Shinji Watanabe Phd Thesis

Speech Recognition Based on a Bayesian Approach

ベイズ的手法にもとづく音声認識

February 2006

Graduate School of Science and Engineering

Waseda University

Shinji Watanabe

ABSTRACT

Speech recognition is a very important technology, which functions as a human interface that con-verts speech information into text information. Conventional speech recognition systems havebeen developed by many researchers using a common database. Therefore, currently availablesystems relate to the specific environment of the database, which lacks robustness. This lack ofrobustness is an obstacle as regards applying speech recognition technology in practice, and im-proving robustness has been a common worldwide challenge in the fields of acoustic and languagestudies. Acoustic studies have taken mainly two directions: the improvement of acoustic mod-els beyond the conventional Hidden Markov Model (HMM), and the improvement of the acousticmodel learning method beyond the conventional Maximum Likelihood (ML) approach. This the-sis addresses the challenge in terms of improving the learning method by employing a Bayesianapproach.

This thesis defines the term “Bayesian approach” to include a consideration of the posteriordistribution of any variable, as well as the prior distribution. That is to say, all the variablesintroduced when models are parameterized, such as model parameters and latent variables, areregarded as probabilistic variables, and their posterior distributions are obtained based on the Bayesrule. The difference between the Bayesian and ML approaches is that the estimation target isthe distribution functionin the Bayesian approach whereas it is theparameter valuein the MLapproach. Based on this posterior distribution estimation, the Bayesian approach can generallyachieve more robust model construction and classification than an ML approach. In fact, theBayesian approach has the following three advantages:

• Effective utilization of prior knowledge through prior distributions (prior utilization)

• Model selection in the sense of maximizing a probability for the posterior distribution ofmodel complexity (model selection)

• Robust classification by marginalizing model parameters (robust classification)

However, the Bayesian approach requires complex integral and expectation computations to obtainposterior distributions when models have latent variables. The acoustic model in speech recogni-tion has the latent variables included in an HMM and a Gaussian Mixture Model (GMM) . There-fore, the Bayesian approach cannot be applied to speech recognition without losing the aboveadvantages. For example, the Maximum A Posteriori based framework approximates the posteriordistribution of the parameter, which loses two of the above advantages although MAP can utilizeprior information. Bayesian Information Criterion and Bayesian Predictive Classification based

i

ii ABSTRACT

frameworks partially realize Bayesian advantages for model selection and robust classification, re-spectively, in speech recognition by approximating the posterior distribution calculation. However,these frameworks cannot benefit from both advantages simultaneously.

Recently, aVariational Bayesian(VB) approach was proposed in the learning theory field,which avoids complex computations by employing the variational approximation technique. Inthe VB approach, approximate posterior distributions (VB posterior distributions) can be obtainedeffectively by iterative calculations similar to the expectation-maximization algorithm in the MLapproach, while the three advantages provided by the Bayesian approaches are still retained. Thisthesis proposes atotal Bayesian framework, Variational Bayesian Estimation and Clustering forspeech recognition (VBEC), where all acoustic procedures of speech recognition (acoustic model-ing and speech classification) are based on the VBposteriordistribution.

VBEC is based on the following four formulations:

1. Setting the output and prior distributions for the model parameters of the standard acousticmodels represented by HMMs and GMMs (setting)．

2. Estimating the VB posterior distributions for the model parameters based on the VB Baum-Welch algorithm similar to the conventional ML based Baum-Welch algorithm (training)．

3. Calculating VBEC objective functions, which are used for model selection (selection)．

4. Classifying speech based on a predictive distribution, which is analytically derived as theStudent’s t-distribution from the marginalization of model parameters based on the VB pos-terior distribution (classification).

VBEC performs the model construction process, which includes model setting, training and selec-tion (1st, 2nd and 3rd), and the classification process (4th) based on the Bayesian approach. Thus,VBEC can be regarded as a total Bayesian framework for speech recognition.

This thesis introduces the above four formulations, and show the effectiveness of the Bayesianapproach through speech recognition experiments. The first set of experiments show the effec-tiveness of the Bayesian acoustic model construction including the prior utilization and modelselection. This work shows the effectiveness of the prior utilization for the sparse training dataproblem. This thesis also shows the effectiveness of the model selection for clustering context-dependent HMM states and selecting the GMM components, respectively. The second set of ex-periments achieve the automatic determination of acoustic model topologies by expanding theBayesian model selection function in the above acoustic model construction. The topologies aredetermined by clustering context-dependent HMM states and by selecting the GMM componentssimultaneously, and the process takes much less time than conventional manual construction withthe same level of performance. The final set of experiments focus on the classification process,and show the effectiveness of VBEC as regards the problem of the mismatch between training andinput speech by applying the robust classification advantages to an acoustic model adaptation task.

ABSTRACT IN JAPANESE

計算機上での音情報理解の中で最も重要な技術の一つが，音情報をテキスト情報に変換する音声認識技術である．従来の音声認識は研究者間で共通のデータベースを用いることにより限られた環境で性能を競い合うという研究スタイルにより大きく発展を遂げた．しかし，実現されたシステムはモデルパラメータ数百万におよぶ超巨大なものであり，限られた環境に特化されたシステムであるゆえに，頑健性を大きく欠落している．そのため,頑健性の点から音声認識の実用化にまだまだ大きな壁が存在し,いかにして音声認識システムの頑健性を向上させるかは，世界共通の課題となっている．そのような取り組みは音響的及び言語的両視点から研究されており，特に音響的視点においては，モデル化の観点から，従来の隠れマルコフモデル (HMM) にもとづく音響モデルをいかに改良するか，及び，学習理論の観点から，従来の最尤学習にもとづく音響モデル学習をいかに改良するか，という 2つの方向性が存在する．本研究は,学習理論の観点から,ベイズ的手法にもとづいた頑健な音声認識実現を取り組む.

本論文で扱うベイズ的手法は単に事前確率分布を最尤推定法に取り入れるだけでなく，分布パラメータや隠れ変数といった，モデルのパラメトリック表現において導入された全ての変数を確率変数とみなし，その事後確率分布をベイズの定理から推定して利用する手法である. そのため，従来のパラメータ推定にもとづく最尤推定法とは推定対象をパラメータ値ではなく分布関数とする点が大きく異なる．この事後確率分布推定にもとづいて，ベイズ的手法は,音声認識で広く用いられている最尤学習に比べてより頑健なモデル構築・識別が可能であるといわれている．実際に，ベイズ的手法には大きく分けて 3つの利点がある．

• 事前確率分布を介した事前知識の効率的な利用（事前知識の活用）

• モデル構造の多様性を確率変数とみなすことによる，事後確率最大化の意味での与えられた学習データに適したモデル構造の選択 (モデル選択)

• モデルパラメータの周辺化による頑健な識別 (頑健な識別)

しかし隠れ変数存在下で事後確率分布を正確に推定するためには，モンテ・カルロシュミレーションなどの数値的手法が必要である．音声認識用音響モデルは，音素コンテクストからなる多数のカテゴリーを持ち，総計数百万に及ぶパラメータが相互に依存し，またHMM や多次元の混合ガウス分布モデル (GMM)を通して多数の隠れ変数を内包する．このような複雑なモデルを数値的手法で扱う場合，莫大な計算量を必要とするため，音声認識におけるベイズ的手法の実現は大変困難であった．そのため，従来音声認識で実現されてきた事後確率最大化法やベイズ的予測識別法，ベイズ情報量基準法などはいずれも，事後確率分布の推定を

iii

iv ABSTRACT IN JAPANESE

行わない近似的実現手法に過ぎず，先に挙げたベイズ的手法の利点を全て内包するものではなかった．近年，変分ベイズ法にもとづく近似的事後確率分布 (VB事後確率分布とよぶ)推定法が提

案され，隠れ変数存在下においても期待値最大化アルゴリズムにより効率よくモデル学習ができるようになった．本研究では，この変分ベイズ法を元に最尤法にもとづく音声認識を発展させ，従来の近似的なベイズ的手法を内包する本格的なベイズ音声認識VBEC(VariationalBayesian Estimation and Clustering for speech recognition)を構築した. VBECは大きく分けて4つの定式化よりなる．

1. HMMおよびGMMを用いて表現される音響モデルに対して，出力確率分布とそのモデルパラメータに対する事前確率分布の設定 (設定)．

2. 従来の最尤学習法にもとづくBaum-Welchアルゴリズムと同様のVB版のBaum-Welchアルゴリズムを構築して，モデルパラメータに対するVB事後確率分布を推定 (学習)．

3. 学習データに応じた適切なモデル構造の選択のための VB評価関数の算出 (選択)．

4. VB事後確率分布と出力確率分布をもとにモデルパラメータに関して周辺化を行い，予測分布がStudentの t分布として解析的にもとまることを示し，その予測分布をもとにした識別 (識別)．

これにより，1から 3のモデルの設定・学習・選択による音響モデル構築過程及び 4の識別過程，つまり音声認識の音響モデルに対する全ての過程がベイズ的手法で実現される．従って VBECは本格的なベイズ音声認識であるといえる．本研究はこの 4つの定式化を紹介すると共に，それによって実現されるベイズ法の利点

についてそれぞれ実験を用いて検証し，有効性を示す．はじめに，モデルの設定・学習・選択(1から 3)を用いて，ベイズ的手法を首尾一貫して利用した音響モデル構築を実現する．それにより，少量学習データでの本手法の優位性を示すとともに，音素環境依存のHMM 状態クラスタリング問題及びGMM混合数決定におけるモデル選択機能効果を示す．次に，上記音響モデル構築過程におけるモデル選択を発展させ，HMM 状態クラスタリング及びGMM混合数の決定を同時に最適化することにより，音響モデル構造の自動決定を実現する．本手法は，計算機のみによる高性能な音響モデルの自動構築を実現し，さらに従来の人手を用いたモデル構築手法と比較して計算時間を大幅に削減することができる．最後に，モデル識別に注目し，実現される頑健な識別効果を音響モデル適応実験に応用することにより，実用的タスクでの VBECの有効性を示す．

Contents

ABSTRACT i

ABSTRACT IN JAPANESE iii

CONTENTS v

LIST OF NOTATIONS ix

LIST OF FIGURES xiii

LIST OF TABLES xv

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Formulation 72.1 Maximum likelihood and Bayesian approach . . . . . . . . . . . . . . . . . . . . 72.2 Variational Bayesian (VB) approach . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 VB-EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 VB posterior distribution for model structure . . . . . . . . . . . . . . . . 12

2.3 Variational Bayesian Estimation and Clustering for speech recognition (VBEC) . . 132.3.1 Output and prior distributions . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 VB Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.3 VBEC objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 VB posterior based Bayesian predictive classification . . . . . . . . . . . . 22

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Bayesian acoustic model construction 253.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Efficient VB Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Clustering context-dependent HMM states using VBEC . . . . . . . . . . . . . . . 27

3.3.1 Phonetic decision tree clustering . . . . . . . . . . . . . . . . . . . . . . . 29

v

vi CONTENTS

3.3.2 Maximum likelihood approach . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.3 Information criterion approach . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.4 VBEC approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Determining the number of mixture components using VBEC . . . . . . . . . . . . 34

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.1 Prior utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.2 Prior parameter dependence . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.3 Model selection for HMM states . . . . . . . . . . . . . . . . . . . . . . . 41

3.5.4 Model selection for Gaussian mixtures . . . . . . . . . . . . . . . . . . . . 42

3.5.5 Model selection over HMM states and Gaussian mixtures . . . . . . . . . . 44

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Determination of acoustic model topology 47

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Determination of acoustic model topology using VBEC . . . . . . . . . . . . . . . 48

4.2.1 Strategy for reaching optimum model topology . . . . . . . . . . . . . . . 48

4.2.2 HMM state clustering based on Gaussian mixture model . . . . . . . . . . 50

4.2.3 Estimation of inheritable node statistics . . . . . . . . . . . . . . . . . . . 51

4.2.4 Monophone HMM statistics estimation . . . . . . . . . . . . . . . . . . . 52

4.3 Preliminary experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Maximum likelihood manual construction . . . . . . . . . . . . . . . . . . 54

4.3.2 VBEC automatic construction based on 2-phase search . . . . . . . . . . . 56

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4.1 Determination of acoustic model topology using VBEC . . . . . . . . . . 57

4.4.2 Computational efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4.3 Prior parameter dependence . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Bayesian speech classification 65

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Bayesian predictive classification using VBEC . . . . . . . . . . . . . . . . . . . . 66

5.2.1 Predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2.2 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.3 Relationship between Bayesian prediction approaches . . . . . . . . . . . 71

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.1 Bayesian predictive classification in total Bayesian framework . . . . . . . 72

5.3.2 Supervised speaker adaptation . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.3 Computational efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

CONTENTS vii

6 Conclusions 796.1 Review of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

ACKNOWLEDGMENTS 83

ACKNOWLEDGMENTS IN JAPANESE 85

BIBLIOGRAPHY 87

LIST OF WORK 93

APPENDICES 97A.1 Upper bound of Kullback-Leibler divergence for posterior distributions . . . . . . 97

A.1.1 Model parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.1.2 Latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A.1.3 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

A.2 Variational calculation for VB posterior distributions . . . . . . . . . . . . . . . . 99A.2.1 Model parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100A.2.2 Latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101A.2.3 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.3 VB posterior calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.3.1 Model parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.3.2 Latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A.4 Student’s t-distribution using VB posteriors . . . . . . . . . . . . . . . . . . . . . 106

LIST OF NOTATIONS

Abbreviations

ML : Maximum Likelihood (page i)HMM : Hidden Markov Model (page i)GMM : Gaussian Mixture Model (page i)VB : Variational Bayes (page ii)VBEC : Variational Bayesian Estimation and Clustering for speech recognition (page ii)EM : Expectation-Maximization (page 1)MAP : Maximum A Posteriori (page 1)BIC ; Bayesian Information Criterion (page 3)MDL : Minimum Description Length (page 3 )BPC : Bayesian Predictive Classification (page 3)VB-BPC : VB posterior based BPC (page 4)LVCSR : Large Vocabulary Continuous Speech Recognition (page 8)MFCC : Mel Frequency Cepstrum Coefficients (page 10)RHS : Right Hand Side (page 23)MLC : ML-based Classification (page 23)IWR : Isorated Word Recognition (page 36)JNAS : Japanese Newspaper Article Sentences (page 36)MMIXTURE : GMM based phonetic decision tree method utilizing

Gaussian mixture statistics of monophone HMM (page 53)MSINGLE : GMM based phonetic decision tree method utilizing

single Gaussian statistics of monophone HMM (page 53)AMP : Acoustic Model Plant (page 61)δ BPC : Dirac δ posterior based BPC (page 67)UBPC : Uniform posterior based BPC (page 67)SOLON : NTT Speech recognizer with OutLook On the Next generation (page 72)CSJ : Corpus of Spontaneous Japanese (page 75)SI : Speaker Independent (page 75)

ix

x LIST OF NOTATIONS

Abbreviations of organizations

ASJ : Acoustical Society of Japan (page 37)JEIDA : Japan Electronic Industry Development Association (page 37)IEEE : Institute of Electrical and Electronic EngineersSSPR : Spontaneous Speech Processing and RecognitionNIPS : Neural Information Processing SystemsICSLP : International Conference on Spoken Language ProcessingICASSP : International Conference on Acoustics, Speech, and Signal ProcessingIEICE : Institute of Electronics, Information and Communication Engineers

General notations

p(·), q(·) : Probabilistic distribution functionsO : Set of feature vectors of training datax : Set of feature vectors of input dataΘ : Set of model parametersm : Model structure indexZ : Set of latent variablesc : Category index

Speech recognition notations

e : Speech example indexE : Number of speech examplest : Frame indexTe : Number of frames in exampleed : Dimension indexD : Number of dimensionsOt

e ∈ RD : Feature vector of training speech at framet of exampleeO = {Ot

e|t = 1, .., Te, e = 1, ..., E} : Set of feature vectors of training speechxt ∈ RD : Feature vector of input speech at frametx = {xt|t = 1, .., T} : Set of feature vectors of input speechW : Sentence (word sequence)

LIST OF NOTATIONS xi

Acoustic model notations

i, j : HMM indicesJ : Number of temporal HMM states in a phonemek : Mixture component indexL : Number of mixture components in an HMM stateste : HMM state index at framet of examplee

S = {ste|t = 1, ..., Te, e = 1, ..., E} : Set of HMM states

vte : Mixture component index at framet of examplee

V = {vte|t = 1, ..., Te, e = 1, ..., E} : Set of mixture components

aij : State transition probability from statei to statejwjk : k-th weight factor of mixture component for statejµjk : Gaussian parameter for mean vector of componentk in statejΣjk : Gaussian parameter for covariance matrix of componentk in statejαt

e,j : Forward probability at framet of examplee in statejβt

e,j : Backward probability at framet of examplee in statejγt

e,ij : Transition probability from statei to statejat framet of examplee

ζte,jk : Occupation probability of mixture componentk

in statej at framet of exampleeO : 0-th order statistics (occupation count)M : 1st order statisticsV : 2nd order statisticsΞ : Set of sufficient statistics

Notations of prior and posterior parameters and VB values

{φij}Jj=1 : Dirichlet distribution parameter for{aij}J

j=1

{ϕjk}Lk=1 : Dirichlet distribution parameter for{wjk}L

k=1

ξjk : Normal distribution parameter forµjk

νjk : Normal distribution parameter forµjk

ηjk : Gamma distribution parameter forΣjk,d

Rjk,d : Gamma distribution parameter forΣjk,d

Φ : Set of prior or posterior parametersFm : VB objective function for model structuremFm

Θ : VB objective function of model parameterΘ for model structuremFm

S,V : VB objective function of latent variablesS andV for model structurem

Notations of phonetic decision tree clustering

n : Tree node indexr : Root node indexQ : Phonetic question index∆LQ(n) : Gain of log likelihood for questionQ in noden

∆LQ(n)BIC/MDL : Gain of BIC/MDL objective function for questionQ in noden

∆FQ(n) : Gain of VB objective function for questionQ in noden

xii LIST OF NOTATIONS

Special function notations

Γ(·) : Gamma functionΨ(·) : Digamma functionδ(·) : Dirac delta functionχ(·) : Cumulative function of the standard Gaussian distribution

Definitions of probabilistic distribution functions and normalization constants

Normal distributionN (O|µ,Σ) ≡ CN (Σ) exp

(−1

2(O − µ)′Σ−1(O − µ))

CN (Σ) ≡ (2π|Σ|)−12

Dirichlet distributionD({aj}J

j=1|{φj}Jj=1) ≡ CD(φj)

∏j(aj)φj−1

CD({φj}Jj=1) ≡ Γ(

PJj=1 φj)

QJj=1 Γ(φj)

Gamma distribution

G(σ−1|η, r) ≡ CG(η, r(σ−1

) η2−1 exp

(− r

2σ

)CG(η, r) ≡ (r/2)

η2 /Γ(η/2)

Student’s t-distribution

St(x|ω, λ, κ) ≡ CSt

(1 + 1

κλ(x − ω)2)−κ+1

2

CSt(κ, λ) ≡ Γ(κ+12 )

Γ(κ2 )Γ( 1

2)(

1κλ

) 12

List of Figures

1.1 Automatic speech recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Chapter flow of thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 General scheme of statistical model construction and classification. . . . . . . . . . 8

2.2 Hidden Markov model for each phoneme unit. A standard acoustic model forphoneme /a/.T, S, G and D denote search spaces of HMM-temporal, HMM-contextual, GMM and feature vector topologies, respectively. . . . . . . . . . . . . 10

2.3 Hidden Markov model for each phoneme unit. A state is represented by the Gaus-sian mixture below it. There are three states and three Gaussian components in thisfigure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 VB Baum-Welch algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Total speech recognition frameworks based on VBEC and ML-BIC/MDL. . . . . . 24

3.1 Efficient VB Baum-Welch algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 A set of all triphone HMM states */ai/* in i th state sequence is clustered based onthe phonetic decision tree method. . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Splitting a set of triphone HMM states in noden into two sets in yes nodenQY and

no nodenQN by answering phonetic questionQ. . . . . . . . . . . . . . . . . . . . 29

3.4 Tree structure in each HMM state . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Acoustic model selection of VBEC: two-phase procedure. . . . . . . . . . . . . . 35

3.6 The left figure shows recognition rates according to the amounts of training databased on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right figure showsan enlarged view of the left figure for 25∼ 1,500 utterances. The horizontal axisis scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Number of splits according to amount of training data (2∼3,000 sentences). . . . . 39

3.8 The left figure shows recognition rates according to the amounts of training databased on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right figure showsan enlarged view of the left figure for more than 1,000 utterances The horizontalaxis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9 Total number of clustered states according to the amounts of training data basedon VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The horizontal and verticalaxes are scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.10 Objective functions and recognition rates according to the number of clustered states. 42

xiii

xiv LIST OF FIGURES

3.11 Objective functions and word accuracies according to the increase in the numberof total clustered triphone HMM states. . . . . . . . . . . . . . . . . . . . . . . . 43

3.12 Total objective functionFm and word accuracy according to the increase in thenumber of mixture components per state. . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Distributional sketch of the acoustic model topology. . . . . . . . . . . . . . . . . 484.2 Optimum model search for an acoustic model. . . . . . . . . . . . . . . . . . . . . 494.3 Estimation of inheritable GMM statistics during the splitting process. . . . . . . . 514.4 Model evaluation test using Test1 (a) and Test2 (b). The contour maps denote word

accuracy distributions for the total number of clustered states and the number ofcomponents per state. The horizontal and vertical axes are scaled logarithmically. . 54

4.5 Determined model topologies and their recognition rates (MSINGLE). The hori-zontal and vertical axes are scaled logarithmically. . . . . . . . . . . . . . . . . . . 58

4.6 Determined model topologies and their recognition rates (MMIXTURE). The hor-izontal and vertical axes are scaled logarithmically. . . . . . . . . . . . . . . . . . 58

4.7 Word accuracies and objective functions using GMM state clustering (MSINGLE).The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . 60

4.8 Word accuracies and objective functions using GMM state clustering (MMIX-TURE). The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . 60

5.1 (a) shows the Gaussian (Gauss(x)) derived fromδBPC, the uniform distributionbased predictive distribution (UBPC(x)) derived from UBPC in Eq. (5.6), thevariance-rescaled Gaussian (Gauss2(x)) derived from VB-BPC-MEAN in Eq. (5.9),and two Student’s t-distributions (St1(x) and St2(x)) derived from VB-BPC in Eq.(5.9). (b) employs the logarithmic scale of the vertical axes in (a) to emphasize thebehavior of each distribution tail. The parameters corresponding to mean and vari-ance are the same for all distributions. The hyper-parameters of UBPC are set atC = 3.0 andρ = 0.9. The rescaling parameter of Gauss2(x) (ξ) is 1. The degreesof freedom (DoF) of the Student’s t-distributions (η = κ) are 1 for St1(x) and 100for St2(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Recognition rate for various amounts of training data. The horizontal axis is scaledlogarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 Word accuracy for various amounts of adaptation data. The horizontal axis isscaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

List of Tables

1.1 Comparison of VBEC and other Bayesian frameworks in terms of Bayesian ad-vantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Speech recognition terms corresponding with statistical learning theory terms . . . 7

2.2 Training specifications for ML and VB . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Examples of questions for phoneme /a/ . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Experimental conditions for isolated word recognition task . . . . . . . . . . . . . 37

3.3 Experimental conditions for LVCSR (read speech) task . . . . . . . . . . . . . . . 38

3.4 Prior distribution parameters.Or,Mr andVr denote the 0th, 1st, and 2nd statis-tics of a root node (monophone HMM state), respectively. . . . . . . . . . . . . . . 38

3.5 Recognition rates in each prior distribution parameter. The model was trainedusing data consisting of 10 sentences. . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6 Recognition rates in each prior distribution parameter. The model was trained byusing data consisting of 150 sentences . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7 Word accuracies for total numbers of clustered states and Gaussians per state. Thecontour graph on the right is obtained from these results. The recognition resultobtained with the bestmanualtuning with ML was 92.0 and that obtainedauto-maticallywith VBEC was 91.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Experimental conditions for LVCSR (read speech) task . . . . . . . . . . . . . . . 56

4.2 Experimental conditions for isolated word recognition . . . . . . . . . . . . . . . . 62

4.3 Comparison with iterative and non-iterative state clustering . . . . . . . . . . . . . 63

4.4 Prior parameter dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Robustness of acoustic model topology determined by VBEC for different speechdata sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1 Relationship between BPCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Experimental conditions for isolated word recognition task . . . . . . . . . . . . . 72

5.3 Prior distribution parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Configuration of VBEC and ML based approaches . . . . . . . . . . . . . . . . . 74

5.5 Experimental conditions for LVCSR speaker adaptation task . . . . . . . . . . . . 75

5.6 Prior distribution parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

xv

xvi LIST OF TABLES

5.7 Experimental results for model adaptation experiments for each speaker based onVB-BPC, VB-BPC-MEAN, UBPC andδBPC(MAP). The best scores among thefour methods are highlighted with a bold font . . . . . . . . . . . . . . . . . . . . 78

6.1 Technical trend of speech recognition using variational Bayes . . . . . . . . . . . . 80

Chapter 1

Introduction

1.1 Background

Speech information processing is one of the most important human interface topics in the fieldof computer science. In particular, speech recognition, which converts speech information intotext information, as shown in Figure 1.1, is the core technology for allowing computers to un-derstand the human intent. Speech recognition has been studied for a number of years, and thepreliminary technique of phoneme recognition has now progressed to word recognition and largevocabulary continuous speech recognition [1–4] where the vocabulary size of the state-of-the-artrecognizer amounts to 1.8 million [5]. The current successes in speech recognition are based onpattern recognition, which uses statistical learning theory. Maximum Likelihood (ML) methodshave become the standard techniques for constructing acoustic and language models for speechrecognition. ML methods guarantee that ML estimates approach the true values of the parameters.ML methods have been used in various aspects of statistical learning, and especially for acous-tic modeling in speech recognition since the Expectation-Maximization (EM) algorithm [6] is apractical way of obtaining the local optimum solution for the training of latent variable models.Therefore, acoustic modeling based on Hidden Markov Models (HMMs) and Gaussian MixtureModels (GMMs) have been developed greatly by using the ML-EM approach [7–9]．Other train-ing methods have also been proposed with which to train acoustic model parameters, such asdiscriminative training methods [10–13], Maximum A Posteriori (MAP) methods [14, 15], andquasi-Bayes methods [16,17].

However, the performance of current speech recognition systems is far from satisfactory. Specif-ically, the recognition performance is much poorer than the human recognition ability since speechrecognition has a distinct lack of robustness, which is crucial for practical use. In a real environ-ment, there are many fluctuations originating from various factors such as the speaker, contextdependence, speaking style and noise. In fact, the performance of acoustic models trained usingread speech decreases greatly when the models are used to recognize spontaneous speech due tothe mismatch between the read and spontaneous speech environments [18]. Therefore, most of theproblems posed by current speech recognition techniques result from a lack of robustness. Thislack of robustness is an obstacle in terms of the practical application of speech recognition technol-

1

2 INTRODUCTION

Speech recognizer (decoder)Featureextraction Sentence

Acousticscore Languagescore LanguageModel

My name is Watanabe. My name isWatanabe.

AcousticmodelFigure 1.1: Automatic speech recognition.

ogy, and improving robustness has been a common worldwide challenge in the field of acoustic andlanguage studies. Acoustic studies have taken mainly two directions: the improvement of acous-tic models beyond the conventional HMM, and the improvement of the acoustic model learningmethod beyond the conventional ML approach. This thesis addresses the challenge in terms ofimproving the learning method by employing a Bayesian approach.

This thesis defines the term “Bayesian approach” to mean that it considers the posterior distri-bution of any variable, as well as the prior distribution. That is to say, all the variables introducedwhen models are parameterized, such as model parameters and latent variables, are regarded asprobabilistic variables, and their posterior distributions are obtained based on the Bayes rule. Thedifference between the Bayesian and ML approaches is that the target of estimation is thedistri-bution functionin the Bayesian approach whereas it is theparameter valuein the ML approach.Based on this posterior distribution estimation, the Bayesian approach can generally achieve morerobust model construction and classification than an ML approach [19, 20]. In fact, the Bayesianapproach has the following three advantages:

(A) Effective utilization of prior knowledge through prior distributions (prior utilization)

(B) Model selection in the sense of maximizing a probability for the posterior distribution ofmodel complexity (model selection)

(C) Robust classification by marginalizing model parameters (robust classification)

However, the Bayesian approach requires complex integral and expectation computations to obtainposterior distributions when models have latent variables. The acoustic model in speech recogni-tion has the latent variables included in an HMM and a Gaussian Mixture Model (GMM). There-fore, the Bayesian approach cannot be applied to speech recognition without losing the aboveadvantages. For example, the Maximum A Posteriori based framework approximates the poste-rior distribution of the parameter, which loses two of the above advantages although MAP can

1.2. GOAL 3

utilize prior information. Bayesian Information Criterion (BIC)1 and Bayesian Predictive Clas-sification (BPC) based frameworks partially realize Bayesian advantages for model selection androbust classification, respectively, in speech recognition by approximating the posterior distribu-tion calculation [15, 21, 22]. However, these frameworks cannot benefit from both advantagessimultaneously, as shown in Table 1.1.

1.2 Goal

One goal of this work is to provide speech recognition based on a Bayesian approach to overcomethe lack of robustness described above by utilizing the three Bayesian advantages. Recently, aVariational Bayesian(VB) approach was proposed in the learning theory field that avoids com-plex computations by employing the variational approximation technique [23–26]. With this VBapproach, approximate posterior distributions (VB posterior distributions) can be obtained effec-tively by iterative calculations similar to the Expectation-Maximization algorithm used in the MLapproach, while the three advantages of the Bayesian approaches are still retained. Therefore, torealize the goal, a new speech recognition framework is formulated using VB to replace the MLapproaches with the Bayesian approaches in speech recognition.A total Bayesian frameworkisproposed,Variational Bayesian Estimation and Clustering for speech recognition (VBEC), whereall acoustic procedures for speech recognition (acoustic model construction and speech classifica-tion) are based on the VBposteriordistribution. VBEC includes the three Bayesian advantagesunlike the conventional Bayesian approaches, as shown in Table 1.1. Therefore, this study alsoconfirms the effectiveness of the three Bayesian advantages, prior utilization, model selection androbust classification, in VBEC experimentally.

1.3 Overview

This subsection provides an overview of the work (Figure 1.2) with reference to related paper.Chapter 2 discusses the formulation of VBEC compared with those of the conventional ML ap-proaches. VBEC is based on the following four formulations:

1. Setting the output and prior distributions for the model parameters of the standard acousticmodels represented by HMMs and GMMs (setting)．

Table 1.1: Comparison of VBEC and other Bayesian frameworks in terms of Bayesian advantages

Bayesian advantage VBEC MAP BIC/MDL Conventional BPC(A) Prior utilization

√ √– –

(B) Model selection√

–√

–(C) Robust classification

√– –

√

1BIC and Minimum Description Length (MDL) criterion have been independently proposed, but they are practi-cally the same. Therefore, they are identified in this thesis and referred to as BIC/MDL.

4 INTRODUCTION

2. Estimating the VB posterior distributions for the model parameters based on the VB Baum-Welch algorithm similar to the conventional ML based Baum-Welch algorithm (training)．

3. Calculating VBEC objective functions, which are used for model selection (selection)．

4. Classifying speech based on a predictive distribution, which is analytically derived as theStudent’s t-distribution from the marginalization of model parameters based on the VB pos-terior distribution (classification).

Therefore, VBEC performs the model construction process, which includes model setting, train-ing and selection (1st, 2nd and 3rd), and the classification process (4th) based on the Bayesianapproach [27, 28]. Thus, VBEC can be regarded as a total Bayesian framework for speech recog-nition.

Based on the above four formulations, this thesis shows the effectiveness of the Bayesian ad-vantages, through speech recognition experiments.

• Chapter 3 describes the construction of the acoustic model through the consistent use ofBayesian approaches based on the 1st, 2nd and 3rd formulations [27–30]. The VB Baum-Welch algorithm is applied to acoustic modeling to estimate the VB posteriors after settingthe prior distributions. The effectiveness of the prior utilization is shown in cases where thereis a small amount of speech recognition training data. In addition, Bayesian model selectionis applied to the phonetic decision tree clustering and the selection of GMM components.Thus, the effectiveness of VBEC for acoustic model construction is confirmed experimen-tally.

• Chapter 4 describes the automatic determination of acoustic model topologies achieved byexpanding the VBEC function of the Bayesian model selection presented in the acousticmodel construction in Chapter 3 [31,32]. The determination is realized by clustering context-dependent HMM states and by selecting the GMM components simultaneously, and theprocess takes much less time than conventional manual construction with the same level ofperformance.

• Chapter 5 focuses on speech classification based on Bayesian Predictive Classification us-ing VB posteriors (VB-BPC), and compares VB-BPC with the other classification methodstheoretically and experimentally. The chapter reveals the superior performance of VBECcompared with the other classification methods in the practical tasks by applying robustclassification to acoustic model adaptation [33,34].

Finally, Chapter 6 reviews this thesis and discusses related and future work.

1.3. OVERVIEW 5

Chapter1. IntroductionChapter2. Formalization

Chapter6. Conclusions

Chapter3. Bayesian acousticmodel constructionChapter4. Determination ofacoustic model topology

Chapter5. Bayesian speechclassificationChapter3. Bayesian acousticmodel constructionChapter4. Determination ofacoustic model topologyChapter3. Bayesian acousticmodel constructionChapter4. Determination ofacoustic model topology

Chapter5. Bayesian speechclassificationAcoustic model construction Speech classification

Figure 1.2: Chapter flow of thesis.

Chapter 2

Formulation

This chapter begins by describing the difference between Bayesian and conventional MaximumLikelihood (ML) approaches using general terms of statistical learning theory in Section 2.1. Par-ticular attention is paid to the advantages and disadvantages of the Bayesian approaches over theconventional ML approaches. Then, in Section 2.2, the general solution of the Variational Bayesian(VB) approach is explained, which is an approximate realization of the Bayesian approaches. Here,the general terms of statistical learning theory are also used. Finally, Section 2.3 explains the appli-cation of the VB approach to acoustic model construction and classification. The first two sectionsdeal with a general scheme for statistical model construction and classification. Namely, as shownin Figure 2.1, a model is obtained with model parameterΘ and model structurem using trainingdataO, and unknown datax is classified into categoryc based on the model. Readers engagedin speech recognition may find it easier to follow these two sections by regarding the statisticallearning theory terms as speech recognition terms, as shown in Table 2.1.

Table 2.1: Speech recognition terms corresponding with statistical learning theory terms

Statistical learning Speech recognitiontheory terms terms

O Training data Speech feature vectorc Category Word, phoneme, triphone, etc.Θ Model parameter State transition probability, weight factor, Gaussian parameters, etc.Z Latent variable Sequences of HMM states, sequences of Gaussian mixture components, etc.m Model structure Number of HMM states, number of Gaussian mixture components, prior parameters, etc.

2.1 Maximum likelihood and Bayesian approach

This section briefly reviews the Bayesian approach in contrast with the ML approach by addressingthe three Bayesian advantages explicitly in terms of general learning issues, as shown in Figure2.1. The Bayesian approach is based on posterior distributions, while the ML approach is based onmodel parameters. LetO be a training data set of feature vectors. The posterior distribution for a

7

8 CHAPTER 2. FORMULATION

Model setting, training, and selectionTraining data

ModelClassificationUnknown dataCategory

Figure 2.1: General scheme of statistical model construction and classification.

set of model parametersΘ(c) of categoryc is obtained with the famous Bayes theorem as follows:

p(Θ(c)|O,m) =

∫p(O|Θ,m)p(Θ|m)

p(O|m)dΘ(−c), (2.1)

wherep(Θ|m) is a prior distribution forΘ, andm denotes the model structure index, for example,the number of Gaussian Mixture Model (GMM) components and Hidden Markov Model (HMM)states. Here,−c represents the set of all categories withoutc. In this thesis, we can also regardthe prior parameter setting as the model structure, and include its variations in indexm. FromEq. (2.1), prior information can be utilized via the estimation of the posterior distribution, whichdepends on prior distributions. Therefore, the Bayesian approach is superior to the ML approachfor the following reasons:

(A) Prior utilization

Familiar applications based on prior utilization in speech recognition are Large Vocabulary Contin-uous Speech Recognition (LVCSR) using language and lexicon models as priors [7], and speakeradaptation using speaker independent models as priors [15]. In addition, the Bayesian approachhas two major advantages over the ML approach:

(B) Model selection,

(C) Robust classification.

These two advantages are derived from the posterior distributions. First, by regardingm as aprobabilistic variable, we can consider the posterior distributionp(m|O) for model structurem.

2.1. MAXIMUM LIKELIHOOD AND BAYESIAN APPROACH 9

Oncep(m|O) is obtained, an appropriate model structure that maximizes the posterior probabilitycan be selected1 as follows:

m = argmaxm

p(m|O). (2.2)

Second, once the posterior distributionp(Θ(c)|O,m) is estimated for all categories, the categoryfor input datax is determined by:

c = argmaxc

∫p(x|Θ(c),m)p(Θ(c)|O,m)dΘ(c). (2.3)

The parameters are integrated out in Eq. (2.3) so that the effect of over-training is mitigated, androbust classification is obtained.

Although the Bayesian approach is often superior to the ML approach because of the abovethree advantages, the integral and expectation calculations make any practical use of the Bayesianapproach very difficult. In particular, when a model includes latent (hidden) variables, the cal-culation becomes more complex. LetZ be a set of discrete latent variables. Then, with a fixedmodel structurem, posterior distributions for model parametersp(Θ(c)|O,m) andp(Z(c)|O,m)

are expressed as follows:

p(Θ(c)|O,m) =∑

Z

∫p(O, Z|Θ,m)p(Θ|m)

p(O|m)dΘ(−c) (2.4)

and

p(Z(c)|O,m) =∑Z(−c)

∫p(O, Z|Θ,m)p(Θ|m)

p(O|m)dΘ. (2.5)

The posterior distributions for the model structurep(m|O) are expressed as follows:

p(m|O) =∑

Z

∫p(O, Z|Θ,m)p(Θ|m)p(m)

p(O)dΘ, (2.6)

wherep(m) denotes a prior distribution for model structurem.

These equations cannot be solved analytically. The acoustic model for speech recognitionincludes latent variables in HMMs and GMMs, and the total number of model parameters amountsto more thanone million. In addition, these parameters depend on each other hierarchically, asshown in Figure 2.2. Solving all integrals and expectations numerically requires huge amountsof computation time. Therefore, when applying the Bayesian approach to acoustic modeling forspeech recognition, an effective approximation technique is necessary.

1In a strict Bayesian sense, probabilistic variablem should be marginalized using the posterior distribution for amodel structurep(m|O). However, this means that we should prepare various structure models in parallel, whichwould require a lot of memory and computation time. This would be unsuitable for such a large task as speechrecognition. Therefore, in this thesis, one appropriate model is selected rather than dealing with various models basedonp(m|O).


＊/a/N C/a/VV/a/V/a/＊/a/N C/a/VV/a/V/a/

/a/ HMM Phonetic contextclustering GMM Feature spaceFigure 2.2: Hidden Markov model for each phoneme unit. A standard acoustic model for phoneme/a/. T, S, G andD denote search spaces of HMM-temporal, HMM-contextual, GMM and featurevector topologies, respectively.

2.2 Variational Bayesian (VB) approach

This section focuses on the VB approach and derives general solutions for VB posterior distribu-tionsq(Θ|O,m), q(Z|O,m), andq(m|O) to approximate the true corresponding posterior distri-butions. To begin with, it is assumed that

q(Θ, Z|O,m) =∏

c

q(Θ(c)|O(c),m)q(Z(c)|O(c),m). (2.7)

This assumption means that probabilistic variables associated with each category are statisticallyindependent from other categories. The speech data used in this thesis is well transcribed andthe label information is reliable. In addition, the frequently-used feature extraction (e.g. MelFrequency Cepstrum Coefficients (MFCC) ) from the speech is good enough for the statisticalindependence of the observation data to be guaranteed. Therefore, the assumption of class inde-pendence is reasonable.

2.2.1 VB-EM algorithm

VB posterior distributions for model parameters

This subsection discusses VB posterior distributions for model parameters with fixed model struc-turem. Initially, the arbitrary posterior distributionq(Θ(c)|O,m) is introduced and the Kullback-Leibler (KL) divergence [35] betweenq(Θ(c)|O,m) and true posterior distributionp(Θ(c)|O,m)

is considered:

KL [q(Θ(c)|O,m)|p(Θ(c)|O,m)] =

∫q(Θ(c)|O,m) log

q(Θ(c)|O,m)

p(Θ(c)|O, m)dΘ(c). (2.8)

Substituting Eq. (2.4) into Eq. (2.8) and using Jensen’s inequality, the inequality of Eq. (2.9) isobtained as follows:

KL [q(Θ(c)|O,m)|p(Θ(c)|O,m)] ≤ log p(O|m) −Fm[q(Θ|O,m), q(Z|O,m)], (2.9)

2.2. VARIATIONAL BAYESIAN (VB) APPROACH 11

where,

Fm[q(Θ|O,m), q(Z|O,m)] ≡⟨

logp(O, Z|Θ,m)p(Θ|m)

q(Θ|O,m)q(Z|O,m)

⟩q(Θ|O,m),q(Z|O,m)

. (2.10)

Here, the brackets〈〉 denote the expectation i.e.〈g(y)〉p(y) ≡∫

g(y)p(y)dy for continuous variabley and〈g(n)〉p(n) ≡

∑n g(n)p(n) for discrete variablen. The derivation of the inequality is shown

in detail in Appendix A.1.1. The inequality (2.9) is strict unlessq(Θ|O,m) = p(Θ|O,m) andq(Z|O,m) = p(Z|O,m) i.e. the arbitrary posterior distributionq is equivalent to the true posteriordistributionp. From the assumption Eq. (2.7),Fm is decomposed into each category as follows:

Fm[q(Θ|O, m), q(Z|O,m)] =∑

c

⟨log

p(O(c), Z(c)|Θ(c),m)p(Θ(c)|m)

q(Θ(c)|O(c),m)q(Z(c)|O(c),m)

⟩q(Θ(c)|O(c),m),q(Z|O(c),m)

=∑

c

Fm,(c)[q(Θ(c)|O(c),m), q(Z(c)|O(c),m)].

(2.11)

This indicates that the total objective function is calculated by summing up all objective functionsfor each category.

From inequality (2.9),q(Θ(c)|O,m) approachesp(Θ(c)|O,m) as the right-hand side decreases.Therefore, the optimal posterior distribution can be obtained by a variational method, which resultsin minimizing the right-hand side. Since termlog p(O|m) can be disregarded, the minimization ischanged to the maximization ofFm with respect toq(Θ(c)|O,m), and is given by the followingvariational equation:

δ

δq(Θ(c)|O,m)Fm[q(Θ|O, m), q(Z|O,m)]

=δ

δq(Θ(c)|O,m)Fm,(c)[q(Θ(c)|O(c),m), q(Z(c)|O(c),m)] = 0. (2.12)

From this equation, the optimal VB posterior distributionq(Θ(c)|O,m) is obtained as follows:

q(Θ(c)|O,m) ∝ p(Θ(c)|m) exp(⟨

log p(O(c), Z(c)|Θ(c),m)⟩

q(Z(c)|O(c),m)

). (2.13)

This variational calculation is shown in detail in Appendix A.2.1. In this thesis, a tilde˜ is addedto indicate variationally optimized values or functions.

VB posterior distributions for latent variables

A similar method is used to obtain the optimal VB posterior distributionq(Z(c)|O,m). An in-equality similar to Eq. (2.12) is obtained by considering the KL divergence between the arbitraryposterior distributionq(Z(c)|O,m) and the true posterior distributionp(Z(c)|O, m) as follows:

KL [q(Z(c)|O,m)|p(Z(c)|O,m)] ≤ log p(O|m) −Fm[q(Θ|O,m), q(Z|O,m)], (2.14)

The derivation of the inequality is detailed in Appendix A.1.2.


The optimal VB posterior distributionq(Z(c)|O,m) is also obtained by maximizingFm withrespect toq(Z(c)|O,m) with the variational method as follows:

q(Z(c)|O,m) ∝ exp(⟨

log p(O(c), Z(c)|Θ(c), m)⟩

q(Θ(c)|O(c),m)

), (2.15)

This variational calculation is shown in detail in Appendix A.2.2.

VB-EM algorithm

Equations (2.13) and (2.15) are closed-form expressions, and these optimizations can be effec-tively performed by iterative calculations analogous to the Expectation and Maximization (EM)algorithm [6], which increasesFm at every iteration up to a converged value. Then, Eqs. (2.13)and (2.15), respectively, correspond to the Maximization step (M-step) and the Expectation step(E-step) in the VB approach. Therefore, by substitutingq into q, these equations can be representedas follows:q(Θ(c)|O,m) ∝ p(Θ(c)|m) exp

(⟨log p(O(c), Z(c)|Θ(c),m)

⟩eq(Z(c)|O(c),m)

)q(Z(c)|O,m) ∝ exp

(⟨log p(O(c), Z(c)|Θ(c),m)

⟩eq(Θ(c)|O(c),m)

) . (2.16)

Note that optimal posterior distributions for a particular category can be obtained simply by usingthe category’s variables i.e. we are not concerned with the other categories in the calculation, sinceEq. (2.16) only depends on categoryc, which is based on the assumption given by Eq. (2.7).

Finally, to compare the VB approach with the conventional ML approach for training latentvariable models, the training specifications for ML and VB are summarized, as shown in Table2.2.

Table 2.2: Training specifications for ML and VB

Training Min-max optimization Objective functionML ML-EM Differential method Q functionVB VB-EM Variational method Fm functional

2.2.2 VB posterior distribution for model structure

The VB posterior distributions for a model structure are derived in the same way as in Section2.2.1, and model selection is discussed by employing the posterior distribution. Arbitrary posteriordistributionq(m|O) is introduced and the KL divergence betweenq(m|O) and the true posteriordistributionp(m|O) is considered:

KL [q(m|O)|p(m|O)] =∑m

q(m|O) logq(m|O)

p(m|O). (2.17)

2.3 VBEC 13

Substituting Eq. (2.6) into Eq. (2.17) and using Jensen’s inequality, the inequality of Eq. (2.17)can be obtained as follows:

KL [q(m|O)|p(m|O)]

≤ log p(O) +

⟨log

q(m|O)

p(m)−Fm[q(Θ|O,m), q(Z|O,m)]

⟩q(m|O)

. (2.18)

Similar to the discussion in Section 2.2.1, from the inequality (2.18),q(m|O) approachesp(m|O)

as the right-hand side decreases. The derivation of the inequality is detailed in Appendix A.1.3.Therefore, the optimal posterior distribution for a model structure can be obtained by a variationalmethod that results in minimizing the right-hand side as follows:

q(m|O) ∝ p(m) exp (Fm[q(Θ|O,m), q(Z|O,m)]) . (2.19)

This variational calculation is shown in detail in Appendix A.2.3. Assuming thatp(m) is a uniformdistribution2, the proportion relation betweenq(m|O) andFm is obtained as follows based on theconvexity of the logarithmic function:

Fm′ ≥ Fm ⇔ q(m′|O) ≥ q(m|O). (2.20)

Therefore, an optimal model structure in the sense of maximum posterior probability estimationcan be selected as follows:

m = argmaxm

q(m|O) = argmaxm

Fm. (2.21)

This indicates that by maximizing totalFm with respect to bothq(Θ|O,m), q(Z|O,m), andm,we can obtain the optimal parameter distributions and select the optimal model structure simulta-neously [25,26].

2.3 Variational Bayesian Estimation and Clustering for speechrecognition (VBEC)

The four formulations are obtained by using the VB framework with which to perform acousticmodel construction (model setting, training and selection) and speech classification consistentlybased on the Bayesian approach. Consequently, the conventional formulation based on the MLapproaches to the formulation based on the Bayesian approach is replaced as follows:

• Set output distributions→ Set output distributions and prior distributions(Section 2.3.1)

• ML Baum-Welch algorithm→ VB Baum-Welch algorithm(Section 2.3.2)

2We can set an informative prior distribution forp(m).


23a12a

11a 22a 33a11a 22a 33a

Gaussian mixture for state i

1=i 3=i2=i1=i 3=i2=i

Figure 2.3: Hidden Markov model for each phoneme unit. A state is represented by the Gaussianmixture below it. There are three states and three Gaussian components in this figure.

• Log likelihood→ VBEC objective function(Section 2.3.3)

• ML based classification→ VB-BPC(Section 2.3.4)

These four formulations are explained in the following four subsections, by applying the acousticmodel for speech recognition to the general solution in Section 2.2.

2.3.1 Output and prior distributions

Setting of the output and prior distributions is required when calculating the VB posterior distribu-tions. The output distribution is obtained based on a left-to-right HMM with a GMM for each state,as shown in Figure 2.3, which has been widely used to represent a phoneme category in acousticmodels for speech recognition.

Output distribution

Let Oe = {Ote ∈ RD|t = 1, ..., Te} be a sequential speech data set for an examplee of a phoneme

category andO = {Oe|e = 1, ..., E} be the total data set of a phoneme category. Since theformulations for the posterior distributions are common to all phoneme categories, the phonemecategory indexc is omitted from this section to simplify the equation forms.D is used to denotethe dimension number of the feature vector andTe to denote the frame number for an examplee.The output distribution, which represents a phoneme acoustic model, is expressed by

p(O, S, V |Θ,m) =E∏

e=1

Te∏t=1

ast−1e st

ewst

evtebst

evte(Ot

e), (2.22)

2.3 VBEC 15

whereS is a set of sequences of HMM states,V is a set of sequences of Gaussian mixture com-ponents, andst

e andvte denote the state and mixture components at framet of examplee. Here,

S andV are sets of discrete latent variables, which are the concrete forms ofZ in Section 2.1.The parameteraij denotes the state transition probability from statei to statej, andwjk is thek-th weight factor of the Gaussian mixture for statej. In addition,bjk(O

te)(= N (Ot

e|µjk, Σjk))

denotes the Gaussian with the mean vectorµjk and covariance matrixΣjk defined as:

N (Ote|µjk, Σjk) ≡ (2π)−

D2 |Σjk|−

12 exp

(−1

2(Ot

e − µjk)′Σ−1

jk (Ote − µjk)

), (2.23)

where| · | and ′ denote the determinant and the transpose of the matrix, respectively, whileΘ =

{aij, wjk,µjk, Σ−1jk |i, j = 1, ..., J, k = 1, ..., L} is a set of model parameters. Here,J denotes the

number of states in an HMM sequence andL denotes the number of Gaussian components in astate.

Prior distribution

Conjugate distributions, which are based on the exponential function, are easy to use as priordistributions since the function forms of prior and posterior distributions become the same [15,19, 20]. Then, a distribution is selected where the probabilistic variable constraint is the same asthat of the model parameter. The state transition probabilityaij and the mixture weight factorwjk

has the constraint that∑

j aij = 1 and∑

k wjk = 1. Therefore, the Dirichlet distributions foraij andwjk are used, where the variables of the Dirichlet distribution satisfy the above constraint.Similarly, the diagonal elements of the inverse covariance matrixΣ−1

jk is always positive, and theGamma distribution is used. The mean vectorµjk goes from−∞ to∞, and the Gaussian is used.Thus, the prior distribution for acoustic model parameters is expressed as follows:

p(Θ|m) ≡J∏

i=1

J∏j=1

L∏k=1

p({aij′}Jj′=1|m)p({wjk′}L

k′=1|m)p(µjk, Σjk|m)

≡∏i,j,k

D({aij′}Jj′=1|{φ0

ij′}Jj′=1)D({wjk′}L

k′=1|{ϕ0jk′}L

k′=1)

×N (µjk|ν0jk, (ξ

0jk)

−1Σjk)D∏

d=1

G(Σ−1jk,d|η

0jk, R

0jk,d), (2.24)

Here,Φ0 ≡ {φ0ij, ϕ

0jk, ξ

0jk,ν

0jk, η

0jk, R

0jk|i, j = 1, ..., J, k = 1, ..., L} is a set of prior parameters.

In Eq. (2.24),D denotes a Dirichlet distribution andG denotes a gamma distribution. The priordistributions ofaij andwjk are represented by the Dirichlet distributions, and the prior distributionof µjk andΣjk is represented by the normal gamma distribution. If the covariance matrix elementsare off the diagonal, a normal-Wishart distribution is used as the prior distribution ofµjk andΣjk.


The explicit forms of the distributions are defined as follows:

D({aij}Jj=1|{φ0

ij}Jj=1) ≡ CD({φ0

ij}Jj=1)

∏j(aij)

φ0ij−1

D({wjk}Lk=1|{ϕ0

jk}Lk=1) ≡ CD({ϕ0

jk}Lk=1)

∏k(wjk)

ϕ0jk−1

N (µjk|ν0jk, (ξ

0jk)

−1Σjk) ≡ CN (ξ0jk)|Σjk|−

12 exp

(− ξ0

jk

2(µjk − ν0

jk)′Σ−1

jk (µjk − ν0jk)

)G(Σ−1

jk,d|η0jk, R

0jk,d) ≡ CG(η0

jk, R0jk,d)

(Σ−1

jk,d

) η0jk2

−1exp

(− R0

jk,d

2Σjk,d

) , (2.25)

where CD({φ0

ij}Jj=1) ≡ Γ(

∑Jj=1 φ0

ij)/(∏J

j=1 Γ(φ0ij))

CD({ϕ0jk}L

k=1) ≡ Γ(∑L

k=1 ϕ0jk)/(

∏Lk=1 Γ(ϕ0

jk))

CN (ξ0jk) ≡ (ξ0

jk/2π)D2

CG(η0jk, R

0jk,d) ≡

(R0

jk,d/2) η0

jk2 /Γ(η0

jk/2)

. (2.26)

In the Bayesian approach, an important problem is how to set the prior parameters. In thisthesis, two kinds of prior parameters ofν0 andR0 are set using sufficient amounts of data from:

• Statistics of higher hierarchy models in acoustic models for the acoustic model constructiontask.

• Statistics of speaker independent models for the speaker adaptation task.

The other parameters (φ0, ϕ0, ξ0 andη0) have a meaning as regarding tuning the balance betweenthe values obtained from training data and the above statistics. These parameters are set appro-priately based on experiments, and the dependence of these prior parameters is discussed in theexperimental chapters.

2.3.2 VB Baum-Welch algorithm

This subsection introduces concrete forms of the VB posterior distributions for model parame-ters q(Θ|O,m) and for latent variablesq(Z|O,m) in acoustic modeling, which are efficientlycomputed by VB iterative calculations within the VB framework. This calculation is effectivelycomputed by the VB Baum-Welch algorithm shown in Figure 2.4.

VB M-step

First, the VB M-step for acoustic model training is explained. This is solved by substituting theacoustic model setting in Section 2.3.1 into the general solution for the VB M-step in Section 2.2.The derivation is found in Appendix A.3.1. The calculated results for the optimal VB posterior

2.3 VBEC 17VB Baum-Welch aｌｇｏｒｉｔｈmInitial settingVB forward-backward (E-step)Equations (2.33)~(2.36), (2.38)Accumulating sufficient statistics Equation (2.31)

Calculating VB objective functionEquations (2.39)~(2.42)Convergence judgmentＹＥＳＮＯUpdating posterior parameters (M-step)Equation (2.30)

Speech featureSpeech feature

Acoustic modelAcoustic model

Transition and occupation probabilitiesTransition and occupation probabilitiesSufficient statisticsSufficient statistics

Posterior parameters

Figure 2.4: VB Baum-Welch algorithm.

distributions for the model parameters are summarized as follows:

q(Θ|O,m) ≡∏i,j,k

q({aij′}Jj′=1|O,m)q({wjk′}L

k′=1|O,m)q(µjk, Σjk|O,m)

≡∏i,j,k

D({aij′}Jj′=1|{φij′}J

j′=1)D({wjk′}Lk′=1|{ϕjk′}L

k′=1)

×N (µjk|νjk, (ξjk)−1Σjk)

∏d

G(Σ−1jk,d|ηjk, Rjk,d). (2.27)

The concrete forms of the distributions are defined as follows:

D({aij}Jj=1|{φij}J

j=1) ≡ CD({φij}Jj=1)

∏j(aij)

eφij−1

D({wjk}Lk=1|{ϕjk}L

k=1) ≡ CD({ϕjk}Lk=1)

∏k(wjk)

eϕjk−1

N (µjk|νjk, (ξjk)−1Σjk) ≡ CN (ξjk)|Σjk|−

12 exp

(−

eξjk

2(µjk − νjk)

′Σ−1jk (µjk − νjk)

)G(Σ−1

jk,d|ηjk, Rjk,d) ≡ CG(ηjk, Rjk,d)(Σ−1

jk,d

) eηjk2

−1exp

(−

eRjk,d

2Σjk,d

) , (2.28)


where

CD({φij}Jj=1) ≡ Γ(

∑Jj=1 φij)/(

∏Jj=1 Γ(φij))

CD({ϕjk}Lk=1) ≡ Γ(

∑Lk=1 ϕjk)/(

∏Lk=1 Γ(ϕjk))

CN (ξjk) ≡ (ξjk/2π)D2

CG(ηjk, Rjk,d) ≡(Rjk,d/2

) eηjk2

/Γ(ηjk/2)

. (2.29)

Note that Eqs. (2.24) and (2.27) are members of the same function family, and the only differenceis that the set of prior parametersΦ0 in Eq. (2.24) is replaced with a set of posterior distributionparametersΦ ≡ {φij, ϕjk, ξjk, νjk, ηjk, Rjk|i, j = 1, ..., J, k = 1, ..., L} in Eq. (2.27). The conju-gate prior distribution is adopted because the posterior distribution is theoretically a member of thesame function family as the prior distribution and is obtained analytically, which is a characteristicof the exponential distribution family. Here,Φ are defined as:

φij = φ0ij + γij

ϕjk = ϕ0jk + ζjk

ξjk = ξ0jk + ζjk

νjk =ξ0jkν0

jk+ fMjk

ξ0jk+eζjk

ηjk = η0jk + ζjk

Rjk = diag

[R0

jk + Vjk − 1eζjk

Mjk(Mjk)′ +

ξ0jk

eζjk

ξ0jk+eζjk

(fMjk

eζjk− ν0

jk

)(fMjk

eζjk− ν0

jk

)′]

, (2.30)

where diag[·] denotes the diagonalization operation by setting off diagonal elements as zero.γij,ζjk, Mjk andVjk denote 0th, 1st and 2nd order sufficient statistics, respectively, and are definedas follows:

γij ≡∑

e,t γte,ij

ζjk ≡∑

e,t ζte,jk

Mjk ≡∑

e,t ζte,jkO

te

Vjk ≡∑

e,t ζte,jkO

te(O

te)

′

. (2.31)

These sufficient statisticsΞ ≡ {γij, ζjk,Mjk, Vjk|i, j = 1, ..., J, k = 1, ..., L} are computed byusingγt

e,ij andζte,jk defined as follows:

{γt

e,ij ≡ q(st−1e = i, st

e = j|O,m)

ζte,jk ≡ q(st

e = j, vte = k|O, m)

. (2.32)

Here,γte,ij is a VB transition posterior distribution, which denotes the transition probability from a

statei to a statej at a framet of an examplee, andζte,jk is a VB occupation posterior distribution,

which denotes the occupation probability of a mixture componentk in a statej at a framet of anexamplee, in the VB approach. Therefore,Φ can be calculated fromΦ0, γt

e,ij andζte,jk, enabling

q(Θ|O,m) to be obtained.

2.3 VBEC 19

VB transition probability γ (VB E-step)

By using variational calculation, VB transition probabilityγ is obtained as follows:

γte,ij = q(st−1

e = i, ste = j|O,m)

∝ exp(⟨

log p(O, st−1e = i, st

e = j|Θ,m)⟩

eq(Θ|O,m)

)∝ exp

⟨log αt−1

e,i aij

(∑k

wjkbjk(Ote)

)βt

e,j

⟩eq(Θ|O,m)

(2.33)

αt−1e,i is a forward probability at framet of examplee in statej, andβt

e,j is a backward probabilityat framet of examplee in statej. Therefore,γt

e,ij is obtained as follows:

γte,ij =

αt−1e,i aij

(∑k wjkbjk(O

te)

)βt

e,j∑j′ α

Te

e,j′

, (2.34)

Here,aij, wjk andbjk(Ote) are defined as follows:

aij ≡exp(Ψ(φij) − Ψ(

∑j′ φij′)

)wjk ≡exp

(Ψ(ϕjk) − Ψ(

∑k′ ϕjk′)

)bjk(O

te)≡exp

(−1

2

) (D

(log π + (ξjk)

−1 − Ψ(

eηjk

2

))+ log

∣∣∣Rjk

∣∣∣ + ηjk(Ote − νjk)

′(Rjk)−1(Ot

e − νjk)

) (2.35)

whereΨ(y) is a digamma function defined asΨ(y) ≡ ∂/∂y log Γ(y). These are solved by sub-stituting the acoustic model setting in Section 2.3.1 into the general solution forq(Z|O, M) inSection 2.2. The derivation is found in Appendix A.3.1.α andβ are VB forward and backwardprobabilities defined as:{

αte,j ≡

(∑i α

t−1e,i aij

) ∑k wjkbjk(O

te)

βte,j ≡

∑i aji

(∑k wikbik(O

t+1e )

)βt+1

e,i

. (2.36)

αt=0e,j andβt=Te

e,j are initialized appropriately.

VB occupation probability ζ (VB E-step)

Similarly, ζte,jk is obtained from the variational calculation as follows:

ζte,jk = q(st

e = j, vte = k|O,m)

∝ exp(⟨

log p(O, ste = j, vt

e = k|Θ,m)⟩

eq(Θ|O,m)

)∝ exp

⟨log

(∑i

αt−1e,i aij

)wjkbjk(O

te)β

te,j

⟩eq(Θ|O,m)

(2.37)


Therefore,

ζte,jk =

(∑i α

t−1e,i aij

)wjkbjk(O

te)β

te,j∑

i αTee,i

. (2.38)

Thus,γte,ij and ζt

e,jk are calculated efficiently by using a probabilistic assignment via the familiarforward-backward algorithm. This algorithm is called the VB forward backward algorithm.

Similar to the VB forward-backward algorithm, theViterbi algorithmis also derived within theVB approach by exchanging a summation overi for a maximization overi in the calculation of theVB forward probabilityαt

e,j in Eq. (2.36). This algorithm is called the VB Viterbi algorithm.

Thus, VB posteriors can be calculated iteratively in the same way as the Baum-Welch algorithmeven for a complicated sequential model that includes latent variables such as HMM and GMM foracoustic models. These calculations are referred to as a VB Baum-Welch algorithm, as proposedin [27,28]. VBEC is based on the VB Baum-Welch algorithm.

2.3.3 VBEC objective function

This section discusses the VB objective functionFm for a whole acoustic model topology, i.e.,the VBEC objective function, and provides general calculation results. The VBEC objective func-tion is a criterion for both posterior distribution estimation, and model topology optimization inacoustic model construction. This section begins by focusing on one phoneme category. By sub-stituting the VB posterior distribution obtained in Section 2.3.2, we obtain analytical results forFm, and therefore, this calculation also requires a VB iterative calculation based on the VB Baum-Welch algorithm used in the VB posterior calculation. We can separateFm into two components:one is composed solely ofq(S, V |O,m), whereas the other is mainly composed ofq(Θ|O,m).Therefore, we defineFm

Θ andFmS,V , and representFm as follows:

Fm =

⟨log

p(O, S, V |Θ, m)p(Θ|m)

q(Θ|O,m)

⟩eq(Θ|O,m)

eq(S,V |O,m)

− 〈log q(S, V |O,m) 〉eq(S,V |O,m)

= FmΘ −Fm

S,V ,

(2.39)

where FmΘ ≡

⟨log p(O,S,V |Θ,m)p(Θ|m)

eq(Θ|O,m)

⟩eq(Θ|O,m)

eq(S,V |O,m)

FmS,V ≡ 〈log q(S, V |O,m) 〉

eq(S,V |O,m)

. (2.40)

2.3 VBEC 21

FmS,V includes the effect of latent variablesS andV , and is represented as follows:

FmS,V =

∑i,j

γij

(Ψ(φij) − Ψ(

∑j′

φij′))

+∑j,k

ζjk

(Ψ(ϕjk) − Ψ(

∑k′

ϕjk′))

− 1

2

∑j,k

ζjk

(D

(log(π) +

1

ξjk

− Ψ

(ηjk

2

))+ log

∣∣∣Rjk

∣∣∣)

− 1

2

(∑e,t

ηjk(Ote − νjk)(Rjk)

−1(Ote − νjk)

′

)

−∑

e

log

(∑j

αTee,j

).

(2.41)

The fourth term on the right hand side in Eq. (2.41) is composed of the VB forward probabilityobtained in Eq. (2.36), and requires us to compute Eq. (2.35) iteratively using all frames of data.This part corresponds to the latent variable effect for the VBEC objective function.

Next,FmΘ is represented as follows:

FmΘ =

∑i

logΓ(

∑j φ0

ij)

Γ(∑

j′ φij)

∏j Γ(φij)∏j Γ(φ0

ij)+

∑j

logΓ(

∑k ϕ0

jk)

Γ(∑

k ϕjk)

∏k Γ(ϕjk)∏k Γ(ϕ0

jk)

+∑j,k

log

(π)−eζjkD

2

(ξ0jk

ξjk

)D2

(Γ

(eηjk

2

))D ∣∣R0jk

∣∣ η0jk2

(Γ

(η0

jk

2

))D ∣∣∣Rjk

∣∣∣ eηjk2

. (2.42)

From Eq. (2.42),FmΘ can be calculated by using the statistics of the posterior distribution param-

etersΦ given in Eq. (2.30). This part is equivalent to the objective function for model selectionbased on Akaike’s Bayesian information criterion [36].

The wholeFm for all categories is obtained by simply summing up theFm results obtainedin this section for all categories as in Eq. (2.11). Strictly speaking, this operation often requirescomplicated summation because of the shared structure of the model parameters.

Thus, the analytical result of the VBEC objective functionFm for the acoustic model construc-tion is provided. The VBEC objective function is derived analytically so that it retains the effectsof the dependence among model parameters and of the latent variables, defined in the output dis-tribution in Eq. (2.22), unlike with the conventional Bayesian Information Criterion and MinimumDescription Length (BIC/MDL) approaches. From this standpoint, we can recognize the VBECobjective function as a global criterion for the selection of acoustic model topologies i.e. the modeltopology that maximizes the VBEC objective function is globally optimal. Therefore, the VBECobjective function can compare any acoustic models with respect to all topological aspects andtheir combinations, e.g. contextual and temporal topologies in HMMs, the number of componentsper GMM in an HMM state, and the dimensional size of feature vectors, based on the followingequation:

m = argmaxm∈(T×S×G×D)

Fm (2.43)


HereT, S, G andD denote search spaces of HMM-temporal, HMM-contextual, GMM and featurevector topologies, respectively, as shown in Figure 2.2.

Based on the discussion in Section 2.3, the following eight steps provide a VB training algo-rithm for acoustic modeling.———————————————————————————————————————-Step 1) Set posterior parameterΦ[τ = 0] from initialized transition probabilityγ[τ = 0], occupa-tion probabilityζ[τ = 0] and model structurem (prior parameterΦ0 is included) for each category.Step 2) Computea[τ + 1], w[τ + 1] andb(O)[τ + 1] usingΦ[τ ]. (By Eq. (2.35))Step 3) Updateγ[τ + 1] and ζ[τ + 1] via the Viterbi algorithm or forward-backward algorithm.(By Eqs. (2.34) and (2.38))Step 4) Accumulate sufficient statisticsΞ[τ + 1] usingγ[τ + 1], ζ[τ + 1]. (By Eqs. (2.31)Step 5) ComputeΦ[τ + 1] usingΞ[τ + 1] andΦ0. (By Eq. (2.30))Step 6) Calculate totalFm[τ +1] for all categories. (By using Eqs. (2.42) and (2.41) and summingup all categories’Fm)Step 7) If |(Fm[τ + 1] − Fm[τ ])/Fm[τ + 1]| ≤ ε, then stop; otherwise setτ ← τ + 1 and go toStep 2.Step 8) CalculateFm for all possiblem and findm(= argmaxm Fm).———————————————————————————————————————-Here,τ denotes an iteration count, andε denotes a threshold that checks whetherFm converges.Thus, the posterior distribution estimation in the VBEC framework can be effectively constructedbased on the VB Baum-Welch algorithm, which is analogous to the ML Baum-Welch algorithm.In addition, VBEC can recognize the model selection using the VB objective function as shown inStep 8. Thus, VBEC can construct an acoustic model consistently based on the Bayesian approach.

Note that if we change → (a value with attached indicates an ML estimate),Φ → Θ

andFm → Lm (whereLm means the log-likelihood for a modelm), this algorithm becomes anML-based framework, except for the model selection. Therefore, in the implementation phase,the VBEC framework can be realized in the conventional systems of acoustic model constructionby adding the prior distribution setting and by changing the estimation procedure and objectivefunction calculation.

2.3.4 VB posterior based Bayesian predictive classification

This subsection deals with the classification based on the Bayesian approach. Here, the phonemecategory indexc is restored because the category index is important when discussing speech clas-sification. In recognition, the feature vector sequencex = {xt ∈ RD|t = 1, ..., T} of input speechis classified as the optimal phoneme categoryc using a conditional probability functionp(c|x,O)

given input datax and training dataO, as follows:

c = argmaxc

p(c|x, O) = argmaxc

p(c|O)p(x|c, O)

∼= argmaxc

p(c)p(x|c, O).(2.44)

2.3 VBEC 23

Here,p(c) is the prior distribution of phoneme categoryc obtained by language and lexicon mod-els. c is assumed to be independent ofO (i.e., p(c|O) ∼= p(c)). p(x|c, O) is calledpredictivedistribution because this distribution predicts the probability of unknown datax conditioned bytraining dataO. We focus on the predictive distributionp(xt|c, i, j, O) of t-th frame input dataxt

at the HMM transition fromi to j of categoryc. Then, by introducing an output distribution with aset of distribution parametersΘ and a model structurem, p(xt|c, i, j, O) is represented as follows:

p(xt|c, i, j, O) =∑m

∫p(xt|c, i, j, Θ,m)p(Θ|c, i, j, O, m)p(m|O)dΘ

=∑m

∫p(xt|Θ(c)

ij ,m)p(Θ(c)ij |O,m)p(m|O)dΘ

(c)ij ,

(2.45)

whereΘ(c)ij ≡ {a(c)

ij , w(c)jk , µ

(c)jk , Σ

(c)jk |k = 1, ..., L} is a set of model parameters in categoryc. By

selecting an appropriate model structurem, the predictive distribution is approximated as follows:

p(xt|c, i, j, O) ∼=∫

p(xt|Θ(c)ij , m)p(Θ

(c)ij |O, m)dΘ

(c)ij , (2.46)

Therefore, by calculating the integral in Eq. (2.46), the accumulated score of a feature vectorsequencex can be computed by summing up each frame score based on the Viterbi algorithm,which enables input speech to be classified. This predictive distribution based approach, whichinvolves considering the integrals and true posterior distributionsp(Θ

(c)ij |O,m) in Eq. (2.46), is

called the Bayesian inference or Bayesian Predictive Classification (BPC) approach [19,20].After acoustic modeling in Section 2.3.3, the optimal VB posterior distributions are obtained

for the optimal model structureq(Θ|O, m). Therefore, VBEC can deal with the integrals in Eq.(2.46) by using the estimatedVB posterior distributionsq(Θ(c)

ij |O, m) as follows:

p(xt|c, i, j, O,m) ∼=∫

p(xt|Θ(c)ij , m)q(Θ

(c)ij |O, m)dΘ

(c)ij . (2.47)

The integral overΘ(c)ij can be solved analytically by substituting Eqs. (2.22) and (2.27) into Eq.

(5.8). If we consider the marginalization of all the parameters, the analytical result of the RightHand Side (RHS) in Eq. (2.47) is found to be a mixture distribution based on the Student’s t-distributionSt(·), as follows:

RHS in Eq. (2.47)

=φ

(c)ij∑

j φ(c)ij

∑k

ϕ(c)jk∑

k ϕ(c)jk

∏d

St

(xt

d

∣∣∣∣∣ν(c)jk,d,

(1 + ξ(c)jk )R

(c)jk,d

ξ(c)jk η

(c)jk

, η(c)jk

).

(2.48)

This approach is called VB posterior based BPC (VB-BPC).The use of VB-BPC makes VBEC a total Bayesian framework for speech recognition that

possesses a consistent concept whereby all acoustic procedures (acoustic model construction andspeech classification) are carried out based on posterior distributions, as shown in Figure 2.5. Fig-ure 2.5 shows the VBEC framework compared with a conventional approach, ML-BIC/MDL: themodel parameter estimation, model selection and speech classification are based on ML, BIC/MDLand the conventional ML-based Classification (MLC) , respectively.


Model setting: output distribution+ prior distributionVB Baum-Welch algorithmModel selection based on

VB-BPC

Model setting: output distributionML Baum-Welch algorithmModel selection basedon BIC/MDL

MLCSpeech classification

(Decoder)

Acoustic model construc

tion VBEC MLTraining data Training data

Acoustic modelAcoustic model Language modelLexical modelResults Results

Acoustic modelRecognition dataRecognition data

Figure 2.5: Total speech recognition frameworks based on VBEC and ML-BIC/MDL.

2.4 Summary

This chapter described the difference between Bayesian and conventional ML approaches, ex-plained the general solution of the VB approach, and formulated the total Bayesian framework forspeech recognition, VBEC. VBEC is based on four formulations, i. e. model setting, training,selection, and speech classification. Therefore, VBEC performs the model construction process,which includes model setting, training and selection (1st, 2nd and 3rd), and the classification pro-cess (4th) based on the Bayesian approach. Thus, we can say that VBEC is a total Bayesianframework for speech recognition that includes three Bayesian advantages, i. e., prior utilization,model selection, and robust classification. The following three chapters confirm the effectivenessof the Bayesian advantages using speech recognition experiments and an explanation of their im-plementation.

Chapter 3

Bayesian acoustic model construction

3.1 Introduction

The accurate construction of acoustic models contributes greatly to improving speech recogni-tion performance. The conventional framework for acoustic model construction is based on Hid-den Markov Models (HMMs) and Gaussian Mixture Models (GMMs) where the statistical modelparameters are estimated by the familiar Maximum Likelihood (ML) approach. The ML-basedframework presents problems as regards the construction of acoustic models of real speech recog-nition tasks. The reasons for the problems are that ML often estimates parameters incorrectly whenthe amounts of training data assigned for the parameters are small (over-training), and cannot selectan appropriate model structure in principle. This degrades the speech recognition performance, andis especially true for spontaneous speech recognition tasks, which exhibit greater speech variabilitythan read speech [18].

The Bayesian approach is based on aposterior distributionestimation, which can utilize priorinformation via a prior distribution, and can select an appropriate model structure based on aposterior distribution for a model structure, which excludes the over-trained models. Together,these two advantages totally mitigate the effects of over-training. By using these advantages,the Bayesian approach-based speech recognition framework is expected to solve the problemsoriginating from the ML’s limitations. The Variational Bayesian Estimation and Clustering forspeech recognition (VBEC) framework in acoustic model construction includes prior setting, anestimation of the Variational Bayesian (VB) posterior distributions based on the VB Baum-Welchalgorithm and the selection of the appropriate model structure using the VBEC objective function,as discussed in Section 2.31. Consequently, VBEC can totally mitigate the over-training effectswhen constructing acoustic models used in speech recognition.

This chapter mainly introduces an implementation of Bayesian acoustic model constructionwithin the VBEC framework. The implementation is discussed in relation to the following threeaspects:

• Section 3.2 describes the derivation of an efficient VB Baum-Welch algorithm designed to

1VBEC also includes Bayesian predictive classification in the speech classification process, and this will be dis-cussed in Chapter 5.

25

26 CHAPTER 3. BAYESIAN ACOUSTIC MODEL CONSTRUCTION

improve the computational efficiency of the original VB Baum-Welch algorithm in Section2.3.2.

• Section 3.3 reports the application of VBEC model selection to the context-dependent HMMclustering based on the phonetic decision tree method.

• Section 3.4 describes the application of VBEC model selection to the determination of mix-ture components.

Section 3.5 confirms the effectiveness of Bayesian acoustic model construction based on the aboveimplementations using speech recognition experiments. Experiments show the Bayesian advan-tages of prior utilization and model selection.

3.2 Efficient VB Baum-Welch algorithm

With acoustic model training, the VB forward-backward algorithm imposes the greatest computa-tional load, which is proportional to the product of the number of HMM statesI and the numberof framesT , similar to the conventional ML forward-backward algorithm. However, the originalVB forward-backward algorithm described in Section 2.3.2 requires much more computation thanthe ML forward-backward algorithm in each frame, and is not suitable for practical use. There-fore, this section introduces an efficient VB forward-backward algorithm to reduce the amountof computation required in each frame by summarizing the factors, which does not depend onframet. Consequently, a new VB Baum-Welch algorithm is proposed based on the efficient VBforward-backward algorithm.

VB transition probability γ

The VB forward-backward algorithm is obtained by computing the VB transition probabilityγ

and VB occupation probabilityζ, as discussed in Section 2.3.2. From Eqs. (2.34) and (2.36),VB transition probabilityγ is obtained by computingaij

∑k wjkbjk(O

te) defined in Eq. (2.35) for

each framet. By substituting (2.35) intoaij

∑k wjkbjk(O

te), and by employing the logarithmic

function, the concrete form of the equation is represented as follows:

log

(aij

∑k

wjkbjk(Ote)

)

=(Ψ(φij) − Ψ(

∑j′

φij′))log

(∑k

exp(Ψ(ϕjk) − Ψ(

∑k′

ϕjk′))

× exp

(D

2

(− log 2π − 1

ξjk

+ Ψ( ηjk

2

))− 1

2

∑d

(log

(Rjk,d

2

)+

(Ote,d − νjk,d)

2ηjk

Rjk,d

)))(3.1)

Equation (3.1) is very complicated, and in addition, the computation of the digamma functionΨ(·)in (3.1) is very heavy [37]. Therefore, the factors, which do not depend on frames, are summarized

3.3. CLUSTERING CONTEXT-DEPENDENT HMM STATES USING VBEC 27

for computation in advance in the initialization or VB M-step, similar to the calculation of the nor-malization constant in the conventional ML Baum-Welch algorithm. Then, Eq. (3.1) is simplifiedas follows:

log

(aij

∑k

wjkbjk(Ote)

)= Hij + log

(∑k

exp(Ajk + (Ot

e − Gjk)′Bjk(O

te − Gjk)

))(3.2)

where

Φ′ :

Hij ≡ Ψ(φij) − Ψ(∑

j′ φij′)

Ajk ≡ Ψ(ϕjk) − Ψ(∑

k′ ϕjk′) + D2

(− log 2π − 1

eξjk+ Ψ

(eηjk

2

))− 1

2log

∣∣∣ eRjk

2

∣∣∣Gjk ≡ νjk

Bjk ≡ − eηjk

2R−1

jk

(3.3)

Φ′ ≡ {H, A, G, B} are the frame invariant factors, which do not depend on frames. Therefore,by usingΦ′ instead ofΦ, γ only requires the computation of Eq. (3.2) for each frame, which isequivalent to computing the likelihood of the mixture of Gaussians with the state transition in theconventional ML computation. Thus, the VB transition probability computation can be simplifiedto a conventional ML transition probability computation.

VB occupation probability ζ

In a similar way to that used for VB transition probabilityγ, the VB occupation probabilityζcomputation required in each frame can also be simplified to that of the conventional ML approachby using frame invariant factorsΦ′ ≡ {H, A, G, B}. The logarithmic function ofaijwjkbjk(O

te),

which is required in Eq. (2.38) can be simplified by usingΦ′ as follows:

log(aijwjkbjk(O

te)

)= Hij + Ajk + (Ot

e − Gjk)′Bjk(O

te − Gjk). (3.4)

These efficient computation methods are available for the VB Viterbi algorithm. Thus, an effi-cient VB forward-backward algorithm is realized based on the new computation of VB transitionprobability γ and VB occupation probabilityζ, which is computed in the same way as the MLforward-backward algorithm. Therefore, an efficient VB Baum-Welch algorithm is also realizedbased on the VB forward-backward algorithm, which is computed in almost the same way as theML Baum-Welch algorithm, since most of the computation time is used for the forward-backwardalgorithm. This procedure is shown in Figure 3.1.

3.3 Clustering context-dependent HMM states using VBEC

Recently, a context-dependent HMM has been widely used as a standard acoustic model in speechrecognition. The triphone HMM is often adopted as the context-dependent model, which considersthe preceding and following phoneme contexts as well as the center phoneme. Since there are alarge number of triphone contexts, it is almost impossible to collect a sufficient amount of training

28 CHAPTER 3. BAYESIAN ACOUSTIC MODEL CONSTRUCTIONEfficient VB Baum-Welch aｌｇｏｒｉｔｈmInitial settingVB forward-backward (E-step)Equations (2.34),(2.36),(2.38),(3.2),(3.4)Accumulating sufficient statistics Equation (2.29)

Calculating VB objective functionEquations (2.39)~(2.42)Convergence judgmentＹＥＳＮＯUpdating posterior parameters (M-step)Equation (2.30)

Speech featureSpeech feature

Acoustic modelAcoustic model

Transition and occupation probabilitiesTransition and occupation probabilitiesSufficient statisticsSufficient statistics

Posterior parameters Updating frame invariant factorsEquation (3.3)Frame invariant factors

Figure 3.1: Efficient VB Baum-Welch algorithm.

data to estimate all the parameters of triphone HMM states, and this data insufficiency causes over-training. To solve these problems, there are certain methods for sharing parameters over severaltriphone HMM states by the clustering approach [38–41].

Clustering triphone HMM states corresponds to the appropriate selection of the sharing struc-ture of states and the total number of shared states. Therefore, this clustering can be regarded asmodel selection. Conventionally, the ML criterion has been used as the model selection criterion.However, the ML criterion requires the number of shared states or the likelihood gain to be exper-imentally set as a threshold. This is because the likelihood value increases monotonically as thenumber of model parameters increases, and always leads to the selection of a model structure withthe largest number of parameters, in the sense of ML. Therefore the ML criterion is not suitable forsuch a model structure selection. Information criterion approaches typified by the Akaike infor-mation criterion [42], the Minimum Description Length (MDL) [43] and the Bayesian InformationCriterion (BIC) [44] can be used to select an appropriate model structure. In speech recognition,BIC/MDL based acoustic modeling approaches have been widely used [21, 45–48] to deal withacoustic model selection. However, these criteria are derived based on an asymptotic condition,and are only effective when there is a sufficient amount of training data. Practical acoustic mod-


Yes No Yes NoYes No

Yes No

Yes No*/ai/*

k/ai/ik/ai/o：ts/ai/mch/ai/ng：

Root node (n=r)Leaf node

Figure 3.2: A set of all triphone HMM states */ai/* in i th state sequence is clustered based on thephonetic decision tree method.

Yes NoPhoneticquestion

Figure 3.3: Splitting a set of triphone HMM states in noden into two sets in yes nodenQY and no

nodenQN by answering phonetic questionQ.

eling often encounters cases where the amount of training data is small, and therefore, a methodis desired that is not limited by the amount of training data.. On the other hand, VBEC includesprior utilization, as well as model selection, as shown in Table 1.1, and can overcome the problemsfound with the conventional ML and BIC/MDL approaches. This section explains the applicationof VBEC model selection to the triphone HMM clustering by using the VBEC objective functiondescribed in Section 2.3.3

3.3.1 Phonetic decision tree clustering

The phonetic decision tree method has been widely used to cluster triphone HMM states effectivelyby utilizing the phonetic knowledge-based constraint and the binary-tree search. This sectionprovides a general discussion of a conventional phonetic decision tree method.

The original ideas behind phonetic decision tree construction are familiar because its first pro-


Figure 3.4: Tree structure in each HMM state

posal used likelihood as the objective function [41]. An appropriate choice of phonetic questionat each node split allows a decision tree to grow properly, and appropriate state clusters becomerepresented in its leaf nodes, as shown in Figure 3.2. The phonetic question concerns the precedingand following phoneme context, and is obtained through knowledge of the phonetics. Table 3.1shows example questions. When noden is split into yes (nQ

Y ) and no (nQN ) nodes according to

questionQ, as shown in Figure 3.3, an appropriate questionQ(n) is chosen from a set of questionsso that the split gives the largest gain in an arbitrary objective functionHm, i.e., :

Q(n) = argmaxQ

∆HQ(n), (3.5)

where∆HQ(n) ≡ HnQ

Y + HnQN −Hn (3.6)

is the overall gain in objective function when noden is split byQ. Thus, a decision tree is producedspecifically for each state in the sequence, and the trees are independent of each other, as shown inFigure 3.4.

The arbitrary objective functionHn in nodeN is computed by the sufficient statisticΞn innoden by assuming the following constraint.

(C1) Data alignments for each state are fixed during the splitting process.

Table 3.1: Examples of questions for phoneme /a/

Question Yes NoPreceding phoneme is vowel? {a, i, u, e, o}/ a /{ all } otherwise

Following phoneme is plosive ? { all }/ a /{p, b, t, d, k, g} otherwise...

......


Under this constraint, 0-th, 1st and 2nd statistics of noden (Ξn ≡ {On, Mn,Vn}) are computedby simply summing up sufficient statistics ofj (Oj, Mj andVj) by using the following equation:

On =∑

j∈n Oj

Mn =∑

j∈n Mj

Vn =∑

j∈n Vj

, (3.7)

where,j represents a non-clustered triphone HMM state, and is included in a set of triphone HMMstates in noden. Here,Oj, Mj andVj are calculated by using occupation probabilityζ and featurevectorO as follows:

Oj =∑

e,t ζte,j

Mj =∑

e,t ζte,jO

te

Vj =∑

e,t ζte,jO

te(O

te)

′. (3.8)

Therefore, once statisticsΞj ≡ {Oj,Mj,Vj} are prepared for all possible triphone HMM states,the statistics for any node can be easily calculated using Eq. (3.7) under constraint (C1). Here,occupation probabilityζt

e,j is obtained by the forward-backward or Viterbi algorithm within theVB or ML framework. This reduces the computation time to a practical level. The following threesections derive the concrete form of gain objective function∆HQ(n) based on the ML, BIC/MDL,and VBEC approaches.

3.3.2 Maximum likelihood approach

The gain of log-likelihood∆LQ(n) is simply obtained using the sufficient statisticsΞ under thefollowing two constraints:

(C2) The output distribution in a state is represented by a normal distribution.

(C3) The contribution of the state transition to the objective functions is disregarded.

Let Oj = {Ote ∈ RD|e, t ∈ j} be a set of feature vectors that are assigned to statej by the Viterbi

algorithm, andOn = {Oj|j ∈ n} be a set of feature vectors that are assigned to noden. From theconstraints, log-likelihoodLn for a training data set, which is assigned to a state set in noden, isexpressed by the followingD-dimensional normal distribution:

Ln = log p(On|µn, Σn)

= log∏j∈n

∏e,t∈j

N (Ote|µ

n, Σn)

∝ log∏j∈n

∏e,t∈j

∣∣∣Σn∣∣∣− 1

2exp

(−1

2

(Ot

e − µn)′

(Σn)−1(Ot

e − µn))

,

(3.9)

whereµn andΣn denote aD dimensional mean vector and aD × D diagonal covariance matrixfor a data setOn in n, respectively. From Eq. (3.9), ML estimatesµn andΣn can be obtainedusing the sufficient statisticsΞn in noden in Eq. (3.7) as follows:{

µn = Mn

On

Σn = diag[Vn

On − Mn

On

(Mn

On

)′] . (3.10)


Therefore, the gain of log-likelihood∆LQ(n) can be solved as follows [41]:

∆LQ(n) = LnQY + LnQ

N − Ln

= l(nQY ) + l(nQ

N) − l(n).(3.11)

Herel(n) in Eq. (3.11) is defined as:

l(n) = −1

2

(On log

(∣∣∣Σn∣∣∣))

= −1

2

(On log

(∣∣∣∣diag

[Vn

On− Mn

On

(Mn

On

)′]∣∣∣∣)). (3.12)

Equations (3.11) and (3.12) show that∆LQ(n) can be calculated using the sufficient statisticsΞn

in noden.

Therefore, an appropriate questionQ(n) for noden can be selected by:

Q(n) = argmaxQ

∆LQ(n). (3.13)

However,∆LQ(n) is always positive for any split in the ML criterion, and always selects the modelstructure in which the number of states is the largest. Namely, no states are shared at all. To avoidthis, the ML criterion requires the following threshold to be set manually to stop splitting:

∆LQ(n) ≤ Threshold. (3.14)

There are other manual approaches that stop splitting by setting the number of total states, orthe maximum depth of the tree, as well as a hybrid approach. However, the effectiveness of thethresholds in all of these manual approaches has to be judged on the basis of experimental results.

3.3.3 Information criterion approach

This subsection considers automatic model selection based on an information criterion using BIC/MDL,which is widely used as the model selection criterion for various aspects of statistical modeling.The gain of objective function∆LQ(n)

BIC/MDL using BIC/MDL is obtained while splitting a state setby using questionQ, as follows:

∆LQ(n)BIC/MDL = ∆LQ(n) − λ

#(Θn)

2logOr, (3.15)

whereλ is a tuning parameter in BIC/MDL, andOr denotes the frame number of data assignedto a root node.#(Θn) is the number of free parameters used in noden. From the constraints, thefree parameters are aD-dimensional mean vector and aD × D diagonal covariance matrix, andtherefore,#(Θn) = 2D. Equation (3.15) suggests that the BIC/MDL objective function penalizesthe gain in log-likelihood on the basis of the balance between the number of free parameters and theamount of training data, and the penalty can be controlled by varyingλ (in the original definitions


λ = 1 [44] [43]). Model structure selection is achieved according to the amount of training databy using∆LQ(n)

BIC/MDL instead of using∆LQ(n) in Eq. (3.13), and by stopping splitting when:

∆LQ(n)BIC/MDL ≤ 0. (3.16)

Therefore, the BIC/MDL approach does not require a threshold unlike the ML approach.BIC/MDL is an asymptotic criterion that is theoretically effective only when the amount of

training data is sufficiently large. Therefore, for a small amount of training data, model selectiondoes not perform well because of the uncertainty of the ML estimates. The next section aims atsolving the problem caused by a small amount of training data by using VBEC.

3.3.4 VBEC approach

Similar to Eq. (3.13), an appropriate question at each split is chosen to increase the VBEC objectivefunctionFm in the VBEC framework. When noden is split into Yes node (nQ

Y ) and No node (nQN )

by questionQ, the appropriate questionQ(n) is chosen from a set of questions as follows:

Q(n) = argmaxQ

∆FQ(n), (3.17)

where∆FQ(n) is the gain in the VBEC objective function when noden is split byQ. The questionis chosen to maximize the gain inFm by splitting. The VBEC objective function for phoneticdecision tree construction is also simply calculated under the same constraints as the ML approach((C1), (C2), and (C3)). By using these conditions, the objective function is obtained withoutiterative calculations, which reduces the calculation time. Under condition (C1), the latent variablepart ofFm can be disregarded, i.e.,

Fm ≈ FmΘ . (3.18)

In the VBEC objective function of model parameterFmΘ (Eq. (2.42)), the factors of posterior

parametersφ andϕ can also be disregarded under conditions (C1) and (C2), whereφ andϕ are re-lated to the transition probability and weight factor, respectively. Therefore, the objective functionFn in noden for assigned data setOn can be obtained fromFm

Θ in Eq. (2.42) as follows:

Fn = log

(2π)−OnD

2

(ξn,0

ξn

)D2 2

ηnD2

(Γ

(ηn

2

))D |Rn,0|ηn,0

2

2ηn,0D

2

(Γ

(ηn,0

2

))D

|Rn|ηn

2

, (3.19)

where{ξn,νn, ηn, Rn}( ≡ ψn) is a subset of the posterior parameters in Eq. (2.30), and is repre-sented by:

ξn = ξn,0 + On

νn = ξn,0νn,0+Mn

ξn,0+On

ηn = ηn,0 + On

Rn = diag[Rn,0 + Vn − 1

On Mn(Mn)′ + ξn,0On

ξn,0+On

(Mn

On − νn,0) (Mn

On − νn,0)′] . (3.20)


On,Mn andVn are the sufficient statistics in noden, as defined in Eq. (3.7). Here,{ξn,0,νn,0, ηn,0, Rn,0}(≡ψn,0) is a set of prior parameters. In our experiments, prior parametersνn,0 andRn,0 are set byusing monophone (root node) HMM state statistics (Or, Mr andVr) as follows:{

νn,0 = Mr

Or

Rn,0 = ηn,0( Vr

Or − νn,0 (νn,0)′) (3.21)

The other parametersξn,0 andηn,0 are set manually. By substituting Eq. (3.20) into Eq. (3.19), thegain∆FQ(n) is obtained whenn is split intonQ

Y , nQN by questionQ,

∆FQ(n) = f(ψnQY ) + f(ψnQ

N ) − f(ψn) − f(ψn,0). (3.22)

Here,f(ψ) is defined by:

f(ψ) ≡ −D

2log ξ − η

2log |R| + D log Γ

(η

2

). (3.23)

Note that the terms that do not contribute to∆FQ(n) are disregarded. Node splitting stops whenthe condition:

∆FQ(n) ≤ 0, (3.24)

is satisfied similar to the BIC/MDL approach. A model structure based on the VBEC frameworkcan be obtained by executing this construction for all trees, resulting in the maximization of totalFm. This implementation based on the phonetic decision tree method does not require iterativecalculations, and can construct clustered-state HMMs efficiently. There is another major methodfor the construction of clustered-state HMMs that uses successive state splitting algorithm, andwhich does not remove latent variables in HMMs [39,40]. Therefore, this requires the VB Baum-Welch algorithm and the calculation of latent variable part of the VBEC objective function for eachsplitting. This is realized as the VB SSS algorithm by [49].

The relationship between the VBEC model selection and the conventional BIC/MDL modelselection based on Eqs. (3.22) and (3.15), respectively, is discussed. Based on the condition of asufficiently large data set,ξn, ηn → On, νn → Mn/On, andRn → Vn − Mn(Mn)′/On, inaddition,log Γ(ηn/2) → −(On/2) log(On/2) −On/2 in Eq.(3.22). Then, an asymptotic form ofEq.(3.22) is composed of a log-likelihood gain term and a penalty term depending on the number offree parameters and the amount of training data i.e. the asymptotic form becomes the BIC/MDL-type objective function form2. Therefore, VBEC theoretically involves the BIC/MDL objectivefunction and the BIC/MDL is asymptotically equivalent to VBEC, which displays the advantagesof VBEC, especially for small amounts of training data.

3.4 Determining the number of mixture components using VBEC

Once the clustered-state model structure is obtained, the acoustic model selection is completed bydetermining the number of mixture components per state. GMMs include latent variables, and its

2Strictly speaking, the penalty term of Eq. (3.15) is slightly different from that of the asymptotic form in Eq. (3.22)since [21] [46] useOr in the penalty term rather thanOn.

3.4. DETERMINING THE NUMBER OF MIXTURE COMPONENTS USING VBEC 35

Clustering triphone HMM states(Single Gaussian per state)

Determining number of mixture components

(A) Fixed number (B) Varying number(A) Fixed number (B) Varying number……

First-phase procedure

Second-phase procedure

Figure 3.5: Acoustic model selection of VBEC: two-phase procedure.

determination requires the VB Baum-Welch algorithm and the computation of the latent variablepart of the VBEC objective function, unlike the clustering triphone HMM states in Section 3.3.Therefore, this section deals with the determination of the number of GMM components per stateby considering the latent variable effects. This is the first research to apply VB model selection toGMMs in speech recognition3, which corresponds to the first research showing the effectivenessof VB model selection in latent variable models in speech recognition [52, 53]. Then, the effec-tiveness of VB model selection in latent variable models is confirmed in [49] for the successivestate splitting algorithm, and the effectiveness of VB model selection for GMMs is re-confirmedin [54]. In general, there are two methods for determining the number of mixture components,as shown in Figure 3.5. With first method, the number of mixture components per state is thesame for all states. The objective functionFm is calculated for each number of mixture compo-nents, and the number of mixture components that maximizes the totalFm is determined as beingthe appropriate one (fixed-number GMM method). With second method, the number of mixturecomponents per state can vary by state; here, Gaussians are split and merged to increaseFm anddetermine the number of mixture components in each state (varying-number GMM method). Amodel obtained by the varying-number GMM method is expected to be more accurate than oneobtained by the fixed-number GMM method, although the varying-number GMM method requiresmore computation time.

We require the VBEC objective functions for each state to determine the number of mixturecomponents. In this case, the state alignments vary and states are expressed as GMMs. Therefore,the model includes latent variables and the componentFm

S,V cannot be disregarded, unlike thecase of triphone HMM state clustering. However, since the number of mixture components isdetermined for each state and the state alignments do not change greatly, the contribution of thestate transitions to the objective function is expected to be small, and can be ignored. Therefore,the objective functionFm for a particular statej is represented as follows:

(Fm)j = (FmΘ )j − (Fm

V )j (3.25)

3Other applications for determining the number of mixture components using VB are already proposed in [50,51].


where

(FmΘ )j = log

Γ (Lϕ0)∏L

k′=1 Γ(ϕjk′)

Γ(∑L

k′=1 ϕjk′

)Γ(ϕ0)L

+∑

k

log

(2π)−eζjkD

2

(ξ0

ξjk

)D2

+2

eηjkD

2

(Γ

(eηjk

2

))D ∣∣R0jk

∣∣ η0

2

2η0D

2

(Γ

(η0

2

))D ∣∣∣Rjk

∣∣∣ eηjk2

(3.26)

and

(FmV )j =

∑k

ζjk log wjk +∑

k

∑e,t

ζte,jk log be,jk(O

te). (3.27)

Therefore, with the fixed-number GMM method, the totalFm is obtained by summing up all states’(Fm)j, which determines the number of mixture components per state. With the varying-numberGMM method, the change of(Fm)j per state is compared after merging or splitting the Gaussians,which also determines the number of mixture components.

The number of mixture components is also automatically determined by using the BIC/MDLobjective function [47] [48]. However, the BIC/MDL objective function is based on the asymptoticcondition, and VBEC theoretically involves this function, which parallels the discussion in the lastparagraph in Section 3.3.4.

3.5 Experiments

This section deals with the two Bayesian advantages realized by VBEC experimentally. Sections3.5.1 and 3.5.2 examine prior utilization for acoustic model construction. Sections 3.5.3, 3.5.4,and 3.5.5 examine the validity of model selection by comparing the value of the VBEC objectivefunction with the recognition performance.

Experimental conditions

Two sets of speech recognition tasks were used in the experiments. The first set was an IsolatedWord Recognition (IWR) task as shown in Table 3.2 where ASJ continuous speech was used asthe training data and the JEIDA 100 city name speech corpus was used as the test data. Forthe IWR task, the model parameters were trained based on the ML approach, i.e., the IWR taskonly adopted the prior setting and the model selection for acoustic model construction to evaluatethe sole effectiveness of the VBEC model selection. The second set was a Large VocabularyContinuous Speech Recognition (LVCSR) task as shown in Table 3.3 where Japanese NewspaperArticle Sentences (JNAS) were used as training and test data. Unlike the IWR task, the LVCSRtask adopted all Bayesian acoustic model procedures (setting, training, and selection) for the modelconstruction.

3.5. EXPERIMENTS 37

Table 3.2: Experimental conditions for isolated word recognition task

Sampling rate 16 kHzQuantization 16 bitFeature vector 12-order MFCC + 12-order∆ MFCC

(24 dimensions)Window HammingFrame size/shift 25/10 msNumber of HMM states 3 (left-to-right HMM)Number of phoneme categories27

Training data ASJ continuous speech sentences3,000 sentences (male× 30)

Test data JEIDA 100 city names2,500 words (male× 25)

ASJ: Acoustical Society of JapanJEIDA: Japan Electronic Industry Development Association

Prior parameter setting

Prior parametersν0 andR0 were set from the mean and covariance statistics of the monophoneHMM state, respectively. The other parameters were set asξ0 = η0 = 0.01 andϕ0 = 2.0, asshown in Table 3.4. The tuning of the prior parameters is discussed in Section 3.5.2, and the abovesetting was used in the other experimental sections.

3.5.1 Prior utilization

Two experiments using the IWR and LVCSR tasks were conducted to compare the VBEC modelselection with the conventional BIC/MDL selection for various amounts of training data. Severalsubsets were randomly extracted from the training data set, and each of the subsets was used toconstruct a set of clustered triphone HMMs. Then, the selected models were trained by using theML Baum-Welch algorithm with IWR, and by using the VB Baum-Welch algorithm with LVCSR.All output distributions were organized by single Gaussians, so that the effect of model selectioncould be evaluated exclusively.

Figure 3.6 shows the recognition rate and Figure 3.7 shows the total number of states in a setof clustered triphone HMMs, according to the amount of training data in the IWR task. Figures3.8 and 3.9 show the same information for the LVCSR task. In the IWR task, as shown in Figure3.6, when the number of training sentences was less than 60, our method greatly outperformed theML-BIC/MDL ( λ = 1) method by as much as 50%. With such a small amount of training data,the total number of clustered states differed greatly for the VBEC and BIC/MDL model selections(Figure 3.7). Similar behavior was also confirmed from Figures 3.8 and 3.9. These suggest thatVBEC model selection determined the model structure more appropriately than ML-BIC/MDL(λ = 1) selection.

Next, in the IWR task, the penalty term of ML-BIC/MDL in Eq. (3.15) was adjusted toλ = 4


Table 3.3: Experimental conditions for LVCSR (read speech) task

Sampling rate 16 kHzQuantization 16 bitFeature vector 12 - order MFCC +∆ MFCC + Energy +∆ Energy

(26 dimensions)Window HammingFrame size/shift 25/10 msNumber of HMM states 3 (Left to right)Number of phoneme categories43

Training data JNAS 20,000 utterances, 34 hours (male)Test data JNAS 100 utterances, 1,583 words (male)Language model Standard trigram (10 years of newspapers)Vocabulary size 20,000Perplexity 64.0

JNAS: Japanese Newspaper Article Sentences

Table 3.4: Prior distribution parameters.Or,Mr andVr denote the 0th, 1st, and 2nd statistics ofa root node (monophone HMM state), respectively.

Prior parameter Setting valueϕ0 2.0ξ0 0.01η0 0.01ν0 Mean statistics in root noder (Mr/Or)R0 Variance statistics in root noder × η0 (η0(Vr/Or − ν0(ν0)′))

so that the total numbers of states and recognition rates with a small amount of data were as closeas possible to those of VBEC model selection. Nevertheless, VBEC model selection resulted inan improvement of about 2 % in the word recognition rates when the number of training sentenceswas 25∼ 1,500, as shown in Figure 3.5.1, which is an enlarged graph of Figure 3.6. This is becauseBIC/MDL (λ = 4) selected a smaller number of shared states due to the higher penalty, and themodel structure was less precise than with VBEC model selection. In fact, Figure 3.7 shows thatthere is a great difference between the numbers of states for the VBEC and BIC/MDL (λ = 4)model selections.

Similarly, in the LVCSR task, the tuning parameterλ in ML-BIC/MDL was altered fromλ = 1

to λ = 4, which tuned ML-BIC/MDL to select the total number of states that was closest to thatof VBEC and to score the highest word accuracy for small amounts of data (less than 1,000 utter-ances). Although the tuned ML-BIC/MDL improved slightly for small amounts of data (fewer than600 utterances), VBEC still scored about 10 points higher. This suggests that VBEC could clusterHMM states more appropriately than ML-BIC/MDL, i.e., VBEC selected appropriate questions inEq. (3.17), even when the resultant total numbers of clustered states were similar.

3.5. EXPERIMENTS 39

�

��

��

��

��

��

��

��

��

��

��

� ��

zwÊ¼ÅË¼Åº¼Ê

©¼ºÆ¾ÅÀËÀÆÅwÉ¸Ë¼w�|�

�¸Ð¼Ê

¤�£wW��

¤�£wW��

��

��

��

��

��

��

��

��

zwÊ¼ÅË¼Åº¼Ê


�¸Ð¼Ê

¤�£wW��

¤�£wW��

Figure 3.6: The left figure shows recognition rates according to the amounts of training data basedon VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right figure shows an enlarged view ofthe left figure for 25∼ 1,500 utterances. The horizontal axis is scaled logarithmically.

10

1000

1 10 100 1000 10000# sentencesTotal number of clustered states BayesMDL λ=1MDL λ=4

Figure 3.7: Number of splits according to amount of training data (2∼3,000 sentences).

Consequently, VBEC enabled theautomaticselection of triphone HMM state clustering withany amount of training data, unlike the ML-BIC/MDL methods, and exhibited considerable su-periority especially with small amounts of training data. This provides experimental proof thatthere is a relationship between VBEC and ML-BIC/MDL where VBEC theoretically involvesML-BIC/MDL, and that the ML-BIC/MDL is asymptotically equivalent to VBEC, which guar-antees that VBEC is theoretically superior to ML-BIC/MDL in model selection because of theprior utilization advantage.

The small-data superiority of VBEC should be effective for acoustic model adaptation [55]and for extremely large recognition tasks where the amount of training data per acoustic modelparameter would be small because of the large speech variability.


2030405060708090

10 100 1,000 10,000 100,000

ML-BIC/MDLTuned ML-BIC/MDLVBECNumber of utterances

Word accuracy

(1 min.) (10,000 min.)(1,000 min.)(100 min.)(10 min.)82838485868788

1,000 10,000 100,000ML-BIC/MDLTuned ML-BIC/MDLVBEC

Number of utterancesWord accuracy

(10,000 min.)(1,000 min.)(100 min.)Figure 3.8: The left figure shows recognition rates according to the amounts of training data basedon VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right figure shows an enlarged view ofthe left figure for more than 1,000 utterances The horizontal axis is scaled logarithmically.

100

1,000

10,000

10 100 1,000 10,000 100,000Number of utterancesTotal number of clustered

states

ML-BIC/MDLTuned ML-BIC/MDLVBEC

(1 min.) (10,000 min.)(1,000 min.)(100 min.)(10 min.)

Figure 3.9: Total number of clustered states according to the amounts of training data based onVBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The horizontal and vertical axes are scaled log-arithmically.

3.5.2 Prior parameter dependence

The dependence of the recognition rate on prior parameter values was examined experimentally.The IWR task was used for the experiment because the amount of training data was smaller thanthat of the LVCSR task, and it is easier to examine various prior parameter values.ν0 andR0 canbe obtained from the mean vector and the covariance matrix statistics in the root node, and heuris-tics for the parameter setting were removed. However,ξ0 andη0 remain as heuristic parameters.Therefore, it is important to examine how robustly the triphone HMMs are clustered by VBECmodel selection against changes in prior parameter values with various amounts of training data.

The values of prior parametersξ0 andη0 were varied from10−5 to 10, and the word recogni-tion rates were examined in two typical cases; one where the amount of data was very small (10sentences) and another where it was fairly large (150 sentences). Tables 3.5 and 3.6 indicate therecognition rates for each combination of prior parameters. The prior parameter values for accept-able performance seem to be distributed broadly for both very small and fairly large amounts of

3.5. EXPERIMENTS 41

training data. Here, approximately 20 of the best recognition rates are highlighted in each table.The combinations of prior parameter values that yielded the best recognition rates were alike forthe two different amounts of training data. Namely, appropriate combinations of prior parametervalues can consistently achieve high performance regardless of the amount of training data.

In summary, the values of prior parameters do not greatly influence the quality of the clusteredtriphone HMMs. This suggests that it is not necessary to be very careful when selecting the valuesof prior parameters when the VBEC objective function is applied to speech recognition.

Table 3.5: Recognition rates in each prior distribution parameter. The model was trained usingdata consisting of 10 sentences.

ξ0 η0

101 100 10−1 10−2 10−3 10−4 10−5

101 1.0 1.0 47.2 66.8 65.6 66.3 66.1100 2.0 1.0 66.3 65.9 66.5 66.1 65.510−1 1.5 2.2 65.9 66.2 66.7 66.1 66.510−2 1.7 31.2 66.1 66.5 66.3 65.5 64.610−3 1.7 60.3 66.2 66.7 66.1 65.5 64.610−4 1.7 66.5 66.6 66.3 65.5 64.6 64.610−5 1.7 66.1 66.7 66.1 65.5 64.6 64.6

Table 3.6: Recognition rates in each prior distribution parameter. The model was trained by usingdata consisting of 150 sentences

ξ0 η0

101 100 10−1 10−2 10−3 10−4 10−5

101 14.7 23.3 94.7 94.0 94.0 93.3 92.1100 17.2 22.0 93.5 94.0 93.1 92.3 92.210−1 17.3 49.3 94.3 93.9 93.3 92.5 92.410−2 17.5 83.5 94.4 93.2 92.3 92.3 92.210−3 18.3 92.5 93.8 93.3 92.5 92.4 92.210−4 21.3 94.1 93.2 92.3 92.3 92.2 92.310−5 23.4 94.0 93.4 92.5 92.4 92.2 92.3

3.5.3 Model selection for HMM states

Since VBEC selects the model that gives the highest value of the VBEC objective functionFm,the validity of the model selection can be evaluated by examining the relationship betweenFm

and recognition performance. The validity was examined with an IWR task (in Figure 3.10) andLVCSR tasks (in Figure 3.11). Figure 3.10 (a) plots the recognition rates and the summation ofFm for all phonemes (hereafter,Fm is used to indicate the summation ofFm for all phonemes) forseveral sets of clustered-state triphone HMMs in the IWR task. Here, a set of HMMs was obtainedby controlling the maximum tree depth uniformly for all trees without arresting the splitting, and all


��

��

��

��

��

��

��

��

��

� ��

zwÊË¸Ë¼Ê


©¼ºÆ¾ÅÀËÀÆÅ

É¸Ë¼

�

Ä

(a) VBEC

��

��

��

��

��

��

��

��

��

� ��

zwÊË¸Ë¼Ê


©¼ºÆ¾ÅÀËÀÆÅ

É¸Ë¼

£ÀÂ¼ÃÀ¿ÆÆ»

(b) ML

Figure 3.10: Objective functions and recognition rates according to the number of clustered states.

output pdfs were organized by single Gaussians, so that the effect of clustering could be evaluatedexclusively. The results clearly showed thatFm and word accuracy behaved very similarly i.e. bothcontinued to increase until they reached their peaks at around 1,500 states and then decreased. Thesame type of examination was carried out for log-likelihood and recognition rate (Figure 3.10 (b)).The log-likelihood continued to increase monotonically while the word accuracy decreased afterreaching its peak at around 1,500 states, and the log-likelihood could not provide any informationfor the automatic selection of an appropriate structure. This was due to the nature of the likelihoodmeasure i.e. the more complicated the model structure is, the higher the likelihood becomes.Similar behavior was obtained using the LVCSR task (Figure 3.11). These results indicate thatthe VBEC objective function is valid for model selection as regards clustering the triphone HMMstates.

3.5.4 Model selection for Gaussian mixtures

The above experiments show the effectiveness of triphone HMM state clustering by using VBEC.The next experiment shows the effectiveness of another aspect of acoustic model selection, i.e., thedetermination of the number of mixture components by using VBEC. In the following sections,only the LVCSR task was used for the experiments because the determination of the number ofmixture components requires the VB Baum-Welch algorithm, as discussed in Section 3.4, whichwas only used in the LVCSR task. In VBEC, the number of mixture components is determined sothatFm has the highest value, similar to the triphone HMM state clustering described in Section3.5.3. Therefore, it is important to examine the relationship betweenFm and recognition perfor-

3.5. EXPERIMENTS 43

657075808590

0 10,000 20,000

Word accuracy

Word Accuracy F_m

Total number of clustered states(a) VBEC

657075808590

0 10,000 20,000

Word accuracy

Word Accuracy Log-likelihood

Total number of clustered states(b) ML

Figure 3.11: Objective functions and word accuracies according to the increase in the number oftotal clustered triphone HMM states.

mance for each number of mixture components. 30 sets of GMMs were prepared with the sameclustered-state structure obtained using the all training data in Section 3.5.1. Figure 3.12 showsthe word accuracy and the objective functionFm for each number of mixture components. Inthis experiment, the number of mixture components was the same for all clustered states (fixed-number GMM method). From Figure 3.12, the behavior ofFm and the word accuracy had similarcontours, i.e., from 1 to 9 Gaussians, both increased gradually as the number of mixture com-ponents increased, and, at 9 Gaussians, both peaks were identical, and finally, for more than 9Gaussians, both decreased gradually as the number of mixture components decreased due to theover-training effects. Therefore, VBEC could also select an appropriate model structureautomati-cally for GMMs and the selected model scored a word accuracy of 91.1 points, which is sufficientfor practical use.

The varying-number GMM method is a promising approach to more accurate modeling, andtakes full advantage of automatic selection by using the VBEC objective function. Namely, itis almost impossible to obtain manually an appropriate combination of varying-number GMMs,each of which can have a different number of components. The varying-number GMM method wasapplied to the same clustered-state triphone HMMs. Then, the total number of mixture componentswas 20,226 while the total number was 39,725 for the fixed-number GMM, and the word accuracyimproved 0.4 point to 91.5. The varying-number GMM, thus, improved the performance with asmaller total number of mixture components.


86878889909192

0 10 20 30Word accuracyFm

Number of mixture components per stateWord accuracy

Figure 3.12: Total objective functionFm and word accuracy according to the increase in thenumber of mixture components per state.

Table 3.7: Word accuracies for total numbers of clustered states and Gaussians per state. Thecontour graph on the right is obtained from these results. The recognition result obtained with thebestmanualtuning with ML was 92.0 and that obtainedautomaticallywith VBEC was 91.1.

# G # S129 500 1000 2000 3000 4000 6000 8000

40 79.7 90.9 91.0 91.5 89.6 88.7 85.0 81.530 78.6 90.6 91.7 92.0 91.4 90.1 86.5 84.520 77.3 90.4 91.7 91.9 91.4 91.6 88.5 86.616 75.4 90.0 90.8 91.7 91.8 91.2 88.1 87.712 74.7 89.5 91.3 91.5 91.5 90.4 90.3 89.08 71.4 88.4 91.2 90.7 90.8 90.9 90.1 89.64 67.2 87.7 88.9 91.2 90.6 90.4 90.7 89.41 49.9 82.1 84.3 85.5 84.8 86.0 85.7 85.5

# G : number of mixture components per state

# S : total number of clustered states

90 92

2000 4000 6000 8000510152025303540

Total number of clustered statesNumber of mixture components per state

First-phase procedure Second-phase procedure

92.0

VBEC91.191.5

90.0

90.0

91.591.5

90.0

3.5.5 Model selection over HMM states and Gaussian mixtures

Sections 3.5.3 and 3.5.4 confirmed that VBEC could construct appropriate clustered-state triphoneHMMs with GMMs. The interest then turned to how closely an automatically selected VBECmodel over clustered states and Gaussian mixtures would approach the best manually obtainedmodel provided by the ML framework. Here, the automatic selection was carried out with a two-phase procedure; triphone HMM states were clustered and then the number of mixture componentsper state were determined. Although, this two-phase procedure might select a local optimumstructure since the model structure was only optimized at each phase, this was a realistic solutionfor selecting a model structure in a practical computation time.

The ML framework examined word accuracies for several combinations of each state clusterand mixture size and obtained results with good accuracies as summarized in Table 3.7. Compar-ative results corresponding to Table 3.7 are shown on a contour map. From this experiment, this

3.6. SUMMARY 45

map divides the word accuracy into black, gray and white areas scoring less than 90 points, from90 to 91.5 points and more than 91.5 points, respectively. Models that score more than 90.0 pointswould be regarded as sufficient for practical use in this task. The selected VBEC model (5,675states and 9 Gaussians) scored 91.1 points and this score was within the high performance area(more than 90.0 points). Even if we include the selected model structure and other model struc-tures that scored within the 5-bestFm values in Figure 3.12, the average accuracy was 90.5 i.e., itwas also within the high performance area. These results confirmed that the model structures withhighFm values provided high levels of performance compared with the ML results.

However, the model structure that VBEC selected (5,675 states and 9 Gaussians per state)could not match the best manually obtained model (2,000 states and 30 Gaussians per state), andthe VBEC model could not match the highest accuracy (92.0). The main reason for this inferioritywas the two-phase model selection procedure, as shown in Table 3.7. In this case, in the first phase,the selected model was appropriate for a single Gaussian per state, but not for multiple Gaussiansper state, and it had too many states, which caused the degradation in performance. Therefore,triphone HMM state clustering with multiple Gaussians is required to select the optimum modelstructure.

Consequently, although the word accuracy did not reach the highest value obtained withmanual

tuning, theautomatically selected VBEC model could provide a satisfactory performance forpractical use for the LVCSR task. This suggests the successful construction of a Bayesian acousticmodel, which includes model setting, training, and selection by using VBEC.

3.6 Summary

This chapter introduced the implementation of VBEC for acoustic model construction. VBECincludes prior utilization and model selection, which can automatically select an appropriate modelstructure in clustered-state HMMs and in GMMs according to the VBEC objective function withany amount of training data.

In particular, when the amounts of training data were small, the VBEC model significantlyoutperformed the ML-BIC/MDL model, and as the amount of training data increased, the perfor-mance determined by VBEC and ML-BIC/MDL converged. This superiority of VBEC is based onthe prior utilization advantage over the BIC/MDL criterion. Furthermore, VBEC could determineappropriate sizes for Gaussian mixture models, as well as the clustered triphone HMM structure.Thus, VBEC totally mitigates the effect of over-training, which may occur when the amount oftraining data per Gaussian is small, by utilizing prior information and by selecting an appropriatemodel structure.

There is room for further improvement in selecting the model structure. The two-phase modelselection procedure employed in the experiments, i.e., first clustering triphone HMM states andthen determining the number of mixture components per state, could only locally optimize themodel structure. The VBEC performance is expected to improve if the selection procedure isimproved so as to globally optimize the model structure. Therefore, the next chapter considerssuch a model optimization for acoustic models.

Chapter 4

Determination of acoustic model topology

4.1 Introduction

The acoustic model has a very complicated structure: a category is expressed by a set of clustered-state triphone Hidden Markov Models (HMMs) that possess an output distribution representedby a Gaussian Mixture Model (GMM). Therefore, onlyexpertswho well understand this compli-cated model structure (model topology) can design the models. Although certain algorithms havebeen proposed for dealing with the model topology [38–41,56], these algorithms require heuristictuning since they are based on the Maximum Likelihood (ML) criterion, which cannot determinethe model topology because ML increases monotonically as the number of model parameters in-creases. If we are to eliminate the need for heuristic tuning we must find a way to determine theacoustic model topology automatically.

Some partially successful approaches to the automatic determination of the acoustic modeltopology have been reported that use such information criteria as the Minimum Description Lengthor Bayesian Information Criterion (BIC/MDL). However, the BIC/MDL approach cannot theoret-ically determine the total acoustic model topology since the acoustic model includes latent vari-ables. Therefore, these approaches only determine the model by simplifying the model topologydetermination to a single Gaussian clustering in a model with the constraint that the acoustic modelhas no latent variables [21,45–47].

VBEC can theoretically determine a complicated model topology by using the VBEC objectivefunction, which corresponds to the VB posterior for a model structure, even when latent variablesare included. In the previous chapter, automatic determination using VBEC was confirmed byclustering triphone HMM states with a single Gaussian model, and then determining the number ofcomponents per state fixing the clustered state topology, where the latent variables exist, as shownby the dashed lines and boxes in Figure 4.1. This procedure is called VBEC 2-phase search orsimply 2-phase search in this chapter. Although this procedure is capable of determining the modeltopology within a practical computation time, the determined topology is only locally optimizedat each phase and the obtained performance is not the best.

The goal of this chapter is to reach an optimum topology area without falling into a local op-timum area using VBEC. Two characteristics of acoustic model topology are utilized to reach this

47

48 CHAPTER 4. DETERMINATION OF ACOUSTIC MODEL TOPOLOGY

Number of clustered states

VBEC 2-phase search

1

Optimum areaLocal optimum area

Clustering triphone HMM statewith a single Gaussian model

Increasing the number of componentswhile fixing the clustered state structure

Number of components per state

First phase Second phase

Figure 4.1: Distributional sketch of the acoustic model topology.

goal. The first is that the appropriate model would be distributed in a band where the total numberof model parameters (≈ the total number of Gaussians) is almost constant because the amount ofdata is fixed, as shown in the inversely proportional band in Figure 4.1. The second characteristicis that an optimum model topology area would be in the band and nearly unimodal, as shown inFigure 4.1. The characteristics of the acoustic model were experimentally confirmed in [31] byusing an isolated word speech recognition task. Therefore, by constructing a number of acousticmodels in the band, and then selecting the most appropriate of the in-band models, namely theone that maximizes the VBEC objective function, we can determine an optimum model topologyefficiently. This search algorithm is called in-band model search. To obtain in-band models, GMM-based HMM state clustering is employed using the phonetic decision tree method. Although theconstruction of the GMM-based decision tree is also automatically determined within an originalVBEC framework, as described in Chapter 2, the construction requires an unrealistic number ofcomputations because the VBEC objective function is obtained by a VB iterative calculation usingall frames of data for each clustering. To reduce the number of computations to a practical level,this chapter proposes new approaches for realizing the GMM-based decision tree method within aVBEC framework by utilizing monophone HMM state statistics as priors.

4.2 Determination of acoustic model topology using VBEC

4.2.1 Strategy for reaching optimum model topology

This section describes how to determine the acoustic model topology automatically by usingVBEC. In acoustic modeling, the specifications of the model topology are often represented bythe number of clustered states and the number of GMM components per state, as shown in Figure

4.2. DETERMINATION OF ACOUSTIC MODEL TOPOLOGY USING VBEC 49


State-first approachMixture-first approach

1

Optimum areaLocal optimum area

Triphone HMM state clusteringwith a single Gaussian model

Number of components per state

In-band model searchTriphone HMM states clustering with a GMM

Figure 4.2: Optimum model search for an acoustic model.

4.2. Then, the good models that provide good performance would be distributed in the inverse-proportion band where the total number of distribution parameters (approximately equal to thetotal number of Gaussians) is constant because the amount of data is fixed. Moreover, there wouldbe a unimodal optimum area in the band where the model topologies are represented by an ap-propriate number of pairs of clustered states and components, as shown in Figure 4.2. In orderto realize an optimum model topology, two characteristics of the acoustic model are utilized, theinverse-proportion band and the unimodality. By preparing a number of acoustic models in theband, and by choosing the model that has the best VBEC objective function score, an optimummodel topology can be determined, as shown in Figure 4.2 (in-band model search).

There are at least two conceivable approaches for constructing in-band models: one involvesincreasing the number of mixture components from single Gaussian based triphone models, andthe other involves increasing the number of clustered triphone HMM states from GMM basedmonophone models. The former is obviously an extension of the 2-phase search described inChapter 3 and Section 4.1. Whereas the original 2-phase search determines only the one stateclustering topology that has the best VBEC objective function (Fm) score in its first phase, theextended approach retains a number of single Gaussian based triphone models as candidates fora globally optimum topology. Then, the number of mixture components is increased for eachcandidate, so that each of the triphone models reaches the inverse-proportion band at the bestFm

(state-first approach). This produces a number of in-band models of several topologies (see thedashed arrows in Figure 4.2). The latter approach proceeds in a way that is symmetrical withrespect to the state-first approach. Namely, it prepares a number of GMM based monophonemodels as candidates with several numbers of mixture components in its first phase. Then, thenumber of clustered states is increased for each candidate, so that each of the triphone modelsreaches the inverse-proportion band at the bestFm (mixture-first approach). This also can produce


a number of in-band models (see the solid arrows in Figure 4.2) in the same way as the state-firstapproach.

Here, we note the potential of the mixture-first approach for obtaining accurate state clusters,which comes from a precise representation of output distributions not by single Gaussians but byGMMs during clustering, i.e., GMM based state clustering. Even if two acoustic models producedseparately by the state-first approach and the mixture-first approach have the same quantitativespecifications as regards the numbers of states and mixture components, triphone HMM statesmight be clustered differently by the two approaches due to the difference in the representation ofthe output distributions. In general, the mixture-first approach is advantageous for accurate clus-tering because the precise representation of the output distribution is expected to achieve properdecisions in the clustering process [56]. Accordingly, this chapter employs the mixture-first ap-proach to construct the in-band models.

The original VBEC framework, as formulated in Chapter 2, already involves the theoreticalrealization of GMM based state clustering. However, the straightforward implementation of therealization requires an impractical computation time. Therefore, a key technique for the construc-tion of accurate in-band models involves reducing the computation time needed for GMM basedstate clustering to a practical level.

4.2.2 HMM state clustering based on Gaussian mixture model

This subsection first describes the phonetic decision tree method for clustering context-dependentHMM states, as explained in Section 3.3. The success of the phonetic decision tree is due largelyto the efficient binary tree search under constraints of phonetic knowledge based questions, andbecause of the following three additional constraints:

(C1) Data alignments for each state are fixed during the splitting process.

(C2) The output distribution in a state is represented by a normal distribution.

(C3) The contribution of the state transition to the objective functions is disregarded.

These constraints are already introduced in Section 3.3. Two constraints (C1) and (C2) play arole in eliminating the latent variables involved in an acoustic model. Therefore, under theseconstraints, the 0-th, 1st and 2nd statistics of noden (On, Mn andVn, respectively) are computedsimply by summing up the sufficient statistics of statej (Oj, Mj andVj) as shown in Eq. (3.7),wherej represents a non-clustered triphone HMM state, and is included in noden. Therefore,once we have prepared statisticsOj, Mj andVj for all possible triphone HMM states by usingEq. (3.8), we can easily calculate the statistics for each node, and this reduces the computationtime to a practical level.

In contrast, the goal of GMM based state clustering is to obtain even more accurate state clus-ters at the expense of losing the benefits of constraint (C2). As a result, the splitting process hasto proceed with the latent variables that remain in the model. Consequently, we require the GMMstatistics for eachk componentOn

k , Mnk andVn

k , which are obtained by the VB iterative cal-culation described in Sections 2.3.2 and 2.3.3, to calculate VB posteriors and the gain of VBEC


YesPhoneticquestionNo

Parent node

ChildnodesFigure 4.3: Estimation of inheritable GMM statistics during the splitting process.

objective function∆FQ(n) to examine all possible combinations of noden and questionQ. It is in-evitable that the overall computation time needed to construct phonetic decision trees will becomehuge and impractical.

4.2.3 Estimation of inheritable node statistics

In order to avoid VB iterative calculation for the GMM statisticsOnk , Mn

k andVnk during the

splitting process, the ratio of the n-th order statistics for componentk is assumed to be conserved.That is to say, the ratio of the 0-th order statisticsOn

k for a noden is related to the ratios of the 0-thorder statistics of its yes-nodenY

Q and no-nodenNQ for a questionQ as follows:

OnYQ

k∑k′ O

nYQ

k′

=OnN

Q

k∑k′ O

nNQ

k′

=On

k∑k′ On

k′. (4.1)

Employing the relation (4.1) for the upper node in a phonetic tree successively, the assumptionyields the fact that the ratio at each node is equivalent to the ratioOr

k/∑

k′ Ork′ (≡ wr

k: weightingfactor ) at the root node statistics in the tree (i.e., the ratio of the monophone HMM state statistics)as follows:

Onk∑

k′ Onk′

=Or

k∑k′ Or

k′≡ wr

k for any noden, (4.2)

where the suffixr indicates a root node. Therefore, from Eq. (4.2), the 0-th order statisticsOnk of

componentk in noden is estimated as follows:

Onk = wr

k

∑k′

Onk′ = wr

kOn. (4.3)


This approach is based on the knowledge that the monophone state statistics are phonetically sim-ilar to the clustered state statistics. Similarly, the 1st and 2nd order statisticsζn

k of componentk innoden are estimated as follows:

Mnk = wr

kMn

Vnk = wr

kVn. (4.4)

Thus, we can estimate the GMM statistics of each nodeOnk , Mn

k andVnk without using the VB

iterative calculation, but using thek component ratio of the monophone statisticswrk. We call this

the estimation of inheritable node statistics because the ratiowrk is passed from a parent node to

child nodes, as shown in Figure 4.3. Consequently, VB posteriors and VBEC objective functioncan also be calculated without using a VB iterative calculation during the splitting process. Theconcrete form of the parameters for VB posteriorsΦ is derived by substituting Eqs. (4.3) and (4.4)into the general solution Eq. (2.30), as follows:

Φnk :

ϕnk = ϕ0 + wr

kOn

ξnk = ξ0 + wr

kOn

ηnk = η0 + wr

kOn

νnk =

ξ0νn,0k +wr

kMn

ξ0+wrkOn

Rn = diag[Rn,0 + Vn − 1

On Mn(Mn)′ + ξn,0On

ξn,0+On

(Mn

On − νn,0) (Mn

On − νn,0)′]

. (4.5)

This equation omits theφn part because the contribution of the state transition to the objectivefunctions is disregarded (condition (C3)). For a similar reason, state suffixj is reduced. In addition,this chapter assumesϕ0

k, ξ0k, andη0

k to be constants for anyk, i.e.,{ϕ0k, ξ

0k, η

0k} → {ϕ0, ξ0, η0}

Based on the above VB posteriorsΦn ≡ {Φnk |k = 1, ..., L} and the constraints (C1), (C2), and

(C3),∆FQ(n) is derived from Eqs. (2.39), (2.41), and (2.42) as follows:

∆FQ(n) = f(ΦnYQ) + f(ΦnN

Q ) − f(Φn) − f(Φn,0) −∑

k

wrk log wr

k, (4.6)

where

f(Φ) ≡ − log Γ(∑

kϕk

)+

∑k

log Γ(ϕk) −D

2log ξk −

ηk

2log |Rk| + D log Γ

(ηk

2

). (4.7)

To calculate∆FQ(n), we must estimatewrk and set the prior parametersΦ0 appropriately.

4.2.4 Monophone HMM statistics estimation

MMIXTURE

In this section, weighting factorwrk of componentk in root noder is estimated from monophone

HMM statistics. wrk is needed to estimate the inheritable GMM statistics of each node and to

calculate the gain of VBEC objective function∆FQ(n). The GMM statistics of monophone HMM


Ork, Mr

k andVrk can be obtained by the VB iterative calculation without much computation, as

follows: Or

k =∑

e,t ζte,j=r,k

Mrk =

∑e,t ζ

te,j=r,kO

te

Vrk =

∑e,t ζ

te,j=r,kO

te(O

te)

′. (4.8)

whereζte,j is an occupation probability obtained by the forward-backward or Viterbi algorithm

within the VB or ML framework. Therefore,wrk is estimated byOr

k/∑

k′ Ork′. Moreover,Mr

k andVr

k are used to set the prior parametersνn,0k andRn,0

k . Then,wrk, νn,0

k andRn,0k are represented by

Ork, Mr

k andVrk as follows:

wrk =

Ork

P

k′ Ork′

νn,0k =

Mrk

Ork

Rn,0k = diag

[η0

(Vr

k

Ork− νn,0

k

(νn,0

k

)′)] . (4.9)

This approach utilizes the Gaussian mixturestatistics of monophone HMM to obtainOrk, Mr

k andVr

k , so this approach is called MMIXTURE.

MSINGLE

This section introduces another approach for obtainingwrk, νn,0

k andRn,0k more easily, which was

first proposed in [31]. This approach assumes thatwrk is the same for all the components in an

L-component GMM, and is represented bywrk = 1/L, instead of calculating the GMM statistics

of monophone HMM. In addition, single Gaussian statistics of monophone HMM are employed tosetνn,0

k andRn,0k . These are easily computed by summing up the sufficient statisticsOj, Mj and

Vj for all triphone HMM states. Then,wrk, νn,0

k andRn,0k are represented as follows:

wrk = 1

L

νn,0k =

P

j∈r OjMjP

j∈r Oj

Rn,0k = diag

[η0

(P

j∈r OjVjP

j∈r Oj− νn,0

k

(νn,0

k

)′)] . (4.10)

This approach utilizes only singleGaussian statistics of monophone HMM, so this approach iscalled MSINGLE. It is easier to realize MSINGLE than MMIXTURE because the former doesnot require the preparation of the GMM statistics of monophone HMM. However, because of therough estimation, the MSINGLE approach would be less accurate than MMIXTURE.

Thus, we can construct a number of in-band model topologies by using MMIXTURE or MS-INGLE, i.e., we realize the solid arrows seen in Figure 4.2. Finally, in order to determine anappropriate model from the in-band models (in-band model search), the exact VBEC objectivefunction is calculated by dropping the constraints (C1), (C2), and (C3) in Section 4.2.2 and the in-heritable statistics assumption in Section 4.2.3. Then, the calculation is performed by VB iterationas described in Sections 2.3 and 3.2, unlike the non-iterative approximation described in Eqs. (4.6)and (4.7).


WACC.88 91.1

2.5 3 3.5#S00.51

1.52

#G


Num

ber

of c

ompo

nent

s pe

r st

ate

129 500 1000 2000 3000 4000 6000 8000129 500 1000 2000 3000 4000 6000 8000

1

4

8

12

16

20

30

40

50

100

1

4

8

12

16

20

30

40

50

100WACC

92 9091 89 88WACC

92 9091 89 88

1st-phase (#S = 7790)

2nd-phase (#G = 8)

{7790, 8}{7790, 8}

1st-phase (#S = 7790)

2nd-phase (#G = 8)

{7790, 8}{7790, 8}

{1000, 30}{1000, 30}

(a) Test1

WACC.88 91.1

2.5 3 3.5#S00.51

1.52

#G


Num

ber

of c

ompo

nent

s pe

r st

ate

129 500 1000 2000 3000 4000 6000 8000129 500 1000 2000 3000 4000 6000 8000

1

4

8

12

16

20

30

40

50

100

1

4

8

12

16

20

30

40

50

100WACC

92 9091 89 88WACC

92 9091 89 88

{2000, 40}{2000, 40}

1st-phase (#S = 7790)

2nd-phase (#G = 8)

{7790, 8}{7790, 8}

1st-phase (#S = 7790)

2nd-phase (#G = 8)

{7790, 8}{7790, 8}

(b) Test2

Figure 4.4: Model evaluation test using Test1 (a) and Test2 (b). The contour maps denote wordaccuracy distributions for the total number of clustered states and the number of components perstate. The horizontal and vertical axes are scaled logarithmically.

As regards the setting of the prior parameters, we follow the statistics based setting forν0 andR0 as described in this section, and the constant value setting for the remaining parametersϕ0,ξ0 andη0. Both settings are widely used in the Bayesian community (e.g., [15, 24–26]), and theireffectiveness has also already been confirmed by the speech recognition experiments described inChapter 3.

4.3 Preliminary experiments

4.3.1 Maximum likelihood manual construction

The effectiveness of the proposed method is proved in Section 4.4. First, this section describes pre-liminary experiments that were conducted to show the way in which the performance is distributedover the numbers of states and components per state to confirm the two acoustic model character-istics “inverse-proportional band” and “unimodality” described in Section 4.2.1. The recognitionperformance of conventional ML-based acoustic models with manually varied model topologies isalso examined to provide baselines with which to compare the performance of the automaticallydetermined model topology. Several topologies were produced under manually varied conditionsas regards the number of states, i.e., the sizes of the phonetic decision trees, and the number ofcomponents per state. A total of10 (1, 4, 8, 12, 16, 20, 30, 40, 50 and 100 components)× 8(129 1, 500, 1,000, 2,000, 3,000, 4,000, 6,000, 8,000 clustered states)= 80 acoustic models was

1The number of monophone HMM states.

4.3. PRELIMINARY EXPERIMENTS 55

obtained. In order to examine topologies with widely distributed specifications, the numbers ofclustered states and components were arranged at irregular intervals, i.e., wider intervals betweenthe larger numbers. Then, the examined points along the inversely proportional band were locatedat nearly regular intervals, so that the search for an appropriate model topology would be carriedout evenly over the band. The experimental conditions are summarized in Table 4.1. The trainingdata consisted of about 20,000 Japanese sentences (34 hours) spoken by 30 males, and two testsets (Test1 and Test2) were prepared from Japanese Newspaper Article Sentences (JNAS)2. Onewas used as a performance criterion for obtaining an appropriate model from the various models(model evaluation test) and the other was used to measure the performance of the obtained model(performance test). By exchanging the roles of the two test sets, two sets of results were obtained,and these were utilized to support the certainty of the discussion. The test sets each consistedof 100 Japanese sentences spoken by 10 males and taken from JNAS (a total of 1,898 and 1,897words, respectively), as shown in Table 4.1.

Figure 4.4 (a) and (b), respectively, are contour maps that show the results of model evaluationtests for the examined word accuracy (WACC) obtained using Test1 and Test2, where the hori-zontal and vertical axes are scaled logarithmically. We can see a high performance area along anegative slope band in both maps (an inversely proportional band in the linear scale maps). Theband satisfied the relationship whereby the product of the numbers of states and components perstate ranged approximately from104 to 105. Therefore, the results confirmed the acoustic modelcharacteristic namely that a high performance area was distributed in the inverse-proportion bandwhere the total number of distribution parameters (approximately equal to the total number ofGaussians) is constant.

Next, we focus on the other characteristic, namely the unimodality in the band. The top scoreswere 91.1 and 91.6 WACC for the model evaluation test using Test1 and Test2, respectively, wherethe numbers of states and components per state were{1,000, 30} and{2,000, 40}, respectively({ · , · } denotes the model topology by{the number of clustered states, the number of compo-nents per state}). From Figure 4.4, we can see that high performance areas were distributed acrossthe regions around the top scoring topologies, and the unimodality of the performance distribu-tions were confirmed experimentally. Thus, the two characteristics of the acoustic model wereconfirmed, which indicates the feasibility of the proposed in-band model search.

Finally, a performance test was undertaken, and 91.0 and 91.4 WACC were obtained by rec-ognizing Test1 and Test2 data using the Test2-obtained model{2,000, 40} and Test1-obtainedmodel{1,000, 30}, respectively. Since both Test1 and Test2 scored more than 91.0 points whenusingmanuallyobtained models, our goal is to reach a score of more than 91.0 points when usingautomaticallydetermined models.

2Although this task is almost the same as the LVCSR task described in Chapter 3, the test sets are different, whichmakes the experimental results slightly different, e.g., recognition performance.


Table 4.1: Experimental conditions for LVCSR (read speech) task

Sampling rate 16 kHzQuantization 16-bitFeature vector 12 - order MFCC +∆ MFCC + Energy +∆ Energy

(26 dimensions)Window HammingFrame size/shift 25/10 msNumber of HMM states 3 (Left to right)Number of phoneme categories43

Training data JNAS: 20,000 utterances, 34 hours (122 males)Test1 data JNAS: 100 utterances, 1,898 words (10 males)Test2 data JNAS: 100 utterances, 1,897 words (10 males)Language model Standard trigram (10 years of Japanese newspapers)Vocabulary size 20,000Perplexity Test1: 94.8 (OOV rate: 2.1 %), Test2: 96.3 (OOV rate: 2.4 %)JNAS: Japanese Newspaper Article Sentences, OOV: Out Of Vocabulary words

4.3.2 VBEC automatic construction based on 2-phase search

This subsection demonstrates another automatic determination method, which does not require amodel evaluation test using test data. The method is based on the conventional 2-phase search,which is the same procedure as that demonstrated in Section 3.5.5. The 2-phase search was pro-posed as a way to determine acoustic models within a practical computation time by clusteringtriphone HMM states with a single Gaussian model, and then determining the number of compo-nents per state, as described in Chapter 3.

Prior parametersν0 andR0/η0 were set from the single Gaussian statistics in the monophoneHMM state as prior parametersΘ0 in a 2-phase search because the monophone HMM states havea sufficient amount of data, and we can estimate their statistics accurately without worrying aboutover-training. For the other parameter setting, the conventional Bayesian approach ofφ0 = ϕ0 =

2.0 for the Dirichlet distribution andξ0 = η0 = 0.1 for the Normal Gamma distribution wereemployed. It is noted that the dependences ofξ0, η0, φ0 andϕ0 are not particularly sensitive interms of performance as regards the acoustic model construction, as discussed in Section 3.5.2.These settings of the prior parametersξ0, η0, φ0 andϕ0 were used for all the VBEC experiments,and finally, Section 4.4.3 examines the dependence of the prior parameter settings. In addition,here and below in a VBEC framework, the conventional ML-based decoding for recognition wasemployed rather than the Bayesian Predictive Classification (BPC) based decoding of VBEC, VB-BPC. This made it possible to evaluate the pure effect of the automatic determination of the modeltopology using the proposed determination technique.

The triphone states were automatically clustered and there were 7,790 clustered states, whichhad the best VBEC objective functionFm in the 1st-phase procedure. Then, 10 acoustic modelswere made by using VBEC where each model had a fixed clustered state topology and 1, 4, 8,12, 16, 20, 30, 40, 50 and 100 components. The acoustic model was finally determined as{7790,

4.4. EXPERIMENTS 57

8 } , which had the best VBEC objective functionFm in the 2nd-phase procedure automatically.The dashed line in Figure 4.4 overlaps the same model topology determined by the 2-phase searchin the two ML contour maps. The determined model topology{7,790, 8} scored 88.9 points inTest1 and 89.2 points in Test2. The model topology{7,790, 8} determined by the 2-phase searchcould not match the best manually obtained models ({# S, # G} = {1,000, 30}, {2,000, 40}), andthe VBEC model could not reach the performance goal obtained with the ML-manual approach(91.0 points). The main reason for this inferiority was the local optimality of the model topologyin the 2-phase search. That is to say, in the first phase of state clustering, the selected model wasappropriate for a single Gaussian per state, but had too many states for multiple Gaussians per state.In short, the model topology determined by a 2-phase search was a local optimum, and this causedthe degradation in performance. Optimum performance cannot be determined by the conventional2-phase search procedure, as found with the results described in Section 3.5.5.

4.4 Experiments

This section describes experiments conducted to prove the effectiveness of the proposals. There arethree subsections. First, Section 4.4.1 confirms the automatic determination of the model topologyusing the proposed approach and compares its performance with conventional approaches in largevocabulary continuous speech recognition (LVCSR) tasks. Section 4.4.2 describes experimentsdesigned to compare the computation time needed for the proposed approach, an ML-manual ap-proach, a 2-phase search and the straightforward method of GMM phonetic tree clustering with theVB iterative calculation. Section 4.4.3 examines the prior parameter dependence of the proposalsand discusses the difference between the proposals of MSINGLE and MMIXTURE. The secondand third experiments undertake relatively small tasks using isolated word recognition in order toexamine the topology search as properly as in the previous experiments. This is despite the fact thatthe straightforward method of GMM phonetic tree clustering requires a huge computation time inSection 4.4.2 and the examination of prior parameter dependence requires extra search spaces forprior parameter setting in Section 4.4.3.

4.4.1 Determination of acoustic model topology using VBEC

This subsection conducts experiments to prove the effectiveness of the proposed procedure, namelyan in-band model search using GMM state clustering, which is determined by VBEC. The exper-imental conditions were the same as those described in Section 4.3. The prior parametersν0,R0/η0 andϕ0 were set in accordance with the MSINGLE and MMIXTURE algorithms describedin Section 4.2.4, and the other prior parameters were set in the same way as described in Section4.3.2.

First, the in-band models were constructed using the GMM decision tree clustering proposedin Section 4.2.2, 4.2.3 and 4.2.4 and it was examined whether the constructed model topologieswere in band. The two proposed clustering algorithms MSINGLE and MMIXTURE were used toproduce a set of clustered-state triphone HMMs, which made a total of 10 sets of clustered-state


WACC.88 91.1

2.5 3 3.5#S00.51

1.52

#G


Num

ber

of c

ompo

nent

s pe

r st

ate

129 500 1000 2000 3000 4000 6000 8000129 500 1000 2000 3000 4000 6000 8000

1

4

8

12

16

20

30

40

50

100

1

4

8

12

16

20

30

40

50

100

86.6

89.589.5

90.990.9

90.590.591.491.4

91.891.8

91.091.0

91.291.2

90.990.9

WACC92 9091 89 88

WACC92 9091 89 88

90.190.1

(a) Test1

WACC.88 91.1

2.5 3 3.5#S00.51

1.52

#G


Num

ber

of c

ompo

nent

s pe

r st

ate

129 500 1000 2000 3000 4000 6000 8000129 500 1000 2000 3000 4000 6000 8000

1

4

8

12

16

20

30

40

50

100

1

4

8

12

16

20

30

40

50

100

87.6

89.689.6

90.790.7

91.391.391.591.5

91.791.7

91.691.6

91.791.7

91.491.4

WACC92 9091 89 88

WACC92 9091 89 88

90.890.8

(b) Test2

Figure 4.5: Determined model topologies and their recognition rates (MSINGLE). The horizontaland vertical axes are scaled logarithmically.

WACC.88 91.1

2.5 3 3.5#S00.51

1.52

#G


Num

ber

of c

ompo

nent

s pe

r st

ate

129 500 1000 2000 3000 4000 6000 8000129 500 1000 2000 3000 4000 6000 8000

1

4

8

12

16

20

30

40

50

100

1

4

8

12

16

20

30

40

50

100

86.686.6

90.390.390.390.3

90.790.790.790.7

90.090.090.090.090.790.790.790.7

90.690.690.690.6

90.990.990.990.9

91.491.491.491.4

91.391.391.391.3

WACC.92 9091 89 88

90.990.990.990.9

(a) Test1

WACC.88 91.1

2.5 3 3.5#S00.51

1.52

#G


Num

ber

of c

ompo

nent

s pe

r st

ate

129 500 1000 2000 3000 4000 6000 8000129 500 1000 2000 3000 4000 6000 8000

1

4

8

12

16

20

30

40

50

100

1

4

8

12

16

20

30

40

50

100WACC

92 9091 89 8892 9091 89 88

87.787.7

90.190.190.190.1

90.990.990.990.9

91.291.291.291.291.291.291.291.2

91.491.491.491.4

91.491.491.491.4

91.791.791.791.7

91.391.391.391.3

91.191.191.191.1

(b) Test2

Figure 4.6: Determined model topologies and their recognition rates (MMIXTURE). The horizon-tal and vertical axes are scaled logarithmically.

4.4. EXPERIMENTS 59

HMMs (1, 4, 8, 12, 16, 20, 30, 40, 50 100 components). The determined model topologies ofMSINGLE and MMIXTURE are plotted with crosses in Figures 4.5 and 4.6, respectively, wherethe plots are overlaid on the contour map of Figure 4.4. All the determined models were located onthe band along a negative slope line. Therefore, these experiments confirmed that the determinedmodels were located in the band. Moreover, a speech recognition test was undertaken by using thein-band models to measure how well the determined topologies performed. The obtained word ac-curacies are also plotted in Figures 4.5 and 4.6 for each determined model topology. Almost all theWACC values were more than 90.0 points, which supports the evidence indicating that the modeltopologies determined using MSINGLE and MMIXTURE were good. Thus, it was confirmed thateach of the model topologies was selected appropriately using MSINGLE and MMIXTURE be-cause they were located in a band where the product of the numbers of states and components perstate was a constant, and almost all the WACC values were above 90.0 points. This indicates thevalidity of the approximations of MSINGLE and MMIXTURE.

The proposed procedure was finalized by selecting the set of in-band models with the highestFm value as an optimum acoustic model. This was accomplished by utilizing the unimodal char-acteristic without seeing the performance of the model evaluation test using test data. Since VBECselects the model that gives the highest value of the VBEC objective function, the validity of themodel selection can be evaluated by examining the relation betweenFm and recognition perfor-mance. Figure 4.7 shows the VBEC objective functionFm values and WACCs for both Test1and Test2 for MSINGLE along a line connecting the points of the determined topologies in Figure4.5, where the horizontal axis is the number of components per state, which are logarithmicallyscaled. Figure 4.8 shows the same values for MMIXTURE. With both MSINGLE (Figure 4.7) andMMIXTURE (Figure 4.8), WACC andFm behaved similarly, which suggests that the proposedsearch algorithm worked well. In fact, the VBEC objective function and WACC behaved almostunimodally for both MSINGLE and MMIXTURE. This indicates that VBEC could determine theappropriate in-band model by using the VBEC objective function, which supports the effectivenessof this proposal, i.e., in-band model search.

Next, experiments were undertaken that focused on the suitability of the finally determinedmodel topology using the in-band model search. From Figure 4.7, MSINGLE could determinethe model topologies{912, 40}, and obtained91.2 for Test1 and91.7 for Test2. Similarly, fromFigure 4.8, MMIXTURE could also determine the model topologies{878, 40}, and the WACCsobtained using the performance test were91.4 for Test1 and91.7 for Test2. They all exceednot only the values of 88.9 and 89.2 obtained by the conventional 2-phase search, but also theperformance goal of 91.0 points, and so we can say that MSINGLE and MMIXTURE can providehigh levels of performance. For the model topology, (the number of clustered states and GMMcomponents), the MSINGLE and MMIXTURE models were similar to each other, and matchedone of the best manually obtained models{1,000, 30} described in Section 4.3.1. In contrast, theMSINGLE and MMIXTURE models were different from the other best manually obtained model{2,000, 40}. However, the topology of manually obtained models varies depending on the testset data, and therefore, the determined models do not always have to correspond to the manuallyobtained models. In fact, the performance of MSINGLE and MMIXTURE reached the goal of 91.0


points, and therefore, we can also say that MSINGLE and MMIXTURE are capable of determiningan optimum model topology. Furthermore, the total numbers of Gaussians in MSINGLE andMMIXTURE were smaller than those obtained in ML-manual, and the determined models weremore compact, which could improve the decoding speed.

Thus, these experiments proved that the proposed method canautomaticallydetermine anoptimumacoustic model topology with thehighest performance. In these experiments, the twoproposed algorithms, MSINGLE and MMIXTURE determined similar topologies, and achievedsimilar performance levels, which indicates that there was no great difference between the twoproposals in this experimental situation. Section 4.4.3 comments on the difference.

86.087.088.089.090.091.092.0

1 10 100Number of components per state

WACCVB objective functionWACC for Test1WACC for Test2VB objective function 40

Figure 4.7: Word accuracies and objectivefunctions using GMM state clustering (MSIN-GLE). The horizontal axis is scaled logarith-mically.

86.087.088.089.090.091.092.0

1 10 100Number of components per state

WACCVB objective functionWACC for Test1WACC for Test2VB objective function 40

Figure 4.8: Word accuracies and objec-tive functions using GMM state clustering(MMIXTURE). The horizontal axis is scaledlogarithmically.

4.4.2 Computational efficiency

The previous experiments confirmed that MSINGLE and MMIXTURE enabled GMM based stateclustering and that the in-band model search could determine an optimum model topology withthe highest performance levels. However, GMM based state clustering is also realized with theoriginal VBEC framework, as described in Chapter 2, even if there are latent variables, withoutusing MSINGLE or MMIXTURE. However, as mentioned in Section 4.2, the VB iterative calcu-lations require an impractical amount of computational time to obtain the trueFm values. Theadvantage of MSINGLE and MMIXTURE is that they can determine an optimum model topologywithin a practical computation time. Therefore, to emphasize the advantages of MSINGLE andMMIXTURE compared with the straightforward implementation, we have to consider the com-putation time needed for constructing the acoustic model. This section examines the effectivenessof MSINGLE and MMIXTURE in comparison with the straightforward VBEC implementation,VBEC 2-phase search and the ML manual method. Here we compare these approaches not only interms of the model topology and performance described in the previous section but also in termsof computation time. To consider the experiments in more detail, isolated word recognition taskswere used to test various situations. The experimental conditions are summarized in Table 4.2. The

4.4. EXPERIMENTS 61

training data consisted of about 3,000 Japanese sentences (4.1 hours) spoken by 30 males. Twotest sets were prepared as with the previous LVCSR experiments, which consisted of 100 Japanesecity names spoken by 25 males (a total of 1,200 words for each), as shown in Table 4.23.

First, the straightforward implementation of VBEC is described that uses VB iterative calcula-tion within the original VBEC framework to prepare in-band models. In this experiment, the VBECiterative method was approximated by fixing the frame-to-state alignments during the splitting pro-cess and by using a phonetic decision tree construction as well as MSINGLE and MMIXTURE.Even in this situation, the full version of the iterative algorithm is unrealistic because of the VB it-erative calculation in GMM. So, a restricted version was examined that was implemented as ideallyas possible by using a brute force computation. Namely, 45 personal computers with state-of-the-art specifications were used, so that the computation for all the phonetic decision trees could becarried out in parallel (this is called the VBEC iterative method within the original VBEC frame-work VBEC AMP (Acoustic Model Plant) because it is finally realized by such a large number ofcomputers). Moreover, in order to reduce the computation time needed for the iterative calcula-tion, we employed an approximation to reduce the number of decision branches when choosing theappropriate phonetic question. The 10 best questions were derived from 44 questions by applyingall the questions to a state splitting with a single Gaussian based state clustering method, whichdid not require any iterative calculations. Then, the iterative calculations were performed for thederived 10 best questions. The trial suggested that the questions selected when using the 10 bestquestions covered about 95 % of those selected when using all the questions, and were sufficientwhen carrying out iterative calculations for all the GMMs to construct a set of clustered-state tri-phone HMMs. Finally, an optimum model was also determined from the in-band models using thein-band model search as well as MSINGLE and MMIXTURE.

As with the LVCSR experiments, a total of10 (1, 5, 10, 15, 20, 25, 30, 35, 40 and 50 com-ponents)× 6 (100, 500, 1,000, 1,500, 2,000, 3,000 clustered states)= 60 acoustic models wasprepared for the ML manual method, and a total of 10 sets of clustered-state HMMs (1, 5, 10,15, 20, 25, 30, 35, 40 and 50 components) was prepared for the VBEC automatic methods. Theobtained model topology, performance and computation time needed for constructing the acousticmodels are listed in Table 4.3, and are discussed in order.

Model topology: The model topologies determined using MSINGLE, MMIXTURE and AMPwere almost the same as those obtained using ML-manual. This supports the view that the MSIN-GLE and MMIXTURE approximations for the VBEC objective function work well in GMM stateclustering and can construct the appropriate model topology even when compared with the moreexact AMP method.

Recognition rate: The performance of MSINGLE, MMIXTURE and AMP was almost thesame. This also supports the validity of the MSINGLE and MMIXTURE approximations for theVBEC objective function. In addition, it is also confirmed that the performance was comparable tothe ML-manual performance (MSINGLE, MMIXTURE, AMP and ML-manual scored more than

3Similar to the discussion for the LVCSR task, although this isolated word recognition task is almost the same asthat in Chapter 3, the test sets are different, which makes the experimental results slightly different, e.g., recognitionperformance.


97.0 % ) and was higher than the conventional 2-phase search performance, even for a differenttask from that described in Section 4.4.1.

Computation time: MSINGLE and MMIXTURE both took about 30 hours, and, as expected,this was much faster than AMP, which took 1,150 hours to finalize the calculation, even thoughthe amount of training data (4.1 hours) was relatively small. Therefore, we can say that theseapproaches were very effective ways to construct models because they can obtain comparablerecognition results as regards performance, and can construct models even more rapidly than AMP.

In comparison with the conventional ML-manual approach, MSINGLE and MMIXTURE tooka relatively short computation time, and are regarded as providing a relatively short computationtime. The reason for the short computation times of MSINGLE and MMIXTURE is that thesemethods do not need an extra search for the dimension of the clustered states (i.e., the number ofsearch combinations was reduced to 1/6 because 6× 10 search combinations were reduced to 1×10 search combinations). Moreover, in LVCSR, the difference between the proposals (MSINGLEand MMIXTURE) and ML-manual as regards computation time would become larger because themodel evaluation test in ML-manual requires more computation time in LVCSR than isolated wordrecognition.

Focusing on the difference between MSINGLE and MMIXTURE, we can see that MSINGLEtook slightly less time than MMIXTURE. The difference between them resulted from the mono-phone GMM training required in MMIXTURE.

Thus, we can conclude that MSINGLE and MMIXTURE can also determine an optimummodel topology while maintaining the highest level of performance even for an isolated wordrecognition task, and as regards computation time, MSINGLE and MMIXTURE can constructacoustic models more rapidly than ML-manual or AMP.

Table 4.2: Experimental conditions for isolated word recognition


(24 dimensions)Window HammingFrame size/shift 25/10 msNumber of HMM states 3 (left-to-right HMM)Number of phoneme categories27

Training data ASJ: 3,000 utterances, 4.1 hours (male)Test data 1 JEIDA: 100 city names, 1,200 words (male)Test data 2 JEIDA: 100 city names, 1,200 words (male)

ASJ: Acoustical Society of Japan, JEIDA: Japan Electronic Industry Development Association

4.4. EXPERIMENTS 63

Table 4.3: Comparison with iterative and non-iterative state clustering

ML-manual 2-phase search AMP MSINGLE MMIXTUREModel topology ({#S, #G}) {500, 30}, {500, 35} {2,642, 5} {548, 30} {253, 35} {224, 35}Recognition rate (%) 97.1, 97.6 96.3, 94.9 97.3, 98.0 97.6, 98.2 97.8, 97.8Time (hour) 244 56 1,150 30 37

# S: number of clustered states, # G: number of components per Gaussian

4.4.3 Prior parameter dependence

A fixed combination of prior parameter values,ξ0 = η0 = 0.1 andφ0 = ϕ0 = 2 were usedthroughout the construction of the model topology and the estimation of the posterior distribution.In the small-scale experiments that were conducted in previous research [24–26], the selectionof such values was not a major concern. However, when the scale of the target application islarge, the selection of the prior parameter values might affect model quality. Namely, the best orbetter values might differ greatly. Moreover, estimating appropriate prior parameters for acousticmodels needs so much time that it is impractical for speech recognition. Therefore, this subsectionexamines how robustly the acoustic models constructed by the proposed method performed againstchanges in prior parameter values. Here,ξ0 andη0 were examined, which correspond to the priorparameters of the mean and covariance for a Gaussian. This is because the effects ofξ0 andη0 for speech recognition would be stronger than those ofφ0 andϕ0, which correspond to theprior parameters of the state transition probability and weight factor, analogous to the effects ofthe mean and covariance for a Gaussian and the state transition probability and weight factor forspeech recognition.

The values of prior parametersξ0 andη0 were varied from 0.01 to 10, and the speech recog-nition rates examined. Table 4.4 shows the recognition rates for each combination of prior pa-rameters. We can see that almost all the models for the given range of prior parameter valuesprovide high levels of performance (more than 97.0 %). Here, we found that there was a differencebetween MSINGLE and MMIXTURE as regards model topology and performance. The modeltopologies of MSINGLE were incorrect whenξ0 = η0 = 1.0 and the performance significantlydegraded whenξ0 = η0 = 10 where the numbers of clustered states increased greatly. Here,the approximate GMM statistics approached the prior statistics of monophone HMM too closelyin Eq. (4.5), which enhanced the rough setting of the prior parameters in MSINGLE. Therefore,these experiments confirm that MMIXTURE was better than MSINGLE as regards the robustnessof the prior parameters, especially with large prior parameter values, because the prior parametersetting in MMIXTURE using GMM statistics is more appropriate than that in MSINGLE usingsingle Gaussian statistics. Thus, although MMIXTURE is more difficult to use than MSINGLEbecause MMIXTURE requires the preparation of the GMM statistics of the monophone HMMstates, MMIXTURE is more stable than MSINGLE when the prior parameters are varied.


Table 4.4: Prior parameter dependence

ξ0 = η0 0.01 0.05 0.1 0.5 1.0 10.0MSINGLEModel topology ({#S, #G}) {325, 15} {278, 25} {253, 35} {343, 40} {962, 25} {7,149, 5}Recognition rate (%) 97.6, 97.5 97.1, 98.1 97.6, 98.2 97.5, 97.4 97.1, 97.1 94.1, 94.3MMIXTUREModel topology ({#S, #G}) {308, 15} {204, 35} {224, 35} {373, 25} {371, 30} {257, 40}Recognition rate (%) 97.5, 97.4 97.6, 97.3 97.8, 97.8 97.8, 97.4 98.3, 97.6 97.8, 97.2

4.5 Summary

This chapter proposed theautomaticdetermination of an optimum topology in an acoustic modelby using GMM-based phonetic decision tree clustering and an efficient model search algorithmutilizing the acoustic model characteristics. The proposal was realized by expanding the VBECmodel selection function used in the Bayesian acoustic model construction in Chapter 3. Experi-ments showed that the proposed approach could determine an optimum topology with a practicalcomputation time, and the performance was comparable to the best recognition performance pro-vided by the conventional maximum likelihood approach withmanualtuning. The effectiveness ofthe proposed methods has also been shown for various tasks, such as a lecture speech recognitiontask and a English read speech recognition task in [57], as shown in Table 4.5. Thus, by usingthe proposed method, VBEC can automatically and rapidly determine an acoustic model topol-ogy with the highest performance, enabling us to dispense with manual tuning procedures whenconstructing acoustic models. The next chapter focuses on the last Bayesian advantage, namelyrobust classification by marginalizing model parameters, obtained by using Bayesian prediction inVBEC,

Table 4.5: Robustness of acoustic model topology determined by VBEC for different speech datasets.

Japanese readJapanese isolated wordJapanese lectureEnglish readVBEC 91.7 97.9 74.5 91.3ML-manual 91.4 98.1 74.2 91.3

Chapter 5

Bayesian speech classification

5.1 Introduction

The performance of statistical speech recognition is severely degraded when it encounters previ-ously unseen environments. This is because statistical speech recognition is conventionally basedon conventional Maximum Likelihood (ML) approaches, which often over-train model parametersto fit a limited amount of training data (sparse data problem). On the other hand, a Bayesian frame-work is expected to deal even with unseen environments since it has the following three importantadvantages: effective utilization of prior knowledge, appropriate selection of model structure androbust classification of unseen speech, each of which works to mitigate the effects of the sparsedata problem. Recently, we and others proposed Variational Bayesian Estimation and Clusteringfor speech recognition (VBEC), which includes all the above Bayesian advantages [28]. PreviousVBEC studies have mainly examined and proven the effectiveness derived from the first two ad-vantages [28, 32]. This chapter focuses on the third Bayesian advantage, the robust classificationof unseen speech.

When we use a conventional classification method based on the ML approach (MLC), weprepare a probability function (for examplef(x; Θ) with a set of model parametersΘ), whichrepresents the distribution of features for a classification category, e.g. a phoneme category, andpoint-estimateΘ using labeled speech data. However, the parameter is often estimated incorrectlybecause of the sparseness of the training data, which results in a mismatch between the trainingand input data. Therefore, MLC might be seriously affected by the incorrect estimation when clas-sifying unseen speech data. In contrast, a classification method based on the Bayesian approachdoes not use the point-estimated value of the parameter, but assumes that the value itself also hasa probability distributionrepresented by a function (for example,g(Θ)). Then, by employing theexpectation off(x; Θ) with respect tog(Θ), we obtain a distribution ofx, and can robustly predictthe behavior of unseen data in the Bayesian approach, which is the so-calledmarginalizationpro-cedure of model parameters [20]. Some previous studies proposed classification methods basedon the predictive distribution (Bayesian Predictive Classification, referred to as BPC) for speechrecognition, and proved that they were capable of providing a much more robust classification thanMLC [22,58].

65

66 CHAPTER 5. BAYESIAN SPEECH CLASSIFICATION

A major problem with BPC is how to provideg(Θ). In [22,58], they prepare theg(Θ) of meanparameters off(x; Θ), while assuming thatΘ is distributed according to the constrained uniformposterior whose mean value is set by an ML or a Maximum A Posteriori (MAP) estimate obtainedfrom training data, and its scaling parameter value is determined by setting prior parameters. Then,the predictive distribution has a flat peak shape because of the scaling parameter, so that the distri-bution can cover a peak where the unseen speech feature might be distributed. Here, the coverageof the predictive distribution depends on the hyper-parameter setting. On the other hand, VBECprovidesg(Θ) from training data for all model parameters off(x; Θ) since the VBEC frameworkis designed to deal consistently with a posterior distribution ofΘ, which is a direct realizationof g(Θ), by variational Bayes (VB-posteriors) [25, 26]. As a result, the predictive distribution isanalytically derived as theStudent’s t-distribution. The tail of the Student’s t-distribution is widerthan that of the Gaussian distribution, which also can cover the distribution of the unseen speechfeature. The tail width depends on the training data, i.e., the tail becomes wider as the training databecomes sparser. Note that, in the VBEC framework, an appropriate coverage by the predictivedistribution is automatically determined from the training data, and mitigation of the mismatchbetween the training and input speech is achieved without setting the hyper-parameters

VB posteriors using the VBEC framework are introduced briefly in Section 2.3. Then, inSection 5.2, a generalized formulation is provided for the conventional MLC, MAP and BPC andVB-BPC so that they form a family of BPCs. Section 5.3 describes two experiments. The firstaims to show the role of VB-BPC in the total Bayesian framework VBEC for the sparse dataproblem. Namely, we show how BPC using the VB posteriors (VB-BPC) contributes to solvingthe sparse data problem in association with the other Bayesian advantages provided by VBEC.The second experiment aims to compare the effectiveness of VB-BPC with conventional Bayesianapproaches. There, we apply VB-BPC to a supervised speaker adaptation task within a directparameter adaptation scheme as a practical example of the sparse data problem, and examine theeffectiveness of VB-BPC compared with the conventional MAP and BPC approaches [15,58].

5.2 Bayesian predictive classification using VBEC

5.2.1 Predictive distribution

This section provides a generalized formulation for the ML/MAP based classification, the uniformdistribution based BPC and VB-BPC so that they form a family of BPCs. To focus on the BPC,the model structurem is omitted in this section. As discussed in Section 2.3.4, by calculatingthe integral in Eq. (2.46), the predictive distribution is obtained and BPC is realized. However, ingeneral, a true posterior distributionp(Θ

(c)ij |O) is difficult to obtain analytically. On the other hand,

the numerical approach requires a very long computation time and is impractical for use in speechrecognition. Therefore, it is important to find a way to approximate true posteriors appropriately ifwe are to realize BPC effectively. The following three types of BPCs were categorized accordingto the methods for approximating the true posteriors to Diracδ posteriors, uniform posteriors andVB posteriors.

5.2. BAYESIAN PREDICTIVE CLASSIFICATION USING VBEC 67

Dirac δ posterior based BPC (δ BPC)

Conventional ML-based classification is interpreted as an extremely simplified BPC that only uti-lizes the location parameter to represent a posterior. That is, we consider aDirac δ posterior,which shrinks the model parameters so that they have deterministic values provided by ML esti-mates,Θ(c)

ij ≡ {a(c)ij , w

(c)jk , µ

(c)jk , Σ

(c)jk |k = 1, ..., L}. Then, the true posterior is approximated as:

p(xt|c, i, j, O) ∼=∫

p(xt|Θ(c)ij )δ(Θ

(c)ij − Θ

(c)ij )dΘ

(c)ij , (5.1)

whereδ(y − z) is a Diracδ function defined as∫

g(y)δ(y − z)dy = g(z). Obviously, the RightHand Side (RHS) in Eq. (5.1) can be reduced to a common likelihood function:

RHS in Eq. (5.1)

= p(xt|Θ(c)ij ) = a

(c)ij

∑k

w(c)jk N

(xt

∣∣∣µ(c)jk , Σ

(c)jk

). (5.2)

Therefore, from Eq. (5.2), MLC is represented as a mixture of Gaussian distributions.MAP estimates are available as an alternative to ML estimates. The MAP estimates mitigate

the sparse data problem by smoothing the ML estimates with reliable statistics obtained with asufficient amount of data. In this case, the ML estimatesΘ

(c)ij in the Diracδ function in Eq. (5.2)

are replaced with MAP estimatesΘ∗(c)ij ≡ {a∗(c)

ij , w∗(c)jk , µ

∗(c)jk , Σ

∗(c)jk |k = 1, ..., L} as follows:

p(xt|c, i, j, O) ∼=∫

p(xt|Θ(c)ij )δ(Θ

(c)ij − Θ

∗(c)ij )dΘ

(c)ij . (5.3)

Therefore, the analytical result of the RHS in Eq. (5.3) is shown as follows:

RHS in Eq. (5.3)

= p(xt|Θ∗(c)ij ) = a

∗(c)ij

∑k

w∗(c)jk N

(xt

∣∣∣µ∗(c)jk , Σ

∗(c)jk

). (5.4)

MAP classification is also represented as a mixture of Gaussian distributions. These classificationsusing ML and MAP estimates are based on point estimation simply obtained from training datavia the Diracδ posterior, and therefore, they cannot deal with mismatches between training andtesting conditions. They are referred to asδBPC1.

Uniform posterior based BPC (UBPC)

The mismatch problem caused by the point estimation in speech recognition is first dealt with by in-troducing the distribution of a constraineduniform posteriorfor p(µ

(c)jk |O), which is also regarded

as an approximation of the true posterior within the BPC formulation [22, 58]. Their methods arebased on the prior knowledge that the mismatch between the two spectral coefficients is repre-sented by the difference between the two coefficients ofd dimension byCd−1ρd experimentally,

1These methods are often referred to asplug-inapproaches (ex. [20,22,58])


whereC andρ are hyper-parameters. Therefore, they assumep(µ(c)jk |O) to be the constrained uni-

form posterior that has a location parameter set by MAP estimatesµ∗(c)jk

2 and a scaling parameterset by hyper-parametersCd−1ρd. The other parameters are distributed as Diracδ posteriors as wellasδBPC. Thus, the true posterior is approximated as follows:

p(xt|c, i, j, O)

∼=∫

p(xt|Θ(c)ij )δ(a

(c)ij − a

∗(c)ij )

∏k

δ(w(c)jk − w

∗(c)jk )δ(Σ

(c)jk − Σ

∗(c)jk )

×∏

d

U(µ(c)jk,d|µ

∗(c)jk,d − Cd−1ρd, µ

∗(c)jk,d + Cd−1ρd)dΘ

(c)ij .

(5.5)

Although, in [22], a normal approximation for the integral calculation of the RHS is used, in [58],the integral with respect toΘ(c)

ij of the RHS in Eq. (5.5) is analytically solved as follows:

RHS in Eq. (5.5)

= a∗(c)ij

∑k

w∗(c)jk

∏d

fjk,d

(xt

d

∣∣∣µ∗(c)jk,d, Σ

∗(c)jk,d, C, ρ

), (5.6)

wherefjk,d is defined as follows:

fjk,d

(xt

d

∣∣∣µ∗(c)jk,d, Σ

∗(c)jk,d, C, ρ

)≡ 1

2Cd−1ρd

(χ

(√Σ

∗(c)jk,d(µ

∗(c)jk,d − xt

d + Cd−1ρd)

)

− χ

(√Σ

∗(c)jk,d(µ

∗(c)jk,d − xt

d − Cd−1ρd)

)).

(5.7)

χ is the cumulative distribution function of the standard Gaussian distribution defined asχ(x) =1√2π

∫ x

−∞ e−y2dy. Thus, [58] obtained the predictive distribution by considering the marginalization

of the Gaussian mean parameter using the uniform posterior, and we refer to this BPC approach asUniform posterior based BPC (UBPC).

VB posterior based BPC (VB-BPC)

In VBEC, after the acoustic modeling described in Section 2.3, we obtain the appropriate VBposterior distributionsq(Θ|O). Therefore, VBEC can deal with the integrals in Eq. (2.46) byusing the estimatedVB posterior distributionsq(Θ(c)

ij |O) as follows:

p(xt|c, i, j, O) ∼=∫

p(xt|Θ(c)ij )q(Θ

(c)ij |O)dΘ

(c)ij . (5.8)

The integral overΘ(c)ij can be solved analytically by substituting Eqs. (2.22) and (2.27) into Eq.

(5.8). Similar to UBPC, if we only consider the marginalization of the mean parameter, the analyt-2ML estimates can also be used.

5.2. BAYESIAN PREDICTIVE CLASSIFICATION USING VBEC 69

ical result of the RHS in Eq. (5.8) is found to be a mixture of Gaussian distributions, as follows:

RHS in Eq. (5.8)

=φ

(c)ij∑

j φ(c)ij

∑k

ϕ(c)jk∑

k ϕ(c)jk

N

(xt

∣∣∣∣∣ν(c)jk ,

(1 + ξ(c)jk )R

(c)jk

ξ(c)jk η

(c)jk

).

(5.9)

This corresponds toδBPC(MAP) with a rescaled variance factored by(1 + ξ(c)jk )/ξ

(c)jk , and this

is referred to as VB-BPC-MEAN. If we consider the marginalization of all the parameters, theanalytical result of the RHS in Eq. (5.8) is found to be a mixture distribution based on the Student’st-distributionSt(·), as follows:

RHS in Eq. (5.8)

=φ

(c)ij∑

j φ(c)ij

∑k

ϕ(c)jk∑

k ϕ(c)jk

∏d

St

(xt

d

∣∣∣∣∣ν(c)jk,d,

(1 + ξ(c)jk )R

(c)jk,d

ξ(c)jk η

(c)jk

, η(c)jk

).

(5.10)

The details of the derivation of Eqs. (5.9) and (5.10) are discussed in A.4, and the property of theStudent’s t-distribution is described in Section 5.2.2. This approach is called Bayesian PredictiveClassification using VB posterior distributions (VB-BPC). VB-BPC achieves VBEC with a totalBayesian framework for speech recognition that possesses a consistent concept whereby all proce-dures (acoustic modeling and speech classification) are carried out based on posterior distributions,as shown in Figure 2.5. VBEC mitigates the sparse data problem by using the full potential of theBayesian approach that is drawn out by this consistent concept, and VB-BPC contributes greatlyas one of the components.

5.2.2 Student’s t-distribution

One of the specifications of VB-BPC is that its classification function is represented by a non-Gaussian Student’s t-distribution. Therefore, VB-BPC is related to the study of non-Gaussian dis-tribution based speech recognition. Here, we discuss the Student’s t-distribution to clarify the dif-ference between the Gaussian (Gauss(x)), the UBPC distribution (UBPC(x)), the variance-rescaledGaussian (Gauss2(x)) and the Student’s t-distributions (St1(x) and St2(x)), as shown in Figure 5.1(a), where the distribution parameters corresponding to mean and variance are the same. Figure5.1 (b) employs a logarithmic scale for the vertical axes of the linear scale sketched in Figure 5.1(a) to emphasize the behavior of the distribution tail. In speech recognition, acoustic scores arecalculated based on the logarithmic scale, and therefore, the behavior in Figure 5.1 (b) contributesgreatly to the acoustic score and is important to discuss. The Student’s t-distribution is defined asfollows:

St(x|ω, λ, κ) = CSt

(1 +

1

κλ(x − ω)2

)−κ+12

, (5.11)

where

CSt =Γ

(κ+1

2

)Γ

(κ2

)Γ

(12

) (1

κλ

) 12

. (5.12)


Gauss(x)UBPC(x)Gauss2(x)St1(x)St2(x)

(a) Linear scale

Gauss(x)UBPC(x)Gauss2(x)St1(x)St2(x)

(b) Log scale

Figure 5.1: (a) shows the Gaussian (Gauss(x)) derived fromδBPC, the uniform distribution basedpredictive distribution (UBPC(x)) derived from UBPC in Eq. (5.6), the variance-rescaled Gaussian(Gauss2(x)) derived from VB-BPC-MEAN in Eq. (5.9), and two Student’s t-distributions (St1(x)and St2(x)) derived from VB-BPC in Eq. (5.9). (b) employs the logarithmic scale of the verticalaxes in (a) to emphasize the behavior of each distribution tail. The parameters corresponding tomean and variance are the same for all distributions. The hyper-parameters of UBPC are set atC = 3.0 andρ = 0.9. The rescaling parameter of Gauss2(x) (ξ) is 1. The degrees of freedom(DoF) of the Student’s t-distributions (η = κ) are 1 for St1(x) and 100 for St2(x).

Hereω andλ correspond to the mean and variance of the Gaussian, respectively. The Student’st-distribution has an additional parameterκ, which is referred to as a degree of freedom. Thisparameter represents the width of the distribution tail as shown in Figure 5.1 (b). Ifκ is small, thedistribution tail becomes wider than the Gaussian, and ifκ is large, it approaches the Gaussian.From Eq. (5.10),κ = ηjk, and is approximately proportional to the training data occupationcountsζjk from Eq. (2.30). ηjk is obtained for each Gaussian appropriately based on the VBBaum-Welch algorithm. Therefore, with dense training data,κ = ηjk becomes large and VB-BPCapproaches the Gaussian-basedδBPC, as shown in St2(x) and Gauss(x) of Figure 5.1 (a) and (b),which is theoretically proved in [20]. On the other hand, when the training data is sparse,κ = ηjk

becomes small, and the distribution tail becomes wider in St1(x) of Figure 5.1 (b). This behavioris effective in solving the mismatch problem because a wider distribution can cover regions whereunseen speech might be distributed. Consequently VB-BPC mitigates the effects of the mismatchproblem. This property shows that VB-BPC can automatically change the distribution tail byκ = ηjk in the Student’s t-distribution according to the amount of training data, which is theadvantage of VB-BPC over the other BPCs.

Although UBPC(x) has a flatter peak than Gauss(x) in Figure 5.1 (a), the tail behavior is lessflexible than that of the Student’s t-distribution, and tends to be similar to that of Gauss2(x), whichcorresponds to VB-BPC-MEAN, from Figure 5.1 (b). This similar behavior is probably based onthe fact that both UBPC and VB-BPC-MEAN marginalize only the mean parameter of the output

5.3. EXPERIMENTS 71

distribution.

5.2.3 Relationship between Bayesian prediction approaches

Lastly, the relationships between members of the BPC family are summarized as follows:

• δBPC is actually equivalent to classifications that utilize point-estimated values of modelparameters, and does not have any explicit capability of mitigating the mismatch problembecauseδBPC does not marginalize model parameters.δBPC is an extremely simplifiedcase of BPCs, i.e., any other BPC approachesδBPC if its scale-parameter posterior valueapproaches zero.

• UBPC and VB-BPC-MEAN are similar in the sense that both marginalize only the meanparameters of models. UBPC provides a predictive distribution with a flat-peak shape de-pending on the hyper-parameter setting while VB-BPC-MEAN provides a Gaussian predic-tive distribution whose variance is rescaled so that it spreads as the training data becomessparse. The mitigation effect on the mismatch comes from the flat-peak shape of the distri-bution for UBPC, and from the spread variance for VB-BPC-MEAN, which are determinedby hyper-parameters and from the training data, respectively.

• VB-BPC provides a non-Gaussian and wide-tailed predictive distribution. Since the varianceparameters of models are also marginalized by VB-BPC, the wide-tailed shape of its predic-tive distribution, which is analytically derived as the Student’s t-distribution, is obtained. InVB-BPC, the shape of the distribution is automatically determined from the training data,i.e. the tail becomes wider as the training data becomes sparser unlike with UBPC wherethe flat-peak shape is determined by the hyper-parameter tuning. The relationship betweenBPCs is summarized in Table 5.1.

5.3 Experiments

Two experiments were conducted to show the effectiveness of VB-BPC. The first experiment wasdesigned to show the role of VB-BPC in total Bayesian framework VBEC for the sparse data prob-lem. Namely, it is shown how VB-BPC contributes to solving the sparse data problem in associa-tion with the other Bayesian advantages provided by VBEC. The second experiment was designed

Table 5.1: Relationship between BPCs

Posterior distribution µ Σ Predictive distributionδBPC Dirac δ – – GaussianUBPC Constrained uniform Marginalized – Error function

VB-BPC-MEAN Gaussian Marginalized – GaussianVB-BPC Normal Gamma Marginalized Marginalized Student’s t


Table 5.2: Experimental conditions for isolated word recognition task


(24 dimensions)Window HammingFrame size/shift 25/10 msNumber of HMM states 3 (left-to-right HMM)Number of phoneme categories27Number of GMM components 16

Training data ASJ: 10,709 utterances, 10.2 hours (44 male)Test data JEIDA: 100 city names, 7,200 words (75 male)

ASJ: Acoustical Society of JapanJEIDA: Japan Electronic Industry Development Association

Table 5.3: Prior distribution parameters

Prior parameter Setting valueϕ0

jk ζjk × 0.01ξ0jk ζjk × 0.01

ν0jk mean vector of statej

η0jk ζjk × 0.01

R0jk covariance matrix of statej ×η0

jk

to compare the effectiveness of VB-BPC with the conventional Bayesian approaches. VB-BPC isapplied to speaker adaptation within the direct parameter adaptation scheme as a practical exampleof solving the sparse data problem, and the effectiveness of VB-BPC is examined by compari-son with the conventionalδBPC(MAP) and UBPC approaches [15, 58]. All the experiments wereperformed using the SOLON speech recognition toolkit [4].

5.3.1 Bayesian predictive classification in total Bayesian framework

Isolated word recognition experiments were conducted and the fully implemented version of VBECthat includes VB-BPC was compared with other partially implemented versions. The experimen-tal conditions are summarized in Table 5.2. The setting of the prior parameters for the VB andMAP training is shown in Table 5.3. The training data consisted of 10,709 Japanese utterances(10.2 hours) spoken by 44 males. The test data consisted of 100 Japanese city names spoken by75 males (a total of 7,200 words). Several subsets of different sizes were randomly extracted fromthe training data set, and each of the subsets was used to construct a set of acoustic models. Theacoustic models were represented by context-dependent HMMs and each HMM state had a 16-component GMM. As a result, 36 sets of acoustic models were prepared for various amounts of

5.3. EXPERIMENTS 73

0102030405060708090100

10 100 1,000 10,000Number of utterancesWord recognition rate (%)

Full-VBECVBEC-MAPVBEC-MLMDL-ML

Figure 5.2: Recognition rate for various amounts of training data. The horizontal axis is scaledlogarithmically.

training data.

Table 5.4 summarizes approaches that use a combination of methods for model selection(context-dependent HMM state clustering), training and classification, for each of which we em-ploy either VB or other approaches. The combination determines how well the method includes theBayesian advantages, i.e., effective utilization of prior knowledge, appropriate selection of modelstructure and robust classification of unseen speech. Here BIC/MDL indicates model selectionusing the minimum description length criterion (or the Bayesian information criterion), which weshould recognize as a kind of ML-based approach [45], andδBPC(MAP) is regarded as a partialimplementation of the Bayesian prediction approach as discussed in Section 5.2. Note that all ofthe combinations, except for Full-VBEC, include an ML or a merely partial implementation of theBayesian approach, and that the approaches are listed in the order of how well the Bayesian ad-vantages are included. Figure 5.2 shows recognition results obtained using the combinations. Wecan see that the more the Bayesian advantages were included, the more robustly speech was rec-ognized. In particular, when the training data were sparse (less than 100 utterances), Full-VBECsignificantly outperformed the other combinations. In addition, when the training data were evensparser (less than 50 utterances), Full-VBEC was better than VBEC-MAP by 2∼ 9 %. Note thatthe only difference between them was the classification algorithm, i.e., VB-BPC orδBPC(MAP).This improvement is clearly due to the effectiveness of VB-BPC, and perhaps due to the synergis-tic effect that results from exploiting the full potential of the Bayesian approach by incorporating


Table 5.4: Configuration of VBEC and ML based approaches

Model selection Training ClassificationFull-VBEC VB VB VB-BPCVBEC-MAP VB VB δBPC(MAP)VBEC-ML VB ML δBPC(MLC)BIC/MDL-ML BIC/MDL ML δBPC(MLC)

all its advantages.

5.3.2 Supervised speaker adaptation

The effectiveness of VB-BPC for supervised speaker adaptation was examined as a practical ap-plication of the VB-BPC, which can demonstrate its superiority in terms of solving the sparse dataproblem. The improvements in the accuracy by the adaptation are compared using VB-BPC (Eq.(5.10)), VB-BPC-MEAN (Eq. (5.9)), UBPC (Eq. (5.6)) andδBPC(MAP) (Eq. (5.4)), each ofwhich belongs to the direct HMM parameter adaptation scheme. Table 5.5 summarizes the ex-perimental conditions. The initial (prior) acoustic model was constructed by read sentences andthis model was adapted using 10 lectures given by 10 males and their labels [59]. In this task, themismatch between training and adaptation data is caused not only by the speakers, but also by thedifference in speaking styles between read speech and lectures. The total training data for initialmodels consisted of 10,709 Japanese utterances spoken by 44 males. In the initial model train-ing, 1,000 speaker-independent context-dependent HMM states were constructed using a phoneticdecision tree method. The output distribution in each state was represented by a 16-componentGMM, and the model parameters were trained based on conventional ML estimation. Each lecturewas divided in half based on the utterance units, and the first-half of the lecture was used as adap-tation data and the second-half was used as recognition data. The total adaptation data consisted ofmore than 60 utterances for each male, and 1, 2, 4, 8, 16, 32, 40, 48 and 60 utterances were used asadaptation data. As a result, about 9 sets of adapted acoustic models for several amounts of adap-tation data were prepared for each male. The prior parameter settings are shown in Table 5.6, andwere used to estimate the MAP parameters inδBPC(MAP) and UBPC, and the VB posteriors inVB-BPC-MEAN and VB-BPC. When setting the UBPC hyper-parameters, the hyper-parameterswere preliminary optimized by trying eight kinds of combinations ofC = 2, 3, 4 and5 andρ = 0.7

and0.9 with reference to the result in [58], and the combination of{C = 3, ρ = 0.9} was adoptedthat provided the best average word accuracy. Throughout this experiment we used a beam searchalgorithm with sufficient beam width and a sufficient number of hypotheses to avoid search er-rors in decoding. The language model weight used in this experiment was optimized by the wordaccuracy of each result.

Figure 5.3 compares the recognition results obtained with VB-BPC, VB-BPC-MEAN, UBPCand δBPC(MAP) for several amounts of adaptation data with the baseline performance for thenon-adapted speaker independent model (62.9 word accuracy). Table 5.7 shows the result from

5.3. EXPERIMENTS 75

Table 5.5: Experimental conditions for LVCSR speaker adaptation task

Sampling rate 16 kHzQuantization 16 bitFeature vector 12 order MFCC with energy +∆+∆∆

(39 dimensions)Window HammingFrame size/shift 25/10 msNumber of HMM states 3 (Left to right)Number of phoneme categories43Number of GMM components 16

Initial training data ASJ: 10,709 utterances, 10.2 hours (44 males)Adaptation data CSJ: 1st-half lectures (10 males)Test data CSJ: 2nd-half lectures (10 males)Language model Standard trigram (made by CSJ transcription)Vocabulary size 30, 000Perplexity 82.2OOV rate 2.1 %

CSJ: Corpus of Spontaneous Japanese

Table 5.6: Prior distribution parameters

Prior parameter Setting valueϕ0

jk 10ξ0jk 10

ν0jk SI mean vector of Gaussiank in statej

η0jk 10

R0jk SI covariance matrix of Gaussiank in statej ×η0

jk

SI: Speaker Independent

Figure 5.3 in detail by including the word accuracy scores and the time of the adaptation data foreach speaker. First, we focus on the effectiveness of the marginalization of the model parameters inBPCs for the sparse data problem, as discussed in Section 5.2.3. Namely, the results of VB-BPC,VB-BPC-MEAN and UBPC were compared with that ofδBPC(MAP), which does not marginalizemodel parameters at all. Figure 5.3 shows that for a small amount of adaptation data (less than8 adaptation utterances), VB-BPC, VB-BPC-MEAN and UBPC were better thanδBPC(MAP),which confirms the effectiveness of the marginalization of the model parameters. A more detailedexamination of the results in this region showed that VB-BPC was better than UBPC by 0.7∼ 1.5points, and that VB-BPC-MEAN and UBPC behaved similarly. This suggests the effectiveness ofthe wide tail property of the Student’s t-distribution discussed in Section 5.2.3, which is obtainedby the marginalization of the variance parameters in addition to the mean parameters. In particular,the results of the one utterance adaptation in Table 5.7, where VB-BPC scored the best for 9 of 10speakers, supports the above suggestion of the effectiveness of the VB-BPC marginalization when


62.063.064.065.066.067.068.069.070.071.072.073.0

1 10 100Number of utterances

Word accuracy

VB-BPCVB-BPC-MEANUBPCδBPC(MAP)

Baseline

Figure 5.3: Word accuracy for various amounts of adaptation data. The horizontal axis is scaledlogarithmically.

the mismatch would be very large due to extreme data sparseness. Second, for any given amountof adaptation data, VB-BPC and VB-BPC-MEAN achieved comparable or better performancethan UBPC, which required hyper-parameter (C andρ) optimization. Therefore, we can say thatVB-BPC and VB-BPC-MEAN could determine the shapes of their distributions automaticallyand appropriately from the adaptation data without tuning the hyper-parameters, as discussed inSection 5.2.3. Finally, VB-BPC was the best for almost all amounts of adaptation data. VB-BPCapproached theδBPC(MAP) performance asymptotically, and provided the highest word accuracyscore of 72.9 for this task (the benchmark score obtained by the speaker independent acousticmodel trained by using lectures is about 72.0 word accuracy in [59]). This confirms the steadyimprovement of the performance using VB-BPC.

Thus, the effectiveness of VB-BPC based on the Student’s t-distribution for the sparse dataproblem has been shown in this as well as the previous experiments.

5.3.3 Computational efficiency

This subsection comments on the computation time needed for the speech classification process.Full-VBEC took six times as long as the other approaches (VBEC-MAP, VBEC-MLC and MDL-ML), which was mainly because more computation time was needed for the acoustic score cal-culation for the Student’s t-distribution in VB-BPC than with the Gaussian inδBPC. The current

5.4. SUMMARY 77

implementation of the Student’s t-distribution computation requires additional logarithmic compu-tation time for each feature dimension compared with the Gaussian computation. In addition, thespeed of the Gaussian computation was increased in a number of ways in our decoder (ex. utiliz-ing cash memories), and the speed of the Student’s t-distribution computation must be increasedsimilarly to reduce the difference in computation time.

5.4 Summary

This chapter introduced a method of Bayesian Predictive Classification using Variational Bayesianposteriors (VB-BPC) in speech recognition. An analytical solution is obtained based on the Stu-dent’s t-distribution. The effectiveness of this approach is confirmed by the recognition rate im-provement obtained when VB-BPC is used in a total Bayesian framework, experimentally. Inspeaker adaptation experiments in the direct HMM parameter adaptation scheme, VB-BPC is moreeffective than the conventional maximum a posteriori and uniform distribution based Bayesianprediction approaches. Thus, we show the effectiveness of VB-BPC based on the Student’s t-distribution for the sparse data problem.

This approach is related to the study of non-Gaussian distribution based speech recognition [60,61] since it successfully applied the Student’s t-distribution to large vocabulary continuous speechrecognition. By considering the Bayesian prediction for the various parametric models in speechrecognition (ex. the conventional Bayesian prediction approach is applied to the transformationbased parametric adaptation approach [62]), the next step is to study the application of the othernon-Gaussian distributions to speech recognition.


Table 5.7: Experimental results for model adaptation experiments for each speaker based on VB-BPC, VB-BPC-MEAN, UBPC andδBPC(MAP). The best scores among the four methods arehighlighted with a bold font

Amount of adaptation data (utterance)Speaker ID 1 2 4 8 16 32 40 48 60

time(s) 1.7 25.8 37.6 58.9 107.4 198.1 240.2 296.7 355.0VB-BPC 84.2 84.7 85.6 85.6 87.2 87.7 88.8 88.3 89.2

A01M0097 VB-BPC-MEAN 83.4 85.0 86.0 85.5 86.7 87.5 89.3 89.2 89.7UBPC 83.0 84.6 85.4 85.5 86.7 87.9 88.2 89.5 89.9

Baseline accuracy = 79.0 δBPC(MAP) 82.5 85.0 85.8 85.8 86.4 87.7 89.2 89.2 89.7time(s) 5.3 7.1 18.2 32.1 64.2 152.7 184.6 228.2 281.4VB-BPC 75.9 75.6 75.9 78.6 84.2 85.7 86.8 88.6 89.8


















Baseline accuracy = 60.7 δBPC(MAP) 61.4 65.1 65.1 66.7 68.9 70.9 70.5 70.5 71.9

time(s) 3.3 10.8 18.9 37.7 72.6 143.5 181.1 221.0 272.3VB-BPC 65.7 66.4 66.9 68.0 69.5 70.8 71.6 72.0 72.9

Average VB-BPC-MEAN 64.5 65.7 66.2 67.4 69.3 70.8 71.6 72.0 73.0UBPC 64.2 65.7 66.2 67.0 68.6 70.6 70.8 71.7 72.5

Baseline accuracy = 62.9 δBPC(MAP) 63.4 65.0 65.7 66.9 68.8 70.6 70.9 71.8 72.5

Chapter 6

Conclusions

6.1 Review of work

The aim of this thesis was to overcome a lack of robustness in the current speech recognitionsystems based on the Maximum Likelihood (ML) approach by introducing a Bayesian approach,both theoretically and practically. The thesis has achieved the following objectives by applying theVariational Bayesian (VB) approach to speech recognition:

• The formulation of Variational Bayesian Estimation and Clustering for speech recognition(VBEC), as a total framework for speech recognition, which covers both acoustic modelconstruction and speech classification by consistently using the VB posteriors (Chapter 2).

• Bayesian acoustic model construction by consistently using VB based Bayesian formulationssuch as the VB Baum-Welch algorithm and VB model selection within the VBEC framework(Chapter 3).

• The automatic determination of acoustic model topologies by expanding the above Bayesianacoustic model construction (Chapter 4).

• Bayesian speech classification based on Bayesian predictive classification by using the Stu-dent’s t-distribution within the VBEC framework (Chapter 5).

This thesis confirms the three Bayesian advantages (prior utilization, model selection and robustclassification) over ML through the use of speech recognition experiments. Thus, VBEC totallymitigates the effect of over-training in speech recognition. In addition, VBEC automatic determi-nation enables us to dispense with manual tuning procedures when constructing acoustic models.Thus, this thesis achieves Bayesian speech recognition through the realization of the total Bayesianspeech recognition framework VBEC.

6.2 Related work

VB is a key technique in this thesis. Table 6.1 summarizes the technical trend in VB-applied speechinformation processing. Note that although the first applications of VB to speech recognition were

79

80 CHAPTER 6. CONCLUSIONS

limited to the topics of feature extraction and acoustic models, recent applications have coveredspoken language modeling. Therefore, VB has been widely applied to speech recognition and otherforms of speech processing. Given such a trend, this work plays an important role in pioneering themain formulation and implementation of VB based speech recognition, which is a core technologyin this field.

As regards Bayesian speech recognition, there have been many studies that did not employVB based approaches. Although this thesis mainly compares maximum a posteriori, Bayesianinformation criterion, and Bayesian prediction approaches, a serious discussion should also bedone for another major realization of Bayesian speech recognition, quasi-Bayes approaches [16,17,63].

6.3 Future work

Each summary section in the previous chapters suggests future works related to the techniquedescribed in that chapter. This section provides the future global directions triggered by this the-sis. A major study must be undertaken to expand VBEC expansions to deal with recent topics inspeech recognition such as discriminative training, the full covariance model, and feature extrac-tion. Future work will also concentrate on advanced topics in relation to acoustic model adaptationtechniques such on-line adaptation [71] and structural Bayes [72] by utilizing all the Bayesianadvantages based on VBEC. To realize these approaches, the total Bayesian framework must beexpanded to deal with the sequential updating of prior distributions and the model structure ac-cording to the time evolution. In addition, prior distribution setting must be carefully considered.

Finally, new modeling that extends beyond the standard acoustic model (the hidden Markovmodel, Gaussian mixture model, and, current phoneme unit) must be studied. Recent progress onspeech recognition is mainly due to advanced training methods, which are typified by discrimi-native and Bayesian analysis beyond ML. Although many attempts at new modeling have beenunsuccessful (e.g. segment models [73]), these training methods can provide a breakthrough withrespect to the new modeling problem.

Table 6.1: Technical trend of speech recognition using variational Bayes

Topic References DateFeature extraction [64,65] 2002 -Clustering context-dependent HMM states [29,30,49] 2002 -Formulation of Bayesian speech recognition[27,28] 2002 -Selection of number of GMM components [52–54] 2002 -Acoustic model adaptation [34,55,66,67] 2003 -Determination of acoustic model topology [31,32] 2003 -Gaussian reduction [68] 2004 -Bayesian prediction [33,34] 2005 -Language modeling [69,70] 2003 -

6.4. SUMMARY 81

6.4 Summary

This thesis dealt with Bayesian speech recognition, and realized a total Bayesian speech recogni-tion framework VBEC, both theoretically and practically. This thesis represents pioneering workwith respect to the main formulation and implementation of VB based speech recognition, whichis a core technology in the Bayesian speech recognition field. The VBEC framework will be im-proved to deal with model adaptation techniques and new modeling in speech recognition andother forms of speech information processing. I shall be content if this thesis contributes to ad-vances in worldwide studies of speech information processing through the additional progress ofthis Bayesian speech recognition.

ACKNOWLEDGMENTS

It gives me great pleasure to receive my doctorate from Waseda University, from whom I receivedmy master’s degree five years ago. I would like to thank Professor Tetsunori Kobayashi whowas my main supervisor, and Professors Yasuo Matsuyama, Toshiyasu Matsushima, and YoshinoriSagisaka who acted as vice supervisors, for giving me this opportunity, as well as for their generousteaching during my PhD coursework. I particularly want to thank Professor Kobayashi for noticingmy work in its early stages, and for offering me much advice, both general and detailed, as Ipursued my research and constructed this thesis.

I started my research career during my time in the Department of Physics at Waseda University.From the 4th year of my bachelors degree to the 2nd year of my masters degree I studied atthe Ohba-Nakazato Laboratory where I established my research style of seeking out theories anduniqueness. I want to thank Professor Ichiro Ohba, Professor Hiromichi Nakazato, Dr. HirokiNakamura, and the other senior researchers for their valuable advice and direction, which stayswith me to this day. I will never forget their seminars on scattering and neutrino theory. I have alsobeen stimulated by many colleagues especially Dr. Tsuyoshi Otobe and Dr. Gen Kimura (currentlyat Tohoku University), even after my graduation. I hope we can continue this relationship.

This thesis was conducted while belonging to NTT Communication Laboratories, NTT Cor-poration. My continuous 5-year research on speech information processing has been supportedand allowed to continue by Dr. Shigeru Katagiri, Dr. Shoji Makino, and Dr. Masato Miyoshias executive managers and group leaders thanks to their understanding of my work. In this pe-riod, my research has developed greatly through various research discussions and communicationswithin the Speech Open Laboratory and the Signal Processing Research Group. Members of ourspeech recognition team, Dr. Erik McDermott, Dr. Mike Schuster, Dr. Daniel Willet (currentlyat TEMIC SDS), Dr. Takaaki Hori, Kentaro Ishizuka, and Takanobu Oba have always providedme with valuable technical knowledge with regard to speech recognition. In addition to these col-leagues, members of the technical support staff, and in particular Ken Shimomura, have engagedin the development of the speech recognition research platform SOLON, which is a basic tool ofmy research. I am extremely grateful to all of them. Other members of my laboratory, such asDr. Tomohiro Nakatani, Dr. Chiori Hori (currently at Carnegie Mellon University), Dr. ParhamZolfaghari (currently at BNP Paribas), Dr. Masashi Inoue (currently at the National Institute ofInformatics), and Keisuke Kinoshita, have given me great pleasure through valuable discussionson speech processing and statistical learning theory. Each person has provided a different view-point on speech recognition, and has also encouraged me along the way. I am also grateful for thesupport and discussion given to my work by Atsushi Sako of Kobe University, Toshiaki Kubo of

83

84 ACKNOWLEDGMENTS

Waseda University, Wesley Arnquist of the University of Washington, and Hirokazu Kameoka ofthe University of Tokyo through their internship programs at NTT, and by David Meacock who hasrefined the English of most of my work, including this thesis. Dr. Satoshi Takahashi, YoshikazuYamaguchi (currently at NTT IT Corporation), and Atsunori Ogawa of NTT Cyber Space Labo-ratories have all given me valuable advice as researchers with experience in real applications ofspeech recognition. The part of my work dealing with automatic determination was emboldenedby Yoshikazu Yamaguchi as he always pointed out the importance of this topic and its need onthe development side, and encouraged me to work on this. To continue such work, I would like tomaintain this good relationship between the research and development arms of NTT.

I started the work described in this thesis with Dr. Naonori Ueda, a specialist in statisticallearning theory, Dr. Atsushi Nakamura and Dr. Yasuhiro Minami, specialists in speech recognition,and myself, who, at the time, knew nothing of statistical learning theory or speech recognition.The hard work during that first research period and their strict teaching has a treasured place in mymemory. Their teaching has covered many aspects such as research and social postures, technicaland business writing, how to conduct research, as well as research activities. They also showed metheir own kindness and considerations, which was especially encouraging when I was still youngin my research life. I truly feel it would have been hard to find such a fortunate and advantageouslearning environment with such knowledgeable supervisors.

Finally I would like to thank all of my friends, Professor Yoshiji Horikoshi at Waseda Uni-versity and his family as they treat me like a family member, and my family for their wonderfulsupport throughout these years.

ACKNOWLEDGMENTS IN JAPANESE

愛する母校である早稲田大学で博士学位を取れることをこの上なく幸いに思います. 本学位論文の主査としてこのような機会を与えてくださった早稲田大学小林哲則教授ならびに副査の松山泰男教授，松嶋敏泰教授，匂坂芳典教授には，その指導も含めて大変感謝しております. 特に小林教授には自研究に対して早くから注目して頂き，本学位論文をまとめる上で研究の細部から全体の構成にいたるまでの適切な助言を頂いたことで，学位論文のみならず自研究が大きく進展しました．私の研究経歴は早稲田大学理工学部物理学科時代にはじまります.特に学部 4年から修士

2年まで 3年間所属した早稲田大学大場中里研究室が理論探求と独自性を目指すという自分の研究スタイルを形作ってくれたと思っております．大場一郎教授・中里弘道教授・中村博樹博士等の諸先生・先輩方の指導は今でも自分の身に染み付いております．散乱ゼミや卒業論文時のニュートリノゼミは忘れません．また乙部毅博士や木村元博士 (現在東北大学)をはじめとして研究室同僚や後輩にも大変刺激を受けました．今後も大場中里研究室の皆様と関係が続けばと思っております．本研究は日本電信電話株式会社 NTTコミュニケーション科学基礎研究所での成果です．

入社以来 5年間首尾一貫して音声認識研究を続けられたのは片桐滋博士・牧野昭二博士，三好正人博士が部長としてグループリーダーとして本研究に常に理解を示しサポートしてくれたおかげです. その間，音声オープンラボ・信号処理研究グループに所属した際の様々な研究及び多くの方々との交流が本成果を生み出したといえます．認識チームのエリック・マクダーモット博士，マイク・シュスター博士，ダニエル・ヴィレット博士 (現在TEMIC SDS社)，堀貴明博士，石塚健太郎氏，大庭隆伸氏は音声認識に関する貴重な専門知識を数多く提供してくれました. 上記のメンバーに加えて下村賢氏をはじめとする研究補助員のサポートにより日々進化していった音声認識プラットフォームSOLONにより本成果は実現できたといえます．彼らの本成果への貢献に大きく感謝します．また，中谷智広博士，堀智織博士 (現在カーネギーメロン大学)，パーハム・ゾルファガリ博士 (現在BNPパリバ社)，井上雅史博士（現在国立情報学研究所）木下慶介氏とは音声認識とは違った視点から音声処理や統計理論について議論頂くとともに，数多くの私的な励ましに助けられました．神戸大学佐古淳氏，早稲田大学久保俊明氏，ワシントン州立大学ウェスリー・アーンクィスト氏，東京大学亀岡弘和氏の実習制度を通じた研究協力やデービッド・ミーコック氏の本学位論文を含めた英文添削も本研究を大きく助けてくれました．さらに，高橋敏博士，山口義和氏 (現在NTTアイティ(株))，小川厚徳氏をはじめとするNTTサイバースペース研究所のメンバーには音声認識の実用上の観点から貴重な意見を数々頂きました．特に 4章の自動決定の研究は，開発現場での必要性から，山口氏に大きく薦められた結果完成した成果であります．このような研究を今後も進めていくために，開発研究と基礎研究の緊密な関係を保って生きたいです，

85

86 ACKNOWLEDGMENTS IN JAPANESE

本成果は学習理論の専門家である上田修功博士と音声認識技術の専門家である中村篤博士，南泰浩博士，新人として学習理論も音声認識も全くわからなかった自分の 4人ではじめたものです. 当初は難航した研究活動も厳しかった指導も今思えば大変貴重な時間でありました．彼らの指導は多岐に渡り研究活動のみならず会社人としての心構え，文章能力，そして何よりも研究者はどうあるべきかを常に自分に提示してくれました. また 3人それぞれが持つ異なる側面の優しさが，駆け出しであった自分を大いに励ましてくれました．当時のこのように贅沢な指導は他にはないのではないかと思っています．最後に今日まで自分を支えてくれた友人達，家族同然の付き合いをしてくれた早稲田大

学堀越佳治教授とそのご家族，両親，姉弟，言葉の届かない家族に感謝の意を表します．

Bibliography

[1] K. F. Lee, H. W. Hon, and R. Reddy. An overview of the SPHINX speech recognition system.IEEE Transactions on Acoustics, Speech, and Signal Processing, 38:35–45, 1990.

[2] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young. Large vocabulary continuous speechrecognition using HTK. InProc. ICASSP1994, volume 2, pages 125–128, 1994.

[3] T. Kawahara, A. Lee, T. Kobayashi, K. Takeda, N. Minematsu, K. Itou, A. Ito, M. Yamamoto,A. Yamada, T. Utsuro, and K. Shikano. Japanese dictation toolkit - 1997 version -.Journalof the Acoustical Society of Japan (E), 20:233–239, 1999.

[4] T. Hori. NTT Speech recognizer with OutLook On the Next generation: SOLON. InProc.NTT Workshop on Communication Scene Analysis, volume 1, SP-6, 2004.

[5] T. Hori, C. Hori, and Y. Minami. Fast on-the-fly composition for weighted finite-state trans-ducers in 1.8 million-word vocabulary continuous speech recognition. InProc. ICSLP2004,volume 1, pages 289–292, 2004.

[6] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EMalgorithm.Journal of Royal Statististical Society B, 39:1–38, 1976.

[7] F. Jelinek. Continuous speech recognition by statistical methods. InProc. IEEE, volume64(4), pages 532–556, 1976.

[8] X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov models for speech recognition.Edinburgh University Press, 1990.

[9] X. D. Huang, A. Acero, and H. W. Hon.Spoken language processing, a guide to theory,algorithm, and system development. Prentice Hall PTR, 2001.

[10] P. Brown. The acoustic-modeling problem in automatic speech recognition. PhD thesis,Carnegie Mellon University, 1987.

[11] S. Katagiri, B-H. Juang, and C-H. Lee. Discriminative learning for minimum error classifi-cation. IEEE Transactions on Signal Processing, 40:3043–3054, 1992.

[12] E. McDermott.Discriminative training for speech recognition. PhD thesis, Waseda Univer-sity, 1997.

87

88 BIBLIOGRAPHY

[13] D. Povey. Discriminative training for large vocabulary speech recognition. PhD thesis,Cambridge University, 2003.

[14] C. H. Lee, C. H. Lin, and B-H. Juang. A study on speaker adaptation of the parameters ofcontinuous density hidden Markov models.IEEE Transactions on Acoustics, Speech, andSignal Processing, 39:806–814, 1991.

[15] J-L. Gauvain and C-H. Lee. Maximum a posteriori estimation for multivariate Gaussianmixture observations of Markov chains.IEEE Transactions on Speech and Audio Processing,2:291–298, 1994.

[16] Q. Huo, C. Chan, and C.-H. Lee. On-line adaptation of the SCHMM parameters based on thesegmental quasi-Bayes learning for speech recognition.IEEE Transactions on Speech andAudio Processing, 4:141–144, 1996.

[17] J. T. Chien. Quasi-Bayes linear regression for sequential learning of hidden Markov models.IEEE Transactions on Speech and Audio Processing, 10:268–278, 2002.

[18] S. Furui. Recent advances in spontaneous speech recognition and understanding. InProc.SSPR2003, pages 1–6, 2003.

[19] J. O. Berger.Statistical Decision Theory and Bayesian Analysis, Second Edition. Springer-Verlag, 1985.

[20] J. M. Bernardo and A. F. M. Smith.Bayesian Theory. John Wiley & Sons Ltd, 1994.

[21] K. Shinoda and T. Watanabe. MDL-based context-dependent subword modeling for speechrecognition.Journal of the Acoustical Society of Japan (E), 21:79–86, 2000.

[22] Q. Huo and C-H. Lee. A Bayesian predictive classification approach to robust speech recog-nition. IEEE Transactions on Speech and Audio Processing, 8:200–204, 2000.

[23] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variationalmethods for graphical models.Machine Learning, 37:183–233, 1997.

[24] S. Waterhouse, D. MacKay, and T. Robinson.Bayesian methods for mixtures of experts.NIPS 7, MIT Press, 1995.

[25] H. Attias. Inferring parameters and structure of latent variable models by variational Bayes.In Proc. Uncertainty in Artificial Intelligence (UAI) 15, 1999.

[26] N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizingvariational bounds.Neural Networks, 15:1223–1241, 2002.

[27] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda.Application of variational Bayesianapproach to speech recognition. NIPS 2002, MIT Press, 2002.

BIBLIOGRAPHY 89

[28] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Variational Bayesian estimation andclustering for speech recognition.IEEE Transactions on Speech and Audio Processing,12:365–381, 2004.

[29] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Constructing shared-state hiddenMarkov models based on a Bayesian approach. InProc. ICSLP2002, volume 4, pages 2669–2672, 2002.

[30] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Selection of shared-states hiddenMarkov model structure using Bayesian criterion.IEICE Transactions on Information andSystems, J86-D-II:776–786, 2003. (in Japanese).

[31] S. Watanabe, A. Sako, and A. Nakamura. Automatic determination of acoustic model topol-ogy using variational Bayesian estimation and clustering. InProc. ICASSP2004, volume 1,pages 813–816, 2004.

[32] S. Watanabe, A. Sako, and A. Nakamura. Automatic determination of acoustic model topol-ogy using variational Bayesian estimation and clustering for large vocabulary continuousspeech recognition.IEEE Transactions on Audio, Speech, and Language Processing, 14,2006. (in press).

[33] S. Watanabe and A. Nakamura. Effects of Bayesian predictive classification using variationalBayesian posteriors for sparse training data in speech recognition. InProc. Interspeech ’2005- Eurospeech, pages 1105–1108, 2005.

[34] S. Watanabe and A. Nakamura. Speech recognition based on Student’s t-distribution de-rived from total Bayesian framework.IEICE Transactions on Information and Systems, E89-D:970–980, 2006.

[35] S. Kullback and R. A. Leibler. On information and sufficiency.Annals of MathematicalStatistics, 22:79–86, 1951.

[36] H. Akaike. Likelihood and the Bayes procedure. In J. M. Bernardo, M. H. DeGroot, D. V.Lindley, and A. F. M. Smith, editors,Bayesian Statistics, pages 143–166. University Press,Valencia, Spain, 1980.

[37] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.Numerical Recipes in C:The Art of Scientific Computing, Second Edition. Cambridge University Press, 1993.

[38] S. Sagayama. Phoneme environment clustering for speech recognition. InProc.ICASSP1989, volume 1, pages 397–400, 1989.

[39] J. Takami and S. Sagayama. A successive state splitting algorithm for efficient allophonemodeling. InProc. ICASSP1992, volume 1, pages 573–576, 1992.

90 BIBLIOGRAPHY

[40] M. Ostendorf and H. Singer. HMM topology design using maximum likelihood successivestate splitting.Computer Speech and Language, 11:17–41, 1997.

[41] J. J. Odell.The use of context in large vocabulary speech recognition. PhD thesis, CambridgeUniversity, 1995.

[42] H. Akaike. A new look at the statistical model identification.IEEE Transactions on AutomaticControl, 19:716–723, 1974.

[43] J. Rissanen. Universal coding, information, prediction and estimation.IEEE Transactions onInformation Theory, 30:629–636, 1984.

[44] G. Schwarz. Estimating the dimension of a model.The Annals of Statistics, 6:461–464, 1978.

[45] K. Shinoda and T. Watanabe. Acoustic modeling based on the MDL criterion for speechrecognition. InProc. Eurospeech1997, volume 1, pages 99–102, 1997.

[46] W. Chou and W. Reichl. Decision tree state tying based on penalized Bayesian informationcriterion. InProc. ICASSP1999, volume 1, pages 345–348, 1999.

[47] S. Chen and R. Gopinath. Model selection in acoustic modeling. InProc. Eurospeech1999,volume 3, pages 1087–1090, 1999.

[48] K. Shinoda and K. Iso. Efficient reduction of Gaussian components using MDL criterion forHMM-based speech recognition. InProc. ICASSP2001, volume 1, pages 869–872, 2001.

[49] T. Jitsuhiro and S. Nakamura. Automatic generation of non-uniform HMM structures basedon variational Bayesian approach. InProc. ICASSP2004, volume 1, pages 805–808, 2004.

[50] H. Attias. A variational Bayesian framework for graphical models. NIPS 2000, MIT Press,2000.

[51] P. Somervuo. Speech modeling using variational Bayesian mixture of Gaussians. InProc.ICSLP2002, volume 2, pages 1245–1248, 2002.

[52] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Application of variational Bayesianapproach to speech recognition. InProc. Fall Meeting of ASJ 2002, volume 1, pages 127–128,2002. (in Japanese).

[53] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Bayesian acoustic modeling for spon-taneous speech recognition. InProc. SSPR2003, pages 47–50, 2003.

[54] F. Valente and C. Wellekens. Variational Bayesian GMM for speech recognition. InProc.Eurospeech2003, pages 441–444, 2003.

[55] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Application of variational Bayesianestimation and clustering to acoustic model adaptation. InProc. ICASSP2003, volume 1,pages 568–571, 2003.

BIBLIOGRAPHY 91

[56] T. Kato, S. Kuroiwa, T. Shimizu, and N. Higuchi. Efficient mixture Gaussian synthesis fordecision tree based state tying. InProc. ICASSP2001, volume 1, pages 493–496, 2001.

[57] S. Watanabe and A. Nakamura. Robustness of acoustic model topology determined by vari-ational Bayesian estimation and clustering for speech recognition for different speech datasets. InProc. IEICE International Workshop of Beyond HMM, SP2004-90, pages 55–60,2004.

[58] H. Jiang, K. Hirose, and Q. Huo. Robust speech recognition based on a Bayesian predictionapproach.IEEE Transactions on Speech and Audio Processing, 7:426–440, 1999.

[59] T. Kawahara, H. Nanjo, T. Shinozaki, and S. Furui. Benchmark test for speech recognitionusing the Corpus of Spontaneous Japanese. InProc. SSPR2003, pages 135–138, 2003.

[60] A. Nakamura. Acoustic modeling for speech recognition based on a generalized Laplacianmixture distribution.IEICE Transactions on Information and Systems, J83-D-II:2118–2127,2000. (in Japanese).

[61] S. Basu, C. A. Micchelli, and P. Olsen. Power exponential densities for the training andclassification of acoustic feature vectors in speech recognition.Journal of Computational &Graphical Statistics, 10:158–184, 2001.

[62] A. C. Surendran and C-H. Lee. Transformation-based Bayesian prediction for adaptation ofHMMs. Speech Communication, 34:159–174, 2001.

[63] U. E. Makov and A. F. M. Smith. A quasi-Bayes unsupervised learning procedure for priors.IEEE Transactions on Information Theory, 23:761–764, 1977.

[64] O. Kwon, T.-W. Lee, and K. Chan. Application of variational Bayesian PCA for speechfeature extraction. InProc. ICASSP2002, volume 1, pages 825–828, 2002.

[65] F. Valente and C. Wellekens. Variational Bayesian feature selection for Gaussian mixturemodels. InProc. ICASSP2004, volume 1, pages 513–516, 2004.

[66] S. Watanabe and A. Nakamura. Acoustic model adaptation based on coarse-fine training oftransfer vectors and its application to speaker adaptation task. InProc. ICSLP2004, volume 4,pages 2933–2936, 2004.

[67] K. Yu and M. J. F. Gales. Bayesian adaptation and adaptively trained systems. InProc.Automatic Speech Recognition and Understanding Workshop (ASRU) 2005, pages 209–214,2005.

[68] A. Ogawa, Y. Yamaguchi, and S. Takahashi. Reduction of mixture components using newGaussian distance measure. InProc. Fall Meeting of ASJ 2004, volume 1, pages 81–82, 2004.(in Japanese).

92 BIBLIOGRAPHY

[69] T. Mishina and M. Yamamoto. Context adaptation using variational Bayesian learning forngram models based on probabilistic LSA.IEICE Transactions on Information and Systems,J87-D-II:1409–1417, 2004. (in Japanese).

[70] Y.-C. Tam and T. Schultz. Dynamic language model adaptation using variational Bayesinference. InProc. Interspeech ’2005 - Eurospeech, pages 5–8, 2005.

[71] Q. Huo and C-H. Lee. On-line adaptive learning of the correlated continuous density hiddenMarkov models for speech recognition.IEEE Transactions on Speech and Audio Processing,6:386–397, 1998.

[72] K. Shinoda and C-H. Lee. A structural Bayes approach to speaker adaptation.IEEE Trans-actions on Speech and Audio Processing, 9:276–287, 2001.

[73] M. Ostendorf, V. Digalakis, and O.A. Kimball. From HMMs to segment models.IEEETransactions on Speech and Audio Processing, 4:360–378, 1996.

LIST OF WORK

Journal papers

[J1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Selection of shared-state hiddenMarkov model structure using Bayesian criterion,” (in Japanese),IEICE Transactions onInformation and Systems, vol. J86-D-II, no. 6, pp. 776–786, (2003) (received the best paperaward from IEICE).

[J2] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Variational Bayesian estimation andclustering for speech recognition,”IEEE Transactions on Speech and Audio Processing, vol.12, pp. 365–381, (2004).

[J3] S. Watanabe, A. Sako and A. Nakamura, “Automatic determination of acoustic model topol-ogy using variational Bayesian estimation and clustering for large vocabulary continuousspeech recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol.14, (in press).

[J4] S. Watanabe and A. Nakamura, “Speech recognition based on Student’s t-distribution de-rived from total Bayesian framework,”IEICE Transactions on Information and Systems, vol.E89-D, pp. 970–980, (2006).

Letters

[L1] S. Watanabe and A. Nakamura, “Acoustic model adaptation based on coarse/fine training oftransfer vectors,” (in Japanese),Information Technology Letters.

International conferences

[IC1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Constructing shared-state hiddenMarkov models based on a Bayesian approach,” InProc. ICSLP’02, vol. 4, pp. 2669–2672,(2002).

[IC2] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Application of variational Bayesianapproach to speech recognition,” InProc. NIPS15MIT Press, (2002).

93

94 LIST OF WORK

[IC3] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Application of variational Bayesianestimation and clustering to acoustic model adaptation,” InProc. ICASSP’03, vol. 1, pp.568–571, (2003).

[IC4] S. Watanabe, A. Sako and A. Nakamura, “Automatic determination of acoustic model topol-ogy using variational Bayesian estimation and clustering,” InProc. ICASSP’04, vol. 1, pp.813–816, (2004).

[IC5] P. Zolfaghari, S. Watanabe, A. Nakamura and S. Katagiri, “Bayesian modelling of the speechspectrum using mixture of Gaussians,” InProc. ICASSP’04, vol. 1, pp. 553–556, (2004).

[IC6] S. Watanabe and A. Nakamura, “Acoustic model adaptation based on coarse-fine training oftransfer vectors and its application to speaker adaptation task,” InProc. ICSLP’04, vol. 4,pp. 2933–2936, (2004).

[IC7] S. Watanabe and A. Nakamura, “Effects of Bayesian predictive classification using varia-tional Bayesian posteriors for sparse training data in speech recognition,” InProc. Inter-speech ’2005 Eurospeech, pp. 1105–1109, (2005).

International workshops

[IW1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Bayesian acoustic modeling forSpontaneous speech recognition,” InProc. SSPR’03, pp. 47–50, (2003).

[IW2] P. Zolfaghari, H. Kato, S. Watanabe and S. Katagiri, “Speech spectral modelling usingmixture of Gaussians, ” InProc. Special Workshop In Maui (SWIM), (2004)

[IW3] S. Watanabe and A. Nakamura, “Robustness of acoustic model topology determined byVBEC (Variational Bayesian Estimation and Clustering for speech recognition) for differentspeech data sets,” InProc. Workshop on Statistical Modeling Approach for Speech Recog-nition - Beyond HMM, pp. 55–60, (2004).

Domestic conferences (in Japanese)

[DC1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Application of variational Bayesianmethod to speech recognition,” InProc. Fall Meeting of ASJ 2002, 1-9-23, pp. 45–46,(2002.9) (recieved the Awaya prise from the ASJ).

[DC2] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Application of variational Bayesianestimation and clustering to acoustic model adaptation,” InProc. Spring Meeting of ASJ2003, 3-3-12, pp. 127–128, (2003.3).

[DC3] S. Watanabe, T. Hori, “N-gram language modeling using Bayesian approach,” InProc. FallMeeting of ASJ 2003, 2-6-10, pp. 79-80, (2003.9).

LIST OF WORK 95

[DC4] P. Zolfaghari, S. Watanabe and S. Katagiri, “Bayesian modeling of the spectrum usingGaussian mixtures,” InProc. Fall Meeting of ASJ 2003, 2-Q-10, pp. 331–332, (2003.9).

[DC5] S. Watanabe, A. Sako, A. Nakamura, “Automatic determination of acoustic model topologyusing variational Bayesian estimation and clustering,” InProc. Spring Meeting of ASJ 2004,1-8-6, pp. 11–12, (2004.3).

[DC6] S. Watanabe, T. Hori, E. McDermott, Y. Minami, A. Nakamura, “An evaluation of speechrecognition system“SOLON” using corpus of spontaneous Japanese,” InProc. Spring Meet-ing of ASJ 2004, 2-8-7, pp. 73-74, (2004.3).

[DC7] S. Watanabe, A. Nakamura, “A supervised acoustic model adaptation based on coarse/finetraining of transfer vectors,” InProc. Fall Meeting of ASJ 2004, 2-4-11, pp. 107–108,(2004.9).

[DC8] S. Watanabe, T. Hori, “A perplexity for spoken language processing using joint probabilitiesof HMM states and words,” InProc. Spring Meeting of ASJ 2005, 1-5-23, pp. 45-46,(2005.3)

Domestic workshops (in Japanese)

[DW1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Training of shared states in hiddenMarkov model based on Bayesian approach,” InTechnical Report of IEICE, SP2002-14, pp.43–48, (2002).

[DW2] S. Watanabe, [Invited talk] “VBEC: robust speech recognition based on a Bayesian ap-proach,” InProc. 5th Young Researcher Meeting of ASJ Kansai section, I-2, (2003)

[DW3] T. Hori, S. Watanabe, E. McDermott, Y. Minami, A. Nakamura, “Evaluation of the speechrecognizer SOLON using the corpus of spontaneous Japanese,” InProc. Workshop of Spon-taneous Speech Science and Engineering, pp. 85–92, (2004).

[DW4] S. Watanabe, [Tutorial talk] “Speech recognition based on a Bayesian approach,” InTechni-cal Report of IEICE, SP2004-74, pp. 13–20, (2004).

[DW5] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, [Talk celebrating the best paperaward] “Selection of shared-state hidden Markov model structure using Bayesian criterion,”In Technical Report of IEICE, SP2004-149, pp. 25–30 (2005).

[DW6] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Speech recognition using variationalBayes,” InProc. 8th workshop of Information-Based Induction Sciences (IBIS2005), pp.269–274, (2005).

96 LIST OF WORK

Others

[O1] The Awaya prize from the ASJ in 2003

[O2] The best paper award from the IEICE in 2004

[O3] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Selection of shared-state hiddenMarkov model structure using Bayesian criterion,” (English translation paper of IEICETransactions on Information and Systems, vol. J86-D-II, no. 6, pp. 776–786 [J1])IEICETransactions on Information and Systems, vol. E88-D, no. 1, pp. 1–9, (2005).

APPENDICES

A.1 Upper bound of Kullback-Leibler divergence for posteriordistributions

This section derives the upper bound of the KL divergence between a true posterior distribution andan arbitrary distribution. Jensen’s inequality is important as regards deriving the upper bound. Witha continuous function, iff is a concave function andp is a distribution function (

∫p(x)dx = 1),

thenf

(〈g(x)〉p(x)

)≥ 〈f(g(x))〉p(x) (A.1)

Heref(x) = log x andg(x) = h(x)/p(x). Then,

log

(⟨h(x)

p(x)

⟩p(x)

)= log

(∫h(x)dx

)≥

⟨log

(h(x)

p(x)

)⟩p(x)

. (A.2)

Similarly, with a discrete function, iff is a concave function andp is a distribution function(∑

l pl = 1), then

log

(⟨hl

pl

⟩pl

)= log

(∑l

hl

)≥

⟨log

(hl

pl

)⟩pl

. (A.3)

These inequalities are used to derive the upper bounds of the KL divergences.In this section we simplify the arbitrary posterior distributionsq(Θ|O,m), q(Z|O,m), p(Θ|O,m),

andp(Z|O,m) asq(Θ), q(Z), p(Θ) andp(Z) to avoid complicated equation forms.

A.1.1 Model parameter

We first consider the KL divergence between an arbitrary posterior distribution for model parame-tersq(Θ(c)) and the true posterior distribution for model parametersp(Θ(c)):

KL [q(Θ(c))|p(Θ(c))] ≡∫

q(Θ(c)) logq(Θ(c))

p(Θ(c))dΘ(c). (A.4)

Substituting Eq. (2.4) into Eq. (A.4), the KL divergence is rewritten as follows:

KL [q(Θ(c))|p(Θ(c))]

=

∫q(Θ(c)) log

q(Θ(c))∑Z

∫ p(O,Z|Θ,m)p(Θ|m)p(O|m)

dΘ(−c)dΘ(c)

= log p(O|m) −∫

q(Θ(c)) log

∑Z

∫p(O, Z|Θ,m)p(Θ|m)dΘ(−c)

q(Θ(c))dΘ(c)

(A.5)

97

98 APENDICES

Then applying the continuous Jensen’s inequality Eq. (A.2) to Eq. (A.5), the following inequalityis obtained:

KL [q(Θ(c))|p(Θ(c))]

≤ log p(O|m) −∑

Z

∫q(Θ(c))q(Θ(−c))q(Z) log

p(O, Z|Θ,m)p(Θ|m)

q(Θ(c))q(Θ(−c))q(Z)dΘ(c)dΘ(−c)

= log p(O|m) −∑

Z

∫q(Θ)q(Z) log

p(O, Z|Θ,m)p(Θ|m)

q(Θ)q(Z)p(O|m)dΘ

= log p(O|m) −⟨


q(Θ)q(Z)

⟩.

(A.6)

From the 3rd to the 4th line, we use the definitiondΘ(c)dΘ(−c) ≡ dΘ and the relationq(Θ(c))q(Θ(−c)) =

q(Θ), which is derived from Eq. (2.7). Using theFm definition in Eq. (2.10), the upper bound ofthe KL divergence is represented based onFm as follows:

KL [q(Θ(c))|p(Θ(c))] ≤ log p(O|m) −Fm[q(Θ), q(Z)], (A.7)

This inequality corresponds to Eq. (2.9) if the omitted notations are recovered.

A.1.2 Latent variable

Similar to Section A.1.1, we consider the KL divergence between an arbitrary posterior distributionfor model parametersq(Z(c)) and the true posterior distribution for model parametersp(Z(c)):

KL [q(Z(c))|p(Z(c))] ≡∑Z(c)

q(Z(c)) logq(Z(c))

p(Z(c)). (A.8)


KL [q(Z(c))|p(Z(c))] = log p(O|m) −∑Z(c)

q(Z(c)) log

∑Z(−c)

∫p(O, Z|Θ,m)p(Θ|m)dΘ

q(Z(c))(A.9)

Then by applying Jensen’s inequality Eq. (A.3) to Eq. (A.9), the following inequality is obtained:

KL [q(Z(c))|p(Z(c))] ≤ log p(O|m) −⟨


q(Θ)q(Z)

⟩. (A.10)

To derive Eq. (A.10), we use the definition∑

Z(c)

∑Z(−c) ≡

∑Z and the relationq(Z(c))q(Z(−c)) =

q(Z), which is derived from Eq. (2.7). Using theFm definition in Eq. (2.10), the upper bound ofthe KL divergence is represented based onFm as follows:

KL [q(Z(c))|p(Z(c))] ≤ log p(O|m) −Fm[q(Θ), q(Z)], (A.11)

This inequality corresponds to Eq. (2.14) if the omitted notations are recovered. From Eqs. (A.7)and (A.11), we find that the KL divergences of the model parameters and latent variables have thesame upper boundlog p(O|m)−Fm[q(Θ), q(Z)]. This guarantees that the arbitrary posterior dis-tributions for model parameters and latent variables (q(Θ(c)) andq(Z(c)) ) have the same objectivefunctionalFm[q(Θ), q(Z)].

A.2 Variational calculation for VB posterior distributions 99

A.1.3 Model structure

Similar to Section A.1.1 and A.1.2, we consider the KL divergence between an arbitrary posteriordistribution for model structureq(m|O) and the true posterior distribution for model structurep(m|O):

KL [q(m|O)|p(m|O)] ≡∑m

q(m|O) logq(m|O)

p(m|O). (A.12)


KL [q(m|O)|p(m|O)]

=∑m

q(m|O) logq(m|O)∑

Z

∫ p(O,Z|Θ,m)p(Θ|m)p(m)p(O)

dΘ

= log p(O|m) +∑m

q(m|O)

(log

q(m|O)

p(m)− log

∑Z

∫p(O, Z|Θ,m)p(Θ|m)dΘ

).

(A.13)

Then by applying Jensen’s inequality Eq. (A.2) to Eq. (A.13), the following inequality is obtained:

KL [q(m|O)|p(m|O)]

≤ log p(O|m) +∑m

q(m|O)

(log

q(m|O)

p(m)−

∑Z

∫q(Θ)q(Z) log

p(O, Z|Θ,m)p(Θ|m)

q(Θ)q(Z)dΘ

).

(A.14)

Using theFm definition in Eq. (2.10), the upper bound of the KL divergence is represented basedonFm as follows:

KL [q(m|O)|p(m|O)] ≤ log p(O) +

⟨log

q(m|O)

p(m)−Fm[q(Θ), q(Z)]

⟩q(m|O)

. (A.15)

This inequality corresponds to Eq. (2.18) if the omitted notations are recovered.

A.2 Variational calculation for VB posterior distributions

Functional differentiation is a technique for obtaining an extremal function based on a variationalcalculation, and is defined as follows:Continuous function case

δ

δg(y)H[g(x)] = lim

ε→0

H[g(x) + εδ(x − y)] −H[g(x)]

ε(A.16)

Discrete function caseδ

δgl

H[gn] = limε→0

H[gn + εδnl] −H[gn]

ε(A.17)

In this section we simplify the arbitrary posterior distributionsq(Θ|O, m) andq(Z|O,m) asq(Θ)

and q(Z), and objective functionFm[q] asFm to avoid complicated equation forms, and omitcategory indexc.

100 APENDICES


If we consider∫

q(Θ)dΘ = 1 constraint, the functional differentiation in Eq. (2.12) is representedby substituting respectivelyFm andq(Θ) intoH andg(y) in Eq. (A.16) as follows:

δ

δq(Θ′)

(Fm[q(Θ), q(Z)] + K

(∫q(Θ)dΘ − 1

))= lim

ε→0

1

ε

(⟨log

p(O, Z|Θ,m)p(Θ|m)

(q(Θ) + εδ(Θ − Θ′)) q(Z)

⟩q(Θ)+εδ(Θ−Θ′),q(Z)

−Fm

+ K

(∫(q(Θ) + εδ(Θ − Θ′)) dΘ − 1

)− K

(∫q(Θ)dΘ − 1

)),

(A.18)

whereK is a Lagrange’s undetermined multiplier. We focus on the first term in the brackets () inthe 2nd line of Eq. (A.18). By expanding the expectation, the first term is represented as follows:⟨


(q(Θ) + εδ(Θ − Θ′)) q(Z)

⟩q(Θ)+εδ(Θ−Θ′),q(Z)

=

∫(q(Θ) + εδ(Θ − Θ′))

⟨log

p(O, Z|Θ,m)p(Θ|m)

(q(Θ) + εδ(Θ − Θ′)) q(Z)

⟩q(Z)

dΘ

=

∫(q(Θ) + εδ(Θ − Θ′))

⟨log

p(O, Z|Θ,m)p(Θ|m)

q(Θ)q(Z)− log

(1 + ε

δ(Θ − Θ′)

q(Θ)

)⟩q(Z)

dΘ.

(A.19)

By expanding the logarithmic term in Eq. (A.19) with respect toε, Eq. (A.19) is represented as:

Equation (A.19)

=

∫(q(Θ) + εδ(Θ − Θ′))

(⟨log

p(O, Z|Θ,m)p(Θ|m)

q(Θ)q(Z)

⟩q(Z)

− εδ(Θ − Θ′)

q(Θ)

)dΘ + o(ε2)

= Fm − ε + ε

⟨log

p(O, Z|Θ,m)p(Θ|m)

q(Θ)q(Z)

⟩q(Z)

+ o(ε2)

= Fm + ε(−1 + 〈log p(O, Z|Θ,m)〉q(Z) − 〈log q(Z)〉q(Z) + log p(Θ|m) − log q(Θ)

)+ o(ε2)

(A.20)

whereo(ε2) denotes a set of terms of more than the 2nd power ofε. Therefore, by substituting Eq.(A.20) into Eq. (A.18), Eq. (A.18) is represented as:

Equation (A.18)

= limε→0

1

ε

(ε

(−1 + 〈log p(O, Z|Θ,m)〉q(Z) − 〈log q(Z)〉q(Z)

+ log p(Θ|m) − log q(Θ) + K

)+ o(ε2)

)= −1 + 〈log p(O, Z|Θ,m)〉q(Z) − 〈log q(Z)〉q(Z) + log p(Θ|m) − log q(Θ) + K.

(A.21)

A.2 Variational calculation for VB posterior distributions 101

Therefore, the optimal posterior (VB posterior)q(Θ) satisfies the relation whereby Eq. (A.21)= 0

from Eq. (2.12), and is obtained as:

log q(Θ) = −1 + 〈log p(O, Z|Θ,m)〉q(Z) − 〈log q(Z)〉q(Z) + log p(Θ|m) + K. (A.22)

By disregarding the normalization constant, the optimal VB posterior is finally derived as:

q(Θ) ∝ p(Θ|m) exp(〈log p(O, Z|Θ,m)〉q(Z)

)(A.23)

which corresponds to Eq. (2.13) if the omitted notations are recovered.


Similar to Section A.2.1, if we consider∑

Z q(Z) = 1 constraint, the functional differentiation isrepresented by substituting respectivelyFm andq(Z) into H andgn in Eq. (A.17) as follows:

δ

δq(Z ′)

(Fm[q(Θ), q(Z)] + K

(∑Z

q(Z) − 1

))

= limε→0

1

ε

(⟨log

p(O, Z|Θ,m)p(Θ|m)

(q(Z) + εδZZ′) q(Θ)

⟩q(Z)+εδZZ′ ,q(Θ)

−Fm

+ K

(∑Z

(q(Z) + εδZZ′) − 1

)− K

(∑Z

q(Z) − 1

)),

(A.24)

whereK is a Lagrange’s undetermined multiplier. We focus on the first term in the brackets () inthe 2nd line of Eq. (A.24). By expanding the expectation, the first term is represented as follows:⟨


(q(Z) + εδZZ′) q(Θ)

⟩q(Z)+εδZZ′ ,q(Θ)

=∑

Z

(q(Z) + εδZZ′)

⟨log

p(O, Z|Θ,m)p(Θ|m)

q(Θ)q(Z)− log

(1 + ε

δZZ′

q(Z)

)⟩q(Θ)

.

(A.25)


Equation (A.25)

= Fm + ε

(−1 + 〈log p(O, Z|Θ,m)〉q(Θ) +

⟨log

p(Θ|m)

q(Θ)

⟩q(Θ)

− log q(Z)

)+ o(ε2)

(A.26)

Therefore, by substituting Eq. (A.26) into Eq. (A.24), Eq. (A.24) is represented as:

Equation (A.24)

= −1 + 〈log p(O, Z|Θ,m)〉q(Θ) +

⟨log

p(Θ|m)

q(Θ)

⟩q(Z)

− log q(Z) + K.(A.27)

102 APENDICES

Therefore, the optimal posterior (VB posterior)q(Z) satisfies the relation whereby Eq. (A.27)= 0

from Eq. (2.12), and is obtained as:

log q(Z) = −1 + 〈log p(O, Z|Θ,m)〉q(Θ) +

⟨log

p(Θ|m)

q(Θ)

⟩q(Θ)

+ K. (A.28)


q(Z) ∝ exp(〈log p(O, Z|Θ,m)〉q(Θ)

)(A.29)


A.2.3 Model structure

Similar to Sections A.2.1 and A.2.2, if we consider∑

m q(m|O) = 1 constraint, the functionaldifferentiation is represented by substituting respectivelyFm andq(m|O) into H andgn in Eq.(A.17) as follows:

δ

δq(m′|O)

(⟨log

q(m|O)

p(m)−Fm

⟩q(m|O)

+ K

(∑m

q(m|O) − 1

))

= limε→0

1

ε

(⟨log

q(m|O) + εδmm′

p(m)−Fm

⟩q(m|O)+εδmm′

−⟨

logq(m|O)

p(m)−Fm

⟩q(m|O)

+ K

(∑m

q(m|O) + εδmm′ − 1

)− K

(∑m

q(m|O) − 1

))(A.30)

whereK is a Lagrange’s undetermined multiplier. We focus on the fist term in the brackets () inthe 2nd line of Eq. (A.30). By expanding the expectation, the first term is represented as follows:⟨

logq(m|O) + εδmm′

p(m)−Fm

⟩q(m|O)+εδmm′

=∑m

(q(m|O) + εδmm′)

(log

q(m|O)

p(m)+ log

(1 + ε

δmm′

q(m|O)

)−Fm

) (A.31)


Equation (A.31)

=

⟨log

q(m|O)

p(m)−Fm

⟩q(m|O)

+ ε

(log

q(m′|O)

p(m′)−Fm′

+ 1

)+ o(ε2)

(A.32)

Therefore, by substituting Eq. (A.32) into Eq. (A.30), Eq. (A.30) is represented as:

Equation (A.30)= logq(m′|O)

p(m′)−Fm′

+ 1 + K. (A.33)

A.3 VB posterior calculation 103

Therefore, the optimal posterior (VB posterior)q(m|O) satisfies the relation whereby Eq. (A.33)= 0, and is obtained as:

logq(m|O)

p(m)−Fm + 1 + K = 0. (A.34)


q(m|O) ∝ p(m) exp (Fm) , (A.35)


A.3 VB posterior calculation


From the output distribution and prior distribution in Section 2.3.1, we can obtain the optimal VBposterior distribution for model parametersq(Θ|O,m). We first derive the concrete form of theexpectation of the logarithmic output distributionlog p(O, S, V |Θ,m) with respect to VB posteriorfor latent variableslog p(O, S, V |Θ,m) as follows:

〈log p(O, S, V |Θ,m)〉eq(S,V |O,m) =

∑i,j

γij log aij +∑j,k

(ζjk log wjk +

∑e,t

ζte,jk log bjk(O

te)

)(A.36)

This is obtained by substituting Eq. (2.22) into Eq. (A.36) and regrouping the terms in the sum-mations according to state transitions and observed symbols. By using Eq. (A.36),q(Θ|O,m) isrepresented as follows:

q(Θ|O,m)

∝ p(Θ|m) exp(〈log p(O, S, V |Θ,m)〉

eq(S,V |O,m)

)= p(Θ|m) exp

(∑i,j

γij log aij +∑j,k

(ζjk log wjk +

∑e,t

ζte,jk log bjk(O

te)

)) (A.37)

where γt

e,ij ≡ q(st−1e = i, st

e = j|O,m)

γij ≡∑E

e=1

∑Te

t=1 γte,ij

ζte,jk ≡ q(st

e = j, vte = k|O, m)

ζjk ≡∑E

e=1

∑Te

t=1 ζte,jk

. (A.38)

Here,γte,ij is a VB transition posterior distribution, which denotes the transition probability from

statei to statej at a framet of an examplee, andζte,jk is a VB occupation posterior distribution,

which denotes the occupation probability of mixture componentk in statej at a framet of anexamplee, with the VB approach.

104 APENDICES

From Eqs. (2.24) and (A.37), the factors of the optimal VB posterior distributions, each ofwhich depends onaij, wjk andbjk(O

te), is decomposed as follows:

q({aij}Jj=1|O,m) ∝ p({aij}J

j=1|m)∏

j(aij)eγij

q({wjk}Lk=1|O,m) ∝ p({wjk}L

k=1|m)∏

k(wjk)eζjk .

q(bjk|O,m) ∝ p(bjk|m)∏

e,t(bjk(Ote))

eζte,jk

(A.39)

Therefore we can derive the concrete form of each factor.

State transition probability a

By focusing on a term that depends on a probabilistic variableaij, the concrete form ofq({aij}Jj=1|O,m)

can be calculated from Eqs. (2.25) and (A.39) as follows:

q({aij}Jj=1|O,m) ∝

∏j

(aij)eφij−1, (A.40)

whereφij ≡ φ0ij + γij. Therefore, by considering the normalization constant,q({aij}J

j=1|O,m) isobtained as follows:

q({aij}Jj=1|O,m) = CD({φij}J

j=1)∏

j

(aij)eφij−1

= D({aij}Jj=1|{φij}J

j=1),

(A.41)

where

CD({φij}Jj=1) ≡

Γ(∑J

j=1 φij)∏Jj=1 Γ(φij)

. (A.42)

Weight factor w

Similarly, the concrete form ofq({wjk}Lk=1|O,m) is obtained from Eqs. (2.25) and (A.39) as

follows:

q({wjk}Lk=1|O,m) = CD({ϕjk}L

k=1)∏

k

(wjk)eϕjk−1

= D({wjk}Lk=1|{ϕjk}L

k=1),

(A.43)

whereϕjk ≡ ϕ0jk + ζjk and

CD({ϕjk}Lk=1) ≡

Γ(∑L

k=1 ϕjk)∏Lk=1 Γ(ϕjk)

. (A.44)

A.3 VB posterior calculation 105

Gaussian parametersµ and Σ

Finally, the concrete form ofq(bjk|O,m) can be derived from Eqs. (2.25) and (A.39) as follows:Since the calculation is more complicated than the two previous calculations, the indexesj andk

are removed to simplify the derivation.

q(b|O,m) ∝∏

d

Σ− 1

2d exp

(−1

2

(ξΣ−1

d (µd − νd)2))

(Σ−1d )

eη2−1 exp

(− Rd

2Σd

), (A.45)

where ξ ≡ ξ0 + ζ, ν ≡(ξ0ν0 +

∑e,t ζ

teO

te

)/ξ, η ≡ η0 + ζ, Rd ≡ R0

d + ξ0(ν0d − νd)

2 +∑e,t ζ

te(O

te,d − νd)

2. Consequently, by considering normalization constants, the concrete form ofq(b|O,m) is obtained as follows:

q(b|O,m)

=

(CN (ξ)

∏d

exp

(−1

2

(ξΣ−1

d (µd − νd)2)))

·

(∏d

CG(η, Rd)(Σ−1d )

eη2−1 exp

(− Rd

2Σd

))= N (µ|ν, ξ−1Σ)

∏d

G(Σ−1d |η, Rd),

(A.46)

where CN (ξ) ≡ (ξ/2π)D2

CG(η, Rd) ≡(Rd/2

) eη2/Γ(η/2)

. (A.47)

Thus, the VB posterior distribution for model parameters are analytically obtained as (2.27), (2.28)and (2.30) by summarizing the calculation results.


From the output distribution and prior distribution in Section 2.3.1, the optimal VB posterior dis-tribution for latent variablesq(S, V |O,m) is represented by substituting Eqs. (2.22) and (2.27)into Eq. (2.16) as follows:

q(S, V |O,m) ∝∏e,t

exp(〈log ast−1

e ste〉

eq({aij′}Jj′=1

|O,m)

)× exp

(〈log wst

evte〉

eq({wjk′}Lk′=1

|O,m)

)exp

(〈log bst

evte(Ot

e)〉eq(bjk|O,m)

).

(A.48)

We calculate each factor in this equation, changing the suffixset to i or j, and the suffixve

t to k tosimplify the derivation.

Weight factor a

First, the integral overaij is solved from Eq. (A.41) by using a partial integral technique and anormalization constant, andaij, which denotes the state transition probability from statei to state

106 APENDICES

j in the VB approach, is defined as follows:

〈log aij〉eq({aij′}J

j′=1|O,m) = CD({φij′}J

j′=1)

∫log aij

∏j′

(aij′)eφij′−1daij′

= CD({φij′}Jj′=1)

∂

∂φij

1

CD({φij′}Jj′=1)

= Ψ(φij) − Ψ(∑

j′φij′

)≡ log aij,

(A.49)

whereΨ(y) is a digamma function defined asΨ(y) ≡ ∂/∂y log Γ(y).

Weight factor w

In a way similar to that used foraij, the integral overwjk is solved from Eq. (A.43), andwjk,which denotes thek-th weight factor of the Gaussian mixture for statej in the VB approach, isdefined as follows:

〈log wjk〉eq({wjk′}J

k′=1|O,m) = Ψ(ϕjk) − Ψ

(∑k′

ϕjk′

)≡ log wjk. (A.50)

Gaussian parametersµ and Σ

Finally, the integrals overbjk(= {µjk, Σ−1jk }) are solved from Eqs. (2.23) and (A.46), andbjk(O

te)

is defined. Since the calculation is more complicated than the two previous calculations, the in-dexesj andk are removed to simplify the derivation.

〈log b(Ote)〉eq(b|O,m)

=

∫ ∏d′

N (µd′|νd′ , ξ−1Σd′)G(Σ−1

d′ |η, Rd′)

×

(−1

2

∑d

(log 2π − log(Σ−1

d ) + Σ−1d (Ot

e,d − µd)2))

dµd′d(Σ−1d′ )

= −1

2

∫ ∏d′

G(Σ−1d′ |η, Rd′)

∑d

(log 2π +

1

ξ− log(Σ−1

d ) + Σ−1d (Ot

e,d − νd)2

)d(Σ−1

d′ )

= −D

2

(log 2π +

1

ξ− Ψ

(η

2

))− 1

2

∑d

(log

Rd

2+ (Ot

e,d − νd)2 η

Rd

)≡ log b(Ot

e).

(A.51)

A.4 Student’s t-distribution using VB posteriors

In this section, we explain the derivation of the Student’s t-distribution using the VB posteriors inEq. (5.10). The indexes of phoneme categoryc and framet are removed to simplify the derivation.

A.4 Student’s t-distribution using VB posteriors 107

Substituting the VB posteriors of Eq. (2.27) into Eq. (5.10), we can obtain the following equation:∫p(x|Θij)q(Θij|O)dΘij

=

∫aij

∑k

wjkN (x|µjk, Σjk)D({aij}j|{ϕij}j)D({wjk}k|{ϕjk}k)

×N (µjk|νjk, (ξjk)−1Σjk)

∏d

G(Σ−1jk,d|ηjk, Rjk,d)daijdwjkdµjkd(Σ−1

jk ).

(A.52)

The concrete forms of the posterior distribution functions are shown in Eq. (2.28). Therefore, wesubstitute the factors in Eq. (2.28) that depend on the integral variables into Eq. (A.52). Integratingwith respect toaij andwjk and grouping the arguments of the exponential functions, the equationis represented as follows:∫

p(x|Θij)q(Θij|O)dΘij

∝∫

aij

∏j′

(aij′)φij′−1daij

∫ ∑k

wjk

∏k′

(wjk′)ϕjk′−1dwjk∫ ∏d

(Σjk,d)− 1

2 exp

(−1

2(Σ−1

jk,d)(xd − µjk,d)2

)∏

d

exp

(−ξjk

2(Σ−1

jk,d)(µjk,d − νjk,d)2

) (Σ−1

jk,d

) ηjk2

−1exp

(−(Σ−1

jk,d)Rjk,d

2

)dµjkd(Σ−1

jk )

∝ φij∑j φij

∑k

ϕjk∑k ϕjk∫ ∏

d

(Σ−1

jk,d

) ηjk2 exp

(−

Σ−1jk,d

2

((xd − µjk,d)

2 + ξjk(µjk,d − νjk,d)2 + Rjk,d

))dµjkd(Σ−1

jk ).

(A.53)

We focus on the integral with respect toµjk,d and sjk,d ≡ Σ−1jk,d The indexes of statei, j and

mixture componentk are removed to simplify the derivation. In addition, we adopt a diagonalcovariance matrix, which does not consider the correlation between dimensions, and the integrationcan be performed for each feature dimension independently. Therefore we also remove the indexof dimensiond. First, we focus on the integration with respect toµ, and completing the squarewith respect toµ. Then, by integrating with respect toµ, and arranging the equation, the followingequation is obtained:∫

sη2 exp

(−s

2

((x − µ)2 + ξ(µ − ν)2 + r

))dµ

=

∫s

η2 exp

(−s

2

((1 + ξ)

(µ − x + ξν

1 + ξ

)2

− (x + ξν)2

1 + ξ+ x2 + ξν2 + r

))dµ

∝ sη−12 exp

(−s

2

(−(x + ξν)2

1 + ξ+ x2 + ξν2 + r

))= s

η+12

−1 exp

(−s

ξ(x − ν)2 + (1 + ξ)r

2(1 + ξ)

).

(A.54)

108 APENDICES

Here we discuss the case when the VB posterior for variance is the Dirac delta function in theVB-BPC-MEAN discussion in Section 5.2.1, and the argument of its Dirac delta function is themaximum value of the VB posterior. Then, the result of the integration with respect tos is obtainedby changings to ηr−1 in Eq. (A.54). Therefore, by substituting it into Eq. (A.53) and adding theremoved indexesi, j, k, c, t andd, the following equation is obtained:

φ(c)ij∑

j φ(c)ij

∑k

ϕ(c)jk∑

k ϕ(c)jk

N

(xt

∣∣∣∣∣ν(c)jk ,

(1 + ξ(c)jk )R

(c)jk

ξ(c)jk η

(c)jk

). (A.55)

This is the analytical result of the predictive distribution for VB-BPC-MEAN based on the mixtureof Gaussian distributions.

In Eq. (A.54), by integrating with respect tos, and arranging the equation, the followingequation is obtained: ∫

sη+12

−1 exp

(−s

ξ(x − ν)2 + (1 + ξ)r

2(1 + ξ)

)ds

∝(

ξ(x − ν)2 + (1 + ξ)r

2(1 + ξ)

)− η+12

∝(

1 +ξ

(1 + ξ)r(x − ν)2

)− η+12

.

(A.56)

Here we refer to the concrete form of the Student’s t-distribution given in Eq. (5.11). The pa-rametersω, κ andλ of the Student’s t-distribution correspond to those of the above equation asfollows:

ω = νκ = η

λ = (1+ξ)rξη

. (A.57)

Thus, the result of integrating with respect toµ ands is represented as the Student’s t-distribution.

St

(x

∣∣∣∣ν, (1 + ξ)r

ξη, η

). (A.58)

Finally, by substituting Eq. (A.58) into Eq. (A.53) and adding the removed indexesi, j, k, c, t

andd, we can obtain the analytical result of the predictive distribution using the VB posteriors asfollows: ∫

p(xt|Θ(c)ij )q(Θ

(c)ij |O)dΘ

(c)ij

=φ

(c)ij∑

j φ(c)ij

∑k

ϕ(c)jk∑

k ϕ(c)jk

∏d

St

(xt

d

∣∣∣∣∣ν(c)jk,d,

(1 + ξ(c)jk )R

(c)jk,d

ξ(c)jk η

(c)jk

, η(c)jk

).

(A.59)

Thus, the predictive distribution for VB-BPC is obtained analytically based on the mixture of theStudent’s t-distributions.

Documents

Shinji Watanabe Phd Thesis