The Variational Bayes Method in Signal Processing

Springer Series on

Signals and Communication Technology

Signals and Communication Technology

Circuits and SystemsBased on Delta ModulationLinear, Nonlinear and Mixed Mode ProcessingD.G. Zrilic ISBN 3-540-23751-8

Functional Structures in NetworksAMLn – A Language for Model DrivenDevelopment of Telecom SystemsT. Muth ISBN 3-540-22545-5

RadioWave Propagationfor Telecommunication ApplicationsH. Sizun ISBN 3-540-40758-8

Electronic Noise and Interfering SignalsPrinciples and ApplicationsG. Vasilescu ISBN 3-540-40741-3

DVBThe Family of International Standardsfor Digital Video Broadcasting, 2nd ed.U. Reimers ISBN 3-540-43545-X

Digital Interactive TV and MetadataFuture Broadcast MultimediaA. Lugmayr, S. Niiranen, and S. KalliISBN 3-387-20843-7

Adaptive Antenna ArraysTrends and ApplicationsS. Chandran (Ed.) ISBN 3-540-20199-8

Digital Signal Processingwith Field Programmable Gate ArraysU. Meyer-Baese ISBN 3-540-21119-5

Neuro-Fuzzy and Fuzzy Neural Applicationsin TelecommunicationsP. Stavroulakis (Ed.) ISBN 3-540-40759-6

SDMA for Multipath Wireless ChannelsLimiting Characteristicsand Stochastic ModelsI.P. Kovalyov ISBN 3-540-40225-X

Digital TelevisionA Practical Guide for EngineersW. Fischer ISBN 3-540-01155-2

Multimedia Communication TechnologyRepresentation, Transmissionand Identification of Multimedia SignalsJ.R. Ohm ISBN 3-540-01249-4

Information MeasuresInformation and its Description in Scienceand EngineeringC. Arndt ISBN 3-540-40855-X

Processing of SAR DataFundamentals, Signal Processing,InterferometryA. Hein ISBN 3-540-05043-4

Chaos-Based DigitalCommunication SystemsOperating Principles, Analysis Methods,and Performance EvalutationF.C.M. Lau and C.K. Tse ISBN 3-540-00602-8

Adaptive Signal ProcessingApplication to Real-World ProblemsJ. Benesty and Y. Huang (Eds.)ISBN 3-540-00051-8

Multimedia Information Retrievaland ManagementTechnological Fundamentals and ApplicationsD. Feng, W.C. Siu, and H.J. Zhang (Eds.)ISBN 3-540-00244-8

Structured Cable SystemsA.B. Semenov, S.K. Strizhakov,and I.R. Suncheley ISBN 3-540-43000-8

UMTSThe Physical Layer of the Universal MobileTelecommunications SystemA. Springer and R. WeigelISBN 3-540-42162-9

Advanced Theory of Signal DetectionWeak Signal Detection inGeneralized ObeservationsI. Song, J. Bae, and S.Y. KimISBN 3-540-43064-4

Wireless Internet Access over GSM and UMTSM. Taferner and E. BonekISBN 3-540-42551-9

The Variational Bayes Methodin Signal ProcessingV. Smıdl and A. QuinnISBN 3-540-28819-8

Vaclav SmıdlAnthony Quinn

The VariationalBayes Methodin Signal Processing

With 65 Figures

123

Dr. Vaclav SmıdlInstitute of Information Theory and AutomationAcademy of Sciences of the Czech Republic, Department of Adaptive SystemsPO Box 18, 18208 Praha 8, Czech RepublicE-mail: [email protected]

Dr. Anthony QuinnDepartment of Electronic and Electrical EngineeringUniversity of Dublin, Trinity CollegeDublin 2, IrelandE-mail: [email protected]

ISBN-10 3-540-28819-8 Springer Berlin Heidelberg New York

ISBN-13 978-3-540-28819-0 Springer Berlin Heidelberg New York

Library of Congress Control Number: 2005934475

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specif ically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,reproduction on microf ilm or in any other way, and storage in data banks. Duplication of this publication orparts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in itscurrent version, and permission for use must always be obtained from Springer-Verlag. Violations are liableto prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media.

springer.com

© Springer-Verlag Berlin Heidelberg 2006Printed in Germany

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,even in the absence of a specif ic statement, that such names are exempt from the relevant protective laws andregulations and therefore free for general use.

Typesetting and prod cu tion: SPI Publisher ServicesCover design: design & production GmbH, Heidelberg

Printed on acid-free paper SPIN: 11370918 62/3100/SPI - 5 4 3 2 1 0

Do mo Thuismitheoirí

A.Q.

Preface

Gaussian linear modelling cannot address current signal processing demands. Inmodern contexts, such as Independent Component Analysis (ICA), progress has beenmade specifically by imposing non-Gaussian and/or non-linear assumptions. Hence,standard Wiener and Kalman theories no longer enjoy their traditional hegemony inthe field, revealing the standard computational engines for these problems. In theirplace, diverse principles have been explored, leading to a consequent diversity in theimplied computational algorithms. The traditional on-line and data-intensive preoc-cupations of signal processing continue to demand that these algorithms be tractable.

Increasingly, full probability modelling (the so-called Bayesian approach)—orpartial probability modelling using the likelihood function—is the pathway for de-sign of these algorithms. However, the results are often intractable, and so the areaof distributional approximation is of increasing relevance in signal processing. TheExpectation-Maximization (EM) algorithm and Laplace approximation, for exam-ple, are standard approaches to handling difficult models, but these approximations(certainty equivalence, and Gaussian, respectively) are often too drastic to handlethe high-dimensional, multi-modal and/or strongly correlated problems that are en-countered. Since the 1990s, stochastic simulation methods have come to dominateBayesian signal processing. Markov Chain Monte Carlo (MCMC) sampling, and re-lated methods, are appreciated for their ability to simulate possibly high-dimensionaldistributions to arbitrary levels of accuracy. More recently, the particle filtering ap-proach has addressed on-line stochastic simulation. Nevertheless, the wider accept-ability of these methods—and, to some extent, Bayesian signal processing itself—has been undermined by the large computational demands they typically make.

The Variational Bayes (VB) method of distributional approximation originates—as does the MCMC method—in statistical physics, in the area known as Mean FieldTheory. Its method of approximation is easy to understand: conditional indepen-dence is enforced as a functional constraint in the approximating distribution, andthe best such approximation is found by minimization of a Kullback-Leibler diver-gence (KLD). The exact—but intractable—multivariate distribution is therefore fac-torized into a product of tractable marginal distributions, the so-called VB-marginals.This straightforward proposal for approximating a distribution enjoys certain opti-

VIII Preface

mality properties. What is of more pragmatic concern to the signal processing com-munity, however, is that the VB-approximation conveniently addresses the followingkey tasks:

1. The inference is focused (or, more formally, marginalized) onto selected subsetsof parameters of interest in the model: this one-shot (i.e. off-line) use of the VBmethod can replace numerically intensive marginalization strategies based, forexample, on stochastic sampling.

2. Parameter inferences can be arranged to have an invariant functional formwhen updated in the light of incoming data: this leads to feasible on-linetracking algorithms involving the update of fixed- and finite-dimensional sta-tistics. In the language of the Bayesian, conjugacy can be achieved under theVB-approximation. There is no reliance on propagating certainty equivalents,stochastically-generated particles, etc.

Unusually for a modern Bayesian approach, then, no stochastic sampling is requiredfor the VB method. In its place, the shaping parameters of the VB-marginals arefound by iterating a set of implicit equations to convergence. This Iterative Varia-tional Bayes (IVB) algorithm enjoys a decisive advantage over the EM algorithmwhose computational flow is similar: by design, the VB method yields distributionsin place of the point estimates emerging from the EM algorithm. Hence, in commonwith all Bayesian approaches, the VB method provides, for example, measures ofuncertainty for any point estimates of interest, inferences of model order/rank, etc.

The machine learning community has led the way in exploiting the VB methodin model-based inference, notably in inference for graphical models. It is timely,however, to examine the VB method in the context of signal processing where, todate, little work has been reported. In this book, at all times, we are concerned withthe way in which the VB method can lead to the design of tractable computationalschemes for tasks such as (i) dimensionality reduction, (ii) factor analysis for medicalimagery, (iii) on-line filtering of outliers and other non-Gaussian noise processes, (iv)tracking of non-stationary processes, etc. Our aim in presenting these VB algorithmsis not just to reveal new flows-of-control for these problems, but—perhaps moresignificantly—to understand the strengths and weaknesses of the VB-approximationin model-based signal processing. In this way, we hope to dismantle the current psy-chology of dependence in the Bayesian signal processing community on stochasticsampling methods. Without doubt, the ability to model complex problems to arbitrarylevels of accuracy will ensure that stochastic sampling methods—such as MCMC—will remain the golden standard for distributional approximation. Notwithstandingthis, our purpose here is to show that the VB method of approximation can yieldhighly effective Bayesian inference algorithms at low computational cost. In show-ing this, we hope that Bayesian methods might become accessible to a much broaderconstituency than has been achieved to date.

Praha, Dublin Václav ŠmídlOctober 2005 Anthony Quinn

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 How to be a Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Variational Bayes (VB) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 A First Example of the VB Method: Scalar Additive Decomposition 3

1.3.1 A First Choice of Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 The Prior Choice Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 The VB Method in its Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 VB as a Distributional Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Layout of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Bayesian Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Bayesian Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Off-line vs. On-line Parametric Inference . . . . . . . . . . . . . . . . 142.2 Bayesian Parametric Inference: the Off-Line Case . . . . . . . . . . . . . . . 15

2.2.1 The Subjective Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Posterior Inferences and Decisions . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Prior Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Bayesian Parametric Inference: the On-line Case . . . . . . . . . . . . . . . . 19

2.3.1 Time-invariant Parameterization . . . . . . . . . . . . . . . . . . . . . . . . 202.3.2 Time-variant Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Off-line Distributional Approximations and the Variational BayesMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1 Distributional Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 How to Choose a Distributional Approximation . . . . . . . . . . . . . . . . . 26

3.2.1 Distributional Approximation as an Optimization Problem . . 263.2.2 The Bayesian Approach to Distributional Approximation . . . 27

X Contents

3.3 The Variational Bayes (VB) Method of Distributional Approximation 283.3.1 The VB Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.2 The VB Method of Approximation as an Operator . . . . . . . . 323.3.3 The VB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3.4 The VB Method for Scalar Additive Decomposition . . . . . . . 37

3.4 VB-related Distributional Approximations . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Optimization with Minimum-Risk KL Divergence . . . . . . . . 393.4.2 Fixed-form (FF) Approximation . . . . . . . . . . . . . . . . . . . . . . . . 403.4.3 Restricted VB (RVB) Approximation . . . . . . . . . . . . . . . . . . . 40

3.4.3.1 Adaptation of the VB method for the RVBApproximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4.3.2 The Quasi-Bayes (QB) Approximation . . . . . . . . . 423.4.4 The Expectation-Maximization (EM) Algorithm . . . . . . . . . . 44

3.5 Other Deterministic Distributional Approximations . . . . . . . . . . . . . . 453.5.1 The Certainty Equivalence Approximation . . . . . . . . . . . . . . . 453.5.2 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.3 The Maximum Entropy (MaxEnt) Approximation . . . . . . . . . 45

3.6 Stochastic Distributional Approximations . . . . . . . . . . . . . . . . . . . . . . 463.6.1 Distributional Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.7 Example: Scalar Multiplicative Decomposition . . . . . . . . . . . . . . . . . . 483.7.1 Classical Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.7.2 The Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.7.3 Full Bayesian Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.7.4 The Variational Bayes (VB) Approximation . . . . . . . . . . . . . . 513.7.5 Comparison with Other Techniques . . . . . . . . . . . . . . . . . . . . . 54

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Principal Component Analysis and Matrix Decompositions . . . . . . . . . 574.1 Probabilistic Principal Component Analysis (PPCA) . . . . . . . . . . . . . 58

4.1.1 Maximum Likelihood (ML) Estimation for the PPCA Model 594.1.2 Marginal Likelihood Inference of A . . . . . . . . . . . . . . . . . . . . . 614.1.3 Exact Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.4 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 The Variational Bayes (VB) Method for the PPCA Model . . . . . . . . . 624.3 Orthogonal Variational PCA (OVPCA) . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 The Orthogonal PPCA Model . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.2 The VB Method for the Orthogonal PPCA Model . . . . . . . . . 704.3.3 Inference of Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3.4 Moments of the Model Parameters . . . . . . . . . . . . . . . . . . . . . . 78

4.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4.1 Convergence to Orthogonal Solutions: VPCA vs. FVPCA . . 794.4.2 Local Minima in FVPCA and OVPCA . . . . . . . . . . . . . . . . . . 824.4.3 Comparison of Methods for Inference of Rank . . . . . . . . . . . . 83

4.5 Application: Inference of Rank in a Medical Image Sequence . . . . . . 854.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Contents XI

5 Functional Analysis of Medical Image Sequences . . . . . . . . . . . . . . . . . 895.1 A Physical Model for Medical Image Sequences . . . . . . . . . . . . . . . . 90

5.1.1 Classical Inference of the Physiological Model . . . . . . . . . . . 925.2 The FAMIS Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2.1 Bayesian Inference of FAMIS and Related Models . . . . . . . . 945.3 The VB Method for the FAMIS Model . . . . . . . . . . . . . . . . . . . . . . . . . 945.4 The VB Method for FAMIS: Alternative Priors . . . . . . . . . . . . . . . . . . 995.5 Analysis of Clinical Data Using the FAMIS Model . . . . . . . . . . . . . . 1025.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 On-line Inference of Time-Invariant Parameters . . . . . . . . . . . . . . . . . . 1096.1 Recursive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.2 Bayesian Recursive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2.1 The Dynamic Exponential Family (DEF) . . . . . . . . . . . . . . . . 1126.2.2 Example: The AutoRegressive (AR) Model . . . . . . . . . . . . . . 1146.2.3 Recursive Inference of non-DEF models . . . . . . . . . . . . . . . . . 117

6.3 The VB Approximation in On-Line Scenarios . . . . . . . . . . . . . . . . . . . 1186.3.1 Scenario I: VB-Marginalization for Conjugate Updates . . . . 1186.3.2 Scenario II: The VB Method in One-Step Approximation . . . 1216.3.3 Scenario III: Achieving Conjugacy in non-DEF Models via

the VB Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3.4 The VB Method in the On-Line Scenarios . . . . . . . . . . . . . . . 126

6.4 Related Distributional Approximations . . . . . . . . . . . . . . . . . . . . . . . . 1276.4.1 The Quasi-Bayes (QB) Approximation in On-Line Scenarios 1286.4.2 Global Approximation via the Geometric Approach . . . . . . . 1286.4.3 One-step Fixed-Form (FF) Approximation . . . . . . . . . . . . . . . 129

6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models . . . 1306.5.1 The VB Method for AR Mixtures . . . . . . . . . . . . . . . . . . . . . . . 1306.5.2 Related Distributional Approximations for AR Mixtures . . . 133

6.5.2.1 The Quasi-Bayes (QB) Approximation . . . . . . . . . . 1336.5.2.2 One-step Fixed-Form (FF) Approximation . . . . . . . 135

6.5.3 Simulation Study: On-line Inference of a Static Mixture . . . . 1356.5.3.1 Inference of a Many-Component Mixture . . . . . . . . 1366.5.3.2 Inference of a Two-Component Mixture . . . . . . . . . 136

6.5.4 Data-Intensive Applications of Dynamic Mixtures . . . . . . . . . 1396.5.4.1 Urban Vehicular Traffic Prediction . . . . . . . . . . . . . . 141

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7 On-line Inference of Time-Variant Parameters . . . . . . . . . . . . . . . . . . . . 1457.1 Exact Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.2 The VB-Approximation in Bayesian Filtering . . . . . . . . . . . . . . . . . . . 147

7.2.1 The VB method for Bayesian Filtering . . . . . . . . . . . . . . . . . . 1497.3 Other Approximation Techniques for Bayesian Filtering . . . . . . . . . . 150

7.3.1 Restricted VB (RVB) Approximation . . . . . . . . . . . . . . . . . . . 1507.3.2 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

XII Contents

7.3.3 Stabilized Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.3.3.1 The Choice of the Forgetting Factor . . . . . . . . . . . . . 154

7.4 The VB-Approximation in Kalman Filtering . . . . . . . . . . . . . . . . . . . . 1557.4.1 The VB method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567.4.2 Loss of Moment Information in the VB Approximation . . . . 158

7.5 VB-Filtering for the Hidden Markov Model (HMM) . . . . . . . . . . . . . 1587.5.1 Exact Bayesian filtering for known T . . . . . . . . . . . . . . . . . . . 1597.5.2 The VB Method for the HMM Model with Known T . . . . . . 1607.5.3 The VB Method for the HMM Model with Unknown T . . . . 1627.5.4 Other Approximate Inference Techniques . . . . . . . . . . . . . . . 164

7.5.4.1 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1647.5.4.2 Certainty Equivalence Approach . . . . . . . . . . . . . . . 165

7.5.5 Simulation Study: Inference of Soft Bits . . . . . . . . . . . . . . . . . 1667.6 The VB-Approximation for an Unknown Forgetting Factor . . . . . . . 168

7.6.1 Inference of a Univariate AR Model with Time-VariantParameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.6.2 Simulation Study: Non-stationary AR Model Inference viaUnknown Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1737.6.2.1 Inference of an AR Process with Switching

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1737.6.2.2 Initialization of Inference for a Stationary AR

Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1747.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

8 The Mixture-based Extension of the AR Model (MEAR) . . . . . . . . . . . . 1798.1 The Extended AR (EAR) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.1.1 Bayesian Inference of the EAR Model . . . . . . . . . . . . . . . . . . . 1818.1.2 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

8.2 The EAR Model with Unknown Transformation: the MEAR Model 1828.3 The VB Method for the MEAR Model . . . . . . . . . . . . . . . . . . . . . . . . . 1838.4 Related Distributional Approximations for MEAR . . . . . . . . . . . . . . . 186

8.4.1 The Quasi-Bayes (QB) Approximation . . . . . . . . . . . . . . . . . . 1868.4.2 The Viterbi-Like (VL) Approximation . . . . . . . . . . . . . . . . . . . 187

8.5 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1888.6 The MEAR Model with Time-Variant Parameters . . . . . . . . . . . . . . . . 1918.7 Application: Inference of an AR Model Robust to Outliers . . . . . . . . 192

8.7.1 Design of the Filter-bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1928.7.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

8.8 Application: Inference of an AR Model Robust to Burst Noise . . . . 1968.8.1 Design of the Filter-Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.8.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1978.8.3 Application in Speech Reconstruction . . . . . . . . . . . . . . . . . . . 201

8.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Contents XIII

9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2059.1 The VB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2059.2 Contributions of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2069.3 Current Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2069.4 Future Prospects for the VB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Required Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209A.1 Multivariate Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209A.2 Matrix Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209A.3 Normal-inverse-Wishart (N iWA,Ω) Distribution . . . . . . . . . . . . . . . . 210A.4 Truncated Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211A.5 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212A.6 Von Mises-Fisher Matrix distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 212

A.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213A.6.2 First Moment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213A.6.3 Second Moment and Uncertainty Bounds . . . . . . . . . . . . . . . . 214

A.7 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215A.8 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215A.9 Truncated Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Notational Conventions

Linear AlgebraR, X, Θ∗ Set of real numbers, set of elements x and set of elements θ,

respectively.x x ∈ R, a real scalar.A ∈ Rn×m Matrix of dimensions n × m, generally denoted by a capital

letter.ai, ai,D ith column of matrix A, AD, respectively.ai,j , ai,j,D (i, j)th element of matrix A, AD, respectively, i = 1 . . . n,

j = 1 . . .m.bi, bi,D ith element of vector b, bD, respectively.

diag (·) A = diag (a), a ∈ Rq, then ai,j =

ai

0if i=jif i =j , i, j =

1, . . . , q.a Diagonal vector of given matrix A (the context will distin-

guish this from a scalar, a (see 2nd entry, above)).diag−1 (·) a = diag−1 (A), A ∈ Rn×m, then a = [a1,1, . . . , aq,q]

′,q = min (n,m).

A;r, AD;r Operator selecting the first r columns of matrix A, AD, re-spectively.

A;r,r, AD;r,r Operator selecting the r× r upper-left sub-block of matrix A,AD, respectively.

a;r, aD;r Operator extracting upper length-r sub-vector of vector a, aD,respectively.

A(r) ∈ Rn×m Subscript (r) denotes matrix A with restricted rank,rank (A) = r ≤ min (n,m).

A′ Transpose of matrix A.Ir ∈ Rr×r Square identity matrix.1p,q , 0p,q Matrix of size p × q with all elements equal to one, zero, re-

spectively.tr (A) Trace of matrix A.

XVI Notational Conventions

a = vec (A) Operator restructuring elements of A = [a1, . . . ,an] into avector a = [a′

1, . . . ,a′n]′.

A = vect (a, p) Operator restructuring elements of vector a ∈ Rpn into matrixA ∈ Rp×n, as follows:

A =

⎡⎢⎣a1 ap+1 · · · ap(n−1)+1

......

...ap a2p · · · apn

⎤⎥⎦ .

A = UALAV ′A Singular Value Decomposition (SVD) of matrix A ∈ Rn×m.

In this monograph, the SVD is expressed in the ‘economic’form, where UA ∈ Rn×q , LA ∈ Rq×q, VA ∈ Rm×q, q =min (n,m).

[A⊗B] ∈ Rnp×mq Kronecker product of matrices A ∈ Rn×m and B ∈ Rp×q ,such that

A⊗B =

⎡⎢⎣ a1,1B · · · a1,mB...

. . ....

an,1B · · · an,mB

⎤⎥⎦ .

[A B] ∈ Rn×m Hadamard product of matrices A ∈ Rn×m and B ∈ Rn×m,such that

A B =

⎡⎢⎣ a1,1b1,1 · · · a1,mb1,m

.... . .

...an,1bn,1 · · · an,mbn,m

⎤⎥⎦ .

Set AlgebraAc Set of objects A with cardinality c.A(i) ith element of set Ac, i = 1, . . . , c.

AnalysisχX (·) Indicator (characteristic) function of set X .erf (x) Error function: erf (x) = 2√

π

∫ x

0exp

(−t2)dt.

ln (A) , exp (A) Natural logarithm and exponential of matrix A respectively.Both operations are performed on elements of the matrix (orvector), e.g.

ln([a1, a2]

′) = [ln a1, ln a2]′.

Γ (x) Gamma function, Γ (x) =∫∞0

tx−1 exp(−t)dt, x > 0.ψΓ (x) Digamma (psi) function, ψΓ (x) = ∂

∂x lnΓ (x).

Notational Conventions XVII

Γr

(12p)

Multi-gamma function:

Γr

(12p

)= π

14 r(r−1)

r∏j=1

Γ

12

(p− j + 1)

, r ≤ p

0F1(a,AA′) Hypergeometric function, pFq(·), with p = 0, q = 1, scalarparameter a, and symmetric matrix parameter, AA′.

δ (x) δ-type function. The exact meaning is determined by the typeof the argument, x. If x is a continuous variable, then δ (x) isthe Dirac δ-function:∫

X

δ (x− x0) g (x) dx = g (x0) ,

where x, x0 ∈ X. If x is an integer, then δ (x) is the Kroneckerfunction:

δ (x) =

1,0,

if x = 0,otherwise.

.

εp (i) ith elementary vector of Rp, i = 1, . . . , p:

εp (i) = [δ (i− 1) , δ (i− 2) , . . . , δ (i− p)]′ .

I(a,b] Interval (a, b] in R.

Probability CalculusPr (·) Probability of given argument.f (x|θ) Distribution of (discrete or continuous) random variable x,

conditioned by known θ.f (x) Variable distribution to be optimized (‘wildcard’ in functional

optimization).x[i], f [i](x) x and f(x) in the i-th iteration of an iterative algorithm.θ Point estimate of unknown parameter θ.Ef(x) [·] Expected value of argument with respect to distribution,

f (x).g (x) Simplified notation for Ef(x) [g (x)].x, x Upper bound, lower bound, respectively, on range of random

variable x.Nx (µ, r) Scalar Normal distribution of x with mean value, µ, and vari-

ance, r.Nx (µ,Σ) Multivariate Normal distribution of x with mean value, µ, and

covariance matrix, Σ.NX (M,Σp ⊗Σn) Matrix Normal distribution of X with mean value, M , and

covariance matrices, Σp and Σn.

XVIII Notational Conventions

tNx (µ, r; X) Truncated scalar Normal of x, of type N (µ, r), confined tosupport set X ⊂ R.

MX (F ) Von-Mises-Fisher matrix distribution of X with matrix para-meter, F .

Gx (α, β) Scalar Gamma distribution of x with parameters, α and β.Ux (X) Scalar Uniform distribution of x on the support set X ⊂ R.

List of Acronyms

AR AutoRegressive (model, process)ARD Automatic Rank Determination (property)CDEF Conjugate (parameter) distribution to a DEF (observation)

modelDEF Dynamic Exponential FamilyDEFS Dynamic Exponential Family with Separable parametersDEFH Dynamic Exponential Family with Hidden variablesEAR Extended AutoRegressive (model, process)FA Factor AnalysisFAMIS Functional Analysis for Medical Image Sequences (model)FVPCA Fast Variational Principal Component Analysis (algorithm)HMM Hidden Markov ModelHPD Highest Posterior Density (region)ICA Independent Component AnalysisIVB Iterative Variational Bayes (algorithm)KF Kalman FilterKLD Kullback-Leibler DivergenceLPF Low-Pass FilterFF Fixed Form (approximation)MAP Maximum A PosterioriMCMC Markov Chain Monte CarloMEAR Mixture-based Extension of the AutoRegressive modelML Maximum LikelihoodOVPCA Orthogonal Variational Principal Component AnalysisPCA Principal Component AnalysisPE Prediction ErrorPPCA Probabilistic Principal Component AnalysisQB Quasi-BayesRLS Recursive Least SquaresRVB Restricted Variational Bayes

XX List of Acronyms

SNR Signal-to-Noise RatioSVD Singular Value DecompositionTI Time-InvariantTV Time-VariantVB Variational BayesVL Viterbi-Like (algorithm)VMF Von-Mises-Fisher (distribution)VPCA Variational PCA (algorithm)

1

Introduction

1.1 How to be a Bayesian

In signal processing, as in all quantitative sciences, we are concerned with data,D, and how we can learn about the system or source which generated D. We willoften refer to learning as inference. In this book, we will model the data parametri-cally, so that a set, θ, of unknown parameters describes the data-generating system.In deterministic problems, knowledge of θ determines D under some notional rule,D = g(θ). This accounts for very few of the data contexts in which we must work.In particular, when D is information-bearing, then we must model the uncertainty(sometimes called the randomness) of the process. The defining characteristic ofBayesian methods is that we use probabilities to quantify our beliefs amid uncer-tainty, and the calculus of probability to manipulate these quantitative beliefs [1–3].Hence, our beliefs about the data are completely expressed via the parametric prob-abilistic observation model, f(D|θ). In this way, knowledge of θ determines ourbeliefs about D, not D themselves.

In practice, the result of an observational experiment is that we are given D,and our problem is to use them to learn about the system—summarized by theunknown parameters, θ—which generated them. This learning amid uncertainty isknown as inductive inference [3], and it is solved by constructing the distributionf(θ|D), namely, the distribution which quantifies our a posteriori beliefs about thesystem, given a specific set of data, D. The simple prescription of Bayes’ rule solvesthe implied inverse problem [4], allowing us to reverse the order of the conditioningin the observation model, f(D|θ):

f(θ|D) ∝ f(D|θ)f(θ). (1.1)

Bayes’ rule specifies how our prior beliefs, quantified by the prior distribution,f(θ), are updated in the light of D. Hence, a Bayesian treatment requires prior quan-tification of our beliefs about the unknown parameters, θ, whether or not θ is bynature fixed or randomly realized. The signal processing community, in particular,has been resistant to the philosophy of strong Bayesian inference [3], which assigns

2 1 Introduction

probabilities to fixed, as well as random, unknown quantities. Hence, they relegateBayesian methods to inference problems involving only random quantities [5, 6].This book adheres to the strong Bayesian philosophy.

Tractability is a primary concern to any signal processing expert seeking to de-velop a parametric inference algorithm, both in the off-line case and, particularly,on-line. The Bayesian approach provides f(θ|D) as the complete inference of θ, andthis must be manipulated in order to solve problems of interest. For example, wemay wish to concentrate the inference onto a subset, θ1, by marginalizing over theircomplement, θ2:

f(θ1|D) ∝∫

Θ∗2

f(θ|D)dθ2. (1.2)

A decision, such as a point estimate, may be required. The mean a posterioriestimate may then be justified:

θ1 =∫

Θ∗1

θ1f(θ1|D)dθ1. (1.3)

Finally, we might wish to select a model from a set of candidates, M1, . . . ,Mc,via computation of the marginal probability of D with respect to each candidate:

f(Ml|D) ∝ Pr[Ml].∫

Θ∗l

f(D|θl,Ml)dθl. (1.4)

Here, θl ∈ Θ∗l are the parameters of the competing models, and Pr[Ml] is the nec-

essary prior on those models.

1.2 The Variational Bayes (VB) Method

The integrations required in (1.2)–(1.4) will often present computational burdens thatcompromise the tractability of the signal processing algorithm. In Chapter 3, we willreview some of the approximations which can help to address these problems, but theaim of this book is to advocate the use of the Variational Bayes (VB) approximationas an effective pathway to the design of tractable signal processing algorithms forparametric inference. These VB solutions will be shown, in many cases, to be noveland attractive alternatives to currently available Bayesian inference algorithms.

The central idea of the VB method is to approximate f(θ|D), ab initio, in termsof approximate marginals:

f(θ|D) ≈ f(θ|D) = f(θ1|D)f(θ2|D). (1.5)

In essence, the approximation forces posterior independence between subsets of pa-rameters in a particular partition of θ chosen by the designer. The optimal suchapproximation is chosen by minimizing a particular measure of divergence fromf(θ|D) to f(θ|D), namely, a particular Kullback-Leibler Divergence (KLD), whichwe will call KLDVB in Section 3.2.2:

1.3 A First Example of the VB Method: Scalar Additive Decomposition 3

f(θ|D) = arg minf(θ1|·)f(θ2|·)

KL(f(θ1|D)f(θ2|D)||f(θ|D)

). (1.6)

In practical terms, functional optimization of (1.6) yields a known functionalform for f(θ1|D) and f(θ2|D), which will be known as the VB-marginals. How-ever, the shaping parameters associated with each of these VB-marginals are ex-pressed via particular moments of the others. Therefore, the approximation is pos-sible if all moments required in the shaping parameters can be evaluated. Mutualinteraction of VB-marginals via their moments presents an obstacle to evaluation ofits shaping parameters, since a closed-form solution is available only for a limitednumber of problems. However, a generic iterative algorithm for evaluation of VB-moments and shaping parameters is available for tractable VB-marginals (i.e. mar-ginals whose moments can be evaluated). This algorithm—reminiscent of the clas-sical Expectation-Maximization (EM) algorithm—will be called the Iterative Varia-tional Bayes (IVB) algorithm in this book. Hence, the computational burden of theVB-approximation is confined to iterations of the IVB algorithm. The result is a setof moments and shaping parameters, defining the VB-approximation (1.5).

1.3 A First Example of the VB Method: Scalar AdditiveDecomposition

Consider the following additive model:

d = m + e, (1.7)

f (e) = Ne

(0, ω−1

). (1.8)

The implied observation model is f (d|m,ω) = Nd

(m,ω−1

). The task is to infer

the two unknown parameters—i.e. the mean, m, and precision, ω—of the Normaldistribution, N , given just one scalar data point, d. This constitutes a stressful regimefor inference. In order to ‘be a Bayesian’, we assign a prior distribution to m and ω.Given the poverty of data, we can expect our choice to have some influence on ourposterior inference. We will now consider two choices for prior elicitation.

1.3.1 A First Choice of Prior

The following choice seems reasonable:

f (m) = Nm

(0, φ−1

), (1.9)

f (ω) = Gω (α, β) . (1.10)

In (1.9), the zero mean expresses our lack of knowledge of the polarity of m, and theprecision parameter, φ > 0, is used to penalize extremely large values. For φ → 0,(1.9) becomes flatter. The Gamma distribution, G, in (1.10) was chosen to reflect thepositivity of ω. Its parameters, α > 0 and β > 0, may again be chosen to yield a

4 1 Introduction

non-informative prior. For α → 0 and β → 0, (1.10) approaches Jeffreys’ improperprior on scale parameters, 1/ω [7].

Joint inference of the normal mean and precision, m and ω respectively, is wellstudied in the literature [8, 9]. From Bayes’ rule, the posterior distribution is

f (m,ω|d, α, β, φ) ∝ Nd

(m,ω−1

)Nm

(0, φ−1

)Gω (α, β) . (1.11)

The basic properties of the Normal (N ) and Gamma (G) distributions are summa-rized in Appendices A.2 and A.5 respectively. Even in this simple case, evaluation ofthe marginal distribution of the mean, m, i.e. f (m|d, α, β, φ), is not tractable. Hence,we seek the best approximation in the class of conditionally independent posteriorson m and ω, by minimizing KLDVB (1.6), this being the VB-approximation. Thesolution can be found in the following form:

f (m|d, α, β, φ) = Nm

((ω + φ)−1

ωd, (ω + φ)−1), (1.12)

f (ω|d, α, β, φ) = Gω

(α +

12,12

[m2 − 2dm + d2 + 2β

]). (1.13)

The shaping parameters of (1.12) and (1.13) are mutually dependent via their mo-ments, as follows:

ω = Ef(ω|d,·) [ω] =α + 1

2

12

[m2 − 2dm + d2 + 2β

] ,m = Ef(m|d,·) [m] = (ω + φ)−1

ωd, (1.14)

m2 = Ef(m|d,·)[m2

]= (ω + φ)−1 + m2.

The VB-moments (1.14) fully determine the VB-marginals, (1.12) and (1.13). It canbe shown that this set of VB-equations (1.14) has three possible solutions (beingroots of a 3rd-order polynomial), only one of which satisfies ω > 0. Hence, theoptimized KLDVB has three ‘critical’ points for this model. The exact distributionand its VB-approximation are compared in Fig. 1.1.

1.3.2 The Prior Choice Revisited

For comparison, we now consider a different choice of the priors:

f (m|ω) = Nm

(0, (γω)−1

), (1.15)

f (ω) = Gω (α, β) . (1.16)

Here, (1.16) is the same as (1.10), but (1.15) has been parameterized differently from(1.9). It still expresses our lack of knowledge of the polarity of m, and it still penal-izes extreme values of m if γ → 0. Hence, both prior structures, (1.9) and (1.15), can

1.3 A First Example of the VB Method: Scalar Additive Decomposition 5

ω

m

0 1 2 3 4

0 0.05 0.1

0

0.05

0.1

2

-

-

1

0

1

2

3

Fig. 1.1. The VB-approximation, (1.12) and (1.13), for the scalar additive decomposition(dash-dotted contour). Full contour lines denote the exact posterior distribution (1.11).

express non-informative prior knowledge. However, the precision parameter, γω, ofm is now chosen proportional to the precision parameter, ω, of the noise (1.8).

From Bayes’ rule, the posterior distribution is now

f (m,ω|d, α, β, γ) ∝ Nd

(m,ω−1

)Nm

(0, (γω)−1

)Gω (α, β) , (1.17)

f (m,ω|d, α, β, γ) = Nm

((γ + 1)−1

d, ((γ + 1)ω)−1)×

×Gω

(α +

12, β +

γd2

2 (1 + γ)

). (1.18)

Note that the posterior distribution, in this case, has the same functional form as theprior, (1.15) and (1.16), namely a product of Normal and Gamma distributions. Thisis known as conjugacy. The (exact) marginal distributions of (1.17) are now readilyavailable:

f (m|d, α, β, γ) = Stm

(d

γ + 1,

12α (d2γ + 2β (1 + γ))

, 2α)

,

f (ω|d, α, β, γ) = Gω

(α +

12, β +

γd2

2 (1 + γ)

),

where Stm denotes Student’s t-distribution with 2α degrees of freedom.

6 1 Introduction

In this case, the VB-marginals have the following forms:

f (m|d, α, β, γ) = Nm

((1 + γ)−1

d, ((1 + γ) ω)−1), (1.19)

f (ω|d, α, β, γ) = Gω

(α + 1, β +

12

[(1 + γ) m2 − 2dm + d2

]). (1.20)

The shaping parameters of (1.19) and (1.20) are therefore mutually dependent viathe following VB-moments:

ω = Ef(ω|d,·) [ω] =α + 1

β + 12

[(1 + γ) m2 − 2dm + d2

] ,m = Ef(m|d,·) [m] = (1 + γ)−1

d, (1.21)

m2 = Ef(m|d,·)[m2

]= (1 + γ)−1

ω−1 + m2.

In this case, (1.21) has a simple, unique, closed-form solution, as follows:

ω =(1 + 2α) (1 + γ)d2γ + 2β (1 + γ)

,

m =d

1 + γ, (1.22)

m2 =d2 (1 + γ + 2α) + β (1 + γ)

(1 + γ)2 (1 + 2α).

The exact and VB-approximated posterior distributions are compared in Fig. 1.2.

Remark 1.1 (Choice of priors for the VB-approximation). Even in the stressful regimeof this example (one datum, two unknowns), each set of priors had a similar influ-ence on the posterior distribution. In more realistic contexts, the distinctions will beeven less, as the influence of the data—via f(D|θ) in (1.1)—begins to dominate theprior, f(θ). However, from an analytical point-of-view, the effects of the prior choicecan be very different, as we have seen in this example. Recall that the moments ofthe exact posterior distribution were tractable in the case of the second prior (1.17),but were not tractable in the first case (1.11). This distinction carried through to therespective VB-approximations. Once again, the second set of priors implied a farsimpler solution (1.22) than the first (1.14). Therefore, in this book, we will take careto design priors which can facilitate the task of VB-approximation. We will alwaysbe in a position to ensure that our choice is non-informative.

1.4 The VB Method in its Context

Statistical physics has long been concerned with high-dimensional probability func-tions and their simplification [10]. Typically, the physicist is considering a system of

1.4 The VB Method in its Context 7

ω

m

0 1 2 3 4

0 0.05 0.1

0

0.05

0.1

-

-

2

1

0

1

2

3

Fig. 1.2. The VB-approximation, (1.19) and (1.20), for the scalar additive decomposition(dash-dotted contour), using alternative priors, (1.15) and (1.16). Full contour lines denotethe exact posterior distribution (1.17).

many interacting particles and wishes to infer the state, θ, of this system. Boltzmann’slaw [11] relates the energy of the state to its probability, f (θ). If we wish to infer asub-state, θi, we must evaluate the associated marginal, f (θi). Progress can be madeby replacing the exact probability model, f (θ), with an approximation, f (θ). Typi-cally, this requires us to neglect interactions in the physical system, by setting manysuch interactions to zero. The optimal such approximate distribution, f (θ), can bechosen using the variational method [12], which seeks a free-form solution withinthe approximating class that minimizes some measure of disparity between f (θ) andf (θ). Strong physical justification can be advanced for minimization of a Kullback-Leibler divergence (1.6), which is interpretable as a relative entropy. The VariationalBayes (VB) approximation is one example of such an approximation, where inde-pendence between all θi is enforced (1.5). In this case, the approximating marginalsdepend on expectations of the remaining states. Mean Field Theory (MFT) [10] gen-eralizes this approach, exploring many such choices for the approximating function,f (θ), and its disparity with respect to f (θ). Once the variational approximation hasbeen obtained, the exact system is studied by means of this approximation [13].

The machine learning community has adopted Mean Field Theory [12] as a wayto cope with problems of learning and belief propagation in complex systems suchas neural networks [14–16]. Ensemble learning [17] is an example of the use ofthe VB-approximation in this area. Communication between the machine learning

8 1 Introduction

and physics communities has been enhanced by the language of graphical mod-els [18–20]. The Expectation-Maximization (EM) algorithm [21] is another impor-tant point of tangency, and was re-derived in [22] using KLDVB minimization. TheEM algorithm has long been known in the signal processing community as a meansof finding the Maximum Likelihood (ML) solution in high-dimensional problems—such as image segmentation—involving hidden variables. Replacement of the EMequations with Variational EM (i.e. IVB) [23] equations allows distributional ap-proximations to be used in place of point estimates.

In signal processing, the VB method has proved to be of importance in addressingproblems of model structure inference, such as the inference of rank in PrincipalComponent Analysis (PCA) [24] and Factor Analysis [20, 25]), and in the inferenceof the number of components in a mixture [26]. It has been used for identification ofnon-Gaussian AutoRegressive (AR) models [27, 28], for unsupervised blind sourceseparation [29], and for pattern recognition of hand-written characters [15].

1.5 VB as a Distributional Approximation

The VB method of approximation is one of many techniques for approximation ofprobability functions. In the VB method, the approximating family is taken as theset of all possible distributions expressed as the product of required marginals, withthe optimal such choice made by minimization of a KLD. The following are amongthe many other approximations—deterministic and stochastic—that have been usedin signal processing:

Point-based approximations: examples include the Maximum a Posteriori (MAP)and ML estimates. These are typically used as certainty equivalents [30] in de-cision problems, leading to highly tractable procedures. Their inability to takeaccount of uncertainty is their principal drawback.

Local approximations: the Laplace approximation [31], for example, performs aTaylor expansion at a point, typically the ML estimate. This method is knownto the signal processing community in the context of criteria for model order se-lection, such as the Schwartz criterion and Bayes’ Information Criterion (BIC),both of which were derived using the Laplace method [31]. Their principal dis-advantage is their inability to cope with multimodal probability functions.

Spline approximations: tractable approximations of the probability function may beproposed on a sufficiently refined partition of the support. The computationalload associated with integrations typically increases exponentially with the num-ber of dimensions.

MaxEnt and moment matching: the approximating distribution may be chosen tomatch a selected set of the moments of the true distribution [32]. Under theMaxEnt principle [33], the optimal such moment-matching distribution is theone possessing maximum entropy subject to these moment constraints.

Empirical approximations: a random sample is generated from the probability func-tion, and the distributional approximation is simply a set of point masses placed

1.5 VB as a Distributional Approximation 9

at these independent, identically-distributed (i.i.d.) sampling points. The keytechnical challenge is efficient generation of i.i.d. samples from the true dis-tribution. In recent years, stochastic sampling techniques [34]—particularly theclass known as Markov Chain Monte Carlo (MCMC) methods [35]—have over-taken deterministic methods as the golden standard for distributional approxi-mation. They can yield approximations to an arbitrary level of accuracy, but typ-ically incur major computational overheads. It can be instructive to examine theperformance of any deterministic method—such as the VB method—in termsof the accuracy-vs-complexity trade-off achieved by these stochastic samplingtechniques.

The VB method has the potential to offer an excellent trade-off between computa-tional complexity and accuracy of the distributional approximation. This is suggestedin Fig. 1.3. The main computational burden associated with the VB method is theneed to solve iteratively—via the IVB algorithm—a set of simultaneous equations inorder to reveal the required moments of the VB-marginals. If computational cost isof concern, VB-marginals may be replaced by simpler approximations, or the eval-uation of moments can be approximated, without, hopefully, diminishing the overallquality of approximation significantly. This pathway of approximation is suggestedby the dotted arrow in Fig. 1.3, and will be traversed in some of the signal processingapplications presented in this book. Should the need exist to increase accuracy, theVB method is sited in the flexible context of Mean Field Theory, which offers moresophisticated techniques that might be explored.

meanfieldtheory

samplingmethods

computational cost

qual

ityof

appr

oxim

atio

n

certaintyequivalent

methodsdeterministic

EM algorithm

Variational Bayes (IVB)

Fig. 1.3. The accuracy-vs-complexity trade-off in the VB method.

10 1 Introduction

1.6 Layout of the Work

We now briefly summarize the main content of the Chapters of this book.

Chapter 2. This provides an introduction to Bayesian theory relevant for distribu-tional approximation. We review the philosophical framework, and we introducebasic probability calculus which will be used in the remainder of the book. Theimportant distinction between off-line and on-line inference is outlined.

Chapter 3. Here, we are concerned with the problem of distributional approxima-tion. The VB-approximation is defined, and from it we synthesize an ergonomicprocedure for deducing these VB-approximations. This is known as the VBmethod. Related distributional approximations are briefly reviewed and com-pared to the VB method. A simple inference problem—scalar multiplicativedecomposition—is considered.

Chapter 4. The VB method is applied to the problem of matrix multiplicative de-compositions. The VB-approximation for these models reveals interesting prop-erties of the method, such as initialization of the Iterative VB algorithm (IVB)and the existence of local minima. These models are closely related to PrincipalComponent Analysis (PCA), and we show that the VB inference provides solu-tions to problems not successfully addressed by PCA, such as the inference ofrank.

Chapter 5. We use our experience from Chapter 4 to derive the VB-approximationfor the inference of physiological factors in medical image sequences. The phys-ical nature of the problem imposes additional restrictions which are successfullyhandled by the VB method.

Chapter 6. The VB method is explored in the context of recursive inference of sig-nal processes. In this Chapter, we confine ourselves to time-invariant parametermodels. We isolate three fundamental scenarios, each of which constitutes a re-cursive inference task where the VB-approximation is tractable and adds value.We apply the VB method to the recursive identification of mixtures of AR mod-els. The practical application of this work in prediction of urban traffic flow isoutlined.

Chapter 7. The time-invariant parameter assumption from Chapter 6 is relaxed.Hence, we are concerned here with Bayesian filtering. The use of the VB methodin this context reveals interesting computational properties in the resulting algo-rithm, while also pointing to some of the difficulties which can be encountered.

Chapter 8. We address a practical signal processing task, namely, the reconstructionof AR processes corrupted by unknown transformation and noise distortions.The use of the VB method in this ambitious context requires synthesis of ex-perience gained in Chapters 6 and 7. The resulting VB inference is shown tobe successful in optimal data pre-processing tasks such as outlier removal andsuppression of burst noise. An application in speech denoising is presented.

Chapter 9. We summarize the main findings of the work, and point to some interest-ing future prospects.

1.7 Acknowledgement 11

1.7 Acknowledgement

The first author acknowledges the support of Grants AV CR 1ET 100 750 401 andMŠMT 1M6798555601.

2

Bayesian Theory

In this Chapter, we review the key identities of probability calculus relevant toBayesian inference. We then examine three fundamental contexts in parametric mod-elling, namely (i) off-line inference, (ii) on-line inference of time-invariant parame-ters, and (iii) on-line inference of time-variant parameters. In each case, we use theBayesian framework to derive the formal solution. Each context will be examined indetail in later Chapters.

2.1 Bayesian Benefits

A Bayesian is someone who uses only probabilities to quantify degrees of belief inan uncertain hypothesis, and uses only the rules of probability as the calculus foroperating on these degrees of belief [7, 8, 36, 37]. At the very least, this approach toinductive inference is consistent, since the calculus of probability is consistent, i.e.any valid use of the rules of probability will lead to a unique conclusion. This is nottrue of classical approaches to inference, where degrees of belief are quantified usingone of a vast range of criteria, such as relative frequency of occurrence, distance ina normed space, etc. If the Bayesian’s probability model is chosen to reflect suchcriteria, then we might expect close correspondence between Bayesian and classicalmethods. However, a vital distinction remains. Since probability is a measure func-tion on the space of possibilities, the marginalization operator (i.e. integration) is apowerful inferential tool uniquely at the service of the Bayesian. Careful comparisonof Bayesian and classical solutions will reveal that the real added value of Bayesianmethods derives from being able to integrate, thereby concentrating the inferenceonto a selected subset of quantities of interest. In this way, Bayesian methods natu-rally embrace the following key problems, all problematical for the non-Bayesian:

1. projection into a desired subset of the hypothesis space;2. reduction of the number of parameters appearing in the probability function (so-

called ‘elimination of nuisance parameters’ [38]);3. quantification of the risk associated with a data-informed decision;

14 2 Bayesian Theory

4. evaluation of expected values and moments;5. comparison of competing model structures and penalization of complexity (Ock-

ham’s Razor) [39, 40] ;6. prediction of future data.

All of these tasks require integration with respect to the probability measure onthe space of possibilities. In the case of 5. above, competing model structures aremeasured, leading to consistent quantification of model complexity. This natural en-gendering of Ockham’s razor is among the most powerful features of the Bayesianframework.

Why, then, are Bayesian methods still so often avoided in application contextssuch as statistical signal processing? The answer is mistrust of the prior, and philo-sophical angst about (i) its right to exist, and (ii) its right to influence a decision oralgorithm. With regard to (i), it is argued by non-Bayesians that probabilities mayonly be attached to objects or hypotheses that vary randomly in repeatable exper-iments [41]. With regard to (ii), the non-Bayesian (objectivist) perspective is thatinferences should be based only on data, and never on prior knowledge. Preoccupa-tion with these issues is to miss where the action really is: the ability to marginalizein the Bayesian framework. In our work, we will eschew detailed philosophical ar-guments in favour of a policy that minimizes the influence of the priors we use, andpoints to the practical added value over frequentist methods that arise from use ofprobability calculus.

2.1.1 Off-line vs. On-line Parametric Inference

In an observational experiment, we may wish to infer knowledge of an unknownquantity only after all data, D, have been gathered. This batch-based inference will becalled the off-line scenario, and Bayesian methods must be used to update our beliefsgiven no data (i.e. our prior), to beliefs given D. It is the typical situation arising indatabase analysis. In contrast, we may wish to interleave the process of observingdata with the process of updating our beliefs. This on-line scenario is important incontrol and decision tasks, for example. For convenience, we refer to the independentvariable indexing the occasions (temporal, spatial, etc.) when our inferences must beupdated, as time, t = 0, 1, .... The incremental data observed between inference timesis dt, and the aggregate of all data observed up to and including time t is denoted byDt. Hence:

Dt = Dt−1 ∪ dt, t = 1, ...,

with D0 = , by definition. For convenience, we will usually assume that dt ∈Rp×1, p ∈ N+, ∀t, and so Dt can be structured into a matrix of dimension p × t,with the incremental data, dt, as its columns:

Dt = [Dt−1, dt] . (2.1)

In this on-line scenario, Bayesian methods are required to update our state ofknowledge conditioned by Dt−1, to our state of knowledge conditioned by Dt. Of

2.2 Bayesian Parametric Inference: the Off-Line Case 15

course, the update is achieved using exactly the same ‘inference machine’, namelyBayes’ rule (1.1). Indeed, one step of on-line inference is equivalent to an off-linestep, with D = dt, and with the prior at time t being conditioned on Dt−1. Never-theless, it will be convenient to handle the off-line and on-line scenarios separately,and we now review the Bayesian probability calculus appropriate to each case.

2.2 Bayesian Parametric Inference: the Off-Line Case

Let the measured data be denoted by D. A parametric probabilistic model of thedata is given by the probability distribution, f (D|θ), conditioned by knowledge ofthe parameters, θ. In this book, the notation f (·) can represent either a probabilitydensity function for continuous random variables, or a probability mass functionfor discrete random variables. We will refer to f (·) as a probability distribution inboth cases. In this way a significant harmonization of formulas and nomenclaturecan be achieved. We need only keep in mind that integrations should be replaced bysummations whenever the argument is discrete1.

Our prior state of knowledge of θ is quantified by the prior distribution, f(θ). Ourstate of knowledge of θ after observing D is quantified by the posterior distribution,f(θ|D). These functions are related via Bayes’ rule,

f (θ|D) =f (θ,D)f (D)

=f (D|θ) f (θ)∫

Θ∗ f (D|θ) f (θ) dθ, (2.2)

where Θ∗ is the space of θ. We will refer to f (θ,D) as the joint distribution ofparameters and data, or, more concisely, as the joint distribution. We will refer tof (D|θ) as the observation model. If this is viewed as a (non-measure) function of θ,it is known as the likelihood function [3, 43–45]:

l(θ|D) ≡ f (D|θ) . (2.3)

ζ = f (D) is the normalizing constant, sometimes known as the partition functionin the physics literature [46]:

ζ = f (D) =∫

Θ∗f (θ,D) dθ =

∫Θ∗

f (D|θ) f (θ) dθ. (2.4)

Bayes’ rule (2.2) can therefore be re-written as

f (θ|D) =1ζf (D|θ) f (θ) ∝ f (D|θ) f (θ) , (2.5)

where ∝ means equal up to the normalizing constant, ζ. The posterior is fully deter-mined by the product f (D|θ) f (θ), since the normalizing constant follows from the

1 This can also be achieved via measure theory, operating in a consistent way for both discreteand continuous distributions, with probability densities generalized in the Radon-Nikodymsense [42]. The practical effect is the same, and so we will avoid this formality.


requirement that f (θ|D) be a probability distribution; i.e.∫

Θ∗ f (θ|D) = 1. Evalua-tion of ζ (2.4) can be computationally expensive, or even intractable. If the integralin (2.4) does not converge, the distribution is called improper [47]. The posteriordistribution with explicitly known normalization (2.5) will be called the normalizeddistribution. In Fig. 2.1, we represent Bayes’ rule (2.2) as an operator, B, transform-ing the prior into the posterior, via the observation model, f (D|θ) .

Bf(θ) f(θ|D)

f(D|θ)

Fig. 2.1. Bayes’ rule as an operator.

2.2.1 The Subjective Philosophy

All our beliefs about θ, and their associated quantifiers via f(θ), f(θ|D), etc., areconditioned on the parametric probability model, f(θ,D), chosen by us a priori(2.2). Its ingredients are (i) the deterministic structure relating D to an unknownparameter set, θ, i.e. the observation model f(D|θ), and a chosen measure on thespace, Θ, of this parameter set, i.e. the prior measure f(θ). In this sense, Bayesianmethods are born from a subjective philosophy, which conditions all inference on theprior knowledge of the observer [2,36]. Jeffreys’ notation [7], I, is used to conditionall probability functions explicitly on this corpus of prior knowledge; e.g. f(θ) →f(θ|I). For convenience, we will not use this notation, nor will we forget the fact thatthis conditioning is always present. In model comparison (1.4), where we examinecompeting model assumptions, fl(θl, D), l = 1, . . . , c, this conditioning becomesmore explicit, via the indicator variable or pointer, l ∈ 1, 2, ..., c , but once againwe will suppress the implied Jeffreys’ notation.

2.2.2 Posterior Inferences and Decisions

The task of evaluating the full posterior distribution (2.5) will be called parameter in-ference in this book. We favour this phrase over the alternative—density estimation—used in some decision theory texts [48]. The full posterior distribution is a completedescription of our uncertainty about the parameters of the observation model (2.3),given prior knowledge, f(θ), and all available data, D. For many practical tasks,we need to derive conditional and marginal distributions of model parameters, andtheir moments. Consider the (vector of) model parameters to be partitioned into twosubsets, θ = [θ′1, θ

′2]

′. Then, the marginal distribution of θ1 is

f (θ1|D) =∫

Θ∗2

f (θ1, θ2|D) dθ2. (2.6)

2.2 Bayesian Parametric Inference: the Off-Line Case 17∫dθ2f(θ1, θ2|D) f(θ1|D)

Fig. 2.2. The marginalization operator.

In Fig. 2.2, we represent (2.6) as an operator. This graphical representation will beconvenient in later Chapters.

The moments of the posterior distribution—i.e. the expected or mean value ofknown functions, g (θ) , of the parameter—will be denoted by

Ef(θ|D) [g (θ)] =∫

Θ∗g (θ) f (θ|D) dθ. (2.7)

In general, we will use the notation g (θ) to refer to a posterior point estimate of g(θ).Hence, for the choice (2.7), we have

g (θ) ≡ Ef(θ|D) [g (θ)] . (2.8)

The posterior mean (2.7) is only one of many decisions that can be made inchoosing a point estimate, g (θ), of g(θ). Bayesian decision theory [30, 48–52] al-lows an optimal such choice to be made. The Bayesian model, f (θ,D) (2.2), issupplemented by a loss function, L(g, g) ∈ [0,∞), quantifying the loss associated

with estimating g ≡ g(θ) by g ≡ g (θ). The minimum Bayes risk estimate is foundby minimizing the posterior expected loss,

g (θ) = arg ming

Ef(θ|D) [L(g, g)] . (2.9)

The quadratic loss function, L(g, g) =(g(θ) − g (θ)

)′

Q(g(θ) − g (θ)

), Q pos-

itive definite, leads to the choice of the posterior mean (2.7). Other standard lossfunctions lead to other standard point estimates, such as the maximum and mediana posteriori estimates [37]. The Maximum a Posteriori (MAP) estimate is defined asfollows:

θMAP = arg maxθ

f(θ|D). (2.10)

In the special case where f(θ) = const., i.e. the improper uniform prior, then, from(2.2) and (2.3),

θMAP = θML = arg maxθ

l(θ|D). (2.11)

Here, θML denotes the Maximum Likelihood (ML) estimate. ML estimation [43]is the workhorse of classical inference, since it avoids the issue of defining a priorover the space of possibilities. In particular, it is the dominant tool for probabilisticmethods in signal processing [5, 53, 54]. Consider the special case of an additiveGaussian noise model for vector data, D = d ∈ Rp, with

d = s(θ) + e,

e ∼ N (0, Σ) ,


where Σ is known, and s(θ) is the (non-linearly parameterized) signal model. In thiscase, θML = θLS, the traditional non-linear, weighted Least-Squares (LS) estimate[55] of θ. From the Bayesian perspective, these classical estimators—θML and θLS—can be justified only to the extent that a uniform prior over Θ∗ might be justified.When Θ∗ has infinite Lebesgue measure, this prior is improper, leading to technicaland philosophical difficulties [3, 8]. In this book, it is the strongly Bayesian choice,g (θ) = Ef(θ|D) [g (θ)] (2.8), which predominates. Hence, the notation g ≡ g (θ)will always denote the posterior mean of g(θ), unless explicitly stated otherwise.

As an alternative to point estimation, the Bayesian may choose to describe a con-tinuous posterior distribution, f (θ|D) (2.2), in terms of a region or interval withinwhich θ has a high probability of occurrence. These credible regions [37] replace theconfidence intervals of classical inference, and have an intuitive appeal. The follow-ing special case provides a unique specification, and will be used in this book.

Definition 2.1 (Highest Posterior Density (HPD) Region). R ⊂ Θ∗ is the 100(1−α)% HPD region of (continuous) distribution, f (θ|D) , where α ∈ (0, 1), if (i)∫

Rf (θ|D) = 1−α, and if (ii) almost surely (a.s.) for any θ1 ∈ R and θ2 /∈ R, then

f (θ1|D) ≥ f (θ2|D).

2.2.3 Prior Elicitation

The prior distribution (2.2) required by Bayes’ rule is a function that must be elicitedby the designer of the model. It is an important part of the inference problem, andcan significantly influence posterior inferences and decisions (Section 2.2.2). Gen-eral methods for prior elicitation have been considered extensively in the litera-ture [7,8,37,56], as well as the problem of choosing priors for specific signal modelsin Bayesian signal processing [3, 35, 57]. In this book, we are concerned with thepractical impact of prior choices on the inference algorithms which we develop. Theprior distribution will be used in the following ways:

1. To supplement the data, D, in order to obtain a reliable posterior estimate, incases where there are insufficient data and/or a poorly defined model. This willbe called regularization (via the prior);

2. To impose various restrictions on the parameter θ, reflecting physical constraintssuch as positivity. Note, from (2.2), that if the prior distribution on a subset ofthe parameter support, Θ∗, is zero, then the posterior distribution will also bezero on this subset;

3. To express prior ignorance about θ. If the data are assumed to be informativeenough, we prefer to choose a non-informative prior (i.e. a prior with minimalimpact on the posterior distribution). Philosophical and analytical challenges areencountered in the design of non-informative priors, as discussed, for example,in [7, 46].

In this book, we will typically choose our prior from a family of distributions provid-ing analytical tractability during the Bayes update (Fig. 2.1). Notably, we will work

2.3 Bayesian Parametric Inference: the On-line Case 19

with conjugate priors, as defined in the next Section. In such cases, we will designour non-informative prior by choosing its parameters to have minimal impact on theparameters of the posterior distribution.

2.2.3.1 Conjugate priors

In parametric inference, all distributions, f(·), have a known functional form, and arecompletely determined once the associated shaping parameters are known. Hence,the shaping parameters of the posterior distribution, f (θ|D, s0) (2.5), are, in general,the complete data record, D, and any shaping parameters, s0, of the prior, f0 (θ|s0) .Hence, a massive increase in the degrees-of-freedom of the inference may occurduring the prior-to-posterior update. It will be computationally advantageous if theform of the posterior distribution is identical to the form of prior, f0(·|s0), i.e. theinference is functionally invariant with respect to Bayes’ rule, and is determined froma finite-dimensional vector shaping parameter:

s = s (D, s0) , s ∈ Rq, q < ∞,

with s( , s0) ≡ s0, a priori. Then Bayes’ rule (2.5) becomes

f0 (θ|s) ∝ f (D|θ) f0 (θ|s0) . (2.12)

Such a distribution, f0, is known as self-replicating [42], or as the conjugate distribu-tion to the observation model, f (D|θ) [37]. s are known as the sufficient statistics ofthe distribution, f0. The principle of conjugacy may be used in designing the prior;i.e. if there exists a family of conjugate distributions, Fs, whose elements are indexedby s ∈ Rq, then the prior is chosen as

f (θ) ≡ f0 (θ|s0) ∈ Fs, (2.13)

with s0 forming the parameters of the prior. If s0 are unknown, then they are calledhyper-parameters [37], and are assigned a hyperprior, s0 ∼ f(s0). As we will seein Chapter 6, the choice of conjugate priors is of key importance in the design oftractable Bayesian recursive algorithms, since they confine the shaping parametersto Rq, and prevent a linear increase in the number of degrees-of-freedom with Dt

(2.1). From now on, we will not use the subscript ‘0’ in f0. The fixed functionalform will be implied by the conditioning on sufficient statistics s.

2.3 Bayesian Parametric Inference: the On-line Case

We now specialize Bayesian inference to the case of learning in tandem with dataacquisition, i.e. we wish to update our inference in the light of incremental data,dt (Section 2.1.1). We distinguish two situations, namely time-invariant and time-variant parameterizations.


2.3.1 Time-invariant Parameterization

The observations at time t, namely dt (2.1), lead to the following update of ourknowledge, according to Bayes’ rule (2.2):

f(θ|dt, Dt−1) = f(θ|Dt) ∝ f (dt|θ,Dt−1) f (θ|Dt−1) , t = 1, 2, ..., (2.14)

where f (θ|D0) ≡ f (θ), the parameter prior (2.2). This scenario is illustrated inFig. 2.3. The observation model, f (dt|θ,Dt−1), at time t is related to the observa-

Bf(θ|Dt−1) f(θ|Dt)

f(dt|θ, Dt−1)

Fig. 2.3. The Bayes’ rule operator in the on-line scenario with time-invariant parameterization.

tion model for the accumulated data, Dt—which we can interpret as the likelihoodfunction of θ (2.3)—via the chain rule of probability:

l(θ|Dt) ≡ f (Dt|θ) =t∏

τ=1

f (dτ |θ,Dτ−1) . (2.15)

Conjugate updates (Section 2.2.3) are essential in ensuring tractable on-line algo-rithms in this context. This will be the subject of Chapter 6.

2.3.2 Time-variant Parameterization

In this case, new parameters, θt, are required to explain dt, i.e. the observation model,f(dt|θt, Dt−1), t = 1, 2, ..., is an explicitly time-varying function. For convenience,we assume that θt ∈ Rr, ∀t, and we aggregate the parameters into a matrix, Θt, aswe did the data (2.1):

Θt = [Θt−1, θt] , (2.16)

with Θ0 = by definition. Once again, Bayes’ rule (2.2) is used to update ourknowledge of Θt in the light of new data, dt:

f(Θt|dt, Dt−1) = f(Θt|Dt), t = 1, 2, ...,∝ f (dt|Θt, Dt−1) f (Θt|Dt−1) (2.17)

= f (dt|Θt, Dt−1) f (θt|Dt−1, Θt−1) f (Θt−1|Dt−1) ,

where we have used the chain rule to expand the last term in (2.17), via (2.16).Typically, we want to concentrate the inference into the newly generated parameter,θt, which we do via marginalization (2.6):

2.3 Bayesian Parametric Inference: the On-line Case 21

f (θt|Dt) =∫

Θ∗t−1

· · ·∫

Θ∗1

f (θt, Θt−1|Dt) dΘt−1 (2.18)

∝∫

Θ∗t−1

· · ·∫

Θ∗1

f (dt|Θt, Dt−1) f (θt|Θt−1, Dt−1) f (Θt−1|Dt−1) dΘt−1.

Note that the dimension of the integration is r(t − 1) at time t. If the integrationsneed to be carried out numerically, this increasing dimensionality proves prohibitivein real-time applications. Therefore, the following simplifying assumptions are typi-cally adopted [42]:

Proposition 2.1 (Markov observation model and parameter evolution models).The observation model is to be simplified as follows:

f (dt|Θt, Dt−1) = f (dt|θt, Dt−1) , (2.19)

i.e. dt is conditionally independent of Θt−1, given θt.The parameter evolution model is to be simplified as follows:

f (θt|Θt−1, Dt−1) = f (θt|θt−1) . (2.20)

In many applications, (2.20) may depend on exogenous (observed) data, ξt, whichcan be seen as shaping parameters, and need not be explicitly listed in the condition-ing part of the notation.

This Markov model (2.20) is the required extra ingredient for Bayesian time-variant on-line inference. Employing Proposition 2.1 in (2.18), and noting that∫

Θ∗t−2

f (Θt−1|Dt−1) f(Θt−2)dΘt−2 = f (θt−1|Dt−1), then the following equat-

ions emerge:

The time update (prediction) of Bayesian filtering:

f (θt|Dt−1) ≡ f (θ1) , t = 1,

f (θt|Dt−1) =∫

Θ∗t−1

f (θt|θt−1, Dt−1) f (θt−1|Dt−1) dθt−1, t = 2, 3, .... (2.21)

The data update of Bayesian filtering:

f (θt|Dt) ∝ f (dt|θt, Dt−1) f (θt|Dt−1) , t = 1, 2, .... (2.22)

Note, therefore, that the integration dimension is fixed at r, ∀t (2.21). We willrefer to this two-step update for Bayesian on-line inference of θt as Bayesian fil-tering, in analogy to Kalman filtering which involves the same two-step procedure,and which is, in fact, a specialization to the case of Gaussian observation (2.19) andparameter evolution (2.20) models. On-line inference of time-variant parameters isillustrated in schematic form in Fig. 2.4. In Chapter 7, the problem of designingtractable Bayesian recursive filtering algorithms will be addressed for a wide classof models, (2.19) and (2.20), using Variational Bayes (VB) techniques.


f(θt−1|Dt−1) ×

f(dt|θt, Dt−1)∫dθt−1 B f(θt|Dt)

f(θt, θt−1|Dt−1) f(θt|Dt−1)

f(θt|θt−1, Dt−1)

Time update Data update

Fig. 2.4. The inferential scheme for Bayesian filtering. The operator ‘×’ denotes multiplica-tion of distributions.

2.3.3 Prediction

Our purpose in on-line inference of parameters will often be to predict future data.In the Bayesian paradigm, k-steps-ahead prediction is achieved by eliciting the fol-lowing distribution:

dt+k ∼ f (dt+k|Dt) . (2.23)

This will be known as the predictor.The one-step ahead predictor (i.e. k = 1 in (2.23)) for a model with time-

invariant parameters (2.14) is as follows:

f (dt+1|Dt) =∫

Θ∗f (dt+1|θ) f (θ|Dt) dθ, (2.24)

=

∫Θ∗ f (dt+1|θ) f (θ,Dt) dθ∫

Θ∗ f (θ,Dt) dθ, (2.25)

=f (Dt+1)f (Dt)

=ζt+1

ζt, (2.26)

using (2.2) and (2.4). Hence, the one-step-ahead predictor is simply a ratio of nor-malizing constants.

Evaluation of the k-steps-ahead predictor, k > 1, involves integration over fu-ture data, dt+1, . . ., dt+k−1, which may require numerical methods. For modelswith time-variant parameters (2.17), marginalization over the parameter trajectory,θt+1, . . . , θt+k−1, is also required.

2.4 Summary

In later Chapters, we will study the use of the Variational Bayes (VB) approximationin all three contexts of Bayesian learning reviewed in this Chapter, namely:

1. off-line parameter inference (Section 2.2), in Chapter 3;2. on-line inference of Time-Invariant (TI) parameters (Section 2.3.1), in Chapter 6;3. on-line inference of Time-Variant (TV) parameters (Section 2.3.2), in Chapter 7.

2.4 Summary 23

Context Observation model Model of state evolution posterior prior

off-line f (D|θ) — f (θ|D) f (θ)

on-line TI f (dt|θ, Dt−1) — f (θ|Dt) f (θ|D0) ≡ f (θ)

on-line TV f (dt|θt, Dt−1) f (θt|θt−1, Dt−1) f (θt|Dt) f (θ1|D0) ≡ f (θ1)

Table 2.1. The distributions arising in three key contexts of Bayesian inference.

The key probability distributions arising in each context are summarized in Table2.1. The VB approximation will be employed consistently in each context, but withdifferent effect. Each will imply distinct criteria for the design of tractable Bayesianlearning algorithms.

3

Off-line Distributional Approximations and theVariational Bayes Method

In this Chapter, we formalize the problem of approximating intractable parametricdistributions via tractable alternatives. Our aim will be to generate good approxima-tions of the posterior marginals and moments which are unavailable from the exactdistribution. Our main focus will be the Variational Bayes method for distributionalapproximation. Among the key deliverables will be (i) an iterative algorithm, calledIVB, which is guaranteed to converge to a local minimizer of the disparity func-tion; and (ii) the VB method, which provides a set of clear and systematic steps forcalculating VB-approximations for a wide range of Bayesian models. We will com-pare the VB-approximation to related and rival alternatives. Later in this Chapter, wewill apply the VB method to an insightful toy problem, namely the multiplicativedecomposition of a scalar.

3.1 Distributional Approximation

Tractability of the full Bayesian analysis—i.e. application of Bayes’ rule (2.2), nor-malization (2.4), marginalization (2.6), and evaluation of moments of posterior dis-tributions (2.7)—is assured only for a limited class of models. Numerical integrationcan be used, but it is often computationally expensive, especially in higher dimen-sions.

The problem can be overcome by approximating the true posterior distributionby a distribution that is computationally tractable:

f (θ|D) ≈ A [f (θ|D)] ≡ f (θ|D) . (3.1)

In Fig. 3.1, we interpret the task of distributional approximation as an operator, A.Once f (θ|D) is replaced by f (θ|D) (3.1), then, notionally, all the inferential op-erations listed above may be performed tractably. Many approximation strategieshave been developed in the literature. In this Chapter, we review some of those mostrelevant in signal processing. Note that in the off-line context of this Chapter, alldistributional approximations will operate on the posterior distribution, i.e. after the

26 3 Off-line Distributional Approximations and the Variational Bayes Method

Af(θ|D) f(θ|D)

Fig. 3.1. Distributional approximation as an operator.

update by Bayes’ rule (2.5). Hence, we will not need to examine the prior and obser-vation models separately, and we will not need to invoke the principle of conjugacy(Section 2.2.3.1).

3.2 How to Choose a Distributional Approximation

It will be convenient to classify all distributional approximation methods into one oftwo types:

Deterministic distributional approximations: the approximation, f (θ|D) (3.1), is ob-tained by application of a deterministic rule; i.e. f (θ|D) is uniquely determinedby f (θ|D). The following are deterministic methods of distributional approxi-mation: (i) certainty equivalence [30], which includes maximum likelihood andMaximum a posteriori (MAP) point inference [47] as special cases; (ii) theLaplace approximation [58]; (iii) the MaxEnt approximation [59, 60]; and (iv)fixed-form minimization [32]. The latter will be reviewed in Section 3.4.2, andthe others in Section 3.5.

Stochastic distributional approximations: the approximation is developed via a ran-dom sample of realizations from f (θ|D) . The fundamental distributionalapproximation in this class is the empirical distribution from nonparametricstatistics [61] (see (3.59)). The main focus of attention is on the numericallyefficient generation of realizations from f (θ|D) . An immediate consequenceof stochastic approximation is that f (θ|D) will vary with repeated use of themethod. We briefly review this class of approximations in Section 3.6.

Our main focus of attention will be the Variational Bayes (VB) approximation,which—as we will see in Section 3.3—is a deterministic, free-form distributionalapproximation.

3.2.1 Distributional Approximation as an Optimization Problem

In general, the task is to choose an optimal distribution, f (θ|D) ∈ F, from the space,F, of all possible distributions. f (θ|D) should be (i) tractable, and (ii) ‘close’ to thetrue posterior, f (θ|D) , in some sense. The task can be formalized as an optimizationproblem requiring the following elements:

1. A subspace of distributions, Fc ⊂ F, such that all functions, f ∈ Fc, are re-garded as tractable. Here, f (θ|D) denotes a ‘wildcard’ or candidate tractabledistribution from the space Fc.

3.2 How to Choose a Distributional Approximation 27

2. A proximity measure, ∆(f ||f

), between the true distribution and any tractable

approximation. ∆(f ||f

)must be defined on F × Fc, such that it accept two

distributions, f ∈ F and f ∈ Fc, as input arguments, yield a positive scalar asits value, and have f = f (θ|D) as its (unique) minimizer.

Then, the optimal choice of the approximating function must satisfy

f (θ|D) = arg minf∈Fc

∆(f (θ|D) ||f (θ|D)

), (3.2)

where we denote the optimal distributional approximation by f (θ|D) .

3.2.2 The Bayesian Approach to Distributional Approximation

From the Bayesian point of view, choosing an approximation, f (θ|D) (3.1), canbe seen as a decision-making problem (Section 2.2.2). Hence, the designer choosesa loss function (2.9) (negative utility function [37]) measuring the loss associatedwith choosing each possible f (θ|D) ∈ Fc, when the ‘true’ distribution is f (θ|D).In [62], a logarithmic loss function was shown to be optimal if we wish to extractmaximum information from the data. Use of the logarithmic loss function leads tothe Kullback-Leibler (KL) divergence [63] (also known as the cross-entropy) as anappropriate assignment for ∆ in (3.2):

∆(f (θ|D) ||f (θ|D)

)= KL

(f (θ|D) ||f (θ|D)

). (3.3)

The Kullback-Leibler (KL) divergence from f (θ|D) to f (θ|D) is defined as:

KL(f (θ|D) ||f (θ|D)

)=

∫Θ∗

f (θ|D) lnf (θ|D)

f (θ|D)dθ = Ef(θ|D)

[ln

f (θ|D)

f (θ|D)

].

(3.4)It has the following properties:

1. KL(f (θ|D) ||f (θ|D)

)≥ 0;

2. KL(f (θ|D) ||f (θ|D)

)= 0 iff f (θ|D) = f (θ|D) almost everywhere;

3. KL(f (θ|D) ||f (θ|D)

)= ∞ iff on a set of a positive measure f (θ|D) > 0

and f (θ|D) = 0;

4. KL(f (θ|D) ||f (θ|D)

)= KL

(f (θ|D) ||f (θ|D)

)in general, and the KL

divergence does not obey the triangle inequality.

Given 4., care is needed in the syntax describing KL (·). We say that (3.4) is fromf (θ|D) to f (θ|D). This distinction will be important in what follows. For futurepurposes, we therefore distinguish between the two possible orderings of the argu-ments in the KL divergence:


KL divergence for Minimum Risk (MR) calculations, as defined in (3.4):

KLDMR ≡ KL(f (θ|D) ||f (θ|D)

). (3.5)

KL divergence for Variational Bayes (VB) calculations:

KLDVB ≡ KL(f (θ|D) ||f (θ|D)

). (3.6)

The notations KLDMR and KLDVB imply the order of their arguments, whichare, therefore, not stated explicitly.

3.3 The Variational Bayes (VB) Method of DistributionalApproximation

The Variational Bayes (VB) method of distributional approximation is an optimiza-tion technique (Section 3.2.1) with the following elements:

The space of tractable distributions Fc is chosen as the space of conditionally inde-pendent distributions:

Fc ≡ f(θ1, θ2|D) : f(θ1, θ2|D) = f(θ1|D) f(θ2|D) . (3.7)

A necessary condition for applicability of the VB approximation is therefore thatΘ be multivariate.

The proximity measure is assigned as (3.6):

∆(f (θ|D) ||f (θ|D)

)= KL

(f (θ|D) ||f (θ|D)

)= KLDVB. (3.8)

Since the divergence, KLDMR (3.4,3.5), is not used, the VB approximation,f (θ|D), defined from (3.2) and (3.8), is not the minimum Bayes risk distribu-tional approximation. A schematic illustrating the VB method of distributionalapproximation is given in Fig. 3.2.

3.3.1 The VB Theorem

Theorem 3.1 (Variational Bayes). Let f (θ|D) be the posterior distribution of mul-tivariate parameter, θ. The latter is partitioned into q sub-vectors of parameters:

θ =[θ′1, θ

′2, . . . , θ

′q

]′. (3.9)

Let f (θ|D) be an approximate distribution restricted to the set of conditionally in-dependent distributions for θ1, θ2, . . . , θq:

f (θ|D) = f (θ1, θ2, . . . , θq|D) = Πqi=1f (θi|D) . (3.10)

3.3 The Variational Bayes (VB) Method of Distributional Approximation 29

distributions Fc

F

approximations f

VB approximation

KLDVB

true posterior distribution

KLDMR

minimum Bayes risk approximation

conditionally independent

Fig. 3.2. Schematic illustrating the VB method of distributional approximation. The minimumBayes’ risk approximation is also illustrated for comparison.

Then, the minimum of KLDVB, i.e.

f (θ|D) = arg minf(·)


), (3.11)

is reached for

f (θi|D) ∝ exp(Ef(θ/i|D) [ln (f (θ,D))]

), i = 1, . . . , q, (3.12)

where θ/i denotes the complement of θi in θ, and f(θ/i|D

)=

∏qj=1,j =i f (θj |D).

We will refer to f (θ|D) (3.11) as the VB-approximation, and f (θi|D) (3.12) as theVB-marginals.

Proof: KLDVB can be rewritten, using the definition (3.4), as follows:



)=

=∫

Θ∗f (θi|D) f

(θ/i|D

)ln

(f (θi|D) f

(θ/i|D

)f (θ|D)

f (D)f (D)

)dθ

=∫

Θ∗f (θi|D) f

(θ/i|D

)ln f (θi|D) dθ+

−∫

Θ∗f (θi|D) f

(θ/i|D

)ln f (θ,D) dθ+

+∫

Θ∗f (θi|D) f

(θ/i|D

) [ln f

(θ/i|D

)+ ln f (D)

]dθ,

=∫

Θ∗i

f (θi|D) ln f (θi|D) dθi + ln f (D) + ηi+

−∫

Θ∗i

f (θi|D)

[∫Θ∗

/i

f(θ/i|D

)ln f (θ,D) dθ/i

]dθi. (3.13)

Here,ηi = Ef(θ/i|D)

[ln(f(θ/i|D

))].

For any known non-zero scalars, ζi = 0, i = 1, . . . , q, it holds that


)=

=∫

Θ∗i

f (θi|D) ln f (θi|D) dθi + ln f (D) + ηi+

−∫

Θ∗i

f (θi|D)

ln[ζi

ζiexp Ef(θ/i|D) [ln f (θ,D)]

]dθi

=∫

Θ∗i

f (θi|D) lnf (θi|D)

1ζi

exp Ef(θ/i|D) [ln f (θ,D)]dθi+ln f (D)−ln (ζi)+ηi.

(3.14)

(3.14) is true ∀i ∈ 1, . . . , q . We choose each ζi, i = 1, . . . , q respectively, as thefollowing normalizing constant for exp E· [·] in the denominator of (3.14):

ζi =∫

Θ∗i

exp Ef(θ/i|D) [ln f (θ,D)] dθi, i = 1, . . . , q.

Then, the last equality in (3.14) can be rewritten in terms of a KL divergence, ∀i ∈1, . . . , q:


)= KL

(f (θi|D) || 1

ζiexp Ef(θ/i|D) [ln f (θ,D)]

)+

+ ln f (D) − ln (ζi) + ηi. (3.15)


The only term on the right-hand side of (3.15) dependent on f (·) is the KL diver-gence. Hence, minimization of (3.15) with respect to f (θi|D), ∀i ∈ 1, . . . , q ,keeping f

(θ/i|D

)fixed, is achieved by minimization of the first term. Invoking

non-negativity (Property 1) of the KL divergence (Section 3.2.2), the minimumof the first term is zero. The minimizer is almost surely f (θi|D) = f (θi|D) ∝exp

(Ef(θ/i|D) [ln (f (θ,D))]

), i.e. (3.12), via the second property of the KL diver-

gence (Section 3.2.2).We note the following:

• The VB-approximation (3.11) is a deterministic, free-form distributional approx-imation, as asserted in Section 3.2. The term in italics refers to the fact that nofunctional form for f is prescribed a priori.

• The posterior distribution of the parameters, f (θ|D) , and the joint distribution,f (θ,D) , differ only in the normalizing constant, ζ (2.4), as seen in (2.5). Fur-thermore, ζ is independent of θ. Hence, (3.12) can also be written in terms ofln f

(θi, θ/i|D

), in place of ln f

(θi, θ/i, D

). We prefer to use the latter, as it

emphasizes the fact that the normalizing constant does not need to be known,and, in fact, the VB method can be used with improper distributions.

• Theorem 3.1 can be proved in many other ways. See, for example, [26] and [29].Our proof was designed so as to use only basic probabilistic calculus, and thestated properties of the KL divergence.

• Uniqueness of the VB-approximation (3.12)—i.e. uniqueness of the minimizer ofthe KL divergence (3.11)—is not guaranteed in general [64]. Therefore, an extratest for uniqueness may be required. This will be seen in some of the applicationsaddressed later in the book.

• We emphasize the fact that the VB-approximation, (3.10) and (3.12), i.e.

f (θ|D) = f (θ1, θ2, . . . , θq|D) = Πqi=1f (θi|D) , (3.16)

enforces posterior conditional independence between the partitioned parameters.Hence:

– the VB approximation can only be used for multivariate models;– cross-correlations between the θi’s are not present in the approximation.

The degree of partitioning, q, must therefore be chosen judiciously. The larger itis, the more correlation between parameters will be lost in approximation. Hence,the achieved minimum of KLDVB will be greater (3.11), and the approximationwill be poorer. The guiding principle must be to choose q sufficiently large toachieve tractable VB-marginals (3.12), but no larger.

Remark 3.1 (Lower bound on the marginal of D via the VB-approximation). The VB-approximation is often interpreted in terms of a lower bound on f(D), the marginaldistribution of the observed data [24,25]. For an arbitrary approximating distribution,f (θ|D), it is true that


ln f (D) = ln∫

Θ∗f (θ,D) dθ = ln

∫Θ∗

f (θ|D)

f (θ|D)f (θ,D) dθ

≥∫

Θ∗f (θ|D) ln

f (θ|D) f (D)

f (θ|D)dθ ≡ ln f (D) , (3.17)

ln f (D) = ln f (D) −KL(f (θ|D) ||f (θ|D)

), (3.18)

using Jensen’s inequality [24]. Minimizing the KL divergence on the right-hand sideof (3.17)—e.g. using the result of Theorem 3.1—the error in the approximation isminimized.

The main computational problem of the VB-approximation (3.12) is that it is notgiven in closed form. For example, with q = 2, we note from (3.12) that f (θ1|D)is needed for evaluation of f (θ2|D), and vice-versa. A solution of (3.12) is usuallyfound iteratively. In such a case the following general result can be established.

Algorithm 1 (Iterative VB (IVB) algorithm). Consider the q = 2 case for conve-nience, i.e. θ = [θ′1, θ

′2]

′. Then cyclic iteration of the following steps, n = 2, 3, . . .,monotonically decreases KLDVB (3.6) in (3.11):

1. Compute the current update of the VB-marginal of θ2 at iteration n, via (3.12):

f [n] (θ2|D) ∝ exp∫

Θ∗1

f [n−1] (θ1|D) ln f (θ1, θ2, D) dθ1. (3.19)

2. Use the result of the previous step to compute the current update of the VB-marginal of θ1 at iteration n, via (3.12):

f [n] (θ1|D) ∝ exp∫

Θ∗2

f [n] (θ2|D) ln f (θ1, θ2, D) dθ2. (3.20)

Here, the initializer, f [1] (θ1|D), may be chosen freely. Convergence of the algorithmto fixed VB-marginals, f [∞] (θi|D), ∀i, was proven in [26].

This Bayesian alternating algorithm is clearly reminiscent of the EM algorithm fromclassical inference [21], which we will review in Section 3.4.4. In the EM algorithm,maximization is used in place of one of the expectation steps in Algorithm 1. For thisreason, Algorithm 1 is also known as (i) an ‘EM-like algorithm’ [19], (ii) the ‘VB al-gorithm with E-step and M-step’ [26], which is misleading, and (iii) the ‘VariationalEM (VEM)’ algorithm [23]. We will favour the nomenclature ‘IVB algorithm’.

Algorithm 1 is an example of a gradient descent algorithm, using the naturalgradient technique [65]. In general, the algorithm requires q steps—one for each θi,i = 1, . . . , q—in each iteration.

3.3.2 The VB Method of Approximation as an Operator

The VB method of approximation is a special case of the distributional approxima-tion expressed in operator form in Fig. 3.1. Therefore, in Fig. 3.3, we represent the


V

f(θ1|D)

f(θ2|D)

× f(θ|D)f(θ|D)

f(θ|D) A f(θ|D)

Fig. 3.3. The VB method of distributional approximation, represented as an operator, V, forq = 2. The operator ‘×’ denotes multiplication of distributions.

VB method of approximation via the operator V. It is the principal purpose of thisbook to examine the consequences of replacing A by V.

The conditional independence enforced by the VB-approximation (3.3) has theconvenient property that marginalization of the VB-approximation is achieved sim-ply via selection of the relevant VB-marginal(s). The remaining VB-marginals arethen ignored, corresponding to the fact that they integrate to unity in (3.16). Graph-ically, we simply ‘drop off’ the VB-marginals of the marginalized parameters fromthe inferential schematic, as illustrated in Fig. 3.4. We call this VB-marginalization.Throughout this book, we will approximate the task of (intractable) marginalizationin this way.

Af(θ1, θ2|D) f(θ1|D)

f(θ1, θ2|D)

f(θ2|D)

Vf(θ1, θ2|D)f(θ1|D)

∫dθ2

Fig. 3.4. VB-marginalization via the VB-operator, V, for q = 2.

3.3.3 The VB Method

In this Section, we present a systematic procedure for applying Theorem 3.1 in prob-lems of distributional approximation for Bayesian parametric inference. It is specifi-cally this 8-step procedure—represented by the flowchart in Fig. 3.5—that we will beinvoking from now on, when we refer to the Variational Bayes (VB) method. Our aimis to formulate the VB method with the culture and the needs of signal processing inmind.

Step 1: Choose a Bayesian (probability) model: Construct the joint distribution ofmodel parameters and observed data, f (θ,D) (2.2). This step embraces the choice


of an observation model, f(D|θ) (2.3), and a prior on its parameters, f(θ). Recall,from Section 3.3.1, that the VB method is applicable to improper joint distributions.We assume that analytical marginalization of the posterior distribution (2.6) is in-tractable, as is evaluation of posterior moments (2.7). This creates a ‘market’ for theVB-approximation which follows.

Step 2: Partition the parameters: Partition θ into q sub-vectors (3.9). For conve-nience, we will assume that q = 2 in this Section. Check if

ln f (θ1, θ2, D) = g (θ1, D)′ h (θ2, D) , (3.21)

where g (θ1, D), and h (θ2, D) are p-dimensional vectors (p < ∞) of compatibledimension. If (3.21) holds, the joint distribution, f(θ,D), is said to be a member ofthe separable-in-parameters family.This step is, typically, the crucial test which must be satisfied if a Bayesian model isto be amenable to VB approximation. If the logarithm of the joint distribution cannotbe written in the form (3.21)—i.e. as a scalar product of g (θ1, D) and h (θ2, D)—theVB method will not be tractable. This will become evident in the following steps. Insuch a case, the model must be reformulated or approximated by other means.

Step 3: Write down the VB-marginals: Application of Theorem 3.1 to (3.21) isstraightforward. The VB-marginals are:

f (θ1|D) ∝ exp(Ef(θ2|D) [ln f (θ1, θ2, D)]

)∝ exp

(g (θ1, D) h (θ2, D)

), (3.22)

f (θ2|D) ∝ exp(Ef(θ1|D) [ln f (θ1, θ2, D)]

)∝ exp

(g (θ1, D)h (θ2, D)

). (3.23)

The induced expectations are (2.7)

g (θ1, D) ≡ Ef(θ1|D) [g (θ1, D)] , (3.24)

h (θ2, D) ≡ Ef(θ2|D) [h (θ2, D)] . (3.25)

Step 4: Identify standard distributional forms: Identify the functional forms of(3.22) and (3.23) as those of standard parametric distributions. For this purpose, we

take the expectations, g (θ1, D) and h (θ2, D), as constants in (3.22) and (3.23) re-spectively. These standard distributions are denoted by

f (θ1|D) ≡ f(θ1| ar1

), (3.26)

f (θ2|D) ≡ f(θ2| br2

), (3.27)

where ar1and br2

will be called the shaping parameters of the respective VB-marginals. They depend on the arguments of those VB-marginals as follows:


a(j) = a(j)(h,D

), j = 1, . . . , r1, (3.28)

b(j) = b(j) (g, D) , j = 1, . . . , r2, (3.29)

where g and h are shorthand notation for (3.24) and (3.25) respectively. This stepcan be difficult in some situations, since it requires familiarity with the catalogueof standard distributions. Note that the form of the VB-marginals yielded by Step3. may be heavily disguised versions of the standard form. In the sequel, we willassume that standard parametric distributions can been identified. If this is not thecase, we can still proceed using symbolic or numerical integration, or, alternatively,via further approximation of the VB-marginals generated in Step 3.

Step 5: Formulate necessary VB-moments: Typically, shaping parameters (3.28)and (3.29) are functions of only a subset of the expectations (3.24), (3.25), being

g (θ1), where g = [gi, i ∈ Ig ⊆ 1, . . . p] , and h (θ2), h = [hj , j ∈ Ih ⊆ 1, . . . p].These necessary moments are, themselves, functions of the shaping parameters,(3.28) and (3.29), and are typically listed in tables of standard parametric distrib-utions:

g (θ1) = g(ar1

), (3.30)

h (θ2) = h(br2

). (3.31)

These necessary moments, (3.30) and (3.31), will be what we refer to as the VB-moments.

The set of VB-equations (3.28)–(3.31) fully determines the VB-approximation (3.12).Hence, any solution of this set—achieved by any technique—yields the VB-approx-imation (3.12).

Step 6: Reduce the VB-equations: Reduce the equations (3.28)–(3.31) to a set pro-viding an implicit solution for a reduced number of unknowns. The remaining un-knowns therefore have an explicit solution. Since the shaping parameters, (3.28) and(3.29), are explicit functions of the VB moments, (3.30) and (3.31), and vice versa,we can always reformulate the equations in terms of either shaping parameters alone,or VB-moments alone, simply by substitution. Usually, there will be a choice inhow far we wish to go in reducing the VB-equations. The choice will be influencedby numerical considerations, namely, an assessment of the computational load andconvenience associated with solving a reduced set of VB-equations. The reductiontechniques will be problem-specific, and will typically require some knowledge ofsolution methods for sets of non-linear equations. If no reduction is possible, we canproceed to the next step.

In rare cases, a full analytical solution of the VB-equations (3.28)–(3.31) can befound. This closed-form solution must be tested to see if it is, indeed, the globalminimizer of (3.11). This is because KLDVB (3.6) can exhibit local minima or max-ima, as well as saddle points. In all cases, (3.12) is satisfied [64].


Step 7: Run the IVB Algorithm (Algorithm 1): The reduced set VB-equationsmust be solved iteratively. Here, we can exploit Algorithm 1, which guides us inthe order for evaluating the reduced equations. Assuming iteration on the unreducedVB-equations (3.28)–(3.31) for convenience, then, at the nth iteration, we evaluatethe following:

Shaping parameters of f [n](θ2| b[n]

r2

):

g (θ1)[n−1]

= g(a[n−1]

r1

), (3.32)(

b(j))[n]

= b(j)(g (θ1)

[n−1], D

), j = 1, . . . , r2. (3.33)

Shaping parameters of f [n](θ1| a[n]

r1

):

h (θ2)[n]

= h(b[n]

r2

), (3.34)(

a(j))[n]

= a(j)

(h (θ2)

[n]

, D

), j = 1, . . . , r1. (3.35)

Special care is required when choosing initial shaping parameters a[1]r1

(3.28) (theremaining unknowns do not have to be initialized). In most of the literature in thisarea, e.g. [20, 24], these are chosen randomly. However, we will demonstrate thata carefully designed choice of initial values may lead to significant computationalsavings in the associated IVB algorithm (Chapter 4). In general, the IVB algorithm isinitialized by setting all of the shaping parameters of just one of the q VB-marginals.

Step 8: Report the VB-marginals: Report the VB-approximation (3.16) in the formof shaping parameters, (3.28) and (3.29), and/or the VB-moments, (3.30) and (3.31).Note, therefore, an intrinsic convenience of the VB method: its output is in the formof the approximate marginal distributions and approximate posterior moments forwhich we have been searching.

Remarks on the VB method:

(i) The separable parameter requirement of (3.21) forms a family of distributionsthat is closely related to the exponential family with hidden variables (alsoknown as ‘hidden data’) [26,65], which we will encounter in Section 6.3.3. Thislatter family of distributions is revealed via the assignment:

g (θ1, D) = g (θ1) .

In this case, θ2 constitutes the hidden variable in (3.21). In the VB method,no formal distinction is imposed between the parameters and the hidden data.Family (3.21) extends the exponential family with hidden variables by allowingdependence of g (θ1) on D.


(ii) Requirement (3.21) may not be the most general case for which the VB theoremcan be applied. However, if a distribution cannot be expressed in this form, allsubsequent operations are far less systematic: i.e. the VB method as definedabove cannot be used.

(iii) Step 3 of the VB-method makes clear how easy it is to write down the functionalform of the VB-marginals. All that is required is to expand the joint distributioninto separable form (3.21) and then to use a ‘cover-up rule’ to hide all termsassociated with θ2, immediately revealing f (θ1|D) (3.22). When q > 2, wecover up all terms associated with θ2, . . . , θq. The same is true when writingdown f (θ2|D) , etc. This ‘cover-up’ corresponds to the substitution of thoseterms by their VB-moments (Step 5).

(iv) The IVB algorithm works by propagating VB-statistics from each currently-updated VB-marginal into all the others. The larger the amount of partition-ing, q (3.9), the less correlation is available to ‘steer’ the update towards a localminimum of KLDVB (3.11). Added to this is the fact that the number of IVBequations—of the type (3.19) and (3.20)—increases as q. For these reasons, con-vergence of the IVB algorithm will be negatively affected by a large choice ofq. Once again, the principle should be to choose q no larger than is required toensure tractable VB-marginals (i.e. standard forms with available VB-moments).

(v) In the remainder of this book, we will follow the steps of the VB method closely,but we will not always list all the implied mathematical objects. Notably, we willoften not define the auxiliary functions g (θ1) and h (θ2) in (3.21).

(vi) The outcome of the VB method is a set of evaluated shaping parameters, (3.28)and (3.29). It should be remembered, however, that these parameters fully de-termine the VB-marginals, and so the VB method reports distributions on allconsidered unknowns. If we are interested in moments of these distributions,some of them may have been provided by the set of necessary VB-moments,(3.30) and (3.31), while others can be evaluated via the shaping parameters us-ing standard results. Note that VB-moments alone will usually not be sufficientfor full description of the posteriors.

(vii) If the solution was obtained using the IVB algorithm (Algorithm 1), then wedo not have to be concerned with local maxima or saddle points, as is the casefor an analytical solution (see remark in Step 6. above). This is because theIVB algorithm is a gradient descent method. Of course, the algorithm may stillconverge to a local minimum of KLDVB (3.6).

(viii) The flow of control for the VB method is displayed in Fig. 3.5.

3.3.4 The VB Method for Scalar Additive Decomposition

Consider the model for scalar decomposition which we introduced in Section 1.3.2.We now verify the 8 steps of the VB method in this case.

Step 1: From (1.17), θ = [m,ω]′, and the joint distribution is

f (m,ω, d|ε) ∝ ωα exp(−1

2(d−m)2 ω − 1

2m2γω − βω

),


Find Global Minimum Run the IVB Algorithm

Solution?

Choose a Bayesian model

Collect data

NoYes Analytical

Reduce the VB-equations

Formulate necessary VB-moments

Identify standard forms

Write down the VB-marginals

Partition the parameters

Report the VB-marginals

Fig. 3.5. Flowchart of the VB method.

where ε = (α, β, γ)′, which, for conciseness, we will not show in the followingequations.

Step 2: The form (3.21) is revealed for assignments θ1 = m, θ2 = ω, and

g (m, d) =[α,−1

2(d2 + 2β

),md,−1

2(1 + γ)m2

]′,

h (ω, d) = [ln(ω), ω, ω, ω]′ .

Step 3: (3.22)–(3.23) immediately have the form:

3.4 VB-related Distributional Approximations 39

f (m|d) ∝ exp(−1

2(−2md + m2 (1 + γ)

)ω

), (3.36)

f (ω|d) ∝ exp(α lnω − 1

2ω(d2 + 2β − 2dmd + m2 (1 + γ)

)). (3.37)

Step 4: (3.36) can be easily recognized to be Normal: f (m|d) = N (a1, a2). Here,the shaping parameters are a1 (the mean) and a2 (the variance). Similarly, (3.37) is inthe form of the Gamma distribution: f (ω|D) = G (b1, b2), with shaping parametersb1 and b2. The shaping parameters are assigned from (3.36) and (3.37), as follows:

a1 = (1 + γ)−1d,

a2 = ((1 + γ) ω)−1,

b1 = α + 1,

b2 = β +12

[(1 + γ) m2 − 2md + d2

].

Step 5: The required VB-moments are summarized in (1.21). Note that only mo-ments h1, g3 and g4 are required; i.e. Ig = 1 and Ih = 3, 4 .Step 6: Equations (1.21) can be analytically reduced to a single linear equation whichyields the solution in the form of (1.22).

Step 7: In the case of the first variant of the scalar decomposition model (Sec-tion 1.3), the set (1.14) can be reduced to a cubic equation. The solution is far morecomplicated, and, furthermore, we have to test which of the roots minimizes (3.11).

If an iterative solution is required–as was the case in the first variant of the model(Section 1.3)—the VB equations are evaluated in the order given by (3.32)–(3.35);i.e. m and m2 are evaluated in the first step, and ω in the second step. In this case,we can initialize a

[1]1 = d, and a

[1]2 = (2φ)−1.

Step 8: Report shaping parameters a1, a2, b1, b2 and/or the VB-moments (1.21) ifthey are of interest.

3.4 VB-related Distributional Approximations

3.4.1 Optimization with Minimum-Risk KL Divergence

In Section 3.2.2, we explained that minimization of KLDMR (3.5) provides the min-imum Bayes risk distributional approximation. Optimization of KLDMR under theassumption of conditional independence (3.10) has the following solution.

Remark 3.2 (Conditional independence approximation under KLDMR). Considerminimization of the following KL divergence for any i ∈ 1, . . . , q:


f (θi|D) = arg minf(θi|D)

KL(f (θ|D) ||f (θi|D) f

(θ/i|D

))(3.38)

= arg minf(θi|D)

∫Θ∗

i

∫Θ∗

/i

f(θi, θ/i|D

)ln

1

f (θi|D)dθ/idθi

= arg minf(θi|D)

KL(f (θi|D) ||f (θi|D)

)= f (θi|D) .

Hence, the best conditionally independent approximation of the posterior underKLDMR is the product of the analytical marginals. This, in general, will differ fromthe VB-approximation (3.12), as suggested in Fig. 3.2.

While this result is intuitively appealing, it may nevertheless be computation-ally intractable. Recall that the need for an approximation arose when operationssuch as normalization and marginalization on the true posterior distribution provedintractable. This situation was already demonstrated in the first model for scalar de-composition (Section 1.7), and will be encountered again.

3.4.2 Fixed-form (FF) Approximation

Another approach to KLDMR minimization is to choose Fc = Fβ in (3.7), being afamily of tractable parametric distributions, with members f (θ|D) ≡ f0 (θ|β). Here,the approximating family members are indexed by an unknown (shaping) parameter,β, but their distributional form, f0(·), is set a priori. The optimal approximation

f (θ|D) = f0

(θ|β

), is then determined via

β = arg minβ

KL (f (θ|D) ||f0 (θ|β)) . (3.39)

In many applications, measures other than KLDMR (3.39) are used for specific prob-lems. Examples include the Levy, chi-squared and L2 norms. These are reviewedin [32].

3.4.3 Restricted VB (RVB) Approximation

The iterative evaluation of the VB-approximation via the IVB algorithm (Algorithm1) may be prohibitive, e.g. in the on-line scenarios which will follow in Chapter 6and 7. Therefore, we now seek a modification of the original VB-approximation thatyields a closed-form solution.

Corollary 3.1 (of Theorem 3.1: Restricted Variational Bayes (RVB)). Let f (θ|D)be the posterior distribution of multivariate parameter θ = [θ′1, θ

′2]

′, i.e. we considera binary partitioning of the global parameter set, θ. Let f (θ2|D) be a posterior dis-tribution of θ2 of fixed functional form. Let f (θ|D) be a conditionally-independentapproximation of f (θ|D) of the kind


f (θ|D) = f (θ1, θ2|D) = f (θ1|D) f (θ2|D) . (3.40)

Then, the minimum of KLDVB (3.6)—i.e. KL(f (θ|D) ||f (θ|D)

)—is reached for

f (θ1|D) ∝ exp(Ef(θ2|D) [ln (f (θ,D))]

). (3.41)

Proof: Follows from the proof of Theorem 3.1.Note that Corollary 3.1 is equivalent to the first step of the IVB algorithm (Al-gorithm 1). However, with distribution f (θ2|D) being known, the equation (3.41)now constitutes a closed-form solution. Furthermore, since it is chosen by the de-signer, its moments, required for (3.41), will be available. The RVB approxima-tion can greatly reduce the computational load needed for distributional approxi-mation, since there is now no IVB algorithm, i.e. no iterations are required. Notehowever that, since f (θ2|D) is fixed, the minimum value of KLDVB achieved bythe RVB-approximation (3.41) will be greater than or equal to that achieved by theVB-approximation (3.12). In this sense, RVB can be seen as a sub-optimal approxi-mation.

The quality of the RVB approximation (3.41) strongly depends on the choice ofthe fixed approximating distribution f (θ2|D) in (3.40). If f (θ2|D) is chosen close tothe VB-optimal posterior (3.12), i.e. f (θ2|D) ≈ f (θ2|D) (3.16), then just one stepof the RVB algorithm can replace many iterations of the original IVB algorithm.In Section 3.4.3.2, will outline one important strategy for choice of f (θ2|D). First,however, we review the steps of the VB method (Section 3.3.3), specialized to theRVB approximation.

3.4.3.1 Adaptation of the VB method for the RVB Approximation

The choice of the restricted distribution f(θ\1|D

)can be a powerful tool, not only

in reducing the number of IVB iterations, but any time the VB method yields exces-sively complicated results (e.g. difficult evaluation of VB-moments). Various restric-tions can be imposed, such as neglecting some of the terms in (3.22) and (3.23)). Onesuch scenario for RVB approximation is described next, using the steps designed forthe VB method (Section 3.3.3).

Step 1: We have assumed that the chosen Bayesian model is intractable. If the analyt-ical form of one of the marginals is available, we can use it as the restricted marginalf(θ\1|D

).

Step 2: We assume the Bayesian model is separable in parameters (3.21).

Step 3: Write down the VB marginals. In this step, we replace some of the VB-marginals by known standard form. For simplicity, we take q = 2, and

f (θ2|D) = f (θ2|D) = f(θ2| br2

).

Since the distribution is known, we can immediately evaluate its moments,


h (θ2, D) = h(br2

).

Step 4: Identify standard forms. Only the form of f (θ1|D) = f(θ1| ar1

)needs

to be identified. Its shaping parameters can be evaluated in closed form as follows:

a(j) = a(j)(h,D

), j = 1, . . . , r1.

Note that the form of shaping parameters is identical to the VB-solution (3.28). Whathave changed are the values of h only.

Step 5: Note that no moments of θ1 are required, i.e. Ig = .

Steps 6–7: Do not arise.

Step 8: Report shaping parameters, ar1,br2

.

Remark 3.3 (Partial Restriction of the VB Method). Consider the multiple partition-ing, θ =

[θ′1, θ

′2, . . . , θ

′q

]′, q > 2, of the global parameter set, θ (3.9). There exists

a range of choices for partial free-form approximation of f(θ|D), between the ex-treme cases of (full) VB-approximation (3.16) and the RVB-approximation (3.41).Specifically, we can search for an optimal approximation, minimizing KLDVB (3.8),within the class

f (θ|D) = Πi∈I f (θi|D) f (∪i/∈Iθi|D) , (3.42)

where I ⊆ 1, . . . q . When I = 1, . . . q , the full VB approximation is gen-erated via the VB method (3.16). When I is a singleton, the RVB approximation(3.41) is produced in closed form, i.e. without IVB cycles. In intermediate cases, theoptimized distributional approximation is

f (θ|D) =∏i∈I

f (θi|D) f (∪i/∈Iθi|D) , (3.43)

generated via a reduced set of IVB equations, where unknown moments of ∪i/∈Iθi inthe (full) VB method have been replaced by the known moments of the fixed distri-bution, f (∪i/∈Iθi|D) . In this manner, the designer has great flexibility in trading anoptimal approximation of parameter subsets for a reduced number of IVB equations.This flexibility can be important in on-line scenarios where the computational load ofthe IVB algorithm must be strictly controlled. While all these choices (3.42) can beseen as restricted VB-approximations, we will reserve the term RVB-approximationfor the closed-form choice (3.41).

3.4.3.2 The Quasi-Bayes (QB) Approximation

The RVB solution (3.41) holds for any tractable choice of distribution, f (θ2|D).We seek a reasonable choice for this function, such that the minimum of KLDVB


(3.6) achieved by the the RVB-approximation approaches that achieved by the VB-approximation (3.11). Hence, we rewrite the KL divergence in (3.11) as follows,using (3.10):


)=

∫Θ∗

f (θ1|D) f (θ2|D) lnf (θ1|D) f (θ2|D)

f (θ1|θ2, D) f (θ2|D)dθ

=∫

Θ∗f (θ1|D) f (θ2|D) ln

f (θ1|D)f (θ1|θ2, D)

dθ +

+∫

Θ∗2

f (θ2|D) lnf (θ2|D)f (θ2|D)

dθ2. (3.44)

We note that the second term in (3.44) is KL(f (θ2|D) ||f (θ2|D)

), which is min-

imized for the restricted assignment (see (3.40))

f (θ2|D) ≡ f (θ2|D) =∫

Θ∗1

f (θ|D) dθ1, (3.45)

i.e. the exact marginal distribution of the joint posterior f (θ|D) (2.6). The globalminimum of (3.44) with respect to f (θ2|D) is not reached for this choice, since thefirst term in (3.44) is also dependent on f (θ2|D). Therefore we consider (3.45) to bethe best analytical choice for the restricted assignment, f (θ2|D) , that we can make.It is also consistent with the minimizer of KLDMR (Remark 3.2). From (3.41) and(3.45), the Quasi-Bayes (QB) approximation is therefore

f (θ1|D) ∝ exp(Ef(θ2|D) [ln (f (θ,D))]

). (3.46)

The name Quasi-Bayes (QB) was first used in the context of finite mixture mod-els [32], to refer to this type of approximation. In [32], the marginal for θ1 wasapproximated by conditioning the joint posterior distribution on θ2, which was as-signed as the true posterior mean of θ2:

θ2 = Ef(θ2|D) [θ2] . (3.47)

Returning, for a moment, to the RVB approximation of Corollary 3.1, we note that iffln f (θ1, θ2, D) is linear in θ2, then, using (3.46), the RVB approximation, f (θ1|D) ,is obtained by replacing all occurrences of θ2 by its expectation, Ef(θ2|D) [θ2] . In thiscase, therefore, the RVB approximation is the following conditional distribution:

f (θ1|D) ≡ f(θ1|D, θ2

).

This corresponds to a certainty equivalence approximation for inference of θ1 (Sec-tion 3.5.1). The choice (3.45) yields (3.47) as the certainty equivalent for θ2, thisbeing the original definition of the QB approximation in [32]. In this sense, the RVBsetting for QB (3.46) is the generalization of the QB idea expressed in [32].


3.4.4 The Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is a well known algorithm for Max-imum Likelihood (ML) estimation (2.11)—and, by extension, for MAP estimation(2.10)—of subset θ2 of the model parameters θ = [θ1, θ2] [21]. Here, we follow analternative derivation of EM via distributional approximations [22]. The task is toestimate parameter θ2 by maximization of the (intractable) marginal posterior distri-bution:

θ2 = arg maxθ2

f (θ2|D) .

This task can be reformulated as an optimization problem (Section 3.2.1) for theconstrained distributional family:

Fc =f (θ1, θ2|D) : f (θ1, θ2|D) = f (θ1|D, θ2) δ

(θ2 − θ2

).

Here, δ(·) denotes the Dirac δ-function,∫X

δ (x− x0) g (x) dx = g(x0), (3.48)

if x ∈ X is a continuous variable, and the Kronecker function,

δ (x) =

1,0,

if x = 0otherwise

,

if x is integer. We optimize over the family Fc with respect to KLDVB (3.6) [22].Hence, we recognize that this method of distributional approximation is a spe-cial case of the VB approximation (Theorem 3.1), with the functional restrictions

f (θ1|D) = f (θ1|D, θ2) and f (θ2|D) = δ(θ2 − θ2

). The functional optimization

for f (θ1|D) is trivial since, in (3.12), all moments and expectations with respect to

δ(θ2 − θ2

)simply result in replacement of θ2 by θ2 in the joint distribution. The

resulting distributional algorithm is then a cyclic iteration (alternating algorithm) oftwo steps:

Algorithm 2 (The Expectation-Maximization (EM) Algorithm).

E-step: Compute the approximate marginal distribution of θ1, at iteration i:

f [i] (θ1|D) = f(θ1|D, θ

[i−1]2

). (3.49)

M-step: Use the approximate marginal distribution of θ1 from the E-step to updatethe certainty equivalent for θ2 :

θ[i]2 = arg max

θ2

∫Θ∗

1

f [i] (θ1|D) ln f (θ1, θ2, D) dθ1. (3.50)

In the context of uniform priors (i.e. ML estimation (Section 2.11)), it was provedin [66] that this algorithm monotonically increases the marginal likelihood, f (D|θ2),of θ2, and therefore converges to a local maximum [66].

3.5 Other Deterministic Distributional Approximations 45

3.5 Other Deterministic Distributional Approximations

3.5.1 The Certainty Equivalence Approximation

In many engineering problems, full distributions (2.2) are avoided. Instead, a pointestimate, θ, is used to summarize the full state of knowledge expressed by the poste-rior distribution (Section 2.2.2).

The point estimate, θ = θ (D), can be interpreted as an extreme approximationof the posterior distribution, replacing f(θ|D) by a suitably located Dirac δ-function:

f (θ|D) ≈ f (θ|D) = δ(θ − θ(D)

), (3.51)

where θ is the chosen point estimate of parameter θ.The approximation (3.51) is known as the certainty equivalence principle [30],

and we have already encountered it in the QB (Section 3.4.3.2) and EM (Sec-tion 3.4.4) approximations. It remains to determine an optimal assignment for thepoint estimate. The Bayesian decision-theoretic framework for design of point esti-mates was mentioned in Section 2.2.2, where, also, popular choices such as the MAP,ML and mean a posteriori estimates were reviewed.

3.5.2 The Laplace Approximation

This method is based on local approximation of the posterior distribution, f (θ|D) ,around its MAP estimate, θ, using a Gaussian distribution. Formally, the posteriordistribution (2.2) is approximated as follows:

f (θ|D) ≈ N(θ, H−1

). (3.52)

θ is the MAP estimate (2.10) of θ ∈ Rp, and H ∈ Rp×p is the (negative) Hessianmatrix of the logarithm of the joint distribution, f (θ,D) , with respect to θ, evaluatedat θ = θ:

H = −[∂2 ln f (θ,D)

∂θi∂θj

]θ=θ

, i, j = 1, . . . , p. (3.53)

The asymptotic error of approximation was studied in [31].

3.5.3 The Maximum Entropy (MaxEnt) Approximation

The Maximum Entropy Method of distributional approximation [60,67–70] is a free-form method (Section 3.3.1), in common with the VB method, since a known dis-tributional form is not stipulated a priori. Instead, the approximation f (θ|D) ∈ Fc

(Section 3.2.1) is chosen which maximizes the entropy,

Hf = −∫

Θ∗ln f(θ|D)dF (θ|D), (3.54)


constrained by any known moments (2.8) of f :

mi = gi (θ) = Ef(θ|D) [gi (θ)] =∫

Θ∗gi (θ) f (θ|D) dθ. (3.55)

In the context of MaxEnt, (3.55) are known as the mean constraints. The MaxEntdistributional approximation is of the form

f (θ|D) ∝ exp

[−∑

i

αi(D) gi (θ)

], (3.56)

where the αi are chosen—using, for example, the method of Lagrange multipliersfor constrained optimization—to satisfy the mean constraints (3.55) and the normal-ization requirement for f (θ|D) . Since its entropy (3.54) has been maximized, (3.56)may be interpreted as the smoothest (minimally informative) distribution matchingthe known moments of f (3.55). The MaxEnt approximation has been widely usedin solving inverse problems [71, 72], notably in reconstruction of non-negative datasets, such as in Burg’s method for power spectrum estimation [5, 6] and in imagereconstruction [67].

3.6 Stochastic Distributional Approximations

A stochastic distributional approximation maps f to a randomly-generated approx-imation, f (Fig. 3.1), in contrast to all the methods we have reviewed so far, wheref is uniquely determined by f and the rules of the approximation procedure. Thecomputational engine for stochastic methods is therefore the generation of an inde-pendent, identically-distributed (i.i.d.) sample set (i.e. a random sample),

θ(i) ∼ f(θ|D), (3.57)

θn =θ(1), . . . , θ(n)

. (3.58)

The classical stochastic distributional approximation is the empirical distribution[61],

f(θ|D) =1n

n∑i=1

δ(θ − θ(i)), (3.59)

where δ(·) is the Dirac δ-function (3.48) located at θ(i). The posterior moments (2.8)of f(θ|D) under the empirical approximation (3.59) are therefore

Ef(θ|D) [gj (θ)] =1n

n∑i=1

gj

(θ(i)

). (3.60)

Note that marginal distributions and measures are also generated with ease underapproximation (3.59) via appropriate summations.

3.6 Stochastic Distributional Approximations 47

For low-dimensional θ, it may be possible to generate the i.i.d. setθn usingone of a vast range of standard stochastic sampling methods [34, 35]. The realchallenge being addressed by modern stochastic sampling techniques is to gener-ate a representative random sample (3.58) for difficult—notably high-dimensional—distributions. Markov-Chain Monte Carlo (MCMC) methods [73,74] refer to a classof stochastic sampling algorithms that generate a correlated sequence of samples,θ(0), θ(1), θ(2), . . . , θ(k), . . .

, from a first-order (Markov) kernel, f

(θ(k)|θ(k−1), D

).

For mild regularity conditions on f (·|·), then θ(k) ∼ fs (θ|D) as k → ∞, wherefs (θ|D) is the stationary distribution of the Markov process with this kernel.. Thisconvergence in distribution is independent of the initialization, θ(0), of the Markovchain [75]. Careful choice of the kernel can ensure that fs (θ|D) = f(θ|D). Hence,repeated simulation from the Markov process, i = 1, 2, . . . , with n sufficiently large,generates the required random sample (3.58) for construction of the empirical ap-proximation (3.59). Typically, the associated computational burden is large, and canbe prohibitive in cases of high-dimensional θ. Nevertheless, the very general andflexible way in which MCMC methods have been defined has helped to establishthem as the golden standard for (Bayesian) distributional approximation. In the on-line scenario (i.e. Bayesian filtering, see Chapter 6), sequential Monte Carlo tech-niques, such as particle filtering [74], have been developed for recursive updating ofthe empirical distribution (3.59) via MCMC-based sampling.

3.6.1 Distributional Estimation

The problem of distributional approximation is closely related to that of distrib-utional (e.g. density) estimation. In approximation—which is our concern in thisbook—the parametric distribution, f(θ|D), is known a priori, as emphasized inFig. 3.1. This means that a known (i.e. deterministic) observation model, f(D|θ)(2.2), parametrized by a finite set of parameters, θ, forms part of our prior knowl-edge base, I (Section 2.2.1). In contrast, nonparametric inference [61] addresses themore general problem of an unknown distribution, f(D), on the space, D, of ob-servations, D. Bayesian nonparametrics proceeds by distributing unknown f via anonparametric prior: f ∼ F0 [76, 77]. The distribution is learned via i.i.d. sampling(3.58) from D, which in the Bayesian context yields a nonparametric posterior dis-tribution, f | Dn ∼ Fn. An appropriate distributional estimate, f(D), can then begenerated. Once again, it is the empirical distribution, f = f (3.59), which is thebasic nonparametric density estimator [61]. In fact, this estimate is formally justi-fied as the posterior expected distribution under i.i.d. sampling, if F0 = D0, whereD0 is the nonparametric Dirichlet process prior [76]. Hence, the stochastic samplingtechniques reviewed above may validly be interpreted as distributional estimationtechniques.

The MaxEnt method (Section 3.5.3) also has a rôle to play in distributional esti-mation. Given a set of sample moments, Ef(D) [gj (D)] , where f(D) is the empiricaldistribution (3.59) built from i.i.d. samples, Di ∈ D, then the MaxEnt distributional

estimate is f(D) ∝ exp[−∑

j αj (Dn) gj (D)]

(3.56).


For completeness, we note that the VB method (Section 3.3) is a strictly paramet-ric distributional approximation technique, and has no rôle to play in distributionalestimation.

3.7 Example: Scalar Multiplicative Decomposition

In previous Sections, we reviewed several distributional approximations and formu-lated the VB method. In this Section, we study these approximations for a simplemodel. The main emphasis is, naturally, on the VB method. The properties of theVB approximation will be compared to those of competing techniques.

3.7.1 Classical Modelling

We consider the following scalar model:

d = ax + e. (3.61)

Model (3.61) is over-parameterized, with three unknown parameters (a, x, e) ex-plaining just one measurement, d. (3.61) expresses any additive-multiplicative de-composition of a real number. Separation of the ‘signal’, ax, from the ‘noise’, e, isnot possible without further information. In other words, the model must be regular-ized. Towards this end, let us assume that e is distributed as N (0, re). Then,

f (d|a, x, re) = N (ax, re) , (3.62)

where variance re is assumed to be known.The likelihood function for this model, for d = 1 and re = 1, is displayed in

the upper row of Fig. 3.6, in both surface plot (left) and contour plot (right) forms.The ML solution (2.11) is located anywhere in the manifold defined by the signalestimate:

ax = d. (3.63)

This indeterminacy with respect to a and x will be known as scaling ambiguity, andwill be encountered again in matrix decompositions in Chapters 4 and 5. Furtherregularization is clearly required.

3.7.2 The Bayesian Formulation

It can also be appreciated from Fig. 3.6 (upper-left), that the volume (i.e. inte-gral) under the likelihood function is infinite. This means that f (a, x|d, re) ∝f (d|a, x, re) f (a, x) is improper (unnormalizable) when the parameter prior f (a, x)is itself improper (e.g. uniform in R2). Prior-based regularization is clearly requiredin order to achieve a proper posterior distribution via Bayes’ rule. Under the assign-ment,

3.7 Example: Scalar Multiplicative Decomposition 49

f (a|ra) = N (0, ra) , (3.64)

f (x|rx) = N (0, rx) , (3.65)

the posterior distribution is

f (a, x|d, re, ra, rx) ∝ exp

(−1

2(ax− d)2

re− 1

2a2

ra− 1

2x2

rx

). (3.66)

In what follows, we will generally suppress the notational dependence on the knownprior parameters. (3.66) is displayed in the lower row of Fig. 3.6, for d = 1, re = 1,ra = 10, rx = 20. The posterior distribution (3.66) is now normalizable (proper),with point maximizers (MAP estimates) as follows:

1. For d > re√rarx

, then

x = ±(d

√rx

ra− re

ra

) 12

, (3.67)

a = ±(d

√ra

rx− re

rx

) 12

. (3.68)

Note that the product of the maxima is

ax = d− re√rarx

. (3.69)

Comparing (3.69) to (3.63), we see that the signal estimate has been shiftedtowards the coordinate origin. For the choice ra re and rx re, the priorstrongly influences the posterior and is therefore said to be an informative prior(Section2.2.3). For the choice, ra re and rx re, the prior has negligibleinfluence on the posterior and can be considered as non-informative. Scalingambiguity (3.63) has been reduced, now, to a sign ambiguity, characteristic ofmultiplicative decompositions (Chapter 5).

2. For d ≤ re√rarx

, thenx = a = 0.

Clearly, then, the quantity dMAP = re√rarx

constitutes an important inferential break-

point. For d > dMAP, a non-zero signal is inferred, while for d ≤ dMAP, the obser-vation is inferred to be purely noise.

3.7.3 Full Bayesian Solution

The posterior distribution (3.66) is normalizable, but the normalizing constant cannotbe expressed in closed form. Integration of (3.66) over x ∈ R yields the followingmarginal distribution for a:


Probability surface Contour plot

unregularized

5 0 5- -5

0

5

x

a

regularized(via prior)

5 0 5- -5

0

5

x

a

Fig. 3.6. Illustration of scaling ambiguity in the scalar multiplicative decomposition. Upperrow: the likelihood function, f (d|a, x, re), for d = 1, re = 1 (dash-dotted line denotesmanifold of maxima). Lower row: posterior distribution, f (a, x|d, re, ra, rx), for d = 1 andprior parameters re = 1, ra = 10, rx = 20. Cross-marks denote maxima.

f (a|d, re, ra, rx) ∝ 12

exp(−1

2d2ra + a4rx + a2

ra (a2rx + 1)

)[π2ra

(a2rx + 1

)]− 12 .

(3.70)The normalizing constant, ζa, for (3.70) is not available in closed form. Structuralsymmetry with respect to a and x in (3.61) implies that the marginal inference for xhas the same form as (3.70).

The maximum of the posterior marginal (3.70) is reached for

a =

⎧⎪⎨⎪⎩±[−rarx−2re+

√rarx(rarx+4d2)

2rx

] 12

if d >√

r2e

rarx+ re = dm,

0 if d ≤√

r2e

rarx+ re.

(3.71)

The same symbol, a, is used to denote the (distinct) joint (3.68) and marginal (3.71)MAP estimates. No confusion will be encountered. Both cases of (3.71) are illus-trated in Fig. 3.7, for d = 1 (left) and d = 2 (right). The curves were normalized bynumerical integration. Once again, there remains a sign ambiguity in the estimateof a.


x

a

x

a

5 0 55 0 55

0

5

- ---

5

0

5

Fig. 3.7. Analytical marginals (for the distribution in Fig. 3.6). re = 1, ra = 10, rx = 20,for which case the inferential breakpoint is dm = 1.0025 (3.71). Both modes of solution aredisplayed: d = 1 < dm (left), d = 2 > dm (right).

The unavailability of ζa in closed form means that maximization (3.71) is theonly operation that can be carried out analytically on the posterior marginal (3.70).Most importantly, the moments of the posterior must be evaluated using numericalmethods. In this sense, (3.70) is intractable. Hence, we now seek its approximationusing the VB method of Section 3.3.3.

Remark 3.4 (Multivariate extension). Extension of the model (3.61) to the multivari-ate case yields the model known as the factor analysis model (Chapter 5). The fullBayesian solution presented in this Section can, indeed, be extended to this multi-variate case [78]. The multivariate posterior distributions suffer the same difficultiesas those of the scalar decomposition. Specifically, normalizing constants of the mar-ginal posteriors and, consequently, their moments must be evaluated using numericalmethods, such as MCMC approximations (Section 3.6) [79]. We will study the VB-approximation of related matrix decompositions in Chapters 4 and 5.

3.7.4 The Variational Bayes (VB) Approximation

It this Section, we follow the VB method of Section 3.3.3, in order to obtain approx-imate inferences of the parameters a and x.

Step 1: The joint distribution is already available in (3.66).

Step 2: Since there are only two parameters in (3.66), the only available partitioningis θ1 = a, θ2 = x. All terms in the exponent of (3.66) are a linear combinationof a and x. Hence, (3.66) is in the form (3.21), and the VB-approximation will bestraightforward.

Step 3: Application of the VB theorem yields

f (a|d) ∝ exp(− 1

2rera

((rax2 + re

)a2 − 2axdra − d2ra

)),

f (x|d) ∝ exp(− 1

2rerx

((rxa2 + re

)x2 − 2axdrx − d2rx

)).


Step 4: The distributions in step 3 are readily recognized to be Normal:

f (a|d) = N (µa, φa) ,

f (x|d) = N (µx, φx) . (3.72)

Here, the shaping parameters, µa, φa, µx and φx, are assigned as follows:

µa = r−1e d

(r−1e a2 + r−1

a

)−1

x, (3.73)

µx = r−1e d

(r−1e x2 + r−1

x

)−1

a, (3.74)

φa =(r−1e a2 + r−1

a

)−1

, (3.75)

φx =(r−1e x2 + r−1

x

)−1

. (3.76)

Step 5: The necessary moments of the Normal distributions, required in step 4, arereadily available:

a = µa, a2 = φa + µ2a, (3.77)

x = µx, x2 = φx + µ2x. (3.78)

From steps 4 and 5, the VB-approximation is determined by a set of eight equationsin eight unknowns.

Step 6: In this example, the set of equations can be reduced to one cubic equation,whose three roots are expressed in closed form as functions of the shaping parametersfrom Step 4, as follows:

1. zero-signal inference:

µa = 0, (3.79)

µx = 0,

φa =re

2rx

(√1 +

4rarx

re− 1

),

φx =re

2ra

(√1 +

4rarx

re− 1

). (3.80)

2. and 3. non-zero signal inference:


µa = ±[(

d2 − re

)√rarx − dre

drx

] 12

, (3.81)

µx = ±[(

d2 − re

)√rarx − dre

dra

] 12

, (3.82)

φa =re

d

√ra

rxsgn

(d2 − re

). (3.83)

φx =re

d

√rx

rasgn

(d2 − re

). (3.84)

Here, sgn (·) returns the sign of the argument.The remaining task is to determine which of these roots is the true minimizer of theKL divergence (3.11). Roots 2. and 3. will be treated as one option, since the valueof the KL divergence is clearly identical for both of them. From (3.83), we note thatsolutions 2. and 3. are non-complex if d >

√re. However, (3.81) collapses to µx = 0

(i.e. to the zero-signal inference) for

d = dVB =12re +

√re (re + 4rarx)√

rarx≈ √

re +re

2√rarx

>√re, (3.85)

and has complex values for d < dVB. Hence, (3.85) denotes the VB-based break-point. For d > dVB, a non-zero signal is inferred (cases 2. and 3.), while for d ≤ dVB

the observation is considered to be purely noise. For improved insight, Fig. 3.8demonstrates graphically how the modes of the VB-approximation arise.

KLD

VB

f(a, x|d)

KLD

VB

f(a, x|d)

d > dVBd < dVB

0 0f(a, x|d)

Fig. 3.8. The notional shape of the KL divergence (KLDVB) as a function of observed data.The two modes are illustrated.

Step 7: As an alternative to the analytical solution above, the IVB algorithm may beadopted (Algorithm 1). In this case, the equations in Steps 4 and 5 are evaluated inthe order suggested by (3.32)–(3.35). The trajectory of the iterations for the posteriorVB-means, µa (3.73) and µx (3.74), are shown in Fig. 3.9, for d = 2 (left) and


d = 1 (right). For the chosen priors, the inferential breakpoint (3.85) is at dVB =1.0025. Hence, these two cases demonstrate the two distinct modes of solution ofthe equations (3.73)–(3.76). Being a gradient descent algorithm, the IVB algorithmhas no difficulty with multimodality of the analytical solution (Fig. 3.8), since theremaining modes are irrational (for d < dVB) or are local maxima (for d > dVB),as discussed in Step 6. above. For this reason, the IVB algorithm converges to theglobal minimizer of KLDVB (3.6) independently of the initial conditions.

Step 8: Ultimately, the VB-approximation, f(a, x|d), is given by the product of theVB-marginals in (3.72).

x

a

x

a

2 1 0 1 20 1 2

2

1

0

1

2

-

-

- --0 5

0

0.5

1

1.5

2

2.5

Fig. 3.9. VB-approximation of the posterior distribution for the scalar multiplicative decompo-sition, using the IVB algorithm. The dashed line denotes the initial VB-approximation; the fullline denotes the converged VB-approximation; the trajectory of the posterior VB-means, µa

(3.73) and µx (3.74), is also illustrated. The prior parameters are re = 1, ra = 10, rx = 10.Left: (non-zero-signal mode) d = 2. Right: (zero-signal-mode) d = 1.

3.7.5 Comparison with Other Techniques

The VB-approximation can be compared to other deterministic methods consideredin this Chapter:

KLDMR under the assumption of conditional independence (Section 3.4.1) is min-imized for the product of the true marginal posterior distributions (3.70). Sincethese are not tractable, their evaluation must be undertaken numerically, as wasthe case in Fig. 3.7. The results are illustrated in Fig. 3.10 (left).

QB approximation for the model cannot be undertaken because of intractability ofthe true marginal distribution (3.70).

Laplace approximation (Section 3.5.2) is applied at the MAP estimate, (3.67) and(3.68). The result is displayed in Fig. 3.10 (middle). Unlike the VB-approxima-tion and the KLDMR-based method of Section 3.4.1, the Laplace approximation


does model cross-correlation between variables. However, it is dependent on theMAP estimates, and so its inferential break-point, dMAP (3.7.2), is the same asthat for MAP estimation.

VB-approximation is illustrated in Fig. 3.10 (right). This result illustrates a key con-sequence of the VB approximation, namely absence of any cross-correlationbetween variables, owing to the conditional independence assumption whichunderlies the approximation (3.10).

x

a

x

a

xa

5 0 55 0 55 0 55

0

5

5

0

5

-- - -

--5

0

5

Fig. 3.10. Comparison of approximation techniques for the scalar multiplicative decompo-sition (3.66). re = 1, ra = 10, rx = 20, d = 2. Left: KLDMR-based approximation.Centre: the Laplace approximation. Right: the VB-approximation, which infers a non-zerosignal (3.81), since d > dVB. In the last two cases, the ellipse corresponding to the 2-standard-deviation boundary of the approximating Normal distribution is illustrated.

These results suggest the following:

• The prior distribution is indispensable in regularizing the model, and in ensuringthat finite VB-moments are generated. With uniform priors, i.e. ra → ∞ andrx → ∞, none of the derived solutions is valid.

• From (3.81) and (3.82), the ratio of the posterior expected values, a/x, is fixedby the priors at

√ra/rx. This is a direct consequence of the scaling ambiguity

of the model: the observed data do not bring any information about the ratio ofthe mean values of a and x. Hence, the scale of these parameters is fixed by theprior.

The inferential breakpoint, d—i.e. the value of d above which a non-zero signal isinferred—is different for each approximation. For the ML and MAP approaches, theinferred signal is non-zero even for very small data: dML = 0, and dMAP = 0.07for the chosen priors. After exact marginalization, the signal is inferred to be non-zero only if the observed data are greater than dm =

√re = 1 for uniform priors, or

dm = 1.0025 for the priors used in Fig. 3.7. The VB-approximation infers a non-zerosignal only above dVB = 1.036.


3.8 Conclusion

The VB-approximation is a deterministic, free-form distributional approximation. Itreplaces the true distribution with one for which correlation between the partitionedparameters is suppressed. An immediate convenience of the approximation is thatits output is naturally in the form of marginals and key moments, answering manyof the questions which inspire the use of approximations for intractable distributionsin the first place. While an analytical solution may be found in special cases, thegeneral approach to finding the VB-approximation is to iterate the steps of the IVBalgorithm. The IVB algorithm may be seen as a Bayesian counterpart of the clas-sical EM algorithm, generating distributions rather than simply point estimates. Anadvantage of the IVB approach to finding the VB-approximation is that the solutionis guaranteed to be a local minimizer of KLDVB.

In this Chapter, we systematized the procedure for VB-approximation into the 8-step VB method, which will be followed in all the later Chapters. It will prove to bea powerful generic tool for distributional approximation in many signal processingcontexts.

4

Principal Component Analysis and MatrixDecompositions

Principal Component Analysis (PCA) is one of the classical data analysis tools fordimensionality reduction. It is used in many application areas, including data com-pression, denoising, pattern recognition, shape analysis and spectral analysis. Foran overview of its use, see [80]. Typical applications in signal processing includespectral analysis [81] and image compression [82].

PCA was originally developed from a geometric perspective [83]. It can alsobe derived from the additive decomposition of a matrix of data, D, into a low-rankmatrix, M(r), of rank r, representing a ‘signal’ of interest, and noise, E:

D = M(r) + E. (4.1)

If E has a Gaussian distribution, and M(r) is decomposed into a product of twolower-dimensional matrices—i.e. M(r) = AX ′—then Maximum Likelihood (ML)estimation of M(r) gives the same results as PCA [84,85]. This is known as the Prob-abilistic PCA (PPCA) model. In this Chapter, we introduce an alternative model,parameterizing M(r) in terms of the Singular Value Decomposition (SVD): i.e.M(r) = ALX ′. We will call this the Orthogonal PPCA model (see Fig. 4.1). TheML estimation of M(r) again gives the same results as PCA.

We will study the Bayesian inference of the parameters of both of these mod-els, and find that the required integrations are intractable. Hence, we will use theVB-approximation of Section 3.3 to achieve tractable Bayesian inference. Threealgorithms will emerge: (i) Variational PCA (VPCA), (ii) Fast Variational PCA(FVPCA), and (iii) Orthogonal Variational PCA (OVPCA). (i) and (ii) will be usedfor inference of PPCA parameters. (iii) will be used for inference of the OrthogonalPPCA model. The layout of the Chapter is summarized in Fig. 4.1.

The Bayesian methodology allows us to address important tasks that are not suc-cessfully addressed by the ML solution (i.e. PCA). These are:

Uncertainty bounds: PCA provides point estimates of parameters. Since the resultsof Bayesian inference are probability distributions, uncertainty bounds on theinferred parameters can easily be derived.

58 4 Principal Component Analysis and Matrix Decompositions

VPCA

algorithm

FVPCA

algorithm

OVPCA

algorithm

Matrix decomposition

D = M(r) + E

PPCA model Orthogonal PPCA model

M(r) = AX ′ M(r) = ALX ′

Fig. 4.1. Models and algorithms for Principal Component Analysis.

Inference of rank: in PCA, the rank, r, of the matrix M(r) must be known a priori.Only ad hoc and asymptotic results are available for guidance in choosing thisimportant parameter. In the Bayesian paradigm, we treat unknown r as a randomvariable, and we derive its marginal posterior distribution.

4.1 Probabilistic Principal Component Analysis (PPCA)

The PPCA observation model [85] is

D = AX ′ + E, (4.2)

as discussed above. Here, D ∈ Rp×n are the observed data, A ∈ Rp×r andX ∈ Rn×r are unknown parameters, and E ∈ Rp×n is independent, identically-distributed (i.i.d.) Gaussian noise with unknown but common variance, ω−1:

f (E|ω) =p∏

i=1

n∏j=1

Nei,j

(0, ω−1

). (4.3)

ω is known as the precision parameter, and has the meaning of inverse variance:var (ei,j) = ω−1. In this Chapter, we will make use of the matrix Normal distribution(Appendix A.2):

f (E|ω) = NE

(0p,n, ω

−1Ip ⊗ In

). (4.4)

This is identical to (4.3).The model (4.2) and (4.4) can be written as follows:

f (D|A,X, ω) = N (AX ′, ω−1Ip ⊗ In

). (4.5)

4.1 Probabilistic Principal Component Analysis (PPCA) 59

Note that (4.5) can be seen as a Normal distribution with low-rank mean value,

M(r) = AX ′, (4.6)

where r is the rank of M(r), and we assume that r < min (p, n). Matrices A and Xare assumed to have full rank; i.e. rank (X) = rank (A) = r.

The original PPCA model of [85] contains an extra parameter, µ, modelling acommon mean value for the columns, mi, of M(r). This parameter can be seen asan extra column in A if we augment X by a column 1n,1. Hence, it is a restriction ofX . In this work, we do not impose this restriction of a common mean value; i.e. weassume that the common mean value is µ = 0p,1.

In the sequel, we will often invoke the theorem that any matrix can be decom-posed into singular vectors and singular values [86]. We will use the following formof this theorem.

Definition 4.1. The Singular Value Decomposition (SVD) of matrix D ∈ Rp×n isdefined as follows:

D = UDLDV ′D, (4.7)

where UD ∈ Rp×p and VD ∈ Rn×n are orthonormal matrices, such that U ′DUD =

Ip, V ′DVD = In and LD ∈ Rp×n is a matrix with diagonal elements lD ∈ Rmin(p,n)

and zeros elsewhere. The columns of UD and VD are known as the left- and right-singular vectors, respectively. The elements of lD are known as the singular values.

Unless stated otherwise, we will assume (without loss of generality) that r <p ≤ n. This assumption allows us to use the economic SVD, which uses only p rightsingular vectors of VD. Therefore, in the sequel, we will assume that LD ∈ Rp×p

and VD ∈ Rn×p.

Since LD is a square diagonal matrix, we will often work with its diagonal elementsonly. In general, diagonal elements of a diagonal matrix will be denoted by the samelower-case letter as the original matrix. In particular, we use the notation,

LD = diag (lD) ,lD = diag−1 (LD) ,

where lD ∈ Rp.

4.1.1 Maximum Likelihood (ML) Estimation for the PPCA Model

The ML estimates of the parameters of (4.5)—i.e. M(r) and ω—are defined as fol-lows:

M(r), ω = arg maxM(r),ω

ω

pn2 exp

(−1

2ωtr

((D −M(r)

) (D −M(r)

)′)). (4.8)

Here, tr (·) denotes the trace of the matrix. Note that (4.8) is conditioned on a knownrank, r.


Using the SVD of the data matrix, D (4.7), the maximum of (4.8) is reached for

M(r) = UD;rLD;r,rV′D;r, ω =

pn∑pi=r+1 l2i,D

. (4.9)

Here, UD;r and VD;r are the first r columns of the matrices, UD and VD, respectively,and LD;r,r is the r × r upper-left sub-block of matrix LD.

Remark 4.1 (Rotational ambiguity). The ML estimates of A and X in (4.5), using(4.9), are not unique because (4.5) exhibits a multiplicative degeneracy; i.e.

M(r) = AX ′ =(AT

)(T−1X ′

)= AX ′, (4.10)

for any invertible matrix, T ∈ Rr×r. This is known as rotational ambiguity in thefactor analysis literature [84].

Remark 4.2 (Relation of the ML estimate to PCA). Principal Component Analysis(PCA) is concerned with projections of p-dimensional vectors dj , j = 1, . . . , n,into an r-dimensional subspace. Optimality of the projection was studied from botha maximum variation [87], and least squares [83], point-of-view. In both cases, theoptimal projection was found via the eigendecomposition of the sample covariancematrix:

S =1

n− 1DD′ = UΛU ′. (4.11)

Here, Λ = diag (λ), with λ = [λ1, . . . , λp]′, is a matrix of eigenvalues of S, and

U is the matrix of associated eigenvectors. The columns uj , j = 1, . . . , r, of U ,corresponding to the largest eigenvalues, λ1 > λ2 . . . > λr, form a basis for theoptimal projection sub-space.

These results are related to the ML solution (4.9), as follows. From (4.7),

DD′ = UDLDV ′DVDLDU ′

D = UDLDLDU ′D. (4.12)

The ML estimate (4.9) can be decomposed into A and X , as follows:

A = UD;r, X = LD;r,rV′D;r. (4.13)

Hence, comparing (4.11) with (4.12), and using (4.13), the following equalitiesemerge:

A = UD;r = U;r, LD = (n− 1)12 Λ

12 . (4.14)

Equalities (4.14) formalize the relation between PCA and the ML estimation of thePPCA model.

Remark 4.3 (Ad hoc choice of rank r). The rank, r, has been assumed to be knowna priori. If this is not the case, the ML solution—and, therefore, PCA—fails sincethe likelihood function in (4.8) increases with r in general. This is a typical problem

4.1 Probabilistic Principal Component Analysis (PPCA) 61

with classical estimation, since no Ockham-based regularization (i.e. penalization ofcomplexity) is available in this non-measure-based approach [3].

Many heuristic methods for selection of rank do, however, exist [80]. One isbased on the asymptotic properties of the noise E (4.4). Specifically, from (4.5),

Ef(D|M(r),ω) [DD′] = M(r)M′(r) + nω−1Ip. (4.15)

Using the SVD (Definition 4.1), M(r)M′(r) = UML2

MU ′M . Noting the equality,

UMU ′M = Ip, then, from (4.15) and (4.12),

limn→∞UDL2

DU ′D = UML2

MU ′M + nω−1UMU ′

M . (4.16)

It follows that limn→∞ UD = UM , and that

limn→∞ l2i,D =

l2i,M + nω−1 i ≤ r,

nω−1 i > r.(4.17)

Hence, the index, r, for which the singular values, limn→∞ li,D, i > r, are constantis considered to be an estimate of the rank r. In finite samples, however, (4.17) is onlyapproximately true. Therefore, the estimate can be chosen by visual examination ofthe graphed singular values [80], looking for the characteristic ‘knee’ in the graph.

4.1.2 Marginal Likelihood Inference of A

An alternative inference of the parameters of the PPCA model (4.5) complements(4.5) with a Gaussian prior on X ,

f (X) = NX (0n,r, In ⊗ Ir) , (4.18)

and uses this to marginalize over X in the likelihood function (4.5) [84, 85]. Theresulting maximum of the marginal likelihood, conditioned by r, is then reached forω given by (4.9), and for

A = UD;r

(L2

D;r,r − ω−1Ir

)R. (4.19)

Here, UD;r and LD;r,r are given by (4.7), and R ∈ Rr×r is any orthogonal (i.e.rotation) matrix. In this case, indeterminacy of the model is reduced from an arbitraryinvertible matrix, T (4.10), to an orthogonal matrix, R. This reduction is a directconsequence of the restriction imposed on the model by the prior on X (4.18).

4.1.3 Exact Bayesian Analysis

Bayesian inference for the PPCA model can be found in [78]. Note that the specialcase of the PPCA model (4.5) for p = n = r = 1 was studied in Section 3.7.We found that the inference procedure was not tractable. Specifically, the marginalsof a and x (scalar versions of (4.6)) could not be normalized analytically, and somoments were unavailable. The same is true of the marginal inferences of A and Xderived in [78]. Their numerical evaluation was accomplished using Gibbs samplingin [79, 88].


4.1.4 The Laplace Approximation

Estimation of the rank of the PPCA model via the Laplace approximation (Section3.5.2) was published in [89]. There, the parameter A (4.5) was restricted by theorthogonality constraint, A′A = Ir. The parameter X was integrated out, as in Sec-tion 4.1.2 above.

4.2 The Variational Bayes (VB) Method for the PPCA Model

The VB-approximation for the PPCA model (4.5) was introduced in [24]. Here, weuse the VB method to obtain the necessary approximation (Section 3.3.3) and someinteresting variants. Note that this development extends the scalar decompositionexample of Section 3.7 to the matrix case.

Step 1: Choose a Bayesian Model

The observation model (4.5) is complemented by the following priors:

f (A|Υ ) = NA

(0p,r, Ip ⊗ Υ−1

), (4.20)

Υ = diag (υ) , υ = [υ1, . . . , υr]′,

f (υ|α0, β0) =r∏

i=1

Gυi(αi,0, βi,0) , (4.21)

f (X) = NX (0n,r, In ⊗ Ir) , (4.22)

f (ω|ϑ0, ρ0) = Gω (ϑ0, ρ0) . (4.23)

In (4.21), α0 = [α1,0, . . . , αr,0]′ and β0 = [β1,0, . . . , βr,0]

′. Here, Υ ∈ Rr×r is a di-agonal matrix of hyper-parameters, υi, distributed as (4.21). The remaining prior pa-rameters, ϑ0 and ρ0, are known scalar parameters. Complementing (4.5) with (4.20)–(4.23), the joint distribution is

f (D,A,X, ω, Υ |α0, β0, ϑ0, ρ0, r)

= ND

(AX,ω−1Ip ⊗ In

)NA

(0p,r, Ip ⊗ Υ−1

)NX (0n,r, In ⊗ Ir)×

× Gω (ϑ0, ρ0)r∏

i=1

Gυi(αi,0, βi,0) . (4.24)

In the sequel, the conditioning on α0, β0, ϑ0, ρ0 will be dropped for convenience.

Step 2: Partition the parameters

The logarithm of the joint distribution is as follows:

4.2 The Variational Bayes (VB) Method for the PPCA Model 63

ln f (D,A,X, ω, Υ |r) =pn

2lnω − 1

2ωtr

((D −AX ′) (D −AX ′)′

)+

+p

2

r∑i=1

ln υi − 12tr (ΥA′A) − 1

2tr (XX ′) +

r∑i=1

(α0 − 1) ln υi+

−r∑

i=1

β0υi + (ϑ0 − 1) lnω − ρ0ω + γ. (4.25)

Here, γ gathers together all terms that do not depend on A, X , ω or Υ .We partition the model (4.24) as follows: θ1 = A, θ2 = X , θ3 = ω and θ4 = Υ .

Hence, (4.25) is a member of the separable-in-parameters family (3.21). The detailedassignment of functions g (·) and h (·) is omitted for brevity.

Step 3: Write down the VB-marginals

Any minimizer of KLDVB must have the following form (Theorem 3.1):

f (A|D, r) ∝ exp[− 1

2ωtr

(−2AX ′D′

)− 1

2tr(−A

(ωX ′X

)A′)− 1

2tr(AΥA′) ],

f (X|D, r) ∝ exp[− 1

2ωtr

(−2AX ′D′

)− 1

2tr(−X

(ωA′A

)X ′

)− 1

2tr(X ′X

) ],

f (υ|D, r) ∝ exp[ r∑

i=1

[(p

2+ αi,0 − 1

)ln υi −

(βi,0 +

1

2a′

iai

)υi

] ],

f (ω|D, r) ∝ exp

[(pn

2+ ϑ0 − 1

)ln ω +

−ω(ρ0 +

1

2tr(DD′ − AX ′D′ − DXA′

)+

1

2tr(A′AX ′X

))].

Step 4: Identify standard forms

The VB-marginals from the previous step are recognized to have the followingforms:

f (A|D, r) = NA (µA, Ip ⊗ΣA) , (4.26)

f (X|D, r) = NX (µX , In ⊗ΣX) , (4.27)

f (υ|D, r) =r∏

i=1

Gυi(αi, βi) , (4.28)

f (ω|D, r) = Gω (ϑ, ρ) . (4.29)

The associated shaping parameters are as follows:


µA = ωDX(ωX ′X + Υ

)−1

, (4.30)

ΣA =(ωX ′X + Υ

)−1

, (4.31)

µX = ωD′A(ωA′A + Ir

)−1

, (4.32)

ΣX =(ωA′A + Ir

)−1

, (4.33)

α = α0 +p

21r,1, (4.34)

β = β0 +12diag

(A′A

), (4.35)

ϑ = ϑ0 +np

2, (4.36)

ρ = ρ0 +12tr(DD′ − AX ′D′ −DXA′

)+

12tr(A′AX ′X

). (4.37)

Step 5: Formulate necessary VB-moments

Using standard results for the matrix Normal and Gamma distributions (AppendicesA.2 and A.5, respectively), the necessary moments of distributions (4.26)–(4.29) canbe expressed as functions of their shaping parameters, (4.30)–(4.37), as follows:

A = µA, (4.38)

A′A = pΣA + µ′AµA,

X = µX ,

X ′X = nΣX + µ′XµX ,

υi =αi

βi, i = 1, . . . , r, (4.39)

ω =ϑ

ρ. (4.40)

(4.39) can be written in vector form as follows:

υ = α β−1.

Here, ‘’ denotes the Hadamard product (the ‘.∗’ operator in MATLAB: see Nota-tional Conventions, Page XVI) and

β−1 ≡ [β−1

1 , . . . , β−1r

].

This notation will be useful in the sequel.

Step 6: Reduce the VB-equations

As mentioned in Section (3.3.3), the VB-equations, (4.30)–(4.40), can always beexpressed in terms of moments or shaping parameters only. Then, the IVB algorithm


can be used to find the solution. This approach was used in [24]. Next, we showthat reduction of the VB-equations can be achieved using re-parameterization of themodel, resulting in far fewer operations per IVB cycle.

The IVB algorithm used to solve the VB-equations is a gradient search methodin the multidimensional space of the shaping parameters and moments above. If wecan identify a lower-dimensional subspace in which the solution exists, the iterationscan be performed in this subspace. The challenge is to define such a subspace.

Recall that both the ML solution (4.8) and the marginal ML solution (4.19) ofthe PPCA model (4.5) have the form of scaled singular vectors, UD;r, of the datamatrix, D (4.7). The scaling coefficients for these singular vectors are different foreach method of inference. Intuitively, therefore, it makes sense to locate our VB-approximate distributions in the space of scaled singular vectors. In other words, weshould search for a solution in the space of scaling coefficients of UD. This idea isformalized in the following conjecture.

Conjecture 4.1 (Soft orthogonality constraints). The prior distributions on A andX—i.e. (4.20) and (4.22)—were chosen with diagonal covariance matrices. In otherwords, the expected values of A′A and X ′X are diagonal. Hence, this choicefavours those matrices, A and X , which are orthogonal. We conjecture that theVB-approximation (3.12) converges to posterior distributions with orthogonal meanvalue, even if the IVB algorithm was initialized with non-orthogonal matrices.

If this Conjecture is true, then it suffices to search for a VB-approximation only inthe space of orthogonal mean values and diagonal covariance matrices. Validity ofthe conjecture will be tested in simulations.

Proposition 4.1 (Orthogonal solution of the VB-equations). Consider a specialcase of distributions (4.26) and (4.27) for matrices A and X respectively, in whichwe restrict the first and second moments as follows:

µA = UD;rKA, (4.41)

µX = VD;rKX , (4.42)

KA = diag (kA) ,

KX = diag (kX) ,

ΣA = diag (σA) , (4.43)

ΣX = diag (σX) . (4.44)

The first moments, µA (4.41) and µX (4.42), are formed from scaled singular vectorsof the data matrix, D (4.7), multiplied by diagonal proportionality matrices, KA ∈Rr×r and KX ∈ Rr×r. The second moments, (4.43) and (4.44), are restricted tohave a diagonal form. Then the VB-marginals, (4.26)–(4.29), are fully determinedby the following set of equations:


kA = ωlD;r kX σA, (4.45)

σA =(ωnσX + ωkX kX + α β−1

)−1,

kX = ωσX kA lD;r,

σX = (ωpσA + ωkA kA + 11,r)−1

, (4.46)

αi = α0 +p

2, i = 1, . . . , r,

βi = β0 +12(pσi,A + k2

i,A

), i = 1, . . . , r, (4.47)

ϑ = ϑ0 +np

2, (4.48)

ρ = ρ0 +12((lD − kA kX)′ (lD − kA kX)

)+

+12

(pσ′A (kX kX) + pnσ′

AσX + nσ′X (kA kA)) , (4.49)

ω =ϑ

ρ. (4.50)

Proof: From (4.30) and (4.31),

µA = ωDX ′ΣA. (4.51)

Substituting (4.41), (4.43) and (4.7) into (4.51) it follows that

UD;rKA = ωUDLDV ′DVD;rKXdiag (σA) .

Hence,

diag (kA) = ωdiag (lD;r) diag (KX) diag (σA) ,

using orthogonality of matrices UD and VD. Rewriting all diagonal matrices usingidentities of the kind

KAKX = diag (kA) diag (kX) = diag (kA kX) ,

and extracting their diagonal elements, we obtain (4.45). The identities (4.46)–(4.49)all follow in the same way.The key result of Proposition 4.1 is that the distributions of A and X are com-pletely determined by the constants of proportionality, kA and kX , and variances,σA and σX , respectively. Therefore, the number of scalar unknowns arising in theVB-equations, (4.41)–(4.50), is now only 5r + 1, being the number of terms in theset kA, kX , σA, σX , β and ρ. In the original VB-equations, (4.30)–(4.40), the num-ber of scalar unknowns was much higher—specifically, r (p + n + 2r) + r + 1—forthe parameter set µA, µX , ΣA, ΣX , β and ρ.

In fact, the simplification achieved in the VB-equations is even greater than dis-cussed above. Specifically, the vectors σA,σX , kA and kX now interact with eachother only through ω (4.50). Therefore, if ω is fixed, the vector identities in (4.45)–(4.47) decouple element-wise, i = 1, . . . ,r, into scalar identities. The complexity of


each such scalar identity is merely that of the scalar decomposition in Section 3.7,which had an analytical solution.

Proposition 4.2 (Analytical solution). Let the posterior expected value, ω (4.50), ofω be fixed. Then the scalar identities, i = 1, . . . , r, implied by VB-equations (4.45)–(4.47) have an analytical solution with one of two modes, determined, for each i, byan associated inferential breakpoint, as follows:

li,D =√p +

√n√

ω

√1 − βi,0ω. (4.52)

1. Zero-centred solution, for each i such that li,D ≤ li,D, where li,D is the ithsingular value of D (4.7):

ki,A = 0, (4.53)

ki,X = 0,

σi,X =12

2n− (n− p)βi,0ω −√

β2i,0ω

2 (n− p)2 + 4βi,0npω

n (1 − βi,0ω),

σi,A =1 − σi,X

σi,Xpω,

βi =

(αi,0 + 1

2p)(σi,X − 1)

(n (σi,X − 1) + p)σi,X ω. (4.54)

2. Non-zero solution, for each i such that li,D > li,D:

ki,A =

√−b +

√b2 − 4ac

2a, (4.55)

ki,X =

(ωl2i,D − p

)ki,A

lD

(ωk2

i,A + 1) ,

σi,A =ωk2

i,A + 1

ω(ωl2i,D − p

) ,

σi,X =

(ωl2i,D − p

)ωli,D

(ωk2

i,A + 1) ,

βi =(αi,0 + p

2 )(ωk2

i,A((1 − ωβi,0)(n− p) + li,D) + n + βi,0ω2l2i,D

)ωk2

i,A (p− n) − n + ωl2i,D.(4.56)

Note that ki,A in (4.55) is the positive root of the quadratic equation,

ak2i,A + bki,A + c = 0,

whose coefficients are


a = nω3li,D2,

b = nω p− p2ω + 2 pω2li,D2 + nω2li,D

2 + βi,0 nω3li,D2 +

−ω3li,D4 − βi,0 nω2p− βi,0 ω3li,D

2p + βi,0 ω2p2,

c = np− βi,0 ω3li,D4 − βi,0 nω p + βi,0 ω2li,D

2p + βi,0 nω2li,D2.

Proof: The identities were established using the symbolic software package,Maple. For further details, see [90].Note that the element-wise inferential breakpoints, li,D (4.52), differ only with re-spect to the known prior parameters, βi,0 (4.21). It is reasonable to choose theseequal—i.e. β0 = β1,01r,1—in which case the r breakpoints are identical.

Step 7: Run the IVB Algorithm

From Step 6, there are two approaches available to us for finding the VB-marginals(4.26)–(4.29). If we search for a solution without imposing orthogonality constraints,we must run the IVB algorithm in order to find a solution to the full sets of VB-equations (4.30)–(4.40). This was the approach presented in [24], and will be knownas Variational PCA (VPCA).

In contrast, if we search for an orthogonal solution using Conjecture 4.1, thenwe can exploit the analytical solution (Proposition 4.2), greatly simplifying the IVBalgorithm, as follows. This will be known as Fast Variational PCA (FVPCA).

Algorithm 3 (Fast VPCA (FVPCA)).

1. Perform SVD (4.7) of data matrix, D.2. Choose initial value of ω as ω[1] = n

lp,D(as explained shortly). Set iteration

counter to j = 1.3. Evaluate inferential breakpoints,

li,D =√p +

√n√

ω[j]

√1 − βi,0ω[j].

4. Partition lD into lz =li,D : li,D ≤ li,D

and lnz =

li,D : li,D > li,D

.

5. Evaluate solutions (4.53)–(4.54) for lz, and solutions (4.55)–(4.56) for lnz.6. Update iteration counter (j = j + 1), and estimate ω[j] = ϑ

ρ[j] , using (4.48) and(4.49).

7. Test with stopping rule; e.g. if ω[j] − ω[j−1] > ε, ε small, go to step 3, otherwiseend.

Remark 4.4 (Automatic Rank Determination (ARD) Property). The shaping parame-ters, α and β, can be used for rank determination. It is observed that for some valuesof the index, i, the posterior expected values, υi = αi/βi (4.39), converge (with thenumber of IVB iterations) to the prior expected values, υi → αi,0/βi,0. This can beunderstood as a prior dominated inference [8]; i.e. the observations are not informa-tive in those dimensions. Therefore, the rank can be determined as the number of υi

4.3 Orthogonal Variational PCA (OVPCA) 69

that are significantly different from the prior value, αi,0/βi,0. This behaviour will becalled the Automatic Rank Determination (ARD) property.1

Remark 4.5 (Ad hoc choice of initial value, ω[1], of ω). As n → ∞ (4.2), the singularvalues, li,D, of D are given by (4.17). In finite samples, (4.17) holds only approxi-mately, as follows:

1p− r

p∑i=r+1

l2i,D ≈ nω−1, (4.57)

p∑i=1

l2i,D ≈r∑

i=1

l2i,M + pnω−1, (4.58)

where r is the unknown rank. From the ordering of the singular values, l1,D > l2,D >. . . > lp,D, it follows that l2p,D < 1

p−r

∑pi=r+1 l2i,D (i.e. the mean is greater than the

minimal value). From (4.57), it follows that l2p,D < nω−1. From (4.58), it is true that∑pi=1 l2i,D > pnω−1. These considerations lead to the following choice of interval

for ω:pn

l′DlD< ω <

n

l2p,D

. (4.59)

Recall that ω is the precision parameter in the PPCA model (4.5). Hence, we ini-tialize ω at the upper bound in (4.59)—i.e. ω[1] = n

l2p,D

—encouraging convergence

to a higher-precision solution. We will examine this choice in simulation, in Section4.4.2.

Step 8: Report the VB-marginals

The VB-marginals are given by (4.26)–(4.29). Their shaping parameters and mo-ments are inferred using either the VPCA algorithm (4.30)–(4.40) or the FVPCAalgorithm (Algorithm 3).

4.3 Orthogonal Variational PCA (OVPCA)

In Section 4.1, we explained that Maximum Likelihood (ML) estimation of parame-ters A and X in the PPCA model suffers from rotational ambiguity (Remark 4.1).This is a consequence of inappropriate modelling of low-rank matrix, M(r), whichis clearly over-parameterized. It leads to complications in the Bayesian approach,where the inference of A and X must be regularized via priors, as noted in Section3.7. From an analytical point-of-view, model (4.6) contains redundant parameters. Inthis Section, we re-parameterize the model in a more compact way.

1 In the machine learning community, this property is known as the Automatic RelevanceDetermination property [24]. In our work, the inferred number of relevant parameters isassociated with the rank of M(r).


4.3.1 The Orthogonal PPCA Model

We now apply the economic SVD (Definition 4.1) to low rank matrix, M(r). In thiscase, p−r singular vectors will be irrelevant, since they will be multiplied by singularvalues equal to zero. This allows us to write M(r) as the following product:

M(r) = ALX ′. (4.60)

Here, the matrices A ∈ Rp×r and X ∈ Rn×r have orthogonality restrictions, A′A =Ir and X ′X = Ir. Also,

L = diag (l) ,

is a diagonal matrix of non-zero singular values, l = [l1, . . . , lr]′, ordered, without

loss of generality, asl1 > l2 > . . . > lr > 0. (4.61)

The decomposition (4.60) is unique, up to the sign of the r singular vectors (i.e.there are 2r possible decompositions (4.60) satisfying the orthogonality and ordering(4.61) constraints, all equal to within a sign ambiguity [86]).

From (4.2), (4.4) and (4.60), the orthogonal PPCA model is

f (D|A,L,X, ω, r) = N (ALX ′, ω−1Ip ⊗ In

). (4.62)

This model, and its VB inference, were first reported in [91].The ML estimates of the model parameters, conditioned by known r, are(

A, L, X, ω)

= arg maxA,L,X,ω

f (D|A,L,X, ω, r) ,

with assignments

A = UD;r, L = LD;r,r, X = VD;r, ω =pn∑p

i=r+1 l2i,D, (4.63)

using (4.7).

4.3.2 The VB Method for the Orthogonal PPCA Model

Here, we follow the VB method (Section 3.3.3) to obtain VB-marginals of the para-meters in the orthogonal PPCA model (4.62).


The reduction of rotational ambiguity to merely a sign-based ambiguity is an ad-vantage gained at the expense of orthogonal restrictions which are generally difficultto handle. Specifically, parameters A and X are restricted to having orthonormalcolumns, i.e. A′A = Ir and X ′X = Ir, respectively.


Intuitively, each column ai, i = 1 . . . r, of A belongs to the unit hyperball in pdimensions, i.e. ai ∈ Hp. Hence, A ∈ Hr

p, the Cartesian product of r p-dimensionalunit hyperballs. However, the requirement of orthogonality—i.e. a′

iaj = 0, ∀i =j—confines the space further. The orthonormally constrained subset, Sp,r ⊂ Hr

p, isknown as the Stiefel manifold [92]. Sp,r has finite area, which will be denoted asα (p, r), as follows:

α (p, r) =2rπ

12 pr

π14 r(r−1)

∏rj=1 Γ

12 (p− j + 1)

. (4.64)

Here, Γ (·) denotes the Gamma function [93]. Both the prior and posterior distribu-tions have a support confined to Sp,r.

We choose the priors on A and X to be the least informative, i.e. uniform on Sp,r

and Sn,r respectively:

f (A) = UA (Sp,r) = α (p, r)−1χSp,r

(A) , (4.65)

f (X) = UX (Sn,r) = α (n, r)−1χSn,r

(X) . (4.66)

There is no upper bound on ω > 0 (4.4). Hence, an appropriate prior is (theimproper) Jeffreys’ prior on scale parameters [7]:

f (ω) ∝ ω−1. (4.67)

Remark 4.6 (Prior on l). Suppose that the sum of squares of elements of D isbounded from above; e.g.

p∑i=1

n∑j=1

d2i,j = tr (DD′) ≤ 1. (4.68)

This can easily be achieved, for example, by preprocessing of the data. (4.68) can beexpressed, using (4.7), as

tr (DD′) = tr (UDLDLDU ′D) =

p∑i=1

l2i,D ≤ 1. (4.69)

Note that tr(M(r)M

′(r)

)≤ tr (DD′) (4.2). Hence, using (4.69),

r∑i=1

l2i ≤p∑

i=1

l2i,D ≤ 1. (4.70)

This, together with (4.61), confines l to the space

Lr =

l∣∣∣l1 > l2 > . . . > lr > 0,

r∑i=1

l2i ≤ 1

, (4.71)


which is a sector of the unit hyperball. Constraint (4.70) forms a full unit hyperball,Hr ⊂ Rr, with hypervolume

hr = πr2 /Γ

(r

2+ 1

). (4.72)

Positivity constraints restrict this further to hr/2r, while hyperplanes, li = lj ,∀i, j = 1 . . . r, partition the positive sector of the hyperball into r! sectors, eachof equal hypervolume, only one of which satisfies condition (4.61). Hence, the hy-pervolume of the support (4.71) is

ξr = hr1

2r (r!)=

πr2

Γ(

r2 + 1

)2r (r!)

. (4.73)

We choose the prior distribution on l to be non-committal—i.e. uniform—on support(4.71). Using (4.73),

f (l) = Ul (Lr) = ξ−1r χLr

(l) . (4.74)

Multiplying (4.62) by (4.65), (4.66), (4.67) and (4.74), and using the chain rule ofprobability, we obtain the joint distribution,

f (D,A,X,L, ω|r) = N (ALX ′, ω−1Ip ⊗ It

)×α (p, r)−1

α (n, r)−1ξ−1r ω−1 χΘ∗ (θ) . (4.75)

Here, θ = A,X,L, ω with support

Θ∗ = Sp,r × Sn,r × Lr × R+.

Step 2: Partition the Parameters

We choose to partition the parameters of model (4.62) as follows: θ1 = A, θ2 = X ,θ3 = L, θ4 = ω. The logarithm of the joint distribution, restricted to zero outside thesupport Θ∗, is given by

ln f (D,A,X,L, ω|r) =

=(pn

2− 1

)lnω − 1

2ωtr

((D −ALX ′) (D −ALX ′)′

)+ γ

=(pn

2− 1

)lnω − 1

2ωtr (DD′ − 2ALX ′D′) − 1

2ωtr (LL′) + γ, (4.76)

using orthogonality of matrices A and X . Once again, γ denotes the accumulationof all terms independent of A, X , L and ω.

Note that the chosen priors, (4.65), (4.66), (4.67) and (4.74), do not affect thefunctional form of (4.76) but, instead, they restrict the support of the posterior. There-fore, the function appears very simple, but evaluation of its moments will be compli-cated.


Step 3: Inspect the VB-marginals

Application of the VB theorem to (4.76) yields to following VB-marginals:

f (A,X,L, ω|D, r) = f (A|D, r) f (X|D, r) f (L|D, r) f (ω|D, r) , (4.77)

f (A|D, r) ∝ exp[ωtr

(ALX ′D′

)]χSp,r

(A) , (4.78)

f (X|D, r) ∝ exp[ωtr

(X ′D′AL

)]χSn,r

(X) ,

f (l|D, r) ∝ exp[ωl′diag−1 (X ′D′A) − 1

2ωl′l

]χLr

(l) ,

f (ω|D, r) ∝ exp[ (pn

2− 1

)lnω − 1

2ωtr

(DD′ − ALX ′D′

)+

−12ωl′l

]χR+ (ω) . (4.79)

Recall that the operator diag−1 (L) = l extracts the diagonal elements of the matrixargument into a vector (see Notational Conventions on Page XV).


The VB-marginals, (4.78)–(4.79), are recognized to have the following standardforms:

f (A|D, r) = M (FA) , (4.80)

f (X|D, r) = M (FX) , (4.81)

f (l|D, r) = tN (µl, φIp;Lr) , (4.82)

f (ω|D, r) = G (ϑ, ρ) . (4.83)

Here, M (·) denotes the von Mises-Fisher distribution (i.e. Normal distribution re-stricted to the Stiefel manifold (4.64) [92]). Its matrix parameter is FA ∈ Rp×r in(4.80), and FX ∈ Rn×r in (4.81). tN (·) denotes the truncated Normal distributionon the stated support.

The shaping parameters of (4.80)–(4.83) are

FA = ωDXL, (4.84)

FX = ωD′AL, (4.85)

µl = 2diag−1(X ′D′A

), (4.86)

φ = ω−1, (4.87)

ϑ =pn

2, (4.88)

ρ =12tr(DD′ − 2DXLA′

)+

12l′l. (4.89)


Step 5: Formulate the necessary VB-moments

The necessary VB-moments involved in (4.84)–(4.89) are A, X , l, l′l and ω, where,

by definition, L = diag(l)

. Moments A and X are expressed via the economic

SVD (Definition 4.1) of parameters FA (4.84) and FX (4.85),

FA = UFALFA

V ′FA

, (4.90)

FX = UFXLFX

V ′FX

, (4.91)

with LFXand LFA

both in Rr×r. Then,

A = UFAG (p, LFA

)V ′FA

, (4.92)

X = UFXG (n,LFX

)V ′FX

, (4.93)

l = µl +√

φϕ (µl, φ) , (4.94)

l′l = rφ + µ′l l −

√φκ (µl, φ)′ 1r,1, (4.95)

ω =ϑ

ρ. (4.96)

Moments of tN (·) and M (·)—from which (4.92)–(4.95) are derived—are reviewedin Appendices A.4 and A.6 respectively. Functions G (·, ·), ϕ (·, ·) and κ (·, ·) are alsodefined there. Note that each of G (·, ·), ϕ (·, ·) and κ (·, ·) returns a multivariate valuewith dimensions equal to those of the multivariate argument. Multivariate argumentsof functions ϕ (·, ·) and κ (·, ·) are evaluated element-wise, using (A.26) and (A.27).

Remark 4.7 (Approximate support for l). Moments of the exact VB-marginal for l(4.82) are difficult to evaluate, as Lr (4.71) forms a non-trivial subspace of Rr.Therefore, we approximate the support, Lr, by an envelope, Lr ≈ Lr. Note that(4.70) is maximized for each i = 1, . . . , r if l1 = l2 = . . . = li, li+1 = li+2 =. . . = lr = 0. In this case,

∑rj=1 l2j = il2i ≤ 1, which defines an upper bound,

li ≤ li, which is li = i−12 . Hence, (4.71) has a rectangular envelope,

Lr =l : 0 < li ≤ li = i−

12 , i = 1 . . . r

. (4.97)

(4.82) is then approximated by

f (l|D, r) =r∏

i=1

tN(µi,l, φ;

[0, i−

12

]). (4.98)

Moments of the truncated Normal distribution in (4.98) are available via the errorfunction, erf (·) (Appendix A.4). The error of approximation in (4.97) is largest atthe boundaries, li = lj , i = j, i, j ∈ 1 . . . r, and is negligible when no two li’sare equal.



The set of VB-equations, (4.84)–(4.96), is non-linear and must be evaluated numer-ically. No analytical simplification is available. One possibility is to run the IVBalgorithm (Algorithm 1) on the full set (4.84)–(4.96). It can be shown [91] that ini-tialization of the full IVB algorithm via ML estimates (4.63) yields VB moments,(4.92)–(4.93), that are collinear with the ML solution (4.63). Note that this is thesame space that was used for restriction of the PPCA solution (Proposition 4.1), andfrom which the FVPCA algorithm was derived. Therefore, in this step, we imposethe restriction of collinearity with the ML solution on the moments, (4.92)–(4.93),thereby obtaining reduced VB-equations.

Proposition 4.3. We search for a solution of A (4.92) and X (4.93) in the space ofscaled singular vectors of matrix D (4.7):

A = UD;rKA, (4.99)

X = VD;rKX . (4.100)

UD and VD are given by the economic SVD of D (4.7). KA = diag (kA) ∈ Rr×r andKX = diag (kX) ∈ Rr×r denote matrix constants of proportionality which mustbe inferred. Then, distributions (4.84)–(4.96) are fully determined by the followingequations:

kA = G(p, ωlD;r kX l

), (4.101)

kX = G(n, ωlD;r kA l

), (4.102)

µl = kX lD;r kA, (4.103)

φ = ω−1, (4.104)

l = µl +√

φϕ (µl, φ) , (4.105)

l′l = rφ + µ′l l −

√φκ (µl, φ)′ 1r,1, (4.106)

ω = pn

[l′DlD − 2

(kX l kA

)′lD;r + l′l

]−1

. (4.107)

Proof: Substituting (4.100) into (4.84), and using (4.7), we obtain

FA = ω (UDLDV ′D)VD;rKX L = ωUD;rLD;r,rKX L. (4.108)

This is in the form of the SVD of FA (4.90), with assignments

UFA= UD;r, LFA

= ωLD;r,rKX Lr, VFA= Ir. (4.109)

Substituting (4.109) into (4.92), then

A = UD;rG(p, ωLD;r,rKX Lr

)Ir. (4.110)


Note that G (·, ·) is a diagonal matrix since its multivariate argument is diagonal.Hence, (4.110) has the form (4.99) under the assignment

KA = G(p, ωLD;r,rKX Lr

). (4.111)

Equating the diagonals in (4.111) we obtain (4.101). (4.102) is found in exactly thesame way. (4.103)–(4.107) can be easily obtained by substituting (4.99)–(4.100) into(4.86), (4.89), (4.94) and (4.95), and exploiting the orthogonality of UD and VD.

Step 7: Run the IVB Algorithm

Using Proposition 4.3, the reduced set of VB-equations is now (4.101)–(4.107). TheIVB algorithm implied by this set will be known as Orthogonal Variational PCA(OVPCA).

Under Proposition 4.3, we note that the optimal values of A and X are deter-mined up to the inferred constants of proportionality, kA and kX , by UD and VD

respectively. The iterative algorithm is then greatly simplified, since we need onlyiterate on the 2r degrees-of-freedom constituting KA and KX together, and not onA and X with r

(p + n− r−1

2

)degrees-of-freedom. The ML solution is a reason-

able choice as an orthogonal initializer, and is conveniently available via the SVDof D (4.63). With this choice, the required initializers of matrices KA and KX areK

[1]A = K

[1]X = Ir, via (4.99) and (4.100).

Remark 4.8 (Automatic Rank Determination (ARD) property of the OVPCA algo-rithm). Typically, ki,A and ki,X converge to zero for i > r, for some empirical upperbound, r. A similar property was used as a rank selection criterion for the FVPCAalgorithm (Remark 4.4). There, the rank was chosen as r = r [24]. This property willbe used when comparing OVPCA and FVPCA (Algorithm 3). Nevertheless, the fullposterior distribution of r—i.e. f (r|D)—will be derived for the orthogonal PPCAmodel shortly (see Section 4.3.3).

Remark 4.9. Equations (4.101)–(4.103) are satisfied for

kA = kX = µl = 0r,1, (4.112)

independently of data, i.e. independently of lD. Therefore, (4.112) will appear as acritical point of KLDVB (3.6) in the VB approximation. (4.112) is appropriate whenM(r) = 0p,n (4.1), in which case r = 0. Of course, such an inference is to be avoidedwhen a signal is known to be present.

Remark 4.10 (Uniqueness of solution). In Section 4.3.1, we noted 2r cases of theSVD decomposition (4.60), differing only in the signs of the singular vectors. Note,however, that Proposition 4.3 separates posterior mean values, A (4.99) and X(4.100), into orthogonal and proportional terms. Only the proportional terms (kA

and kX ) are estimated using the IVB algorithm. Since all diagonal elements of thefunction G (·, ·) are confined to the interval I[0,1] (Appendix A.6.2), the converged


values of kA and kX are always positive. The VB solution is therefore unimodal,approximating only one of the possible 2r modes. This is important, as the all-modedistribution of A is symmetric around the coordinate origin, which would consignthe posterior mean to A = 0p,r.


The VB-marginals are given by (4.80)–(4.83), with shaping parameters FA, FX , µl,φ, ϑ and ρ (4.84)–(4.89). In classical PCA applications, estimates A are required.In our Bayesian context, complete distributions of parameters are available, and soother moments and uncertainty measures can be evaluated. Moreover, we can alsoreport an estimate, r = r, of the number of relevant principal components, using theARD property (Remark 4.8).

More ambitious tasks—such as inference of rank—may be addressed using theinferred VB-marginals. These tasks will be described in the following subsections.

4.3.3 Inference of Rank

In the foregoing, we assumed that the rank, r, of the model (4.60) was known apriori. If this is not the case, then inference of this parameter can be made usingBayes’ rule:

f (r|D) ∝ f (D|r) f (r) , (4.113)

where f (r) denotes the prior on r, typically uniform on 1 ≤ r ≤ p ≤ n.(4.113) is constructed by marginalizing over the parameters of the model, yielding acomplexity-penalizing inference. This ‘Ockham sensitivity’ is a valuable feature ofthe Bayesian approach.

The marginal distribution of D—i.e. f (D|r)—can be approximated by a lowerbound, using Jensen’s inequality (see Remark 3.1):

ln f (D|r) ≈ ln f (D|r) −KL(f (θ|D, r) ||f (θ|D, r)

)(4.114)

=∫

Θ∗f (θ|D, r)

(ln f (D, θ|r) − ln

(f (θ|D, r)

))dθ.

The parameters are θ = A,X,L, ω, and f (D, θ|r) is given by (4.75). From(4.114), the optimal approximation, f (θ|D, r), under a conditional independenceassumption is the VB-approximation (4.77). Substituting (4.80)–(4.86) into (4.77),and the result—along with (4.75)—into (4.114), then (4.113) yields


f (r|D) ∝ exp

−r

2lnπ + r ln 2 + lnΓ

(r

2+ 1

)+ ln (r!) + (4.115)

+12φ−1

(µ′

lµl − l′µl − µ′l l + l′l

)+

+ ln 0F1

(12p,

14FAF ′

A

)− ω

(kX l kA

)′lD;r +

+ ln 0F1

(12n,

14FXF ′

X

)− ω

(kX l kA

)′lD;r +

+r∑

j=1

ln[erf

(lj − µj,l√

2φ

)+ erf

(µj,l√2φ

)]+

+r ln(√

πφ/2)− (ϑ + 1) ln ρ

,

where kA, kX , µl, φ, l, l′l and ω are the converged solutions of the OVPCA al-gorithm, and FA and FX are functions of these, via, for example, (4.108). lj ,j = 1, . . . , r, are the the upper bounds on lj in the envelope L (4.97).

One of the main algorithmic advantages of PCA is that a single evaluation ofall p eigenvectors, i.e. U (4.11), provides with ease the PCA solution for any rankr < p, via the simple extraction of the first r columns, U;r (4.7), of U = UD (4.14).The OVPCA algorithm also enjoys this property, thanks to the linear dependence ofA (4.99) on UD. Furthermore, X observes the same property, via (4.100). Therefore,in the OVPCA procedure, the solution for a given rank is obtained by simple ex-traction of UD;r and VD;r, followed by iterations involving only scaling coefficients,kA and kX . Hence, p × (p + n) values (those of UD and VD) are determined rank-independently via the SVD (4.7), and only 4r + 3 values (those of kA, kX , µl, φ, l,l′l and ω together) are involved in the rank-dependent iterations (4.101)–(4.107).

4.3.4 Moments of the Model Parameters

The Bayesian solution provides an approximate posterior distribution of all involvedparameters, (4.80)–(4.83) and (4.115). In principle, moments and uncertainty boundscan then be inferred from these distributions, in common with any Bayesian method.

The first moments of all involved parameters have already been presented,(4.92)–(4.94) and (4.96), since they are necessary VB-moments. The second non-central moment of l—i.e. l′l—was also generated. Parameter ω is Gamma-distributed(4.83), and so its confidence intervals are available.

The difficult task is to determine uncertainty bounds on orthogonal parameters,A and X , which are von Mises-Fisher distributed, (4.80) and (4.81). Such confidenceintervals are not available. Therefore, we develop approximate uncertainty bounds inAppendix A.6.3, using a Gaussian approximation. The distribution of X ∈ Rn×r

(4.81) is fully determined by the r-dimensional vector, yX , defined as follows:

4.4 Simulation Studies 79

yX (X) = diag−1(U ′

FXXVFX

)= diag−1

(V ′

D;rX). (4.116)

This result is from (A.35). Therefore, confidence intervals on X can be mapped toconfidence intervals on yX , via (4.116), as shown in Appendix A.6.2. The idea isillustrated graphically for p = 2 and r = 1 in Fig. 4.2. It follows that the HPD region(Definition 2.1) of the von Mises-Fisher distribution is bounded by X , where

X =X : yX (X) = y

X

, (4.117)

using (4.116) and with yX given in Appendix A.6.3. Since A has the same distribu-tion, A is defined analogously.

space of X (thickness is proportional to f(X))direction of VMF maximum, and also axis of yX

maximum of VMF distributionmean valueexample of projection X → YX

confidence interval for f(yX)projection of uncertainty bounds yX → X

X

X

YX

yX

yX

X

Fig. 4.2. Illustration of the properties of the von-Mises-Fisher (VMF) distribution, X ∼M (F ), for X, F ∈ R2×1.

4.4 Simulation Studies

We have developed three algorithms for VB inference of matrix decomposition mod-els in this Chapter, namely VPCA, FVPCA and OVPCA (see Fig. 4.1). We will nowcompare their performance in simulations. Data were generated using model (4.5)with p = 10, n = 100 and r = 3. These data are displayed in Fig. 4.3. Three noiselevels, E (4.4), were considered: (i) ω = 100 (SIM1), (ii) ω = 25 (SIM2) and (iii)ω = 10 (SIM3). Note, therefore, that the noise level is increasing from SIM1 toSIM3.

4.4.1 Convergence to Orthogonal Solutions: VPCA vs. FVPCA

VB-based inference of the PPCA model (4.5) was presented in Section 4.2. Recallthat the PPCA model does not impose any restrictions of orthogonality. However, wehave formulated Conjecture 4.1 which states that a solution of the VB-equations canbe found in the orthogonal space spanned by the singular vectors of the data matrixD (4.7). In this Section, we explore the validity of this conjecture.

Recall, from Step 7 of the VB method in Section 4.2, that two algorithms existfor evaluation of the shaping parameters of the VB-marginals (4.26)–(4.29):


Example of data realization di,:, i = 1 . . . 10 (SIM1)

Simulated values: x3

n = 100

Simulated values: a3

p = 10

Simulated values: x2Simulated values: a2

Simulated values: x1Simulated values: a1

0 10 0 100

0 10 0 100

0 10 0 100

0 20 40 60 80 100

0

0.5

1

-1

0

1

0

0.5

1

-1

0

1

0

0.5

1

-5

0

5

-4

-2

0

2

Fig. 4.3. Simulated data, D, used for testing the PCA-based VB-approximation algorithms.SIM1 data are illustrated, for which ω = 100.

FVPCA algorithm (Algorithm 3), which uses Conjecture 4.1, and is deterministi-cally initialized via the ML solution (4.13).

VPCA algorithm, which does not use Conjecture 4.1, and is initialized randomly.

The validity of the conjecture is tested by comparing the results of both algorithmsvia a Monte Carlo simulation using many random initializations in the VPCA case. IfConjecture 4.1 is true, then the posterior moments, µX (4.32) and µA (4.30), inferredby the VPCA algorithm should converge to orthogonal (but not orthonormal) matri-ces for any initialization, including non-orthogonal ones. Therefore, a Monte Carlostudy was undertaken, involving 100 runs of the VPCA algorithm (4.30)–(4.40). Dur-ing the iterations, we tested orthogonality of the posterior moments, X (4.32), via the


assignmentQ (µX) = ||µ′

XµX || ,where ||A|| ≡ [|ai,j |] ,∀i, j, denotes the matrix of absolute values of its elements.The following criterion of diagonality of Q (·)—being, therefore, a criterion of or-thogonality of µX—is then used:

q (µX) =1′

r,1Q (µX)1r,1

1′r,1diag−1 (Q (µX))

; (4.118)

i.e. the ratio of the sum of all elements of Q (µX) over the sum of its diagonal el-ements. Obviously, q (µX) = 1 for a diagonal matrix, and q (µX) > 1 for a non-diagonal matrix, µX . We changed the stopping rule in the IVB algorithm (step 7,Algorithm 3) to

q (µX) < 1.01. (4.119)

Hence, the absolute value of non-diagonal elements must be less than 1% of the di-agonal elements for stopping to occur. Note that all the experiments were performedwith initial value of ω[1] from Remark 4.5.

In all MC runs, (4.119) was satisfied, though it typically took many iterationsof the VPCA algorithm. The results are displayed in Fig. 4.4. Histograms of q (µX)(4.118) for the initializing matrices, µ[1]

X , are displayed in the left panel, while q (µX)for the converged µ

[m]X are displayed in the middle panel. In the right panel, the his-

togram of the number of iterations required to satisfy the stopping rule is displayed.We conclude that Conjecture 4.1 is verified; i.e. the solution exists in the orthogonalspace (4.41)–(4.42).

q(m [1]X )

coun

ts

q(m [m]X ) number of iterations

(in thousands)

0 10 20 30 401.0082 1.011.5 2 2.50

10

20

30

0

20

40

60

80

100

0

5

10

15

20

25

Fig. 4.4. Monte Carlo study (100 trials) illustrating the convergence of the VPCA algorithm to

an orthogonal solution. Left: criterion for initial values, q(µ

[1]X

)(4.118). Middle: criterion

for converged values, q(µ

[m]X

). Right: number of iterations required for convergence.


From now on, we will assume validity of Conjecture 4.1, such that the VPCAand FVPCA algorithms provide the same inference of A, X and ω. In the case of theSIM1 data, we display—in Table 4.1—the median value, kA, obtained by projectingthe VPCA inference, µA, into the orthogonal space (4.41)–(4.42). This comparesvery closely to kA (4.45) inferred directly via the FVPCA algorithm (Algorithm 3).

Table 4.1. Comparison of converged values of kA obtained via the VPCA and FVPCA algo-rithms.

VPCA, median FVPCA

k1,A 9.989 9.985k2,A 9.956 9.960k3,A 8.385 8.386

In subsequent simulations, we will use only the FVPCA algorithm for inferenceof the PPCA model, since it is faster and its initialization is deterministic.

4.4.2 Local Minima in FVPCA and OVPCA

In this simulation study, we design experiments which reveal the existence of localminima in the VB-approximation for the PPCA model (evaluated by FVPCA (Algo-rithm 3)), and for the orthogonal PPCA model (evaluated by the OVPCA algorithm(Section 4.3.2, Step 7)). Recall, that these algorithms have the following properties:

(i) D enters each algorithm only via its singular values, lD (4.7).(ii) For each setting of ω, the remaining parameters are determined analytically.

From (ii), we need to search for a minimum of KLDVB (3.6) only in the one-dimensional space of ω. Using the asymptotic properties of PCA (Remark 4.5), wealready have a reasonable interval estimate for ω, as given is (4.59). We will test theinitialization of both algorithms using values from this interval. If KLDVB for thesemodels is unimodal, the inferences should converge to the same value for all possibleinitializations. Since all other VB-moments are obtained deterministically, once ω isknown, therefore we monitor convergence only via ω.

In the case of datasets SIM1 and SIM3, ω converged to the same value for alltested initial values, ω[1], in interval (4.59). This was true for both the FVPCA andOVPCA algorithms. However, for the dataset SIM2, the results of the two algorithmsdiffer, as displayed in Fig. 4.5. The terminal values, ω[m], were the same for allinitializations using FVPCA. However, the terminal ω[m] using OVPCA exhibits twodifferent modes: (i) for the two lowest tested values of ω[1], where ω[m] was almostidentical with FVPCA; and (ii) all other values of ω[1], where ω[m] was very close tothe simulated value. In (i), the ARD property of OVPCA (Remark 4.8) gave r = 2,while in (ii), r = 3. This result suggests that there are two local minima in KLDVB

for the Orthogonal PPCA model (4.62), for the range of initializers, ω[1], considered(4.59).


initial value ω[1]

conv

erge

dva

lue

ofω

FVPCAOVPCAsimulated value

5 10 15 20 25 30 35 4022

23

24

25

Fig. 4.5. Converged value of ω using the dataset SIM2, for different initial values, ω[1].

For the three datasets, SIM1 to SIM3, there was only one minimizer of KLDVB

in the case the PPCA model. We now wish to examine whether, in fact, the VB-approximation of the PPCA model can exhibit multiple modes for other possibledatasets. Recall that the data enter the associated FVPCA algorithm only via theirsingular values lD. Hence, we need only simulate multiple realizations of lD, ratherthan full observation matrices, D. Hence, in a Monte Carlo study involving 1000

runs, we generated 10-dimensional vectors, l(i)D = exp(l(i)MC

), where each l

(i)MC was

drawn from the Uniform distribution, UlMC

([0, 1]10

). For each of the 1000 vectors

of singular values, l(i)D , we generated the converged estimate, ω[m], using the FVPCAalgorithm. As in the previous simulation, we examined the range of initializers, ω[1],given by (4.59). For about 2% of the datasets, non-unique VB-approximations wereobtained.

In the light of these two experiments, we conclude the following:

• There are local minima in KLDVB for both the PPCA and Orthogonal PPCAmodels, leading to non-unique VB-approximations.

• Non-unique behaviour occurs only rarely.• Each local minimum corresponds to a different inferred r arising from the ARD

property of the associated IVB algorithm (Remark 4.4).

4.4.3 Comparison of Methods for Inference of Rank

We now study the inference of rank using the FVPCA and OVPCA algorithms. Thetrue rank of the simulated data is r = 3. Many heuristic methods for choice of thenumber of relevant principal components have been proposed in the literature [80].These methods are valuable since they provide an intuitive insight into the problem.Using Remark 4.3, we will consider the criterion of cumulative variance:


vi,c =

∑ij=1 λj∑pj=1 λj

× 100%. (4.120)

Here, λj are the eigenvalues (4.11) of the simulated data, D. This criterion is dis-played in the third and fourth columns of Fig. 4.6. As before, we test the methodsusing the three datasets, SIM1 to SIM3, introduced at the beginning of the Section.

eigenvalues l

SIM

1

eigenvalues l(detail) cumulative variance

cumulative variance

(detail)

SIM

2SI

M3

1 2 3 40 5 100 5 100 5 10

1 2 3 40 5 100 5 100 5 10

1 2 3 40 5 100 5 100 5 10

85

90

95

100

60

80

100

0

10

20

0

200

400

90

95

100

60

80

100

0

5

10

15

0

200

400

94

96

98

100

70

80

90

100

0

2

4

6

0

200

400

Fig. 4.6. Ad hoc methods for estimation of rank for three simulated datasets, each with dif-ferent variance of noise. The method of visual examination (Remark 4.3) is applied to theeigenvalue graphs, λ = eig (DD′), and to the cumulative variance graphs.

For all datasets, the first two eigenvalues are dominant (first column), while thethird eigenvalue is relatively small (it contains only 1% of total variation, as seenin Fig. 4.6 (right)). In the first row, i.e. dataset SIM1, the third eigenvalue is clearlydistinct from the remaining ones. In the second dataset (SIM2), the difference is notso obvious, and it is completely lost in the third row (SIM3). Ad hoc choices of rankusing (i) visual inspection (Remark 4.3) and (ii) the method of cumulative variance

4.5 Application: Inference of Rank in a Medical Image Sequence 85

are summarized in Table 4.2. This result underlines the subjective nature of these adhoc techniques.

Table 4.2. Estimation of rank in simulated data using ad hoc methods.

SIM1 SIM2 SIM3

visual inspection 3 2-3 2cumulative variance 2-3 2 2

Next, we analyze the same three datasets using formal methods. The results ofFVPCA (ARD property, Remark 4.4), OVPCA (ARD property, Remark 4.8, andposterior distribution, f(r|D) (4.115)), and the Laplace approximation (the posteriordistribution, fL(r|D), as discussed in Section 4.1.4), are compared in Table 4.3.

Table 4.3. Comparison of formal methods for inference of rank in simulated data.

FVPCA OVPCA LaplaceARD ARD f (r|D) , r = fL (r|D),r =

2 3 4 5 2 3 4 5

SIM1 3 3 0 98.2 1.7 0.1 0 82 13 2SIM2 2 3 96 3.5 0.2 0 70 25 3 0.5SIM3 2 2 97 3.9 0.1 0 94 5 0.5 0.0

Values of f (r|D) and fL (r|D) not shown in the tableare very close to zero, i.e. < 0.001.

Note that for low noise levels (SIM1) and high noise levels (SIM3), all methodsinferred the true rank correctly. In this case, data were simulated using the underlyingmodel. We therefore regard the results of all methods to be correct. The differencesbetween posterior probabilities caused by different approximations are, in this case,insignificant. The differences will, however, become important for real data, as wewill see in next Section.

4.5 Application: Inference of Rank in a Medical Image Sequence

PCA is widely used as a dimensionality reduction tool in the analysis of medical im-age sequences [94]. This will be the major topic of Chapter 5. Here, in this Section,we address just one task of the analysis, namely inference of the number of physi-ological factors (defined as the rank, r, of matrix M(r) (4.1)) using the PPCA (4.6)and Orthogonal PPCA (4.60) models.

For this purpose, we consider a scintigraphic dynamic image sequence of thekidneys. It contains n = 120 images, each of size 64×64 pixels. These images werepreprocessed as follows:


• A rectangular area of p = 525 pixels was chosen as the region of interest at thesame location in each image.

• Data were scaled by the correspondence analysis method [95] (see Section 5.1.1of Chapter 5). With this scaling, the noise on the preprocessed data is approxi-mately additive, isotropic and Gaussian [95], satisfying the model assumptions(4.4).

Note that true rank of M(r) is therefore r ≤ min (p, n) = 120. We compare thefollowing three methods for inference of rank:

(i) The posterior inference of rank, f (r|D) (4.115), developed using the VB-approximation, and using the OVPCA algorithm.

(ii) The ARD property (Remark 4.8) of OVPCA.(iii)The ARD property (Remark 4.8) of FVPCA.

The various inferences are compared in Table 4.4. For comparison, we also inferred

Table 4.4. Inference of rank for a scintigraphic image sequence (p = 525 and n = 120).

OVPCA OVPCA FVPCAf (r|D) ARD Property ARD Property

Pr (r = 17|D) = 0.0004Pr (r = 18|D) = 0.2761Pr (r = 19|D) = 0.7232Pr (r = 20|D) = 0.0002

r = 45 r = 26

Note: where not listed, Pr (r|D) < 3 × 10−7.

rank via the criterion of cumulative variance (4.120) (Fig. 4.7), in which case, r = 5was inferred.

It is difficult to compare performance of the methods since no ‘true’ rank is avail-able. From a medical point-of-view, the number of physiological factors, r, shouldbe 4 or 5. This estimate is supported by the ad hoc criterion (Fig. 4.7). From thisperspective, the formal methods appear to over-estimate significantly the number offactors. The reason for this can be understood by reconstructing the data using theinferred number of factors recommended by each method (Table 4.5). Four consec-utive frames of the actual scintigraphic data are displayed in the first row. Thoughthe signal-to-noise ratio is poor, functional variation is clearly visible in the centralpart of the left kidney, and in the upper part of the right kidney, which cannot be ac-counted for by noise. The same frames of the sequence—reconstructed using r = 5factors, as recommended by medical experts—are displayed in Table 4.5 (secondrow). This reconstruction fails to capture the observed functional behaviour. In con-trast, the functional information is apparent in the sequence reconstructed using thef (r|D) inference of OVPCA (i.e. r = 19 factors). The same is true of sequencesreconstructed using r > 19 factors, such as the r = 45 choice suggested by the ARDProperty of OVPCA (Table 4.5, last row).

4.6 Conclusion 87

number of principal components

cum

ulat

ive

vari

ance

0 5 10 15 2089909192939495969798

Fig. 4.7. Cumulative variance for the scintigraphic data. For clarity, only the first 20 of n =120 points are shown.

Table 4.5. Reconstruction of a scintigraphic image sequence using different numbers, r, offactors.

number of factors, r frames 48–51 of the dynamic image sequence

original images (r = 120)

Ad hoc criterion (r = 5)

f (r|D) (OVPCA) (r = 19)

ARD (OVPCA) (r = 45)

4.6 Conclusion

The VB method has been used to study matrix decompositions in this Chapter. Ouraim was to find Bayesian extensions for Principal Component Analysis (PCA), whichis the classical data analysis technique for dimensionality reduction. We explainedthat PCA is the ML solution for inference of a low-rank mean value, M(r), in theisotropic Normal distribution (4.4). This mean value, M(r), can be modelled in twoways (Fig. 4.1): (i) as a product of full rank matrices, M(r) = AX ′ (PPCA model),


or (ii) via its SVD, M(r) = ALX ′ (orthogonal PPCA model). The main drawbackof the ML solution is that it does not provide inference of rank, r, nor uncertaintybounds on point estimates, A, X , etc. These tasks are the natural constituency ofBayesian inference, but this proves analytically intractable for both models.

Therefore, we applied the VB method to the PPCA and orthogonal PPCA mod-els. In each case, the resulting VB-marginals, f (A|D) etc., addressed the weak-nesses of the ML solution, providing inference of rank and parameter uncertainties.We should not miss the insights into the nature of the VB method itself, gained as aresult of this work. We list some of these now:

• The VB approximation is a generic tool for distributional approximation. It hasbeen applied in this Chapter to an ‘exotic’ example, yielding the von Mises-Fisher distribution. The VB method always requires evaluation of necessary VB-moments, which, in this case, had to be done approximately.

• Careful inspection of the implied set of VB-equations (Step 6 of the VB method)revealed that an orthogonality constraint could significantly reduce this set ofequations, yielding much faster variants of the IVB algorithm, namely FVPCAand OVPCA. Further study of the VB-equations allowed partial analytical so-lutions to be found. The resulting inference algorithms then iterated in just onevariable (ω (4.40) in this case).

• The VB-approximation is found by minimization of KLDVB (3.6) which isknown to have local minima [64] (see Section 3.16). We took care to studythis issue in simulation (Section 4.4.2). We discovered local minima in the VB-approximation of both the PPCA model (via FVPCA) and the orthogonal PPCAmodel (via OVPCA). Hence, convergence to a local (or global) minimum is sen-sitive to initialization of the IVB algorithm (via ω[1] in this case). Therefore,appropriate choice of initial conditions is an important step in the VB method.

• We have shown that the VB-approximation is suitable for high-dimensional prob-lems. With reasonably chosen initial conditions, the IVB algorithm convergedwithin a moderate number of steps. This may provide significant benefits whencompared to stochastic sampling techniques, since there is no requirement todraw large numbers of samples from the implied high-dimensional parameterspace.

In the next Chapter, we will build on this experience, using VB-approximations toinfer diagnostic quantities in medical image sequences.

5

Functional Analysis of Medical Image Sequences

Functional analysis is an important area of enquiry in medical imaging. Its aim isto analyze physiological function—i.e. behaviour or activity over time which canconvey diagnostic information—of biological organs in living creatures. The phys-iological function is typically measured by the volume of a liquid involved in thephysiological process. This liquid is marked by a contrast material (e.g. a radiotracerin nuclear medicine [96]) and a sequence of images is obtained over time. A typicalsuch sequence is illustrated in Fig. 5.1.

The following assumptions are made:

(i) There is no relative movement between the camera (e.g. the scintigraphic camerain nuclear medicine) and the imaged tissues.

(ii) Physiological organs do not change their shape.(iii)Changes in the volume of a radiotracer within an organ cause a linear response

in activity, uniformly across the whole organ.

Under these assumptions, the observed images can be modelled as a linear combi-nation of underlying time-invariant organ images [94]. The task is then to identifythese organ images, and their changing intensity over time. The idea is illustratedin Fig. 5.1 for a scintigraphic study of human kidneys (a renal study). In Fig. 5.1,the complete sequence of 120 pictures is interpreted as a linear combination of theseunderlying organ images, namely the left and right kidneys, and the urinary bladder.The activity of the radiotracer in each organ is displayed below the correspondingorgan image.

In this Chapter, we proceed as follows: (i) we use physical modelling of the dy-namic image data to build an appropriate mathematical model, inference of which isintractable; (ii) we replace this model by a linear Gaussian model (called the FAMISmodel), exact inference of which remains intractable; and (iii) we use the VB methodof Chapter 3 to obtain approximate inference of the FAMIS model parameters; (iv)we examine the performance of the VB-approximation for the FAMIS model in thecontext of real clinical data.

90 5 Functional Analysis of Medical Image Sequences

0 1200

1

t0 1200

1

t

0

0

120

1200

1

t

t 0 1200

1

t

Left kidney Right kidney Urinary bladder

Functional Analysis

Fig. 5.1. Functional analysis of a medical image sequence (renal study).

5.1 A Physical Model for Medical Image Sequences

The task is to analyze a sequence of n medical images taken at times t = 1, . . . , n.Here, we use t as the time-index. The relation of t to real time, τt ∈ R, is significantfor the clinician, but we will assume that this mapping to real time is handled sepa-rately. Each image is composed of p pixels, stored column-wise as a p-dimensionalvector of observations, dt. The entire sequence forms the matrix D ∈ Rp×n. Typ-ically p n even for poor-quality imaging modalities such as scintigraphy [96].It is assumed that each image in the sequence is formed from a linear combinationof r < n < p underlying images, aj , j = 1, . . . , r, of the physiological organs.Formally,

dt =r∑

j=1

ajxt,j + et. (5.1)

5.1 A Physical Model for Medical Image Sequences 91

Here, aj , j = 1, . . . , r, are the underlying time-invariant p-dimensional image vec-tors, known as the factor images, and xt,j is the weight assigned to the jth factorimage at time t. The vector, xj = [x1,j , . . . , xn,j ]

′, of weights over time is knownas the factor curve or the activity curve of the jth factor image. The product ajx

′j is

known as the jth factor. Vector et ∈ Rp models the observation noise.The physiological model (5.1) can be written in the form of the following matrix

decomposition (4.2):D = AX ′ + E, (5.2)

where A ∈ Rp×r = [a1, . . . ,ar], X ∈ Rn×r = [x1, . . . ,xr], E ∈ Rp×n =[e1, . . . ,en]. The organization of the image data into matrix objects is illustrated inFig. 5.2.

D = A + E

X ′

Fig. 5.2. The matrix representation of a medical image sequence.

It would appear that the matrix decomposition tools of Chapter 4 could be usedhere. However, those results cannot be directly applied since the physical nature ofthe problem imposes extra restrictions on the model (5.2). In the context of nuclearmedicine, each pixel of dt is acquired as a count of radioactive particles. This has thefollowing consequences:

1. Finite counts of radioactive particles are known to be Poisson-distributed [95]:

f (D|A,X) =p∏

i=1

n∏t=1

Po

⎛⎝ r∑j=1

ai,jxt,j

⎞⎠ , (5.3)

where Po (·) denotes the Poisson distribution [97]. From (5.2) and (5.3), weconclude that E is a signal-dependent noise, which is characteristic of imagingwith finite counts.

2. All pixels, di,t, aggregated in the matrix, D, are non-negative. The factor images,aj—being columns of A—are interpreted as observations of isolated physiologi-cal organs, and so the ajs are also assumed to be non-negative. The factor curves,xj—being the columns of X—are interpreted as the variable intensity of the as-sociated factor images, which, at each time t, acts to multiply each pixel of aj by


the same amount, xt,j . Therefore, the xjs are also assumed to be non-negative.In summary,

ai,j ≥ 0, i = 1, . . . , p, j = 1, . . . , r, (5.4)

xt,j ≥ 0, t = 1, . . . , n, j = 1, . . . , r.

5.1.1 Classical Inference of the Physiological Model

The traditional inference procedure for A and X in (5.2) consists of the followingthree steps [94, 95]:

Correspondence analysis: a typical first step in classical inference is scaling. Thisrefers to the pre-processing of the data matrix, D, in order to whiten the obser-vation noise, E. The problem of noise whitening has been studied theoreticallyfor various noise distributions [98–101]. In the case of the Poisson distribution(5.3), the following transformation is considered to be optimal [95]:

D = diag((D1n,1)

− 12

)Ddiag

((D′1p,1)

− 12

). (5.5)

In (5.5), the notation vk, v a vector, denotes the vector of powers of elements,[vk

i

](see Notational Conventions, Page XV). Pre-processing the data via (5.5)

is called correspondence analysis [95].Orthogonal analysis is a step used to infer a low-rank signal, AX ′, from the pre-

processed data, D (5.5). PCA is used to decompose D into orthogonal matrices,A and X (Section 4.1). The corresponding estimates of A and X in the originalmodel (4.5) are therefore

A = diag((D1n,1)

12

)A, (5.6)

X = diag((D′1p,1)

12

)X. (5.7)

Note, however, that these solutions are orthogonal and cannot, therefore, satisfythe positivity constraints for factor images and factor curves (5.4). Instead, theyare used as the basis of the space in which the positive solution is to be found.

Oblique analysis is the procedure used to find the positive solution in the spacedetermined by A and X from the orthogonal analysis step above. The exis-tence and uniqueness of this solution are guaranteed under conditions givenin [94, 102, 103].

5.2 The FAMIS Observation Model

The first two steps of the classical inference procedure above are equivalent to MLestimation for the following observation model:

5.2 The FAMIS Observation Model 93

f(D|A, X, ω

)= N

(AX ′, ω−1Ip ⊗ In

), (5.8)

for any scalar precision parameter, ω (Section 4.1.1 in Chapter 4). Hence, an addi-tive, isotropic Gaussian noise model is implied for D. Using (5.5)–(5.7) in (5.8), theobservation model for D is

f (D|A,X, ω) ∝ exp(−1

2ωtr

[Ωp (D −AX ′)Ωn (D −AX ′)′

]), (5.9)

Ωp = diag (D1n,1)−1

, (5.10)

Ωn = diag (D′1p,1)−1

. (5.11)

Note that D enters (5.9) through the standard matrix Normal distributional formand via the data-dependent terms, Ωp and Ωn. If we now choose Ωp and Ωn to beparameters of the model, independent of D, then we achieve two goals:

(i) The tractable matrix Normal distribution is revealed.(ii) These constants can be relaxed via prior distributions, yielding a flexible Bayesian

model for analysis of medical image sequences. In particular, (5.10) and (5.11)can then be understood as ad hoc estimates for which, now, Bayesian alternativeswill be available.

To summarize, the appropriate observation model inspired by classical inference ofmedical image sequences is as follows:

f (D|A,X,Ωp, Ωn) = N (AX ′, Ω−1

p ⊗Ω−1n

), (5.12)

Ωp = diag (ωp) , (5.13)

Ωn = diag (ωn) . (5.14)

Here, A ∈ (R+)p×r, X ∈ (R+)n×r, ωp ∈ (R+)p and ωn ∈ (R+)n are the unknownparameters of the model. The positivity of A and X reflects their positivity in thephysiological model (5.4). We note the following:

• Model (5.12), with its associated positivity constraints, will be known as themodel for Functional Analysis of Medical Image Sequences (the FAMIS model).It is consistent with the matrix multiplicative decomposition (5.2) with whiteGaussian noise, E.

• The imposed structure of the covariance matrix in (5.12) has the following in-terpretation. Each pixel of each image in the sequence, di,t, i = 1, . . . , p,t = 1, . . . , n, is independently distributed as

f (di,t|A,X, ωi,p, ωt,n) = N⎛⎝ r∑

j=1

ai,jxt,j , ω−1t,nω

−1i,p

⎞⎠ .

However, there is dependence in the covariance structure. Specifically, the vari-ance profile is the same across the pixels of every image, dt, being a scaledversion of ω−1

p . Similarly, the variance profile is the same for the intensity overtime of every pixel, di,:, being a scaled version of ω−1

n .


• We have confined ourselves to diagonal precision matrices and a Kronecker prod-uct structure. This involves just p+n degrees-of-freedom in the noise covariancematrix. Inference of the model with unrestricted precision matrices is not fea-sible because the number of parameters is then higher than the number, pn, ofavailable data.

• As mentioned at the beginning of this Chapter, the model (5.1) is tenable only ifthe images are captured under the same conditions. This is rarely true in practice,resulting in the presence of artifacts in the inferred factors. The relaxation ofknown Ωp and Ωn, (5.10) and (5.11), allows these artifacts to be captured andmodelled as noise. This property will be illustrated on clinical data later in thisChapter.

5.2.1 Bayesian Inference of FAMIS and Related Models

Exact Bayesian inference for the FAMIS model is not available. However, Bayesiansolutions for related models can be found in the literature, and are briefly reviewedhere.

Factor analysis model: this model [84] is closely related to the the FAMIS model,with the following simplifications: (i) positivity of A and X is relaxed; and (ii)matrix Ωp is assumed to be fixed at Ωp = Ip. Bayesian inference of the factoranalysis model was presented in [78]. The solution suffers the same difficultiesas those of the scalar multiplicative decomposition (Section 3.7). Specifically,the marginal posterior distributions, and their moments, are not tractable.

Independent Component Analysis (ICA) model: in its most general form, noisy ICA[104] is a generalization of the FAMIS model. In fact, the implied matrix de-composition, D = AX + E (4.2), is identical. In ICA, A is known as the mix-ing matrix and the rows of X are known as sources. Therefore, in a sense, anymethod—such as the procedures which follow in this Chapter—may be calledICA. However, the keyword of ICA is the word ‘independent’. The aim of ICAis to infer statistically independent sources, X:

f (X|D) =r∏

j=1

f (xj |D) . (5.15)

Note that this assumption does not imply any particular functional form of theprobability distribution. In fact, for a Gaussian distribution, the model (5.15)is identical to the one implied by PCA (4.4). A VB-approximation for (5.15),with priors, f (xj), chosen as a mixture of Gaussian distributions, was presentedin [29].

5.3 The VB Method for the FAMIS Model

The FAMIS observation model (5.12) is a special case of the matrix decompositionmodel (4.2) with additional assumptions. We can therefore expect similarities to theVB method of Section 4.2.

5.3 The VB Method for the FAMIS Model 95

Step 1: Choose a Bayesian model

We must complement the observation model (5.12) with priors on the model para-meters. The prior distributions on the precision parameters are chosen as follows:

f (ωp|ϑp0,ρp0) =p∏

i=1

Gωi,p(ϑi,p0, ρi,p0) , (5.16)

f (ωn|ϑn0,ρn0) =n∏

t=1

Gωt,n(ϑt,n0, ρt,n0) ,

with vector shaping parameters, ϑp0 = [ϑ1,p0, . . . , ϑp,p0]′, ρp0 = [ρ1,p0, . . . , ρp,p0]

′,ϑn0 = [ϑ1,n0, . . . , ϑn,n0]

′ and ρn0 = [ρ1,n0, . . . , ρn,n0]′. These parameters can be

chosen to yield a non-informative prior. Alternatively, asymptotic properties of thenoise (Section 5.1.1) can be used to elicit the prior parameters. These shaping para-meters can be seen as ‘knobs’ to tune the method to suit clinical practice.

The parameters, A and X , are modelled in the same way as those in PPCA, withthe additional restriction of positivity. The prior distributions, (4.20) and (4.22), thenbecome

f (A|υ) = tN(0p,r, Ip ⊗ Υ−1,

(R+

)p×r), (5.17)

Υ = diag (υ) , υ = [υ1, . . . , υr] ,f (υj) = G (αj,0, βj,0) , j = 1, . . . , r, (5.18)

f (X) = tN(0n,r, In ⊗ Ir,

(R+

)n×r). (5.19)

The hyper-parameter, Υ , plays an important rôle in inference of rank via the ARDproperty (Remark 4.4). The shaping parameters, α0 = [α1,0, . . . , αr,0]

′ and β0 =[β1,0, . . . , βr,0]

′, are chosen to yield a non-informative prior.


We partition the parameters into θ1 = A, θ2 = X , θ3 = ωp, θ4 = ωn and θ5 = υ.The logarithm of the joint distribution is then

ln f (D,A,X, ωp, ωn, υ) =

+n

2ln |Ωp| + p

2ln |Ωn| − 1

2tr(Ωp (D −AX ′)Ωn (D −AX ′)′

)+

+p

2ln |Υ | − 1

2tr (AΥA′) − 1

2tr (XX ′) +

r∑i=1

((α0 − 1) ln υi − β0υi) +

+n∑

i=1

((ϑn0 − 1) lnωi,n − ρn0ωi,n) +p∑

i=1

((ϑp0 − 1) lnωi,p − ρp0ωi,p) + γ,

(5.20)

where γ denotes all those terms independent of the model parameters, θ.



The five VB-marginals—for A, X , ωp, ωp and υ—can be obtained from (5.20) byreplacement of all terms independent on the inferred parameter by expectations. Thisstep is easy, but yields repeated lengthy distributions, which will not be written downhere.


In the VB-marginals of Step 3, the Kronecker-product form of the prior covariancematrices for A (5.17) and X (5.19) has been lost. Therefore, we identify the VB-marginals of A and X in their vec forms. The conversion of the matrix Normaldistribution into its vec form is given in Appendix A.2. We use the notation a =vec (A) and x = vec (X) (see Notational Conventions on Page XV).

The VB-marginals are recognized to have the following standard forms:

f (A|D) = f (a|D) = tNA

(µA, ΣA,

(R+

)pr), (5.21)

f (X|D) = f (x|D) = tNX

(µX , ΣX ,

(R+

)nr), (5.22)

f (υ|D) =r∏

j=1

Gυj(αj , βj) , (5.23)

f (ωp|D) =p∏

i=1

Gωi,p(ϑi,p, ρi,p) , (5.24)

f (ωn|D) =n∏

t=1

Gωt,n(ϑt,n, ρt,n) . (5.25)

The associated shaping parameters are as follows:

5.3 The VB Method for the FAMIS Model 97

µA = ΣAvec(ΩpDΩnX

), (5.26)

ΣA =(Ef(X|D)

[X ′ΩnX

]⊗ Ωp + Υ ⊗ Ip

)−1

, (5.27)

µX = ΣXvec(A′ΩpDΩn

), (5.28)

ΣX =(Ef(A|D)

[A′ΩpA

]⊗ Ωn + Ir ⊗ In

)−1

, (5.29)

α = α0 +12p1r,1,

β = β0 +12diag−1

(Ef(A|D) [A′A]

),

ϑp = ϑp0 +n

21p,1,

ρp = ρp0 +12diag−1

(DΩnD

′ − AX ′ΩnD′ −DΩnXA′ +

+Ef(A|D)

[AEf(X|D)

[X ′ΩnX

]A′] )

, (5.30)

ϑn = ϑn0 +p

21n,1,

ρn = ρn0 +12diag−1

(D′ΩpD −D′ΩpAX ′ − XA′ΩpD +

+Ef(X|D)

[XEf(A|D)

[A′ΩpA

]X ′

] ). (5.31)

Note that ΣA (5.27) is used in (5.26), and ΣX (5.29) is used in (5.28), for concise-ness.


The necessary VB-moments of the Gamma distributions, (5.23)–(5.25), are

υ = α β−1, (5.32)

ωp = ϑp ρ−1p , (5.33)

ωn = ϑn ρ−1n . (5.34)

Hence, the moments of the associated full matrices are

Υ = diag (υ) ,

Ωp = diag (ωp) ,

Ωn = diag (ωn) .

The necessary moments of the truncated multivariate Normal distributions, (5.21)and (5.22), are


A, X,

Ef(X|D)

[X ′ΩnX

], Ef(A|D)

[A′ΩpA

],

Ef(A|D)

[AEf(X|D)

[X ′ΩnX

]A′], Ef(X|D)

[XEf(A|D)

[A′ΩpA

]X ′

].

(5.35)These are not available in closed-form. Therefore, we do not evaluate (5.35) withrespect to (5.21) and (5.22), but rather with respect to the following independenceapproximations:

f (a|D) ≈ tNA

(µA,diag (σA) ,

(R+

)pr), (5.36)

f (x|D) ≈ tNX

(µX ,diag (σX) ,

(R+

)nr), (5.37)

where

σA = diag−1 (ΣA) ,σX = diag−1 (ΣX) .

Recall that the operator, diag−1 (ΣA) = σA, extracts the diagonal elements of thematrix argument into a vector (see Notational Conventions on Page XV). This choiceneglects all correlation between the pixels in A, and between the weights in X .Hence, for example, (5.36) can be written as the product of scalar truncated Nor-mal distributions:

f (a|D) ≈pr∏

k=1

tNak

(µk,A, σk,A,R+

),

with moments

a = µA − σ− 1

2A ϕ (µA, σA) , (5.38)

a a = σA + µA a− σ− 1

2A κ (µA, σA) ,

where functions ϕ (·) and κ (·) are defined in Appendix A.4, and ‘’ is the Hadamardproduct. Using (5.38), the necessary matrix moments listed in (5.35) can all be con-structed from the following identities:

A = vect(a, p

), (5.39)

Ef(A|D) [A′ZA] = diag(

vect(a a, p

)′diag−1 (Z)

)+ A′ZA. (5.40)

Here, Z denotes any of the constant matrices arising in (5.35). Exactly the sameidentities hold for (5.37).


No simplification of the VB-equations was found in this case. Hence, the full setof VB-equations is (5.26)–(5.31), (5.32)–(5.34), and the VB-moments (5.35), wherethe latter are evaluated via (5.38)–(5.40).

5.4 The VB Method for FAMIS: Alternative Priors 99

Step 7: Run the IVB algorithm

The IVB algorithm is run on the full set of VB-equations from the previous step.A and X (5.39) can be initialized with any matrices of appropriate dimensions withpositive elements. One such choice are the results of classical inference (Section5.1.1). Simulation studies suggest that the resulting VB-approximation is insensitiveto this initialization. However, special care should be taken in choosing the initialvalues of Ωp (5.33) and Ωn (5.34). Various choices of initialization were tested insimulation, as we will see in Section 5.5. The most reliable results were achievedusing the results of classical inference (Section 5.1.1) as the initializers. Overall, ini-tialization of A, X , Ωp and Ωn using the classical inference results yields significantcomputational savings [105].

Remark 5.1 (Automatic Rank Determination (ARD) Property). The shaping parame-ters, α and β, fulfil the same rôle as those in PPCA, i.e. inference of the number ofrelevant factors (Remark 4.4).


The VB-marginals are given by (5.21)–(5.25). Their shaping parameters and mo-ments are inferred using the IVB algorithm in Step 7. Typically, in medical applica-tions, only the expected values, A and X , of the factor images and the factor curvesrespectively, are of interest. From Remark 5.1, values of r can also be reported.

5.4 The VB Method for FAMIS: Alternative Priors

The choice of covariance matrix, Ip⊗Υ−1 in (5.17) and In⊗Ir in (5.19), is intuitivelyappealing. It is a simple choice which imposes the same prior independently on eachpixel of the factor image, aj , and the same prior on all weights in X . However,evaluation of the posterior distributions, (5.21) and (5.22), under this prior choiceis computationally expensive, as we will explain in the next paragraph. Therefore,we seek another functional form for the priors, for which the VB-approximationyields a computationally simpler posterior (Remark 1.1). This will we achieved usingproperties of the Kronecker product [106].

Consider a typical Kronecker product, e.g. C ⊗ F , arising in the matrix Normaldistribution. Here, C ∈ Rp×p and F ∈ Rn×n are both invertible. The Kroneckerproduct is a computationally advantageous form since the following identity holds:

(C ⊗ F )−1 = C−1 ⊗ F−1. (5.41)

Hence, the inverse of a (potentially large) pn × pn matrix—arising, for instance, inthe FAMIS observation model (5.12)—can be replaced by inversions of two smallermatrices of dimensions p× p and n× n respectively.

This computational advantage is lost in (5.27) and (5.29), since they require in-version of a sum of two Kronecker products, which cannot be reduced. Specifically,the posterior covariance structure for A (5.27) is


ΣA =(E (·) ⊗ Ωp + Υ ⊗ Ip

)−1

. (5.42)

Hence, identity (5.41) cannot be used and the full pn × pn inversion must be eval-uated. Note, however, that the second term in the sum above is the precision matrixof the prior (5.17) which can be chosen by the designer. Hence, a computationallyadvantageous form of the posterior can be restored if we replace Ip in (5.42) by Ωp.This replacement corresponds to replacement of Ip ⊗ Υ−1 in (5.17) by Ω−1

p ⊗ Υ−1.Under this choice, the posterior covariance (5.42) is as follows:(

E (·) ⊗ Ωp + Υ ⊗ Ωp

)−1

=((

E (·) + Υ)⊗ Ωp

)−1

=(E (·) + Υ

)−1

⊗ Ω−1p .

This is exactly the multivariate case of the alternative prior which we adopted inthe scalar decomposition (Section 1.3), in order to facilitate the VB-approximation.In both cases (Section 1.3.2 and above), computational savings are achieved if theprior precision parameter of m or µA—i.e. φ (1.9) or Ip ⊗ Υ (5.17) respectively—are replaced by precision parameters that are proportional to the precision parameterof the observation model. Hence, we replace φ by γω (1.15) in the scalar case ofSection 1.3.2, and Ip by Ωp above.

The same rationale can be used to adapt the prior covariance structure of X fromIn ⊗ Ir in (5.19) to Ω−1

n ⊗ Ir.We now re-apply the VB method using these analytically convenient priors.


Given the considerations above, we replace the priors, (5.17) and (5.19), by the fol-lowing:

f (A|υ, Ωp) = tN(0p,r, Ω

−1p ⊗ Υ−1,

(R+

)p×r), (5.43)

f (X|Ωn) = tN(0n,r, Ω

−1n ⊗ Ir,

(R+

)n×r), (5.44)

where Ωp and Ωn are parameters now common to the observation model (5.12) andthe priors. The rest of the model, i.e. (5.16) and (5.18), is unchanged.


The same partitioning is chosen as in Section 5.3. The logarithm of the joint distrib-ution is now

5.4 The VB Method for FAMIS: Alternative Priors 101

ln f (D,A,X, ωp, ωn, υ) =

+n + r

2ln |Ωp| + p + r

2ln |Ωn| − 1

2tr(Ωp (D −AX ′)Ωn (D −AX ′)′

)+

+p

2ln |Υ | − 1

2tr (ΩpAΥA′)− 1

2tr (X ′ΩnX) +

r∑j=1

((αj,0 − 1) ln υj,i − βj,0υj) +

+n∑

t=1

((ϑt,n0 − 1) lnωt,n − ρt,n0ωt,n)+p∑

i=1

((ϑi,p0 − 1) lnωi,p − ρi,p0ωi,p)+γ.

(5.45)


Once again, this step is easy, yielding five distributions, which we will omit forbrevity. Their standard forms are identified next.


The posterior distributions, (5.23)–(5.25), are unchanged, while (5.21) and (5.22) arenow in the form of the truncated matrix Normal distribution (Appendix A.2):

f (A|D) = tNA

(MA, Φ−1

Ap ⊗ Φ−1Ar ,

(R+

)p×r), (5.46)

f (X|D) = tNX

(MX , Φ−1

Xn ⊗ Φ−1Xr,

(R+

)n×r). (5.47)

The shaping parameters of the VB-marginals are now as follows:

MA = DΩnX′Φ−1

Ar , (5.48)

ΦAr = Ef(X|D)

[X ′ΩnX

]+ diag (υ) ,

ΦAp = Ωp,

MX = Φ−1XrA

′ΩpD,

ΦXr = Ef(A|D)

[A′ΩpA

]+ Ir,

ΦXn = Ωn,

β = β0 +12diag−1

(Ef(A|D)

[A′ΩpA

]),

ρp = ρp0 +12diag−1

(DΩnD

′ − AX ′ΩnD′ −DΩnXA′ +

+Ef(A|D)

[AEf(X|D)

[X ′ΩnX + Υ

]A′] )

,

ρn = ρn0 +12diag−1

(D′ΩpD −D′ΩpAX ′ − XA′ΩpD + (5.49)

+Ef(X|D)

[XEf(A|D)

[A′ΩpA + Ir

]X ′

] ).



The moments of (5.23)–(5.25)—i.e. (5.32)–(5.34)—are unchanged. The moments ofthe truncated Normal distributions, (5.21) and (5.22), must be evaluated via indepen-dence approximations of the kind in (5.36) and (5.37).

Steps 6-8:

As in Section 5.3.

5.5 Analysis of Clinical Data Using the FAMIS Model

In this study, a radiotracer has been administered to the patient to highlight the kid-neys and bladder in a sequence of scintigraphic images. This scintigraphic study wasalready considered in Section 4.5 of the previous Chapter. Our main aim here is tostudy the performance of the VB-approximation in the recovery of factors (Section5.1) from this scintigraphic image sequence, via the FAMIS model (Section 4.5).We wish to explore the features of the VB-approximation that distinguish it fromclassical inference (Section 5.1.1). These are as follows:

The precision matrices of the FAMIS observation model (5.12), Ωp and Ωn, are in-ferred from the data in tandem with the factors. This unifies the first two steps ofclassical inference (Section 5.1.1).

The oblique analysis step of classical inference (Section 5.1.1) is eliminated, sinceexpected values (5.38) of truncated Normal distributions, (5.21) and (5.22), areguaranteed to be positive.

Inference of Rank is achieved via the ARD property of the implied IVB algorithm(Remark 5.1).

Fig. 5.3. Scintigraphic medical image sequence of the human renal system, showing every4th image from a sequence of n = 120 images. These are displayed row-wise, starting in theupper-left corner.

The scintigraphic image sequence is displayed in Fig. 5.3. The sequence consists ofn = 120 images, each of size 64 × 64, with a selected region-of-interest of size

5.5 Analysis of Clinical Data Using the FAMIS Model 103

p = 525 pixels highlighting the kidneys. Note that we adopt the alternative set ofpriors, outlined in Section 5.4, because of their analytical convenience. Recall thatthis choice also constitutes a non-informative prior. Our experience indicates thatthe effect on the posterior inference of either set of priors (Sections 5.3 and 5.4respectively) is negligible.

In the sequel, we will compare the factor inferences emerging from four cases ofthe VB-approximation, corresponding to four noise modelling strategies. These referto four choices for the precision matrices of the FAMIS observation model (5.12):

Case 1: precision matrices, Ωp and Ωn, were fixed a priori at Ωp = Ip and Ωn = In.This choice will be known as isotropic noise assumption.

Case 2: Ωp and Ωn were again fixed a priori at Ωp = diag (D1n,1)−1 and Ωn =

diag (D′1p,1)−1. This choice is consistent with the correspondence analysis

(5.5) pre-processing step of classical inference.Case 3: Ωp and Ωn were inferred using the VB method. Their initial values in the

IVB algorithm were chosen as in case 1; i.e. the isotropic choice, Ω[1]p = Ip and

Ω[1]n = In.

Case 4: Ωp and Ωn were again inferred using the VB method. Their initial valuesin the IVB algorithm were chosen as in case 2; i.e. the correspondence analysischoice, Ω[1]

p = diag (D1n,1)−1 and Ω

[1]n = diag (D′1p,1)

−1.

In cases 3 and 4, where the precision matrices were inferred from the data, the shap-ing parameters of the prior distributions (5.16) were chosen as non-committal; i.e.ϑp0 = ρp0 = 10−101p,1, ϑn0 = ρn0 = 10−101n,1.

The results of these experiments are displayed in Fig. 5.4. The known-precisioncases (Cases 1 and 2) are displayed in the first column, while the unknown-precisioncases (Cases 3 and 4) are displayed in the second column. We can evaluate the factorinferences with respect to template factor curves for healthy and pathological kid-neys [107], as displayed in Fig. 5.5. For this particular dataset, therefore, we canconclude from Case 4 (Fig. 5.4, bottom-right) that the inferred factors correspond tothe following physiological organs and liquids, starting from the top: (i) pathologi-cal case of the right pelvis; (ii) a linear combination of the left pelvis and the rightparenchyma; (iii) the left parenchyma; and (iv) arterial blood.

Turning our attention to the VB-approximation in Fig. 5.4, we conclude the fol-lowing:

The number of relevant factors, r, estimated using the ARD property (Remark 5.1)associated with the FAMIS model (5.12), is r = 3 in Case 1, and r = 4 in all re-maining cases (Fig. 5.4). This is much smaller than the rank estimated using thePPCA model (Table 4.4, Section 4.5). Furthermore, it corresponds to the valueexpected by the medical expert, as expressed by the templates in Fig. 5.5. Notethat this reduction from the PPCA result has been achieved for all considerednoise modelling strategies. Hence, this result is a consequence of the regular-ization of the inference achieved by the positivity constraints (5.4). This will bediscussed later (Remark 5.2).


a priori known Ωp and Ωn

inferred Ωp and Ωn

(IVB algorithm is initialized via thestated noise modelling strategy)

Noi

sem

odel

ling

stra

tegy

isot

ropi

cno

ise

assu

mpt

ion

time [images]

inte

nsity

inte

nsity

inte

nsity

inte

nsity

factor curve 1

factor curve 2

factor curve 3

factor curve 3

factor image 1

factor image 2

factor image 3

factor image 4

(case 1)

time [images]in

tens

ity

inte

nsity

inte

nsity

inte

nsity

factor curve 1

factor curve 2

factor curve 3

factor curve 3

factor image 1

factor image 2

factor image 3

factor image 4

(case 3)

corr

espo

nden

cean

alys

is

time [images]

inte

nsity

inte

nsity

inte

nsity

inte

nsity

factor curve 1

factor curve 2

factor curve 3

factor curve 3

factor image 1

factor image 2

factor image 3

factor image 4

(case 2)

time [images]

inte

nsity

inte

nsity

inte

nsity

inte

nsity

factor curve 1

factor curve 2

factor curve 3

factor curve 3

factor image 1

factor image 2

factor image 3

factor image 4

(case 4)

Fig. 5.4. Posterior expected factor images and factor curves for four noise models.

5.5 Analysis of Clinical Data Using the FAMIS Model 105

healthy kidney pathological kidney

time [min]

coun

ts

parenchymapelvisarterial blood

0 2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

8

9

time [min]

coun

ts

parenchymapelvisarterial blood

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

Fig. 5.5. Typical activity curves in a healthy kidney (left) and a pathological kidney (right). Inthis Figure, factor curves are plotted against the actual time of acquisition of the images andcannot be directly compared to the discrete-time indices in Fig. 5.4. It is the overall shape ofthe curve that is important in clinical diagnosis.

The precision, Ωp and Ωn, is an important parameter in the analysis. If we comparethe results achieved using known precision matrices (left column of Fig. 5.4)with the cases where we infer the precision matrices via (5.33) and (5.34) (rightcolumn of Fig. 5.4), then we can conclude the following:

• The expected values, Ωp and Ωn, of the posterior distributions of the pre-cision matrices, (5.24) and (5.25) respectively, are similar for both cases ofinitialization (Cases 3 and 4), and are, in fact, close to the estimates ob-tained using correspondence analysis (5.5) from classical inference. This isdisplayed in Fig. 5.6. This finding supports the assumption underlying theclassical inference, namely that (5.5) is the optimal pre-processing step forscintigraphic data [95].

• In Cases 1 and 2—i.e. those with known precision matrices—the inferredfactor curves have sharp peaks at times t = 25 and t = 37, respectively.This behaviour is not physiologically possible. These peaks are significantlyreduced in those cases where precision is inferred; i.e. in Cases 3 and 4.Note that the precision estimates, ωt,n, at times t = 25 and t = 37, aresignificantly lower than those at other times (Fig. 5.6). Thus, in effect, imagesobserved at times t = 25 and t = 37 will be automatically suppressed in theinference procedure, since they will receive lower weight in (5.48).

• The posterior expected values, A and X , are sensitive to the IVB initializa-tion of the precision, Ω[1]

p and Ω[1]n (Cases 3 and 4). Note that the inferred

factor images in Case 3 (Fig. 5.4, top-right) are similar to those in Case1 (Fig. 5.4, top-left). Specifically, the image of arterial blood (4th image inCases 2 and 4) is not present in either Case 1 or 3. The similarity between thefactor images in Cases 2 and 4 (Fig. 5.4, bottom row) is also obvious. Thissuggests that there are many local minima of KLDVB, and that the initial val-ues of the precision, ω[1]

p and ω[1]n , determine which one will be reached by

106 5 Functional Analysis of Medical Image Sequencesω

p

correspondence analysis

ωn

ωp

converged expected posterior values (isotropic noise initialization)

ωn

ωp

pixels

converged expected posterior values (initialization using correspondence analysis)

ωn

time20 40 60 80 100 120100 200 300 400 500

20 40 60 80 100 120100 200 300 400 500

20 40 60 80 100 120100 200 300 400 500

Fig. 5.6. Comparison of the posterior expected value of the precision matrices, ωp (left) andωn (right). In the first row, the result of correspondence analysis (classical inference) is shown.In the second and third rows, the VB-approximations are shown for two different choices ofIVB initialization.

the IVB algorithm. This feature of the VB-approximation has already beenstudied in simulation for the PPCA model, in Section 4.4.2.

It is difficult to compare these results to classical inference, since the latter does nothave an Automatic Rank Determination (ARD) property (Remark 5.1), nor does itinfer precision matrices. It also requires many tuning knobs. Hence, for an experi-enced expert, it would be possible to produce results similar to those presented inFig. 5.4.

Remark 5.2 (Reduction of r via positivity constraints). In Section 4.5, we used thesame dataset as that in Fig. 5.3 to infer the rank of the mean matrix, M (rank un-known), in the matrix decomposition (4.1). All of the estimates of rank displayedin Table 4.4 are significantly higher than the number of underlying physiologicalobjects expected by the medical expert. In contrast, the estimates of rank providedby the FAMIS model (Fig. 5.4) were, in all four cases, in agreement with medicalopinion. This result can be informally explained by considering how each model ismatched to the actual medical data. The real scintigraphic data are composed of three

5.6 Conclusion 107

elements:D = M + N + E.

M and E are the elements modelled by PPCA or FAMIS. In the case of PPCA, M isunconstrained except for an unknown rank restriction, while M observes positivityconstraints in the FAMIS model (5.12). E is the Normally-distributed white noise(4.4). Informally, N is an unmodelled matrix of non-Gaussian noise and physiolog-ical residuals. Inevitably, then, the PPCA and FAMIS models yield estimates of themodelled parameters, M and E, that are corrupted by the residuals, N , as follows:

M = M + NM ,

E = E + NE .

NM and NE are method-dependent parts of the residuals, such that N = NM +NE .If the criteria of separation are (i) rank-restriction of M with unknown rank, r,

and (ii) Gaussianity of the noise, E, then only a small part of N fulfills (ii), but alarge part of N fulfills (i) as M has unrestricted rank. Consequently, this rank, r, issignificantly overestimated. However, if we now impose a third constraint, namely(iii) positivity of the signal M (5.4), we can expect that only a small part of N fulfills(iii), ‘pushing’ the larger part, NE , of N into the noise estimate, E.

5.6 Conclusion

In this Chapter, we have applied the VB method to the problem of functional analy-sis of medical image sequences. Mathematical modelling for these medical imagesequences was reviewed. Bayesian inference of the resulting model was found tobe intractable. Therefore, we introduced the FAMIS model as its suitable approxi-mation. Exact Bayesian inference remained intractable, but the VB method yieldedtractable approximate posterior inference of the model parameters. Our experiencewith the VB method in this context can be summarized as follows:

Choice of priors: we considered two different choices of non-informative priors.These were (i) isotropic i.i.d. distributions (Section 5.3), and (ii) priors whoseparameters were shared with the observation model (Section 5.4). We designedthe latter (alternative) prior specifically to facilitate a major simplification in theresulting VB inference. This yielded significant computational savings withoutany significant influence on the resulting inferences.

IVB initialization: the implied VB-equations do not have a closed-form solution.Hence, the IVB algorithm must be used. Initialization of the IVB algorithm is animportant issue since it determines (i) the speed of convergence of the algorithm,and (ii) which of the (non-unique) local minima of KLDVB are found. Initializa-tion of the IVB algorithm via correspondence analysis from classical inferenceprovides a reliable choice, conferring satisfactory performance in terms of speedand accuracy.


The performance of the VB-approximation for the FAMIS model was tested in thecontext of a real medical image sequence (a renal scintigraphic study). The approx-imation provides satisfactory inference of the underlying biological processes. Con-trary to classical inference—which is based on the certainty equivalence approach—the method was able to infer (i) the number of underlying physiological factors, and(ii) uncertainty bounds for these factors.

6

On-line Inference of Time-Invariant Parameters

On-line inference is exactly that: the observer of the data is ‘on’ the ‘line’ that isgenerating the data, and has access to the data items (observations), dt, as they aregenerated. In this sense, the data-generating system is still in operation while learningis taking place [108]. This contrasts with the batch mode of data acquisition, D,examined in Chapters 4 and 5. In cases where a large archive of data already exists,then on-line learning refers to the update of our knowledge in light of sequentialretrieval of data items from the database.

Any useful learning or decision algorithm must exploit the opportunities pre-sented by this setting. In other words, inferences should be updated in response toeach newly observed data item. In the Bayesian setting, we have already noted theessential technology for achieving this, namely repetitive Bayesian updates (2.14),generating f(θ|Dt) from f(θ|Dt−1) as each new observation, dt, arrives. The ambi-tion to implement such updates at all t will probably be frustrated, for two reasons: (i)we have a strictly limited time resource in which to complete the update, namely ∆t,the sampling period between data items (we have uniformly sampled discrete-timedata in mind); and (ii) the number of degrees-of-freedom in the inference, f(θ|Dt),is generally of the order of the number of degrees-of-freedom in the aggregated datarecord, Dt (2.1), and is growing without bound as t → ∞. The requirement to up-date f(θ|Dt), and to extract moments, marginals and/or other decisions from it, willeventually ‘break the bank’ of ∆t !

Only a strictly limited class of observation models is amenable to tractableBayesian on-line inference, and we will review this class (the Dynamic ExponentialFamily) in Section 6.2.1. Our main aim in this Chapter is to examine how the Varia-tional Bayes (VB) approximation can greatly increase the possibilities for successfulimplementation of Bayesian on-line learning. We will take care to define these con-texts as generally as possible (via the three so-called ‘scenarios’ of Section 6.3), butalso to give worked examples of each context. In this Chapter, we assume stationaryparametric modelling via a time-invariant vector of parameters, θ. The time-variantcase leads to distinct application scenarios for VB, and will be handled in the nextChapter.

110 6 On-line Inference of Time-Invariant Parameters

6.1 Recursive Inference

On-line inference and learning algorithms are essential whenever the observer isrequired to react or interact in light of the sequentially generated observations, dt.Examples can be found whenever time-critical decisions must be made, such as inthe following areas:

Prediction: knowledge about future outputs should be improved in light of each newobservation [53, 109]; a classical application context is econometrics [56].

Adaptive control: the observer wishes to ‘close the loop’, and design the best controlinput in the next sampling time [52,110]; in predictive control [111], this task isaccomplished via effective on-line prediction.

Fault detection: possible faults in the observed system must be recognized as soonas possible [112].

Signal processing: adaptive filters must be designed to suppress noise and recon-struct signals in real time [113–115].

All the above tasks must satisfy the following constraints:

(i) the inference algorithm must use only the data record, Dt, available at the currenttime; i.e. the algorithm must be causal.

(ii) the computational complexity of the inference algorithm should not grow withtime.

These two requirements imply that the data, Dt, must be represented within a finite-dimensional ‘knowledge base’, as t → ∞. This knowledge base may also summarizeside information, prior and expert knowledge, etc. In the Bayesian context, this extraconditioning knowledge, beyond Dt, was represented by Jeffreys’ notation, I (Sec-tion 2.2.1), and the knowledge base was the posterior distribution, f(θ|Dt, I). Inthe general setting, the mapping from Dt to the knowledge base should be indepen-dent of t, and can be interpreted as a data compression process. This compressionof all available data has been formalized by two concepts in the literature: (a) thestate-space approach [108, 116], and (b) sufficient statistics [42, 117]. In both cases,the inference task (parameter inference in our case), must address the following twosub-tasks (in no particular order):

(a) infer the parameters from the compressed data history and the current data record(observation), dt;

(b) assimilate the current observation into the compressed data history.

An algorithm that performs these two tasks is known as a recursive algorithm [108].

6.2 Bayesian Recursive Inference

In this Section, we review models for which exact, computationally tractable, Baye-sian recursive inference algorithms exist. The defining equation of Bayesian on-line

6.2 Bayesian Recursive Inference 111

inference is (2.14): the posterior distribution at time t − 1, i.e. f(θ|Dt−1), is up-dated to the posterior distribution at time t, i.e. f(θ|Dt), via the observation model,f (dt|θ,Dt−1), at time t. Invoking the principles for recursive inference above, thefollowing constraints must apply:

1. The observation model must be conditioned on only a finite number, ∂, of pastobservations. This follows from (2.14). These past observations enter the obser-vation model via an m-dimensional time-invariant mapping, ψ, of the followingkind:

ψt = ψ (dt−1, . . . , dt−∂) , ψ ∈ Rm, (6.1)

with m < ∞ and ∂ < ∞. Under this condition,

f (dt|θ,Dt−1) = f (dt|θ, ψt) , (6.2)

and ψt is known as the regressor at time t. An observation model of this type,with regression onto the data (observation) history, is called an autoregressivemodel. In general, ψt (6.1) can only be evaluated for t > ∂. Therefore, therecursive inference algorithm typically starts at t = ∂+1, allowing the regressor,ψ∂+1 = ψ (D∂) , to be assembled in the first step. The recursive update (2.14)now has the following form:

f (θ|Dt) ∝ f (dt|θ, ψt) f (θ|Dt−1) , t > ∂. (6.3)

2. The knowledge base, f(·), must be finite-dimensional, ∀t. In the Bayesian par-adigm, the knowledge base is represented by f(θ|Dt). The requirement for atime-invariant, finite-dimensional mapping from Dt to the knowledge base isachieved by the principle of conjugacy (Section 2.2.3.1). Therefore, we restatethe conjugacy relation (2.12) in the on-line context, and we impose (6.2), asfollows:

f (θ|st) ∝ f (dt|θ, ψt) f (θ|st−1) , t > ∂. (6.4)

The distribution, f (θ|st) , is said to be conjugate to the observation model,f (dt|θ, ψt) .The notation, f (·|st) , denotes a fixed functional form, updated onlyvia the finite-dimensional sufficient statistics, st. The posterior distribution istherefore uniquely determined by st, and the functional recursion (6.3) can bereplaced by the following algebraic recursion in st:

st = s (st−1, ψt, dt) , t > ∂. (6.5)

A recursive algorithm has been achieved via compression of the complete currentdata record, Dt, into a finite-dimensional representation, st. The algebraic recursion(6.5) achieves Bayesian inference of θ, ∀t, as illustrated in Fig. 6.1. Of course, themappings ψ (6.1) and s (6.5) must be implementable within the sampling period,∆t, if the recursive algorithm is to be operated in real time. If the observation modeldoes not have a conjugate distribution on parameters, the computational complexityof full Bayesian inference is condemned to grow with time t. Hence, our main aimis to design Bayesian recursive algorithms by verifying—or restoring—conjugacy.


Conjugacy must hold even in the first step of the recursion, i.e. at t = ∂+1 (6.3).Therefore, the prior, f (θ|s∂) ≡ f (θ|s0) , must also be conjugate to the observationmodel (6.2).

Bst−1

dt, ψt

st

B

f(dt|θ, Dt−1)

f(θ|Dt)f(θ|Dt−1)

Fig. 6.1. Illustration of Bayes’ rule in the recursive on-line scenario with time-invariant pa-rameterization. Upper diagram: update of posterior distribution via (6.3). Lower diagram:update of sufficient statistics (6.5).

6.2.1 The Dynamic Exponential Family (DEF)

A conjugate distribution exists for every observation model belonging to the expo-nential family [117]:

f (dt|θ) = exp(q (θ)′ u (dt) − ζd (θ)

). (6.6)

This result is not directly useful in recursive on-line inference, since autoregressivemodels (i.e. models with dependence on a data regressor, ψt (6.2)) are not included.Therefore, in the following proposition, we extend conjugacy to autoregressive mod-els [52].

Proposition 6.1 (Conjugacy for the Dynamic Exponential Family (DEF)). Let theautoregressive observation model (6.2) be of the following form:

f (dt|θ, ψt) = exp(q (θ)′ u (dt, ψt) − ζdt

(θ)). (6.7)

Here, q (θ) and u (dt, ψt) are η-dimensional vector functions, and

exp ζdt(θ) =

∫D

exp(q (θ)′ u (dt, ψt)

)d dt, (6.8)

where ζdt(θ) is a normalizing constant, which must be independent of ψt. Then the

distributionf (θ|st) = exp

(q (θ)′ vt − νtζdt

(θ) − ζθ (st)), (6.9)

with vt ∈ Rη, νt ∈ R+, and sufficient statistics st = [v′t, νt]′ ∈ Rη+1, is conjugate

to the observation model (6.7). Here,

exp ζθ (st) =∫

Θ∗exp

(q (θ)′ vt − νtζdt

(θ))d θ,


where ζθ (st) is the normalizing constant for f (θ|st).Inserting (6.9) at time t−1 and (6.7) into Bayes’ rule (6.4), the following update

equations emerge for the sufficient statistics of the posterior distribution:

νt = νt−1 + 1, (6.10)

vt = vt−1 + u (dt, ψt) . (6.11)

Remark 6.1 (Normalizing constants, ζ). In this book, the symbol ζ will be reservedfor normalizing constants. Its subscript will declare the random variable whose fullconditional distribution is generated when the joint distribution is divided by thisquantity. Thus, for example, f(θ|D) = ζ−1

θ f(θ,D), and so ζθ = f(D). Occasion-ally, we will denote the functional dependence of ζ on the conditioning quantityexplicitly. Hence, in the case considered, we have ζθ ≡ ζθ(D).

Notes:

• The proof of Proposition 6.1 is available in [117].• In the sequel, family (6.7) will be known as the Dynamic Exponential Family

(DEF). Their respective conjugate distributions (6.9) will be known as CDEFparameter distributions. Comparing with (6.6), the DEF family can be under-stood as an extension of the exponential family to cases with autoregression.(6.6) is revealed for ψt = .

• If the observation model satisfies two properties, namely (i) smoothness, and (ii)a support that does not depend on θ, then the (dynamic) exponential family is theonly family for which a conjugate distribution exists [117]. If these properties donot hold, then sufficient statistics may also exist. For example, sufficient statis-tics exist for the uniform distribution parameterized by its boundaries, and forthe Pareto distribution [51]. However, there are no generalizable classes of suchobservation models. Therefore, we will restrict our attention to the DEF familyin this book.

• In general, the integral on the right-hand side of (6.8) may depend on the regres-sor, yielding ζdt

(θ, ψt). The DEF family, however, requires that ζdt(θ, ψt) =

ζdt(θ), constituting a major restriction. As a result, the following models are

(almost) the only members of the family [52]:

1. Normal (Gaussian) linear-in-parameters models in the case of continuous pa-rameters;

2. Markov chain models in the case of discrete variables.

These are also the models of greatest relevance to signal processing.• The regressor ψt (6.1) may also contain known (observed) external variables, ξt;

i.e.ψt = ψ (dt−1, . . . , dt−∂ , ξt) . (6.12)

ξt are known as exogenous variables in the control literature [118], and (6.2),with (6.12), defines an AutoRegressive model with eXogenous variables (ARX).In this case, the DEF model can admit ζdt

(θ, ψt) = ζ (θ) ν (ξt), where ν (ξt) isa scalar function of ξt. Then, update (6.10) becomes


νt = νt−1 + ν (ξt) .

• (6.9) was defined as the conjugate distribution for (6.7) with sufficient statisticsof minimum length. In fact, the following distribution is also conjugate to (6.7):

f (θ|st) = exp(q (θ)′ vt − νtζdt

(θ) − ζθ (st)). (6.13)

Here, q (θ) =[q (θ)′ , q0 (θ)′

]′, and vt = [v′t, v

′0]

′ is an extended vector of suf-ficient statistics. Once again, vt (which is now only a subset of the extendedsufficient statistics, vt) is updated by (6.11). v0 is not, however, updated by dataand is therefore inherited unchanged from the prior. Therefore, v0 is a functionof the prior parameters. This design of the conjugate distribution may be useful ifthe prior is to fulfil a subsidiary rôle, such as regularization of the model (Section3.7.2). However, if we wish to design conjugate inference with non-informativepriors, we invariably work with the unextended conjugate distribution (6.9).

6.2.2 Example: The AutoRegressive (AR) Model

z−1

z−2

z−m

dt

am

a2

a1

dt−2

dt−1

dt−m

σet

Fig. 6.2. The signal flowgraph of the AutoRegressive (AR) model.

The univariate (scalar), time-invariant AutoRegressive (AR) model is defined asfollows:

dt =m∑

k=1

akdt−k + σet, (6.14)

where m ≥ 1, et denotes the input (innovations process), and dt the output (i.e. ob-servation) of the system. The standard signal flowgraph is illustrated in Fig. 6.2 [119].The problem is to infer recursively the fixed, unknown, real parameters, r = σ2 (thevariance of the innovations) and a = [a1, . . . , am]′. The Bayesian approach to thisproblem is based on the assumption that the innovations sequence, et, is i.i.d. (and


therefore white) with a Normal distribution: f (et) = N (0, 1). The fully probabilis-tic univariate AR observation model is therefore

f (dt|a, r, ψt) = Nd (a′ψt, r) (6.15)

=1√2πr

exp(− 1

2r(dt − a′ψt)

2)

, (6.16)

where t > m = ∂ , and ψt = [dt−1, . . . , dt−m]′ is the regressor (6.1). The observa-tion model (6.15) belongs to the DEF family (6.7) under the assignments, θ = [a′, r]′,and

q (θ) = −12r−1 vec

([1,−a′]′ [1,−a′]

), (6.17)

u (dt, ψt) = vec([dt, ψ

′t]′ [dt, ψ

′t])

= vec (ϕtϕ′t) , (6.18)

ζdt(θ) = ln

(√2πr

). (6.19)

In (6.18),ϕt = [dt, ψ

′t]′ (6.20)

is the extended regressor. We will refer to the outer product of ϕt in (6.18) as a dyad.Using (6.9), the conjugate distribution has the following form:

f (a, r|vt, νt) ∝ exp(−1

2r−1vec

([1,−a′]′ [1,−a′]

)′vt − νt ln

(√2πr

)),

(6.21)where vt ∈ Rη, η = (m + 1)2 , and st = [v′t, νt]

′ ∈ Rη+1 (Section 6.2.1) isthe vector of sufficient statistics. Distribution (6.21) can be recognized as Normal-inverse-Gamma (N iG) [37]. We prefer to write (6.21) in its usual matrix form, usingthe extended information matrix, Vt ∈ R(m+1)×(m+1):

Vt = vect (vt,m + 1) ,f (a, r|st) = N iGa,r (Vt, νt) ,

where vect (·) denotes the vec-transpose operator (see Notational Conventions onPage XV) [106], st = Vt, νt , and

N iGa,r (Vt, νt) ≡ r−0.5νt

ζa,r (Vt, νt)exp

−1

2r−1 [−1, a′]Vt [−1, a′]′

, (6.22)

ζa,r (Vt, νt) = Γ (0.5νt)λ−0.5νtt |Vaa,t|−0.5 (2π)0.5, (6.23)

Vt =[V11,t V ′

a1,t

Va1,t Vaa,t

], λt = V11,t − V ′

a1,tV−1aa,tVa1,t. (6.24)

(6.24) denotes the partitioning of Vt ∈ R(m+1)×(m+1) into blocks where V11,t is the(1, 1) element. From (6.10) and (6.11), the updates for the statistics are


Vt = Vt−1 + ϕtϕ′t = V0 +

t∑i=m+1

ϕiϕ′i, (6.25)

νt = νt−1 + 1 = ν0 + t−m, t > m.

Here, the prior has been chosen to be the conjugate distribution, N iGa,r (a, r|V0, ν0).Note, from Appendix A.3, that V0 must be positive definite, and ν0 > m+p+1. Thevector form of sufficient statistics (6.11) is revealed if we apply the vec (·) operator(see Page XV) on both sides of (6.25).

The following posterior inferences will be useful in later work:

Posterior means:

at ≡ EN iGa,r[a] = V −1

aa,tVa1,t, (6.26)

rt ≡ EN iGa,r[r] =

1νt −m− 4

λt.

Predictive distribution: the one-step-ahead predictor (2.23) is given by the ratio ofnormalizing constants (6.23), a result established in general in Section 2.3.3. Inthe case of the AR model, then, using (6.23),

f (dt+1|Dt) =ζa,r

(Vt +

[dt+1, ψ

′t+1

]′ [dt+1, ψ

′t+1

], νt + 1

)√

2πζa,r (Vt, νt). (6.27)

This is Student’s t-distribution [97] with νt − m − 2 degrees of freedom. Themean value of this distribution is readily found to be

dt+1 = a′tψt+1, (6.28)

using (6.26).

Remark 6.2 (Classical AR estimation). The classical approach to estimation of a andr in (6.15) is based on the Minimum Mean Squared-Error (MMSE) criterion, alsocalled the Wiener criterion, which is closely related to Least Squares (LS) estima-tion (Section 2.2.2) [54]. Parameter estimates are obtained by solution of the normalequations. Two principal approaches to evaluation of these equations are known asthe covariance and correlation methods respectively [53]. In fact, the posterior mean(6.26), for prior parameter V0 (6.25) set to zero, is equivalent to the result of the co-variance method of parameter estimation [120]. There are many techniques for thenumerical solution of these equations, including recursive ones, such as the Recur-sive Least Squares (RLS) algorithm [108].

Remark 6.3 (Multivariate AutoRegressive (AR) model). Bayesian inference of theAR model can be easily adapted for multivariate (p > 1) observations. Then, theobservation model (6.14) can be written in matrix form:

dt = Aψt + R12 et. (6.29)


This can be formalized probabilistically as the multivariate Normal distribution (Ap-pendix A.1):

f (dt|A,R,ψt) = Ndt(Aψt, R) . (6.30)

Here, dt ∈ Rp are the vectors of data, A ∈ Rp×m is a matrix of regression coeffi-cients, and ψt ∈ Rm is the regressor at time t, which may be composed of previ-ous observations, exogenous variables (6.12) and constants. R

12 denotes the matrix

square root of R; i.e. R = R12 R

12′ ∈ Rp×p is the covariance matrix of the vector

innovations sequence, et (6.29). Occasionally, it will prove analytically more conve-nient to work with the inverse covariance matrix, Ω = R−1, which is known as theprecision matrix.

The posterior distribution on parameters A and R is of the Normal-inverse-Wishart (N iW) type [37] (Appendix A.3). The sufficient statistics of the posteriordistribution are again of the form Vt, νt, updated as given by (6.25), where, now, ϕt =[d′t, ψ

′t]′ ∈ Rp+m (6.20). One immediate consequence is that Vt ∈ R(p+m)×(p+m),

and is partitioned as follows:

Vt =[Vdd,t V ′

ad,t

Vad,t Vaa,t

], Λt = Vdd,t − V ′

ad,tV−1aa,tVad,t, (6.31)

where Vdd,t is the (p× p) upper-left sub-block of Vt. Objects Λt and Vdd,t, Vad,t areused in the posterior distribution in place of λt and V11,t, V1d,t respectively (Appen-dix A.3).

6.2.3 Recursive Inference of non-DEF models

If the observation model is not from the DEF family (6.7), the exact posterior para-meter inference does not have finite-dimensional statistics, and thus recursive infer-ence is not feasible (Section 6.2). In such cases, we seek an approximate recursiveinference procedure. Two principal approaches to approximation are as follows:

Global approximation: this seeks an optimal approximation for the composition ofan indeterminate number of Bayesian updates (6.3). In general, this approachleads to analytically complicated solutions. However, in special scenarios, com-putationally tractable solutions can be found [121, 122].

One-step approximation: this seeks an optimal approximation for just one Bayesianupdate (6.3), as illustrated in Fig 6.3. In general, this approach is less demandingfrom an analytical point of view, but it suffers from the following limitations:• The error of approximation may grow with time. Typically, the quality of the

approximation can only be studied asymptotically, i.e. for t → ∞.• In the specific case of on-line inference given a set of i.i.d. observations, this

approach yields different results depending on the order in which the dataare processed [32].

One-step approximation is a special case of local approximation, with the latterembracing two- and multi-step approximations. In all cases, the problems mentionedabove still pertain. In this book, we will be concerned with one-step approximations,and, in particular, one-step approximation via the VB method (Section 3.5).


f(θ|Dt)B

f(dt|θ, Dt−1)

f(θ|Dt−1) A ˜f(θ|Dt)

Fig. 6.3. Illustration of the one-step approximation of Bayes’ rule in the on-line scenario withtime-invariant parameterization. B denotes the (on-line) Bayes’ rule operator (Fig. 6.1) and Adenotes a distributional approximation operator (Fig. 3.1). The double tilde notation is used toemphasize the fact that the approximations are accumulated in each step. From now on, onlythe single tilde notation will be used, for convenience.

6.3 The VB Approximation in On-Line Scenarios

The VB method of approximation can only be applied to distributions which areseparable in parameters (3.21). We have seen how to use the VB operator both forapproximation of the full distribution (Fig. 3.3) and for generation of VB-marginals(Fig. 3.4). In the on-line scenario, these can be composed with the Bayes’ rule up-date of Fig. 6.1 (i.e. equation (2.14) or (6.3)) in many ways, each leading to a distinctalgorithmic design. We now isolate and study three key scenarios in on-line sig-nal processing where the VB-approximation can add value, and where it can leadto the design of useful, tractable recursive algorithms. The three scenarios are sum-marized in Fig. 6.4 via the composition of the Bayes’ rule (B, see Fig. 6.1) andVB-approximation (V, see Fig. 3.3) operators.

We now examine each of these scenarios carefully, emphasizing the followingissues:

1. The exact Bayesian inference task which is being addressed;2. The nature of the resulting approximation algorithm;3. The family of observation models for which each scenario is feasible;4. The possible signal processing applications.

6.3.1 Scenario I: VB-Marginalization for Conjugate Updates

Consider the Bayesian recursive inference context (Section 6.2), where (i) sufficientstatistics are available, and (ii) we are interested in inferring marginal distributions(2.6) and posterior moments (2.7) at each step t. The formal scheme is given inFig. 6.5 (top). We assume that q = 2 here (3.9) for convenience. In the lowerschematic, the VB operator has replaced the marginalization operators (see Fig. 3.4).

Since sufficient statistics are available, the VB method is used in the same wayas in the off-line case (Section 3.3.3). In particular, iterations of the IVB algorithm(Algorithm 1) are typically required in each time step, in order to generate the VB-marginals.

This approach can be used if the following conditions hold:

(i) The observation model is from the Dynamic Exponential Family (DEF) (6.7):

f (dt|θ1, θ2, ψt) ∝ exp(q (θ1, θ2)

′u (dt, ψt)

).

6.3 The VB Approximation in On-Line Scenarios 119

Inference of marginals

B

V

B

V

Propagation of VB-approximation

B V ×B V ×

Inference of marginals for observation models with hidden variables

B V B V

Fig. 6.4. Three possible scenarios for application of the VB operator in on-line inference.

Bf(θ|Dt−1)

f(dt|θ, Dt−1)

f(θ|Dt)

V

f(θ2|Dt)f(θ1|Dt)

Bf(θ|Dt−1)

f(dt|θ, Dt−1)

f(θ|Dt)∫f(θ2|Dt)f(θ1|Dt)

∫

Fig. 6.5. Generation of marginals (q = 2) in conjugate Bayesian recursive inference for time-invariant parameters, using the VB-approximation.


(ii) The posterior distribution is separable in parameters (3.21):

f (θ1, θ2|Dt) ∝ exp(g (θ1, Dt)

′h (θ2, Dt)

).

These two conditions are satisfied iff

f (dt|θ1, θ2, ψt) ∝ exp((q1 (θ1) q2 (θ2))

′u (dt, ψt)

), (6.32)

where denotes the Hadamard product (i.e. an element-wise product of vectors, ‘.∗’in MATLAB notation; see Page XV). The set of observation models (6.32) will beimportant in future applications of the VB method in this book, and so we will referto it as the Dynamic Exponential Family with Separable parameters (DEFS).

This scenario can be used in applications where sufficient statistics are available,but where the posterior distribution is not tractable (e.g. marginals and moments ofinterest are unavailable, perhaps due to the unavailability of the normalizing constantζθ (2.4)).

Example 6.1 (Recursive inference for the scalar multiplicative decomposition (Sec-tion 3.7)). We consider the problem of generating—on-line—the posterior marginaldistributions and moments of a and x in the model dt = ax + et (3.61). We assumethat the residuals et are i.i.d. N (0, re). Hence, the observation model is

f (dt|a, x, re) = f (dt|θ, re) = N (θ, re) , (6.33)

i.e. the scalar Normal distribution with unknown mean θ = ax. Note that (6.33)is a member of DEFS family (6.32) since, trivially, θ = a x, and it is thereforea candidate for inference under scenario I. Indeed, recursive inference of θ in thiscase is a rudimentary task in statistics [8]. The posterior distribution, t ≥ 1, has thefollowing form:

f (a, x|Dt, re, ra, rx) ∝ Na,x

(1t

t∑τ=1

dτ ,1tre

)f (a, x|ra, rx) . (6.34)

Under the conjugate prior assignment, (3.64) and (3.65), this posterior distribution isalso normal, and is given by (3.66) under the following substitution:

d → 1t

t∑τ=1

dt,

re → 1tre.

In this case, the extended sufficient statistics vector, vt = [v′t, v′0]

′ (6.13), is given by

vt =

[1t

t∑τ=1

dτ ,1tre

], (6.35)

v0 = [ra, rx] .


This is an example of a non-minimal conjugate distribution, as discussed in Sec-tion 6.2.1 (see Eq. 6.13). vt is updated recursively via (6.35), and the VB-marginalsand moments, (3.72)–(3.76), are elicited at each time, t, via the VB method (Section3.3.3), using either the analytical solution of Step 6, or the IVB algorithm of Step 7.

This example is representative of the signal processing contexts where scenario Imay add value: sufficient statistics for recursive inference are available, but mar-ginalization of the posterior distribution is not tractable.

6.3.2 Scenario II: The VB Method in One-Step Approximation

We replace the one-step approximation operator, A, in Fig. 6.3 with the VB-approxi-mation operator, V (Fig. 3.3). The resulting composition of operators is illustrated inFig. 6.6.

B

f(dt|θ, Dt−1)

f(θ|Dt−1)

B

f(dt|θ, Dt−1)

Vf(θ|Dt−1)

f(θ1|Dt)

f(θ2|Dt)

f(θ|Dt)

× f(θ|Dt)

Af(θ|Dt)

f(θ|Dt)

Fig. 6.6. One-step approximation of Bayes’ rule in on-line, time-invariant inference, using theVB-approximation (q = 2).

In this scenario, we are propagating the VB-marginals, and so the distribution attime t−1 has already been factorized by the VB-approximation (3.16). Hence, using(3.12) in (2.14), the joint distribution at time t is

f (θ, dt|Dt−1) ∝ f (dt|θ,Dt−1) f (θ1|Dt−1) f (θ2|Dt−1) . (6.36)

Using Theorem 3.1, the minimum of KLDVB is reached for

f (θi|Dt) ∝ exp(Ef(θ\i|Dt) [ln f (dt|θ,Dt−1)] + ln f (θi|Dt−1)

)∝ exp

(Ef(θ\i|Dt) [ln f (dt|θ,Dt−1)]

)f (θi|Dt−1)

= f (dt|θi, Dt−1) f (θi|Dt−1) , i = 1, 2, (6.37)

wheref (dt|θi, Dt−1) ∝ exp

(Ef(θ\i|Dt) [ln f (dt|θ,Dt−1)]

). (6.38)

Remark 6.4 ( VB-observation model). The VB-observation model for θi is given by(6.38), i = 1, . . . , q. It is parameterized by the VB-moments of f

(θ\i|Dt

), but does

not have to be constructed explicitly as part of the VB-approximation. Instead, it isan object that we inspect in order to design appropriate conjugate priors for recursiveinference, as explained in Section 6.2; i.e.:


1. We can test if the VB-observation models (6.38) are members of the DEF fam-ily (6.7). If they are, then a Bayesian recursive algorithm is available for theproposed one-step VB-approximation (Fig. 6.6).

2. If, for each i = 1, . . . , q, f (θi|Dt−1) in (6.37) is conjugate to the respective VB-observation model, f (dt|θi, Dt−1) in (6.38), then these VB-marginal functionalforms are replicated at all subsequent times, t, t + 1, . . .. In particular, if wechoose the prior for each θi to be conjugate to the respective f (dt|θi, Dt−1) ,then conjugate Bayesian recursive inference of θ =

[θ′1, θ

′2, . . . , θ

′q

]′(3.9) is

achieved ∀t.Hence, each VB-marginal can be seen as being generated by an exact Bayes’ ruleupdate, B, using the respective VB-observation model in a bank of such models. Theconcept is illustrated in Fig. 6.7.

f(θ|Dt−1)

B

B

f(dt|θ1, Dt−1)

× f(θ|Dt)

f(dt|θ2, Dt−1)

<

Fig. 6.7. The one-step VB-approximation for on-line time-invariant inference, shown as par-allel Bayes’ rule updates of the VB-marginals (q = 2). Operator ‘<’ denotes factorization(‘fanning out’) into the VB-marginals available from the previous time-step.

Recall that each VB-observation model (6.38) depends on the VB-moments of allthe other VB-marginals. If there exists an analytical solution for the VB-equations,then the computational flow of the recursive algorithm is exactly as implied inFig. 6.7; i.e. the sufficient statistics of each VB-marginal at time t are expressedrecursively via parallel (decoupled) update equations, si,t−1 → si,t, i = 1, . . . , q,using (6.10) and (6.11). Solution of the VB-equations at time t yields the shapingparameters, (3.28) and (3.29), of each VB-marginal in terms of these updated sta-tistics (specifically, the VB-moments). More commonly, however, the VB-momentsare evaluated iteratively via cycles of the IVB algorithm (Algorithm 1). In each suchcycle, these updated VB-moments must be propagated back through the recursivesufficient statistics updates for each VB-observation model. Essentially, these recur-sive statistics updates form part of the set of VB-equations. This flow of VB-momentsduring iterations of the IVB algorithm is illustrated by the dotted lines in Fig. 6.8,and occurs many times in each time-step, once for each cycle of the IVB algorithm.

Next, we seek the most general form of observation model, f(Dt|θ), for whichthe one-step VB-approximation (scenario II) of Fig. 6.6 can be realized tractably.From the foregoing, we must satisfy two conditions (assuming q = 2 for conve-nience):


Bf(θ1|Dt−1) f(θ1|Dt)

f(dt|θ1, Dt−1)

Bf(θ2|Dt−1) f(θ2|Dt)

f(dt|θ2, Dt−1)

Fig. 6.8. The one-step VB-approximation for on-line time-invariant inference. The transmis-sion of VB-moments via IVB cycles (q = 2) is indicated by dotted arrows.

(i) Each VB-observation model must be from the DEF family (6.7):

f (dt|θi, Dt) = f (dt|θi, ψi,t) ∝ exp(qi (θi)

′ui (dt, ψi,t)

), i = 1, 2.

(ii) The joint distribution at time t (6.36) must be separable in parameters (3.21).Hence, the (exact) observation model must fulfil the same condition:

f (dt|θ1, θ2, Dt−1) ∝ exp(g (θ1, Dt)

′h (θ2, Dt)

).

These two conditions are satisfied iff g (θ1, Dt) = q1 (θ1) u1 (dt, ψ1,t), andh (θ2, Dt) = q2 (θ2) u2 (dt, ψ2,t) , which is consistent with the definition of theDEFS family (6.32), under the assignment u (dt, ψt) = u1 (dt, ψ1,t) u2 (dt, ψ2,t).Therefore, this scenario can be used for the same class of models as scenario I (Sec-tion 6.3.1).

The key distinction between this scenario and scenario I (Section 6.3.1) is thatthe exact sufficient statistics, (6.10) and (6.11), are not collected. Instead, the ap-proximate statistics, si,t, i = 1, . . . , q, of the parallel VB-observation model up-dates (Remark 6.4) are collected as part of the VB method at each step. Ultimately,the VB-marginals and their shaping parameters represent all the information thatwe carry forward about the unknown parameters. This may be useful in situationswhere the exact sufficient statistics are too large to be collected within the sam-pling period, ∆t, of the on-line process. The imposition of conditional indepen-dence in the VB-approximation (3.16) will, in effect, remove all terms modellingcross-correlations from the statistics. The free-form optimization achieved by the VBmethod—minimizing KLDVB (3.6)—will adjust the values of the remaining statis-tics at each time t, in order to emulate the original statistics as closely as possible.

6.3.3 Scenario III: Achieving Conjugacy in non-DEF Models via the VBApproximation

In this Section, we study the use of VB-approximation for non-DEF observationmodels. Specifically, we will study on-line Bayesian inference of θ in an observationmodel expressed in marginalized form, as follows:


f (dt|θ,Dt−1) =∫

Θ∗2,t

f (dt, θ2,t|θ,Dt−1) dθ2,t. (6.39)

The auxiliary parameter, θ2,t, has been used to augment the model. We will referto θ2,t as the hidden variables in the model, and the integrand, f (dt, θ2,t|θ,Dt−1) ,as the augmented model. θ2,t correspond to the missing data terms in the EM algo-rithm [21] (Section 3.4.4). Note that θ2,t is generated locally at each time t, and socorrelation between these variables at different times is not modelled; i.e. they are(conditionally) independent.

∫f(θ|Dt)

f(dt, θ2,t|θ, Dt−1)∫

f(dt, θ2,t|θ, Dt−1)

f(θ|Dt−1) B

f(θ, θ2,t|Dt)

Bf(θ|Dt−1)

f(dt|θ, Dt−1)

f(θ|Dt)

Fig. 6.9. On-line inference of θ when the observation model has hidden variables, θ2,t. Twoequivalent operator compositions are given.

On-line inference of θ (6.39) is illustrated in Fig. 6.9.We note that the hiddenvariables are ‘injected’ into the observation model at each time t, but marginaliza-tion ensures that they do not appear in the posterior distribution. As discussed inSection 6.7, exact Bayesian recursive inference is available iff (6.39) is a member ofthe DEF family (6.7).

In cases where (6.39) is not a member of the DEF family, it may be possibleto achieve a recursive algorithm using an approximation. To this end, we now usethe VB-approximation to replace the exact marginalization with VB-marginalization(Fig. 3.4). We concentrate on the operator composition in the lower schematic ofFig. 6.9, which yields the approximate update in Fig. 6.10. In this case, the VB-marginal of θ is propagated to the next step, while that of the hidden variables, θ2,t,is ‘dropped off’.

From Fig. 6.10, we apply the VB-approximation to joint distribution, f (θ, θ2,t|Dt); i.e. we seek its approximation in the family (3.7)

Fc = f (θ, θ2,t|Dt) : f (θ, θ2,t|Dt) = f (θ|Dt) f (θ2,t|Dt) .


V f(θ|Dt)B

f(dt, θ2,t|θ, Dt−1)

f(θ|Dt−1)

f(θ2,t|Dt)

f(θ, θ2,t|Dt)

Fig. 6.10. The one-step VB-approximation for on-line inference with hidden variables, show-ing the propagation of the VB-marginal of θ.

Using Theorem 3.1, the minimum of KLDVB (3.6) is reached for

f (θ|Dt) ∝ exp(Ef(θ2,t|Dt)

[ln f (dt, θ2,t|θ,Dt−1)] + ln f (θ|Dt−1))

∝ exp(Ef(θ2,t|Dt)

[ln f (dt, θ2,t|θ,Dt−1)])f (θ|Dt−1) ,

∝ f (dt|θ,Dt−1) f (θ|Dt−1) , (6.40)

where

f (dt|θ,Dt−1) ∝ exp(Ef(θ2,t|Dt)

[ln f (dt, θ2,t|θ,Dt−1)]), (6.41)

is the VB-observation model, defined in Remark 6.4. Hence, a recursive inferencealgorithm for θ emerges if the prior, f(θ), is chosen conjugate to (6.41), in whichcase the VB-posterior, f (θ|Dt) , is functionally invariant ∀t.

The VB-marginal for the hidden variables, θ2,t, i.e.

f (θ2,t|Dt) ∝ exp(Ef(θ|Dt)

[ln f (dt, θ2,t|θ,Dt−1)]), (6.42)

is not propagated, and no VB-observation model need be formalized for it. Its pur-pose is purely to provide the VB-moments which are required in formulating theVB-observation model for θ (6.41). The latter is used in exactly the same way as inscenario II: its recursive update equations form part of the set iterated by the IVB al-gorithm. The implied computational flow is illustrated in Fig. 6.11. Comparing withFig. 6.9, we see how the VB-approximation has replaced the Bayes’ update and in-tegration with just a Bayes’ update, with the statistics computed via IVB cycles ateach time step.

If the scheme above is to be tractable, then the observation model, f (dt|θ,Dt−1)(6.39), must satisfy the following two conditions:

(i) The VB-observation model (6.41) is from the Dynamic Exponential Family (6.7):

f (dt|θ,Dt−1) = f (dt|θ, ψt) ∝ exp(q (θ)′ u (dt, ψt)

).

(ii) The true augmented observation model in (6.39) is separable in parameters(3.21):

f (dt, θ2,t|θ,Dt−1) ∝ exp(g (θ,Dt)

′h (θ2,t, Dt)

).


Bf(θ|Dt−1) f(θ|Dt)

f(dt|θ, Dt−1)

f(θ2,t|Dt)

Fig. 6.11. The VB-approximation for on-line inference with hidden variables, illustrating theflow of VB-moments via IVB cycles.

These two conditions are satisfied iff g (θ,Dt) = q (θ) u (dt, ψt); i.e.

f (dt, θ2,t|θ,Dt−1) ∝ exp(q (θ)′ u (θ2,t, Dt)

), (6.43)

whereu (θ2,t, Dt) = h (θ2,t, Dt) u (dt, ψt) .

Family (6.43) will be called the Dynamic Exponential Family with Hidden vari-ables (DEFH), since it extends the Exponential Family with Hidden variables (EFH)[26, 65] to autoregressive cases (6.2). In contrast to the DEFS family (6.32) of theprevious two scenarios (6.32), DEFH observation models do not require separabilityof θ2,t from data Dt, as can be seen in the second argument on the right-hand-sideof (6.43).

This scenario is useful in situations where the observation model does not admita conjugate distribution and, therefore, sufficient statistics are not available. Then,the VB method of approximation outlined above yields a recursive algorithm if theobservation model is amenable to augmentation in such a way that (6.43) is satisfied;i.e. if the observation model (6.39) can be diagnosed as DEFH. This step may bedifficult: the required augmentation must be handled on a case-by-case basis.

6.3.4 The VB Method in the On-Line Scenarios

We follow the 8 steps of the VB method, as developed in Section 3.3.3, but we adaptthem slightly, as follows, for use in the on-line scenario:

Step 1: Choose a Bayesian model: In the off-line scenario, the full Bayesian model(observation model and prior) was chosen (6.3). In the on-line case, the prior distri-bution is typically chosen conjugate to the observation model, and the latter may beadapted by later steps in the VB-method. Hence, in this step, we choose only the ob-servation model (6.2). Using this, we decide which scenario (Sections 6.3.1–6.3.3)we will follow. In scenario III, we assume that we have access to the augmentedobservation model (6.39).

Step 2: Partition the parameters: Since the VB-observation model must be a mem-ber of the DEF family (6.1), the requirement of parameter separability (3.21) in thejoint distribution is replaced by the following stronger conditions:

• The observation model is in the DEF family (6.32) for scenarios I and II;

6.4 Related Distributional Approximations 127

• The observation model is in the DEFH family (6.43) for scenario III.

Step 3: Write down the VB-marginals: This step is scenario-dependent:Scenario I: Write down the VB-marginals, f (θi|Dt).Scenario II: Write down the VB-observation model, f (dt|θi, Dt−1) (6.4), for eachθi.Scenario III: Write down the VB-observation model, f (dt|θi, Dt−1) (6.41), for eachθ =

[θ′1, θ

′2, . . . , θ

′q

]′, and write down the VB-marginal, f (θ2,t|Dt) (6.42), of the

hidden variables, θ2,t.

Step 4: Identify standard forms: We identify standard forms for all VB-distributionslisted in the previous step. The posterior distribution for each θi is chosen conjugateto the standard form of the respective VB-observation model (Remark 6.4).

Step 5: Unchanged.

Step 6: Unchanged.

Step 7: Run the IVB algorithm: the steps of the IVB algorithm are iterated until aconvergence criterion is satisfied. This may be a computationally prohibitive require-ment in the on-line case. Therefore, an upper bound on the number of iterations ineach time step is typically set. For the DEFH family (6.43), consistency in identify-ing θ via the IVB algorithm, using just one IVB iteration per time step, was provedin [26].

For time-invariant parameters, θ, it can be expected that—for t large—each newobservation, dt, will perturb the statistics only slightly. Hence, as t increases, wecan expect the IVB algorithm to converge faster, eventually requiring as few as twocycles.

Step 8: Unchanged.

Note that steps 1.–6. can be completed off-line. The only steps performed on-lineare 7 and 8.

In signal processing applications, we expect that the on-line VB method pre-sented above can be useful for various extensions of (i) AR processes for continuousobservations, and (ii) Markov chain processes for discrete observations. It is signifi-cant, however, that the factorized nature of the VB approximation (3.16) allows jointinference of both continuous and discrete variables, a task which is not computation-ally tractable in exact Bayesian inference [52].

6.4 Related Distributional Approximations

In this Section, we review other techniques for distributional approximation whichrely on KLD minimization. First, we outline the Quasi-Bayes (QB) approxima-tion (Section 3.4.3), specialized to the on-line context. Since this is a RestrictedVB (RVB) technique, DEF-type observation models are still required. The non-VB-based approximation techniques in Sections 6.4.2 and 6.4.3 can potentially be usedfor inference with non-DEF observation models.


6.4.1 The Quasi-Bayes (QB) Approximation in On-Line Scenarios

The Quasi-Bayes (QB) approximation (Section 3.4.3) was characterized as a specialcase of the Restricted VB (RVB) approximation (Section 3.4.3). As such, it has aclosed-form solution without the need for IVB iterations (Section 3.4.3.1), whichmay recommend it for use in on-line scenarios where the (unrestricted) VB methodis unsuitable. Recall, from (3.45), that the QB-approximation requires that one (orq − 1 in the general case: see Remark 3.3) of the VB-marginals be replaced by theexact marginal. Naturally, this can be done only if (i) such a closed-form marginal isavailable, and (ii) this marginal is tractable, in the sense that the necessary momentscan be evaluated analytically.

The QB-approximation can be applied in the three on-line scenarios outlined inSection 6.3, as follows:

Scenario I: If one of the marginals in Fig 6.5 (upper schematic) is tractable, the QB-approximation can be used for inference of the other.

Scenario II: In this case, both marginals are propagated to the next step. Therefore,the QB-approximation is feasible in this scenario only if the analytical marginalhas a conjugate update, i.e. only if it belongs to the CDEF family defined in(6.9). This imposes an additional constraint beyond the usual requirement for aDEF-type observation model (Section 6.3.2).

Scenario III: Since f (θ2,t|Dt) is not propagated, it is the natural choice for replace-ment by the exact marginal:

f (θ2,t|Dt) ≡ f (θ2,t|Dt) =∫

Θ∗f (θ, θ2,t|Dt) dθ. (6.44)

In this case, the only requirement on (6.44) is tractability, as in scenario I above.Conjugate updates for f (θ|Dt) are guaranteed, as before, for DEFH observationmodels (Section 6.43).

6.4.2 Global Approximation via the Geometric Approach

The problem of recursive inference with non-DEF observation models, under a lim-ited memory constraint, was addressed in general in [123, 124]. This geometric ap-proach to approximation is an example of global approximation (Section 6.2.3). Theapproximate inference procedure is found by projecting the space of posterior dis-tributions at any t into the space of distributions with finite-dimensional statistics,wt ∈ Rη×1:

f (θ|Dt) ≈ f (θ|wt) , (6.45)

where η ≥ 1 is assigned a priori. The risk-based Kullback-Leibler divergence,KLDMR (Section 3.2.2), is used as the proximity measure [121].

It was shown in [121,125] that the only family which can be globally optimized—i.e. the optimum is with respect to the parameter inference at any time t—is a proba-bilistic mixture of η fixed (known) distributions, f i (θ), i = 1, . . . , η, weighted by the

6.4 Related Distributional Approximations 129

elements of wt (6.45). These statistics, wt, are updated by an appropriately chosenfunctional, l (·):

wi,t = wi,t−1 + l(f i (θ) , f (dt, θ|Dt−1)

), i = 1, . . . , η. (6.46)

Alternatively, the choice of η fixed distributions, f i (θ), can be replaced by the choiceof η functions, li (·) , of the data, such that

wi,t = wi,t−1 + li (Dt) .

Practical use of the approximation is, however, rather limited. The method re-quires time- and data-invariant linear operators, li (·) (or f i (θ) and l (·)), to be cho-sen a priori. Design criteria for these operators are available only in special cases,and the method is feasible only for low-dimensional problems.

6.4.3 One-step Fixed-Form (FF) Approximation

The fixed-form approximation of Section 3.4.2 can be applied after each Bayes’ ruleupdate in on-line inference (Section 6.3). As such, it is an example of one-step ap-proximation (Fig. 6.3). The form of the approximate parametric posterior distribu-tion, f (θ|Dt) ∈ Fβ = f0 (θ|β) ,∀β, is set a priori, and fixed for all t. It is assumedthat the prior is also assigned from this parametric class: f (θ) = f0 (θ|β0) . After a(non-conjugate) Bayes’ rule update (6.3), then the posterior distribution is

f (θ|βt−1, Dt) ∝ f (dt|θ,Dt−1) f0 (θ|βt−1) . (6.47)

Hence, f (θ|βt−1, Dt) /∈ Fβ , and so an approximation, A[·] (3.1), is found by pro-jecting this posterior distribution into Fβ :

f(θ|Dt) ≡ A[f (θ|βt−1, Dt)

]= f0 (θ|βt) . (6.48)

The approximation (6.48) is used as the prior in the next step. In this on-line scenario,A[·] (and, hence, the update rule for βt) is usually defined in one of two ways [32]:

1. Probability fitting: the approximation (6.48) is optimized with respect to a cho-sen divergence (Section 3.4.2), such as KLDMR (3.5).

2. Moment fitting (also known as the probabilistic editor [32]): parameters βt arechosen so that selected moments of the approximating distribution, f0 (θ|βt)(6.48), match those of the exactly updated posterior, f (θ|βt−1, Dt) (6.47).

This specialization of the one-step approximation (Fig. 6.3) is illustrated in Fig. 6.12.Note, from Figs. 6.6 and 6.10, that scenarios II and III for on-line application

of the VB method are also instances of one-step approximation. The key distinctionbetween the VB-approximation and the one above is the free-form nature of the VBapproach: i.e. the VB approximation yields not only parameters but also the form ofthe approximating distribution.


f0(θ|βt)ABf0(θ|βt−1)

f(dt|θ, Dt−1)

f(θ|βt−1, Dt)

Fig. 6.12. One-step fixed-form approximation in the on-line scenario with time-invariant pa-rameterization.

6.5 On-line Inference of a Mixture of AutoRegressive (AR)Models

The AR model was introduced in Section 6.2.2. Mixtures of these models are used innon-linear time-series modelling [126], control design [127], etc. AR mixtures findwide application in speech recognition [128], classification [129], spectrum mod-elling [130], etc. Mixtures of Normal distributions are used as a universal approx-imation for a wide class of distributions in neural computing [131]. These Normalmixtures are a special case of AR mixtures, where the regressors (6.12) are indepen-dent of the history of dt.

A mixture of c multivariate AR components (i.e. models) is defined as follows(Remark 6.3):

f (dt| Ac , Rc , α, ψtc) =c∑

i=1

αiNdt

(A(i)ψ

(i)t , R(i)

). (6.49)

The notation ·c is used to represent a set of c elements of the same kind; e.g.Ac =

A(1), . . . , A(c)

. In (6.49), dt is a p-dimensional vector of observations

and ψ(i)t is an mi-dimensional regressor. A(i) ∈ Rp×mi , i = 1, . . . , c, is the regres-

sion coefficient matrix of the ith AR component, and R(i) ∈ Rp×p is the covariancematrix of the innovations process associated with each component. αi ∈ I[0,1] de-notes the time-invariant weight of the ith AR component. Mixture models—such asthe AR mixture in (6.49)—do not belong to the DEF family (6.7), since the logarithmof the sum cannot be separated into the required scalar product. As a consequence,the number of terms in the posterior at time t is ct. This exponential increase isan example of combinatoric explosion, typical of probabilistic inference involvingmixtures [132].

6.5.1 The VB Method for AR Mixtures

We now derive the VB-approximation for posterior inference of the parameters ofthe mixture model (6.49). Since the observation model (6.49) is not a member ofthe DEF family, our only choice is to seek an augmented observation model forwhich (6.49) is its marginal (6.39). This would allow us to proceed with scenario III(Section 6.3.3).

Step 1: (6.49) can be diagnosed as a member of the DEFH family (6.43), as follows:

6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models 131

f (dt| Ac , Rc , α, ψtc) =∫

lt

f(dt|A(i), R(i), lt, ψ

(i)t

)f (lt|α) dlt,

f (dt| Ac , Rc , lt, ψtc) =c∏

i=1

Ndt

(A(i)ψ

(i)t , R(i)

)li,t

, (6.50)

f (lt|α) = Mult (α) .

The time-invariant parameters are θ = Ac , Rc , α, and the hidden variable isθ2,t = lt = [l1,t, . . . , lc,t]

′ ∈ εc (1) , . . . , εc (c) , where εc (i) is the ith elementaryvector in Rc (see Notational Conventions on page XV). Hence, lt is the vector pointerinto the component which is active at time t (6.50).

Step 2: The logarithm of the augmented observation model is

ln f (dt, lt| Ac , Rc , α, ψtc) =c∑

i=1

[− 1

2li,t ln

∣∣∣R(i)∣∣∣+

− 12li,t

(dt −A(i)ψ

(i)t

)′ (R(i)

)−1 (dt −A(i)ψ

(i)t

)+ lnΓ (li,t) + li,t ln (αi)

]+ γ

=c∑

i=1

[− 1

2tr(([

Ip,−A(i)]′

R(i)−1[Ip,−A(i)

])(li,tϕ

(i)t ϕ

(i)t

′))+

− 12li,t ln

∣∣∣R(i)∣∣∣ + lnΓ (li,t) + li,t ln (αi)

]+ γ. (6.51)

Here, γ aggregates all terms which do not depend on parameters, and ϕ(i)t =[

d′t, ϕ(i)t

′]′ ∈ Rp+mi is the ith extended regressor (see (6.20) and Remark 6.3).

From (6.51), we see that the augmented observation model is from DEFH family(6.43).

Step 3: The VB-observation model for θ is as follows:

f (dt| Ac , Rc , α, ψtc) ∝ expc∑

i=1

[− 1

2li,t ln

∣∣∣R(i)∣∣∣+

− 12tr(([

Ip,−A(i)]′

R(i)−1[Ip,−A(i)

])(li,tϕ

(i)t ϕ

(i)t

′))+ li,t ln (αi)

].

(6.52)

The VB-marginal on the hidden variable, θ2,t = lt, is

f (lt|Dt) = expc∑

i=1

[lnΓ (li,t) + li,t

[− 1

2ln∣∣R(i)

∣∣t+

− 12tr(

Ef(θ|Dt)

[[Ip,−A(i)

]′R(i)−1

[Ip,−A(i)

]]ϕ

(i)t ϕ

(i)t

′)

+ ln (αi)t

]],

(6.53)


where

Ef(θ|Dt)

[[Ip,−A(i)

]′R(i)−1

[Ip,−A(i)

]]=

=

⎡⎣ R(i)−1

t, A(i)′tR(i)−1

t

R(i)−1

t A(i)t, Ef(θ|Dt)

[A(i)′R(i)−1

A(i)]⎤⎦ . (6.54)

Here we are using the notation, θ t = Ef(θ|Dt)[θ], to denote the (time-variant)

posterior expectation (i.e. the VB-moment (3.30)) of the time-invariant parameter.When the parameter is time-variant—such as lt—its VB-moment at time t will bedenoted by lt; i.e. the second t-subscript will be omitted.

Step 4: Since we are working in scenario III (Section 6.3.3), we must identify stan-dard forms for (i) the VB-observation model (6.52), and (ii) the VB-marginal (6.53).With respect to the VB-observation model, we note that it can be factorized intoc + 1 conditionally independent distributions, and therefore written as the followingproduct of standard forms:

f(dt, lt| Ac , Rc , α, ψtc

)∝ Mu

lt(α)

c∏i=1

Ndt

(A(i)ψ(i)

t, R(i)), (6.55)

where Mu (·) is the Multinomial distribution (Appendix A.7). We use the notation

f(dt, lt| . . .

)to emphasize the fact that the moments, lt, of the hidden variables, lt,

are entering (6.55) in the same way as the actual data, dt. This property is not truein general settings for the VB-observation model, since many VB-moments of thehidden variables, θ2,t, may need to be substituted into the exact observation model(6.41). In this particular context of mixture-modelling, only the first moment, lt, isrequired. An equivalent behaviour in the classical EM algorithm (Section 3.4.4) hasencouraged the use of the phrase ‘missing data’ to describe the hidden variables,θ2,t, in that context [18].

In order to establish a conjugate recursive update of θ = Ac , Rc , α (Re-mark 6.4), the VB-marginal is chosen as conjugate to (6.55):

f (Ac , Rc , α|Dt) = Diα (κt)c∏

i=1

N iWA(i),R(i)

(V

(i)t , ν

(i)t

). (6.56)

Here, N iW (·) denotes the Normal-inverse-Wishart distribution (see Remark 6.3 andAppendix A.3), which is conjugate to Ndt

(·) in (6.55), and Di (·) denotes the Dirich-let distribution (Appendix A.8), which is conjugate to Mu

lt(·) in (6.55).

The VB-marginal of lt (6.53) is recognized to have the following standard form:

f (lt|Dt) = Mu (wt) . (6.57)

The shaping parameters of (6.56) and (6.57) are as follows:


V(i)t = V

(i)t−1 + li,tϕ

(i)t ϕ

(i)t

′, (6.58)

ν(i)t = ν

(i)t−1 + li,t,

κt = κt−1 + lt, (6.59)

wi,t ∝ exp

− 1

2ln∣∣R(i)

∣∣t+ ln (αi)t + (6.60)

−12tr(

Ef(θ|Dt)

[[Ip,−A(i)

]′R(i)−1

[Ip,−A(i)

]]ϕ

(i)t ϕ

(i)t

′)

,

where E· [·] in (6.60) is given by (6.54). Note that νt and κt experience the sameupdate, via li,t. This is because ν

(i)t and κt act as update counters for their respective

distributions, and all these distributions are being updated together in (6.56), via(6.55).

Step 6: The required VB-moments of the VB-marginals, (6.56) and (6.57), are asfollows:

A(i)t = V

(i)ad,t

′V

(i)aa,t

−1, (6.61)

R(i)−1

t = ν(i)t Λ

(i)t

−1,

Ef(θ|Dt)

[A(i)′R(i)−1

A(i)]

= pImi+ A(i)

′t

R(i)−1

t A(i)t,

ln∣∣R(i)

∣∣t= −p ln 2 −

p∑j=1

ψΓ

(ν

(i)t −m− p− j

)+ ln

∣∣∣Λ(i)t

∣∣∣ ,ln (αi)t = ψΓ (κi,t) − ψΓ

⎛⎝ c∑j=1

κj,t

⎞⎠ , (6.62)

li,t =κi,t∑c

j=1 κj,t, (6.63)

where V(i)dd,t, V

(i)ad,t, V

(i)aa,t and Λ

(i)t are submatrices of V (i)

t (see 6.24 and Remark 6.3).ψΓ (·) is the digamma (psi) function [93] (Appendix A.8).

Step 6: No reduction of the VB-equations (6.58)–(6.63) was found.

Step 7: The IVB algorithm (Algorithm 1) is therefore used to find an iterative solu-tion of the VB-equations (6.58)–(6.63).

Step 8: We report the VB-marginal (6.56) of θ = Ac , Rc , α , via its shapingparameters (6.58)–(6.59).

6.5.2 Related Distributional Approximations for AR Mixtures

6.5.2.1 The Quasi-Bayes (QB) Approximation

Recall, from Sections 3.4.3.2 and 6.4.1, that the QB method of approximationis a special case of the Restricted VB (RVB) approximation where one of the


VB-marginals is replaced by the true marginal. In Section 6.4.1, we noted that thismethod may be particularly suitable in on-line inference, since it provides a closed-form distributional approximation for the remaining random variables, obviating theneed for iterations of the IVB algorithm at each time step, t. We specialized the VBmethod to RVB approximation in Section 3.4.3.1, and we follow these steps now.

Step 1: We are deriving the QB-approximation for scenario III of on-line inference.Following the recommendation in Section 6.4.1 (see (6.44)), the VB-marginal onhidden variables, lt, is fixed as the exact marginal:

f (lt|Dt) ≡ f (lt|Dt) =∫

Θ∗f (θ, lt|Dt) dθ

∝∫

Θ∗f (θ|Dt−1) f (dt, θ, lt|Dt−1) dθ, (6.64)

where θ = Ac , Rc , α are the AR mixture parameters. The integrand,f (θ, lt|Dt) , in (6.64) is therefore the one-step updating of f (θ|Dt−1) (i.e. theVB-marginal from the previous step (6.56)) via the exact observation model (6.50).Hence,

f (θ, lt|Dt) ∝ Muα (κt−1 + lt) × (6.65)

×c∏

i=1

[N iWA(i),R(i)

(V

(i)t−1 + ϕ

(i)t ϕ

(i)t

′, ν

(i)t−1 + 1

)]li,t

.

Marginalization of (6.65) over the AR parameters, θ, yields

f (lt|Dt) = Mult (wt) ∝c∏

i=1

wli,t

i,t , (6.66)

wi,t ∝ ζα (κt−1 + εc (i)) ζA(i),R(i)

(V

(i)t−1 + ϕ

(i)t ϕ

(i)t

′, ν

(i)t−1 + 1

),

where ζα (·) denotes the normalizing constant (Remark 6.1) of the Dirichlet dis-tribution (A.49), and ζA(i),R(i) (·) denotes the normalizing constant of the Normal-inverse-Wishart distribution (A.11). Note, finally, that the wi,t are probabilities, andso

∑ci=1 wi,t = 1, providing the constant of proportionality required in the last ex-

pression.

Step 2: Identical to Step 2. of the VB method (Section 6.5.1).

Step 3: f (θ|Dt) has the functional form given in (6.56), and f (lt|Dt) ≡ f (lt|Dt)(6.64) is given by (6.66).

Step 4: The standard form of f (θ|Dt) is given by (6.56), with shaping parameters(6.58)–(6.59). The standard form of f (lt|Dt) is Multinomial (6.66), with shapingparameters wt.

Step 5: f (lt|Dt) (6.66) is tractable, and so its necessary (first-order) moments, lt,are available in closed form (from (6.66)):


lt = wt. (6.67)

Hence, the shaping parameters, (6.58)–(6.59), of f (θ|Dt) are updated in closed formwith respect to (6.67).

Step 6–7: Do not arise.

Step 8: Identical to step 8 of the VB method (Section 6.5.1).

6.5.2.2 One-step Fixed-Form (FF) Approximation

The one-step FF approximation (Section 6.4.3) for a mixture of AR models (6.49)was derived in [133] via minimization of KLDMR (3.5). The approximating posteriordistribution was chosen to be of the same form as the VB-marginal (6.56):

f0 (Ac , Rc , α|βt) = Diα (κt)c∏

i=1

N iWA(i),R(i)

(V

(i)t , ν

(i)t

). (6.68)

Hence, βt = Vtc , νtc , κt are the parameters to be optimized.The Bayes’ rule update of (6.68) via (6.49) yields a mixture of c components,

which is not in the form of (6.68). Hence, we approximate this exact update by thedistribution of the kind (6.68) that is closest to it in the minimum-KLDMR sense.

In this approach, the statistics Vt and νt are not additively updated, as they were inthe VB-approximation (6.58)–(6.59). Instead, they are found as solutions of implicitequations [133]. This implies a considerably increased computational load per timestep.

Both the VB-approximation (6.56) and FF-approximation (6.68) are one-step ap-proximations, each projecting the exactly updated posterior distribution into the samefamily. Hence, the main distinction between the two methods is the criterion of opti-mality used. The VB-approximation minimizes KLDVB, and is susceptible to localminima (see the remark in Step 6 on Page 35), while the FF-approximation mini-mizes KLDMR, with a guaranteed global minimum [64]. For these reasons, the twomethods of approximation produce different results, as illustrated in Fig. 3.2, and aswe will see in the simulations that follow.

6.5.3 Simulation Study: On-line Inference of a Static Mixture

In this Section, we illustrate the properties of the VB-approximation for inference ofstatic mixture models. A time-invariant, static mixture (also known as a mixture ofGaussians [134]) is a special case of the AR mixture model (6.49) under the assign-ment mi = 1, ψ(i)

t = 1, ∀i = 1, . . . , c. Then, from (6.49),

f (dt| Ac , Rc , α, ψtc) =c∑

i=1

αiNdt

(A(i), R(i)

), (6.69)

where A(i) ∈ Rp×1, and other symbols have their usual meaning.


One of the main challenges in on-line inference arises at the beginning of the pro-cedure, i.e. when t is small. This constitutes, intrinsically, a stressful regime [135]for identification, since the number of available data is much smaller than the num-ber of inferred parameters. This problem is known as initialization of the recursivealgorithm [52]. Its Bayesian solution involves careful design of the prior distributionand updates which are robust to early mismodelling.

Bayesian recursive inference for the DEF family involves additive accumulationof the observations into the sufficient statistics (6.11). The same is true of the updatesinvolved in the VB-approximation and QB-approximation (6.58) for AR mixtures.Therefore, any data record that has already been added to the sufficient statistics atan earlier t cannot be removed subsequently. This property is harmful in recursiveidentification of mixture models if the prior distribution is far from the posterior,since early misclassifications (into the wrong component (6.69)) influence all laterparameter inferences. In the following simulations, we examine the influence of ini-tialization on the finite-t performance of the various approximations.

6.5.3.1 Inference of a Many-Component Mixture

A mixture of 30 Gaussian distributions was used to construct the 2-dimensional (i.e.p = 2) ‘true’ observation model displayed in Fig. 6.13 (top-left), from which thedata were generated. A modelling mismatch was simulated by choosing the numberof components in the observation model to be c = 10. The prior distribution on themixture parameters, θ = Ac , Rc , α , was chosen as the conjugate distribution

(6.56) for the VB-observation model (6.55), with shaping parameters ν(i)0 = 10,∀i,

α = 1c1c,1, and V

(i)0 as randomly generated positive definite matrices for each com-

ponent, i = 1, . . . , c. This ensures that the 10 components are not coincident a priori.If they were, they could not be separated during subsequent learning. The task is toinfer the mixture parameters, on-line, i.e. at each time

The task is to infer the mixture parameters, θ = Ac , Rc , α, on-line, i.e. ateach time t = 1, 2, 3, .... In Fig. 6.13, we compare the HPD regions (Definition 2.1)of the approximate posterior distributions (6.56), updated via the VB, QB, and FFmethods respectively. The inferences at times t = 5, 20 and 200 are displayed.

As expected, the distinctions between the methods of approximation are mostobvious for small t. With increasing numbers of observations, all methods approxi-mate the ‘true’ observation model well. Note that all methods were initialized withthe same prior, and so manifestly have different prior sensitivity properties. Further-more, changes in the choice of prior can be expected to lead to different behavioursin all three methods during the small-t phase of on-line learning. We will not pursuethis issue further, but the interested reader is referred to [32, 52].

6.5.3.2 Inference of a Two-Component Mixture

This simulation was designed to examine the sensitivity of the approximation meth-ods to unavoidable misclassification in the early stages of on-line inference. We sim-ulated a mixture of c = 2 static Gaussian components (6.69), with equal covariances,


Simulated mixture Prior

data VB QB FF

Fig. 6.13. Comparison of performance of the VB, QB and FF approximations for recursiveinference of the parameters of a 2D static mixture. The approximate inferences after t = 5,20 and 200 observations are shown in rows 2–4 respectively.

R(i) = R(2) = I2, and with mean values A1 = [0, 1]′, A2 = [0, a2]′, where we tested

various settings of a2 ∈ R[−1,1]. Hence, the distance between the component meansis δA = ||A1 −A2|| = 1 − a2 ∈ R[0,2].

Our aim is to test the sensitivity of the VB, QB, and FF approximations toδA. These approximations, f(θ|Dt, Ii), for i = 1 (VB), i = 2 (QB) and i = 3(FF), can be interpreted as competing models for the mixture parameters, θ =Ac , Rc , α. Hence, we can compare them using standard model comparisontechniques [70]. The competing models are summarized in Table 6.1.


Table 6.1. Approximate inferences of (static) AR mixture parameters.

Method Model (6.56) update of statisticsVB f

(θ| Vtc , νtc , κt, I1

)(6.58)–(6.63),

QB f(θ| Vtc , νtc , κt, I2

)(6.58)–(6.62),(6.67)

FF f(θ| Vtc , νtc , κt, I3

)see [133]

The posterior probability of each model is proportional to the predictor of Dt,assuming a uniform model prior [42, 70, 136]:

f (Ii|Dt) ∝ f (Dt|Ii) =∫

Θ∗f (Dt|θ) f (θ| Vtc , νtc , κt, Ii) dθ, i = 1, . . . , 3.

(6.70)Exact evaluation of (6.70) is not tractable since f (Dt|θ) is a mixture of 2t compo-nents. Therefore, we approximate (6.70) by

f (Dt|Ii) ≡t∏

τ=1

∫Θ∗

f (dτ |θ) f (θ| Vtc , νtc , κt, Ii) dθ, i = 1, . . . , 3, (6.71)

i.e. using a step-wise marginalization procedure. In evaluating (6.71), we first elicitthe approximate distributions, f (θ| Vtc , νtc , κt, Ii) , using all t observations,and then use these terminal approximations in the step-wise marginalizations. Thisallows the terminal approximations to be compared.

Since we are using an approximate marginalization in (6.71), the evaluated prob-abilities may not be accurate. However, the form of the approximate posterior dis-tribution (6.56) is the same for all methods (Table 6.1), and so it is reasonable toassume that the approximation affects all models in the same way. Therefore, we canat least expect that the ranking of the models, using (6.71), will be reliable.

The terminal predictions (6.71) after t = 50 observations, using each of thethree approximations, were examined in a Monte Carlo study. For each setting ofthe inter-cluster distance, δA ∈ R[0,2], 40 randomly initialized inference cycles wereundertaken. The sample mean of (6.71), i = 1, 2, 3, was plotted in Fig. 6.14 as afunction of δA.

Misclassifications of observations for small t affect the terminal (t = 50) approx-imate distributions. The amount of misclassification depends on the inter-componentdistance, δA. For δA < 0.3, the data are generated from a quasi-one-componentobservation model, misclassifications are rare, and all methods perform well. Withincreasing δA, low-t misclassifications proliferate, incorporating greater errors inthe accumulating statistics. The main finding from Fig. 6.14 is that the three ap-proximations have different sensitivities to these early misclassifications. The QB-approximation is the most sensitive to initialization, while the VB-approximation ismore robust. Recall that both of these approximations update their statistics addi-tively (6.58). The FF method of approximation is far more robust to initialization.The main reason for this is non-additive updating of statistics (Section 6.5.2.2).


Inter-component distance, dA

App

roxi

mat

elo

g-pr

edic

tor,f( D

50|I i

)VB (i = 1)QB (i = 2)FF (i = 3)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2-20

0

20

40

60

80

100

Fig. 6.14. Comparison of the prediction performance of the VB, QB, and FF methods ofapproximation, as a function of increasing inter-component distance for a c = 2-componentstatic mixture.

Remark 6.5 (Time-variant parameterization). The problem of initialization can beaddressed by treating the parameters θ as time-variant. Under this assumption, vari-ous techniques are available for eliminating statistics introduced by historic (low-t)data:

Discounting the contribution made by historic data [26]; this is a variant of forgetting[52], which will be discussed in the Chapter 7.

Repetitive runs of the inference procedure, with the prior in the next run modified bythe posterior from the previous run [52]. This technique is, naturally, appropriateonly in off-line mode.

6.5.4 Data-Intensive Applications of Dynamic Mixtures

The mixture model in (6.49) is highly flexible, and can be used to capture correlationbetween data channels, di,t, i = 1, . . . , p, via the component covariance matrices,R(i), as well as temporal correlations, via regression onto the data history (6.12):

ψ(i)t = ψi(dt−1, . . . , dt−∂i

, ξ(i)t ), i = 1, . . . , c. (6.72)

Here, as usual, ∂i denotes the maximum data memory in the ith component, and ξ(i)t

are the exogenous variables. We refer to a component as dynamic if ∂i ≥ 1, andstatic if ψ(i)

t is a function only of exogenous variables, ξ(i)t , and not of past data, an

example of which was studied in Section 6.5.3.The use of multiple components is appropriate when capturing distinct modes in

the data. Consider the case of a Process-Operator Loop (POL) where not only theprocess under operator control is subject to intrinsic changes in behaviour, but the


human operator, too, may typically interact with the system in a finite number ofdifferent ways. The resulting switches in the dynamics of the POL can be capturedby the mixture model (6.49) with appropriate channel and temporal correlations.

Several data-intensive applications of dynamic mixture modelling were exam-ined by the EU IST project, ProDaCTool [137]. In each application, the number ofchannels, p, was large (typically several dozen). The task was to identify an appro-priate structure and parameterization for the dynamic mixture model, sufficient tocapture the various inter-channel and inter-sample dependences. To this end, a largeoff-line database of observations,

Dt = d1,...,dt ,

was gathered during the off-line phase. Typically, the number of data, t, was ofthe order of thousands. The challenging task of model identification in these high-dimensional, large database problems was successfully addressed using recursivevariational inference methods [127]. In order to avoid IVB iterations (Algorithm 1),the Quasi-Bayes (QB) method of approximation (Section 6.4.1) was chosen. A MAT-LAB toolbox, called MixTools [138], was developed as part of the ProDaCToolproject, for identification in these multicomponent, high-dimensional, data-intensivedynamic mixture modelling problems. It is in data-intensive applications like thesethat stochastic sampling-based methods are prohibitive.

The estimated mixtures were studied for their ability to predict future data emerg-ing from the POL. If the database was sufficiently extensive to capture all key inter-mode transitions, as well as the significant correlations within each mode, then theinferred mixture model would be capable of accurate prediction of future behaviour.This predictive capability of the mixture was to be used as a resource in designingan advisory system [127]. The advisory system could be used on-line to generate op-timized recommendations appropriate for the current mode of operation of the POL.These recommendations were made available to the operator via an appropriatelydesigned Graphical User Interface (GUI) [139].

Experience in the following three ProDaCTool application domains were re-ported in [139]:

1. Urban vehicular traffic prediction;2. An advisory system for radiotherapy of thyroid carcinoma;3. An advisory system for a metal rolling mill.

In all cases, the MixTools suite was used in the off-line phase to identify a dynamicor static mixture model as appropriate, using the QB-approximation. In applications2. and 3., the inferred model was used in the off-line design of an appropriate ad-visory system, which then generated recommendations during the on-line phase. Inapplication 1., no advisory design was undertaken, but the ability of the inferredmixture models to predict future traffic states was examined. This application is nowbriefly reviewed.


6.5.4.1 Urban Vehicular Traffic Prediction

From (6.27) and (6.49), the one-step-ahead predictor for the dynamic mixture is alinear combination of Student’s t-distributions:

f (dt+1|Dt) =

c∑i=1

κi,t

κ′t1c,1

ζA(i),R(i)

(V

(i)t +

[d′t+1, ψ

(i)t+1

′]′ [d′t+1, ψ

(i)t+1

′], ν

(i)t + 1

)√

2πζA(i),R(i)

(V

(i)t , ν

(i)t

) ,

(6.73)where the statistics, Vtc , νtc , κt , are updated via the QB-approximation,(6.58)–(6.59) and (6.67).

On the other hand, static mixtures (Section 6.5.3) cannot, formally, provide tem-poral predictions. Nevertheless, if the parameters are assumed to be one-step invari-ant, then the data at time t + 1 can be predicted informally via an estimated staticobservation model (6.5.3) at time t :

f (dt+1|Dt) ≈c∑

i=1

li,tNdt

(A(i)

t, R(i)

t

). (6.74)

Here, A(i)t and R(i)

t are the terminal posterior means (6.26) of the normal compo-nent parameters, and lt are the estimated component weights (6.67) at time t.

0 100 200 300 400 500 600 7000

50

100

150

200

250density in time

0 100 200 300 400 500 600 7000

10

20

30

40

50

60

70intensity in time

Time course of intensity Time course of density

Fig. 6.15. Traffic intensity and density records for the Strahov tunnel, Prague.

In this study, a snapshot, dt, of the traffic state in the busy Strahov traffic tun-nel is provided via an array of synchronized sensors placed at 5 regularly-spacedpoints along each of two northbound lanes of the tunnel. Each sensor detects a traf-fic intensity measure, q, and density measure, ρ (Fig. 6.15). Hence, the number ofchannels is p = 2 × 2 × 5 = 20. Measurements are recorded every 5 minutes for4 weeks, generating an archive of t = 8064 observations. The anticipated daily andweekly periodicities are evident in Fig. 6.15. An extra one day’s data (288 items)


are reserved for validation of predictions. The task examined here is to predict futuretraffic state (system output), yt = [d1,t, d2,t]′, at the northbound exit of the StrahovTunnel, using this off-line archive. The system ‘input’ is then considered to be thetraffic states recorded by the intra-tunnel sensors: ut = [d3,t, d20,t]′. Clearly, therewill be strong correlation (i) between sensors, encouraging a regression model be-tween the channels, di,t, and (ii) from snapshot to snapshot, encouraging a dynamicmodel. The prospect of capturing multimodal behaviour (Section 6.5.4) via mixturesof these models is examined. The estimation, c, of the number of components withinthe MixTools suite will not be described here. Marginalization over ut+1 in (6.73) or(6.74) generates the required tunnel output predictor, f(yt+1|Dt). These predictorswere evaluated in the context of the 288 on-line data items mentioned already.

0 5 10 15 20 25-1.5

-1

-0.5

0

0.5

1

1.5

2

0 5 10 15 20 25-1.5

-1

-0.5

0

0.5

1

1.5

2

Fig. 6.16. Static predictions of traffic intensity at the tunnel output. Left: one-componentregression; right: mixture of c components.

The performance of the predictors under static modelling (6.74) is displayed inFig. 6.16. The one-component regression cannot react to changing modes in the data,and simply predicts the conditional mean of the traffic intensity. The mixture modelpredictor incorporates the posterior mean estimate, lt, of the active component la-bel (6.67), permitting far superior prediction of the multimodal data. For static one-and multicomponent models, Table 6.2 shows the Prediction Error (PE) coefficient,expressed as the ratio of the standard deviation of the prediction error to that of therespective output channel. It confirms the enhanced predictive capability of the mix-ture. For completeness, the τ > 1−step-ahead predictor was also examined. The PEincreases, as expected, with τ .

The predictive capabilities of the dynamic mixture model (6.73) were then inves-tigated. In this case, a first-order temporal regression is assumed for all components;i.e. ∂i = 1, i = 1, . . . , c (6.72). A single dynamic component is again compared to amixture of dynamic components. In Table 6.3, output channel PEs are recorded forboth the 2-D predictive marginal and for the full 20-D prediction. Multi-step-aheadpredictors are, again, also considered.

6.6 Conclusion 143

Table 6.2. Prediction Error (PE) for τ -step-ahead static predictions of traffic tunnel output.

c = 1 c > 1τ Intensity Density Intensity Density1 1.021 1.020 0.418 0.4226 1.025 1.024 0.512 0.51612 1.031 1.030 0.634 0.63818 1.028 1.037 0.756 0.760

Table 6.3. Prediction Error (PE) for τ -step-ahead dynamic predictions of traffic tunnel output(based on marginal prediction, yt+τ , and full state prediction, dt+τ , respectively).

marginal prediction of output joint prediction of full Tunnel statec = 1 c > 1 c = 1 c > 1

τ Intensity Density Intensity Density Intensity Density Intensity Density1 0.376 0.378 0.391 0.392 0.330 0.331 0.328 0.3296 0.526 0.521 0.487 0.483 0.444 0.436 0.421 0.41612 0.711 0.706 0.614 0.615 0.576 0.566 0.524 0.52618 0.840 0.836 0.758 0.766 0.697 0.688 0.654 0.644

Once again, the improvement offered by the mixture model over the single-component regression is evident. In general, PE increases with τ , as before. Com-paring with Table 6.2, the following conclusions are reached:

• PEs for the dynamic models are significantly lower than for the static ones. Theimprovement is more marked for the single-component regressions than for themixtures, and tends to diminish with increasing τ ;

• Mixtures lead to better prediction than single regressions under both the staticand dynamic modelling assumptions, though the improvement is less marked inthe dynamic case (being between about 1 to 5% in terms of PE). In off-line iden-tification of the model, some components had small weights, but their omissioncaused a clearly observable reduction in prediction quality.

The current study clearly indicates the benefits of dynamic mixture modelling forhigh-dimensional traffic data analysis. Dynamic mixtures are capable of accurateprediction many steps (e.g. 90 minutes) ahead. Hence, it is reasonable to expect thatthey can serve for effective advisory design, since components arise, not only due tointrinsic changes in traffic state, but also as a result of traffic sequence choices madeby the operator [137].

6.6 Conclusion

The rôle of the VB-approximation in recursive inference of time-invariant parametricmodels was explored in this Chapter. Three fundamental scenarios were examined


carefully, leading, respectively, to (i) approximations of intractable marginals at eachtime-step, (ii) propagation of reduced sufficient statistics for dynamic exponentialfamily (DEF) models, and (iii) achievement of conjugate updates in DEF modelswith hidden variables (i.e. DEFH models) where exact on-line inference cannot beachieved recursively. The VB-observation model proved to be a key mathematicalobject arising from use of the VB-approximation in on-line inference. The Bayes’rule updates in scenarios II and III are with respect to this object, not the originalobservation model. Conjugate distributions (and, therefore, Bayesian recursive in-ference) may, therefore, be available in cases (like scenario III) where none existsunder an exact analysis. We applied the VB-approximation to on-line inference ofmixtures of autoregressive models, this being an important instance of scenario III.In fact, the VB-approximation is a flexible tool, potentially allowing other scenariosto be examined. In Chapter 7, the important extension to time-variant parameters willbe studied.

7

On-line Inference of Time-Variant Parameters

The concept of Bayesian filtering was introduced in Section 2.3.2. The Kalman filteris the most famous example of a Bayesian filtering technique [113]. It is widely usedin various engineering and scientific areas, including communications, control [118],machine learning [28], economics, finance, and many others. The assumptions of theKalman filter—namely linear relations between states and a Gaussian distributionfor disturbances—are, however, too restrictive for many practical problems. Manyextensions of Kalman filtering theory have been proposed. See, for example, [140]for an overview. These extensions are not exhaustive, and new approaches tend toemphasize two priorities: (i) higher accuracy, and (ii) computational simplicity.

Recently, the area of non-linear filtering has received a lot of attention and vari-ous approaches have been proposed, most of them based on Monte Carlo samplingtechniques, such as particle filtering [141]. The main advantage of these samplingtechniques is the arbitrarily high accuracy they can achieve given a sufficient amountof sampling. However, the associated computational cost can be excessive. In thisChapter, we study properties of the Variational Bayes approximation in the contextof Bayesian filtering. The class of problems to which the VB method will be appliedis smaller than that of particle filtering. The accuracy of the method can be under-mined if the approximated distribution is highly correlated. Then, the assumptionof conditional independence—which is intrinsic to the VB method—is too severea restriction. In other cases, the method can provide inferential schemes with in-teresting computational properties. Various adaptations of the VB method—such asthe Restricted VB method (Section 3.4.3)—can be used to improve computationalefficiency at the expense of minor loss of accuracy.

7.1 Exact Bayesian Filtering

All aspects of on-line inference of parameters discussed in Chapter 6—i.e. restric-tions on the ‘knowledge base’ of data and computational complexity of each step—are also relevant here. Note that the techniques of Chapter 6 can be seen as a spe-cial case of Bayesian filtering, with the trivial parameter-evolution model (2.20), i.e.

146 7 On-line Inference of Time-Variant Parameters

θt = θt−1. In the case of time-variant parameters, the problem of parameter inferencebecomes more demanding, since it involves one extra operation, i.e. marginalizationof previous parameters (2.21), which we called the time-update or prediction. Thecomputational flow of Bayesian filtering is displayed in Fig. 2.4.

Once again, a recursive algorithm for exact Bayesian filtering is possible if theinference is functionally invariant at each time t. This can be achieved if

(i) The observation model (2.19) is a member of the DEF family. Then, there ex-ists a CDEF parameter predictor (Section 6.2.1), f (θt|Dt−1), which enjoys aconjugate data update.

(ii) The parameter evolution model is such that the filtering output from the pre-vious step, f (θt−1|Dt−1), is updated to a parameter predictor, f (θt|Dt−1), ofthe same form. Then, the same functional form is achieved in both updates ofBayesian filtering (Fig. 2.4).

Analysis of models (2.19), (2.20)—which exhibit these properties—was undertakenin [142, 143]. The class of such models is relatively small—the best known exam-ple is the Kalman filter (below)—and, therefore, approximate methods of Bayesianfiltering must be used for many practical scenarios.

Remark 7.1 (Kalman filter). The famous Kalman filter [113] arises when (2.19) and(2.20) are linear in parameters with Normally distributed noise, as follows:

f (θt|θt−1) = Nθt(Aθt−1, Rθ) , (7.1)

f (dt|θt, Dt−1) = Ndt(Cθt, Rd) , (7.2)

where matrices A, Rθ, C, and Rd are shaping parameters, which must be known apriori. Both conditions of tractability above are satisfied, as follows:

(i) The observation model (7.2) is from the DEF family, for which the conjugatedistribution is Normal, i.e. the posterior is also Normal,

f (θt|Dt) = Nθt(µt, Σt) , (7.3)

with shaping parameters µt and Σt. This implies that the prior f (θ1) is typicallychosen in the form of (7.3), with prior parameters µ1 and Σ1.

(ii) The distribution f (θt, θt−1|Dt−1), is a Normal distribution, whose marginal,f (θt|Dt−1), is also Normal, and, therefore, in the form of (7.3).

The task of Bayesian filtering is then fully determined by evolution of the shap-ing parameters, µt and Σt, which form the sufficient statistics of f (θt|Dt) (7.3).Evaluation of the distributions can, therefore, be transformed into recursions onµt, Σt [42, 118].

Remark 7.2 (Bayesian filtering and smoothing). As we saw in Section 6.3.3, the inte-gration operator can be used after the Bayes’ rule operator, marginalizing the poste-rior distribution, f (θt, θt−1|Dt). The two cases are (i) integration over θt−1, yieldingf (θt|Dt) (this is known as Bayesian filtering); and (ii) integration over θt, yieldingf (θt−1|Dt) (this is known as Bayesian smoothing). Both options are displayed inFig. 7.1. Note that f (θt−1|Dt) is not propagated to the next step.

7.2 The VB-Approximation in Bayesian Filtering 147

f(dt|θt, Dt−1)

Bf(θt−1|Dt−1) ×f(θt, θt−1|Dt−1)


f(θt|Dt)

f(θt, θt−1|Dt)

f(θt−1|Dt)

∫dθt−1∫dθt

Fig. 7.1. Bayesian filtering and smoothing.

7.2 The VB-Approximation in Bayesian Filtering

We have seen how to use the VB operator both for approximation of the full dis-tribution (Fig. 3.3) and for generation of VB-marginals (Fig. 3.4). In the case ofBayesian filtering, we replace the marginalization operators in Fig. 7.1 with the VB-approximation operator, V (Fig. 3.4). The resulting composition of operators is illus-trated in Fig. 7.2.

f(dt|θt, Dt−1)

Bf(θt−1|Dt−1) ×f(θt, θt−1|Dt−1)


f(θt, θt−1|Dt)

Vf(θt|Dt)

f(θt−1|Dt)

Fig. 7.2. The VB approximation in Bayesian filtering.

In this scenario, the joint distribution is given by Proposition 2.1:

f (dt, θt, θt−1|Dt−1) = f (dt|θt, Dt−1) f (θt|θt−1) f (θt−1|Dt−1) , (7.4)

where we have used the VB-marginal from the previous step, f (θt−1|Dt−1). Weseek VB-marginals of θt and θt−1 (Fig. 7.2). Using Theorem 3.1, the minimum ofKLDVB is reached for

f (θt|Dt) ∝ exp

Ef(θt−1|Dt)[f (dt, θt, θt−1|Dt−1)]

∝ exp

Ef(θt−1|Dt)

[ln f (dt|θt, Dt−1) + ln f (θt|θt−1) + ln f (θt−1|Dt−1)

]∝ f (dt|θt, Dt−1) exp

Ef(θt−1|Dt)

[ln f (θt|θt−1)], (7.5)

f (θt−1|Dt) ∝ exp

Ef(θt|Dt)[f (dt, θt, θt−1|Dt−1)]

∝ exp

Ef(θt|Dt)

[ln f (dt|θt, Dt−1) + ln f (θt|θt−1) + ln f (θt−1|Dt−1)

]∝ exp

Ef(θt|Dt)

[ln f (θt|θt−1)]f (θt−1|Dt−1) . (7.6)


Hence, the VB-approximation for Bayesian filtering—which we will call VB-filtering—can be seen as two parallel Bayes’ rule updates:

f (θt|Dt) ∝ f (dt|θt, Dt−1) f (θt|Dt−1) , (7.7)

f (θt−1|Dt) ∝ f (dt|θt−1, Dt−1) f (θt−1|Dt−1) . (7.8)

The following approximate distributions are involved:

f (θt|Dt−1) ∝ exp

Ef(θt−1|Dt)[ln f (θt|θt−1)]

, (7.9)

f (dt|θt−1, Dt−1) ∝ exp

Ef(θt|Dt)[ln f (θt|θt−1)]

. (7.10)

B

f(dt|θt−1, Dt−1)

f(θt−1|Dt−1) f(θt−1|Dt)

B

f(dt|θt, Dt−1)

f(θt|Dt−1) f(θt|Dt)

Fig. 7.3. VB-filtering, indicating the flow of VB-moments via IVB cycles.

Remark 7.3 (VB-filtering and VB-smoothing). The objects generated by the VB ap-proximation are as follows:

VB-parameter predictor, f (θt|Dt−1) (7.9). This is generated from the parameterevolution model (2.20) by substitution of VB-moments from (7.8). It is up-dated by the (exact) observation model to obtain the VB-filtering distribution,f (θt|Dt) (see the upper part of Fig. 7.3).

VB-observation model, f (dt|θt−1, Dt−1) (7.10). Once again, this is generated fromthe parameter evolution model (2.20), this time by substitution of VB-momentsfrom (7.7). It has the rôle of the observation model in the lower Bayes’ ruleupdate in Fig. 7.3, updating f (θt−1|Dt−1) from the previous time step to obtainthe VB-smoothing distribution, f (θt−1|Dt).

Notes on VB-filtering and VB-smoothing

• The time update and data (Bayes’ rule) update of (exact) Bayesian filtering (Fig.2.4) are replaced by two Bayes’ rule updates (Fig. 7.3).

• The VB-smoothing distribution is not propagated, but its moments are used togenerate shaping parameters of the VB-parameter predictor, f (θt|Dt−1) (7.9).

7.2 The VB-Approximation in Bayesian Filtering 149

• The functional form of both inputs to the Bayes’ rule operators, B, in Fig. 7.3 arefixed. Namely, (i) the VB-parameter predictor (7.9) for the VB-filtering update isdetermined by the parameter evolution model (2.20), and (ii) the posterior distri-bution, f (θt−1|Dt−1), is propagated from the previous step. Therefore, there isno distribution in the VB-filtering scheme (Fig. 7.3) which can be assigned usingthe conjugacy principle (Section 6.2).

• The same functional form is preserved in the VB-filtering distribution—i.e. (7.7)via (7.9)—at each time step, t; i.e. f (θt−1|Dt−1) is mapped to the same func-tional form, f (θt|Dt). This will be known as VB-conjugacy.

• Only the VB-marginals, (7.7) and (7.8), are needed to formulate the VB-equations,which are solved via the IVB algorithm (Algorithm 1).

The VB-approximation yields a tractable recursive inference algorithm if the jointdistribution, ln f (dt, θt, θt−1|Dt−1) (7.4), is separable in parameters (3.21). How-ever, as a consequence of Proposition 2.1, the only object affected by the VB-approximation is the parameter evolution model, f (θt|θt−1) (see (7.5)–(7.6)). Hencewe require separability only for this distribution:

f (θt|θt−1) = exp(g (θt)

′h (θt−1)

). (7.11)

The only additional requirement is that all necessary moments of the VB-smoothingand VB-filtering distributions, (7.7) and (7.8), be tractable.

Remark 7.4 (VB-approximation after time update). In this Section, we have studiedthe replacement of the marginalization operator in Fig. 7.1 by the VB-approximation(i.e. it follows the Bayes’ rule update). Approximation of the marginalization opera-tor before the Bayes’ rule update (Fig. 2.4) yields results in the same form as (7.5)–(7.6). The only difference is that the VB-moments are with respect to f (θt|Dt−1)and f (θt−1|Dt−1), instead of f (θt|Dt) and f (θt−1|Dt), respectively. In this case,the current observation, dt, has no influence on the VB-marginals, its influence beingdeferred to the subsequent (exact) data update step.

Remark 7.5 (Possible scenarios). We could formulate many other scenarios for VB-filtering using the principles developed in Chapter 6. Specifically (i) the models couldbe augmented using auxiliary variables; and (ii) one-step approximations could beintroduced at any point in the inference procedure. The number of possible scenariosgrows with the number of variables involved in the computations. In applications, theVB-objects—namely the VB-marginals, VB-observation models and VB-parameterpredictors—may be used freely, and combined with other approximations.

7.2.1 The VB method for Bayesian Filtering

The VB method was derived in detail for off-line inference in Section 3.3.3. It wasadapted for on-line inference of time-invariant parameters in Section 6.3.4. In thisSection, we summarize the VB method in Bayesian filtering.


Step 1: Choose a Bayesian Model. We assume that we have (i) the parameter evo-lution model (2.20), and (ii) the observation model (2.19). The prior on θt, t = 1—which will be in the same form as the VB-filtering distribution f (θt|Dt) (7.7)—willbe determined by the method in Step 4.

Step 2: Partition the parameters, choosing θ1 = θt and θ2 = θt−1 (3.9). However, inwhat follows, we will use the standard notation, θt and θt−1. The parameter evolutionmodel, f (θt|θt−1) (2.20), must have separable parameters (7.11).

Step 3: Write down the VB-marginals. We must inspect the VB-filtering (7.7) andVB-smoothing (7.8) distributions.

Step 4: Identify standard forms. Using Fig. 7.3, the standard forms are identified inthe following order:

a) The VB-filtering distribution (7.7) is obtained via the VB-parameter predictor(7.9) and the (exact) observation model (2.19). The prior, f (θ1), is chosen withthe same functional form as this VB-filtering distribution.

b) The output of a) from the previous time step, t − 1, is multiplied by the VB-observation model (7.10), yielding the VB-smoothing distribution (7.8).

Step 5: Unchanged.

Step 6: Unchanged.

Step 7: Run the IVB algorithm. An upper bound on the number of iterations ofthe IVB algorithm may be enforced, as was the case in Section 6.3.4. However, noasymptotic convergence proof is known for the time-variant parameter context ofVB-filtering.

Step 8: Unchanged.

7.3 Other Approximation Techniques for Bayesian Filtering

Many methods of approximation exist for Bayesian filtering [140], and extensiveresearch is currently being undertaken in this area [74, 141]. We now review threetechniques which we will use for comparison with the VB approximation later inthis Chapter.

7.3.1 Restricted VB (RVB) Approximation

The RVB approximation was introduced in Section 3.4.3. Recall that the Quasi-Bayes (QB) approximation is a special case, and this was specialized to the on-linetime-invariant case in Section 6.4.1. The key benefit of the RVB approach is that thesolution to the VB-equations is found in closed form, obviating the need for the IVBalgorithm (Algorithm 1). The challenge is to propose a suitable restriction.

7.3 Other Approximation Techniques for Bayesian Filtering 151

As suggested in Section 6.4.1, we prefer to impose the restriction on the VB-marginal which is not propagated to the next step. In this case, it is the VB-smoothingdistribution, f (θt−1|Dt) (7.8), which is not propagated (Fig. 7.3). We propose twopossible restrictions:

Non-smoothing restriction: the current observation, dt, is not used to update the dis-tribution of θt−1; i.e.

f (θt−1|Dt) = f (θt−1|Dt) ≡ f (θt−1|Dt−1) .

This choice avoids the VB-smoothing update (i.e. the lower part of Fig. 7.3),implying the inference scheme in Fig. 7.4.

f(θt−1|Dt−1)

B

f(dt|θt, Dt−1)


Fig. 7.4. The RVB approximation for Bayesian filtering, using the non-smoothing restriction.

QB restriction: the true marginal of (7.4) is used in place of the VB-smoothing dis-tribution:

f (θt−1|Dt) ≡ f (θt−1|Dt) ∝∫

Θ∗t

f (dt, θt, θt−1|Dt−1) dθt. (7.12)

Note that (7.12) is now the exact smoothing distribution arising from Bayesianfiltering (Fig. 7.1). The VB-filtering scheme in Fig. 7.3 is adapted to the formillustrated in Fig. 7.5.

f(θt−1|Dt−1)

B

f(dt|θt, Dt−1)


B

f(dt, θt|θt−1, Dt−1)

f(θt−1|Dt)∫

dθt

Fig. 7.5. The QB-approximation for Bayesian filtering.

These restrictions modify the VB method as described in Section 3.4.3.1.


7.3.2 Particle Filtering

The particle filtering technique [141] is also known as the sequential Monte Carlomethod [74]. It provides the necessary extension of stochastic approximations (Sec-tion 3.6) to the on-line case. Recall that the intractable posterior distribution (2.22)is approximated by the empirical distribution (3.59),

f (θt|Dt) =1n

n∑i=1

δ(θt − θ(i)t ), (7.13)

θ(i)t ∼ f (θt|Dt) . (7.14)

Here, the i.i.d. samples, θtn, are known as the particles. The posterior moments of(7.13), its marginals, etc., are evaluated with ease via summations (3.60):

Ef(θt|Dt)[g (θt)] =

∫Θ∗

t

g (θt) f (θt|Dt) dθt =1n

n∑i=1

g(θ(i)t

). (7.15)

Typically, the true filtering distribution, f (θt|Dt), is intractable, and so the requiredparticles (7.14) cannot be generated from it. Instead, a distribution fa (θt|Dt) is in-troduced in order to evaluate the required posterior moments:

Ef(θt|Dt) [g (θt)] =∫

Θ∗t

g (θt)f (θt|Dt)fa (θt|Dt)

fa (θt|Dt) dθt. (7.16)

The distribution fa (θt|Dt) is known as the importance function, and is chosen—among other considerations—to be tractable, in the sense that i.i.d. sampling from itis possible. By drawing the random sample, θtn, from fa (θt|Dt), (7.16) can beapproximated by

Ef(θt|Dt) [g (θt)] ≈n∑

i=1

wi,tg(θ(i)t

),

where

wi,t ∝f(Dt, θ

(i)t

)fa

(θ(i)t |Dt

) . (7.17)

The constant of proportionality is provided by the constraint∑n

i=1 wi,t = 1.A recursive algorithm for inference of wi,t can be achieved if the importance

function satisfies the following property with respect to the parameter trajectory,Θt = [Θt−1, θt] (2.16):

fa (Θt|Dt) = fa (θt|Dt, Θt−1) fa (Θt−1|Dt−1) . (7.18)

This is a weaker restriction than the Markov restriction imposed on f (Θt|Dt) byProposition 2.1. Using (7.18), the weights (7.17) can be updated recursively, as fol-lows:

7.3 Other Approximation Techniques for Bayesian Filtering 153

wi,t ∝ wi,t−1

f(dt|θ(i)

t

)f(θ(i)t |θ(i)

t−1

)fa

(θ(i)t |Θ(i)

t−1, Dt

) .

This step finalizes the particle filtering algorithm. The quality of approximation isstrongly dependent on the choice of the importance function, fa (θt|Dt). There is arich literature examining various choices of importance function for different appli-cations [74, 140, 141].

7.3.3 Stabilized Forgetting

In Section 7.1, the two requirements for computational tractability in Bayesian filter-ing were given as (i) choice of a DEF observation model, and (ii) invariance of theassociated CDEF distribution under the time update. This second requirement maybe hard to satisfy.

In the forgetting technique, the exact time update operator (Fig. 2.4) is replacedby an approximate operator without the need for an explicit parameter-evolutionmodel (2.20). This technique was developed using heuristic arguments [42, 144].Its Bayesian interpretation, which involves optimization of KLDVB, was presentedin [145]. As a result, the time update can be replaced by the following probabilisticoperator:

f (θt|Dt−1, φt) ∝[f (θt−1|Dt−1)θt

]φt × f (θt|Dt−1)1−φt . (7.19)

The notation f (·)θtindicates the replacement of the argument of f (·) by θt. f (·)

is a chosen alternative distribution, expressing auxiliary knowledge about θt at timet. The coefficient φt, 0 ≤ φt ≤ 1 is known as the forgetting factor. The impliedapproximate Bayesian filtering scheme is given in Fig. 7.6.

f(dt|θt, Dt−1)

B

f(θt|Dt−1)

f(θt|Dt−1)

φf(θt−1|Dt−1) f(θt|Dt)

Fig. 7.6. Bayesian filtering, with the time-update step approximated by forgetting operator,‘φ’.

If both f (θt|Dt−1) and f (θt−1|Dt−1) are conjugate to the observation model(2.19) then their geometric mean under (7.19) is also conjugate to (2.19). Then, bydefinition, both f (θt−1|Dt−1) and f (θt|Dt) have the same functional form.

Remark 7.6 (Forgetting for the DEF family). Consider an observation model fromthe DEF family (6.7), and associated CDEF distributions, as follows (the normalizingconstants, ζ, are defined in the usual way (Proposition 6.1)):


f (dt|θt, ψt) = exp(q (θt)

′u (dt, ψt) − ζdt

(θt)), (7.20)

f (θt−1|st−1) = exp(q (θt−1)

′vt−1 − νt−1ζdt

(θt−1) − ζθt−1 (st−1)), (7.21)

f (θt|st−1) = exp(q (θt)

′vt−1 − νt−1ζdt

(θt) − ζθt(st−1)

). (7.22)

Approximation of the time update (2.21) via the forgetting operator (7.19) yields

f (θt|st−1, st−1) = exp[q (θt)

′ (φtvt−1 + (1 − φt) vt−1) +

− (φtνt−1 + (1 − φt) νt−1) ζdt(θt) − ζθt

(st−1, st−1)].(7.23)

The data update step of Bayesian filtering, i.e. using (7.20) and (7.23) in (2.22),yields

f (θt|st) = exp(q (θt)

′vt − νtζdt

(θt) − ζθt(st)

), (7.24)

with sufficient statistics, st = [v′t, νt]′ (Proposition 6.1), updated via

vt = φtvt−1 + u (dt, ψt) + (1 − φt) vt−1, (7.25)

νt = φtνt−1 + 1 + (1 − φt) νt−1. (7.26)

The prior distribution, f (θ1) (2.21), is chosen in the form of (7.24) with a suitablechoice of s1 = v1, ν1.

For the case v1 = 0η,1, ν1 = 0, and φt = φ, vt = v, νt = ν, the method is known asexponential forgetting, since (7.25) and (7.26) imply a sum of data vectors weightedby a discrete exponential sequence:

vt =t∑

τ=∂+1

φt−τu (dτ , ψτ ) + v, (7.27)

νt =t∑

τ=∂+1

φt−τ + ν, (7.28)

for t > ∂. The alternative distribution contributes to the statistics via v, ν at all timest. This regularizes the inference algorithm in the case of poor data.

7.3.3.1 The Choice of the Forgetting Factor

The forgetting factor φt is assumed to be known a priori. It can be interpreted asa tuning knob of the forgetting technique. Its approximate inference is describedin [146, 147]. Its limits, 0 ≤ φt ≤ 1, are interpreted as follows:

For φt = 1, thenf (θt|Dt−1, φt = 1) = f (θt−1|Dt−1)θt

.

This is consistent with the assumption that θt = θt−1, being the time-invariantparameter assumption.

7.4 The VB-Approximation in Kalman Filtering 155

For φt = 0, thenf (θt|Dt−1, φt = 0) = f (θt|Dt−1) .

This is consistent with the choice of independence between θt and θt−1; i.e.

f (θt, θt−1|Dt−1) = f (θt|Dt−1) f (θt−1|Dt−1) .

Typically, the forgetting factor is chosen by the designer of the model a priori. Achoice of φt close to 1 models slowly varying parameters. A choice of φt close to 0models rapidly varying parameters. We now review a heuristic technique for settingφ.

Remark 7.7 (Heuristic choice of φ). We compare the exponential forgetting tech-nique for time-variant parameters with inference based on a pseudo-stationary as-sumption on a sliding observation window of length h [148]. Specifically, we con-sider the following two scenarios:

1. time-invariant parameter inference on a sliding pseudo-stationary window oflength h at any time t;

2. inference of time-variant parameters as t → ∞, using forgetting with ν = ν1.

Under 1., the degrees-of-freedom parameter of the posterior distribution is, from(6.10),

νt = h− ∂ + ν1, (7.29)

where ν1 is the prior degrees-of-freedom. Here, we are using the fact that the recur-sion starts after ∂ observations (Section 6.2).

Under 2., the degrees-of-freedom parameter of the posterior distribution is, from(7.26),

νt =1 − φt−∂

1 − φ+ ν1

t→∞−→ 11 − φ

+ ν1. (7.30)

Equating (7.29) and (7.30), then

φ = 1 − 1h− ∂

. (7.31)

This choice of φ yields Bayesian posterior estimates at large t which—under bothscenarios—have an equal number of degrees of freedom in their uncertainty. Hence,the choice of φ reflects our prior assumptions about the number of samples, h, forwhich dt is assumed to be pseudo-stationary.

7.4 The VB-Approximation in Kalman Filtering

Recall, from Remark 7.1, that Bayesian filtering is tractable under the modellingassumptions for the Kalman filter (Remark 7.1). Here, we examine the nature of theVB-approximation for this classical model.


7.4.1 The VB method

Step 1: the Bayesian model is given by (7.1)–(7.2). The prior distribution will bedetermined in Step 4.

Step 2: the logarithm of the parameter evolution model (7.1) is

ln f (θt|θt−1) = −12

(θt −Aθt−1)′R−1

θ (θt −Aθt−1) + γ, (7.32)

where γ denotes terms independent of θt and θt−1. The condition of separability(7.11) is fulfilled in this case.

Step 3: the VB-parameter predictor (7.9) and the VB-observation model (7.10) are,respectively,

f (θt|Dt−1) ∝ exp[−1

2

(θ′tA′R−1

θ Aθt − θt−1′A′R−1

θ θt − θ′tR−1θ Aθt−1

)], (7.33)

f (dt|θt−1) ∝ exp[−1

2

(−θ′t−1A′R−1

θ θt − θt′R−1

θ Aθt−1 + θ′t−1A′R−1θ Aθt−1

)].

Recall that · denotes expectation of the argument with respect to either f (θt|Dt)or f (θt−1|Dt), as appropriate. Note that the expectation of the quadratic termθ′t−1A

′R−1θ Aθt−1 in (7.32) does not appear in (7.33). This term does not depend

on θt and thus becomes part of the normalizing constant. Using the above formulae,and (7.2), the VB-marginals, (7.7) and (7.8), are, respectively,

f (θt|Dt) ∝ exp[− 1

2

(θ′tA′R−1

θ Aθt − θt−1′A′R−1

θ θt − θ′tR−1θ Aθt−1

)+

−1

2

(−θ′tC′R−1

d dt − d′tR−1d Cθt + θ′tC′R−1

d Cθt

) ], (7.34)

f (θt−1|Dt) ∝ exp[−1

2

(−θ′t−1A′R−1

θ θt − θt′R−1

θ Aθt−1 + θ′t−1A′R−1θ Aθt−1

)]×

×f (θt−1|Dt−1) . (7.35)

Step 4: we seek standard forms for a) the VB-filtering distribution (7.34), and b)the VB-smoothing distribution (7.35), using a). For a), (7.34) can be recognized ashaving the following standard form:

f (θt|Dt) = Nθt(µt, Σt) , (7.36)

with shaping parameters

µt =(C ′R−1

d C + A′R−1θ A

)−1(C ′R−1

d dt + R−1θ Aθt−1

), (7.37)

Σt =(C ′R−1

d C + A′R−1θ A

)−1. (7.38)

7.4 The VB-Approximation in Kalman Filtering 157

Hence, the prior is chosen in the form of (7.36), with initial values µ1, Σ1. For b),we substitute (7.36) at time t − 1 into (7.35), which is then recognized to have thefollowing form:

f (θt−1|Dt) = Nθt−1

(µt−1|t, Σt−1|t

),

with shaping parameters

µt−1|t =(A′R−1

θ A + Σ−1t−1

)−1(A′R−1

θ θt + Σ−1t−1µt−1

), (7.39)

Σt−1|t =(A′R−1

θ A + Σ−1t−1

)−1. (7.40)

Step 5: the only VB-moments required are the first moments of the Normal distrib-utions:

θt−1 = µt−1|t,

θt = µt.

Step 6: using the VB-moments from Step 5, and substituting (7.37) into (7.39), itfollows that

µt−1|t = Z(C ′R−1

d dt + R−1θ Aθt−1

)+ wt,

Z =(A′R−1

θ A)−1

A′R−1θ

(C ′R−1

d C + A′R−1θ A

)−1,

wt =(A′R−1

θ A + Σ−1t−1

)−1Σ−1

t−1µt−1.

These can be reduced to the following set of linear equations:(I − ZR−1

θ A)θt−1 = ZC ′R−1

d dt + wt,

with explicit solution

θt−1 =(I − ZR−1

θ A)−1 (

ZC ′R−1d dt + wt

). (7.41)

Step 7: the IVB algorithm does not arise, since the solution has been found analyti-cally (7.41).

Step 8: report the VB-filtering distribution, f (θt|Dt) (7.36), via its shaping parame-ters, µt, Σt.

Remark 7.8 (Inconsistency of the VB-approximation for Kalman filtering). The co-variance matrices, Σt and Σt−1|t, of the VB-marginals, (7.34) and (7.35) respec-tively, are data- and time-independent. Therefore, the VB-approximation yields in-consistent inferences for the Kalman filter. This is caused by the enforced conditionalindependence between θt and θt−1. These parameters are, in fact, strongly correlatedvia the (exact) parameter evolution model (7.1). In general, this correlation is incor-porated in the VB-approximation via the substitution of VB-moments (Fig. 7.3).However, in the case of the Kalman filter, (7.1) and (7.2), only the first moments aresubstituted (Step 5. above), and therefore correlations are not propagated.


7.4.2 Loss of Moment Information in the VB Approximation

Note that the VB-approximation is a sequence of two operations (Theorem 3.1):(i) the expectation operator, and (ii) normalization. Examining the q = 2 case, forexample, then these operations are applied to the logarithm of the joint distribution,ln f (D, θ1, θ2). In effect, the normalization operation removes all terms independentof the random variable from the VB-marginal. This can be illustrated with a simpleexample.

Consider the following parameter evolution model:

ln f (θ1|θ2) = aθ21 + bθ1θ2 + aθ2

2. (7.42)

This has the quadratic form characteristic of Gaussian modelling (see, for example,(7.32) for the Kalman filter). Taking expectations, then

f (θ1) ∝ exp(aθ2

1 + bθ1θ2 + aθ22

)∝ exp

(aθ2

1 + bθ1θ2 + γ),

where constant term, γ = aθ22 , will be consigned to the normalizing constant. In

this way, second-moment information about θ2—which could have been carried via

θ22—is lost.

Writing (7.42) in general form, (3.24) and (3.25), then

g (θ1) =[θ21, bθ1, a

],

h (θ2) =[a, θ2, θ

22

].

Note that the moment information is lost in respect of terms in g (·) or h (·) whosecorresponding entry in the other vector is a constant. We conclude that the VB-approximation will perform best with models exhibiting a minimum number of con-stants in g (·) and h (·).

In fact, the considerations above can provide us with guidelines towards a re-formulation of the model which might lead to more successful subsequent VB-approximation. In (7.42), we might relax the constants, a, via f (a). Then, the lost

moment, θ22 , will be carried through to f (θ1) via a = a

(θ22

).

In the case of the Kalman filter, the quadratic terms in θt and θt−1 enter thedistribution via the known constant, Rθ (7.32). Hence, this parameter is a naturalcandidate for relaxation and probabilistic modelling.

7.5 VB-Filtering for the Hidden Markov Model (HMM)

Consider a Hidden Markov Model (HMM) with the following two constituents: (i)a first-order Markov chain on the unobserved discrete (label) variable lt, with c pos-sible states; and (ii) an observation process in which the labels are observed via a

7.5 VB-Filtering for the Hidden Markov Model (HMM) 159

continuous c-dimensional variable dt ∈ Ic(0,1), which denotes the probability of each

state at time t. The task is to infer lt from observations dt.For analytical convenience, we denote each state of lt by a c-dimensional el-

ementary basis vector εc (i) (see Notational Conventions on page XV); i.e. lt ∈εc (1) , . . . , εc (c). The probability of transition from the jth to the ith state,1 ≤ i, j ≤ c, is

Pr (lt = εc (i) |lt−1 = εc (j)) = ti,j ,

where 0 < ti,j < 1, i, j ∈ [1, . . . , c]. These transition probabilities are aggregatedinto the stochastic transition matrix, T , such that the column sums are unity [149],i.e. t′i1c,1 = 1.

The following Bayesian filtering models are implied (known parameters are sup-pressed from the conditioning):

f (lt|lt−1) = Mult (T lt−1) ∝ l′tT lt−1 = exp (l′t lnT lt−1) , (7.43)

f (dt|lt) = Didt(ρlt + 1c,1) ∝ exp

(ρ ln (dt)

′lt). (7.44)

Here Mu (·) denotes the Multinomial distribution, and Di (·) denotes the Dirichletdistribution (see Appendices A.7 and A.8 respectively). In (7.43), lnT denotes thematrix of log-elements, i.e. lnT = [ln ti,j ] ∈ Rc×c. Note that (7.43) and (7.44)satisfy Proposition 2.1. In (7.44), the parameter ρ controls the uncertainty with whichlt may be inferred via dt. For large values of ρ, the observed data, dt, have higherprobability of being close to the actual labels, lt (see Fig. 7.7).

d1,t

r = 1

d1,t

r = 2

d1,t

r = 10

0 10 10 1 0

1

0

1

0

1

Fig. 7.7. The Dirichlet distribution, f (dt|lt), for c = 2 and ρ = 1, 2, 10, illustrated via itsscalar conditional distribution, f (d1,t|lt = ε2(1)) (full line) and f (d1,t|lt = ε2(2)) (dashedline).

7.5.1 Exact Bayesian filtering for known T

Since the observation model (7.44) is from the DEF family (6.7), its conjugate dis-tribution exists, as follows:


f (lt|Dt) = Mult (βt) . (7.45)

The time update of Bayesian filtering (Fig. 2.4)—i.e. multiplying (7.43) by (7.45)for lt−1, followed by integration over lt−1—yields

f (lt|Dt−1) ∝∑lt−1

Mult (T lt−1)Mult−1 (βt−1)

∝ Mult (Tβt−1) = Mult

(βt|t−1

), (7.46)

βt|t−1 = Tβt−1.

The data update step—i.e. multiplying (7.46) by (7.44), and applying Bayes’ rule—yields

f (lt|Dt) ∝ Didt(ρlt + 1c,1)Mult

(βt|t−1

)= Mult (βt) ,

βt = dρt βt|t−1 = dρ

t Tβt−1, (7.47)

where ‘’ denotes the Hadamard product (the ‘.*’ operation in Matlab). Hence, exactrecursive inference has been achieved, but this depends on an assumption of knownT .

Remark 7.9 (Inference of T when lt are observed). In the case of directly observedlabels, i.e. dt = lt, distribution (7.43) can be seen as an observation model:

f (dt|T, dt−1) = f (lt|T, lt−1) = Mult (T lt−1) .

Since Mu (·) is a member of the DEF family, it has the following conjugate posteriordistribution for T :

f (T |Dt) = DiT (Qt + 1c,c) . (7.48)

Here, DiT (Qt) denotes the matrix Dirichlet distribution (Appendix A.8). Its suffi-cient statistics, Qt ∈ Rc×c, are updated as follows:

Qt = Qt−1 + dtd′t−1 = Qt−1 + ltl

′t−1. (7.49)

This result will be useful later, when inferring T .

7.5.2 The VB Method for the HMM Model with Known T

In the previous Section, we derived an exact Bayesian filtering inference in the caseof the HMM model with known T . Once again, it will be interesting to examine thenature of the VB-approximation in this case.

Step 1: the parametric model is given by (7.43) and (7.44).

Step 2: the logarithm of the parameter evolution model is

ln f (lt|lt−1, T ) = l′t lnT lt−1. (7.50)


(7.50) is linear in both lt and lt−1. Hence, it is separable. However, the explicit formsof g (lt) ∈ Rc2×1 and h (lt−1) ∈ Rc2×1 from (3.21) are omitted for brevity. Notethat these do not include any constants, and so we can expect a more successful VB-approximation in this case than was possible for the Kalman filter (see Section 7.4.2).

Step 3: the implied VB-parameter predictor (7.9) and the VB-observation model(7.10) are, respectively,

f (lt|Dt−1) ∝ exp(l′t lnT lt−1

),

f (dt|lt−1) ∝ exp(lt−1 lnT ′ lt

).

From (7.7) (using (7.44)) and (7.8) respectively, the implied VB-marginals are

f (lt|Dt) ∝ exp (l′tρ ln (dt)) exp(l′t lnT lt−1

), (7.51)

f (lt−1|Dt) ∝ exp(lt′lnT lt−1

)f (lt−1|Dt−1) . (7.52)

Step 4: the VB-filtering distribution (7.51) is recognized to have the form

f (lt|Dt) = Mult (αt) , (7.53)

with shaping parameter

αt ∝ dρt exp

(lnT lt−1

), (7.54)

addressing Step 4 a) (see Section 7.2.1). Here, the normalizing constant is foundfrom

∑ci=1 αi,t = 1, since αt is a vector of probabilities under the definition of

Mu (·). Substituting (7.53) at time t− 1 into (7.52), the VB-smoothing distributionis recognized to have the form

f (lt−1|Dt) = Mult (βt) , (7.55)


βt ∝ αt−1 exp(lnT ′ lt

), (7.56)

addressing Step 4 b). In this case, note that the posterior distributions, (7.53) and(7.55), have the same form. This is the consequence of symmetry in the observationmodel (7.50). This property is not true in general.

Step 5: From (7.53) and (7.55), the necessary VB-moments are

lt = αt, (7.57)

lt−1 = βt, (7.58)


using the first moment of the Multinomial distribution (Appendix A.7).

Step 6: Substitution of VB-moments, (7.57) and (7.58), into (7.56) and (7.54) re-spectively yields the following set of VB-equations:

αt ∝ dρt exp (lnT ′βt) ,

βt ∝ αt−1 exp (lnT ′αt) .

Step 7: run the IVB algorithm on the two VB equations in Step 6.

Step 8: report the VB-filtering distribution, f (lt|Dt) (7.53), and the VB-smoothingdistribution, f (lt−1|Dt) (7.55), via their shaping parameters, αt and βt respectively.Their first moments are equal to these shaping parameters.

Remark 7.10 (Restricted VB (RVB) for the HMM model with known T ). In Section7.3.1, we outlined the non-smoothing restriction which, in this case, requires

f (lt−1|Dt) = f (lt−1|Dt−1) .

In this case, βt−1 = αt−1. Hence, the entire recursive algorithm is replaced by thefollowing equation:

αt ∝ dρt exp (lnT ′αt−1) . (7.59)

In (7.47), we note that the weights, βi,t, of the exact solution are updated by thealgebraic mean of ti,j , weighted by βt−1:

βi,t = dρi,t

c∑j=1

βj,t−1ti,j , i = 1, . . . , c.

In contrast, we note that the weights of the RVB-marginal (7.59) are updated by aweighted geometric mean:

αi,t = dρi,t

c∏j=1

tαj,t−1i,j , i = 1, . . . , c.

The distinction between these cases will be studied in simulation, in Section7.5.5.

7.5.3 The VB Method for the HMM Model with Unknown T

Exact Bayesian filtering for the HMM model with unknown T is intractable, andso, on this occasion, there is a ‘market’ for use of the VB approximation. The exactinference scheme is displayed in Fig. 7.8. Note that the model now contains bothtime-variant and time-invariant parameters. Hence, it does not fit exactly into any ofthe scenarios of Chapters 6 or 7. We will therefore freely mix VB-objects arisingfrom the various scenarios.


f(lt−1|Dt−1) × f(lt|Dt)

×f(T |Dt−1)

f(lt, lt−1|T, Dt−1)

f(T |Dt)

f(lt|lt−1, T, Dt−1)

B

f(dt|lt, Dt−1)

f(lt, lt−1, T |Dt−1)

∫dltlt−1

∫dlt−1T

f(lt, lt−1, T |Dt)

Fig. 7.8. Exact Bayesian filtering for the HMM model with unknown T .

Step 1: the parametric model is given by (7.43) and (7.44). However, in this case, weshow the conditioning of the observation model (7.44) on unknown T :

f (lt|lt−1, T ) = Mult (T lt−1) . (7.60)

The prior distribution on T will be assigned in Step 4.

Step 2: the logarithm of the parameter evolution model (7.50) is unchanged, and itis separable in the parameters, lt, lt−1 and T .

Step 3: the VB-parameter predictor (7.9) and the VB-observation model (7.10) are,respectively,

f (lt|Dt−1) ∝ exp(l′t lnT lt−1

),

f (dt|lt−1) ∝ exp(lt′lnT lt−1

).

The resulting VB-marginals, f (lt|Dt) and f (lt−1|Dt), have the same form as (7.51)and (7.52) respectively, under the substitution lnT → lnT .The VB-observation model (6.38) for time-invariant parameter T is

f (dt|T ) ∝ exp(tr(lt lt−1

′lnT ′

)). (7.61)

Step 4: using the results from Step 4 of the previous Section, the VB marginals,(7.51) and (7.52), have the form (7.53) and (7.55) respectively, with the followingshaping parameters:

αt ∝ dρt exp

(lnT lt−1

), (7.62)

βt ∝ αt−1 exp(lnT

′lt

). (7.63)

The VB-observation model (7.61) is recognized to be in the form of the Multinomialdistribution of continuous argument (Appendix A.7):

f (dt|T ) = Mudt(vec (T )) .


Hence, we can use the results in Remark 7.9; i.e. the conjugate distribution is matrixDirichlet (Appendix A.8),

f (T |Dt) = DiT (Qt) , (7.64)


Qt = Qt−1 + lt lt−1

′.

From (7.64), we set the prior for T to be f(T ) = f (T ) = DiT (Q0) (see Step 1above).

Step 5: the necessary VB-moments of f (lt|Dt) and f (lt−1|Dt) are given by (7.57)and (7.58), using shaping parameters (7.62) and (7.63). The only necessary VB-moment of the Dirichlet distribution (7.64) is lnT , which has elements (A.51)

ln (ti,j) = ψΓ (qi,j,t) − ψΓ

(1′

c,1Qt1c,1

), (7.65)

where ψΓ (·) is the digamma (psi) function [93] (Appendix A.8).

Step 6: the reduced set of VB-equations is as follows:

αt ∝ dρt exp

(lnT

′βt

),


′αt

),

Qt = Qt−1 + αtβ′t,

along with (7.65).

Step 7: run the IVB algorithm on the four VB equations in Step 6.

Step 8: report the VB-filtering distribution, f (lt|Dt) (7.53), the VB-smoothing dis-tribution, f (lt−1|Dt) (7.55), and the VB-marginal of T , i.e. f (T |Dt) (7.64), viatheir shaping parameters, αt, βt and Qt, respectively.

7.5.4 Other Approximate Inference Techniques

We now review two other approaches to approximate inference of the HMM modelwith unknown T .

7.5.4.1 Particle Filtering

An outline of the particle filtering approach to Bayesian filtering was presented inSection 7.3.2. HMM model inference with unknown T requires combined inferenceof time-variant and time-invariant parameters. This context was addressed in [74,Chapter 10]. In the basic version presented there, the time-invariant parameter—Tin our case—was handled via inference of a time-variant quantity, Tt, with the trivialparameter evolution model, Tt = Tt−1.

The following properties of the HMM model, (7.43) and (7.44) (with T un-known), are relevant:


• The hidden parameter, lt, has only c possible states. Hence, it is sufficient togenerate just nl = c particles, ltnl

.• The space of T is continuous. Hence, we generate nT particles, TnT

, satisfy-ing the restrictions on T , namely ti,j ∈ I(0,1), t′i1c,1 = 1, i = 1, . . . , c.

• When T is known, Bayesian filtering has an analytical solution (Section 7.5.1)with sufficient statistics βt = βt (T ) (7.47). Hence, the Rao-Blackwellizationtechnique [74] can be used to establish the following recursive updates for theparticle weights:

wj,t ∝ wj,t−1f(dt|βt−1, Tt−1

),

Tt =nT∑j=1

wj,tT(j),

βt = dρt Ttβt−1.

Here, wj,t are the particle weights (7.17), and

f(dt|βt−1, Tt−1

)=

c∑i=1

f (dt|lt = εc(i)) f(lt = εc(i)|βt−1, Tt−1

), (7.66)

using (7.44) and (7.60). (7.66) is known as the optimal importance function [74].Note that the particles, TnT

, themselves are fixed in this simple implementa-tion of the particle filter, ∀t.

7.5.4.2 Certainty Equivalence Approach

In Section 7.5.1, we obtained exact inference of the HMM model in two differentcases:

(i) when T was known, the sufficient statistics for unknown lt were βt (7.45);(ii) when lt was known, the sufficient statistics for unknown T were Qt (7.48).

Hence, a naîve approach to parameter inference of lt and T together is to use thecertainty equivalence approach (Section 3.5.1). From Appendices A.8 and A.7 re-spectively, we use the following first moments as the certainty equivalents:

Tt = 1c,1

(1′

c,1Qt

)−1Qt,

lt = βt,

where the notation v−1, v a vector, denotes the vector of reciprocals,[v−1

i

](see

Notational Conventions, Page XV). Using these certainty equivalents in cases (i)(i.e. (7.47)) and (ii) (i.e. (7.49)) above, we obtain the following recursive updates:

βt = dρt Tt−1βt−1.

Qt = Qt−1 + βtβ′t−1.


7.5.5 Simulation Study: Inference of Soft Bits

In this simulation, we apply all of the inference techniques above to the problem ofreconstructing a binary sequence, xt ∈ 0, 1, from a soft-bit sequence yt ∈ I(0,1),where yt is defined as the Dirichlet observation process (7.44). The problem is aspecial case of the HMM, (7.43) and (7.44), where the hidden field has c = 2 states,and where we use the following assignments:

lt = [xt, 1 − xt] , (7.67)

dt = [yt, 1 − yt] .

Four settings of the HMM parameters were considered. Each involved one of twosettings of T , i.e. T ∈ T1, T2 (7.43), with

T1 =[

0.6 0.20.4 0.8

], T2 =

[0.1 0.20.9 0.8

],

and one of two settings of ρ (7.44), i.e. ρ = 1 and ρ = 2. For each of these settings,we undertook a Monte Carlo study, generating 200 soft-bit sequences, yt, each oflength t = 100. For each of the four HMM settings, we examined the followinginference methods:

Threshold we infer xt by testing if yt is closer to 0 or 1. Hence, the bit-streamestimate is

xt = round (yt) =

1 if yt > 0.5,0 if yt ≤ 0.5.

(7.68)

This constitutes ML estimation of xt (2.11), ignoring the Markov chain modelfor xt (see Fig. 7.7).

Unknown T : T is inferred from the observed data, via one of the following tech-niques:Naîve the certainty equivalence approach of Section 7.5.4.2.VBT the VB-approximation for unknown T , as derived in Section 7.5.3;PF100 particle filtering (Section 7.5.4), with nT = 100 particles;PF500 particle filtering (Section 7.5.4), with nT = 500 particles;

Known T : we use the true value, T1 or T2, with each of the following techniques:Exact exact Bayesian filtering, as derived in Section 7.5.1;VB the VB-approximation with known T , as derived in Section 7.5.2.

The performance of each of these methods was quantified via the following criteria,where—in all but the threshold method (7.68)—xt = l1,t (7.67) denotes the posteriormean of xt:

1. Total Squared Error (TSE):

TSE =t∑

t=1

(xt − xt)2.


In the case of the threshold inference (7.68), for example, this criterion is equalto the Hamming distance between the true and inferred bit-streams.

2. Misclassification counts (M):

M =t∑

t=1

|round (xt) − xt| ,

where round(·) is defined in (7.68).

Total Squared Error (TSE)Unknown T Known T

T ρ Threshold Naîve VBT PF100 PF500 Exact VBT1 1 25.37 16.20 15.81 15.68 15.66 14.36 14.40T1 2 12.67 8.25 8.20 8.15 8.11 7.45 7.46T2 1 24.88 13.65 12.29 12.65 12.54 10.69 10.69T2 2 12.34 6.87 6.65 6.86 6.83 6.09 6.09

Misclassification counts (M)Unknown T Known T

T ρ Threshold Naîve VBT PF100 PF500 Exact VBT1 1 25.37 24.60 23.59 23.26 23.12 21.08 21.11T1 2 12.67 11.49 11.37 11.35 11.22 10.33 10.38T2 1 24.88 19.48 17.34 18.10 17.97 14.73 14.75T2 2 12.34 9.27 9.03 9.28 9.23 8.35 8.35

Table 7.1. Performance of bit-stream (HMM) inference methods in a Monte Carlo study.

The results of the Monte Carlo study are displayed in Table 7.1, from which we drawthe following conclusions:

• The Threshold method—which ignores the Markov chain dynamics in the under-lying bit-stream—yields the worst performance.

• The performance of the VB-filtering method with known transition matrix, T ,is comparable to the exact solution. Therefore, in this case, the VB-momentssuccessfully capture necessary information from previous time steps, in contrastto VB-filtering in the Kalman context (Remark 7.8).

• All inference methods with unknown T perform better than the Thresholdmethod. None, however, reaches the performance of the exact solution withknown T . Specifically:

– For all settings of parameters, the Naîve method exhibits the worst perfor-mance.

– In the case of T2 (the ‘narrowband’ case, where there are few transitionsin xt), the VB-approximation outperforms the particle filtering approach forboth 100 and 500 particles. However, in the case of T1 (the ‘broadband’ case,


where there are many transitions in xt), the particle filter outperforms the VB-approximation even for 100 particles. In all cases, however, the differencesin VB and particle filtering performances are modest.

– Increasing the number of particles in our particle filtering approach has onlya minor influence on performance. This is a consequence of the choice offixed particles, TnT

(Section 7.5.4.1), since the weights of most of theparticles are inferred close to 0. Clearly, more sophisticated approaches tothe generation of particles—such as resampling [74], kernel smoothing [74,Chapter 10], etc.—would improve performance.

7.6 The VB-Approximation for an Unknown Forgetting Factor

The technique of stabilized forgetting was introduced in Section 7.3.3 as an approxi-mation of Bayesian filtering for DEF models with slowly-varying parameters. In thistechnique, the time update of Bayesian filtering (2.21) is replaced by a probabilisticoperator (7.19). Parameter evolution is no longer modelled explicitly via (2.20), butby an alternative distribution, f (θt|Dt−1), and a forgetting factor, φt.

The forgetting factor, φt, is an important tuning knob of the technique, as dis-cussed in Section 7.3.3.1. Typically, it is chosen as a constant, φt = φ, using heuris-tics of the kind in Remark 7.7. Its choice expresses our belief in the degree of vari-ability of the model parameters, and, consequently, in the stationarity properties ofthe observation process, dt. In many situations, however, the process will migratebetween epochs of slowly non-stationary (or fully stationary) behaviour and rapidlynon-stationary behaviour. Hence, we would like to infer the forgetting factor on-lineusing the observed data. In the context of the Recursive Least Squares (RLS) algo-rithm, a rich literature on the inference of φt exists [147]. In this Section, we willconsider a VB approach to Bayesian recursive inference of φt [150].

In common with all Bayesian techniques, we treat the unknown quantity, φt, as arandom variable, and so we supplement model (7.19) with a prior distribution on φt

at each time t:φt ∼ f (φt|Dt−1) ≡ f (φt) . (7.69)

Here, we assume that (i) the chosen prior is not data-informed [8], and (ii) that φt

is an independent, identically-distributed (i.i.d.) process. This yields an augmentedparameter predictor, f (θt, φt|Dt−1). The posterior distribution on θt is then obtainedvia a Bayes’ rule update and marginalization over φt, as outlined in Fig. 7.9 (upperschematic). However, the required marginalization is intractable. We overcome thisproblem by replacing the marginalization with VB-marginalization (Fig. 3.4). Theresulting VB inference scheme is given in Fig. 7.9 (lower schematic), where the rôleof φt will be explained shortly.

From (7.19) and (7.69), the joint distribution at time t is

f (dt, θt, φt|Dt−1) ∝ f (dt|θt, Dt−1)1

ζθt(φt)

[f (θt−1|Dt−1)θt

]φt × (7.70)

×f (θt|Dt−1)1−φt f (φt) . (7.71)

7.6 The VB-Approximation for an Unknown Forgetting Factor 169

f(θt−1|Dt−1)

f(θt−1|Dt−1)

×

f(φt)

B

×

f(φt)

B

φt

f(θt|Dt−1)

V

f(φt|Dt)

f(θt|Dt)

φ

f(θt|Dt−1)

f(θt|Dt)∫

dφt

f(dt|θt, Dt−1)

f(dt|θt, Dt−1)

f(θt, φt|Dt−1)

f(θt, φt|Dt−1)

Fig. 7.9. Bayesian filtering with an unknown forgetting factor, φt, and its VB-approximation.

Note that the normalizing constant, ζθt(φt), must now be explicitly stated, since

it is a function of unknown φt. We require the VB-marginals for θt and φt. UsingTheorem 3.1, the minimum of KLDVB (3.6) is reached for

f (θt|Dt) ∝ f (dt|θt, Dt−1) ×× exp

(Ef(φt|Dt)

[φt ln f (θt−1|Dt−1)θt

− (1 − φt) ln f (θt|Dt−1)])

∝ f (dt|θt, Dt−1)[f (θt−1|Dt−1)θt

]φtf (θt|Dt−1)

1−φt , (7.72)

f (φt|Dt) ∝ 1ζθt

(φt)f (φt) × (7.73)

× exp(Ef(θt|Dt)

[φt ln f (θt−1|Dt−1)θt

− (1 − φt) ln f (θt−1|Dt−1)])

.

Comparing (7.72) with (7.19), we note that the VB-filtering distribution now has theform of the forgetting operator, with the unknown forgetting factor replaced by itsVB-moment, φt, calculated from (7.73). No simplification can be found for this VB-marginal on φt. The implied VB inference scheme for Bayesian filtering is given inFig. 7.10. Note that it has exactly the same form as the stabilized forgetting schemein Fig. 7.6. Now, however, IVB iterations (Algorithm 1) are required at each time t

in order to elicit φt. This inference scheme is tractable for Bayesian filtering with aDEF observation model and CDEF parameter distributions (Remark 7.6). The onlyadditional requirement is that the first moment of the VB-marginal, f (φt|Dt), beavailable.

7.6.1 Inference of a Univariate AR Model with Time-Variant Parameters

In this Section, we apply the Bayesian filtering scheme with unknown forgetting toan AR model with time-variant parameters. We will exploit the VB inference schemederived above (Fig. 7.10) in order to achieve recursive identification.


φtf(θt−1|Dt−1) f(θt|Dt)

f(θt|Dt−1)

f(θt|Dt−1)

B

f(dt|θt, Dt−1)

f(φt|Dt)

Fig. 7.10. The VB inference scheme for Bayesian filtering with an unknown forgetting factor,φt, expressed in the form of stabilized forgetting, with φt generated via IVB cycles.


The AR observation model (6.15) is from the DEF family, and so inference with aknown forgetting factor is available via Remark 7.6. Furthermore, the conjugate dis-tribution to the AR model is Normal-inverse-Gamma (N iG) (6.21). Using Remark7.6, the update of statistics is as follows:

f (at, rt|Vt (φt) , νt (φt)) = N iG (Vt (φt) , νt (φt)) , (7.74)

Vt (φt) = φtVt−1 + ϕtϕt′ + (1 − φt)V ,

νt (φt) = φtνt−1 + 1 + (1 − φt) ν.

The distribution of φt will be revealed in Step 4.


We choose θ1 = at, rt and θ2 = φt. From (7.74), and using (6.22), the logarithmof the joint posterior distribution is

ln f (at, rt, φt|Dt) = − ln ζat,rt(Vt (φt) , νt (φt)) + ln f (φt) +

−0.5 (φtνt−1 + 1 + (1 − φt) ν) ln rt +

−12r−1t [−1, a′t]

(φtVt−1 + ϕtϕ

′t + (1 − φt)V

)[−1, a′t]

′.

This is separable in θ1 and θ2.


Recall, from (7.72), that the VB-marginal of the AR parameters is in the form of theforgetting operator (7.19). Hence, we need only replace φt with φt in (7.74):

f (at, rt|Dt) = N iW(Vt

(φt

), νt

(φt

)). (7.75)

The VB-marginal of φt is


f (φt|Dt) = exp

[− ln ζat,rt

(Vt (φt) , νt (φt)) − 0.5φt (νt−1 + ν) ln rt +

−12tr(φt

(Vt−1 + V

)Ef(at,rt|Dt)

[[−1, a′t]

′r−1t [−1, a′t]

])+ ln f (φt)

], (7.76)

where

Ef(at,rt|Dt)

[[−1, a′t]

′r−1t [−1, a′t]

]=

[r−1t −r−1

t at′

−r−1t at

atr−1t a′t

]. (7.77)


From (7.74), the shaping parameters of (7.75) are

Vt

(φt

)= φtVt−1 + ϕtϕt

′ +(1 − φt

)V , (7.78)

νt

(φt

)= φtνt−1 + 1 +

(1 − φt

)ν. (7.79)

However, (7.76) does not have a standard form. This is because of the normalizingconstant, ζat,rt

(Vt (φt) , νt (φt)) ≡ ζat,rt(φt). Specifically, we note from (6.23) that

ζat,rt(·) is a function of |Vt (φt)|, which is a polynomial in φt of order m + 1. In

order to proceed, we will have to find a suitable approximation, ζat,rt(φt).

Proposition 7.1 (Approximation of Normalizing Constant, ζat,rt(·)). We will

match the extreme points of ζat,rt(Vt (φt) , νt (φt)) at φt = 0 and φt = 1. From

(7.78),

ζat,rt(Vt (0) , νt (0)) = ζat,rt

(V , ν

), (7.80)

ζat,rt(Vt (1) , νt (1)) = ζat,rt

(Vt−1 + ϕtϕt′, νt−1 + 1) . (7.81)

Next, we interpolate ζat,rtwith the following function:

ζat,rt(φt) ≈ ζat,rt

(φt) = exp (h1 + h2φt) , (7.82)

where h1 and h2 are unknown constants. Matching (7.82) at the extrema, (7.80) and(7.81), we obtain

h1 = ln ζat,rt

(V , ν

), (7.83)

h2 = ln ζat,rt(Vt−1 + ϕtϕt

′, νt−1 + 1) − ln ζat,rt

(V , ν

). (7.84)

Using (7.82) in (7.76), the approximate VB-marginal of φt is

f (φt|Dt) ≈ tExpφt(bt, [0, 1]) f (φt) , (7.85)


where tExp (bt, [0, 1]) denotes the truncated Exponential distribution with shapingparameter bt, restricted to support I[0,1] (Appendix A.9). From (7.76) and (7.82)–(7.84),

bt = ln ζat,rt(Vt−1 + ϕtϕt

′, νt−1 + 1) +

− ln ζat,rt

(V , ν

)− 0.5 (νt−1 + ν) ln rt+

− 12tr((

Vt−1 + V)Ef(at,rt|Dt)

[[−1, a′t]

′r−1t [−1, a′t]

]), (7.86)

where E· [·] is given by (7.77). It is at this stage that we can choose the prior, f (φt)(see Step 1 above). We recognize tExpφt

(bt, [0, 1]) in (7.85) as the VB-observationmodel for φt (Remark 6.4). Its conjugate distribution is

f (φt) = tExpφt(b0, [0, 1]) , (7.87)

with chosen shaping parameter, b0. Substituting (7.87) into (7.85), then

f (φt|Dt) ≈ tExpφt(bt + b0, [0, 1]) .

Note that, for b0 = 0, the prior (7.87) is a uniform distribution, f (φt) = Uφt([0, 1]).

This is the non-informative conjugate prior choice (Section 2.2.3).


From Appendices A.3 and A.9, we can assemble all necessary VB-moments as fol-lows:

at = V −1aa,tVa1,t, (7.88)

r−1t = νtΛ

−1t ,

atr−1t a′t = 1 + at

′ r−1t at,

ln rt = − ln 2 − ψΓ (νt − 1) + lnλt,

φt =exp (bt + b0) (1 − bt − b0) − 1(bt + b0) (1 − exp (bt + b0))

, (7.89)

where matrix Vt is partitioned as in (6.24).


No simplification of the VB equations (7.78), (7.79), (7.86) and (7.88)–(7.89) wasfound.


The IVB algorithm iterates on the equations listed in Step 6.



The VB-marginal of the AR parameters, f (θt|Dt) (7.75), is reported via its shapingparameters, Vt and νt. There may also be interest in the VB inference of φt, in whichcase its VB-marginal, f (φt|Dt) (7.85), is reported via bt.

7.6.2 Simulation Study: Non-stationary AR Model Inference via UnknownForgetting

At the beginning of Section 7.6, we pointed to the need for a time-variant forget-ting factor, φt, in cases where the process, dt, exhibits both pseudo-stationary andnon-stationary epochs. There is then no opportunity for prior tuning of the forgettingfactor to a constant value, φt = φ, reflecting a notional pseudo-stationary windowlength, h (Remark 7.7). This is the difficulty we encounter in the important contextof changepoints [151], such as arise in speech. Here, the parameters switch rapidlybetween pseudo-constant values. A related issue is that of initialization in recursiveinference of stationary processes (Section 6.5.3). In this problem, the on-line datamay initially be incompatible with the chosen prior distribution, and so the distrib-ution must adapt rapidly, via a low forgetting factor, φt. The mismatch diminisheswith increasing data, and so the parameter evolution should gradually be switchedoff, by setting φt → 1.

We now design experiments to study these two contexts for unknown time-variant forgetting:

1. Inference of an AR process with switching parameters (changepoints), where westudy the parameter tracking abilities of the VB inference scheme in Fig. 7.9.

2. Inference of an AR process with time-invariant parameters (stationary process),where we study the behaviour of the VB inference scheme in Fig. 7.9, in thecontext of initialization.

7.6.2.1 Inference of an AR Process with Switching Parameters

A univariate, second-order (m = 2, such that ψt = [dt−1, dt−2]′) stable AR model

was simulated with parameters rt = r = 1, and

at ∈[

1.8−0.98

],

[−0.29−0.98

].

Abrupt switching between these two cases occurs every 30 samples. The model wasidentified via VB-marginals, (7.75) and (7.85), as derived above. The prior distribu-tion is chosen equal to the alternative distribution. Their statistics were chosen as

V1 = V = diag([1, 0.001, 0.001]′

), ν1 = ν = 10, (7.90)

corresponding to the prior estimates, a1 = [0, 0], var (a1,1) = var (a2,1) = 1000and r1 = 10. We have tested two variants of the IVB algorithm:


(i) the IVB(full) algorithm, which was stopped when∣∣∣φ[i]

t − φ[i−1]t

∣∣∣ < 0.001.

(ii) the IVB(2) algorithm, with the number of iterations set at 2.

The forgetting factor was initialized with φ[1]t = 0.7 in both cases, ∀t. The identifi-

cation results are displayed in Fig. 7.11.Note that the VB inference—to within one time-step—correctly detects a change

of parameters. At these changepoints, the forgetting factor is estimated as low asφt = 0.05 (which occurred at t = 33). This achieves a ‘memory-dump’, virtuallyreplacing Vt and νt by the alternative (prior) values, V and ν. Thus, identificationis restarted. Note that the number of iterations of the IVB algorithm (Algorithm 1)is significantly higher at each changepoint. Therefore, at these points, the expectedvalue of the forgetting factor, φt, obtained using the truncated IVB(2) algorithm,remains too high compared to the converged value obtained by the IVB(full) algo-rithm (Fig. 7.11). As a result, tracking is sluggish at the changepoints when using theIVB(2) algorithm.

The results of identification using a time-invariant forgetting factor, φt = φ =0.9, are also displayed in Fig. 7.11. Overall, the best parameter tracking is achievedusing the VB-marginal, f (at, r|Dt), evaluated using the IVB(full) algorithm. Identi-fication of the process using the IVB(2) algorithm is acceptable only if the parametervariations are not too rapid.

7.6.2.2 Initialization of Inference for a Stationary AR Process

We have already seen in Section 6.5.3 how the prior, f (θ), can damage the sufficientstatistics used for identification of time-invariant parameters, and this effect can befelt even for large t. The forgetting technique can help here. If the forgetting factoris low when t is small, the inference of the prior can be removed rapidly. For large t,there is no need for this. Hence, there is a rôle for a time-variant forgetting factor evenfor time-invariant parameters. We will use the VB-approximation (Fig. 7.9) to inferthe time-variant forgetting factor in this context. In other work, discount scheduleshave been proposed to overcome this problem of initialization. The one proposedin [26] was

φDSt = 1 − 1

η1 (t− 2) + η2, (7.91)

where η1 > 0 and η2 > 0 are a priori chosen constants which control the rateof forgetting for t > 2. Note that φDS

t → 1 as t → ∞. We will next examinethe performance of this pre-set discount schedule in overcoming the initializationproblem, and compare it to the performance of the on-line VB inference scheme(Fig. 7.9), where φt is inferred from the data.

A stationary, univariate, second-order (m = 2), stable AR model was simulatedwith parameters a = [1.8,−0.98]′ and r = 1. The results of parameter identificationare displayed in Fig. 7.12. The prior and alternative distributions were once again setvia (7.90), and the forgetting factor was initialized with φ

[1]t = 0.7, ∀t. Identification

using the discount factor (7.91) is also illustrated in Fig. 7.12, with η1 = η2 = 1.

7.6 The VB-Approximation for an Unknown Forgetting Factor 175N

on-s

tatio

nary

proc

essd

t

a1,t

anda1,t

,

(IV

B(f

ull)

)

a1,t

anda1,t

,

(IV

B(2

))

a1,t

anda1,t

,

((f

=0.

9 ))

forg

ettin

gfa

ctor

ft IVB(full)

IVB(2)ed f

num

ber

of

IVB

itera

tions

0 10 20 30 40 50 60 70 80 90 1000

10

20

0

0.5

1

10123

10123

-

-

-

10123

-50

0

50

fix

Fig. 7.11. Identification of an AR process exhibiting changepoints, using the VB inferencescheme with unknown, time-variant forgetting. In sub-figures 2–4, full lines denote simulatedvalues of parameters, dashed lines denote posterior expected values, and dotted lines denoteuncertainty bounds. In sub-figure 6, the number of IVB cycles required for convergence ateach time-step is plotted.

176 7 On-line Inference of Time-Variant ParametersSt

atio

nary

proc

essd

t

a1,t

anda1,t

,

(IV

B(f

ull)

)f

DS

tan

df

t

VB-inferred forgetting factorAd-hoc discount factor

num

ber

of

IVB

itera

tions

0 10 20 30 40 50 60 70 80 90 1003

4

5

0.6

0.8

1

10123

-

-

20

0

20

Fig. 7.12. Identification of a stationary AR process using the VB inference scheme with un-known, time-variant forgetting. In the second sub-figure, the full line denotes simulated valuesof parameters, the dashed line denotes posterior expected values, and dotted lines denote un-certainty bounds.

Note that, for t < 15, the expected value of the unknown forgetting factor, φt,is very close to the discount factor. However, as t → ∞, φt does not converge tounity but to a smaller, invariant value (0.92 in this simulation, for t > 20). This is aconsequence of the stationary alternative distribution, f (·), since V and ν are alwayspresent in Vt−1 and νt−1 (see (7.27) and (7.28) for the φt = φ case).

The priors which are used in the VB inference scheme can be elicited using expertknowledge, side information, etc. In practical applications, it may well be easier tochoose these priors (via V and ν) than to tune a discount schedule via its parameters,η1 and η2.

Note that number of iterations of the IVB(full) algorithm is low (typically four,as seen in Fig. 7.12). Hence, the truncation of the number of iterations to two inthe IVB(2) algorithm of the previous Section yields almost identical results to theIVB(full) algorithm.

7.7 Conclusion

In this Chapter, we applied the VB-approximation to the important problem ofBayesian filtering. The interesting outcome was that the time update and data (Bayes’

7.7 Conclusion 177

rule) update were replaced by two Bayes’ rule updates. These generated a VB-smoothing and a VB-filtering inference.

The VB-approximation imposes a conditional independence assumption. Specif-ically, in the case of on-line inference of time-variant parameters, the conditionalindependence was imposed between the parameters at times t and t − 1. The pos-sible correlation between the partitioned parameters is incorporated in the VB-approximation via the interacting VB-moments. We have studied the VB-approx-imation for two models: (i) the Kalman filter model, and (ii) the HMM model. Weexperienced variable levels of success. The Kalman filter uses Gaussian distributionswith known covariance, and is not approximated well using the VB approach, forreasons explained in Section 7.4.2. Relaxing the covariance matrices via appropriatedistributions might well remedy this problem. In the HMM model, no such problemswere encountered, and we were able to report a successful VB-approximation.

We reviewed the case where the time update of Bayesian filtering is replaced bya forgetting operator. Later in the Chapter, we studied the VB-approximation in thecase of an unknown, time-variant forgetting factor, φt. This parameter behaves likea hidden variable, augmenting the observation model. In adopting an independentdistribution for φt at each time t, the required VB-approximation is exactly the oneused in scenario III of Chapter 6. A potential advantage of this approach is that theindependence assumption underlying the VB-approximation is compatible with thisindependent modelling of φt.

The VB inference algorithm with unknown forgetting was successfully appliedto inference of an AR model with switching parameters. It also performed well incontrolling the influence of the prior in the early stages of inference of a stationaryAR process.

8

The Mixture-based Extension of the AR Model(MEAR)

In previous Chapters, we introduced a number of fundamental scenarios for use ofthe VB-approximation. These enabled tractable (i.e. recursive) on-line inference in anumber of important time-invariant and time-variant parameter models. These sce-narios were designed to point out the key inferential objects implied by the VB-approximation, and to understand their rôle in the resulting inference schemes.

In this Chapter, we apply the VB-approximation to a more substantial signalprocessing problem, namely, inference of the parameters of an AR model under anunknown transformation/distortion. Our two principal aims will be (i) to show howthe VB-approximation leads to an elegant recursive identification algorithm wherenone exists in an exact analysis; and (ii) to explain how such an algorithm can addressa number of significant signal processing tasks, notably

• inference of unknown transformations of the data;• data pre-processing, i.e. removal of unwanted artifacts such as outliers.

This extended example will reinforce our experience with the VB method in signalprocessing.

8.1 The Extended AR (EAR) Model

On-line inference of the parameters of the univariate (scalar) AutoRegressive (AR)model (6.14) was reviewed in Section 6.2.2. Its signal flowgraph is repeated in Fig.8.1 for completeness. We concluded that the AR observation model is a member ofthe Dynamic Exponential Family (DEF), for which a conjugate distribution exists inthe form of the Normal-inverse-Gamma (N iG) distribution (6.22). In other words,N iG is a CDEF distribution (Proposition 6.1). However, the N iG distribution is con-jugate to a much richer class of observation models than merely the AR model, andso all such models are recursively identifiable. We now introduce this more generalclass, known as the Extended AR (EAR) model.

Let us inspect the N iG distribution, which has the following CDEF form (6.9):

180 8 The Mixture-based Extension of the AR Model (MEAR)

f (a, r|Dt) ∝ exp(q (a, r)′ vt − νtζdt

(a, r)). (8.1)

Following Proposition 6.1, the most general associated DEF observation model (6.7)for scalar observation, dt, is built from q (a, r) above:

f (dt|a, r, ψt) = exp(q (a, r)′ u (dt, ψt) + ζdt

(a, r)). (8.2)

Here, the data function, u (dt, ψt), may be constructed more generally than for theAR model. However, for algorithmic convenience, we will preserve its dyadic form(6.18):

u (dt, ψt) = vec (ϕtϕ′t) . (8.3)

In the AR model, the extended regressor was ϕt = [dt, ψ′t]′. We are free, however,

to choose a known transformation of the data in defining ϕt:

ψt = ψ (dt−1, . . . , dt−∂ , ξt) ∈ Rm, (8.4)

ϕ1,t = ϕ1 (dt, . . . , dt−∂ , ξt) ∈ R, (8.5)

ϕt = [ϕ1,t, ψ′t]′ ∈ Rm+1, (8.6)

where ψt is an m-dimensional regressor, ϕt is an (m + 1)-dimensional extendedregressor, and ξt is a known exogenous variable. The mapping,

ϕ = [ϕ1, ψ′]′ , (8.7)

must be known. Using (8.4) in (8.2), the Extended AR (EAR) observation model istherefore defined as

f (dt|a, r, ψt, ϕ) = |Jt (dt)| Nϕ1(dt) (a′ψt, r) , (8.8)

where Jt (·) is the Jacobian of the transformation ϕ1 (8.4):

Jt (dt) =dϕ1 (dt, . . . , dt−∂ , ξt)

ddt. (8.9)

This creates an additional restriction that ϕ1 (8.5) be a differentiable, one-to-onemapping for each setting of ξt. Moreover, ϕ1 must explicitly be a function of dt inorder that Jt = 0. This ensures uncertainty propagation from et to dt (Fig. 8.1):

dt = ϕ−11 (ϕ1,t, dt−1, . . . , dt−∂ , ξt) .

The signal flowgraph of the scalar EAR model is displayed in Fig. 8.1 (right). Recall,from Section 6.2.2, that et is the innovations process, and r = σ2.

In the case of multivariate observations, dt ∈ Rp, with parameters A and R,all identities are adapted as in Remark 6.3. Furthermore, (8.9) is then replaced by aJacobian matrix of partial derivatives with respect to di,t.

8.1 The Extended AR (EAR) Model 181

z−1

z−2

z−m

dt

am

a2

a1

dt−2

dt−1

dt−m

σet

am

a2

a1

ψ2,t

ψ1,t

ψm,t

σ

et

ϕ2

ϕ3

ϕm+1

ϕ−11

dt

Fig. 8.1. The signal flowgraph of the univariate AR model (left) and the univariate ExtendedAR (EAR) model (right).

8.1.1 Bayesian Inference of the EAR Model

Bayesian inference of the EAR model is, by design, exactly the same as for the ARmodel (6.15)–(6.24). The only change is that ϕt—used in the dyadic update of theextended information matrix, Vt (6.25)—is now defined more generally via (8.6):

Vt = Vt−1 + ϕtϕ′t, t > ∂. (8.10)

The update for νt (6.25) is unchanged:

νt = νt−1 + 1, t > ∂.

The EAR model class embraces, by design, the AR model (6.14), and thestandard AutoRegressive model with eXogenous variables (ARX), where ψt =[dt−1, . . . , dt−m, ξ′t]

′ in (6.15). The following important cases are also included [42]:(i) the ARMA model with a known MA part; (ii) an AR process, ϕ1,t, observed via aknown bijective non-linear transformation, dt = ϕ−1

1 (ϕ1,t); (iii) the incremental ARprocess with the regression defined on increments of the measurement process.

Remark 8.1 (Prediction). Jacobians are not required in EAR identification, but theybecome important in prediction. The one-step-ahead predictor is given by the ratioof normalizing coefficients (6.23), a result established in general in Section 2.3.3.Hence, we can adapt the predictor for the AR model (6.27) to the EAR case using(8.8) and (8.4):

f (dt+1|Dt) = |Jt (dt+1)|ζa,r

(Vt + ϕt+1ϕ

′t+1, νt + 1

)√

2πζa,r (Vt, νt), (8.11)

where ϕt+1 =[ϕ1 (dt+1) , ψ′

t+1

]′.


8.1.2 Computational Issues

A numerically efficient evaluation of (8.10), and subsequent evaluation of moments(6.26), is based on the LD decomposition [86]; i.e. Vt = LtTtL

′t, where Lt is lower-

triangular and Tt is diagonal. The update of the extended information matrix (8.10) isreplaced by recursions on Lt and Tt [152]. This approach is superior to accumulationof the full matrix Vt, for the following reasons:

1. Compactness: all operations are performed on triangular matrices involving just(p + m) (p + m) /2 elements, compared to (p + m)2 for full Vt.

2. Computational Efficiency: the dyadic update (8.10) requires O((p + m)2

)op-

erations in each step to re-evaluate Lt, Tt, followed by evaluation of the nor-malizing constant (6.23), with complexity O (p + m). The evaluation of the first

moment, at (6.26), involves complexity O((p + m)2

). In contrast, these op-

erations are O((p + m)3

), using the standard inversion of the full matrix Vt.

Nevertheless, if the matrix inversion lemma [42, 108] is used, the operationshave the same complexity as for the LD decomposition.

3. Regularity: elements of Tt are certain to be positive, which guarantees positive-definiteness of Vt. This property is unique to the LD decomposition.

8.2 The EAR Model with Unknown Transformation: the MEARModel

We now relax the EAR assumption which requires ϕ (8.7) to be known. Instead, weconsider a finite set, ϕc, of possible transformations, called the filter-bank:

ϕc =ϕ(1), ϕ(2), . . . , ϕ(c)

. (8.12)

We assume that the observation, dt, at each time t, was generated by one element ofϕc. (8.8) can be rewritten as

f (dt|a, r, ψtc , ϕc , lt) =c∏

i=1

f(dt|a, r, ψ(i)

t , ϕ(i))li,t

. (8.13)

Here, the active transformation is labelled by a discrete variable, lt = [l1,t, . . . , lc,t]′∈εc (1) , . . . , εc (c), where εc (i) is the i-th elementary vector in Rc (see NotationalConventions on page XVII). lt constitutes a field of hidden variables which we modelvia a first-order homogeneous Markov chain (7.43), with transition matrix T ∈ Ic×c

(0,1):

f (lt|T, lt−1) = Mult (T lt−1) ∝ exp (l′t lnT lt−1) . (8.14)

Here, lnT is a matrix of log-elements of T ; i.e. lnT = [ln ti,j ].From (8.13) and (8.14), the observation model is augmented by these hidden

variables, lt, as follows:

8.3 The VB Method for the MEAR Model 183

f (dt, lt|a, r, T, lt−1, ψtc , ϕc) = f (dt|a, r, ψtc , ϕc , lt) f (lt|T, lt−1) .(8.15)

Marginalization over lt yields an observation model in the form of a probabilisticMixture of EAR components with common AR parameterization, a, r:

f (dt|a, r, T, lt−1, ψtc , ϕc) =c∑

i=1

f (dt, lt = ec (i) |a, r, T, lt−1, ψtc , ϕc) .

(8.16)This defines the Mixture-based Extended AutoRegressive (MEAR) model. We notethe following:

• The MEAR model involves both time-variant parameters, lt, and time-invariantparameters, a, r and T .

• Model (8.16) is reminiscent of the DEFH model (6.39) of scenario III (Section6.3.3). However, the MEAR model in not, in fact, a DEFH model (6.43) sincethe labels, lt, are not independently distributed at each time, t, but evolve as aMarkov chain (8.14).

• This Markov chain parameter evolution model (8.14) is the same as the one (7.43)studied in Section 7.5.3. However, we have now replaced the Dirichlet observa-tion model (7.44) with the EAR observation model (8.13).

• In common with all mixture-based observation models, exact Bayesian recursiveidentification of the MEAR model is impossible because it is not a member ofthe DEF family. We have already seen an example of this in Section 6.5.

We have lost conjugacy (and, therefore, recursion) in our ambition to extend the ARmodel. We will now regain conjugacy for this extended model via the VB approxi-mation.

8.3 The VB Method for the MEAR Model

The MEAR parameter evolution model (8.14) for lt is the same as the one studiedin Section 7.5.3. Therefore, some of the results derived in Section 7.5.3 can be usedhere.


The MEAR model is defined by (8.13) and (8.14). As before, the prior distributions,f(a, r), f (T ) and f(lt) will be derived in Step 4, using conjugacy.


The logarithm of the augmented observation model (8.15), using (8.13) and (8.14), is


ln f (dt, lt|a, r, T, lt−1, ψtc , ϕc) ∝

∝c∑

i=1

li,t

(ln∣∣∣J (i)

t

∣∣∣− 12

ln r − 12r−1

([−1, a′]ϕ(i)

t ϕ(i)t

′[−1, a′]′

))+l′t lnT lt−1,

(8.17)

where the J(i)t are given by (8.9). We choose to partition the parameters of (8.17) as

θ1 = a, r, T, θ2 = lt and θ3 = lt−1.


Using (7.10), the VB-observation model parameterized by the time-variant parame-ter, lt−1, is

f (dt|lt−1) ∝ exp(lt′lnT tlt−1

). (8.18)

Using (6.38), the VB-observation models parameterized by the time-invariant para-meters, a, r and T , are

f (dt|a, r) ∝ exp(−1

2ln r

)× (8.19)

× exp

(−1

2r−1

([−1, a′]

(c∑

i=1

li,tϕ(i)t ϕ

(i)t

′)

[−1, a′]′))

,

f (dt|T ) ∝ exp(tr(lt lt−1

′lnT ′

)). (8.20)

The VB-parameter predictor (7.9) for lt is

f (lt|Dt−1) ∝ exp(l′t(υt + lnT t lt−1

)), (8.21)

where υt ∈ Rc with elements

υi,t = ln∣∣∣J (i)

t

∣∣∣− 12ln rt − 1

2tr((

ϕ(i)t ϕ

(i)t

′)Ef(a,r|Dt)

[[−1, a′]′ r−1 [−1, a′]

]),

(8.22)

and

Ef(a,r|Dt)

[[−1, a′]′ r−1 [−1, a′]

]=

[r−1

t −r−1ta

′t

−atr−1t ar−1a′t

].

8.3 The VB Method for the MEAR Model 185


The VB-parameter predictor (8.21) can be recognized as having the following form:

f (lt|Dt−1) = Mult (αt) , (8.23)

i.e. the Multinomial distribution (Appendix A.7). Using (8.22), its shaping parameteris

αt ∝ exp(υt + lnT t lt−1

). (8.24)

The VB-observation models, (8.18)–(8.20), can be recognized as Multinomial,Normal and Multinomial of continuous argument, respectively. Their conjugate dis-tributions are as follows:

f (lt−1|Dt) = Dilt−1 (βt) , (8.25)

f (a, r|Dt) = N iGa,r (Vt, νt) , (8.26)

f (T |Dt) = DiT (Qt) . (8.27)

Therefore, the posteriors at time t − 1 are chosen to have the same forms, (8.25)–(8.27), with shaping parameters Vt−1 and νt−1 for (8.26), Qt−1 for (8.27), and αt−1

for (8.25) (see Section 7.5.2 for reasoning). From (8.19)–(8.22), the posterior shapingparameters are therefore

Vt = Vt−1 +c∑

i=1

li,tϕ(i)t ϕ

(i)t

′, (8.28)

νt = νt−1 + 1,

Qt = Qt−1 + lt lt−1

′, (8.29)


′t lt

). (8.30)

Note, also, that the parameter priors, f(lt), f(a, r) and f (T ), are chosen to have thesame form as (8.25)–(8.27) (see Step 1).


The necessary VB-moments in (8.24) and (8.28)–(8.30) are at, aa′t, r−1t, ln rt,

lnT t, lt and lt−1 . From Appendices A.3, A.7 and A.8, these are as follows:

at =(V ′

a1,tV−1aa,t

)′, (8.31)

aa′t = V −1aa,t +

(V ′

a1,tV−1aa,t

)′ (V ′

a1,tV−1aa,t

),

r−1t = νtλ

−1t ,

ln rt = − ln 2 − ψΓ (νt − 1) + ln |λt| ,ln ti,j t = ψΓ (qi,j,t) − ψΓ

(1′

c,1Qt1c,1

),

lt = αt,

lt−1 = βt, (8.32)


where λt, Va1,t and Vaa,t are functions of matrix Vt, as defined by (6.24). qi,j,t de-notes the i,j-th element of matrix Qt.


As in Sections 7.5.2 and 7.5.3, trivial substitutions for expectations of the labels canbe made. However, no significant reduction of the VB-equations was found. TheVB-equations are therefore (8.24) and (8.28)–(8.32).


The IVB algorithm is iterated on the full set of VB-equations from Step 6. We chooseto initialize the IVB algorithm by setting the shaping parameters, αt, of f (lt|Dt). Aconvenient choice is the vector of terminal values from the previous step.


The VB-marginals for the MEAR model parameters, i.e. f (a, r|Dt) (8.26) andf (T |Dt) (8.27), are reported via their shaping parameters, Vt , νt and Qt respec-tively. There may also be interest in inferring the hidden field, lt, via f (lt|Dt) (8.23),with shaping parameter αt.

8.4 Related Distributional Approximations for MEAR

8.4.1 The Quasi-Bayes (QB) Approximation

The Quasi-Bayes (QB) approximation was defined in Section 3.4.3 as a special caseof the Restricted VB (RVB) approximation. Recall, from Remark 3.3, that we arerequired to fix all but one of the VB-marginals in order to avoid IVB iterations ateach time step. The true marginal, f (lt|Dt), is available, and so this will be used asone restriction. Furthermore, as discussed in Section 6.4.1, we also wish to fix dis-tributions of quantities that are not propagated to the next time step. For this reason,we will also fix f (lt−1|Dt) in the MEAR model.

Using these considerations, we now follow the steps of the VB method, adaptingthem as in Section 3.4.3.1 to this restricted (closed-form) case.

Step 1: The MEAR model is defined by (8.13) and (8.14).

Step 2: The parameters are partitioned as before.

Step 3: We restrict the VB-marginals on the label field as follows:

f (lt|Dt) ≡ f (lt|Dt) (8.33)

∝∫

Θ∗

c∑i=1

f (dt,, lt|θ, lt−1 = εc (i) , Dt−1) ×

×f (θ|Dt−1) f (lt−1 = εc (i) |Dt−1) dθ,f (lt−1|Dt) = f (lt−1|Dt−1) , (8.34)

8.4 Related Distributional Approximations for MEAR 187

where θ = a, r, T. (8.33) is the exact marginal of the one-step update, being,therefore, a QB restriction (Section 7.3.1). (8.34) is the non-smoothing restrictionwhich we introduced in Section 7.3.1. Using (8.13), (8.14) and (8.23), and summingover lt−1, we obtain

f (dt, θ, lt|Dt−1) =c∑

i=1

αi,t−1Diti(qi,t−1 + lt) × (8.35)

×c∏

i=1

N iGa,r

(Vt−1 + ϕ

(i)t ϕ

(i)t

′, νt + 1

)li,t

.

Here, ti is the ith column of matrix T , and qi,t−1 is the ith column of matrix Qt−1.Finally, we marginalize (8.35) over θ, to yield

f (lt|Dt) = Mult (αt) ∝c∏

i=1

αli,t

i,t , (8.36)

αi,t ∝c∑

j=1

αj,t−1ζtj(qj,t−1 + εc (i)) ζa,r

(Vt−1 + ϕ

(i)t ϕ

(i)t

′, νt−1 + 1

).

ζtj(·) denotes the normalizing constant of the Dirichlet distribution (A.49), and

ζa,r (·) denotes normalizing constant of the Normal-inverse-Gamma (N iG) distribu-tion (6.23). Using (8.36), we can immediately write down the required first momentsof (8.33) and (8.34), via (A.47):

lt−1 = αt−1. (8.37)

lt = αt. (8.38)

The remaining VB-marginal, i.e. f (θ|Dt), is the same as the one derived via the VBmethod in Section 8.3, i.e. (8.26) and (8.27).

Step 4: The standard forms of the VB-marginals, (8.26) and (8.27), are as before.However, their shaping parameters, (8.28)–(8.30), are now available in closed form,using (8.37) and (8.38), in place of (8.32).

Steps 5–7: Do not arise.

Step 8: Identical to Section 8.3.

8.4.2 The Viterbi-Like (VL) Approximation

From (8.28), note that Vt is updated via c dyads, weighted by the elements, li,t,

of lt. Dyadic updates (8.28) are expensive—O((1 + m)2

)as discussed in Section

8.1.2—especially when the extended regression vectors, ϕt, are long. In situationswhere one weight, li,t, is dominant, it may be unnecessary to perform dyadic updatesusing the remaining c− 1 dyads. This motivates the following ad hoc proposition.


Proposition 8.1 (Viterbi-Like (VL) Approximation). Further simplification of theQB-approximation (Section 8.4.1) may be achieved using an even coarser approxi-mation of the label-field distribution, namely, certainty equivalence (Section 3.5.1):

f (lt|Dt) = δ(lt − lt

). (8.39)

This replaces (8.33). Here, lt is the MAP estimate from (8.36); i.e.

lt = arg maxlt

f (lt|Dt) . (8.40)

This corresponds to the choice of one ‘active’ component with index ıt ∈ 1, . . . , c,such that lt = εc (ıt) . The idea is related to the Viterbi algorithm [153]. Substituting(8.40) into (8.28)–(8.30), we obtain the shaping parameters of (8.26) and (8.27), asfollows:

Vt = Vt−1 + ϕ(ıt)t ϕ

(ıt)t

′, (8.41)

νt = νt−1 + 1, (8.42)

Qt = Qt−1 + αtα′t−1. (8.43)

Note that the update of Vt (8.41) now involves only one outer product. Weights αt

(8.38) have already been generated for evaluation of (8.39). Hence, the coarser VLapproximation (8.40) is not used in the Qt update (8.29). Instead, Qt is updatedusing (8.38).

8.5 Computational Issues

We have now introduced three approximation methods for Bayesian recursive infer-ence of the MEAR model:

1. the Variational Bayes (VB) inference (Section 8.3);2. the Quasi-Bayes (QB) inference (Section 8.4.1);3. the Viterbi-Like (VL) inference (Proposition 8.1).

The computational flow is the same for all algorithms, involving updates of statisticsVt, νt and Qt. The recursive scheme for computation of (8.28)–(8.30) via the VB-approximation is displayed in Fig. 8.2.

The computational scheme for the QB algorithm is, in principle, the same, butthe weights lt are evaluated using (8.36) and (8.37), in place of (8.24) and (8.30),respectively. In the VL approximation, only one of the parallel updates is active ineach time. Hence, the main points of interest in comparing the computational loadsof the various schemes are (i) the weight evaluation, and (ii) the dyadic updates.

Weights are computed via (8.24) for the VB scheme, and by (8.36) for the QB andVL schemes. The operations required for this step are as follows:

8.5C

omputationalIssues

189

×

×

×

um

x

ϕ(2) O.P.

...

dt

......

...

11−z−1

ϕ(2)t ϕ

(2)t

′ϕ

(2)t

αtO.P. 1

1−z−1ϕ(1) O.P.

ϕ(1)t ϕ

(1)t

′ϕ

(1)t

ϕ(c) O.P.ϕ

(c)t ϕ

(c)t

′ϕ

(c)t

βt(8.24)

α1,t

(8.24)α2,t

αc,t(8.24)

(8.30)

Vt

Qt

Fig. 8.2. The recursive summed-dyad signal flowgraph for VB inference of the MEAR model. ‘O.P.’ denotes outer product (dyad). ‘mux’ denotesthe assembly of the vector, αt, from its elements αi,t. Accumulators are represented by 1

1−z−1 . Equation references are given as appropriate. Thetransmission of the VB-moments is not shown, for clarity.


VB: (i) evaluation of at, rt, Vaa,t (8.31)–(8.32); (ii) evaluation of the c termsin (8.24). All operations in (i), and each element-wise operation in (ii), are

of complexity O((1 + m)2

). Moreover, these evaluations must be repeated

for each step of the IVB algorithm.QB: (i) c-fold update of Vt in (8.36), and (ii) c-fold evaluation of the correspond-

ing normalizing constant, ζa,r, in (8.36). The computational complexity ofeach normalization is O (1 + m).

VL: the same computational load as QB, plus one maximization (8.40).

The weight updates can be done in parallel for each candidate transformation, inall three cases.

Update of Vt is done via dyadic updates (8.28) for the VB and QB schemes, and(8.41) for the VL scheme. We assume that each dyadic update is undertakenusing the LD decomposition (Section 8.1.2), where the required update algo-rithm [42] performs one dyadic update:

VB: c-fold dyadic update of Vt,QB: c-fold dyadic update of Vt,VL: one dyadic update of Vt.

The dyadic updates must be done sequentially for each component.

The overall computational complexity of the three schemes is summarized in Table8.1.

Table 8.1. Computational complexity of Bayesian recursive inference schemes for the MEARmodel.

Scheme Computational complexity for one time step

VB n (2c + 1) × O((1 + m)2

)QB 2c × O

((1 + m)2

)+ c × O (1 + m)

VL (c + 1) × O((1 + m)2

)+ c × O (1 + m)

n is the number of iterations of the IVB algorithm (for VB only)c is the number of components in the MEAR modelm is the dimension of the regressor

The main drawback of the VB algorithm is that the number of iterations, n, ofthe IVB algorithm at each step, t, is unknown a priori.

Remark 8.2 (Spanning property). Fig. 8.2 assumes linear independence between thecandidate filter-bank transformations (8.12), i.e. between the extended regressors,ϕ

(i)t (8.4), generated by each candidate filter. In this general case, (8.28) is a rank-c

update of the extended information matrix. Special cases may arise, depending onthe filter-bank candidates that are chosen a priori. For example, if the ϕ

(i)t are all

linearly-dependent, then the update in (8.28) is rank-1. In this case, too, only oneouter product operation is required, allowing the c parallel paths in Fig. 8.2 to be

8.6 The MEAR Model with Time-Variant Parameters 191

collapsed, and reducing the number of dyadic updates for update of Vt to one. Inall cases though, the convex combination of filter-dependent dyads (8.28) residesin the simplex whose c vertices are these filter-dependent dyads. Thus, the algorithmallows exploration of a space, ϕ, of EAR transformations (8.7) whose implied dyadicupdates (8.10) are elements of this simplex. This will be known as the spanningproperty of MEAR identification.

8.6 The MEAR Model with Time-Variant Parameters

Inference of time-variant parameters was the subject of Chapter 7. We now applythese ideas to the MEAR model with time-variant parameters. The MEAR model,(8.13) and (8.14), is adapted as follows:

f(dt|at, rt, ψ

(i)t , ϕ(i)

)=

∣∣∣J (i)t (dt)

∣∣∣Nϕ(i)1 (dt)

(−a′tψ

(i)t , rt

),

f (lt|Tt, lt−1) = Mult (Ttlt−1) .

The VB-approximation of Section 8.3 achieved conjugate recursive updating of time-invariant MEAR parameters. Recall, from Section 7.3.3, that the forgetting operatorpreserves these conjugate updates in the case of time-variant parameters, if the al-ternative parameter distribution is chosen from the CDEF family (Proposition 6.1).From (8.26) and (8.27), we note that a and r are conditionally independent of T .Hence, we choose distinct, known, time-invariant forgetting factors, φa,r and φT ,respectively. Then the parameter predictor of at, rt and Tt at time t− 1 is chosen as

f (at, rt, Tt|Dt−1) =[f (at−1, rt−1|Vt−1, νt−1)at,rt

]φa,r[f(at, rt|V , ν

)1−φa,r]×

×[f (Tt−1|Qt−1)Tt

]φT[f(Tt|Q

)1−φT]

, (8.44)

where statistics V , ν and Q of the alternative distributions, f(at, rt|V , ν

)and

f(Tt|Q

), are chosen by the designer.

Under this choice, the VB method for inference of time-variant MEAR parame-ters is the same as the VB method for time-invariant parameters (Section 8.3), underthe substitutions

a → at,

r → rt,

T → Tt.

Step 4 is modified as follows.

4. The distribution f (at, rt, Tt|Dt−1) is chosen in the form (8.44), with knowntime-invariant forgetting factors, φa,r and φT . This is conjugate to the VB-observation models, (8.19) and (8.20). Therefore, the posterior distributions have


the same form as before, i.e. (8.26) and (8.27), with modified shaping parameters,as follows:

Vt = φa,rVt−1 +c∑

i=1

li,tϕ(i)t ϕ

(i)t

′+ (1 − φa,r)V , (8.45)

νt = φa,rνt−1 + 1 + (1 − φa,r) ν, (8.46)

Qt = φTQt−1 + lt lt−1

′+ (1 − φT )Q. (8.47)

These are the required updates for the VB and QB inference schemes with forgetting.For the VL scheme (Proposition 8.1), the sum in (8.45) is reduced to one dyad, as in(8.41).

8.7 Application: Inference of an AR Model Robust to Outliers

One of the main limitations of the AR model is the sensitivity of parameter estimatesto outliers in the measurements. In this Section, we analyze the problem of infer-ence of a time-invariant, univariate (scalar) AR process (6.14) corrupted by isolatedoutliers. An isolated outlier is not modelled by the AR observation model becausethe outlier-affected observed value does not take part in the future regression. In-stead, the process is autoregressive in internal (i.e. not directly measured) variablezt, where

zt = a′ψt + σet, ψt = [zt−1, . . . , zt−m]′ . (8.48)

Here, r = σ2. The internal variable, zt, is observed via

dt = zt + ωt, (8.49)

where ωt denotes a possible outlier at time t. For an isolated outlier, it holds that

Pr [ωt±i = 0|ωt = 0] = 1, i = 1, . . . ,m. (8.50)

The AR model is identified via f (a, r|Dt) (i.e. not via f (a, r|Zt)) and so the outlierhas influence if and only if it enters the extended regressor, ϕt (6.25).

8.7.1 Design of the Filter-bank

Since ϕt is of finite length, m+1, and since the outliers are isolated, a finite numberof mutually exclusive cases can be defined. Each of these cases can be expressed viaan EAR model (8.8) and combined together using the MEAR approach, as follows.

1. None of the values in ϕt is affected by an outlier; i.e. dt−i = zt−i, i = 0, . . .m.ϕ(1) is then the identity transformation:

ϕ(1) : ϕ(1)t = [dt, . . . , dt−m]′ . (8.51)

8.7 Application: Inference of an AR Model Robust to Outliers 193

2. The observed value, dt, is affected by an outlier; from (8.50), all delayed valuesare unaffected; i.e. dt−i = zt−i, i = 1, . . .m. For convenience, ωt can be ex-pressed as ωt = htσet, where ht is an unknown multiplier of the realized ARresidual (6.14). From (6.14), (8.48) and (8.49),

dt = a′ψt + (1 + ht)σet.

Dividing across by (1 + ht) reveals the appropriate EAR transformation (8.4):

ϕ(2) : ϕ(2)t =

11 + ht

[dt, . . . , dt−m]′ . (8.52)

ϕ(2) is parameterized by ht, with constant Jacobian, J2 = 11+ht

(8.8).3. The k-steps-delayed observation, dt−k, is affected by an outlier, k ∈ 1, . . . ,m;

in this case, the known transformation should replace this value by an inter-polant, zt−k, which is known at time t. The set of transformations for each k isthen

ϕ(2+k) : ϕ(2+k)t = [dt, . . . , zt−k, . . . dt−m]′ . (8.53)

ϕ(2+k) is parameterized by zt−k, with Jacobian J2+k = 1.

We have described an exhaustive set of c = m + 2 filters, ϕ(i), transforming theobserved data, dt, . . . , dt−m, into EAR regressors, ϕ(i)

t , for each EAR model in thefilter-bank (8.12). Parameters ht and zt−k must be chosen. We choose ht to be aknown fixed ht = h. Alternatively, if the variance of outliers is known to vary sig-nificantly, we can split ϕ(2) into u > 1 candidates with respective fixed parametersh(1) < h(2) < . . . < h(u). Next, zt−k is chosen as the k-steps-delayed value of thefollowing causal reconstruction:

zt =c∑

j=1

Ef(zt|lt=εc(j)) [zt] f (lt = εc (j) |Dt, ϕtc) (8.54)

= dt

⎛⎝ c∑j=1,j =2

αj,t

⎞⎠ + α2,ta′t−1ψ

(2)t J−1

2 . (8.55)

Here, we are using (6.28), (8.52) and (8.24), and the fact that zt = dt for all trans-formations except ϕ(2).

8.7.2 Simulation Study

A second-order (i.e. m = 2) stable univariate AR process was simulated, with pa-rameters a = [1.85,−0.95]′ and r = σ2 = 0.01. A random outlier was generatedat every 30th sample. The total number of samples was t = 100. A segment of thesimulated data (t = 55, . . . , 100) is displayed in Fig. 8.3 (dotted line), along withthe corrupted data (dots) and the reconstruction (solid line) (8.54). Two outliers oc-curred during the displayed period: a ‘small’ outlier at t = 60 and a ‘big’ outlier


t

Signals and

reconstructions

VB

aa

1,t

Component weights

(first outlier)

2,t

a3,t

a4,t

t

a1,t

Component weights

(second outlier)

a2,t

a3,t

a4,t

t

85 90 95 10055 60 65 7060 70 80 900

10

10

10

1

0

10

10

10

1

-1.5

-1

- 0.5

0

0.5

1

1.5

2

t

QB

a1,t

a2,t

a3,t

a4,t

t

a1,t

a2,t

a3,t

a4,t

t85 90 95 10055 60 65 7060 70 80 900

10

10

10

1

0

10

10

10

1

1.5

1

-

-

-

0.5

0

0.5

1

1.5

2

t

VL

a1,t

a2,t

a3,t

a4,t

t

a1,t

a2,t

a3,t

a4,t

t85 90 95 10055 60 65 7060 70 80 900

10

10

10

1

0

10

10

10

1

1.5

1

-

-

-

0.5

0

0.5

1

1.5

2

Fig. 8.3. Reconstruction of an AR(2) process corrupted by isolated outliers. Results for VB,QB, and VL inference schemes, respectively, are shown. There are outliers at t = 60 andt = 90. In the left column, the uncorrupted AR signal is displayed via the dotted line, thecorrupted data, dt, are displayed as dots, and the reconstructed signals are displayed by thefull line. Note that the reconstructed signals differ from the uncorrupted AR signal only at thesecond outlier, t = 90.

8.7 Application: Inference of an AR Model Robust to Outliers 195

at t = 90. The filter-bank of m + 2 = 4 transformations—ϕ(1) (8.51), ϕ(2) (8.52)with ht = h = 10, ϕ(3) and ϕ(4) (8.53)—was used for identification of the ARparameters, a and r. The prior distribution was chosen as N iG (V0, ν0), with

V0 =

⎡⎣0.1 0 00 0.001 00 0 0.001

⎤⎦ , ν0 = 1.

This choice of prior implies point estimates with a0 = [0, 0], and r0 = 0.1.When an outlier occurs, all candidate filters are sequentially used, as seen in

Fig. 8.3 (middle and right columns). Thus, the outlier is removed from the shap-ing parameters (8.28)–(8.30) very effectively. We note that all considered inferenceschemes—i.e. VB, QB, and VL—performed well when the first outlier occurred.The estimated weights and reconstructed values are almost identical across all threeschemes. However, when the second outlier occurred, the VB scheme identified theweights more accurately than the QB and VL schemes.

The terminal—i.e. t = t—Highest Posterior Density (HPD) region (Definition2.1) of a (A.15) is illustrated (via the mean value and 2 standard deviations ellipse)for the various identification methods in the left (overall performance) and right (de-tail) of Fig 8.4. In the left diagram, the scenarios are (i) AR identification of the

a1

a2

simulated valueAR, corrupted

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

a1

a2

simulated valueAR, uncorruptedMEAR, VBMEAR, QBMEAR, VL

1.7 1.75 1.8 1.85 1.9

-1

-0.95

-0.9

-0.85

-0.8

Fig. 8.4. Identification of an AR(2) process corrupted by isolated outliers. Left: comparison ofthe terminal moments, t = 100, of the posterior distribution of a. Right: detail of left, boxedregion. HPD regions for QB and VL are very close together.

AR process corrupted by outliers; (ii) AR identification of the AR process uncor-rupted by outliers (boxed). In the right diagram, we zoom in on the boxed area sur-rounding (ii) above, revealing the three MEAR-based identification scenarios: (iii)MEAR identification using the VB approximation; (iv) MEAR identification usingthe QB approximation; (v) MEAR estimation using the Viterbi-like (VL) approxi-mation. Impressively, the MEAR-based strategies perform almost as well as the ARstrategy with uncorrupted data, which is displayed via the full line in the right dia-gram. The posterior uncertainty in the estimate of a appears, therefore, to be due tothe AR process itself, with all deleterious effects of the outlier process removed.


8.8 Application: Inference of an AR Model Robust to Burst Noise

The previous example relied on outliers being isolated (8.50), justifying the assump-tion that there is at most one outlier in each AR extended regressor, ϕ(i)

t . In such acase, the additive decomposition (8.49) allowed successful MEAR modelling of dt

via a finite number (m + 2) of candidates.A burst noise scenario requires more than one outlier to be considered in the

regressor. We transform the underlying AR model (6.14) into state-space form [108],as follows:

zt+1 = Azt + bσet, (8.56)

A =

⎡⎢⎢⎢⎣−a1 −a2 · · · −am

1 0 · · · 0...

. . .. . .

...0 · · · 1 0

⎤⎥⎥⎥⎦ , b =

⎡⎢⎢⎢⎣10...0

⎤⎥⎥⎥⎦ , (8.57)

such that A ∈ Rm×m and b ∈ Rm×1. The observation process with burst noise ismodelled as

dt = c′zt + htσξt, (8.58)

where c = [1, 0, . . . , 0]′∈ Rm×1, and ξt is distributed as N (0, 1), independent of et.The term htσ denotes the time-dependent standard deviation of the noise, which isassumed strictly positive during any burst, and zero otherwise. Note that (8.56) and(8.57) imply identically the same AR model as in the previous example (8.48). Theonly modelling difference is in the observation process (8.58), compared to (8.49)and (8.50).

8.8.1 Design of the Filter-Bank

We identify a finite number of mutually exclusive scenarios, each of which can beexpressed using an EAR model:

1. The AR process is observed without distortion; i.e. ht = ht−1 = . . . = ht−m =0. Formally, ϕ(1) : ϕ

(1)t = [dt, . . . , dt−m]′.

2. The measurements are all affected by constant-deviation burst noise; i.e. ht =ht−1 = . . . = ht−m = h. The state-space model, (8.56) and (8.58), is nowdefined by the joint distribution

f (zt, dt|a, r, zt−1, h) = Nzt,dt

([Azt−1

c′zt

], r

[bb′ 00 h2

]). (8.59)

(8.59) cannot be directly modelled as an EAR observation process because itcontains unobserved state vector zt. Using standard Kalman filter (KF) the-ory [42, 108], as reviewed in Section 7.4, we can multiply terms together ofthe kind in (8.59), and then integrate over the unobserved trajectory—i.e. overz1, . . . , zt—to obtain the direct observation model:

8.8 Application: Inference of an AR Model Robust to Burst Noise 197

f (dt|a, r,Dt−1, h) = Ndt(a′zt, rqt) . (8.60)

The moments in (8.60) are defined recursively as follows:

qt = h2 + c′St−1c, (8.61)

Wt = St−1 − q−1t (St−1c) (St−1c)′ , (8.62)

zt = Azt−1 + h−2Wtc (dt − c′Azt−1) , (8.63)

St = bb′ + AWtA′. (8.64)

(8.60) constitutes a valid EAR model (8.8) if zt and qt are independent of theunknown AR parameters, a and r. From (8.61) and (8.63), however, both qt andzt are functions of A (a) (8.57). In order to obtain a valid EAR model, we replace

A (a) in (8.63) and (8.64) by its expected value, At−1 = A (at−1), using (8.31).Then, (8.60) is a valid EAR model defined by the set of transformations

ϕ(2) : ϕ(2)t =

1√qt

[dt, zt

′]′ , (8.65)

with time-variant Jacobian, Jt = q− 1

2t (Dt−1), evaluated recursively using

(8.61). ϕ(2) is parameterized by unknown h, each setting of which defines a dis-tinct candidate transformation. Note that ϕ(2)

t in (8.65) depends on at−1 (8.31).Parameter updates are therefore correlated with previous estimates, at−1.

3. Remaining cases; cases 1. and 2. above do not address the situation where hk

is not constant on a regression interval k ∈ t−m, . . . , t. Complete mod-elling for such cases is prohibitive, since [ht−m, . . . , ht]

′ exists in a continuousspace. Nevertheless, it is anticipated that such cases might be accommodated viaa weighted combination of the two cases above.

The final step is to define candidates to represent ϕ(2) (ϕ(1) is trivial). One candidatemay be chosen for ϕ(2) if the variance of burst noise is reasonably well known apriori. In other cases, we can partition ϕ(2) with respect to intervals of h. Candidatesare chosen as one element from each interval.

8.8.2 Simulation Study

A non-stationary AR(2) process was simulated, with a1,t in the interval from −0.98to −1.8 (as displayed in Fig. 8.5 (top-right)), a2,t = a2 = 0.98, rt = r = 0.01, andt = 200. Realizations are displayed in Fig. 8.5 (top-left, solid line). For t < 95, a1,t

is increasing, corresponding to faster signal variations (i.e. increasing bandwidth).Thereafter, a1,t decreases, yielding slower variations. These variations of a1,t do notinfluence the absolute value of the complex poles of the system, but only their polarangle. The process was corrupted by two noise bursts (samples 50–80 and 130–180),with parameters h = 8 and h = 6 respectively. Realizations of the burst noise processimposed on the simulated signal are displayed in the second row of Fig. 8.5.


Unc

orru

pted

AR

0 100 2000

0.51

1.5

a2,

t

0 100 2002

1.51

-

--

-0.5

a1 ,

t

Simulated parameters andtheir inferences

0 100 2000

1

a1,

t

Component weights

0 50 100 150 2002

-

-

1

0

1

2Signals and reconstructions

Cor

rupt

edda

ta

0 50 100 150 200-2

-1

0

1

2

VB

infe

renc

e

0 100 2000

0.51

1.5

a2,

t

0 100 200-2

-1.5-1

-0.5

a1,

t

0 100 2000

1

a3,

t

0

1

a2,

t

0

1

a1,

t

0 50 100 150 200-2

-1

0

1

2

QB

infe

renc

e

0 100 2000

0.51

1.5

a2,

t

0 100 200- 2

- 1.5- 1

- 0.5

a1,

t

0 100 2000

1

a3,

t

0

1

a2,

t

0

1

a1,

t

0 50 100 150 200- 2

- 1

0

1

2

VL

infe

renc

e

0 50 100 150 200- 2

- 1

0

1

2

t

0

1

a1,

t

0

1

a2,

t

0 100 2000

1

t

a3,

t

0 100 200- 2

- 1.5- 1

- 0.5

a1,

t

0 100 2000

0.51

1.5

t

a2,

t

Fig. 8.5. Reconstruction and identification of a non-stationary AR(2) process corrupted byburst noise, using the KF variant of the MEAR model. In the final column, full lines denotesimulated values of parameters, dashed lines denote posterior expected values, and dotted linesdenote uncertainty bounds.

8.8 Application: Inference of an AR Model Robust to Burst Noise 199

Unc

orru

pted

AR

0 100 2000

0.51

1.5

a2,

t

0 100 200-2

-1.5-1

-0.5

a1 ,

t

Simulated parameters andtheir inferences

0 100 2000

1

a1,

t

Component weights

0 50 100 150 200-2

-1

0

1

2Signals and reconstructions

Cor

rupt

edda

ta

0 50 100 150 200-2

-1

0

1

2

VB

infe

renc

e

0 100 2000

0.51

1.5

a2,

t

0 100 200- 2

- 1.5- 1

- 0.5

a1,

t

0 100 2000

1

a4,

t

0

1

a3,

t

0

1

a2,

t

0

1

a1,

t

0 50 100 150 200- 2

- 1

0

1

2

QB

infe

renc

e

0 100 2000

0.51

1.5

a2 ,

t

0 100 2002

1.51

--

--

0.5

a1,

t

0 100 2000

1

a4,

t

0

1

a3,

t

0

1

a2,

t

0

1

a1,

t

0 50 100 150 2002

-

-

1

0

1

2

VL

infe

renc

e

0 100 2000

0.51

1.5

t

a2,

t

0 100 200- 2

- 1.5- 1

- 0.5

a1,

t

0 100 2000

1

t

a4,

t

0

1

a3,

t

0

1

a2,

t

0

1

a1,

t

0 50 100 150 200-2

-1

0

1

2

t

Fig. 8.6. Reconstruction and identification of a non-stationary AR(2) process corrupted byburst noise, using the KF+LPF variant of the MEAR model. In the final column, full linesdenote simulated values of parameters, dashed lines denote posterior expected values, anddotted lines denote uncertainty bounds.


The process was inferred using r = 3 filter candidates; i.e. the unity transforma-tion, ϕ(1), along with ϕ(2) with h = 5, and ϕ(2) with h = 10. Parameter inferencesare displayed in the right column of Fig. 8.5, as follows: in the first row, inferenceusing the AR model with uncorrupted data; in the third, fourth and fifth rows, theVB, QB and VL parameter inferences with corrupted data. Specifically, the 95%HPD region, via (A.18) and (A.21), of the marginal Student’s t-distribution of a1,t

and a2,t respectively, is displayed in each case. The process was identified using for-getting factors (8.44) φa,r = 0.92 and φT = 0.9. The non-committal, stationary,alternative N iG distribution, f (a, r) = N iGa,r

(V , ν

)was chosen. Furthermore,

the matrix parameter, Q, of the stationary, alternative Di distribution, f (T ) (8.44),was chosen to be diagonally dominant with ones on the diagonal. This discouragesfrequent transitions between filters.

Note that all methods (VB, QB and VL) achieved robust identification of theprocess parameters during the first burst. As already noted, z

(i)t , i = 2, 3 (which

denotes the reconstructed state vector (8.63) with respect to the ith filter), is corre-lated with at−1, which may undermine the tracking of time-variant AR parameters,at. In this case, each Kalman component predicts observations poorly, and receiveslow weights, α2,t and α3,t (8.24), in (8.45). This means that the first component—which does not pre-process the data—has a significant weight, α1,t. Clearly then, theKalman components have not spanned the space of necessary preprocessing trans-formations well (Remark 8.2), and need to be supplemented.

Extra filters can be ‘plugged in’ in a naïve manner (in the sense that they mayimprove the spanning of the pre-processing space, but should simply be rejected, via(8.24), if poorly designed). During the second burst (Fig. 8.5), the process is slowingdown. Therefore, we extend the bank of KF filters with a simple arithmetic meanLow-Pass Filter (LPF) on the observed regressors:

ϕ(3) : ϕ(3)t =

13

(ϕ

(1)t + ϕ

(1)t−1 + ϕ

(1)t−2

). (8.66)

(8.65) and (8.66) yield EAR models with the same AR parameterization, and sothey can be used together in the MEAR filter-bank. Reconstructed values for the KFvariant above are derived from (8.54), as follows:

zt = α1,tdt +∑

i=2,3

αi,tatzi,t, (8.67)

using (8.31). For the KF+LPF variant, the term α4,t

3 (dt + dt−1 + dt−2) is added to(8.67), where α4,t is the estimated weight of the LPF component (8.24).

Identification and reconstruction of the process using the KF+LPF filter-bankis displayed in Fig. 8.6, in the same layout as in Fig. 8.5. The distinction is mostclearly seen in the final column of each. During the second burst, the added LPFfilter received high weights, α4,t (see Fig. 8.6, middle column). Hence, identificationof the parameter at is improved during the second burst.

8.9 Conclusion 201

8.8.3 Application in Speech Reconstruction

The MEAR filter-bank for the burst noise case (KF variant) was applied in the recon-struction of speech [154]. A MEAR model with 4 components was used, involvingϕ(1), and ϕ(2) with three different choices of h, specifically h ∈ 3, 6, 10. Thespeech was modelled as AR with order m = 8 (6.14) [155]. The known forgettingfactors (8.44) were chosen as φa,r = φT = 0.95. Once again, a diagonally-dominantQ was chosen for f (T ).

During periods of silence in speech, statistics (8.45) are effectively not updated,creating difficulties for adaptive identification. Therefore, we use an informative sta-tionary alternative distribution, f (a, r), of the N iG type (8.26) for the AR para-meters in (8.44). To elicit an appropriate distribution, we identify the time-invariantalternative statistics, V and ν, using 1800 samples of unvoiced speech. f (a, r) wasthen flattened to reduce ν from 1800 to 2. This choice moderately influences theaccumulating statistics at each step, via (8.45). Specifically, after a long period ofsilence, the influence of data in (8.45) becomes negligible, and Vt is reduced to V .

Three sections of the bbcnews.wav speech file, sampled at 11 kHz, werecorrupted by additive noise. Since we are particularly interested in performance innon-stationary epochs, we have considered three transitional cases: (i) a voiced-to-unvoiced transition corrupted by zero-mean, white, Gaussian noise, with a realizedSignal-to-Noise Ratio (SNR) of −1 dB during the burst; (ii) an unvoiced-to-voicedtransition corrupted by zero-mean, white, uniform noise at −2 dB; and (iii) a silence-to-unvoiced transition corrupted by a click of type 0.25 cos (3t) exp (−0.3t), super-imposed on the silence period.

Reconstructed values using the VB, QB and VL inference methods respectivelyare displayed in Fig. 8.7. All three methods successfully suppressed the burst in thefirst two cases, and the click in the third case. However, the QB and VL methods alsohad the deteriorious effect of suppressing the unvoiced speech, a problem which wasnot exhibited by the VB inference.

8.9 Conclusion

This is not the first time that we have studied mixtures of AR models and crackedthe problem of loss of conjugacy using the VB-approximation. In Chapter 6, a fi-nite mixture of AR components was recognized as belonging to the DEFH family.Therefore, a VB-observation model belonging to the DEF family could be found andeffective Bayesian recursive identification of the AR parameters of every componentwas achieved.

In this Chapter, we have proposed a different mixture-based extension of thebasic AR model. The same AR parameters appear in each component, and so eachcomponent models a different possible non-linear degradation of that AR process.Once again, the VB-approximation provided a recursive identification scheme, whichwe summarized with the signal flowgraph in Fig. 8.2. A consequence of the sharedAR parameterization of each component was that the statistics were updated by c


4200 4400

-0.4-0.2

00.20.4

t

VL

reco

nstr

uctio

n

-0.4-0.2

00.20.4

QB

reco

nstr

uctio

n

-0.4-0.2

00.20.4

VB

reco

nstr

uctio

n

-0.4-0.2

00.20.4

Cor

rupt

ed

-0.4-0.2

00.20.4

Unc

orru

pted

Speech segment 1

6050 6150 6250

-0.5

0

0.5

t

-0.5

0

0.5

-0.5

0

0.5

-0.5

0

0.5

-0.5

0

0.5Speech segment 2

5550 5600 5650-0.2

-0.1

0

0.1

t

-0.2

-0.1

0

0.1

-0.2

-0.1

0

0.1

-0.2

-0.1

0

0.1

0.1

-0.2

-0.1

0

Speech segment 3

Fig. 8.7. Reconstruction of three sections of the bbcnews.wav speech file. In the secondrow, dash-dotted vertical lines delimit the beginning and end of each burst.

dyads, instead of the usual one-dyad updates characteristic of AR mixtures (Section6.5). This is most readily appreciated by comparing (8.28) with (6.58).

Each component of the MEAR model can propose a different possible pre-processing of the data in order to recover the underlying AR process. In the ap-plications we considered in this Chapter, careful modelling of the additive noise cor-rupting our AR process allowed the pre-processing filter-bank to be designed. In theburst noise example, our filter-bank design was not exhaustive, and so we ‘pluggedand played’ an additional pre-processing filter—a low-pass filter in this case—in or-der to improve the reconstruction. This ad hoc design of filters is an attractive feature

8.9 Conclusion 203

of the MEAR model. The VB inference scheme simply rejects unsuccessful propos-als by assigning them low inferred component weights.

Of course, we want to do all this on-line. A potentially worrying overhead ofthe VB-approximation is the need to iterate the IVB algorithm to convergence ateach time step (Section 6.3.4). Therefore, we derived Restricted VB (RVB) approxi-mate inference schemes (QB and VL) which yielded closed-form recursions. Theseachieved significant speed-ups (Table 8.1) without any serious reduction in the qual-ity of identification of the underlying AR parameters.

9

Concluding Remarks

The Variational Bayes (VB) theorem seems far removed from the concerns of thesignal processing expert. It proposes non-unique optimal approximations for para-metric distributions. Our principal purpose in this book has been to build a bridgebetween this theory and the practical concerns of designing signal processing algo-rithms. That bridge is the VB method. It comprises eight well-defined and feasiblesteps for generating a distributional approximation for a designer’s model. Recall thatthis VB method achieves something quite ambitious in an intriguingly simple way: itgenerates a parametric, free-form, optimized distributional approximation in a deter-ministic way. In general, free-form optimization should be a difficult task. However,it becomes remarkably simple under the fortuitous combination of assumptions de-manded by the VB theorem: (i) conditional independence between partitioned para-meters, and (ii) minimization of a particular choice of Kullback-Leibler divergence(KLDVB).

9.1 The VB Method

The VB theorem yields a set of VB-marginals expressed in implicit functional form.In general, there is still some way to go in writing down a set of explicit tractablemarginals. This has been our concern in designing the steps of the VB method. Ifthe clear requirements of each step are satisfied in turn, then we are guaranteed atractable VB-approximation. If the requirements cannot be satisfied, then we provideguidelines for how to adapt the underlying model in order to achieve a tractable VB-approximation.

The requirements of the VB method mean that it can only be applied success-fully in carefully defined signal processing contexts. We isolated three key scenarioswhere VB-approximations can be generated with ease. We studied these scenar-ios carefully, and showed how the VB-approximation can address important signalprocessing concerns such as the design of recursive inference procedures, tractablepoint estimation, model selection, etc.

206 9 Concluding Remarks

9.2 Contributions of the Work

Among the key outputs of this work, we might list the following:

1. Practical signal processing algorithms for matrix decompositions and for recur-sive identification of stationary and non-stationary processes.

2. Definition of the key VB inference objects needed in order to design recur-sive inference schemes. In time-invariant parameter inference, this was the VB-observation model. In Bayesian filtering for time-variant parameters, this wassupplemented by the VB-parameter predictor.

3. These VB inference objects pointed to the appropriate design of priors fortractable and numerically efficient recursive algorithms.

4. We showed that related distributional approximations—such as Quasi-Bayes(QB)—can be handled as restrictions of the VB-approximation, and are thereforeamenable to the VB method. The choice of these restrictions sets the trade-offbetween optimality and computational efficiency.

Of course, this gain in computational efficiency using the VB-approximation comesat a cost, paid for in accuracy. Correlation between the partitioned parameters is onlyapproximated by the shaping parameters of the independent VB-marginals. This ap-proximation may not be good enough when correlation is a key inferential quantity.We examined model types where the VB-approximation is less successful (e.g. theKalman filter), and this pointed the way to possible model adaptations which couldimprove the performance of the approximation.

9.3 Current Issues

The computational engine at the heart of the VB method is the Iterative VB (IVB)algorithm. It requires an unspecified number of iterations to yield converged VB-moments and shaping parameters. This is a potential concern in time-critical on-linesignal processing applications. We examined up to three methods for controlling thenumber of IVB cycles:

(i) Step 6 of the VB method seeks an analytical reduction in the number of VB-equations and associated unknowns. On occasion, a full analytical solution hasbeen possible.

(ii) Careful choice of initial values.(iii)The Restricted VB (RVB) approximation, where a closed-form approximation is

guaranteed.

Another concern which we addressed is the non-uniqueness of the KLDVB mini-mizer, i.e. the non-uniqueness of the VB-approximation in many cases. As a conse-quence, we must be very careful in our choice of initialization for the IVB algorithm.Considerations based on asymptotics, classical estimation results, etc., can be helpfulin choosing reasonable prior intervals for the initializers.

9.4 Future Prospects for the VB Method 207

In the case of the Kalman filter, a tractable VB-approximation was possible butit was inconsistent. The problem was that the necessary VB-moments from Step 5were, in fact, all first-order moments, and therefore could not capture higher orderdependence on data. As already mentioned, this insight pointed the way to possibleadaptations of the original model which could circumvent the problem.

9.4 Future Prospects for the VB Method

The VB method of approximation does not exist in isolation. In the Restricted VB ap-proximation, we are required to fix all but one of the VB-marginals. How we do thisis our choice, but it is, once again, a task of distributional approximation. Hence, sub-sidiary techniques—such as the Laplace approximation, stochastic sampling, Max-Ent, etc.—can be plugged in at this stage. In turn, the VB method itself can be used toaddress part of a larger distributional approximation problem. For example, we sawhow the VB-marginals could be generated at each time step of a recursive schemeto replace an intractable exact marginal (time update), but the exact Bayesian dataupdates were not affected. Clearly there are many more opportunities for symbiosisbetween these distributional approximation methods.

The conditional independence assumption is characteristic of the VB-approxim-ation and has probably not been exploited fully in this work. The ability to reduce ajoint distribution in many variables to a set of optimized independent distributions—involving only a few parameters each—is a powerful facility in the design of tractableinference algorithms. Possible application areas might include the analysis of largedistributed communication systems, biological systems, etc.

It is tempting to interpret the IVB algorithm—which lies at the heart of the VB-approximation—as a ‘Bayesian EM algorithm’. Where the EM algorithm convergesto the ML solution, and yields point estimates, the IVB algorithm converges to aset of distributions, yielding not only point estimates but their uncertainties. One ofthe most ergonomic aspects of the VB-approximation is that its natural outputs areparameter marginals and moments—i.e. the very objects whose unavailability forcesthe use of approximation in the first place. We hope that the convenient pathway toVB-approximation revealed by the VB method will encourage the Bayesian signalprocessing community to develop practical variational inference algorithms, both inoff-line and on-line contexts. Even better, we hope that the VB-approximation mightbe kept in mind as a convenient tool in developing and exploring Bayesian models.

A

Required Probability Distributions

A.1 Multivariate Normal distribution

The multivariate Normal distribution of x ∈ Rp×1 is

Nx (µ,R) = (2π)−p2 |R|− 1

2 exp−1

2[x− µ]′ R−1 [x− µ]

. (A.1)

The non-zero moments of (A.1) are

x = µ, (A.2)

xx′ = R + µµ′. (A.3)

The scalar Normal distribution is an important special case of (A.1):

Nx (µ, r) = (2πr)−12 exp

− 1

2r(x− µ)2

. (A.4)

A.2 Matrix Normal distribution

The matrix Normal distribution of the matrix X ∈ Rp×n is

NX (µX , Σp ⊗Σn) = (2π)−pn/2 |Σp|−n/2 |Σn|−p/2 × (A.5)

× exp(−0.5tr

Σ−1

p (X − µX)(Σ−1

n

)′(X − µX)′

),

where Σp ∈ Rp×p and Σn ∈ Rn×n are symmetric, positive-definite matrices, andwhere ⊗ denotes the Kronecker product (see Notational Conventions, Page XVI)).The distribution has the following properties:

• The first moment is EX [X] = µX .

210 A Required Probability Distributions

• The second non-central moments are

EX [XZX ′] = tr (ZΣn)Σp + µXZµ′X ,

EX [X ′ZX] = tr (ZΣp)Σn + µ′XZµX , (A.6)

where Z is an arbitrary matrix, appropriately resized in each case.• For any real matrices, C ∈ Rc×p and D ∈ Rn×d, it holds that

f (CXD) = NCXD (CµXD, CΣpC′ ⊗D′ΣnD) . (A.7)

• The distribution of x = vec (X) (see Notational Conventions on Page XV) ismultivariate Normal:

f (x) = Nx (µX , Σn ⊗Σp) . (A.8)

Note that the covariance matrix has changed its form compared to the matrixcase (A.5). This notation is helpful as it allows us to store the pn×pn covariancematrix in p× p and n× n structures.

This matrix Normal convention greatly simplifies notation. For example, if thecolumns, xi, of matrix X are independently Normally distributed with common co-variance matrix, Σ, then

f (x1,x2, . . . ,xn|µX , Σ) =n∏

i=1

Nxi(µi,X , Σ) ≡ N (µX , Σ ⊗ In) , (A.9)

i.e. the matrix Normal distribution (A.5) with µX = [µ1,X , . . . ,µn,X ]. Moreover,linear transformations of the matrix argument, X (A.7), preserves the Kronecker-product form of the covariance structure.

A.3 Normal-inverse-Wishart (N iWA,Ω) Distribution

The Normal-inverse-Wishart distribution of θ = A,R, A ∈ Rp×m, R ∈ Rp×p is

N iWA,Ω (V, ν) ≡ |R|− 12 ν

ζA,R (V, ν)exp

−1

2R−1 [−Ip, A]V [−Ip, A]′

, (A.10)

with normalizing constant,

ζA,R (V, ν) = Γp

(12

(ν −m− p− 1))|Λ|− 1

2 (ν−m−p−1) |Vaa|−12 p 2

12 p(ν−p−1)π

mp2 ,

(A.11)and parameters,

V =[Vdd V ′

ad

Vad Vaa

], Λ = Vdd − V ′

adV−1aa Vad. (A.12)

A.4 Truncated Normal Distribution 211

(A.12) denotes the partitioning of V ∈ R(p+m)×(p+m) into blocks, where Vdd is theupper-left sub-block of size p×p. In (A.11), Γp(·) denotes the multi-gamma function(see Notational Conventions on Page XV).

The conditional and marginal distributions of A and R are [42]

f (A|R, V, ν) = NA

(A, R⊗ V −1

aa

), (A.13)

f (R|V, ν) = iWR (η, Λ) , (A.14)

f (A|V, ν) = StA

(A,

1ν −m + 2

Λ−1 ⊗ V −1aa , ν −m + 2

), (A.15)

with auxiliary constants

A = V −1aa Vad, (A.16)

η = ν −m− p− 1. (A.17)

St (·) denotes the matrix Student’s t-distribution with ν−m+2 degrees of freedom,and iW (·) denotes the inverse-Wishart distribution [156,157]. The moments of thesedistributions are

Ef(A|R,V,ν) [A] = Ef(A|V,ν) [A] = A, (A.18)

Ef(R|V,ν) [R] ≡ R =1

η − p− 1Λ, (A.19)

Ef(R|V,ν)

[R−1

] ≡ R−1 = ηΛ−1, (A.20)

Ef(A|V,ν)

[(A− A

)′ (A− A

)]=

1η − p− 1

ΛV −1aa = RV −1

aa , (A.21)

Ef(R|V,ν) [ln |R|] = −p∏

j=1

ψΓ

(12

(η − j + 1))

+

+ ln |Λ| − p ln 2, (A.22)

where ψΓ (·) = ∂∂β lnΓ (·) is the digamma (psi) function. In the special case where

p = 1, then (A.10) has the form (6.21), i.e. the Normal-inverse-Gamma (N iG) dis-tribution.

A.4 Truncated Normal Distribution

The truncated Normal distribution for scalar random variable, x, is defined asNormal—with functional form Nx (µ, r)—on a restricted support a < x ≤ b. Itsdistribution is

f (x|µ, s, a, b) =

√2 exp

(− 1

2r (x− µ)2)

√πr (erf (β) − erf (α))

χ(a,b] (x) , (A.23)


where α = a−µ√2r

, β = b−µ√2r

. Moments of (A.23) are

x = µ−√r ϕ (µ, r) , (A.24)

x2 = r + µx−√rκ (µ, r) , (A.25)

with auxiliary functions, as follows:

ϕ (µ, r) =

√2[exp

(−β2)− exp

(−α2)]

√π (erf (β) − erf (α))

, (A.26)

κ (µ, r) =

√2[b exp

(−β2)− a exp

(−α2)]

√π (erf (β) − erf (α))

. (A.27)

In case of vector arguments µ and s, (A.26) and (A.27) are evaluated element-wise.Confidence intervals for this distribution can also be obtained. However, for sim-

plicity, we use the first two moments, (A.24) and (A.25), and we approximate (A.23)by a Gaussian. The Maximum Entropy (MaxEnt) principle [158] ensures that uncer-tainty bounds on the MaxEnt Gaussian approximation of (A.23) encloses the uncer-tainty bounds of all distributions with the same first two moments. Hence,

max(a,−2

√x2 − x2

)< x− x < min

(b, 2

√x2 − x2

). (A.28)

A.5 Gamma Distribution

The Gamma distribution is as follows:

f (x|a, b) = Gx (a, b) =ba

Γ (a)xa−1 exp (−bx)χ[0,∞) (x) , (A.29)

where a > 0 and b > 0, and Γ (a) is the Gamma function [93] evaluated at a. Thefirst moment is

x =a

b,

and the second central moment is

E[(x− x)2

]=

a

b2.

A.6 Von Mises-Fisher Matrix distribution

Moments of the von Mises-Fisher distribution are now considered. Proofs of all un-proved results are available in [92].

A.6 Von Mises-Fisher Matrix distribution 213

A.6.1 Definition

The von Mises-Fisher distribution of matrix random variable, X ∈ Rp×n, restrictedto X ′X = In, is given by

f (X|F ) = M (F ) =1

ζX (p, FF ′)exp (tr (FX ′)) , (A.30)

ζX (p, FF ′) = 0F1

(12p,

14FF ′

)C (p, n) , (A.31)

where F ∈ Rp×n is a matrix parameter of the same dimensions as X , and p ≥ n.ζX (p, FF ′) is the normalizing constant. 0F1(·) denotes a Hypergeometric functionof matrix argument FF ′ [159]. C (p, r) denotes the area of the relevant Stiefel man-ifold, Sp,n (4.64).

(A.30) is a Gaussian distribution with restriction X ′X = In, renormalized onSp,n. It is governed by a single matrix parameter F . Consider the (economic) SVD(Definition 4.1),

F = UFLFV ′F ,

of the parameter F , where UF ∈ Rp×n, LF ∈ Rn×n, VF ∈ Rn×n. Then the maxi-mum of (A.30) is reached at

X = UFV ′F . (A.32)

The flatness of the distribution is controlled by LF . When lF = diag−1 (LF ) =0n,1, the distribution is uniform on Sp,n [160]. For li,F → ∞, ∀i = 1 . . . n, thedistribution is a Dirac δ-function at X (A.32).

A.6.2 First Moment

Let Y be the transformed variable,

YX = U ′FXVF . (A.33)

It can be shown that ζX (p, FF ′) = ζX

(p, L2

F

). The distribution of YX is then

f (YX |F ) =1

ζX (p, L2F )

exp (tr (LFYX)) =1

ζX (p, L2F )

exp (l′F yX) , (A.34)

where yX = diag−1 (YX). Hence,

f (YX |F ) ∝ f (yX |lF ) . (A.35)

The first moment of (A.34) is given by [92]

Ef(YX |F ) [YX ] = Ψ, (A.36)

where Ψ = diag (ψ) is a diagonal matrix with diagonal elements


ψi =∂

∂lF,iln 0F1

(12p,

14L2

F

). (A.37)

We will denote vector function (A.37) by

ψ = G (p, lF ) . (A.38)

The mean value of the original random variable X is then [161]

Ef(X|F ) [X] = UFΨVF = UFG (p, LF )VF , (A.39)

where G (p, LF ) = diag (G (p, lF )).

A.6.3 Second Moment and Uncertainty Bounds

The second central moment of the transformed variable, yX = diag−1 (YX) (A.34),is given by

Ef(YX |F )

[yXy′X − Ef(YX |F ) [yX ] Ef(YX |F ) [yX ]′

]= Φ, (A.40)

with elements

φi,j =∂

∂li,F∂lj,Fln 0F1

(12p,

14L2

F

), i, j = 1, . . . , r. (A.41)

Transformation (A.33) is one-to-one, with unit Jacobian. Hence, boundaries of con-fidence intervals on variables Y and Z can be mutually mapped using (A.33). How-ever, mapping yX = diag−1 (YX) is many-to-one, and so X → yX is surjective (butnot injective). Conversion of second moments (and uncertainty bounds) of yX to X(via (A.33) and (A.34)) is therefore available in implicit form only. For example, thelower bound subspace of X is expressible as follows:

X =X| diag−1 (U ′

FXVF ) = yX

,

where yX is an appropriately chosen lower bound on yX . The upper bound, X , can beconstructed similarly via a bound yX . However, due to the topology of the supportof X , i.e. the Stiefel manifold (Fig. 4.2), yX projects into the region with highestdensity of X . Therefore, we consider the HPD region (Definition 2.1) to be boundedby X only.

It remains, then, to choose appropriate bound, yX , from (A.34). Exact confi-dence intervals for this multivariate distribution are not known. Therefore, we usethe first two moments, (A.36) and (A.40), to approximate (A.34) by a Gaussian. TheMaximum Entropy (MaxEnt) principle [158] ensures that uncertainty bounds on theMaxEnt Gaussian approximation of (A.34) enclose the uncertainty bounds of all dis-tributions with the same first two moments. Confidence intervals for the Gaussiandistribution, with moments (A.37) and (A.41), are well known. For example,

Pr(−2

√φi < (yi,X − ψi) < 2

√φi

)≈ 0.95, (A.42)

A.8 Dirichlet Distribution 215

where ψi is given by (A.37), and φi by (A.41). Therefore, we choose

yi,X = ψi − 2√

φi. (A.43)

The required vector bounds are then constructed as yX =[y1,X , . . . , yr,X

]′. The

geometric relationship between variables X and yX is illustrated graphically for p =2 and n = 1 in Fig. 4.2.

A.7 Multinomial Distribution

The Multinomial distribution of the c-dimensional vector variable l, where li ∈ N

and∑c

i=1 li = γ, is as follows:

f (l|α) = Mul (α) =1

ζl (α)

c∏i=1

αlii χNc(l). (A.44)

Its vector parameter is α = [α1, α2, . . . , αc]′, αi > 0,

∑ci=1 αi = 1, and the nor-

malizing constant is

ζl (α) =∏c

i=1 li!γ!

, (A.45)

where ‘!’ denotes factorial.If the argument l contains positive real numbers, i.e. li ∈ (0,∞), then we refer

to (A.44) as the Multinomial distribution of continuous argument. The only changein (A.44) is that the support is now (0,∞)c, and the normalizing constant is

ζl (α) =∏c

i=1 Γ (li)Γ (γ)

, (A.46)

where Γ (·) is the Gamma function [93]. For both variants, the first moment is givenby

l = α. (A.47)

A.8 Dirichlet Distribution

The Dirichlet distribution of the c-dimensional vector variable, α ∈ ∆c, is as fol-lows:

f (α|β) = Diα (β) =1

ζα (β)

c∏i=1

αβi−1i χ∆c

(α), (A.48)

where


∆c =

α|αi ≥ 0,

c∑i=1

αi = 1

is the probability simplex in Rc. The vector parameter in (A.48) is β = [β1, β2, . . . , βc]

′,βi > 0,

∑ci=1 βi = γ. The normalizing constant is

ζα (β) =∏c

i=1 Γ (βi)Γ (γ)

, (A.49)

where Γ (·) is the Gamma function [93]. The first moment is given by

αi = Ef(α|β) [αi] =βi

γ, i = 1, . . . , c. (A.50)

The expected value of the logarithm is

lnαi = Ef(α|β) [lnαi] = ψΓ (βi) − ψΓ (γ) , (A.51)

where ψΓ (·) = ∂∂β lnΓ (·) is the digamma (psi) function.

For notational simplicity, we define the matrix Dirichlet distribution of matrixvariable T ∈ Rp×p as follows:

DiT (Φ) ≡p∏

i=1

Diti(φi) ,

with matrix parameter Φ ∈ Rp×p = [φ1, . . . ,φp]. Here, ti and φi are the ithcolumns of T and Φ respectively.

A.9 Truncated Exponential Distribution

The truncated Exponential distribution is as follows:

f (x|k, (a, b]) = tExpx (k, (a, b]) =k

exp (kb) − exp (ka)exp (xk)χ(a,b](x),

(A.52)where a < b are the boundaries of the support. Its first moment is

x =exp (bk) (1 − bk) − exp (ak) (1 − ak)

k (exp (ak) − exp (bk)), (A.53)

which is not defined for k = 0. The limit at this point is

limk→0

x =a + b

2,

which is consistent with the fact that the distribution is then uniform on the interval(a, b].

References

1. R. T. Cox, “Probability, frequency and reasonable expectation,” Am. J. Phys., vol. 14,no. 1, 1946.

2. B. de Finneti, Theory of Probability: A Critical Introductory Treatment. New York: J.Wiley, 1970.

3. A. P. Quinn, Bayesian Point Inference in Signal Processing. PhD thesis, CambridgeUniversity Engineering Dept., 1992.

4. E. T. Jaynes, “Bayesian methods: General background,” in The Fourth Annual Workshopon Bayesian/Maximum Entropy Methods in Geophysical Inverse Problems, (Calgary),1984.

5. S. M. Kay, Modern Spectral Estimation: Theory and Application. Prentice-Hall, 1988.6. S. L. Marple Jr., Digital Spectral Analysis with Applications. Prentice-Hall, 1987.7. H. Jeffreys, Theory of Probability. Oxford University Press, 3 ed., 1961.8. G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis. Addison-Wesley,

1973.9. P. M. Lee, Bayesian Statistics, an Introduction. Chichester, New York, Brisbane,

Toronto, Singapore: John Wiley & Sons, 2 ed., 1997.10. G. Parisi, Statistical Field Theory. Reading Massachusetts: Addison Wesley, 1988.11. M. Opper and O. Winther, “From naive mean field theory to the TAP equations,” in

Advanced Mean Field Methods (M. Opper and D. Saad, eds.), The MIT Press, 2001.12. M. Opper and D. Saad, Advanced Mean Field Methods: Theory and Practice. Cam-

bridge, Massachusetts: The MIT Press, 2001.13. R. P. Feynman, Statistical Mechanics. New York: Addison–Wesley, 1972.14. G. E. Hinton and D. van Camp, “Keeping neural networks simple by minimizing the

description length of the weights,” in Proceedings of 6th Annual Workshop on ComputerLearning Theory, pp. 5–13, ACM Press, New York, NY, 1993.

15. L. K. Saul, T. S. Jaakkola, and M. I. Jordan, “Mean field theory for sigmoid belief net-works.,” Journal of Artificial Inteligence Research, vol. 4, pp. 61–76, 1996.

16. D. J. C. MacKay, “Free energy minimization algorithm for decoding and cryptanalysis,”Electronics Letters, vol. 31, no. 6, pp. 446–447, 1995.

17. D. J. C. MacKay, “Developments in probabilistic modelling with neural networks – en-semble learning,” in Neural Networks: Artificial Intelligence and Industrial Applica-tions. Proceedings of the 3rd Annual Symposium on Neural Networks, Nijmegen, Nether-lands, 14-15 September 1995, (Berlin), pp. 191–198, Springer, 1995.

18. M. I. Jordan, Learning in graphical models. MIT Press, 1999.

218 References

19. H. Attias, “A Variational Bayesian framework for graphical models.,” in Advances inNeural Information Processing Systems (T. Leen, ed.), vol. 12, MIT Press, 2000.

20. Z. Ghahramani and M. Beal, “Graphical models and variational methods,” in AdvancedMean Field Methods (M. Opper and D. Saad, eds.), The MIT Press, 2001.

21. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incompletedata via the EM algorithm,” Journal of Royal Statistical Society, Series B, vol. 39, pp. 1–38, 1977.

22. R. M. Neal and G. E. Hinton, A New View of the EM Algorithm that Justifies Incremen-tal, Sparse, and Other Variants. NATO Science Series, Dordrecht: Kluwer AcademicPublishers, 1998.

23. M. J. Beal and Z. Ghahramani, “The variational Bayesian EM algorithm for incompletedata: with application to scoring graphical model structures,” in Bayesian Statistics 7(J. M. et. al. Bernardo, ed.), Oxford University Press, 2003.

24. C. M. Bishop, “Variational principal components,” in Proceedings of the Ninth Interna-tional Conference on Artificial Neural Networks, (ICANN), 1999.

25. Z. Ghahramani and M. J. Beal, “Variational inference for Bayesian mixtures of factoranalyzers,” Neural Information Processing Systems, vol. 12, pp. 449–455, 2000.

26. M. Sato, “Online model selection based on the variational Bayes,” Neural Computation,vol. 13, pp. 1649–1681, 2001.

27. S. J. Roberts and W. D. Penny, “Variational Bayes for generalized autoregressive mod-els,” IEEE Transactions on Signal Processing, vol. 50, no. 9, pp. 2245–2257, 2002.

28. P. Sykacek and S. J. Roberts, “Adaptive classification by variational Kalman filtering,”in Advances in Neural Information Processing Systems 15 (S. Thrun, S. Becker, andK. Obermayer, eds.), MIT press, 2003.

29. J. W. Miskin, Ensemble Learning for Independent Component Analysis. PhD thesis,University of Cambridge, 2000.

30. J. Pratt, H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory. MITPress, 1995.

31. R. E. Kass and A. E. Raftery, “Bayes factors,” Journal of American Statistical Associa-tion, vol. 90, pp. 773–795, 1995.

32. D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixtures. NewYork: John Wiley, 1985.

33. E. T. Jaynes, “Clearing up mysteries—the original goal,” in Maximum Entropy andBayesian Methods (J. Skilling, ed.), pp. 1–27, Kluwer, 1989.

34. M. Tanner, Tools for statistical inference. New York: Springer-Verlag, 1993.35. J. J. K. O’Ruanaidh and W. J. Fitzgerald, Numerical Bayesian Methods applied to Signal

Processing. Springer, 1996.36. B. de Finetti, Theory of Probability, vol. 2. Wiley, 1975.37. J. Bernardo and A. Smith, Bayesian Theory. Chichester, New York, Brisbane, Toronto,

Singapore: John Wiley & Sons, 2 ed., 1997.38. G. L. Bretthorst, Bayesian Spectrum Analysis and Parameter Estimation. Springer-

Verlag, 1989.39. M. Kárný and R. Kulhavý, “Structure determination of regression-type models for adap-

tive prediction and control,” in Bayesian Analysis of Time Series and Dynamic Models(J. Spall, ed.), New York: Marcel Dekker, 1988. Chapter 12.

40. A. Quinn, “Regularized signal identification using Bayesian techniques,” in SignalAnalysis and Prediction, Birkhäuser Boston Inc., 1998.

41. R. A. Fisher, “Theory of statistical estimation,” Proc. Camb. Phil. Soc., vol. 22(V),pp. 700–725, 1925. Reproduced in [162].

References 219

42. V. Peterka, “Bayesian approach to system identification,” in Trends and Progress in Sys-tem identification (P. Eykhoff, ed.), pp. 239–304, Oxford: Pergamon Press, 1981.

43. A. W. F. Edwards, Likelihood. Cambridge Univ. Press, 1972.44. J. D. Kalbfleisch and D. A. Sprott, “Application of likelihood methods to models involv-

ing large numbers of parameters,” J. Royal Statist. Soc., vol. B-32, no. 2, 1970.45. R. L. Smith and J. C. Naylor, “A comparison of maximum likelihood and Bayesian

estimators for the three-parameter Weibull distribution,” Appl. Statist., vol. 36, pp. 358–369, 1987.

46. R. D. Rosenkrantz, ed., E. T. Jaynes: Papers on Probability, Statistics and StatisticalPhysics. D. Reidel, Dordrecht-Holland, 1983.

47. G. E. P. Box and G. C. Tiao, Bayesian Statistics. Oxford: Oxford, 1961.48. J. Berger, Statistical Decision Theory and Bayesian Analysis. New York: Springer-

Verlag, 1985.49. A. Wald, Statistical Decision Functions. New York, London: John Wiley, 1950.50. M. DeGroot, Optimal Statistical Decisions. New York: McGraw-Hill, 1970.51. C. P. Robert, The Bayesian Choice: A Decision Theoretic Motivation. Springer texts in

Statistics, Springer-Verlag, 1994.52. M. Kárný, J. Böhm, T. Guy, L. Jirsa, I. Nagy, P. Nedoma, and L. Tesar, Optimized

Bayesian Dynamic Advising: Theory and Algorithms. London: Springer, 2005.53. J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63,

no. 4, pp. 561–580, 1975.54. S. M. Kay, Fundamentals of Statistical Signal Processing. Prentice-Hall, 1993.55. A. P. Quinn, “The performance of Bayesian estimators in the superresolution of signal

parameters,” in Proc. IEEE Int. Conf. on Acoust., Sp. and Sig. Proc. (ICASSP), (SanFrancisco), 1992.

56. A. Zellner, An Introduction to Bayesian Inference in Econometrics. New York: J. Wiley,1976.

57. A. Quinn, “Novel parameter priors for Bayesian signal identification,” in Proc. IEEE Int.Conf. on Acoust., Sp. and Sig. Proc. (ICASSP), (Munich), 1997.

58. R. E. Kass and A. E. Raftery, “Bayes factors and model uncertainty,” tech. rep., Univer-sity of Washington, 1994.

59. S. F. Gull, “Bayesian inductive inference and maximum entropy,” in Maximum Entropyand Bayesian Methods in Science and Engineering (G. J. Erickson and C. R. Smith,eds.), Kluwer, 1988.

60. J. Skilling, “The axioms of maximum entropy,” in Maximum Entropy and BayesianMethods in Science and Engineering. Vol. 1 (G. J. Erickson and C. R. Smith, eds.),Kluwer, 1988.

61. D. Bosq, Nonparametric Statistics for Stochastic Processes: estimation and prediction.Springer, 1998.

62. J. M. Bernardo, “Expected infromation as expected utility,” The Annals of Statistics,vol. 7, no. 3, pp. 686–690, 1979.

63. S. Kullback and R. Leibler, “On information and sufficiency,” Annals of MathematicalStatistics, vol. 22, pp. 79–87, 1951.

64. S. Amari, S. Ikeda, and H. Shimokawa, “Information geometry of α-projection in meanfield approximation,” in Advanced Mean Field Methods (M. Opper and D. Saad, eds.),(Cambridge, Massachusetts), The MIT Press, 2001.

65. S. Amari, Differential-Geometrical Methods in Statistics. Sringer, 1985.66. C.F.J.Wu, “On the convergence properties of the EM algorithm,” The Annals of Statis-

tics, vol. 11, pp. 95–103, 1983.

220 References

67. S. F. Gull and J. Skilling, “Maximum entropy method in image processing,” Proc. IEE,vol. F-131, October 1984.

68. S. F. Gull and J. Skilling, Quantified Maimum Entropy. MemSys5 Users’ Manual. Max-imum Entropy Data Consultants Ltd., 1991.

69. A. Papoulis, “Maximum entropy and spectral estimation: a review,” IEEE Trans. onAcoust., Sp., and Sig. Proc., vol. ASSP-29, December 1981.

70. D. J. C. MacKay, Information Theory, Inference & Learning Algorithms. CambridgeUniverzity Press, 2004.

71. G. Demoment and J. Idier, “Problèmes inverses et déconvolution,” Journal de PhysiqueIV, pp. 929–936, 1992.

72. M. Nikolova and A. Mohammad-Djafari, “Maximum entropy image reconstruc-tion in eddy current tomography,” in Maximum Entropy and Bayesian Methods(A. Mohammad-Djafari and G. Demoment, eds.), Kluwer, 1993.

73. W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice.London: Chapman & Hall, 1997.

74. A. Doucet, N. de Freitas, and N. Gordon, eds., Sequential Monte Carlo Methods inPractice. Springer, 2001.

75. A. F. M. Smith and A. E. Gelfand, “Bayesian statistics without tears: a sampling-resampling perspective,” The American Statistician, vol. 46, pp. 84–88, 1992.

76. T. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals ofStatistics, vol. 1, pp. 209–230, 1973.

77. S. Walker, P. Damien, P. Laud, and A. Smith, “Bayesian nonparametric inference forrandom distributions and related functions,” J. R. Statist. Soc., vol. 61, pp. 485–527,2004. with discussion.

78. S. J. Press and K. Shigemasu, “Bayesian inference in factor analysis,” in Contributionsto Probability and Statistics (L. J. Glesser, ed.), ch. 15, Springer Verlag, New York, 1989.

79. D. B. Rowe and S. J. Press, “Gibbs sampling and hill climbing in Bayesian factor analy-sis,” tech. rep., University of California, Riverside, 1998.

80. I. Jolliffe, Principal Component Analysis. Springer-Verlag, 2nd ed., 2002.81. S. M. Kay, Fundamentals Of Statistical Signal Processing: Estimation Theory. Prentice

Hall, 1993.82. M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component ana-

lyzers,” tech. rep., Aston University, 1998.83. K. Pearson, “On lines and planes of closest fit to systems of points in space,” The London,

Edinburgh and Dublin Philosophical Magazine and Journal of Science, vol. 2, pp. 559–572, 1901.

84. T. W. Anderson, An Introduction to Multivariate Statistical Analysis. John Wiley andSons, 1971.

85. M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” Journalof the Royal Statistical Society, Series B, vol. 61, pp. 611–622, 1998.

86. G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore – London: The JohnHopkins University Press, 1989.

87. H. Hotelling, “Analysis of a complex of statistical variables into principal components,”Journal of Educational Psychology, vol. 24, pp. 417–441, 1933.

88. D. B. Rowe, Multivariate Bayesian Statistics: Models for Source Separation and SignalUnmixing. Boca Raton, FL, USA: CRC Press, 2002.

89. T. P. Minka, “Automatic choice of dimensionality for PCA,” tech. rep., MIT, 2000.90. V. Šmídl, The Variational Bayes Approach in Signal Processing. PhD thesis, Trinity

College Dublin, 2004.

References 221

91. V. Šmídl and A. Quinn, “Fast variational PCA for functional analysis of dynamic imagesequences,” in Proceedings of the 3rd International Conference on Image and SignalProcessing, ISPA 03, (Rome, Italy), September 2003.

92. C. G. Khatri and K. V. Mardia, “The von Mises-Fisher distribution in orientation statis-tics,” Journal of Royal Statistical Society B, vol. 39, pp. 95–106, 1977.

93. M. Abramowitz and I. Stegun, Handbook of Mathematical Functions. New York: DoverPublications, 1972.

94. I. Buvat, H. Benali, and R. Di Paola, “Statistical distribution of factors and factor im-ages in factor analysis of medical image sequences,” Physics in Medicine and Biology,vol. 43, no. 6, pp. 1695–1711, 1998.

95. H. Benali, I. Buvat, F. Frouin, J. P. Bazin, and R. Di Paola, “A statistical model forthe determination of the optimal metric in factor analysis of medical image sequences(FAMIS),” Physics in Medicine and Biology, vol. 38, no. 8, pp. 1065–1080, 1993.

96. J. Harbert, W. Eckelman, and R. Neumann, Nuclear Medicine. Diagnosis and Therapy.New York: Thieme, 1996.

97. S. Kotz and N. Johnson, Encyclopedia of statistical sciences. New York: John Wiley,1985.

98. T. W. Anderson, “Estimating linear statistical relationships,” Annals of Statististics,vol. 12, pp. 1–45, 1984.

99. J. Fine and A. Pouse, “Asymptotic study of the multivariate functional model. Appli-cation to the metric of choice in Principal Component Analysis,” Statistics, vol. 23,pp. 63–83, 1992.

100. F. Pedersen, M. Bergstroem, E. Bengtsson, and B. Langstroem, “Principal componentanalysis of dynamic positron emission tomography studies,” Europian Journal of Nu-clear Medicine, vol. 21, pp. 1285–1292, 1994.

101. F. Hermansen and A. A. Lammertsma, “Linear dimension reduction of sequences ofmedical images: I. optimal inner products,” Physics in Medicine and Biology, vol. 40,pp. 1909–1920, 1995.

102. M. Šámal, M. Kárný, H. Surová, E. Maríková, and Z. Dienstbier, “Rotation to simplestructure in factor analysis of dynamic radionuclide studies,” Physics in Medicine andBiology, vol. 32, pp. 371–382, 1987.

103. M. Kárný, M. Šámal, and J. Böhm, “Rotation to physiological factors revised,” Kyber-netika, vol. 34, no. 2, pp. 171–179, 1998.

104. A. Hyvärinen, “Survey on independent component analysis,” Neural Computing Sur-veys, vol. 2, pp. 94–128, 1999.

105. V. Šmídl, A. Quinn, and Y. Maniouloux, “Fully probabilistic model for functional analy-sis of medical image data,” in Proceedings of the Irish Signals and Systems Conference,(Belfast), pp. 201–206, University of Belfast, June 2004.

106. J. R. Magnus and H. Neudecker, Matrix Differential Calculus. Wiley, 2001.107. M. Šámal and H. Bergmann, “Hybrid phantoms for testing the measurement of regional

dynamics in dynamic renal scintigraphy.,” Nuclear Medicine Communications, vol. 19,pp. 161–171, 1998.

108. L. Ljung and T. Söderström, Theory and practice of recursive identification. Cambridge;London: MIT Press, 1983.

109. E. Mosca, Optimal, Predictive, and Adaptive Control. Prentice Hall, 1994.110. T. Söderström and R. Stoica, “Instrumental variable methods for system identification,”

Lecture Notes in Control and Information Sciences, vol. 57, 1983.111. D. Clarke, Advances in Model-Based Predictive Control. Oxford: Oxford University

Press, 1994.

222 References

112. R. Patton, P. Frank, and R. Clark, Fault Diagnosis in Dynamic Systems: Theory & Ap-plications. Prentice Hall, 1989.

113. R. Kalman, “A new approach to linear filtering and prediction problem,” Trans. ASME,Ser. D, J. Basic Eng., vol. 82, pp. 34–45, 1960.

114. F. Gustafsson, Adaptive Filtering and Change Detection. Chichester: Wiley, 2000.115. V. Šmídl, A. Quinn, M. Kárný, and T. V. Guy, “Robust estimation of autoregressive

processes using a mixture-based filter bank,” System & Control Letters, vol. 54, pp. 315–323, 2005.

116. K. Astrom and B. Wittenmark, Adaptive Control. Reading, Massachusetts: Addison-Wesley, 1989.

117. R. Koopman, “On distributions admitting a sufficient statistic,” Transactions of Ameri-can Mathematical Society, vol. 39, p. 399, 1936.

118. L.Ljung, System Identification-Theory for the User. Prentice-hall. Englewood Cliffs,N.J: D. van Nostrand Company Inc., 1987.

119. A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Prentice-Hall,1989.

120. V. Šmídl and A. Quinn, “Mixture-based extension of the AR model and its recur-sive Bayesian identification,” IEEE Transactions on Signal Processing, vol. 53, no. 9,pp. 3530–3542, 2005.

121. R. Kulhavý, “Recursive nonlinear estimation: A geometric approach,” Automatica,vol. 26, no. 3, pp. 545–555, 1990.

122. R. Kulhavý, “Implementation of Bayesian parameter estimation in adaptive control andsignal processing,” The Statistician, vol. 42, pp. 471–482, 1993.

123. R. Kulhavý, “Recursive Bayesian estimation under memory limitations,” Kybernetika,vol. 26, pp. 1–20, 1990.

124. R. Kulhavý, Recursive Nonlinear Estimation: A Geometric Approach, vol. 216 of Lec-ture Notes in Control and Information Sciences. London: Springer-Verlag, 1996.

125. R. Kulhavý, “A Bayes-closed approximation of recursive non-linear estimation,” Inter-national Journal Adaptive Control and Signal Processing, vol. 4, pp. 271–285, 1990.

126. C. S. Wong and W. K. Li, “On a mixture autoregressive model,” Journal of the RoyalStatistical Society: Series B, vol. 62, pp. 95–115, 2000.

127. M. Kárný, J. Böhm, T. Guy, and P. Nedoma, “Mixture-based adaptive probabilistic con-trol,” International Journal of Adaptive Control and Signal Processing, vol. 17, no. 2,pp. 119–132, 2003.

128. H. Attias, J. C. Platt, A. Acero, and L. Deng, “Speech denoising and dereverberationusing probabilistic models,” in Advances in Neural Information Processing Systems,pp. 758–764, 2001.

129. J. Stutz and P. Cheeseman, “AutoClass - a Bayesian approach to classification,” in Maxi-mum Entropy and Bayesian Methods (J. Skilling and S. Sibisi, eds.), Dordrecht: Kluwer,1995.

130. M. Funaro, M. Marinaro, A. Petrosino, and S. Scarpetta, “Finding hidden events in as-trophysical data using PCA and mixture of Gaussians clustering,” Pattern Analysis &Applications, vol. 5, pp. 15–22, 2002.

131. S. Haykin, "Neural Networks: A Comprehensive Foundation. New York: Macmillan,1994.

132. K. Warwick and M. Kárný, Computer-Intensive Methods in Control and Signal Process-ing: Curse of Dimensionality. Birkhauser, 1997.

133. J. Andrýsek, “Approximate recursive Bayesian estimation of dynamic probabilistic mix-tures,” in Multiple Participant Decision Making (J. Andrýsek, M. Kárný, and J. Kracík,eds.), pp. 39–54, Adelaide: Advanced Knowledge International, 2004.

References 223

134. S. Roweis and Z. Ghahramani, “A unifying review of linear Gaussian models,” Neuralcomputation, vol. 11, pp. 305–345, 1999.

135. A. P. Quinn, “Threshold-free Bayesian estimation using censored marginal inference,”in Signal Processing VI: Proc. of the 6th European Sig. Proc. Conf. (EUSIPCO-’92).Vol. 2, (Brussels), 1992.

136. A. P. Quinn, “A consistent, numerically efficient Bayesian framework for combiningthe selection, detection and estimation tasks in model-based signal processing,” in Proc.IEEE Int. Conf. on Acoust., Sp. and Sig. Proc., (Minneapolis), 1993.

137. “Project IST-1999-12058, decision support tool for complex industrial processes basedon probabilistic data clustering (ProDaCTool),” tech. rep., 1999–2002.

138. P. Nedoma, M. Kárný, and I. Nagy, “MixTools, MATLAB toolbox for mixtures: User’sguide,” tech. rep., ÚTIA AV CR, Praha, 2001.

139. A. Quinn, P. Ettler, L. Jirsa, I. Nagy, and P. Nedoma, “Probabilistic advisory systemsfor data-intensive applications,” International Journal of Adaptive Control and SignalProcessing, vol. 17, no. 2, pp. 133–148, 2003.

140. Z. Chen, “Bayesian filtering: From Kalman filters to particle filters, and beyond,” tech.rep., Adaptive Syst. Lab., McMaster University, Hamilton, ON, Canada, 2003.

141. B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter: Particle Filtersfor Tracking Applications. Artech House Publishers, 2004.

142. E. Daum, “New exact nonlinear filters,” in Bayesian Analysis of Time Series and Dy-namic Models (J. Spall, ed.), New York: Marcel Dekker, 1988.

143. P. Vidoni, “Exponential family state space models based on a conjugate latent process,”J. Roy. Statist. Soc., Ser. B, vol. 61, pp. 213–221, 1999.

144. A. H. Jazwinski, Stochastic Processes and Filtering Theory. New York: Academic Press,1979.

145. R. Kulhavý and M. B. Zarrop, “On a general concept of forgetting,” International Jour-nal of Control, vol. 58, no. 4, pp. 905–924, 1993.

146. R. Kulhavý, “Restricted exponential forgetting in real-time identification,” Automatica,vol. 23, no. 5, pp. 589–600, 1987.

147. C. F. So, S. C. Ng, and S. H. Leung, “Gradient based variable forgetting factor RLSalgorithm,” Signal Processing, vol. 83, pp. 1163–1175, 2003.

148. R. H. Middleton, G. C. Goodwin, D. J. Hill, and D. Q. Mayne, “Design issues in adaptivecontrol,” IEEE Transactions on Automatic Control, vol. 33, no. 1, pp. 50–58, 1988.

149. R. Elliot, L. Assoun, and J. Moore, Hidden Markov Models. New York: Springer-Verlag,1995.

150. V. Šmídl and A. Quinn, “Bayesian estimation of non-stationary AR model parametersvia an unknown forgetting factor,” in Proceedings of the IEEE Workshop on SignalProcessing, (New Mexico), pp. 100–105, August 2004.

151. M. H. Vellekoop and J. M. C. Clark, “A nonlinear filtering approach to changepoint de-tection problems: Direct and differential-geometric methods,” SIAM Journal on Controland Optimization, vol. 42, no. 2, pp. 469–494, 2003.

152. G. Bierman, Factorization Methods for Discrete Sequential Estimation. New York: Aca-demic Press, 1977.

153. G. D. Forney, “The Viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, 1973.

154. J. Deller, J. Proakis, and J. Hansen, Discrete-Time Processing of Speech Signals.Macmillan, New York, 1993.

155. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Prentice-Hall,1978.

224 References

156. J. M. Bernardo, “Approximations in statistics from a decision-theoretical viewpoint,” inProbability and Bayesian Statistics (R. Viertl, ed.), pp. 53–60, New York: Plenum, 1987.

157. N. D. Le, L. Sun, and J. V. Zidek, “Bayesian spatial interpolation and backcastingusing Gaussian-generalized inverted Wishart model,” tech. rep., University of BritishColumbia, 1999.

158. E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge University Press,2003.

159. A. T. James, “Distribution of matrix variates and latent roots derived from normal sam-ples,” Annals of Mathematical Statistics, vol. 35, pp. 475–501, 1964.

160. K. Mardia and P. E. Jupp, Directional Statistics. Chichester, England: John Wiley andSons, 2000.

161. T. D. Downs, “Orientational statistics,” Biometrica, vol. 59, pp. 665–676, 1972.162. R. A. Fisher, Contributions to Mathematical Statistics. John Wiley and Sons, 1950.

Index

activity curve, 91additive Gaussian noise model, 17advisory system, 140augmented model, 124Automatic Rank Determination (ARD)

property, 68, 85, 86, 99, 103AutoRegressive (AR) model, 111, 112, 114,

130, 173, 179, 193AutoRegressive model with eXogenous

variables (ARX), 113, 181

Bayesian filtering, 47, 146Bayesian smoothing, 146

certainty equivalence approximation, 43, 45,188

changepoints, 173classical estimators, 18combinatoric explosion, 130components, 130conjugate distribution, 19, 111, 120conjugate parameter distribution to DEF

family (CDEF), 113, 146, 153, 179,191

correspondence analysis, 86, 92, 103covariance matrix, 117covariance method, 116cover-up rule, 37criterion of cumulative variance, 83, 86

data update, 21, 146, 149digamma (psi) function, 133, 164, 211Dirac δ-function, 44

Dirichlet distribution, 132, 134, 159, 164,215

discount schedules, 139, 174distributional estimation, 47dyad, 115, 181, 190Dynamic Exponential Family (DEF), 113,

153, 179Dynamic Exponential Family with Hidden

variables (DEFH), 126Dynamic Exponential Family with Separable

parameters (DEFS), 120dynamic mixture model, 140

economic SVD, 59EM algorithm, 44, 124empirical distribution, 46, 152exogenous variables, 113, 139, 180exponential family, 112Exponential Family with Hidden variables

(EFH), 126exponential forgetting, 154Extended AR (EAR) model, 179, 180extended information matrix, 115, 181extended regressor, 115, 131, 180

factor images, 91factor analysis, 51factor curve, 91FAMIS model, 93, 94, 102Fast Variational PCA (FVPCA), 68filter-bank, 182, 192, 196forgetting, 139, 191forgetting factor, 153

226 Index

Functional analysis of medical imagesequences, 89

Gamma distribution, 212geometric approach, 128Gibbs sampling, 61global approximation, 117, 128

Hadamard product, 64, 98, 120, 160Hamming distance, 167Hidden Markov Model (HMM), 158hidden variable, 124, 182Highest Posterior Density (HPD) region, 18,

79, 195hyper-parameter, 19, 62, 95

importance function, 152Independent Component Analysis (ICA), 94independent, identically-distributed (i.i.d.)

noise, 58independent, identically-distributed (i.i.d.)

process, 168independent, identically-distributed (i.i.d.)

sampling, 46, 152inferential breakpoint, 49informative prior, 49initialization, 136, 174innovations process, 114, 130, 180inverse-Wishart distribution, 211Iterative Variational Bayes (VB) algorithm,

32, 53, 118, 122, 134, 174

Jacobian, 180Jeffreys’ notation, 16, 110Jeffreys’ prior, 4, 71Jensen’s inequality, 32, 77

Kalman filter, 146, 155, 196KL divergence for Minimum Risk (MR)

calculations, 28, 39, 128KL divergence for Variational Bayes (VB)

calculations, 28, 40, 121, 147Kronecker function, 44Kronecker product, 99Kullback-Leibler (KL) divergence, 27

Laplace approximation, 62LD decomposition, 182Least Squares (LS) estimation, 18, 116local approximation, 117

Low-Pass Filter (LPF), 200

Maple, 68Markov chain, 47, 113, 158, 182Markov-Chain Monte Carlo (MCMC)

methods , 47MATLAB, 64, 120, 140matrix Dirichlet distribution, 160, 216matrix Normal distribution, 58, 209Maximum a Posteriori (MAP) estimation,

17, 44, 49, 188Maximum Likelihood (ML) estimation, 17,

44, 57, 69medical imaging, 89Minimum Mean Squared-Error (MMSE)

criterion, 116missing data, 124, 132MixTools, 140Mixture-based Extended AutoRegressive

(MEAR) model, 183, 191, 192, 196moment fitting, 129Monte Carlo (MC) simulation, 80, 83, 138,

166Multinomial distribution, 132, 134, 159,

162, 185, 215Multinomial distribution of continuous

argument, 163, 185, 215multivariate AutoRegressive (AR) model,

116multivariate Normal distribution, 209

natural gradient technique, 32non-informative prior, 18, 103, 114, 172non-minimal conjugate distribution, 121non-smoothing restriction, 151, 162, 187nonparametric prior, 47normal equations, 116Normal-inverse-Gamma distribution, 115,

170, 179, 211Normal-inverse-Wishart distribution, 132,

134, 210normalizing constant, 15, 22, 113, 116, 134,

169, 187nuclear medicine, 89

observation model, 21, 102, 111, 112, 146,159, 179, 183

one-step approximation, 117, 121, 129

Index 227

One-step Fixed-Form (FF) Approximation,135

optimal importance function, 165orthogonal PPCA model, 70Orthogonal Variational PCA (OVPCA), 76outliers, 192

parameter evolution model, 21, 146particle filtering, 47, 145, 152particles, 152Poisson distribution, 91precision matrix, 117precision parameter, 58, 93prediction, 22, 116, 141, 181Principal Component Analysis (PCA), 57probabilistic editor, 129Probabilistic Principal Component Analysis

(PPCA), 58probability fitting, 129probability simplex, 216ProDaCTool, 140proximity measure, 27, 128pseudo-stationary window, 155

Quasi-Bayes (QB) approximation, 43, 128,133, 140, 150, 186

rank, 59, 77, 83, 99Rao-Blackwellization, 165recursive algorithm, 110Recursive Least Squares (RLS) algorithm,

116, 168regressor, 111, 180regularization, 18, 48, 114, 154relative entropy, 7Restricted VB (RVB) approximation, 128,

133, 186rotational ambiguity, 60, 70

scalar additive decomposition, 3, 37scalar multiplicative decomposition, 120scalar Normal distribution, 120, 209scaling, 92

scaling ambiguity, 48separable-in-parameters family, 34, 63, 118shaping parameters, 3, 34, 36sign ambiguity, 49, 70signal flowgraph, 114Signal-to-Noise Ratio (SNR), 201Singular Value Decomposition (SVD), 59,

213spanning property, 191speech reconstruction, 201static mixture models, 135Stiefel manifold, 71, 213stochastic distributional approximation, 46stressful regime, 136Student’s t-distribution, 5, 116, 141, 200,

211sufficient statistics, 19

time update, 21, 149, 153transition matrix, 159truncated Exponential distribution, 172, 216truncated Normal distribution, 73, 211

uniform prior, 18

Variational Bayes (VB) method, 33, 126,149

Variational EM (VEM), 32Variational PCA (VPCA), 68VB-approximation, 29, 51VB-conjugacy, 149VB-equations, 35VB-filtering, 148VB-marginalization, 33VB-marginals, 3, 29, 32, 34VB-moments, 3, 35, 36VB-observation model, 121, 125, 148, 172,

184VB-parameter predictor, 148, 184VB-smoothing, 148vec-transpose operator, 115Viterbi-Like (VL) Approximation, 188von Mises-Fisher distribution, 73, 212

Documents

The Variational Bayes Method in Signal Processing