Digital Audio Restoration - a statistical model based approach

This is page iPrinter: Opaque this

Digital Audio Restoration - a statistical model

based approach

Simon J. Godsill and Peter J.W. Rayner

September 21, 1998

This is page iiPrinter: Opaque this

This is page iPrinter: Opaque this

We dedicate this book to our families,especially to Rachel(+Rufus) and Ann.

This is page iiPrinter: Opaque this

vii

Preface

The application of digital signal processing (DSP) to problems in audiohas been an area of growing importance since the pioneering DSP workof the 1960s and 70s. In the 1980s, DSP micro-chips became sufficientlypowerful to handle the complex processing operations required for soundrestoration in real-time, or close to real-time. This led to the first commer-cially available restoration systems, with companies such as CEDAR AudioLtd. in the UK and Sonic Solutions in the US selling dedicated systemsworld-wide to recording studios, broadcasting companies, media archivesand film studios. Vast amounts of important audio material, ranging fromhistoric recordings of the last century to relatively recent recordings onanalogue or even digital tape media, were noise-reduced and re-releasedon CD for the increasingly quality-conscious music enthusiast. Indeed, thefirst restorations were a revelation in that clicks, crackles and hiss couldfor the first time be almost completely eliminated from recordings whichmight otherwise be un-releasable in CD format.

Until recently, however, digital audio processing has required high-poweredcomputational engines which were only available to large institutions whocould afford to use the sophisticated digital remastering technology. Withthe advent of compact disc and other digital audio formats, followed bythe increased accessibility of home computing, digital audio processing isnow available to anyone who owns a PC with sound card, and will be ofincreasing importance, in association with digital video, as the multimediarevolution continues into the next millennium. Digital audio restorationwill thus find increasing application to sound recordings from the internet,home recordings and speech, and high-quality noise-reducers will becomea standard part of any computer system and hifi system, alongside speechrecognisers and image processors.

In this book we draw upon extensive experience in the commercial worldof sound restoration1 and in the academic community, to give a compre-hensive overview of the principles behind the current technology, as imple-mented in the commercial restoration systems of today. Furthermore, if thecurrent technology can be regarded as a ‘first phase’ in audio restoration,then the later chapters of the book outline a ‘second phase’ of more so-phisticated statistical methods which are aimed at achieving higher fidelityto the original recorded sound and at addressing problems which cannotcurrently be handled by commercial systems. It is anticipated that newmethods such as these will form the basis of future restoration systems.

1Both authors were founding members of CEDAR (Computer Enhanced Digital Au-dio Restoration), the Cambridge-based audio restoration company.

viii

We acknowledge with gratitude the help of numerous friends and col-leagues in the production and research for this book: to Dr Bill Fitzgeraldfor advice and discussion on Bayesian methods; to Dr Anil Kokaram forsupport, friendship, humour and technical assistance throughout the re-search period; to Dr Saeed Vaseghi, whose doctoral research work initiatedaudio restoration research in Cambridge; to Dr Christopher Roads, whoseunbounded energy and enthusiasm made the CEDAR project a reality; toDave Betts, Gordon Reid and all at CEDAR; to an host of proof-readerswho gave their assistance willingly and often at short notice, includingTim Clapp, Colin Campbell, Paul Walmsley, Dr Mike Davies, Dr ArnaudDoucet, Jacek Noga, Steve Armstrong and Rachel Godsill; and finally tothe many others who have assisted directly and indirectly over the years.

Simon GodsillCambridge, 1998.

ix

Glossary

AR autoregressiveARMA autoregressive moving-averageCDF cumulative distribution functionDFT discrete Fourier transformDTFT discrete time Fourier transformEM expectation-maximisationFFT fast Fourier transformFIR finite impulse responsei.i.d. independent, identically distributedIIR infinite impulse responseLS least squaresMA moving averageMAP maximum a posterioriMCMC Markov chain Monte CarloML maximum likelihoodMMSE minimum mean squared errorMSE mean squared errorPDF probability density functionPMF probability mass functionRLS recursive least squaresSNR signal to noise ratiow.r.t. with respect to

x

xi

Notation

Obscure mathematical notation is avoided wherever possible. However,the following glossary of basic notation, which is adopted unless otherwisestated, may be a useful reference.

Scalars Lower/upper case, e.g. xt or EColumn vectors Bold lower case, e.g. xMatrices Bold upper case, e.g. Mp(.) Probability distribution

(density or mass function)f(.) Probability density functionF (.) Cumulative distribution functionE[ ] Expected valueN(µ, σ2) or N(θ|µ, σ2) Univariate normal distribution,

mean µ, covariance σ2

Nq(µ,C) or Nq(θ|µ,C) q-variate normal distributionG(α, β) or G(v|α, β) Gamma distributionIG(α, β) or IG(v|α, β) Inverted gamma distributionθ ∼ p(θ) θ is a random sample from p(θ)a = [a1 a2 . . . aP ]T Vector of autoregressive parameterse = [eP eP+1 . . . eN−1]

T Vector of autoregressive excitation samplesσ2

e Variance of excitation sequenceσ2

vtVariance of t th observation noise sample

y = [y0 y1 . . . yN−1]T Observed (noisy) data vector

x = [x0 x1 . . . xN−1]T Underlying signal vector

i = [i0 i1 . . . iN−1]T Vector of binary noise indicator values

I The identity matrix0n All zero column vector, length n0n×m All zero (n×m)-dimensional matrix1n All unity column vector, length ntrace() Trace of a matrixT Transpose of a matrixA = a1, . . . , aM the set A, containing M elementsA ∪B UnionA ∩B IntersectionAc Complement∅ Empty seta ∈ A a is a member of set AA ⊂ B A is a subset of B[a, b] real numbers x such that a ≤ x ≤ b(a, b) real numbers x such that a < x < b< the real numbers: < = (−∞,+∞)Z the integers: −∞, . . . ,−1, 0, 1, . . . ,∞ω : E ‘All ω’s such that expression E is True’

xii

This is page xiiiPrinter: Opaque this

Contents

1 Introduction 11.1 The History of recording – a brief overview . . . . . . . . . 21.2 Sound restoration – analogue and digital . . . . . . . . . . . 51.3 Classes of degradation and overview of the book . . . . . . 61.4 A reader’s guide . . . . . . . . . . . . . . . . . . . . . . . . 10

I Fundamentals 13

2 Digital Signal Processing 152.1 The Nyquist sampling theorem . . . . . . . . . . . . . . . . 162.2 The discrete time Fourier transform (DTFT) . . . . . . . . 192.3 Discrete time convolution . . . . . . . . . . . . . . . . . . . 202.4 The z-transform . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 232.4.2 Transfer function and frequency response . . . . . . 232.4.3 Poles, zeros and stability . . . . . . . . . . . . . . . 23

2.5 Digital filters . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.1 Infinite-impulse-response (IIR) filters . . . . . . . . . 242.5.2 Finite-impulse-response (FIR) filters . . . . . . . . . 25

2.6 The discrete Fourier transform (DFT) . . . . . . . . . . . . 252.7 Data windows . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7.1 Continuous signals . . . . . . . . . . . . . . . . . . . 272.7.2 Discrete-time signals . . . . . . . . . . . . . . . . . . 31

xiv Contents

2.8 The fast Fourier transform (FFT) . . . . . . . . . . . . . . . 342.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Probability Theory and Random Processes 393.1 Random events and probability . . . . . . . . . . . . . . . 39

3.1.1 Frequency-based interpretation of probability . . . . 403.2 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . 403.3 Fundamental results of event-based probability . . . . . . . 42

3.3.1 Conditional probability . . . . . . . . . . . . . . . . 423.3.2 Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . 433.3.3 Total probability . . . . . . . . . . . . . . . . . . . . 433.3.4 Independence . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . 443.5 Definition: random variable . . . . . . . . . . . . . . . . . . 453.6 The probability distribution of a random variable . . . . . . 45

3.6.1 Probability mass function (PMF) (discrete RVs) . . 463.6.2 Cumulative distribution function (CDF) . . . . . . . 463.6.3 Probability density function (PDF) . . . . . . . . . . 46

3.7 Conditional distributions and Bayes rule . . . . . . . . . . . 473.7.1 Conditional PMF - discrete RVs . . . . . . . . . . . 483.7.2 Conditional CDF . . . . . . . . . . . . . . . . . . . . 493.7.3 Conditional PDF . . . . . . . . . . . . . . . . . . . . 49

3.8 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.8.1 Characteristic functions . . . . . . . . . . . . . . . . 53

3.9 Functions of random variables . . . . . . . . . . . . . . . . . 543.9.1 Differentiable mappings . . . . . . . . . . . . . . . . 55

3.10 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . 563.11 Conditional densities and Bayes rule . . . . . . . . . . . . . 583.12 Functions of random vectors . . . . . . . . . . . . . . . . . . 583.13 The multivariate Gaussian . . . . . . . . . . . . . . . . . . . 593.14 Random signals . . . . . . . . . . . . . . . . . . . . . . . . . 613.15 Definition: random process . . . . . . . . . . . . . . . . . . . 64

3.15.1 Mean values and correlation functions . . . . . . . . 643.15.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . 653.15.3 Power spectra . . . . . . . . . . . . . . . . . . . . . . 663.15.4 Linear systems and random processes . . . . . . . . 67

3.16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Parameter Estimation, Model Selection and Classification 694.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . 70

4.1.1 The general linear model . . . . . . . . . . . . . . . 704.1.2 Maximum likelihood (ML) estimation . . . . . . . . 714.1.3 Bayesian estimation . . . . . . . . . . . . . . . . . . 73

4.1.3.1 Posterior inference and Bayesian cost func-tions . . . . . . . . . . . . . . . . . . . . . . 75

Contents xv

4.1.3.2 Marginalisation for elimination of unwantedparameters . . . . . . . . . . . . . . . . . . 77

4.1.3.3 Choice of priors . . . . . . . . . . . . . . . 784.1.4 Bayesian Decision theory . . . . . . . . . . . . . . . 79

4.1.4.1 Calculation of the evidence, p(x | si) . . . . 804.1.4.2 Determination of the MAP state estimate . 82

4.1.5 Sequential Bayesian classification . . . . . . . . . . . 824.2 Signal modelling . . . . . . . . . . . . . . . . . . . . . . . . 854.3 Autoregressive (AR) modelling . . . . . . . . . . . . . . . . 86

4.3.1 Statistical modelling and estimation of AR models . 874.4 State-space models, sequential estimation and the Kalman

filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.4.1 The prediction error decomposition . . . . . . . . . . 914.4.2 Relationships with other sequential schemes . . . . . 92

4.5 Expectation-maximisation (EM) for MAP estimation . . . . 924.6 Markov chain Monte Carlo (MCMC) . . . . . . . . . . . . . 934.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

II Basic Restoration Procedures 97

5 Removal of Clicks 995.1 Modelling of clicks . . . . . . . . . . . . . . . . . . . . . . . 1005.2 Interpolation of missing samples . . . . . . . . . . . . . . . 101

5.2.1 Interpolation for Gaussian signals with known covari-ance structure . . . . . . . . . . . . . . . . . . . . . 1035.2.1.1 Incorporating a noise model . . . . . . . . 105

5.2.2 Autoregressive (AR) model-based interpolation . . . 1065.2.2.1 The least squares AR (LSAR) interpolator 1065.2.2.2 The MAP AR interpolator . . . . . . . . . 1075.2.2.3 Examples of the LSAR interpolator . . . . 1085.2.2.4 The case of unknown AR model parameters 108

5.2.3 Adaptations to the AR model-based interpolator . . 1105.2.3.1 Pitch-based extension to the AR interpolator1115.2.3.2 Interpolation with an AR + basis function

representation . . . . . . . . . . . . . . . . 1115.2.3.3 Random sampling methods . . . . . . . . . 1165.2.3.4 Incorporating a noise model . . . . . . . . 1195.2.3.5 Sequential methods . . . . . . . . . . . . . 119

5.2.4 ARMA model-based interpolation . . . . . . . . . . 1225.2.4.1 Results from the ARMA interpolator . . . 125

5.2.5 Other methods . . . . . . . . . . . . . . . . . . . . . 1265.3 Detection of clicks . . . . . . . . . . . . . . . . . . . . . . . 127

5.3.1 Autoregressive (AR) model-based click detection . . 1285.3.1.1 Analysis and limitations . . . . . . . . . . . 131

xvi Contents

5.3.1.2 Adaptations to the basic AR detector . . . 1315.3.1.3 Matched filter detector . . . . . . . . . . . 1325.3.1.4 Other models . . . . . . . . . . . . . . . . . 133

5.4 Statistical methods for the treatment of clicks . . . . . . . . 1335.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6 Hiss Reduction 1356.1 Spectral domain methods . . . . . . . . . . . . . . . . . . . 136

6.1.1 Noise reduction functions . . . . . . . . . . . . . . . 1386.1.1.1 The Wiener solution . . . . . . . . . . . . . 1396.1.1.2 Spectral subtraction and power subtraction 140

6.1.2 Artefacts and ‘musical noise’ . . . . . . . . . . . . . 1416.1.3 Improving spectral domain methods . . . . . . . . . 145

6.1.3.1 Eliminating musical noise . . . . . . . . . . 1456.1.3.2 Advanced noise reduction functions . . . . 1486.1.3.3 Psychoacoustical methods . . . . . . . . . . 148

6.1.4 Other transform domain methods . . . . . . . . . . . 1486.2 Model-based methods . . . . . . . . . . . . . . . . . . . . . 149

6.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 149

III Advanced Topics 151

7 Removal of Low Frequency Noise Pulses 1537.1 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . 1547.2 Separation of AR processes . . . . . . . . . . . . . . . . . . 1567.3 Restoration of transient noise pulses . . . . . . . . . . . . . 157

7.3.1 Modified separation algorithm . . . . . . . . . . . . 1587.3.2 Practical considerations . . . . . . . . . . . . . . . . 160

7.3.2.1 Detection vector i . . . . . . . . . . . . . . 1617.3.2.2 AR process for true signal x1 . . . . . . . . 1617.3.2.3 AR process for noise transient x2 . . . . . 162

7.3.3 Experimental evaluation . . . . . . . . . . . . . . . . 1627.4 Kalman filter implementation . . . . . . . . . . . . . . . . . 1637.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8 Restoration of Pitch Variation Defects 1718.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1728.2 Frequency tracking . . . . . . . . . . . . . . . . . . . . . . . 1758.3 Generation of pitch variation curve . . . . . . . . . . . . . . 176

8.3.1 Bayesian estimator . . . . . . . . . . . . . . . . . . . 1788.3.2 Prior models . . . . . . . . . . . . . . . . . . . . . . 180

8.3.2.1 Autoregressive (AR) model . . . . . . . . . 1808.3.2.2 Smoothness Model . . . . . . . . . . . . . . 181

Contents xvii

8.3.2.3 Deterministic prior models for the pitch vari-ation . . . . . . . . . . . . . . . . . . . . . 182

8.3.3 Discontinuities in frequency tracks . . . . . . . . . . 1828.3.4 Experimental results in pitch curve generation . . . 183

8.3.4.1 Trials using extract ‘Viola’ (synthetic pitchvariation) . . . . . . . . . . . . . . . . . . . 184

8.3.4.2 Trials using extract ‘Midsum’ (non-syntheticpitch variation) . . . . . . . . . . . . . . . . 185

8.4 Restoration of the signal . . . . . . . . . . . . . . . . . . . . 1858.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

9 A Bayesian Approach to Click Removal 1919.1 Modelling framework for click-type degradations . . . . . . 1929.2 Bayesian detection . . . . . . . . . . . . . . . . . . . . . . . 193

9.2.1 Posterior probability for i . . . . . . . . . . . . . . . 1949.2.2 Detection for autoregressive (AR) processes . . . . . 195

9.2.2.1 Density function for signal, px(.) . . . . . . 1969.2.2.2 Density function for noise amplitudes, pn(i)|i(.)1979.2.2.3 Likelihood for Gaussian AR data . . . . . . 1979.2.2.4 Reformulation as a probability ratio test . 199

9.2.3 Selection of optimal detection state estimate i . . . . 2009.2.4 Computational complexity for the block-based algo-

rithm . . . . . . . . . . . . . . . . . . . . . . . . . . 2019.3 Extensions to the Bayes detector . . . . . . . . . . . . . . . 201

9.3.1 Marginalised distributions . . . . . . . . . . . . . . . 2019.3.2 Noisy data . . . . . . . . . . . . . . . . . . . . . . . 2029.3.3 Multi-channel detection . . . . . . . . . . . . . . . . 2039.3.4 Relationship to existing AR-based detectors . . . . . 2039.3.5 Sequential Bayes detection . . . . . . . . . . . . . . 203

9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

10 Bayesian Sequential Click Removal 20510.1 Overview of the method . . . . . . . . . . . . . . . . . . . . 20610.2 Recursive update for posterior state probability . . . . . . . 207

10.2.1 Full update scheme. . . . . . . . . . . . . . . . . . . 20910.2.2 Computational complexity . . . . . . . . . . . . . . . 21010.2.3 Kalman filter implementation of the likelihood update 21010.2.4 Choice of noise generator prior p(i) . . . . . . . . . . 211

10.3 Algorithms for selection of the detection vector . . . . . . . 21110.3.1 Alternative risk functions . . . . . . . . . . . . . . . 213

10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

11 Implementation and Experimental Results for Bayesian De-tection 21511.1 Block-based detection . . . . . . . . . . . . . . . . . . . . . 216

xviii Contents

11.1.1 Search procedures for the MAP detection estimate . 21611.1.2 Experimental evaluation . . . . . . . . . . . . . . . . 219

11.2 Sequential detection . . . . . . . . . . . . . . . . . . . . . . 22211.2.1 Synthetic AR data . . . . . . . . . . . . . . . . . . . 22511.2.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . 226

11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

12 Fully Bayesian Restoration using EM and MCMC 23312.1 A review of some relevant work from other fields in Monte

Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . 23412.2 Model specification . . . . . . . . . . . . . . . . . . . . . . . 235

12.2.1 Noise specification . . . . . . . . . . . . . . . . . . . 23512.2.1.1 Continuous noise source . . . . . . . . . . . 235

12.2.2 Signal specification . . . . . . . . . . . . . . . . . . . 23612.3 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

12.3.1 Prior distribution for noise variances . . . . . . . . . 23712.3.2 Prior for detection indicator variables . . . . . . . . 23912.3.3 Prior for signal model parameters . . . . . . . . . . . 239

12.4 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 23912.5 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 241

12.5.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . 24212.5.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . 24412.5.3 Detection in sub-blocks . . . . . . . . . . . . . . . . 246

12.6 Results for EM and Gibbs sampler interpolation . . . . . . 24812.7 Implementation of Gibbs sampler detection/interpolation . 249

12.7.1 Sampling scheme . . . . . . . . . . . . . . . . . . . . 24912.7.2 Computational requirements . . . . . . . . . . . . . 251

12.7.2.1 Comparison of computation with existingtechniques . . . . . . . . . . . . . . . . . . 253

12.8 Results for Gibbs sampler detection/interpolation . . . . . . 25412.8.1 Evaluation with synthetic data . . . . . . . . . . . . 25412.8.2 Evaluation with real data . . . . . . . . . . . . . . . 25612.8.3 Robust initialisation . . . . . . . . . . . . . . . . . . 25712.8.4 Processing for audio evaluation . . . . . . . . . . . . 25812.8.5 Discussion of MCMC applied to audio . . . . . . . . 259

13 Summary and Future Research Directions 27113.1 Future directions and new areas . . . . . . . . . . . . . . . . 272

A Probability Densities and Integrals 275A.1 Univariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 275A.2 Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . 275A.3 Gamma density . . . . . . . . . . . . . . . . . . . . . . . . . 277A.4 Inverted Gamma distribution . . . . . . . . . . . . . . . . . 277

Contents xix

B Matrix Inverse Updating Results and Associated Proper-ties 279

C Exact Likelihood for AR Process 281

D Derivation of Likelihood for i 283D.1 Gaussian noise bursts . . . . . . . . . . . . . . . . . . . . . 284

E Marginalised Bayesian Detector 285

F Derivation of Sequential Update Formulae 287F.1 Update for in+1 = 0. . . . . . . . . . . . . . . . . . . . . . . 287F.2 Update for in+1 = 1. . . . . . . . . . . . . . . . . . . . . . . 289

G Derivations for EM-based Interpolation 293G.1 Signal expectation, Ex . . . . . . . . . . . . . . . . . . . . . 294G.2 Noise expectation, En . . . . . . . . . . . . . . . . . . . . . 296G.3 Maximisation step . . . . . . . . . . . . . . . . . . . . . . . 297

H Derivations for Gibbs Sampler 299H.1 Gaussian/inverted-gamma scale mixtures . . . . . . . . . . 299H.2 Posterior distributions . . . . . . . . . . . . . . . . . . . . . 300

H.2.1 Joint posterior . . . . . . . . . . . . . . . . . . . . . 300H.2.2 Conditional posteriors . . . . . . . . . . . . . . . . . 302

H.3 Sampling the prior hyperparameters . . . . . . . . . . . . . 303

References 305

Index 321

xx Contents

This is page 1Printer: Opaque this

1

Introduction

The introduction of high quality digital audio media such as Compact Disc(CD) and Digital Audio Tape (DAT) has dramatically raised general aware-ness and expectations about sound quality in all types of recordings. This,combined with an upsurge of interest in historical and nostalgic material,has led to a growing requirement for the restoration of degraded sourcesranging from the earliest recordings made on wax cylinders in the nine-teenth century, through disc recordings (78rpm, LP, etc.) and finally mag-netic tape recording technology, which has been available since the 1950s.Noise reduction may occasionally be required even in a contemporary dig-ital recording if background noise is judged to be intrusive.

Degradation of an audio source will be considered as any undesirablemodification to the audio signal which occurs as a result of (or subsequentto) the recording process. For example, in a recording made direct-to-discfrom a microphone, degradations could include noise in the microphone andamplifier as well as noise in the disc cutting process. Further noise may beintroduced by imperfections in the pressing material, transcription to othermedia or wear and tear of the medium itself. Examples of such noise can beseen in the electron micrographs shown in figures 1.1-1.3. In figure 1.1 wecan clearly see the specks of dust on the groove walls and also the granu-larity in the pressing material, which can be seen sticking out of the walls.In figure 1.2 the groove wall signal modulation can be more clearly seen.In figure 1.3, a broken record is seen almost end-on. Note the fragments of

2 1. Introduction

FIGURE 1.1. Electron micrograph showing dust and granular particles in thegrooves of a 78rpm gramophone disc.

broken disk which fill the grooves 1. We do not strictly consider any noisepresent in the recording environment such as audience noise at a musicalperformance to be degradation, since this is part of the ‘performance’. Re-moval of such performance interference is a related topic which is consideredin other applications, such as speaker separation for hearing aid design. Anideal restoration would then reconstruct the original sound source exactlyas received by the transducing equipment (microphone, acoustic horn, etc.).Of course, this ideal can never be achieved perfectly in practice, and meth-ods can only be devised which come close according to some suitable errorcriterion. This should ideally be based on the perceptual characteristics ofthe human listener.

1.1 The History of recording – a brief overview

The ideas behind sound recording2 began as early as the mid-nineteenthcentury with the invention by Frenchman Leon Scott of the Phonautographin 1857, a device which could display voice waveforms on a piece of paper

1These photographs are reproduced with acknowledgement to Mr. B.C. Breton, Sci-entific Imaging Group, CUED

2For a more detailed coverage of the origins of sound recording see, for example, PeterMartland’s excellent book ‘Since Records Began: EMI, the First Hundred Years’[124].

1.1 The History of recording – a brief overview 3

FIGURE 1.2. Electron micrograph showing signal modulation of the grooves ofa 78rpm gramophone disc.

FIGURE 1.3. Electron micrograph showing a breakage directly across a 78rpmgramophone disc.

4 1. Introduction

via a recording horn which focussed sound onto a vibrating diaphragm.It was not until 1877, however, that Thomas Edison invented the Phono-graph, a machine capable not only of storing but of reproducing sound,using a steel point stylus which cut into a tin foil rotating drum recordingmedium. This piece of equipment was a laboratory tool rather than a com-mercial product, but by 1885 Alexander Graham Bell, and two partners,C.A. Bell and C.S. Tainter, had developed the technology sufficiently tomake it a commercial proposition, and along the way had experimentedwith technologies which would later become the gramophone disc, mag-netic tape and optical soundtracks. The new technology worked by cuttinginto a beeswax cylinder, a method originally designed to bypass the 1878patent of Edison. In 1888, commercial exploitation of this new wax cylindertechnology began, the same year that Emile Berliner demonstrated the firstflat disc record and gramophone at the Franklin Institute, using an acidetching process to cut grooves into the surface of a polished zinc plate.

Cylinder technology was cumbersome in the extreme, and no cheapmethod for duplicating cylinders became available until 1902. This meantthat artists had to perform the same pieces many times into multiplerecording horns in order to obtain a sufficient quantity for sale. Mean-while Berliner developed the gramophone disc technology, discovering thatshellac, a material derived from the Lac beetle, was a suitable mediumfor manufacturing records from his metal negative master discs. Shellacwas used for 78rpm recordings until vinyl was invented in the middle ofthis century. By 1895 a catalogue of a hundred 7-inch records had beenproduced by the Berliner Gramophone Company, but the equipment washand-cranked and primitive. Further developments led to a motor drivengramophone, more sensitive soundbox and a new wax disc recording pro-cess, which enabled the gramophone to become a huge commercial success.

Recording using a mechanical horn and diaphragm was a difficult andunreliable procedure, requiring performers to crowd around the recordinghorn in order to be heard, and some instruments could not be adequatelyrecorded at all, owing to the poor frequency response of the system (164Hz–2000Hz). The next big development was the introduction in 1925, aftervarious experiments from 1920 onwards, of the Western Electric electricalrecording process into the Columbia studios, involving an electrical micro-phone and amplifier which actuates a cutting tool. The electric process ledto great improvements in sound quality, including a bandwidth of 20Hz–5000Hz and reduced surface noise. The technology was much improved byAlan Blumlein, who also demonstrated the first stereo recording process in1935. In the same year AEG in Germany demonstrated the precursor ofthe magnetic tape recording system, a method which eliminates the clicks,crackles and pops of disc recordings. Tape recording was developed to itsmodern form by 1947, allowing for the first time a practical way to editrecordings, including of course to ‘cut and splice’ the tape for restorationpurposes. In the late forties the vinyl LP and 45rpm single were launched.

1.2 Sound restoration – analogue and digital 5

In 1954 stereophonic tapes were first manufactured and in 1958 the firststereo discs were cut. Around this time analogue electronic technology forsound modification by filtering, limiting, compressing and equalisation wasbeing introduced, which allowed the filtering of recordings for reduction ofsurface noise and enhancement of selective frequencies.

The revolution which has allowed the digital processing techniques ofthis book to succeed is the introduction of digital sound, in particular in1982 the compact disc (CD) format, a digital format which allows a stereosignal bandwidth up to 20kHz with 16-bit resolution. Now of course we seehigher bandwidth (48kHz), better resolution (24-bit) formats being usedin the recording studio, but the CD has proved itself as the first practicaldomestic digital format.

1.2 Sound restoration – analogue and digital

Analogue restoration techniques have been available for at least as long asmagnetic tape, in the form of manual cut-and-splice editing for clicks andfrequency domain equalisation for background noise (early mechanical discplayback equipment will also have this effect by virtue of its poor responseat high frequencies). More sophisticated electronic click reducers were basedupon high pass filtering for detection of clicks, and low pass filtering to masktheir effect (see e.g. [34, 102])3. None of these methods was sophisticatedenough to perform a significant degree of noise reduction without interferingwith the underlying signal quality. For analogue tape recordings the pre-emphasis techniques of Dolby [47] have been very successful in reducing thelevels of background noise in analogue tape, but of course the pre-emphasishas to be encoded into the signal at the recording stages.

Digital methods allow for a much greater degree of flexibility in pro-cessing, and hence greater potential for noise removal, although indiscrim-inate application of inappropriate digital methods can be more disastrousthan analogue processing! Some of the earliest digital signal processingwork for audio restoration involved deconvolution for enhancement of asolo voice (Caruso) from an acoustically recorded source (see Miller [132]and Stockham et al. [171]). Since then, research groups from many places,including Cambridge, Le Mans, Paris and the US, have worked in the area,developing sophisticated techniques for treatment of degraded audio. Fora good overall text on the field of digital audio, including restoration [84],see [25]. Another text which covers many enhancement techniques relatedto those of this book is by Vaseghi [188]. On the commercial side, aca-demic research has led to spin-off companies which manufacture computer

3The well known ‘Packburn’ unit achieved masking within a stereo setup by switchingbetween channels

6 1. Introduction

restoration equipment for use in recording studios, re-mastering houses andbroadcast companies. In Cambridge, CEDAR (Computer Enhanced Dig-ital Audio Restoration) Ltd., founded in 1988 by Dr. Peter Rayner fromthe Communications Group of the University’s Engineering Departmentand the British Library’s National Sound Archive, is probably the bestknown of these companies, providing equipment for automatic real-timenoise reduction, click and crackle removal, while the California-based SonicSolutions markets a well-known system called NoNoise. Now, with the rapidincreases in cheaply available computer performance, many computer edit-ing packages include noise and click reduction as a standard add-on.

1.3 Classes of degradation and overview of thebook

There are several distinct types of degradation common in audio sources.These can be broadly classified into two groups: localised degradations andglobal degradations. Localised degradations are discontinuities in the wave-form which affect only certain samples, including clicks, crackles, scratches ,breakages and clipping. Global degradations affect all samples of the wave-form and include background noise, wow and flutter and certain types ofnon-linear distortion. Mechanisms by which all of these defects can occurare discussed later. We distinguish the following classes of localised degra-dation:

• Clicks - these are short bursts of interference random in time andamplitude. Clicks are perceived as a variety of defects ranging fromisolated ‘tick’ noises to the characteristic ‘crackle’ associated with78rpm disc recordings.

• Low frequency noise transients - usually a larger scale defect thanclicks and caused by very large scratches or breakages in the play-back medium. These large discontinuities excite a low frequency res-onance in the pickup apparatus which is perceived as a low frequency‘thump’ noise. This type of degradation is common in gramophonedisc recordings and optical film sound tracks.

Global degradations affect all samples of the waveform and the followingcategories may be identified:

• Broad band noise - this form of degradation is common to all record-ing methods and is perceived as ‘hiss’.

• Wow and flutter - these are pitch variation defects which may becaused by eccentricities in the playback system or motor speed fluc-tuations. The effect is a very disturbing modulation of all frequencycomponents.

1.3 Classes of degradation and overview of the book 7

• Distortion - This very general class covers a wide range of non-lineardefects from amplitude related overload (e.g. clipping) to groove walldeformation and tracing distortion.

We describe methods in this text to address the majority of the abovedefects (in fact all except non-linear distortion, which is a topic of currentresearch interest in DSP for audio). In the case of localised degradationsa major task is the detection of discontinuities in the waveform. In sec-tion III we adopt a classification approach to this task which is based onBayes Decision Theory. Using a probabilistic model-based formulation forthe audio data and degradation we use Bayes’ theorem to derive optimaldetection estimates for the discontinuities. In the case of global degrada-tions an approach based on probabilistic noise and data models is applied togive estimates for the true (undistorted) data conditional on the observed(noisy) data.

Note that all audio results and experimentation presented are performedusing audio signals sampled at the professional rates of 44.1kHz or 48kHzand quantised to 16 bits.

The following gives a brief outline of the ensuing chapters.

Part I - Fundamentals

In the first part of the book we provide and overview of the basic theoryon which the rest of the book relies. These chapters are not intended tobe a complete tutorial for a reader completely unfamiliar with the area,but rather a summary of the important results in a form which is easilyaccessible. The reader is assumed to be quite familiar with linear systemstheory and continuous time spectral analysis. Much of the material in thisfirst section is based upon courses taught by the authors to undergraduatesat Cambridge University.

Chapter 2 - Digital Signal Processing

In this chapter we overview the basics of digital signal processing (DSP),the theory of discrete time processing of sampled signals. We include anintroduction to sampling theory, convolution, spectrum analysis, the z-transform and digital filters.

Chapter 3 - Probability Theory and Random Processes

Owing to the random nature of most audio signals it is necessary to havea thorough grounding in random signal theory in order to design effectiverestoration methods. Indeed, most of the restoration methods presentedin the text are based explicitly on probabilistic models for signals and

8 1. Introduction

noise. In this chapter we build up the theory of random processes, includ-ing correlation functions, power spectra and linear systems analysis fromthe fundamentals of probability theory for random variables and randomvectors.

Chapter 4 - Parameter Estimation, Classification and ModelSelection

This chapter introduces the fundamental concepts and techniques of param-eter estimation and model selection, topics which are applied throughoutthe text, with an emphasis upon the Bayesian Decision Theory perspec-tive. A mathematical framework based on a linear parameter Gaussiandata model is used throughout to illustrate the methods. We then considersome basic signal models which will be of use later in the text and describesome powerful numerical statistical estimation methods:expectation-maximisation (EM) and Markov chain Monte Carlo (MCMC).

Part II - Basic Restoration Procedures

Part II describes the basic methods for removing clicks and backgroundnoise from disc and tape recordings. Much of the material is a review ofmethods which might be used in commercial re-mastering environments,however we also include some new material such as interpolation using theautoregressive moving-average (ARMA) models. The reader who is newto the topic of digital audio restoration will find this part of the book aninvaluable introduction to the wide range of methods which can be applied.

Chapter 5 - Removal of Clicks

Clicks are the most common form of artefact encountered in old recordingsand the first stage of processing in many cases will be a de-clicking pro-cess. We describe firstly a range of techniques for interpolation of missingsamples in audio; this is the task required for replacing a click in the au-dio waveform. We then discuss methods for detection of clicks, based uponmodelling the distinguishing features of audio signals and abrupt discon-tinuities in the waveform (clicks). The methods are illustrated throughoutwith graphical examples which contrast the performance of the variousschemes.

Chapter 6 - Hiss Reduction

All recordings, whatever their source, are inherently contaminated withsome degree of background noise, often perceived as ‘hiss’. This chapter

1.3 Classes of degradation and overview of the book 9

describes the technology for hiss reduction, mostly based upon a frequencydomain attenuation principle. We then go on to describe how hiss reductioncan be carried out in a model-based setting.

Part III - Advanced Topics

In this section we describe some more recent and active research topicsin audio restoration. The research spans the period 1990-1998 and can beconsidered to form a ‘second generation’ of sophisticated techniques whichcan handle new problems such as ‘wow’ or potentially achieve improvementsin the basic areas such as click removal. These methods are generally notimplemented in the commercial systems of today, but are likely to form apart of future systems as computers become faster and cheaper. Many ofthe ideas for these research topics have come from the authors’ experiencein the commercial sound restoration arena.

Chapter 7 - Removal of Low Frequency Noise Pulses

Low frequency noise pulses occur in gramophone and optical film mediawhen the playback system is driven by step-like inputs such as groove break-ages or large scratches. We firstly review the existing template-based meth-ods for restoration of such defects, then present a new signal separation-based approach in which both the audio signal and the noise pulse are mod-elled as autoregressive (AR) processes. A separation algorithm is developedwhich removes the noise pulse from the observed signal. The algorithm hasmore general application than existing methods which rely on a ‘template’for the noise pulse. Performance is found to be better than the existingmethods and the new process is more readily automated.

Chapter 8 - Restoration of Pitch Variation Defects

This chapter presents a novel technique for restoration of musical mate-rial degraded by wow and other related pitch variation defects. An initialfrequency tracking stage extracts frequency tracks for all significant tonalcomponents of the music. This is followed by an estimation procedure whichidentifies pitch variations which are common to all components, under theassumption of smooth pitch variation with time. Restoration is then per-formed by non-uniform resampling of the distorted signal. Results showthat wow can be virtually eliminated from musical material which has asignificant tonal component.

10 1. Introduction

Chapters 9-11 - A Bayesian Approach to Click Removal

In these chapters a new approach to click detection and replacement is de-veloped. This approach is based upon Bayes decision theory as discussedin chapter 4. A Gaussian autoregressive (AR) model is assumed for thedata and a Gaussian model is also used for click amplitudes. The detectoris shown to be equivalent under certain limiting constraints to the existingAR model-based detectors currently used in audio restoration. In chapter10 a novel sequential implementation of the Bayesian techniques is devel-oped and in chapter 11 results are presented demonstrating that the newmethods can out-perform the methods described in chapter 5.

Chapter 12 - Fully Bayesian Restoration using EM and MCMC

The Bayesian methods of chapters 9-11 led to improvements in detectionand restoration of clicks. However, a disadvantage is that system parame-ters must be known a priori or estimated in some ad hoc fashion from thedata. In this chapter we present more sophisticated statistical methodologyfor solution of these limitations and for more realistic modelling of signaland noise sources. We firstly present an expectation-maximisation (EM)method for interpolation of autoregressive signals in non-Gaussian impul-sive noise. We then present a Markov chain Monte Carlo (MCMC) tech-nique which is capable of performing interpolation jointly with detectionof clicks. This is a significant advance and the drawback is a dramatic in-crease in computational complexity for the algorithm. However, we believethat with the rapid advances in computational power which are constantlyoccurring, methods such as these will come to dominate complex statisticalsignal processing problem-solving in the future. The chapter provides anin-depth case study of EM and MCMC applied to click removal, but it isnoted that the methods can be applied with benefit to many of the otherproblem areas described in the book.

1.4 A reader’s guide

This book is aimed at a wide range of readers, from the technically mindedaudio enthusiast through to research scientists in industrial and academicenvironments. For those who have little knowledge of statistical signal pro-cessing the introductory chapters in Section I will be essential reading, andit may be necessary to refer to some of the cited texts in addition as thecoverage is of necessity rather sparse. Part II will then provide the corereading material on restoration techniques, with Part III providing someinteresting developments into areas such as wow and low frequency pulseremoval. For the reader with a research background in statistical signalprocessing, Part I will serve only as a reference for notation and terminol-

1.4 A reader’s guide 11

ogy, although chapter 4 may provide some useful insights into the Bayesianmethodology adopted for much of the book. Part II will then provide gen-eral background reading in basic restoration methods, leading to Part IIIwhich contains state-of-the-art research in the audio restoration area.

Chapters 5 and 9-12 may be read in conjunction for those interestedin click and crackle treatment techniques, while chapters 6, 7 and 8 formstand-alone texts on the areas of hiss reduction, low frequency pulse removaland wow removal, respectively.


Part I

Fundamentals

13



2

Digital Signal Processing

Digital signal processing (DSP) is a technique for implementing operationssuch as signal filtering and spectrum analysis in digital form, as shown inthe block diagram of figure 2.1.

Analogueinput

x(t)

Analogue/Digital

Converter

DigitalProcessor

Digital/AnalogueConverter

Analogueoutput

y(t)x(nT) y(nT)

FIGURE 2.1. Digital signal processing system

There are many advantages in carrying out digital rather than analogueprocessing; amongst these are flexibility and repeatability. The flexibilitystems from the fact that system parameters are simply numbers stored inthe processor. Thus for example, it is a trivial matter to change the cut-off frequency of a digital filter whereas a lumped element analogue filterwould require a different set of passive components. Indeed the ease withwhich system parameters can be changed has led to many adaptive tech-niques whereby the system parameters are modified in real time accordingto some algorithm. Examples of this are adaptive equalisation of transmis-sion systems and adaptive antenna arrays which automatically steer thenulls in the polar diagram onto interfering signals. Digital signal process-ing enables very complex linear and non-linear processes to be implementedwhich would not be feasible with analogue processing. For example it is

16 2. Digital Signal Processing

difficult to envisage an analogue system which could be used to performspatial filtering of an image to improve the signal to noise ratio.

DSP has been an active research area since the late 1960s but applicationstended to be only in large and expensive systems or in non-real time wherea general purpose computer could be used. However, the advent of DSPchips has enabled real-time processing to be performed at very low cost andhas enabled audio companies such as CEDAR Audio and Sonic Solutionsto produce real-time restoration systems for remastering studios and recordcompanies. The next few years will see the further integration of DSP intodomestic products such as television, radio, mobile telephones and of coursehi-fi equipment.

In this chapter we review the basic theory of digital signal processing(DSP) as required later in the book. We consider firstly Nyquist samplingtheory, which states that a continuous time signal such as an audio signalwhich is band-limited in frequency can be represented perfectly withoutany information loss as a set of discrete digital sample values. The whole ofdigital audio, including compact disc (CD) and digital audio tape (DAT)relies heavily on the validity of this theory and so a strong understandingis essential. The chapter then proceeds to describe signal manipulation inthe digital domain, covering such important topics as Fourier analysis andthe discrete Fourier transform (DFT), the z-transform, time domain datawindows and the fast Fourier transform (FFT). A basic level of knowledgein continuous time linear systems, Laplace and Fourier analysis is assumed.Much of the material is of necessity given a superficial coverage and formore detailed descriptions the reader is referred to the texts by Robertsand Mullis [164], Proakis and Manolakis [157] and Oppenheim and Schafer[145].

2.1 The Nyquist sampling theorem

It is necessary to determine under what conditions a continuous signal g(t)may be unambiguously represented by a series of samples taken from thesignal at uniform intervals of T . It is convenient to represent the samplingprocess as that of multiplying the continuous signal g(t) by a sampling sig-nal s(t) which is an infinite train of impulse functions δ(nT ). The sampledsignal gs(t) is:

gs(t) = g(t) s(t)

where:

s(t) =∞∑

n=−∞

δ(t− nT )

2.1 The Nyquist sampling theorem 17

Now the impulse train s(t) is a periodic function and may be representedby a Fourier series:

s(t) =∞∑

p=−∞

cp ejpω0t

where

cp =1

T

∫ −T2

−T2

s(t) e−jpω0t dt =1

T

and

ω0 =2π

T

... gs(t) = g(t)1

T

∞∑

p=−∞

ejpω0t

The spectrum Gs(ω) of the sampled signal may be determined by takingthe Fourier transform of gs(t). This is most readily achieved by making useof the frequency shift theorem which states that:

If g(t)G(ω)

then g(t) ejω0t G(ω − ω0)

Application of this theorem gives:

Gs(ω) =1

T

∞∑

p=−∞

G(ω − pω0) (2.1)

Spectrum of sampled signal

The above equation shows that the spectrum of the sampled signal is simplythe sum of the spectra of the continuous signal repeated periodically atintervals of ω0 = 2π

T as shown in figure 2.2. Thus the continuous signal g(t)may be perfectly recovered from the sampled signal gs(t) provided that thesampling interval T is chosen so that:

2π

T> 2 ωB


|G(ω)|

ω-ωB ωB

|Gs(ω)|

ω-ω0

- ωBω0+ ωB-ω0

ω0ωB-ωB

ω

|Gs(ω)|

ωB-ωB ω0-ω0

|Gs(ω)|

ωωB-ωB

ω0-ω0 − ωΒω0

- ωBω0

FIGURE 2.2. Sampled signal spectra for various different sampling frequenciesω0

2.2 The discrete time Fourier transform (DTFT) 19

where ωB is the bandwidth of the continuous signal g(t). If this conditionholds then there is no overlap of the individual spectra in the summationof expression (2.1) and the original signal can be perfectly reconstructedusing a low-pass filter H(ω) with bandwidth ωB:

H(ω) =

1, |ω| < ωB

0, Otherwise

This theory underpins the whole of digital audio and means that a dig-ital audio system can in principle be designed which loses none of theinformation contained in the continuous domain signal g(t). Of course thepracticalities are rather different as a band-limited input signal is assumedand the reconstruction filter H(ω), the A/D and D/A converters must beideal. For more detail of this procedure see for example [99]

2.2 The discrete time Fourier transform (DTFT)

We now proceed to calculate the sampled signal spectrum in an alterna-tive way which leads to the discrete time Fourier transform (DTFT). Thesampled signal gs(t) is given by:

gs(t) = g(t)

∞∑

p=−∞

δ(t− pT )

and the signal spectrum Gs(ω) can be obtained by taking the Fourier trans-form directly:

Gs(ω) =

∫ ∞

−∞

gs(t) e−jωt dt =

∫ ∞

−∞

g(t)

∞∑

p=−∞

δ(t− pT ) e−jωt dt

... Gs(ω) =

∞∑

p=−∞

gpe−jωpT (2.2)

where we have defined gp = g(pT ). Note that this is a periodic function offrequency and is usually written in the following form:

G(ejωT ) =

∞∑

p=−∞

gp e−jωpT (2.3)

Discrete time Fourier transform (DTFT)


The signal sample values may be expressed in terms of the sampled signalspectrum by noting that equation (2.3) has the form of a Fourier seriesand the orthogonality of the complex exponential can be used to invert thetransform. Multiply each side of equation (2.3) by ejωnT and integrate:

∫ 2π

0

G(ejωT )ejnωT dωT =

∫ 2π

0

∞∑

p=−∞

gp e−jω(p−n)T dωT

=∞∑

p=−∞

gp

∫ 2π

0

e−jω(p−n)T dωT

but

∫ 2π

0

e−j(p−n)ωT dωT =

0, p 6= n2π, p = n

The signal samples are thus obtained from the DTFT as:

gn =1

2π

∫ 2π

0

G(ejωT ) ejnωT dωT (2.4)

Inverse DTFT

2.3 Discrete time convolution

Consider an analogue signal x(t) and its sampled representation

xs(t) = x(t)∞∑

n=−∞

δ(t− nT ) =∞∑

n=−∞

x(nT )δ(t− nT ).

If we apply xs(t) as the input to a linear time invariant (LTI) system withimpulse response h(t), then the output signal can be written as

y(t) =

∫ ∞

−∞

h(t− τ)

∞∑

m=−∞

x(mT )δ(τ −mT ) dτ

=

∞∑

m=−∞

h(t−mT )x(mT )

2.3 Discrete time convolution 21

If we now evaluate the output at times nT we have:

y(nT ) =

∞∑

m=−∞

h((n−m)T )x(mT ) =

∞∑

m=−∞

h(mT )x((n−m)T ) (2.5)

in which the impulse response needs only to be evaluated at integer multi-ples of T . This discrete convolution can be evaluated in the digital domainas shown in figure 2.3, where x(nT ) represents values from an analoguesignal x(t) sampled periodically at intervals of T and the outputs y(nT )represent sample values which may be converted into an analogue signalby means of a digital to analogue converter. We will denote the digitisedsequences from now on as xn = x(nT ) and yn = y(nT ), and the sampledimpulse response as hn = h(nT ), leading to the discrete time convolutionequations:

yn =

∞∑

m=−∞

hn−mxm =

∞∑

m=−∞

hmxn−m (2.6)

Discrete time convolution

Digitalinput

x(nT)

DigitalProcessor

Digitaloutput

y(nT)

FIGURE 2.3. Digital Signal Processor

The frequency domain relationship between the system input and outputmay be developed from the convolution relationship as follows. Take theDTFT of each side of equation (2.6):

Y (ejωT ) =

∞∑

n=−∞

yn e−jnωT =

∞∑

n=−∞

∞∑

p=−∞

xp hn−p

e−jnωT

... Y (ejωT ) =

∞∑

p=−∞

xp

∞∑

n=−∞

hn−pe−jnωT

Let n− p = q, then


Y (ejωT ) =

∞∑

p=−∞

xp

∞∑

q=−∞

hqe−(q+p)ωT

=

∞∑

p=−∞

xp e−jpωT

∞∑

q=−∞

hq e−jqωT

But∑∞

p=−∞ xp e−jpωT is the spectrum of the system input, X(ejωT ),

... Y (ejωT ) = X(ejωT )H(ejωT ) (2.7)

where

H(ejωT ) =∞∑

q=−∞

hq . e−jqωT (2.8)

is the system frequency response.

The system frequency response could also have been derived directlyfrom equation (2.6) as follows. Let

xp = ejωpT

... yn =∞∑

p=−∞

hp . ejω(n−p)T

... yn = ejωnT∞∑

p=−∞

hp . e−jpωT

... H(ejωT ) ≡∞∑

p=−∞

hp e−jpωT

2.4 The z-transform

The Laplace transform is an important tool in continuous system theoryas it enables one to deal with differential equations as algebraic equations,since it transforms convolutive functions into multiplicative functions. How-ever, the natural mathematical description of sampled data systems is interms of difference equations and it would be of considerable help to de-velop a method whereby difference equations can be treated as algebraicequations. The z-transform accomplishes this and is also a powerful toolfor general interpretation of discrete system behaviour.

2.4 The z-transform 23

2.4.1 Definition

The two-sided z-transform is defined for a sampled signal gn as:

G(z) = Zgn =

∞∑

p=−∞

gp z−p (2.9)

2-sided z-transform

The z-transform exists for all z such that the summation converges abso-lutely, i.e.

∞∑

p=−∞

|gp z−p| <∞

2.4.2 Transfer function and frequency response

We can see a similarity here with the DTFT, equation (2.3), in that the two-sided z-transform is equal to the DTFT when z = ejωT . Equivalent resultscan be shown to apply for the discrete-time convolution and frequencyresponse in the z-domain; specifically, if we obtain the z-transform of thediscrete time impulse response, H(z) =

∑∞p=−∞ hp z

−p, then the input-output relationships in the time domain and the z-domain are:

yn =

∞∑

m=−∞

hmxn−m (time domain)

Y (z) = H(z)X(z) (z-domain)

where H(z) is known as the transfer function and Y (z) and X(z) are the z-transforms of the output and input sequences, respectively. The frequencyresponse of the system is:

H(z)∣

∣

z=ejωT

Another useful result is the time-shifting theorem for z-transforms, whichis used for handling difference equations and digital filters:

Zgn−p = z−pG(z) (2.10)

2.4.3 Poles, zeros and stability

If we suppose that the transfer function is a rational function of z, i.e. theratio of two polynomials in z, then H(z) can be factorised as:

H(z) = K

∏Mi=1(z − zi)

∏Nj=1(z − pj)


in which the zi are known as the zeros and pi as the poles of the transferfunction. Then, a necessary and sufficient condition for the transfer functionto be bounded-input bounded-output (BIBO) stable1 is that all the poleslie within the unit circle, i.e.

|pi| < 1, i = 1, . . . , N

There are many more mathematical and practical details of the z-transform,but we omit these here as they are not used elsewhere in the text. The in-terested reader is referred to [164] or one of the many other excellent textson the topic.

2.5 Digital filters

2.5.1 Infinite-impulse-response (IIR) filters

Consider a discrete time difference equation input-output relationship ofthe form:

yn =M∑

i=0

bixn−i +N∑

j=1

ajyn−j

This can be straightforwardly implemented in hardware or software usinga network of delays, adders and multipliers and is known as an infinite-impulse-response (IIR) filter. Taking z-transforms of both sides using result(2.10) and rearranging, we obtain:

Y (z) =B(z)

A(z)X(z)

where

B(z) =

M∑

i=0

biz−i

and

A(z) = 1 −N∑

j=1

ajz−j

The transfer function is H(z) = B(z)A(z) , which is stable if and only if the zeros

of A(z) lie within the unit circle. Furthermore, the filter is invertible (this

1This means that no finite-amplitude input sequence xn can generate an infinite-amplitude output

2.6 The discrete Fourier transform (DFT) 25

means that a finite sequence xn can always be obtained from any finiteamplitude sequence yn) if and only if the zeros of B(z) also lie within theunit circle. This type of filter is known as infinite-impulse-response becausein general the impulse response is of infinite duration even when M and Nare finite.

2.5.2 Finite-impulse-response (FIR) filters

An important special case of the IIR filter is the finite-impulse-response(FIR) filter, in which N is set to zero in the general filter equation:

yn =

M∑

i=0

bixn−i

This filter is unconditionally stable and has an impulse response equal tob0, b1, . . . bM , 0, 0, . . . . The transfer function is given by:

H(z) = B(z)

2.6 The discrete Fourier transform (DFT)

The Discrete Time Fourier Transform (DTFT) expresses the spectrum ofa sampled signal in terms of the signal samples but is not computable ona digital machine for two reasons:

1. The frequency variable ω is continuous.

2. The summation involves an infinite number of samples

The first problem may be overcome by simply choosing to evaluate thespectrum at a set of discrete frequencies. These may be arbitrarily chosenbut it should be remembered that the spectrum is a periodic function offrequency (see figure 2.2) so that it is only necessary to consider frequenciesin the range:

ωT = 0 → 2π

Although the frequencies may be arbitrarily chosen in this range, importantcomputational advantages are to be gained from choosing N uniformlyspaced frequencies, ie. at the frequencies:

ωT = p2π

Np ∈ 0, 1, . . . , N − 1


Thus equation (2.3) becomes:

G(ej 2πN p) =

∞∑

n=−∞

gn e−j 2π

N np

Although the above equation is a function of only discrete variables theproblem remains of the infinite summation. If, however, the signal is non-zero only for samples g0 to gM−1 then the equation becomes:

G(ej 2πN p) =

M−1∑

n=0

gn e−j 2π

N np (2.11)

In general, of course, the signal will not be of finite duration. However,if it is assumed that computing the transform of only M samples of thesignal will lead to a reasonable estimate of the spectrum of the infiniteduration signal then the above equation is computable in a finite time.(The implications of this assumption are investigated in the next sectionon data windowing.) Although the number of samples M is completelyindependent of the number of frequency points N , there is a considerablecomputational advantage to be gained from setting M = N and this willbecome clear later when the fast Fourier transform (FFT) is considered.Under this condition, equation (2.11) becomes:

G(ej 2πN p) =

N−1∑

n=0

gn e−j 2π

N np (2.12)

It is often required to be able to compute the signal samples from thespectral values and this may be achieved by making use of the orthogonalproperties of the discrete complex exponential as follows. Multiply eachside of equation (2.12) by ej 2π

N pq and sum over p = 0 to N − 1:

N−1∑

p=0

G(ej 2πN p) ej 2π

N pq =

N−1∑

p=0

N−1∑

n=0

gn e−j 2π

N np ej 2πN pq

=N−1∑

n=0

gn

N−1∑

p=0

ej 2πN (q−n)p

Note that:

N−1∑

p=0

ej 2πN (q−n)p =

N n = q0 n 6= q

... gq =1

N

N−1∑

p=0

G(ej 2πN p) ej 2π

N pq (2.13)

2.7 Data windows 27

Equations (2.12) and (2.13) are the discrete Fourier transform (DFT) pair,summarised as:

G(p) =

N−1∑

n=0

gn e−j 2π

N np (2.14)

Discrete Fourier transform (DFT)

gn =1

N

N−1∑

p=0

G(p) ej 2πN pn (2.15)

Inverse DFT

2.7 Data windows

In the previous section the discrete Fourier transform (DFT) was derivedfor a signal gn which is non-zero only for n ∈ 0, 1, . . . , N − 1. In mostFourier transform applications the signal will not be of finite duration.However we could force this condition by extracting a section of the signalwith duration Tw and hoping that the spectrum of the finite duration signalwould be a good approximation to the spectrum of the long duration signal.This can be analysed carefully and leads on to the topic of window design.

2.7.1 Continuous signals

It will be helpful firstly to consider continuous signals. Let g(t) be a con-tinuous signal defined over all time with Fourier transform G(ω):

g(t) ⇔ G(ω)

We wish to determine what relationship the spectrum of the windowedsignal gw(t), shown in figure 2.4, has to the spectrum of g(t), where gw(t) =g(t)w(t)

Transforming to the frequency domain and using the standard convolu-tion result for the transform of multiplied signals, we obtain

Gw(ω) =

∫ ∞

−∞

gw(t) e−jωt dt =1

2π

∫ ∞

−∞

W (λ)G(ω − λ) dλ (2.16)


g(t)

1

< >Tw

w(t)

gw(t)=g(t).w(t)

FIGURE 2.4. Windowing a continuous-time signal

The spectrum of the windowed signal is thus the convolution of the win-dow’s spectrum with the un-windowed signal’s spectrum. The spectrum ofthe window is

W (ω) =

∫ ∞

−∞

w(t) . e−jωt dt =

∫Tw2

−Tw2

e−jωt dt =1

−jω[

e−jω Tw2 − ejω Tw

2

]

=2 sin(ω Tw

2 )

ω= Tw

sin(ω Tw

2 )

(ω Tw

2 )(2.17)

as shown in figure 2.5. The effect of this convolution can be appreciated byconsidering the signal:

g(t) = 1 + ejωot

ie. a d.c. offset added to a complex exponential with frequency ωo. TheFourier transform of this signal is:

G(ω) = 2πδ(ω) + 2πδ(ω − ωo)

The spectrum of the windowed signal is then obtained as:

Gw(ω) =

∫ ∞

−∞

Tw

sin(λTw

2 )

(λTw

2 )

δ(ω − λ) + δ(ω − λ− ω0) dλ

= Tw

sin(ω Tw

2 )

(ω Tw

2 )+

sin[

(ω − ωo)Tw

2

]

(ω − ω0)Tw

2

2.7 Data windows 29

−4 −3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized frequency

Nor

mal

ized

spe

ctra

l mag

nitu

de

FIGURE 2.5. Spectral magnitude of rectangular window |W (ω)| plotted as afunction of normalised frequency ωTw/(2π)

This result is plotted in figure 2.6 and we see that there are two effects.

1. Smearing. The discrete frequencies (in this case d.c. and ω0) havebecome ‘smeared’ into a band of frequencies.

2. Spectral leakage. The component at d.c. ‘leaks’ into the ω0 compo-nent as a result of the sidelobes of W (f). This can be very trouble-some when trying to measure the amplitude of a small componentin the presence of a large component at a nearby frequency. This isillustrated for two complex exponentials with differing amplitudes infigure 2.7.

The two effects are not independent but roughly speaking the width of thelobes (or smearing) is an effect of the window duration but the sidelobesare due to the discontinuous nature of the window. A technique commonlyused in spectral analysis is to employ a tapered window rather than therectangular window. One such window is the ‘cosine arch’ or Hanning win-dow, given by:

w(t) =

12 [1 − cos(2πt

Tw)], 0 ≤ t ≤ Tw

0, otherwise

and shown in figure 2.8 along with its spectrum, which has much reducedsidelobes but wider central lobe than the rectangular window.


−8 −6 −4 −2 0 2 4 6 80

0.2

0.4

0.6

0.8

1


Nor

mal

ized

spe

ctra

l mag

nitu

de

Spectra of individual complex exponentials

−8 −6 −4 −2 0 2 4 6 80

0.2

0.4

0.6

0.8

1

1.2

1.4


Nor

mal

ized

spe

ctra

l mag

nitu

de

Spectrum for sum of complex exponentials

FIGURE 2.6. Top - spectra of individual d.c. component and exponential withnormalised frequency of 4; bottom - overall spectrum |Gw(ω)|.

−8 −6 −4 −2 0 2 4 6 80

0.2

0.4

0.6

0.8

1


Nor

mal

ized

spe

ctra

l mag

nitu

de

Spectra of individual complex exponentials

−8 −6 −4 −2 0 2 4 6 80

0.2

0.4

0.6

0.8

1

1.2

1.4


Nor

mal

ized

spe

ctra

l mag

nitu

de

Spectrum for sum of complex exponentials

FIGURE 2.7. Top - spectra of two complex exponentials with normalised frequen-cies 3 and 4, first component is 2/5 the amplitude of second; bottom - spectrumof the sum, showing how the first component is completely lost owing to spectralleakage.

2.7 Data windows 31

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

time (normalized to window length)

Am

plitu

de

Hanning window

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1


Nor

mal

ized

spe

ctra

l mag

nitu

de

Spectrum of Hanning/rectangular window

RectangularHanning

FIGURE 2.8. Hanning window (top) and its spectrum (bottom)

2.7.2 Discrete-time signals

Consider now the discrete case. We can go through the calculations for thewindowed spectrum in a manner analogous to the continuous-time case:

G(ejωT ) =

∞∑

p=−∞

g(pT ) e−jpωT

Gw(ejωT ) =

∞∑

p=−∞

g(pT )w(pT ) e−jpωT

=∞∑

p=−∞

g(pT )

1

2π

∫ 2π

0

W (ejθ)ejpθdθ

e−jpωT

=1

2π

∫ 2π

0

W (ejθ)

∞∑

p=−∞

g(pT ) e−jp(ωT−θ) dθ

... Gw(ejωT ) =1

2π

∫ 2π

0

W (ejθ)G(ej(ωT−θ)) dθ

Once again, the spectrum of the windowed signal is the convolution of theinfinite duration signal’s spectrum and the window’s spectrum. Note thatall spectra in the discrete case are periodic functions of frequency.


As for the continuous case we can consider the use of tapered windowsand one class of window functions is the generalised Hamming windowgiven by

w(nT ) = α− (1 − α) cos

(

2π

Nn

)

n = 0, . . . , N − 1

A few specific values of α are given their own names, as follows,

α = 1 Rectangular windowα = 0.5 Hanning window (Cosine Arch)α = 0.54 Hamming window

Figure 2.9 shows the window shapes and figure 2.10 the spectra (on a log-arithmic scale) for several values of α. Other commonly used data windowsinclude the Kaiser, Bartlett and Parzen windows, each giving a differenttrade-off in side-lobe and central lobe properties. Choice of windows willdepend upon the application, involving a suitable trade-off between side-lobe height and central lobe width. Other commonly used windows includethe Bartlett, Tukey and Parzen windows.

2.7 Data windows 33

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

time (normalized to window length)

ampl

itude

RectangularHanningHamming

FIGURE 2.9. Discrete window functions

-6 -4 -2 0 2 4 6-60

-50

-40

-30

-20

-10

0

Normalised Frequency f*NT

dB

α=1.0

α=0.5

α=0.54

FIGURE 2.10. Discrete window spectra


2.8 The fast Fourier transform (FFT)

The Discrete Fourier transform (DFT) of a sequence of data xn and itsinverse is given by:

X(p) =

N−1∑

n=0

xn e−j 2π

N np (2.18)

xn =1

N

N−1∑

p=0

X(p)ej 2πN np (2.19)

for n ∈ 0, N − 1 and p ∈ 0, N − 1.

Computation of X(p) for p = 0, 1, . . . , N − 1 requires on the order of N2

complex multiplications and additions. The Fast Fourier Transform algo-rithm reduces the required number of arithmetic operations to the orderof N

2 log2(N), which provides a critical degree of speed-up for applicationsinvolving large N .

The first stage of the algorithm is to rewrite equation (2.18) in terms ofthe even-indexed and odd-indexed data xn:

X(p) =

N2 −1∑

n=0

x2n e−j 2π

N (2n)p +

N2 −1∑

n=0

x2n+1e−j 2π

N (2n+1)p

=

N2 −1∑

n=0

x2ne−j 2π

N/2np + e−j 2π

N p

N2 −1∑

n=0

x2n+1e−j 2π

N/2np (2.20)

It can be seen that computation of the length N DFT has been reduced tothe computation of two length N

2 DFTs and an additional N complex mul-tiplications for the complex exponential outside the second summation. Itwould appear, at first sight, that it is necessary to evaluate equation (2.20)for p = 0, 1, . . . , N − 1. However, this is not necessary as may be seen byconsidering the equation for p ≥ (N

2 ):

X(p+N/2)

=

N2 −1∑

n=0

x2n e−j 2π

N/2n(p+ N

2 ) + e−j 2πN (p+ N

2 )

N2 −1∑

n=0

x2n+1 e−j 2π

N/2n(p+ N

2 )

=

N2 −1∑

n=0

x2n e−j 2π

N/2np − e−j 2π

N p

N2 −1∑

n=0

x2n+1 e−j 2π

N/2np (2.21)

2.8 The fast Fourier transform (FFT) 35

Comparing equation (2.21) for X(p+ N2 ) with equation (2.20) for X(p) it

can be seen that the only difference is the sign between the two summations.Thus it is necessary to evaluate equation (2.20) only for p = 0, 1, . . . , N

2 −1,storing the results of the two summations separately for each p. The valuesof X(p) and X(p+ N

2 ) can then be evaluated as the sum and difference ofthe two summations as indicated by equations (2.20) and (2.21).

Thus the computational load for an N -point DFT has been reduced fromN2 complex arithmetic operations to 2(N

2 )2 + N2 , which is approximately

half the computation for large N . This is the first stage in the FFT deriva-tion. The rest of the derivation proceeds by re-applying the same procedurea number of times.

We now define the notation:

W = e−j 2πN

and define the FFT butterfly to be as shown in figure 2.11

Wk

A + B.WA

B A - B.W

k

k

FIGURE 2.11. The FFT butterfly

The flow chart for incorporating this decomposition into the computationof an N = 8 point DFT is shown in figure 2.12. Assuming that (N

2 ) is

even, the same process can be carried out on each of the (N2 )-point DFTs

to reduce the computation further. The flow chart for incorporating thisextra stage of decomposition into the computation of the N = 8 point DFTis shown in figure 2.13.

It can be seen that if N = 2M then the process can be repeated M times toreduce the computation to that of evaluating N single point DFTs. Thusthe flow chart for computing the N = 8 point DFT is as shown in fig-ure 2.14.

Examination of the final chart shows that it is necessary to shuffle the orderof the input data. This data shuffle is usually termed bit-reversal for reasonsthat are clear if the indices of the shuffled data are written in binary.


4-pointDFT

4-pointDFT

x(0)

x(2)

x(4)

x(6)

x(1)

X(0)

x(3)

x(5)

x(7)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(1)

W

0

1

3

2

W

W

W

FIGURE 2.12. First Stage of FFT Decomposition

FIGURE 2.13. Second Stage of FFT Decomposition

X(0)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(1)

W

0

1

3

2

W

W

W

2-pointDFT

x(0)

x(4)

x(2)

x(6)

x(1)

x(5)

x(3)

x(7)

2-pointDFT

2-pointDFT

2-pointDFT

0W

2W

2W

0W

2.8 The fast Fourier transform (FFT) 37

X(0)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(1)

W

0

1

3

2

W

W

W

0W

2W

2W

0W

x(0)

x(4)

x(2)

x(6)

x(1)

x(5)

x(3)

x(7)

0W

0W

0W

0W

FIGURE 2.14. Complete FFT Decomposition for N = 8

Binary Bit DecimalReverse

000 000 0

001 100 4010 010 2

011 110 6

100 001 1

101 101 5

110 011 3111 111 7

It has been shown that the process reduces, at each stage, the computationby half but introduces an extra N

2 multiplications to account for the com-plex exponential term outside the second summation term in the reduction.Thus, for the condition N = 2M , the process can be repeated M times toreduce the computation to that of evaluating N single point DFTs whichrequires no computation. However, at each of the M stages of reduction anextra N

2 multiplications are introduced so that the total number of arith-

metic operations required to evaluate an N -point DFT is N2 log2(N).

The FFT algorithm has a further significant advantage over direct evalu-ation of the DFT expression in that computation can be performed in-place.This is best illustrated in the final flow chart where it can be seen that aftertwo data values have been processed by the butterfly structure, those dataare not required again in the computation and they may be replaced, in the


computer memory, with the values at the output of the butterfly structure.

2.9 Conclusion

In this chapter we have briefly described the discrete time signal theoryupon which the rest of the book is based. The material thus far has con-sidered only deterministic signals with known, fixed waveform. In the nexttwo chapters we will introduce the idea of signals drawn at random fromsome large ensemble of possibilities according to the laws of probability.


3

Probability Theory and RandomProcesses

In this chapter the basic results of probability theory and random processesare developed. These concepts are fundamental to most of the subsequentchapters since our methodology is generally based upon probabilistic argu-ments. It is recommended that this chapter is used as a reference ratherthan for learning about probability from scratch as it comprises mostlya list of results used later in the book without detailed discussion. Thematerial covered is sufficient for the purposes of this text, but for furthermathematical rigour and detail see for example Gray and Davisson [87],Papoulis [148] or Doob [48].

3.1 Random events and probability

Probability theory is used to give a mathematical description of the be-haviour of real-world systems which involve random events. Such a sys-tem might be as simple as a coin-flipping experiment, in which we areinterested in whether ‘Heads’ or ‘Tails’ is the outcome, or it might be morecomplex, as in the study of random errors in a coded digital datastream(e.g. a CD recording or a digital mobile phone). Here we summarise themain results and formalise the intuitive ideas of Events and the Axiomsof Probability into a Probability Space which encapsulates everythingwe need to know about a system of random events.

The probability results given in this section are strictly event-based, sinceit is relatively easy to obtain results which have all the desired properties of

40 3. Probability Theory and Random Processes

a probabilistic system. In later sections we will use these results as the basisfor study of more advanced topics in random signal theory which involverandom signals rather than individual events. We will see, however, thatrandom signal theory can be formulated in terms of event-based results oras limiting cases of those results.

3.1.1 Frequency-based interpretation of probability

An intuitive interpretation of probability is as follows:

Suppose we observeM identical experiments whose outcomeis a member of the set Ω = ω1, ω2, . . . , ωN. The outcome isωi in Mi of the M experiments. The Probability of ωi is theproportion of experiments in which ωi is the observed in thelimit as the total number of experiments tends to infinity:

Prωi = LimM→∞

Mi

M

3.2 Probability spaces

The term Random Experiment is used to describe any situation which has aset of possible outcomes, each of which occurs with a particular Probability.

Any random experiment can be described completely by its ProbabilitySpace, (Ω,F , P ):

1. Sample Space Ω = ω1, ω2, . . . , ωN, the set of possible ‘elemen-tary’ outcomes of the experiment.

2. Event Space F : the space of events F for which a probability isassigned, including ∅ (the empty set) and Ω. Events are collectionsof elementary outcomes. New events can be obtained by set-theoreticoperations on other events, e.g.

H = F ∪G, H = F ∩G, H = F c

For our purposes, consider that F contains all ‘physically meaning-ful’ events which can be obtained by set-theoretic operations on thesample space.1

1There is a great deal of mathematics behind the generation of appropriate eventspaces, which are sometimes known as a sigma field or sigma algebra. In particular,for continuous sample spaces it is possible (but difficult!) to generate events for whicha consistent system of probability cannot be devised. However, such events are notphysically meaningful and we can ignore them.

3.2 Probability spaces 41

3. Probability Measure P : A function, literally a ‘measure of prob-ability’, defining the probability P (F ) of each member of the eventspace. Interpret P as:

P (F ) = Pr‘The outcome is in F ’ = Prω ∈ F

P must obey the axioms of probability:

(a) 0 ≤ P (F ) ≤ 1. The probability of any event must lie between 0and 1.

(b) P (Ω) = 1. The probability of the outcome lying within the sam-ple space is 1 (i.e. certain).

(c) P (F ∪ G) = P (F ) + P (G) for events F and G s.t. F ∩ G =0 (‘mutually exclusive’). This means that for events F and Gin Ω with no elements in common the probability of F OR Goccurring is the sum of their individual probabilities.

Example. A simple example is the coin-flipping experiment for which wecan easily write down a probability space:

Ω = H,T F = H, T ,Ω, ∅

P (H) =1

2, P (T ) =

1

2, P (Ω) = 1, P (∅) = 0

where ∅ denotes the empty set, i.e. no outcome whatsoever. The eventspace F is chosen to contain all the events we might wish to consider: theprobability of heads (P (H) = 1/2), the probability of tails (P (T ) = 1/2),the probability of heads OR tails (P (Ω = H,T ) = 1, the certain event),the probability of no outcome at all (P (∅) = 0).

The above is an example of a discrete probability space, in which thesample space can be mapped with a one-to-one equivalence onto a subset ofthe integers Z.2 In much of the ensuing work, however, we will be concernedwith continuous sample spaces which cannot be mapped onto the integersin this way:

Example. Consider a random thermal noise voltage V measured across aresistor. The sample space Ω would in this case be the real line:

Ω = < = (−∞,+∞)

which cannot be mapped onto the integers. The event space is now muchharder to construct mathematically. However, we can answer all ‘physically

2The number of elements in the sample space is then referred to as countable.


meaningful’ questions such as “what is the probability that the voltage liesbetween -1 and +1” by including in the event space all3 subsets F of thereal line.

The event probabilities can then be calculated by integrals of the prob-ability density function (PDF) f(v) over the appropriate subset F of thereal line (see section on random variables):

P (F ) = PrV ∈ F =

∫

v∈F

f(v) dv

e.g. for F = (a, b] we have

P (F ) = Pra < V ≤ b =

∫ b

v=a+

f(v) dv

The notion of a probability space formalises our description of a randomexperiment. In the examples given in this book we will not lay out prob-lems in such a formal manner. Given a particular experiment with a givensample space it will usually be possible to pose and solve problems withoutexplicitly constructing an event space. However, it is useful to rememberthat a Probability Space (Ω,F , P ) can be devised which fully describes theproperties of any random system we are likely to encounter.

3.3 Fundamental results of event-based probability

3.3.1 Conditional probability

Consider events F and G in some probability space. Conditional probabil-ity, denoted P (F |G), is defined as the probability that event F occurs giventhe prior knowledge that event G has occurred. A formula for calculatingconditional probability can be obtained by reasoning from the axioms ofprobability as:

P (F |G) =P (F ∩G)

P (G), P (G) > 0 (3.1)

Conditional Probability

3with the exception of the difficult cases mentioned in a previous footnote, which weignore as before

3.3 Fundamental results of event-based probability 43

Note that conditional probability is undefined when P (G) = 0, sinceevent F |G is meaningless if G is impossible anyway! The event F ∩G hasthe interpretation ‘the outcome is in F AND G’. Some readers may bemore familiar with the notation (F,G) which means the same thing.

3.3.2 Bayes rule

The equivalent formulation for G conditional upon F is clearly P (G|F ) =P (G ∩ F )

P (F ). Since P (F ∩ G) = P (G ∩ F ) we can substitute the resulting

expression for P (G∩F ) in place of P (F ∩G) in result (3.1) to obtain BayesRule:

P (F |G) =P (G|F )P (F )

P (G)

Bayes Rule

This equation is fundamental to many inferential procedures, since it tellsus how to proceed from a probabilistic description of an event G, whichhas been directly observed, to some unobserved event F whose probabilitywe need to know. Within an inferential framework, P (F ) is known as theprior or a priori probability since it reflects prior knowledge about event Fbefore event G has been observed; P (F |G) is known as the posterior or aposteriori probability since it gives the probability of F after event G hasbeen observed; P (G|F ) is known as the likelihood and P (G) as the ‘total’or ‘marginal’ probability for G.

3.3.3 Total probability

If the total probability P (G) is unknown for a particular problem it canbe calculated by forming a partition F1, F2, . . . , Fn of the sample spaceΩ. By this we mean that we find a set of events from the event space Fwhich are mutually exclusive (i.e. there is no ‘overlap’ between any twomembers of the partition) and whose union is the complete sample space.For example, 1, 2, 3, 4, 5, 6 forms a partition in the die-throwing exper-iment and H,T forms a partition of the coin-flipping experiment. Thetotal probability is then given by


P (G) =

n∑

i=1

P (G|Fi)P (Fi) (3.2)

Total probability

3.3.4 Independence

If one event has no effect upon the outcome of another event, and viceversa, then the two events are termed independent . Mathematically this isexpressed in terms of the joint probability P (F ∩G) for the events, i.e. theprobability that the outcome is in both F and G. The joint probability isexpressed in the way one might expect, as the product of the individualevent probabilities:

P (F ∩G) = P (F )P (G) ⇐⇒ F and G independent

Independent events

Note that this formula, which forms the definition of independent events,is a special case of the conditional probability expression (3.1) in whichP (F |G) = P (F ).

3.4 Random variables

A random variable is intuitively just a numerical quantity which takes ona random value. However, in order to derive the properties of random vari-ables they are defined in terms of an underlying probability space. Considera probability space (Ω,F , P ). The sample and event spaces for this prob-ability space are in general defined in terms of abstract quantities such as‘Heads’ and ‘Tails’ in the coin-tossing experiment, or ‘On’ and ‘Off’ in alogic circuit. As seen in the last section, the laws of probability can be ap-plied directly in this abstract domain. However, in many cases we will wishto study the properties of a numerical outcome such as a random voltageor current rather than these abstract quantities. A mapping X(ω) whichassigns a real numerical value to each outcome ω ∈ Ω is termed a RandomVariable. The theory of random variables will form the basis of much ofthe work in this book. We consider only real-valued random variables, sinceit is straightforward to extend the results to complex-valued variables.

3.5 Definition: random variable 45

3.5 Definition: random variable

Given a probability space (Ω,F , P ), a random variable is a function X(ω)which maps each element ω of the sample space Ω onto a point on thereal line. We will usually refer to the mapping X(ω) and the underlyingprobability space simply as ‘random variable (or RV) X ’.

3.5.0.0.1 Example: Discrete random variable

If the mapping X(ω) can take on only a countable number of values on thereal line, the random variable is termed discrete.

For a logic gate with states ‘On’ and ‘Off’ define the random variable:

X(On) = 1, X(Off) = 0

However, if one wished to study the properties of a line transmissionsystem which transmits an ‘On’ as +1.5V and an ‘Off’ as -1.5V, then amore natural choice of RV would be:

X(On) = +1.5, X(Off) = −1.5

3.5.0.0.2 Example: Continuous random variable

The random noise voltage V measured across a resistor R is a continuousrandom variable. Since the sample space Ω is the measured voltage itself, i.e.V (ω) = ω, the random variable V is termed ‘directly given’. Alternatively,we may wish to consider the instantaneous power V 2/R across the resistor,in which case we can define a RV W (ω) = ω2/R.

3.6 The probability distribution of a randomvariable

The distribution PX of a RV X is simply a probability measure whichassigns probabilities to events on the real line. The distribution PX answersquestions of the form ‘what is the probability that X lies in subset F ofthe real line?’. PX is obtained via the ‘inverse mapping’ method:

PX(F ) = PrX ∈ F = P (X−1(F )) (3.3)

where the inverse mapping is defined mathematically as:

X−1(F ) = ω : X(ω) ∈ F, (F ⊂ <) (3.4)

In practice we will summarise PX by its Probability Mass Function (PMF,discrete variables only), Cumulative Distribution Function (CDF) or Prob-ability Density Function (PDF)


3.6.1 Probability mass function (PMF) (discrete RVs)

Suppose the variable can take a set of M real values x1, . . . , xi, . . . xM.Then the probability mass function (PMF) is defined as:

pX(xi) = PrX = xi = PX(xi),M∑

i=1

pX(xi) = 1

Probability mass function (PMF)

3.6.2 Cumulative distribution function (CDF)

The cumulative distribution function (CDF) can be used to describe dis-crete, continuous or mixed discrete/continuous distributions:

FX(x) = PrX ≤ x = PX( (−∞, x] )

Cumulative distribution function (CDF)

The following properties follow directly from the Axioms of Probability:

1. 0 ≤ FX(x) ≤ 1

2. FX(−∞) = 0, FX(∞) = 1

3. FX(x) is non-decreasing as x increases

4. Prx1 < X ≤ x2 = FX(x2) − FX(x1)

Note also that PrX > x = 1 − FX(x).Where there is no ambiguity we will usually drop the subscript ‘X ’ and

refer to the CDF as F (x).

3.6.3 Probability density function (PDF)

The probability density function (PDF) is defined as the derivative of theCDF with respect to x:

fX(x) =∂FX(x)

∂x

3.7 Conditional distributions and Bayes rule 47

Probability density function (PDF)

Again, its properties follow from the axioms of probability:

1. fX(x) ≥ 0

2.

∫ ∞

−∞

fX(x)dx = 1

3. FX(x) =

∫ x

−∞

fX(α)dα

4.

∫ x2

x1

fX(α)dα = Prx1 < X ≤ x2

As for the CDF we will often drop the subscript and refer simply to f(x)when no confusion can arise.

The PDF can be viewed as a continuous frequency histogram for the RVsince it represents the relative frequency with which a particular value isobserved in many trials under the same conditions.

From property 4 we can obtain the following important result for theprobability that X lies within a slice of infinitesimal width δx:

limδx→0

Prx < X ≤ x+ δx = limδx→0

∫ x+δx

x

fX(α) dα

= fX(x) δx

Note that this probability tends to zero as δx goes to zero, provided thatfX contains no delta functions.

For purely discrete RVs:

fX(x) =

M∑

i=1

piδ(x− xi)

where the pi’s are the PMF ‘weights’ for the discrete random variable.

3.7 Conditional distributions and Bayes rule

Conditional distributions and Bayes Rule are fundamental to inference ingeneral and in particular to the analysis of complex random systems. Manytechnical questions will be of the form ‘what can I infer about parametera in a system given that I have made the related observation b?’. Suchquestions can be answered using conditional probability.


3.7.1 Conditional PMF - discrete RVs

The results for PMFs apply for discrete random variables and are definedonly at the values:

x ∈ x1, . . . , xM, y ∈ y1, . . . , yNSince ‘X = xi’ is an event with non-zero probability for a discrete random

variable, we obtain conditional probability and related results directly fromevent-based probability, conditioning upon either some arbitrary event Gor the value of a second random variable y:

p(x|G) = Pr (X = x) |G =P ( (X = x)ANDG )

P (G)

(Conditional upon arbitrary event G)

p(x|y) = Pr (X = x) | (Y = y) =p(x, y)

p(y)

(Conditional upon a second discrete RV Y = y)Bayes rule and total probability for PMFs are then obtained directly as:

p(x|y) =p(y|x)p(x)

p(y)

Bayes Rule (PMF)

p(y) =

M∑

i=1

p(y|xi)p(xi)

Total Probability (PMF)

Note that p(x, y) is the joint PMF:

p(x, y) = Pr (X = x)AND (Y = y) Conditional and joint PMFs are positive and obey the same normalisa-

tion rules as ordinary PMFs:

M∑

i=1

p(xi|y) = 1,

M∑

i=1

N∑

j=1

p(xi, yj) = 1


3.7.2 Conditional CDF

Apply the same reasoning using events of the form ‘X ≤ x’ and ‘Y ≤ y’:

F (x|y) = Pr (X ≤ x)|(Y ≤ y) =F (x, y)

F (y)

This leads directly to Bayes rule for CDFs:

F (x|y) =F (y|x)F (x)

F (y)

Bayes Rule (CDF)

F (x|y) is the conditional CDF and F (x, y) is the joint CDF:

F (x, y) = Pr (X ≤ x)AND (Y ≤ y)

Conditional CDFs have similar properties to standard CDFs,i.e. FX|Y (−∞|y) = 0, FX|Y (+∞|y) = 1, etc.

3.7.3 Conditional PDF

The conditional PDF is obtained as the derivative of the conditional CDF:

f(x|G) =∂F (x|G)

∂x

It has similar properties and interpretation as the standard PDF (i.e.f(x|G) > 0,

∫

f(x|G) dx = 1, P (X ∈ A|G) =∫

A f(x|G) dx).However, since the event ‘X = x’ has zero probability for continuous

random variables, the PDF conditional upon X = x is not directly defined.We can obtain the required result as a limiting case:

PrG|X = x = limδx→0

PrG|x < X ≤ x+ δx

= limδx→0

Prx < X ≤ x+ δx|GP (G)

Prx < X ≤ x+ δx

= limδx→0

(F (x + δx|G) − F (x|G))P (G)

f(x)δx

= limδx→0

(

(F (x + δx|G) − F (x|G))

δx

)

P (G)

f(x)

=f(x|G)P (G)

f(x)


Now taking G as the event Y ≤ y:

F (y|X = x) = PrY ≤ y|X = x

=f(x|Y ≤ y)F (y)

f(x)

=∂∂x(F (x|y))F (y)

f(x)

=∂∂x(F (x, y))

f(x)

from whence:

f(y|x) =∂F (y|X = x)

∂y

=

∂2

∂x∂yF (x, y)

f(x)

=f(x, y)

f(x)

where f(x, y) is the joint PDF, defined as:

f(x, y) =∂2F (x, y)

∂x∂y

Joint PDF

The conditional probability rule for PDFs leads as before to Bayes Rule:

f(x|y) =f(y|x) f(x)

f(y)

Bayes rule (PDF)

Total probability is obtained as an integral:∫

f(y|x)f(x) dx =

∫

f(x|y) f(y)

f(x)f(x) dx

=

(∫

f(x|y) dx)

f(y)

= f(y)


i.e. the total probability f(y) is given by:

f(y) =

∫

f(y|x)f(x) dx =

∫

f(y, x) dx

Total Probability (PDF)

This important result is often referred to as the Marginalisation Integraland f(y) as the Marginal Probability.

3.7.3.0.3 Example: the sum of 2 random variables

Consider the random variable Y formed as the sum of two independentrandom variables X1 and X2:

Y = X1 +X2

X1 has PDF f1(x1) and X2 has PDF f2(x2).We can write the joint PDF for y and x1 by rewriting the conditional

probability formula:

f(y, x1) = f(y|x1) f1(x1)

It is clear that the event ‘Y takes the value y conditional upon X1 = x1’is equivalent to X2 taking a value y − x1 (since X2 = Y − X1). Hencef(y|x1) = fX2|X1

(y− x1|x1) = f2(y− x1)4 and f(y) is obtained as follows:

f(y) =

∫

f(y, x1) dx1

=

∫

f(y|x1) f1(x1) dx1

=

∫

f2(y − x1) f1(x1) dx1

= f1 ∗ f2

where ‘∗’ denotes the convolution operation. In other words, when twoindependent random variables are added, the PDF of their sum equals theconvolution of the PDFs for the two original random variables.

4This ‘change of variables’ can be proved using results in a later section on functionsof random variables


3.8 Expectation

Expectations form a fundamental part of random signal theory. Examplesof expectations are the mean and standard deviations of a random variable,although the concept is much more general than this.

The mean of a random variable X is defined as:

E[X ] = X =

∫ +∞

x=−∞

x fX(x) dx

Mean value of X

If a second RV Y is defined to be a function Y = g(X) then the expec-tation of Y is obtained as:

E[Y ] =

∫ +∞

x=−∞

g(x) fX(x) dx

Expectation of Y = g(X)

Expectation is a linear operator :

E[a g1(X) + b g2(X)] = aE[g1(X)] + bE[g2(X)]

Conditional expectations are defined in a similar way:

E[X |G] =

∫

x fX|G(x|G) dx, E[Y |G] =

∫

g(x) fX|G(x|G) dx

Conditional expectation

3.8 Expectation 53

Some important examples of expectation are as follows:

E[Xn] =

∫

xn fX(x) dx

nth order moment

E[(X −X)n] =

∫

(x−X)n fX(x) dx

Central moment

The central moment with n = 2 gives the familiar variance of a randomvariable:

E[(X −X)2] =

∫

(x−X)2 fX(x) dx = E[X2] − (X)2

Variance

3.8.1 Characteristic functions

Another important example of expectation is the characteristic functionΦX(ω) of a random variable X :

ΦX(ω) = E[ejωX ]

=

∫ +∞

−∞

ejωx fX(x) dx

which is equivalent to the Fourier transform of the PDF evaluated at ω′ =−ω. Some of the properties of the Fourier Transform can be applied usefullyto the characteristic function, in particular:


1. Convolution - (sums of independent RVs) We know that forsums of random variables the distribution is obtained by convolvingthe distributions of the component parts:

Y =N∑

i=1

Xi

and

fY = fX1 ∗ fX2 . . . ∗ fXN

=⇒ ΦY (ω) =

N∏

i=1

ΦXi(ω)

This result allows the calculation of the characteristic function forsums of independent random variables as the product of the compo-nent characteristic functions.

2. Inversion In order to invert back from the characteristic functionto the PDFwe apply the inverse Fourier transform with frequencynegated:

fX(x) =1

2π

∫ +∞

−∞

e−jωx ΦX(ω) dω

3. Moments

∂nΦX(ω)

∂ωn=

∫ +∞

−∞

(jx)n ejωx fX(x) dx

=⇒ E[Xn] = 1/jn ∂nΦX(ω)

∂ωn

∣

∣

ω=0

This result shows how to calculate moments of a random variablefrom its characteristic function.

3.9 Functions of random variables

Consider a random variable Y which is generated as a function Y = g(X),where X has a known distribution PX(x). We can obtain the distributionfor Y , the transformed random variable, directly by the ‘inverse imagemethod’:

3.9 Functions of random variables 55

FY (y) = PrY ≤ y = PX(x : g(x) ≤ y)

Derived CDF for Y = g(X)

3.9.1 Differentiable mappings

If the mapping g(X) is differentiable then a simple formula can be obtainedfor the derived distribution of Y . Suppose that the equation y = g(x) hasM solutions for a particular value of x:

g(x1) = y, g(x2) = y, . . . , g(xM ) = y

Defining g′(x) = ∂g(x)∂x , it can be shown that the derived PDF of Y is

given by:

fY (y) =M∑

i=1

fX(x)

|g′(x)|

∣

∣

∣

∣

x=xi

PDF of Y = g(X)

3.9.1.0.4 Example: Y = X2.

Suppose that X is a Gaussian random variable with mean zero:

fX(x) =1√

2πσ2e−

x2

2σ2 (3.5)

The solutions of the equation y = x2 are

x1 = +y1/2, x2 = −y1/2

and the modulus of the function derivative is

|g′(x)| = 2|x| = 2y1/2


Substituting these into the formula for fY (y), we obtain:

fY (y) =

2∑

i=1

fX(x)

|g′(x)|

∣

∣

∣

∣

x=xi

=fX(y1/2)

2y1/2+fX(−y1/2)

2y1/2

=1

√

2πσ2ye−

y

2σ2

which is the Chi-squared density with one degree of freedom.

3.10 Random vectors

A Vector random variable can be thought of simply as a collection of singlerandom variables which will in general be interrelated statistically. As forsingle random variables it is conceptually useful to retain the idea of amapping from an underlying Probability Space (Ω,F , P ) onto a real-valuedrandom vector:

X(ω) =

X1(ω)X2(ω)

...XN (ω)

, ω ∈ Ω, X(ω) ∈ <N

i.e. the mapping is from the (possibly highly complex) sample space Ωonto <N , the space of N -dimensional real vectors.

A vector cumulative distribution function can be defined by extension ofthe single random variable definition:

FX(x1, x2, . . . , xN ) = FX(x) = PrX1 ≤ x1, X2 ≤ x2, . . . , XN ≤ xn

Vector (joint) cumulative distribution function (CDF)

i.e. FX(x) is the probability that X1 ≤ x1 AND X2 ≤ x2 AND . . .XN ≤ xN . The vector CDF has the following properties:

1. 0 ≤ FX(x) ≤ 1

2. FX(x1, . . . ,−∞, . . . , xN ) = 0, FX(∞, . . . ,∞, . . . ,∞) = 1

3.10 Random vectors 57

3. FX(x) is non-decreasing as any element xi increases

4. Total Probability - e.g. for N = 4:

F (x1, x3) = PrX1 ≤ x1, X2 ≤ +∞, X3 ≤ x3, X4 ≤ +∞= F (x1,+∞, x3,+∞)

In a similar fashion the vector PDF is defined as:

fX(x1 x2 . . . xN ) = fX(x) =∂NFX(x)

∂x1∂x2 . . . ∂xN

Vector (joint) probability density function (PDF)

The vector PDF has the following properties:

1. fX(x) ≥ 0

2.

∫ +∞

x1=−∞

. . .

∫ +∞

xN=−∞

fX(x) dx1dx2 . . . dxN =

∫

x

fX(x) dx = 1

3. PrX ∈ V =

∫

V

fX(x) dx

4. Expectation:

E[x] =

∫

x

x fX(x) dx, E[g(x)] =

∫

x

g(x)fX(x) dx

5. Probability of small volumes:

Prx1 < X1 ≤ x1 + δx1, . . . , xN < XN ≤ xn + δxn≈ fX(x1, x2, . . . , xN ) δx1δx2 . . . δxN

for small δxi.

6. Marginalisation - integrate over unwanted components, e.g.

f(x1, x3) =

∫ ∞

x2=−∞

∫ ∞

x4=−∞

f(x1, x2, x3, x4) dx2dx4


3.11 Conditional densities and Bayes rule

These again are obtained as a direct generalisation of the results for singlerandom variables. We treat only the case of PDFs for continuous randomvectors as the discrete and mixed cases follow exactly the same form.

fX|Y(x|y) =fX,Y(x,y)

fY(y)

Conditional probability (random vector)

fX|Y(x|y) =fY|X(y|x)fX(x)

fY(y)

Bayes rule (random vector)

3.12 Functions of random vectors

Consider a vector function of a vector random variable:

Y =

g1(X)g2(X)

...gN(X)

= g(X), g(X) ∈ <N

What is the derived PDF fY(y) given fX(x) and g()?Assuming that g() is one-to-one, invertible and differentiable, the derived

PDF is obtained as:

fY(y) =1

| det(J)| fX(x)|x=g−1(y)

3.13 The multivariate Gaussian 59

where the Jacobian | det(J)| gives the ratio between the infinitesimal N -dimensional volumes in the x space and y space, with J defined as:

J =∂g(x)

∂x=

∂g1(x)∂x1

∂g1(x)∂x2

· · · ∂g1(x)∂xN

∂g2(x)∂x1

∂g2(x)∂x2

· · · ∂g2(x)∂xN

......

......

∂gN (x)∂x1

∂gN (x)∂x2

· · · ∂gN (x)∂xN

When the mapping is not one-to-one, the derived PDF can be found asa summation of terms for each root of the equation y = g(x), as for theunivariate case.

3.13 The multivariate Gaussian

We will firstly present the multivariate Gaussian with independent elementsand then use the derived PDF result to derive the general case.

Consider N independent random variables Xi with ‘standard’ (i.e. withmean zero and unity standard deviation) normal distributions:

f(xi) = N(0, 1) =1√2πe−

12x2

i

Since the Xi’s are independent we can immediately write the joint (vec-tor) PDF for all N RVs as:

f(x1, x2, . . . , xN ) = f(x1)f(x2) . . . f(xN )

=N∏

i=1

1√2πe−

12 x2

i

=1

√2π

Nexp

(

−1

2

N∑

i=1

x2i

)

=1

√2π

Nexp

(

−1

2||x||22

)

Now apply the general linear (invertible) matrix transformation to thevector random variable X:

Y = g(X) = AX + b, where A is an invertible N ×N matrix


We wish to determine the derived PDF fY(y). Firstly find the Jacobian,

|J| =∣

∣

∣

∂g(x)∂x

∣

∣

∣:

yi =

N∑

j=1

[A]ijxj + bi

... [J]ij =∂yi

∂xj= [A]ij

... J = A

... | det(J)| = | det(A)|

The matrix transformation is invertible, so we can write:

x = g−1(y) = A−1(y − b)

Hence:

fY(y) =1

| det(J)| fX(x)|x=g−1(y)

=1

| det(A)|1

√2π

Nexp

(

−1

2||g−1(y)||22

)

=1

| det(A)|1

√2π

Nexp

(

−1

2||A−1(y − b)||22

)

=1

| det(A)|1

√2π

Nexp

(

−1

2(y − b)T (AT A)−1(y − b)

)

Now define CY = ATA, and hence | det(A)| = | det(CY)| 12 . SubstitutingCY = ATA and µ = b we obtain finally:

fY(y) =1

√2π

Ndet(CY)

12

exp

(

−1

2(y − µ)TCY

−1(y − µ)

)

Multivariate Gaussian PDF

where µ = E[Y] is the mean vector and CY = E[(Y −µ)(Y−µ)T ] is theCovariance Matrix (since [CY]ij = cYiYj the covariance between elementsYi and Yj). Note that fX(x), the PDF of the original independent data,is a special case of the multivariate Gaussian with CY = I, the identitymatrix. The multivariate Gaussian will be used extensively throughout theremainder of this book. See figures 3.12 and 3.12 for contour and 3-D plotsof an example of the 2-dimensional Gaussian density.

3.14 Random signals 61

3.14 Random signals

In this section we extend the ideas of previous sections to model the prop-erties of Random Signals , i.e. waveforms which are of infinite duration andwhich can take on random values at all points in time.

Consider an underlying Probability Space (Ω,F , P ), as before. However,instead of mapping from the Sample Space Ω to a finite point or set ofpoints such as a random variable or random vector, each member ω of theSample Space now maps onto a real-valued waveform X(t, ω), with valuespossibly defined for all times t.

We can imagine a generalisation of our previous ideas about randomexperiments so that the outcome of an experiment can be a ‘Random Ob-ject’, an example of which is a waveform chosen at random (according toa Probability Measure P ) from a set of possible waveforms, which we terman Ensemble.


020

4060

80100

0

50

100

1500

0.2

0.4

0.6

0.8

1

x1x2

f(x1

,x2)

10 20 30 40 50 60 70 80

20

40

60

80

100

120

x1

x2

FIGURE 3.1. Example of the bivariate Gaussian PDF fX1,X2(x1, x2). Top - 3-Dmesh plot, bottom - contour plot

3.14 Random signals 63

t

t

t

t

X(t, )ω

X(t, )ω

X(t, )ω

X(t, )ω

1

2

i

i+1

ω1

ω2

ωi

ωi+1

FIGURE 3.2. Ensemble representation of a random process


3.15 Definition: random process

Given a probability space (Ω,F , P ), a Random Process X(t, ω) is a func-tion which maps each element ω of the sample space Ω onto a real-valuedfunction of time, see figure 3.14 The mapping is defined at times t belongingto some set T (a more rigorous notation would thus be X(t, ω); t ∈ T ).If T is a continuous set, e.g. < or [0,∞), then the process is termed aContinuous Time Random Process. If T is a discrete set of time values,e.g. Z, the integers, the process is termed Discrete Time or a Time Series.We will be concerned mainly with discrete time processes in this book. Asfor random variables we will usually drop the explicit dependence on ω fornotational convenience, referring simply to ‘Random Process X(t)’. Ifwe consider the process X(t) at one particular time t = t1 then we havea Random Variable X(t1). If we consider the process X(t) at N timeinstants t1, t2, . . . , tN then we have a Random Vector:

X =

X(t1)X(t2)

...X(tN)

We can study the properties of a Random Process by considering thebehaviour of random variables and random vectors extracted from the pro-cess at times ti. We already have the means to achieve this through thetheory of random variables and random vectors.

We will limit our discussion here to discrete time random processes ortime series defined at time instants T = t−∞, . . . , t−1, t0, t1, . . . , t∞,where usually (although this is not a requirement) the process is uniformlysampled , i.e. tn = nT where T is the sampling interval. We will use thenotation Xn = X(tn) for random variables extracted at the ith time pointand the whole discrete time process will be referred to as Xn.

3.15.1 Mean values and correlation functions

The mean value of a discrete time random process is defined as:

µn = E[Xn]

Mean value of random process

and the discrete autocorrelation function is:

3.15 Definition: random process 65

rXX(n,m) = E[XnXm]

Autocorrelation function of random process

The cross-correlation function between two processes Xn and Yn is:

rXY (n,m) = E[XnYm]

Cross-correlation function

3.15.2 Stationarity

A stationary process has the same statistical characteristics irrespective oftime shifts along the axis. To put it another way, an observer looking at theprocess from time ti would not be able to tell the difference in the statisticalcharacteristics of the process if he moved to a different time tj . This ideais formalised by considering the Nth order density for the process:

fXn1 , Xn2 , ... ,XnN(xn1 , xn2 , . . . , xnN )

N th order density for a random process

which is the joint probability density function for N arbitrarily chosen timeindices n1, n2, . . . nN. Since the probability distribution of a randomvector contains all the statistical information about that random vector,we would expect the probability distribution to be unchanged if we shiftedthe time axis any amount to the left or the right. This is the idea behindstrict-sense stationarity for a discrete random process.


A random process is strict-sense stationary if, for any finite c, N andn1, n2, . . . nN:

fXn1 , Xn2 , ... , XnN(xn1 , xn2 , . . . , xnN )

= fXn1+c, Xn2+c, ... , XnN+c(xn1 , xn2 , . . . , xnN )

Strict-sense stationarity for a random process

A less stringent condition which is nevertheless very useful for practi-cal analysis is wide-sense stationarity which only requires first and secondorder moments to be invariant to time shifts.

A random process is wide-sense stationary if:

1. µn = E[Xn] = µ, (mean is constant)

2. rXX(n,m) = rXX(n − m), (autocorrelation function depends onlyupon the difference between n and m).

Wide-sense stationarity for a random process

Note that strict-sense stationarity implies wide-sense stationarity, but notvice versa.

3.15.3 Power spectra

For a wide-sense stationary random process Xn, the power spectrumis defined as the discrete-time Fourier transform (DTFT) of the discreteautocorrelation function:

SX(ω) =

∞∑

m=−∞

rXX(m) e−jmωT (3.6)

Power spectrum for a random process

and the autocorrelation function can thus be found from the power spec-trum by inverting the transform:

3.16 Conclusion 67

rXX(m) =1

2π

∫ π

−π

SX(ω) ejmωT dωT (3.7)

Autocorrelation function from power spectrum

The power spectrum is a real, even and periodic function of frequency.The power spectrum can be interpreted as a density spectrum in the sensethat the expected power at the output of an ideal band-pass filter withlower and upper cut-off frequencies of ωl and ωu is given by

1

π

∫ ωu

ωl

SX(ω) dω

3.15.4 Linear systems and random processes

When a wide-sense stationary discrete random process Xn is passedthrough a linear time invariant (LTI) system with pulse response hn,the output process Yn is also wide-sense stationary and we can expressthe output correlation functions and power spectra in terms of the inputstatistics and the LTI system:

rY Y (l) =

∞∑

k=−∞

∞∑

i=−∞

hk hi rXX(l + i− k) (3.8)

Autocorrelation function at the output of a LTI system

SY (ω) = |H(ejωT )|2SX(ω) (3.9)

Power spectrum at the output of a LTI system

3.16 Conclusion

We have provided a brief survey of results from probability theory andrandom signal theory on which the remaining chapters are based. The nextchapter describes the theory of detection and estimation.



4

Parameter Estimation, ModelSelection and Classification

In this chapter we review the fundamental techniques and principles behindestimation of unobserved variables from observed data (‘parameter estima-tion’) and selection of a classification or form of model to represent theobserved data (‘classification’ and ‘model selection’). The term parameterestimation is used in a very general sense here. For example, we will treatboth the interpolation of missing data samples in a digitised waveform andthe more obvious task of coefficient estimation for some parametric datamodel as examples of parameter estimation. Classification is traditionallyconcerned with choosing from a set of possible classes the class which bestrepresents the observed data. This concept will be extended to includedirectly analogous tasks such as identifying which samples in a data se-quence are corrupted by some intermittent noise process. Model selectionis another type of classification in which the classes are a number of differ-ent model-based formulations for data and it is required to choose the bestmodel for the observed data.

The last sections of the chapter introduce two related topics which willbe important in later work. The first is Sequential Bayesian Classificationin which classification estimates are updated sequentially with the inputof new data samples. The second is autoregressive (AR) modelling consid-ered within a Bayesian framework, which will be a fundamental part ofsubsequent chapters. We then discuss state-space models and the Kalmanfilter, an efficient means for implementing many of the sequential schemes,and finally introduce some sophisticated methods for exploration of poste-rior distributions: the expectation-maximisation and Markov chain MonteCarlo methods.

70 4. Parameter Estimation, Model Selection and Classification

In all of our work parameters and models are treated as random vari-ables or random vectors. We are thus prepared to specify prior informationconcerning the processes in a probabilistic form, which leads to a Bayesianapproach to the solution of these problems. The chapter is thus primarilyconcerned with developing the principles of Bayesian Decision Theory andEstimation as applied to signal processing. We do not intend to provide acomplete review of non-Bayesian techniques, although traditional methodsare referenced and described and the relationship of Bayesian methods toMaximum Likelihood (ML) and Least Squares (LS) is discussed.

4.1 Parameter estimation

In parameter estimation we suppose that a random process Xn dependsin some well-defined stochastic manner upon an unobserved parameter vec-tor θ. If we observe N data points from the random process, we can form avector x = [x1 x2 . . . xN ]T . The parameter estimation problem is to deducethe value of θ from the observations x. In general it will not be possible todeduce the parameters exactly from a finite number of data points since theprocess is random, but various schemes will be described which achieve dif-ferent levels of performance depending upon the amount of data availableand the amount of prior information available regarding θ.

4.1.1 The general linear model

We now define a general parametric model which will be referred to fre-quently in this and subsequent chapters. In this model it is assumed thatthe data x are generated as a function of the parameters θ with an additiverandom modelling error term en:

xn = gn(θ) + en

where gn(.) is a deterministic and possibly non-linear function. For mostof the models we will consider, gn(.) is a linear function of the parametersso we may write

xn = gTnθ + en

where gn is a P -dimensional column vector, and the expression may bewritten for the whole vector x as

x = Gθ + e (4.1)

General linear model

4.1 Parameter estimation 71

where

G =

gT1

gT2...

gTN

The columns of G form a fixed basis vector representation of the data, forexample sinusoids of different frequencies in a signal which is known to bemade up of pure frequency components in noise. A variant of this modelwill be seen later in the chapter, the autoregressive (AR) model, in whichprevious data points are used to predict the current data point xn. Theerror sequence e will usually (but not necessarily) be assumed drawn froman independent, identically distributed (i.i.d. ) noise distribution, that is,

p(e) = pe(e1).pe(e2). . . . .pe(eN )

where pe(.) denotes some noise distribution which is identical for all n.1

en can be viewed as a modelling error, innovation or observation noise,depending upon the type of model. We will encounter the linear form (4.1)and its variants throughout the book, so it is used as the primary examplein the subsequent sections of this chapter.

4.1.2 Maximum likelihood (ML) estimation

The first method of estimation considered is the maximum likelihood (ML)estimator which treats the parameters as unknown constants about whichwe incorporate no prior information. The observed data x is, however,considered random and we can often then obtain the PDFfor x when thevalue of θ is known. This PDFis termed the likelihood L(x;θ), which isdefined as

L(x;θ) = p(x | θ) (4.2)

The likelihood is of course implicitly conditioned upon all of our modellingassumptions M, which could be expressed as p(x | θ,M). We do not adoptthis convention, however, for reasons of notational simplicity.

The ML estimate for θ is then that value of θ which maximises thelikelihood for given observations x:

1Whenever the context makes it unambiguous we will adopt from now on a notationp(.) to denote both probability density functions (PDFs) and probability mass functions(PMFs) for random variables and vectors.


θML = argmax

p(x | θ) (4.3)

Maximum likelihood (ML) estimator

The rationale behind this is that the ML solution corresponds to theparameter vector which would have generated the observed data x withhighest probability. The maximisation task required for ML estimation canbe achieved using standard differential calculus for well-behaved and differ-entiable likelihood functions, and it is often convenient analytically to max-imise the log-likelihood function l(x;θ) = log(L(x;θ)) rather than L(x;θ)itself. Since log is a monotonically increasing function the two solutions areidentical.

In data analysis and signal processing applications the likelihood func-tion is arrived at through knowledge of the stochastic model for the data.For example, in the case of the general linear model (4.1) the likelihoodcan be obtained easily if we know the form of p(e), the joint PDFfor thecomponents of the error vector. The likelihood px| is then found from atransformation of variables e → x where e = x − Gθ. The Jacobian (seesection 3.12) for this transformation is unity, so the likelihood is:

L(x;θ) = p(x | θ) = pe(x − Gθ) (4.4)

The elements of the error vector e are often assumed to be i.i.d. andGaussian. If the variance of the Gaussian is σ2

e then we have:

pe(e) =1

(2πσ2e)

N/2exp

(

− 1

2σ2e

eTe

)

Such an assumption is a reflection of the fact that many naturally occurringprocesses are Gaussian and that many of the subsequent results are easilyobtained in closed form. The likelihood is then

L(x;θ) = pe(x − Gθ) =1

(2πσ2e)N/2

exp

(

− 1

2σ2e

(x − Gθ)T (x − Gθ)

)

(4.5)

which leads to the following log-likelihood expression:

l(x;θ) = −(N/2) log(2πσ2e) − 1

2σ2e

(x − Gθ)T (x − Gθ)

= −(N/2) log(2πσ2e) − 1

2σ2e

N∑

n=1

(xn − gTnθ)

2


Maximisation of this function w.r.t. θ is equivalent to minimising the sum-squared of the error sequence E =

∑Nn=1(xn − gT

nθ)2. This is exactly

the criterion which is applied in the familiar least squares (LS) estimationmethod. The ML estimator is obtained by taking derivatives w.r.t. θ andequating to zero:

θML = (GT G)−1GT x (4.6)

Maximum likelihood for the general linear model

which is, as expected, the familiar linear least squares estimate for modelparameters calculated from finite length data observations. Thus we seethat the ML estimator under the i.i.d. Gaussian error assumption is exactlyequivalent to the least squares (LS) solution.

4.1.3 Bayesian estimation

Recall that ML methods treat parameters as unknown constants. If we areprepared to treat parameters as random variables it is possible to assignprior PDFs to the parameters. These PDFs should ideally express someprior knowledge about the relative probability of different parameter val-ues before the data are observed . Of course if nothing is known a prioriabout the parameters then the prior distributions should in some sense ex-press no initial preference for one set of parameters over any other. Notethat in many cases a prior density is chosen to express some highly qualita-tive prior knowledge about the parameters. In such cases the prior chosenwill be more a reflection of a degree of belief [97] concerning parametervalues than any true modelling of an underlying random process whichmight have generated those parameters. This willingness to assign priorswhich reflect subjective information is a powerful feature and also one ofthe most fundamental differences between the Bayesian and ‘classical’ in-ferential procedures. For various expositions of the Bayesian methodologyand philosophy see, for example [97, 23, 15]. We take here a pragmatic view-point: if we have prior information about a problem then we will generallyget better results through careful incorporation of that information. Theprecise form of probability distributions assigned a priori to the parame-ters requires careful consideration since misleading results can be obtainedfrom erroneous priors, but in principle at least we can apply the Bayesianapproach to any problem where statistical uncertainty is present.

Bayes rule is now stated as applied to estimation of random parametersθ from a random vector x of observations, known as the posterior or aposteriori probability for the parameter:


p(θ | x) =p(x | θ) p(θ)

p(x)(4.7)

Posterior probability

Note that all of the distributions in this expression are implicitly condi-tioned upon all prior modelling assumptions, as was the likelihood functionearlier. The distribution p(x | θ) is the likelihood as used for ML estima-tion, while p(θ) is the prior or a priori distribution for the parameters.This term is one of the critical differences between Bayesian and ‘classical’techniques. It expresses in an objective fashion the probability of variousmodel parameters values before the data x has been observed. As we havealready observed, the prior density may be an expression of highly sub-jective information about parameter values. This transformation from thesubjective domain to an objective form for the prior can clearly be of greatsignificance and should be considered carefully.

The term p(θ | x), the posterior or a posteriori distribution, expresses theprobability of θ given the observed data x. This is now a true measure ofhow ‘probable’ a particular value of θ is, given the observations x. p(θ | x)is in a more intuitive form for parameter estimation than the likelihood,which expresses how probable the observations are given the parameters .The generation of the posterior distribution from the prior distributionwhen data x is observed can be thought of as a refinement to any previ-ous (‘prior’) knowledge about the parameters. Before x is observed p(θ)expresses any information previously obtained concerning θ. Any new in-formation concerning the parameters contained in x is then incorporatedto give the posterior distribution. Clearly if we start off with little or noinformation about θ then the posterior distribution is likely to obtain in-formation obtained almost solely from x. Conversely, if p(θ) expresses asignificant amount of information about θ then x will contribute less newinformation to the posterior distribution.

The denominator p(x), sometimes referred to as the ‘evidence’ becauseof its interpretation in model selection problems (see later), is constant forany given observation x and so may be ignored if we are only interested inthe relative posterior probabilities of different parameters. In fact, Bayesrule is often stated in the form:


p(θ | x) ∝ p(x | θ) p(θ) (4.8)

Posterior probability (proportionality)

p(x) may be calculated, however, by integration:

p(x) =

∫

p(x | θ)p(θ) dθ (4.9)

and this effectively serves as the normalising constant for the posteriordensity (in this and subsequent results the integration would be replacedby a summation in the case of a discrete random vector θ).

4.1.3.1 Posterior inference and Bayesian cost functions

The posterior distribution gives the probability for any chosen θ given ob-served data x, and as such optimally combines our prior information aboutθ and any additional information gained about θ from observing x. Wemay in principle manipulate the posterior density to infer any requiredstatistic of θ conditional upon x. This is an advantage over ML and leastsquares methods which strictly give us only a single estimate of θ, knownas a ‘point estimate’. However, by producing a posterior PDFwith valuesdefined for all θ the Bayesian approach gives a fully interpretable probabil-ity distribution. In principle this is as much as one could ever need to know.In signal processing problems, however, we usually require a single pointestimate for θ, and a suitable way to choose this is via a ‘cost function’C(θ,θ) which expresses objectively a measure of the cost associated with

a particular parameter estimate θ when the true parameter is θ (see e.g.[51, 15]). The form of cost function will depend on the requirements of aparticular problem. A cost of 0 indicates that the estimate is perfect for ourrequirements (this does not necessarily imply that θ = θ) while positivevalues indicate poorer estimates. The risk associated with a particular es-timator is then defined as the expected posterior cost associated with thatestimate:

R(θ) = E[C(θ,θ)] =

∫

x

∫

C(θ,θ) p(θ | x) p(x) dθ dx. (4.10)

We require the estimation scheme which chooses θ to minimise the risk. Theminimum risk is known as the ‘Bayes risk’. For non-negative cost functionsit is sufficient to minimise only the inner integral

I(θ) =

∫

C(θ,θ) p(θ | x) dθ (4.11)


for all θ. Typical cost functions are the quadratic cost function | θ − θ |2and the uniform cost function, defined for arbitrarily small ε as

C(θ,θ) =

1, | θ − θ |> ε0 otherwise

(4.12)

The quadratic cost function leads to the minimum mean-squared error(MMSE) estimator and as such is reasonable for many examples of param-eter estimation, where we require an estimate representative of the wholeposterior density. The MMSE estimate can be shown to equal the mean ofthe posterior distribution:

θ =

∫

θ p(θ | x) dθ. (4.13)

Where the posterior distribution is symmetrical about its mean the poste-rior mean can in fact be shown to be the minimum risk solution for anycost function which is a convex function of | θ − θ | [186].

The uniform cost function is useful for the ‘all or nothing’ scenario wherewe wish to attain the correct parameter estimate at all costs and any otherestimate is of no use. Therrien [174] cites the example of a pilot landinga plane on an aircraft carrier. If he does not estimate within some smallfinite error he misses the ship, in which case the landing is a disaster. Theuniform cost function for ε→ 0 leads to the maximum a posteriori (MAP)

estimate, the value of θ which maximises the posterior distribution:

θMAP = argmax

p(θ | x) (4.14)

Maximum a posteriori (MAP) estimator

Note that for Gaussian posterior distributions the MMSE and MAP so-lutions coincide, as indeed they do for any distribution symmetric aboutits mean with its maximum at the mean. The MAP estimate is thus appro-priate for many common estimation problems, including those encounteredin this book.

We now work through the MAP estimation scheme for the general linearmodel (4.1). Suppose that the prior on parameter vector θ is the multivari-ate Gaussian (A.2):

p(θ) =1

(2π)P/2 | C |1/2exp

(

−1

2(θ − m)TC−1(θ − m)

)

(4.15)


where m is the parameter mean vector, C is the parameter covariancematrix and P is the number of parameters in θ. If the likelihood p(x | θ)takes the same form as before (4.5), the posterior distribution is as follows:

p(θ | x) ∝ 1

(2π)P/2 | C |1/2

1

(2πσ2e)N/2

exp

(

− 1

2σ2e

(x − Gθ)T (x − Gθ)

− 1

2(θ − m)T C−1(θ − m)

)

and the MAP estimate θMAP is obtained by differentiation of the log-posterior as:

θMAP =(

GTG + σ2e C−1

)−1 (GTx + σ2

e C−1m)

(4.16)

MAP estimator - general linear model

In this expression we can clearly see the ‘regularising’ effect of the priordensity on the ML estimate of (4.6). As the prior becomes more ‘diffuse’,i.e. the diagonal elements of C increase both in magnitude and relativeto the off-diagonal elements, we impose ‘less’ prior information on the es-timate. In the limit the prior tends to a uniform (‘flat’) prior with all θequally probable. In this limit C−1 = 0 and the estimate is identical to theML estimate (4.6). This important relationship demonstrates that the MLestimate may be interpreted as the MAP estimate with a uniform priorassigned to θ. The MAP estimate will also tend towards the ML estimatewhen the likelihood is strongly ‘peaked’ around its maximum comparedwith the prior. Once again the prior will have little influence on the esti-mate. It is in fact well known [97] that as the sample size N tends to infinitythe Bayes solution tends to the ML solution. This of course says nothingabout small sample parameter estimates where the effect of the prior maybe very significant.

4.1.3.2 Marginalisation for elimination of unwanted parameters

In many cases we can formulate a likelihood function for a particular prob-lem which depends on more unknown parameters than are actually wantedfor estimation. These will often be ‘scale’ parameters such as unknown noiseor excitation variances but may also be unobserved (‘missing’) data valuesor unwanted system parameters. A full ML procedure requires that the


likelihood be maximised w.r.t. all of these parameters and the unwantedvalues are then simply discarded to give the required estimate.2 However,this may not in general be an appropriate procedure for obtaining only therequired parameters - a cost function which depends only upon a certainsubset of parameters leads to an estimator which only depends upon themarginal probability for those parameters. The Bayesian approach allowsfor the interpretation of these unwanted or ‘nuisance’ parameters as randomvariables, for which as usual we can specify prior densities. The marginali-sation identity can be used to eliminate these parameters from the posteriordistribution, and from this we are able to obtain a posterior distribution interms of only the desired parameters. Consider an unwanted parameter φwhich is present in the modelling assumptions. The unwanted parameter isnow eliminated from the posterior expression by marginalisation:

p(θ|x) =

∫

p(θ,φ|x) dφ

∝∫

p(x|θ,φ)p(θ,φ) dφ (4.17)

4.1.3.3 Choice of priors

One of the major criticisms levelled at Bayesian techniques is that thechoice of priors can be highly subjective and, as mentioned above, theinclusion of inappropriate priors could give misleading results. There isclearly no problem if the prior statistics are genuinely known. Difficultiesarise, however, when an attempt is made to choose prior densities to ex-press some very subjective piece of prior knowledge or when nothing at allappears to be known beforehand about the parameters.

In many engineering applications we may have genuine prior belief interms of the observed long term behaviour of a particular parameter beforethe current data were observed. In these cases there is clearly very littleproblem with the Bayesian approach. In other cases subjective informationmay be available, such as an approximate range of feasible values, or arough idea of the parameter’s likely value. In these cases a pragmatic ap-proach can be adopted: choose a prior with convenient analytical propertieswhich expresses the type of information available to us. We have alreadyseen an example of such a prior in the Gaussian prior assigned to the linearparameter vector (4.15). If an approximate value for θ were known, thiscould be chosen as the mean vector m of the Gaussian. If we can quan-tify our uncertainty about this mean value then this can be incorporatedinto the prior covariance matrix. To take another example, the variance of

2Although in time series likelihood-based analysis it is common to treat missing andother unobserved data in a Bayesian fashion, see [91] and references therein.


the error σ2e might be unknown. However, physical considerations for the

system might lead us to expect that a certain range of values is more likelythan others. A suitable prior for this case might be the inverted Gammadistribution (A.11), which is defined for positive random variables and canbe assigned a mean and variance. In either of these cases the Bayesian anal-ysis of the linear model remains analytically tractable and such priors aretermed conjugate priors (see [15, appendix A.2] for a comprehensive list oflikelihood functions and their associated conjugate priors).

In cases where no prior information can be assumed we will wish toassign a prior which does not bias the outcome of the inference procedure.Such priors are termed non-informative and detailed discussions can befound in Box and Tiao [23] and Bernardo and Smith [15]. We do not detailthe construction of non-informative priors here, but give two importantexamples which are much used. Firstly, a non-informative prior which iscommonly used for the linear ‘location’ parameters θ in the general linearmodel (4.1) is the uniform prior. Secondly, the standard non-informativeprior for a ‘scale’ parameter such as σ2

e in the linear model, is the Jeffreys’prior [97] which has the form p(σ2) = 1

σ2 . These priors are designed togive invariance of the inferential procedure to re-scaling of the data. Atroublesome point, however, is that both are unnormalised, which meansthey are not proper densities. Furthermore, there are some models for whichit is not possible to construct a non-informative prior.

4.1.4 Bayesian Decision theory

As for Bayesian parameter estimation we consider the unobserved variable(in this case a classification state si) as being generated by some randomprocess whose prior probabilities are known. These prior probabilities areassigned to each of the possible classification states using a probability massfunction (PMF) p(si), which expresses the prior probability of occurrenceof different states given all information available except the data x. Therequired form of Bayes rule for this discrete estimation problem is then

p(si | x) =p(x | si) p(si)

p(x)(4.18)

p(x) is constant for any given x and will serve to normalise the posteriorprobabilities over all i in the same way that the ‘evidence’ normalised theposterior parameter distribution (4.7). In the same way that Bayes rulegave a posterior distribution for parameters θ, this expression gives theposterior probability for a particular state given the observed data x. Itwould seem reasonable to choose the state si corresponding to maximumposterior probability as our estimate for the true state (we will refer to thisstate estimate as the MAP estimate), and this can be shown to have thedesirable property of minimum classification error rate PE (see e.g. [51]),


that is, it has minimum probability of choosing the wrong state. Note thatdetermination of the MAP state vector will usually involve an exhaustivesearch of p(si | x) for all feasible i, although sub-optimal schemes aredeveloped for specific applications encountered later.

These ideas are formalised by consideration of a ‘loss function’ λ(αi | sj)which defines the penalty incurred by taking action αi when the true stateis sj . Action αi will usually refer to the action of choosing state si as thestate estimate.

The expected risk associated with action αi (known as the conditionalrisk) is then expressed as

R(αi | x) =

Ns∑

j=1

λ(αi | sj) p(sj | x). (4.19)

It can be shown that it is sufficient to minimise this conditional risk in orderto achieve the optimal decision rule for a given problem and loss function.

Consider a loss function which is zero when i = j and unity otherwise.This ‘symmetric’ loss function can be viewed as the equivalent of the uni-form cost function used for parameter estimation (4.12). The conditionalrisk is then given by:

R(αi | x) =

Ns∑

j=1, (j 6=i)

p(sj | x) (4.20)

= 1 − p(si | x) (4.21)

The second line here is simply the conditional probability that action αi

is incorrect, and hence minimisation of the conditional risk is equivalent tominimisation of the probability of classification error, PE . It is clear fromthis expression that selection of the MAP state is the optimal decision rulefor the symmetric loss function.

Cases where minimum classification error rate may not be the best cri-terion are encountered and discussed later. However, alternative loss func-tions will not in general lead to estimators as simple as the MAP estimate.

4.1.4.1 Calculation of the evidence, p(x | si)

The term p(x | si) is equivalent to the ‘evidence’ term p(x) which wasencountered in the parameter estimation section, since p(x) was implicitlyconditioned on a particular model structure or state in that scheme.

If one uses a uniform state prior p(si) = 1Ns

, then, according to equation(4.18), it is only necessary to compare values of p(x | si) for model selectionsince the remaining terms are constant for all models. p(x | si) can then beviewed literally as the relative ‘evidence’ for a particular model, and twocandidate models can be compared through their Bayes Factor:


BFij =p(x | si)

p(x | sj)

Typically each model or state si will be expressed in a parametric formwhose parameters θi are unknown. As for the parameter estimation case itwill usually be possible to obtain the state conditional parameter likelihoodp(x | θi, si). Given a state-dependent prior distribution for θi the evidencemay be obtained by integration to eliminate θi from the joint probabilityp(x,θi | si). The evidence is then obtained using result (4.17) as

p(x | si) =

∫

i

p(x | θi, si) p(θi | si) dθi (4.22)

If the general linear model of (4.1) is extended to the multi-model sce-nario we obtain:

x = Giθi + ei (4.23)

where Gi refers to the state-dependent basis matrix and ei is the cor-responding error sequence. For this model the state dependent parameterlikelihood p(x | θi, si) is (see (4.5)):

p(x | θi, si) =1

(2πσ2ei

)N/2exp

(

− 1

2σ2ei

(x − Giθi)T (x − Giθi)

)

(4.24)

and assuming the same Gaussian form for the state conditional parame-ter prior p(θi | si) (with Pi parameters) as we used for p(θ) in (4.15) theevidence is then given by:

p(x | si) =

∫

i

1

(2π)Pi/2 | Ci|1/2

1

(2πσ2ei

)N/2

exp

(

− 1

2σ2ei

(x − Giθi)T (x − Giθi)

− 1

2(θi − m

i)T Ci

−1(θi − mi)

)

dθi (4.25)

This multivariate Gaussian integral can be performed after some rearrange-ment using result (A.5) to give:

p(x | si) =1

(2π)Pi/2 | Ci|1/2 | Φ |1/2 (2πσ2

ei)(N−Pi)/2

exp

(

− 1

2σ2ei

(xT x + σ2eim

i

TCi

−1mi− ΘTθMAP

i )

)

(4.26)


with terms defined as

θMAP

i = Φ−1Θ (4.27)

Φ = GiTGi + σ2

eiC

i

−1 (4.28)

Θ = GiTx + σ2

eiC

i

−1mi

(4.29)

Notice that θMAP

i is simply the state dependent version of the MAP param-eter estimate given by (4.16).

This expression for the state evidence is proportional to the joint proba-bility p(x,θi | si) with the MAP value θMAP

i substituted for θi. This is easilyverified by substitution of θMAP

i into the integrand of (4.25) and comparingwith (4.26). The following relationship results:

p(x | si) =(2πσ2

ei)Pi/2

| Φ |1/2p(x,θi | si) |i=

MAPi

. (4.30)

Thus within this all-Gaussian framework the only difference between sub-stituting in MAP parameters for θi and performing the full marginalisation

integral is a multiplicative term(2πσ2

ei)Pi/2

|Φ|1/2 .

In the case where the state dependent likelihood is expressed in terms ofunknown scale parameters (e.g. σ2

ei) it is sometimes possible to marginalise

these parameters in addition to the system parameters θi, depending uponthe prior structure chosen. The same conjugate prior as proposed in section4.1.3.3, the inverted-gamma distribution, may be applied. If we wished toeliminate σ2

eifrom the evidence result of (4.26) then one prior structure

which leads to analytic results is to reparameterise the prior covariancematrix for θi as C′

i= C

i/σ2

eiin order to eliminate the dependency of Θ

and Φ on σ2ei

. Such a reparameterisation becomes unnecessary if a uniform

prior is assumed for θi, in which case Ci

−1 → 0 and Θ and Φ no longerdepend upon σ2

ei.

4.1.4.2 Determination of the MAP state estimate

Having obtained expressions for the state evidence and assigned the priorprobabilities Psi the posterior state probability (4.18) may then be evalu-ated for all i in order to identify the MAP state estimate. For large num-bers of states Ns this can be highly computationally intensive, in which casesome sub-optimal search procedure must be devised. This issue is addressedin later chapters for Bayesian click removal.

4.1.5 Sequential Bayesian classification

Sequential classification is concerned with updating classification estimatesas new data elements are input. Typically x will contain successive data


samples from a time series, and an example is a real-time signal process-ing application in which new samples are received from the measurementsystem one-by-one and we wish to refine some classification of the data asmore data is received. Say we have observed k samples which are placed invector xk. The decision theory of previous sections can be used to calculatethe posterior state probability p(si | xk) for each state si. The sequentialclassification problem can then be stated as follows: given some state es-timate for the first k data samples of xk a sequence, form a revised stateestimate when a new data sample xk+1 is observed.

The optimal Bayes decision procedure requires minimisation of the con-ditional risk (4.19) and hence evaluation of all the state posterior proba-bilities p(si | xk). Considering (4.18), the prior state probabilities p(si) donot change as more data is received, and hence the problem is reduced tothat of updating the evidence term p(x | si) for each possible state witheach incoming data sample. This update is specified in recursive form:

p(xk+1 | si) = g(p(xk | si), xk+1) (4.31)

where g(.) is the sequential updating function. Simple application of theproduct rule for probabilities gives the required update in terms of thedistribution p(xk+1 | xk, si):

p(xk+1 | si) = p(xk+1 | xk, si) p(xk | si). (4.32)

In some simple cases it is possible to write down this distribution directlyfrom modelling considerations (for example in classification for known fixedparameter linear models). For the general multivariate Gaussian distribu-tion with known mean vector mx and covariance matrix Cx, the requiredupdate can be derived by considering the manner in which Cx and its in-verse are modified by the addition of a new element to x (see [173, pp.157-160]).

Here we consider the linear Gaussian model with unknown parameters(4.23) which leads to an evidence expression with the form (4.26). Thesimplest update is obtained by consideration of how the term θMAP

i , theMAP parameter estimate for a particular state or model, is modified bythe input of new data point xk+1. The ‘i’ notation indicating a particularstate si is dropped from now on since the update for each state si can betreated as a separate problem. Subscripts now refer to the number of datasamples (i.e. k or (k + 1)).

For a given model it is assumed that the basis matrix Gk correspondingto data xk is updated to Gk+1 with the input of xk+1 by the addition of anew row gk

T :

Gk+1 =

Gk

− −gk

T

(4.33)


Note that even though gk is associated with time instant k+ 1 we write itas gk for notational simplicity.

The MAP parameter estimate with k samples of input is given by θMAP

k =Φk

−1Θk (4.27). The first term in Φk (4.28) is GkTGk. A recursive update

for this term is obtained by direct multiplication using (4.33):

Gk+1TGk+1 = Gk

TGk + gk gkT . (4.34)

The second term of Φk (4.28) is σ2eC

−1 . This remains unchanged withthe input of xk+1 since it contains parameters of the prior modelling as-sumptions which are independent of the input data samples. The recursiveupdate for Φk is thus given by

Φk+1 = Φk + gk gkT . (4.35)

Now consider updating Θk (4.29). The first term is GkTxk. Clearly xk+1

is given by

xk+1 =

xk

− −xk+1

, (4.36)

and hence by direct multiplication:

Gk+1Txk+1 = Gk

Txk + gk xk+1. (4.37)

As for Φk, the second term of Θk (4.29) is unchanged with the input ofxk+1, so the recursive update for Θ is

Θk+1 = Θk + gk xk+1. (4.38)

The updates derived for Φk (4.35) and Θk (4.38) are now in the formrequired for the extended RLS formulae given in appendix B. Use of theseformulae, including the determinant updating formula for | Φk | give thefollowing update for the evidence (note that P = Φ−1):

kk+1 =Pk gk

1 + gkTPk gk

αk+1 = xk+1 − θMAP

kTgk

θMAP

k+1 = θMAP

k + kk+1 αk+1

Pk+1 = Pk − kk+1 gkTPk

p(xk+1 | s) = p(xk | s) 1

(2πσ2e)1/2

1

(1 + gkTPk gk)

1/2

exp

(

− 1

2σ2e

αk+1

(

xk+1 − θMAP

k+1Tgk

)

)

(4.39)

4.2 Signal modelling 85

where of course all of the terms are implicitly dependent upon the particu-lar state s. This update requires at each stage storage of the inverse matrixPk, the MAP parameter estimate θMAP

k and the state probability p(xk | si).Computation of the update requires O(P 2) multiplications and additions(recall that P is the number of unknown parameters in the current state ormodel), although significant savings can be made in special cases where gk

contains zero-valued elements. The MAP parameter estimates are obtainedas a by-product of the update scheme, and this will be used in cases wherejoint estimation of state or model and parameters is required.

The sequential algorithms developed later in the Bayesian de-clickingchapters will be similar to the form summarised here. They not only allowfor reasonably efficient real-time operation but will also help in the de-velopment of sub-optimal classification algorithms for the cases where thenumber of states is too large for evaluation of the evidence for all states si.

4.2 Signal modelling

The general approach of this book for solving problems in audio is the modelbased approach, in which an attempt is made to formulate a statisticalmodel for the data generation process. The model chosen is considered aspart of the prior information about the problem in a similar way to the priordistribution assumed in Bayesian methods. When the model formulatedis an accurate representation of the data we can expect to be rewardedwith improved processing performance. We have already to seen how todeal with statistical model uncertainty between a set of candidate models(or system ‘states’) within a Bayesian framework earlier in this chapter.The general linear model has been used as an example until now. Such amodel applies to a time series when the data can be thought of as beingcomposed of a finite number of basis functions (in additive noise). Moretypically, however, we may wish to include random inputs into the model.One parametric model which achieves this is the ARMA (autoregressivemoving-average) model, in which the observed data are generated throughthe following general linear difference equation:

xn =

P∑

i=1

aixn−i +

Q∑

j=0

bjen−j (4.40)

ARMA model


The transfer function for this model is

H(z) =B(z)

A(z)

where B(z) =∑Q

j=0 bjz−j and A(z) = 1 −∑P

i=1 aiz−i.

The model can be seen to consist of applying an IIR filter (see 2.5.1) tothe ‘excitation’ or ‘innovation’ sequence en, which is i.i.d. noise. General-isations to the model could include the addition of additional deterministicinput signals (the ARMAX model [114, 21]) or the inclusion of linear basisfunctions in the same way as for the general linear model:

y = x + Gθ

An important special case of the ARMA model is the autoregressive (AR)or ‘all-pole’ (since the transfer function has poles only) model in whichB(z) = 1. This model is used considerably throughout the text and isconsidered in the next section.

4.3 Autoregressive (AR) modelling

A time series model which is fundamental to much of the work in this bookis the autoregressive (AR) model, in which the data is modelled as theoutput of an all-pole filter excited by white noise. This model formulationis a special case of the innovations representation for a stationary randomsignal in which the signal Xn is modelled as the output of a linear time-invariant filter driven by white noise. In the AR case the filtering operationis restricted to a weighted sum of past output values and a white noiseinnovations input en:

xn =

P∑

i=1

aixn−i + en. (4.41)

The coefficients ai; i = 1...P are the filter coefficients of the all-polefilter, henceforth referred to as the AR parameters, and P , the number ofcoefficients, is the order of the AR process. The AR model formulation isclosely related to the linear prediction framework used in many fields ofsignal processing (see e.g. [174, 119]). AR modelling has some very usefulproperties as will be seen later and these will often lead to simple analyticalresults where a more general model such as the ARMA model (see previoussection) does not. In addition, the AR model has a reasonable basis asa source-filter model for the physical sound production process in manyspeech and audio signals [156, 187].

4.3 Autoregressive (AR) modelling 87

4.3.1 Statistical modelling and estimation of AR models

If the probability distribution function pe(en) for the innovation processis known, it is possible to incorporate the AR process into a statisticalframework for classification and estimation problems. A straightforwardchange of variable xn to en gives us the distribution for xn conditional onthe previous P data values as

p(xn | xn−1, xn−2, . . . , xn−P ) = pe(xn −P∑

i=1

aixn−i) (4.42)

Since the excitation sequence is i.i.d. we can write the joint probability fora contiguous block of N −P data samples xP+1...xN conditional upon thefirst P samples x1...xP as

p(xP+1, xP+2, ..., xN | x1, x2, ..., xP ) =

N∏

n=P+1

pe(xn −P∑

i=1

aixn−i) (4.43)

This is now expressed in matrix-vector notation. The data samples x1, ..., xN

and parameters a1, a2, ..., aP−1, aP are written as column vectors of lengthN and P , respectively:

x = [x1 x2 ... xN ]T , a = [a1 a2 ...aP−1 aP ]T (4.44)

x is partitioned into x0, which contains the first P samples x1, ..., xP , andx1 which contains the remaining (N − P ) samples xP+1...xN :

x0 = [x1 x2 ... xP ]T , x1 = [xP+1 ... xN ]T (4.45)

The AR modelling equation of (4.41) is now rewritten for the block of Ndata samples as

x1 = G a + e (4.46)

where e is the vector of (N − P ) excitation values and the ((N − P ) × P )matrix G is given by

G =

xP xP−1 · · · x2 x1

xP+1 xP · · · x3 x2

.... . .

...xN−2 xN−3 · · · xN−P xN−P−1

xN−1 xN−2 · · · xN−P+1 xN−P

(4.47)

The conditional probability expression (4.43) now becomes

p(x1 | x0,a) = pe(x1 − Ga) (4.48)


and in the case of a zero-mean Gaussian excitation we obtain

p(x1 | x0,a) =1

(2πσ2e)

N−P2

exp

(

− 1

2σ2e

(x1 − Ga)T (x1 − Ga)

)

(4.49)

Note that this introduces a variance parameter σ2e which is in general

unknown. The PDFgiven is thus implicitly conditional on σ2e as well as a

and x0.The form of the modelling equation of (4.46) looks identical to that of the

general linear parametric model used to illustrate previous sections (4.1).We have to bear in mind, however, that G here depends upon the datavalues themselves, which is reflected in the conditioning of the distributionof x1 upon x0. It can be argued that this conditioning becomes an insignifi-cant ‘end-effect’ for N >> P [155] and we can then make an approximationto obtain the likelihood for x:

p(x|a) ≈ p(x1|a,x0), N >> P (4.50)

How much greater than P N must be will in fact depend upon the polepositions of the AR process. Using this result an approximate ML estimatorfor a can be obtained by maximisation w.r.t. a, from which we obtain thewell-known covariance estimate for the AR parameters,

aCov = (GT G)−1GT x1 (4.51)

which is equivalent to a minimisation of the sum-squared prediction er-ror over the block, E =

∑Ni=P+1 e

2i , and has the same form as the ML

parameter estimate in the general linear model.Consider now an alternative form for the vector model equation (4.46)

which will be used in subsequent work for Bayesian detection of clicks andinterpolation of AR data:

e = Ax (4.52)

where A is the ((N − P ) × (N)) matrix defined as

A =

−aP · · · −a1 1 0 0 · · · 0 00 −aP · · · −a1 1 0 0 · · · 0...

.... . .

. . .. . .

. . .. . .

......

· · · 0 0 −aP · · · −a1 1 0 00 · · · 0 0 −aP · · · −a1 1 00 0 · · · 0 0 −aP · · · −a1 1

(4.53)

The conditional likelihood for white Gaussian excitation is then rewrittenas:

p(x1 | x0,a) =1

(2πσ2e)

N−P2

exp

(

− 1

2σ2e

xT AT A x

)

(4.54)

4.3 Autoregressive (AR) modelling 89

In order to obtain the exact (i.e. not conditional upon x0) likelihood weneed the distribution p(x0|a), since

p(x|a) = p(x1|x0,a)p(x0|a)

In appendix C this additional term is derived, and the exact likelihoodfor all elements of x is shown to require only a simple modification to theconditional likelihood, giving:

p(x | a) =1

(2πσ2e)

N2 | Mx0 |1/2

exp

(

− 1

2σ2e

xT Mx−1 x

)

(4.55)

where

Mx−1 = AT A +

[

Mx0

−1 00 0

]

(4.56)

and Mx0 is the autocovariance matrix for P samples of data drawn fromAR process a with unit variance excitation. Note that this result relies onthe assumption of a stable AR process. As seen in the appendix, Mx0

−1

is straightforwardly obtained in terms of the AR coefficients for any givenstable AR model a. In problems where the AR parameters are known be-forehand but certain data elements are unknown or missing, as in clickremoval or interpolation problems, it is thus simple to incorporate the truelikelihood function in calculations. In practice it will often not be neces-sary to use the exact likelihood since it will be reasonable to fix at leastP ‘known’ data samples at the start of any data block. In this case thethe conditional likelihood (4.54) is the required quantity. Where P samplescannot be fixed it will be necessary to use the exact likelihood expression(4.55) as the conditional likelihood will perform badly in estimating missingdata points within x0.

While the exact likelihood is quite easy to incorporate in missing data orinterpolation problems with known a, it is much more difficult to use for ARparameter estimation since the functions to maximise are non-linear in theparameters a. Hence the linearising approximation of equation (4.50) willusually be adopted for the likelihood when the parameters are unknown.

In this section we have shown how to calculate exact and approximatelikelihoods for AR data, in two different forms: one as a quadratic formin the data x and another as a quadratic (or approximately quadratic)form in the parameters a. This likelihood will appear on many subsequentoccasions throughout the book.


4.4 State-space models, sequential estimation andthe Kalman filter

Most of the models encountered in this book, including the AR and ARMAmodels, can be expressed in state space form:

yn = Zαn + vn (Observation equation) (4.57)

αn+1 = Tαn +Hen (State update equation) (4.58)

In the top line, the observation equation, the observed data yn is expressedin terms of an unobserved state αn and a noise term vn. vn is uncorrelated(i.e. E[vnv

Tm] = 0 for n 6= m) and zero mean, with covariance Cv. In the

second line , the state update equation, the state αn is updated to its newvalue αn at time n+ 1 and a second noise term en. en is uncorrelated (i.e.E[ene

Tm] = 0 for n 6= m) and zero mean, with covariance Ce, and is also

uncorrelated with vn. Note that in general the state αn, observation yn andnoise terms en / vn can be column vectors and the constants Z, T and Hare then matrices of the implied dimensionality. Also note that all of theseconstants can be made time index dependent without altering the form ofthe results given below.

Take, for example, an AR model xt observed in noise vn, so that theequations in standard form are:

yn = xn + vn

xn =

P∑

i=1

aixn−i + en

One way to express this in state-space form is as follows:

αn =[

xn xn−1 xn−2 . . . xn−P+1

]T

T =

a1 a2 . . . aP

1 0 . . . 00 1 . . . 0...

. . .. . . 0

0 0 . . . 1

H =[

1 0 . . . 0]T

Z = HT

Ce = σ2e

Cv = σ2v

The state-space form is useful since some elegant results exist for thegeneral form which can be applied in many different special cases. In par-

4.4 State-space models, sequential estimation and the Kalman filter 91

ticular, we will use it for sequential estimation of the state αn. In prob-abilistic terms this will involve updating the posterior probability for αn

with the input of a new observation yn+1:

p(αn+1|yn+1) = g(p(αn|yn), yn+1)

where yn = [y1, . . . , yn]T and g(.) denotes the sequential updating function.Suppose that the noise sources en and vn are Gaussian. Assume also that

an initial state probability or prior p(α0) exists and is Gaussian N(m0, C0).Then the posterior distributions are all Gaussian themselves and the pos-terior distribution for αn is fully represented by its sufficient statistics : itsmean an = E[αn|y0, . . . , yn] and covariance matrix Pn = E[(αn−an)(αn−an)T | y0, . . . , yn]. The Kalman filter [100, 5, 91] performs the update effi-ciently, as follows:

1. Initialise: a0 = m0, P0 = C0

2. Repeat for n = 1 to N :

(a) Prediction:

an|n−1 = Tan−1

Pn|n−1 = TPn−1TT + HCeH

T

(b) Correction:

an = an|n−1 + Kn(yn − Zan|n−1)

Pn = (I −KnZ)Pn|n−1

where

Kn = Pn|n−1ZT (ZPn|n−1Z

Tn + Cv)

−1 (Kalman Gain)

Here an|n−1 is the predictive mean E[αn|yn−1], Pn|n−1 the predictivecovariance E[(αn − an|n−1)(αn − an|n−1)

T |yn−1] and I denotes the (ap-propriately sized) identity matrix.

We have thus far given a purely probabilistic interpretation of the Kalmanfilter for normally distributed noise sources and initial state. A more gen-eral interpretation is available [100, 5, 91]: the Kalman filter gives the bestpossible linear estimator for the state in a mean-squared error sense (MSE),whatever the probability distributions.

4.4.1 The prediction error decomposition

One remarkable property of the Kalman filter is the prediction error decom-position which allows exact sequential evaluation of the likelihood function


for the observations. If we suppose that the model depends upon somehyperparameters θ, then the Kalman filter updates sequentially, for a par-ticular value of θ, the density p(αn|yn, θ). We define the likelihood for θ inthis context to be:

p(yn|θ) =

∫

p(αn, yn|θ) dαn

from which the ML or MAP estimator for θ can be obtained by optimi-sation. The prediction error decomposition [91, pp.125-126] calculates thisterm from the outputs of the Kalman filter:

log(p(yn|θ)) = −Mn

2log(2π) − 1

2

n∑

t=1

log |Ft| −1

2

n∑

t=1

wTt F

−1t wt (4.59)

where

Ft = ZPt|t−1ZT + Cv

wt = yt − Zat|t−1

and where M is the dimension of the observation vector yn.

4.4.2 Relationships with other sequential schemes

The Kalman filter is closely related to other well known schemes such asrecursive least squares (RLS) [94], which can be obtained as a special caseof the Kalman filter when T = I, the identity matrix, and Ce = 0 (i.e. asystem with non-dynamic states - an example of this would be the generallinear model used earlier in the chapter, using a time dependent observationvector Zn = gT

n ).

4.5 Expectation-maximisation (EM) for MAPestimation

The Expectation-maximisation (EM) algorithm [43] is an iterative pro-cedure for finding modes of a posterior distribution or likelihood function,particularly in the context of ‘missing data’. EM has been used quite exten-sively in the signal processing literature for maximum likelihood parameterestimation, see e.g. [59, 195, 136, 168]. Within the audio field Veldhuis [192,appendix E.2] derives the EM algorithm for AR parameter estimation inthe presence of missing data, which is closely related to the audio interpola-tion problems of subsequent chapters. The notation used here is essentiallysimilar to that of Tanner [172, pp.38-57].

4.6 Markov chain Monte Carlo (MCMC) 93

The problem is formulated in terms of observed data y, parameters θand unobserved (‘latent’ or ‘missing’) data x. EM will be useful in certaincases where it is straightforward to manipulate the conditional posteriordistributions p(θ|x,y) and p(x|θ,y), but perhaps not straightforward todeal with the marginal distributions p(θ|y) and p(x|y). This will be thecase in many of the model-based interpolation and estimation schemes ofsubsequent chapters. We will typically wish to estimate missing or noisydata x in the audio restoration tasks of this book, so we consider a slightlyunconventional formulation of EM which treats θ as ‘nuisance’ parametersand forms the MAP estimate for x:

xMAP = argmaxx

p(x|y) = argmaxx

∫

p(x,θ|y)dθ

The basic EM algorithm can be summarised as:

1. Expectation step:

Given the current estimate xi, calculate:

Q(x,xi) =

∫

log(p(x|y,θ)) p(θ|xi,y)dθ

= E[log(p(x|y,θ))|xi,y] (4.60)

2. Maximisation step:

xi+1 = argmaxx

Q(x,xi) (4.61)

These two steps are iterated until convergence is achieved. The algorithmis guaranteed to converge to a stationary point of p(x|y), although we mustbeware of convergence to local maxima when the posterior distributionis multimodal. The starting point x0 determines which posterior mode isreached and can be critical in difficult applications.

4.6 Markov chain Monte Carlo (MCMC)

Rapid increases in available computational power over the last few yearshave led to a revival of interest in Markov chain Monte Carlo (MCMC) sim-ulation methods [131, 92]. The object of these methods is to draw samplesfrom some target distribution π(ω) which may be too complex for directestimation procedures. The MCMC approach sets up an irreducible, aperi-odic Markov chain whose stationary distribution is the target distributionof interest, π(ω). The Markov chain is then simulated from some arbitrarystarting point and convergence in distribution to π(ω) is then guaranteed


under mild conditions as the number of state transitions (iterations) ap-proaches infinity [175]. Once convergence is achieved, subsequent samplesfrom the chain form a (dependent) set of samples from the target distri-bution, from which Monte Carlo estimates of desired quantities related tothe distribution may be calculated.

For the statistical reconstruction tasks considered in this book, the tar-get distribution will be the joint posterior distribution for all unknowns,p(x,θ|y), from which samples of the unknowns x and θ will be drawnconditional upon the observed data y. Since the joint distribution can befactorised as p(x,θ|y) = p(θ|x,y)p(x|y) it is clear that the samples in xwhich are extracted from the joint distribution are equivalent to samplesfrom the marginal posterior p(x|y). The sampling method thus implicitlyperforms the (generally) analytically intractable marginalisation integralw.r.t. θ.

The Gibbs Sampler [64, 63] is perhaps the most popular form of MCMCcurrently in use for the exploration of posterior distributions. This method,which can be derived as a special case of the more general Metropolis-Hastings method [92], requires the full specification of conditional poste-rior distributions for each unknown parameter or variable. Suppose thatthe reconstructed data and unknown parameters are split into (possiblymultivariate) subsets x = x0, x1, . . . , xN−1 and θ = θ1, θ2 . . . , θq. Ar-bitrary starting values x0 and θ0 are assigned to the unknowns. A singleiteration of the Gibbs Sampler then comprises sampling each variable fromits conditional posterior with all remaining variables fixed to their currentsampled value. The (i+ 1)th iteration of the sampler may be summarisedas:

θi+11 ∼ p(θ1|θi

2 . . . , θiq, x

i0, x

i1, . . . , x

iN−1,y)

θi+12 ∼ p(θ2|θi+1

1 , . . . , θiq, x

i0, x

i1, . . . , x

iN−1,y)

...

θi+1q ∼ p(θq|θi+1

1 , θi+12 . . . , xi

0, xi1, . . . , x

iN−1,y)

xi+10 ∼ p(x0|θi+1

1 , θi+12 . . . , θi+1

q , xi1, . . . , x

iN−1,y)

xi+11 ∼ p(x1|θi+1

1 , θi+12 . . . , θi+1

q , xi+10 , xi

2, . . . , xiN−1,y)

...

xi+1N−1 ∼ p(xN−1|θi+1

1 , θi+12 . . . , θi+1

q , xi+10 , xi+1

1 , . . . , xi+1N−2,y)

where the notation ‘∼’ denotes that the variable to the left is drawn asa random independent sample from the distribution to the right.

The utility of the Gibbs Sampler arises as a result of the fact that theconditional distributions, under appropriate choice of parameter and datasubsets (θi and xj), will be more straightforward to sample than the fullposterior. Multivariate parameter and data subsets can be expected to lead

4.7 Conclusions 95

to faster convergence in terms of number of iterations (see e.g. [113, 163]),but there may be a trade-off in the extra computational complexity in-volved per iteration. Convergence properties are a difficult and importantissue and concrete results are as yet few and specialised. However, geomet-ric convergence rates can be found in the important case where the pos-terior distribution is Gaussian or approximately Gaussian (see [162] andreferences therein). Numerous (but mostly ad hoc) convergence diagnosticshave been devised for more general scenarios and a review may be foundin [40].

Once the sampler has converged to the desired posterior distribution, in-ference can easily be made from the resulting samples. One useful means ofanalysis is to form histograms of any parameters of interest. These convergein the limit to the true marginal posterior distribution for those parame-ters and can be used to estimate MAP values and Bayesian confidenceintervals, for example. Alternatively a Monte Carlo estimate can be madefor the expected value of any desired posterior functional f(.) as a finitesummation:

E[f(x)|y] ≈∑Nmax

i=N0+1 f(xi)

Nmax −N0(4.62)

where N0 is the convergence (‘burn in’) time and Nmax is the total num-ber of iterations. The MMSE, for example, is simply the posterior mean,estimated by setting f(x) = x in (4.62).

MCMC methods are highly computer-intensive and will only be appli-cable when off-line processing is acceptable and the problem is sufficientlycomplex to warrant their sophistication. However, they are currently un-paralleled in ability to solve the most challenging of modelling problems.

4.7 Conclusions

We have used this chapter to review and develop the principles of parameterestimation and classification with an emphasis on the Bayesian framework.We have considered a linear Gaussian parametric model by way of illus-tration since the results obtained are closely related to those of some latersections of the book. The autoregressive model which is also fundamentalto later chapters has been introduced and discussed within the Bayesianscheme and sequential estimation has been introduced. The Kalman filterwas seen to be an appropriate method for implementation of such sequen-tial schemes. Finally some more advanced methods for exploration of a pos-terior distribution were briefly described: the EM algorithm and MCMCmethods, which find application in the last chapter of the book.


Part II

Basic Restoration

Procedures

97



5

Removal of Clicks

The term ‘clicks’ is used here to refer to a generic localised type of degra-dation which is common to many audio media. We will classify all finiteduration defects which occur at random positions in the waveform as clicks.Clicks are perceived in a number of ways by the listener, ranging from tiny‘tick’ noises which can occur in any recording medium, including moderndigital sources, through to the characteristic ‘scratch’ and ‘crackle’ noiseassociated with most analogue disc recording methods. For example, a poorquality 78rpm record might typically have around 2,000 clicks per secondof recorded material, with durations ranging from less than 20µs up to 4msin extreme cases. See figure 5.11 for a typical example of a recorded musicwaveform degraded by localised clicks. In most examples at least 90% ofsamples remain undegraded, so it is reasonable to hope that a convincingrestoration can be achieved.

There are many mechanisms by which clicks can occur. Typical examplesare specks of dirt and dust adhering to the grooves of a gramophone discor granularity in the material used for pressing such a disc (see figures1.1-1.3). Further click-type degradation may be caused through damage tothe disc in the form of small scratches on the surface. Similar artefacts areencountered in other analogue media, including optical film sound tracksand early wax cylinder recordings, although magnetic tape recordings aregenerally free of clicks. Ticks can occur in digital recordings as a result ofpoorly concealed digital errors and timing problems.

Peak-related distortion, occurring as a result either of overload duringrecording or wear and tear during playback, can give rise to a similar per-ceived effect to clicks, but is really a different area which should receive

100 5. Removal of Clicks

separate attention, even though click removal systems can often go someway towards alleviating the worst effects.

5.1 Modelling of clicks

Localised defects may be modelled in many different ways. For example, adefect may be additive to the underlying audio signal, or it may replace thesignal altogether for some short period. An additive model has been foundto be acceptable for most surface defects in recording media, includingsmall scratches, dust and dirt. A replacement model may be appropriatefor very large scratches and breakages which completely obliterate any un-derlying signal information, although such defects usually excite long-termresonances in mechanical playback systems and must be treated differently(see chapter 7). Here we will consider primarily the additive model, al-though many of the results are at least robust to replacement noise.

An additive model for localised degradation can be expressed as:

yt = xt + it nt (5.1)

where xt is the underlying audio signal, nt is a corrupting noise process andit is a 0/1 ‘switching’ process which takes the value 1 only when the localiseddegradation is present. Clearly the value of nt is irrelevant to the outputwhen the switch is in position 0 and should thus be regarded only as aconceptual entity which makes the representation of (5.1) straightforward.The statistics of the switching process it thus govern which samples aredegraded, while the statistics of nt determine the amplitude characteristicsof the corrupting process.

The model is quite general and can account for a wide variety of noisecharacteristics encountered in audio recordings. It assumes for the momentthat there is no continuous background noise degradation at locations whereit = 0. It assumes also that the degradation process does not interferewith the timing content of the original signal, as observed in yt. This isreasonable for all but very severe degradations, which might temporarilyupset the speed of playback, or actual breakages in the medium which havebeen mechanically repaired (such as a broken disc recording which has beenglued back together).

Any procedure which is designed to remove localised defects in audiosignals must take account of the typical characteristics of these artefacts.Some important features which are common to many click-degraded audiomedia include:

• Degradation tends to occur in contiguous ‘bursts’ of corrupted sam-ples, starting at random positions in the waveform and of randomduration (typically between 1 and 200 samples at 44.1 kHz sampling

5.2 Interpolation of missing samples 101

rates). Thus there is strong dependence between successive samplesof the switching process it, and the noise cannot be assumed to fol-low a classical impulsive noise pattern in which single impulses occurindependently of each other (the Bernoulli model). It is considerablymore difficult to treat clusters of impulsive disturbance than singleimpulses, since the effects of adjacent impulses can cancel each otherin the detection space (‘missed detections’) or add constructively togive the impression of more impulses (‘false alarms’).

• The amplitude of the degradation can vary greatly within the samerecorded extract, owing to a range of size of physical defects. For ex-ample, in many recordings the largest click amplitudes will be wellabove the largest signal amplitudes, while the smallest audible de-fects can be more than 40dB below the local signal level (dependingon psychoacoustical masking by the signal and the amount of back-ground noise). This leads to a number of difficulties. In particular,large amplitude defects will tend to bias any parameter estimationand threshold determination procedures, leaving smaller defects un-detected. As we shall see in section 5.3.1, threshold selection for somedetection schemes becomes a difficult problem in this case.

Many approaches are possible for the restoration of such defects. It isclear, however, that the ideal system will process only on those sampleswhich are degraded, leaving the others untouched in the interests of fidelityto the original source. Two tasks can thus be identified for a successful clickrestoration system. The first is a detection procedure in which we estimatethe values of the noise switching process it, that is decide which samplesare corrupted. The second is an estimation procedure in which we attemptto reconstruct the underlying audio data when corruption is present. Amethod which assumes that no useful information about the underlyingsignal is contained in the degraded samples will involve a pure interpola-tion of the audio data based on the value of the surrounding undegradedsamples, while more sophisticated techniques will attempt in addition toextract information from the degraded sample values using some degree ofexplicit noise modelling.

5.2 Interpolation of missing samples

At the heart of most click removal methods is an interpolation scheme whichreplaces missing or corrupted samples with estimates of their true value.It is usually appropriate to assume that clicks have in no way interferedwith the timing of the material, so the task is then to fill in the ‘gap’with appropriate material of identical duration to the click. As discussedabove, this amounts to an interpolation problem which makes use of the


good data values surrounding the corruption and possibly takes accountof signal information which is buried in the corrupted sections of data. Aneffective technique will have the ability to interpolate gap lengths from onesample up to at least 100 samples at a sampling rate of 44.1kHz.

The interpolation problem may be formulated as follows. Consider Nsamples of audio data, forming a vector x. The corresponding click-degradeddata vector is y, and the vector of detection values it is i (assumed knownfor the time being). The audio data x may be partitioned into two sub-vectors, one containing elements whose value is known (i.e. it = 0), denotedby x

−(i), and the second containing unknown elements which are corruptedby noise (it = 1), denoted by x(i). Consider, for example, the case where asection of data of length l samples starting at sample number m is known tobe missing or irrevocably corrupted by noise. The data is first partitionedinto three sections: the unknown section x(i) = [xm, xm+1, . . . , xm+l−1]

T ,the m samples to the left of the gap x

−(i)a = [x1, x2, . . . , xm−1]T and the

remaining known samples to the right of the gap x−(i)b = [xm+l, . . . , xN ]T :

x =[

x−(i)a

T x(i)T x

−(i)bT]T

(5.2)

We then form a single vector of known samples x−(i) =

[

x−(i)a

T x−(i)b

T]T

.For more complicated patterns of missing data a similar but more elabo-rate procedure is applied to obtain vectors of known and unknown samples.Vectors y and i are partitioned in a similar fashion. The replacement orinterpolation problem requires the estimation of the unknown data x(i),given the observed (corrupted) data y. Interpolation will be a statisticalestimation procedure for audio signals, which are stochastic in nature, andestimation methods might be chosen to satisfy criteria such as minimummean-square error (MMSE), maximum likelihood (ML), maximum a pos-teriori (MAP) or some perceptually based criterion.

Numerous methods have been developed for the interpolation of cor-rupted or missing samples in speech and audio signals. The ‘classical’ ap-proach is perhaps the median filter [184, 151] which can replace corruptedsamples with a median value while retaining detail in the signal waveform.A suitable system is described in [101], while a hybrid autoregressive predic-tion/ median filtering method is presented in [143]. Median filters, however,are too crude to deal with gap lengths greater than a few samples. Othertechniques ‘splice’ uncorrupted data from nearby into the gap [115, 152] insuch a manner that there is no signal discontinuity at the start or end ofthe gap. These methods rely on the periodic nature of many speech andmusic signals and also require a reliable estimate of pitch period.

The most effective and flexible methods to date have been model-based,allowing for the incorporation of reasonable prior information about signalcharacteristics. A good coverage is given by Veldhuis [192], and a numberof interpolators suited to speech and audio signals is presented. These are


based on minimum variance estimation under various modelling assump-tions, including sinusoidal, autoregressive and periodic.

We present firstly a general framework for interpolation of signals whichcan be considered as Gaussian. There then follows a detailed description ofsome of the more successful techniques which can be considered as specialcases of this general framework.

5.2.1 Interpolation for Gaussian signals with knowncovariance structure

We derive firstly a general result for the interpolation of zero mean Gaussiansignals with known covariance matrix. Many of the interpolators presentedin subsequent sections can be regarded as special cases of this interpolatorwith particular constrained forms for the covariance matrix. The case ofnon-zero mean signals can be obtained as a straightforward extension of thebasic result, but is not presented here as it is not used by the subsequentschemes.

Suppose that a block of N data samples x is partitioned according toknown and unknown samples x

−(i) and x(i). As indicated above, the un-known samples can be specified at arbitrary positions in the block and notnecessarily as one contiguous missing chunk. We can express x in terms ofits known and unknown components as:

x = U x(i) + K x−(i). (5.3)

Here U and K are ‘rearrangement’ matrices which reassemble x from thepartitions x(i) and x

−(i). U and K form a columnwise partition of the(N ×N) identity matrix: U contains all the columns of I for which it = 1while V contains the remaining columns, for which it = 0.

Then the PDFfor the unknown samples conditional upon the knownsamples is given, using the conditional probability results of chapter 3, by:

p(x(i) | x−(i)) =p(x)

p(x−(i))

(5.4)

=px(U x(i) + K x

−(i))

p(x−(i))

(5.5)

Given the posterior distribution for the unknown samples we can nowdesign an estimator based upon some appropriate criterion. The MAP in-terpolation, for example, is then the vector x(i) which maximises p(U x(i) +K x

−(i)) for a given x−(i), since the denominator term in (5.5) is a con-

stant. This posterior probability expression is quite general and applies toany form of PDFfor the data vector x. For the case of Gaussian randomvectors it is well known that the MAP estimator corresponds exactly tothe MMSE estimator, hence we derive the MAP estimator as a suitablegeneral purpose interpolation scheme for Gaussian audio signals.


The data block x is now assumed to be a random zero mean Gaussianvector with autocorrelation matrix Rx = E[x xT ] which is strictly pos-itive definite. Then p(x) is given by the general zero-mean multi-variateGaussian (A.2) as:

p(x) =1

(2π)N/2 | Rx |1/2exp

(

−1

2xT Rx

−1x

)

(5.6)

Rewriting x as in (5.3) gives for two times minus the exponent:

xTRx−1x = (U x(i) + K x

−(i))TRx

−1(U x(i) + K x−(i)) (5.7)

Since the exponential function is monotonically increasing in its argument,the vector x(i) which minimises this term is the MAP interpolation. We canuse the results of vector-matrix differentiation to solve for the minimum,which is given by:

xMAP

(i) = −M(i)(i)−1M(i)−(i) x

−(i) (5.8)

where

M(i)(i) = UT Rx−1U and M(i)−(i) = UT Rx

−1K (5.9)

are submatrices of Rx−1 defined by the pre- and post-multipliers U and K.

These select particular column and row subsets of Rx−1 according to which

samples of x are known or missing. It can readily be shown that this inter-polator is also the minimum mean-square error linear interpolator for anysignal vector, Gaussian or non-Gaussian, with known finite autocorrelationmatrix.

Thus for a Gaussian signal an interpolation can be made based simplyon autocorrelation values up to lag N or their estimates. Wide-sense sta-tionarity of the process would imply a Toeplitz structure on the matrix.Non-stationarity could in principle be incorporated by allowing Rx to benon-Toeplitz, hence allowing for an autocorrelation function which evolveswith time in some known fashion, although estimation of the non-stationaryautocorrelation function is a separate issue not considered here.

This interpolation scheme can be viewed as a possible practical interpo-lation method in its own right, allowing for the incorporation of high lagcorrelations which would not be accurately modelled in all but a very highorder AR model, for example. We have obtained some very impressive re-sults with this interpolator in signals with significant long-term correlationstructure. The autocorrelation function is estimated from sections of un-corrupted data immediately prior to the missing sections. We do howeverrequire the inverse correlation matrix Rx

−1, which can be a highly com-puter intensive calculation for largeN . In the stationary case the covariancematrix is Toeplitz and its inverse can be calculated in O(N2) operations


using Trench’s algorithm [86] (which is similar to the Levinson recursion),but computational load may still be too high for many applications if N islarge.

We will see that several of the interpolators of subsequent sections arespecial cases of this scheme with constrained forms substituted for Rx

−1.In many of these cases the constrained forms for Rx

−1 will lead to spe-cial structures in the solution, such as Toeplitz equations, which can beexploited to give computational efficiency.

5.2.1.1 Incorporating a noise model

We have assumed in the previous section that the corrupted data sam-ples are discarded as missing. However, there may be useful informationin the corrupted samples as well, particularly when the click amplitude isvery small, which can improve the quality of interpolation compared withthe ‘missing data’ interpolator. Re-examining our original model for click-degraded samples,

yt = xt + it nt (5.10)

we see that the interpolation method described so far did not include amodel for the additive noise term nt and we have discarded all observedsamples for which it = 1. This can be remedied by including a model fornt which represents the click amplitudes present on a particular recordingand performing interpolation conditional upon the complete vector y of ob-served samples. The simplest assumption for the noise model is i.i.d. Gaus-sian with variance σ2

v . The basic MAP interpolator is now modified to:

xMAP

(i)= −

(

M(i)(i) +1

σ2v

I

)−1(

M(i)−(i) x−(i) − 1

σ2v

y(i)

)

(5.11)

We can see that this version of the interpolator now explicitly uses thevalues of the corrupted samples y(i) in the interpolation. The assumptionof an i.i.d. Gaussian noise process for the click amplitudes is rather crudeand may lead to problems since the click amplitudes are generally drawnfrom some heavy-tailed non-Gaussian distribution. Furthermore, we haveconsidered the noise variance σ2

v as known. However, more realistic non-Gaussian noise modelling can be achieved by the use of iterative techniqueswhich maintain the Gaussian core of the interpolator in a similar formto (5.11) while producing interpolations which are based upon the morerealistic heavy-tailed noise distribution. See chapter 12 and [83, 80, 82,73, 72, 81] for the details of these noise models in the audio restorationproblem using Expectation-maximisation (EM) and Markov chain MonteCarlo (MCMC) methods.

Noise models will be considered in more detail in the section on ARmodel-based interpolators and the chapter on Bayesian click detection.


5.2.2 Autoregressive (AR) model-based interpolation

An interpolation procedure which has proved highly successful is the au-toregressive (AR) model-based method. This was devised first by Janssenet al. [96] for the concealment of uncorrectable errors in CD systems, butwas independently derived and applied to the audio restoration problemby Vaseghi and Rayner [190, 187, 191]. As for the general Gaussian caseabove we present the algorithm in a matrix/vector notation in which thelocations of missing samples can be arbitrarily specified within the datablock through a detection switching vector i. The method can be derivedfrom least-squares (LS) or maximum a posteriori (MAP) principles and wepresent both derivations here, with the least squares derivation first as ithas the simplest interpretation.

5.2.2.1 The least squares AR (LSAR) interpolator

Consider first a block of data samples x drawn from an AR process withparameters a. As detailed in (4.52) and (4.53) we can write the excitationvector e in terms of the data vector:

e = Ax (5.12)

= A(U x(i) + K x−(i)) (5.13)

where x has been re-expressed in terms of its partition as before. We nowdefine A(i) = AU and A

−(i) = AK, noting that these correspond to acolumnwise partition of A corresponding to unknown and known samples,respectively, to give:

e = A−(i)x−(i) + A(i)x(i) (5.14)

The sum squared of the prediction errors over the whole data block isgiven by:

E =

N∑

n=P+1

e2n = eTe

The least squares (LS) interpolator is now obtained as the interpolateddata vector x(i) which minimises the sum squared prediction error E, sinceE can be regarded as a measure of the goodness of ‘fit’ of the data to theAR model. In other words, the solution is found as that unknown datavector x(i) which minimises E:

xLS

(i)= argmin

x(i)

E.


E can be expanded and differentiated using standard vector-matrix cal-culus to find its minimum:

E = eTe

∂E

∂x(i)

= 2eT ∂e

∂x(i)

= 2 (A−(i)x−(i) + A(i)x(i))

T A(i) = 0

Hence, solving for x(i), we obtain:

xLS

(i)= −(A(i)

TA(i))−1A(i)

T A−(i) x

−(i) (5.15)

This solution for the missing data involves solution of a set of l linearequations, where l is the number of missing samples in the block. Whenthere are at least P known samples either side of a single contiguous burstof missing samples the LSAR estimate can be efficiently realised in O(l2)multiplies and additions using the Levinson-Durbin recursion [52, 86] sincethe system of equations is Toeplitz. The computation is further reduced ifthe banded structure of (A(i)

T A(i)) is taken into account. In more generalcases where missing samples may occur at arbitrary positions within thedata vector it is possible to exploit the Markovian structure of the ARmodel using the Kalman Filter [5] by expressing the AR model in statespace form (see section 5.2.3.5).

5.2.2.2 The MAP AR interpolator

We now derive the maximum a posteriori (MAP) interpolator and showthat under certain conditions it is identical to the least squares interpolator.If the autoregressive process for the audio data is assumed Gaussian then(5.8) gives the required result. The inverse autocorrelation matrix of thedata, R−1

x , is required in order to use this result and can be obtained fromthe results in appendix C as:

R−1x =

M−1x

σ2e

=1

σ2e

(

ATA +

[

Mx0

−1 00 0

])

The required submatrices M(i)(i) and M(i)−(i) are thus:

M(i)(i) = UT Rx−1U =

1

σ2e

(

AT(i)

A(i) + UT

[

Mx0

−1 00 0

]

U

)

and

M(i)−(i) = UT Rx−1K =

1

σ2e

(

AT(i)A−(i) + UT

[

Mx0

−1 00 0

]

K

)


Noting that the operation of pre-multiplying a matrix by UT is equiv-alent to extracting the rows of that matrix which correspond to unknownsamples, we can see that the second term in each of these expressions be-comes zero when there are no corrupted samples within the first P samplesof the data vector. In this case the least squares and MAP interpolatorsare exactly equivalent to one another:

xMAP

(i)= xLS

(i)= −(A(i)

T A(i))−1A(i)

TA−(i) x

−(i)

This equivalence will usually hold in practice, since it will be sensibleto choose a data vector which has a reasonably large number of uncor-rupted data points before the first missing data point if the estimates areto be reliable. The least squares interpolator is found to perform poorlywhen missing data occurs in the initial P samples of x, which can be ex-plained in terms of its divergence from the true MAP interpolator underthat condition.

5.2.2.3 Examples of the LSAR interpolator

See figure 5.1 for examples of interpolation using the LSAR method appliedto a music signal. A succession of interpolations has been performed, withincreasing numbers of missing samples from left to right in the data (gaplengths increase from 25 samples up to more than 100). The autoregressivemodel order is 60. The shorter length interpolations are almost indistin-guishable from the true signal (left-hand side of figure 5.1(a)), while theinterpolation is much poorer as the number of missing samples becomeslarge (right-hand side of figure 5.1(b)). This is to be expected of any in-terpolation scheme when the data is drawn from a random process, butthe situation can often be improved by use of a higher order autoregressivemodel. Despite poor accuracy of the interpolant for longer gap lengths,good continuity is maintained at the start and end of the missing datablocks, and the signal appears to have the right ‘character’. Thus effectiveremoval of click artefacts in typical audio sources can usually be achieved.

5.2.2.4 The case of unknown AR model parameters

The basic formulation given in (5.15) assumes that the AR parametersare known a priori . In practice we may have a robust estimate of theparameters obtained during the detection stage (see section 5.3.1). This,however, is strictly sub-optimal since it will be affected by the values of thecorrupted samples and we should perhaps consider interpolation methodswhich treat the parameters as unknown a priori . There are various possibleapproaches to this problem. The classical time series maximum likelihoodmethod would integrate out the unknown data x(i) to give a likelihoodfunction for just the parameters (including the excitation variance), whichcan then be maximised to give the maximum likelihood solution:


0 50 100 150 200 250 300 350 400 450 500−2000

−1500

−1000

−500

0

500

1000

1500

2000

ampl

itude

(a) Gap lengths left to right from 25−50 samples

True signal Interpolated signal Interpolation regions

600 650 700 750 800 850 900 950 1000 1050 1100−2000

−1500

−1000

−500

0

500

1000

1500

2000

sample number

ampl

itude

(b) Gap lengths left to right from 70−120 samples

True signal Interpolated signal Interpolation regions

FIGURE 5.1. AR-based interpolation, P=60, classical chamber music, (a) shortgaps, (b) long gaps


(a, σ2e )ML = argmax

a,σ2e

p(x−(i)|a, σ2

e) =

∫

x(i)

p(x|a, σ2e) dx(i)

.

The integral can be performed analytically for the Gaussian case eitherdirectly or by use of the prediction error decomposition and the Kalmanfilter (see section 4.4 and [91] for details). The maximisation would be per-formed iteratively by use of some form of gradient ascent or the expectation-maximisation (EM) algorithm [43] (see section 4.5). A simpler iterative ap-proach, proposed by Janssen, Veldhuis and Vries [96], performs the follow-ing steps until convergence is achieved according to some suitable criterion:

1. Initialise a and x(i)

2. Estimate a by least squares from x with the current estimate of x(i)

substituted.

3. Estimate x(i) by least squares interpolation using the current estimateof a.

4. Goto 2.

This simple procedure is quite effective but sensitive to the initialisa-tion and prone sometimes to convergence to local maxima. We note thatσ2

e does not appear as steps 2. and 3. are independent of its value. Moresophisticated approaches can be applied which are based upon Bayesiantechniques and allow for a prior distribution on the parameters. This so-lution can again be obtained by numerical techniques, see chapter 12 and[129, 168] for Bayesian interpolation methods using EM and Markov chainMonte Carlo (MCMC) simulation.

In general it will not be necessary to perform an extensive iterativescheme, since a robust initial estimate for the AR parameters can eas-ily be obtained from the raw data, see e.g. [104, 123] for suitable methods.One or two iterations may then be sufficient to give a perceptually highquality restoration.

5.2.3 Adaptations to the AR model-based interpolator

The basic AR model-based interpolator performs well in most cases. How-ever, certain classes of signal which do not fit the modelling assumptions(such as periodic pulse-driven voiced speech) and very long gap lengthscan lead to an audible ‘dulling’ of the signal or unsatisfactory masking ofthe original corruption. Increasing the order of the AR model will usuallyimprove the results; however, several developments to the method whichare now outlined can lead to improved performance in some scenarios.


5.2.3.1 Pitch-based extension to the AR interpolator

Vaseghi and Rayner [191] propose an extended AR model to take account ofsignals with long-term correlation structure, such as voiced speech, singingor near-periodic music. The model, which is similar to the long term pre-diction schemes used in some speech coders, introduces extra predictorparameters around the pitch period T , so that the AR model equation ismodified to:

xt =

P∑

i=1

xn−i ai +

Q∑

j=−Q

xn−T−j bj + et, (5.16)

where Q is typically smaller than P . Least squares/ML interpolation usingthis model is of a similar form to the standard LSAR interpolator, and pa-rameter estimation is straightforwardly derived as an extension of standardAR parameter estimation methods (see section 4.3.1). The method gives auseful extra degree of support from adjacent pitch periods which can onlybe obtained using very high model orders in the standard AR case. As aresult, the ‘under-prediction’ sometimes observed when interpolating longgaps is improved. Of course, an estimate of T is required, but results arequite robust to errors in this. Veldhuis [192, chapter 4] presents a specialcase of this interpolation method in which the signal is modelled by onesingle ‘prediction’ element at the pitch period (i.e. Q = 0 and P = 0 in theabove equation).

5.2.3.2 Interpolation with an AR + basis function representation

A simple extension of the AR-based interpolator modifies the signal modelto include some deterministic basis functions, such as sinusoids or wavelets.Often it will be possible to model most of the signal energy using the de-terministic basis, while the AR model captures the correlation structure ofthe residual. The sinusoid + residual model, for example, has been appliedsuccessfully by various researchers, see e.g. [169, 158, 165, 66]. The modelfor xn with AR residual can be written as:

xn =

Q∑

i=1

ciψi[n] + rn where rn =

P∑

i=1

airn−i + en

Here ψi[n] is the nth element of the ith basis vector ψi and rn is theresidual, which is modelled as an AR process in the usual way. For example,with a sinusoidal basis we might take ψ2i−1[n] = cos(ωinT ) and ψ2i[n] =sin(ωinT ), where ωi is the ith sinusoid frequency. Another simple exampleof basis functions would be a d.c. offset or polynomial trend. These canbe incorporated within exactly the same model and hence the interpolatorpresented here is a means for dealing also with non-zero mean or smoothunderlying trends.


200 400 600 800 1000 1200 1400 1600 1800 2000−3

−2

−1

0

1

2

3x 10

4

sample number

ampl

itude

Brass orchestral extract

FIGURE 5.2. Original (uncorrupted) audio extract

200 400 600 800 1000 1200 1400 1600 1800 2000−3

−2

−1

0

1

2

3x 10

4

sample number

ampl

itude

Missing data

FIGURE 5.3. Audio extract with missing sections (shown as zero amplitude)


200 400 600 800 1000 1200 1400 1600 1800 2000−3

−2

−1

0

1

2

3x 10

4

sample number

ampl

itude

Sin+AR interpolated

FIGURE 5.4. Sin+AR interpolated audio: dotted line - true original; solid line -interpolated

200 400 600 800 1000 1200 1400 1600 1800 2000−3

−2

−1

0

1

2

3x 10

4

sample number

ampl

itude

AR interpolated

FIGURE 5.5. AR interpolated audio: dotted line - true original; solid line - in-terpolated


300 320 340 360 380 400 420 440 460 480 500−3

−2

−1

0

1

2

3x 10

4

True dataSin+AR interpolatedAR interpolated

FIGURE 5.6. Comparison of both methods

If we assume for the moment that the set of basis vectors ψi is fixedand known for a particular data vector x then the LSAR interpolator caneasily be extended to cover this case. The unknowns are now augmentedby the basis coefficients, ci. Define c as a column vector containing theci’s and a (N ×Q) matrix G such that x = Gc + r, where r is the vectorof residual samples. The columns of G are the basis vectors, i.e. G =[ψ1 . . .ψQ]. The excitation sequence can then be written in terms of x andc as e = A(x − Gc), which is the same form as for the general linearmodel (see section 4.1). As before the solution can easily be obtained fromleast squares, ML and MAP criteria, and the solutions will be equivalentin most cases. We consider here the least squares solution which minimiseseTe as before, but this time with respect to both x(i) and c, leading to thefollowing estimate:

[

x(i)

c

]

=

[

AT(i)

A(i) −AT(i)

AG−GTAT A(i) GT ATAG

]−1 [ −AT(i)

A−(i)x−(i)

GT ATA−(i)x−(i)

]

(5.17)

This extended version of the interpolator reduces to the standard in-terpolator when the number of basis vectors, Q, is equal to zero. If weback-substitute for c in (5.17), the following expression is obtained for x(i)


alone:

x(i) = −(

A(i)T(

I − AG(GT AT AG)−1GTAT)

A(i)

)−1

(

A(i)TA

−(i) x−(i) − A(i)AG(GT ATAG)−1GT ATA

−(i)x−(i)

)

These two representations are equivalent to both the maximum likelihood(ML) and maximum a posteriori (MAP)1 interpolator under the same con-ditions as the standard AR interpolator, i.e. that no missing samples occurin the first P samples of the data vector. In cases where missing data doesoccur in the first P samples, a similar adaptation to the algorithm can bemade as for the pure AR case. The modified interpolator involves someextra computation in estimating the basis coefficients, but as for the pureAR case many of the terms can be efficiently calculated by utilising thebanded structure of the matrix A.

We do not address the issue of basis function selection here. Multiscaleand ‘elementary waveform’ representations such as wavelet bases may cap-ture the non-stationary nature of audio signals, while a sinusoidal basis islikely to capture the character of voiced speech and the steady-state sectionof musical notes. Some combination of the two may well provide a goodmatch to general audio. Procedures have been devised for selection of thenumber and frequency of sinusoidal basis vectors in the speech and audioliterature [127, 45, 66] which involve various peak tracking and selectionstrategies in the discrete Fourier domain. More sophisticated and certainlymore computationally intensive methods might adopt a time domain modelselection strategy for selection of appropriate basis functions from somelarge ‘pool’ of candidates. A Bayesian approach would be a strong possi-bility for this task, employing some of the powerful Monte Carlo variableselection methods which are now available [65, 108]. Similar issues of iter-ative AR parameter estimation apply as for the standard AR interpolatorin the AR plus basis function interpolation scheme.

5.2.3.2.1 Example: sinusoid+AR residual interpolation

As a simple example of how the inclusion of deterministic basis vectors canhelp in restoration performance we consider the interpolation of a shortsection of brass music, which has a strongly ‘voiced’ character, see figure5.2. Figure 5.3 shows the same data with three missing sections, each oflength 100 samples. This was used as the initialisation for the interpola-tion algorithm. Firstly a sinusoid + AR interpolation was applied, using25 sinusoidal basis frequencies and an AR residual with order P = 15. Thealgorithm used was iterative, re-estimating the AR parameters, sinusoidalfrequencies and missing data points at each step. The sinusoidal frequencies

1assuming a uniform prior distribution for the basis coefficients


are estimated rather crudely at each step by simply selecting the 25 fre-quencies in the DFT of the interpolated data which have largest magnitude.The number of iterations was 5. Figure 5.4 shows the resulting interpolateddata, which can be seen to be a very effective reconstruction of the originaluncorrupted data. Compare this with interpolation using an AR model oforder 40 (chosen to match the 25+15 parameters of the sin+AR interpo-lation), as shown in figure 5.5, in which the data is under-predicted quiteseverely over the missing sections. Finally, a zoomed-in comparison of thetwo methods over a short section of the same data is given in figure 5.6,showing more clearly the way in which the AR interpolator under-performscompared with the sin+AR interpolator.

5.2.3.3 Random sampling methods

A further modification to the LSAR method is concerned with the charac-teristics of the excitation signal. We notice that the LSAR procedure seeksto minimise the excitation energy of the signal, irrespective of its timedomain autocorrelation. This is quite correct, and desirable mathematicalproperties result. However, figure 5.8 shows that the resulting excitationsignal corresponding to the corrupted region can be correlated and wellbelow the level of surrounding excitation. As a result, the ‘most probable’interpolants may under-predict the true signal levels and be over-smoothcompared with the surrounding signal. In other words, ML/MAP proce-dures do not necessarily generate interpolants which are typical for theunderlying model, which is an important factor in the perceived effect ofthe restoration. Rayner and Godsill [161] have devised a method whichaddresses this problem. Instead of minimising the excitation energy, weconsider interpolants with constant excitation energy. The excitation en-ergy may be expressed as:

E = (x(i) − xLS

(i))T A(i)

TA(i) (x(i) − xLS

(i)) + ELS, E > ELS, (5.18)

where ELS is the excitation energy corresponding to the LSAR estimatexLS

(i) . The positive definite matrix A(i)T A(i) can be factorised into ‘square

roots’ by Cholesky or any other suitable matrix decomposition [86] to giveA(i)

TA(i) = MTM, where M is a non-singular square matrix. A trans-formation of variables u = M (x(i) − xLS

(i)) then serves to de-correlate themissing data samples, simplifying equation (5.18) to:

E = uTu + ELS, (5.19)

from which it can be seen that the (non-unique) solutions with con-stant excitation energy correspond to vectors u with constant L2-norm.The resulting interpolant can be obtained by the inverse transformationx(i) = M−1u + xLS

(i) .


550 600 650 700 750 800 850 900 950 1000 1050−1500

−1000

−500

0

500

1000

ampl

itude

(a) Original Signal

550 600 650 700 750 800 850 900 950 1000 1050−100

−50

0

50

100

sample number

ampl

itude

(b) Original Excitation sequence

FIGURE 5.7. Original signal and excitation (P=100)

550 600 650 700 750 800 850 900 950 1000 1050−1500

−1000

−500

0

500

1000

ampl

itude

(a) LSAR Interpolation

550 600 650 700 750 800 850 900 950 1000 1050−100

−50

0

50

100

sample number

ampl

itude

(b) LSAR excitation sequence

FIGURE 5.8. LSAR interpolation and excitation (P = 100)


550 600 650 700 750 800 850 900 950 1000 1050−1500

−1000

−500

0

500

1000

ampl

itude

(a) Sampled interpolation

550 600 650 700 750 800 850 900 950 1000 1050−100

−50

0

50

100

sample number

ampl

itude

(b) Sampled excitation sequence

FIGURE 5.9. Sampled AR interpolation and excitation (P=100)

One suitable criterion for selecting u might be to minimise the autocor-relation at non-zero lags of the resulting excitation signal, since the exci-tation is assumed to be white noise. This, however, requires a non-linearminimisation, and a practical alternative is to generate u as Gaussian whitenoise with variance (E − ELS)/l, where l is the number of corrupted sam-ples. The resulting excitation will have approximately the desired energyand uncorrelated character. A suitable value for E is the expected exci-tation energy for the AR model, provided this is greater than ELS, i.e.E = max(ELS, N σ2

e).Viewed within a probabilistic framework, the case when E = ELS + lσ2

e ,where l is the number of unknown sample values, is equivalent to drawing asample from the posterior density for the missing data, p(x(i) | x−(i),a, σ

2e).

Figures 5.7-5.9 illustrate the principles involved in this sampled interpo-lation method. A short section taken from a modern solo vocal recordingis shown in figure 5.7, alongside its estimated autoregressive excitation.The waveform has a fairly ‘noise-like’ character, and the corresponding ex-citation is noise-like as expected. The standard LSAR interpolation andcorresponding excitation is shown in figure 5.8. The interpolated section(between the dotted vertical lines) is reasonable, but has lost the randomnoise-like quality of the original. Examination of the excitation signal showsthat the LSAR interpolator has done ‘too good’ a job of minimising theexcitation energy, producing an interpolant which, while optimal in a mean-


square error sense, cannot be regarded as typical of the autoregressive pro-cess. This might be heard as a momentary change in sound quality at thepoint of interpolation. The sampling-based interpolator is shown in figure5.9. Its waveform retains the random quality of the original signal, and like-wise the excitation signal in the gap matches the surrounding excitation.Hence the sub-optimal interpolant is likely to sound more convincing tothe listener than the LSAR reconstruction.

O Ruanaidh and Fitzgerald [146, 166] have successfully extended theidea of sampled interpolates to a full Gibbs’ Sampling framework [64, 63] inorder to generate typical interpolates from the marginal posterior densityp(x(i) | x

−(i)). The method is iterative and involves sampling from theconditional posterior densities of x(i), a and σ2

e in turn, with the otherunknowns fixed at their most recent sampled values. Once convergence hasbeen achieved, the interpolation used is the last sampled estimate fromp(x(i) | x

−(i),a, σ2e). See chapter 12 for a more detailed description and

extensions to these methods.

5.2.3.4 Incorporating a noise model

If we incorporate an i.i.d. Gaussian noise model as in (5.11) then the mod-ified AR interpolator, assuming no corrupted samples within the first Pelements of y, is:

xMAP

(i) = −(

A(i)TA(i) +

σ2e

σ2v

I

)−1(

A(i)T A

−(i) x−(i) − σ2e

σ2v

y(i)

)

, (5.20)

(see chapter 9), where σ2v is the variance of the corrupting noise, which is

assumed independent and Gaussian. This interpolator has been found togive higher perceived restoration quality, particularly in the case of clickswith small amplitude. The assumed model and the resulting interpolatorform a part of the Bayesian restoration schemes discussed in chapters 9-12. With this form of interpolator we now have the task of estimating the(unknown) noise variance σ2

v in addition to the AR parameters and theexcitation variance σ2

e . This can be performed iteratively within the sametype of framework as for the standard AR interpolator (section 5.2.2.4) anddetails of EM and MCMC based methods can be found in [83, 80, 82, 73,72, 81, 129] and chapter 12.

5.2.3.5 Sequential methods

The preceding algorithms all operate in a batch mode; that is, a wholevector of data is processed in one single operation. It will often be moreconvenient to operate in a sequential fashion, at each step inputing one newdata point and updating our interpolation in the light of the informationgiven by the new data point. The Kalman filter (see section 4.4) can be


used to achieve this. We first write the AR model in state space form:

yn = Znαn + un

αn+1 = Tαn +Hen

where

αn =[

xn xn−1 xn−2 . . . xn−P+1

]T

T =

a1 a2 . . . aP

1 0 . . . 00 1 . . . 0...

. . .. . . 0

0 0 . . . 1

H =[

1 0 . . . 0]T

Here as before for the Kalman filter we have dropped our usual bold-facenotation for matrices and vectors for the sake of simplicity, the dimension-ality of each term being clear from its definition.

We allow the term Zn to be time dependent in order to account formissing observations in a special way. For an uncorrupted data sequencewith no missing data, Zn = HT for all n and the variance of un is zero,i.e. all the data points are exactly observed without error. For the case of adiscarded or missing data point, we can set yn = 0 and Zn = 0 to indicatethat the data point is not observed. The standard Kalman filter can thenbe run across the entire data sequence, estimating the states αn (and hencethe interpolated data sequence xn) from the observed data y0, . . . , yn. Therecursion is summarised as:

1. Initialise: a0, P0.

2. Repeat for n = 1 to N :

(a) Prediction:

an|n−1 = Tan−1

Pn|n−1 = TPn−1TT + Hσ2

eHT

(b) Correction:

an = an|n−1 + Kn(yn − Znan|n−1)

Pn = (I −KnZn)Pn|n−1

where

Kn = Pn|n−1ZTn (ZnPn|n−1Z

Tn + E[u2

n])−1 (Kalman Gain)


The observation noise variance E[u2n] is zero for the missing data case

considered so far, but we leave it in place for later inclusion of a Gaussianobservation noise model. Assuming the AR model to be stable, the initiali-sation is given by a0 = E[α0] = 0 and P0 = E[α0α

T0 ], where the latter term

is the covariance matrix for P samples of the AR process (see appendix C).Note that many of the calculations in this recursion are redundant andcan be eliminated by careful programming, since we know immediatelythat xn = yn for any uncorrupted samples with in = 0. This means thatelements of Pm which involve such elements xn can immediately be setto zero. Taking account of this observation the Kalman filter becomes acomputationally attractive means of performing interpolation, particularlywhen missing samples do not occur in well spaced contiguous bursts so thatthe Toeplitz structure is lost in the batch based AR interpolator.

A straightforward sequential interpolator would extract at each timepoint the last element of an, which equals E[xn−P+1|y1, . . . , yn], since thisis the element with the greatest degree of ‘lookahead’ or ‘smoothing’. Thisdegree of lookahead may not be sufficient, especially when there are longbursts of corrupted samples in the data. To overcome the problem an aug-mented state space model can be constructed in which extra lagged datapoints are appended to the state αn. The Kalman filter equations retainthe same form as above but interpolation performance is improved, owingto the extra lookahead in the estimated output.

We have shown how to implement a sequential version of the basic ARinterpolator. By suitable modifications to the state transition and observa-tion equations, which we do not detail here, it is possible also to implementusing the Kalman filter all of the subsequent modifications to the AR in-terpolator discussed in later sections of the chapter, including the sampledversions and the AR plus basis function version.

5.2.3.5.1 The Kalman Smoother

The Kalman filter may also be used to perform efficient batch-based inter-polation. The filter is run forwards through the data exactly as above, but isfollowed by a backwards ‘smoothing’ pass through the data. The smooth-ing pass is a backwards recursion which calculates E[αn|y1, . . . , yN ] forn = N − 1, N − 2, . . . 1, i.e. the estimate of the state sequence conditionalupon the whole observed data sequence. We do not give the details of thesmoother here but the reader is referred for example to the texts by Ander-son and Moore [5] and Harvey [91]. It should be noted that the results ofsuch a procedure are then mathematically identical to the standard batchbased AR interpolation.


5.2.3.5.2 Incorporating a noise model into the Kalman filter

We have assumed thus far that the observation noise is always zero, i.e.E[u2

n] = 0. If we allow this term to be non-zero then a Gaussian noise modelcan be incorporated, exactly as for the batch based algorithms. SettingZn = HT for all n, E[u2

n] = 0 when it = 0 and E[u2n] = σ2

v when it = 1achieves the same noise model as in section 5.2.3.4. Interpolators basedupon other noise models which allow some degree of noise at all samplinginstants can be used to remove background noise and clicks jointly. Theseideas have been applied in [141, 142] and will be returned to in the chaptersconcerning Bayesian restoration.

5.2.4 ARMA model-based interpolation

A natural extension to the autoregressive interpolation models is the au-toregressive moving-average (ARMA) model (see section 4.2). In this modelwe extend the AR model by including a weighted sum of excitation samplesas well as the usual weighted sum of output samples:

xt =P∑

i=1

aixn−i +

Q∑

j=0

bjen−j

with b0 = 1. Here the bi are the moving-average (MA) coefficients, whichintroduce zeros into the model as well as poles. This can be expected togive more accurate modelling of the spectral shape of general audio signalswith fewer coefficients, although it should be noted that a very high orderAR model can approximate any ARMA model arbitrarily well.

In this section we present an interpolator for ARMA signals which isbased on work in a recent M.Eng. dissertation by Soon Leng Ng [140]. Weadopt a MAP procedure, since the probability of the initial state vector ismore crucial in the ARMA case than in the pure AR case. In principle wecan express the ARMA interpolator for Gaussian signals within the earlierframework for Gaussian signals with known correlation structure, but thereis some extra work involved in the calculation of the inverse covariancematrix for the ARMA case [21]. In order to obtain this covariance matrixwe re-parameterise the ARMA process in terms of a ‘latent’ AR processun cascaded with a moving-average filter:

un =

P∑

i=1

aiun−i + en (5.21)

xn =

Q∑

j=0

bjun−j (5.22)

A diagrammatic representation is given in figure 5.10.


Order P

en un MAAROrder Q

xn

FIGURE 5.10. Equivalent version of ARMA model

The derivation proceeds by expressing the probability for xt in termsof the probability of un, which we already have since it is an AR pro-cess. The linear transformation implied by (5.22) has unity Jacobian andhence the result is achieved by a straightforward change of variables. Wethen need only to account for the initial values of the latent process u0 =[u−P+1 · · · u−1 u0]

T which is carried out using the results of appendixC. Similar calculations can be found in time series texts, see e.g. [21, Ap-pendix A7.3], although Box et al. derive their likelihoods in terms of initialstates of both AR and MA components whereas we show that only the ARinitial conditions are required. We then proceed to derive the MAP datainterpolator in a similar fashion to the AR case.

For a data block with N samples equations (5.21) and (5.22) can beexpressed in matrix form as follows:

e = Au (5.23)

x = Bu (5.24)

where:

x = [x1 · · · xN ]T , e=[e1 · · · eN ]T , (5.25)

u = [u−P+1 · · · u−1 u0 u1 · · · uN ]T , (5.26)

A is the N×(N+P) matrix :

A=

−aP · · · −a1 1 0 · · · 00 −aP · · · −a1 1 0 . . ....

. . .. . .

. . .. . .

. . ....

0 · · · 0 −aP · · · −a1 1

(5.27)

and B is the N×(N+P) matrix:

B =

0 · · · 0 bQ · · · b1 b0 0 · · · 0 0 0...

. . .. . .

. . .. . .

. . .. . .

. . .. . .

. . .. . .

...0 · · · 0 · · · 0 bQ · · · b1 b0 0 · · · 00 0 · · · 0 · · · 0 bQ · · · b1 b0 0 · · ·...

. . .. . .

. . .. . .

. . .. . .

. . .. . .

. . .. . .

...0 · · · 0 0 · · · 0 · · · 0 bQ · · · b1 b0

(5.28)


Suppose that u is partitioned into u0, which contains the first P samples,and u1 which contains the remaining N samples:

u0 = [u−P+1 u−P+2 · · · u0]T (5.29)

u1 = [u1 u2 · · · uN ]T (5.30)

We can now partition the equations for e and x correspondingly:

e = A0u0 + A1u1 (5.31)

x = B0u0+B1u1 (5.32)

where A = [A0 A1] and B = [B0 B1] are columnwise partitions of A andB.

B1 is invertible since it is a lower triangular square matrix and b0 = 1.Thus we can rewrite equation (5.32) as:

u1 = B−11 (x − B0u0) (5.33)

and back-substitute into (5.31) to give:

e = Cu0 + Dx (5.34)

where

C=A0 −A1B−11 B0 and D=A1B

−11 (5.35)

The probability of u1 given u0, the conditional probability for the latentAR process (see appendix C), is:

p(u1|u0) ∝ exp

(

− 1

2σ2e

eTe

)∣

∣

∣

∣

e=Au

(the conditioning upon ARMA parameters a and b is implicit from hereon). Since B1 is lower triangular with unity diagonal elements (b0 = 1)the Jacobian of the transformation from u1 to x conditioned upon u0, asdefined by (5.33), is unity and we can write immediately:

p(x|u0) ∝ exp

(

− 1

2σ2e

eTe

)∣

∣

∣

∣

e=Cu0+Dx

We have also the probability of u0, again from appendix C:

p(u0) ∝ exp

(

− 1

2σ2e

uT0 M−1

0 u0

)

where M0 is the covariance matrix for P samples of the AR process withunity excitation variance. Hence the joint probability is:

p(x,u0) = p(x|u0)p(u0)

∝ exp

(

− 1

2σ2e

(

(Cu0 + Dx)T (Cu0 + Dx) + uT0 M−1

0 u0

)

)

.


p(x) is now obtained by integration over the unwanted component u0:

p(x) =

∫

u0

p(x,u0)du0

∝ exp

(

− 1

2σ2e

(

xTDT (I − C(CT C + M−10 )−1CT )Dx

)

)

where the integral is performed using the multivariate Gaussian integralresult (see appendix A.5). The resulting distribution is multivariate Gaus-sian, as expected, and by comparison with the general case (A.2) we obtainthe inverse covariance matrix as:

σ2eR

−1x = DT (I − C(CTC + M−1

0 )−1CT )D

We can now substitute this expression for the inverse covariance matrixinto the general expression of equation (5.8) to find the MAP ARMA in-terpolator. The interpolation involves significantly more calculations thanthe AR-based interpolator (equation 5.15) although it should be noted thatcalculation of B−1

1 , which is required in (5.35), takes only NQ/2 flops sinceB1 is Toeplitz, banded and lower triangular. Furthermore, many of thematrix multiplication operations such as A1 can be realised using simplefiltering operations. There is always, however, the overhead of inverting the(P × P ) matrix (CT C + M−1

0 ).As an alternative to the batch based method outlined above, the ARMA

interpolator can be realised sequentially using the Kalman filter in a verysimilar way to the AR case (see section 5.2.3.5). The ARMA model canbe expressed in state space form (see e.g. [5, 91]) and having done so theKalman filter recursions follow exactly as before on the modified state spacemodel.

5.2.4.1 Results from the ARMA interpolator

5.2.4.1.1 Synthetic ARMA process

The performance of the ARMA model-based interpolator is evaluated hereusing two audio signals, each of length roughly 10 seconds: a piece of fastmoving piano music (‘fast’) and a section of solo vocal singing (‘diner’).Both signals were sampled at 44.1 kHz and sections of data were removedat periodic intervals from the original signal. Each missing section was 50samples long and approximately 10% of the data was removed. The ARinterpolator assumed an AR order of 15 while the ARMA interpolator as-sumed an ARMA order of (15, 5). The model parameters in both cases wereestimated using MATLAB’s in-built ARMA function ‘armax’ operating onthe uncorrupted input data, allowing the parameters to change once every1000 data points. The mean squared error (MSE) and maximum absolutedeviation (MAD) in each section were calculated and the average valuesare tabulated in table 5.1.


Signal Average AverageMSE MAD

‘fast’ AR 6.6428×105 923.74ARMA 6.4984×105 911.26

‘diner’ AR 3.6258×105 844.52ARMA 3.0257×105 790.59

TABLE 5.1. Comparison of error performance for tracks ‘fast’ and ‘diner’.

The errors show that the ARMA interpolator performs better than theLSAR interpolator for both types of audio signals under these conditions,although the difference is by no means dramatic. The signals restored by theARMA interpolator were also subjectively evaluated by informal listeningtests and in both cases the differences from the original (clean) audio signalswere almost inaudible, but the differences between the two interpolationmethods were also extremely subtle. These results indicate that it may notbe worthwhile using extra computation to interpolate using ARMA models.

5.2.5 Other methods

Several transform-domain methods have been developed for click replace-ment. Montresor, Valiere and Baudry [135] describe a simple method forinterpolating wavelet coefficients of corrupted audio signals, which involvessubstituting uncorrupted wavelet coefficients from nearby signal accordingto autocorrelation properties. This, however, does not ensure continuity ofthe restored waveform and is not a localised operation in the signal do-main. An alternative method, based in the discrete Fourier domain, whichis aimed at restoring long sections of missing data is presented by Maher[118]. In a similar manner to the sinusoidal modelling methods of McAulayand Quatieri [126, 158, 127], this technique assumes that the signal is com-posed as a sum of sinusoids with slowly varying frequencies and amplitudes.Spectral peak ‘tracks’ from either side of the gap are identified from theDiscrete Fourier Transform (DFT) of successive data blocks and interpo-lated in frequency and amplitude to generate estimated spectra for theintervening material. The inverse DFTs of the missing data blocks are theninserted back into the signal. The method is reported to be successful forgap lengths of up to 30ms, or well over 1000 samples at audio samplingrates. A method for interpolation of signals represented by the multiplesinusoid model is given in [192, Chapter 6].

Godsill and Rayner [77, 70] have derived an interpolation method whichoperates in the frequency domain. This can be viewed as an alternative tothe LSAR interpolator (see section 5.2.2) in which power spectral density(PSD) information is directly incorporated in the frequency domain. Realand imaginary DFT components are modelled as independent Gaussianswith variance proportional to the PSD at each frequency. These assump-

5.3 Detection of clicks 127

tions of independence are known to hold exactly for periodic random pro-cesses [173], so the method is best suited to musical signals with stronglytonal content which are close to periodic. The method can, however, alsobe used for other stationary signals provided that a sufficiently long blocklength is used (e.g. 500-2000 samples) since the assumptions also improveas block length increases [148]. The Maximum a posteriori solution is of asimilar form and complexity to the LSAR interpolator, and is potentiallyuseful as an alternative to the other methods when the signal has a quasi-periodic or tonal character. A robust estimate is required for the PSD, andthis can usually be obtained through averaged DFTs of the surroundingclean data, although iterative methods are also possible, as in the case ofthe LSAR estimator.

5.3 Detection of clicks

In the last section we discussed methods for interpolation of corruptedsamples. All of these methods assumed complete knowledge of the positionof click-type corruption in the audio waveform. In practice of course thisinformation is completely unknown a priori and some kind of detectionprocedure must be applied in order to extract the timing of the degradation.There are any number of ways by which click detection can be performed,ranging from entirely ad hoc filtering methods through to model basedprobabilistic detection algorithms, as described in later chapters. In thissection we describe some simple techniques which have proved extremelyeffective in most gramophone environments. The discussion cannot everbe complete, however, as these algorithms are under constant developmentand it is always possible to devise some clever trick which will improveslightly over the current state of the art!

Click detection for audio signals involves the identification of sampleswhich are not drawn from the underlying clean audio signal; in other wordsthey are drawn from some spurious ‘outlier’ distribution. There are closerelationships between click detection and work in robust parameter esti-mation and treatment of outliers in data analysis, from fields as diverse asmedical signal processing, underwater signal processing and statistical dataanalysis. In the statistical field in particular there has been a vast amountof work in the treatment of outliers (see e.g. [13, 12] for extensive reviewmaterial, and further references in the later chapters on statistical clickremoval techniques). Various criteria for detection are possible, includingminimum probability of error and related concepts, but strictly speakingthe aim of any audio restoration scheme is to remove only those artefactswhich are audible to the listener. Any further processing is not only un-necessary but will increase the chance of distorting the perceived signalquality. Hence a truly optimal system should take into account the trade-


off between the audibility of artefacts and perceived distortion as a resultof processing, and will involve consideration of complex psychoacousticaleffects in the human ear (see e.g. [137]). Such an approach, however, is dif-ficult both to formulate and to realise, so we limit discussion here only tocriteria which are currently well understood in a mathematical sense. Thisis not to say that new algorithms should not strive to attain the highertarget of a perceptually optimal restoration.

The simplest click detection methods involve a high-pass filtering oper-ation on the signal, the assumption being that most audio signals containlittle information at high frequencies, while clicks, like impulses, have spec-tral content at all frequencies. Clicks are thus enhanced relative to thesignal by the high-pass filtering operation and can easily be detected bythresholding the filtered output. The method has the advantage of beingsimple to implement and having no unknown system parameters (exceptfor a detection threshold). This principle is the basis of most analogue de-clicking equipment [34, 102] and some simple digital click detectors [101].Of course, the method will fail if the audio signal itself has strong high fre-quency content or the clicks are band-limited. Along similar lines, waveletsand multiresolution methods in general [3, 37, 38] have useful localisationproperties for singularities in signals (see e.g. [120]), and a Wavelet filterat a fine resolution can be used for the detection of clicks. Such methodshave been studied and demonstrated successfully by Montresor, Valiere etal. [185, 135].

Other methods attempt to incorporate prior information about signaland noise into a model-based detection procedure. We will focus upon suchmethods since we believe that the best results can be achieved throughthe use of all the prior information which is available about the problem.We will describe techniques based upon AR modelling assumptions for theaudio signal, the principles of which can easily be extended to some of theother models already encountered.

5.3.1 Autoregressive (AR) model-based click detection

Techniques for detection and removal of impulses from autoregressive (AR)signals have been developed in other fields of signal processing from robustfiltering principles (see e.g. [9, 53]). These methods apply non-linear func-tions to the autoregressive excitation sequence in order to make parameterestimates robust in the presence of impulses and allow detection of im-pulses.

Autoregressive detection methods in the audio restoration field origi-nated with Vaseghi and Rayner [190, 187, 191]. A sub-frame of the under-lying audio data xt; t = 1, ..., N is assumed to be drawn from a short-termstationary autoregressive (AR) process, as for the AR-based interpolators


0 500 1000 1500 2000 2500−6000

−4000

−2000

0

2000

4000

6000

8000

sample number

ampl

itude

FIGURE 5.11. Click-degraded Music Waveform taken from 78rpm recording

0 500 1000 1500 2000 25000

2000

4000

6000(a) Absolute value of detection vector − prediction error detector

ampl

itude

0 500 1000 1500 2000 25000

2000

4000

6000(b) Absolute value of detection vector − matched filter detector

sample number

ampl

itude

FIGURE 5.12. AR-based detection for above waveform, P=50. (a) Predictionerror filter (b) Matched filter.


of the previous section:

xt =P∑

i=1

aixn−i + et

The prediction error et should take on small values for most of the time,especially if Gaussian assumptions can be made for et. Now, considerwhat happens when an impulse replaces the signal. Of course the impulsewill be unrelated to the surrounding signal values and it is likely to causea large error if an attempt is made to predict its value from the previousvalues of xt. Hence, if we apply the inverse AR filter to an impulse-corruptedAR signal yt and observe the output prediction error sequence εt = yt −∑P

i=1 aiyn−i we can expect large values at times when impulses are presentand small values for the remainder of the time. This is the basic principlebehind the simple AR model-based detectors.

If we suppose that the AR parameters and corresponding prediction errorvariance σ2

e are known, then a possible detection scheme would be:

For n = 1 to N

1. Calculate prediction error: εt = yt −∑P

i=1 aiyn−i

2. if |εt| > kσe then it = 1, else it = 0

end

Here k is a detection threshold which might be set to around 3 if webelieved the excitation et to be Gaussian and applied the standard ‘3σ’rule for detection of unusual events. it is a binary detection indicator asbefore.

In practice the AR model parameters a and the excitation variance σ2e

are unknown and must be estimated from the corrupted data yt using someprocedure robust to impulsive noise. Robust methods for this estimationprocedure are well-known (see e.g. [95, 139, 123]) and can be applied here,at least to give initial estimates of the unknown parameters. The detec-tion process can then be iterated a few times if time permits, each timeperforming interpolation of the detected samples and re-estimation of theAR parameters from that restored output. The whole process can be re-peated frame by frame through an entire track of audio material, leadingto a click-reduced output.


5.3.1.1 Analysis and limitations

Some basic analysis of the AR detection method is instructive. Substitutingfor yt from (5.1) using (4.41) gives:

εt = et + it nt −P∑

i=1

it−i nt−i ai (5.36)

which is composed of the true signal excitation et and a weighted sum ofpresent and past impulsive noise values. If xt is zero mean and has vari-

ance σ2x then et is white noise with variance σ2

e = 2πσ2

x∫ π

−π1

|A(ejθ)|2dθ

, where

A(z) = 1−∑Pi=1 aiz

−i. The reduction in power here from signal to excita-tion can be 40dB or more for highly correlated audio signals. Considerationof (5.36), however, shows that a single impulse contributes the impulse re-sponse of the prediction error filter, weighted by the impulse amplitude,to the detection signal εt, with maximum amplitude corresponding to themaximum in the impulse response. This means that considerable ampli-fication of the impulse relative to the signal can be achieved for all butuncorrelated, noise-like signals. It should be noted, however, that this am-plification is achieved at the expense of localisation in time of the impulse,whose effect is now spread over P + 1 samples of the detection signal εt.This will have adverse consequences when a number of impulses is presentin the same vicinity, since their impulse responses may cancel one anotherout or add constructively to give false detections . More generally, thresholdselection will be troublesome when impulses of widely differing amplitudesare present, since a low threshold which is appropriate for very small clickswill lead to false detections in the P detection values which follow a largeimpulse.

Detection can then be performed by thresholding |εt| to identify likelyimpulses. Choice of threshold will depend upon the AR model, the varianceof et and the size of impulses present, and will reflect trade-offs betweenfalse and missed detection rates. Optimal thresholds can be obtained undercertain assumptions for the noise and signal statistics, see [69, appendix H],but will in more general conditions be a difficult issue. See figure 5.12(a) fora typical example of the detection prediction error εt obtained using thismethod, which shows how the impulsive interference is strongly amplifiedrelative to the signal component.

5.3.1.2 Adaptations to the basic AR detector

As the previous discussion will have indicated, the most basic form of ARdetector described thus far is far from ideal. Depending upon the thresholdchosen it will either leave a residual click noise behind or it will causeperceptual damage to the underlying audio signal. A practical scheme will


have to devise improvements aimed at overcoming the limitations discussedabove. We do not detail the precise structure of such an algorithm as mostof the ideas are ad hoc, but rather point out some areas where improvementscan be achieved:

• Clicks occur as ‘bursts’ of corruption. It is very rare for a singleimpulse to occur in the corrupted waveform. It is much more commonfor clicks to have some finite width, say between 5 and 100 samplesat 44.1kHz sampling rates. This would correspond to the width ofthe physical scratch or irregularity in the recorded medium. Henceimprovements can be achieved by allowing a ‘buffer’ zone of a fewsamples both before and after the clicks detected as above. Thereare many ways for achieving this, including pre- and post-filtering ofthe prediction error or the detection vector with a non-linear ‘pulsebroadening’ filter.

• Backward prediction error. Very often a good estimate can beobtained for the position of the start of a click, but it is more diffi-cult to distinguish the end of a click, owing in part to the forward‘smearing’ effect of the prediction error filter (see figure 5.12(a)). Thiscan be alleviated somewhat by using information from the backwardprediction error εbt :

εbt = yt −P∑

i=1

aiyn+i

εbt can often give clean estimates of the end point of a click, so someappropriate combination of εt and εbt will improve detection capabil-ity.

5.3.1.3 Matched filter detector

An adaptation of the basic AR detection method, also devised by Vaseghiand Rayner, considers the impulse detection problem from a matched fil-tering perspective [186]. The ‘signal’ is the impulse itself, while the autore-gressive audio data is regarded as coloured additive noise. The predictionerror filter described above can then be viewed as a pre-whitening stage forthe autoregressive noise, and the full matched filter is given by A(z)A(z−1),a non-causal filter with 2P + 1 coefficients which can be realised with Psamples of lookahead. The matched filtering approach provides additionalamplification of impulses relative to the signal, but further reduces localisa-tion of impulses for a given model order. Choice between the two methodswill thus depend on the range of click amplitudes present in a particularrecording and the degree of separation of individual impulses in the wave-form. See figure 5.12(b) for an example of detection using the matched

5.4 Statistical methods for the treatment of clicks 133

filter. Notice that the matched filter has highlighted a few additional im-pulse positions, but at the expense of a much more ‘smeared’ responsewhich will make accurate localisation of the clicks more difficult. Hence theprediction-error detector is usually preferred in practice.

Both the prediction error detection algorithm and the matched filteringalgorithm are efficient to implement and can be operated in real time usingDSP microprocessors. Results of a very high standard can be achieved ifa careful strategy is adopted for extracting the precise click locations fromthe detection signal.

5.3.1.4 Other models

The principles of the AR-based detector can be extended to some of theother audio models which were used for interpolation. For example, anARMA model can easily replace the AR model in the detection step byapplying the inverse ARMA model to the corrupted data in place of theAR prediction error filter. We have found that this can give improved lo-calisation of clicks, although it must of course be ensured that the ARMAmodel is itself invertible. The advantages of the sin+AR residual modelcan also be exploited by performing AR-based detection directly on theestimated AR residual. This can give a greater sensitivity to small clicksand crackles than is obtainable using the standard AR-based detector andalso prevents some signal-damaging false alarms in the detection process.

5.4 Statistical methods for the treatment of clicks

The detection and replacement techniques described in the preceding sec-tions can be combined to give very successful click concealment, as demon-strated by a number of research and commercial systems which are nowused for the re-mastering of old recordings. However, some of the difficultiesoutlined above concerning the ‘masking’ of smaller defects by large defectsin the detection process, the poor time localisation of some detectors inthe presence of impulse ‘bursts’ and the inadequate performance of exist-ing interpolation methods for certain signal categories, has led to furtherresearch which considers the problem from a more fundamental statisticalperspective [76, 79, 83, 82]. These methods model explicitly both signal andnoise sources, using rigorous statistical methods to perform more accurateand robust removal of clicks. The basis of these techniques can be foundin chapters 9 - 12. The same framework may be extended to perform jointremoval of clicks and background noise in one single procedure, and somerecent work on this problem can be found in [81] for autoregressive signalsand in [72, 73] for autoregressive moving-average (ARMA) signals.

A fully model based statistical methodology for treatment of audio de-fects (not just clicks and crackles) has many advantages. One significant


drawback, however, is the increase in computational requirements which re-sults from the more sophisticated treatment. This becomes less of a problemas the speed and memory size of computer systems improves.

5.5 Discussion

In this chapter we have seen many possible methods for interpolation anddetection of clicks in audio signals. The question remains as to which canbe recommended in the practical situation. In answer to this it shouldbe noted that all of the methods given have their advantages for particulartypes of signal, but that not all will perform well with audio signals in theirgenerality; this is because some of the models used are aimed at signalsof a particular type, such as periodic or pitch-based. The most generallyapplicable of the methods are the autoregressive (AR) based techniques,which can be applied quite successfully to most types of audio. The basicAR scheme can, however, lead to audible distortion of strongly ‘voiced’musical extracts such as brass and vocal music. We now briefly summarisethe benefits and drawbacks of the other techniques described in the chapter.

The pitch based adaptation of the basic AR scheme can give consider-ably better reconstructions of pitched voice and instrumental waveforms,but requires a robust estimate of the pitch period and may not be success-ful in unvoiced sections. The AR + basis function interpolators are moregenerally applicable and a simple choice of sinusoidal basis leads to suc-cessful results and improved detection compared with a pure AR model.Random sampling methods as outlined here lead to a modest improvementin interpolation of long gap lengths. They are more important, however,as the basis of some of the more sophisticated statistical methods of laterchapters. The same comments apply to the interpolation methods which in-corporate a noise model. Sequential methods using the Kalman filter can beapplied to all of the models considered in the chapter and are an implemen-tational detail which may be important in certain real time applications.The ARMA model has been found to give some improvements in detectionand interpolation, but the small improvements probably do not justify theextra computational load.

Statistical methods were briefly mentioned in the chapter and will beconsidered in more detail in chapters 9 - 12. These methods employ explicitmodels of both the signal and degradation process, which allows a rigorousmodel-based methodology to be applied to both detection and interpolationof clicks. We believe that methods such as these will be a major source offuture advances in the area of click treatment.


6

Hiss Reduction

Random additive background noise is a form of degradation common to allanalogue measurement, storage and recording systems. In the case of audiosignals the noise, which is generally perceived as ‘hiss’ by the listener, willbe composed of electrical circuit noise, irregularities in the storage mediumand ambient noise from the recording environment. The combined effect ofthese sources will generally be treated as one single noise process, althoughwe note that a pure restoration should strictly not treat the ambient noise,which might be considered as a part of the original ‘performance’. Randomnoise generally has significant components at all audio frequencies, and thussimple filtering and equalisation procedures are inadequate for restorationpurposes.

Analogue tape recordings typically exhibit noise characteristics whichare stationary and for most purposes white. At the other end of the scale,many early 78rpm and cylinder recordings exhibit highly non-stationarycoloured noise characteristics, such that the noise can vary considerablywithin each revolution of the playback system. This results in the charac-teristic ‘swishing’ effect associated with some early recordings. In recordingmedia which are also affected by local disturbances, such as clicks and lowfrequency noise resonances, standard practice is to restore these defectsprior to any background noise treatment.

Noise reduction has been of great importance for many years in engi-neering disciplines. The classic least-squares work of Norbert Wiener [199]placed noise reduction on a firm analytic footing, and still forms the basisof many noise reduction methods. In the field of speech processing a largenumber of techniques has been developed for noise reduction, and many of

136 6. Hiss Reduction

these are more generally applicable to noisy audio signals. We do not at-tempt here to describe every possible method in detail, since these are wellcovered in speech processing texts (see for example [112, 110] and numer-ous subsequent texts). We do, however, consider some standard approacheswhich are appropriate for general audio signals and emerging techniqueswhich are likely to be of use in future work. It is worth mentioning thatwhere methods are derived from speech processing techniques, as in for ex-ample the spectral attenuation methods of section 6.1, sophisticated modi-fications are required in order to match the stringent fidelity requirementsand signal characteristics of an audio restoration system.

6.1 Spectral domain methods

Certainly the most popular methods for noise reduction in audio signalsto date are based upon short-time processing in the spectral domain. thereason for the success of these methods is that audio signals are usuallycomposed of a number of line spectral components which correspond tofundamental pitch and partials of the notes being played. Although theseline components are time-varying they can be considered as fixed over ashort analysis window with a duration of perhaps 0.02 seconds or more.Hence analysing short blocks of data in the frequency domain will con-centrate the signal energy into a relatively few frequency ‘bins’ with highsignal-to-noise ratio. This means that noise reduction can be performed inthe frequency domain while maintaining the important parts of the signalspectrum largely unaffected. Processing is performed in a transform do-main, usually the discrete Fourier Transform (DFT) Y (n,m) of sub-framesin the observed data yn:

Y (n,m) =

N−1∑

l=0

gl y(nM+l) exp(−jlm2π/N), m = 0, . . . , N − 1.

Index n is the sub-frame number within the data and m is the frequencycomponent number. gl is a time domain pre-windowing function with suit-able frequency domain characteristics, such as the Hanning or HammingWindow. N is the sub-frame length and M < N is the number of samplesbetween successive sub-frames. Typically M = N/2 or M = N/4 to allow asignificant overlap between successive frames. A suitable value of N whichencompasses a large enough window of data while not seriously violatingthe assumption of short-term fixed frequency components will be chosen,and N = 1024 or N = 2048 are appropriate choices which allow calculationwith the FFT. Cappe [30, 32] has performed analysis which justifies the useof block lengths of this order.

6.1 Spectral domain methods 137

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1(a) Sine−wave input signal

Am

plitu

de

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1(b) Pre/Post−windowing functions

Am

plitu

de

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1(c) Pre−windowed central frame

Am

plitu

de

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1(d) Reconstructed data prior to compensation

Am

plitu

de

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2(e) Gain compensation function

Am

plitu

de

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1(f) Reconstructed signal after gain compensation

Sub−frame number

Am

plitu

de

FIGURE 6.1. a) Input sine wave xt b) Pre-/Post-windowing functions gl andhl (c) Pre-windowed data d) Reconstruction without gain compensation e) Gaincompensation window wl f) Reconstructed sine wave


Processing is then performed on the spectral components Y (n,m) inorder to estimate the spectrum of the ‘clean’ data X(n,m):

X(n,m) = f(Y (n,m))

where f(.) is a function which performs noise reduction on the spectralcomponents. We will discuss some possible forms for f(.) in the next section.

The estimated spectrum X(n,m) is then inverse DFT-ed to obtain atime domain signal estimate for sub-frame n at sample number nM + l:

xnnM+l = hl

1

N

N−1∑

m=0

X(n,m) exp(jlm2π/N), l = 0, . . . , N − 1

where hl is a post-processing window function which is typically tapered tozero at its ends to ensure continuity of restored output. Again, Hammingor Hanning windows would be suitable choices here.

Finally, the reconstructed signal in sub-frame n is obtained by an overlap-add method:

xnM+l = wl

∑

m∈M

xmnM+l

where M = m; mM ≤ nM + l < mM + N, wl is a weighting functionwhich compensates for the overall gain of the pre- and post-windowingfunctions gl and hl:

wl =1

∑

m∈M0gmM+lhmM+l

and M0 = m; 0 ≤ mM + l < N. The summation sets in these twoexpressions are over all adjacent windows which overlap with the currentoutput sample number.

This time domain pre-windowing, post-windowing and gain compensa-tion procedure is illustrated for Hanning windows with 50% block over-lap (M = N/2) in figure 6.1. The processing function is switched off, i.e.f(Y ) = Y , so we wish in this case to reconstruct the input signal unmodi-fied at the output.

6.1.1 Noise reduction functions

The previous section has described a basic scheme for analysis, processingand resynthesis of noisy audio signals using the DFT, a similar method tothat first presented by Allen [4]. We have not yet, however, detailed thetype of processing functions f(.) which might be used to perform noisereduction for the spectral components. Many possible variants have beenproposed in the literature, some based on heuristic ideas and others on amore rigorous basis such as the Wiener, maximum likelihood or maximuma posteriori estimation. See [20] for a review of the various suppressionrules from a statistical perspective.


6.1.1.1 The Wiener solution

If the noise is assumed to be additive and independent of the signal, thenthe frequency domain Wiener filter which minimises the mean-squared errorof the time domain reconstruction is given by [148]:

H(ω) =SX(ω)

SX(ω) + SN (ω)

where SX(ω) is the power spectrum of the signal and SN (ω) is the powerspectrum of the noise. If this gain can be calculated at the frequencies cor-responding to the DFT bins then it can be applied as a noise reductionfilter in the DFT domain. If we interchange the continuous frequency vari-able ω with the DFT bin number m the following ‘pseudo-Wiener’ noisereduction rule is obtained:

f(Y (m)) =SX(m)

SX(m) + SN (m)Y (m)

where we have dropped the sub-frame variable n for convenience.The problem here is that we do not in general know any of the required

terms in this equation except for the raw DFT data Y (m). However, inmany cases it is assumed that SN (m) is known and it only remains toobtain SX(m) (SN (m) can often be estimated from ‘silent’ sections whereno music is playing, for example). A very crude estimate for SY (m), thepower spectrum of the noisy signal, can be obtained from the amplitudesquared of the raw DFT components |Y (m)|2. An estimate for the signalpower spectrum is then:

SX(m) =

|Y (m)|2 − SN (m), |Y (m)|2 > SN (m)

0, otherwise

where the second case ensures that the power spectrum is always greaterthan or equal to zero, leading to a noise reduction function of the form:

f(Y (m)) =

|Y (m)|2−SN (m)|Y (m)|2 Y (m), |Y (m)|2 > SN (m)

0, otherwise

Defining the power signal-to-noise ratio at each frequency bin to beρ(m) = (|Y (m)|2 − SN (m))/SN (m), this function can be rewritten as:

f(Y (m)) =

ρ(m)1+ρ(m)Y (m), ρ(m) > 0

0, otherwise

Notice that the noise reducer arising from the Wiener filter introduceszero phase shift - the restored phase equals the phase of the corrupted


signal. Even though it is often quoted that the ear is insensitive to phase,this only applies to time-invariant signals [137]. Thus it can be expectedthat phase-related modulation effects will be observed when this and other‘phase-blind’ processes are applied in the time-varying environment of realaudio signals.

6.1.1.2 Spectral subtraction and power subtraction

The Wiener solution has the appeal of being based upon a well-definedoptimality criterion. Many other criteria are possible, however, includingthe well known spectral subtraction and power subtraction methods [19,154, 16, 125, 112].

In the spectral subtraction method an amount of noise equal to the rootmean-squared noise in each frequency bin, i.e. SN (m)1/2, is subtractedfrom the spectral amplitude, while once again the phase is left untouched.In other words we have the following noise reduction function:

f(Y (m)) =

|Y (m)|−SN (m)1/2

|Y (m)| Y (m), |Y (m)|2 > SN (m)

0, otherwise

Rewriting as above in terms of the power signal-to-noise ratio ρ(m) weobtain:

f(Y (m)) =

(

1 − 1(1+ρ(m))1/2

)

Y (m), ρ(m) > 0

0, otherwise

Another well known variant upon the same theme is the power subtrac-tion method, in which the restored spectral power is set equal to the powerof the noisy input minus the expected noise power:

f(Y (m)) =

(

|Y (m)|2−SN (m)|Y (m)|2

)1/2

Y (m), |Y (m)|2 > SN (m)

0, otherwise

and in terms of the power signal-to-noise ratio:

f(Y (m)) =

(

ρ(m)1+ρ(m)

)1/2

Y (m), ρ(m) > 0

0, otherwise

The gain applied here is the square root of the Wiener gain.The gain curves for Wiener, spectral subtraction and power subtrac-

tion methods are plotted in figure 6.2 as a function of ρ(m). The outputamplitude as a function of input amplitude is plotted in figure 6.3 withSN (m) = 1. It can be seen that spectral subtraction gives the most severesuppression of spectral amplitudes, power subtraction the least severe and


the Wiener gain lies between the two. The Wiener and power subtractionrules tend much more rapidly towards unity gain at high input amplitudes.In our experience the audible differences between the three rules are quitesubtle for audio signals, especially when some of the modifications out-lined below are incorporated, and the Wiener rule is usually an adequatecompromise.

The de-noising procedure is now illustrated with a synthetic examplein which a music signal (figure 6.4(a)) is artificially corrupted with ad-ditive white Gaussian noise (figure 6.4(b)). Figure 6.4(c) shows the noisyinput after pre-windowing with a Hanning window. The data is fast Fouriertransformed (FFT-ed) to give figure 6.5 in which the clean data FFT (figure6.5(a)) is included for comparison. Note the way in which the noise floor hasbeen elevated in the noisy case (figure 6.5(b)). Only the first 200 frequencybins have been plotted, a frequency range of roughly 8.5kHz, as there islittle signal information to be seen at higher frequencies for this example.Figure 6.6 shows the filter gain calculated according to the three criteriadescribed above. Note the more severe attenuating effect of the spectralsubtraction method. This can also be seen in the reconstructed outputs offigure 6.7, in which there is clearly a progression in the amount of residualnoise left in the three methods (see the next section for a discussion ofresidual noise effects).

6.1.2 Artefacts and ‘musical noise’

The above methods can all lead to significant reduction in backgroundnoise in audio recordings. However, there are several significant drawbackswhich inhibit the practical application of these techniques without furthermodification. The main drawbacks are in the residual noise artefacts, themost annoying of which is known variously as ‘musical noise’, ‘bird-song’ or‘tonal noise’. We will describe this effect and suggest some ways to eliminateit, noting that attempts to remove these artifacts are likely to involve atrade-off with the introduction of other distortions into the audio.

Musical noise arises from the randomness inherent in the crude estimatefor the signal power spectrum which is used in the basic form of thesemethods, i.e. SX(m) = |Y (m)|2 − SN (m). Clearly the use of the raw in-put spectral amplitude will lead to inaccuracies in the filter gain owing tothe random noise and signal fluctuations within each frequency band. Thiscan lead to over-estimates for SX(m) in cases where |Y (m)| is too highand under-estimates when |Y (m)| is too low, and a corresponding randomdistortion of the sound quality in the restored output. The generation ofmusical noise can be seen most clearly in the case where no signal is present(SX(m) = 0) and the filter gain would ideally then be zero across all fre-quency bins. To see what actually occurs we use another synthetic example,in which a pure white Gaussian noise signal is de-noised using the Wienersuppression rule. In order to illustrate the effect best the noise power spec-


10−1

100

101

102

103

10−2

10−1

100

Input power signal−to−noise ratio

Filt

er g

ain

WienerPower subtractionSpectral Subtraction

FIGURE 6.2. Filter gain at frequency bin m as a function of power signal-to-noiseratio ρ(m).

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Raw input amplitude

Out

put a

mpl

itude

WienerPower subtractionSpectral Subtraction

FIGURE 6.3. Filter output amplitude at frequency bin m as a function of inputamplitude |Y (m)|.


100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2

Am

plitu

de

(a) Clean input signal

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2A

mpl

itude

(b) Noisy input signal

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2

Am

plitu

de

(c) Pre−windowed noisy input signal

Sample number

FIGURE 6.4. (a) Clean input signal xn, (b) Noisy input signal yn, (c)Pre-windowed input glyn.

0 20 40 60 80 100 120 140 160 180 20010

−3

10−2

10−1

100

101

(a) FFT of clean data

Mag

nitu

de

0 20 40 60 80 100 120 140 160 180 20010

−2

10−1

100

101

(b) FFT of noisy data

Mag

nitu

de

Frequency bin number

FIGURE 6.5. (a) FFT of clean data |X(m)|, (b) FFT of noisy data |Y (m)|.


0 20 40 60 80 100 120 140 160 180 2000

0.5

1(a) Spectral subtraction

Filt

er g

ain

0 20 40 60 80 100 120 140 160 180 2000

0.5

1(b) Wiener filter

Filt

er g

ain

0 20 40 60 80 100 120 140 160 180 2000

0.5

1(c) Power subtraction

Filt

er g

ain

Frequency bin number

FIGURE 6.6. Filter gain for (a) Spectral subtraction, (b) Wiener filter, (c) Powersubtraction.

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2(a) Spectral subtraction

Am

plitu

de

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2(b) Wiener filter

Am

plitu

de

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2(c) Power subtraction

Am

plitu

de

Sample number

FIGURE 6.7. Estimated signal prior to post-windowing and overlap-add resyn-thesis: (a) Spectral subtraction, (b) Wiener filter, (c) Power subtraction.


trum is over-estimated by a factor of two, which leads to a smaller amountof residual noise but greater distortion to the musical signal components.Figure 6.8 shows the input noise signal, the pre-windowed noise signal andthe Wiener estimated signal corresponding to this noise signal. The noiseis well attenuated (roughly 10dB of attenuation can be measured) but asbefore a residual noise is clearly visible. Ideally we would like the residualnoise to have a pleasant-sounding time-invariant characteristic. However,figure 6.9 shows that this is not the case. Figure 6.9(a) shows the spectralmagnitude of the noise input signal and figure 6.9(b) gives the correspond-ing noise reduced spectrum. It can be seen that while low amplitude noisecomponents have been attenuated to zero, as desired, the high amplitudecomponents have remained largely unsuppressed, owing to the non-linearnature of the suppression rule. These components are typically isolated inthe spectrum from one another and will thus be perceived as ‘tones’ in therestored output. When considered from frame to frame in the data theseisolated components will move randomly around the spectrum, leading torapidly time-varying bursts of tonal or ‘musical’ noise in the restoration.This residual can be more unpleasant than the original unattenuated noise,and we now go on to discuss ways for alleviating the problem.

6.1.3 Improving spectral domain methods

We have already stated that the main cause of musical noise is randomfluctuations in the estimate of SX(m), the signal power spectrum. Clearlyone way to improve the situation will be to make a more statistically stableestimate of this term for each frame. Another way to improve the qualityof restoration will be to devise alternative noise suppression rules basedupon more appropriate criteria than those described above. There is alarge number of techniques which have been proposed for achieving eitheror both of these objectives, mostly aimed at speech enhancement, and wedo not attempt to describe all of them here. Rather we describe the mainprinciples involved and list some of the key references. For a range of audioapplications based upon these techniques and their variants the reader isreferred to [109, 138, 187, 190, 189, 185, 58, 116].

6.1.3.1 Eliminating musical noise

The simplest way to eliminate musical noise is to over-estimate the noisepower spectrum, see e.g. [187, 190, 16, 17, 116], in a similar (but moreextreme) way as used in the results of figure 6.8. To achieve this we simplyreplace SN (m) with αSN (m), where α > 1 is typically between 3 and 6 fornoisy audio. Of course, such a procedure simply trades off improvementsin musical noise with distortion to the musical signal. Another simple idealeaves a noise ‘floor’ in the spectrum, masking the isolated noise peakswhich lead to tonal noise. The introduction of a noise floor means less noise


100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2

Am

plitu

de

(a) Noise input

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2A

mpl

itude

(b) Pre−windowed noise input

100 200 300 400 500 600 700 800 900 1000−0.2

−0.1

0

0.1

0.2(c) Wiener filter output

Am

plitu

de

Sample number

FIGURE 6.8. (a) Input noise signal yn, (b) Pre-windowed input glyn, (c) Wienerestimated output signal.

20 40 60 80 100 120 140 160 180 2000

0.2

0.4

0.6

0.8

1Spectrum of noise

DFT bin number

Mag

nitu

de

20 40 60 80 100 120 140 160 180 2000

0.2

0.4

0.6

0.8

1Noise−reduced spectrum − Wiener filter

DFT bin number

Mag

nitu

de

FIGURE 6.9. (a) FFT of noisy data |Y (m)| (b) FFT of noise reduced data|f(Y (m))|.


reduction, but this might be acceptable depending upon the application andnoise levels in the original recording. A first means of achieving this [16]simply limits the restored spectral amplitude to be no less than βSN (m)1/2,where 0 < β < 1 determines the required noise attenuation. The resultingresidual sounds slightly unnatural, and a better scheme places a lower limiton the filter gain [16, 17, 116]. The Wiener suppression rule would become,for example:

f(Y (m)) = max

(

α,ρ(m)

1 + ρ(m)

)

Y (m)

This achieves a significant degree of masking for musical noise and leavesa natural sounding residual. It can also help to limit any loss of soundquality by ensuring that very small signal components are never completelyattenuated.

The best means for elimination of musical noise to date, however, em-ploy temporal information from surrounding data sub-frames. It is highlyunlikely that a random noise peak will occur at the same frequency inadjacent sub-frames, while tonal components which are genuinely presentin the music are likely to remain at the same frequency for several sub-frames. Thus some degree of linear or non-linear processing can be usedto eliminate the noise spikes while retaining the more slowly varying sig-nal components intact. In [19] a spectral magnitude averaging approach issuggested in which the raw speech amplitudes |Y (n,m)| are averaged overseveral adjacent sub-frames n at each frequency bin m. This helps some-what in the stability of signal power spectral estimates, but will smoothout genuine transients in the music without leading to a very significant re-duction in musical noise. A more successful approach from the same paperuses the minimum estimated spectral amplitude from adjacent sub-framesas the restored output, although this can again lead to a ‘dulling’ of thesound quality. One simple technique which we have found to be effectivewithout any serious degradation of the sound quality involves taking a me-dian of adjacent blocks rather than an arithmetic mean or minimum value.This eliminates musical noise quite successfully and the effect on the signalquality is less severe than taking a minimum amplitude. Other techniquesinclude non-linear ‘masking’ out of frequency components whose tempo-ral neighbours are of low amplitude (and hence can be classified as noise)[189] and applying frequency domain de-noising to the temporal envelopeof individual frequency bins [29].

In the same way that there are many possible methods for perform-ing click detection, there is no end to the alternative schemes which canbe devised for refining the hiss reduction performance based upon tem-poral spectral envelopes. In practice a combination of several techniquesand some careful algorithmic tuning will be necessary to achieve the bestperformance.


6.1.3.2 Advanced noise reduction functions

In addition to the basic noise suppression rules (Wiener, power/spectralsubtraction) described so far, various alternative rules have been proposed,based upon criteria such as maximum likelihood [125] and minimum mean-squared error [54, 55, 56]. In [125] a maximum likelihood estimator is de-rived for the spectral amplitude components under Gaussian noise assump-tions, with the complex phase integrated out. This estimator is extended in[55] to the minimum mean-square error estimate for the amplitudes, whichincorporates Gaussian a priori assumptions for the signal components inaddition to the noise. Both of these methods [125, 55] also incorporate un-certainty about the presence of signal components at a given frequency, butrely on the assumption of statistical independence between spectral com-ponents at differing frequencies. In fact, the Ephraim and Malah method[55] is well known in that it does not generate significant levels of musicalnoise. This can be attributed to a time domain non-linear smoothing filterwhich is used to estimate the a priori signal variance [31, 84].

6.1.3.3 Psychoacoustical methods

A potentially important development in spectral domain noise reduction isthe incorporation of the psychoacoustical properties of human hearing. It isclearly sub-optimal to design algorithms based upon mathematical criteriawhich do not account for the properties of the listener. In [27, 28] a methodis derived which attempts to mimic the pre-filtering of the auditory systemfor hearing-impaired listeners, while in [183, 116] simultaneous masking re-sults of the human auditory system [137] are employed to predict whichparts of the spectrum need not be processed, hence leading to improvedfidelity in the restored output as perceived by the listener. These methodsare in their infancy and do not yet incorporate other aspects of the humanauditory system such as non-simultaneous masking, but it might be ex-pected that noise reducers and restoration systems of the future will takefuller account of these features to their benefit.

6.1.4 Other transform domain methods

So far we have assumed a Fourier transform-based implementation of spec-tral noise reduction methods. It is possible to apply many of these methodsin other domains which may be better tuned to the non-stationary andtransient character of audio signals. Recent work has seem noise reductionperformed in alternative basis expansions, in particular the wavelet domain[134, 185, 14] and sub-space representations [44, 57].

6.2 Model-based methods 149

6.2 Model-based methods

The spectral domain methods described so far are essentiallynon-parametric, although some include assumptions about the probabil-ity distribution of signal and noise spectral components. In the same wayas for click reduction, it should be possible to achieve better results if anappropriate signal model can be incorporated into the noise reduction pro-cedure, however noise reduction is an especially sensitive operation andthe model must be chosen very carefully. In the speech processing fieldLim and Oppenheim [111] studied noise reduction using an autoregressivesignal model, deriving iterative MAP and ML procedures. These methodsare computationally intensive, although the signal estimation part of theiteration is shown to have a simple frequency-domain Wiener filtering inter-pretation (see also [147, 107] for Kalman filtering realisations of the signalestimation step). It is felt that new and more sophisticated model-basedprocedures may provide noise reducers which are competitive with the well-known short-time Fourier based methods. In particular, modern Bayesianstatistical methodology for solution of complex problems [172, 68] allowsfor more realistic signal and noise modelling, including non-Gaussianity,non-linearity and non-stationarity. Such a framework can also be used toperform joint restoration of both clicks and random noise in one single pro-cess. A Bayesian approach to this joint problem using an autoregressive sig-nal model is described in [70] and in chapter 9 (the ‘noisy data’ case), while[141, 142] present an extended Kalman filter for ARMA-modelled audio sig-nals with the same type of degradation. See also [81, 72, 73] for some recentwork in this area which uses Markov chain Monte Carlo (MCMC) methodsto perform joint removal of impulses and background noise for both ARand autoregressive moving-average (ARMA) signals. Chapter 12 describesin detail the principles behind these methods. The standard time seriesmodels have yielded some results of a good quality, but it is anticipatedthat more sophisticated and realistic models will have to be incorporatedin order to exploit the model-based approach fully.

6.2.1 Discussion

A number of noise reduction methods have been described, with particularemphasis on the short-term spectral methods which have been most popularto date. However, it is expected that new statistical methodology and rapidincreases in readily-available computational power will lead in the future tothe use of more sophisticated methods based on realistic signal modellingassumptions and perceptual optimality criteria.


Part III

Advanced Topics

151



7

Removal of Low Frequency NoisePulses

A problem which is common to several recording media, including gramo-phone discs and optical film sound tracks, is that of low frequency noisepulses. This form of degradation is typically associated with large scratchesor even breakages in the surface of a gramophone disc. The precise formof the noise pulses depends upon the characteristics of the playback sys-tem but a typical waveform is shown in figure 7.1. A large discontinuity isobserved followed by a decaying low frequency transient. The noise pulsesappear to be additively superimposed upon the undistorted signal wave-form (see figure 7.2).

In this chapter we briefly review some early digital technology for removalof these defects, the templating method [187, 190], noting that problemscan arise with these methods since they rely on little or no variability ofpulse waveforms throughout a recording. This can be quite a good assump-tion when defects are caused by single breakages or radial scratches on adisc recording. However pulse shapes are much less consistent when defectsare caused by randomly placed scratches in disc recordings or optical filmsound tracks. The templating method then becomes very difficult to apply.Many different templates are required to cover all possible pulse shapes,and automation of such a system is not straightforward. Thus it seems bet-ter to form a more general model for the noise pulses which is sufficientlyflexible to allow for all the pulses encountered on a particular recording.In this chapter we propose the use of an AR model for the noise pulses,driven by impulsive excitation. An algorithm is initially derived for theseparation of additively superimposed AR signals. This is then modified to

154 7. Removal of Low Frequency Noise Pulses

give a restoration algorithm suited to our assumed models for signal andnoise pulses.

7.1 Existing methods

Low frequency noise pulses appear to be the response of the pick-up systemto extreme step-like or impulsive stimuli caused by breakages in the groovewalls of gramophone discs or large scratches on an optical film sound track.The audible effect of this response is a percussive ‘pop’ or ‘thump’ in therecording. This type of degradation is often the most disturbing artefactpresent in a given extract. It is thus highly desirable to eliminate noisepulses as a first stage in the restoration process.

Since the majority of the noise pulse is of very low frequency it mightbe thought that some kind of high pass filtering operation would removethe defect. Unfortunately this does not work well since the discontinuityat the front of the pulse has significant high frequency content. Some suc-cess has been achieved with a combination of localised high pass filteringand interpolation to remove discontinuities. However it is generally foundthat significant artefacts remain after processing or that the low frequencycontent of the signal has been damaged.

The only existing method known to the authors is that of Vaseghi andRayner [187, 190]. This technique, which employs a template for the noisepulse waveform, has been found to give good results for many examples ofbroken gramophone discs. The observation was made that the low frequencysections of successive occurrences of noise pulses in the same recording werenearly identical in envelope (to within a scale factor). This observation istrue particularly when the noise pulses correspond to a single fracture run-ning through successive grooves of a disc recording. Given the waveform ofthe repetitive section of the noise pulse (the ‘template’ tm) it is then pos-sible to subtract appropriately scaled versions from the corrupted signalyn wherever pulses are detected. Any remaining samples close to the dis-continuity which are irrevocably distorted can then be interpolated usinga method such as the LSAR interpolator discussed earlier.

The template tm is obtained by long term averaging of many such pulsesin the corrupted waveform. The correct position and scaling for each indi-vidual pulse are obtained by cross-correlating the template with the cor-rupted signal.

The template method is limited in several important ways which preventautomation of pulse removal. While the assumption of constant templateshape is good for short extracts with periodically recurring noise pulses(e.g. in the case of a broken gramophone disc) it is not a good assumptionfor many other recordings. Even where noise pulses do correspond to asingle radial scratch or fracture on the record the pulse shape is often

7.1 Existing methods 155

-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

10000

12000

0 500 1000 1500 2000 2500 3000

FIGURE 7.1. Noise pulse from broken gramophone disc (‘lead-in’ groove)

-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

10000

12000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

FIGURE 7.2. Signal waveform degraded by additive noise pulse


found to change significantly as the recording proceeds, while much morevariety is found where pulses correspond to ‘randomly’ placed scratches andbreakages on the recording. Further complications arise where several pulsesbecome superimposed as is the case for several closely spaced scratches.

Correct detection is also a problem. This may seem surprising since thedefect is often very large relative to the signal. However, audible noise pulsesdo occur in high amplitude sections of the signal. In such cases the cross-correlation method of detection can give false alarms from low frequencycomponents in the signal; in other circumstances noise pulses can be missedaltogether. This is partly as a result of colouration of the signal whichrenders the cross-correlation procedure sub-optimal. A true matched filterfor the noise pulse would take into account the signal correlations (see e.g.[186]) and perhaps achieve some improvements in detection. However thisissue is not addressed here since the other limitations of the templatingmethod are considered too severe. Rather in chapter 7 we derive a moregeneral restoration system for noise pulses which is designed to cope withsituations where pulses are of varying shape and at random locations. Withsuch an approach we aim to restore a wider range of degraded material,including optical sound tracks, with little or no user intervention.

7.2 Separation of AR processes

Consider firstly the problem of separating two additively superimposed ARprocesses. This will provide the basis for the subsequent noise pulse restora-tion algorithm. A probabilistic derivation is given here for consistency withthe remainder of the book, although a least squares analysis yields similarresults.

A data block y with N samples is expressed as the sum of two constituentparts x1 and x2:

y = x1 + x2. (7.1)

We wish to estimate either x1 or x2 from the combined signal y under theassumption that both constituent signals are drawn from AR processes.The parameters and excitation variances a1, σ

2e1 and a2, σ

2e2 for the two

AR processes are assumed known. The two signals x1 and x2 are assumedindependent and their individual PDFs can be written in the usual fashion(see (4.54)) under the Gaussian independence assumption as

px1(x1) =1

(2πσ2e1)

N−P12

exp

(

− 1

2σ2e1

x1TA1

TA1 x1

)

(7.2)

px2(x2) =1

(2πσ2e2)

N−P22

exp

(

− 1

2σ2e2

x2TA2

TA2 x2

)

(7.3)

7.3 Restoration of transient noise pulses 157

where Pi is the AR model order for ith process (i = 1, 2) and Ai is the pre-diction matrix containing AR model coefficients from the ith model. ThesePDFs are, as before, strictly conditional upon the first Pi data samplesfor each signal component. The resulting approximation is maintained forthe sake of notational simplicity. Note, however, that as before the exactlikelihoods can be incorporated by relatively straightforward modificationsto the first Pi elements of the first Pi rows of the matrices AT

i Ai (seeappendix C).

Say we wish to find the MAP solution for x1. The probability for thecombined data conditional upon the first component signal is given by:

p(y | x1) = px2(y − x1). (7.4)

We then obtain the posterior distribution for x1 using Bayes’ theorem as

p(x1 | y) ∝ p(y | x1) px1(x1)

∝ px2(y − x1) px1(x1) (7.5)

where the term p(y) can as usual be ignored as a normalising constant.Substituting for the terms in (7.5) using (7.2) and (7.3) then gives

p(x1 | y) ∝exp

(

− 12σ2

e2(y − x1)

T A2TA2 (y − x1) − 1

2σ2e1

x1T A1

TA1 x1

)

(2πσ2e1)

(N−P1)/2 (2πσ2e2)

(N−P2)/2

(7.6)

and the MAP solution vector xMAP1 is then obtained as the solution of :

(

A1TA1

σ2e1

+A2

TA2

σ2e2

)

xMAP

1 =A2

TA2

σ2e2

y. (7.7)

This separation equation requires knowledge of both AR processes and assuch cannot straightforwardly be applied in general ‘blind’ signal separationapplications. Nevertheless we will now show that a modified form of thisapproach can be applied successfully to the practical case of separating anaudio signal from transient noise pulses.

7.3 Restoration of transient noise pulses

Examination of figures 7.1 and 7.2 indicates that the noise pulse is com-posed of a high amplitude impulsive start-up transient followed by a lowfrequency decaying oscillatory tail consistent with a ‘free-running’ system(i.e. a resonant system with no excitation). The oscillatory tail appears tobe additive to the true audio signal while the impulsive start may well oblit-erate the underlying signal altogether for a period of some large number ofsamples (50-100 samples is typical).


This noise transient is interpreted as being the response of the mechanicalplayback apparatus to highly impulsive or step-like stimuli such as wouldoccur when a complete breakage or deep gouge is encountered by the stylusin the groove wall. The way such discontinuities are generated, in contrastwith the way the cutter records the true audio signal onto the disc, meansthat the damaged grooves can have a much higher rate of rise or fall thancould ever be encountered from the standard recording procedure (and mostlikely at a higher rate of change than the subsequent electronic circuitry iscapable of reproducing). It is postulated that these especially sharp step-like discontinuities excite the low frequency resonance observed with somescratches, while other clicks and scratches with less sharp transitions willnot excite the same degree of resonance. The frequency of oscillation is con-sistent with the resonant frequency of the tone arm apparatus (15-30Hz, see[133]) so this may well be the source of the resonance. We do not verify thishere from mechanics theory or experiments but rather concern ourselveswith the modelling of the noise transients in the observed waveform.

It can be seen clearly from figure 7.1 that the frequency of resonance de-creases significantly as the transient proceeds. This is typical of noise pulsesfrom both gramophone recordings and optical film sound tracks. Attemptsto model the pulse as several superimposed decaying vibration modes withdifferent frequencies have proved largely unsuccessful, and it seems morelikely that there is non-linearity in the mechanical system which wouldaccount for this effect. This kind of change in resonant frequency is con-sistent with a ‘stiffening’ spring system in which a spring-type mechanicalelement has increasing stiffness with increasing displacement (see e.g. [90]).Exploratory experiments indicate that discrete low order (2-3) non-linearAR models based on a reduced Volterra expansion (see e.g. [18]) lead to sig-nificantly better whitening of the prediction error than a linear AR modelwith the same number of coefficients. This requires further investigationand is left as further work. Here, for the sake of achieving analytic results(and hence a practical restoration system), we adopt the linear AR modelas an approximation to the true mechanical system. The impulsive start-upsection is modelled as high-amplitude impulsive input to the system and alow level white noise excitation is assumed elsewhere to allow for modellinginaccuracies resulting from the linear approximation.

7.3.1 Modified separation algorithm

The proposed model for noise transient degradation is similar to the ‘noisydata’ model for standard click degradation (see section 9.1) in which a noisesignal is generated by switching between two noise processes n1

m andn0

m according to the value of a binary process im. The high amplitudenoise process n1

m corresponds to high level impulsive input (i.e. the start-up transient) while the low level noise process n0

m represents the low levelof noise excitation mentioned above. The low frequency transient model


+

0/1

ym

im

nm0 nm

1

nm

Noise GeneratingProcess

Noise AmplitudeProcess 0


A (z)2

x m1

x m2

FIGURE 7.3. Schematic model for noise transients

differs from the standard click model in that the switched noise processis then treated as the excitation to the AR process with parameters a2.The output of this process is then added to the input data signal. Figure7.3 gives a schematic form for the noise model. Since this model does notspecify when the high amplitude impulsive excitation process n1

m is active,it is general enough to include the problematic scenario of many largescratches close together, causing ‘overlapping’ noise pulses in the degradedaudio waveform. We now consider modifications to the AR-based separationalgorithm given in the previous section. These will lead to an algorithm forrestoring signals degraded by noise from the proposed model.

In the simple AR-based separation algorithm (7.7) a constant varianceGaussian white excitation was assumed for both AR processes. However,the model discussed above indicates a switched process for the excitationto AR process a2, depending on the current ‘switch value’ im. We definea vector i similar to the detection vector of chapter 9 which contains a ‘1’wherever im = 1, i.e. the high amplitude noise process is active, and ‘0’otherwise. This vector gives the switch position at each sampling instantand hence determines which of the two noise processes is applied to the ARfilter a2. If the two noise processes are modelled as Gaussian and white withvariances σ2

n0 and σ2n1 the correlation matrix for the excitation sequence


to the second AR model a2 is a diagonal matrix Λ. The diagonal elementsλm of Λ are given by

λm = σ2n0 + im(σ2

n1 − σ2n0). (7.8)

The PDFfor x2, the ‘noise’ signal, is then modified from the expression of(7.3) to give

px2(x2) =1

(2π)N−P2

2 | Λ |1/2exp

(

−1

2x2

TA2T Λ−1A2 x2

)

. (7.9)

The derivation for the modified separation algorithm then follows the samesteps as for the straightforward AR separation case, substituting the mod-ified PDFfor x2 and leading to the following posterior distribution,

p(x1 | y) ∝exp

(

− 12 (y − x1)

T A2T Λ−1A2 (y − x1) − 1

2σ2e1

x1TA1

TA1 x1

)

(2πσ2e1)

(N−P1)/2 (2π)(N−P2)/2 | Λ |1/2

(7.10)

and the resulting MAP solution vector xMAP1 is obtained as the solution of:

(

A1TA1

σ2e1

+ A2T Λ−1A2

)

xMAP

1 = A2T Λ−1A2 y (7.11)

If σ2n1 is very large the process n1

m tends to a uniform distribution, inwhich case the impulsive section will largely obscure the underlying audiowaveform x1. In the limit the elements of Λ−1 equal to (1/σ2

n1) tend tozero. Since σ2

n1 is usually large and awkward to estimate we will assumethe limiting case for the restoration examples given later.

The algorithm as it stands is appropriate for restoration of a whole blockof data with no reference to the surroundings. Discontinuities can thusresult at the boundaries between a restored data block and adjacent data.Continuity can be enforced with a windowed overlap/add approach at theends of a processed block. However, improved results have been obtainedby fixing a set of known samples at both ends of the block. Continuityis then inherent in the separation stage. This approach requires a slightlymodified version of equation (7.11) which accounts for the known samplesin a manner very similar to the standard AR-based interpolator (see section5.2.2) in which known samples are fixed either side of missing data.

7.3.2 Practical considerations

The modified algorithm proposed for separation of the true signal fromthe combination of true signal and noise transients requires knowledge ofboth AR systems, including noise variances and actual parameter values,


as well as the detection vector i which indicates which noise samples haveimpulsive excitation and which have low level excitation. We now considerhow we might obtain estimates for these unknowns and hence produce apractical restoration scheme.

7.3.2.1 Detection vector i

Identification of i is the equivalent for noise transients of click detection. Weare required to estimate which samples of the noise pulse signal correspondto impulsive noise excitation. A Bayesian scheme for optimal detection canbe derived based on the same principles as the click detector of chapter9 but using the revised noise model. However, in many cases the noisetransients are large relative to the audio signal and a simpler approach willsuffice.

An approach based on the inverse filtering AR detector as describedelsewhere (see section 5.3.1) has been found to perform adequately in mostcases. The impulsive sections of the low frequency noise transients are sim-ply considered as standard clicks of high amplitude. The inverse filteringthreshold operates as for detection of standard clicks, but a second highervalued threshold is chosen to ‘flag’ clicks of the low frequency transienttype. These clicks are then restored using the scheme introduced above.Such a detection procedure will provide good results for most types ofnoise pulse degradation. However, ‘false alarms’ and ‘missed detections’are an inevitable consequence in a signal where low frequency transientscan be of small amplitude or when standard clicks can be of comparablemagnitude to noise transient clicks. The higher threshold must be chosenfrom trial and error and will reflect how tolerant we are of false alarms(which may result in distortion of the low frequency signal content) andmissed detections (which will leave small transients unrestored). The moresophisticated Bayesian detection scheme should improve the situation butis left as a topic for future investigation.

7.3.2.2 AR process for true signal x1

The AR model parameters and excitation variance a1, σ2e1 for the unde-

graded audio source can usually be estimated by standard methods (seesection 4.3.1) from a section of uncorrupted data in the close vicinity ofthe noise pulse. It is best to estimate these parameters from data prior tothe noise pulse so as to avoid the effects of the low frequency noise tail onestimates. Assumptions of local stationarity are then made for restorationof the subsequent corrupted section of data using the same AR model. TheAR model order is fixed beforehand and, as for click removal applications,an order of 40-60 is typically sufficient.


7.3.2.3 AR process for noise transient x2

The AR process for the noise transients may be identified in several ways.Perhaps the most effective method is to estimate the parameters by max-imum likelihood from a ‘clean’ noise pulse obtained from a passage of therecording where no recorded sound is present, such as the lead-in groove ofa disc recording (see figure 7.1). If a ‘clean’ noise transient is not availablefor any reason it is possible to use estimates obtained from another similarrecording. These estimates can of course be replaced or adapted based onrestored pulses obtained as processing proceeds. Typically a very low modelorder of 2-3 is found to perform well. When no such estimates are availablewe have found that a ‘second differencing’ AR model, which favours signalswhich are ‘smooth’ in terms of their second differences, with parameters[2, −1], leads to restorations of a satisfactory standard.

7.3.3 Experimental evaluation

Experimental results are now presented for the signal separation algorithmsderived in previous sections. Firstly we consider a synthetic case in which anoise transient obtained from the lead-in groove of a gramophone recordingis added at a known position to a ‘clean’ audio waveform. The resultingsynthetically corrupted waveform is shown in figure 7.4. The restorationcorresponding to this waveform is shown below in figure 7.5 and we cansee that the degradation is completely removed as far as the eye can see.The next two figures (7.6 and 7.7) show the true noise transient and thetransient estimated by the algorithm, respectively. The estimated transientis obtained by subtracting the restored waveform (figure 7.5) from the cor-rupted waveform (figure 7.4). There is very good correspondence betweenthe true and estimated transient waveform, indicating that the separationalgorithm has worked well. We now consider the procedure in more detail.

The AR model parameters for the noise transient were estimated to order2 by the covariance method from the non-impulsive section of the true noisetransient (samples 500-1000 of figure 7.6). The AR parameters for the signalwere estimated to order 80 from a section of 1000 data points just priorto the noise transient. The high amplitude impulsive section driving thenoise transient was chosen as a run of 50 samples from 451-500 and theseparation algorithm operated over the larger interval 441-720.

In this example 80 ‘restored’ samples were fixed before and after theregion of signal separation (samples 441-720). The values of these fixedsamples were obtained by a high-pass filtering operation on the corrupteddata. Such a procedure assumes little or no spectral overlap between the de-sired signal and noise transient at large distances from the impulsive centralsection and leads to computational savings, since a significant proportionof the transient can be removed by standard high-pass filtering operations.

7.4 Kalman filter implementation 163

The estimated noise transient under these conditions is shown in moredetail in figure 7.9. The region of signal separation is enclosed by the verticaldot-dash line while the region of high amplitude impulsive input is enclosedby the dotted lines. For comparison purposes the true noise pulse is shownin figure 7.8. A very close correspondence can be seen over the whole regionof separation although the peak values are very slightly under-predicted.Figures 7.10 and 7.11 show the same region in the true and estimated audiosignals. Once again a very good correspondence can be observed.

Two further examples are now given which illustrate restoration per-formed on genuine degraded audio material. In the first example (figures7.12 and 7.13) restoration is performed on two closely spaced noise tran-sients. Detection of high amplitude impulsive input regions was performedusing the inverse filtering detector with threshold set very high so as toavoid detecting standard clicks as noise transients. In the second example(figures 7.14, 7.15 and 7.16) a difficult section of data is taken in which thereare many overlapping noise pulses of different amplitudes and shapes. Thisexample would thus be particularly problematic for the templating method.Detection of impulsive sections was performed as for the previous exampleand restoration is visually very effective.

A number of passages of degraded audio material have been processed us-ing this technique. Results have been good and usually more effective thanthe templating method, which requires some additional high-pass filteringto remove the last traces of degradation.

7.4 Kalman filter implementation

As for the single AR process models, it is possible to write the double ARprocess model in state-space form. This means that the separation proce-dure can be efficiently implemented using the Kalman filter (see section4.4), which is an important consideration since the size of matrices to beinverted in the direct implementation can run into many hundreds. Detailsof such an implementation can be found in [85].

7.5 Conclusion

The separation algorithms developed in this chapter have been found to bea significant improvement over the templating method for removal of lowfrequency noise transients, both in terms of automation and the quality ofprocessed results. Any further work needs to address the issue of detectionfor low frequency noise transients since it is still a difficult problem todistinguish between standard clicks and low frequency transients with lowenergy relative to the signal. In our current investigations we are looking


-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

3000

0 200 400 600 800 1000 1200 1400 1600 1800 2000

sample number

ampl

itude

FIGURE 7.4. Audio Waveform synthetically corrupted by low frequency transient

-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

3000

500 1000 1500 2000 2500

sample number

ampl

itude

FIGURE 7.5. Restored audio data

7.5 Conclusion 165

-1500

-1000

-500

0

500

1000

1500

2000

2500

0 200 400 600 800 1000 1200 1400 1600 1800 2000

sample number

ampl

itude

FIGURE 7.6. Genuine noise transient added to audio signal

-1500

-1000

-500

0

500

1000

1500

2000

2500

0 200 400 600 800 1000 1200 1400 1600 1800 2000

sample number

ampl

itude

FIGURE 7.7. Noise transient estimated by restoration procedure


-2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

900 950 1000 1050 1100 1150 1200 1250 1300

sample number

ampl

itude

FIGURE 7.8. Genuine noise transient showing areas of signal separation andinterpolation

-2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

900 950 1000 1050 1100 1150 1200 1250 1300

sample number

ampl

itude

FIGURE 7.9. Noise transient estimated by restoration procedure

7.5 Conclusion 167

-800

-600

-400

-200

0

200

400

600

800

900 950 1000 1050 1100 1150 1200 1250 1300

sample number

ampl

itude

FIGURE 7.10. Original audio signal showing areas of signal separation and in-terpolation

-800

-600

-400

-200

0

200

400

600

800

900 950 1000 1050 1100 1150 1200 1250 1300

sample number

ampl

itude

FIGURE 7.11. Signal estimated by restoration procedure


-2.5

-2

-1.5

-1

-0.5

0

0.5

1x104

0 1000 2000 3000 4000 5000 6000

sample number

ampl

itude

FIGURE 7.12. Degraded audio signal with two closely spaced noise transients

-5000

-4000

-3000

-2000

-1000

0

1000

2000

3000

4000

5000

0 1000 2000 3000 4000 5000 6000

sample number

ampl

itude

FIGURE 7.13. Restored audio signal for figure 7.12 (different scale)

7.5 Conclusion 169

-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

0 1000 2000 3000 4000 5000 6000 7000

sample number

ampl

itude

FIGURE 7.14. Degraded audio signal with many closely spaced noise transients

-10000

-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

0 1000 2000 3000 4000 5000 6000 7000

sample number

ampl

itude

FIGURE 7.15. Estimated noise transients for figure 7.14


-2000

-1500

-1000

-500

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000

sample number

ampl

itude

FIGURE 7.16. Restored audio signal for figure 7.14 (different scale)

at models which more accurately reflect the mechanics of the gramophoneturntable system, which should lead to improved detection and restorationperformance.

It is thought that the methods proposed in this chapter may find ap-plication in more general correlated transient detection and restoration.Another example from the audio field is that of high frequency resonancescaused by poor choice of or faulty playback stylus. This defect has manyfeatures in common with that of low frequency transients although it hasonly rarely been a serious practical problem in our experience.


8

Restoration of Pitch Variation Defects

A form of degradation which can be encountered on almost any analoguerecording medium, including discs, magnetic tape, wax cylinders and op-tical film sound tracks, is an overall pitch variation which was not presentin the original performance. This chapter addresses the problem of smoothpitch variation over long time-scales (e.g. variations at around 78rpm formany early gramophone discs), a defect often referred to as ‘wow’ and oneof the most disturbing artefacts encountered in old recordings. An associ-ated defect known as ‘flutter’ describes a pitch variation which varies morerapidly with time. This defect can in principle be restored within the sameframework but difficulties are expected when performance effects such astremolo or vibrato are present.

There are several mechanisms by which wow can occur. One cause is avariation of rotational speed of the recording medium during either record-ing or playback; this is often the case for tape or gramophone recordings.A further cause is eccentricity in the recording or playback process for discand cylinder recordings, for example a hole which is not punched perfectlyat the centre of a gramophone disc. Lastly it is possible for magnetic tapeto become unevenly stretched during playback or storage; this too leadsto pitch variation during playback. Early accounts of wow and flutter aregiven in [62, 10]. The fundamental principles behind our algorithms pre-sented here have been presented in [78], while a recursive adaptation isgiven in [71].

In some cases it is in principle possible to make a mechanical correctionfor pitch defects, for example in the case of the poorly punched disc centre-hole, but such approaches are generally impractical. This chapter presents

172 8. Restoration of Pitch Variation Defects

a new signal processing approach to the detection and correction of thesedefects which is designed to be as general as possible in order to correct awide range of related defects in recordings. No knowledge is assumed aboutthe precise source of the pitch variation except its smooth variation withtime. Results are presented from both synthetically degraded material andgenuinely degraded sources.

8.1 Overview

The physical mechanism by which wow is produced is equivalent to a non-uniform warping of the time axis. If the undistorted time-domain waveformof the gramophone signal is written as x(t) and the time axis is warped bya function fw(t) then the distorted signal is given by:

xw(t) = x(fw(t)) (8.1)

For example, consider a gramophone turntable with nominal playback an-gular speed ω0. Suppose there is a periodic speed fluctuation in the turntablemotor which leads to an actual speed of rotation during playback as:

ωw(t) = (1 + α cos(ω0t)) ω0 (8.2)

Since a given disc is designed to be replayed at constant angular speed ω0

an effective time warping and resultant pitch variation will be observed on

playback. The waveform recorded on the disc is x(

θω0

)

where θ denotes

the angular position around the disc groove (the range [0, 2π) indicatesthe first revolution of the disk, [2π, 4π) the second, etc.) and the distortedoutput waveform when the motor rotates the disc at a speed of ωw(t) willbe (assuming that θ = 0 at t = 0):

xw(t) = x

(

1

ω0

∫ t

0

ωw(τ) dτ

)

= x

(∫ t

0

(1 + α cos(ω0τ)) dτ

)

= x

(

t+1

ω0α sin(ω0t)

)

. (8.3)

In this case the time warping function is fw(t) = t+ 1ω0α sin(ω0t). Similar

results hold for more general speed variations. The resultant pitch variationmay be seen by consideration of a sinusoidal signal component x(t) =sin(ωst). The distorted signal is then given by

xw(t) = x(fw(t))

= sin

(

ωs

(

t+1

ω0α sin(ω0t)

))

, (8.4)

8.1 Overview 173

which is a frequency modulated sine wave with instantaneous frequencyωs(1+α cos(ω0t)). This will be perceived as a tone with time-varying pitchprovided that ωs >> ω0. Fig. 8.1 illustrates this effect. When a number ofsinusoidal components of different frequencies are present simultaneouslythey will all be frequency modulated in such a way that there is a perceivedoverall variation of pitch with time.

If the time warping function fw() is known and invertible it is possibleto regenerate the undistorted waveform x(t) as

x(t) = xw(f−1w (t)) (8.5)

A wow restoration system is thus primarily concerned with estimationof the time warping function or equivalently the pitch variation function

pw(t) =d(fw(t))

dt, from which the original signal can be reconstructed.

We adopt here a discrete time approach in which the values of pitchvariation function are estimated from the wow-degraded data. We denoteby p = [p1 . . . pN ]T the pitch variation vector corresponding to a vectorof corrupted data points xw. If it is assumed that p is drawn from somerandom process which has prior distribution p(p) then Bayes’ theoremgives:

p(p | xw) ∝ p(xw | p) p(p) (8.6)

where the likelihood p(xw | p) can in principle be obtained from the priorfor the undistorted data p(x) and the pitch variation vector. In fact weadopt a pre-processing stage in which the raw data are first transformedinto a time-frequency ‘map’ which tracks the pitch of the principle fre-quency components in the data (‘frequency tracking’). This simplifies thecalculation of the likelihood and allows a practical implementation of Bayes’theorem to the transformed data. The idea behind this is the observationthat most musical signals are made up of combinations of steady toneseach comprising a fundamental pitch and overtones. The assumption ismade that pitch variations which are common to all tones present maybe attributed to the wow degradation, while other variations are genuinepitch variations in the musical performance. The approach can of coursefail during non-tonal (‘unvoiced’) passages or if note ‘slides’ dominate thespectrum, but this has not usually been a problem.

Once frequency tracks have been generated they are combined usingthe Bayesian algorithm into a single pitch variation vector p. The finalrestoration stage of equation (8.5) is then approximated by a digital re-sampling operation on the raw data which is effectively a time-varyingsample rate converter.

The complete wow identification and restoration scheme is summarisedin figure 8.2.


0

0.5

1

1.5

0 2 4 6 8 10 12 14

ANGULAR SPEED FLUCTUATION

TIMER

AD

S./S

EC

(a) Periodic variation of ww(t)

0

5

10

15

0 2 4 6 8 10 12 14

TIME WARPING FUNCTION

TIME

RA

DS.

(b) Corresponding time-warping function fw(t)

-1

-0.5

0

0.5

1

0 2 4 6 8 10 12 14

INPUT SINUSOID

TIME

AM

PLIT

UD

E

-1

-0.5

0

0.5

1

0 2 4 6 8 10 12 14

FREQUENCY-MODULATED SINUSOID

TIME

AM

PLIT

UD

E

(c) Single sinusoidal component

FIGURE 8.1. Frequency modulation of sinusoid due to motor speed variation.

8.2 Frequency tracking 175

wx (t)

x(t)

wp (t)

Frequency

tracking

Generation of

pitch

variation curve

Time domain

resampling

FIGURE 8.2. Outline of wow restoration system.

8.2 Frequency tracking

The function of the frequency tracking stage is to identify as many tonalfrequency components as possible from the data and to make estimates ofthe way their frequencies vary with time. In principle any suitable time-varying spectral estimator may be used for this purpose (see e.g. [155, 39]).For the purpose of wow estimation it has been found adequate to employthe discrete Fourier Transform (DFT) for the frequency estimation task,although more sophisticated methods could easily be incorporated into thesame framework. The window length is chosen to be short enough thatfrequency components within a single block are nearly constant, typically2048 data points at 44.1kHz sampling rates, and a windowing function withsuitable frequency domain characteristics such as the Hamming windowis employed. Analysis of the peaks in the DFT magnitude spectrum thenleads to estimates for the instantaneous frequency of tonal components ineach data block. In order to obtain a set of contiguous frequency ‘tracks’from block to block, a peak matching algorithm is used. This is closelyrelated to the methods employed for sinusoidal coding of speech and audio,see [126, 158] and references therein.

In our scheme the position and amplitude of maxima (peaks) are ex-tracted from the DFT magnitude spectrum for each new data block. Thefrequency at which the peak occurs can be refined to higher accuracy byevaluation of the DTFT on a finer grid of frequencies than the raw DFT,and a binary search procedure is used to achieve this. Peaks below a chosenthreshold are deemed noise and discarded. The remaining peaks are splitinto two categories: those which fit closely in amplitude and frequency withan existing frequency ‘track’ from the previous block (these are added tothe end of the existing track) and those which do not match with an ex-isting track (these are placed at the start of a new frequency track). Thus


the frequency tracks evolve with time in a consistent manner, enabling thesubsequent processing to estimate a pitch variation curve.

Experimental results for frequency tracking are shown in Figs. 8.4 & 8.9and discussed in the experimental section.

8.3 Generation of pitch variation curve

Once frequency tracks have been generated for a section of music, the pitchvariation curve can be estimated. For the nth block of data there will beRn frequency estimates corresponding to the Rn tonal components whichwere being tracked at block n. The ith tonal component has a nominal(unknown) centre frequency F i

0 which is assumed to remain fixed over theperiod of interest in the uncorrupted music data, and a measured frequencyF i

n. Variations in F in are attributed to the pitch variation value Pn for block

n and a noise component V in. This noise component is composed both of

inaccuracies in the frequency estimation stage and genuine ‘performance’pitch deviations (hopefully small) in tonal components.

Two modelling forms are considered. In both cases frequency measure-ments F i

n are expressed as the product of centre frequency F i0 and the pitch

variation value Pn in the presence of noise. In the first case a straightfor-ward additive form for the noise is assumed:

F in = F i

0 Pn + V in, (n = 1, . . . , N, i = 1, . . . , Rn) (8.7)

This would seem to be a reasonable model where the noise is assumedto be composed largely of additive frequency estimation errors from thefrequency tracking stage. However, it was mentioned that the noise termmust also account for true pitch variations in musical tones which mightinclude performance effects such as tremolo and vibrato. These may wellbe multiplicative, which leads to the following perhaps more natural model:

F in = F i

0 Pn Vin, V i

n > 0 (8.8)

This is equivalent to making logarithmic frequency measurements of theform:

f in = f i

0 + pn + vin (8.9)

where lower case characters denote the natural logarithm of the correspond-ing upper case quantities. Such a model linearises the estimation tasks forthe centre frequencies and the pitch variation, a fact which will lead tosignificant computational benefits over the additive noise case. A furtheradvantage of the multiplicative model is that all of the unknowns in thelog-transformed domain can be treated as random variables over the whole

8.3 Generation of pitch variation curve 177

real axis, which means that the Gaussian assumption can be made withoutfear of estimating invalid parameter values (i.e. negative frequencies andpitch variation functions).

We now consider the multiplicative noise modelling assumption in moredetail, obtaining the likelihood functions which will be required for thesubsequent Bayesian analysis. For full details of the use of the additivenoise model, see [70, 78, 71].

For the time being we treat the number of tonal components as fixedfor all blocks with Rn = R, for notational simplicity. This restriction isremoved in a later section.

For the log-transformed multiplicative model of (8.9) we can write invector notation the observation equation at times n for the logarithmicallytransformed variables as

fn = f0 + pn 1R + vn, (n = 1, . . . , N) (8.10)

where 1m is the column vector with m elements containing all ones, and

fn = [f1n f

2n . . . fR

n ]T,

f0 = [f10 f

20 . . . fR

0 ]T,

vn = [v1n v

2n . . . vR

n ]T.

The corresponding matrix notation for all N data blocks is then:

F = f0 1NT + 1RpT + V (8.11)

where

F = [ f1 f2 · · · fN ], (8.12)

p = [ p1 p2 · · · pN ]T , (8.13)

V = [ v1 v2 · · · vN ] (8.14)

If we now assume a specific noise distribution for the log-noise termsvi

n, it is straightforward to obtain the likelihood function for the data,p(F | p, f0). The simplest case analytically assumes that the noise termsare i.i.d. Gaussian random variables with variance σ2

v . This is likely to bean over-simplification, especially for the ‘modelling error’ noise componentwhich may contain decidedly non-white noise effects such as vibrato. Nev-ertheless, this i.i.d. Gaussian assumption has been used for all of our workto date and very successful results have been obtained. The likelihood isthen given by:

p(F | p, f0) =1

(2πσ2v)

N R2

exp

(

− 1

2σ2v

Q

)

(8.15)


where

Q =

N∑

n=1

R∑

i=1

(vin)

2(8.16)

since the Jacobian of the transformation V → F conditioned upon f0 andp, defined by (8.11), is unity. We now expand Q using (8.11) to give:

Q = trace(VVT ) (8.17)

= trace(F FT − 2F (1N f0T + p1R

T )

+ f01NT1N f0

T + 1RpTp1RT + 2 f01N

T p1RT ) (8.18)

= trace(F FT ) − 2 f0T F 1N − 2 1R

TFp

+ N f0T f0 + R pTp + 2 1N

Tp 1RT f0 (8.19)

where the result (trace(u vT ) = vT u) has been used to obtain (8.19) from(8.18).

Differentiation w.r.t. f0 and p gives the following partial derivatives:

∂Q

∂f0= −2F 1N + 2N f0 + 2 1R×Np (8.20)

∂Q

∂p= −2 FT1R + 2 R p + 2 1N×Rf0 (8.21)

where 1N×M is the (N×M)-matrix containing all unity elements. A simpleML estimate for the unknown parameter vectors would be obtained bysetting these gradient expressions to zero. This, however, leads to a singularsystem of equations, and hence the ML estimate is not well-defined for themultiplicative noise assumption. For the additive modelling assumption, aML solution does exist, but is found to cause noise-fitting in the estimatedpitch variation vector. In the methods discussed below, prior regularisinginformation is introduced which makes the estimation problem well-posedand we find that a linear joint estimate is possible for both f0 and p since(8.20) and (8.21) involve no cross-terms between the unknowns.

8.3.1 Bayesian estimator

The ML approach assumes no prior knowledge about the pitch variationcurve and as such is not particularly useful since the solution vector tends tobe very noisy for the additive noise assumption (see experimental section)or ill-posed for the multiplicative assumption. The Bayesian approach isintroduced here as a means of regularising the solution vector such thatnoise is rejected in order to be consistent with qualitative prior informationabout the wow generation process. In (8.6) the posterior distribution forunknown parameters is expressed in terms of the raw input data. We now


assume that the raw data has been pre-processed to give frequency tracksF. The joint posterior distribution for p and f0 can then be expressed as:

p(p, f0 | F) ∝ p(F | p, f0) p(p, f0) (8.22)

which is analogous to equation (8.6) except that the frequency track dataF rather than the raw time domain signal is taken as the starting point forestimation. This is sub-optimal in the sense that some information will in-evitably be lost in the pre-processing step. However, from the point of viewof analytical tractability working with F directly is highly advantageous.

Since we are concerned only with estimating p it might be argued thatf0 should be marginalised from the a posteriori distribution. In fact, forthe multiplicative noise model the marginal estimate for p can be shown tobe identical to the joint estimate since the posterior distribution is jointlyGaussian in both p and f0.

A uniform (flat) prior is assumed for f0 throughout since we have noparticular information known about the distribution of centre frequencieswhich is not contained in the frequency tracks, hence we can write p(p, f0) ∝p(p). The prior chosen for p will depend on what is known about the sourcemechanism for pitch defects. For example if the defect is caused by a poorlypunched centre-hole in a disc recording, a sinusoidal model with period closeto the period of angular rotation of the disc may be appropriate, while formore random pitch variations, caused perhaps by stretched analogue tapeor hand-cranked disc and cylinder recordings, a stochastic model must beused. We place most emphasis on the latter case, since a system with theability to identify such defects will have more general application than asystem constrained to identify a deterministic form of defect such as asinusoid. One vital piece of prior information which can be used in manycases is that pitch variations are smooth with time: the mechanical speedvariations which cause wow very rarely change in a sudden fashion, so wedon’t expect a very ‘rough’ waveform or any sharp discontinuities.

In all cases considered here the prior on p is zero-mean and Gaussian. Inthe general case, then, we have p(p) = N(0,Cp), the Gaussian distributionwith mean vector zero and covariance matrix Cp (see appendix A.2):

p(p) =1

(2π)N2 |Cp|1/2

exp

(

−1

2pTCp

−1p

)

(8.23)

The posterior distribution can be derived directly by substituting (8.15)and (8.23) into (8.22):

p(p, f0 | F) ∝ exp

(

− 1

2σ2v

Qpost

)

(8.24)

where

Qpost = Q + σ2v pTCp

−1p (8.25)


and we can then obtain the partial derivatives:

∂Qpost

∂f0=∂Q

∂f0= −2F 1N + 2 N f0 + 2 1R×Np (8.26)

∂Qpost

∂p=∂Q

∂p+ 2 σ2

v Cp−1p

= −2 FT1R + 2 R p + 2 1N×Rf0 + 2σ2

v

σ2e

Cp−1p (8.27)

If (8.26) and (8.27) are equated to zero, linear estimates for f0 and p,in terms of each other, result. Back-substituting for f0 gives the followingestimate for p alone:

pMAP =

((

R I − R

N1N×N

)

+ σ2vCp

−1

)−1 (

I− 1

N1N×N

)

FT 1R

(8.28)

This linear form of solution is one of the principal advantages of a multi-plicative noise model over an additive model for the pitch correction prob-lem.

8.3.2 Prior models

Several specific forms of prior information are now considered. The firstassumes that pitch curves can be modelled as an AR process. This is quitea general formulation which can incorporate most forms of pitch defectthat we have encountered. For example, the poles of the AR process can becentred upon 78/60Hz and close to the unit circle in cases where the defectis known to originate from a 78rpm recording. A second form of prior in-formation, which can be viewed as a special case of the AR model, includesthe prior information already mentioned about the expected ‘smoothness’of the pitch curves. It is usually the case that pitch variations occur in asmooth fashion with no sharp discontinuities. This intuitive form of priorinformation is thus a reasonable assumption. Finally, deterministic mod-els for pitch variation are briefly considered, which might be appropriatefor modelling gradual pitch shifts over a longer period of time (such asa turntable motor which gradually slowed down over a period of severalseconds during the recording or playback process). We have successfullyapplied all three forms of prior to the correction of pitch variations in realrecordings.

8.3.2.1 Autoregressive (AR) model

This is in some sense the most general of the pitch variation models con-sidered, allowing for a wide range of pitch defect mechanisms from highlyrandom to sinusoidal, depending on the parameters of the AR model and


the driving excitation variance. Recall from chapter 4 that an AR processdriven by a Gaussian white independent zero-mean noise process with vari-

ance σ2e has inverse covariance matrix Cp

−1 ≈ AT Aσ2

e(see (4.48)), where A

is the usual matrix of AR coefficients. To obtain the MAP estimate for pusing an AR model prior, we simply substitute for Cp

−1 in result (8.28):

pMAP =

((

R I − R

N1N×N

)

+σ2

v

σ2e

AT A

)−1 (

I − 1

N1N×N

)

FT1R

(8.29)

Note, however, that the estimate obtained assumes knowledge of the ARcoefficients as well as the ratio λ = σ2

v/σ2e . These are generally unknown

for a given problem and must be determined beforehand. It is possible toestimate the values for these parameters or to marginalise them from theposterior distributions. Such an approach leads to a non-linear search forthe pitch variation estimate, for which the EM (expectation-maximisation)algorithm would be a suitable method; we do not investigate this here,however, and thus the parameter values are considered as part of the priormodelling information. For the general AR modelling case this might seemimpractical. However, it is often possible to choose parameters for a loworder model (P = 2 or 3) by appropriate positioning of the poles of the ARfilter to reflect any periodicity in the pitch deviations (such as a 78/60Hzeffect for a 78rpm recording) and selecting a noise variance ratio whichgives suitable pitch variation estimates over a number of trial processingblocks. When sufficient prior information is not available to position the ARpoles, the smoothness prior of the next section may be more appropriate.In this case all parameter values except for the noise variance ratio arefixed. The noise variance ratio then expresses the expected ‘smoothness’ ofpitch curves.

8.3.2.2 Smoothness Model

This prior is specifically designed to maximise some objective measure of‘smoothness’ in the pitch variation curve. This is reasonable since it canusually be assumed that pitch defects vary in a smooth fashion. Such anapproach is well known for the regularisation of least-squares and Bayesiansolutions (see, e.g. [198, 150, 117, 103]). One suitable smoothness measurefor the continuous case is the integral of order q derivatives:

∫ t

0

∣

∣

∣

∣

dqpw(τ)

dτq

∣

∣

∣

∣

2

dτ (8.30)

where the time interval for wow estimation is (0, t). In the discrete case thederivatives can be approximated by finite differencing of the time series: thefirst difference is given by d1

n = pn − pn−1, the second by d2n = d1

n − d1n−1,


etc. Written in matrix form we can form the sum squared of the qth orderdifferences as follows:

QS =

N∑

n=q+1

(dqn)

2= pT Dq

T Dq p (8.31)

where Dq is the matrix which generates the vector of order q differencesfrom a length N vector. An appropriate Gaussian prior which favours vec-tors p which are ‘smooth’ to order q is then

p(p) ∝ exp(

−α2QS

)

(8.32)

Substitution of covariance matrix Cp−1 = αDq

T Dq into equation (8.28)leads to the MAP estimate for the smoothness prior model. In a similarway to the AR model we can define a regularising constant λ = σ2

vα whichreflects the degree of ‘smoothness’ expected for a particular problem.

Such a prior is thus a special case of the AR model with fixed parametersand poles on the real axis at unity radius (for q = 2, the second order case,the parameters are a = [ 2 − 1 ]T ). The smoothness prior thus eliminatesthe need for AR coefficient estimation, although λ must still be selected.As before, λ could be estimated or marginalised from the posterior distri-bution, but the same arguments apply as for the AR modelling case. Inpractice it is usually possible to select a single value for λ which gives suit-able pitch variation curve estimates for the whole of a particular musicalextract.

8.3.2.3 Deterministic prior models for the pitch variation

In some cases there will be specific prior information available about theform of the pitch variation curve. For example, the pitch variation maybe known to be sinusoidal or to consist of a linear pitch shift over a timeinterval of many seconds. In these cases deterministic basis functions can beadded to the prior pitch model, in much the same way as for the AR+basisfunction interpolators described in chapter 5. We do not present the detailshere as they can easily be obtained through a synthesis of ideas alreadydiscussed in this and earlier chapters. We do note, however, that bothsinusoidal and linear pitch shift priors have been used successfully on anumber of real recordings.

8.3.3 Discontinuities in frequency tracks

The above results apply only to frequency tracks which are contiguous overthe whole time window of interest (recall that we fixed Rn, the number oftracks at time n, to be constant for all n). In practice, however, frequencytracks will have gaps and discontinuities corresponding to note changes


in the musical material and possibly errors in the tracking algorithm. Insome cases all tracks may disappear during pauses between notes. Thelikelihood expressions developed above can be straightforwardly modifiedto incorporate these discontinuities which leads to a small change in theform of the MAP estimate.

Each frequency track is now assigned a ‘birth’ and ‘death’ index bi anddi, obtained from the frequency tracking algorithm, such that bi denotesthe first time index within the block at which f0i is present (‘active’) anddi the last (each track is then continuously ‘active’ between these indices).

The model equation can now be re-written for the multiplicative noisecase as

f in =

f i0 + pn + vi

n, bi ≤ n ≤ di

0, otherwise, i = 1...Rmax (8.33)

Rmax is the total number of tonal components tracked in the interval 1 toN . At block n there are Rn active tracks, and the length of the ith trackis given by Ni = di − bi + 1. Q (see (8.16)) is as before defined as thesum squared of all the noise terms vi

n, and differentiation of Q yields thefollowing modified gradient expressions (c.f. (8.20)-(8.21)):

∂Q

∂f0= −2F 1N + 2 diag(N1, N2, ..., NRmax) f0 + 2 M p (8.34)

∂Q

∂p= −2 FT1Rmax + 2 diag(R1, R2, ..., RN ) p + 2 MT f0 (8.35)

[F]pq = fpq and M is the (Rmax ×N) matrix defined as:

[M]pq =

1, [F]pq > 0

0, otherwise(8.36)

In other words elements of M ‘flag’ whether a particular frequency trackelement in F is active. This modified form for Q and its derivatives aresubstituted directly for (8.20) and (8.21) above to give a slightly modifiedform of the Bayesian estimator (8.28) which is used for pitch estimation onreal data.

8.3.4 Experimental results in pitch curve generation

Some examples of processing to generate pitch variation curve estimatesare now presented. Two musical extracts sampled at 44.1kHz are used forthe experiments.

The first example, ‘Viola’, is a simple extract of string music taken froma modern CD quality digital recording. There are no significant pitch de-fects on the recording so artificial pitch defects are generated to test theeffectiveness of the pitch variation algorithms. The artificial pitch variation


is sinusoidal (see figure 8.3). This was applied to the uncorrupted signalusing a resampling operation similar to that discussed shortly in section8.4.

The second example, ‘Midsum’, is taken from a poor quality 78rpm discrecording of orchestral music which exhibits a large amount of ‘wow’ degra-dation. The only available source for the recording is a second or third gen-eration analogue tape copy of the 78rpm master so it is not known whetherthe ‘wow’ originates from the disc or tape medium and whether a regular orperiodic degradation can be assumed. The high broad-band noise contentof this source means that this is a testing environment for the frequencytracking algorithm.

Both extracts have initially been pre-processed to generate frequencytracks which are shown in figures 8.4 and 8.9. Frequency tracking is lim-ited to the range 100Hz-1kHz and a new set of frequency track values isgenerated once every 2048 input samples for a block of 4096 data samples(i.e. a 50% block overlap scheme).

8.3.4.1 Trials using extract ‘Viola’ (synthetic pitch variation)

The first set of results, applied to extract ‘Viola’, are from the applicationof the second differencing (q = 2) smoothness model (see section 8.3.2.2).This model has been found to perform well for a wide variety of pitchvariation scenarios. If there is no particular reason to assume periodicityin the pitch variation then this model is probably the most robust choice.Figure 8.5 shows the estimated pitch curves using this model under bothadditive and multiplicative noise assumptions. The noise variance ratios λhave been chosen to give results with an appropriate degree of smoothnessfrom several ‘pilot’ runs with different values of λ. The graph shows verylittle difference between the additive and multiplicative assumptions, whichboth give a good approximation to the true pitch curve. For pitch variationswith peak-to-peak deviations within a few percent very little difference inperformance is generally observed from applying the two noise assumptions.Thus the multiplicative assumption, with its computational advantage, isadopted for remaining trials on extract ‘Viola’.

Figure 8.6 shows estimated pitch curves under the AR model and sinu-soidal model assumptions. Both of these models are suited to pitch variationcurves which exhibit some periodicity, so performance is improved for thisexample compared with the smoothness prior. For the AR model an order2 system was selected with poles close to the unit circle. The pole angle wasselected by hand (θ = 0.13π in this case). This may seem artificial but ina real-life situation the periodicity will often be known to a reasonable ap-proximation (78rpm will be typical!) so a pole position in the correct rangecan easily be selected. For this example the resulting pitch curve was foundto be fairly insensitive to errors in pole angle within ±30%. As should beexpected the sinusoidal model gives the best performance of all the models

8.4 Restoration of the signal 185

in this artificial case since the true pitch variation curve is itself sinusoidal.Nevertheless the close agreement between the sinusoid model curve and thetrue pitch curve gives an indication that a reasonable formulation is beingused for pitch curve estimation from the frequency tracks.

Figures 8.7 and 8.8 show the effect of varying λ over a wide range for thesmoothness model and also for the AR model with poles at angle ±0.13π onthe unit circle. λ = 1 corresponds to very little prior influence on the pitchvariation curve. This estimate identifies general trends in the pitch curvebut since there is very little regularisation of the solution this estimate isvery noisy. Increasing λ leads to increasingly smooth solutions. In the caseof the smoothness prior (figure 8.8) the solution tends to zero amplitudeas λ becomes large since its favoured solution is a straight line. Hence thesolution is sensitive to choice of λ. The AR model solution (figure 8.10) isless sensitive to λ since its favoured solution is now a sinusoid at normalisedfrequency θ, the pole angle. This feature is one possible benefit from usingthe AR model when periodic behaviour can be expected.

8.3.4.2 Trials using extract ‘Midsum’ (non-synthetic pitch variation)

Figures 8.9 and 8.10 show frequency tracks and estimated pitch curvesrespectively for example ‘Midsum’. Pitch curves are estimated using thedifferential smoothness model under both additive and multiplicative noiseassumptions. Once again little significant difference is observed between thetwo assumptions, which both give reasonable pitch curves. The appearanceof these curves as well as the application of AR and sinusoid models (resultsnot shown) indicates that the smoothness model is the best choice heresince there is little periodicity evident in the frequency tracks. Listeningtests performed with restorations from these pitch curves (see next section)confirm this.

8.4 Restoration of the signal

The generation of a pitch variation curve allows the final restoration stageto proceed. Equation (8.5) shows that, in principle, perfect reconstruction ofthe undegraded signal is possible in the continuous time case, provided thetime warping function is known. In the case of band-limited discrete signalsperfect reconstruction will also be possible by non-uniform resampling ofthe degraded samples. The degraded signal xw(nT ) is considered to benon-uniform samples from the undegraded signal x(t) with non-uniformsampling instants given by the time-warping function fw(nT ), where 1/Tis the sample rate for x(t). Provided the average sampling rate for the non-uniformly sampled signal x(fw(nT )) is greater than the Nyquist rate forx(t) it is then possible to reconstruct x(nT ) perfectly from x(fw(nT )) (seee.g. [122]). Note however that the pitch varies very slowly relative to the


0.975

0.98

0.985

0.99

0.995

1

1.005

1.01

1.015

1.02

1.025

0 10 20 30 40 50 60

time (units of 2048 samples)

pitc

h va

riat

ion

FIGURE 8.3. Synthetically generated pitch variation.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .. . .

. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

Time (units of 2048 samples)

Freq

uenc

y (k

Hz)

FIGURE 8.4. Frequency tracks generated for example ‘Viola’.


0.975

0.98

0.985

0.99

0.995

1

1.005

1.01

1.015

1.02

1.025

0 10 20 30 40 50 60


pitc

h va

riat

ion

FIGURE 8.5. Pitch variation estimates for example ‘Viola’ using differentialsmoothness model: dotted line - true variation; solid line - additive model; dashedline - multiplicative model

0.975

0.98

0.985

0.99

0.995

1

1.005

1.01

1.015

1.02

1.025

0 10 20 30 40 50 60


pitc

h va

riat

ion

FIGURE 8.6. Pitch variation estimates for example ‘Viola’ using AR and sinusoidmodels: dotted line - true variation; solid line - AR model; dashed line - sinusoidalmodel


0.975

0.98

0.985

0.99

0.995

1

1.005

1.01

1.015

1.02

1.025

0 10 20 30 40 50 60


pitc

h va

riat

ion

FIGURE 8.7. Pitch variation estimates for example ‘Viola’ using differentialsmoothness model: dotted line - true variation; solid line - λ = 1; dashed line- λ = 40; dot-dash line - λ = 400

0.975

0.98

0.985

0.99

0.995

1

1.005

1.01

1.015

1.02

1.025

0 10 20 30 40 50 60


pitc

h va

riat

ion

FIGURE 8.8. Pitch variation estimates for example ‘Viola’ using AR model:dotted line - true variation; solid line - λ = 1; dashed line - λ = 40; dot-dash line- λ = 400


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

.

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. . .

.

. . . . . . . . . . . . . . . . . . .

.

. . . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

.

. .

. . . . . . . . . . . . . .

.

. .

.

. .

. . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. .

. . . . . . . .

. . . . . . . . . . . . . . . . . .

.

. . . . .

. . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . .

. . . . . . . . . . . . . . . . . . . .

.

. .

.

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . .

. . . . . . . . . . . . .

.

.

. .

. . . .

.

. .

. . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . .

.

.

. . . . . .

. . . . . . . . . . .

. .

. . . . . .

.

. . . . . .

. . .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


freq

uenc

y (k

Hz)

FIGURE 8.9. Frequency tracks for example ‘Midsum’

0.985

0.99

0.995

1

1.005

1.01

1.015

0 10 20 30 40 50 60 70 80 90 100


pitc

h va

riat

ion

FIGURE 8.10. Pitch variation estimates for example ‘Midsum’ using differentialsmoothness model: solid line - additive model; dashed line - multiplicative model


sampling rate. Thus, at any given time instant it is possible to approximatethe non-uniformly sampled input signal as a uniformly sampled signal withsample rate 1/T ′ = pw(t)/T . The problem is then simplified to one ofsample rate conversion for which there are well-known techniques (see e.g.[41, 159]). Since the sample rate is slowly changing with time, a conversionmethod must be chosen which can easily be adjusted to arbitrary rates. Amethod which has been found suitable under these conditions is resamplingwith a truncated sinc function (see e.g. [41, 149]). In this method the outputsample x(nT ) is estimated from the closest (2M + 1) corrupted samples as

x(nT ) =

M∑

m=−M

w(mT ′ − τ) sinc(α(mT ′ − τ)) xw((nw −m)T ′) (8.37)

where α = min(1, T ′/T ) ensures band-limiting to the Nyquist frequency,nw is the closest corrupted sample to the required output sample n, τ =nT − fw(nwT ) is the effective time delay and w(t) is a time domain win-dowing function with suitable frequency characteristics. Such a scheme canbe implemented in real-time using standard digital signal processing chips.

8.5 Conclusion

Informal listening tests indicate that the proposed method is capable ofa very high quality of restoration of gramophone recordings which wouldotherwise be rendered useless owing to high levels of wow. The algorithmhas been tested on a wide range of pitch defects from commercial recordingsand found to work robustly in many different scenarios. Improvements tothe algorithms could involve a joint peak-tracking/pitch curve estimationscheme, for which some of the more sophisticated techniques discussed inchapter 12 might be applied.


9

A Bayesian Approach to ClickRemoval

Clicks are the most common form of localised degradation encountered inaudio signals. This chapter presents techniques for the removal of clicksbased on an explicit model for the process by which these artefacts aregenerated. This work has previously been presented in [76, 79]. A full dis-cussion and review of existing restoration methods for clicks is given inchapter 5.

Within the framework presented here, clicks are modelled as randomamplitude disturbances occurring at random positions in the waveform.When a suitable form is chosen for the random processes which generatethe clicks, the proposed model can accurately represent the characteristicsof degradation in real audio signals. The detection and restoration proce-dure then becomes a problem in decision and estimation theory and weapply the Bayesian principles discussed in chapter 4. This approach al-lows for the incorporation of modelling information into the problem in aconsistent fashion which we believe is not possible using more traditionalmethods. The problem is formulated such that each possible permutationof the corrupted samples in a particular section of data is represented bya separate state in a classification scheme. Bayes decision theory is thenapplied to identify which of these possible states is ‘most probable’. Thisstate is chosen as the detection estimate.

Since this work is essentially concerned with the identification of aberrantobservations in a time series, there are many parallels with work which hasbeen done in the field of model-based outlier treatment in statistical data;see for example [22, 88, 13, 1, 2] for some texts of particular relevance.

192 9. A Bayesian Approach to Click Removal

This chapter develops the fundamental theory for detection using ourmodel of click-type discontinuities. We then consider the case ofAR-modelled data and particular forms of the click generating process.Classification results are derived which form the basis of both a block-baseddetector and a sequential detection scheme derived in the next chapter.Extensions to the basic theory are presented which consider marginaliseddetectors, the case of ‘noisy data’ and multi-channel detection. A specialcase of the detector is then shown to be equivalent to the simple detectorsproposed by Vaseghi and Rayner [190, 187]. Experimental results from themethods are presented in chapter 11.

9.1 Modelling framework for click-typedegradations

Consider a sampled data sequence xm which is corrupted by a random,additive noise process to give a corrupted data sequence ym. The noisesamples are generated by two processes. The first is a binary (1/0) noisegenerating process, im, which controls a ‘switch’. The switch is connectedonly when im = 1, allowing a second noise process nm, the noise am-plitude process, to be added to the data signal xm. ym is thus expressedas:

ym = xm + imnm (9.1)

The noise source is thus a switched additive noise process which corruptsxm only when im = 1. Note that the precise grouping of corrupted sampleswill depend on the statistics of im. For example, if successive values ofim are modelled by a Markov chain (see e.g. [174]), then the transitionprobabilities of the Markov chain will describe some degree of ‘cohesion’between values of im, i.e. the corrupting noise will tend to occur in groupsor ‘bursts’. We now see how such a model can be used to represent clickdegradation in audio signals since, as we have observed before, clicks canbe viewed as groups of corrupted samples random in length and position.The amplitude of these clicks is assumed to be generated by some randomprocess nm whose samples need not in general be i.i.d. and might alsodepend on the noise generating process im.

Under this model the tasks of noise detection and restoration can bedefined as follows: detection identifies the noise switching values im fromthe observed data ym, while restoration attempts to regenerate the inputdata xm given observations ym.

Note that a more realistic formulation might generalise the noise processto switch between a high amplitude ‘impulsive’ disturbance nm

1 and alow level background noise process nm

0. This then represents a systemwith a continuous low level disturbance (such as broad band background

9.2 Bayesian detection 193

+

0/1

ymxm

im

nm0 nm

1

nm

Noise GeneratingProcess



FIGURE 9.1. Schematic form for ‘noisy data’ model

noise) in addition to impulsive click-type disturbances. The corrupted wave-form is modified to:

ym = xm + (1 − im) nm0 + im nm

1 (9.2)

This model will be referred to as the ‘noisy data’ case. In a later section it isshown that the optimal detector is obtained as a straightforward extensionof the detector for the purely impulsive case (9.1). Figure 9.1 shows aschematic form for the noisy data model. The standard click model of (9.1)is obtained by setting n0

m = 0.

9.2 Bayesian detection

The detection of clicks is considered first and the restoration procedure isfound to follow as a direct consequence. For a given vector of N observeddata samples y = [y1...ym...yN ]T we may write for the case of additivenoise

y = x + n (9.3)


where x is the corresponding data signal and n is the random additiveswitched noise vector containing elements imnm (which can of course bezero for some or all of its elements). Define a corresponding vector i withbinary elements containing the corresponding samples of im. Note thatcomplete knowledge of i constitutes perfect detection of the corrupted sam-ples within vector y. The discrete N -vector i can take on any one of 2N

possible values and each value is considered to be a state within a classifi-cation framework. For the detection procedure we must estimate the truedetection state (represented by i) from the observed data sequence y.

Within a Bayesian classification scheme (see section 4.1.4) we can esti-mate the detection ‘state’ i by calculating the posterior probability p(i | y)for a postulated detection state i. The major part of this calculation isdetermination of the state likelihood , p(y | i) (see 4.18) which will be themajor concern of the next few sections. A general approach is derived ini-tially in which the underlying distributions are not specified. Subsequentsections consider data modelled as an AR process with particular noisestatistics.

9.2.1 Posterior probability for i

If i is treated as a discrete random vector whose elements may take onvalues 0 and 1 in any combination, we may derive the probability of i, thedetection state, conditional upon the corrupted data y and any other priorinformation available to us. This posterior probability is expressed usingBayes rule as (see 4.18):

p(i | y) =p(y | i) p(i)

p(y)(9.4)

p(y | i) is the likelihood for state i, which will be considered shortly.p(i) is the prior probability for the discrete detection vector, which will be

referred to as the noise generator prior. This discrete probability, definedonly for elements of i equal to 1 or 0, reflects any knowledge we haveconcerning the noise generating process im. p(i) will contain any priorinformation about the relative probability of various click lengths and theirfrequencies of occurrence. A ‘uniform’ prior assigns equal prior probabilityto all noise configurations. However, we know from experience that this doesnot represent a typical noise generation process. For example, uncorruptedsamples are usually far more likely than corrupted samples (a 20:1 ratio istypical), and as we have discussed in the previous section, the corruptiontends to occur in ‘bursts’ of random length. A prior which expresses thisknowledge may be more successful than a uniform prior which assumes alldetections i are equally probable. A discussion of suitable priors on i isfound in the next chapter.


p(y) is a constant for any given y and serves to normalise the posterior

probability p(i|y). It can be calculated as p(y) =∑

i

p(y | i) p(i), where

the summation is over all 2N possible detection states i.For classification using a MAP procedure (see section 4.1.4) the posterior

probability p(i | y) must in principle be evaluated for all 2N possible i andthe estimated state is then that which maximises the posterior probability(although in the next chapter we devise sub-optimal schemes for practicaloperation).

The likelihood p(y | i) is obtained from modelling considerations of bothdata and noise (see later).

We now adopt a notation for partitioning matrices and vectors accordingto the elements of i which is identical to that defined in section 5.2. Inother words vectors are partitioned into elements which correspond to noisy(im = 1) or clean (im = 0) data. Specifically the corrupt input data ywill be partitioned as y(i) and y

−(i), the true data x will be partitionedas x(i) and x

−(i) and the autoregressive parameter matrix A (see 4.3 andnext section) will be partitioned as A(i) and A

−(i). Clearly y−(i) = x

−(i)

is the section of uncorrupted data and the corrupting noise is given byn(i) = y(i) − x(i).

Assume that the PDFfor the corrupting noise during bursts is known tobe pn(i)|i(.). In addition assume a PDFfor the uncorrupted data, px(x). Ifthe noise is assumed statistically independent of the data x the likelihoodcan be obtained as (see appendix D):

p(y | i) =

∫

x(i)

pn(i)|i(y(i) − x(i) | i) px(Ux(i) + Ky−(i))dx(i) (9.5)

where U and K are ‘rearrangement’ matrices such that x = Ux(i) +Ky−(i),

as defined in section 5.2.1. The likelihood expression can be seen to be anal-ogous to the evidence calculation for the model selection scheme with un-known parameters (see 4.22), except we are now integrating over unknowndata values rather than unknown parameters.

Note that the ‘zero’ hypothesis which tests whether i = 0, (i.e. no sam-ples are corrupted), is obtained straightforwardly from equation 9.4 byconsidering the case of i = 0, the zero vector. No integration is requiredand the resulting likelihood expression is simply

p(y | i = 0) = px(y). (9.6)

In the next section we consider detection when the signal xm is modelledas a Gaussian AR process.

9.2.2 Detection for autoregressive (AR) processes

The previous section developed a general probability expression for de-tection of noise bursts. We now address the specific problem of detect-


ing burst-type degradation for particular audio signal models. We consideronly the likelihood calculation since the posterior probability of (9.4) isthen straightforward to evaluate. Inspection of (9.5) shows that we requireknowledge of two probability densities: pn(i)|i(.), the density for noise sam-ples in corrupted sections, and px(.), the density for the signal samples.

9.2.2.1 Density function for signal, px(.)

The signal model investigated here is the autoregressive (AR) model (seesection 4.3), in which data samples can be expressed as a weighted sum ofP previous data samples to which a random excitation value is added:

xm =

P∑

i=1

aixm−i + em. (9.7)

The excitation sequence em is modelled as zero-mean white Gaussiannoise of variance σ2

e , and the signal xm is assumed stationary over theregion of interest. Such a model has been found suitable for many examplesof speech and music and has the appeal of being a reasonable approximationto the physical model for sound production (excitation signal applied to‘vocal tract’ filter).

The conditional likelihood for the vector of AR samples is then given by(see section 4.3.1, result (4.49)):

p(x1 | x0) =1

(2πσ2e)

N−P2

exp

(

− 1

2σ2e

xT ATA x

)

(9.8)

where the expression is implicitly conditioned on the AR parameters a, theexcitation variance σ2

e and the total number of samples (including x0) isN .

We then adopt the approximate likelihood result (see (4.50)):

px(x) ≈ p(x1|x0), N >> P (9.9)

As noted in section 4.3.1, this result can easily be made exact by the in-corporation of the extra term p(x0), but we omit this here for reasons ofnotational simplicity. Furthermore, as also noted in section 4.3.1, resultsfrom using this approximation will be exact provided that the first P datapoints x0 are uncorrupted. For the detection procedure studied here thissimplification will imply that the first P samples of i are held constant atzero indicating no corruption over these samples. In a block-based schemewe could, for example, choose to make x0 the last P samples of restoreddata from the previous data block, so we would be reasonably confident ofno corrupted samples within x0.


9.2.2.2 Density function for noise amplitudes, pn(i)|i(.)

Having considered the signal model PDFpx(.) we now consider the noiseamplitude density function, pn(i)|i, which is the distribution of noise ampli-tudes within bursts, given knowledge of the noise burst positions i.

The precise form for pn(i)|i(.) is not in general known, but one possible as-sumption is that the noise samples within bursts are mutually independentand drawn from a Gaussian zero-mean process of variance σ2

n. Observa-tion of many click-degraded audio signals indicates that such a noise priormay be reasonable in many cases. Note that if more explicit information isknown about the noise PDFduring bursts this may be directly incorporated.In particular, a more general Gaussian distribution is easily incorporatedinto the algorithm as described here. The assumption of such a zero-meanmultivariate Gaussian density with covariance matrix Rn(i)

leads to a noiseamplitude density function of the form (see appendix A):

pn(i)|i(n(i) | i) =1

(2π)l/2 | Rn(i)|1/2

exp

(

−1

2n(i)

TRn(i)

−1n(i)

)

(9.10)

where l = (iT i) is the number of corrupted samples indicated by detectionvector i.

9.2.2.3 Likelihood for Gaussian AR data

The noise and signal priors can now be substituted from (9.10) and (9.8)into the integral of equation (9.5) to give the required likelihood for i.The resulting expression is derived in full in appendix D.1 and after somemanipulation may be written in the following form:

p(y | i) =σl

e exp(

− 12σ2

eEMIN

)

(2πσ2e)

N−P2 | Rn(i)

|1/2| Φ |1/2(9.11)

where

EMIN = E0 − θTxMAP

(i) (9.12)

xMAP

(i) = Φ−1θ (9.13)

Φ = A(i)TA(i) + σ2

e Rn(i)

−1 (9.14)

θ = −(

A(i)TA

−(i) y−(i) − σ2

e Rn(i)

−1y(i)

)

(9.15)

E0 = y−(i)

TA−(i)

TA−(i) y

−(i) + σ2e y(i)

T Rn(i)

−1y(i) (9.16)

Equations (9.6) and (9.8) lead to the following straightforward result forthe likelihood for no samples being erroneous (i = 0):

p(y | i = 0) =1

(2πσ2e)

N−P2

exp

(

− 1

2σ2e

EMIN

)

(9.17)


where

EMIN = yT AT A y. (9.18)

The term xMAP(i)

(9.13) is the MAP estimate for the unknown data x(i)

given the detection state i and is calculated as a byproduct of the likelihoodcalculation of (9.11)-(9.16). xMAP

(i) has already been derived and discussedin the earlier click removal chapter in section 5.2.3.4. Under the given mod-elling assumptions for signal and noise, xMAP

(i) is clearly a desirable choice forrestoration for a given detection state i, since it possesses many useful prop-erties including MMSE (see 4.1.4) for a particular i. Thus we can achievejoint detection and restoration using the Bayes classification scheme bychoosing the MAP detection state i and then restoring over the unknownsamples for state i using xMAP

(i) . This is a suboptimal procedure, but onewhich is found to be perfectly adequate in practice.

A special case of the noise statistics requires the assumption that thecorrupting noise samples n(i) are drawn from a zero-mean white Gaussiannoise process of variance σ2

n. In this case the noise autocorrelation matrixis simply σ2

nI with I the (l× l) identity matrix, and thus Rn(i)

−1 = 1σ2nI and

| Rn(i)|= σ2l

n . In this case make the following modifications to (9.11)-(9.16):

p(y | i) =µl exp

(

− 12σ2

eEMIN

)

(2πσ2e)

N−P2 | Φ |1/2

(9.19)

and

EMIN = E0 − θTxMAP

(i)(9.20)

xMAP

(i)= Φ−1θ (9.21)

Φ = A(i)TA(i) + µ2 I (9.22)

θ = −(

A(i)TA

−(i) y−(i) − µ2 y(i)

)

(9.23)

E0 = y−(i)

T A−(i)

TA−(i) y

−(i) + µ2 y(i)T y(i) (9.24)

where µ = σe/σn. This simplification is used in much of the following workand it appears to be a reasonable model for the degradation in many audiosignals.

As derived in section 5.2.2 above, the standard AR-based interpolatoreffectively discards the observed data over the corrupted sections. We canachieve the same result within our click modelling framework by allowingthe noise variance σ2

n to become very large. As the ratio µ = σe/σn → 0the noise distribution tends to uniform and the corrupted data is ignored.Taking the limit makes the following modifications to equations (9.22)-


(9.24):

Φ = A(i)TA(i) (9.25)

θ = −A(i)TA

−(i) y−(i) (9.26)

E0 = y−(i)

TA−(i)

TA−(i) y

−(i) (9.27)

and in this case xMAP(i)

is identical to the xLS(i)

, the standard AR-based inter-polation.

9.2.2.4 Reformulation as a probability ratio test

The detector can be reformulated as a probability ratio test: specifically,form the ratio Λ(i) of posterior probability for detection state i to that fori = 0, the ‘null’ hypothesis. Using results (9.4), (9.11) and (9.17) we obtain

Λ(i) =p(i | y)

p(i = 0 | y)=

p(y | i) p(i)p(y | i = 0) p(i = 0)

(9.28)

=p(i)

p(i = 0)

σle exp

(

− 12σ2

e∆EMIN

)

| Rn(i)|1/2| Φ |1/2

(9.29)

which is the ratio of likelihoods for the two states, weighted by the rel-

ative prior probabilities p(i)p(i=0) . ∆EMIN is the difference between EMIN for

detection state i (9.12) and EMIN for the null state i = 0 (9.18). The wholeexpression is perhaps more conveniently written as a log-probability ratio:

log(Λ(i)) = log

(

p(i)

p(i = 0)

)

+1

2log

(

σ2le

| Rn(i)|

)

− 1

2log | Φ | − 1

2σ2e

∆EMIN (9.30)

where the only data-dependent term is ∆EMIN and the remaining termsform a constant threshold for a given state i compared with the null statei = 0. We now consider this term in some detail since it has an intuitivelyappealing form and will be used later in the chapter. Expansion of ∆EMIN

using (9.18) and (9.12-9.16) and some manipulation gives the result

∆EMIN = −(

y(i) − xMAP

(i)

)TΦ(

y(i) − xMAP

(i)

)

(9.31)

which can be seen as the error between the corrupted data y(i) and theMAP data estimate xMAP

(i), ‘weighted’ by Φ. Broadly speaking, the larger

the error between interpolated signal xMAP(i)

and corrupted data y(i), thehigher the probability that the data is corrupted.


An alternative form for ∆EMIN is obtained by expanding Φ and xMAP(i)

using (9.13) and (9.14) to give:

∆EMIN = −(

y(i) −(

A(i)TA(i)

)−1

A(i)TA

−(i)y−(i)

)T

(

A(i)TA(i)

)T

Φ−1(

A(i)TA(i)

)

(

y(i) −(

A(i)TA(i)

)−1

A(i)TA

−(i)y−(i)

)

(9.32)

= −(

y(i) − xLS

(i)

)T(

A(i)T A(i)

)T

Φ−1(

A(i)TA(i)

)

(

y(i) − xLS

(i)

)

(9.33)

which shows a relationship with the LSAR interpolation xLS(i) (see (5.15)).

The probability ratio form of the Bayes detector may often be a more usefulform for practical use since we have eliminated the unknown scaling factorp(y) from the full posterior probability expression (9.4). Probability ratiosare now measured relative to the zero detection, so we will have at leastsome idea how ‘good’ a particular state estimate is relative to the zerodetection.

9.2.3 Selection of optimal detection state estimate i

As discussed in section 4.1.4 the MAP state selection iMAP will correspondto minimum error-rate detection. In order to identify iMAP it will in generalbe necessary to evaluate the likelihood according to equations (9.11)-(9.16)for all states i and substitute into the posterior probability expression of(9.4) with some assumed (or known) noise generator prior p(i). The statewhich achieves maximum posterior probability is selected as the final stateestimate and the corresponding MAP data estimate, xMAP

(i) , may be usedto restore the corrupted samples. This scheme will form the basis of theexperimental work in chapter 11. Methods are described there which reducethe number of likelihood evaluations required in a practical scheme.

It is worth noting, however, that the MAP state estimate may not alwaysbe the desired solution to a problem such as this. Recall from section 4.1.4that the MAP state estimate is the minimum-risk state estimate for a lossfunction which weights all errors equally. In the block-based click detectionframework the MAP estimator is equally disposed to make an estimate inwhich all samples in the block are incorrectly detected as an estimate inwhich all but one sample is correctly detected. Clearly the latter case is farmore acceptable in click detection for audio and hence we should perhapsconsider a modified loss function which better expresses the requirementsof the problem. One possibility would be a loss function which increases independence upon the total number of incorrectly identified samples, given

9.3 Extensions to the Bayes detector 201

by ne = |ii − ij |2, where ii is the estimated state and ij is the true state.The loss function for choosing ii when the true state is ij is then given by

λ(αi | ij) = f(ne) (9.34)

where f() is some non-decreasing function which expresses how the lossincreases with the number of incorrectly detected samples. A further so-phistication might introduce some trade-off between the number of falsedetections nf and the number of missed detections nu. These are definedsuch that nf is the number of samples detected as corrupted which werein fact uncorrupted and nu is the number of corrupted samples which goundetected. If nf (ii, ij) is the number of false alarms when state ii is esti-mated under true state ij , and likewise for the missed detections nu(ii, ij),a modified loss function might be

λ(αi | ij) = f(β nf (ii, ij) + (1 − β) nu(ii, ij)) (9.35)

where β represents the trade-off between missed detections and false alarms.Loss functions which do not lead to the MAP solution, however, are

likely to require the exhaustive calculation of (4.19) for all i, which willbe impractical. Sub-optimal approaches have been considered and are dis-cussed in the next chapter on sequential detection, but for the block-basedalgorithm experimentation is limited to estimation of the MAP detectionvector.

9.2.4 Computational complexity for the block-based algorithm

Computational complexity is largely determined by the matrix inversionof equation (9.13). This will in general be an O(l3) operation, repeated foreach state evidence evaluation (with l varied appropriately). Note, however,that if the detection vector i is constrained to detect only single, contiguousnoise bursts and a ‘guard zone’ of at least P samples is maintained both be-fore and after the longest noise burst tested, the matrix will be Toeplitz, asfor the standard LSAR interpolation. Solution of the equations may then beefficiently calculated using the Levinson-Durbin recursion, requiring O(l2)floating point operations (see [52, 173]).

9.3 Extensions to the Bayes detector

Several useful extensions to the theory developed in this chapter are nowoutlined.

9.3.1 Marginalised distributions

All of the detection results presented so far are implicitly conditioned onmodelling assumptions, which may include parameter values. For the Gaus-


sian AR data case the posterior distributions are conditional upon the val-ues of the AR coefficients a, the excitation variance σ2

e , and the noise sam-ple variance σ2

n (when the noise samples are considered to be independentzero mean Gaussian variates, see (9.19)-(9.24)). If these parameters are un-known a priori , as will usually be the case in real processing situations, itmay be desirable to marginalise. The resulting likelihood expressions willthen not depend on parameter estimates, which can be unreliable.

The authors are unaware of any analytic result for marginalising the ARcoefficients a from the likelihood expression (9.19). It seems unlikely thatan analytic result exists because of the complicated structure of EMIN and| Φ |, both of which depend upon the AR coefficients. Detection is thusperformed using estimated AR coefficients (see chapter 11).

Under certain prior assumptions it is possible to marginalise analyticallyone of the noise variance terms (σ2

e or σ2n). The details are summarised in

appendix E. Initial experiments indicate that the marginalised detector ismore robust than the standard detector with fixed parameters, but a fullinvestigation is not presented here.

9.3.2 Noisy data

Detection under the ‘noisy data’ model (9.2) is a simple extension of thework in this chapter. It is thought that use of this model may give somerobustness to ‘missed’ detections, since some small amount of smoothingwill be applied in restoration even when no clicks are detected. Recall thatcorruption is modelled as switching between two noise processes nm

0and nm

1, controlled by the binary switching process im. Making theGaussian i.i.d. assumption for noise amplitudes, the correlation matrix forthe noise samples Rn|i is equal to a diagonal matrix Λ, whose diagonalelements λj are given by

λj = σ20 + ij(σ

21 − σ2

0) (9.36)

where σ20 and σ2

1 are the variances of the white Gaussian noise processesnt

0 and nt1, respectively. Following a simplified version of the deriva-

tion given in appendices D and D.1 the state likelihood is given by:

p(y | i) =σP

e exp(

− 12σ2

eEMIN

)

(2π)N−P

2 | Λ |1/2| Φ |1/2(9.37)

and

EMIN = E0 − θTxMAP (9.38)

xMAP = Φ−1θ (9.39)

Φ = AT A + σ2e Λ−1 (9.40)

θ = σ2e Λ−1y (9.41)

E0 = σ2ey

T Λ−1y (9.42)

9.3 Extensions to the Bayes detector 203

This procedure is clearly more computationally expensive than the pureclick detection, since calculation of xMAP involves the inversion of an (N ×N) matrix Φ. A further difficulty is foreseen in parameter estimation for thenoisy data case. The sequential algorithms presented in the next chapterare, however, easily extended to the noisy data case with a less seriousincrease in computation.

9.3.3 Multi-channel detection

A further topic which is of interest for click removal (and audio restorationin general) is that of multi-channel sources. By this we mean that two ormore sources of the same recording are available. If degradation of thesesources occurred after the production process then defects are likely to becompletely different. It would thus be expected that significant processinggains can be achieved through reference to all the sources. Problems inprecise alignment of the sources are, however, non-trivial (see [187, 190]) forthe general case. An example where two closely aligned sources are availableis where a monophonic disc recording is played back using stereophonicequipment. It is often found that artefacts in the two resulting channels arelargely uncorrelated. The problem of restoration of multiple aligned sourceswith uncorrelated clicks has been considered from a Bayesian perspective,based on modifications to the theory of this chapter. A summary of thetheory is given in [70, appendix G]. The approach is based on a simple linearfiltering relationship between signals in the two channels. The resultingdetector is of a similar form to the single channel detector of this chapter.

9.3.4 Relationship to existing AR-based detectors

It is shown in [70, appendix H] that the inverse filtering and matched fil-tering detectors (see section 5.3.1) are equivalent to special cases of theBayesian detectors derived in this chapter. This equivalence applies whenthe Bayesian detectors are constrained to detect only single corrupted sam-ples in any given section of data. It is found that the inverse filtering de-tector corresponds to detection with no ‘lookahead’, i.e. the Bayes detectoris testing for corruption in the last sample of any data block. With P ormore samples of lookahead the matched filtering detector results. These re-sults give some justification to the existing detectors while also highlightingthe reason for their inadequacy when corruption occurs in closely spacedrandom bursts of many consecutive samples.

9.3.5 Sequential Bayes detection

We have seen that optimal detection for the Bayesian scheme of this chap-ter will in general require 2N expensive evaluations of a complex likelihood


function. This is highly impractical for any useful value of N . Sub-optimalmethods for overcoming this problem with a block-based approach are pro-posed in chapter 11, but these are somewhat unsatisfactory since they relyon a good initial estimate for the detection state i. In a sequential detectionframework, state estimates are updated as new data samples are input. Itis then possible to treat the detection procedure as a binary tree searchin which each new sample can lead to two possible detection outcomes forthe new data point (corrupted or uncorrupted). Paths with low posteriorprobabilities can be eliminated from the search, leading to a great reduc-tion in the number of detection states visited compared with the exhaustivesearch. Sequential methods are derived in the next chapter and investigatedexperimentally in chapter 11.

9.4 Discussion

The Bayes restoration system has been developed in a general form. Thesystem has been investigated for AR data with specific noise burst modelsthought to be realistic for many types of audio degradation. Experimen-tal results and discussion of practical detection-state selection schemes forboth artificial and real data are presented in chapter 11. There are limi-tations to the full Bayesian approach largely as a result of the difficultyin selecting a small subset of candidate detection vectors for a given datablock (see chapter 11). This difficulty means that constraints such as onesingle contiguous burst of degradation per block have to be applied whichreduce the generality of the method. The recursive algorithms developed inthe next chapter are aimed at overcoming some of these problems and theyalso place the algorithm in a sequential framework which is more naturalfor real-time audio applications.

In addition, it has been demonstrated that the existing AR-based de-tection techniques described in chapter 5 are equivalent to a special caseof the full Bayesian detector. As a result of this work, new detectors witha similar form to the existing simple detectors have been developed withexplicit expressions for threshold values [70, appendix I].


10

Bayesian Sequential Click Removal

In this chapter a sequential detection scheme is presented which is foundedon a recursive update for state posterior probabilities. This recursive formis more natural for real time signal processing applications in which newdata becomes available one sample at a time. However, the main motiva-tion is to obtain a reliable search method for the optimal detection stateestimate which requires few probability evaluations. The sequential algo-rithm achieves this by maintaining a list of ‘candidate’ detection estimates.With the input of a new data sample the posterior probability for eachmember of the list is updated into two new list elements. The first of thesenew elements corresponds to the detection estimate with the new sam-ple uncorrupted, while the second element is for the corrupted case. If allstates in the list are retained after each set of probability updates the listwill grow exponentially in size. The search procedure would then amountto the exhaustive search of a binary tree. We present a reduced search inwhich states are deleted from the list after each iteration according to pos-terior probability. If we retain a sufficient number of ‘candidate’ states ateach sampling instant it is hoped that we will only rarely discard a lowprobability state which would have developed into the MAP state estimatesome samples later.

As for the sequential Bayesian classification schemes of section 4.1.5 themain task is obtaining a recursive update for the state likelihood p(i|y),which is derived with the aid of the matrix inversion results given in ap-pendix B. The resulting form is related to the Kalman filtering equations,see section 4.4.

206 10. Bayesian Sequential Click Removal

In order to obtain the posterior state probability from the state likeli-hood, the noise generator prior, p(i), is required. In chapter 9 a Markovchain model was proposed for noise generation. Such a model fits conve-niently into the sequential framework here and is discussed in more detailin this chapter.

Given updating results for the state posterior probabilities, reduced statesearch algorithms are developed. Criteria for retention of state estimatesare proposed which result in a method related to the reduced state Viterbialgorithms [193, 6].

The methods derived in this chapter have been presented previously as[83] and patented as [69].

10.1 Overview of the method

We have calculated in the previous chapter a posterior state probability fora candidate detection vector i:

p(i|y) ∝ p(y|i)p(i)and (for the MAP detection criterion) we wish to maximise this posterior

probability:

iMAP = argmaxi

(p(i|y))

Since this is infeasible for long data vectors we devise instead a sequentialprocedure which allows a more efficient sub-optimal search. Assume that attime n we have a set of candidate detection vectors In = i1n, i2n, . . . , iMn

n ,along with their posterior probabilities p(ijn|yn), where the subscript ‘n’denotes a vector with n data points, e.g. yn = [y1, . . . , yn]T . These candi-dates and their posterior probabilities are then updated with the input ofa new data point yn+1. The scheme for updating from time n to n+ 1 canbe summarised as:

1. At time n we have ijn and p(ijn|yn), for j = 1, . . .Mn.

2. Input a new data point yn+1.

3. For each candidate detection vector ijn, j = 1, . . .Mn, generate twonew candidate detection vectors of length n + 1 and update theirposterior probabilities:

i2j−1n+1 = [ijn

T0]T

p(i2j−1n+1 |yn+1) = g(p(ijn|yn), yn+1)

i2jn+1 = [ijn

T1]T

p(i2jn+1|yn+1) = g(p(ijn|yn), yn+1)

10.2 Recursive update for posterior state probability 207

4. Perform a ‘culling’ step to eliminate new candidates which have lowposterior probability p(in+1|yn+1)

5. n = n+ 1, goto 1.

Details of the culling step, which maintains the set of candidates at apracticable size, are discussed later in the chapter. The computationallyintensive step is in the sequential updating of posterior probabilities (step3.), and most of the chapter is devoted to a derivation of this from firstprinciples. At the end of the chapter it is pointed out that the recursiveprobability update can also be implemented using a Kalman filter.

10.2 Recursive update for posterior stateprobability

We now derive a recursive update for the posterior detection state proba-bility, p(i | y), (see equation (9.4)). As before we will be concerned mainlywith a recursive update for the state likelihood, p(y | i), since the full pos-terior probability is straightforwardly obtained from this term. A GaussianAR model is assumed for the data as in the last chapter, and the corruptingnoise is assumed white and Gaussian with variance σ2

n. These assumptionscorrespond to the likelihood expression of (9.19)-(9.24).

As in section 4.1.5 a subscript n is introduced for all variables dependenton the number of samples in the processing block. Thus in indicates astate detection vector i corresponding to the first n samples of data. Thelikelihood for state in is then rewritten directly from (9.19)-(9.24) as

p(yn | in) =µl exp

(

− 12σ2

eEMINn

)

(2πσ2e)

n−P2 | Φn |1/2

(10.1)

and

EMINn = E0n − θnTxMAP

(i)n (10.2)

xMAP

(i)n = Φn−1θn (10.3)

Φn = A(i)n

T A(i)n + µ2 I (10.4)

θn = −(

A(i)nTA

−(i)n y−(i)n − µ2 y(i)n

)

(10.5)

E0n = y−(i)n

TAT−(i)nA

−(i)n y−(i)n + µ2 y(i)n

Ty(i)n (10.6)

where µ = σe/σn as before. The recursive likelihood update we requireis then written in terms of the previous evidence value and a new inputsample yn+1 as

p(yn+1 | in+1) = g(p(yn | in), yn+1) (10.7)


and as before this update can be expressed in terms of the conditionalpredictive distribution for the new sample yn+1 (see (4.32)):

p(yn+1 | in+1) = p(yn+1 | yn, in+1) p(yn | in) (10.8)

since p(yn | in) = p(yn | in+1).With the input of a new sample yn+1 each detection state estimate in is

extended by one new binary element in+1:

in+1 = [ in in+1 ]T (10.9)

Clearly for each in there are two possibilities for the state update, corre-sponding to in+1 = 0 (yn+1 uncorrupted) and in+1 = 1 (yn+1 corrupted).If in+1 = 0 sample yn+1 is considered ‘known’ and then l (the number of‘unknown’ samples) remains the same. Correspondingly the MAP estimateof the unknown data xMAP

(i)(n+1) is of the same length as its predecessor xMAP(i)n .

In this case it will be seen that the evidence update takes a similar formto that encountered in the sequential model selection scheme discussed insection 4.1.5. However, when in+1 = 1 the number of unknown samplesincreases by one. The evidence update then takes a modified form whichincludes a change in system ‘order’ from l to (l + 1). The likelihood up-dates for both in+1 = 0 and in+1 = 1 are now derived in appendix F andsummarised in the next section.

10.2 Recursive update for posterior state probability 209

10.2.1 Full update scheme.

The full update is now summarised in a form which highlights terms com-mon to both cases in+1 = 0 and in+1 = 1. For in+1 = 0 the update isexactly equivalent to (F.13)-(F.20) while for in+1 = 1 the two stages out-lined above are combined into one set of operations.

Firstly calculate the terms:

pn = Pn b(i)n (10.10)

κn = 1 + b(i)nTp (10.11)

κ′n = 1 + µ2(κn) (10.12)

Ψn = pn pnT (10.13)

Now, for in+1 = 0:

kn =pn

κn(10.14)

dn+1 = yn+1 − b−(i)n

Ty−(i)n (10.15)

αn+1 = dn+1 − xMAP

(i)nTb(i)n (10.16)

xMAP

(i)(n+1) = xMAP

(i)n + kn+1 αn+1 (10.17)

εn+1 = dn+1 − xMAP

(i)(n+1)Tb(i)n (10.18)

Pn+1 = Pn − Ψn

κn(10.19)

p(y(n+1) | i(n+1)) =p(yn | in)√

2πσ2e

1

κ1/2n

exp

(

− 1

2σ2e

αn+1 εn+1

)

(10.20)

and, for in+1 = 1:

xMAP

(i)(n+1) =

[

xMAP(i)n

yn+1 − αn+1

]

+µ2 αn+1

κ′n

[

pn

κn

]

(10.21)

Pn+1 =

[ (

Pn − (µ2 Ψn)/κ′n)

pn/κ′n

pnT /κ′n κn/κ

′n

]

(10.22)

p(y(n+1) | i(n+1)) =p(yn | in)√

2πσ2e

µ

κ′n1/2

exp

(

− 1

2σ2e

µ2

κ′nαn+1

2

)

(10.23)


The update for in+1 = 1 requires very few calculations in addition tothose required for in+1 = 0, the most intensive operation being the O(l2)operation (Pn − (µ2 Ψn)/κ′n).

10.2.2 Computational complexity

We now consider the computational complexity for the full update of (10.10)-(10.23). The whole operation is O(l2) in floating point multiplications andadditions, the respective totals being (ignoring single multiples and addi-tions) 4l2 + 5l + n multiplies and 3l2 + 3l + n additions. This is a veryheavy computational load, especially as the effective block length n andconsequently the number of missing samples l becomes large. There areseveral simple ways to alleviate this computational load.

Firstly observe that bn (see F.2) contains only P non-zero terms. Hencethe partitioned vectors b(i)n and b

−(i)n will contain lP and P − lP non-zeroelements, respectively, where (lP ≤ P ≤ l) is the number of samples whichare unknown from the most recent P samples. Careful programming willthus immediately reduce O(l2) operations such as (10.10) to a complexityof O(l × lP ). A further implication of the zero elements in bn is is that thedata dependent terms αn+1 and εn+1 in the evidence updates require onlythe most recent lP samples of xMAP

(i)n and xMAP

(i)(n+1) respectively, since earlier

sample values are zeroed in the products b(i)nTxMAP

(i)n and b(i)nTxMAP

(i)n+1 (see(10.16) and (10.18)). A direct consequence of this is that we require only toupdate the last lP elements of the last lP rows of An+1. All O(l2) operationsnow become O(lP

2), which can be a very significant saving over the ‘bruteforce’ calculation.

10.2.3 Kalman filter implementation of the likelihood update

We have given thus far a derivation from first principles of the sequen-tial likelihood update. An alternative implementation of this step uses theKalman filter and the prediction error decomposition as described in sec-tion 4.4.1. Here the ‘hyperparameter’ of the model is the detection vectorup to time n, in, which replaces θ in the prediction error decomposition.We write the AR model in the same state-space form as given in section5.2.3.5, but setting Z = HT for all times. Now to complete the model, setσ2

v = σ2n when im = 1 and σ2

v = 0 when im = 0. Running the Kalman filterand calculating the terms required for the prediction error decompositionnow calculates the state likelihood recursively. The computations carriedout are almost identical to those derived in the sequential update formulaeearlier in this chapter.

10.3 Algorithms for selection of the detection vector 211

10.2.4 Choice of noise generator prior p(i)

We have derived updates for the detection state likelihood p(xn | in). Refer-ring back to (9.4) we see that the remaining term required for the full pos-terior update (to within a scale factor) is the noise generator prior p(in). Asstated above one possibility is a uniform prior, in which case the likelihoodcan be used unmodified for detection. However, we will often have someknowledge of the noise generating process. For example in click-degradedaudio the degraded samples tend to be grouped together in bursts of 1-30samples (hence the use of the term noise ‘burst’ in the previous sections).

One possible model would be a Bernoulli model [174] in which we couldinclude information about the ratio of corrupted to uncorrupted samples inthe state probabilities. However, this does not incorporate any informationabout the way samples of noise cluster together, since samples are assumedindependent in such a model. A better model for this might be the Markovchain [174], where the state transition probabilities allow a preference for‘cohesion’ within bursts of noise and uncorrupted sections.

The two-state Markov chain model fits neatly into the sequential frame-work of this chapter, since the prior probability for in+1 depends by def-inition on only the previous value in (for the first order case). We definestate transition probabilities Pinin+1 in terms of:

P00 = Probability of remaining in state 0 (uncorrupted) (10.24)

P11 = Probability of remaining in state 1 (corrupted) (10.25)

from which the state change transition probabilities follow as P01 = 1−P00

and P10 = 1 − P11.The sequential update for the noise prior is then obtained as

p(in+1) = Pinin+1p(in) (10.26)

which fits well with the sequential likelihood updates derived earlier. If weare not prepared to specify any degree of cohesion between noise burstsamples we can set P01 = P11 and P10 = P00 to give the Bernoulli modelmentioned above. In this case all that is assumed is the average ratio ofcorrupted samples to uncorrupted samples. Many other noise generator pri-ors are possible and could be incorporated into the sequential framework,but we have chosen the Markov chain model for subsequent experimen-tation since it appears to be a realistic representation of many click typedegradations.

10.3 Algorithms for selection of the detectionvector

The previous sections derived the required recursions for the posterior prob-ability for a candidate detection vector. A scheme must now be devised for


selecting the best detection vector in a sequential fashion. Given a set ofMn candidate detection state estimates at sample n

In = i1n, i2n, . . . , iMnn (10.27)

for which detection probabilities p(ijn|y) have been calculated, we can selectthe vector with maximum probability as our current detection estimate. Asa new sample yn+1 arrives, there are two paths that each member of In cantake (corresponding to in+1 = 0 or in+1 = 1). The posterior probabilitiesfor each member of I can then be updated for each case according tothe sequential updating strategy derived earlier to give two new detectionstate estimates at sample number (n+ 1). We now have a set of 2Mn

detection vectors each with a corresponding probability, and the new setIn+1 must now be selected from the 2Mn vectors, based upon their relativeprobabilities. The exhaustive search procedure would involve doubling thesize of In with each increment in n, so a ‘culling’ strategy for deleting lessprobable state estimates must be employed after each update procedure.Such a scheme cannot guarantee the optimal (MAP) solution, but it maybe possible to achieve results which are close to optimal.

For n data samples there are 2n possible detection permutations. How-ever, the nature of a finite order AR model leads us to expect that newdata samples will contain very little information about noise burst con-figurations many samples in the past. Hence the retention of vectors inthe current set of candidate vectors In with different noise configurationsprior to n0, where n n0, will lead to unnecessarily large Mn, the num-ber of candidate vectors in In. The retention of such vectors will also bedetrimental to the selection of noise-burst configurations occurring morerecently than sample n0, since important detection vectors may then haveto be deleted from In in order to keep Mn within practical limits.

Hence it is found useful to make a ‘hard-decision’ for elements of incorresponding to samples prior to n−dH , where dH is an empirically chosendelay. One suitable way to make this hard-decision is to delete all vectorsfrom In which have elements prior to sample n− dH different from thoseof the most probable element in In (i.e. the current MAP state estimate).dH should clearly depend on the AR model order P as well as the levelof accuracy required of the detector.Under such a scheme only the mostrecent dH samples of the detection state estimates in In are allowed to vary,the earlier values being fixed. Finding the state estimate with maximumprobability under this constraint is then equivalent to decoding the stateof a finite state machine with 2dH states. Algorithms similar to the Viterbialgorithm [193, 60] are thus appropriate. However, dH (which may well bee.g. 20-30) is likely to be too large for evaluation of the posterior probabilityfor all 2dH states. We thus apply a ‘culling’ algorithm related to the reducedforms of the Viterbi algorithm [6] in which states are deleted from In whichhave log-probabilities which are not within a certain threshold below thatof the most probable member of In. This approach is flexible in that it will

10.4 Summary 213

keep very few candidate vectors when the detection probability functionis strongly ‘peaked’ around its maximum (i.e. the detector is relativelysure of the correct solution) but in some critical areas of uncertainty it willmaintain a larger ‘stack’ of possible vectors until the uncertainty is resolvedwith more incoming data. A further reduction procedure limits the numberof elements in In to a maximum of Mmax by deleting the least probableelements of In.

10.3.1 Alternative risk functions

The desirability of minimising some function other than the commonly-usedone-zero risk function is discussed in section (9.2.3). However, as stated inthat section, it will usually be necessary to calculate p(in | yn) for allpossible in-vectors in order to achieve this goal.

In the present case we have available the set of candidate vectors In andtheir associated posterior probabilities. Suppose that we were certain thatIn contained the true MAP detection state. This is equivalent to assumingthat the posterior probability of any detection state estimate not presentin In has zero probability, and correspondingly there is zero risk associatedwith not choosing any such state estimates. The total expected risk maythen be rewritten as

R(αi | yn) ∝∑

j:ij∈In

λ(αi | ij)P (ij | yn) (10.28)

where we have modified the summation of (4.19) such that only elementsof In are included.

This risk function can then be evaluated for each element of In to givea new criterion for selection of state estimate. We cannot guarantee thatthe correct state vector is within In, so such a risk calculation will be sub-optimal. It is quite possible, for example, that the culling algorithm outlinedabove deletes a state which would later have a relatively high probabilityand a low expected risk. The state estimate selected from In might thenhave significantly higher risk than the deleted state. Nevertheless, initialtests with a loss function of the form (9.35) indicate that some trade-off can indeed be made between false alarms and missed detections at anacceptable overall error rate using a risk function of the form (10.28), buta thorough investigation is left as future work.

10.4 Summary

In this chapter we have derived a sequential form for Bayesian detection ofclicks. Most of the work is concerned with deriving sequential updates forthe detection state likelihood which are based on matrix inversion results


similar to those used in RLS and Kalman filtering. In fact, through use ofthe prediction error decomposition (see section 4.4), an alternative imple-mentation of the whole scheme would be possible using a Kalman filter.In addition we have introduced a Markov chain prior for noise generationwhich is used in subsequent experimental work and is designed to modelthe observed tendency that clicks correspond to short ‘bursts’ of corruptedsamples rather than isolated impulses in the data. In the next chapter weinclude implementational details for the Bayes detector and perform someexperimental evaluation of the algorithms presented in this and the previ-ous chapter.


11

Implementation and ExperimentalResults for Bayesian Detection

In this chapter we discuss implementation issues concerned with theBayesian detection schemes derived in the last two chapters. Experimentalresults from several implementations of the detector are then presentedwhich demonstrate the performance improvements attainable using the newschemes.

Firstly, the block-based approach of chapter 9 is discussed. As has beensuggested earlier, the block-based scheme is awkward to implement ow-ing to the combinatorial explosion in the number of possible detectionstates as the block size N increases. Sub-optimal search procedures areproposed which will usually bring the number of detection states testedwithin reasonable computational limits. Some experimental results are pre-sented which show that the Bayesian approach can improve upon existingtechniques. It is not proposed that these methods are applied in practice,but we present results to illustrate the workings of the Bayesian principles.The reader who is only interested in the practical algorithm can miss outthis section.

Secondly, a fuller investigation of performance is made using the moreflexible sequential methods of chapter 10. Algorithms for sequential selec-tion of detection states are specified and experimental results are presentedfor these algorithms using both artificial and real data sequences.

216 11. Implementation and Experimental Results for Bayesian Detection

11.1 Block-based detection

We first consider the block-based detection scheme given in chapter 9 inwhich the posterior probability for a particular detection state estimate isderived for a block of N corrupted data samples y under Gaussian AR dataassumptions (see (9.4) and (9.11)-(9.16). This section may be omitted byreaders only interested in the practical implementation of the sequentialscheme. In the absence of any additional information concerning the noise-burst amplitude distribution we will make the Gaussian i.i.d. assumptionwhich leads to the evidence result of (9.19)-(9.24). In this section a ‘uniform’prior P (i) = 1/2N is assumed for the detection state vector i, i.e. for agiven data block no particular state detection estimate is initially thoughtto be more probable than any other. Other priors based upon the Markovchain model of click degradation are considered in the section on sequentialdetection.

11.1.1 Search procedures for the MAP detection estimate

Assuming that we have estimated values for the AR system parametersit is now possible to search for the MAP detection vector based upon theposterior probability for each i. We have seen that for a given data blockthere are 2N possibilities for the detection state, making the exhaustiveposterior calculation for all i prohibitive. It is thus necessary to performa sub-optimal search which restricts the search to a few detection stateswhich are thought most probable.

It is proposed that an initial estimate for the detection state is obtainedfrom one of the standard detection methods outlined in section 5.3.1, suchas the inverse filtering method or the matched filtering alternative. A lim-ited search of states ‘close’ to this estimate is then made to find a local max-imum in the detection probability. Perhaps the simplest procedure makesthe assumption that at most one contiguous burst of data samples is cor-rupted in any particular data block. This will be reasonable in cases wherethe corruption is fairly infrequent and the block length is not excessivelylarge.

The starting point for the search is taken as an initial estimate given bythe standard techniques. Optimisation is then performed by iterated one-dimensional searches in the space of the i-vector. Supposing the the firstcorrupted sample is m and the last corrupted sample is n (the number ofdegraded samples is then given by l = n−m+1). Initial estimates n0,m0are obtained from the inverse filtering or matched filter method. m is ini-tially fixed at m0 and n is varied within some fixed search region aroundits estimate n0. The posterior probability for each n,m0 combination isevaluated. The value of n which gives maximum posterior probability is se-lected as the new estimate n1. With n fixed at n1, m is now varied aroundm0 over a similar search region to find a new estimate n1,m1 with max-

11.1 Block-based detection 217

imum probability. This completes one iteration. The ith iteration, startingwith estimate mi−1, ni−1 and using a search region of ±∆ is summarisedas:

1. Evaluate posterior probability for m = mi−1 andn = (ni−1 − ∆)...(ni−1 + ∆).

2. Choose ni as the value of n which maximises the posterior probabilityin step 1.

3. Evaluate posterior probability for m = (mi − ∆)...(mi + ∆) andn = ni.

4. Choose mi as the value of m which maximises the posterior proba-bility in step 3.

More than one iteration may be required, depending on the extent of searchregion ∆ and how close the initial estimate n0,m0 is to the MAP esti-mate.

An example of this simple search procedure is given in figures 11.1-11.4.In figure 11.1 a real audio signal is shown and a sequence of samples (34-39)has been corrupted by additive Gaussian noise (σn/σe = 4). Figure 11.2shows the magnitude output of both the matched filter and inverse filterdetectors (normalised by maximum magnitude). An AR model of order 20has been used with parameters (including σe) estimated directly from 1000samples of the corrupted data by the covariance method.

We see that both filters give high amplitude output in the region of degra-dation and so may be thresholded to give initial estimates for the corruptedsection. Figure 11.3 now shows (scaled) posterior probabilities calculatedfor one iteration of the one-dimensional search procedures outlined aboveand using the same data and AR model as before. The evidence formu-lation of equations (9.19)-(9.24) is used for this and subsequent trials. Auniform noise generator prior was used which favours no one detection overany other. The dashed line shows step 1. above in which probabilities areevaluated for a range of n values with m fixed at m0 = 31 (an error of 3samples). There is a clear peak in probability at n = 39, the last degradedsample. The full line shows step 3. in which n is fixed at n1 = 39 and m isvaried. Again there is a strong peak at the correct m value of 34. Finally,figure 11.4 shows a mesh plot of probability for all possible m,n com-binations within the range 31 − 37, 36 − 42 which shows a strong localmaximum at the desired detection estimate 34, 39 (note that any com-bination where n < m is assigned zero probability). The one-dimensionalprobability plots of the previous figure are effectively just cross-sections ofthis 3-d plot taken parallel to the m and n axes.

The above example demonstrates typical operation of the Bayesian de-tection algorithm and highlights some of the advantages obtainable. In


particular it can be seen that there is some ambiguity in detection us-ing the standard methods (see figure 11.2). It is not possible to select athreshold for this example which will correctly detect the corrupted datawithout leading to ‘false alarms’. The Bayesian detector on the other handgives a clear probability maximum which requires no heuristic threshold-ing or other manipulations in the detection process. A suitable probabilitymaximum is typically found within 1-2 iterations of the search procedure,involving a total of perhaps 10-20 probability evaluations. Note that falsealarms in detection are automatically tested if we use the probability ratioversion of the detector (see (9.28) and following). If no posterior probabilityratio evaluated exceeds unity then the ‘zero’ hypothesis is selected in whichno samples are detected as corrupt.

We have seen that each direct evaluation of posterior probabilities is anO(l2) operation when the constraint of one contiguous noise burst framedby at least P uncorrupted samples is made. Even using the simple searchalgorithms discussed above the detection procedure is considerably morecomputationally intensive than the standard restoration techniques given insection 5.3.1 which require, in addition to detection, a single interpolationwith O(l2) complexity. The situation is not as bad as it appears, how-ever, since the ordered nature of the search procedures outlined above lendthemselves well to efficient update methods. In particular the procedureof increasing the last degraded sample number n by one or decreasing thefirst sample m by one can be performed in O(l) operations using techniquesbased upon the Levinson recursion for Toeplitz systems (see [86]). Searchprocedures for the MAP estimate can then be O(l2max), where lmax is themaximum length degradation tested. Computation is then comparable tothe standard methods.

The methods discussed above are somewhat limiting in that we are con-strained to one single contiguous degradation in a particular data block.We require in addition a buffer of at least P uncorrupted samples to framethe degradation in order to maintain a Toeplitz system leading to efficientprobability evaluations. The procedure can however be generalised for prac-tical use. The standard detection methods are used as an ‘event’ detector.An event in the signal will be one single burst of degradation which cangenerally be identified from the highest maxima of the detection waveform(see e.g. figure 11.2). Each event identified in this way is then framed bysufficient samples on either side to surround the whole degradation withat least P samples. The sub-block formed by this framing process is thenprocessed using the search procedures given above to give the Bayesiandetection estimate and (if required) the restored signal waveform. Thismethod will work well when degradations are fairly infrequent. It is thenunlikely that more than one event will occur in any detection sub-block. Incases where multiple events do occur in sub-blocks the above algorithmswill ignore all but the ‘current’ event and sub-optimal performance willresult. Although of course more heuristics can be added to improve the


situation, this practical limitation of the block-based approach is a signif-icant motivation for the sequential processing methods developed in thelast chapter and evaluated later in this chapter.

11.1.2 Experimental evaluation

Some quantitative evaluation of the block-based algorithm’s performance ispossible if we calculate the probability of error in detection, PE . This repre-sents the probability that the algorithm does not pick precisely the correctdetection estimate in a given block of N samples. PE may be estimatedexperimentally by performing many trials at different levels of degradationand counting the number of incorrect detections NE as a proportion of thetotal number of trials NT :

PE =NE

NT(11.1)

and for NT sufficiently large, PE → PE .In this set of trials data is either synthetically generated or extracted from

a real audio file and a single noise burst of random length and amplitudeis added at a random position to the data. The posterior probability isevaluated on a 7 by 7 grid of m,n values such that the true values ofm and n lie at the centre of the grid. A correct detection is registeredwhen the posterior probability is maximum at the centre of the grid, whilethe error count increases by one each time the maximum is elsewhere inthe grid. This is a highly constrained search procedure which effectivelyassumes that the initial estimate n0,m0 is close to the correct solution.It does however give some insight into detector performance.

The first result shown in figure 11.5 is for a synthetic order 10 AR model.Noise burst amplitudes are generated from a Gaussian distribution of theappropriate variance. The length of noise bursts is uniformly distributedbetween 0 and 15 samples. The full line gives the error probability underideal conditions when µ is known exactly. Performance becomes steadilyworse from left to right which corresponds to decreasing noise amplitudes.This is to be expected since smaller clicks should be harder to detect thanlarge clicks. Error probabilities may seem quite high but it should be notedthat we are using Gaussian noise bursts. A significant proportion of noisesamples will thus have near-zero amplitude making detection by any meansa near impossibility. The dashed line gives some idea of the algorithm’ssensitivity to the parameter µ. In this case we have fixed the detector atµ = 30 while varying the true value of µ for the artificially generated noisebursts. Some degradation in performance can be seen which increases at lownoise amplitudes (since the detector’s estimate for µ is most inaccurate forlow noise burst amplitudes). Nevertheless performance is broadly similar tothat for the ideal case. This result may be important in practice since µ is


-15

-10

-5

0

5

10

15

20

0 10 20 30 40 50 60 70

sample number

ampl

itude

FIGURE 11.1. True audio signal (dashed line) and signal corrupted with additiveGaussian noise burst (solid line)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70

sample number

ampl

itude

FIGURE 11.2. Detection sequences using matched (solid line) and inverse filtering(dashed line) detectors- AR(20)


-10

-5

0

5

10

15

20

25 30 35 40 45 50

sample number

ampl

itude

FIGURE 11.3. Posterior probabilities for m and n with AR(20) modelling: dottedline - signal waveforms; solid line - posterior probability for m with n fixed at 39;dashed line - posterior probability for n with m fixed at 31.

3132

3334

3536

37

36

38

40

420

0.2

0.4

0.6

0.8

1

nm

Rel

ativ

e P

roba

bilit

y

FIGURE 11.4. Mesh plot of posterior probability over a grid of m,n combina-tions (range is m:31-37, n:36-42, m, n increase vertically)


likely to be difficult to estimate. The dotted line gives the error probabilitiesobtained using the matched filtering detector over the same set of datasamples. The threshold for this detector was hand-tuned to give minimumerror probability but the result is seen to be consistently far worse than thatfor the Bayes detector. This result is perhaps not as poor as it appears since,while the matched filter hardly ever gives precisely the correct detectionestimate owing to filter smearing effects, it is quite robust and usually givesa result close to the correct estimate. A more realistic comparison of thedifferent methods is given later in the work on sequential detection wherethe per sample error rate is measured.

In the second graph of figure 11.6 we test the sensitivity of the detectorto the Gaussian assumption for noise bursts. The noise burst amplitudesare now generated from a rectangular distribution and the detector used isbased on the Gaussian assumption with µ set according to the variance ofthe rectangular distribution noise samples. Performance is broadly similarto that for Gaussian noise samples with even some improvement for highnoise amplitudes. The dashed line once again gives error probabilities withthe estimate of µ fixed at a value of 30.

The third graph of figure 11.7 shows results obtained using a real audiosignal and artificial Gaussian noise bursts. AR parameters are estimated toorder 10 by the covariance method from the uncorrupted data. Results aresimilar to those for the synthetic AR data indicating that the AR modelassumption is satisfactory for detection in real audio signals.

Note that these experiments only give a first impression of the perfor-mance of the block-based approach since the degradation was limited to asingle contiguous noise burst. A more useful objective evaluation is possibleusing the sequential algorithm of subsequent sections. Also, as discussed insection (9.2.3), PE is not necessarily the best measure of the performanceof this type of algorithm. For the sequential algorithms it is possible toobtain meaningful estimates of per sample error rates which give a morerealistic picture of detector performance.

11.2 Sequential detection

In the last section algorithms have been proposed for block-based Bayesiandetection. It has been seen that several constraints have to be placed ondetection to give a practically realisable system. In particular we have seenthat a good initial estimate for degraded sample locations is required. Thismay be unreliable or impossible to achieve especially for very low variancenoise samples. The sequential algorithms derived in the last chapter allow amore flexible and general approach which does not rely on other methods orrequire the assumption of contiguous noise bursts which occur infrequentlythroughout the data.

11.2 Sequential detection 223

10-1

100

10-2 10-1 100 101

noise variance ratio

erro

r pr

obab

ility

FIGURE 11.5. Error probabilities for Bayesian detector as µ is varied, syntheticAR(10) model: solid line - Bayes with true µ value; dashed line - Bayes withµ = 30; dotted line - matched filter detector

10-1

100

10-2 10-1 100 101


erro

r pr

obab

ility

FIGURE 11.6. Error probabilities with uniform click distribution as µ is varied,synthetic AR(10) data: solid line - Bayes with true µ value; dashed line - Bayeswith µ = 30; dotted line - matched filter detector


10-1

100

10-2 10-1 100 101


erro

r pr

obab

ility

FIGURE 11.7. Error probabilities for Bayesian detector as µ is varied, real audiodata modelled as AR(10): solid line - Bayes detector; dashed line - matched filterdetector

As for the block-based algorithm it is possible to make some objectiveevaluation of detection performance for the sequential algorithm by theprocessing of data corrupted with synthetically generated noise waveforms.Full knowledge of the corrupting noise process enables measurement of somecritical quantities; in particular the following probabilities are defined:

Pf = False alarm probability: Probability of detecting a corrupted samplewhen the sample is uncorrupted

(11.2)

Pu = Missed detection probability: Probability of not detecting a cor-rupted sample when the sample is corrupted

(11.3)

Pse = Per-sample error-rate: Probability of incorrect detection, whetherfalse alarm or missed detection.

(11.4)

These quantities are easily estimated in simulations by measurements ofdetection performance over a sufficiently large number of samples.

The first set of simulations uses a true Gaussian AR process with knownparameters for the uncorrupted data signal in order to establish the algo-

11.2 Sequential detection 225

rithm’s performance under correct modelling assumptions. Further trialsprocess genuine audio data for which system parameters are unknown andmodelling assumptions are not perfect.

11.2.1 Synthetic AR data

Initially some measurements are made to determine values of the errorprobabilities defined in equations (11.2)-(11.4) under ideal conditions. AnAR model order of 20 is used throughout. The state vector selection, or‘culling’, algorithm (see section 10.3) deletes detection states which haveposterior probabilities more than 50dB’s below the peak value of all thecandidates, and limits the maximum number of state vectors to 10 (this isthe maximum value after the selection procedure). The hard-decision delaylength is fixed at dH = P , the AR model order. The AR model parame-ters for simulations are obtained by the covariance method from a typicalsection of audio material in order to give a realistic scenario for the simu-lations. The synthetic AR data is generated by filtering a white Gaussianexcitation signal of known variance σ2

e through the AR filter. The corrupt-ing noise process is generated from a binary Markov source with knowntransition probabilities (P00 = 0.97, P11 = 0.873) (see section 10.2.4), andthe noise amplitudes within bursts are generated from a white Gaussiannoise source of known variance σ2

n. The sequential detection algorithm isthen applied to the synthetically corrupted data over N data samples. Nis typically very large (e.g. 100000) in order to give good estimates of errorprobabilities. A tally is kept of the quantitiesNf (the number of samples in-correctly identified as corrupted),Nu (the number of corrupted samples notdetected as such) and Nc (the number of samples known to be corrupted).For sufficiently large N the error probabilities defined in (11.2)-(11.4) maybe reliably estimated as:

Pf =Nf

N −Nc(11.5)

Pu =Nu

Nc(11.6)

Pse =Nu +Nf

N(11.7)

Figures (11.8) and (11.9) illustrate an example of detection and restora-tion within this simulation environment for µ = 3. In the first (unrestored)waveform noise-bursts can be seen as the most ‘jagged’ sections in the wave-form; these have been completely removed in the lower (restored) waveform.

The graphical results of figure (11.10) show the variation of detectionprobabilities as the ratio µ is increase. As should be expected the probabil-ity of undetected corruption Pu increases as µ increases and noise bursts


are of smaller amplitude. For the range of µ examined Pu is mostly below10−1, rising to higher values for very small noise amplitudes. This may seema high proportion of undetected samples. However as noted earlier for theblock-based detector noise amplitudes are generated from a Gaussian PDF,whose most probable amplitude is zero. Hence a significant proportion ofnoise samples will be of very low amplitude, and these are unlikely to bedetected by the detector. This should not worry us unduly since such sam-ples are least likely to leave any audible signal degradation if they remainunrestored. The rising trend in Pu is the most significant component of theoverall error rate Pse, while Pf is consistently lower.

For comparison purposes corresponding results are plotted in the samefigure for the inverse filtering detector. The error rates plotted correspondto the threshold which gives (experimentally) the minimum per sampleerror rate Pse. The inverse filtering detector is seen to give significantlyhigher error rates at all but extremely low σn values, where the two de-tectors converge in performance (at these extremely low noise amplitudesboth detectors are unable to perform a good detection and are essentiallyachieving the ‘random guess’ error rate).

As for the block-based detector an important issue to investigate is thesensitivity of detector performance to incorrect parameter estimates, suchas the µ ratio and the Markov source transition probabilities P00 and P11.Figure (11.12) shows an example of the significant variation of error proba-bilities as the estimate of σn, σn, is varied about its optimum. This variationamounts essentially to a trade-off between Pu and Pf and could be used totailor the system performance to the needs of a particular problem. Figure(11.13) shows this trade-off in the form of a ‘Detector Operating Charac-teristic’ (DOC) (see [186]) with probability of correct detection (1 − Pu)plotted against false alarm rate (Pf ). For comparison purposes an equiva-lent curve is plotted for the inverse filtering detector, in this case varyingthe detection threshold over a wide range of values. It can be seen that theBayesian detector performs significantly better than the inverse filteringmethod. The shape of curve for the Bayesian method follows essentiallythe same shape as the standard DOC, but turns back on itself for veryhigh σn. This may be explained by the nature of the Bayes detector, whichis now able to model high amplitude signal components as noise bursts oflarge amplitude, causing Pf to increase.

Considerably less performance variation has been observed when theMarkov transition probabilities are varied, although once again some error-rate trade-offs can be achieved by careful choice of these parameters.

11.2.2 Real data

For real data it is possible to estimate the same error probabilities as for thesynthetic case when noise-bursts are generated synthetically. The furtherissues of parameter estimation for a, σn and σe now require consideration,

11.3 Conclusion 227

since these are all assumed known by the Bayesian detector. The sameprinciples essentially apply as in the block-based case although it is nowperhaps desirable to update system parameters adaptively on a sample bysample basis. The autoregressive parameters are possibly best estimatedadaptively [93] and one suitable approach is to use the restored waveformfor the estimation. In this way the bias introduced by parameter estima-tion in an impulsive noise environment should be largely avoided. σe canalso be estimated adaptively from the excitation sequence generated frominverse-filtering the restored data with the AR filter parameters. σn maybe estimated adaptively as the root mean square value of noise samplesremoved during restoration.

Figure (11.11) shows the same quantities as were measured for the syn-thetic data case, obtained by processing a short section of wide-band musicsampled from a Compact Disc recording (sample-rate 44.1kHz). Markovsource probabilities and the AR model order were the same as used for thepurely synthetic trials, so conditions for this real data simulation shouldbe reasonably comparable with those results. Performance is seen to besimilar to that for the synthetic AR data except for slightly higher falsealarm rate Pf at low µ values. This higher false alarm rate may reflect thefact that real data is not ideally modelled as a quasi-stationary AR process,and some transient behaviour in the data is falsely detected as degradation.The corresponding error probabilities are plotted for the inverse filteringAR detector and they are seen once more to be significantly higher.

Listening tests performed for several different sections of restored datashow that an improvement in noise level reduction can be achieved by thisrestoration method compared with the standard AR filter-based method.In addition, the method is qualitatively judged to remove many moreclicks with lower resultant signal degradation than when the inverse fil-tering or matched filtering detection method is used. Where restorationwas performed on artificially corrupted data the restored version was com-pared with the true undegraded original signal. Restored sound quality wasjudged almost identical to that of the original material.

11.3 Conclusion

Implementation issues have been discussed for the Bayesian detection meth-ods of the previous two chapters. The block-based methods have been usedto evaluate performance of the Bayesian techniques and sensitivity to cer-tain system parameters. Practical schemes for implementing block-baseddetection have been proposed but it has been seen that several limitingconstraints must be applied in order to make the scheme usable. Sequen-tial methods are seen to have more general application and flexibility andsome useful performance measurements have been made. Substantial im-


-60

-50

-40

-30

-20

-10

0

10

20

30

40

0 50 100 150 200 250 300 350

Corrupted Waveform

sample no.

ampl

itude

FIGURE 11.8. Corrupted input waveform for synthetic AR(20) data.

-60

-50

-40

-30

-20

-10

0

10

20

30

40

0 50 100 150 200 250 300 350

Restored Waveform

sample no.

ampl

itude

FIGURE 11.9. Restored output waveform for synthetic AR(20) data.

11.3 Conclusion 229

σ σe n/0.02 0.05 0.1 0.2 0.5 1 2 5

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

Pro

bab

ilit

y

(Bayes)

(Bayes)

(Bayes)

(Thrshld.)

(Thrshld.)

(Thrshld.)

se

uP

Pf

P

se

uP

Pf

P

FIGURE 11.10. Error probabilities for synthetic AR(20) data as σeσn

is varied.

σ σe n/0.02 0.05 0.1 0.2 0.5 1 2 5

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

Pro

bab

ilit

y

(Bayes)

(Bayes)

(Bayes)

(Thrshld.)

(Thrshld.)

(Thrshld.)

se

uP

Pf

P

se

uP

Pf

P

FIGURE 11.11. Error probabilities for real data (modelled as AR(20)) as σeσn

isvaried.


σn^

0.2 0.5 1 2 5 10 200.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

Pro

bab

ilit

y

se

uP

Pf

P

FIGURE 11.12. Error probabilities for synthetic AR(20) data as σn is variedaround the true value (σn = 4, σe = 1) (Bayesian method).

0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.50.1

0.2

0.3

0.5

0.7

1

Bayes AR

Thresholded AR

Pf

uP(1- )

Arrow indicates

increasing (Bayes AR)

increasing threshold (thresholded AR)

σn ;

:

.

=0.5σn

σn=32

FIGURE 11.13. DOC for AR(20) data as σn is varied around the true value(σn = 4, σe = 1) (Bayesian method c.f. thresholded AR.)

11.3 Conclusion 231

provements are demonstrated over the inverse filtering and matched filter-ing approaches to detection which indicate that the extra computationalload may be worthwhile in critical applications. The Bayesian detectorstested here are, however, limited in that they assume fixed and known ARparameters, which is never the case in reality. In the final chapter we showhow to lift this restriction and use numerical Bayesian methods to solvethe joint problem of detection, interpolation and parameter estimation.



12

Fully Bayesian Restoration using EMand MCMC

In this chapter we consider the restoration of click-degraded audio usingadvanced statistical estimation techniques. In the methods of earlier chap-ters it was generally not possible to eliminate the system hyperparameterssuch as the AR coefficients and noise variances without losing the analyti-cal tractability of the interpolation and detection schemes; thus heuristic oriterative methods had to be adopted for estimation of these ‘nuisance’ pa-rameters. The methods used here are formalised iterative procedures whichallow all of the remaining hyperparameters to be numerically integrated outof the estimation scheme, thus forming estimators for just the quantities ofinterest, in this case the restored data. These methods, especially the MonteCarlo schemes, are highly computationally intensive and cannot currentlybe considered for on-line or real time implementation. However, they illus-trate that fully Bayesian inference with its inherent advantages can be per-formed on realistic and sophisticated models given sufficient computationalpower. In future years, with the advancements in available computationalspeed which are continually being developed, methods such as these seemlikely to dominate statistical signal processing when complex models arerequired.

The use of advanced statistical methodology allows more realistic mod-elling to be applied to problems, and we consider here in particular thecase of non-Gaussian impulses (recall that earlier work was restricted toGaussian impulse distributions, which we will see can lead to inadequateperformance). Modelling of non-Gaussianity is achieved here by use of scalemixtures of Gaussians, a technique which allows quite a wide range ofheavy-tailed noise distributions to be modelled, and some detailed dis-

234 12. Fully Bayesian Restoration using EM and MCMC

cussion of this important point is given. Having described the modellingframework for clicks and signal, an expectation-maximisation (EM) algo-rithm is presented for the interpolation of corrupted audio in the presenceof non-Gaussian impulses. This method provides effective results for thepure interpolation problem (i.e. when the detection vector i is known) butcannot easily be extended to detection as well as interpolation. The Gibbssampler provides a method for achieving this, and the majority of the chap-ter is devoted to a detailed description of a suitable implementation of thisscheme. Finally results are presented for both methods. The work includedhere has appeared previously as [80, 82, 83].

12.1 A review of some relevant work from otherfields in Monte Carlo methods

Applications of MCMC methods which are of particular relevance to thiscurrent work include Carlin et al. [33] who perform MCMC calculationsin a general non-linear and non-Gaussian state-space model, McCullochand Tsay [128, 129] who use the Gibbs Sampler to detect change-pointsand outliers in autoregressive data and, in the audio processing field, ORuanaidh and Fitzgerald [146] who develop interpolators for missing datain autoregressive sequences.

Take first Carlin et al. [33]. This paper takes a non-linear, non-Gaussianstate-space model as its starting point and uses a rejection-based MCMCmethod to sample the conditional for the state variables at each samplingtime in turn. As such this is a very general formulation. However, it does nottake into account any specific state-space model structure and hence couldconverge poorly in complex examples. Our method by contrast expressesthe model for noise sources using extra latent variables (in particular theindicator variables it and the noise variances σ2

vt, see later) which leads

to a conditionally Gaussian structure [36, 170] for the reconstructed datax and certain model parameters. This conditionally Gaussian structure isexploited to create efficient blocking schemes in the Gibbs Sampler [113]which are found to improve convergence properties.

The signal and noise models we use are related to those used by us inchapter 9 and [76, 83] and also McCulloch and Tsay [129] for analysis ofautoregressive (AR) time series. Important differences between our MCMCschemes and theirs include the efficient blocking schemes we use to improveconvergence (McCulloch and Tsay employ only univariate conditional sam-pling), our model for the continuous additive noise source vt, which is basedon a continuous mixture of Gaussians in order to give robustness to non-Gaussian heavy-tailed noise sources, and in the discrete Markov chain priorwhich models the ‘burst’-like nature of typical impulsive processes. A fur-ther development is treatment of the noise distribution ‘hyperparameters’

12.2 Model specification 235

as unknowns which can be sampled alongside the other parameters. Thisallows both the scale parameter and the degrees of freedom parameter forthe impulsive noise to be estimated directly from the data.

Finally, the Gibbs Sampling and EM schemes of O Ruanaidh and Fitzger-ald [146, 168] consider the interpolation of missing data in autoregressivesequences. As such they do not attempt to model or detect impulsive noisesources, assuming rather that this has been achieved beforehand. Theirwork can be obtained as a special case of our scheme in which the noisemodel and detection steps (both crucial parts of our formulation) are omit-ted.

12.2 Model specification

12.2.1 Noise specification

The types of degradation we are concerned with here can be regarded asadditive and localised in time, which may be represented quite generallyusing a ‘switched’ noise model as seen previously in chapters 5 and 9:

yt = xt + it vt (12.1)

where yt is the observed (corrupted) waveform, xt is the underlying audiosignal and it vt is an impulsive noise process. it is a binary (0/1) ‘indica-tor’ variable which indicates outlier positions and vt is a continuous noiseprocess.

12.2.1.1 Continuous noise source

An important consideration within a high fidelity reconstruction environ-ment will be the statistics of the continuous noise source, vt. Close exam-ination of the typical corrupted audio waveform in figure 12.15 indicatesthat the noise process operates over a very wide range of amplitudes withina short time interval. In the case of gramophone recordings this is a resultof physical defects on the disk surface which vary dramatically in size, frommicroscopic surface cracks up to relatively large dust particles adhering tothe groove walls and scratches in the medium, while in a communicationsenvironment variations in the size of impulses can be attributed to thepower of individual noise sources and to their distance from the measuringpoint. It will clearly be important to model the noise sources in a fashionwhich is robust to the full range of defect sizes which is likely to be en-countered. In particular, for very small-scale defects it should be possibleto extract useful information about the underlying signal from corruptedsample values, while very large impulses should effectively be ignored andtreated as if the data were ‘missing’.


A natural first assumption for the underlying noise process vt might bethe Gaussian distribution p(vt|σ2

v) = N(0, σ2v). Such an assumption has

been investigated for degraded audio signals within a Bayesian frameworkin [76, 79] and chapters 9-11. However, the standard Gaussian assumptionis well known to be non-robust to noise distributions which are more ‘heavy-tailed’ than the Gaussian, even if the variance component σ2

v is made anunknown parameter a priori .

A convenient way to make the noise model robust to heavy-tailed noisesources is to retain the overall Gaussian framework but to allow the variancecomponents of individual noise sources to vary with time index. In thisway a reconstruction algorithm can adjust locally to the scale of individualdefects, while the computational advantages of working within a linearGaussian framework are retained. The noise source vt is thus modelled asGaussian with time-varying variance parameter, i.e. p(vt|σ2

vt) = N(0, σ2

vt)

where σ2vt

is dependent upon the time index t.Under these assumptions the likelihood p(y|x,θ) is then obtained from

the independent, switched, additive noise modelling considerations (seeequation (12.1)). Note that θ contains all the system unknowns. However,since the noise source is assumed to be independent of the signal, the like-lihood is conditionally independent of the signal model parameters. Defineas before i to be the vector of it values corresponding to x. Then,

p(y|x,θ) =∏

t:it=0

δ(yt − xt)

×∏

t:it=1

N(yt|xt, σ2vt

) (12.2)

where the δ-functions account for the fact that the additive noise sourceis precisely zero whenever the indicator switch it is zero (see (12.1)). Notethat the likelihood depends only upon the noise parameters (including i)and is independent of any other elements in θ.

12.2.2 Signal specification

Many real-life signals, including speech, music and other acoustical signals,have strong local autocorrelation structure in the time domain. This struc-ture can be used for distinguishing between an uncorrupted audio waveformand unwanted noise artefacts. We model the autocorrelation structure inthe uncorrupted data sequence xt as an autoregressive (AR) process (asin earlier chapters) whose coefficients ai are constant in the short term:

xt =P∑

i=1

xt−i ai + et (12.3)

12.3 Priors 237

where et ∼ N(0, σ2e) is an i.i.d. excitation sequence. In matrix-vector form

we have, for N data samples:

e = Ax = x1 − Xa (12.4)

where the rows of A and X are constructed in such a way as to form et

in (12.3) for successive values of t and the notation x1 denotes vector xwith its first P elements removed. The parametric model for signal valuesconditional upon parameter values can now be expressed using the standardapproximate likelihood for AR data (section 4.3.1 and [21, 155]):

p(x|θ) = p(x|a, σ2e) = NN (0, σ2

e(AT A)−1) (12.5)

where Nq(µ,C) denotes as before the multivariate Gaussian with meanvector µ and covariance matrix C (see appendix A).

It is of course an approximation to assume that the AR parameters andexcitation variance remain fixed in any given block of data. A time-varyingsystem might be a more realistic representation of the signal and shouldcertainly be considered in future adaptations to this work (see, e.g. [160]for a possible framework). However, we have found this assumption to bemuch less critical in practice than the choice of an appropriate impulsivenoise model (see section 12.8.1 for comparison with a constant varianceGaussian impulse model), since the signal parameters for typical speechand audio signals typically vary slowly and smoothly with time. A furtherapproximation is the assumption of Gaussian excitation process et, whichmay not hold well for many speech or music examples. It is possible to relaxthis assumption and allow impulsive elements in the excitation sequence.This issue is not investigated here, since we have found that results aregenerally robust to the Gaussian assumption. However, some initial workin this area can be found in [72, 81].

12.3 Priors

12.3.1 Prior distribution for noise variances

Within a Bayesian framework prior distributions can now be assigned tothe unknown variance components σ2

vt. In principle any distribution which

is defined over the positive axis can be chosen, and the precise choice willdepend on the type of prior knowledge available (if any) and will leadto different robustness properties of the reconstruction procedure. A fewpossible distributions which express different forms of prior knowledge are


listed below:

p(σ2vt

) ∝ c, Uniform

p(σ2vt

) ∝ 1/σ2vt, Jeffreys

p(log(σ2vt

)) = N(µv , sv), Log-normal

p(σ2vt

) = IG(αv , βv) ∝ (σ2vt

)−(αv+1) exp(−βv/σ2vt

), Inverted-gamma

Uniform or Jeffreys [97] priors might be used when no prior information isavailable about noise variances. The log-normal prior is a suitable choicewhen expected noise levels are known or estimable a priori on a decibelscale. The inverted-gamma (IG) prior [98] (see appendix A) is a conve-nient choice in that vague or specific prior information can be incorporatedthrough suitable choice of αv and βv while the uniform and Jeffreys priorscan be obtained as improper limiting cases. Owing to its flexibility andcomputational simplicity (see later) the IG prior will be adopted for theremainder of this work. Note that upper and lower limits on the noise vari-ance may well be available from physical considerations of the system andcould easily be incorporated by rejection sampling [46, 153] into the Gibbssampling scheme proposed below.

We have discussed intuitively why combinations of noise models and pri-ors such as these can be expected to lead to robustness. Perhaps a moreconcrete interpretation is given by studying the marginal distribution p(vt)which is obtained when the variance term is integrated out, or marginalised,from the joint distribution p(vt|σ2

vt)p(σ2

vt). The resulting distributions are

scale mixtures of normals, which for example West [196] has proposed formodelling of heavy-tailed noise sources. The family of possible distribu-tions which can result from this scale mixture process is large, includingsuch well-known robust distributions as α-stable, generalised Gaussian andStudent-t families [7, 197]. In fact, the Student-t distribution is known toarise as a result of assuming the IG prior (see appendix H.1 for a deriva-tion), and we adopt this as the prior density in the remainder of the chapter,owing both to its convenient form computationally and its intuitive appeal.The Student-t distribution is a well known robust distribution which hasfrequently been proposed for modelling of impulsive sources (see e.g. [89]).However, it should be noted that the sampling methods described here canbe extended to other classes of prior distribution on the noise variances incases where Student-t noise is not thought to be appropriate. This will leadto other robust noise distributions such as the α-stable class [144] whichhas received considerable attention recently as the natural extension to aGaussian framework. α-stable noise distributions, being expressible as ascale mixture of normals [7, 197], can in principle fit into the same frame-work as we propose here, although the lack of an analytic form for thedistribution complicates matters somewhat.

12.4 EM algorithm 239

12.3.2 Prior for detection indicator variables

The statistics of the continuous noise source vt have been fully specified. Itnow remains to assign a prior distribution to the vector i of noise switchingvalues it. The choice of this prior is not seriously limited by computationalconsiderations since any prior which can easily be evaluated for all i willfit readily into the sampling framework of the next section. Since it isknown that impulses in many sources occur in ‘bursts’ rather than singly(see [70, 79] for discussion of the audio restoration case), we adopt a two-state Markov chain prior to model transitions between adjacent elementsof i. A model of this form has previously been proposed in [67] for burst-noise in communications channels. The prior may be expressed in the formp(i) = p(i0)

∏

t Pit−1it , in which the clustering of outliers in time is modelledby the transition probabilities of the Markov chain, Pij and p(i0) is the prioron the initial state. These transition probabilities can be assigned a prioribased on previous experience with similar data sets, although they could inprinciple be ‘learnt’ by the Gibbs Sampler along with the other unknowns(see later).

12.3.3 Prior for signal model parameters

Assuming that the AR parameter vector a is a priori independent of σ2e , we

assign the simple improper prior p(a, σ2e) = p(a)p(σ2

e ) ∝ IG(σ2e |αe, βe), with

the parameters αe and βe chosen so that the IG distribution approaches theJeffreys limiting case. Note that a very vague prior specification such as thisshould be adequate in most cases since a and σ2

e will be well-determinedfrom the data, which will contain many hundreds of elements. A slightlymore general framework would allow for a proper Gaussian prior on a.We do not include this generalisation here for the sake of notational sim-plicity, but it should be noted that the general mathematical form of theGibbs sampler derived shortly remains unchanged under this more generalframework.

12.4 EM algorithm

We firstly consider the application of the EM algorithm (see section 4.5)to the pure interpolation problem, i.e. estimation of x(i) conditional uponthe detection indicator vector i and the observed data y. In this sectionwe work throughout in terms of precision parameters rather than varianceparameters, i.e. λ = 1/σ2, assigning gamma priors rather than invertedgamma. This is purely for reasons of notational simplicity in the derivation,and a simple change of variables λ = 1/σ2 gives exactly equivalent resultsbased on the corresponding variance parameters. We will revert to thevariance form in the subsequent Gibbs sampler interpolation algorithm.


For purposes of comparison, we present the EM algorithm for both theassumed noise model above, with independent variance components foreach noise sample, and a common variance model as used in chapter 9where the noise samples are Gaussian with a single common variance termσ2

v.Substituting λe = 1/σ2

e , the conditional AR likelihood (12.5) is rewrittenas

p(x|a, λe) = (λe/2π)N−P

2 exp (−λeE(x,a)/2)def= ARN (x|a, λe) (12.6)

where λe = 1/σ2e and E(x,a) = eTe is the squared norm of the vector of

excitation values.The interpolation problem is posed in the usual way as follows. A signal

x is corrupted by additive noise affecting l specified elements of x, indexedby a switching vector i. The data x is partitioned according to ‘unknown’(noise-corrupted) samples x(i) whose time indices form a set I, and theremaining ‘known’ (uncorrupted) samples, denoted by x

−(i). The observeddata is y, and by use of a similar partitioning we may write y

−(i) = x−(i)

and y(i) = x(i) + v(i), where v = [v1, . . . , vN ]T is the corrupting noisesignal. It is then required to reconstruct the unknown data x(i). Note thatthe noise variance, the AR coefficients and the AR excitation variance willall be unknown in general, so these will be treated as model parameters,denoted by θ. We use an unconventional formulation of EM in which theparameters θ are treated as the auxiliary data and the missing data x(i)

is treated as the quantity of interest, i.e. we are performing the followingMAP estimation:

xMAP

(i)= argmax

x(i)

p(x(i)|x−(i)) = argmaxx(i)

∫

p(x(i),θ|x−(i)) dθ

This is the formulation used by ORuanaidh and Fitzgerald [168] for thepurely missing data problem (i.e. with no explicit noise model) but is dis-tinct from say [192, 136], who use EM for estimation of parameters ratherthan missing data. Here the operation of EM is very powerful, enabling usto marginalise all of the nuisance parameters, including the AR model andthe noise statistics.

The derivation of the EM interpolator is given in appendix G. It requirescalculation of expected values for the signal and noise parameters, fromwhich the updated missing data estimate xi+1

(i) (iteration i + 1) can beobtained from the current estimate xi

(i) (iteration i), for the independent

12.5 Gibbs sampler 241

noise variance model, as follows:

ai = aMAP(xi) (12.7)

λie =

(αe + (N − P )/2)

(βe + E(xi,ai)/2)

λivt

= (αv + 1/2)/(βv + (vti)2/2), (t ∈ I) (12.8)

xi+1(i)

= −Ψ−1

(

(

λieA

iT Ai + T(xi))

(i)−(i)

y−(i) − Miy(i)

)

(12.9)

where

aMAP(xi) = (XiTXi)−1XiT xi1 (12.10)

Mi = diag(λvt ; t ∈ I)Ψ =

(

λieA

iTAi + T(xi))

(i)(i)

+ Mi

Matrix T(xi) is defined in appendix G and ‘(i)(i)’ denotes the sub-matrixcontaining all elements whose row and column numbers are members ofI; similarly, ‘(i)−(i)’ extracts the sub-matrix whose row numbers are in Iand whose column numbers are not in I. The common variance model isobtained simply by replacing the observation noise expectation step (12.8)with

λiv = (αn + l/2)/(βv +

∑

t∈I

(vti)2/2)

and setting Mi = λivI.

The first three lines of the iteration are simply calculating expected valuesof the unknown system parameters (AR coefficients and noise precision val-ues) conditional upon the current missing data estimate xi

(i). These are then

used in the calculation of xi+1(i) . Each iteration of the method thus involves

a sequence of simple linear estimation tasks. Each iteration is guaranteedto increase the posterior probability of the interpolated data. Initialisingwith a starting guess x0

(i), the algorithm is run to convergence according to

some suitable stopping criterion.In section 12.6 some results for the EM interpolator are presented and

compared with the Gibbs sampling scheme. We now consider the Gibbssampler both for the interpolation problem considered here and also for themore challenging case of joint detection and interpolation of click-degradeddata.

12.5 Gibbs sampler

The noise and signal probability expressions given in the previous sectionsare sufficient to specify the joint posterior distribution to within a con-


stant scale factor (see appendix H.2). The Gibbs sampler now requiresthe full posterior conditional distributions for each unknown in turn. Thestandard approach in many applications is to calculate univariate condi-tionals for each parameter component in turn. This is the approach givenin, for example, [33] for non-linear/non-Gaussian state-space analysis, andalso in [129] for a statistical outlier model which is closely related to ourswitched noise model. This sampling approach offers generality in that itis usually feasible to sample from the univariate conditionals even for verysophisticated non-Gaussian models. However, as noted earlier, improvedconvergence can be expected through judicious choice of multivariate pa-rameter subsets in the sampling steps. In particular, for the models we havechosen it is possible to sample jointly from the switching vector i and thereconstructed data vector x.

12.5.1 Interpolation

Consider firstly the conditional densities with the switching vector i fixedand known. Since no detection process is required to determine which datasamples are corrupted this may be considered as a pure interpolation pro-cedure in which the corrupted samples are replaced by sampled realisationsof their uncorrupted value. The methods outlined in this section could ofcourse be used as a stand-alone interpolation technique in cases where im-pulse locations are known beforehand, although we will usually choose hereto perform the interpolation and detection (see next section) jointly withinthe Gibbs sampling framework.

From this point on we split θ into the switching vector i and remainingmodel parameters ω = a, σ2

e , σ2vt

(t = 0 . . .N − 1) for the sake of nota-tional clarity. The posterior conditionals are then obtained by straightfor-ward manipulations of the full joint posterior (see appendix H.2) and aresummarised as:

p(a|x, i,ω−(a),y) = NP (aMAP(x), σ2

e (XT X)−1) (12.11)

p(σ2e |x, i,ω−(σ2

e),y) = IG(αe + (N − P )/2, βe + E(x,a)/2 ) (12.12)

p(σ2vt|x, i,ω

−(σ2vt

),y) =

IG(αv + 1/2, βv + vt2/2) (t ∈ I)

IG(αv, βv ) (otherwise)(12.13)

p(x(i)|i,ω,y) = Nl(xMAP

(i) , σ2e Φ−1 ) (12.14)

x−(i) = y

−(i) (12.15)


where

aMAP(x) = (XT X)−1XT x1

xMAP

(i) = −Φ−1(

AT(i)A−(i) y

−(i) − σ2e R−1

v(i)y(i)

)

Φ = AT(i)

A(i) + σ2eR

−1v(i)

and ω−(a) denotes all members of ω except a, etc. A similar notation for

vector/matrix partitioning to that used for the EM interpolator and earlierchapters should be clear from the context; defined such that subscript ‘(i)’denotes elements/columns corresponding to corrupted data (i.e. it = 1),while ‘−(i)’ denotes the remaining elements/columns. The terms vt, required(12.13) are calculated for a particular x, y and i from (12.1). Rv(i)

isthe covariance matrix for the l corrupting impulses indicated by i, and isdiagonal with elements σ2

vtin this case. Note also that when it = 0, (12.13)

requires that we sample from the prior on σ2vt

, since there is no informationavailable from the data or parameters about this hyperparameter whenthe impulsive noise source is switched ‘off’. It is nevertheless important tocarry out this sampling operation since the detection procedure (see nextsection) requires sampled values of σ2

vtfor all t.

An important point about the method is that all of the sampling steps(12.11)-(12.14) involve only simple linear operations and sampling frommultivariate normal and inverted-gamma distributions, for which reliableand fast methods are readily available (see e.g. [26, 46]). The scheme in-volves multivariate sampling of the reconstructed data x in order to achievereliable convergence properties, in the same spirit as [161, 79, 146], and in-deed the Gibbs sampling interpolator given in [146] can be derived as aspecial case of (12.11)-(12.14) when the corrupted samples are discardedas ‘missing’.

12.5.1.0.1 Sampling the noise hyperparameters

The above working is all implicitly conditioned upon knowledge of the noisehyperparameters αv and βv. Within the Gibbs sampling framework there isno reason in principle why we should not assign priors to these parametersand treat them as unknowns. The required conditional densities and sam-pling operations can be obtained straightforwardly and are summarised inappendix H.3, along with the form of prior distributions chosen. Whetherthese variables should be treated as unknowns will depend on the extentof prior information available. For example, when processing very long sec-tions of data such as speech or music extracts it will be advisable to splitthe data into manageable sub-blocks in which the signal model parameterscan be regarded as constant. However, in such cases it is often reasonable toassume that the highest level noise parameters αv and βv remain roughlyconstant for all blocks within a given extract. Hence it might be best to‘learn’ these parameters informally through the processing of earlier data


blocks rather than starting afresh in each new data block with very vagueprior information. Such a procedure certainly speeds up convergence of theGibbs Sampler, which can take some time to converge if αv and βv arerandomly initialised and ‘vague’ priors are used. In order to demonstratethe full capabilities of the methods, however, the results presented latertreat αv and βv as unknowns which are then sampled with the remainingvariables.

12.5.2 Detection

Now consider the remaining unknown, the switch vector i. Sampling for ias part of the Gibbs Sampling scheme will allow joint detection of impulselocations as well as reconstruction of clean data values. A straightforwardGibbs Sampling approach will sample from the conditional for each uni-variate it element. This is the approach adopted in [129] and it is similarin principle to the variable selection methods for linear regression given in[65].

In [129] univariate sampling is performed from the conditionals for all theelements it and vt (from which the sampled reconstruction can be obtainedas xt = yt − itvt). This scheme has two drawbacks which will affect con-vergence of the sampler. Firstly, sampling is univariate even though thereare likely to be strong posterior correlations between successive elementsof it and vt. Secondly, the scheme involves sampling from the prior for vt

when it = 0, since vt is conditionally independent of the input data in thissituation. it will then only stand a reasonable chance of detecting a trueoutlier when vt happens to take a sampled value close to its ‘true’ value,with the result that the evolution of it with iteration number is rather slow.Note that in equation (12.13) of our scheme it is necessary to sample fromthe prior on σ2

vtwhen it = 0. However, at this level of the ‘hierarchy’ the

convergence of it is less likely to be affected adversely. The drawbacks ofunivariate sampling can be overcome by sampling from the joint multivari-ate conditional for x and i. Note that this is distinct from the approach ofCarter and Kohn for estimation of mixture noise [35], in which indicatorvariables (i) are sampled jointly conditional upon state variables (x), andvice versa.

The proposed scheme uses the joint multivariate conditional distributionfor x and i, which can be factorised as:

p(x, i|ω,y) = p(x|i,ω,y)p(i|ω,y) (12.16)

The first term of this has already been given in (12.14). The second term,the so-called reduced conditional , is the conditional distribution for i andx with x ‘integrated out’:

p(i|ω,y) =

∫

x

p(x, i|ω,y)dx (12.17)


This integral can be performed directly using multivariate normal proba-bility identities as in [79, equations (10)-(15)] and section 9.2.2.3 of thisbook and may be expressed as

p(i|ω,y) ∝ p(y|ω, i) p(i)

where

p(y|ω, i) =(σ2

e)l/2

exp(−S2|x(i)=xMAP(i)

)

(2πσ2e)(N−P )/2 |Rv(i)

|1/2 |Φ|1/2(12.18)

and

S2 =∑

t

et2/2σ2

e +∑

t: it=1

vt2/2σ2

vt(12.19)

Note that S2 in (12.18), which is equivalent to EMIN in section 9.2.2.3,is evaluated using (12.19) with the conditional MAP reconstruction xMAP

(i)

substituted for x(i).Joint conditional sampling of i and x could then be achieved using the

method of composition (see e.g. [172]), which involves sampling i from itsreduced conditional p(i|ω,y) (12.18) followed by sampling x from its fullconditional p(x|i,ω,y) equation (12.14). Equation (12.18) is a multivariatediscrete distribution defined for all 2N possible permutations of i. As thenormalising constant c is unknown, direct sampling from this distributionwill require 2N evaluations of (12.18). A suitable scheme is:

1. Evaluate (12.18) for all 2N possible permutations of i.

2. Normalise sum of probabilities to unity and calculate cumulative dis-tribution for i.

3. Draw a random deviate u from a uniform distribution in the range0 < u < 1.

4. Select the value of i whose cumulative probability is closest to u onthe positive side (i.e. greater than u).

Clearly such a sampling approach cannot be adopted in practice for anyuseful value of N , so some other scheme must be adopted. As stated earlier,MCMC methods allow a great deal of flexibility in the details of implemen-tation. Issues such as which subsets of the parameters to sample and theprecise order and frequency of sampling are left to the designer. These fac-tors, while not affecting the asymptotic convergence of the sampler, willhave great impact on short term convergence properties.

There is less computational burden in the full multidimensional samplingoperation for x(i) (12.14), which is dominated by l-dimensional matrix-vector operations, so this can be performed on an occasional basis for the


whole block, in addition to the sub-block sampling described in a latersection. This should eliminate any convergence problems which may ariseas a result of posterior correlation between the xt’s for a given i.

12.5.3 Detection in sub-blocks

There are many possible adaptations to the direct multivariate sampling of(12.18) which will introduce trade-offs between computational complexityper iteration and number of iterations before convergence. Here we adopta Gibbs sampling approach which performs the sampling given by (12.16)in small sub-blocks of size q, conditional upon switch values it and recon-structed data xt from outside each sub-block. A compromise can thus beachieved between the computational load of operations (12.18) and (12.14)and convergence properties. An alternative scheme has recently been pro-posed for general state-space models in [36] in which individual elements itare conditionally sampled directly from the reduced conditional, given byequation 12.17 in our case. This is an elegant approach, but quite complexto implement and we leave a comparative evaluation as future work.

If we denote a particular sub-block of the data with length q samples byx(q) and the corresponding sub-block of detection indicators by i(q) thenthe required joint conditional for x(q) and i(q) is factorised as before:

p(x(q), i(q)|x−(q), i(q),ω,y) = p(x(q)|x−(q), i,ω,y)p(i(q)|x−(q), i−(q),ω,y)(12.20)

where x−(q) and i

−(q) are vectors containing the N − q data points andswitch values respectively which lie outside the sub-block. The joint sam-ple x(q), i(q) is then obtained by composition sampling for i(q) and x(q) asdescribed above.

The required full and reduced conditional distributions for x(q) and i(q)

are obtained as an adaptation to the results for the complete data block(equations (12.18) and (12.14)). Since both distributions in (12.20) are con-ditioned upon surrounding reconstructed data points x

−(q) it is equivalentfor the purposes of conditional sampling to consider x

−(q) as uncorruptedinput data. In other words we can artificially set y

−(q) = x−(q) and fix

i−(q) = 0 to obtain:

p(y|i(q),x−(q), i−(q),ω) = p(y(q),y−(q) = x−(q)|i(q), i−(q) = 0,ω) (12.21)

and

p(x(q)|x−(q), i,ω,y) = p(x(q)|x−(q), i(q), i−(q) = 0,ω,y(q),y−(q) = x−(q))

(12.22)

This is a convenient form which allows direct use of equations (12.18) and(12.14) without any rearrangement. We now use the standard proportion-


ality between joint and conditional distributions (see appendix H.2):

p(i(q)|i−(q),ω,y) ∝ p(i|,ω,y) (12.23)

p(x(q)|x−(q), i,ω,y) ∝ p(x|i,ω,y) (12.24)

to obtain the required results:

p(i(q)|x−(q), i−(q),ω,y) ∝ p(y|i,ω)|(i−(q)=0, y

−(q)=x−(q))p(i) (12.25)

and,

p(x(q)|x−(q), i,ω,y) ∝ p(x|i,ω,y)|(i−(q)=0, y

−(q)=x−(q)) (12.26)

Thus equations (12.18) and (12.14) are used directly to perform sub-blocksampling. The scheme is, then:

1. Sample the reduced conditional for i(q) using (12.25) and (12.18).i(q) is a binary vector of length q samples. Therefore 2q probabilityevaluations must be performed for each sub-block sample.

2. Sample the conditional for x(q) using (12.26) and (12.14) with thevalue of i(q) obtained in step 1. This involves drawing a random sam-ple from the multivariate Gaussian defined in 12.14. Since i

−(q) isconstrained to be all zeros, the dimension of the Gaussian is equal tothe number of non-zero elements in i(q) (a maximum of q).

Note that in the important case q = 1 these sampling steps are very effi-cient, requiring no matrix inversion/factorisation and only O(p) operationsper sub-block sample.

12.5.3.0.2 Sampling the Markov Chain transition probabilities

In the same way as the noise hyperparameters were treated as unknowns inthe last section, the Markov chain transition probabilities P01 and P10 canbe sampled as part of the detection procedure. The details, summarisedin appendix H.3, are straightforward and involve sampling from beta den-sities which is a standard procedure [46]. We have observed that conver-gence of the algorithm is successful when the noise hyperparameters αn

and βn are known a priori . However, when processing real audio dataand with these parameters treated as unknowns (as described in appendixH.3) performance was not good and the sampler chose very low values forP00 = 1−P01 and high values for P11 = 1−P10. This is most likely a prob-lem of modelling ambiguity: the sampler can either fit a very heavy-tailednoise model at most data points (setting it = 1 nearly everywhere) or itcan fit a high amplitude noise model at a few data points. It appears tochoose the former case, which is not unreasonable since there will alwaysbe some low-level white noise at every data point in a real audio signal. We


are not attempting to model this low level noise component in this work,however (see [81, 72] for discussion of a related approach to this problem).In practice, then, we fix the transition probabilities to appropriate valuesfor the type of impulsive noise present on a particular recording. We havefound the methods to be fairly insensitive to the precise values of theseparameters. For example, the values P00 = 0.93 and P11 = 0.65 have givensuccessful results in all of the examples we have processed to date. A moreelaborate approach (not pursued here) might attempt to estimate the tran-sition probabilities from a ‘silent’ section of audio data in which no speechor music signal is present.

12.6 Results for EM and Gibbs samplerinterpolation

Results are presented for a simple interpolation of a scratch-degraded audiosignal, figure 12.1. The region of interpolation is indicated by the verticaldotted lines, which includes the scratch and a ‘guard zone’ of uncorruptedsamples either side.

Figure 12.2 shows the results of interpolations using the Gibbs-estimatedposterior mean (over 50 iterations following convergence) and the EM-estimated posterior mode (after 10 iterations). Both iterative schemes wereinformally judged to have converged in less than 10 iterations for this exam-ple. Figure 12.2(a) shows the missing data interpolator [192, 146], in whichthe corrupted samples are discarded prior to estimation. This interpolatormakes a reasonable estimate, but clearly will not fit an interpolate close tosamples in the guard zone. The common variance model (figure 12.2(b))does a better job in the samples preceding the scratch, but is more waywardafterwards. Clearly the variance term is being biased upwards by some veryhigh scratch amplitudes. The independent variance model (figure 12.2(c))is able to fit the data in the guard zone almost perfectly, while performinga good interpolation of the main scratch, and even removing some very lowamplitude noise material following the scratch. Thus we can expect greaterfidelity to the source material with this method, and this is borne out by a‘brighter’ sound to material restored in this way. There is not much differ-ence to be seen in performance between the Gibbs and EM interpolationmethods.

12.7 Implementation of Gibbs sampler detection/interpolation 249

250 300 350 400 450 500 550 600 650 700 750−2000

−1500

−1000

−500

0

500

1000

1500

2000

2500

3000

SAMPLE NUMBER

AM

PLI

TU

DE

FIGURE 12.1. Corrupted audio waveform (78rpm disk)

12.7 Implementation of Gibbs samplerdetection/interpolation

12.7.1 Sampling scheme

The Gibbs Sampler as implemented for joint detection and interpolationcan be summarised as follows:

1. Assign initial values to unknowns: x0, i0, ω0 = a0, σ2e0, σ2

vt

0(t =

0 . . .N − 1).

2. Repeat for i = 0 . . . Nmax − 1:

(a) Set ωi+1 = ωi, ii+1 = ii, xi+1 = xi.

(b) Draw samples for unknown parameters:

i. ai+1 ∼ p(a|ii+1,xi+1,ωi+1−(a),y) (see (12.11))

ii. (σ2e)i+1 ∼ p(σ2

e |ii+1,xi+1,ωi+1−(σ2

e ),y) (see (12.12))

iii. (σ2vt

)i+1 ∼ p(σ2vt|ii+1,xi+1,ωi+1

−(σ2vt

),y), (t = 0 . . .N −1) (see

(12.13))

(c) Draw samples from reconstructed data and switch values:

i. For all (non-overlapping) sub-blocks x(q) of length q samplesin x and i(q) in i:

A. ii+1(q) ∼ p(i(q)|xi+1

−(q), ii+1−(q),ω

i+1,y) (see (12.25))


490 500 510 520 530 540 550 560 570−500

−400

−300

−200

−100

0

100

200

300

400

500

AM

PLI

TU

DE

(a) MISSING DATA INTERPOLATION

Corrupted Gibbs EM

490 500 510 520 530 540 550 560 570−500

−400

−300

−200

−100

0

100

200

300

400

500

AM

PLI

TU

DE

(b) COMMON VARIANCE INTERPOLATION

490 500 510 520 530 540 550 560 570−500

−400

−300

−200

−100

0

100

200

300

400

500

SAMPLE NUMBER

AM

PLI

TU

DE

(c) INDEPENDENT VARIANCE INTERPOLATION

FIGURE 12.2. Interpolations using various noise models and AR(40)


B. xi+1(q) ∼ p(x(q)|xi+1

−(q), ii+1,ωi+1,y) (see (12.26))

ii. If ((i mod M)=0): xi+1 ∼ p(x|ii+1,ωi+1,y) (see (12.14))

(d) (Optional) Sample top level hyperparameters: αv , βv , P01, P10

(see appendix H.3).

3. Calculate histograms and Monte Carlo estimates of posterior func-tionals (see (4.62)). For example, a Monte Carlo estimate of the pos-terior mean is given by:

E[x|y] ≈∑Nmax

i=N0+1 xi

Nmax −N0

where N0 is the number of iterations before convergence is achieved.

Note that the multivariate sampling operation of 2(c)ii is carried outhere once every M iterations.

12.7.2 Computational requirements

MCMC methods are generally considered to be highly computationallyintensive, with typical applications requiring perhaps multiple runs, eachwith hundreds of iterations. Such schemes are clearly impractical in manyengineering applications, particularly where large amounts of data must beprocessed. Such computational extremes will not, however, be necessary incases where only point estimates of the data are required rather than a fullanalysis of all the posterior distributions. In the examples considered hereit will be seen that useful results can be achieved after very few iterations,particularly if robust starting points are chosen for critical parameters.Nevertheless, each iteration of the scheme described in section 12.7.1 isitself quite intensive. Taking the items in the same order as section 12.7.1we have:

• Assigning initial values: x0, i0 and ω0 (step 1). This step need nottake any significant computation since the unknowns can be assignedarbitrary random starting values. However, as discussed in the nextsection, it is beneficial to convergence of the algorithm if a reasonablestarting point can be obtained using some other technique. The com-putation involved will usually be much smaller than the subsequentiterations of the Gibbs Sampler.

• Sampling the AR parameters: a (step 2(b)i). This step requiresa sample from a P -dimensional Gaussian distribution, given in equa-tion 12.11. The most significant computation here is a square-rootfactorisation of the covariance matrix, which can be achieved usingfor example the Cholesky Decomposition (complexity O(P 3)). De-tails of techniques for sampling multivariate Gaussians can be foundin, for example, [46] or [168, pages 207-8].


• Noise variances: σ2e (step 2(b)ii) and σ2

vt(step 2(b)iii). The condi-

tionals for all the noise variance terms are of Inverted Gamma form(see equations 2(b)ii and 2(b)iii and appendix A.4). Sampling fromthe Inverted Gamma distribution is straightforward (see e.g. [168],pp. 137-8 and [153]) and requires very little computation comparedto other parts of the algorithm.

• Joint sampling of switch and data values (step 2(c)i). This step,as detailed in section 12.5.2, requires firstly the evaluation of the re-duced conditional (equation 12.25) for all 2q possible values of thecurrent detection sub-block i(q), where q is the size of sub-block. Eachsuch evaluation requires the calculation of xMAP

i , involving the solu-tion of a lq × lq matrix-vector equation (lq is the number of corruptedsamples indicated by a particular i(q)). This can be solved using aCholesky decomposition of Φ, which then facilitates the sampling ofx(q). The detection step rapidly becomes infeasible for all but smallvalues of q. However, the experimental results show that small valuesof q give acceptable performance. In particular the case q = 1 leadsto a very efficient scheme which requires only O(NP ) calculations forone complete sweep of all the sub-blocks. Note that q = 1 retainsthe important advantage of sampling jointly for data values xt andswitch values it.

Having sampled i(q) from its reduced conditional the joint sampleis completed by sampling for x(q) from its full conditional (12.26).The significant calculations for this operation will already have beencarried out while sampling i(q), since both stages require xMAP

(i) andthe Cholesky decomposition of the matrix Φ.

The precise balance of computational requirements will depend onthe choice of sub-block size q, which in turn has a strong impacton the number of iterations required for convergence. In the absenceof concrete results on the rate of convergence for the scheme, it isprobably best to tune the sub-block size q by trial and error to giveacceptable performance for a given application.

• Sampling of complete data vector (interpolation): x (step2(c)ii). Joint sampling of the complete data vector x conditional uponall of the other unknowns is achieved using equation 12.14. As forthe AR parameters this involves sampling from a multivariate Gaus-sian distribution, this time of dimension l, where l is the number ofimpulse-corrupted data points indicated by i. Direct sampling, re-quiring matrix Cholesky factorisation, can be very intensive (O(l3))for large l. This is significantly reduced, however, by observing thatmatrix Φ is quite sparse in structure, owing to the Markov propertyof the assumed AR model. This sparse structure can be explicitly ac-counted for in the calculation of the Cholesky Decomposition, which


leads to a dramatic reduction in the computation (see [168], p.137for discussion of the closely related problem of missing data interpo-lation). Possibly the best and most flexible way to take advantage ofthis structure, however, is by use of the standard state-space formfor an AR model (see e.g. [5], pp.16-17). This then allows the use ofthe Kalman filter-smoother [5] for sampling of the whole data vector.Recent methods for achieving this efficiently in general state-spacemodels can be found in [35, 61, 42]. In fact we implemented [42] be-cause of its speed and efficient memory usage. Computations are thenreduced to approximately O(lP 2).

12.7.2.1 Comparison of computation with existing techniques

The computational requirements of the Gibbs Sampling method can becompared with the detection and interpolation techniques of Vaseghi andRayner [84, 191]. Here we have processing steps which are closely relatedin functionality and computation.

The parameter estimation stage of [191] involves estimation of AR pa-rameters for the corrupted data. If a covariance-based AR parameter es-timation method [119] is used then the computations will be of a similarcomplexity to the AR parameter sampling step in the Gibbs Sampler (note,however, that an autocorrelation method of estimation [119] would makethis step faster for the simple scheme). Furthermore, the detection step in[191] is quite closely related to the sampling of detection vector and datavector in the Gibbs Sampler. In particular the computations are very sim-ilar (both O(NP ) operations for one complete sweep through the data)when a sub-block length of q = 1 is chosen for the Gibbs Sampler (therelationship between simple detectors and Bayesian detection schemes isdiscussed further in [70], appendix H). Finally, a close relationship can befound between the computation of the Least Squares AR-based interpo-lator in [191] and the Bayesian interpolation scheme. The computationalrequirements here are almost identical when the Least Squares-based inter-polation is performed jointly over all corrupted samples in the data block.The only item not present in the simple scheme but required for the GibbsSampler is the estimation of noise hyperparameters, including noise vari-ances. These can be neglected as they take up a very small fraction of theoverall computation in the Gibbs Sampler.

Overall, then, one iteration of the Gibbs Sampler has roughly the samecomplexity as one complete pass of the simple detector/interpolator, withsome small overheads in setting up the equations and drawing randomvariates. The chief difference lies in the large number of iterations requiredfor the Gibbs Sampler compared with perhaps less than five passes requiredfor the simple scheme.


12.8 Results for Gibbs samplerdetection/interpolation

12.8.1 Evaluation with synthetic data

In order to demonstrate the operation of the proposed methods and forsome comparison with other related techniques under idealised conditionswe consider firstly a synthetic AR sequence with N = 300 data points, seefigure 12.3. The poles of the AR process have been selected arbitrarily withorder p = 6 and the excitation is white Gaussian noise with variance unity.The noise source vt is Gaussian with independent variance componentsgenerated by sampling from an IG(0.3, 40) process. The switching processit is a first order Markov chain with transition probabilities defined byP00 = 0.93 and P11 = 0.65. The resulting corrupted waveform is shown infigure 12.4, with the highest amplitude impulses not shown on the graph.

The Gibbs sampler was run under two noise modelling assumptions. Thefirst model is the independent variance case, using the variance samplingmethods given in (12.13). This corresponds to the true noise generation pro-cess and so can be considered as a somewhat idealised scenario. P00 andP11 are fixed at their true values, while all other parameters are sampled asunknowns. αv and βv are reparameterised as mv = βv/αv and αv to givea more physically meaningful model (see appendix H.3). The transformedvariables are treated as a priori independent in the absence of furtherknowledge. αv treated as a discrete variable with a uniform prior, takingon values uniformly spread over the range 0.1 to 3, while mv is assumedindependent of αv and given a very vague gamma prior G(10−10, 10−10).The second noise model assumes Gaussian noise with constant (unknown)variance but is otherwise identical to the first. This corresponds to theimpulse models investigated in [79] with the generalisation that noise vari-ances and AR parameters are unknown. A prior IG(10−10, 10−10) is usedfor the common noise variance term. This noise model can be expected togive poorer results than the first, since it does not reflect the true noisegeneration process.

Both models were run for 1000 iterations of the Gibbs Sampler follow-ing a burn-in period of 100 iterations. The burn-in period was chosen asthe number of iterations required for convergence of the Markov chain, asjudged by informal examination of the sampled reconstructions. This figureis intentionally over-generous and was obtained after running the samplerwith many different realisations of the corrupted input data. The detectionvector i was initialised to all zeros, while other parameters were initialisedrandomly. A minimal sub-block size q = 1 was found to be adequate forthe joint sampling of the switch and reconstructed data values 2(c)i, whilemultivariate sampling of the data elements 2(c)ii was performed once periteration in sub-blocks of 100 samples.

12.8 Results for Gibbs sampler detection/interpolation 255

MMSE reconstruction estimates were obtained as the arithmetic meanof the sampled reconstructions xi following the burn-in period. For theindependent variance model, results are shown in figure 12.5. As hoped,comparison with the true signal shows accurate reconstruction of the orig-inal data sequence. Where significant errors can be seen, as in the vicinityof sample number 100, the impulsive noise can be seen to take very highvalues. As a result, the algorithm must effectively treat the data as ‘miss-ing’ at these points. The corresponding result from the common variancemodel can be seen in figure 12.6. While the large impulses have all beenremoved, many smaller scale disturbances are left untouched. As expected,the common variance model has not been able to adjust locally to the scaleof the noise artefacts.

While the reconstructed data is strictly all that is required in a restora-tion or enhancement framework, there is much additional information whichcan be extracted from the parameter and noise detection samples. Thesewill assist in the assessment of convergence of the Markov chain, allowdata analysis and parameter estimation and may ultimately help in thedevelopment of improved methods.

Consider, for example, samples drawn from the AR parameters, ai. Polepositions calculated from these sampled values can be plotted as scatterdiagrams, which gives a useful pictorial representation of the posterior poleposition distribution. This is shown for both noise models in figures 12.9and 12.10. The true poles are at radii 0.99, 0.9 and 0.85 with angles 0.1π,0.3π and 0.7π, respectively. The scatter diagram for the independent vari-ance model shows samples clustered tightly around the most resonant pole(radius 0.99) with more posterior uncertainty around the remaining poles.The common variance model, however, indicates pole positions well awayfrom their true values, since small impulses are still present in the data.Pole positions calculated from the MMSE AR parameter estimates are alsoindicated on both plots.

The excitation variance samples for the independent variance noise modelare displayed in figures 12.11 and 12.12. Plotted against iteration number(figure 12.11) the variance can be seen to converge rapidly to a value ofaround 1.4, which is quite close to the true value of unity, while the his-togram of samples following burn-in indicates that the variance is fairlywell-determined at this value.

An important diagnostic is the detection vector i. The normalised his-togram of ii samples gives an estimate of posterior detection probabilitiesat each data sample. Regions of impulses are clearly indicated in figure12.7, and a procedure which flags an impulse for all samples with prob-ability values greater that 0.5 yields an error rate of 0.2% for this datasequence. By comparison, the common variance model gave an error rateof 5% (figure 12.8). Error rates are measured as the percentage of samplesmis-classified by the detector.


Finally, histograms for αv and mv are shown in figures 12.13 and 12.14.The true values are 0.3 and 40/1.3 respectively, obtained from the IG(0.3, 40)process which generated the noise variances. αv is well-centred around itstrue value, which is typical of all the datasets processed.mv exhibits a muchwider spread, centred at a value much larger than its true value. This canprobably be accounted for by the relatively small number of noise samplesused in the sampling of mv . In any case the precise value of mv withinroughly an order of magnitude does not seem to have great impact on re-construction results, which is a desirable consequence of the hierarchicalnoise modelling structure used.

12.8.2 Evaluation with real data

For evaluation with real data the method was tested using data digitallysampled at 44.1kHz and 16-bit integer resolution from degraded gramo-phone recordings containing voice and music material. The autoregressiveparameters are assumed fixed for time intervals up to 25ms; hence datablock lengths N of around 1100 samples are used. Processing for listeningpurposes can then proceed sequentially through the data, block by block,re-initialising the sampler for each new block and storing the reconstructedoutput data sequentially in a new data file. Continuity between blocks ismaintained by using an overlap of P samples from block to block and fixingthe first P elements of restored data in a new data block to equal the lastP restored elements from the previous block. This is achieved by fixingthe first P elements of the switch vector i in the new block equal to zeroand the first P elements of the new data vector y to equal the last P re-stored output data points from the most recent block. Then the detectionalgorithm is only operated upon elements P . . .N − 1 of i.

Figures 12.16-12.18 show results from running one instance of the GibbsSampler for a single block of classical music, digitised from a typical noisy78rpm recording (figure 12.15 shows this data block). An AR model orderP = 30 was chosen, which is adequate for representation of moderatelycomplex classical music extracts. As for the synthetic example the switchvalues it were all initialised to ‘off’, i.e. no impulses present, the AR param-eters and excitation variance were initialised by maximum likelihood (ML)from the corrupted data, and the noise variances were sampled randomlyfrom their prior distribution. An informative gamma prior G(2, 0.0004) wasassigned to mv , a distribution whose mean value of 5000 was roughly esti-mated as the mean noise variance obtained from processing earlier blocks ofthe same dataset. A uniform prior was assigned to αv over the range 0.1 to3. The Markov chain probabilities were fixed at P00 = 0.93 and P11 = 0.65which had been found to be reasonable in earlier data.

The sampler was run for Nmax = 1000 iterations with a ‘burn-in’ pe-riod of N0 = 100 iterations. Figure 12.16 shows the MMSE estimate re-construction and figure 12.17 shows the estimated detection probabilities.


While most samples appear to have a small non-zero posterior probabil-ity of being an outlier (since there must be some continuous measurementnoise present in all samples), time indices corresponding to high probabil-ities can be identified clearly with ‘spikes’ in the input waveform, figure12.15. The detection procedure thus appears to be working as we wouldexpect for the real data. All major defects visible to the naked eye havebeen rejected by the algorithm. Histograms for noise variance parametersare given in figures 12.18 and 12.19.

Some qualitative assessment of various possible sampling schemes can beobtained by examining the evolution with time of σ2

e , since this parameterwill be affected by impulses at any position in the waveform. The first20 iterations of σ2

e are displayed in figure 12.20(a) and (b) under varioussub-block sampling schemes and initialising i to be all zeros. Under ourproposed scheme elements of it and xt are sampled jointly in sub-blocks ofsize q. This is contrasted with the independent sampling scheme used in[129], adapted here to use our noise model and applied also in sub-blocks.Figure 12.20(a) shows the comparison for the minimal sub-block size q = 1.The joint sampling scheme is seen to converge significantly faster than theindependent scheme which did not in fact reach the ‘true’ value of σ2

e

for many hundreds of iterations. In figure 12.20(b) the sub-block size is 4.Once again convergence is significantly faster for the joint sampling method,although the independent method does this time converge successfully afterapproximately 10 iterations. Note that the initial rate of change in σ2

e

does not depend strongly on the block length chosen. This is a result ofthe additional reconstruction operation (12.14) which is performed eachiteration in large sub-blocks and helps to reduce the dependency on q. Thuswe recommend that a small value of q be used in a practical situation. Theconvergence time of the independent sampling scheme was also found tobe far less reliable than the joint scheme since it can take many iterationsbefore very large impulses are first detected under the independent scheme.The differences in convergence demonstrated here will be a significant factorin typical applications such as speech and audio processing where speed isof the essence.

12.8.3 Robust initialisation

Initialisation of the sampler can be achieved in many ways but, if themethod is to be of practical value, we should choose starting points thatlead to very rapid and reliable convergence. Possibly the most critical ini-tialisation is for the detection vector i. Results presented thus far used acompletely ‘blind’ initialisation with all elements of i0 set to zero. Such ascheme shows the power of the sampling algorithm acting in ‘stand-alone’mode. An alternative scheme might initialise i0 to a robust estimate ob-tained from some other simple detection procedure. A rough and readyinitial value for i is obtained by thresholding the estimated AR excitation


sequence corresponding to the corrupted data and initial ML parameterestimates [191, 84], with a threshold set low enough to detect all sizeableimpulses. One can then perform a least squares AR interpolation of the cor-

rupted samples as in [191, 84] and then estimate σ2e0

and a0 by standardmaximum likelihood from the interpolated sequence. This gives the algo-rithm a good starting point from which the detection vector i will convergevery rapidly. Such an initialisation was used for processing of long sectionsof data where fast operation is essential. Convergence times were informallyobserved to be significantly reduced, with usable results obtained within thefirst 5-10 iterations. Take for example a synthetic example generated withthe same modelling parameters as the example of section 12.8.1. We com-pare convergence properties for the proposed robust initialisation and for arandom initialisation (parameters generated at random, x0 set equal to y).Figure 12.21 shows the evolution with time of the innovation variance σ2

e .This is plotted on a log-scale to capture the wide variations observed in therandomly initialised case. Note that in the random case σ2

e was initialisedas a uniform deviate between 0 and 100, so the huge values plotted are notjust the result of a very unfortunate initialisation for this parameter alone.It is clear that from the robust starting point the sampler settles down atleast twice as quickly as for random initialisation. Note also that the actualsquared error between the long-term MMSE estimates from the samplersand the true data (available for this synthetic example) settled down toaround 0.66 per sample in both cases, indicating that both initialisationsgive similar performance in the long term.

12.8.4 Processing for audio evaluation

From a computational point of view it will not be possible to run the sam-pler for very many iterations if the scheme is to be of any practical usewith long sections of data. Longer extracts processed using the samplerwere thus limited to 15 iterations per data block. Restorations were takenas either the mean of the last 5 reconstructions or the last sampled recon-struction from the chain. This latter approach is appropriate for listeningpurposes since it can be regarded as a typical realisation of the restoreddata (see [161, 146, 168] for discussion of this point). This is an interestingmatter, since most methods aim to find some estimator which minimisesa mathematical risk function, such as a MAP or MMSE estimator. Here,however, we recognise that such estimators can sometimes lead to resultswhich are atypical of the signals concerned. In short, they have to makea conservative choice of reconstruction in order to minimise the expectedrisk associated with the estimator. In audio processing this often manifestsitself in reconstructions which are oversmooth compared to typical signals.A random sample drawn from the posterior distribution on the other handwill have a lower posterior probability than a MAP estimator but exhibits


characteristics more typical of the signal under consideration and thereforepreferable for listening purposes. Informal evaluation in fact shows thateither the sampled scheme or the MMSE scheme leads perceptually to veryhigh quality restorations with the models used here. The criteria for eval-uation are the amount of reduction in audible clicks/crackles, etc. and thedegree of audible distortion (if any) to the program material itself com-pared with the corrupted input. The processed material is rendered almostentirely free from audible clicks and crackles in one single procedure, withminimal distortion of the underlying audio signal quality. The same degreeof click reduction is not achievable by any other single procedure known tous for the removal of impulses from audio signals, and certainly not withoutmuch greater distortion of the sound quality.

12.8.5 Discussion of MCMC applied to audio

Markov chain Monte Carlo methods allow for inference about very sophis-ticated probabilistic models which cannot easily be addressed using deter-ministic methods. Their use here allows for joint parameter estimation andreconstruction for signals in the presence of non-Gaussian switched noiseprocesses, which are encountered in many important applications. The al-gorithms developed are computationally intensive, but give high qualityresults which we believe are not attainable using standard techniques. Inparticular, compared with earlier methods such as [191, 79], these tech-niques can reliably detect and remove impulses occurring on many differentamplitude scales in one single procedure. We have tested the methods us-ing synthetic data and examples obtained from corrupted voice and musicrecordings. It is hoped, however, that the methods are sufficiently generaland robust to find application in a wide range of other areas, from commu-nication systems to statistical data processing.

In this chapter we have provided only one example of the application ofMCMC to audio signals. The methods can however be applied to any ofthe other applications described in the book. In particular, it is relativelystraightforward to extend the work of this chapter to performing noisereduction as well as click removal, and to extend the signal models toARMA rather than just AR. These areas are addressed in some of ourrecent papers [72, 73, 81]. For further applications in image enhancement,non-linear system modelling and Hidden Markov models, see [74, 105, 75,106, 177, 178, 179, 182, 181, 180, 8, 50, 49].


0 50 100 150 200 250 300−25

−20

−15

−10

−5

0

5

10

15

20

25

time index, t

ampl

itude

Synthetic AR process

FIGURE 12.3. Synthetic AR data (p = 6)

50 100 150 200 250 300−100

−80

−60

−40

−20

0

20

40

60

80

100

time index, t

ampl

itude

Noisy data

FIGURE 12.4. Noisy AR data


True signal Reconstruction

0 50 100 150 200 250 300−25

−20

−15

−10

−5

0

5

10

15

20

25

time index, t

ampl

itude

MMSE reconstruction (independent variance model)

FIGURE 12.5. Reconstructed data (independent variance noise model)

True signal Reconstruction

0 50 100 150 200 250 300−30

−20

−10

0

10

20

30

time index, t

ampl

itude

MMSE reconstruction (common variance model)

FIGURE 12.6. Reconstructed data (common variance noise model)


0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time index,t

prob

abili

ty

Detection (independent variance model)

FIGURE 12.7. Detection probabilities (independent variance noise model)

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time index,t

prob

abili

ty

Detection (common variance model)

FIGURE 12.8. Detection probabilities (common variance noise model)


True poles Sampled polesMMSE estimate

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Real

Imag

inar

y

Pole locations (independent variance model)

FIGURE 12.9. Reconstructed AR poles (independent variance noise model)

True poles Sampled polesMMSE estimate

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Real

Imag

inar

y

Pole locations (common variance model)

FIGURE 12.10. Reconstructed AR poles (common variance noise model)


0 200 400 600 800 1000 12000

2

4

6

8

10

12

14

Iteration number

Var

ianc

e

Excitation variance (independent variance model)

FIGURE 12.11. Excitation variance samples (independent variance noise model)

0.5 1 1.5 2 2.5 3 3.50

20

40

60

80

100

120

140

160

180

Variance

Fre

quen

cy

Excitation variance (independent variance model)

FIGURE 12.12. Excitation variance histogram (independent variance noisemodel)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250

300

Amplitude

Fre

quen

cy

Degrees of freedom parameter (independent variance model)

FIGURE 12.13. Histogram of αv samples (independent variance noise model)

0 200 400 600 800 1000 1200 14000

20

40

60

80

100

120

Amplitude

Fre

quen

cy

Mode parameter (independent variance model)

FIGURE 12.14. Histogram of mv samples (independent variance noise model)


100 200 300 400 500 600 700 800 900 1000 1100−2000

−1500

−1000

−500

0

500

1000

1500

2000

2500

3000

time index, t

ampl

itude

Noisy data

FIGURE 12.15. Noisy audio data

100 200 300 400 500 600 700 800 900 1000 1100−2000

−1500

−1000

−500

0

500

1000

1500

2000

2500

3000

time index, t

ampl

itude

MMSE reconstruction

FIGURE 12.16. Reconstructed audio data


100 200 300 400 500 600 700 800 900 1000 11000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time index,t

prob

abili

ty

Detection

FIGURE 12.17. Estimated detection probabilities (real data)


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

20

40

60

80

100

120

140

Amplitude

Fre

quen

cy

Degrees of freedom parameter

FIGURE 12.18. Histogram of αv samples (real data)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

20

40

60

80

100

120

140

160

Amplitude

Fre

quen

cy

Mode parameter

FIGURE 12.19. Histogram of mv samples (real data)


2 4 6 8 10 12 14 16 18 202000

4000

6000

8000

10000

12000

14000

(a) iteration number

inno

vatio

n va

rianc

e

q=1, joint sampling q=1, independent sampling

2 4 6 8 10 12 14 16 18 202000

4000

6000

8000

10000

12000

14000

(b) iteration number

inno

vatio

n va

rianc

e

q=4, joint sampling q=4, independent sampling

FIGURE 12.20. Convergence of excitation variance under different samplingstrategies.


Robust Initialization Random Initializations

0 5 10 15 20 25 30 35 40 45 50−2

0

2

4

6

8

10

12

14

16

iteration number

log(

inno

vatio

n va

rianc

e )

FIGURE 12.21. Comparison of robust and random initialisations


13

Summary and Future ResearchDirections

In this book a wide range of techniques has been presented for the reduc-tion of degradation in audio recordings. In Part II an attempt was madeto provide a comprehensive overview of existing methods in click and hissreduction, while also presenting some new developments and interpreta-tions of standard techniques. The emphasis in the section on click removalwas on model-based methods, which provide the opportunity to incorpo-rate signal-specific prior information about audio. This is believed to beone of the key ways to improve on current technology and is the approachadopted throughout Part III on advanced methods. The section on back-ground noise reduction, however, concentrates on spectral domain methodswhich can be regarded as non-parametric; this is a result of the history ofthe subject in which spectral domain methods have been applied almostwithout exception. Nevertheless, it is believed that model based methodswill form a part of the future for hiss reduction. The difficulty here isthat hiss reduction, being a global degradation, is a much more subtle andsensitive procedure than click removal, and model choice can be a criti-cal consideration. Experimentation with standard speech based time seriesmodels such as the AR and ARMA models has led to some promising re-sults for hiss reduction and joint hiss/click reduction, especially when thefully Bayesian methods of chapter 12 are employed [81, 72, 73], but it seemsthat to exceed the performance of the best spectral domain methods morerealistic models will have to be used. These should include elements of non-stationarity, non-Gaussianity and the ability to model musical transients(note ‘attacks’, decays, performance noises) as well as steady state tonalsections. Combinations of elementary signals, such as wavelets or sinusoids,

272 13. Summary and Future Research Directions

with stochastic models (AR, ARMA, etc.) may go some way to providingthe solutions and have already proved quite successful in high level mod-elling for coding, pitch transcription and synthesis [158, 169, 165, 194].

In Part III we present research work carried out over the last ten yearsinto various aspects of digital audio restoration. The approach is generallymodel based and Bayesian, ranging from empirical Bayes methods whichrely upon sub-optimal prior parameter estimation for efficiency (chapters7-11) through to highly computationally intensive fully Bayesian methodswhich apply Monte Carlo simulation techniques to the optimal extractionof signals from noise (chapter 12). We present novel solutions to the areasof automatic pitch defect correction, an area where we are not aware ofany other work, and for correction of low frequency noise pulses causedby breakages and deep scratches on gramophone discs and optical soundtracks. Finally we present a range of Bayesian techniques for handling clickand crackle removal in a very accurate way.

13.1 Future directions and new areas

We have already stated that realistic signal modelling and sophisticatedestimation procedures are likely to form the basis of future advances inaudio restoration. A further area which has not been mentioned in depth isthe incorporation of human perceptual criteria [137, 24] into the restorationprocess. It is well known that standard optimality criteria such as the mean-squared error (MSE) or MAP criterion are not tuned optimally to thehuman auditory system. Attempts have been made to incorporate someof the temporal and frequency domain (simultaneous) masking propertiesinto coding and noise reduction systems, see chapter 6. However, these useheuristic arguments for incorporation of such properties. We believe that itmay be possible to formulate a scheme using perceptually-based Bayesiancost functions, see chapter 4, combined with a Monte Carlo estimationmethod. This would formalise the approach and should show the potentialof perceptual methods, which may as yet not have been exploited to thefull. This is a topic of our current research.

We have covered in this text a very wide range of audio defects. Onearea which has scarcely been touched upon is that of non-linear distortion,a defect which is present in many recordings. This might be caused by sat-uration of electronics or magnetic recording media, groove deformation andtracing distortion in gramophone recordings or non-linearity in an analoguetransmission path, which gives rise to unpleasant artefacts in the sound ofthe recording which should ideally be corrected. In general we will not haveany very specific knowledge of the distortion mechanisms, so a fairly gen-eral modelling approach might be adopted. Many models are possible fornon-linear time series, see e.g. [176]. In our initial work [130, 177, 179, 180]

13.1 Future directions and new areas 273

we have adopted a cascade model in which the undistorted audio xt ismodelled as a linear autoregression and the non-linear distortion processas a ‘with memory’ polynomial non-linear autoregressive (NAR) filter con-taining terms up to a particular lag p and order q, so that the observedoutput yt can be expressed as:

yt = xt +

p∑

i1=1

i1∑

i2=1

bi1i2 yt−i1 yt−i2 +

p∑

i1=1

i1∑

i2=1

i2∑

i3=1

bi1i2i3 yt−i1 yt−i2 yt−i3

+ . . .+

p∑

i1=1

i1∑

i2=1

. . .

iq−1∑

iq=1

bi1...iq yt−i1 . . . yt−iq (13.1)

The model is illustrated in figure 13.1. The problem here is to identify thesignificant terms which are present in the non-linear expansion of equation13.1 and to neglect insignificant terms, otherwise the number of terms inthe model becomes prohibitively large. Results are very good for blindrestoration of audio data which are artificially distorted with a NAR model(see e.g. figure 13.2). However, the models do not appear to be adequatefor many of the distortions generally encountered in audio. The study ofappropriate non-linear models would seem to be a fruitful area for futureresearch. More specific forms of non-linear distortion are currently beingstudied, including clipping/pure saturation and coarsely quantised data.These are much more readily modelled and some progress is expected withsuch problems in the near future.

ARy

NARe x

FIGURE 13.1. Block diagram of AR-NAR model

274 13. Summary and Future Research Directions

Dis

tort

edIte

ratio

n 10

0

0 100 200 300 400 500 600 700 800 900 1000

Itera

tion

3000

Time

FIGURE 13.2. Example of AR-NAR restoration: Top - Non-linearly distortedinput (solid), Undistorted input (dotted); Middle - Restored data (100 iterationsof Gibbs sampler; Bottom - Restored data (3000 iterations of Gibbs sampler)

To conclude, we have presented here a set of algorithms which are basedon principles which are largely new to audio processing. As such, this workmight be regarded as a ‘new generation’ of audio restoration methods.The potential for improved performance has been demonstrated, althoughthere is a corresponding increase in computational complexity. Computingpower is still increasing rapidly, however, especially with the availabilityof high speed parallel DSP systems. In the same way that the first waveof restoration algorithms (see Part II) can now be implemented in real-time on standard DSP processors, it is hoped that the new generation ofalgorithms presented in Part III may soon be implemented in real-time onnew processing platforms.


Appendix A

Probability Densities and Integrals

A.1 Univariate Gaussian

The univariate Gaussian, or normal, density function with mean µ andvariance σ2 is defined for a real-valued random variable as:

N(x|µ, σ2) =1√

2πσ2e−

12σ2 (x−µ)2 (A.1)

Univariate normal density

A.2 Multivariate Gaussian

The multivariate Gaussian probability density function (PDF) for a columnvector x with N real-valued components is expressed in terms of the meanvector mx and the covariance matrix Cx = E[(x − mx)(x − mx)T ] as:

NN (x|m,Cx) =1

(2π)N/2 | Cx |1/2exp

(

−1

2(x − mx)

TCx

−1 (x − mx)

)

(A.2)

Multivariate Gaussian density

276 Appendix A. Probability Densities and Integrals

An integral which is used on many occasions throughout the text is ofthe general form:

I =

∫

y

exp

(

−1

2

(

a+ bT y + yTC y)

)

dy (A.3)

where dy is interpreted as the infinitesimal volume element:

dy =

N∏

i=1

dyi

and the integral is over the real line in all dimensions, i.e. the single inte-gration sign should be interpreted as:

∫

y

≡∫ ∞

y1=−∞

. . .

∫ ∞

yN=−∞

For non-singular symmetric C it is possible to form a ‘perfect square’ forthe exponent and express I as:

I =

∫

y

exp

(

−1

2

(

(y − my)T C (y − my))

)

exp

(

−1

2

(

a − bTC−1b

4

))

dy (A.4)

where

my = −C−1b

2

Comparison with the multivariate PDFof A.2 which has unity volumeleads directly to the result:

∫

y

exp

(

−1

2

(

a+ bTy + yTC y)

)

dy

=(2π)N/2

| C |1/2exp

(

−1

2

(

a − bTC−1b

4

))(A.5)

Multivariate Gaussian integral

This result can be also be obtained directly by a transformation whichdiagonalises C and this approach then verifies the normalisation constantgiven for the PDFof A.2.

A.3 Gamma density 277

A.3 Gamma density

Another distribution which will be of use is the two parameter gammadensity G(α, β), defined for α > 0, β > 0 as

G(y|α, β) =βα

Γ(α)yα−1 exp(−βy) (0 < y <∞) (A.6)

Gamma densityΓ() is the Gamma function (see e.g. [11]), defined for positive arguments.

This distribution with its associated normalisation enables us to performmarginalisation of scale parameters with Gaussian likelihoods and a widerange of parameter priors (including uniform, Jeffreys, Gamma and InverseGamma (see [121]) priors) which all require the following result:

∫ ∞

y=0

yα−1 exp(−βy) dy = Γ(α)/βα (A.7)

Gamma integral

Furthermore the mean, mode and variance of such a distribution areobtained as:

µ = E[Y ] = α/β (A.8)

m = argmaxy

(p(y)) = (α− 1)/β (A.9)

σ2 = E[(Y − µ)2] = α/β2 (A.10)

A.4 Inverted Gamma distribution

A closely related distribution is the inverted-gamma distribution, IG(α, β)which describes the distribution of the variable 1/Y , where Y is distributedas G(α, β):

IG(y|α, β) =βα

Γ(α)y−(α+1) exp(−β/y) (0 < y <∞) (A.11)

Inverted-gamma densityThe IG distribution has a unique maximum at β/(α + 1), mean value

β/(α− 1) (for α > 1) and variance β2/((α− 1)2(α− 2)) (for α > 2).It is straightforward to see that the improper Jeffreys prior p(x) = 1/x

is obtained in the limit as α→ 0 and β → 0.

278 Appendix A. Probability Densities and Integrals

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

amptitude

norm

aliz

ed p

roba

bilit

y de

nsity

Inverted−gamma density

alpha=1000

alpha=0.01

FIGURE A.1. Inverted-gamma family with mode=1, α = 0.01 . . . 1000

The family of IG distributions is plotted in figure A.1 as α varies overthe range 0.01 to 1000 and with maximum value fixed at unity. The varietyof distributions available indicates that it is possible to incorporate eithervery vague or more specific prior information about variances by choice ofthe mode and degrees of freedom of the distribution. With high values of αthe prior is very tightly clustered around its mean value, indicating a highdegree of prior belief in a small range of values, while for smaller α theprior can be made very diffuse, tending in the limit to the uninformativeJeffreys prior. Values of α and β might be chosen on the basis of meanand variance information about the unknown parameter or from estimatedpercentile positions on the axis.


Appendix B

Matrix Inverse Updating Results andAssociated Properties

Here we are concerned with updates to a non-singular (N ×N) matrix Φwhich are of the form:

Φ → Φ + u vT (B.1)

for some column vectors u and v. Results given here can be found in [153](and many other texts) but since these results are used several times in themain text they are presented here in full for reference purposes.

Consider firstly a block partitioned matrix A:

A =

[

Φ UVT W

]

(B.2)

Matrix A can be inverted by row/column manipulations in two ways togive the following equivalent results:

A−1

=

[

(Φ − UW−1

VT )−1 −(Φ − UW

−1V

T )−1UW

−1

−W−1

VT (Φ − UW

−1V

T )−1 (W−1 + W−1

VT (Φ − UW

−1V

T )−1UW

−1)

]

(B.3)

=

[

(Φ−1 + Φ−1

U(W − VTΦ

−1U)−1

VTΦ

−1) −Φ−1

U(W − VTΦ

−1U)−1

−(W − VTΦ

−1U)−1

VTΦ

−1 (W − VTΦ

−1U)−1

]

(B.4)

280 Appendix B. Matrix Inverse Updating Results and Associated Properties

An intermediate stage of this inversion procedure obtains an upper or lowerblock triangular matrix form from which the determinant of A can beobtained as

| A |=| Φ || (W − VTΦ−1U) |=| W || (Φ − U W−1VT) | (B.5)

If we equate the top left hand elements of the two inverse matrices B.3 andB.4 the result is

(Φ − U W−1VT )−1 = Φ−1 + Φ−1U (W − VT Φ−1U)−1VT Φ−1 (B.6)

which is the well known Woodbury update formula or Matrix InversionLemma.

If U = u, V = v and W = −1 the update is then as required for B.1,giving the Sherman-Morrison formula:

(Φ + u vT )−1 = Φ−1 − Φ−1u vT Φ−1

1 + vT Φ−1u(B.7)

which is an order O(N2) operation and hence more efficient than directmatrix inversion using Cholesky decomposition, an O(N3) operation.

Suppose we are in fact solving a set of linear equations at time index nsuch that Φn xn = θn and the current solution xn is known. Matrix Φn isnow updated according to:

Φn+1 = Φn + ununT (B.8)

while θn is updated as

θn+1 = θn + undn. (B.9)

Some further manipulation and the use of B.7 and B.5 leads to the followingextended form of the recursive least squares (RLS) update employed inadaptive filtering:

kn+1 =Φn

−1un

1 + unTΦn

−1un

(B.10)

αn+1 = dn − xnTun (B.11)

xn+1 = xn + kn+1 αn+1 (B.12)

Φn+1−1 = Φn

−1 − kn+1unTΦn

−1 (B.13)

| Φn+1 | = | Φn | (1 + unT Φn

−1un) (B.14)

EMIN

n+1 = EMIN

n + αn+1 (dn − xTn+1un) (B.15)

where EMINn is the minimum sum of error squares for the least squares

problem solved by xn = Φn−1θn


Appendix C

Exact Likelihood for AR Process

In this appendix we derive the exact likelihood for a stable (stationary) ARprocess with parameters a, σ2

e. The conditional likelihood has alreadybeen obtained as:

p(x1 | x0,a) =1

(2πσ2e)

N−P2

exp

(

− 1

2σ2e

xT ATA x

)

. (C.1)

In order to obtain the true likelihood for the whole data block x use theprobability chain rule:

p(x | a) = p(x0,x1 | a) = p(x1 | x0,a) p(x0 | a) (C.2)

and the missing term is thus p(x0,a), the joint PDFfor P samples of theAR process. Since the data is assumed zero mean and Gaussian we canwrite

p(x0 | a) =1

(2πσ2e)P/2 | Mx0 |1/2

exp(− 1

2σ2e

xT0 Mx0

−1x0) (C.3)

where Mx0 is the covariance matrix for P samples of data with unit varianceexcitation, which is well-defined for a stable model.

The exact likelihood expression is thus:

p(x | a) =1

(2πσ2e)

N2 | Mx0 |1/2

exp

(

− 1

2σ2e

xT Mx−1 x

)

(C.4)

282 Appendix C. Exact Likelihood for AR Process

where

Mx−1 = AT A +

[

Mx0

−1 00 0

]

(C.5)

is the inverse covariance matrix for a block of N samples with σ2e = 1.

Matrix Mx0 can easily be calculated since it is composed of elements fromthe covariance sequence of the AR process which may be known beforehand(e.g. if an asymptotic approximation is made for the parameter estimationstage then the covariance values will be specified before estimation of theparameters).

In any case Mx0

−1 is easily obtained owing to the reversible characterof the AR process which implies that Mx

−1 must be persymmetric, i.e.symmetrical about both principal axes. Consider partitioning A for N ≥2P into three components: A0 contains the first P columns of A, A1b

contains the last P columns of A and A1a contains the remaining columns,so that

A = [A0 A1a A1b] (C.6)

and

ATA =

A0TA0 A0

TA1a A0TA1b

A1aTA0 A1a

TA1a A1aTA1b

A1bTA0 A1b

T A1a A1bTA1b

(C.7)

Comparing (C.7) with (C.5) and using the persymmetric property of Mx−1

gives the following expression for Mx0

−1:

Mx0

−1 = (A1bT A1b)

R − A0TA0 (C.8)

where the operator R reverses row and column ordering of a matrix. Sinceelements of A0 and A1b are AR coefficient values elements of Mx0

−1 arequadratic in the AR coefficients. Since AT A is also quadratic in the co-efficients it follows that the exponent of the exact likelihood expressionxT Mx

−1 x can be expressed as a quadratic form in terms of the parame-ter vector ae = [1 − a]:

xT Mx−1 x = aeT D ae (C.9)

and Box, Jenkins and Reinsel [21] show that the elements of D are sym-metric sums of squared and lagged products of the data elements. Thiswould appear to be helpful for obtaining exact ML estimates for the ARparameters since this exponent can easily be minimised w.r.t. a. Note how-ever that the determinant term | Mx0 | of the exact likelihood expressionis variable with a and its differentials w.r.t. a are complicated functions ofthe AR coefficients, making the exact ML solution intractable to analyticsolution.


Appendix D

Derivation of Likelihood for i

In this appendix the likelihood p(y | i) for a particular noise configurationi is derived for general noise amplitude and data models. Consider firstlythe PDFfor n, the additive noise component for a particular data blockof length N . Wherever im = 0 we know with certainty 1 that the noisecomponent is zero. Otherwise im = 1 and the noise takes some randomamplitude defined by the noise burst amplitude PDFpn(i)|i which is the jointdistribution for noise amplitudes at the l sample locations where im = 1.Overall we have:

pn(n) = δ(N−l)(n−(i)) pn(i)|i(n(i) | i) (D.1)

where δ(k)() is the k-dimensional delta function with unity volume and (i)

and −(i) are the partitions as defined before. This delta function expressesthe certainty that noise amplitudes are zero wherever im = 0.

We have the additive relationship

y = x + n

for the data block. If the noise n is assumed statistically independent ofthe data x the PDFfor y conditional upon x may be directly expressed as:

p(y | x, i) = pn(y − x) = δ(N−l)(y−(i) − x−(i)) pn(i)|i(y(i) − x(i)) (D.2)

For a particular data model with PDFpx the joint probability for data andobservations is then given by

p(y, x | i) = p(y | x, i) px(x) (D.3)

= δ(N−l)(y−(i) − x−(i)) pn(i)|i(y(i) − x(i) | i) px(x) (D.4)

284 Appendix D. Derivation of Likelihood for i

where we have used the assumption that the noise generating process isindependent of the data to assign px|i = px.

The likelihood is now obtained by a marginalisation integral over x:

p(y | i) =

∫

x

δ(N−l)(y−(i) − x−(i)) pn(i)|i(y(i) − x(i) | i) px(x) dx (D.5)

and the integration may be immediately simplified using the sifting prop-erty of the delta function as:

p(y | i) =

∫

x(i)

pn(i)|i(y(i) − x(i) | i) px(x) |x−(i)=y

−(i)dx(i) (D.6)

=

∫

x(i)

pn(i)|i(y(i) − x(i) | i) px(x(i), y−(i)) dx(i) (D.7)

where px(x(i), y−(i)) = px(x) |x

−(i)=y−(i)

is simply px with observed datay

−(i) substituted at the appropriate locations.This expression for evidence can now be evaluated by substitution of the

relevant functional forms for px and pn|i, both of which will be determinedfrom modelling considerations.

D.1 Gaussian noise bursts

Substituting the Gaussian noise PDF(9.10) and Gaussian AR data PDF(9.8) into the likelihood expression of equation (9.5) we arrive at the fol-lowing integral:

p(y | i) = k

∫

x(i)

exp

(

−1

2

(

a+ bTx(i) + xT(i)C x(i)

)

)

dx(i) (D.8)

where the terms k,a,b,C are defined as

k =1

(2π)l/2 | Rn(i)|1/2 (2πσ2

e)(N−P )/2(D.9)

a =y

−(i)TA

−(i)TA

−(i) y−(i)

σ2e

+ y(i)TRn(i)

−1y(i) (D.10)

b =2A(i)

T A−(i) y

−(i)

σ2e

− 2 Rn(i)

−1y(i) (D.11)

C =AT

(i)A(i)

σ2e

+ R−1n(i)

. (D.12)

Note that in equation (D.8) p(x0) has been placed outside the integral andtreated as a multiplicative constant since, as discussed in the main text, thefirst P samples of the block are assumed to be uncorrupted. The integrandis of the same form as (A.3) and hence result (A.5) applies.

Substituting for k, a, b and C gives equations (9.11)-(9.16)


Appendix E

Marginalised Bayesian Detector

The likelihood calculations of (9.19)-(9.24) are implicitly conditional uponthe AR coefficients a, the excitation variance σ2

e , the noise variance σ2n, the

first P data samples x0 (through use of the conditional likelihood) and allremaining modelling assumptions M. For fixed σ2

e , a and x0 it was notedthat the PDFp(x0) (C.3) was a constant which could be neglected. If σ2

e isallowed to vary, this term is no longer a constant and must be accountedfor. The exact likelihood expression (4.55) can be used to overcome thisdifficulty. We use the conditional likelihood expression for compatibilitywith the results of chapter 9.

The likelihood expression of (9.19) is then rewritten as

p(y | i, σ2e , µ,a,M) ∝

µl exp(

− 12σ2

eEMIN

)

(2πσ2e)

N2 | Φ |1/2

(E.1)

while results (9.20)-(9.24) remain unchanged.We might wish to marginalise σ2

e from the evidence. Note, however, thatµ = σe

σn, which means that (9.20)-(9.24) are strongly dependent on σe. If

we are prepared to assume that µ is independent of σe then these depen-dencies disappear. This assumption implies that the click variance σ2

n isdetermined by a constant 1

µ2 times the excitation variance, i.e. click ampli-tudes are scaled relative to excitation energy. There is no reason to assumethat noise variances should be related to the signal excitation in the case ofdegraded audio sources. It is also debatable as to whether marginalisationis preferable to long-term adaptive estimation of the unknown parameters

286 Appendix E. Marginalised Bayesian Detector

beforehand. However, it is hoped that marginalisation under this assump-tion may give improved robustness when the value of σe is uncertain.

Supposing σ2e is reparameterised as λ = 1

σ2e

and λ is independent of i,

µ, a and the modelling assumptions, then p(λ | i, µ,a,M) = p(λ), and thejoint distribution is given by:

p(y, λ | i, µ,a,M) ∝ p(λ)µl exp

(

− 12σ2

eEMIN

)

(2πσ2e)

N2 | Φ |1/2

(E.2)

Suppose λ ∼ G(α, β), i.e. Y is drawn from the two-parameter Gammadensity (see appendix A). Then the joint posterior density of (E.2) is:

p(y, λ | i, µ,a,M) ∝ λ(N/2+α−1) µl exp (−λ (β + EMIN/2))

| Φ |1/2(E.3)

Considered as a function of λ this expression is proportional to G(α +N/2, β + EMIN/2). Using result A.7 for the integration of such functions weobtain the following marginalised evidence expression:

p(y | i, µ,a,M) =

∫

λ

p(y, λ | i, µ,a,M) dλ ∝ µl (β + EMIN/2)−(α+N/2)

| Φ |1/2

(E.4)

This expression can be used in place of (9.19) for detection. Values of α andβ must be selected beforehand. If a long-term estimate σ2

e of σ2e is available

it will be reasonable to choose α and β using (A.8)-(A.10) such that the

mean or mode of the Gamma density is λ = 1σ2

eand the covariance of the

density expresses how confident we are in the estimate λ. Initial resultsindicate that such a procedure gives improved detection performance whenvalues of σ2

e and µ are uncertain.If a long-term estimate of σ2

e is not available the general Gamma priormay be inappropriate. Comparison of (E.2) with (E.3) shows that a uniformor Jeffreys’ prior can be implemented within this framework. In particular,α = 1, β → 0 corresponds to a uniform prior on λ, while α = 0, β → 0gives a Jeffreys’ prior p(λ) = 1/λ.

Under certain assumptions we have shown that it is possible to marginaliseσ2

e from the evidence expression. It is not, however, straightforward to elim-inate µ, since this appears in the matrix expressions (9.20)-(9.24). The onlycase where marginalisation of µ is straightforward is when the approxima-tion of (9.25)-(9.27) is made, i.e. µ is very small. In this case a secondmarginalisation integral can be performed to eliminate µ from (E.4). Eval-uation of such a procedure is left as future work.


Appendix F

Derivation of Sequential UpdateFormulae

This appendix derives from first principles the sequential update for thedetection likelihood in chapter 10

Consider firstly how the (n − P ) × n matrix An (see 4.53) is modifiedwith a new input sample yn+1. When the data block increases in length byone sample, matrix An (by inspection) will have one extra row and columnappended to it as defined by:

An+1 =

[

An 0n

−bnT 1

]

(F.1)

where bn is defined as

bn =[

0(n−P )T aP aP−1 . . . a2 a1

]T(F.2)

and 0q is the column vector containing q zeros.bn is a length n column vector and can thus be partitioned just as yn

and xn into sections b(i)n and b−(i)n corresponding to unknown and known

data samples for detection state estimate in.

F.1 Update for in+1 = 0.

Consider firstly how A(i)n and A−(i)n are updated when in+1 = 0. In this

case the incoming sample yn+1 is treated as uncorrupted. Hence the true

288 Appendix F. Derivation of Sequential Update Formulae

data sample xn+1 is considered known and A(i)(n+1) is obtained by theaddition of just a single row to A(i)n:

A(i)(n+1) =

[

A(i)n

−b(i)nT

]

, (F.3)

while A−(i)(n+1) is obtained by the addition of both a row and a column:

A−(i)(n+1) =

[

A−(i)n 0n

−b−(i)n

T 1

]

. (F.4)

In addition the partitioned input data vector is straightforwardly updatedas:

y(i)(n+1) = y(i)n, y−(i)(n+1) = [ y

−(i)nT yn+1 ]T (F.5)

We now proceed to update Φn (10.4). The first term is updated by directmultiplication of (F.3):

A(i)(n+1)T A(i)(n+1) = A(i)n

T A(i)n

+ b(i)n b(i)nT (F.6)

The second term of Φn is unchanged by the input of yn+1. Hence Φn isupdated as:

Φ(n+1) = Φn + b(i)n b(i)nT (F.7)

The first term of θn (10.5) is updated similarly using direct multiplicationof results (F.3), (F.4), (F.5) to give:

A(i)(n+1)TA

−(i)(n+1) y−(i)(n+1) = A(i)n

TA−(i)n y

−(i)n

+ b(i)n

(

b−(i)n

T y−(i)n − yn+1

)

(F.8)

As for Φn, the second term of θn is unchanged for in+1 = 0. Hence theupdate for θn is:

θ(n+1) = θn + b(i)n

(

yn+1 − b−(i)n

Ty−(i)n

)

(F.9)

= θn + b(i)n dn+1 (F.10)

where

dn+1 = yn+1 − b−(i)n

Ty−(i)n (F.11)

The recursive updates derived for Φn (F.7) and θn (F.10) are now in theform required for application of the results given in appendix B (equations

F.2 Update for in+1 = 1. 289

(B.10)-(B.15)). Use of these results leads to recursive updates for the termsin the state likelihood (9.37), i.e. xMAP

(i)n , EMINn and | Φn |.Now, defining

Pn = Φn−1, (F.12)

the update for in+1 = 0 is given by

kn+1 =Pn b(i)n

1 + b(i)nTPn b(i)n

(F.13)

dn+1 = yn+1 − b−(i)n

Ty−(i)n (F.14)

αn+1 = dn+1 − xMAP

(i)nTb(i)n (F.15)

xMAP

(i)(n+1) = xMAP

(i)n + kn+1 αn+1 (F.16)

Pn+1 = Pn − kn+1 b(i)nTPn (F.17)

| Pn+1 | =| Pn |(

1 + b(i)nTPn b(i)n

)−1

(F.18)

EMIN(n+1) = EMINn + αn+1

(

dn+1 − xMAP

(i)(n+1)Tb(i)n

)

(F.19)

p(y(n+1) | i(n+1)) = p(yn | in)1

√

2πσ2e

(

1 − b(i)nTkn+1

)1/2

exp

(

− 1

2σ2e

αn+1

(

dn+1 − xMAP

(i)(n+1)Tb(i)(n)

)

)

(F.20)

Note that the values of | Pn+1 |, y−(i)n and EMIN(n+1) are not specifically

required for the likelihood update given in (F.20), but are included forcompleteness and clarity. In fact the update only requires storage of theterms Pn, xMAP

(i)n and the previous likelihood value p(yn | in) in order toperform the full evidence update when yn+1 is input.

F.2 Update for in+1 = 1.

As stated above, when in = 1 the length of the MAP data estimate xMAP

(i)(n+1)

is one higher than at the previous sample number. Correspondingly theorder of Φ(n+1) and θ(n+1) are also one higher. This step-up in solutionorder is achieved in two stages. The first stage is effectively a predictionstep which predicts the MAP data vector without knowledge of yn+1. Thesecond stage corrects this prediction based on the input sample yn+1.


We follow a similar derivation to the in+1 = 0 case. Matrices A(i)n andA

−(i)n are updated as follows (c.f. (F.3)) and (F.4)):

A(i)(n+1) =

[

A(i)n 0n

−b(i)nT 1

]

(F.21)

and

A−(i)(n+1) =

[

A−(i)n

−b−(i)n

T

]

(F.22)

The input data partitioning is now given by (c.f. (F.5)):

y(i)(n+1) = [ y(i)nT yn+1 ]T y

−(i)(n+1) = y−(i)n (F.23)

A similar procedure of direct multiplication of the required terms leads tothe following updates:

A(i)(n+1)TA(i)(n+1) =

[(

A(i)nTA(i)n + b(i)n b(i)n

T)

−b(i)n

−b(i)nT 1

]

(F.24)

and

A(i)(n+1)TA

−(i)(n+1) y−(i)(n+1)

=

[

A(i)nTA

−(i)n y−(i)n + b(i)nb

−(i)nTy

−(i)n

b−(i)n

Ty−(i)n

]

(F.25)

By inspection the updates to Φn (10.4) and θn (10.5) are now given by:

Φ(n+1) =

[ (

Φn + b(i)n b(i)nT)

−b(i)n

−b(i)nT 1 + µ2

]

(F.26)

and

θ(n+1) =

[

θn − b(i)nb−(i)n

Ty−(i)n

b−(i)n

T y−(i)n + µ2yn+1

]

(F.27)

Various approaches are possible for updating for xMAP(i)n , etc. The approach

suggested here is in two stages. It is efficient to implement and has a usefulphysical interpretation. The first step is effectively a prediction step, inwhich the unknown data sample xn+1 is predicted from the past data basedon the model and without knowledge of the corrupted input sample yn+1.The second stage corrects the prediction to incorporate yn+1.

In the prediction step we solve for xPRED

(i)(n+1) using:

[(

Φn + b(i)n b(i)nT)

−b(i)n

−b(i)nT 1

]

xPRED

(i)(n+1)

=

[

θn − b(i)nb−(i)n

Ty−(i)n

b−(i)n

Ty−(i)n

]

(F.28)

F.2 Update for in+1 = 1. 291

or equivalently

ΦPRED

n+1 xPRED

(i)(n+1) = θPRED

n+1 (F.29)

It is easily verified by direct multiplication and use of the identity ΦnxMAP(i)n =

θn that the prediction estimate xPRED

(i)(n+1) is given by:

xPRED

(i)(n+1) =

[

xMAP(i)n

b(i)nT xMAP

(i)n + b−(i)n

Ty−(i)n

]

(F.30)

This is exactly as should be expected, since the last element of xPRED

(i)(n+1) issimply the single step linear prediction with values from xMAP

(i)n substitutedfor unknown samples. Since the linear prediction will always have zeroexcitation the previous missing samples remain unchanged at their lastvalue xMAP

(i)n .The inverse of matrix ΦPRED

n+1 is easily obtained from the result for theinverse of a partitioned matrix (see B.3) as:

ΦPRED

(n+1)−1

=

[

Φn−1 Φn

−1b(i)n

b(i)nTΦn

−1(

1 + b(i)nTΦn

−1b(i)n

)

]

(F.31)

all of whose terms are known already from the update for in+1 = 0. Thedeterminant of ΦPRED

n+1 can be shown using result (B.5) to be unchanged bythe update. The error energy term EMIN(n+1) is also unaffected by this partof the update, since the single sample data prediction adds nothing to theexcitation energy.

The second stage of the update involves the correction to account forthe new input data sample yn+1. By comparison of (F.26) and (F.27) with(F.28) and F.29) we see that the additional update to give Φn+1 and θn+1

is:

Φn+1 = ΦPRED

n+1 +

[

0n×n 0n

0Tn µ2

]

(F.32)

and

θn+1 = θPRED

n+1 +

[

0n

µ2 yn+1

]

(F.33)

where 0n×n is the all-zero matrix of dimension (n×n). This form of updatecan be performed very efficiently using a special case of the extended RLSequations given in (B.10) (B.15) in which ‘input’ vector un and ‘desiredsignal’ dn+1 are given by:

un = [ 0nT µ ]T and dn+1 = µ yn+1 (F.34)

ΦPRED

n+1 is associated with Φn in the appendix and xPREDn+1 with xn. Significant

simplifications follow directly as a result of un having only one non-zero


element. The majority of the terms required to make this second stage ofupdate will already have been calculated during the update for in+1 = 0and they will thus not generally need to be re-calculated. Section 10.2.1 inthe main text summarises the full update for both in+1 = 0 and in+1 = 1and explicitly shows which terms are common to both cases.


Appendix G

Derivations for EM-basedInterpolation

This appendix derives the EM update equations for the Bayesian inter-polation problem of chapter 12. Section 4.5 summarises the general formof the EM algorithm. The first stage calculates an expectation of the logaugmented posterior p(z|θ,y) conditional upon the current estimate zi. Anotation to make the expectations clear is introduced such that:

Eφ[x|y] =

∫

φ

x p(φ|y)dφ.

We define the latent data z = x(i) for this derivation. For the interpolationmodel the expectation step can be split into separate expectations overnoise and signal parameters θv and θx:

Q(z, zi) = E[log(p(z|θ,y)|zi,y] (G.1)

= E[log(p(z,y|θ)) − log(p(y|θ))|zi,y]

= E[log(p(z,y|θ))|zi,y] − C

= E[log(p(x|θx)) + log(p(v(i)|θv))|zi,y] − C

= Ex[log(p(x|θx))|xi] + E

v[log(p(v(i)|θv)|vi)] − C

= Ex + Ev − C (G.2)

where C is constant since p(y|θ) does not depend on z. In the third linep(z,y|θ) has been expanded as:

p(z,y|θ) = p(x(i),x−(i),y(i)|θ)= p(x|θx)p(v(i)|θv)

294 Appendix G. Derivations for EM-based Interpolation

We now present the signal and noise expectations, Ex and Ev .

G.1 Signal expectation, Ex

The signal expectation is taken over the signal parameters θx = a, λeconditional upon the current estimate of the reconstructed data xi.

The expectation over signal parameters can be rewritten as:

Ex = Ex [log(p(x|θx))|xi]

= Eλe

[

Ea

[

log(p(x|θx))|λe ,xi]

|xi]

= Eλe

[

λe/2 Ea

[

−E(x,a)|λe ,xi]

|xi]

+ D (G.3)

where D is a constant which does not depend upon x and (12.6) hasbeen substituted for p(x|θx). The resulting ‘nested’ expectations are takenw.r.t. p(a|λe ,x

i) and p(λe |xi) which can be obtained using Bayes’ Theoremas:

p(θx |x) ∝ p(x|θx) p(θx)

∝ ARN (x|a, λe)G(λe |αe, βe) (G.4)

= NP (a|aMAP(x), (λeXTX))−1

G(λe |αe + (N − P )/2, βe + E(x,aMAP(x))/2) (G.5)

= p(a|λe ,x)p(λe |x) (G.6)

where the rearrangement between (G.4) and (G.5) is achieved by expandingand noting that the resulting density must be normalised w.r.t. θx , and theterm ARN (x|a, λe) denotes the conditional likelihood of the AR process,as defined in (12.6). aMAP(x) is defined as the MAP parameter estimate fordata x: aMAP(x) = (XT X)−1XT x1.

The required densities p(a|λe ,x) and p(λe |x) can be directly identifiedfrom comparison of (G.5) and (G.6) as

p(a|λe ,x) = NP (aMAP(x), (λeXTX)−1)

and p(λe |x) = G(αe + (N − P )/2, βe + E(x,aMAP(x))/2)

The inner expectation of (G.3) is now taken w.r.t. p(a|λe ,xi), giving:

Ea

[

−E(x,a)|λe ,xi]

= −xT1 x1 + 2xT

1 XEa

[

a|λe ,xi]

− trace(

XEa

[

aaT |λe ,xi]

XT)

= −xT1 x1 + 2xT

1 Xai − trace

(

X

(

(

λeXiTXi

)−1

+ aiaiT)

XT

)

= −E(x,ai) − 1

λetrace

(

X(

XiTXi)−1

XT

)

= −E(x,ai) − T (x,xi)/λe (G.7)

G.1 Signal expectation, Ex 295

In this expression the mean and covariance of the conditional density

p(a|λe ,xi) = NP (ai, (λeX

iTXi)−1), where ai = aMAP(xi), lead directly

to the result Ea

[

aaT |λe ,xi]

= aiaiT +(

λeXiTXi

)−1

. Note that (G.7) is

equivalent to the E-step derived in O Ruanaidh and Fitzgerald [167, 166]for missing data interpolation.T (x,xi) can be written as a quadratic form in x:

T (x,xi) = trace

(

X(

XiT Xi)−1

XT

)

= xT T(xi)x

where T(xi) =

N−P+1∑

q=1

Nq(xi) (G.8)

Nq(xi) is the all-zero (N×N) matrix with the (P×P ) sub-matrix

(

XiTXi)−1

substituted starting at the qth element of the leading diagonal:

Nq(xi) =

0q−1,q−1 . . .. . .

...(

XiT Xi)−1 ...

. . . . . . 0N−q−p+1,N−q−p+1

(G.9)

where 0nm denotes the (n × m) all-zero matrix. T(xi) is thus a band-diagonal matrix calculated as the summation of all N−P+1 configurationsof Nq(x

i). Note that the structure of T(xi) is symmetrical throughout andalmost Toeplitz except for its first and last P columns and rows.

The outer expectation of (G.3) is then taken w.r.t. p(λe |xi) to give:

Ex = Eλe

[

−λeE(x,ai)/2 − T (x,xi)/2|xi]

+D

= −E(x,ai)(αe + (N − P )/2)

2(βe + E(xi,ai)/2)− T (x,xi)

2+D

= −λieE(x,ai)

2− T (x,xi)

2+D

(G.10)

The term λie is the expected value of λe conditional upon xi and is obtained

directly as the mean of the gamma distribution p(λe |xi).The termD does not depend upon x and so may be ignored as an additive

constant which will not affect the maximisation step.


G.2 Noise expectation, Ev

The noise parameters are θv = Λ(i) where Λ = [λ1 . . . λN ]T and λt =1/σ2

vt. In the common variance noise model we have that σ2

vt= σ2

v , aconstant for all t. The noise parameter expectation depends now upon theform of noise model chosen:

1. Independent variance. The expectation is taken w.r.t. p(θv |vi(i)),

which is given by:

p(θv |v(i)) = p(Λ(i)|v(i))

∝ p(v(i)|Λ(i)) p(Λ(i))

∝ Nl(v(i)|0, (diag(Λ(i)))−1)

∏

t∈I

G(λvt |αv, βv)

=∏

t∈I

G(λvt |αv + 1/2, βv + vt2/2) (G.11)

where we have substituted the noise likelihood and prior expressionsand noted that the resultant densities must be normalised w.r.t. λvt .Recall also that l is the number of degraded samples.

The noise expectation is then obtained as

Ev = Ev[log(p(v(i)|θv))|vi

(i)]

= −∑

t∈I

vt2Eλvt

[

λvt |vti]

/2 +G

= −N∑

t∈I

vt2λi

vt/2 +G (G.12)

where λivt

= αv+1/2βv+vt

2/2 is the expected value of the gamma distribution

G(λvt |αv + 1/2, βv + vt2/2) (see appendix A.3). G is once again a

constant which is independent of x.

2. Common variance. A similar argument gives for the common vari-ance model:

En = Eλv

[

−∑

t∈I

λvvt2/2|vi

(i)

]

+G

= −λivv

T(i)v(i)/2 +G (G.13)

where λivt

= λiv = (αn + l/2)/(βn + vi

(i)

Tvi

(i)/2) is the expected value

of the gamma distribution p(λv |vi(i)).

The two results can be summarised as:

En = −vT(i)M

iv(i)

2+G (G.14)

G.3 Maximisation step 297

where Mi = diag(λivt

; t ∈ I) is the expected value of the noise inverse-covariance matrix conditioned upon the current estimate of the noise, vi

(i) =y(i) − xi

(i). G is an additive constant which does not depend upon x.

G.3 Maximisation step

The Q(., .) function (G.2) is now obtained as the sum of Ex (G.10) and En

(G.14). The individual components of Q(., .) are all quadratic forms in thereconstructed data x and hence also quadratic in z, which is a sub-vectorof x. Maximisation of Q(., .) is thus a linear operation, obtained as the

solution of∂Q(z, zi)

∂z= 0:

zi+1 = −Ψ−1

(

(

λieA

iTAi + T(xi))

(i)−(i)

y−(i) − Miy(i)

)

(G.15)

where Ψ =(

λieA

iT Ai + T(xi))

(i)(i)

+ Mi

and the notation ‘(i)(i)’ denotes the sub-matrix containing all elementswhose row and column numbers are members of I. Similarly, ‘(i)−(i)’ extractsa sub-matrix whose row numbers are in I and whose column numbers arenot in I.

The complete iteration for EM interpolation is summarised in the maintext.



Appendix H

Derivations for Gibbs Sampler

In this appendix results required for the Gibbs sampler detection and in-terpolation scheme are derived (see chapter 12).

H.1 Gaussian/inverted-gamma scale mixtures

With the assumption of a zero mean normal distribution with unknownvariance for a noise source v, the joint distribution for noise amplitude andnoise variance is given by:

p(v, σ2) = p(v|σ2)p(σ2)

= N(v|0, σ2)p(σ2)

The ‘effective’ distribution for v is then obtained by marginalisation ofthe joint distribution:

p(v) =

∫

σ2

N(0, σ2)p(σ2)dσ2

The convolving effect of this mixture process will generate non-Gaussiandistributions whose properties depend on the choice of prior p(σ2). In thecase of the IG prior, simple analytic results exist, based on the use of the

300 Appendix H. Derivations for Gibbs Sampler

IG normalising constant in (A.4):

p(v) =

∫

σ2

N(v|0, σ2)IG(σ2|α, β)dσ2

=

∫

σ2

1√2πσ2

exp(−v2/2σ2)

× βα

Γ(α)(σ2)−(α+1) exp(−β/σ2)dσ2

=1√2π

βα

Γ(α)

Γ(α+ 1/2)

(β + v2/2)(α+1/2)

×∫

σ2

IG(σ2|α+ 1/2, β + v2/2)dσ2

=1√2π

βα

Γ(α)

Γ(α+ 1/2)

(β + v2/2)(α+1/2)

= St(v|0, α/β, 2α)

where St(.) denotes the Student-t distribution:

St(x|µ, λ, α) = k(1 + (x− µ)2λ/α)−(α+1)/2 (H.1)

The Student distribution, which has both Gaussian and Cauchy as limitingcases, is well known to be robust to impulses, since the ‘tails’ of the distri-bution decay algebraically, unlike the much less robust Gaussian with itsexponentially decaying tails. The family of Student-t curves which arises asa result of the normal/inverted-gamma scale mixture is displayed in figureH.1, for the same set of α and β parameters as in figure A.1. It can be seenthat with suitable choice of α and β the Student distribution can assignsignificant probability mass to high amplitude impulses which would havenegligible mass in the corresponding normal distribution.

H.2 Posterior distributions

H.2.1 Joint posterior

The joint posterior distribution for all the unknowns is obtained from thelikelihood, signal model and the prior distribution p(θ). Note that elementsfrom the unknown parameter set θ = i,a, σ2

e , σ2vt

(t = 0 . . .N − 1) areassumed independent a priori , so that the prior can be expressed as:

p(θ) = p(i)p(a)p(σ2e)∏

t

p(σ2vt

) (H.2)

H.2 Posterior distributions 301

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

amptitude

norm

aliz

ed p

roba

bilit

y de

nsity

Student−t density

alpha=1000

alpha=0.01

FIGURE H.1. Normal/Inverted-gamma mixture family (Student-t) withmode=1, α = 0.01 . . . 1000


which is simply the product of the individual prior terms for each param-eter. The full posterior distribution is then obtained as:

p(x,θ|y) ∝ p(i0)∏

t:it=0

δ(yt − xt)∏

t:it=1

N(yt − xt|0, σ2vt

)

× NN (x|0, σ2e(AT A)−1)

×∏

t

Pit−1it

× IG(σ2e |αe, βe)

∏

t

IG(σ2vt|αv , βv)

(H.3)

H.2.2 Conditional posteriors

The conditional posteriors are obtained by simple manipulation of the fullposterior (H.3). For example, the conditional for a is expressed as:

p(a|x,θ−(a),y) =

p(θ,x|y)

p(x,θ−(a)|y)

where θ−(a) denotes all elements of θ except for a. Note, however, that the

denominator term is simply a normalising constant for any given valuesof x, θ

−(a) and y. The required conditional for a can thus be obtainedsimply by grouping together all the terms in the joint posterior expressionwhich depend upon a and finding the appropriate normaliser to form aproper density. In the simplest cases this can be achieved by recognisingthat the terms depending on a can be rewritten in the form of a well-known distribution. Here, the only term depending on a is the signal modelNN (x|0, σ2

e(AT A)−1) which is easily rewritten using (12.4) to give thenormalised conditional expression (12.11). Similar manipulations lead tothe multivariate normal conditional for x (12.14), obtained by rearrangingthe product of likelihood and signal model, both of which depend upon x.

The noise variance terms can be treated similarly. For example, σ2e is

present in both its own prior density IG(σ2e |αe, βe) and the signal model.

Grouping together the exponential and power-law terms in σ2e leads to a

conditional which is still in the general form of the IG(.) density, as givenin (12.12).

The detection indicator variables are discrete, but the same principlesapply. In this case the conditional distribution has no well-known form andthe normalising constant must be determined by evaluation of the jointposterior at all 2N possible values of i. In this work, the indicator vectoris treated jointly with the reconstructed data x, as discussed more fully insection 12.5.2.

H.3 Sampling the prior hyperparameters 303

H.3 Sampling the prior hyperparameters

Noise variance parameters αv and βv . Conditional posterior distribu-tions for the noise variance parameters αv and βv can be obtained directlyfrom the noise variance prior as follows:

p(αv , βv |x,θ−(αv ,βv ),y) = p(αv , βv |σ2vt

(t = 0 . . .N − 1))

∝ p(σ2vt

(t = 0 . . .N − 1)|αv , βv)p(αv)p(βv)

∝∏

t

IG(σ2vt|αv , βv)p(αv)p(βv)

where we have assumed αv and βv independent a prioriRearrangement of this expression leads to the following univariate con-

ditionals:

p(αv |x,θ−(αv ),y) ∝ βNαvv Π−(1+αv )

Γ(αv)Np(αv) (H.4)

and

p(βv |x,θ−(βv ),y) = G (Nαv + a,Σ + b) (H.5)

where Σ =∑

t 1/σ2vt

, Π =∏

t σ2vt

and G(.) denotes the Gamma dis-tribution, p(x|α, β) ∝ xα−1 exp(−βx), and we have assigned the priorp(βv) = G(a, b). Sampling from the distribution of equation (H.5) is astandard procedure. Non-informative or informative priors can be incor-porated as appropriate by suitable choice of a and b. (H.4) is not a wellknown distribution and must be sampled by other means. This is not diffi-cult, however, since the distribution is univariate. In any case, the precisevalue of αv is unlikely to be important, so we sample αv from a uniformgrid of discrete values with probability mass given by (H.4). In principleany prior could easily be used for αv but we choose a uniform prior in theabsence of any further information.

As an alternative to the above it may be reasonable to reparameterise αv

and βv in terms of variables with a more ‘physical’ interpretation, such asthe mean or mode of the noise variance distribution, the idea being that itis usually easier to assign a prior distribution to such parameters. Since themean of the IG(.) distribution is only defined for αv > 1 (see appendix A.4)we use the mode here, given bymv = βv/(αv+1). Another possibility wouldbe the ratio βv/αv which corresponds to the inverse precision parameter1/λ of the equivalent Student-t noise distribution (see appendix H.1).

Similar working to the above leads to the following conditionals for αv

and mv :

p(αv |x,θ−(αv ),y) ∝ (mv(αv + 1))Nαv Π−(1+αv ) exp(−(αv + 1)mvΣ)

Γ(αv)Np(αv)

(H.6)


and

p(mv |x,θ−(mv ),y) = G (Nαv + a, (αv + 1)Σ + b) (H.7)

Markov chain transition probabilities, P01 and P10. Conditionalsfor P01 and P10 are obtained in a similar fashion, since they are condition-ally dependent only upon the switching values it:

p(P01, P10|θ−(P01,P10),y) = p(P01, P10|i)∝ p(i|P01, P10)p(P01)p(P10)

∝ p(P01)p(P10)p(i0)∏

t:it=1,it−1=0

P01

∏

t:it=0,it−1=0

(1 − P01)

∏

t:it=0,it−1=1

P10

∏

t:it=1,it−1=1

(1 − P10)

Rearrangement of this expression under the assumption of uniform or betadistributed priors p(P01) and p(P10) leads to univariate conditionals whichare in the form of the beta distribution, p(x|α, β) ∝ xα−1(1 − x)β−1, sam-pling from which is a standard procedure.


References

[1] B. Abraham and G. E. P. Box. Linear models and spurious observa-tions. Appl. Statist., 27:120–130, 1978.

[2] B. Abraham and G. E. P. Box. Bayesian analysis of some outlierproblems in time series. Biometrika, 66(2):229–36, 1979.

[3] A. N. Akansu and R. A. Haddad. Multiresolution Signal Decomposi-tion. Academic Press, inc., 1992.

[4] J. Allen. Short term spectral analysis, synthesis and modificationby discrete Fourier transform. IEEE Trans. Acoustics, Speech andSignal Processing, ASSP-25:235–239, June 1977.

[5] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice-Hall, 1979.

[6] J. B. Anderson and S. Mohan. Sequential coding algorithms: A sur-vey and cost analysis. IEEE Trans. Comms., COM-32:169–176, 1984.

[7] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distribu-tions. Journal of the Royal Statistical Society, Series B, 36:99–102,1974.

[8] C. Andrieu, A. Doucet, and P. Duvaut. Bayesian estimation offiltered point processes using mcmc. In in Proc. IEEE AsilomarConf. on Signals, Systems and Computers IEEE, Asilomar, Califor-nia, November 1997.

306 References

[9] K. Arakawa, D.H. Fender, H. Harashima, H. Miyakawa, andY. Saitoh. Separation of a non-stationary component from the EEGby a nonlinear digital filter. IEEE Trans. Biomedical Engineering,33(7):724–726, 1986.

[10] P. E. Axon and H. Davies. A study of frequency fluctuations in soundrecording and reproducing systems. Proc. IEE, Part III, page 65,Jan. 1949.

[11] A. C. Bajpai, L. R. Mustoe, and D. Walker. Advanced EngineeringMathematics. Wiley, 1977.

[12] V. Barnett and T. Lewis. Outliers in Statistical Data. Chich-ester:Wiley, second edition, 1984.

[13] R. J. Beckman and R. D. Cook. Outlier. . . . . . . . . . s. Technometrics,25(2):119–165, May 1983.

[14] J. Berger, R. R. Coifman, and M. J. Goldberg. Removing noise frommusic using local trigonometric bases and wavelet packets. J. AudioEng. Soc., 42(10):808–818, October 1994.

[15] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley& Sons, 1994.

[16] M. Berouti, R. Schwartz, and J. Makhoul. Enhancement of speechcorrupted by additive noise. In Proc. International Conference onAcoustics, Speech and Signal Processing, pages 208–211, 1979.

[17] M. Berouti, R. Schwartz, and J. Makhoul. Enhancement of speechcorrupted by additive noise, pages 69–73. In Lim [110], 1983.

[18] S. A. Billings. Identification of nonlinear systems - A survey. Proc.IEE, part D, 127(6):272–285, November 1980.

[19] S. F. Boll. Suppression of acoustic noise in speech using spectralsubtraction. IEEE Trans. Acoustics, Speech and Signal Processing,27(2):113–120, April 1979.

[20] S.F. Boll. Speech enhancement in the 1980’s: Noise suppression withpattern matching. In S. Furui and M.M. Sondhi, editors, Advances inSpeech Signal Processing, pages 309–325. Marcel Dekker, Inc., 1992.

[21] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis,Forecasting and Control. Prentice Hall, 3rd edition, 1994.

[22] G. E. P. Box and G. C. Tiao. A Bayesian approach to some outlierproblems. Biometrika, 55:119–129, 1968.

References 307

[23] G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Anal-ysis. Addison-Wesley, 1973.

[24] K. Brandenburg. Perceptual coding of high quality digital audio. InBrandenburg and Kahrs [25], page 39.

[25] K. Brandenburg and M. Kahrs, editors. Applications of Digital SignalProcessing to Audio and Acoustics. Kluwer Academic Publishers,1998.

[26] P. Bratley, B. L. Fox, and E. L. Schrage. A Guide to Simulation.New York: Springer-Verlag, 1983.

[27] C. N. Canagarajah. A single-input hearing aid based on auditoryperceptual features to improve speech intelligibility in noise. Proc.IEEE Workshop on Audio and Acoustics, Mohonk, NY State, 1991.

[28] C. N. Canagarajah. Digital Signal Processing Techniques for SpeechEnhancement in Hearing Aids. PhD thesis, University of Cambridge,1993.

[29] O. Cappe. Enhancement of musical signals degraded by backgroundnoise using long-term behaviour of the short-term spectral compo-nents. Proc. International Conference on Acoustics, Speech and Sig-nal Processing, I:217–220, 1993.

[30] O. Cappe. Techniques de reduction de bruit pour la restaurationd’enregistrements musicaux. PhD thesis, l’Ecole Nationale Superieuredes Telecommunications, 1993.

[31] O. Cappe. Elimination of the musical noise phenomenon with theEphraim and Malah noise suppressor. IEEE Trans. on Speech andAudio Processing, 2(2):345–349, 1994.

[32] O. Cappe and J. Laroche. Evaluation of short-time spectral atten-uation techniques for the restoration of musical recordings. IEEETrans. on Speech and Audio Processing, 3(1):84–93, January 1995.

[33] B. P Carlin, N. G. Polson, and D. S. Stoffer. A Monte Carlo ap-proach to nonnormal and nonlinear state-space modeling. Journal ofAmerican Statistical Association, 87(418):493–500, June 1992.

[34] M. J. Carrey and I. Buckner. A system for reducing impulsive noiseon gramophone reproduction equipment. The Radio Electronic En-gineer, 50(7):331–336, July 1976.

[35] C. K. Carter and R. Kohn. On Gibbs Sampling for state space mod-els. Biometrika, 81(3):541–553, 1994.

308 References

[36] C.K. Carter and R. Kohn. Markov chain Monte Carlo in conditionallyGaussian state space models. Biometrika, 83:589–601, 1996.

[37] C. K. Chui. Wavelet Analysis and its Applications, Volume 1: AnIntroduction to Wavelets. Academic Press, inc., 1992.

[38] C. K. Chui. Wavelet Analysis and its Applications, Volume 2:Wavelets: A Tutorial in Theory and Applications. Academic Press,inc., 1992.

[39] L. Cohen. Time-frequency distributions – a review. Proc. IEEE,77(7), July 1989.

[40] M. K. Cowles and B. P Carlin. Markov chain Monte Carlo con-vergence diagnostics – a comparative review. Journal of AmericanStatistical Association, (434):94–883–904, 1996.

[41] R. E. Crochiere and L. R. Rabiner. Multirate Digital Signal Process-ing. Prentice-Hall, 1983.

[42] P. de Jong and N. Shephard. The simulation smoother for time seriesmodels. Biometrika, 82(2):339–350, 1995.

[43] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihoodfrom incomplete data via the EM algorithm. Journal of the RoyalStatistical Society, Series B, 39(1):1–38, 1977.

[44] M. Dendrinos, S. Bakamidis, and G. Carayannis. Speech enhance-ment from noise: A regenerative approach. Speech Communication,10:45–57, 1991.

[45] P. Depalle, G. Garcia, and X. Rodet. Tracking of partials for additivesound synthesis using hidden Markov models. In Proc. InternationalConference on Acoustics, Speech and Signal Processing, pages 225–228. April 1993.

[46] L. Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, 1986.

[47] R. M. Dolby. An audio noise reduction system. J. Audio Eng. Soc.,15(4):383–388, 1967.

[48] J.L. Doob. Stochastic processes. John Wiley and Sons, 1953.

[49] A. Doucet and P. Duvaut. Fully Bayesian analysis of hidden Markovmodels. In in Proc. EUSIPCO, Trieste, Italy, September 1996.

[50] A. Doucet and P. Duvaut. Bayesian estimation of state space mod-els applied to deconvolution of bernoulli-gaussian processes. SignalProcessing, 57:147–161, 1997.

References 309

[51] R. O. Duda and P. E. Hart. Pattern Classification and Scene Anal-ysis. John Wiley and Sons, 1973.

[52] J. Durbin. Efficient estimation of parameters in moving-average mod-els. Biometrika, 46:306–316, 1959.

[53] A.J. Efron and H. Jeen. Pre-whitening for detection in correlatedplus impulsive noise. Proc. International Conference on Acoustics,Speech and Signal Processing, II:469–472, 1992.

[54] Y. Ephraim and D. Malah. Speech enhancement using optimal non-linear spectral amplitude estimation. In Proc. International Confer-ence on Acoustics, Speech and Signal Processing, pages 1118–1121,Boston, 1983.

[55] Y. Ephraim and D. Malah. Speech enhancement using a minimummean-square error short-time spectral amplitude estimator. IEEETrans. Acoustics, Speech and Signal Processing, ASSP-21(6):1109–1121, December 1984.

[56] Y. Ephraim and D. Malah. Speech enhancement using a minimummean-square error log-spectral amplitude estimator. IEEE Trans.Acoustics, Speech and Signal Processing, 33(2):443–445, 1985.

[57] Y. Ephraim and H. VanTrees. A signal subspace approach for speechenhancement. Proc. International Conference on Acoustics, Speechand Signal Processing, II:359–362, 1993.

[58] W. Etter and G. S. Moschytz. Noise reduction by noise-adaptivespectral magnitude expansion. J. Audio Eng. Soc., 42(5):341–349,1994.

[59] M. Feder, A. V. Oppenheim, and E. Weinstein. Maximum likelihoodnoise cancellation using the EM algorithm. IEEE Trans. Acoustics,Speech and Signal Processing, 37(2), February 1989.

[60] G. D. Forney. The Viterbi algorithm. Proc. IEEE, 61(3):268–278,March 1973.

[61] S. Fruhwirth-Schnatter. Data augmentation and dynamic linearmodels. Journal of Time Series Analysis, 15:183–202, 1994.

[62] U. R. Furst. Periodic variations of pitch in sound reproduction byphonographs. Proc. IRE, page 887, November 1946.

[63] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches tocalculating marginal densities. Journal of American Statistical As-sociation, 85:398–409, 1990.

310 References

[64] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributionsand the Bayesian restoration of images. IEEE Trans. Pattern Anal-ysis and Machine Intelligence, 6:721–741, 1984.

[65] E. I. George and R. E. McCulloch. Variable selection via Gibbssampling. Journal of American Statistical Association, 88(423):881–889, September 1993.

[66] E.B. George and J.T. Smith. Analysis-by-synthesis/overlap-add si-nusoidal modeling applied to the analysis and synthesis of musicaltones. J. Audio Eng. Soc., 40(6), June 1992.

[67] E. N. Gilbert. Capacity of a burst-noise channel. Bell System Tech-nical Journal, pages 1253–1265, 1960.

[68] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. MarkovChain Monte Carlo in Practice. Chapman and Hall, 1996.

[69] S. J. Godsill. Digital Signal Processing. U. K. Patent no. 2280763,1993. Patent Granted 1997 .

[70] S. J. Godsill. The Restoration of Degraded Audio Signals. PhD the-sis, Cambridge University Engineering Department, Cambridge, Eng-land, December 1993.

[71] S. J. Godsill. Recursive restoration of pitch variation defects in mu-sical recordings. In Proc. International Conference on Acoustics,Speech and Signal Processing, volume 2, pages 233–236, Adelaide,April 1994.

[72] S. J. Godsill. Bayesian enhancement of speech and audio signalswhich can be modelled as ARMA processes. International StatisticalReview, 65(1):1–21, 1997.

[73] S. J. Godsill. Robust modelling of noisy ARMA signals. In Proc.International Conference on Acoustics, Speech and Signal Processing,April 1997.

[74] S. J. Godsill and A. C. Kokaram. Joint interpolation, motion andparameter estimation for image sequences with missing data. In Proc.EUSIPCO, September 1996.

[75] S. J. Godsill and A. C. Kokaram. Restoration of image sequencesusing a causal spatio-temporal model. In Proc. 17th Leeds AnnualStatistics Research Workshop (The Art and Science of Bayesian Im-age Analysis), July 1997.

References 311

[76] S. J. Godsill and P. J. W. Rayner. A Bayesian approach to thedetection and correction of bursts of errors in audio signals. In Proc.International Conference on Acoustics, Speech and Signal Processing,volume 2, pages 261–264, San Francisco, March 1992.

[77] S. J. Godsill and P. J. W. Rayner. Frequency-domain interpolationof sampled signals. In Proc. International Conference on Acoustics,Speech and Signal Processing, volume I, pages 209–212, Mineapolis,April 1993.

[78] S. J. Godsill and P. J. W. Rayner. The restoration of pitch variationdefects in gramophone recordings. In Proc. IEEE Workshop on Audioand Acoustics, Mohonk, NY State, Mohonk, NY State, October 1993.

[79] S. J. Godsill and P. J. W. Rayner. A Bayesian approach to therestoration of degraded audio signals. IEEE Trans. on Speech andAudio Processing, 3(4):267–278, July 1995.

[80] S. J. Godsill and P. J. W. Rayner. Robust noise modelling withapplication to audio restoration. In Proc. IEEE Workshop on Audioand Acoustics, Mohonk, NY State, Mohonk, NY State, October 1995.

[81] S. J. Godsill and P. J. W. Rayner. Robust noise reduction for speechand audio signals. In Proc. International Conference on Acoustics,Speech and Signal Processing, May 1996.

[82] S. J. Godsill and P. J. W. Rayner. Robust treatment of impul-sive noise in speech and audio signals. In J.O. Berger, B. Betro,E. Moreno, L.R. Pericchi, F. Ruggeri, G. Salinetti, and L. Wasser-man, editors, Bayesian Robustness - proceedings of the workshop onBayesian robustness, May 22-25, 1995, Rimini, Italy, volume 29,pages 331–342. IMS Lecture Notes - Monograph Series, 1996.

[83] S. J. Godsill and P. J. W. Rayner. Robust reconstruction and analysisof autoregressive signals in impulsive noise using the Gibbs sampler.IEEE Trans. on Speech and Audio Processing, July 1998. Previouslyavailable as Tech. Report CUED/F-INFENG/TR.233.

[84] S. J. Godsill, P. J. W. Rayner, and O. Cappe. Digital audio restora-tion. In K. Brandenburg and M. Kahrs, editors, Applications of Digi-tal Signal Processing to Audio and Acoustics. Kluwer Academic Pub-lishers, 1998.

[85] S. J. Godsill and C. H. Tan. Removal of low frequency transient noisefrom old recordings using model-based signal separation techniques.In Proc. IEEE Workshop on Audio and Acoustics, Mohonk, NY State,Mohonk, NY State, October 1997.

312 References

[86] G. H. Golub and C. F. Van Loan. Matrix Computations. The JohnHopkins University Press, 1989.

[87] R.M. Gray and L.D. Davisson. Random Processes: a MathematicalApproach for Engineers. Prentice-Hall, 1986.

[88] I. Guttman. Care and handling of univariate or multivariate out-liers in detecting spuriosity – a Bayesian approach. Technometrics,15(4):723–738, November 1973.

[89] H. M. Hall. The generalized ‘t’ model for impulsive noise. Proc.International Symposium on Info. Theory, 1970.

[90] J. P Den Hartog. Mechanical Vibrations. McGraw-Hill, 1956.

[91] A.C. Harvey. Forecasting, Structural Time Series Models and theKalman Filter. Cambridge University Press, 1989.

[92] W. K. Hastings. Monte Carlo sampling methods using Markov chainsand their applications. Biometrika, 57:97–109, 1970.

[93] S. Haykin. Modern Filters. MacMillan, 1989.

[94] S. Haykin. Adaptive Filter Theory. Prentice-Hall, second edition,1991.

[95] P.J. Huber. Robust Statistics. Wiley and Sons, 1981.

[96] A. J. E. M. Janssen, R. Veldhuis, and L. B. Vries. Adaptive inter-polation of discrete-time signals that can be modeled as AR pro-cesses. IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-34(2):317–330, April 1986.

[97] H. Jeffreys. Theory of Probability. Oxford University Press, 1939.

[98] N. L. Johnson and S. Kotz. Distributions in Statistics: ContinuousUnivariate Distributions. Wiley, 1970.

[99] M. Kahrs. Digital audio system architecture. In Brandenburg andKahrs [25], page 195.

[100] R. E. Kalman. A new approach to linear filtering and predictionproblems. ASME J. Basic Eng., 82:35–45, 1960.

[101] T. Kasparis and J. Lane. Adaptive scratch noise filtering. IEEETrans. Consumer Electronics, 39(4), 1993.

[102] G. R. Kinzie, Jr. and D. W. Gravereaux. Automatic detection ofimpulse noise. J. Audio Eng. Soc., 21(3):331–336, April 1973.

References 313

[103] G. Kitagawa and W. Gersch. Smoothness Priors Analysis of TimeSeries. Lecture notes in statistics. Springer-Verlag, 1996.

[104] B. Kleiner, R. D. Martin, and D. J. Thomson. Robust estimationof power spectra (with discussion). Journal of the Royal StatisticalSociety, Series B, 41(3):313–351, 1979.

[105] A. C. Kokaram and S. J. Godsill. A system for reconstruction ofmissing data in image sequences using sampled 3D AR models andMRF motion priors. In Computer Vision - ECCV ’96, volume II,pages 613–624. Springer Lecture Notes in Computer Science, April1996.

[106] A. C. Kokaram and S. J. Godsill. Detection, interpolation, motionand parameter estimation for image sequences with missing data. InProc. International Conference on Image Applications and Process-ing, Florence, Italy, September 1997.

[107] B. Koo, J. D. Gibson, and S. D. Gray. Filtering of coloured noise forspeech enhancement and coding. Proc. International Conference onAcoustics, Speech and Signal Processing, pages 349–352, 1989.

[108] L. Kuo and B. Mallick. Variable selection for regression models.Sankhya, 1997. (to appear).

[109] R. Lagadec and D. Pelloni. Signal enhancement via digital signalprocessing. In Preprints of the AES 74th Convention, New York,1983.

[110] J. S. Lim, editor. Speech Enhancement. Prentice-Hall signal process-ing series. Prentice-Hall, 1983.

[111] J. S. Lim and A. V. Oppenheim. All-pole modelling of degradedspeech. IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-26(3), June 1978.

[112] J. S. Lim and A. V. Oppenheim. Enhancement and bandwidth com-pression of noisy speech. Proc. IEEE, 67(12), 1979.

[113] J. Liu, W. H. Wong, and A. Kong. Covariance structure of the Gibbssampler with applications to the comparison of estimators and aug-mentation schemes. Biometrika, 81:27–40, 1994.

[114] L.J. Ljung. System Identification: Theory for the User. Prentice-Hall, 1987.

[115] G. B. Lockhart and D. J. Goodman. Reconstruction of missingspeech packets by waveform substitution. Signal Processing 3: The-ories and Applications, pages 357–360, 1986.

314 References

[116] M. Lorber and R. Hoeldrich. A combined approach for broadbandnoise reduction. In Proc. IEEE Workshop on Audio and Acoustics,Mohonk, NY State, October 1997.

[117] D. MacKay. Bayesian Methods for Adaptive Models. PhD thesis,Calif. Inst. Tech., 1992.

[118] R. C. Maher. A method for extrapolation of missing digital audiodata. 95th AES Convention, New York, 1993.

[119] J. Makhoul. Linear prediction: A tutorial review. Proc. IEEE,63(4):561–580, 1975.

[120] S. Mallatt and W. L. Hwang. Singularity detection and processingwith wavelets. IEEE Trans. Info. Theory, pages 617–643, 1992.

[121] Manouchehr-Kheradmandnia. Aspects of Bayesian Threshold Au-toregressive Modelling. PhD thesis, University of Kent, 1991.

[122] R. J. Marks II. Introduction to Shannon Sampling and InterpolationTheory. Springer-Verlag, 1991.

[123] R.D. Martin. Robust estimation of autoregressive models. In Di-rections in time series, D.R. Brillinger and G.C. Tiao, eds., pages228–254, 1980.

[124] P. Martland. Since Records Began: EMI, the First Hundred Years.B.T. Battsford Ltd., 1997.

[125] R. J. McAulay and M. L. Malpass. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoustics, Speech andSignal Processing, 28(2):137–145, April 1980.

[126] R. J. McAulay and T. F. Quatieri. Speech analysis/synthesis basedon a sinusoidal representation. IEEE Trans. Acoustics, Speech andSignal Processing, ASSP-34(4):744–754, 1986.

[127] R. J. McAulay and T. F. Quatieri. Speech analysis/synthesis basedon a sinusoidal representation. In W.B. Klein and K.K. Paliwal,editors, Speech coding and synthesis. Elsevier Science B.V., 1995.

[128] R. E. McCulloch and R. S. Tsay. Bayesian inference and predictionfor mean and variance shifts in autoregressive time series. Journal ofAmerican Statistical Association, 88(423):968–978, 1993.

[129] R. E. McCulloch and R. S. Tsay. Bayesian analysis of autoregressivetime series via the Gibbs sampler. Journal of Time Series Analysis,15(2):235–250, 1994.

References 315

[130] K. J. Mercer. Identification of signal distortion models. PhD thesis,University of Cambridge, 1993.

[131] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller,and E. Teller. Equations of state calculations by fast computingmachines. Journal of Chemical Physics, 21:1087–1091, 1953.

[132] N. J. Miller. Recovery of singing voice from noise by synthesis. ThesisTech. Rep. , ID UTEC-CSC-74-013, Univ. Utah, May 1973.

[133] E. Molloy. High Fidelity Sound Reproduction. Newnes, 1958.

[134] S. Montresor. Etude de la transformee en ondelettes dans le cadrede la restauration d’enregistrements anciens et de la determinationde la frequence fondamentale de la parole. PhD thesis, Universite duMaine, Le Mans, 1991.

[135] S. Montresor, J. C. Valiere, and M. Baudry. Detection et suppressionde bruits impulsionnels appliquees a la restauration d’enregistrementsanciens. Colloq. de Physique, C2, supplement au no. 2, Tome 51,Fevrier, pages 757–760, 1990.

[136] T.K. Moon. The Expectation-Maximization Algorithm. IEEE SignalProcessing Magazine, pages 47–60, November 1996.

[137] B. Moore. An Introduction to the Psychology of Hearing. AcademicPress, fourth edition, 1997.

[138] J. A. Moorer and M. Berger. Linear-phase bandsplitting: Theory andapplications. J. Audio Eng. Soc., 34(3):143–152, 1986.

[139] F. Mosteller and J.W. Tukey. Data Analysis and Regression.Addison-Wesley, Reading, Mass., 1977.

[140] S.L. Ng. Estimation of corrupted samples in autoregressive movingaverage (ARMA) data sequences. Master’s thesis, MEng dissertation,Cambridge University Engineering Dept., May 1997.

[141] M. Niedzwiecki. Recursive algorithm for elimination of measurementnoise and impulsive disturbances from ARMA signals. Signal Pro-cessing VII: Theories and Applications, pages 1289–1292, 1994.

[142] M. Niedzwiecki and K. Cisowski. Adaptive scheme for eliminationof broadband noise and impulsive disturbances from AR and ARMAsignals. IEEE Trans. on Signal Processing, pages 528–537, 1996.

[143] A. Nieminen, M. Miettinen, P. Heinonen, and Y. Nuevo. Musicrestoration using median type filters with adaptive filter substruc-tures. In Digital Signal Processing-87, eds. V. Cappellini and A.Constantinides, North-Holland, 1987.

316 References

[144] C. L. Nikias and M. Shao. Signal Processing with α-Stable Distribu-tions and Applications. John Wiley and Sons, 1995.

[145] A.V. Oppenheim and R.W. Schafer. Digital Signal Processing.Prentice-Hall, 1975.

[146] J. J. K. O Ruanaidh and W. J. Fitzgerald. Interpolation of missingsamples for audio restoration. Electronic Letters, 30(8), April 1994.

[147] K. K. Paliwal and A. Basu. A speech enhancement method basedon Kalman filtering. Proc. International Conference on Acoustics,Speech and Signal Processing, pages 177–180, 1987.

[148] A. Papoulis. Probability, Random Variables and Stochastic Processes.McGraw-Hill, New York, 3rd edition, 1991.

[149] S. Park, G. Hillman, and R. Robles. A novel technique for real-time digital sample-rate converters with finite precision error analysis.ICASSP, 1991.

[150] D. L. Phillips. A technique for numerical solution of certain integralequations of the first kind. J. Ass. Comput. Mach., 9:97–101, 1962.

[151] I. Pitas and A. N. Venetsanopoulos. Nonlinear Digital Filters.Kluwer Academic Publishers, 1990.

[152] H. J. Platte and V. Rowedda. A burst error concealment method fordigital audio tape application. AES preprint, 2201:1–16, 1985.

[153] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vettereling.Numerical Recipes in C. CUP, second edition, 1992.

[154] R.D. Preuss. A frequency domain noise cancelling preprocessor fornarrowband speech communications systems. Proc. InternationalConference on Acoustics, Speech and Signal Processing, pages 212–215, April 1979.

[155] M. B. Priestley. Spectral Analysis and Time Series. Academic Press,1981.

[156] J.G. Proakis, J.R. Deller, and J.H.L. Hansen. Discrete-Time Pro-cessing of Speech Signals. Macmillan, New York, 1993.

[157] J.G. Proakis and D.G. Malonakis. Introduction to Digital Signal Pro-cessing. Macmillan, 1988.

[158] T. F. Quatieri and R. J. McAulay. Audio signal processing based onsinusoidal analysis/sysnthesis. In Brandenburg and Kahrs [25], page343.

References 317

[159] L. R. Rabiner. Digital techniques for changing the sampling rateof a signal. Digital Audio; collected papers from the AES premierconference, pages 79–89, June 1982.

[160] J. J. Rajan, P. J. W. Rayner, and S. J. Godsill. A Bayesian approachto parameter estimation and interpolation of time-varying autore-gressive processes using the Gibbs sampler. IEE Proc. Vision, imageand signal processing, 144(4), August 1997.

[161] P. J. W. Rayner and S. J. Godsill. The detection and correction ofartefacts in archived gramophone recordings. In Proc. IEEE Work-shop on Audio and Acoustics, Mohonk, NY State, Mohonk, NY State,October 1991.

[162] G. O. Roberts. Markov chain concepts related to sampling algo-rithms. In W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, edi-tors, Markov chain Monte Carlo in practice, pages 45–57. Chapmanand Hall, 1996.

[163] G.O. Roberts and S.K. Sahu. Updating schemes, correlation struc-ture, blocking and parameterization for the Gibbs sampler. Journalof the Royal Statistical Society, Series B, 59:291, 1997.

[164] R.A. Roberts and C.T. Mullis. Digital Signal Processing. Addison-Wesley, 1997.

[165] X. Rodet. Musical sound signal analysis/synthesis: sinu-soidal+residual a nd elementary waveform models. In IEEE UK Sym-posium on applications of Time-Frequency and Time-S cale Methods,pages 111–120, August 1997.

[166] J. J. K. O Ruanaidh. Numerical Bayesian methods in signal process-ing. PhD thesis, University of Cambridge, 1994.

[167] J. J. K. O Ruanaidh and W. J. Fitzgerald. The restoration of dig-ital audio recordings using the Gibbs’ sampler. Technical ReportCUED/F-INFENG/TR.134, Cambridge University Engineering De-partment, 1993.

[168] J. J. K. O Ruanaidh and W. J. Fitzgerald. Numerical Bayesianmethods applied to signal processing. Springer-Verlag, 1996.

[169] X. Serra. A system for sound analysis/transformation/synthesisbased on a deterministic plus stochastic decomposition. PhD thesis,Stanford Univertsity, October 1989.

[170] N. Shephard. Partial non-Gaussian state space. Biometrika,81(1):115–131, 1994.

318 References

[171] T. G. Stockham, T. M. Cannon, and R. B. Ingebretsen. Blind decon-volution through digital signal processing. Proc. IEEE, 63(4):678–692, April 1975.

[172] M. A. Tanner. Tools for Statistical Inference, Second Edition.Springer-Verlag, 1993.

[173] C. W. Therrien. Decision, Estimation and Classification. Wiley,1989.

[174] C. W. Therrien. Discrete Random Signals and Statistical Signal Pro-cessing. Prentice-Hall, 1992.

[175] L. Tierney. Markov chains for exploring posterior disiributions (withdiscussion). The Annals of Statistics, (22):1701–1762, 1994.

[176] H. Tong. Non-linear Time Series. Oxford Science Publications, 1990.

[177] P. T. Troughton and S. J. Godsill. Bayesian model selection fortime series using Markov chain Monte Carlo. In Proc. InternationalConference on Acoustics, Speech and Signal Processing, April 1997.

[178] P. T. Troughton and S. J. Godsill. A reversible jump sam-pler for autoregressive time series, employing full conditionals toachieve efficient model space moves. Technical Report CUED/F-INFENG/TR.304, Cambridge University Engineering Department,1997.

[179] P. T. Troughton and S. J. Godsill. Bayesian model selection for lin-ear and non-linear time series using the Gibbs sampler. In John G.McWhirter, editor, Mathematics in Signal Processing IV. Oxford Uni-versity Press, 1998.

[180] P.T. Troughton and S.J. Godsill. MCMC methods for restorationof nonlinearly distorted autoregressive signals. In Proc. EuropeanConference on Signal Processing, September 1998.

[181] P.T. Troughton and S.J. Godsill. Restoration of nonlinearly distortedaudio using Markov chain Monte Carlo methods. In Preprint fromthe 104th Convention of the Audio Engineering Society, May 1998.

[182] P.T. Troughton and S.J. Godsill. A reversible jump sampler for au-toregressive time series. In Proc. International Conference on Acous-tics, Speech and Signal Processing, April 1998.

[183] D. Tsoukalas, M. Paraskevas, and J. Mourjopoulos. Speech enhance-ment using psychoacoustic criteria. Proc. International Conferenceon Acoustics, Speech and Signal Processing, II:359–362, 1993.

References 319

[184] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1971.

[185] J. C. Valiere. La Restauration d’Enregistrements Anciens par Traite-ment Numerique- Contribution a l’etude de Quelques techniquesrecentes. PhD thesis, Universite du Maine, Le Mans, 1991.

[186] H. VanTrees. Decision, Estimation and Modulation Theory, Part 1.Wiley and Sons, 1968.

[187] S. V. Vaseghi. Algorithms for Restoration of Archived GramophoneRecordings. PhD thesis, University of Cambridge, 1988.

[188] S. V. Vaseghi. Advanced Signal Processing and Digital Noise Reduc-tion. Wiley, 1996.

[189] S. V. Vaseghi and R. Frayling-Cork. Restoration of old gramophonerecordings. J. Audio Eng. Soc., 40(10), October 1992.

[190] S. V. Vaseghi and P. J. W. Rayner. A new application of adaptivefilters for restoration of archived gramophone recordings. Proc. In-ternational Conference on Acoustics, Speech and Signal Processing,V:2548–2551, 1988.

[191] S. V. Vaseghi and P. J. W. Rayner. Detection and suppression ofimpulsive noise in speech communication systems. IEE Proceedings,Part 1, 137(1):38–46, February 1990.

[192] R. Veldhuis. Restoration of Lost Samples in Digital Signals. Prentice-Hall, 1990.

[193] A. J. Viterbi. Error bounds for convolutional codes and an assymp-totically optimum decoding algorithm. IEEE Trans. Info. Theory,IT-13:260–269, April 1967.

[194] P.J. Walmsley, S.J. Godsill, and P.J.W. Rayner. Multidimensionaloptimisation of harmonic signals. In Proc. European Conference onSignal Processing, September 1998.

[195] E. Weinstein, A. V. Oppenheim, M. Feder, and J. R. Buck. Iterativeand sequential algorithms for multisensor signal enhancement. IEEETrans. on Signal Processing, 42(4), April 1994.

[196] M. West. Outlier models and prior distributions in Bayesian lin-ear regression. Journal of the Royal Statistical Society, Series B,46(3):431–439, 1984.

[197] M. West. On scale mixtures of normal distributions. Biometrika,74(3):99–102, 1987.

320 References

[198] E. T. Whittaker. On a new method of graduation. Proc. EdinburghMath. Soc., 41:63–75, 1923.

[199] N. Wiener. Extrapolation, Interpolation and Smoothing of StationaryTime Series with Engineering Applications. MIT Press, 1949.


Index

α-stable noise, 238nth order moment, 53z-transform, 22–24

definition, 23frequency response, 23poles, 23stability, 23time shift theorem, 23transfer function, 23zeros, 23

45rpm, 4

All-pole filter, 86Analogue de-clicking, 128Analogue restoration methods, 5Autocorrelation function, 64Autoregressive (AR) model, 86–

89approximate likelihood, 88, 196backward prediction error, 132conditional probability, 88covariance estimate, 88exact likelihood, 89, 157excitation energy, 116Gaussian excitation, 88

observed in noise, 90separation of two AR processes,

156–157state-space form, 90statistical modelling and es-

timation, 87Autoregressive moving average (ARMA)

model, 86Axioms of probability, 41

Background noise, 6Background noise reduction, see

Hiss reductionBand-pass filter, 67Bayes factor, 80Bayes risk, 75Bayes rule, 43, 47

cumulative distribution func-tion (CDF), 49

inferential procedures, 47probability density function,

50probability mass function (PMF),

48Bayesian decision theory, 79–85

322 Index

Bayesian estimation, 73–79Bayesian regulariser, 178Bell, Alexander Graham, 4Berliner, Emile, 4Bernoulli model, 101, 211Beta distribution, 304Blumlein, Alan, 4Breakage, 158Breakages, 6, 153Broadband noise reduction, see Hiss

reductionBurst noise, 192

Caruso, 5CEDAR Audio, 16Central moment, 53Change-points, 234Characteristic function, 53

inverse, 54moments of a random vari-

able, 54Chi-squared, 56Cholesky decomposition, 116Classification error probability, 80Click detection, see Detection of

clicksClick removal, 99–134

Gibbs sampler, 241implementation/results, 248

Clicks, 6Compact disc (CD), 5, 16Conditional distribution, 47

cumulative distribution func-tion (CDF), 49

probability density function(PDF), 49

probability mass function (PMF),48

Conditional probability, 42Convolution, 20–22

sum of random variables, 54Convolution of PDFs, 51Cost function

Bayesian, 75quadratic, 76

uniform, 76Cost functions

Bayesian, 77Covariance matrix, 60Crackles, 6Cross-correlation function, 65Cumulative distribution function

(CDF), 45, 46

Data windowsseeWindows, 175

Derived distribution, 55Derived PDF, 58Detection of clicks, 101, 127–133

i.i.d. Gaussian noise assump-tion, 198

adaptations to AR detector,131–132

analysis and limitations, 130–131

ARMA model, 133autoregressive (AR), 128Bayesian, 191–204

autoregressive (AR), 195complexity, 201loss functions, 200marginalised, 201relationship with LS inter-

polator, 200relationship with simple AR,

203results, 215–231selection of optimal state,

200Bayesian sequential

complexity, 210Kalman filter implementa-

tion, 210state culling, 207state culling strategy, 212update equations, 209

false Alarms, 101false detection, 131Gibbs sampler, 244high-pass filter, 128

Index 323

likelihood for Gaussian ARdata, 197

Markov chain model, 192matched filter, 132matched filter detector, 132–

133missed detection, 101model-based, 128multi-channel, 203noise generator prior, 194, 210noisy data case, 193noisy data model, 202non-standard Bayesian loss func-

tions, 213probability of error, 127probability ratio test, 199Recursive update for poste-

rior probability, 207sequential Bayesian, 203, 205–

214sinusoid+AR residual model,

133statistical methods, 134switched noise model, 194threshold selection, 101, 131uniform prior, 194

Deterministic signals, 38Differential mappings, 55Digital audio tape (DAT), 16Digital filters, 24–25

finite-impulse-response (FIR),25

infinite-impulse-response (IIR),24

Digital re-sampling, 173Discrete Fourier transform (DFT),

25–27, 34, 126, 136, 175inverse, 26

Discrete probability space, 41Discrete time Fourier transform

(DTFT), 19–20, 66Distortion

peak-related, 99

Edison, Thomas, 4

Electrical noise, 135Elementary outcomes, 40Ensemble, 61Equalisation, 135Event based probability, 39

fundamental results, 42Event space, 40Evidence, 80

linear model, 81Expectation, 52–54

function of random variable,52

linear operator, 52random variable, 52

Expectation-maximisation (EM),92–93, 181, 234

Expected risk, 80

Fast Fourier transform (FFT), 26,34–38, 136

butterfly structure, 35in-place calculation, 37

Finite-impulse-response (FIR) fil-ters, 25

Flutter, 6, see Pitch variation de-fects

Fourier series, 17Fourier transform

relation with characteristic func-tion, 53

Frequency modulation, 173Frequency response, 22, 23Frequency shift theorem, 17Frequency tracking, 173

birth and death of components,183

Fully Bayesian restoration, 233–259

Functions of random variables, 54Functions of random vectors, 58

Gamma density, 277Gamma function, 277Gamma integral, 277Gaussian, 55, 59

324 Index

multivariate, 59, 275multivariate integral, 276univariate, 275

Generalised Gaussian, 238Gibbs sampler, 94, 119, 234Global defects, 6Global degradation, 6Gramophone discs, 1

fluctuation in turntable speed,172

imperfectly punched centre hole,171

Hamming window, 32, 136Hanning window, 29, 32, 136Heavy-tailed noise distribution, 233High-pass filter

removal of low frequency noisepulses, 154

Hiss reduction, 135–149model based, 148–149noise artefacts, 141Spectral domain

psychoacoustical methods,148

spectral domain, 136–148maximum likelihood (ML),

148minimum mean-squared er-

ror (MMSE), 148sub-space methods, 148wavelets, 148Wiener, 138

Impulse train, 17Impulsive stimuli, 154Independence, 44Infinite-impulse-response (IIR) fil-

ters, 24Innovation, 71Interpolation, 101–127

autoregressive (AR), 106–122adaptations to, 110AR plus basis function rep-

resentation, 111

examples, 108incorporating a noise model,

119least squares (LS), 106–107maximum a posteriori (MAP),

107–108pitch-based adaptation, 111random sampling methods,

116–119sequential methods, 119–122sinusoid plus AR, 115–116unknown parameters, 108–

110autoregressive moving-average

(ARMA), 122–126autoregressive(AR)

AR plus basis function rep-resentation, 116

expectation-maximisation (EM),239

frequency domain, 126–127Gaussian signals, 103–105

incorporating a noise model,105

Gibbs sampler, 242model-based, 102transform domain methods,

126Interpolation of corrupted samples,

see InterpolationInverse image method, 54Inverted-gamma density, 277Inverted-gamma prior, 238

Jacobian, 59, 178Jeffreys prior, 238, 278, 286Jeffreys’ prior, 79Joint PDF, 50

Kalman filter, 90, 91, 125, 205evaluation of likelihood, 91prediction error decomposi-

tion, 91

Laplace transform, 22

Index 325

Least squares (LS)relationship with maximum

likelihood, 73Likelihood function, 71Linear model, general, 70

MAP estimator, 77Linear prediction, 86Linear time invariant (LTI), 20Localised defects, 6Localised degradation, 6Location parameter, 79Loss function, 80

symmetric, 80Low frequency noise pulses, 153–

170cross-correlation detector, 156detection, 156experimental results, 162Kalman filter implementation,

163matched filter detection, 156template method, 153, 154unknown parameter estima-

tion, 160Low-pass filter, 19LP (vinyl), 5

Magnetic tapes, 1, 4, 135unevenly stretched, 171

MAP classification, 195Marginal density, 119Marginal probability, 50, 57, 179Marginalisation, 77–78Markov chain, 192

noise generator model, 206noise generator process, 211

Markov chain Monte Carlo (MCMC),93–95

review of methods, 234Masking

simultaneous, 148Masking of small clicks, 133Matched filter, 132Maximum likelihood

general linear model, 73

Maximum likelihood (ML), 71–73,102

ill-posed, 178Maximum a posteriori (MAP), 102,

127Maximum a posteriori (MAP) es-

timator, 76Median filter, 102Minimum mean-squared error

Kalman filter, 91Minimum mean-squared error (MMSE),

76, 102Minimum variance estimation, 103Missing data, 77Modelling of clicks, 100–101Modelling of signals, 85–92Multiresolution methods, 128Musical noise, 141

elimination, 145masking, 147

Musical signalsfundamental and overtones,

173note slides, 173tracking of tonal components,

175

Noiseadditive, 100bursts, 100clicks, 99clustering, 100localised, 99, 100replacement, 100

Noise artefacts, 141Noise reduction, see Hiss reduc-

tionNon-Gaussian noise, 234Non-linear distortions, 6Non-linear models

stiffening spring, 158Non-uniform sampling, 185Non-uniform time-warping, 172Normal distribution, see GaussianNyquist frequency, 190

326 Index

Nyquist sampling theorem, 16–19

Observation equation, 90Optical sound tracks, 153Outliers, 127, 191, 234Overlap-add synthesis, 138

Parameter estimation, 70–79Partials, see Musical signalsPhase

sensitivity of ear, 140Phonautograph, 2Phonograph, 4Pitch period, 102, 111Pitch variation defects, 171–190

additive model, 176AR model, 180Bayesian estimator, 178deterministic models, 182discontinuities in frequency tracks,

182experimental results, 183frequency tracking, 173generation of pitch variation

curve, 176log-additive model, 176multiplicative model, 176prior models, 180smoothness prior, 179, 181

Point estimates, 75Pops, see Low frequency noise pulsesPosterior probability, 43, 74

recursive update, 205Power spectrum, 66Power subtraction, 140Pre-whitening filter, 132Prediction error decomposition, 91Prior distribution

choice of, 78conjugate, 79Gaussian, 78inverted-gamma, 79non-informative, 79

Prior probability, 43Probability

frequency-based interpretation,40

Probability density function (PDF),45, 46

Probability mass function (PMF),45, 46

Probability measure, 41Probability spaces, 40Probability theory, 39–60

Quasi-periodic signals, 127

Random experiment, 40Random periodic process, 127Random process, 60–67

autocorrelation function, 64continuous time, 64cross-correlation function, 65definition, 64discrete time, 64linear system response, 67mean value, 64power spectrum, 66stationarity, 65strict-sense stationarity, 65wide-sense stationarity, 66

Random signals, 38, 61Random variables (RVs), 44–56

continuous, 45definition, 45discrete, 45probability distribution, 45

Random vectors, 56–60Bayes rule, 57CDF, 56conditional distribution, 57PDF, 57

Real time processing, 15, 16, 133Rectangular window, 32Recursive least squares (RLS), 84

relationship with Kalman fil-ter, 92

Residual noise, 141Restoration

statistical methods, 133

Index 327

Risk function, 213

Sample rate conversion, 190time-varying, 173truncated sinc function, 190

Sample space, 40Scale mixture of Gaussians, 233,

238Scale parameter, 77, 79Scratches, 6, 153Second difference smoothness model,

162Separation of AR processes, see

Autoregressive (AR) modelSequential Bayes, see Click detec-

tionSequential Bayesian classification,

82–85Linear model, 85

Sigma algebra, 40Sinusoidal model, 71, 126

coding of speech and audio,175

Smearing, 29Smoothness prior, 181Sonic Solutions, 16Sound recordings

film sound tracks, 99Source-filter model, 86Spectral leakage, 29Spectral subtraction, 140State update equation, 90State-space model, 90–92

non-Gaussian, 234non-linear, 234

Stationarity, 65short-term, 128

Step-like stimuli, 154mechanical response, 158

Stiffening spring mechanical sys-tem, 158

Stochastic processes, see randomprocesses

Strict-sense stationarity, 65Student-t distribution, 238

Sum of random variables, 51, 54Switched noise model, 194

Thumps, see Low frequency noisepulses

Time series, 64Tone arm resonance, 158Total probability, 43

PDF, 51probability mass function (PMF),

48Transfer function, 23Transient noise pulses, see Low

frequency noise pulsesTremolo, 176, see Pitch variation

defects

Uniform prior, 179, 194, 238, 278Uniform sampling, 64

Variance, 53Vibrato, 176, see Pitch variation

defectsViterbi algorithm, 212Volterra model, 158

Wavelets, 126, 128Wax cylinders, 1, 4, 135Wide-sense stationarity, 66Wiener, 135Wiener filter, 138Windows, 27–32

Bartlett, 32continuous signals, 27discrete-time signals, 31frequency smearing, 29generalised Hamming window,

32Hamming, 32, 175Hanning, 32Hanning window, 29Kaiser, 32Parzen, 32rectangular, 32spectral leakage, 29

328 Index

Tukey, 32Wow, 6, see Pitch variation de-

fects

Documents

Digital Audio Restoration - a statistical model based approach