Maximum Entropy and Bayesian Methods: Cambridge, England, 1994 Proceedings of the Fourteenth International Workshop on Maximum Entropy and Bayesian Methods

Maximum Entropy and Bayesian Methods

Fundamental Theories of Physics

An International Book Series on The Fundamental Theories of Physics: Their Clarification, Development and Application

Editor: ALWYN V AN DER MERWE University of Denver, U.S.A.

Editorial Advisory Board:

LAWRENCE P. HORWITZ, Tel-Aviv University, Israel BRIAN D. JOSEPHSON, University of Cambridge, U.K.

CLIVE KILMISTER, University of London, U.K. GUNTER LUDWIG, Philipps-Universitiit, Marburg, Germany

ASHER PERES, Israel Institute of Technology, Israel NATHAN ROSEN,IsraelInstitute of Technology, Israel MENDEL SACHS, State University of New York at Buffalo, U.S.A. ABDUS SALAM, International Centre for Theoretical Physics, Trieste, Italy HANS-JURGEN TREDER, Zentralinstitut for Astrophysik der Akademie der

Wissenschaften, Germany

Volume 70

Maximum Entropy and Bayesian Methods Cambridge, England, 1994

Proceedings o/the Fourteenth International Workshop on Maximum Entropy and Bayesian Methods

edited by

John Skilling

and

Sibusiso Sibisi Department 0/ Applied Mathematics and Theoretical Physics, University o/Cambridge, Cambridge, England

... " KLUWER ACADEMIC PUBLISHERS DORDRECHT / BOSTON / LONDON

A c.I.P. Catalogue record for this book is available from the Library of Congress

ISBN-13: 978-94-010-6534-4 DOI:10. 1 007/978-94-009-0107-0

e-ISBN-13: 978-94-009-0107-0

Published by Kluwer Academic Publishers, P.O. Box 17,3300 AA Dordrecht, The Netherlands.

Kluwer Academic Publishers incorporates the publishing programmes of D. Reidel, Martinus Nijhoff, Dr W. Junk and MTP Press.

Sold and distributed in the U.S.A. and Canada by Kluwer Academic Publishers, 101 Philip Drive, Norwell, MA 02061, U.S.A.

In all other countries, sold and distributed by Kluwer Academic Publishers Group, P.O. Box 322, 3300 AH Dordrecht, The Netherlands.

Printed on acid-free paper

All Rights Reserved

© 1996 Kluwer Academic Publishers

Softcover reprint of the hardcover 1 st edition 1996

No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner.

To the ideal of rational inference

Contents

Preface

APPLICA nONS

EJ. Farham, D. Xing, I.A. Derbyshire, SJ. Gibbs, T.A. Carpenter, L.D. Hall Flow and diffusion images from Bayesian spectral analysis of motion-encoded NMR data

GJ. Marseille, R. de Beer, A.F. Mehlkopf, D. van Ormondt Bayesian estimation of MR images from incomplete raw data

S.M. Glidewell, B.A. Goodman, 1. Skilling Quantified maximum entropy and biological EPR spectra

R. Fischer, W. von der Linden, V. Dose The vital importance of prior information for the decomposition of ion spectroscopy data

W. von der Linden, K. Ertl, V. Dose Bayesian consideration of the tomography problem

NJ. Davidson, BJ. Cole, H.G. Miller Using MaxEnt to determine nuclear level densities

V.A. Macaulay, B. Buck Afresh look at model selection in inverse scattering

S. Hansen, J.J. Muller The maximum entropy method in small-angle scattering

L.-H. Zou, Z. Wang, L.E. Roemer Maximum entropy multi-resolution EM tomography by adaptive subdivision

Y. Cao, T.A. Prince

scattering

High resolution image construction from lRAS survey - parallelization and artifact suppression

F. Solms, P.G.W. van Rooyen, I.S. Kunicki Maximum entropy performance analysis of spread-spectrum multiple-access communications

XI

13

23

31

41

51

59

69

79

91

101

viii

L. Stergioulas, A. Vourdas, G.R. Jones Noise analysis in optical fibre sensing: A study using the maximum entropy method

ALGORITHMS

J. Stutz, P. Cheeseman AutoClass - a Bayesian approach to classification

P. Desmedt, 1. Lemahieu, K. Thielemans Evolution reviews of BayesCalc, a MATHEMATICA package for doing Bayesian calculations

V. Kadirkamanathan Bayesian inference for basis function selection in nonlinear system identification using

109

117

127

genetic algorithms 135

M. Tribus The meaning of the word "Probability"

K.M. Hanson, G.S. Cunningham The hard truth

AJ.M. Garrett Are the samples doped -If so, how much?

C. Rodriguez Confidence intervals from one observation

G.A. Vignaux, B. Robertson Hypothesis refinement

S. Sibisi, J. Skilling Bayesian density estimation

S. Brette, J. Idier, A. Mohammad-Djafari Scale-invariant Markov models for Bayesian inversion of linear inverse problems

M. Schramm, M. Greiner Foundations: Indifference, independence and MaxEnt

J.-F. Bercher, G. Le Besnerais, G. Demoment The maximum entropy on the mean method, noise and sensitivity

G J. Daniell The maximum entropy algorithm applied to the two-dimensional random packing

143

157

165

175

183

189

199

213

223

problem 233

NEURAL NETWORKS

A.H. Barnett, D.J.C. MacKay Bayesian comparison of models for images

D.J.C. MacKay, R. Takeuchi Interpolation models with multiple hyperparameters

DJ.C. MacKay Density networks and their application to protein modelling

S.P. Luttrell The cluster expansion: A hierarchical density model

S.P. Luttrell The partitioned mixture distribution: Multiple overlapping density models

PHYSICS

S.F. Gull, AJ.M. Garrett Generating functional for the BBGKY hierarchy and the N-identical-body problem

D. Montgomery, X. Shan, W.H. Matthaeus Entropies for continua: Fluids and magnetojluids

R.S. Silver A logical foundation for real thermodynamics

Index

ix

239

249

259

269

279

287

303

315

321

Preface

This volume records papers given at the fourteenth international maximum entropy conference, held at St John's College Cambridge, England. It seems hard to believe that just thirteen years have passed since the first in the series, held at the University of Wyoming in 1981, and six years have passed since the meeting last took place here in Cambridge. So much has happened.

There are two major themes at these meetings, inference and physics. The inference work uses the confluence of Bayesian and maximum entropy ideas to develop and explore a wide range of scientific applications, mostly concerning data analysis in one form or another. The physics work uses maximum entropy ideas to explore the thermodynamic world of macroscopic phenomena. Of the two, physics has the deeper historical roots, and much of the inspiration behind the inference work derives from physics. Yet it is no accident that most of the papers at these meetings are on the inference side. To develop new physics, one must use one's brains alone. To develop inference, computers are used as well, so that the stunning advances in computational power render the field open to rapid advance.

Indeed, we have seen a revolution. In the larger world of statistics beyond the maximum entropy movement as such, there is now an explosion of work in Bayesian methods, as the inherent superiority of a defensible and consistent logical structure becomes increasingly apparent in practice. In principle, the revolution was overdue by some decades, as our elder statesmen such as Edwin Jaynes and Myron Tribus will doubtless attest. Yet in practice, we needed the computers: knowing what ought to be added up is of limited value until one can actually do the sums.

Here, in this series of proceedings, one can see the revolution happen as the power and range of the work expand, and the level of understanding deepens. The movement is wary of orthodoxy, and not every author (to say nothing of the editors) will agree with every word written by every other author. So, reader, scan the pages with discernment for the jewels made for you ...

As a gesture of faith and goodwill, our publishers, Kluwer Academic Publishers, actively sponsored the meeting, and in this they were joined by Bruker Spectrospin and by MaxEnt Solutions Ltd. To these organisations, we express our gratitude and thanks. Our thanks also go to the staff of St John's College, Cambridge, for their efficiency and help in letting the meeting be worthy of its surroundings. Let us all go forward together.

John Skilling, Sibusiso Sibisi Cavendish Laboratory Cambridge 1995

xi

FLOW AND DIFFUSION IMAGES FROM BAYESIAN SPECTRAL ANALYSIS OF MOTION-ENCODED NMR DATA

E.J. Fordham*, D. Xing, J.A. Derbyshire, S.J. Gibbs, T.A. Carpenter and L.D. Hall

Herchel Smith Laboratory for Medicinal Chemistry, Cambridge University School of Clinical Medicine, University Forvie Site, Robinson Way, Cambridge CB2 2PZ, U.K.

ABSTRACT. Quantitative imaging of steady laminar flow fields in up to three dimensions is

achieved by NMR imaging with the addition of motion-encoding field gradient pulses; coherent flow is encoded as a phase shift, diffusive or dispersive processes as an attenuation. A sequence of

images with incremented gradient pulse areas displays at each pixel a damped sinusoidal oscillation with frequency proportional to a convective flow velocity, and a Gaussian envelope dependent on local effective diffusivity. Velocity and diffusivity are obtained from a spectral analysis of such oscillations. Traditional Fourier analysis has been used with many images in such a sequence.

Such approaches are not economical with data acquistion time; nor are error estimates available. The Bayesian spectral analysis of Bretthorst (1988, 1991), although currently applied mainly to

spectroscopic data, permits also the routine analysis of noisy, heavily truncated, non-uniformly and sparsely sampled data. Bayesian error intervals are also available. We demonstrate a non

uniform sampling strategy that requires only four images to obtain velocity and diffusion images

for various laminar liquid flows: water and a non-Newtonian polymer solution in a cylindrical pipe, and 3-dimensional flow of water in a duct of complex geometry. The latter experiment is in part

made practicable by thus minimising the number of images acquired.

1. Introduction

The inherent motion-sensitivity ofNMR has been known since not long after the discovery of the phenomenon (e.g. Singer (1960), Hahn (1960)). Interestingly, several proposals exist for non-invasive detection of flow and even determination of velocity distributions (e.g. Grover & Singer (1971)) which clearly predate the invention of NMR imaging (Lauterbur, Mansfield & Grannell (1973)), although like the latter they depend upon the experimental introduction of uniform field gradients and the Fourier inversion of the detected signal. In recent years NMR imaging has been applied to a variety of flows in heterogeneous media and ducts of various complexities. Imaging very slow processes (e.g. flows in porous media) is best accomplished by acquiring succesive time-lapsed images ((e.g. Fordham et

* On leave from Schlumberger Cambridge Research, Cambridge CB3 OEL, U.K.

J. Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 1-12. © 1996 Kluwer Academic Publishers.

2 E.J. Fordham, D. Xing et ai.

ai. (1993)(a)); however for flow in pipes, exploitation of the motion-sensitivity of NMR is essential.

Many techniques exist. For quantitative measurement we use Pulsed Field Gradients (PFG), whereby (coherent) motion is encoded as a phase shift in the NMR spin-echo signal; diffusive motion averages a microscopic distribution of such phase shifts which thus appear as an attenuation. The PFG technique can readily be applied in combination with the orthodox protocols for standard NMR imaging; a set of images with various strengths of the motion-encoding gradient is acquired which possesses an additional dimension which can be analysed simultaneously for flow velocity and diffusivity.

The main defect of these techniques is that they are slow (possibly many hours) either (i) because of the high dimensionality, or (ii) because of unnecessary data acquisition in the the motion-encoded dimension prompted by an inappropriate and inefficient tool (the Fourier transform) being used for the velocity (frequency parameter) analysis.

We use two strategies to improve acquisition time: (i) (experimental) an implementation of the Echo-Planar (EPI) technique of Mansfield (e.g. Mansfield & Morris (1982)) which increases the speed of scanning by 0(10 2 ), and (ii) (analysis) a Bayesian frequency parameter analysis (for flow velocity) of the motion-encoded dimension. Using (i), we have extended the PFG technique in flow imaging to full velocity vector imaging of a fully three-dimensional (albeit steady laminar) flow in a complex baffled duct (Derbyshire et ai. (1994)). Using (ii), we have reduced the number of sample points in q-space from 18 (Callaghan & Xia (1991)) to four, with the additional benefit of Bayesian error intervals of which the traditional Fourier analysis gives no clue. We show various examples, including a re-working of the 3-D flow data of Derbyshire et ai., using the Bayesian analysis as well as the EPI experimental method. This experiment can now be accomplished in about half an hour of data acquisition, an entirely practical proposition.

The Bayesian frequency analysis of Bretthorst (1988, 1990( a), 1990(b), 1990( c), 1991, 1992) has been applied with success in high-resolution NMR spectroscopy, typically involving many thousands of data points. Our work uses the same methods, but applied instead to oscillatory data records which are heavily truncated, very sparsely sampled (possibly non-uniformly) and in many practical situtions quite noisy (e.g. SIN ~ 5). We outline first the physics of the measurements. .

2. NMR imaging of flow and diffusion

MOTION-ENCODING OF NMR SIGNALS BY THE PULSED FIELD GRADIENT TECHNIQUE

The NMR signal may be made motion-sensitive by applied field gradient pulses. The archetype protocol involves a single rJ. excitation to produce transverse magnetization, and two field gradient pulses, of duration 6, magnitude g = V Bz , separated by a time-interval D.. These occur either side of a phase-inverting 1800 pulse (Fig. 1( a)). Consider a collection of nuclei initially at r' flowing with a velocity v. Neglecting motion during 6, transverse magnetization during the first gradient pulse evolves with a frequency w = ,(g. r') (in the heterodyne detection frame i.e. relative to the Larmor frequency WL = ,Bo) and hence advances in phase ,(g. r')6. During the time interval D., the nuclei move to a position r = r' + vD., so that by the end of the second gradient pulse, the magnetization has acquired a net phase -v· (,6g)D.. More generally, the signal from a collection of nuclei, or

FLOW AND DIFFUSION IMAGES 3

volume element at r in an MR image, is given by

E(q, r,~) = Iv P(R;~, r)exp( -iq· R) d3 R (1)

i.e. the Fourier transform of the diffusion propagator P(R;~, r) (the probability of displacement by R = r - r' from r' during ~); here q = ,6g is wavevector conjugate to the space of molecular displacements R. This motion-sensitive dimension has been called "qspace" by Callaghan et al. ((1988), (1991)(a) & (b)) to distinguish it from "k-space" (the Fourier conjugate of the image (r-space or co-ordinate space)). Variants of the above pulse sequence may be preferred in practice. The variant (Fig. 1(b)) involving the stimulated echo (Hahn (1950)) was used by us in this work.

(a)

9 0° rJ. pulses

I I

: 9

echo signal

s

2t

(b)

90° r.f. put es

.- - ------------ . acquire

/

Fig. 1. The Pulsed Field Gradient protocol: (a) using the normal spin echo (b) using the Stimulated Echo. The latter is our usual choice (see Gibbs & Johnson (1991».

For the case of Brownian (i.e. Gaussian) diffusion superposed on laminar flow, the transform (1) is (Stejskal and Tanner (1965))

E(q, r,~) ~ exp (-q. D(r)· q~' + iq· v(r)~) (2)

where 0 is the diffusivity tensor (typically but by no means necessarily isotropic). This provides the basic model, a single (complex) sinusoid with Gaussian attenuation, within which data for very limited values of q are analysed for the local flow v(r) and diffusivity V (assumed isotropic hereafter). (~' = ~ - 6/3 corrects ~ for non-zero 6).

RAPID IMAGING BY THE ECHO-PLANAR TECHNIQUE

The echo-planar imaging (EPI) techniques pioneered by Mansfield are amongst the fastest MRI protocols. By encoding two dimensions of spatial information for each r.f. exci~ation, these methods permit the acquisition of complete sets of 2-D image data in short times, typically 16-100 ms for a single image. This compares favourably with other rapid MR imaging methods such as spoiled FLASH, FAST, CE-FAST or FADE which typically require 300 ms to 10 s.

Use of EPI in combination with a phase-encoding gradient permits the acquisition of 3D images in times similar to conventional2D MRI techniques. A third spatial dimension is phase-encoded (see e.g. Callaghan (1991)) in the manner employed for the second dimension

4 E.J. Fordham, D. Xing et al.

in ordinary MRI sequences (Fig. 2). We show later results from such EPI sequences used with a prepended motion encoding sequence to provide quantitative studies of 3-D velocity fields which would otherwise require prohibitively long experimental times.

The acquired MR signal for stationary spins, neglecting nuclear magnetic relaxation, may be expressed in terms of the spin density per) by the imaging equation

S(k) = i p(r)exP(-ik.r)d3 r (3)

where k( t) = "( J; G(t') dt' describes a trajectory through k-space determined by the modulation history of the (imaging) gradients G. Echo-planar methods achieve their speed by modulating the gradients such that the locus, k(t), samples a two-dimensional region of k-space, rather than just a line or single point, during the spin echo that follows r.f. excitation.

The most commonly implemented trajectory for k(t) is the MBEST technique which scans k-space in a forward and backward manner along a rectangular raster. A pulse sequence for performing such an acquisition is shown together with the k-space trajectory in Fig. 2.

fc90l fc180

RF . ~~ ~M~ . 1.'-r~ .~ r ~-~ · I .. _---'-_

, . I

-i - / I I

n I I I G,------ '-____ ~I:-____ , " I

1IIIlL SplH>Cho e""eIope

S(t) --r -A -] --~ --A -;t-L

Fig. 2. (a) The Echo-Planar protocol for rapid scanning of 2-D k-space. In 2-D, the phase encoding increments of Gslice are not used. The shaped r.f. pulses select a slice in 2-D, a thick slab in 3-D. (b) The k-space trajectory scanned by G1 and G2 . The effect of the dephasing gradients is shown by the straight dotted line, and that of the refocusing 1800 pulse is shown by the dotted arc. The ideal acquisition windows are shown by the thicker parts of the solid lines, along which the complex signal was sampled uniformly at 6 J-ts intervals. A 642 data matrix was collected in this work.

Since EPI obtains a snapshot observation of the spatial distribution of magnetization, it may be employed in a wide variety of magnetization preparation experiments. A module which specifically encodes some parameter to be measured in the magnetization state (in


the present case a PFG sequence to encode motion) is prepended to the imaging sequence; the imaging equation (3) becomes

S(k,q) = [p(r)E(q, r,l:l.) exp( -ik· r) d3 r (4)

with E(q,r, l:l.) given by (2).

3. Bayesian analysis of motion-encoding for velocity and diffusion

We retain the traditional Fourier analysis for analysis of the k-space data. This may not be the best way of handling the k-space data, but it remains the usual method in MRI for analysis of the image space, and we believe that it is always wise to attempt one new thing at a time. For our flows, p(r) is uniform, so the image transform (4) delivers the motion-encoded images E(r,q,l:l.).

The q-space analysis is an exercise in parameter estimation within the model (2) with diffusion isotropy assumed. Each Cartesian component of q-space is analysed separately. We outline the developments given by Bretthorst (1988, 1990( a)(b)( c), 1991), specialised to the present case (2) with quadrature data. Our model for the real and imaginary channels (for each image pixel) is

m

fR(q) = 2: CjUj = e-aq2 (Cl coswq - C2 sinwq) j

m

JI(q) = 2: CjVj = e-aq2 (Cl sinwq + C2 coswq) j

(5)

where the amplitudes (CI, C2) are effectively the uninteresting amplitude and phase (not written out in (2)). The number of linearly independent signal functions in Bretthorst's theory is denoted by m; in our case m = 2. The interesting parameters here are the nonlinear parameters w = uAr)l:l. (the flowed distance) (a 'frequency' in the qx domain) and diffused distance a 1/ 2 = (Vl:l. ' )1/2 (an attenuation coefficient).

For a finite noise variance u 2 , we assign a Gaussian prior; the direct probability of the data D == {dR,dd in each channel gives a likelihood:

P(Dlc,w,a,u,I) = L(Q;u,I) = ( \)N exp [-~] 27ru 2u

(6)

where the quadratic form Q is

in which summation over repeated suffices (model signal labels ) is implied, and the dot prod

uct f . 9 == 2:;:1 j;gi denotes a sum over N values of encoding gradient qi (not necessarily equally spaced). For quadrature models where U1 U2 + VI V2 == 0, the interaction matrix

gjk = Uj . Uk + Vj . Vk = bjk/J (where I/J = e-2aq2 • 1) is already diagonal. Completing the squares so that the amplitudes separate, applying Bayes' theorem with a uniform prior


for the model parameters (C1' C2 ,w, a) and marginalising over the uninteresting amplitudes (Ct,C2), the joint inference for the non-linear parameters (w,a) is then

(7)

where 2Nd2 = dR' dR + d[· d[ (the magnitude of the data vector) and h2 = (l/m)hj h j

is the mean square projection of the data onto the models. (hj = ..fJ(dR · Uj + d[· Vj)). Note that mh2 is a generalisation of the power magnitude discrete Fourier transform or

periodogram: mh2 = J(a) 1L:::le-aq? die- iwq; 12 (where d = dR + id l ) which clarifies the

role of this traditional statistic in Bretthorst's theory. If the noise variance is not known prior to the acquisition of the images, we treat a as

another nuisance parameter with the Jeffreys prior da / a and marginalise again, obtaining

100 1 [ (2N d2 - mh2)] / ( mh2 ) (2N-m)/2 pew, aiD, I) ex: J 2N +1 exp - 2 da ex: J 1 - ---=

o a -m 2a 2N d2

(8) (Bretthorst (1990(a)). Including a noise sample is straightforward in NMR imaging; there will be many pixels in the image corresponding to empty space which will thus contain only noise. The developments of Bretthorst (1991) carry through straightforwardly and the final result for the marginal inference is

P(w,aID,Du '!) ex: J(a) 1- mh2 / ( _) (2N+2Na -m)/2

2Nd2 + 2Nu d; (9)

where Du denotes the Nu (complex) samples of noise data. This result is the basis for all of the computed examples shown; it tends smoothly to the exponential result (7) where the noise sample is sufficiently large that a is estimated well.

4. Examples

LAMINAR NEWTONIAN FLOW IN A CIRCULAR PIPE

The methods outlined above have been applied to the laminar flow of aN ewtonian fluid (doped water) in a circular pipe; further details are given by Xing et at. (1994). Example results are shown in Figures 3 and 4; Fig. 3(a) shows typical data froma single image pixel, in our non-uniform sampling scheme The resulting joint inference for the flow velocity and local diffusivity appears in Fig. 3(b). The collected results (flow and diffusion, best value and Bayesian error estimates) are given in Fig. 4. Examination of radial velocity profiles shows some deviations, outside the Bayesian errors, from the expected Poiseuille (parabolic) profile. This is not a major surprise for flow in an extruded PMMA tube which is unlikely to be accurately circular, or for a flow situation not guaranteed to be free of entrance effects, or local convection. The Bayesian analysis (with error estimates) demonstrates the potential for reliable detection of such deviations from ideal results.

FLOW AND DIFFUSION IMAGES

I.2e+OS " (a) .. Experimental data real ~ t 1.0000S " Experimental data imaginary + U

8.0e+04 Fitting real ,. ~ Fitting imagina ry

~ 6.0e+04 ."

4.0e+04 c = S 2.0e+04 = Q

4i.0e+04 L-._..L..:~-'-_-L..._---'-_----''------'

·5.0 0.0 S.O 10.0 15.0 20.0 25.0 . q V..,~

Posterior Probability Density 0.2

0.15

0.1

0.05

Diffusion Coefficient (cm1s)

7

(b)

Velocity (emls)

Fig. 3. Laminar flow of water in a pipe. (a) Example quadrature data from one image pixel; four values of q only. (b) Resulting joint inference for flow velocity and diffusivity.

Velocity (emls)

-0.9 -0.6 -0.3

Standard Deviation of

Velocity (emls)

1.0e-02

S.0e-03

X Pixel

(a)

64

(b)

64

YPixe)

FLOW

Dilfusion Coefficienl (eml/s)

6e-OS 4e.OS 2e-OS

00

Dilfusion Coefficient

Error (em Is)

15e-OS le-OS

5e-06

X Pixel

DIFFUSION

(c)

64

(d)

64

Fig. 4. Laminar flow of water in a pipe. Flow velocity image (a) with Bayesian error surface (b); diffusivity image (c) with associated Bayesian errors (d).


NON-NEWTONIAN FLOW IN A CIRCULAR PIPE

The methods have been applied also to laminar flow of a non-Newtonian fluid, i.e. a weak solution of a X ant han gum, a shear-thinning fluid for which a parabolic profile is not expected. Fig. 5(a) shows the original image data, for four values of q (real channel only); the resulting velocity profiles are shown in Fig. 5(b), together with the Bayesian error bars. The superposed theoretical profile assumes a power-law rheology with an index measured in an independent rheometric (Couette) flow. Again the experimental data confirm the general shape of the theoretical profile, but there are systematic deviations. Plausible reasons are the inability of the power-law model to describe the rheogram accurately over a wide range of shears, or the presence of some thixotropy. Again the error bars suggest strongly that for high accuracy predictions, the existing rheological model is inadequate, a conclusion which would be much less credible without the Bayesian error intervals.

(a) .1 10.0 rad em

·1 10 1.7 rad em

o

25.7 rad eni l

203.4 rad em' l

+

(b)

1.4,.----r----.----r----r----r------.

1.2

1

OJ

~ 0.8 .. ---.. 0.6

0.4

0.2

R ••••• I\\\\\ ••••• ------ ..... ------ . . It-\. ,If \/\

Power· law fluid '+

't = K)' 0.11 \

\ .

o'~--~--~------~----~--~ o 0.1 0.2 0.3 0.4 0.5 0.6

r (em)

Fig. 5. Flow of a X ant han gum solution in a pipe. See Gibbs et al. (1994). (a) Example image

data (real channel only shown) for four values of q. (b) Flow velocity profiles (two radii from one diameter) with Bayesian error intervals. Theoretical curve for a power law fluid of (rheometrically determined) index 0.11 superposed.

THREE-DIMENSIONAL FLOWS IN COMPLEX DUCTS

A fully-three dimensional (steady, laminar, Re rv 50) flow was created at in a baffled duct sketched in Fig. 6. The PFG-EPI method was applied in this case for all three Cartesian components (directions of q) in turn, with 3-D volume image acquisition in kspace (see Fig. 2.). This experiment is particularly demanding on experimental time because of the high dimensionality, even when the EPr method is employed. A major objective of the present analysis was to optimise such experiments by minimising the number of q values


required for acceptable flow images. In the version of this work published by Derbyshire et al. (1994) 17 q values were used (in each direction); in the present paper we show results from our non-uniform 4-sample scheme which are subjectively no worse, but now have objective error estimates attached to them as well as being acheivable more than four times faster.

1.6 an

Fig. 6. Sketch of baffled flow phantom used to create a 3-D laminar flow field. Construction: PMMA tube and sheet.

Example sections through the 3-D volume images of flow are shown in Fig. 7, one section for each Cartesian component and a fourth for flow speed; the latter clearly shows the expected major liquid flux regions.

A diagnostic on reliability of such results comes from estimates of the divergence field derived from the velocity vector component estimates. These are scattered about 0, as expected for an incompressible fluid. In the previous version of this work, a nearly flat divergence field was obtained with the results normalised against a local voxel turnover rate. In Fig. 8 we show the results of a Bayesian diagnostic; all local divergence estimates have been normalised against the local Bayesian error in the estimate (rule: variances add) and the distribution of such estimates plotted for all non-noise voxels out of ':0:' 2.5 X

105 • The conformity to an approximately Gaussian shape is pleasing; the width is also remarkably close to unity. Large deviations from zero divergence are rather more frequent than Gaussian; most of these come from voxels close to the walls or in other high-shear regions where our assumption of a single flow velocity within the voxel begins to break down.

5. Conclusions

The Bayesian spectral analysis of Bretthorst has been applied with great success in a new context: parameter estimation from data sampled very sparsely and non-uniformly, heavily truncated and possibly quite noisy. The possession of a reliable model (the single complex sinusoid with Gaussian decay) for the motion-encoded MRI data enables quantitative flow imaging to be achieved, with sufficient local accuracy for most engineering purposes, from remarkably few motion-encoded images; flow velocity is a frequency parameter in qspace. Diffusivity results (from a Gaussian decay parameter) are less reliable at this level of sampling; more data are probably necessary for diffusivity imaging.

Experiments of this type formerly used traditional Fourier analysis over 17 or 18 motion-encoded images. This is shown to be unnecessary even for accurate flow imaging; our current protocol requires only 4 (complex) motion-encoded images, a four-fold improvement in acquisition speed. Full vector velocity image data can now be acquired (for steady laminar flows) over 3 spatial dimensions in approximately half an hour. In addition, the Bayesian probabilistic analysis now provides rational error bars for the estimated


-0.4 velocity components (cmls) +0.4

o speed (cmls) +0.4 Fig. 7. Sections through 3-D velocity images of flow in the baffled duct, from the Bayesian

analysis. All three velocity components available. Inspection of the entire 3-D result (3 velocity components at each of ~ 2.5 X 105 voxels) is done using appropriate software tools.

parameters; these are unavailable using traditional Fourier analysis. In each of our chosen examples, the error bars have enabled plausible conclusions to be drawn about details in

FLOW AND DIFFUSION IMAGES

0.45

0.4

0 .35

0 .3

0.25

0.2

0.15

0.1

0.05

0 -5 -4 -3 -2

'histpJ.oorm' -0,3969 • exp(,(),5 • x • xl - -.

4

11

5

Fig. 8. A Bayesian diagnostic on the flow results of Fig. 7. Distribution of deviations of V' . (v) from 0 (all voxels that are not noise-only) , each deviation normalised by the local Bayesian error in V' . (v) (computed from Bayesian errors in the velocity components of Fig. 7).

the flows or the accuracy of the data; without them such conclusions could not have been confidently drawn.

Further improvements in the vector velocity imaging are believed to be possible; analysis of each Cartesian component independently effectively introduces 6 nuisance parameters (2 for each direction) to be removed from the analysis. However data acquisition can easily ensure that both amplitude and phase (2 parameters) are the same for all directions. A joint inference searched for all 3 velocity parameters simultaneously may yield acceptable results with yet fewer data. Further known contraints could be applied; for example a condition of zero divergence could be imposed rather than using the estimated divergence as a diagnostic on performance. Taken to extremes however this approach hybridises experimental measurement with computational fluid dynamics. Such developments should not lose sight of the original objective of providing a direct measurement of the flow fields.

ACKNOWLEDGMENTS. We thank Dr Herchel Smith for his munificent endowment of the Herchel Smith Laboratory for Medicinal Chemistry (L.D.H., T.A.C. & S.J,G.) and for research studentships (D.X. & J.A.D.). E.J.F. thanks the Royal Society (London) and the S.E.R.C. (United Kingdom) for the award of an Industrial Fellowship during 1990-1992 which enabled him to pursue some of the ideas employed in the present work, and Schlumberger Cambridge Research for leave of absence over the same period. The suggestions for further development of the method are the result of valuable discussions with G. Larry Bretthorst at the Cambridge meeting .

REFERENCES

Bretthorst, G .L. (1988). Bayesian Spectrum Analysis and Parameter Estimation, Lecture Notes in Statistics, 48, Springer-Verlag, New York.

12

Bretthorst, G.1. (1990)(a). J. Magn. Reson. 88,533-551. Bretthorst, G.L. (1990)(b). J. Magn. Reson. 88, 552-570. Bretthorst, G.L. (1990)(c). J. Magn. Reson. 88, 571-595. Bretthorst, G.L. (1991). J. Magn. Reson. 93,369-394. Bretthorst, G.L. (1992). J. Magn. Reson., 98, 501-523.

E.J. Fordham, D. Xing et ai.

Callaghan, P.T. (1991). Principles of Nuclear Magnetic Resonance Microscopy. Clarendon Press, Oxford, U.K.

Callaghan, P.T., C.D. Eccles & Y. Xia (1988). J. Phys. E.: Sci. lnstrum., 21, 820-822. Callaghan, P.T., A. Coy, D. MacGowan, K.J. Packer & F.O. Zelaya (1991)(a). Nature, 351,

467-469. Callaghan, P.T. & Y. Xia (1991)(b). J. Magn. Reson., 91, 326-352. Derbyshire, J.A., S.J. Gibbs, T.A. Carpenter & 1.D. Hall (1994). A.I.Gh.E. Jnl., 40, 8,

1404-1407. Fordham, E.J., 1.D. Hall, T.S. Ramakrishnan, M.R. Sharpe & C. Hall (1993). A.I.Gh.E.

Jnl., 39, 9, 1431-1443. Gibbs, S.J. & C. S. Johnson, Jr. (1991). J. Magn. Reson., 93, 395-402. Gibbs, S.J., D. Xing, S. Ablett, I. D. Evans, W. Frith, D. E. Haycock, T. A. Carpenter &

1. D. Hall (1994). J. Rheology, in press. Hahn, E.1. (1950). Phys. Rev., 80, 4, 580-594. Hahn, E.L. (1960). J. Geophys. Res., 65, 2, 776-777. Lauterbur, P.C. (1973). Nature, 242, 190-19l. Mansfield, P. & P.K. Grannell (1973). J. Phys. G, 6, L422. Mansfield, P. & P.G. Morris (1982). Adv. Mag. Res., Suppl. 2., Academic Press. Singer, J.R. (1960). J. Appl. Phys., 31,125-127. Stejskal, E.O. & J. E. Tanner (1965). J. Ghem. Phys., 42, 1,288-292. Xing, D., S.J. Gibbs, J.A. Derbyshire, E.J. Fordham, T.A. Carpenter, & L.D. Hall (1994).

J. Magn. Reson., in press.

BAYESIAN ESTIMATION OF MR IMAGES FROM INCOMPLETE RAW DATA

G.J. Marseille, R. de Beer, M. Fuderer #, A.F. Mehlkopf, D. van Ormondt, Delft University of technology, Applied Physics Laboratory, P.O. Box 5046, 2600 GA Delft, The Netherlands, # Philips Medical Systems, Best, The Netherlands. [email protected]

Keywords. Magnetic Resonance, Scan Time Reduction, Optimal Non-Uniform Sampling, Bayesian Estimation, Image Reconstruction

ABSTRACT. This work concerns reduction of the MRI scan time through optimal sampling. We derive optimal sample positions from Cramer-Rao theory. These positions are nonuniformly distributed, which hampers Fourier transformation to the image domain. With the aid of Bayesian formalism we estimate an image that satisfies prior knowledge while its inverse Fourier transform is compatible with the acquired samples. The new technique is applied successfully to a real-world MRI scan of a human brain.

1. Introduction The raw data of a Magnetic Resonance Imaging (MRI) scanner are sampled in the twodimensional k-space. k is proportional to a magnetic field gradient vector which is incremented stepwise. Typically, 256 x256 complex-valued samples are acquired, i.e., one for each pair (k"" ky ). After proper phasing of the data, the real part of the 2-D Fourier transformed (FFT) data constitutes the desired image [1) . See Figure 1.

The objective of this work is to reduce the scan time of MRI. One way to achieve this is to simply truncate data acquisition prematurely. Since truncation causes ringing in the Fourier domain, the missing data should be estimated, by e.g. linear prediction [2). Our new approach improves on this in two ways:

1. We classify sample positions in terms of their information yield, using CramerRao theory [3) . This enables us to omit only those sample positions classified as leastinformative, rather than indiscriminately omitting only samples at the end. See Figure 2.

2. Instead of applying linear prediction, we estimate (,reconstruct') an optimal image in the Fourier domain while keeping the inverse FFT of the image compatible with the incomplete raw data. Optimality is achieved by invoking the following empirical prior knowledge, established for real-world images [4J: The probability density function of differences of grey values of adjacent image pixels (edges) has Lorentzian shape. See Figure 3. In order to facilitate handling of prior knowledge and noise, we use Bayesian estimation.

In the following we treat the derivation of an informative sample distribution from Cramer-Rao bounds, and Bayesian estimation of an optimal image. Subsequently, we apply

!3


14 G.J. Marseille et al.

k :t x t t

4 Y

a b

Figure 1: a) Absolute value of real part of raw data of a turbo spin echo complete scan of a human head; k:t, ky =-128,-127, ... ,127. b) Real part of Fourier transform of phased complete raw data. Scan time can be saved by judiciously omitting columns in Figure la.

the new approach to the real-world data of Figure 1a.

e

d

c

b

a

o 128

Figure 2: Right halves of five alternative symmetric sample position distributions. a) complete, b) truncated, c) exponential, d) derived from Cramer-Rao, e) adapted from d.

2. Derivation of an informative sample distribution

Before proceeding, we mention that the two dimensions of the k-space have different properties. Sampling in the kx-dimension is fast, rendering omission of sample positions senseless. In the ky-dimension, on the other hand, sampling is slow and therefore omission of sample positions does save scan time. Under the circumstances the problem is one-dimensional. The dimension of interest, ky, is depicted horizontally. Note that the omitted ky values are

Bayesian estimation of MR images from incomplete raw data

pea) t

0.02,---.------,,.----,---,.--,---,----.----,---,--,

0.01 8

0.016

0.014

0.012

0.01

0.008

0.006

0.004

15

Figure 3: Probability density of differences of grey values (image intensities) of adjacent pixels (edges) of the image of Figure lb. The dotted line is a least squares fit of a Lorentzian function.

the same for each kx . In other words, entire columns are omitted.

We consider the sample distributions a, b, c, d, e depicted in figure 2. The possible sample positions are constrained to a uniform grid. For practical purposes, the latter grid coincides with that obtained from inverse Fourier transform (IFFT) of the image. Distribution a represents a complete scan. Distributions b, c, d, e each have 30% fewer samples and therefore require 30% less scan time. Of these, b constitutes the conventional way of saving time. It truncates the last part of the scan abruptly, which entails loss of information about objects with sharp edges [2]. Such loss can be alleviated by spreading the last samples in an exponential manner, according to distribution c [3]. A yet more informative sample distribution can be derived in a systematic way [3], from Cramer-Rao theory [5]. This is treated in the next paragraphs. To the best of our knowledge, the approach is new. Distribution e is treated briefly in section 4.

When applying Cramer-Rao theory to images, one has to devise a model capable of representing the object in some desired way. Oqr objective is to establish a sample distribution that limits loss of information about sharp edges. Consequently, we seek a shape possessing such edges, but at the same time is not too specific for any class of real-world in vivo objects. So far, our choice is a symmetric I-D trapezium in the y-space, with sharp edges as depicted in figure 4. The width of the trapezium at the top is 2w, and the width of the edges is 6.w. The MRI signal in the ky space, Sky, is real-valued because of symmetry, i.e.

16 G .J. Marseille et al.

C Sky = 7rk2 D.w {cos{7rkyw) - cos{7rky{w + D.w))} (I)

y

in which c is the amplitude.

2w -D.w

Figure 4: Trapezium-shaped model for deriving an optimal sample distribution, using Cramer-Rao theory.

Cramer-Rao theory enables one to derive lower bounds on the standard deviations of the model parameters, c, w, D.w, estimated by fitting the function ofEq.{l) to the data [5]. These lower bounds depend on the signal-to-noise ratio (SNR) and the sample positions on the ky-axis. This property can be used to compare the information yield of alternative sample distributions. We define the distribution that minimizes the sum of variances of the three parameters as the most informative one. In principle, one has to perform an exhaustive search over all possible ways to omit 30% of the sample positions shown in figure 2a. This requires much computation. However, given the fact that the trapezium is only an approximation of the actual object, a faster search method could be devised [6]. Using w=0.76 and D.w=0.05, we arrived at the informative distribution d depicted in figure 2.

We emphasize that the sample distribution just derived pertains to a model with only a few parameters. Yet, our real-world MRI application rather requires estimation of a large number of Fourier coefficients. Moreover, since sample points have been omitted, the number of unknown Fourier coefficients exceeds the number of data. Consequently, additional information about the image is needed to regularize the problem. One possible form of regularization is to impose that the entropy of the image be maximal [7], [8]. Instead, we use the empirical finding that the histogram of differences of grey values of neighbouring pixels (edges) has approximately Lorentzian shape [4], as already mentioned in the introduction.

Finally, it should mentioned that Cao and Levin have recently investigated alternative nonuniform sample distributions, derived from prior knowledge contained in a training set of high resolution images of a relevant body part in a number of subjects [9], [10], [11].

The next section deals with a new solution for estimating omitted samples.

Bayesian estimation of MR images from incomplete raw data 17

3. Image estimation from incomplete k-space data

The acquired samples in kx , ky-space are arranged in a data matrix such that kx varies columnwise and ky rowwise. Since sampling is complete in the kx-space, Fourier transformation (FFT) can be applied to all acquired columns, without incurring ringing artefacts. Inverse Fourier Transformation (IFFT) from x to kx is not required at any later stage. Rowwise, the data are still considered as samples. The omitted samples discussed in the previous Section, give rise to missing columns. Initially, these columns are filled with zero's. Subsequently, the estimation is carried out for each row separately, involving only the kyand y-space. An image is sought using the above-mentioned Lorentzian distribution, while its IFFT from y- to ky-space be compatible with the acquired samples. Compatibility may be interpreted as exact equality, or as equality 'within the noise band'. A Bayesian [12J approach appears well-suited to concurrent handling of both the Lorentzian distribution and noise band. We write

(liS) = p(SII)p(I) p p(S)

(2)

in which I is a row of the image in the y-space and S is the attendant sample row in the kyspace. p(IIS) is the probability of an image given the samples. p(SII), the probability of the samples given an image, relates to the noise distribution. p(I) comprises prior knowledge about the image, such as the Lorentzian distribution. p(S) is just a scaling constant once the samples have been acquired. The task is to find the image that maximizes p(IIS).

Assuming Gaussian measurement noise with zero mean and standard deviation (J" and a unitary (I)FFT matrix, one finds for each row of the image

(3)

the index j pertaining to acquired samples. The prior knowledge term p(J) can be split into two parts, one for the actual object 0, and one for the background B beyond the perimeter of O. After phasing and retaining only the real part of I, we write for the object image 10

p(lo) ex II p(hlh-l) (4) kEO

with p(hI1k-Il ex [(Ik - Ik_Il2 + a2J-l, which has the Lorentzian shape alluded to above, 2a being the width at half height. Here we have used the probability of simultaneous appearance of event {h, h+l,'" h+l} , namely


and the already mentioned empirical prior knowledge on adjacent pixels. For the background we write

12 p(IB) ex II exp( -~)

jEB 20" (6)

Furthermore, p(I)=p(Io )p(lB)' One may choose to ignore knowledge of object boundaries. This extends Eq.(4) to the entire image and obviates Eq.(6). In actual practice, we optimize the natural logarithm of p(IIS), choosing relative weights Ct., (3 and 'Y for the contributing terms Inp(SII), lnp(lo), Inp(IB) , respectively. This is an iterative process, based on conjugate gradients. Various weights can be used, depending on the application. At the SNR of the images currently investigated by us, the results appear best for Ct. = 00 (implying that measured samples may not be changed), {3 = 0.7, 'Y = 0.3. This aspect needs further investigation.

4. Results

4.1. Performance measure

The most obvious way to judge the success (performance) of image estimation from raw data is to peruse the image on the screen of a work station. However, this method is subjective. Consequently, for the sake of reporting there is a need for a simple, objective, numerical measure of the performance. We use the following functional to measure the performance,

(7)

where lu is the real part of a 256x256 phased image, the index u standing for sample distributions b, c, d, e, as defined in figure 2. An additional 0 in an index indicates zerofilling of omitted samples. In absence of 0, the Bayesian estimation is implied. la and ho are the 'high quality' and 'low quality' reference images respectively. The latter are obtained through mere FFT. The symbol II ... 11 stands for the Frobenius norm of a matrix, which equals the square root of the sum of squares of all elements.

Note that the performance definition works only for test experiments where the complete scan, yielding la, is available. For the purpose of testing, one can remove columns from the data matrix at will.

The maximum performance is 1. This is reached when the Bayesian estimation lu equals the reference image la. In practice, such can happen only in the case of high SNR simulated signals. When IlIa - lull equals IlIa - holl, the performance is zero. Even negative performances may occur. The latter need not indicate that the estimation has failed; ringing is always substantially reduced. A negative performance indicates flattening of fine structure.

When la is flawed by motion of the object or when the SNR is low, the performance definition is inadequate.


4.2. Performances of alternative scan strategies

Applying the Bayesian estimation method to incomplete scans b, c, and e derived from the Turbo Spin Echo scan shown in figure la, we obtained the results listed in Table 1. Distribution d is not suitable since the data in question require phasing; see footing of Table 1. Furthermore, at the given SNR (=25), Q' was put to 00, implying that measured samples may not be changed. Knowledge of the boundaries of the object, obtained from edge detection, was invoked using ,=0.3.

Table 1. Performances and average number of iterations per row. Incomplete raw data were derived from figure 1a by omitting columns according to the distributions of figure 2. Q' = 00,(3 = 0.7" = 0.3.

sample distribution* b c e performance -0.2212 0.1936 0.5514 average no. of iterations 21 15 8.5

* Sample distribution d was not used because phasing requires uniform sampling [13] in the range ky = -32, ... ,32. e was derived from d empirically by studying the performance for several changes of the omitted sample positions.

The performance of distribution e, which was derived from the optimal nonuniform distribution d, is by far the highest. Although distribution e is probably only suboptimal for the real-world object at hand, this result supports the theory of section 2. Traditional uniform sampling and truncation, b, clearly entails significant loss of information. Interestingly, the other nonuniform distribution, c, also supersedes b. The latter result can be understood by considering that interpolation is usually more reliable than extrapolation. Note that estimating h amounts to extrapolation, whereas estimating Ie amounts to interpolation, as can be seen in figure 2.

Another indication of the effectiveness of alternative sample distributions can be gleaned from the number of iterations needed to reach convergence. The numbers quoted in Table 1 are averages of the numbers for the separate rows of the data matrix. For distribution e, convergence appeared about twice as fast as for c, and even 2.5 as fast as for b. We conclude from this that the optimization effort can be reduced by choosing informative samples.

The actual imagesho and Ie are shown in figure 5. Ringing as a result of zero-filling is clearly evident in h o ' Although performance b is negative, ringing is significantly reduced (not shown). More on this in the next subsection.

4.3. Residue images

Yet another way of judging the relative merits of alternative sample distributions is to peruse residu images, Ia - It<. In a residue image, scaling is related to the amplitude of the ringing rather than to maximum intensity of the associated image It<. As a result, visibility of artefacts is strongly enhanced. figure 6 shows the result for u = ba, e, using the same scale for both. It can be seen that the iterative Bayesian image estimation reduces ringing substantially.


a

b

Figure 5: Images estimated from samples derived from the complete scan shown in figure la. a) ho' obtained by Fourier transform of zero-filled distribution b, b) Ie, obtained by Bayesian estimation from distribution e.


a b

Figure 6: Residue images Ia - Iu, for a} u = bo, b} u = e.

5. Conclusions

Summarizing, the results enable us to make the following claims:

• The Cramer-Rao bounds pertaining to a simple k-space model function devoid of specific anatomical information about the object, yield superior sample positions for MRl scan time reduction .

• Ringing attendant on scan time reduction can be strongly reduced by Bayesian estimation and using a prior based on the Lorentzian shape of the 'edge' histogram.

AcknowledgIllent

This work is supported by Stichting Technische Wetenschappen, STW (project DTN 11.2507).

References

[1] F. Wehrli, The Origins and Future of Nuclear Magnetic Resonance Imaging, Physics Today, 34-42 (1992).

[2] P. Barone and G. Sebastiani, A New Method of Magnetic Resonance Image Reconstruction with Short Aquisition Time and Truncation Artifact Reduction, IEEE Trans. Med. Imag., 11, 250-259 (1992).

[3] G.J. Marseille, M. Fuderer, R. de Beer, A.F. Mehlkopf, D. van Ormondt, Reduction of MRI Scan Time through Non-Uniform Sampling and Edge Distribution Modelling, J. Magn. Reson., B103, 192-196 (1994).

22 G .J. Marseille et al.

[4] M. Fuderer, Ringing Artefact Reduction by an Efficient Likelihood Improvement Method, Proc. SPIE, 1137, 84-90 (1989).

[5] A. van den Bos, Parameter Estimation, Chapter 8 in: Handbook of Measurement Science, Vol. 1, P.H. Sydenham Ed., Wiley, London (1982).

[6] G.J. Marseille, R. de Beer, A.F. Mehlkopf, D. van Ormondt, Optimization of Sample Positions in Magnetic Resonance Spectroscopy, Proc. ProRISC/IEEE Workshop on Circuits, Systems, and Signal Processing, pp. 233-238, J.P. Veen and M.J. de Ket, Eds., Utrecht, STW (1993).

[7] R.T. Constable and R. Henkelman, Why MEM Does Not Work in MR Image Reconstruction, Magn. Reson. Med., 14, 12-25 (1990).

[8] Moran, Observations on Maximum Entropy Processing of MR Images, Magn. Reson. Imag., 9, 213-221 (1991).

[9] Y. Cao and D.N. Levin, Feature-Guided Acquisition and Reconstruction ofMR Images, In: IPMI 93, Lecture Notes in Computer Science 687, 278-292 (1993).

[10] Y. Cao and D.N. Levin, In: SPIE, Vol. 2167, 258-270 (1994).

[11] Y. Cao and D.N. Levin, Feature-Recognizing MRI, Magn. Reson. Med., 30, 305-317 (1993).

[12] J.P. Norton, An Introduction to Identification, Academic Press, London (1986).

[13] G. McGibney, M.R. Smith, S.T. Nichols, A. Crawley, Quantitative Evaluation of Several Partial Fourier Reconstruction Algorithms Used in MRI, Magn. Reson. Med., 30, 51-59 (1993).

QUANTIFIED MAXIMUM ENTROPY AND BIOLOGICAL EPR SPECTRA

ABSTRACT.

S. M. Glidewell, B. A. Goodman and J. Skillingt Scottish Crop Research Institute, Invergowrie, Dundee, DD2 5DA. t University of Cambridge,Cavendish Laboratory, Madingley Road, Cambridge, CB3 ORE.

This work describes the use of a quantified maximum entropy method for the optimisation of analytical information from EPR spectra using both single and composite point spread functions .. Quantified maximum entropy reconstruction of the complex multiline EPR spectrum of the perinaphthyl radical allows the accurate determination of the 2 hyperfine splittings. The approach is then used on a system involving unknown radical species formed by crushing lettuce in the presence of a spin trap. This reconstruction reveals the presence of at least 2 radicals and the parameters derived suggest that one of them is a hydroxyl radical adduct - a result not obtainable by direct inspection.

1. INTRODUCTION

Many biologically important free radicals have very short lifetimes and do not accumulate in tissues to levels which are readily detectable directly. In consequence, their presence is either determined indirectly by assay of stable end-products of radical reactions or by EPR spectroscopic detection of adducts with spin traps. This technique of spin-trapping has proved valuable in the study of reactions involving biological fluids and tissues, but its usefulness is often limited by the low levels of adducts which are formed and the requirement for small amounts of aqueous material in the cavity of the spectrometer. This work describes the use of a quantified maximum entropy method for the optimisation of analytical information from EPR spectra. The ability to use composite PSFs lends further refinement, either for the disentangling of multiline spectra or for the determination of the intensity of weak spectra of known radicals, a problem often encountered in biological EPR spectroscopy.

1.1. Electron Paramagnetic Resonance

Most molecules have even numbers of electrons and these energetically prefer to exist in a paired state. Free radicals and certain transition metals have one or more unpaired electrons and this confers a magnetic moment on the molecule or ion and makes it paramagnetic. In simple terms, the magnetic moments of paramagnetic species placed in a strong external magnetic field, tend to align parallel to the applied field. They may be excited to the antiparallel alignment by the absorption of electromagnetic energy. The energy difference between the levels is given by the equation:

23


24 S. M. Glidewell, B. A. Goodman & J. Skilling

where B is the magnetic field strength, fi,B, the Bohr magneton and g, the gyromagnetic ratio. For a magnetic field of 3.5T, the resonance frequency is around 9GHz in the microwave region. Electron paramagnetic resonance (EPR) or electron spin resonance (ESR) spectra are normally recorded by placing the sample (usually in a quartz container) between the poles of a magnet and sweeping the magnetic field at a fixed microwave frequency. The value of 9 is a function of the magnetic environment of the unpaired electron in the sample molecule. For organic radicals, 9 ~ 2.00, close to the value for the free electron. The resonance signal may be split by the presence of adjacent magnetic nuclei such as 1 H, 14N

or 13C. EPR spectra are normally recorded as first derivative spectra, Fig Ib, rather than the more usually encountered absorption Fig 1a .

a) b)

Figure 1: a) Lorentzian absorption spectrum b) First derivative of (a)

1.2. Maximum Entropy

All experimental data are distorted from their real values by instrumental and other factors which blur or spread the data about their true values. In order to reconstruct the most probable set of true values from the experimental data, an initial estimate of the distortion or spread must be made; this is called the point spread function (PSF) which is the prior in Bayes' equation:

. £ prior x likelihood m erence = =-------

evidence P (j I D H) = Pr(J I H) x Pr(D I j, H) r, Pr(d I H)

For EPR spectroscopy, the PSF takes the form of the first differential of a lineshape which can be a mixture of Lorentzian, Gaussian, winged and square wave lineshapes; this is estimated initially from an isolated peak in the spectrum. Trials are then run, varying the linewidth and lineshape profile until the value of the evidence is maximised. The MaxEnt result is a series of peaks which represent both the position and the intensity of the peaks in the data and their associated uncertainties. These are most clearly represented as a Spike Plot With Errors (see Fig. 2) in which the width of the lines represents their positional uncertainties and their height represents their intensities. The convolution of the MaxEnt result with the PSF gives the Mock Data which represent the noise-free spectrum. Subtraction of the Mock Data from the original data should therefore give only noise if the reconstruction is good.

1.3. Spin Trapping

Free radicals are important in all biological systems, as part of normal metabolism and in pathological processes, both offensive and defensive. Although some radicals such as oxygen

Quantified Maximum Entropy and Biological EPR Spectra

Original Spectrum

MaxEnt Result

Spike Plot

With Errors O.5mT

Figure 2: MaxEnt results for EPR spectrum of 4- hydroxy- TEMPO

25

and nitric oxide are stable, many of those involved in biology are short- lived and do not accumulate in tissue to levels directly observable by EPR spectroscopy. This problem can be ameliorated by stabilising such radicals by spin trapping. The spin trap is a molecule which forms a more stable radical adduct with the radical of interest; its EPR parameters can give an indication of the nature of the original radical species

In the example in Fig 3a, the free radical EPR signal will be split into a 1:1 :1 triplet by the 14N and further into doublets by the adjacent proton, Ha , and possibly further still by the o-ring protons and any magnetic nuclei in R. The bottom 12 lines (Fig 3b) each have only 1/12 of the intensity of the singlet. Mechanical damage to plant tissue generates free radicals. The production of radicals produced by the maceration of lettuce was investigated by EPR spectroscopy.

2. MATERIAL AND METHODS

Perinaphthyl was as supplied by Bruker. Spin trapping experiments were carried out by crushing shreds of fresh lettuce (Lactuca sativa L)leaf with a glass rod in a 500mM solution of a-(4-pyridyl-l-oxide)-N-t-butylnitrone (POBN) (Sigma). The supernatant liquor was transferred to a quartz flat cell and the EPR spectra recorded at room temperature on a Bruker ESp 30DE spectrometer. Instrumental parameters were: sweep width 10 mT; microwave power O.OlmW; modulation frequency 100kHz; modulation amplitude O.014mT;


o

t 9H3

&N~~' 9 POBN +

~ Spin Trap

o

o R, I 9H3

©N-b~~' N ~ POBN Spin Adduct

o

Figure 3: Formation of POBN adduct of a radical R and its diagrammatic EPR spectrum

conversion time 2000sec time contant O.Olmsec for perinaphthyl and sweep width 0.9mT; microwave power 5mW; modulation frequency 100kHz; modulation amplitude 0.016mT; conversion time 164 msec; time constant 20 msec for the POBN adducts. The maximum entropy calculations were carried out using the MemSys5 algorithm and Maxlnt graphical interface as supplied by MSL Ltd, 8, The Pits, Isleham, Ely, UK.

3. RESULTS AND DISCUSSION

Figure 4a shows the EPR spectrum of the perinaphthyl radical (Fig. 5a). This has one set of 3 equivalent protons, a, and another set of 6 equivalent protons, b. The EPR spectrum will therefore be a 1:3:3:1 quartet of 1:6:15:20:15:6:1 septets or vice versa depending on the relative magnitude of the hyperfine splittings by the two sets of protons. A single PSF (Fig. 4b insert) was optimised by varying the linewidth and shape until the evidence was maximised; the resulting spike plot with errors is Fig. 4b. To facilitate the unravelling of the multiplets, the calculated positions and relative intensities thus obtained, were used to construct a composite PSF corresponding to the quartets. This is shown in the insert of Fig. 4c. The resulting spike plot with errors (Fig. 4c) is a septet of the expected relative intensities. Using a PSF constructed with these separations and relative intensities, the quartet is revealed (Fig. 4d) and combining both multiplets in a PSF which is a septet of quartets, a single line results (Fig. 4e). These spike plots are shown all to the same scale in Fig. 5 which emphasises the gain in sensitivity obtained by the use of the composite PSF. A single EPR scan has revealed the splittings as aHa = 0.182 ± 0.002 mT and aHb = 0.630 ± 0.002 mT.

The low field two-thirds of the EPR spectrum of the POBN adduct from crushed lettuce is shown in Figure 6a. This rather noisy spectrum is typical of spin trapping experiments on biological material in which the radical concentration is low, the lifetime is limited (although greatly enhanced over that of the initial radicals) and the sample is in aqueous medium which necessitates the use of small amounts of sample. An initial PSF was selected by differentiation of a parametric curve which was fitted visually to the integral of the outer

Quantified Maximum Entropy and Biological EPR Spectra 27

a

a) b)

b b

:1 b)

II ,II" I L, II, II

c)

':1 c)

I. ,I, I,

I I I 1 ':l d)

II ~

+

e) ?SF

0

+ e)

'l5mT

~~ ______________ -L ______ __

Figure 4: (Left) Perinaphthyl Result

Figure 5: (Right) As Figure 4 but all to the same scale


halves of the outermost peaks. The values of linewidth and lineshape were then refined until the evidence was maximised; the final Spike Plot with Errors is shown in Fig 6b. It is apparent by inspection that there are at least two species present, as marked on the spike plot. The data were separated by subtracting the appropriate sections of the mock data from the original data and then treated separately (Fig 6c,e).

a) ~.

.!. .!. .!. .!

n r ~

b)

c)

d)

e)

Figure 6: MaxEnt results from EPR spectra of POBN adducts from crushed lettuce

The data in Fig 6c were reconstructed using successively, a singlet PSF (Fig 7a), a O.051mT doublet (Fig7b), a O.515mT doublet of O.051mT doublets (Fig 7c) and finally, a 1.492mT 1: 1 doublet of the previous multiplets was used to reduce the spike plot to a single line (Fig 7d). The data in Fig 6e were similarly treated: in this case, the first composite PSF was a O.042mT 1:2:1 triplet (Fig 8b), then a O.183mT doublet of triplets and finally,

Quantified Maximum Entropy and Biological EPR Spectra 29

the 1.494mT doublet of these multiplets . a)-t . ~

II I) 1\ II IIIIII II! Iii (

b) b)

1+1 11+ II I

I I

c) c)

-1r-\- ~/

O.51~ 05mT

Figure 7: Species A Figure 8: Species B

By using successively more complex PSFs built up from multiplets of the original PSF, a single peak is obtained for each species, and their relative intensities and associated uncertainties are revealed. Finally, the PSFs for each species are combined in the correct proportions and the entire spectrum is reconstructed, giving the expected single line. Thus MaxEnt reconstruction of the data has allowed not only accurate determination of the splittings:

species A aN = 1.492 ± 0.006mT aHa = 0.515 ± 0.004mT aH = 0.050 ± 0.004mT

species B aN = 1.493 ± 0.005mT aHa = 0.183 ± 0.005mT aH = 0.041 ± 0.005mT


the latter being consistent with a hydroxyl radical [1]; but also an estimate of their relative concentrations: A : B = 0.76 ± 0.16.

ACKNOWLEDGEMENTS

The Scottish Office Agricultural and Fisheries Department (SOAFD) is acknowledged for support (SMG, BAG).

References

[1] G. R .Buettner, "Spin Trapping: ESR Parameters of Spin Adducts", Free Radical Biology fj Medicine, Vol. 3, pp: 259-303, 1987.

THE VITAL IMPORTANCE OF PRIOR INFORMATION FOR THE DECOMPOSITION OF ION SCATTERING SPECTROSCOPY DATA

R. Fischer, W. von der Linden, and V. Dose Max-Planck-Institut fur Plasmaphysik, POB 1533, D-85740 Garching, Germany EURATOM Association

ABSTRACT. The ubiquitous spectroscopic problem of decomposing overlapping lines is solved for ion scattering spectroscopy employing Maximum Entropy. The chosen example of Pd adsorption on a Ru surface is particularly challenging because 13 partially overlapping isotopes contribute to the total scattering signal. Proper decomposition using appropriate prior information enables accurate coverage determination.

1. Introduction

Ion scattering spectroscopy (ISS) is among the most important and successful techniques to determine the nuclear mass composition of a surface and the geometric surface structure in real space on an atomic scale [1]. ISS data, like experimental data in physics quite generally, constitute the convolution of a broadening function with the desired physical quantity. Yet the inverse problem of deriving the underlying physics is in most cases not unique and often corrupted, even ruined, by noise [2]. A frequently adopted way ou t of this problems is given if we know a reliable model function for the solution in terms of a f ew parameters . These may then be determined by well established least squares fitting techniques. The prior knowledge, which we use in this procedure, is quite substantial. The solution must be of the model form; the reliability of this assumption may be subject to a X2 test. Reliable model functions are only rarely available, however, and may even obscure small effects present in the data. The way out of this dilemma is offered by Bayesian reasoning with Maximum Entropy (ME) prior information [3, 4, 5]. The ME concept imposed on the infinite manifold of solutions which agree after convolution with the broadening function with the experimental data within error bars chooses the single unique solution containing the least amount of information exceeding our prior knowledge. The incorporation of prior knowledge differs, however, radically from the model function approach. Prior knowledge in ME constitutes a soft constraint, which can always be overruled by the measured data. This paper exemplifies this procedure with ion scattering spectroscopy data from Pd adsorbed on Ru(OOl). Special attention is given to the problem of incorporating prior knowledge and the reliability of the ME solutions.

In ISS an approximately monoenergetic beam of ions , typically noble gas ions, is directed at the surface in some well-defined directions. The energy of the primary scattered ions is measured at a well-defined scattering angle. In the binary collision model the energy of the backscattered ions is determined by the scattering geometry, the masses of the incident ion and the backscattering surface atom, and the incident energy. The backscattered

31


32 R. Fischer, W. von der Linden, and V. Dose

ion intensity yields the information of the quantitative composition of the surface. Relative coverages can be determined if the signals originating from different masses are well separated. Fitting procedures are necessary if masses are not resolved, and, especially, if the overlapping signals contain isotopes. The system Pd/Ru consists of 6 Pd (A=102-110) and 7 Ru (A=96-104) isotopes which overlap in their mass numbers A. Since no straight forward theory of the line shape for scattering of a noble gas ion off a single Pd isotope and a single Ru isotope is available, we proceed to a form-free reconstruction of these lines with the help of ME.

This brings us back to the deconvolution problem. While in general one attempts a signal reconstruction removing the smearing effect of a known broadening, decomposition of a spectrum into its constituents requires the determination of the unknown broadening. The latter comprises in the present case apparatus effects, all contributions from the binary collision event and from various possible effects like multiple scattering and/or inelastic mechanisms.

2. Formalism

First, we will specify the mathematical structure of the problem imposed by ion scattering experiments. The scattering of ions with energies ranging from 0.1 to 10 keV off surfaces can be described very well by binary collisions between the incident ion and surface atoms. The energy E] of an ion of mass m] and incident energy Eo scattered through an angle {) by an atom of mass m2 is given by [6]

(1)

An energy spectrum of the scattered ions bears the information of the surface composition since for a fixed scattering angle {) the energy depends only on the mass ratio m2/ml. The energy spectrum F(E) can be treated as a sum of energy spectra f(E) of the various scattering atoms each modeling the convolution of the apparatus function with the energy distribution of the backscattered ions.

F(E) = La"Lb~f,,(E+E~) (2)

" v

The abundance of the atom J-L is a" and the relative abundance of isotope 1/ of atom species J-L

at the surface is b~. b~ is taken from isotope tables of the elements. The energy distributions of the ions backscattered off the isotopes of atom species J-L are assumed to be identical in their form but shifted in energy by E~ according to their masses. The scattering signals f" are allowed to vary between different atom species J-L. Such differences can arise from energy losses in course of internal excitation of the scattering partners which will of course depend on details of their electronic structure. f" in our procedure will also comprise apparatus broadening and the energy distribution of the incident noble gas ions.

Let Gl represent the measurement of F at energy E l . If the data Gl are assumed independent and normally distributed with error 0"1 then the likelihood function of the data

The vital importance of prior information ... 33

Gl, given the solution Fl is

with (3)

What we need is, however, P(FlIGl , A), the posterior probability of a particular solution Fl in view of the measured data G[, which we want to maximize. A stands for a set of further parameters for example a/-" the abundance of the atom fJ-, in (2) or the incident energy Eo. We choose the prior probability of Fl, P(FlIA), to be entropic relative to a certain prior knowledge m

P(FdA) ex exp(o:S) (4)

with regularization parameter 0: and entropy S.

(5)

An entropic prior is appropriate for positive additive distributions such as the present ISS spectra. According to Bayes' theorem we have to maximize

(6)

0: determines the competition between a reconstruction close to the data (0: = 0) and a reconstrllction close to the default model m (0: sufficiently large). A selfconsistent choice of 0: is obtained by marginalizing P(o:IGl ) = J DF· P(F/,o:IGl ) [7J or the evidence of the data P(GlIA) = J DF . P(Fl, GdA) [8J over the reconstruction Fl . These quantities are proportional to each other providing the prior probability of 0: is uninformative, i.e. constant. The result Fl will then agree with the experimental data Gl and simultaneollsly approximate the default model m as closely as possible, in other words the default model is updated by the smallest amount required by the data. It is clear that the default model should be chosen such that it incorporates as much prior information as possible. For example the requirement that only positive functions I/-' are physically acceptable requires that m/-, > O. If this constitutes all our prior knowledge then m/-, is taken a small positive number such that I/-' will tend to m/-, where the data do not provide different information. If the data are not of sufficient precision this will generally lead to "soft" solutions with corresponding large error bars. The reliability of the solution can be very poor as will be shown in the following sections.

3. The clean Ru surface

Experimental data Gl shown as error bars in fig. 1 were obtained by scattering of Ar+ ions of nominally 5 keY energy by an angle of 1650 • Experimental details have been published previously [6]. The continuous solid line marks the final ME data fit Fl. The dotted curves indicate the decomposition of the overall spectrum into seven single isotope scattering functions 11 of the Ru surface.

Initially the data reduction was performed with a flat uninformative default model. The result for the single isotope scattering signal l~u resulting from this choice is displayed in

34

,.......... (f)

+J

C :::l 0 u

'---'

R. Fischer, W. von der Linden, and V. Dose

4000

3000

2000

1000 / .................... .

. . """ I I' , ,, ·,·;··.··.'·,··,··,··,··!··'·'I··;··,.'K,··.: •. : •. :.-::.::.:.,.: ... ~~~.: ... :-I: •. ; ••• :.: .•. :_:.:>:::::<.... . ...... , ....... . o '. ..:,.,-:':':"'::';;:»,:" " ..... 850 1000

E [eV] 1150

Figure 1: ISS spectrum of the clean Ru(OOl) surface (error bars), the ME reconstruction (solid line), and the decomposition into the 7 Ru isotopes (dotted lines). The incident energy of the Ar+ ions Eo = 5.1 eV is determined via evidence analysis. The Ru isotopes cause a broad nearly unstructured backscattering signal.

the upper panel of fig. 2. This is clearly unacceptable from our ISS experience in cases without isotope complication. It is, however, worth mentioning that the reconstruction of the overall signal from this unacceptable function fits the data in fig. 1 even closer than shown by the continuous solid line. The unacceptable shape of l~.u is the price we pay by pretending to be more ignorant then we actually are. Practically all data points provide information beyond our default model with the consequence of overfitting the data and little entropic smoothing. More precisely, the number of data which provides information differing from the default model approaches the number of measurements: N ~ 2aS. Since the result of our calculation is not only the most probable function or more specifically vector f but rather its full probability distribution we can proceed and calculate errors of amplitude and position of the various structures in r They are given in fig. 2 and demonstrate that the disappointing ringing region of f is mostly insignificant.

From experience in image reconstruction it is well known that the quality of the ME solution depends progressively on the choice of default models as the noise level of the data increases [9]. We therefore proceed to the construction of a more refined default model. First of all we have obviously to account for a constant background on top of which the Ru scattering signal appears. The next point is more subtle. For the clean Ru surface equation (2) reduces to

(7) v

The vital importance of prior information ...

1.0

-200 -100

1.0

o

, "

I '

ignorant default model

100 200

refined default model

- 200 -100 0 100 200 E [eV]

35

Figme 2: ME solutions (solid lines) for the broadening function resulting in the ISS spectrum shown in fig. 1 with different default models (dashed lines). In the upper panel the value of the default model m is small resulting in "ringing" with great errors in position and intensity of the structures. In the lower panel the default model is refined with a decomposition into a constant background and a Gaussian for the backscattering signal (see text).


We now take the first three moments

J EnF(E)dE

2:>v JEnf(E+EV)dE, n=0,1,2 v

(8)

of F and approximate the left-hand side by the corresponding moments computed from the experimental data

(9)

By change of variables and binomial expansion (8) transforms into a system of linear equations for the moments Mn(f), n = 0,1,2. Our ME solution should of course obey these "sum rules". But the moments tell even more. It can be shown by ME itself that the functional

(j) = J [f(X) - m(x) - f(x) log ~~;)] dx (10)

subject to the conditions that the first three moments are given numbers is maximum if f(x) is a Gaussian [10]. This is of course quite satisfactory since experience from simple ion scattering spectra comply with this solution. Accordingly it is a reasonable choice for the default model and we shall now examine its influence on the single isotope scattering signal f. The lower panel of fig. 2 shows the Gaussian default function as determined from the moments of the experimental data in fig. 1 as dashed curve. The solid line is the ME solution on the basis of this default model. It has the same maximum position and the same half width as the upper panel solution and is, of course, much more satisfactory. The unphysical ringing in the wings of the line is greatly reduced. Note the error bars showing the confidence of the improved solution in these regions. Furthermore, 2aS has decreased to 80% of its previous value indicating that more measured data accord with the default model. A minor point worth mentioning is that the reconstruction in fig. 2 is for an optimal incident energy Eo = 5.1 keY determined via evidence analysis rather than at the nominal 5 keY, or in case of inelastic losses even below 5 keY. This discrepancy must be blamed on the calibration of the effective accelerating voltage of the primary ion beam and/or on minute deviations from the nominal scattering angle of 1650 •

The single isotope scattering signal summarizes all contributions to the unknown broadening. The effects of apparatus broadening, multiple scattering events, inelastic energy losses or even surface defects, which are always present on real surfaces, are not considered in the binary collision formula (1). Since the apparatus broadening is not known further discrimination into the different physical effects is conjectural. Nevertheless, our main task of decomposing ISS spectra with overlapping signals into the constituents can be done without any bias.

4. The composite system Pd/Ru(OOl}

The treatment of the data from the clean Ru surface has already given a flavor of the importance to choose a default model incorporating all our reliable knowledge in a complicated case like the present one. Experimental data for scattering of Ar+ off a Ru surface partially covered with Pd are shown in fig. 4 (note the different energy axis as compared

The vital importance of prior information ...

1.0

r--... W '--"' '>-

0.0

1.0

Pd

Ru

-200

Pd

Ru

-200

T

0

ignorant model

T I

T i T . I

·"··-:L~;":- · I T ,

200 400

refined

default model

o 200 400 E [eV]

37

Figure 3: ME solution for the broadening function resulting in the ISS spectrum shown in fig. 4 for ions backscattered off Pd isotopes (solid line) and Ru isotopes (dash-dotted line) with different default models (dashed lines). In the upper panel the value of the default model is chosen small resulting in a useless solution with great errors in position and intensity of the structures. In the lower panel the default models are the solutions for the cases of a clean Ru surface and a thick Pd film. Note the considerably smaller error bars. The optimal incident energies for ions backscattered off Ru and Pd is 5100 eV and 5075 eV, respectively.


to fig. I!). Proceeding as before with a flat default model we obtain the single isotope scattering functions f~u (dash-dotted) and f-;'d (solid) in the upper panel of fig. 3. They

4000 .---,

(f) .......

3000 c ::J 0 ()

L...-J 2000 ,.--.... w '--" 1000 l.J....

0 850 1000 1150 1300

E [eV]

Figure 4: ISS spectrum of some amount of Pd adsorbed on the Ru(OOl) surface (error bars), the ME reconstruction (solid line), and the decomposition into the 7 Ru (dotted lines) and 6 Pd isotopes (dash-dotted lines). The ME solutions are as depicted in the lower panel of fig. 3.

are even more unacceptable than the solution for the clean Ru surface. Both functions show excessive oscillations which are even less significant according to the error bars than previously. Again a much better choice of the default function is mandatory. For Ru we choose the final reconstruction of the single isotope scattering function as derived for the clean Ru surface. For Pd we go through a similar procedure with auxiliary scattering data obtained from a thick Pd film such that no backscattering from Ru occurs. The two default models obtained are shown as dashed lines in the lower panel of fig. 3. Analysis of the scattering data from the partially covered Ru surface on the basis of these default models results in the dash-dotted curve for Ru for an optimal incident energy Eo = 5100 eV as before and the continuous line for Pd for Eo = 5075 eV. The energy difference must be due to different average inelastic losses in Ar+Pd and Ar+Ru collisions. Note that the single isotope scattering functions hardly deviate from their respective default models. The data for the partially covered surface support no additional information beyond the information we obtained for the clean Ru surface or the thick Pd film. Such a coincidence is expected on the basis of the binary collision model for ISS. On the other hand our results do not prove the binary collision model conclusively. Data of insufficient precision would lead to the same behavior.

The similarity of the here determined single isotope scattering functions to Gaussians

The vital importance of prior information ... 39

may be taken as a justification for Gaussian fits to ISS spectra. Nevertheless the remaining deviations (see fig. 2, lower panel) are sizable and large enough that a Gaussian fit to a spectrum as complicated as in fig. 1 results in systematic deviations from the data. In other words, the data support more information as given by their first three moments Mn(G), n = 0, 1,2.

The analysis of the Pd/Ru spectra requires the determination of still another parameter, namely apd/aRu, the fractional coverage. This is determined by evidence analysis in much the same way as the selfconsistent determination of O!. The fractional Pd coverage follows then from the backscattering yields observed on the all Ru and all Pd surfaces. A more extensive discussion of Pd adsorption on Ru(OOl) will be published elsewhere [11].

Let us finally look at the data synthesis on the basis of the single isotope scattering functions of fig. 3. This is shown as the solid line in fig. 4. In view of the meaning of Poisson errors the fit is quite good. Once again the solutions shown in the upper panel of fig. 3 would have lead to unreasonable overfitting. Dotted and dash-dotted lines indicate the contributions of Ru and Pd, respectively, to the overall spectrum.

In conclusion, the present application of ME techniques to experimental data has underlined their character as a quantitative measure of strength of belief. The reliability of the ME solutions must be accessed carefully before proceeding to physical interpretations. ME processing requires elaborate consideration and incorporation of prior knowledge which can only be supplied by an experienced user of the subject. It appears unlikely and dangerous for the unexperienced user to develop the ME procedure to a black box all purpose data processing package.

ACKNOWLEDGMENTS

We are grateful to N. Memmel and A. Steltenpohl for supplying the ISS data and helpful comments.

References

[1] E. Taglauer, "Investigation of the Local Atomic Arrangement on Surfaces Using LowEnergy Ion Scattering", Appl. Phys., Vol. A 38, pp: 161-170, 1985.

[2] W. von der Linden, M. Donath and V. Dose, "Unbiased Access to Exchange Splitting of Magnetic Bands Using the Maximum Entropy Method", Phys. Rev. Lett., Vol. 71, pp: 899-902, 1993.

[3] J. N. Kapur, "Maximum-Entropy Models in Science and Engineering", Wiley, New York, 1989.

[4] G. J. Daniell, "Of Maps and Monkeys: an Introduction to the Maximum Entropy Method", in: "Maximum Entropy in Action", Eds. B. Buck and V. A. Macaulay, Oxford University Press Inc., New York, p: 1, 1991.

[5] P. W. Anderson, "The Reverend Thomas Bayes, Needles in Haystacks, and the Fifth Force" Phys. Today, Vol. 45 (Jan.), pp: 9-11, 1992.

[6] Th. Fauster, "Surface geometry determination by large-angle ion scattering", Vacuum, Vol. 38, pp: 129-142, 1988.


[7] S. F. Gull, "Developments in Maximum Entropy Data Analysis", in: "Maximum Entropy and Bayesian Methods", Eds. J. Skilling, Kluwer Academic Publishers, Norwell, MA, pp: 53-71, 1989.

[8] J. Skilling, "Quantified Maximum Entropy", in: "Maximum Entropy and Bayesian Methods", Eds. P. F. Fougere, Kluwer Academic Publishers, Norwell, MA, p: 341, 1990.

[9] S. F. Gull and J. Skilling, "Maximum entropy method in image processing", lEE Pmc., Vol. 131 F, pp: 646-659, 1984.

[10] L. R. Mead and N. Papanicolaou, "Maximum entropy in the problem of moments", J. Math. Phys., Vol. 25 (8), pp: 2404-2417, 1984.

[11] N. Memmel et aI., unpublished.

BAYESIAN CONSIDERATION OF THE TOMOGRAPHY PROBLEM

W. von der Linden, K. Ertl, and V. Dose Max-Planck-Institut fur Plasmaphysik, EURATOM Association D-85740 Garching b. Munchen, Germany e-mail: [email protected]

ABSTRACT. Soft X-ray tomography has become a standard diagnostic equipment to investigate plasma profiles. Due to limitations in viewing-access and detector numbers the reconstruction of the two-dimensional emissivity profile constitutes a highly underdetermined inversion problem. We discuss the principle features of the tomography problem from the Bayesian point of view in various stages of sophistication. The approach is applied to real-world data obtained from the Wendel stein 7 AS stellerator.

1. Introduction

Soft X-ray emission tomography is used to analyze the formation and time-evolution of the poloidal plasma cross-section on tokamaks and stellerators. It depends on various plasma parameters, like temperature, density and effective atomic number. The knowledge of the emissivity profile allows to infer to some extent internal properties of the plasma, like position, shape, impurity distribution, and magneto-hydrodynamic (MHD) modes. Typically, a set of detectors is used as depicted in fig.1 to record X-rays emitted by the plasma. A

VI

c;----I < .... -......

v-

.11 -

I-=-- -

l-;::f::-:

[:.;[Vrv - -i't'1. 1-

>

--'1 ---. += i

_I

~~ ~~ - ':- ~& . ... ~ ..

v~ . - --~- - - T .1 -t-

'i pl • • 11 " llh omlsslvlly E I ttl 2·~· ; '; ·"·4 ~ detector I

S)

./ I I

Z -r

+ ' i I

Figure 1: X-ray tomography

plasma in the temperature and density range of interest is optically thin for X-rays, so

41


42 w. von der Linden, K. Ertl, and V. Dose

radiation from internal points of the plasma, say pixel i, reaches the detectors without absorption. Hence, the signal St, recorded by detector l, is the integrated emissivity of all pixels lying in viewing cone Vi, formed by the detector and the collimator aperture

St(E) = LgliEi (1) iEy/

In addition, the signal is deteriorated by experimental errors O"t. The matrix gli contains: a) the physics of X-rays such as the 1/r2 intensity-fall off. b) the geometry of the detector arrangement. In particular the r2 increase in volume, which cancels the aforementioned 1/r2-decrease. c) characteristics of detector I such as sensitivity etc .. A thorough determination of gti is an essential part of the tomography problem. An incompatible g, used to invert Eq.1, would lead to significant artifacts in the emissivity profile. Such problems are absent when using artificial test data, which limits the value of simulations with synthetic data.

2. Qualitative discussion

In order to analyze the experimental data we employed the most accurate values for gti we could get. For a qualitative discussion, however, it is expedient to use a widespread approximation

St(E) = 9 r E(x)ds lei

(2)

which is justified if the solid angle of the viewing-cones is very narrow, the spatial extent of the detector slit is much smaller than the plasma diameter, and if all detectors have identical sensitivities. In this case, signal Sz is proportional to the line-integrated emissivity along the view-chord Ct. For the qualitative discussion we assume pixels on a square-lattice, as depicted in fig.I. Furthermore, we take the liberty to install detectors viewing parallel to the x- and y-axis, respectively, such that the mid-point of each pixel is intersected by 2 orthogonal viewing-chords. Put differently, the viewing-chords constitute a square-lattice which is shifted relative to the pixel-mesh by half a unit-cell diagonal. In this case, Eq.2 simplifies to

Sf = 9LEi j (3) j

S;x/y) stands for the signal recorded by detector i along the (x/y)-axis and Eij represents the emissivity of the pixel at position (i,j) on the square-lattice. The inversion of Eq.3 is ill-posed, which is most easily seen in the case of 2x2 pixels and 2+2 detectors. In this case the linear problem can be written as

1 o o 1

o 1 1 o

~ ) ~ E= o 1

GE (4)

Obviously, Eq.4 has no unique solution since det(G) = O. The reason is that viewingchords are provided for only two angles. It is of no use to install an infinite number of detectors along the two axes; the inversion would remain underdetermined. For a unique

Bayesian consideration of the tomography problem 43

solution a complete coverage of all angles is required. This point becomes apparent when employing the Fourier transform in polar coordinates [1, 2, 3, 4]. A first insight into the tomography problem can be obtained by ignoring experimental errors and assuming that the total emissivity is known Eo = Lij Eij. It is advantageous to express the emissivity in terms of a probability distribution Pij = Eij / Eo which we determine in the frame of Jaynes' MaxEnt[5]. The data-constraints are simply

sf = LPij j

1 = LPij ij

(5)

with s~x/y) = S}x/y) /(Eo§). The entropy S = - LijPij lnpij is maximized subject to the data-constraints which amounts to maximizing the Lagrangian

ij j j ij

with respect to Pij and the Lagrange parameters A. The MaxEnt result reads

E x y ij ex: si * Sj (7)

Since the" experimental" geometry provides no correlation between x- and y-detector signals and due to MaxEnt's maxim of being unbiased, the x- and y-components of the resulting emissivity are uncorrelated. This will be different in the real world experiment which introduces correlation as we will see below. The orthogonal viewing-geometry leads to a significant restriction in angular resolution. This point is elucidated in table 1, where a few examples for a 3x3 pixel-field with 3+3 detectors are provided. The left panel depicts possible solutions of the ambiguous inversion problem, while the right panel shows the unique MaxEnt solution. Obviously, MaxEnt yields the most symmetric solution, as far as point transformations and translations are concerned. To increase the angular resolution a set of rotated reference-frames is required. This has been achieved, despite of the rather limited viewing-access, in the Wendelstein stellerator W7 AS as follows. In order to image the poloidal emissivity profile two sets of detector-arrays are used. Each array consists of 36 detectors. The radiation is collimated in both detector-arrays by rectangular slits which produce a fan-like view-configuration as depicted in fig2. This geometry is probably the most useful for wide-angle viewing in situations with restricted access. As the next step towards the solution of the realistic tomography problem, we use the fan-like geometry. Still ignoring experimental errors and employing the line-integral approximation Eq.2 we obtain again the MaxEnt solution Eq.7, only the meaning of x- and y-coordinate has been modified as indicated in fig.2. We denote signals and viewing-chords of the first camera by x and those of the second camera by y. We then enumerate the viewing-chords as depicted in fig.2. The pixel-geometry is chosen according to the mesh provided by the intersecting viewing-chords. The pixel, at which viewing-chord Cf and CJ intersect has coordinates (i, j). Actually, the approximation Eq.5 is not entirely correct in the present case, as it ignores the fact that the lengths of the line-segments in the different pixels differ, leading to a variation in the matrix elements gli by about 50%.

The product form is retained even if we include experimental errors. Anticipating the result of the next section, the MaxEnt solution of the tomography problem for the fangeometry, including experimental errors, in the line-integral approximations Eq.2 reads

44 W. von der Linden, K. Ertl, and V. Dose

possible emi sivity profiles Maximum Entropy

a)

.I AaMIl :1. ••• .15

I 0 I

b) ! 0 0 ! 0 0

!

1.4, .. 1:11 :.111, .. 1 ...

l l

c) i ! r. !

Table 1: Demonstration of angular resolution in two-angle viewing geometry. Left hand columns and top rows of each table depict the x- and y-detector signals. The inner 3x3 block represents emissivities.

again E -X-y

ij ex si S j

however, the Si are no longer the bare experimental signals.

3. Real world data and quantified MaxEnt

(8)

We are now prepared to tackle the tomography problem in full complexity, i.e. to invert Eq.l consistently accounting for the statistical measurement errors. To this end, we invoke quantified MaxEnt[6, 7]. The Likelihood function assuming Gaussian error statistics reads

P(SIE, cr) = e-h2 ; X2 = I:(S/ - S/(E))2 (9) /=1 cr/

According to Eq.1 S/(E) is the predicted detector signal for given emissivity E. The entropy, entering the Prior P(Sla, m),

S = LEi - mi - Ei In(Edm;) (10)

is measured relative to a default model m. For notational simplicity we have used a combined index i for x- and y-coordinate. The default-model plays a central role in the tomography problem and will be discussed in some detail. Above all, it is crucial to fully account


(iJ)

Figure 2: Viewing geometry in W7AS. For clarity, only 10 viewing chords are depicted covering the same angular region as the 72 viewing chords given in the experiment. Cr represents the line of sight of camera a belonging to detector i . The arrow indicates pixel (ij) intersected by the view chords Cf and CJ, respectively.

for our prior knowledge that the plasma is confined to the region, where the viewing chords intersect. Quantified MaxEnt can be mapped onto a dual problem [8] upon introducing a Legendre transform. The original 'potential energy' to be minimized reads

(11)

We introduce the Legendre transform

a¢ (SI - S/(E))

aSl Uf aAl .- (12)

2

¢(E, S(A)) - a L AlSl = -~ L AfUf - a L AlSl(E) - as I 2 I I

IlJ(E, A) . (13)

The sought-for emissivity is readily obtained via the Euler-Lagrange equation

(14)

The Lagrange parameters Al follow from Eq.12. The result, given in Eq.8, is a special case of Eq.14 if the data constraints Eq.5 are used. The regularization parameter a is determined upon maximizing the marginal posterior P(aIS, m, u).

46 W. von del' Linden, K. Ertl, and V. Dose

The idea of solving the plasma tomography problem by MaxEnt traces back to Frieden [9] and was applied in the sequel to HoAomography[lO] and soft X-ray tomography for tokamak data[ll]. The application to stellerator emissivities obtained by W7 AS is more challenging as the emissivity profile is more structured and fewer experimental data are available. The MaxEnt concept offers several important advantages over other reconstruction techniques employed over the years, like those based on generalized Abel inversion [12, 13 , 14, 15], linear least-squares techniques[16] and restricted Fourier analysis [1, 2, 3, 4]. It allows the reconstruction of arbitrary asymmetrical emissivity-profiles without any assumptions or restrictions which are tacitly made in the other approaches. This is particularly desirable if no independent experimental evidence is given to justify these restrictions. If, on the other hand, the assumptions are valid or if additional prior knowledge is given it can effectively be incorporated into the MaxEnt algorithm, leading to a systematic improvement of the results.

For completeness we mention the geometry parameters used in our calculation. In W7 AS the plasma diameter is about 30cm . The reconstruction is performed on a square region of about 40cmx40cm which is divided into 34 x 34 pixels, yielding a spatial resolution of 1.5 cm. The total angle, covered by the detectors of one pinhole camera (see fig.2), is about 40°. The system is highly underdetermined as the number of pixels is about 700 - only pixels with non-zero default model enter the calculation - while the number of detectors is 72. Moreover, the relevant plasma cross section is not sampled evenly, because both cameras are located on one side of the plasma (fig.2). Consequently, the information density is lower on the remote side of the detectors, leading to "shadow-effects" in the sense that the reconstructed profile will be more reliable close to the detectors than away from them. Needless to mention that this is a generic feature of the experimental set-up [15].

First we present results for synthetic data to illustrate the degree of reliability inherent to the MaxEnt reconstruction. Fig.3a shows the sample emissivity that sketches an m=4 MHD mode. The respective detector signals are computed via Eq.l and adding 3% Gaussian noise. The MaxEnt reconstruction is shown in fig.3b. One finds a good overall agreement with the underlying "exact" image. But we observe also the expected shadoweffect, a decline in reliability with increasing distance from the detector. It is related to the decreasing information content of the chord-measurements for these parts of the plasma cross section. The m=4 symmetry is, however, clearly visible in the reconstructed image. We have repeated the reconstruction for the emissivity profile rotated around the poloidal axis. It appears that the quality gradually degrades with increasing rotation angle up to ~ 45° after which it starts improving again. At ~ 45° the peaks of the m=4 structure are pairwise lined up in the central viewing chords of the two cameras and the experiment provides no information, whatsoever, to resolve the m=4 mode[15]. Consequently, the MaxEnt reconstruction exhibits a single broad peak in the center of the plasma, like in example c) oftable 1.

As a real world application, we reconstructed emissivity profiles from soft X-ray chord measurements on W7 AS for a series of snapshots at consecutive times. A representative emissivity profile is depicted in fig.4a, which was obtain by quantified MaxEnt using a flat default model. As mentioned earlier, the default model was confined to the region of crossing viewing-chords. The reconstruction of a time series of emissivity profiles allows to observe the temporal changes in shape and position of the plasma profile, as well as the appearance,


Figure 3: Test tomography problem. Left) Synthetic tomography data, right) MaxEnt reconstruction.

motion, locking, and disappearance of magneto-hydrodynamic modes. We mentioned above that the resolution-quality of the reconstructed emissivity changes periodically, depending on the relative position of image-structures to the two cameras. The analysis of time sequences allows to bridge reconstructions of poor quality. It can be observed in fig.4a that the reconstruction is rather spiky, due to the exaggeratedly ignorant default model. The data demand strong deviations from the model resulting in a small regularization parameter. At the state of knowledge, expressed by the flat default model, MaxEnt cares first of all about the gross structure. Details are only tackled, if the coarse structure is already correctly provided by the default model. This is illustrated in fig4b. by the result for a default model, derived from the vacuum magnetic flux[15J, which represents a reasonable prior knowledge. The default model has a bell-shape, the position of which is unknown and is treated as hyper-parameter. In Fig.5a the evidence (log-marginal-posterior) for the x-coordinate of the position is depicted. Even on a logarithmic scale the probability is sharply peaked. Fig.5b shows the response of the reconstruction to the position of the default model summarized by the center of mass and the maximum of the emissivity profile, respectively. It is good to see that the result depends only weekly on the position of the default model; the center of mass follows hesitantly the position of the default model, while the maximum even shifts slightly to the opposite direction. The reason lies in the fact that the "shadow" decreases with increasing shift in positive x-direction.

4. Summary

In summary, we have demonstrated that MaxEnt is perfectly suited to reconstruct emissivity profiles from X-ray chord-measurements irrespective of the detailed shape of the plasma.

48 W. von der Linden, K. Ertl, and V. Dose

Figure 4: Emissivity profile reconstructed from W7AS tomography data. left) flat default model right) bell-shaped default model.

It is therefore more favorable than other reconstruction methods which depend crucially on the spatial smoothness of the image. MaxEnt provides a consistent description of probabilistic inference based on Bayesian statistics. It yields the most probable and noncommittal solution consistent with available noisy data and additional prior knowledge. It should be mentioned that MaxEnt allows to assign confidence intervals to the position and shape of the reconstructed image. This information is extremely valuable in the present case of week data-constraints to assess the predictive power of X-ray tomography, in general. Along the same lines of probabilistic inference it is possible to allow for different camera-sensitivities which are probably present. MaxEnt provides systematic and controlled means to incorporate justified assumptions (prior knowledge) thus restricting the image-space and yielding more informative (detailed) and more reliable results.

References

[1] A. P.Navarro , V.K. Pare, and J.L. Dunlap, Rev.Sci.Instrum. 52, 1634 (1981).

[2] R.S. Granetz and J. Camacho, Nucl. Fusion 25, 727 (1985).

[3] J. Camacho and R.S. Granetz, Rev.Sci.Instrum. 57,417 (1986).

[4] N.R. Sauthoff,K.M. McGuire, and S. von Goeler, Rev.Sci.Instrum. 57, 2139 (1986).

[5] E.T. Jaynes, (1958).

[6] S.F. Gull , in Maximum Entropy and Bayesian Methods ed. J. Skilling, (Kluwer, Academic Publishers, 1989).

Bayesian consideration of the tomography problem

1.0 0.0

5.0

·20.0 3 Vl

III ~ "'C 'E 3.0

· 40.0 Q) W (.) )( 0.5 C IV Q) ~

"'C ·60.0 - 1.0 .:; 0 W ;::

·80.0 :E Vl

·1.0

----. !igkhhttt o center of mass omaxtmum

~/~

49

2.0 3 .0 4.0 5.0 0.0 L----'" _____ .L-~ ___ -'-______'

0.0 ·1.0 1.0 0.5 3.0 5 .0 1.0 shift of default model (pixel) shift of default model (pixel)

Figure 5: The default model is shifted rigidly along the x-axis in pixel-units. a) (left) dependence of the evidence on the shift. b) (right) response of the center of mass and maximum-position of the emissivity profile.

[7] J. Skilling, in Maximum Entropy and Bayesian Methods ed. P. F. Fougere, (Kluwer, Academic Publishers, 1990).

[8] R. Silver, in Maximum Entropy and Bayesian Methods ed. G. Heidbreder, (Kluwer, Academic Publishers, 1993), to be published.

[9] B. Roy Frieden, J. Opt.Soc.Am. 62, 511 (1972).

[10] G.A.Cottrell, in Maximum Entropy in Action, ed. B. Buck and V. A. Macaulay, (Oxford Science Publications, Oxford, 1990).

[11] A. Holland and G.A.Navrati, Rev.ScLInstrum. 57,1557 (1986).

[12] K. Bockasten, J.Opt.Soc.Amer. 51, 943 (1961).

[13] N.R. Sauthoff, S. von Goeler, and W. Stodiek, Nucl. Fusion 18, 1445 (1978).

[14] N.R. Sauthoff and S. von Goeler, IEEE Trans.Plasma.Sci. 7, 141 (1979).

[15] A.P. Navarro, M.A. Ochando, and A. Weller, IEEE Trans. on Plasma Science 19, 569 (1991) .

[16] R. Decoste, Rev.Sci.lnstrum. 56, 807 (1985).

USING MAXENT TO DETERMINE NUCLEAR LEVEL DENSITIES

N. J. Davidsonl, B. J. Cole2 and H. G. Miller3

1 Department of Mathematics, University of Manchester Institute of Science and Technology (UMIST), POBox 88, Manchester M60 1QD, UK 2Department of Physics, University of the Witwatersrand, WITS 2050, South Africa 3Department of Physics, University of Pretoria, Pretoria 0002, South Africa

ABSTRACT. Calculations involving excited nuclei often require knowledge of the nuclear manybody density of states in regions where the analytic dependence of this quantity on the energy is not well known. We show, by means of a model calculation, that it should be possible to satisfactorily infer the energy dependence of the nuclear level density for a reasonable range of energies by use of the Maximum Entropy Principle. The prior information required is the observed number of states per energy interval at comparatively low energies, where the experimental nuclear spectra are well known. Our results suggest that the proposed method is sufficiently reliable to allow for the calculation of thermal properties of nuclei (through the partition function) over a reasonable temperature range.

1. Introduction

The nuclear many-body level density has long been a quantity of considerable physical interest. In earlier years this was mainly in the astrophysical community, but the development of relativistic heavy ion accelerators in more recent years has seen a renewal of interest from nuclear physicists. Nuclei are, however, extremely complex systems, and there has been little success with the derivation of the level density from anything remotely approaching an ab initio calculation. Most calculations still rely heavily on the extrapolation to high energies of parameterizations based on the low energy behaviour of the level density, where there is sufficient data to allow for reasonably reliable fits.

The simplest such expression for the nuclear level density has been obtained in the Fermi-gas model by Bethe [1, 2, 3J and later modified by Bloch [4, 5, 6J;

po(E) = ~E exp [2vaoE] (1)

where ao is the level-density parameter.

There are, however, a number of shortcomings in this approach. For example, the lack of coupling to the collective part of the nuclear spectrum leads to an energy-independent level density parameter. Also, the multiple inverse Laplace transform used to determine the nuclear density of states from the Grand-Canonical partition function of a Fermi gas appears to lead to certain inconsistencies in the folding of nuclear level densities [7J.

51


52 N.J. Davidson, B.J. Cole and H.G. Miller

In spite of these deficiencies Bloch's formula is widely used, particularly as a means of parameterizing experimentally determined nuclear level densities [8J. The level density paramter is often expressed in terms of the nuclear mass number A by

A ao =-.

ko (2)

Generally, ko is taken to be constant, a popular choice being ko = 8 MeV. Should ko be energy dependent, an a priori form for this dependence must be assumed.

2. A MAXENT Approach

In what follows we will demonstrate a simple MAXENT-based method of extracting the nuclear level density from the experimental information [9J . The problem which we want toaddress is the determination of the distribution of states (or level density), p(E), for a nucleus such that

lXJ p(E)8(Ei - E)8(E - Ei+l)dE = Ni i = 1, ... , N (3)

where N i , the number of states in the energy interval Ei+l - Ei , is assumed known, and 8 is the Heaviside step function. As a zeroth-order guess we assume that the distribution of states or level density is given by the Bloch formula (1). The actual distribution p(E) may now be determined by minimizing the information entropy of p(E) relative to Po(E),

00

J p(E) S(p, Po) = p(E) In Po (E) dE

o (4)

subject to the experimental information given in eq. (3). This yields

p(E) = po(E) exp [~Ai8(Ei - E)8(E - Ei+l)] (5)

where the Lagrange multipliers Ai for the intervals Ei ::; E ::; Ei+l are determined from the experimental information given in eq. (3).

Now suppose we have determined the set of Ai and we put

(6)

where Ei is the midpoint of the energy interval Ei ::; E ::; Ei+l. If the A(Ei) vary smoothly with energy, a least-squares fit to this quantity should enable us to extrapolate the p(E) beyond the region in which the experimental information is known. If this is the case the density of states may now given by

p(E) = Po (E)e->'(E) , (7)

or, from eqs (1) and (2),

p(E) = ~ exp [2JAE - A(E)] . v48E ko

(8)

Using maxent to determine nuclear level densities 53

The important aspect to note here is that the energy dependence of the density of states is determined from the prior experimental data.

One of the main areas which requires a knowledge of the nuclear level density over a reasonably broad range of energies is the calculation of the thermal properties of nuclei. Such properties are calculated from the nuclear partition function Z (T) for fixed nuclear mass number A, where T is the nuclear temperature. The contribution to Z (T) of the continuous part of the nuclear spectrum is

Z (T) = J,'X) P (E) e-E1T dE Emin

(9)

where Emin is some suitable lower limit.

0.010

0.008

0.006

0.004 T 0.002 T @ - = e_. j - -e- r

0.000 -- • • • • -,-< ..L

-0.002 .J..

-0.004 .1

-0.006

-0.008

-0.010 1.0 1.5 2 .0 2.5 3.0 3 .5 4.0 4 .5 5.0

Energy (MeV)

Figure 1: Modified Lagrange multipliers for the case k(E) = 8. The circles indicate values computed with eq. (11) the solid line represents the fit with a polynomial of order zero, the dashed line the fit of order one and the dash-dot line the fit of order two. The calculation of the error bars is explained in the text.

3. A Model Calculation

To test the proposed method we have performed semi-realistic model calculations, which allow us to better isolate the effects of varying parameters of the model (it e.g. the number and width of the energy intervals) than would be the case if we used real experimental data. Rather than use actual experimental information for the number of states Ni in eq.

54 N.J. Davidson, B.J. Cole and H.G. Miller

(3), we have generated the Ni by assuming that for energy E 2: 1 MeV the experimental distribution of states is given by

1 [fAEl p (E) = V48E exp 2V k(E)J (10)

where k = k(E) to allow the energy dependence of the density of states to deviate from the Bloch form. The size of the energy intervals Ei+l - Ei in eq. (3) was fixed at 0.4 MeV, with ten such intervals covering the energy range 1- 5 MeV. For a particular choice k(E) the Lagrange multipliers >"i were calculated from eqs (3), (1), (2) and (5), with A arbitrarily set to 56 in eq. (2). A series of least-squares polynomial fits of low order was then performed for the modified Lagrange multipliers

>..' (E) = >.. (Ei) . , 2VAE (11)

The many-body density of states and nuclear partition function were then calculated from eqs (8) and (9) , respectively.

1.2

1.1 ----"' CD 1.0 ------ . .. -.;

0.9 '0 ~ 0.8

.... ... .... <: CD

"C 0.7 ....

CD "' .2: 0.6 .. "' Qj a: 0.5 "'

"' 0.4 ....

0.3 0 2 4 6 8 10 12 14 16 18 20

Energy (MeV)

Figure 2: Calculated density of states, relative to the exact density of states, for the case k(E) = 8. The solid line is computed with>" fitted with a polynomial of order zero, the dashed line with the fit of order one and the dash-dot line with the fit of order two.

A sample fit to >..' (Ei) is shown in Figure 1 for the choice

k(E) = 8 (12)


in eq. (lO), corresponding to the Bloch formula (1) with ko = 8; a perfect fit should produce A' = 0 identically. Since we take the number of states to be an integer, whereas the integral in (3) with the density of states given by (10) produces a real number, we do not expect to find all the Lagrange multipliers to be exactly zero. The error bars included in the figure were generated by arbitrarily changing the number of states Ni to Ni ± JNi/2. They serve both to indicate the sensitivity of the calculation to uncertainties in the input, and also to provide a scale for A'. The fit was also weighted according to these error bars. As Figure 1 indicates, the quality of the fit increases as the order of the polynomial is increased, but A' = 4 X lO-5 (constant) is already satisfactory.

1.4

c 1.2 .2 --" c ,;: 1.0 ~ -::. -:---

c

~ 0.8 iU a.

'" 0.6 .~ 1U "-

a; " c: 0.4 "-

0.2 0.4 0.6 0.8 1 .0 1.2 1.4 1.6 1.8 2.0

Temperature (MeV)

Figure 3: Calculated partition function, relative to the exact partition function, for the case k(E) = 8. The solid line is computed with A fitted with a polynomial of order zero, the dashed line with the fit of order one and the dash-dot line with the fit of order two.

The corresponding many-body density of states was calculated from eq. (8) with E extrapolated beyond the region of fit 1- 5 MeV to 20 MeV; the density relative to the input density eq. (lO) is shown in Figure 2. It is seen that A' = constant provides an almost perfect fit for all energies. Polynomial fits of higher order, whilst giving a slightly better fit up to E = 5 MeV, cannot be extrapolated much beyond this energy.

The nuclear partition function was computed using eq. (9), with Emin = 1 MeV. Due to the rapid increase in the calculated density of states for higher order polynomial fits to the modified Lagrange multipliers, the infinite upper limit of the energy integration was replaced by Emax, the determination of which is described in detail in [9]. The results are plotted

56 N.J. Davidson, BJ. Cole and H.G. Miller

in Figure 3, which shows Z(T) calculated with eq. (8) relative to the partition function calculated with the input density of states. The fit)..' = constant produces an almost perfect result up to T = 2 MeV, whereas the higher order fits are far from satisfactory except at the lowest temperatures.

0.005

-0.005

@: -O.Q15

-.<

-0.025

-0.035

-0.045 L........~'--'--'-_ ........... ~...I-_ ......... _~_-'-_--'-_~--'

1.0 1.5 2.0 2.5 3 .0 3 .5 4 .0 4.5 5 .0

Energy (MeV)

Figure 4: Modified Lagrange multipliers for the case k(E) = 6 + E/2. The circles indicate values computed with eq. (8), the solid line represents the fit with a polynomial of order zero, the dashed line the fit of order one and the dash-dot line the fit of order two. The calculation of the error bars is explained in the text.

These calculations were repeated with other choices for k(E) in eq. (10). For example, the use of

k(E)=6+E/2 (13)

is illustrated in Figures 4- 6. This choice allows a reasonable energy dependence for the level density parameter, permitting the density of states to deviate sufficiently from the Bloch form to provide a realistic test of the method. (It should be noted that in practice the energy dependence of k(E) is somewhat constrained; too strong a dependence causes the density of states to increase rapidly at higher energies or to decrease suddenly to zero.) As seen from Figure 4, the lowest order polynomial which fits)..' (Ei) within the error bars is linear in energy. Figure 5 shows that this same fit provides a satisfactory description of the density of states well beyond the energy region in which the fit was produced. The quality of the calculated partition function, shown in Figure 6, is excellent.


$,0

OJ 5_0 -Cl Oi -;; '0

4,0 I

i!' "ii 3.0 c: Cl

1J Cl /

~ 2.0 / -;

1i a: 1.0

0.0 0 2 4 6 8 10 12 14 16 18 20

Energy (MeV)

Figure 5: Calculated density of states, relative to the exact density of states, for the case k(E) = 6 + E/2. The solid line is computed with A fitted with a polynomial of order zero, the dashed line with the fit of order one and the dash-dot line with the fit of order two.

The calculations have been checked for sensitivity to various details of the method, including the fitting procedure for the Lagrange multipliers and the binning of the energy levels. Overall, we have found the method to be remarkably robust to changes in these details. A more detailed discussion can be found in [9].

4. Conclusions

In the above, we have presented a simple Information theory based approach to determining the nuclear many-body density of states from prior experimental measurements. By means of model calculations, we have shown that a good approximation of the density of states can be obtained even at higher energies and any energy dependence is determined from the prior experimental data. However, it should be noted that the extrapolated density of states will not necessarily be very accurate if processes occur at higher energies which are not reflected in the lower energy prior information. With this proviso, the model calculations show that the partition function can be accurately calculated over a reasonable temperature range which suggests that the thermal properties of nuclei can be adequately described by this method.

58 N.1 Davidson, B.1 Cole and H.G. Miller

5.0

<: ~ 4.0 U <: ;. <: 3.0 -g t: '" 0- 2.0 ., > i Gi 1.0 a: -- - ---

0.0 '--'~-'--~~~-'--'--~------L~_L..-_-'--_'-'--~--'

0.4 0 .6 0 .8 1.0 1.2 1.4 1.6 1.8 2.0

Temperature (MeV)

Figure 6: Calculated partition function, relative to the exact partition function, for the case k(E) = 6 + E/2. The solid line is computed with A fitted with a polynomial of order zero and the dashed line with the fit of order one.

Acknowledgements

BJC and HGM acknowledge the financial support of the Foundation for Research Development, Pretoria.

References

[1] H. A. Bethe, Phys. Rev. 50, 332 (1936).

[2J H. A. Bethe, Rev. Mod. Phys. 9, 69 (1937).

[3J H. A. Bethe, Phys. Rev. 53, 675 (1938).

[4J C. Bloch, Phys. Rev. 93, 1094 (1954).

[5J J. M. B. Lang and K. J. LeCouteur, Proc. Phys. Soc. (London) A67, 586 (1954).

[6J T. E. Ericson, Adv. Phys. 9, 425 (1960).

[7J D. H. E. Gross and R. Heck, Phys. Lett. B 318, 405 (1993).

[8J E. Erba, U. Facchinni, and E. Saetta-Menichella, Nuov. Cim. 22, 1237 (1961).

[9J B. J. Cole, N. J. Davidson and H. G. Miller, Phys. Rev. C in press.

A FRESH LOOK AT MODEL SELECTION IN INVERSE SCATTERING

Vincent A. Macaulay and Brian Buck Theoretical Physics University of Oxford 1 Keble Road, Oxford, OX1 3NP, England vincent@thphys. ox. ac . uk and buck@thphys. ox. ac . uk

ABSTRACT. In this paper, we return to a problem which we treated at MaxEnt89 and show how the Bayesian evidence formalism provides new insights. The problem is to infer the charge distribution of a nucleus from noisy and incomplete measurements of its Fourier transform.

It is shown that one particular set of expansion functions is especially suitable. The FourierBessel expansion can be derived from a variational principle involving the finite extent of nuclear charge and the sharp decline of its Fourier transform as a function of momentum transfer. We show expansions at different model orders and choose the one with the largest evidence.

The prior probability for the expansion coefficients is assigned by an application of Jaynes' principle of maximum entropy. The parameter introduced to satisfy the assumed constraint in the data can be marginalized or estimated from the posterior distribution. Alternatively, we show that it can usefully be assigned directly from a single macroscopic variable derived from the data. This approach removes the need for an optimization and is conceptually simple. In addition, it gives results which are indistinguishable from those that estimate the constant a posteriori. The prior is shown to playa useful role in preventing the overfitting of data in an over-parametrized model.

1. Introduction

In the 1950s, the internal structure of the atomic nucleus started to be probed with beams of high energy electrons. Such a technique is very powerful but the measured scattering crosssections are not of direct use. For they are related by a potentially non-trivial transform to the physically more interesting charge distribution of the nucleus, the problem of inferring which is considered here.

The initial response to this inverse problem was to parametrize the nuclear shape in terms of certain of its gross features, e.g., its spatial extent and surface diffuseness (Hofstadter 1956). This provided a manageable but rather impoverished hypothesis space, in which the true shape almost certainly did not lie. However in that space, the parameters were well-determined and the least-squares method used to find them accurately approximated the Bayesian solution to this point estimation problem.

Subsequently (Borysowicz and Hetherington 1973), it became clear that a more flexible model could be employed to enlarge the space of conceivable shapes. Expansions of the density in sets of basis-functions were used, e.g. the Fourier-Bessel expansion employed by Dreher et al. (1974). However, within an orthodox statistical framework, it was not clear how to truncate the expansion or find useful error-bands around the reconstruction. Plausible, but essentially ad hoc, recipes were suggested to overcome the difficulties.

59


60 v. A. Macaulay and B. Buck

The present authors (Macaulay 1992; Macaulay and Buck, in prep.) increased the size of the hypothesis space still further and modelled the charge as a free-form density controlled by an entropic prior. One significant benefit is that error estimates are not restricted to a small model space. As a result, pointwise error-bands diverge but errors on large scale structures, such as the mean square radius or the predicted form factor, are finite and independent of the discretization of the model.

Here we reconsider the expansion of the density in a relatively small set of basis-functions and show how the Bayesian treatment works.

2. Scattering physics

A beam of electrons is directed at a nuclear target and the numbers emerging at selected scattering angles are recorded. These are divided by the numbers which would have been scattered at the same angles by a point charge to yield a quantity known as the form factor. This is expressed as a function of the momentum q transferred between the electron and nucleus which, as shown schematically in Fig. 1, is simply related to the scattering angle (J.

In general, the relation between the form factor F(q) and the charge density p(r) is non-linear. But for high energy electrons incident on light nuclei, where the Born Approximation is applicable, the relation is linear, namely a 3D Fourier transform: F(q) =

I p( r) exp(iq . r) d3r. For a spherical nucleus, this simplifies further to

F(q) = lX) p(r)jo(qr) 47fr2 dr, where q = Iql and r = Irl·

The data for He4 are shown, with error-bars, in Fig. 2.

3. Modelling the density

The model we take is a Fourier-Bessel expansion (Dreher et al. 1974):

where

p(r) = { L;;;=1 enrPn (r), 0,

r<R; r::::: R,

with

(1 )

(2)

(3)

These functions are orthonormal with respect to the 3D volume element 47fr2 dr. This implies an equivalent expansion of F(q) as L;n cn<pn(q), where each <Pn is the transform (1) of rPn. There is some freedom in the choice of R; in the example below, it is taken as 5 fm.

4. Motivation

There are a number of reasons for choosing Fourier-Bessel basis-functions .

• They form an orthonormal set.

• The transform <pn(q) of rPn(r) has its peak near Qn, suggesting that, for any n, information in the data about Cn comes from the form factor points with q-values near Qn. Indeed, en could be determined once and for all if F( Qn) were known exactly. For, direct inversion

1 implies that Cn = (27fR)-2QnF(Qn). But rarely do any of the experimental qs coincide with Qn-values, and even if they did, noise is always present in the corresponding data.

MODEL SELECTION IN INVERSE SCATTERING 61

q=p(in)-p(out) rq~=2Ip~ sin(912) --Ip =Ip (in)I=ljt(out)1

e

Fig. 1: Schematic diagram of the scattering process. In practice, all the quantities should be converted to the centre of mass frame.

1.00000 ._---0-,------,---,-----,---.---..,..-----,

~ 0.10000 ~ ... 0 U 0.01000 ~

§ .!2 '- 0.00100 0

'" .2 ::l

"8 0.00010 ::a

0.00001

t

0 2 3 4 5 6 7 Momentum lIansfcr, q (l/fm)

Fig. 2: Data on the He4 form factor. For sources of data, see Macaulay (1992) . Note that what is accessible from experiment is F2(q). We have taken the square-root and used the obvious diffraction structure of F, with a minimum near q = 3 fm-I, to reinstate the sign. This is to avoid having to use a non-linear model.

• They possess a concentration property which makes them very efficient for expanding the density.

5. Concentration

One of the most striking features of the data is the rapid decline of F(q) with increasing q, as can be seen from Fig. 2- note the logarithmic axis. It seems sensible to expand in functions which share this property in q-space. However, because our model has p(r) strictly zero for r ~ R, there is a restriction on how concentrated the Fourier transform can be. This suggests that we define the concentration of a function in a plausible way and then seek those functions which are maximally concentrated in q-space given the absolute cutoff in r-space. Ideas akin to these were developed by Slepian (1983).

We shall define the concentration C of a function, in particular F(q), by the reciprocal

62 v. A. Macaulay and B. Buck

of its mean square dispersion 1) about zero, i.e.,

C- 1 __ .,.., __ J q21F(q)12 d3q __ J q4 F2(q) dq L/ (after angular integration).

- JIF(q)1 2d3q J q2F2(q)dq (4)

Then maximizing C is equivalent to minimizing 1). Using (1) to express F(q) in terms of p(r), we can evaluate the functional derivative 51) /5p(r). Equating this to zero leads to the following equation for p(r):

1 d2 --(rp) = -1)p, r dr2

(5)

1 which has solutions p ~ cos or sin(1)"2r)/r. The cosine solutions are rejected because of their irregular behaviour at r = O. In addition, the boundary condition p(R) = 0 means

that 1)~ can only take on the discrete values Qn = mr / R with n = 1,2, .... The normalized basis-functions appear in (3).

6. Bayes inference

We shall be concerned with two levels of inference. These are:

• to estimate the expansion coefficients c for given model order, N. This is called parameter estimation. Additionally,

• to rank the suitability of each model order, otherwise called model selection.

Each of these levels of inference, in very different hypothesis spaces, is accomplished with an application of Bayes' theorem (posterior = prior x likelihood/evidence):

p(cID, N, 5, K)

p(NID,D,K)

p(cIN, 5, K)p(Dlc, N, 5, K) p(DIN,5,K)

p(NI5, K) p(DIN, 5, K) p(DI5,K)

(6)

(7)

As above, N states the model order, c the values of the coefficients. In addition, K expresses the relevant background knowledge and the class of models to be considered; 5 stands for the values of any additional parameters introduced in the probabilities (which may not be relevant to them all); D represents the data which are noisy measurements of F(qp).

Notice that the normalizing denominator of the parameter estimation stage becomes the likelihood-like term during model selection.

6.1. Likelihood

Since the data consist of best estimates and mean dispersions, the likelihood p(Dlc, N, K) is taken to be Gaussian:

p(Dlc, N, K) = [II (27fer;)-~] exp (-~ L (Dp - F(qp)) 2 /er;) , p p

(8)


6.2. Prior on coefficients

Because of the relation between p and F, there exists a Parseval-type integral expression between their squares:

J p2(r) 47rr2 dr = (2:)3 J F2(q) 47rq2 dq. (9)

With the substitution of the expansion (2) and (3), the LHS becomes l:n c~. A fortiori, the above relation must hold when averaged over the prior distribution of

the coefficients p(c). Suppose that we could make a reproducible measurement of the RHS,

X, say. Then the above relation becomes a constraint on p(c):

(10)

In the presence of such a (testable) constraint, we employ the principle of maximum entropy (Jaynes 1968) to assign the most non-committal p(c) by maximizing - J p(c) logp(c) dN c, subject to (10) and J p(c) dN c = 1. This produces the prior

P(cIN,0)=(27r02)-N/2exp(-2~2LC;) with 02=X/N, (11) n

which has the interpretation that the 'power' in the data is evenly divided between the coefficients.

Three avenues are now open to us to complete the assignment of the prior. We can

• treat 0 as a furiher parameter in our model, to be determined a posteriori; or

• marginalize (integrate) it out a posteriori as a nuisance parameter; or

• estimate X and hence 0 directly from the available data. This is easy to do, but does require a preliminary coarse-grained look at the data set. We can approximate the RHS

of (9) by a crude expression for the integral, such as the trapezium rule, where F is replaced by its observed values at the experimental qs. This method gives very similar results to those that are obtained by the first method, saving an optimization problem, and we recommend its use. It raises the possibility that other hyperparameters, such as the entropic a might be approximately assigned almost a priori from some macroscopic observable.

Although the second method is perhaps the most theoretically sound, we shall not use it here, as might have been anticipated by our leaving expressions (6) and (7) conditioned on 15. In turns out that 0 is so well determined that, for estimating c, it is a good approximation to put 0 equal to the value 0* that maximizes p(DIN, 0, K); this is valid if the prior on 0 is reasonably flat. For estimating the model order, 0 will also be put equal to 15*. For a discussion of the treatment of hyperparameters in model comparison, see Mackay (1992).

6.3. Remaining probability assignments

No model order will be preferred to any other a priori, and hence p(NIK) is assigned constant. It turns out that overly-complex models are penalized in p(DIN, 0, K), providing us with a quantitative Ockham Razor (MacKay 1992) . The denominators of (6) and (7) are obiained by requiring normalized posterior probabilities at each stage.

64 V. A. Macaulay and B. Buck

7. Discussion of results

In Fig. 3 are shown reconstructions both of F(q) and p(r) at different model orders. The constant 0 has been chosen to maximize the evidence p(DIN, 0, K) in each case. As expected, it requires larger model orders to describe the behaviour of F(q) as q increases. By N = 10, the data seem to be fitted satisfactorily. In fact, as we show below, N = 10 turns out to be the most probable model order. Beyond this, the coefficients of any additional basis-functions are not determined by the data, that is, the likelihood is insensitive to them. Thus they are essentially fixed by the prior p(c). Since this is centred on zero for each cn , the estimates of these coefficients are driven to zero. This is what is sometimes called regularization. It emerges automatically from the method. 1

The error-bands shown on the RHS indicate expected deviations from the most probable density within the chosen hypothesis space. Hence it is possible to have very small errorbands, although the reconstruction is not very good (e.g., N = 2), if the model space is badly chosen. 2 The error-bands summarize only some of the information in the posterior distribution of p(r). For example, it is not the case that every function enclosed by the error-bands is reasonably probable. Many such functions are not even in the model space and hence have zero probability.

Note the increase in errors for N = 14. This arises from the four coefficients (n > 10) that are not fixed by the data. The most probable values of these are approximately zero, as we said above, but naturally they have dispersion about this value. And since the likelihood has nothing to say about them, that dispersion is inherited directly from the prior. The prior's width was fixed by dividing the 'power' in the data evenly between the coefficients. But, since Cn ex QnF( Qn), Cn turns out to decrease rapidly with n. So the prior underestimated the likely spread of Cn for n small, but, more importantly, overestimated it for n large. Hence the implausibly large errors. To remedy this, it would be necessary to build a more structured prior, that assumed more about the general shape of F(q). A crude hypothesis about the smoothness of p has provided a starting point for this (Dreher et al. 1974; Macaulay and Buck 1990).

The evidence p(DIN, 0) as a function of the regularization parameter 0 is shown in Fig. 4, for three different model orders N = 9, 10 and 11. Bearing in mind the logarithmic evidence scale,3 it is clear that p(DIN, 0) is quite sharply peaked around its maximum, and hence 0 is fixed very precisely.

The qualitative shape of these curves should be noted. The rather rapid decline for o < 0*, where the prior is overpowering the likelihood, represents an exponential penalty for not fitting the data very well. The more sedate decline for 0 > 0* is a power-law penalty arising from spreading the prior too thinly over the parameter space.

1 A non-Bayesian method such as least-squares does not display this attractive feature. Here, the extra freedom is used to fit the noise, leading to coefficients which can become arbitrarily large, and to useless reconstructions. Singular value decomposition can be used to discover the ill-determined model functions, but even then art is required to choose just which of these functions to exclude, that is, to fix the cutoff in singular values.

2Some of the motivation for the use of a free-form MaxEnt model (Macaulay 1992) was to obtain error estimates valid in a much more general space.

3The scale is in decibels, that is 10 log 10 [P(DIN, <i")fpo]. The reference level po is taken as unity; it is only differences on the decibel scale that are significant here. Hence, a change of 3 db is a factor of about 2, and 10 db a factor of 10.

MODEL SELECTION IN INVERSE SCATTERING

o 10 ~~~-----------r--__ --~---, 10'

-I 10

1O·'0!c---;----:2;---O-3-~4:---:-5 -~,----:!

~ ., <l10 ~

§ ... 2 10

10'

Momentum transfer, q

2 3 4 5 ~1ommtum tra.nsfer. q

6

100 t-~:o-----------------------_,

10'

10

o

10'

$ ~IO ] E 10

& 10'"

10 0

No. 'of ba.si~ fW'lctions 10

234 5 6 Momentum transfer, q

No.'or ba5i~ [Wlctions: 14

2 3 4 5 6 Momentum transfer. q

0.06

0.05

0.04 ... ~0.Q3 '" ll. j 0.02 u

0.01

0

-0.010

0.06

0.05

0.04 . ~

]003

~O.02 6

0,01

o -- .

-0.010

0.06

0.05 .

0.04 ,.. ~ 0.03 '" ~0.Q2 .c u

0,01 -o·

-0.010

0.06

0.05 .

0.04 . ,..

i om , ',...'

~O.02 .

6 0,01

o .;..

-0.010

65

2 3 RadLWi, r

2 Radius, r

2 ~ RadilJS, r

v\

~~M~:-._--_

2 :l Radms. r

Fig. 3: Reconstructed F(q) (left) and p(r) (right) for N = 2, 6, 10, 14- l'econstructions (solid lines), error-bands (dashed lines), and data points (crosses).

66

400

-2 10

; ........ - ..... ---~ .

" : />; -(~ ;<~~'

- I 10

Della

V. A. Macaulay and B. Buck

Fig. 4: The evidence (in db) as a function of the regularizing constant 6 for model orders N = 9 (dash-dotted), 10 (solid) and 11 (dashed).

1 The power argument above suggested that the optimal 6 should vary like N-"2. This is

in qualitative agreement with the figure, which shows that the best 6 does indeed decrease with increasing N.

In addition, it is clear that N = 10 is overwhelmingly the most probable model order, for any remotely probable 6. But near 6 = 6*, model N = 10 beats N = 11 by about 30 db, and N = 9 by about 60 db. To put this in perspective, in Fig. 5, we show the evidence and the corresponding probability for model orders from 6 to 16. No model other than N = 10 even registers on the linear probability scale!

8. Conclusion

In this paper, we have shown how to infer a density function in terms of a set of basisfunctions. The particular set used was motivated by a property of the data set. We showed how a hyperparameter occurring in the prior distribution could be estimated directly from the data. The optimal expansion order was determined from the evidence. Such ideas could be applied in other fields.

ACKNOWLEDGMENTS. VAM receives financial support from the SERC.

References Borysowicz, J. and Hetherington, J. H. (1973). Errors on charge densities determined from

electron scattering. Physical Review, C7, 2293-303. Dreher, B., Friedrich, J., Merle, K., Rothhaas, H. and Liihrs, G. (1974). The determina

tion of the nuclear ground state and transition charge density from measured electron scattering data. Nuclear Physics, A235, 219-48.


o ..... .....•............ ; .... ·0· ' . .0 ' • q-

6 8 10 12 14 16 6 8 10 12 14 16 Model order. Model order. N

Fig. 5: Left: the evidence (in decibels), as a function of model order, optimized over the regularizing constant c5 at each N. Right: the posterior probability as a function of N.

Hofstadter, R. (1956). Electron scattering and nuclear structure. Reviews of Modern Physics, 28, 214-54.

Jaynes, E. T. (1968). Prior probabilities. IEEE Transactions on Systems Science and Cybernetics, SSC-4, Sept., 227-41.

Macaulay, V. A. (1992). Bayesian inversion with applications to physics. D. Phil. thesis. University of Oxford.

Macaulay, V. A. and Buck, B. (1990). Linear inversion by the maximum entropy method with specific non-trivial prior information. In Maximum entropy and Bayesian methods, Dartmouth, U.S.A., 1989 (ed. P. Fougere), pp. 273- 80. Kluwer, Dordrecht.

Macaulay, V. A. and Buck, B. (in prep.). The determination of nuclear charge distributions using a Bayesian maximum entropy method.

MacKay, D. J. C. (1992). Bayesian methods for adaptive models. Ph.D. thesis. Caltech. Slepian, D. (1983). Some comments on Fourier analysis, uncertainty and modeling. SIAM

Review, 25, 379-93.

THE MAXIMUM-ENTROPY METHOD IN SMALL-ANGLE SCATTERING

Steen Hansen Department of Mathematics and Physics Royal Veterinary and Agricultural University Thorvaldsensvej 40 DK-1871 FRB C Denmark and J lirgen J. M liller Max-Delbrlick-Center for Molecular Medicine Robert-Rossle-Strasse 10 D-13122 Berlin Germany

ABSTRACT. The Maximum-Entropy method is applied to the determination of the distance distribution function in small-angle scattering. Alternative methods for this purpose suffer from problems caused by their ad hoc nature, but the Maximum-Entropy method has a well established theoretical foundation offering several advantages. Examples are given using simulated as well as experimental data. It is demonstrated that the "best" (most likely) choice of parameters as e.g. the noise level, the model or the regularisation method in general can be found from the evidence in a Bayesian framework.

1. Introduction

The term small-angle scattering (SAS) is usually applied to the scattering of either X-rays (SAXS), neutrons (SANS) or light (SALS) through small angles.

Among its many possible applications, small-angle scattering is used for obtaining structural information about molecules in solution. The interest for performing solution experiments appear e.g. in biophysics where it may be crucial to preserve the exact and functionally active structure of the biomolecule. The loss of information due to the random orientation of the molecules in the solution makes it important to extract the maximum information from the measured scattering profile. For interpretation of the experimental results it is often relevant to represent the scattering data in direct space which requires a Fourier transformation of the data preserving the full information content of the experimental data. A direct Fourier transform is of limited use due to noise, smearing and truncation. Attempts to take these effects into account by indirect Fourier transformation (1FT) have been suggested in the literature (e.g. [10], [15], [24], [8], [18]). MaxEnt is well-suited for such underdetermined problems by its ability to take into account prior information in a logical and transparent way.

Using SAS and the various types of radiation mentioned above it is possible to obtain structural information about molecules having sizes from about 1 nm to about 1000 nm.

69


70 S. Hansen and J. J. Miiller

This region fits well with the sizes of many interesting biological molecules (proteins and virus) and biological aggregation phenomena.

2. Theory

2.1. Small-Angle Scattering

In small-angle scattering the intensity I is measured as a function of the length of the scattering vector q = 41T sin( 0) / >., where>. is the wavelength of the radiation and 0 is half the scattering angle. For scattering from a dilute solution of monodisperse molecules of maximum dimension D, the intensity can be written in terms of the distance distribution function p(r) (see e.g. [11 J):

I() 4 Io D ()sin(qr) d q = 1T P r --- r.

o qr (1 )

Approximating the distance distribution functionp(r) by p = (P1, ... ,PN) and measuring the intensity at

N

I(qd = L AijPj + ei (2) j=l

where ei is the noise at data point i and the matrix A is given by Aij = 41T ~r sin( qirj) / (qirj), where ~r = rj - r·j-1. The aim of the indirect Fourier transformation is to restore p which by virtue of the Fourier transform contains the full information present in the scattering profile. The distance distribution function is related to the density-density correlation ,(r) of the scattering length density p(r') by

(3)

where p(r') is the scattering contrast, given by the difference in scattering density between the scatterer Psc(r') and the solvent Pso, i.e. p(r') = Psc(r') - Pso , < . > means averaging over all orientations of the molecule and V is the volume of the molecule.

For uniform scattering density of the molecule the distance distribution function is proportional to the probability distribution for the distance between two arbitrary scattering points within the molecule.

For non-uniform scattering density the distance distribution may have negative regions (if the scattering density of some region of the scatterer is less than the scattering density of the solvent). Also high concentrations may give negative regions in the distance distribution function around the maximum size of the scatterer (see e.g. [11]). In these cases it is not possible to identify p(r) as being proportional to the probability distribution to be estimated when using MaxEnt. One very simple way around this problem is suggested in (22) assuming the probability distribution for p(r) to be a product of Gaussian distributions and applying MaxEnt to this probability distribution. Another perhaps more logical approach would be to assume that the probability distribution to be estimated was proportional to the scattering density for the molecule, which is non-negative at least for SAXS and SALS. If known the scattering density of the solvent could then be subtracted before calculation of the distance distribution function or, if unknown, the scattering density of the solvent

The Maximum-Entropy Method in Small-Angle Scattering 71

could be found from the evidence. This is likely to preserve more information (and be more consistent with the physics of the problem) than allowing negative values for the distribution to be estimated directly.

For the case of polydispersity the size distribution for the molecules can be calculated from the scattering profile by a similar indirect Fourier transformation if the shape of the molecules is known (usually spheres are assumed). The mathematics of this problem is completely analogous to the problem of estimation of the distance distribution function for monodisperse systems (but at least the physics here does not allow the distribution to go negative) . The actual calculations may cause additional difficulties as the system of equations to be solved for calculation of a size distribution is often more ill-conditioned than for the determination of the distance distribution function (as will appear by a singular value decomposition of the transformation matrix). However the numerical problems encountered are similar ([6] [7] and [16]) .

2.2. Methods for 1FT in SAS

First of all we mention briefly the alternatives to using MaxEnt for 1FT in SAS. Tikhonov & Arsenin Historically the method of Tikhonov ([25]) plays an important role for 1FT in SAS (see

also [24]). Using this method the estimate for I is found by minimizing the expression

(4)

where a is a Lagrange multiplier, which is found by allowing the X2 to reach a predetermined value and where the regularizer is given by the general expression

0.(1, m, p) = III - m1l 2 + pllJ'1l2 (5)

The first term minimizes the deviation of I from the prior estimate m with respect to a given norm and the second term imposes a smoothness constraint on the distribution to be estimated. The X2 is defined in the conventional manner i.e. with the notation from above

2 ~ (Im(q;) - I(q;))2 X =~ 2

;=1 CJ; (6)

where Im(q;) is the measured intensity and CJi is the standard deviation of the Gaussian noise at data point i.

A general problem for the application of the method of Tikhonov and Arsenin is how to choose the norm. Using p = 0 and a flat metric (by this minimizing the square of the distance of the estimate I from a prior estimate m) as a constraint on 1FT in SAS has previously been suggested by Glatter ([10]). However this approach was then discarded as the influence of the Lagrange multiplier (balancing the importance of the X2 and the additional constraint) "was not negligible". But clearly it should not be negligible as it governs the extent to which the prior estimate is allowed to influence the final solution. And by the nature of the problem any method which is able to 'solve' underdetermined problems by imposing some additional constraint will of course influence the solution.

Smoothness constraint Neglecting the first part of Tikhonov and Arsenin's regularizing equation the last term

offers the ability to impose smoothness upon the solution.


The smoothness constraint can be expressed by writing the distance distribution function as a sum of smooth functions e.g. cubic B-splines: p(r) = L:.f=1 ajBj(r) and the sum

A = L:.f=1 (aj+1 - aj)2 is then to be minimized subject to the constraint that the X2 takes some sensible value (10]. This problem leaves three parameters to be determined: the number of basis functions N, the maximum diameter used D and the noiselevel a determining the relative weighting of the constraints from the data and the smoothness respectively.

Using this method the number of basis functions is chosen as sufficient to accomodate the structure in the data (however the algorithm most frequently used for 1FT in SAS imposes a constraint of maximum 40 cubic B-splines on p(r».

The maximum diameter D is found by plotting the forward scattering /(0) and the X2 both as a function of D and then using a D found when /(0) has a "plateau" after the X2 has reached a sufficiently low value.

The noise level is found is a similar way by plotting In A and again the X2 this time as a function of Ina and now a plateau in InA has to be found when X2 has reached a low value. One practical problem with this method is that the the needed plateaus may not exist.

Decomposition of p( r) into special functional systems Another method for doing 1FT in SAS is to writep(r) as above, but using basis functions

with nice transformation properties ((15]). One example of this is to choose Bj (r) =

2rsin(7f"jr/D). Information theory gives the choice for N = Dqrnax/7r (which is often too small and a larger N has to be used by an artificial extrapolation of the scattering curve). The maximum diameter used is found as above by the point of inflection method. Using this method there is no built in stabilisation and it is very difficult to know to what extent the special choice of basis functions influences the result.

MaxEnt The ability to include prior information in an easy, transparent and logically consis

tent way may be considered the major advantage of maximum entropy compared to other methods for data analysis. This includes the possibility of constraining the solution to be positive if it is known a priori that the scattering contrast has the same sign for all parts of the molecule and that no concentration effects are present. Certainly for estimation of size distributions the optional positivity constraint must be considered an advantage.

Restricting ourselves here to the situations where a positivity constraint may be imposed upon the distance distribution function (or size distribution) it can be inserted directly into the expression for the entropy (see e.g. (19] and (20])

N

S(p, iii) = L -Pj In(pj/mj) + Pj - mj (7) j=l

where iii is a prior estimate of p. This is done through maximizing as - X2/2 subject to the constraint that the Lagrange

multiplier a takes the most likely value (see below). The addition of a smoothness constraint to the maximum entropy method (in line with

the method of Tikhonov and Arsenin) is of course also quite legitimate and have been tested previously [4].

Using p = 0 and the (entropy)metric /'-1/2 as the norm, Tikhonov and Arsenin's method is equal to a second order approximation to MaxEnt (in iii).


The estimation of the errors can be done in the conventional way using the curvature matrix for the X2 but including the additional entropy regularization term. However, when using regularization care has to be taken to avoid dominance of the regularizing term in the error matrix leading to underestimation of errors (see [9])

As mentioned above according to information theory the number of independently determinable parameters in a small-angle scattering experiment is given by qmaxD /7f. However this number does not take into acoount the noise in the experiment. A better way to estimate the number of parameters which can be determined from a given set of data is described by [12] from the eigenvalues Ai for the curvature matrix for the X2 viewed in the entropy metric. Directions measured well will give a steep X2 surface and consequently a large eigenvalue. The number of "good" directions Ng is determined from

N-"~ 9 - L Ai + a ,

(8)

The choice of parameters like the most likely value for the Lagrange multiplier a or the maximum diameter D for the scatterer is now done within a Bayesian framework allowing error bars to be calculated for the parameters. This is done by maximizing the probability for measuring the data conditional on some model or hypothesis. This probability is termed the 'evidence' for the hypothesis (see [21] and [13]). For example using the evidence to find the most likely value for a gives the equation

- 2aS = N g (9)

from which a can be determined .

3. Results

For explicit comparison of some of the methods referred to above and for previous results of the application of MaxEnt to 1FT in SAS see [8] and [18]. Here we apply MaxEnt to the determination of the distance distribution function from simulated data and demonstrate the recent developments in the theory by using the evidence to determine the noise level, the maximum size of the scatterer (or in general choose the best model) as well as the best (most likely) method for regularisation of the given data. Finally we give an example from high-resolution SAXS using experimental data.

Simulated data: In Fig. la is shown simulated data calculated by [14] from an object consisting of 8

spheres. The result of an 1FT using MaxEnt with a spherical prior is shown in Fig. 1 b with the original distance distribution function calculated directly from the model of the scatterer. Having no prior information at all about the shape of the scattering molecule a sphere would be a sensible first estimate. Compared to the original distance distribution function MaxEnt reproduces the two peaks well.

In Fig. lc is shown how the noiselevel for the simulated data can be found from the evidence or posterior probability for the Lagrange multiplier a. The noise was simulated to give a value for the (reduced) chi-square equal to I and the Bayesian estimate gives a slightly closer fit as is seen from Fig. ld. If the number of good parameters to be determined from

, ~

~

i >

UJ

C/l

l!

7

74

10 (0)

8

6 ,; ~

i

0 0 0.04 0.08 0.12 0.16

q (IInm)

0.06

Ie) 0.05

0.04 !

0.03 ~ if ~

0.02

0.01

0 0.01 0.'

alp ..

12

(e)

10

8 c ~

6 i W

~.OL'-------0~.'-----_--.J alp ..

S. Hansen and J. J. Muller

30

(bl "

25

20

15

10

40 60 eo 100 '20 '40 '60 , (nm)

1.04

1.02

O.IIB

0.96

0.0' 0.1 alpha

25

I')

20

" 10

oL-~~--~ __ ~ __ ~_~ 60 ~ m ~ ~ ~

Q.amelet' In~

Figure 1: a) Simulated data (error bars) and MaxEnt fit (full line) - b) MaxEnt estimate corresponding to a) (full line) and original distance distribution function (dashed) c) Evidence for Q. d) X2 as a function of Q. e) The number of good parameters Ny as a function of Q f) The evidence as a function of the diameter of the molecule


(bl

12

o L-~ __ ~ __ ~~ __ ~ __ ~~~"OW' o 20 40 eo 80 100 120 140 160

r (nm)

Figure 2: a) Simulated data (error bars) and MaxEnt fit (full line) - b) Maxent estimate corresponding to a) (full line) and original distance distribution function (dashed).

the data Ny is small a prior distribution for a may have to be included in the calculation of the most likely value for the posterior changing the criterion in Eq. (9) slightly (see [3]) . This was tested for the present example and resulted in only a negligible change in the result.

From Fig. Ie it appears that it would be possible to determine a maximum of 5 good parameters from the simulated small-angle scattering spectrum shown in Fig. Ia (using qmaxD/,rr gives an ideal- or maximum - number of 7 parameters to be determined). This low information content is characteristic for most SAS experiments. This is also a reason why MaxEnt might be relevant especially to solving structural problems using SAS as it is often necessary to include some kind of prior information or assumption in the analysis. Often this is done through least-square fitting of some alternative models to the measured data, but MaxEnt offers another more flexible option for the analysis. Furthermore prior information from other experimental techniques like e.g. electron microscopy frequently is available and for most of the structural problems studied by SAS it would be a fruitless task trying to solve them using SAS only.

In Fig. 1£ is shown the evidence for the data when the diameter of the prior is varied and a value of about 76 A has the maximum probability. Furthermore an error estimate on this value can be given from the figure, demonstrating yet another appealing feature of MaxEnt in a Bayesian framework.

Comparing MaxEnt with the results obtained by using a smoothness criterion like that described above and calculating the evidence for each of the two methods for regularisation gives a higher evidence for MaxEnt than for the smoothness criterion.

An example where it is vice versa is shown in Fig. 2. In Fig. 2a is shown the simulated data from a very elongated object and in Fig. 2b the corresponding MaxEnt estimate using a spherical prior. In this case the spherical prior is so far off the original structure that the evidence is much higher for the smoothness constraint which does not to such a large degree try to pull the estimate away from the true value. With the information that the smoothness constraint gave higher evidence another more sensible prior can be tried once more demonstrating that failure is an opportunity to learn ...

,; ~

g

76 S. Hansen and J. J. Muller

100

(oj 3 (bJ

10 2.5

-;-.!.

1.5 ~

0,1

0.5

0,01 0 8 10 12 141 16 8 10

q [IIIV11J

Figure 3: a) 5SrRNA experimental high resolution SAXS experimental data (dots) and MaxEnt fit to the data (full line). b) Maxent estimate of the distance distribution function (full line) corresponding to the data shown in Fig. 3a. The low-resolution prior is shown as long dashes. Short dashes show the p(r) for a single idealized A-RNA double helix (see text) .

Experimental data: For the experimental data from E.coli 5SrRNA shown in Fig. 5a (from [17]) the cor

responding Maxent estimate is shown in Fig. 5b. Here we used a prior derived from the experimental data. With direct methods the experimental scattering curve was extrapolated by Porod's power law from 6.4 nm- 1 to infinity (for reduction of truncation errors) and then transformed using Eq. (1). The resulting p(r) function is low resolved, only the overall shape is described by the distribution function. The ideal number of independently determinable parameters (from qrnaxD /7r) is 25 for the first part of the scattering curve (up to 6.4 nm-1). Using MaxEnt with the low-resolution-p(r) as prior for the scattering curve up to q = 16 nm-l, we obtain a distance distribution with distinct features. "From this last calculation the number of good parameters Ng is 21 against the ideal 60. The ripples in p(r) are no artefacts. They are visible also in the p(r) of a single idealized A-RNA double helix , when using the atomic model derived from fiber structure analysis [1] (see Fig. 3b). The 5SrRNA molecule does contain such double helical strands (e.g. [2]) . By distance analysis one can see that the maxima are caused by distances within the sugar-phosphate backbone of the RNA-molecule.

4. Conclusion

Using MaxEnt provides a theoretically well founded method for regularizing 1FT in SAS. Models or parameters in models - e.g. the maximum diameter of the scatterer - can be determined by their evidence and furthermore error bars can be found for these estimates. The amount of information in a set of experimental data can be quantified by calculating the number of good directions in parameter space. The noise level can be determined by the evidence. Prior knowledge (which often is available in some form when using SAS) can be included in the analysis in a transparent way.

ACKNOWLEDGMENTS. This work was supported by a grant from the Deutsche For-


schungsgemeinschaft (Mu989/1-1).

References

[1] Arnott S.,D.W.L. Hukins, Dover S.D.:1972, 'Optimized Parameters for RNA DoubleHelices', Biochem. Biophys. Res. Commun. 48, 1392-1399.

[2] BruneI Ch.,Romby P.,Westhof E., Ehresmann Ch., Ehresmann B.: 1991, 'Threedimensional Model of Escherichia coli Ribosomal 5S RNA as Deduced from Structure Probing in Solution and Computer Modelling', J Mol. Bioi. 221, 293-308.

[3] Bryan, R.K.: 1990, 'Maximum entropy analysis of oversampled data problems', Eur. Biophys. J. 18, 165-174.

[4] Charter, M.K., and Gull, S.F.: 1991, 'Maximum Entropy and Drug Absorption', .1. Pharmacokin. Biopharm. 19, 497-520.

[5] Damaschun, G., Miiller, J.J., Bielka, H.: 1979, 'Scattering Studies of Ribosomes and Ribosomal Components' Methods in Enzymology 59, 706-750.

[6] Potton, J. A., Daniell, G. J. & Rainford, B. D.: 1988a, 'Particle Size Distributions from SANS Using th Maximum Entropy Method', J. Appl. CTyst. 21, 663-668.

[7] Potton, J. A., Daniell, G. J. & Rainford, B. D.: 1988b, 'A New Method for the Determination of Particle Size Distributions from Small-Angle Neutron Scattering', .1. Appl. Cryst. 21, 891-897.

[8] Hansen, S. & Pedersen, J. S.: 1991, 'A Comparison of Three Different Methods for Analysing Small-Angle Scattering Data', .1. Appl. Cryst. 24, 541-548.

[9] Hansen, S. & Wilkins, S. W.: 1994, 'On uncertainty in maximum-entropy maps and the generalization of 'classic MaxEnt", Acta Cryst A50, 547-550.

[10] Giatter, 0.: 1977, 'A New Method for the Evaluation of Small-Angle Scattering Data', .1. Appl. Cryst. 10, 415-421.

[11] Giatter, 0.: 1982, In Small Angle X-my ScatteTing, edited by O. Giatter and O. Kratky. Academic Press, London.

[12] Gull, S.F.: 1989, 'Developments in Maximum Entropy Data Analysis', in MaximumEntropy and Bayesian Methods, edited by J. Skilling, pp. 53-71. Dordrecht: Kluwer Academic Publishers.

[13] MacKay, D.J.C.: 1992, 'Bayesian Interpolation', in Maximum Entropy and Bayesian Methods, Seattle, 1991 edited by C.R.Smith et aI, pp. 39-66. Kluwer Academic Publishers, Netherlands.

[14] May, R. P. & Nowotny, V. (1989). 'Distance Information Derived from Neutron Low-Q Scattering' J. Appl. Cryst. 22, 231-237.

[15] Moore, P. B. (1980). 'Small-Angle Scattering. Information Content and Error Analysis' J. Appl. CTyst. 13, 168-175.

[16] Morrison, J. D., Corcoran, J. D. & Lewis, K. E.: 1992, 'The Determination of Particle Size Distributions in Small-Angle Scattering Using the Maximum-Entropy Method', J. Appl. Cryst. 25, 504-513.


[17] Miiller, J.J., Zalkova, T.N., Zirwer, D., Misselwitz, R., Gast K., Serdyuk, LN., WeIHe H., Damaschun, G.: 1986, 'Comparison of the structure of ribosomal 5S RNA from E.coli and from rat liver using X-ray scattering and dynamic light scattering', Eur. Biophys. J. 13 301-307.

[18] Miiller, J. J. & Hansen, S.: 1994, 'A Study of High-Resolution X-ray Scattering Data Evaluation by the Maximum-Entropy Method', J. Appl. Cryst. 27257-270.

[19] Skilling, J.: 1988, ' The Axioms of Maximum Entropy', in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol 1), edited by G. J. Erickson and C. Ray Smith, pp. 173-187. Kluwer Academic Publishers, Dordrecht.

[20] Skilling, J.: 1989, 'Classical Maximum Entropy', in Maximum-Entropy and Bayesian Methods, edited by J. Skilling, pp. 42-52. Kluwer Academic Publishers, Dordrecht.

[21] Skilling, J.: 1991, 'On parameter estimation and quantified MaxEnt' in MaximumEntropy and Bayesian Methods, edited by Grandy and Schick, pp. 267-273. Kluwer Academic Publishers, Dordrecht.

[22] Steenstrup, S. & Hansen, S.: 1994, 'The Maximum-Entropy Method without the Positivity Constraint - Applications to the Determination of the Distance-Distribution Function in Small-Angle Scattering', J. Appl. Cryst. 27, 574-580.

[23] Svergun, D. 1.: 1992, 'Determination of the Regularization Parameter in IndirectTransform Methods Using Perceptual Criteria', J. Appl. Cryst. 25,495-503.

[24] Svergun, D. 1., Semenyuk, A. V. & Feigin, L. A.: 1988, 'Small-Angle-Scattering-Data Treatment by the Regularization Method', Acta Cryst. A44, 244-250.

[25] Tikhonov, A. N. & Arsenin, V. Ya.: 1977, in Solution of Ill-Posed Problems, New York: Wiley.

MAXIMUM ENTROPY MULTI-RESOLUTION EM TOMOGRAPHY BY ADAPTIVE SUBDIVISION *

Li-He Zou, Zhengrong Wang, and Louis E. Roemer Department of Electrical Engineering Louisiana Tech University Ruston, LA 71272, USA

ABSTRACT. Audio band electromagnetic (EM) waves have a great potential for success in bore hole to bore hole or surface to bore hole tomography for geophysical exploration or environmental tests. Low resolution is generally the major limitation in the EM tomography. If a high resolution is sought, many artifacts with random patterns will show in the resultant image of the reconstruction if a least square error criterion is applied. The maximum entropy constraint can certainly reduce the artifacts. However, the conflict of high resolution and fewer artifacts still exists. This paper proposes an adaptive procedure which produces a tomography image with different resolution in different subdivisions according to the details the subdivision may possess. This procedure can reduce unnecessary resolution in those areas where no more interesting details are shown while showing high resolution in other areas where interesting details may occur. Thus, the artifacts can be reduced to a minimum. Computer simulations on the proposed method compared with least square error method and a conventional maximum entropy method show that the proposed method can produce higher resolution images with significantly reduced artifacts. All experimental results are encouraging and show great potential in practical applications.

1. Introduction

Tomographic imaging has revolutionized medical X-ray diagnostics and is now a valuable technique in geophysics too. Tomographic images have been constructed from both seismic and electromagnetic (EM) measurements. In EM tomographic imaging, an electromagnetic signal is transmitted and received on an array surrounding the area of interest. Then a tomographic reconstruction process is applied which is simply an objective and systematic way to fit the measured data.

A wide range of signals have been tested for EM tomographic imaging. The spectrum covers from as low as 1 Hz to as high as 100 MHz. However, since applications of high frequency electromagnetic wave to the surface detection have been disappointing due to severe attenuation of source waves by the medium, a low frequency (in audio band) electromagnetic wave has been found successful in tomographic imaging. The signal can be propagated several hundreds of meters through subsurface ground. At these lower frequencies, the magnetic permeability of the medium can be treated as constant. Only the resistivity of the medium needs to be determined. The subsurface resistivity distribution can be computed with very high accuracy if the reconstruction model is adequately suited

This work is partly supported by the Louisiana Education Quality Support Fund under the grant 32-4121-40437, and partly supported by American Gas Association under the contract PR222-9218. This work also received support from National Center for Supercomputing Application (NCSA) at Urbana-Champaign under grant TRA930024N.

79

f. Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 79-89. © 1996 Kluwer Academic Publishers.

80 Li-He Zou, Zhengraong Wang and Louis E. Roemer

to the medium. Audio frequency cross-borehole and surface-to-borehole resistivity testing techniques have been attracting a substantial attention since the later 1980's for petroleum reservoir characterization and monitoring[l].

In recent years, a new technology - the trenchless drilling has found great demand from petroleum and gas pipeline industries. The conventional open ground technique for installing pipeline has been restricted by increasing concern for environmental protection. The trenchless drilling for pipeline installation has experienced a rapid growth. However, it has been reported that many trenchless drilling projects fail when the drill head is blocked by some obstacle underground. Therefore, the detection of underground obstacles along the planned drilling path is a crucial need for a successful trenchless drilling project[2]. Electromagnetic wave (EM) cross-borehole tomography has a great potential for aiding trenchless drilling. However, unlike for petroleum reservoir characterization, the detection of obstacles for trenchless drilling needs much higher resolution, especially in the horizontal direction. This requirement makes the reconstruction problem very ill-posed. Our work is dedicated to solve such a severely ill-posed reconstruction problem by using an adaptive subdivision maximum entropy method.

2. Description of SysteID

In a cross-borehole system, the transmitter and receiver are placed in boreholes. There are two basic antennas and receiver sensors: electrical dipole and magnetic dipole.

current source

transmitter

ammeter voltmeter

receiver ound surface

electrode

boreholes

Figure 1: Borehole-to-borehole survey system

Maximum entropy multi-resolution EM tomography by adaptive subdivision 81

For an electric dipole, a number of electrodes are placed in each hole in electrical contact with the formation, as shown in Fig.l. Two adjacent electrodes are driven by a known current, and the resulting voltage difference is measured between all other adjacent pairs of electrodes (in each of the boreholes). Then, the known current is applied to two other adjacent electrodes and the voltage is again measured between all other adjacent pairs. The process is repeated until current has been applied to all pairs of adjacent electrodes (in both boreholes). A transfer resistance is the ratio of a voltage at one pair of terminals to the current causing it. The data of measured transfer resistance then go through a reconstruction algorithm to generate a tomographic image.

For the type of magnetic dipole, the configuration is the same as shown in Fig.1. Electrodes are replaced by vertical-axis coils. Usually, the transmitter is placed in one borehole, and the receiver in another. The transmitted current is detected by induction. The transfer resistance is measured by the ratio of received inductive voltage to the transmitted current. A similar reconstruction algorithm is applied for the tomographic image.

3. Modelling the Problem

The propagation of an EM wave from transmitter to receiver is described by Maxwell's equations

aB V'xE+7Jt=O

aD V'xH--=J+J at S

V'·B=O

V'·D = u

(1)

(2)

(3)

(4)

where J is electric current density in A/m2 , J s is the source current density, and p is electric charge density in Coulombs/m3 . There are also three basic constitutive relations D = u E, B = pH, and J = uE, in which E, p and u indicate, respectively, the dielectric permittivity, the magnetic permeability, and the electric conductivity.

In the frequency domain, by Fourier transform, the Maxwell's equations can be further reduced to the Helmholtz's equations [3]

(5)

where k2 = PEW2 - ipuw. At low frequencies, say less than 105Hz, fJ,EW 2 « puw for most earth materials. So, in solving Helmholtz's equations, it is accurate enough to consider only the variation of u. This observation makes the equations easier to solve. Even so, the equations can not be solved unless digitization is taken. To solve the equations digitally, the medium should be appropriately modelled.

A one-dimensional model is a layered model which assumes that the area between two boreholes is comprised of many horizontal layers (as shown in Fig.2). In each layer, the material is assumed to be electromagnetically homogeneous. If the EM parameters of each layer are known, the Maxwell's or the Helmholtz's equations for the EM wave propagation can be solved. The voltage at the receiver point can be calculated. If the calculated result


------ boreholes

Figure 2: One-dimensional layered model

_____ boreholes ____

I I I I

I I I I

I I

I I I , I I

Figure 3: Two-dimensional model

matches the measured data, the EM parameters (resistivity or conductivity) of layers can be shown as the tomographic image of the subsurface structure.

In a two-dimensional model, it is assumed that the subsurface has a set of blocks structure as shown in Fig.3. The third dimension of each block is assumed infinite. By this model, the tomographic image can show not only the vertical but also the horizontal resolution. However, the horizontal resolution usually is much poorer than the vertical. In a similar way, a three-dimensional model can be visualized as in FigA.

Of course, a multi-dimensional model can show more details than a single dimensional one; the computational difficulty increases with the number of dimensions dramatically. At present, a general multi-dimensional model imaging is still at the development stage.

4. Least-Square (LS) Solution

By the modelling of the medium and using numerical techniques, the propagation equations can be finally formed in a matrix form [4][5] Y = AX. The X is the EM parameter vector of the medium X = [XIX2 ... xn]t, where Xi is the conductivity (or resistivity) of the ith


Figure 4: Three-dimensional model

cell, and [.]t indicates vector transposition. The Y is the estimate of the signal intensity vector at the receiver array calculated by the propagation equations. The matrix A is a constant matrix depending upon the test configuration as well as the model structure.

In forward problems, given a model, the matrix A can be calculated. Then, assigning any value to the parameter vector X, we can estimate the received signal Y. In inverse problems, the real received signal Y is obtained from site measurements and the X is the unknown which needs to be solved to best match the measure Y in some sense of optimization. In the LS method, an error vector is defined as the difference between the Y and Y b = Y - Y = Y - AX or

Y=AX+b (6)

The least square error solution is obtained by solving the following constrained optimization

subject to X > 0 (7)

where II . II indicates the Euclidian norm. This is a typical constrained optimization problem and can be solved by an iteration routine algorithm.

The main weakness of the LS solution is the conflict between the resolution requirement and the artifacts occurring in the resultant tomographic image. The resolution performance depends on the fine grain of the model. When a high resolution is sought, the dimension of the vector X is high. It may greatly exceed the dimension of the measurement Y , making the equation (6) highly under-determined. The answer of (7) becomes uncertain. The solution can be expressed as

(8)


where (.)+ indicates the Moore-Penrose pseudoinverse of matrix[6], and the 9 is an arbitrary vector with the same dimension of Y provided X > o. Different answers can be obtained from different initial guesses (initial value assigned to X). Even adding some additional constraint such as the minimum norm constraint to the solution, the answer still may not be better than others.

5. Maximum Entropy Principle with Adaptive Subdivision

The reason for the uncertainty in the LS solution is the lack of reasonable constraints. In other words, the LS method does not fully utilize all the information about the solution. A better way can be found by Bayes' theory and the maximum entropy principle [7][8]. Using the maximum entropy principle to assign the probability distribution of X, we can assume the b in (6) represents the noise measurement which is supposed to be additive and the variance is known as Rb = (T2 I.

Some global information on the X can be disposed in the form

E[S(X)] = s E[H(X)] = h (9)

where n

S(X) = LXi (10) i=l

is the total intensity of the parameter X , and

n n

H(X) = - LXi lnxi, or - LXi (11) i=l i=l

is the structural entropy of the parameter X [9]. Knowing a priori the noise variance and the constraints (9) and applying the maximum

entropy principle, we obtain

p(YjX) = Cexp[-(Q(X)] (12)

and p(X) = Bexp[--\H(X) - j.LS(X)] (13)

where Q(X) = [Y - AXYR;l[y - AX] (14)

and the -\ and j.L are Lagrange multipliers. By applying the Bayes' rule, it becomes

p(XjY) = Cexp{ -[Q(X) + -\H(X) + j.LS(X)]} (15)

Thus, the MAP solution is given by

X = Arg min [Q(X) + -\H(X) + j.LS(X)] (16) x>o

This solution is equivalent to the solution of the following constrained optimization

Min Q(X) x>o

. {S(X) - s = 0 subject to H(X) _ h = 0 (17)


The Lagrange multipliers A and IJ can be obtained by calculating the partition function

Z = J exp[-AH(X) - IJS(X)]dx (18)

and by solving the following system of equations

oZ -=8 OIJ

OZ OA =h (19)

However, the A, IJ can be more easily estimated from the empiric mean and variance of X. The empiric mean of X is

and variance is

1 n ex = - LXi

n i=l (20)

1 n Vx = - L(Xi - ex )2 (21)

n i=l

Suppose the dimension of X is large enough and p(X) is symmetric in Xi i.e. p(X) = n

IIp(xi), so that we can reasonably assume i=l

When H(X) is in the form - 2: lnxi, p(Xi) is in the form P(Xi) solutions for (22) can be found

ex = (A + 1)/A,

Therefore, we found

(22)

(23)

Though the ex and Vx are still unknown, they can be further estimated from the empiric 1 m 1 m

mean ey = - LYi and variance Vy = - L(Yi - ey)2. l,From the equation (6), it can be m i=l m i=l

seen

where 2: bi = 0 because b is assumed a zero-mean noise. So, we have

ex = mey ILL aij

Let Ex = E(X) = (exex ... ex)t, Ey = E(Y) = (eyey . .. ey)t , then Ey = AEx + E(b) = AEx, and

(24)

(25)


where Q is pseudoinverse of AAt (26)

Substituting (24) and (25) to (23), we can finally estimate the>. and p,. Moreover, as >. and p, become known, the optimization (16) can be easily solved by iteration though nonlinear programming routines .

The maximum entropy principle greatly improves the quality of the tomographic image. Artifacts are considerably reduced while resolution is improved. However, the conflict of high resolution and few artifacts still exists. To further reduce artifacts, an adaptive procedure is proposed in this paper which can produce a tomographic image with different resolution in different subdivisions according to the details the subdivision may possess . This procedure can reduce unnecessary high resolution in those areas where no more interesting details are shown while keeping high resolution in other areas where interesting details may occur. Thus, the dimension of the parameter vector X can be efficiently reduced. As a result, the illness of the reconstruction problem is considerably reduced and the artifacts can be controlled to a minimum. This is a recursive approach like a "process of proliferation of cells". At the first recursive stage, a coarse grid is generated in the model of the medium. For instance, a four cells in a 2-D model may be a good starting point. Then, the maximum entropy reconstruction process mentioned above is applied and the first resultant image on the coarse grid is obtained. Since the number of unknown is less than the number of measurements, i.e. the dimension of X lower than the dimension of Y, the equation (6) is overdetermined. So, the result is very robust with high reliability. However, the resolution in this stage is low. To refine the resolution, the process comes to the recursive stage. In this stage, the solution X obtained in the previous stage is now applied as the initial value for the present reconstruction process and also used to calculate the A and p, by (23) instead of estimating them though (24)(25). To refine the grid, all cells in the model are dumped into a refinement list (a queue). Then each cell in the list is consecutively chosen to split in each dimension making four subdivisions for a 2-D cell or eight subdivisions for a 3-D cell. The maximum entropy reconstruction algorithm is applied again on the new refined grid . A test procedure is set to monitor the change of the Bayesian risk function and to determine whether further refinement is needed for the cell. If the test finds that no more resolution in this cell is necessary, the cell will be dropped from the refinement list. Otherwise, add the new cells to the list for further refinement in next recursive process. The refining process will continue until the list is empty or a satisfactory reconstruction image is obtained. The recursive process can be shown as below. 1. Set initial data and put the medium as one cell into the refinement list. 2. Call cell splitting process. 3. Form a r efined model. 4. Numerically solve Maxwell's (or Helmholtz'S) equations (1)-(4) or (5). 5. Estimate>. and p, (23)-(25). 6. Solve maximum entropy reconstruction equation (16). 7. Test whether the cell needs further refinement. If no, skip step 8. 8. Add the new cells to the refinement list. 9. Check whether the list is empty. If yes, stop. 10. Take a new cell from the list and go to step 2.


6. Solve the Problem by Neural Network

The algorithm for the adaptive subdivision MaxEnt reconstruction is a very time consuming process. Probably the best way to solve the problem is by neural networks. Some models of neural nets have been developed to solve secondary optimization problems [10]. These models can be further developed to solve the maximum entropy reconstruction problem (16). Let

L(X, A, j-t) = Q(X) + AH(X) + j-tS(X), (27)

then a continuous linear model of neural net can be generated which is defined on such a dynamic

[ V'3(XL V'xH(X) V'xS(X) 1 [ dX/dt 1 [ V'xL 1

V'xH(X)t 0 0 dA/dt =- H(X) V'xS(X)t 0 0 dj-t/dt S(X)

(28)

It can be proved that the neural net will converge to the constraint equilibrium points. The dynamic process is similar to the nonlinear programming process generated by Newton's algorithm. Details about this approach will be published in another paper. One advantage of the neural net approach is there is no need to estimate the Lagrange multipliers A and j-t. They will be automatically solved by the convergent process. Another advantages are the no need of nonlinear programming and the high speed of computation.

7. Computer Simulation

A computer simulation has been conducted in laboratory. The experiment simulates a site with two boreholes separated in a 200 meters distance. Both boreholes have a 120 meter depth with an array of 16 equally spaced transmitters in the left borehole and an array of 4 equally spaced receivers in the right one. A 1000 Hz continuous sinusoidal EM wave is chosen. A 2-D model of medium is selected with a high conductivity bar in the low right part surrounded by homogeneous earth material as shown in Fig.5. The conductivity of the surrounding material is normalized as unity. The relative conductivity of the bar is 10. In the figure, the value of relative conductivity is represented by grey level. Fig.6 shows the result produced by the LS reconstruction algorithm described in section 4. The resolution is 16 x 16 in the test area. So, the size of X (unknown) is 256 and the size of Y (data) is 64. This is a highly under-determined case. Not only the under-determination but also the test configuration make the inverse problem severely ill-posed. Many artifacts with random patterns show in the picture making the solution unacceptable.

Fig.7 shows another result produced by the same algorithm with different initial value. The two figures show total different pictures indicating the serious uncertainty of the solution.

Fig.8 shows the result obtained by maximum entropy reconstruction on a 16 x 16 grid without adaptive subdivision. Artifacts are considerably reduced comparing with the LS solutions.

Fig.9 shows the final result as well as three middle stage results produced by the adaptive subdivision MaxEnt reconstruction. The bar can be clearly seen in the final image. Artifacts are greatly reduced except at the low right corner. The reason is that the test configuration makes the corner almost unmeasurable.

88

Figure 5: The structure of medium

Figure 7: LS reconstruction image with different initial value

8. Conclusion and Discussion

Li-He Zou, Zhengraong Wang and Louis E. Roemer

Figure 6: LS reconstruction image

Figure 8: Maximum entropy reconstruction image

This paper presents an adaptive subdivision maximum entropy reconstruction algorithm for EM wave underground tomography. This approach has a pyramid structure and shows a significant advantage in high resolution and low artifacts. Our work is still preliminary. Much further investigation needs to be conducted. For instance, the algorithm is very timeconsuming. Supercomputing is necessary even for a mild resolution image. How to make the algorithm more efficient is an open problem. Another open problem may be the test criterion for determination of cell refinement. Bad criteria may miss details or, in contrast, create artifacts.

References

[1] M. J. Wilt, H. F. Morrison, A. Becker, and K. H. Lee, "Cross-Borehole and Surface-to- Borehole Electromagnetic Induction for Reservoir Characterization," DOE/BC/91002253 Report, Lawrence Livermore National Lab., Livermore, CA, Aug. 1991

[2] D. T. Iseley and D. H. Cowling, "Obstacle Detection to Facilitate Horizontal Directional Drilling," Final report of AGA project PR222-9218, Pipeline Research Committee at American Gas Association, Jan. 1994

[3] M. N. O. Sadiku, "Numerical Techniques in Electromagnetics," CRC Press, 1992


(a) At the first recursive stage (b) At the second recursive stage

(c) At the third recursive stage (d) The final image

Figure 9: Adaptive subdivision maximum entropy reconstruction images

[4] Q. Zhuo, "Audio Frequency Numerical Modeling and Tomographic Inversion for Reservoir Evaluation," Ph.D dissertation, Department of Engineering Geosciences, University of California at Berkeley, 1989

[5] W. C. Chew, and Y. M. Wang, "Reconstruction of 2-D permittivity Distribution Using the Distorted Born Iteration Method," IEEE Trans. on Medical Imaging, vol.9, no.2, pp.218-225 , June 1990

[6] A. Albert, "Regression and Moore-Penrose pseudoinverse" New York: Academic Press

[7] L. L. Scharf, "Statistical Signal Processing: Detection, Estimation, and Time Series Analysis ," Addison-Wesley Publishing Co. 1991

[8] S. F . Burch, S. F. Gull and J. Skilling, "Image Restoration by a Powerful Maximum Entropy Method," Comput. Vis . Graph. Im. Process., vo1.23, pp.113-128, 1983

[9] A. Mohammad-Djafari and G. Demoment, "Maximum Entropy Image Reconstruction in X-Ray and Diffraction Tomography," IEEE Trans. on Medical Imaging, vol. 7, no.4, pp.345-354, 1988

[10] S. Zhang, X. Zhu, and Li-He Zou, "Second Order Neural Nets for Constrained Optimization," IEEE Trans. on Neural Networks, vol. 3 , no.6, pp.1021-1024, 1992

HIGH RESOLUTION IMAGE CONSTRUCTION FROM IRAS SURVEY - PARALLELIZATION AND ARTIFACT SUPPRESSION

ABSTRACT.

Yu Cao and Thomas A. Prince Division of Physics , Mathematics and Astronomy, California Institute of Technology, Pasadena, CA91125, USA

The Infrared Astronomical Satellite carried out a nearly complete survey of the infrared sky, and the survey data are important for the study of many astrophysical phenomena. However, many data sets at other wavelengths have higher resolutions than that of the co-added IRAS maps, and high resolution IRAS images are strongly desired both for their own information content and their usefulness in correlation studies.

The HIRES program was developed by the Infrared Processing and Analysis Center (IPAC) to produce high resolution ell) images from IRAS data using the Maximum Correlation Method (MCM). In this paper, we describe the port of HIRES to the Intel Paragon, a massively parallel supercomputer. A speed increase of about 7 times is achieved with 16 processors and 5 times with 8 processors for a 1° x 1° field. Equivalently a 64 square degree field can be processed using 512 nodes, with a speedup factor of 320.

hnages produced from the MCM algorithm sometimes suffer from visible striping and ringing artifacts. Correcting detector gain offsets and using a Burg entropy metric in the restoration scheme were found to be effective in suppressing these artifacts.

1. Introduction

The Infrared Astronomical Satellite (IRAS) provided our first comprehensive look at the infrared sky, producing a nearly complete survey at mid- to far-infrared wavelengths (12 , 25,60, and 100 microns). Information about IRAS relevant to this paper is given in Section 3 ..

The Maximum Correlation Method (MCM) algorithm [1] produces high resolution images from the survey and additional observation (AO) data, using a nonlinear iterative scheme. The resulting images have resolution of about 1', compared to the 4' - 5' subtended by the 100 Mm band detectors in the IRAS focal plane. Application of the algorithm to the IRAS data has been limited largely by the computational resources available for HIRES processing. A description of the MCM algorithm is outlined in Section 4 ..

We have ported the HIRES program to the Intel Delta and Paragon systems. Each 10 x 10 field is mapped to an 8- or 16-node process grid , which shares the computation by loading different observation scans. An efficiency of 60 % is reached with 8 nodes. Section 2. further explains the motivation for this port, and Section 5. discusses the parallelization strategy, output verification, and performance analysis.

In Sections 6. and 7. we offer descriptions of artifact reduction algorithms, namely using estimates of gain offset to eliminate striping, and using a Burg entropy prior in the iterative algorithm to suppress ringing around bright point sources.

91


92 Y. Cao and T. A. Prince

2. Scientific Motivation

The wavelength bands covered by the IRAS survey are tracers of star-forming regions and numerous other components of the interstellar medium. A variety of studies have been made to date ranging from structure on a galactic scale to detailed studies of individual molecular clouds (see [2, 11]). The strength of IRAS is the completeness of the survey. However, in many cases the spatial resolution of the comparison data sets at other wavelengths is better than for IRAS, and thus the 4' - 5' resolution of the released IRAS images (the Infrared Sky Survey Atlas, ISSA) sometimes limits the comparison. The desire for higher spatial resolution combined with the paucity of new infrared satellite missions has inspired many efforts to extract high spatial resolution information from the data (e.g. [3,4]). The products most widely accessible to the science community are the HIRES images distributed by the Infrared Processing and Analysis Center (IPAC), which are based on the Maximum Correlation Method.

The HIRES process is very demanding computationally. A 10 x 10 field of typical scan coverage takes 1 - 2 hours of CPU time on a Sun SPARCstation 2, for all four wavelength bands and 20 iterations (at which point artifacts limit further improvement of image quality).

As part of a program in high-performance computational science and engineering, Caltech has developed significant software and hardware capabilities for massively parallel computing. The high demand for HIRES images, along with the availability of parallel computing facilities, motivated the port of HIRES to the parallel supercomputers.

We also developed algorithms which can effectively suppress the artifacts, which allows the iteration procedure to be carried much further (hence requiring more CPU time and further justifying the parallel computing approach).

3. Relevant Information about IRAS

The IRAS focal plane was designed for the identification of point sources. It included eight staggered linear arrays subtending 30' in length, two in each of four spectral bands at 12, 25, 60, and 100 {tm. Data rate considerations forced the detector sizes to be much larger than the diffraction limit of the telescope. The typical detector sizes were 45 x 267, 45 x 279, 90 x 285, and 180 x 303 arcsec (full width at half maximum response, FWHM) respectively, at the four wavelength bands. The sky was scanned in "push-broom" fashion.

This combination of focal place, detector size, and scan pattern optimized detection of point sources in areas of the sky where the separation between sources was large compared to the sizes of the detectors. However, it complicated the construction of images of regions containing spatial structure on the scale of arcminutes.

4. The Maximum Correlation Method

Starting from a model of the sky flux distribution, the HIRES MCM algorithm folds the model through the IRAS detector responses, compares the result track-by-track to the observed flux, and calculates corrections to the model. The process is taken through about 20 iterations at which point artifacts limit further improvement. The algorithm yields a resolution of approximately I' at 60 {tm. This represents an improvement in resolution by as much as a factor of 20 in solid angle over the previous images from the IRAS Full

High Resolution Image Constrution from IRAS Survey 93

Resolution Survey Coadder (FRESCO). We give a brief description of the MCM algorithm following the formalism and notations of [1].

Given an image grid iJ, with n pixels j = 1, ... , nand m detector samples (footprints) with fluxes

D; : i = 1, ... , m, (1)

whose centers are contained in the image grid, an image can be constructed iteratively from a zeroth estimate of the image, fJ = const. > 0 for all j. In other words the initial guess is a uniform, flat, and positive definite map. For each footprint, a correction factor C; is computed as,

where

C;=D;/Fi'

Fi = Lrijfj, j

(2)

(3)

and rij is the value of the ith footprint's response function at image pixels iJ. Therefore Fi is the current estimate of the ith footprint's flux, given image grid fj.

A mean correction factor for the jth image pixel is computed by projecting the correction factor for the footprints into the image domain:

Cj = [L(r;j/ (TT)Ci] / [L(rij / (TT)] . , , (4)

The weight attached to the ith correction factor for the jth pixel is rij/(TT, where (Ti is the a priori noise assigned to the ith footprint.

The kth estimate of the image is computed by

(5)

In practice when the footprint noise (Ti is not easily estimated, an equal noise value for all footprints is assumed, and the MCM is identical to the Richardson-Lucy algorithm [10,8].

5. Parallelization

Detector data are stored in scanlines called legs, which contain individual footprints. Profiling a typical HIRES process showed that more than 95 % of the total execution time was spent within the code which calculates the footprint and image correction factors. In the parallel decomposition of the problem, each processor takes care of footprints from a set of scanlines. The reasons for doing this are:

1. Small programming effort. The essence of the original HIRES architecture is left untouched.

2. Footprints in one leg share the same response function grid, except for a translation, which is basically the reason the original code processes the data one leg at a time. Keeping


Table 1: Speed comparisons for 60 p,m band of M51

Sun SPARCstation 2 720 sec Single node of the Paragon 640 sec 8 nodes of the Paragon 137 sec

the whole leg in one processor is therefore a natural choice, which minimizes local memory usage.

3. As we will discuss in Section 6., IRAS detectors have gain differences which are especially prominent for the 60 and 100 p,m bands. The gain offset can be estimated from correction factors in the same leg, which came from the same detector.

Each node calculates the correction factor Ci's for its share of footprints, and projects them onto the pixels covered by the footprints. A global sum over all processors for the correction factor Cj'S for each image pixel is performed at end of each iteration, and the weighted average is taken, which is then applied to the image pixel value.

Decomposition in the image pixel domain was not carried out for the 10 X 10 field, eliminating the need for ghost boundary communication, which would be significant and complicated to code, due to the large size and irregular shape of the detector response function. This helped maintaining the parallel code similar in structure to the sequential one, making simultaneous upgrades relatively easy.

The efficiency ofthe parallel program depends on the scan coverage ofthe field processed. The computation time is roughly proportional to the total coverage (Le. total number of footprints), while the communication overhead is not related to footprints and is only dependent upon the image array size. So the efficiency is higher for a field with higher coverage.

For a large field (e.g. 60 x 60 of p Ophiuchus), the detector measurements are broken into 10 X 10 pieces with overlap 0.150 • Each 1.150 x 1.150 field was loaded on to a subgroup of 8 or 16 processors. The overlap was chosen to be large enough so that cropping the overlap after HIRES ensures smoothness at the boundaries. Therefore no inter-subgroup communication was needed during HIRES, at the cost of a moderate increase in computation.

The output images from the parallel computers are compared with those from the standard HIRES program running on a Sun SP ARCstation. The differences are well within the range of numerical round-off errors. At the 20th iteration, the standard deviation of (NewImage - OldImage) / OldImage averages to about 10-4 .

The executable code was compiled and linked with a math library conformant to the IEEE 754 standard. For the 60 p,m band ofM51 (baseline removed data), a time comparison is shown in Table 1.

Efficiency is 60 % for 8 nodes for a lax 1 afield. All 512 nodes can be used to process a 64 square degree field with a speedup factor of 320. The global sum operation, which collects pixel correction factors from different nodes, is the primary source of overhead in the parallel program.


6. Destriping Algorithm

Stripes are the most prominent artifacts of the HIRES images. HIRES takes in the IRAS detector data, and if not perfectly calibrated, would try to fit the gain differences in the detectors by a striped image. The striping builds up in amplitude and sharpness along with the HIRES iterations, as the algorithm refines the "resolution" of the stripes (see Fig. l(a) and (b)).

The IPAC program LAUNDR [5J invokes several one dimensional flat fielding and deglitching techniques. For the purpose of destriping, the one dimensional algorithm works well for regions with a well-defined baseline, but the result is not satisfactory for regions where structure exists in all spatial frequencies.

Another IPAC utility KESTER, developed by Do Kester, is similar in principle to the approach we take. The idea is to process the data with HIRES to certain iterations to obtain an image, which is then used to simulate a set of detector flux measurements. The baseline offsets of the original data are then calibrated using the simulated data set.

Our approach is to combine the image construction and the destriping process. Since the striping gets amplified through the iterations, the idea of applying constraints to the correction factors is natural.

Assume footprints in the same leg L suffer from the same unknown gain offset G L, then

(6)

is the "true" detector flux, had the detector gain been perfectly calibrated. The GL's can be seen as extra parameters to be estimated, besides the image pixels fj. Under a Poisson framework, the maximum likelihood estimate for G L is

(7)

in which Ct is the gain compensated correction factor. This choice of the unknown G L minimizes the mutual information between the sets Di

and Fi in the leg, i.e. the resulting correction factors Ct will extract the minimum amount of information from the stream Di. According to the maximum entropy principle, this is the only reasonable choice.

From another point of view, this strategy works because the procedure of averaging Ci'S to get Cj has a smoothing effect on the image, so that the image fj therefore Fi does not contain as much striping power as the footprints Di, especially on the scale smaller than one detector size.

When the legs do contain non-random gain deviation roughly periodic on a scale larger than the detector size (typically around 7' for IRAS, distance between neighboring detectors), this destriping method sometimes fails to smooth out the wide stripes. It does eliminate high spatial frequency stripes, but may result in wide regions with discontinuous flux levels. A heuristic analogy for understanding this behavior can be made with the one dimensional Ising model where the energy is lowest when all the spin directions are lined up, if we compare the negative likelihood to the energy in the Ising model, and the residual gain offset to the spin. Just like the spins can be trapped in a local minimum energy state (aligned in patches of up's and down's), the gain estimation may reach a local maximum of


(b)

(c)

Figure 1: (a). 1st iteration image for a field in p Ophiuchus (100 p,m band); (b). 20th iteration, standard HIRES; (c). 20th iteration, with uniform gain compensation; (d). 20th iteration, with local gain compensation. Size of image is lOx 10 . Height of surface represents flux.


the likelihood function in the GL space (image would consist of smooth regions at different flux levels). The LAUNDR program, which is run upstream of the MCM process, is capable of detecting this kind of flux variation and correcting it. But when raw data are fed into the MCM program without the LAUNDR offsets, it is necessary to first smooth the image fj with a large kernel (15'), before trying to estimate the gain offsets.

A further complication lies in the fact that the assumption of a uniform gain offset in a certain leg is only approximately true. Various hysteresis effects cause the gain to drift slightly within the 1° range. The more aggressive form of the destriping algorithm estimates the gain offset locally as the geometric mean of the correction factors for nearby footprints, so the estimated gain correction for each footprint varies slowly along the leg. The local gain offset is not allowed to differ by more than 10 % from the global value, since the gain is not expected to drift that much over a 1° scale, and the variation in computed offset average is most likely due to real local structure. We used an averaging length of 10' to estimate the local offset. Because it is larger than the spatial resolution of the first iteration image (5'), it is safe to refer the average correction factor on that scale as due to gain offset. The 10' length scale is also small enough to capture the drifting behavior of the gain, as shown by visual inspection of output images as well as Fourier power spectrum analysis. Unlike the standard HIRES algorithm (in which stripes are amplified throughout the iterations), the local gain compensation decreases the striping power monotonically to a negligible level after roughly 10 iterations.

Fig. 1 demonstrates the striking effect of the destriping algorithm. Fig. l(a) shows the first iteration image for a 1° x 1° field in p Ophiuchus, which is smooth (blurry). Fig. l(b) is the 20th iteration image of the field obtained with the standard HIRES algorithm, and is contaminated with strong striping artifacts. A tremendous improvement is seen in Fig. l(c) which is produced with uniform gain compensation, although some weak stripes are still visible. Finally, using the local gain compensation method gives a stripe-free image, Fig. l(d). It is also apparent that Fig. l(d) contains many high spatial frequency features that are absent in l(a).

7. De-ringing Algorithm

For many deconvolution problems, ringing artifact (or "ripple") appears when a bright point source exists over a non-zero background. The mechanism of the artifact can be understood as the Gibbs phenomenon (a sharp cutoff in high spatial frequency signal incurs ripples in the position domain).

A variant of the Log-Entropy MART [9J

f?) = f?-l) + U?-1))2 L ~~ (Di - Fi) i ,

(8)

was tested on IRAS data and gave satisfactory suppression of ringing artifact (Fig. 2).

The U?-1))2 factor in the correction term indicates a Burg entropy metric in the image space, and effectively boosts the correction factor for brighter pixels. So the bright point source is fitted better in the earlier iterations, which circumvents the corruption of background caused by the misfit.

The prior knowledge signified by using maximum Burg entropy estimation rule has


(a) (b)

Figure 2: (a). Point source 116293-2422 in p Ophiuchus (100 micron), no ringing suppression; (b). Same field, using entropy prior for ringing suppression. Size of image is lOx 10. Peak flux in (a) is 3749 MJy/ster, and 3329 MJy/ster in (b).

been discussed in [7, 6]. According to [6], the class of optical objects described by the Burg entropy prior would tend to consist of a relatively small number of randomly placed bright cells, the rest being dim, befitting the bright point source scene we're concerned with.

Suppression of ringing leads to better photometry determination of the point source, and helps solve the source confusion problem, which is especially prominent in the Galactic plane.

8. Summary

The parallelization and algorithmic enhancements of the IPAC HIRES program have been described. These efforts will soon enable production of HIRES images by IPAC from the Intel Paragon supercomputer.

It is now possible to produce complete maps of the Galactic plane (±5° latitude) at 60 and 100 pm with arcminute resolution, as well as maps of the Orion, Ophiuchus, and Taurus-Auriga clouds complexes. These maps will represent a 20-fold improvement in areal information content over current IRAS 60 and 100 pm maps and will be valuable for a wide range of scientific studies, including:

• The structure and dynamics of the interstellar medium (ISM)

• Cloud core surveys within giant molecular clouds • Determination of initial mass functions (IMFs) of massive stars • Study of supernova remnants (SNRs)

Additional information will come from combining the 60 and 100 pm HIRES data with the images and catalogs being produced from the 12 and 25 pm IRAS data by the Air Force Phillips Laboratory and Mission Research Corporation.


This research was performed in part using the Intel Touchstone Delta and the Intel Paragon operated by Cal tech on behalf of the Concurrent Supercomputing Consortium.

We thank Tom Soifer, Joe Mazzarela, Jason Surace, Sue Terebey, John Fowler, Michael Melnyk, Chas Beichmann, Diane Engler and Ron Beck for their contributions to this project. YC also thanks Professor John Skilling for discussions during the workshop.

References

[1] H. H. Aumann, J. W. Fowler and M. Melnyk, "A Maximum Correlation Method for Image Construction of IRAS Survey Data," Astronomical Journal, Vol. 99(5), pp: 1674-1681, 1990.

[2] C. A. Beichman, "The IRASView of the Galaxy and the Solar-System," Annual Review of Astronomy and Astrophysics, Vol. 25, pp: 521-563, 1987.

[3] T. R. Bontekoe, D. J. M. Kester, S. D. Price, A. R. W. Dejonge and P. R. Wesselius, "Image Construction from the IRAS Survey," Astronomical Journal, Vol. 248(1), pp: 328336, 1991.

[4] T. R. Bontekoe, E. Koper and D. J. M. Kester, "Pyramid Maximum-Entropy Images of IRAS Survey Data," Astronomy and Astrophysics, Vol. 284(3), pp: 1037-1053, 1994.

[5] J. W. Fowler and M. Melnyk, LA UNDR Software Design Specifications, IPAC, Caltech, 1990.

[6] B. R. Frieden, "Estimating Occurrence Laws with Maximum Probability, and the Transition to Entropic Estimators," in Maximum-Entropy and Bayesian Methods in Inverse Problems, eds. C. R. Smith and W. T. Grandy, Jr., pp: 133-170, D. Reidel Publishing Company, Dordrecht, Holland, 1985.

[7] E. T. Jaynes, "Monkeys, Kangaroos, and N," in Maximum Entropy and Bayesian Methods in Applied Statistics, ed. J. H. Justice, pp: 26-58, Cambridge University Press, 1986.

[8] L. B. Lucy, "An Iterative Technique for the Rectification of Observed Distributions," Astronomical Journal, Vol. 79, pp: 745-754, 1974.

[9] A. R. De Pierro, "Multiplicative Iterative Methods in Computed Tomography," in Mathematical Methods in Tomography, eds. G. T. Herman, A. K. Louis and F. Natterer, pp: 167186, Springer-Verlag, 1991.

[10] W. H. Richardson, "Bayesian-Based Iterative Method of Image Restoration," Journal of the Optical Society of America, Vol. 62, pp: 55-59, 1972.

[11] B. T. Soifer, J. R. Houck and G. Neugebauer, "The IRAS View of the Extragalactic Sky," Annual Review of Astronomy and Astrophysics, Vol. 25, pp: 187230, 1987.

MAXIMUM ENTROPY PERFORMANCE ANALYSIS OF SPREAD-SPECTRUM MULTIPLE-ACCESS COMMUNICATIONS

ABSTRACT.

F. Solms Dept. of Applied Mathematics, Rand Afrikaans University, PO Box 524, Auckland Park, 2006, South Africa E-mail:[email protected]

P.G.W. van Rooyen Aicatel Altech Teicoms, PO Box 286 Boksburg, 1460, South Africa

J.S. Kunicki Dept of Electrical Engineering, Rand Afrikaans University, PO Box 524, Auckland Park, 2006, South Africa

The Maximum Entropy Method (MEM) is used to evaluate the inter-user interference (lUI) probability distribution function (pdf) for Spread Spectrum Multiple Access (SSM A) systems. This pdf is frequently assumed to be Gaussian. We calculate the discrimination information (relative entropy) between the lUI-pdf as inferred via the MEM and the "best" Gaussian pdf in order to quantitatively assess the accuracy of the Gaussian assumption. We find that the Gaussian assumption becomes more accurate as the number of users increases. The widely used Gauss-Quadrature rule (GQR) based methods usually require a very high number of moments for accurate results and often fail for low error probabilities. The MEM results on the other hand require usually far fewer moments and continue to give accurate results in the practically important region of low error probabilities.

1. Introduction

Spread spectrum signals lend themselves excellently to code-division multiple access (CDMA) communication. CDMA systems are widely used for secure communication systems (the receiver must know the code signal of the sender), cellular networks including mobile communication systems and indoor wireless communication systems. In CDMA systems a spread spectrum signal is obtained by spreading a low frequency data signal with a high frequency code signal. The resultant signal is modulated, sent over a noisy channel and is usually received via multiple independently fading paths. Finally the user-code is used by a correlation decoder to collapse the wide-band spread-spectrum signal to a narrow band data signal.

There are three factors which contribute to the performance degradation of the CDMA system. Each of these is modeled by an independent random variable. Firstly, we assume we are transmitting over an average white Gaussian noise (AWGN) channel, i.e. that the noise can be modeled with a random Gaussian variable. Secondly, we follow [1] by assuming that the fading can be modelled by a random variable which is Nakagami distributed. The Nakagami fading model spans from single-sided Gaussian fading to non-fading including

101


102 F. Solms, P.G.W. van Rooyen and J. Kunicki

Rayleigh and Rician fading as special cases. The third factor is inter-user interference (lUI). CDMA signals are usually asynchronous since it is not generally feasable to use a common timing reference for the various transmitters. Asynchronous CDMA signals are not strictly orthogonal. This results in cross correlation noise due to inter-user interference (lUI). The latter increases as the number of users increases. Furthermore, the number of subscribers can be increased by relaxing the orthogonality requirement and hence increasing the lUI. The maximum tolerated average error rate limits the number of subscribers and the number of simultaneous users.

In a previous paper [2] we have shown that the Maximum Entropy Method (MEM) and the Minimum Relative Entropy Method (MREM) [3, 4] when used to evaluate the average error probability in digital communication systems have several advantages over the commonly used Gauss-Quadrature rule (GQR) based methods [5]. Firstly, the GQR based methods fail under frequently encountered conditions and in particular for high signal to noise ratios (Le. for low error probabilities), while the MEM continues to give reliable results. Furthermore, in cases where the GQR method does not fail, it is found that the MEM [6] and the MREM method [2] require generally far fewer moments than the GQR based methods to obtain accurate results. This advantage is particularly significant when only a few moments of the pdf are accurately known as is the case when the moments are obtained experimentally.

In this paper we (i) use the MEM to evaluate the performance of Spread Spectrum Multiple Access (SSMA) systems and (ii) calculate the discrimination information (also known as expected weight of evidence, relative entropy, cross entropy or Kulback-Leibler distance) [3] to quantify the accuracy of the Gaussian assumption for the lUI.

In particular, we use the MEM to infer the IUI-pdffrom its moments. Kavehrad [7] gives an algorithm to generate the moments of the lUI-pdf. However, numerical instabilities often limit the number of moments that can be generated accurately. For example, for a SSMA system with N=127 period codes only the first 16 moments of the 2-user interference pdf can be generated accurately. Consequently one requires an inference method which makes efficient use of this limited number of moments. The MEM proves to be superior to the GQR-based methods in this respect.

The lUI in SSMA systems is often assumed to be Gaussian. Making the widely known Gaussian assumption (GA) simplifies SSMA system analysis considerably [8]. However, the extend to which the GA is valid, is not generally known, although it has been shown [9, 10] that the validity of the Gaussian assumption decreases with decreasing number of users. We compute the discrimination information between the "exact" lUI pdf as inferred via the MEM and the "best" Gaussian pdf (as defined by the first three moments of the iui-pdf) to quantify the accuracy of the GA.

2. The formalism

Consider the CDMA system for K users as depicted in figure 1. The data signal ddt) of user k is a sequence of unit amplitude positive and negative rectangular pulses of duration Td

00

dk(t) = L d~k)PTd(t - iTd)· (1 ) i=-oo

Maximum Entropy Performance Analysis of SSMA Communications

1 Aej(wt+OK)

1 CK(t) ~ _____ ~ d~M_M_1 ~ '61 . - - '61 'YKO(t-TK)

L--______ -I

threshold detector

dj (t)

L-____ ~I------~ILI----------~I------------~I LI ____ ~I----~ Transmitters Channel Receiver

Figure 1: CDMA communication system

where

T t = P { lVO::;T<l ( ) 0 otherwise

103

(2)

The data signal of each user is spread by a high frequency user code signal (in our analysis we used Gold codes [11] which are known to have low cross-correlation properties)

00

Ck(t)= L c~k)PTc(t-iTd)' (3) i=-oo

which encodes each data bit with Nc = ¥C chips. The user signals are modulated on a carrier wave with common carrier frequency w

(4)

Here (h is the carrier phase for the transmitter of user k and A = If!fi. is the common carrier level with Eb the bit energy.


We assume that the signals are received over an additive white Gaussian noise channel via L Nakagami fading paths [1]. The impulse response of the channel is thus modelled by

L

h(t) = L {3l0(t - tl)ei¢l (5) l=1

where (3l is the Nakagami distributed random path gain and <Pl is the phase shift for path l which we assume to be randomly distributed over [0,21f) and Tl is the path time delay. To simplify the analysis we follow Kavehrad [7] in assuming that the signals from the interfering users are received via a single Nakagami fading path with delay Tk randomly distributed over one bit period.

The received signal is then given by

(6)

where n(t) is white Gaussian noise with double sided spectral density of No/2 and 'lj;l =

<Pl - WTl· The first sum is the desired signal received via L paths and the second sum is the inter-user interference (lUI) term. The received signal is the input to the correlation decoder which phase and delay locks into the first received desired signal. The output of the correlator is given by

(7)

(8)

where 1) is a sample of the Gaussian noise with zero mean and variance (]"2 = N~Td,

(9)

'lj;l = (<Pl - <PI) - wtl, 8 k = (h - WTk, and the code correlation integrals are given by

(10)

The terms for which kl f=k2 are the inter-user correlations (cross correlations) which are non-zero due to the asynchronous arrival of the signals.

Maximum Entropy Performance Analysis of SSMA Communications

Now,the conditional probability of a bit error is given by

Pela,,B = P(d6=+I)· P [A~d (ex + fh) + 'f/ < 01 b6 = +1]

+P(d5=-I). P [~d(ex - 131) +'f/ > 01 b5 = -1]

Assuming that the two data bits are equally likely the above result reduces to

105

(ll)

(12)

The above expression contains three random variables: 'f/ for the Gaussian noise, 131 for the fading and ex for the inter-user interference. Averaging over the zero-mean Gaussian noise we obtain

Pela,,B = ~erfC { [fi(ex + 13d} (13)

where

2 100 2 erfc(x) = ,fir x e-t dt (14)

We model the multipath-fading channel by a L-tap delay line [1] where the 131 are related to the Nakagami distributed received signal to noise ratio via

(15)

Hence

(16)

with

( m)A 'YA-1 (m'Y) P(')'b) = - _b_ exp --'Yo r(A) 'Yo

(17)

where 'Yo is the average signal to noise ratio and m is the parameter of the Nakagami-m distribution. For m-+oo we have no fading and for M=1 we obtain Rayleigh fading.

We finally obtain the average error rate Pe by evaluating

(18)

where p( ex) is the distribution of the random variable ex modeling the inter-user interference. ex is a function of the random path delays Tk, the random path phases (Pl , the carrier phases Ok, the code chips c~k) and the data bit d~k). We have used Kavehrad's algorithm [7] to

106 F. 801ms, P.G.W. van Rooyen and J. Kunicki

generate the moments of lUI-pdf p(a) and we use the maximum entropy method to infer p(a) itself. Finally, performing the integration (18) yields the average error rate.

In the MEM the missing information (the information entropy)

J(p) = -lb p(a) Inp(a)da (19)

is maximized subject to the constraints of the normalization of the pdf and subject to the available information. In our case the expectation values of the moment operators must be equal to the measured or calculated moments. This is a standard maximum entropy moments problem which has been studied in great detail [12, 13]).

To simplify performance analysis of CDMA systems many authors [8, 9] have invoked the Gaussian assumption, i.e. that the lUI can be approximated by a random Gaussian variable. It has been shown [9, 10] that the validity of the Gaussian assumption decreases as the number of users decreases. However, the accuracy of the Gaussian assumption has not yet been quantified.

We calculate the discrimination information (relative entropy) [3]

(20)

between the "exact" lUI-pdf p(a) as inferred via the MEM and the "best" Gaussian pdf q( a) and use this as a measure of accuracy of the Gaussian assumption. The discrimination information is a measure of the evidence contained in the data (the moments) discriminating against the Gaussian pdf q(a).

In the Minimum Relative Entropy method (MREM) [4, 14] one minimizes the relative entropy subject to the constraints of the available information. When the prior distribution q(x) is uniform the MREM reduces to the MEM. we have found that the inferred lUI-pdf for the MEM and the MREM (using the Gaussian assumption for the prior distribution) are virtually identical but that the MREM usually converges faster and that it is numerically more stable. The MREM-pdf is given by

(21 )

with partition function

(22)

3. Results and Conclusions

For a relatively high number of users (K=29)one finds that the Gaussian lUI-pdf as defined by the first 3 moments is very close to the "exact" pdf (as inferred via the MEM from the first 12 moments). One finds, however, that for a > 0.5 the Gaussian tail probabilities are too small. The discrimination information between the two pdfs is only 7.7 x 10-5 which reaffirms that the Gaussian assumption is very good.

Maximum Entropy Performance Analysis of SSMA Communications 107

Figure 2 compares the "exact" lUI-pdf (MEM, 12 moments) with the Gaussian pdf for K = 2 users. In this case the Gaussian assumption is not at all very good. This is quantitatively expressed by the discrimination information I R =O.083 which is more than three orders of magnitude larger than for the case of 29 users. From figure 2 we see that the Gaussian breaks down completely for the tail distributions.

lO

1

0.1

0.01

1e-3

p(a) le-4

Le-5

le-6

le-7

le-8

le-9

Je-lO 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Figure 2: The MEM lUI-pdf (solid line) for K = 2, Nc = 127 (12 moments) compared to the "best" Gaussian (dotted line). The Gaussian assumption breaks down for a > 0.21.

Figure 3 shows the discrimination information between the "exact" lUI-pdf (as inferred via the MEM from the first 12 moments) and the Gaussian pdf (as defined by the first three moments) as a function of the number of users. The discrimination information decreases exponentially with increasing number of users.

Especially for large signal to noise ratios the Maximum entropy method is found to be superior to the GQR-based methods in that it provides more accurate results than the GQR-based methods using fewer moments. This is especially significant when the moments are obtained experimentally or when the moment generating algorithm becomes unstable for higher order moments . The discrimination information between the "exact" pdf as inferred via MaxEnt and the "best" Gaussian approximation (GA) can be used to quantify the accuracy of the GA. We reaffirm that the GA is good for a high number of users K and we find that it breaks down for roughly K<15.

References

[1] P. Crepeau. Uncoded and coded performance of MFSK and DPSK in Nakagami fading channels. IEEE Trans. Commun. , vol. COM-40, pp. 487-493, 1992.

[2] J. Kunicki F. Solms, P. van Rooyen. Maximum Entropy Performance Analysis of Code-Division Multiple-Access Systems. submitted for publication, 1994.

[3] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959.


0.1 ........ -----v--,--------r----,-----,-------.

0.01

I 0.001

0.0001

le-005 '-------'------'------'-------''------' o 5 10 15 20 25

K

Figure 3: The differentiation information (relative entropy) between the "exact" lUI-pdf as inferred via the MEM (using 12 moments) and the Gaussian pdf decreases exponentially with the number of users K.

[4) E.T. Jaynes. Prior Probabilities. IEEE Trans. Syst., Sci., Cybern. , vol. SSC-4, pp. 227- 241 , 1968.

[5) A. Luvison S. Benedetto, E. Biglieri and V. Zingarelli. Moment-based performance evaluation of digital transmission systems. lEE Proceedings-I, vol. 139, pp. 258- 266, 1992.

[6) M. Kavehrad and M. Joseph. Maximum Entropy and the Method of Moments in Performance Evaluation of Digital Communication Systems. IEEE Trans. Commun. , vol. COM-34, pp. 1183- 1189, 1986.

[7) M. Kavehrad. Performance of Nondiversity Receivers for Spread Spectrum in Indoor Wireless Communications. AT&T Technical Journal, vol. 64, pp. 1181-1210, 1985.

[8) G.L. Turin. The Effects of Multipath and Fading on the Performance of Direct-Sequence CDMA Systems. IEEE J. Selected Areas Commun. , vol. SAC-2, pp. 597-603, 1984.

[9) A.R. Akinniyi. Characterization of Noncoherent Spread-Spectrum Multiple-Access Communications. IEEE Trans. Commun. , vol. 42, pp. 139-148, 1994.

[10) K. Yao. Error Probability in Asynchronous Spread Spectrum Multiple Access Communication Systems. IEEE Trans. Commun. , vol. COM-25, pp. 803- 809, 1977.

[ll) R. Gold. Optimal Binary Sequences for Spread-Spectrum Multiplexing. IEEE Trans. Inform, Theory , vol. IT-13, pp. 619- 621, 1967.

[12] A. Tagliani. On the application of maximum entropy to the moments problem. J. Math. Phys. , vol. 34, pp. 326-337, 1993.

[13) L.R. Mead and N. Papanicolaou. Maximum entropy in the problem of moments . J. Math. Phys. , vol. 28, pp. 2404-2417, 1984.

[14) J.E. Shore and R.W. Johnson. Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy. IEEE Trans. Inform. Theory, vol. IT-26, pp. 26- 37, 1980.

NOISE ANALYSIS IN OPTICAL FIBRE SENSING: A STUDY USING THE MAXIMUM ENTROPY METHOD

L. Stergioulas, A. Vourdas and G.R. Jones Department of Electrical Engineering and Electronics, The University of Liverpool, Brownlow Hill, P.O.BOX 147, LIVERPOOL L69 3BX, UK.

ABSTRACT. The maximum entropy method is used for the reduction of noise in images at the output of an optical fibre. Assuming that the useful information is in the lowest moments and that the higher moments are influenced by noise, we construct" clean" images that have the same lower moments as the original ones and maximum entropy. From a mathematical point of view, we study the moment problem with the maximum entropy method for the case of a discrete variable that takes a finite number of values.

1. Introduction

Optical fibre techniques have been used widely for sensing. Intrinsic fibre sensors rely upon an optical signal propagating through the fibre being influenced by the parameter to be measured. One class of such intrinsic fibre sensors utilises interference between the various propagation modes in a multimode fibre to produce changes in the speckle pattern at the fibre output in response to for instance acoustical vibrations (e.g. [1]). One of the limitations in the sensitivity of these systems is caused by noise. In the present contribution, we examine the spatial variation of the speckle pattern as captured by a CCD camera and explore methods for reducing the noise content of the image.

Assuming that the useful information is in the lowest moments of the intensity distribution and that the higher moments are influenced mainly by noise, we construct new ("clean") images which have the same lowest moments as the original ones. The unbiased way to do this is to maximise the entropy under the constraint of having fixed lower moments [2 ,3,4,5].

Much of the work presented in the literature on similar problems considers the moments of distributions of a continuous variable (e.g. [6,7,8,9]). In contrast, in our case, the variable position represented by the pixel number is discrete and it takes a finite number of values. The mathematical details in these two cases are different. For example, in the case of a thermodynamic system, the maximum entropy method leads to the BoseEinstein distribution when the Hilbert space of the system is infinite dimensional, and to the uniform distribution (PN = liN) when the Hilbert space is N-dimensional. The statistics in these two cases are different and therefore a separate study of these two cases is required. Similarly, our study of distributions of a discrete variable that takes a finite number of values is different from the continuous case. Moreover, here we apply these ideas in the context of noise analysis of the output image in an optical fibre.

109

1. Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 109-116. © 1996 Kluwer Academic Publishers.

110 L. Stergioulas, A. Vourdas and G.R. Jones

The optical images at the exit of a fibre are digitised by a CCD camera and partitioned into many small regions within each of which the first two moments of the intensity distribution in the various pixels, are calculated. An "improved" image is then constructed which has the same first two moments in each region and maximum entropy. The image is improved in the sense that it contains the same information as the original one in the lowest moments which are assumed to contain the desirable information, whilst it loses the structure of the higher moments which are assumed to be due to noise. Maximisation of the entropy is achieved by requiring its first derivative to be equal to zero. This leads to several solutions corresponding to various extrema (local minima, local maxima, saddle points). All these solutions are investigated and the global maximum is found.

2. Experimental data

The images used for the application of our method were produced in an experiment similar to the one described in [IOJ. The images are digitised by a CCD camera of the speckle pattern produced at the output of a step index multimode optical fibre, with a core diameter of 50p,m. A monochromatic source was used to excite modes within the fibre. The CCD array views the exit plane of the optical fibre and provides the far-field intensity distribution. The images captured by the CCD camera are square arrays of 496 x 496 pixels.

3. The method of moments in the case of a discrete variable that takes a finite number of values

The problem of moments involves determining the distribution of a variable from the knowledge of its moments. In the case that all moments are given, the distribution is defined uniquely. But in the case that only the first K moments are given, there are many solutions which have the given moments and which differ from each other with respect to the unknown moments. The maximum entropy method chooses among them the one with maximum entropy.

The image is divided into M square regions, each of which contains L pixels. Let Iij be the intensity and fij be the normalised intensity of the j pixel in the i region:

Iij fij = L Iij'

i,j

Lfij = 1, i,j

o S; !ij S; 1,

(1 )

(2)

where i = 1,2, ... , M,j = 1,2, ... , L. The moments within each region are calculated from the formulae:

L

L !ij (1) _ j=l

m i - -L--'

M

Lm(l)=~ i=l t N

(3)

Noise Analysis in Optical Fibre Sensing 111

and L ( (1))N E It) - m t

(N) _ '--)=_1 __ -=-__ _ m t - L (4)

The Iij can be viewed as probabilities and the entropy of the image is given by:

s = - L!;j In!;j. (5) i,j

The expression to be maximised, in this case, is:

(6)

where A~N) are the Lagrange multipliers. The distribution of the new values of !;j will

have maximum entropy and at the same time it will also have the original moments m~N) (N=l,2, ... ,K).

4. The case of the first two moments

In this case, the constraints correspond to preservation of the mean value and the stan

dard deviation in each region which here we denote as J-ti (= m~I)) and O"i (= (m~2)) 1/2), correspondingly.

The expression to be maximised, in this case, is:

where Ai is the undetermined Lagrange multiplier for the first moment term and Ti is the Lagrange multiplier for the second moment term, referring to the i region (i = 1,2, ... , M). Taking the first derivative of Q with respect to !;j to be equal to zero, we obtain:

(8)

The relations (3), (4) and (8) are considered for every region and they form a nonlinear system of M(L + 2) equations with M(L + 2) unknowns (ML intensity values and 2M Lagrange mUltipliers). This system of equations provides the solution. From relation (8), one can easily derive that:

(9)

The last equation can be satisfied in all pixels within a region only if the normalised intensities within that region take one of two possible values: !;j = Ai or Bi . If we assume that in


the i region there are Xi pixels with normalised intensity Ai and Yi pixels with normalised intensity Bi, then we have:

and

Xi + Yi = L,

XiAi + YiBi = J.Li L = L Iij j

xi(Ai - J.Li)2 + Yi(Bi - J.Li)2 = Lar Equations (10), (11) and (12) give the solution:

and

( X· )~ Bi = J.Li ± O"i --'

L -Xi

(10)

(11)

(12)

(13)

(14)

Xi can take all values between 1 and L - 1, but only the ones that lead to non-negative Ai, Bi are acceptable. All these solutions represent local extrema of the entropy function . An investigation of all these local extrema is needed in order to find the global maximum.

The method does not define which of the pixels have normalised intensity Ai and which have B i , because the entropy remains the same if we scramble the pixels within a certain region. Fig. 1 shows a 496 x 496 256-level digital image of the speckle pattern as captured by a CCO camera. The improved versions of this image for L = 4,16,64 are given in Fig. 2,3,4. In each of these three cases we have investigated all possible values of X that lead to non-negative Ai, Bi and which correspond to local extrema. Table I gives the entropies of all these extrema and the images corresponding to the global maxima are plotted in Fig. 2,3,4.

TABLE I Region size x Entropy (in nats)

L=4 I 3 I 10.587677

L = 16 10.617421 L = 16 10.617425

L= 64 63 10.708898 L= 64 62 10.708955 L= 64 61 10.708997 * L= 64 60 10.708917

* The asterisk indicates the MEM image

For completeness, we have also considered the simple case in which the first moment only is used as a constraint. This leads to the trivial result of substituting all intensities within a region with their average. In table II we present the entropies of the new images as reconstructed with the method of one moment and two moments. As expected the entropy is an increasing function of the region size and a decreasing function of the number of moments that are used as a constraint.


100 200 300 400

Figure 1: Original image.

100 200 300 400

Figure 2: Improved image with region size L=4.


100 200 300 400


400

300

200

100

100 200 300 400



TABLE II ENTROPY OF MEM IMAGES (in nats)

Region size I First moment I Two moments I original image 10.584326 10.584326

L=4 10.606380 10.587677 L = 16 10.644367 10.617425 L= 64 10.736984 10.708997

5. Conclusion

The maximum entropy method has been used to clean the images at the output of an optical fibre. Using as a constraint the first two moments, we have reconstructed a new image which has maximum entropy. We have shown that in the improved image the intensities within each region are distributed in the two levels of equations (13), (14). A detailed investigation of all local extrema has been performed in order to find the solution corresponding to the global maximum of the entropy. The method can be used for the improvement in the sensitivity of optical sensing systems operating under noisy conditions.

ACKNOWLEDGMENTS. One of us (L.S.) gratefully acknowledges financial support from the S.S.F. (Greece).

References

[1] Cosgrave, J.A., Vourdas, A., Spencer, J.W., Jones,G.R., Murphy, M.M., Wilson, A., "Acoustic monitoring of partial discharges in gas insulated substations using optical sensors", lEE Proc., vol. A140 (1993), pp. 369-374.

[2] Jaynes, E.T., "On the rational of maximum-entropy methods", Proceedings of the IEEE, vol. 70 (1982), pp. 939-952.

[3] Bryan, R.K., Skilling, J., "Maximum entropy image reconstruction from phaseless Fourier data", Optica Acta, vol. 33 (1986), pp. 287-299.

[4] Gull, S.F., Daniell, G.J., "Image reconstruction from incomplete and noisy data", Nature, Land., vol. 272 (1978), pp. 686-690.

[5] Livesey, A.K., Skilling J., "Maximum entropy theory", Acta Cryst., vol. A41 (1985), pp. 113-122.

[6] Mead, L.R., Papanicolaou, N., "Maximum entropy in the problem of moments", J. Math. Phys., vol. 25 (1984), pp. 2404-2417.

[7] Dowson, D.C., Wragg, A., "Maximum entropy distributions having prescribed first and second moments", IEEE Trans. I. T., vol. 19 (1973), pp. 689-693.

[8] Ciulli, S., Mounsif, M., Gorman, N., Spearman, T.D., "On the application of maximum entropy to the moments problem", J. Math. Phys., vol. 32 (1991), pp. 1717-1719.

[9] Tagliani, A., "On the application of maximum entropy to the moments problem", J. Math. Phys., vol. 34 (1993), pp. 326-337.


[10] Smith, R., Ahmed, S.,Vourdas, A., Spencer, J.W., Russell, P., Jones, G.R., "Chromatic modulation for optical fibre sensing - Electromagnetic and speckle noise analysis", Journal of Modern Optics, vol. 39 (1992), pp. 2301- 2314.

AUTOCLASS - A BAYESIAN APPROACH TO CLASSIFICATION

ABSTRACT.

John Stutz and Peter Cheeseman NASA Ames Research Center Moffet Field, CA 94035 U.s .A.

We describe a Bayesian approach to the unsupervised discovery of classes in a set of cases, sometimes called finite mixture separation or clustering. The main difference between clustering and our approach is that we search for the "best" set of class descriptions rather than grouping the cases themselves. We describe our classes in terms of probability distribution or density functions, and the locally maximal posterior probability parameters. We rate our classifications with an approximate posterior probability of the distribution function w.r.t. the data, obtained by marginalizing over all the parameters. Approximation is necessitated by the computational complexity of the joint probability, and our marginalization is w.r.t. a local maxima in the parameter space. This posterior probability rating allows direct comparison of alternate density functions that differ in number of classes and/or individual class density functions.

We discuss the rationale behind our approach to classification. We give the mathematical development for the basic mixture model, describe the approximations needed for computational tractability, give some specifics of models for several common attribute types, and describe some of the results achieved by the AutoClass program ..

1. Introduction

Classification has a dual interpretation. It may mean either the assignment of instances to predefined classes, or the constructions of new classes from a previously undifferentiated instance set. Machine learning for classification mirrors these interpretations. Supervised classifiers seek to characterize predefined classes by defining measures that maximize inclass similarity and out-class dissimilarity. Decision trees and Neural Networks are two currently popular approaches . Success is generally measured by their ability to recover the original classes from data similar, but not identical, to the original data.

Our work has been directed toward unsupervised classification. Unsupervised classifiers seek similarity measures without the guidance provided by predefined classes. Their goal is to discover "natural" classes reflecting underlying causal mechanisms. These mechanisms may be as trivial as sample biases, may reflect well known phenomena, confirm or discomfirm previous hypothesis, or lead to entirely new discoveries.

Unsupervised classification is known to the multivariate analysis community as cluster analysis [Dillon & Goldstein, 1984], [Mardia et at., 1979]. It is normally applied only to numerical data, and is characterized by the usual diversity of methods. One of the principle problems is the lack of any well founded or agreed upon measure of success, especially since likelihood optimization always tends to favor single instance classes. The methods based on "distance measures" have trouble with non-numerical attribute types.

We are also troubled that any classifier which partitions the data instances into sets, may exhibit brittle behavior with those instances that lie near the decision boundaries: Small

117

J. Skilling and S. Sibisi (eds.). Maximum Entropy and Bayesian Methods. 117-126. © 1996 Kluwer Academic Publishers.

118 J. Stutz and P. Cheeseman

changes in decision criteria may shift such instances between classes, provoking further criteria shifts, shifting more instances .. . until a radical change results .

2. Bayesian Approach

A Bayesian approach based upon finite mixture distributions [Titterington et at., 1985], [Everitt & Hand, 1981] offers several advantages over the alternatives. Each class is modeled by a probability distribution over an attribute space in which each instance is modeled as a Cartesian point located by the instance attribute vector. A compact class forms a cluster of instance points in the attribute space. A classification is a probability weighted set of such distributions. The degree of class "membership" for any instance is its normalized class probability. This is generally large for some one class and quite small for the others, but never 1 or 0 (modulo floating point roundoff error). Boundary cases cannot induce brittle behavior since they are nearly equally probable in the adjacent classes.

Finite mixture classification was developed decades ago, but was hampered by the use of maximum probability assignment of instances to classes, and crippled by reliance upon likelihoods for rating the resulting classifications. We introduce priors on the parameters and then marginalize out all parameters to get a dataset conditional posterior probability for the classification's distribution function (model). This permits direct comparison of alternative classification models, particularly models that differ only in the number of classes, for which likelihood methods never provided a satisfactory technique.

There are two directions such Bayesian classification can take. In the supervised case one possesses a training dataset of preclassified instances and seeks a "best" representation of the classes. This involves a search over alternate models for each individual class, marginalizing out the parameters to find the MAP models. These, instantiated with the MAP parameter values, then define the classification and provide probability distributions for classifying new data instances.

In the unsupervised case, one only has the database and is looking for interesting and predictive patterns in that data. Here one must search both the classification probability model space and the parameter space. The latter search involves converging to the local parameter space probability maxima, and heuristic search for the largest such maxima.

The primary object in developing our AutoClass program has been to experiment with unsupervised classification, but we have developed some facilities for supervised classification and for its use as a classifier.

3. AutoClass

Our objective is to find the (possibly several) most probable classification T, represented as a set of J class distribution functions Tj, and the corresponding MAP values of the class parameter sets Vj.

We begin with the classical [Titterington et al., 1985],[Everitt & Hand, 1981J finite mixture model assumptions: that the data X is a conditionally independent mixture from several unknown classes {VjTj}, of instances Xi which are, within their class, independently and identically distributed w.r.t. the class. Each instance is assumed to "belong" to only one class VjTj , but that class is not identified. Nor is the number J of classes given.

We limit ourselves to data for which instances can be represented as ordered vectors of

AutoClass - a Bayesian Approach to Classification

j k l c

T = Te , TI"'" T J

V = Ve, VI, ... , T J

7rj

I

the data set indexes instances, i = 1, . . . , I indexes classes, j = 1, ... , J indexes attributes, k = 1, ... , K indexes discrete attribute values, l = 1, ... , L indicates inter-class probabilities & parameters mathematical form of a probability distribution function parameter set associated with a distribution function class mixture probability, Ve = {7rI, ... , 7r J }

implicit information not specifically represented

Table 1: Symbols used in this paper.

119

attribute values. In principle, each attribute represents a measurement of some instance property common to all instances. These are "simple" properties in the sense that they can be represented by single measurements: discrete values such as "true" or "false", or integer values , or real numbers. For example, medical case #8, described as (age = 23, . blood-type = A, wt = 73.4, ... ) would have X 8,1 = 23, X 8,2 = A, etc. We make no attempt to deal with relational data where attributes, such as "married-to", have values that are other instances. We can allow the reported value to be "unknown'.

We define a classification as a set of class models Tj weighted by an interclass mixture model Tc specified to be a Bernoulli distribution. This mixture probability is the probability that any Xi is a member of class Gj , irrespective of its attribute values. The interclass model parameters Vc are the set of 7rj defined as 7rj == P(Xi E Gj I Ve, Tc,l). We specify a uniform Dirichlet I prior density P(Vc I Te,l) for Ve.

Each class is modeled by a class probability distribution P(Xi I Xi E Gj, Vj, Tj,l). This gives the conditional probability that an instance Xi would have attribute values X ik if it were known that the instance is a member of class Gj . This class distribution function is a product of distributions modeling conditionally independent attributes k:

P(Xi I Xi E Gj, Vj, Tj,l) = II P(Xik I Xi E Gj, Vjk, Tjk,l)· k

(1)

For exposition we show all attributes as if independent, noting that the shift to partially or fully covariant attributes is only a matter of bookkeeping 2 . Individual attribute models P(Xik I Xi E Gj , Vjk> Tjk,l) include the Bernoulli and Poisson distributions, and Gaussian densities. Such models are detailed in the next section.

The probability of anyone instance is then the sum of class probabilities, i.e. the weighted sum over the class conditional probabilities. The database probability is just the product of the instance probabilities:

P(X I V, T,l) = II[L(7rj P(Xi I Xi E Gj, Vj, TjJ))]. j

IThe Bernoulli distribution and Dirichlet density are given in equations (9) and (10) respectively. 2 A covariant subset of attributes is modeled if it were a single new independent attribute

(2)


For fixed V and T this is the direct probability of the data, and for known assignments Xi E Gj , it gives the likelihood of the model V, T.

So far we've only described a classical finite mixture model. We convert this to a Bayesian model by introducing priors, obtaining the joint probability of data and parameters:

P(XV ITT) = P(V ITT)P(X I VT!)

P(Vc I TcT) x II[p(Vj I TJI)] x II[L:Cll"jp(Xi I Xi E Gj , VjTj!))] (3) j j

P(Vc I TcI) x II[p(Vjk I TjkI)] x II[L: 7T"j II P(Xik I Xi E Gj , Y}kTjk!)]. (4) jk j k

In the latter form we have expanded the joint over the attributes, again showing the attributes as if conditionally independent within the classes.

We seek two things: For given distribution form T = Tc, T J , ••• ,TJ and data X, we want the parameter posterior distribution and its MAP parameter values

P(XV IT!) P(XV IT!) P(V IXT!) = P(X IT!) = JJdVP(XV IT!)· (5)

Independently of the parameters, we want the posterior probability of the model form given the data:

P(T I X T) = P(TX I!) = JJdV P(XV I TI)P(T I I) ex: ffdV P(XV IT!) = P(X ITT). (6) P(X I I) P(X IT) JF

The proportionality in (6) holds when we assume P(T I I) uniform for all T. Frustratingly, attempts to directly optimize over or integrate out the parameter sets Vjk in equation (4) founder on the Jl products resulting from the product over sums.

The mixture assumption, that each instance actually derives from one and only one class, suggests a useful approach. If we knew the true class memberships, and augmented the instance vectors Xi with this information, the conditional probabilities P(Xi I Xi E Gj , VjTjT) would be zero whenever Xi (j. Gj . The J sums in equations (2), (3), and (4) would each degenerate into a single non-zero term. Merging the two products over k, and shifting the attribute product within, then gives

P(XV I TI) = P(Vc I TcI) II II II P(XijkVjk I TjkI)· (7) j XiEj k

This pure product form cleanly separates the classes with their member instances. Class parameters can be optimized or integrated over without interaction with the other class's parameters. The same holds for the independent attribute terms within each class. Clearly, for supervised classification, the optimization and rating of a model is a relatively straightforward process. Unfortunately, this does not hold for unsupervised classification.

One could fall back on the mixture assumption, applying this known assignment approach to every partitioning of the data into non-empty subsets. But the number of such

partitionings is Stirling's sf), which approaches Jl for small J. Clearly this technique is only useful for verifying approximations over very small data and class sets.

AutoClass - a Bayesian Approach to Classification 121

We are left with approximation. Since equation (2) is easily evaluated for known parameters, the obvious approach is a variation of the EM algorithm [Dempster et al., 1977], [Titterington et al., 1985]. Given the set of class distributions T j , and the current MAP estimates of the parameter values 7rj and Vj, the class conditional probabilities of equation (1) provide us with weighted assignments Wij in the form of normalized class probabilities:

(8)

We can use these instance weightings to construct weighted statistics corresponding to the known class case. For example, the weighted class mean and variance for an independent Gaussian model term are

Using these statistics as if they represented known assignment statistics permits reestimation of the MAP parameters. This new MAP parameter set then permits reestimation of the normalized probabilities. Cycling between the two reestimation steps carries the current MAP parameter and weight estimates toward a mutually predictive and locally maximal stationary point. Marginalizing the parameters w.r.t. the stationary point 's instance weightings then approximates the local contribution to P(X I TJ) .

Unfortunately, there is usually a great number of locally maximal stationary points. And excepting generate-and-test, we know of no method find, or even count, these maxima - so we are reduced to search. Because the parameter space is generally too large to allow regular sampling, we generate pseudo-random points in parameter (or weight) space, converge to the local maximum, record the results, and repeat for as long as time allows.

Having collected a set of local maxima for model T, and eliminated the (often many) duplicates , we consider the local marginals P(X' IT J)n = JJdV P(X'V IT J). These are local in the sense of being computed w.r.t. the local weighted statistics X'. As such, they only give the local contribution to P(X I TJ) from the parameter space region "near" Vn. But it is remarkable how the largest such P(X' IT J)n can dominate the remainder. Ratios between the two top probabilities of 104 to 109 are routine when the number of attribute values, I x K, exceeds a few hundred. With a few million attribute values, the ratios of the top probabilities may easily reach elOO ;:::: 1044 • In such circumstances we feel justified in reporting the largest P(X' IT J)n as a reasonable approximation to P(X IT J), and in using it as our approximation to P(T I X J).

Thus we rate the various models T by their best P(X' I TJ)n and report on them in terms of the corresponding parameterizations. If one model's marginal dominates all others, it is our single best choice for classifying the database. Otherwise we report the several that do dominate.

4. Class Models

Each class model is a product of conditionally independent probability distributions over singleton and/or covariant subsets of the attributes. For the previous medical example, blood type is a discrete valued attribute modeled with a Bernoulli distribution while age and weight are both scalar real numbers modeled with a log-Gaussian density.


Much of the strength of our approach lies in the diversity of attribute types that may be effectively modeled. We have provided basic models for discrete (nominal) and several types of numerical data. We have not yet identified a satisfactory distribution function for ordinal data. In each case we adopt a minimum or near minimum information prior, the choice being limited among those providing integrable marginals. This integrability limitation has seriously retarded development of the more specific models, but numerical integration is too costly for EM convergence.

The following gives a very brief description of the attribute probability distributions that we use to assemble the class models.

• Discrete valued attributes (sex, blood-type, ... ) - Bernoulli distributions with uniform Dirichlet conjugate prior. For the singleton case with Lk possible values, the parameters are Vjk = qjkl ... qjkLk, such that qjkl 2: 0, 'L,fk qjkl = 1, where

qjkl (9)

1 Lk-L

f(Lk + 1)[f(1 + L)tLk II qfk~ k 1=1

(10)

For the covariant case, say sex and blood type jointly, we apply the above.model to the cross product of individual attribute values. Thus female and type A would form a single value in the cross product.

• Integer count valued attributes - Poisson distribution with uniform prior per Loredo [1992J. No covariant form has been developed.

• Real valued location attributes (spatial locations) - Gaussian densities with either a uniform or Gaussian prior on the means. We use a Jeffreys prior on a singleton attribute's standard deviation, and the inverse Wishart distribution [Box & Tiao, 1972J as the variance prior of covariant attribute subsets.

• Real valued scalar attributes (age, weight) - Log-Gaussian density model obtained by applying the Gaussian model to the log transformation. See Aitchison [1957J.

• Bounded real valued attributes (probabilities) - Gaussian model on the log-odds transform (under development).

• Circular or angular real valued attributes - von Mises-Fisher distributions on the circle and n-sphere (under development) [Mardia et at., 1979J.

• Missing values - Discrete valued attributes are extended to include "missing" as an attribute value. Numerical attributes use a binary discrete probability for "missing" and "known", with the standard numerical model conditioned on the "known" side.

• Ignorable attributes - Use the standard model for the attribute type, with fixed parameters obtained by considering the entire dataset as a single class (under revision).

• Hierarchical models - represent a reorganization of the standard mixture model , from a fiat structure, where each class is fully independent, to a tree structure where multiple


classes can share one or more model terms. A class is then described by the attribute model nodes along the branch between root and leaf. This makes it possible to avoid duplicating essentially identical attribute distributions common to several classes. The advantage of such hierarchical models lies in eliminating excess parameters, thereby increasing the model posterior. See [Hanson et at., 1991] for a full description of our approach. Other approaches are possible: see [Boulton & Wallace, 1973].

The only hard constraint on the class models is that all must use the same data. An attribute subset may be modeled covariantly in one class and independently in another. Attribute~s may be ignored, but any such attribute must then be ignored in all classes of all the classifications that are to be compared. To do otherwise would be to evaluate different classes w.r.t. different data sets, rendering the results incommensurable.

In principle, our classification model should also include a prior distribution P(T I I) on the number of classes present and the individual class model forms Tj. Currently we take this distribution to be uniform and drop it from our calculations. Thus we ignore any prior information on alternate classification model probabilities, relying solely on our parameter priors for the Occam factor preventing over fitting of the models. We find this quite sufficient.

Consider that every single parameter introduced into a model brings its own multiplicative prior to the joint probability, which always lowers the marginal. If the parameter fails to raise the marginal by increasing the direct probability by the same factor, its use is rejected. Thus simple independent attribute models are usually favored simply because they need fewer parameters than the corresponding covariant models. Consider the case of 10 binary discrete attributes. To model them independently requires only 10 parameters. A full covariance model requires 1023 parameters. One requires a great many very highly covariant instances to raise the fully covariant model's marginal above the independent model's. Similar effects accompany the introduction of additional classes to a model.

The foregoing is confirmed throughout our experience. For data sets of a few hundred to a few thousand instances, class models with large order covariant terms are generally dominated by those combining independent and/or small order covariant terms. We have yet to find a case where the most probable number of classes was not a small fraction of the number of instances classified. Nor have we found a case where the most probable number of model parameters was more than a small fraction of the total number of attribute values.

Our current treatment of missing values is somewhat limited. We initially elected to take the straightforward approach and do our classification on the data vectors at hand - and thus consider "missing" a valid attribute value. This assumption is appropriate when missing values are the result of a dependent process. For example, with medical cases the fact that a particular test was not reported, and presumably not performed , may be quite informative, at least about the doctor's perception of the patient's state. However most users are more concerned with classifying instance objects rather than their data abstractions. For such persons, "missing" represents a failure of the data collection process, and they are often tempted to substitute some "estimated" value for the missing one. We regard such "estimation" as falsification. We have experimented with an approximation that ignores any missing values in all computations.


5. Practical Details

There are still a few problems. The EM convergence is difference driven. The convergence rate decreases exponentially, thus there is a strong tradeoff between compute time and accurately locating the stationary points. We have a very real problem in deciding when to halt a convergence. Disappointingly, our attempts to speed convergence by over relaxation or exponential extrapolation have met with mixed success.

There are the usual numerical problems arising from the limited precision inherent in the floating point numerical representation. Probability calculations must be done with logs of probabilities. Commonly used functions like log-gamma, if not available from certified libraries, may require extremely careful implementation and testing. Combinatorial functions should, if possible, be computed with indefinite length integers.

The normalization implicit in equation (8) is inherently limited by the current floating point representation's E, the largest number which when added to 1.0 yields 1.0. Any probability smaller than E times the largest probability cannot influence the normalization sum. Such probabilities are equally unable to affect the statistical weightings. They are effectively zero and might as well be set so. Thus normalization is a fruitful source of zeros in a calculation which should never yield zero. This is only rarely a problem. But it is possible to specify extreme cases where normalization yields only one non-zero class probability for each instance. Then convergence ceases and the algorithm fails.

We did a bit (perhaps a lot) of handwaving to justify our use of P(X' I TI)n as an approximation for P(T I X I). The truth is not so much that we use this approximation because it is justified, but that we justify it because we can compute it. We are not particularly satisfied with the argument. And would welcome any suggestions for a better justified, but still computable, approach. Or a better justification for what we already do.

6. Results

Because of limited space, this section only contains a summary of some of our experience in using A utoClass.

On artificial data, where the true generating class models are known, AutoClass has been very successful at recovering the classes from the data alone. In one such series of tests, with a thousand random points from a I-d Gaussian equal-mixture distribution ([.5 x N(O, 1) +.5 x N(x, 1)]), on average AutoClass prefered the I-class model over the 2-class model for x :::; 2. When x > 2 the 2-class model rapidly dominates. Human observers need x > 3 to 4 before they perceive two clusters. AutoClass's ability to do better than the human eye is largely because AutoClass assumes the data is coming from a Gaussian distribution (and in this case it really is); whereas humans probably do not make precise distribution assumptions.

Real world application have had less clearcut but more useful results. For example, application of A utoClass to the Infrared Astronomy Satellite - Low Resolution Spectrometer data produced a classification [Cheeseman et at., 1989] that revealed many known and previously unknown classes. Subsequent study has confirmed the astrophysical significance of some of these classes, as well as pointing up calibration errors in the data. An evaluation by local IR astronomers [Goebel et at., 1989] points out the kind of support AutoClass can provide a research field.


An even larger classification was performed on a 1000xlOOO pixel Landsat Thematic Mapper image, using a parallel version of AutoClass with correlation between the attributes. The attributes in this case were the seven spectral values for each pixel. Using these seven numbers alone, we found nearly 100 distinct classes of ground cover. The interpretation of these ground cover classes is often difficult without corresponding ground observations, but the spatial distribution of the classes give some clues. It is clear that in this domain we need to model adjacent pixel relations rather than treat each pixel independently.

On a much smaller scale, we have found distinct classes of introns in DNA data of unknown, but obviously biologically significant, origin. In performing this classification, we had to remove a number of effects before the new classification became clear. For example, one class that showed up very strongly turned out to be due to a mislabeling of the intron boundary creating essentially nonsense data. AutoClass put these cases in a separate class because they did not resemble any of the other cases. Other classes found by AutoClass turned out to be due to gene duplication events that produced introns that were close copies of each other. In order to remove the masking effect of these very tight classes, the duplicated introns were removed from the data. Only when these known but uninteresting effects were eliminated could really new classes be found. This has been our experience on many other databases, so that it is clear that use of AutoClass as an exploratory data analysis tool is not a one step process, but involves many passes with an expert interpreting the results, and new or modified data at each step. Human diagnosticians bring much additional information, beyond the instance data, to the diagnostic task. AutoClass is able to find classes in highly multidimensional data and large data sets that overwhelm human cognitive abilities, and so provide compact descriptions of the data for human interpretation.

7. Discussion

We have described a Bayesian approach to performing unsupervised classification in vector valued data bases, and given the essentials of our implementation, the AutoClass program. There are several significant advantages over most previous classification work:

• Classes are described as probability distributions over the attribute space. In principle, such distributions can be applied to any type of data, and data types can be combined without restriction.

• Class membership is given as a probability rather than as an assignment. This eliminates boundary region brittleness while identifying boundary instances for considered decision at a later time.

• Within this Bayesian approach the model form's posterior probability, P(T I X I), provides a universally applicable criteria for rating alternate classifications, regardless of the number of classes or any other measure of model complexity. Because it incorporates the parameter priors, over fitting of the data is precluded. No such criteria is possible for conventional likelihood based methods.

• The Bayesian approach is not limited to the mixture model of equation (2), where class membership is implicitly exclusive although unknown. Other interclass relations may be represented. Consider medical diagnosis, where multiple independent causes (country of


origin, specific previous infections & etc.} may be operating simultaneously. The mixture model seeks the cross product of those causes. A model of overlapping classes interacting multiplicatively might be much more efficient.

8. References

We have given only the essential details of the class probability models. The paper by Hanson et al. gives further details. A technical report giving full details of the currently implemented system should be available by the time this is published. Address requests to [email protected].

References

[Aitchison & Brown, 1957] J. Aitchison and and J. A. C. Brown. The Lognormal Distribution. University Press, Cambridge, 1957.

[Boulton & Wallace, 1973] D. M. Boulton and C. S. Wallace. An information Measure of Hierarchic Classification. Computer Journal, 16 (3), pp 57-63, 1973.

[Box & Tiao, 1972] G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, Mass. 1973. John Wiley & Sons, New York, 1992.

[Cheeseman et al., 1989] P. Cheeseman, J. Stutz, M. Self, W. Taylor, J. Goebel, K. Yolk, H. Walker. A utomatic Classification of Spectra From the Infrared Astronomical Satellite (IRAS). NASA Ref. Pub!. #1217, 1989.

[Dempster et al., 1977] A. P. Dempster, N. M.Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1}:1-38, 1977.

[Dillon & Goldstein, 1984] W. Dillon and M. Goldstein. Multivariate Analysis: Methods and Applications, chapter 3. Wiley, 1984.

[Everitt & Hand, 1981] B. S. Everitt and D. J. Hand. Finite Mixture Distributions. Monographs on Applied Probability and Statistics, Chapman and Hall, London, England , 1981. Extensive Bibliography.

[Goebel et al., 1989] J. Goebel, K. Yolk, H. Walker, F. Gerbault, P. Cheeseman, M. Self, J. Stutz, and W. Taylor. A Bayesian classification of the IRAS LRS Atlas. Astron. Astrophys. , 222, L5-L8, (1989).

[Hanson et al., 1991] R. Hanson, J. Stutz, and P. Cheeseman. Bayesian Classification with correlation and inheritance. In 12th International Joint conference on Artificial Intelligence, pages 692-698, Sydney, 1991.

[Loredo, 1992] Thomas Loredo. The Promise of Bayesian Inference for Astrophysics. In E. Feigelson and G. Babu Eds.,Statistical Challenges in Modern Astronomy, SpringerVerlag, 1992.

[Mardia et al., 1979] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariant Analysis. Academic Press, New York, 1979.

[Titterington et al., 1985] D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, New York, 1985 ..

EVOLUTION REVIEW OF BAYESCALC, A MATHEMATICATM PACKAGE FOR DOING BAYESIAN CALCULATIONS

ABSTRACT.

Paul Desmedt~ Ignace Lemahieut Department of Electronics and Information Systems, University of Ghent, St.-Pietersnieuwstraat 41, B-9000 Ghent, Belgium.

K. Thielemans t Katholieke Universiteit Leuven Instituut voor Theoretische Fysica Celestijnenlaan 200 D B- 3001 Leuven, Belgium

The application of Bayesian probability theory requires only a few rules: the sum rule , the product rule and the marginalization procedure. However, in practice Bayesian computations can become tedious. The package BayesCalc implements the rules governing Bayesian probability theory in a Mathematica framework. Consequently BayesCalc can help introduce Bayesian theory to newcomers and facilitate computations for regular Bayesians.

The implemented rules enable the calculation of posterior probabilities from probabilistic relations. The main rules are the product and marginalization rule.

The previous version of BayesCalc dealt with symbolic calculations. Unfortunately, problems arise with some symbolic operations, especially integrations. To overcome this problem, numerical versions of many operations were added to the package. Some additional utilities are offered: decision theory, hypothesis testing and discrete ranges for parameters.

1. Introduction

To master Bayesian probability theory only a few rules are needed: the sum rule, the product rule and the marginalization procedure. Unfortunately, practical application of these rules is often complicated by the requested mathematical manipulations (e.g., integrations) . The package BayesCalc implements the rules governing Bayesian probability theory in a Mathematica framework. Therefore BayesCalc is a valuable tool to learn Bayesian theory and it can also simplify mathematical manipulations in more advanced Bayesian applications.

Mathematica 1 is a program for doing symbolic mathematical manipulations by computer. These manipulations are performed according to some built-in mathematical rules. Mathematica allows the user to extend these rules. Here the package BayesCalc is pre-

'supported by a grant from IWONL, Brussels, Belgium t research associate with the NFWO, Brussels, Belgium lCurrent address: Theoretical Physics Group, Imperial College, London SW7 2BZ, UK 1 Mathematica is a trademark of Wolfram Research Inc. For details, see [1].

127


128 P. DESMEDT, et al.

sented. This package implements most of the rules needed for the application of Bayesian probability theory.

Because Bayesian probability theory uses only a restricted number of rules, it forms an excellent subject for implementation in Mathematica. Furthermore the application of the rules of Bayesian probability theory is straightforward. This elegant property further limits the complexity of the implementation.

Last year a version of the package BayesCalc was introduced [2]. The main concern of that package was the automatic symbolic calculation of posterior probabilities. These posterior probabilities are calculated from a number of probabilistic relations (prior probabilities, sampling distributions, ... ) and parameter ranges.

The principal extensions of the evolved version of BayesCalc are: the numerical calculations of posterior probabilities, the hypothesis testing, decision theory and discrete parameter ranges.

In the next section the general features offered by the Mathematica package BayesCalc are reviewed. No implementation details are discussed in this paper. A technical review of the package is rendered in [3]. The general principles of Bayesian probability theory are found in [4, 5, 6]. Section 3 focuses on the additions to the package.

Notation: Mathematica input and output is written in typeset font. The Mathematica user interface is simulated by preceding the input lines by ''In[n} :=", and output statements by "Out[n}=", where n is the Mathematica line number. If n=l, no previous inputs to the Mathematica package are required.

2. General features of the BayesCalc package

The general notation for a conditional probability is BP [{ A}, {B} ] which stands for BP(A I B I). Note that the curly brackets "{}" determine which hypotheses are on the right or the left of the conditional sign. If a hypothesis consists of a set of hypotheses, i.e. the logical product of the hypotheses, the distinct hypotheses are separated by commas.

Before any posterior probabilities can be calculated, the specification of the proper information J is required. Consider first the specification of probabilistic relations. The general procedure to specify probabilistic relations has the syntax:

DefineBP[ {A}, {B}, userFunc[A, Bll

In conventional notation this means: BP(AIBI) = userFunc(A, B). For instance

DefineBP[ {x}, {p, o-}, Gauss[x, p, (J"ll

(1 )

(2)

specifies that the measurement x has a Gauss probability distribution with mean p. and variance (J"2.

Note that the package BayesCalc assumes that two hypotheses are independent unless a probabilistic relation of the form (1) links them.

The second part of the prior information consists of the definition of the ranges of the parameters. The input

DefineRange[a, {b, c}] (3)

EVOLUTION REVIEW OF BAYESCALC 129

specifies that the parameter a can have values between band c. Once the proper information I concerning the particular problem is specified, the re

quested probability relation can be calculated by

Posterior[BP[{A}, {B}]] (4)

The previous form is immediately transformed with the product rule to

1 BP[{B}] BP[{A, B}] (5)

The nominator will be expanded by recursive use of the product rule. This expansion will ultimately result in a product of user defined probabilistic relations. The denominator of this expression is a normalization constant and will in general not be expandable in a product of user specified probabilistic relations. As a consequence, the denominator will usually be returned unchanged.

If the package was unable to find an explicit expression for the joint probability, equation (5) is returned. The recursive expansion procedure is explained in [3].

In numerous applications, the result returned by Posterior will be sufficient to solve the problem at hand . However, sometimes the normalized version of this expression is desired. The procedure to obtain the normalized version is:

Normalize[probFunc[A, B], {A}] (6)

where probFunc[A,B] is a general probabilistic relation and {A} is the set of parameters for which the normalization is performed.

Here is an example of the calculation of posterior probabilities which deals with the estimation of the Poisson rate b from a radio-active source. The number of counts nb emitted by this source in a time T is measured. The BayesCalc specification of the problem is given by (7). First the package is loaded (Needs). Then the range and prior probability for the rate b are specified . Finally the direct probability for the number of counts nb is set to a Poisson distribution.

In[i] :=

Needs["BayesCalc"'] DefineRange[b, {a, Infinity}] DefineBP[{b}, lib] DefineBP[{nb}, {b}, poisson[nb, T b]]

The posterior probability for the rate is obtained by

In[2]:= Posterior[BP[{b}, {nb}]]

1 Out[2] = Tnb bnb- 1

BP[{nb}] nb! Exp[Tb]

The normalized version is obtained by 2:

In[3] := Normalize[%, b]

2 "%" is a Mathematica short hand for the last output.

(7)

(8 )

(9)

(10)

130

The result of this operation is

Tnb bnb- 1

Out!3} = Exp[T bJ Gamma[nbJ

which equals formula (41) of [7J.

P. DESMEDT, et al.

(11)

Of course, once the required probabilistic relation is obtained, all built-in Mathematica routines are available for further manipulations. For instance one can obtain plots, perform differentiations, ...

3. Extensions of the package

It should be noted that the Mathematica integration routines may fail to find the integrals needed for normalization and marginalization. Therefore, some frequently occurring integrals were precomputed and stored in BayesCalc. However, even with these precomputed integrals symbolic integrations are hazardous. Numerical versions of the integration routines are available now. The syntax of a numerical routine is derived from the symbolic routine.

For instance, the symbolic routine

Posterior[BP[{A}, {B}]] (12)

becomes

NPosterior[BP[{A}, {B}J, {s}] (13 )

The numerical routine has an "N" prepended to the name of the symbolic routine. Furthermore an argument ({S}) is added to the function call. This argument contains the value assignments of parameters to allow the numerical computations. Assignments are given in the standard Mathematica notation, e.g.,

{a- > 10.0, b- > O.O} (14)

Here is an example of the calculation for a measurement data obtained from independent Gauss measurements. The posterior probability is examined for the mean mu and spread sigma of the Gauss distribution.

In (15) the set up of the problem is given. The function Reset [] serves to delete all previously user-defined probabilistic relations. The use of Reset at the start of every new problem is recommended.

In!l} :=

Reset[] DefineRange[mu, {O, 20}] DefineBP[{mu}, Uniform[mu]] DefineRange[sigma, {5, 10}J DefineBP[{sigma}, Jeffreys[sigma]J DefineBP[{data}, {mu, sigma}, Gauss[data,mu, sigma]]

The normalized posterior is searched by

NormalizedPosterior[BP[{mu, sigma}, {data}]]

(15)

(16)

EVOLUTION REVIEW OF BAYESCALC

Mathematica rule

Information specification Reset []

AllDependent []

Dependent [{A}.{B}]

Independent [{A}.{B}]

DefineBP[{A}, {B}. probFunc[A,Bj] DefineBP [{ A}. probFunc[Aj] DefineRange[a.{b.c.d}]

DefineRange[a.{b.c}]

Probability relations

Gauss [x.mu.sigma]

Poisson[x.rate] Uniform [a] Jeffreys [a] ImproperUniform[a] ImproperJeffreys[a]

Utilities Posterior[BP[{A}.{B}]] NormalizedPosterior[BP[{A}.{B}]]

Normalize [probFunc[A,B]. {C} ]

Marginalize [ probFunc[A,B]. {C}

Mean [probFunc[ A] • c] Stdev [pro bFun c[A] • c] Moment [probFunc[A] .n. c]

HypothesisTest[{Ml}.{Nl}.{M2}.{N2}.{D}]

BestEstimate[{A}.{B}.{C}. lossFunc. {P}]

Conventional meaning

clears all previous user specified probabilistic settings make all hypotheses dependent on each other

. explicitly make hypotheses A dependent on hypotheses B explicitly make hypotheses A independent on hypotheses B BP(A I BI) = probFunc(A , B) BP(A I I) = probFunc(A) a is a discrete parameter which has values from b up to c with increment d a E [b, c]

1 ( 1 (x-mut ) V2 · 2 exp -2 sigma 7rstgma

exp( -rate) rateX ~ uniform prior for parameter a Jeffreys prior for parameter a non-normalized uniform prior non-normalized Jeffreys prior

calculate probability BP(A I BI) calculate normalized posterior probability BP(A I BI) normalize probFunc[A,B] for hypotheses C marginalize probFunc[A,B] for hypotheses C calculate mean calculate standard deviation calculate n-th order moment

I I · . bp(M1ID) . h ca cu ates ratIO bp(M2ID) Wit - Nl nuisance parameters for Ml - N2 nuisance parameters for M2 find optimal values for P with bp(AIC) the marginalized probability for bp(ABIC) and the loss function lossFunc

Table 1: Overview of routines provided by BayesCalc

131

132

Mathematica rule

Numerical utilities NPosterior[BP[{A},{B}J ,{S}J NNormalizedPosterior[BP[{A},{B}JJ

NNormalize [probFunc[A,Bj, {C}, {S}J NMarginalize[ probFunc[A ,Bj,{C}, {S} J NMean[probFunc[Aj,c, {S}J

NStdev [probFunc[Aj, c, {S}J

NMoment[probFunc[Aj,n, c, {S}J

NHypothesisTest[{M1},{N1},{M2},{N2},{D},{S}J

Basic Tools GetConstant [probFunc, {A}]

GetProportional[probFunc, {A}J

AssignData[name, {D}J

Logical "or" and logical "and" BP[{LogicalOr[a,bJ}J BP[{LogicalOr[a,LogicalAnd[b,cJJ}J BP[{a, b}J

P. DESMEDT, et al.

Conventional meaning

same as NNormalizedPosterior calculate numerical value of posterior probability BP(A I BI) with assignments S numeric normalization numeric marginalization numeric calculation of mean numeric calculation of standard deviation numeric calculation of n-th order moment numeric calculation of hypothesis test

selects terms of probFunc independent of parameters A selects terms of probFunc dependent of parameters A makes an assignment list, i.e., {name{l ]- > D[(1] ].name (2]- > D[(2]) ... }

bp(a +b I J) bp(a+bclJ) bp(ab I J)

Table 2: Overview of routines provided by BayesCalc (continued)

EVOLUTION REVIEW OF BAYESCALC 133

Mathematica will be unable to perform the integrations required for the normalization, but the numerical normalization is feasible. For data'" 10, the value of the normalized posterior probability is

In[2] := NNormalizedPosterior[BP[{mu, sigma}, {data}], {data- > 10}] Out[2] = . 1.22238 (17)

E(10-mu)2/(2 S1gma)2 sigma Sqrt[Pi Sigma2]

The next extension deals with hypothesis testing. In hypothesis testing the ratio K of the posterior probability of two models (Ml and Mo) conditional on some data D is calculated:

K = bp(MIIDI) bp(MoIDI) .

(18)

If K is greater than one, model MJ is more probable on the data D than model Mo . In general some nuisance parameters will be present in both hypotheses . Nuisance parameters determine the ability of a model to adapt itself to measurements. The actual value of the nuisance parameters is however not of interest. If for instance the hypothesis test tries to find out whether a linear or a quadratic function fits a set of data points, the nuisance parameters are the coefficients of the linear and the quadratic function . These nuisance parameters are marginalized out . The ratio K can be calculated by the BayesCalc routine

HypothesisTest[{ml}, {nuisl} , {mOl, {nuisO}, {D}]. (19)

Here the ml and mO are the models to compare. The nuisance parameters of the two models are nuisl, nuisO. The hypotheses (data) on which both ml and mO depend are given by D.

The decision theory extension of Bayesian probability is also present in the evolved version of BayesCalc . This extension allows the selection of the best estimate of a parameter given the posterior probability distribution and the loss function. The loss function lossFunc gives the penalty for making mistakes when choosing a parameter value. The routine that finds the best estimates for the parameters {p} is

BestEstimate[{A}, {B}, lossFunc, {p}] (20)

This routine will first calculate the posterior probability BP[{A},{B}] . Then it will find the values for the parameters {P} that minimize the expected loss.

A full list of the available functions and the corresponding conventional meaning is given in table 1 en 2. Help for the different routines can be obtained within Mathematica by typing? FunctionName, which will give the syntax of the function together with a short explanation.

Note that many routines of BayesCalc have a multiple syntax. Not all the possible syntaxes are rendered in table 1 and 2. The multiple syntax offers some flexibility to the user. If for instance the posterior probability BP [{ A}, {B}] was already calculated, there is no need to calculate it again to find the best estimate. The routine BestEstimate has a syntax that allows the reuse of the posterior probability:

BestEstimate[posterior, lossFunc, {P}]. (21)

134 P. DESMEDT, et al.

Here posterior stands for the previously calculated posterior probability BP [{ A} , {B}] . Also included in the latest version of BayesCalc is the handling of discrete parameters.

The parameters are discrete or continuous according to the form of DefineRange (see table 1). If the range specified in DefineRange has three arguments, the parameter considered discrete. With two parameters the parameter has a continuous range.

4. Obtaining the BayesCalc package

The package is freely available by anonymous ftp. The ftp address is ftp.elis.rug.ac.be (login: anonymous, password: email address). You will find the code and examples in the directory pub/MEDISIP. You are encouraged to report your comments by electronic ([email protected]) or normal mail to the first author.

5. Conclusions

The presented BayesCalc package allows the automatic calculation of probabilities in the presence of the proper specifications. Most operations needed in the Bayesian probability theory are performed by the package. Included are marginalization, hypothesis testing and decision theory. These operations can be performed symbolically as well as numerically.

References

[1] S. Wolfram, Mathematica, A System for Doing Mathematics by Computer, AddisonWesley Publishing Company, Inc., 1991

[2] P.Desmedt, 1. Lemahieu, K. Thielemans, A Mathematica package for symbolic Bayesian calculations, to be published in proceedings of MaxEnt 1993, Santa Barbara, 1993

[3] P. Desmedt, K. Thielemans, Technical aspects of the Mathematica package BayesCalc, ELlS Technical Report DG 93-13, 1993

[4] E.T. Jaynes, Probability theory as logic, to be published

[5] J.O. Berger, Statistical decision theory and Bayesian analysis, Springer-Verlag, 1985

[6] H. Jeffreys, in Theory of probability, Oxford University Press, 1939

[7] T.J. Loredo, From Laplace to supernova SN j987A : Bayesian inference in astrophysics, in Maximum entropy and Bayesian methods, Dartmouth, Reidel, 1990, pp. 81-142

BAYESIAN INFERENCE FOR BASIS FUNCTION SELECTION IN NONLINEAR SYSTEM IDENTIFICATION USING GENETIC ALGORITHMS

ABSTRACT.

Visakan Kadirkamanathan Department of Automatic Control & Systems Engineering University of Sheffield, UK

In this paper, an algorithm to determine the most probable model, amongst a large number of models formed with a set of wider class of basis functions, based on Bayesian model comparison is developed. The models consists of linear coefficients and nonlinear basis functions, which may themselves be parametrised, with different models constructed with different subsets of basis functions. By a suitable encoding, genetic algorithms are used to search over the space of all possible subsets of basis functions to determine the most probable model that describes the given observations.

1. Introduction

Modelling or identifying unknown nonlinear systems can be approached from an approximation or non-parametric estimation viewpoint, resulting in the use of Volterra polynomials and neural network models for such tasks [2], [8], [7]. These models construct the underlying mapping by a linear combination of a set of basis functions that are parametrised. Estimation of these nonlinearly appearing parameters leads in general to increase in estimation time and can suffer from local optima. Furthermore, the number of basis functions used is critical in obtaining a good approximation to the underlying system, the problem being similar to the 'overfitting' problem in interpolation.

The Bayesian framework developed for model comparison by Gull, Skilling and others at Cambridge [4], [12], [11], was used by MacKay [8] to demonstrate how different neural networks and models can be compared (at the second level of inference) and the most probable model chosen as the best approximation to the underlying system. Here, we adopt the procedure outlined in [8] and extend the scope by developing an algorithm to exhaustively search over a wider class of models.

First, a set of basis functions are selected. A model will consist of basis functions which are a subset of this set. These basis functions have fixed parameters and are not estimated using the data. At the first level of inference, the coefficients are estimated for the model and at the second level, its evidence computed. A search is carried out over the space defined by the possible combinations of the set of basis functions, using genetic algorithms [3], to determine the model with the largest evidence. Even for a moderate number of basis functions in a set, the number of possible models constructed from different subsets become very large. The application of the above procedure on the data generated by a large pilot scale liquid level nonlinear system for the Volterra polynomial and the radial basis function [9] models are provided.

135


136 Visakan Kadirkamanathan

2. Nonlinear System Identification

A general nonlinear system can be described in terms of a set of dynamical equations involving the input and output of the system. One particular form of description is the Nonlinear Auto Regressive Moving Average with eXogenous input (NARMAX) model [2]. For a discrete-time single input single output (SISO) system, this is given by,

y(t) = j [y(t - 1), ... ,y(t - ny),u(t - 1), ... ,u(t - nu),e(t - 1), ... ,e(t - ne)] + e(t) (1)

where j(.) is a nonlinear function, y E R is the output, u E R is the input and e is the noise/disturbance, with the corresponding maximum delays ny, nu, ne. A sub-class of NARMAX models is the Nonlinear Auto Regressive with eXogenous input (NARX) model, considered in this paper, where,

y(t) = j [x(t)] + e(t) (2)

and x(t) = [y(t - 1), ... , y(t - ny), u(t - 1), ... , u(t - nu)]. The nonlinear function j(.) and the corresponding maximum delays are unknown and needs to be estimated in the identification task based on the input - output observations {(xn, Yn) In = 1, ... ,N}. The advantage of this reduced representation is that x is completely known prior to estimation unlike in NARMAX case and therefore the estimation procedure is much simpler.

One approach to estimate j (-) is to use approximation techniques by using Volterra polynomials or use non-parametric estimation techniques such as radial basis functions and neural networks [2]. In these techniques, the output for the model chosen is a linear combination of a set of basis functions, given by,

K

j(XiP) = LWk¢k(Xidk) (3) k=l

where ¢k(·) are the basis functions, d k are the parameters in the basis functions and Wk are the linear coefficients. The vector P = [ ... ,Wk, dk, ... ] is the parameter vector for the model chosen for identification.

The identification problem can be split into two sub-problems of model structure selection and model estimation. In model structure selection, choices are made such as the number and specific form of the basis functions, and the input vector including the maximum delay parameters. Once the model structure is selected, the relevant parameters such as the linear coefficients are estimated. Since there is little a priori knowledge about the model structure, several models must be chosen and their modelling performances compared in arriving at the most probable model that fits the system observations.

3. Genetic Algorithms for Basis Function Selection

The procedure described here is applicable to a wider class of models although the ideas are illustrated through the use of two particular class of models. The first is the Volterra Polynomial Basis Functions (VPBF) model in which the multivariate polynomials up to order 2 are used as basis functions. The functional form of the model is given by equation (3), with the basis functions given by,

(4)

Bayesian inference for basis function selection using genetic algorithms 137

with number of basis functions Ko = ! M(M + 1) and x E ~M. Note that there are no parameters associated with the basis functions and hence dk is null. Once the input x is chosen, only the linear coefficients w need to be estimated, a computationally simpler task.

A difficulty with the use of the VPBF model is the prohibitively high number of basis functions in the set {<p}. With a finite data set of observations, the presence of inappropriate basis functions can lead to bias and greater uncertainty in the estimates for the coefficients. It is more likely that a model formed with {<Ph}' basis functions that are a subset of {<p} , will form a better approximation to the unknown nonlinearity. A search must therefore be carried out over all possible subsets {<Ph} to find the best subset, and hence the best VPBF model structure. Furthermore, as long as MI > ny, M2 > nu and M = MI + M 2, the problem of estimating the maximum delay parameters would be absorbed into the same subset search task, as would the selection of number of basis functions.

The second class of models is the Gaussian radial basis function (GRBF) model [9], whose functional form is also given by equation (3) with basis functions of the form,

(5)

where Ck E ~MxM (= I, the identity matrix, here) is the weighting matrix of the kth basis function whose centre is d k E ~M.

The basis function parameters of the RBF model are estimated in several ways. The procedure developed first in interpolation tasks is to choose dk = Xn for all k = n = 1,2, ... ,N, the centres being placed on all the data points in the (model) input space [9J. This was extended to using either a random set of values or choosing a subset of {Xn}ln=I, .. N

for dk [1 J. Alternatively, the parameters dk can be estimated together with the coefficients w, a nonlinear estimation task requiring an iterative optimisation scheme and is liable to suffer from local optima [6J. Yet another procedure is to add basis functions at appropriate stages of the estimation procedure [5J.

Determination of dk may be viewed as a problem of model structure selection in which the best set of basis functions are searched for. Unlike in the case of the VPBF model, there are infinite possible values for dk and hence the complete set {<p} would be infinite. A compromise on the size of {<p} can be made by choosing dk values to be composed of random values and a subset of {xn }. Since the set {<p} is formed prior to estimation, linear independence of the basis functions in {<p} is ensured by appropriate selection.

Let the size of {<p} ie., the total number of basis functions, be Ko. Then, there are 2K o

total number of subsets {<Ph}' h = 1, ... ,2Ko including the null subset , and hence that many different model can be formed. The search for the best {<Ph} is a combinatorial optimisation task of finding the optimum Ko dimensional hypercube vertex. This optimisation task can be carried out using genetic algorithms (GA) [3J where each model formed with {<Ph} is expressed by an M-bit binary model code c, ie., a chromosome representation in GA. The I-bits of the code c relate to the selected subset of basis functions from {<p} and the 0-bits relate to the omitted ones. For example, with {<p} = {¢I, ¢2, ... , ¢ K 0' a code c = [1 00 1 0 0 1 0 ... J represent the model,

f(x; p) = wI¢dx) + W4¢4(X) + W7¢7(X) + ...

The use of GAs for the VPBF model is due to Fitzgerald (personal communication). While the use of GAs in structure selection of the multilayer perceptron like neural networks have


been studied widely [10], its use in RBF centre selection was introduced in [7]. In [7], a real-valued chromosome is also used for estimating dk unlike in this paper where, the basis function parameters are chosen a priori.

In GA, each generation (iteration) consists of individual chromosomes (models formed with {<Ph}) of a particular population size (number of models). Each individual has a particular fitness measure associated with it, which in the identification task is a measure of the goodness of fit to the data or expressed differently, the probability of that particular model being correct amongst all the other models in the population. This probability is proportional to the Bayesian evidence and is computed from the method Bayesian model comparison developed in [4], [12], [8]. The next generation of individuals are formed by cross-over, mutation, ranking and selection operations of the GA.

4. Bayesian Model Estimation and Comparison

In this section, the theory and the computational procedure developed by Gull [4] and applied to radial basis functions and neural networks by MacKay [8]' are outlined briefly. For complete details refer to [4], [8].

Bayesian inference operates at two levels. At the first is model jitting where, a chosen model Hh is assumed to be the true underlying model of the system and the model parameters ware inferred from the observation data D = {(xn, Yn)ln = 1, ... ,N}, ie.,

(6)

where P(wIHh) is the prior for w, P(Dlw, Hh) is the likelihood of the data D, P(wID, Hh) is the posterior for wand P(DIHh) is called the evidence for Hh. At the second level is model comparison where the most plausible model, amongst those chosen for estimation, that explains the data is inferred, ie.,

(7)

If there are no compelling reasons for assigning different model priors P(Hh) then, P(HHID) ex P(DIHh), the model evidence for the data D.

Assume that we have selected a particular { <P h} of size K, as the set of basis functions of the model Hh given by equation (3), so that the only model parameters that need estimation are the linear coefficients w. Under the assumption that e(t) is zero mean Gaussian with noise variance 1//3, the likelihood is given by,

(8)

where, N

ED(Dlw, Hh) = ~ L[Yn - WT<Ph(Xn)F (9) n=l

is quadratic in w.Maximising the likelihood to obtain an estimate WML may lead to problems of 'over-fitting' where the model fits the noise in the data and as a. result is a poor


model of the underlying system. The prior therefore reflects the smoothness expected from the model and is chosen as,

K

P(wla, Hh) = C:) "2 exp{ -aEw (wi Hh)} (10)

where, a is the regularising constant and

Ew(wIHh) = ~wTw (11)

is quadratic in w. Under the assumption that a, (3 are known, applying equation (6) and rewriting the posterior,

1 1 T } P(wla,(3,D,Hh) = K lexp{-2(w-WMP) A(W-WMP) (27r)"2IAI2

(12)

where A = aI + (3Bj B = I:;;=1 <f>h(Xn)<pr(xn)j and WMP is the posterior estimate given by,

(13)

where WML = B-1 I:;;=1 <Ph(Xn)Yn is the maximum likelihood estimate. The evidence for a, (3 and model Hh is obtained by integrating the posterior over w, which gives,

(14)

However, a, (3 are unknown in general and Bayesian theory provides a means of estimating them [4], [8J. From Bayes law,

P( (3ID H ) = P(Dla, (3, Hh)P(a, (3) a, , h P(DIHh) (15)

with a flat prior for P(a, (3) gives P(a, (3ID, Hh) ex P(Dla, (3, Hh), so that maximising this evidence over different a, (3 gives the required estimates. They are given by the following equation which needs to be solved [8J:

2aEw =, 2(3ED = N-, (16)

where, = k-aTrace(A -1) is a measure of the effective number of parameters in the model. Since, depends on a, (3 through A, the estimates have to be obtained by an iteration,

(i+1) _ ~ (3(i+1) = N - ,(i) a - 2E(i)' 2E(i) (17)

w D

where the iterations include re-evaluation of the posterior estimate for the parameter W with the improved hyper parameter estimates. For quadratic Ew, ED, the optimum estimates are unique and hence the iterations converge fairly quickly.

The model evidence P(DIHh) is now obtained by integrating P(DI&, /3, Hh) over a, (3 under Gaussian approximations for the log(a, (3), with the peak at &, /3 gives [8J,

1 1

P(DIHh);:::; P(DI&,/3,Hh)P(&,/3)(27r) (~r (N~,)2 (18)

The prior P(&, /3) is assigned to be the same for all of the models used for identification and hence can be ignored for model comparison.


5. The Selection and Estimation Algorithm

• Construct {<p}, the set of all basis functions.

• Create a population of binary model codes, ie., models with {<Ph} for h = I, ... ,hp-

• For each generation (iteration),

Produce new models using genetic algorithm cross-over and mutation operations on the binary model codes.

For each model,

* Obtain posterior estimate W M P and estimates for (x, fl. * Obtain model evidence P(D/Hh)'

Rank models in decreasing model evidence.

Select the best hp models for next generation.

• Stop when convergence is achieved.

6. Identification Results

A large pilot scale liquid level SISO system was excited by a zero mean Gaussian input signal and 1000 pairs of input - output data were observed. The first 500 pairs are used in model selection and estimation while the remaining 500 are used for independent validation.

For the VPBF model, maximum delays of ny = nu = 4 giving M = 8 were chosen, resulting in Ko = 45 and ho = 245 ~ 10l3. For the GRBF model, maximum delays of ny = nu = 2 giving M = 4 were chosen. The basis function set {<p} for G RBF were formed in four different ways and experimented with various values for Ko. In the first (GRBFI) , dk are chosen randomly and in the second (GRBF2), as a subset of the input data {xn}. In the third (GRBF3), a set combining the above two are formed and in the fourth (GRBF4), random values for dk and Pk, where Ck = PkI are used. Finally, the GRBF4 set and the VPBF model set are combined (VP+RBF) to form {<p}, resulting in a set of different class of basis functions.

The results were quite consistent over different experiments and only partial results are provided in Table 1. The model type, size of the {<p} basis function set Ko, number of basis functions in model Kh, Bayesian evidence, maximum absolute error (maxerr) and the root mean square error (RMSE) for the validation data and the effective number of parameters I are given. Note that model evidence value can only be taken relative to other models in this set of results as normalisation was not done.

The GRBF models consistently outperformed the VPBF model and the VP+RBF model was the best in terms of Bayesian evidence, even though the number of basis functions used was higher. Note however, that the performances as measured by the maximum absolute error and RMSE show that the VPBF model is an equally good fit, if not better. The discrepancy in the results is due to the marginal difference in behaviour of the system in the second phase compared to the first phase from which data for estimation were obtained.


Model Ko Kh Evidence Max Err RMSE I VPBF 45 11 765.7 0.254 0.0508 10.84 GRBFl 40 19 781.7 0.303 0.0515 18.53 GRBFl 60 21 811.8 0.312 0.0488 20.06 GRBFl 80 23 815.5 0.334 0.0513 22.18 GRBFl 100 22 819.3 0.303 0.0513 21.44

GRBF2 40 12 809.3 0.323 0.0523 11.96 GRBF2 60 9 823.3 0.339 0.0492 8.96 GRBF3 80 18 814.5 0.330 0.0497 17.37 GRBF4 40 12 851.6 0.312 0.0487 11.82 GRBF4 60 16 848.9 0.307 0.0493 14.37 VP+RBF 40 18 815.6 0.313 0.0524 14.68 VP+RBF 60 23 842.4 0.306 0.0501 18.66 VP+RBF 100 21 861.5 0.397 0.0519 20.46

Table 1: Identification performance for the different model structures

Another reason is due to the global span of the VPBFs in contrast to the local span of the GRBFs which impacts on the type of regularisers used. Furthermore, the Gaussian noise assumption as required in the computational simplification does not hold strongly in this case. Overall however, good fit models were found using this procedure compared to the results in [7].

7. Conclusions

The Bayesian framework of model estimation and comparison allows different class of models to be compared to one another in problems such as nonlinear system identification, where non-parametric type methods have to be used. The problem of using this framework for models such as radial basis functions is the computational complexity of not only estimating the nonlinearly appearing parameters in the model, but to do so over a number of such models. Added to this is the additional computation involving the Bayesian evidence. Furthermore, the estimation of the parameters can be plagued by local optima.

An algorithm that simplifies the identification or approximation task is presented here in which the model estimation problem is decomposed into model structure selection task and model linear coefficient estimation task. This is also shown to allow different class of basis functions to be included in a larger set of basis functions from which appropriate subsets are formed for each model. Using Bayesian evidence as a relative measure of the probability that the chosen model to be the unknown system, genetic algorithms are used to search over the space of all possible subsets. The algorithm is shown to be useful in selecting the centres of the radial basis functions. Experimental results on an identification task is used to demonstrate the algorithm in which the basis functions are formed by Volterra polynomials and Gaussian radial basis functions.


AcknowledgeIDents

The author acknowledges the support of the Engineering and Physical Sciences Research Council (EPSRC), UK under the grant GR/ J46661 and thanks Professor S. A. Billings for the data used in the experiments.

References

[1] D. S. Broomhead & D. B. Lowe. Multivariable functional interpolation and adaptive networks. Complex Systems, Vol. 2 , pp. 321-355, 1988.

[2] S. Chen, S. A. Billings & P. M. Grant. Nonlinear system identification using neural networks . International Journal of Control, Vol. 51, No.6, pp: 1191-1214, 1990.

[3] D. E. Goldberg. Genetic algorithms in search, optimization and machine learning. Addison-Wesley, MA: Reading, 1989.

[4] S. F. Gull. Developments in maximum entropy data analysis. In J. Skilling. (ed.) Maximum entropy and Bayesian methods, Kluwer, pp: 53-71, 1989.

[5] V. Kadirkamanathan. A statistical inference based growth criterion for the RBF network. In Proc. IEEE Workshop on Neural Network for Signal Processing, pp. 12-21 , 1994.

[6] V. Kadirkamanathan, M. Niranjan & F. Fallside. Sequential adaptation of radial basis function neural network and its application to time-series prediction. In D. S. Touretzky (ed.) Advances in Neural Information Processing Systems 3, Morgan Kaufmann, CA: San Mateo, pp. ???-??? , 1991.

[7] G. P. Liu & V. Kadirkamanathan. Multiobjective criteria for nonlinear model selection and identification with neural networks. Research Report No.508, Department of Automatic Control & Systems Engineering, University of Sheffield, UK, March 1994.

[8] D. J . C. MacKay. Bayesian Interpolation. Neural Computation, Vol. 4, No.3, pp: 415-447, 1992.

[9] M. J. D. Powell. Radial basis functions for multivariable interpolation: A review. In J. C. Mason & M. G. Cox (eds.) Algorithms for approximation, Oxford University Press, Oxford, pp. 143-167, 1987.

[10] J. D. Schaffer & D. Whitley (eds.). Combinations of genetic algorithms and neural networks. IEEE Computer Society Press, CA: Los Alamitos, 1992.

[11] S. Sibisi. Bayesian Interpolation. In W. T. Grandy, Jr. (ed.) Maximum entropy and Bayesian methods, Kluwer, pp: 349-355, 1991.

[12] J. Skilling. On parameter estimation and quantified MaxEnt. In W . T. Grandy, Jr. (ed.) Maximum entropy and Bayesian methods,Kluwer, pp: 267-273, 1991.

THE MEANING OF THE WORD "PROBABILITY"

Myron Tribus Exergy, Inc. Hayward. CA 94541. USA

1. Introduction

At the MaxEnt 94 conference, delegates were given a small card and asked to write on it their answer to the question: "What does the word 'probability' mean to you?". The replies varied, but the overwhelming response was "Degree of belier' .

In this paper I shall argue that this definition is as flawed as the frequentist definition and, through a specific example, will demonstrate why the answer should be: "A numerical encoding of incomplete information".

2. Thinking about Thinking

Symbol Operator

Idea Concept Abstract

Concrete Percept Tangible "real"

"Pure Mathematics" The logic of logic

critique

~

Logical Operations, "Thinking"

Experience "Experiments"

Figure 1: Three levels of thought and discourse

In the above diagram, the words in the lower part are associated with what we often call the "real world", by which phrase we refer to things that are accessible to our senses or, in some cases , to the instruments we have developed to extend our senses. We can never be too sure about the "reality" of this world. Schrodinger once put it this way: "I hear

143


144 Tribus

a buzzing. Do you hear it too? Or is it only in my ear?". In science we insist that more than one person be able to confirm an experience before we pronounce it "real". In brief, science is concerned with reproducible experiences. If, as in astronomy, the experiences are not reproducible, the observations must be.

Above the line, in the middle of the diagram, the words are associated with what we loosely call "thinking" or "reasoning" or "analysis". In this domain we manipulate abstractions. The abstractions we usually manipulate are representations of things we recognize as "concrete" in the "real world". If it is in our head, it is not concrete, it is abstract. Ed Jaynes has referred to mistakes between these two regions as a "mind projection fallacy", which others have called "mistaking the map for the territory" .

We develop symbols to represent our concepts, writing: "Let m = the mass of" so that m becomes the symbol for the abstract idea of mass which we assign to something concrete in the "real world" . We manipulate these symbols according to various rules we have developed to characterize a field of inquiry. The box indicates this process of reasoning. The results of the process are called a conclusion and the conclusion is compared with the observation as a way to decide if the whole scheme "makes sense".

There is another level, however, and it lies above the middle ground. This level applies critical thinking to the thinking process itself and if anything is found within the middle box which does not satisfy some elementary criteria, the results of the analysis will be rejected, without regard for whether they fit the observations or not. It is at this level that the "rules for reasoning" are established. This box contains criteria for the acceptance or rejection of a scientific hypothesis which are stronger than just agreement with experiment. The inputs to this box do not necessarily have a "real world" counterpart. They are one step removed for they are associated with thinking about thinking.

3. Criteria for Acceptance of a Scientific Hypothesis

When I first published an account of the information theory basis for thermostatics and thermodynamics (Tribus, 1961) , basing this work on the publications of Edwin T. Jaynes (1957) and Richard Cox (1961), I met with a great deal of resistance from people trained in classical thermodynamics and classical statistical mechanics. When they subjected my reasoning to their higher order box, they rejected the development. MaxEnt was not accepted.

Over a period of about two years I collected criticisms from various sources and tried to make sense of them. With the help of a couple of professors of philosophy at Dartmouth College, I was able to categorize these objections and after studying the categories, developed the following four desiderata for the acceptability of a scientific hypothesis. A reasoning scheme which follows these desiderata should be acceptable. Of course it may not be, for the critic may simply be unwilling to abandon an old paradigm or may have an aesthetic preference for the old way. Few people will accept an hypothesis which obviously contradicts these desiderata. The desiderata are:

1. Consistency ... if more than one way exists to "solve a problem" they must all lead to the same answer.

2. Continuity of Method .. . the methods adopted should not change abruptly, just because there is a small change in the numbers used to describe the problem.

Meaning of "Probability" 145

3. Universality ... the methods should not be ad hoc, that is, conjured up for just one problem and changed when the problem statement changes. There must be one underlying method from which all solutions are developed.

4. Meaningful StateIUents ... all statements used in the solution process should have unambiguous meanings.

As demonstrated elsewhere (Tribus, 1969) , these statements are sufficient to provide the basis for the works of Cox and Jaynes. They provide a straightforward derivation of the equations:

p(ABIC) = p(AIBC)p(BIC)

p(AIC) + p(~ AIC) = 1

As shown also in the same reference, these equations furnish the basis for the development of the Shannon entropy function,

4. How Should We Regard the SyIUbols "p" and "S"?

Referring again to the diagram in figure 1, into which level of the diagram shall we put the concepts of probability and entropy? Surely they do not belong in the bottom level. People who equate probability with frequency seem to believe that they do belong at the bottom, or at least are very close to it. They talk about measuring probabilities, for example. But do these concepts really belong to the middle or the top level? How should we decide?

Let us consider some concepts which belong clearly in the top level. The basic operators denoted by the symbols "+, -, x, I" belong there. It is true that they sometimes have "real world counterparts" in that we think of adding things together or dividing a quantity of liquid into parts. But they are different from other concepts , in that they have a much greater generality. I know of no real world counterpart for division by a complex number, for example.

On the one hand, the concepts of "hard" and "soft" are used to describe every day experiences of touch. These and many other concepts which we learn from schools, the newspapers or just from conversations are usually close to "reality". We talk about "traffic jams" and "interference", connecting each concept with "real world" events or "things" . Such concepts belong to the middle portion of the diagram. They have close "real world" counterparts about which we reason.

On the other hand, there are other concepts, for example Fourier Transforms, which do not necessarily have real world manifestations. Holography comes to mind as a possible manifestation, but the transform is of greater significance than just holography. Using Fourier transforms we are able to invent concepts which are useful in describing new things to measure in the "real world". Think of the concepts which are inspired by making a Fourier transform of a wave. These concepts, such as frequency, amplitude and phase angle belong to the middle part of the diagram. They are closer to "reality". They have "real world" counterparts. The Fourier transform belongs to the top level. It is a generator of concepts.

146 Tribus

People who study the sciences are simply given concepts by their teachers and have to show they have "understood" them or they cannot pass their examinations. After a while, they cannot "see" something in the "real world", if the percept lies outside their set of concepts. There is no such thing as an immaculate perception.

In the following section I shall demonstrate that probability is an operator for inductive logic and also can be used to generate concepts. Used in this way it is something much more than a statement of "belief' .

5. Why Was It Necessary to Develop a Science of Thermodynamics

To begin it is useful to recall why it is that the sciences of statistical and classical thermodynamics are needed at all. To keep the arguments simple, we shall consider only Newtonian mechanics and recall that the equation for one dimensional motion

dv dv dx dv Fx = m dt = m dx dt = mv dx (1)

may be written (2)

and integrated to give:

J Fdx = ~mv2 + const (3)

The term "energy" was described by Thomas Young in 1801:

"The term energy may be applied, with great propriety, to the product of the mass or weight of a body into the square of the number expressing its velocity ... "

In due course the constant of the motion (equation 3) was called the "energy" and it was found that in the absence of frictional forces, all manner of problems involving gravity, springs, levers, wheels, masses and fluids could be solved by invoking the principle of conservation of energy.

There was one difficulty, however. In "real" systems there was always some friction present and, therefore, there was always a difference between the conclusions from the "thought model" (the middle part of the diagram) and the "real world" experiences (the bottom part of the diagram). The difference was always in one direction. The "constant of the motion" in the "real world" was always smaller than the corresponding quantity calculated from mental imagery. A theory to explain this difference was necessary.

The principle of conservation of energy was so firmly established that in the presence of friction the only reasonable question was: "Where did the energy go?". Since the existence of atoms and molecules was firmly established by the middle 1800's, the answer obviously was, "Into the atoms and molecules".

At this point in history, there was no general method known for describing the state of motion of a very large number of particles, of order 1023 of them. At the turn of the century a rudimentary statistical mechanics was under development but as Planck observed in the introduction to his thermodynamics text (Planck, 1945) statistical mechanics did not provide a firm basis for the derivation of a satisfactory explanation. Planck's displeasure at having to use classical thermodynamics to explain where the energy had gone is evident in this passage from the introduction to the first edition of his book:


"This last, more inductive, treatment, which is used exclusively in this book, corresponds best to the present state of the science. It cannot be considered as final, however, but may have in time to yield to a mechanical, or perhaps an electro-magnetic theory. Although it may be of advantage for a time to consider the activities of nature - Heat, Motion, Electricity, etc., - as different in quality, and to suppress the question as to their common nature, still our aspiration after a uniform theory of nature, on a mechanical basis or otherwise, which has derived such powerful encouragement from the principle of the conservation of energy, can never be permanently repressed. Even at the pr'esent day, a recession from the assumption that all physical phenomena are of a common nature would be tantamount to renouncing the comprehension of a number of recognized laws of interaction between different spheres of natural phenomena. Of course, even then, the results we have deduced from the two laws of Thermodynamics would not be invalidated, but these two laws would not be introduced as independent, but would be deduced from other more general propositions. At present, however, no probable limit can be set to the time it will take to reach this goal."

Probability theory, as developed by Richard Cox and Ed Jaynes, culminating in the Maximum Entropy principle, provides the solution Max Planck wished to find almost a century ago. Recognizing that it is the solution depends very much on how we regard the concept of probability.

6. Incomplete Information

Consider an observer, contemplating a collection of atoms and molecules in some kind of container. The observer knows that these atoms and molecules move, interact with one another and may even change from one kind of molecule to another. There are so many of these atoms and molecules (and other particles) that there is no way to develop a deterministic description of their state of motion. To be precise, when we say that there is no way to develop a deterministic description, we mean there is no way to write down a list of the coordinate positions and vector velocities of all the particles and then proceed to compute the resultant motions. The problem, therefore, is to develop a rational description of the "state" of the system when the information about the state is incomplete.

Our observer is advised to use the methods of maximum entropy as a way to develop such a description. Note that we are not proposing that the observer solve a specific problem for the collection of particles being observed. The objective is to develop a suitable theory which will tell the observer what to observe about the collection of particles. Just as a Fourier analysis prescribes what to measure about a signal so will MaxEnt prescribe what to measure about a system of particles.

The observer cannot measure individual particle energies. Instead the observer will have to rely on an incomplete description. From the theory of probability, the observer chooses to use expectation energy and expectation composition as descriptors.

Please take special notice: I am not saying that the observer knows these expectation values. I am saying that the observer chooses to describe the system using these concepts. If this is the observer's choice, what concepts should the observer invoke to obtain a consistent theory? The answer lies in MaxEnt.

The prescription from MaxEnt is:

148 Tribus

Maximize: (A)

subject to: (AI)

(A2)

c = a,b, ... (A3)

Note, again, that at this point the observer is not saying that the values for the expectation energy and the expectation composition are known. Rather the observer has decided that the expectation values are to be used as appropriate ways of describing systems for which it is impossible to assign values to the individual elementary particles. In short, based on the MaxEnt methodology, the observer chooses the concepts of expectation energy and composition as state variables. To what "real world" measurements they are to be attached will be decided later.

The entropy has no meaning unless the question to which it is attached has been described. A better notation in equation A ~ould be:

S = S(QIX) (B)

where

Q = A definite question

X = What is known about Q

By a "definite question" we mean a question for which all of the possible answers are defined, but the observer does not know which of the answers is the correct one.

When we consider a well defined question, Q, to which we do not know the correct answer, we say are "uncertain". The magnitude of our uncertainty is measured by the entropy. When we do not know what the question is, then we are more than uncertain, we are confused. No one has yet developed a measure for confusion.

The observer's X provides the following information: a) Q = "In what quantum state is the system?" b) X = "Matter is not infinitely divisible. It is discontinuous."

"The particles into which matter may be divided move. They have energy and momentum. The energy and momentum are not infinitely divisible."

"The states of a system are, therefore, limited to well defined quantum states 1, identified by the subscript i."

1 I recognize that these states, in turn, are only defined statistically, but that fact is irrelevant to the development


"The 'allowed' values of the energies in a quantum state are determined by parameters external to they system, denoted by the set {Xd. (e.g. Xl = volume, X2 = magnetic field, etc.)"

In short, X invokes the most basic principles of the quantum theory. With these definitions of the meaning attached to Pi, the principle of maximum entropy

is used to assign values to the Pi. The result is, by methods well known to this audience:

(C)

where the Lagrange multipliers n, /3, 0: are to be defined by the theory. At this point in the development, an important step is taken. The Pi are eliminated from the system of equations. The probabilities have done their work. Like the scaffold used in erecting a building, they are no longer needed once the building has been constructed. The probabilities, having been used to encode the given information, are eliminated. Substitution for Pi in the set of equations denoted by A gives:

(Dl)

(D2)

c = a,b, .. . (D3)

It is straightforward to find, from equation Dl, that:

(D4)

According to the MaxEnt derivation, the Lagrange Multiplier n is a function of the other Lagrange Multipliers and the external parameters (which determine the "allowed" values associated with the subscript i). That is:

(D5)

From D5 it follows that:

(D6)

By the usual methods of MaxEnt, it follows that the derivatives indicated in equation D6 are related to the expectations in equations A.

an - a/3 = (f)

an -- = (nc) ao:c

(D7)

(D8)

150 Tribus

and an 1 aXk = {J (Fk) (D9)

Substitution of D7, D8 and D9 into D6 gives:

dO. = -(e:)d(3 - L)nc)dQc + (32:.(Fk)dXk (D10) c k

The probabilities may be eliminated from the definition of entropy to give:

(Dl1)

Differentiation of equation D4 gives:

dB = kdn + k(3d(e:) + k(e:)d(3 + k 2:. Qcd(nc) + k 2:.(nc)dQc (D12) c

Finally, substitution of Dl1 into DI2 gives:

dB = k(3d(e:) + k 2:. Qcd(nc) + kf3 2:. (Fk)dXk (D13) c k

Since it is homogeneous in the extensive variables (e:), (nc) and X k equation DI3 may be integrated to give:

B = k(3(e:) + k 2:. Qc(nc) + k(3 2:. (Fk)Xk k

Comparing DI4 with Dll it is seen, also, that

(D14)

(D15)

Please note that after equation D4, the probabilities have disappeared. The scaffolding they provided is no longer needed. Equations D5 to DI5 provide a set of connected concepts. What remains to be done is to connect these concepts to the "real world", i.e. to relate them to things which we can measure and compare. The process of relating the variables in equations D5 to DI5 will serve to define operationally what is meant by the various concepts.

To begin, we shall define the concept of equilibrium as a state in which the entropy is a maximum, that is, a state which can be defined by the set of thermodynamic concepts just derived. The expectation brackets, ( ), are identified with this state of equilibrium because they are associated with results which are reproducible. That is the characteristic of an equilibrium state: it is stable and reproducible.

It is interesting to examine the literature of classical thermodynamics to see how the concept of equilibrium is defined. It will be found that either the definition is passed over lightly or by circular logic, it is defined as the point of maximum entropy. (We have not yet identified the entropy in our equations with the entropy of Clausius. That will be demonstrated later.)


Let us now consider two systems which are allowed to communicate with one another. Because the variables (El, (ncl and X k are all extensive, so is S. Therefore, the entropy of two systems taken together, if they are at equilibrium states, will be the sum of the two entropies, i.e. the extensive properties are additive:

c = a,b, ...

and, limiting our considerations to k = 1,

(CI)

(C2)

(C3)

(C4)

On maximizing Statal under the constraints C2 to C4, it is found by the usual methods that:

If the {3's are equal, then so are their reciprocals. We define

{3 = l/kT

(C5)

(C6)

(C7)

and define the concept of temperature by choosing a body, any body, as a suitable reference. The most convenient body is a perfect gas, for which the quantum states may be readily related to the volume.

For example, if we define a mol of perfect gas as one containing exactly it atoms then (n) = it and by simple arguments (Tribus, 1961b) it follows that the values of d(nc) III

equation D13 vanish and from D15, for a perfect gas, it follows that:

!1perfect gas = (3 (P) V (C8)

or !1perfect gas _ (P) V

it - RT (C9)

As shown in Tribus (1961b) !1perfect gas/it = 1. Because this theory defines temperature in precisely the same way as the concept has

been defined historically, the concepts are the same. If two systems which are not in equilibrium with one another are allowed to come to

equilibrium with each other, and they do not interact with other bodies, this theory also predicts that the hotter body will lose energy to the colder one.

Confining attention to closed systems, for which the variations in composition are not at issue, from equation A2 we can recognize two different ways to change the expectation energy of a body. Differentiate equation A2 to find:

(ClO)

152

or

Let

dEi ~ d(EI = L Eidpi + LPi-=dV

i i dV

dQr = LEidPi and dWr = -(PldV

Tribus

(Cll)

(C12)

The symbols Q and W have been chosen for historical reasons, but they are apt since the symbol Q refers to an action which alters the uncertainty about the Question and the symbol W refers to a change in which the Way the change in energy was accomplished. The subscript 'r' is also appended for historic reasons. It is a reminder that the entire process is carried out with the system in equilibrium, i.e. at maximum entropy under the constraints.

The definition of Qr still contains probabilities, so as defined in C12 it is not a macroscopic quantity. However, in the form given, it answers the question posed earlier, "Where did the energy go?" If the expectation energy of a system turns out to be different from the value expected when all actions are free of friction, that difference will be found in the Qr term.

The essence of the argument lies in a clear distinction between symbols such as (EI, (PI, (FI and E, P and F. The latter refer to the results of deterministic reasoning, such as follows from Newton's laws of motion. Terms with the brackets around them, ( I, refer to the new ideas introduced by the MaxEnt process. They are expectation values, which we identify with equilibrium. From equations D5 and D7, for example, we conclude that the expectation energy of a body follows an equation of state. In mathematical terms,

(C13)

Equation C13 brings out an important difference between the First Law of Thermodynamics and the Law of Conservation of Energy.

The law of conservation of energy is derived from Newton's Laws of motion. It is deterministic. It provides a method for dealing with systems for which we know how to solve the equations of motion. With a computer this limit is now quite large, up to several thousand objects. But we cannot use the method for numbers of particles approaching Avogadro's number.

The first law treats this same problem statistically, using the MaxEnt process to tell us how the various averages and their measures are related, one to another. The first law tells us that the expectation energy of a body, i.e. at equilibrium, depends only on the state of the body, not how it got to its state. The first and second laws are statistical. (To be consistent we should really put brackets around the entropy and write it as (S), but for historic reasons we do not.)

7. The Relation Between Heat and Entropy

When the applied force is not equal to the equilibrium force, we shall denote it by F. Let us also define dW (without the subscript r) by

dW = FdX(= PdV) (C14)


and for any change in expectation energy write, again, eliminating the subscript r:

d(c) = dQ - dW (CI5)

Having defined both d(c) and dW, equation CI5 becomes a definition of what we mean by dQ.

To bring out the distinctions between dQ and dQr on the one hand and dW and dWr on the other, consider a change in entropy of a closed system. (Variations in (nc ) are not at issue, so terms involving a c are omitted.)

From equation C15,

and from CII and Cl2

dS = kf3d(c) + kf3'L(Fk )dXk k

dS = kf3(dQ - dW) + kf3'L(Fk)dXk

k

dS = dQ ,,( (Fk) - F)dXk T+~ T

k

When a process is carried out in such a way that: a) dXk = 0, (as in a constant volume calorimeter)

or b) (Fk ) = F, i.e. at equilibrium,

then dQ = dQr and dS = dQr/T

(CI6)

(CI7)

(CI8)

(CI9)

From these arguments we conclude that what we have called Q describes a process wherein energy flows from one body to another as a consequence of a temperature difference and what we have called W describes a process whereby energy goes from one system to another by the movement of a boundary (one of the Xk). A study of the definitions used in classical thermodynamics shows that the way the information theory treatment has defined these quantities is precisely the way they are defined in the classical treatment. Therefore, they are the same concepts.

Equation Cl9 shows that the entropy in MaxEnt and the entropy defined by Clausius are the same, for they are defined that way.

8. Implications for Education

The derivation presented here is necessarily compact. The reader who is experienced with statistical thermodynamics will recognize that on the way to the development of classical thermodynamic concepts, the derivation of the Grand Canonical Distribution of Gibbs was developed as a by-product.

In a textbook of thermodynamics, intended for undergraduate students who do not know about either statistical or classical thermodynamics, it is better to begin with much simpler systems, starting, for example with the perfect gas. Tribus (196Ib) demonstrates such a presentation. It has been used successfully with undergraduate students and even in a high school. How well the approach succeeds in producing good thermodynamicists

154 Tribus

still depends upon the instructor. As often happens, if the instructor is wedded to the old ways, the new paradigm is rejected.

I believe that the matter of acceptability will ultimately rest upon how convincing the logic appears. In the 33 years since this approach was first published, only one author, and his students, have contested the approach. K.G. Denbigh, who has a reputation as one of the most competent classical thermodynamicists in the UK has written one paper (Denbigh, 1981) and, with his son, a book (Denbigh and Denbigh, 1985). Two other students, Rowlinson (1970) and Atkins (1986) have written papers in the same vein. Ed Jaynes has replied to Rowlinson's criticism in a thorough and devastating way, countering each and every objection and making of each an object lesson in the power of the MaxEnt approach (Jaynes, 1979). He also rebutted the arguments by Atkins.

Denbigh's objection hinges directly on the interpretation of probability as a measure of strength of belief. If that is what probability is taken to mean, it is difficult to refute him. His main point was that the entropy of the steam table does not depend on his or my ignorance or willingness to believe. As the derivation here demonstrates, it is not a matter of his or my ignorance. It is a matter of developing the least prejudiced description of a system for which our knowledge is incomplete and using that description to define the circumstances under which measurements are to be made.

9. Conclusion

There are many other interesting observations to be made about the concepts of classical thermodynamics. They have been made elsewhere (Tribus, 1961a, 1961b). The point of this development has been to show why it is important to not define probability as "state of belief". Probability may, at times, in some problems, for some people, represent state of belief, but its use goes far beyond that limitation. Probability, used in a MaxEnt process, generates families of concepts which link together and generate a theory. MaxEnt provides a process for generating hypotheses which may be connected to the "real world" and are guaranteed to be the best possible encoding of limited information. Probability is not a matter of "belief". Probability provides a process of rational induction.

References

[1J Atkins, P.W.: 1986, 'Entropy in Relation to Complete Knowledge', Contemp. Phys. 17, 257.

[2J Cox, Richard T.: 1961, 'The Algebra of Probable Inference', Johns Hopkins University Press, Baltimore.

[3J Denbigh, K.G.: 1981, 'How Subjective is Entropy?', Chemistry in Britain, 17, 168-185.

[4J Denbigh, K.G. and Denbigh, J.S.: 1985, 'Entropy in Relation to Incomplete Knowledge', Cambridge Univ. Press.

[5J Jaynes, Edwin T.: 1957, 'Information Theory and Statistical Mechanics', Part I Physical Review 106 620-630; Part II Physical Review 108 171-190.

[6J Jaynes, Edwin T.: 1979, 'Where do We Stand on Maximum Entropy', in Raphael D. Levine and Myron Tribus (eds.), The Maximum Entropy Formalism, MIT Press, page 14.


[7] Jaynes, Edwin T.: 1987, 'Comment on a review by P.W. Atkins', Contemp. Phys. 28.

[8] Planck, Max: 1945, 'A Treatise on Thermodynamics' Dover Publications, New York. The first edition appeared in 1897; the seventh in 1922.

[9] Rowlinson, J.S.: 1970, 'Probability, Information, and Entropy', Nature 225, 1196.

[10] Tribus, Myron: March 1961a, 'Information Theory as the Basis for Thermodynamics and Thermostatics', Jour. Appl. Mech. 1-8.

[11] Tribus, Myron: 1961b, 'Thermostatics and Thermodynamics: An Introduction to Energy, Information and States of Matter, With Engineering Applications', D. Van Nostrand Company.

[12] Tribus, Myron: 1969, 'Rational Descriptions, Decisions and Designs', Pergamon Press.

THE HARD TRUTH

Kenneth M. Hanson and Gregory S. Cunningham' Los Alamos National Laboratory, MS P940 Los Alamos, New Mexico 87545 USA [email protected] [email protected]

ABSTRACT. Bayesian methodology provides the means to combine prior knowledge about competing models of reality and available data to draw inferences about the validity of those models . The posterior quantifies the degree of certainty one has about those models. We propose a method to determine the uncertainty in a specific feature of a Bayesian solution. Our approach is based on an analogy between the negative logarithm of the posterior and a physical potential. This analogy leads to the interpretation of gradient of this potential as a force that acts on the model. As model parameters are perturbed from their maximum a posteriori (MAP) values, the strength of the restoring force that drives them back to the MAP solution is directly related to the uncertainty in those parameter estimates. The correlations between the uncertainties of parameter estimates can be elucidated.

1. Introduction

Bayesian analysis provides the foundation for a rich environment in which to explore inferences about models from both data and prior knowledge through the posterior probability. In an attempt to reduce an analysis problem to a manageable size, the usual approach is to present a single instantiation of the object model as "the answer", typically that which maximizes the posterior (the MAP solution). However, because of uncertainties in the measurements and/or because of a lack of sufficient data to define an unambiguous answer (in the absence of regularizing priors) [1], there is no unique answer to many real analysis problems. Rather, innumerable solutions are possible. Of course, some solutions are more probable than others. The beauty of the Bayesian approach is that it provides the probability of every possible solution, which, in a sense, ranks various solutions. The estimation of the uncertainty or reliability of the answer remains a pressing issue, particularly when the number of parameters in the model is large. Although there is a mathematically correct way to specify the covariance in the parameters, including the correlation between the uncertainties in any two parameters, it does not provide much insight.

One appealing way to get a feeling for the uncertainty in a Bayesian solution is to display a sequence of distinct solutions drawn from the posterior probability distribution. This approach was suggested by Skilling et al. [2], who produced a video display of a random walk through the posterior distribution. However, the calculational method used in that work was based on a Gaussian approximation of the posterior probability distribution in the neighborhood of the MAP solution. Later Skilling made some progress in dealing with non-Gaussian distributions [3]. While the probabilistic display of Skilling et al. provides

Supported by the United States Department of Energy under contract number W-7405-ENG-36.

157


158 K. M. Hanson and G. S. Cunningham

a general impression of the overall degree of possible variation in the solution, we desire a means to probe the uncertainty in the solution in a more directed manner.

We propose a technique to test hypotheses regarding perturbations of the MAP solution in a fashion that allows one to ask questions of particular interest. The approach we suggest makes use of an analogy between the negative logarithm of the posterior and a physical potential. The uncertainty of a particular change of the MAP solution is revealed in a tactile way as a force that tends to pull the solution back toward the MAP solution. Correlations between the perturbed set of parameters and the remaining parameters in the model are also brought to light. This innovative Bayesian tool is tangibly demonstrated within the context of geometrically-defined object models used for tomographic reconstruction from very limited projection data.

2. Traditional approach to uncertainty

Bayesian analysis revolves around the posterior probability of a model, where the model parameters are represented by the vector a . The posterior p(ald) incorporates data through the likelihood p(dla), i.e. the probability of the observed data given the parameters, and prior information through a prior probability on the parameters pta). Bayes's law gives the posterior as p(ald) oc p(dla)p(a). The most typical use of Bayesian analysis is to find the parameter values that maximize the posterior, called the MAP solution.

It is convenient to deal with the negative logarithm ofthe posterior, 'P = -log{p(ald)}. The MAP estimate of the parameters Ii is found by minimizing 'P, the condition for which is g~ = 0 for all parameters ai, providing there are no constraints on the parameters themselves. In the traditional approach to the estimation of uncertainty [4], which we only summarize here, the variances in the parameters are derived from the curvature matrix of 'P, calculated at Ii,

(1)

Since this matrix is evaluated at the minimum of 'P, it must be positive semi-definite, i.e. (6a) TK 6a 2: 0 for any 6a. In the Gaussian approximation to the posterior, the error matrix E, which gives the covariances between all the parameters, [Elij = (( ai -ii;)(aj -iij)), where the brackets indicate an ensemble average, is the inverse of the curvature matrix,

E = K-1 . (2)

Although this result is mathematically rigorous, it only provides the second moment of the parameter uncertainties and their correlations. It also suffers from not being very illuminating in terms of its consequences for the parametric model, particularly in terms of the correlations in the uncertainties of various parameters. Furthermore, for the lO5- 106

pixel amplitudes that are typically needed to describe a 2D image, the full error matrix contains 1010_lO12 elements, which can neither be practically calculated nor stored. We propose another approach to provide a more tangible indication of the degree of uncertainty in the inferred model as well as the ability to directly probe the uncertainty of specific feat ures of the model.

The hard truth 159

3. Bayesian mechanics

If one draws an analogy between 'P and a physical potential, then the gradient of 'P is analogous to a force, just as in physics. The force 1; = - g~ is roughly in the direction of the local minimum of 'P, under suitable assumptions concerning the smoothness of the dependence of 'P on the parameters. The condition for the MAP solution, Va 'P = 0, can be interpreted as stating that at the MAP operating point the forces on all the variables in the problem balance: the net force on each variable is zero. Further, when the variable ai

is perturbed slightly from the MAP solution, the force 1; pulls ai back towards the MAP solution. The phrase "force of the data" takes on real meaning in this context.

A quadratic approximation to 'P in the neighborhood of the MAP solution implies a linear force law, i.e. the restoring force is proportional to the displacement from equilibrium, as in a simple spring. In this quadratic approximation the curvature of 'P is proportional to the covariance of the MAP estimate. A high curvature is analogous to a stiff spring and therefore represents a "rigid", reliable solution.

An interesting aspect of this interpretation is the possibility of decomposing the forces acting on the MAP solution into their various components. For example, the force derived from all data (through the likelihood), or even a selected set of data, may be compared to the force derived from the prior. In this way it is possible to examine the influence of the priors on the solution as well as determine which data have the largest effect on a particular feature of the solution.

We note that the notion of applying forces to model parameters in the preceding discussion must ultimately be stated in terms of pressures, that is, forces applied over regions, acting on physically meaningful quantities. The first reason is that the physical world , which we usually model, exists as a continuum: the physical quantities of interest are typically densities, which are a function of continuous spatial or temporal coordinates. Thus meaningful questions about reality should really be stated in terms averages over regions, not as point values. Secondly, physically feasible measurements can only probe physical quantities over finite-sized regions. Point sampling is fundamentally impossible. As an example, a radiographic measurement in which the attenuation of an x-ray beam is measured is always subject to the effects of a blurring process that arises from a finite spot size for the source of x rays and the finite resolution of the x-ray detector. Thus the measured attenuation is necessarily an average over a cylinder in space. In truth, radiographic measurements can not provide line integrals of an attenuation coefficient through an object, as is often assumed as an approximation to the real process. Put succinctly, all physical measurements have limited spatial or temporal resolution that render as meaningless questions about what happens in an infinitesimally small region. As a result, uncertainties in an estimated physical quantity can only be addressed in terms of the average of that quantity over a finite region. As the concepts of Bayesian analysis mature, we will learn to deal only with physical quantities that are functions of continuous independent variables and we will avoid referencing directly the underlying discrete parameters of the models.

One needs to be aware that any finite representation, which we are forced to use in computer models, has a limited resolution. Thus when one explores the model at a scale finer than the inherent resolution of the model, the model can only respond by interpolation of the underlying discrete model [5]. One can only meaningfully explore the model at resolutions coarser than this.


4. Perturbation from Equilibrium

We propose to exploit the above physical analogy to facilitate the exploration of the uncertainty in a MAP solution. For the present we will assume that in the neighborhood of the MAP point a 'P is well approximated by a quadratic expansion:

(3)

where 6.a = a-a is the displacement from the MAP point and 'Po = 'P(a). Suppose that we start from a and displace the parameter values by a small amount 6.a. Then the gradient of 'P, -"iJ a'P, represents a force that pulls the parameters back toward the MAP point. The units of the force are those of the reciprocal of the parameter. The curvature, and hence the reciprocal of the variance, in the direction of 6.a is given by the ratio of I "iJ a 'PI, evaluated at a + 6.a, to l6.al, for vanishingly small displacements.

As an alternative to directly displacing parameters, their perturbation may be achieved by applying an external force to the parameters. Suppose that one pulls on the parameters with a force f. Note that this force can act on just one parameter or on many. From the physical analogy, it is easy to write down the new potential;

(4)

The new minimum of 'P occurs when

"iJ a 'P = 0 = K6.a - f (5)

Solving for the displacement in a and using Eq. (2),

(6)

If the curvature matrix K, and hence the covariance matrix E, is not diagonal, the resulting displacement is not in the direction of the applied force. This phenomenon demonstrates the correlations between the uncertainties in all the parameters. The component of 6.a in the direction of the applied force divided by the magnitude of the force, i.e. 6.a T f / If12, is the effective variance in the parameters in that direction.

Although we assumed that 'P is quadratic above, this approach can be useful even when it is nonquadratic. While it may not be feasible to express the results analytically, we obtain a feeling for the uncertainty in 6.a and the correlations between 6.a and the other parameters. Any constraints on the parameters can be explicitly seen. For nonquadratic 'P the plot of the value of 'P versus the applied force provides the means to visualize the uncertainty in 6.a.

5. Use with Deformable Geometric Model

The above approach takes on a poignant interpretation when the reconstructed object is defined in terms of its geometric shape. The prior on the geometry is defined in terms of the default shape together with a prescription of how to assess the probability of other possible shapes. The latter is simply done by using a Gibbs form for the probability given as exp( -,BW), where W is the deformation energy, i.e. the energy required to deform the

The hard truth

• ... A e • • •• e •

••

161

Figure 1: An example of how a polygon (solid line) can be distorted by either pushing node A inward (dashed line) or outward (dotted line), assuming that the measurements consist of two orthogonal projections. Note the effect on the overall shape of the object, which indicates the correlations between the polygon vertices.

geometry from the default shape into a new shape [6-10J. The parameter fJ regulates the strength of the prior on the geometry.

Figure 1 shows a polygon defined in terms of its 20 vertices or nodes. Thus there are 40 parameters in this model corresponding to the two coordinates needed to specify each vertex of the polygon. We assume that two sets of parallel projections, one vertical and one horizontal, are available and that they are subject to a very small amount of measurement noise. For simplicity we ignore the prior on the deformation described above. Starting from the known original polygon, a force is applied to the leftmost node (node A), pulling it outward. The plot of the applied force and the resulting horizontal displacement of the node is shown in Fig. 2. For positive forces node A moves outward steadily up to a breakpoint (at a displacement of 0.18), which we call point B. The dotted-line figure in Fig. 1 shows the configuration of the polygon at that point. We note that the act of displacing node A outward contradicts the vertical projections, which indicate that there is probably no material to the left of the original position of the node. Beyond point B the slope of the curve decreases substantially, principally because new configurations of the polygon are possible, which can reduce the excessive projection values to the left of the original position of node A. As an aside, the optimization procedure employed is based on a steepest descent method. We use the technique of adjoint differentiation to efficiently calculate the required derivatives with respect to the parameters [l1J.

162

0.125

121.11210

121.075

F o ~0.050 E

0.025

0.000

-0.025

-0.2

K. M. Hanson and G. S. Cunningham

APPLIED FORCE VS_ DISPLACEMENT

~ V

~ V /

/ /

.J C

-0 . .1 0.0 0 . .1 O,2 liL3 13.4 0.5 DISPLACEMENT

Figure 2: Plot of the force applied to node A of the polygon in Fig. 1 versus the resulting displacement of that node. The nonlinear nature of the force-displacement law for this problem is dramatically demonstrated. The configurations shown in Fig. 1 are at the two breakpoints in the curve: the dashed line corresponds to a force of -0.006 (inward) at point C and the dotted line to a force of 0.080 (outward) at B.

Applying the force inward (negative force values) results in quite a different behavior. With a small inward push, the displacement reaches a breakpoint, point C in Fig. 2. The configuration of the polygon at this point is shown Fig. 1 as the dashed figure. Node A has just reached the line connecting its neighbors, one of which has moved outward to take its place in supplying the proper vertical projection. Pushing harder only makes node A slide down that line, which requires only a little force to achieve a large displacement. The position of node A is not well determined in this region. We notice that the shape of the object does not change during this process. The results for this situation are correct, but may not be what one has in mind when specifying the force. It seems desirable to avoid applying the force directly to the parameters, in this case, to the position of the nodes of the polygon. The force should instead be applied to the object and its effect translated to the parameters. Also we observe that the only reason point C is not closer to the origin is that the coarseness of our polygon object model limits the flexibility of the object to respond. With many more degrees of freedom, we would expect neighboring sections of the object boundary to move out to take the place of node A in response to a slight inward force.

The correlation between the uncertainty in the position of node A and the positions of the other nodes in the polygon is demonstrated in Fig. 1. We observe that the nodes

The hard truth 163

on the right side of the polygon move to maintain the measured horizontal projection. Of course, the constraints of the vertical projection also figure into the problem, making the overall movement of the sides of the polygon rather complex. This approach nicely handles the complex interaction between all the constraints arising from measurements and prior knowledge.

For an object modeled in terms of its geometry, poor reliability of the MAP estimate means that the object is soft or squishy, pliable. Good reliability of the estimate means that the object is firm. Therefore, "truth" is hard or rigid.

6. Discussion

In the future it may be possible to use the tools of virtual reality, coupled to turbocomputation, to explore the reliability of a Bayesian solution of complex problems through direct manipulation of the computer model. Force feedback will permit one to actually "feel" the stiffness of a model. Higher dimensional correlations might be "felt" through one's various senses.

To reiterate the comments made in Sect. 3, we suggest that queries regarding physical quantities should be made in terms of averages over regions rather than in terms of their values at points. Furthermore, the uncertainties of individual parameters that, as a collection, are meant to describe a physical quantity as a function of continuous coordinates, may have little meaning. In regard to an image represented as a grid of pixels, the question "what is the rms error in a pixel value?," is impossible to answer without a clear understanding of what a pixel value represents, e.g. the average value over the area of the pixel. More meaningful questions can be made for areas larger than that of a single pixel. Furthermore, the correlations between the average value within a region and the rest of the image must be considered. Consequently, our language must change. Instead of applying forces to probe the reliability of individual parameters that are used to describe an object, we should speak of applying pressures over regions of the object. And it must be understood that when we ask about regions whose size is on the order of, or smaller, than the resolution of the discrete model of the object, we will only learn about the interpolation properties of the model.

The approach to reliability testing described above is very general and can be used in virtually any other kind of Bayesian analysis. Examples of other contexts are as follows:

Spectral estimation: In typical spectral analysis a scalar variable quantity is estimated for different discrete frequency values. Normally a single spectrum is estimated. Skilling et al. [3] probed the variability possible in the answer through their probabilistic display technique. That display gives one a true feeling for the range of answers possible for a given set of input data. With our technique, one can ask direct questions about the power over a specific range of frequencies. The mode of interaction with the spectrum might be thought of as pushing down or pulling up on a region of the spectrum. In a virtual reality setting, we can imagine that the analyst would be able to use his fingers to press upward or downward on various portions of the spectrum. The resistance to this attempted action, fed back to the users fingers as a force, would indicate the degree of uncertainty in the solution.

Image reconstruction: The basic problem is to estimate the amplitudes in image pixels from data, each of which is a combination of many pixels, as in tomographic reconstruction


from projections (strip integrals) through the image, or deconvolution of blurred images. Interaction with the image can be provided by allowing one to push or pull on the amplitudes in an area of interest. The concepts behind this technique can be used to make binary decisions, for example, to decide whether an object is present or not, or to decide between two different signals [12].

References

[1] K. M. Hanson and G. W. Wecksung. Bayesian approach to limited-angle reconstruction in computed tomography. J. Opt. Soc. Amer., 73:1501-1509, 1983.

[2] J. Skilling, D.R.T. Robinson, and S.F. Gull. Probabilistic displays. In Jr. W.T. Grandy and L.H. Schick, editors, Maximum Entropy and Bayesian Methods, pages 365-368. Kluwer Academic, 1991.

[3] J. Skilling. Clouds. presented at the Workshop on Maximum Entropy and Bayesian Methods, July 19-24, 1992, Paris.

[4] P. R. Bevington. Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, 1969.

[5] K. M. Hanson and G. W. Wecksung. Local basis-function approach to computed tomography. Appl. Opt., 24:4028-4039, 1985.

[6] R. Szeliski. Probabilistic modeling of surfaces. Proc. SPIE, 1570:154-165, 1991.

[7] R. Szeliski and D. Terzopoulos. Physically-based and probabilistic models for computer vision. Proc. SPIE, 1570:140--152, 1991.

[8] K. M. Hanson. Reconstruction based on flexible prior models. Proc. SPIE, vol. 1652:183-191, 1992.

[9] K. M. Hanson. Flexible prior models in Bayesian image analysis. In A. MohammadDjafari and G. Demoment, editors, Maximum Entropy and Bayesian Methods, pages 399-406. Kluwer Academic, 1993.

[10] K. M. Hanson. Bayesian reconstruction based on flexible prior models. J. Opt. Soc. Amer., AlO:997-1004, 1993.

[11] G. S. Cunningham, K. M. Hanson, G. R. Jennings, Jr., and D. R. Wolf. An objectoriented optimization system. Proc. IEEE Int. Conf. Image Processing, III:826-830, 1994.

[12] K. M. Hanson. Making binary decisions based on the posterior probability distribution associated with tomographic reconstrcutions. In C. R. Smith, G. J. Erickson, and P. O. Neudorfer, editors, Maximum Entropy and Bayesian Methods, pages 313-326. Kluwer Academic, 1991.

ARE THE SAMPLES DOPED - IF SO, HOW MUCH?

Anthony J.M. Garrett Byron's Lodge, 63 High Street Grantchester Cambridge CB3 9NF England

ABSTRACT. The concentration of a chemical in a compound varies across samples, due to uncontrolled factors. A substitute for the compound likewise contains a variable concentration of this chemical. The concentration is measured in samples of the compound and the substitute, and in a manufactured version of the compound. Probability theory is applied to investigate whether the manufacturer is doping the compound with its substitute, and if so by how much.

1. Introduction

We shall analyse our problem in a single context, with the understanding that the analysis is applicable more widely. Our application concerns whether a dairy is secretly adulterating cows milk with goatsmilk, based on measurements of amino acid concentrations in pure cowsmilk, pure goatsmilk, and in what the dairy labels as cowsmilk. We shall assume either that the dairy is honest, or that it adulterates all cowsmilk by a fixed (but unknown) proportion of goatsmilk; these two hypotheses are to be tested against each other. If we decide that the dairy is dishonest, we should like to estimate the proportion of goatsmilk.

This application has been chosen for comparison with a tutorial work on milk substitution [1]. The analysis can also be applied to investigate, for example, whether orange juice from one source is adulterated by orange juice from another (or juice from other fruits); in testing pharmaceutical drugs; and in many other chemometric and forensic problems. Dilution is a special case, in which the adulterating compound has zero concentration of the chemical that is tested for. Another special case of practical relevance corresponds to 100% adulteration or none. This is the identification problem, in which the dairy puts out either pure cowsmilk or pure goatsmilk and we aim to discover which. These special cases are treated in section 5.

Were cows milk and goatsmilk always to contain unchanging concentrations of amino acids, different in each species of animal, the problem would be deterministic. Given these concentrations from a sample of each species, a test on a single sample of dairy output would tell us the situation exactly. But, alas, there is a natural variation in amino acid concentrations in both cows milk and goatsmilk - due, for example, to season, to breed, to diet and to health. Nevertheless, if the concentration of a particular amino acid is tightly distributed about one value in cowsmilk and about a very different value in goatsmilk, something similar can be reasoned out. Our task is to make this reasoning precise in the grey area where intuition fails us.

165


166 A.J.M. Garrett

What we want to know is how strongly to believe that the dairy is honest or dishonest. To provide a quantitative theory we must therefore seek a numerical representation of strength of belief. This number must represent the strength of belief that a binary proposition A (say) is true, supposing that we can take a further proposition B as true. A short tutorial now follows on this notion.

The conditioned proposition we denote AlB, and the number representing strength of belief in it by p(AIB). (Though we are using the notation of probability theory, we have said nothing about probability itself.) The fact that propositions obey the Boolean calculus now induces a corresponding calculus for the numbers. This numerical calculus was first derived from the propositional calculus in 1946 by R.T. Cox [2]. Specifically, p(ABIC) must be expressible in terms of p(AIBC) and p(BIC); no other inequivalent combination makes sense. To find the relation between these, write p(ABIC) as an unknown function of the other two, and apply it twice so as to express the belief-strength in the logical product of three propositions in terms of the belief-strengths of single conditioned propositions. The decomposition can be done in two different ways but, since the logical product is associative, the results must be the same. This condition sets up a functional equation for the unknown relation between the p's. Its solution is

p(ABIC) = p(AIBC)p(BIC) (1 )

and this is known as the product rule. Likewise, associativity of the logical sum, together with de Morgan's logical relation A + B = AB, gives a second relation

p(AIB) + p(AIB) = 1. (2)

This is known as the sum rule. In fact there is more to the analysis, since if p is a numerical representation of strength

of belief then so is any function of p. This fact is reflected in a degree of arbitrariness in the solutions of the functional equations; of the many equivalent representations, one alone satisfies the rules (1) and (2), and this is the representation we work with.

These rules have long been known as 'the laws of probability'. And, like probability, a numerical representation of strength of belief is involved in those problems where there is uncertainty. On these grounds I will call p the probability; objectors, however, can simply replace the word in what follows by 'numerical representation of strength of belief'. By this means the long debate over the meaning of probability is circumvented: it is a strength of belief which is needed, it is a strength of belief which we shall find, and we could if necessary denote it so throughout and make no reference to 'probability ' .

This view, based on strength of belief and agnostic over what probability is, has the further advantage that it makes no reference to the idea of a 'random' process, a notion which is notoriously slippery to define. It appears to mean 'unpredictable', but this raises the question: unpredictable by whom? Not by somebody who knows the truth of sufficient propositions about the process. The 'random' point of view tackles problems using a toolbox of methods called sampling theory, and often proves inequivalent to application of the sum and product rules. The resulting difficulties have been painstakingly documented by Jaynes [3,4]. By contrast the present viewpoint relates probabilities using only the sum and product rules - which we have seen are mandatory - and employs a further principle derived from consistency requirements to assign them (maximum entropy: see refs. [3,4,5]) .

ARE THE SAMPLES DOPED 167

This viewpoint is often called Bayesian but, since even this word means different things to different people, I prefer to call it consistent probability theory. It is the most general way of handling uncertainty; claims that fuzzy logic can tackle a wider range of problems are unfounded [4]. Finally, extension is routinely made to quantitative and continuous parameters by considering propositions such as 'the height of the tree is between hand h + dh'. Probability densities will be denoted in what follows by a capital P.

2. Ockham's Razor

Our problem is to test against each other two hypotheses: one, that the dairy is honest and that the fraction f of goatsmilk mixed in the cowsmilk is zero; the other, that f may be non-zero and must be estimated from the data. Since these are the only theories we are considering, they are complementary and their probabilities sum to unity. Hypothesis testing is, properly, always comparative.

In the first theory the prior probability density function for f is a <l-function located at f = 0; in the second theory this density derives from the prior information, and will be broader. The extra flexibility of the second theory allows f to be chosen to give a better fit to the data than the first theory. On the other hand the second theory is disadvantaged in placing some of the prior probability for f where the data and noise statistics indicate it is very unlikely to lie. There is therefore a trade-off between simplicity of theory and goodness of fit. Alternatively we can say that the first theory has all its eggs placed in one basket (at f = 0), and so is liable to spectacular success or failure according to whether the data and noise statistics suggest that f is close to zero.

The trade-off between simplicity of theory and goodness of fit corresponds to the aphorism known as Ockham's Razor, essentia non sunt multiplicanda praeter necessitatem (though this phrasing was coined after William of Ockham's time). The Latin is customarily translated as 'entities are not to be multiplied except of necessity', and popularly as 'prefer the simplest theory that fits the facts' - though this popular translation misses the point that some theories fit the facts better than others. A translation suited to the present view is 'parameters should not proliferate unnecessarily'. The Bayesian expression of Ockham's razor was first understood by Harold Jeffreys [6], and a tutorial on the connection has been given by the present author [7].

Denote the theory that the dairy is honest (f = 0) by T1 , and the theory that it is adulterating the milk by T2 . The data consist of amino acid concentrations in cowsmilk, goatsmilk and dairy milk, and are collectively denoted D; the prior information is I. Our task is to use the sum and product rules to synthesise the posterior probability of either theory, P(Ti IDI) , from the prior P(Ti II), from the noise statistics - the likelihood p(DlfTiI) according to that theory - and from the prior probability density for the adulteration parameter, P(fITiI). We begin from the observation that, since the logical product of two propositions is commutative, the product rule can be written as

whence p(TiIDI) = Kp(TilI)p(DIT;I)

where K-l = p(DII). We now multiply the sum rule, in the form

(3)

(4)

(5)

168 A.J.M. Garrett

by p(DII), and employ the product rule to give

(6)

which is the rule for marginalising. Now apply the product rule 'the other way' in (6), to give

(7)

Equations (4) and (7), combined, comprise Bayes' theorem for incorporating data D into prior probabilities P(TilI) to give posterior probabilities P(TiIDI). (Of course 1\ = T2 and vice-versa.) However our synthesis is not yet complete: p(DITJ), which appears on the RHS of (4)/(7), is not one of our building blocks. It is easily constructed from them using the marginal ising rule

p(DITiI) = 11 df P(DfITiI)

= 101 df p(DlfTJ)PUITJ),

(8)

(9)

which on substitution into (4)/(7) completes our program. Alternatively, we use (4) to write the odds ratio

P(Ti[DI) _ P(Ti[I) fo1 dfp(DlfTiI)PU[TJ)

pm[DI) - p(1i[I) fo1 dfp(D[fTiI)PU[TiI) , (10)

where K has cancelled. Upon writing P(TiIDI) = I-p(1iIDI), it is easy to isolate p(Ti[DI) from (10).

Now put i = 2 in (10), write T2 = T1, put PUIT1I) = 6(f) and perform the now-trivial integration over f in the denominator. Suppose also that the prior information I contains nothing discriminating between the two theories, so that by symmetry P(T11 I) = p(T2[ I) = 1/2. Then (10) reduces to

P(T2[D I) fo1 df PU[T2I)p(D[fT2I) P(T1[DI) - p(Dlf=0;T1I)

(11)

We prefer T2 (adulteration) or T1 (honesty) according to whether the RHS of (10) is greater or less than one; how strongly to prefer either theory is determined by the value of this ratio. Expression (11) is our fundamental formula.

The presence of the prior density P(f[T2 I) for f in (ll) is a strength, not a weakness: the prior information I is known and it must be taken into account. To find this prior density the prior information must be specified.

The only other term in (11) is the likelihood p(D[fT2I). (The denominator is just its value at f = 0.) In combination with the prior density for f, it gives the posterior density for f according to Bayes' theorem,

(12)


where the factor

K' -1 = 11 df' P(f'IT2I)p(DI1'T2I) (13)

guarantees normalisation ofthe posterior density and l' is a dummy variable. This normalising factor is the same as the numerator in (11). Expression (12) encapsulates everything we know about f after the data are logged. It can always be calculated, even when (11) indicates that the dairy is honest - though in that case there is little point in finding it.

If the likelihood p(DlfT2I), and consequently the posterior density P(JIDT2I), is sharply peaked in f about a value well away from zero, then the probability of getting the data if f were equal to zero, which is the denominator in (11), becomes very small and T2 is strongly preferred. The likelihood is sharply peaked if, for example, the data consist of a large number of logically independent samples from the same distribution, for then the central limit theorem tells us that the probability density for the mean of the samples is accurately Gaussian and sharply peaked about the mean of the distribution, with variance equal to the variance of the distribution divided by the number N of samples. The likelihood p(DlfT2I) is re-expressed in terms of the sample mean and N-1 other independent functions of the samples, and so contains this Gaussian as a factor.

Our formula (12) may be used in (Bayesian) parameter estimation, since from it the mean (I) , the variance (12) - (1)2, and all higher posterior moments of f can be calculated. The maximum of (12) is the value most strongly to believe - though not the value exclusively to believe. Marginalisation over every value of f is not an arbitrary averaging process, but a rigorous consequence of the sum and product rules; it is part of how to reason correctly.

3. The Likelihood Term

Further progress rests on how the theory T2 predicts the amino acid concentrations which comprise the data D. The connection tells us about the likelihood p(DIJT2I) in (11) and (12). Let us partition the data into measurements made on cowsmilk, denoted DC, measurements made on goatsmilk, D9, and measurements made on what the dairy labels cowsmilk, Dd (for dairy). We drop T2 from the conditioning information, since by implication it is included in any calculation where f is involved. Denote by (pc and ¢9 any parameters, in the probability distributions for the amino acid concentrations in cowsmilk and goatsmilk, which are estimatible from the data. For example, we might work with Gaussian (Normal) densities, with ¢ corresponding to the mean and variance, while if the problem is a nonparametric one ¢ stands for an infinite number of parameters(!) The purpose of taking measurements on pure cowsmilk and pure goatsmilk is to learn about these parameters without the added complication of mixtures of unknown composition.

The likelihood can be written

p(DlfT2I) = p(DC D9 Ddlf I) (14)

= Jd¢CJd¢9 P(DCD9Dd¢c¢91fI) (15)

(by marginalisation)

(16)

170 A.J .M. Garrett

(by the product rule)

(17)

since the data on pure cowsmilk and pure goatsmilk are irrelevant given the parameters specifying the distributions for these, and since knowledge of f is irrelevant to inference about pure cowsmilk and pure goats milk;

if we suppose that knowledge of cows milk tells us nothing about goatsmilk, and vice-versa. Next, we suppose that the data DC consist of measurements of amino acid concentra

tions in a number of samples NC of pure cows milk. The concentration in the ith sample is denoted x;; if several amino acids are measured at once, this quantity is a vector. The samples are now supposed to be logically independent so that, given the parameters (pc , knowledge of one sample tells us nothing about any other. By the product rule the likelihood p( DC I (pc I) then decomposes into a product of identical likelihoods for the individual samples. There are similarly N9 samples of pure goatsmilk, for which the concentration of the same amino acid(s) in the jth sample is denoted Y] ; and N d samples of what the dairy

labels cowsmilk, for which the concentration vector of the kth sample is zf We suppose that the kth dairy sample consists of a fraction 1-f of cowsmilk having amino acid concentration z'k before mixing, and a fraction f of goatsmilk of concentration zk before mixing; the concentration x and subscript i refer to pure cows milk samples, y and subscript j to pure goatsmilk samples, and z and subscript k to samples from the dairy. Superscripts c, g, d refer to the source of the milk. We therefore write the likelihood (18) as

p(DlfT2 I) =

N C N9 N d J d¢c P(¢clI) J d¢9 P(¢91I) TI P(xfW I)dxf TIp(y]I¢9 I)dy] TI P(z~14hp9 f I)dz~ . (19) i=l j=l k=l

The concentration of amino acid in a mixture is calculated from the concentrations in the constituents according to

(20)

In the expansion

which is a purely logical relation deriving from the product rule and the marginalising rule, we can therefore write the probability density P(z~lz'kzk¢c¢9 f I) as the 8-function 8 ((1 - f)z'k + f zk - z~). This trick for finding the density for z~ is better than transforming from variables {z'k,zk} to z~ and some other function of z'k, zk and then marginalising over this other function. The present method involves no such arbitrary function and no calculation of a Jacobian; instead the laws of probability take the strain. The second density


in the RHS of (21) factorises into the product P(z.(}PC I)P(Z%I¢9 I). Our final formula for the likelihood p(DIJT2I), to be substituted into (11) , is therefore

N C NY

p(DIJT2 J) = J d¢c P(¢clJ) ITp(xiW I)dxi J d¢9 P(¢9 II) IT P(y]I¢9 I)dy] i=1 j=1

N d

. IT dz~ J dZk P(ZkWI) J dZk P(ZkW I)8((1 - j)Zk + JZk - z~) . (22) k=1

When this expression is substituted into both numerator and denominator of (11) (putting J = 0 in the denominator), the differentials dxf, dyj, dz~ cancel. If, based on the result, we decide that the dairy is adulterating the cowsmilk with goatsmilk, substitution of (22) into (12)/(13) gives the posterior probability density for the extent of adulteration, J. The maximum of this density gives the most probable value for J.

4. Example: Gaussians

Gaussian distributions often prove useful in practice, and lead to tractable mathematics. Accordingly we shall take the likelihoods for the data samples to be Gaussian, so that ¢ stands for the mean 1-£ (a vector) and the covariance matrix V:

and exactly the same density for zC . The variables y9 and z9 likewise derive from a Gaussian density with mean 1-£9 and covariance V9.

The Gaussian is the density of maximum entropy on (-00,00) with a uniform measure and a stated mean and covariance [5]. It should not be used in this problem if it predicts a significant probability where the concentration would be negative, or if a frequency plot of the data points appears significantly skewed about its maximum.

The integrations over zZ, zk in (22) are most economically performed by writing the 8-function in its spectral representation J dweiw .q (with dummy argument q). It might seem perverse to write the o-function, which can be used to remove one integration in (22) , in a form which adds a further integration. But, by doing this, symmetry is maintained in the integrations over zZ and z% in (22), which separate. Both are Gaussian, and are easily performed using the formula

The final integration, over w, then involves the product of two Gaussians. Formula (24) can be adapted, giving the further Gaussian

(25)

where the mean I-£d and variance Vd are weighted averages of those of cowsmilk and goatsmilk:

(26)

172 A.J.M. Garrett

Therefore the likelihood is

N C

p(DlfT2I) =! ! d~cdVc P(~CvclI) II (det 2rrVC) -1/2 exp [--!(xi - ~C). V C -I. (xi - ~C)]dxi ,=1

N9 j ! d~9dV9 P(~9V91I) II (det 2rrV9) -1/2 exp[ --!(y] - ~9). V9 -1. (yJ - ~9)] dy] )=1

N d

II d -1/2 [1 d d d -I d d] d . (det2rrV) exp -2(Zk - ~ )·V '(Zk - ~ ) dzk , (27) k=1

in which the products all take the same form. When this expression is substituted into (ll) the differentials cancel on top and bottom.

The products of exponentials in (27) can be written as exponentials of sums. The result can be expressed in terms of the sample mean and (co )variance if a single amino acid is measured, so that ~ and V are scalar-valued.

If the prior densities for the parameters ~c and ~9 are themselves Gaussian, the marginalisation over these in can be carried out analytically in (27). The result is not of any striking form.

5. Special Cases: Dilution and Identification

If we know that any adulteration of the cowsmilk takes place by dilution with water (or any other inert compound free of amino acids), we simply replace the probability density for the amino acids in goats milk by t5-functions located at zero concentration. Upon performing the now-trivial integrations over ZC and Z9 in (22), the marginalisation over ¢9 reduces to a normalisation, indicating the irrelevance of making measurements on an exactly specified compound. The likelihood (22) simplifies to

N C N9

p(DlfT2I) = ! d¢c P(¢clI) II P(xiW I)dxi II t5(yJ)dyJ i=1 j=1

N d

·II d4 (1- f)-I P(Zk= (1- f)-lz~WI) (28) k=1

and the ratio (ll) for p(T2IDI)/p(T1 IDI) to

N C Nd

foldf PUIT2I) (1 - f)-N d f d¢CP(¢clI) n P(xiWI) n P(Zk = (1-f)-lz~W I) i=1 k=1

NC Nd (29)

f d¢CP(¢clI) n P(xiWI) n P(zk =z~WI) i=1 k=1

Similar simplifications take place in finding the posterior probability density for the dilution parameter f.

For the identification problem, in which the dairy puts out either pure cowsmilk or pure goatsmilk and we investigate which, we have P(JIT2I) = t5(J -1), and the ratio (ll)


reduces to p(Dlf = 1; T2I)/p(Dlf = 0; T2I). Both likelihoods in this ratio are easily found from the preceding analysis.

6. Discussion

Once the form of the probability density functions for the amino acids in cows milk and goatsmilk are specified and the priors for any undetermined parameters in these densities are given, together with the prior density for f, the foregoing analysis provides a complete solution, telling us how strongly to believe that the dairy is honest or dishonest , and how strongly to believe that it is adulterating cows milk by any fraction of goatsmilk.

The tutorial paper by Coomans and Massart on milk adulteration [1 J treats a slightly different problem, in which - as here - it is not known whether the dairy is honest; but if it is dishonest then the adulteration fraction f is known. This seems a strange model to treat: if you don't know whether the dairy is adulterating, you are unlikely to know the extent of any adulteration! This case is handled by simply putting the prior density P(fIT2 I) equal to the d-function d(f - F), where F is the extent to which the dairy, if it is dishonest, adulterates the cowsmilk. The ratio (11) is then just the ratio of the likelihoods for the data at f = F and at f = O. Obviously there is no longer any estimation of f·

The principal method advocated by Coomans and Massart involves setting up two sets of samples to match the possible output of the dairy. One set consists of samples of pure cowsmilk, while the other contains samples made up of a mixture of 100(1- F)% cowsmilk and 100F% goatsmilk. These are known as 'training sets', while the samples from the da iry comprise the test set. To solve Coomans and Massart's problem using samples of this sor t, the product over the j's in (22) is taken over an expression of the same form as the product over the k's , at the value F.

At first sight, the technique of making up samples according to the possible output of the dairy seems a good way to facilitate comparison. But suppose that F = 0.05, which is a plausible value. Our information about goatsmilk then comes entirely from samples in which it is swamped 19:1 by cowsmilk. Little will be learned: the posterior density for ¢9 will differ from the prior far less than will the posterior for ¢c from its prior. It is shooting yourself in the foot to dilute whatever goats milk you have. The statistical methodology is driving the experimental methodology here, instead of vice-versa. This is why the present paper does not use Coomans and Massart's data to find numerical results; it would be to acquiesce in a flawed protocol.

Coomans and Massart's separation of the data into training samples of known composition and test samples of unknown composition is an artificial procedure. All are simply data, bearing in their own way on the question of whether the dairy is honest; DC and D9 are data, not conditioning information. Conversely, we can learn something about cowsmilk and goatsmilk even from the dairy samples. Marginalisation over f in the posterior densities for the parameters ¢c and ¢9 indicates this fact.

The Coomans-Massart approach employs a pattern recognition technique called 'hard modelling', in which a rule is used to classify each sample from the dairy squarely into one or other of the training classes. It is strange that different dairy samples can be placed in different classes, even though we know that all must lie in the same class - though which class that is is not certain. There is a confusion here between what is, and our knowledge of what is; between ontology, and epistemology [4J. Moreover the rules for classifying samples in hard modelling, in contrast to the sum and product rules, have no deeper justification.

174 A.J.M. Garrett

If they are not equivalent to them they are wrong. The hard modelling methodology also suggests why the rather strained model problem

of not knowing whether the dairy is adulterating, but knowing the extent of any adulteration, has been examined: hard modelling cannot estimate f. Coo mans and Massart do outline a probabilistic classification method for the dairy samples, but sample classification is not the way to tackle the problem. (Some other probability-based pattern recognition methods are called 'soft modelling', but the distinction does not seem a useful one.) To a well-posed question there can be only one correct answer, and this emerges by treating directly the question that is asked and using the sum and product rules and nothing else.

It is easy to allow for adulteration by milk (or anything else) from several other species. A separate f is defined for each compound that we suspect is tipped into the vat; equations (11) and (22) readily generalise.

The present Bayesian analysis can even be generalised to allow for different degrees of adulteration from dairy sample to sample. (Here we restrict ourselves to adulteration solely by goatsmilk.) An ik is defined for each dairy measurement z~ in (22); it is easy to show that the integration in the numerator of (11) becomes a product of integrations over the ik, according to

1 N d 1

r df P(fIT2I) ... -t II r dikP(fkIT2I) ... 10 k=11o

(30)

Here P(ikIT2I) is the distribution from which the degree of adulteration f is drawn. Any data-estimatible parameters in this distribution are marginalised over in the same way as the </;'s; their posterior distribution can be found if desired. The probability density (and hence an estimate) for each fk is available from a corresponding generalisation of (11) and (12), taking all data - not just the kth - into account and marginalising over the f of every other sample. No alternative technique even begins to come to grips with this more realistic problem. Once more the superiority of Bayesian methods, simply letting the sum and product rules run, stands forth.

REFERENCES

l. Coomans, D. & Massart, D.L. 1992. Hard Modelling in Supervised Pattern Recognition. In: R.G. Brereton (ed), Multivariate Pattern Recognition in Chemometrics. Elsevier, Amsterdam, Netherlands. pp249-287. This work gives further references on milk classification.

2. Cox, R. T. 1946. Probability, Frequency and Reasonable Expectation. A merican Journal of Physics 14, 1-13.

3. Jaynes, E.T. 1983. E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics. R.D. Rosenkrantz (ed). Synthese Library 158. Reidel, Dordrecht, Netherlands.

4. Jaynes, E.T. Probability Theory: the Logic of Science. Book, in preparation; interim versions available at http://omega.albany.edu:8008/ JaynesBook. html.

5. Tribus, M. 1969. Rational Descriptions, Decisions and Designs. Pergamon, New York , U.S.A.

6. Jeffreys, H. 1939. Theory of Probability. Oxford University Press, Oxford, U.K. 7. Garrett, A.J.M. 1991. Ockham's Razor. In: W.T. Grandy & L.H. Schick (eds), Maximum

Entropy and Bayesian Methods, Laramie, Wyoming, U.S.A., 1990. Kluwer, Dordrecht, Netherlands. pp357-364.

CONFIDENCE INTERVALS FROM ONE OBSERVATION

C. C. Rodriguez Department of Mathematics and Statistics State University of New York at Albany E-Mail: [email protected] URL: http://omega.albany.edu:8008/carlos

ABSTRACT. Robert Machol's surprising result, that from a single observation it is possible to have finite length confidence intervals for the parameters of location-scale models, is re-produced and extended. Two previously unpublished modifications are included. First, Herbert Robbins nonparametric confidence interval is obtained. Second, I introduce a technique for obtaining confidence intervals for the scale parameter of finite length in the logarithmic metric.

1. Introduction

Let x be an observation from a N(j-t, cr2 ) population with unknown parameters. The following statement belongs to the folklore of Statistical Science: From a single observation

x we can not gain information about the variability in the population. Thus, finite length confidence intervals for j-t and/or cr are impossible even in principle.

This is not correct. For example x ± 5 . Ixl will cover j-t at least 90% of the time and (0, 171xl) will cover cr at least 95% of the time. If you don't believe it check it with your PC!

I first heard about this some years ago from Herbert Robbins. According to Robbins, this phenomenon was discovered by an electrical engineer in the 60's (Robert Machol IEEE Trans. Info. Theor., 1964) but it is still relatively unknown to statisticians.

I show Machol's idea below. The intervals for j-t in the parametric case are due to him. The nonparametric improvement is due to Robbins and the intervals on cr are mine.

2. Confidence Intervals for j-t, ParaIlletric Case

Consider the following problem. Given a single observation from a r.v.

1 x-j-t X ~ - . f(--), j-tEIR, cr > 0 unknown,

cr cr

with f a known density symmetric about zero. Find a finite length 100· (1 - (3)% CI for j-t.

Machol's answer: Consider the event

where aEIR is an arbitrary constant and t > 1 is given. We have

A = [WI> tlY - all 175

J. Skilling and S. Sibisi (eds.). Maximum Entropy and Bayesian Methods, 175-182. © 1996 Kluwer Academic Publishers.

176 Carlos C. Rodriguez

Z'Y

, : -y

Fig. 1. Illustration of event A

where X-p, a-p,

y = -- 'Vt J(y) and a = -- dR. u u

The event A corresponds to the shaded piece in Fig. 1. Thus,

P(A) = P[ WI > tlY - all = [l:~tl J(Y)dY[ = (3(a, t) HI

and P(A) ::; (3*(t) = sup (3(a, t).

<><R

Therefore P[ X - tlX - al ::; p, ::; X + tlX - all = P(AC ) ~ 1 - (3*(t)

Hence, provided that (3*(t) -+ 0 as t -+ 00 the interval X ± tlX - al can be made to have any pre-specified confidence. Example: Take J(y) = ¢(y) == pdf of N(O, 1). From the symmetry of ¢ about zero we

can write

Thus,

For a > 0 we have,

so that

(3( -a, t) = li~tl ¢(Z)dZI = (3(a, t) HI

(3*(t) = sup (3(a, t). <»0

[ 1 ( at ) 2 1 ( at ) 2] exp "2 t + 1 -"2 t + 1

t+l t - 1

Confidence Intervals From One Observation 177

and taking logs we obtain

from where t2 - 1

a*=--t

1 (t + 1) -log --2t t - 1

and

J(t+l) it log( ~) {3* (t) = 1 (ill.) ¢(y)dy

(t-l) 2t log t-1

with a calculator and a normal table we find that for t = 5 then a* = 1.0796, {3* = .1 and the confidence is 90% for x ± 51xl. Other intervals could be computed in a similar way. In fact this shows that

P[ X - 51X - al ~ fJ, < X + 51X - all > .90

for all adR, fJ,tIR and u > o. The best a is the one that produces the shortest expected length. But, length = L =

2tlX - al and E(L) = 2tE(IX - aD ex E(IX - at)

so that the best a = a* should minimize E(IX - aD i.e. a* must be the Illedian of X and since X is symmetric about fJ, we have a* = fJ,. Hence, the best a is our best a priori guess for fJ,. This looks like Bayesianism sneaking in classical confidence intervals!.

The arbitrariness of a in the statement "x ± tlx - al is a (1 - {3*(t))100% CI for fJ,"

reminds me of the Stein shrinking phenomenon. Perhaps this is part of the reason why Robbins got interested in it. Recall that Robbins' Empirical Bayesianism produces Stein's estimators as a special case.

3. Confidence Intervals for fJ" Non-paraIlletric Case

Let'S be the class of all unimodal, symmetric about zero densities. Given a single observation of X with X -vt f(x - fJ,) where both ft'S and WIR are unknown, find a 100(1 - {3)% CI for fJ, of finite length. Robbins' Answer: Consider first the following simple lemma:

LeIllIlla: If f ..I- in (0, +00) then

1 lb l(x) = b _ x x f(y) dy ..I- in (0, b)

proof: This is obvious from the picture (see Fig. 2.), since l(x) denotes the mean value of f on (x, b). Of course the algebra gives the same answer. Notice that

1 l(x) ~ b _ xf(x) (b - x) = f(x).


«y)

Fig. 2. The mean value of f(y) decreases when x approaches b

Thus , differentiating both sides of the equation

(b - x) l(x) = lb f(y) dy,

we obtain 1

l'(x) = b_x[l(x)-f(x)l < 0

i.e. l(x) decreases in (0, b)-

Consider as before the event

A = [ IX - ttl > tlX - all for t > 1 and a£lR.

Then, if Y = X - It , we have

P(A) = P[ IYI > tlY - all with a = a - It £lR.

P(A) = (3(a , t) = (3( -a, t) since fE':}.

But now applying the Lemma for x = at/(t + 1) > 0 and b = at/(t - 1) we obtain

Hence,

Therefore

P(A) t -11,':'1 t-1 l(x) = (1 1 ) S; l(O) = - f(y) dy S; -2-·

at --- at 0 at t-l t+l

1 P(A) S; -- for all a£lR and fE':}.

t + 1

1 P[ X - tlX - al < It < X + tlX - all > 1 - -

1 + t


holds for all aElR, WIR, and fE'S. Example: For t = 9, we have 1-1/(1 +t) = .9, and x±9lx -al will cover p, at least 90% of

the time even if we are uncertain about fE'S. This suggests the following game: Each time you pick up a function f in 'S in any way you want i.e. deterministically or stochastically with some distribution. Then you choose W:IR also in an arbitrary way i.e. each p, every time or following a pre-specified sequence, or generate them with a distribution changing the distribution each time etc ... Then use the computer to show me x'"'-' f(x - p,) . I win $1 if x ± 91xl covers your p, and you win $5 if it doesn't. Do you want to playa couple of hundred times?

4. Confidence Intervals for (5

We consider now the estimation of the scale parameter from a single observation. It should be noticed that the only interesting confidence intervals are those of finite length. Thus, (0,00) is a 100% confidence interval but useless.

The natural, invariant under re-parameterizations, measure of length for a confidence interval (a, b) for a scale parameter is not just b - a but proportional to the difference in the logarithmic scale, i.e. log b - log a. This follows by recalling the fact that the square of the element of length, on the hypothesis space of the location-scale model, along a line of constant scale is given by:

where 9uu is the Fisher information amount at (5 given by:

k-l 9uu=~

with

k = 4 i: y2 (~'(y))2 dy

and ~2 = f in the notation of the proposition below. Hence, the geodesic distance from the probability distribution with scale "a" to the probability distribution with scale "b" is obtained by integrating the element of length and therefore proportional to the difference in the log scale as noted above. The reader unfamiliar with the geometry of hypothesis spaces may use the expression of the Kullback number between the gaussian with mean zero and standard deviation "a" and the gaussian with mean zero and standard deviation "b" as an approximation to the geodesic distance, to convince him/herself of the logarithmic nature of this length.

It is therefore necessary to consider confidence intervals with non-zero lower bounds, since (5 = 0 is in fact a line at infinity. I show below that it is possible to have finite length confidence intervals for the scale parameter from a single observation, but only if we rule out a priori from the hypothesis space a bit more than the line (5 = O. It is this interplay between geometry, classical inference and bayesianism that I find appealing in this problem.

Proposition: Let f be a pdf symmetric about 0 and differentiable everywhere. Let F be the associated cdf. Let 0 < tl < t2 ::; 00 with f'(t l ) > f'(t2) and define

G(o:, tl, t2) = F(o: - td + F(o: + t2) - F(o: - t2) - F(o: + tl)'


, z

"

Fig. 3. Illustration of event A

Let M > 0, aEffi., J.tEffi., e> > ° be given numbers. Then if

1 (X - J.t) 1J.t - al ::; e>M and x""'-+ -;;1 -e>- ,

we have

p [ IX t~ al ::; e> ::; IX t~ a l ] ~ 2[F(t2) - F(t])] f[M ::; M*] +

f[M>M*] inf {G(a,t],t2)}. O<a<M

Where M* = min {a > 0: G(a, t], t2) = G(O, t], t2) }. If 1 == N(O, 1) (or any other pdf with similar tails) and excellent approximation is

Proof: Consider the event

Let y = X - J.t""'-+ 1(y).

e>

Then by adding and subtracting J.t inside the absolute values and dividing through by e> we obtain

A = [t] ~ IY - al ~ t2J

where a = (a - J.t)/(J is such that la[ ~ M. Notice that the y's satisfying the inequalities that define the event A correspond to the shaded region in Fig. 3.

Hence,


""'I

Fig. 4. Illustration of the event A

Notice that for given values tl and t2 the function G, as a function of a is twice differentiable and symmetric about zero with a local minimum at a = O. Since, using the fact that f(y) = f(-y) we have

and also

Thus,

aGI aa a=O = [J(a - tl) - f(a - t2) + f(a + t2) - f(a + tl)]la=o 0

~:~Ia=o = !'(-tl) - f'(-t2) + !'(t2) - !'(tl)

= 2 (f'(tl) - !'(t2)) > 0

P(A) ~ G(O, tl, t2) = 2[F(t2) - F(tl)]

provided that lal :s: M* i.e. if M :s: M*. The picture (see Fig. 4.) illustrates the situation.

In the gaussian case, to obtain reasonable confidences we must have tl < 1 and t2 > 3. Hence, F(a - t l ) ~ F(a + tl) ~ F(a) and F(a + t2) ~ 1. From where

G(a,tl,h) ~ 1-F(a-t2) == 2[1-F(tl)] ~ G(O,tl,t2)

and the approximation for M* is obtained by solving the central identity for ae Remarks:

1) Notice that the lower bound of the confidence interval, i.e. Ix - al/t2, is positive only if M < 00 i.e. if we know a priori that 111- - al :s: ~ M < 00.

2) When t2 -t 00 then M* -t 00 and with no prior knowledge (i.e. 111- - al < 00 ) we still have

p(o:s:~:s:IXt~al) ~ 2(1-F(tl)).

3) The value of t2 is related to the amount of prior information. The larger t2 the weaker the prior information necessary to assume the desire confidence. On the other hand


tl controls the confidence associated to the interval. These remarks are illustrated with examples. Examples: Let x be a single observation from a gaussian with unknown mean f." and

unknown variance (72. Then 90% CIs for (7 are: (0, Six!) valid always

( 1¥, Slxl) valid if 1f."1 ::; 2.7(7

(~,Slxl) valid iflf."l::; 6.7(7 95% CIs are:

(~, 171xl) valid if 1f."1 ::; 3.3(7

(W,17lxl) valid if 1f."1 ::; 4S(7

(0, 171xl) valid always. 99% CIs are:

(~ , 70lxl) valid if 1f."1 ::; 2.7(7

(~, 70lxl) valid if 1f."1 ::; 997(7

(0,70Ixl) valid always.

Almost Real Example

I'll try to show that the required prior knowledge necessary to have non-zero lower bounds for the CIs is in fact often available. Suppose that we want to measure the length of the desk in my office with a regular meter graduated in centimeters. Let x be the result of a single measurement and let f." be the true length of my desk. Then

x = f." + £ with £ -v+ N(O, (72)

is a reasonable and very popular assumption. Now, even before I make the measurement I can write with all confidence that for my desk f." = 2 ± 1m i.e. If." - 21 ::; 1. With the meter graduated in centimeters I will be guessing the middle line between centimeters so I can be sure that x = f."± at least t of a centimeter. Thus,

Therefore I can be absolutely sure that

If." - 21 ::; 1200(7.

Hence,

( Ix - 21 ) 1500 ,701x - 21

will be a 99% CI for (7.

HYPOTHESIS REFINEMENT

G A Vignaux Institute of Statistics and Operations Research Victoria University, PO Box 600, Wellington, New Zealand, Tony.Vignaux~vuw.ac.nz

Bernard Robertson Department of Business Law Massey University, PO Box 11222, Palmerston North, New Zealand, B.W.Robertson~massey.ac.nz

ABSTRACT. The conventional portrayal of Bayes Theorem is that a likelihood ratio for evidence under two hypotheses is combined with prior odds to form posterior odds. The posterior becomes the prior to which a likelihood ratio for the next item of evidence is applied and so forth. At each stage the likelihood ratio becomes more complex as it is conditioned upon more and more earlier pieces of evidence.

Objectors to Bayesian methods claim that this presentation does not represent real thought processes and may not be possible in real-world inferential problems.

A more attractive view of the Bayesian model involves the successive refinement (or redefinition or subdivision) of hypotheses to incorporate previous items of evidence. Then at each step different hypotheses are compared. This approach is entirely consistent with the logical approach to probability while accommodating, or at least defusing, these objections.

1. Introduction

The great jurist John Reury Wigmore (1913) proposed that there were only three possible responses to an item of evidence introduced by an opponent in an adversarial trial: explanation, denial, and the introduction of a rival fact. While the stuff of television trials is denial, in reality defence is usually conducted by conceding much evidence and attempting to explain the remainder. This approach is much more difficult for the prosecution to deal with. The defence may even succeed in framing an alternative hypothesis which accounts for all the evidence. If such a hypothesis has any reasonable probability then a "reasonable doubt" will have been introduced and the defendant is entitled to be acquitted. In a civil case the same techniques are available but the defendant's task is harder since the alternative explanation must be at least as likely to be true as not.

It seems, then, that hypothesis selection may be just as important as seeking out new evidence. At the end of a trial the jury may have to choose between two very narrowly defined hypotheses both of which accommodate almost all the evidence. Does Bayesian reasoning capture this process?

183


184 G. A. VIGNAUX, B. W. ROBERTSON

2. Hypothesis refinement

Consider a criminal case in which there is eye-witness evidence about the race and height of the offender and DNA analysis of a bloodstain known to have been left by him. A suspect is arrested. Accurate observations of the suspect's race, DNA profile and height are made. We define the following hypotheses and items of evidence: HI = the perpetrator was the suspect, of race X, DNA profile Y, and height z. H2 = the perpetrator was someone else. I = all our background information. EI = an eye witness states the perpetrator was of race X. E2 = the scene sample, left by the perpetrator, has DNA profile Y. E3 = The perpetrator was of height z.

The correct logical method for updating one's belief in a hypothesis after the reception of evidence, E 1 , is, of course, by applying Bayes' Rule. In odds form this is:

P(H1IE1I) P(H2IE1I)

P(H11I) P(EIIHd) P(H21I) P(E1IH2I)

(1)

If another piece of evidence, E2 , is to be considered the posterior odds in (1) are used as the prior odds to be updated by the likelihood ratio for E2 (given El). This gives

P(HIIE1E2I) P(H2IE1E2I)

P(H1IE1I) P(E2IH1E1I) P(H2 IE1I) P(E2IH2EII)

(2)

Introducing more than two pieces of evidence leads to an increasingly complicated equation:

P(H1IE1 .. EnI) P(H2IE1 .. EnI)

P(H1IE1 .. En-1I) P(EnIHIEl .. En-lI) P(H2IE1 .. En- 1 I) P(EnIH2EI .. En-1 J)

(3)

Empirical research shows that this is certainly not a juror's normal mode of thought, which is to compare stories (Pennington and Hastie, 1991). Nor does it seem to be a useful prescription for analysing complex cases under the conditions in which jurors work. This makes Bayesian analysis vulnerable to the criticism that it is of no use either descriptively or prescriptively. Thus Ligertwood (1992) claimed that:

"[BayesiansJ would have us conceptualising our beliefs in mathematical terms and then increasing those beliefs by adding in, piece by piece, individual items of evidence, also conceptualised as mathematical chances, until our beliefs have achieved a degree of likelihood we are prepared to accept as proof."

He then refers to "The cognitive incapacity of human beings to make the complicated and very numerous

calculations which would be required in even simple cases." This refers to the so-called "combinatorial explosion" which is used to frighten lawyers

with the difficulties that might be involved. The argument is that with 30 pieces of evidence, each of which may be true or false, there are 230 possible combinations to be considered and analysed by the jury. This argument, of course, ignores the simplification produced by conditional independence between different items of evidence.

These criticisms also wrongly assume that Bayes theorem is the only tool provided by Bayesian Probability Theory and that Bayesian principles cannot be applied to aggregates

HYPOTHESIS REFINEMENT 185

of evidence as well as to individual items. In fact, they are not even criticisms of the fundamental logic of Bayes theorem but only of its conventional presentation. We intend to show that the presentation of the application of Bayes theorem to forensic and other problems can be modified to accomodate these criticisms while not, of course, departing from the requirements of its logic.

The term P(E3IH2E1E2I) means, in ordinary language, the probability of the perpetrator being of height Z (E3) supposing that he was someone other than the suspect (H2 ) ,

that he had been observed to be of race X(E1) and he had DNA analysis Y(E2). This raises the possibility of dependencies between, for example, race and height or race and DNA characteristics.

This is not the way the mind will or should work. If good evidence is given that the offender was of race X then the hypothesis that the offender was any other race becomes, at least for the present, worthy of little consideration. From then on the hypotheses to be compared are that the offender was the accused and that the offender was some other person of race X. The hypothesis that the offender was some other race may be resurrected if other evidence comes to light and Jaynes (1994) explains how this might occur.

As each item of evidence is considered and accepted the effect is therefore not just a change in odds but also a change in the hypotheses to be compared. A number of hypotheses may require consideration and the evidence will be of greater or lesser power in discriminating between them. Thus on the basis of evidence El we might consider three hypotheses, H2 = the perpetrator was some other person. H2 = H2a + H2b , where H2a = the perpetrator was some other person of race X. H 2b = the perpetrator was some other person of another race.

P(H1IE1I) P(H2 IE1I)

(4)

The defence may choose to attack the strength of the eye-witness evidence, E 1 , by exposing the witness's poor eyesight or racial prejudice but if El is taken as strong evidence then P(H2b IE1J) will be very small. Where P(H2b IE1I) is small (or, more accurately, very much smaller than P(H2a IE1 I)) then

Now,

P(H1IE1I) P(HdEII) P(H2IE1I) ~ P(H2a IE1J)

P(H1IE1I) P(H2a IE1I)

P(H11I)P(E1IH1I) P(H2alI)P(EIIH2aI)

(5)

(6)

If the probability of getting evidence El is conditionally independent of the identity of the suspect, supposing that the race is given, the likelihood ratio in (6) is 1 and (6) becomes

P(HIIE1I) P(H2 IE II)

P(HdI) P(H2a lI) .

(7)


So when looking at the next piece of evidence we are comparing two more specific hypotheses, HI and H2a, which include the fact that the suspect and the perpetrator are of the same race, X.

Then, when E2 (the DNA profile) is introduced as evidence,

P(HIIEIE2I) ~ P(HIIE2I) P(H2IEIE2I) ~ P(H2a IE2I)'

This gives (9) rather than (2).

P(HI IE2 I )

P(H2a IE2I) P(HIII) P(E2IH II) P(H2aI I ) P(E2IH2a I)'

H2a similarly subdivides into H2ai + H2aii : H2ai = the perpetrator was another person of race X with DNA profile Y. H2aii = the perpetrator was another person of race X but without DNA profile Y.

(8)

(9)

Assuming we remain confident of the DNA evidence (E2), H2aii falls out of consideration (i.e. becomes of negligible probability) and P(H2ai E21I) = P(H2ai lI). Thus

P(HIIEIE2I) P(HIII) P(E2IHII) P(HIII) P(H2IEIE2I) ~ P(H2ailI) P(E2IH2a;I) = P(H2ai lI)'

(10)

When the evidence of height E3 , is introduced, we only need compare its probabilities conditional upon HI or H2ai .

This process of hypothesis refinement eases the problem of dependency in calculating the likelihood ratios. We are not now concerned with any "dependence" between DNA and race but only with the probability that a person of our particular known race X would have the DNA characteristics found.

3. Uncertain evidence

Suppose an item of evidence does not seem sufficiently powerful to eliminate a hypothesis from consideration. For example, we may have an eye-witness's statement about the race of a perpetrator. At the extreme we may have no confidence at all in the ability of the eye-witness to distinguish race X from the other races in the circumstances of the incident. In that case P(EIIH2a I) = P(EI IH2b I), giving a likelihood ratio between them of 1. The prior probability ratio for H2a and H2b would be unchanged and we would not bother to subdivide H2 on this basis.

The intermediate case is where P(EI IH2a I) > P(EI IH2bI) but the likelihood ratio is low. In this case EI increases the probability of HI relative to H2 and of H2a relative to H 2b but without eliminating the latter. In order to calculate the likelihood ratio we require the prior probabilities for H2a and H2b but these priors are needed in any case to calculate the posterior odds for either hypothesis.

4. The order in which evidence is given.

Logically, and in the classical exposition of Bayes Rule, of course it does not matter in what order we consider the evidence. There is some data (e.g. Kahneman, Slovic, and Tversky, 1982) on the psychology of juries which shows that a form of "anchoring" takes place. Early

HYPOTHESIS REFINEMENT 187

evidence "fixes" the mind of the decider and later evidence does not change it as much as it should. So the order of evidence will have an effect, illogical though this might be. But quite apart from this effect, the hypothesis refinement process suggests that one sequence of evidence items may be more economical than another.

The uncontested evidence, call it En, should be considered first. This will lead quickly to a more specific alternative hypothesis. En itself will not distinguish between the prosecution hypothesis (HI) and this more specific alternative (H2n). In other words, the likelihood ratio of the uncontested evidence, En, for HI and H2n will be 1.0 and the ratio

P(HIIEnI) P(H2 IEnI)

P(HIII) P(H2n lI) .

(11)

Next, the jury should consider the evidence which most strongly distinguishes between hypotheses. This will cause the subdivision of H 2n , parts of it being no longer worthy of serious consideration. Finally, the jury will be left with two very specific alternative hypotheses (between which the bulk of the evidence does not effectively distinguish) and a few items of evidence which will distinguish between them to give the ultimate probability ratio.

5. Bounding rationality

As soon as we consider evidence about which there is any doubt and as soon as we subdivide hypotheses and ignore those sub-hypotheses with low probabilities, the probability ratios we obtain are no longer true odds. We hope that they are a close approximation and this will be so provided the ignored hypotheses are separated by at least lOdb of evidence from the major ones (Jaynes, 1994). The calculation of true odds for a hypothesis requires summing the probabilities for the evidence over a possibly infinite number of alternative hypotheses. This is impossible and the use of Bayes' Theorem has been criticised precisely on the ground that it appears to require one to consider an infinite number of alternative hypotheses (Allen, 1991 and response by Friedman, 1992).

Formally, these probability ratios could become true odds if we include in the specification of the problem (in I) that the hypotheses to be considered are the only possible ones. Making this statement clear at least has the value that it may prompt consideration of whether there are any other feasible hypotheses.

In the notorious Australian "Dingo baby" case, (R v. Chamberlain (1984) 51 ALR 225 (CA)), in which Mrs Chamberlain was accused of killing her daughter, Azaria, the defence was that a dingo took the baby from the tent. Gibbs CJ and Mason J said "Once the possibility that one of the children killed Azaria is rejected, as it was by common agreement at the trial, only two possible explanations of the facts remain open - either a dingo took Azaria, or Mrs Chamberlain killed her. Therefore, if the jury were satisfied beyond reasonable doubt that a dingo did not kill the baby, they were entitled to accept the only other available hypothesis, that Mrs Chamberlain was guilty of murder. However it would have been unsafe for a jury to approach the case by asking "Are we satisfied that a dingo did not do it?" because that would have diverted attention from the evidence that bore on the critical issue - whether Mrs Chamberlain killed the baby."

Since the final sentence does not follow from the preceding reasoning, it seems that the real reason that the case was not proved "beyond reasonable doubt" is that there was a


sufficient probability that the death had occurred in some third way.

6. Conclusion

The approximations made inevitably sacrifice some information. This is due not to any shortcomings of Bayes' Theorem but to the fact that, as Friedman(1992) said, "the world is a complex place and our capacities are limited". The I, and hence the prior probability for any hypothesis, will contain, to quote I J Good(1950) "much that is half-forgotten". It is therefore not a criticism of the hypothesis refinement process that its result deviates from true odds by an unknown amount.

The hypothesis refinement process accommodates the criticisms outlined above and makes optimum use of available information.

References

[1] R. J. Allen, "On the significance of batting averages and strikeout totals: a clarification of the "naked statistical evidence" debate, the meaning of "evidence", and the requirement of proof beyond reasonable doubt," Tulane Law Review, 65, 1093-1110, 1991.

[2] R. D. Friedman, "Infinite Strands, Infinitesimally thin: Storytelling, Bayesianism, Hearsay and other evidence," Cardozo Law Review, 14, 79 - 101, 1992.

[3] I. J. Good, Probability and the weighing of evidence, Charles Griffin & Co, London, 1950.

[4] E. T. Jaynes, Probability theory - the logic of science, in draft 1994.

[5] D. Kahneman, P. Slovic, and A. Tversky, Judgement under uncertainty: heuristics and biases, Cambridge University Press, Cambridge, 1982.

[6] A. Ligertwood, "Inference as a Judicial Function," Reform of Evidence Conference, Society for the Reform of the Criminal Law, Vancouver, 1992.

[7] Pennington, N. and Hastie, R., A cognitive theory of juror decision making; the story model, Cardozo Law Review, 13, 519-574, 1991.

[8] J. H. Wigmore, Principles of Judicial Proof, Little, Brown and Co, Boston, 1913.

BAYESIAN DENSITY ESTIMATION

Sibusiso Sibisi, John Skilling University of Cambridge, Cavendish Laboratory Madingley Road, England CB3 ORE

ABSTRACT. We develop a fully Bayesian solution to the density estimation problem. Smoothness of the estimates f is incorporated through the integral formulation f(x) = J dx'</>(x')K(x,x') involving an appropriately smooth kernel function K. The analysis involves integration over the underlying space of densities </>. The key to this approach lies in properly setting up a measure on this space consistent with passage to the continuum limit of continuous x. With this done, a flat prior suffices to complete a well-posed definition of the problem.

1. Introduction

Given a set {xs , s = 1 ... N} of N iid observations (samples) drawn from an unknown but presumed continuous probability density f(x), non-parametric density estimation seeks to estimate the density without invoking a parametric form for it. This basic problem is of continuing interest in statistics. The major approaches are kernel density estimation (e.g. [10]; [8]) and penalized likelihood methods (e.g. [12]). Penalized likelihood for density estimation was explicitly introduced by Good and Gaskins [3] who adopted a Bayesian interpretation. In fact, a Bayesian approach to density estimation dates back to Whittle [14]. This interpretation is often ignored, the objective being to calculate no more than a single best estimate for f through maximization. In this case the log-prior is interpreted merely as a roughness penalty, hence maximum penalized likelihood (MPL).

Bayesian analysis starts by defining a hypothesis space, covering whatever parameters or distributions are needed to define the problem. These variables may be directly observable, or they may be latent variables controlling the observable quantities. Results are obtained as sums or integrals over the hypothesis space.

Thus the next step ought to be to define a measure on this space, so that integrals become properly defined, independently of the particular coordinates being used. The measure defines how densely points are to be placed within the space, and it is part of the prior structure of the problem. The remaining part is the prior probability function which additionally weights points in the space in accordance with whatever extra background information is available. We prefer to separate the concepts of measure and prior, because this enables insights which would be less obvious if measure and prior were conflated into a single prior defined over volume elements. Both measure and prior are, of course, to be assigned independently of the likelihood function which quantifies the effect of whatever data may eventually become available.

In density estimation, our objectives (always contingent upon some background information I) are:

1. to compute sufficiently many sample densities from the posterior Pr(f I {x s }, I) that the

189


190 Sibisi and Skilling

statistics of any property of f can be determined to reasonable precision, and

2. to compute the prior predictive value Pr( {xs }II). We shall also refer to Pr( {xs}lI) as the evidence for I. Given prior models A and B, the evidence ratio Pr({xs}IA)/Pr({xs}IB) is the Bayes factor in favour of model A relative to model B.

For these we need the measure, the prior Pr(JlI) and the likelihood Pr({xs}lf,I)· Irrespective of I, the likelihood is

N

Pr({xs}If,I) = II f(xs) (1) s=l

This relates to f only at the discrete points Xs and (save in the unlikely event of a multiplicity of samples at exactly the same x) carries no information whatever on the indefinitely local microscopic structure of f. Yet we wish f to be smooth, ruling out in advance any estimates restricted solely to spikes at the data points. This smoothness requirement can only be incorporated through the prior ([14]).

2. Smoothness

We impose smoothness through an integral formulation

f(x) = J dx' ¢(x')K(x, x') (2)

Here the kernel K is an assigned smooth function, possibly having a few width and shape parameters. ¢ is defined solely by integral properties. Thus it is an underlying latent density controlling f, observed only indirectly via f. It is natural to require K and ¢ to be non-negative, and for K to be normalized over x. Then ¢ is also normalized, so belongs to Lebesgue class 1, and the kernel K endows f with the requisite smoothness.

¢ bears an analogy to the coefficients of a finite mixture model for f (e.g. [6], [11], [13]). However, ¢ being an arbitrarily detailed density rather than a set of discrete coefficients, (2) may be interpreted as a nonparametric mixture model where ¢ plays the role of a full spectrum of arbitrarily many mixture coefficients.

If ¢(x) is simply chosen to be the empirical distribution 2:s 6(x - xs)/N comprising a sum of 6-functions centred at the observation points, (2) becomes the standard kernel density estimator j(x) = Ls K(x,xs)/N. With minor change of notation, the simplest and most commonly used case of a translation-invariant kernel gives

(3)

in which a single estimate 1 having smoothness width W is computed. While 1 may be readily computable, there is no defensible probabilistic estimate of its reliability.

Typically, K is both translation-invariant and symmetric, so that K(x, x') == K(lx-x'l). Such functions in multidimensional space are called radial basis functions. They have received much attention in multivariate function approximation (e.g. [7]). Some popular

Bayesian Density Estimation 191

choices are the multiquadrics, Gaussian and the two dimensional thin-plate splines. There is an intimate relationship between density estimation and function approximation (called interpolation if the data are exact and smoothing if they are noisy) where the data are the function values as opposed to samples from a probability distribution. Silverman [9] discusses smoothing splines in non-parametric function smoothing. The smoothing spline theme in the context of density estimation is pursued by Gu, [4]. Adopting a Bayesian perspective, Silverman restricts the unknown f to functions that are linear combinations of B-splines having knots at the data abscissae and places a prior on their coefficients. However, Bayesian methodology does not thus incorporate the data into the prior formulation: after all, increasing numbers of data constraints ought to progressively restrict the space of possibilities rather than endlessly extend it.

In our approach, the task of inferring f is delegated to the inferral of <jJ, so we must assign <jJ a prior. But first we need to assign a measure on <jJ-space. Here, we anticipate the requirements of practical computation by writing (2) in matrix form f = K<p with the abscissa partitioned into some potentially large number M of disjoint cells {Ci , i = 1 ... M}, and with the latent density decomposed into corresponding amounts

3. The Measure

<Pi = r <jJ(x)dx lei

The evidence and the posterior for f are both computed from the joint probability

Pr(, {xs}lI) = Pr( {xs}I,I) Pr(II)

as -integrals

Pr({xs}lI) J dJ-t Pr(, {xs}lI)

Pr(Jl{xs}, I) J dJ-t 6(J - K<I» Pr(, {xs}II)/Pr({xs}lI)

(4)

(5)

(6)

(7)

Here, dJ-t = J-t(<I»d is the element of Lebesgue measure which defines integration over . Because the latent density incorporates no spatial correlations, its measure on M cells factorizes as J-t(<I»d = Il;J-ti(i)di.

The measure cannot be ignored. In order to combine two cells Ci and Cj (Ci n Cj = 0), with measures J-ti and J-tj respectively, into a single composite cell Ck = Ci UCj with measure J-tk, we use the mapping ~~ --+ ~~ defined by k = i + j. The domain A = {i' j : i + j < \II} maps to the interval I = {k : k < \II} so that the measures obey f'IdJ-tk = fA dJ-tidJ-tj, i.e.

(8)

Differentiating with respect to \II gives the Laplace convolution

(9)


Hence the Laplace transforms (denoted by - with transform variable s) of the measure densities multiply as

(10)

under cell combination and conversely factorise under cell division. If J-ti and J-tj were simply taken as constant, J-tk(} would become proportional to its argument and non-constant, so that the measure cannot be constant in general.

The simplest viable assignment is to set some finite width llx = w (or, technically, x-measure) at which the measure density J-t( ; w} would be constant if the cells were constructed to be of that size. We take w to be independent of x, though one may note the obvious extension which allows w to vary. We do not take w -+ 0, which would force to be constant . Neither do we take w -+ 00, which tends to the Jeffreys measure O(d/} which cannot be normalized. Instead, we will take w to be an ordinary Bayesian parameter, which in most problems turns out to be of the order of the kernel width W.

A cell of width w has constant measure density, so ji,(s;w} ex s-l. A n-fold uniform subdivision of such a cell gives a uniform grid of micro cells of size ~ = win. Taking these microcells to be a priori equivalent, they must each be assigned measure transform density proportional to s-l/n. Recombining m of them gives a cell of width h = wmln having measure transform density proportional to s-h/w, corresponding to

J-t(; h} ex -l+h/w (11)

This argument defines a consistent measure on cells of arbitrary and possibly different widths. Overall, the integration measure on such cells incorporating the normalization Li i = 1 is

(12)

where

(13)

Normalization on J-t has been chosen to ensure unit mass. We recognise (13) as the Dirichlet measure density, here given a natural derivation. As cells are subdivided in the approach to the continuum, the measure density approaches but does not attain the improper form Ok ;1.

Our derivation of the Dirichlet measure has arisen purely from the Laplace property (10) of the measure on the hypothesis space. It is independent of the likelihood, which has not yet entered the formulation. Indeed, we have not yet defined the prior. The fact that the Dirichlet form happens to be the conjugate prior for the density estimation problem is entirely fortuitous. The Dirichlet exponents ak have the explicit meaning of being the relative cell-widths hklw.


4. The Prior

A prior is a pointwise assignment of probability over a defined space with defined measure. As a matter of general policy, we suggest that it ought only incorporate potentially observable quantities, for which the values could ultimately be tested by measurement. In the present case, we require the prior to depend only on integrals of ¢, possibly multilinear to include correlation.

By construction though, smoothness is included through K and not ¢. Hence the prior on ¢ does not include any spatial correlations, so only linear integrals are allowed. Only one such integral is translation-invariant, namely J ¢(x)dx = Li iI>;, and this is already known to be 1. Hence we assign the flat prior

Pr(iI>/I) = 1 (14)

which is correctly normalized to J d", Pr( iI>/I) = 1 and conveniently simple. The combination of Dirichlet measure and flat prior yields the well-known Dirichlet process, suggested earlier by Ferguson ([1], [2]) on grounds of formal convenience.

Towards the continuum limit of indefinitely many cells, random samples from this prior consist largely of a shimmering ocean of exponentially small values. For small h, the median per cell is 2-w / h . Just occasionally, the ocean throws up a substantive outlier-about one per width w in each factor-of-e range in iI>. Fig 2 shows the evolution of a sample diffusing randomly through the prior. To any desired accuracy, any typical sample of iI> can be encoded as a finite set of point masses even in the continuum limit. For example, 99% of the overall probability shown in fig 1 is contained in the strongest 68 cells, even though the grid was arbitrarily fine, having over 1 million cells. For this reason, we propose to call our approach the Massive Inference technique.

5. Examples

The two datasets we discuss here are listed in [8]. In order to process them, we need background information I consisting in part of the kernel K to be used. Typically, the kernel is assigned a specified shape parameterized by an unknown width W. According to the strict Bayesian paradigm, the evidence value then involves marginalization over W using a hyperprior on W. For small datasets, though, evidence values tend to be sensitive to the choice of this hyperprior. In the interests of objective comparisons between different model kernel shapes, we wish to avoid this extra level of subjectivity. Accordingly, we adopt the empirical Bayes approach of determining the optimal width W by maximizing the evidence Pr({xs}/W,K) over W. Our formulation also involves a measure width w, which we optimize as well. Thus, for a given kernel shape, our inferences will be conditioned upon Wand w.

For a single "result", it is natural to display the mean of the posterior Pr(j/ {x s }, W, w), along with its associated pointwise error-bars. However, a static display of this nature cannot capture the point-to-point covariance structure of the posterior. A more faithful representation is to display a number of random samples from the posterior. For ease of visualization, it seems even better to display a sample diffusing randomly through the posterior. We shall use this representation.

For both datasets, we use a wraparound-periodic Gaussian kernel parameterized by full-width-half-maximum W, on a grid sufficiently close to the continuum limit that yet


finer grids induce negligible change in the results. In each case, 128 cells sufficed. The calculations were performed with a continuous time Markov chain Monte Carlo algorithm, currently being prepared for publication.

5.1. Buffalo Snowfall Data

This dataset consists of 63 measurements of annual inches of snowfall in Buffalo. These observations are included in fig 3 as impulses in the range 0 to 150 inches/year. Each unit impulse represents a single observation.

The optimal widths are found to be W = 24 in/yr and iiJ = 100 in/yr corresponding to an evidence of Pr( {xs}IW, iiJ) = 2 x 1O-128(in/yr)-63. The mean of the posterior is displayed in fig 3 with corresponding pointwise I-sigma error-bars. Fig 4 shows the diffusion of a sample from this empirical Bayes posterior. This is a much more instructive display. About 20% of the time the sample is unimodal, 40% of the time it is bimodal and it is trimodal in the remaining 40%. Thus the shape of the typical sample is quite variable.

This dataset has received considerable attention in the density estimation literature. As noted by !zenman ([5]) in a review of density estimation approaches, a trimodal density is. usually regarded as the most reasonable solution. Our analysis suggests that bimodal or unimodal estimates are not unreasonable. [4) also inclines to a unimodal such an estimate.

Scott ([8]) also obtains estimates which are in reasonable agreement with this conclusion. Using two variants of cross-validation (CV), he obtains kernel widths giving one unimodal estimate and another with two gentle satellite bumps. In the latter case Scott's CV statistic has only weak discrimination against larger widths.

5.2. Old Faithful Eruption Data

This dataset consists of 107 measurements of the duration of eruptions of the Old Faithful geyser. Fig 5 includes an impulse plot of the data in the range 1 to 6 minutes. There being more data in this example, the optimal widths are more tightly determined at W = 0.3mins and iiJ = 1.0mins. The corr~sponding evidence is 5 x 10-51 (min) -107. The mean is shown in fig 5 along with its pointwise error-bars. The diffusing sample is shown in fig 6. From this , we observe substantial variation in the valley from 2.2 to 3.2 minutes, indicating that the two gentle bumps shown in the mean estimate are not reliable.

6. Conclusion

We have developed a fully Bayesian solution to the density estimation problem. Our first step was to set up a space of non-negative latent densities ¢ from which the observed densities f derive via linear integrals. All Bayesian calculations require integrals over this hypothesis space, so that a measure must be assigned there. This measure cannot be simply taken to be constant, and we derive a Dirichlet form. Only after a measure has been assigned is it meaningful to discuss a prior on the hypothesis space. As a matter of policy, we hold that the prior should only involve quantities which might in principle be observable, which here restricts us to using linear integrals over ¢. In fact, in this work we simply took a flat prior.

Because we compute a whole probability distribution over estimates, we have sufficient structure to answer quantitative questions about the estimate and its reliability. A display of a diffusive sample through the posterior is a useful means of visualizing the point-topoint variability of the estimate. Finally, we believe that in estimation problems the overall


evidence value for the model used should always be reported, to await objective comparison with alternative models.

While we have focussed on the density estimation problem, extension to function approximation for applications such as imaging is straightforward.

ACKNOWLEDGMENTS. We thank Prof. P. Whittle and Dr 1. G. Craw for helpful advice.

References

[1] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209-230.

[2] Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2, 615-629.

[3] Good, 1. J. and Gaskins, R. A. (1971). Nonparametric roughness penalties for probability densities. Biometrika, 58, 255-277.

[4] Gu, C. (1993). Smoothing spline density estimation: a dimensionless automatic algorithm. J. Amer. Statist. Ass., 88, 495-504.

[5] !zenman, A. J. (1991). Recent developments in nonparametric density estimation. J. Amer. Statist. Ass., 86, 205-224.

[6] Neal, R. M. (1992). Bayesian mixture modelling. In Maximum Entropy and Bayesian Methods (eds C. R. Smith, G. J. Erickson and P. O. Neudorfer), 197-211, Dordrecht: Kluwer.

[7] Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation: a review. In Algorithms for Approximation (eds J. C. Mason and M. G. Cox), 143-167, Oxford: Clarendon.

[8] Scott, D. W. (1992). Multivariate Density Estimation. New York: Wiley.

[9] Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparametric regression curve fitting. J. R. Statist. Soc. B, 47, 1-52.

[10] Silverman, B. W. (1986). Density Estimation. London: Chapman and Hall.

[11] Smith, A. F. M. and Makov, U. E. (1978). A quasi-Bayes sequential procedure for mixtures. J. R. Statist. Soc. B, 40, 106-112.

[12] Thompson, J. R. and Tapia, R. A. (1990). Nonparametric Function Estimation, Modeling and Simulation. Philadelphia: SIAM.

[13] West, M. (1992). Modelling with mixtures. In Bayesian Statistics 4 (eds J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith), 503-524, Oxford: Clarendon.

[14] Whittle, P. (1958). On the smoothing of probability density functions. J. R. Statist. Soc. B, 20, 334-343.

196

.... o ., ~

OLl')

Ol l .2

o

Sibisi and Skilling

Figure 1: Typical prior sample of if; plotted on a linear scale (above) and as base 10 logarithms (below)

Figure 2: Evolution of a sample from the Dirichlet prior

Bayesian Density Estimation

~ "", C CU "0

N o o +

~-:.=: a n . 0 0

.D + e 0..

~ ~~~HtlttI~~ ______________________________________ ~~tlti~

a -0 "0

o +25 +50 +75 + 100 + 125 +150

Figure 3: Buffalo Snowfall data and density estimate with error bars.

Figure 4: Buffalo Snowfall posterior random sample diffusing

197


+2 +4 +6

Figure 5: Old Faithful data and density estimate with error bars

Figure 6: Old Faithful posterior random sample diffusing

SCALE INVARIANT MARKOV MODELS FOR BAYESIAN INVERSION OF LINEAR INVERSE PROBLEMS

Stephane Brette, Jerome Idier and Ali Mohammad-Djafari Laboratoire des Signaux et Systemes (CNRS-ESE-UPS) Ecole Superieure d 'Electricite, Plateau de Moulon, 91192 Gif-sur-Yvette Cedex, France

ABSTRACT. In a Bayesian approach for solving linear inverse problems one needs to specify the prior laws for calculation of the posterior law. A cost function can also be defined in order to have a. common tool for various Bayesian estimators which depend on the data and the hyperparameters. The Gaussian case excepted, these estimators are not linear and so depend on the scale of the measurements. In this paper a weaker property than linearity is imposed on the Bayesian estimator, . namely the scale invariance property (SIP).

First, we state some results on linear estimation and then we introduce and justify a scale invariance axiom. We show that arbitrary choice of scale measurement can be avoided if the estimator has this SIP. Some examples of classical regularization procedures are shown to be scale invariant. Then we investigate general conditions on classes of Bayesian estimators which satisfy this SIP, as well as their consequences on the cost function and prior laws. We also show that classical methods for hyperparameters estimation (i.e., Maximum Likelihood and Generalized Maximum Likelihood) can be introduced for hyperparameters estimation , and we verify the SIP property for them.

Finally we discuss how to choose the prior laws to obtain scale invariant Bayesian estimators. For this , we consider two cases of prior laws: entropic prior laws and first-order Markov models. In related preceding works [1, 2], the SIP constraints have been studied for the case of entropic prior laws. In this paper extensiori to the case of first-order Markov models is provided.

KEY WORDS : Bayesian estimation, Scale invariance , Markov modelling, Inverse Problems, Image reconstruction, Prior model selection

1. Introduction

Linear inverse problem is a common framework for many different objectives, such as reconstruction, restoration, or deconvolution of images arising in various applied areas [3]. The problem is to estimate an object x which is indirectly observed through a linear operator A, and is therefore noisy. We choose explicitly this linear model because its simplicity captures many of interesting features of more complex models without their computational complexity. Such a degradation models allows the following description:

y = Ax+ b, (1)

where b includes both the modeling errors and unavoidable noise of any physical observation system , and A represents the indirect observing system and depends on a particular application. For example, A can be diagonal or block-diagonal in deblurring, Toeplitz or bloc-Toeplitz in deconvolution, or have no special interesting form as in X-ray tomography.

In order to solve these problems, one may choose to minimize the quadratic residual error Ily - Ax112. That leads to the classical linear system

199

J. Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 199-212. © 1996 Kluwer Academic Publishers. Printed in the Netherlands.

200 S. Brette, J. Idier & A. Mohammad-Djafari

(2)

When mathematically exact solutions exist, they are too sensitive to unavoidable noise and so are not of practical interest. This fact is due to a very high condition number of A [3]. In order to have a solution of interest, we must mathematically qualify admissible solutions.

The Bayesian framework is well suited for this kind of problem because it could combine information from data Y and prior knowledge on the solution. One needs then to specify the prior laws Px(x; oX) and Pb(Y - Ax; '1/1) for calculation of the posterior Pxly(xIY) ex: Px(X)Pb(Y - Ax) with the Bayes rules. Most of the classical Bayesian estimators, e.g., Maximum a posteriori (MAP), Posterior Mean (PM) and Marginal MAP (MMAP), can be studied using the common tool of defining a cost function C(x*, x) for each of them. It leads to the classical Bayesian estimator

(3)

depending both on data y and hyperparameters (J.

Choosing a prior model is a difficult task. This prior model would include our prior knowledge. Some criteria based on information theory and maximum entropy principle, have been used for that. For example, when our prior knowledge are the moments of the image to be restored, application of maximum entropy principle leads DJAFARI & DEMOMENT [4] to exact determination of the prior, including its parameters. Knowledge of the bounds (a gabarit) and the choice of a reference measure leads LEBESNERAIS [5, 6] to the construction of a model accounting for human shaped prior in the context of astronomic deconvolution.

We consider the case when there is no important and quantitative prior information such as the knowledge of moment or bounds of the solution. Then we propose to reduce the arbitrariness of the choice of prior model by application of constraint to the resulting Bayesian estimator. The major constraint for the estimator is to be scale invariant, that is, whichever the scale or physical unit we choose, estimation results must be identical. This desirable property will reduce the possible choice for prior models and make it independent of the unavoidable scale choice. In this sense, related works of JAYNES [7] or Box & TIAO [8] on non-informative prior are close to our statement, although in these works the ignorance is not limited to the measurement scale. In our work, qualitative information only is supposed to be known (positivity excepted), so we think of choosing a parametric family of probability laws as a usual and natural way in accounting for the prior. The parameters estimation in the chosen family of laws will be done according to the data, with a Maximun Likelihood (ML) or the Generalized Maximum Likelihood (GML) approach. These approaches are shown in this paper to be scale invariant.

One can criticize choosing the prior law from a desired property of the final estimator rather than from the available prior knowledge. We do not maintain having exactly chosen a model but just restricting the available choice. Then Gaussian or convex prior popularity is due likely to the tractability of the associated estimator rather than Gaussianity or convexity of the modeling process. Lastly, good as the model is, its use depends on the tradeoff between the good behavior of the final estimator and the quality of estimation.

The paper is organized as follows. First, we state some known results on Gaussian estimators as well as introduce and justify the imposition of scale invariance property (SIP)

Scale invariant Markov models ... 201

onto the estimator. This will be done in section 2 with various examples of scale invariant models. In section 3 we prove a general theorem for a Bayesian estimator to be scale invariant. This theorem states a sufficient condition on the prior laws which can be used for reducing the choice to admissible priors. For this, we consider two cases of prior laws: entropic prior laws and first-order Markov models. In related preceding works [1, 2), the SIP constraints has been studied for the case of entropic prior laws. In this paper we extend that work to the case of first-order Markov models.

2. Linearity and scale invariance property

In order to better understand the scale invariance property (SIP), in the next subsection we consider in detail the classical case of linear estimators. First, let us define linearity as combination of additivity:

(4)

and the scale invariance property (SIP):

Vy , y H X ==> Vk, ky H kx. (5)

Linearity includes the SIP and so is a stronger property. We show a particular case how the SIP is satisfied in these linear models.

2.1. Linearity and Gaussian assumptions

Linear estimators under Gaussian assumptions have been (and probably still are) the most studied Bayesian estimators because they lead to an explicit estimation formula. In a similar way their practical interest is due to their easy implementation, such as Kalman filtering. In all these cases, prior laws have the following form:

Px(x) ex exp (-~(x - -mS~;I(X - -mx)) , (6)

whereas the conditional additive noise is often a zero mean Gaussian process N(O, ~b). Minimization of the posterior likelihood for all the three classical cost functions MAP,

PM and MMAP is the same as those of a quadratic form. It leads to the general form of the solution:

(7)

which is a linear estimator. Some particular cases follow:

• Case where ~;I = 0 and ~b = a-~ I. This can be interpreted as degenerated uniform prior of the solution. The solution is the minimum variance one and is rarely suitable due to the high condition number of A .

• Case where ~b = a-~ I and ~x = a-;I. This leads to the classical Gaussian inversion formula:

(8)


The Signal-to-noise ratio (SNR) J-L = u;/ ul appears explicitly and serves as a scale invariant parameter. It plays therefore the meaningful role of a hyper parameter.

• The Gauss-Markov regularization case, which considers a smooth prior of the solution, is specified by setting ~; 1 = J-LDt D + U;2 I, with D a discrete difference matrix.

For all these cases, estimate x depends on a scale. Let us look at the dependence. For that matter, suppose that we change the measurement scale. For example, if both x and y are optic images where each pixel represents the illumination (in Lumen) onto the surface of an optical device, we measure the number of photons coming into this device. (This could be of practical interest for X-ray tomography.) Then we convert y into the new chosen scale and simultaneously update our parameters ~x, ~b and rnx . Estimation formula is then given by

Xk = (At k-2~bl A + k-2~;1 I)-I (At k-2~bl ky + k-2~;1 krnx) ,

or, canceling the scale factor k:

(9)

(10)

Thus, if we take care of hyperparameters, the two restored images are physically the same. This property is rarely stated in the Gaussian case, which can be explained by the use

of SNR as a major tool of reasoning. Thus if we set the SNR, then Xk and kx are equal. In many cases Gaussian assumptions are fulfilled, often leading to fast algorithms for

calculating the resulting linear estimator. We focus on the case where Gaussian assumptions are too strong. It is the case when Gauss-Markov models are .used, leading to smoother restoration than wanted. It might be explained by the short probability distribution tails which make discontinuity rare and which prevent appearing of wide homogeneous areas into the restored image.

2.2. Scale invariance basics

Although the particular case considered above may appear obvious, it is at the base of the scale invariance axiom. In order to estimate or to compare physical parameters, we must choose a scale measurement. This can have a physical meaningful unit or only a grey-level scale in computerized optics . Anyway we have to keep in mind that a physical unit or scale is just a practical but arbitrary tool, both common and convenient. As a consequence of this remark we state the following axiom of scale invariance:

Estimation results must not depend on the arbitrary choice of the scale measur·ement.

This is true when scale measurement depends on time exposure (astronomic observations, Positron emission tomography, X-ray tomography, etc.). Estimation results with two different values of time exposure must be coherent. SIP is also of practical interest when exhaustive tests are required for the validation.

Let us have a look on some regularized criteria for Bayesian estimation. In all the cases, the MAP criterion is used, and the estimators take the following form:

X(y;.,p, A) = argmin {-logpb(Y - Ax;.,p) -logpx(x; A)}. x

(11)


Lp-norm estimators: General form of those criteria involves an Lp-norm rather than a quadratic norm. Then , the noise models and prior models take the following form:

( 12)

and

(13)

where M can be a difference matrix as used by BOUMAN & SAUER and BESAG on the Generalized Gauss-Markov Models [9], and L1-Markov models [LO]. Finally, with q = 1 and M an identity matrix it leads to a L1-deconvolution algorithm in the context of seismic deconvolution [11].

According to the scale transformation x >-+ kx and y >-+ ky, the models change in the following way:

(14 )

and

(15)

If we set ('f/;k, Ak) = (kP'f/;, kgA), the two estimates are scale invariant. Moreover, if p = q, we can drop the scale k in the MAP criteria (eq. 11) which becomes scale invariant. This is done in [9] [11], but it makes the choice of the prior and the noise models mutually dependent. We can also remark that 'f/;g / AP is scale invariant and can be interpreted as a generalized SNR.

Maximum Entropy methods: Maximum Entropy reconstruction methods have been extensively used in the last decade. A helpful property of these methods is positivity of the restored image. In these methods, the noise is considered zero-mean Gaussian N(O, ~b)' while the Log-prior take different forms which look like an "Entropy measure" of BURG or SHANNON. Three different forms which have been used in practical problems are considered below .

• First, in a Fourier synthesis problem, WERNECKE f3 D'AoOARIO [12] used the following form:

Px(X;A) ex: exp [-A~IOgXi]. ( 16)

Changing the scale in this context just modifies the partition function which is not important in the MAP criterion (eq. 11). As the noise is considered Gaussian, these authors show that if we update the A parameter in a proper way (Ak = PAl, then the ME reconstruction maintain linearity with respect to the measurement scale k. Thus, this ME solution is scale invariant, although nonlinear.

• In image restoration, BURCH f3 al. [13] , consider a prior law of the form

Px(X; A) ex: exp [-A ~ x;log Xi] . (17)

204 S. Brette, J . Idier & A. Mohammad-Djafari

Applying our scale changing yields:

Px(kXj A) ex exp [-A ~ k Xi logXj + klog k ~Xj] , (18)

which does not satisfy the scale invariance property due to the k log k Li Xi term. It appears from their later papers that they introduced a data pre-scaling before the reconstruction. Then , the modified version of their entropy becomes

(19)

where s is the pre-scaling parameter.

• Modification of the above expression with natural parameters for exponential family leads to the "entropic laws" used later by GULL & SKILLING. [14] and DJAFARI [15]:

Px(Xj A) ex exp [-AI ~Xi logxj - A2 ~Xi] . (20)

The resulting estimator is scale invariant for the reasons stated above .

Markovian models: A new Markovian model [16] has appeared from I-divergence considerations on small translation of an image in the context of astronomic deconvolution. This model can be rewritten as Gibbs distribution in the following form:

Px(Xj A) ex exp [-A L (xs - xr) log G:)]· (.,r)EC

(21 )

If we change the scale of the measurement, the scale factor k vanishes in the logarithm, and

Px(kxj A) ex exp [-kA L (xs - xr) log (:3 )]. (s,r)EC r

(22)

Thus this particular Markov random field leads to a scale invariant estimator if we update the parameter A according to A(Jb constant (the noise is assumed Gaussian-independent). In the same way as in the Lp norm example, A(Jb can be considered as a generalized SNR.

These examples show that the family of scale invariant laws is not a duck-billed platypus family. It includes many already employed priors on the context of image estimation. We have shown in a related work that other scale invariant prior laws exist, both in the Markovian prior family [17] and in the uncorrelated prior [2] family.

3. Scale invariant Bayesian estimator

Before further developing the scale invariance constraint for the estimator, we want to emphasize the role of the hyperparameters (J (i.e., parameters of the prior laws) and to sketch their estimation from the data which is very important in real-world applications. The estimation problem is considered globally. By globally we mean that, although we are interested on the estimation of x we want also to take into account the estimation of the


b .... _---------._-_ .. ---------,

'A

Ax y=Ax+b !X I

Argmin J(x,) 11-7----------,

x, 9,

k .. ---------------------------,

Scheme 1: Global scale invariance property for an estimator

hyperparameters O. To summarize the SIP of an estimator, we illustrate it by the following scheme:

For more detail, let us define a scale invariant estimator in the following way:

Definition 1 An estimator x(y; 0) is said to be scale invariant if there exists function Ok = f k (0) such that

V(y, 0, k > 0), x(ky, Ok) = k x(y, 0) (23)

or in short

Y >--t x =:} Vk > 0, ky>--t kx. (24)

In this paper, we focus only on priors which admit density laws. We define then the scale invariant property for those laws as follows:

Definition 2 A probability density function p" (u; 0) [resp., a conditional density P"lv (ulv; 0) ,l is said to be scale invariant if there exists function (h = f k (0) such that

V(u,O,k> 0), p,,(kU;Ok) = k-Np,,(u;O), (25)

[resp., V(u,O,k> 0), P"lv(kulkv;Ok) = k-Npulv(ulv;O),l where N = dim(u).

If fk = Id, i.e.; if Ok = 0 then p,,(u;O) is said to be strictly scale invariant.

The above property for density laws specifies that these laws are a part of a family of the laws which is closed relative to scale transformation. Thus, in this class, a set of pertinent parameters exists for each chosen scale.

We need also to set two properties for scale invariant density laws. Both concern the conservation of the SIP, one after marginalization, the other after application of the Bayes rules.

206 S. Brette, J . Idier & A. Mohammad-Djafari

Lemma 1 If Px ,y(x, y; 0) is scale invariant, then the marginalized py(Y; 0) ~s also scale invariant.

Lemma 2 If Px (x; A) and Pylx (ylx; 1/.» are scale invariant, then the joint law PX,y( x, y; A,1/.» is also scale invariant.

Proofs are straightforward and are found in Appendix A. Using these two definitions, we prove the following theorem which summarizes sufficient

conditions for an estimator to be scale invariant:

Theorem 1 If the cost function C (x' , x) of a Bayesian estimator satisfies the condition:

and if the posterior law is scale invariant, i.e., there exists function Ok = f k (0) such that:

Vk > 0, V(x, y), p(kxlky; Ok) = k-dim(x)p(xly; 0), (27)

then, the resulting Bayesian estimator is scale invariant, i.e.,

(28)

See the appendix B for the proof. It is also shown there that the cost functions of the three classical Bayesian estimators, i.e.; MAP, PM and the MMAP, satisfy the first constraint.

Remark: In this theorem, the SIP is applied to the posterior law p(xly; 0) . However, we can separate the hyperparameters (J in two sets A and 1/.>, where A and 1/.> are the parameters of the prior laws Px(x; A) and Pb(y - Ax; 1/.». In what follows, we want to make the choice of Px and Pb independent. From the lemma 1 and 2, if Px and Pb satisfy the SIP then the posterior p(xly; (J) satisfies the SIP. As a consequence Ok must be separated according to (Jk = [Ak' 1/.>k] = [gk(A), hk(1/.»].

4. Hyperparameters estimation

In the above theorem, we assumed that the hyperparameters 0 are given. Thus, given the data y and the hyperparameters 0, we can calculate x. Now , if the scale factor k of the data has been changed, we have first to update the hyperparameters [18] according to (Jk = fk((J), and then we can use the SIP:

(29)

Now, let us see what happens if we have to estimate both x and (J, either by Maximum or Generalized Maximum Likelihood .

• Maximum likelihood (ML) method estimates first (J by

where

jj = arg max {L(lI)} , (J

(30)

Scale invariant Markov models ...

L(lJ) = p(y; lJ)

and then 0 is used to estimate x. At a scale k,

Application of lemma 1 implies that

thus, the Maximum Likelihood estimator satisfies the condition

207

(31 )

(32)

(33)

(34)

The likelihood function (eq. 31) has rarely an explicit form, and a common algorithm for its locally maximization is the EM algorithm which is an iterative algorithm described briefly as follows:

E ~(i) {In p(ylx; lJ)} X!y;lJ

{ ~(i) }

arg;rax Q(lJ;lJ ) . (35)

At a scale k,

E ~«) {Inp(kylkx; lJk)} kX!ky;lJ k

-M In k + E ~(.) {In p(ylx; lJ)} kX!ky;lJ k

-M In k + k-dim(Y)Q(lJ; O(i)). (36)

Th 'f ... I' h" . I . h . h h I lJ~(O) .t: (lJ~(O)) h h us, I we Imtla Ize t IS IteratIve a gont m WIt t e va ue k = k ,t en we ave

(37)

Then the scale invariance coherence of hyperparameters is ensured during the optimization steps .

• In Generalized Maximum Likelihood (GML) method, one estimates both lJ and x by

(0, x) = arg max {p(x, y; O)} . (lJ,X)

(38)

Applying the same demonstration as above to the joint laws rather than to the marginalized one leads to

(39)

However, this holds if and only if the GML has a maximum. This may not be always the case and this is a major drawback in GML. Also, in GML method, direct resolution


is rarely possible and sub-optimal techniques lead to the classical two-step estimation scheme:

{ ~(i) }

argm;x p(;x,y;(J ) , (40)

arg m8x {p(iC(i), y; (J) } . ( 41)

We see that, in each iteration, the (J estimation step may be considered as the ML estimation of (J if ;x(i) is supposed to be a realization of the prior law. Thus the coherence of estimated hyperparameters at different scales is fulfilled during the both optimization steps, and

(42)

Thus, if we consider the whole estimation problem (with a ML or GML approach), the SIP of the estimator is assured in both cases. It is also ensured during the iterative optimization schemes of ML or GML.

5. Markovian invariant distributions

Markovian distributions as priors in image processing allow to introduce local characteristics and inter-pixels correlations. They are widely used but there exist many different Markovian models and very few model selection guidelines exist. In this section we apply the above scale invariance considerations to the prior model selection in the case of first order homogeneous MRFs.

Let X E Q be a homogeneous Markov random field defined on the subset [1 ... N] X

[1 ... M] of Z2. The Markov characteristic property is:

(43)

where oi is the neighbourhood of site i, and S is the set of pixels. Hammersley-Clifford theorem for the first order neighbourhood reads:

Px (;x; A) ex exp (-A L rjJ(XSl Xr)) , {r,S}Ee

(44)

where C is the clique set, and rjJ(x, y) the clique potential. In most works [9, 19, 20, 21] a simplified model is introduced under the form rjJ(x, y) = rjJ(x - y). Here we keep a general point of view. Application of the scale invariance condition to the Markovian prior laws Px(;x, A) leads to the two following theorems:

Theorem 2 A familly of Markovian distribution is scale invariant if and only if there exist two functions f(k, A) and f3(k) such that clique potential rjJ(XSl xr) satisfies:

f(k, A) rjJ(kXSl kXr) = ArjJ(X" xr) + f3(k). (45)


Theorem 3 A necessary and sufficient condition for a Markov random fields to be scale invariant is that exists a triplet (a, b, c) such as the clique potential 4>(x" xr) verifies the linear partial differential equation (PDE) :

Finally, enforcing symmetry of the clique potentials 4>(x" xr) = 4>(xn xs) the following theorem provides the set of scale invariant clique potentials:

Theorem 4 Px (x, A) is scale invariant if and only if 4>( x" xr) is chosen from one of the following vector spaces:

Va = {4>(X" xr) I :3¢(.) even and PER, 4>(x" xr) = ¢ (lOg I :: I) -plog Ix,xr I} (46)

VI(p) = {4>(X" xr) I :3¢(.) even, 4>(x" xr) = ¢ (log I :: I) IXsXrIP} (47)

Moreover, Va is the subspace of strictly scale invariant clique potentials.

For the proof of these theorems see [22]. Among the most common models in use for image processing purposes, only few clique

potentials fall into the above set. Let us give two examples: First, the GGMRFs proposed by BOUMAN & SAUER [9] were built by a similar ap

proach of scale invariance but under the restricted assumption that 4>(x" xr) = 4>(xs - xr). The yielded expression 4>(x" xr) = lx, - xrlP can be factored according to 4>(x" xr) lx, xrlP/212sh (Iog(x s /x r)/2}IP which shows that it falls in VI (p).

The second example of potential does not reduce to the single variable function ¢(xs - xr): 4>(x" xr) = (xs - xr) log (xs/x r). It has recently been introduced from 1-divergence penalty considerations in the field of image estimation problem (optic deconvolution) by O'Sullivan [16]. Factoring IXsxrl~ leads to:

(48)

where ¢(X) = 2Xsh(X/2) is even. It shows that 4>(x"xr) IS In VI(1/2) and is scale invariant. As ¢(x" xr) is defined only on R~+ it applies to positive quantities. This feature is very useful in image processing where prior positivity applies to many physical quantities.

6. Conclusions

In this paper we have outlined and justified a weaker property than linearity that is desired for the Bayesian estimators to have. We have shown that this scale invariance property (SIP) helps to avoid an arbitrary choice for the scale of the measurement. Some models already employed in Bayesian estimation, including Markov prior Models [9, 16], Entropic prior [23, 2] and Generalized Gaussian models [11], have demonstrated the existence and usefulness of scale invariant models. Then we have given general conditions for a Bayesian estimator to be scale invariant. This property holds for most Bayesian estimators such as MAP, PM, MMAP under the condition that the prior laws are also scale invariant. Thus,


imposition of the SIP can assist in the model selection. We have also shown that classical hyperparameters estimation methods satisfy the SIP property for estimated laws.

Finally we discussed how to choose the prior laws to obtain scale invariant Bayesian estimators. For this, we considered two cases: entropic prior laws and first-order Markov models. In related preceding works [1, 2, 24], the SIP constraints have been studied for the case of entropic prior laws. In this paper we extended that work to the case of first-order Markov models and showed that many common Markov models used in image processing are special cases.

A SIP property inheritance

• Proof of the Lemma 1:

Let Px ,y (x, y; 0) have the scale invariance property, then if there exists Ok = f k (0) such that

Px,y(kx, ky; Ok) = k-(M+N)pX,y(x, y; 0),

where N = dim (x) and M = dim(y), then, marginalizing with respect to x, we obtain

which completes the proof.

• Proof of the Lemma 2:

The definition of SIP for density laws and direct application of the Bayes rule lead to

which concludes the proof.

B SIP conditions for Bayesian estimator

• Proof of the Theorem 1:

Since a Bayesian estimator is defined by

x = arg mJn {] C(x', x) p(x'ly; 0) dx' } ,

then

Xk argr~ikn {] C(x~, Xk) p(x~lkYi Ok)d(X~)}

k argmJn {] C(kx', kx) p(kx' IkYi Ok) kN dx' }

k argmJn {] [ak + bkC(x', X )]k-N p(x' Iy; 0) kN dx' } = k X,

which proves the Theorem 1.


• Conditions for cost functions:

The three classical Bayesian estimators, MAP, PM and MMAP, satisfy the condition of the cost function:

- Maximum a posteriori (MAP): C(XZ,Xk) = 1- 8(x;; - Xk) = C(x·,x).

Marginal Maximum a Posteriori (MMAP):

References

[1] A. Mohammad-Djafari and J. Idier, "Maximum entropy prior laws of images and estimation of their parameters," in Maximum Entropy and Bayesian Methods in Science and Engineering (T. Grandy, ed.), (Dordrecht, The Netherlands), MaxEnt Workshops, Kluwer Academic Publishers, 1990.

[2] A. Mohammad-Djafari and J. Idier, "Scale invariant Bayesian estimators for linear inverse problems," in Proc. of the First ISBA meeting, (San Fransisco, USA), Aug. 1993.

[3] G. Demoment, "Image reconstruction and restoration: Overview of common estimation structure and problems," IEEE Transactions on Acoustics Speech and Signal Processing, vol. 37, pp. 2024-2036, Dec. 1989.

[4] A. Mohammad-Djafari and G. Demoment, "Estimating priors in maximum entropy image processing," in Proceedings of IEEE ICASSP, pp. 2069-2072, IEEE, 1990.

[5] G. Le Besnerais, J. Navaza, and G. Demoment, "Aperture synthesis in astronomical radio-interferometry using maximum entropy on the mean," in SPIE Conj., Stochastic and Neural Methods in Signal Processing, Image Processing and Computer Vision (S. Chen, ed.), (San Diego), p. 11, July 1991.

[6] G. Le Besnerais, J. Navaza, and G. Demoment, "Synthese d'ouverture en radioastronomie par maximum d'entropie sur la moyenne," in Actes du 13eme colloque GRETSI, (Juan-les-Pins, France), pp. 217-220, Sept. 1991.

[7] E. Jaynes, "Prior probabilities," IEEE Transactions on Systems Science and Cybernetics, vol. SSC-4, pp. 227-241, Sept. 1968.

[8] G. Box and T. G.C., Bayesian inference in statistical analysis. Addison-Wesley publishing, 1972.

[9] C. Bouman and K. Sauer, "A generalized Gaussian image model for edge-preserving MAP estimation," IEEE Transactions on Medical Imaging, vol. MI-2, no. 3, pp. 296-310, 1993.

[10] J. Besag, "Digital image processing: Towards Bayesian image analysis," Journal of Applied Statistics, vol. 16, no. 3, pp. 395- 407, 1989.


[ll] D. Oldenburg, S. Levy, and K. Stinson, "Inversion of band-limited reflection seismograms: theory and practise," Procedings of IEEE, vol. 74, p. 3, 1986.

[12] S. Wernecke and L. D'Addario, "Maximum entropy image reconstruction," IEEE Transactions on Computers, vol. C-26, pp. 351-364, Apr. 1977.

[13] S. Burch, S. Gull, and J. Skilling, "Image restoration by a powerful maximum entropy method," Computer Vision and Graphics and Image Processing, vol. 23, pp. 113-128, 1983.

[14] S. Gull and J. Skilling, "Maximum entropy method in image processing," Proceedings of the lEE, vol. 131-F, pp. 646-659, 1984.

[15] A. Mohammad-Djafari and G. Demoment, "Maximum entropy reconstruction in X ray and diffraction tomography," IEEE Transactions on Medical Imaging, vol. 7, no. 4, pp. 345-354, 1988.

[16] J. A. O'Sullivan, "Divergence penalty for image regularization," in Proceedings of IEEE ICASSP, vol. V, (Adelaide), pp. 541-544, Apr. 1994.

[17] S. Brette, J. Idier, and A. Mohammad-Djafari, "Scale invariant Markov models for linear inverse problems," in Fifth Valencia Int. Meeting on Bayesian Statistics, (Alicante, Spain), June 1994.

[18] J. Marroquin, "Deterministic interactive particle models for image processing and computer graphics," Computer Vision and Graphics and Image Processing, vol. 55, no. 5, pp. 408-417, 1993.

[19] S. Geman and D. Geman, "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-6, p. 2, 1984.

[20] S. Geman and G. Reynolds, "Constrained restoration and recovery of discontinuities," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-14, pp. 367-383, 1992.

[21] J. Besag, "On the statistical analysis of dirty pictures," Journal of Royal Statistical Society B, vol. 48, p. 1, 1986.

[22] S. Brette, J. Idier, and A. Mohammad-Djafari, "Scale invariant Markov models for linear inverse problems," in Second ISBA meeting, vol. Bayesian Statistics, (Alicante, Spain), ISBA, American Statistical Association, June 1994.

[23] S. F. Gull, "Developments in maximum entropy data analysis," in Maximum Entropy and Bayesian Methods (J. Skilling, ed.), pp. 53-71, Dordrecht, The Netherlands: Kluwer Academic Publishers, 1989.

[24] A. Mohammad-Djafari and J. Idier, "A scale invariant Bayesian method to solve linear inverse problems," in Maximum Entropy and Bayesian Methods (G. Heidbreder, ed.), (Dordrecht, The Netherlands), The 13th Int. MaxEnt Workshops, Santa Barbara, USA, Kluwer Academic Publishers, 1993.

FOUNDATIONS: INDIFFERENCE, INDEPENDENCE & MAXENT

Manfred Schramm and Michael Greiner Institut fur Informatik der Technischen Universitiit Munchen, Germany

ABSTRACT. Through completing an underspecified probability model, Maximum Entropy (MaxEnt) supports non-monotonic inferences. Some major aspects of how this is done by MaxEnt can be understood from the background of two principles of rational decision: the concept of Indifference and the concept of Independence. In a formal specification MaxEnt can be viewed as (conservative) extension of these principles; so these principles shed light on the "magical" decisions of MaxEnt. But the other direction is true as well: Since MaxEnt is a "correct" representation of the set of models (Concentration Theorem), it elucidates these two principles (e.g. it can be shown, that the knowledge of independences can be of very different information-theoretic value). These principles and their calculi are not just arbitrary ideas: When extended to work with qualitative constraints which are modelled by probability intervals, each calculus can be successfully applied to V.Lifschitz's Benchmarks of Non-Monotonic Reasoning and is able to infer some instances of them ([Lifschitz88]). Since MaxEnt is strictly stronger than the combination of the two principles , it yields a powerful tool for decisions in situations of incomplete knowledge. To give an example, a well-known problem of statistical inference (Simpson's Paradox) will serve as an illustration throughout the paper.

1. Introduction

1.1. Background

If we want to model common sense reasoning, an important step will be the development of systems which can make decisions under incomplete knowledge. These decisions should be the best possible ones given the incomplete knowledge; they will show non-monotonic behaviour when the knowledge is increasing. Recently, probability theory has become more and more accepted as an appropriate tool for that purpose, especially in connection with the notion of entropy ([Paris89], [PearI88], [Cheeseman88]). Following [Cox79], we consider probability theory as an adequate model for one-dimensional belief of propositional expressions. Following [Adams75], we consider the conditional probability to be much more adequate compared to the use of the Material Implication1 of propositional logic when modelling the common sense connective "If, then" of the language. Following [JaynesS2] we consider MaxEnt as an adequate method of choosing a probability model from an infinite set of possible models, when only linear constraints are present. Concerning MaxEnt it is still a problem to explain this method of inductive reasoning to any newcomers. Surely there are various ways. One possibility is to take some intuitively plausible axioms of rational reasoning and to show how MaxEnt is a necessary consequence of these axioms. This approach has been chosen quite a few times in the literature ([ShoreSO], [SkiliingSS], [Paris90J). Here we choose a slightly different approach; we take two strong properties,

'the Material Implication of two propositions (P" P2), normally denoted by (P, --t P2) , is false iff the first proposition (antecedens) is true and the second one is false

213


214 M. Schramm and M. Greiner

strong enough to define decision principles, and we show that MaxEnt concludes strictly stronger (see 6. and the figure below) than the two principles combined. Both seem to be different from MaxEnt at first glance, and although they seem to be well-known for a long time, they are far from clear when one looks at them in more detail:

The principle of Indifference, viewed by [Jaynes78] as a simple "demand of consistency", is sometimes mixed with the problem of modelling probabilities; this leads to arguments against this principle. Therefore we have to specify how we use this principle, especially in the presence of linear constraints. The principle of Independence is related to undirected graphs and to the Markov properties of its variables; it seems not to have been used so far as a formal principle of reasoning (but see [PearI88]). If MaxEnt is derived from the usual axioms, only a special case of this principle is required for the proof.

So the paper proceeds from the bottom to the top of the following figure:

l P-Models

J with MaxEnt

.A. • !'-Models

with Indifference and Independence

t t ('-Models I'-Model,

with with Indifference Independence

... .. - -I'-Modeb

( Propositional Logic )

First, the logic on probability models (P-Models) is formally described and illustrated by use of Simpson's paradox. The principles of Indifference and Independence are then introduced as additional axioms on P-Models. Some remarks about the relation between MaxEnt and these principles conclude this short presentation.

1.2. Mathematical formulation

Consider a finite set R of random variables A, B, C, .. . with two possible values for each variable (e.g. V(A) = {a, ,a} ). Let n be the set of elementary events (also called the set of possible worlds), where an elementary event is a conjunction of values for every random variable; let A(n) be the Algebra over n, defined as the power set of n. A (discrete) probability model (P-Model) is an assignment of non-negative numerical values to the elements of n, which sum up to unity. Let Wn be the set of all possible P-Models for n. We define a constraint to be a sentence which is true or false in any P-Model; let DB be a set of linear constraints on Wn. We define the set WDB as the set of all elements of Wn,

Foundations: Indifference, Independence & MaxEnt 215

which are consistent with the constraints in DB. If WOB consists of more than one element (here equivalent to infinite many), the information in DB is incomplete for determining a single P-Model. If WOB is empty, the information in DB was inconsistent. We want to model incomplete information, expressed by linear constraints (premises) over a set of P-Models, so the ease that there are "infinite many elements in WOB" will be our standard case. A conclusion from DB will be a sentence which is true in all P-Models of WOB (therefore, adding a conclusion to DB won't change the set of models of WOB)' A belief in a of a system now means to us that, if no other information is given and the system is forced to decide between a and -,a, the system will decide for a (default decision). According to the relationship between probabilities and decisions, we model the belief in a as

(P(a) = X; X E (0.5,1]) E DB.

Knowledge is expressed by probability one (a is known to be true iff (P(a) = 1) E DB). Therefore, if a sentence S of the form (P(a) = XiX E (0.5,1]) for some propositional expression a is a conclusion from DB (in symbols: DB II~ S), the system will decide for a given the knowledge in DB. This interpretation of defaults is quantitative; especially this kind of belief means "in more than half of the cases". This is weaker than "in most cases" (similar to "normally"), but the quantitative meaning of most is context-dependent and therefore difficult to describe; the structure of the desired conclusions of most seems to be very similar to that of "more than half". So we opted for that interpretation. Conditional knowledge (belief, decisions) is of course expressed by conditional probabilities: (P(b I a) = XiX E (0.5,1]) means that if the system knows a (and nothing else), it believes (decides for) b .

1.3. Example: Default-Knowledge

Default-Knowledge: Normally animals do not fly. Birds are animals. Normally birds fly. Desired conclusion: Animals, which are no birds, normally do not fly Formal: DBI := { (P(-,fll an) = PI; PI E (0.5,1]), (P(an I bi) = 1.0),

(P(fll bi) = P2; P2 E (0.5,1]) } Desired Conclusion: DBI II~ (P(-,fll an /\ -,bi) = P3; P3 E (0.5,1])

2. Conclusions on P-Models

This kind oflogic on P-Models (P-Logic), described so far, is of course strictly stronger than propositional logic, which can be embedded into P-Logic as follows: Take the premises of propositional logic as knowledge with probability 1 into DB and look for expressions, being true in all remaining possible worlds. This P-Logic is surely useful, when modelling certain examples of reasoning. For example this logic supports the desired conclusion from DBI . Moreover the use of conditional probabilities instead of Material Implication avoids some of the well-known modelling problems with the Material Implication. Also P-Logic allows for a richer language than propositional logic, but it still has the property of being monotonic (additional knowledge won't revise earlier decisions). However, we aim at something which is much stronger; because too many conclusions which seem to be intuitively true are not supported by this P-Logic.


Example DB2 [Weak version of Simpson's Paradox ([Blyth73], [Neapolitan90j)]:

DB2 = {(P(c I a) = Pl;Pl E (0.5,1]), (P(c I b) = P2;P2 E (0.5,1]) }2

Desired conclusions: (c1) DB211~ (P(c I a V b) = P3;P3 E (0.5,1]) (c2) DB211~ (P(c I a /I. b) = P4;P4 E (0.5,1])

These conclusions seem intuitively obvious although they are not true in P-Logic (or in statistics): We construct a proof by means of P-Models, which fulfil the premises, but not the conclusions.

not (c1): (P(abc) = 6/19, P(ab--.c) = 1/19, P(a--.bc) = 1/19, P(a--.b--.c) = 5/19, P(--.abc) = 1/19,P(--.ab--.c) = 5/19,P(--.a--.bc) = O,P(--.a--.b--.c) = ° )

not (c2): (P(abc) = 1/20, P(ab--.c) = 5/20, P(a--.bc) = 6/20, P(a--.b--.c) = 1/20, P(--.abc) = 6/20,P(--.ab--.c) = 1/20,P(--.a--.bc) = O,P(--.a--.b--.c) = ° )

This makes the Simpson problem a common sense paradox. Probability theory is too finegrained to model common sense reasoning in general. The remaining degrees of freedom have to be filled up; to do this without adding information is still a problem, last but not least addressed by the MaxEnt-Program of Jaynes. Filling the degrees of freedom with correct methods will help to overcome the mistrust in statistics which can be found even among scientifically educated people. So our goal is to look for additional (context-sensitive) constraints (resp. principles), which are able to support rational decisions with incomplete knowledge (e.g. the desired conclusions of the last example DB2)' This will be done in the next sections.

3. Conclusions on P-Models with Indifference

3.1. What does Indifference mean?

The history of this famous principle goes back to Laplace and Keynes. Let us quote [Jaynes78] for a short and informal version of this principle:

"If the available evidence gives us no reason to consider proposition al either more or less likely than a2, then the only honest way we can describe that state of knowledge is to assign them equal probabilities: P(at} = P(a2) ."

Three questions arise here:

a) How to make formally precise that a system has no reason to consider al either more or less likely as a2 in the presence of linear constraints?

b) Why should we use this principle?

c) Given a set of linear constraints of DB: is it possible to decide on the basis of this set which elementary events (and therefore which complex events) will be considered to be indifferent?

We will adress these questions on the following two pages.

2if the system knows a, it believes (decides for) Ci if the system knows b, it believes (decides for) c



Let WOB be the set of P-Models of DB, VOB the set of vectors of P-Models of DB and v E VOB a single vector. Now look for permutations II with "Iv E VOB ::lv' E VOB : II{v) = v' , in short form written as: II{VoB) = VOB. It is well-known, that any permutation can be expressed by writing down its cycles, so we express II by describing its cycles. The principle of Indifference now demands that all variables (we express the unknown probabilities of elementary events by variables) within the same cycle get the same value. We define the set lOB as the collection of all the equations of any II with the property II(VoB ) = VOB . S is a consequence of a set of linear constraints with the help of the principle of Indifference iff the following relation is valid: DB U Ion II~ S.

3.3. The main argument for using Indifference: Consistency

If WOB contains P-Models with the property P(al) < P(a2) and P(al) > P(a2) and al is indifferent to a2 as defined above, an unknown future decision process based on this set of P-Models might once choose a model with the property P(al) < P(a2) and might choose a P-Model with P(al) > P(a2) at another time. Both models contain information which is not present in the database. On the basis of VOB we notice that we won't be able to recognize if a permutation II (of the kind II(VoB) = VOB ) has happened inside our machine which switches the values of some variables (this is equivalent to renaming the variables) and changes a model with the property P(al) < P(a2) into a model with the opposite property. Of course we don't want something we can't notice to have any influence on future (rational) decisions. That's what the principle of Indifference is able to prevent: it disposes of those degrees of freedom which our constraints do not address and which we therefore are not able to control in a rational manner.

3.4. Another argument for using Indifference: Model Quantification

Take WI-OB as the set of all P-Models, which satisfy the constraints in DB and the equations in [DB; take Vi-oB as the corresponding set of all vectors of P-Models. Given that the MaxEnt-solution of a problem with linear constraints is the correct representation of the set of P-Models (what was proved by [Jaynes82] via the Concentration Theorem), it is possible to consider every Indifference model wi E WI-OB as MaxEnt-Solution of a subproblem DB;, where WOB; is an element of a certain partition of WOB (the partition is formed by varying the values of additional constraints derived from models in WI-OB). Then this P-Model wi

is of course a correct representation of the set WOBi . If this is the case, only a minimum amount of information is necessary to replace the set WOBi by the model wi (the amount tends to zero if the problem is modelled by a random experiment of size Nand N grows large) and only a minimum of information is contained in lOB. This means that statistically all models in WI-DB have a special representation status.

3.5. How to detect indifferent events by the matrix M of linear constraints.

A sufficient condition for II to have the property of II(VoB) = VOB is the existence of an permutation MIT of the columns of M, which, followed by an permutation MA of the rows of M, is equivalent to M (formally: MA . M· MIT = M). Proof: Systems with the same matrix of equations have the same set of solutions.

Example: Lets take DB3 := DB2 U {PI = P2 = p}. The matrix of linear constraints has the entries


V!:= V2 := V3 := V4 := Vs := V6 := V7 := VB := P(abe) P(ab,e) P(a,be) P(a,b,e) P(,abe) P(,ab,e) P(,a,be) P(,a,b,e)

1 1 1 1 1 1 1 1 1 1 - p -p 1 - p -p 0 0 0 0 0 1 - P -P 0 0 1 - p -p 0 0 0

We obtain II(ViDB ) = ViDB for the permutation II = (VI) (V2) (V3 vs) (V4 V6) (V7 Vs) 3 3 VI V2 V5 V3 V6 V4 V8 V7

Equations in IDB: {V3 = V5 , V4 = V6 , V7 = VB}·

3.6. Examples (no rules) of the use of indifference

• n = Inl implies: 0 U 10 II~ (P(w;) = lin) \/Wi En.

• Take DB4 as equal to { (P(b I a) = P!iPl E (0.5,1]) }. Conclusion: DBi U IDB4 II~ (P(b I a /\ c) = P2iP2 E (0.5 , 1]).3

• Take DB5 as equal to { (P(b I a /\ c) = PliPl E (0.5,1]) }. Conclusion: DB5 U IDBs II~ (P(b I a) = P2iP2 E (0.5,1]) .4

3.7. Summary (Indifference)

Two important arguments (consistency, quantification of possible worlds) justify the use of the principle of Indifference when decisions are necessary. For us it seems clear that there is no way around it. Of course it does not solve the problem of modelling, which is the problem of defining n and encoding our knowledge. Some paradoxes of the use of Indifference are related to the selection of different n's and therefore different results of the principle of Indifference (see e.g. [Neapolitan90],[Howson93]). The consistency ( i.e. VDB of 0 => Vi-DB of 0) of this principle can be proven by the convexity of VDB in any component of the vectors V (E VDB). Moreover the MaxEnt-Model fulfils all the equations of I-DB (which means that the MaxEnt-Model w* is an element of WI-DB). The decisions based on P-Models and the principle of Indifference are of course strictly stronger than that on pure P-models. The decisions have already the property of being non-monotonic, when additional information becomes available (indifferences might disappear, when new knowledge comes in).

4. Conclusions on P-Models with Independence

4.1. Basics

From the point of information theory, Independence of two events a and b in a P-Model w is given, if any knowledge about the event a (like a has (or has not) happened) does not change the probability of b (and vice versa) in w (formally P(b I a) = P(b) ). With the knowledge of Independence of the two events, the probability of the combined event becomes a function of the probability of the single events. If this is the case not only for single events, but for all values of a random variable, Independence allows to reduce the complexity (of calculating) and the space (for storing probability models ([Lewis59]). In Bayesian Reasoning, Independence is well-known and commonly used when completing

3Indifference demands the equations P(abe) = P(ab,e) = P(a,be) = P(a,b,e) , P(,abe) = P(,ab,c) = P(,a,be) = P(,a,b,e)

4Indifference demands P(ab,e) = P(a,b,e) = P(,abc) = P(,ab,e) = P(,a,be) = P(,a,b,c)


incomplete knowledge or when simplifying calculations (see e.g. [PearI88]). In our context the following questions arise:

a) How to make formally precise which kind of (conditional) Independence a system should demand?

b) Why should we use this principle?

c) Given a set of linear constraints of DB: is it possible to decide on the basis of this set which events will become independent?


The principle of Independence is based on the construction of an undirected graph from the constraints in DB by the following rules: Let us take every variable from R as a knot and let us connect two variables by an edge, iff the two variables are both mentioned in the same constraint. Consider the resulting undirected graph as Independence map (I-Map; see [Pearl88]) . We take all the statements of (conditional) Independence of the map and translate it into (non-linear) equations between events of n. We define UOB as the set of all these equations. (The set UOB expresses many possible independences between subalgebras of A(n)). S is a consequence of a set of linear constraints with the help of the principle of Independence, when the following relation is valid: DB U UOB Il~ S .

Example DB2 : R = {A, B, C}

The Independence map of DB2 :

This Indepence map now demands that any event of A{A} is (conditionally) independent from any event of A{B}, conditioned on an elementary event of n{C}.

4.3. First argument (intuitive graphical representation)

Some years ago, conditional Independence relations in P-Models have been identified as a model for a set of axioms, which describe (and conclude) connections on undirected graphs (an introduction to this topic can be obtained from [PearI88]). This means that (conditional) Independence relations could be detected by only qualitative information about a P-Model: The quantitative information, encoded in the numerical values of its events, is not necessary (see e.g. [Pearl88]). We find this approach very important for MaxEnt , because it clarifies the relation between MaxEnt and (conditional) Independence. 5

4.4. Second argument (Quantification of possible worlds)

Take WU-OB as the set of all P-Models which fulfil the constraints in DB and the equations in UOB; take VU-OB as the corresponding set of all vectors of P-Models. Given that the MaxEnt-solution of a problem with linear constraints is the correct representation of the set of P-Models, it is possible to consider every Independence model wU (E WU- OB ) as MaxEnt-Solution of a subproblem DBu , where WOB" is an element of a partition of W OB

(the partition is formed by varying the values of additional constraints derived from models

San exact knowledge of this is useful, when the solution of a problem should be found by computers itself. This knowledge allows to separate "active" (independence) constraints from "inactive" constraints. The active constraints are necessary for the system, because they will change the result of the reasoning process, the inactive ones are fulfiled anyway by the reasoning process


in WU-OB) ' Then this P-Model w" is of course a correct representation of the set WOBu'

If this is the case, only a minimum amount of information is necessary to change from the set WOBu to the model w" and only a minimum of information is contained in UOB. This means that statistically all models in WU-OB have a special representation status.

4.5. Example (Model Quantification)

Consider an urn with N balls, R of which are red. Let us take out n balls without replacement. What is the most probable frequency of red balls in the sample to expect? We model this question with a Hypergeometric distribution and we count the maximum of models in the case of Independence (as to expect with the Independence map).

4.6. Summary (Independence)

Beside the important argument of reducing complexity two more arguments (intuitive graphical representation, quantification of possible worlds) justify the use of the principle of Independence when decisions are necessary. All demands of Independence, contained in UOB , describe constraints of only little information-theoretic value to the problem; if the decisions are based on the method of MaxEnt, these constraints in UOB have no influence on the decisions. So assumptions of Independence can be informative or not, depending on thein'elation to the I-Map of the constraints. The consistency (i.e. VOB i= 0 =? VU-OB i= 0) of this principle can be proven by the MaxEnt-Model, which fulfils all the non-linear equations of UO B (what means that the MaxEnt-Model is an element of VU-OB)' The set UOB

(resp. the I-Maps) will clarify the relation between MaxEnt and Independence. The decisions based on P-Models and the principle of Independence are of course strictly stronger than those based on pure P-models. The decisions have already the property of being non-monotonic, when additional information gets available.

5. Conclusions on P-Models with Indifference and Independence

It can be shown that a system using both the principle of Indifference and the principle of Independence concludes strictly stronger than the systems with the isolated principles. An example for this is again Simpson's Paradox: both conclusions of DB2 become true in the joined system, but they are not supported in the single systems.

6. Conclusions on P-Models with MaxEnt

We expect MaxEnt to be well-known to readers of this volume. So we just recall the following items:

a) MaxEnt has a unique solution, given linear constraints.

b) MaxEnt complies with the demands of the principle of Indifference. If not, there would be a different P-Model (use II!) with equivalent entropy as second candidate for the solution; but this would be inconsistent with a).

c) MaxEnt complies with the demands of the principle of Independence. Idea of the proof: All the equations in UOB have the form (Vi' Vj = Vk . vt). Sufficient for this equation to hold is the validity of (epi + epj = epk + ep/) for all elements epv of the matrix M, which can easily be shown by using undirected graphs.


d) MaxEnt decides strictly stronger than the joined principles of Indifference and Independence, because ViU_OB6 contains in most cases more than one P-Model (i.e. infinitely many).

e) MaxEnt has the best possible justification for decisions by the Concentration Theorem.

f) MaxEnt problems with linear constraints can easily be handled by numerical optimization algorithms. The knowledge of UOB helps to avoid unneccessary (i.e. inactive) non-linear constraints.

7. Conclusions

The 5 logics (P-Models, P-Models with Indifference, P-Models with Independence, PModels with both principles, P-Models with MaxEnt) do not only clarify some theoretical relations between MaxEnt and these principles; they make sense by their own and are not an ad hoc concept: When applied to a special set of benchmarks for non-monotonic logics, collected by V.Lifschitz, each logic can infer some of the problems (MaxEnt, being strictly stronger, solves of course nearly all problems). This gives additional information about a problem; it makes explicit which assumptions are necessary to reach the desired conclusions. Concerning our background aim of modelling common sense reasoning we don't argue that in every day reasoning humans calculate the MaxEnt-distribution. Rather we argue that this is the for mal solution of a general problem, parts of which might be solved informally (with less accuracy) very fast; a first idea for this is given by the qualitative reasoning in undirected graphs.

References

[Adams75] E.W. Adams, "The Logic of Conditionals", D.Reidel Dordrecht Netherlands, 1975.

[Bacchus90] F . Bacchus, "Lp - A Logic for Statistical Information", Uncertainty in Artificial Intelligence 5, pp. 3-14, Elsevier Science, ed .: M. Henrion, R.D. Shachter, L.N. Kanal, J.F. Lemmer, 1990.

[Bacchus94] F. Bacchus, A.J. Grove, J.Y. Halpern, D. Koller, "From Statistical Knowledge Bases to Degrees of Belief", Technical Report (available via ftp at logos.uwaterloo.ca:/pub/bacchus),1994.

[Blyth73] C. Blyth, "Simpson's Paradox und mutually favourable Events" , Journal of the American Statistical Association, Vol. 68, p. 746, 1973.

[Cheeseman88] P. Cheeseman, "An Inquiry into Computer Understanding", Computational Intelligence, Vol. 4, pp. 58-66, 1988.

[Cox79] R.T. Cox, "Of Inference and Inquiry - An Essay in Inductive Logic" , in: The Maximum Entropy Formalism, MIT Press, ed.: Levine & Tribus, pp. 119-167, 1979.

[Howson93] C. Howson, P. Urbach, "Scientific Reasoning: The Bayesian Approach", 2nd Edition, Open Court, 1993.

6i.e. the set of all vectors of P-Models, which fulfil the constraints in DB and the equations in IDB and UDB


[Jaynes78] E.T. Jaynes, "Where do we stand on Maximum Entropy?", 1978, in: E.T. Jaynes: Papers on Probability, Statistics and Statistical Physics, pp. 210-314, Kluwer Academic Publishers, ed.: R.D. Rosenkrantz, 1989.

[Jaynes82] E.T. Jaynes, "On the Rationale of Maximum-Entropy Methods", Proceedings of the IEEE, Vol. 70, No.9, pp. 939-952, 1982.

[Lewis59] P.M. Lewis, "Approximating Probability Distributions to Reduce Storage Requirements", Information and Control 2, pp. 214-225, 1959.

[Lifschitz88] V. Lifschitz, "Benchmark Problems for Formal nonmonotonic Reasoning", Lecture Notes in Artificial Intelligence Non-Monotonic Reasoning, Vol. 346, pp. 202-219, ed .: Reinfrank et al., 1988.

[Neapolitan90] R.E. Neapolitan, "Probabilistic Reasoning in Expert Systems: Theory and Algorithms", John Wiley & Sons, 1990.

[Paris89j J.B. Paris, A. Vencovska, "On the Applicability of Maximum Entropy to Inexact Reasoning", Int. Journal of approximate reasoning, Vol. 3, pp. 1-34, 1989.

[Paris90] J.B. Paris, A. Vencovska, "A note on the Inevitability of Maximum Entropy", Int. Journal of approximate reasoning, Vol. 4, pp. 183-223, 1990.

[Pearl88] J. Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference", Kaufmann, San Mateo, CA, 1988.

[Shore80j J.E. Shore, R.W. Johnson, "Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross Entropy", IEEE Transactions on Information Theory, Vol. IT-26, No.1, pp. 26-37, 1980.

[Skilling88] J. Skilling, "The Axioms of Maximum Entropy, Maximum-Entropy and Bayesian Methods in Science and Engineering, Vol. 1 - Foundations", Kluwer Academic, ed .. : G . .J. Erickson, C.R. Smith, Seattle Univ. Washington, 1988.

About the authors

Manfred Schranun (e-mail: schramma<Oinformatik.tu-muenchen.de ) has been working since 1990 in an automated reasoning group. His reaserch interests include reasoning with incomplete knowledge, common sense reasoning, logics for belief and knowledge, nonmonotonic reasoning and modelling with probability.

Michael Greiner (e-mail: [email protected] ) has been working since 1992 on the comparison and development of numerical optimization methods for the performance analysis of computer systems. His research interests include statistics, probability theory and genetic algorithms.

This short article is based on a technical report of the two authors published at the TU Munchen in 1994.

THE MAXIMUM ENTROPY ON THE MEAN METHOD, NOISE AND SENSITIVITY

Jean-Franc;ois BERCHER, Guy LE BESNERAIS and Guy DEMOMENT

Laboratoire des Signaux et Systemes (CNRS-ESE-UPS) Ecole Superieure d 'Electricite, Plateau de Moulon, 91192 Gif-sur-Yvette Cedex, France

ABSTRACT. In this paper we address the problem of building convenient criteria to solve linear and noisy inverse problems of the form y = Ax + n. Our approach is based on the specification of constraints on the solution x through its belonging to a given convex set C. The solution is chosen as the m ean of the distribution which is the closest to a reference measure fl. on C with respect to the Kullback divergence, or cross-entropy. This is therefore called the Maximum Entropy on the Mean Method (MEMM). This problem is shown to be equivalent to the convex one x = argminx F(x) submitted to y = Ax (in the noiseless case). Many classical criteria are found to be particular solutions with different reference measures fl.. But except for some measures, these primal criteria have no explicit expression. Nevertheless, taking advantage of a dual formulation of the problem , the MEMM enables us to compute a solution in such cases. This indicates that such criteria could hardly have been derived without the MEMM . In order to integrate the presence of additive noise in the MEMM scheme, the object and noise are searched simultaneously for in an appropriate convex C'. The MEMM then gives a criterion of the form x = arg minx F( x) + g(y - Ax) , where F and 9 are convex, without constraints. The functional 9 is related to the prior distribution of noise, and may be used to account for specific noise distributions. Using the regularity of the criterion , the sensitivity of the solution to variations of the data is also derived.

1. Problem statement

In many applications , one often faces the inverse problem y = Ax + n which consists in estimating a vector x E RN from an indirect and noisy observation vector y. The observation matrix A is supposed to be known, together with some statistical characteristics of the noise n. When the observation matrix A is either not regular or ill-conditioned the problem is ill-posed and one has to complete the data with an a priori knowledge or constraints on the solution in order to select a physically-acceptable solution. Such information may be given in the form of the convex constraint

x E C, (1)

where C is a convex set. Examples of this situation are plentiful, let us only cite the problem of imaging positive intensity distributions, which arises in spectral analysis, astronomy, spectrometry, etc ... In other specific problems, such as crystallography or tomography, lower and upper bounds on the image are known, and have to be taken into account in the reconstruction process. Such constraints may be specified by the belonging of the object to the convex set C (where the bounds ak and bk are given) and include the positivity constraint as a special case,

223

1. Skilling and S. Sibisi (eds.). Maximum Entropy and Bayesian Methods. 223-232. © 1996 Kluwer Academic Publishers.

(2)

224 J.-F Bercher, G. Le Besnerais and G. Demoment

2. Methods for solving linear inverse problems

In the case of an ill-posed problem, the generalized inverse solution is unsatisfactory because of the dramatic amplification of any observation noise. Quadratic regularization makes possible to get rid of ill-posedness effects, but it leads to linear estimates, and therefore cannot provide any guarantee with respect to the support constraint (2).

Possible answers are given with set theoretic estimation (for a review see [1]) and projection onto convex sets algorithms. Although good reconstructions can be obtained, they are often computationally expensive and do not lead to a unique and well-defined solution.

Other approaches use regularized criteria, which are usually written as a compound criterion made of two terms, one which enforces some fidelity of the solution to the data, the other which ensures that some desirable properties are met. Such regularized criteria will be noted under the generic form

:1(x) = .1"(x) + a9(y - Ax) a :?: O. (3)

Many of these regularized criteria may be interpreted in a Bayesian setting. Indeed, if the functionals -.1" and x ...... -a9(y - Ax) are respectively a log-prior and a log-likelihood, then the minimization of:1 provides the maximum a posteriori (MAP) estimator. However, in a given problem, the ab initio choice of a good model is a difficult task, for which there is no general answer (see [6) for a discussion of the subject). Such situations are encountered when the only a priori knowledge is a convex constraint such as (1). Nevertheless, useful methods have been found in those cases: for instance, when reconstructing object with positivity as the only pre-requisite, several thought processes have lead different authors to the conclusion that the maximum entropy reconstruction method could be a useful answer. It consists in the optimization of a regularized criterion of the form (3) with

N X· .1"(x) = 2.) Xi log ---.:.. - Xi + m;},

i=I mi (4)

where Tn = [mI' m2,' .. , mN) is a prior guess. As far as the positivity constraint is concerned, criteria like (4), built upon logarithmic expressions, ensure positivity and are therefore said to be "positivity free"; an other well-known example is the "log( x)" or Burg entropy used in spectral analysis.

The several good properties of the maximum entropy reconstruction method have been studied by many authors (see axiomatic studies such as [2) or [5]). The MEMM construction generalizes in some way certain aforementioned "thought processes," leading to the maximum entropy reconstruction method (4), in order to exhibit useful regularization functionals for a large class of convex constraints (1). The obtained regularizing functionals share many properties of the entropy (4).

3. The Maximum Entropy on the Mean Method

The foundations of the Maximum Entropy on the Mean Method originate from the work of J . Navaza [10), and some theoretical aspects of the method were further studied by F. Gamboa and D. Dacunha-Castelle [3) . We have also studied it with a special attention to its potential applications in signal and image reconstruction and restoration [8) . For the

The Maximum Entropy on the Mean Method, noise and sensitivity 225

sake of simplicity, this paragraph addresses the noiseless problem. Discussion of how to account for noise will take place in §5.2.

Much emphasis must be put on our only a priori information: the convex constraint of (2). The MEMM construction thus begins with the specification of the set C and a reference measure dJ.L( x) over it. The actual observations yare considered as the mean of a process x under a probability distribution P defined on C (this idea comes from statistical physics where observations are average values or macrostates). The set C being convex, the mean Ep{x} under P is in C and hence the convex constraint is automatically fulfilled by Ep{x}.

3.1. Additional information principle

Since the constraint given by (2) does not lead to a unique distribution P, we have to invoke some additional information principle. For this purpose, we use the J.L-entropy K(P,J.L), or Kullback-Leibler (K-L) information [7]. This information is defined for a reference measure J.L and a probability measure P by

J dP K(P,J.L) = log dJ.L dP (5)

if P is absolutely continuous with respect to J.L (P ~ J.L) and K( P, J.L) = +00 otherwise. The distribution P is selected as the minimizer of the J.L-entropy submitted to the con

straints "on the mean" A Ep{X} = y. In other words, P is the nearest distribution, with respect to the K-L divergence, to the reference measure J.L in the set of distributions such that A Ep{X} = y. The maximum entropy on the mean problem then states as follows:

MEMM problem P = arg~in log dJ.L (x)dP(x), { - J dP

such that y = A J xdP(x).

It is well known that the solution, if it exists, belongs to in the exponential family

(6)

and, more precisely, that its natural parameter is of the form s = At A for some A. In (6) log Z is the log-partition function or the log-Laplace transform of the measure dJ.L( x); this function will be noted F* in the sequel.

3.2. The dual problem

U sing results of duality theory, there is an equality between the optimum value of the previous problem and the optimum value of its dual counterpart (dual attainment):

Inf K(P,J.L) = Sup {Aty-:F*(AtA)}, PEP. AEVA

(7)

where P y = {P : AEp{X} = y} is the set of normalized distributions which satisfy the linear constraint on the mean, and VA is the set {A E RM : Z(AtA) < oo}, which is often the whole RM , in which case the dual problem is unconstrained.

Once the dual problem on the right side of (7) is solved, that is the maximization of the dual functional

(8)


yielding an optimum value 5., one has the expression of the density P = PAt)., and can calculate the reconstructed object :i: by computing (numerically) the expectation Ep{X}. But this is not the more efficient way to compute the solution. Indeed, inside the exponential family (6) there is a one-to-one mapping between the natural parameter s and the mean of the associated distribution x( s):

dF* xes) = Ts(s). (9)

Therefore, the solution :i: is simply obtained by calculating (9) at the optimal point At 5.. Let us emphasize that the dual criterion is by construction a strictly concave functional. Efficient methods of numerical optimization, such as gradient, conjugate gradient, or second order methods (Gauss-Newton) can be used to compute the solution. They will use the gradient of D which is easily calculated to be just y - Ax(AtA). During the algorithm, the primal-dual relation (9) is used to compute the current reconstruction from the dual vector A.

3.3. Yet another primal problem

The previous development was done in the space of the dual parameters A. The purpose of this paragraph is to come back to the natural "object space". We will exhibit a new primal criterion, which we will call an entropy. This function, not surprisingly, is intimately related with the previous dual function and the K-L information.

For each x E C, consider the MEMM problem when the constraint is Ep{X} = x. We define F(x) to be the optimum value of the K-L information for this problem

where P", = {P: Ep{X} = x}.

F(x) = Inf K(P, 11), PEP",

As already seen, at the optimum, we have by dual attainment

(10)

The latter equation means that .1' is the conjugate convex of .1'* and , as .1'* is the logLaplace transform of 11, the Cramer transform of 11. Such transforms appear in various fields of statistics and in particular in the Large Deviations theory, which has important connections with the MEMM [9J. Properties of Cramer transforms are listed below [4J:

• .1' is continuously differentiable and strictly convex on C,

• .1'( x) = +00 for x ~ C and its derivative is infinite on the boundary of C,

• .1'( x) 2: 0 with equality for x = rn, the mean value under the reference measure 11.

Our original MEMM problem can now be handled in a different way. If P is a candidate distribution with mean x, its K- L information with respect 11 is greater or equal to .1'( x). Moreover, this lower bound can be decreased by searching a vector :i: minimizing .1' over the set C y = {x : A x = y}. Then the ME M M problem is reformulated as


If we consider the reconstruction problem in the object space, we only need to solve

x = argmin.1'(x). (11 ) XEC y

Note that this problem has the same dual problem than that of (8). In fact, we have exhibited another primal problem associated to (8), directly in the object space lRN. Its solution x is the mean of the optimal distribution in the MEMM problem, and a solution to our reconstruction problem. This swap between primal problems is referred to as a "contraction principle" in statistical physics [4]. In this context, functional .1' appears as a level-l entropy, therefore we will simply call it an entropy in the following.

Properties of the Cramer transform are useful for reconstruction purposes, when holding the entropy .1' as the objective function, as in (11). Strict convexity enables a simple implementation and guarantees the uniqueness of the reconstruction. The second property shows that any descent method will provide a solution in C, even if the constraint x E C is not specified in the algorithm; this "C-free property", is here an analog of the "positivity free" property observed in conventional maximum entropy solutions (see above). The last property shows that .1' may be considered as a discrepancy measure between x and Tn. In the sequel, we give some examples illustrating the different points developed above.

4. A few examples of MEMM criteria

4.1. Gaussian reference

Our first example consists in a problem where no constraint is known on the object, so that C = lR n. We choose the Gaussian measure N( Tn, Rx) as our reference measure J1 on C. A simple calculus then leads to the Cramer transform

which is recognized as a classical quadratic regularizing term.

4.2. The positive case

• Poisson reference and the "Shannon entropy"

(12)

Let now C be ]0, +00[, and the reference distribution be a (separable) Poisson law, with expectation Tn. Such a prior may correspond to the modeling of the fall of quanta of energy, following a Poisson process. This modeling may be encountered in astronomy (the speckle-images of optical interferometry) for instance. The reference measure is then

The entropy functional .1', which measures the distance between any candidate solution x and the prior mean Tn is the Cramer transform of /1, and works out to be

.1'(x) = t [XJ log (2) + mj - Xj], j=l mJ mJ

which is the generalized version of the Shannon entropy.


• Gamma reference and Itakura-Salto discrepancy measure The presentation of [11] in spectrum analysis happens to be exactly a MEMM approach

to a well known criterion: the Itakura-Salto discrepancy measure. The periodogram having asymptotically a X2 distribution with two degrees of freedom,

the corresponding reference measure J-l over the possible spectra is an exponential law with mean, i.e. prior spectrum 'Tn. Using the Cramer transform definition, one easily obtains the entropy

F(m) = t Xj -log (::2) -1, j=l mJ mJ

(13)

which is the Itakura-Salto distortion between s and 'Tn. With 'Tn = 1, we measure a distance to a flat spectrum, and find out the so-called "log( x)", or Burg entropy.

(a) Test object.

(c) Reconstruction with a Gaussian reference measure - no constraint.

Figure 1: this figure compares different re

constructions in a simple Fourier synthesis

problem. The test object is in (a), its Fourier

transform and the available data in (b). Then

three reconstructions corresponding to differ

ent reference measures of the MEMM scheme,

and also to different constraint sets, are given.

They show the improvement with the reduc

tion of the set of admissible solutions.

4.3. The bounded case

(b) Fourier transform of the object and (0) available data.

(d) Reconstruction with a Poisson reference

- positivity constraint.

(e) Reconstruction with a uniform 0-2 mea

sure - x E [O,2]N.

We consider here the case when C has the general form of Eq. 2. Such constraints may be useful in many applied problems where the object is a priori known to lie between two bounds (tomography, filter design, crystallography).


Several reference measures can be used on the convex C. A natural idea is indeed to use a product of uniform measures over each interval]aj, bj [: dJ-L( x) = ®f=l bj~a) 1]a),b)[( Xj) dXj.

The calculus of the Cramer transform leads to implicit equations, therefore we have no analytic expression for F. Nevertheless the primal-dual relation can be computed

and the convex problem Inf F(x) subject to y = Ax, where F is not explicit, can still be x

solved using its dual formulation (8) together with the aforementioned primal-dual relation. Other measures could be used in this case. The case of a Bernoulli measures product dJ-L(x) = ®f=1{Cl' j O'(xj - aj) + (1- Cl'j)O'(xj - bj)} (where 8 denotes the Dirac measure) is derived in a referenced work [9] and leads to a generalized version of Fermi-Dirac entropies.

Figure 1 compares reconstructions obtained with different entropies presented above.

5. Taking noise into account

So far MEMM criteria have been derived from the maximization of the J-L-entropy submitted to an exact constraint. Any observation noise will ruin our exact constraint, and as a consequence the two (primal-dual) formulations of the MEMM problem. The exact constraint was useful in interpreting observations as a linear transform of a mean, then enabling us to exhibit the discrepancy measure F. Because of the good properties of F, we will still consider the unknown object x as a mean, in order to use its entropy F(x), but we have to modify the procedure.

5.1. The X2 constraint

A classical way to account for noise is to construct a confidence region about the expected value of some statistic. For Gaussian noise, one usually uses the X2 constraint Ily - Axl12 ~ p, where p is some constant. Then the problem becomes the minimization of F submitted to the X2 constraint. There always exists a positive parameter Cl' (in fact it is a Lagrange parameter corresponding to the X2 constraint) such that the previous problem reduces to

Inf {F(x) + ally - AxW}. x (14)

5.2. Accounting for general noise statistic within the MEMM procedure

Thanks to a specific entropy function, more complicated penalizations than (14) can be performed in order to account for non-Gaussian noises. Such entropies can be derived directly in the same MEMM axiomatic approach as in the noiseless case. To this end, we only need to introduce an extended object x = [x, n], and consider the relation y = Ax, with A = [A, 1]. The vector x evolves in the convex C of lRN +M , which separates on a product of the usual C and of B, C = C X B, where B is the convex hull of the state space of the noise vector n.

We then use a reference measure v over the noise set. For instance, in the case of a Gaussian noise we take B = lR M and a centered Gaussian law with covariance matrix Rv as v. With a Poisson noise we take B = lR:j:f and a Poisson reference measure v.

Now we can define a new entropy functional by using a reference measure fi on C. If v is the distribution of the noise, J-L our object reference measure on C, and if we assume that


the object and noise are independent, we obtain p, = f.l @ v. The entropy function we are looking for is then the Cramer transform of p, which is simply

Estimation of the extended object is conducted through a constrained minimization of Fii( x), the constraint being y = Ax = Ax + n. Therefore it reduces to the unconstrained minimization of the compound criterion

(15)

A dual approach is again useful, in particular if FJ1. or Fv or both are not explicit. It is easy to show that the dual criterion is

Having solved the dual problem, we come back to the primal solution by the primal-dual relation which is, thanks to the separability of the log-Laplace transform of p" the same as in the noiseless case:

(a) Orl~lllal obJ<'(l.

(bl ,\vailabIP data with Poi~soll noise.

(c) 1t"colIstrllctioll with Poisson and (:alls-

(d) H('constructioll with Poisson and Pois

SOli reference measures.

Figure 2: Different accounts for noise. The original object (a) is convolved and corrupted by Poisson noise (b). Twenty different realizations of the noise are represented. Reconstructions are in (c) and (d) : reconstructions in (c) , where the noise is considered as Gaussian are outperformed by reconstructions in (d) , where the real nature of the noise is accounted for.

We are then able to account for specific noise distributions , without loss in the nice properties of our criteria: the global criterion of (15) is always convex, and the convex


constraint is automatically satisfied. Concerning the case of a Gaussian noise, it can easily be checked using result of (12), that a Gaussian reference measure for the noise term leads exactly to the problem of (14), which was obtained by statistical considerations.

5.3. Sensitivity of the reconstruction

Within the limit of small variations, we can also study the stability of the reconstruction with a sensitivity analysis. This enables to study the importance of a given data point on the reconstruction and quantify the amount of change resulting from a perturbation of the data. The sensitivity analysis is based on the determination of the derivative dx / dy. Although an expression can be obtained in the direct domain, the derivation is done in the dual domain, because the primal functions may be not explicit.

The stationary point X of the dual function V(> .. ) verifies

Let us note F~" and F:" the diagonal matrix of the second derivatives F:" (X) and

F;"(AtX). Then, we have

ay [aX); .,,] dy = a>.. d>" = A a>.. + F v d>...

With the primal-dual relation x); = F;' (A tX), the partial derivative of x); with respect to

>.. is simply F:"At . Thus

oX-Using dx); = ~d>", one gets finally the relation

With E{ dydyt} = R y , the noise covariance matrix, we use the "sensitivity matrix" H RyHt whose (square root) diagonal terms may serve as "sensitivity bars".

·1 ~ I ~.A _AA! (a) Sensitivity bars. (b) Monte Carlo study and sensitivity anal

ysis.

Figure 3: Sensitivity of the reconstruction. Sensitivity "bars" are plotted in (a) and (b). In (b), we have reported 20 reconstructions of a Monte Carlo study. The variations of reconstructions are in good agreement with the sensitivity analysis.


6. Conclusion

It is always possible to modify our reference measures to balance the two terms of the global criterion (15) which should therefore be written as

where a is a regularization parameter. The Maximum Entropy on the Mean procedure enables us to find the generic form of regularized criteria, and to solve the problem even if primal criteria FI" and Fv have no analytical expression.

Such an approach provides a new general framework for the interpretation and derivation of these criteria. Many other criteria as those presented in §4 have been derived [9]. In particular, reference measures defined as mixture of distributions (Gaussian, Gamma) have been successfully used for the reconstruction of blurred and noisy sparse spike trains. Poissonized sums of random variables also lead to interesting regularized procedure in connection with the general class of Bregman divergences. Work is also in progress concerning the quantification of the quality of MEMM estimates, the links with the Bayesian approach, especially with correlated a priori models such as Gibbs random fields.

References

[1] P. L. Combettes. The foundation of set theoretic estimation. Proceedings of the IEEE, 81(2):182~208, Feb. 1993.

[2] I. Csiszar. Why least-squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. The Annals of Statistics, 19( 4 ):2032~2066, 1991.

[3] D. Dacunha-Castelle and F. Gamboa. Maximum d'entropie et probleme des moments. Annales de l'Institut Henri Poincare, 26(4):567~596, 1990.

[4] R. S. Ellis. Entropy, Large Deviations, and Statistical Mechanics. Springer-Verlag, New York, 1985.

[5] L. K. Jones and C. L. Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE tmnsactions on Information Theory, 36(1):23~30, Jan. 1990.

[6] R. E. Kass and L. Wasserman. Formal Rules for Selecting Prior Distributions: A Review and Annotated Bibliography. Technical report, Department of Statistics, Carnegie Mellon University, 1994. Submitted to Jnl. of American Statistic Association.

[7] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959.

[8] G. Le Besnerais. Methode du maximum d'entropie sur la moyenne, criteres de reconstruction d'image et synthese d'ouverture en mdio-astronomie. PhD thesis, University of Paris-Sud, 1993.

[9] G. Le Besnerais, J.-F. Bercher, and G. Demoment. A new look at the entropy for solving linear inverse problems. submitted to IEEE transactions on Information Theory, 1994.

[10] J. Navaza. The use of non-local constraints in maximum-entropy electron density reconstruction. Acta Crystallographica, pages 212~223, 1986.

[11] J. E. Shore. Minimum cross-entropy spectral analysis. IEEE transactions on Acoustics, Speech and Signal Processing, (2):230~237, Apr. 1981.

THE MAXIMUM ENTROPY ALGORITHM APPLIED TO THE TWO-DIMENSIONAL RANDOM PACKING PROBLEM

G.J. Daniell Physics Department University of Southampton Southampton. S017 lBJ. U.K.

ABSTRACT. The statistical properties of the random close packing, on a plane , of discs of two different sizes are considered . All the possible clusters of a central disc and its nearest neighbours that are allowed by geometry are determined and the proportions of these clusters are computed by the maximum entropy method. The results for the proportions of the three types of contact between the two sizes of disc are compared with a Monte Carlo simulation.

1. Introduction

Problems involving the random packing of spheres or discs occur in many branches of science and present formidable mathematical challenges. In view of these difficulties simple approximate calculations are worth studying. Dodds[l] devised a simple model for the random close packing of discs of two different sizes and used it to predict the numbers of contacts between the different types of disc. A more recent survey of the problem is [2]. This paper shows how the assumptions inherent in Dodds ' model can be incorporated into the general formalism of the maximum entropy method. The result is a more straightforward calculation, in which the assumptions involved are absolutely explicit, and which can be used to answer a wider range of questions about the packing distribution.

The Maximum Entropy Method of Jaynes [3] is a rule for assigning probabilities when certain average values are given, and otherwise we wish to remain as unprejudiced as possible. We consider the same problem as Dodds: discs of two different sizes, referred to as type 1 and type 2, are arranged at random in a plane and compressed so that they are

tightly packed. Let P)~ be the probability that a disc chosen at random is of type i (lor 2) and its nearest neighbours are j of type 1 and k of type 2. The set of possible values of j and k is determined by geometrical considerations and the assumption that the structure is close packed. Since a disc will not in general be in perfect contact with all its neighbours the set of possible values of j and k will depend on how we regard the gaps in the structure. Our results cannot be critically affected by the inclusion or omission of a pair of j and k from the list of permitted values since the essence of both Dodds', and our, approximation is to pretend that there are no gaps.

The entropy of the distribution p;2 is then

_ "{ (1) (1) (2) (2) } S - - ~ Pjk 10g(Pjk /9jk) + Pjk 10g(Pjk /9jk) (1 ) j,k

it being understood that j and k take values from the set defined above. The symbol gjk stands for the degeneracy factor (j + k)!jj!kL In the maximum entropy method our

233


234 GJ. DANIELL

knowledge of the system is imposed as constraints on the probabilities, and the method is successful when a few simple constraints are overwhelmingly important in controlling the system. The best example is, of course, statistical mechanics where the single constraint of energy conservation is sufficient to account almost perfectly for the behaviour of a very complex system.

The constraints operating in the disc packing problem are mostly extremely subtle and involve large numbers of discs; herein lies the mathematical intractability. Nonetheless a few constraints are simple. The proportions of large and small discs hand 12 are given by

and we also have

( i) Ii = :~:>jk j,k

h+12=1.

(2)

(3)

The average number of contacts between discs of type 1 and 2 can be expressed in two ways, either by counting the small discs around a large one, or by counting the large discs around a small one. If we equate these two expression we get the constraint:

(4)

A further constraint can be constructed if we consider th~ arrangement of discs triangulated as in Dodds ' approach. A triangle formed by one large and two small discs can be regarded either as a large disc with two small neighbours or as a small disc with one large and one small neighbour. The average number of such triangles can, as for the number of contacts, be expessed in two ways and the resulting constraint can be written:

'" (1) _ '" (2) L.- ajkPjk - L.- bjkPjk . (5) j,k j,k

ajk is the number of pairs of adjacent small discs amongst the (j +k)!/j!k! arrangements of j large and k small discs surrounding a large disc, and in the same way bjk is the number of adjacent pairs of large and small discs surrounding a small disc. The calculation of the values of ajk and bjk can easily be done in any numerical example but the general formula involves some complicated combinatorics. Some values are given in the Table.

It might be thought that a further constraint exists relating the numbers of triangles formed by two large discs peripheral to a small one and a large and small disc peripheral to a large one. However calculation shows that this constraint is not a distinct one and can be derived from equations (4) and (5).

A fourth constraint is suggested by the work of Rivier and Lissowski [4]. If we draw common tangents to the discs at each contact and extrapolate these to form polygons then the topology of the arrangement of polygons in a plane is constrained by Euler's theorem: V + F = E + 1 where V is the number of vertices, E the number of contacts and F the number of discs. The generic vertex involves three tangents meeting at a point and a tangent joins two vertices, so that V := 2E /3. Since a contact is between two discs it

THE MAXIMUM ENTROPY ALGORITHM 235

follows from Euler's theorem that the average number of contacts per disc is six. This is also true in Dodds' model and is discussed more fully below.

It is important to note that because of the gaps in the structure this constraint may not be exactly obeyed. It is necessary either to involve vertices where four or more tangents meet or to count small gaps as true contacts. In terms of our probabilities this constraint can be written:

L(j + k)p;~ = 6 (6) i,j,k

Before considering the general case it is instructive to look at the case of discs all of the same size, distinguished by colour for example. The close packed structure then has co-ordination number six, and the set of permitted j and k is the set of all pairs for which j +k = 6. The pair constraint (4) and the triangle constraint (5) are automatically satisfied, and so is the Euler constraint (6). If the entropy S given by (1) is maximised subject to the constraints (2) and (3) the obviously correct result

(7)

is easily obtained. In the general case the constraints (2) - (6) must be included using Lagrange multipliers. The first two of these are easily eliminated and it is found that

(1) Pjk

(2) Pjk

Ej,k gjke-Ak e-p,(Hk) e-Vajk

h9jke + Ak e-p,(Hk) e-Vbjk

E "k 9 "ke+Ake-p,(j+k)e-vbjk J, J

The constants A,J.l and II have to be chosen so that (4), (5) and (6) are satisfied.

(8)

In order to see how successful this model is we have performed a Monte Carlo simulation. There is obviously a problem in precisely defining a random packing of discs, either in a computer simulation or in a real experiment. We have chosen to compare with a computer simulation because in that case the random distribution is defined by the algorithm used to generate it, whereas experimental conditions are impossible to describe or reproduce exactly.

Our random distributions are generated by imagining that the discs are attracted towards a central point. A disc, large or small at random, is introduced at a great radial distance and a random azimuth. It is then moved radially inwards until it comes in contact with an existing disc. It then rolls round this disc, and any that it subsequently touches, until it reaches a position of stable equilibrium under the presumed central force.

236 G.J. DANIELL

F 0.9 r a 0.8 c t 0.7 i small-small arge-large 0 0.6 n 0 0.5 f C 0.4 0 n 0.3 t a 0.2 c t 0.1 s

0.2 0.4 0.6 0.8

Fraction of Large Discs

The points show the proportions of the three possible types of disc contact determined by a Monte

Carlo simulation and error bars indicating one standard deviation. The curves are the values predicted by

our simple theory using the maximum entropy method. All the results are for a diameter ratio of 2: 1.

The figure shows the proportions of the three types of contact between pairs of discs together with the predictions of our model. The diameter ratio 2:1 was used in our calculations. Geometry shows that for this ratio of diameters 17 configurations are possible, 10 with a central large disc and 7 with a small one. It is necessary to adopt some rule for deciding which gaps should be ignored. We have said that if a gap is larger in the tangential direction than radially then it should be regarded as a contact. The set of 17 configurations together with their degeneracies, and the number of triangles is given in the Table.

It is important to realise that the sizes of the discs enter into the calculation only in fixing the set of permitted configurations. Superficially they enter in a more direct way in Dodds' model and it is worth exploring this point in more detail. If we define

Zi = L(j + k)p}1, (9) j,k

then this is the same quantity denoted by Zi by Dodds. His empirical discovery that Zl + Z2 = 6.0 can immediately be seen to be the Euler constraint (6). Dodds' fundamental equation (his equation 4) relates the average angle subtended at the centre of a disc by a peripheral disc computed in two different ways. This equation has the same physical content as our equation (5) which connects the underlying probabilities rather than averages.

The results in the figure show that there is reasonable agreement between the predictions of the maximum entropy formulae and the Monte Carlo simulations. The error bars on the simulation values show plus and minus one standard deviation and are derived from five independent calculations. The Monte Carlo calculations can be used to test whether the Euler constraint (6) is obeyed and significant departures are indeed detectable. When the proportions of the two types of disc are roughly equal the average number of neighbours is very close to six, but when either the large or the small discs predominate the average

THE MAXIMUM ENTROPY ALGORITHM 237

number of neighbours falls to about 5.7. Because of this we have chosen to show in the figure maximum entropy calculations with the Euler constraint (6) removed. Including it makes very little difference to the results; the obvious systematic departures from the Monte Carlo results are slightly changed but the overall fit is not improved. Similarly including or omitting configurations from the set considered makes changes in the theoretical predictions that are rather less than the systematic discrepancies. We can conclude that any further improvement in the theory will need to take account of the gaps in the structure in a much more detailed way. As an illustration of the additional power of our approach over Dodds' we might note that we can calculate, for example, the probability that a large disc is completely surrounded by small ones and how this varies with the concentration of small discs. Because this is a microscopic property in contrast to the number of contacts it is much more sensitive to the inclusion of the Euler constraint.

We have shown that Dodds' approach to the disc packing problem can be reformulated and extended by using the maximum entropy method. Because the assumptions involved are clear a further extension can be suggested in which larger clusters of discs are defined by their possible geometries and the proportions of these clusters determined by the maximum entropy algorithm.

ACKNOWLEDGMENTS. My thanks are due to D. Melville for introducing me to this problem and to both him and J.T. Chalker for several discussion on the subject.

References

[1] Dodds, J .A. , Simplest statistical geometric model of the simplest version of the multicomponent random packing problem. Nature. 256, 187-189, 1975.

[2] Bideau, D., Gervois , A. , Oger, L., Troadec, J.P., Geometrical Properties of Disordered Packings of Hard Discs. J. Physique 47, 1697-1707, 1986.

[3] Jaynes ,E.T., Prior Probabilities. IEEE Transactions on Systems Science and Cybernetics. SSC-4, 227-240, 1968.

[4] Rivier,N. and Lissowski,A. , On the correlation between sizes and shapes of cells in epithelial mosaics. J.Phys.A: Math.Gen. 15, L143-L148, 1982.

238 GJ. DANIELL

TABLE

Central Large Disc

j k gjk ajk

0 9 1 9 1 8 9 63 2 7 36 189 2 6 28 120 3 5 56 160 4 4 70 120 5 2 21 7 5 1 6 0 6 0 1 0

Central Small Disc

j k gik bik 0 6 1 0 1 5 6 12 2 4 15 48 2 3 10 30 3 2 10 30 3 1 4 8 4 0 1 0

The possible neighbours of large and small discs permitted by geometry are j large discs and k small ones. The degeneracy of each configuration is 9jk = (j + k)!fj!k!. In each of these degenerate configurations a is the number of pairs of adjacent small discs around a large one and b is the number of adjacent pairs of a large and small disc occurring as the nearest neighbours of a small disc.

BAYESIAN COMPARISON OF MODELS FOR IMAGES

Alex R. Barnett and David J .C. MacKay Cavendish Laboratory Cambridge, CB3 ORE. United Kingdom.

ABSTRACT. Probabilistic models for images are analysed quantitatively using Bayesian hypothesis comparison on a set of image data sets. One motivation for this study is to produce models which can be used as better priors in image reconstruction problems.

The types of model vary from the simplest, where spatial correlations in the image are irrelevant, to more complicated ones based on a radial power law for the standard deviations of the coefficients produced by Fourier or Wavelet Transforms. In our experiments the Fourier model is the most successful, as its evidence is conclusively the highest. This ties in with the statistical scaling selfsimilarity (fractal property) of many images. We discuss the invariances of the models, and make suggestions for further investigations.

1 Introduction

This paper's aim is to devise and search for 'good' statistical descriptions of images, which are greyscale pictures digitized from a camera, stored as an array of integers (representing the intensities of light falling on the camera's sensitive array). All of the images analysed in this paper have a fixed number of greyscale levels, ). = 256, corresponding to quantization to eight bits (one byte) of information per pixel. We assume that the image data is a linear function of physical intensity, and free of noise and blurring.

The statistical properties considered in the project are so general that what the images depict is largely unimportant, and we chose easily recognisable pictures (figure 1) such a face, natural objects, astronomical images, and the kind of images our eyes are subjected to frequently.

The development of the models is driven by intuitive ideas and by observations of real images, and is regulated by certain criteria for invariance, that is, operations on the image which should not affect its likelihood.

Bayesian analysis allows quantitative manipulation of data and prior beliefs to give a numerical result, the evidence, which reflects the probability of a hypothesis, and therefore how 'good' a model is.

Each model comprises a hypothesis H, with some free parameters denoted by the vector w = (0:,(3, ... ), which assigns a probability density p(flw,H), the likelihood, over the image space of f, normalized so as to integrate to unity. The density's units are that of [intensityt n , since each pixel component Ii has units of [intensity].

In most models the free parameters are initially unknown (i. e. they are assigned very wide prior distributions), and we search for their best fit value WBF, which has the largest likelihood given the image. Bayes' Theorem gives

p(flw, H) p(wlf, H) = p(fIH) p(wIH)

239

J . Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 239-248. © 1996 Kluwer Academic Publishers.

(1)

240 A.H. Barnett and D.J.C. MacKay

.r 'susie' 'mouse' 'redspot' 'trees' ,. . .

.' fA ~ ... " , '., '" . ." : ~ e e, ~

41. ":\

'" .. ,.

'sky' 'parrot' 'mlOOcen' 'ngc1068'

Figure 1: The images analysed

This shows each image data set Ii as an intensity array of nx by ny pixels (nx and ny being either 128 or 256 in these images).

The denominator is independent ofw, the numerator is the likelihood, i.e. the probability of the observed image f as a function of w, and the final term p(wIH) is the prior distribution on the the free parameters. This prior has to be assigned (based on our beliefs about images), even if seemingly arbitrarily, but has negligible effect on the WBF found because the likelihood dominates. We know that p(wlf, H) is normalized to 1, giving an expression for the denominator of (1), which we now call the evidence for H:

p(fIH) = L p(flw, H)p(wIH)dw (2)

This evidence is often dominated by the value of p(flwBF, H) (the best fit likelihood). The evidence is equal to the best fit likelihood multiplied by a smaller factor known as the "Occam Factor". Applying Bayes' Theorem again gives us the probability of H (to within a constant factor) as p(Hlf) <X p(fIH)p(H). The prior p(H) can incorporate our beliefs about the validity of each hypothesis before the data arrived, but we chose all p(Hi) as equal (in fact, usually any such prior would have to be very extreme to outweigh the evidence).

So, we now have the relative plausibilities of competing hypotheses, and in this paper we evaluate p(fIHi) for given images and a variety of Hi.

2 Description of Models

We now detail the models (roughly in order of increasing complexity), the first four of which assume independence of the pixels Ii, meaning

(3)

Bayesian Comparison of Models for Images 241

y(f)fK

y,

o

Figure 2: Some image intensity histograms, and a free-form distribution using B "bins"

RANDOM BITS MODEL (RB)

The simplest distribution we can assign is uniform across all the allowed values of all n pixels, so p(fIH) = constant within the n-dimensional hypercube from 0 to Imax on each axis, and zero elsewhere. Our integer data Ii is assumed to have been truncated from a real continuous variable in the range 0 to Imax (Jmax being>. units of intensity), so although our image vector f is always quantized onto an integer lattice, we will deal with continuous density functions. This model is called "random bits" because it corresponds to a prior of exactly ~ for the probability of each binary bit being set in the stored image. Written in log form, the evidence is

10gp(fIHRB) = Llog(I/>') = -nlog>.. (4)

FREE-FoRM DISTRIBUTION MODEL (FF)

Figure 2 shows the frequency of occurrence of each intensity level in some images. It is clear that the distributions are far from flat, and are inconsistent from image to image, depending on properties of the camera and the digitization process. Therefore a model with a flexible, parametrized probability distribution function y(J) over intensity I would be able to fit real images better than one with a uniform distribution. The figure also shows a simple y(J) with a finite number (B) of variables, namely {Yb} = {Yll Y2, ... YB}, which give the probabilities of I falling into each "bin" of width 11K. This probability is applied independently to each Ii of the image, so that

p(fJ{Yb}, H) = Ie II Yb Nb , (5) b

where Nb is the number of pixels with intensity falling into bin b. This is substituted in (2) using {Yb} as the parameter vector w, and with a flat prior over all the normalized {Yb} (but zero if not normalized, as used by Gregory and Loredo (1992)).

Approximating Gamma functions using logs eventually gives,

log(Nb+l) [1 1 (n+B)] 10gp(fIHFF) = nlogJ< + ~(Nb+ 1) n+ B. - 2 ~log(Nb + 1) - 2 log ~


The first two terms are the best fit likelihood, and the last term (in square brackets) is the log Occam factor. In order to disregard the statistical and digitization fluctuations in the histogram, but retain some flexibility, we have chosen B = 16 for this analysis (B must be between 1 and A). A Bayesian choice of B might also be made.

GAUSSIAN DISTRIBUTION MODEL (GD)

This model applies the Gaussian probability distribution N (/1-,0"2) to each Ii, and is the first of a general class (that we'll call G) of Gaussian models which use the likelihood

(fl H) - ..!. -l(f-a)TC(f-a) p w, - z€ 2 , (6)

where w controls some properties of the square (order n) matrix C and the mean vector a, and Z a normalizing constant. This gives for G models the evidence

1 1 logp(flw, H) = -2(f - a)TC(f - a) + 2 [log(det C) - nlog(27r)] . (7)

In this model, GD, the values /1- and 0" are constant for all pixels (this is not only simplest, but desirable for invariance under spatial transformations), giving C = 1/0"2 and a = (/1-, /1-, ... /1-). The parameter w is (/1-,0"), and solving Vw logp(flw, H) = 0 gives the best fit values:

/1-BF = .!. L Ii, 0"1F = .!. LUi - /1-BF)2 . (8) n . n . , ,

The approximation (usually a very good one, which we will use in all our G models - €.g. see Figure 4) that the peak about these best fit values is Gaussian makes equation 2 easy to evaluate. We assume

(fl H) _lwTAw P w, = PBF€ 2 (9)

with PBF = p(flwBF, H), and A its Hessian matrix at WBF. Substituting (9) into (2) and assuming a constant prior p(wIH)BF near the peak gives the general G model evidence

10gp(fIH) = 10gpBF - ~ log (det ~) + 10gp(wIH)BF (10)

We assigned a Gaussian prior on the logarithm1 of each component, log Wi, of standard deviation O"logwi about the best fit value log wpF. For this model, normalizing p(wIH) gave the prior at best fit

p(wIH)BF = (27r /1-BF O"logl" O"BF O"logu )-1 • (11)

Substituting the best fit values, the Hessian A and this prior into (10) enabled us to calculate this model's evidence p(fIHoD). Based on the largest and smallest conceivable /1- and 0" (given integer j; from 0 to fmax), we set both the standard deviations O"Iogl" and O"logu to 4 (a value we used in all the G models). However, in our results the prior, and indeed the whole Occam Factor, is almost completely negligible compared to the relative likelihoods of different hypotheses, so we will not devote so much rigor to assigning priors in the coming models. (One would need to constrain O(n) free parameters before this became signifi can t. )

Although GD fits most images less well than the FF model, the above G class includes new, more powerful models (FP and WP), where pixels are no longer independent.

1 Tills is appropriate since we initially have an uncertainty on w of orders of magnitude.


15

10

·s

.10 '--'----'-_.J...---'---''----:'-----'-----:-'------'------' -3.5 -3 -2.5 -2 -'.5 -1 .os 0 Os U:

Figure 3: The 2D Fourier power spectrum JFiJ2 of 'susie'

On the left it is displayed as a 2D histogram-equalized image (with k = 0 central, and kx and ky in the range [-11", +11"] ), and on the right as a scatter plot of log IF(k) 12 against log Ikl.

I

GAUSSIAN DISTRIBUTION OF log J; MODEL (LGD)

In many classes of images, notably astronomical, there are a very large number of lowintensity pixels and fewer at higher intensities, and a Gaussian distribution on fi is clearly inappropriate. However, if we define a new image 9i = log(fi + 8), where 8 is some constant offset (to keep 9i finite in the case of fi = 0; we chose 8 as half an intensity unit), then N(/-Lg, a;) in 9i-space corresponds to a suitably biased smooth distribution in fi-space, which also has the desirable property of enforcing positivity of the intensities.

To transform probability densities we use p(f) = det(J)p(g), the determinant of J the Jacobean being det(J) = Di(fi + 8)-1 = e-nl"gBF, so we can use all the previous GD theory on 9i to assign 10gp(gJHGD) then add on 10g(detJ), to get the log evidence 10gp(fJHLGD).

FOURIER SPECTRUM RADIAL POWER LAW MODEL (FP)

So far none of the models have cared about spatial correlations in an image, which are after all usually what makes them recognisable. However, the 2D Discrete Fourier Transform (from fi to the complex array F(k), k = (kx, ky) ) allows us to construct a hypothesis with correlation (i.e. a non-diagonal C), within the general scheme of the G class.

Visual examination of the 2D power spectrum of a typical image shows three main features (Figure 3):

1. seemingly uncorrelated random speckle on the scale of one pixel,

2. an approximately radially symmetric upward trend towards the point k = 0, and

3. brighter lines on or near the vertical and horizontal axes (these are artifacts caused by the non-periodicity of the image, and were found to have little effect on the evidence when removed).

The observation of radial symmetry motivated a log-log plot of the spectrum as a function of radius in k-space, which shows a clear linear downward trend of mean log power with log radius. This, together with the uncorrelated nature of the speckle, led to a hypothesis


·15 1033 -...-£:::;z:s~~~~~~?=~~~::;2:5~:::;::;;:;? et033.5 ~ 81034

810304.5 61035

el0:)5.5 &to:l&

81036.5 &1037

· e10:J7.8 Gl0aS

1.&65

Figure 4: Peak of likelihood 10gp(Flm, c, H) about best fit values for image 'susie'

... ""1 • • - ., • CI 0 C ::JJ

11 0 0 c::,

.. n " ,..,

:"i ii V ...... .1

, BliafV1lNTX)-..III

1\( . .

Figure 5: An artificial image, its 2D WT image, and a slice through a wavelet

Notice the structure of the WT: it is divided into rectangular regions of every possible binary (2i by 2j ) size, each of which contains a map of the original image.

that the Fourier coefficients have real and imaginary parts which are both independently distributed like N(O, v'2o-(k) = ck-m ) where c and m are the power law constants and k = Ikl. We assigned 0"(0) = fofmax to avoid an infinity. This Gaussian distribution for the coefficients F was found to be very well justified when we histogrammed Re[F] and Im[F] for real images. Expressed as a density in F-space, equation (7) becomes

1R12 n 10gp(Flm,c,H) = - :L-'-2 - :LlogO"i - -log(27r),

i 20"i i 2 (12)

and the orthogonality of the FT (det(J) = 1) means this is equal to logp(flm, c, H), and from this the evidence p(fIHFP) was found in a similar way to the GD model.

There was no simple analytic solution for mBF and CBF, so a Newton-Rapheson iterative approach was used to find WBF and the Hessian A in the 2D space W = (m , c) . Figure 4 confirms that 10gp(Flm, c, H) has a Gaussian peak about wBF.

WAVELET TRANSFORM POWER LAW MODEL (WP)

The Wavelet Transform (WT) is linear, orthogonal, and operates on a real vector f (of n components, an integer power of two), converting it to another real vector F of n components. For a good, practical introduction see (Press et al. 1993:section 13.10), or (Strang


Figure 6: Typical sample images from models' f-space distributions

On the left is a sample from any ofthe RB, FF, GD or LGD models, which is spatially uncorrelated. On the right is a sample from the FP model, with m = 1.5 and c = 10, showing structure at all length scales (in fact it is a fractal).

1989) and (Heil and Walnut 1989) for more background. Wavelets have the property of being localized in both real and frequency space, so can efficiently represent both discontinuities and periodic features (see figure 5). They have many applications in lossy image-compression techniques, because they often reduce images to a few large coefficients and many small ones2 • In this final model (also in the G class) we used the 4-coefficient Daubechies WT to replace the FT from the previous model, assigning Gaussian distributions to the WT coefficients Pi but with (J'i uniform within each of the 'binary regions' evident in figure 5. For each region an approximate ki was used, based on the minimum wavelet dimension (in x or y), and the power law (J'i = ckim was used as before. Skipping over details, this allowed equation (12) to be used to compute the evidence p(fIHwp) in an identical way to the FP model.

3 Results

Table 1 presents some of the results for different images, with ei short for the log evidence of model i. For easy comparison, the uncorrelated models FF, GD and LGD are shown as ratios to the standard RB model (so that a number greater than 1 implies more evidence than for RB). Similarly, the ratios RF and Rw are defined as ~ and :;;,~ respectively, since the FP and WP models are closest in form to GD (of the uncorrelated models). So R gives us a guide to how much improvement3 has been obtained by introducing correlation. Note that the evidences are extremely small numbers, and that small differences in R values correspond to huge factors of relative evidence, of the order of elOOOO in our case, so that one hypothesis is overwhelmingly the most likely for a given image. The table also gives the best fit power law gradients mF for FP and mw for WP.

Three random computer-generated images were first analysed: 'A', with independent pixels with a flat distribution from 0 to fmax (= A = 256); 'B', likewise but with a Gaussian distribution of (J' = 20; and 'C', with a correlated power law distribution of m = 1.5 and

2 Later we show that this is exactly the criterion required in a good model. 3Note that, because of the units of Ji chosen, ei is equal to the optimum message length in nats needed

for lossless communication of image f (to a precision of one intensity unit) using an encoding based on the hypothesis Hi. Thus R is the information compression ratio.


image nx eRB I eRB/eFF.1 eRB/eGD I eRB/eLGD RF I mF II Rw .1 mw uncorrelated models correlated models

A 256 -363409 1.003 0.969 0.931 1.000 -0.001 0.999 -0.001 B 256 -363409 1.248 1.255 1.249 1.001 -0.002 1.000 -0.004 C 256 -363409 1.202 1.204 1.189 1.796 1.488 1.635 1.453

susie 128 -90852 1.078 1.040 1.025 1.431 1.577 1.319 1.316 mouse 256 -363409 1.113 0.996 1.046 2.082 1.572 1.877 1.617 redspot 256 -363409 1.084 1.079 1.054 1.320 1.088 1.287 1.162 trees 128 -90852 1.044 0.985 0.949 1.026 0.350 1.027 0.420 sky 128 -90852 1.184 1.095 1.107 1.256 0.717 1.275 0.813 parrot 128 -90852 1.056 1.044 1.012 1.260 1.544 1.189 1.323 m100cen 128 -90852 1.163 1.113 1.094 1.261 0.858 1.240 0.900 ngcl068 128 -90852 1.158 1.033 1.137 1.549 1.215 1.490 1.269

Table 1: Log evidence results for simulated and real images

c = 10. So A,B and C are typical samples from the RB, GD and FP model distributions respectively (see Figure 6). These test images behaved as expected: for A and B we find RF,W ;:::: 1 (since they are uncorrelated), whereas for C, RF;:::: 1.8 so the FP model shows a vastly higher evidence, and a best fit m close to the predicted value. For B, evidence gains in the uncorrelated models over RB are due to a better fitting of the narrower intensity range.

Analysis of the eight real images gave the general results:

• Correlated models are vastly more successful than uncorrelated, with FP consistently ahead of WP.

• RF,W tend to be larger the higher the best fit gradient m is.

• mF and mw loosely match for a given image.

• Of the uncorrelated models, FF invariably has the most evidence (although not always by a large margin), and RB usually the least.

• LGD has no convincing advantage over GD for the last two (astronomical) images.

4 Discussion

To understand the increase in R with m, we consider a general (G class) model where distributions N(/Li, an are applied to the coefficients Fi produced by some orthogonal linear transform on the image J; (FP and WP are special cases of this). Making the crude assumption that the Fi are distributed in this way implies that maximizing PBF

(and therefore the evidence) is equivalent to minimizing n ai (under the orthogonality constraint L aT = constant). This can best be achieved by having only a few large ai and many small ai, i.e. choosing a transform which concentrates the image 'power' L Jl into as few coefficients as possible, and a higher m does this better than a low one in the FP or WP model.


It is interesting to realise how the power law found in many of our images relates to a fractal property. Based on Mandelbrot's (1982) statement (p. 254) in the ID case, we derived that for an image sampled in N dimensions which obeys a statistical scaling law f(x) rv h-"'f(hx) then one would expect the power spectrum < IF(kW > ex k-2m (in the case of directional isotropy), with the relation 0' = m - N /2. For this case, N = 2 and m is that of the FP model, mF. This power law spectrum is surprisingly common in much of nature, for instance the rough fracture surfaces of metals (Barnett 1993), which initially led us to investigate the FP model.

Also worthy of discussion are the invariances that were considered in regulating the choice of models for this investigation. If a model had a likelihood function invariant under translation, rotation and scaling of the image, then it could not ind uce unnatural preferences for particular positions, angular directions or length scales when used as a prior in image reconstruction (or other such inverse problems). Apart from the axes-dependent behavior of the wavelets in the WP model, all the models in this paper share this invariance. However, models where correlation is introduced via a Gaussian ICF (intercorrelation function), for instance, are not scale-invariant and will be prone to favour length scales similar to the ICF radius. We believe that our FP model can be expressed in terms of an rCF, which will however have an asymptotic, power law form.

There are a huge number of directions for further investigation into models for images, but among the more fruitful we suggest:

1. Develop new models that incorporate positivity, since we are dealing with physical intensities which cannot be negative.

2. Search for new formulations of what 'correlation' is, and what makes images recognisable. Borrow ideas from good image compression techniques, as these rely on identifying correlations.

3. Investigate Gabor functions (Gabor 1946), which are forms of wavelets, and which, as Daugman (1985) discusses, seem to match the receptive fields of neurons in the primary visual cortex. We suggest that, since evolution has optimized so many biological design problems, the workings of our own perceptual system should be studied and mimicked to find good image processing and modelling techniques. It is, after all, our own perception that tells our consciousness that we are looking at a recognisable image.

5 Conclusion

A framework of simple models for images has been built up, and their Bayesian evidence has been evaluated for a set of image data. The results show a conclusively massive increase in evidence for correlated models (FP and WP) over uncorrelated (RB, FF, GD and LGD), with the FP model almost always the most successful, especially at higher mF. This reflects a power law dependence of Fourier components apparent in images and implies a statistical scaling self-similarity, that is, a general fractal property.

ACKNOWLEDGEMENTS

AHB thanks Ross Barnett for general advice and for use of computing facilities. DJCM gratefully acknowledges the support of the Royal Society.


References

BARNETT, A. H. (1993) Statistical modelling of rough crack surfaces in metals. Internal Report for Non-Destructive Testing Applications Centre, Technology Division, Nuclear Electric plc .

DAUGMAN, J. G. (1985) Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2 (7): 1160- 1169.

GABOR, D. (1946) Theory of communication. J. Inst. Electr. Eng. 93: 429- 457.

GREGORY, P. C. , and LOREDO, T. J. (1992) A new method for the detection of a periodic signal of unknown shape and period. In Maximum Entropy and Bayesian Methods" ed. by G. Erickson and C. Smith. Kluwer. also in The Astrophysical Journal , Oct 10, 1992.

HElL, C. E ., and WALNUT, D . F. (1989) Continuous and discrete wavelet transforms. SIAM Review 31 (4): 628- 666.

MANDELBROT, B. (1982) The Fractal Geometry of Nature . W.H. Freeman and Co.

PRESS, W., B.P. FLANNERY, S. T., and VETTERLING, W. (1993) Numerical Recipes in C, Second Edition. Cambridge.

STRANG, G . (1989) Wavelets and dilation equations: A brief introduction. SIAM Review 31 (4): 614- 627.

INTERPOLATION MODELS WITH MULTIPLE HYPERPARAMETERS

Ryo Takeuchi Waseda University

David J C MacKay Cavendish Laboratory Cambridge, U.K. mackay~mrao.cam.ac.uk

Tokyo, Japan. takeuchi~matsumoto.elec.waseda.ac.jp

ABSTRACT. A traditional interpolation model is characterized by the choice ofregularizer applied to the interpolant, and the choice of noise model. Typically, the regularizer has a single regularization constant a, and the noise model has a single parameter f3. The ratio a / f3 alone is responsible for determining globally all these attributes of the interpolant: its 'complexity', 'flexibility', 'smoothness', 'characteristic scale length', and 'characteristic amplitude'. We suggest that interpolation models should be able to capture more than just one flavour of simplicity and complexity. We describe Bayesian models in which the interpolant has a smoothness that varies spatially. We emphasize the importance, in practical implementation, of the concept of 'conditional convexity' when designing models with many hyperparameters.

1 Introduction

A traditional linear interpolation model 'HI' is characterized by the choice of the regularizer R, or prior probability distribution, that is applied to the interpolant; and the choice of noise model N. The choice of basis functions A used to represent the interpolant may also be important if only a small number of basis functions are used. Typically the regularizer is a quadratic functional of the interpolant and has a single associated regularization constant 00, and the noise model is also quadratic and has a single parameter /3. For example, the splines prior for the function y(x) (Kimeldorf and Wahba 1970) is: 1

log P(y(x)loo, Hd = -~oo J dx [y(p)(xW + const., (1)

where y(p) denotes the pth derivative of y. The probability of the data measurements D = {t(m)}:;:=1 assuming independent Gaussian noise is:

1 N 2 logP(Dly(x),/3,Hd = -2/3 L (y(x(m)) _t(m)) +const.

m=1

(2)

When we use these distributions with p = 2 and find the most probable y(x) we obtain the cubic splines interpolant. For any quadratic regularizer and quadratic log likelihood, the most probable interpolant depends linearly on the data values. This is the property by which we define a 'linear' interpolation model.

1 Strictly this prior is improper since addition of an arbitrary polynomial of degree p - 1 to y( x) is not constrained. It can be made proper by adding terms corresponding to boundary conditions to (1). In the present implementations of the models, we enforce the boundary conditions y(O) = 0, and, where appropriate, y'(O) = o.

249


250 D.J.C. MacKay and R. Takeuchi

Figure 1: An inferred spike signal from a zebra finch neuron Courtesy of M. Lewicki and A. Doupe, California Institute of Technology.

Such models may be optimized and compared using Bayesian methods as reviewed in (MacKay 1992). In such models . the ratio 0://3 alone determines globally all the following attributes of the interpolant: its complexity, flexibility, smoothness, characteristic scale length, and characteristic amplitude. Now, whilst some of these terms may be synonyms, surely others describe distinct properties. Should not our models be able to capture more than just one flavour of simplicity and complexity? And should not the interpolant's smoothness, for example, be able to vary spatially?

EXAMPLE: NEURAL SPIKE MODELLING

An example of a function from a real system is shown in figure 1; this is the action potential of a neuron deduced from recordings of 40 distinct events (Lewicki 1994). The graph was created by fitting a simple spline model with p = 1 to the data. This function has one 'spiky' region with large characteristic amplitude and short spatial scale. Elsewhere the true function is smooth. However the fitted function, controlled by only one regularization constant 0:, overfits the noise on the right, having a rough appearance where it should plausibly be smooth. The value of 0: appropriate for fitting the spiky region is too small for the rest of the curve. It would be useful here to have a model capable of capturing the concept of local smoothness, because such a model, having a prior better matched to the real world, would require less data to yield information of the same quality. Furthermore, when different hypotheses are compared, broad priors introduce a bias toward simpler hypotheses. For example, if we ask whether one or two distinct spike functions are present in a data set, the traditional model's prior with small 0: will bias the conclusion in favour of the single spike function. Only with well-matched priors can the results of hypothesis comparison be trusted.

In this paper we demonstrate new interpolation models with multiple hyperparameters that capture a spatially varying smoothness in a computationally tractable way.

The interpolation models we propose might be viewed as Bayesian versions of the 'variable bandwidth' kernel regression technique (Muller and Stadtmuller 1987). The aim of our new model is also similar to the goal of inferring the locations of discontinuities in a function, studied by Blake and Zisserman (1987). Traditional inte~polation models have difficulty with discontinuities: if the value of 0://3 is set high, then edges are blurred out in

Interpolation Models with Multiple Hyperparameters 251

the model; if a/ f3 is lowered, the edge is captured, but ringing appears near the edge, and noise is overfitted everywhere. Blake and Zisserman introduce additional hyperparameters defining the locations of edges. The models they use are computationally non-convex, so that finding good representatives of the posterior distribution is challenging. They use 'graduated non-convexity' techniques to find good solutions. By contrast we attempt to create new hierarchical models that are, for practical purposes, convex.

2 Tractable hierarchical modelling: Convexity

Bayesian statistical inference is often implemented either by Gaussian approximations about modes of distributions, or by Markov Chain Monte Carlo methods (Smith 1991). Both methods clearly have a better chance of success if the posterior probability distribution over the model parameters and hyperparameters is not dominated by multiple distinct optima. If we know that most of the probability mass is in just one 'hump', then we know that we need not engage in a time-consuming search for more probable optima, and we might hope that some approximating distribution (e.g., involving the mode of the distribution) might be able to capture the key properties of that hump. Furthermore, convex conditional distributions may be easier to sample from with, say, Gibbs sampling methods (Gilks and Wild 1992). It would be useful if all the conditional and marginal probability distributions of our models were log convex:

Definition 1 A probability distribution is log convex if there is a representation x of the variables such that the matrix M defined by

is everywhere positive definite.

82 Mij = -~logP(x)

UXiUXj (3)

It is hard, however, to make interesting hierarchical models such that all conditional and marginal distributions are log convex. We introduce a weaker criterion:

Definition 2 A model is conditionally convex if its variables can be divided into groups such that, for every group, their distribution conditioned on any values for the other variables is log convex.

An example of a conditionally convex model is the traditional interpolation model with three groups of variables: D (data), w (parameters), and a (one hyperparameter). The probability distribution P(Dlw, a) = P(Dlw) is log convex over D (it is Gaussian). The distribution P(wID, a) is log convex over w (it is Gaussian). And the distribution P(alw, D) = P(alw) is log convex over a (it is a Gamma distribution).

That a model is conditionally convex emphatically does not guarantee that all marginal distributions of variables are unimodal. For example the traditional model's posterior marginals P(wID) and P(aID) are not necessarily unimodal; but good unimodal approximations to them can often be made (MacKay 1994). So we conjecture that conditional convexity is a desirable property for a tractable model.

We now generalize the spline model of equation (1) to a model with multiple hyperparameters that is conditionally convex, and demonstrate it on the neural spike data.

252 D.J.C. MacKay and R. Takeuchi

3 A new interpolation model

We replace the regularizer of equation (1) by:

log P(y(x)la(x), 1l2) = -~ J dx a(x)[y(pl(xW + const, (4)

where a(x) is written in terms of hyperparameters u = {Uh} thus:

(5)

The exponentiated quantity has the form of a linear interpolant using basis functions 1/Jh(X). In the special case H = 1, 1/J! (x) = const. , we obtain the traditional single alpha model. This representation is chosen because (1) it embodies our prior belief that a(x) should be a smooth function of x, and (2) the model is conditionally convex (a partial proof is given in (MacKay and Takeuchi 1994)) .

When implementing this model we optimize the hyperparameters u and (3 by maximizing the 'evidence',

(6)

where k is the dimensionality of our representation y of y(x). This approach is also known as 'ML-II' and is closely related to 'generalized maximum likelihood' (Gu and Wahba 1991). The ideal Bayesian method would put a proper prior on the hyperparameters and marginalize over them, but optimization of the hyperparameters is computationally more convenient and often gives predictive distributions that are indistinguishable (MacKay 1994).

We use a discrete representation y( x) -t y and a( x) -t {ac }; the Hessian of the log posterior over y is a sum of terms from the log prior and a diagonal matrix from the log likelihood, A == - \7\7 log P(yID, {ac}, (3, 1l2) = ~f=! a ce c + {3r. The gradient of the log evidence, which we use for the optimization , is then:

(7)

where

DEMONSTRATION

We made an artificial data set by adding Gaussian noise of standard deviation 1000 to the function depicted in figure 1. Figure 2 shows the data, interpolated using first the traditional single alpha models with p = 1 and p = 2. The hyper parameter a was optimized by maximizing the evidence, as in (Lewicki 1994). The noise level all was set to the known noise level. In order for the spiky part of the data to be fitted, the traditional model's a has to be set to a small value, and the most probable interpolant is able in both models to

Interpolation Models with Multiple Hyperparameters

p=1 ]1=2 '0000

,.". r--~--~-~--~-----.,

.... J . , ...

--- ~/ . . . .""", l

'0000 O'---,,~-~--~,,:-----'OO~---"'OO .10000 o'---,.~-~ .. --~ .. --~oo---",oo

0MP = .5.9 X lQ-i rooor-~~--~-~--~-~ !!.OO ••

': -~."-' .. :':-""."'" "'. ~.:. .. ,:.- ... .¥. .. ~~ ... ,. .. • ., - 1\ ... ........ , ..

1000 •• _. t •• "'

ISOO .. :

xoo~--,.=--~ .. :---~ .. ~-~oo~-~

, ....

• 'OlIO ,,..

. ". '",-A;. ". J .. ... ..,."".. .. ~. .../ .~ .,. 'f' .'.:

,...~-~,.--~ .. :---~ .. :---~oo~-~-

Figure 2: Traditional models: p = 1 and p = 2

253

The points in the upper plots are the artifical data. The solid line shows the most probable interpolant found using the traditional single alpha model. The error bars are one-standard-deviation error bars. The lower plots show the errors between the interpolant and the original function to which the noise was added to make the artificial data. The predictive error bars are also shown. Contrast with figure 3_

go very close to all the data points, so there is considerable overfitting, and the predictive error bars are large.

We then interpolated the data with two new models defined by equations (4) and (5), with p = 1 and p = 2. We set the basis functions 7/; to the hump shaped functions shown in figure 3. These functions define a scale length on which the smoothness is permitted to vary. This scale length was optimized roughly by maximizing the evidence. The new models had nine hyper parameters u. These hyperparameters were set by maximizing the evidence using conjugate gradients. Because the new models are conditionally convex, we had hoped that the maximization of the evidence would lead to a unique optimum UMP' However, there were multiple optima in the evidence as a function of the hyperparameters; but these did not cause insurmountable problems. We found different optima by using different initial conditions U for the optimization. The best evidence optima were found by initializing u in a way that corresponded to our prior knowledge that neuronal spike functions start and end with a smooth region; we set u initially to {ud = {a, -10, -10, -10, -10, -10, -10, 0, a}. This prior knowledge was not formulated into an informative prior over u during the optimization.

Figure 3 shows the solutions found using the new interpolation models with p = 1 and p = 2. The inferred value of Q' is small in the region of the spike, but elsewhere a larger value of Q' is inferred, and the interpolant is correspondingly smoother. The log evidence for

254 D.J .C. MacKay and R. Takeuchi

p=l p=2 ,.,..

~' .... ..•.. ) ~:~~ ... t·,; '''''- .. ",!,.. ..,

- ' ...... :.-~" :'.~.~'i~ . . . .

• '0000 O~--::20---'''';:---::OO'------: .. =--~,OO .• CIOCIG CI~--::,.'-----.. ,;:---:: .. '----: .. ,;:------",.,

Figure 3: New models with multiple hyperparameters: p = 1 and p = 2 Top row: Most probable interpolant with error bars. Second row: the inferred a(x) on a log scale (contrast with the values of 5.9 x 10- 7 and 2.0 x 10-6 inferred for the traditional models). The third row shows the nine basis functions 1/J used to represent a(x). The bottom row shows the errors

between the interpolant and the original function to which the noise was added to make the artificial data. The predictive error bars are also shown. These graphs should be compared with those of

figure 2.


Model log Evidence { RMS error (RMS error) Traditional, p = 1 -886.0 54.7 730 694 Traditional, p = 2 -891.7 32.2 692 642 New model, p = 1 -859.2 23.6 509 470 New model, p = 2 -861.5 15.3 510 417

Table 1: Comparison of models on artificial data The first three columns give the evidence, the effective number of parameters, and the RMS error for each model when applied to the data shown in figures 2-3. The fourth column gives the RMS error of each model averaged over four similar data sets.

the four models is shown in table 1. The reported evidence values are loge P(DjaMP ' 1l I ),

loge P(DjUMP' 1l2 ). If we were to make a proper model comparison we would integrate over the hyper parameters; this integration would introduce additional small subjective Occam factors penalizing the extra hyperparameters in 1l2, c.f. (MacKay 1992). The root mean square errors between the interpolant and the original function to which the noise was added to make the artificial data are also given, and the errors themselves are displayed at the bottom of figures 2 and 3.

By both the evidence value and the RMS error values, the new models are significantly superior to the traditional model. Table 1 also displays the value of the 'effective number of well-determined parameters' (Gull 1988; MacKay 1992), {, which, . when the hyperparameters are optimized, is given by:

(8)

The smaller the effective number of parameters, the less overfitting of noise there is, and the smaller the error bars on the interpolant become. The total number of parameters used to represent the interpolant was in all cases 100.

MODEL CRITICISM

It is interesting to assess whether the observed errors with respect to the original function are compatible with the one-standard-deviation error bars asserted by the new models. These are shown together at the bottom of figure 3. The errors are only significantly larger than the error bars at the leftmost five data points, where the small amount of noise in the original function is incompatible with the assumed boundary conditions y{O) = 0 and y'(O) = O. Omitting those five data points, we find for the new p = 1 model that the other 95 errors have X2 = 72.5 (c.f. expectation 95 ± 14), and for the p = 2 model, X2 = 122. None of the 95 errors in either case exceed 2.5 standard deviations. We therefore see no significant evidence for the observed errors to be incompatible with the predictive error bars.

DISCUSSION

These new models offer two practical benefits. First, while the new models still fit the spiky region well (indeed the errors are slightly reduced there), they give a smoother interpolant

256 D.J .C. MacKay and R. Takeuchi

700.-~-'~~-.~---r-----'----"

600

500

400

300

200

• .••••••••.•• TradItional p:1 -+TradltJ(mal p:2 ....... New p~1 ·D·· New p:2 ~

Ill.. t",

x; ...... ~'" .......... "

'c. ...

, .. ~ .. ~;; :;;: .. : K K

100 L---~----~-----L-----L----~

o 200 400 600 BOO 1000

Figure 4: Average RMS error of the traditional and new models as a function of N. To achieve the same performance as the new models, the traditional models require about 3 times more data.

elsewhere. This reduction in overfitting allows more information to be extracted from any given quantity of experimental data; neuronal spikes will be distinguishable given fewer samples. To quantify the potential savings in data we fitted the four models to fake data equivalent to N = 100,200, ... 1000 data points with noise level O'v = 1000. The figures and tables shown thus far correspond to the case N = 100. In figure 4 we show the RMS error of each model as a function of the number of data points, averaged over four runs with different artificial noise. To achieve the same performance (RMS error) as the new models, the traditional models require about three times as much data.

Second, the new models have greater values of the evidence. This does not only mean that they are more probable models. It also means that model comparison questions can be answered in a more reliable way. For example, if we wish to ask 'are two distinct spike types present in several data sets or just one?' then we must compare two hypotheses: JiB, which explains the data in terms of two spike functions, and JiA, which just uses one function. In such model comparisons, the 'Occam factors' that penalize the extra parameters of JiB are important. If we used the traditional interpolation model, we would obtain Occam factors about e20 bigger than those obtained using the new interpolation model. Broad priors bias model comparisons toward simpler models. The new interpolation model, when optimized, produces a prior in which the effective number of degrees of freedom of the interpolant is reduced so that the prior is more concentrated on the desired set of functions.

Of course, inference is open-ended, and we expect that these models will in turn be superceded by even better ones. Future models might include a continuum of alternative values of p (non-integer values of p can be implemented in a Fourier representation). It might also make sense for the characteristic length scale of the basis functions 1/J with which a(x) is represented to be shorter where a is small.

The advantages conferred by the new models are not accompanied by a significant increase in computational cost. The optimization of the hyper parameters simply requires that the Hessian matrix be inverted a small number of times.


In a longer paper (MacKay and Takeuchi 1994), we also discuss more generally the construction of hierarchical models with multiple hyperparameters, and the application of these ideas to the representation of covariance matrices.

ACKNOWLEDGEMENTS

D.J.C.M. thanks the Isaac Newton Institute and T. Matsumoto, Waseda University, for hospitality, and Radford Neal, Mike Lewicki, David Mumford and Brian Ripley for helpful discussions. R.T. thanks T. Matsumoto for his support.

References

BLAKE, A., and ZISSERMAN, A. (1987) Visual Reconstruction. Cambridge Mass.: MIT Press.

GILKS, W., and WILD, P. (1992) Adaptive rejection sampling for Gibbs sampling. Applied Statistics 41: 337-348.

Gu, C., and WAHBA, G. (1991) Minimizing GCV /GML scores with multiple smoothing parameters via the Newton method. SIAM J. Sci. Stat. Comput. 12: 383-398.

GULL, S. F. (1988) Bayesian inductive inference and maximum entropy. In Maximum Entropy and Bayesian Methods in Science and Engineering, vol. 1: Foundations, ed. by G. Erickson and C. Smith, pp. 53-74, Dordrecht. Kluwer.

KIMELDORF, G. S., and WAHBA, G. (1970) A correspondence between Bayesian estimation of stochastic processes and smoothing by splines. Annals of Mathematical Statistics 41 (2): 495-502.

LEWICKI, M. (1994) Bayesian modeling and classification of neural signals. Neural Computation 6 (5): 1005-1030.

MACKAY, D. J. C. (1992) Bayesian interpolation. Neural Computation 4 (3): 415-447.

MACKAY, D. J. C. (1994) Hyperparameters: Optimize, or integrate out? In Maximum Entropy and Bayesian Methods, Santa Barbara 1993, ed. by G. Heidbreder, Dordrecht. Kluwer.

MACKAY, D. J. C., and TAKEUCHI, R., (1994) Interpolation models with multiple hyperparameters. Submitted to IEEE PAMI.

MULLER, H. G., and STADTMULLER, U. (1987) Variable bandwidth kernel estimators of regression-curves. Annals of Statistics 15 (1): 182-201.

SMITH, A. (1991) Bayesian computational methods. Philosophical Transactions of the Royal Society of London A 337: 369-386.

DENSITY NETWORKS AND THEIR APPLICATION TO PROTEIN MODELLING

David J.C. MacKay Cavendish Laboratory, Cambridge, CB3 OHE. U.K. mackay~mrao.cam.ac.uk

ABSTRACT. I define a latent variable model in the form of a neural network for which only target outputs are specified; the inputs are unspecified. Although the inputs are missing, it is still possible to train this model by placing a simple probability distribution on the unknown inputs and maximizing the probability of the data given the parameters. The model can then discover for itself a description of the data in terms of an underlying latent variable space of lower dimensionality. I present preliminary results of the application of these models to protein data.

1 Density Modelling

The most popular supervised neural networks, multilayer perceptrons (MLPs), are well established as probabilistic models for regression and classification, both of which are conditional modelling tasks: the input variables are assumed given, and we condition on their values when modelling the distribution over the output variables; no model of the density over input variables is constructed. Density modelling (or generative modelling), on the other hand, denotes modelling tasks in which a density over all the observable quantities is constructed. Multi-layer perceptrons have not conventionally been used to create density models (though belief networks and other neural networks such as the Boltzmann machine do define density models). Various interesting research problems in this field relate to the difficulty of defining a full probabilistic model with an MLP. For example, if some inputs in a regression problem are 'missing', then traditional methods offer no principled way of filling the gaps. This paper discusses how one can use an MLP as a density model.

TRADITIONAL DENSITY MODELS

A popular class of density models are mixture models, which define the density as a sum of simpler densities. Mixture models might however be viewed as inappropriate models for high-dimensional data spaces such as images or genome sequences. The number of components in a mixture model has to scale exponentially as we add independent degrees of freedom. Consider, for example, a protein family in which there is a strong correlation between the amino acids in the first and second columns - they are either both hydrophobic, or both hydrophilic, say - and there is an independent correlation between two other amino acids elsewhere in the protein chain - when one of them has a large residue the other has a small residue, say. A mixture model would have to use four categories to capture all four combinations of these binary attributes, whereas only two independent degrees of freedom are really present. Thus a combinatorial representation of underlying variables would seem more appropriate. [Luttrell's (1994) partitioned mixture distribution is motivated similarly, but is a different form of quasi-probabilistic model.]

259


260 D.J.C. MacKay

These observations motivate the development of density models that have components rather than categories as their 'latent variables' (Everitt 1984; Hinton and Zemel 1994). Let us denote the observables by t. If a density is defined on the latent variables x, and a parameterized mapping is defined from these latent variables to a probability distribution over the observables P(tlx, w), then when we integrate over the unknowns x, a non-trivial density over t is defined, P(tlw) = f dx P(tlx, w)P(x). Simple linear models of this form in the statistics literature come under the label of 'factor analysis'. In a 'density network' (MacKay 1995) P(tlx, w) is defined by a more general non-linear parameterized mapping, and interesting priors on w may be used.

THE MODEL

The ' latent inputs' of the model are a vector x indexed by h = 1 ... H ('h' mnemonic for 'hidden'). The dimensionality of this hidden space is H but the effective dimensionality assigned by the model in the output space may be smaller, as some of the hidden dimensions may be effectively unused by the model. The relationship between the latent inputs and the observables, parameterized by w, has the form of a mapping from inputs to outputs y(x; w), and a probability of targets given outputs, P(tly). The observed data are a set of target vectors D = {t(n)};;=I' To complete the model we assign a prior P(x) to the latent inputs (an independent prior for each vector x(n)) and a prior P(w) to the unknown parameters. [In the applications that follow the priors over wand x(n) are assumed to be spherical Gaussians; other distributions could easily be implemented and compared, if desired.] In summary, the probability of everything is:

P(D, {x(n)}, wl1i) = IT [p(t(n)lx(n), w, 1i)p(x(n)I1i)] P(wl1i) (1) n

It will be convenient to define 'error functions' G(n)(x; w) as follows:

(2)

The function G depends on the nature of the problem. If t consists of real variables then G might be a sum-squared error between t and y; in a 'softmax' classifier where the observations t are categorical, G is a 'cross entropy'. In general we may have many output groups of different types. The following derivation applies to all cases. Subsequently this paper concentrates on the following form of model, which may be useful to have in mind. The observable t = {ts}~=1 (e.g., a single protein sequence) consists of a number S of categorical attributes that are believed to be correlated (S will be the number of columns in the protein alignment). Each attribute can take one of a number I of discrete values, a probability over which is modelled with a softmax group (e.g., 1=20).

s P(tlx, w) = II {yt. (x; w)} (3)

s=1

where

(4)

Density Networks and Protein Modelling 261

The parameters w form a matrix of (H + 1) X S x I weights from the H latent inputs x (and one bias) to the S x I outputs:

H

ai(x; w) = wio + L wihxh (5) h=l

The data items t are labelled by an index n = 1 ... N, not included in the above equations, and the error function G(n) is

(6)

Having written down the probability of everything (equation 1) we can now make any desired inferences by turning the handle of probability theory. Let us aim towards the inference of the parameters w given the data D, P(wID, 1l). We can obtain this quantity conveniently by distinguishing two levels of inference.

Levell: Given wand t(n), infer x(n). The posterior distribution of x(n) is

P( (n)1 (n) 1l) = p(t(n)lx(n), w, 1l)p(x(n)I1l) x t ,w, p(t(n)lw,1l)' (7)

where the normalizing constant is:

p(t(n)lw,1l) = J dHx(n) p(t(n)lx(n), w, 1l)p(x(n)I1l)· (8)

Level 2: Given D = {t(n)}, infer w.

P( ID 1l) = P(Dlw,1l)P(wl1l) w , P(DI1l) (9)

The data-dependent term here is a product of the normalizing constants of the level 1 inferences:

N

P(Dlw,1l) = IT p(t(n)lw, 1l). (10) n=l

The evaluation of the evidence p(t(n)lw,1l) for a particular n is a problem similar to the evaluation of the evidence for a supervised neural network (MacKay 1992). There, the inputs x are given, and the parameters ware unknown; we obtain the evidence by integrating over w. In the present problem, on the other hand, the hidden vector x(n) is unknown, and the parameters ware conditionally fixed. For each n, we wish to integrate over x(n) to obtain the evidence.

LEARNING: THE DERIVATIVE OF THE EVIDENCE WITH RESPECT TO w

The derivative of the log of the evidence (equation 8) is:

1 J dHx exp(G(n) (x' w))P(xl1l)~G(n)(x' w) p(t(n)lw,1l) , ow'

o ow logP(t(n)lw, 1l)

J dHxp(xlt(n),w,1l)o:G(n)(x;w). (11)

This gradient can thus be written as an expectation of the traditional 'backpropagation' gradient a~G(n)(x; w), averaging over the posterior distribution of x(n) found in equation (7) .

262 D.J.C. MacKay

HIGHER LEVELS - PRIORS ON w

We can continue up the hierarchical model, putting a prior on w with hyperparameters {a} which are inferred by integrating over w. These priors are important from a practical point of view to limit overfitting of the data by the parameters w. These priors will also be used to bias the solutions towards ones that are easier for humans to interpret.

EVALUATION OF THE EVIDENCE AND ITS DERIVATIVES USING SIMPLE MONTE CARLO

The evidence and its derivatives with respect to w both involve integrals over the hidden components x. For a hidden vector of sufficiently small dimensionality, a simple Monte Carlo approach to the evaluation of these integrals can be effective.

Let {x(r)}~=l be random samples from P(x). Then we can approximate the log evidence by:

log p({t(n)}lw, 1l) Llog J dHx exp(Cn(x;w))P(x) n

Llog [2.. Lexp(Cn(x(r);W))]' n R r

Similarly the derivative can be approximated by:

This simple Monte Carlo approach loses the advantage that we gained when we rejected mixture models and turned to componential models; this implementation of the componential model requires a number of samples R that is exponential in the dimension of the hidden space H . More sophisticated methods using stochastic dynamics (Neal 1993) are currently under development.

ALTERNATIVE IMPLEMENTATIONS

An alternative approach to making such componential models scale well is the free energy minimization approximation of Hinton and Zemel (1994) . They introduce a distribution Qn(x) that is intended to be similar to the posterior distribution p(xlt(n), w, 1l); Q is written as a nonlinear function of the observable t(n); the parameters of this nonlinear function are then optimized so as to make Qn(x) the best possible approximation to p(xlt(n), w, 1l) (for all n) as measured by a free energy, Ln f dx Q 10g(Q/ P). This method gives an approximate lower bound for the log evidence. If R random samples {x(r)}~=l from Q(x) are made, then:

log p(t(n) Iw, 1l) log J dHx exp(Cn(x; w))P(x)

[P(x)eG(n)(x;w) ] < J dxQ(x) log Q(x)

:s ~ ~ [c(n)(x;w) + 10gP(x) -logQ(x)] .


An alternative formula for estimating the evidence is given by importance sampling:

2 A componential density model for a protein family

A protein is a sequence of amino acids. A protein family is a set of proteins believed to have the same physical structure but not necessarily having the same sequence of amino acids. In a multiple sequence alignment, residues of the individual sequences which occupy structurally analogous positions are aligned into columns. There are twenty different amino acids, and columns can often be characterized by a predominance of particular amino acids. Lists of marginal frequencies over amino acids in different structural contexts are given in (Nakai et al. 1988).

The development of models for protein families is useful for two reasons. The first is that" a good model might be used to identify new members of an existing family, and discover new families too, in data produced by genome sequencing projects. The second reason is that a sufficiently complex model might be able to give new insight into the properties of the protein family; for example, properties of the proteins' tertiary structure might be elucidated by a model capable of discovering suspicious inter-column correlations.

The only probabilistic model that has so far been applied to protein families is a hidden Markov model (Krogh et al. 1994). This model is not inherently capable of discovering long-range correlations, as Markov models, by definition, produce no correlations between the observables, given a hidden state sequence.

The next-door neighbour of proteins, RNA, has been modelled with a 'covariance model' capable of capturing correlations between base-pairs in anti-parallel RNA strands (Eddy and Durbin 1994). The aim of the present work is to develop a model capable of discovering general correlations between multiple arbitrary columns in a protein family. E. Steeg (personal communication) has developed an efficient statistical test for discovering correlated groups of residues. The present wo~k is complementary to Steeg's in that (1) in the density network, a residue may be influenced by more than one latent variable; whereas Steeg's test is specialised for the case where the correlated groups are non-overlapping; (2) the density networks developed here define full probabilistic models rather than statistical tests.

Here I model the protein families using a density network containing one soft max group for each column (see equations 3-6). The network has only one layer of weights connecting the latent variables x directly to the softmax groups. I have optimized w by evaluating the evidence and its gradient and feeding them into a conjugate gradient routine. The random points {x(r)} are kept fixed, so that the objective function and its gradient are deterministic functions during the optimization. This also has the advantage of allowing one to get away with a smaller number of samples R than might be thought necessary, as the parameters w can adapt to make the best use of the empirical distribution over x.

REGULARIZATION SCHEMES

A human prejudice towards comprehensible solutions gives an additional motivation for regularizing the model, beyond the usual reasons for having priors. Here I encourage the model to be comprehensible in two ways:

264 D.J.C. MacKay

1. There is a redundancy in the model regarding where it gets its randomness from. Assume that a particular output is actually random and uncorrelated with other outputs. This could be modelled in two ways: its weights from the latent inputs could be set to zero, and the biases could be set to the log probabilities; or alternatively the biases could be fixed to arbitrary values, with appropriate connections to unused latent inputs being used to create the required probabilities, on marginalization over the latent variables. In predictive terms, these two models would be identical, but we prefer the first solution, finding it more intelligible. To encourage such solutions I use a prior which weakly regularizes the biases, so that they are 'cheap' relative to the other parameters.

2. If the distribution P(x) is rotationally invariant, then the predictive distribution IS In

variant under corresponding transformations of the parameters w. If a solution can be expressed in terms of parameter vectors aligned with some of the axes (i.e. so that some parameters are zero), then we would prefer that representation. Here I create a non-spherical prior on the parameters by using multiple undetermined regularization constants {ne}, each one associated with a class of weights (c.f. the automatic relevance determination model (MacKay and Neal 1994)). A weight class consists of all the weights from one latent input to one softmax group, so that for a protein with S columns modelled using H latent variables, I introduced S H regularization constants, each specifying whether a particular latent variable has an influence on a particular column. Given n e , the prior on the parameters in class c is Gaussian with variance line. This prior favours solutions in which one latent input has non-zero connections to all the units in some softmax groups (corresponding to small ne), and negligible connections to other softmax groups (large ne). The resulting solutions can easily be interpreted in terms of correlations between columns.

METHOD FOR OPTIMIZATION OF HYPERPARAMETERS

For given values of {ne}, the parameters w were optimized to maximize the posterior probability. No explicit Gaussian approximation was made to the posterior distribution of W;

rather, the hyperparameters {ne} were adapted during the optimization of the parameters w, using a cheap and cheerful method motivated by Gaussian approximations (MacKay 1992), thus:

ne := f L ke 2' (13) iEewi

Here ke is the number of parameters in class c and f is a 'fudge factor' incorporated to imitate the effect of integrating over w (set to a value between 0.1 and 1.0).

This algorithm could be converted to a correct 'stochastic dynamics' Monte Carlo method (Neal 1993) by adding an appropriate amount of noise to gradient descent on wand setting f = 1.

Toy DATA

A toy data set was created imitating a protein family with four columns each containing one of five amino acids. The 27 data (table 1) were constructed to exhibit two correlations between the columns: the first and second columns have a tendency both to be amino acid E together. The third and fourth columns are correlated such that if one is amino acid B,


EEAB EECB EEBC EECC EEAA EEBA EEBB EECD EEDC EEDD AACD DDDC CBDD CCAB BDCB ABBC CBCC EDAA ABBA BCBB DBAB AECB EBBC BDCC BCAA DABA BCBB

Table 1: Toy data for a protein family

then the other is likely to be A, B or C; if one is C, then the other is likely to be B, C or D; and so forth, with an underlying single dimension running through the amino acids A,B,C,D. The model is given no prior knowledge of the 'spatial relationship' of the columns, or of the ordering of the amino acids. A model that can identify the two correlations in the data is what we are hoping for.

Both regularized and unregularized density networks having four latent inputs were adapted to this data. Unregularized density networks give solutions that successfully predict the two correlations, but the parameters of those models are hard to interpret (figure la). There is also evidence of overfitting of the data leading to overconfident predictions by the model. The regularized models, in which all the parameters connecting one input to one softmax group are put in a regularization class with an unknown hyperparameter O'c, give interpretable solutions that clearly identify the two correlated groups of columns. Figure Ib shows the hyperparameters and parameters inferred in a typical solution using a regularized density network. Notice that two of the latent inputs are unused in this solution. Of the other two inputs, one has an influence on columns 1 and 2 only, and the other has an influence on columns 3 and 4 only. Thus this model has successfully revealed the underlying 'structure' of the proteins in this family.

RESULTS ON REAL DATA: BETA SHEETS

Beta sheets are structures in which two parts of the protein engage in a particular hydrogenbonding interaction. It would greatly help in the solution of the protein folding problem if we could distinguish correct from incorrect alignments of beta strands.

Data on aligned anti parallel beta strands was provided by Tim Hubbard. N = 1000 examples were taken . Density networks with H = 6 latent inputs were used to model the joint distribution of the twelve residues surrounding a beta sheet hydrogen bond. Our prior expectation is that if there is any correlation among these residues, it is likely to reflect the spatial arrangement of the residues, with nearby residues being correlated. But this prior expectation was not included in the model. The hope was that meaningful physical properties such as this would be learned from the data.

ANALYSIS

The parameters of a typical optimized density network are shown in figure 2. The parameter vectors were compared, column by column, with a large number of

published amino acid indices (Nakai et al. 1988) to see if they corresponded to established physical properties of amino acids. Each index was normalized by subtracting the mean from each vector and scaling it to unit length. The similarity of a parameter vector to an index was then measured by the magnitude of their inner product.

266 D.J.C. MacKay

a)

b)

Figure 1: Parameters and Hyperparameters inferred for the toy protein family a) Hinton diagram showing parameters W of model optimized without adaptive regularizers. Positive parameters are shown by black squares, negative by white. Magnitude of parameter is proportional to square area. This diagram shows, in the five grey rectangles, the projective fields from the bias and the four latent variables to the outputs. In each grey rectangle the influences of one latent variable on the twenty outputs are arranged in a 5x4 grid: in each column the 5 output units correspond to the 5 amino acids. It is hard to interpret these optimized parameters. b) The hyperparameters and parameters of a hierarchical model with adaptive regularizers. The results are more intelligible and show a model that has discovered the two underlying dimensions of the data. Hyperparameters: Each hyperparameter controls all the influences of one latent variable on one column. Square size -denotes the value of ".~ = l/a on a log scale from 0.001 to 1.0. The model has discovered that columns 1 and 2 are correlated with each other but not with columns 3 and 4, and vice versa. Parameters: same conventions as (a). Note the sparsity of the connections, making clear the two distinct underlying dimensions of this protein family:

Two distinctive patterns reliably emerged in most adapted models, both having a meaningful physical interpretation. First, an alternating pattern can be seen in the influences in the third rectangle from the left. The influences on columns 2, 4, 9 and 11 are similar to each other, and opposite in sign to the influences on columns 3, 5, 10 and 12. This dichotomy between the residues is physically meaningful: residues 2, 4, 9 and 11 are on the opposite side of the beta sheet plane from residues 3, 5, 10 and 12; when these influence vectors were compared with the published amino acid indices, they showed the greatest similarity to Nakai et ai.'s (1988) indices 57, 17,7 and 42, which respectively describe the amino acids' polarity, the proportion of residues 100% buried, the transfer free energy to surface, and the consensus normalized hydrophobicity scale. This latent variable has clearly discovered the inside- outside characteristics of the beta sheet structure: either one face of sheet is exposed to the solvent (high polarity) or the other face, but not both.

Second, a different pattern is apparent in the second rectangle from the right. Here the influences on residues 4, 5, 6, 7, 8 are similar and opposite to the influences on 11, 12, 1, 2. For five of these residues the influence vector shows greatest similarity with index number 21, the normalized frequency of beta-turn. What this latent variable has discovered,


Figure 2: Parameters w of an optimized density network modelling aligned antiparallel beta strands. In each grey rectangle the twelve columns represent the twelve residues surrounding a beta hydrogen bond. The twenty rows represent the twenty amino acids, in alphabetical order (A,C,D, ... ). Each rectangle shows the influences of one latent variable on the 12 x 20 probabilities. The top left rectangle shows the biases of all the output units. There is an additional 21st row in this rectangle for the biases of the output units corresponding to 'no amino acid'. The latent variables were defined to have no influence on these outputs to inhibit the wasting of latent variables on the modelling of dull correlations. The other six rectangles contain the influences of the 6 latent variables on the output units, of which the second and fifth are discussed in the text.

therefore, is that a beta turn may happen at one end or the other of two anti-parallel beta strands, but not both.

Both of these patterns have the character of an 'exclusive-or' problem. One might imagine that an alternative way to model aligned beta sheets would be to train a discriminative model such as a neural network binary classifier to distinguish 'aligned beta sheet' from not aligned beta sheet. However, such a model would have difficulty learning these exclusive-or patterns. Exclusive-or can be learnt by a neural network with one hidden layer and two layers of weights, but it is not a natural function readily produced by such a network. In contrast these patterns are easily captured by the density networks presented here, which have only one layer of weights.

It is interesting to note that the two effects discovered above involve competing correlations between large numbers of residues. The inside-outside latent variable produces a positive correlation between columns 4 and 11, for example, while the beta turn latent variable produces a negative correlation between those two columns. These results, although they do not constitute new discoveries, suggest that this technique shows considerable promise.

FUTURE WORK

More complex models under development will include additional layers of processing between the latent variables and the observables. If some of the parameters of a second layer were communal to all columns of the protein, the model would be able to generalize amino acid equivalences from one column to another.

It would be interesting to attempt to represent protein evolution as taking place in the latent variable space of a density network.

It is hoped that a density network adapted to beta sheet data will eventually be useful

268 D.J.C. MacKay

for discriminating correct from incorrect alignements of beta strands. The present work is not of sufficient numerical accuracy to achieve this, but possibly by introducing superior sampling methods in tandem with free energy minimization (Hinton and Zemel 1994), these

models may make a contribution to the protein folding problem.

References

EDDY, S. R., and DURBIN, R., (1994) RNA sequence analysis using covariance models. NAR, in press.

EVERITT, B. S. (1984) An Introduction to Latent Variable Models. London: Chapman and Hall.

HINTON, G. E., and ZEMEL, R. S. (1994) Autoencoders, minimum description length and Helmholtz free energy. In Advances in Neural Information Processing Systems 6, ed. by J. D. Cowan, G. Tesauro, and J. Alspector, San Mateo, California. Morgan Kaufmann.

KROGH, A., BROWN, M., MIAN, I. S., SJOLANDER, K., and HAUSSLER, D. (1994) Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology 235: 1501-153l.

LUTTRELL, S. P., (1994) The partitioned mixture distribution: an adaptive Bayesian network for low-level image processing. to appear.

MACKAY, D. J. C. (1992) A practical Bayesian framework for back propagation networks. Neural Computation 4 (3): 448-472.

MACKAY, D. J. C. (1995) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research, Section A.

MACKAY, D . .J. C., and NEAL, R. M. (1994) Automatic relevance determination for neural networks. Technical Report in preparation, Cambridge University.

NAKAI, K., KIDERA, A., and KANEHlSA, M. (1988) Cluster analysis of amino acid indices for prediction of protein structure and function. Prot. Eng. 2: 93-100.

NEAL, R. M. (1993) Bayesian learning via stochastic dynamics. In Advances in Neural Information Processing Systems 5, ed. by C. L. Giles, S. J. Hanson, and J. D. Cowan, pp. 475-482, San Mateo, California. Morgan Kaufmann.

I thank Radford Neal, Geoff Hinton, Sean Eddy, Richard Durbin, Tim Hubbard and Graeme Mitchison for invaluable discussions. I gratefully acknowledge the support of this work by the Royal Society Smithson Research Fellowship.

THE CLUSTER EXPANSION: A HIERARCHICAL DENSITY MODEL

Stephen P Luttrell Defence Research Agency St Andrews Rd, Malvern, Worcestershire, WR14 3PS, United Kingdom [email protected]

© British Crown Copyright 1994 / DRA Published with the permission of the Controller of Her Britannic Majesty's Stationery Office

ABSTRACT. Density modelling in high-dimensional spaces is a difficult problem. In this paper a new model, called the cluster expansion, is proposed and discussed. The cluster expansion scales well to high-dimensional spaces, and it allows the integrals over model parameters that arise in Bayesian predictive distributions to be evaluated explicitly.

1. Introd uction

The basic idea behind the cluster expansion is as follows. Density models in subspaces (or clusters of pixels) of a high-dimensional input space are first built, and these are then linked together to form clusters-of-clusters, which are further linked, etc. This type of hierarchical approach is computationally very efficient.

The purpose of this paper is to present a Bayesian derivation of the cluster expansion model. This supplements the rather non-Bayesian discussions presented in [1, 2, 3].

2. Notation

The following notation is used in this paper. M = model, D = training set of data, D' = training set of data plus one extra sample (i.e. the test sample), N = number of samples in the training set, x = input vector, s = parameter vector of the model, y( ... ) = transformation function (or mapping), 6(···) = Dirac delta function, L = number of layers in the model, I = layer index, c = cluster index, f(···) = Gamma function, v = order parameter, m = number of counts in a histogram bin, n = number of histogram bins per dimension, Y = whole transformation function, [dY] = integration measure over transformation functions.

3. The Cluster Expansion Model

The standard Bayesian expansion for the predictive distribution Pr(x[D, M) is

Pr(x[D, M) = J ds Pr(x[s, M) Pr(s[D, M)

269

J. Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 269-278.

(1)

270 S. P. Luttrell

where the integral over 8 is usually difficult to evaluate. The main purpose of this paper is to present a model that has the following two properties: (1) M scales sensibly to high dimensional :v, (2) M and Pr(:vID, M) factorise into a product of loosely coupled pieces. These two goals must be achieved without simply making an independence assumption. This type of M is called a "cluster expansion", and it was briefly discussed in a rather non-Bayesian way in [IJ.

The simplest non-trivial example of a cluster expansion is

(2)

where M is broken into 3 pieces as M = MI X M2 X M 12 , and the parameter vector is appropriately partitioned as 8 = (81,82,812). The input vector is partitioned into two non-overlapping pieces as :v = (:VI, :V2), where MI models the density of :VI and M2 models the density of :V2. The input vector is then transformed to produce another pair of vectors YI(:VI) and Y2(:V2) whose joint density is modelled by M 12 . The denominators are normalisation factors that may be obtained from their respective numerators as follows

(3)

where the delta function constrains the integral over input space to include only those points that map to Yk(:Vk). The normalisation of the cluster expansion can then be checked by integrating over all :v using

J d:v (00') (4)

which first integrates over the the :Vk that map to a particular Yk, and then integrates over

the Yk' The cluster expansion model may also by derived as the solution to the following maxi-

mum entropy problem [IJ: (1) Assume that :VI, :V2, YI(:VI), and Y2(:V2) are defined as above, (2) Supply the three marginal PDFs Pr( :vr), Pr( :V2), and Pr(YI, Y2) as constraints on the maximum entropy Pr(:v), (3) Seek the maximum entropy Pr(:v) that satisfies these constraints. In the special case where Pr(YI' Y2) is not supplied as a constraint, the maximum entropy solution is Pr(:vr) P( :V2), as expected. When Pr(YI' Y2) is supplied as a constraint, this solution acquires an extra factor Pr(YI (:VI)' Y2(:V2)) / (Pr(YI (:vr) Pr(Y2 (:V2)) to account for any correlations between YI and Y2 that are found in Pr(YI,Y2)' Although this maximum entropy style of argument was the way in which the cluster expansion was originally introduced in [IJ, in this paper the main justification for the cluster expansion is that it allows Bayesian computations (e.g. integrations over model parameters) to be performed readily.

The cluster expansion model may be visualised as shown in Figure 1, which shows the model structure for a 4-dimensional input vector :v = (:VI, :V2), where :VI = (xu, XI2) and :V2 = (X2l' X22). The clusters are each 2-dimensional subspaces of the full 4-dimensional input space, and Figure 1 shows each of these subspaces discretised into bins. The bins highlighted in black in Figure 1 correspond to a single representative input vector:v. The

The Cluster Expansion 271

-r----,~ ,. ---jl ['''''-'--''----

X12 X22

'-------

X II X21

Figure 1: The cluster expansion for a 4-dimensional input vector.

transformation Yk(rek) maps rek to Yk, and Figure 1 shows examples of how a subset of bins in each input subspace maps to the same bin in the corresponding part of output space. Although these subsets of bins are shown in Figure 1 as consisting of contiguous bins, this need not necessarily be the case. Pr(reklsk, Mk) is the likelihood that Mk computes for input vector rek using parameters Sk. Pr(Yk(rek)lsk, Mk) is the sum of the likelihoods Pr(reklsk, Mk) over all those rek that map to the same Yk(rek), as shown in Figure 1. Pr(Yl(red,Y2(re2)ls12,M12) is the likelihood that M12 computes for output vector (YI (reI)' Y2( re2)) using parameters S12. Thus Ml has responsibility for modelling the bottom left hand part of Figure 1, M2 the bottom right hand part, and M12 the top part.

In practice, the cluster expansion model in Figure 1 may be applied tva situation where the data arrives at a pair of sensors (as reI and re2). One Bayesian is located at each of the two sensors, and each constructs a likelihood (Pr( reI lSI ,Ml ) and Pre re21s2, M2 )) for her own data. However, they decide to pass on the responsibility for modelling the mutual correlations between their data to a third Bayesian, who receives a transformed version of the data (Yl(rel),Y2(re2)) and constructs a likelihood Pr(Yl(red'Y2(re2)ls12,M12) for this data. The full cluster expansion in Equation 2 then combines these three likelihoods to form a consistent joint probability.

The ratio Pr(reklsk, Mk)/Pr(YkCrek)lsk, Mk) that appears in the cluster expansion is a normalised likelihood that sums to lover the subset of bins that map to the same YkCrek), as seen in Figure 1. If Bayes' theorem is used in the form

P ( I ( )) = Pr(y(re)lz) Pr(z) = b( ( ) _ ()) Pr(z) rz Y re Pr(y(re)) Y re Y z Pr(y(re)) (5)

272 S. P. Luttrell

then

(6)

So Pr(xklsk, Mk)/Pr(Yk(xk)lsk, Mk) corresponds to Pr(zkIYk(Xk), Sk, Mk)' which is none zero only for Yk(Zk) = Yk(Xk). This observation leads to the "top-down" interpretation of the cluster expansion: (1) Top level: Pr(Y1 (xd, Y2( x2)ls12, M12 ) models the joint PDF in Yspace, (2) Bottom level: Each of the ratios Pr(xklsk, Mk)/Pr(Yk(xk)lsk, Mk) a conditional PDF that generates the possible Xk that correspond to Yk(Xk).

Using Equation 1, the cluster expansion yields the following predictive distribution after integration over the parameters Sl, S2 and S12

(7)

where the second line is Pr(sI,s2,s12ID,M), and Pr(sI,s2,sdM) is assumed to be a product of three independent pieces. Assuming x and D are independent, equation 7 can be rewritten as

(I M) J dS1 dS2 dS 12 Pr(slIM1) Pr(s2IM2) Pr(sdM12 ) Pr(D'l s1,S2,S12,M) (8) Pr x D, = J dS1 dS2 dS 12 Pr(slIM1) Pr(s2IM2) Pr(sd M12) Pr(Dl s1, S2,S12, M)

where D' is the training data D plus one extra sample x. Because the numerator and denominator have the same structure, only the denominator

needs to be evaluated in detail. For an N sample training set the denominator reduces to

(9)

Note that Equation 9 is a product of three separate integrals, which makes it relatively simple to evaluate.

4. A Detailed Model

During the derivation of the predictive distribution (both here and in Appendix A) a somewhat cavalier notation will be used in order to avoid overloading the equations with indices. However, the final result will be stated with all indices correctly included (as defined in Appendix B).

There are three separate pieces of information that are needed to specify a model so that a predictive distribution can be computed. (1) The model M: Assume that all of the variables have been discretised into bins (see Figure 1), and assume that each part (M1, M2, and M12 in Figure 1) of the overall model is parameterised by a separate set of bin occupation probabilities Sl,S2'''',Sn' (2) The priors Pr(slIM1)' Pr(s2IM2), and


Pr(snlM12 ): Assume that the Si each have a Dirichlet prior distribution (see Appendix A). (3) The training set D: Assume that the training data is supplied as a histogram of observed bin occupancies m1, m2, ... , m n . The probability of producing these observed occupancies from the underlying occupation probabilities is S'[" S~2 ..• s;:'n (times a combinatorial factor that cancels out when computing the predictive distribution, see e.g. Equation 8).

With these assumptions Equation 9 reduces to a product of two types of factor

Type 1 J d s;(",n 1) rrn mi+"i-1 S u L..i=l Si - i=l Si

(10) Type 2

where mi, Vi and Si are physically different variables in the type 1 and type 2 expressions. The numerator in Equation 8 is effectively the same as the denominator with the addition of one extra training sample. If this extra sample falls into bin k, then the numerator can be obtained by modifying the factors in Equation 10 as follows: include an extra Sk

(type 1), or include an extra Sk / Lj:y(j)=y(k) Sj (type 2). Both of these cases may readily be implemented by appropriately modifying the powers inside the product term in Equation 10.

When the type 1 and type 2 terms are evaluated, and the ratio of the numerator and corresponding denominator contributions in Equation 8 is taken, the following simple results emerge (see Appendix A).

Type 1

(11) Type 2 =

These results may be gathered together to yield

Pr(i,j, k, liD, M) = mLill +viji,l m~jkll+v~;k,l m~;Y1 (i,)) +";;Y1 (i,)l m~;Y2(k,!) +";;Y2(k,l)

(12)

where detailed indices have been used to ensure that this result is unambiguously expressed (see Appendix B).

5. Integration Over the Layer to Layer Transformations

The predictive distribution in Equation 1 (and Equation 7) assumes that the layer-to-Iayer transformations are held fixed, i.e. they are not variable parameters of the model. In effect, the predictive distribution Pr(a:ID, M) thus far calculated should strictly be written as Pr(a:ID, Y, M), where the Y denotes the set of layer-to-layer transformations that is used. The full predictive distribution may now be written as

Pr(a:ID, M) = J [dYj Pr(a:ID, Y, M) Pr(YID, M) (13)

274 S. P. Luttrell

where Pr(xID, Y, M) is computed by the right hand side of Equation 1. It is important to determine which particular transformations dominate in equation 13 so that the integral can be approximated. No attempt will be made to analyse Equation 13 in detail, rather, an attempt will be made to extract some rules of thumb which will assist in its interpretation.

A simplification can be made by rewriting Equation 13 as follows

(14) xPr(xID, Y, M) Pr(YID, M)

If all points in output space are a priori equivalent, then the inner integral over Y yields a result that is independent of Yl and Y2' The outer integral (over Yl and Y2) merely introduces a constant overall factor, so henceforth Yl(Xl) and Y2(X2) will be held constant in the Y integration in Equation 13. In other words, it does not matter to which particular Yl and Y2 one decides to map Xl and X2, respectively.

There are two cases of Equation 14 where the Y integral may readily be simplified further: (1) The Pr(xID, Y, M) factor dominates, or (2) The Pr(YID, M) factor dominates (as is generally the case). These two cases will be considered below

5.1. Case 1

Equation 8 (which is now to be read as computing Pr(xID, Y, M)) has all of its X dependence in the numerator in the Pr(D'lsl, S2, S12, M) factor. The transformations appear in three separate places: (1) Yl (Xl) appears in Pr( xllsl, Ml)/Pr(Yl (xdlsl' Ml ). Yl (Xl) is the point in output space to which Xl maps. The output PDF Pr(Yl (Xl )ISl, M l ) is the integral of the input PDF Pr( xllsl, MJ) over all those input vectors that map to Yl (xd. In the Y integration this term is large for those Y that map only a small number of input vectors to the given YI(Xl), because then the PDF Pr(YI(xdlsl, M l ) is small. (2) There is a similar term for Y2( X2). (3) Yl (Xl) and Y2( X2) both appear in Pr(Yl (Xl ), Y2( x2)lsI2, MI2)' Typically (see the example below), the output PDF Pr(YI (xd, Y2( x2)ls12, M 12 ) is large for those Y that map a large number of members of the training set to the given YI (Xl) and Y2(X2), which are assumed to be held constant as discussed after Equation 14.

These observations lead to two opposing effects. (1) and (2) above require that few other input vectors map to YI(xd and Y2(X2), whereas (3) above typically requires that many members of the training set map to YI (Xl) and Y2( X2)'

Suppose that the training data is drawn from a distribution in which Xl and X2 are related to each other as Xl = XI(X2) + fl(X2) and X2 = X2(XI) + f2(XI), where fl(X2) and f2( Xl) are localised noise processes. If Xl is known then X2 is also known approximately, and vice versa. The dominant contributions to the Y integral then come from transformations that have the general form shown in Figure 1. The set of input vectors that map to YI (xd and Y2(X2) typically occupy patches of input space in the neighbourhoods of Xl and X2, respectively, whereas all other regions of input space typically map to other output values (i.e. not to YI(XI) and Y2(X2)). In effect, Y is unconstrained, apart from in the vicinity of Xl and X2 where Y is constrained to map to the given YI(xd and Y2(X2). Actually, it is a little more complicated than this, because the border between the constrained and unconstrained regions is somewhat blurred, and it depends on how much training data there is, but the general behaviour is typically as described.


5.2. Case 2

Using Bayes' theorem, Pr(YID, M) may be written as

P (YID M) = Pr(DIY, M) Pr(YIM) r, Pr(DIM) (15)

The Y dependence is contained entirely in the numerator, and it will be assumed to be dominated by the training set likelihood term Pr(DIY, M). The problem of choosing a single dominant mapping thus may be written as

N argmax ~ (I ) Yo = Y ~ log Pr :1:i Y, M

i=l

(16)

For the cluster expansion shown in Figure I, Equation 12 may be used to reduce this to

y; _ argmax ~ 1 ( mi,2;Y,{i,j)'Y2(k,l) + v?;Yl(i,j) v~;Y2(k,l) )

0- Y ~ og (m2 .. +v2 .)(m2 +v2 ) traimng set l;Y'(',J) l;Y'{',J) 2;Y2(k,l) 2;Y2(k,l)

(17)

where each training vector has been expressed in bin notation (( i, j), (k, l)) (as described in Appendix B). In the limit where the number of histogram counts dominates the size of the order parameters this reduces to

v ~ argmax ~ 1 (mi,2;Yl(i,j)'Y2(k,I)) JO - Y . ~ og m2 .. m 2

trammg set l;Yl(',J) 2;Y2(k,l)

(18)

which is equivalent to maximising the mutual information between the outputs Y1 (:1:1) and Y2(:1:2 ).

6. Extensions to the Cluster Expansion Model

In [2] a similar mutual information result was obtained for an L layer cluster expansiori model. A global optimisation criterion was used, which was equivalent to the sum of all of the transverse mutual informations throughout the network. In effect, the network actively adjusted its layer-to-layer transformations to "lock onto" correlations between different subspaces of the input data.

In [3] the optimisation of an L layer cluster expansion model was investigated in detail. Because the optimisation criterion depended simultaneously on what each layer was doing, the layer-to-layer mappings had to co-operate in order to find their optimum configuration. Two types of information flowed through the network: (1) Bottom-up flow of data from layer to layer, as already described above, (2) Top-down flow of control signals from layer to layer, required to implement the optimisation. In effect the higher layers of the cluster expansion model controlled the lower layers during the optimisation process. This effect is called "self-supervision" because it is exactly like standard "supervision" of a multilayer neural network, except that here the backward propagating signals are internally generated within the network itself.

In [4] it was shown how to embed many cluster expansion models into a single translation invariant layered network structure. This is very useful for image processing applications.

276 S. P. Luttrell

7. Conclusions

This paper presents a Bayesian treatment of the cluster expansion model. It shows that the cluster expansion approach to density modelling allows Bayesian predictive distributions to be derived in closed form, if it is assumed that the underlying histograms follow Dirichlet prior distributions. The form of the results thus obtained is essentially the same as those previously derived using more ad hoc approaches (see e.g. [1]).

A The Dirichlet distribution

A more detailed discussion on the use of Dirichlet prior distributions is given by Skilling and Sibisi in these proceedings.

The Dirichlet distribution for n bins is defined as

(19)

where Si and Vi are the occupation probability and the order parameter assigned to bin i, respectively. Note that J ds PrDirichlet(sIM) = 1 and J ds PrDirichlet(sIM) Si = Vd(Vl + ... + vn ).

The cluster expansion requires that the following integrals be evaluated (see Equation 10)

Type 1

(20)

The normalisation property of the Dirichlet distribution may be used to evaluate the type 1 integral to give the results (for both the numerator and denominator contributions to Equation 8)

Type 1 denominator = TItrt rtmi+v;) rt .=1 (mi+ V ;))

Type 1 numerator TI~ I rtmi+v.) rt2::~1 (mi+ v;)+l)

The type 2 integral requires rather more effort to evaluate it. Use the result

1 = J dTy 8 (. L Si - Ty )

t:y(,)=y

and introduce scaled coordinates

to transform Equation 22 into the form

Si ti= -

Ty(i)

(21)

(22)

(23)

(24)


The type 2 integral then becomes

(25)

where 8(I:~1 Si - 1) = 8(I:y Ty - 1) and dSi = dti Ty(i) have been used. Note that the first and second lines of this result are separate integrals. The first line cancels between the numerator and denominator contributions in Equation 8), so only the second line needs to be retained, and it may be evaluated by making use of the normalisation property of the Dirichlet distribution, to yield the results

Type 2 denominator =

Type 2 numerator

B Coupling the layers of the cluster expansion together

The notation that will be used is

m~;k counts in bin k of cluster c in layer 1

V~;k order parameter corresponding to bin k of cluster c in layer 1

(26)

(27)

So, for instance, the notation used for the cluster expansion in Figure 1 consists of mL,j'

m~;k,/ (layer 1, clusters 1 and 2, bins (i,j) and (k,l) respectively), and mi,2;Yl,Y2 (layer 2, cluster (1,2), bin (Yl, Y2)). There is an analogous notation for the order parameters. Note that the histogram counts and order parameters in the two layers of the cluster expansion in Figure 1 are related.

The histogram counts in layer 2 are determined as follows

L i, j : Yl (i, j) = Yl k,l: Y2(k,1) = Y2

m 1 .. 1,2;,,],k,1 (28)

where mL2;i,j,k,1 is the full joint histogram in layer 1, from which the histogram counts ill layer 1 may be determined by marginalisation

1 m1;i,j

m~'k/

I:k,/ mL2;i,j,k,/

I: m 1 .. ',] 1,2;,,],k,/

Th 2 . I dId 1 . 1 us m 1,2;Yl,Y2 IS re ate to m1;i,j an m 2;k,1 VIa m 1,2;i,j,k,/'

(29)

The order parameters in layers 1 and 2 are related in a less obvlllus way than the histogram counts. Referring to Figure 1, regard the transformation from layer 1 to layer

278 S. P. Luttrell

2 as concatenation of the following two operations: (1) Each cluster histogram is rebinned into coarser bins, (2) A joint histogram is formed from 2 or more coarse binned histograms. The Dirichlet distribution has the property that when two or more bins are combined to create a larger bin, the resulting distribution is still Dirichlet, but with an order parameter equal to the sum of the order parameters of the original bins. Thus step 1 above produces the following summed order parameters

Cluster 1 (30)

Cluster 2

Step 2 forces these summed order parameters to refer to the two marginalised versions of layer 2 of Figure 1, i.e. summing the layer 2 bins down the columns or along the rows, respectively. In order for the order parameters for layers 1 and 2 to be consistent with each other they must therefore satisfy the constraints

Cluster 1 L v 2 Y2 l,2jYl'Y2 Li,j:Yl (i,j)=Yl VL,j

(31) Cluster 2 L v 2

Yl 1,2jYllY2 Lk,/:Y2(k,/)=Y2 v~;k,/

Note that these are insufficient constraints to completely determine vi 2'Y Y from vi.; j and 1 ' , 11 2 1 1

V 2;k,/'

References

[1] S. P. Luttrell, "The use of Bayesian and entropic methods in neural network theory", Maximum Entropy and Bayesian Methods, ed. J. Skilling, Kluwer, pp: 363-370, 1989.

[2] S. P. Luttrell, "A hierarchical network for clutter and texture modelling", Proceedings of the SPIE Conference on Adaptive Signal Processing, ed. S. Haykin, San Diego, Vol. 1565, pp: 518-628, 1991.

[3] S. P. Luttrell, "Adaptive Bayesian networks", Proceedings of the SPIE Conference on Adaptive and Learning Systems, ed. F. A. Sadjadi, Orlando, Vol. 1706, pp: 140-151, 1992.

[4] S. P. Luttrell, "A trainable texture anomaly detector using the Adaptive Cluster Expansion (ACE) method", RSRE Memorandum, No. 4437, 1990.

THE PARTITIONED MIXTURE DISTRIBUTION: MULTIPLE OVERLAPPING DENSITY MODELS

Stephen P Luttrell Defence Research Agency St Andrews Rd, Malvern, Worcestershire, WR14 3PS, United Kingdom [email protected]

©British Crown Copyright 1994 / DRA Published with the permission of the Controller of Her Britannic Majesty's Stationery Office

ABSTRACT. In image processing problems density models are often used to characterise the local image statistics. In this paper a layered network structure is proposed, which consists of a large number of overlapping mixture distributions. This type of network is called a partitioned mixture distribution (PMD), and it may be used to apply mixture distribution models simultaneously to many different patches of an image.

1. Introduction

A partitioned mixture distribution (PMD) [1] is a set of overlapping mixture distributions , which is used in the simultaneous density modelling of many low-dimensional subspaces of a high-dimensional dataset. This type of problem can arise in image processing, for instance. The theory of standard mixture distributions is discussed, and then extended to encompass PMDs. An expectation-maximisation (EM) optimsation scheme is derived.

2. Notation

The following notation is used in this paper. M = model, D = training set of data, N =

number of samples in the training set, x = input vector, s = parameter vector, Q(xlt) =

parametric PDF used for fitting a Bayesian predictive distribution, t = parameter vector , L c Q(xltc)Q(c) = mixture distribution form of Q(xlt) , Q(xltc) = class PDF, Q(c) = prior PDF, c = class label, G = relative entropy between the fitting PDF and the Bayesian predictive distribution, Go = relative entropy between the fitting PDF and the training data, 5(·· .) = Dirac delta function , lJ = noise parameter, n = (odd) number of components in a mixture distribution, n/2 = (rounded down to nearest integer) half width of a mixture window in PMD, no = size of PMD (or number of embedded mixture distributions), S = entire set of PMD parameters, Q(x, ciS) = joint PDF of input and class in a PMD, Qc(xIS) = mixture distribution centred at location c in a PMD, Qc(c'lx , S) = posterior probability for class c' in the mixture distribution centred at location c in a PMD, Q(clx , S) = average over overlapping mixture distributions of the posterior probability of class c in

279

1. Skilling and S. Sibisi (eds.), Maximum Entropy and Bayesian Methods, 279-286.

280 S. P. Luttrell

a PMD, me = mean of a Gaussian class PDF, Ae = covariance of a Gaussian class PDF, M2 = zeroth moment of the training data for class c, M1 = first moment of the training data for class c, M; = second moment of the training data for class c, Q(cI8) = version of Q(cI8) converted to PMD posterior probability form (Le. the same functional form as Q(clx, 8)), E = leakage factor, 6.G = difference in relative entropy for two different choices of parameter values, 6.Go = lower bound for 6.G.

3. Fitting a Predictive Distribution with a Mixture Distribution

In general a Bayesian predictive distribution can be written as

Pr(xID, M) = J ds Pr(xls, M) Pr(sID, M) (1)

where the integral over parameters s is usually difficult to do exactly. There are many ways to alleviate this s integral problem, such as the cluster expansion model discussed by Luttrell in these proceedings. However, one possible approach would be to approximate Pr(xID, M) by a simple parametric PDF Q(xlt), whose parameter vector t is adjusted so that Q(xlt) ::= Pr(xID, M) according to some appropriate fitting criterion. Note that it is not appropriate to think of Q(xlt) as based on a model, because it is used solely as a numerical trick to speed up computations at the cost of some loss of accuracy. This is the reason that the notation Q is used in preference to Pro

The goodness of fit criterion that will be used here is the relative entropy G defined as

- J (pr(xID,M)) G = - dx Pr(xID, M) log Q(xlt) (2)

where G ~ 0, and G = ° iff Q(xlt) = Pr(xID, M) for all x. The goal is to find the parameter vector t that maximises G, and which yields an optimal approximation Q(xlt) to Pr(xID, M). During the fitting process itself the predictive distribution Pr(xID, M) must be evaluated many times, which can be computationally expensive. However, once the optimum parameter vector has been located, the approximation Q(xlt) is used thereafter in place of Pr(xID, M).

In this paper Q(xlt) will be a mixture distribution [2]

n

Q(xlt) = L Q(xlte)Q(c) (3) e=l

where Q(xlte) is a normalised PDF, and the coefficients Q(c) sum up to 1. The relative entropy defined in Equation 2 may then be written as

G = J dx Pr(xID, M) log (~Q(XltcJQ(C)) + constant (4)

If the constant term is ignored, then G has the following frequentist interpretation. It is proportional to the logarithmic likelihood of drawing from the mixture distribution a large number of samples with a distribution given by the Bayesian predictive distribution.

Figure 1 shows a three layer network representation of a mixture distribution. The nodes in the input layer hold the components of the input vector, and they are fully connected to

The Partitioned Mixture Distribution 281

output = !Q(x t JQ(c) .................... . <-I

Figure 1: A mixture distribution represented as a simple type of neural network.

the hidden layer by "links" that make use of the parameters tc to compute the likelihoods Q(xltc}, which are then stored in the hidden layer. The nodes in the hidden layer are then connected to the node in the output layer via conventional links weighted by theQ(c}, which do a weighted sum of the likelihoods to produce the mixture probability Q(xlt), which is the required output. This "neural network" is trained to maximise the relative entropy criterion in Equation 4. Note that the hidden-to-output weights sum up to one.

4. Frequentist Versus Bayesian Mixture Optimisation

It is useful to note that the frequentist way of optimising a mixture distribution is to maximise the logarithm of the likelihood that the mixture distribution can generate the training data, which can be written as

1 N ( n ) Go = N ~ log ?; Q(xiltc)Q(c) (5)

where i ranges over the training set. Note that Go and G are very similar except for the choice of measure to use in the

integral over x-space. These measures are, respectively

11 I:f:1 8(x - Xi) frequentist (6)

Pr(xID,M) Bayesian

The Bayesian method uses the model to "fill in" the gaps between the delta functions, and in the limit of a sufficiently large amount of appropriately distributed training data will produce the same result as the frequentist method.

If a non-parametric model with a Dirichlet prior is used, then the predictive distribution reduces to

1- v N Pr(xID, M) = --y:;- L 8(x - Xi) + v

i=l

(7)

The procedure for maximising G is then very similar to maximising Go, whatever the size of the training set. The constant v term corresponds to injecting a flat background distribution of training vectors into the training set.

282 S. P. Luttrell

5. The Partitioned Mixture Distribution

A set of overlapping mixture distributions is called a partitioned mixture distribution (PMD). In order to avoid unnecessary complications, the PMD will be assumed to have periodic boundary conditions. The derivations may readily be modified to account for other types of boundary condition. Figure 2 shows the PMD that is suggested by the mixture

Figure 2: A partitioned mixture distribution viewed as a neural network.

distribution shown in Figure 1. It is simplest to inspect Figure 2 starting from one of the output nodes. Each output node depends on a number of hidden nodes (three in this case) , each of which in turn depends on a number of input nodes (five in this case). The nodes on which the highlighted output node depends are also highlighted in Figure 1. The computations performed by each layer and by the links between the layers is the same as in a standard mixture distribution (see Figure 1). Thus the highlighted output node computes a mixture distribution approximation to the marginal PDF of the highlighted nodes in the input layer. Overall, the PMD computes separate mixture distribution approximations to each of the marginal PDFs that can be obtained by laterally shifting the highlighted region in the input layer. Because these overlapping mixture distributions share parameters they are not independent of each other. This is the price that has to be paid for embedding a large number of mixture distributions in a single-layered network structure.

The generalisation of Equation 4 is (n/2 is rounded down to the nearest integer)

no (l:~~!~n 2Q(Xltc')Q(C')) G = L J dx Pr(xID, M) log - cln/2 C + constant

c=l l:cl=c-n/2 Q( ) (8)

where the outer summation over c scans across the PMD, and the inner summations over c' compute the mixture distribution that is centred at location c in the PMD. For simplicity, boundary effects have been ignored (in effect, circular boundary conditions have been assumed). A normalisation factor has been included at each location c to ensure that the hidden-to-output layer weights sum up to unity for each mixture distribution. Note that each mixture distribution "sees" only a marginal PDF of the full predictive distribution Pr(xID,M).

The relative entropy in Equation 8 is then maximised with respect to the tc parameters and the Q(c) weights. The eventual result of this is a set of mixture distributions that approximate the marginal PDFs of the predictive distribution Pr(xID, M).


6. Optimising a Partitioned Mixture Distribution

The relative entropy in Equation 8 can be maximised by using the so-ealled expectation maximisation (EM) method [3]. After some algebra (see the Appendix and [1,4]) this yields the following iterative update algorithm for modifying the PMD parameters

Snew = arg ;ax ~ J dx Pr(xID, M) (Q(CIX, Sold) log Q(x, ciS) - log e'~~:/2 Q(C'IS))

(9) where S stands for the entire set ofPMD parameters S == {te,Q(c): C = 1,2,· · ·,no}, and the subscripts "old" and "new" refer to the values of these parameters on successive iterations of the EM algorithm.

There are several additional pieces of notation used in this EM update equation.

Q(cIS) Q(c) as specified in the parameter set S

Q(xlc, S) Q(xltc) with te as specified in the parameter set S (10)

Q(clx, S) Q( IS) ,\,e+n/2 1 X , C wc'=e-n/2 ,\,CI+n / 2 Q(X e"iS)

L..Jc"=c l -n j 2 )

The first two of these definitions are straightforward, but the definition of Q( clx, S) requires some clarification. It is the sum of the n different posterior probabilities of c given x that are computed by the mixture distributions that overlap node c in the PMD.

Apart from these details, Equation 9 is identical to the EM update equation for a standard mixture distribution.

7. Gaussian Partitioned Mixture Distribution

A Gaussian PMD uses class likelihood functions defined as follows

Q(xlc, S)== Jdet127rAe exp(-~(x-mefA;;-l(x-me)) (11 )

where me is the class mean and Ae is the class covariance. The EM update equation can readily be implemented by defining the following moments

J dx Pr(xID , M) Q(clx , Sold)

J dx Pr(xID, M) Q(clx, Sold) x

J dx Pr(xID, M) Q(clx, Sold) x x T

whence the update equations reduce to

M;(Sold) (S) (S )T MO(S ) - me new me new c old

(12)

(13)

284 s. P. Luttrell

where Q(cISnew) is defined as

(14)

The second and third update equations yield mc(Snew) and Ac(Snew) directly, but the first update equation must be solved iteratively to obtain Q(cISnew ).

8. Conclusions

A PMD is a convenient way of embedding many mixture distributions into a single-layered network structure. For instance, it may be used to build density models of a large number of separate patches of an image simultaneously. This convenience is bought at the cost of forcing neighbouring mixture distributions to share parameters. The EM method of optimising a standard mixture distribution may readily be extended to a PMD.

A The Expectation-Maximisation (EM) Algorithm for PMDs

The maximisation of the relative entropy in Equation 8 needs to be converted into an explicit training algorithm. There are two basic types of algorithm: the expectation-maximisation (EM) method [3], and various gradient ascent methods. The former is suitable for batch training, whereas the latter is suitable for on-line training. In this paper only batch training is considered.

Use the notation S to denote the entire set of PMD parameters S == {tc , Q(c) : c =

1, 2, . . . ,no}, and the subscripts "old" and "new" denote the values of these parameters on successive iterations of the EM algorithm.

First of all define the difference (:)'G(S, Sold) between the relative entropy evaluated for two different choices of the ,parameters S and Sold

(15)

Define some convenient notation. The mixture distribution evaluated at location c in the PMD has a d ensity given by

Q (xiS) == '£~-:::!~n/2 Q(xlc', S) Q(e'IS)

c '£~~~~n/2 Q(c'IS) (16)

and the posterior probability for class c' in the mixture distribution centred at location c in the PMD is

{

Q(xld ,S) Q(dIS) ,£c+n/2 Q(xld' S) Q(d'IS)

Qc(c'lx, S) == 0 c"=c - n/2 '

Ie' - cl ~ n!2 (17)

Ie' - cl > n!2


Then the EM update prescription can be derived as follows

!'::,.G(S S) ""no J d P ( ID M) I Wc'=c-n/2 C , C ( ""c+n/2 Q(xl' S) Q( 'IS) )

, old = wc=l X r X, og Qc(XISold) L::;;:~~n/2 Q(c"IS)

L:~~1 J dx Pr(xID, M)

I (""c+n/2 Q ('I S) Q(XIc',S) Q(c'IS) ) X og wd=c-n/2 c C X, old Qc(X,dISo1d ) L::;;:~~n/2 Q(d'IS)

(18)

where !'::,.Go(S, Sold) is given by

!'::,.Go(S, Sold) = I:~~l J dx Pr(xID, M)

X (Q(clx, Sold) log Q(x, ciS) -log I:~~n2n/2 Q(c'IS)) (19)

+a term that is independent of S

The penultimate step of .the derivation in Equation 18 was obtained by using Jensen's inequality for convex functions, and the last step was obtained by using the following results

c+n/2 L Qc(c'lx, S) = 1 (20)

d=c-n/2

and

(21)

I:~O=l Q(cllx, S) ( ... )

where the term ( ... ) does not depend on c. Combining the above results yields

(22)

where !'::,.G(S, S) = O. Thus flG(S, Sold) + G(Sold) can be used to locate the greatest lower bound of G(S), so a single update of the EM algorithm is implemented as

argmax Snew = S flG(S, Sold) (23)

286 S. P. Luttrell

which may be written out in full as

Snew = arg;ax f J dx Pr(xID,M) c=l

( c+n/2 )

Q(clx, Sold) log Q(x, ciS) - log c'=~n/2 Q(c'IS)

(24)

References

[1) S. P. Luttrell, "The partitioned mixture distribution: an adaptive Bayesian network for low-level image processing", lEE Proceedings on Vision, Image and Signal Processing, Vol. 141, No.4, pp: 251-260, 1994.

[2] D. M. Titterington, A. F. M. Smith and U. E. Makov, "Statistical analysis of finite mixture distributions", Wiley, 1985.

[3) A. P. Dempster, N. M. Laird and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm", Journal of the Royal Statistical Society Series B, Vol. 39, pp: 1-37, 1977.

[4] S. P. Luttrell, "An adaptive Bayesian network for low-level image processing", Proceedings of the 3rd lEE International Conference on Artificial Neural Networks, Brighton, pp: 61-65, 1993.

GENERATING FUNCTIONAL FOR THE BBGKY HIERARCHY AND THE N-IDENTICAL-BODY PROBLEM

Stephen F. Gull and Anthony J.M. Garrett Mullard Radio Astronomy Observatory Cavendish Laboratory Madingley Road Cambridge CB3 ORE, U.K.

ABSTRACT. We formulate the N -body problem for identical particles by identifying a generating functional for the reduced n-particle probability density functions, n :::; N. The evolution equation for the generating functional follows by substituting the corresponding representation of the Liouville density PN, called the de Finetti representation, into Liouville's equation. We illustrate this process in model problems, which indicate that there is a difficulty when the flux of the generator depends on its properties away from the constraint plane of fixed normalisation. The evolution equation for the generating functional then depends crucially on N, the actual number of particles, which in real problems is seldom known exactly. For fixed transition rates, and for bosons satisfying detailed balance, the difficulty does not arise, and the de Finetti representation provides useful insights. For the general case of classical statistical mechanics, the difficulty forces us to conclude that the de Finetti generator is not a useful alternative to the BBGKY hierarchy.

1. Introduction

In this paper we develop a novel approach to the N-body problem, following a suggestion of Jaynes (1986). The essential new idea is to try to remove the number of particles from the problem, by developing a generating functional whose successive moments correspond to the number of particles. This approach has applications wherever the N-body problem plays a role: the kinetic theory of gases (Balescu 1975), plasma physics (Clemmow & Dougherty 1969) and non-relativistic gravitational systems, such as the violent relaxation of galaxy mergers (Lynden-Bell 1967, Saslaw 1985).

We begin by reviewing, in Section 2, the conventional approach that leads to the development of the BBGKY hierarchy and its associated closure problem. In Section 3, we look at the problem of correlated distributions using model problems. We explore the defects of closure schemes by giving examples, and introduce our new approach based upon a generating function for the N-particle probability density function (pdf). By considering the time evolution of very simple systems, we show how partial differential equations for the generating function can be derived and solved. Remarkably, the equation for the evolution of the generator is far simpler than for the pdfs. We give a condition that determines whether it is possible to formulate the evolution equation for the generating function in a manner independent of the number N of particles present - a quantity which is rarely

287

J. Skilling and S. Sibisi (eds.). Maximum Entropy and Bayesian Methods. 287-301. © 1996 Kluwer Academic Publishers.

288 S.F. Gull and A.J .M. Garrett

known exactly. When this is possible, the extra complexity of the N-body problem over the two-body problem is thrown into the routine generation ofthe pdfs from the generating function. We then derive (Section 4) evolution equations for the general case of particles interacting via two-body forces. It is not possible to eliminate the number of particles from this problem, so that the generating function approach to time evolution is unpromising. The choice of a measure in function space is considered in Section 5, and results of using a Gaussian generating function are presented for various choices of measure. Second-order Mayer cluster closure and MaxEnt closure arise as special cases. Section 6 presents our conclusions.

2. Kinetic theory and the BBGKY hierarchy

Our problem concerns a system of N particles interacting through conservative twoparticle forces. The particles are 'identical' in the sense that the Hamiltonian of the system is symmetric with respect to exchange of any two particles. Specifically, we study a classical system governed by the Hamiltonian

H = L tm lv il 2 + L V(IXi - Xjl), i,j<i

(2.1)

where {Xi, Vi} are the positions and velocities of the particles, m their mass and V the potential energy function . Statistical mechanics of this system, following Gibbs, treats the evolution in time t of the joint pdf PN ( {Xi, V;}; t) of all N particles in a 6N -dimensional phase space. In this large phase space the system corresponds to a point, which moves under the influences of the velocities {v;} and accelerations {ai}. Our pdf PN for this point evolves according to Liouville's equation

where

dPN == {)PN + LVi. {)PN + Lai. {)PN = 0, dt {)t . {)Xi . {)v i , ,

mai = LFij, j#.i

(2.2)

(2.3)

The characteristics of Liouville's equation are Hamilton's equations. Liouville's equation further implies that the Gibbs entropy S(PN) == - J dr PN log PN of the joint pdf is a constant of the motion. We denote the full phase space r == 11 EB 12 EB .... EB IN, where I == {x, v} is the single-particle phase space~

The dimensionality of this equation, typically quoted as '" 1023 , makes its solution so daunting that it is traditional to give up on exact techniques and reduce the dimensionality of the problem by marginalisation, usually to the one-particle ·marginal distribution

(2.4)

If the initial pdf PN(t = 0) and the Hamiltonian are invariant under exchange of any two particles, it follows that PN(t) retains this symmetry at all times; it is said to be exchangeable. In consequence, it does not matter which of the N -1 coordinates are integrated out, so that on marginalisation

J "v,, . {) PN = N v. a PI CiI) . d'2 d'3 ... d'N L.- a . Xi ax •

(2.5)

GENERATING FUNCTIONAL FOR THE BBGKY HIERARCHY 289

Now, the force on any particle depends on where the other particles actually are, which in turn is influenced by the position of the original particle. As a result the term involving the two-particle force Fij cannot be marginalised further than P2 without losing vital information. Marginalisation to obtain an evolution equation for PI (,1; t) therefore involves a term which is an integral over P2(!1,,2;t); likewise the equation for the evolution of P2 depends upon P3 , and so on back up to PN, which alone determines its own future.

The resulting set of equations is known as the BBGKY hierarchy, after the initials of its inventors (Balescu 1975), and various methods have been proposed to close this set. This is the 'closure problem'. A standard method of closure is the Mayer cluster expansion (see, for example, Balescu 1975) which, with a further abbreviation of notation P(12) == P2 ( 11, 12), can be written as

P(12) = P(1)P(2) + M(12), (2.6)

P(123) = P(1)P(2)P(3) + M(12)P(3) + M(23)P(1) + M(31)P(2) + M(123),

and so on for P(1234) and higher distributions. Closure at first order involves setting M(12) and all higher correlations to zero, so that all the higher distributions are independent. Closure at second order is less crude, setting M(123) and above to zero. Whilst this is a sensible ad hoc scheme for representing small perturbations about independent distributions, the method lacks a coherent physical rationale.

Strictly, the very existence of a closure problem means that the attempt to reduce the dimensionality of the problem by marginalisation has failed. No matter how successfully we can approximate the physical content of the original equations via closure schemes, it is still useful to look for a new approach in which the closure problem does not arise.

3. Correlations in a toy problem

Before discussing the general problem of correlated particles in statistical mechanics, it is useful to consider a greatly simplified 'toy' problem, in which we distinguish only two separate regions of phase space. Suppose that the numbers of particles in each region are nI, n2, with nl + n2 = N; we study the joint pdf P( nl, n2). If the particles are independent, this pdf is just the binomial one:

(3.1)

where II, h are the one~particle probabilities satisfying II + h = 1. Any deviation from this distribution implies that the particles are correlated. If the distribution of P(nI,n2) is broader than the binomial distribution having the same mean then the particles are positively correlated; this situation arises if the forces between the particles are attractive. If the forces are repulsive, on the other hand, negative correlations are induced; these tend to equalise the number of particles in the two regions, making P( nl, n2) narrower than its equivalent binomial. Examples of independent and correlated distributions are shown in Figure 1: the distribution represented by the bold bars is uncorrelated, and is binomial; the 'flat' distribution is positively correlated; and the distribution represented by the narrow bars is negatively correlated. In fact, the first two distributions are the stationary solutions of the model problems described in Sections 3.3 and 3.4 respectively.

290 S.F. Gull and A.J.M. Garrett

C IN Z . II 0 N

C

~

r

- r--

h o o

I I I I I I ll .Hl In D r H 1 J I I I I I I o 10

n, Fig. 1. Examples of correlated distributions for N = 30.

3.1. Closure schemes for the toy problem

20 30

We can illustrate the deficiencies of various closure schemes by applying them to our simple example. Suppose, first, that the joint distribution of two particles is completely correlated, and is given by

1 3 P(2,0) = 4' P(0,2) = 4' P(1,1) = O. (3.2)

Evaluating the correlation coefficients required by the Mayer scheme, we find

M(ll) = M(22) = :2' M(12) = M(21) = - 332, (3.3)

If we impose second-order Mayer closure, setting M(123) = 0, we find for the third-order probabilities

( 1)3 1 3 5 27 P(3,0) = - + 3 X - X - = -, P(0,3) =-, ·4 4 16 32 32

3 3 P(2, 1) = 32' P(1,2) = - 32' (3.4)

This last result - a negative probability - reveals a clear failure of the method. While Mayer closure may be a reasonable approximation for weak correlations, it clearly cannot stand as a general method.

An alternative scheme, after the manner of Inguva et al. (1987) and Karkheck (1989), might be to use MaxEnt closure, maximising S( PN ) under appropriate constraints. For lowest order closure PI ---+ P2 the result coincides with Mayer closure. In the present problem the constraint is the value of the reduced distribution P2 • This procedure always yields non-negative probabilities and, for our simple example, MaxEnt closure gives fully correlated higher pdfs as expected. More generally, for MaxEnt closure at second order, we obtain the formula

(3.5)


for some consistently determined correlation coefficient R12. Though elegant, this formula is not convincing, because the closure procedure is not unique. Consider, for example, possible closure schemes for N = 6:

or P2 -+ P3 -+ P4 -+ Ps -+ P6 •

Closure all the way from P2 to P6 in one step yields, unfortunately, a different result from that obtained by closing from P2 to P3 and using this to close to P4, and so on. If we were forced to use a closure scheme, we would have to advocate MaxEnt closure directly to PN, where N is the actual number of particles in the system. Once again, the situation is unsatisfactory; it seems that we are trying to repair damage which should never have been done in the first place.

3.2. The de Finetti generating function

We have seen above that closure schemes generally suffer from deficiencies or ambiguities. This is a natural consequence of the attempt to assign a large probability distribution from a smaller one. Marginalisation, on the other hand, goes from a large pdf to a smaller one and is therefore a deterministic process. Jaynes (1986) showed how to use a theorem due to de Finetti to represent the joint pdf of N particles in such a way that its form remains unchanged by marginalisation. In this approach we write P( nl, n2) as a mixture of binomial distributions:

(3.6)

where the domain of integration is {J;} ~ 0 and normalisation is guaranteed by the condition

j dh dh g(h,h) 6(1-"L,f) = 1. (3.7)

To cope with the possibility of negative correlations for finite N, the generating function g(h,h) must be allowed to take on negative values (Jaynes 1986). If g(h,h) is indeed negative somewhere in j-space the representation cannot be extended to N -+ 00 without generating some negative P( nI, n2). The form (3.6) survives marginalisation because the probability of anyone realisation of {nl' n2} is the sum of the probabilities of getting one realisation of {nl + 1,n2} and one realisation of {nl,n2 + I}: from the sum and product rules, and by symmetry,

The de Finetti representation then gives

j dh dh (J~1+1 1;2 + j~l 1;2+1) g(h, h) 6(1 - "L,f)

= jdh dh g lI;2g(h,h)6(1-"L,j), (3.9)


where the b-function ensures h + h = 1. The de Finetti generator is therefore a natural way of representing a correlated distribution.

The presence of the b-function in this generator is troublesome, because its implied restriction of the domain of the function will interfere with the use of important integral theorems in f-space. It would be more convenient to do without it, but we still require normalisation, so that

L P(nl,n2) = 1= jdhdfdh+h)Ng(h,h). n,+n2=N

(3.10)

This more general form is still a possible representation of the N -particle distribution, but it unfortunately no longer has the same form when marginalised to N ' < N particles:

N'l j I I I

P(n~,n~)=---;-ri-; dhdhf;'f;2 g(h,h)(h+h)N-N. nl ·n2 •

(3.11)

In what follows we shall write the generator without the b-function. We have also identified an important mathematical condition:

the de Finetti representation is independent of the number of particles (N) if and only if the genemting function gUl, h) can be taken to include a factor of b(l - 2:.1).

We shall see that this is sometimes, but not always, possible. This condition will tell us whether we are able to eliminate the number of particles - which for real macroscopic systems we never know exactly - from our formulation.

3.3. Evolution equation for the generating function

We now show how the generating function g(h, h) defined above can help us transform the equations for the evolution of P( nl, n2) into a simpler one in a larger space. Let us return to the toy problem and suppose now that each particle has a probability a12dt for making a change of state from region 1 to region 2 in a time interval dt, and a transition rate a21 for the change 2--1. If these transition rates are independent of the number of particles in the regions, we find a set of N + 1 evolution equations

oP( nl, n2) ot (3.12)

This set of equations is not particularly difficult to solve; it is found that all correlations decay exponentially, so that the distribution approaches the independent one (3.1), with a12it = a2lh· We emphasise, though, that the set of equations becomes more complicated as N increases, and that the solution contains a mixture of decay rates because certain types of correlation decay faster than others. In fact, there are exactly N non-zero decay rates for the system, spaced equally by a12 + a21.

The underlying behaviour is nevertheless very simple, and this is clearly revealed by using the de Finetti representation. Substituting (3.11) into the RHS of the evolution equation we obtain

(3.13)


We eliminate the explicit dependence on the number of particles, using the result

2 2

ni II I;k = Ii a~i II I;k (no summation on i). k=l k=l

(3.14)

The RHS then becomes

which we can integrate by parts to obtain

assuming that 9 is sufficiently well-behaved for the integrated part to vanish. A sufficient condition for the representation to satisfy our evolution equation for all N is that the generating function g(11, 12; t) shall itself satisfy an evolution equation, which we write as

. ag(11, h) (a a ) 9 == at = afr - ah (a12fr - a21h)g(fr,h)

We can write this as a simple advection equation in terms of a conservative flux:

where

We note the following points:

Ji == "I)aijli - aj;Jj)g. ji.i

(3.15)

(3.16)

(3.17)

• The flux of 9 lies in our constraint line fr + 12 = 1: the scalar product of the flux with the normal to this line is 2:: Ji, which vanishes. Accordingly, the function g(1l, h) can be taken to include a factor of 8(1 - 2::1); if we were to include it explicitly, it could move through the differential operator in the RHS and be cancelled on both sides .

• The equation has real characteristics given by the solutions of

so that

ii + i.:)aij!; - aji!i) = 0, ji.i

(3.18)


If) r---0

..... ~ 0

If) N 0

C? 0

0.0 0 .3 0.5 0 .8 1.3 1.5 t

Fig. 2. Characteristics of /I for the model problem with 0'12 = 0'2l.

In Figure 2 we show these characteristics for the case 0'12 = 0'21 = 1. Any initial distribution of g(/I, h) is carried along the characteristics and is eventually squeezed into a b-function at 0'12/I = 0'2l 12. This is further illustrated in Figure 3, showing the evolution of 9 for a finite system of N = 30 particles. The distribution was initially fully correlated, with nl = 30, and the distribution P( nl, n2) was allowed to evolve according to equations (3.12). Because a finite system cannot completely determine the continuous function gU1, h), the distribution of P( nl, n2) was projected onto the set of Nth order polynomials. We see that the distribution of g(/I,12 = 1- It) is indeed swept along, so that the correlations all eventually decay. The approach to equilibrium now involves only the single decay rate 0'12 + 0'2l.

We conclude that the use of the generating function leads to a solution which is fully independent of the number of particles, and displays a simpler behaviour than is apparent from the joint pdf of N particles. We can easily extend this type of solution to the case of M regions of phase space, having one-particle distribution {I;}, leading to a form identical to equation (3.16).

Finally, let us make the ominous remark that our success in eliminating the number of particles is not surprising, because the particles were all behaving independently.

3.4. A new toy problem - boson statistics

The technique used in our first simple example will deal with any first-order conservative terms in the evolution equations. There is, however, more to be learnt from this simple system before venturing to the general case. We now complicate the problem by supposing that the transition rates depend upon the occupancy, as in the stimulated emission found in boson systems. The evolution equation for P(nl>n2) now contains quadratic terms in nl and n2:

ap(nl,n2) at = 0'12 [en} + 1)n2 P(nl + 1,n2 -1) - (1 + n2)n}P(nl,n2)]

(3.19) + 0'21 [(n2 + l)n}P(n} - 1,n2 + 1) - (1 + nJ)n2P(n},n2)].


~~~~~---=~~=-~-==-~~~~~~~~-r~~~~~~~ o ~~~~-=-= __ =-~ ____ =--= __ ==-=~~~~-,~--~~~~~~

0.25 0.5 f

0.75

Fig. 3. Evolution of gU; t), plotted as ordinate at a sequence of equally spaced times, for the model problem of Section 3.3.

Substituting the de Finetti representation into the RHS above and eliminating the {ni} as before, we find this time a second-order partial differential equation for the evolution of g(h,hit):

!J + (0~1 - 0~2) hh (0 12 0~1 -02l0~2) 9 = O. (3.20)

We can again write this in terms of a conservative flux:

(3.21)

where the flux is

(3.22)

(In this form the equation is v3.lld for any number of regions of phase space.) This is now a diffusion equation for g(h, 12; t), so that any initial distribution tends to spread out. We note that the flux Ji again lies in the constraint line: the scalar product of the flux with the normal is I:i Ji, which vanishes. This time, however, there is a serious problem. If 012 # 02l, the magnitude of the flux of g depends on the gradient across I:i!i = 1. Suppose, for example, we take the gradient of g as Og/O!i = G'(f); then we evaluate


2.:#i( aij - aji)G', which is non-zero. If we attempt to include the b-function explicitly, we find an extra term proportional to its derivative, which prevents us from isolating an evolution equation for g. We conclude that the problem is not truly independent of N unless the condition aij = a ji holds for all i and j, a condition known as detailed balance.

This apparently disturbing behaviour illustrates an extremely important point for the bosonic system. The stationary solution of the evolution equations for P( nl, n2) is

P(nl + 1,n2 -1) P(nb n2)

(3.23)

so that, unless al2 = a21, the system settles more and more certainly into one region or another as N is increased. Our generating function must, therefore, depend upon N. A suitable stationary form can be found by taking

(3.24)

with !31al2 = !32a21. For the important case of physical interest in which the condition of detailed balance holds, the evolution equation for 9(ft, h; t) has a flux of 9 which lies in the constraint surface and depends entirely on the behaviour of 9 in this surface. We are therefore able to include a factor of 15(1 - 2.:1) in 9 and our solution is again fully independent of the number of particles.

Putting 1 = iI, h = 1- 1, a = al2 = a21, we find

9 = a :1 (1(1- 1) ~~) , (3.25)

which has a complete set of separable solutions proportional to the Legendre polynomials:

(3.26)

All polynomials of 1 decay except for the zeroth-order, so that the system approaches 9(1) = constant.

The conclusion is that second-order terms due to particle interactions cause severe technical difficulties to this method, unless very restrictive conditions apply.

3.5. Generalisation

We now generalise to the case where the spin degeneracies are ](1 and ](2 respectively, and also consider fermi statistics as well as bose. We find

ap(nl,n2) _ ~ at = al2 [(nl + 1)(Ii2 - 1 ± n2)P(nl + 1,n2 -1) - (R2 ± n2)nlP(nl,n2)]

+ a21 [(n2 + 1)(](1 -1 ± ndP(nl - 1,n2 + 1) - (](l ± nJ}n2P(nl,n2)], (3.27)

where the upper sign refers to bosons and the lower to fermions. Proceeding as before, we obtain the same evolution equation (3.21), with flux given by

(3.28)


Again, this form holds for M regions of phase space, and the representation will be independent of the number of particles if the 'transition matrix' is symmetrical, so that O!ij = O!ji

for all i,j. With this form for the flux, (3.21) is no longer self-adjoint and, as it stands, can have

polynomial solutions with a runaway time dependence. To see this, set 0!1 = 0!2 == O!,

h = j, 12 = 1- j as before: for bosons we find that

(3.29)

Substituting a trial uniform generator g(f; t) = get) in this equation, we see it has a timedependence as e(1(, +K:,-2)at. This runaway behaviour is not present in the solution of the N + 1 first-order differential equations (3.19), and warns us that the partial differential evolution equation we have derived can have solutions inappropriate to the original problem. The very fact that we have replaced a finite set of variables with a continuum shows that, whatever simplicity we have obtained, we have introduced vastly more degrees of freedom to the problem than there were before, and we must control them carefully. The source of the trouble is that the trial solution violates the requirement that the integrated parts of the j-space integrals vanish. We can rescue the situation by making (3.29) self-adjoint and casting it into Sturm-Liouville form. The weight function is j[(,-l (1 - 1)1(2-1 , and the eigensolutions are

(3.30)

where pie,e) are the Jacobi polynomials, generalising the Legendre polynomials of (3.26). The stationary solution is now proportional to the weight function, and for an arbitrary number of regions of phase space generalises to

j1(; -1

g(f) = II r'(I(;) r(2:J(;). ,

(3.31 )

This procedure does not work for the fermionic case, because the representation is restricted to N ::; 1(1 + 1(2 by the rigid constraint P(1(l ,1(2) = 1. Although the evolution equation can be formally derived, its stationary solution IIi ji-(1(;+l) cannot be normalised and anyway appears to induce positive correlations rather than the required negative ones. As Jaynes points out, there are functionals that generate negative correlations, and therefore represent the stationary solution to the physical problem; for example, 9 = -9 + 60j(1- j) when 1(1 = 1(2 = 1. Unfortunately such functionals do not satisfy the evolution equation or the requirement that the integrated parts vanish.

We have not yet found a plausible de Finetti representation of a fermionic system, and are pessimistic of doing so.

4. The de Finetti generator and the BBGKY hierarchy

We are now ready to investigate whether the de Finetti representation can help us with the BBGKY hierarchy. We extend our simple problem to M regions of phase space

(4.1)


and let M -> 00 so that the {n;} are all zero except for a set of N regions hi} which each contain a single particle. At the same time we extend the notation (though with no difference in content) to write the representation as a functional integral:

N

P( h;}) = N! J(Df) II Jerk) g [fl· k=l

(4.2)

We now substitute this form into the Liouville equation and eliminate the dependence on J at the specified coordinates h;} in favour of a dependency on JCI) at unspecified (dummy) coordinates "I. Because we are considering several spaces of different size, it is worth emphasising our notation. We write J'Y to indicate the "I-component of the vector f, but J("I) when J is considered as a function of phase-space coordinate "I. The procedure is exactly the same as that already demonstrated for our simple example, and the substitution algorithm, with '¢ and II! representing arbitrary functions of phase space, is

(4.3)

(4.4)

We find an evolution equation for g in terms of a second-order functional differential oper-ator:

(4.5)

where we invoke a summation convention on repeated indices "I, "I' to imply a phase-space integration. We write this in the form of a conservation equation

(4.6)

and note that the flux consists of two terms:

• A first-order advection term. The flux is AI) == -y. 2f,:g: its scalar product with the

normal to the constraint plane is J d"lAI) , which again vanishes . • A second-order diffusion term which, because of a phase-space integration, can be

written as

J (2) - 8J'Y Jd'f ( ') 8g "I = 8y' "I 'Y,a "1,"1 8J'Y" (4.7)

The flux J(2) is again in the constraint plane provided that tv . a( "I, "I'), a condition which is satisfied for all velocity-dependent forces likely to be of interest (e.g. magnetic fields). Sadly, however, the flux depends on the gradient of g across this plane, as can be seen by setting 8g / 8 J'Y = G'. We obtain

(4.8)


which depends on G' , unless the average acceleration of a particle vanishes everywhere (an uninteresting special case). We conclude that the representation is not fully independent of the number of particles, because the generator cannot be taken to include the necessary o-function.

5. The measure problem and Gaussian approximations

There are two technicalities involved in writing the generating functional in the form (4.2). First, there is the way in which we subdivide the phase space 1 into small regions and assign them a dimension in the functional integral. Here we are content to follow the traditional rule - "one mode per h3 of phase space" - and assign a uniform measure over I' The second technicality is deeper, and demands further comment. The variables if;} are themselves continuous, and a measure over f space must be specified before the volume element (Df) is well-defined.

Above, we took a Euclidean measure with

(Df) == II dfi (5.1)

over non-negative {Ii}. Our evolution equation for g[fj t] will not be seriously affected by our choice, which only concerns the proper division of J(Df) g [f] into generating function and volume element. Whatever volume factor p(f) we put into (Df), we must then compensate for by using g / p as the generating function. The choice of volume element does, though, influence the way in which we view f-space geometrically, and also implies a choice of metric.

To illustrate this point, we mention some results obtained by using a Gaussian generating function g[f] for various choices of metric. We first consider the Euclidean metric and volume element, setting

(5.2)

where fo is the centre of the distribution and K,"!"!, the inverse matrix of curvatures. If to is sufficiently far from the unphysical regions of phase space where any of the {f"!} are negative, we can accurately approximate this integral by Gaussian integration over the whole of f space. We then find, through a cumulant generating functional, that

PI(Jd = fo(JI),

P2(/1,/2) = fo(/dfo(J2) + K,(JI,/2), (5.3)

and so on, which is exactly the result given by second-order Mayer closure. Because it has only two independent moments, a Gaussian approximation to the generating function yields, in Euclidean f-space, only second-order correlations. Because we have had to integrate over unphysical parts of this space to obtain this result, we gain some insight into the reason why Mayer closure can yield negative probabilities. Our Gaussian form can include, in this metric, the factor 8(1 - J d, I), which is a hyperplane and is therefore a special case of the ellipsoids generated by K,TY"


A different choice for the volume element follows on demanding invariance of form under repeated subdivision of i-space (Skilling & Sibisi 1995). This leads to

dj; (Df) = II 7: == II d17;,

, . (5.4)

where 17; == log ii and the integral is taken over the whole of 'Q-space. It is more natural now to take, for our Gaussian generator,

(5.5)

We can integrate this directly to obtain a form for PN that corresponds to second-order MaxEnt closure; PI and P2 are both expressible in terms of 'I/O, K, and

(5.6)

where Q3 is determined implicitly, in terms of P2 , by marginalising. Similarly, P4 is the product of six Q 4 functions, having arguments '/'1, '/'2, '/'3, '/'4 taken two at a time; and so on. This time, however, the constraint 0(1 - J d,/, 1) cannot be included, since the transform f -+ '" has destroyed its simple geometry. This means that a Gaussian approximation in this metric is not fully independent of N, a fact that affects the normalisation of the reduced distributions.

An alternative choice, which may have some merit in this case (because there is a preferred choice for the subdivision of phase space), is the so-called 'entropy metric' proposed by Skilling (1989):

(Df) = II d~ i it

(5.7)

(see also Rodriguez (1989) for a further, supporting argument). There is then a Euclidean metric in fl/2-space with a simple geometric interpretation of the normalisation constraint, which becomes the surface of the unit hypersphere.

6. Conclusions

We have developed Jaynes' (1986) suggestion of a novel alternative to the BBGKY hierarchy.

• In some circumstances we can obtain an exact, linear differential equation of motion for a function of a one-particle distribution, which is able to generate the N -particle distributions for a system interacting under two-body forces. When this is possible, the particle number has been eliminated from the evolution equation and the problem of closure does not arise.

• This is so when the transition rates between regions of phase space are constant, and also for bosons satisfying detailed balance. The behaviour of such systems is more clearly seen using the de Finetti representation, because the resulting partial differential evolution equation can be solved without specifying the actual number of particles.

• For the full problem of classical statistical mechanics, the particle number cannot be eliminated, and the de Finetti representation in its original form does not seem to be a useful way to proceed.


REFERENCES

Balescu, R.: 1975, Equilibrium and Non-equilibrium Statistical Mechanics, Wiley, New York, U.S.A. Chapters 3 and 14.

Clemmow, P.C. & Dougherty, J.P.: 1969, Electrodynamics of Particles and Plasmas, Addison-Wesley, Reading, Mass., U.S.A.

Inguva, R., Smith, C.R., Huber, T.M. & Erickson, G.: 1987. 'Variational Method for Classical Fluids', in: C.R. Smith & G.J. Erickson (eds), Maximum Entropy and Bayesian Spectral Analysis and Estimation Problems, Reidel, Dordrecht, Netherlands. pp 295-304.

Jaynes, E.T.: 1986, 'Some Applications and Extensions of the de Finetti Representation Theorem', in: P. Goel & A. Zellner (eds), Bayesian Inference and Decision Techniques, Studies in Bayesian Econometrics 6, Kluwer, Dordrecht, Netherlands. pp 31-42.

Karkheck, J.: 1989. 'Kinetic Theory and Ensembles of Maximum Entropy', in: J. Skilling (ed.), Maximum Entropy and Bayesian Methods, Cambridge, England, 1988, Kluwer, Dordrecht, Netherlands. pp 491-496.

Lynden-Bell, D.: 1967. 'Statistical Mechanics of Violent Relaxation in Stellar Systems', Mon. Not. Roy. Astron. Soc., 136, 101-12l.

Rodriguez, C.C.: 1989. 'The Metrics Induced by the Kullback Number', in: J. Skilling (ed.), Maximum Entropy and Bayesian Methods, Cambridge, England, 1988, Kluwer, Dordrecht, Netherlands. pp 415-422.

Saslaw, W.C.: 1985, Gravitational Physics of Stellar and Galactic Systems, Cambridge University Press, Cambridge, U.K.

Skilling, J.: 1989. 'Classic Maximum Entropy', in: J. Skilling (ed.), Maximum Entropy and Bayesian Methods, Cambridge, England, 1988, Kluwer, Dordrecht, Netherlands. pp 45-52.

Skilling, J. & Sibisi, S.: 1995. 'Your title, please ... ', in: J. Skilling & Sibisi, S. (eds), Maximum Entropy and Bayesian Methods, Cambridge, England, 1994, Kluwer, Dordrecht, Netherlands. pp xxx-xxx.

ENTROPIES FOR CONTINUA: FLUIDS AND MAGNETOFLUIDS

D. Montgomery and X. Shan Dept. of Physics & Astronomy Dartmouth College Hanover, NH 03755-3528, USA

W.H. Matthaeus Bartol Research Institute

University of Delaware Newark, DE 19716, USA

ABSTRACT. The greatest single use of maximum entropy methods at present seems to be in situations related to data analysis. However, for over twenty years, it has also appeared that considerations of maximum entropy might have dynamical implications for dissipative continuum mechanics that go beyond the class of statements that can be made from the traditional statistical mechanics of discrete particles. Inquiry into the extent to which a meaningfully increasing entropy can be defined for an evolving dissipative continuum has been to a considerable degree an "experimental" investigation, with numerical solution of the relevant dynamical equations (e.g., Navier-Stokes, magnetohydrodynamic, geostrophic, or plasma "drift" equations) as the relevant experimental tool. Here, we review various suggested formulations and the accumulated numerical evidence. We also suggest theoretical and computational problems currently thought to be potentially illuminating and ripe for solution.

1. INTRODUCTION

There is no denying that the dominant emphasis in maximum entropy theory in recent years has been on applications to data analysis. Despite the extraordinary success of this agenda, there remains, in our opinion, a second program to be developed that derives from Jaynes's earliest perspectives [1] on entropy maximization. We take that to be a likelihood that our notions of the probability of the physical state of a system might be generalized far beyond the elegant but rather confining ones that originate in the classical statistical mechanics of point particles, due to Boltzmann and Gibbs. The idea that physical evolution of a system might represent in many cases the passage from a "less probable" to a "more probable" state does not seem to be fully exploited by the classical emphasis on phase spaces, Liouville theorems, Hamiltonian dynamics, ergodic or "mixing" theorems, equality of time and phase space averages, and so on. Valuable as these pushes toward precision and rigor have been, they may have diverted attention away from cases where a statisticalmechanical perspective has real, if less sharply formulated, predictive power.

We wish to discuss here an example of a physical system whose mathematical description has often been thought to put it outside the proper area of inquiry of classical statistical mechanics but which nonetheless has recently seemed to confirm some maximum-entropy predictions for it: two dimensional, Navier-Stokes (hereafter, "2D NS") fluid turbulence. Section II reviews briefly some recent analytical and numerical calculations which seem to us to support this assertion. Section III is devoted to some discussion of different possibilities for introducing probabilistic concepts into the dynamics. Section IV proposes several new numerical computations for related physical systems that seem to be likely candidates for advancing our understanding of the statistics of dissipative continua further.

303


304 D. MONTGOMERY, X. SHAN AND W.H. MATTHAES

This work takes it as axiomatic that the robust appearance of readily-recognizable states in dissipative continua implies that they have an essentially thermodynamic behavior, even though that thermodynamics falls outside the traditional energy-conserving formulation. We believe that it mostly remains to be discovered, is not fully accounted for by past hypotheses such as "minimum rate of entropy production," and will require enlargement of some of our favorite ideas of classical statistical mechanics. The example treated here is one in which the use of information-theoretic entropies has shown a predictive power in deterministic, time-developing situations that we believe to be a genuinely new departure in physics.

2. 2D NS TURBULENT FLOWS

A somewhat artificial construct that is nonetheless of considerable meteorological and oceanographic relevance [2] is the flow of an incompressible Navier-Stokes fluid like water or air, in the case in which the velocity field has only two components, v = (vx,vy,O), say, with both Vx and Vy independent of the third coordinate z. In such a case, the velocity field v is expressible in terms of a stream function 'lj;, according to v = 'V'lj; x ez , with ez a unit vector in the z direction. The vorticity w = 'V x v = (0,0, w) then serves as a pseudoscalar source for the velocity field through the Biot-Savart law, and can form the basic dynamical variable. Thus the Navier-Stokes equation for v can be replaced by the "vorticity representation," which evolves w according to

aw 2 -+v·'Vw=v'V w at (1 )

while Poisson's equation relates w to 'lj;: 'V 2 'lj; = ~w. In Eq. (1), the constant v is the kinematic viscosity, or in a favorite set of dimensionless units, may be thought of as the reciprocal of the large-scale Reynolds number.

The fact that v is "small" in the cases of interest makes Eq. (1) look deceptively close to the Euler equation of an "ideal" fluid, which we write separately for emphasis:

aw at + V· 'Vw = O. (2)

It is difficult to overestimate the numbers of misunderstandings that have been generated by assuming that because v is "small," the differences between the solutions of Eqs. (1) and (2) ought also to be "small."

The point is that Eq. (2), with smooth initial data, provides a continuously infinite family of conservation laws, being a pointwise conservation law itself for the value of w

associated with any fluid element. Several consequences of these conservation laws are immediate. First, contours of constant w can never cross or intersect, as long as the solutions continue to exist at all. Closed contours nested inside each other must stay nested, and all local maxima or minima must remain isolated maxima or minima forever. In the presence of reasonable boundary conditions (such as spatially periodic or perfectly reflecting ones), all integrals of any differentiable function F(w) are constants of the motion. This vast number of conservation laws almost, but not quite, locks any evolution into the initial conditions; at least topologically speaking, this is not too strong a statement.

ENTROPIES FOR CONTINUA: FLUIDS AND MAGNETOFLUIDS 305

Eq. (1) avoids this infinity of conservation laws because of the well-demonstrated property of developing steep spatial gradients in w, so that no matter how small v is (or how high the Reynolds number), the dissipative term in Eq. (1) soon becomes effective. The topological constraints are broken, and many of the ideal global invariants begin to dissipate. It has been speculated that at high enough Reynolds number (low enough v ), the dissipation of functions of w such as the enstrophy (mean square vorticity) or integrals of higher powers of w might be postponed indefinitely, but the numerical evidence for such a postponement is, to date, notably unconvincing; no one has suggested how the number of eddy turnover times required for the viscosity to become effective might scale with Reynolds number.

The disappearance of the conservation laws, in passing from Eq. (2) to Eq. (1), permits a great deal of dynamical behavior to occur which was frozen out in the Euler description. In particular, the merger with or capture of one vortex by another permits the topography in a surface of w(x, y) at fixed time to change drastically from one time to another. The fluid can "tryout" configurations that the Euler equations would not permit. It has been repeatedly suggested that a "coarse graining" of the solutions of the Euler equations will perhaps render the two different time evolutions indistinguishable, but we have not so far found such a scenario to have been convincingly documented in any but the most impressionistic sense. We will return to this point later in the discussion.

For one class of initial conditions, though, the topological conservation law difficulties disappear. Namely, they lose much of their restrictiveness for a collection of discrete, deltafunction ideal "line" vortices, in which the vorticity distribution appears as the singular sum w = L:i KiO(X - x;). Eq. (2) also applies. Here, the location of the ith vortex of strength Ki is Xi = (Xi, Yi), and it is swept around the X, Y plane in the velocity field that the other vortices self-consistently generate. This "particle" model of 2D hydrodynamics is non-dissipative, of course, but has other desirable properties (e.g, see Lin [3] or Onsager [4]). In particular, it is a Hamiltonian system with canonically conjugate coordinates which are just proportional to the X and y coordinates of the particles (Onsager [4]), and invites consideration by means of classical statistical mechanical techniques. Many of the answers turn out to look strange, however, because of the finiteness of the total phase space, an unusual condition that can lead to the phenomenon that came to be called, in the 1950s, "negative temperatures." The system can behave in ways that are noticeably different from the continuous, square-integrable Euler formulation, in that the "topological" invariants are of no consequence. In particular, there is no limit to the number of the line vortices that can occupy any small area in the xy plane: the vorticity maximum, averaged over an arbitrarily small area, can continue to increase with time, which it cannot for the continuum interpretation of Eq. (2).

Mean-field maximum entropy predictions for the statistical evolution of the line-vortex case can be deduced by postulating a classical Boltzmann entropy for a cellular occupationnumber representation of the state of the system, using Stirling's approximation on the factorials, then maximizing the entropy subject to constraints [5,6]. The constraints implied by the dynamics are conservation of particles and total interaction energy (for this system, the interaction energy is formally the Coulomb interaction energy of a collection of parallel line charges). There results an exponential dependence of the mean vorticity of both signs on the stream function: w = exp(-a+ - fJ1jJ) - exp(-a- + fJ1jJ). Here, the a± and the fJ


are Lagrange multipliers associated with conservation of the positive (negative) vorticity, and the energy, respectively. j3 is an inverse "temperature" and may be negative.

Demands of self consistency of w may be imposed by bringing in Poisson's equation (2), to give

"i72'¢ = - exp[-o:+ - j3'¢] + exp[-o:- + j3'¢]. (3)

as a nonlinear partial differential equation for the "most probable" stream function '¢. A further assumption of positive and negative symmetry (not invariably justified) replaces the right hand side by a hyperbolic sine, and Eq. (3) then becomes

(4)

Qualitative agreement between the predictions of Eq. (4) and numerical integration of the equations of motion of a few thousand line vortices was established in the 1970s ([5-19]). Because of the absence of any viscous dissipation in the system, it was not thought to be a useful predictor of the true Navier-Stokes behavior exhibited by Eq. (1). Analytical and numerical solutions of the "sinh-Poisson" equation were given, and various generalizations to magnetohydrodynamics and collections of self-gravitating point-masses were written down. Our attention gradually moved on to what were thought to be fluid descriptions (e.g., NS and dissipative MHD equations) of more physical interest. The maximum-entropy meanfield description appeared, at the time of this meeting in 1981, as a perhaps amusing sideshow in fluid mechanics [20].

In the late 1980s, Matthaeus et al [21,22] undertook what in some ways was the most ambitious numerical attempt to date to solve Eq. (1) at high Reynolds numbers and for long times. Highly turbulent initial conditions in periodic geometry were allowed to evolve for about 400 initial large scale eddy turnover times at a large-scale Reynolds number of 14,286. The spatial resolution was 512x512, and the time step was 1/2048 of an eddy turnover time. Spatially periodic boundary conditions were employed, the Orszag-Patterson dealiased pseudospectral method was used [23], and extensive diagnostic and graphic procedures were exploited to monitor the evolution.

Much of the evolution was familiar. For reasons which are well known [2], energy decayed slowly compared to higher order ideal invariants such as integrals of the powers of vorticity. Like-sign vortex merger was the most prominent dynamical event, and hundreds of initial maxima and minima in the vorticity field eventually were reduced by mergers to a single maximum and a single minimum. The varied details of each merger event were so distinctive and intriguing that it was easy to spend more time on the details than they merited. The time evolution is summarized in Figure 1. What had not been at all suspected was that after around 300 eddy turnover times, the computed two-vortex final state would be accurately predicted by Eq. (4). A scatter-plot of vorticity vs. stream function is shown in Figure 2, and the curve drawn through the points is the hyperbolic sine predicted by Eq. (4). Except for a slow decay on the energy decay time scale (of the order of 14,000 eddy turnover times), any interesting time development ceased. Further diagnostics are shown in reference [24]. This sharply defined, slowly decaying, "sinh-Poisson" state defied all expectations, since most of the enstrophy had decayed by this time and a small but not negligible fraction of the energy had, too. Any connections to the ideal line-vortex model seemed tenuous at best. It was clear that maximum entropy arguments had, in some sense,


made an uninvited but thoroughly non-trivial statement about evolving 2D NS turbulence in a dissipative continuum. There seemed to be every incentive to interrogate the subject further to try to see what had happened, and to appreciate whatever connection it had with statistical mechanics.

There seem to be two not unrelated problems in defining a serious entropy function for a dissipative continuum (rather than as a mean-field model of a discrete vortex system), such as the one described by Eq. (1). First, there is no immediate conservation law for vorticity itself, since for both periodic boundary conditions and for stationary material rigid boundaries, the net integrated vorticity must remain zero. Even though there is no net vorticity that is conserved in a non-trivial way, there is manifestly a damping of some kind that the viscosity accounts for. Secondly, there is the problem of enumerating possible states, which must remain available to a time dependent evolution even though viscous dissipation is going on. This is a physical necessity that is preliminary to any assignment of an entropy function.

Both difficulties are to some extent answered [24,25J by the replacement of Eq. (1) by a "two-fluid" dynamics for two conserved vorticity fluxes, w+ and w-:

(:t + V· V') w± = vV'2w±. (5)

Here, w+ and w- are both> O. The physical vorticity w is interpreted as the difference, w = w+ - w-. The stream function 't/J is generated through Poisson's equation by this physical vorticity w, and the velocity field v which enters into Eqs. (5) is derived from 't/J and is the same in both equations. Subtracting the second of Eqs. (5) from the first leads back to Eq. (1), with the same interpretation for all symbols, so it is clear that the physical2D NS dynamics has been embedded in the two-fluid dynamics contained in Eqs. (5): when they are solved, a 2D NS solution is immediately available upon subtraction. Viscous dissipation amounts to the interpenetration of the two fields, and when they have finally become equal at all points, all dynamical activity has ceased. The advantage is that the non-negative fields w+ and w- have fluxes that are separately rigorously conserved, for periodic boundary conditions, as can be seen by simple integration. Since the 2D NS dynamics is still in effect, any integral invariants like energy that are almost conserved at high Reynolds numbers by 2D NS flow are still conserved for the two-fluid system. We then have three constants of the motion, two exact ((w+, = (w-" where the angle brackets 0 mean a spatial average) and one approximate, the energy E = ~ J(w+ - w-)'t/Jdxdy. The states-counting problem is then no more acute or mysterious than it is for any conserved incompressible fluid which is convected around and diffused. Though Boltzmann, Gibbs, and von Neumann have given us no sanction for doing so, it is probably not shocking to any information theorist that we decided to try the following expression for a measure of the probability of any given arrangement of the two fields w+ and w-: S = - J(w+ lnw+ + w-Inw-)dxdy. Seeking the most probable state, subject to the three constraints, is a straightforward exercise, and leads at once to the predictions

(6)

for the most probable values of the w+ and w- fields. The 0;+, 0;-, and f3 are, of course, Lagrange multipliers to be chosen so that the constraints are satisfied.


Whether the predictions (6) are in fact accurate is at this stage of the subject an "experimental" question. Shan [25] has written a 2D pseudospectral code to solve Eqs. (5), starting from turbulent initial conditions similar to those that were evolved to reach Figures 1 and 2. The inferred 2D NS dynamics, obtained by subtraction, were virtually identical to those previously seen. Scatter plots of w+ and w- vs. 'If; at late times, however, did not confirm Eqs. (6), despite their regularity. Typical cases are shown in Figure 3. The downward hook to the left in the w+ plot and the downward hook to the right in the w- plot cannot be inferred from the most probable state (6) , and it seemed clear that some revision was called for.

A consideration of what might be needed followed from the observation that Eqs. (5) and all their predictions for w, v, and 'If; are invariant to the addition of any positive constant to w+ and w- . There is no absolute normalization implied in the two-fluid partition of the physical initial w into w+ and w- , other than the relatively mild requirement that they stay non-negative remains in force. Yet the new expression for S, as weIl as the Lagrange multipliers in (6), do change upon the addition of constant mean fluxes (w+) and (w-) to w+ and W-. "Drone" states, totaIly uninvolved in any possible dynamical evolution, have been contaminating the set of possible physical states we have been assigning to the system, and thereby skewing the weightings of those that do participate. It seems natural to try to redefine the entropy in a way that is invariant to the addition of an arbitrary (w+) and (w-).

A further and natural subdivision of the fluxes, according to w = w+ - w- = (w+) + w++ - w+- - (w-) - w- - + w-+ seems to eliminate the difficulties. Here, w++ is, for example, the part of the flux of w+ that lies above (w+), whatever it may be. Similar statements apply to the other three non-negative "vorticities," w+-, w- -, and w-+ .

The individual dynamics of the four fields can be readily taken to be

(7)

where i and j take on the values + and -. The velocity field v is common to all four equations , and is derived from the stream function 'If;, where Poisson's equation now becomes '\l2'1f; = - w++ + w+- + w-- - w-+. Eq. (1) is again immediately recoverable by combining the appropriate linear relations (7). A natural entropy, now independent of (w+) and (w-) can readily be seen to be S = - Z=i,j J wij lnwijdxdy where the sum runs over the four fields. Differentiating S, using (7), and integrating by parts leads at once to the relation

dS J ('\lwij )2 - = v" . . dxdy < 0 dt ~ w'J -

',J (8)

Eq. (8) almost, but not quite, amounts to an "H theorem" for the system. It predicts a monotonic increase in S until all spatial gradients have disappeared, and if the time scales in the evolution of S could be shown to be,in some precise sense, in between the (slow) decay of energy and the (much more rapid) decay of enstrophy and the other higher order invariants, it would amount to an unambiguous proof that the system is driven toward a state to be obtained by maximizing S subject to conservation of energy and the four fluxes of the wij , only. So far , no such demonstration has been possible.


Nevertheless, it is of interest to compare the numerical solutions for (w+) + w++ - w+and (w-) + w-- - w-+ with the maximum-entropy predictions, which are, clearly: wij =

exp( -aij =F (3'1j;) In wij , the a ij and (3 are the appropriate Lagrange multipliers associated with the conservation of the four fluxes and the energy, respectively. A quite acceptable fit to the computed fields is achieved and samples are shown in Figure 4. It will be apparent that there is no incentive for any further subdivision of the vorticities: no more general expressions than wij can result.

One additional feature of the evolution which had not been anticipated has revealed itself: a lack of strict equipartition of energy in the two final-state vortices. Upon reflection, this seems perhaps unsurprising, since there is no a priori reason why exactly equal amounts of energy should be given to the positive and negative parts of the vorticity field. A difference in the two final state energies of the order of five per cent was observed , and if the difference changed, it changed only on a time scale that was perhaps as slow as the energy decay itself. This reflects itself in a departure from exact symmetry in the Lagrange multipliers in wij , and an even better fit to the computed relaxed state was achieved than the "sinh-Poisson" scatter plot of Figure 2. It is perhaps not exaggerating to say that at this moment, there are no computed details of relaxing 2D NS turbulence that do not fit comfortably into the maximum-entropy picture as it has been elaborated.

A natural effort to extend the picture to driven (as opposed to decaying) systems at finite Reynolds numbers has met with less success. W. B. Jones and the writer [26] studied numerically the problem of 2D NS flow between parallel planes under a uniform pressure gradient (plane Poiseuille flow). Above a critical Reynolds number of about 2600 or so, the one dimensional parabolic velocity profile is supplemented by a second set of preferred periodic 2D solutions which have been seen as the catalyst for 3D turbulent transitions for quite some time ([27- 29]). By keeping the problem two-dimensional at artificially high Reynolds numbers (up to 15,000, based on the pressure gradient), it was possible to achieve after transients had decayed, a vortex-street configuration that, when viewed in the rest frame of the vortices, was accurately time-independent, and a candidate for a maximum-entropy analysis. The flow divided naturally into three regions: (1) a central stripe that contained most of the mass flux, that wandered periodically in space in the cross-stream direction; (2) a vortex street arrangement of uniform-vorticity vortices that rested almost against the walls; and (2) boundary layers, between the walls and the vortices, and between the vortices and the central stripe. In the first two of these, an accurate pointwise dependence of w upon 'Ij; was observed, that in the maximum entropy language, corresponded to two different "temperatures." In the boundary layers, there was no regularity to be observed between wand 'Ij;, and there was, of course, no sense in which the viscous term could be counted as formally "small" in estimating the terms of Eq. (1). A scatter plot , with the boundary layer contributions removed shows a linear central section corresponding to the central stripe, bounded by gaps on the other sides of which are two flat ("infinite temperature") sections corresponding to the vortex street. At this point, it is an unrealized hope to integrate these very regular results into a unified maximum entropy analysis; interested parties are inVited to try [26].


3. ALTERNATIVE FORMULATIONS AND PROBLEMS

It has been mentioned previously that there is more than one way to formulate a mean field theory of the Euler equations. Instead of "line" vortices , one may postulate mutually exclusive, fiat, vortex "patches" which occupy finite areas, and are separated by empty space or by patches carrying other (piecewise constant) values of vorticity. The statistics of these deformable but still non-interpenetrating vortex patches can be treated with reference to a cellular division of space, according to methods first enunciated by Lynden-Bell [30]. Formally, a classical analogue of Fermi-Dirac statistics results. The 2D Euler equation analogue of this procedure has been given by Robert and Sommeria [31- 32] and by Miller et al [33]. In this approach, the vorticity distribution remains ragged and singular, and correspondence with smooth Navier-Stokes variables is attempted through "coarse graining": replacing the vorticity distribution by its average over local regions of phase space. What results , as a "most probable" vorticity dependence upon stream function, is formally a Fermi distribution: w = aexp(a - j3'¢)/[1 + exp(-a - j3'¢)], (w > 0), which, it will be noted, has typically a concave, rather than convex, dependence at the maxima.

Apart from the simple-minded question of whether the Fermi-Dirac or Gibbsian dependences fit turbulent Navier-Stokes computations better (the latter do), there remains a more basic question of whether and to what extent "coarse graining" the products of Eq. (2) should accurately reproduce the solutions of Eq. (1). The diffusive viscous Laplacian decay term does definite things to any shape (either in spectral space or in configuration space) of vorticity distribution it acts upon. In particular, it tends to fill in zeros at the first time step, if any exist in the spatial vorticity distribution. Its effect is far more detailed than a mere "smearing out" of sharp spatial gradients. Whether its effects can be satisfactorily mimicked by "coarse graining" a dynamics which permits arbitrarily steep gradients to survive remains to be demonstrated, but may be true. Our own impulses at this stage have led us to try instead to construct a maximum entropy formulation that admitted dissipation from the beginning, and treat it as one dynamical process among others that can contribute to evolution towards a more probable state.

4. SUGGESTED NEW APPLICATIONS AND TESTS

There seem to us to be at present two areas in particular where maximum entropy statistical mechanical predictions can be made that are ripe for numerical test. The first is in twodimensional magnetohydrodynamics (2D MHD) and self-gravitating systems.

The equat ions of 2D MHD are a natural generalization of Eq. (1):

aw 2 aA 2 - + v . Vw - B . VJ' = I/V W - + v . V A = ... V A (9) at ' at '/ Here, the velocity field is v = V'¢ x ez , where '¢ is again the two-dimensional stream function, and B = V A x ez , where B is the two-dimensional magnetic field and A is a (one-component) magnetic vector potential. The vorticity is again w = V x v , and the current density is j; both are directed in the z-direction, and are related to '¢ and A by Laplace's equation:

(10)

The kinematic viscosity is 1/, while the magnetic diffusivity is 1). In dimensionless ("Alfvenic") units, both are interpretable as Reynolds-like numbers,

ENTROPIES FOR CONTINUA: FLUIDS AND MAGNETO FLUIDS 311

There are now two "source" fields (wand j) in terms of which an entropy may be defined [34,35). One has, at this point, cut loose, however, from any Euler-like description whereby convected, conserved delta-function sources may be made to appear by dropping the viscosity and magnetic diffusivity. Any entropy which can emerge from this problem cannot bear a close relation to a conservative system of straightforward Hamiltonian particle mechanics. It is a particularly promising arena in which the information theoretic formulation of entropy may be tested. Some preliminary results have already been reported by Biskamp [36), though the computations were far short in duration of what would be required for any approach to an asymptotic state like that considered in Refs. [21,22,24,25).

The second area where relaxation toward maximum entropy states may be expected to be testable for a non-standard system is that of gravitationally-interacting masses (e.g., [30)). Sharp differences between point masses and phase-space Vlasov distributions of mass may be expected unless some kind of viscous or other constraint-breaking dissipative terms are added to the latter. We do not have in mind here spatially-periodic boundary conditions of the kind currently popular in cosmological simulations, , but rather isolated, Newtonian, self-attracting collections of "stars" which may also possess an angular momentum integral [37).

We may be at a propitious moment for a significant conceptual increase in the class of dynamical systems that may be considered susceptible to a statistical mechanical analysis based upon ideas of maximum entropy.

ACKNOWLEDGMENTS

This work was supported in part by NASA Grant NAG-W-710 and USDoE Grant DEFG02-85ER53194 at Dartmouth, by the U.S. Department of Energy at Los Alamos, and by NSF Grant ATM-89131627 and NASA Grant NGT-50338 at the Bartol Research Foundation.

References

[1) E.T. Jaynes, Phys. Rev. 106,620 (1957) and .lUS, 171 (1957). [2) e.g., R.H. Kraichnan and D. Montgomery, Repts. on Progress in Physics 43,547 (1980). [3) C. C. Lin, "On the Motion of Vortices in Two Dimensions" (University of Toronto

Press, Toronto, 1943). [4) L. Onsager, Nuovo Cimento Suppl. 6, 279 (1949). [5) G. Joyce and D. Montgomery, J. Plasma Phys. 10, 107 (1973). [6) D. Montgomery and G. Joyce, Phys. Fluids 17, 1139 (1974) [7) C.E. Seyler, Jr., Phys. Rev. Lett. 32, 515 (1974) and Phys. Fluids 19, 1336 (1976). [8) B.E. McDonald, J. Camp. Phys. 16, 630 (1974). [9) D.L. Book, S. Fisher, and B.E. McDonald, Phys. Rev. Lett. 34, 4 1975.

[10] Y.B. Pointin and T.S. Lundgren, Phys. Fluids 19, 1459 (1976). [11] T.S. Lundgren and Y.B. Point in, Phys. Fluids 20, 356 (1977). [12] T.S. Lundgren and Y.B. Point in, J. Stat. Phys. 17, 323 (1978). [13] J.H. Williamson, J. Plasma Phys. 17, 85 (1977). [14] G.A. Kriegsmann and E.L Reiss, Phys. Fluids 21, 258 (1978). [15] A.C. Ting, H.H. Chen, and Y.C. Lee, Physica D 26, 37 (1987).


[16] R.A. Smith, Phys. Rev. Lett. 63, 1479 (1989). [17] R.A. Smith and T. O'Neil, Phys. Fluids B 2, 2961 (1990). [18] R.A. Smith, Phys. Rev. A 43, 1126 (1991). [19] L.J. Campbell and K. O'Neill, J. Stat. Phys. 65,495 (1991). [20] D. Montgomery, in "Maximum-Entropy and Bayesian Methods in Inverse Problems,"

ed. by C. Ray Smith and W.T. Grandy, Jr. D.Reidel, Dordrecht, 1985. pp. 455 ff. [21] W.H. Matthaeus, W. Stribling, D. Martinez, S. Oughton, and D. Montgomery, Phys

Rev. Lett. 66, 2731 (1991). [22] W.H. Matthaeus, W.T. Stribling, D. Martinez, S. Oughton, and D. Montgomery, Phys

iea D 51, 531 (1991). [23] D. Gottlieb and S.A. Orszag, "Numerical Analysis of Spectral Methods," NSF-CBMS

Monograph No.26 (SIAM, Philadelphia, 1977) [24] D. Montgomery, W.H. Matthaeus, W.T. Stribling, D. Martinez, and S. Oughton, Phys.

Fluids A 4, 3 (1992). [25] D. Montgomery, X. Shan, and W.H. Matthaeus, Phys. Fluids A 5, 2207 (1993). [26] W.B. Jones and D. Montgomery, Physiea D 73, 227 (1994). [27] T. Herbert, Fluid Dyn. Trans. 11, 77 (1983). [28] S.A. Orszag and L.C. Kells, J. Fluid Meeh. 96, 159 (1980). [29] S.A. Orszag and A.T. Patera, J. Fluid Meeh. 128,347 (1983). [30] D. Lynden-Bell, Mon. Not. R. Astr. Soc. 136, 101 (1967). [31] R. Robert and J. Sommeria, J. Fluid Meeh. 229, 291 (1991). [32] R. Robert and J. Sommeria, Phys. Rev. Lett. 69, 2776 (1992). [33] J. Miller, P. Weichmann, and M. Cross, Phys. Rev. A 45, 2328 (1992). [34] D. Montgomery, L. Turner, and G. Vahala, J. Plasma Phys. 21, 239 (1979). [35] G. Vahala and J. Ambrosiano, Phys. Fluids 24 ,2253 (1981). [36] D. Biskamp, Phys. Fluids B 5, 3893 (1993). [37] D. Montgomery and Y.C. Lee, Ap J. 368, 380 (1990).

ENTROPIES FOR CONTINUA: FLUIDS AND MAGNETO FLUIDS 313

(b)

(c) (d)

(f)

Figure 1. Computer-drawn perspective plot of the vorticity field vs. x and y at six successive times, showing the eventual merger of all like-sign vortices (from Matthaeus et al [22]).

, 0 ,.--..,.....--------

5 ) 0

3

-5

-10

-15 --~

-3 -2 -1 0 2 3 Y

Figure 2. Scatter plot of vorticity vs. stream function after 374 initial large-scale eddy turnover times (from [24]). The dashed line drawn through the scatter plot is a least squares fit of the hyperbolic sine term in Eq. (4).

314

(a)

11 10

9 8

i

1 7

~~ 2 / ~~-~~--'--~~~~--' -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1 .5 2 2.5

'i'

D. MONTGOMERY, X. SHAN AND W.H. MATTHAES

(b)

11 10

9 8 7 6 5 4 3 2 -2.5 -2 -1.5 -1 -0.5 0 0.5

'i' 1 1.5 2 2.5

Figure 3. Scatter plot of positive and negative parts of velocity field obeying Eqs. (5), as computed by Shan [25] at time t = 390 eddy turnover times, for initial large-scale Reynolds number 10,000. The downward hooks to the left and right do not agree with the predictions of Eqs. (6).

(a)

11 I 10 9 8 7 6 5 4

I

3 I 2 '-'----~~~~-'---'---'---'-------.J

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 'i'

1.5 2 2.5 (b)

;~ r- , 9 8 7 6 5 4 3 2 ~ -'--'--' -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2 .5

'i'

Figure 4 Same scatter plots as shown in Figure 3, compared with least-squares fit to the four-flux predictions of Eqs. (17), from [25]. The dashed lines represent Eqs. (17), and over most of the interval coincide so closely with the scatter plots as to be indistinguishable from it. The fit is limited apparently only by the slightly unequal energies of the two final vortices.

A LOGICAL FOUNDATION FOR REAL THERMODYNAMICS

R.S. Silver Professor Emeritus, James Watt Chair Department of Mechanical Engineering, University of Glasgow Glasgow G12 8QQ Scotland

ABSTRACT. During the 1940's a new approach to engineering thermodynamics had been begun by J.R. Keenan of M.LT. in postulating thermodynamic activity as process transferring substance. This did not come out clearly for some time, but was greatly clarified in 1961 by Myron Tribus, whose work firmly established the concept of process transferring substance. But neither the Keenan nor the Tribus initiatives, nor any other systematic thermodynamic analysis , attempted to include frictional dissipation. Everyone knew that friction effects are ever present, and indeed that an equivalence between heat and work had been settled by frictional experiments established by Joule. Yet no textbook of thermodynamics included formal analysis of the real inherent effects due to friction. That is why, in doing so, this paper includes the adjective "real" in its title.

It was about 1960 that I started trying to introduce realism into introductory thermodynamic teaching. I had graduated in Natural Philosophy in 1934 and became a research physicist in industry. For the next 29 years I was continuously employed in various branches of engineering industry, all of which happened to be concerned with thermodynamics. I had worked on explosions and explosives, and on combustion and other chemical reaction equipment, and had designed and operated boilers, turbines, evaporators, condensers, compressors, refrigerators, heat exchangers and desalination equipment, and had published quite a few papers. I had kept in touch with the Institute of Physics and the Institution of Mechanical Engineers, and took particular interest in the Education groups of those bodies.

But I had become very critical, in the period 1950- 1960, of the prevailing academic thermodynamics teaching, particularly to engineers. I found the treatments in thermodynamics textbooks quite unrealistic, and of very little relevance to the many activities in which I had actively worked. The academic textbook presentations were full of redundancies and circular arguments in which the end swallowed the beginning. An exception to such texts was a thermodynamics book published by J.H. Keenan, engineering professor at M.LT. in the 1940's during the war years [1].

I did not see Keenan's book until the mid-1950's but immediately realised it was free of the generic faults in the books previously known to me. Keenan had a new way of looking at the subject and was trying to find a distinction between substance and process. It is easy to be wise after someone else has shown the way, even if he has not shown it completely. In the late 1950's most of us in the thermodynamics field knew that Keenan had done something important but it was still not clear what.

315


316 R.S. Silver

However from my varied practical work I considered that the worst feature in all the usual books was the absence of a realistic treatment of the phenomenon of frictional dissipation inevitable in any real thermodynamic apparatus. There was no systematic general analysis of friction, but mere random recordings of ad hoc "efficiencies". And to this particular matter of friction the Keenan approach had not given any more attention than the other texts. By 1960 I had determined that I must attempt a proper account of the frictional dissipation that is inherent in the theory of thermodynamics.

In 1961 the book "Thermostatics and Thermodynamics" by Myron Tribus appeared [2] and differed from all the previous engineering thermodynamic texts, because Myron began with statistical thermodynamics which he had built on the information theory approach of Jaynes [3]. I was very happy with the Jaynes/Tribus statistical thermodynamics because its logical basis was more convincing than the traditional statistical treatment of Fowler and Guggenheim [4]. But while I greatly admired it as an exposition of microscopic thermodynamics, I did not regard it as a useful way into the thermodynamics of practical equipment.

It is customary to refer to "macroscopic" thermodynamics for the bulk phenomena and to "microscopic" for the statistical level. However in lectures a speaker has frequently to deliver his words rather laboriously to distinguish between "microscopic" and "macroscopic" and listeners may mistake one for the other. I prefer to avoid the difficulty altogether by using the term "megascopic" for the bulk phenomena and "microscopic" for the statistical. No pronunciation can make "mega" sound like "micro"!

At the time my main objective was to get frictional dissipation into the analysis, and I did not realise just how much else could be done. I was having problems with the three words energy, heat, and work, and of course was not alone in this. Keenan himself had not clearly established the relation between the three, although he had certainly initiated the matter, but his formulations were not clear.

The breakthrough to clarity came in 1961 from Myron Tribus in his book "Thermostatics and Thermodynamics" [2], but not at all from his advocacy of microscopic statistical thermodynamics. Instead it was a masterpiece of what I would call megascopic thermodynamics. It happened on his page 140, where he asked the question "How much rain is in the reservoir"? To which he gave the answer "None"! He added that heat cannot be stored in a body. "Only energy can be stored. Heat is the name given to the transferral process" . So Tribus used water as an analogue of energy being transferred by a process which is an analogue of heat. Water is a substance in the reservoir. Rain is a process which transfers that substance to the reservoir from some other source. That analogy by Tribus was the vision of a brilliant teacher and began an entirely new outlook for thermodynamics.

Strangely Tribus did not specifically mention the corresponding process, work, which is like heat in that work also cannot be stored in a body. Work is the name given to a transferral process, a process different from heat, but which also transfers energy and is as worthy of analogy as is heat.

Once those two analogues due to Tribus are recognised there is no difficulty in conceiving heat to be a process for transferring energy from one mass to another, and in conceiving work likewise Energy is the property, possessed by substance, that is transferred by either process, heat or work, or both. Either of these two processes, work or heat, can transfer energy to or from a reference mass from or to another reference mass, and both processes may occur simultaneously in the same or in opposite directions.

LOGICAL FOUNDATION FOR THERMODYNAMICS 317

The Tribus vision introduced an important semantic feature into thermodynamics, because the process, is qualitatively different from water, the substance. But an essential aspect of analytical science is to be quantitative, and in the process/substance pair the essential aspect is that the quantity of water received by the reservoir must be equal to the quantity of rain transferred.

These distinctions between property and process were actually implied in Keenan's 1941 initiative but had not come out clearly until later. Indeed in retrospect it can be seen that all the founding fathers of thermodynamics, Duhem, Clausius, Kelvin, Gibbs, etc. had struggled in vain to formulate equations without recognising the qualitative difference between property and process.

To exploit effectively the Tribus analogues a notation was required which would equate the quantities of the qualitatively different process rain and substance water. This quantitative equality of the distinct qualities was what I introduced in 1971 by using two clearly different increment symbols, one for process and the other for substance/property.

I sought to choose increment symbols which would each correspond to familiar algebraic connotations, and which would be clearly distinct from each other in a way conceptually matching the difference between process and substance.

The distinct increment symbols which I introduced were: (1) For the process increment symbol I used the Capital Greek delta triangle, ~. (2) For substance/property I used the simple lower case as used commonly for increment

symbol, d. Those two increment symbols were clearly distinct from each other and were used in

my 1971 textbook [5). The effectiveness of that choice of distinct increment symbols is shown in the following

equation in terms of the Tribus analogue Rain R and Water Substance W, thus:

(1 )

Note that the symbol I use for process not only has ~R but also a suffix t. That suffix is present to denote that the process is transferring to the reference unit mass. (An alternative convention has ~Rt as the process transferring from the reference unit mass, with a change in sign of ~Rt throughout.)

Turning now from the analogy to actual thermodynamics, we have to deal with the processes involved, and with the "substance" concept Energy E which is transferred by the processes. I begin by considering only mechanical phenomena and ignoring thermal phenomena completely.

In mechanics alone, energy is defined as the product of a force and distance in its direction of motion. The consequence of that definition is recognised in two forms, potential energy for which we shall use the symbol </J, and kinetic energy ~V2. Both potential and kinetic energy are properties of a unit reference mass.

A megascopic unit mass is a sum of smaller bits and each bit carries its share of the potential and kinetic energy. But it cannot be assumed that such allocation takes full account of the summed energy of all the bits. The unit reference mass may contain, in addition to the observable kinetic and potential energy, another form of energy internally. This, which can be regarded, as Tribus has said, as "whatever other energy there is there" is denoted by U, as unknown energy which may exist in the reference mass.

318 R.S. Silver

Thus the total content E of energy in the reference mass is given by the equation

and therefore that

E - "'+ Iv2 +U -'I-' 2 (2)

(3)

I denote by ll'lf;t the total sum of energy added to the reference mass by all such transfer processes whatever they may be. A negative value of ll'lf;t will denote that energy has been taken from the reference mass. Thus I write the general equation

(4)

and therefore that (5)

The mechanical process of energy transmission by the work mode is denoted llWt and is given by:

(6)

where pdV is the work output from the reference mass expanding against ambient pressure. The d~V2 and d¢ terms are work outputs obtained from diminishing kinetic energy and potential energy. The last term is the work dissipated against frictional resistance. The negative sign preceding it shows it always reduces the work output from what would have been obtained if there were no friction.

Thus equation (6) gives the total work llWt done by the unit reference mass, i.e. the total energy output from the reference mass by the process of force times distance.

Equation (6) can be rearranged to give:

(7)

And equation (5) rearranged to give:

(8)

Thus, from (7) and (8) together,

(9)

Equation (9) can be put arbitrarily into the form of two processes by bracketing the first three terms of the R.H.S. of (9) as

(10)

I now choose to give the bracketed terms a process form with the symbol llQt, using Q simply to imply "Question" as I do not yet know what it really is.

Therefore equation (9) is now composed of two processes, according to

(11)

LOGICAL FOUNDATION FOR THERMODYNAMICS 319

Equation (11) shows two distinct processes, confirming the value of introducing fl.'ljJt

after equation (3) and explicitly allowing for more than one process of energy transfer. Equation (ll) shows that the work mode of process -fl.Wt is negative, Le. that this work process is being done by the reference mass. The other mode fl.Qt if positive as assumed will be transferring energy to the reference mass.

The remaining task is to identify with certainty the unknown transfer process fl.Qt·

The identification is made as follows. The transfer process fl.Qt is identical with the bracketed terms in equation (10), so

that we have (12)

It follows that (13)

The identification of the presently unknown fl.Qt mode of energy transfer is now easily obtained because we see from equation (13) that, whatever fl.Q t is, the results which it has on the internal energy and specific volume of the reference mass are of the same nature as those observed when frictional dissipation occurs.

What is observed when friction occurs is most frequently the phenomenon which we call a rise in temperature. I have not introduced temperature previously in my procedure and it does not need to be precisely measurable nor even defined at this stage. It is an unmistakable phenomenon which constitutes a qualitative indication that a change has been caused in a reference mass by frictional dissipation.

That is the first clue to the nature of fl.Qt. It has to be something giving rise to the phenomenon which we call rise of temperature, but in the absence of frictional dissipation, and which is recognisably transmission from something outside the reference mass. The second clue is that, while in its nature the frictional dissipation fl. Wf is always positive, and must always give an indication the same direction - observed to be rise of temperature - the process element fl.Qt is transmitted energy and therefore must include the case of transmission from as well as transmission to, the reference mass. Since the result of transmission to has been identified as rise of temperature, the result of transmission from must be fall of temperature. Therefore we have to seek in our experience for conditions where a fall of temperature is observed.

This enables unmistakable identification, for one of our most common experiences is that when two bodies are at different temperatures, and if, so far as we can tell, they are isolated from everything except each other, then the lower temperature will increase and the higher will decrease. Alternatively we observe that if the higher is maintained the lower will continue to increase until the two are equal, while if the lower is maintained the higher will fall to equality. This common phenomenon is what we call "heating". Thus the process which transfers the energy amount fl.Qt is uniquely defined as "heating". But for our technical vocabulary definition we have chosen to speak of the mechanical transmission process as "work" , not as "working". Thus for consistency we now give the other transfer process fl.Qt the name "heat" in preference to "heating". Thus we now see fl. W t as the mode of energy process by work, and fl.Qt as the mode of energy process by heat.

Note that the presence of the frictional term fl. Wf in the equations is crucial to the identification of the fl.Qt process. Without it you might speculate, but cannot be certain of the identification. Leave it in and the conclusion is completely determined.

320 R.S. Silver

There is a further important point which I think I should include. In my teaching textbook, I give a careful account of completing a thermodynamic cycle, including frictional dissipation throughout, and also a careful derivation of precise absolute temperature, and am then able to prove that the cyclic integral

f dU ~pdV =0 (14)

This confirms the existence of a state property of the reference unit mass which is identified with - or defined as - entropy, denoted as usual by S, and shows that

dS = (dU + pdV)/T (15)

Then the relation to the heat process is seen to be

(16)

or perhaps more informatively:

(17)

Thus, since fl W f IS always positive, equation (17) IS the proper form of the Clausius inequality

(18)

The equality form (17) is more useful than the traditional inequality (18).

ACKNOWLEDGMENTS. I should like to express my thanks to Anthony Garrett and Steve Gull

for the interest they have expressed in my approach to thermodynamics, and to them and John Skilling for assisting me to give the present paper, including overhead projection, during the proceedings of MAXENT 94. A very gracious permission considering that the statistical content of my paper is truly microscopic.

REFERENCES

1. Keenan, J.H. 1941. Thermodynamics. Wiley, New York, U.S.A. 2. Tribus, M. 1961. Thermostatics and Thermodynamics. Van Nostrand, Princeton, New

Jersey, U.S.A. 3. Jaynes, E.T. 1957. Information Theory and Statistical Mechanics I & II. Physical Review

106, 620-630 & 108, 171-190. 4. Fowler, R.H. & Guggenheim, E.A. 1939. Statistical Thermodynami ~s. Cambridge Uni

versity Press, Cambridge, U.K. 5. Silver, R.S. 1971. An Introduction to Thermodynamics. Cambridge University Press,

Cambridge, U.K.

Index

Abel inversion 36 Adaptive grid 86 Adulteration 173 Algorithm - EM 121 283 Algorithm - genetic 140 Algorithm - maximum entropy 89 Algorithm - Richardson-Lucy 93 Amino acids 165,263 Atomic nucleus 51, 59 Bayes' theorem 24, 168, 184, 239 Bayes factor 190 BayesCalc 127 Bayesian analysis 5, 157 Bayesian calculus 128 Bayesian classification 117 Bayesian cluster expansion 269 Bayesian error bars 109 Bayesian estimation 13, 101, 189 Bayesian evidence 64,73, 138,240 Bayesian inference 62, 135 Bayesian maximum entropy 31 Bayesian mechanics 159 Bayesian model comparison 135, 239 Bayesian model estimation 138 Bayesian predictive distribution 280 Bayesian reasoning 183 Bayesian statistical inference 158 BBGKY hierarchy 287 Belief 143,166 Bernoulli distribution 119 Beta sheet - protein structure 265 Bloch formula 51 Boolean calculus 166 Boson statistics 294 Buffalo snowfall 194 CCD camera 110 Chemical doping 165 Civil law 183 Classification 117 Closure problem 287 Cluster expansion model 269 Combining data from multiple sensors 271

321

Communication systems 103 Communications - spread-spectrum 101 Complexity - computational 117 Confidence intervals 177 Conservation laws 304 Convexity 251 Correlation 243, 287 Cox axioms 166 Cramer transform 226 Cramer-Rao bounds 13 Criminal law 184 de Finetti generating function 291 Decision tree 117 Deconvolution 32 Default model 32, 38 Density estimation 189 Density modelling 259, 269 Diffusion imaging 1 Digital communication 103 Dingo baby 187 Dirichlet density 119 Dirichlet measure 192 Dirichlet prior probability 272 Dirichlet process 193 Disc packing 233 Discrimination information 102 Dissipation - dynamical 303 Dissipation - frictional 315 Distribution - distance 70 Distribution - fractal 247 Distribution - Bernoulli 119 DNA analysis 184 Duality 223 Dynamics 303 Echo-planar technique 3 Efficiency of heat engine 316 Electromagnetic waves 81 EM algorithm 121, 283 Entropy 145 Entropic prior 63 EPR spectroscopy 24 Equilibrium 150

322

Error rates 102 ESR spectroscopy 24 Evidence 252 Evolution equations 288 Fermi gas 51 Finite mixture 118 Flow imaging 1 Fluid flow 8 Fluid turbulence 304 Fourier-Bessel expansion 59 Fractal distribution 247 Free energy 262 Free radicals 24 Friction 146,315 Gaussian partitioned mixture distribution 284 Generating function 287 Genetic algorithm 140 Geometry 160 Graduated non-convexity 251 H theorem 308 Heat 153,319 Hierarchical density model 269 HIRES program 91 Hyperparameters 249 Hypothesis 144, 183, 186 Image correlation 243 Image models 239 Image - NMR 13 Image pixel intensities 239 Image reconstruction 17, 109, 199 Incomplete data 17 Incomplete knowledge 147,213 Induction 155 Inference - classical 179 Interpolation 249 Inverse problems 31, 199, 224 Inverse scattering 59 Ion scattering 31 IRAS (InfraRed Astronomy Satellite) 91, 124 Kinetic theory 287 Labelled photographs 52 Lagrange multipliers 54, 73, 85, 149 Lagrange parameter 35 Laminar flow 7 Landsat 125 Laplace transform 51 Latent variable 260

INDEX

Law case - civil 183 Law case - criminal 184 Law case - dingo baby 187 Layer-to-Iayer transformations 273 Legendre transform 35 Lettuce 27 Location model 175 Lorentzian distribution 17 Magnetic Resonance imaging 13 Magnetohydrodynamics 31,310 MAP (maximum a posteriori) 84, 200 MAP estimates - dynamical model 158 Markov models 201 Markov random fields 209 Massive Inference 193 Mathematica 127 MaxEnt closure 288 Maximum correlation method 92 Maximum entropy 51, 69,84, 106, 110, 148,

213,233 Maximum entropy - algorithm 89 Maximum entropy - Bayesian 31 Maximum entropy - quantified 28, 34 Maximum entropy cluster expansion 270 Maximum entropy constraints 234 Maximum entropy on the mean 223 Maximum entropy predictions 305 Mayer cluster closure 288 Measure 191 Measure problem 300 Mechanics - Bayesian 159 Minimum relative entropy 52, 106 Mixture distribution as neural network 281 Mixture model 259 Mixture separation 117 Model - density 279 Model- dynamical 158 Model - location and scale 175 Moments - method of 110 Monte Carlo method 262 Monte Carlo simulation 236 Motion encoding 5 Multi-resolution data 79 Multilayer perceptron 259 Mutual information maximisation 275 Nakagami distribution 101 NARMAX autoregressive model 136

Navier-Stokes equations 304 Neural network 87,117,135,259,281 Neural spike 205 Newton-Raphson 244 NMR motion-encoding 1 Noise 109,229 Non-convexity 251 Nuclear level density 51 Nuclear partition function 53 Numerical calculation 127 Observation - single 175 Occam factor 123, 242 Ockham's razor 63, 167 Odds - prior and posterior 183 Old Faithful 194 Optical fi bre I 10 Overlapping density models 279 Particle scattering 61 Partitioned mixture distribution 279, 282 Perceptron 259 Perinaphthyl radical 28 Plasma tomography 31 Point spread function - composite 28 Power law 243 Power spectrum 243 Predictive distribution approximation 280 Principle of independence 218 Principle of indifference 216 Prior information 31 Prior model selection 62, 209 Process - Dirichlet 193 Protein structure 263 Radial basis function 136 Random packing 233 Reasoning - non-monotonic 218 Reasoning - statistical 213 Reconstruction - ill-posed 80 Regression 249 Regularization 71, 223, 249, 263

INDEX

Relative entropy maximisation 281 Relevance determination 264 Richardson-Lucy algorithm 93 Sampling - non-uniform 13 Scale invariance 201, 247 Scale model 175 Scan time reduction 13 Scattering - particle 61 Scattering - small-angle 69 Self-supervising multi-layer networks 275 Sensitivity analysis 223 Simpson's paradox 216 Simulations 45 Softmax classifier 260 Spatial correlation 189 Speckle pattern 112 Spectroscopy - EPR, ESR 24 Spectroscopy - ion scattering 31 Spike plot 26 Spin trapping 26 Spline 249 Steepest descent 161 Stein estimates 177 Surface structure 31 Symbolic calculation 127

323

System identification - nonlinear 135 Thermodynamic process and substance 317 Thermodynamics 146, 315 Thinking 143 Three-dimensional flows 8 Tomography 31, 80 Trenchless drilling 80 Truth - presentation of 157 Uncertainty 157 Volterra polynomial 136 Wavelet transform 244 Wigmore 183 X-rays 31

Documents

Maximum Entropy and Bayesian Methods: Cambridge, England, 1994 Proceedings of the Fourteenth International Workshop on Maximum Entropy and Bayesian Methods