M N : an efﬁcient and robust Bayesian inference tool for … · 2 F. Feroz, M.P. Hobson & M. Bridges than thermodynamic integration. To achieve an improved accep-tance ratio and

arX

iv:0

809.

3437

v1 [

astr

o-ph

] 19

Sep

200

8Mon. Not. R. Astron. Soc.000, 1–14 (2008) Printed 19 September 2008 (MN LATEX style file v2.2)

MULTI NEST: an efficient and robust Bayesian inference tool forcosmology and particle physics

F. Feroz⋆, M.P. Hobson and M. BridgesAstrophysics Group, Cavendish Laboratory, JJ Thomson Avenue, Cambridge CB3 0HE, UK

Accepted —. Received —; in original form 19 September 2008

ABSTRACTWe present further development and the first public release of our multimodal nested sam-pling algorithm, called MULTI NEST. This Bayesian inference tool calculates the evidence,with an associated error estimate, and produces posterior samples from distributions thatmay contain multiple modes and pronounced (curving) degeneracies in high dimensions.The developments presented here lead to further substantial improvements in sampling ef-ficiency and robustness, as compared to the original algorithm presented in Feroz & Hob-son (2008), which itself significantly outperformed existing MCMC techniques in a widerange of astrophysical inference problems. The accuracy and economy of the MULTI NESTalgorithm is demonstrated by application to two toy problems and to a cosmological in-ference problem focussing on the extension of the vanillaΛCDM model to include spa-tial curvature and a varying equation of state for dark energy. The MULTI NEST software,which is fully parallelized using MPI and includes an interface to CosmoMC, is available athttp://www.mrao.cam.ac.uk/software/multinest/. It will also be releasedas part of the SuperBayeS package, for the analysis of supersymmetric theories of particlephysics, athttp://www.superbayes.org

Key words: methods: data analysis – methods: statistical

1 INTRODUCTION

Bayesian analysis methods are already widely used in astrophysicsand cosmology, and are now beginning to gain acceptance in parti-cle physics phenomenology. As a consequence, considerableefforthas been made to develop efficient and robust methods for perform-ing such analyses. Bayesian inference is usually considered to di-vide into two categories: parameter estimation and model selection.Parameter estimation is typically performed using Markov chainMonte Carlo (MCMC) sampling, most often based on the standardMetropolis–Hastings algorithm or its variants, such as Gibbs’ orHamiltonian sampling (see e.g. Mackay 2003). Such methods canbe very computationally intensive, however, and often experienceproblems in sampling efficiently from a multimodal posterior dis-tribution or one with large (curving) degeneracies betweenparam-eters, particularly in high dimensions. Moreover, MCMC methodsoften require careful tuning of the proposal distribution to sam-ple efficiently, and testing for convergence can be problematic.Bayesian model selection has been further hindered by the evengreater computational expense involved in the calculationto suffi-cient precision of the key ingredient, the Bayesian evidence (alsocalled the marginalized likelihood or the marginal densityof thedata). As the average likelihood of a model over its prior probabil-

⋆ E-mail: [email protected]

ity space, the evidence can be used to assign relative probabilitiesto different models (for a review of cosmological applications, seeMukherjee et al. 2006). The existing preferred evidence evaluationmethod, again based on MCMC techniques, is thermodynamic in-tegration (see e.g.O Ruanaidh & Fitzgerald 1996), which is ex-tremely computationally intensive but has been used successfullyin astronomical applications (see e.g. Hobson & McLachlan 2003;Marshall et al. 2003; Slosar A. et al. 2003; Niarchou et al. 2004;Bassett et al. 2004; Trotta 2007; Beltran et al. 2005; Bridges et al.2006). Some fast approximate methods have been used for evidenceevaluation, such as treating the posterior as a multivariate Gaussiancentred at its peak (see e.g. Hobson et al. 2002), but this approxima-tion is clearly a poor one for multimodal posteriors (exceptperhapsif one performs a separate Gaussian approximation at each mode).The Savage–Dickey density ratio has also been proposed (Trotta2005) as an exact, and potentially faster, means of evaluating ev-idences, but is restricted to the special case of nested hypothesesand a separable prior on the model parameters. Various alternativeinformation criteria for astrophysical model selection are discussedby Liddle (2007), but the evidence remains the preferred method.

Nested sampling (Skilling 2004) is a Monte Carlo method tar-getted at the efficient calculation of the evidence, but alsoproducesposterior inferences as a by-product. In cosmological applications,Mukherjee et al. (2006) showed that their implementation ofthemethod requires a factor of∼ 100 fewer posterior evaluations

c© 2008 RAS

http://arXiv.org/abs/0809.3437v1

2 F. Feroz, M.P. Hobson & M. Bridges

than thermodynamic integration. To achieve an improved accep-tance ratio and efficiency, their algorithm uses an elliptical boundcontaining the current point set at each stage of the processto re-strict the region around the posterior peak from which new samplesare drawn. Shaw et al. (2007) point out that this method becomeshighly inefficient for multimodal posteriors, and hence introducethe notion of clustered nested sampling, in which multiple peaksin the posterior are detected and isolated, and separate ellipsoidalbounds are constructed around each mode. This approach signifi-cantly increases the sampling efficiency. The overall computationalload is reduced still further by the use of an improved error cal-culation (Skilling 2004) on the final evidence result that producesa mean and standard error in one sampling, eliminating the needfor multiple runs. In our previous paper (Feroz & Hobson 2008–hereinafter FH08), we built on the work of Shaw et al. (2007) bypursuing further the notion of detecting and characterising multi-ple modes in the posterior from the distribution of nested samples,and presented a number of innovations that resulted in a substantialimprovement in sampling efficiency and robustness, leadingto analgorithm that constituted a viable, general replacement for tradi-tional MCMC sampling techniques in astronomical data analysis.

In this paper, we present further substantial development ofthe method discussed in FH08 and make the first public releaseofthe resulting Bayesian inference tool, called MULTI NEST. In par-ticular, we propose fundamental changes to the ‘simultaneous ellip-soidal sampling’ method described in FH08, which result in asub-stantially improved and fully parallelized algorithm for calculat-ing the evidence and obtaining posterior samples from distributionswith (an unkown number of) multiple modes and/or pronounced(curving) degeneracies between parameters. The algorithmalsonaturally identifies individual modes of a distribution, allowing forthe evaluation of the ‘local’ evidence and parameter constraints as-sociated with each mode separately.

The outline of the paper is as follows. In Section 2, we brieflyreview the basic aspects of Bayesian inference for parameter es-timation and model selection. In Section 3, we introduce nestedsampling and discuss the use of ellipsoidal bounds in Section 4.In Section 5, we present the MULTI NEST algorithm. In Section 6,we apply our new algorithms to two toy problems to demonstratethe accuracy and efficiency of the evidence calculation and param-eter estimation as compared with other techniques. In Section 7,we consider the use of our new algorithm for cosmological modelselection focussed on the extension of the vanillaΛCDM model toinclude spatial curvature and a varying equation of state for darkenergy. We compare the efficiency of MULTI NEST and standardMCMC techniques for cosmological parameter estimation in Sec-tion 7.3. Finally, our conclusions are presented in Section8.

2 BAYESIAN INFERENCE

Bayesian inference methods provide a consistent approach to theestimation of a set parametersΘ in a model (or hypothesis)H forthe dataD. Bayes’ theorem states that

Pr(Θ|D,H) =Pr(D|Θ, H)Pr(Θ|H)

Pr(D|H), (1)

wherePr(Θ|D, H) ≡ P(Θ) is the posterior probability distri-bution of the parameters,Pr(D|Θ, H) ≡ L(Θ) is the likelihood,Pr(Θ|H) ≡ π(Θ) is the prior, andPr(D|H) ≡ Z is the Bayesianevidence.

In parameter estimation, the normalising evidence factor is

usually ignored, since it is independent of the parametersΘ, andinferences are obtained by taking samples from the (unnormalised)posterior using standard MCMC sampling methods, where at equi-librium the chain contains a set of samples from the parameterspace distributed according to the posterior. This posterior consti-tutes the complete Bayesian inference of the parameter values, andcan be marginalised over each parameter to obtain individual pa-rameter constraints.

In contrast to parameter estimation problems, in model selec-tion the evidence takes the central role and is simply the factor re-quired to normalize the posterior overΘ:

Z =

Z

L(Θ)π(Θ)dDΘ, (2)

whereD is the dimensionality of the parameter space. As the av-erage of the likelihood over the prior, the evidence automaticallyimplements Occam’s razor: a simpler theory with compact param-eter space will have a larger evidence than a more complicated one,unless the latter is significantly better at explaining the data. Thequestion of model selection between two modelsH0 andH1 canthen be decided by comparing their respective posterior probabili-ties given the observed data setD, as follows

Pr(H1|D)

Pr(H0|D)=

Pr(D|H1) Pr(H1)

Pr(D|H0) Pr(H0)=

Z1

Z0

Pr(H1)

Pr(H0), (3)

wherePr(H1)/Pr(H0) is the a priori probability ratio for the twomodels, which can often be set to unity but occasionally requiresfurther consideration.

Evaluation of the multidimensional integral (2) is a challeng-ing numerical task. The standard technique of thermodynamic in-tegration draws MCMC samples not from the posterior directly butfrom Lλπ whereλ is an inverse temperature that is slowly raisedfrom ≈ 0 to 1 according to some annealing schedule. It is pos-sible to obtain accuracies of within 0.5 units in log-evidence viathis method, but in cosmological model selection applications ittypically requires of order106 samples per chain (with around 10chains required to determine a sampling error). This makes evi-dence evaluation at least an order of magnitude more costly thanparameter estimation.

3 NESTED SAMPLING

Nested sampling (Skilling 2004) is a Monte Carlo technique aimedat efficient evaluation of the Bayesian evidence, but also pro-duces posterior inferences as a by-product. A full discussion of themethod is given in FH08, so we give only a briefly description here,following the notation of FH08.

Nested sampling exploits the relation between the likelihoodand prior volume to transform the multidimensional evidence inte-gral (Eq. 2) into a one-dimensional integral. The ‘prior volume’ Xis defined bydX = π(Θ)dD

Θ, so that

X(λ) =

Z

L(Θ)>λ

π(Θ)dDΘ, (4)

where the integral extends over the region(s) of parameter spacecontained within the iso-likelihood contourL(Θ) = λ. The evi-dence integral (Eq. 2) can then be written as

Z =

Z 1

0

L(X)dX, (5)

whereL(X), the inverse of Eq. 4, is a monotonically decreasing

c© 2008 RAS, MNRAS000, 1–14

MULTI NEST: efficient and robust Bayesian inference3

(a) (b)

Figure 1. Cartoon illustrating (a) the posterior of a two dimensionalprob-lem; and (b) the transformedL(X) function where the prior volumesXi

are associated with each likelihoodLi.

function of X. Thus, if one can evaluate the likelihoodsLi =L(Xi), whereXi is a sequence of decreasing values,

0 < XM < · · · < X2 < X1 < X0 = 1, (6)

as shown schematically in Fig. 1, the evidence can be approximatednumerically using standard quadrature methods as a weighted sum

Z =M

X

i=1

Liwi. (7)

In the following we will use the simple trapezium rule, for whichthe weights are given bywi = 1

2(Xi−1 − Xi+1). An example of

a posterior in two dimensions and its associated functionL(X) isshown in Fig. 1.

The summation (Eq. 7) is performed as follows. The itera-tion counter is first set toi = 0 andN ‘active’ (or ‘live’) sam-ples are drawn from the full priorπ(Θ) (which is often simplythe uniform distribution over the prior range), so the initial priorvolume isX0 = 1. The samples are then sorted in order of theirlikelihood and the smallest (with likelihoodL0) is removed fromthe active set (hence becoming ‘inactive’) and replaced by apointdrawn from the prior subject to the constraint that the pointhasa likelihoodL > L0. The corresponding prior volume containedwithin this iso-likelihood contour will be a random variable givenby X1 = t1X0, wheret1 follows the distributionPr(t) = NtN−1

(i.e. the probability distribution for the largest ofN samples drawnuniformly from the interval[0, 1]). At each subsequent iterationi,the removal of the lowest likelihood pointLi in the active set, thedrawing of a replacement withL > Li and the reduction of thecorresponding prior volumeXi = tiXi−1 are repeated, until theentire prior volume has been traversed. The algorithm thus travelsthrough nested shells of likelihood as the prior volume is reduced.The mean and standard deviation oflog t, which dominates the ge-ometrical exploration, areE[log t] = −1/N andσ[log t] = 1/N .Since each value oflog t is independent, afteri iterations the priorvolume will shrink down such thatlog Xi ≈ −(i ±

√i)/N . Thus,

one takesXi = exp(−i/N).The algorithm is terminated on determining the evidence to

some specified precision (we use 0.5 in log-evidence): at iterationi, the largest evidence contribution that can be made by the remain-ing portion of the posterior is∆Zi = LmaxXi, whereLmax isthe maximum likelihood in the current set of active points. Theevidence estimate (Eq. 7) may then be refined by adding a finalincrement from the set ofN active points, which is given by

∆Z =

NX

j=1

LjwM+j , (8)

wherewM+j = XM/N for all j. The final uncertainty on the cal-culated evidence may be straightforwardly estimated from asinglerun of the nested sampling algorithm by calculating the relative en-tropy of the full sequence of samples (see FH08).

Once the evidenceZ is found, posterior inferences can be eas-ily generated using the full sequence of (inactive and active) pointsgenerated in the nested sampling process. Each such point issimplyassigned the weight

pj =Ljwj

Z ., (9)

where the sample indexj runs from 1 toN = M + N , the totalnumber of sampled points. These samples can then be used to cal-culate inferences of posterior parameters such as means, standarddeviations, covariances and so on, or to construct marginalised pos-terior distributions.

4 ELLIPSOIDAL NESTED SAMPLING

The most challenging task in implementing the nested samplingalgorithm is drawing samples from the prior within the hard con-straintL > Li at each iterationi. Employing a naive approach thatdraws blindly from the prior would result in a steady decrease inthe acceptance rate of new samples with decreasing prior volume(and increasing likelihood).

Ellipsoidal nested sampling (Mukherjee et al. 2006) tries toovercome the above problem by approximating the iso-likelihoodcontourL = Li by aD-dimensional ellipsoid determined from thecovariance matrix of the current set of active points. New pointsare then selected from the prior within this ellipsoidal bound (usu-ally enlarged slightly by some user-defined factor) until one is ob-tained that has a likelihood exceeding that of the removed lowest-likelihood point. In the limit that the ellipsoid coincideswith thetrue iso-likelihood contour, the acceptance rate tends to unity.

Ellipsoidal nested sampling as described above is efficientforsimple unimodal posterior distributions without pronounced degen-eracies, but is not well suited to multimodal distributions. As advo-cated by Shaw et al. (2007) and shown in Fig. 2, the sampling ef-ficiency can be substantially improved by identifying distinct clus-tersof active points that are well separated and constructing anin-dividual (enlarged) ellipsoid bound for each cluster. In some prob-lems, however, some modes of the posterior may exhibit a pro-nounced curving degeneracy so that it more closely resembles a(multi–dimensional) ‘banana’. Such features are problematic for allsampling methods, including that of Shaw et al. (2007).

In FH08, we made several improvements to the samplingmethod of Shaw et al. (2007), which significantly improved its effi-ciency and robustness. Among these, we proposed a solution to theabove problem by partitioning the set of active points into as manysub–clusters as possible to allow maximum flexibility in followingthe degeneracy. These clusters are then enclosed in ellipsoids anda new point is then drawn from the set of these ‘overlapping’ el-lipsoids, correctly taking into account the overlaps. Although thissub-clustering approach provides maximum efficiency for highlydegenerate distributions, it can result in lower efficiencies for rel-atively simpler problems owing to the overlap between the ellip-soids. Also, the factor by which each ellipsoid was enlargedwaschosen arbitrarily. Another problem with the our previous approachwas in separating modes with elongated curving degeneracies. Wenow propose solutions to all these problems, along with someaddi-tional modifications to improve efficiency and robustness still fur-

c© 2008 RAS, MNRAS000, 1–14


(a) (b) (c) (d) (e)

Figure 2. Cartoon of ellipsoidal nested sampling from a simple bimodal distribution. In (a) we see that the ellipsoid represents agood bound to the activeregion. In (b)-(d), as we nest inward we can see that the acceptance rate will rapidly decrease as the bound steadily worsens. Figure (e) illustrates the increasein efficiency obtained by sampling from each clustered region separately.

ther, in the MULTI NESTalgorithm presented in the following sec-tion.

5 THE MULTINEST ALGORITHM

The MULTI NEST algorithm builds upon the ‘simultaneous ellip-soidal nested sampling method’ presented in FH08, but incorpo-rates a number of improvements. In short, at each iterationi of thenested sampling process, the full set ofN active points is parti-tioned and ellipsoidal bounds constructed using a new algorithmpresented in Section 5.2 below. This new algorithm is far more ef-ficient and robust than the method used in FH08 and automaticallyaccommodates elongated curving degeneracies, while maintaininghigh efficiency for simpler problems. This results in a set of(pos-sibly overlapping) ellipsoids. The lowest-likelihood point from thefull set ofN active points is then removed (hence becoming ‘inac-tive’) and replaced by a new point drawn from the set of ellipsoids,correctly taking into account any overlaps. Once a point becomesinactive it plays no further part in the nested sampling process, butits details remain stored. We now discuss the MULTI NEST algo-rithm in detail.

5.1 Unit hypercube sampling space

The new algorithm for partitioning the active points into clustersand constructing ellipsoidal bounds requires the points tobe uni-formly distributed in the parameter space. To satisfy this require-ment, the MULTI NEST ‘native’ space is taken as aD-dimensionalunit hypercube (each parameter value varies from 0 to 1) in whichsamples are drawn uniformly. All partitioning of points into clus-ters, construction of ellipsoidal bounds and sampling are performedin the unit hypercube.

In order to conserve probability mass, the pointu =(u1, u2, · · · , uD) in the unit hypercube should be transformed tothe pointΘ = (θ1, θ2, · · · , θD) in the ‘physical’ parameter space,such that

Z

π(θ1, θ2, · · · , θD) dθ1 dθ2 · · · dθD =

Z

du1du2 · · · duD.

(10)In the simple case that the priorπ(Θ) is separable

π(θ1, θ2, · · · , θD) = π1(θ1)π2(θ2) · · ·πD(θD), (11)

one can satisfy Eq. 10 by setting

πj(θj)dθj = duj . (12)

Therefore, for a given value ofuj , the corresponding value ofθj

can be found by solving

uj =

Z θj

−∞

πj(θ′j)dθ′

j . (13)

In the more general case in which the priorπ(Θ) is not separable,one instead writes

π(θ1, θ2, · · · , θD) = π1(θ1)π2(θ2|θ1) · · ·πD(θD|θ1, θ2 · · · θD−1)(14)

where we define

πj(θj |θ1, · · · , θj−1)

=

Z

π(θ1, · · · , θj−1, θj , θj+1, · · · , θD) dθj+1 · · · dθD. (15)

The physical parametersΘ corresponding to the parametersu inthe unit hypercube can then be found by replacing the distributionsπj in Eq. 13 with those defined in Eq. 15 and solving forθj . Thecorresponding physical parametersΘ are then used to calculate thelikelihood value of the pointu in the unit hypercube.

It is worth mentioning that in many problems the priorπ(Θ)is uniform, in which case the unit hypercube and the physicalpa-rameter space coincide. Even when this is not so, one is oftenable to solve Eq. 13 analytically, resulting in virtually nocompu-tational overhead. For more complicated problems, two alternativeapproaches are possible. First, one may solve Eq. 13 numerically,most often using look-up tables to reduce the computationalcost.Alternatively, one can re-cast the inference problem, so that theconversion between the unit hypercube and the physical parame-ter space becomes trivial. This is straightforwardly achieved by, forexample, defining the new ‘likelihood’L′(Θ) ≡ L(Θ)π(Θ) and‘prior’ π′(Θ) ≡ constant. The latter approach does, however, havethe potential to be inefficient since it does not make use of the trueprior π(Θ) to guide the sampling of the active points.

5.2 Partitioning and construction of ellipsoidal bounds

In FH08, the partitioning of the set ofN active points at each it-eration was performed in two stages. First, X-means (Pelleget al.2000) was used to partition the set into the number of clusters thatoptimised the Bayesian Information Criterion (BIC). Second, toaccommodate modes with elongated, curving degeneracies, eachcluster identified by X-means was divided into sub-clustersto fol-low the degeneracy. To allow maximum flexibility, this was per-formed using a modified, iterativek-means algorithm withk = 2to produce as many sub-clusters as possible consistent withtherebeing at leastD + 1 points in any sub-cluster, whereD is the di-mensionality of the parameter space. As mentioned above, how-ever, this approach can lead to inefficiencies for simpler problemsin which the iso-likelihood contour is well described by a few (well-separated) ellipsoidal bounds, owing to large overlaps between theellipsoids enclosing each sub-cluster. Moreover, the factor f bywhich each ellipsoid was enlarged was chosen arbitrarily.

We now address these problems by using a new method to par-tition the active points into clusters and simultaneously construct

c© 2008 RAS, MNRAS000, 1–14


the ellipsoidal bound for each cluster (this also makes redundant thenotion of sub-clustering). At theith iteration of the nested samplingprocess, an ‘expectation-maximization’ (EM) approach is used tofind the optimal ellipsoidal decomposition ofN active points dis-tributed uniformly in a region enclosing prior volumeXi, as set outbelow.

Let us begin by denoting the set ofN active points in theunit hypercube byS = u1, u2, · · · , uN and some partitioning ofthe set intoK clusters (called the set’sK-partition) bySkK

k=1,whereK > 1 and∪K

k=1Sk = S. For a cluster (or subset)Sk

containingnk points, a reasonably accurate and computationallyefficient approximation to its minimum volume bounding ellipsoidis given by

Ek = u ∈ RD|uT (fkCk)−1u 6 1, (16)

where

Ck =1

nk

nkX

j=1

(uj − µk)(uj − µk)T (17)

is the empirical covariance matrix of the subsetSk and µk =Pnk

j=1 uj is its center of the mass. The enlargement factorfk en-sures thatEk is a bounding ellipsoid for the subsetSk. The vol-ume of this ellipsoid, denoted byV (Ek), is then proportional top

det(fkCk).Suppose for the moment that we know the volumeV (S) of

the region from which the setS is uniformly sampled and let usdefine the function

F (S) ≡ 1

V (S)

KX

k=1

V (Ek). (18)

The minimisation ofF (S), subject to the constraintF (S) > 1,with respect toK-partitioningsSkK

k=1 will then yield an ‘opti-mal’ decomposition intoK ellipsoids of the original sampled re-gion. The minimisation ofF (S) is most easily performed usingan ‘expectation-minimization’ scheme as set out below. This ap-proach makes use of the result (Lu et al. 2007) that for uniformlydistributed points, the variation inF (S) resulting from reassigninga point with positionu from the subsetSk to the subsetSk′ is givenby

∆F (S)k,k′ ≈ γ

„

V (Ek′)d(u, Sk′)

V (Sk′)− V (Ek)d(u, Sk)

V (Sk)

«

(19)

whereγ is a constant,

d(u, Sk) = (u − µk)T (fkCk)−1(u − µk) (20)

is the Mahalanobis distance fromu to the centroidµk of ellipsoidEk defined in Eq. 16, and

V (Sk) =nkV (S)

N(21)

may be considered as the true volume from which the subset ofpointsSk were drawn uniformly. The approach we have adopted infact differs slightly from that outlined above, since we make furtheruse of Eq. 21 to impose the constraint that the volumeV (Ek) ofthekth ellipsoid should never be less than the ‘true’ volumeV (Sk)occupied by the subsetSk. This can be easily achieved by enlargingthe ellipsoidEk by a factorfk, such that its volumeV (Ek) =max[V (Ek), V (Sk)], before evaluating Eqs. 18 and 19.

In our case, however, at each iterationi of the nested samplingprocess,V (S) corresponds to the true remaining prior volumeXi,which is not known. Nonetheless, as discussed in Section 3, we do

know the expectation value of this random variable. We thus takeV (S) = exp(−i/N) which, in turn, allows us to defineV (Sk)according to Eq. 21.

From Eq. 19, we see that defining

hk(u) =V (Ek)d(u, Sk)

V (Sk)(22)

for a pointu ∈ S and assigningu ∈ Sk to Sk′ only if hk(u) <hk′(u), ∀ k 6= k′, is equivalent to minimizingF (S) using the vari-ational formula (Eq. 19). Thus, a weighted Mahalanobis metric canbe used in thek-means framework to optimize the functionalF (S).In order to find out the optimal number of ellipsoids,K, a recur-sive scheme can be used which starts withK = 2, optimizes this2-partition using the metric in Eq. 22 and recursively partitions theresulting ellipsoids. For each iteration of this recursion, we employthis optimization scheme in Algorithm 1.

Algorithm 1 Minimizing F (S), subject toF (S) > 1, for pointsS = u1, u2, · · · , uN uniformly distributed in a volumeV (S).

1: calculate bounding ellipsoidE and its volumeV (E)2: enlargeE so thatV (E) = max[V (E), V (S)].3: partitionS into S1 andS2 containingn1 andn2 points respec-

tively by applyingk−means clustering algorithm withK = 2.4: calculateE1, E2 and their volumesV (E1) andV (E2) respec-

tively.5: enlargeEk (k = 1, 2) so thatV (Ek) = max[V (Ek), V (Sk)].6: for all u ∈ S do7: assignu to Sk such thathk(u) = min[h1(x), h2(x)].8: end for9: if no point has been reassignedthen

10: go to step 14.11: else12: go to step 4.13: end if14: if V (E1) + V (E2) < V (E) or V (E) > 2V (S) then15: partition S into S1 andS2 and repeat entire algorithm for

each subsetS1 andS2.16: else17: return E as the optimal ellipsoid of the point setS.18: end if

In step 14 of Algorithm 1 we partition the point setS withellipsoidal volumeV (E) into subsetsS1 andS2 with ellipsoidalvolumesV (E1) andV (E2) even if V (E1) + V (E2) > V (E),providedV (E) > 2V (S). This is required since, as discussed inLu et al. (2007), the minimizer ofF (S) can be over-conservativeand the partition should still be performed if the ellipsoidal volumeis greater than the true volume by some factor (we use 2).

The above EM algorithm can be quite computationally ex-pensive, especially in higher dimensions, due to the numberofeigenvalue and eigenvector evaluations required to calculate el-lipsoidal volumes. Fortunately, MULTI NEST does not need toperform the full partitioning algorithm at each iteration of thenested sampling process. Once partitioning of the active pointsand construction of the ellipsoidal bounds has been performedusing Algorithm 1, the resulting ellipsoids can then be evolvedthrough scaling at subsequent iterations so that their volumes aremax[V (Ek), Xi+1nk/N ], where withXi+1 is the remaining priorvolume in the next nested sampling iteration andnk is number ofpoints in the subsetSk at the end ofith iteration. As the MULTI -NEST algorithm moves to higher likelihood regions, one would

c© 2008 RAS, MNRAS000, 1–14


(a)

(b)

Figure 3. Illustrations of the ellipsoidal decompositions returnedby Algo-rithm 1: the points given as input are overlaid on the resulting ellipsoids.1000 points were sampled uniformly from: (a) two non-intersecting ellip-soids; and (b) a torus.

expect the ellipsoidal decomposition calculated at some earlier it-eration to become less optimal. We therefore perform a full re-partitioning of the active points using Algorithm 1 ifF (S) > h;we typically useh = 1.1.

The approach outlined above allows maximum flexibility andsampling efficiency by breaking up a posterior mode resemblinga Gaussian into relatively few ellipsoids, but a mode possesses apronounced curving degeneracy into a relatively large number ofsmall ‘overlapping’ ellipsoids. In Fig. 3 we show the results of ap-plying Algorithm 1 to two different problems in three dimensions:in (a) the iso-likelihood surface consists of two non-overlappingellipsoids, one of which contains correlations between theparam-eters; and in (b) the iso-likelihood surface is a torus. In each case,1000 points were uniformly generated inside the iso-likelihood sur-face are used as the starting setS in Algorithm 1. In case (a), Al-gorithm 1 correctly partitions the point set in two non-overlapping

ellipsoids withF (S) = 1.1, while in case (b) the point set is parti-tioned into 23 overlapping ellipsoids withF (S) = 1.2.

In our nested sampling application, it is possible that the el-lipsoids found by Algorithm 1 might not enclose the entire iso-likelihood contour, even though the sum of their volumes is con-strained to exceed the prior volumeX This is because the ellip-soidal approximation to a region in the prior space might notbeperfect. It might therefore be desirable to sample from a region withvolume greater than the prior volume. This can easily be achievedby usingX/e as the desired minimum volume in Algorithm 1,whereX is the prior volume ande the desired sampling efficiency(1/e is the enlargement factor). We also note that if the desire sam-pling efficiencye is set to be greater than unity, then the prior canbe under-sampled. Indeed, settinge > 1 can be useful if one is notinterested in the evidence values, but wants only to have a generalidea of the posterior structure in relatively few likelihood evalua-tions. We note that, regardless of the value ofe, it is always en-sured that the ellipsoidsEk enclosing the subsetsSk are alwaysthe bounding ellipsoids.

5.3 Sampling from overlapping ellipsoids

Once the ellipsoidal bounds have been constructed at some iterationof the nested sampling process, one must then draw a new pointuniformly from the union of these ellipsoids, many of which may beoverlapping. This is achieved using the method presented inFH08,which is summarised below for completeness.

Suppose at iterationi of the nested sampling algorithm, onehasK ellipsoidsEk. One ellipsoid is then chosen with probabil-ity pk equal to its volume fraction

pk = V (Ek)/Vtot, (23)

whereVtot =PK

k=1 V (Ek). Samples are then drawn uniformlyfrom the chosen ellipsoid until a sample is found for which the hardconstraintL > Li is satisfied, whereLi is the lowest-likelihoodvalue among all the active points at that iteration. There is, ofcourse, a possibility that the chosen ellipsoid overlaps with one ormore other ellipsoids. In order to take an account of this possibil-ity, we find the number of ellipsoids,ne, in which the sample liesand only accept the sample with probability1/ne. This provides aconsistent sampling procedure in all cases.

5.4 Decreasing the number of active points

For highly multimodal problems, the nested sampling algorithmwould require a large numberN of active points to ensure thatall the modes are detected. This would consequently result in veryslow convergence of the algorithm. In such cases, it would bede-sirable to decrease the number of active points as the algorithmproceeds to higher likelihood levels, since the number of isolatedregions in the iso-likelihood surface is expected to decrease with in-creasing likelihood. modes Fortunately, nested sampling does notrequire the number of active points to remain constant, providedthe fraction by which the prior volume is decreased after each it-eration is adjusted accordingly. Without knowing anythingaboutthe posterior, we can use the largest evidence contributionthat canbe made by the remaining portion of the posterior at theith itera-tion∆Zi = LmaxXi, as the guide in reducing the number of activepoints by assuming that the change in∆Z is linear locally. We thusset the number of active pointsNi at theith iteration to be

Ni = Ni−1 − Nmin∆Zi−1 − ∆Zi

∆Zi − tol, (24)

c© 2008 RAS, MNRAS000, 1–14


subject to the constraintNmin 6 Ni 6 Ni−1, whereNmin is theminimum number of active points allowed and tol is the toleranceon the final evidence used in the stopping criterion.

5.5 Parallelization

Even with the enlargement factore set to unity (see Section 5.2),the typical sampling efficiency obtained for most problems in astro-physics and particle physics is around 20–30 per cent for twomainreasons. First, the ellipsoidal approximation to the iso-likelihoodsurface at any iteration is not perfect and there may be regions of theparameter space lying inside the union of the ellipsoids butoutsidethe true iso-likelihood surface; samples falling in such regions willbe rejected, resulting in a sampling efficiency less than unity. Sec-ond, if the number of ellipsoids at any given iteration is greater thanone, then they may overlap, resulting in some samples withL > Li

falling inside a region shared byne ellipsoids; such points are ac-cepted only with probability1/ne, which consequently lowers thesampling efficiency. Since the sampling efficiency is typically lessthan unity, the MULTI NEST algorithm can be usefully (and eas-ily) parallelized by, at each nested sampling iteration, drawing apotential replacement point on each ofNCPU processors, where1/NCPU is an estimate of the sampling efficiency.

5.6 Identification of modes

As discussed in FH08, for multimodal posteriors it can proveuse-ful to identify which samples ‘belong’ to which mode. There isinevitably some arbitrariness in this process, since modesof theposterior necessarily sit on top of some general ‘background’ inthe probability distribution. Moreover, modes lying closeto oneanother in the parameter space may only ‘separate out’ at rela-tively high likelihood levels. Nonetheless, for well-defined, ‘iso-lated’ modes, a reasonable estimate of the posterior mass that eachcontains (and hence the associated ‘local’ evidence) can bedefined,together with the posterior parameter constraints associated witheach mode. To perform such a calculation, once the nested sam-pling algorithm has progressed to a likelihood level such that (atleast locally) the ‘footprint’ of the mode is well-defined, one needsto identify at each subsequent iteration those points in theactiveset belonging to that mode. The partitioning and ellipsoidscon-struction algorithm described in Section 5.2 provides a much moreefficient and reliable method for performing this identification, ascompared with the methods employed in FH08.

At the beginning of the nested sampling process, all the activepoints are assigned to one ‘group’G1. As outlined above, at sub-sequent iterations, the set ofN active points is partitioned intoKsubsetsSk and their corresponding ellipsoidsEk constructed.To perform mode identification, at each iteration, one of thesubsetsSk is then picked at random: its members become the first membersof the ‘temporary set’T and its associated ellipsoidEk becomesthe first member of the set of ellipsoidsE . All the other ellipsoidsEk′ (k′ 6= k) are then checked for intersection withEk using anexact algorithm proposed by Alfano & Greer (2003). Any ellipsoidfound to intersect withEk is itself added toE and the members ofthe corresponding subsetSk are added toT . The setE (and con-sequently the setT ) is then iteratively built up by adding to it anyellipsoid not inE that intersects with any of those already inE , un-til no more ellipsoids can be added. Once this is completed, if nomore ellipsoids remain then all the points inT are (re)assigned toG1, a new active point is drawn from the union of the ellipsoids

Ek (and also assigned toG1) and the nested sampling processproceeds to its next iteration.

If, however, there remain ellipsoidsEk not belonging toE ,then this indicates the presence of (at least) two isolated regionscontained within the iso-likelihood surface. In this event, the pointsin T are (re)assigned to the groupG2 and the remaining activepoints are (re)assigned to the groupG3. The original groupG1

then contains only the inactive points generated up to this nestedsampling iteration and is not modified further. The groupG3 isthen analysed in a similar manner to see if can be split further (intoG3 andG4), and the process continued until no further splitting ispossible. Thus, in this case, one is left with an ‘inactive’ groupG1

and a collection of ‘active’ groupsG2, G3, . . . . A new active pointis then drawn from the union of the ellipsoidsEk, and assignedto the appropriate active group, and the nested sampling processproceeds to its next iteration.

At subsequent nested sampling iterations, each of the activegroups present at that iteration is analysed in the same manner tosee if it can be split. If so, then the active points in that group are(re)assigned to new active groups generated as above, and the orig-inal group becomes inactive, retaining only those of its points thatare inactive. In order to minimize the computational cost, we takeadvantage of the fact that the ellipsoids created by Algorithm 1 canonly get smaller in later iterations. Hence, within each active group,if two ellipsoids are found not to intersect at some iteration, they arenot checked for intersection in later iterations. This makes the com-putational expense involved in separating out the modes negligible.

At the end of the nested sampling process, one thus obtainsa set of inactive groups and a set of active groups, which betweenthem partition the full set of (inactive and active) sample pointsgenerated. It is worth noting that, as the nested sampling processreaches higher likelihood levels, the number of active points in anyparticular active group may dwindle to zero, but such a groupis stillconsidered active since it remains unsplit at the end of the nestedsampling run. Each active groups is then promoted to a ‘mode’,resulting in a set ofL (say) such modesMl.

As a concrete example, consider the two-dimensional illus-tration shown in Fig. 4, in which the solid circles denote activepoints at the nested sampling iterationi = i2, and the open cir-cles are the inactive points at this stage. In this illustration, the firstgroupG1 remains unsplit until iterationi = i1 of the nested sam-pling process, at which stage it is split intoG2, G3 andG4. ThegroupG3 then remains unsplit until iterationi = i2, when it issplit into G5, G6 andG7. The groupG4 remains unsplit until it-erationi = i2, when it is split intoG8 and G9. The groupG2

remains unsplit at iterationi = i2 but the number of active points itcontains has fallen to zero, since it is a low-lying region ofthe like-lihood function. Thus, at the iterationi = i2, the inactive groupsareG1, G3 andG4, and the active groups areG2, G5, G6, G7, G8

andG6. If (say) all of the latter collection of groups were to remainactive until the end of the nested sampling process, each wouldthen be promoted to a mode according toM1 = G2, M2 = G5,M3 = G6, · · · , M6 = G9.

5.7 Evaluating ‘local’ evidences

The reliable identification of isolated modesMl allows one toevaluate the local evidence associated with each mode much moreefficiently and robustly than the methods presented in FH08.Sup-pose thelth modeMl contains the pointsuj (j = 1, · · · , nl). In

c© 2008 RAS, MNRAS000, 1–14


G1

u2

u1

G2

G3

G6

G5G8

G9

G7

G4

00 1

1

Figure 4. Cartoon illustrating the assignment of points to groups; see textfor details. The iso-likelihood contoursL = Li1 andL = Li2 are shownas the dashed lines and dotted lines respectively. The solidcircles denoteactive points at the nested sampling iterationi = i2, and the open circlesare the inactive points at this stage.

the simplest approach, the local evidence of this mode is given by

Zl =

nlX

j=1

Ljwj , (25)

where (as in Eq. 8)wj = XM/N for each active point inMl,and for each inactive pointswj = 1

2(Xi−1 − Xi+1), in which

i is the nested sampling iteration at which the inactive pointwasdiscarded. In a similar manner, the posterior inferences resultingfrom thelth mode are obtained by weighting each point inMl bypj = Ljwj/Zl.

As outlined in FH08, however, there remain some problemswith this approach for modes that are sufficiently close to one an-other in the parameter space that they are only identified as iso-lated regions once the algorithm has proceeded to likelihood valuessomewhat larger than the value at which the modes actually sepa-rate. The ‘local’ evidence of each mode will then be underestimatedby Eq. 25. In such cases, this problem can be overcome by alsomaking use of the points contained in the inactive groups at the endof the nested sampling process, as follows.

For each modeMl, expression Eq. 25 for the local evidence isreplaced by

Zl =

nlX

j=1

Ljwj +X

g

Lgwgα(l)g , (26)

where the additional summation overg includes all the points in theinactive groups, the weightwg = 1

2(Xi−1 −Xi+1), wherei is the

nested sampling iteration in which thegth point was discarded, andthe additional factorsα(l)

g are calculated as set out below. Similarly,posterior inferences from thelth mode are obtained by weightingeach point inMl by pj = Ljwj/Zl and each point in the inactivegroups bypg = Lgwgα

(l)g /Zl.

The factorsα(l)g can be determined in a number of ways. The

most straightforward approach is essentially to reverse the processillustrated in Fig. 4, as follows. Each modeMl is simply an activegroupG that has been renamed. Moreover, one can identify the in-active groupG′ that split to formG at the nested sampling iteration

i. All points in the inactive groupG′ are then assigned the factor

α(l)g =

n(A)G (i)

n(A)G′ (i)

, (27)

wheren(A)G (i) is the number of active points inG at nested sam-

pling iterationi, and similarly forn(A)

G′ (i). Now, the groupG′ mayitself have been formed when an inactive groupG′′ split at aneariler nested sampling iterationi′ < i, in which case all the pointsin G′′ are assigned the factor

α(l)g =

n(A)G (i)

n(A)

G′ (i)

n(A)

G′ (i′)

n(A)

G′′ (i′). (28)

The process is continued until the recursion terminates. Finally, allpoints in inactive groups not already assigned haveα

(l)g = 0.

As a concrete example, considerM2 = G5 in Fig. 4. In thiscase, the factors assigned to the members of all the inactivegroupsG1, G3 andG4 are

α(2)g =

8

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

:

n(A)G5

(i2)

n(A)G3

(i2)for g ∈ G3

n(A)G5

(i2)

n(A)G3

(i2)

n(A)G3

(i1)

n(A)G1

(i1)for g ∈ G1

0 for g ∈ G4

(29)

It is easy to check that the general prescription (Eqs. 27 and28) ensures that

LX

l=1

Zl = Z, (30)

i.e. the sum of the local evidences for each mode is equal to theglobal evidence. An alternative method for setting the factorsα

(l)g ,

for which Eq. 30 again holds, is to use a mixture model to analysethe full set of points (active and inactive) produced, as outlined inAppendix A.

6 APPLICATIONS

In this section we apply the MULTI NEST algorithm describedabove to two toy problems to demonstrate that it indeed calculatesthe Bayesian evidence and makes posterior inferences accuratelyand efficiently. These toy examples are chosen to have features thatresemble those that can occur in real inference problems in astro-and particle physics.

6.1 Toy model 1: egg-box likelihood

We first demonstrate the application of MULTI NEST to a highlymultimodal two-dimensional problem, for which the likelihood re-sembles an egg-box. The un-normalized likelihood is definedas

L(θ1, θ2) = exp

»

2 + cos

„

θ1

2

«

cos

„

θ2

2

«–5

, (31)

and we assume a uniform priorU(0, 10π) for bothθ1 andθ2. A plotof the log-likelihood is shown in Fig. 5 and the the prior ranges arechosen such that some of the modes are truncated. Hence, althoughonly two-dimensional, this toy example is a particularly challeng-ing problem, not only for mode identification but also for evaluatingthe local evidence of each mode accurately. Indeed, even obtaining

c© 2008 RAS, MNRAS000, 1–14


0

5

10

15

20

25

30 0

5

10

15

20

25

30 0

50

100

150

200

250

x y

log(L)log(L)

(a)

0

5

10

15

20

25

30 0

5

10

15

20

25

30-50

0

50

100

150

200

250

x y

log(L)log(L)

(b)

Figure 5. Toy model 1: (a) two-dimensional plot of the likelihood function defined in Eq. 31; (b) dots denoting the points with the lowest likelihood atsuccessive iterations of the MULTI NESTalgorithm. Different colours denote points assigned to different isolated modes as the algorithm progresses.

posterior samples efficiently from such a distribution can present achallenge for standard Metropolis–Hastings MCMC samplers. Wenote that distributions of this sort can occur in astronomical objectdetection applications (see FH08).

Owing to the highly multimodal nature of this problem, weuse 2000 active points. The results obtained with MULTI NESTareillustrated in Fig. 5, in which the dots show the points with thelowest likelihood at successive iterations of the nested samplingprocess, and different colours indicate points assigned todiffer-ent isolated modes as the algorithm progresses. MULTI NEST re-quired∼ 30, 000 likelihood evaluations and evaluated the globallog-evidence value to be235.86±0.06, which compares favourablywith the log-evidence value of235.88 obtained through numericalintegration on a fine grid. The local log-evidence values of eachmode, calculated through numerical integration on a fine grid (de-noted as ‘truelog(Z)’) and using MULTI NESTare listed in Table 1.We see that there is good agreement between the two estimates.

6.2 Toy model 2: Gaussian shells likelihood

We now illustrate the capabilities of our MULTI NEST in samplingfrom a posterior containing multiple modes with pronounced(curv-ing) degeneracies, and extend our analysis to parameters spaces ofhigh dimension.

Our toy problem here is the same one used in FH08 and Al-lanach & Lester (2007). The likelihood function in this model isdefined as,

L(θ) = circ(θ; c1, r1, w1) + circ(θ; c2, r2, w2), (32)

where

circ(θ; c, r, w) =1√

2πw2exp

»

− (|θ − c| − r)2

2w2

–

. (33)

In two dimensions, this toy distribution represents two well sepa-rated rings, centred on the pointsc1 andc2 respectively, each ofradiusr and with a Gaussian radial profile of widthw (see Fig. 6).With a sufficiently smallw value, this distribution is representativeof the likelihood functions one might encounter in analysing forth-coming particle physics experiments in the context of beyond-the-Standard-Model paradigms; in such models the bulk of the proba-

Mode true locallog(Z) MULTI NEST local log(Z)

1 233.33 233.20 ± 0.082 233.33 233.10 ± 0.063 233.33 233.48 ± 0.054 233.33 233.43 ± 0.055 233.33 233.65 ± 0.056 233.33 233.27 ± 0.05

7 233.33 233.14 ± 0.068 233.33 233.81 ± 0.049 232.64 232.65 ± 0.1210 232.64 232.43 ± 0.1611 232.64 232.11 ± 0.1412 232.64 232.44 ± 0.1113 232.64 232.68 ± 0.1114 232.64 232.84 ± 0.0915 232.64 233.02 ± 0.0916 232.64 231.65 ± 0.2917 231.94 231.49 ± 0.2718 231.94 230.46 ± 0.36

Table 1. The local log-evidence values of each mode for the toy model 1,described in Section 6.1, calculated through numerical integration on a finegrid (the ‘truelog(Z)’) and using the MULTI NESTalgorithm.

bility lies within thin sheets or hypersurfaces through thefull pa-rameter space.

We investigate the above distribution up to a30-dimensionalparameter spaceΘ with MULTI NEST. In all cases, the centres ofthe two rings are separated by7 units in the parameter space, andwe takew1 = w2 = 0.1 and r1 = r2 = 2. We maker1 andr2 equal, since in higher dimensions any slight difference betweenthese two values would result in a vast difference between the vol-umes occupied by the rings and consequently the ring with thesmallerr value would occupy a vanishingly small fraction of thetotal probability volume, making its detection almost impossible.It should also be noted that settingw = 0.1 means the rings havean extremely narrow Gaussian profile and hence they represent an‘optimally difficult’ problem for our ellipsoidal nested sampling al-gorithm, since many tiny ellipsoids are required to obtain asuffi-ciently accurate representation of the iso-likelihood surfaces. For

c© 2008 RAS, MNRAS000, 1–14


-5-2.5

02.5

5x

-5

-2.50

2.55

y

0

1

2

3

4

5

L

-5-2.5

02.5

5x

(a)

-6-4

-2 0

2 4

6 -6-4

-2 0

2 4

6 0

1

2

3

4

5

Likelihood

Common PointsPeak 1Peak 2

x y

Likelihood

(b)

Figure 6. Toy model 2: (a) two-dimensional plot of the likelihood function defined in Eqs. (32) and (33); (b) dots denoting the points with the lowest likelihoodat successive iterations of the MULTI NESTalgorithm. Different colours denote points assigned to different isolated modes as the algorithm progresses.

Analytical MULTI NEST

D log(Z) local log(Z) log(Z) local log(Z1) local log(Z2)

2 −1.75 −2.44 −1.72 ± 0.05 −2.28 ± 0.08 −2.56 ± 0.085 −5.67 −6.36 −5.75 ± 0.08 −6.34 ± 0.10 −6.57 ± 0.11

10 −14.59 −15.28 −14.69 ± 0.12 −15.41 ± 0.15 −15.36 ± 0.1520 −36.09 −36.78 −35.93 ± 0.19 −37.13 ± 0.23 −36.28 ± 0.2230 −60.13 −60.82 −59.94 ± 0.24 −60.70 ± 0.30 −60.57 ± 0.32

Table 2.The true and estimated global and locallog(Z) for toy model 2, as a function of the dimensionsD of the parameter space, using MULTI NEST.

the two-dimensional case, with the parameters described above, thelikelihood is shown in Fig. 6.

In analysing this problem using the methods presented inFH08, we showed that the sampling efficiency dropped signifi-cantly with increasing dimensionality, with the efficiencybeing lessthan 2 per cent in 10 dimensions, with almost600, 000 likelihoodevaluations required to estimate the evidence to the required accu-racy. Using 1000 active points in MULTI NEST,we list the evaluatedand analytical evidence values in Table 2. The total number of like-lihood evaluations and the sampling efficiencies are listedin Table3. For comparison, we also list the number of likelihood evaluationsand the sampling efficiencies with the ellipsoidal nested samplingmethod proposed in FH08. One sees that MULTI NEST requires anorder of magnitude fewer likelihood evaluations than the methodof FH08. In fact, the relative computational cost of MULTI NEST iseven less than this comparison suggests, since it no longer performsan eigen-analysis at each iteration, as discussed in Section 5.2. In-deed, for this toy problem discussed, the EM partitioning algorithmdiscussed in Section 5.2 was on average called only once per 1000iterations of the MULTI NESTalgorithm.

7 COSMOLOGICAL PARAMETER ESTIMATION ANDMODEL SELECTION

Likelihood functions resembling those used in our toy models dooccur in real inference problems in astro- and particle physics, suchas object detection in astronomy (see e.g. Hobson & McLachlan2003; FH08) and analysis of beyond-the-Standard-Model theoriesin particle physics phenomenology (see e.g. Feroz et al. 2008).

from FH08 MULTI NEST

D Nlike Efficiency Nlike Efficiency

2 27, 658 15.98% 7, 370 70.77%5 69, 094 9.57% 17, 967 51.02%

10 579, 208 1.82% 52, 901 34.28%20 43, 093, 230 0.05% 255, 092 15.49%30 753, 789 8.39%

Table 3.The number of likelihood evaluations and sampling efficiency forthe ellipsoidal nested sampling algorithm of FH08 and MULTI NEST, whenapplied to toy model 2 as a function of the dimensionD of the parameterspace.

Nonetheless, not all likelihood functions are as challenging and itis important to demonstrate that MULTI NESTis more efficient (andcertainly no less so) than standard Metropolis–Hastings MCMCsampling even in more straightforward inference problems.

An important area of inference in astrophysics is that of cos-mological parameter estimation and model selection, for which thelikelihood functions are usually quite benign, often resembling asingle, broad multivariate Gaussian in the allowed parameter space.Therefore, in this section, we apply the MULTI NEST algorithm toanalyse two related extensions of the standard cosmology model:non-flat spatial curvature and a varying equation of state ofdarkenergy.

The complete set of cosmological parameters and the rangesof the uniform priors assumed for them are given in Table 4, wherethe parameters have their usual meanings. WithΩk = 0 and

c© 2008 RAS, MNRAS000, 1–14


0.018 6 Ωbh2 6 0.0320.04 6 Ωcdmh2 6 0.160.98 6 Θ 6 1.10.01 6 τ 6 0.5−0.1 6 Ωk 6 0.1−1.5 6 w 6 −0.50.8 6 ns 6 1.22.6 6 log[1010As] 6 4.2

Table 4.Cosmological parameters and uniform priors ranges for the vanillaΛCDM model, plus spatial curvatureΩk and dark energy equation of stateparameterw.

w = −1 this model then represents the ‘vanilla’ΛCDM cosmol-ogy. In addition, mirroring the recent analysis of the WMAP 5-year(WMAP5) data (Dunkley et al. 2008), a Sunyaev-Zel’dovich ampli-tude is introduced, with a uniform prior in the range[0, 2]. We havechosen three basic sets of data: CMB observations alone; CMBplusthe Hubble Space Telescope (HST) constraint onH0 (Freedmanet al. 2001); and CMB plus large scale structure (LSS) constraintson the matter power spectrum derived from the luminous red galaxy(LRG) subset of the Sloan Digital Sky Survey (SDSS; Tegmarket al. 2006; Tegmark et al. 2004) and the two degree field survey(2dF; Cole et al. 2005). In addition, for the dark energy analysis weinclude distance measures from supernovae Ia data (Kowalski et al.2008). The CMB data comprises WMAP5 observations (Hinshawet al. 2008) + higher resolution datasets from the ArcminuteCos-mology Bolometer Array (ACBAR; Reichardt et al. 2008) + theCosmic Background Imager (CBI; Readhead et al. 2004; SieversJ. L. et al. 2007; CBI Supplementary Data 2006) + Balloon Ob-servations of Millimetric Extragalactic Radiation and Geophysics(BOOMERANG; Piacentini et al. 2006; Jones et al. 2006; Mon-troy et al. 2006).

Observations of the first CMB acoustic peak cannot in them-selves constrain the spatial curvatureΩk. This could be constrainedusing angular scale of the first acoustic peak coupled with a knowl-edge of the distance to the last scattering surface, but the latter is afunction of the entire expansion history of the universe andso thereis a significant degeneracy betweenΩk and the Hubble parameterH(z). This dependence, often termed the ‘geometric degeneracy’,can be broken, however, since measurements at different redshiftscan constrain a sufficiently different functions ofΩk andH . Thus,the combination of CMB data with measurements of the acous-tic peak structure in the matter power spectrum derived fromlargescale structure surveys such as the LRG subset of Sloan can placemuch tighter constraints on curvature than with either alone (seee.g. Eisenstein et al. 2005; Tegmark et al. 2006; Komatsu et al.2008).

Inflation generically predicts a flat universe (Guth 1981). Thetightest current constraint suggestsΩk ≈ 10−2, whereas inflationlasting over 60 e-folds would produce flatness at the level of10−5

(Komatsu et al. 2008). Thus, at present, the data is not capable of re-futing or confirming such an inflationary picture. From a Bayesianpoint of view, however, one can still assess whether the datacur-rently prefers the inclusion of such a physical parameter.

The algorithmic parameters of MULTI NEST were appropri-ately chosen given our a priori knowledge of the uni-modal form oftypical cosmological posteriors, the dimensionality of the problemand some empirical testing. The number of active points was setto N = 400 and a sampling efficiencye of 0.3 means that MPIparallelisation across 4 CPUs is optimal (with a further 4 openmp

Ωk

H0

−0.1 −0.05 0 0.05 0.1

50

60

70

80

90

100

Figure 7. Breaking of the ‘geometric’ degeneracy in CMB data (blue) viathe addition of HST (green) and large scale structure data (red).

Dataset\ Model vanilla +Ωk vanilla +Ωk + w

CMB alone −0.29 ± 0.27 -CMB + HST −1.56 ± 0.27 -

ALL −2.92 ± 0.27 −1.29 ± 0.27

Table 5.Differences oflog evidences for both models and the three datasetsdescribed in the text.[Negative (Positive) values represent lower (higher)preference for the parameterisation change]

threads per MPI CPU used by CAMB’s multithreading facility).This represents a relatively modest computational investment. Allof the inferences obtained in this section required between40,000and 50,000 likelihood evaluations.

7.1 Results: spatial curvature

Fig. 7. illustrates the progressively tighter constraintsplaced onΩk

andH0 produced by combining CMB data with other cosmologicalprobes of large scale structure. The geometric degeneracy is clearlyunbroken with CMB data alone, but the independent constraints onH0 by HST are seen to tighten the constraint somewhat. IncludingLSS data, specifically the LRG data, markedly reduces the uncer-tainty on the curvature so that at 1-σ we can limit the curvaturerange to−0.043 6 Ωk 6 0.004. The asymmetry of this constraintleaves a larger negative tail ofΩk resulting in a mean value that isonly slightly closed. However, even these most stringent parameterconstraints available, we see no statistically significantdeviationfrom a spatially flat universe. The Bayesian evidence, in penalis-ing excessive model complexity should tell us whether relaxing theconstraint on flatness is preferred by the data. Our results (Table 5)very clearly rule out the necessity for such an addition in anythingother than with CMB data alone. This implies that the inclusion ofspatial curvature is an unnecessary complication in cosmologicalmodel building, given the currently available data.

7.2 Results: varying equation of state of dark energy

The properties of the largest universal density component are stilllargely unknown, yet such a component seems crucial for the uni-verse to be spatially flat. It has thus been argued by some (Wright

c© 2008 RAS, MNRAS000, 1–14


Ωk

w

−0.06 −0.04 −0.02 0 0.02 0.04

−1.4

−1.3

−1.2

−1.1

−1

−0.9

−0.8

−0.7

Figure 8. Joint constraints on universal geometryΩk and the equation ofstate of dark energyw using WMAP5 + HST + LSS + supernovae data.

2006 & Tegmark et al. 2006) that it is inappropriate to assumespa-tial flatness when attempting to vary the properties of the dark en-ergy component beyond those of a simple cosmological constant.Here we allow for the variation of the dark energy equation ofstateparameterw. We will therefore proceed by placing joint constraintson bothw andΩk, as performed in Komatsu et al. (2008). Onceagain we encounter a serious degeneracy, this time betweenΩk andw. With the geometric degeneracy, combining cosmological obser-vations of the universe at different redshifts was sufficient to breakthe dependence, but when dark energy is dynamical wemustuse atleast a third, independent data set. In this analysis, we have there-fore included distance measures from type Ia supernovae observa-tions (Kowalski et al. 2008). Using this combination of datapro-duces impressively tight constraints on bothw andΩk; indeed theresulting constraints on the spatial curvature are tighterthan thoseobtained in the previous section, for whichw was held constant atw = −1. This is primarily due to the near orthogonality of the con-straints provided by supernovae and the CMB. Once again we findlittle evidence to support a departure from the basic vanilla cosmol-ogy (see Table 5). To within estimated uncertainty, the Bayesianevidence is at least one log unit greater for a flat universe with darkenergy in the form of a cosmological constant.

7.3 Comparison ofMULTI NESTand MCMC ‘quick look’parameter constraints

The above results are in good agreement with those found by Ko-matsu et al. (2008) using more traditional MCMC methods, andindicate that MULTI NEST has produced reliable inferences, bothin terms of the estimated evidence values and the derived posteriorparameter constraints. It is often the case, however, that cosmolo-gists wish only to obtain a broad idea of the posterior distributionof parameters, using short MCMC chains and hence relativelyfewlikelihood evaluations. In this section, we show that the MULTI -NEST algorithm can also perform this task by setting the numberof active points,N , to a smaller value.

In order to illustrate this functionality, we analyse theWMAP5 CMB data-set in the context of the vanillaΛCDM cos-mology using both MULTI NEST and the publicly available Cos-moMC Lewis & Bridle (2002) package, which uses an MCMCsampler based on a tailored version of the Metropolis–Hastings

0.0195 0.021 0.0225 0.024 0.0255

MCMC(3200)MCMC(48000)

MultiNest

0.08 0.09 0.1 0.11 0.12 0.13 0.14

1.02 1.03 1.04 1.05 1.06 0 0.05 0.1 0.15 0.2

0.88 0.92 0.96 1 1.04 2.9 3 3.1 3.2 3.3

Ωbh2 Ωcdmh2

Θ τ

ns log(1010As)

Figure 9. 1-D marginalized posteriors for the flat-ΛCDM cosmology ob-tained with: CosmoMC using 48,000 likelihood evaluations (solid black);CosmoMC using 3,200 likelihood evaluations (dotted pink);and MULTI -NESTusing 3,100 likelihood evaluations (dashed blue).

method. We imposed uniform priors on all the parameters. Theprior ranges forΩbh2, Ωcdmh2, Θ, andτ is listed in Table 4. Inaddition,ns was allowed to vary between 0.8 and 1.2,log(1010As)between 2.6 and 4.2 and the Sunyaev–Zel’dovich amplitude be-tween 0 and 2.

To facilitate later comparisons, we first obtained an accu-rate determination of the posterior distribution using thetraditionalmethod by running CosmoMC to produce 4 long MCMC chains.The Gelman–Rubin statisticR returned by CosmoMC indicatedthat the chains had converged to withinR ≈ 1.1 after about 6000steps per chain, resulting in∼ 24,000 likelihood evaluations. Tobe certain of determining the ‘true’ posterior to high accuracy, wethen ran the MCMC chains for a further 6,000 samples per chain,resulting in a total of 48,000 likelihood evaluations, at which pointthe convergence statistic wasR ≈ 1.05.

As stated above, however, one often wishes to obtain only a‘quick-and-dirty’ estimate of the posterior in the first stages of thedata analysis. As might be typical of such analyses, we ran Cos-moMC again using 4 MCMC chains, but with only 800 steps perchain, resulting in a total of 3,200 likelihood evaluations. As a com-parison, we also ran the MULTI NEST algorithm using only 50 ac-tive points and with the sampling efficiencye set to unity; this re-quired a total of 3,100 likelihood evaluations. In Fig. 9, weplot the1-D marginalized posteriors for derived from the two analyses, to-gether with the results of the longer CosmoMC analysis describedabove. It is clear from these plots that both the MULTI NEST andMCMC ‘quick-look’ results compare well with the ‘true’ posteriorobtained from the more costly rigorous analysis.

c© 2008 RAS, MNRAS000, 1–14


8 DISCUSSION AND CONCLUSIONS

We have described a highly efficient Bayesian inference tool, calledMULTI NEST, which we have now made freely available for aca-demic purposes as a plugin that is easily incorporated into the COS-MOMC software. On challenging toy models that resemble real in-ference problems in astro- and particle physics, we have demon-strated that MULTI NEST produces reliable estimates of the evi-dence, and its uncertainty, and accurate posterior inferences fromdistributions with multiple modes and pronounced curving degen-eracies in high dimensions. We have also demonstrated in cos-mological inference problems that MULTI NESTproduces accurateparameter constraints on similar time scales to standard MCMCmethodsand, with negligible extra computational effort, also yieldsvery accurate Bayesian evidences for model selection. As a cos-mological application we have considered two extensions ofthebasic vanillaΛCDM cosmology: non-zero spatial curvature and avarying equation of state of dark energy. Both extensions are deter-mined to be unnecessary for the modelling of existing data via theevidence criterion, confirming that with the advent of five years ofWMAP observations the data is still satisfied by aΛCDM cosmol-ogy.

As a guide for potential users, we conclude by noting that theMULTI NEST algorithm is controlled by two main parameters: (i)the number of active pointsN ; and (ii) the maximum efficiencye(see Section 5.2). These values can be chosen quite easily asout-lined below. First,N should be large enough that, in the initial sam-pling from the full prior space, there is a high probability that atleast one point lies in the ‘basin of attraction’ of each modeof theposterior. In later iterations, active points will then tend to populatethese modes. It should be remembered, of course, thatN must al-ways exceed the dimensionalityD of the parameter space. Also, inorder to calculate the evidence accurately,N should be sufficientlylarge so that all the regions of the parameter space are sampled ade-quately. For parameter estimation only, one can use far fewer activepoints. For cosmological data analysis, we found400 and50 activepoints to be adequate for evidence evaluation and parameteresti-mation respectively. The parametere controls the sampling volumeat each iteration, which is equal to the sum of the volumes of theellipsoids enclosing the active point set. For parameter estimationproblems,e should be set to1 to obtain maximum efficiency with-out undersampling or to a lower value if one wants to get a generalidea of the posterior very quickly. For evidence evaluationin cos-mology, we found settinge ∼ 0.3 ensures an accurate evidencevalue.

ACKNOWLEDGEMENTS

This work was carried out largely on the Cambridge High Perfor-mance Computing Cluster Darwin and the authors would like tothank Dr. Stuart Rankin for computational assistance. FF issup-ported by studentships from the Cambridge Commonwealth Trust,Isaac Newton and the Pakistan Higher Education Commission Fel-lowships. MB is supported by STFC.

REFERENCES

Allanach B. C., Lester C. G., , 2007, Sampling using a ‘bank’ ofclues

Bassett B. A., Corasaniti P. S., Kunz M., 2004, Astrophys. J., 617,L1

Beltran M., Garcıa-Bellido J., Lesgourgues J., Liddle A.R., SlosarA., 2005, Phys.Rev.D, 71, 063532

Bridges M., Lasenby A. N., Hobson M. P., 2006, MNRAS, 369,1123

CBI Supplementary Data 2006Cole S., et al., 2005, Mon. Not. Roy. Astron. Soc., 362, 505Dempster A. P., Laird N. M., Rubin D. B., 1977, Journal of theRoyal Statistical Society. Series B (Methodological), 39,1

Dunkley J., Komatsu E., Nolta M. R., Spergel D. N., Larson D.,Hinshaw G., Page L., Bennett C. L., Gold B., Jarosik N., WeilandJ. L., Halpern M., Hill R. S., Kogut A., Limon M., Meyer S. S.,Tucker G. S., Wollack E., Wright E. L., 2008, ArXiv e-prints,803

Eisenstein D. J., Zehavi I., Hogg D. W., Scoccimarro R., BlantonM. R., Nichol R. C., Scranton R., Seo H.-J., Tegmark M., ZhengZ., Anderson S. F., 2005, ApJ, 633, 560

Feroz F., Allanach B. C., Hobson M., AbdusSalam S. S., TrottaR., Weber A. M., , 2008, Bayesian Selection of sign(mu) withinmSUGRA in Global Fits Including WMAP5 Results

Feroz F., Hobson M. P., 2008, MNRAS, 384, 449Freedman W. L., Madore B. F., Gibson B. K., Ferrarese L., KelsonD. D., Sakai S., Mould J. R., Kennicutt Jr. R. C., Ford H. C.,Graham J. A., Huchra J. P., Hughes S. M. G., Illingworth G. D.,Macri L. M., Stetson P. B., 2001, ApJ, 553, 47

Guth A. H., 1981, Phys.Rev.D, 23, 347Hinshaw G., Weiland J. L., Hill R. S., Odegard N., Larson D.,Bennett C. L., Dunkley J., Gold B., 2008, ArXiv e-prints, 803

Hobson M. P., Bridle S. L., Lahav O., 2002, MNRAS, 335, 377Hobson M. P., McLachlan C., 2003, MNRAS, 338, 765Jones W. C., et al., 2006, ApJ, 647, 823Komatsu E., Dunkley J., Nolta M. R., Bennett C. L., Gold B.,Hinshaw G., Jarosik N., Larson D., Limon M., Page L., SpergelD. N., Halpern M., Hill R. S., Kogut A., Meyer S. S., TuckerG. S., Weiland J. L., Wollack E., Wright E. L., 2008, ArXiv e-prints, 803

Kowalski M., et al., 2008Lewis A., Bridle S., 2002, Phys.Rev.D, 66, 103511Liddle A. R., 2007, ArXiv Astrophysics e-printsLu L., Choi Y.-K., Wang W., Kim M.-S., 2007, Computer Graph-ics Forum, 26, 329

Mackay D. J. C., 2003, Information Theory, Inference andLearning Algorithms. Information Theory, Inference and Learn-ing Algorithms, by David J. C. MacKay, pp. 640. ISBN0521642981. Cambridge, UK: Cambridge University Press, Oc-tober 2003.

Marshall P. J., Hobson M. P., Slosar A., 2003, MNRAS, 346, 489Montroy T. E., et al., 2006, ApJ, 647, 813Mukherjee P., Parkinson D., Liddle A. R., 2006, ApJ, 638, L51Niarchou A., Jaffe A. H., Pogosian L., 2004, Phys.Rev.D, 69,063515

O Ruanaidh J., Fitzgerald W., 1996, Numerical Bayesian MethodsApplied to Signal Processing. Springer Verlag:New York

Piacentini F., et al., 2006, ApJ, 647, 833Readhead A. C. S., Mason B. S., Contaldi C. R., Pearson T. J.,Bond J. R., Myers S. T., Padin S., Sievers J. L., Cartwright J.K.,2004, ApJ, 609, 498

Reichardt C. L., et al., 2008Shaw J. R., Bridges M., Hobson M. P., 2007, MNRAS, 378, 1365Sievers J. L. et al. 2007, Astrophys. J., 660, 976Skilling J., 2004, in Fischer R., Preuss R., Toussaint U. V.,eds,American Institute of Physics Conference Series Nested Sam-pling. pp 395–405

c© 2008 RAS, MNRAS000, 1–14


Slosar A. et al. 2003, MNRAS, 341, L29Tegmark M., Eisenstein D., Strauss M., Weinberg D., BlantonM., Frieman J., Fukugita M., Gunn J., Hamilton A., Knapp G.,Nichol R., Ostriker J., 2006, ArXiv Astrophysics e-prints

Tegmark M., et al., 2004, Astrophys. J., 606, 702Tegmark M., et al., 2006, Phys. Rev., D74, 123507Trotta R., 2007, MNRAS, 378, 72Wright E. L., 2006, ArXiv Astrophysics e-prints

APPENDIX A: LOCAL EVIDENCE EVALUATION USINGA GAUSSIAN MIXTURE MODEL

As mentioned in Section 5.7, an alternative method for determiningthe local evidence and posterior constraints associated with eachidentified modeMl is to analyse the full set of samples using amixture model. In what follows, we assume that the modes of the‘implied’ posteriorP(u) in the unit hypercube space are each welldescribed by a multivariate Gaussian, leading to a Gaussianmixturemodel, but the model for the mode shape can be easily changed toanother distribution.

Let us assume that, at the end of the nested sampling pro-cess, the full set ofN (inactive and active) points obtained isu1, u2, · · · , uN and L modesM1, M2, · · · , ML modes havebeen identified. The basic goal of the method is to assign an extrafactorα(l)

j to everypoint (j = 1 toN ), so that the estimate (Eq. 26)for the local evidence associated with the modeMl is replaced by

Zl =N

X

j=1

Ljwjα(l)j , (A1)

where the weightswj for the inactive and active points are the sameas those used in Eq. 26. Similarly, posterior inferences from themodeMl are obtained by weightingeverypoint (j = 1 to N ) bypj = Ljwjα

(l)j /Zl.

The factorsα(l)j are determined by modelling each modeMl

as a multivariate Gaussian with ‘normalised amplitude’Al, meanµl and covariance matrixCl, such that

P(u) ∝L

X

l=1

AlG(u; µl,Cl), (A2)

where the Gaussian unit-volumeG(u; µl, Cl) is given by

G(u; µl,Θl) =1

(2π)D

2 |Cl| 12exp

h

− 12(u − µl)

TC

−1l (u − µl)

i

,

(A3)and the values of the parametersΘ ≡ (Al, µl, Cl) are tobe determined from the sample pointsuj. Since the scaling inEq. A2 is arbitrary, it is convenient to set

PL

l=1 Al = 1.For a given set of parameter valuesΘ, our required factors are

α(l)j (Θ) = Pr(Ml|uj ,Θ)

=Pr(uj |Ml,Θ) Pr(Ml|Θ)

PL

l=1 Pr(uj |Ml,Θ)Pr(Ml|Θ)

=AlG(u; µl,Cl)

PL

l=1 AlG(u; µl,Cl). (A4)

Our approach is to optimise the parametersΘ, and hence de-termine the factorsα(l)

j (Θ), using an expectation-maximization(EM) algorithm. The algorithm is initialized by setting

α(l)j =

(

1 if uj ∈ Ml

0 otherwise(A5)

and calculating the initial values of eachZl using Eq. A1. In theM-step of the algorithm one then obtains the maximum-likelihoodestimatesΘ of the parameters. These are easily derived (see e.g.Dempster et al. (1977)) to be

bAl =nl

n(A6)

bµl =1

nl

NX

j=1

α(l)j uj (A7)

bCl =1

nl

NX

j=1

α(l)j (uj − bµl)(uj − bµl)

T, (A8)

wherenl =PN

j=1 α(l)j , n =

PL

l=1 nl anduj = ujLjwj/Zl arethe locally posterior-weighted sample points. In the subsequent E-step of the algorithm on then updates theα

(l)j values using Eq. A4

and updatesZl using Eq. A1. We further impose the constraint thatα

(l)j = 0 if uj /∈ Ml and its likelihoodLj is greater than the lowest

likelihood of the points inMl. The EM algorithm is then iteratedto convergence.

c© 2008 RAS, MNRAS000, 1–14

Documents

M N : an efﬁcient and robust Bayesian inference tool for … · 2 F. Feroz, M.P. Hobson & M. Bridges than thermodynamic integration. To achieve an improved accep-tance ratio and