16
Distilling Importance Sampling Dennis Prangle 1 1 School of Mathematics, University of Bristol, United Kingdom Abstract Many complicated Bayesian posteriors are difficult to approximate by either sampling or optimisation methods. Therefore we propose a novel approach combining features of both. We use a flexible para- meterised family of densities, such as a normal- ising flow. Given a density from this family approx- imating the posterior, we use importance sampling to produce a weighted sample from a more accur- ate posterior approximation. This sample is then used in optimisation to update the parameters of the approximate density, which we view as dis- tilling the importance sampling results. We iterate these steps and gradually improve the quality of the posterior approximation. We illustrate our method in two challenging examples: a queueing model and a stochastic differential equation model. 1 INTRODUCTION Bayesian inference has had great success in in recent dec- ades [Green et al., 2015], but remains challenging in models with a complex posterior dependence structure e.g. those involving latent variables. Monte Carlo methods are one state-of-the-art approach. These produce samples from the posterior distribution. However in many settings it remains challenging to design good mechanisms to propose plaus- ible samples, despite many advances (e.g. Cappé et al., 2004, Cornuet et al., 2012, Graham and Storkey, 2017, Whitaker et al., 2017). We focus on one simple Monte Carlo method: importance sampling (IS). This weights draws from a proposal distribu- tion so that the weighted sample can be viewed as represent- ing a target distribution, such as the posterior. IS can be used 1 This work was completed while the author was employed at Newcastle University, UK. in almost any setting, including in the presence of strong posterior dependence or discrete random variables. How- ever it only achieves a representative weighted sample at a feasible cost if the proposal is a reasonable approximation to the target distribution. An alternative to Monte Carlo is to use optimisation to find the best approximation to the posterior from a family of distributions. Typically this is done in the framework of vari- ational inference (VI). VI is computationally efficient but has the drawback that it often produces poor approximations to the posterior distribution e.g. through over-concentration [Turner et al., 2008, Yao et al., 2018]. A recent improvement in VI is due to the development of a range of flexible and computationally tractable distribu- tional families using normalising flows [Dinh et al., 2016, Papamakarios et al., 2019a]. These transform a simple base random distribution to a complex distribution, using a se- quence of learnable transformations. We propose an alternative to variational inference for train- ing the parameters of an approximate posterior density, typ- ically a normalising flow, which we call the distilled density. This alternates two steps. The first is importance sampling, with the current distilled density as the proposal. The target distribution is an approximate posterior, based on tempering, which is an improvement on the proposal. The second step is to use the resulting weighted sampled to train the distilled density further. Following Li et al. [2017], we refer to this as distilling the importance sampling results. By iteratively distilling IS results, we can target increasingly accurate pos- terior approximations i.e. reduce the tempering. Each step of our distilled importance sampling (DIS) method aims to reduce the Kullback Leibler (KL) diver- gence of the distilled density from the current tempered posterior. This is known as the inclusive KL divergence, as minimising it tends to produce a density which is over- dispersed compared to the tempered posterior. Such a dis- tribution is well suited to be an IS proposal distribution. Variational inference, on the other hand, uses the exclusive arXiv:1910.03632v3 [stat.CO] 10 Sep 2021

arXiv:1910.03632v2 [stat.CO] 9 Feb 2020

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Distilling Importance Sampling

Dennis Prangle1

1School of Mathematics, University of Bristol, United Kingdom

Abstract

Many complicated Bayesian posteriors are difficultto approximate by either sampling or optimisationmethods. Therefore we propose a novel approachcombining features of both. We use a flexible para-meterised family of densities, such as a normal-ising flow. Given a density from this family approx-imating the posterior, we use importance samplingto produce a weighted sample from a more accur-ate posterior approximation. This sample is thenused in optimisation to update the parameters ofthe approximate density, which we view as dis-tilling the importance sampling results. We iteratethese steps and gradually improve the quality of theposterior approximation. We illustrate our methodin two challenging examples: a queueing modeland a stochastic differential equation model.

1 INTRODUCTION

Bayesian inference has had great success in in recent dec-ades [Green et al., 2015], but remains challenging in modelswith a complex posterior dependence structure e.g. thoseinvolving latent variables. Monte Carlo methods are onestate-of-the-art approach. These produce samples from theposterior distribution. However in many settings it remainschallenging to design good mechanisms to propose plaus-ible samples, despite many advances (e.g. Cappé et al., 2004,Cornuet et al., 2012, Graham and Storkey, 2017, Whitakeret al., 2017).

We focus on one simple Monte Carlo method: importancesampling (IS). This weights draws from a proposal distribu-tion so that the weighted sample can be viewed as represent-ing a target distribution, such as the posterior. IS can be used

1This work was completed while the author was employed atNewcastle University, UK.

in almost any setting, including in the presence of strongposterior dependence or discrete random variables. How-ever it only achieves a representative weighted sample at afeasible cost if the proposal is a reasonable approximationto the target distribution.

An alternative to Monte Carlo is to use optimisation tofind the best approximation to the posterior from a family ofdistributions. Typically this is done in the framework of vari-ational inference (VI). VI is computationally efficient buthas the drawback that it often produces poor approximationsto the posterior distribution e.g. through over-concentration[Turner et al., 2008, Yao et al., 2018].

A recent improvement in VI is due to the development ofa range of flexible and computationally tractable distribu-tional families using normalising flows [Dinh et al., 2016,Papamakarios et al., 2019a]. These transform a simple baserandom distribution to a complex distribution, using a se-quence of learnable transformations.

We propose an alternative to variational inference for train-ing the parameters of an approximate posterior density, typ-ically a normalising flow, which we call the distilled density.This alternates two steps. The first is importance sampling,with the current distilled density as the proposal. The targetdistribution is an approximate posterior, based on tempering,which is an improvement on the proposal. The second stepis to use the resulting weighted sampled to train the distilleddensity further. Following Li et al. [2017], we refer to thisas distilling the importance sampling results. By iterativelydistilling IS results, we can target increasingly accurate pos-terior approximations i.e. reduce the tempering.

Each step of our distilled importance sampling (DIS)method aims to reduce the Kullback Leibler (KL) diver-gence of the distilled density from the current temperedposterior. This is known as the inclusive KL divergence,as minimising it tends to produce a density which is over-dispersed compared to the tempered posterior. Such a dis-tribution is well suited to be an IS proposal distribution.Variational inference, on the other hand, uses the exclusive

arX

iv:1

910.

0363

2v3

[st

at.C

O]

10

Sep

2021

KL divergence which tends to produce over-concentrateddistributions which cannot easily be corrected using import-ance sampling.

In the remainder of the paper, Section 2 presents back-ground material. Sections 3 and 4 describe our method.Section 5 illustrates it on a simple two dimensional in-ference task. Sections 6 and 7 give more challenging ex-amples: queueing and time series models. Section 8 con-cludes with a discussion, including limitations and opportun-ities for future improvements. Code for the examples can befound at https://github.com/dennisprangle/DistillingImportanceSampling. All exampleswere run on a 6-core desktop PC.

1.1 RELATED WORK AND NOVELTY

Several recent papers [Müller et al., 2019, Cotter et al.,2020, Duan, 2019] learn a density defined via a transforma-tion to use as an importance sampling proposal. Since thefirst version of our paper, more work has also looked atoptimising the inclusive KL divergence. Dhaka et al. [2021]show that a naive implementation performs poorly. Naes-seth et al. [2020] use conditional importance sampling togive convergence guarantees. Jerfel et al. [2021] propose aboosting approach in which a Gaussian mixture density isimproved by sequentially adding mixture components. Incomparison to the above, a novelty of our work is using asequential approach based on tempering, and an applicationto likelihood-free inference.

A related approach is to distill Markov chain Monte Carlooutput, but this turns out to be more difficult than for IS.One reason is that optimising the KL divergence typically re-quires unbiased estimates of it or related quantities (e.g. itsgradient), but MCMC only provides unbiased estimatesasymptotically. Li et al. [2017] and Parno and Marzouk[2018] proceed by using biased estimates, while Ruiz andTitsias [2019] introduce an alternative more tractable diver-gence. However IS, as we shall see, can produce unbiasedestimates of the required KL gradient.

Approximate Bayesian computation (ABC) methods [Marinet al., 2012, Del Moral et al., 2012] involve simulating data-sets under various parameters to find those which produceclose matches to the observations. However, close matchesare rare unless the observations are low dimensional. HenceABC typically uses dimension reduction of the observationsthrough summary statistics, which reduces inference accur-acy. Our method can instead learn a joint proposal distribu-tion for the parameters and all the random variables usedin simulating a dataset (see Section 6 for details). Hence itcan control the simulation process to frequently output datasimilar to the full observations.

Conditional density estimation methods (see e.g. Le et al.,2017, Papamakarios et al., 2019b, Grazian and Fan, 2019) fit

a joint distribution to parameters and data from simulations.Then one can condition on the observed data to approximateits posterior distribution. These methods also sometimesrequire dimension reduction, and can perform poorly whenthe observed data is unlike the simulations. Our approachavoids these difficulties by directly finding parameters whichcan reproduce the full observations.

More broadly, DIS has connections to several inferencemethods. Concentrating on its IS component, it is closelyrelated to adaptive importance sampling [Cornuet et al.,2012] and sequential Monte Carlo (SMC) [Del Moral et al.,2006]. Concentrating on training an approximate density,it can be seen as a version of the cross-entropy method[Rubinstein, 1999], an estimation of distribution algorithm[Larrañaga and Lozano, 2002], or reweighted wake-sleepBornschein and Bengio [2014].

2 BACKGROUND

2.1 BAYESIAN FRAMEWORK

We observe data y, assumed to be the output of a probabilitymodel p(y|θ) under some parameters θ . Given prior densityπ(θ) we aim to find corresponding posterior p(θ |y).

Many probability models involve latent variables x, so thatp(y|θ) =

∫p(y|θ ,x)p(x|θ)dx. To avoid computing this in-

tegral we’ll attempt to infer the joint posterior p(θ ,x|y), andmarginalise to get p(θ |y). For convenience we introduceξ = (θ ,x) to represent the collection of parameters andlatent variables. For models without latent variables ξ = θ .

We now wish to infer p(ξ |y). Typically we can only evaluatean unnormalised version,

p(ξ |y) = p(y|θ ,x)p(x|θ)π(θ).

Then p(ξ |y) = p(ξ |y)/Z where Z =∫

p(ξ |y)dξ is an in-tractable normalising constant.

2.2 TEMPERING

We’ll use a tempered target density pε(ξ ) such that p0 is theposterior and ε > 0 gives an approximation. As for the pos-terior, we can often only evaluate an unnormalised versionpε(ξ ). Then pε(ξ ) = pε(ξ )/Zε where Zε =

∫pε(ξ )dξ . We

use various tempering schemes later in the paper. See thesupplement (Section H) for a summary.

2.3 IMPORTANCE SAMPLING

Let p(ξ ) be a target density, such as a tempered posterior,where p(ξ ) = p(ξ )/Z and only p(ξ ) can be evaluated. Im-portance sampling (IS) is a Monte Carlo method to estimate

expectations of the form

I = Eξ∼p[h(ξ )],

for some function h. Here we give an overview of relevantaspects. For full details see e.g. Robert and Casella [2013]and Rubinstein and Kroese [2016].

IS requires a proposal density λ (ξ ) which can easily besampled from, and must satisfy

supp(p)⊆ supp(λ ), (1)

where supp denotes support. Then

I = Eξ∼λ

[p(ξ )λ (ξ )

h(ξ )]. (2)

So an unbiased Monte Carlo estimate of I is

I1 =1

NZ

N

∑i=1

wih(ξ (i)), (3)

where ξ (1),ξ (2), . . . ,ξ (N) are independent samples from λ ,and wi = p(ξ (i))/λ (ξ (i)) is an importance weight.

Typically Z is estimated as 1N ∑

Ni=1 wi giving

I2 =N

∑i=1

wih(ξ (i))

/ N

∑i=1

wi, (4)

a biased, but consistent, estimate of I. Equivalently

I2 =N

∑i=1

sih(ξ (i)),

for normalised importance weights si = wi/∑Ni=1 wi.

A drawback of IS is that it can produce estimates with large,or infinite, variance if λ is a poor approximation to p. Hencediagnostics for the quality of the results are useful. A populardiagnostic is effective sample size (ESS),

NESS =

(N

∑i=1

wi

)2/ N

∑i=1

w2i . (5)

For most functions h, Var(I2) roughly equals the varianceof an idealised Monte Carlo estimate based on NESS inde-pendent samples from p(ξ ) [Liu, 1996].

2.4 NORMALISING FLOWS

A normalising flow represents a random vector ξ with acomplicated distribution as an invertible transformation of arandom vector z with a simple base distribution, typicallyN (0, I).

Recent research has developed flexible learnable familiesof normalising flows. See Papamakarios et al. [2019a] for a

review. We focus on real NVP (“non-volume preserving”)flows [Dinh et al., 2016] (other flows with similar properties,such as Durkan et al., 2019, could also be used). Thesecompose several transformations of z. One type is a couplinglayer which transforms input vector u to output vector v,both of dimension D, by

v1:d = u1:d , vd+1:D = µ + exp(σ)�ud+1:D,

µ = fµ(u1:d), σ = fσ (u1:d),

where � and exp are elementwise multiplication and ex-ponentiation. Here the first d elements of u are copied un-changed. We typically take d = b 1

2 Dc. The other elementsare scaled by vector exp(σ) then shifted by vector µ , whereµ and σ are functions of u1:d . This transformation is in-vertible, and allows quick computation of the density ofv from that of u, as the Jacobian determinant is simply∏

Di=d+1 exp(σi). Coupling layers are alternated with per-

mutations so that different variables are copied in successivecoupling layers. Real NVP typically uses order-reversing orrandom permutations.

The functions fµ and fσ are neural network outputs. Eachcoupling layer has its own neural network. The collection ofall weights and biases, φ , can be trained for particular taskse.g. density estimation of images. Permutations are fixed inadvance and not learnable.

Real NVP produces a flexible family of densities q(ξ ;φ)with two useful properties for this paper. Firstly, samples canbe drawn rapidly. Secondly, it is reasonably fast to compute∇φ logq(ξ ;φ) for any ξ .

3 OBJECTIVE AND GRADIENT

Given an approximate family of densities q(ξ ;φ), such asnormalising flows, this section introduces objective func-tions to judge how well q approximates a tempered targetpε . It then discusses how to estimate the gradient of thisobjective with respect to φ . Section 4 presents our algorithmusing these gradients to update φ while also reducing ε .

3.1 OBJECTIVE

Given pε , we aim to minimise the inclusive Kullback-Leibler (KL) divergence,

KL(pε ||q) = Eξ∼pε[log pε(ξ )− logq(ξ ;φ)].

This is equivalent to maximising a scaled negative cross-entropy, which we use as our objective,

Jε(φ) = Zε Eξ∼pε[logq(ξ ;φ)].

(We scale by Zε to avoid this intractable constant appearingin our gradient estimates below.)

The inclusive KL divergence penalises φ values which pro-duce small q(ξ ;φ) when pε(ξ ) is large. Hence the optimalφ tends to make q(ξ ;φ) non-negligible where pε(ξ ) is non-negligible, known as the zero-avoiding property. This is anintuitively attractive feature for importance sampling pro-posal distributions. Indeed recent theoretical work showsthat, under some conditions, the sample size required inimportance sampling scales exponentially with the inclusiveKL divergence [Chatterjee and Diaconis, 2018].

Our work could be adapted to use the χ2 divergence [Dienget al., 2017, Müller et al., 2019], which also has theoreticallinks to the sample size needed by IS [Agapiou et al., 2017].

3.2 BASIC GRADIENT ESTIMATE

Assuming standard regularity conditions [Mohamed et al.,2020, Section 4.3.1], the objective has gradient

∇Jε(φ) = Zε Eξ∼pε[∇ logq(ξ ;φ)].

Using (2), an importance sampling form is

∇Jε(φ) = Eξ∼λ

[pε(ξ )

λ (ξ )∇ logq(ξ ;φ)

],

where λ (ξ ) is a proposal density. We will take λ (ξ ) =q(ξ ;φ ∗) for some φ ∗. (In our main algorithm, φ ∗ will be theoutput of a previous optimisation step.) Note we use choicesof q with full support, so (1) is satisfied.

An unbiased Monte Carlo gradient estimate is

g1 =1N

N

∑i=1

wi∇ logq(ξ (i);φ), (6)

where ξ (i) ∼ λ (ξ ) are independent samples and wi =pε(ξ

(i))/λ (ξ (i)) are importance sampling weights.

We calculate ∇ logq(ξ (i);φ) by backpropagation. Note webackpropagate with respect to φ , but not φ ∗ which is treatedas a constant. Hence the ξ (i) values are themselves constant.

3.3 IMPROVED GRADIENT ESTIMATES

Here we discuss reducing the variance and cost of g1.

Clipping Weights To avoid high variance gradient es-timates we apply truncated importance sampling [Ionides,2008]. This clips the weights at a maximum value ω , pro-ducing truncated importance weights wi = min(wi,ω). Theresulting gradient estimate is

g2 =1N

N

∑i=1

wi∇ logq(ξ (i);φ).

This typically has lower variance than g1, but has somebias. See the supplement (Section A) for more details anddiscussion, including how we choose ω automatically.

Algorithm 1 Distilled importance sampling (DIS)

1: Input: importance sampling size N, target ESS M, batchsize n, initial tempering parameter ε0

2: Initialise φ0 (followed by pretraining if necessary).3: for t = 1,2, . . . do4: Sample (ξi)1≤i≤N from q(ξ ;φt−1).5: Select a new tempering parameter εt ≤ εt−1 (see Sec-

tion 4.2 for details).6: Calculate weights wi = pε(ξ

(i))/q(ξ (i);φt−1) andtruncate to wis (see the supplement, Section A, fordetails).

7: for j = 1,2, . . . ,B do8: Resample (ξ ( j))1≤ j≤n from (ξ (i))1≤i≤N using nor-

malised wis as probabilities, with replacement.9: Calculate gradient estimate g3 using (7).

10: Update φ using stochastic gradient optimisation.We use the Adam algorithm.

11: end for12: end for

A more sophisticated alternative method is to Pareto smooththe largest importance weights [Yao et al., 2018]. We did notuse this as it is more expensive to implement than clipping,but it would be interesting to investigate in future work.

Resampling Calculating g2 requires evaluating∇ logq(ξ (i);φ) for 1 ≤ i ≤ N. Each of these has acomputational cost, but often many receive small weightsand so contribute little to g2.

To reduce this cost we can discard many low weight samples,by using importance resampling [Smith and Gelfand, 1992]as follows. We sample n�N times, with replacement, fromthe ξ ( j)s with probabilities s j = w j/S where S = ∑

Ni=1 wi.

Denote the resulting samples as ξ ( j). The following is thenan unbiased estimate of g2,

g3 =S

nN

n

∑j=1

∇ logq(ξ ( j);φ). (7)

4 ALGORITHM

Our approach to inference is as follows. Given a currentdistilled density approximating the posterior, q(ξ ;φt), weuse this as λ (ξ ) in (7) to produce a gradient estimate. (In thenotation of Section 3.2, we take φ ∗ = φt .) We then updateφt to φt+1 by stochastic gradient ascent, aiming to increaseJε . As t increases, we also reduce the tempering in ourtarget density pε(ξ ) by reducing ε , slowly enough to avoidhigh variance gradient estimates.

Algorithm 1 gives our implementation of this approach. Theremainder of the section discusses various details of it. andthe supplement (Section H) for a summary of tuning choices.

We fix training batch size to n = 100, and number of batchesB to M/n. So steps 4–6 perform importance sampling witha target ESS of M. Then steps 7–11 uses M of its outputs(sampled with replacement) for training. The idea is to avoidoverfitting by too much reuse of the same training data.

In our experiments later, we run the algorithm until ε = 0is reached or for a fixed runtime. Alternatively, the termin-ation decision could be based on approximate inferencediagnostics (e.g. Yao et al., 2018, Huggins et al., 2020.

We investigate other tuning choices for the algorithm inSection 6. For now note that N must be reasonably largesince our method to update εt relies on making an accurateESS estimate, as detailed in Section 4.2.

4.1 INITIALISATION AND PRETRAINING

The initial q should be similar to the initial target pε0 . Oth-erwise the first gradient estimates produced by importancesampling are likely to be high variance. This can sometimesbe achieved by initialising φ to give q a particular distribu-tion, and designing our tempering scheme to have a similarinitial target. See the supplement (Section B) for details, andSections 5–7 for examples.

Pretraining can also be used to improve the match betweenthe initial q and pε0 when it is possible to sample from thelatter. This iterates the following steps:

1. Sample (ξ (i))1≤i≤n from pε0(ξ ).

2. Update φ using gradient 1n ∑

ni=1 ∇ logq(ξ (i);φ).

This maximises the negative cross-entropyEξ∼pε0

[logq(ξ ;φ)]. We use n = 100, and terminateonce q(ξ ;φ) achieves a reasonable ESS (e.g. half actualsample size) when targeting pε0 in importance sampling.

4.2 SELECTING εT

We select εt using effective sample size, as in Del Moralet al. [2012]. Given (ξi)1≤i≤N sampled from q(ξ ;φt−1), theESS value for target pε(ξ ) is

NESS(ε) =[∑N

i=1 w(ξi,ε)]2

∑Ni=1 w(ξi,ε)2

, where w(ξ ,ε) =pε(ξ )

q(ξ ;φt−1).

In step 5 of Algorithm 1 we first check whetherNESS(εt−1)< M, a target ESS value. If so we set εt = εt−1.Otherwise we set εt to an estimate of the minimal ε suchthat NESS(ε)≥M, computed by a bisection algorithm.

5 EXAMPLE: SINUSOIDALDISTRIBUTION

As a simple illustration, consider θ1 ∼U(−π,π), θ2|θ1 ∼N (sin(θ1),1/200), giving unnormalised target density

p(θ) = exp{−100[θ2− sin(θ1)

2]}1[|θ1|< π],

where 1 is an indicator function. (Note earlier sections inferξ = (θ ,x), where x are latent variables. This example hasno latent variables, so we simply infer θ .)

We use the unnormalised tempered target

pε(θ) = p1(θ)ε p(θ)1−ε , (8)

initialising ε at 1 and reducing it to 0 during the algorithm.

As initial distribution we take θ1 and θ2 to have independentN (0,σ2

0 ) distributions. We use σ0 = 2 to give a reasonablematch to the standard deviation of θ1 under the target. Hence

p1(θ) =1

2πσ20

exp[− 1

2σ20(θ 2

1 +θ22 )

].

We use real NVP for q(θ ;φ), with 4 coupling layers, al-ternated with permutation layers swapping θ1 and θ2. Eachcoupling layer uses a neural network with 3 hidden layersof 10 hidden units each and ELU activation. We initialise qclose to a N (0, I) distribution, as in the supplement (Sec-tion B), then pretrain so q approximates p1.

We use Algorithm 1 with N = 4000 training samples and atarget ESS of M = 2000. These values give a clear visualillustration: we investigate efficient tuning choices later.

Figure 1 shows our results. The distilled density quicklyadapts to meet the importance sampling results, and ε = 0is reached by 90 iterations. This took roughly 1.5 minutes.

6 EXAMPLE: M/G/1 QUEUE

This section describes an application to likelihood-free infer-ence [Marin et al., 2012, Papamakarios and Murray, 2016].Here a generative model or simulator is specified, typicallyby computer code. This maps parameters θ and pseudo-random draws x to data y(θ ,x).

Given observations y0, we aim to infer the joint posterior ofθ and x, p(ξ |y0). Here we approximate this with a black-box choice of q(ξ |φ) i.e. a generic normalising flow. Thisapproach could be applied to any simulator model withoutmodifying its computer code, instead overriding the ran-dom number generator to use x values proposed by q, asin Baydin et al. [2019]. For higher dim(ξ ) a black-box ap-proach becomes impractical. Section 7 outlines an alternat-ive – using knowledge of the simulator to inform the choiceof q.

4 2 0 2 42.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Iteration 15, epsilon=0.956

4 2 0 2 42.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Iteration 30, epsilon=0.937

4 2 0 2 42.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Iteration 45, epsilon=0.877

4 2 0 2 42.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Iteration 60, epsilon=0.812

4 2 0 2 42.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Iteration 75, epsilon=0.004

4 2 0 2 42.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Iteration 90, epsilon=0.000

Figure 1: Sinusoidal example output. Each frame shows 300 samples from the current distilled density. Red crosses show asubsample of 150 targeting pε(θ) for the current ε value, selected by importance resampling (see Section 3.3). Blue dotsshow the remaining points.

6.1 MODEL

We consider a M/G/1 queuing model of a single queueof customers. Times between arrivals at the back of thequeue are Exp(θ1). On reaching the front of the queue,a customer’s service time is U(θ2,θ3). All these randomvariables are independent.

We consider a setting where only inter-departure times areobserved: times between departures from the queue. Thismodel is a common benchmark for likelihood-free inference,which can often provide fast approximate inference. SeePapamakarios et al. [2019b] for a detailed comparison. Anadvantage of DIS over these methods is that it does notrequire using low dimensional summary statistics whichlose some information from the data. Near-exact posteriorinference is also possible for this model using a sophisticatedMCMC scheme [Shestopaloff and Neal, 2014].

We sample a synthetic dataset of m = 20 observations fromparameter values θ1 = 0.1,θ2 = 4,θ3 = 5. We attempt toinfer these parameters under the prior θ1 ∼U(0,1/3),θ2 ∼U(0,10),θ3−θ2 ∼U(0,10) (all independent).

6.2 DIS IMPLEMENTATION

Latent Variables and Simulator We introduce ϑ and x,vectors of length 3 and 2m, and take ξ as the collection(ϑ ,x). Our simulator transforms these inputs to θ(ϑ) andy(ξ ) in a way such that when ξ ∼N (0, I) then (θ ,y) is asample from the prior and the model. See the supplement

(Section C) for details.

Tempered Target Our unnormalised tempered target is

pε(ξ ) = π(ξ )exp[− 1

2ε2 d(ξ )2], for ε > 0 (9)

where π(ξ ) is the N (0, I) density and d(ξ ) = ||y(ξ )−y0||2is the Euclidean distance between the simulated and ob-served data. This target is often used in ABC and corres-ponds to the posterior under the assumption that the datais observed with independent N (0,ε2) errors [Wilkinson,2013]. We use initial value ε0 = 10: a scale for errors whichis large relative to our observations. Note that DIS cannotreach the exact target here. Instead, like ABC, it producesincreasingly good posterior approximations as ε → 0.

Approximate Family We use a real NVP architecture forq(ξ ;φ) with 16 coupling layers alternated with random per-mutations. Each coupling layer uses a neural network with3 hidden layers of 100,100,50 units and ELU activation.We initialise q close to a N (0, I) distribution, as describedin the supplement (Section B): close enough to the initialtarget that no pretraining was needed.

6.3 RESULTS

Figure 2 (left) compares different choices of N (number ofimportance samples) and M (target ESS). It shows that ε

reduces more quickly for larger N or smaller M. Our choiceof N was restricted by memory requirements: the largestvalue we used was 50,000. Our choice of M was restricted

by numerical stability: values below a few hundred oftenproduced numerical overflow errors in the normalising flow.

This tuning study suggests using large N and small M sub-ject to these restrictions. We use this guidance here and inSection 7. In both examples, the cost of a single evaluationof the target density is low. For more expensive modelsefficient tuning choices may differ.

Figure 2 (right) shows that DIS results with N = 50,000 andM = 2500 are a close match to near-exact MCMC outputusing the algorithm of Shestopaloff and Neal [2014]. Themain difference is that DIS lacks sharp truncation at θ2 = 4.The DIS results are far more accurate than an ABC baseline,as detailed in the supplement (Section D).

7 EXAMPLE: LORENZ MODEL

Here we consider a time series application. Noisy and in-frequent observations y of a time series path x are available.The black-box approach of Section 6 is not practical here:dim(x) is too large to learn (θ ,x) directly. Instead we takean amortised approach, exploiting knowledge of the modelstructure by just learning q(θ) (marginal parameters) andq(xi+1|xi,θ) (next step of the time series).

7.1 MODEL

We consider the time series model

xi+1 = xi +α(xi,θ)∆t +√

10∆tεi (10)

α(xi,θ) =

θ1(xi,2− xi,1)θ2xi,1− xi,2− xi,1xi,3

xi,1xi,2−θ3xi,3

, (11)

for 0 ≤ i ≤ m. Here xi is a vector (xi,1,xi,2,xi,3) and εi ∼N (0, I) is also a vector of length 3. This is a stochasticdifferential equation (SDE) version of the Lorenz 63 dynam-ical system [Lorenz, 1963] from Vrettas et al. [2015], afterapplying Euler-Maruyama discretisation.

We take m = 100 and assume independent observationsyi ∼ N (xi,σ

2I) at i = 20,40,60,80,100. We fix x0 =(−30,0,30) and ∆t = 0.02. The unknown parameters areθ = (θ1,θ2,θ3,σ). We simulate synthetic data from thisdiscretised model for θ = (10,28,8/3,2) – see Figure 3.

7.2 DIS IMPLEMENTATION

Tempered Target Our unnormalised tempered target is

pε(ξ ) = π(θ)p(x|θ)p(y|x,θ)1−ε . (12)

We initialise ε at 1. So the initial target p1 is the prior andtime series model unconditioned by y. As ε is reduced, moreagreement with y is enforced.

Approximate Family Following Ryder et al. [2018], weuse an autoregressive approximate family

q(ξ ;φ) = q(θ ;φθ )m−1

∏i=0

q(xi+1|xi,θ ;φx).

We define q(xi+1|xi,θ ;φx) generatively as

xi+1 = xi +[α(xi,θ)+β ]∆t +√

γ∆tεi. (13)

This modifies (10) by adding to α an extra term β andreplacing 10 with γ . We take β and γ to be outputs of aneural network with inputs (i,xi,θ) and weights and biasesφx. See the supplement (Section E) for more details.

In the limit ∆t → 0, interpreted in an appropriate fashion,(13) with γ = 10, and the correct choice of β , gives theconditional distribution p(x|θ ,y) (see e.g. Opper, 2019).However, for ∆t > 0 it is useful to allow γ to vary Durhamand Gallant [2002]. Intuitively the idea is that the randomvariation in xi+1− xi may need to be smaller when i+1 isclose to an observation time, to ensure that the simulation isclose to the observed value.

Tuning Choices We use a real NVP architecture forq(θ ;φθ ), made up of 8 coupling layers, alternated with ran-dom permutation layers. Each coupling layer uses a neuralnetwork with 3 hidden layers of 30 units each and ELUactivation. We can thus initialise q(θ ;φθ ) close to the priorπ(θ) via the procedure described in the supplement (SectionB).

The network for β and γ has three hidden layers of 80 unitseach and ELU activation. We initialise the neural networkto give x dynamics approximating those of the initial targetp1 (the SDE without conditioning on y). See the supplement(Section E) for details.

In the DIS algorithm we take N = 50000,M = 2500, fol-lowing the general tuning suggestions in Section 6. See thesupplement (Section E) for more details of tuning choices,and methods to avoid numerical instability.

Gradients Our approximating family parameters are φ =(φθ ,φx). We calculate ∇φ logq(θ ,x;φ) by backpropagation.Simulating from q involves looping through (13) m times,each time using the neural network for β and γ . Back-propagation requires unrolling this. See Li et al. [2020]for recent progress on more efficient gradient calculationfor SDE models.

7.3 RESULTS

First we consider an example which is easy by other meth-ods to verify that DIS gives correct results. Here we performinference for θ under independent Exp(0.1) priors on eachθi and σ . Figure 3 shows the results using DIS, which takes

0 2000 4000 6000 8000 10000time (seconds)

100

101

epsil

on

M/N0.050.10.2N5000100002000050000

0.0 0.1 0.20

5

10

15

20

25

MCM

C

2 40.0

0.5

1.0

1.5

2.0

2.5

3.0

4 60.00

0.25

0.50

0.75

1.00

1.25

0.0 0.1 0.2arrival rate

0

5

10

15

20

25

DIS

2 4min service

0.0

0.5

1.0

1.5

2.0

2.5

3.0

4 6max service

0.00

0.25

0.50

0.75

1.00

1.25

Figure 2: M/G/1 results. Left: The ε value reached by DIS on the M/G/1 example against computation time, for variouschoices of N (number of importance samples) and M/N (ratio of target effective sampling size to N). Right: Marginalposterior histograms for M/G/1 example. The DIS output shown is for ε = 0.283, which took 180 minutes to reach.

5 100.0

0.1

0.2

0.3

0.4

MCM

C

25 300.0

0.1

0.2

0.3

0.4

2 40.0

0.5

1.0

1.5

0.0 2.50.0

0.2

0.4

0.6

0.8

5 10theta1

0.0

0.1

0.2

0.3

0.4

DIS

25 30theta2

0.0

0.1

0.2

0.3

0.4

2 4theta3

0.0

0.5

1.0

1.5

0.0 2.5sigma

0.0

0.2

0.4

0.6

0.8

0 20 40 60 80 100i

20

0

20

40

Figure 3: Output for Lorenz example with σ unknown. Left: marginal posterior histograms for parameters θ from particleMCMC and DIS output. Vertical lines show the true values. Right: paths x (red dot-dash xi,1, blue dotted xi,2, green dashedxi,3) and observations (circles). Solid lines show true paths. DIS plots are based on subsamples (1000 for θ , 30 for x) selectedby resampling (see Section 3.3) from the final importance sampling output with ε = 0, targeting the posterior distribution.

49 minutes and using particle MCMC [Andrieu et al., 2010],as detailed in the supplement (Section G). The two meth-ods produce very similar output, although MCMC is muchfaster, taking 4 minutes.

Secondly we investigate an example with the observationscale fixed to σ = 0.2, well below the true value of 2, andperform inference for θ1,θ2,θ3. This is a simple illustra-tion of how particle filtering based methods such as PM-CMC can become expensive under model misspecification,which occurs often in practice [Akyildiz and Míguez, 2020].Now DIS takes 443 minutes to target the posterior, ParticleMCMC was infeasible with our available memory, but weestimate it would take at least 80,000 minutes. See the sup-plement (Sections F and G) for details.

These results illustrate that DIS can be implemented in a set-ting with high dim(x), and it outperforms MCMC methodsfor some problems. Also, unlike recent work in variationalinference for SDEs [Ryder et al., 2018, Opper, 2019, Liet al., 2020], DIS directly targets the posterior distribution,not a variational approximation.

8 CONCLUSION

We’ve presented distilled importance sampling, and shownits application as an approximate Bayesian inference methodfor a likelihood-free example, and also as an exact methodfor a challenging time series model.

There are interesting opportunities to extend DIS. Firstly it’snot required that pε is differentiable, so discrete parameterinference is plausible. Secondly, we can use random-weightimportance sampling [Fearnhead et al., 2010] in DIS, andreplace pε with an unbiased estimate.

Acknowledgements

Thanks to Alex Shestopaloff for providing MCMC codefor the M/G/1 model and to Andrew Golightly and ChrisWilliams for helpful suggestions.

References

Sergios Agapiou, Omiros Papaspiliopoulos, Daniel Sanz-Alonso, and Andrew M. Stuart. Importance sampling:Intrinsic dimension and computational cost. StatisticalScience, 32(3):405–431, 2017.

Ömer Deniz Akyildiz and Joaquín Míguez. Nudging theparticle filter. Statistics and Computing, 30(2):305–330,2020.

Christophe Andrieu, Arnaud Doucet, and Roman Holenstein.Particle Markov chain Monte Carlo methods. Journal

of the Royal Statistical Society: Series B, 72(3):269–342,2010.

Atilim Gunes Baydin, Lei Shao, Wahid Bhimji, Lukas Hein-rich, Saeid Naderiparizi, Andreas Munk, Jialin Liu, Brad-ley Gram-Hansen, Gilles Louppe, Lawrence Meadows,Philip Torr, Victor Lee, Kyle Cranmer, Prabhat, and FrankWood. Efficient probabilistic inference in the quest forphysics beyond the standard model. In Advances inNeural Information Processing Systems, 2019.

Mark A. Beaumont, Jean-Marie Cornuet, Jean-MichelMarin, and Christian P. Robert. Adaptive approximateBayesian computation. Biometrika, pages 2025–2035,2009.

Jörg Bornschein and Yoshua Bengio. Reweighted wake-sleep. arXiv preprint arXiv:1406.2751, 2014.

Olivier Cappé, Arnaud Guillin, Jean-Michel Marin, andChristian P. Robert. Population Monte Carlo. Journal ofComputational and Graphical Statistics, 13(4):907–929,2004.

Sourav Chatterjee and Persi Diaconis. The sample sizerequired in importance sampling. The Annals of AppliedProbability, 28(2):1099–1135, 2018.

Jean-Marie Cornuet, Jean-Michel Marin, Antonietta Mira,and Christian P. Robert. Adaptive multiple importancesampling. Scandinavian Journal of Statistics, 39(4):798–812, 2012.

Simon L. Cotter, Ioannis G. Kevrekidis, and Paul Rus-sell. Transport map accelerated adaptive importancesampling, and application to inverse problems arisingfrom multiscale stochastic reaction networks. SIAM/ASAJournal on Uncertainty Quantification, 8(4):1383–1413,2020.

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequen-tial Monte Carlo samplers. Journal of the Royal StatisticalSociety: Series B, 68(3):411–436, 2006.

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Anadaptive sequential Monte Carlo method for approximateBayesian computation. Statistics and Computing, 22(5):1009–1020, 2012.

Akash Kumar Dhaka, Alejandro Catalina, ManushiWelandawe, Michael Riis Andersen, Jonathan Huggins,and Aki Vehtari. Challenges and opportunities inhigh-dimensional variational inference. arXiv preprintarXiv:2103.01085, 2021.

Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, JohnPaisley, and David Blei. Variational inference via χ upperbound minimization. In Advances in Neural InformationProcessing Systems, 2017.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.Density estimation using real NVP. arXiv preprintarXiv:1605.08803, 2016.

Arnaud Doucet, Michael K. Pitt, George Deligiannidis, andRobert Kohn. Efficient implementation of Markov chainMonte Carlo when using an unbiased likelihood estimator.Biometrika, 102(2):295–313, 2015.

Leo L. Duan. Transport Monte Carlo. arXiv preprintarXiv:1907.10448, 2019.

Garland B. Durham and A. Ronald Gallant. Numerical tech-niques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Eco-nomic Statistics, 20(3):297–338, 2002.

Conor Durkan, Artur Bekasov, Iain Murray, and GeorgePapamakarios. Neural spline flows. In Advances inNeural Information Processing Systems, 2019.

Paul Fearnhead, Omiros Papaspiliopoulos, Gareth O.Roberts, and Andrew Stuart. Random-weight particlefiltering of continuous time processes. Journal of theRoyal Statistical Society: Series B, 72(4):497–512, 2010.

Matthew M. Graham and Amos J. Storkey. Asymptotic-ally exact inference in differentiable generative models.Electronic Journal of Statistics, 11(2):5105–5164, 2017.

Clara Grazian and Yanan Fan. A review of approximateBayesian computation methods via density estimation:Inference for simulator-models. Wiley InterdisciplinaryReviews: Computational Statistics, 2019.

Peter J. Green, Krzysztof Łatuszynski, Marcelo Pereyra, andChristian P. Robert. Bayesian computation: a summaryof the current state, and samples backwards and forwards.Statistics and Computing, 25(4):835–862, 2015.

Jonathan H. Huggins, Mikołaj Kasprzak, Trevor Campbell,and Tamara Broderick. Practical posterior error boundsfrom variational objectives. In Artificial Intelligence andStatistics, 2020.

Edward L. Ionides. Truncated importance sampling. Journalof Computational and Graphical Statistics, 17(2):295–311, 2008.

Ghassen Jerfel, Serena Wang, Clara Fannjiang, Katherine AHeller, Yian Ma, and Michael I Jordan. Variational refine-ment for importance sampling using the forward kullback-leibler divergence. arXiv preprint arXiv:2106.15980,2021.

Aaron A. King, Dao Nguyen, and Edward L. Ionides. Stat-istical inference for partially observed Markov processesvia the R package pomp. Journal of Statistical Software,69(12), 2016.

Pedro Larrañaga and Jose A. Lozano. Estimation of distribu-tion algorithms: A new tool for evolutionary computation.Springer, 2002.

Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. In-ference compilation and universal probabilistic program-ming. In Artificial Intelligence and Statistics, 2017.

Xuechen Li, Ting-Kam Leonard Wong, Ricky T. Q. Chen,and David Duvenaud. Scalable gradients for stochasticdifferential equations. In Artificial Intelligence and Stat-istics, 2020.

Yingzhen Li, Richard E. Turner, and Qiang Liu. Approx-imate inference with amortised MCMC. arXiv preprintarXiv:1702.08343, 2017.

Dennis V. Lindley. The theory of queues with a single server.Mathematical Proceedings of the Cambridge Philosoph-ical Society, 48(2):277–289, 1952.

Jun S. Liu. Metropolized independent sampling with com-parisons to rejection sampling and importance sampling.Statistics and Computing, 6(2):113–119, 1996.

Edward N. Lorenz. Deterministic nonperiodic flow. Journalof the atmospheric sciences, 20(2):130–141, 1963.

Jean-Michel Marin, Pierre Pudlo, Christian P. Robert, andRobin J. Ryder. Approximate Bayesian computationalmethods. Statistics and Computing, 22(6):1167–1180,2012.

Luca Martino, Víctor Elvira, and Francisco Louzada. Ef-fective sample size for importance sampling based ondiscrepancy measures. Signal Processing, 131:386–401,2017.

Shakir Mohamed, Mihaela Rosca, Michael Figurnov, andAndriy Mnih. Monte Carlo gradient estimation in ma-chine learning. Journal of Machine Learning Research,21(132):1–62, 2020.

Thomas Müller, Brian Mcwilliams, Fabrice Rousselle,Markus Gross, and Jan Novák. Neural importancesampling. ACM Transactions on Graphics (TOG), 38(5):1–19, 2019.

Christian A. Naesseth, Fredrik Lindsten, and David Blei.Markovian score climbing: Variational inference withKL(p||q). arXiv preprint arXiv:2003.10374, 2020.

Manfred Opper. Variational inference for stochastic differ-ential equations. Annalen der Physik, 531(3):1800233,2019.

George Papamakarios and Iain Murray. Fast ε-free infer-ence of simulation models with Bayesian conditionaldensity estimation. In Advances in Neural InformationProcessing Systems, 2016.

George Papamakarios, Eric Nalisnick, Danilo JimenezRezende, Shakir Mohamed, and Balaji Lakshminaray-anan. Normalizing flows for probabilistic modeling andinference. arXiv preprint arXiv:1912.02762, 2019a.

George Papamakarios, David Sterratt, and Iain Murray. Se-quential neural likelihood: Fast likelihood-free inferencewith autoregressive flows. In Artificial Intelligence andStatistics, 2019b.

Matthew D. Parno and Youssef M. Marzouk. Transportmap accelerated Markov chain Monte Carlo. SIAM/ASAJournal on Uncertainty Quantification, 6(2):645–682,2018.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Onthe difficulty of training recurrent neural networks. InInternational Conference on Machine Learning, 2013.

Dennis Prangle, Richard G. Everitt, and Theodore Kypraios.A rare event approach to high-dimensional approximateBayesian computation. Statistics and Computing, 28(4):819–834, 2018.

Christian P. Robert and George Casella. Monte Carlo stat-istical methods. Springer, 2013.

Reuven Rubinstein. The cross-entropy method for combin-atorial and continuous optimization. Methodology andcomputing in applied probability, 1(2):127–190, 1999.

Reuven Y. Rubinstein and Dirk P. Kroese. Simulation andthe Monte Carlo method. John Wiley & Sons, 2016.

Francisco Ruiz and Michalis Titsias. A contrastive diver-gence for combining variational inference and MCMC.In International Conference on Machine Learning, 2019.

Thomas Ryder, Andrew Golightly, A. Stephen McGough,and Dennis Prangle. Black-box variational inferencefor stochastic differential equations. In InternationalConference on Machine Learning, 2018.

Chris Sherlock, Alexandre H. Thiery, Gareth O. Roberts,and Jeffrey S. Rosenthal. On the efficiency of pseudo-marginal random walk Metropolis algorithms. The Annalsof Statistics, 43(1):238–275, 2015.

Alexander Y. Shestopaloff and Radford M. Neal. OnBayesian inference for the M/G/1 queue with efficientMCMC sampling. arXiv preprint arXiv:1401.5548, 2014.

Scott A. Sisson, Yanan Fan, and Mark M. Tanaka. Correc-tion: Sequential Monte Carlo without likelihoods. Pro-ceedings of the National Academy of Sciences, 106(39):16889–16890, 2009.

Adrian F. M. Smith and Alan E. Gelfand. Bayesian statisticswithout tears: a sampling–resampling perspective. TheAmerican Statistician, 46(2):84–88, 1992.

Tina Toni, David Welch, Natalja Strelkowa, Andreas Ipsen,and Michael Stumpf. Approximate Bayesian computa-tion scheme for parameter inference and model selectionin dynamical systems. Journal of The Royal Society In-terface, 6(31):187–202, 2009.

Richard E. Turner, Pietro Berkes, Maneesh Sahani, andDavid J. C. MacKay. Counterexamples to variationalfree energy compactness folk theorems. Technical report,University College London, 2008.

Dootika Vats, James M. Flegal, and Galin L. Jones. Mul-tivariate output analysis for Markov chain Monte Carlo.Biometrika, 106(2):321–337, 2019.

Michail D. Vrettas, Manfred Opper, and Dan Cornford. Vari-ational mean-field algorithm for efficient inference inlarge systems of stochastic differential equations. Phys-ical Review E, 91, 2015.

Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, andChris Sherlock. Improved bridge constructs for stochasticdifferential equations. Statistics and Computing, 27(4):885–900, 2017.

Richard D. Wilkinson. Approximate Bayesian computation(ABC) gives exact results under the assumption of modelerror. Statistical applications in genetics and molecularbiology, 12(2):129–141, 2013.

Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gel-man. Yes, but did it work?: Evaluating variational infer-ence. In International Conference on Machine Learning,2018.

SUPPLEMENTARY MATERIAL

A TRUNCATING IMPORTANCEWEIGHTS

Importance sampling estimates can have large or infin-ite variance if λ is a poor approximation to p. The prac-tical manifestation of this is a small number of importanceweights being very large relative to the others. To reduce thevariance, Ionides [2008] introduced truncated importancesampling. This replaces each importance weight wi withwi = min(wi,ω) given some threshold ω . The truncatedweights are then used in estimates (3) or (4). Truncating inthis way typically reduces variance, at the price of increasingbias.

Gradient clipping [Pascanu et al., 2013] is common practicein stochastic gradient optimisation of an objective J (φ) toprevent occasional large gradient estimates from destabil-ising optimisation. Truncating importance weights in Al-gorithm 1 has a similar effect of reducing the variability ofgradient estimates. A potential drawback of either methodis that gradients lose the property of unbiasedness, whichis theoretically required for convergence to an optimumof the objective. Pascanu et al. [2013] give heuristic argu-ments for good optimiser performance when using truncatedgradients, and we make the following similar case for usingclipped weights. Firstly, even after truncation, the gradientis likely to point in a direction increasing the objective. Inour approach, it should still increase the q density at ξ (i)

values with large wi weights, which is desirable. Secondly,we expect there is a region near the optimum for φ wheretruncation extremely rare, and therefore gradient estimateshave very low bias once this region is reached. Finally, wealso observe good empirical behaviour in our examples,showing that optimisation with truncated weights can findvery good importance sampling proposals.

We could use gradient clipping directly in Algorithm 1.However we prefer truncating importance weights as thereis an automated way to choose the threshold ω , as follows.We select ω to reduce the maximum normalised importanceweight, maxi

wi∑

Ni=1 wi

, to a prespecified value: throughout we

use 0.1. The required ω can easily be calculated e.g. bybisection. (Occasionally no such ω exists i.e. if most wis arezero. In this case we set ω to the smallest positive wi.)

B APPROXIMATE DENSITYINITIALISATION

As discussed in Section 4.1, we wish to initialise q(ξ ;φ)close to its initial target distribution pε0 . This avoid import-ance sampling initially producing high variance gradientestimates.

An approach we use in our examples is to select real NVP

parameters so that all its coupling layers are approximatelythe identity transformation. Then the resulting distributionapproximates its base distribution N (0, I). We can oftendesign our initial target to equal this, or to be sufficientlysimilar that only a few iterations of pretraining are needed.

To achieve this we initialise all the neural network weightsand biases used in real NVP to be approximately zero. Inmore detail, we set biases to zero and sample weights fromN (0,0.0012) distributions truncated to two standard devi-ations from the mean. We also ensure that we use activationfunctions which map zero to zero. Then the neural net-work outputs µ and σ are also approximately zero. Thuseach coupling layer has shift vector µ ≈ 0 and scale vectorexp(σ)≈ 1.

C M/G/1 MODEL

This section describes the M/G/1 queueing model, in partic-ular how to simulate from it.

Recall that the parameters are θ1,θ2,θ3, with independentprior distributions θ1 ∼U(0,1/3),θ2 ∼U(0,10),θ3−θ2 ∼U(0,10). We introduce a reparameterised version of ourparameters: ϑ1,ϑ2,ϑ3 with independent N (0,1) priors.Then we can take θ1 = Φ−1(ϑ1)/3, θ2 = 10Φ−1(ϑ2) andθ3 = θ2 +10Φ−1(ϑ3), where Φ is the N (0,1) cumulativedistribution function.

The model involves independent latent variables xi ∼N (0,1) for 1≤ i≤ 2m, where m is the number of observa-tions. These generate

ai =−1θ3

logΦ−1(xi), (inter-arrival times)

si = θ2 +(θ3−θ2)Φ−1(xi+m). (service times)

Inter-departure times can be calculated through the follow-ing recursion [Lindley, 1952]

di = si +max(Ai−Di−1), (inter-departure times)

where Ai = ∑ij=1 a j (arrival times) and Di = ∑

ij=1 d j (depar-

ture times).

D ABC ANALYSIS OF M/G/1 EXAMPLE

As a baseline comparison, we perform inference for theM/G/1 example using approximate Bayesian computation(ABC).

D.1 ALGORITHM

We implement ABC using Algorithm A below, a version ofthe popular ABC-PMC approach [Toni et al., 2009, Sisson

et al., 2009, Beaumont et al., 2009] modified to target

pε(θ) = π(θ)∫

exp[− 1

2ε2 ||y(θ ,x)− y0||2]

π(x)dx,

where || · || is the Euclidean norm, and π(x) is the densityof the x variables used in the simulator. This is the targetused in the main paper for θ in this example. i.e. it is the θ

marginal of the joint target (9).

Each iteration of the algorithm produces a weighted sample(θ t

i ,wti)1≤i≤N . This targets pεt (θ) in the same way as im-

portance sampling output. Over the course of the algorithmthe value of εt is reduced, and the number of simulateddatasets required for an iteration tends to increase.

Algorithm A ABC-PMC

1: Initialise ε1 = ∞

2: for t = 1,2, . . . do3: Let i = 0 (number of acceptances).4: while i < N do5: Sample θ ∗ from density λt(θ) (see below).6: if π(θ ∗)> 0 then7: Sample x∗ from π(x) and let y∗ = y(θ ∗,x∗).8: Let d∗ = ||y∗− y0||.9: Let α = exp[− 1

2ε2t

d∗2].10: With probability α accept:

let θ ti = θ ∗, dt

i = d∗, wti = π(θ ∗)/λt(θ

∗) andincrement i by 1.

11: end if12: end while13: Calculate εt+1 (see below).14: end for

Step 5 of the algorithm samples from the density

λt(θ) =

{π(θ) if t=1

∑Ni=1 wt−1

i Kt(θ |θ t−1i )/∑

Ni=1 wt−1

i otherwise.

In the first iteration λt(θ) is the prior. After this a kerneldensity estimate is used based on the previous weightedsample. We follow Beaumont et al. [2009] in using

Kt(θ |θ ′) = ϕ(θ ′,2Σt−1),

where ϕ is the density of a normal distribution and Σt−1is the empirical variance matrix of (θ t−1

i )1≤i≤N calculatedusing weights (wt−1

i )1≤i≤N

To implement step 13 of the algorithm, we select εt+1 sothat the acceptance probability of a typical member of theprevious weighted sample is reduced by a prespecified factork. Specifically, we define d as the median of the dt

i values,and find εt+1 by solving

α(d,εt+1) = kα(d,εt), where

α(d,ε) = exp[− 12ε2 d2].

D.2 RESULTS

We ran Algorithm A on the M/G/1 example with k = 0.7 andN = 500. It terminates (in step 2) once the number of sim-ulated datasets used in the run of DIS reported in the mainpaper (1.65×108) is exceeded. The final ABC-PMC ε valueis 7.15, much worse than the final value for DIS, ε = 0.28).Recall that the approximate posterior pε(θ) is the exact pos-terior under the assumption of extra Gaussian observationerror with scale ε [Wilkinson, 2013]. At ε = 7.15 this extraerror is large compared to the data (whose values rangefrom 4 to 34), hence it seems likely to produce significantapproximation error. This is reflected in very wide posteriormarginals – see Figure 4. Further, the asymptotic cost peroutput sample of ABC is O(ε−dim(y)) [Prangle et al., 2018],suggesting that reaching ε = 0.28 using ABC requires afactor of (0.28/7.15)−20 ≈ 1028 more simulations, which iscomputationally infeasible.

To reduce the computational cost of ABC, the data are of-ten replaced by low dimensional summary statistics. Weinvestigate using quartile summaries in the M/G/1 example,following Papamakarios and Murray [2016]. This done byadding a final step to the simulator which converts the rawdata to quartiles, and letting y0 be the observed quartiles.We ran ABC-PMC with quartile statistics using the sametuning details as above. The final ε value is now 0.31. Fig-ure 4 shows that the posterior marginals are much improved.However the maximum service time marginal is still a poorapproximation of the near-exact MCMC results, suggestingthat the summaries have lost information from the raw data.

E LORENZ EXAMPLE TUNINGDETAILS

Here we describe further details of our implementation ofDIS for the Lorenz example.

E.1 NUMERICAL STABILITY

The Lorenz model (10) can produce large xi values whichcause numerical difficulties in our implementation. To avoidthis our code gives zero importance weight to any simulationwhere maxi, j |xi, j|> 1000. Effectively this is a weak extraprior constraint. Our final posteriors in Figures 3 and 5 showno xi values near this bound, verifying that this constrainthas a negligible effect on the final results.

E.2 NEURAL NETWORK INPUT

The inputs to the neural network for β and γ , used in (13),are:

• Parameters θ

• Current time i

0.0 0.20

10

20M

CMC

0 5 100

1

2

0 10 200.0

0.2

0.4

0.6

0.0 0.2arrival rate

0

10

20

ABC

no su

mm

arie

s

0 5 10min service

0

1

2

0 10 20max service

0.0

0.2

0.4

0.6

0.0 0.2arrival rate

0

10

20

ABC

sum

mar

ies

0 5 10min service

0

1

2

0 10 20max service

0.0

0.2

0.4

0.6

Figure 4: Marginal posterior histograms for M/G/1 example. Top: MCMC output. Middle: ABC output without summarystatistics for ε = 7.14. Bottom: ABC output without quartile summaries for ε = 0.32. The ABC histograms are based onsamples drawn from ABC-PMC output by importance resampling.

• Current state xi

and also the following derived features:

• Current α(xi,θ), from (11)• Time until next observation• Next observation value

E.3 NEURAL NETWORK INITIALISATION ANDOUTPUT

As discussed in Section 7, we aim to initialise the neuralnetwork for β and γ so that (13) produces x dynamics similarto those of the initial target. This requires β ≈ (0,0,0) andγ ≈ 10. To achieve this we initialise as follows.

We initialise the biases to zero and the weights close tozero, as in Section B. Our final neural network layer has 4outputs. Under our initialisation these will be close to zero.Three are used as β , providing the required initial values.

The remaining output of the neural network, η , is used toproduce the multiplier γ through

γ =10

log(2)softplus(η).

We apply the softplus transform since γ must be positive.The constant factor ensures that η = 0 produces γ = 10, asrequired.

F DIS RESULTS FOR FIXED σ LORENZEXAMPLE

Figure 5 shows the DIS results in the Lorenz example withσ fixed at 0.2. The results shown are after 1500 iterations,taking 652 minutes.

8

10

12

thet

a1

26

28

thet

a2

7.5

10.0

12.5

theta1

2.50

2.75

3.00

3.25

thet

a3

26 28

theta2

2.5

3.0

theta30 20 40 60 80 100

i

30

20

10

0

10

20

30

40

50

Figure 5: Output for Lorenz example with σ = 0.2. Left: Parameters θ . Diagonal plots show histograms of marginals, withvertical lines showing the true values. Off-diagonal plots show bivariate scatter-plots. Right: Paths x (red dot-dash xi,1, bluedotted xi,2, green dashed xi,3) and observations (dots). Solid lines show the true paths. Both panels are subsamples (1000left, 30 right) selected by resampling (see Section 3.3) from the final importance sampling output with ε = 0, targeting theposterior distribution. Darker points/lines indicate more frequent resampling.

G PARTICLE MCMC ANALYSIS OFLORENZ EXAMPLES

As a comparison to DIS, we also investigate inference forthe Lorenz example using particle Markov chain MonteCarlo (PMCMC) [Andrieu et al., 2010], a near-exact in-ference method. This runs a Metropolis-Hastings MCMCalgorithm for the model parameters θ . For each proposedθ , the likelihood is estimated by running a particle filter forthe x variables with NPF particles. A particle filter involvesforward simulating the unconditioned time series model tothe next observation time and weighting the simulated pathsbased on the density of the observation given the endpoint.We implement PMCMC and particle filters using the pompR package [King et al., 2016].

One PMCMC tuning choice is the proposal distributionfor θ . We use a normal proposal, with variance equal to aposterior covariance matrix estimate from a pilot PMCMCrun. Another choice is where to initialise the MCMC chain.For simplicity we use the true parameter values. In a realanalysis initialisation is often more difficult.

A key remaining tuning choice is selecting NPF so thatthe particle filter produces likelihood estimates which areaccurate enough for the MCMC algorithm to be efficient. Wefollow the theoretically derived tuning advice of Sherlocket al. [2015] and Doucet et al. [2015]: we choose NPF so thatthe log-likelihood estimates at a representative parametervalue (we use the true parameter values) have a standarddeviation s of roughly 1.5.

In our first example σ , observation noise scale, is an un-known parameter whose true value is 2. For σ = 2 it is

relatively common for forward simulations to produce highobservation densities. This is reflected in that we need onlyNPF = 50 to attain s≈ 1.5, and so PMCMC was fast to run.We ran 80,000 PMCMC iterations, taking 4 minutes. Thisproduces an effective sample size – calculated using themethod of Vats et al., 2019 – of 2328, comparable to thetarget ESS of 2500 used by DIS.

However in our second example we fix σ = 0.2. At thislevel noise it is rare for forward simulations to produce highobservation densities, particularly for the observations inFigure 3 which lie furthest from the true paths. Hence evenNPF = 106, near the largest choice we could use with ouravailable memory, produced s≈ 6. We can estimate a lowerbound on the time cost of PMCMC for this example withoutmemory constraints by considering how long 106 particleswould take. The computational cost of this particle filterimplementation is O(NPF) [King et al., 2016], and using 50particles in PMCMC took 4 minutes. So MCMC using 106

particles would take approximately 4× 106/50 = 80,000minutes, which is orders of magnitude longer than DIS.

H TUNING RECOMMENDATIONS

Here we summarise recommendations on tuning DIS whichappear throughout the paper, and add some further com-ments.

H.1 HYPERPARAMETERS

For the M/G/1 and Lorenz examples we recommend using:

• N = 50,000 (importance sampling sample size)

• M = 2,500 (target effective sample size)

• n = 100 (training batch size)

• B = M/n (number of batches)

The sinusoidal example is much simpler, and we foundN = 4000,M = 2000 to be sufficient here.

In general we expect optimising these hyperparameter forparticular tasks is likely to be useful. In particular, this paperfocuses on models where producing a single evaluation ofthe target density is low. For more expensive models, theoptimal tuning choices could be qualitatively different.

H.2 NORMALISING FLOW ARCHITECTURE

We tune the normalising flow architecture by trial and error.In general we expect more complex and higher dimensionaltargets to require more layers and hidden units.

One possible more general approach is to begin with a smallnumber of layers with few hidden units. It may becomeapparent while running DIS that the decrease in ε reaches abarrier due to an insufficiently flexible class of distributionsavailable in the approximating family. Whenever this hap-pens, extra layers with more hidden units can be added tothe flow, initialised to be close to identity transformationsas in Section B above.

H.3 REGULARISATION

We speculate that L1 regularisation on neural networkweights may be helpful to encourage simple density approx-imations, which are especially appropriate early in trainingwhen ε is large. Exploratory work found that L1 regularisa-tion improved the results in our most complicated example,on the Lorenz model, but not the others.

H.4 SELECTING ε

The method outlined in the main paper to select ε is free oftuning parameters. However, alternative methods to selectε could be used, for instance by using variations on thestandard effective sample size [see e.g. Martino et al., 2017].

H.5 TEMPERING SCHEME

We have not compared the performance of different temper-ing schemes and for now recommend choosing one basedon convenience. In the sinusoidal example we used

pε(ξ ) = p1(ξ )ε p(ξ )1−ε , (14)

which is convenient when the initial target p1(ξ ) and finalunnormalised target p(ξ ) can be evaluated. Here it is crucial

that q(ξ ;φ) can be easily pretrained to match the initialtarget.

In the likelihood-free example we used

pε(ξ ) = π(ξ )exp[− 1

2ε2 d(ξ )2], (15)

which is convenient when the likelihood cannot be evalu-ated but it is possible to simulate data y(ξ ) and calculated(ξ ) = ||y(ξ )− y0||, the Euclidean distance to the observa-tions y0. It is also necessary to be able to easily train q(ξ ;φ)to match the prior π(ξ ) on the parameters and random vari-ables involved in the simulator. Many variations on thistarget are possible, such as changing the choice of distancefunction or replacing the exponential term with K(ξ/ε) forsome density K with mode zero [Wilkinson, 2013].

In the SDE example we used

pε(ξ ) = π(θ)p(x|θ)p(y|x,θ)1−ε . (16)

Recall y represents observed data, x latent variables and θ

model parameters. This is convenient when it is straightfor-ward to simulate from an unconditioned model for ξ =(θ ,x)and to pretrain q(ξ ;φ) to approximate its density, and theobservation density p(y|x,θ) is known.

There are close relations between these three temperingschemes. Firstly, (16) is the special case of (14) where theunconditioned model is used as the initial target. Secondly,note that (15) and (16) are similar. In (16) the model has anobservation density component, and the tempering schemeinflates the observation error. As ε → 0 the true observationdensity is recovered. In (15) an artificial observation errordensity is introduced, which converges to a point mass asε → 0.