PAWL - GPU meeting @ Warwick

Wang–Landau algorithmImprovements

Example: variable selectionConclusion

Parallel Adaptive Wang–Landau Algorithm

Pierre E. Jacob

CEREMADE - Universite Paris Dauphine, funded by AXA Research

GPU in Computational StatisticsJanuary 25th, 2012

joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford),Pierre Del Moral (INRIA & Universite de Bordeaux), Robin J. Ryder (Dauphine)

Pierre E. Jacob PAWL 1/ 29



Outline

1 Wang–Landau algorithm

2 ImprovementsAutomatic BinningAdaptive proposalsParallel Interacting Chains

3 Example: variable selection

4 Conclusion




Wang–Landau

Context

unnormalized target density π

on a state space X

A kind of adaptive MCMC algorithm

It iteratively generates a sequence Xt .

The stationary distribution is not π itself.

At each iteration a different stationary distribution is targeted.




Wang–Landau

Partition the space

The state space X is cut into d bins:

X =d⋃

i=1

Xi and ∀i 6= j Xi ∩ Xj = ∅

Goal

The generated sequence spends a desired proportion φi oftime in each bin Xi ,

within each bin Xi the sequence is asymptotically distributedaccording to the restriction of π to Xi .




Wang–Landau

Stationary distribution

Define the mass of π over Xi by:

ψi =

∫Xi

π(x)dx

The stationary distribution of the WL algorithm is:

π(x) ∝ π(x)×φJ(x)

ψJ(x)

where J(x) is the index such that x ∈ XJ(x)




Wang–Landau

Example with a bimodal, univariate target density: π and two πcorresponding to different partitions. Here φi = d−1.

X

Log

Den

sity

−12

−10

−8

−6

−4

−2

0

Original Density, with partition lines

−5 0 5 10 15

Biased by X

−5 0 5 10 15

Biased by Log Density

−5 0 5 10 15




Wang–Landau

Plugging estimates

In practice we cannot compute ψi analytically. Instead we plug inestimates θt(i) of ψi/φi at iteration t, and define the distributionπθt by:

πθt (x) ∝ π(x)× 1

θt(J(x))

Metropolis–Hastings

The algorithm does a Metropolis–Hastings step, aiming πθt atiteration t, generating a new point Xt , updating θt . . .




Wang–Landau

Estimate of the bias

The update of the estimated bias θt(i) is done according to:

θt(i)← θt−1(i) [1 + γt (1IXi(Xt)− φi )]

with d the number of bins, γt a decreasing sequence or “stepsize”. E.g. γt = 1/t.

If 1IXi(Xt) then θt(i) increases;

otherwise θt(i) decreases.




Wang–Landau

The algorithm itself

1: First, ∀i ∈ {1, . . . , d} set θ0(i)← 1.2: Choose a decreasing sequence {γt}, typically γt = 1/t.3: Sample X0 from an initial distribution π0.4: for t = 1 to T do5: Sample Xt from Pt−1(Xt−1, ·), a MH kernel with invariant

distribution πθt−1(x).6: Update the bias: θt(i)← θt−1(i)[1 + γt(1IXi

(Xt)− φi )].7: end for




Wang–Landau

Result

In the end we get:

a sequence Xt asymptotically following π,

as well as estimates θt(i) of ψi/φi .




Wang–Landau

Usual improvement: Flat Histogram

Wait for the FH criterion to occur before decreasing γt .

(FH) maxi=1...d

∣∣∣∣νt(i)

t− φi

∣∣∣∣ < c

where νt(i) =∑t

k=1 1IXi(Xk) and c > 0.

WL with stochastic schedule

Let κt be the number of times FH was reached at iteration t. Useγκt at iteration t instead of γt . If FH reached, reset νt(i) to 0.




Wang–Landau

Theoretical Understanding of WL with deterministic schedule

The schedule γt decreases at each iteration, hence θt converges,hence Pt(·, ·) converges . . .≈ “diminishing adaptation”.

Theoretical Understanding of WL with stochastic schedule

Flat Histogram is reached in finite time for any γ, φ, c if one usesthe following update:

log θt(i)← log θt−1(i) + γ(1IXt (Xt)− φi )

instead ofθt(i)← θt−1(i)[1 + γ(1IXt (Xt)− φi )]




Automatic BinningAdaptive proposalsParallel Interacting Chains

Automate Binning

Maintain some kind of uniformity within bins. If non-uniform, splitthe bin.

Log density

Fre

quen

cy

(a) Before the split

Log density

Fre

quen

cy

(b) After the split





Adaptive proposals

Target a specific acceptance rate:

σt+1 = σt + ρt (21I(A > 0.234)− 1)

Or use the empirical covariance of the already-generated chain:

Σt = δ × Cov (X1, . . . ,Xt)





Parallel Interacting Chains

N chains (X(1)t , . . . ,X

(N)t ) instead of one.

targeting the same biased distribution πθt at iteration t,

sharing the same estimated bias θt at iteration t.

The update of the estimated bias becomes:

log θt(i)← log θt−1(i) + γκt

1

N

N∑j=1

1IXi(X

(j)t )− φi






How “parallel” is PAWL?

The algorithm’s additional cost compared to independent parallelMCMC chains lies in:

getting the proportions 1N

∑Nj=1 1IXi

(X(j)t )

updating (θt(1), . . . , θt(d)).






Example: Normal distribution

Histogram of the binned coordinate

binned coordinate

Den

sity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4






Reaching Flat Histogram

iterations

#FH

10

20

30

40

2000 4000 6000 8000 10000

N = 1N = 10N = 100






Stabilization of the log penalties

iterations

valu

e

−10

−5

0

5

10

2000 4000 6000 8000 10000

Figure: log θt against t, for N = 1







iterations

valu

e

−10

−5

0

5

10

2000 4000 6000 8000 10000








iterations

valu

e

−10

−5

0

5

10

2000 4000 6000 8000 10000







Multiple effects of parallel chains

log θt(i)← log θt−1(i) + γκt

1

N

N∑j=1

1IXi(X

(j)t )− φi

FH is reached more often when N increases, hence γκtdecreases quicker;

log θt tends to vary much less when N increases, even for afixed value of γ.




Variable selection

Settings

Pollution data as in McDonald & Schwing (1973). For 60metropolitan areas:

15 possible explanatory variables (including precipitation,population per household, . . . ) (denoted by X ),

the response variable Y is the age-adjusted mortality rate.

This leads to 32,768 possible models to explain the data.




Variable selection

Introduce

γ ∈ {0, 1}p the “variable selector”,

qγ represents the number of variables in model “γ”,

g some large value (g -prior, see Zellner 1986, Marin & Robert2007).

Posterior distribution

π(γ|y,X) ∝ (g + 1)−(qγ+1)/2[yTy − g

g + 1yTXγ(XT

γ Xγ)−1Xγy

]−n/2

.




Variable selection

Most naive MH algorithm

The proposal is flipping a variable on / off at random, at eachiteration.

Binning

Along values of log π(x), found with a preliminary exploration, in20 bins.




Variable selection

Iteration

Log(

θ)

−60

−40

−20

0

N = 1

20000 40000 60000 80000

N = 10

5000 10000 15000 20000 25000

N = 100

500 1000 1500 2000 2500 3000 3500

Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show thereal ψ.




Variable selection

Iteration

Mod

el S

atur

atio

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Wang−Landau

Metropolis−Hastings, Temp = 10

500 1000 1500 2000 2500 3000 3500



500 1000 1500 2000 2500 3000 3500

Figure: qγ/p (mean and 95% interval) along iterations, for N = 100.




Conclusion

Automatic binning but. . .

We still have to define a range of plausible (or “interesting”)values.

Parallel Chains

Seems reasonable to use more than N = 1 chain, with or withoutGPUs. No theoretical validation of this yet. Optimal N for a givencomputational effort?

Need of a stochastic schedule?

It seems that using large N makes the use and hence the choice ofγt irrelevant.




Would you like to know more?

Article: An Adaptive Interacting Wang-Landau Algorithm forAutomatic Density Exploration, with L. Bornn, P. Del Moral, A.

Doucet.

Article: The Wang-Landau algorithm reaches the FlatHistogram criterion in finite time, with R. Ryder.

Software: PAWL, available on CRAN:

install.packages("PAWL")

References:

F. Wang, D. Landau, Physical Review E, 64(5):56101

Y. Atchade, J. Liu, Statistica Sinica, 20:209-233


Education

PAWL - GPU meeting @ Warwick