On the convergence properties of the Wang-Landau algorithm

Wang–Landau algorithmFlat Histogram in finite time

Parallel Wang–Landau: asymptotic behaviour

On the convergence properties of theWang-Landau Algorithm

Robin J. RyderCEREMADE, Universite Paris Dauphine

and CREST

ENSAE – 31 January 2012

joint work with Pierre Jacob (National University of Singapore)and Pierre Del Moral (INRIA & Universite de Bordeaux)

R. Ryder (Dauphine) & CREST Wang-Landau 1/ 40



Outline

1 Wang–Landau algorithm

2 Flat Histogram in finite time

3 Parallel Wang–Landau: asymptotic behaviour




Motivation

X

dens

ity

0.0

0.1

0.2

0.3

0.4

0.5

−5 0 5 10 15

Figure: A mixture of well-separated normal distributions.




Motivation

X

dens

ity

0.0

0.1

0.2

0.3

0.4

0.5

−5 0 5 10 15





Motivation

X

dens

ity

0.0

0.1

0.2

0.3

0.4

0.5

−5 0 5 10 15





Setting

Partition the state space

X =d⋃

i=1

Xi

Desired frequencies

φ = (φ1, . . . , φd)

Penalized distribution

πθ(x) ∝ π(x)

θ(J(x))

where J(x) such that x ∈ XJ(x).




First algorithm

Algorithm 1 Wang-Landau with deterministic schedule (γt)

1: Init θ0 > 0, X0 ∈ X .2: for t = 1 to T do3: Sample Xt from Kθt−1(Xt−1, ·), MH kernel targeting πθt−1 .

4: Update the penalties:

log θt(i)← log θt−1(i) + γt (1IXi(Xt)− φi )

5: end for




Video




Video




Flat Histogram

Issues with the first version

Choice of γ has a huge impact on the results.

Flat Histogram

Define the counters:

νt(i) :=t∑

n=1

1IXi(Xn)

Flat Histogram (FH) is reached when:

maxi∈{1,...,d}

∣∣∣∣νt(i)

t− φi

∣∣∣∣ < c




Flat Histogram

Idea

Instead of decreasing γt at each time step t, decrease only whenthe Flat Histogram criterion is reached.

In practice

Denote by κt the number of FH criteria reached up to time t.

Use γκt instead of γt at time t.

If FH is reached at time t, reset νt(i) to 0 for all i .




Wang–Landau with Flat Histogram

Algorithm 2 Wang-Landau with stochastic schedule (γκt )

1: Init θ0 > 0, X0 ∈ X .2: Init κ0 ← 0.3: for t = 1 to T do4: Sample Xt from Kθt−1(Xt−1, ·), MH kernel targeting πθt−1 .5: If (FH) then κt ← κt−1 + 1, otherwise κt ← κt−1.

6: Update the penalties:

log θt(i)← log θt−1(i) + γκt (1IXi(Xt)− φi )

7: end for




FH is met in finite time

To be sure that eventually, for any c > 0:

maxi∈{1,...,d}

∣∣∣∣νt(i)

t− φi

∣∣∣∣ < c

we want to prove:

∀i ∈ {1, . . . , d} νt(i)

tP−−−→

t→∞φi

for any fixed γ > 0.




Various updates

Right update

log θt(i)← log θt−1(i) + γ (1IXi(Xt)− φi ) (1)

Using this update, FH is met in finite time.

Wrong update

θt(i)← θt−1(i) [1 + γ (1IXi(Xt)− φi )]

⇔log θt(i)← log θt−1(i) + log [1 + γ (1IXi

(Xt)− φi )] (2)

Here FH is not necessarily met in finite time.Special case: ∀i φi = 1

d .




Assumptions

In this talk, there are only two bins: d = 2. Additionally:

Assumption

The bins are not empty with respect to µ and π:

∀i ∈ {1, 2} µ(Xi ) > 0 and π(Xi ) > 0

Assumption

The state space X is compact.




Assumptions

Assumption

The proposal distribution Q(x , y) is such that:

∃qmin > 0 ∀x ∈ X ∀y ∈ X Q(x , y) > qmin

Assumption

The MH acceptance ratio is bounded from both sides:

∃m > 0 ∃M > 0 ∀x ∈ X ∀y ∈ X m <π(y)

π(x)

Q(y , x)

Q(x , y)< M




Theorem

Theorem

Consider the sequence of penalties θt introduced in the WLalgorithm. We define:

Zt = logθt(1)

θt(2)= log θt(1)− log θt(2)

Then:Zt

t

L1−−−→t→∞

0

and consequently, with update (1) (FH) is reached in finite timefor any precision threshold c, whereas this is not guaranteed forupdate (2).




Consequence

Using update (1), and starting from Z0 = 0:

Zt = log θt(1)(

1IXX

XX

)− log θt(2)

(1IX

X

XX

)




Consequence


Zt = log θ0(1) + γ∑t

n=1(1IX1(Xn)− φ1)(

1IXX

XX

)− log θ0(2) + γ

∑tn=1(1IX2(Xn)− φ2)

(1IX

X

XX

)




Consequence


Zt = γ(νt(1)− tφ1)(

1IXX

XX

)−γ(νt(2)− tφ2)

(1IX

X

XX

)




Consequence


Zt = γ(νt(1)− tφ1)(

1IXX

XX

)−γ(t − νt(1)− t(1− φ1))

(1IX

X

XX

)




Consequence


Zt = γ(νt(1)− tφ1)(

1IXX

XX

)+γ(νt(1)− tφ1))

(1IX

X

XX

)




Consequence


Zt = 2γ(νt(1)− tφ1)(

1IXX

XX

)(1IX

X

XX

)




Consequence


Ztt = 2γ

(νt(1)t − φ1

)(1IX

X

XX

)(1IX

X

XX

)




Consequence


Ztt = 2γ

(νt(1)t − φ1

)L1−−−→

t→∞0(

1IXX

XX

)⇒ νt(1)

t

L1−−−→t→∞

φ1

(1IX

X

XX

)




Consequence


Ztt = 2γ

(νt(1)t − φ1

)L1−−−→

t→∞0(

1IXX

XX

)⇒ νt(1)

t

L1−−−→t→∞

φ1

(1IX

X

XX

)Using update (2), these simplifications do not occur: νt(i)/tconverges, but not to φi in general.




Proof

To prove thatZt

t

L1−−−→t→∞

0,

first notice that

Zt+1 − Zt = 2γ (νt+1(1)− νt(1)− φ1)

can only take two different values, which we note +a and −b.




Proof

Introduce Ut , the increment of Zt :

Zt+1 = Zt + Ut

Lemma

With the introduced processes Zt and Ut , there exists ε > 0 suchthat for all η > 0, there exists Zhi such that, if Zt ≥ Zhi , we havethe following two inequalities:

P[Ut+1 = −b|Ut = +a,Zt ] > ε

P[Ut+1 = −b|Ut = −b,Zt ] > 1− η.

(and the symmetric corollary)




Proof

time

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Zs

Zs+T

Z~

s+T~

5 10 15 20 25 30 35

Figure: We prove that Zt returns below a given horizontal bar wheneverit goes above it, and it does so in finite time. This implies Zt/t → 0.




Proof

Introduce a crossing time s: Zs−1 ≤ Zhi and Zs ≥ Zhi .Introduce a new process Zt :

Zs = Zs and Zt+1 = Zt + Vt

We can choose Vt such that Zt ≥ Zt and

Lemma

(Vt) is a Markov chain over the space {+a,−b} with transitionmatrix (

1− ε εη 1− η

)where the first state corresponds to +a and the second state to−b.




Proof

time

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Zs

Zs+T

Z~

s+T~

5 10 15 20 25 30 35

Figure: We prove that Zt returns below a given horizontal bar wheneverit goes above it, and it does so in finite time. This implies Zt/t → 0.




Case d = 2

This shows that Zt/t → 0, hence (FH) is met in finite time.For the case d > 2, we shall use only update 1.




Case d > 2

To prove that (FH) is met in finite time in the case with d > 2, weneed an extra assumption:

Assumption

The desired frequencies are all rational numbers:

∀i , φi ∈ Q




Irreducible Markov chain

Under the assumption of rationality, we have the following lemma:

Lemma

Let Θ be the following subset of Rd :

Θ = {z ∈ Rd : ∃(n1, . . . , nd) ∈ Nd zi = ni − φid∑

j=1

nj}

Then denoting by λ the product of the Lebesgue measure µ on Xand of the counting measure on Θ, (Xt , log θt) is λ-irreducible.

Proof: Bezout’s lemma.




A limit exists

Since (Xt , log θt) is λ-irreducible, the proportion of visits to anyλ-measurable set of X ×Θ converges to a limit in [0, 1]. Hencethe vector (ν(i)/t) converges to some limit pi . We now need toprove that this limit corresponds to (FH).




Reductio ad absurdum

Suppose (reductio ad absurdum) that p 6= φ. Then there exist iand j such that

pi − φi < pj − φjand hence

Z j ,it = −νt(i) + νt(j) + t(φi − φj) ∼ t(−pi + φi + pj − φj)→∞

As before, we can construct Ut and Ut and show that Ut decreaseson average, which contradicts the assumption that Z j ,i

t →∞.




Conclusion for d > 2

Conclusion: using update 1, (FH) is met in finite time for anynumber of bins d .




Illustration

X

dens

ity

0.0

0.1

0.2

0.3

0.4

0.5

−4 −2 0 2 4

(a) Histogram of the gen-erated sample

iterations

Pro

port

ions

of v

isits

to e

ach

bin

0.0

0.2

0.4

0.6

0.8

1.0

0 1000 2000 3000 4000 5000

(b) Convergence of theproportions of visits toeach bin, using the rightupdate

iterations

Pro

port

ions

of v

isits

to e

ach

bin

0.0

0.2

0.4

0.6

0.8

1.0

0 1000 2000 3000 4000 5000

(c) Convergence of theproportions of visits toeach bin, using the wrongupdate




Parallel chains

N chains (X(1)t , . . . ,X

(N)t ) instead of one.

targeting the same biased distribution πθt at iteration t,

sharing the same estimated bias θt at iteration t.

Old update of the penalties

log θt(i)← log θt−1(i) + γ (1IXi(Xt)− φi )

(1IX

X

XX

)

(L. Bornn, P. Jacob, P. Del Moral, A. Doucet)




Parallel chains

N chains (X(1)t , . . . ,X

(N)t ) instead of one.

targeting the same biased distribution πθt at iteration t,

sharing the same estimated bias θt at iteration t.

New update of the penalties

log θt(i)← log θt−1(i) + γ(

1N

∑Nk=1 1IXi

(X(k)t )− φi

) (1IX

X

XX

)

(L. Bornn, P. Jacob, P. Del Moral, A. Doucet)




Infinite number of chains

Suppose φi = 1/d . We can use either update and choose the“wrong” update

θt(i)← θt−1(i)

[1 + γ

(1

N

N∑k=1

1IXi(X

(k)t )− 1

d

)]

Note that if (X(k)t )Nk=1

iid∼ πθt then

1

N

N∑k=1

1IXi(X

(k)t )

a.s.−−−−→N→∞

∫Xi

πθt (x)dx =ψi

θt−1(i)

(∑k

ψk

θ(k)

)−1




This corresponds to update

θt(i)← θt−1(i)

[1 + γ

(∫Xi

πθt (x)dx − 1

d

)]

⇔ θt(i)← θt−1(i)

1 + γ

ψi

θt−1(i)

(∑k

ψk

θ(k)

)−1

− 1

d

Normalize:

θt(i)← θt(i)∑j θt(j)

Then the penalties (θt)t≥0 should converge towards ψ.




θt(i)←θt−1(i)

[1+γ

(ψi

θt−1(i)(∑

kψkθ(k))

−1− 1

d

)](1IXX

XX

)∑d

j=1 θt−1(j)

[1+γ

(ψj

θt−1(j)(∑

kψkθ(k))

−1− 1

d

)](1IXX

XX

)




θt(i)←θt−1(i)

[1+γ

(ψi

θt−1(i) M−1ψ (θt−1)− 1

d

)](1IXX

XX

)∑d

j=1 θt−1(j)[1+γ

(ψj

θt−1(j) M−1ψ (θt−1)− 1

d

)](1IXX

XX

)




θt(i)←θt−1(i)(1−γd ) + ψiγM−1

ψ (θt−1)

(1IXX

XX

)∑d

j=1 θt−1(j)(1−γd ) + ψjγM−1ψ (θt−1)

(1IXX

XX

)




θt(i)←θt−1(i)(1−γd ) + ψiγM−1

ψ (θt−1)

(1IXX

XX

)(1−γd )+γM−1

ψ (θt−1)

(1IXX

XX

)




θt(i)← θt−1(i)1− γ

d

1− γd

+γM−1ψ (θt−1)

+ ψiγM−1

ψ (θt−1)

1− γd

+γM−1ψ (θt−1)




θt(i)← θt−1(i)× (1− α(θt−1)) + ψi × α(θt−1)




Hence the update can be written as a convex combination

θt(i)← θt−1(i)× (1− α(θt−1)) + ψi × α(θt−1)

where

α(θt−1) =γM−1

ψ (θt−1)

1− γd + γM−1

ψ (θt−1)

It can easily be shown that

∀t ≥ 0 0 < αmin ≤ α(θt) ≤ αmax < 1

provided that ∀i , θ0(i) > 0, ψ(i) > 0.

Of course the assumptions N →∞ andiid∼ sampling are strong.

What happens in practice?




Animation

Toy example

X = {1, 2, 3, 4, 5}, π = (π1, . . . , π5)

We take the partition: Xi = {i} and launch the Wang–Landaualgorithm for N = 1, 10, 100 and γ = 1.In this case we know the value of ψi : ψi = πi , so we can computethe deterministic sequence of penalties corresponding to N =∞.

Animation




Animation

N =∞R. Ryder (Dauphine) & CREST Wang-Landau 36/ 40



Animation

N = 100R. Ryder (Dauphine) & CREST Wang-Landau 36/ 40



Animation




Animation




Animation

“




New take on the Parallel Wang–Landau

Call fψ the function such that:

θt ← fψ(θt−1)

As we saw:fψ(θ) = θ(1− α(θ)) + ψα(θ))

where

α(θ) =γM−1

ψ (θ)

1− γd + γM−1

ψ (θ)




New take on the Parallel Wang–Landau (in progress)

In the ideal case we have

θn = fψ ◦ . . . ◦ fψ︸︷︷︸×n

(θ0) = f nψ (θ0)

If we add perturbations to ψ we also add perturbations to θ, asfollows:

θ1 = f εψ(θ0) = fψ(θ0) + V0






θn = fψ ◦ . . . ◦ fψ︸︷︷︸×n

(θ0) = f nψ (θ0)


θ1 = f εψ(θ0) = fψ(θ0) + V0

θ2 = f εψ(θ1) = f εψ(fψ(θ0) + V0) = f 2ψ (θ0) + V1

. . .






θn = fψ ◦ . . . ◦ fψ︸︷︷︸×n

(θ0) = f nψ (θ0)


θ1 = f εψ(θ0) = fψ(θ0) + V0

θ2 = f εψ(θ1) = f εψ(fψ(θ0) + V0) = f 2ψ (θ0) + V1

. . .

The properties of fψ should help control the noise terms(Vn)n≥0. . .




Bibliography

Atchade, Y. and Liu, J. (2010). The Wang-Landau algorithmin general state spaces: applications and convergence analysis.Statistica Sinica, 20:209–233.

Wang, F. and Landau, D. (2001). Determining the density ofstates for classical statistical models: A random walkalgorithm to produce a flat histogram. Physical Review E,64(5):56101.

An Adaptive Interacting Wang-Landau Algorithm forAutomatic Density Exploration, L. Bornn, P. Jacob, P. Del

Moral, A. Doucet, on arXiv.

The Wang-Landau algorithm reaches the Flat Histogramcriterion in finite time, P. Jacob & RR, on arXiv.


Documents

On the convergence properties of the Wang-Landau algorithm