Computable Exponential Bounds for Markov Chains and MCMC ...€¦ · 1. Nonasymptotic Bounds for Markov Chains Motivation: Markov Chain Monte Carlo 2. A General Information-Theoretic

Computable Exponential Boundsfor Markov Chains and MCMC Simulation

Ioannis KontoyiannisAthens Univ of Econ & Business

joint work withS.P. Meyn, L.A. Lastras-Montano

Probability Seminar, Columbia University, December 2007

1

Outline

1. Nonasymptotic Bounds for Markov Chains

Motivation: Markov Chain Monte Carlo

2. A General Information-Theoretic Bound

Csiszar’s Lemma and Jensen’s inequality

3. Large Deviations Bounds: Analysis & Optimization

Doeblin chains

An (MCMC) example of the Gibbs sampler

Geometrically ergodic chains

� Controlling averages and excursions

A general MCMC sampling criterion

4. The i.i.d. case: A geometrical explanation

2

Motivation

A Common Task

Calculate the expectation Eπ(F ) =∑

x∈S π(x)F (x) of a given F : S → R

In many cases, the distribution π = (π(x) ; x ∈ S) is known explicitly

but it’s impossible to calculate its values in practice

Typical in Bayesian stat, statistical mechanics,

networks, image processing, . . .

Markov Chain Monte Carlo

It is often simple to construct an ergodic Markov chain {X1, X2, . . .}with stationary distribution π

In that case, we estimate Eπ(F ) by the partial sums 1n

∑ni=1 F (Xi)

Problem

How long a simulation sample n do we need for an accurate estimate?

3

The Setting: Deviation Bounds for Markov Chains

We have

Ergodic Markov chain {X1, X2, . . .}, discrete state-space S [for simplicity]

Transition kernel P (x, y) = Pr{Xn+1 = y|Xn = x}, initial condition x1 ∈ S

Stationary distribution π = (π(x) ; x ∈ S)

Goal

Find explicit, computable, nonasymptotic bounds on

Pr{1

n

n∑i=1

F (Xi) ≥ Eπ(F ) + ε}

� In MCMC, this leads to precise performance guarantees

and sampling criteria (or stopping rules)

� Similar questions appear in numerous other applications

4

A General Information-Theoretic Bound

Let H(P‖Q) =∑

x∈S P (x) log P (x)Q(x)

= relative entropy

‖P −Q‖ =∑

x∈S |P (x)−Q(x)| = 2×[total variation distance]

5

A General Information-Theoretic Bound

Let H(P‖Q) =∑

x∈S P (x) log P (x)Q(x)

= relative entropy

‖P −Q‖ =∑

x∈S |P (x)−Q(x)| = 2×[total variation distance]

Theorem 1

For any Markov chain {Xn}, any function F : S → R bounded above,

any c > 0 and any initial condition X1 = x1, we have

log Pr{1

n

n∑i=1

F (Xi) ≥ c}

≤ −(n − 1)H(W‖W 1 × P )

for some bivariate distribution W = (W (x, y)) on S × S

with marginals W 1 and W 2 that satisfy

‖W 1 − W 2‖ ≤ 2

n − 1and EW 1(F ) ≥ c − supx F (x)

n − 1

and W 1 × P denotes the bivariate distr (W 1 × P )(x, y) = W 1(x)P (x, y)

6

Interpretation

Our result

To use the above bound, we need to look at

log Pr{1

n

n∑i=1

F (Xi) ≥ c}≤ −(n − 1) inf

WH(W‖W 1 × P )

over all W s.t.

‖W 1 − W 2‖ ≤ 2


n − 1

7

Interpretation

Our result

To use the above bound, we need to look at

log Pr{1

n

n∑i=1

F (Xi) ≥ c}≤ −(n − 1) inf

WH(W‖W 1 × P )

over all W s.t.

‖W 1 − W 2‖ ≤ 2


n − 1

Donsker and Varadhan’s classic result

For a very restricted class of chains, asymptotically in n

log Pr{1

n

n∑i=1

F (Xi) ≥ c}≈ −n inf

WH(W‖W 1 × P )

over all W s.t. W 1 = W 2 and EW 1(F ) ≥ c

8

Remarks

� Theorem 1 offers an elementary yet general explanationof Donsker and Varadhan’s exponent and their upper bound

� The result and proof are outrageously general and simple

9

Remarks



Proof.

Step I. Csiszar’s Lemma. Let p be an arbitrary probability measure

on any probability space, and E any event with p(E) > 0. Let p|E

denote

the corresponding conditional measure. Then:

log p(E) = −H(p|E‖p)

10

Remarks



Proof.

Step I. Csiszar’s Lemma. Let p be an arbitrary probability measure

on any probability space, and E any event with p(E) > 0. Let p|E

denote

the corresponding conditional measure. Then:

log p(E) = −H(p|E‖p)

With p = distribution of (X1, X2, . . . , Xn)

and E ={

1n

∑ni=1 F (Xi) ≥ c

}:

log Pr{1

n

n∑i=1

F (Xi) ≥ c}

= −H(p|E‖p)

11

Proof cont’d

Step II.

Write p|E

as a product of conditionals and p as a product

of bivariate conditionals

Expanding the log in H(p|E‖p) (“chain rule”)

transforms this relative entropy between n-dimensional distributions

into a sum of relative entropies between bivariate ones

log Pr{1

n

n∑i=1

F (Xi) ≥ c}

= −n−1∑i=1

H(pi,i+1‖pi × P )

12

Proof cont’d

Step II.

Write p|E

as a product of conditionals and p as a product

of bivariate conditionals

Expanding the log in H(p|E‖p) (“chain rule”)

transforms this relative entropy between n-dimensional distributions

into a sum of relative entropies between bivariate ones

log Pr{1

n

n∑i=1

F (Xi) ≥ c}

= −n−1∑i=1

H(pi,i+1‖pi × P )

Step III.

Use convexity (Jensen) to simplify and combine into

log Pr{1

n

n∑i=1

F (Xi) ≥ c}≤ −(n − 1)H(W‖Wi × P )

Check W has the required properties �

13

The “Nicest” Chains

Doeblin chains

Defn A Markov chain {Xn} on a general alphabet is called

a Doeblin chain iff it converges to equilibrium exponentially fast,

uniformly in the initial condition X1 = x1, i.e., iff

supx∈S

∑y∈S

|Pn(x, y) − π(y)| → 0 exponentially fast

Equivalent characterization There exists a number of steps m,

a probability measure ρ, and α > 0, such that:

Pr{Xm ∈ E |X1 = x1} ≥ αρ(E) for all x1, E

� Doeblin chains don’t satisfy the Donsker-Varadhan conditions

� They don’t even satisfy the usual large deviations principle!

14

A Bound for Doeblin Chains

Theorem 2

For any Doeblin chain {Xn},any bounded function F : S → R, any ε > 0,

and any initial condition X1 = x1, we have

log Pr{1

n

n∑i=1

F (Xi) ≥ Eπ(F ) + ε}

≤ −(n − 1)1

2

[( α

m Fmax

)ε − 3

n − 1

]2

where Fmax = supx |F (x)|� In the case of i.i.d. {Xn}, Theorem 3 essentially

reduces to Hoeffding’s bound, which is tight in that case

� In the general case, this is the best bound known to date,

improving [Glynn & Ormoneit 2002] by a factor of 2 in the exponent

15

Note

log Pr{1

n

n∑i=1

F (Xi) ≥ Eπ(F ) + ε}≤ −(n − 1)

1

2

[( α

m Fmax

)ε − 3

n − 1

]2

� Bound only depends on F via its maximum

� Explicit exponent, quadratic in ε

� Bound only depends on the chain via α, m

� Good convergence estimates ⇒ good bounds on α, m

⇒ better exponents

16

Proof outline

Step I. From Theorem 1 we get

log Pr{1

n

n∑i=1

F (Xi) ≥ Eπ(F ) + ε}≤ −(n − 1)H(W‖W 1 × P )

for an appropriate W

Step II. Using Pinsker’s and then Jensen’s inequality we bound

H(W‖W 1 × P ) ≥ 1

2

[ ∑x,y

W 1(x)|P (x, y) − W (y|x)|]2

(∗)

Step III. Lemma. For any row vector v with∑

x v(x) = 0, we have

‖v(I − P )‖ ≥ α

m‖v‖

Step IV. Get bounds on the dual of a LP related to (∗) �

17

Extend to Geometrically Ergodic Chains?

� In many applications, we are interested in unbounded functions F

� Most chains found in applications (like MCMC)

are not Doeblin, but geometrically ergodic

Defn A Markov chain {Xn} is geometrically ergodic

iff it converges to equilibrium exponentially fast,

not necessarily uniformly in the initial condition

� The most general class for which exponential bounds might hold

� Same bounds cannot hold exactly as before

� But: There is a different exponential bound in this case

� The following example motivates its form . . .

18

A Hard Example for the Gibbs Sampler: The Witch’s Hat

0 1

1

x

y

Brim ~εUnif

Peak ~(1- ε) Unif

1/3

1/3

Setting: Use (randomized) Gibbs sampler

to compute average of F (x, y) = e5x + e5y

w.r.t. the “witch’s hat distr” with ε = 1251

19


0 1

1

x

y

Brim ~εUnif

Peak ~(1- ε) Unif

1/3

1/3




Problem: Estimates very sensitive

to the rare visits to the “brim”

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004.7

4.8

4.9

5.0

5.1

5.2

5.3

5.4

5.5

20


0 1

1

x

y

Brim ~εUnif

Peak ~(1- ε) Unif

1/3

1/3




Problem: Estimates very sensitive

to the rare visits to the “brim”

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004.7

4.8

4.9

5.0

5.1

5.2

5.3

5.4

5.5

Idea: Consider the new function

U (x) = F (x) − E[F (X2)|X1 = x

]and note that Eπ(U ) = 0

[Cf. Henderson (1997)]

21

A Sampling Criterion for this Gibbs Sampler

Idea: Together with the averages of F

also compute the averages of U

Averages of F and sampling times (purple)

0 1000 2000 3000 4000 5000 6000 70004.8

5.0

5.2

5.4

5.6

5.8

6.0

6.2

Averages of U and time spent at the peak (red)

0 1000 2000 3000 4000 5000 6000 7000−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

22

A Sampling Criterion for this Gibbs Sampler

Idea: Together with the averages of F

also compute the averages of U


0 1000 2000 3000 4000 5000 6000 70004.8

5.0

5.2

5.4

5.6

5.8

6.0

6.2


0 1000 2000 3000 4000 5000 6000 7000−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

We know: Eπ(U ) = 0

Sampling Criterion:

Sample the F -averages

only when the U -averages

are between ±u for some small u > 0

23

More Simulation Results from the Witch’s Hat


0 1000 2000 3000 4000 5000 6000 70004.5

5.0

5.5

6.0

6.5

7.0


0 1000 2000 3000 4000 5000 6000 7000−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

24

More Simulation Results from the Witch’s Hat


0 1000 2000 3000 4000 5000 6000 70004.5

5.0

5.5

6.0

6.5

7.0


0 1000 2000 3000 4000 5000 6000 7000−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6


0 1000 2000 3000 4000 5000 6000 70004.6

4.8

5.0

5.2

5.4

5.6

5.8

6.0


0 1000 2000 3000 4000 5000 6000 7000−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

25

Generally: Geometrically Ergodic Chains

Defn A Markov chain {Xn} is geometrically ergodic

iff it converges to equilibrium exponentially fast,

not necessarily uniformly in the initial condition

Equivalent characterization There exists a function V : S → R,

a finite set S0 ⊂ S, and positive constants b, δ, such that:

E[V (X2) |X1 = x] − V (x) ≤ −δV (x) + bIS0(x) for all x

Bounds

Suppose the function of interest F : S → R is possibly unbounded

but with ‖F 2‖V := supxF (x)2

V (x) < ∞Define a screening function U (x) = V (x) − E[V (X2) |X1 = x]

26

An Exponential Bound for Geometrically Ergodic Chains

Theorem 3

For any geometrically ergodic chain {Xn},any function F : S → R as above, any ε, u > 0,

and any initial condition X1 = x1:

log Pr{1

n

n∑i=1

F (Xi) ≥ Eπ(F ) + ε &∣∣∣1n

n∑i=1

U(Xi)∣∣∣ ≤ u & Xn ∈ S0

}

27

An Exponential Bound for Geometrically Ergodic Chains

Theorem 3

For any geometrically ergodic chain {Xn},any function F : S → R as above, any ε, u > 0,

and any initial condition X1 = x1:

log Pr{1

n

n∑i=1

F (Xi) ≥ Eπ(F ) + ε &∣∣∣1n

n∑i=1

U(Xi)∣∣∣ ≤ u & Xn ∈ S0

}

≤ −(n − 1)1

2

[( δ

8ξ‖F 2‖V

)( ε − Fmax,0

n−1

u + b +Umax,0

n−1

)2

− 2

n − 1

]2

where Fmax,0 = maxx∈S0 |F (x)|, Umax,0 = maxx∈S0 |U (x)|and ξ is the “convergence parameter” of the chain

28

General Sampling Criterion for Geometrically Ergodic Chains

Note: Apart from the fact that the above bound is explicitly computable,it naturally leads us to formulate the following sampling criterion

Given: A geometrically ergodic chain {Xn}Its parameters V , b, δ, S0

A function F s.t. F 2 ≤ CV

Set: The screening function U (x) := V (x) − E[V (X2)|X1 = x]

A “small” threshold u > 0

Sampling Criterion: Sample the results of the chain only at times n

when Xn ∈ S0 and |1n∑n

i=1 U (Xi)| ≤ u

Explanation: Control averages and excursions

29

Comments on the Sampling Criterion

� Geometric ergodicity in general easy to verify

� Many choices for V (x), and V ≈ F often works

� To apply the sampling criterion, the screening function

U (x) = V (x) − E[V (X2)|X1 = x]

needs to be analytically computable

� Easily so for the Gibbs sampler,

some versions of the Metropolis algorithm . . .

30

Comments on Theorem 3

� Why is the exponent in Theorem 3 of O(ε2) and not O(ε4)?

� Proof outline similar to one for Doeblin case

� Theorem 3 applies even to cases where

Pr{

1n

∑ni=1 F (Xi) ≥ Eπ(F ) + ε

}decays sub-exponentially (e.g., discrete M/M/1 queue)

How is it that the addition of two non-rare events{∣∣∣1n

∑ni=1 U (Xi)

∣∣∣ ≤ u}∩

{Xn ∈ S0

}makes the probability exponentially small?!

� Specialize to the i.i.d. case for an explanation . . .

31

An “i.i.d. version” of Theorem 3

Setting: Estimate EP (F ) where F is “heavy tailed”

from i.i.d. samples X1, X2, . . . ∼ P

Suppose we have a U with known EP (U ) = 0, s.t.

U “dominates” F : ess sup[F (X) − βU (X)] < ∞, for all β > 0

Assume EP (F 2), EP (U 2) both finite

32







Theorem 4

(i) The “standard” error prob is subexponential: ∀ε > 0:

limn→∞

−1

nlog Pr

{1

n

n∑i=1

F (Xi) ≥ EP (F ) + ε}

= 0

33







Theorem 4

(i) The “standard” error prob is subexponential: ∀ε > 0:

limn→∞−1

nlog Pr

{1

n

n∑i=1

F (Xi) ≥ EP (F ) + ε}

= 0

(ii) The “screening” error prob is exponential: ∀ε, u > 0:

limn→∞−1

nlog Pr

{1

n

n∑i=1

F (Xi) ≥ EP (F ) + ε &∣∣∣1n

n∑i=1

U (Xi)∣∣∣ ≤ u

}> 0

34

Geometrical Explanation of Theorem 4

(i) Pr{standard error} ≈ exp{− n infQ∈Σ H(Q‖P )

}where Σ = {Q : EQ(F ) ≥ EP (F ) + ε} and the infimum is = 0

E

PΣ PΣ Q∗

35

Geometrical Explanation of Theorem 4

(i) Pr{standard error} ≈ exp{− n infQ∈Σ H(Q‖P )

}where Σ = {Q : EQ(F ) ≥ EP (F ) + ε} and the infimum is = 0

E

PΣ PΣ Q∗

(ii) Pr{screening error} ≈ exp{−n infQ∈E H(Q‖P )

}= exp

{−nH(Q∗‖P )

}where E = {Q : EQ(F ) ≥ EP (F ) + ε, |EQ(U )| < u}and the infimum is > 0

36

Theorem 4 cont’d

(iii) The “screening” error prob satisfies:

Let K > 0 arbitrary. Then ∀ε > 0, 0 < u ≤ Kε

log Pr{1

n

n∑i=1

F (Xi) ≥ EP (F ) + ε &∣∣∣1n

n∑i=1

U (Xi)∣∣∣ ≤ u

}

≤ −n

2

[M

M 2 + (1 + 12K

)2

]2

ε2

where M = ess sup[F (X) − 1

2KU (X)]

37

Theorem 4: A Heavy-Tailed Simulation Example

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.38

1.40

1.42

1.44

1.46

1.48

1.50

1.52

1.54

1.56

1.58

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.42

1.43

1.44

1.45

1.46

1.47

1.48

1.49

1.50

1.51

1.52

1

k

k∑i=1

X3/4i

Sampling times

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.36

1.37

1.38

1.39

1.40

1.41

1.42

1.43

1.44

1.45

1.46

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.40

1.45

1.50

1.55

1.60

38

Concluding Remarks

Information-Theoretic Methods

Convexity, elementary properties

Strikingly effective in a brutally technical area...

Markov Chain Bounds

Doeblin chains

Geometrically ergodic chains

Functional analysis and optimization

A new sampling criterion

Further applications in MCMC...

39

Simulating a Simple Queue in Discrete Time

Consider: The chain Xn+1 = [Xn − Sn+1]+ + An+1 where:

{An} i.i.d. ∼ (1 + κ)α·Bern( 11+κ) and {Sn} i.i.d. ∼ 2µ·Bern(1

2)

the load ρ = E(Ak)E(Sn)

= αµ

is heavy, ρ ≈ 1, and F (x) = x

40




2)


= αµ


Then: {Xn} is geometrically ergodic with V (x) = eεx

U(x) = V (x) − E[V (X2)|X1 = x] is an easily computable quadratic

No exponential error bound can be proved on the error probability!

41




2)


= αµ


Then: {Xn} is geometrically ergodic with V (x) = eεx

U(x) = V (x) − E[V (X2)|X1 = x] is an easily computable quadratic

No exponential error bound can be proved on the error probability!

120

130

140

150

160

170

Γn(F )

1 2 3 4 5 6 7 8 9 10 x 105 1 2 3 4 5 6 7 8 9 10 x 10

5 1 2 3 4 5 6 7 8 9 10 x 105

50

55

60

65 π(F )

κ = 1 κ = 2 κ = 4

90

100

95

105

0 0

n

42

Documents

Computable Exponential Bounds for Markov Chains and MCMC ...€¦ · 1. Nonasymptotic Bounds for Markov Chains Motivation: Markov Chain Monte Carlo 2. A General Information-Theoretic