Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Computable Exponential Boundsfor Markov Chains and MCMC Simulation
Ioannis KontoyiannisAthens Univ of Econ & Business
joint work withS.P. Meyn, L.A. Lastras-Montano
Probability Seminar, Columbia University, December 2007
1
Outline
1. Nonasymptotic Bounds for Markov Chains
Motivation: Markov Chain Monte Carlo
2. A General Information-Theoretic Bound
Csiszar’s Lemma and Jensen’s inequality
3. Large Deviations Bounds: Analysis & Optimization
Doeblin chains
An (MCMC) example of the Gibbs sampler
Geometrically ergodic chains
� Controlling averages and excursions
A general MCMC sampling criterion
4. The i.i.d. case: A geometrical explanation
2
Motivation
A Common Task
Calculate the expectation Eπ(F ) =∑
x∈S π(x)F (x) of a given F : S → R
In many cases, the distribution π = (π(x) ; x ∈ S) is known explicitly
but it’s impossible to calculate its values in practice
Typical in Bayesian stat, statistical mechanics,
networks, image processing, . . .
Markov Chain Monte Carlo
It is often simple to construct an ergodic Markov chain {X1, X2, . . .}with stationary distribution π
In that case, we estimate Eπ(F ) by the partial sums 1n
∑ni=1 F (Xi)
Problem
How long a simulation sample n do we need for an accurate estimate?
3
The Setting: Deviation Bounds for Markov Chains
We have
Ergodic Markov chain {X1, X2, . . .}, discrete state-space S [for simplicity]
Transition kernel P (x, y) = Pr{Xn+1 = y|Xn = x}, initial condition x1 ∈ S
Stationary distribution π = (π(x) ; x ∈ S)
Goal
Find explicit, computable, nonasymptotic bounds on
Pr{1
n
n∑i=1
F (Xi) ≥ Eπ(F ) + ε}
� In MCMC, this leads to precise performance guarantees
and sampling criteria (or stopping rules)
� Similar questions appear in numerous other applications
4
A General Information-Theoretic Bound
Let H(P‖Q) =∑
x∈S P (x) log P (x)Q(x)
= relative entropy
‖P −Q‖ =∑
x∈S |P (x)−Q(x)| = 2×[total variation distance]
5
A General Information-Theoretic Bound
Let H(P‖Q) =∑
x∈S P (x) log P (x)Q(x)
= relative entropy
‖P −Q‖ =∑
x∈S |P (x)−Q(x)| = 2×[total variation distance]
Theorem 1
For any Markov chain {Xn}, any function F : S → R bounded above,
any c > 0 and any initial condition X1 = x1, we have
log Pr{1
n
n∑i=1
F (Xi) ≥ c}
≤ −(n − 1)H(W‖W 1 × P )
for some bivariate distribution W = (W (x, y)) on S × S
with marginals W 1 and W 2 that satisfy
‖W 1 − W 2‖ ≤ 2
n − 1and EW 1(F ) ≥ c − supx F (x)
n − 1
and W 1 × P denotes the bivariate distr (W 1 × P )(x, y) = W 1(x)P (x, y)
6
Interpretation
Our result
To use the above bound, we need to look at
log Pr{1
n
n∑i=1
F (Xi) ≥ c}≤ −(n − 1) inf
WH(W‖W 1 × P )
over all W s.t.
‖W 1 − W 2‖ ≤ 2
n − 1and EW 1(F ) ≥ c − supx F (x)
n − 1
7
Interpretation
Our result
To use the above bound, we need to look at
log Pr{1
n
n∑i=1
F (Xi) ≥ c}≤ −(n − 1) inf
WH(W‖W 1 × P )
over all W s.t.
‖W 1 − W 2‖ ≤ 2
n − 1and EW 1(F ) ≥ c − supx F (x)
n − 1
Donsker and Varadhan’s classic result
For a very restricted class of chains, asymptotically in n
log Pr{1
n
n∑i=1
F (Xi) ≥ c}≈ −n inf
WH(W‖W 1 × P )
over all W s.t. W 1 = W 2 and EW 1(F ) ≥ c
8
Remarks
� Theorem 1 offers an elementary yet general explanationof Donsker and Varadhan’s exponent and their upper bound
� The result and proof are outrageously general and simple
9
Remarks
� Theorem 1 offers an elementary yet general explanationof Donsker and Varadhan’s exponent and their upper bound
� The result and proof are outrageously general and simple
Proof.
Step I. Csiszar’s Lemma. Let p be an arbitrary probability measure
on any probability space, and E any event with p(E) > 0. Let p|E
denote
the corresponding conditional measure. Then:
log p(E) = −H(p|E‖p)
10
Remarks
� Theorem 1 offers an elementary yet general explanationof Donsker and Varadhan’s exponent and their upper bound
� The result and proof are outrageously general and simple
Proof.
Step I. Csiszar’s Lemma. Let p be an arbitrary probability measure
on any probability space, and E any event with p(E) > 0. Let p|E
denote
the corresponding conditional measure. Then:
log p(E) = −H(p|E‖p)
With p = distribution of (X1, X2, . . . , Xn)
and E ={
1n
∑ni=1 F (Xi) ≥ c
}:
log Pr{1
n
n∑i=1
F (Xi) ≥ c}
= −H(p|E‖p)
11
Proof cont’d
Step II.
Write p|E
as a product of conditionals and p as a product
of bivariate conditionals
Expanding the log in H(p|E‖p) (“chain rule”)
transforms this relative entropy between n-dimensional distributions
into a sum of relative entropies between bivariate ones
log Pr{1
n
n∑i=1
F (Xi) ≥ c}
= −n−1∑i=1
H(pi,i+1‖pi × P )
12
Proof cont’d
Step II.
Write p|E
as a product of conditionals and p as a product
of bivariate conditionals
Expanding the log in H(p|E‖p) (“chain rule”)
transforms this relative entropy between n-dimensional distributions
into a sum of relative entropies between bivariate ones
log Pr{1
n
n∑i=1
F (Xi) ≥ c}
= −n−1∑i=1
H(pi,i+1‖pi × P )
Step III.
Use convexity (Jensen) to simplify and combine into
log Pr{1
n
n∑i=1
F (Xi) ≥ c}≤ −(n − 1)H(W‖Wi × P )
Check W has the required properties �
13
The “Nicest” Chains
Doeblin chains
Defn A Markov chain {Xn} on a general alphabet is called
a Doeblin chain iff it converges to equilibrium exponentially fast,
uniformly in the initial condition X1 = x1, i.e., iff
supx∈S
∑y∈S
|Pn(x, y) − π(y)| → 0 exponentially fast
Equivalent characterization There exists a number of steps m,
a probability measure ρ, and α > 0, such that:
Pr{Xm ∈ E |X1 = x1} ≥ αρ(E) for all x1, E
� Doeblin chains don’t satisfy the Donsker-Varadhan conditions
� They don’t even satisfy the usual large deviations principle!
14
A Bound for Doeblin Chains
Theorem 2
For any Doeblin chain {Xn},any bounded function F : S → R, any ε > 0,
and any initial condition X1 = x1, we have
log Pr{1
n
n∑i=1
F (Xi) ≥ Eπ(F ) + ε}
≤ −(n − 1)1
2
[( α
m Fmax
)ε − 3
n − 1
]2
where Fmax = supx |F (x)|� In the case of i.i.d. {Xn}, Theorem 3 essentially
reduces to Hoeffding’s bound, which is tight in that case
� In the general case, this is the best bound known to date,
improving [Glynn & Ormoneit 2002] by a factor of 2 in the exponent
15
Note
log Pr{1
n
n∑i=1
F (Xi) ≥ Eπ(F ) + ε}≤ −(n − 1)
1
2
[( α
m Fmax
)ε − 3
n − 1
]2
� Bound only depends on F via its maximum
� Explicit exponent, quadratic in ε
� Bound only depends on the chain via α, m
� Good convergence estimates ⇒ good bounds on α, m
⇒ better exponents
16
Proof outline
Step I. From Theorem 1 we get
log Pr{1
n
n∑i=1
F (Xi) ≥ Eπ(F ) + ε}≤ −(n − 1)H(W‖W 1 × P )
for an appropriate W
Step II. Using Pinsker’s and then Jensen’s inequality we bound
H(W‖W 1 × P ) ≥ 1
2
[ ∑x,y
W 1(x)|P (x, y) − W (y|x)|]2
(∗)
Step III. Lemma. For any row vector v with∑
x v(x) = 0, we have
‖v(I − P )‖ ≥ α
m‖v‖
Step IV. Get bounds on the dual of a LP related to (∗) �
17
Extend to Geometrically Ergodic Chains?
� In many applications, we are interested in unbounded functions F
� Most chains found in applications (like MCMC)
are not Doeblin, but geometrically ergodic
Defn A Markov chain {Xn} is geometrically ergodic
iff it converges to equilibrium exponentially fast,
not necessarily uniformly in the initial condition
� The most general class for which exponential bounds might hold
� Same bounds cannot hold exactly as before
� But: There is a different exponential bound in this case
� The following example motivates its form . . .
18
A Hard Example for the Gibbs Sampler: The Witch’s Hat
0 1
1
x
y
Brim ~εUnif
Peak ~(1- ε) Unif
1/3
1/3
Setting: Use (randomized) Gibbs sampler
to compute average of F (x, y) = e5x + e5y
w.r.t. the “witch’s hat distr” with ε = 1251
19
A Hard Example for the Gibbs Sampler: The Witch’s Hat
0 1
1
x
y
Brim ~εUnif
Peak ~(1- ε) Unif
1/3
1/3
Setting: Use (randomized) Gibbs sampler
to compute average of F (x, y) = e5x + e5y
w.r.t. the “witch’s hat distr” with ε = 1251
Problem: Estimates very sensitive
to the rare visits to the “brim”
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004.7
4.8
4.9
5.0
5.1
5.2
5.3
5.4
5.5
20
A Hard Example for the Gibbs Sampler: The Witch’s Hat
0 1
1
x
y
Brim ~εUnif
Peak ~(1- ε) Unif
1/3
1/3
Setting: Use (randomized) Gibbs sampler
to compute average of F (x, y) = e5x + e5y
w.r.t. the “witch’s hat distr” with ε = 1251
Problem: Estimates very sensitive
to the rare visits to the “brim”
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004.7
4.8
4.9
5.0
5.1
5.2
5.3
5.4
5.5
Idea: Consider the new function
U (x) = F (x) − E[F (X2)|X1 = x
]and note that Eπ(U ) = 0
[Cf. Henderson (1997)]
21
A Sampling Criterion for this Gibbs Sampler
Idea: Together with the averages of F
also compute the averages of U
Averages of F and sampling times (purple)
0 1000 2000 3000 4000 5000 6000 70004.8
5.0
5.2
5.4
5.6
5.8
6.0
6.2
Averages of U and time spent at the peak (red)
0 1000 2000 3000 4000 5000 6000 7000−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
22
A Sampling Criterion for this Gibbs Sampler
Idea: Together with the averages of F
also compute the averages of U
Averages of F and sampling times (purple)
0 1000 2000 3000 4000 5000 6000 70004.8
5.0
5.2
5.4
5.6
5.8
6.0
6.2
Averages of U and time spent at the peak (red)
0 1000 2000 3000 4000 5000 6000 7000−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
We know: Eπ(U ) = 0
Sampling Criterion:
Sample the F -averages
only when the U -averages
are between ±u for some small u > 0
23
More Simulation Results from the Witch’s Hat
Averages of F and sampling times (purple)
0 1000 2000 3000 4000 5000 6000 70004.5
5.0
5.5
6.0
6.5
7.0
Averages of U and time spent at the peak (red)
0 1000 2000 3000 4000 5000 6000 7000−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
24
More Simulation Results from the Witch’s Hat
Averages of F and sampling times (purple)
0 1000 2000 3000 4000 5000 6000 70004.5
5.0
5.5
6.0
6.5
7.0
Averages of U and time spent at the peak (red)
0 1000 2000 3000 4000 5000 6000 7000−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
Averages of F and sampling times (purple)
0 1000 2000 3000 4000 5000 6000 70004.6
4.8
5.0
5.2
5.4
5.6
5.8
6.0
Averages of U and time spent at the peak (red)
0 1000 2000 3000 4000 5000 6000 7000−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
25
Generally: Geometrically Ergodic Chains
Defn A Markov chain {Xn} is geometrically ergodic
iff it converges to equilibrium exponentially fast,
not necessarily uniformly in the initial condition
Equivalent characterization There exists a function V : S → R,
a finite set S0 ⊂ S, and positive constants b, δ, such that:
E[V (X2) |X1 = x] − V (x) ≤ −δV (x) + bIS0(x) for all x
Bounds
Suppose the function of interest F : S → R is possibly unbounded
but with ‖F 2‖V := supxF (x)2
V (x) < ∞Define a screening function U (x) = V (x) − E[V (X2) |X1 = x]
26
An Exponential Bound for Geometrically Ergodic Chains
Theorem 3
For any geometrically ergodic chain {Xn},any function F : S → R as above, any ε, u > 0,
and any initial condition X1 = x1:
log Pr{1
n
n∑i=1
F (Xi) ≥ Eπ(F ) + ε &∣∣∣1n
n∑i=1
U(Xi)∣∣∣ ≤ u & Xn ∈ S0
}
27
An Exponential Bound for Geometrically Ergodic Chains
Theorem 3
For any geometrically ergodic chain {Xn},any function F : S → R as above, any ε, u > 0,
and any initial condition X1 = x1:
log Pr{1
n
n∑i=1
F (Xi) ≥ Eπ(F ) + ε &∣∣∣1n
n∑i=1
U(Xi)∣∣∣ ≤ u & Xn ∈ S0
}
≤ −(n − 1)1
2
[( δ
8ξ‖F 2‖V
)( ε − Fmax,0
n−1
u + b +Umax,0
n−1
)2
− 2
n − 1
]2
where Fmax,0 = maxx∈S0 |F (x)|, Umax,0 = maxx∈S0 |U (x)|and ξ is the “convergence parameter” of the chain
28
General Sampling Criterion for Geometrically Ergodic Chains
Note: Apart from the fact that the above bound is explicitly computable,it naturally leads us to formulate the following sampling criterion
Given: A geometrically ergodic chain {Xn}Its parameters V , b, δ, S0
A function F s.t. F 2 ≤ CV
Set: The screening function U (x) := V (x) − E[V (X2)|X1 = x]
A “small” threshold u > 0
Sampling Criterion: Sample the results of the chain only at times n
when Xn ∈ S0 and |1n∑n
i=1 U (Xi)| ≤ u
Explanation: Control averages and excursions
29
Comments on the Sampling Criterion
� Geometric ergodicity in general easy to verify
� Many choices for V (x), and V ≈ F often works
� To apply the sampling criterion, the screening function
U (x) = V (x) − E[V (X2)|X1 = x]
needs to be analytically computable
� Easily so for the Gibbs sampler,
some versions of the Metropolis algorithm . . .
30
Comments on Theorem 3
� Why is the exponent in Theorem 3 of O(ε2) and not O(ε4)?
� Proof outline similar to one for Doeblin case
� Theorem 3 applies even to cases where
Pr{
1n
∑ni=1 F (Xi) ≥ Eπ(F ) + ε
}decays sub-exponentially (e.g., discrete M/M/1 queue)
How is it that the addition of two non-rare events{∣∣∣1n
∑ni=1 U (Xi)
∣∣∣ ≤ u}∩
{Xn ∈ S0
}makes the probability exponentially small?!
� Specialize to the i.i.d. case for an explanation . . .
31
An “i.i.d. version” of Theorem 3
Setting: Estimate EP (F ) where F is “heavy tailed”
from i.i.d. samples X1, X2, . . . ∼ P
Suppose we have a U with known EP (U ) = 0, s.t.
U “dominates” F : ess sup[F (X) − βU (X)] < ∞, for all β > 0
Assume EP (F 2), EP (U 2) both finite
32
An “i.i.d. version” of Theorem 3
Setting: Estimate EP (F ) where F is “heavy tailed”
from i.i.d. samples X1, X2, . . . ∼ P
Suppose we have a U with known EP (U ) = 0, s.t.
U “dominates” F : ess sup[F (X) − βU (X)] < ∞, for all β > 0
Assume EP (F 2), EP (U 2) both finite
Theorem 4
(i) The “standard” error prob is subexponential: ∀ε > 0:
limn→∞
−1
nlog Pr
{1
n
n∑i=1
F (Xi) ≥ EP (F ) + ε}
= 0
33
An “i.i.d. version” of Theorem 3
Setting: Estimate EP (F ) where F is “heavy tailed”
from i.i.d. samples X1, X2, . . . ∼ P
Suppose we have a U with known EP (U ) = 0, s.t.
U “dominates” F : ess sup[F (X) − βU (X)] < ∞, for all β > 0
Assume EP (F 2), EP (U 2) both finite
Theorem 4
(i) The “standard” error prob is subexponential: ∀ε > 0:
limn→∞−1
nlog Pr
{1
n
n∑i=1
F (Xi) ≥ EP (F ) + ε}
= 0
(ii) The “screening” error prob is exponential: ∀ε, u > 0:
limn→∞−1
nlog Pr
{1
n
n∑i=1
F (Xi) ≥ EP (F ) + ε &∣∣∣1n
n∑i=1
U (Xi)∣∣∣ ≤ u
}> 0
34
Geometrical Explanation of Theorem 4
(i) Pr{standard error} ≈ exp{− n infQ∈Σ H(Q‖P )
}where Σ = {Q : EQ(F ) ≥ EP (F ) + ε} and the infimum is = 0
E
PΣ PΣ Q∗
35
Geometrical Explanation of Theorem 4
(i) Pr{standard error} ≈ exp{− n infQ∈Σ H(Q‖P )
}where Σ = {Q : EQ(F ) ≥ EP (F ) + ε} and the infimum is = 0
E
PΣ PΣ Q∗
(ii) Pr{screening error} ≈ exp{−n infQ∈E H(Q‖P )
}= exp
{−nH(Q∗‖P )
}where E = {Q : EQ(F ) ≥ EP (F ) + ε, |EQ(U )| < u}and the infimum is > 0
36
Theorem 4 cont’d
(iii) The “screening” error prob satisfies:
Let K > 0 arbitrary. Then ∀ε > 0, 0 < u ≤ Kε
log Pr{1
n
n∑i=1
F (Xi) ≥ EP (F ) + ε &∣∣∣1n
n∑i=1
U (Xi)∣∣∣ ≤ u
}
≤ −n
2
[M
M 2 + (1 + 12K
)2
]2
ε2
where M = ess sup[F (X) − 1
2KU (X)]
37
Theorem 4: A Heavy-Tailed Simulation Example
0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.38
1.40
1.42
1.44
1.46
1.48
1.50
1.52
1.54
1.56
1.58
0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.42
1.43
1.44
1.45
1.46
1.47
1.48
1.49
1.50
1.51
1.52
1
k
k∑i=1
X3/4i
Sampling times
0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.36
1.37
1.38
1.39
1.40
1.41
1.42
1.43
1.44
1.45
1.46
0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001.40
1.45
1.50
1.55
1.60
38
Concluding Remarks
Information-Theoretic Methods
Convexity, elementary properties
Strikingly effective in a brutally technical area...
Markov Chain Bounds
Doeblin chains
Geometrically ergodic chains
Functional analysis and optimization
A new sampling criterion
Further applications in MCMC...
39
Simulating a Simple Queue in Discrete Time
Consider: The chain Xn+1 = [Xn − Sn+1]+ + An+1 where:
{An} i.i.d. ∼ (1 + κ)α·Bern( 11+κ) and {Sn} i.i.d. ∼ 2µ·Bern(1
2)
the load ρ = E(Ak)E(Sn)
= αµ
is heavy, ρ ≈ 1, and F (x) = x
40
Simulating a Simple Queue in Discrete Time
Consider: The chain Xn+1 = [Xn − Sn+1]+ + An+1 where:
{An} i.i.d. ∼ (1 + κ)α·Bern( 11+κ) and {Sn} i.i.d. ∼ 2µ·Bern(1
2)
the load ρ = E(Ak)E(Sn)
= αµ
is heavy, ρ ≈ 1, and F (x) = x
Then: {Xn} is geometrically ergodic with V (x) = eεx
U(x) = V (x) − E[V (X2)|X1 = x] is an easily computable quadratic
No exponential error bound can be proved on the error probability!
41
Simulating a Simple Queue in Discrete Time
Consider: The chain Xn+1 = [Xn − Sn+1]+ + An+1 where:
{An} i.i.d. ∼ (1 + κ)α·Bern( 11+κ) and {Sn} i.i.d. ∼ 2µ·Bern(1
2)
the load ρ = E(Ak)E(Sn)
= αµ
is heavy, ρ ≈ 1, and F (x) = x
Then: {Xn} is geometrically ergodic with V (x) = eεx
U(x) = V (x) − E[V (X2)|X1 = x] is an easily computable quadratic
No exponential error bound can be proved on the error probability!
120
130
140
150
160
170
Γn(F )
1 2 3 4 5 6 7 8 9 10 x 105 1 2 3 4 5 6 7 8 9 10 x 10
5 1 2 3 4 5 6 7 8 9 10 x 105
50
55
60
65 π(F )
κ = 1 κ = 2 κ = 4
90
100
95
105
0 0
n
42