51
Fast Mixing in a Markov Chain L´aszl´oLov´ asz and Peter Winkler December 2007 Contents 1 Introduction and preliminaries 2 1.1 Notation ........................................... 2 1.2 Examples .......................................... 2 2 Access times between distributions 3 2.1 Hitting times ........................................ 3 2.2 General Stopping rules ................................... 4 2.3 Access times ........................................ 5 2.4 Mixing time ......................................... 6 2.5 Exit Frequencies ...................................... 7 2.6 *Exit discrepancy and recurrent potential ........................ 9 2.7 Four stopping rules ..................................... 10 2.7.1 Local rule ...................................... 11 2.7.2 Filling rule ..................................... 11 2.7.3 Threshold rule ................................... 12 2.7.4 Chain rule ...................................... 13 2.8 Halting states and optimal rules ............................. 14 2.9 *Strong stopping rules ................................... 18 2.10 *Matrix formulas ...................................... 18 2.11 *Comparison of stopping rules .............................. 19 2.12 *Continuous time and space: an example ........................ 20 3 *Time reversal 20 3.1 Reverse chains ....................................... 20 3.2 Forget time and reset time ................................. 22 3.3 Optimal and pessimal starting states ........................... 24 3.4 Exit frequency matrices .................................. 26 4 Blind rules 26 4.1 *Exact blind rules ..................................... 26 4.2 Averaging rules ....................................... 28 4.3 *Approximate blind mixing times ............................. 29 4.4 *Pointwise approximation ................................. 30 5 *Groups of mixing times 33 5.1 The relaxation group .................................... 35 5.2 The forget group ...................................... 38 5.3 The reset group ....................................... 41 5.4 The mixing group ...................................... 42 5.5 The maxing group ..................................... 43 5.6 A reverse inequality .................................... 44 1

Fast Mixing in a Markov Chain - lovasz/fastmixing.pdf · Markov chains (where the time needed by a step is an exponentially distributed random variable) can be viewed as choosing

  • Upload
    hakhanh

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Fast Mixing in a Markov Chain

Laszlo Lovasz and Peter Winkler

December 2007

Contents

1 Introduction and preliminaries 21.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Access times between distributions 32.1 Hitting times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 General Stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Access times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Mixing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Exit Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 *Exit discrepancy and recurrent potential . . . . . . . . . . . . . . . . . . . . . . . . 92.7 Four stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.7.1 Local rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7.2 Filling rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7.3 Threshold rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7.4 Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Halting states and optimal rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.9 *Strong stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.10 *Matrix formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.11 *Comparison of stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.12 *Continuous time and space: an example . . . . . . . . . . . . . . . . . . . . . . . . 20

3 *Time reversal 203.1 Reverse chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Forget time and reset time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Optimal and pessimal starting states . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Exit frequency matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Blind rules 264.1 *Exact blind rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Averaging rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 *Approximate blind mixing times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 *Pointwise approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 *Groups of mixing times 335.1 The relaxation group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 The forget group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3 The reset group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.4 The mixing group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.5 The maxing group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.6 A reverse inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1

6 Estimating the mixing time 456.1 *A linear programming bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.3 *Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Abstract

The critical issue in the complexity of Markov chain sampling techniques has been “mixingtime”, the number of steps of the chain needed to reach its stationary distribution. It turnsout that there are many ways to define mixing time—more than a dozen are considered here—but they fall into a small number of classes. The parameters in each class lie within constantmultiples of one another, independent of the chain.

1 Introduction and preliminaries

In the past 20 years there have been numerous applications (see, e.g., [3], [18], [16]) for sampling viaMarkov chains. Typically a random walk on a graph is run for a fixed number of steps after whichthe distribution of the current state is nearly stationary.

There is no particular reason why such a walk must be run for a fixed number of steps; in fact,more general stopping rule, some of which “look where they are going,” are capable of achievingthe stationary distribution exactly.

It turns out to be useful to consider stopping rules that achieve any given distribution, whenstarting from some other given distribution. From a theoretical standpoint, our stopping rules resultin a generalization of the notion of “hitting time” to state-distributions. This measure of distancebetween distributions behaves quite nicely.

By considering the least expected number of steps required by rules of a particular kind to reachthe stationary distribution (or some approximation thereof) we obtain several measures of mixingtime. To these we add some additional parameters, related to eigenvalues, set-hitting time, and thereverse of the given chain. Altogether we obtain a substantial collection of numbers each of whichhas mixing time implications.

Our objective is to place these numbers into a few equivalence classes, within which numbersdiffer only by constant factors independent of the chain. To do so we develop a calculus of “exitfrequencies” and some linear algebraic tools which we hope shed some light on the mechanism ofmixing.

In what follows we assume that the Markov chain has finitely many states (although see [6] formore general results) but is not generally reversible. For some of our stopping rules the transitionmatrix of the Markov chain must be known, making use as sampling mechanisms unlikely; but suchrules are useful in analysis and, as we shall see, often replaceable by simpler ones without loss inefficiency. In any case we put no restriction on the amount of computation needed to implement astopping rule.

1.1 Notation

Throughout this paper we will assume a fixed irreducible Markov chain with transition matrixM = (pij), with a finite state space V of finite cardinality n (see [6] for extensions to infinite statespace). If we say “distribution” without specifying an underlying set, we mean distribution on V .If σ is a distribution on V and A ⊆ V , then σ(A) =

∑i∈A σi is the probability of A. We denote by

π the stationary distribution of the chain. The ergodic flow consists of the values

Qij = πipij .

1.2 Examples

In this section we discuss examples which show that “intelligent” stopping rules can sometimesachieve specified distributions in an elegant or surprising manner.

2

Example 1.1 (Cycle) The following is an interesting fact from folklore. Let G be a cycle of lengthn and start a random walk on G from a node u. Then the probability that v is the last node visited(i.e., the a random walk visits every other node before hitting v) is the same for each v 6= u.

While this is not an efficient way to generate a uniform random points of the cycle, it indicatesthat there are entirely different ways to use random walks for sampling than walking a given numberof steps. This particular method does not generalize; in fact, apart from the complete graph, thecycle is the only graph which enjoys this property (see [21]).

Example 1.2 (Cube) Consider another quite simple graph, the cube, which we view as the graphof vertices and edges of [0, 1]n. Let us do a random walk on it as follows: at each vertex, we selecta direction (that is, one of the n coordinate indices) at random, then flip a coin. If we get “heads”we walk along the incident edge corresponding to that direction; if “tails” we stay where we are.We stop when we have selected every direction at least once (whether or not we walked along theedge).

It is trivial that after each of the n directions has been selected, the corresponding coordinatewill be 0 or 1 with equal probability, independently of the rest of the coordinates. So the vertex westop at will be uniformly distributed over all vertices.

This method takes about n ln n coin flips on the average, thus about n ln n/2 actual steps, so itis a quite efficient way to generate a random vertex of the cube (assuming we insist on using randomwalks; otherwise choosing the coordinates independently is simpler and faster). We will see that itis in fact optimal.

Example 1.3 (Card Shuffling) A classic application of Markov chain mixing is shuffling a deckof playing cards; see e.g. [9] where Bayer and Diaconis argue that seven ruffle shuffles are necessaryand sufficient to mix a deck of 52 cards. In Aldous and Diaconis [5], the following simple shufflingalgorithm is analyzed: a card is removed from the top of an n-card deck and replaced with equalprobability in any of the n slots among the remaining n− 1 cards.

It is not difficult to see that if we note when the card originally at the bottom of the deckhas reached the top and perform just one more shuffle, then the deck will be precisely uniformlyrandom! Again, this method takes about n ln n steps on the average, and also as in the cube case,constitutes a “strong” stopping rule in the sense of [5] (the ending distribution is uniform even whenconditioned on the stopping rule having taken some fixed number of steps). However, we will seelater that this shuffling rule is not optimal.

Without going into the details, let us remark that the algorithm of Aldous [4] and Broder [10] canbe regarded as a stopping rule for a Markov chain on trees that generates the uniform distribution.

Finally, we note that to stop after a fixed number of steps in the “continuous” time model ofMarkov chains (where the time needed by a step is an exponentially distributed random variable)can be viewed as choosing a random number T from a Poisson distribution and stopping afterT steps. So here a (randomized) stopping rule is considered which, in many respects, has betterproperties than the “stop after t steps” rule. A similar remark applies to stopping a “lazy” walk asdefined, e.g., in [20].

2 Access times between distributions

2.1 Hitting times

Let H(i, j) denote the expected hitting time from i to j (mean number of steps to reach j from i).Hitting times satisfy the following identity:

H(i, j) =

1 +

∑k pikH(k, j), if i 6= j,

0, if i = j.(1)

3

A more surprising identity is the “Random Target Lemma” (also known as the right averagingprinciple, e.g. in [1]), which says that there is a constant N for which

j

πjH(i, j) = N (2)

for every i.Hitting times in a reversible chain satisfy the following “cycle-reversing” identity of Coppersmith

et al. [13]:

H(i, j) +H(j, k) +H(k, i) = H(i, k) +H(k, j) +H(j, i). (3)

Our goal is to generalize this by defining access times between distributions.

2.2 General Stopping rules

Throughout this paper we will assume a fixed irreducible Markov chain with transition matrixM = pij and initial distribution σ, over a state space V of finite cardinality n. Most of ourresults generalize in straightforward manner in the presence of transient states and/or a countablyinfinite state space, but with messier conditions and arguments which we prefer to avoid. Weassume discrete-time transitions but, again, most or our results can be modified to apply to thecontinuous-time case.

We denote by V ∗ be the space of finite “walks” on V , that is, the set of finite strings w =(w0, w1, . . . , wt), wi ∈ S. Relative to the Markov chain we then have

P(w) = σw0

t−1∏

i=0

pwi,wi+1 .

The distribution of wt will be denoted by σt, so that σ0 = σ and σti =

∑wt=i P((w0, . . . , wt)).

It will be useful to us to regard a stopping rule Γ both as a random stopping time and as a(partial) function from S∗ to [0, 1] giving the probability of continuing from a walk w. Formally:

Definition. A stopping rule Γ is a partial map from S∗ to [0, 1], such that Γ(w) is defined forw = (w0, . . . , wt) just when P(w) > 0 and Γ(w0, . . . , wi) is defined and non-zero for each i, 0 ≤ i < t.

We interpret Γ(w) as the probability of continuing given that w is the walk so far observed, eachsuch stop-or-go decision being made independently. We can also regard Γ as a random variable withvalues in 0, 1, . . . , so that we stop at wΓ. The probability of stopping at state j, given startingdistribution σ, is

σΓj :=

w=(w0,...,wt=j)

σw0

(t−1∏

i=0

Γ(w0, . . . , wi)pwi,wi+1

)(1− Γ(w)).

The mean length EΓ of the stopping rule Γ is its expected duration; if EΓ < ∞ then withprobability 1 the walk eventually stops, and thus σΓ is a probability distribution. A stopping ruleΓ for which σΓ = τ is also called a stopping rule from σ to τ .

Note that if X is a random variable whose values are stopping rules, then X is equivalent to astopping rule Γ whose probability of continuing at w is a sum or integral of X(w) conditioned onX not having stopped the chain so far. Thus our stopping rules are not less general for insisting onindependent randomization at each step. For example, the stopping rule given above in Example 1.2(Cube) does not at first appear to fit our model, but we may simply walk on the (loopless) cube andstop according to the probability that the “draw and flip” rule would stop, given that it reachedthe current time and state.

Note that for any Markov chain and any τ , there is at least one finite stopping rule Γ such thatσΓ = τ ; namely, we select a target state j in accordance with τ and walk until we reach j. We callthis the “naive” stopping rule Ωσ,τ . The mean length of the naive rule is given by

N (σ, τ) = EΩσ,τ =∑

i,j

σiτjH(i, j).

4

When the target is the stationary distribution π for the chain, EΩσ,π is is independent of the startingdistribution σ, on account of the Random Target Lemma (2), so that N = N (σ, π).

The maximum length max(Γ) of a stopping rule Γ is the maximum length of a walk that haspositive probability. Γ is said to be bounded (for σ) if this quantity is finite. It is clear that for afixed M and Γ, max(Γ) only depends on the support of σ. Bounded stopping rules are availablewhen the target distribution τ has sufficiently large support, thus in particular when τ = π.

We often think of a stopping rule Γ as a means of moving from a starting distribution σ to agiven target distribution τ = σΓ. Such a Γ is said to be mean-optimal or simply optimal (for σ andτ) if EΓ is minimal, max-optimal if max(Γ) is minimal.

2.3 Access times

The mean length of a mean-optimal stopping rule from σ to τ will be called the access time fromσ to τ and denoted H(σ, τ). As suggested by the notation H(σ, τ) generalizes the hitting timesH(i, j).

Trivially, H(σ, τ) = 0 if and only if σ = τ . It is easy to see that the following triangle inequalityis satisfied for any three distributions α, β and γ:

H(α, γ) ≤ H(α, β) +H(β, γ). (4)

(To generate γ from α, we can first use an optimal rule to generate β from γ and then use the stateobtained as a starting state for an optimal rule generating γ from β). We should warn the reader,however, that H(σ, τ) 6= H(τ, σ) in general.

We have seen that the access time H(σ, τ) has the properties of a metric on the space of state-distributions, except for symmetry; the latter is of course too much to expect since the ordinaryhitting time, even for a reversible chain, is not generally symmetric.

Hence it would appear reasonable to expect that the cycle reversing identity 3 extends to dis-tributions in the case of reversible chains. However, this is not the case. A counterexample may befound on the 5-path: let α = (0, 0, 1, 0, 0), β = (0, 1/2, 0, 1/2, 0) and γ = (1/4, 0, 1/2, 0, 1/4). Thenit is easily checked that H(α, β)+H(β, γ)+H(γ, α) = 1+1+2 but the reverse route gives 2+1+2.

If τ is concentrated at j (for which we write, rather carelessly, “τ = j”) then

H(σ, j) =∑

i

σiH(i, j) (5)

since clearly the only optimal stopping rule in this case is Ωi,j , “walk until state j is reached.” Thisis therefore both mean-optimal and max-optimal, although max(Ωi,j) will generally be infinite. Inparticular, we have

N =∑

j

πjH(π, j). (6)

By considering the naive rule Ωτ , we get the inequality

H(σ, τ) ≤∑

i,j

σsτjH(i, j). (7)

This may be quite far from equality; for example, H(σ, σ) = 0 for any σ.The access time H(σ, τ) is convex in both its arguments, i.e.,

H(cσ + (1− c)σ′, τ) ≤ cH(σ, τ) + (1− c)H(σ′, τ) (8)

and

H(σ, cτ + (1− c)τ ′) ≤ cH(σ, τ) + (1− c)H(σ, τ ′) (9)

for 0 ≤ c ≤ 1 and any distributions σ, σ′, τ, τ ′ on S.

5

To see (8), let i be the starting point drawn from distribution cσ + (1 − c)σ′. Flip a biasedcoin; with probability cσi/(cσi + (1− c)σ′i) follow an optimal stopping rule from σ to τ , else followan optimal stopping rule from σ′ to τ . It is easy to check that this gives a stopping rule fromcσ+(1−c)σ′ to τ , which may not be optimal, but gives an upper bound on the length of an optimalrule.

We derive (9) similarly: with probability c follow an optimal stopping rule from σ to τ , else onefrom σ to τ ′.

We mention one more useful elementary inequality. Let σ and τ be two distributions and denoteby d(σ, τ) their total variation distance: d(σ, τ) =

∑i max0, σi − τi. Also define the distributions

(σ\τ)i :=max0, σi − τi

d(σ, τ), (σ ∧ τ)i :=

minσi, τi1− d(σ, τ)

.

Then

H(σ, τ) ≤ d(σ, τ)H(σ\τ, τ \σ). (10)

This can be seen by stopping with probability minσi, τi/σi at time 0, else following an optimalstopping rule to get from σ\τ to τ \σ.

2.4 Mixing time

Before defining our first notions of mixing time it will be useful to establish a reasonably consistentnotation scheme. We will use script letters (likeH or N ) to denote mixing parameters, with differentletters often used to distinguish classes of stopping rules. The dependence of a parameter on the(fixed) transition matrix of a Markov chain will always be understood.

As before we permit initial and target distributions as parameters, e.g. H(σ, τ) for the accesstime from σ to τ , but we will extend the scope of the arguments even further, to sets of distributions.In the case of access time, if A and B are two sets of state-distributions, we define

H(A,B) := minβ∈B

maxα∈A

H(α, β).

For example, H(V,≤ 2π) would be the minimum over all distributions τ ≤ 2π of the access time toτ from a worst single state (equivalently, in this case, a worst distribution of states).

Often mixing times will be computed from the worst starting distribution among alldistributions—as in the above case this is generally equivalent to worst state—and that will bethe assumption if there is only one argument to H. If there are no arguments to H then the startingdistribution is worst case and the target distribution π, as it is in most applications. In symbols:

H := H(π) = maxσH(s, π) = max

s∈VH(σ, π) := H(V, π).

We bestow upon H the honor of calling it the mixing time of the chain.We have already seen the notation

N := N (π) := N (σ, π)

for the constant in the Random Target Lemma, where σ is any distribution.We reserve the letter M for the minimum over all stopping rules Γ (with appropriately restricted

initial and target distributions) of the maximum number of steps, instead of the mean, taken by Γ.Thus,

M(σ, τ) := minΓ: σΓ=τ

max(Γ) ;

and

M := M(π) := maxσM(σ, π)

6

is the related mixing measure, which we cannot resist calling the “maxing time” of the chain.A stopping rule is said to be independent if the state at which it stops the chain is independent

of the one in which the chain was started. If we write

I(σ, τ) := minEΓ : Γ independent, σΓ = τthen we have

I(σ, τ) =∑

i∈V

σiH(i, τ)

since an independent rule must produce the target distribution “separately” from each startingstate.

In many applications repeated independent samples are required from the stationary distributionof a chain, and in that case it is natural to start the second and subsequent walks from π itself insteadof a “worst” state. Hence, we let both the default initial and target distributions for an independentrule be π:

I := I(π, π) =∑

i

πiH(i, π)

obtaining what we call the reset time of the chain. (This is only formally similar to formula (6) forN ; in general, N is much larger.)

The last mixing measure which we introduce at this stage does not overtly involve the stationarydistribution; instead, we look for the best target distribution, in the sense of expected access timefrom a worst starting state for that distribution. We call this parameter the forget time because itmeasures, in a sense, the least time it can take to “forget” what state the chain was started in. Wedenote the forget time by F and this time no arguments are needed:

F := minτ

maxs∈V

H(s, τ) = minτ

maxσH(σ, τ).

The target distribution τ which minimizes maxsH(s, τ) will be called the “forget distribution” anddenoted by ϕ.

2.5 Exit Frequencies

Let us now fix a transition matrix M , an initial distribution σ and a finite stopping rule Γ.

Definition. The exit frequencies xi = xi(Γ) (i ∈ V ) are defined by setting xi equal to the expectednumber of times the walk leaves state i before stopping. The scaled exit frequencies are defined byyi = 1

πixi. In the finite case, the use of xi or yi is only a matter of taste; in the case of Markov

chains with infinite state space, the difference between xi and yi is quite significant.Exit frequencies are a special case of what Pitman [26] calls “pre-T occupation measures” for

stopping times T . Indeed, versions of Lemma 2.2 and Theorem 2.4 below are proved in [26] usingPitman’s “occupation measure identity.”

To warm up, let us determine the exit frequencies in some simple cases. The first result is fromAldous [1]. Several related formulas could be derived using relations to electrical networks, as in[12], [14], or [29].

Lemma 2.1 The scaled exit frequencies for the naive stopping rule from state i to state j are givenby

yk(Ωij) = H(i, j) +H(j, k)−H(i, k).

More generally, the scaled exit frequencies for the naive stopping rule from σ to τ are given by

yk(Ωσ,τ ) =∑

i,j

σiτj(H(i, j) +H(j, k)−H(i, k)) =∑

i,j

σiτjH(i, j) +H(τ, k)−H(σ, k).

7

Proof. The mean number of k-visits in a walk from i to j to k and back to i must be the stationaryprobability of k, multiplied by the mean length of such a walk, i.e. πk(H(i, j) +H(j, k) +H(k, i)),since one can partition a doubly-infinite walk into such pieces. Similarly the number of k visits(counting the last but not the 0-th) on a k-to-i-to-k round trip is πk(H(k, i) +H(i, k)).

From i to k or from j to k we have exactly one visit of k, so altogether we get that the meannumber of k-visits in a walk from i to j is

πk

(H(i, j) +H(j, k) +H(k, i) + 1)− πk

(H(k, i) +H(i, k)− 1)

giving the desired result. ¤Exit frequencies are related to the starting and ending distributions by a simple formula, found

in Pitman [26]:

Lemma 2.2 The exit frequencies of any stopping rule Γ that reaches distribution τ from distributionσ satisfy the equation

i

pijxi(Γ)− xj(Γ) = τj − σj .

Proof. The probability of stopping at state j is the expected number of times j is entered minusthe expected number of times j is left. ¤

One can rewrite this identity as follows.∑

i

pijxi(Γ)−∑

i

pjixj(Γ) = τj − σj .

In other words, the values pijxi(Γ) can be viewed as values of a flow through the underlying graph,from supply σ to demand τ . This is also easily seen by observing that pijxi(Γ) is the expectednumber of passes from i to j while following Γ.

Theorem 2.3 Fix two distributions σ and τ , and let Γ and Γ′ be two finite stopping rules from σto τ . Then x(Γ′)− x(Γ) = (EΓ− EΓ′)π.

Conversely if Γ and Γ′ are two stopping rules such that when starting from σ, x(Γ) and x(Γ′)differ by a multiple of π, then σΓ = σΓ′ .

Proof. Let xi = xi(Γ), x′i = xi(Γ′). By Lemma 2.2,

σΓj = σj +

i

pi,jxi − xj .

In case σΓ = σΓ′ , we can subtract the similar identity involving x′ to get

x′j − xj =∑

i

pi,j(x′i − xi)

which means that x′ − x is a multiple of the stationary distribution, say x′ − x = Dπ. Since∑j xj = EΓ, the difference in mean length is D. The converse statement follows by the same

formulas. ¤It follows from Theorem 2.3 that the exit frequencies of any mean-optimal stopping rule from σ

to τ are the same. We denote them by xi(σ, τ), and the corresponding scaled exit frequencies byyi(σ, τ). We can compute these using Theorem 2.3 and the naive rule:

yk(σ, τ) = H(σ, τ) +∑

i,j

σiτjH(i, j) +H(τ, k)−H(σ, k)−∑

i,j

σiτjH(i, j)

= H(σ, τ) +H(τ, k)−H(σ, k). (11)

Thus we proved:

Theorem 2.4 The scaled exit frequencies of a mean-optimal stopping rule from σ to τ are given by

yk(σ, τ) = H(τ, k)−H(σ, k) +H(σ, τ).

8

2.6 *Exit discrepancy and recurrent potential

Another way of formulating Theorem 2.3 is to say that the values yi(Γ) − EΓ are the same forevery stopping rule Γ (optimal or not) from a given σ to a given τ . We call these values the exitdiscrepancies and denote them by zi(σ, τ). They will be very convenient quantities to use.

Formula (11) implies that

zk(σ, τ) = H(τ, k)−H(σ, k),

which yields immediately a number of useful properties of z:

zk(σ, τ) + zk(τ, ρ) = zk(σ, ρ) ,

and

zk(σ, τ) = −zk(τ, σ).

Also note that zk(σ, τ) is a linear function in each of σ and τ .The matrix Rik = zi(k, π) is called the recurrent potential (from starting state k). The above

properties of exit discrepancies imply that the recurrent potential contains all the information neededto express exit discrepancies, exit frequencies, and access times.

Lemma 2.5 For every fixed state k, the recurrent potential Rik is the (unique) function on Ssatisfying the conditions

j

pjiπjRjk − πiRik =

πi if i 6= k,

π1 − 1, if i = k.

and∑

i

πiRik = 0.

Furthermore, we have the formulas

zi(σ, τ) =∑

k

(σk − τk)Rik,

H(σ, τ) = −mini

zi(σ, τ) = maxi

k

(τk − σk)Rik

and

yi(σ, τ) = zi(σ, τ) +Hi(σ, τ) =∑

k

(σk − τk)Rik −minj

k

(σk − τk)Rjk.

Proof.

zi(σ, τ) = zi(σ, π)− zi(τ, π) =∑

k

(σk − τk)zi(k, π) =∑

k

σk − τk

πkRik.

¤By substituting using (2.4), it is easy to verify the following identity:

zk(σ, ρ) = d(σ, ρ)zk(σ \ ρ, ρ \ σ). (12)

Let ‖.‖π denote the `1 norm in IRn with weight function π, i.e., ‖z(σ, τ)‖π =∑

i∈S πi|zi(σ, τ)|.We denote by Z the maximum of (1/2)‖z(σ, π)‖π over all distributions σ, and call it the discrepancyof the Markov chain. Clearly, the maximum is attained when σ is a singleton. Then

‖z(σ, ρ)‖π = d(σ, ρ)‖z(σ\ρ, ρ\σ)‖π ≤ d(σ, ρ)(‖z(σ\ρ, π)‖π + ‖z(ρ\σ, π)‖π

)≤ 4d(σ, ρ)Z. (13)

9

We obtain another interesting formula for z by considering a simple stopping rule: walk onestep! Its exit frequencies are σi, and hence

zk(σ, σ1) =σk

πk− 1.

It follows that

zk(σ, σt) =t−1∑m=0

(σmk

πk− 1

). (14)

If the chain is ergodic, then we can let t →∞ and obtain

zk(σ, π) =∞∑

m=0

(σmk

πk− 1

).

(it is easy to see that the series on the right is absolutely convergent). Hence we get that for anytwo distributions σ and τ ,

zk(σ, τ) =∞∑

m=0

(σmk

πk− τm

k

πk

).

Exit discrepancies are related to access times by some simple but useful inequalities. Trivially

zi(σ, τ) ≥ −H(σ, τ) (15)

and hence

zi(σ, τ) ≤ H(τ, σ). (16)

In terms of H(σ, τ), we only have the much weaker inequality

zi(σ, τ) ≤ (1πi− 1)H(σ, τ). (17)

To get a stronger inequality, note that since∑

i πizi(σ, τ) = 0, we have

maxA

i∈A

πizi(σ, τ) =∑

i

πi max0, zi(σ, τ) =12‖z(σ, τ)‖π.

Then for the sum of zi over any subset:∑

i∈A

πizi(σ, τ) =∑

i∈A

xi(σ, τ)− πAH(σ, τ) ≤ H(σ, τ)− πAH(σ, τ). (18)

Hence12‖z(σ, τ)‖π ≤ H(σ, τ). (19)

2.7 Four stopping rules

Using the notion of exit frequencies, we describe four stopping rules that, with the right choice oftheir parameters, can generate any target distribution from any starting distribution. We’ll see inthe next section that these rules are optimal (again, if the parameters are chosen right). Throughout,let σ and τ be arbitrary distribution and let x ∈ IRV be such that xi ≥ 0 and

j

pjixj − xi = τi − σi. (20)

(Such numbers xi can be obtained, for example, by considering the exit frenquencies of any stoppingrule from σ to τ , and then adding to them any real multiple of π, as long as non-negativity ispreserved.)

10

2.7.1 Local rule

The following “local” (or “Markovian”) rule Λ is perhaps the easiest to describe: if we are at nodei, we stop with probability τi/(xi + τi), and move on with probability xi/(xi + τi) (if xi + τi = 0the stopping probability does not need to be defined). Thus the probability of stopping under Λdepends only on the current state, not the time; in effect, it treats “stop” like a new (absorbing)state of the Markov chain. It is clear that each walk stops eventually with probability 1.

Theorem 2.6 The local rule generates τ , i.e., σΛ = τ .

Proof. Let x′1, . . . , x′n be the exit frequencies of Λ, and let τ ′ be the distribution it produces, and

assume that τ ′ 6= τ . Clearly τ ′i = 0 if τi = 0 and x′i = 0 if xi = 0. Since any time we are at state iwith τi 6= 0, the probability of moving on is xi/τi times that of staying, we have that

x′i =xi

τiτ ′i .

For each state i, let

αi =

x′i/xi, if xi 6= 0,τ ′i/τi, if τi 6= 0,0, if xi = τi = 0.

and

α = maxi

αi.

Let I = i : αi = α and I1 = i : αi = α, xi > 0.Since τ ′ 6= τ , we have α > 1. Moreover, x′i ≤ αxi and τ ′i ≤ ατi for all i, and x′i = αxi and

τ ′i = ατi for all i ∈ I.Let i ∈ I, then applying Lemma 2.2 to Λ and using the conditions on the xi, we get

σi = τ ′i + x′i −∑

j

pjix′j ≥ ατi + αxi −

j

pjiαxj ≥= ασi.

This can only hold if σi = 0 and x′j = αxj for all j with pji > 0. For such a j, we have either xj = 0or j ∈ I1. But we cannot have xj = 0 for all these j, since then (20) implies that xi = τi = 0,contradicting the choice i ∈ I. Thus I1 6= ∅. It also follows that if we follow Λ, then we don’t startat I1 and never enter I1. But this is clearly impossible. ¤

2.7.2 Filling rule

This is the discrete version of the “filling scheme,” introduced by Chacon and Ornstein [11] andshown by Baxter and Chacon [8] to minimize expected number of steps. We call it the filling ruleΦ = Φσ,τ and define it recursively as follows. Let pt

i be the probability of being at state i after tsteps (and thus not having stopped at a prior step); let qt

i be the probability of stopping at state i infewer than t steps. Then if we are at state i after step t, we stop with probability min(1, (τi−qt

i)/pti).

Thus, Φ stops myopically as soon as it can without overshooting the target probability of itscurrent state. The probability of ever stopping at state i is clearly at most τi. We will see that Φis a finite stopping rule and thus it does in fact achieves τ when started at σ.

Let St denote the set of states i such that qti < τi. Observe that S1 ⊇ S2 ⊇ . . . . If Sm be the

last non-empty set in this sequence (it may be that Sm = Sm+1 = . . . or Sm+1 = ∅).We claim that if we hit Sm we stop. Indeed, suppose that we hit a state j ∈ Sm at step t and

don’t stop. This means that j will be filled after this step (else, we would stop with probability 1),and so j /∈ St+1. By the definition of Sm, this means that St+1 is empty. But then every state isfilled, which means that be then we must have stopped with probability

∑i qt+1

i =∑

i τi = 1, soagain we could not move on.

11

Since we hit Sm with probability 1, this proves that the filling rule is finite.Let us note that the filling rule Φ can also be described by “deadlines” gi ≥ 0. For a given state

i, we define gi as follows. Consider the largest t for which qti < τi, and let

gi = t +τi − qt

i

pti

.

Since qt+1 = τi ≤ qti + pt

i, we have pti > 0 and gi ≤ t + 1).

In terms of the deadlines gi, we can describe the filling rule as follows: roughly speaking, westop at state i if we hit it before its deadline, and move on if we hit it after the deadline. Close tothe deadline we have to be a bit more careful. To be exact, we stop if we hit it after t steps andt ≤ gi − 1; we stop with probability gi − t if gi − 1 < t ≤ gi; and we don’t stop if gi < t.

It is not hard to see that the deadlines are uniquely determined by σ and τ except for stateswhere we never stop, or states where we always stop. For these, we can take gi = 0 and gi = ∞,respectively, to make the deadlines well-defined.

2.7.3 Threshold rule

A threshold rule is the antithesis of the filling rule: it keeps moving whenever possible withoutovershooting the prescribed exit frequencies xi.

To be precise, we describe the threshold rule recursively as follows. Again, let pti be the proba-

bility of being at state i after t steps (and thus not having stopped at a prior step); let qti be the

probability of stopping at state i in fewer than t steps. Let xti be the expected number of exits from

state i, again in fewer than t steps. We’ll maintain that xti ≤ xi. Then if we are at state i after step

t, we we continue with probability min1, (xi − xti)/pt

i.It is clear that xt+1

i ≤ xi remains valid. Let us also observe qtj ≤ τj remains valid. Indeed, if

we stop at j by time t at all, then we have xt+1 = xj ; but then the expected number of times j isentered by time t + 1 is

σj +∑

i

pijxti ≤ σj +

i

pijxi = τj + xj = τj + xt+1j ,

and hence the expected number of times we stop there is at most τj .However, Θ stops with probability 1 (since the sum of its exit frequencies is bounded) and does

achieve an ending distribution. Thus this distribution must be τ .Another way to describe a threshold rule is dual to the “deadline” description of the filling rule.

We have a “threshold” or “release time” 0 ≤ hi ≤ ∞ associated with each state i. The stoppingrule is defined by

Γ(w0, . . . , wk) =

0 if k ≥ hwk

1 if k ≤ hwk− 1

k − hwkotherwise.

Thus if we hit a state past the release time we stop; otherwise we continue, except that if we’rewithin 1, we use the fractional part to randomize.

For a given state i, we can define the threshold (or release time) hi as follows. Consider thelargest t for which xt

i < xi, and let

hi = t +xi − xt

i

pti

.

If xti < xi for all i, then hi = ∞. If t is finite, then xt+1 = xi ≤ xt

i + pti, and hence pt

i > 0 andhi ≤ t + 1).

The threshold vector may not be uniquely determined by a threshold rule Γ (e.g. all possiblethresholds hi smaller than the time before any possible walk reaches i are equivalent), but byconvention we always associate with Γ the vector each of whose coordinates is minimal. Then h isfinite if and only if Γ is bounded.

12

While the threshold rule may appear just a variation on the filling rule, it has many nice proper-ties. In particular, it turns out to be often bounded in time and this time bound is the least amongall bounded rules.

We now show that the threshold rule is bounded whenever τ has sufficient support.

Theorem 2.7 Suppose there is no possible closed walk which passes only through states j withτj = 0. Then every threshold rule producing τ is bounded.

Proof. Trivially every state j with τj > 0 has finite threshold (else, we could never stop there).Let t be an upper bound on all finite thresholds. If any walk survives past time t+n, then betweentime t and t + n it must have visited states with infinite threshold and, consequently, with τj = 0only. But during this period, the walk must have traversed a closed walk, which contradicts theassumption of the theorem. ¤

Assume that there is a closed walk through states with τj = 0. If σj > 0 for at least one state onsuch a walk, then clearly every threshold rule is unbounded. Furthermore, for any σ, we can choosexi satisfying (20) so that the corresponding threshold rule will be unbounded; it suffices to choosethe xi large enough (by adding a large multiple of π to x) so that we reach the bad closed walkwith positive probability. But some threshold rules may be bounded even in this case; for example,there is a threshold rule from any σ to itself is trivially bounded.

2.7.4 Chain rule

Let ρ be a probability distribution on the subsets of the state space V , and observe that it providesa stopping rule: “choose a subset U from ρ, and walk until some state in U is hit.” The naive ruleis of course a special case, with ρ concentrated on singletons.

Theorem 2.8 For every target distribution τ , there exists a unique distribution ρ which is concen-trated on a chain of subsets and gives an optimal stopping rule for generating τ .

Proof. Starting the chain from distribution σ, let αU be the distribution of the first node in U ,for every nonempty subset U . For example, αV = σ. Let U0 = V , µ0 = τ . We define distinctnodes i1, . . . , in, non-negative numbers ρ1, . . . , ρn, and non-negative vectors µ1, . . . , µn ∈ RV , byinduction. Assume that we have defined vj , ρj and µj for j ≤ k. Let Uk+1 = S \v1, . . . , vk. Selecta node ik+1 which minimizes µk−1

i /αUki , and let

ρk+1 = µkik+1

/αUk+1ik+1

and

µk+1i = µk

i − ρk+1αUk+1i .

Note that ρ1 = τ1/σ1.It follows by induction and by the choice of ik+1 that ρk+1 ≥ 0, µk+1 ≥ 0, and νk+1

i = 0 fori /∈ Uk+2. Moreover, we have

n∑

k=1

ρk = 1.

¤We call this rule the “chain rule” and denote it by Ξ. A rather neat way to think of this rule is

to assign a “price”, a real value r(i) =∑ρU : i ∈ U to each state i. The rule is then implemented

by choosing a random real “budget” b uniformly from [0, 1] and walking until a state j with r(j) ≤ b(an item that we can buy) is reached.

13

2.8 Halting states and optimal rules

Any state j for which xj = 0 is called a halting state. By definition we stop immediately if and whenany halting state is entered. (Of course we may stop in other states too, just not all the time.)

As an example, suppose we take a random walk starting at vertex 3 of the path 1—2—3—4—5,with target distribution τ = (1/4, 0, 1/2, 0, 1/4). There is a stopping rule here that is both mean-optimal and max-optimal, namely “take two steps and quit.” In contrast the filling rule stops deadat the starting point half the time, otherwise it walks until it reaches an endpoint (which takes 4steps on the average). It thus achieves the same mean length 2, but unbounded maximum length.For both rules the endpoints are halting states. Note that the naive rule Ω3,τ is worse, choosing itsstopping vertex beforehand to get mean length H(3, 1)/4 + H(3, 5)/4 = 6.

The following theorem gives an extremely useful characterization of optimality.

Theorem 2.9 A stopping rule Γ is mean-optimal for starting distribution σ and target distributionτ if and only if it has a halting state.

Proof. Since xj cannot be negative, it is a trivial corollary of Theorem 2.3 that if some xj = 0then Γ is mean-optimal. The converse is less obvious: we must provide, for any two distributions σand τ , a stopping rule that has a halting state. We can use any of the four rules described in Section2.7 for this purpose. We need the observation that we can find xi such that (20) is satisfied andmini xi = 0 (just subtract an appropriate multiple of π from the exit frequencies of any stoppingrule from σ to τ). Then

1. — the local rule will halt in every state with xj = 0, by definition;

2. — the filling rule will halt in every state of the last non-empty St, as discussed in the con-struction;

3. — the threshold rule will halt if xj = 0 since, as used in the construction, its exit frequenciesare bounded by the xj ;

4. — the chain rule will halt in every state of the least set in the chain of non-empty sets onwhich ρ is concentrated.

¤

Corollary 2.10 Under the condition that mini xi = 0, each of the four rules described in Section2.7 is optimal.

Let us apply Theorem 2.9 to our initial examples. Our stopping rule for Example 1.1 (Cycle)has no halting state (except for n = 2) and is therefore not optimal; in fact we will see thatit is asymptotically six times as long as necessary for generating the uniform distribution amongnon-starting states.

Our rule for Example 1.2 (Cube) must stop when the antipodal point is hit, since then alldirections must have been considered. Hence it is optimal.

For Example 1.3 (Card Shuffling) the permutations with least exit frequency are those withthe initial bottom card at the top; each of these is equally likely to be exited on the last step ofthe chain, but at no other time, thus achieving exit frequency exactly 1/(n − 1)!. The stationarydistribution is uniform by symmetry, thus to produce a halting state each exit frequency must bereduced by 1/(n− 1)! = nπi. The mean stopping time is the sum of the exit frequencies which, forour stopping rule, is thus exactly n shuffles more than optimal.

Theorem 2.11 For all distributions σ and τ ,

H(σ, τ) = maxj

(H(σ, j)−H(τ, j)).

This maximum is achieved precisely when j is a halting state of some optimal rule which stops at τwhen started at σ.

14

Proof. The inequality H(σ, τ) ≥ H(σ, j)−H(τ, j) is a special case of the triangle inequality (4).On the other hand, if j is a halting state for (σ, τ) then it costs nothing to pass through τ on theway from σ to j, i.e. H(σ, j) = H(σ, τ) +H(τ, j). ¤

Of course Theorem 2.11 implies that for fixed M , σ and τ any two optimal stopping rules havethe same halting states, namely those states j which maximize H(σ, j)−H(τ, j).

Since H(σ, j) is linear in σ, we can write the formula in the theorem as

H(σ, τ) = maxj

i

(σi − τi)H(i, j).

As an immediate corollary we obtain the following useful fact:

Corollary 2.12 Let α, β, γ, δ be four distributions such that α− β = c(γ − δ). Then

H(α, β) = cH(γ, δ).

Theorem 2.11 gives an explicit description of the “asymmetric metric” H(σ, τ). Let H denotethe matrix of hitting times: Hij = H(i, j). Then H defines an affine map of the simplex with theunit vectors as vertices (the space of state-distributions) onto some other simplex S. Using theRandom Target Lemma (2), we find that S lies in a hyperplane

∑i πixi = N . Furthermore, (5)

implies that (Hσ)i = H(σ, i) for each distribution σ.For x, y in this hyperplane, we can define an asymmetric “one-sided maximum norm” by ¿

x, yÀ := maxi(xi − yi). Since∑

i πi(xi − yi) = 0, we have ¿x, yÀ > 0 for x 6= y. This “norm”also satisfies the triangle inequality, and from Theorem 2.11 we have

H(σ, τ) = ¿Hσ −HτÀ .

Another way of stating Theorem 2.9 is the following:

H(σ, τ) = maxj−zj(σ, τ). (21)

Next we discuss the special case of time-reversible chains. In this case we can use the cycle-reversing identity (3) to obtain the following formula for the exit frequencies of the naive rule:

xk = πk

i,j

σiτj

(H(j, i) +H(k, j)−H(k, i))

= πk

i,j

σiτjH(j, i) +∑

i

(τi − σi)H(k, i)

,

and so, proceeding as above, the exit frequencies of a mean-optimal rule from σ to τ are

xk(σ, τ) = πk

(∑

i

(τi − σi)H(k, i)−minj

i

(τi − σi)H(j, i)

)

and

H(σ, τ) =∑

k

xk(σ, τ) =∑

i

(τi − σi)H(π, i)−minj

i

(τi − σi)H(j, i).

In the case when τ = π and σ is concentrated at a single state i (which is perhaps the most commonin applications of Markov chain techniques to sampling), this formula can be simplified using theRandom Target Lemma (2) to get that any mean-optimal rule from i to π has exit frequencies

xk = πk(maxjH(j, i)−H(k, i)) (22)

and

H(i, π) = maxjH(j, i)−H(π, i). (23)

We have thus identified the halting state of a reversible chain, in attaining the stationary distributionfrom a fixed state i, as the state j from which the mean time to hit i is greatest. This seems slightlyperverse in that we are interested in getting from i to j, not the other way ’round!

Time-reversibility is also reflected by the exit discrepancies in a very natural way:

15

Lemma 2.13 The Markov chain is reversible if and only if Rik = Rki for all i, k ∈ S.

Proof. By (22) and (23), we have

Rik = H(π, k)−H(i, k).

On the other hand, (11) implies

Rki = H(π, k)−H(i, k) ,

which proves the lemma. ¤As an application, we prove a curious fact. Call a state s pessimal if it maximizes H(s, π), i.e.,

if H(s, π) = H.

Corollary 2.14 Let i be a pessimal state of a time-reversible Markov chain and let j be a haltingstate from i to π. Then j is also pessimal, and i is a halting state for H(j, π). In particular, everytime-reversible chain with at least two states has at least two pessimal states.

Proof. By Lemma 2.13,

Rsj = Rjs.

Using that xj(i, π) = 0, we can write this equation as

−H(i, π) = yi(j, π)−H(j, π).

This implies that H(i, π) ≤ H(j, π). Since i is pessimal, this implies that we must have H(j, π) =H(i, π) and yi(j, π) = 0. ¤

Theorem 2.15 The threshold rule Θσ,τ is max-optimal.

Proof. It is perhaps already plausible to the reader that by getting all its exiting “over with” assoon as possible, Θσ,τ achieves the least maximum number of steps.

Let Γt(i) denote the probability that a stopping rule Γ continues at time t given that the chainhas reached state i at time t (and thus has not yet been stopped). Two stopping rules Γ1 and Γ2

are said to be equivalent if Γt1(i) = Γt

2(i) for all i and t such that both conditional probabilities aredefined. In that case σΓ1 = σΓ2 , EΓ1 = EΓ2 and max Γ1 = maxΓ2, so the rules behave similarlyfor our purposes. It is easy to see that every stopping rule Γ is equivalent to a unique balanced ruleΓ∗ whose continuation probability depends only on the current state and the time; Γ∗ is definedsimply by Γ∗(w0, w1, . . . , wt) = Γt(wt).

To prove the theorem, consider any balanced, bounded mean-optimal stopping rule Γ whichgenerates τ , and suppose there are times s < t and a state j such that Γs(j) < 1 and Γt(j) > 0. Letp and q be the (positive) probabilities of being at state j at times s and t, respectively, under Γ,and put ε = minp(1− Γs(j)), qΓt(j)). Then increasing Γs(j) by ε/p and decreasing Γt(j) by ε/qproduces a new stopping rule θΓ with the same exit frequencies as Γ but with either θΓs(j) = 1 orθΓt(j) = 0; and max(θΓ) ≤ max(Γ).

Finitely many repeated applications of the operator θ produce a threshold rule Θ with max(Θ) ≤max(Γ). Since Θ has the same exit frequencies as Γ it also generates τ , concluding the proof of thetheorem. ¤

It is interesting that this result defines an ordering (with ties—technically, a “preorder”) ofthe states for every σ and τ , starting with the state having the largest surplus (σi/τi maximum),and ending with the halting state. (This ordering is in general different from the ordering by exitfrequencies, the ordering implicit in the filling rule, or the ordering by thresholds defined in the nextsection.) We conclude this section by computing the exit frequencies in some specific cases.

16

Example 2.16 (Path) Consider first the classic case of a random walk on the path of length n,with nodes labeled 0, 1, . . . , n. We begin at 0, with the object of terminating at the stationarydistribution.

The hitting times from endpoints are H(0, j) = H(n, n− j) = j2 and the stationary distributionis

π = (12n

,1n

,1n

, . . . ,1n

,12n

)

since for any random walk on a graph πi is proportional to the degree of the node i. Owing to thespecial topology of the path, the filling rule and the chain rule are equivalent; moreover since webegin at an endpoint both rules are equivalent to the naive stopping rule Ω, for which node n is ahalting state. From (22) we have

xk = πk(H(n, 0)−H(k, 0)) = πkH(n, k) = πk(n− k)2

and

H = H(0, π) =n∑

i=0

xk =12n

n2 +n−1∑

i=1

1n

(n− k)2 =n2

3+

16.

It follows that for a cycle of length n, n even,

H =n2

12+

16

as compared with expected time

n− 1n

n(n− 1)2

=(n− 1)2

2

for staying at 0 with probability 1/n else walking until the last new vertex is hit as in Example 1.1.

Example 2.17 (Winning Streak) The following Markov process, sometimes called the “winningstreak” chain, will be quite useful later on. Let S = 0, 1, . . . , n − 1, 0 < c < 1, and define thetransition probabilities by

pij =

c if j = i + 1,1− c if j = 0 and 0 ≤ i ≤ n− 2,1 if j = 0 and i = n− 1,0 otherwise.

It is easy to check that the stationary distribution is

πi = ci 1− c

1− cn−1.

Hence the exit frequencies xi for an optimal rule from 0 to π can be determined using Lemma 2.2,working backwards from i = n− 1, n− 2, . . . , obtaining

xi = (n− i− 1)ci 1− c

1− cn−1.

Summing over all states, we get

H(0, π) =n− cn− 1 + cn

(1− c)(1− cn)= n +

1c− 1

+ O(ncn)

(if c is fixed and n →∞).

17

2.9 *Strong stopping rules

A stopping rule Γ is said to be strong if for any time t with P(Γ = t) > 0, the conditional distributionσt|Γ = t is the same as σΓ. In other words, the final state vΓ is independent of the number ofsteps Γ.

A balanced strong stopping rule Ψ is determined completely by the numbers qt, defined as theprobability that the chain is stopped by Ψ precisely at time t. To see that this is so let qt(i) be theconditional probability of stopping at time t given that the chain has reached time t and is in statei; and let σt

i(Ψ) be the probability that the chain has survived to time t and enters state i at thattime. For any t with qt > 0, the state distribution given that the chain stops at that point is thetarget distribution τ . Hence we must have qt(i)σt

i(Ψ) = qtτi for each state i. It follows that qt(i) isdetermined for the first t for which qt > 0, and similarly for all subsequent such t.

Assuming that the chain is ergodic, there will always be strong stopping rules; in fact we canstop the chain at any point we wish, as long as it has positive probability of being in any state i forwhich τi > 0, with any probability qt up to mini(σt

i(Ψ)/τi). In fact, suppose that the chain achievesat least 1

2π after k steps from any state, and let c = 12 mini(πi/τi); then Ψ can stop the chain every

k steps with probability at least c times the probability it is still alive. Hence EΨ ≤ k/(1− c).If the chain is periodic, with period k, let Sj be the set of states accessible at times t ≡ j (mod

k) starting from state 1. Then for a strong stopping rule to exist it is necessary and sufficient thatfor some fixed integer m and real r,

i∈Sj+m

τi ∈

0, r ·∑

i∈Sj

σi

for all j = 0, . . . , k − 1, with indices taken modulo k.The “greedy” strong stopping rule stops as soon as it can and with maximum probability. This

cannot cost if the target distribution is π; for, suppose Ψ is not greedy and let t be the first timethat qt falls short of mini(σt

i(Ψ)/πi). Then there is a positive portion of π within σt which we maythink of as a new chain starting, and thus remaining, in state π. But, obviously, the new chain mustbe stopped immediately with probability 1 to minimize length.

As an example, consider a random walk on the graph K3, starting at vertex 1, and aiming forthe (uniform) stationary distribution. The greedy strong stopping rule takes two steps and stopsunless the walk has returned to vertex 1, in which case it stops with probability 1/2 and otherwiserepeats the experiment. Thus its expected length is 2/(1− 1

4 ) = 8/3, which is strictly greater thanthe access time H(1, π) = 2/3.

From the above analysis it is clear that for a strong stopping rule Ψ to be mean-optimal overall rules, it must be the greedy strong stopping rule and moreover at every time t, the states iminimizing σt

i(Ψ)/πi must include all of the halting states. If in the K3 example we add loops tothe vertices, so that the probability of remaining at the same vertex in a given step is some fixedp ≥ 1

3 , then the two halting states remain underdogs to vertex 1, and the greedy strong stoppingrule is optimal.

Let us say that a state i minimizing σti(Ψ)τi is “hot” at time t. Notice that when a strong rule

(or no rule at all) is in effect and the target is π, the list Lt of “hot” states at time t is independent ofΨ. The reason is that if st(Ψ) is the probability that the chain has been stopped prior to time t thenσt

i(Ψ) = σti − st(Ψ)πi, since the stationary distribution persists. Thus σt

i(Ψ)/πi = σti/πi − st(Ψ),

affecting all states equally. Putting these various facts together, we have:

Theorem 2.18 For any Markov chain, and any initial distribution σ and target distribution τ , thegreedy strong stopping rule is the only one (up to equivalence) which can be mean-optimal amongall stopping rules. If τ = π then the greedy strong stopping rule is mean-optimal if and only if Lt

contains the halting states for all t.

2.10 *Matrix formulas

We describe some useful formulas connecting the transition matrix M with the matrix H whoseentry in position (i, j) is the hitting time H(i, j). So in particular the diagonal of H is 0. We denoteby R the diagonal matrix with the return time 1/πi in the i-th position of the diagonal.

18

It is easy to see that these matrices satisfy the equation

(I −M)H = J −R. (24)

Unfortunately, the matrix I −M is singular, and so (24) does not uniquely determine H. But theonly left eigenvector of I−M with eigenvalue 0 is πT, and the only right eigenvector with eigenvalue0 is 1, hence I −M + 1πT is non-singular. Moreover, we have

(I −M + 1πT)(I − 1πT) = I −M

(using that M1 = 1 and πTM = M). Hence

(I −M + 1πT)(I − 1πT)H = (I −M)H = J −R

and thus

H = (I −M + 1πT)−1(J −R) + 1πTH (25)

For convenience, let us denote the matrix (I −M + 1πT)−1(J −R) = J − (I −M + 1πT)−1R by G;this matrix turns out to carry a lot of information about combinatorial properties of the randomwalk. It is not difficult to see that for time-reversible walks, G is a symmetric matrix.

Equation (25) is still not the right formula for H, since H also occurs on the right hand side.But we can determine πTH by looking at the diagonal: 0 = Gii + (πTH)i and hence

(πTH)i = −Gii.

So the negative of the i-th diagonal entry of G gives the expected time H(π, i) needed to hit statei, starting from the uniform distribution. But then we can express H purely from the matrix G. Infact, (25) implies that

Hij = Gij −Gjj .

Now consider the mixing time. Using Theorem 2.11, we get

H(s, π) = maxt(H(s, t)−H(π, t)) = maxt(Gst −Gtt + Gtt) = maxsGs,t.

Hence the mixing time is the largest entry of G.

2.11 *Comparison of stopping rules

Each of the stopping rules can be implemented in time polynomial in the number of states. Infact, the constructions of the rules as described in Section 2.7 yields polynomial time algorithms tocompute the exit frequences, deadlines, release times, and prices, and these quantities are enoughto implement the stopping rules. (Unfortunately, this is not good enough in a typical applicationof these techniques to sampling, where the size of the state space is exponential.)

The chain rule Ξ shares with the filling rule Φ the “now-or-never” property that once a state isexited, it can never be the state at which the rule stops. In fact, once Ξ exits state ik it can neverstop at any ij for j ≤ k in the notation of the proof.

Note that the chain rule is not generally balanced; for example, we cannot stop at state j if somestate i with r(i) ≤ r(j) has already been hit.

The fact that this rule is optimal follows from Theorem 2.9 since the state in is never exited.The uniqueness of ρ follows easily by induction on k.

It is interesting that this result defines an ordering (with ties—technically, a “preorder”) ofthe states for every σ and τ , starting with the state having the largest surplus (σi/τi maximum),and ending with the halting state. (This ordering is in general different from the ordering by exitfrequencies, the ordering implicit in the filling rule, or the ordering by thresholds defined in the nextsection.)

The continuous-time versions of both Φ and Θ enjoy more elegant threshold descriptions: in thefirst case we stop whenever we transfer to a state ahead of its threshold, in the second whenever athreshold is reached while we sit in the corresponding state. In both cases the thresholds themselvesare uniquely defined.

19

2.12 *Continuous time and space: an example

Although we have not presented statements or proofs of our results either for continuous time or forcontinuous state spaces, the temptation to examine our repertoire of stopping rules in the contextof Brownian motion is too difficult to resist. Let n −→ ∞ in Example 2.16, the path of lengthn, while time contracts at rate n2. Then in the limit we have Brownian motion B(t) in the unitinterval [0, 1] with reflecting barriers, and hitting time normalized to H(0, 1) = 1. Suppose that weagain wish to start at 0 and stop at the stationary (uniform) distribution.

Skorokhod [28], Dubins [15] and Root [27] have proposed different rules for stopping B(t) at agiven distribution. All three rules are optimal, running in expected time 1/3 in our case. Skorokhod’srule is the filling rule (or equivalently, the chain rule) applied to Brownian motion; here it amountsmerely to choosing u uniformly from [0, 1] and stopping as soon as B(t) = u.

Dubin’s is a strong stopping rule, operating in our case as follows. Define a sequence U1, U2, . . .of state-sets by

Ui :=1 + 2j

2i: j = 1, . . . , 2i−1 − 1

.

Put T1 := inft : B(t) ∈ U1, Ti := inft > Ti−1 : B(t) ∈ Ui. When we stop at T = limi→∞ Ti ,the bits of the binary expansion of B(T ) will be fair and independent, hence B(T ) is uniformlydistributed. (Note that there is no strong stopping rule, optimal or otherwise, for the discrete path.To get one we would need to append loops to the nodes or to operate in continuous time.)

Root’s is a threshold rule: there is a lovely, smooth function h on the unit interval with h(1) = 0such that B(inft : h(B(t)) = t) is uniform. Unfortunately, no one seems to have an explicitdescription of h.

To these three we can add a local rule. The continuous analog of exit frequency is given byx(u) = (1 − u)2, thus we can attain the uniform distribution by stopping B(t) between T andT + dt, given that we have not stopped before, with probability

dt

1 + (1−B(t))2.

From limiting forms of our results it follows that each of these stopping rules is unique in theappropriate context: for example, Skorokhod’s is the only optimal chain rule, Dubin’s the onlyoptimal strong rule, Root’s the only rule which minimizes the maximum stopping time, and oursthe only optimal rule whose infinitesimal stopping probability does not depend on time.

3 *Time reversal

3.1 Reverse chains

Given a Markov chain with transition probabilities pij , we define the reverse chain as the Markovchain on the same set of states, with transition probabilities ←−p ij = πjpji/πi. We will generally usethe reverse arrow over a symbol to indicate that it refers to this reverse chain.

We start with the trivial but important remark that the reverse chain has the same stationarydistribution as the original. This follows by straightforward substitution.

The key to the proof of many properties of the reverse chain is the following general “dualityformula.”

Lemma 3.1 Let α, β, γ, δ be four distributions on the states. Then∑

i

(βi − αi)←−y i(γ, δ) =∑

i

(δi − γi)yi(α, β).

Proof. Let v0, v1, . . . be a walk in the forward chain started from α, and consider the randomvariables

Yt = ←−y vt+1(γ, δ)−←−y vt(γ, δ) + (γvt − δvt)/πvt .

20

From the conservation equation (Lemma 2.2) for exit frequencies applied to the reverse chain, wehave

←−y i(γ, δ)− (γi + δi)/πi =1πi

j

←−p ji←−x j(γ, δ)

=∑

j

pij

πj

←−x j(γ, δ) =∑

j

pij←−y j(δ, γ) ,

and therefore

E(Yt|vt = i) =∑

j

pij←−y j(γ, δ)−

j

pij←−y j(γ, δ) = 0

for each i, thus EYt = 0 a priori. In particular, if T is the number of steps taken by some optimalstopping rule from α to β, then the sum

∑T−1t=0 Yt has zero expectation; but

T−1∑t=0

Yt = ←−y vT(γ, δ)−←−y v0(γ, δ) +

T−1∑

i

(γi − δi

πi

)Xi

where Xi is the number of times state i is exited. Taking expectation, the lemma follows. ¤We derive some corollaries of this lemma.

Corollary 3.2 For every state j, we have

H(π, j) =←−H(π, j).

Hence also←−N = N .

Proof. Choose α = γ = π and β = δ = j in Lemma 3.1. ¤

Corollary 3.3 For every Markov chain, the mixing time and reverse mixing time are the same:

←−H = H.

Moreover, if i is a pessimal state for the first chain and j is a halting state from i to π, then j is apessimal state for the reverse chain and i is a halting state from j to π in the dual chain.

Proof. Let i be a pessimal starting point for the primal chain and let j be a halting state from ito π. Apply the lemma with α = i, γ = j, and β = δ = π. Then we get

←−H(j, π)−←−y i(j, π) = H(i, π)− yj(i, π) = H(i, π).

Hence←−H ≥ ←−H(j, π) ≥ H(i, π) = H.

The reverse inequality follows similarly. ¤

Corollary 3.4 For every two states i and j, we have

←−H(i, j) = H(j, i)−H(π, i) +H(π, j).

21

A consequence worth mentioning is that if the Markov chain has a state-transitive automorphismgroup, then H(i, j) =

←−H(j, i).

Proof. Apply the lemma with α = i, β = π, γ = j and δ = i. Then we get

←−H(i, j) = yj(j, π)− yi(j, π). (26)

Now Theorem 2.4 implies the assertion. ¤

Corollary 3.5 Let i be any state. A state j is a halting state from i to π if and only if←−H(j, i) ≥←−H(u, i) for every state k. In this case,

←−H(j, i) = H(i, π) +H(π, i)

Proof. By (26),

←−H(k, i) = yi(i, π)− yk(i, π).

It follows that the left hand side is maximized when k is a halting state j from i to π. Note thatthen

←−H(j, i) = yi(i, π). (27)

Now apply the lemma with α = δ = i and β = γ = π. Then we get

←−H(π, i) = yi(i, π)−H(i, π).

Combining with Corollary 3.2 and (27), we get the second statement of the corollary. ¤

Remark. Several results about reversible Markov chains can be extended by using the notion ofreverse chain. For example, the “cycle reversing” identity of [13] can be generalized (with a virtuallyidentical proof) as

H(i, j) +H(j, k) +H(k, i) =←−H(i, k) +

←−H(k, j) +←−H(j, i)

for any three states i, j and k. (From here we could get another way of deriving, among others,Corollary 3.4 above.)

All the linear algebra formulas also relate nicely to the reverse chain. Clearly this has transitionmatrix

←−M = RMTR−1, and simple substitution gives

←−G = GT. We could use this fact to derive the

results of the previous section in an algebraic way.

3.2 Forget time and reset time

We begin with a surprising identity: that the forget time of a chain is equal to the reset time of itsreverse.

Theorem 3.6 For every finite Markov chain, F =←−I and I =

←−F . In particular, if the Markovchain is time-reversible, then F = I.

First, we describe a reformulation of the forget time F as the optimum value of a linear program,and derive a few consequences. Let us rewrite the definition, using Theorem 2.11:

F = minτ

maxsH(s, τ) = min

τmax

smax

j(H(s, j)−H(τ, j)) = min

τmax

j(H(j′, j)−H(τ, j)) ,

22

where j′ is an “antipode” of j, i.e., a state for which H(j′, j) is maximal among all values H(k, j).We can write this expression as a linear program:

F = minimize tsubject to τi ≥ 0 (i ∈ V ),∑

i τi = 1,t +

∑i τiH(i, j) ≥ H(j′, j) (j ∈ V ).

(28)

Let us formulate the dual program; we have a variable r for the equation and a variable ρj for eachj ∈ V , and then the dual program is:

F = maximize r +∑

j ρjH(j′, j)subject to ρj ≥ 0 (j ∈ V ),∑

j ρj = 1r +

∑j ρjH(i, j) ≤ 0 (i ∈ V ).

(29)

So ρ is a probability distribution; moreover, the best value of r is clearly

r = mini

j

ρjH(i, j)

, (30)

and so

F = maxρ

mini

j

ρj

(H(j′, j)−H(i, j)). (31)

Any probability distribution ρ gives a lower bound on the forget time. An attractive choice is ρ = π,in which case we can use the Random Target Lemma (2) to see that r = N and the minimum in(30) is attained for all i. Hence

F ≥∑

j

πj(H(j′, j)−H(π, j)). (32)

The key result here is that equality holds in (32).

Theorem 3.7 The forget time of any Markov chain can be computed by

F =∑

j

πj(H(j′, j)−H(π, j)).

Moreover, the forget distribution ϕ is uniquely determined and is given by the formula

ϕj = πj

(1−H(j′, j) +H(π, j)

)+

∑m

pmjπm

(H(m′,m)−H(π, m)). (33)

Remark: Using the reverse chain and Corollary 3.5, we can write these formulas in the followingneater forms:

F =∑

j

πj←−H(j, π) ,

and

ϕj = πj

(1−←−H(j, π) +

∑m

←−p jm←−H(m,π)

). (34)

In particular, Theorem 3.7 will imply Theorem 3.6.

23

Proof. It suffices to show that ρ = π and r = −N form an optimal solution of (29). For this, itsuffices to exhibit a solution (ϕ, t) of the primal such that the complementary slackness conditionshold:

ϕi > 0 ⇒ −N +∑

j

πjH(i, j) = 0, (35)

and

πj > 0 ⇒ t +∑

i

ϕiH(i, j) = H(j′, j). (36)

We choose t =∑

j πj(H(j′, j)−H(π, j)) (recall that we want equality in (32)). To choose the rightϕ, observe that the first set of conditions is fulfilled, on account of the Random Target Lemma (2),independently of the choice of ϕ. Since π > 0, the second set applies for every j. This gives n = |V |linear equations on ϕ (and we have one more from the original system, namely

∑i ϕi = 1)). We

show that this (seemingly overdetermined) system has a unique solution, which is non-negative, andhence is a solution of the linear program (28). This will prove the theorem.

Let hj = H(j′, j). Then (36) can be re-written as

HTϕ = h− t1. (37)

and we also must have

1Tϕ = 1. (38)

We know from (25) that

H = J − (I −M + 1πT)−1R + 1πTH ,

and hence

HTϕ = Jϕ + R(I −MT + π1T)−1ϕ + HTπ1Tϕ = 1−R(I −MT + π1T)−1ϕ + HTπ.

Equating the two expressions for HTµ and rearranging, we can express ϕ:

ϕ = (I −MT + π1T)R−1((1 + t)1 + HTπ − h) = π + (I −MT)R−1(HTπ − h). (39)

Sustitution shows that this µ satisfies (37) and (38). The fact that it is non-negative follows bynoticing that (39) is equivalent to equation (33), which in turn is equivalent to (34), where thenon-negativity of the right hand side is trivial:

←−H(j, π) ≤ 1 +∑m

←−p jm←−H(m,π) ,

since making one step from i and then following an optimal rule to π is a (not necessarily optimal)stopping rule from j to π.

Thus we have a feasible solution of (28), which satisfies, together with (r, π), the complementaryslackness conditions. Our argument also proves the uniqueness of the optimizing distribution, aswell as the optimality of π as a dual solution. ¤

3.3 Optimal and pessimal starting states

We conclude with some further links between the forget time and mixing-optimal and mixing-pessimal states.

Lemma 3.8 The exit frequencies from the stationary distribution π to the forget distribution ϕ,and vice versa, can be expressed by the formulas

yi(π, ϕ) =←−H(i, π)−min

s∈V

←−H(s, π)

and

yi(ϕ, π) = H−←−H(i, π).

24

Proof. By Theorem 33, we have

ϕi = πi

1 +

j

←−p ij←−H(j, π)−←−H(i, π)

and hence∑

j

pjiπj←−H(j, π)− πi

←−H(i, π) = ϕi − πi.

This equation means the numbers yi =←−H(i, π) are the scaled exit frequencies of some (not neces-

sarily optimal) stopping rule from π to ϕ, and so the numbers←−H(i, π) −mins

←−H(s, π) are the exitfrequencies of an optimal stopping rule. The second equation follows similarly. ¤

The previous lemma has the following corollaries.

Corollary 3.9 A state j is a halting state from π to ϕ if and only if it is a mixing-optimal statefor the reverse chain. Moreover, for the reverse mixing time from j we have

mins

←−H(s, π) = F −H(π, ϕ).

Corollary 3.10 A state j is a halting state from ϕ to π if and only if it is a mixing-pessimal statefor the reverse chain. Moreover, for the mixing time we have

H = F +H(ϕ, π).

So from a pessimal point, a best stopping rule to get to π is to follow an optimum stopping ruleto the forget distribution ϕ, and then follow an optimum rule from there. It also follows that

Corollary 3.11 Every mixing-pessimal state is forget-pessimal.

We conclude with a discussion of our second introductory example, the winning streak chain.The stopping rules to π exhibited for the reverse chain are optimal, since each has a halting state.Hence

←−H = H = n− 1.

To compute the forget time, we use Theorem 3.7. It can easily be checked that H(0, i) = 2i+1 − 2for all i and H(j, i) = 2i+1 for all i < j ≤ n− 1. It is clear that the state i′ which maximizes H(i′, i)is i + 1 (mod n). Using the Random Target Lemma we have that for any state i,

N =n−1∑

j=0

πjH(i, j) =n−1∑

j=0

πjH(0, j) = n− 1

andn−1∑

j=0

πjH(j′, j) = n + 1− 2−(n−2) ,

so that by Theorem 3.7, the forget time is

F = 2− 2−(n−2).

This is indeed the value for the reverse reset time←−I calculated in the introduction. The forget

distribution ϕ is given by

ϕi =

1− 2−(n−1), if i = 0,2−(n−1), if i = n− 1,0, otherwise.

25

In fact, H(i, ϕ) = 2 − 2−(n−2) for all i (walk until you hit 0 or make n − 1 steps, whichever comesfirst).

For the reverse forget time, a more tedious calculation shows that

←−F = n− k + 1− n− k

2k− 1

2n−1,

where k is the largest integer with 2k ≤ n − k + 1 (so k ≈ log2 n; H(i, π) behaves differently fori < k and i ≥ k). This value is strictly smaller than the mixing time H, but still asymptotically n.

3.4 Exit frequency matrices

4 Blind rules

4.1 *Exact blind rules

We now turn temporarily to a type of stopping rule which is not generally capable of achievingarbitrary target distributions, and is almost never optimal. We call a stopping rule Γ blind if it isbalanced and Γt(i) depends only on t. The simplest blind stopping rule is the stopping rule usedmost often: “stop after t steps.” Several other practical methods to generate elements from thestationary distribution (approximately) can also be viewed as blind rules. For example, the method(to be discussed below) of walking u steps and then choosing one of the exited points uniformlycan be viewed as a blind rule: after t steps, we stop with probability 1/(u− t). Lazy random walkshave been considered (see Lovasz and Simonovits [20]) because they have better convergence to thestationary distribution; the lazy version of a Markov chain is obtained by flipping a coin before eachmove and staying where we are if we see “heads.” Stopping the lazy version of a Markov chain afters steps is equivalent to following the original walk (w0, w1, . . . , wu) for u steps and then choosing awt according to the binomial distribution, i.e. with probability

(ut

)2−u. This is again equivalent to

a blind rule, where we stop after t steps with probability(

u

t

)/((u

t

)+

(u

t + 1

)+ · · ·+

(u

u

)).

It is often more convenient to describe a blind stopping rule by the probabilities at that statevt is selected (not conditioning on not having stopped before). Trivially at ≥ 0 and the finitetermination of the rule is equivalent to

∑t at = 1. The rule is bounded if and only if the sequence

at has a finite number of non-0 terms. Thus a blind rule can always be thought of as an averaging,using the distribution (at).

One cannot generate any distribution by a blind stopping rule; for example, starting from thestationary distribution, every blind rule generates the stationary distribution itself. We shall restrictour attention to stopping rules generating the stationary distribution (or at least approximations ofit).

Let λ1, . . . , λq be the eigenvalues of M , λ1 = 1. From the Perron-Frobenius Theorem we knowthat |λi| ≤ 1. Our next theorem gives a characterization for the existence of a blind stopping rulefor the stationary distribution.

Theorem 4.1 (a) If λk is positive real for some k ≥ 2, then there exists a state s from which noblind stopping rule can generate π.

(b) If every λk, k ≥ 2, is either non-real, negative or zero, then there is a finite blind stoppingrule that generates π from any starting distribution.

Proof. (a) Assume that there exists a blind stopping rule, described by a0, a1, . . . (this sequencemay be finite or infinite), that generates π from a starting distribution σ. Let z be a complexvariable, and φ the expansion

φ(z) =∑

t

atzt.

26

Since at ≥ 0 and∑

t at = 1, this series is convergent for |z| ≤ 1. Moreover, by the definition of thestopping rule,

∑t

atσTM t = πT ,

or

σTφ(M) = πT. (40)

Let uk denote a right eigenvector of M belonging to λk; note that πTuk = 0 for k ≥ 2. Apply(40) to uk (where k ≥ 2) to get that

φ(λk)σTuk = σTφ(M)uk = πTuk = 0.

Choose σ here so that σTuk 6= 0 for k = 2, . . . , n (such a σ exists even among those distributionsconcentrated on single nodes). Then it follows that

φ(λk) = 0

for all k = 2, . . . , n. Clearly, this cannot hold if any λk is positive.

(b) Conversely, assume that λ2, . . . , λp are negative or zero and λp+1, . . . , λq are non-real. Foreach λk, k > p, we choose a positive integer bk such that

Re λbk

k ≤ 0.

Let mk denote the multiplicity of eigenvalue λk. The polynomial

φ(z) =p∏

k=2

(z − λk

1− λk

)mk q∏

k=p+1

(zbk − λbk

k

1− λbk

k

)mk(

zbk − λbk

k

1− λbk

k

)mk

= a0 + a1z + a2z2 + . . .

has non-negative coefficients with∑

t at = φ(1) = 1. Considering, e.g., the Jordan normal form ofM , we can see that φ(M) = 1πT. Thus for any starting distribution σ, φ(MT)σ = π. ¤

Interestingly, the condition formulated in the theorem is most restrictive for time-reversiblechains; then all the eigenvalues are real, and typically many of them are positive. For example, forrandom walks on graphs, only complete multipartite graphs give a spectrum with just one positiveeigenvalue. More generally, it is easy to argue that for any two states i and j that are not “twins”(i.e., pik 6= pjk or pki 6= pkj for some k) we must have pij > 0. (If the chain is symmetric and pii = 0for all i, then the condition is equivalent to saying that that states can be represented by points insome euclidean space so that the distance between i and j is √pij .)

But one can show that for every chain, there is an “almost blind” stopping rule which achievesthe stationary distribution exactly: namely, a rule in which the probability of stopping after thewalk w1, w2, . . . , wT depends only on which pairs ws, wt are equal (walking in an unknown citywith no map, we recognize a corner where we have already been). In fact, the result of [7] andits improvement in [22] implies the existence of such a rule whose description depends only on thenumber of states in the Markov chain. We do not go into details in this paper.

It is virtually never possible to find a blind rule for generating π which is mean-optimal. If weconsider only stopping at some fixed time, even approximating π may take far more time than amean-optimal rule (for example when the chain is almost periodic). However, the blind approximatestopping rule discussed in the next section will take time not much more thanH(σ, π) for any startingdistribution σ. We should also remark that if the chain is time-reversible and all the eigenvaluesλ2, . . . , λn other than λ1 are non-positive, then the blind rule described in the proof of Theorem 4.1is optimal among blind rules. Assuming for simplicity that these eigenvalues are distinct, an easycomputation shows that its mean length is

∑nk=1 1/(1 − λk). This value is just the same as N in

(2), which is an average hitting time, so in general it is much larger than the mixing time.

27

4.2 Averaging rules

Now we turn to the issue of finding practical rules for generating the stationary distribution whichare optimal or near-optimal. We will describe simple, easily implementable (in particular, blind)rules and prove that they give a good approximation of the stationary distribution, in expected timeonly a constant factor more than the mixing time.

As a preliminary remark, we note that making a move independently of where we are does nothurt; more exactly, recalling that d is total variation distance and σ1 is the state distribution afterone step of the chain, we have

Lemma 4.2

d(σ1, π) ≤ d(σ, π); H(σ1, π) ≤ H(σ, π).

Proof. The first inequality is easily checked. To prove the second, consider the following rule fromσ to π: make one step, then follow an optimal rule from σ1 to π. Comparing this with an optimalrule from σ to π using Theorem 2.3, we get

πi

(H(σ, π)−H(σ′, π))

= xi(σ, π)− xi(σ′, π) + πi − σi ≥ −xi(σ′, π)

by Lemma 2.2. Since there is a state i for which the right hand side is 0, the second inequality alsofollows. ¤

The uniform averaging rule Υ = Υt (t ≥ 0) is defined as follows: choose a random integer Yuniformly from the interval 0 ≤ Z ≤ t−1, and stop after Y steps. (To describe this as a stoppingrule: stop after the u-th step with probability 1/(t−u) (j = 0, . . . , t−1).) We shall give estimates onhow close the distribution σΥ of the state generated by the averaging rule is to the stationary. Tothis end, we derive an explicit formula for the distribution of the state produced by the averagingrule.

Lemma 4.3 Let Y be chosen uniformly from 0, . . . , t− 1. Then

σYi = πi

(1 +

1tzi(σ, σt)

).

Proof. Let xi denote the expected number of times state i is exited during the first t steps. Notethat σY

i = xi/t. Then the xi are the exit frequencies of a (σ, σt) rule, and hence by Theorem 2.3,we have

yi − t = zi(σ, σt) ,

whence the lemma follows. ¤

Theorem 4.4 Let Y be chosen uniformly from 0, . . . , t−1. Then for any starting distribution σand 0 ≤ ε ≤ 1,

d(σY , π) ≤ ε +1tHε(σ, π).

In particular,

d(σY , π) ≤ 1tH(σ, π).

Proof. Let τ be a distribution such that d(τ, π) ≤ ε and H(σ, τ) ≤ Hε(σ, π). Then

z(σ, σt) = z(σ, τ) + z(τ, τ t) + z(τ t, σt)

28

and hence for every U ⊆ V ,

k∈U

(σYk − πk) =

1t

k∈U

πkzk(σ, σt) =1t

k∈U

πkzk(σ, τ) +1t

k∈U

πkzk(τ, τ t) +1t

k∈U

πkzk(τ t, σt).

The first term above is at most (1− π(U))H(σ, τ) and the last term is at most

π(U)H(σt, τ t) ≤ π(U)H(σ, τ)

by (18) and (16). For the middle term, we use the rough bound

k∈U

πkzk(τ, τ t) =∑

k∈U

t−1∑m=0

(τmk − πk) ≤

t−1∑m=0

d(τm, π) ≤ td(τ, π) ≤ tε.

Substituting these bounds, the theorem follows. (See [23] for a simple direct proof.) ¤An immediate consequence of this lemma is the following converse to Lemma 5.20. Choosing

t = (2/ε)Z, the uniform averaging rule yields a distribution which is within variation distance ε ofπ.

4.3 *Approximate blind mixing times

We introduce the letter B to denote mixing time using blind rules, so that

B(σ, τ) := minΓ blind, σΓ=τ

and for distribution-sets A and B,

B(A,B) := minβ∈B

maxα∈A

B(α, β).

Let dε(τ) denote the ball of radius ε and center τ in the total variation metric. We introducethe shorthand notation

Bε(A, τ) := B(A, dε(τ))

and similarly for H and other mixing measures. In particular,

Bε := Bε(π) = maxσB(σ, dε(π))

and

Hε := Hε(π) = maxσH(σ, dε(π)) = max

smin

τ : d(τ,π)<εH(s, τ).

In terminology, Hε becomes the “approximate mixing time” and Bε the “blind approximatemixing time.” Since the averaging rule Υ is a blind rule, we have shown

Corollary 4.5 For every 0 ≤ ε ≤ 1,

Hε ≤ Bε ≤ 2εZ.

29

4.4 *Pointwise approximation

Theorem 4.4 asserts closeness in the total variation distance; approaching π pointwise (as noted, forthe reversible case, in [1]) turns out to be somewhat harder in general.

We indicate pointwise approximation from above and from below by a bar over or under thesubscript, as follows:

Hε := H(≤(1 + ε)π)

and

Hε := H(≥(1− ε)π).

The former will be called the “dispersion time” and the latter the “filling time.”Below we establish results about the pointwise distance of distributions obtained by various

averaging rules. However, we shall have to use the worst-case bound on the mixing time.The following example, simplified from [1], illustrates the problem in trying to approach π

pointwise from both above and below. Suppose there are just two states a and b, with paa = pab = 12 ,

pba = ε, pbb = 1− ε. Then πa = ε/(ε + 12 ), and so starting from a it takes about log(1/ε) steps to

decrease the probability of being in a to twice its stationary probability. On the other hand, clearlyH1/2 ≤ 1 and so H ≤ 2. Thus we cannot generally control the quotient σY

i /πi by averaging overtimes of order H. In fact, we’ll see in Section 5.5 that the time needed to make a two-sided approachto π using averaging is at least the order of the maximum length of a max-optimal stopping rulethat achieves π exactly.

We can show, however, that in time O(H), the excess of σY over π can be decreased to anarbitrarily small fraction of its original value.

Theorem 4.6 Assume that σ ≤ Kπ. Then

σY ≤(

1 +1tKH

)π.

Proof. By Lemma 4.3, the triangle inequality, and Lemma 4.2,

σYi = πi

(1 +

1tzi(σ, σt)

)= πi

(1 +

1tzi(σt, σ)

)

≤ πi

(1 +

1tH(σt, σ)

)≤ πi

(1 +

1t

(H(σt, π) +H(π, σ)))

≤ πi

(1 +

1t

(H(σ, π) +H(π, σ)))

.

Here we know that π ≥ 1K σ, and so by Lemma 5.6 we get

H(π, σ) ≤ (K − 1)H ,

and thus

σYi ≤

(1 +

1tKH

)πi.

¤Our next goal is to describe a rule giving a point that has a probability of at least (1− ε)πi of

being at state i. We assume that the starting distribution is already close to π in the total variationdistance (this can be achieved, by the above, using the averaging rule), and do another averaging.As before, let Y be chosen uniformly from 0, . . . , t−1. The main result is the following.

Theorem 4.7 Let 0 ≤ δ ≤ ε ≤ 1 and assume that d(σ, π) = ε and σ ≥ (1− δ)π. Then

σY ≥(

1− ε− δ

tH

)π.

30

Proof. Let α = σ\π and β = π\σ, so that we can write

σ = π + εα− δβ.

Clearly, the supports of α and β are disjoint, and hence σ ≥ (1 − δ)π implies that β ≤ (δ/ε)π.Clearly

σY = πY + εαY − εβY = π + εαY − εβY ≥ π − εβY .

Now by Theorem 4.6, we have

βY ≤(

1 +δ

εtH

)π ,

and hence

σY ≥(

1− ε− δ

tH

as claimed. ¤It follows that in order to fill π up to a factor 1 − ε, it suffices to choose two integers Y1

and Y2 uniformly and independently from 0, . . . ,H/ε, and then do Y1 + Y2 steps; symbolically,Hε ≤ Bε ≤ H/ε.

This result is not entirely satisfactory, however; one would like to see that the error diminishesexponentially with t or, in other words, that the time needed is proportional to log(1/ε) rather thanto 1/ε. Next we describe another simple averaging rule that achieves this. The result below alsofollows by adaptation of the “multiplicativity property” in Aldous [1].

Let M > 0, t = 8dHe, and let X be the sum of M independent random variables Y1, . . . , YM ,distributed uniformly in 0, . . . , t− 1. Stop at vX .

To analyze this rule, let Xk = Y1 + . . . Yk.

Lemma 4.8 For all k ≥ 1, we have

d(σXk , π) ≤ 2−(k+1)

and

σXk ≥(1− 2−(k−1)

)π.

Proof. We prove this inequality by induction on k. For k = 1 the inequality follows by Theorem4.4. Assume that k > 1. Then σXk can be obtained from σXk−1 by the uniform averaging rule, andhence we get by Theorem 4.7 that

σXk ≥(

1− 2−k − 182−(k−2)

)π > (1− 2−(k−1))π. (41)

On the other hand, we have by the induction hypothesis and Lemma 5.6 that

H(σXk−1 , π) ≤ 2−(k−2)H ,

and hence by Theorem 4.4,

d(σXk , π) ≤ 182−(k−2) = 2−(k+1).

This completes the induction. ¤Choose M = dlog(1/ε)e, and denote the resulting rule by Γε. Then we have, as an immediate

consequence of Lemma 4.8, the following theorem.

31

Theorem 4.9 For any starting distribution σ, the rule Γε produces a distribution τ satisfying

τ ≥ (1− ε)π ,

and has mean length O(H log(1/ε)).

Lemma 4.10 Assume that for some 0 < δ < 1, for any two states i and j, pij ≤ (1+ δ)πj. Let thestarting distribution σ satisfy σ > (1− ε)π for some 0 < ε < 1. Then

σ1 ≤ (1 + δε)π.

Proof. Consider the starting distribution ρ = (σ − (1 − ε)π)/ε. Then clearly ρ1 ≤ (1 + δ)π. Onthe other hand, ρ1 = (σ1 − (1− ε)π)/ε. Thus

σ1 − (1− ε)πε

≤ (1 + δ)π ,

whence the assertion follows immediately. ¤The following two assertions follow along the same lines:

Lemma 4.11 (a) Assume that for some 0 < ε < 1, for any two states i and j, pij ≥ (1− ε)πj. Letthe starting distribution σ satisfy σ > (1− ε)π. Then

σ1 ≥ (1− ε2)π.

(b) Assume that for some 0 ≤ δ < 1, for any two states i and j, pij ≤ (1 + δ)π. Let the startingdistribution σ satisfy σ < (1 + δ)π. Then

σ1 ≥ (1− δ2)π.

Corollary 4.12 Assume that for some 0 ≤ δ < 1, for any two states i and j, pij ≤ (1+δ)πj. Thenfor any starting distribution σ, and all k ≥ 2,

σk ≤ (1 + δ(1− δ2)k−2

)π.

Lemma 4.13 Suppose that there exists an averaging rule Y and a 0 < δ < 1 such that

σY ≤ (1 + δ)π

for every starting distribution σ. Then there exists an averaging rule W such that max W ≤ (2/δ)EYand

σW ≤ (1 + 3δ)π

for every starting distribution σ.

Proof. Let t be the smallest integer such that P(Y > t) ≤ δ/2. Define a non-negative integervalued random variable W by

P(W = k) = P(Y = k | Y ≤ t).

Then clearly

σWi = P(vW = i) ≤ σY

i

P(Y ≤ t)≤ σY

i

(1 + δ

1− δ/2

)π.

¤

32

5 *Groups of mixing times

Let σ and τ be two distributions on S, 0 ≤ ε ≤ 1.Recall that Hε(σ, τ, ) is the minimum mean length of any stopping rule Γ such that σΓ

i ≥ (1−ε)τi

for all i, and analagously for Hε, while Hε(σ, τ) requires d(σΓ, τ) ≤ ε. As usual, absence of the firstargument indicates worst-case starting distribution, so that for example

Hε(τ) := maxσHε(σ, τ).

We can define similar approximate versions of the forget time, by Fε := minτ Hε(τ) etc. Ofcourse, when ε = 0 all these versions are the same:

H0 = H0 = H0 = H

and

F0 = F0 = F0 = F .

It is clear that

Hε(σ, τ) ≤ Hε(σ, τ) ≤ H(σ, τ)

and hence

Hε ≤ Hε ≤ H

and

Fε ≤ Fε ≤ F .

Similar inequalities hold for the disperse times.We can also restrict the rules used in the definition of these mixing times, to get closer to

practically implementable algorithms. To the letter B, reserved for blind rules, we can add the evenmore restrictive U for uniform averaging rules (Υ). The exact forms B(σ, τ) and U(σ, τ) are notgenerally defined but the approximate forms, e.g. Uε, make sense.

From matrix algebra we easily derive:

Theorem 5.1

←−B ε = Bε

and←−U ε = Uε.

At this point, the distinction between mixing in the “filling”, “disperse” and “total variation”sense may seem pedantic, but in fact these three mixing measures behave quite differently.

The next lemma describes a folklore procedure to access a distribution, if we can access a positivefraction of it.

Lemma 5.2 For every two distributions τ and σ, and every 0 ≤ ε < 1,

H(σ, τ) ≤ Hε(σ, τ) +ε

1− εHε(τ).

Proof. Let σ be any starting distribution, and consider the following stopping rule: follow anoptimal stopping rule Γ rule from σ to σΓ such that σΓ > (1− ε)τ . We get a random state w from

33

distribution σΓ. We flip a biased coin and stop with probability (1− ε)τw/σΓw (by our assumption

on Γ, this is at most 1). The probability that we stop at state i is

σΓi

(1− ε)τi

σΓi

= (1− ε)τi ,

and hence the probability that we stop at all is 1− ε.If we do not stop, the continuation of the walk can be considered as a walk starting from the

distribution σ′ = 1εσΓ − 1−ε

ε τ . We follow an optimal stopping rule Γ′ from σ′ to (σ′)Γ′, such that

(σ′)Γ′> (1−ε)τ , to get a random state j; there we stop with probability (1−ε)τj/(σ′)Γ

′j , else follow

an optimal stopping rule Γ′′ from σ′′ = 1ε (σ′)Γ

′ − 1−εε τ to get close to τ etc. The probability that

we stop (eventually) at state i is

(1− ε)τi + ε(1− ε)τi + ε2(1− ε)τi + · · · = τi.

These numbers add up to 1, hence we stop with probability 1. The expected number of steps is

EΓ + εEΓ′ + ε2EΓ′′ + · · · ≤ 11− ε

Hε(τ).

¤

Corollary 5.3 For any target distribution τ and 0 ≤ ε < 1,

H(τ) ≤ 11− ε

Hε(τ).

Another corollary concerns exact access times, and will be useful later.

Corollary 5.4 Let τ and τ ′ be distributions and 0 < ε < 1 such that τ ′ ≥ (1− ε)τ . Then for anystarting distribution σ,

H(σ, τ) ≤ H(σ, τ ′) +ε

1− εH(τ ′).

In particular,

H(τ) ≤ 11− ε

H(τ ′).

Another simple lemma:

Lemma 5.5 For any three distributions σ and σ and τ there exists a distribution τ such that

d(τ, τ) ≤ d(σ, σ)

and

H(σ, τ) ≤ H(σ, τ).

Another way of saying this is that if d(σ, σ) = ε then for every τ ,

Hε(σ, τ) ≤ H(σ, τ).

Proof. Let Γ be an optimal stopping rule from σ to τ . Let v0 be a state from distribution σ andv0, a state from distribution σ, coupled so that

P(v0 = v0) ≥ 1− ε.

34

Define a new stopping rule Γ by

Γ =

Γ, if v0 = v0,,0 otherwise.

Then vΓ = vΓ with probability at least 1− ε, and so

d(σΓ, σΓ) = d(σΓ, τ) ≤ ε.

Since trivially EΓ ≤ EΓ, this proves the lemma. ¤Let us state here a simple lemma analogous to Corollary 5.4 in the sense that it relates access

times if the starting distributions are close. Recall that M(τ) is the maximum over σ of M(σ, τ),which in turn is the least maximum number of steps taken by any stopping rule from σ to τ .

Lemma 5.6 Assume that σ′ ≥ (1− ε)σ. Then for every target distribution τ ,

H(σ′, τ) ≤ (1− ε)H(σ, τ) + εM(τ).

In particular,

H(σ′, σ) ≤ εM(σ).

Proof. This follows from the decomposition

σ = (1− ε)ρ + ε

(1εσ −

(1ε− 1

).

and the convexity of H(σ, τ) as a function of σ. ¤

5.1 The relaxation group

We now consider the effect of a “warm start,” where the starting distribution σ is required to bebounded by a constant multiple of the stationary distribution. We will indicate a warm start in ourmixing time notation by placing a tilde over the caligraphic letter, but for the value of the constantthe reader will need to refer to the precise definitions.

Let

Z := maxσ≤2π

12‖z(σ, π)‖.

Let

Hε := H(≤ 2π, d|eps(π)) = maxσ

σ≤2π

minτ

d(τ,π)<c

H(σ, τ) ,

and

H := H0.

Let 1 = λ1 ≥ λ2 > · · · ≥ λn denote the eigenvalues of the matrix L = (1/4)√

D(M+I)TD−1(M+I)√

D, where D = diag(1/π1, . . . , 1/πn) is the diagonal matrix of return times. The relaxation timeis defined by

L =1

1− λ2

(we use L to remind us of λ).

35

We say that two random variables X and Y (with values from a set V ) are ε-independent, if forevery two sets A ⊆ V and B ⊆ V , we have

∣∣∣P(X ∈ A, Y ∈ B)− P(X ∈ A)P(Y ∈ B)∣∣∣ ≤ ε.

This is a very weak notion of independence, but in some applications of Markov chain techniquesto sampling [19], this is exactly what is needed.

Let the warm reset time Iε be the smallest t such that for every starting distribution σ withσ ≤ (1 + ε)π, there exists a stopping rule Γ with EΓ ≤ t such that if v0 is from σ then vΓ isε-independent of the starting state, and σΓ ≤ (1 + ε)π.

Theorem 5.7 We have

Hε ≤ ln(1/ε)L

and

I16ε ≤ 1εHε ≤ Z.

For two distributions α, β on V such that β > 0, we define their χ2-distance by

χ2(α, β) =∑

i

(αi − βi)2

βi=

i

α2i

βi− 1

(this is not a proper distance function since it is not symmetric, but this will not matter). Notethat trivially χ2 gives an upper bound on the total variation distance:

d(α, β) ≤ 12χ2(α, β).

The χ2-distance is particularly well suited for the study of convergence to the stationary distribution,because of the following nice property proved by Fill [17]:

Lemma 5.8 For every starting distribution σ we have

χ2(σ + σ1

2, π) ≤ λ2χ

2(σ, π).

Corollary 5.9 Let t ≥ L. Let Y be the sum of t independent coin flips. Then

χ2(σY , π) <1eχ2(σ, π).

Corollary 5.10 Let t ≥ kL. Let Y be the sum of t independent coin flips. Assume that σ ≤ 2π.Then

χ2(σY , π) <1ek

.

Corollary 5.11 Let Y be the sum of t independent coin flips. Assume that σ ≤ 2π. Then

d(σY , π) ≤ λt2.

Corollary 5.12

Hε ≤ (ln ε)L.

36

Lemma 5.13 Let Y be chosen uniformly from the set 0, 1, . . . , t− 1. Then

d(σY , π) ≤ 1tZ.

Proof.

σYi − πi =

1tπizi(σ, σt) ,

and so with an appropriate A ⊆ V ,

d(σY , π) =∑

i∈A

(σYi − πi) =

i∈A

1tπizi(σ, σt) ≤ 2

tZ.

¤Defining the warm uniform mixing time Uε in a manner analagous to Hε but for uniform stopping

rules, we have

Corollary 5.14

Uε ≤ εZ.

The next lemma bounds Hε(ρ, π) if we only know that ρ ≤ rπ for some c > 2.

Lemma 5.15 Let ρ be a distribution such that ρ ≤ rπ for some c ≥ 2. Then

Hε(ρ, π) ≤ (c− 1)Hε/(2c−3).

Proof. Let

h =ε

2c− 3and α =

ρ + (c− 2)πc− 1

.

Then α is a distribution with α ≤ 2π, and hence there exists a distribution τ such that d(τ, π) = hand H(α, τ) ≤ Hh. Then we can write τ = π − hβ + hγ for some distributions β, γ.

Let

u =1

1 + h(c− 2),

and consider the distributions ρ′ = uρ + (1− u)β and τ ′ = uτ + (1− u)γ. Then

ρ′ − τ ′ = u(ρ− τ) + (1− u)(β − γ) =c− 1

1 + h(c− 2)(α− τ).

Hence

H(ρ′, τ ′) =c− 1

1 + h(c− 2)H(α, τ) < (c− 1)Hh.

But d(ρ, ρ′) ≤ 1 − u < h(c − 2), and thus by Lemma 5.5, there exists a distribution τ ′′ such thatd(τ ′′, τ ′) < h(c− 2) and H(ρ, τ ′′) ≤ H(ρ′τ ′). Now we have

d(τ ′′, π) ≤ d(τ ′′, τ ′) + d(τ ′, τ) + d(τ, π) ≤ h(c− 2) + h(c− 2) + h = (2c− 3)h = ε.

Thus

Hε(ρ, π) ≤ H(ρ, τ ′′) ≤ H(ρ′, τ ′) < (c− 1)Hh.

¤

37

Lemma 5.16 Let 0 < ε < 1 and t ≥ (1/ε)Hε. Let v0 be chosen from a starting distribution σ ≤ 2πand let Y be chosen uniformly from 0, . . . , t− 1. Then v0 and vY are (16ε)-independent.

Note that Theorem 4.4 implies that the distribution of vY is closer than 2ε to π in total variationdistance.

Proof. The assertion is trivial if σA < 16ε, so suppose that σA ≥ 16ε. Then we have∣∣∣P(v0 ∈ A, vY ∈ B)− P(v0 ∈ A)P(vY ∈ B)

∣∣∣ = σA

∣∣∣P(wY ∈ B | w0 ∈ A)− P(wY ∈ B)∣∣∣

= σA

∣∣∣ρY (B)− σY (B)∣∣∣ ≤ σAd(ρY , σY )

where ρ is the restriction of σ to A: ρS = σS∩A/σA. Notice that ρ ≤ cπ, where c = 2/σA ≤ 1/(8ε).By Theorem 4.4, we have

d(σY , π) ≤ 2cε +1tH2cε(σ, π)

and

d(ρY , π) ≤ 2cε +1tH2cε(σ, π).

Here

H2cε(σ, π) ≤ H2cε < Hε.

Furthermore, by Lemma 5.15,

H2cε(ρ, π) ≤ (c− 1)H2cε/(2c−3) < (c− 1)Hε.

Hence

d(ρY , σY ) ≤ d(ρY , π) + d(σY , π) ≤ 4cε +1tHε +

1t(c− 1)Hε ≤ 4cε +

c

tHε ≤ 8cε ,

and thus∣∣∣P(v0 ∈ A, vY ∈ B)− P(v0 ∈ A)P(vY ∈ B)

∣∣∣ ≤ σAd(ρY , σY ) ≤ 2c8cε = 16ε.

¤

Corollary 5.17

I16ε ≤ 1εHε.

5.2 The forget group

This group contains a number of interesting invariants: the forget time F ; the discrepancy Z; forany fixed ε > 0, the approximate mixing time Hε; and both approximate forget times Fε and Fε.To these we add a new parameter, the “set access time” defined below. These parameters are allwithin constant factors of each other (depending only on ε). Note that the approximate mixingtime Hε is not in this group.

The set access time S is the mean number of steps required to hit the “toughest” set of states(adjusted by the stationary probability of the set) from the worst start:

S := maxs∈V, U⊆V

πUH(s, U).

38

Theorem 5.18 For every finite Markov chain and 0 < ε < 1/2, we have the following inequalities.

(1− 2ε)Z ≤ Fε ≤ Fε ≤ 2Hε/2 ≤4εS ≤ 4

εZ.

In addition,

Z ≤ F ≤ 11− ε

Fε.

Corollary 5.19

S ≤ Z ≤ F ≤ 16S.

We break down the proof into several steps, one for each inequality claimed. The first step is(up to the constant) a stronger version of (19):

Lemma 5.20 Let 0 < ε < 1/2, then

Z ≤ 11− 2ε

Fε.

Proof. Let ρ be any distribution and let σ be a distribution maximizing ‖z(σ, π)‖. By definition,there exists a distribution µ such that d(µ, ρ) ≤ ε and H(σ, µ) ≤ Hε(ρ). Then for an appropriateset A ⊆ V , we have

Z =∑

i∈A

πizi(σ, π) =∑

i∈A

πizi(σ, µ) +∑

i∈A

πizi(µ, ρ).

Here by (19),∑

i∈A

πizi(σ, µ) =∑

i∈V \Aπizi(µ, σ) ≤ (1− πA)H(σ, µ) ≤ (1− πA)Hε(ρ).

To estimate the other term, we have by (13)

i∈A

πizi(µ, ρ) ≤ 12‖z(µ, ρ)‖π ≤ 2d(µ, ρ)Z ≤ 2εZ.

Hence

Z ≤ Hε(ρ) + 2εZ ,

and the lemma follows. ¤The inequality

Fε ≤ Fε

is trivial.

Lemma 5.21 For ε < 1/2,

F2ε ≤ 2Hε.

Proof. For each state k, let τ(k) be a distribution such that d(τ(k), π) ≤ ε and H(k, τ(k)) ≤ Hε.Let

fi = minµ

k

τ(k)iµk ,

39

where µ ranges over all distributions with d(µ, π) ≤ ε. Let µ(i) be the distribution achievingthis minimum. Write µ(i)k = πk + α(i)k and τ(k)i = πi + β(k)i. Let a(i)k = min(0, α(i)k) andb(k)i = min(0, β(k)i). Then

τ(k)iµ(i)k ≥ (πi + a(i)k)(πk + b(k)i) ≥ πiπk + πia(i)k + πkb(k)i

and hence∑

i

fi =∑

i,k

τ(k)iµ(i)k ≥ 1 +∑

i

πi

k

a(i)k +∑

k

πk

i

b(k)i.

Now here, for any fixed i,∑

k

a(i)k = −d(π, µ(i)) ≥ −ε ,

and similarly∑

i

b(k)i = −d(π, τ(k)) ≥ −ε.

Hence∑

i

fi ≥ 1−∑

i

πiε−∑

k

πkε = 1− 2ε.

Consider the distribution

ρi = fi

/∑

j

fj ,

and the following stopping rule: starting at state k, follow an optimal rule from k to τ(k); if youend at j, then follow an optimal rule from j to τ(j). Let θ(k) be the distribution produced. Then

θ(k)i =∑

j

τ(k)jτ(j)i ≥ fi ≥ (1− 2ε)ρi.

Thus H2ε(ρ) ≤ 2Hε. ¤

Lemma 5.22 For every 0 < ε < 1,

εHε ≤ S.

Proof. We want to prove that for any starting state s,

Hε(s, π) ≤ 1εS.

Consider the chain rule from s to π. There exists a labelling 0, . . . , n − 1 of the nodes and aprobability distribution ρ such that selecting Sk = k, k+1, . . . , n−1 with probability ρk and thenwalking until Sk is reached, we generate π. Choose k such that πSk+1 ≤ ε < πSk

, and consider themodified chain rule Ak that selects Sj with probability ρj for j = 1, . . . , k − 1, but selects Sk withthe remaining probability ρ′k = 1− ρ1 − · · · − ρk−1. Let this rule generate distribution τ(k). Thenτ(k)i = πi for i /∈ Sk, and τ(k)k ≥ πk. Hence

d(π, τ(k)) ≤ πSk+1 ≤ ε.

The mean length of this rule is

k−1∑

j=1

ρjH(s, Sj) + ρ′kH(s, Sk) ≤ H(s, Sk) ≤ 1πSk

S <1εS.

¤

40

Lemma 5.23

S ≤ Z.

Proof. Let σ be any distribution and let A ⊆ V . Let ρ be the distribution of the first elementin A, starting the chain from σ. Note that H(σ, ρ) = H(σ,A) is just the expected number of stepsbefore hitting A, starting from σ. Also note that trivially xi(σ, ρ) = 0 for i ∈ A, and hence

H(σ, ρ) = −zi(σ, ρ) = zi(ρ, σ).

Hence

πAH(σ,A) =∑

i∈A

piizi(ρ, σ) ≤ Z.

¤To complete the proof of Theorem 5.18, it suffices to notice that letting ε → 0 in Lemma 5.20

we get

Z ≤ F ,

while Corollary 5.3 implies that

F ≤ H(ρ) ≤ 11− ε

Hε(ρ) =1

1− εFε.

5.3 The reset group

Let←−Z = max

v∈VS⊆V

i∈S

←−z i(π, v)

be the discrepancy of the reverse chain.

Theorem 5.24

132I ≤ 1

2←−Z ≤ H ≤ 2I.

Proof. The first inequality is immediate by Corollary 5.19 and Theorem 3.6. To prove the second,let u, v ∈ V and S ⊂ V attain the maximum in the definition of

←−Z . Let A = i ∈ V : H(i, v) ≥

H(π, v).Case 1. Assume that πA ≥ 1/2. Define

ρi =

πi/πA, if i ∈ A,0 otherwise.

Then ρ is a distribution with ρ ≤ 2π. Moreover,

H(ρ, π) ≥ H(ρ, v)−H(π, v) =∑

i∈A

πi

πA

(H(i, v)−H(π, v))

=1

πA

i∈A

πizi(v, π) =1

πAZ > Z.

Case 2. Assume that πA < 1/2. Set t = (1− 2πA)/(1− πA) and define

ρi =

2πi, if i ∈ A,tπi, otherwise.

41

Then again ρ is a distribution with ρ ≤ 2π. Moreover,

H(ρ, π) ≥ H(ρ, v)−H(π, v)

=∑

i∈A

2πi

(H(i, v)−H(π, v))+

i∈V \Atπi

(H(i, v)−H(π, v))

= (2− t)∑

i∈A

πi

(H(i, v)−H(π, v))

=1

1− πA

i∈A

πizi(v, π) =1

1− πAZ > Z.

This proves the second inequality.Finally, the third inequality is easy. Consider the following rule from σ to π: look at the starting

state and follow an optimal rule from that state. Hence

H(σ, π) ≤∑

i

σiH(i, π) ≤ 2∑

i

πiH(i, π) = I.

¤

5.4 The mixing group

Theorem 5.25

Hε ≤ H ≤ 11− ε

Hε.

Proof. The first inequality is trivial; the second follows immediately from corollary 5.3. ¤

Theorem 5.26

H ≤ 2(S + I

).

Proof. Let S := j : H(j, π) ≤ 2I. Then

I =∑

i

πiH(i, π) ≥∑

i∈V \SπiH(i, π) ≥ π(V \ S)(2I),

and hence π(S) ≥ 1/2. Now for any starting state s, we can walk from s until we hit some statej ∈ S, and then follow an optimal rule from j to π. The expected length of this walk is at most

H(s, S) + 2I ≤ 1π(S)

S + 2I ≤ 2S + 2I.

¤Using the inequality S ≤ F (see corollary 5.19), we get

Corollary 5.27

maxF , I ≤ H ≤ 2(F + I

).

42

5.5 The maxing group

We return once more to M(σ, τ), the maximum length of a max-optimal stopping rule from σ to τ .

Theorem 5.28 Suppose that for any initial distribution σ,

4π/5 ≤ σYi ≤ 5π/4.

where Y is chosen uniformly from 0, . . . , t−1. Then for any σ there is a (blind) stopping rule Γwith σΓ = π and max(Γ) < 2t.

Proof. Replacing the transition matrix M by∑t−1

j=0 M j preserves the stationary distribution,while the condition of the theorem becomes

4π/5 ≤ σ1 ≤ 5π/4.

for all σ. It now suffices to produce a stopping rule Γ with σΓ = π and max(Γ) ≤ 2; to do this webound the exit frequencies xi(θ, π) when θ is close to π.

Note that for our new fast-mixing Markov chain, the conditions of Corollary *** obtain witht = 1 and ε = 1

5 , so for any σ, H(σ, π) ≤ M = H ≤ 54 . Suppose that 4π/5 ≤ θ ≤ 5π/4; then we

may substitute θ for τ and π for τ ′ in Corollary 5.4 to get

M(θ) ≤ 54M≤ 25

16.

Now we can apply Lemma 5.6 twice to get

H(θ, π) ≤ 15M≤ 1

4

and

H(π, θ) ≤ 15M(θ) ≤ 5

16.

It follows from (16) that for each state i,

yi(θ, π) = zi(θ, π) +H(θ, π) ≤ H(π, θ) +H(θ, π) ≤ 916

.

Let Θ be the unique mean-optimal threshold rule for initial distribution θ and target π, con-structed as in Section 2.7.3. From above we have that for every i,

θi ≥ 45πi >

916

πi ≥ xi(θ, π)

so that the exit frequency for every state is used up already on the first step of the construction. Itfollows that all of the thresholds are less than 1 and therefore max(Θ) ≤ 1.

We may now define Γ for arbitrary starting state σ simply by taking one step, then settingθ = σ1 and implementing Θ. ¤

Lemma 5.29 Let Φ : σ → τ and Ψ : σ → ρ be two stopping rules such that Φ ≤ Ψ for any walk.Then

H(τ, ρ) ≤ EΨ− EΦ.

Proof. Define a stopping rule from τ to ρ as follows: the starting state from distribution τ mayas well be generated by starting from σ and following Φ. But then we can just consider our walkas a continuation and stop it following the rule Ψ. This way we get a rule from τ to ρ with meanlength EΨ− EΦ. ¤

43

Lemma 5.30 Assume that t = M(σ, τ) is finite. Then

H(σ, τ) +H(τ, σt) ≤ t.

Proof. Since H(σ, τ) is the mean time of the threshold rule Θσ,τ , t is the mean time of the rule“walk t steps”, and Θσ,τ ≤ t, Lemma 5.29 implies that H(τ, σt) ≤ t−H(τ, σt). ¤

Theorem 5.31 Let t = M(σ, π). Choose a random starting state from σ. Choose a random integerY uniformly from the interval [t, 2t−1], and walk for Y steps. Then the probability of being at statei is at most 2πi.

Proof. The probability that we stop at state i is

1t

2t−1∑

k=t

σki = πi +

1tπizi(σt, σ2t) = πi − 1

tπizi(π, σt)− 1

tπizi(σ2t, π).

Here we use that

zi(π, σt) = yi(π, σt)−H(π, σt) ≥ −H(π, σt) ≥ H(σ, π)− t

(by Lemma 5.30), and

zi(σ2t, π) = yi(σ2t, π)−H(σ2t, π) ≥ −H(σ2t, π) ≥ −H(σ, π)

(by Lemma 4.2). Hence we get that

1t

2t−1∑

k=t

σki ≤ πi − 1

tπi(H(σ, π)− t) +

1tπiH(σ, π) = 2πi.

¤

5.6 A reverse inequality

For any state distribution µ, let µ := miniµi. The log of the smallest probability in the stationarydistribution links the maxing and relaxation times, and also the mixing and set access times.

Theorem 5.32

M≤ ln(1/π)L.

Proof. *** ¤

Theorem 5.33

H ≤ ln(1/π)S.

We prove the inequality in a more general form, fixing the starting distribution σ, and allowingan arbitrary target distribution. We define

S(σ, τ) = maxA⊆V

τAH(σ,A).

Then we have the following lemma.

Lemma 5.34 Let σ, τ be two arbitrary distributions; then

H(σ, τ) ≤ ln(1/τ)S(σ, τ).

44

Proof. We use the chain rule, more exactly Theorem 2.8. This implies that for every startingstate s there exists a labelling 0, . . . , n− 1 of the nodes and a probability distribution ρ such that

H(σ, τ) =∑

k

ρkH(σ, Sk), (42)

where Sk = k, k + 1, . . . , n − 1 and H(σ, S) denotes the expected number of steps before a walksarting from σ hits the set S. (One could formulate a similar expression for the exit frequencies.)Setting ωk =

∑m≥k ρm, (42) can be rewritten as

H(σ, τ) =∑

k

ωk(H(σ, Sk)−H(σ, Sk−1)). (43)

Now we use that

ωk ≤ τSk(44)

(if we choose any Sm, m ≥ k in the chain rule as our target set, we certainly end up in Sk), andhence

H(σ, τ) ≤∑

k

τSk(H(σ, Sk)−H(σ, Sk−1) =

k

τkH(σ, Sk) ≤ S(σ, τ)∑

k

τk

τk + · · ·+ τn.

Using the inequality x ≤ ln 1/(1− x), we get

H(σ, τ) ≤ S(σ, τ)∑

k

lnτk+1 + · · ·+ τn

τk + · · ·+ τn= ln(1/τn)Tset(σ, τ).

¤Using a similar argument, one could replace S by maxA⊆S πAH(π, A) in this theorem. This

latter quantity seems to be related to the “conductance” of the chain ([18]), but we have not beenable to prove an explicit connection.

The winning streak chain shows that the logarithmic factor in the upper bound in the theoremcannot be saved even if the target distribution is the stationary distribution π.

6 Estimating the mixing time

6.1 *A linear programming bound

Theorem 6.1 Assume that for some T ≥ 0 and for every u ∈ V , there exists a vector r ∈ IRV suchthat

r ≥ 0; (45)

j

pijrj − ri ≤ 1 (for all i 6= u); (46)

i

πiri ≤ T. (47)

Then H ≤ T .

45

Proof. The proof is a variation on the proof of Lemma 3.1. Let s be a pessimal starting state andu, a halting state from s to π. Let v0 = s, v1, . . . be a walk in the chain, and consider the randomvariables

Xt = rvt+1 − rvt + 1 +1πu

(i = u) ,

where the ri satisfy (45)-(47) for this choice of u. From (46) we get, for all i 6= u, that

E(Xt|vt = i) =∑

j

pijrj − ri − 1 ≥ 0.

Thus Xt is a supermartingale, and hence if T is the number of steps taken by some optimal stoppingrule from u to π, then the sum

∑T−1t=0 Xt has non-negative expectation. But (using that u is never

exited during the first T steps), this implies

0 ≤T−1∑t=0

EXt = E(rvT)− ru − ET =∑

i

πiri − ru −H(u, π) ≤ T−H.

¤

6.2 Conductance

Recall that the conductance of a Markov chain is defined by

φ = minA

∑i∈A

∑j∈V \A pijπi

πAπV \A,

where the minimum is extended over all non-empty proper subsets A of V . Standard results,first obtained by Jerrum and Sinclair [18], and extended in various directions ([25], [20]) use theconductance φ to bound the mixing time from above. Let us start from distribution σ. Define

K = maxi∈V

σi

πi.

Choose Y from the binomial distribution with mean t (in other words, do a lazy random walk for2t steps); then Corollary 1.5 in [20] says that

d(σY , π) ≤√

K

(1− φ2

8

)2t

.

Hence

Hε ≤ 4φ2

logK

ε. (48)

This implies for time-reversible chains, by Lemma 5.20, that

H ≤ 64φ2

log2π

.

Below we prove that this relation (with a better constant even) holds for all chains. We in factprove a more general result.

Define the conductance function by

Φ(x) = minS⊂V,

0<π(S)≤x

Q(S, S)π(S)π(S)

.

(here Q(S, T ) =∑

i∈S,j∈T Qij . Obviously, 0 < Φ(x) ≤ 2, and Φ(x) is monotone decreasing as afunction of x. For x ≥ 1/2, we have Φ(x) = Φ.

Our main theorem about mixing in general Markov chains is the following.

46

Theorem 6.2 The mixing time H of any Markov chain can be bounded by

H ≤ 30∫ 1

π

dx

xΦ(x)2

Corollary 6.3

H(σ, π) ≤ 30φ2

log1π

.

Lemma 6.4 Let 0 = y1 ≤ y2 ≤ · · · ≤ yn be the scaled exit frequencies of an optimal stopping rulefrom s to π. Let 1 ≤ k < m ≤ n, and set A = 1, . . . , k, and C = m, . . . , n. Then

ym − yk ≤ π(A)Q(C, A)

Proof. It is easy to see that s = n. Let B = k + 1, . . . , m− 1. We start with the identity∑

i≤k

j>k

yjQji −∑

i≤k

j>k

yiQij = π(A), (49)

since the left hand side counts the expected number of steps from V \ A to A, less the expectednumber of steps from A to V \ A, when following an optimal stopping rule from s to π. Since wenever start in A but stop in A with probability π(A), this proves (49).

Now we estimate the left hand side of (49) as follows:∑

i≤k

j>k

yjQji ≥∑i≤kj≥m

ymQji +∑i≤k

k<j<m

ykQji = ymQ(C, A) + ykQ(B, A),

and∑

i≤k

j>k

yiQij ≤∑

i≤k

j>k

ykQij = ykQ(A,B ∪ C) = ykQ(B ∪ C, A).

Substituting in (49), we get

ymQ(C, A) + ykQ(B, A)− ykQ(B ∪ C, A) = (ym − yk)Q(C,A) ≤ π(A),

which proves the lemma. ¤Our other preliminary observation is that

H(s, π) =∑

i

πiyi =∑

j

(yj+1 − yj)π(> j). (50)

Proof. [of Theorem 6.2] Let 1 = m0 < m1 < · · · < mk < mk+1. Set Ti = [1..mi], T i = V \ Ti, andai = π(Ti). We choose the sequence (mi) recursively so that

ai+1 − πmi+1 < ai(1 +14Φ(ai)) ≤ ai+1.

We stop with

ak ≤ 1/2 < ak+1.

We bound a portion of the sum in (50) as follows:

mi+1−1∑

j=mi

(yj+1 − yj)π(> j) ≤ (1− ai)(ymi+1 − ymi)

47

≤ ai(1− ai)Q(T i+1 ∪mi+1, Ti)

.

Now here

Q(T i+1 ∪mi+1, Ti) = Q(T i, Ti)−Q(T i \ T i+1 \mi+1, Ti) ≥ Q(T i, Ti)− π(T i \ T i+1 \mi+1)

≥ Φ(ai)ai(1− ai)− ai+1 + πmi+1 + ai >12Φ(ai)ai(1− ai).

Hence we get

mi+1−1∑

j=mi

(yj+1 − yj)π(> j) ≤ 2Φ(ai)

.

Next we show that

2Φ(ai)

≤ 10∫ ai+1

ai

dx

xΦ(x)2.

Indeed,∫ ai+1

ai

dx

xΦ(x)2≥ 1

Φ(ai)2

∫ ai+1

ai

dx

x=

1Φ(ai)2

lnai+1

ai≥ 1

Φ(ai)2ln

(1 +

14Φ(ai)

)≥ 1

5Φ(ai)

(using that Φ(ai) ≤ 2). Summing up, we get

mk+1∑

j=1

(yj+1 − yj)π(> j) ≤ 10∫ ak+1

π

dx

xΦ(x)2< 10

∫ 1

π

dx

xΦ(x)2(51)

The estimate on the other half of the sum (50) is similar. We define a sequence n0 = n > n1 >· · · > nr, and set Si = [ni...n], Si = V \ Si and bi = π(Si). We choose the numbers ni recursivelyso that

bi+1 − πni+1 < bi(1 +14Φ(bi)) ≤ bi+1.

We stop with

br ≤ 1/2 < br+1.

Similarly as before, we consider the partial sum

ni−1∑

j=ni+1

(yj+1 − yj)π(> j) ≤ (bi+1 − pni+1)(yni − yni+1),

and we use again lemma 6.4 to estimate it:

≤ (bi+1 − pni+1)(1− bi)Q(Si+1 ∪ ni+1, Si)

Here

Q(Si+1 ∪ ni+1, Si) = Q(Si, Si)−Q(Si \ Si+1 \ ni+1, Si) ≥ Q(Si, Si)− π(Si+1 \ Si − πni+1)

≥ Φ(bi)bi(1− bi)− bi+1 + πni+1 + bi >12Φ(bi)bi(1− bi).

Hence

ni−1∑

j=ni+1

(yj+1 − yj)π(> j) ≤ 2(bi+1 − πni+1)bi

1Φ(bi)

48

As before,

2Φ(bi)

≤ 10∫ bi+1

bi

dx

xΦ(x)2,

Moreover,

bi+1 − πni+1 < bi(1 +14Φ(bi)) < 2bi.

Summing up, we get

n−1∑

j=nr+1

(yj+1 − yj)π(> j) ≤ 20∫ br+1

πn

dx

xΦ(x)2< 20

∫ 1

π

dx

xΦ(x)2. (52)

Since obviously nr+1 < mk+1, all terms in (50) have been accounted for, and it follows that

H ≤ 30∫ 1

π

dx

xΦ(x)2,

which proves the theorem. ¤

6.3 *Coupling

A coupling in a Markov chain is a sequence of random variables ((v0, w0), (v1, w1), . . . ) such thateach of the sequences (v0, v1, . . . ) and (w0, w1, . . . ) is a walk of the chain. For a given coupling, thecollision time is the expectation of the first index t with vt = wt. The coupling time C(α, β) betweentwo distributions α and β is the minimum collision time of couplings where v0 is from distributionα and w0 is from distribution β.

We shall only be concerned with Markovian coupling, which is defined as a Markov chain onV × V with tranisition probabilities pij,kl such that for all (i, j) ∈ V × V , and every l ∈ V , we have∑

k pij,kl = pjl and∑

l pij,kl = pik. Let D = (i, i) : i ∈ V be the diagonal. The collision time fora given Markovian coupling is the maximum hitting time to D from any starting state (i, j).

Lemma 6.5 For every two distributions,

‖z(α, β)‖π ≤ C(α, β).

Proof. Consider an optimal coupling for α and β, and let τ be the distribution of the state wherethe two walks first collide. The coupling rule defines a stopping rule Γ from α to τ (by constructingthe chain (w0, w1, . . . ) only in the background, to provide a stopping criterion), and similarly, itdefines a stopping rule Γ′ from β to τ . Trivially, EΓ = EΓ′ = C(α, β).

Thus we have

zi(α, β) = zi(α, τ)− zi(β, τ) =(yi(Γ)− EΓ

)− (yi(Γ′)− EΓ′

)= yi(Γ)− yi(Γ′) ≤ yi(Γ) ,

and hence

‖z(α, β)‖π ≤∑

i

yi(Γ) = EΓ = C(α, β).

¤

49

References

[1] D.J. Aldous and J. Fill, Reversible Markov Chains and Random Walks on Graphs (book), toappear. URL for draft at http://www.stat.Berkeley.EDU/users/aldous/book.html.

[2] D.J. Aldous, Some inequalities for reversible Markov chains, J. London Math. Soc. 25 (1982),564–576.

[3] D.J. Aldous, Applications of random walks on graphs, preprint (1989).

[4] D.J. Aldous, The random walk construction for spanning trees and uniform labelled trees,SIAM J. Discrete Math. 3 (1990), 450–465.

[5] D.J. Aldous and P. Diaconis, Shuffling cards and stopping times, Amer. Math. Monthly 93 #5(1986), 333–348.

[6] D.J. Aldous, L. Lovasz and P. Winkler, Mixing times for uniformly ergodic Markov chains,Stochastic Processes and their Applications, to appear.

[7] S. Asmussen, P.W. Glynn and H. Thorisson, Stationary detection in the initial transient prob-lem, ACM Transactions on Modeling and Computer Simulation 2 (1992), 130–157.

[8] J.R. Baxter and R.V. Chacon, Stopping times for recurrent Markov processes, Illinois J. Math.20 (1976), 467–475.

[9] D. Bayer and P. Diaconis, Trailing the dovetail shuffle to its lair, Ann. Appl. Probab. 2 (1992),294–313.

[10] A. Broder, Generating random spanning trees, Proc. 30th Annual Symp. on Found. of ComputerScience, IEEE Computer Soc. (1989), 442–447.

[11] R.V. Chacon and D.S. Ornstein, A general ergodic theorem, Illinois J. Math. 4 (1960), 153–160.

[12] A.K. Chandra, P. Raghavan, W.L. Ruzzo, R. Smolensky, and P. Tiwari, The Electrical Re-sistance of a Graph Captures its Commute and Cover Times, Proceedings of the 21st AnnualACM Symposium on Theory of Computing, May (1989).

[13] D. Coppersmith, P. Tetali, and P. Winkler, Collisions among Random Walks on a Graph, SIAMJ. on Discrete Mathematics 6 #3 (1993), 363–374.

[14] P.G. Doyle and J.L. Snell, Random Walks and Electric Networks, Mathematical Assoc. ofAmerica, Washington, DC 1984.

[15] L.E. Dubins, On a theorem of Skorokhod, Ann. Math. Statist. 39 (1968), 2094–2097.

[16] M. Dyer, A. Frieze and R. Kannan, A random polynomial time algorithm for estimating volumesof convex bodies, Proc. 21st Annual ACM Symposium on the Theory of Computing (1989), 375–381.

[17] J. Fill, ***

[18] M. Jerrum and A. Sinclair, Conductance and the rapid mixing property for Markov chains: theapproximation of the permanent resolved, Proc. 20nd Annual ACM Symposium on Theory ofComputing (1988), 235–243.

[19] R. Kannan, L. Lovasz and M. Simonovits, ***

[20] L. Lovasz and M. Simonovits, Random walks in a convex body and an improved volumealgorithm, Random Structures and Alg. 4 (1993), 359–412.

[21] L. Lovasz and P. Winkler, A note on the last new vertex visited by a random walk, J. GraphTheory 17 (1993), 593–596.

50

[22] L. Lovasz and P. Winkler, Exact mixing in an unknown Markov chain, Electronic J. Comb. 2,(1995), Paper R15.

[23] L. Lovasz and P. Winkler, Mixing of random walks and other diffusions on a graph, Surveys inCombinatorics, 1995, P. Rowlinson, ed., London Math. Soc. Lecture Note Series 218, CambridgeU. Press (1995), 119–154.

[24] L. Lovasz and P. Winkler, Reversal of Markov chains and the forget time, Combinatorics,Probability and Computing, to appear.

[25] M. Mihail, ***

[26] J.W. Pitman, Occupation measures for Markov chains, Adv. Appl. Prob. 9 (1977), 69–86.

[27] D.H. Root, The existence of certain stopping times on Brownian motion, Ann. Math. Statist.40 (1969), 715–718.

[28] A. Skorokhod, Studies in the Theory of Random Processes, orig. pub. Addison-Wesley (1965),2nd ed. Dover, New York 1982.

[29] P. Tetali, Random walks and effective resistance of networks, J. Theoretical Prob. #1 (1991),101–109.

51