Upload
leanh
View
215
Download
0
Embed Size (px)
Citation preview
Contents
1 Basic notions 1
1.1 A few definitions from measure theory . . . . . . . . . . . . . 1
1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Some basic facts . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Convergence of random variables 11
2.1 Types of convergence . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Weak law of large numbers . . . . . . . . . . . . . . . . . . . . 13
2.3 The strong law of large numbers . . . . . . . . . . . . . . . . . 13
2.4 Techniques related to a.s. convergence . . . . . . . . . . . . . . 19
2.5 Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . 22
3 Conditional expectations 25
4 Martingales 29
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Optional stopping . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Doob’s inequalities . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Martingale convergence theorems . . . . . . . . . . . . . . . . 34
iii
iv CONTENTS
4.6 Applications of martingales . . . . . . . . . . . . . . . . . . . 36
5 Weak convergence 41
6 Characteristic functions 47
6.1 Inversion formula . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Continuity theorem . . . . . . . . . . . . . . . . . . . . . . . . 53
7 Central limit theorem 59
8 Gaussian sequences 67
9 Kolmogorov extension theorem 69
10 Brownian motion 73
10.1 Definition and construction . . . . . . . . . . . . . . . . . . . 73
10.2 Nowhere differentiability . . . . . . . . . . . . . . . . . . . . . 78
11 Markov chains 81
11.1 Framework for Markov chains . . . . . . . . . . . . . . . . . . 81
11.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.3 Markov properties . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.4 Recurrence and transience . . . . . . . . . . . . . . . . . . . . 91
11.5 Stationary measures . . . . . . . . . . . . . . . . . . . . . . . 94
11.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 1
Basic notions
1.1 A few definitions from measure theory
Given a set X, a σ-algebra on X is a collection A of subsets of X such that(1) ∅ ∈ A;(2) if A ∈ A, then Ac ∈ A, where Ac = X \ A;(3) if A1, A2, . . . ∈ A, then ∪∞i=1Ai and ∩∞i=1Ai are both in A.
A measure on a pair (X,A) is a function µ : A → [0,∞] such that(1) µ(∅) = 0;(2) µ(∪∞i=1Ai) =
∑∞i=1 µ(Ai) whenever the Ai are in A and are pairwise
disjoint.
A function f : X → R is measurable if the set x : f(x) > a ∈ A for alla ∈ R.
A property holds almost everywhere, written “a.e,” if the set where it failshas measure 0. For example, f = g a.e. if µ(x : f(x) 6= g(x)) = 0.
The characteristic function χA of a set in A is the function that is 1 whenx ∈ A and is zero otherwise. A function of the form
∑ni=1 aiχAi
is called asimple function.
If f is simple and of the above form, we define
∫f dµ =
n∑i=1
aiµ(Ai).
1
2 CHAPTER 1. BASIC NOTIONS
If f is non-negative and measurable, we define∫f dµ = sup
∫s dµ : 0 ≤ s ≤ f, s simple
.
Provided∫|f | dµ <∞, we define∫
f dµ =
∫f+ dµ−
∫f− dµ,
where f+ = max(f, 0), f− = max(−f, 0).
1.2 Definitions
A probability or probability measure is a measure whose total mass is one.Because the origins of probability are in statistics rather than analysis, someof the terminology is different. For example, instead of denoting a measurespace by (X,A, µ), probabilists use (Ω,F ,P). So here Ω is a set, F is calleda σ-field (which is the same thing as a σ-algebra), and P is a measure withP(Ω) = 1. Elements of F are called events. Elements of Ω are denoted ω.
Instead of saying a property occurs almost everywhere, we talk about prop-erties occurring almost surely, written a.s.. Real-valued measurable functionsfrom Ω to R are called random variables and are usually denoted by X or Yor other capital letters. We often abbreviate ”random variable” by r.v.
We let Ac = (ω ∈ Ω : ω /∈ A) (called the complement of A) and B − A =B ∩ Ac.
Integration (in the sense of Lebesgue) is called expectation or expectedvalue, and we write EX for
∫XdP. The notation E [X;A] is often used for∫
AXdP.
The random variable 1A is the function that is one if ω ∈ A and zerootherwise. It is called the indicator of A (the name characteristic function inprobability refers to the Fourier transform). Events such as (ω : X(ω) > a)are almost always abbreviated by (X > a).
Given a random variable X, we can define a probability on R by
PX(A) = P(X ∈ A), A ⊂ R. (1.1)
1.2. DEFINITIONS 3
The probability PX is called the law of X or the distribution of X. We defineFX : R→ [0, 1] by
FX(x) = PX((−∞, x]) = P(X ≤ x). (1.2)
The function FX is called the distribution function of X.
As an example, let Ω = H,T, F all subsets of Ω (there are 4 of them),P(H) = P(T ) = 1
2. Let X(H) = 1 and X(T ) = 0. Then PX = 1
2δ0 + 1
2δ1,
where δx is point mass at x, that is, δx(A) = 1 if x ∈ A and 0 otherwise.FX(a) = 0 if a < 0, 1
2if 0 ≤ a < 1, and 1 if a ≥ 1.
Proposition 1.1 The distribution function FX of a random variable X sat-isfies:(a) FX is nondecreasing;(b) FX is right continuous with left limits;(c) limx→∞ FX(x) = 1 and limx→−∞ FX(x) = 0.
Proof. We prove the first part of (b) and leave the others to the reader. Ifxn ↓ x, then (X ≤ xn) ↓ (X ≤ x), and so P(X ≤ xn) ↓ P(X ≤ x) since P isa measure.
Note that if xn ↑ x, then (X ≤ xn) ↑ (X < x), and so FX(xn) ↑ P(X < x).Any function F : R → [0, 1] satisfying (a)-(c) of Proposition 1.1 is called a
distribution function, whether or not it comes from a random variable.
Proposition 1.2 Suppose F is a distribution function. There exists a ran-dom variable X such that F = FX .
Proof. Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. DefineX(ω) = supx : F (x) < ω. Here the Borel σ-field is the smallest σ-fieldcontaining all the open sets.
We check that FX = F . Suppose X(ω) ≤ y. Then F (z) ≥ ω if z > y.By the right continuity of F we have F (y) ≥ ω. Hence (X(ω) ≤ y) ⊂ (ω ≤F (y)).
4 CHAPTER 1. BASIC NOTIONS
Suppose ω ≤ F (y). If X(ω) > y, then by the definition of X there existsz > y such that F (z) < ω. But then ω ≤ F (y) ≤ F (z), a contradiction.Therefore (ω ≤ F (y)) ⊂ (X(ω) ≤ y).
We then have
P(X(ω) ≤ y) = P(ω ≤ F (y)) = F (y).
In the above proof, essentially X = F−1. However F may have jumps orbe constant over some intervals, so some care is needed in defining X.
Certain distributions or laws are very common. We list some of them.
(a) Bernoulli. A random variable is Bernoulli if P(X = 1) = p, P(X =0) = 1− p for some p ∈ [0, 1].
(b) Binomial. This is defined by P(X = k) =
(nk
)pk(1− p)n−k, where n
is a positive integer, 0 ≤ k ≤ n, and p ∈ [0, 1].
(c) Geometric. For p ∈ (0, 1) we set P(X = k) = (1 − p)pk. Here k is anonnegative integer.
(d) Poisson. For λ > 0 we set P(X = k) = e−λλk/k! Again k is anonnegative integer.
(e) Uniform. For some positive integer n, set P(X = k) = 1/n for 1 ≤k ≤ n.
Suppose F is absolutely continuous. This is not the definition, but beingabsolutely continuous is equivalent to the existence of a function f such that
F (x) =
∫ x
−∞f(y) dy
for all x. We call f = F ′ the density of F . Some examples of distributionscharacterized by densities are the following.
(f) Uniform on [a, b]. Define f(x) = (b− a)−11[a,b](x). This means that ifX has a uniform distribution, then
P(X ∈ A) =
∫A
1
b− a1[a,b](x) dx.
1.3. SOME BASIC FACTS 5
(g) Exponential. For x > 0 let f(x) = λe−λx.
(h) Standard normal. Define f(x) = 1√2πe−x
2/2. So
P(X ∈ A) =1√2π
∫A
e−x2/2dx.
(i) N (µ, σ2). We shall see later that a standard normal has mean zero andvariance one. If Z is a standard normal, then a N (µ, σ2) random variablehas the same distribution as µ + σZ. It is an exercise in calculus to checkthat such a random variable has density
1√2πσ
e−(x−µ)2/2σ2
. (1.3)
(j) Cauchy. Here
f(x) =1
π
1
1 + x2.
1.3 Some basic facts
We can use the law of a random variable to calculate expectations.
Proposition 1.3 If g is nonnegative or if E |g(X)| <∞, then
E g(X) =
∫g(x)PX(dx).
Proof. If g is the indicator of an event A, this is just the definition of PX . Bylinearity, the result holds for simple functions. By the monotone convergencetheorem, the result holds for nonnegative functions, and by writing g =g+ − g−, it holds for g such that E |g(X)| <∞.
If FX has a density f , then PX(dx) = f(x) dx. So, for example, EX =∫xf(x) dx and EX2 =
∫x2f(x) dx. (We need E |X| finite to justify this if
X is not necessarily nonnegative.) We define the mean of a random variableto be its expectation, and the variance of a random variable is defined by
VarX = E (X − EX)2.
6 CHAPTER 1. BASIC NOTIONS
For example, it is routine to see that the mean of a standard normal is zeroand its variance is one.
Note
VarX = E (X2 − 2XEX + (EX)2) = EX2 − (EX)2.
Another equality that is useful is the following.
Proposition 1.4 If X ≥ 0 a.s. and p > 0, then
EXp =
∫ ∞0
pλp−1P(X > λ) dλ.
The proof will show that this equality is also valid if we replace P(X > λ)by P(X ≥ λ).
Proof. Use Fubini’s theorem and write∫ ∞0
pλp−1P(X > λ) dλ = E∫ ∞0
pλp−11(λ,∞)(X) dλ
= E∫ X
0
pλp−1 dλ = EXp.
We need two elementary inequalities.
Proposition 1.5 Chebyshev’s inequality If X ≥ 0,
P(X ≥ a) ≤ EXa.
Proof. We write
P(X ≥ a) = E[1[a,∞)(X)
]≤ E
[Xa
1[a,∞)(X)]≤ EX/a,
since X/a is bigger than 1 when X ∈ [a,∞).
1.3. SOME BASIC FACTS 7
If we apply this to X = (Y − EY )2, we obtain
P(|Y − EY | ≥ a) = P((Y − EY )2 ≥ a2) ≤ VarY/a2. (1.4)
This special case of Chebyshev’s inequality is sometimes itself referred to asChebyshev’s inequality, while Proposition 1.5 is sometimes called the Markovinequality.
The second inequality we need is Jensen’s inequality, not to be confusedwith the Jensen’s formula of complex analysis.
Proposition 1.6 Suppose g is convex and and X and g(X) are both inte-grable. Then
g(EX) ≤ E g(X).
Proof. One property of convex functions is that they lie above their tangentlines, and more generally their support lines. So if x0 ∈ R, we have
g(x) ≥ g(x0) + c(x− x0)
for some constant c. Take x = X(ω) and take expectations to obtain
E g(X) ≥ g(x0) + c(EX − x0).
Now set x0 equal to EX.
If An is a sequence of sets, define (An i.o.), read ”An infinitely often,” by
(An i.o.) = ∩∞n=1 ∪∞i=n Ai.
This set consists of those ω that are in infinitely many of the An.
A simple but very important proposition is the Borel-Cantelli lemma. Ithas two parts, and we prove the first part here, leaving the second part tothe next section.
Proposition 1.7 (Borel-Cantelli lemma) If∑
n P(An) <∞, then P(An i.o.)= 0.
8 CHAPTER 1. BASIC NOTIONS
Proof. We haveP(An i.o.) = lim
n→∞P(∪∞i=nAi).
However,
P(∪∞i=nAi) ≤∞∑i=n
P(Ai),
which tends to zero as n→∞.
1.4 Independence
Let us say two events A and B are independent if P(A ∩ B) = P(A)P(B).The events A1, . . . , An are independent if
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aij) = P(Ai1)P(Ai2) · · ·P(Aij)
for every subset i1, . . . , ij of 1, 2, . . . , n.
Proposition 1.8 If A and B are independent, then Ac and B are indepen-dent.
Proof. We write
P(Ac ∩B) = P(B)− P(A ∩B) = P(B)− P(A)P(B)
= P(B)(1− P(A)) = P(B)P(Ac).
We say two σ-fields F and G are independent if A and B are independentwhenever A ∈ F and B ∈ G. The σ-field generated by a random variable X,written σ(X), is given by (X ∈ A) : A a Borel subset of R.) Two randomvariables X and Y are independent if the σ-field generated by X and theσ-field generated by Y are independent. We define the independence of nσ-fields or n random variables in the obvious way.
Proposition 1.8 tells us that A and B are independent if the randomvariables 1A and 1B are independent, so the definitions above are consistent.
1.4. INDEPENDENCE 9
If f and g are Borel functions and X and Y are independent, then f(X)and g(Y ) are independent. This follows because the σ-field generated byf(X) is a sub-σ-field of the one generated by X, and similarly for g(Y ).
Let FX,Y (x, y) = P(X ≤ x, Y ≤ y) denote the joint distribution functionof X and Y . (The comma inside the set means ”and.”)
Proposition 1.9 FX,Y (x, y) = FX(x)FY (y) if and only if X and Y are in-dependent.
Proof. If X and Y are independent, the 1(−∞,x](X) and 1(−∞,y](Y ) areindependent by the above comments. Using the above comments and thedefinition of independence, this shows FX,Y (x, y) = FX(x)FY (y).
Conversely, if the inequality holds, fix y and letMy denote the collectionof sets A for which P(X ∈ A, Y ≤ y) = P(X ∈ A)P(Y ≤ y). My containsall sets of the form (−∞, x]. It follows by linearity thatMy contains all setsof the form (x, z], and then by linearity again, by all sets that are the finiteunion of such half-open, half-closed intervals. Note that the collection offinite unions of such intervals, A, is an algebra generating the Borel σ-field.It is clear that My is a monotone class, so by the monotone class lemma,My contains the Borel σ-field.
For a fixed set A, letMA denote the collection of sets B for which P(X ∈A, Y ∈ B) = P(X ∈ A)P(Y ∈ B). Again, MA is a monotone class andby the preceding paragraph contains the σ-field generated by the collectionof finite unions of intervals of the form (x, z], hence contains the Borel sets.Therefore X and Y are independent.
The following is known as the multiplication theorem.
Proposition 1.10 If X, Y , and XY are integrable and X and Y are inde-pendent, then E [XY ] = (EX)(EY ).
Proof. Consider the random variables in σ(X) (the σ-field generated by X)and σ(Y ) for which the multiplication theorem is true. It holds for indicatorsby the definition of X and Y being independent. It holds for simple randomvariables, that is, linear combinations of indicators, by linearity of both sides.
10 CHAPTER 1. BASIC NOTIONS
It holds for nonnegative random variables by monotone convergence. And itholds for integrable random variables by linearity again.
If X1, . . . , Xn are independent, then so are X1 − EX1, . . . , Xn − EXn.Assuming everything is integrable,
E [(X1−EX1) + · · · (Xn−EXn)]2 = E (X1−EX1)2 + · · ·+E (Xn−EXn)2,
using the multiplication theorem to show that the expectations of the crossproduct terms are zero. We have thus shown
Var (X1 + · · ·+Xn) = VarX1 + · · ·+ VarXn. (1.5)
We finish up this section by proving the second half of the Borel-Cantellilemma.
Proposition 1.11 Suppose An is a sequence of independent events. If wehave
∑n P(An) =∞, then P(An i.o.) = 1.
Note that here the An are independent, while in the first half of the Borel-Cantelli lemma no such assumption was necessary.
Proof. Note
P(∪Ni=nAi) = 1− P(∩Ni=nAci) = 1−N∏i=n
P(Aci) = 1−N∏i=n
(1− P(Ai)).
By the mean value theorem, 1−x ≤ e−x, so we have that the right hand sideis greater than or equal to 1− exp(−
∑Ni=n P(Ai)). As N →∞, this tends to
1, so P(∪∞i=nAi) = 1. This holds for all n, which proves the result.
Chapter 2
Convergence of randomvariables
2.1 Types of convergence
In this section we consider three ways a sequence of random variables Xn canconverge.
We say Xn converges to X almost surely if
P(Xn 6→ X) = 0.
Xn converges to X in probability if for each ε,
P(|Xn −X| > ε)→ 0
as n→∞.Xn converges to X in Lp if
E |Xn −X|p → 0
as n→∞.
The following proposition shows some relationships among the types ofconvergence.
11
12 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Proposition 2.1 (1) If Xn → X a.s., then Xn → X in probability.(2) If Xn → X in Lp, then Xn → X in probability.(3) If Xn → X in probability, there exists a subsequence nj such that Xnj
converges to X almost surely.
Proof. To prove (1), note Xn−X tends to 0 almost surely, so 1(−ε,ε)c(Xn−X)also converges to 0 almost surely. Now apply the dominated convergencetheorem.
(2) comes from Chebyshev’s inequality:
P(|Xn −X| > ε) = P(|Xn −X|p > εp) ≤ E |Xn −X|p/εp → 0
as n→∞.
To prove (3), choose nj larger than nj−1 such that P(|Xn−X| > 2−j) < 2−j
whenever n ≥ nj. So if we let Aj = (|Xnj− X| > 2−j), then P(Aj) ≤ 2−j.
By the Borel-Cantelli lemma P(Ai i.o.) = 0. This implies there exists J(depending on ω) such that if j ≥ J , then ω /∈ Aj. Then for j ≥ J we have|Xnj
(ω)−X(ω)| ≤ 2−j. Therefore Xnj(ω) converges to X(ω) if ω /∈ (Ai i.o.).
Let us give some examples to show there need not be any other implica-tions among the three types of convergence.
Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Let Xn =en1(0,1/n). Then clearly Xn converges to 0 almost surely and in probability,but EXp
n = enp/n→∞ for any p.
Let Ω be the unit circle, and let P be Lebesgue measure on the circlenormalized to have total mass 1. Let tn =
∑ni=1 i
−1, and let
An = eiθ : tn−1 ≤ θ < tn.
Let Xn = 1An . Any point on the unit circle will be in infinitely many An,so Xn does not converge almost surely to 0. But P(An) = 1/2πn → 0, soXn → 0 in probability and in Lp.
2.2. WEAK LAW OF LARGE NUMBERS 13
2.2 Weak law of large numbers
SupposeXn is a sequence of independent random variables. Suppose also thatthey all have the same distribution, that is, FXn = FX1 for all n. This situ-ation comes up so often it has a name, independent, identically distributed,which is abbreviated i.i.d.
Define Sn =∑n
i=1Xi. Sn is called a partial sum process. Sn/n is theaverage value of the first n of the Xi’s.
Theorem 2.2 (Weak law of large numbers=WLLN) Suppose the Xi arei.i.d. and EX2
1 <∞. Then Sn/n→ EX1 in probability.
Proof. Since the Xi are i.i.d., they all have the same expectation, and soESn = nEX1. Hence E (Sn/n− EX1)
2 is the variance of Sn/n. If ε > 0, byChebyshev’s inequality,
P(|Sn/n− EX1| > ε) ≤ Var (Sn/n)
ε2=
∑ni=1 VarXi
n2ε2=nVarX1
n2ε2. (2.1)
Since EX21 <∞, then VarX1 <∞, and the result follows by letting n→∞.
The weak law of large numbers can be improved greatly; it is enough thatxP(|X1| > x)→ 0 as x→∞.
2.3 The strong law of large numbers
Our aim is the strong law of large numbers (SLLN), which says that Sn/nconverges to EX1 almost surely if E |X1| <∞.
The strong law of large numbers is the mathematical formulation of thelaw of averages. If one tosses a fair coin over and over, the proportion ofheads should converge to 1/2. Mathematically, if Xi is 1 if the ith toss turnsup heads and 0 otherwise, then we want Sn/n to converge with probabilityone to 1/2, where Sn = X1 + · · ·+Xn.
14 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Before stating and proving the strong law of large numbers for i.i.d. ran-dom variables with finite first moment, we need three facts from calculus.First, recall that if bn → b are real numbers, then
b1 + · · ·+ bnn
→ b. (2.2)
Second, there exists a constant c1 such that
∞∑k=n
1
k2≤ c1n. (2.3)
(To prove this, recall the proof of the integral test and compare the sum to∫∞n−1 x
−2 dx when n ≥ 2.) Third, suppose a > 1 and kn is the largest integerless than or equal to an. Note kn ≥ an/2. Then
∑n:kn≥j
1
k2n≤
∑n:an≥j
4
a2n≤ 4
j2· 1
1− a−2(2.4)
by the formula for the sum of a geometric series.
We also need two probability estimates.
Lemma 2.3 If X ≥ 0 a.s., then EX <∞ if and only if
∞∑n=1
P(X ≥ n) <∞.
Proof. Suppose EX is finite. Since P(X ≥ x) increases as x decreases,
∞∑n=1
P(X ≥ n) ≤∞∑n=1
∫ n
n−1P(X ≥ x) dx
=
∫ ∞0
P(X ≥ x) dx = EX,
which is finite.
2.3. THE STRONG LAW OF LARGE NUMBERS 15
If we now suppose∑∞
n=1 P(X ≥ n) <∞, write
EX =
∫ ∞0
P(X ≥ x) dx ≤ 1 +∞∑n=1
∫ n+1
n
P(X ≥ x) dx
≤ 1 +∞∑n=1
P(X ≥ n) <∞.
Lemma 2.4 Let Xn be an i.i.d. sequence with each Xn ≥ 0 a.s. andEX1 <∞. Define
Yn = Xn1(Xn≤n).
Then∞∑k=1
VarYkk2
<∞.
Proof. Since VarYk ≤ EY 2k ,
∞∑k=1
VarYkk2
≤∞∑k=1
EY 2k
k2
=∞∑k=1
∫ ∞0
2xP(Yk > x) dx
=∞∑k=1
∫ k
0
2xP(Yk > x) dx
=∞∑k=1
1
k2
∫ ∞0
1(x≤k)2xP(Yk > x) dx
≤∞∑k=1
1
k2
∫ ∞0
1(x≤k)2xP(Xk > x) dx
=
∫ ∞0
∞∑k=1
1
k21(x≤k)2xP(X1 > x) dx
≤ c1
∫ ∞0
1
x· 2xP(X1 > x) dx
16 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
= 2c1
∫ ∞0
P(X1 > x) dx = 2c1EX1 <∞.
We used the fact that the Xk are i.i.d., the Fubini theorem, and (2.3).
Before proving the strong law, let us first show that we cannot weaken thehypotheses.
Theorem 2.5 Suppose the Xi are i.i.d. and Sn/n converges a.s. Then E |X1|<∞.
Proof. Since Sn/n converges, then
Xn
n=Snn− Sn−1n− 1
· n− 1
n→ 0
almost surely. Let An = (|Xn| > n). If∑∞
n=1 P(An) =∞, then by using thatthe An’s are independent, the Borel-Cantelli lemma (second part) tells us thatP(An i.o.) = 1, which contradicts Xn/n → 0 a.s. Therefore
∑∞n=1 P(An) <
∞.
Since the Xi are identically distributed, we then have
∞∑n=1
P(|X1| > n) <∞,
and E |X1| <∞ follows by Lemma 2.3.
We now state and prove the strong law of large numbers.
Theorem 2.6 Suppose Xi is an i.i.d. sequence with E |X1| < ∞. LetSn =
∑ni=1Xi. Then
Snn→ EX1, a.s.
Proof. By writing each Xn as X+n − X−n and considering the positive and
negative parts separately, it suffices to suppose each Xn ≥ 0. Define Yk =
2.3. THE STRONG LAW OF LARGE NUMBERS 17
Xk1(Xk≤k) and let Tn =∑n
i=1 Yi. The main part of the argument is to provethat Tn/n→ EX1 a.s.
Step 1. Let a > 1 and let kn be the largest integer less than or equal to an.Let ε > 0 and let
An =( |Tkn − ETkn|
kn> ε).
Then
P(An) ≤ Var (Tkn/kn)
ε2=
VarTknk2nε
2=
∑knj=1 VarYj
k2nε2
.
Then
∞∑n=1
P(An) ≤∞∑n=1
kn∑j=1
VarYjk2nε
2
=1
ε2
∞∑j=1
∑n:kn≥j
1
k2nVarYj
≤ 4(1− a−2)−1
ε2
∞∑j=1
VarYjj2
by (2.4). By Lemma 2.4,∑∞
n=1 P(An) < ∞, and by the Borel-Cantellilemma, P(An i.o.) = 0. This means that for each ω except for those in a nullset, there exists N(ω) such that if n ≥ N(ω), then |Tkn(ω) − ETkn|/kn < ε.Applying this with ε = 1/m, m = 1, 2, . . ., we conclude
Tkn − ETknkn
→ 0, a.s.
Step 2. Since
EYj = E [Xj;Xj ≤ j] = E [X1;X1 ≤ j]→ EX1
by the dominated convergence theorem as j →∞, then by (2.2)
ETknkn
=
∑knj=1 EYjkn
→ EX1.
Therefore Tkn/kn → EX1 a.s.
18 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Step 3. If kn ≤ k ≤ kn+1, then
Tkk≤Tkn+1
kn+1
· kn+1
kn
since we are assuming that the Xk are non-negative. Therefore
lim supk→∞
Tkk≤ aEX1, a.s.
Similarly, lim infk→∞ Tk/k ≥ (1/a)EX1 a.s. Since a > 1 is arbitrary,
Tkk→ EX1, a.s.
Step 4. Finally,
∞∑n=1
P(Yn 6= Xn) =∞∑n=1
P(Xn > n) =∞∑n=1
P(X1 > n) <∞
by Lemma 2.3. By the Borel-Cantelli lemma, P(Yn 6= Xn i.o.) = 0. Inparticular, Yn −Xn → 0 a.s. By (2.2) we have
Tn − Snn
=n∑i=1
(Yi −Xi)
n→ 0, a.s.,
hence Sn/n→ EX1 a.s.
2.4. TECHNIQUES RELATED TO A.S. CONVERGENCE 19
2.4 Techniques related to a.s. convergence
If X1, . . . , Xn are random variables, σ(X1, . . . , Xn) is defined to be the small-est σ-field with respect to which X1, . . . , Xn are measurable. This definitionextends to countably many random variables.
If Xi is a sequence of random variables, the tail σ-field is defined by
T = ∩∞n=1σ(Xn, Xn+1, . . .).
An example of an event in the tail σ-field is (lim supn→∞Xn > a). Anotherexample is (lim supn→∞ Sn/n > a). The reason for this is that if k < n isfixed,
Snn
=Skn
+
∑ni=k+1Xi
n.
The first term on the right tends to 0 as n → ∞. So lim supSn/n =lim sup(
∑ni=k+1Xi)/n, which is in σ(Xk+1, Xk+2, . . .). This holds for each
k. The set (lim supSn > a) is easily seen not to be in the tail σ-field.
Theorem 2.7 (Kolmogorov 0-1 law) If the Xi are independent, then theevents in the tail σ-field have probability 0 or 1.
This implies that in the case of i.i.d. random variables, if Sn/n has a limitwith positive probability, then it has a limit with probability one, and thelimit must be a constant.
Proof. Let M be the collection of sets in σ(Xn+1, . . .) that is independentof every set in σ(X1, . . . , Xn). M is easily seen to be a monotone class andit contains σ(Xn+1, . . . , XN) for every N > n. Therefore M must be equalto σ(Xn+1, . . .).
If A is in the tail σ-field, then A is independent of σ(X1, . . . , Xn) for each n.The classMA of sets independent of A is a monotone class, hence is a σ-fieldcontaining σ(X1, . . . , Xn) for each n. Therefore MA contains σ(X1, . . .).
We thus have that the event A is independent of itself, or
P(A) = P(A ∩ A) = P(A)P(A) = P(A)2.
This implies P(A) is zero or one.
Next is Kolmogorov’s inequality, a special case of Doob’s inequality.
20 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Proposition 2.8 Suppose the Xi are independent and EXi = 0 for each i.Then
P( max1≤i≤n
|Si| ≥ λ) ≤ ES2n
λ2.
Proof. Let Ak = (|Sk| ≥ λ, |S1| < λ, . . . , |Sk−1| < λ). Note the Ak aredisjoint and that Ak ∈ σ(X1, . . . , Xk). Therefore Ak is independent of Sn−Sk.Then
ES2n ≥
n∑k=1
E [S2n;Ak]
=n∑k=1
E [(S2k + 2Sk(Sn − Sk) + (Sn − Sk)2);Ak]
≥n∑k=1
E [S2k ;Ak] + 2
n∑k=1
E [Sk(Sn − Sk);Ak].
Using the independence, E [Sk(Sn − Sk)1Ak] = E [Sk1Ak
]E [Sn − Sk] = 0.Therefore
ES2n ≥
n∑k=1
E [S2k ;Ak] ≥
n∑k=1
λ2P(Ak) = λ2P( max1≤k≤n
|Sk| ≥ λ).
Our result is immediate from this.
We look at a special case of what is known as Kronecker’s lemma.
Proposition 2.9 Suppose xi are real numbers and sn =∑n
i=1 xi. If the sum∑∞j=1(xj/j) converges, then sn/n→ 0.
Proof. Let bn =∑n
j=1(xj/j), b0 = 0, and suppose bn → b. As we have seen,this implies (
∑ni=1 bi)/n→ b. We have n(bn − bn−1) = xn, so
snn
=
∑ni=1(ibi − ibi−1)
n=
∑ni=1 ibi −
∑n−1i=1 (i+ 1)bi
n
= bn −∑n−1
i=1 bin− 1
· n− 1
n→ b− b = 0.
2.4. TECHNIQUES RELATED TO A.S. CONVERGENCE 21
Lemma 2.10 Suppose Vi is a sequence of independent random variables,each with mean 0. Let Wn =
∑ni=1 Vi. If
∑∞i=1 VarVi < ∞, then Wn con-
verges almost surely.
Proof. Choose nj > nj−1 such that∑∞
i=njVarVi < 2−3j. If n > nj, then
applying Kolmogorov’s inequality shows that
P( maxnj≤i≤n
|Wi −Wnj| > 2−j) ≤ 2−3j/2−2j = 2−j.
Letting n→∞, we have P(Aj) ≤ 2−j, where
Aj = (maxnj≤i|Wi −Wnj
| > 2−j).
By the Borel-Cantelli lemma, P(Aj i.o.) = 0.
Suppose ω /∈ (Aj i.o.). Let ε > 0. Choose j large enough so that 2−j+1 < εand ω /∈ Aj. If n,m > nj, then
|Wn −Wm| ≤ |Wn −Wnj|+ |Wm −Wnj
| ≤ 2−j+1 < ε.
Since ε is arbitrary, Wn(ω) is a Cauchy sequence, and hence converges.
We now consider the “three series criterion.” We prove the “if” portionhere and defer the “only if” to Section 20.
Theorem 2.11 Let Xi be a sequence of independent random variables., A >0, and Yi = Xi1(|Xi|≤A). Then
∑Xi converges if and only if all of the fol-
lowing three series converge: (a)∑
P(|Xn| > A); (b)∑
EYi; (c)∑
VarYi.
Proof of “if” part. Since (c) holds, then∑
(Yi−EYi) converges by Lemma2.10. Since (b) holds, taking the difference shows
∑Yi converges. Since
(a) holds,∑
P(Xi 6= Yi) =∑
P(|Xi| > A) < ∞, so by Borel-Cantelli,P(Xi 6= Yi i.o.) = 0. It follows that
∑Xi converges.
22 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
2.5 Uniform integrability
Before proceeding to an extension of the SLLN, we discuss uniform integra-bility. A sequence of random variables is uniformly integrable if
supi
∫(|Xi|>M)
|Xi| dP→ 0
as M →∞.
Proposition 2.12 Suppose there exists ϕ : [0,∞) → [0,∞) such that ϕ isnondecreasing, ϕ(x)/x → ∞ as x → ∞, and supi Eϕ(|Xi|) < ∞. Then theXi are uniformly integrable.
Proof. Let ε > 0 and choose x0 such that x/ϕ(x) < ε if x ≥ x0. If M ≥ x0,∫(|Xi|>M)
|Xi| =∫|Xi|
ϕ(|Xi|)ϕ(|Xi|)1(|Xi|>M) ≤ ε
∫ϕ(|Xi|) ≤ ε sup
iEϕ(|Xi|).
Proposition 2.13 If Xn and Yn are two uniformly integrable sequences, thenXn + Yn is also a uniformly integrable sequence.
Proof. Since there exists M such that supn E [|Xn|; |Xn| > M ] < 1 andsupn E [|Yn|; |Yn| > M ] < 1, then supn E |Xn| ≤ M + 1, and similarly for theYn.
Note
P(|Xn|+ |Yn| > K) ≤ E |Xn|+ E |Yn|K
≤ 2(1 +M)
K
by Chebyshev’s inequality.
Now let ε > 0 and choose L such that
E [|Xn|; |Xn| > L] < ε
2.5. UNIFORM INTEGRABILITY 23
and the same when Xn is replaced by Yn.
E [|Xn + Yn|; |Xn + Yn| > K]
≤ E [|Xn|; |Xn + Yn| > K] + E [|Yn|; |Xn + Yn| > K]
≤ E [|Xn|; |Xn| > L] + E [|Xn|; |Xn| ≤ L, |Xn + Yn| > K]
+ E [|Yn|; |Yn| > L] + E [|Yn|; |Yn| ≤ L, |Xn + Yn| > K]
≤ ε+ LP(|Xn + Yn| > K]
+ ε+ LP(|Xn + Yn| > K]
≤ 2ε+ L4(1 +M)
K.
Given ε we have already chosen L. Choose K large enough that 4L(1 +M)/K < ε. We then get that E [|Xn +Yn|; |Xn +Yn| > K] is bounded by 3ε.
The main result we need in this section is Vitali’s convergence theorem.
Theorem 2.14 T7.3 If Xn → X almost surely and the Xn are uniformlyintegrable, then E |Xn −X| → 0.
Proof. By the above proposition, Xn−X is uniformly integrable and tendsto 0 a.s., so without loss of generality, we may assume X = 0. Let ε > 0 andchoose M such that supn E [|Xn|; |Xn| > M ] < ε. Then
E |Xn| ≤ E [|Xn|; |Xn| > M ] + E [|Xn|; |Xn| ≤M ] ≤ ε+ E [|Xn|1(|Xn|≤M)].
The second term on the right goes to 0 by dominated convergence.
Proposition 2.15 Suppose Xi is an i.i.d. sequence and E |X1| <∞. Then
E∣∣∣Snn− EX1
∣∣∣→ 0.
Proof. Without loss of generality we may assume EX1 = 0. By the SLLN,Sn/n → 0 a.s. So we need to show that the sequence Sn/n is uniformlyintegrable.
24 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Pick M such that E [|X1|; |X1| > M ] < ε. Pick N = ME |X1|/ε. So
P(|Sn/n| > N) ≤ E |Sn|/nN ≤ E |X1|/N = ε/M.
We used here
E |Sn| ≤n∑i=1
E |Xi| = nE |X1|.
We then have
E [|Xi|; |Sn/n| > N ] ≤ E [|Xi| : |Xi| > M ]
+ E [|Xi|; |Xi| ≤M, |Sn/n| > N ]
≤ ε+MP(|Sn/n| > N) ≤ 2ε.
Finally,
E [|Sn/n|; |Sn/n| > N ] ≤ 1
n
n∑i=1
E [|Xi|; |Sn/n| > N ] ≤ 2ε.
Chapter 3
Conditional expectations
If F ⊆ G are two σ-fields and X is an integrable G measurable random vari-able, the conditional expectation of X given F , written E [X | F ] and readas “the expectation (or expected value) of X given F ,” is any F measurablerandom variable Y such that E [Y ;A] = E [X;A] for every A ∈ F . The con-ditional probability of A ∈ G given F is defined by P(A | F) = E [1A | F ].
If Y1, Y2 are two F measurable random variables with E [Y1;A] = E [Y2;A]for all A ∈ F , then Y1 = Y2, a.s., or conditional expectation is unique up toa.s. equivalence.
In the case X is already F measurable, E [X | F ] = X. If X is independentof F , E [X | F ] = EX. Both of these facts follow immediately from thedefinition. For another example, which ties this definition with the one usedin elementary probability courses, if Ai is a finite collection of disjoint setswhose union is Ω, P(Ai) > 0 for all i, and F is the σ-field generated by theAis, then
P(A | F) =∑i
P(A ∩ Ai)P(Ai)
1Ai.
This follows since the right-hand side is F measurable and its expectationover any set Ai is P(A ∩ Ai).
As an example, suppose we toss a fair coin independently 5 times andlet Xi be 1 or 0 depending whether the ith toss was a heads or tails. LetA be the event that there were 5 heads and let Fi = σ(X1, . . . , Xi). Then
25
26 CHAPTER 3. CONDITIONAL EXPECTATIONS
P(A) = 1/32 while P(A | F1) is equal to 1/16 on the event (X1 = 1) and 0 onthe event (X1 = 0). P(A | F2) is equal to 1/8 on the event (X1 = 1, X2 = 1)and 0 otherwise.
We haveE [E [X | F ]] = EX (3.1)
because E [E [X | F ]] = E [E [X | F ]; Ω] = E [X; Ω] = EX.
The following is easy to establish.
Proposition 3.1 (a) If X ≥ Y are both integrable, then E [X | F ] ≥ E [Y |F ] a.s.(b) If X and Y are integrable and a ∈ R, then E [aX + Y | F ] = aE [X |F ] + E [Y | F ].
It is easy to check that limit theorems such as monotone convergence anddominated convergence have conditional expectation versions, as do inequal-ities like Jensen’s and Chebyshev’s inequalities. Thus, for example, we havethe following.
Proposition 3.2 (Jensen’s inequality for conditional expectations) If g isconvex and X and g(X) are integrable,
E [g(X) | F ] ≥ g(E [X | F ]), a.s.
A key fact is the following.
Proposition 3.3 If X and XY are integrable and Y is measurable withrespect to F , then
E [XY | F ] = Y E [X | F ]. (3.2)
Proof. If A ∈ F , then for any B ∈ F ,
E[1AE [X | F ];B
]= E
[E [X | F ];A ∩B
]= E [X;A ∩B] = E [1AX;B].
Since 1AE [X | F ] is F measurable, this shows that (3.1) holds when Y = 1Aand A ∈ F . Using linearity and taking limits shows that (3.1) holds wheneverY is F measurable and Xand XY are integrable.
Two other equalities follow.
27
Proposition 3.4 If E ⊆ F ⊆ G, then
E[E [X | F ] | E
]= E [X | E ] = E
[E [X | E ] | F
].
Proof. The right equality holds because E [X | E ] is E measurable, hence Fmeasurable. To show the left equality, let A ∈ E . Then since A is also in F ,
E[E[E [X | F ] | E
];A]
= E[E [X | F ];A
]= E [X;A] = E [E
[X | E ];A
].
Since both sides are E measurable, the equality follows.
To show the existence of E [X | F ], we proceed as follows.
Proposition 3.5 If X is integrable, then E [X | F ] exists.
Proof. Using linearity, we need only consider X ≥ 0. Define a measure Qon F by Q(A) = E [X;A] for A ∈ F . This is trivially absolutely continuouswith respect to P|F , the restriction of P to F . Let E [X | F ] be the Radon-Nikodym derivative of Q with respect to P|F . The Radon-Nikodym derivativeis F measurable by construction and so provides the desired random variable.
When F = σ(Y ), one usually writes E [X | Y ] for E [X | F ]. Notationthat is commonly used (however, we will use it only very occasionally andonly for heuristic purposes) is E [X | Y = y]. The definition is as follows. IfA ∈ σ(Y ), then A = (Y ∈ B) for some Borel set B by the definition of σ(Y ),or 1A = 1B(Y ). By linearity and taking limits, if Z is σ(Y ) measurable,Z = f(Y ) for some Borel measurable function f . Set Z = E [X | Y ] andchoose f Borel measurable so that Z = f(Y ). Then E [X | Y = y] is definedto be f(y).
If X ∈ L2 and M = Y ∈ L2 : Y is F -measurable, one can show thatE [X | F ] is equal to the projection of X onto the subspace M. We will notuse this in these notes.
Chapter 4
Martingales
4.1 Definitions
In this section we consider martingales. Let Fn be an increasing sequence ofσ-fields. A sequence of random variables Mn is adapted to Fn if for each n,Mn is Fn measurable.
Mn is a martingale if Mn is adapted to Fn, Mn is integrable for all n, and
E [Mn | Fn−1] = Mn−1, a.s., n = 2, 3, . . . . (4.1)
If we have E [Mn | Fn−1] ≥ Mn−1 a.s. for every n, then Mn is a sub-martingale. If we have E [Mn | Fn−1] ≤ Mn−1, we have a supermartingale.Submartingales have a tendency to increase.
Let us take a moment to look at some examples. If Xi is a sequence ofmean zero i.i.d. random variables and Sn is the partial sum process, thenMn = Sn is a martingale, since E [Mn | Fn−1] = Mn−1 + E [Mn −Mn−1 |Fn−1] = Mn−1 + E [Mn −Mn−1] = Mn−1, using independence. If the Xi’shave variance one and Mn = S2
n − n, then
E [S2n | Fn−1] = E [(Sn−Sn−1)2 | Fn−1]+2Sn−1E [Sn | Fn−1]−S2
n−1 = 1+S2n−1,
using independence. It follows that Mn is a martingale.
Another example is the following: if X ∈ L1 and Mn = E [X | Fn], thenMn is a martingale.
29
30 CHAPTER 4. MARTINGALES
If Mn is a martingale and Hn ∈ Fn−1 for each n, it is easy to check thatNn =
∑ni=1Hi(Mi −Mi−1) is also a martingale.
If Mn is a martingale and g(Mn) is integrable for each n, then by Jensen’sinequality
E [g(Mn+1) | Fn] ≥ g(E [Mn+1 | Fn]) = g(Mn),
or g(Mn) is a submartingale. Similarly if g is convex and nondecreasing on[0,∞) and Mn is a positive submartingale, then g(Mn) is a submartingalebecause
E [g(Mn+1) | Fn] ≥ g(E [Mn+1 | Fn]) ≥ g(Mn).
4.2 Stopping times
We next want to talk about stopping times. Suppose we have a sequenceof σ-fields Fi such that Fi ⊂ Fi+1 for each i. An example would be ifFi = σ(X1, . . . , Xi). A random mapping N from Ω to 0, 1, 2, . . . is calleda stopping time if for each n, (N ≤ n) ∈ Fn. A stopping time is also calledan optional time in the Markov theory literature.
The intuition is that the sequence knows whether N has happened bytime n by looking at Fn. Suppose some motorists are told to drive north onHighway 99 in Seattle and stop at the first motorcycle shop past the secondrealtor after the city limits. So they drive north, pass the city limits, passtwo realtors, and come to the next motorcycle shop, and stop. That is astopping time. If they are instead told to stop at the third stop light beforethe city limits (and they had not been there before), they would need todrive to the city limits, then turn around and return past three stop lights.That is not a stopping time, because they have to go ahead of where theywanted to stop to know to stop there.
We use the notation a∧ b = min(a, b) and a∨ b = max(a, b). The proof ofthe following is immediate from the definitions.
Proposition 4.1 (a) Fixed times n are stopping times.(b) If N1 and N2 are stopping times, then so are N1 ∧N2 and N1 ∨N2.(c) If Nn is a nondecreasing sequence of stopping times, then so is N =supnNn.(d) If Nn is a nonincreasing sequence of stopping times, then so is N =
4.3. OPTIONAL STOPPING 31
infnNn.(e) If N is a stopping time, then so is N + n.
We define FN = A : A ∩ (N ≤ n) ∈ Fn for all n.
4.3 Optional stopping
Note that if one takes expectations in (4.1), one has EMn = EMn−1, andby induction EMn = EM0. The theorem about martingales that lies at thebasis of all other results is Doob’s optional stopping theorem, which says thatthe same is true if we replace n by a stopping time N . There are variousversions, depending on what conditions one puts on the stopping times.
Theorem 4.2 If N is a bounded stopping time with respect to Fn and Mn
a martingale, then EMN = EM0.
Proof. Since N is bounded, let K be the largest value N takes. We write
EMN =K∑k=0
E [MN ;N = k] =K∑k=0
E [Mk;N = k].
Note (N = k) is Fj measurable if j ≥ k, so
E [Mk;N = k] = E [Mk+1;N = k]
= E [Mk+2;N = k] = . . . = E [MK ;N = k].
Hence
EMN =K∑k=0
E [MK ;N = k] = EMK = EM0.
This completes the proof.
The assumption that N be bounded cannot be entirely dispensed with.For example, letMn be the partial sums of a sequence of i.i.d. random variablethat take the values ±1, each with probability 1
2. If N = mini : Mi = 1,
we will see later on that N <∞ a.s., but EMN = 1 6= 0 = EM0. The same
proof as that in Theorem 4.2 gives the following corollary.
32 CHAPTER 4. MARTINGALES
Corollary 4.3 If N is bounded by K and Mn is a submartingale, thenEMN ≤ EMK.
Also the same proof gives
Corollary 4.4 If N is bounded by K, A ∈ FN , and Mn is a submartingale,then E [MN ;A] ≤ E [MK ;A].
Proposition 4.5 If N1 ≤ N2 are stopping times bounded by K and M is amartingale, then E [MN2 | FN1 ] = MN1, a.s.
Proof. Suppose A ∈ FN1 . We need to show E [MN1 ;A] = E [MN2 ;A].Define a new stopping time N3 by
N3(ω) =
N1(ω) if ω ∈ AN2(ω) if ω /∈ A.
It is easy to check that N3 is a stopping time, so EMN3 = EMK = EMN2
impliesE [MN1 ;A] + E [MN2 ;A
c] = E [MN2 ].
Subtracting E [MN2 ;Ac] from each side completes the proof.
The following is known as the Doob decomposition.
Proposition 4.6 Suppose Xk is a submartingale with respect to an increas-ing sequence of σ-fields Fk. Then we can write Xk = Mk +Ak such that Mk
is a martingale adapted to the Fk and Ak is a sequence of random variableswith Ak being Fk−1-measurable and A0 ≤ A1 ≤ · · · .
Proof. Let ak = E [Xk | Fk−1] − Xk−1 for k = 1, 2, . . . Since Xk is asubmartingale, then each ak ≥ 0. Then let Ak =
∑ki=1 ai. The fact that
the Ak are increasing and measurable with respect to Fk−1 is clear. SetMk = Xk − Ak. Then
E [Mk+1 −Mk | Fk] = E [Xk+1 −Xk | Fk]− ak+1 = 0,
or Mk is a martingale.
Combining Propositions 4.5 and 4.6 we have
4.4. DOOB’S INEQUALITIES 33
Corollary 4.7 Suppose Xk is a submartingale, and N1 ≤ N2 are boundedstopping times. Then
E [XN2 | FN1 ] ≥ XN1 .
4.4 Doob’s inequalities
The first interesting consequences of the optional stopping theorems areDoob’s inequalities. If Mn is a martingale, denote M∗
n = maxi≤n |Mi|.
Theorem 4.8 If Mn is a martingale or a positive submartingale,
P(M∗n ≥ a) ≤ E [|Mn|;M∗
n ≥ a]/a ≤ E |Mn|/a.
Proof. Set Mn+1 = Mn. Let N = minj : |Mj| ≥ a ∧ (n+ 1). Since | · | isconvex, |Mn| is a submartingale. If A = (M∗
n ≥ a), then A ∈ FN because
A ∩ (N ≤ j) = (N ≤ n) ∩ (N ≤ j) = (N ≤ j) ∈ Fj.
By Corollary 4.4
P(M∗n ≥ a) ≤ E
[M∗n
a;M∗
n ≥ a]≤ 1
aE [|MN |;A] ≤ 1
aE [|Mn|;A] ≤ 1
aE |Mn|.
For p > 1, we have the following inequality.
Theorem 4.9 If p > 1 and E |Mi|p <∞ for i ≤ n, then
E (M∗n)p ≤
( p
p− 1
)pE |Mn|p.
Proof. Note M∗n ≤
∑ni=1 |Mn|, hence M∗
n ∈ Lp. We write, using Theorem4.8 for the first inequality,
E (M∗n)p =
∫ ∞0
pap−1P(M∗n > a) da ≤
∫ ∞0
pap−1E [|Mn|1(M∗n≥a)/a] da
= E∫ M∗n
0
pap−2|Mn| da =p
p− 1E [(M∗
n)p−1|Mn|]
≤ p
p− 1(E (M∗
n)p)(p−1)/p(E |Mn|p)1/p.
34 CHAPTER 4. MARTINGALES
The last inequality follows by Holder’s inequality. Now divide both sides bythe quantity (E (M∗
n)p)(p−1)/p.
4.5 Martingale convergence theorems
The martingale convergence theorems are another set of important conse-quences of optional stopping. The main step is the upcrossing lemma. Thenumber of upcrossings of an interval [a, b] is the number of times a processcrosses from below a to above b.
To be more exact, let
S1 = mink : Xk ≤ a, T1 = mink > S1 : Xk ≥ b,
and
Si+1 = mink > Ti : Xk ≤ a, Ti+1 = mink > Si+1 : Xk ≥ b.
The number of upcrossings Un before time n is Un = maxj : Tj ≤ n.
Theorem 4.10 (Upcrossing lemma) If Xk is a submartingale,
EUn ≤ (b− a)−1E [(Xn − a)+].
Proof. The number of upcrossings of [a, b] by Xk is the same as thenumber of upcrossings of [0, b− a] by Yk = (Xk − a)+. Moreover Yk is still asubmartingale. If we obtain the inequality for the the number of upcrossingsof the interval [0, b−a] by the process Yk, we will have the desired inequalityfor upcrossings of X.
So we may assume a = 0. Fix n and define Yn+1 = Yn. This will stillbe a submartingale. Define the Si, Ti as above, and let S ′i = Si ∧ (n + 1),T ′i = Ti ∧ (n+ 1). Since Ti+1 > Si+1 > Ti, then T ′n+1 = n+ 1.
We write
EYn+1 = EYS′1 +n+1∑i=0
E [YT ′i − YS′i ] +n+1∑i=0
E [YS′i+1− YT ′i ].
4.5. MARTINGALE CONVERGENCE THEOREMS 35
All the summands in the third term on the right are nonnegative since Yk isa submartingale. For the jth upcrossing, YT ′j − YS′j ≥ b− a, while YT ′j − YS′jis always greater than or equal to 0. So
∞∑i=0
(YT ′i − YS′i) ≥ (b− a)Un.
SoEUn ≤ EYn+1/(b− a). (4.2)
This leads to the martingale convergence theorem.
Theorem 4.11 If Xn is a submartingale such that supn EX+n < ∞, then
Xn converges a.s. as n→∞.
Proof. Let U(a, b) = limn→∞ Un. For each a, b rational, by monotoneconvergence,
EU(a, b) ≤ c(b− a)−1E (Xn − a)+ <∞.
So U(a, b) <∞, a.s. Taking the union over all pairs of rationals a, b, we seethat a.s. the sequence Xn(ω) cannot have lim supXn > lim inf Xn. ThereforeXn converges a.s., although we still have to rule out the possibility of thelimit being infinite. Since Xn is a submartingale, EXn ≥ EX0, and thus
E |Xn| = EX+n + EX−n = 2EX+
n − EXn ≤ 2EX+n − EX0.
By Fatou’s lemma, E limn |Xn| ≤ supn E |Xn| < ∞, or Xn converges a.s. toa finite limit.
Corollary 4.12 If Xn is a positive supermartingale or a martingale boundedabove or below, Xn converges a.s.
Proof. If Xn is a positive supermartingale, −Xn is a submartingalebounded above by 0. Now apply Theorem 4.11.
36 CHAPTER 4. MARTINGALES
If Xn is a martingale bounded above, by considering −Xn, we may assumeXn is bounded below. Looking at Xn + M for fixed M will not affect theconvergence, so we may assume Xn is bounded below by 0. Now apply thefirst assertion of the corollary.
Proposition 4.13 If Xn is a martingale with supn E |Xn|p < ∞ for somep > 1, then the convergence is in Lp as well as a.s. This is also true whenXn is a submartingale. If Xn is a uniformly integrable martingale, then theconvergence is in L1. If Xn → X∞ in L1, then Xn = E [X∞ | Fn].
Xn is a uniformly integrable martingale if the collection of random vari-ables Xn is uniformly integrable.
Proof. The Lp convergence assertion follows by using Doob’s inequality(Theorem 4.9) and dominated convergence. The L1 convergence assertionfollows since a.s. convergence together with uniform integrability implies L1
convergence. Finally, if j < n, we have Xj = E [Xn | Fj]. If A ∈ Fj,
E [Xj;A] = E [Xn;A]→ E [X∞;A]
by the L1 convergence of Xn to X∞. Since this is true for all A ∈ Fj,Xj = E [X∞ | Fj].
4.6 Applications of martingales
One application of martingale techniques is Wald’s identities.
Proposition 4.14 Suppose the Yi are i.i.d. with E |Y1| < ∞, N is a stop-ping time with EN < ∞, and N is independent of the Yi. Then ESN =(EN)(EY1), where the Sn are the partial sums of the Yi.
Proof. Sn − n(EY1) is a martingale, so ESn∧N = E (n ∧ N)EY1 by op-tional stopping. The right hand side tends to (EN)(EY1) by monotoneconvergence. Sn∧N converges almost surely to SN , and we need to show theexpected values converge.
4.6. APPLICATIONS OF MARTINGALES 37
Note
|Sn∧N | =∞∑k=0
|Sn∧k|1(N=k) ≤∞∑k=0
n∧k∑j=0
|Yj|1(N=k)
=n∑j=0
∞∑k>j
|Yj|1(N=k) =n∑j=0
|Yj|1(N≥j) ≤∞∑j=0
|Yj|1(N≥j).
The last expression, using the independence, has expected value
∞∑j=0
(E |Yj|)P(N ≥ j) ≤ (E |Y1|)(1 + EN) <∞.
So by dominated convergence, we have ESn∧N → ESN .
Wald’s second identity is a similar expression for the variance of SN .
Let Sn be your fortune at time n. In a fair casino, E [Sn+1 | Fn] = Sn. IfN is a stopping time, the optional stopping theorem says that ESN = ES0;in other words, no matter what stopping time you use and what method ofbetting, you will do not better on average than ending up with what youstarted with.
We can use martingales to find certain hitting probabilities.
Proposition 4.15 Suppose the Yi are i.i.d. with P(Y1 = 1) = 1/2, P(Y1 =−1) = 1/2, and Sn the partial sum process. Suppose a and b are positiveintegers. Then
P(Sn hits −a before b) =b
a+ b.
If N = minn : Sn ∈ −a, b, then EN = ab.
Proof. S2n − n is a martingale, so ES2
n∧N = En ∧ N . Let n → ∞. Theright hand side converges to EN by monotone convergence. Since Sn∧N isbounded in absolute value by a+b, the left hand side converges by dominatedconvergence to ES2
N , which is finite. So EN is finite, hence N is finite almostsurely.
38 CHAPTER 4. MARTINGALES
Sn is a martingale, so ESn∧N = ES0 = 0. By dominated convergence,and the fact that N <∞ a.s., hence Sn∧N → SN , we have ESN = 0, or
−aP(SN = −a) + bP(SN = b) = 0.
We also haveP(SN = −a) + P(SN = b) = 1.
Solving these two equations for P(SN = −a) and P(SN = b) yields our firstresult. Since EN = ES2
N = a2P(SN = −a)+ b2P(SN = b), substituting givesthe second result.
Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = 1and ENb =∞, where Nb = minn : Sn = b.
An elegant application of martingales is a proof of the SLLN. Fix N large.Let Yi be i.i.d. Let Zn = E [Y1 | Sn, Sn+1, . . . , SN ]. We claim Zn = Sn/n.Certainly Sn/n is σ(Sn, · · · , SN) measurable. If A ∈ σ(Sn, . . . , SN) for somen, then A = ((Sn, . . . , SN) ∈ B) for some Borel subset B of RN−n+1. Sincethe Yi are i.i.d., for each k ≤ n,
E [Y1; (Sn, . . . , SN) ∈ B] = E [Yk; (Sn, . . . , SN) ∈ B].
Summing over k and dividing by n,
E [Y1; (Sn, . . . , SN) ∈ B] = E [Sn/n; (Sn, . . . , SN) ∈ B].
Therefore E [Y1;A] = E [Sn/n;A] for every A ∈ σ(Sn, . . . , SN). Thus Zn =Sn/n.
Let Xk = ZN−k, and let Fk = σ(SN−k, SN−k+1, . . . , SN). Note Fk getslarger as k gets larger, and by the above Xk = E [Y1 | Fk]. This shows thatXk is a martingale (cf. the next to last example in Section 11). By Doob’supcrossing inequality, if UX
n is the number of upcrossings of [a, b] by X, thenEUX
N−1 ≤ EX+N−1/(b − a) ≤ E |Z1|/(b − a) = E |Y1|/(b − a). This differs
by at most one from the number of upcrossings of [a, b] by Z1, . . . , ZN . Sothe expected number of upcrossings of [a, b] by Zk for k ≤ N is boundedby 1 + E |Y1|/(b − a). Now let N → ∞. By Fatou’s lemma, the expectednumber of upcrossings of [a, b] by Z1, . . . is finite. Arguing as in the proofof the martingale convergence theorem, this says that Zn = Sn/n does notoscillate.
4.6. APPLICATIONS OF MARTINGALES 39
It is conceivable that |Sn/n| → ∞. But by Fatou’s lemma,
E[lim |Sn/n|] ≤ lim inf E |Sn/n| ≤ lim inf nE |Y1|/n = E |Y1| <∞.
Chapter 5
Weak convergence
We will see later that if the Xi are i.i.d. with mean zero and variance one,then Sn/
√n converges in the sense
P(Sn/√n ∈ [a, b])→ P(Z ∈ [a, b]),
where Z is a standard normal. If Sn/√n converged in probability or al-
most surely, then by the Kolmogorov zero-one law it would converge to aconstant, contradicting the above. We want to generalize the above type ofconvergence.
We say Fn converges weakly to F if Fn(x) → F (x) for all x at whichF is continuous. Here Fn and F are distribution functions. We say Xn
converges weakly to X if FXn converges weakly to FX . We sometimes sayXn converges in distribution or converges in law to X. Probabilities µnconverge weakly if their corresponding distribution functions converges, thatis, if Fµn(x) = µn(−∞, x] converges weakly.
An example that illustrates why we restrict the convergence to continuitypoints of F is the following. Let Xn = 1/n with probability one, and X = 0with probability one. FXn(x) is 0 if x < 1/n and 1 otherwise. FXn(x)converges to FX(x) for all x except x = 0.
Proposition 5.1 Xn converges weakly to X if and only if E g(Xn)→ E g(X)for all g bounded and continuous.
The idea that E g(Xn) converges to E g(X) for all g bounded and contin-uous makes sense for any metric space and is used as a definition of weak
41
42 CHAPTER 5. WEAK CONVERGENCE
convergence for Xn taking values in general metric spaces.
Proof. First suppose E g(Xn) converges to E g(X). Let x be a continuitypoint of F , let ε > 0, and choose δ such that |F (y)−F (x)| < ε if |y−x| < δ.Choose g continuous such that g is one on (−∞, x], takes values between0 and 1, and is 0 on [x + δ,∞). Then FXn(x) ≤ E g(Xn) → E g(X) ≤FX(x+ δ) ≤ F (x) + ε.
Similarly, if h is a continuous function taking values between 0 and 1that is 1 on (−∞, x − δ] and 0 on [x,∞), FXn(x) ≥ Eh(Xn) → Eh(X) ≥FX(x− δ) ≥ F (x)− ε. Since ε is arbitrary, FXn(x)→ FX(x).
Now suppose Xn converges weakly to X. We start by making some ob-servations. First, if we have a distribution function, it is increasing and sothe number of points at which it has a discontinuity is at most countable.Second, if g is a continuous function on a closed bounded interval, it can beapproximated uniformly on the interval by step functions. Using the uniformcontinuity of g on the interval, we may even choose the step function so thatthe places where it jumps are not in some pre-specified countable set. Ourthird observation is that
P(X < x) = limk→∞
P(X ≤ x− 1
k) = lim
y→x−FX(y);
thus if FX is continuous at x, then P(X = x) = FX(x)− P(X < x) = 0.
Suppose g is bounded and continuous, and we want to prove that E g(Xn)→ E g(X). By multiplying by a constant, we may suppose that |g| is boundedby 1. Let ε > 0 and choose M such that FX(M) > 1− ε and FX(−M) < εand so that M and −M are continuity points of FX . We see that
P(Xn ≤ −M) = FXn(−M)→ FX(−M) < ε,
and so for large enough n, we have P(Xn ≤ −M) ≤ 2ε. Similarly, forlarge enough n, P(Xn > M) ≤ 2ε. Therefore for large enough n, E g(Xn)differs from E (g1[−M,M ])(Xn) by at most 4ε. Also, E g(X) differs fromE (g1[−M,M ])(X) by at most 2ε.
Let h be a step function such that sup|x|≤M |h(x) − g(x)| < ε and h is0 outside of [−M,M ]. We choose h so that the places where h jumps arecontinuity points of F and of all the Fn. Then E (g1[−M,M ])(Xn) differs fromEh(Xn) by at most ε, and the same when Xn is replaced by X.
43
If we show Eh(Xn)→ Eh(X), then
lim supn→∞
|E g(Xn)− E g(X)| ≤ 8ε,
and since ε is arbitrary, we will be done.
h is of the form∑m
i=1 ci1Ii , where Ii is an interval, so by linearity it isenough to show
E 1I(Xn)→ E 1I(X)
when I is an interval whose endpoints are continuity points of all the Fnand of F . If the endpoints of I are a < b, then by our third observationabove, P(Xn = a) = 0, and the same when Xn is replaced by a and when ais replaced by b. We then have
E 1I(Xn) = P(a < Xn ≤ b) = FXn(b)− FXn(a)
→ FX(b)− FX(a) = P(a < X ≤ b) = E 1I(X),
as required.
Let us examine the relationship between weak convergence and conver-gence in probability. The example of Sn/
√n shows that one can have weak
convergence without convergence in probability.
Proposition 5.2 (a) If Xn converges to X in probability, then it convergesweakly.(b) If Xn converges weakly to a constant, it converges in probability.(c) (Slutsky’s theorem) If Xn converges weakly to X and Yn converges weaklyto a constant c, then Xn +Yn converges weakly to X + c and XnYn convergesweakly to cX.
Proof. To prove (a), let g be a bounded and continuous function. Ifnj is any subsequence, then there exists a further subsequence such thatX(njk) converges almost surely to X. Then by dominated convergence,E g(X(njk))→ E g(X). That suffices to show E g(Xn) converges to E g(X).
For (b), if Xn converges weakly to c,
P(Xn− c > ε) = P(Xn > c+ ε) = 1−P(Xn ≤ c+ ε)→ 1−P(c ≤ c+ ε) = 0.
44 CHAPTER 5. WEAK CONVERGENCE
We use the fact that if Y ≡ c, then c + ε is a point of continuity for FY . Asimilar equation shows P(Xn − c ≤ −ε)→ 0, so P(|Xn − c| > ε)→ 0.
We now prove the first part of (c), leaving the second part for the reader.Let x be a point such that x − c is a continuity point of FX . Choose ε sothat x− c+ ε is again a continuity point. Then
P(Xn + Yn ≤ x) ≤ P(Xn + c ≤ x+ ε) + P(|Yn − c| > ε)→ P(X ≤ x− c+ ε).
So lim supP(Xn + Yn ≤ x) ≤ P(X + c ≤ x + ε). Since ε can be as small aswe like and x− c is a continuity point of FX , then lim supP(Xn + Yn ≤ x) ≤P(X + c ≤ x). The lim inf is done similarly.
Here is an example where Xn converges weakly but not in probability. LetX1, X2, . . . be an i.i.d. sequence with P(X1 = 1) = 1
2and P(Xn = 0) = 1
2.
Since the FXn are all equal, then we have weak convergence.
We claim the Xn do not converge in probability. If they did, we couldfind a subsequence nj such that Xnj
converges a.s. Let Aj = (Xnj= 1).
These are independent sets, P(Aj) = 12, so
∑j P(Aj) = ∞. By the Borel-
Cantelli lemma, P(Aj i.o.) = 1, which means that Xnjis equal to 1 infinitely
often with probability one. The same argument also shows that Xnjis equal
to 0 infinitely often with probability one, which implies that Xnjdoes not
converge almost surely, a contradiction.
We say a sequence of distribution functions Fn is tight if for each ε >0 there exists M such that Fn(M) ≥ 1 − ε and Fn(−M) ≤ ε for all n.A sequence of random variables is tight if the corresponding distributionfunctions are tight; this is equivalent to P(|Xn| ≥M) ≤ ε.
We give an easily checked criterion for tightness.
Proposition 5.3 Suppose there exists ϕ : [0,∞)→ [0,∞) that is increasingand ϕ(x) → ∞ as x → ∞. If c = supn Eϕ(|Xn|) < ∞, then the Xn aretight.
Proof. Let ε > 0. Choose M such that ϕ(x) ≥ c/ε if x > M . Then
P(|Xn| > M) ≤∫ϕ(|Xn|)c/ε
1(|Xn|>M)dP ≤ε
cEϕ(|Xn|) ≤ ε.
45
Theorem 5.4 (Helly’s theorem) Let Fn be a sequence of distribution func-tions that is tight. There exists a subsequence nj and a distribution functionF such that Fnj
converges weakly to F .
What could happen is that Xn = n, so that FXn → 0; the tightnessprecludes this.
Proof. Let qk be an enumeration of the rationals. Since Fn(qk) ∈ [0, 1],any subsequence has a further subsequence that converges. Use the diago-nalization procedure so that Fnj
(qk) converges for each qk and call the limitF (qk). F is nondecreasing, and define F (x) = infqk≥x F (qk). So F is rightcontinuous and nondecreasing.
If x is a point of continuity of F and ε > 0, then there exist r and srational such that r < x < s and F (s) − F (x) < ε and F (x) − F (r) < ε.Then
Fnj(x) ≥ Fnj
(r)→ F (r) > F (x)− ε
andFnj
(x) ≤ Fnj(s)→ F (s) < F (x) + ε.
Since ε is arbitrary, Fnj(x)→ F (x).
Since the Fn are tight, there exists M such that Fn(−M) < ε. ThenF (−M) ≤ ε, which implies limx→−∞ F (x) = 0. Showing limx→∞ F (x) = 1 issimilar. Therefore F is in fact a distribution function.
Chapter 6
Characteristic functions
We define the characteristic function of a random variable X by ϕX(t) =E eitx for t ∈ R.
Note that ϕX(t) =∫eitxPX(dx). So if X and Y have the same law, they
have the same characteristic function. Also, if the law of X has a density,that is, PX(dx) = fX(x) dx, then ϕX(t) =
∫eitxfX(x) dx, so in this case the
characteristic function is the same as (one definition of) the Fourier transformof fX .
Proposition 6.1 ϕ(0) = 1, |ϕ(t)| ≤ 1, ϕ(−t) = ϕ(t), and ϕ is uniformlycontinuous.
Proof. Since |eitx| ≤ 1, everything follows immediately from the definitionsexcept the uniform continuity. For that we write
|ϕ(t+ h)− ϕ(t)| = |E ei(t+h)X − E eitX | ≤ E |eitX(eihX − 1)| = E |eihX − 1|.
|eihX − 1| tends to 0 almost surely as h→ 0, so the right hand side tends to0 by dominated convergence. Note that the right hand side is independentof t.
Proposition 6.2 ϕaX(t) = ϕX(at) and ϕX+b(t) = eitbϕX(t),
47
48 CHAPTER 6. CHARACTERISTIC FUNCTIONS
Proof. The first follows from E eit(aX) = E ei(at)X , and the second is similar.
Proposition 6.3 If X and Y are independent, then ϕX+Y (t) = ϕX(t)ϕY (t).
Proof. From the multiplication theorem,
E eit(X+Y ) = E eitXeitY = E eitXE eitY .
Note that if X1 and X2 are independent and identically distributed, then
ϕX1−X2(t) = ϕX1(t)ϕ−X2(t) = ϕX1(t)ϕX2(−t) = ϕX1(t)ϕX2(t) = |ϕX1(t)|2.
Let us look at some examples of characteristic functions.
(a) Bernoulli: By direct computation, this is peit+(1−p) = 1−p(1−eit).(b) Coin flip: (i.e., P(X = +1) = P(X = −1) = 1/2) We have 1
2eit +
12e−it = cos t.
(c) Poisson:
E eitX =∞∑k=0
eitke−λλk
k!= e−λ
∑ (λeit)k
k!= e−λeλe
it
= eλ(eit−1).
(d) Point mass at a: E eitX = eita. Note that when a = 0, then ϕ ≡ 1.
(e) Binomial: Write X as the sum of n independent Bernoulli randomvariables Bi. So
ϕX(t) =n∏i=1
ϕBi(t) = [ϕBi
(t)]n = [1− p(1− eit)]n.
(f) Geometric:
ϕ(t) =∞∑k=0
p(1− p)keitk = p∑
((1− p)eit)k =p
1− (1− p)eit.
6.1. INVERSION FORMULA 49
(g) Uniform on [a, b]:
ϕ(t) =1
b− a
∫ b
a
eitxdx =eitb − eita
(b− a)it.
Note that when a = −b this reduces to sin(bt)/bt.
(h) Exponential:∫ ∞0
λeitxe−λx dx = λ
∫ ∞0
e(it−λ)xdx =λ
λ− it.
(i) Standard normal:
ϕ(t) =1√2π
∫ ∞−∞
eitxe−x2/2dx.
This can be done by completing the square and then doing a contour inte-gration. Alternately, ϕ′(t) = (1/
√2π)
∫∞−∞ ixe
itxe−x2/2dx. (do the real and
imaginary parts separately, and use the dominated convergence theorem tojustify taking the derivative inside.) Integrating by parts (do the real andimaginary parts separately), this is equal to −tϕ(t). The only solution toϕ′(t) = −tϕ(t) with ϕ(0) = 1 is ϕ(t) = e−t
2/2.
(j) Normal with mean µ and variance σ2: Writing X = σZ + µ, where Zis a standard normal, then ϕX(t) = eiµtϕZ(σt) = eiµt−σ
2t2/2.
(k) Cauchy: We have
ϕ(t) =1
π
∫eitx
1 + x2dx.
This is a standard exercise in contour integration in complex analysis. Theanswer is e−|t|.
6.1 Inversion formula
We need a preliminary real variable lemma, and then we can proceed tothe inversion formula, which gives a formula for the distribution function interms of the characteristic function.
50 CHAPTER 6. CHARACTERISTIC FUNCTIONS
Lemma 6.4 (a)∫ N0
(sin(Ax)/x) dx→ sgn (A)π/2 as N →∞.(b) supa |
∫ a0
(sin(Ax)/x) dx| <∞.
Proof. If A = 0, this is clear. The case A < 0 reduces to the case A > 0by the fact that sin is an odd function. By a change of variables y = Ax, wereduce to the case A = 1. Part (a) is a standard result in contour integration,and part (b) comes from the fact that the integral can be written as analternating series.
An alternate proof of (a) is the following. e−xy sinx is integrable on(x, y); 0 < x < a, 0 < y <∞. So∫ a
0
sinx
xdx =
∫ a
0
∫ ∞0
e−xy sinx dy dx
=
∫ ∞0
∫ a
0
e−xy sinx dx dy
=
∫ ∞0
[ e−xyy2 + 1
(−y sinx− cosx)]a0dy
=
∫ ∞0
[ e−ay
y2 + 1(−y sin a− cos a)
− −1
y2 + 1
]dy
=π
2− sin a
∫ ∞0
ye−ay
y2 + 1dy − cos a
∫ ∞0
e−ay
y2 + 1dy.
The last two integrals tend to 0 as a→∞ since the integrand is bounded by(1 + y)e−y if a ≥ 1.
Theorem 6.5 (Inversion formula) Let µ be a probability measure and letϕ(t) =
∫eitxµ(dx). If a < b, then
limT→∞
1
2π
∫ T
−T
e−ita − e−itb
itϕ(t) dt = µ(a, b) +
1
2µ(a) +
1
2µ(b).
The example where µ is point mass at 0, so ϕ(t) = 1, shows that oneneeds to take a limit, since the integrand in this case is 2 sin t/t, which is notintegrable.
6.1. INVERSION FORMULA 51
Proof. By Fubini,∫ T
−T
e−ita − e−itb
itϕ(t) dt =
∫ T
−T
∫e−ita − e−itb
iteitxµ(dx) dt
=
∫ ∫ T
−T
e−ita − e−itb
iteitxdt µ(dx).
To justify this, we bound the integrand by the mean value theorem
Expanding e−itb and e−ita using Euler’s formula, and using the fact thatcos is an even function and sin is odd, we are left with∫
2[ ∫ T
0
sin(t(x− a))
tdt−
∫ T
0
sin(t(x− b))t
dt]µ(dx).
Using Lemma 6.4 and dominated convergence, this tends to∫[πsgn (x− a)− πsgn (x− b)]µ(dx).
Theorem 6.6 If∫|ϕ(t)| dt <∞, then µ has a bounded density f and
f(y) =1
2π
∫e−ityϕ(t) dt.
Proof.
µ(a, b) +1
2µ(a) +
1
2µ(b)
= limT→∞
1
2π
∫ T
−T
e−ita − e−itb
itϕ(t) dt
=1
2π
∫ ∞−∞
e−ita − e−itb
itϕ(t) dt
≤ b− a2π
∫|ϕ(t)| dt.
Letting b→ a shows that µ has no point masses.
52 CHAPTER 6. CHARACTERISTIC FUNCTIONS
We now write
µ(x, x+ h) =1
2π
∫e−itx − e−it(x+h)
itϕ(t)dt
=1
2π
∫ (∫ x+h
x
e−itydy)ϕ(t) dt
=
∫ x+h
x
( 1
2π
∫e−ityϕ(y) dt
)dy.
So µ has density (1/2π)∫e−ityϕ(t) dt. As in the proof of Proposition 6.1, we
see f is continuous.
A corollary to the inversion formula is the uniqueness theorem.
Theorem 6.7 If ϕX = ϕY , then PX = PY .
Proof. Let a and b be such that P(X = a) = P(X = b) = 0, and the samewhen X is replaced by Y . We have
FX(b)− FX(a) = P(a < X ≤ b) = PX((a, b]) = PX((a, b)),
and the same when X is replaced by Y . By the inversion theorem, PX((a, b))= PY ((a, b)). Now let a→ −∞ along a sequence of points that are continuitypoints for FX and FY . We obtain FX(b) = FY (b) if b is a continuity pointfor X and Y . Since FX and FY are right continuous, FX(x) = FY (x) for allx.
The following proposition can be proved directly, but the proof using char-acteristic functions is much easier.
Proposition 6.8 (a) If X and Y are independent, X is a normal with meana and variance b2, and Y is a normal with mean c and variance d2, then X+Yis normal with mean a+ c and variance b2 + d2.(b) If X and Y are independent, X is Poisson with parameter λ1, and Y isPoisson with parameter λ2, then X + Y is Poisson with parameter λ1 + λ2.(c) If Xi are i.i.d. Cauchy, then Sn/n is Cauchy.
6.2. CONTINUITY THEOREM 53
Proof. For (a),
ϕX+Y (t) = ϕX(t)ϕY (t) = eiat−b2t2/2eict−c
2t2/2 = ei(a+c)t−(b2+d2)t2/2.
Now use the uniqueness theorem.
Parts (b) and (c) are proved similarly.
6.2 Continuity theorem
Lemma 6.9 Suppose ϕ is the characteristic function of a probability µ.Then
µ([−2A, 2A]) ≥ A∣∣∣∫ 1/A
−1/Aϕ(t) dt
∣∣∣− 1.
Proof. Note
1
2T
∫ T
−Tϕ(t) dt =
1
2T
∫ T
−T
∫eitxµ(dx) dt
=
∫ ∫1
2T1[−T,T ](t)e
itx dt µ(dx)
=
∫sinTx
Txµ(dx).
Since | sin(Tx)| ≤ 1, then |(sin(Tx))/Tx| ≤ 1/2TA if |x| ≥ 2A. Since|(sin(Tx))/Tx| ≤ 1, we then have∣∣∣∫ sinTx
Txµ(dx)
∣∣∣ ≤ µ([−2A, 2A]) +
∫[−2A,2A]c
1
2TAµ(dx)
= µ([−2A, 2A]) +1
2TA(1− µ([−2A, 2A])
=1
2TA+(
1− 1
2TA
)µ([−2A, 2A]).
Setting T = 1/A, ∣∣∣A2
∫ 1/A
−1/Aϕ(t) dt
∣∣∣ ≤ 1
2+
1
2µ([−2A, 2A]).
Now multiply both sides by 2.
54 CHAPTER 6. CHARACTERISTIC FUNCTIONS
Proposition 6.10 If µn converges weakly to µ, then ϕn converges to ϕ uni-formly on every finite interval.
Proof. Let ε > 0 and choose M large so that µ([−M,M ]c) < ε. Define fto be 1 on [−M,M ], 0 on [−M − 1,M + 1]c, and linear in between. Since∫f dµn →
∫f dµ, then if n is large enough,∫
(1− f) dµn ≤ 2ε.
We have
|ϕn(t+ h)− ϕn(t)| ≤∫|eihx − 1|µn(dx)
≤ 2
∫(1− f) dµn + h
∫|x|f(x)µn(dx)
≤ 2ε+ h(M + 1).
So for n large enough and |h| ≤ ε/(M + 1), we have
|ϕn(t+ h)− ϕn(t)| ≤ 3ε,
which implies that the ϕn are equicontinuous. Therefore the convergence isuniform on finite intervals.
The interesting result of this section is the converse, Levy’s continuitytheorem.
Theorem 6.11 Suppose µn are probabilities, ϕn(t) converges to a functionϕ(t) for each t, and ϕ is continuous at 0. Then ϕ is the characteristicfunction of a probability µ and µn converges weakly to µ.
Proof. Let ε > 0. Since ϕ is continuous at 0, choose δ small so that∣∣∣ 1
2δ
∫ δ
−δϕ(t) dt− 1
∣∣∣ < ε.
Using the dominated convergence theorem, choose N such that
1
2δ
∫ δ
−δ|ϕn(t)− ϕ(t)| dt < ε
6.2. CONTINUITY THEOREM 55
if n ≥ N . So if n ≥ N ,∣∣∣ 1
2δ
∫ δ
−δϕn(t) dt
∣∣∣ ≥ ∣∣∣ 1
2δ
∫ δ
−δϕ(t) dt
∣∣∣− 1
2δ
∫ δ
−δ|ϕn(t)− ϕ(t)| dt
≥ 1− 2ε.
By Lemma 6.9 with A = 1/δ, for such n,
µn[−2/δ, 2/δ] ≥ 2(1− 2ε)− 1 = 1− 4ε.
This shows the µn are tight.
Let nj be a subsequence such that µnjconverges weakly, say to µ. Then
ϕnj(t) → ϕµ(t), hence ϕ(t) = ϕµ(t), or ϕ is the characteristic function of
a probability µ. If µ′ is any subsequential weak limit point of µn, thenϕµ′(t) = ϕ(t) = ϕµ(t); so µ′ must equal µ. Hence µn converges weakly to µ.
We need the following estimate on moments.
Proposition 6.12 If E |X|k <∞ for an integer k, then ϕX has a continuousderivative of order k and
ϕ(k)(t) =
∫(ix)keitxPX(dx).
In particular, ϕ(k)(0) = ikEXk.
Proof. Write
ϕ(t+ h)− ϕ(t)
h=
∫ei(t+h)x − eitx
hP(dx).
The integrand is bounded by |x|. So if∫|x|PX(dx) < ∞, we can use domi-
nated convergence to obtain the desired formula for ϕ′(t). As in the proof ofProposition 6.1, we see ϕ′(t) is continuous. We do the case of general k byinduction. Evaluating ϕ(k) at 0 gives the particular case.
Here is a converse.
56 CHAPTER 6. CHARACTERISTIC FUNCTIONS
Proposition 6.13 If ϕ is the characteristic function of a random variableX and ϕ′′(0) exists, then E |X|2 <∞.
Proof. Noteeihx − 2 + e−ihx
h2= −2
1− coshx
h2≤ 0
and 2(1− coshx)/h2 converges to x2 as h→ 0. So by Fatou’s lemma,∫x2 PX(dx) ≤ 2 lim inf
h→0
∫1− coshx
h2PX(dx)
= − lim suph→0
ϕ(h)− 2ϕ(0) + ϕ(−h)
h2= ϕ′′(0) <∞.
One nice application of the continuity theorem is a proof of the weak lawof large numbers. Its proof is very similar to the proof of the central limittheorem, which we give in the next section. Another nice use of characteristic
functions and martingales is the following.
Proposition 6.14 Suppose Xi is a sequence of independent random vari-ables and Sn converges weakly. Then Sn converges almost surely.
Proof. We first rule out the possibility that |Sn| → ∞ with positive prob-ability. If this happens with positive probability ε, given M there exists Ndepending on M such that if n ≥ N , then P(|Sn| > M) > ε. Then the limitlaw of PSn , say, P∞, will have P∞([−M,M ]c) ≥ ε for all M , a contradiction.Let N1 be the set of ω for which |Sn(ω)| → ∞.
Suppose Sn converges weakly to W . Then ϕSn(t) → ϕW (t) uniformly oncompact sets by Proposition 6.10. Since ϕW (0) = 1 and ϕW is continuous,there exists δ such that |ϕW (t)−1| < 1/2 if |t| < δ. So for n large, |ϕSn(t)| ≥1/4 if |t| < δ.
Note
E[eitSn | X1, . . . Xn−1
]= eitSn−1E [eitXn | X1, . . . , Xn−1] = eitSn−1ϕXn(t).
6.2. CONTINUITY THEOREM 57
Since ϕSn(t) =∏ϕXi
(t), it follows that eitSn/ϕSn(t) is a martingale.
Therefore for |t| < δ and n large, eitSn/ϕSn(t) is a bounded martingale,and hence converges almost surely. Since ϕSn(t) → ϕW (t) 6= 0, then eitSn
converges almost surely if |t| < δ.
Let A = (ω, t) ∈ Ω × (−δ, δ) : eitSn(ω) does not converge. For eacht, we have almost sure convergence, so
∫1A(ω, t)P(dω) = 0. Therefore∫ δ
−δ
∫1A dP dt = 0, and by Fubini,
∫ ∫ δ−δ 1A dt dP = 0. Hence almost surely,∫
1A(ω, t) dt = 0. This means, there exists a set N2 with P(N2) = 0, and ifω /∈ N2, then eitSn(ω) converges for almost every t ∈ (−δ, δ). Call the limit,when it exists, L(t).
Fix ω /∈ N1 ∪N2.
Suppose one subsequence of Sn(ω) converges to Q and another subse-quence to R. Then L(a) = eiaR = eiaQ for a.e. a small, which implies thatR = Q.
Suppose one subsequence converges to R 6= 0 and another to ±∞. Byintegrating eitSn and using dominated convergence, we have∫ a
0
eitSn dt =eiaSn − 1
iSn
converges. We then obtaineiaR − 1
iR= 0
for a.e. a small, which implies R = 0.
Finally suppose one subsequence converges to 0 and another to ±∞. Since(eiaSnj − 1)/iSnj
→ a, we obtain a = 0 for a.e. a small, which is impossible.
We conclude Sn must converge.
Chapter 7
Central limit theorem
The simplest case of the central limit theorem (CLT) is the case when theXi are i.i.d., with mean zero and variance one, and then the CLT says thatSn/√n converges weakly to a standard normal. We first prove this case.
Lemma 7.1 If cn are complex numbers with cn → c, then(1 +
cnn
)n→ ec.
Proof. If cn → c, there exists a real number R such that |cn| ≤ R for all nand then |c| ≤ R. We use the fact (from the Taylor series expansion of ex)that for x > 0
ex = 1 + x+ x2/2! + · · · ≥ 1 + x.
We also use the identity
an − bn = (a− b)(an−1 + ban−2 + b2an−3 + · · ·+ bn−1),
which implies|an − bn| ≤ |a− b|n[(|a| ∨ |b|)]n−1,
where |a| ∨ |b| = max(|a|, |b|).Using this inequality with a = 1 + cn/n and b = 1 + c/n, we have
|(1 + cn/n)n − (1 + c/n)n| ≤ |cn − c|(1 +R/n)n−1 ≤ |cn − c|(1 +R/n)n
≤ |cn − c|(eR/n)n → 0
59
60 CHAPTER 7. CENTRAL LIMIT THEOREM
as n→∞.
Now using the inequality with a = 1 + c/n and b = ec/n, we have
|(1 + c/n)n − (ec/n)n ≤ |(1 + c/n)− ec/n|n(1 +R/n)n−1
≤ |(1 + c/n)− ec/n|n(1 +R/n)n
≤ |(1 + c/n)− ec/n|neR.
A Taylor series expansion or l’Hopital’s rule shows that n|(1+c/n)−ec/n| → 0,which proves the lemma.
Lemma 7.2 If f is twice continuously differentiable, then
|f(x+ h)− [f(x) + f ′(x)h+ 12f ′′(x)h2] |
h2→ 0
as h→ 0.
Proof. Write
f(x+ h)−[f(x) + f ′(x)h+ 12f ′′(x)h2]
=
∫ x+h
x
[f ′(y)− f ′(x)− f ′′(x)(y − x)] dy
=
∫ x+h
x
∫ y
x
[f ′′(z)− f ′′(x)] dz dy.
So the above is bounded in absolute value by∫ x+h
x
∫ y
x
A(h) dz dy ≤ A(h)
∫ x+h
x
h dy ≤ A(h)h2,
whereA(h) = sup
|w−x|≤h|f ′′(w)− f ′′(x)|.
Theorem 7.3 Suppose the Xi are i.i.d., mean zero, and variance one. ThenSn/√n converges weakly to a standard normal.
61
Proof. Since X1 has finite second moment, then ϕX1 has a continuous secondderivative. By Taylor’s theorem,
ϕX1(t) = ϕX1(0) + ϕ′X1(0)t+ ϕ′′X1
(0)t2/2 +R(t),
where |R(t)|/t2 → 0 as |t| → 0. So
ϕX1(t) = 1− t2/2 +R(t).
Then
ϕSn/√n(t) = ϕSn(t/
√n) = (ϕX1(t/
√n))n =
[1− t2
2n+R(t/
√n)]n.
Since t/√n converges to zero as n→∞, we have
ϕSn/√n(t)→ e−t
2/2.
Now apply the continuity theorem.
Let us give another proof of this simple CLT that does not use character-istic functions. For simplicity let Xi be i.i.d. mean zero variance one randomvariables with E |Xi|3 <∞.
Proposition 7.4 With Xi as above, Sn/√n converges weakly to a standard
normal.
Proof. Let Y1, . . . , Yn be i.i.d. standard normal random variables that areindependent of the Xi. Let Z1 = Y2 + · · · + Yn, Z2 = X1 + Y3 + · · · + Yn,Z3 = X1 +X2 + Y4 + · · ·+ Yn, etc.
Let us suppose g ∈ C3 with compact support and let W be a standardnormal. Our first goal is to show
|E g(Sn/√n)− E g(W )| → 0. (7.1)
We have
E g(Sn/√n)− E g(W ) = E g(Sn/
√n)− E g(
n∑i=1
Yi/√n)
=n∑i=1
[E g(Xi + Zi√
n
)− E g
(Yi + Zi√n
)].
62 CHAPTER 7. CENTRAL LIMIT THEOREM
By Taylor’s theorem,
g(Xi + Zi√
n
)= g(Zi/
√n) + g′(Zi/
√n)Xi√n
+1
2g′′(Zi/
√n)X2
i +Rn,
where |Rn| ≤ ‖g′′′‖∞|Xi|3/n3/2. Taking expectations and using the indepen-dence,
E g(Xi + Zi√
n
)= E g(Zi/
√n) + 0 +
1
2E g′′(Zi/
√n) + ERn.
We have a very similar expression for E g((Yi + Zi)/√n). Taking the differ-
ence, ∣∣∣E g(Xi + Zi√n
)− E g
(Yi + Zi√n
)∣∣∣ ≤ ‖g′′′‖∞E |Xi|3 + E |Yi|3
n3/2.
Summing over i from 1 to n, we have (7.1).
By approximating continuous functions with compact support by C3 func-tions with compact support, we have (7.1) for such g. Since E (Sn/
√n)2 = 1,
the sequence Sn/√n is tight. So given ε we have that there exists M such
that P(|Sn/√n| > M) < ε for all n. By taking M larger if necessary, we
also have P(|W | > M) < ε. Suppose g is bounded and continuous. Let ψbe a continuous function with compact support that is bounded by one, isnonnegative, and that equals 1 on [−M,M ]. By (7.1) applied to gψ,
|E (gψ)(Sn/√n)− E (gψ)(W )| → 0.
However,
|E g(Sn/√n)− E (gψ)(Sn/
√n)| ≤ ‖g‖∞P(|Sn/
√n| > M) < ε‖g‖∞,
and similarly|E g(W )− E (gψ)(W )| < ε‖g‖∞.
Since ε is arbitrary, this proves (7.1) for bounded continuous g. By Proposi-tion 5.1, this proves our proposition.
We give another example of the use of characteristic functions.
63
Proposition 7.5 Suppose for each n the random variables Xni, i = 1, . . . , nare i.i.d. Bernoullis with parameter pn. If npn → λ and Sn =
∑ni=1Xni,
then Sn converges weakly to a Poisson random variable with parameter λ.
Proof. We write
ϕSn(t) = [ϕXn1(t)]n =
[1 + pn(eit − 1)
]n=[1 +
npnn
(eit − 1)]n→ eλ(e
it−1).
Now apply the continuity theorem.
A much more general theorem than Theorem 7.3 is the Lindeberg-Fellertheorem.
Theorem 7.6 Suppose for each n, Xni, i = 1, . . . , n are mean zero indepen-dent random variables. Suppose(a)
∑ni=1 EX2
ni → σ2 > 0 and(b) for each ε,
∑ni=1 E [|X2
ni; |Xni| > ε]→ 0.
Let Sn =∑n
i=1Xni. Then Sn converges weakly to a normal random variablewith mean zero and variance σ2.
Note nothing is said about independence of the Xni for different n.
Let us look at Theorem 7.3 in light of this theorem. Suppose the Yi arei.i.d. and let Xni = Yi/
√n. Then
n∑i=1
E (Yi/√n)2 = EY 2
1
and
n∑i=1
E [|Xni|2; |Xni| > ε] = nE [|Y1|2/n; |Y1| >√nε] = E [|Y1|2; |Y1| >
√nε],
which tends to 0 by the dominated convergence theorem.
64 CHAPTER 7. CENTRAL LIMIT THEOREM
If the Yi are independent with mean 0, and∑ni=1 E |Yi|3
(VarSn)3/2→ 0,
then Sn/(VarSn)1/2 converges weakly to a standard normal. This is knownas Lyapounov’s theorem; we leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader.
Proof. Let ϕni be the characteristic function of Xni and let σ2ni be the
variance of Xni. We need to show
n∏i=1
ϕni(t)→ e−t2σ2/2. (7.2)
Using Taylor series, |eib − 1− ib+ b2/2| ≤ c|b|3 for a constant c. Also,
|eib − 1− ib+ b2/2| ≤ |eib − 1− ib|+ |b2|/2 ≤ c|b|2.
If we apply this to a random variable tY and take expectations,
|ϕY (t)− (1 + itEY − t2EY 2/2)| ≤ c(t2EY 2 ∧ t3EY 3).
Applying this to Y = Xni,
|ϕni(t)− (1− t2σ2ni/2)| ≤ cE [t3|Xni|3 ∧ t2|Xni|2].
The right hand side is less than or equal to
cE [t3|Xni|3;|Xni| ≤ ε] + cE [t2|Xni|2; |Xni| > ε]
≤ cεt3E [|Xni|2] + ct2E [|Xni|2; |Xni| ≥ ε].
Summing over i we obtain
n∑i=1
|ϕni(t)− (1− t2σ2ni/2)| ≤ cεt3
∑E [|Xni|2] + ct2
∑E [|Xni|2; |Xni| ≥ ε].
We need the following inequality: if |ai|, |bi| ≤ 1, then∣∣∣ n∏i=1
ai −n∏i=1
bi
∣∣∣ ≤ n∑i=1
|ai − bi|.
65
To prove this, note∏ai −
∏bi = (an − bn)
∏i<n
bi + an
(∏i<n
ai −∏i<n
bi
)and use induction.
Note |ϕni(t)| ≤ 1 and |1− t2σ2ni/2| ≤ 1 because σ2
ni ≤ ε2 +E [|X2ni|; |Xni| >
ε] < 1/t2 if we take ε small enough and n large enough. So
∣∣∣ n∏i=1
ϕni(t)−n∏i=1
(1−t2σ2ni/2)
∣∣∣ ≤ cεt3∑
E [|Xni|2]+ct2∑
E [|Xni|2; |Xni| ≥ ε].
Since supi σ2ni → 0, then log(1 − t2σ2
ni/2) is asymptotic to −t2σ2ni/2, and
so ∏(1− t2σ2
ni/2) = exp(∑
log(1− t2σ2ni/2)
)is asymptotically equal to
exp(− t2
∑σ2ni/2
)= e−t
2σ2/2.
Since ε is arbitrary, the proof is complete.
We now complete the proof of Theorem 2.11.
Proof of “only if” part of Theorem 2.11. Since∑Xn converges, then
Xn must converge to zero a.s., and so P(|Xn| > A i.o.) = 0. By the Borel-Cantelli lemma, this says
∑P(|Xn| > A) < ∞. Since P(Xn 6= Yn i.o.) = 0,
we also conclude that∑Yn converges.
Let cn =∑n
i=1 VarYi and suppose cn →∞. Let Znm = (Ym−EYm)/√cn.
Then∑n
m=1 VarZnm = (1/cn)∑n
m=1 VarYm = 1. If ε > 0, then for n large,we have 2A/
√cn < ε. Since |Ym| ≤ A and hence |EYm| ≤ A, then |Znm| ≤
2A/√cn < ε. It follows that
∑nm=1 E (|Znm|2; |Znm| > ε) = 0 for large n.
By Theorem 7.6,∑n
m=1(Ym − EYm)/√cn converges weakly to a standard
normal. However,∑n
m=1 Ym converges, and cn → ∞, so∑Ym/√cn must
converge to 0. The quantities∑
EYm/√cn are nonrandom, so there is no
way the difference can converge to a standard normal, a contradiction. Weconclude cn does not converge to infinity.
66 CHAPTER 7. CENTRAL LIMIT THEOREM
Let Vi = Yi − EYi. Since |Vi| < 2A, EVi = 0, and VarVi = VarYi, whichis summable, by the “if” part of the three series criterion,
∑Vi converges.
Since∑Yi converges, taking the difference shows
∑EYi converges.
Chapter 8
Gaussian sequences
We first prove a converse to Proposition 6.3.
Proposition 8.1 If E ei(uX+vY ) = E eiuXE eivY for all u and v, then X andY are independent random variables.
Proof. Let X ′ be a random variable with the same law as X, Y ′ one withthe same law as Y , and X ′, Y ′ independent. (We let Ω = [0, 1]2, P Lebesguemeasure, X ′ a function of the first variable, and Y ′ a function of the secondvariable defined as in Proposition 1.2.) Then E ei(uX′+vY ′) = E eiuX′E eivY ′ .Since X,X ′ have the same law, they have the same characteristic function,and similarly for Y, Y ′. Therefore (X ′, Y ′) has the same joint characteristicfunction as (X, Y ). By the uniqueness of the Fourier transform, (X ′, Y ′) hasthe same joint law as (X, Y ), which is easily seen to imply that X and Y areindependent.
A sequence of random variables X1, . . . , Xn is said to be jointly normalif there exists a sequence of independent standard normal random variablesZ1, . . . , Zm and constants bij and ai such that Xi =
∑mj=1 bijZj + ai, i =
1, . . . , n. In matrix notation, X = BZ+A. For simplicity, in what follows letus take A = 0; the modifications for the general case are easy. The covarianceof two random variables X and Y is defined to be E [(X − EX)(Y − EY )].Since we are assuming our normal random variables are mean 0, we canomit the centering at expectations. Given a sequence of mean 0 random
67
68 CHAPTER 8. GAUSSIAN SEQUENCES
variables, we can talk about the covariance matrix, which is Cov (X) =EXX t, where X t denotes the transpose of the vector X. In the above case,we see Cov (X) = E [(BZ)(BZ)t] = E [BZZtBt] = BBt, since EZZt = I,the identity.
Let us compute the joint characteristic function E eiutX of the vector X,where u is an n-dimensional vector. First, if v is an m-dimensional vector,
E eivtZ = Em∏j=1
eivjZj =m∏j=1
E eivjZj =m∏j=1
e−v2j /2 = e−v
tv/2
using the independence of the Z’s. Setting v = Btu,
E eiutX = E eiutBZ = e−utBBtu/2.
By taking u = (0, . . . , 0, a, 0, . . . , 0) to be a constant times the unit vectorin the jth coordinate direction, we deduce that each of the X’s is indeednormal.
Proposition 8.2 If the Xi are jointly normal and Cov (Xi, Xj) = 0 fori 6= j, then the Xi are independent.
Proof. If Cov (X) = BBt is a diagonal matrix, then the joint characteristicfunction of the X’s factors, and so by Proposition 8.1, the Xs would in thiscase be independent.
Chapter 9
Kolmogorov extension theorem
The goal of this section is to show how to construct probability measures onRN = R × R × · · · . We may view RN as the set of sequences (x1, x2, . . .) ofelements of R. Given an element x = (x1, x2, . . .) of RN, we define τn(x) =(x1, . . . , xn) ∈ Rn. A cylindrical set in RN is a set of the form A×RN, whereA is a Borel subset of Rn for some n ≥ 1. Another way of phrasing this isto say a cylindrical set is one of the form τ−1n (A), where n ≥ 1 and A is aBorel subset of Rn. We furnish Rn with the product topology. Recall thatthis means we take the smallest topology that contains all cylindrical sets.We use the σ-field on RN generated by the cylindrical sets. Thus the σ-fieldwe use is the same as the Borel σ-field on RN. We use Bn to denote the Borelσ-field on Rn.
We suppose that for each n we have a probability measure µn defined on(Rn,Bn). The µn are consistent if µn+1(A × R) = µn(A) whenever A ∈ Bn.The Kolmogorov extension theorem is the following.
Theorem 9.1 Suppose for each n we have a probability measure µn on(Rn,Bn). Suppose the µn are consistent. Then there exists a probabilitymeasure µ on RN such that µ(A× RN) = µn(A) for all A ∈ Bn.
Proof. Define µ on cylindrical sets by µ(A × RN) = µm(A) if A ∈ Bm.By the consistency assumption, µ is well defined. If A0 is the collection ofcylindrical sets, it is easy to see that A0 is an algebra of sets and that µ isfinitely additive on A0. If we can show that µ is countably additive on A0,
69
70 CHAPTER 9. KOLMOGOROV EXTENSION THEOREM
then by the Caratheodory extension theorem, we can extend µ to the σ-fieldgenerated by the cylindrical sets. It suffices to show that whenever An ↓ ∅with An ∈ A0, then µ(An)→ 0.
Suppose that An are cylindrical sets decreasing to ∅ but µ(An) does nottend to 0; by taking a subsequence we may assume without loss of generalitythat there exists ε > 0 such that µ(An) ≥ ε for all n. We will obtain acontradiction.
It is possible that An might depend on fewer or more than n coordinates.It will be more convenient if we arrange things so that An depends on exactlyn coordinates. We want An = τ−1n (An) for some An a Borel subset of Rn.Suppose An is of the form
An = τ−1jn(Dn)
for some Dn ⊂ Rjn ; in other words, An depends on jn coordinates. By let-ting A0 = RN and replacing our original sequence by A0, . . . , A0, A1, . . . , A1,A2, . . . , A2, . . ., where we repeat each Ai sufficiently many times, we maywithout loss of generality suppose that jn ≤ n. On the other hand, if jn < nand An = τ−1jn
(Dn), we may write An = τ−1n (Dn) with Dn = Dn × Rn−jn .Thus we may without loss of generality suppose that An depends on exactlyn coordinates.
We set An = τn(An). For each n, choose Bn ⊂ An so that Bn is com-
pact and µ(An − Bn) ≤ ε/2n+1. To do this, first we choose M such that
µn(([−M,M ]n)c) < ε/2n+2, and then we choose a compact subset Bn of
An∩[−M,M ]n such that µ(An∩[−M,M ]n−Bn) ≤ ε/2n+2. LetBn = τ−1n (Bn)and let Cn = B1 ∩ . . . ∩ Bn. Hence Cn ⊂ Bn ⊂ An and Cn ↓ ∅. SinceA1 ∩ · · · ∩ An − B1 ∩ · · · ∩ Bn ⊂ ∪ni=1(Ai − Bi) and A1 ∩ · · · ∩ An = An, wehave
µ(Cn) ≥ µ(An)−n∑i=1
µ(Ai −Bi) ≥ ε/2.
Also Cn = τn(Cn), the projection of Cn onto Rn, is compact.
We will find x = (x1, . . . , xn, . . . ) ∈ ∩nCn and obtain our contradiction.For each n choose a point y(n) ∈ Cn. The first coordinates of y(n), namely,
y1(n), form a sequence contained in C1, which is compact, hence there isa convergent subsequence y1(nk). Let x1 be the limit point. The firstand second coordinates of y(nk) form a sequence contained in the compact
set C2, so a further subsequence (y1(nkj), y2(nkj)) converges to a point
71
in C2. Since nkj is is a subsequence of nk, the first coordinate of thelimit is x1. Therefore the limit point of (y1(nkj), y2(nkj)) is of the form
(x1, x2), and this point is in C2. We continue this procedure to obtain x =
(x1, x2, . . . , xn, . . .). By our construction, (x1, . . . , xn) ∈ Cn for each n, hencex ∈ Cn for each n, or x ∈ ∩nCn, a contradiction.
A typical application of this theorem is to construct a countable sequenceof independent random variables. We construct X1, . . . , Xn to be an inde-pendent collection of n independent random variables. Let µn be the jointlaw of (X1, . . . , Xn); it is easy to check that the µn form a consistent family.We use Theorem 9.1 to obtain a probability measure µ on RN. To get randomvariables out of this, we let Xi(ω) = ωi if ω = (ω1, ω2, . . .). If we set P = µ,then the law of (X1, X2, . . .) under P is µ.
Chapter 10
Brownian motion
10.1 Definition and construction
In this section we construct Brownian motion and define Wiener measure.
Let (Ω,F ,P) be a probability space and let B be the Borel σ-field on[0,∞). A stochastic process, denoted X(t, ω) or Xt(ω) or just Xt, is a mapfrom [0,∞)× Ω to R that is measurable with respect to the product σ-fieldof B and F .
Definition 10.1 A stochastic process Xt is a one-dimensional Brownian mo-tion started at 0 if(1) X0 = 0 a.s.;(2) for all s ≤ t, Xt−Xs is a mean zero normal random variable with variancet− s;(3) the random variables Xri−Xri−1
, i = 1, . . . , n, are independent whenever0 ≤ r0 ≤ r1 ≤ · · · ≤ rn;(4) there exists a null set N such that if ω /∈ N , then the map t→ Xt(ω) iscontinuous.
Brownian motion has a useful scaling property.
Proposition 10.2 If Xt is a Brownian motion started at 0, a > 0, andYt = a−1Xa2t, then Yt is a Brownian motion started at 0.
73
74 CHAPTER 10. BROWNIAN MOTION
Proof. Clearly Y has continuous paths and Y0 = 0. We observe thatYt − Ys = a−1(Xa2t − Xa2s) is a mean zero normal with variance t − s. If0 = r0 ≤ r1 < · · · < rn, then a2r0 ≤ a2r1 < · · · < a2rn, so the Xa2ri −Xa2ri−1
are independent, and hence the Yri − Yri−1are independent.
Let us show that there exists a Brownian motion. We give the Haarfunction construction, which is one of the quickest ways to the constructionof Brownian motion.
For i = 1, 2, . . ., j = 1, 2, . . . , 2i−1, let ϕij be the function on [0, 1] definedby
ϕij =
2(i−1)/2, x ∈
[2j−22i, 2j−1
2i
);
−2(i−1)/2, x ∈[2j−12i, 2j2i
);
0, otherwise.
Let ϕ00 be the function that is identically 1. The ϕij are called the Haarfunctions. If 〈·, ·〉 denotes the inner product in L2([0, 1]), that is, 〈f, g〉 =∫ 1
0f(x)g(x)dx, note the ϕij are orthogonal and have norm 1. It is also easy
to see that they form a complete orthonormal system for L2: ϕ00 ≡ 1; 1[0,1/2)
and 1[1/2,1) are both linear combinations of ϕ00 and ϕ11; 1[0,1/4) and 1[1/4,1/2)
are both linear combinations of 1[0,1/2), ϕ21, and ϕ22. Continuing in this way,we see that 1[k/2n,(k+1)/2n) is a linear combination of the ϕij for each n and eachk ≤ 2n. Since any continuous function can be uniformly approximated bystep functions whose jumps are at the dyadic rationals, linear combinationsof the Haar functions are dense in the set of continuous functions, which inturn is dense in L2([0, 1]).
Let ψij(t) =∫ t0ϕij(r) dr. Let Yij be a sequence of independent identically
distributed standard normal random variables. Set
V0(t) = Y00ψ00(t), Vi(t) =2i−1∑j=1
Yijψij(t), i ≥ 1.
If ei, i = 1, . . . , N is a finite orthonormal set and f =∑N
i=1 aiei, then
〈f, ej〉 =N∑i=1
ai〈ei, ej〉 = aj,
10.1. DEFINITION AND CONSTRUCTION 75
or
f =N∑i=1
〈f, ei〉ei.
If f =∑N
i=1 aiei and g =∑N
j=1 bjej, then
〈f, g〉 =N∑i=1
N∑j=1
aibj〈ei, ej〉 =N∑i=1
aibi,
or
〈f, g〉 =N∑i=1
〈f, ei〉〈g, ei〉.
We need one more preliminary. If Z is a standard normal random variable,then Z has density (2π)−1/2e−x
2/2. Since∫x4e−x
2/2 dx <∞,
then EZ4 <∞. We then have
P(|Z| > λ) = P(Z4 > λ4) ≤ EZ4
λ4. (10.1)
Theorem 10.3∑∞
i=0 Vi(t) converges uniformly in t a.s. If we call the sumXt, then Xt is a Brownian motion started at 0.
Proof. Step 1. We first prove convergence of the series. Let
Ai = (|Vi(t)| > i−2 for some t ∈ [0, 1]).
We will show∑∞
i=1 P(Ai) < ∞. Then by the Borel–Cantelli lemma, ex-cept for ω in a null set, there exists i0(ω) such that if i ≥ i0(ω), we havesupt |Vi(t)(ω)| ≤ i−2. This will show
∑Ii=0 Vi(t)(ω) converges as I → ∞,
uniformly over t ∈ [0, 1]. Moreover, since each ψij(t) is continuous in t, thenso is each Vi(t)(ω), and we thus deduce that Xt(ω) is continuous in t.
76 CHAPTER 10. BROWNIAN MOTION
Now for i ≥ 1 and j1 6= j2, for each t at least one of ψij1(t) and ψij2(t) iszero. Also, the maximum value of ψij is 2−(i+1)/2. Hence
P(|Vi(t)| > i−2 for some t ∈ [0, 1])
≤ P(|Yij|ψij(t) > i−2 for some t ∈ [0, 1], some 0 ≤ j ≤ 2i−1)
≤ P(|Yij|2−(i+1)/2 > i−2 for some 0 ≤ j ≤ 2i−1)
≤2i−1∑j=0
P(|Yij|2−(i+1)/2 > i−2)
= (2i−1 + 1)P(|Z| > 2(i+1)/2i−2)
where Z is a standard normal random variable. Using (10.1), we concludeP(Ai) is summable in i.
Step 2. Next we show that the limit, Xt, satisfies the definition of Brownianmotion. It is obvious that each Xt has mean zero and that X0 = 0.
Let D = k/2n : n ≥ 0, 0 ≤ k ≤ 2n, the dyadic rationals. If s, t ∈ D ands < t, then there exists N such that s = k/2N , t = `/2N , and then
E [XsXt] =N∑i=0
∑j
N∑k=0
∑`
E [YijYk`]ψij(s)ψk`(t)
=N∑i=0
∑j
ψij(s)ψij(t)
=N∑i=0
∑j
〈1[0,s], ϕij(s)〉〈1[0,t], ϕij(t)〉
= 〈1[0,s], 1[0,t]〉 = s.
Thus Cov (Xs, Xt) = s. Applying this also with s replaced by t and t replacedby s, we see that VarXt = t and VarXs = s, and therefore
Var (Xt −Xs) = t− 2s+ s = t− s.
Clearly Xt −Xs is normal, hence
E eiu(Xt−Xs) = e−u2(t−s)/2.
10.1. DEFINITION AND CONSTRUCTION 77
For general s, t ∈ [0, 1], choose sm < tm in D such that sm → s and tm → t.Using dominated convergence and the fact that X has continuous paths,
E eiu(Xt−Xs) = limm→∞
E eiu(Xtm−Xsm ) = limm→∞
e−u2(tm−sm)/2 = e−u
2(t−s)/2.
This proves (2).
We can work backwards from this to see that if s < t, then
t− s = Var (Xt −Xs) = VarXt − 2Cov (Xs, Xt) + VarXs
= s+ t− 2Cov (Xs, Xt),
orCov (Xs, Xt) = s.
Suppose 0 ≤ r1 < · · · < rn are in D. There exists N such that eachri = mi/2
N . So Xrk =∑N
i=0
∑j Yijψij(rk), and hence the Xrk are jointly
normal.
If ri < rj, then
Cov (Xri −Xri−1, Xrj −Xrj−1
) = (ri − ri)− (ri−1 − ri−1) = 0,
and hence the increments are independent. We therefore have
E ei∑n
k=1 uk(Xrk−Xrk−1
) =n∏k=1
E eiuk(Xrk−Xrk−1
). (10.2)
Now suppose the rk are in [0, 1], take rmk in D converging to rk withrk1 < · · · < rkn. Replacing rk by rmk , letting m → ∞, and using dominatedconvergence, we have (10.2). This proves independent increments.
The stochastic process Xt induces a measure on C([0, 1]). We say A ⊂C([0, 1]) is a cylindrical set if
A = f ∈ C([0, 1]) : (f(r1), . . . , f(rn)) ∈ B
for some n ≥ 1, r1 ≤ · · · ≤ rn, and B a Borel subset of Rn. For A acylindrical set, define µ(A) = P(X·(ω) ∈ A, where X is a Brownian motionand X·(ω) is the function t → Xt(ω). We extend µ to the σ-field generated
78 CHAPTER 10. BROWNIAN MOTION
by the cylindrical sets. If B is in this σ-field, then µ(B) = P(X· ∈ B). Theprobability measure µ is called Wiener measure.
We defined Brownian motion for t ∈ [0, 1]. To define Brownian motionfor t ∈ [0,∞), take a sequence Xn
t of independent Brownian motions on[0, 1] and piece them together as follows. Define Xt = X1
t for 0 ≤ t ≤ 1. For1 < t ≤ 2, define Xt = X1 + X2
t−1. For 2 < t ≤ 3, let Xt = X2 + X3t−2, and
so on.
10.2 Nowhere differentiability
We have the following result, which says that except for a set of ω’s thatform a null set, t → Xt(ω) is a function that does not have a derivative atany time t ∈ [0, 1].
Theorem 10.4 With probability one, the paths of Brownian motion are no-where differentiable.
Proof. Note that if Z is a normal random variable with mean 0 and variance1, then
P(|Z| ≤ r) =1√2π
∫ r
−re−x
2/2 dx ≤ 2r. (10.3)
Let M,h > 0 and let
AM,h = ∃s ∈ [0, 1] : |Xt −Xs| ≤M |t− s| if |t− s| ≤ h,Bn = ∃k ≤ 2n : |Xk/n −X(k−1)/n| ≤ 4M/n,
|X(k+1)/n −Xk/n| ≤ 4M/n, |X(k+2)/n −X(k+1)/n| ≤ 4M/n.
We check that AM,h ⊂ Bn if n ≥ 2/h. To see this, if ω ∈ AM,h, there existsan s such that |Xt −Xs| ≤ M |t − s| if |t − s| ≤ 2/n; let k/n be the largestmultiple of 1/n less than or equal to s. Then
|(k + 2)/n− s| ≤ 2/n and |(k + 1)/n− s| ≤ 2/n,
and therefore
|X(k+2)/n −X(k+1)/n| ≤ |X(k+2)/n −Xs|+ |Xs −X(k+1)/n|≤ 2M/n+ 2M/n < 4M/n.
10.2. NOWHERE DIFFERENTIABILITY 79
Similarly |X(k+1)/n −Xk/n| and |Xk/n −X(k−1)/n| are less than 4M/n.
Using the independent increments property, the stationary incrementsproperty, and (10.3),
P(Bn) ≤ 2n supk≤2n
P(|Xk/n −X(k−1)/n| < 4M/n, |X(k+1)/n −Xk/n| < 4M/n,
|X(k+2)/n −X(k+1)/n| < 4M/n)
≤ 2nP(|X1/n| < 4M/n, |X2/n −X1/n| < 4M/n,
|X3/n −X2/n| < 4M/n)
= 2nP(|X1/n| < 4M/n)P(|X2/n −X1/n| < 4M/n)
× P(|X3/n −X2/n| < 4M/n)
= 2n(P(|X1/n| < 4M/n))3
≤ cn(4M√
n
)3,
which tends to 0 as n→∞. Hence for each M and h,
P(AM,h) ≤ lim supn→∞
P(Bn) = 0.
This implies that the probability that there exists s ≤ 1 such that
lim suph→0
|Xs+h −Xs||h|
≤M
is zero. Since M is arbitrary, this proves the theorem.
Chapter 11
Markov chains
11.1 Framework for Markov chains
Suppose S is a set with some topological structure that we will use as ourstate space. Think of S as being Rd or the positive integers, for example. Asequence of random variables X0, X1, . . ., is a Markov chain if
P(Xn+1 ∈ A | X0, . . . , Xn) = P(Xn+1 ∈ A | Xn) (11.1)
for all n and all measurable sets A. The definition of Markov chain has thisintuition: to predict the probability that Xn+1 is in any set, we only needto know where we currently are; how we got there gives no new additionalintuition.
Let’s make some additional comments. First of all, we previously consid-ered random variables as mappings from Ω to R. Now we want to extendour definition by allowing a random variable be a map X from Ω to S, where(X ∈ A) is F measurable for all open sets A. This agrees with the definitionof random variable in the case S = R.
Although there is quite a theory developed for Markov chains with arbi-trary state spaces, we will confine our attention to the case where either S isfinite, in which case we will usually suppose S = 1, 2, . . . , n, or countableand discrete, in which case we will usually suppose S is the set of positiveintegers.
81
82 CHAPTER 11. MARKOV CHAINS
We are going to further restrict our attention to Markov chains where
P(Xn+1 ∈ A | Xn = x) = P(X1 ∈ A | X0 = x),
that is, where the probabilities do not depend on n. Such Markov chains aresaid to have stationary transition probabilities.
Define the initial distribution of a Markov chain with stationary transitionprobabilities by µ(i) = P(X0 = i). Define the transition probabilities byp(i, j) = P(Xn+1 = j | Xn = i). Since the transition probabilities arestationary, p(i, j) does not depend on n.
In this case we can use the definition of conditional probability given inundergraduate classes. If P(Xn = i) = 0 for all n, that means we never visiti and we could drop the point i from the state space.
Proposition 11.1 Let X be a Markov chain with initial distribution µ andtransition probabilities p(i, j). Then
P(Xn = in, Xn−1 = in−1, . . . , X1 = i1, X0 = i0) (11.2)
= µ(i0)p(i0, i1) · · · p(in−1, in).
Proof. We use induction on n. It is clearly true for n = 0 by the definitionof µ(i). Suppose it holds for n; we need to show it holds for n + 1. Forsimplicity, we will do the case n = 2. Then
P(X3 = i3, X2 = i2, X1 = i1, X0 = i0)
= E [P(X3 = i3 | X0 = i0, X1 = i1, X2 = i2);X2 = i2, X1 = ii, X0 = i0]
= E [P(X3 = i3 | X2 = i2);X2 = i2, X1 = ii, X0 = i0]
= p(i2, i3)P(X2 = i2, X1 = i1, X0 = i0).
Now by the induction hypothesis,
P(X2 = i2, X1 = i1, X0 = i0) = p(i1, i2)p(i0, i1)µ(i0).
Substituting establishes the claim for n = 3.
The above proposition says that the law of the Markov chain is determinedby the µ(i) and p(i, j). The formula (11.2) also gives a prescription forconstructing a Markov chain given the µ(i) and p(i, j).
11.1. FRAMEWORK FOR MARKOV CHAINS 83
Proposition 11.2 Suppose µ(i) is a sequence of nonnegative numbers with∑i µ(i) = 1 and for each i the sequence p(i, j) is nonnegative and sums to
1. Then there exists a Markov chain with µ(i) as its initial distribution andp(i, j) as the transition probabilities.
Proof. Define Ω = S∞. Let F be the σ-fields generated by the collectionof sets (i0, i1, . . . , in) : n > 0, ij ∈ S. An element ω of Ω is a sequence(i0, i1, . . .). Define Xj(ω) = ij if ω = (i0, i1, . . .). Define P(X0 = i0, . . . , Xn =in) by (11.2). Using the Kolmogorov extension theorem, one can show thatP can be extended to a probability on Ω.
The above framework is rather abstract, but it is clear that under P thesequence Xn has initial distribution µ(i); what we need to show is that Xn
is a Markov chain and that
P(Xn+1 = in+1 | X0 = i0, . . . Xn = in) (11.3)
= P(Xn+1 = in+1 | Xn = in) = p(in, in+1).
eee By the definition of conditional probability, the left hand side of (11.3)is
P(Xn+1 = in+1 | X0 = i0, . . . , Xn = in) (11.4)
=P(Xn+1 = in+1, Xn = in, . . . X0 = i0)
P(Xn = in, . . . , X0 = i0)
=µ(i0) · · · p(in−1, in)p(in, in+1)
µ(i0) · · · p(in−1, in)
= p(in, in+1)
as desired.
To complete the proof we need to show
P(Xn+1 = in+1, Xn = in)
P(Xn = in)= p(in, in+1),
or
P(Xn+1 = in+1, Xn = in) = p(in, in+1)P(Xn = in). (11.5)
84 CHAPTER 11. MARKOV CHAINS
Now
P(Xn = in) =∑
i0,··· ,in−1
P(Xn = in, Xn−1 = in−1, . . . , X0 = i0)
=∑
i0,··· ,in−1
µ(i0) · · · p(in−1, in)
and similarly
P(Xn+1 = in+1, Xn = in)
=∑
i0,··· ,in−1
P(Xn+1 = in+1, Xn = in, Xn−1 = in−1, . . . , X0 = i0)
= p(in, in+1)∑
i0,··· ,in−1
µ(i0) · · · p(in−1, in).
Equation (11.5) now follows.
Note in this construction that the Xn sequence is fixed and does notdepend on µ or p. Let p(i, j) be fixed. The probability we constructed aboveis often denoted Pµ. If µ is point mass at a point i or x, it is denoted Pi orPx. So we have one probability space, one sequence Xn, but a whole familyof probabilities Pµ.
11.2 Examples
Random walk on the integers
We let Yi be an i.i.d. sequence of random variables, with p = P(Yi = 1)and 1−p = P(Yi = −1). Let Xn = X0+
∑ni=1 Yi. Then the Xn can be viewed
as a Markov chain with p(i, i + 1) = p, p(i, i − 1) = 1 − p, and p(i, j) = 0if |j − i| 6= 1. More general random walks on the integers also fit into thisframework. To check that the random walk is Markov,
P(Xn+1 = in+1 | X0 = i0, . . . , Xn = in)
= P(Xn+1 −Xn = in+1 − in | X0 = i0, . . . , Xn = in)
= P(Xn+1 −Xn = in+1 − in),
11.2. EXAMPLES 85
using the independence, while
P(Xn+1 = in+1 | Xn = in) = P(Xn+1 −Xn = in+1 − in | Xn = in)
= P(Xn+1 −Xn = in+1 − in).
Random walks on graphs
Suppose we have n points, and from each point there is some probabilityof going to another point. For example, suppose there are 5 points and wehave p(1, 2) = 1
2, p(1, 3) = 1
2, p(2, 1) = 1
4, p(2, 3) = 1
2, p(2, 5) = 1
4, p(3, 1) = 1
4,
p(3, 2) = 14, p(3, 3) = 1
2, p(4, 1) = 1, p(5, 1) = 1
2, p(5, 5) = 1
2. The p(i, j) are
often arranged into a matrix:
P =
0 12
12
0 0
14
0 12
0 14
14
14
12
0 0
1 0 0 0 0
12
0 0 0 12
.
Note the rows must sum to 1 since
5∑j=1
p(i, j) =5∑j=1
P(X1 = j | X0 = i) = P(X1 ∈ S | X0 = i) = 1.
Renewal processes
Let Yi be i.i.d. with P(Yi = k) = ak and the ak are nonnegative and sumto 1. Let T0 = i0 and Tn = T0 +
∑ni=1. We think of the Yn as the lifetime of
the nth light bulb and Tn the time when the nth light bulb burns out. (Wereplace a light bulb as soon as it burns out.) Let
Xn = minm− n : Ti = m for some i.
So Xn is the amount of time after time n until the current light bulb burnsout.
86 CHAPTER 11. MARKOV CHAINS
If Xn = j and j > 0, then Ti = n + j for some i but Ti does not equaln, n+ 1, . . . , n+ j − 1 for any i. So Ti = (n+ 1) + (j − 1) for some i and Tidoes not equal (n+ 1), (n+ 1) + 1, . . . , (n+ 1) + (j − 2) for any i. ThereforeXn+1 = j − 1. So p(i, i− 1) = 1 if i ≥ 1.
If Xn = 0, then a light bulb burned out at time n and Xn+1 is 0 if the nextlight bulb burned out immediately and j − 1 if the light bulb has lifetime j.The probability of this is aj. So p(0, j) = aj+1. All the other p(i, j)’s are 0.
Branching processes
Consider k particles. At the next time interval, some of them die, and someof them split into several particles. The probability that a given particle willsplit into j particles is given by aj, j = 0, 1, . . ., where the aj are nonnegativeand sum to 1. The behavior of each particle is independent of the behaviorof all the other particles. If Xn is the number of particles at time n, then Xn
is a Markov chain. Let Yi be i.i.d. random variables with P(Yi = j) = aj.The p(i, j) for Xn are somewhat complicated, and can be defined by p(i, j) =P(∑i
m=1 Ym = j).
Queues
We will discuss briefly the M/G/1 queue. The M refers to the fact thatthe customers arrive according to a Poisson process. So the probability thatthe number of customers arriving in a time interval of length t is k is given bye−λt(λt)k/k! The G refers to the fact that the length of time it takes to servea customer is given by a distribution that is not necessarily exponential. The1 refers to the fact that there is 1 server.
Suppose the length of time to serve one customer has distribution functionF with density f . The probability that k customers arrive during the timeit takes to serve one customer is
ak =
∫ ∞0
e−λt(λt)k
k!f(t) dt.
Let the Yi be i.i.d. with P(Yi = k−1) = ak. So Yi is the number of customersarriving during the time it takes to serve one customer. Let Xn+1 = (Xn +Yn+1)
+ be the number of customers waiting. Then Xn is a Markov chainwith p(0, 0) = a0 + a1 and p(i, i− 1 + k) = ak if i ≥ 1, k > 1.
Ehrenfest urns
11.3. MARKOV PROPERTIES 87
Suppose we have two urns with a total of r balls, k in one and r − k inthe other. Pick one of the r balls at random and move it to the other urn.Let Xn be the number of balls in the first urn. Xn is a Markov chain withp(k, k + 1) = (r − k)/r, p(k, k − 1) = k/r, and p(i, j) = 0 otherwise.
One model for this is to consider two containers of air with a thin tubeconnecting them. Suppose a few molecules of a foreign substance are in-troduced. Then the number of molecules in the first container is like anEhrenfest urn. We shall see that all states in this model are recurrent, soinfinitely often all the molecules of the foreign substance will be in the firsturn. Yet there is a tendency towards equilibrium, so on average there will beabout the same number of molecules in each container for all large times.
Birth and death processes
Suppose there are i particles, and the probability of a birth is ai, theprobability of a death is bi, where ai, bi ≥ 0, ai + bi ≤ 1. Setting Xn equalto the number of particles, then Xn is a Markov chain with p(i, i + 1) = ai,p(i, i− 1) = bi, and p(i, i) = 1− ai − bi.
11.3 Markov properties
Proposition 11.3 Px(Xn+1 = z | X1, . . . , Xn) = PXn(X1 = z), Px-a.s.
Proof. The right hand side is measurable with respect to the σ-field gener-ated by Xn. We therefore need to show that if A is in σ(Xn), then
Px(Xn+1 = z, A) = E x[PXn(X1 = z);A].
A is of the form (Xn ∈ B) for some B ⊂ S, so it suffices to show
Px(Xn+1 = z,Xn = y) = E x[PXn(X1 = z);Xn = y] (11.6)
and sum over y ∈ B.
Note the right hand side of (11.6) is equal to
E x[Py(X1 = z);Xn = y],
88 CHAPTER 11. MARKOV CHAINS
while
Py(X1 = z) =∑i
Py(X0 = i,X1 = z) =∑i
1y(i)p(i, z) = p(y, z).
Therefore the right hand side of (11.6) is
p(y, z)Px(Xn = y).
On the other hand, the left hand side of (11.6) equals
Px(Xn+1 = z | Xn = y)Px(Xn = y) = p(y, z)Px(Xn = y).
Theorem 11.4
Px(Xn+1 = i1, . . . , Xn+m = im | X1, . . . , Xn) = PXn(X1 = i1, . . . , Xm = im).
Proof. Note this is equivalent to
E x[f1(Xn+1) · · · fm(Xn+m) | X1, . . . , Xn] = EXn [f1(X1) · · · fm(Xm)].
To go one way, we let fk = 1ik, to go the other, we multiply the conditionalprobability result by f1(i1) · · · fm(im) and sum over all possible values ofi1, . . . , im.
We’ll use induction. The case m = 1 is the previous proposition. Let ussuppose the result holds for m and prove that it holds for m+ 1. Let
C = (Xn+1 = i1, . . . , Xn+m−1 = im−1)
andD = (X1 = i1, . . . , Xm−1 = im−1).
Setϕ(z) = Pz(X1 = im+1).
Let Fn = σ(X1, . . . , Xn). We then have
Px(Xn+1 = i1, . . . , Xn+m = im, Xn+m+1 = im+1 | Fn)
= E x[Px(C,Xn+m = im, Xn+m+1 = im+1 | Fn+m) | Fn]
= E x[1C1im(Xn+m)Px(Xn+m+1 = im+1 | Fn+m) | Fn]
= E x[1C(ϕ1im)(Xn+m) | Fn].
11.3. MARKOV PROPERTIES 89
By the induction hypothesis, this is equal to
EXn [1D(ϕ1im)(Xm)].
For any y,
E y[1D(ϕ1im)(Xm)]
= E y[1D1im(Xm)PXm(X1 = im+1)]
= E y[1D1im(Xm)Py(Xm+1 = im+1 | Fm)]
= E y[1D1im(Xm)1im+1(Xm+1)].
The strong Markov property is the same as the Markov property, butwhere the fixed time n is replaced by a stopping time N .
Theorem 11.5 If N is a finite stopping time, then
Px(XN+1 = i1 | FN) = PXN (X1 = i1),
and
Px(XN+1 = i1, . . . , XN+m = im | FN) = PXN (X1 = i1, . . . , Xm = im),
both Px-a.s.
Proof. We will show
Px(XN+1 = j | FN) = PXN (X1 = j).
Once we have this, we can proceed as in the proof of the previous theoremto obtain the second result. To show the above equality, we need to showthat if B ∈ FN , then
Px(XN+1 = j, B) = E x[PXN (X1 = j);B]. (11.7)
Recall that since B ∈ FN , then B ∩ (N = k) ∈ Fk. We have
Px(XN+1 = j, B,N = k) = Px(Xk+1 = j, B,N = k)
= E x[Px(Xk+1 = j | Fk);B,N = k]
= E x[PXk(X1 = j);B,N = k]
= E x[PXN (X1 = j);B,N = k].
90 CHAPTER 11. MARKOV CHAINS
Now sum over k; since N is finite, we obtain our desired result.
We will need the following corollary. We use the notation Ty = minn >0 : Xn = y, the first time the Markov chain hits the state y.
Corollary 11.6 Let U be a finite stopping time. Let V = minn > U :Xn = y, the first time after U that the chain hits y. Then
Px(V <∞ | FU) = PXU (Ty <∞).
Proof. We can write
(V <∞) = ∪∞n=1(V = U + n)
= ∪∞n=1(XU+1 6= y, . . . , XU+n−1 6= y,XU+n = y).
By the theorem
Px(XU+1 6= y, . . . ,XU+n−1 6= y,XU+n = y)
= PXU (X1 6= y, . . . , Xn−1 6= y,Xn = y)
= PXU (Ty = n).
Now sum over n.
Another way of expressing the Markov property is through the Chapman-Kolmogorov equations. Let pn(i, j) = P(Xn = j | X0 = i).
Proposition 11.7 For all i, j,m, n we have
pn+m(i, j) =∑k∈S
pn(i, k)pm(k, j).
Proof. We write
P(Xn+m = j,X0 = i) =∑k
P(Xn+m = j,Xn = k,X0 = i)
=∑k
P(Xn+m = j | Xn = k,X0 = i)P(Xn = k | X0 = i)P(X0 = i)
=∑k
P(Xn+m = j | Xn = k)pn(i, k)P(X0 = i)
=∑k
pm(k, j)pn(i, k)P(X0 = i).
11.4. RECURRENCE AND TRANSIENCE 91
If we divide both sides by P(X0 = i), we have our result.
Note the resemblance to matrix multiplication. It is clear if P is thematrix made up of the p(i, j), then P n will be the matrix whose (i, j) entryis pn(i, j).
11.4 Recurrence and transience
LetTy = mini > 0 : Xi = y.
This is the first time that Xi hits the point y. Even if X0 = y we would haveTy > 0. We let T ky be the k-th time that the Markov chain hits y and we set
r(x, y) = Px(Ty <∞),
the probability starting at x that the Markov chain ever hits y.
Proposition 11.8 Px(T ky <∞) = r(x, y)r(y, y)k−1.
Proof. The case k = 1 is just the definition, so suppose k > 1. Let U = T k−1y
and let V be the first time the chain hits y after U . Using the corollary tothe strong Markov property,
Px(T ky <∞) = Px(V <∞, T k−1y <∞)
= E x[Px(V <∞ | FTk−1y
);T k−1y <∞]
= E x[PX(Tk−1y )(Ty <∞);T k−1y <∞]
= E x[Py(Ty <∞);T k−1y <∞]
= r(y, y)Px(T k−1y <∞).
We used here the fact that at time T k−1y the Markov chain must be at thepoint y. Repeating this argument k − 2 times yields the result.
We say that y is recurrent if r(y, y) = 1; otherwise we say y is transient.Let
N(y) =∞∑n=1
1(Xn=y).
92 CHAPTER 11. MARKOV CHAINS
Proposition 11.9 y is recurrent if and only if E yN(y) =∞.
Proof. Note
E yN(y) =∞∑k=1
Py(N(y) ≥ k) =∞∑k=1
Py(T ky <∞)
=∞∑k=1
r(y, y)k.
We used the fact that N(y) is the number of visits to y and the number ofvisits being larger than k is the same as the time of the k-th visit being finite.Since r(y, y) ≤ 1, the left hand side will be finite if and only if r(y, y) < 1.
Observe that
E yN(y) =∑n
Py(Xn = y) =∑n
pn(y, y).
If we consider simple symmetric random walk on the integers, then pn(0, 0)
is 0 if n is odd and equal to
(nn/2
)2−n if n is even. This is because in order
to be at 0 after n steps, the walk must have had n/2 positive steps and n/2negative steps; the probability of this is given by the binomial distribution.Using Stirling’s approximation, we see that pn(0, 0) ∼ c/
√n for n even, which
diverges, and so simple random walk in one dimension is recurrent.
Similar arguments show that simple symmetric random walk is also recur-rent in 2 dimensions but transient in 3 or more dimensions.
Proposition 11.10 If x is recurrent and r(x, y) > 0, then y is recurrentand r(y, x) = 1.
Proof. First we show r(y, x) = 1. Suppose not. Since r(x, y) > 0, thereis a smallest n and y1, . . . , yn−1 such that p(x, y1)p(y1, y2) · · · p(yn−1, y) > 0.Since this is the smallest n, none of the yi can equal x. Then
Px(Tx =∞) ≥ p(x, y1) · · · p(yn−1, y)(1− r(y, x)) > 0,
11.4. RECURRENCE AND TRANSIENCE 93
a contradiction to x being recurrent.
Next we show that y is recurrent. Since r(y, x) > 0, there exists L suchthat pL(y, x) > 0. Then
pL+n+K(y, y) ≥ pL(y, x)pn(x, x)pK(x, y).
Summing over n,∑n
pL+n+K(y, y) ≥ pL(y, x)pK(x, y)∑n
pn(x, x) =∞.
We say a subset C of S is closed if x ∈ C and r(x, y) > 0 implies y ∈ C.A subset D is irreducible if x, y ∈ D implies r(x, y) > 0.
Proposition 11.11 Let C be finite and closed. Then C contains a recurrentstate.
From the preceding proposition, if C is irreducible, then all states will berecurrent.
Proof. If not, for all y we have r(y, y) < 1 and
E xN(y) =∞∑k=1
Px(N(y) ≥ k) =∞∑k=1
Px(T ky <∞)
=∞∑k=1
r(x, y)r(y, y)k−1 =r(x, y)
1− r(y, y)<∞.
Since C is finite, then∑
y ExN(y) <∞. But that is a contradiction since∑
y
E xN(y) =∑y
∑n
pn(x, y) =∑n
∑y
pn(x, y)
=∑n
Px(Xn ∈ C) =∑n
1 =∞.
94 CHAPTER 11. MARKOV CHAINS
Theorem 11.12 Let R = x : r(x, x) = 1, the set of recurrent states.Then R = ∪∞i=1Ri, where each Ri is closed and irreducible.
Proof. Say x ∼ y if r(x, y) > 0. Since every state is recurrent, x ∼ x and ifx ∼ y, then y ∼ x. If x ∼ y and y ∼ z, then pn(x, y) > 0 and pm(y, z) > 0for some n and m. Then pn+m(x, z) > 0 or x ∼ z. Therefore we have anequivalence relation and we let the Ri be the equivalence classes.
Looking at our examples, it is easy to see that in the Ehrenfest urn modelall states are recurrent. For the branching process model, suppose p(x, 0) > 0for all x. Then 0 is recurrent and all the other states are transient. In therenewal chain, there are two cases. If k : ak > 0 is unbounded, all statesare recurrent. If K = maxk : ak > 0, then 0, 1, . . . , K − 1 are recurrentstates and the rest are transient.
For the queueing model, let µ =∑kak, the expected number of people
arriving during one customer’s service time. We may view this as a branchingprocess by letting all the customers arriving during one person’s service timebe considered the progeny of that customer. It turns out that if µ ≤ 1, 0 isrecurrent and all other states are also. If µ > 1 all states are transient.
11.5 Stationary measures
A probability µ is a stationary distribution if∑x
µ(x)p(x, y) = µ(y). (11.8)
In matrix notation this is µP = µ, or µ is the left eigenvector corresponding tothe eigenvalue 1. In the case of a stationary distribution, Pµ(X1 = y) = µ(y),which implies that X1, X2, . . . all have the same distribution. We can use(11.8) when µ is a measure rather than a probability, in which case it iscalled a stationary measure. Note
µP n = (µP )P n−1 = µP n−1 = · · · = µ.
If we have a random walk on the integers, µ(x) = 1 for all x serves as astationary measure. In the case of an asymmetric random walk: p(i, i+ 1) =
11.5. STATIONARY MEASURES 95
p, p(i, i − 1) = q = 1 − p and p 6= q, setting µ(x) = (p/q)x also works. Tocheck this, note
µP (x) =∑
µ(i)p(i, x) = µ(x− 1)p(x− 1, x) + µ(x+ 1)p(x+ 1, x)
=(pq
)x−1p+
(pq
)x+1
q
=px
qx· q +
px
qx· q =
px
qx.
In the Ehrenfest urn model, µ(x) = 2−r(rx
)works. One way to see this
is that µ is the distribution one gets if one flips r coins and puts a coin inthe first urn when the coin is heads. A transition corresponds to picking acoin at random and turning it over.
Proposition 11.13 Let a be recurrent and let T = Ta. Set
µ(y) = E aT−1∑n=0
1(Xn=y).
Then µ is a stationary measure.
The idea of the proof is that µ(y) is the expected number of visits to yby the sequence X0, . . . , XT−1 while µP is the expected number of visits toy by X1, . . . , XT . These should be the same because XT = X0 = a.
Proof. Let pn(a, y) = Pa(Xn = y, T > n). So
µ(y) =∞∑n=0
Pa(Xn = y, T > n) =∞∑n=0
pn(a, y)
and ∑y
µ(y)p(y, z) =∑y
∞∑n=0
pn(a, y)p(y, z).
96 CHAPTER 11. MARKOV CHAINS
First we consider the case z 6= a. Then∑y
pn(a, y)p(y, z)
=∑y
Pa(hit y in n steps without first hitting a
and then go to z in one step)
= pn+1(a, z).
So ∑y
µ(y)p(y, z) =∑n
∑y
pn(a, y)p(y, z)
=∞∑n=0
pn+1(a, z) =∞∑n=0
pn(a, z)
= µ(z)
since p0(a, z) = 0.
Second we consider the case a = z. Then∑y
pn(a, y)p(y, z)
=∑y
Pa(hit y in n steps without first hitting a
and then go to z in one step)
= Pa(T = n+ 1).
Recall Pa(T = 0) = 0, and since a is recurrent, T <∞. So∑y
µ(y)p(y, z) =∑n
∑y
pn(a, y)p(y, z)
=∞∑n=0
Pa(T = n+ 1) =∞∑n=0
Pa(T = n) = 1.
On the other hand,T−1∑n=0
1(Xn=a) = 1(X0=a) = 1,
11.5. STATIONARY MEASURES 97
hence µ(a) = 1. Therefore, whether z 6= a or z = a, we have µP (z) = µ(z).
Finally, we show µ(y) < ∞. If r(a, y) = 0, then µ(y) = 0. If r(a, y) > 0,choose n so that pn(a, y) > 0, and then
1 = µ(a) =∑y
µ(y)pn(a, y),
which implies µ(y) <∞.
We next turn to uniqueness of the stationary distribution. We give thestationary measure constructed in Proposition 11.13 the name µa. We showedµa(a) = 1.
Proposition 11.14 If the Markov chain is irreducible and all states are re-current, then the stationary measure is unique up to a constant multiple.
Proof. Fix a ∈ S. Let µa be the stationary measure constructed above andlet ν be any other stationary measure.
Since ν = νP , then
ν(z) = ν(a)p(a, z) +∑y 6=a
ν(y)p(y, z)
= ν(a)p(a, z) +∑y 6=a
ν(a)p(a, y)p(y, z) +∑x 6=a
∑y 6=a
ν(x)p(x, y)p(y, z)
= ν(a)Pa(X1 = z) + ν(a)Pa(X1 6= a,X2 = z)
+ Pν(X0 6= a,X1 6= a,X2 = z).
Continuing,
ν(z) = ν(a)n∑
m=1
Pa(X1 6= a,X2 6= a, . . . , Xm−1 6= a,Xm = z)
+ Pν(X0 6= a,X1 6= a, . . . , Xn−1 6= a,Xn = z)
≥ ν(a)n∑
m=1
Pa(X1 6= a,X2 6= a, . . . , Xm−1 6= a,Xm = z).
98 CHAPTER 11. MARKOV CHAINS
Letting n→∞, we obtain
ν(z) ≥ ν(a)µa(z).
We have
ν(a) =∑x
ν(x)pn(x, a) ≥ ν(a)∑x
µa(x)pn(x, a)
= ν(a)µa(a) = ν(a),
since µa(a) = 1 (see proof of Proposition 11.13). This means that we haveequality and so for each n and x either pn(x, a) = 0 or
ν(x) = ν(a)µa(x).
Since r(x, a) > 0, thhen pn(x, a) > 0 for some n. Consequently
ν(x)
ν(a)= µa(x).
Proposition 11.15 If a stationary distribution exists, then µ(y) > 0 impliesy is recurrent.
Proof. If µ(y) > 0, then
∞ =∞∑n=1
µ(y) =∞∑n=1
∑x
µ(x)pn(x, y) =∑x
µ(x)∞∑n=1
pn(x, y)
=∑x
µ(x)∞∑n=1
Px(Xn = y) =∑x
µ(x)E xN(y)
=∑x
µ(x)r(x, y)[1 + r(y, y) + r(y, y)2 + · · · ].
Since r(x, y) ≤ 1 and µ is a probability measure, this is less than∑x
µ(x)(1 + r(y, y) + · · · ) ≤ 1 + r(y, y) + · · · .
Hence r(y, y) must equal 1.
Recall that Tx is the first time to hit x.
11.5. STATIONARY MEASURES 99
Proposition 11.16 If the Markov chain is irreducible and has stationarydistribution µ, then
µ(x) =1
E xTx.
Proof. µ(x) > 0 for some x. If y ∈ S, then r(x, y) > 0 and so pn(x, y) > 0for some n. Hence
µ(y) =∑x
µ(x)pn(x, y) > 0.
Hence by Proposition 11.15, all states are recurrent. By the uniqueness of thestationary distribution, µx is a constant multiple of µ, i.e., µx = cµ. Recall
µx(y) =∞∑n=0
Px(Xn = y, Tx > n),
and so∑y
µx(y) =∑y
∞∑n=0
Px(Xn = y, Tx > n) =∑n
∑y
Px(Xn = y, Tx > n)
=∑n
Px(Tx > n) = E xTx.
Thus c = E xTx. Recalling that µx(x) = 1,
µ(x) =µx(x)
c=
1
E xTx.
We make the following distinction for recurrent states. If E xTx <∞, thenx is said to be positive recurrent. If x is recurrent but E xTx = ∞, x is nullrecurrent.
An example of null recurrent states is the simple random walk on theintegers. If we let g(y) = E yTx, the Markov property tells us that
g(y) = 1 + 12g(y − 1) + 1
2g(y + 1).
Some algebra translates this to
g(y)− g(y − 1) = 2 + g(y + 1)− g(y).
100 CHAPTER 11. MARKOV CHAINS
If d(y) = g(y) − g(y − 1), we have d(y + 1) = d(y) − 2. If d(y0) is finite forany y0, then g(y) will be less than −1 for all y larger than some y1, whichimplies that g(y) will be negative for sufficiently large y, a contradiction. Weconclude g is identically infinite.
Proposition 11.17 Suppose a chain is irreducible.(a) If there exists a positive recurrent state, then there is a stationary distri-bution.(b) If there is a stationary distribution, all states are positive recurrent.(c) If there exists a transient state, all states are transient.(d) If there exists a null recurrent state, all states are null recurrent.
Proof. To show (a), suppose x is positive recurrent. We have seen that
µx(y) = E xTx−1∑n=0
1(Xn=y)
is a stationary measure. Then
µx(S) =∑y
µx(y) = E xTx−1∑n=0
1 = E xTx <∞.
Therefore µ(y) = µ(y)/E xTx will be a stationary distribution. From thedefinition of µx we have µx(x) = 1, hence µ(x) > 0.
For (b), suppose µ(x) > 0 for some x. If y is another state, choose n sothat pn(x, y) > 0, and then from
µ(y) =∑x
µ(x)pn(x, y)
we conclude that µ(y) > 0. Then 0 < µ(y) = 1/E yTy, which implies E yTy <∞.
We showed that if x is recurrent and r(x, y) > 0, then y is recurrent. So(c) follows.
Suppose there exists a null recurrent state. If there exists a positive re-current or transient state as well, then by (a) and (b) or by (c) all states arepositive recurrent or transient, a contradiction, and (d) follows.
11.6. CONVERGENCE 101
11.6 Convergence
Our goal is to show that under certain conditions pn(x, y) → π(y), where πis the stationary distribution. (In the null recurrent case pn(x, y)→ 0.)
Consider a random walk on the set 0, 1, where with probability one oneach step the chain moves to the other state. Then pn(x, y) = 0 if x 6= y andn is even. A less trivial case is the simple random walk on the integers. Weneed to eliminate this periodicity.
Suppose x is recurrent, let Ix = n ≥ 1 : pn(x, x) > 0, and let dx be theg.c.d. (greatest common divisor) of Ix. dx is called the period of x.
Proposition 11.18 If r(x, y) > 0, then dy = dx.
Proof. Since x is recurrent, r(y, x) > 0. Choose K and L such thatpK(x, y), pL(y, x) > 0.
pK+L+n(y, y) ≥ pL(y, x)pn(x, x)pK(x, y),
so taking n = 0, we have pK+L(y, y) > 0, or dy divides K + L. So dy dividesn if pn(x, x) > 0, or dy is a divisor of Ix. Hence dy divides dx. By symmetrydx divides dy.
Proposition 11.19 If dx = 1, there exists m0 such that pm(x, x) > 0 when-ever m ≥ m0.
Proof. First of all, Ix is closed under addition: if m,n ∈ Ix,
pm+n(x, x) ≥ pm(x, x)pn(x, x) > 0.
Secondly, if there exists N such that N,N + 1 ∈ Ix, let m0 = N2. Ifm ≥ m0, then m−N2 = kN + r for some r < N and
m = r +N2 + kN = r(N + 1) + (N − r + k)N ∈ Ix.
Third, pick n0 ∈ Ix and k > 0 such that n0 + k ∈ Ix. If k = 1, we aredone. Since dx = 1, there exists n1 ∈ Ix such that k does not divide n1.
102 CHAPTER 11. MARKOV CHAINS
We have n1 = mk + r for some 0 < r < k. Note (m + 1)(n0 + k) ∈ Ixand (m + 1)n0 + n1 ∈ Ix. The difference between these two numbers is(m + 1)k − n1 = k − r < k. So now we have two numbers in Ik differing byless than or equal to k − 1. Repeating at most k times, we get two numbersin Ix differing by at most 1, and we are done.
We write d for dx. A chain is aperiodic if d = 1.
If d > 1, we say x ∼ y if pkd(x, y) > 0 for some k > 0 We divide S intoequivalence classes S1, . . .Sd. Every d steps the chain started in Si is back inSi. So we look at p′ = pd on Si.
Theorem 11.20 Suppose the chain is irreducible, aperiodic, and has a sta-tionary distribution π. Then pn(x, y)→ π(y) as n→∞.
Proof. The idea is to take two copies of the chain with different startingdistributions, let them run independently until they couple, i.e., hit eachother, and then have them move together. So define
q((x1, y1), (x2, y2)) =
p(x1, x2)p(y1, y2) if x1 6= y1,
p(x1, x2) if x1 = y1, x2 = y2,
0 otherwise.
Let Zn = (Xn, Yn) and T = mini : Xi = Yi. We have
P(Xn = y) = P(Xn = y, T ≤ n) + P(Xn = y, T > n)
= P(Yn = y, T ≤ n) + P(Xn = y, T > n),
whileP(Yn = y) = P(Yn = y, T ≤ n) + P(Yn = y, T > n).
Subtracting,
P(Xn = y)− P(Yn = y) ≤ P(Xn = y, T > n)− P(Yn = y, T > n)
≤ P(Xn = y, T > n) ≤ P(T > n).
Using symmetry,
|P(Xn = y)− P(Yn = y)| ≤ P(T > n).
11.6. CONVERGENCE 103
Suppose we let Y0 have distribution π and X0 = x. Then
|pn(x, y)− π(y)| ≤ P(T > n).
It remains to show P(T > n) → 0. To do this, consider another chainWn = (Xn, Yn), where now we take Xn, Yn independent. Define
r((x1, y1), (x2, y2)) = p(x1, x2)p(y1, y2).
The chain under the transition probabilities r is irreducible. To see this,there exist K and L such that pK(x1, x2) > 0 and pL(y1, y2) > 0. If M islarge, pL+M(x2, x2) > 0 and pK+M(y2, y2) > 0. So pK+L+M(x1, x2) > 0 andpK+L+M(y1, y2) > 0, and hence we have rK+L+M((x1, x2), (y1, y2)) > 0.
It is easy to check that π′(a, b) = π(a)π(b) is a stationary distribution forW . Hence Wn is recurrent, and hence it will hit (x, x), hence the time to hitthe diagonal (y, y) : y ∈ S is finite. However the distribution of the timeto hit the diagonal is the same as T .