Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DraftAdvanced Probability Theory (Fall 2017)
J.P.Kim
Dept. of Statistics
Finally modified at November 28, 2017
Draft
Preface & Disclaimer
This note is a summary of the lecture Advanced Probability Theory (326.729A) held at Seoul
National University, Fall 2017. Lecturer was Minwoo Chae, and the note was summarized by
J.P.Kim, who is a Ph.D student. There are few textbooks and references in this course, which
are following.
• Weak Convergence and Empirical Processes with Applications to Statistics, Van der Vaart
& Wellner, Springer, 1996.
• Asymptotic Statistics, Van der Vaart, Cambridge University Press, 1998.
Also I referred to following books when I write this note. The list would be updated contin-
uously.
• Convergence of probability measures, Billingsley, John Wiley & Sons, 2013.
• Lecture notes on Topics in Mathematics I (3341.445) held by Gerald Trutnau (spring
2015).
Finally, some examples or motivation would be complemented based on the lecture notes
(summarized by myself) of
• Probability Theory I (326.513) on spring 2016;
• Theory of Statistics II (326.522) on fall 2016,
most of which are available at https://jpkimstat.wordpress.com/notes-and-slides.
If you want to correct typo or mistakes, please contact to: [email protected]
1
Draft
Chapter 1
Stochastic Convergence
1.1 Motivation
Recall some basic results in asymptotics.
Theorem 1.1.1 (SLLN). Let X1, X2, · · · , Xn be i.i.d random variables with E|X1| <∞. Then
1
n
n∑i=1
XiP−a.s−−−→n→∞
EX1.
Theorem 1.1.2 (CLT). Let X1, X2, · · · , Xn be i.i.d random variable with E|X1|2 <∞. Then
1√n
n∑i=1
(Xi − µ)d−−−→
n→∞N(0, σ2),
where µ = EX1 and σ2 = EX21 − µ2.
From now on we will use following notations. Let
• (Ω,A,P) or (Ωi,Ai,Pi) be underlying probability space or sequence of them;
• (D, d) be a metric space;
• D = B(D) be a Borel σ-algebra of D;
• Cb(D) be the set of all bounded continuous real functions on D;
• X (Xn, resp.) be a map from Ω (Ωi, resp.) to D (not necessarily be measurable).
Remark 1.1.3. Note that LLN and CLT holds for f(Xi)’s, i.e.,
1
n
n∑i=1
f(Xi)P−a.s−−−→n→∞
Ef(X1)
2
Draft
Advanced Probability Theory J.P.Kim
and1√n
n∑i=1
(f(Xi)− Ef(Xi))d−−−→
n→∞N(0, σ2
f )
holds for σ2f = varf(X1), provided that E[f(X1)2] <∞. Our question in this course is:
• For a class F of real functions. do LLN and CLT hold “uniformly” in “some” sense? For
example, does it hold
supf∈F
∣∣∣∣∣ 1nn∑i=1
f(Xi)− Ef(X1)
∣∣∣∣∣ −−−→n→∞0
P-a.s. or in probability?
• For finite f1, · · · , fk,
(1√n
n∑i=1
(f1(Xi)− Ef1(X1)) , · · · , 1√n
n∑i=1
(fk(Xi)− Efk(X1))
)>
converges weakly to MVN. How the convergence of (“infinite dimensional joint net”)(1√n
n∑i=1
(f1(Xi)− Ef1(X1))
)f∈F
(1.1)
can be defined?
For this we see more general notion of weak convergence here.
Definition 1.1.4. Let Pn, P be Borel probability measures on (D,D, d). Then
(i) Pn converges weakly to P, denoted as Pnw−−−→
n→∞P, iff
∫DfdPn −−−→
n→∞
∫DfdP ∀f ∈ Cb(D).
(ii) If Xn and X are D-valued random variables with laws Pn and P respectively, then Xn converges
weakly to X, denoted as Xnw−−−→
n→∞X, iff Pn
w−−−→n→∞
P.
For weak convergence of (1.1), we may use the definition 1.1.4. For this, (1.1) should be embedded
into a metric space.
Example 1.1.5. Let (Ωn,An,Pn) = ([0, 1],B, λ), and
F = 1[0,t](·) : 0 ≤ t ≤ 1 ⊆ D[0, 1],
where B = B([0, 1]) is a Borel σ-algebra on [0, 1] and λ denotes the Lebesgue measure. Then (1.1)
can be viewed as a D[0, 1]-valued random variables. A natural metric on D[0, 1] is the uniform metric
3
Draft
Advanced Probability Theory J.P.Kim
defined as
d(f1, f2) = supt∈[0,1]
|f1(t)− f2(t)| ∀f1, f2 ∈ D[0, 1].
However, under such metric, D[0, 1] is not separable, which makes the space too large to work with.
Furthermore, under the metric, (1.1) may even not be measurable.
Proposition 1.1.6. A map X : [0, 1]→ D[0, 1] defined as
X(ω) = 1[ω,1]
is NOT Borel measurable with the uniform metric.
Proof.
Figure 1.1: Proof of proposition 1.1.6.
Let Bs be the open ball of radius 1/2 in D[0, 1] centered
on 1[s,1]. Then G =⋃s∈S
Bs is an open set in D[0, 1] for any
S ⊆ [0, 1]. However, note that X(ω) ∈ Bs if and only if
ω = s, and hence
X−1(G) = (X ∈ G) = S
holds. If X is Borel measurable, then every subset S of [0, 1] should be also Borel measurable; it yields
contradiction.
To handle this issue, we may consider some alternative views like:
• To consider a weaker σ-algebra, such as ball σ-alg. In here, ball σ-algebra is a σ-algebra generated
by all open balls. If the space is separable, then ball σ-algebra is equivalent to Borel σ-algebra.
Note that with smaller σ-algebra, measurability condition becomes weaker.
• To consider a weaker metric. This is one typical approach dealing with empirical process,
using Skorokhod’s metric. Under Skorokhod metric, D[0, 1] becomes separable, and it is well-
known that there exists an equivalent metric with Skorokhod metric making D[0, 1] also complete
(Billingsley).
• Drop the measurability requirement, that is to extend some notions of weak convergence to non-
measurable maps. We shall focus on this approach in this course.
1.2 Outer Integral
From now on, let (Ω,A,P) be an underlying probability space. Also, let T : Ω→ R = [−∞,∞] be
an arbitrary map (not necessarily be measurable) and B ⊆ Ω be an arbitrary set (not necessarily be
4
Draft
Advanced Probability Theory J.P.Kim
measurable).
Definition 1.2.1. (i) The outer integral of T w.r.t. P is defined as
E∗T := infEU : U ≥ T, U : Ω→ R is measurable & EU exists,
where “EU exists” means EU+ < ∞ or EU− < ∞ (note that it can be defined only except the
case ∞−∞).
(ii) The outer probability of B is
P∗(B) = infP(A) : A ⊇ B, A ∈ A.
(iii) The inner integral of T w.r.t. P is defined as
E∗T = −E∗(−T ).
(iv) The inner probability of B is
P∗(B) = 1− P∗(Ω\B).
Remark 1.2.2. Note that definitions in (iii) and (iv) is equivalent to using similar argument as (i)
and (ii), i.e.,
E∗(T ) = supEU : U ≤ T, U : Ω→ R is measurable & EU exists
and
P∗(B) = supP(A) : A ⊆ B, A ∈ A.
It is well known that the map T ∗ achieving the supremum always exists provided that its expectation
exists.
Lemma 1.2.3. For any map T : Ω→ R, there exists a measurable map T ∗ : Ω→ R with
(i) T ∗ ≥ T ;
(ii) T ∗ ≤ U P-a.s. for any U : Ω→ R with U ≥ T P-a.s..
Furthermore, such T ∗ is unique up to P-null sets, and
E∗T = ET ∗
provided that ET ∗ exists.
Definition 1.2.4. Such function T ∗ is called minimal measurable majorant of T . Similarly, the
maximal measurable minorant T∗ can be defined as T∗ = −(−T )∗.
5
Draft
Advanced Probability Theory J.P.Kim
There are several similarities between outer integral and normal one. Many concepts and proposi-
tions in the probability theory can be extended to the outer probability statement. However there are
also several statements those not hold in outer-measure version. One example is Fubini’s theorem.
Lemma 1.2.5 (Fubini theorem in outer-integral). Let T be a real-valued function on the product space
(Ω1 × Ω2,A1 ⊗A2,P1 ⊗ P2). Then
E∗T ≤E1∗E2∗ T≤E∗1E∗2T ≤E∗T,
where E∗2 is defined as
(E∗2T ) (ω1) = infE2U : U(ω2) ≥ T (ω1, ω2), U : Ω2 → R is measurable and E2U exists
for ω1 ∈ Ω1 and vice versa.
Now we will extend the notion of weak convergence to non-measurable maps.
1.3 Weak Convergence
Definition 1.3.1. (i) A Borel probability measure L on D is tight if
∀ε > 0 ∃cpt set K with L(K) ≥ 1− ε.
(ii) A Borel measurable map X : Ω→ D is tight if the “law” of X L(X) := P X−1 is tight.
(iii) L (or X) is separable if there exists separable measurable set with probability 1, i.e.,
∃separable measurable set A ⊆ D s.t. L(A) = 1 or P(X ∈ A) = 1.
Lemma 1.3.2. (i) If L (or X) is tight, then L (or X) is separable.
(ii) The converse is true if D is complete. That is, given that D is complete, separability of L (or X)
implies tightness.
Now we are ready to define weak convergence of “arbitrary” map Xn.
Definition 1.3.3 (Weak Convergence). Let (Ωn,An,Pn) be a sequence of probability spaces and Xn :
Ωn → D be arbitrary maps (may be non-measurable). Then Xn is said to converge weakly to a
“Borel measure” L, denoted as Xnw−−−→
n→∞L, if
E∗f(Xn) −−−→n→∞
∫fdL ∀f ∈ Cb(D).
6
Draft
Advanced Probability Theory J.P.Kim
Furthermore, if there is a “Borel measurable” map X with law L, i.e., L(X) = L, then it is denoted
as Xnw−−−→
n→∞X.
We can say similar arguments about weak convergence of measurable maps.
Theorem 1.3.4 (Portmanteau). TFAE.
(i) Xnw−−−→
n→∞L
(ii) lim infn P∗(Xn ∈ G) ≥ L(G) for any open set G
(iii) lim supn P∗(Xn ∈ F ) ≤ L(F ) for any closed set F
(iv) lim infn E∗f(Xn) ≥∫fdL for any function f which is l.s.c & bdd below
(v) lim supn E∗f(Xn) ≤∫fdL for any function f which is u.s.c & bdd above
(vi) limP∗(Xn ∈ B) = limP∗(Xn ∈ B) = L(B) for any L-continuity set B (i.e., L(∂B) = 0)
(vii) lim infn E∗f(Xn) ≥∫fdL for any function f which is bdd, Lipschitz continuous, and nonnegative
Recall that a function f is lower semicontinuous (l.s.c) if
lim infx→x0
f(x) ≥ f(x0)
and vice versa. Our first important result in measure theory is continuous mapping theorem.
Theorem 1.3.5 (Continuous mapping theorem). Let (D, d) and (E, e) be metric spaces and g : D→ E
be continuous at every point of a set D0 ⊆ D. If Xnw−−−→
n→∞X and X takes its values in D0, then
g(Xn)w−−−→
n→∞g(X).
Next to the continuous mapping theorem, Prokhorov theorem (or Helly’s principle, in special case)
is the most important theorem on weak convergence. To formulate the result, two new concepts are
needed.
Definition 1.3.6. (i) Xn is asymptotic measurable if
E∗f(Xn)− E∗f(Xn) −−−→n→∞
0 ∀f ∈ Cb(D).
(ii) Xn is asymptotic tight if
∀ε > 0 ∃cpt set K s.t. lim infn
P∗(Xn ∈ Kδ) ≥ 1− ε ∀δ > 0,
where Kδ := y ∈ D : d(y,K) < δ is the “δ-enlargement” of K.
7
Draft
Advanced Probability Theory J.P.Kim
Remark 1.3.7. A collection of Borel measurable maps Xn is (uniformly) tight if
∀ε > 0 ∃cpt set K s.t. infn
P(Xn ∈ K) ≥ 1− ε.
(It is also equivalent if “inf” in the last statement is replaced to “lim inf”) The δ in the definition of
(asymptotic) tightness may seem a bit overdone (∵ it enlarges the set K), but nothing is gained in
simple cases:
Proposition 1.3.8. If D is separable and complete, then uniformly tightness and asymptotically tight-
ness are the same (for measurable maps).
Following result might be useful to verify asymptotic measurability or tightness.
Lemma 1.3.9.
(i) If Xnw−−−→
n→∞X, then (Xn) is asymptotically measurable.
(ii) If Xnw−−−→
n→∞X, then
(Xn) is asymptotically tight ⇐⇒ X is tight.
Now we are ready to state Prokhorov theorem.
Theorem 1.3.10 (Prokhorov).
(i) If Xn is asymptotically tight and asymptotically measurable, then Xn is relatively compact,
i.e., every subsequence Xn′ has a further subsequence Xn′′ converging weakly to a tight Borel
law.
(ii) Relatively compact collection Xn is asymptotically tight if D is Polish space (i.e., separable and
complete).
Remark 1.3.11. By previous theorem, for Borel measures on Polish space, the concepts “relatively
compact,” “asymptotically tight” and “uniformly tight” are all equivalent.
Our final extension is:
Lemma 1.3.12. Let Xnw−−−→
n→∞X and Yn
w−−−→n→∞
c, where c is constant and X has separable Borel
law. Then
(Xn, Yn)w−−−→
n→∞(X, c).
Corollary 1.3.13. Let Xn and X be on separable Banach space (topological vector space) and Yn
and c be scalars. Then addition and scalar multiplication can be defined, which are also continuous
operator on separable Banach space. Thus we can get
Xn + Ynw−−−→
n→∞X + c
8
Draft
Advanced Probability Theory J.P.Kim
and
XnYnw−−−→
n→∞cX.
Furthermore, if c 6= 0, we can also obtain
Xn/Ynw−−−→
n→∞X/c.
1.4 Spaces of Bounded Functions
Definition 1.4.1. Let T be an arbitrary set. Then the space `∞(T ) is defined as
`∞(T ) = all functions f : T → R s.t. ‖f‖∞ <∞,
where ‖f‖∞ = supt∈T|f(t)|.
It is well-known that `∞ is Banach space.
Definition 1.4.2 (Stochastic Process). A collection X(t) : t ∈ T of random variables defined on
the same probability space (Ω,A,P) is called a stochastic process.
Note that, if every sample path t 7→ X(t, ω) belongs to `∞(T ), i.e., every sample path is bounded,
then X can be viewed as a (random) map from Ω to `∞(T ). For any arbitrary map X : Ω→ `∞(T ),
it is natural to call a finite dimensional projection (X(t1), X(t2), · · · , X(tk)) for t1, t2, · · · , tk ∈ T as a
marginal.
Obviously our interest is to find equivalent condition for asymptotic tightness of weak convergence
of a sequence of random maps (Xn). Before starting, we introduce following two lemmas which will
be used.
Lemma 1.4.3. Let Xn : Ωn → `∞(T ) be asymptotically tight. Then
(Xn) is asymptotically measurable ⇐⇒ (Xn(t)) is asymptotically measurable for any t ∈ T.
It implies that every stochastic process is asymptotically measurable; each marginal is random
variable and hence measurable.
Lemma 1.4.4. Let X,Y be Borel-measurable maps into `∞(T ) and they are tight. Then
L(X) = L(Y ) ⇐⇒ all marginals are equal in law.
It means that for tight measurable maps, laws of all marginals determine the (joint) laws. Now we
are ready to introduce our first result.
9
Draft
Advanced Probability Theory J.P.Kim
Theorem 1.4.5. Let Xn : Ωn → `∞(T ), n = 1, 2, · · · be arbitrary maps. Then Xn converges weakly
to a tight limit if and only if
(1) (Xn) is asymptotically tight;
(2) every marginal converges weakly to a limit.
Proof. ⇒) (1) is trivial from lemma 1.3.9. Next, note that, for any fixed t1, t2, · · · , tk ∈ T , projection
g : `∞(T ) → Rk
z 7→ (z(t1), z(t2), · · · , z(tk))>
is continuous function on `∞(T ). Thus continuous mapping theorem implies (2).
⇐) Let t ∈ T be arbitrarily chosen. Then condition (2) implies that (Xn(t)) is asymptotically
measurable by lemma 1.3.9. Since t ∈ T was arbitrary, (Xn) is asymptotically measurable by lemma
1.4.3. Then by Prokhorov theorem, every subsequence n′ ⊆ n has a further subsequence n′′ ⊆
n′ which makes (Xn′′) converges weakly. If such limit is all equal, then (Xn) converges weakly.
This follows from convergence of every marginal (condition (2)) and lemma 1.4.4. In details, for any
subsequence n′ ⊆ n, there exist a further subsequence n′′ ⊆ n′ and Y = Y (n′) such that
Xn′′w−−−−→
n′′→∞Y . Note that Y is tight by condition (1) and lemma 1.3.9, and by lemma 1.4.4, every Y
has the same law for any choice of subsequence n′. Let X be tight r.v. s.t. L(X) = L(Y ). Then we
get
∀n′ ⊆ n ∃n′′ ⊆ n′ s.t. Xn′′w−−−−→
n′′→∞X,
and therefore Xnw−−−→
n→∞X.
Theorem 1.4.5 tells that weak convergence for a sequence of random map is implied by asymptotic
tightness and marginal convergence. Marginal convergence can be established by and of the well-known
methods for proving weak convergence on Euclidean space. (Asymptotic) tightness can be given a
more concrete form, either through finite approximation or (essentially) Arzela-Ascoli characterization.
Second approach is related to asymptotic continuity of the sample paths.
Definition 1.4.6. A map ρ : T × T → R is called a semimetric (or pseudometric) if
(1) ρ(x, y) ≥ 0 and x = y implies ρ(x, y) = 0;
(2) ρ(x, y) = ρ(y, x);
(3) ρ(x, z) ≤ ρ(x, y) + ρ(y, z).
(It may not satisfy ρ(x, y) = 0 =⇒ x = y)
10
Draft
Advanced Probability Theory J.P.Kim
Definition 1.4.7. Let Xn : Ωn → `∞(T ) be a sequence of maps and ρ be a semimetric on T . Then
(Xn) is called asymptotically uniformly ρ-equicontinuous in probability if
∀ε, η > 0 ∃δ > 0 s.t. lim supn→∞
P∗(
supρ(s,t)<δ
|Xn(s)−Xn(t)| > ε
)< η.
Recall that a collection fn : T → R of functions is “uniformly equicontinuous” if
∀ε > 0 ∃δ > 0 s.t. supρ(s,t)<δ
|fn(s)− fn(t)| < ε uniformly on n.
The definition in 1.4.7 is slightly changed to make the notion in probability. Now we are ready to see
some equivalent conditions of asymptotic tightness; which is one of the goal of this section.
Theorem 1.4.8. TFAE.
(i) (Xn) is asymptotically tight.
(ii) (1) (Xn(t)) is asymptotically tight ∀t ∈ T ;
(2) ∃semimetric ρ on T s.t. (T, ρ) is totally bounded and (Xn) is asymptotically uniformly
ρ-equicontinuous in probability.
(iii) (1) and (3) holds, where
(3) ∀ε, η > 0 ∃finite partition T1, · · · , Tk of T s.t.
lim supn→∞
P∗(
maxi
sups,t∈Ti
|Xn(s)−Xn(t)| > ε
)< η. (1.2)
Remark 1.4.9. (ii) is related to Arzela-Ascoli characterization of the space, while (iii) is related to
the finite approximation of state space T . (iii) means that for any ε > 0, T can be partitioned into
finitely many subset Ti such that (asymptotically) the variation of the sample paths t 7→ X(t) is less
than ε on every Ti.
Proof. (i) ⇒ (ii). First we have to show (1). Let πt : x 7→ x(t) be a projection. Given ε > 0, there
exists a compact set K s.t.
lim infn
P∗(Xn ∈ Kδ) > 1− ε ∀δ > 0.
From
a ∈ Kδ =⇒ ∃b ∈ K s.t. ‖b− a‖∞ < δ
=⇒ |πt(b)− πt(a)| ≤ ‖b− a‖∞ < δ
=⇒ πt(a) ∈ πt(K)δ,
11
Draft
Advanced Probability Theory J.P.Kim
we get
lim infn
P∗(Xn(t) ∈ (πt(K))δ) ≥ lim infn
P∗(Xn ∈ Kδ) > 1− ε ∀δ > 0.
As πt is continuous, πt(K) is compact, and it is the desired compact set.
Now we show (2). Let ε be given, and K1 ⊆ K2 ⊆ · · · be a sequence of compact subsets of `∞(T )
satisfying
lim infn→∞
P∗ (Xn ∈ Kεm) ≥ 1− 1
m. (“asymptotic tightness”)
Now for any m, define ρm as
ρm(s, t) = supz∈Km
|z(s)− z(t)|.
Claim 1.) (T, ρm) is totally bounded.
Remark 1.4.10. Note that “totally boundedness” means that ∀ε > 0 T is covered with finite radius-ε
balls w.r.t ρ. It is also equivalent to:
∀ε > 0 ∃finite subset whose distance from any element of T is less than ε.
Proof of Claim 1. For given η > 0, choose z1, z2, · · · , zk ∈ `∞(T ) s.t.
Km ⊆k⋃j=1
Bη(zj).
(It can be chosen because of compactness ofKm) Since each zi is a bounded function, A := (z1(t), · · · , zk(t))> :
t ∈ T ⊆ Rk is bounded set, it is totally bounded1, and hence ∃t1, t2, · · · , tp ∈ T s.t.
A ⊆p⋃i=1
Bη((z1(ti), · · · , zk(ti)).
It gives that for any t ∈ T , ∃ti s.t.
(z1(t), · · · , zk(t))> ∈ Bη(z1(ti), · · · , zk(ti)),
and hence we get
ρm(t, ti) = supz∈Km
|z(t)− z(ti)|
supz∈Km
min1≤j≤k
|z(t)− zj(t)|︸ ︷︷ ︸≤‖z−zj‖∞
+|zj(t)− zj(ti)|+ |zj(ti)− z(ti)|︸ ︷︷ ︸≤‖z−zj‖∞
1totally bounded space is bounded; converse is also true in Euclidean space.
12
Draft
Advanced Probability Theory J.P.Kim
≤ 2 supz∈Km
min1≤j≤k
‖z − zj‖∞︸ ︷︷ ︸≤η (∵ def. of zj)
+ max1≤j≤k
|zj(t)− zj(ti)|︸ ︷︷ ︸≤η (∵ def. of ti)
≤ 3η.
In summary, ∀η > 0 ∃t1, · · · , tp s.t.
∀t ∈ T∃ti s.t. ρm(t, ti) ≤ 3η,
which gives totally boundedness of (T, ρm). (Claim 1)
Claim 2.) (T, ρ) is totally bounded, where
ρ(s, t) =∞∑m=1
2−m(ρm(s, t) ∧ 1).
Proof of Claim 2. Note that ρm increases as m grows by the definition. For η > 0, take m s.t.
2−m < η. Then by Claim 1, ∃t1, t2, · · · , tp s.t.
T ⊆p⋃i=1
Bη(ti; ρm).
Then for every t ∈ T , ∃ti s.t. ρm(t, ti) < η, and so
ρ(t, ti) ≤m∑k=1
2−k ρk(t, ti)︸ ︷︷ ︸≤ρm(t,ti)
+
∞∑k=m+1
2−k · 1︸ ︷︷ ︸=2−m<η
≤ η + η = 2η.
It means that ∀η > 0 ∃t1, t2, · · · , tp s.t.
∀t ∈ T ∃ti s.t. ρ(t, ti) ≤ 2η,
which gives that (T, ρ) is totally bounded. (Claim 2)
Claim 3.) (Xn) is asymptotically uniformly ρ-equicontinuous in probability.
Proof of Claim 3. Let ε > 0. If ‖z − z0‖∞ < ε for some z0 ∈ Km, then
|z(s)− z(t)| ≤ |z(s)− z0(s)|︸ ︷︷ ︸<ε
+ |z0(s)− z0(t)|︸ ︷︷ ︸≤supz∈Km |z(s)−z(t)|=ρm(s,t)
+ |z0(t)− z(t)|︸ ︷︷ ︸<ε
≤ 2ε+ ρm(s, t).
13
Draft
Advanced Probability Theory J.P.Kim
If ρ(s, t) < 2−mε, then
ρm(s, t) ∧ 1 ≤ 2mρ(s, t) < ε,
which gives ρm(s, t) < ε. Thus
z ∈ Kεm =⇒ ∃z0 ∈ Km s.t. ‖z − z0‖∞ < ε =⇒ |z(s)− z(t)| ≤ 2ε+ ρm(s, t) ≤ 3ε
provided that ρ(s, t) < 2−mε. Therefore,
Kεm ⊆
z : sup
ρ(s,t)<2−mε|z(s)− z(t)| ≤ 3ε
.
Now letting δ < 2−mε, we get
lim infn→∞
P∗
(sup
ρ(s,t)<δ|Xn(s)−Xn(t)| ≤ 3ε
)≥ lim inf
n→∞P∗(Xn ∈ Kε
m) ≥ 1− 1
m.
In summary, we get ∀m ∈ N ∀ε > 0 ∃δ > 0 s.t.
lim infn→∞
P∗
(sup
ρ(s,t)<δ|Xn(s)−Xn(t)| ≤ 3ε
)≥ 1− 1
m,
which implies that (Xn) is asymptotically uniformly ρ-equicontinuous in probability. (Claim 3)
(ii) ⇒ (iii). By the assumption, given ε, η > 0, ∃δ > 0 s.t.
lim supn→∞
P∗(
supρ(s,t)<δ
|Xn(s)−Xn(t)| > ε
)< η.
Since (T, ρ) is totally bounded, ∃ finite set t1, t2, · · · , tp ⊆ T s.t.
T ⊆p⋃j=1
Bδ/2(ti; ρ).
Now letting Ti = Bδ/2(ti; ρ), we get
s, t ∈ Ti =⇒ ρ(s, t) ≤ ρ(s, ti) + ρ(ti, t) < δ,
and therefore
sups,t∈Ti
|z(s)− z(t)| ≤ supρ(s,t)<δ
|z(s)− z(t)|
for any i = 1, 2, · · · , p. It implies the conclusion
lim supn→∞
P∗(
maxi
sups,t∈Ti
|Xn(s)−Xn(t)| > ε
)≤ lim sup
n→∞P∗(
supρ(s,t)<δ
|Xn(s)−Xn(t)| > ε
).
14
Draft
Advanced Probability Theory J.P.Kim
(iii) ⇒ (i). Suppose that for given ε, η > 0,
lim supn→∞
P∗(
maxi
sups,t∈Ti
|Xn(s)−Xn(t)| > ε
)< η
holds. Note that, for a fixed ti ∈ Ti,
sups,t∈Ti
|Xn(s)−Xn(t)| ≤ ε =⇒ sups∈Ti|Xn(s)| ≤ sup
s∈Ti|Xn(s)−Xn(ti)|+ |Xn(ti)| ≤ |Xn(ti)|+ ε,
and hence
lim infn→∞
P∗(‖Xn‖∞ ≤ max
1≤i≤p|Xn(ti)|+ ε
)≥ lim inf
n→∞P∗
(max1≤i≤p
sups,t∈Ti
|Xn(s)−Xn(t)| ≤ ε
)≥ 1− η.
It implies that (‖Xn‖∞) is asymptotically tight. Why? First note that for each i, ∃Mi > 0 s.t.
lim infn→∞
P∗ (|Xn(ti)| < Mi + ε) ≥ 1− η ∀ε > 0.
Letting M = maxiMi, we get
lim supn→∞
P∗(
maxi|Xn(ti)| ≥M + ε
)≤ lim sup
n→∞
∑i
P∗(|Xn(ti)| ≥M + ε) ≤ pη ∀ε > 0,
i.e., (maxi |Xn(ti)|) is asymptotically tight. Now let K be a compact set s.t.
lim infn→∞
P∗(maxi|Xn(ti)| ∈ Kε) ≥ 1− η ∀ε > 0.
Then
lim infn→∞
P∗(‖Xn‖∞ ≤ max
i|Xn(ti)|+ ε, max
i|Xn(ti)| ∈ Kε
)= lim inf
n→∞P∗(∣∣∣∣‖Xn‖∞ −max
i|Xn(ti)|
∣∣∣∣ ≤ ε, maxi|Xn(ti)| ∈ Kε
)≤ lim inf
n→∞P∗(‖Xn‖∞ ∈ K3ε
),
which implies
lim infn→∞
P∗(‖Xn‖∞ ∈ K3ε
)≥ 1− 2η,
i.e., (‖Xn‖∞) is asymptotically tight. Now, let ζ > 0 and a sequence εm 0 be given. Choose M > 0
s.t.
lim supn→∞
P∗(‖Xn‖∞ > M) ≤ ζ.
15
Draft
Advanced Probability Theory J.P.Kim
For εm and η = 2−mζ, let
T =
km⋃i=1
Tm,i
be a partition satisfying
lim supn→∞
P∗(
max1≤i≤km
sups,t∈Tm,i
|Xn(s)−Xn(t)| > εm
)< η.
Now, let zm,1, zm,2, · · · , zm,Pm be the set of all functions in `∞(T ) that are constant on each Tm,i
taking values
0,±εm,±2εm, · · · ,±⌊M
εm
⌋εm.
(a) Function zm,i’s. (b) Approximating elements of `∞(T ) with zm,i’s.
Figure 1.2: Function zm,i’s and approximation.
Now let
Km =
Pm⋃i=1
Bεm(zm,i) and K =∞⋂m=1
Km.
By construction, for each m,
‖Xn‖∞ ≤M and maxi
sups,t∈Tm,i
|Xn(s)−Xn(t)| < εm implies Xn ∈ Km. (1.3)
Since K is closed (and hence complete2) and totally bounded, it is compact. Thus our claim is:
Claim.) ∀δ > 0 ∃m s.t. Kδ ⊇⋂mi=1Ki.
Proof of Claim. Assume not. Then ∃δ > 0 s.t. ∀m Kδ +⋂mi=1Ki. That is,
∃(zm) s.t. zm ∈m⋂i=1
Ki but zm /∈ Kδ.
Now we use Arzela-Ascoli formulation: Note that zn ⊆ K1 =⋃P1i=1Bε1(z1,i), i.e., an infinite
number of zn belong to finite balls, which means that at least one of them contains also an infinite
number of zn. Now consider a subsequence z∗n of zn in Bε1(z1,i1) for some i1. In the same way,
2closed subset of complete space is complete
16
Draft
Advanced Probability Theory J.P.Kim
there exists a further subsequence z∗∗n in Bε2(z2,i2) for some i2.
z1 z2 z3 · · ·
z∗1 z∗2 z∗3 · · ·
z∗∗1 z∗∗2 z∗∗3 · · ·...
......
. . .
Now define (zl) as a sequence z1, z∗2 , z∗∗3 , · · · , and then we get (zl) is Cauchy sequence. Since `∞(T )
is complete, (zl) converges. Now note that
zl ∈l⋂
i=1
Ki ⊆m⋂i=1
Ki
by construction for any l ≥ m, and⋂mi=1Ki is closed, the limit z of (zl) belongs to
⋂mi=1Ki for any
m. It implies that z ∈ K, which is contradictory to zm /∈ Kδ ∀m. (Claim)
Now, by the Claim,
lim supn→∞
P∗(Xn /∈ Kδ) ≤ lim supn→∞
P∗(Xn /∈
m⋂i=1
Ki
)
≤(1.3)
lim supn→∞
P∗(‖Xn‖∞ > M or max
isup
s,t∈Tm′,i|Xn(s)−Xn(t)| > εm′ for some m′ ≤ m
)
≤ lim supn→∞
P∗(‖Xn‖∞ > M) + lim supn→∞
m∑m′=1
P∗(
maxi
sups,t∈Tm′,i
|Xn(s)−Xn(t)| > εm′
)
≤ ζ +
m∑i=1
2−mζ < 2ζ.
If the condition “asymptotic tightness” is replaced with “weak convergence,” then we can obtain
stronger argument.
Proposition 1.4.11. If Xnw−−−→
n→∞X, where X is tight, then sample path t 7→ X(t, ω) is uniformly
ρ-continuous a.s., where ρ is the semimetric constructed in the proof of (i)⇒(ii) part of theorem 1.4.8.
Proof. Let notations be continued. We get
P(X ∈ Kεm) ≥ lim sup
n→∞P∗(Xn ∈ Kε
m) (∵ Portmanteau)
≥ lim infn→∞
P∗(Xn ∈ Kεm)
≥ 1− 1
m
17
Draft
Advanced Probability Theory J.P.Kim
for any m and ε > 0. By letting ε 0, we get
P(X ∈ Km) ≥ 1− 1
m,
which gives
P
(X ∈
∞⋃m=1
Km
)= 1. (1.4)
Hence, for
ρm(s, t) = supz∈Km
|z(s)− z(t)| and ρ(s, t) =∞∑m=1
2−m(ρm(s, t) ∧ 1),
we get
z ∈ Km =⇒ |z(s)− z(t)| ≤ ρm(s, t) ∀s, t ∈ T.
Also, from 1 ∧ ρm(s, t) ≤ 2mρ(s, t), we get
ρ(s, t) < δ =⇒ ρm(s, t) < ε
for any δ < 2−mε. Therefore, we get the conclusion; For m = m(ω) s.t. X(ω) ∈ Km, ∀ε > 0 ∃δ = δ(m)
s.t.
sups,t∈Tρ(s,t)<δ
|X(s)−X(t)| < ε.
Proposition 1.4.12. If Xnw−−−→
n→∞X, (T, ρ) is totally bounded, and sample path t 7→ X(t, ω) is
uniformly ρ-continuous P-a.s., then (Xn) is asymptotically tight and asymptotically uniformly ρ-
equicontinuous in probability.
Remark 1.4.13. Before the proof, note that:
“The set of uniformly continuous functions on a totally bounded set is complete & separable in
uniform metric.”
A brief proof is following. It is well known that C(T ) is complete; C(T ) is separable if and only if T
is compact. It gives that the set of continuous functions is complete and separable if T is compact.
Meanwhile, followings are also well known:
• Totally bounded, complete set is compact;
• uniformly continuous function can be extended to a continuous function on a completion.
In other words, uniformly continuous function on a totally bounded set is equivalent to a continuous
function on a compact set.
18
Draft
Advanced Probability Theory J.P.Kim
Proof. Note that (T, ρ) is totally bounded and t 7→ X(t, ω) is uniformly ρ-continuous, so the set of
realization of such X is complete and separable. Since r.v. on complete separable space is tight, X
is tight, which implies (Xn) is asymptotically tight (lemma 1.3.9). Since X is tight and uniformly
ρ-continuous a.s.,
∀η > 0 ∃K : cpt set of uniformly ρ-continuous functions s.t. P(X ∈ K) ≥ 1− η.
Note that, from Portmanteau lemma,
lim infn→∞
P∗(Xn ∈ Kε) ≥ P(X ∈ Kε) ≥ 1− η ∀ε > 0.
Since K is totally bounded, ∀ε > 0 ∃z1, z2, · · · , zk ∈ K s.t. K ⊆⋃ki=1Bε(zi), which implies
Kε ⊆k⋃i=1
B2ε(zi).
Since each zi is uniformly continuous,
∃δ > 0 s.t. ρ(s, t) < δ =⇒ max1≤i≤k
|zi(s)− zi(t)| < ε.
Then we get, for z ∈ B2ε(zi),
ρ(s, t) < δ =⇒ |z(s)− z(t)| ≤ |z(s)− zi(s)|+ |zi(s)− zi(t)|+ |zi(t)− z(t)| ≤ 2ε+ ε+ 2ε = 5ε,
and therefore,
lim infn→∞
P∗
(sup
ρ(s,t)<δ|Xn(s)−Xn(t)| ≤ 5ε
)≥ lim inf
n→∞P∗
(Xn ∈
k⋃i=1
B2ε(zi)
)≥ lim inf
n→∞P∗ (Xn ∈ Kε)
≥ 1− η.
Concluding remark of this chapter is that, in most cases we are interested in the case that limit
process is Gaussian, whose finite-dimensional convergence is obtained by CLT. In here, semimetric ρ
in proposition 1.4.12 becomes p-norm.
Definition 1.4.14. A stochastic process is called Gaussian if each marginal has multivariate normal
distribution.
19
Draft
Advanced Probability Theory J.P.Kim
Remark 1.4.15. Note that if Xnw−−−→
n→∞X, where X is tight Gaussian process, then the metric
ρ(s, t) = ρp(s, t) = (E|X(s)−X(t)|p)1/p , p ≥ 1
makes (Xn) asymptotically uniformly ρ-equicontinuous in probability.
20
Draft
Chapter 2
Maximal Inequalities and
Symmetrization
2.1 Introduction
In here, we use following notation. Let (X ,B,P) be a baseline probability space, and (X∞,B∞,P∞)
be the product space. We consider “the projection into the ith coordinate,” Xi : X∞ → X . Then
X1, X2, · · · become i.i.d. r.v.’s with distribution law P.
Definition 2.1.1. Denote
Pn :=1
n
n∑i=1
δXi (“empirical measure”)
and
Gn :=1√n
n∑i=1
(δXi − P) . (“empirical process”)
In here, δX denotes the dirac-delta measure.
Remark 2.1.2. Often, Gn denotes the stochastic process f 7→ Gnf (i.e., (Gnf)f∈F ), where F is
a collection of measurable functions and Qf denotes Qf =∫fdQ for a measurable function f and
signed measure Q. Note that
Gnf =1√n
n∑i=1
(f(Xi)− Pf).
Definition 2.1.3. For signed measure Q, define
‖Q‖F := sup|Qf | : f ∈ F.
Our first step is very well-known results:
Proposition 2.1.4. For each f ∈ F ,
(i) Pnf → Pf a.s.. (SLLN)
21
Draft
Advanced Probability Theory J.P.Kim
(ii) Gnfd−−−→
n→∞N(0,P(f − Pf)2). (CLT)
We are interested in uniform versions of previous proposition. Uniform version of (i) becomes:
‖Pn − P‖FP∗−−−→
n→∞0. (2.1)
In here P∗ denotes outer probability.
Definition 2.1.5. A collection of integrable (measurable) function F satisfying (2.1) is called P-
Glivenko-Cantelli class.
Next, uniform version of (ii) can be obtained as following. Assume that
supf∈F|f(x)− Pf | <∞ ∀x ∈ X .
Then f 7→ Gnf can be viewed as a map into `∞(F). If Gn is asymptotically tight in `∞(F), then Gn
converges weakly to a tight Borel measurable map G in `∞(F), from CLT-like argument and theorem
1.4.5.
Definition 2.1.6. A class F of square-integrable (measurable) functions is called P-Donsker class
if (Gn) is asymptotically tight.
Remark 2.1.7. A finite collection F of integrable functions is trivially P-Glivenko-Cantelli. Further-
more, a finite collection F of square-integrable functions is P-Donsker (∵ (iii)⇒(i) part of theorem
1.4.8).
Example 2.1.8. Let X1, X2, · · · be i.i.d r.v’s in R, and
F := 1(−∞,t] : t ∈ R.
Then
‖Pn − P‖F = supt∈R|Fn(t)− F (t)| −−−→
n→∞0 a.s.
for any probability measure P on R. It gives that F is P-Glivenko-Cantelli for any P.
To show F is P-Donsker, we should show asymptotical tightness of (Gn), which is obtained by
controlling “supremum on the finite partition.” For this, we need some maximal inequalities and
technique of controlling variation, which will be covered on the rest part of this chapter.
2.2 Tail and Concentration Bounds
The most simple case is well-known to us:
22
Draft
Advanced Probability Theory J.P.Kim
Figure 2.1: Supremum on the infinite set might be controlled as an aggregation of supremum on the finite netand variation in each small ball.
Theorem 2.2.1 (Markov inequality). Let X be a r.v. with mean µ. Then
P(|X − µ| ≥ t) ≤ E|X − µ|k
tk∀t, k > 0.
It gives a “polynomial bound” for tail probability. However, such result may not be so useful because
of its roughness. Some results about “exponential bounds” are also well-known, which are often called
concentration inequalities.
Theorem 2.2.2 (Chernoff bound).
P(X − µ ≥ t) ≤ E(eλ(X−µ))
eλt∀λ > 0 ∀t ∈ R.
Proof. Clear from
I(X − µ ≥ t) = I(eλ(X−µ) ≥ eλt) ≤ eλ(X−µ)
eλt.
Example 2.2.3 (Gaussian tail bound). Let X ∼ N(µ, σ2) be Gaussian r.v.. Then by Chernoff ineq.,
P(X − µ ≥ t) ≤ e−λtEeλ(X−µ) = exp
(−λt+
σ2
2λ2
)for any t > 0, λ > 0.
Hence, we get
P(X − µ ≥ t) ≤ infλ>0
exp
(−λt+
σ2
2λ2
)= e−t
2/2σ2 ∀t > 0.
As shown, Gaussian random variable has a squared-exponential tail bound. In general, the collection
of such distribution is named as sub-Gaussian.
Definition 2.2.4. A r.v. X with mean EX = µ is called sub-Gaussian if ∃σ > 0 s.t.
E(eλ(X−µ)) ≤ eσ2λ2/2 ∀λ ∈ R,
23
Draft
Advanced Probability Theory J.P.Kim
Remark 2.2.5. Note that right hand side of definition 2.2.4 is an mgf of N(0, σ2). Thus sub-
Gaussianity means “smaller scale of mgf than that of Gaussian distribution,” i.e., “having tail which
decays faster than Gaussian scale.”
Remark 2.2.6. Obviously, if X is sub-Gaussian, we get
P(X − µ ≥ t) ≤ exp
(− t2
2σ2
)∀t ≥ 0.
Furthermore, if X is sub-Gaussian with parameter σ, so is −X, and hence
P(|X − µ| ≥ t) = P(X − µ ≥ t) + P((−X)− (−µ) ≥ t) ≤ 2 exp
(− t2
2σ2
).
Example 2.2.7. A r.v. ε is called Rademacher if
P(ε = 1) = P(ε = −1) =1
2.
In this case,
Eeλε =eλ + e−λ
2=∞∑k=0
λ2k
(2k)!≤∞∑k=0
λ2k
2kk!= exp
(λ2
2
),
and hence ε is sub-Gaussian with σ = 1.
Actually, this result is not so surprising, because distribution with bounded support has extremely
light tail, clearly lighter tail than that of Gaussian. We can easily formulate the conjecture as following:
Example 2.2.8. Let X be a r.v. with EX = µ and P(a ≤ X ≤ b) = 1. Then X is sub-Gaussian with
σ =b− a
2. To show this, define
ψ(λ) = logEeλX . (“cgf”)
Then ψ(0) = 0, ψ′(0) = µ and
ψ′′(λ) = Eλ(X2)− (Eλ(X))2
where
Eλf(X) :=E(f(X)eλX)
E(eλX).
Note that Eλ can be viewed as an expectation operator w.r.t weight proportional to eλX . Now note
that:
If a ≤ Y ≤ b a.s., then
var(Y ) = miny
E(Y − y)2 ≤ E(Y − b+ a
2
)2
≤(b− a
2
)2
holds.
24
Draft
Advanced Probability Theory J.P.Kim
Since ψ′′(λ) can be viewed as a variance, we get
ψ′′(λ) ≤(b− a
2
)2
∀λ ∈ R.
Thus we obtain
supλ∈R
ψ′′(λ) ≤(b− a
2
)2
,
and hence
ψ(λ) = ψ(0) + ψ′(0)λ+ψ′′(ξ)
2λ2
≤ ψ(0) + ψ′(0)λ+
(b− a
2
)2
2λ2
= λµ+λ2
2
(b− a
2
)2
,
which yields
E(eλ(X−µ)) = e−λµ+ψ(λ) ≤ exp
(λ2
2
(b− a
2
)2).
Our next result is that independent sum of sub-Gaussian random variables is also sub-Gaussian.
Theorem 2.2.9 (Hoeffding’s inequality). Let Xi be independent r.v.’s with EXi = µi, and each Xi is
sub-Gaussian with σ = σi. Then∑n
i=1Xi is also sub-Gaussian with parameter(∑n
i=1 σ2i
)1/2, i.e.,
P
(n∑i=1
(Xi − µi) ≥ t
)≤ exp
(− t2
2∑n
i=1 σ2i
)∀t ≥ 0.
Proof. It is sufficient to show that
X1 +X2 is sub-Gaussian with σ2 = σ21 + σ2
2.
It is clear from
E(eλ(X1+X2−(µ1+µ2))
)= E
(eλ(X1−µ1)
)E(eλ(X2−µ2)
)≤ exp
(σ2
1
2λ2
)exp
(σ2
2
2λ2
)= exp
(σ2
1 + σ22
2λ2
).
Following corollary is clear from Hoeffding’s inequality, but it is very useful result. It will also be
widely used in this course.
Corollary 2.2.10. If each Xi is bounded and independent, i.e., P(ai ≤ Xi ≤ bi) = 1, then
P
(n∑i=1
(Xi − µi) ≥ t
)≤ exp
(− 2t2∑n
i=1(bi − ai)2
).
25
Draft
Advanced Probability Theory J.P.Kim
Before we move the step, let’s check some equivalent conditions for sub-Gaussianity.
Theorem 2.2.11. For any X with EX = 0, TFAE.
(i) ∃σ > 0 s.t. EeλX ≤ exp(λ2
2 σ2)∀λ ∈ R (i.e., X is sub-Gaussian).
(ii) ∃c ≥ 1 and Gaussian r,v, Z ∼ N(0, τ2) s.t. P(|X‖ ≥ s) ≤ cP(|Z| ≥ s) ∀s ≥ 0.
(iii) ∃θ ≥ 0 s.t. EX2k ≤ (2k)!
2kk!θ2k ∀k = 1, 2, · · · .
(iv) ∃σ > 0 s.t. EeλX2/2σ2 ≤ 1√1− λ
∀λ ∈ [0, 1).
Now we see some other notion. The notion of sub-Gaussianity is fairly restrictive, so that it is
natural to consider various relaxations of it. The class called sub-Exponential r.v.’s are defined by a
slightly milder condition on the mgf and hence has a slower tail probability density rate.
Definition 2.2.12. A random variable X is called sub-exponential if ∃ν, b > 0 s.t.
Eeλ(X−µ) ≤ eν2λ2/2 ∀λ : |λ| ≤ 1
b.
Obviously, sub-Gaussianity implies sub-exponentiality. The converse is not true; sub-Gaussianity is
stronger condition.
Example 2.2.13. Let Z ∼ N(0, 1) amd X = Z2. Then
Eeλ(X−1) =e−λ√1− 2λ
for λ <1
2,
and it does not exists for λ > 1/2. With simple calculation, we can verify that
e−λ√1− 2λ
≤ e2λ2 ∀λ : |λ| < 1
4.
Therefore, X is sub-exponential, but not sub-Gaussian.
Theorem 2.2.14. For any X with EX = 0, TFAE.
(i) ∃ν, b > 0 s.t. EeλX ≤ eλ2ν2/2 ∀λ : |λ| ≤ 1/b (i.e., X is sub-exponential).
(ii) ∃c0 > 0 s.t. EeλX <∞ ∀λ : |λ| ≤ c0.
(iii) ∃c1, c2 > 0 s.t. P(|X| ≥ t) ≤ c1e−c2t ∀t > 0.
(iv) ∃σ,M > 0 s.t. |EX|k ≤ 1
2σ2k!Mk−2 ∀k = 2, 3, · · · (“Bernstein condition”)
The condition (iv) is called Berstein condition. It is known that:
26
Draft
Advanced Probability Theory J.P.Kim
Lemma 2.2.15. If EX = 0 and X satisfies Bernstein condition, then
P(|X| ≥ t) ≤ 2e− t2
2(σ2+Mt) ∀t > 0.
Proof. Note that
EeλX =∞∑k=0
λkEXk
k!
= 1 +∞∑k=2
λkEXk
k!
≤ 1 +∞∑k=2
|λ|k 12σ
2k!Mk−2
k!
= 1 +λ2
2σ2∞∑k=2
(|λ|M)k−2
holds, which implies
EeλX ≤ 1 +λ2
2σ2 1
1− |λ|M≤ e
λ2σ2
2(1−|λ|M)
provided that |λ| ≤ 1
M. It gives
P(X ≥ t) ≤ e−λt+λ2σ2
2(1−|λ|M)∀λ : |λ| ≤ 1
M,
and hence
P(X ≥ t) ≤ inf|λ|≤1/M
e−λt+ λ2σ2
2(1−|λ|M) = et2
2(σ2+Mt) .
Similar technique on −X gives the conclusion
P(|X| ≥ t) ≤ 2et2
2(σ2+Mt) .
We can easily extend the result to the independent sum of random variables.
Corollary 2.2.16 (Bernstein’s inequality). Let Xi be independent random variables satisfying Bern-
stein condition
EXi = 0 and E|Xi|k ≤σ2i
2k!Mk−2, k = 2, 3, · · · .
Then
P(|X1 + · · ·+Xn| ≥ t) ≤ 2e− 1
2t2∑n
i=1σ2i+Mt .
27
Draft
Advanced Probability Theory J.P.Kim
Proof. By Chernoff inequality, we get
P(|X1 + · · ·+Xn| ≥ t) ≤ 2e−λtEeλ(∑ni=1Xi) ≤ 2 exp
(−λt+
n∑i=1
λ2σ2i
2(1− |λ|M)
).
We get the conclusion by letting λ =t
Mt+∑n
i=1 σ2i
.
Example 2.2.17. Let Zki.i.d∼ N(0, 1). Then
P
(∣∣∣∣∣ 1nn∑k=1
Z2k − 1
∣∣∣∣∣ ≥ t)≤ e−λt exp
(λ
(1
n
n∑k=1
Z2k − 1
))
= 2e−λt
(e−λ/n√1− 2λ/n
)n≤ 2e−λte2n(λ/n)2
= 2e−λt+2λ2
n
for any λ s.t. |λ| < 1
4. Since
min|λ|<1/4
(2λ2
n− λt
)= −nt
2
8,
we get
P
(∣∣∣∣∣ 1nn∑k=1
Z2k − 1
∣∣∣∣∣ ≥ t)≤ 2e−
nt2
8 .
Example 2.2.18 (Johnson-Lindenstrauss embedding). Let ui ∈ Rd, i = 1, 2, · · · ,m be extremely
high-dimensional vectors (i.e., d is very large). We want to find a map F : Rd → Rn with n d and
(1− δ)‖ui − uj‖22 ≤ ‖F (ui)− F (uj)‖22 ≤ (1 + δ)‖ui − uj‖22
for some δ ∈ (0, 1) (“embedding to low-dimensional space preserving the distance approximately”).
Remark 2.2.19. Such embedding might be useful when using, for example, clustering algorithm.
There are various distance-based clustering methods such as K-means. If one handles extremely high-
dimensional data, then obtaining distances between all pairs of data might require heavy computation.
For this reason, one can first embed the data into low-dimensional subspace, with preserving distances,
and regard the data as low-dimensional.
(Example continued) Define F : Rd → Rn by
F (u) =Xu√n, where X = (xij)i,j ∈ Rn×d with xij
i.i.d∼ N(0, 1).
28
Draft
Advanced Probability Theory J.P.Kim
Then‖F (u)‖22‖u‖22
=‖Xu‖22n‖u‖22
=n∑i=1
〈Xi, u〉2
n‖u‖22=
1
n
n∑i=1
⟨Xi,
u
‖u‖2
⟩2
,
where Xi is the ith row vector of X. Note that for any fixed u,
n‖F (u)‖22‖u‖22
=n∑i=1
⟨Xi,
u
‖u‖2
⟩2
∼ χ2(n)
holds, and hence we get
P(‖F (u)‖22‖u‖22
/∈ [1− δ, 1 + δ]
)≤ 2e−nδ
2/8
for any u 6= 0 by previous example. Thus, using F (ui − uj) = F (ui)− F (uj), we get
P(‖F (ui)− F (uj)‖22‖ui − uj‖22
/∈ [1− δ, 1 + δ] for some i 6= j
)≤∑i 6=j
P(‖F (ui)− F (uj)‖22‖ui − uj‖22
/∈ [1− δ, 1 + δ]
)
≤ 2
(m
2
)e−nδ
2/8.
Finally, for any ε ∈ (0, 1) and m ≥ 2,
2
(m
2
)e−nδ
2/8 ≤ ε if n >16
δ2log
m
ε,
so for such n, we can find a map F . ♦
From now on, we focus on our origin interest. Whether a given class F is a Glivenko-Cantelli
(Donsker) class depends on the size of the class. A finite class of square integrable functions is always
Donsker (by theorem 1.4.8), while at the other extreme the class of all square integrable uniformly-
bounded functions is almost never Donsker. A relatively simple way to measure the size of a class is
to use entropy numbers, which is essentially the logarithm of the number of “ball” or “brackets” of
size ε needed to cover F . Let (F , ‖ · ‖) be a subset of a normed space of functions f : X → R.
Definition 2.2.20 (Covering number).
• The covering number N(ε,F , ‖ · ‖) is the minimum number of balls g : ‖g−f‖ < ε of radius
ε needed to cover F . The center of the balls f need not belongs to F .
• The entropy is the logarithm of the covering number N(ε,F , ‖ · ‖).
Definition 2.2.21 (Bracketing number).
• Given two functions l and u, the bracket [l, u] is the set of all functions with l ≤ f ≤ u.
• An ε-bracket is a bracket [l, u] with ‖u− l‖ < ε.
• The bracketing number N[ ](ε,F , ‖ · ‖) is the minimum number of ε-brackets needed to cover
F . Each u and l need not belong to F .
29
Draft
Advanced Probability Theory J.P.Kim
• The bracketing entropy (entropy with bracketing) is the logarithm of the bracketing number
N[ ](ε,F , ‖ · ‖).
We only consider norms ‖ · ‖ with property
|f | < |g| =⇒ ‖f‖ ≤ ‖g‖.
For example, Lr(Q) norm
‖f‖Q,r =
(∫|f |rdQ
)1/r
satisfies the property.
Remark 2.2.22. Note that
N(ε,F , ‖ · ‖) ≤ N[ ](2ε,F , ‖ · ‖)
is satisfied, because
f ∈ [l, u], ‖u− l‖ < 2ε =⇒ f ∈ Bε(u+ l
2
)holds, i.e., every 2ε-bracket is contained in some ε-ball.
Definition 2.2.23. An envelope function of F is any function F s.t.
|f(x)| ≤ F (x) ∀x ∈ X ∀f ∈ F .
2.3 Maximal Inequalities
In this section, we will obtain the bound of expectation of maximum, for example, maximum
variation of stochastic process within small time. For this we introduce the notion of Orlicz norm.
Definition 2.3.1. For ψ : [0,∞) → [0,∞), where ψ is strictly increasing and convex function with
ψ(0) = 0, and a random variable X, the Orlicz norm ‖X‖ψ is defined as
‖X‖ψ = inf
C > 0 : Eψ
(|X|C
)≤ 1
.
Of course, we wonder that Orlicz norm is actually a “norm.”
Proposition 2.3.2. ‖ · ‖ψ is a norm on the set of all random variables with ‖X‖ψ <∞, i.e.,
(i) ‖aX‖ψ = |a| · ‖X‖ψ ∀a ∈ R;
(ii) ‖X‖ψ = 0 ⇐⇒ X = 0 a.s.;
(iii) ‖X + Y ‖ψ ≤ ‖X‖ψ + ‖Y ‖ψ.
30
Draft
Advanced Probability Theory J.P.Kim
Proof. (i) Trivial.
(ii) ⇐ part is trivial. Assume that ‖X‖ψ = 0. It means that
Eψ(|X|C
)≤ 1 ∀C > 0. (∗)
Note that
ψ
(|X|C
)−−−→C0
ψ(∞) =∞ on (X 6= 0)
ψ
(|X|C
)−−−→C0
ψ(0) = 0 on (X = 0).
(ψ(∞) = ∞ because ψ is convex, strictly increasing function) If P(X 6= 0) > 0, then by monotone
convergence theorem,
limC0
Eψ(|X|C
)=∞,
which is contradictory to (∗).
(iii) It suffices to show that
Eψ(|X|C1
)∨ Eψ
(|Y |C2
)≤ 1 =⇒ Eψ
(|X + Y |C1 + C2
)≤ 1.
(∵ Let Eψ(|X|/C1) ≤ 1 and Eψ(|Y |/C2) ≤ 1. Then under our claim ‖X + Y ‖ψ ≤ C1 + C2 holds.
Taking infimum w.r.t C1 and C2 sequentially, we get the desired result.) It comes from:
ψ
(|X + Y |C1 + C2
)≤ ψ
(|X|+ |Y |C1 + C2
)(∵ ψ is strictly increasing)
= ψ
(C1
C1 + C2
|X|C1
+C2
C1 + C2
|Y |C2
)≤ C1
C1 + C2ψ
(|X|C1
)+
C2
C1 + C2ψ
(|Y |C2
)(∵ ψ is convex).
There are two oftenly-used Orlicz norms.
Example 2.3.3. Let ψ(x) = xp, p ≥ 1. Then trivially ψ satisfies conditions in definition 2.3.1 and
‖X‖ψ = inf
C > 0 : E
(|X|C
)p≤ 1
= inf C > 0 : E|X|p ≤ Cp
= (E|X|p)1/p =: ‖X‖p,
i.e., Orlicz norm w.r.t ψ(x) = xp is Lp-norm.
31
Draft
Advanced Probability Theory J.P.Kim
Example 2.3.4. Let ψp(x) := exp − 1, p ≥ 1. Then trivially ψp satisfies conditions in definition 2.3.1
and ψp(x) ≥ xp. Hence
‖X‖p ≤ ‖X‖ψp .
Remark 2.3.5. Note that, to ‖X‖p or ‖X‖ψp exist,
Eψ(|X|C
)<∞ or Eψp
(|X|C
)<∞
should be held for some C > 0 respectively. The former one requires polynomial order tail bound,
while the latter one requires exponential order (p = 1) or squared-exponential one (p = 2). In general,
following holds.
Proposition 2.3.6 (Tail bound). If ‖X‖ψ <∞, then
P(|X| > x) ≤ 1
ψ
(x
‖X‖ψ
) .Proof. Since ψ is continuous (from convexity),
Eψ(|X|‖X‖ψ
)= E lim
C‖X‖ψψ
(|X|C
)= lim
C‖X‖ψEψ(|X|C
)≤ 1 (2.2)
holds by MCT (Actually “=” holds). Now Markov inequality gives
P(|X| > x) = P(ψ
(|X|‖X‖ψ
)≥ ψ
(x
‖X‖ψ
))
≤Eψ(|X|‖X‖ψ
)ψ
(x
‖X‖ψ
) ≤ 1
ψ
(x
‖X‖ψ
) .
This proposition gives necessary condition for ‖X‖ψ < ∞. Then what is sufficient condition? In
other words, is there any condition for tail bound which implies ‖X‖ψ <∞?
Proposition 2.3.7. If P(|X| > x) ≤ Cxp+δ
for p ≥ 1, C, δ > 0, then ‖X‖p <∞.
Proof.
E|X|p =
∫ ∞0
P(|X|p > x)dx ≤ 1 +
∫ ∞1
C
x1+δ/pdx <∞.
Proposition 2.3.8. If P(|X| > x) ≤ Ke−Cxp for p ≥ 1 and C,K > 0, then ‖X‖ψp <∞.
32
Draft
Advanced Probability Theory J.P.Kim
Proof. Note that
E(eD|X|
p − 1)
= E∫ |X|p
0DeDsds
= E∫ ∞
0I(s < |X|p)DeDsds
=
∫ ∞0
P(s < |X|p)DeDsds
≤∫ ∞
0Ke−CsDeDsdx
= KD
∫ ∞0
e−(C−D)sds
≤ 1
holds for sufficiently small D > 0. It gives that
Eψp(|X|D−1/p
)≤ 1
for sufficiently small D > 0 (precisely, if D ≤ CK+1), i.e., ‖X‖ψp <∞ (precisely, ‖X‖ψp ≤
(K+1C
)1/p).
Remark 2.3.9. Proposition 2.3.7 gives that, if tail probability is bounded with p“+δ” order polyno-
mial, then p-norm ‖X‖p becomes finite; proposition 2.3.8 gives that if tail probability is bounded with
squared exponential (exponential, resp.), i.e., random variable has sub-Gaussian (sub-exponential,
resp.) distrbution, then ‖X‖ψ2 <∞ (‖X‖ψ1 <∞, resp.) is satisfied.
Our origin goal of this section is to obtain some bounds for maximum of random variables. Such
maximal inequalities can be found from the basic properties of Orlicz norm. Before starting, note
following naive bound
E(
max1≤i≤m
|Xi|)≤
m∑i=1
E|Xi| ≤ m · max1≤i≤m
E|Xi|,
or similarly,
∥∥∥∥ max1≤i≤m
|Xi|∥∥∥∥p
=
(E max
1≤i≤m|Xi|p
)1/p
≤
(m∑i=1
E|Xi|p)1/p
≤(m · max
1≤i≤mE|Xi|p
)1/p
= m1/p max1≤i≤m
‖Xi‖p.
Thus if random variable has smaller tail probability (Emax |Xi|p < ∞), then more tight bound for
maximum is obtained (m1/p). Following proposition gives generalized bound.
Theorem 2.3.10. Let ψ be convex, strictly increasing function with ψ(0) = 0. Further, assume that
ψ satisfies
lim supx,y→∞
ψ(x)ψ(y)
ψ(cxy)<∞ for some c > 0. (2.3)
33
Draft
Advanced Probability Theory J.P.Kim
Then for any random variables X1, X2, · · · , Xm,∥∥∥∥ max1≤i≤m
|Xi|∥∥∥∥ψ
≤ Kψ−1(m) max1≤i≤m
‖Xi‖ψ,
where K is a constant depending only on ψ.
Remark 2.3.11. Note that:
• “m−1/p” in the naive bound is corresponding to “ψ−1(m). If ψ increases fast, then ψ−1(m)
becomes smaller, which gives smaller bound.
• It holds for any random variables X1, · · · , Xm; it does not require additional assumption such
as independence.
Proof. Firstly, we assume that
ψ(x)ψ(y) ≤ ψ(cxy) ∀x, y ≥ 1
and ψ(1) ≤ 1
2. In this case,
ψ
(x
y
)≤ ψ(cx)
ψ(y)∀x ≥ y ≥ 1.
Thus, for y ≥ 1 and any C > 0,
max1≤i≤m
ψ
(|Xi|Cy
)≤ max
1≤i≤m
ψ
(c|Xi|C
)ψ(y)
I
(|Xi|Cy≥ 1
)+ ψ
(|Xi|Cy
)︸ ︷︷ ︸
≤ψ(1) on (|Xi|Cy
<1)
I
(|Xi|Cy
< 1
)
≤ max1≤i≤m
ψ
(c|Xi|C
)ψ(y)
+ ψ(1)
≤m∑i=1
ψ
(c|Xi|C
)ψ(y)
+1
2
holds. Taking expectation with C = c max1≤i≤m
‖Xi‖ψ and y = ψ−1(2m), we get
E[ψ
(max1≤i≤m |Xi|
Cy
)]= E
[max
1≤i≤mψ
(|Xi|C
)]
≤m∑i=1
Eψ
(|Xi|
max ‖Xi‖ψ
)ψ(y)
+1
2
≤m∑i=1
≤1 (2.2)︷ ︸︸ ︷Eψ(|Xi|‖Xi‖ψ
)2m
+1
2
34
Draft
Advanced Probability Theory J.P.Kim
≤ 1
2+
1
2= 1,
and therefore ∥∥∥∥ max1≤i≤m
|Xi|∥∥∥∥ψ
≤ Cy
= cψ−1(2m) max1≤i≤m
‖Xi‖ψ
≤ 2cψ−1(m) max1≤i≤m
‖Xi‖ψ
holds from ψ−1(2m) ≤ 2ψ−1(m), which comes from
m =ψ(0) + ψ(ψ−1(2m))
2≥ ψ
(0 + ψ−1(2m)
2
)
and increasingness of ψ−1.
Now we see general ψ. Define φ(x) = σψ(τx). If τ > 0 is large enough ∃K > 0 s.t.
∀x, y ≥ 0 φ(x)φ(y) = σ2ψ(τx)ψ(τy) ≤ Kσ2ψ(cτ2xy) = Kσφ(cτxy) (∵ (2.3)),
so if σ < 1 is small enough, we get φ(x)φ(y) ≤ φ(cτxy) and φ(1) = σψ(τ) ≤ 1
2. Also note that
‖X‖ψ = inf
C > 0 : Eψ
(|X|C
)≤ 1
= inf
C > 0 :
1
σEφ(|X|τC
)≤ 1
.
Putting C =‖X‖φστ
gives
1
σEφ(|X|τC
)=
1
σEφ(σ|X|‖X‖φ
)≤(∗)
Eφ(|X|‖X‖φ
)≤
(2.3)1,
while (∗) holds from φ(σx+ (1− σ) · 0) ≤ σφ(x) + (1− σ)φ(0). Hence we get
‖X‖ψ ≤‖X‖φστ
.
On the other hand,
‖X‖φ = inf
C > 0 : σEψ
(τ |X|C
)≤ 1
,
and putting C = τ‖X‖ψ we get
σEψ(τ |X|C
)= σEψ
(|X|‖X‖ψ
)≤
(2.3)1,
which implies
‖X‖φ ≤ τ‖X‖ψ.
35
Draft
Advanced Probability Theory J.P.Kim
Therefore we have ∥∥∥∥ max1≤i≤m
|Xi|∥∥∥∥ψ
≤ 1
στ
∥∥∥∥ max1≤i≤m
|Xi|∥∥∥∥φ
≤ K1
στφ−1(m) max
1≤i≤m‖Xi‖φ
≤(∗)
K1
(στ)2ψ−1(m) max
1≤i≤mτ‖Xi‖ψ
= K ′ψ−1(m) max1≤i≤m
‖Xi‖ψ.
In (∗), it was used that from φ−1(x) = τ−1ψ−1(σ−1x) and
ψ(σψ−1
(xσ
))≤ σψ
(ψ−1
(xσ
))= x,
we have
φ−1(x) =1
τψ−1
(xσ
)≤ 1
στψ−1(x).
Remark 2.3.12. Using previous theorem, we can obtain the bound of maximum of stochastic process.
A common technique to handle maximum term is to partition the underlying space into finite net and
control variation on the small ball, for example,
supt∈T|Xt| ≤ max
1≤i≤m|Xti |+ sup
d(t,ti)<δ|Xti −Xt|.
Partitioning the space into δ-balls is deeply related to the covering number; it will affect the bound.
As δ becomes small, variation on each δ-ball might be smaller, while controlling maximum of finite
net becomes challengeable.
Definition 2.3.13. Let (T, d) be an arbitrary semi-metric space. Then
• the covering number N(ε) is the minimum number of balls of radius ε needed to cover T ;
• a collection of points is ε-separated if the distance between each pair of points is strictly larger
than ε;
• the packing number D(ε) is the maximum number of ε-separated points in T .
We can naturally guess that the packing number D(ε) would have similar value with the covering
number N(ε).
Proposition 2.3.14. N(ε) ≤ D(ε) ≤ N(
1
2ε
).
36
Draft
Advanced Probability Theory J.P.Kim
Proof. First, for D = D(ε) ,let t1, t2, · · · , tD be maximal ε-separated points. Then since the set
t1, t2, · · · , tD is “maximal,” adding any other point in T makes the set not ε-separated. That is,
∀t ∈ T ∃ti s.t. d(t, ti) ≤ ε.
It means that
T ⊆D⋃j=1
Bε(tj),
i.e., N(ε) ≤ D(ε).
Next, let D = D(ε) and N = N(ε/2). Assume that D > N . Then ∃t1, t2, · · · , tD which are ε-
separated points, and ∃s1, s2, · · · , sN which balls centered with cover T , i.e., T ⊆⋃Nj=1Bε/2(sj). Then
because we assumed that D > N , there exist two points ti and ti′ those belong to the same ball
ti, ti′ ∈ Bε/2(sj). However it is contradictory to the assumption that ti, ti′ are ε-separated. Therefore
D ≤ N .
Now we are ready for our main result for maximal inequality.
Definition 2.3.15. A stochastic process (Xt)t∈T is separable if for any countable dense subset T0 ⊆ T
and δ > 0,
supd(s,t)<ssδs,t∈T
|Xs −Xt| = supd(s,t)<δs,t∈T0
|Xs −Xt| a.s..
Lemma 2.3.16. If 0 ≤ Xn X, then ‖Xn‖ψ ‖X‖ψ.
Proof. First, it is obvious that
0 ≤ X ≤ y =⇒ ‖X‖ψ ≤ ‖Y ‖ψ.
Now, for any C < ‖X‖ψ, by definition,
limn→∞
Eψ(Xn
C
)=
MCTEψ(X
C
)> 1.
It implies that ‖Xn‖ψ for large n, i.e.,
lim infn→∞
‖Xn‖ψ ≥ C.
Since C < ‖X‖ψ was arbitrary, we get
lim infn→∞
‖Xn‖ψ ≥ ‖X‖ψ.
37
Draft
Advanced Probability Theory J.P.Kim
Meanwhile, Xn ≤ X implies ‖Xn‖ψ ≤ ‖X‖ψ, which gives
limn→∞
‖Xn‖ψ = ‖X‖ψ.
Theorem 2.3.17 (Maximal Inequality). Let ψ be convex, strictly increasing function satisfying ψ(0) =
0 and (2.3). Also assume that stochastic process (Xt)t∈T is separable and satisfies
|Xs −Xt‖ψ ≤ C · d(s, t) ∀s, t ∈ T. (2.4)
Then for any η, δ > 0,∥∥∥∥∥ supd(s,t)≤δ
|Xs −Xt|
∥∥∥∥∥ψ
≤ K∫ η
0ψ−1(D(ε))dε+ δψ−1(D2(η))
holds, where K is a constant depending only on C and ψ.
Proof. Construct T0 ⊆ T1 ⊆ · · · ⊆ T recursively to satisfy that Tj is a maximal η · 2−j-separated set
containing Tj−1. Then by the definition of packing number,
card(Tj) ≤ D(η · 2−j).
Note that (by maximality) ∀tj+1 ∈ Tj+1 ∃tj ∈ Tj s.t. d(tj , tj+1) ≤ η · 2−j . Link every tj+1 ∈ Tj+1 to
a “unique” tj ∈ Tj s.t. d(tj , tj+1) < η · 2−j (make any mapping which satisfies d(tj , tj+1) < η · 2−j ;
“how” it can be possible is not our interest). Now call tk+1, tk, · · · , t0 to a “chain.” Note that⋃∞k=1 Tk
is countable and dense subset (by construction) in T . Since (Xt)t∈T is separable,
∥∥∥∥∥ supd(s,t)≤δ
|Xs −Xt|
∥∥∥∥∥ψ
=
∥∥∥∥∥∥∥ supd(s,t)≤δ
s,t∈∪∞k=1Tk
|Xs −Xt|
∥∥∥∥∥∥∥ψ
=MCT
limk→∞
∥∥∥∥∥∥∥ supd(s,t)≤δs,t∈Tk+1
|Xs −Xt|
∥∥∥∥∥∥∥ψ
.
Now let sk+1 − sk − · · · − s0 and tk+1 − tk − · · · − t0 be chains. Then
|Xsk+1−Xtk+1
| ≤ |(Xsk+1−Xs0)− (Xtk+1
−Xt0)|︸ ︷︷ ︸(∗∗)
+|Xs0 −Xt0 |
38
Draft
Advanced Probability Theory J.P.Kim
holds. Now we get
(∗∗) =
∣∣∣∣∣∣k∑j=0
(Xsj+1 −Xsj )− (Xtj+1 −Xtj )
∣∣∣∣∣∣ ≤ 2k∑j=0
max(u,v)∈Lj
|Xu −Xv|,
where Lj is the set of all links from Tj+1 to Tj . Then we get
card(Lj) ≤ D(η · 2−j−1),
and hence by theorem 2.3.10,∥∥∥∥∥ supsk+1,tk+1∈Tk+1
(∗∗)
∥∥∥∥∥ψ
≤ 2k∑j=0
∥∥∥∥ max(u,v)∈Lj
|Xu −Xv|∥∥∥∥ψ
≤ 2K ′k∑j=0
ψ−1(card(Lj))︸ ︷︷ ︸≤ψ−1(D(η·2−j−1))
max(u,v)∈Lj
‖Xu −Xv‖ψ︸ ︷︷ ︸≤C·d(u,v)≤Cη·2−j
≤ Kk∑j=0
ψ−1(D(η · 2−j−1)
)· η · 2−j−2 · 4
≤ 4K
∫ η/2
0ψ−1(D(ε))dε
≤ 4K
∫ η
0ψ−1(D(ε))dε.
Now, to control |Xs0 − Xt0 |, conversely for each pair of end points (s0, t0), choose “unique” pair
Figure 2.2:k∑j=0
ψ−1D(η · 2−j−1) · η · 2−j−2 ≤∫ η/2
0
ψ−1(D(ε))dε.
sk+1, tk+1 ∈ Tk+1 (which is different from those in previous paragraph; there is some abuse of notation).
Then
|Xs0 −Xt0 | ≤ (∗∗) + |Xsk+1−Xtk+1
|
39
Draft
Advanced Probability Theory J.P.Kim
again, and hence∥∥∥∥ max(s0,t0)∈T0
|Xs0 −Xt0 |∥∥∥∥ψ
≤∥∥∥∥ maxsk+1,tk+1∈Tk+1
(∗∗)∥∥∥∥ψ
+∥∥max |Xsk+1
−Xtk+1
∥∥ψ
≤ 4K
∫ η
0ψ−1D(ε)dε+
∥∥max |Xsk+1−Xtk+1
|∥∥ψ.
Note that the number of possible pairs of (s0, t0) (and consequently (sk+1, tk+1)) is at most card(T0)2 ≤
(D(η))2, and thus by theorem 2.3.10 again,
∥∥max |Xsk+1−Xtk+1
|∥∥ψ≤ K ′ψ−1(D2(η)) max ‖Xsk+1
−Xtk+1‖ψ.
Since ‖Xsk+1−Xtk+1
‖ψ ≤ C · d(sk+1, tk+1), we get∥∥∥∥∥∥∥ maxd(s,t)≤δs,t∈Tk+1
|Xs −Xt|
∥∥∥∥∥∥∥ψ
≤ 8K
∫ η
0ψ−1(D(ε))dε+K ′ψ(D2(η)) ·Cδ = 8K
∫ η
0ψ−1(D(ε))dε+Kδψ(D2(η)).
Remark 2.3.18. Why we decomposed |Xsk+1− Xtk+1
| as (∗∗) and |Xs0 − Xt0 |, and decomposed
|Xs0 − Xt0 | again? If we bound |Xsk+1− Xtk+1
| directly with similar argument, then we obtain the
bound with term ψ−1(D2(η · 2−j−1), which might not be so useful.
How such maximal inequality can be used? Following is one example which gives the bound for
sub-Gaussian stochastic process. Before we start, we should define sub-Gaussianity of stochastic
process.
Definition 2.3.19. A stochastic process (Xt)t∈T is sub-Gaussian with respect to semi-metric d if
P (|Xs −Xt| > x) ≤ 2 exp
(−1
2
x2
d2(s, t)
)∀x.
Example 2.3.20. Any zero-mean Gaussian process is sub-Gaussian with respect to L2-distance
d(s, t) = σ(Xs −Xt) =√
E(Xs −Xt)2.
Example 2.3.21. Let ε1, ε2, · · · , εn be Rademacher r.v. and
Xa =n∑i=1
aiεi, a ∈ Rn.
40
Draft
Advanced Probability Theory J.P.Kim
Then by Hoeffding inequalitym,
P
(∣∣∣∣∣n∑i=1
aiεi
∣∣∣∣∣ ≥ x)≤ 2 exp
(−1
2
x2
‖a‖2
).
It implies that (Xa)a∈Rn is sub-Gaussian stochastic process with respect to Euclidean distance d(a, b) =
‖a− b‖2.
To apply maximal inequality, we should verify the condition (2.4).
Proposition 2.3.22. For sub-Gaussian stochastic process (Xt)t∈T and ψ2(x) = ex2 − 1,
‖Xs −Xt‖ψ2 ≤√
6d(s, t).
Proof. It suffices to show that
Eψ2
(|Xs −Xt|√
6d(s, t)
)≤ 1.
It comes from
Eψ2
(|Xs −Xt|√
6d(s, t)
)= E
(exp
(|Xs −Xt|2
6d2(s, t)
)− 1
)=
∫ ∞0
P(
exp
(|Xs −Xt|2
6d2(s, t)
)− 1 > x
)dx
=
∫ ∞0
P(|Xs −Xt| >
√6d(s, t)
√log(1 + x)
)dx
≤∫ ∞
02 exp
(−1
2
6d2(s, t) log(1 + x)
d2(s, t)
)dx
=
∫ ∞0
2e−3 log(1+x)dx
=1
2.
Now we get the desired result.
Corollary 2.3.23. Let (Xt)t∈T be separable sub-Gaussian stochastic process. Then
E supd(s,t)≤δ
|Xs −Xt| .∫ δ
0
√logD(ε)dε ∀δ > 0.
Remark 2.3.24. From now on, A . B denotes that
A ≤ c ·B for some universal constant c > 0.
Proof. Apply theorem 2.3.17 with ψ = ψ2 and η = δ. Since the constant K in theorem 2.3.17 depended
41
Draft
Advanced Probability Theory J.P.Kim
only on ψ and C, which are all given in this example, K becomes universal. Therefore we get
E supd(s,t)≤δ
|Xs −Xt| =
∥∥∥∥∥ supd(s,t)≤δ
|Xs −Xt|
∥∥∥∥∥1
≤
∥∥∥∥∥ supd(s,t)≤δ
|Xs −Xt|
∥∥∥∥∥2
≤
∥∥∥∥∥ supd(s,t)≤δ
|Xs −Xt|
∥∥∥∥∥ψ2
.∫ δ
0ψ−1
2 (D(ε))dε+ δψ−12 (D2(δ))
.∫ δ
0ψ−1
2 (D(ε))dε+ δψ−12 (D(δ))
(∵ ψ−12 (x) =
√log(1 + x) and hence we get
ψ−12 (x2) ≤
√2ψ−1
2 (x) for x ≥ 0)
.∫ δ
0ψ−1
2 (D(ε))dε
(∵ ψ−12 is increasing, while D is decreasing, and hence
δψ−12 (D(δ)) ≤
∫ δ
0ψ−1
2 (D(ε))dε)
=
∫ δ
0
√log(1 +D(ε))dε
.∫ δ
0
√logD(ε)dε
(∵ log(1 + x) ≤ 2 log x for sufficiently large x)
Remark 2.3.25. Note that logD(ε) is an “entropy.” Thus, whether the value of bound (integral) be
finite or not depends on “how fast the entropy grows” as δ goes to 0.
2.4 Symmetrization
In empirical process, our final goal is to obtain Glivenko-Cantelli and Donsker’s theorem. They can
be obtained from measuring the space F via covering number or bracketing numbers. The former one
requires symmetrization technique, while the other one requires Bernstein inequality as follows.
Lemma 2.4.1. Let X1, · · · , Xm be arbitrary r.v.’s with
P(|Xi| > x) ≤ 2e−12
x2
b+ax (x > 0)
42
Draft
Advanced Probability Theory J.P.Kim
for a, b > 0. Then ∥∥∥∥ max1≤i≤m
Xi
∥∥∥∥ψ1
. a log(1 +m) +√b log(1 +m).
Remark 2.4.2. The bound can be also represented as aψ−11 (m) +
√bψ−1
2 (m).
Proof. First note that
‖X‖ψp ≤ ‖X‖ψq (log 2)1q− 1p
holds for 1 ≤ p ≤ q (∵ φ defined as
ψp
(x(log 2)
1p
)= φ
(ψq
(x(log 2)
1q
)),
i.e., φ = ψp ψq−1
for ψp(x) = 2xp − 1 is concave function with φ(1) = 1, and hence by Jensen,
1 = φ(1) ≥ φ(Eψq
((log 2)
1q (log 2)
− 1q|X|‖X‖ψq
))≥ Eφ
(ψq
((log 2)
1q (log 2)
− 1q|X|‖X‖ψq
))= Eψp
((log 2)
1p− 1q|X|‖X‖ψq
),
which gives (log 2)1q− 1p ‖X‖ψq ≥ ‖X‖ψp). Now
P(|Xi| > x) ≤ 2e−12
x2
b+ax ≤
2e−x2
4b
(0 ≤ x ≤ b
a
)2e−
x4a
(x > b
a
)holds. Now recall that
P(|X| > x) ≤ Ke−Cxp , p ≥ 1 =⇒ ‖X‖ψp ≤(K + 1
C
)1/p
(proposition 2.3.8).
Thus for
|Xi| = |Xi|I(|Xi| ≤
b
a
)︸ ︷︷ ︸
(∗)
+ |Xi|I(|Xi| >
b
a
)︸ ︷︷ ︸
(∗∗)
,
we get
P((∗) > x) ≤ 2e−x2
4b and hence ‖(∗)‖ψ2 .√b
P((∗∗) > x) ≤ 2e−x4a and hence ‖(∗∗)‖ψ1 . a.
Therefore we have∥∥∥∥ max1≤i≤m
Xi
∥∥∥∥ψ1
≤∥∥∥∥ max
1≤i≤m(∗)∥∥∥∥ψ1
+
∥∥∥∥ max1≤i≤m
(∗∗)∥∥∥∥ψ1
43
Draft
Advanced Probability Theory J.P.Kim
.
∥∥∥∥ max1≤i≤m
(∗)∥∥∥∥ψ2
+
∥∥∥∥ max1≤i≤m
(∗∗)∥∥∥∥ψ1
. ψ−12 (m) max
1≤i≤m‖(∗)‖ψ2 + ψ−1
1 (m) max1≤i≤m
‖(∗∗)‖ψ1
. ψ−12 (m)
√b+ ψ−1
1 (m)a.
Now we see very useful technique, which is called symmetrization. Recall that in empirical process
we consider following setting:
X1, X2, · · · , Xni.i.d∼ P
Pnf =1
n
n∑i=1
f(Xi)
Gnf =1√n
n∑i=1
(f(Xi)− Pf).
Symmetrization technique is formulated based on the fact that, for Rademacher random variables
ε1, · · · , εn, f 7→ (Pn − P)f would have similar behavior with
f 7→ P0nf :=
1
n
n∑i=1
εif(Xi).
Theorem 2.4.3 (Symmetrization). Let φ be a convex non-decreasing function and F be a class of
measurable functions. Then
E∗φ (‖Pn − P‖F ) ≤ E∗φ(2‖P0
n‖F).
Proof. We prove only under the measurability condition. Recall that under measurability, we can use
Fubini theorem. Let Y1, Y2, · · · , Yn be independent copies of X1, X2, · · · , Xn. Then
‖Pn − P‖F = supf∈F
1
n
∣∣∣∣∣n∑i=1
(f(Xi)− Ef(Xi))
∣∣∣∣∣= sup
f∈F
1
n
∣∣∣∣∣n∑i=1
(f(Xi)− EY f(Yi))
∣∣∣∣∣≤ EY sup
f∈F
1
n
∣∣∣∣∣n∑i=1
(f(Xi)− f(Yi))
∣∣∣∣∣ ,and hence by non-decreasingness of φ,
Eφ(‖Pn − P‖F ) ≤ EXφ
(EY sup
f∈F
∣∣∣∣∣ 1nn∑i=1
(f(Xi)− f(Yi))
∣∣∣∣∣)
≤ EXEY φ
(supf∈F
∣∣∣∣∣ 1nn∑i=1
(f(Xi)− f(Yi))
∣∣∣∣∣)
(Jensen)
44
Draft
Advanced Probability Theory J.P.Kim
holds. Now note that,
f(Xi)− f(Yi)d≡ f(Yi)− f(Xi)
by symmetricity, and hence
f(Xi)− f(Yi)d≡ ei(f(Yi)− f(Xi))
for any ei ∈ −1, 1 (“symmetrization!”). Consequently, we have
supf∈F
∣∣∣∣∣ 1nn∑i=1
(f(Xi)− f(Yi))
∣∣∣∣∣ d≡ supf∈F
∣∣∣∣∣ 1nn∑i=1
ei(f(Xi)− f(Yi))
∣∣∣∣∣for any (e1, · · · , en) ∈ −1, 1n. Therefore we get
Eφ(‖Pn − P‖F ) ≤ EεEX,Y |εφ
(supf∈F
∣∣∣∣∣ 1nn∑i=1
εi(f(Xi)− f(Yi))
∣∣∣∣∣)
≤ EεEX,Y φ
(1
2
supf∈F
2
n
∣∣∣∣∣n∑i=1
εif(Xi)
∣∣∣∣∣+ supf∈F
2
n
∣∣∣∣∣n∑i=1
εif(Yi)
∣∣∣∣∣)
≤ 1
2Eε
EX,Y φ
(supf∈F
2
n
∣∣∣∣∣n∑i=1
εif(Xi)
∣∣∣∣∣)
+ EX,Y φ
(supf∈F
2
n
∣∣∣∣∣n∑i=1
εif(Yi)
∣∣∣∣∣)
=1
2Eε2EXφ
(supf∈F
2
n
∣∣∣∣∣n∑i=1
εif(Xi)
∣∣∣∣∣)
= Eφ
(supf∈F
2
n
∣∣∣∣∣n∑i=1
εif(Xi)
∣∣∣∣∣)
= Eφ(2‖P0n‖F ).
Example 2.4.4. Consider φ(x) = xm, m ≥ 1. Then
E∗‖Pn − P‖mF ≤ 2mE∗‖P0n‖mF
by symmetrization. If ‖P0n‖F is measurable, then
E∗‖P0n‖F = E‖P0
n‖F = EXEε|X supf∈F
∣∣∣∣∣ 1nn∑i=1
εif(Xi)
∣∣∣∣∣holds. The term
Eε|X supf∈F
∣∣∣∣∣ 1nn∑i=1
εif(Xi)
∣∣∣∣∣can be viewed as a supremum of stochastic process n−1
∑aiεi for constants ai’s, and hence its bound
can be obtained via, for instance, Hoeffding inequality. Note that such argument requires measura-
bility! Thus considering the class of functions which makes the target process measurable is a natural
45
Draft
Advanced Probability Theory J.P.Kim
procedure.
Definition 2.4.5. A class F of measurable functions f : X → R on (X ,A,P) is called “P-measurable
class” if
(X1, · · · , Xn) 7→
∥∥∥∥∥n∑i=1
eif(Xi)
∥∥∥∥∥F
is measurable on completion of (X n,An,Pn) for every n and (e1, · · · , en) ∈ −1, 1n.
46
Draft
Chapter 3
Applications for Empirical Process
3.1 Glivenko-Cantelli Theorems
Now we are ready for our first goal in empirical process; a uniform LLN. First we use bracketing
argument; it does not require measurability.
Theorem 3.1.1 (Bracketing Glivenko-Cantelli). If N[ ](ε,F ,L1(P)) <∞ ∀ε > 0, then F is Glivenko-
Cantelli, i.e.,
‖Pn − P‖FP∗−−−→
n→∞0.
Proof. First note that ε-bracket w.r.t L1(P) norm is [l, u] with l ≤ f ≤ u and
‖u− l‖ =
∫|u− l|dP = P|u− l| < ε.
For given ε > 0, choose finitely many ε-brackets [li, ui], 1 ≤ i ≤ N covering F . For each f ∈ F , ∃i s.t.
(Pn − P)f = Pnf − Pf ≤ Pnui − Pf = (Pn − P)ui + P(ui − f) < (Pn − P)ui + ε.
If f is fixed, then i is also fixed, and hence by SLLN, (Pn − P)ui → 0 almost surely. Since i is finitely
many, we have
max1≤i≤N
(Pn − P)ui → 0 almost surely,
and therefore,
supf∈F
(Pn − P)f < max1≤i≤N
(Pn − P)ui︸ ︷︷ ︸−−−→n→∞
0
+ε.
Similarly we get
inff∈F
(Pn − P)f > −ε+ min1≤i≤N
(Pn − P)li︸ ︷︷ ︸−−−→n→∞
0
,
47
Draft
Advanced Probability Theory J.P.Kim
and combining both we obtain
lim supn→∞
‖(Pn − P)f‖∗F ≤ ε almost surely.
Since ε > 0 was arbitrary, we get
lim supn→∞
‖Pn − P‖∗F = 0 a.s.,
or
‖Pn − P‖FP∗−a.s.−−−−→n→∞
0.
Example 3.1.2. Let P be a probability measure on R and
F = 1(−∞,c] : c ∈ R.
Then for given ε > 0, let
−∞ = t0 < t1 < · · · < tm =∞ with P(ti, ti+1) < ε ∀i.
Then
[1(−∞,ti], 1(−∞,ti+1)
are ε-brackets covering F , and hence we get “Glivenko-Cantelli theorem,”
supt|Fn(t)− F (t)| −−−→
n→∞0 almost surely.
Next argument for other type of Glivenko-Cantelli theorem uses symmetrization technique. As
mentioned in example 2.4.4, we need measurability condition in here.
Theorem 3.1.3 (Covering Glivenko-Cantelli). Let F be P-measurable and F be an envelope of F with
P∗F <∞. Furthermore assume that
logN(ε,FM ,L1(Pn)) = o∗P(n) ∀M, ε > 0,
where
FM = f1F≤M : f ∈ F.
Then
E∗‖Pn − P‖F = o(1) (i.e., it implies ‖Pn − P‖FP∗−−−→
n→∞0).
48
Draft
Advanced Probability Theory J.P.Kim
Proof. Denote ‖g(f)‖F = supf∈F |g(f)|. Then by symmetrization,
E∗‖Pn − P‖F ≤ 2EXEε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥F
(measurability!)
= 2EXEε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)I(F (Xi) ≤M) +1
n
n∑i=1
εif(Xi)I(F (Xi) > M)
∥∥∥∥∥F
≤ 2EXEε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥FM
+ 2EXEε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)I(F (Xi) > M)
∥∥∥∥∥F︸ ︷︷ ︸
(∗)
holds. Note that
(∗) = supf∈F
∣∣∣∣∣ 1nn∑i=1
εif(Xi)I(F (Xi) > M)
∣∣∣∣∣≤ 1
n
n∑i=1
|f(Xi)|I(F (Xi) > M)
≤ 1
n
n∑i=1
F (Xi)I(F (Xi) > M)
and hence we get
E∗‖Pn − P‖F ≤ 2EXEε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥FM
+ 2E∗XF (Xi)I(F (Xi) > M)︸ ︷︷ ︸=2P∗F I(F>M)−−−−→
M→∞0 (∵P∗F<∞)
.
Now, for given X1, · · · , Xn and ε > 0, let G be an ε-covering of FM s.t. card(G) = N(ε,FM ,L1(Pn)).
Note that
∀f ∈ FM ∃g ∈ G s.t. Pn|g − f | < ε,
and hence ∣∣∣∣∣ 1nn∑i=1
εif(Xi)
∣∣∣∣∣ ≤∣∣∣∣∣ 1n
n∑i=1
εig(Xi)
∣∣∣∣∣+
∣∣∣∣∣ 1nn∑i=1
εi(g(Xi)− f(Xi))
∣∣∣∣∣︸ ︷︷ ︸≤ 1n
∑i |g(Xi)−f(Xi)|=Pn|g−f |<ε
.
It gives
Eε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥FM
≤ Eε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥G
+ ε
= Eε maxf∈G
∣∣∣∣∣ 1nn∑i=1
εif(Xi)
∣∣∣∣∣︸ ︷︷ ︸=‖·‖1|X≤‖·‖ψ1|X.‖·‖ψ2|X
+ε
49
Draft
Advanced Probability Theory J.P.Kim
.
∥∥∥∥∥maxf∈G
∣∣∣∣∣ 1nn∑i=1
εif(Xi)
∣∣∣∣∣∥∥∥∥∥ψ2|X
+ ε
.√
1 + log |G|maxf∈G
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥ψ2|X
+ ε
.√
logN(ε,FM ,L1(Pn)) maxf∈G
1
n
(n∑i=1
f(Xi)2
)1/2
+ ε
=√
logN(ε,FM ,L1(Pn)) maxf∈G
1√n
(1
n
n∑i=1
f(Xi)2
)1/2
︸ ︷︷ ︸=(Pnf2)1/2
+ε
≤√
logN(ε,FM ,L1(Pn))M√n
+ ε
= o∗P(1) + ε
by the assumption logN(ε,FM ,L1(Pn)) = o∗P(n). In . part, following argument is used: For constants
ai’s, we get ∥∥∥∥∥ 1
n
n∑i=1
aiεi
∥∥∥∥∥ψ2
= (log 2)−1/2 1
n
(n∑i=1
a2i
)1/2
from
E exp
( 1
Cn
n∑i=1
aiεi
)2 =
n∏i=1
E exp
(a2i ε
2i
C2
)= exp
(1
n2C2
n∑i=1
a2i
)
(∵ ε2i = 1) and in consequence
Eψ2
(1
C· 1
n
n∑i=1
aiεi
)≤ 1 ⇐⇒ exp
(1
n2C2
n∑i=1
a2i
)≤ 2 ⇐⇒ C ≥ 1
n
(1
log 2
n∑i=1
a2i
)1/2
.
(Or we can use some general arguments using Hoeffding inequality; see following remark.) Since ε > 0
was arbitrary, we get
Eε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥FM
= o∗P(1).
Note that
Eε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥FM
≤M ;
and therefore by BCT, we get
EXEε
∥∥∥∥∥ 1
n
n∑i=1
εif(Xi)
∥∥∥∥∥FM
= o(1) as n→∞.
50
Draft
Advanced Probability Theory J.P.Kim
Remark 3.1.4. In . part, we used the argument only can be applied on Rademacher ε’s. However,
we can also find more general argument using Hoeffding inequality. Note that since each aiεi’s are
sub-Gaussian, by Hoeffding’s inequality, we can find K and C s.t.
P
(∣∣∣∣∣ 1nn∑i=1
aiεi
∣∣∣∣∣ > x
)≤ Ke−Cx2 .
Now proposition 2.3.8 gives ∥∥∥∥∥ 1
n
n∑i=1
aiεi
∥∥∥∥∥ψ2
≤(K + 1
C
)1/2
,
where K = 2 and C−1 = 2∑a2i precisely.
Remark 3.1.5. To make red-colored “≤” part in the proof of previous theorem rigorous, one should
construct G to satisfy |f | ≤M for f ∈ G. It can be assumed without loss of generality; if not, one can
truncate the function as (f ∧M)∨ (−M) so that truncated one also covers FM and satisfies |f | ≤M .
Just one have to check that it is still ε-covering of FM ; let f ∈ FM and g ∈ G s.t. Pn|g− f | < ε. Then
for g = (g ∧M) ∨ (−M),
Pn|g − f | =1
n
∑i:−M≤g(Xi)≤M
|g(Xi)− f(Xi)|+∑
i: g(Xi)>M
(M − f(Xi)) +∑
i:−M>g(Xi)
(f(Xi) +M)
≤ 1
n
∑i:−M≤g(Xi)≤M
|g(Xi)− f(Xi)|+∑
i: g(Xi)>M
(g(Xi)− f(Xi)) +∑
i:−M>g(Xi)
(f(Xi)− g(Xi))
=
1
n
n∑i=1
|g(Xi)− f(Xi)| = Pn|g − f | < ε
holds.
3.2 Donsker Theorems
In here we consider two versions of Donsker’s theorem. From now on, ‖ · ‖Q,2 denotes
‖f‖Q,2 =
(∫f2dQ
)1/2
for a probability measure Q.
Theorem 3.2.1 (Covering Donsker). Let Fδ := f − g : f, g ∈ F , ‖f − g‖P,2 < δ be P-measurable
for any δ ∈ (0,∞] and F be an envelope of F with P∗F 2 <∞. If∫ ∞0
supQ
√logN(ε‖F‖Q,2,F ,L2(Q))dε <∞, (3.1)
when the supremum is taken over all finitely discrete probability measures, then F is P-Donsker.
51
Draft
Advanced Probability Theory J.P.Kim
Proof. It suffices to prove that Gn is asymptotically tight, where Gn = Gnf : f ∈ F is regarded as
a stochastic process (with index set F). By theorem 1.4.8 (note that each Gnf converges weakly by
classical CLT, which implies asymptotic tightness of each marginal) it’s enough to show that:
(i) F is totally bounded in L2(P) norm;
(ii) Gn is asymptotically uniformly L2(P)-equicontinuous in probability.
For this, we need following lemma:
Lemma 3.2.2. Let an : [0, 1]→ [0,∞) be a sequence of non-decreasing functions. Then
limδ→0
lim supn→∞
an(δ) = 0 ⇐⇒ an(δn) = 0(1) ∀δn 0.
Proof of lemma. =⇒ ) Let δn be nonincreasing sequence convergin to 0. ∀ε > 0 ∃δ0 > 0 s.t.
lim supn→∞
an(δ0) <ε
2
and hence ∃N s.t. n ≥ N ⇒ an(δ0) < ε. Since an is nondecreasing, ∃N ′ s.t. n ≥ N ′ ⇒ δn < δ0. Thus
n ≥ N ∨N ′ =⇒ δn < δ0 =⇒ an(δn) ≤ an(δ0) < ε.
⇐= ) It’s sufficient to show that:
∃δn 0 s t. lim supn→∞
an(δn) = limδ→0
lim supn→∞
an(δ).
Let C = limδ→0 lim supn→∞ an(δ). Then for any δ > 0, we get
lim supn→∞
an(δ) ≥ C,
because an decreases as δ 0. It gives that for any δ > 0 and for any ε > 0,
an(δ) > C − ε i.o..
Thus, for every fixed m,
an
(1
m
)> C − 1
mi.o.,
i.e., ∃N1, N2, N3, · · · s.t.
aN1(1) > C − 1
aN2
(1
2
)> C − 1
2, N2 > N1
52
Draft
Advanced Probability Theory J.P.Kim
aN3
(1
3
)> C − 1
3, N3 > N2
and so on. Take δn as
1, 1, · · · , 1︸ ︷︷ ︸N1
,1
2,1
2, · · · , 1
2︸ ︷︷ ︸N2−N1
,1
3,1
3, · · · , 1
3︸ ︷︷ ︸N3−N2
, · · · .
Then by definition,
aNk(δNk) > C − 1
k
holds, which gives that
lim supn→∞
an(δn) ≥ C.
However, since an(δn) ≤ an(δ) for any fixed δ > 0 and large n enough, we have
lim supn→∞
an(δn) ≤ lim supn→∞
an(δ) ∀δ > 0,
which gives
lim supn→∞
an(δn) ≤ C.
Therefore we get
lim supn→∞
an(δn) = C = limδ0
lim supn→∞
an(δn).
(Lemma)
Now we show (ii) first. (ii) is equivalent to
∀x, η > 0 ∃δ > 0 s.t. lim supn→∞
P∗(
sup‖f−g‖P,2<δ
|Gnf −Gng| > x
)< η.
Note that by definition
sup‖f−g‖P,2<δ
|Gnf −Gng| = ‖Gn‖Fδ ;
thus (ii) is again equivalent to
∀x, η > 0 ∃δ > 0 s.t. lim supn→∞
P∗ (‖Gn‖Fδ > x) < η.
Note that ‖Gn‖δ decreases as δ 0, which makes P∗ (‖Gn‖Fδ > x) also non-decreasing of δ. Thus it
is equivalent to
limδ→0
lim supn→∞
P∗ (‖Gn‖Fδ > x) < η ∀x > 0,
which is also same as
limn→∞
P∗(‖Gn‖Fδn > x
)< η ∀x > 0 ∀δn 0 (3.2)
53
Draft
Advanced Probability Theory J.P.Kim
by the lemma. Now we will show (3.2) instead of (ii). For given x > 0 and δn 0,
P∗(‖Gn‖Fδn > x) ≤ 1
xE∗‖Gn‖Fδn
≤ 2
xE
∥∥∥∥∥ 1√n
n∑i=1
εif(Xi)
∥∥∥∥∥Fδn
(symmetrization)
holds. Note that E∗ becomes E in blue-colored part from the measurability of Fδn . Now, note that
Pε|X
(1√n
n∑i=1
εi(f(Xi)− g(Xi)) < x
)≤ 2 exp
(−1
2
x2
‖f − g‖2n
),
where ‖f‖n =
√√√√ 1
n
n∑i=1
f(Xi)2 by Hoeffding’s inequality (cf. example 2.3.21), which implies that the
stochastic process f 7→ 1√n
n∑i=1
εif(Xi) is sub-Gaussian w.r.t. ‖ · ‖n. Then by maximal inequality
(corollary 2.3.23)
Eε|X
∥∥∥∥∥ 1√n
n∑i=1
εif(Xi)
∥∥∥∥∥Fδn
≤ Eε|X supf∈Fδn‖f−g‖n<δ
∣∣∣∣∣ 1√n
n∑i=1
εi(f(Xi)− g(Xi))
∣∣∣∣∣+ Eε|X
∣∣∣∣∣ 1√n
n∑i=1
εig(Xi)
∣∣∣∣∣.∫ δ
0
√logD(ε,Fδn , ‖ · ‖n)dε+ Eε|X
∣∣∣∣∣ 1√n
n∑i=1
εig(Xi)
∣∣∣∣∣holds for any δ > 0 and g ∈ Fδn . Using 0 ∈ Fδn and letting δ very big (MCT), we can obtain
Eε|X
∥∥∥∥∥ 1√n
n∑i=1
εif(Xi)
∥∥∥∥∥Fδn
.∫ ∞
0
√logD(ε,Fδn , ‖ · ‖n)dε.
Now using D(ε) ≤ N(ε/2), we can obtain that
Eε|X
∥∥∥∥∥ 1√n
n∑i=1
εif(Xi)
∥∥∥∥∥Fδn
.∫ ∞
0
√logN(ε,Fδn , ‖ · ‖n)dε
=
∫ θn
0
√logN(ε,Fδn , ‖ · ‖n)dε (θn = sup
f∈Fδn‖f‖n)
(∵ N(ε,Fδn , ‖ · ‖n) = 1 for large ε)
≤∫ θn/‖F‖n
0
√logN(ε‖F‖n,F∞, ‖ · ‖n)dε · ‖F‖n (∵ Fδn ⊆ F∞)
≤∫ θn/‖F‖n
0supQ
√logN(ε‖F‖Q,2,F∞,L2(Q))dε · ‖F‖n
.∫ θn/‖F‖n
0supQ
√logN
( ε2‖F‖Q,2,F ,L2(Q)
)dε · ‖F‖n
(∵ ‖f‖Q,2 < ε, ‖g‖Q,2 < ε⇒ ‖f − g‖Q,2 < 2ε,
54
Draft
Advanced Probability Theory J.P.Kim
which implies N(2ε,F∞,L2(Q)) ≤ N2(ε,F ,L2(Q)))
=
∫ θn/2‖F‖n
0supQ
√logN(ε‖F‖Q,2,F ,L2(Q))dε · 2‖F‖n
.∫ θn/‖F‖n
0supQ
√logN(ε‖F‖Q,2,F ,L2(Q))dε · ‖F‖n.
Note that ‖F‖n =
√√√√ 1
n
n∑i=1
F (Xi)2 converges to a positive constant by SLLN and the assumption,
E∗X‖F‖2n = P∗F 2 <∞. Hence we get:
EX∫ θn/‖F‖n
0supQ
√logN(ε‖F‖Q,2,F ,L2(Q))dε · ‖F‖n
= EX∫ ∞
0supQ
√logN(ε‖F‖Q,2,F ,L2(Q))‖F‖nI
(θn‖F‖n
> ε
)dε
=
∫ ∞0
supQ
√logN(ε‖F‖Q,2,F ,L2(Q))E‖F‖nI
(θn‖F‖n
> ε
)dε.
By uniform entropy condition and DCT, the last term converges to 0 as n→∞ if
E‖F‖nI
(θn‖F‖n
> ε
)−−−→n→∞
0 ∀ε > 0.
If θn/‖F‖n converges to 0 in probability, then Cauchy-Schwarz gives
E‖F‖nI(θn > ε‖F‖n) ≤(E∗‖F‖2n
)︸ ︷︷ ︸<∞
(P∗(θn > ε‖F‖n))︸ ︷︷ ︸−−−→n→∞
0
−−−→n→∞
0,
which gives the desired result. Thus our claim is that θn/‖F‖nP∗−−−→
n→∞0. However note that ‖F‖n
converges to a positive constant; therefore our final claim is:
Claim.) θn = oP∗(1).
By definition,
θ2n = sup
f∈Fδn‖f‖2n = sup
f∈FδnPnf2 ≤ sup
f∈Fδn(Pn − P)f2 + sup
f∈FδnPf2 ≤ sup
f∈F∞(Pn − P)f2 + sup
f∈FδnPf2
and
supf∈Fδn
Pf2 ≤ δ2n −−−→n→∞
0 (∵ def of Fδ)
hold. Furthermore, since 4F 2 is an integrable envelope of G∞ = f2 : f ∈ F∞, we get for f, g ∈ F∞
Pn|f2 − g2| = Pn|f − g| · |f + g| ≤|f |≤2F
Pn|f − g| · 4F ≤ ‖f − g‖n · ‖4F‖n (∵ Cauchy-Schwarz)
55
Draft
Advanced Probability Theory J.P.Kim
and hence
N(ε‖2F‖2n,G∞,L1(Pn)) ≤ N(ε‖F‖n,F∞, ‖ · ‖n) ≤ supQN(ε‖F‖Q,2,F∞, ‖ · ‖Q,2).
(∵ ‖f − g‖n ≤ ε‖F‖n ⇒ Pn|f2 − g2| ≤ ‖f − g‖n · ‖4F‖n ≤ ε‖F‖n‖4F‖n = ε‖2F‖2n) Hence
N(ε‖2F‖2n,G∞,L1(Pn)) is bounded by a fixed number depending only on ε, i.e.,
N(ε‖2F‖2n,G∞,L1(Pn)) = OP∗(1) ∀ε > 0.
It implies that
logN(ε,G∞,L1(Pn)) = oP∗(n) ∀ε > 0,
(cf. see following remark) which implies that G∞ is Glivenko-Cantelli (thm 3.1.3), i.e.,
supf∈G∞
(Pn − P)f = supf∈F∞
(Pn − P)f2 P∗−−−→n→∞
0.
(Claim)
Remark 3.2.3. Assume that N(ε‖2F‖2n,G∞,L1(Pn)) = OP∗(1) for any ε > 0. For each ω, ∃M > 0
and ∃N s.t.
n > N =⇒ ‖2F‖2n(ω) ≤M,
and hence
N(εM,G∞,L1(Pn)) ≤ N(ε‖2F‖2n,G∞,L1(Pn))
for such M and n. logN(ε‖2F‖2n,G∞,L1(Pn)) = oP∗(n) ∀ε > 0 implies that logN(ε,G∞,L1(Pn)) =
oP∗(n) ∀ε > 0.
Proof (Cont’d). Now we show (i). Since G∞ is Glivenko-Cantelli, there exists a finitely discrete
measure Pn with
‖(Pn − P)f2‖F∞ −−−→n→∞0.
Meanwhile, by the uniform entropy condition, we get∫ ∞0
√logN(ε‖F‖Pn,2,F ,L2(Pn))dε =
1
‖F‖Pn,2
∫ ∞0
√logN(ε,F ,L2(Pn))dε <∞,
i.e.,
N(ε,F ,L2(Pn)) <∞ ∀ε > 0.
56
Draft
Advanced Probability Theory J.P.Kim
For f, g ∈ F∞, Pn(f − g)2 < ε2 implies
P(f − g)2 = (P− Pn)(f − g)2 + Pn(f − g)2 ≤ (P− Pn)(2f2 + 2g2)︸ ︷︷ ︸≤4‖(Pn−P)f2‖F∞
+Pn(f − g)2 ≤ ε2 + ε2 = 2ε2
for large n enough so that ‖(Pn − P)f2‖F∞ ≤ ε2/4. It implies that
‖f − g‖Pn,2 ≤ ε =⇒ ‖f − g‖P,2 ≤√
2ε
for large n, i.e.,
N(ε,F ,L2(P)) ≤ N(
ε√2,F ,L2(Pn)
)<∞
for large n. Therefore we obtain
N(ε,F ,L2(P)) <∞ ∀ε > 0,
i.e., F is totally bounded w.r.t L2(P)-norm.
Next we consider bracketing Donsker’s theorem. It uses Bernstein’s inequality in the proof. From
now on, let F be a set of measurable functions with envelope F satisfying P∗F 2 <∞.
Lemma 3.2.4. If |F| <∞ and ‖f‖∞ <∞ for any f ∈ F , then
E‖Gn‖F . maxf∈F
‖f‖∞√n
log |F|+ maxf∈F‖f‖P,2
√log |F|. (3.3)
Proof. Note that
|Gnf | =
∣∣∣∣∣ 1√n
n∑i=1
(f(Xi)− Pf)
∣∣∣∣∣ .Each (f(Xi)− Pf)/
√n has mean zero and satisfies Bernstein condition
E∣∣∣∣f(Xi)− Pf√
n
∣∣∣∣k = E
[(f(Xi)− Pf√
n
)2 ∣∣∣∣f(Xi)− Pf√n
∣∣∣∣k−2]
≤ E
[(2‖f‖∞√
n
)k−2 2(f2(Xi) + (Pf)2)
n
]
≤ 2
nPf2 · 2k−1 ·
(‖f‖∞√
n
)k−2
≤2k−1≤k!
4Pf2
2nk!
(‖f‖∞√
n
)k−2
57
Draft
Advanced Probability Theory J.P.Kim
holds. Thus by Bernstein’s inequality,
P(|Gnf | > x) ≤ 2 exp
−1
2
x2
n∑i=1
4Pf2
n+‖f‖∞√
nx
≤ 2 exp
−1
2
x2
4 maxf∈F
Pf2 + maxf∈F
‖f‖∞√nx
holds for any x > 0 for large n. Now maximal inequality (lemma 2.4.1) gives the conclusion
E‖Gn‖F ≤∥∥∥∥maxf∈F
Gnf
∥∥∥∥ψ1
. maxf∈F
‖f‖∞√n
log(1 + |F|) +√
4 maxf∈F
Pf2√
log(1 + |F|)
. maxf∈F
‖f‖∞√n
log |F|+√
maxf∈F
Pf2√
log |F|.
Theorem 3.2.5 (Bracketing Donsker). If∫ ∞0
√logN[ ](ε,F ,L2(P))dε <∞,
then F is P-Donsker.
Remark 3.2.6. We use chaining technique and previous lemma in the proof. However, as the con-
dition ‖f‖∞ < ∞ is required to apply the lemma, we should truncate the terms with the order
satisfying √log |F|‖f‖∞√
n∼ ‖f‖P,2
so that two terms in the RHS of (3.3) have equal order.
Proof. There exists an envelope F of F with P∗F 2 < ∞ (Recall remark 2.2.22; bracketing num-
ber is larger than covering number with same ‘diameter.’ Finiteness of the integral gives that
N[ ](ε,F ,L2(P)) = 1 for large ε. Let [l, u] be the only bracket covering F with ‖u − l‖P,2 < M .
Also we get ∫ ∞0
√logN(ε,F ,L2(P))dε <∞,
i.e., N(ε,F ,L2(P)) is finite for any ε > 0. It implies that F is totally bounded; so F is bounded. Thus
P(|u|+ |l|)2 ≤ 2P(|u|2 + |l|2)
and
‖u‖P,2 ≤ ‖u− f‖P,2 + ‖f‖P,2 <∞,
58
Draft
Advanced Probability Theory J.P.Kim
‖l‖P,2 ≤ ‖f − l‖P,2 + ‖f‖P,2 <∞
for f ∈ F implies that
P(|u|+ |l|)2 <∞.
Letting F = sup(|u|, |l|) ≤ |u|+ |l|, we get an envelope F of F with P∗F 2 <∞). For q ≥ 1, construct
a sequence of nested partitions
F =
Nq⋃i=1
Fq,i
s.t. Fq,i is a 2−q-bracket in ‖ · ‖P,2 and
∞∑q=1
2−q√
logNq <∞. (3.4)
Figure 3.1: Nested Partition⋃iFq,i.
Of course we have to show that we can find such partition satisfying (3.4). Note that Nq is equal
to the sum of the number of partitions of each Fq−1,i, i.e.,
Nq ≤ Nq−1 ·N[ ](2−q,F ,L2(P)).
Figure 3.2: Relationship between Nq−1 and Nq.
It implies that
√logNq ≤
√logNq−1 + logN[ ](2−q,F ,L2(P))
≤√
logNq−1 +√
logN[ ](2−q,F ,L2(P)) (√a+ b ≤
√a+√b)
59
Draft
Advanced Probability Theory J.P.Kim
≤√
logNq−2 +√
logN[ ](2−(q−1),F ,L2(P)) +√
logN[ ](2−q,F ,L2(P))
≤ · · ·
≤√
logN1 +
q∑p=2
√logN[ ](2−p,F ,L2(P)),
and therefore
∞∑q=1
2−q√
logNq ≤∞∑q=1
2−q
√logN1 +
q∑p=2
√logN[ ](2−p,F ,L2(P))
≤√
logN1 +
∞∑q=1
q∑p=1
2−q√
logN[ ](2−p,F ,L2(P))
=√
logN1 +∞∑p=1
∞∑q=p
2−q√
logN[ ](2−p,F ,L2(P))
=√
logN1 +∞∑p=1
2−(p−1)√
logN[ ](2−p,F ,L2(P))
.√
logN1 +
∫ ∞0
√logN[ ](ε,F ,L2(P))dε
<∞
holds, which yields (3.4). Now, fix fq,i ∈ Fq,i (fix “representatives” of each partition), and for f ∈ Fq,i,
define
πqf := fq,i (“projection” to the space of representatives)
∆qf :=
(sup
g,h∈Fq,i|g − h|
)∗. (“variation” on each partition)
Since each Fq,i is 2−q-bracket, ‖g − h‖P,2 ≤ 2−q, and hence
√P(∆qf)2 ≤ 2−q.
Figure 3.3: Fq,i and representative fq,i.
60
Draft
Advanced Probability Theory J.P.Kim
Now note that by theorem 1.4.8, it suffices to show that ∀ε, η > 0 ∃ finite partition
Nq0⋃i=1
Fq0,i satisfying
lim supn→∞
P∗(
max1≤i≤Nq0
supf,g∈Fq0,i
|Gnf −Gng| > ε
)< η.
For f, g ∈ Fq0,i in the same partition, πq0f = πq0g, and hence
Gnf −Gng = Gn(f − πq0f + πq0g − g)
= Gn(f − πq0f) + Gn(πq0g − g).
Therefore, if we can show that
∀ε > 0 ∃q0 s.t. E∗‖Gn(f − πq0f)‖F < ε for large n enough,
then
P∗(
max1≤i≤Nq0
supf,g∈Fq0,i
|Gnf −Gng| > ε
)≤ P∗
(max
1≤i≤Nq0sup
f,g∈Fq0,i(|Gn(f − πq0f)|+ |Gn(g − πq0g)|) > ε
)
≤ 1
εE∗(
max1≤i≤Nq0
supf,g∈Fq0,i
(|Gn(f − πq0f)|+ |Gn(g − πq0g)|)
)≤ 2
εE∗ max
1≤i≤Nq0sup
f∈Fq0,i|Gn(f − πq0f)|
≤ 2
εE∗‖Gn(f − πq0f)‖F
< η for large n enough
holds for arbitrarily given η > 0.
Claim.) ∀ε > 0 ∃q0 s.t. E∗‖Gn(f − πq0f)‖F < ε for large n enough. Define a sequence
aq =2−q√
logNq+1
and some indicators
Aq−1f = I(∆q0f ≤
√naq0 , · · · ,∆q−1f ≤
√naq−1
)Bqf = I
(∆q0f ≤
√naq0 , · · · ,∆q−1f ≤
√naq−1,∆qf >
√naq)
Bq0f = I(∆q0f >
√naq0
)
61
Draft
Advanced Probability Theory J.P.Kim
for q > q0. Then from Aq−1f = Aqf +Bqf we get
(f − πq−1f)Aq−1f = (f − πqf + πqf − πq−1f)Aq−1f
= (f − πqf)(Aqf +Bqf) + (πqf − πq−1f)Aq−1f
= (f − πqf)Bqf + (f − πqf)Aqf + (πqf − πq−1f)Aq−1f.
Furthermore, we have Bq0f +Aq0f = 1, and hence
(f − πq0f) = (f − πq0f)(Aq0f +Bq0f)
= (f − πq0f)Bq0f + (f − πq0f)Aq0f
= (f − πq0f)Bq0f + (f − πq0+1f)Bq0+1f + (f − πq0+1f)Aq0+1f + (πq0+1f − πq0f)Aq0f
= · · ·
= (f − πq0f)Bq0f +
Q∑q=q0+1
(f − πqf)Bqf +
Q∑q=q0+1
(πqf − πq−1f)Aq−1f + (f − πQf)AQf
for any Q > q0. Note that
|(f − πQf)AQf | ≤ ∆QfAQf ≤def. of AQ
√naQ =
2−Q√logNQ+1
−−−−→Q→∞
0,
and finally we have
f − πq0f = (f − πq0f)Bq0f︸ ︷︷ ︸(I)
+∞∑
q=q0+1
(f − πq0f)Bqf︸ ︷︷ ︸(II)
+∞∑
q=q0+1
(πqf − πq−1f)Aq−1f︸ ︷︷ ︸(III)
.
Now we have to control 3 terms, (I), (II), and (III). (Sketch: (I) f is dominated by squared-integrable
envelope F and I(F >√naq0) −−−→
n→∞0 for fixed q0; (II) since indicator in each Bqf is disjoint, at most
one term of Bqf is non-zero; (iII) chaining technique on πqf ’s.)
(I) Since |(f − πq0f)Bq0f | ≤ 2F I(2F ≥√naq0),
E∗‖Gn (f − πq0f)Bq0f︸ ︷︷ ︸=:(∗)
‖F = E∗∥∥∥∥∥ 1√
n
n∑i=1
((∗)(Xi)− P(∗))
∥∥∥∥∥F
≤ 2√n
n∑i=1
P2F I(2F ≥√naq0)
= 4√nP(F I(2F ≥
√naq0)
)(we should eliminate
√n term)
≤ 4√nP(F
2F√naq0
I(2F ≥√naq0)
)
62
Draft
Advanced Probability Theory J.P.Kim
=8
aq0PF 2I(2F ≥
√naq0)
−−−→n→∞
0 for any fixed q0 (∵ PF 2 <∞).
(III) Note that both πqf and πq−1f belong to Fq−1,i if f ∈ Fq−1,i, because each partition is defined
as a refinement. It gives that
|πqf − πq−1f | ≤ ∆q−1f.
Hence we get
‖(πqf − πq−1f)Aq−1f‖∞ ≤ ‖∆q−1fAq−1f‖∞ ≤def. of Aq−1
√naq−1
and
‖(πqf − πq−1f)Aq−1f‖P,2 ≤ ‖πqf − πq−1f‖P,2≤2−(q−1).
Note that the last inequality comes from πqf, πq−1f ∈ Fq−1,i, which is 2−(q−1)-bracket. Thus we have
E∗∥∥∥∥∥∥Gn
∞∑q=q0+1
(πqf − πq−1f)Aq−1f
∥∥∥∥∥∥F
≤∞∑
q=q0+1
E∗ ‖Gn(πqf − πq−1f)Aq−1f‖F
.∞∑
q=q0+1
[√naq√n
√logNq + 2−q
√logNq
]
by lemma 3.2.4. Note that lemma 3.2.4 can be applied only for function class with finite cardinality,
but possible different values of (πqf − πq−1f)Aq−1f is Nq. Then putting aq = 2−q/√
logNq+1 ≤ 2−q,
we get
E∗∥∥∥∥∥∥Gn
∞∑q=q0+1
(πqf − πq−1f)Aq−1f
∥∥∥∥∥∥F
.∞∑
q=q0+1
2−q√
logNq −−−−→q0→∞
0
regardless of n. Thus we can obtain the claim via first finding q0 that makes the term (III) (and (II),
eventually, which will be shown) small enough, and then for such fixed q0 letting n very large so that
(I) becomes small. Therefore the remain part is showing (II) becomes very small for large q0 enough
regardless of n.
Claim (for (II)). (II) −−−−→q0→∞
0 regardless of n. Since |f − πqf |Bqf ≤ ∆qfBqf , we have
|Gn (f − πqf)Bqf︸ ︷︷ ︸(∗∗)
| =
∣∣∣∣∣ 1√n
n∑i=1
((∗∗)(Xi)− P(∗∗))
∣∣∣∣∣
≤
∣∣∣∣∣∣∣∣∣∣1√n
n∑i=1
(∆qfBqf − P∆qfBqf)︸ ︷︷ ︸=Gn∆qfBqf
∣∣∣∣∣∣∣∣∣∣+
∣∣∣∣∣ 1√n
n∑i=1
P∆qfBqf
∣∣∣∣∣+
∣∣∣∣∣ 1√n
n∑i=1
P(∗∗)
∣∣∣∣∣︸ ︷︷ ︸≤√nP|(∗∗)|
≤ |Gn∆qfBqf |+ 2√nP∆qfBqf. (3.5)
63
Draft
Advanced Probability Theory J.P.Kim
Note that ‖∆qfBqf‖∞ ≤ ‖∆q−1fBqf‖∞ ≤√naq−1 and
P(∆qfBqf)2 ≤ P(∆qf∆q−1fBqf)
(∵ ∆qf ≤ ∆q−1f, (Bqf)2 = Bqf)
≤√naq−1P∆qfBqf
≤√naq−1P
(∆qfI(∆qf >
√naq)
)≤√naq−1P
(∆q ·
∆qf√naq
)=aq−1
aqP (∆qf)2 ≤ aq−1
aq2−2q
hold (the last equality comes from ‖∆qf‖P,2 ≤ 2−q). Thus we get
‖Gn∆qfBqf‖F .maxf ‖∆qfBqf‖∞√
nlogNq + max
f
√P(∆qfBqf)2
√logNq
≤ aq−1 logNq +
√aq−1
aq︸ ︷︷ ︸≥1⇒≤
aq−1aq
2−q√
logNq
≤ aq−1︸︷︷︸= 2−(q−1)√
logNq
logNq +aq−1
aq︸ ︷︷ ︸=2
√logNq+1logNq
2−q√
logNq
= 2−(q−1)√
logNq + 2−(q−1)√
logNq+1︸ ︷︷ ︸=(A)
.
Note that final bound (A) is finite under∑∞
q=q0+1. On the other hand,
√naqP(∆qfBqf) ≤ P(∆qf)2 ≤ 2−2q
by def. of Bqf , which gives
2√nP∆qfBqf .
√2−2qaq = 2−q
√logNq+1︸ ︷︷ ︸
=(B)
,
whose final bound (B) is also finite under∑∞
q=q0+1. Putting these in (3.5), we get
E∗∥∥∥∥∥∥∞∑
q=q0+1
Gn(f − πqf)Bqf
∥∥∥∥∥∥F
≤∞∑
q=q0+1
E∗ ‖Gn(f − πqf)Bqf‖F
≤∞∑
q=q0+1
[E∗‖Gn∆qfBqf‖F + 2
√n‖P∆qfBqf‖F
]
64
Draft
Advanced Probability Theory J.P.Kim
.∞∑
q=q0+1
[2−(q−1)
√logNq + 2−(q−1)
√logNq+1 + 2−q
√logNq+1
]−−−−→q0→∞
0
regardless of n.
65
Draft
Chapter 4
Uniform Entropy & Bracketing
Numbers
Note that F is a Donsker class if (there exists an square-integrable envelope F and)∫ ∞0
supQ
√logN(ε‖F‖Q,2,F ,L2(Q))dε <∞
by covering Donsker’s theorem. For this, one has to get
supQ
logN(ε‖F‖Q,2,F ,L2(Q)) ≤ K(
1
ε
)2−δ
(at least for small ε’s) for some δ > 0. However, many classes of functions satisfy much more stronger
condition,
supQN(ε‖F‖Q,2,F ,L2(Q)) ≤ K
(1
ε
)V, 0 < ε < 1
for some number V . In this chapter, we consider a class of sets or functions called VC-class, named
after Vapanik and Chervonenkis, and some properties about uniform covering (bracketing) numbers.
Also we consider a set of special functions with uniform bracketing numbers.
4.1 VC class and Uniform Covering Numbers
4.1.1 VC class of sets
Let C be a collection of subsets of X , and x1, x2, · · · , xn be an arbitrary set of n points in X .
Definition 4.1.1 (VC index & class).
(i) C picks out a certain subset, say A, from x1, · · · , xn if A = C ∩ x1, · · · , xn for some
C ∈ C.
(ii) C shatters x1, · · · , xn if each of 2n subsets can be picked out.
66
Draft
Advanced Probability Theory J.P.Kim
(iii) The VC index V (C) of C is the smallest n for which no set of size n is shattered by C. VC
dimension is defined as V (C)− 1.
(iv) A collection of measurable sets C is called VC class if V (C) <∞.
(v) Also define ∆n(C, x1, x2, · · · , xn) := |C ∩ x1, · · · , xn : C ∈ C| .
Example 4.1.2. Let X = R and C = (−∞, c] : c ∈ R (“half-interval”). Then every singleton
c ⊆ R can be picked out.
However, c1, c2 (c1 < c2) cannot be shattered as c2 is not picked out.
Thus V (C) = 2.
Example 4.1.3. Let X = R and C = (a, b] : a, b ∈ R. Then every set c1, c2 ⊆ R is shattered.
However, c1, c2, c3 (c1 < c2 < c3) cannot be shattered as c1, c3 cannot be picked out.
Thus V (C) = 3.
Example 4.1.4. In general, let X = Rd and
C1 = (−∞, c] : c ∈ Rd, C2 = (a, b] : a, b ∈ Rd.
Then V (C1) = d+ 1 and V (C2) = 2d+ 1. Sketch of the proof is following: (To be added)
Following lemma is a result from combinatorics; the proof eventually gives that the number of
subsets picked out by C (∆n(C, x1, · · · , xn)) is bounded above by the number of subsets of x1, · · · , xn
shattered by C.
Lemma 4.1.5 (Sauer). Let C be a VC class. Then for any n points x1, · · · , xn with n ≥ V (C)− 1,
we get
∆n(C, x1, · · · , xn) ≤V (C)−1∑j=0
(n
j
)≤(
ne
V (C)− 1
)V (C)−1
.
67
Draft
Advanced Probability Theory J.P.Kim
Proof. WLOG we may assume that C ⊆ 2x1,··· ,xn, i.e., C consists of subsets of x1, · · · , xn, so that
∆n(C, x1, · · · , xn) = |C ∩ x1, · · · , xn : C ∈ C| = |C|.
For C ∈ C, let
Ti(C) =
C − xi C − xi /∈ C
C o.w.
Then C 7→ Ti(C) is one-to-one; assume that Ti(C1) = Ti(C2), then there are 4 possible cases;
i) if C1 − xi /∈ C, C2 − xi /∈ C, then C1 − xi = C2 − xi, and since C1, C2 ∈ C, we get
xi ∈ C1, C2 (so that C1 − xi /∈ C, C! ∈ C, and so on). Thus C1 = C2;
ii) if C1 − xi ∈ C, C2 − xi ∈ C, then clearly C1 = C2 by def. of Ti;
iii) if C1−xi /∈ C, C2−xi ∈ C, or vice versa, then C1−xi = C2 ∈ C, which yields contradiction;
and hence we get the result. It gives that |C| = |Ti(C)|.
Claim.) A ⊆ x1, · · · , xn is shattered by Ti(C) =⇒ A is shattered by C.
If xi /∈ A then it’s clear that A ∩ C = A ∩ Ti(C) for any C ∈ C. Assume that xi ∈ A and Ti(C)
shatters A. Then for any B ⊆ A, B ∪ xi ⊆ A and hence ∃C ∈ C s.t.
B ∪ xi = A ∩ Ti(C).
Then we get xi ∈ Ti(C) and hence Ti(C) = C, i.e, C − xi ∈ C. Thus both
B ∪ xi = A ∩ Ti(C) = A ∩ C
and
B − xi = A ∩ C − xi = A ∩ (C − xi)
are picked out by C, one of which equals B. (Claim)
Now apply the operator T1 T2 · · · Tn repeatedly, until the collection of sets does not change
anymore. Let such collection D. Since C 7→ Ti(C) is one-to-one, |D| = |C|. Now by the claim,
(# of sets shattered by C) ≥ (# of sets shattered by Ti(C))
≥ · · ·
≥ (# of sets shattered by D)
(4.1)
and by the construction, D − xi ∈ D for any D ∈ D and xi, which implies that D is a power set
of some set (∵ for any D ∈ D, all subsets of D should belong to D, i.e., 2D ⊆ D), say D. Then any
subset of D can be shattered by D, and any set that can be shattered by D should be a subset of D.
68
Draft
Advanced Probability Theory J.P.Kim
Hence we get
(# of sets shattered by D) = |D| = |C|. (4.2)
Combining both (4.1) and (4.2), we get
∆n(C, x1, · · · , xn) = |C| ≤ (# of sets shattered by C).
Note that C can only shatter the set with no more elements than V (C);
(# of sets shattered by C) ≤V (C)−1∑j=1
(n
j
).
Therefore we get
|C| ≤V (C)−1∑j=1
(n
j
).
Now the remain part is to show
k∑j=1
(n
j
)≤(nek
)kfor any k ≤ n,
which can be obtained from following simple caculation.
k∑j=1
(n
j
)=
k∑j=1
n(n− 1) · · · (n− j + 1)
j!
≤k∑j=1
nj
j!
≤k∑j=1
nj
j!
(nk
)k−j=
k∑j=1
(nk
)k kjj!
≤(nk
)kek
For two measurable sets, note that Lr(Q) norm of indicators is obtained as
‖1C − 1D‖Q,r = Q1/r(C4D). (4.3)
Now we consider an Lr(Q)-distance between sets C and D in the sense of (4.3).
69
Draft
Advanced Probability Theory J.P.Kim
Theorem 4.1.6 (Uniform Covering Numbers). Let C be a VC class of sets. Then for any r ≥ 1,
0 < ε < 1 and probability measure Q,
N(ε, C,Lr(Q)) ≤ K(
1
ε
)r(V (C)−1)
holds, where K is a constant depending only on V (C).
Proof. We will only prove the mild version of the statement,
N(ε, C,Lr(Q)) ≤ K(
1
ε
)r(V (C)−1+δ)
∀δ > 0.
Take C1, C2, · · · , Cm from C s.t. Q(Ci4Cj) > ε for i 6= j (i.e., Ci’s are ε1/r-separated). If one shows
that that
∀δ > 0 m .
(1
ε
)V (C)−1+δ
,
then as packing number is larger than covering number, we can obtain the conclusion,
N(ε, C,Lr(Q)) ≤ D(ε) .
(1
εr
)V (C)−1+δ
.
Let X1, X2, · · · , Xni.i.d∼ Q (Remark. The fact that “Q is probability measure” is used in here!). Note
that
Ci and Cj pick out the same subset from X1, · · · , Xn
⇐⇒ Ci ∩ X1, · · · , Xn = Cj ∩ X1, · · · , Xn
⇐⇒ Xk /∈ Ci4Cj ∀k.
Thus
∀(i, j) ∃k s.t. Xk ∈ Ci4Cj
=⇒ Each of Ci (i = 1, 2, · · · ,m) picks out different subdet from X1, · · · , Xn
=⇒ C picks out at least m subsets from X1, · · · , Xn.
Let
E = (∀(i, j) ∃k s.t. Xk ∈ Ci4Cj) =⋂(i,j)
n⋃k=1
(Xk ∈ Ci4Cj) .
Then
P(Ec) ≤∑i<j
P(Xk /∈ Ci4Cj ∀k)
70
Draft
Advanced Probability Theory J.P.Kim
=∑i<j
(1− P(Xk ∈ Ci4Cj))n
=∑i<j
(1−Q(Ci4Cj))n
≤∑i<j
(1− ε)n =
(m
2
)(1− ε)n.
It gives that P(Ec) < 1 for sufficiently large n, i.e., P(E) > 0. It implies that there should exist
x1, x2, · · · , xn(∈ supp(Q)) satisfying E, i.e., C picks out at least m subsets from x1, · · · , xn. It
implies that
m ≤ maxx1,··· ,xn
∆n(C, x1, · · · , xn).
By previous lemma, we get ∆n(C, x1, · · · , xn) . nV (C)−1, i.e.,
m . nV (C)−1
provided that(m2
)(1− ε)n < 1. Put n =
3 logm
e(sufficiently large), then we get
m .
(logm
ε
)V (C)−1
,
i.e., m has smaller order than that of logm, which implies
m .
(1
ε
)V (C)−1+δ
∀δ > 0,
since (logm)V (C)−1 is bounded by a constant times mδ for any δ > 0.
4.1.2 VC class of functions
We can extend the definition of VC class to the function class as considering subgraph of each
function.
Definition 4.1.7.
(i) The subgraph of a function f : X → R is (x, t) : t < f(x).
(ii) A collection F of measurable functions is called a VC-subgraph class (or VC class, in short),
if C = Cf : f ∈ F is a VC class of sets, where Cf denotes the subgraph of f . In this case, VC
index of F is defined as V (F) = V (C).
Theorem 4.1.8. Let F be a VC class with measurable envelope F and Q be a probability measure
71
Draft
Advanced Probability Theory J.P.Kim
with ‖F‖Q,r > 0, r ≥ 1. Then
N(ε‖F‖Q,r,F ,Lr(Q)) .
(1
ε
)r(V (F)−1)
.
Proof. First we see the case that r = 1. Let Cf be the subgraph of f ∈ F and C = Cf : f ∈ F.
Then
Q|f − g| =∫ ∫
1Cf4Cg(x, t)dtdQ(x) = (Q⊗ λ)(Cf4Cg),
where λ denotes a Lebesgue measure on R. Note that Q⊗λ is not a probability measure anymore; we
have to construct a probability measure to apply the result for a VC class of sets. Let
P :=Q⊗ λ2QF
.
Then P is a probability measure on (x, t) : |t| ≤ F (x), and hence we can obtain
N(ε · 2QF,F ,L1(Q)) = N(ε, C,L1(P )) .
(1
ε
)V (C)−1
.
(Editor’s note: P is a probability measure on (x, t) : |t| ≤ F (x), but we should consider a subgraph
class: (x, t) : t ≤ F (x). My solution for this is that: considering C and considering
C∗ := C ∩ (x, t) : |t| ≤ F (x) : C ∈ C
is equivalent in the sense of Cf4Cg = C∗f4C∗g , where
C∗f = Cf ∩ (x, t) : |t| ≤ F (x)(= Cf ∩ (x, t) : t > −F (x)) ∈ C∗.
Then it can be said that P is a probability measure on the space containing C∗.)
For general r > 1, note that
Q|f − g|r ≤ Q|f − g|(2F )r−1 = 2r−1Q(|f − g|F r−1) = 2r−1R|f − g| ·QF r−1
where R is a probability measure defined as
R(A) =
∫A
F r−1
QF r−1dQ.
Also note that
R|f−g| < εrRF =⇒ Q|f−g|r ≤ 2r−1R|f−g|QF r−1 < 2r−1εrRF ·QF r−1 = 2r−1QF rεr < (2ε‖F‖Q,r)r,
72
Draft
Advanced Probability Theory J.P.Kim
which gives
N(2ε‖F‖Q,r,F ,Lr(Q)) ≤ N(εrRF,F ,L1(R)).
Then we get, by the argument in r = 1,
N(εrRF,F ,L1(R)) .
(1
εr
)V (F)−1
,
which gives the conclusion.
4.2 VC-Hull Class and Uniform Entropy
Unfortunately, it is difficult to show that given function class is VC in many cases (and often it
does not hold). However, considering the convex hull of a class can represent a huge class in many
cases. For example, a class of normal density functions can only represent Gaussian distributions,
but its convex hull includes many Gaussian mixture distributions, which can approximate almost all
continuous distributions on R. With this motivation, it’s hard to expect that given function class is
VC, but it may belong to VC-Hull class.
Definition 4.2.1.
(i) (Convex hull)
convF =
m∑i=1
αifi : αi > 0,
m∑i=1
αi = 1, fi ∈ F
(ii) (Symmetric convex hull)
sconvF :=
m∑i=1
αifi :m∑i=1
|αi| ≤ 1, fi ∈ F
(iii) (Pointwise limit) Let convF or sconvF be a closure of convF or sconvF , respectively. It contains
pointwise limits of finite combinations (and eventually infinite combinations).
(iv) F is a VC-Hull class if it is in the pointwise sequential closure of the symmetric convex hull
of a VC class of functions, i.e., F = sconvG for a VC class G of functions.
Following theorems give uniform entropy conditions for VC-Hull class.
Theorem 4.2.2. Let Q be a probability measure and F be a class of measurable functions with mea-
surable square integrable envelope F , i.e., QF 2 <∞. If
N(ε‖F‖Q,2,F ,L2(Q)) ≤ C(
1
ε
)V, 0 < ε < 1
73
Draft
Advanced Probability Theory J.P.Kim
then there exists a constant K depending only on C and V such that
log(N(ε‖F‖Q,2, convF ,L2(Q)) ≤ K(
1
ε
) 2VV+2
.
Remark 4.2.3. It gives that if covering number has a bound with polynomial order, then entropy
of convex hull has a bound with polynomial order smaller than 2. By Donsker’s theorem, we get a
convex hull becomes Donsker.
We can obtain a similar argument for VC-Hull class.
Corollary 4.2.4 (Uniform Entropy). Let G be a VC class and F = sconvG be a VC-Hull class. Then
logN(ε‖F‖Q,2,F ,L2(Q)) .
(1
ε
)2(1−V (G)−1)
for small ε’s, and hence F is Donsker.
It can be proved using that sconvG is contained in the convex hull of G ∪ (−G) ∪ 0.
4.2.1 Examples: VC(-Hull) Classes
Lemma 4.2.5. Let ψ : R→ R be a fixed monotone function. Then
x 7→ ψ(x− h) : h ∈ R (“translations”)
is VC class of index 2.
Figure 4.1: The set with two points cannot be shattered
Lemma 4.2.6. Let F be a finite dimensional vector space of measurable functions. Then V (F) ≤
dim(F) + 2, i.e., finite dimensional vector space of measurable functions is a VC class.
Proof. Let dim(F) = d and n = d+ 2. Take any n points (x1, t1), · · · , (xn, tn) from X × R. Then for
74
Draft
Advanced Probability Theory J.P.Kim
basis b1, · · · , bd of F , ∃a1, a2, · · · , ad s.t.
f =d∑i=1
aibi,
i.e., f(x1)− t1f(x2)− t2
...
f(xn)− tn)
=
t1 b1(x1) · · · bd(x1)
t2 b1(x2) · · · bd(x2)...
.... . .
...
tn b1(xn) · · · bd(xn)
︸ ︷︷ ︸
(∗)
−1
a1
...
ad
holds for f ∈ F . Since the matrix (∗) is n × (n − 1), the point (f(x1) − t1, · · · , f(xn) − tn) lies on a
(n − 1)-dimensional subspace W of Rn. Thus there exists a nonzero vector a ∈ Rn\0 s.t. a ⊥ W .
WLOG at least one component ai is strictly positive, and hence we get
∑ai>0
ai(f(xi)− ti) =∑ai≤0
(−ai)(f(xi)− ti). (4.4)
If (xi, ti) : ai > 0 (which is non-empty by the assumption) can be picked out by a subgraph, then
∃f ∈ F s.t.
(xi, ti) : ai > 0 = (x, t) : t < f(x)∩(xi, ti) : i = 1, 2, · · · , n = (xi, ti) : ti < f(xi), i = 1, 2, · · · , n.
It gives
(xi, ti) : ai ≤ 0 = (xi, ti) : ti ≥ f(xi),
but then
∑ai>0
ai(f(xi)− ti) > 0
∑ai≤0
(−ai)(f(xi)− ti) ≤ 0
hold, which yield contradiction to (4.4). Hence the set (xi, ti) : ai > 0 cannot be picked out, i.e.,
(x1, t1), · · · , (xn, tn) cannot be shattered.
Following lemma gives some basic properties of VC class of sets (functions, resp.).
Lemma 4.2.7. Let C, D be a VC class of sets in X , and E be a VC class of sets in Y. Also let
φ : X → Y, ψ : Z → X be fixed functions. Then:
(i) Cc := Cc : C ∈ C is VC of index V (C).
(ii) C u D := C ∩D : C ∈ C, D ∈ D is VC of index ≤ V (C) + V (D)− 1.
75
Draft
Advanced Probability Theory J.P.Kim
(iii) C t D := C ∪D : C ∈ C, D ∈ D is VC of index ≤ V (C) + V (D)− 1.
(iv) D × E is VC of index ≤ V (D) + V (E)− 1.
(v) φ(C) is VC of index V (C) if φ is one-to-one.
(vi) ψ−1(C) (inverse image) is VC of index ≤ V (C).
Proof. (i) A ⊆ x1, · · · , xn is picked by C
⇐⇒ ∃C ∈ C s.t. A = C ∩ x1, · · · , xn
⇐⇒ ∃Cc ∈ Cc s.t. Ac = Cc ∩ x1, · · · , xn
⇐⇒ Ac ⊆ x1, · · · , xn is picked by Cc
and hence x1, · · · , xn is shattered by C ⇐⇒ x1, · · · , xn is shattered by Cc.
(ii) C u D becomes VC (see the textbook), but I’m not sure that it’s index is bounded above by
V (C) + V (D)− 1. Following (iii) and (iv) use (ii) in the proof, and they also become uncertain.
(iii) Note that C t D = (Cc u Dc)c. Then (i) and (ii) end the proof.
(iv) (Regard D × E as D × E : D ∈ D, E ∈ E) Note that
D × E = (D × Y) ∩ (X × E),
and hence we just have to show that the class of D×Y is VC with index V (D), which seems obvious.
Consider (x1, y1), · · · , (xn, yn) ∈ X × Y. Then
A ⊆ (x1, y1), · · · , (xn, yn) is picked out by D × Y = D × Y : D ∈ D
=⇒ ∃D ∈ D s.t. A = (D × Y) ∩ (x1, y1), · · · , (xn, yn) = (xi, yi) : xi ∈ D
=⇒ ∃D ∈ D s.t. xi : (xi, yi) ∈ A = D ∩ x1, · · · , xn
=⇒ xi : (xi, yi) ∈ A is picked out by D
and vice versa. Thus x1, x2, · · · , xn is shattered ⇐⇒ (x1, y1), · · · , (xn, yn) is shattered.
(v) A ⊆ x1, · · · , xn is picked out by C
⇐⇒ ∃C ∈ C s.t. A = C ∩ x1, · · · , xn
⇐⇒ ∃φ(C) ∈ φ(C) s.t. φ(A) = φ(C) ∩ φ(x1, · · · , xn)
because φ is one-to-one. Thus x1, · · · , xn is shattered by C ⇐⇒ φ(x1), · · · , φ(xn) is shattered
by φ(C).
(vi) Assume that A ⊆ z1, · · · , zn is picked out by ψ−1(C). Then ∃C ∈ C s.t. A = ψ−1(C) ∩
z1, · · · , zn. It gives that
ψ(A) = ψ(ψ−1(C) ∩ z1, · · · , zn
)= ψ(ψ−1(C)) ∩ ψ(z1, · · · , zn) = C ∩ ψ(z1), · · · , ψ(zn),
i.e., ψ(A) ⊂ ψ(z1), · · · , ψ(zn) is picked by C. Hence
z1, · · · , zn is shattered by ψ−1(C) =⇒ ψ(z1), · · · , ψ(zn) is shattered by C.
76
Draft
Advanced Probability Theory J.P.Kim
Also note that, if z1, · · · , zn is shattered by ψ−1(C), then
∀A ⊆ z1, · · · , zn ∃C ∈ C s.t A = ψ−1(C) ∩ z1, · · · , zn.
If ψ(zi) = ψ(zj) for some i 6= j, then zi cannot be picked out; ψ(zi) 6= ψ(zj) for i 6= j. It yields that
z1, · · · , zn is not shattered by ψ−1(C) for n = V (C); if not, then ψ(z1), · · · , ψ(zV (C)) is shattered
by C, contradictory to the assumption.
Lemma 4.2.8. Let F ,G be VC classes of functions on X and g : X → R, φ : R→ R, ψ : Z → X be
fixed functions. Then:
(i) F ∧ G := f ∧ g : f ∈ F , g ∈ G (pointwise minimum) is VC of index ≤ V (F) + V (G)− 1.
(ii) F ∨ G := f ∨ g : f ∈ F , g ∈ G (pointwise maximum) is VC of index ≤ V (F) + V (G)− 1.
(iii) F > 0 := x : f(x) > 0 : f ∈ F is VC class of sets with index ≤ V (F).
(iv) −F is VC of index V (F).
(v) F + g = f + g : f ∈ F is VC of index V (F).
(vi) g · F = g · f : f ∈ F is VC of index ≤ 2V (F)− 1.
(vii) F ψ = f(ψ) : f ∈ F is VC of index ≤ V (F).
(viii) φ F = φ(f) : f ∈ F is VC of index ≤ V (F) if φ is monotone.
Proof. (i) Subgraph of g ∧ f is
(x, t) : t < (f ∧ g)(x) = (x, t) : t < f(x) ∩ (x, t) : t < g(x),
and hence (i) of previous lemma gives the assertion.
(ii) Subgraph of g ∨ f is
(x, t) : t < (f ∨ g)(x) = (x, t) : t < f(x) ∪ (x, t) : t < g(x),
and hence (ii) of previous lemma gives the assertion.
(iii) For one-to-one function φ : (x, 0) 7→ x, the collection of x : f(x) > 0 = ψ((x, 0) : 0 <
f(x)) = ψ((x, t) : t < f(x) ∩ (X × 0)) is a VC class.
(iv) For this, it suffices to show that “closed subgraph”
(x, t) : t ≤ f(x) : f ∈ F
forms VC class of index V (F) (∵ (x, t) : t < −f(x) = (x,−t) : t > f(x) = ψ ((x, t) : t ≤ f(x)c)
for 1-1 ψ(x, t) = (x,−t)). Suppose that the closed subgraphs shatter (x1, t1), · · · , (xn, tn). Then
77
Draft
Advanced Probability Theory J.P.Kim
∃f1, f2, · · · , fm, m = 2n s.t. pick out each subset of (x1, t1), · · · , (xn, tn). Set 2ε := infti − fj(xi) :
ti − fj(xi) > 0. Then (x1, t1 − ε), · · · , (xn, tn − ε) is shattered by open subgraph.
Figure 4.2: (xi, ti) : i ∈ I is picked out by closed subgraphs =⇒ (xi, ti − ε) : i ∈ I is picked outby open subgraphs
Conversely, assume that open subgraphs shatter (x1, t1), · · · , (xn, tn). Then again ∃f1, · · · , fmwhich pick out each subset. Set 2ε := inffj(xi)−ti : fj(xi)−ti > 0. Then (x1, t1+ε), · · · , (xn, tn+ε)
is shattered by closed subgraphs.
(v) F + g shatters (x1, t1), · · · , (xn, tn)
⇐⇒ ∀A ⊆ (x1, t1), · · · , (xn, tn) ∃f ∈ F s.t. A = (x, t) : t < f(x)+g(x)∩(x1, t1), · · · , (xn, tn)
⇐⇒ ∀A ⊆ (x1, t1), · · · , (xn, tn) ∃f ∈ F s.t. A = (xi, ti) : t− g(xi) < f(xi)
⇐⇒ ∀A ⊆ (xi, ti − g(xi)) : i = 1, 2, · · · , n ∃f ∈ F s.t. A = (xi, ti − g(xi)) : ti − g(xi) < f(xi)
⇐⇒ F shatters (x1, t1 − g(x1)), · · · , (xn, tn − g(xn)).
(vi) The subgraph of g · f is
(x, t) : t < f(x)g(x) = (x, t) : t < f(x)g(x), g(x) > 0 =: C+f
∪ (x, t) : t < f(x)g(x), g(x) < 0 =: C−f
∪ (x, t) : t < f(x)g(x), g(x) = 0 =: C0f .
Note that
C+f : f ∈ F shatters (x1, t1), · · · , (xn, tn) ⊆ (X ∩ g > 0)× R
⇐⇒ ∀A ⊆ (x1, t1), · · · , (xn, tn) ∃f ∈ F s.t.
A = (xi, ti) : ti < f(xi)g(xi), g(xi) > 0 =
(xi, ti) :
tig(xi)
< f(xi), g(xi) > 0
⇐⇒
(x,
t
g(x)
):
t
g(x)< f(x), g(x) > 0
: f ∈ F
shatters
(x1,
t1g(x1)
), · · · ,
(xn,
tng(xn)
)
and hence C+f : f ∈ F is VC in (X∩g > 0)×R. Similarly, C−f : f ∈ F is VC in (X∩g < 0)×R.
78
Draft
Advanced Probability Theory J.P.Kim
Finally,
C0f = (x, t) : t < 0 = (X ∩ g = 0)× (−∞, 0)
is VC of index ≤ 2.
(vii) Subgraph of f(ψ) is (x, t) : t < f(ψ(x)), which is an inverse image of (x, t) : t < f(x) with
function (x, t) 7→ (ψ(x), t).
(viii) Let φF shatter (x1, t1), · · · , (xn, tn). Then ∃f1, · · · , fm ∈ F which pick out each m subsets.
If we define si = maxfj(xi) : ti ≥ φ(fj(xi)), then
si ≥ fj(xi) ⇐⇒ ti ≥ φ(fj(xi)),
i.e.,
si < fj(xi) ⇐⇒ ti < φ(fj(xi)).
Thus for Aj = (xi, ti) : ti < φ(fj(xi)) for each j = 1, 2, · · · ,m, fj picks out (xi, si) : (xi, ti) ∈ Aj.
It implies that F shatters (x1, s1), · · · , (xn, sn).
Lemma 4.2.9. Let C be a class of sets in X and F = 1C : C ∈ C. Then
C is VC ⇐⇒ F is VC and V (C) = V (F).
Proof. Note that subgraph of 1C is
(x, t) : t < 1C(x) = (C × (−∞, 1)) ∪ (Cc × (−∞, 0)) .
Assume that C shatters x1, · · · , xn. Then for each subset Aj of x1, · · · , xn, ∃Cj ∈ C s.t.
Aj = Cj ∩ x1, · · · , xn
by definition. Then(xi,
1
2
): xi ∈ Aj
=[(Cj × (−∞, 1)) ∪ (Ccj × (−∞, 0))
]∩(
x1,1
2
), · · · ,
(xn,
1
2
)
holds, which gives that F shatters(x1,
12
), · · · ,
(xn,
12
). ∴ V (F) ≥ V (C).
Conversely, let F shatter (x1, t1), · · · , (xn, tn). If ∃ti s.t. ti ≥ 1, then (xi, ti) cannot be picked
out; if ∃ti s.t. ti < 0, then (xj , tj) cannot be picked out for j 6= i. Hence 0 ≤ ti < 1 (which also
implies xi 6= xj for i 6= j), and ∀Aj ⊆ (x1, t1), · · · , (xn, tn) ∃Cj ∈ C s.t.
Aj = (xi, ti) : ti < 1Cj (xi) =[(Cj × (−∞, 1)) ∪ (Ccj × (−∞, 0))
]∩ (x1, t1), · · · , (xn, tn).
79
Draft
Advanced Probability Theory J.P.Kim
It gives that
xi : (xi, ti) ∈ Aj = Cj ∩ x1, · · · , xn.
Thus C shatters x1, · · · , xn, which gives V (F) ≤ V (C). Therefore, we get V (C) = V (F).
(a) F shatters (xi, 1/2)’s. (b) If F shatters (xi, ti)ni=1, then ∀ti ∈[0, 1).
Figure 4.3: (a) Proof of V (C) ≤ V (F). (b) Proof of V (F) ≤ V (C).
Example 4.2.10. Let F be the set of all monotone functions, f : R → [0, 1]. Then F is not a VC,
but contained in a VC-Hull class.
Proof. F is not a VC class if V (C) = ∞, i.e., for any n there exists a set (xi, yi)ni=1 that can be
shattered by F .
Figure 4.4: F is not a VC class.
Putting x1 < x2 < · · · < xn and 0 < y1 < y2 < · · · <
yn < 1, we can find monotone function which picks out each
subset of (x1, y1), · · · , (xn, yn). Thus F is not a VC.
However, F is contained in a VC-Hull class. To show this, we have to show that F ⊆ sconvG for
some VC class G.
Figure 4.5: F is contained in a VC-Hullclass.
Let G = 1[a,∞)(t) : 0 ≤ a ≤ 1 be a VC class. Define
xi = inf
x : f(x) ≥ i
n
.
Then f(xi) ≥ in , and on the interval (xi, xi+1),
i
n≤ f(x) <
i+ 1
n, i.e., 0 ≤ f(x)− i
n<
1
n.
Thus for fn(x) :=
n−1∑i=1
i
n1[xi,xi+1)(x) + 1[xn,∞)(x), we get
|f(x)− fn(x)| ≤ 1
n, and therefore fn(x) −−−→
n→∞f(x).
80
Draft
Advanced Probability Theory J.P.Kim
Now note that, for xn+1 =∞,
fn(x) =
n∑i=1
i
n1[xi,xi+1)(x) =
n∑i=1
i∑j=1
1
n1[xi,xi+1)(x) =
n∑j=1
n∑i=j
1
n1[xi,xi+1)(x) =
n∑j=1
1
n1[xi,∞)(x),
and therefore fn ∈ sconvG (for convenience, [∞,∞) is regarded as φ), i.e., f ∈ sconvG.
Remark 4.2.11. Later, we will show that for any r ≥ 1 and probability measure Q,
logN[ ](ε,F ,Lr(Q)) ≤ K(
1
ε
)holds, where K depends only on r. It gives more tight bound than theorem 4.2.2.
Example 4.2.12 (Half space). Let C =x ∈ Rd : 〈x, u〉 ≤ c : u ∈ Rd, c ∈ R
be a space of half-
spaces. Then V (C) = d+ 2.
Proof. It is easy to show that C shatters 0, e1, · · · , ed for ith standard normal vector ei. It gives
V (C) ≥ d+ 2.
Figure 4.6: For any subset A ⊆ 0, e1, · · · , ed we can find a hyperplane separating A and 0, e1, · · · , ed\A.
On the other hand, by Radon’s theorem, any d + 2 points in Rd can be partitioned into two sets,
whose convex hulls have non-empty intersection.
Figure 4.7: Radon’s theorem.
These two subsets cannot be separated by hyperplanes. If they can, then since each half-sapce is
convex, their convex hulls are disjoint, which contradicts to Radon’s theorem. Thus any d+ 2 points
cannot be shattered, which gives V (C) ≤ d+ 2.
81
Draft
Advanced Probability Theory J.P.Kim
4.3 Bracketing Numbers
In this section, we see several examples of function classes which give uniform bracketing entropy
condition.
4.3.1 Monotone Functions
Theorem 4.3.1 (Monotone Functions). Let F be the set of all monotone functions from R to [0, 1].
Then (for small ε)
logN[ ](ε,F ,Lr(Q)) ≤ K(
1
ε
)∀r ≥ 1, Q : probability measure,
where K is a constant depending only on r.
Sketch of proof. Because it is too difficult to show, we will show mild version:
logN[ ](ε,F ,Lr(Q)) ≤ K(
1
εlog
1
ε
).
Let ε, r, and Q be given, and just for convenience, assume that Q is continuous. Partition R into
−∞ = t0 < t1 < · · · < tNr+1 =∞ s.t. Q(ti, ti+1] = εr+1,
where N =
⌊1
ε
⌋+ 1. Also let
S = all monotone step functions jumping only at tj ’s with jump size k/N︸︷︷︸≈kε
, k = 1, 2, · · · , N.
Figure 4.8: One element of S.
Note that
|S| = (# of cases choosing “jump-location” among N r+1 + 1 tj ’s) =
(N r+1 +N
N
),
82
Draft
Advanced Probability Theory J.P.Kim
and so cardinality of the collection [l, u] : l, u ∈ S of brackets is smaller than |S × S| = |S|2 =(Nr+1+N
N
)2, and
log |S × S| . log
(N r+1 +N
N
)≤ N log(N r+1 +N) .
r + 1
εlog
1
ε,
because N ≥ 1/ε and N r ≥ N holds. For given f ∈ F , let l, u ∈ S be s.t.
l ≤ f ≤ u and ‖u− l‖Q,r becomes minimized.
Figure 4.9: f ∈ F and bracket [l, u] on specific interval (ti, ti+1).
Let
N1 = (# of j’s s.t. f(tj+1)− f(tj) ≤1
N) (“# less than ε jump”)
N2 = (# of j’s s.t.1
N< f(tj+1)− f(tj) ≤
2
N) (“# less than 2ε jump”)
...
and so on. Then clearly we get
∑j
Nj = N r+1 (“total partition”) and∑j
Njj
N≤ 1 (“total jump size”).
Then we have
Q(u− l)r =
∫(u− l)rdQ
=∑j
∫ tj+1
tj
(u− l)rdQ (*)
≤∑j
(j + 2
N
)rNjε
r+1 (∵ Q(tj , tj+1] = εr+1) (**)
83
Draft
Advanced Probability Theory J.P.Kim
≤3rεr+1∑j
(j
N
)r︸ ︷︷ ︸
≤j/N from j/N≤1
Nj
≤ 3rεr+1∑j
j
NNj
≤ 3rεr+1 ≤ εr
for small ε > 0 (Note that index j in (∗) and (∗∗) have different meanings; one is index of tj , while
denotes jump size). For ≤, see figure 4.9. In ≤, simply j + 2 ≤ 3j is used. Thus we get
‖u− l‖Q,r ≤ ε
for small ε (precisely, ε < 1/3r). It gives
logN[ ](ε,F ,Lr(Q)) ≤ log |S × S|,
which is smaller than ε−1 log ε−1 up to constant multiplication with constant depending only on r.
4.3.2 Smooth functions and sets
Let X be a bounded convex subset of Rd with nonempty interior.
Definition 4.3.2 (α-Holder continuity). For 0 < α ≤ 1, a function f : X → R is called α-Holder
continuous if
supx 6=y
|f(x)− f(y)|‖x− y‖α
<∞.
Especially, if α = 1, then f is Lipschitz continuous. Also note that as α increases it represents
further smoothness. Then in natural, our interest is to extend the definition 4.3.2 for α > 1.
Remark 4.3.3. However, such extension is not straightforward. Suppose that for some α > 1
supx 6=y
|f(x)− f(y)|‖x− y‖α
<∞
holds. Then ∣∣∣∣f(x+ th)− f(x)
t
∣∣∣∣ . tα
t→ 0 as t→ 0
for any unit vector h ∈ Rd. Thus every directional derivative of f is 0, i.e., f is a constant function.
In summary, α-Holder continuity for α > 1 is satisfied only for trivial cases.
Hence we extend the concept in other way; we consider α-Holder continuity of derivatives.
Definition 4.3.4. From now on, let α be the largest integer “smaller” than α. For example, α = 0
84
Draft
Advanced Probability Theory J.P.Kim
for α = 1. Also, for a vector k = (k1, · · · , kd) of d integers, define
Dk =∂k·
∂xk11 · · · ∂xkdd
, where k· =d∑j=1
kj .
Definition 4.3.5 (α-Holder norm). Let α > 0 and f be a function with uniformly bounded partial
derivatives up to order α and (α − α)-Holder continuous kth derivatives Dkf for k· = α. Then
α-Holder norm of f is defined as
‖f‖α := maxk:k·≤α
‖Dkf‖∞ + maxk:k·=α
supx 6=y
|Dkf(x)−Dkf(y)|‖x− y‖α−α
.
Also, let CαM (X ) be the set of all continuous functions f : X → R with ‖f‖α ≤M .
Following two theorems (theorem 4.3.6 and corollary 4.3.7) give bound for covering and bracketing
entropy of Cα1 (X ), which is the case M = 1. Note that for the bound be finite, domain X should be
bounded subset (of Rd).
Theorem 4.3.6.
logN(ε, Cα1 (X ), ‖ · ‖∞) ≤ Kλ(X 1)
(1
ε
)d/α,
where K is a constant depending only on α and d, λ is a Lebesgue measure, and X 1 = x ∈ Rd :
‖x−X‖ < 1.
Proof. We will prove the assertion for d = 1 (proof for general d is similar). Since the function in
Cα1 (X ) are continuous on X , WLOG we may assume that X is open so that Taylor theorem can be
applied everywhere on X . Let ε ∈ (0, 1] be given and δ = ε1/α. Let x1 < x2 < · · · < xm be δ-net for
X s.t. m .λ(X 1)
δ. For nonnegative integer k ≤ α and f ∈ Cα1 (X ), define
Akf :=
(⌊f (k)(x1)
δα−k
⌋, · · · ,
⌊f (k)(xn)
δα−k
⌋).
Then the vector δα−kAkf consists of the values f (k)(xi) discretized on a grid of mesh-width δα−k.
Claim.) For f, g ∈ Cα1 (X ), if Akf = Akg ∀k ≤ α, ‖f − g‖∞ . ε.
Proof of Claim. By Taylor’s theorem,
(f − g)(x) =
α−1∑k=0
(f (k) − g(k))(xi)
k!(x− xi)k +
1
α!(f (α) − g(α))(x)(x− xi)α
for some x lying between x and xi. Then
(f − g)(x) =
α∑k=0
(f (k) − g(k))(xi)
k!(x− xi)k +
1
α!
((f (α) − g(α))(x)− (f (α) − g(α))(xi)
)(x− xi)α.
85
Draft
Advanced Probability Theory J.P.Kim
Note that for f ∈ Cα1 (X ), ‖f‖α ≤ 1, which gives
supx 6=y
|D(α)f(x)−D(α)f(y)||x− y|α−α
≤ 1,
and hence we get
|f (α)(x)− f (α)(xi)| ≤ |x− xi|α−α ≤ δα−α.
Similarly
|g(α)(x)− g(α)(xi)| ≤ δα−α
holds for g ∈ Cα1 (X ). Since Akf = Akg, we get⌊f (k)(xi)
δα−k
⌋=
⌊g(k)(xi)
δα−k
⌋∀xi,
which gives |(f (k) − g(k))(xi)| ≤ δα−k for any k ≤ α and i. Thus we get
|(f − g)(x)| ≤α∑k=0
δα−k
k!|x− xi|k︸ ︷︷ ︸≤δk
+1
α!2δα−α|x− xi|α
≤α∑k=0
δα
k!+
1
α!2δα
. δα (or ≤ (2 + e)δα)
= ε,
(Claim)
By claim, there exists a constant C depending only on α s.t.
N(Cε, Cα1 (X ), ‖ · ‖∞) ≤ |Af : f ∈ Cα1 (X )| ,
where
Af =
A0f
A1f...
Aαf
.
Note that # of possible values of
⌊f (k)(xi)
δα−k
⌋is smaller than
2
δα−k+1 (∵ ‖f‖α ≤ 1 yields |f (k)(xi)| ≤ 1
by def.), which does not exceed 2δ−α + 1. Thus each column of Af can have at most (2δ−α + 1)α+1
different values. It gives
|Af : f ∈ Cα1 (X )| ≤ (2δ−α + 1)(α+1)m,
86
Draft
Advanced Probability Theory J.P.Kim
i.e.,
log |Af : f ∈ Cα1 (X )| ≤ m(α+ 1) log(2δ−α + 1)
.λ(X 1)
δ(α+ 1) log δ−α
. λ(X 1)
(1
ε
)1/α
log1
ε.
Now our goal is to “remove” log(1/ε) term. It uses smoothness of function; f cannot blow-up or
blow-down in short interval.
Note that
f (k)(xi+1) =
α−k∑k=0
f (k+l)(xi)(xi+1 − xi)l
l!+R
with
|R| . (xi+1 − xi)α−k ≤ δα−k.
Then we get (Motivation: f (k+l)(xi) ≈
⌊f (k+l)(xi)
δα−(k+l)
⌋δα−(k+l))
∣∣∣∣∣∣f (k)(xi+1)−α−k∑k=0
⌊f (k+l)(xi)
δα−(k+l)
⌋δα−(k+l) (xi+1 − xi)l
l!
∣∣∣∣∣∣≤
∣∣∣∣∣∣f (k)(xi+1)−α−k∑l=0
f (k+l)(xi)(xi+1 − xi)l
l!
∣∣∣∣∣∣︸ ︷︷ ︸=|R|
+
∣∣∣∣∣∣α−k∑l=0
(f (k+l)(xi)−
⌊f (k+l)(xi)
δα−(k+l)
⌋δα−(k+l)
)(xi+1 − xi)l
l!
∣∣∣∣∣∣. δα−k +
α−k∑l=0
δα−(k+l)
∣∣∣∣∣f (k+l)(xi)
δα−(k+l)−
⌊f (k+l)(xi)
δα−(k+l)
⌋∣∣∣∣∣︸ ︷︷ ︸≤1
(xi+1 − xi)l
l!
≤ δα−k +
α−k∑l=0
δα−(k+l) δl
l!︸ ︷︷ ︸=δα−k/l!
. δα−k.
It implies that, for given ith column of Af , # of possible values for the (i + 1)th column is bounded
by a constant K depending only on α. Thus we get
N(Cε, Cα1 (X ), ‖ · ‖∞) ≤ |Af : f ∈ Cα1 (X )| . (2δ−α + 1)α+1︸ ︷︷ ︸first column
Km−1︸ ︷︷ ︸rest ones
,
87
Draft
Advanced Probability Theory J.P.Kim
i.e.,
logN(Cε, Cα1 (X ), ‖ · ‖∞) . log1
ε+m . log
1
ε+
(1
ε
)1/α
λ(X 1) .
(1
ε
)1/α
λ(X 1).
Corollary 4.3.7.
logN[ ](ε, Cα1 (X ),Lr(Q)) . λ(X 1)
(1
ε
)d/α.
Proof. Basic idea of the proof is that with respect to ‖ · ‖∞ norm bracketing entropy is equivalent to
(covering) entropy. Let f1, · · · , fp be the centers of ‖ · ‖∞-balls of radius ε that cover Cα1 (X ). Then
the brackets [fi − ε, fi + ε] cover Cα1 (X ). Each bracket has Lr(Q)-size at most 2ε. Thus we have
logN[ ](ε, Cα1 (X ),Lr(Q)) ≤ logN( ε
2, Cα1 (X ),Lr(Q)
). λ(X 1)
(2
ε
)d/α. λ(X 1)
(1
ε
)d/α.
The corollary implies that Cα1 [0, 1]d is universally Donsker for α > d/2 by bracketing Donsker’s
theorem. For instance, on the unit interval in the line (d = 1), uniformly bounded and Holder-
continuous of order > 1/2 suffices and in the unit square it suffices that the partial derivatives exist
and satisfy a Lipschitz condition.
The previous results are restricted to bounded subsets of Euclidean space (bound has a term λ(X 1).
Under appropriate conditions on the tails of the underlying distribution, they can be extended to
classes of functions on the whole of Euclidean space.
Corollary 4.3.8. Let Rd =∞⋃j=1
Ij be a partition of Rd into bounded convex sets Ij’s with nonempty
interior, and let F be a class of functions f : Rd → R s.t. the restrictions f |Ij belong to CαMj(Ij) for
every j. Then ∃K depending only on α, V , r and d such that
logN[ ](ε,F ,Lr(Q)) ≤ K(
1
ε
)V ∞∑j=1
λ(I1j )
rV+rM
V rV+r
j Q(Ij)r
V+r
V+rr
for every ε > 0, V ≥ d
αand probability measure Q.
For the collection of subgraphs to be Donsker, a smoothness condition on the underlying measure is
needed, in addition to sufficient smoothness of the graphs. The following result implies that the sub-
graphs (contained in Rd+1) of the functions Cα1 [0, 1]d are P-Donsker for Lebesgue-dominated measure
88
Draft
Advanced Probability Theory J.P.Kim
P with bounded density, provided α > d. For instance, for the sets cut out in the plane by functions
f : [0, 1]→ [0, 1], a uniformly Lipschitz condition of any order on the derivatives suffices.
Corollary 4.3.9. Let Cα,d be the collection of subgraphs of Cα1 [0, 1]d. Then there exists a constant K
depending only on α and d s.t.
logN[ ](ε, Cα,d,Lr(Q)) ≤ K‖q‖d/α∞(
1
ε
)dr/αfor any r ≥ 1, ε > 0, and probability measure Q with bounded Lebesgue density q on Rd+1. In here,
“covering” Cα,d is under the sense that [Ci, Di]’s cover Cα,d (for subgraphs Ci, Di) means their indicator
functions bracket the set of indicator functions of the sets (=subgraphs) in Cα,d.
4.3.3 Closed convex sets and Convex functions
Theorem 4.3.10. Let A be a bounded subset of Rd , d ≥ 2. Define
C = all compact convex subsets of A
and Q be a Lebesgue absolutely continuous probability measure. Then ∃ a constant K depending only
on A,Q, and d s.t.
logN[ ](ε, C,Lr(Q)) ≤ K(
1
ε
)(d−1)r/2
.
Remark 4.3.11. Note that C is not a VC class. For any n, there exist n points which can be shattered.
Figure 4.10: For such n points, any subset can be picked out by convex compact set.
Following theorem is a version of function class of theorem 4.3.10.
Theorem 4.3.12. Let A be a compact convex subset of Rd (d ≥ 2), and
F = all convex functions f : A→ [0, 1] s.t. |f(x)− f(y)| ≤ L‖x− y‖ ∀x, y ∈ A.
89
Draft
Advanced Probability Theory J.P.Kim
Then ∃K depending only on d and A s.t.
logN(ε,F , ‖ · ‖∞) ≤ K(1 + L)d/2(
1
ε
)d/2.
Theorem 4.3.13. Let (T, d) be a semimetric space and F = ft : X → R | t ∈ T s.t.
|fs(x)− ft(x)| ≤ F (x)d(s, t) ∀s, t,∈ T, x ∈ X
for some fixed function F (i.e., F is a class of functions x 7→ ft(x) that are Lipschitz in the index
parameter t ∈ T ). Then
N[ ](2ε‖F‖,F , ‖ · ‖) ≤ N(ε, T, d)
for any norm ‖ · ‖.
Proof. Let t1, · · · , tp be an ε-net for d for T . Then the brackets [fti − εF, fti + εF ] cover F (∵ ∀s ∈ T
∃ti s.t. |fti(x)− fs(x)| ≤ F (x)d(ti, s) < εF (x), which gives fti − εF ≤ fs ≤ fti + εF ), which are of size
2ε‖F‖.
4.4 Further topic: Tail bounds
In this section, we derive moments and tail bounds for the supremum ‖Gn‖F of the empirical process.
Thus we will consider a non-asymptotic, finite sample behavior, just as concentration inequalities. Note
that if F is P-Donsker, then ‖Gn‖F = OP(1). Also note that
Gnfd−−−→
n→∞N(0,P(f − Pf)2).
Let F be an envelope of F , and
J(δ,F) = supQ
∫ δ
0
√1 + logN(ε‖F‖Q,2,F ,L2(Q))dε
J[ ](δ,F) =
∫ δ
0
√1 + logN[ ](ε‖F‖P,2,F ,L2(P)dε,
where the supremum is taken over all discrete probability measures Q with ‖F‖Q,2 > 0.
Theorem 4.4.1. Let F be a P-measurable class of measurable functions with measurable envelope F .
Then
‖‖Gn‖∗F‖P,p ≤ KJ(1,F)‖F‖P,2∨p ∀p ≥ 1,
where K depends only on p.
90
Draft
Advanced Probability Theory J.P.Kim
Theorem 4.4.2. Let F be a class of measurable functions with measurable envelope function F . Then
‖‖Gn‖∗F‖P,1 . J[ ](1,F)‖F‖P,2.
Theorem 4.4.3. Let F be a class of measurable functions with measurable envelope function F . Then
‖‖Gn‖∗F‖P,p . ‖‖Gn‖∗F‖P,1 + n−1/2+1/p‖F‖P,p (p ≥ 2)
‖‖Gn‖∗F‖P,ψp . ‖‖Gn‖∗F‖P,1 + n−1/2(1 + log n)1/p‖F‖P,ψp (0 < p ≤ 1)
‖‖Gn‖∗F‖P,ψp . ‖‖Gn‖∗F‖P,1 + n−1/2+1/q‖F‖P,ψp (1 < p ≤ 2),
where q is a Holder conjugate of p and the constants in the inequalities . depend only on the type of
norm involved in the statement.
91