Advanced Probability Theory (Fall 2017) · Weak Convergence and Empirical Processes with Applications to Statistics, Van der Vaart & Wellner, Springer, 1996. Asymptotic Statistics,

DraftAdvanced Probability Theory (Fall 2017)

J.P.Kim

Dept. of Statistics

Finally modified at November 28, 2017

Draft

Preface & Disclaimer

This note is a summary of the lecture Advanced Probability Theory (326.729A) held at Seoul

National University, Fall 2017. Lecturer was Minwoo Chae, and the note was summarized by

J.P.Kim, who is a Ph.D student. There are few textbooks and references in this course, which

are following.

• Weak Convergence and Empirical Processes with Applications to Statistics, Van der Vaart

& Wellner, Springer, 1996.

• Asymptotic Statistics, Van der Vaart, Cambridge University Press, 1998.

Also I referred to following books when I write this note. The list would be updated contin-

uously.

• Convergence of probability measures, Billingsley, John Wiley & Sons, 2013.

• Lecture notes on Topics in Mathematics I (3341.445) held by Gerald Trutnau (spring

2015).

Finally, some examples or motivation would be complemented based on the lecture notes

(summarized by myself) of

• Probability Theory I (326.513) on spring 2016;

• Theory of Statistics II (326.522) on fall 2016,

most of which are available at https://jpkimstat.wordpress.com/notes-and-slides.

If you want to correct typo or mistakes, please contact to: [email protected]

1

Draft

Chapter 1

Stochastic Convergence

1.1 Motivation

Recall some basic results in asymptotics.

Theorem 1.1.1 (SLLN). Let X1, X2, · · · , Xn be i.i.d random variables with E|X1| <∞. Then

1

n

n∑i=1

XiP−a.s−−−→n→∞

EX1.

Theorem 1.1.2 (CLT). Let X1, X2, · · · , Xn be i.i.d random variable with E|X1|2 <∞. Then

1√n

n∑i=1

(Xi − µ)d−−−→

n→∞N(0, σ2),

where µ = EX1 and σ2 = EX21 − µ2.

From now on we will use following notations. Let

• (Ω,A,P) or (Ωi,Ai,Pi) be underlying probability space or sequence of them;

• (D, d) be a metric space;

• D = B(D) be a Borel σ-algebra of D;

• Cb(D) be the set of all bounded continuous real functions on D;

• X (Xn, resp.) be a map from Ω (Ωi, resp.) to D (not necessarily be measurable).

Remark 1.1.3. Note that LLN and CLT holds for f(Xi)’s, i.e.,

1

n

n∑i=1

f(Xi)P−a.s−−−→n→∞

Ef(X1)

2

Draft

Advanced Probability Theory J.P.Kim

and1√n

n∑i=1

(f(Xi)− Ef(Xi))d−−−→

n→∞N(0, σ2

f )

holds for σ2f = varf(X1), provided that E[f(X1)2] <∞. Our question in this course is:

• For a class F of real functions. do LLN and CLT hold “uniformly” in “some” sense? For

example, does it hold

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Xi)− Ef(X1)

∣∣∣∣∣ −−−→n→∞0

P-a.s. or in probability?

• For finite f1, · · · , fk,

(1√n

n∑i=1

(f1(Xi)− Ef1(X1)) , · · · , 1√n

n∑i=1

(fk(Xi)− Efk(X1))

)>

converges weakly to MVN. How the convergence of (“infinite dimensional joint net”)(1√n

n∑i=1

(f1(Xi)− Ef1(X1))

)f∈F

(1.1)

can be defined?

For this we see more general notion of weak convergence here.

Definition 1.1.4. Let Pn, P be Borel probability measures on (D,D, d). Then

(i) Pn converges weakly to P, denoted as Pnw−−−→

n→∞P, iff

∫DfdPn −−−→

n→∞

∫DfdP ∀f ∈ Cb(D).

(ii) If Xn and X are D-valued random variables with laws Pn and P respectively, then Xn converges

weakly to X, denoted as Xnw−−−→

n→∞X, iff Pn

w−−−→n→∞

P.

For weak convergence of (1.1), we may use the definition 1.1.4. For this, (1.1) should be embedded

into a metric space.

Example 1.1.5. Let (Ωn,An,Pn) = ([0, 1],B, λ), and

F = 1[0,t](·) : 0 ≤ t ≤ 1 ⊆ D[0, 1],

where B = B([0, 1]) is a Borel σ-algebra on [0, 1] and λ denotes the Lebesgue measure. Then (1.1)

can be viewed as a D[0, 1]-valued random variables. A natural metric on D[0, 1] is the uniform metric

3

Draft


defined as

d(f1, f2) = supt∈[0,1]

|f1(t)− f2(t)| ∀f1, f2 ∈ D[0, 1].

However, under such metric, D[0, 1] is not separable, which makes the space too large to work with.

Furthermore, under the metric, (1.1) may even not be measurable.

Proposition 1.1.6. A map X : [0, 1]→ D[0, 1] defined as

X(ω) = 1[ω,1]

is NOT Borel measurable with the uniform metric.

Proof.

Figure 1.1: Proof of proposition 1.1.6.

Let Bs be the open ball of radius 1/2 in D[0, 1] centered

on 1[s,1]. Then G =⋃s∈S

Bs is an open set in D[0, 1] for any

S ⊆ [0, 1]. However, note that X(ω) ∈ Bs if and only if

ω = s, and hence

X−1(G) = (X ∈ G) = S

holds. If X is Borel measurable, then every subset S of [0, 1] should be also Borel measurable; it yields

contradiction.

To handle this issue, we may consider some alternative views like:

• To consider a weaker σ-algebra, such as ball σ-alg. In here, ball σ-algebra is a σ-algebra generated

by all open balls. If the space is separable, then ball σ-algebra is equivalent to Borel σ-algebra.

Note that with smaller σ-algebra, measurability condition becomes weaker.

• To consider a weaker metric. This is one typical approach dealing with empirical process,

using Skorokhod’s metric. Under Skorokhod metric, D[0, 1] becomes separable, and it is well-

known that there exists an equivalent metric with Skorokhod metric making D[0, 1] also complete

(Billingsley).

• Drop the measurability requirement, that is to extend some notions of weak convergence to non-

measurable maps. We shall focus on this approach in this course.

1.2 Outer Integral

From now on, let (Ω,A,P) be an underlying probability space. Also, let T : Ω→ R = [−∞,∞] be

an arbitrary map (not necessarily be measurable) and B ⊆ Ω be an arbitrary set (not necessarily be

4

Draft


measurable).

Definition 1.2.1. (i) The outer integral of T w.r.t. P is defined as

E∗T := infEU : U ≥ T, U : Ω→ R is measurable & EU exists,

where “EU exists” means EU+ < ∞ or EU− < ∞ (note that it can be defined only except the

case ∞−∞).

(ii) The outer probability of B is

P∗(B) = infP(A) : A ⊇ B, A ∈ A.

(iii) The inner integral of T w.r.t. P is defined as

E∗T = −E∗(−T ).

(iv) The inner probability of B is

P∗(B) = 1− P∗(Ω\B).

Remark 1.2.2. Note that definitions in (iii) and (iv) is equivalent to using similar argument as (i)

and (ii), i.e.,

E∗(T ) = supEU : U ≤ T, U : Ω→ R is measurable & EU exists

and

P∗(B) = supP(A) : A ⊆ B, A ∈ A.

It is well known that the map T ∗ achieving the supremum always exists provided that its expectation

exists.

Lemma 1.2.3. For any map T : Ω→ R, there exists a measurable map T ∗ : Ω→ R with

(i) T ∗ ≥ T ;

(ii) T ∗ ≤ U P-a.s. for any U : Ω→ R with U ≥ T P-a.s..

Furthermore, such T ∗ is unique up to P-null sets, and

E∗T = ET ∗

provided that ET ∗ exists.

Definition 1.2.4. Such function T ∗ is called minimal measurable majorant of T . Similarly, the

maximal measurable minorant T∗ can be defined as T∗ = −(−T )∗.

5

Draft


There are several similarities between outer integral and normal one. Many concepts and proposi-

tions in the probability theory can be extended to the outer probability statement. However there are

also several statements those not hold in outer-measure version. One example is Fubini’s theorem.

Lemma 1.2.5 (Fubini theorem in outer-integral). Let T be a real-valued function on the product space

(Ω1 × Ω2,A1 ⊗A2,P1 ⊗ P2). Then

E∗T ≤E1∗E2∗ T≤E∗1E∗2T ≤E∗T,

where E∗2 is defined as

(E∗2T ) (ω1) = infE2U : U(ω2) ≥ T (ω1, ω2), U : Ω2 → R is measurable and E2U exists

for ω1 ∈ Ω1 and vice versa.

Now we will extend the notion of weak convergence to non-measurable maps.

1.3 Weak Convergence

Definition 1.3.1. (i) A Borel probability measure L on D is tight if

∀ε > 0 ∃cpt set K with L(K) ≥ 1− ε.

(ii) A Borel measurable map X : Ω→ D is tight if the “law” of X L(X) := P X−1 is tight.

(iii) L (or X) is separable if there exists separable measurable set with probability 1, i.e.,

∃separable measurable set A ⊆ D s.t. L(A) = 1 or P(X ∈ A) = 1.

Lemma 1.3.2. (i) If L (or X) is tight, then L (or X) is separable.

(ii) The converse is true if D is complete. That is, given that D is complete, separability of L (or X)

implies tightness.

Now we are ready to define weak convergence of “arbitrary” map Xn.

Definition 1.3.3 (Weak Convergence). Let (Ωn,An,Pn) be a sequence of probability spaces and Xn :

Ωn → D be arbitrary maps (may be non-measurable). Then Xn is said to converge weakly to a

“Borel measure” L, denoted as Xnw−−−→

n→∞L, if

E∗f(Xn) −−−→n→∞

∫fdL ∀f ∈ Cb(D).

6

Draft


Furthermore, if there is a “Borel measurable” map X with law L, i.e., L(X) = L, then it is denoted

as Xnw−−−→

n→∞X.

We can say similar arguments about weak convergence of measurable maps.

Theorem 1.3.4 (Portmanteau). TFAE.

(i) Xnw−−−→

n→∞L

(ii) lim infn P∗(Xn ∈ G) ≥ L(G) for any open set G

(iii) lim supn P∗(Xn ∈ F ) ≤ L(F ) for any closed set F

(iv) lim infn E∗f(Xn) ≥∫fdL for any function f which is l.s.c & bdd below

(v) lim supn E∗f(Xn) ≤∫fdL for any function f which is u.s.c & bdd above

(vi) limP∗(Xn ∈ B) = limP∗(Xn ∈ B) = L(B) for any L-continuity set B (i.e., L(∂B) = 0)

(vii) lim infn E∗f(Xn) ≥∫fdL for any function f which is bdd, Lipschitz continuous, and nonnegative

Recall that a function f is lower semicontinuous (l.s.c) if

lim infx→x0

f(x) ≥ f(x0)

and vice versa. Our first important result in measure theory is continuous mapping theorem.

Theorem 1.3.5 (Continuous mapping theorem). Let (D, d) and (E, e) be metric spaces and g : D→ E

be continuous at every point of a set D0 ⊆ D. If Xnw−−−→

n→∞X and X takes its values in D0, then

g(Xn)w−−−→

n→∞g(X).

Next to the continuous mapping theorem, Prokhorov theorem (or Helly’s principle, in special case)

is the most important theorem on weak convergence. To formulate the result, two new concepts are

needed.

Definition 1.3.6. (i) Xn is asymptotic measurable if

E∗f(Xn)− E∗f(Xn) −−−→n→∞

0 ∀f ∈ Cb(D).

(ii) Xn is asymptotic tight if

∀ε > 0 ∃cpt set K s.t. lim infn

P∗(Xn ∈ Kδ) ≥ 1− ε ∀δ > 0,

where Kδ := y ∈ D : d(y,K) < δ is the “δ-enlargement” of K.

7

Draft


Remark 1.3.7. A collection of Borel measurable maps Xn is (uniformly) tight if

∀ε > 0 ∃cpt set K s.t. infn

P(Xn ∈ K) ≥ 1− ε.

(It is also equivalent if “inf” in the last statement is replaced to “lim inf”) The δ in the definition of

(asymptotic) tightness may seem a bit overdone (∵ it enlarges the set K), but nothing is gained in

simple cases:

Proposition 1.3.8. If D is separable and complete, then uniformly tightness and asymptotically tight-

ness are the same (for measurable maps).

Following result might be useful to verify asymptotic measurability or tightness.

Lemma 1.3.9.

(i) If Xnw−−−→

n→∞X, then (Xn) is asymptotically measurable.

(ii) If Xnw−−−→

n→∞X, then

(Xn) is asymptotically tight ⇐⇒ X is tight.

Now we are ready to state Prokhorov theorem.

Theorem 1.3.10 (Prokhorov).

(i) If Xn is asymptotically tight and asymptotically measurable, then Xn is relatively compact,

i.e., every subsequence Xn′ has a further subsequence Xn′′ converging weakly to a tight Borel

law.

(ii) Relatively compact collection Xn is asymptotically tight if D is Polish space (i.e., separable and

complete).

Remark 1.3.11. By previous theorem, for Borel measures on Polish space, the concepts “relatively

compact,” “asymptotically tight” and “uniformly tight” are all equivalent.

Our final extension is:

Lemma 1.3.12. Let Xnw−−−→

n→∞X and Yn

w−−−→n→∞

c, where c is constant and X has separable Borel

law. Then

(Xn, Yn)w−−−→

n→∞(X, c).

Corollary 1.3.13. Let Xn and X be on separable Banach space (topological vector space) and Yn

and c be scalars. Then addition and scalar multiplication can be defined, which are also continuous

operator on separable Banach space. Thus we can get

Xn + Ynw−−−→

n→∞X + c

8

Draft


and

XnYnw−−−→

n→∞cX.

Furthermore, if c 6= 0, we can also obtain

Xn/Ynw−−−→

n→∞X/c.

1.4 Spaces of Bounded Functions

Definition 1.4.1. Let T be an arbitrary set. Then the space `∞(T ) is defined as

`∞(T ) = all functions f : T → R s.t. ‖f‖∞ <∞,

where ‖f‖∞ = supt∈T|f(t)|.

It is well-known that `∞ is Banach space.

Definition 1.4.2 (Stochastic Process). A collection X(t) : t ∈ T of random variables defined on

the same probability space (Ω,A,P) is called a stochastic process.

Note that, if every sample path t 7→ X(t, ω) belongs to `∞(T ), i.e., every sample path is bounded,

then X can be viewed as a (random) map from Ω to `∞(T ). For any arbitrary map X : Ω→ `∞(T ),

it is natural to call a finite dimensional projection (X(t1), X(t2), · · · , X(tk)) for t1, t2, · · · , tk ∈ T as a

marginal.

Obviously our interest is to find equivalent condition for asymptotic tightness of weak convergence

of a sequence of random maps (Xn). Before starting, we introduce following two lemmas which will

be used.

Lemma 1.4.3. Let Xn : Ωn → `∞(T ) be asymptotically tight. Then

(Xn) is asymptotically measurable ⇐⇒ (Xn(t)) is asymptotically measurable for any t ∈ T.

It implies that every stochastic process is asymptotically measurable; each marginal is random

variable and hence measurable.

Lemma 1.4.4. Let X,Y be Borel-measurable maps into `∞(T ) and they are tight. Then

L(X) = L(Y ) ⇐⇒ all marginals are equal in law.

It means that for tight measurable maps, laws of all marginals determine the (joint) laws. Now we

are ready to introduce our first result.

9

Draft


Theorem 1.4.5. Let Xn : Ωn → `∞(T ), n = 1, 2, · · · be arbitrary maps. Then Xn converges weakly

to a tight limit if and only if

(1) (Xn) is asymptotically tight;

(2) every marginal converges weakly to a limit.

Proof. ⇒) (1) is trivial from lemma 1.3.9. Next, note that, for any fixed t1, t2, · · · , tk ∈ T , projection

g : `∞(T ) → Rk

z 7→ (z(t1), z(t2), · · · , z(tk))>

is continuous function on `∞(T ). Thus continuous mapping theorem implies (2).

⇐) Let t ∈ T be arbitrarily chosen. Then condition (2) implies that (Xn(t)) is asymptotically

measurable by lemma 1.3.9. Since t ∈ T was arbitrary, (Xn) is asymptotically measurable by lemma

1.4.3. Then by Prokhorov theorem, every subsequence n′ ⊆ n has a further subsequence n′′ ⊆

n′ which makes (Xn′′) converges weakly. If such limit is all equal, then (Xn) converges weakly.

This follows from convergence of every marginal (condition (2)) and lemma 1.4.4. In details, for any

subsequence n′ ⊆ n, there exist a further subsequence n′′ ⊆ n′ and Y = Y (n′) such that

Xn′′w−−−−→

n′′→∞Y . Note that Y is tight by condition (1) and lemma 1.3.9, and by lemma 1.4.4, every Y

has the same law for any choice of subsequence n′. Let X be tight r.v. s.t. L(X) = L(Y ). Then we

get

∀n′ ⊆ n ∃n′′ ⊆ n′ s.t. Xn′′w−−−−→

n′′→∞X,

and therefore Xnw−−−→

n→∞X.

Theorem 1.4.5 tells that weak convergence for a sequence of random map is implied by asymptotic

tightness and marginal convergence. Marginal convergence can be established by and of the well-known

methods for proving weak convergence on Euclidean space. (Asymptotic) tightness can be given a

more concrete form, either through finite approximation or (essentially) Arzela-Ascoli characterization.

Second approach is related to asymptotic continuity of the sample paths.

Definition 1.4.6. A map ρ : T × T → R is called a semimetric (or pseudometric) if

(1) ρ(x, y) ≥ 0 and x = y implies ρ(x, y) = 0;

(2) ρ(x, y) = ρ(y, x);

(3) ρ(x, z) ≤ ρ(x, y) + ρ(y, z).

(It may not satisfy ρ(x, y) = 0 =⇒ x = y)

10

Draft


Definition 1.4.7. Let Xn : Ωn → `∞(T ) be a sequence of maps and ρ be a semimetric on T . Then

(Xn) is called asymptotically uniformly ρ-equicontinuous in probability if

∀ε, η > 0 ∃δ > 0 s.t. lim supn→∞

P∗(

supρ(s,t)<δ

|Xn(s)−Xn(t)| > ε

)< η.

Recall that a collection fn : T → R of functions is “uniformly equicontinuous” if

∀ε > 0 ∃δ > 0 s.t. supρ(s,t)<δ

|fn(s)− fn(t)| < ε uniformly on n.

The definition in 1.4.7 is slightly changed to make the notion in probability. Now we are ready to see

some equivalent conditions of asymptotic tightness; which is one of the goal of this section.

Theorem 1.4.8. TFAE.

(i) (Xn) is asymptotically tight.

(ii) (1) (Xn(t)) is asymptotically tight ∀t ∈ T ;

(2) ∃semimetric ρ on T s.t. (T, ρ) is totally bounded and (Xn) is asymptotically uniformly

ρ-equicontinuous in probability.

(iii) (1) and (3) holds, where

(3) ∀ε, η > 0 ∃finite partition T1, · · · , Tk of T s.t.

lim supn→∞

P∗(

maxi

sups,t∈Ti


)< η. (1.2)

Remark 1.4.9. (ii) is related to Arzela-Ascoli characterization of the space, while (iii) is related to

the finite approximation of state space T . (iii) means that for any ε > 0, T can be partitioned into

finitely many subset Ti such that (asymptotically) the variation of the sample paths t 7→ X(t) is less

than ε on every Ti.

Proof. (i) ⇒ (ii). First we have to show (1). Let πt : x 7→ x(t) be a projection. Given ε > 0, there

exists a compact set K s.t.

lim infn

P∗(Xn ∈ Kδ) > 1− ε ∀δ > 0.

From

a ∈ Kδ =⇒ ∃b ∈ K s.t. ‖b− a‖∞ < δ

=⇒ |πt(b)− πt(a)| ≤ ‖b− a‖∞ < δ

=⇒ πt(a) ∈ πt(K)δ,

11

Draft


we get

lim infn

P∗(Xn(t) ∈ (πt(K))δ) ≥ lim infn

P∗(Xn ∈ Kδ) > 1− ε ∀δ > 0.

As πt is continuous, πt(K) is compact, and it is the desired compact set.

Now we show (2). Let ε be given, and K1 ⊆ K2 ⊆ · · · be a sequence of compact subsets of `∞(T )

satisfying

lim infn→∞

P∗ (Xn ∈ Kεm) ≥ 1− 1

m. (“asymptotic tightness”)

Now for any m, define ρm as

ρm(s, t) = supz∈Km

|z(s)− z(t)|.

Claim 1.) (T, ρm) is totally bounded.

Remark 1.4.10. Note that “totally boundedness” means that ∀ε > 0 T is covered with finite radius-ε

balls w.r.t ρ. It is also equivalent to:

∀ε > 0 ∃finite subset whose distance from any element of T is less than ε.

Proof of Claim 1. For given η > 0, choose z1, z2, · · · , zk ∈ `∞(T ) s.t.

Km ⊆k⋃j=1

Bη(zj).

(It can be chosen because of compactness ofKm) Since each zi is a bounded function, A := (z1(t), · · · , zk(t))> :

t ∈ T ⊆ Rk is bounded set, it is totally bounded1, and hence ∃t1, t2, · · · , tp ∈ T s.t.

A ⊆p⋃i=1

Bη((z1(ti), · · · , zk(ti)).

It gives that for any t ∈ T , ∃ti s.t.

(z1(t), · · · , zk(t))> ∈ Bη(z1(ti), · · · , zk(ti)),

and hence we get

ρm(t, ti) = supz∈Km

|z(t)− z(ti)|

supz∈Km

min1≤j≤k

|z(t)− zj(t)|︸︷︷︸≤‖z−zj‖∞

+|zj(t)− zj(ti)|+ |zj(ti)− z(ti)|︸︷︷︸≤‖z−zj‖∞

1totally bounded space is bounded; converse is also true in Euclidean space.

12

Draft


≤ 2 supz∈Km

min1≤j≤k

‖z − zj‖∞︸︷︷︸≤η (∵ def. of zj)

+ max1≤j≤k

|zj(t)− zj(ti)|︸︷︷︸≤η (∵ def. of ti)

≤ 3η.

In summary, ∀η > 0 ∃t1, · · · , tp s.t.

∀t ∈ T∃ti s.t. ρm(t, ti) ≤ 3η,

which gives totally boundedness of (T, ρm). (Claim 1)

Claim 2.) (T, ρ) is totally bounded, where

ρ(s, t) =∞∑m=1

2−m(ρm(s, t) ∧ 1).

Proof of Claim 2. Note that ρm increases as m grows by the definition. For η > 0, take m s.t.

2−m < η. Then by Claim 1, ∃t1, t2, · · · , tp s.t.

T ⊆p⋃i=1

Bη(ti; ρm).

Then for every t ∈ T , ∃ti s.t. ρm(t, ti) < η, and so

ρ(t, ti) ≤m∑k=1

2−k ρk(t, ti)︸︷︷︸≤ρm(t,ti)

+

∞∑k=m+1

2−k · 1︸︷︷︸=2−m<η

≤ η + η = 2η.

It means that ∀η > 0 ∃t1, t2, · · · , tp s.t.

∀t ∈ T ∃ti s.t. ρ(t, ti) ≤ 2η,

which gives that (T, ρ) is totally bounded. (Claim 2)

Claim 3.) (Xn) is asymptotically uniformly ρ-equicontinuous in probability.

Proof of Claim 3. Let ε > 0. If ‖z − z0‖∞ < ε for some z0 ∈ Km, then

|z(s)− z(t)| ≤ |z(s)− z0(s)|︸︷︷︸<ε

+ |z0(s)− z0(t)|︸︷︷︸≤supz∈Km |z(s)−z(t)|=ρm(s,t)

+ |z0(t)− z(t)|︸︷︷︸<ε

≤ 2ε+ ρm(s, t).

13

Draft


If ρ(s, t) < 2−mε, then

ρm(s, t) ∧ 1 ≤ 2mρ(s, t) < ε,

which gives ρm(s, t) < ε. Thus

z ∈ Kεm =⇒ ∃z0 ∈ Km s.t. ‖z − z0‖∞ < ε =⇒ |z(s)− z(t)| ≤ 2ε+ ρm(s, t) ≤ 3ε

provided that ρ(s, t) < 2−mε. Therefore,

Kεm ⊆

z : sup

ρ(s,t)<2−mε|z(s)− z(t)| ≤ 3ε

.

Now letting δ < 2−mε, we get

lim infn→∞

P∗

(sup

ρ(s,t)<δ|Xn(s)−Xn(t)| ≤ 3ε

)≥ lim inf

n→∞P∗(Xn ∈ Kε

m) ≥ 1− 1

m.

In summary, we get ∀m ∈ N ∀ε > 0 ∃δ > 0 s.t.

lim infn→∞

P∗

(sup


)≥ 1− 1

m,

which implies that (Xn) is asymptotically uniformly ρ-equicontinuous in probability. (Claim 3)

(ii) ⇒ (iii). By the assumption, given ε, η > 0, ∃δ > 0 s.t.

lim supn→∞

P∗(

supρ(s,t)<δ


)< η.

Since (T, ρ) is totally bounded, ∃ finite set t1, t2, · · · , tp ⊆ T s.t.

T ⊆p⋃j=1

Bδ/2(ti; ρ).

Now letting Ti = Bδ/2(ti; ρ), we get

s, t ∈ Ti =⇒ ρ(s, t) ≤ ρ(s, ti) + ρ(ti, t) < δ,

and therefore

sups,t∈Ti

|z(s)− z(t)| ≤ supρ(s,t)<δ

|z(s)− z(t)|

for any i = 1, 2, · · · , p. It implies the conclusion

lim supn→∞

P∗(

maxi

sups,t∈Ti


)≤ lim sup

n→∞P∗(

supρ(s,t)<δ


).

14

Draft


(iii) ⇒ (i). Suppose that for given ε, η > 0,

lim supn→∞

P∗(

maxi

sups,t∈Ti


)< η

holds. Note that, for a fixed ti ∈ Ti,

sups,t∈Ti

|Xn(s)−Xn(t)| ≤ ε =⇒ sups∈Ti|Xn(s)| ≤ sup

s∈Ti|Xn(s)−Xn(ti)|+ |Xn(ti)| ≤ |Xn(ti)|+ ε,

and hence

lim infn→∞

P∗(‖Xn‖∞ ≤ max

1≤i≤p|Xn(ti)|+ ε

)≥ lim inf

n→∞P∗

(max1≤i≤p

sups,t∈Ti

|Xn(s)−Xn(t)| ≤ ε

)≥ 1− η.

It implies that (‖Xn‖∞) is asymptotically tight. Why? First note that for each i, ∃Mi > 0 s.t.

lim infn→∞

P∗ (|Xn(ti)| < Mi + ε) ≥ 1− η ∀ε > 0.

Letting M = maxiMi, we get

lim supn→∞

P∗(

maxi|Xn(ti)| ≥M + ε

)≤ lim sup

n→∞

∑i

P∗(|Xn(ti)| ≥M + ε) ≤ pη ∀ε > 0,

i.e., (maxi |Xn(ti)|) is asymptotically tight. Now let K be a compact set s.t.

lim infn→∞

P∗(maxi|Xn(ti)| ∈ Kε) ≥ 1− η ∀ε > 0.

Then

lim infn→∞

P∗(‖Xn‖∞ ≤ max

i|Xn(ti)|+ ε, max

i|Xn(ti)| ∈ Kε

)= lim inf

n→∞P∗(∣∣∣∣‖Xn‖∞ −max

i|Xn(ti)|

∣∣∣∣ ≤ ε, maxi|Xn(ti)| ∈ Kε

)≤ lim inf

n→∞P∗(‖Xn‖∞ ∈ K3ε

),

which implies

lim infn→∞

P∗(‖Xn‖∞ ∈ K3ε

)≥ 1− 2η,

i.e., (‖Xn‖∞) is asymptotically tight. Now, let ζ > 0 and a sequence εm 0 be given. Choose M > 0

s.t.

lim supn→∞

P∗(‖Xn‖∞ > M) ≤ ζ.

15

Draft


For εm and η = 2−mζ, let

T =

km⋃i=1

Tm,i

be a partition satisfying

lim supn→∞

P∗(

max1≤i≤km

sups,t∈Tm,i

|Xn(s)−Xn(t)| > εm

)< η.

Now, let zm,1, zm,2, · · · , zm,Pm be the set of all functions in `∞(T ) that are constant on each Tm,i

taking values

0,±εm,±2εm, · · · ,±⌊M

εm

⌋εm.

(a) Function zm,i’s. (b) Approximating elements of `∞(T ) with zm,i’s.

Figure 1.2: Function zm,i’s and approximation.

Now let

Km =

Pm⋃i=1

Bεm(zm,i) and K =∞⋂m=1

Km.

By construction, for each m,

‖Xn‖∞ ≤M and maxi

sups,t∈Tm,i

|Xn(s)−Xn(t)| < εm implies Xn ∈ Km. (1.3)

Since K is closed (and hence complete2) and totally bounded, it is compact. Thus our claim is:

Claim.) ∀δ > 0 ∃m s.t. Kδ ⊇⋂mi=1Ki.

Proof of Claim. Assume not. Then ∃δ > 0 s.t. ∀m Kδ +⋂mi=1Ki. That is,

∃(zm) s.t. zm ∈m⋂i=1

Ki but zm /∈ Kδ.

Now we use Arzela-Ascoli formulation: Note that zn ⊆ K1 =⋃P1i=1Bε1(z1,i), i.e., an infinite

number of zn belong to finite balls, which means that at least one of them contains also an infinite

number of zn. Now consider a subsequence z∗n of zn in Bε1(z1,i1) for some i1. In the same way,

2closed subset of complete space is complete

16

Draft


there exists a further subsequence z∗∗n in Bε2(z2,i2) for some i2.

z1 z2 z3 · · ·

z∗1 z∗2 z∗3 · · ·

z∗∗1 z∗∗2 z∗∗3 · · ·...

......

. . .

Now define (zl) as a sequence z1, z∗2 , z∗∗3 , · · · , and then we get (zl) is Cauchy sequence. Since `∞(T )

is complete, (zl) converges. Now note that

zl ∈l⋂

i=1

Ki ⊆m⋂i=1

Ki

by construction for any l ≥ m, and⋂mi=1Ki is closed, the limit z of (zl) belongs to

⋂mi=1Ki for any

m. It implies that z ∈ K, which is contradictory to zm /∈ Kδ ∀m. (Claim)

Now, by the Claim,

lim supn→∞

P∗(Xn /∈ Kδ) ≤ lim supn→∞

P∗(Xn /∈

m⋂i=1

Ki

)

≤(1.3)

lim supn→∞

P∗(‖Xn‖∞ > M or max

isup

s,t∈Tm′,i|Xn(s)−Xn(t)| > εm′ for some m′ ≤ m

)

≤ lim supn→∞

P∗(‖Xn‖∞ > M) + lim supn→∞

m∑m′=1

P∗(

maxi

sups,t∈Tm′,i

|Xn(s)−Xn(t)| > εm′

)

≤ ζ +

m∑i=1

2−mζ < 2ζ.

If the condition “asymptotic tightness” is replaced with “weak convergence,” then we can obtain

stronger argument.

Proposition 1.4.11. If Xnw−−−→

n→∞X, where X is tight, then sample path t 7→ X(t, ω) is uniformly

ρ-continuous a.s., where ρ is the semimetric constructed in the proof of (i)⇒(ii) part of theorem 1.4.8.

Proof. Let notations be continued. We get

P(X ∈ Kεm) ≥ lim sup

n→∞P∗(Xn ∈ Kε

m) (∵ Portmanteau)

≥ lim infn→∞

P∗(Xn ∈ Kεm)

≥ 1− 1

m

17

Draft


for any m and ε > 0. By letting ε 0, we get

P(X ∈ Km) ≥ 1− 1

m,

which gives

P

(X ∈

∞⋃m=1

Km

)= 1. (1.4)

Hence, for

ρm(s, t) = supz∈Km

|z(s)− z(t)| and ρ(s, t) =∞∑m=1

2−m(ρm(s, t) ∧ 1),

we get

z ∈ Km =⇒ |z(s)− z(t)| ≤ ρm(s, t) ∀s, t ∈ T.

Also, from 1 ∧ ρm(s, t) ≤ 2mρ(s, t), we get

ρ(s, t) < δ =⇒ ρm(s, t) < ε

for any δ < 2−mε. Therefore, we get the conclusion; For m = m(ω) s.t. X(ω) ∈ Km, ∀ε > 0 ∃δ = δ(m)

s.t.

sups,t∈Tρ(s,t)<δ

|X(s)−X(t)| < ε.

Proposition 1.4.12. If Xnw−−−→

n→∞X, (T, ρ) is totally bounded, and sample path t 7→ X(t, ω) is

uniformly ρ-continuous P-a.s., then (Xn) is asymptotically tight and asymptotically uniformly ρ-

equicontinuous in probability.

Remark 1.4.13. Before the proof, note that:

“The set of uniformly continuous functions on a totally bounded set is complete & separable in

uniform metric.”

A brief proof is following. It is well known that C(T ) is complete; C(T ) is separable if and only if T

is compact. It gives that the set of continuous functions is complete and separable if T is compact.

Meanwhile, followings are also well known:

• Totally bounded, complete set is compact;

• uniformly continuous function can be extended to a continuous function on a completion.

In other words, uniformly continuous function on a totally bounded set is equivalent to a continuous

function on a compact set.

18

Draft


Proof. Note that (T, ρ) is totally bounded and t 7→ X(t, ω) is uniformly ρ-continuous, so the set of

realization of such X is complete and separable. Since r.v. on complete separable space is tight, X

is tight, which implies (Xn) is asymptotically tight (lemma 1.3.9). Since X is tight and uniformly

ρ-continuous a.s.,

∀η > 0 ∃K : cpt set of uniformly ρ-continuous functions s.t. P(X ∈ K) ≥ 1− η.

Note that, from Portmanteau lemma,

lim infn→∞

P∗(Xn ∈ Kε) ≥ P(X ∈ Kε) ≥ 1− η ∀ε > 0.

Since K is totally bounded, ∀ε > 0 ∃z1, z2, · · · , zk ∈ K s.t. K ⊆⋃ki=1Bε(zi), which implies

Kε ⊆k⋃i=1

B2ε(zi).

Since each zi is uniformly continuous,

∃δ > 0 s.t. ρ(s, t) < δ =⇒ max1≤i≤k

|zi(s)− zi(t)| < ε.

Then we get, for z ∈ B2ε(zi),

ρ(s, t) < δ =⇒ |z(s)− z(t)| ≤ |z(s)− zi(s)|+ |zi(s)− zi(t)|+ |zi(t)− z(t)| ≤ 2ε+ ε+ 2ε = 5ε,

and therefore,

lim infn→∞

P∗

(sup


)≥ lim inf

n→∞P∗

(Xn ∈

k⋃i=1

B2ε(zi)

)≥ lim inf

n→∞P∗ (Xn ∈ Kε)

≥ 1− η.

Concluding remark of this chapter is that, in most cases we are interested in the case that limit

process is Gaussian, whose finite-dimensional convergence is obtained by CLT. In here, semimetric ρ

in proposition 1.4.12 becomes p-norm.

Definition 1.4.14. A stochastic process is called Gaussian if each marginal has multivariate normal

distribution.

19

Draft


Remark 1.4.15. Note that if Xnw−−−→

n→∞X, where X is tight Gaussian process, then the metric

ρ(s, t) = ρp(s, t) = (E|X(s)−X(t)|p)1/p , p ≥ 1

makes (Xn) asymptotically uniformly ρ-equicontinuous in probability.

20

Draft

Chapter 2

Maximal Inequalities and

Symmetrization

2.1 Introduction

In here, we use following notation. Let (X ,B,P) be a baseline probability space, and (X∞,B∞,P∞)

be the product space. We consider “the projection into the ith coordinate,” Xi : X∞ → X . Then

X1, X2, · · · become i.i.d. r.v.’s with distribution law P.

Definition 2.1.1. Denote

Pn :=1

n

n∑i=1

δXi (“empirical measure”)

and

Gn :=1√n

n∑i=1

(δXi − P) . (“empirical process”)

In here, δX denotes the dirac-delta measure.

Remark 2.1.2. Often, Gn denotes the stochastic process f 7→ Gnf (i.e., (Gnf)f∈F ), where F is

a collection of measurable functions and Qf denotes Qf =∫fdQ for a measurable function f and

signed measure Q. Note that

Gnf =1√n

n∑i=1

(f(Xi)− Pf).

Definition 2.1.3. For signed measure Q, define

‖Q‖F := sup|Qf | : f ∈ F.

Our first step is very well-known results:

Proposition 2.1.4. For each f ∈ F ,

(i) Pnf → Pf a.s.. (SLLN)

21

Draft


(ii) Gnfd−−−→

n→∞N(0,P(f − Pf)2). (CLT)

We are interested in uniform versions of previous proposition. Uniform version of (i) becomes:

‖Pn − P‖FP∗−−−→

n→∞0. (2.1)

In here P∗ denotes outer probability.

Definition 2.1.5. A collection of integrable (measurable) function F satisfying (2.1) is called P-

Glivenko-Cantelli class.

Next, uniform version of (ii) can be obtained as following. Assume that

supf∈F|f(x)− Pf | <∞ ∀x ∈ X .

Then f 7→ Gnf can be viewed as a map into `∞(F). If Gn is asymptotically tight in `∞(F), then Gn

converges weakly to a tight Borel measurable map G in `∞(F), from CLT-like argument and theorem

1.4.5.

Definition 2.1.6. A class F of square-integrable (measurable) functions is called P-Donsker class

if (Gn) is asymptotically tight.

Remark 2.1.7. A finite collection F of integrable functions is trivially P-Glivenko-Cantelli. Further-

more, a finite collection F of square-integrable functions is P-Donsker (∵ (iii)⇒(i) part of theorem

1.4.8).

Example 2.1.8. Let X1, X2, · · · be i.i.d r.v’s in R, and

F := 1(−∞,t] : t ∈ R.

Then

‖Pn − P‖F = supt∈R|Fn(t)− F (t)| −−−→

n→∞0 a.s.

for any probability measure P on R. It gives that F is P-Glivenko-Cantelli for any P.

To show F is P-Donsker, we should show asymptotical tightness of (Gn), which is obtained by

controlling “supremum on the finite partition.” For this, we need some maximal inequalities and

technique of controlling variation, which will be covered on the rest part of this chapter.

2.2 Tail and Concentration Bounds

The most simple case is well-known to us:

22

Draft


Figure 2.1: Supremum on the infinite set might be controlled as an aggregation of supremum on the finite netand variation in each small ball.

Theorem 2.2.1 (Markov inequality). Let X be a r.v. with mean µ. Then

P(|X − µ| ≥ t) ≤ E|X − µ|k

tk∀t, k > 0.

It gives a “polynomial bound” for tail probability. However, such result may not be so useful because

of its roughness. Some results about “exponential bounds” are also well-known, which are often called

concentration inequalities.

Theorem 2.2.2 (Chernoff bound).

P(X − µ ≥ t) ≤ E(eλ(X−µ))

eλt∀λ > 0 ∀t ∈ R.

Proof. Clear from

I(X − µ ≥ t) = I(eλ(X−µ) ≥ eλt) ≤ eλ(X−µ)

eλt.

Example 2.2.3 (Gaussian tail bound). Let X ∼ N(µ, σ2) be Gaussian r.v.. Then by Chernoff ineq.,

P(X − µ ≥ t) ≤ e−λtEeλ(X−µ) = exp

(−λt+

σ2

2λ2

)for any t > 0, λ > 0.

Hence, we get

P(X − µ ≥ t) ≤ infλ>0

exp

(−λt+

σ2

2λ2

)= e−t

2/2σ2 ∀t > 0.

As shown, Gaussian random variable has a squared-exponential tail bound. In general, the collection

of such distribution is named as sub-Gaussian.

Definition 2.2.4. A r.v. X with mean EX = µ is called sub-Gaussian if ∃σ > 0 s.t.

E(eλ(X−µ)) ≤ eσ2λ2/2 ∀λ ∈ R,

23

Draft


Remark 2.2.5. Note that right hand side of definition 2.2.4 is an mgf of N(0, σ2). Thus sub-

Gaussianity means “smaller scale of mgf than that of Gaussian distribution,” i.e., “having tail which

decays faster than Gaussian scale.”

Remark 2.2.6. Obviously, if X is sub-Gaussian, we get

P(X − µ ≥ t) ≤ exp

(− t2

2σ2

)∀t ≥ 0.

Furthermore, if X is sub-Gaussian with parameter σ, so is −X, and hence

P(|X − µ| ≥ t) = P(X − µ ≥ t) + P((−X)− (−µ) ≥ t) ≤ 2 exp

(− t2

2σ2

).

Example 2.2.7. A r.v. ε is called Rademacher if

P(ε = 1) = P(ε = −1) =1

2.

In this case,

Eeλε =eλ + e−λ

2=∞∑k=0

λ2k

(2k)!≤∞∑k=0

λ2k

2kk!= exp

(λ2

2

),

and hence ε is sub-Gaussian with σ = 1.

Actually, this result is not so surprising, because distribution with bounded support has extremely

light tail, clearly lighter tail than that of Gaussian. We can easily formulate the conjecture as following:

Example 2.2.8. Let X be a r.v. with EX = µ and P(a ≤ X ≤ b) = 1. Then X is sub-Gaussian with

σ =b− a

2. To show this, define

ψ(λ) = logEeλX . (“cgf”)

Then ψ(0) = 0, ψ′(0) = µ and

ψ′′(λ) = Eλ(X2)− (Eλ(X))2

where

Eλf(X) :=E(f(X)eλX)

E(eλX).

Note that Eλ can be viewed as an expectation operator w.r.t weight proportional to eλX . Now note

that:

If a ≤ Y ≤ b a.s., then

var(Y ) = miny

E(Y − y)2 ≤ E(Y − b+ a

2

)2

≤(b− a

2

)2

holds.

24

Draft


Since ψ′′(λ) can be viewed as a variance, we get

ψ′′(λ) ≤(b− a

2

)2

∀λ ∈ R.

Thus we obtain

supλ∈R

ψ′′(λ) ≤(b− a

2

)2

,

and hence

ψ(λ) = ψ(0) + ψ′(0)λ+ψ′′(ξ)

2λ2

≤ ψ(0) + ψ′(0)λ+

(b− a

2

)2

2λ2

= λµ+λ2

2

(b− a

2

)2

,

which yields

E(eλ(X−µ)) = e−λµ+ψ(λ) ≤ exp

(λ2

2

(b− a

2

)2).

Our next result is that independent sum of sub-Gaussian random variables is also sub-Gaussian.

Theorem 2.2.9 (Hoeffding’s inequality). Let Xi be independent r.v.’s with EXi = µi, and each Xi is

sub-Gaussian with σ = σi. Then∑n

i=1Xi is also sub-Gaussian with parameter(∑n

i=1 σ2i

)1/2, i.e.,

P

(n∑i=1

(Xi − µi) ≥ t

)≤ exp

(− t2

2∑n

i=1 σ2i

)∀t ≥ 0.

Proof. It is sufficient to show that

X1 +X2 is sub-Gaussian with σ2 = σ21 + σ2

2.

It is clear from

E(eλ(X1+X2−(µ1+µ2))

)= E

(eλ(X1−µ1)

)E(eλ(X2−µ2)

)≤ exp

(σ2

1

2λ2

)exp

(σ2

2

2λ2

)= exp

(σ2

1 + σ22

2λ2

).

Following corollary is clear from Hoeffding’s inequality, but it is very useful result. It will also be

widely used in this course.

Corollary 2.2.10. If each Xi is bounded and independent, i.e., P(ai ≤ Xi ≤ bi) = 1, then

P

(n∑i=1

(Xi − µi) ≥ t

)≤ exp

(− 2t2∑n

i=1(bi − ai)2

).

25

Draft


Before we move the step, let’s check some equivalent conditions for sub-Gaussianity.

Theorem 2.2.11. For any X with EX = 0, TFAE.

(i) ∃σ > 0 s.t. EeλX ≤ exp(λ2

2 σ2)∀λ ∈ R (i.e., X is sub-Gaussian).

(ii) ∃c ≥ 1 and Gaussian r,v, Z ∼ N(0, τ2) s.t. P(|X‖ ≥ s) ≤ cP(|Z| ≥ s) ∀s ≥ 0.

(iii) ∃θ ≥ 0 s.t. EX2k ≤ (2k)!

2kk!θ2k ∀k = 1, 2, · · · .

(iv) ∃σ > 0 s.t. EeλX2/2σ2 ≤ 1√1− λ

∀λ ∈ [0, 1).

Now we see some other notion. The notion of sub-Gaussianity is fairly restrictive, so that it is

natural to consider various relaxations of it. The class called sub-Exponential r.v.’s are defined by a

slightly milder condition on the mgf and hence has a slower tail probability density rate.

Definition 2.2.12. A random variable X is called sub-exponential if ∃ν, b > 0 s.t.

Eeλ(X−µ) ≤ eν2λ2/2 ∀λ : |λ| ≤ 1

b.

Obviously, sub-Gaussianity implies sub-exponentiality. The converse is not true; sub-Gaussianity is

stronger condition.

Example 2.2.13. Let Z ∼ N(0, 1) amd X = Z2. Then

Eeλ(X−1) =e−λ√1− 2λ

for λ <1

2,

and it does not exists for λ > 1/2. With simple calculation, we can verify that

e−λ√1− 2λ

≤ e2λ2 ∀λ : |λ| < 1

4.

Therefore, X is sub-exponential, but not sub-Gaussian.

Theorem 2.2.14. For any X with EX = 0, TFAE.

(i) ∃ν, b > 0 s.t. EeλX ≤ eλ2ν2/2 ∀λ : |λ| ≤ 1/b (i.e., X is sub-exponential).

(ii) ∃c0 > 0 s.t. EeλX <∞ ∀λ : |λ| ≤ c0.

(iii) ∃c1, c2 > 0 s.t. P(|X| ≥ t) ≤ c1e−c2t ∀t > 0.

(iv) ∃σ,M > 0 s.t. |EX|k ≤ 1

2σ2k!Mk−2 ∀k = 2, 3, · · · (“Bernstein condition”)

The condition (iv) is called Berstein condition. It is known that:

26

Draft


Lemma 2.2.15. If EX = 0 and X satisfies Bernstein condition, then

P(|X| ≥ t) ≤ 2e− t2

2(σ2+Mt) ∀t > 0.

Proof. Note that

EeλX =∞∑k=0

λkEXk

k!

= 1 +∞∑k=2

λkEXk

k!

≤ 1 +∞∑k=2

|λ|k 12σ

2k!Mk−2

k!

= 1 +λ2

2σ2∞∑k=2

(|λ|M)k−2

holds, which implies

EeλX ≤ 1 +λ2

2σ2 1

1− |λ|M≤ e

λ2σ2

2(1−|λ|M)

provided that |λ| ≤ 1

M. It gives

P(X ≥ t) ≤ e−λt+λ2σ2

2(1−|λ|M)∀λ : |λ| ≤ 1

M,

and hence

P(X ≥ t) ≤ inf|λ|≤1/M

e−λt+ λ2σ2

2(1−|λ|M) = et2

2(σ2+Mt) .

Similar technique on −X gives the conclusion

P(|X| ≥ t) ≤ 2et2

2(σ2+Mt) .

We can easily extend the result to the independent sum of random variables.

Corollary 2.2.16 (Bernstein’s inequality). Let Xi be independent random variables satisfying Bern-

stein condition

EXi = 0 and E|Xi|k ≤σ2i

2k!Mk−2, k = 2, 3, · · · .

Then

P(|X1 + · · ·+Xn| ≥ t) ≤ 2e− 1

2t2∑n

i=1σ2i+Mt .

27

Draft


Proof. By Chernoff inequality, we get

P(|X1 + · · ·+Xn| ≥ t) ≤ 2e−λtEeλ(∑ni=1Xi) ≤ 2 exp

(−λt+

n∑i=1

λ2σ2i

2(1− |λ|M)

).

We get the conclusion by letting λ =t

Mt+∑n

i=1 σ2i

.

Example 2.2.17. Let Zki.i.d∼ N(0, 1). Then

P

(∣∣∣∣∣ 1nn∑k=1

Z2k − 1

∣∣∣∣∣ ≥ t)≤ e−λt exp

(λ

(1

n

n∑k=1

Z2k − 1

))

= 2e−λt

(e−λ/n√1− 2λ/n

)n≤ 2e−λte2n(λ/n)2

= 2e−λt+2λ2

n

for any λ s.t. |λ| < 1

4. Since

min|λ|<1/4

(2λ2

n− λt

)= −nt

2

8,

we get

P

(∣∣∣∣∣ 1nn∑k=1

Z2k − 1

∣∣∣∣∣ ≥ t)≤ 2e−

nt2

8 .

Example 2.2.18 (Johnson-Lindenstrauss embedding). Let ui ∈ Rd, i = 1, 2, · · · ,m be extremely

high-dimensional vectors (i.e., d is very large). We want to find a map F : Rd → Rn with n d and

(1− δ)‖ui − uj‖22 ≤ ‖F (ui)− F (uj)‖22 ≤ (1 + δ)‖ui − uj‖22

for some δ ∈ (0, 1) (“embedding to low-dimensional space preserving the distance approximately”).

Remark 2.2.19. Such embedding might be useful when using, for example, clustering algorithm.

There are various distance-based clustering methods such as K-means. If one handles extremely high-

dimensional data, then obtaining distances between all pairs of data might require heavy computation.

For this reason, one can first embed the data into low-dimensional subspace, with preserving distances,

and regard the data as low-dimensional.

(Example continued) Define F : Rd → Rn by

F (u) =Xu√n, where X = (xij)i,j ∈ Rn×d with xij

i.i.d∼ N(0, 1).

28

Draft


Then‖F (u)‖22‖u‖22

=‖Xu‖22n‖u‖22

=n∑i=1

〈Xi, u〉2

n‖u‖22=

1

n

n∑i=1

⟨Xi,

u

‖u‖2

⟩2

,

where Xi is the ith row vector of X. Note that for any fixed u,

n‖F (u)‖22‖u‖22

=n∑i=1

⟨Xi,

u

‖u‖2

⟩2

∼ χ2(n)

holds, and hence we get

P(‖F (u)‖22‖u‖22

/∈ [1− δ, 1 + δ]

)≤ 2e−nδ

2/8

for any u 6= 0 by previous example. Thus, using F (ui − uj) = F (ui)− F (uj), we get

P(‖F (ui)− F (uj)‖22‖ui − uj‖22

/∈ [1− δ, 1 + δ] for some i 6= j

)≤∑i 6=j

P(‖F (ui)− F (uj)‖22‖ui − uj‖22

/∈ [1− δ, 1 + δ]

)

≤ 2

(m

2

)e−nδ

2/8.

Finally, for any ε ∈ (0, 1) and m ≥ 2,

2

(m

2

)e−nδ

2/8 ≤ ε if n >16

δ2log

m

ε,

so for such n, we can find a map F . ♦

From now on, we focus on our origin interest. Whether a given class F is a Glivenko-Cantelli

(Donsker) class depends on the size of the class. A finite class of square integrable functions is always

Donsker (by theorem 1.4.8), while at the other extreme the class of all square integrable uniformly-

bounded functions is almost never Donsker. A relatively simple way to measure the size of a class is

to use entropy numbers, which is essentially the logarithm of the number of “ball” or “brackets” of

size ε needed to cover F . Let (F , ‖ · ‖) be a subset of a normed space of functions f : X → R.

Definition 2.2.20 (Covering number).

• The covering number N(ε,F , ‖ · ‖) is the minimum number of balls g : ‖g−f‖ < ε of radius

ε needed to cover F . The center of the balls f need not belongs to F .

• The entropy is the logarithm of the covering number N(ε,F , ‖ · ‖).

Definition 2.2.21 (Bracketing number).

• Given two functions l and u, the bracket [l, u] is the set of all functions with l ≤ f ≤ u.

• An ε-bracket is a bracket [l, u] with ‖u− l‖ < ε.

• The bracketing number N[ ](ε,F , ‖ · ‖) is the minimum number of ε-brackets needed to cover

F . Each u and l need not belong to F .

29

Draft


• The bracketing entropy (entropy with bracketing) is the logarithm of the bracketing number

N[ ](ε,F , ‖ · ‖).

We only consider norms ‖ · ‖ with property

|f | < |g| =⇒ ‖f‖ ≤ ‖g‖.

For example, Lr(Q) norm

‖f‖Q,r =

(∫|f |rdQ

)1/r

satisfies the property.

Remark 2.2.22. Note that

N(ε,F , ‖ · ‖) ≤ N[ ](2ε,F , ‖ · ‖)

is satisfied, because

f ∈ [l, u], ‖u− l‖ < 2ε =⇒ f ∈ Bε(u+ l

2

)holds, i.e., every 2ε-bracket is contained in some ε-ball.

Definition 2.2.23. An envelope function of F is any function F s.t.

|f(x)| ≤ F (x) ∀x ∈ X ∀f ∈ F .

2.3 Maximal Inequalities

In this section, we will obtain the bound of expectation of maximum, for example, maximum

variation of stochastic process within small time. For this we introduce the notion of Orlicz norm.

Definition 2.3.1. For ψ : [0,∞) → [0,∞), where ψ is strictly increasing and convex function with

ψ(0) = 0, and a random variable X, the Orlicz norm ‖X‖ψ is defined as

‖X‖ψ = inf

C > 0 : Eψ

(|X|C

)≤ 1

.

Of course, we wonder that Orlicz norm is actually a “norm.”

Proposition 2.3.2. ‖ · ‖ψ is a norm on the set of all random variables with ‖X‖ψ <∞, i.e.,

(i) ‖aX‖ψ = |a| · ‖X‖ψ ∀a ∈ R;

(ii) ‖X‖ψ = 0 ⇐⇒ X = 0 a.s.;

(iii) ‖X + Y ‖ψ ≤ ‖X‖ψ + ‖Y ‖ψ.

30

Draft


Proof. (i) Trivial.

(ii) ⇐ part is trivial. Assume that ‖X‖ψ = 0. It means that

Eψ(|X|C

)≤ 1 ∀C > 0. (∗)

Note that

ψ

(|X|C

)−−−→C0

ψ(∞) =∞ on (X 6= 0)

ψ

(|X|C

)−−−→C0

ψ(0) = 0 on (X = 0).

(ψ(∞) = ∞ because ψ is convex, strictly increasing function) If P(X 6= 0) > 0, then by monotone

convergence theorem,

limC0

Eψ(|X|C

)=∞,

which is contradictory to (∗).

(iii) It suffices to show that

Eψ(|X|C1

)∨ Eψ

(|Y |C2

)≤ 1 =⇒ Eψ

(|X + Y |C1 + C2

)≤ 1.

(∵ Let Eψ(|X|/C1) ≤ 1 and Eψ(|Y |/C2) ≤ 1. Then under our claim ‖X + Y ‖ψ ≤ C1 + C2 holds.

Taking infimum w.r.t C1 and C2 sequentially, we get the desired result.) It comes from:

ψ

(|X + Y |C1 + C2

)≤ ψ

(|X|+ |Y |C1 + C2

)(∵ ψ is strictly increasing)

= ψ

(C1

C1 + C2

|X|C1

+C2

C1 + C2

|Y |C2

)≤ C1

C1 + C2ψ

(|X|C1

)+

C2

C1 + C2ψ

(|Y |C2

)(∵ ψ is convex).

There are two oftenly-used Orlicz norms.

Example 2.3.3. Let ψ(x) = xp, p ≥ 1. Then trivially ψ satisfies conditions in definition 2.3.1 and

‖X‖ψ = inf

C > 0 : E

(|X|C

)p≤ 1

= inf C > 0 : E|X|p ≤ Cp

= (E|X|p)1/p =: ‖X‖p,

i.e., Orlicz norm w.r.t ψ(x) = xp is Lp-norm.

31

Draft


Example 2.3.4. Let ψp(x) := exp − 1, p ≥ 1. Then trivially ψp satisfies conditions in definition 2.3.1

and ψp(x) ≥ xp. Hence

‖X‖p ≤ ‖X‖ψp .

Remark 2.3.5. Note that, to ‖X‖p or ‖X‖ψp exist,

Eψ(|X|C

)<∞ or Eψp

(|X|C

)<∞

should be held for some C > 0 respectively. The former one requires polynomial order tail bound,

while the latter one requires exponential order (p = 1) or squared-exponential one (p = 2). In general,

following holds.

Proposition 2.3.6 (Tail bound). If ‖X‖ψ <∞, then

P(|X| > x) ≤ 1

ψ

(x

‖X‖ψ

) .Proof. Since ψ is continuous (from convexity),

Eψ(|X|‖X‖ψ

)= E lim

C‖X‖ψψ

(|X|C

)= lim

C‖X‖ψEψ(|X|C

)≤ 1 (2.2)

holds by MCT (Actually “=” holds). Now Markov inequality gives

P(|X| > x) = P(ψ

(|X|‖X‖ψ

)≥ ψ

(x

‖X‖ψ

))

≤Eψ(|X|‖X‖ψ

)ψ

(x

‖X‖ψ

) ≤ 1

ψ

(x

‖X‖ψ

) .

This proposition gives necessary condition for ‖X‖ψ < ∞. Then what is sufficient condition? In

other words, is there any condition for tail bound which implies ‖X‖ψ <∞?

Proposition 2.3.7. If P(|X| > x) ≤ Cxp+δ

for p ≥ 1, C, δ > 0, then ‖X‖p <∞.

Proof.

E|X|p =

∫ ∞0

P(|X|p > x)dx ≤ 1 +

∫ ∞1

C

x1+δ/pdx <∞.

Proposition 2.3.8. If P(|X| > x) ≤ Ke−Cxp for p ≥ 1 and C,K > 0, then ‖X‖ψp <∞.

32

Draft


Proof. Note that

E(eD|X|

p − 1)

= E∫ |X|p

0DeDsds

= E∫ ∞

0I(s < |X|p)DeDsds

=

∫ ∞0

P(s < |X|p)DeDsds

≤∫ ∞

0Ke−CsDeDsdx

= KD

∫ ∞0

e−(C−D)sds

≤ 1

holds for sufficiently small D > 0. It gives that

Eψp(|X|D−1/p

)≤ 1

for sufficiently small D > 0 (precisely, if D ≤ CK+1), i.e., ‖X‖ψp <∞ (precisely, ‖X‖ψp ≤

(K+1C

)1/p).

Remark 2.3.9. Proposition 2.3.7 gives that, if tail probability is bounded with p“+δ” order polyno-

mial, then p-norm ‖X‖p becomes finite; proposition 2.3.8 gives that if tail probability is bounded with

squared exponential (exponential, resp.), i.e., random variable has sub-Gaussian (sub-exponential,

resp.) distrbution, then ‖X‖ψ2 <∞ (‖X‖ψ1 <∞, resp.) is satisfied.

Our origin goal of this section is to obtain some bounds for maximum of random variables. Such

maximal inequalities can be found from the basic properties of Orlicz norm. Before starting, note

following naive bound

E(

max1≤i≤m

|Xi|)≤

m∑i=1

E|Xi| ≤ m · max1≤i≤m

E|Xi|,

or similarly,

∥∥∥∥ max1≤i≤m

|Xi|∥∥∥∥p

=

(E max

1≤i≤m|Xi|p

)1/p

≤

(m∑i=1

E|Xi|p)1/p

≤(m · max

1≤i≤mE|Xi|p

)1/p

= m1/p max1≤i≤m

‖Xi‖p.

Thus if random variable has smaller tail probability (Emax |Xi|p < ∞), then more tight bound for

maximum is obtained (m1/p). Following proposition gives generalized bound.

Theorem 2.3.10. Let ψ be convex, strictly increasing function with ψ(0) = 0. Further, assume that

ψ satisfies

lim supx,y→∞

ψ(x)ψ(y)

ψ(cxy)<∞ for some c > 0. (2.3)

33

Draft


Then for any random variables X1, X2, · · · , Xm,∥∥∥∥ max1≤i≤m

|Xi|∥∥∥∥ψ

≤ Kψ−1(m) max1≤i≤m

‖Xi‖ψ,

where K is a constant depending only on ψ.

Remark 2.3.11. Note that:

• “m−1/p” in the naive bound is corresponding to “ψ−1(m). If ψ increases fast, then ψ−1(m)

becomes smaller, which gives smaller bound.

• It holds for any random variables X1, · · · , Xm; it does not require additional assumption such

as independence.

Proof. Firstly, we assume that

ψ(x)ψ(y) ≤ ψ(cxy) ∀x, y ≥ 1

and ψ(1) ≤ 1

2. In this case,

ψ

(x

y

)≤ ψ(cx)

ψ(y)∀x ≥ y ≥ 1.

Thus, for y ≥ 1 and any C > 0,

max1≤i≤m

ψ

(|Xi|Cy

)≤ max

1≤i≤m

ψ

(c|Xi|C

)ψ(y)

I

(|Xi|Cy≥ 1

)+ ψ

(|Xi|Cy

)︸︷︷︸

≤ψ(1) on (|Xi|Cy

<1)

I

(|Xi|Cy

< 1

)

≤ max1≤i≤m

ψ

(c|Xi|C

)ψ(y)

+ ψ(1)

≤m∑i=1

ψ

(c|Xi|C

)ψ(y)

+1

2

holds. Taking expectation with C = c max1≤i≤m

‖Xi‖ψ and y = ψ−1(2m), we get

E[ψ

(max1≤i≤m |Xi|

Cy

)]= E

[max

1≤i≤mψ

(|Xi|C

)]

≤m∑i=1

Eψ

(|Xi|

max ‖Xi‖ψ

)ψ(y)

+1

2

≤m∑i=1

≤1 (2.2)︷︸︸︷Eψ(|Xi|‖Xi‖ψ

)2m

+1

2

34

Draft


≤ 1

2+

1

2= 1,

and therefore ∥∥∥∥ max1≤i≤m

|Xi|∥∥∥∥ψ

≤ Cy

= cψ−1(2m) max1≤i≤m

‖Xi‖ψ

≤ 2cψ−1(m) max1≤i≤m

‖Xi‖ψ

holds from ψ−1(2m) ≤ 2ψ−1(m), which comes from

m =ψ(0) + ψ(ψ−1(2m))

2≥ ψ

(0 + ψ−1(2m)

2

)

and increasingness of ψ−1.

Now we see general ψ. Define φ(x) = σψ(τx). If τ > 0 is large enough ∃K > 0 s.t.

∀x, y ≥ 0 φ(x)φ(y) = σ2ψ(τx)ψ(τy) ≤ Kσ2ψ(cτ2xy) = Kσφ(cτxy) (∵ (2.3)),

so if σ < 1 is small enough, we get φ(x)φ(y) ≤ φ(cτxy) and φ(1) = σψ(τ) ≤ 1

2. Also note that

‖X‖ψ = inf

C > 0 : Eψ

(|X|C

)≤ 1

= inf

C > 0 :

1

σEφ(|X|τC

)≤ 1

.

Putting C =‖X‖φστ

gives

1

σEφ(|X|τC

)=

1

σEφ(σ|X|‖X‖φ

)≤(∗)

Eφ(|X|‖X‖φ

)≤

(2.3)1,

while (∗) holds from φ(σx+ (1− σ) · 0) ≤ σφ(x) + (1− σ)φ(0). Hence we get

‖X‖ψ ≤‖X‖φστ

.

On the other hand,

‖X‖φ = inf

C > 0 : σEψ

(τ |X|C

)≤ 1

,

and putting C = τ‖X‖ψ we get

σEψ(τ |X|C

)= σEψ

(|X|‖X‖ψ

)≤

(2.3)1,

which implies

‖X‖φ ≤ τ‖X‖ψ.

35

Draft


Therefore we have ∥∥∥∥ max1≤i≤m

|Xi|∥∥∥∥ψ

≤ 1

στ

∥∥∥∥ max1≤i≤m

|Xi|∥∥∥∥φ

≤ K1

στφ−1(m) max

1≤i≤m‖Xi‖φ

≤(∗)

K1

(στ)2ψ−1(m) max

1≤i≤mτ‖Xi‖ψ

= K ′ψ−1(m) max1≤i≤m

‖Xi‖ψ.

In (∗), it was used that from φ−1(x) = τ−1ψ−1(σ−1x) and

ψ(σψ−1

(xσ

))≤ σψ

(ψ−1

(xσ

))= x,

we have

φ−1(x) =1

τψ−1

(xσ

)≤ 1

στψ−1(x).

Remark 2.3.12. Using previous theorem, we can obtain the bound of maximum of stochastic process.

A common technique to handle maximum term is to partition the underlying space into finite net and

control variation on the small ball, for example,

supt∈T|Xt| ≤ max

1≤i≤m|Xti |+ sup

d(t,ti)<δ|Xti −Xt|.

Partitioning the space into δ-balls is deeply related to the covering number; it will affect the bound.

As δ becomes small, variation on each δ-ball might be smaller, while controlling maximum of finite

net becomes challengeable.

Definition 2.3.13. Let (T, d) be an arbitrary semi-metric space. Then

• the covering number N(ε) is the minimum number of balls of radius ε needed to cover T ;

• a collection of points is ε-separated if the distance between each pair of points is strictly larger

than ε;

• the packing number D(ε) is the maximum number of ε-separated points in T .

We can naturally guess that the packing number D(ε) would have similar value with the covering

number N(ε).

Proposition 2.3.14. N(ε) ≤ D(ε) ≤ N(

1

2ε

).

36

Draft


Proof. First, for D = D(ε) ,let t1, t2, · · · , tD be maximal ε-separated points. Then since the set

t1, t2, · · · , tD is “maximal,” adding any other point in T makes the set not ε-separated. That is,

∀t ∈ T ∃ti s.t. d(t, ti) ≤ ε.

It means that

T ⊆D⋃j=1

Bε(tj),

i.e., N(ε) ≤ D(ε).

Next, let D = D(ε) and N = N(ε/2). Assume that D > N . Then ∃t1, t2, · · · , tD which are ε-

separated points, and ∃s1, s2, · · · , sN which balls centered with cover T , i.e., T ⊆⋃Nj=1Bε/2(sj). Then

because we assumed that D > N , there exist two points ti and ti′ those belong to the same ball

ti, ti′ ∈ Bε/2(sj). However it is contradictory to the assumption that ti, ti′ are ε-separated. Therefore

D ≤ N .

Now we are ready for our main result for maximal inequality.

Definition 2.3.15. A stochastic process (Xt)t∈T is separable if for any countable dense subset T0 ⊆ T

and δ > 0,

supd(s,t)<ssδs,t∈T

|Xs −Xt| = supd(s,t)<δs,t∈T0

|Xs −Xt| a.s..

Lemma 2.3.16. If 0 ≤ Xn X, then ‖Xn‖ψ ‖X‖ψ.

Proof. First, it is obvious that

0 ≤ X ≤ y =⇒ ‖X‖ψ ≤ ‖Y ‖ψ.

Now, for any C < ‖X‖ψ, by definition,

limn→∞

Eψ(Xn

C

)=

MCTEψ(X

C

)> 1.

It implies that ‖Xn‖ψ for large n, i.e.,

lim infn→∞

‖Xn‖ψ ≥ C.

Since C < ‖X‖ψ was arbitrary, we get

lim infn→∞

‖Xn‖ψ ≥ ‖X‖ψ.

37

Draft


Meanwhile, Xn ≤ X implies ‖Xn‖ψ ≤ ‖X‖ψ, which gives

limn→∞

‖Xn‖ψ = ‖X‖ψ.

Theorem 2.3.17 (Maximal Inequality). Let ψ be convex, strictly increasing function satisfying ψ(0) =

0 and (2.3). Also assume that stochastic process (Xt)t∈T is separable and satisfies

|Xs −Xt‖ψ ≤ C · d(s, t) ∀s, t ∈ T. (2.4)

Then for any η, δ > 0,∥∥∥∥∥ supd(s,t)≤δ

|Xs −Xt|

∥∥∥∥∥ψ

≤ K∫ η

0ψ−1(D(ε))dε+ δψ−1(D2(η))

holds, where K is a constant depending only on C and ψ.

Proof. Construct T0 ⊆ T1 ⊆ · · · ⊆ T recursively to satisfy that Tj is a maximal η · 2−j-separated set

containing Tj−1. Then by the definition of packing number,

card(Tj) ≤ D(η · 2−j).

Note that (by maximality) ∀tj+1 ∈ Tj+1 ∃tj ∈ Tj s.t. d(tj , tj+1) ≤ η · 2−j . Link every tj+1 ∈ Tj+1 to

a “unique” tj ∈ Tj s.t. d(tj , tj+1) < η · 2−j (make any mapping which satisfies d(tj , tj+1) < η · 2−j ;

“how” it can be possible is not our interest). Now call tk+1, tk, · · · , t0 to a “chain.” Note that⋃∞k=1 Tk

is countable and dense subset (by construction) in T . Since (Xt)t∈T is separable,

∥∥∥∥∥ supd(s,t)≤δ

|Xs −Xt|

∥∥∥∥∥ψ

=

∥∥∥∥∥∥∥ supd(s,t)≤δ

s,t∈∪∞k=1Tk

|Xs −Xt|

∥∥∥∥∥∥∥ψ

=MCT

limk→∞

∥∥∥∥∥∥∥ supd(s,t)≤δs,t∈Tk+1

|Xs −Xt|

∥∥∥∥∥∥∥ψ

.

Now let sk+1 − sk − · · · − s0 and tk+1 − tk − · · · − t0 be chains. Then

|Xsk+1−Xtk+1

| ≤ |(Xsk+1−Xs0)− (Xtk+1

−Xt0)|︸︷︷︸(∗∗)

+|Xs0 −Xt0 |

38

Draft


holds. Now we get

(∗∗) =

∣∣∣∣∣∣k∑j=0

(Xsj+1 −Xsj )− (Xtj+1 −Xtj )

∣∣∣∣∣∣ ≤ 2k∑j=0

max(u,v)∈Lj

|Xu −Xv|,

where Lj is the set of all links from Tj+1 to Tj . Then we get

card(Lj) ≤ D(η · 2−j−1),

and hence by theorem 2.3.10,∥∥∥∥∥ supsk+1,tk+1∈Tk+1

(∗∗)

∥∥∥∥∥ψ

≤ 2k∑j=0

∥∥∥∥ max(u,v)∈Lj

|Xu −Xv|∥∥∥∥ψ

≤ 2K ′k∑j=0

ψ−1(card(Lj))︸︷︷︸≤ψ−1(D(η·2−j−1))

max(u,v)∈Lj

‖Xu −Xv‖ψ︸︷︷︸≤C·d(u,v)≤Cη·2−j

≤ Kk∑j=0

ψ−1(D(η · 2−j−1)

)· η · 2−j−2 · 4

≤ 4K

∫ η/2

0ψ−1(D(ε))dε

≤ 4K

∫ η

0ψ−1(D(ε))dε.

Now, to control |Xs0 − Xt0 |, conversely for each pair of end points (s0, t0), choose “unique” pair

Figure 2.2:k∑j=0

ψ−1D(η · 2−j−1) · η · 2−j−2 ≤∫ η/2

0

ψ−1(D(ε))dε.

sk+1, tk+1 ∈ Tk+1 (which is different from those in previous paragraph; there is some abuse of notation).

Then

|Xs0 −Xt0 | ≤ (∗∗) + |Xsk+1−Xtk+1

|

39

Draft


again, and hence∥∥∥∥ max(s0,t0)∈T0

|Xs0 −Xt0 |∥∥∥∥ψ

≤∥∥∥∥ maxsk+1,tk+1∈Tk+1

(∗∗)∥∥∥∥ψ

+∥∥max |Xsk+1

−Xtk+1

∥∥ψ

≤ 4K

∫ η

0ψ−1D(ε)dε+

∥∥max |Xsk+1−Xtk+1

|∥∥ψ.

Note that the number of possible pairs of (s0, t0) (and consequently (sk+1, tk+1)) is at most card(T0)2 ≤

(D(η))2, and thus by theorem 2.3.10 again,

∥∥max |Xsk+1−Xtk+1

|∥∥ψ≤ K ′ψ−1(D2(η)) max ‖Xsk+1

−Xtk+1‖ψ.

Since ‖Xsk+1−Xtk+1

‖ψ ≤ C · d(sk+1, tk+1), we get∥∥∥∥∥∥∥ maxd(s,t)≤δs,t∈Tk+1

|Xs −Xt|

∥∥∥∥∥∥∥ψ

≤ 8K

∫ η

0ψ−1(D(ε))dε+K ′ψ(D2(η)) ·Cδ = 8K

∫ η

0ψ−1(D(ε))dε+Kδψ(D2(η)).

Remark 2.3.18. Why we decomposed |Xsk+1− Xtk+1

| as (∗∗) and |Xs0 − Xt0 |, and decomposed

|Xs0 − Xt0 | again? If we bound |Xsk+1− Xtk+1

| directly with similar argument, then we obtain the

bound with term ψ−1(D2(η · 2−j−1), which might not be so useful.

How such maximal inequality can be used? Following is one example which gives the bound for

sub-Gaussian stochastic process. Before we start, we should define sub-Gaussianity of stochastic

process.

Definition 2.3.19. A stochastic process (Xt)t∈T is sub-Gaussian with respect to semi-metric d if

P (|Xs −Xt| > x) ≤ 2 exp

(−1

2

x2

d2(s, t)

)∀x.

Example 2.3.20. Any zero-mean Gaussian process is sub-Gaussian with respect to L2-distance

d(s, t) = σ(Xs −Xt) =√

E(Xs −Xt)2.

Example 2.3.21. Let ε1, ε2, · · · , εn be Rademacher r.v. and

Xa =n∑i=1

aiεi, a ∈ Rn.

40

Draft


Then by Hoeffding inequalitym,

P

(∣∣∣∣∣n∑i=1

aiεi

∣∣∣∣∣ ≥ x)≤ 2 exp

(−1

2

x2

‖a‖2

).

It implies that (Xa)a∈Rn is sub-Gaussian stochastic process with respect to Euclidean distance d(a, b) =

‖a− b‖2.

To apply maximal inequality, we should verify the condition (2.4).

Proposition 2.3.22. For sub-Gaussian stochastic process (Xt)t∈T and ψ2(x) = ex2 − 1,

‖Xs −Xt‖ψ2 ≤√

6d(s, t).

Proof. It suffices to show that

Eψ2

(|Xs −Xt|√

6d(s, t)

)≤ 1.

It comes from

Eψ2

(|Xs −Xt|√

6d(s, t)

)= E

(exp

(|Xs −Xt|2

6d2(s, t)

)− 1

)=

∫ ∞0

P(

exp

(|Xs −Xt|2

6d2(s, t)

)− 1 > x

)dx

=

∫ ∞0

P(|Xs −Xt| >

√6d(s, t)

√log(1 + x)

)dx

≤∫ ∞

02 exp

(−1

2

6d2(s, t) log(1 + x)

d2(s, t)

)dx

=

∫ ∞0

2e−3 log(1+x)dx

=1

2.

Now we get the desired result.

Corollary 2.3.23. Let (Xt)t∈T be separable sub-Gaussian stochastic process. Then

E supd(s,t)≤δ

|Xs −Xt| .∫ δ

0

√logD(ε)dε ∀δ > 0.

Remark 2.3.24. From now on, A . B denotes that

A ≤ c ·B for some universal constant c > 0.

Proof. Apply theorem 2.3.17 with ψ = ψ2 and η = δ. Since the constant K in theorem 2.3.17 depended

41

Draft


only on ψ and C, which are all given in this example, K becomes universal. Therefore we get

E supd(s,t)≤δ

|Xs −Xt| =


|Xs −Xt|

∥∥∥∥∥1

≤


|Xs −Xt|

∥∥∥∥∥2

≤


|Xs −Xt|

∥∥∥∥∥ψ2

.∫ δ

0ψ−1

2 (D(ε))dε+ δψ−12 (D2(δ))

.∫ δ

0ψ−1

2 (D(ε))dε+ δψ−12 (D(δ))

(∵ ψ−12 (x) =

√log(1 + x) and hence we get

ψ−12 (x2) ≤

√2ψ−1

2 (x) for x ≥ 0)

.∫ δ

0ψ−1

2 (D(ε))dε

(∵ ψ−12 is increasing, while D is decreasing, and hence

δψ−12 (D(δ)) ≤

∫ δ

0ψ−1

2 (D(ε))dε)

=

∫ δ

0

√log(1 +D(ε))dε

.∫ δ

0

√logD(ε)dε

(∵ log(1 + x) ≤ 2 log x for sufficiently large x)

Remark 2.3.25. Note that logD(ε) is an “entropy.” Thus, whether the value of bound (integral) be

finite or not depends on “how fast the entropy grows” as δ goes to 0.

2.4 Symmetrization

In empirical process, our final goal is to obtain Glivenko-Cantelli and Donsker’s theorem. They can

be obtained from measuring the space F via covering number or bracketing numbers. The former one

requires symmetrization technique, while the other one requires Bernstein inequality as follows.

Lemma 2.4.1. Let X1, · · · , Xm be arbitrary r.v.’s with

P(|Xi| > x) ≤ 2e−12

x2

b+ax (x > 0)

42

Draft


for a, b > 0. Then ∥∥∥∥ max1≤i≤m

Xi

∥∥∥∥ψ1

. a log(1 +m) +√b log(1 +m).

Remark 2.4.2. The bound can be also represented as aψ−11 (m) +

√bψ−1

2 (m).

Proof. First note that

‖X‖ψp ≤ ‖X‖ψq (log 2)1q− 1p

holds for 1 ≤ p ≤ q (∵ φ defined as

ψp

(x(log 2)

1p

)= φ

(ψq

(x(log 2)

1q

)),

i.e., φ = ψp ψq−1

for ψp(x) = 2xp − 1 is concave function with φ(1) = 1, and hence by Jensen,

1 = φ(1) ≥ φ(Eψq

((log 2)

1q (log 2)

− 1q|X|‖X‖ψq

))≥ Eφ

(ψq

((log 2)

1q (log 2)

− 1q|X|‖X‖ψq

))= Eψp

((log 2)

1p− 1q|X|‖X‖ψq

),

which gives (log 2)1q− 1p ‖X‖ψq ≥ ‖X‖ψp). Now

P(|Xi| > x) ≤ 2e−12

x2

b+ax ≤

2e−x2

4b

(0 ≤ x ≤ b

a

)2e−

x4a

(x > b

a

)holds. Now recall that

P(|X| > x) ≤ Ke−Cxp , p ≥ 1 =⇒ ‖X‖ψp ≤(K + 1

C

)1/p

(proposition 2.3.8).

Thus for

|Xi| = |Xi|I(|Xi| ≤

b

a

)︸︷︷︸

(∗)

+ |Xi|I(|Xi| >

b

a

)︸︷︷︸

(∗∗)

,

we get

P((∗) > x) ≤ 2e−x2

4b and hence ‖(∗)‖ψ2 .√b

P((∗∗) > x) ≤ 2e−x4a and hence ‖(∗∗)‖ψ1 . a.

Therefore we have∥∥∥∥ max1≤i≤m

Xi

∥∥∥∥ψ1

≤∥∥∥∥ max

1≤i≤m(∗)∥∥∥∥ψ1

+

∥∥∥∥ max1≤i≤m

(∗∗)∥∥∥∥ψ1

43

Draft


.

∥∥∥∥ max1≤i≤m

(∗)∥∥∥∥ψ2

+

∥∥∥∥ max1≤i≤m

(∗∗)∥∥∥∥ψ1

. ψ−12 (m) max

1≤i≤m‖(∗)‖ψ2 + ψ−1

1 (m) max1≤i≤m

‖(∗∗)‖ψ1

. ψ−12 (m)

√b+ ψ−1

1 (m)a.

Now we see very useful technique, which is called symmetrization. Recall that in empirical process

we consider following setting:

X1, X2, · · · , Xni.i.d∼ P

Pnf =1

n

n∑i=1

f(Xi)

Gnf =1√n

n∑i=1

(f(Xi)− Pf).

Symmetrization technique is formulated based on the fact that, for Rademacher random variables

ε1, · · · , εn, f 7→ (Pn − P)f would have similar behavior with

f 7→ P0nf :=

1

n

n∑i=1

εif(Xi).

Theorem 2.4.3 (Symmetrization). Let φ be a convex non-decreasing function and F be a class of

measurable functions. Then

E∗φ (‖Pn − P‖F ) ≤ E∗φ(2‖P0

n‖F).

Proof. We prove only under the measurability condition. Recall that under measurability, we can use

Fubini theorem. Let Y1, Y2, · · · , Yn be independent copies of X1, X2, · · · , Xn. Then

‖Pn − P‖F = supf∈F

1

n

∣∣∣∣∣n∑i=1

(f(Xi)− Ef(Xi))

∣∣∣∣∣= sup

f∈F

1

n

∣∣∣∣∣n∑i=1

(f(Xi)− EY f(Yi))

∣∣∣∣∣≤ EY sup

f∈F

1

n

∣∣∣∣∣n∑i=1

(f(Xi)− f(Yi))

∣∣∣∣∣ ,and hence by non-decreasingness of φ,

Eφ(‖Pn − P‖F ) ≤ EXφ

(EY sup

f∈F

∣∣∣∣∣ 1nn∑i=1

(f(Xi)− f(Yi))

∣∣∣∣∣)

≤ EXEY φ

(supf∈F

∣∣∣∣∣ 1nn∑i=1

(f(Xi)− f(Yi))

∣∣∣∣∣)

(Jensen)

44

Draft


holds. Now note that,

f(Xi)− f(Yi)d≡ f(Yi)− f(Xi)

by symmetricity, and hence

f(Xi)− f(Yi)d≡ ei(f(Yi)− f(Xi))

for any ei ∈ −1, 1 (“symmetrization!”). Consequently, we have

supf∈F

∣∣∣∣∣ 1nn∑i=1

(f(Xi)− f(Yi))

∣∣∣∣∣ d≡ supf∈F

∣∣∣∣∣ 1nn∑i=1

ei(f(Xi)− f(Yi))

∣∣∣∣∣for any (e1, · · · , en) ∈ −1, 1n. Therefore we get

Eφ(‖Pn − P‖F ) ≤ EεEX,Y |εφ

(supf∈F

∣∣∣∣∣ 1nn∑i=1

εi(f(Xi)− f(Yi))

∣∣∣∣∣)

≤ EεEX,Y φ

(1

2

supf∈F

2

n

∣∣∣∣∣n∑i=1

εif(Xi)

∣∣∣∣∣+ supf∈F

2

n

∣∣∣∣∣n∑i=1

εif(Yi)

∣∣∣∣∣)

≤ 1

2Eε

EX,Y φ

(supf∈F

2

n

∣∣∣∣∣n∑i=1

εif(Xi)

∣∣∣∣∣)

+ EX,Y φ

(supf∈F

2

n

∣∣∣∣∣n∑i=1

εif(Yi)

∣∣∣∣∣)

=1

2Eε2EXφ

(supf∈F

2

n

∣∣∣∣∣n∑i=1

εif(Xi)

∣∣∣∣∣)

= Eφ

(supf∈F

2

n

∣∣∣∣∣n∑i=1

εif(Xi)

∣∣∣∣∣)

= Eφ(2‖P0n‖F ).

Example 2.4.4. Consider φ(x) = xm, m ≥ 1. Then

E∗‖Pn − P‖mF ≤ 2mE∗‖P0n‖mF

by symmetrization. If ‖P0n‖F is measurable, then

E∗‖P0n‖F = E‖P0

n‖F = EXEε|X supf∈F

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣holds. The term

Eε|X supf∈F

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣can be viewed as a supremum of stochastic process n−1

∑aiεi for constants ai’s, and hence its bound

can be obtained via, for instance, Hoeffding inequality. Note that such argument requires measura-

bility! Thus considering the class of functions which makes the target process measurable is a natural

45

Draft


procedure.

Definition 2.4.5. A class F of measurable functions f : X → R on (X ,A,P) is called “P-measurable

class” if

(X1, · · · , Xn) 7→

∥∥∥∥∥n∑i=1

eif(Xi)

∥∥∥∥∥F

is measurable on completion of (X n,An,Pn) for every n and (e1, · · · , en) ∈ −1, 1n.

46

Draft

Chapter 3

Applications for Empirical Process

3.1 Glivenko-Cantelli Theorems

Now we are ready for our first goal in empirical process; a uniform LLN. First we use bracketing

argument; it does not require measurability.

Theorem 3.1.1 (Bracketing Glivenko-Cantelli). If N[ ](ε,F ,L1(P)) <∞ ∀ε > 0, then F is Glivenko-

Cantelli, i.e.,

‖Pn − P‖FP∗−−−→

n→∞0.

Proof. First note that ε-bracket w.r.t L1(P) norm is [l, u] with l ≤ f ≤ u and

‖u− l‖ =

∫|u− l|dP = P|u− l| < ε.

For given ε > 0, choose finitely many ε-brackets [li, ui], 1 ≤ i ≤ N covering F . For each f ∈ F , ∃i s.t.

(Pn − P)f = Pnf − Pf ≤ Pnui − Pf = (Pn − P)ui + P(ui − f) < (Pn − P)ui + ε.

If f is fixed, then i is also fixed, and hence by SLLN, (Pn − P)ui → 0 almost surely. Since i is finitely

many, we have

max1≤i≤N

(Pn − P)ui → 0 almost surely,

and therefore,

supf∈F

(Pn − P)f < max1≤i≤N

(Pn − P)ui︸︷︷︸−−−→n→∞

0

+ε.

Similarly we get

inff∈F

(Pn − P)f > −ε+ min1≤i≤N

(Pn − P)li︸︷︷︸−−−→n→∞

0

,

47

Draft


and combining both we obtain

lim supn→∞

‖(Pn − P)f‖∗F ≤ ε almost surely.

Since ε > 0 was arbitrary, we get

lim supn→∞

‖Pn − P‖∗F = 0 a.s.,

or

‖Pn − P‖FP∗−a.s.−−−−→n→∞

0.

Example 3.1.2. Let P be a probability measure on R and

F = 1(−∞,c] : c ∈ R.

Then for given ε > 0, let

−∞ = t0 < t1 < · · · < tm =∞ with P(ti, ti+1) < ε ∀i.

Then

[1(−∞,ti], 1(−∞,ti+1)

are ε-brackets covering F , and hence we get “Glivenko-Cantelli theorem,”

supt|Fn(t)− F (t)| −−−→

n→∞0 almost surely.

Next argument for other type of Glivenko-Cantelli theorem uses symmetrization technique. As

mentioned in example 2.4.4, we need measurability condition in here.

Theorem 3.1.3 (Covering Glivenko-Cantelli). Let F be P-measurable and F be an envelope of F with

P∗F <∞. Furthermore assume that

logN(ε,FM ,L1(Pn)) = o∗P(n) ∀M, ε > 0,

where

FM = f1F≤M : f ∈ F.

Then

E∗‖Pn − P‖F = o(1) (i.e., it implies ‖Pn − P‖FP∗−−−→

n→∞0).

48

Draft


Proof. Denote ‖g(f)‖F = supf∈F |g(f)|. Then by symmetrization,

E∗‖Pn − P‖F ≤ 2EXEε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥F

(measurability!)

= 2EXEε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)I(F (Xi) ≤M) +1

n

n∑i=1

εif(Xi)I(F (Xi) > M)

∥∥∥∥∥F

≤ 2EXEε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥FM

+ 2EXEε

∥∥∥∥∥ 1

n

n∑i=1


∥∥∥∥∥F︸︷︷︸

(∗)

holds. Note that

(∗) = supf∈F

∣∣∣∣∣ 1nn∑i=1


∣∣∣∣∣≤ 1

n

n∑i=1

|f(Xi)|I(F (Xi) > M)

≤ 1

n

n∑i=1

F (Xi)I(F (Xi) > M)

and hence we get

E∗‖Pn − P‖F ≤ 2EXEε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥FM

+ 2E∗XF (Xi)I(F (Xi) > M)︸︷︷︸=2P∗F I(F>M)−−−−→

M→∞0 (∵P∗F<∞)

.

Now, for given X1, · · · , Xn and ε > 0, let G be an ε-covering of FM s.t. card(G) = N(ε,FM ,L1(Pn)).

Note that

∀f ∈ FM ∃g ∈ G s.t. Pn|g − f | < ε,

and hence ∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣ ≤∣∣∣∣∣ 1n

n∑i=1

εig(Xi)

∣∣∣∣∣+

∣∣∣∣∣ 1nn∑i=1

εi(g(Xi)− f(Xi))

∣∣∣∣∣︸︷︷︸≤ 1n

∑i |g(Xi)−f(Xi)|=Pn|g−f |<ε

.

It gives

Eε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥FM

≤ Eε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥G

+ ε

= Eε maxf∈G

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣︸︷︷︸=‖·‖1|X≤‖·‖ψ1|X.‖·‖ψ2|X

+ε

49

Draft


.

∥∥∥∥∥maxf∈G

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣∥∥∥∥∥ψ2|X

+ ε

.√

1 + log |G|maxf∈G

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥ψ2|X

+ ε

.√

logN(ε,FM ,L1(Pn)) maxf∈G

1

n

(n∑i=1

f(Xi)2

)1/2

+ ε

=√

logN(ε,FM ,L1(Pn)) maxf∈G

1√n

(1

n

n∑i=1

f(Xi)2

)1/2

︸︷︷︸=(Pnf2)1/2

+ε

≤√

logN(ε,FM ,L1(Pn))M√n

+ ε

= o∗P(1) + ε

by the assumption logN(ε,FM ,L1(Pn)) = o∗P(n). In . part, following argument is used: For constants

ai’s, we get ∥∥∥∥∥ 1

n

n∑i=1

aiεi

∥∥∥∥∥ψ2

= (log 2)−1/2 1

n

(n∑i=1

a2i

)1/2

from

E exp

( 1

Cn

n∑i=1

aiεi

)2 =

n∏i=1

E exp

(a2i ε

2i

C2

)= exp

(1

n2C2

n∑i=1

a2i

)

(∵ ε2i = 1) and in consequence

Eψ2

(1

C· 1

n

n∑i=1

aiεi

)≤ 1 ⇐⇒ exp

(1

n2C2

n∑i=1

a2i

)≤ 2 ⇐⇒ C ≥ 1

n

(1

log 2

n∑i=1

a2i

)1/2

.

(Or we can use some general arguments using Hoeffding inequality; see following remark.) Since ε > 0

was arbitrary, we get

Eε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥FM

= o∗P(1).

Note that

Eε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥FM

≤M ;

and therefore by BCT, we get

EXEε

∥∥∥∥∥ 1

n

n∑i=1

εif(Xi)

∥∥∥∥∥FM

= o(1) as n→∞.

50

Draft


Remark 3.1.4. In . part, we used the argument only can be applied on Rademacher ε’s. However,

we can also find more general argument using Hoeffding inequality. Note that since each aiεi’s are

sub-Gaussian, by Hoeffding’s inequality, we can find K and C s.t.

P

(∣∣∣∣∣ 1nn∑i=1

aiεi

∣∣∣∣∣ > x

)≤ Ke−Cx2 .

Now proposition 2.3.8 gives ∥∥∥∥∥ 1

n

n∑i=1

aiεi

∥∥∥∥∥ψ2

≤(K + 1

C

)1/2

,

where K = 2 and C−1 = 2∑a2i precisely.

Remark 3.1.5. To make red-colored “≤” part in the proof of previous theorem rigorous, one should

construct G to satisfy |f | ≤M for f ∈ G. It can be assumed without loss of generality; if not, one can

truncate the function as (f ∧M)∨ (−M) so that truncated one also covers FM and satisfies |f | ≤M .

Just one have to check that it is still ε-covering of FM ; let f ∈ FM and g ∈ G s.t. Pn|g− f | < ε. Then

for g = (g ∧M) ∨ (−M),

Pn|g − f | =1

n

∑i:−M≤g(Xi)≤M

|g(Xi)− f(Xi)|+∑

i: g(Xi)>M

(M − f(Xi)) +∑

i:−M>g(Xi)

(f(Xi) +M)

≤ 1

n

∑i:−M≤g(Xi)≤M

|g(Xi)− f(Xi)|+∑

i: g(Xi)>M

(g(Xi)− f(Xi)) +∑

i:−M>g(Xi)

(f(Xi)− g(Xi))

=

1

n

n∑i=1

|g(Xi)− f(Xi)| = Pn|g − f | < ε

holds.

3.2 Donsker Theorems

In here we consider two versions of Donsker’s theorem. From now on, ‖ · ‖Q,2 denotes

‖f‖Q,2 =

(∫f2dQ

)1/2

for a probability measure Q.

Theorem 3.2.1 (Covering Donsker). Let Fδ := f − g : f, g ∈ F , ‖f − g‖P,2 < δ be P-measurable

for any δ ∈ (0,∞] and F be an envelope of F with P∗F 2 <∞. If∫ ∞0

supQ

√logN(ε‖F‖Q,2,F ,L2(Q))dε <∞, (3.1)

when the supremum is taken over all finitely discrete probability measures, then F is P-Donsker.

51

Draft


Proof. It suffices to prove that Gn is asymptotically tight, where Gn = Gnf : f ∈ F is regarded as

a stochastic process (with index set F). By theorem 1.4.8 (note that each Gnf converges weakly by

classical CLT, which implies asymptotic tightness of each marginal) it’s enough to show that:

(i) F is totally bounded in L2(P) norm;

(ii) Gn is asymptotically uniformly L2(P)-equicontinuous in probability.

For this, we need following lemma:

Lemma 3.2.2. Let an : [0, 1]→ [0,∞) be a sequence of non-decreasing functions. Then

limδ→0

lim supn→∞

an(δ) = 0 ⇐⇒ an(δn) = 0(1) ∀δn 0.

Proof of lemma. =⇒ ) Let δn be nonincreasing sequence convergin to 0. ∀ε > 0 ∃δ0 > 0 s.t.

lim supn→∞

an(δ0) <ε

2

and hence ∃N s.t. n ≥ N ⇒ an(δ0) < ε. Since an is nondecreasing, ∃N ′ s.t. n ≥ N ′ ⇒ δn < δ0. Thus

n ≥ N ∨N ′ =⇒ δn < δ0 =⇒ an(δn) ≤ an(δ0) < ε.

⇐= ) It’s sufficient to show that:

∃δn 0 s t. lim supn→∞

an(δn) = limδ→0

lim supn→∞

an(δ).

Let C = limδ→0 lim supn→∞ an(δ). Then for any δ > 0, we get

lim supn→∞

an(δ) ≥ C,

because an decreases as δ 0. It gives that for any δ > 0 and for any ε > 0,

an(δ) > C − ε i.o..

Thus, for every fixed m,

an

(1

m

)> C − 1

mi.o.,

i.e., ∃N1, N2, N3, · · · s.t.

aN1(1) > C − 1

aN2

(1

2

)> C − 1

2, N2 > N1

52

Draft


aN3

(1

3

)> C − 1

3, N3 > N2

and so on. Take δn as

1, 1, · · · , 1︸︷︷︸N1

,1

2,1

2, · · · , 1

2︸︷︷︸N2−N1

,1

3,1

3, · · · , 1

3︸︷︷︸N3−N2

, · · · .

Then by definition,

aNk(δNk) > C − 1

k

holds, which gives that

lim supn→∞

an(δn) ≥ C.

However, since an(δn) ≤ an(δ) for any fixed δ > 0 and large n enough, we have

lim supn→∞

an(δn) ≤ lim supn→∞

an(δ) ∀δ > 0,

which gives

lim supn→∞

an(δn) ≤ C.

Therefore we get

lim supn→∞

an(δn) = C = limδ0

lim supn→∞

an(δn).

(Lemma)

Now we show (ii) first. (ii) is equivalent to

∀x, η > 0 ∃δ > 0 s.t. lim supn→∞

P∗(

sup‖f−g‖P,2<δ

|Gnf −Gng| > x

)< η.

Note that by definition

sup‖f−g‖P,2<δ

|Gnf −Gng| = ‖Gn‖Fδ ;

thus (ii) is again equivalent to

∀x, η > 0 ∃δ > 0 s.t. lim supn→∞

P∗ (‖Gn‖Fδ > x) < η.

Note that ‖Gn‖δ decreases as δ 0, which makes P∗ (‖Gn‖Fδ > x) also non-decreasing of δ. Thus it

is equivalent to

limδ→0

lim supn→∞

P∗ (‖Gn‖Fδ > x) < η ∀x > 0,

which is also same as

limn→∞

P∗(‖Gn‖Fδn > x

)< η ∀x > 0 ∀δn 0 (3.2)

53

Draft


by the lemma. Now we will show (3.2) instead of (ii). For given x > 0 and δn 0,

P∗(‖Gn‖Fδn > x) ≤ 1

xE∗‖Gn‖Fδn

≤ 2

xE

∥∥∥∥∥ 1√n

n∑i=1

εif(Xi)

∥∥∥∥∥Fδn

(symmetrization)

holds. Note that E∗ becomes E in blue-colored part from the measurability of Fδn . Now, note that

Pε|X

(1√n

n∑i=1

εi(f(Xi)− g(Xi)) < x

)≤ 2 exp

(−1

2

x2

‖f − g‖2n

),

where ‖f‖n =

√√√√ 1

n

n∑i=1

f(Xi)2 by Hoeffding’s inequality (cf. example 2.3.21), which implies that the

stochastic process f 7→ 1√n

n∑i=1

εif(Xi) is sub-Gaussian w.r.t. ‖ · ‖n. Then by maximal inequality

(corollary 2.3.23)

Eε|X

∥∥∥∥∥ 1√n

n∑i=1

εif(Xi)

∥∥∥∥∥Fδn

≤ Eε|X supf∈Fδn‖f−g‖n<δ

∣∣∣∣∣ 1√n

n∑i=1

εi(f(Xi)− g(Xi))

∣∣∣∣∣+ Eε|X

∣∣∣∣∣ 1√n

n∑i=1

εig(Xi)

∣∣∣∣∣.∫ δ

0

√logD(ε,Fδn , ‖ · ‖n)dε+ Eε|X

∣∣∣∣∣ 1√n

n∑i=1

εig(Xi)

∣∣∣∣∣holds for any δ > 0 and g ∈ Fδn . Using 0 ∈ Fδn and letting δ very big (MCT), we can obtain

Eε|X

∥∥∥∥∥ 1√n

n∑i=1

εif(Xi)

∥∥∥∥∥Fδn

.∫ ∞

0

√logD(ε,Fδn , ‖ · ‖n)dε.

Now using D(ε) ≤ N(ε/2), we can obtain that

Eε|X

∥∥∥∥∥ 1√n

n∑i=1

εif(Xi)

∥∥∥∥∥Fδn

.∫ ∞

0

√logN(ε,Fδn , ‖ · ‖n)dε

=

∫ θn

0

√logN(ε,Fδn , ‖ · ‖n)dε (θn = sup

f∈Fδn‖f‖n)

(∵ N(ε,Fδn , ‖ · ‖n) = 1 for large ε)

≤∫ θn/‖F‖n

0

√logN(ε‖F‖n,F∞, ‖ · ‖n)dε · ‖F‖n (∵ Fδn ⊆ F∞)

≤∫ θn/‖F‖n

0supQ

√logN(ε‖F‖Q,2,F∞,L2(Q))dε · ‖F‖n

.∫ θn/‖F‖n

0supQ

√logN

( ε2‖F‖Q,2,F ,L2(Q)

)dε · ‖F‖n

(∵ ‖f‖Q,2 < ε, ‖g‖Q,2 < ε⇒ ‖f − g‖Q,2 < 2ε,

54

Draft


which implies N(2ε,F∞,L2(Q)) ≤ N2(ε,F ,L2(Q)))

=

∫ θn/2‖F‖n

0supQ

√logN(ε‖F‖Q,2,F ,L2(Q))dε · 2‖F‖n

.∫ θn/‖F‖n

0supQ

√logN(ε‖F‖Q,2,F ,L2(Q))dε · ‖F‖n.

Note that ‖F‖n =

√√√√ 1

n

n∑i=1

F (Xi)2 converges to a positive constant by SLLN and the assumption,

E∗X‖F‖2n = P∗F 2 <∞. Hence we get:

EX∫ θn/‖F‖n

0supQ

√logN(ε‖F‖Q,2,F ,L2(Q))dε · ‖F‖n

= EX∫ ∞

0supQ

√logN(ε‖F‖Q,2,F ,L2(Q))‖F‖nI

(θn‖F‖n

> ε

)dε

=

∫ ∞0

supQ

√logN(ε‖F‖Q,2,F ,L2(Q))E‖F‖nI

(θn‖F‖n

> ε

)dε.

By uniform entropy condition and DCT, the last term converges to 0 as n→∞ if

E‖F‖nI

(θn‖F‖n

> ε

)−−−→n→∞

0 ∀ε > 0.

If θn/‖F‖n converges to 0 in probability, then Cauchy-Schwarz gives

E‖F‖nI(θn > ε‖F‖n) ≤(E∗‖F‖2n

)︸︷︷︸<∞

(P∗(θn > ε‖F‖n))︸︷︷︸−−−→n→∞

0

−−−→n→∞

0,

which gives the desired result. Thus our claim is that θn/‖F‖nP∗−−−→

n→∞0. However note that ‖F‖n

converges to a positive constant; therefore our final claim is:

Claim.) θn = oP∗(1).

By definition,

θ2n = sup

f∈Fδn‖f‖2n = sup

f∈FδnPnf2 ≤ sup

f∈Fδn(Pn − P)f2 + sup

f∈FδnPf2 ≤ sup

f∈F∞(Pn − P)f2 + sup

f∈FδnPf2

and

supf∈Fδn

Pf2 ≤ δ2n −−−→n→∞

0 (∵ def of Fδ)

hold. Furthermore, since 4F 2 is an integrable envelope of G∞ = f2 : f ∈ F∞, we get for f, g ∈ F∞

Pn|f2 − g2| = Pn|f − g| · |f + g| ≤|f |≤2F

Pn|f − g| · 4F ≤ ‖f − g‖n · ‖4F‖n (∵ Cauchy-Schwarz)

55

Draft


and hence

N(ε‖2F‖2n,G∞,L1(Pn)) ≤ N(ε‖F‖n,F∞, ‖ · ‖n) ≤ supQN(ε‖F‖Q,2,F∞, ‖ · ‖Q,2).

(∵ ‖f − g‖n ≤ ε‖F‖n ⇒ Pn|f2 − g2| ≤ ‖f − g‖n · ‖4F‖n ≤ ε‖F‖n‖4F‖n = ε‖2F‖2n) Hence

N(ε‖2F‖2n,G∞,L1(Pn)) is bounded by a fixed number depending only on ε, i.e.,

N(ε‖2F‖2n,G∞,L1(Pn)) = OP∗(1) ∀ε > 0.

It implies that

logN(ε,G∞,L1(Pn)) = oP∗(n) ∀ε > 0,

(cf. see following remark) which implies that G∞ is Glivenko-Cantelli (thm 3.1.3), i.e.,

supf∈G∞

(Pn − P)f = supf∈F∞

(Pn − P)f2 P∗−−−→n→∞

0.

(Claim)

Remark 3.2.3. Assume that N(ε‖2F‖2n,G∞,L1(Pn)) = OP∗(1) for any ε > 0. For each ω, ∃M > 0

and ∃N s.t.

n > N =⇒ ‖2F‖2n(ω) ≤M,

and hence

N(εM,G∞,L1(Pn)) ≤ N(ε‖2F‖2n,G∞,L1(Pn))

for such M and n. logN(ε‖2F‖2n,G∞,L1(Pn)) = oP∗(n) ∀ε > 0 implies that logN(ε,G∞,L1(Pn)) =

oP∗(n) ∀ε > 0.

Proof (Cont’d). Now we show (i). Since G∞ is Glivenko-Cantelli, there exists a finitely discrete

measure Pn with

‖(Pn − P)f2‖F∞ −−−→n→∞0.

Meanwhile, by the uniform entropy condition, we get∫ ∞0

√logN(ε‖F‖Pn,2,F ,L2(Pn))dε =

1

‖F‖Pn,2

∫ ∞0

√logN(ε,F ,L2(Pn))dε <∞,

i.e.,

N(ε,F ,L2(Pn)) <∞ ∀ε > 0.

56

Draft


For f, g ∈ F∞, Pn(f − g)2 < ε2 implies

P(f − g)2 = (P− Pn)(f − g)2 + Pn(f − g)2 ≤ (P− Pn)(2f2 + 2g2)︸︷︷︸≤4‖(Pn−P)f2‖F∞

+Pn(f − g)2 ≤ ε2 + ε2 = 2ε2

for large n enough so that ‖(Pn − P)f2‖F∞ ≤ ε2/4. It implies that

‖f − g‖Pn,2 ≤ ε =⇒ ‖f − g‖P,2 ≤√

2ε

for large n, i.e.,

N(ε,F ,L2(P)) ≤ N(

ε√2,F ,L2(Pn)

)<∞

for large n. Therefore we obtain

N(ε,F ,L2(P)) <∞ ∀ε > 0,

i.e., F is totally bounded w.r.t L2(P)-norm.

Next we consider bracketing Donsker’s theorem. It uses Bernstein’s inequality in the proof. From

now on, let F be a set of measurable functions with envelope F satisfying P∗F 2 <∞.

Lemma 3.2.4. If |F| <∞ and ‖f‖∞ <∞ for any f ∈ F , then

E‖Gn‖F . maxf∈F

‖f‖∞√n

log |F|+ maxf∈F‖f‖P,2

√log |F|. (3.3)

Proof. Note that

|Gnf | =

∣∣∣∣∣ 1√n

n∑i=1

(f(Xi)− Pf)

∣∣∣∣∣ .Each (f(Xi)− Pf)/

√n has mean zero and satisfies Bernstein condition

E∣∣∣∣f(Xi)− Pf√

n

∣∣∣∣k = E

[(f(Xi)− Pf√

n

)2 ∣∣∣∣f(Xi)− Pf√n

∣∣∣∣k−2]

≤ E

[(2‖f‖∞√

n

)k−2 2(f2(Xi) + (Pf)2)

n

]

≤ 2

nPf2 · 2k−1 ·

(‖f‖∞√

n

)k−2

≤2k−1≤k!

4Pf2

2nk!

(‖f‖∞√

n

)k−2

57

Draft


holds. Thus by Bernstein’s inequality,

P(|Gnf | > x) ≤ 2 exp

−1

2

x2

n∑i=1

4Pf2

n+‖f‖∞√

nx

≤ 2 exp

−1

2

x2

4 maxf∈F

Pf2 + maxf∈F

‖f‖∞√nx

holds for any x > 0 for large n. Now maximal inequality (lemma 2.4.1) gives the conclusion

E‖Gn‖F ≤∥∥∥∥maxf∈F

Gnf

∥∥∥∥ψ1

. maxf∈F

‖f‖∞√n

log(1 + |F|) +√

4 maxf∈F

Pf2√

log(1 + |F|)

. maxf∈F

‖f‖∞√n

log |F|+√

maxf∈F

Pf2√

log |F|.

Theorem 3.2.5 (Bracketing Donsker). If∫ ∞0

√logN[ ](ε,F ,L2(P))dε <∞,

then F is P-Donsker.

Remark 3.2.6. We use chaining technique and previous lemma in the proof. However, as the con-

dition ‖f‖∞ < ∞ is required to apply the lemma, we should truncate the terms with the order

satisfying √log |F|‖f‖∞√

n∼ ‖f‖P,2

so that two terms in the RHS of (3.3) have equal order.

Proof. There exists an envelope F of F with P∗F 2 < ∞ (Recall remark 2.2.22; bracketing num-

ber is larger than covering number with same ‘diameter.’ Finiteness of the integral gives that

N[ ](ε,F ,L2(P)) = 1 for large ε. Let [l, u] be the only bracket covering F with ‖u − l‖P,2 < M .

Also we get ∫ ∞0

√logN(ε,F ,L2(P))dε <∞,

i.e., N(ε,F ,L2(P)) is finite for any ε > 0. It implies that F is totally bounded; so F is bounded. Thus

P(|u|+ |l|)2 ≤ 2P(|u|2 + |l|2)

and

‖u‖P,2 ≤ ‖u− f‖P,2 + ‖f‖P,2 <∞,

58

Draft


‖l‖P,2 ≤ ‖f − l‖P,2 + ‖f‖P,2 <∞

for f ∈ F implies that

P(|u|+ |l|)2 <∞.

Letting F = sup(|u|, |l|) ≤ |u|+ |l|, we get an envelope F of F with P∗F 2 <∞). For q ≥ 1, construct

a sequence of nested partitions

F =

Nq⋃i=1

Fq,i

s.t. Fq,i is a 2−q-bracket in ‖ · ‖P,2 and

∞∑q=1

2−q√

logNq <∞. (3.4)

Figure 3.1: Nested Partition⋃iFq,i.

Of course we have to show that we can find such partition satisfying (3.4). Note that Nq is equal

to the sum of the number of partitions of each Fq−1,i, i.e.,

Nq ≤ Nq−1 ·N[ ](2−q,F ,L2(P)).

Figure 3.2: Relationship between Nq−1 and Nq.

It implies that

√logNq ≤

√logNq−1 + logN[ ](2−q,F ,L2(P))

≤√

logNq−1 +√

logN[ ](2−q,F ,L2(P)) (√a+ b ≤

√a+√b)

59

Draft


≤√

logNq−2 +√

logN[ ](2−(q−1),F ,L2(P)) +√

logN[ ](2−q,F ,L2(P))

≤ · · ·

≤√

logN1 +

q∑p=2

√logN[ ](2−p,F ,L2(P)),

and therefore

∞∑q=1

2−q√

logNq ≤∞∑q=1

2−q

√logN1 +

q∑p=2

√logN[ ](2−p,F ,L2(P))

≤√

logN1 +

∞∑q=1

q∑p=1

2−q√

logN[ ](2−p,F ,L2(P))

=√

logN1 +∞∑p=1

∞∑q=p

2−q√

logN[ ](2−p,F ,L2(P))

=√

logN1 +∞∑p=1

2−(p−1)√

logN[ ](2−p,F ,L2(P))

.√

logN1 +

∫ ∞0

√logN[ ](ε,F ,L2(P))dε

<∞

holds, which yields (3.4). Now, fix fq,i ∈ Fq,i (fix “representatives” of each partition), and for f ∈ Fq,i,

define

πqf := fq,i (“projection” to the space of representatives)

∆qf :=

(sup

g,h∈Fq,i|g − h|

)∗. (“variation” on each partition)

Since each Fq,i is 2−q-bracket, ‖g − h‖P,2 ≤ 2−q, and hence

√P(∆qf)2 ≤ 2−q.

Figure 3.3: Fq,i and representative fq,i.

60

Draft


Now note that by theorem 1.4.8, it suffices to show that ∀ε, η > 0 ∃ finite partition

Nq0⋃i=1

Fq0,i satisfying

lim supn→∞

P∗(

max1≤i≤Nq0

supf,g∈Fq0,i

|Gnf −Gng| > ε

)< η.

For f, g ∈ Fq0,i in the same partition, πq0f = πq0g, and hence

Gnf −Gng = Gn(f − πq0f + πq0g − g)

= Gn(f − πq0f) + Gn(πq0g − g).

Therefore, if we can show that

∀ε > 0 ∃q0 s.t. E∗‖Gn(f − πq0f)‖F < ε for large n enough,

then

P∗(

max1≤i≤Nq0

supf,g∈Fq0,i

|Gnf −Gng| > ε

)≤ P∗

(max

1≤i≤Nq0sup

f,g∈Fq0,i(|Gn(f − πq0f)|+ |Gn(g − πq0g)|) > ε

)

≤ 1

εE∗(

max1≤i≤Nq0

supf,g∈Fq0,i

(|Gn(f − πq0f)|+ |Gn(g − πq0g)|)

)≤ 2

εE∗ max

1≤i≤Nq0sup

f∈Fq0,i|Gn(f − πq0f)|

≤ 2

εE∗‖Gn(f − πq0f)‖F

< η for large n enough

holds for arbitrarily given η > 0.

Claim.) ∀ε > 0 ∃q0 s.t. E∗‖Gn(f − πq0f)‖F < ε for large n enough. Define a sequence

aq =2−q√

logNq+1

and some indicators

Aq−1f = I(∆q0f ≤

√naq0 , · · · ,∆q−1f ≤

√naq−1

)Bqf = I

(∆q0f ≤

√naq0 , · · · ,∆q−1f ≤

√naq−1,∆qf >

√naq)

Bq0f = I(∆q0f >

√naq0

)

61

Draft


for q > q0. Then from Aq−1f = Aqf +Bqf we get

(f − πq−1f)Aq−1f = (f − πqf + πqf − πq−1f)Aq−1f

= (f − πqf)(Aqf +Bqf) + (πqf − πq−1f)Aq−1f

= (f − πqf)Bqf + (f − πqf)Aqf + (πqf − πq−1f)Aq−1f.

Furthermore, we have Bq0f +Aq0f = 1, and hence

(f − πq0f) = (f − πq0f)(Aq0f +Bq0f)

= (f − πq0f)Bq0f + (f − πq0f)Aq0f

= (f − πq0f)Bq0f + (f − πq0+1f)Bq0+1f + (f − πq0+1f)Aq0+1f + (πq0+1f − πq0f)Aq0f

= · · ·

= (f − πq0f)Bq0f +

Q∑q=q0+1

(f − πqf)Bqf +

Q∑q=q0+1

(πqf − πq−1f)Aq−1f + (f − πQf)AQf

for any Q > q0. Note that

|(f − πQf)AQf | ≤ ∆QfAQf ≤def. of AQ

√naQ =

2−Q√logNQ+1

−−−−→Q→∞

0,

and finally we have

f − πq0f = (f − πq0f)Bq0f︸︷︷︸(I)

+∞∑

q=q0+1

(f − πq0f)Bqf︸︷︷︸(II)

+∞∑

q=q0+1

(πqf − πq−1f)Aq−1f︸︷︷︸(III)

.

Now we have to control 3 terms, (I), (II), and (III). (Sketch: (I) f is dominated by squared-integrable

envelope F and I(F >√naq0) −−−→

n→∞0 for fixed q0; (II) since indicator in each Bqf is disjoint, at most

one term of Bqf is non-zero; (iII) chaining technique on πqf ’s.)

(I) Since |(f − πq0f)Bq0f | ≤ 2F I(2F ≥√naq0),

E∗‖Gn (f − πq0f)Bq0f︸︷︷︸=:(∗)

‖F = E∗∥∥∥∥∥ 1√

n

n∑i=1

((∗)(Xi)− P(∗))

∥∥∥∥∥F

≤ 2√n

n∑i=1

P2F I(2F ≥√naq0)

= 4√nP(F I(2F ≥

√naq0)

)(we should eliminate

√n term)

≤ 4√nP(F

2F√naq0

I(2F ≥√naq0)

)

62

Draft


=8

aq0PF 2I(2F ≥

√naq0)

−−−→n→∞

0 for any fixed q0 (∵ PF 2 <∞).

(III) Note that both πqf and πq−1f belong to Fq−1,i if f ∈ Fq−1,i, because each partition is defined

as a refinement. It gives that

|πqf − πq−1f | ≤ ∆q−1f.

Hence we get

‖(πqf − πq−1f)Aq−1f‖∞ ≤ ‖∆q−1fAq−1f‖∞ ≤def. of Aq−1

√naq−1

and

‖(πqf − πq−1f)Aq−1f‖P,2 ≤ ‖πqf − πq−1f‖P,2≤2−(q−1).

Note that the last inequality comes from πqf, πq−1f ∈ Fq−1,i, which is 2−(q−1)-bracket. Thus we have

E∗∥∥∥∥∥∥Gn

∞∑q=q0+1

(πqf − πq−1f)Aq−1f

∥∥∥∥∥∥F

≤∞∑

q=q0+1

E∗ ‖Gn(πqf − πq−1f)Aq−1f‖F

.∞∑

q=q0+1

[√naq√n

√logNq + 2−q

√logNq

]

by lemma 3.2.4. Note that lemma 3.2.4 can be applied only for function class with finite cardinality,

but possible different values of (πqf − πq−1f)Aq−1f is Nq. Then putting aq = 2−q/√

logNq+1 ≤ 2−q,

we get

E∗∥∥∥∥∥∥Gn

∞∑q=q0+1

(πqf − πq−1f)Aq−1f

∥∥∥∥∥∥F

.∞∑

q=q0+1

2−q√

logNq −−−−→q0→∞

0

regardless of n. Thus we can obtain the claim via first finding q0 that makes the term (III) (and (II),

eventually, which will be shown) small enough, and then for such fixed q0 letting n very large so that

(I) becomes small. Therefore the remain part is showing (II) becomes very small for large q0 enough

regardless of n.

Claim (for (II)). (II) −−−−→q0→∞

0 regardless of n. Since |f − πqf |Bqf ≤ ∆qfBqf , we have

|Gn (f − πqf)Bqf︸︷︷︸(∗∗)

| =

∣∣∣∣∣ 1√n

n∑i=1

((∗∗)(Xi)− P(∗∗))

∣∣∣∣∣

≤

∣∣∣∣∣∣∣∣∣∣1√n

n∑i=1

(∆qfBqf − P∆qfBqf)︸︷︷︸=Gn∆qfBqf

∣∣∣∣∣∣∣∣∣∣+

∣∣∣∣∣ 1√n

n∑i=1

P∆qfBqf

∣∣∣∣∣+

∣∣∣∣∣ 1√n

n∑i=1

P(∗∗)

∣∣∣∣∣︸︷︷︸≤√nP|(∗∗)|

≤ |Gn∆qfBqf |+ 2√nP∆qfBqf. (3.5)

63

Draft


Note that ‖∆qfBqf‖∞ ≤ ‖∆q−1fBqf‖∞ ≤√naq−1 and

P(∆qfBqf)2 ≤ P(∆qf∆q−1fBqf)

(∵ ∆qf ≤ ∆q−1f, (Bqf)2 = Bqf)

≤√naq−1P∆qfBqf

≤√naq−1P

(∆qfI(∆qf >

√naq)

)≤√naq−1P

(∆q ·

∆qf√naq

)=aq−1

aqP (∆qf)2 ≤ aq−1

aq2−2q

hold (the last equality comes from ‖∆qf‖P,2 ≤ 2−q). Thus we get

‖Gn∆qfBqf‖F .maxf ‖∆qfBqf‖∞√

nlogNq + max

f

√P(∆qfBqf)2

√logNq

≤ aq−1 logNq +

√aq−1

aq︸︷︷︸≥1⇒≤

aq−1aq

2−q√

logNq

≤ aq−1︸︷︷︸= 2−(q−1)√

logNq

logNq +aq−1

aq︸︷︷︸=2

√logNq+1logNq

2−q√

logNq

= 2−(q−1)√

logNq + 2−(q−1)√

logNq+1︸︷︷︸=(A)

.

Note that final bound (A) is finite under∑∞

q=q0+1. On the other hand,

√naqP(∆qfBqf) ≤ P(∆qf)2 ≤ 2−2q

by def. of Bqf , which gives

2√nP∆qfBqf .

√2−2qaq = 2−q

√logNq+1︸︷︷︸

=(B)

,

whose final bound (B) is also finite under∑∞

q=q0+1. Putting these in (3.5), we get

E∗∥∥∥∥∥∥∞∑

q=q0+1

Gn(f − πqf)Bqf

∥∥∥∥∥∥F

≤∞∑

q=q0+1

E∗ ‖Gn(f − πqf)Bqf‖F

≤∞∑

q=q0+1

[E∗‖Gn∆qfBqf‖F + 2

√n‖P∆qfBqf‖F

]

64

Draft


.∞∑

q=q0+1

[2−(q−1)

√logNq + 2−(q−1)

√logNq+1 + 2−q

√logNq+1

]−−−−→q0→∞

0

regardless of n.

65

Draft

Chapter 4

Uniform Entropy & Bracketing

Numbers

Note that F is a Donsker class if (there exists an square-integrable envelope F and)∫ ∞0

supQ

√logN(ε‖F‖Q,2,F ,L2(Q))dε <∞

by covering Donsker’s theorem. For this, one has to get

supQ

logN(ε‖F‖Q,2,F ,L2(Q)) ≤ K(

1

ε

)2−δ

(at least for small ε’s) for some δ > 0. However, many classes of functions satisfy much more stronger

condition,

supQN(ε‖F‖Q,2,F ,L2(Q)) ≤ K

(1

ε

)V, 0 < ε < 1

for some number V . In this chapter, we consider a class of sets or functions called VC-class, named

after Vapanik and Chervonenkis, and some properties about uniform covering (bracketing) numbers.

Also we consider a set of special functions with uniform bracketing numbers.

4.1 VC class and Uniform Covering Numbers

4.1.1 VC class of sets

Let C be a collection of subsets of X , and x1, x2, · · · , xn be an arbitrary set of n points in X .

Definition 4.1.1 (VC index & class).

(i) C picks out a certain subset, say A, from x1, · · · , xn if A = C ∩ x1, · · · , xn for some

C ∈ C.

(ii) C shatters x1, · · · , xn if each of 2n subsets can be picked out.

66

Draft


(iii) The VC index V (C) of C is the smallest n for which no set of size n is shattered by C. VC

dimension is defined as V (C)− 1.

(iv) A collection of measurable sets C is called VC class if V (C) <∞.

(v) Also define ∆n(C, x1, x2, · · · , xn) := |C ∩ x1, · · · , xn : C ∈ C| .

Example 4.1.2. Let X = R and C = (−∞, c] : c ∈ R (“half-interval”). Then every singleton

c ⊆ R can be picked out.

However, c1, c2 (c1 < c2) cannot be shattered as c2 is not picked out.

Thus V (C) = 2.

Example 4.1.3. Let X = R and C = (a, b] : a, b ∈ R. Then every set c1, c2 ⊆ R is shattered.

However, c1, c2, c3 (c1 < c2 < c3) cannot be shattered as c1, c3 cannot be picked out.

Thus V (C) = 3.

Example 4.1.4. In general, let X = Rd and

C1 = (−∞, c] : c ∈ Rd, C2 = (a, b] : a, b ∈ Rd.

Then V (C1) = d+ 1 and V (C2) = 2d+ 1. Sketch of the proof is following: (To be added)

Following lemma is a result from combinatorics; the proof eventually gives that the number of

subsets picked out by C (∆n(C, x1, · · · , xn)) is bounded above by the number of subsets of x1, · · · , xn

shattered by C.

Lemma 4.1.5 (Sauer). Let C be a VC class. Then for any n points x1, · · · , xn with n ≥ V (C)− 1,

we get

∆n(C, x1, · · · , xn) ≤V (C)−1∑j=0

(n

j

)≤(

ne

V (C)− 1

)V (C)−1

.

67

Draft


Proof. WLOG we may assume that C ⊆ 2x1,··· ,xn, i.e., C consists of subsets of x1, · · · , xn, so that

∆n(C, x1, · · · , xn) = |C ∩ x1, · · · , xn : C ∈ C| = |C|.

For C ∈ C, let

Ti(C) =

C − xi C − xi /∈ C

C o.w.

Then C 7→ Ti(C) is one-to-one; assume that Ti(C1) = Ti(C2), then there are 4 possible cases;

i) if C1 − xi /∈ C, C2 − xi /∈ C, then C1 − xi = C2 − xi, and since C1, C2 ∈ C, we get

xi ∈ C1, C2 (so that C1 − xi /∈ C, C! ∈ C, and so on). Thus C1 = C2;

ii) if C1 − xi ∈ C, C2 − xi ∈ C, then clearly C1 = C2 by def. of Ti;

iii) if C1−xi /∈ C, C2−xi ∈ C, or vice versa, then C1−xi = C2 ∈ C, which yields contradiction;

and hence we get the result. It gives that |C| = |Ti(C)|.

Claim.) A ⊆ x1, · · · , xn is shattered by Ti(C) =⇒ A is shattered by C.

If xi /∈ A then it’s clear that A ∩ C = A ∩ Ti(C) for any C ∈ C. Assume that xi ∈ A and Ti(C)

shatters A. Then for any B ⊆ A, B ∪ xi ⊆ A and hence ∃C ∈ C s.t.

B ∪ xi = A ∩ Ti(C).

Then we get xi ∈ Ti(C) and hence Ti(C) = C, i.e, C − xi ∈ C. Thus both

B ∪ xi = A ∩ Ti(C) = A ∩ C

and

B − xi = A ∩ C − xi = A ∩ (C − xi)

are picked out by C, one of which equals B. (Claim)

Now apply the operator T1 T2 · · · Tn repeatedly, until the collection of sets does not change

anymore. Let such collection D. Since C 7→ Ti(C) is one-to-one, |D| = |C|. Now by the claim,

(# of sets shattered by C) ≥ (# of sets shattered by Ti(C))

≥ · · ·

≥ (# of sets shattered by D)

(4.1)

and by the construction, D − xi ∈ D for any D ∈ D and xi, which implies that D is a power set

of some set (∵ for any D ∈ D, all subsets of D should belong to D, i.e., 2D ⊆ D), say D. Then any

subset of D can be shattered by D, and any set that can be shattered by D should be a subset of D.

68

Draft


Hence we get

(# of sets shattered by D) = |D| = |C|. (4.2)

Combining both (4.1) and (4.2), we get

∆n(C, x1, · · · , xn) = |C| ≤ (# of sets shattered by C).

Note that C can only shatter the set with no more elements than V (C);

(# of sets shattered by C) ≤V (C)−1∑j=1

(n

j

).

Therefore we get

|C| ≤V (C)−1∑j=1

(n

j

).

Now the remain part is to show

k∑j=1

(n

j

)≤(nek

)kfor any k ≤ n,

which can be obtained from following simple caculation.

k∑j=1

(n

j

)=

k∑j=1

n(n− 1) · · · (n− j + 1)

j!

≤k∑j=1

nj

j!

≤k∑j=1

nj

j!

(nk

)k−j=

k∑j=1

(nk

)k kjj!

≤(nk

)kek

For two measurable sets, note that Lr(Q) norm of indicators is obtained as

‖1C − 1D‖Q,r = Q1/r(C4D). (4.3)

Now we consider an Lr(Q)-distance between sets C and D in the sense of (4.3).

69

Draft


Theorem 4.1.6 (Uniform Covering Numbers). Let C be a VC class of sets. Then for any r ≥ 1,

0 < ε < 1 and probability measure Q,

N(ε, C,Lr(Q)) ≤ K(

1

ε

)r(V (C)−1)

holds, where K is a constant depending only on V (C).

Proof. We will only prove the mild version of the statement,

N(ε, C,Lr(Q)) ≤ K(

1

ε

)r(V (C)−1+δ)

∀δ > 0.

Take C1, C2, · · · , Cm from C s.t. Q(Ci4Cj) > ε for i 6= j (i.e., Ci’s are ε1/r-separated). If one shows

that that

∀δ > 0 m .

(1

ε

)V (C)−1+δ

,

then as packing number is larger than covering number, we can obtain the conclusion,

N(ε, C,Lr(Q)) ≤ D(ε) .

(1

εr

)V (C)−1+δ

.

Let X1, X2, · · · , Xni.i.d∼ Q (Remark. The fact that “Q is probability measure” is used in here!). Note

that

Ci and Cj pick out the same subset from X1, · · · , Xn

⇐⇒ Ci ∩ X1, · · · , Xn = Cj ∩ X1, · · · , Xn

⇐⇒ Xk /∈ Ci4Cj ∀k.

Thus

∀(i, j) ∃k s.t. Xk ∈ Ci4Cj

=⇒ Each of Ci (i = 1, 2, · · · ,m) picks out different subdet from X1, · · · , Xn

=⇒ C picks out at least m subsets from X1, · · · , Xn.

Let

E = (∀(i, j) ∃k s.t. Xk ∈ Ci4Cj) =⋂(i,j)

n⋃k=1

(Xk ∈ Ci4Cj) .

Then

P(Ec) ≤∑i<j

P(Xk /∈ Ci4Cj ∀k)

70

Draft


=∑i<j

(1− P(Xk ∈ Ci4Cj))n

=∑i<j

(1−Q(Ci4Cj))n

≤∑i<j

(1− ε)n =

(m

2

)(1− ε)n.

It gives that P(Ec) < 1 for sufficiently large n, i.e., P(E) > 0. It implies that there should exist

x1, x2, · · · , xn(∈ supp(Q)) satisfying E, i.e., C picks out at least m subsets from x1, · · · , xn. It

implies that

m ≤ maxx1,··· ,xn

∆n(C, x1, · · · , xn).

By previous lemma, we get ∆n(C, x1, · · · , xn) . nV (C)−1, i.e.,

m . nV (C)−1

provided that(m2

)(1− ε)n < 1. Put n =

3 logm

e(sufficiently large), then we get

m .

(logm

ε

)V (C)−1

,

i.e., m has smaller order than that of logm, which implies

m .

(1

ε

)V (C)−1+δ

∀δ > 0,

since (logm)V (C)−1 is bounded by a constant times mδ for any δ > 0.

4.1.2 VC class of functions

We can extend the definition of VC class to the function class as considering subgraph of each

function.

Definition 4.1.7.

(i) The subgraph of a function f : X → R is (x, t) : t < f(x).

(ii) A collection F of measurable functions is called a VC-subgraph class (or VC class, in short),

if C = Cf : f ∈ F is a VC class of sets, where Cf denotes the subgraph of f . In this case, VC

index of F is defined as V (F) = V (C).

Theorem 4.1.8. Let F be a VC class with measurable envelope F and Q be a probability measure

71

Draft


with ‖F‖Q,r > 0, r ≥ 1. Then

N(ε‖F‖Q,r,F ,Lr(Q)) .

(1

ε

)r(V (F)−1)

.

Proof. First we see the case that r = 1. Let Cf be the subgraph of f ∈ F and C = Cf : f ∈ F.

Then

Q|f − g| =∫ ∫

1Cf4Cg(x, t)dtdQ(x) = (Q⊗ λ)(Cf4Cg),

where λ denotes a Lebesgue measure on R. Note that Q⊗λ is not a probability measure anymore; we

have to construct a probability measure to apply the result for a VC class of sets. Let

P :=Q⊗ λ2QF

.

Then P is a probability measure on (x, t) : |t| ≤ F (x), and hence we can obtain

N(ε · 2QF,F ,L1(Q)) = N(ε, C,L1(P )) .

(1

ε

)V (C)−1

.

(Editor’s note: P is a probability measure on (x, t) : |t| ≤ F (x), but we should consider a subgraph

class: (x, t) : t ≤ F (x). My solution for this is that: considering C and considering

C∗ := C ∩ (x, t) : |t| ≤ F (x) : C ∈ C

is equivalent in the sense of Cf4Cg = C∗f4C∗g , where

C∗f = Cf ∩ (x, t) : |t| ≤ F (x)(= Cf ∩ (x, t) : t > −F (x)) ∈ C∗.

Then it can be said that P is a probability measure on the space containing C∗.)

For general r > 1, note that

Q|f − g|r ≤ Q|f − g|(2F )r−1 = 2r−1Q(|f − g|F r−1) = 2r−1R|f − g| ·QF r−1

where R is a probability measure defined as

R(A) =

∫A

F r−1

QF r−1dQ.

Also note that

R|f−g| < εrRF =⇒ Q|f−g|r ≤ 2r−1R|f−g|QF r−1 < 2r−1εrRF ·QF r−1 = 2r−1QF rεr < (2ε‖F‖Q,r)r,

72

Draft


which gives

N(2ε‖F‖Q,r,F ,Lr(Q)) ≤ N(εrRF,F ,L1(R)).

Then we get, by the argument in r = 1,

N(εrRF,F ,L1(R)) .

(1

εr

)V (F)−1

,

which gives the conclusion.

4.2 VC-Hull Class and Uniform Entropy

Unfortunately, it is difficult to show that given function class is VC in many cases (and often it

does not hold). However, considering the convex hull of a class can represent a huge class in many

cases. For example, a class of normal density functions can only represent Gaussian distributions,

but its convex hull includes many Gaussian mixture distributions, which can approximate almost all

continuous distributions on R. With this motivation, it’s hard to expect that given function class is

VC, but it may belong to VC-Hull class.

Definition 4.2.1.

(i) (Convex hull)

convF =

m∑i=1

αifi : αi > 0,

m∑i=1

αi = 1, fi ∈ F

(ii) (Symmetric convex hull)

sconvF :=

m∑i=1

αifi :m∑i=1

|αi| ≤ 1, fi ∈ F

(iii) (Pointwise limit) Let convF or sconvF be a closure of convF or sconvF , respectively. It contains

pointwise limits of finite combinations (and eventually infinite combinations).

(iv) F is a VC-Hull class if it is in the pointwise sequential closure of the symmetric convex hull

of a VC class of functions, i.e., F = sconvG for a VC class G of functions.

Following theorems give uniform entropy conditions for VC-Hull class.

Theorem 4.2.2. Let Q be a probability measure and F be a class of measurable functions with mea-

surable square integrable envelope F , i.e., QF 2 <∞. If

N(ε‖F‖Q,2,F ,L2(Q)) ≤ C(

1

ε

)V, 0 < ε < 1

73

Draft


then there exists a constant K depending only on C and V such that

log(N(ε‖F‖Q,2, convF ,L2(Q)) ≤ K(

1

ε

) 2VV+2

.

Remark 4.2.3. It gives that if covering number has a bound with polynomial order, then entropy

of convex hull has a bound with polynomial order smaller than 2. By Donsker’s theorem, we get a

convex hull becomes Donsker.

We can obtain a similar argument for VC-Hull class.

Corollary 4.2.4 (Uniform Entropy). Let G be a VC class and F = sconvG be a VC-Hull class. Then

logN(ε‖F‖Q,2,F ,L2(Q)) .

(1

ε

)2(1−V (G)−1)

for small ε’s, and hence F is Donsker.

It can be proved using that sconvG is contained in the convex hull of G ∪ (−G) ∪ 0.

4.2.1 Examples: VC(-Hull) Classes

Lemma 4.2.5. Let ψ : R→ R be a fixed monotone function. Then

x 7→ ψ(x− h) : h ∈ R (“translations”)

is VC class of index 2.

Figure 4.1: The set with two points cannot be shattered

Lemma 4.2.6. Let F be a finite dimensional vector space of measurable functions. Then V (F) ≤

dim(F) + 2, i.e., finite dimensional vector space of measurable functions is a VC class.

Proof. Let dim(F) = d and n = d+ 2. Take any n points (x1, t1), · · · , (xn, tn) from X × R. Then for

74

Draft


basis b1, · · · , bd of F , ∃a1, a2, · · · , ad s.t.

f =d∑i=1

aibi,

i.e., f(x1)− t1f(x2)− t2

...

f(xn)− tn)

=

t1 b1(x1) · · · bd(x1)

t2 b1(x2) · · · bd(x2)...

.... . .

...

tn b1(xn) · · · bd(xn)

︸︷︷︸

(∗)

−1

a1

...

ad

holds for f ∈ F . Since the matrix (∗) is n × (n − 1), the point (f(x1) − t1, · · · , f(xn) − tn) lies on a

(n − 1)-dimensional subspace W of Rn. Thus there exists a nonzero vector a ∈ Rn\0 s.t. a ⊥ W .

WLOG at least one component ai is strictly positive, and hence we get

∑ai>0

ai(f(xi)− ti) =∑ai≤0

(−ai)(f(xi)− ti). (4.4)

If (xi, ti) : ai > 0 (which is non-empty by the assumption) can be picked out by a subgraph, then

∃f ∈ F s.t.

(xi, ti) : ai > 0 = (x, t) : t < f(x)∩(xi, ti) : i = 1, 2, · · · , n = (xi, ti) : ti < f(xi), i = 1, 2, · · · , n.

It gives

(xi, ti) : ai ≤ 0 = (xi, ti) : ti ≥ f(xi),

but then

∑ai>0

ai(f(xi)− ti) > 0

∑ai≤0

(−ai)(f(xi)− ti) ≤ 0

hold, which yield contradiction to (4.4). Hence the set (xi, ti) : ai > 0 cannot be picked out, i.e.,

(x1, t1), · · · , (xn, tn) cannot be shattered.

Following lemma gives some basic properties of VC class of sets (functions, resp.).

Lemma 4.2.7. Let C, D be a VC class of sets in X , and E be a VC class of sets in Y. Also let

φ : X → Y, ψ : Z → X be fixed functions. Then:

(i) Cc := Cc : C ∈ C is VC of index V (C).

(ii) C u D := C ∩D : C ∈ C, D ∈ D is VC of index ≤ V (C) + V (D)− 1.

75

Draft


(iii) C t D := C ∪D : C ∈ C, D ∈ D is VC of index ≤ V (C) + V (D)− 1.

(iv) D × E is VC of index ≤ V (D) + V (E)− 1.

(v) φ(C) is VC of index V (C) if φ is one-to-one.

(vi) ψ−1(C) (inverse image) is VC of index ≤ V (C).

Proof. (i) A ⊆ x1, · · · , xn is picked by C

⇐⇒ ∃C ∈ C s.t. A = C ∩ x1, · · · , xn

⇐⇒ ∃Cc ∈ Cc s.t. Ac = Cc ∩ x1, · · · , xn

⇐⇒ Ac ⊆ x1, · · · , xn is picked by Cc

and hence x1, · · · , xn is shattered by C ⇐⇒ x1, · · · , xn is shattered by Cc.

(ii) C u D becomes VC (see the textbook), but I’m not sure that it’s index is bounded above by

V (C) + V (D)− 1. Following (iii) and (iv) use (ii) in the proof, and they also become uncertain.

(iii) Note that C t D = (Cc u Dc)c. Then (i) and (ii) end the proof.

(iv) (Regard D × E as D × E : D ∈ D, E ∈ E) Note that

D × E = (D × Y) ∩ (X × E),

and hence we just have to show that the class of D×Y is VC with index V (D), which seems obvious.

Consider (x1, y1), · · · , (xn, yn) ∈ X × Y. Then

A ⊆ (x1, y1), · · · , (xn, yn) is picked out by D × Y = D × Y : D ∈ D

=⇒ ∃D ∈ D s.t. A = (D × Y) ∩ (x1, y1), · · · , (xn, yn) = (xi, yi) : xi ∈ D

=⇒ ∃D ∈ D s.t. xi : (xi, yi) ∈ A = D ∩ x1, · · · , xn

=⇒ xi : (xi, yi) ∈ A is picked out by D

and vice versa. Thus x1, x2, · · · , xn is shattered ⇐⇒ (x1, y1), · · · , (xn, yn) is shattered.

(v) A ⊆ x1, · · · , xn is picked out by C

⇐⇒ ∃C ∈ C s.t. A = C ∩ x1, · · · , xn

⇐⇒ ∃φ(C) ∈ φ(C) s.t. φ(A) = φ(C) ∩ φ(x1, · · · , xn)

because φ is one-to-one. Thus x1, · · · , xn is shattered by C ⇐⇒ φ(x1), · · · , φ(xn) is shattered

by φ(C).

(vi) Assume that A ⊆ z1, · · · , zn is picked out by ψ−1(C). Then ∃C ∈ C s.t. A = ψ−1(C) ∩

z1, · · · , zn. It gives that

ψ(A) = ψ(ψ−1(C) ∩ z1, · · · , zn

)= ψ(ψ−1(C)) ∩ ψ(z1, · · · , zn) = C ∩ ψ(z1), · · · , ψ(zn),

i.e., ψ(A) ⊂ ψ(z1), · · · , ψ(zn) is picked by C. Hence

z1, · · · , zn is shattered by ψ−1(C) =⇒ ψ(z1), · · · , ψ(zn) is shattered by C.

76

Draft


Also note that, if z1, · · · , zn is shattered by ψ−1(C), then

∀A ⊆ z1, · · · , zn ∃C ∈ C s.t A = ψ−1(C) ∩ z1, · · · , zn.

If ψ(zi) = ψ(zj) for some i 6= j, then zi cannot be picked out; ψ(zi) 6= ψ(zj) for i 6= j. It yields that

z1, · · · , zn is not shattered by ψ−1(C) for n = V (C); if not, then ψ(z1), · · · , ψ(zV (C)) is shattered

by C, contradictory to the assumption.

Lemma 4.2.8. Let F ,G be VC classes of functions on X and g : X → R, φ : R→ R, ψ : Z → X be

fixed functions. Then:

(i) F ∧ G := f ∧ g : f ∈ F , g ∈ G (pointwise minimum) is VC of index ≤ V (F) + V (G)− 1.

(ii) F ∨ G := f ∨ g : f ∈ F , g ∈ G (pointwise maximum) is VC of index ≤ V (F) + V (G)− 1.

(iii) F > 0 := x : f(x) > 0 : f ∈ F is VC class of sets with index ≤ V (F).

(iv) −F is VC of index V (F).

(v) F + g = f + g : f ∈ F is VC of index V (F).

(vi) g · F = g · f : f ∈ F is VC of index ≤ 2V (F)− 1.

(vii) F ψ = f(ψ) : f ∈ F is VC of index ≤ V (F).

(viii) φ F = φ(f) : f ∈ F is VC of index ≤ V (F) if φ is monotone.

Proof. (i) Subgraph of g ∧ f is

(x, t) : t < (f ∧ g)(x) = (x, t) : t < f(x) ∩ (x, t) : t < g(x),

and hence (i) of previous lemma gives the assertion.

(ii) Subgraph of g ∨ f is

(x, t) : t < (f ∨ g)(x) = (x, t) : t < f(x) ∪ (x, t) : t < g(x),

and hence (ii) of previous lemma gives the assertion.

(iii) For one-to-one function φ : (x, 0) 7→ x, the collection of x : f(x) > 0 = ψ((x, 0) : 0 <

f(x)) = ψ((x, t) : t < f(x) ∩ (X × 0)) is a VC class.

(iv) For this, it suffices to show that “closed subgraph”

(x, t) : t ≤ f(x) : f ∈ F

forms VC class of index V (F) (∵ (x, t) : t < −f(x) = (x,−t) : t > f(x) = ψ ((x, t) : t ≤ f(x)c)

for 1-1 ψ(x, t) = (x,−t)). Suppose that the closed subgraphs shatter (x1, t1), · · · , (xn, tn). Then

77

Draft


∃f1, f2, · · · , fm, m = 2n s.t. pick out each subset of (x1, t1), · · · , (xn, tn). Set 2ε := infti − fj(xi) :

ti − fj(xi) > 0. Then (x1, t1 − ε), · · · , (xn, tn − ε) is shattered by open subgraph.

Figure 4.2: (xi, ti) : i ∈ I is picked out by closed subgraphs =⇒ (xi, ti − ε) : i ∈ I is picked outby open subgraphs

Conversely, assume that open subgraphs shatter (x1, t1), · · · , (xn, tn). Then again ∃f1, · · · , fmwhich pick out each subset. Set 2ε := inffj(xi)−ti : fj(xi)−ti > 0. Then (x1, t1+ε), · · · , (xn, tn+ε)

is shattered by closed subgraphs.

(v) F + g shatters (x1, t1), · · · , (xn, tn)

⇐⇒ ∀A ⊆ (x1, t1), · · · , (xn, tn) ∃f ∈ F s.t. A = (x, t) : t < f(x)+g(x)∩(x1, t1), · · · , (xn, tn)

⇐⇒ ∀A ⊆ (x1, t1), · · · , (xn, tn) ∃f ∈ F s.t. A = (xi, ti) : t− g(xi) < f(xi)

⇐⇒ ∀A ⊆ (xi, ti − g(xi)) : i = 1, 2, · · · , n ∃f ∈ F s.t. A = (xi, ti − g(xi)) : ti − g(xi) < f(xi)

⇐⇒ F shatters (x1, t1 − g(x1)), · · · , (xn, tn − g(xn)).

(vi) The subgraph of g · f is

(x, t) : t < f(x)g(x) = (x, t) : t < f(x)g(x), g(x) > 0 =: C+f

∪ (x, t) : t < f(x)g(x), g(x) < 0 =: C−f

∪ (x, t) : t < f(x)g(x), g(x) = 0 =: C0f .

Note that

C+f : f ∈ F shatters (x1, t1), · · · , (xn, tn) ⊆ (X ∩ g > 0)× R

⇐⇒ ∀A ⊆ (x1, t1), · · · , (xn, tn) ∃f ∈ F s.t.

A = (xi, ti) : ti < f(xi)g(xi), g(xi) > 0 =

(xi, ti) :

tig(xi)

< f(xi), g(xi) > 0

⇐⇒

(x,

t

g(x)

):

t

g(x)< f(x), g(x) > 0

: f ∈ F

shatters

(x1,

t1g(x1)

), · · · ,

(xn,

tng(xn)

)

and hence C+f : f ∈ F is VC in (X∩g > 0)×R. Similarly, C−f : f ∈ F is VC in (X∩g < 0)×R.

78

Draft


Finally,

C0f = (x, t) : t < 0 = (X ∩ g = 0)× (−∞, 0)

is VC of index ≤ 2.

(vii) Subgraph of f(ψ) is (x, t) : t < f(ψ(x)), which is an inverse image of (x, t) : t < f(x) with

function (x, t) 7→ (ψ(x), t).

(viii) Let φF shatter (x1, t1), · · · , (xn, tn). Then ∃f1, · · · , fm ∈ F which pick out each m subsets.

If we define si = maxfj(xi) : ti ≥ φ(fj(xi)), then

si ≥ fj(xi) ⇐⇒ ti ≥ φ(fj(xi)),

i.e.,

si < fj(xi) ⇐⇒ ti < φ(fj(xi)).

Thus for Aj = (xi, ti) : ti < φ(fj(xi)) for each j = 1, 2, · · · ,m, fj picks out (xi, si) : (xi, ti) ∈ Aj.

It implies that F shatters (x1, s1), · · · , (xn, sn).

Lemma 4.2.9. Let C be a class of sets in X and F = 1C : C ∈ C. Then

C is VC ⇐⇒ F is VC and V (C) = V (F).

Proof. Note that subgraph of 1C is

(x, t) : t < 1C(x) = (C × (−∞, 1)) ∪ (Cc × (−∞, 0)) .

Assume that C shatters x1, · · · , xn. Then for each subset Aj of x1, · · · , xn, ∃Cj ∈ C s.t.

Aj = Cj ∩ x1, · · · , xn

by definition. Then(xi,

1

2

): xi ∈ Aj

=[(Cj × (−∞, 1)) ∪ (Ccj × (−∞, 0))

]∩(

x1,1

2

), · · · ,

(xn,

1

2

)

holds, which gives that F shatters(x1,

12

), · · · ,

(xn,

12

). ∴ V (F) ≥ V (C).

Conversely, let F shatter (x1, t1), · · · , (xn, tn). If ∃ti s.t. ti ≥ 1, then (xi, ti) cannot be picked

out; if ∃ti s.t. ti < 0, then (xj , tj) cannot be picked out for j 6= i. Hence 0 ≤ ti < 1 (which also

implies xi 6= xj for i 6= j), and ∀Aj ⊆ (x1, t1), · · · , (xn, tn) ∃Cj ∈ C s.t.

Aj = (xi, ti) : ti < 1Cj (xi) =[(Cj × (−∞, 1)) ∪ (Ccj × (−∞, 0))

]∩ (x1, t1), · · · , (xn, tn).

79

Draft


It gives that

xi : (xi, ti) ∈ Aj = Cj ∩ x1, · · · , xn.

Thus C shatters x1, · · · , xn, which gives V (F) ≤ V (C). Therefore, we get V (C) = V (F).

(a) F shatters (xi, 1/2)’s. (b) If F shatters (xi, ti)ni=1, then ∀ti ∈[0, 1).

Figure 4.3: (a) Proof of V (C) ≤ V (F). (b) Proof of V (F) ≤ V (C).

Example 4.2.10. Let F be the set of all monotone functions, f : R → [0, 1]. Then F is not a VC,

but contained in a VC-Hull class.

Proof. F is not a VC class if V (C) = ∞, i.e., for any n there exists a set (xi, yi)ni=1 that can be

shattered by F .

Figure 4.4: F is not a VC class.

Putting x1 < x2 < · · · < xn and 0 < y1 < y2 < · · · <

yn < 1, we can find monotone function which picks out each

subset of (x1, y1), · · · , (xn, yn). Thus F is not a VC.

However, F is contained in a VC-Hull class. To show this, we have to show that F ⊆ sconvG for

some VC class G.

Figure 4.5: F is contained in a VC-Hullclass.

Let G = 1[a,∞)(t) : 0 ≤ a ≤ 1 be a VC class. Define

xi = inf

x : f(x) ≥ i

n

.

Then f(xi) ≥ in , and on the interval (xi, xi+1),

i

n≤ f(x) <

i+ 1

n, i.e., 0 ≤ f(x)− i

n<

1

n.

Thus for fn(x) :=

n−1∑i=1

i

n1[xi,xi+1)(x) + 1[xn,∞)(x), we get

|f(x)− fn(x)| ≤ 1

n, and therefore fn(x) −−−→

n→∞f(x).

80

Draft


Now note that, for xn+1 =∞,

fn(x) =

n∑i=1

i

n1[xi,xi+1)(x) =

n∑i=1

i∑j=1

1

n1[xi,xi+1)(x) =

n∑j=1

n∑i=j

1

n1[xi,xi+1)(x) =

n∑j=1

1

n1[xi,∞)(x),

and therefore fn ∈ sconvG (for convenience, [∞,∞) is regarded as φ), i.e., f ∈ sconvG.

Remark 4.2.11. Later, we will show that for any r ≥ 1 and probability measure Q,

logN[ ](ε,F ,Lr(Q)) ≤ K(

1

ε

)holds, where K depends only on r. It gives more tight bound than theorem 4.2.2.

Example 4.2.12 (Half space). Let C =x ∈ Rd : 〈x, u〉 ≤ c : u ∈ Rd, c ∈ R

be a space of half-

spaces. Then V (C) = d+ 2.

Proof. It is easy to show that C shatters 0, e1, · · · , ed for ith standard normal vector ei. It gives

V (C) ≥ d+ 2.

Figure 4.6: For any subset A ⊆ 0, e1, · · · , ed we can find a hyperplane separating A and 0, e1, · · · , ed\A.

On the other hand, by Radon’s theorem, any d + 2 points in Rd can be partitioned into two sets,

whose convex hulls have non-empty intersection.

Figure 4.7: Radon’s theorem.

These two subsets cannot be separated by hyperplanes. If they can, then since each half-sapce is

convex, their convex hulls are disjoint, which contradicts to Radon’s theorem. Thus any d+ 2 points

cannot be shattered, which gives V (C) ≤ d+ 2.

81

Draft


4.3 Bracketing Numbers

In this section, we see several examples of function classes which give uniform bracketing entropy

condition.

4.3.1 Monotone Functions

Theorem 4.3.1 (Monotone Functions). Let F be the set of all monotone functions from R to [0, 1].

Then (for small ε)


1

ε

)∀r ≥ 1, Q : probability measure,

where K is a constant depending only on r.

Sketch of proof. Because it is too difficult to show, we will show mild version:


1

εlog

1

ε

).

Let ε, r, and Q be given, and just for convenience, assume that Q is continuous. Partition R into

−∞ = t0 < t1 < · · · < tNr+1 =∞ s.t. Q(ti, ti+1] = εr+1,

where N =

⌊1

ε

⌋+ 1. Also let

S = all monotone step functions jumping only at tj ’s with jump size k/N︸︷︷︸≈kε

, k = 1, 2, · · · , N.

Figure 4.8: One element of S.

Note that

|S| = (# of cases choosing “jump-location” among N r+1 + 1 tj ’s) =

(N r+1 +N

N

),

82

Draft


and so cardinality of the collection [l, u] : l, u ∈ S of brackets is smaller than |S × S| = |S|2 =(Nr+1+N

N

)2, and

log |S × S| . log

(N r+1 +N

N

)≤ N log(N r+1 +N) .

r + 1

εlog

1

ε,

because N ≥ 1/ε and N r ≥ N holds. For given f ∈ F , let l, u ∈ S be s.t.

l ≤ f ≤ u and ‖u− l‖Q,r becomes minimized.

Figure 4.9: f ∈ F and bracket [l, u] on specific interval (ti, ti+1).

Let

N1 = (# of j’s s.t. f(tj+1)− f(tj) ≤1

N) (“# less than ε jump”)

N2 = (# of j’s s.t.1

N< f(tj+1)− f(tj) ≤

2

N) (“# less than 2ε jump”)

...

and so on. Then clearly we get

∑j

Nj = N r+1 (“total partition”) and∑j

Njj

N≤ 1 (“total jump size”).

Then we have

Q(u− l)r =

∫(u− l)rdQ

=∑j

∫ tj+1

tj

(u− l)rdQ (*)

≤∑j

(j + 2

N

)rNjε

r+1 (∵ Q(tj , tj+1] = εr+1) (**)

83

Draft


≤3rεr+1∑j

(j

N

)r︸︷︷︸

≤j/N from j/N≤1

Nj

≤ 3rεr+1∑j

j

NNj

≤ 3rεr+1 ≤ εr

for small ε > 0 (Note that index j in (∗) and (∗∗) have different meanings; one is index of tj , while

denotes jump size). For ≤, see figure 4.9. In ≤, simply j + 2 ≤ 3j is used. Thus we get

‖u− l‖Q,r ≤ ε

for small ε (precisely, ε < 1/3r). It gives

logN[ ](ε,F ,Lr(Q)) ≤ log |S × S|,

which is smaller than ε−1 log ε−1 up to constant multiplication with constant depending only on r.

4.3.2 Smooth functions and sets

Let X be a bounded convex subset of Rd with nonempty interior.

Definition 4.3.2 (α-Holder continuity). For 0 < α ≤ 1, a function f : X → R is called α-Holder

continuous if

supx 6=y

|f(x)− f(y)|‖x− y‖α

<∞.

Especially, if α = 1, then f is Lipschitz continuous. Also note that as α increases it represents

further smoothness. Then in natural, our interest is to extend the definition 4.3.2 for α > 1.

Remark 4.3.3. However, such extension is not straightforward. Suppose that for some α > 1

supx 6=y

|f(x)− f(y)|‖x− y‖α

<∞

holds. Then ∣∣∣∣f(x+ th)− f(x)

t

∣∣∣∣ . tα

t→ 0 as t→ 0

for any unit vector h ∈ Rd. Thus every directional derivative of f is 0, i.e., f is a constant function.

In summary, α-Holder continuity for α > 1 is satisfied only for trivial cases.

Hence we extend the concept in other way; we consider α-Holder continuity of derivatives.

Definition 4.3.4. From now on, let α be the largest integer “smaller” than α. For example, α = 0

84

Draft


for α = 1. Also, for a vector k = (k1, · · · , kd) of d integers, define

Dk =∂k·

∂xk11 · · · ∂xkdd

, where k· =d∑j=1

kj .

Definition 4.3.5 (α-Holder norm). Let α > 0 and f be a function with uniformly bounded partial

derivatives up to order α and (α − α)-Holder continuous kth derivatives Dkf for k· = α. Then

α-Holder norm of f is defined as

‖f‖α := maxk:k·≤α

‖Dkf‖∞ + maxk:k·=α

supx 6=y

|Dkf(x)−Dkf(y)|‖x− y‖α−α

.

Also, let CαM (X ) be the set of all continuous functions f : X → R with ‖f‖α ≤M .

Following two theorems (theorem 4.3.6 and corollary 4.3.7) give bound for covering and bracketing

entropy of Cα1 (X ), which is the case M = 1. Note that for the bound be finite, domain X should be

bounded subset (of Rd).

Theorem 4.3.6.

logN(ε, Cα1 (X ), ‖ · ‖∞) ≤ Kλ(X 1)

(1

ε

)d/α,

where K is a constant depending only on α and d, λ is a Lebesgue measure, and X 1 = x ∈ Rd :

‖x−X‖ < 1.

Proof. We will prove the assertion for d = 1 (proof for general d is similar). Since the function in

Cα1 (X ) are continuous on X , WLOG we may assume that X is open so that Taylor theorem can be

applied everywhere on X . Let ε ∈ (0, 1] be given and δ = ε1/α. Let x1 < x2 < · · · < xm be δ-net for

X s.t. m .λ(X 1)

δ. For nonnegative integer k ≤ α and f ∈ Cα1 (X ), define

Akf :=

(⌊f (k)(x1)

δα−k

⌋, · · · ,

⌊f (k)(xn)

δα−k

⌋).

Then the vector δα−kAkf consists of the values f (k)(xi) discretized on a grid of mesh-width δα−k.

Claim.) For f, g ∈ Cα1 (X ), if Akf = Akg ∀k ≤ α, ‖f − g‖∞ . ε.

Proof of Claim. By Taylor’s theorem,

(f − g)(x) =

α−1∑k=0

(f (k) − g(k))(xi)

k!(x− xi)k +

1

α!(f (α) − g(α))(x)(x− xi)α

for some x lying between x and xi. Then

(f − g)(x) =

α∑k=0

(f (k) − g(k))(xi)

k!(x− xi)k +

1

α!

((f (α) − g(α))(x)− (f (α) − g(α))(xi)

)(x− xi)α.

85

Draft


Note that for f ∈ Cα1 (X ), ‖f‖α ≤ 1, which gives

supx 6=y

|D(α)f(x)−D(α)f(y)||x− y|α−α

≤ 1,

and hence we get

|f (α)(x)− f (α)(xi)| ≤ |x− xi|α−α ≤ δα−α.

Similarly

|g(α)(x)− g(α)(xi)| ≤ δα−α

holds for g ∈ Cα1 (X ). Since Akf = Akg, we get⌊f (k)(xi)

δα−k

⌋=

⌊g(k)(xi)

δα−k

⌋∀xi,

which gives |(f (k) − g(k))(xi)| ≤ δα−k for any k ≤ α and i. Thus we get

|(f − g)(x)| ≤α∑k=0

δα−k

k!|x− xi|k︸︷︷︸≤δk

+1

α!2δα−α|x− xi|α

≤α∑k=0

δα

k!+

1

α!2δα

. δα (or ≤ (2 + e)δα)

= ε,

(Claim)

By claim, there exists a constant C depending only on α s.t.

N(Cε, Cα1 (X ), ‖ · ‖∞) ≤ |Af : f ∈ Cα1 (X )| ,

where

Af =

A0f

A1f...

Aαf

.

Note that # of possible values of

⌊f (k)(xi)

δα−k

⌋is smaller than

2

δα−k+1 (∵ ‖f‖α ≤ 1 yields |f (k)(xi)| ≤ 1

by def.), which does not exceed 2δ−α + 1. Thus each column of Af can have at most (2δ−α + 1)α+1

different values. It gives

|Af : f ∈ Cα1 (X )| ≤ (2δ−α + 1)(α+1)m,

86

Draft


i.e.,

log |Af : f ∈ Cα1 (X )| ≤ m(α+ 1) log(2δ−α + 1)

.λ(X 1)

δ(α+ 1) log δ−α

. λ(X 1)

(1

ε

)1/α

log1

ε.

Now our goal is to “remove” log(1/ε) term. It uses smoothness of function; f cannot blow-up or

blow-down in short interval.

Note that

f (k)(xi+1) =

α−k∑k=0

f (k+l)(xi)(xi+1 − xi)l

l!+R

with

|R| . (xi+1 − xi)α−k ≤ δα−k.

Then we get (Motivation: f (k+l)(xi) ≈

⌊f (k+l)(xi)

δα−(k+l)

⌋δα−(k+l))

∣∣∣∣∣∣f (k)(xi+1)−α−k∑k=0

⌊f (k+l)(xi)

δα−(k+l)

⌋δα−(k+l) (xi+1 − xi)l

l!

∣∣∣∣∣∣≤

∣∣∣∣∣∣f (k)(xi+1)−α−k∑l=0

f (k+l)(xi)(xi+1 − xi)l

l!

∣∣∣∣∣∣︸︷︷︸=|R|

+

∣∣∣∣∣∣α−k∑l=0

(f (k+l)(xi)−

⌊f (k+l)(xi)

δα−(k+l)

⌋δα−(k+l)

)(xi+1 − xi)l

l!

∣∣∣∣∣∣. δα−k +

α−k∑l=0

δα−(k+l)

∣∣∣∣∣f (k+l)(xi)

δα−(k+l)−

⌊f (k+l)(xi)

δα−(k+l)

⌋∣∣∣∣∣︸︷︷︸≤1

(xi+1 − xi)l

l!

≤ δα−k +

α−k∑l=0

δα−(k+l) δl

l!︸︷︷︸=δα−k/l!

. δα−k.

It implies that, for given ith column of Af , # of possible values for the (i + 1)th column is bounded

by a constant K depending only on α. Thus we get

N(Cε, Cα1 (X ), ‖ · ‖∞) ≤ |Af : f ∈ Cα1 (X )| . (2δ−α + 1)α+1︸︷︷︸first column

Km−1︸︷︷︸rest ones

,

87

Draft


i.e.,

logN(Cε, Cα1 (X ), ‖ · ‖∞) . log1

ε+m . log

1

ε+

(1

ε

)1/α

λ(X 1) .

(1

ε

)1/α

λ(X 1).

Corollary 4.3.7.

logN[ ](ε, Cα1 (X ),Lr(Q)) . λ(X 1)

(1

ε

)d/α.

Proof. Basic idea of the proof is that with respect to ‖ · ‖∞ norm bracketing entropy is equivalent to

(covering) entropy. Let f1, · · · , fp be the centers of ‖ · ‖∞-balls of radius ε that cover Cα1 (X ). Then

the brackets [fi − ε, fi + ε] cover Cα1 (X ). Each bracket has Lr(Q)-size at most 2ε. Thus we have

logN[ ](ε, Cα1 (X ),Lr(Q)) ≤ logN( ε

2, Cα1 (X ),Lr(Q)

). λ(X 1)

(2

ε

)d/α. λ(X 1)

(1

ε

)d/α.

The corollary implies that Cα1 [0, 1]d is universally Donsker for α > d/2 by bracketing Donsker’s

theorem. For instance, on the unit interval in the line (d = 1), uniformly bounded and Holder-

continuous of order > 1/2 suffices and in the unit square it suffices that the partial derivatives exist

and satisfy a Lipschitz condition.

The previous results are restricted to bounded subsets of Euclidean space (bound has a term λ(X 1).

Under appropriate conditions on the tails of the underlying distribution, they can be extended to

classes of functions on the whole of Euclidean space.

Corollary 4.3.8. Let Rd =∞⋃j=1

Ij be a partition of Rd into bounded convex sets Ij’s with nonempty

interior, and let F be a class of functions f : Rd → R s.t. the restrictions f |Ij belong to CαMj(Ij) for

every j. Then ∃K depending only on α, V , r and d such that


1

ε

)V ∞∑j=1

λ(I1j )

rV+rM

V rV+r

j Q(Ij)r

V+r

V+rr

for every ε > 0, V ≥ d

αand probability measure Q.

For the collection of subgraphs to be Donsker, a smoothness condition on the underlying measure is

needed, in addition to sufficient smoothness of the graphs. The following result implies that the sub-

graphs (contained in Rd+1) of the functions Cα1 [0, 1]d are P-Donsker for Lebesgue-dominated measure

88

Draft


P with bounded density, provided α > d. For instance, for the sets cut out in the plane by functions

f : [0, 1]→ [0, 1], a uniformly Lipschitz condition of any order on the derivatives suffices.

Corollary 4.3.9. Let Cα,d be the collection of subgraphs of Cα1 [0, 1]d. Then there exists a constant K

depending only on α and d s.t.

logN[ ](ε, Cα,d,Lr(Q)) ≤ K‖q‖d/α∞(

1

ε

)dr/αfor any r ≥ 1, ε > 0, and probability measure Q with bounded Lebesgue density q on Rd+1. In here,

“covering” Cα,d is under the sense that [Ci, Di]’s cover Cα,d (for subgraphs Ci, Di) means their indicator

functions bracket the set of indicator functions of the sets (=subgraphs) in Cα,d.

4.3.3 Closed convex sets and Convex functions

Theorem 4.3.10. Let A be a bounded subset of Rd , d ≥ 2. Define

C = all compact convex subsets of A

and Q be a Lebesgue absolutely continuous probability measure. Then ∃ a constant K depending only

on A,Q, and d s.t.

logN[ ](ε, C,Lr(Q)) ≤ K(

1

ε

)(d−1)r/2

.

Remark 4.3.11. Note that C is not a VC class. For any n, there exist n points which can be shattered.

Figure 4.10: For such n points, any subset can be picked out by convex compact set.

Following theorem is a version of function class of theorem 4.3.10.

Theorem 4.3.12. Let A be a compact convex subset of Rd (d ≥ 2), and

F = all convex functions f : A→ [0, 1] s.t. |f(x)− f(y)| ≤ L‖x− y‖ ∀x, y ∈ A.

89

Draft


Then ∃K depending only on d and A s.t.

logN(ε,F , ‖ · ‖∞) ≤ K(1 + L)d/2(

1

ε

)d/2.

Theorem 4.3.13. Let (T, d) be a semimetric space and F = ft : X → R | t ∈ T s.t.

|fs(x)− ft(x)| ≤ F (x)d(s, t) ∀s, t,∈ T, x ∈ X

for some fixed function F (i.e., F is a class of functions x 7→ ft(x) that are Lipschitz in the index

parameter t ∈ T ). Then

N[ ](2ε‖F‖,F , ‖ · ‖) ≤ N(ε, T, d)

for any norm ‖ · ‖.

Proof. Let t1, · · · , tp be an ε-net for d for T . Then the brackets [fti − εF, fti + εF ] cover F (∵ ∀s ∈ T

∃ti s.t. |fti(x)− fs(x)| ≤ F (x)d(ti, s) < εF (x), which gives fti − εF ≤ fs ≤ fti + εF ), which are of size

2ε‖F‖.

4.4 Further topic: Tail bounds

In this section, we derive moments and tail bounds for the supremum ‖Gn‖F of the empirical process.

Thus we will consider a non-asymptotic, finite sample behavior, just as concentration inequalities. Note

that if F is P-Donsker, then ‖Gn‖F = OP(1). Also note that

Gnfd−−−→

n→∞N(0,P(f − Pf)2).

Let F be an envelope of F , and

J(δ,F) = supQ

∫ δ

0

√1 + logN(ε‖F‖Q,2,F ,L2(Q))dε

J[ ](δ,F) =

∫ δ

0

√1 + logN[ ](ε‖F‖P,2,F ,L2(P)dε,

where the supremum is taken over all discrete probability measures Q with ‖F‖Q,2 > 0.

Theorem 4.4.1. Let F be a P-measurable class of measurable functions with measurable envelope F .

Then

‖‖Gn‖∗F‖P,p ≤ KJ(1,F)‖F‖P,2∨p ∀p ≥ 1,

where K depends only on p.

90

Draft


Theorem 4.4.2. Let F be a class of measurable functions with measurable envelope function F . Then

‖‖Gn‖∗F‖P,1 . J[ ](1,F)‖F‖P,2.

Theorem 4.4.3. Let F be a class of measurable functions with measurable envelope function F . Then

‖‖Gn‖∗F‖P,p . ‖‖Gn‖∗F‖P,1 + n−1/2+1/p‖F‖P,p (p ≥ 2)

‖‖Gn‖∗F‖P,ψp . ‖‖Gn‖∗F‖P,1 + n−1/2(1 + log n)1/p‖F‖P,ψp (0 < p ≤ 1)

‖‖Gn‖∗F‖P,ψp . ‖‖Gn‖∗F‖P,1 + n−1/2+1/q‖F‖P,ψp (1 < p ≤ 2),

where q is a Holder conjugate of p and the constants in the inequalities . depend only on the type of

norm involved in the statement.

91

Documents

Advanced Probability Theory (Fall 2017) · Weak Convergence and Empirical Processes with Applications to Statistics, Van der Vaart & Wellner, Springer, 1996. Asymptotic Statistics,