Unrealizable (Agnostic) Learningcs.brown.edu/courses/cs155/slides/2019/Rademacher.pdf · property if for every ; >0 there is a sample size m = m( ; ) such that for every distribution

Unrealizable (Agnostic) Learning

• Given training sample (x1, c(x1)), . . . , (xm, c(xm)) and concept class C.

• No hypothesis in the concept class C is consistent with all the training set (c 6∈ C)

• Relaxed goal: Let c be the correct concept. Find c ′ ∈ C such that

PrD

(c ′(x) 6= c(x)) ≤ infh∈C

PrD

(h(x) 6= c(x)) + ε .

• If we estimate each Pr(h(x) 6= c(x)) =|{i : h(xi ) 6= c(xi )|

msuch that∣∣∣∣Pr(h(x) 6= c(x))− Pr

x∼D(h(x) 6= c(x))

∣∣∣∣ ≤ ε

2,

then h = arg minh∈C

Pr(h(x) 6= c(x)) is ε-optimal

• Correct concept c may not even exist, then c(x) is a random function

Illustrating Agnostic Learning

We want a classifier to distinguish between cats and dogs

Image 1 Image 2 Image 3 Image 4

x

c(x) Cat Dog Ostrich? Sensor Error?

Uniform Convergence for Binary Classification

Definition

A range space (X ,R) (domain X , set-of-subsets R) has the uniform convergenceproperty if for every ε, δ > 0 there is a sample size m = m(ε, δ) such that for everydistribution D over X , if S ∼ Dm, then, with probability at least 1− δ, S is anε-sample for R with respect to D.

• Holds uniformly over:• All distributions D• All range space elements r ∈ R

• Convergence of confidence intervals to points (w.h.p. as m→∞)

Theorem

The following three conditions are equivalent:

1 A concept class C over a domain X is agnostic PAC learnable.

2 The range space (X , C) has the uniform convergence property.

3 The range space (X , C) has a finite VC dimension.

Uniform Convergence in General

Definition

A set of functions F has the uniform convergence property with respect to a domain Zif there is a function mF (ε, δ) such that for any ε, δ > 0, m(ε, δ) <∞, and for anydistribution D on Z, a sample z1, . . . , zm of size m = mF (ε, δ) satisfies

Pr

(supf ∈F

∣∣∣∣∣ 1

m

m∑i=1

f (zi )− ED

[f ]

∣∣∣∣∣ ≤ ε)≥ 1− δ .

The general supervised learning scheme: Z = X × Y (labeled pairs):• Have a loss function ` : Y × Y → R that measures prediction error

• Classification: 0-1 error: `(y , y) =

{y = y : 0y 6= y : 1

• Regression: square error: `(y , y) = (y − y)2

• Let (` ◦ h)(zi ) = `(yi , h(xi )) on zi = (xi , yi )• Let F = ` ◦ H = {(` ◦ h | h ∈ H}

• Uniform convergence =⇒ training loss in H approximates true loss• ε-ERM (Empirical Risk Minimization): find h ∈ H such that h is ε-optimal

Is Uniform Convergence Necessary?

Definition

A set of functions F has the uniform convergence property with respect to a domain Zif there is a function mF (ε, δ) such that for any ε, δ > 0, m(ε, δ) <∞, and for anydistribution D on Z, a sample z1, . . . , zm of size m = mF (ε, δ) satisfies

Pr

(supf ∈F

∣∣∣∣∣ 1

m

m∑i=1

f (zi )− ED

[f ]

∣∣∣∣∣ ≤ ε)≥ 1− δ .

• We don’t need uniform convergence for any distribution D, just for the input(training set) distribution – Rademacher average

• We don’t need tight estimate for all functions, only for functions in neighborhoodof the optimal function – local Rademacher average

Jensen’s Inequality

Definition

A function f : Rm → R is said to be convex if, for anyx1, x2 and 0 ≤ λ ≤ 1,

f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2) .

Theorem (Jenssen’s Inequality)

If f is a convex function, then

E[f (X )] ≥ f (E[X ]) .

In particular, supf ∈F

E[f ] ≤ E[supf ∈F

f ]

Rademacher Averages (Complexity)

Definition

The Empirical Rademacher Average of F of a sample S = {z1, . . . , zm} is

Rm(F , S) = Eσ

[supf ∈F

1

m

m∑i=1

σi f (zi )

],

where σ are i.i.d. Rademacher (uniform on ±1).

Definition

The Rademacher Average of F over a distribution D is defined as

Rm(F ,D) = ES∼Dm

[Rm(F , S)

]= E

S∼D,σ

[supf ∈F

1

m

m∑i=1

σi f (zi )

].

A Symmetrization Inequality

Theorem

Suppose z ∼ Dm over domain Z and F ⊆ Z → R. We bound the supremum deviation

Ez

[supf ∈F

1

m

m∑i=1

f (zi )− ED

[f ]

]≤ 2Rm(F ,D) .

We will later convert this expectation bound to a high-probabilty tail bound, showinguniform convergence

Symmetrization Inequality: Proof

Function family: F ⊆ Z → R; samples: z , z ′ ∼ Dm; σ ∼ Rademacherm

Ez

[supf∈F

1

m

m∑i=1

f (zi )− ED

[f ]

]= E

z

[supf∈F

1

m

m∑i=1

f (zi )− Ez′

[1

m

m∑i=1

f (z ′i )

]]Linearity of Expectation

≤ Ez,z′

[supf∈F

1

m

m∑i=1

(f (zi )− f (z ′i )

)]Jensen’s Inequality

= Ez,z′,σ

[supf∈F

1

m

m∑i=1

σi(f (zi )− f (z ′i )

)]Symmetry

≤ Ez,σ

[supf∈F

1

m

m∑i=1

σi f (zi )

]+ E

z′,σ

[supf∈F− 1

m

m∑i=1

σi f (z ′i )

]Subadditivity

= Ez,σ

[supf∈F

1

m

m∑i=1

σi f (zi )

]+ E

z′,σ

[supf∈F

1

m

m∑i=1

σi f (z ′i )

]Pr(σ) = Pr(−σ)

= 2 Ez,σ

[supf∈F

1

m

m∑i=1

σi f (zi )

]= 2Rm(F ,D) z , z ′ ∼ Dm

Interpreting the Rademacher Average

Empirical Average :1

m

m∑i=1

f (zi) mean estimate

Empirical Supremum : supf ∈F

1

m

m∑i=1

f (zi) max estimate

Rm(F , z) : Eσ

[supf ∈F

1

m

m∑i=1

σi f (zi)

]sample-noise correlation

Rm(F ,D) : Ez∼Dm

[Eσ

[supf ∈F

1

m

m∑i=1

σi f (zi)

]]distribution-noise correlation

Exponential Tail Bounds with Rademacher Averages

Theorem

Let F ⊆ X → [0, 1], S = {z1, . . . , zn} be a sample from D and let δ ∈ (0, 1).

1 Bounding the generalization error using the Rademacher complexity:

PrS

(supf ∈F

{ED

[f ]− 1

m

m∑i=1

f (zi )

}≥ 2Rm(F ,D) + ε

)≤ e−2mε2

2 Bounding the generalization error using the empirical Rademacher complexity:

PrS

(supf ∈F

{ED

[f ]− 1

m

m∑i=1

f (zi )

}≥ 2Rm(F , S) + 3ε

)≤ 2e−2mε2

Recap: McDiarmid’s Inequality

Applying Azuma inequality to Doob’s martingale:

Theorem

Let z1, . . . , zm be independent random variables and let h(z1, . . . , zm) be a functionsuch that changing zi can change the value of h(. . .) by no more than ci :

supz1,...,zm,z ′i

|h(z1, . . . , zi , . . . , zm)− h(z1, . . . , z′i , . . . , zm)| ≤ ci .

Then for any ε > 0,

Prz1,...,zm

(h(z1, . . . , zm)− E

z1,...,zm[h(z1, . . . , zm)] ≥ ε

)≤ e−2ε2/

∑mi=1 c

2i .

Exponential Tail Bounds with Rademacher Averages: Proof

• Suppose F ⊆ X → [0, 1]

• The generalization error (empirical supremum deviation)

D(z1, . . . , zm) = supf ∈F

(ED

[f ]− 1

m

m∑i=1

f (zi )

)

measures how badly empirical loss underestimates true loss for any f ∈ F• Changing zi results in a change of no more than 1

m to D(z1, . . . , zm)

• The Empirical Rademacher Average

Rm(F ,S) = Eσ

[supf ∈F

1

m

m∑i=1

σi f (zi )

]

is also a function of z1, . . . , zm, with bounded differences 1m

• McDiarmid’s inequality yields the result

Why Data Dependent Bounds?

• The Vapnik-Chervonenkis Dimension:• Applies only to binary classification

• Complicated generalizations for regression or multi-class classification

• Combinatorial bound• Ignores data distribution (worst-case over all distributions)

• Can be hard to compute

• Rademacher Averages:• Handles general learning problems

• Only need a loss function• Classification, regression, clustering, data mining

• Sensitive to data distribution (distribution-dependent)• Approximated with training sample (data-dependent)

• Always at least as good bound as the VC-dimension• Still hard to compute

Computing Rademacher Averages: Example 1

• Take F = {f } for any f : X → R. Then

Rm(F , x) = Eσ

[supf ∈F

1

m

m∑i=1

σi f (xi )

]= E

σ

[1

m

m∑i=1

σi f (xi )

]=

1

m

m∑i=1

Eσi

[σi f (xi )] = 0

• Is this useful?• No, but it’s educational• A single function can not overfit (no inductive bias)• Recover Hoeffding’s inequality:

Pr

(ED

[f ]− 1

m

m∑i=1

f (zi ) ≥ 2Rm(F ,D) + ε

)= Pr

(ED

[f ]− 1

m

m∑i=1

f (zi ) ≥ ε

)≤ e−2mε2

.

Computing Rademacher Averages: Example 2

• Take F = {f (x) = x , f (x) = −x}, with X = ±1

Rm(F , x) = Eσ

[supf ∈F

1

m

m∑i=1

σi f (xi )

]= E

σ

[∣∣∣∣∣ 1

m

m∑i=1

σixi

∣∣∣∣∣]

= Eσ

[∣∣∣∣∣ 1

m

m∑i=1

σi

∣∣∣∣∣]

because xiσi ∼ Rademacher. By Jensen’s inequality (f (x) = x2 is convex):

=

√√√√(Eσ

[(1

m

m∑i=1

σi

)])2

≤

√√√√√Eσ

( 1

m

m∑i=1

σi

)2 =

1√m

,

Connection: Distance after m-step random walk, binomial stdev• Is this useful?

• Yes, can be used to construct interesting classifiers• Idea: pick labels for independent subspaces independently• Use above analysis to bound Rademacher averages

• Also gives lower bounds to Rademacher averages, using this result:

R(F1, z) ≤ R(F1 ∪ F2, z)

Bounding Rademacher Averages: Massart’s Inequality

Theorem (Massart’s Finite Class Inequality)

Assume that |F| is finite. Let S = {z1, . . . , zm} be a sample. Then

Rm(F , S) ≤

√√√√supf ∈F

m∑i=1

f 2(zi ) ∗√

2 ln |F|m

.

Corollary

Therefore, if F ⊆ X → [−1, 1], then√√√√supf ∈F

m∑i=1

f 2(zi ) ≤√m , thus Rm(F ,D) ≤

√2 ln|F|m

.

Massart’s Inequality: Proof Volume IFor any λ > 0:

exp(λmRm(F ,S)

)= exp

(λEσ

[supf∈F

m∑i=1

σi f (zi )

])Definition

≤ Eσ

[exp

(λ sup

f∈F

m∑i=1

σi f (zi )

)]exp(E[·]) ≤ E[exp(·)]

Jensen’s Inequlity

= Eσ

[supf∈F

exp

(m∑i=1

λσi f (zi )

)]sup(exp(·)) = exp(sup(·))

Monotonicity

≤∑f∈F

Eσ

[exp

(m∑i=1

λσi f (zi )

)]supf∈F

(·) ≤∑f∈F

(·) when positive

=∑f∈F

Eσ

[m∏i=1

exp(λσi f (zi ))

]Properties of the Exponent

=∑f∈F

m∏i=1

Eσ

[exp(λσi f (zi ))] Independence

Massart’s Inequality: Proof Volume II

Take B2 = supf ∈F

m∑i=1

f 2(zi ) . For any λ > 0:

exp(λmRm(F ,S)

)=∑f∈F

m∏i=1

Eσ

[exp(λσi f (zi ))] Previous Slide

=∑f∈F

m∏i=1

exp(λf (zi )) + exp(−λf (zi ))

2Definition of Expectation

≤∑f∈F

m∏i=1

exp

(λ2

2f 2(zi )

)Hyperbolic Cosine Inequality

=∑f∈F

exp

(λ2

2

m∑i=1

f 2(zi )

)Exponent Laws

≤∑f∈F

exp

(λ2B2

2

)∀fi ∈ F :

m∑i=1

f 2i (zi ) ≤ sup

f∈F

m∑i=1

f 2(zi ) = B2

≤ |F| exp

(λ2B2

2

)Summation

Massart’s Inequality: Proof Volume III

Take B2 = supf ∈F

m∑i=1

f 2(zi ) . For any λ > 0:

Previously: exp(λmRm(F ,S)

)≤ |F| exp

(λ2B2

2

)Take logarithms, rearrange, and minimize to obtain λ =

√2 ln |F|B

Rm(F , S) ≤ infλ

1

m

(ln |F|λ

+λB2

2

)=

B√

2 ln |F|m

=

√√√√supf ∈F

m∑i=1

f 2(zi ) ∗√

2 ln|F|m

Takeaways:

• Lots of steps; all very simple

• Proof similar to Azuma-Hoeffding inequality

• Often loose, but lays foundation for better bounds

Application: The Dvoretsky-Kiefer-Wolfowitz Inequality

• Suppose a sample x ∼ Dm over R, take F the CDF of D• Define the empirical CDF F (t) =

|{i ∈ 1, . . . , n : xi ≤ t}|m

• For any t ∈ R: Ex

[F (t)] = F (t)

• DKW inequality: Prx

(supt∈R

∣∣∣F (t)− F (t)∣∣∣ ≥ ε) ≤ 2e−2mε2

• Elegant result, with complicated proof• 36 years to show optimal constants!• Can we get close, in much less time?

• Compare to Hoeffding’s Inequality:

∀t ∈ R : Pr(∣∣∣F (t)− F (t)

∣∣∣ ≥ ε) ≤ 2e−2mε2

• Can we prove DKW with a union bound?• No, needs to hold at ∞ points• Need uniform convergence!


• Want to show Prx

(supt∈R

∣∣∣F (t)− F (t)∣∣∣ ≥ ε) ≤ 2e−2mε2

• With Massart’s inequality:• Function family is indicator functions F = {f (x) = 1≤t(x) : t ∈ R}

• Problem: Rm(F , x) ≤√

supf∈F

f 2(xi ) ∗√

2 ln(F)

m=∞

• Pick F ′ ⊆ F where ∀f ∈ F , ∃f ′ ∈ F ′ s.t. f (xi ) = f ′(xi ) at every xi

Rm(F , x) = Eσ

[supf∈F

1

m

m∑i=1

σi f (xi )

]= E

σ

[supf ′∈F ′

1

m

m∑i=1

σi f′(xi )

]= Rm(F ′,S)

• |F ′| is the empirical shattering coefficient of F• No worst-case over z , unlike in VC dimension!• Minimal size |F ′| = m + 1. Massart’s lemma:

Pr

(supx∈R

∣∣∣F (x)− F (x)∣∣∣ ≥ 2

√2 ln |m + 1|

m+ ε

)≤ 2e−2mε2


• Want to show Prx

(supt∈R

∣∣∣F (t)− F (t)∣∣∣ ≥ ε) ≤ 2e−2mε2

• Improving the result• Lots of f ′1 , f

′2 ∈ F ′ that differ on only a few xi .

• Define F ′k ⊆ F ′ s.t. each f ′ ∈ F ′ has f ′k ∈ Fk that disagrees on ≤ k of the xi• Minimal size |F ′k | ≤ 1 + m+1

2k+1 ≈m2k

Rm(F , x) = Eσ

[supf∈F

1

m

m∑i=1

σi f (xi )

]≤ E

σ

[supfk∈Fk

1

m

m∑i=1

σi f′(xi )

]+

k

m≤

√−2 ln(1 + m+1

2k+1 )

m+

k

m

• Each f ∈ F approximated by some f ′k ∈ F ′k with error ≤ km

• Discretization method: replace (possibly infinite) F with discrete approximation F ′k• Conclusions:

• Massart’s inequality is a powerful tool for proving simple bounds• Refinements can get much tighter bounds

Application: Agnostic Learning a Binary Classifier

Let C be a binary concept class defined on a domain X , and let D be a probabilitydistribution on X . For each x ∈ X let c(x) be the correct classification of x .For each hypothesis h ∈ C we define a function fh(x) by

fh(x) =

{1 if h(x) = c(x)0 otherwise

Let F = {fh | h ∈ C}. Want to find h′ ∈ C such that with probability ≥ 1− δ

E[fh′ ] ≥ supfh∈F

E[fh]− α .

E[fh′ ] measures the accuracy of h′ over the distribution, want α-optimal h′.We give an upper bound on the required training set size with Rademacher averages.

For each hypothesis h ∈ C we define a function fh(x) by

fh(x) =

{1 if h(x) = c(x)0 otherwise

Let S = z1, . . . , zm be a sample of size m. Then

Rm(F ,S) ≤√

2 ln |F|m

,

thus we have

Pr

(supf ∈F

(ED

[f ]− 1

m

m∑i=1

f (zi )

)≥ 2Rm(F , S) + 3ε

)≤ 2e−2mε2

.

We need to pick m, ε such that 2e−2mε2 ≤ δ and 2√

2 ln |F|m + 3ε ≤ α.

Relation to VC-dimension

We express this bound in terms of the VC dimension of the concept class C.Each function fh ∈ F corresponds to an hypothesis h ∈ C.Let d be the VC dimension of C.The projection of the range space (X , C) on a sample of size m has no more than md

different sets.Thus, the set of different functions we need to consider is bounded by md , and

Rm(F ,S) ≤√

2 ln(md)

m=

√2d ln(m)

m.

Exercise: compare the the bounds obtained using the VC-dimension and theRademacher complexity methods.

Talagrand’s Contraction Inequality

Definition (Lipschitz-Continuous)

A function φ : R→ R is λ-Lipschitz if for any a, b ∈ R, it holds that|φ(a)− φ(b)| ≤ λ|a− b|.

• A 1-Lipschitz function is called a contraction, because it can only contract thedistance between a and b.

Theorem (Talagrand’s Contraction Inequality)

Suppose φ a λ-Lipschitz function, F ∈ Z → R, and z ∈ Zm. Then

Rm(φ ◦ F , z) ≤ λRm(F , z) .

Talagrand’s Contraction Inequality: PRoof• Suppose φ is a λ-Lipschitz function (i.e. |φ(a)− φ(b)| ≤ λ|a− b|)

mRm(φ ◦ F , z) = Eσ

[supf∈F

m∑i=1

σiφ(f (zi ))

]= E

σ

[supf∈F

σiφ(f (z1)) +m∑i=2

σiφ(f (zi ))

]

=1

2E

σ2,...,σm

[(supf∈F

φ(f (z1)) +m∑i=2

σiφ(f (zi ))

)+

(supf∈F−φ(f (zi )) +

m∑i=2

σiφ(f (zi ))

)]

=1

2E

σ2,...,σm

[sup

f ,f ′∈F

(φ(f (z1))− φ(f ′(z1))

)+

m∑i=2

σi(φ(f (zi )) + φ(f ′(zi ))

)]

≤ 1

2E

σ2,...,σm

[sup

f ,f ′∈Fλ∣∣f (z1)− f ′(z1)

∣∣+m∑i=2

σi(φ(f (zi )) + φ(f ′(zi ))

)]Lipschitz

=1

2E

σ2,...,σm

[sup

f ,f ′∈Fλ(f (z1)− f ′(z1)

)+

m∑i=2

σi(φ(f (zi )) + φ(f ′(zi ))

)]Symmetry

= Eσ

[supfσ1λf (z1) +

m∑i=2

σiφ(f (zi ))

]Linearity

≤ mλR(F , z) Repeat for i ∈ {2, . . . ,m}

Application: Regression and Loss Functions

• Suppose loss function `(y , y) = (y − y)2

• Fix y ∈ [−1, 1]. `(y , y) is 4-Lipschitz for y ∈ [−1, 1]• Can prove by looking at derivatives

• Suppose H ∈⊆ X → {−1, 1}, and H� its convex closure• Each h ∈ H� linearly interpolates between members of H

Rm(H, z) = Eσ

[suph∈H

1

m

m∑i=1

σih(zi )

]= E

σ

[sup

h�∈H�

1

m

m∑i=1

σih�(zi )

]= Rm(H�, z)

Interpolating can’t increase the supremum!

• By the contraction inequality, for any z ∈ (X × [−1, 1])m:

R(` ◦ H�, z) ≤ 4R(H�, z)

• Discrete model → continuous model → regression bounds• Usually easier to analyze hypothesis class than loss function family• Flexible approach: works for any H, any (Lipschitz) loss

Application: Regression and Loss Functions

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

3.5

y∆ = y − y

`(y , y) = y2∆

`(y , y) = |y∆|`(y , y) = 2ln

(cosh(y∆)

)`(y , y) = atan(y2

∆)

Rademacher Averages in the Wild• Used to refine the Vapnik-Chervonenkis inequality

• Assume H with VC dimension d

• Classical: ε ∈ O(√

d ln(m)m

)• Show with ε-net theorem, or Massart’s lemma and shattering coeffcients

• Modern: ε ∈ O(√

dm

)• Show with refinement of discretization, called chaining

• Classification and regression problems• Support Vector Machines• Neural networks• Data mining problems

• Other ways to bound Rademacher averages• Covering numbers essentially refine Massart’s lemma• Monte-Carlo methods approximate Rm(F , z) directly

Conclusions

• Uniform convergence shows bounds for learning tasks• Expectation bounds with symmetrization• Tail bounds with martingale inequalities

• Vapnik-Chervonenkis Dimension• Distribution-free bounds• Limited to binary classification

• Rademacher averages• Data-dependent bounds• Continuous function families• Bound complicated function families with simple tools

• Massart’s finite class inequality• Talagrand’s contraction inequality• Many other tools

• Simpler than ε-net theorem

Documents

Unrealizable (Agnostic) Learningcs.brown.edu/courses/cs155/slides/2019/Rademacher.pdf · property if for every ; >0 there is a sample size m = m( ; ) such that for every distribution