Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Unrealizable (Agnostic) Learning
• Given training sample (x1, c(x1)), . . . , (xm, c(xm)) and concept class C.
• No hypothesis in the concept class C is consistent with all the training set (c 6∈ C)
• Relaxed goal: Let c be the correct concept. Find c ′ ∈ C such that
PrD
(c ′(x) 6= c(x)) ≤ infh∈C
PrD
(h(x) 6= c(x)) + ε .
• If we estimate each Pr(h(x) 6= c(x)) =|{i : h(xi ) 6= c(xi )|
msuch that∣∣∣∣Pr(h(x) 6= c(x))− Pr
x∼D(h(x) 6= c(x))
∣∣∣∣ ≤ ε
2,
then h = arg minh∈C
Pr(h(x) 6= c(x)) is ε-optimal
• Correct concept c may not even exist, then c(x) is a random function
Illustrating Agnostic Learning
We want a classifier to distinguish between cats and dogs
Image 1 Image 2 Image 3 Image 4
x
c(x) Cat Dog Ostrich? Sensor Error?
Uniform Convergence for Binary Classification
Definition
A range space (X ,R) (domain X , set-of-subsets R) has the uniform convergenceproperty if for every ε, δ > 0 there is a sample size m = m(ε, δ) such that for everydistribution D over X , if S ∼ Dm, then, with probability at least 1− δ, S is anε-sample for R with respect to D.
• Holds uniformly over:• All distributions D• All range space elements r ∈ R
• Convergence of confidence intervals to points (w.h.p. as m→∞)
Theorem
The following three conditions are equivalent:
1 A concept class C over a domain X is agnostic PAC learnable.
2 The range space (X , C) has the uniform convergence property.
3 The range space (X , C) has a finite VC dimension.
Uniform Convergence in General
Definition
A set of functions F has the uniform convergence property with respect to a domain Zif there is a function mF (ε, δ) such that for any ε, δ > 0, m(ε, δ) <∞, and for anydistribution D on Z, a sample z1, . . . , zm of size m = mF (ε, δ) satisfies
Pr
(supf ∈F
∣∣∣∣∣ 1
m
m∑i=1
f (zi )− ED
[f ]
∣∣∣∣∣ ≤ ε)≥ 1− δ .
The general supervised learning scheme: Z = X × Y (labeled pairs):• Have a loss function ` : Y × Y → R that measures prediction error
• Classification: 0-1 error: `(y , y) =
{y = y : 0y 6= y : 1
• Regression: square error: `(y , y) = (y − y)2
• Let (` ◦ h)(zi ) = `(yi , h(xi )) on zi = (xi , yi )• Let F = ` ◦ H = {(` ◦ h | h ∈ H}
• Uniform convergence =⇒ training loss in H approximates true loss• ε-ERM (Empirical Risk Minimization): find h ∈ H such that h is ε-optimal
Is Uniform Convergence Necessary?
Definition
A set of functions F has the uniform convergence property with respect to a domain Zif there is a function mF (ε, δ) such that for any ε, δ > 0, m(ε, δ) <∞, and for anydistribution D on Z, a sample z1, . . . , zm of size m = mF (ε, δ) satisfies
Pr
(supf ∈F
∣∣∣∣∣ 1
m
m∑i=1
f (zi )− ED
[f ]
∣∣∣∣∣ ≤ ε)≥ 1− δ .
• We don’t need uniform convergence for any distribution D, just for the input(training set) distribution – Rademacher average
• We don’t need tight estimate for all functions, only for functions in neighborhoodof the optimal function – local Rademacher average
Jensen’s Inequality
Definition
A function f : Rm → R is said to be convex if, for anyx1, x2 and 0 ≤ λ ≤ 1,
f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2) .
Theorem (Jenssen’s Inequality)
If f is a convex function, then
E[f (X )] ≥ f (E[X ]) .
In particular, supf ∈F
E[f ] ≤ E[supf ∈F
f ]
Rademacher Averages (Complexity)
Definition
The Empirical Rademacher Average of F of a sample S = {z1, . . . , zm} is
Rm(F , S) = Eσ
[supf ∈F
1
m
m∑i=1
σi f (zi )
],
where σ are i.i.d. Rademacher (uniform on ±1).
Definition
The Rademacher Average of F over a distribution D is defined as
Rm(F ,D) = ES∼Dm
[Rm(F , S)
]= E
S∼D,σ
[supf ∈F
1
m
m∑i=1
σi f (zi )
].
A Symmetrization Inequality
Theorem
Suppose z ∼ Dm over domain Z and F ⊆ Z → R. We bound the supremum deviation
Ez
[supf ∈F
1
m
m∑i=1
f (zi )− ED
[f ]
]≤ 2Rm(F ,D) .
We will later convert this expectation bound to a high-probabilty tail bound, showinguniform convergence
Symmetrization Inequality: Proof
Function family: F ⊆ Z → R; samples: z , z ′ ∼ Dm; σ ∼ Rademacherm
Ez
[supf∈F
1
m
m∑i=1
f (zi )− ED
[f ]
]= E
z
[supf∈F
1
m
m∑i=1
f (zi )− Ez′
[1
m
m∑i=1
f (z ′i )
]]Linearity of Expectation
≤ Ez,z′
[supf∈F
1
m
m∑i=1
(f (zi )− f (z ′i )
)]Jensen’s Inequality
= Ez,z′,σ
[supf∈F
1
m
m∑i=1
σi(f (zi )− f (z ′i )
)]Symmetry
≤ Ez,σ
[supf∈F
1
m
m∑i=1
σi f (zi )
]+ E
z′,σ
[supf∈F− 1
m
m∑i=1
σi f (z ′i )
]Subadditivity
= Ez,σ
[supf∈F
1
m
m∑i=1
σi f (zi )
]+ E
z′,σ
[supf∈F
1
m
m∑i=1
σi f (z ′i )
]Pr(σ) = Pr(−σ)
= 2 Ez,σ
[supf∈F
1
m
m∑i=1
σi f (zi )
]= 2Rm(F ,D) z , z ′ ∼ Dm
Interpreting the Rademacher Average
Empirical Average :1
m
m∑i=1
f (zi) mean estimate
Empirical Supremum : supf ∈F
1
m
m∑i=1
f (zi) max estimate
Rm(F , z) : Eσ
[supf ∈F
1
m
m∑i=1
σi f (zi)
]sample-noise correlation
Rm(F ,D) : Ez∼Dm
[Eσ
[supf ∈F
1
m
m∑i=1
σi f (zi)
]]distribution-noise correlation
Exponential Tail Bounds with Rademacher Averages
Theorem
Let F ⊆ X → [0, 1], S = {z1, . . . , zn} be a sample from D and let δ ∈ (0, 1).
1 Bounding the generalization error using the Rademacher complexity:
PrS
(supf ∈F
{ED
[f ]− 1
m
m∑i=1
f (zi )
}≥ 2Rm(F ,D) + ε
)≤ e−2mε2
2 Bounding the generalization error using the empirical Rademacher complexity:
PrS
(supf ∈F
{ED
[f ]− 1
m
m∑i=1
f (zi )
}≥ 2Rm(F , S) + 3ε
)≤ 2e−2mε2
Recap: McDiarmid’s Inequality
Applying Azuma inequality to Doob’s martingale:
Theorem
Let z1, . . . , zm be independent random variables and let h(z1, . . . , zm) be a functionsuch that changing zi can change the value of h(. . .) by no more than ci :
supz1,...,zm,z ′i
|h(z1, . . . , zi , . . . , zm)− h(z1, . . . , z′i , . . . , zm)| ≤ ci .
Then for any ε > 0,
Prz1,...,zm
(h(z1, . . . , zm)− E
z1,...,zm[h(z1, . . . , zm)] ≥ ε
)≤ e−2ε2/
∑mi=1 c
2i .
Exponential Tail Bounds with Rademacher Averages: Proof
• Suppose F ⊆ X → [0, 1]
• The generalization error (empirical supremum deviation)
D(z1, . . . , zm) = supf ∈F
(ED
[f ]− 1
m
m∑i=1
f (zi )
)
measures how badly empirical loss underestimates true loss for any f ∈ F• Changing zi results in a change of no more than 1
m to D(z1, . . . , zm)
• The Empirical Rademacher Average
Rm(F ,S) = Eσ
[supf ∈F
1
m
m∑i=1
σi f (zi )
]
is also a function of z1, . . . , zm, with bounded differences 1m
• McDiarmid’s inequality yields the result
Why Data Dependent Bounds?
• The Vapnik-Chervonenkis Dimension:• Applies only to binary classification
• Complicated generalizations for regression or multi-class classification
• Combinatorial bound• Ignores data distribution (worst-case over all distributions)
• Can be hard to compute
• Rademacher Averages:• Handles general learning problems
• Only need a loss function• Classification, regression, clustering, data mining
• Sensitive to data distribution (distribution-dependent)• Approximated with training sample (data-dependent)
• Always at least as good bound as the VC-dimension• Still hard to compute
Computing Rademacher Averages: Example 1
• Take F = {f } for any f : X → R. Then
Rm(F , x) = Eσ
[supf ∈F
1
m
m∑i=1
σi f (xi )
]= E
σ
[1
m
m∑i=1
σi f (xi )
]=
1
m
m∑i=1
Eσi
[σi f (xi )] = 0
• Is this useful?• No, but it’s educational• A single function can not overfit (no inductive bias)• Recover Hoeffding’s inequality:
Pr
(ED
[f ]− 1
m
m∑i=1
f (zi ) ≥ 2Rm(F ,D) + ε
)= Pr
(ED
[f ]− 1
m
m∑i=1
f (zi ) ≥ ε
)≤ e−2mε2
.
Computing Rademacher Averages: Example 2
• Take F = {f (x) = x , f (x) = −x}, with X = ±1
Rm(F , x) = Eσ
[supf ∈F
1
m
m∑i=1
σi f (xi )
]= E
σ
[∣∣∣∣∣ 1
m
m∑i=1
σixi
∣∣∣∣∣]
= Eσ
[∣∣∣∣∣ 1
m
m∑i=1
σi
∣∣∣∣∣]
because xiσi ∼ Rademacher. By Jensen’s inequality (f (x) = x2 is convex):
=
√√√√(Eσ
[(1
m
m∑i=1
σi
)])2
≤
√√√√√Eσ
( 1
m
m∑i=1
σi
)2 =
1√m
,
Connection: Distance after m-step random walk, binomial stdev• Is this useful?
• Yes, can be used to construct interesting classifiers• Idea: pick labels for independent subspaces independently• Use above analysis to bound Rademacher averages
• Also gives lower bounds to Rademacher averages, using this result:
R(F1, z) ≤ R(F1 ∪ F2, z)
Bounding Rademacher Averages: Massart’s Inequality
Theorem (Massart’s Finite Class Inequality)
Assume that |F| is finite. Let S = {z1, . . . , zm} be a sample. Then
Rm(F , S) ≤
√√√√supf ∈F
m∑i=1
f 2(zi ) ∗√
2 ln |F|m
.
Corollary
Therefore, if F ⊆ X → [−1, 1], then√√√√supf ∈F
m∑i=1
f 2(zi ) ≤√m , thus Rm(F ,D) ≤
√2 ln|F|m
.
Massart’s Inequality: Proof Volume IFor any λ > 0:
exp(λmRm(F ,S)
)= exp
(λEσ
[supf∈F
m∑i=1
σi f (zi )
])Definition
≤ Eσ
[exp
(λ sup
f∈F
m∑i=1
σi f (zi )
)]exp(E[·]) ≤ E[exp(·)]
Jensen’s Inequlity
= Eσ
[supf∈F
exp
(m∑i=1
λσi f (zi )
)]sup(exp(·)) = exp(sup(·))
Monotonicity
≤∑f∈F
Eσ
[exp
(m∑i=1
λσi f (zi )
)]supf∈F
(·) ≤∑f∈F
(·) when positive
=∑f∈F
Eσ
[m∏i=1
exp(λσi f (zi ))
]Properties of the Exponent
=∑f∈F
m∏i=1
Eσ
[exp(λσi f (zi ))] Independence
Massart’s Inequality: Proof Volume II
Take B2 = supf ∈F
m∑i=1
f 2(zi ) . For any λ > 0:
exp(λmRm(F ,S)
)=∑f∈F
m∏i=1
Eσ
[exp(λσi f (zi ))] Previous Slide
=∑f∈F
m∏i=1
exp(λf (zi )) + exp(−λf (zi ))
2Definition of Expectation
≤∑f∈F
m∏i=1
exp
(λ2
2f 2(zi )
)Hyperbolic Cosine Inequality
=∑f∈F
exp
(λ2
2
m∑i=1
f 2(zi )
)Exponent Laws
≤∑f∈F
exp
(λ2B2
2
)∀fi ∈ F :
m∑i=1
f 2i (zi ) ≤ sup
f∈F
m∑i=1
f 2(zi ) = B2
≤ |F| exp
(λ2B2
2
)Summation
Massart’s Inequality: Proof Volume III
Take B2 = supf ∈F
m∑i=1
f 2(zi ) . For any λ > 0:
Previously: exp(λmRm(F ,S)
)≤ |F| exp
(λ2B2
2
)Take logarithms, rearrange, and minimize to obtain λ =
√2 ln |F|B
Rm(F , S) ≤ infλ
1
m
(ln |F|λ
+λB2
2
)=
B√
2 ln |F|m
=
√√√√supf ∈F
m∑i=1
f 2(zi ) ∗√
2 ln|F|m
Takeaways:
• Lots of steps; all very simple
• Proof similar to Azuma-Hoeffding inequality
• Often loose, but lays foundation for better bounds
Application: The Dvoretsky-Kiefer-Wolfowitz Inequality
• Suppose a sample x ∼ Dm over R, take F the CDF of D• Define the empirical CDF F (t) =
|{i ∈ 1, . . . , n : xi ≤ t}|m
• For any t ∈ R: Ex
[F (t)] = F (t)
• DKW inequality: Prx
(supt∈R
∣∣∣F (t)− F (t)∣∣∣ ≥ ε) ≤ 2e−2mε2
• Elegant result, with complicated proof• 36 years to show optimal constants!• Can we get close, in much less time?
• Compare to Hoeffding’s Inequality:
∀t ∈ R : Pr(∣∣∣F (t)− F (t)
∣∣∣ ≥ ε) ≤ 2e−2mε2
• Can we prove DKW with a union bound?• No, needs to hold at ∞ points• Need uniform convergence!
Application: The Dvoretsky-Kiefer-Wolfowitz Inequality
• Want to show Prx
(supt∈R
∣∣∣F (t)− F (t)∣∣∣ ≥ ε) ≤ 2e−2mε2
• With Massart’s inequality:• Function family is indicator functions F = {f (x) = 1≤t(x) : t ∈ R}
• Problem: Rm(F , x) ≤√
supf∈F
f 2(xi ) ∗√
2 ln(F)
m=∞
• Pick F ′ ⊆ F where ∀f ∈ F , ∃f ′ ∈ F ′ s.t. f (xi ) = f ′(xi ) at every xi
Rm(F , x) = Eσ
[supf∈F
1
m
m∑i=1
σi f (xi )
]= E
σ
[supf ′∈F ′
1
m
m∑i=1
σi f′(xi )
]= Rm(F ′,S)
• |F ′| is the empirical shattering coefficient of F• No worst-case over z , unlike in VC dimension!• Minimal size |F ′| = m + 1. Massart’s lemma:
Pr
(supx∈R
∣∣∣F (x)− F (x)∣∣∣ ≥ 2
√2 ln |m + 1|
m+ ε
)≤ 2e−2mε2
Application: The Dvoretsky-Kiefer-Wolfowitz Inequality
• Want to show Prx
(supt∈R
∣∣∣F (t)− F (t)∣∣∣ ≥ ε) ≤ 2e−2mε2
• Improving the result• Lots of f ′1 , f
′2 ∈ F ′ that differ on only a few xi .
• Define F ′k ⊆ F ′ s.t. each f ′ ∈ F ′ has f ′k ∈ Fk that disagrees on ≤ k of the xi• Minimal size |F ′k | ≤ 1 + m+1
2k+1 ≈m2k
Rm(F , x) = Eσ
[supf∈F
1
m
m∑i=1
σi f (xi )
]≤ E
σ
[supfk∈Fk
1
m
m∑i=1
σi f′(xi )
]+
k
m≤
√−2 ln(1 + m+1
2k+1 )
m+
k
m
• Each f ∈ F approximated by some f ′k ∈ F ′k with error ≤ km
• Discretization method: replace (possibly infinite) F with discrete approximation F ′k• Conclusions:
• Massart’s inequality is a powerful tool for proving simple bounds• Refinements can get much tighter bounds
Application: Agnostic Learning a Binary Classifier
Let C be a binary concept class defined on a domain X , and let D be a probabilitydistribution on X . For each x ∈ X let c(x) be the correct classification of x .For each hypothesis h ∈ C we define a function fh(x) by
fh(x) =
{1 if h(x) = c(x)0 otherwise
Let F = {fh | h ∈ C}. Want to find h′ ∈ C such that with probability ≥ 1− δ
E[fh′ ] ≥ supfh∈F
E[fh]− α .
E[fh′ ] measures the accuracy of h′ over the distribution, want α-optimal h′.We give an upper bound on the required training set size with Rademacher averages.
For each hypothesis h ∈ C we define a function fh(x) by
fh(x) =
{1 if h(x) = c(x)0 otherwise
Let S = z1, . . . , zm be a sample of size m. Then
Rm(F ,S) ≤√
2 ln |F|m
,
thus we have
Pr
(supf ∈F
(ED
[f ]− 1
m
m∑i=1
f (zi )
)≥ 2Rm(F , S) + 3ε
)≤ 2e−2mε2
.
We need to pick m, ε such that 2e−2mε2 ≤ δ and 2√
2 ln |F|m + 3ε ≤ α.
Relation to VC-dimension
We express this bound in terms of the VC dimension of the concept class C.Each function fh ∈ F corresponds to an hypothesis h ∈ C.Let d be the VC dimension of C.The projection of the range space (X , C) on a sample of size m has no more than md
different sets.Thus, the set of different functions we need to consider is bounded by md , and
Rm(F ,S) ≤√
2 ln(md)
m=
√2d ln(m)
m.
Exercise: compare the the bounds obtained using the VC-dimension and theRademacher complexity methods.
Talagrand’s Contraction Inequality
Definition (Lipschitz-Continuous)
A function φ : R→ R is λ-Lipschitz if for any a, b ∈ R, it holds that|φ(a)− φ(b)| ≤ λ|a− b|.
• A 1-Lipschitz function is called a contraction, because it can only contract thedistance between a and b.
Theorem (Talagrand’s Contraction Inequality)
Suppose φ a λ-Lipschitz function, F ∈ Z → R, and z ∈ Zm. Then
Rm(φ ◦ F , z) ≤ λRm(F , z) .
Talagrand’s Contraction Inequality: PRoof• Suppose φ is a λ-Lipschitz function (i.e. |φ(a)− φ(b)| ≤ λ|a− b|)
mRm(φ ◦ F , z) = Eσ
[supf∈F
m∑i=1
σiφ(f (zi ))
]= E
σ
[supf∈F
σiφ(f (z1)) +m∑i=2
σiφ(f (zi ))
]
=1
2E
σ2,...,σm
[(supf∈F
φ(f (z1)) +m∑i=2
σiφ(f (zi ))
)+
(supf∈F−φ(f (zi )) +
m∑i=2
σiφ(f (zi ))
)]
=1
2E
σ2,...,σm
[sup
f ,f ′∈F
(φ(f (z1))− φ(f ′(z1))
)+
m∑i=2
σi(φ(f (zi )) + φ(f ′(zi ))
)]
≤ 1
2E
σ2,...,σm
[sup
f ,f ′∈Fλ∣∣f (z1)− f ′(z1)
∣∣+m∑i=2
σi(φ(f (zi )) + φ(f ′(zi ))
)]Lipschitz
=1
2E
σ2,...,σm
[sup
f ,f ′∈Fλ(f (z1)− f ′(z1)
)+
m∑i=2
σi(φ(f (zi )) + φ(f ′(zi ))
)]Symmetry
= Eσ
[supfσ1λf (z1) +
m∑i=2
σiφ(f (zi ))
]Linearity
≤ mλR(F , z) Repeat for i ∈ {2, . . . ,m}
Application: Regression and Loss Functions
• Suppose loss function `(y , y) = (y − y)2
• Fix y ∈ [−1, 1]. `(y , y) is 4-Lipschitz for y ∈ [−1, 1]• Can prove by looking at derivatives
• Suppose H ∈⊆ X → {−1, 1}, and H� its convex closure• Each h ∈ H� linearly interpolates between members of H
Rm(H, z) = Eσ
[suph∈H
1
m
m∑i=1
σih(zi )
]= E
σ
[sup
h�∈H�
1
m
m∑i=1
σih�(zi )
]= Rm(H�, z)
Interpolating can’t increase the supremum!
• By the contraction inequality, for any z ∈ (X × [−1, 1])m:
R(` ◦ H�, z) ≤ 4R(H�, z)
• Discrete model → continuous model → regression bounds• Usually easier to analyze hypothesis class than loss function family• Flexible approach: works for any H, any (Lipschitz) loss
Application: Regression and Loss Functions
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
3.5
y∆ = y − y
`(y , y) = y2∆
`(y , y) = |y∆|`(y , y) = 2ln
(cosh(y∆)
)`(y , y) = atan(y2
∆)
Rademacher Averages in the Wild• Used to refine the Vapnik-Chervonenkis inequality
• Assume H with VC dimension d
• Classical: ε ∈ O(√
d ln(m)m
)• Show with ε-net theorem, or Massart’s lemma and shattering coeffcients
• Modern: ε ∈ O(√
dm
)• Show with refinement of discretization, called chaining
• Classification and regression problems• Support Vector Machines• Neural networks• Data mining problems
• Other ways to bound Rademacher averages• Covering numbers essentially refine Massart’s lemma• Monte-Carlo methods approximate Rm(F , z) directly
Conclusions
• Uniform convergence shows bounds for learning tasks• Expectation bounds with symmetrization• Tail bounds with martingale inequalities
• Vapnik-Chervonenkis Dimension• Distribution-free bounds• Limited to binary classification
• Rademacher averages• Data-dependent bounds• Continuous function families• Bound complicated function families with simple tools
• Massart’s finite class inequality• Talagrand’s contraction inequality• Many other tools
• Simpler than ε-net theorem