Sample Complexity for Function Approximation. Model Selection.10715-f18/lectures/srm.pdf · general...

Maria-Florina (Nina) Balcan

03/10/2018

Sample Complexity for Function Approximation. Model Selection.

Two Core Aspects of Machine Learning

Algorithm Design. How to optimize?

Automatically generate rules that do well on observed data.

Confidence Bounds, Generalization

Confidence for rule effectiveness on future data.

Computation

(Labeled) Data

• E.g.: logistic regression, SVM, Adaboost, etc.

Labeled Examples

PAC/SLT models for Supervised Classification

Learning Algorithm

Expert / Oracle

Data Source

Alg.outputs

Distribution D on X

c* : X ! Y

(x1,c*(x1)),…, (xm,c*(xm))

h : X ! Yx1 > 5

x6 > 2

• Algo does optimization over S, find hypothesis ℎ.

• Goal: h has small error over D.

• Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D

– labeled examples - drawn i.i.d. from D and labeled by target c*

– labels 2 {-1,1} - binary classification

Instance space X

• Realizable: 𝑐∗ ∈ 𝐻.

𝑒𝑟𝑟𝐷 ℎ = Pr𝑥~ 𝐷

(ℎ 𝑥 ≠ 𝑐∗(𝑥))

PAC/SLT models for Supervised Learning

• X – feature/instance space; distribution D over X

e.g., X = Rd or X = {0,1}d

• Fix hypothesis space H [whose complexity is not too large]

• Agnostic: 𝑐∗ “close to” H.

Sample Complexity for Supervised Learning

Consistent Learner

• Output: Find h in H consistent with S (if one exits).

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

Prob. over different samples of m training examples

Linear in 1/𝜖

Realizable Case

Sample Complexity: Infinite Hypothesis Spaces

Realizable Case

E.g., H= linear separators in Rd

Sample complexity linear in d

So, if double the number of features, then I only need roughly twice the number of samples to do well.

-VCdim(H)= d+1

Sample Complexity: Uniform Convergence

Agnostic CaseEmpirical Risk Minimization (ERM)

• Output: Find h in H with smallest errS(h)

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

1/𝜖2 dependence [as opposed to1/𝜖 for realizable]

Sample Complexity: Finite Hypothesis Spaces

Agnostic Case

1) How many examples suffice to get UC whp (so success for ERM).

2) Statistical Learning Theory style:

errD h ≤ errS h +1

2mln (2 H ) + ln

With prob. at least 1 − 𝛿, for all h ∈ H:

1/𝜖2 dependence [as opposed to 1/𝜖

for realizable], but get for something stronger.

𝑚as opposed to

𝑚for

realizable

Sample Complexity: Infinite Hypothesis SpacesAgnostic Case

1) How many examples suffice to get UC whp (so success for ERM).

2) Statistical Learning Theory style:

errD h ≤ errS h + O1

2mVCdim H ln

VCdim H+ ln

With prob. at least 1 − 𝛿, for all h ∈ H:

VCdimension Generalization Bounds

errD h ≤ errS h + O1

2mVCdim H ln

VCdim H+ ln

δ.E.g.,

VC bounds: distribution independent bounds

• Generic: hold for any concept class and any distribution.

• Might be very loose specific distr. that are more benign than the worst case….

• Hold only for binary classification; we want bounds for fns approximation in general (e.g., multiclass classification and

regression).

[nearly tight in the WC over choice of D]

Rademacher Complex: Binary classification

Theorem: For any H, any distr. D, w.h.p. ≥ 1 − 𝛿 all h ∈ H satisfy:

So, by Sauer’s lemma, RS F ≤2dln

mRS F ≤

ln 2|H[S]|

errD h ≤ errS h +2dln

ln 2/δ

Many more uses!!! Margin bounds for SVM, boosting, regression bounds, etc.

errD h ≤ errS h + Rm H + 3ln 2/δ

generalization bound

H = {h: X → Y} hyp. space (e.g., lin. sep) F= L(H), d=VCdim(H):

Can we use our bounds for model selection?

True Error, Training Error, Overfitting

complexity

train error

generalizationerror

errD h ≤ errS h + Rm H +…

Model selection: trade-off between decreasing training error and keeping H simple.

Structural Risk Minimization (SRM)

error rate

Hypothesis complexity

empirical error

overfitting

𝐻1 ⊆ 𝐻2 ⊆ 𝐻3 ⊆ ⋯ ⊆ 𝐻𝑖 ⊆…

What happens if we increase m?

Black curve will stay close to the red curve for longer, everything shift to the right…

error rate

Hypothesis complexity

empirical error

overfitting

𝐻1 ⊆ 𝐻2 ⊆ 𝐻3 ⊆ ⋯ ⊆ 𝐻𝑖 ⊆…

As k increases, errS hk goes down but complex. term goes up.

• 𝐻1 ⊆ 𝐻2 ⊆ 𝐻3 ⊆ ⋯ ⊆ 𝐻𝑖 ⊆…

• hk = argminh∈Hk{errS h }

• 𝑘 = argmink≥1{errS hk + complexity(Hk)}

Output ℎ = ℎ𝑘

Claim: W.h.p., errD h ≤ mink∗minh∗∈Hk∗errD h∗ + 2complexity Hk∗

Proof:• We chose h s.t. errs h + complexity Hk ≤ errS h∗ + complexity(Hk∗).

• Whp, errD h ≤ errs h + complexity Hk .

• Whp, errS h∗ ≤ errD h∗ + complexity Hk∗ .

Techniques to Handle Overfitting

• Cross Validation:

• Structural Risk Minimization (SRM).

• Regularization:

Minimize gener. bound:

• minimizes expressions of the form: errS h + λ h2

• E.g., SVM, regularized logistic regression, etc.

• Hold out part of the training data and use it as a proxy for the generalization error

ℎ = argmink≥1{errS hk + complexity(Hk)}

𝐻1 ⊆ 𝐻2 ⊆ ⋯ ⊆ 𝐻𝑖 ⊆…

Some norm when H is a vector space; e.g., L2 norm

Picked through cross validation

general family closely related to SRM

• Often computationally hard….

• Nice case where it is possible: M. Kearns, Y. Mansour, ICML’98, “A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization”

What you should know

• Shattering, VC dimension as measure of complexity, Sauer’s lemma, form of the VC bounds (upper and lower

bounds).

• Notion of sample complexity.

• Understand reasoning behind the simple sample complexity bound for finite H [exam question!].

• Model Selection, Structural Risk Minimization.

• Rademacher Complexity.

Sample Complexity for Function Approximation. Model Selection.10715-f18/lectures/srm.pdf · general...

Documents

Start list IRONMAN 70.3 Weymouth 2020 · Hatchett Katie GBR F18-24 Hatton Charlotte IRL F18-24 Jackson Milly GBR F18-24 Jones Holly Mei GBR F18-24 Kerr Sally GBR F18-24 Leadbeater

(3)kb.psu.ac.th/psukb/bitstream/2016/10715/1/409407.pdf · ระดับน้าตาลในเลือดของผู้ป่วยเบาหวานชนิดที่

10715 MACARTHUR BLVD OAKLAND, CA - betaagency.com · 10715 MACARTHUR BLVD, OAKLAND SITE PLAN ©2019 Beta. This information has been obtained from sources believed reliable however

PACIFIC F18 F18 - Product... · certification nfpa 1971:2018, nfpa 1951:2013 eye protector: ansi z87.1:2015 safety without compromise pacific f18 traditional structural

Introduction to Machine Learningepxing/Class/10715/lectures/RiskMin.pdf · Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture6.pdf · Advanced Introduction to Machine Learning 10715, Fall 2014 The Kernel Trick, Reproducing Kernel

F18 Americas, by Cherie Sogsti

LECTURE 7 Ch5 F18 Forces

Firbimatic cat f18-25_as

No. 10715 IMPROVING MINERAL SPECTRA USING SINGLE ... · No. 10715 OBJECTIVE This paper investigates approaches to further improvement in mineral spectra reproducibility using the

F18 EIA Escalinata Acomayo

Actualisation F18 L’A C

Fotofrafias Aeronauticas F18

F18- Kamraten · 2015. 11. 11. · 2 F 18-kamraten nr 45 2012 F18 Kamratförening, Pg 72 91 00-8 F18-Kamraten Ordförande Gunnar Persson, 08–530340 06, ordforande@f18.se Vice ordförande

Manual Biometrico F18

Advanced Introduction to Machine Learning, CMU-10715 · 2014-09-17 · Advanced Introduction to Machine Learning, CMU-10715 Perceptron, Multilayer Perceptron Barnabás Póczos, Sept

F18 Entry Form - Scholastic

Archives Nationales - F18 Finding Aid

d 10715 10711 10705 d d - LoopNet...10711 10715. Owings Mills Town Center. 140 140 140 140 133 130 125 129 129 26 26. 795 695. RED RUN BUSINESS PARK. 10705-10715 RED RUN BOULEVARD,

Computational Discourse F18