122
CS260: Machine Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019

CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

CS260: Machine Learning AlgorithmsLecture 7: VC Dimension

Cho-Jui HsiehUCLA

Jan 30, 2019

Page 2: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Reducing M to finite number

Page 3: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Where did the M come from?

The Bad events Bm:“|Etr(hm)− E (hm)| > ε” with probability ≤ 2e−2ε

2N

The union bound:P[B1 or B2 or · · · or BM ]

≤ P[B1] + P[B2] + · · ·+ P[BM ]︸ ︷︷ ︸consider worst case: no overlaps

≤ 2Me−2ε2N

Page 4: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Where did the M come from?

The Bad events Bm:“|Etr(hm)− E (hm)| > ε” with probability ≤ 2e−2ε

2N

The union bound:P[B1 or B2 or · · · or BM ]

≤ P[B1] + P[B2] + · · ·+ P[BM ]︸ ︷︷ ︸consider worst case: no overlaps

≤ 2Me−2ε2N

Page 5: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we improve on M?

Page 6: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we improve on M?

Page 7: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we improve on M?

Page 8: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we improve on M?

The event that |Etr(h1)− E (h1)| > ε and |Etr(h2)− E (h2)| > ε arelargely overlapped.

Page 9: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

What can we replace M with?

Instead of the whole input spaceLet’s consider a finite set of input pointsHow many patterns of orange and black can you get?

Page 10: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

What can we replace M with?

Instead of the whole input spaceLet’s consider a finite set of input pointsHow many patterns of colors can you get?

Page 11: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

What can we replace M with?

Instead of the whole input spaceLet’s consider a finite set of input pointsHow many patterns of colors can you get?

Page 12: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Dichotomies: mini-hypotheses

A hypothesis: h : X → −1,+1A dichotomy: h : x1, x2, · · · , xN → −1,+1

Number of hypotheses |H| can be infinite

Number of dichotomies |H(x1, x2, · · · , xN)|:at most 2N

⇒Candidate for replacing M

Page 13: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Dichotomies: mini-hypotheses

A hypothesis: h : X → −1,+1A dichotomy: h : x1, x2, · · · , xN → −1,+1Number of hypotheses |H| can be infinite

Number of dichotomies |H(x1, x2, · · · , xN)|:at most 2N

⇒Candidate for replacing M

Page 14: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Dichotomies: mini-hypotheses

A hypothesis: h : X → −1,+1A dichotomy: h : x1, x2, · · · , xN → −1,+1Number of hypotheses |H| can be infinite

Number of dichotomies |H(x1, x2, · · · , xN)|:at most 2N

⇒Candidate for replacing M

Page 15: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

The growth function

The growth function counts the most dichotomies on any N points:

mH(N) = maxx1,··· ,xN∈X

|H(x1, · · · , xN)|

The growth function satisfies:

mH(N) ≤ 2N

Page 16: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

The growth function

The growth function counts the most dichotomies on any N points:

mH(N) = maxx1,··· ,xN∈X

|H(x1, · · · , xN)|

The growth function satisfies:

mH(N) ≤ 2N

Page 17: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

Compute mH(3) in 2-D space

What’s |H(x1, x2, x3)|?

Page 18: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

Compute mH(3) in 2-D space when H is perceptron (linear hyperplanes)

mH(3) = 8

Page 19: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

Compute mH(3) in 2-D space when H is perceptron (linear hyperplanes)

Doesn’t matter because we only counts the most dichotomies

Page 20: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

Compute mH(3) in 2-D space when H is perceptron (linear hyperplanes)

Doesn’t matter because we only counts the most dichotomies

Page 21: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

What’s mH(4)?

(At least) missing two dichotomies:

mH(4) = 14 < 24

Page 22: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

What’s mH(4)?

(At least) missing two dichotomies:

mH(4) = 14 < 24

Page 23: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Growth function for linear classifiers

What’s mH(4)?

(At least) missing two dichotomies:

mH(4) = 14 < 24

Page 24: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Example I: positive rays

Page 25: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Example II: positive intervals

Page 26: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Example III: convex sets

H is set of h : R2 → −1,+1h(x) = +1 is convex

How many dichotomies can we generate?

Page 27: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Example III: convex sets

H is set of h : R2 → −1,+1h(x) = +1 is convex

How many dichotomies can we generate?

Page 28: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Example III: convex sets

H is set of h : R2 → −1,+1h(x) = +1 is convex

How many dichotomies can we generate?

Page 29: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Example III: convex sets

H is set of h : R2 → −1,+1h(x) = +1 is convex

mH(N) = 2N for any N

⇒ We say the N points are “shattered” by h

Page 30: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

The 3 growth functions

H is positive rays:mH(N) = N + 1

H is positive intervals:

mH(N) =1

2N2 +

1

2N + 1

H is convex sets:mH(N) = 2N

Page 31: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

What’s next?

Remember the inequality

P[|Ein − Eout| > ε] ≤ 2Me−2ε2N

What happens if we replace M by mH(N)?

mH(N) polynomial ⇒ Good!

How to show mH(N) is polynomial?

Page 32: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

What’s next?

Remember the inequality

P[|Ein − Eout| > ε] ≤ 2Me−2ε2N

What happens if we replace M by mH(N)?

mH(N) polynomial ⇒ Good!

How to show mH(N) is polynomial?

Page 33: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

What’s next?

Remember the inequality

P[|Ein − Eout| > ε] ≤ 2Me−2ε2N

What happens if we replace M by mH(N)?

mH(N) polynomial ⇒ Good!

How to show mH(N) is polynomial?

Page 34: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

When will mH(N) be polynomial

Page 35: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Break point of H

If no data set of size k can be shattered by H, then k is a break pointfor H

mH(k) < 2k

VC dimension of H: k − 1 (the most points H can shatter)

For 2-D perceptron: k = 4, VC dimension = 3

Page 36: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Break point of H

If no data set of size k can be shattered by H, then k is a break pointfor H

mH(k) < 2k

VC dimension of H: k − 1 (the most points H can shatter)

For 2-D perceptron: k = 4, VC dimension = 3

Page 37: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Break point - examples

Positive rays: mH(N) = N + 1

Break point k = 2, dVC = 1

Positive intervals: mH(N) = 12N

2 + 12N + 1

Break point k = 3, dVC = 2

Convex set: mH(N) = 2N

Break point k =∞, dVC =∞

Page 38: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Break point - examples

Positive rays: mH(N) = N + 1

Break point k = 2, dVC = 1

Positive intervals: mH(N) = 12N

2 + 12N + 1

Break point k = 3, dVC = 2

Convex set: mH(N) = 2N

Break point k =∞, dVC =∞

Page 39: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Break point - examples

Positive rays: mH(N) = N + 1

Break point k = 2, dVC = 1

Positive intervals: mH(N) = 12N

2 + 12N + 1

Break point k = 3, dVC = 2

Convex set: mH(N) = 2N

Break point k =∞, dVC =∞

Page 40: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

We will show

No break point ⇒ mH(N) = 2N

Any break point ⇒ mH(N) is polynomial in N

Page 41: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 42: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 43: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 44: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 45: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 46: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 47: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 48: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 49: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 50: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 51: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 52: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 53: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Puzzle

Break point is k = 2

Page 54: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Bounding mH(N)

Key quantity:

B(N, k): Maximum number of dichotomies on N points, with breakpoint k

If the hypothesis space has break point k , then

mH(N) ≤ B(N, k)

Page 55: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Bounding mH(N)

Key quantity:

B(N, k): Maximum number of dichotomies on N points, with breakpoint k

If the hypothesis space has break point k , then

mH(N) ≤ B(N, k)

Page 56: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

For any “valid” set of dichotomies, reorganize rows byS1: pattern of x1, · · · , xN−1 only appears onceS+2 ,S

−2 : pattern of x1, · · · , xN−1 appears twice

B(N, k) = α + 2β

Page 57: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

Focus on x1, x2, · · · , xN−1 columns:

α + β ≤ B(N − 1, k)

Page 58: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

Now focus on the S2 = S+2 ∪ S− + 2 rows

β ≤ B(N − 1, k − 1)

Page 59: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

What’s the upper bound for B(N, k)?

Page 60: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

Page 61: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

Page 62: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

Page 63: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

Page 64: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

Page 65: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Recursive bound on B(N , k)

B(N, k) = α + β + β

≤ B(N − 1, k) + B(N − 1, k − 1)

Page 66: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Analytic solution for B(N , k) bound

B(N, k) is upper bounded by C (N, k):

C (N, 1) = 1, N = 1, 2, · · ·C (1, k) = 2, k = 2, 3, · · ·C (N, k)=C (N − 1, k) + C (N − 1, k − 1)

Theorem: C (N, k) =∑k−1

i=0

(Ni

)

Boundary conditions: (easy to check)

Induction:

k−1∑i=0

(N

i

)︸ ︷︷ ︸

select < k from N items

=k−1∑i=0

(N − 1

i

)︸ ︷︷ ︸

N-th item not chosen

+k−2∑i=0

(N − 1

i

)︸ ︷︷ ︸N-th item chosen

Page 67: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Analytic solution for B(N , k) bound

B(N, k) is upper bounded by C (N, k):

C (N, 1) = 1, N = 1, 2, · · ·C (1, k) = 2, k = 2, 3, · · ·C (N, k)=C (N − 1, k) + C (N − 1, k − 1)

Theorem: C (N, k) =∑k−1

i=0

(Ni

)Boundary conditions: (easy to check)

Induction:

k−1∑i=0

(N

i

)︸ ︷︷ ︸

select < k from N items

=k−1∑i=0

(N − 1

i

)︸ ︷︷ ︸

N-th item not chosen

+k−2∑i=0

(N − 1

i

)︸ ︷︷ ︸N-th item chosen

Page 68: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Analytic solution for B(N , k) bound

B(N, k) is upper bounded by C (N, k):

C (N, 1) = 1, N = 1, 2, · · ·C (1, k) = 2, k = 2, 3, · · ·C (N, k)=C (N − 1, k) + C (N − 1, k − 1)

Theorem: C (N, k) =∑k−1

i=0

(Ni

)Boundary conditions: (easy to check)

Induction:

k−1∑i=0

(N

i

)︸ ︷︷ ︸

select < k from N items

=k−1∑i=0

(N − 1

i

)︸ ︷︷ ︸

N-th item not chosen

+k−2∑i=0

(N − 1

i

)︸ ︷︷ ︸N-th item chosen

Page 69: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

It is polynomial!

For a given H, the break point k is fixed:

mH(N) ≤k−1∑i=0

(N

i

)︸ ︷︷ ︸

Polynomial with degree k − 1

Page 70: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

It is polynomial!

For a given H, the break point k is fixed:

mH(N) ≤k−1∑i=0

(N

i

)︸ ︷︷ ︸

Polynomial with degree k

H is positive rays: (break point k = 2)

mH(N) = N + 1

Page 71: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

It is polynomial!

For a given H, the break point k is fixed:

mH(N) ≤k−1∑i=0

(N

i

)︸ ︷︷ ︸

Polynomial with degree k

H is 2D perceptrons: (break point k = 4)

mH(N) =?

Page 72: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

It is polynomial!

For a given H, the break point k is fixed:

mH(N) ≤k−1∑i=0

(N

i

)︸ ︷︷ ︸

Polynomial with degree k

H is 2D perceptrons: (break point k = 4)

mH(N)≤ 1

6N3 +

5

6N + 1

Page 73: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Replace M by mH(N)

Original bound:

P[∃h ∈ H s.t. |Etr(h)− E (h)| > ε] ≤ 2Me−2ε2N

Replace M by mH(N)

P[∃h ∈ H s.t. |Etr(h)− E (h)| > ε]︸ ︷︷ ︸BAD

≤ 2 · 2mH(2N) · e−18ε2N

Vapnik-Chervonenkis (VC) bound

Page 74: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC Dimension

Page 75: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Definition

The VC dimension of a hypothesis set H, denoted by dVC(H), is

the largest value of N for which mH(N) = 2N

“the most points H can shatter”

N ≤ dVC(H)⇒ H can shatter N points

k > dVC(H)⇒ H cannot be shattered

The smallest break point is 1 above VC-dimension

Page 76: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Definition

The VC dimension of a hypothesis set H, denoted by dVC(H), is

the largest value of N for which mH(N) = 2N

“the most points H can shatter”

N ≤ dVC(H)⇒ H can shatter N points

k > dVC(H)⇒ H cannot be shattered

The smallest break point is 1 above VC-dimension

Page 77: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Definition

The VC dimension of a hypothesis set H, denoted by dVC(H), is

the largest value of N for which mH(N) = 2N

“the most points H can shatter”

N ≤ dVC(H)⇒ H can shatter N points

k > dVC(H)⇒ H cannot be shattered

The smallest break point is 1 above VC-dimension

Page 78: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

The growth function

In terms of a break point k :

mH(N) ≤k−1∑i=0

(N

i

)In terms of the VC dimension dVC:

mH(N) ≤dVC∑i=0

(N

i

)

Page 79: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

For d = 2, dVC = 3

What if d > 2?

In general,dVC = d + 1

We will prove dVC ≥ d + 1 and dVC ≤ d + 1

Page 80: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

For d = 2, dVC = 3

What if d > 2?

In general,dVC = d + 1

We will prove dVC ≥ d + 1 and dVC ≤ d + 1

Page 81: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

For d = 2, dVC = 3

What if d > 2?

In general,dVC = d + 1

We will prove dVC ≥ d + 1 and dVC ≤ d + 1

Page 82: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

For d = 2, dVC = 3

What if d > 2?

In general,dVC = d + 1

We will prove dVC ≥ d + 1 and dVC ≤ d + 1

Page 83: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

To prove dVC ≥ d + 1

A set of N = d + 1 points in Rd shattered by the linear hyperplane

X is invertible!

Page 84: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

To prove dVC ≥ d + 1

A set of N = d + 1 points in Rd shattered by the linear hyperplane

X is invertible!

Page 85: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

To prove dVC ≥ d + 1

A set of N = d + 1 points in Rd shattered by the linear hyperplane

X is invertible!

Page 86: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we shatter the dataset?

For any y =

y1y2...

yd+1

=

±1±1

...±1

, can we find w satisfying

sign(Xw) = y

Easy! Just set w = X−1y

So, dVC ≥ d + 1

Page 87: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we shatter the dataset?

For any y =

y1y2...

yd+1

=

±1±1

...±1

, can we find w satisfying

sign(Xw) = y

Easy! Just set w = X−1y

So, dVC ≥ d + 1

Page 88: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Can we shatter the dataset?

For any y =

y1y2...

yd+1

=

±1±1

...±1

, can we find w satisfying

sign(Xw) = y

Easy! Just set w = X−1y

So, dVC ≥ d + 1

Page 89: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

To show dVC ≤ d + 1, we need to show

We cannot shatter any set of d + 2 points

For any d + 2 points

x1, x2, · · · , xd+1, xd+2

More points than dimensions ⇒ linear dependent

xj =∑i 6=j

aixi

where not all ai ’s are zeros

Page 90: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

To show dVC ≤ d + 1, we need to show

We cannot shatter any set of d + 2 points

For any d + 2 points

x1, x2, · · · , xd+1, xd+2

More points than dimensions ⇒ linear dependent

xj =∑i 6=j

aixi

where not all ai ’s are zeros

Page 91: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

xj =∑i 6=j

aixi

Now we construct a dichotomy that cannot be generated:

yi =

sign(ai ) if i 6= j

−1 if i = j

For all i 6= j , assume the labels are correct: sign(ai ) = sign(wTxi )⇒ aiwTxi > 0

For j-th data,

wTxj =∑i 6=j

aiwTxi > 0

Therefore, yj = sign(wTxj) = +1 (cannot be −1)

Page 92: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

xj =∑i 6=j

aixi

Now we construct a dichotomy that cannot be generated:

yi =

sign(ai ) if i 6= j

−1 if i = j

For all i 6= j , assume the labels are correct: sign(ai ) = sign(wTxi )⇒ aiwTxi > 0

For j-th data,

wTxj =∑i 6=j

aiwTxi > 0

Therefore, yj = sign(wTxj) = +1 (cannot be −1)

Page 93: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

VC dimension of linear classifiers

xj =∑i 6=j

aixi

Now we construct a dichotomy that cannot be generated:

yi =

sign(ai ) if i 6= j

−1 if i = j

For all i 6= j , assume the labels are correct: sign(ai ) = sign(wTxi )⇒ aiwTxi > 0

For j-th data,

wTxj =∑i 6=j

aiwTxi > 0

Therefore, yj = sign(wTxj) = +1 (cannot be −1)

Page 94: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Putting it together

We proved for d-dimensional linear hyperplane

dVC ≤ d + 1 and dVC ≥ d + 1⇒ dVC = d + 1

Number of parameters w0, · · · ,wd

d + 1 parameters!

Parameters create degrees of freedom

Page 95: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Putting it together

We proved for d-dimensional linear hyperplane

dVC ≤ d + 1 and dVC ≥ d + 1⇒ dVC = d + 1

Number of parameters w0, · · · ,wd

d + 1 parameters!

Parameters create degrees of freedom

Page 96: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Putting it together

We proved for d-dimensional linear hyperplane

dVC ≤ d + 1 and dVC ≥ d + 1⇒ dVC = d + 1

Number of parameters w0, · · · ,wd

d + 1 parameters!

Parameters create degrees of freedom

Page 97: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2

Not always true · · ·dVC measures the effective number of parameters

Page 98: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2

Not always true · · ·dVC measures the effective number of parameters

Page 99: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2

Not always true · · ·dVC measures the effective number of parameters

Page 100: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Number of data points needed

P[|Ein(g)− Eout(g)| > ε] ≤ 4mH(2N)e−18ε2N︸ ︷︷ ︸

δ

If we want certain ε and δ, how does N depend on dVC?

Need Nde−N = small value

N is almost linear with dVC

Page 101: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Number of data points needed

P[|Ein(g)− Eout(g)| > ε] ≤ 4mH(2N)e−18ε2N︸ ︷︷ ︸

δ

If we want certain ε and δ, how does N depend on dVC?

Need Nde−N = small value

N is almost linear with dVC

Page 102: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Number of data points needed

P[|Ein(g)− Eout(g)| > ε] ≤ 4mH(2N)e−18ε2N︸ ︷︷ ︸

δ

If we want certain ε and δ, how does N depend on dVC?

Need Nde−N = small value

N is almost linear with dVC

Page 103: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Regularization

Page 104: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

The polynomial model

HQ : polynomials of order Q

HQ = Q∑

q=0

wqLq(x)

Linear regression in the Z space with

z = [1, L1(x), · · · , LQ(x)]

Page 105: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Unconstrained solution

Input (x1, y1), · · · , (xN , yN)→ (z1, y1), · · · , (zN , yN)

Linear regression:

Minimize :Etr(w) =1

N

N∑n=1

(wTzn − yn)2

Minimize :1

N(Zw − y)T (Zw − y)

Solution wtr = (ZTZ )−1ZTy

Page 106: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Constraining the weights

Hard constraint: H2 is constrained version of H10

(with wq = 0 for q > 2)

Soft-order constraint:∑Q

q=0 wq2 ≤ C

The problem given soft-order constraint:

Minimize1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C︸ ︷︷ ︸

smaller hypothesis space

Solution wreg instead of wtr

Page 107: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Constraining the weights

Hard constraint: H2 is constrained version of H10

(with wq = 0 for q > 2)

Soft-order constraint:∑Q

q=0 wq2 ≤ C

The problem given soft-order constraint:

Minimize1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C︸ ︷︷ ︸

smaller hypothesis space

Solution wreg instead of wtr

Page 108: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Constraining the weights

Hard constraint: H2 is constrained version of H10

(with wq = 0 for q > 2)

Soft-order constraint:∑Q

q=0 wq2 ≤ C

The problem given soft-order constraint:

Minimize1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C︸ ︷︷ ︸

smaller hypothesis space

Solution wreg instead of wtr

Page 109: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Equivalent to the unconstrained version

Constrained version:

minw

Etr(w) =1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C

Optimal when∇Etr(wreg) ∝ −wreg

Why? If −∇Etr(w) and w are not parallel, can decrease Etr(w)without violating the constraint

Page 110: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Equivalent to the unconstrained version

Constrained version:

minw

Etr(w) =1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C

Optimal when∇Etr(wreg) ∝ −wreg

Assume ∇Etr(wreg) = −2 λNwreg

⇒ ∇Etr(wreg) + 2 λNwreg = 0

wreg is also the solution of unconstrained problem

minw

Etr(w) +λ

NwTw

(Ridge regression!)

C ↑ λ ↓

Page 111: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Equivalent to the unconstrained version

Constrained version:

minw

Etr(w) =1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C

Optimal when∇Etr(wreg) ∝ −wreg

Assume ∇Etr(wreg) = −2 λNwreg

⇒ ∇Etr(wreg) + 2 λNwreg = 0

wreg is also the solution of unconstrained problem

minw

Etr(w) +λ

NwTw

(Ridge regression!)

C ↑ λ ↓

Page 112: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Equivalent to the unconstrained version

Constrained version:

minw

Etr(w) =1

N(Zw − y)T (Zw − y) s.t. wTw ≤ C

Optimal when∇Etr(wreg) ∝ −wreg

Assume ∇Etr(wreg) = −2 λNwreg

⇒ ∇Etr(wreg) + 2 λNwreg = 0

wreg is also the solution of unconstrained problem

minw

Etr(w) +λ

NwTw

(Ridge regression!)

C ↑ λ ↓

Page 113: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Ridge regression solution

minw

Ereg(w) =1

N

((Zw − y)T (Zw − y) + λwTw

)

∇Ereg(w) = 0 ⇒ ZTZ (w − y) + λw = 0

So, wreg = (ZTZ + λI )−1ZTy (with regularization)

as opposed to wtr = (ZTZ )−1ZTy (without regularization)

Page 114: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Ridge regression solution

minw

Ereg(w) =1

N

((Zw − y)T (Zw − y) + λwTw

)

∇Ereg(w) = 0 ⇒ ZTZ (w − y) + λw = 0

So, wreg = (ZTZ + λI )−1ZTy (with regularization)

as opposed to wtr = (ZTZ )−1ZTy (without regularization)

Page 115: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

The result

minw

Etr(w) +λ

NwTw

Page 116: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Equivalent to “weight decay”

Consider the general case

minw

Etr(w) +λ

NwTw

Gradient descent:

wt+1 = wt − η(∇Etr(wt) + 2

λ

Nwt

)= wt (1− 2η

λ

N)︸ ︷︷ ︸

weight decay

−η∇Etr(wt)

Page 117: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Equivalent to “weight decay”

Consider the general case

minw

Etr(w) +λ

NwTw

Gradient descent:

wt+1 = wt − η(∇Etr(wt) + 2

λ

Nwt

)= wt (1− 2η

λ

N)︸ ︷︷ ︸

weight decay

−η∇Etr(wt)

Page 118: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Variations of weight decay

Emphasis of certain weights:

Q∑q=0

γqw2q

Example 1: γq = 2q ⇒ low-order fitExample 2: γq = 2−q ⇒ high-order fit

General Tikhonov regularizer:

wTHw

with a positive semi-definite H

Page 119: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Variations of weight decay

Emphasis of certain weights:

Q∑q=0

γqw2q

Example 1: γq = 2q ⇒ low-order fitExample 2: γq = 2−q ⇒ high-order fit

General Tikhonov regularizer:

wTHw

with a positive semi-definite H

Page 120: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

General form of regularizer

Calling the regularizer Ω = Ω(h), we minimize

Ereg(h) = Etr(h) +λ

NΩ(h)

In general, Ω(h) can be any measurement for the “size” of h

Page 121: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

L2 vs L1 regularizer

L1-regularizer: Ω(w) = ‖w‖1 =∑

q |wq|Usually leads to a sparse solution

(only few wq will be nonzero)

Page 122: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Conclusions

VC dimension

Regularization

Questions?