66
1 Support Vector Machines Machine Learning Spring 2020 The slides are mainly from Vivek Srikumar

Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

1

Support Vector Machines

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2020

The slides are mainly from Vivek Srikumar

Page 2: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Big picture

2

Linear models

Page 3: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Big picture

3

Linear models

How good is a learned model?

Page 4: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Big picture

4

Linear models

How good is a learned model?

Mistake-driven

learning

Perceptron

Page 5: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Big picture

5

Linear models

How good is a learned model?

Mistake-driven

learningPAC, Agnostic

learning

Perceptron

Page 6: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Big picture

6

Linear models

How good is a learned model?

Mistake-driven

learningPAC, Agnostic

learning

Perceptron Support Vector Machines

The Risk Minimization Principle

Page 7: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Big picture

7

Linear models

How good is a learned model?

Mistake-driven

learningPAC, Agnostic

learning

Perceptron Support Vector Machines

….

….The Risk Minimization Principle

Page 8: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

8

Page 9: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

9

Page 10: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

VC dimensions and linear classifiers

What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true

error of a hypothesis h with training error errS(h) is bounded by

10

Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound

Page 11: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

VC dimensions and linear classifiers

What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true

error of a hypothesis h with training error errS(h) is bounded by

2. VC dimension of a linear classifier in d dimensions = d + 1

11

Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound

Page 12: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

VC dimensions and linear classifiers

What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true

error of a hypothesis h with training error errS(h) is bounded by

2. VC dimension of a linear classifier in d dimensions = d + 1

12

Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound

But are all linear classifiers the same?

Page 13: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

13

++

++

+ +++

-- --

-- -- --

---- --

--

Page 14: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

14

++

++

+ +++

-- --

-- -- --

---- --

--

Margin with respect to this hyperplane

Page 15: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

15

Which hyperplane is better?h1

h2

Recall: Margin

Page 16: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

16

Which hyperplane is better?h1

h2

The farther from the data points, the less chance to make wrong prediction

Recall: Margin

Page 17: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Data dependent VC dimension

• Intuitively, larger margins are better

• Suppose we only consider linear classifiers with margins 𝛾1 and 𝛾2 for the training dataset– H1 = linear classifiers that have a margin at least 𝛾1

– H2 = linear classifiers that have a margin at least 𝛾2

– And 𝛾1 > 𝛾2

• The entire set of functions H1 is “better”

17

Page 18: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Data dependent VC dimension

Theorem (Vapnik):– Let H be the set of linear classifiers that separate the

training set by a margin at least 𝛾– Then

– R is the radius of the smallest sphere containing the data

18

Page 19: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Data dependent VC dimension

Theorem (Vapnik):– Let H be the set of linear classifiers that separate the

training set by a margin at least 𝛾– Then

– R is the radius of the smallest sphere containing the data

Larger margin ⟹ Lower VC dimension

Lower VC dimension ⟹ Better generalization bound

19

Page 20: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Learning strategy

Find the linear classifier that maximizes the margin

20

Page 21: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

21

Page 22: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: The geometry of a linear classifier

22

Prediction = sgn(b +w1 x1 + w2x2)

++

++

+ +++

-- --

-- -- --

---- --

--

b +w1 x1 + w2x2=0

Page 23: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: The geometry of a linear classifier

23

Prediction = sgn(b +w1 x1 + w2x2)

++

++

+ +++

-- --

-- -- --

---- --

--

b +w1 x1 + w2x2=0We only care about the sign, not the magnitude

Page 24: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: The geometry of a linear classifier

24

Prediction = sgn(b +w1 x1 + w2x2)

b +w1 x1 + w2x2=0

2b +2w1 x1 + 2w2x2=0

1000b +1000w1 x1 + 1000w2x2=0+

+++

+ +++

-- --

-- -- --

---- --

--

We only care about the sign, not the magnitude

Page 25: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want maxw °

25

Some people call this the geometric margin

The numerator alone is called the functional margin

Page 26: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want maxw,b 𝛾

26

Some people call this the geometric margin

The numerator alone is called the functional margin

Page 27: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: The geometry of a linear classifier

27

Prediction = sgn(b +w1 x1 + w2x2)

b +w1 x1 + w2x2=0

We only care about the sign, not the magnitude

++

++

+ +++

-- --

-- -- --

---- --

--

Page 28: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Recall: The geometry of a linear classifier

28

Prediction = sgn(b +w1 x1 + w2x2)

b +w1 x1 + w2x2=0

We only care about the sign, not the magnitude

++

++

+ +++

-- --

-- -- --

---- --

--

We have the freedom to scale up/down wand b so that the numerator is 1.

Page 29: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want maxw,b 𝛾• We only care about the sign of wTxi+b in the end and not the

magnitude– Set the minimum functional margin to be 1 (restrict w,b to be in a

small set)

29

max𝐰,𝐛

𝛾 is equivalent to max𝐰,𝐛

$𝐰 in this setting

Page 30: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Max-margin classifiers

• Learning problem:

30

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 31: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Max-margin classifiers

• Learning problem:

31

Minimizing gives us max𝐰

"𝐰

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 32: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Max-margin classifiers

• Learning problem:

32

This condition is true for every example, specifically, for the example closest to the separator

Minimizing gives us max𝐰

"𝐰

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 33: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Max-margin classifiers

• Learning problem:

• This is called the “hard” Support Vector Machine

We will look at how to solve this optimization problem later

33

This condition is true for every example, specifically, for the example closest to the separator

Minimizing gives us max𝐰

"𝐰

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 34: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Hard SVM

• This is a constrained optimization problem

• If the data is not separable, there is no w that will classify the data

• Infeasible problem, no solution!

What if the data is not separable?

34

Maximize margin

Every example has a functional margin of at least 1

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 35: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

What if the data is not separable?

Hard SVM

• This is a constrained optimization problem

• If the data is not separable, there is no w that will classify the data

• Infeasible problem, no solution!

35

Maximize margin

Every example has a functional margin of at least 1

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 36: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

36

++

++

+ +++

-- --

-- -- --

---- --

--

+ -

Page 37: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

37

++

++

+ +++

-- --

-- -- --

---- --

--

+ -

Page 38: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

38

++

++

+ +++

-- --

-- -- --

---- --

--

+ -

This separator has a large enough margin that it should generalize well.

Page 39: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

39

So, when computing margin, ignore the examples that make the margin smaller or the data inseparable.

++

++

+ +++

-- --

-- -- --

---- --

--

+ -

This separator has a large enough margin that it should generalize well.

Page 40: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

• Hard SVM:

• Introduce one slack variable »i per example – And require yiwTxi ¸ 1 - »i and »i ¸ 0

40

Maximize margin

Every example has a functional margin of at least 1

Intuition: The slack variable allows examples to “break” into the margin

If the slack value is zero, then the example is either on or outside the margin

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 41: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

• Hard SVM:

• Introduce one slack variable per example – And require and

41

Maximize margin

Every example has a functional margin of at least 1

Intuition: The slack variable allows examples to “break” into the margin

If the slack value is zero, then the example is either on or outside the margin

⇠i

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 42: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

• Hard SVM:

• Introduce one slack variable per example – And require and

42

Maximize margin

Every example has a functional margin of at least 1

Intuition: The slack variable allows examples to “break” into the margin

If the slack value is zero, then the example is either on or outside the margin

⇠i

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 43: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

• Hard SVM:

• Introduce one slack variable per example – And require and

• New optimization problem for learning

43

Maximize margin

Every example has a functional margin of at least 1

⇠i⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 44: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Soft SVM

• Hard SVM:

• Introduce one slack variable per example – And require and

• New optimization problem for learning

44

Maximize margin

Every example has a functional margin of at least 1

⇠i

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 45: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

45

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 46: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

46

Maximize margin

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 47: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

47

Maximize margin

Minimize total slack (i.e allow as few examples as possible to violate the margin)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 48: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

48

Maximize margin

Minimize total slack (i.e allow as few examples as possible to violate the margin)

Tradeoff between the two terms

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 49: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

Eliminate the slack variables to rewrite this

This form is more interpretable

49

Maximize margin

Minimize total slack (i.e allow as few examples as possible to violate the margin)

Tradeoff between the two terms

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 50: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

Eliminate the slack variables to rewrite this

This form is more interpretable

50

Maximize margin

Minimize total slack (i.e allow as few examples as possible to violate the margin)

Tradeoff between the two terms

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

⇠i � 1� yi(w>xi + b)

<latexit sha1_base64="oHONhqR+NEuyK0BbOjMnfhZvwAA=">AAACWXicbVHLahsxFNVM83CnL7dZdiNiAjalZqYttJtAaDftLiVxErAcoZHv2CKSZiJpWhsxP9lFofRXuqg8cSCvA4Kjc+6Vro7ySgrr0vRPFD/a2Nza7jxOnjx99vxF9+WrE1vWhsOIl7I0ZzmzIIWGkRNOwlllgKlcwml+8WXln/4AY0Wpj92ygoliMy0KwZkLEu1We0QxNzfKgzENPcL7mBSGcZ81XjXE1op6sZ815wp/6xf9BRUDTDRc4nnLBwlZCCowmQHO8Fu8pKLfHpgX/mdzTlxZXW8XDRVv8gHt9tJh2gLfJ9ma9NAah7T7i0xLXivQjktm7ThLKzfxzDjBJTQJqS1UjF+wGYwD1UyBnfg2mQbvBWWKi9KEpR1u1ZsdnilrlyoPlasx7V1vJT7kjWtXfJp4oavageZXFxW1xK7Eq5jxVBjgTi4DYdyIMCvmcxaCdeEzkhBCdvfJ98nJu2H2fph+/9A7+LyOo4Neo13URxn6iA7QV3SIRoij3+hftBltRX/jKO7EyVVpHK17dtAtxDv/Afsnsss=</latexit>

Page 51: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Soft SVM

Eliminate the slack variables to rewrite this

This form is more interpretable

51

Maximize margin

Minimize total slack (i.e allow as few examples as possible to violate the margin)

Tradeoff between the two terms

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 52: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Maximizing margin and minimizing loss

Three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

This is the hinge loss function

52

Maximize margin Penalty for the prediction

Page 53: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Maximizing margin and minimizing loss

We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

This is the hinge loss function

53

Maximize margin Penalty for the prediction

Page 54: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin and minimizing loss

We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

This is the hinge loss function

54

Maximize margin Penalty for the prediction

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 55: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin and minimizing loss

We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

This is the hinge loss function

55

Maximize margin Penalty for the prediction

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 56: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin and minimizing loss

We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi (wTxi+b)

This is the hinge loss function

56

Maximize margin Penalty for the prediction

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 57: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin and minimizing loss

We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi (wTxi+b)

This gives us the hinge loss function

57

Maximize margin Penalty for the prediction

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 58: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

The Hinge Loss

58

y(wTx+b)

Loss

Page 59: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

The Hinge Loss

59

0-1 loss

Hinge loss

Loss

y(wTx+b)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

5

Page 60: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

The Hinge Loss

60

0-1 loss

Loss

0-1 loss: If the sign of y and wTx+bis the same, then no penalty

0-1 loss: If the sign of y and wTx+bare different, then penalty = 1

Hinge loss

y(wTx+b)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

5

Page 61: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

The Hinge Loss

61

Hinge: Penalize predictions even if they are correct, but too close to the hyperplane

Loss

Hinge: No penalty if wTx+b is far away from 1 (-1 for negative examples)

Hinge: Incorrect predictions get a linearly increasing penalty with |wTx+b|

y(wTx+b)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

5

Page 62: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

Maximizing margin and minimizing loss

Three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

62

Maximize margin Penalty for the prediction

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 63: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

General learning principle

Risk minimization

63

Define the notion of “loss” over the training data as a function of a hypothesis

Learning = find the hypothesis that has lowest loss on the training data

Page 64: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

General learning principle

Regularized risk minimization

64

Define the notion of “loss” over the training data as a function of a hypothesis

Learning = find the hypothesis that has lowest

[Regularizer + loss on the training data]

Define a regularization function that penalizes over-complex hypothesis.

Capacity control gives better generalization

Page 65: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

SVM objective function

65

Regularization term: • Maximize the margin• Imposes a preference over the

hypothesis space and pushes for better generalization

• Can be replaced with other regularization terms which impose other preferences

Empirical Loss: • Hinge loss • Penalizes weight vectors that make

mistakes

• Can be replaced with other loss functions which impose other preferences

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5

Page 66: Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

SVM objective function

66

Regularization term: • Maximize the margin• Imposes a preference over the

hypothesis space and pushes for better generalization

• Can be replaced with other regularization terms which impose other preferences

Empirical Loss: • Hinge loss • Penalizes weight vectors that make

mistakes

• Can be replaced with other loss functions which impose other preferences

A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5