Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic

1

Support Vector Machines

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2020

The slides are mainly from Vivek Srikumar

Big picture

2

Linear models

Big picture

3

Linear models

How good is a learned model?

Big picture

4

Linear models


Mistake-driven

learning

Perceptron

Big picture

5

Linear models


Mistake-driven

learningPAC, Agnostic

learning

Perceptron

Big picture

6

Linear models


Mistake-driven


learning

Perceptron Support Vector Machines

The Risk Minimization Principle

Big picture

7

Linear models


Mistake-driven


learning

Perceptron Support Vector Machines

….

….The Risk Minimization Principle

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

8






9

VC dimensions and linear classifiers

What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true

error of a hypothesis h with training error errS(h) is bounded by

10

Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound




2. VC dimension of a linear classifier in d dimensions = d + 1

11





2. VC dimension of a linear classifier in d dimensions = d + 1

12


But are all linear classifiers the same?

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

13

++

++

+ +++

-- --

-- -- --

---- --

--

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

14

++

++

+ +++

-- --

-- -- --

---- --

--

Margin with respect to this hyperplane

15

Which hyperplane is better?h1

h2

Recall: Margin

16

Which hyperplane is better?h1

h2

The farther from the data points, the less chance to make wrong prediction

Recall: Margin

Data dependent VC dimension

• Intuitively, larger margins are better

• Suppose we only consider linear classifiers with margins 𝛾1 and 𝛾2 for the training dataset– H1 = linear classifiers that have a margin at least 𝛾1

– H2 = linear classifiers that have a margin at least 𝛾2

– And 𝛾1 > 𝛾2

• The entire set of functions H1 is “better”

17


Theorem (Vapnik):– Let H be the set of linear classifiers that separate the

training set by a margin at least 𝛾– Then

– R is the radius of the smallest sphere containing the data

18


Theorem (Vapnik):– Let H be the set of linear classifiers that separate the

training set by a margin at least 𝛾– Then

– R is the radius of the smallest sphere containing the data

Larger margin ⟹ Lower VC dimension

Lower VC dimension ⟹ Better generalization bound

19

Learning strategy

Find the linear classifier that maximizes the margin

20






21

Recall: The geometry of a linear classifier

22

Prediction = sgn(b +w1 x1 + w2x2)

++

++

+ +++

-- --

-- -- --

---- --

--

b +w1 x1 + w2x2=0


23


++

++

+ +++

-- --

-- -- --

---- --

--

b +w1 x1 + w2x2=0We only care about the sign, not the magnitude


24


b +w1 x1 + w2x2=0

2b +2w1 x1 + 2w2x2=0

1000b +1000w1 x1 + 1000w2x2=0+

+++

+ +++

-- --

-- -- --

---- --

--

We only care about the sign, not the magnitude

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want maxw °

25

Some people call this the geometric margin

The numerator alone is called the functional margin

Maximizing margin


• We want maxw,b 𝛾

26

Some people call this the geometric margin

The numerator alone is called the functional margin


27


b +w1 x1 + w2x2=0


++

++

+ +++

-- --

-- -- --

---- --

--


28


b +w1 x1 + w2x2=0


++

++

+ +++

-- --

-- -- --

---- --

--

We have the freedom to scale up/down wand b so that the numerator is 1.

Maximizing margin


• We want maxw,b 𝛾• We only care about the sign of wTxi+b in the end and not the

magnitude– Set the minimum functional margin to be 1 (restrict w,b to be in a

small set)

29

max𝐰,𝐛

𝛾 is equivalent to max𝐰,𝐛

$𝐰 in this setting

Max-margin classifiers

• Learning problem:

30

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

w � Cyixi otherwise

5



31

Minimizing gives us max𝐰

"𝐰

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5



32

This condition is true for every example, specifically, for the example closest to the separator


"𝐰

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5



• This is called the “hard” Support Vector Machine

We will look at how to solve this optimization problem later

33

This condition is true for every example, specifically, for the example closest to the separator


"𝐰

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Hard SVM

• This is a constrained optimization problem

• If the data is not separable, there is no w that will classify the data

• Infeasible problem, no solution!

What if the data is not separable?

34

Maximize margin

Every example has a functional margin of at least 1

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

What if the data is not separable?

Hard SVM

• This is a constrained optimization problem

• If the data is not separable, there is no w that will classify the data

• Infeasible problem, no solution!

35

Maximize margin


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

36

++

++

+ +++

-- --

-- -- --

---- --

--

+ -



37

++

++

+ +++

-- --

-- -- --

---- --

--

+ -



38

++

++

+ +++

-- --

-- -- --

---- --

--

+ -

This separator has a large enough margin that it should generalize well.



39

So, when computing margin, ignore the examples that make the margin smaller or the data inseparable.

++

++

+ +++

-- --

-- -- --

---- --

--

+ -

This separator has a large enough margin that it should generalize well.

Soft SVM

• Hard SVM:

• Introduce one slack variable »i per example – And require yiwTxi ¸ 1 - »i and »i ¸ 0

40

Maximize margin


Intuition: The slack variable allows examples to “break” into the margin

If the slack value is zero, then the example is either on or outside the margin

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

• Hard SVM:

• Introduce one slack variable per example – And require and

41

Maximize margin




⇠i

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

• Hard SVM:


42

Maximize margin




⇠i

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

• Hard SVM:


• New optimization problem for learning

43

Maximize margin


⇠i⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

• Hard SVM:


• New optimization problem for learning

44

Maximize margin


⇠i

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

5

⇠i � 0

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

45

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

46

Maximize margin

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

47

Maximize margin

Minimize total slack (i.e allow as few examples as possible to violate the margin)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

48

Maximize margin


Tradeoff between the two terms

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM

Eliminate the slack variables to rewrite this

This form is more interpretable

49

Maximize margin



216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Soft SVM



50

Maximize margin



216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

⇠i � 1� yi(w>xi + b)

<latexit sha1_base64="oHONhqR+NEuyK0BbOjMnfhZvwAA=">AAACWXicbVHLahsxFNVM83CnL7dZdiNiAjalZqYttJtAaDftLiVxErAcoZHv2CKSZiJpWhsxP9lFofRXuqg8cSCvA4Kjc+6Vro7ySgrr0vRPFD/a2Nza7jxOnjx99vxF9+WrE1vWhsOIl7I0ZzmzIIWGkRNOwlllgKlcwml+8WXln/4AY0Wpj92ygoliMy0KwZkLEu1We0QxNzfKgzENPcL7mBSGcZ81XjXE1op6sZ815wp/6xf9BRUDTDRc4nnLBwlZCCowmQHO8Fu8pKLfHpgX/mdzTlxZXW8XDRVv8gHt9tJh2gLfJ9ma9NAah7T7i0xLXivQjktm7ThLKzfxzDjBJTQJqS1UjF+wGYwD1UyBnfg2mQbvBWWKi9KEpR1u1ZsdnilrlyoPlasx7V1vJT7kjWtXfJp4oavageZXFxW1xK7Eq5jxVBjgTi4DYdyIMCvmcxaCdeEzkhBCdvfJ98nJu2H2fph+/9A7+LyOo4Neo13URxn6iA7QV3SIRoij3+hftBltRX/jKO7EyVVpHK17dtAtxDv/Afsnsss=</latexit>

Soft SVM



51

Maximize margin



216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Maximizing margin and minimizing loss

Three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

This is the hinge loss function

52

Maximize margin Penalty for the prediction

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5


We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi


53



We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi


54


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5


We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi wTxi


55


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5


We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi (wTxi+b)


56


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5


We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi (wTxi+b)

This gives us the hinge loss function

57


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

The Hinge Loss

58

y(wTx+b)

Loss

The Hinge Loss

59

0-1 loss

Hinge loss

Loss

y(wTx+b)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

5

The Hinge Loss

60

0-1 loss

Loss

0-1 loss: If the sign of y and wTx+bis the same, then no penalty

0-1 loss: If the sign of y and wTx+bare different, then penalty = 1

Hinge loss

y(wTx+b)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

5

The Hinge Loss

61

Hinge: Penalize predictions even if they are correct, but too close to the hyperplane

Loss

Hinge: No penalty if wTx+b is far away from 1 (-1 for negative examples)

Hinge: Incorrect predictions get a linearly increasing penalty with |wTx+b|

y(wTx+b)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

5


Three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi

62


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

General learning principle

Risk minimization

63

Define the notion of “loss” over the training data as a function of a hypothesis

Learning = find the hypothesis that has lowest loss on the training data

General learning principle

Regularized risk minimization

64

Define the notion of “loss” over the training data as a function of a hypothesis

Learning = find the hypothesis that has lowest

[Regularizer + loss on the training data]

Define a regularization function that penalizes over-complex hypothesis.

Capacity control gives better generalization

SVM objective function

65

Regularization term: • Maximize the margin• Imposes a preference over the

hypothesis space and pushes for better generalization

• Can be replaced with other regularization terms which impose other preferences

Empirical Loss: • Hinge loss • Penalizes weight vectors that make

mistakes

• Can be replaced with other loss functions which impose other preferences

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

SVM objective function

66

Regularization term: • Maximize the margin• Imposes a preference over the

hypothesis space and pushes for better generalization

• Can be replaced with other regularization terms which impose other preferences

Empirical Loss: • Hinge loss • Penalizes weight vectors that make

mistakes

• Can be replaced with other loss functions which impose other preferences

A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�


�

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0


5

Documents

Supervised Learning: The Setup Support Vector Machineszhe/teach/pdf/svm-intro.pdf · Big picture 6 Linear models How good is a learnedmodel? Mistake-driven learning PAC, Agnostic