Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Support Vector Machines
MachineLearningFall2017
SupervisedLearning:TheSetup
1
Machine LearningSpring 2020
The slides are mainly from Vivek Srikumar
Big picture
2
Linear models
Big picture
3
Linear models
How good is a learned model?
Big picture
4
Linear models
How good is a learned model?
Mistake-driven
learning
Perceptron
Big picture
5
Linear models
How good is a learned model?
Mistake-driven
learningPAC, Agnostic
learning
Perceptron
Big picture
6
Linear models
How good is a learned model?
Mistake-driven
learningPAC, Agnostic
learning
Perceptron Support Vector Machines
The Risk Minimization Principle
Big picture
7
Linear models
How good is a learned model?
Mistake-driven
learningPAC, Agnostic
learning
Perceptron Support Vector Machines
….
….The Risk Minimization Principle
This lecture: Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
8
This lecture: Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
9
VC dimensions and linear classifiers
What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true
error of a hypothesis h with training error errS(h) is bounded by
10
Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound
VC dimensions and linear classifiers
What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true
error of a hypothesis h with training error errS(h) is bounded by
2. VC dimension of a linear classifier in d dimensions = d + 1
11
Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound
VC dimensions and linear classifiers
What we know so far1. If we have m examples, then with probability 1 - 𝛿, the true
error of a hypothesis h with training error errS(h) is bounded by
2. VC dimension of a linear classifier in d dimensions = d + 1
12
Generalization error Training error A function of VC dimension.Low VC dimension gives tighter bound
But are all linear classifiers the same?
Recall: Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
13
++
++
+ +++
-- --
-- -- --
---- --
--
Recall: Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
14
++
++
+ +++
-- --
-- -- --
---- --
--
Margin with respect to this hyperplane
15
Which hyperplane is better?h1
h2
Recall: Margin
16
Which hyperplane is better?h1
h2
The farther from the data points, the less chance to make wrong prediction
Recall: Margin
Data dependent VC dimension
• Intuitively, larger margins are better
• Suppose we only consider linear classifiers with margins 𝛾1 and 𝛾2 for the training dataset– H1 = linear classifiers that have a margin at least 𝛾1
– H2 = linear classifiers that have a margin at least 𝛾2
– And 𝛾1 > 𝛾2
• The entire set of functions H1 is “better”
17
Data dependent VC dimension
Theorem (Vapnik):– Let H be the set of linear classifiers that separate the
training set by a margin at least 𝛾– Then
– R is the radius of the smallest sphere containing the data
18
Data dependent VC dimension
Theorem (Vapnik):– Let H be the set of linear classifiers that separate the
training set by a margin at least 𝛾– Then
– R is the radius of the smallest sphere containing the data
Larger margin ⟹ Lower VC dimension
Lower VC dimension ⟹ Better generalization bound
19
Learning strategy
Find the linear classifier that maximizes the margin
20
This lecture: Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
21
Recall: The geometry of a linear classifier
22
Prediction = sgn(b +w1 x1 + w2x2)
++
++
+ +++
-- --
-- -- --
---- --
--
b +w1 x1 + w2x2=0
Recall: The geometry of a linear classifier
23
Prediction = sgn(b +w1 x1 + w2x2)
++
++
+ +++
-- --
-- -- --
---- --
--
b +w1 x1 + w2x2=0We only care about the sign, not the magnitude
Recall: The geometry of a linear classifier
24
Prediction = sgn(b +w1 x1 + w2x2)
b +w1 x1 + w2x2=0
2b +2w1 x1 + 2w2x2=0
1000b +1000w1 x1 + 1000w2x2=0+
+++
+ +++
-- --
-- -- --
---- --
--
We only care about the sign, not the magnitude
Maximizing margin
• Margin = distance of the closest point from the hyperplane
• We want maxw °
25
Some people call this the geometric margin
The numerator alone is called the functional margin
Maximizing margin
• Margin = distance of the closest point from the hyperplane
• We want maxw,b 𝛾
26
Some people call this the geometric margin
The numerator alone is called the functional margin
Recall: The geometry of a linear classifier
27
Prediction = sgn(b +w1 x1 + w2x2)
b +w1 x1 + w2x2=0
We only care about the sign, not the magnitude
++
++
+ +++
-- --
-- -- --
---- --
--
Recall: The geometry of a linear classifier
28
Prediction = sgn(b +w1 x1 + w2x2)
b +w1 x1 + w2x2=0
We only care about the sign, not the magnitude
++
++
+ +++
-- --
-- -- --
---- --
--
We have the freedom to scale up/down wand b so that the numerator is 1.
Maximizing margin
• Margin = distance of the closest point from the hyperplane
• We want maxw,b 𝛾• We only care about the sign of wTxi+b in the end and not the
magnitude– Set the minimum functional margin to be 1 (restrict w,b to be in a
small set)
29
max𝐰,𝐛
𝛾 is equivalent to max𝐰,𝐛
$𝐰 in this setting
Max-margin classifiers
• Learning problem:
30
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Max-margin classifiers
• Learning problem:
31
Minimizing gives us max𝐰
"𝐰
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Max-margin classifiers
• Learning problem:
32
This condition is true for every example, specifically, for the example closest to the separator
Minimizing gives us max𝐰
"𝐰
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Max-margin classifiers
• Learning problem:
• This is called the “hard” Support Vector Machine
We will look at how to solve this optimization problem later
33
This condition is true for every example, specifically, for the example closest to the separator
Minimizing gives us max𝐰
"𝐰
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Hard SVM
• This is a constrained optimization problem
• If the data is not separable, there is no w that will classify the data
• Infeasible problem, no solution!
What if the data is not separable?
34
Maximize margin
Every example has a functional margin of at least 1
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
What if the data is not separable?
Hard SVM
• This is a constrained optimization problem
• If the data is not separable, there is no w that will classify the data
• Infeasible problem, no solution!
35
Maximize margin
Every example has a functional margin of at least 1
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
36
++
++
+ +++
-- --
-- -- --
---- --
--
+ -
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
37
++
++
+ +++
-- --
-- -- --
---- --
--
+ -
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
38
++
++
+ +++
-- --
-- -- --
---- --
--
+ -
This separator has a large enough margin that it should generalize well.
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
39
So, when computing margin, ignore the examples that make the margin smaller or the data inseparable.
++
++
+ +++
-- --
-- -- --
---- --
--
+ -
This separator has a large enough margin that it should generalize well.
Soft SVM
• Hard SVM:
• Introduce one slack variable »i per example – And require yiwTxi ¸ 1 - »i and »i ¸ 0
40
Maximize margin
Every example has a functional margin of at least 1
Intuition: The slack variable allows examples to “break” into the margin
If the slack value is zero, then the example is either on or outside the margin
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
• Hard SVM:
• Introduce one slack variable per example – And require and
41
Maximize margin
Every example has a functional margin of at least 1
Intuition: The slack variable allows examples to “break” into the margin
If the slack value is zero, then the example is either on or outside the margin
⇠i
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
5
⇠i � 0
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
• Hard SVM:
• Introduce one slack variable per example – And require and
42
Maximize margin
Every example has a functional margin of at least 1
Intuition: The slack variable allows examples to “break” into the margin
If the slack value is zero, then the example is either on or outside the margin
⇠i
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
5
⇠i � 0
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
• Hard SVM:
• Introduce one slack variable per example – And require and
• New optimization problem for learning
43
Maximize margin
Every example has a functional margin of at least 1
⇠i⇠i � 0
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
• Hard SVM:
• Introduce one slack variable per example – And require and
• New optimization problem for learning
44
Maximize margin
Every example has a functional margin of at least 1
⇠i
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
5
⇠i � 0
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
45
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
46
Maximize margin
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
47
Maximize margin
Minimize total slack (i.e allow as few examples as possible to violate the margin)
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
48
Maximize margin
Minimize total slack (i.e allow as few examples as possible to violate the margin)
Tradeoff between the two terms
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
Eliminate the slack variables to rewrite this
This form is more interpretable
49
Maximize margin
Minimize total slack (i.e allow as few examples as possible to violate the margin)
Tradeoff between the two terms
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Soft SVM
Eliminate the slack variables to rewrite this
This form is more interpretable
50
Maximize margin
Minimize total slack (i.e allow as few examples as possible to violate the margin)
Tradeoff between the two terms
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
⇠i � 1� yi(w>xi + b)
<latexit sha1_base64="oHONhqR+NEuyK0BbOjMnfhZvwAA=">AAACWXicbVHLahsxFNVM83CnL7dZdiNiAjalZqYttJtAaDftLiVxErAcoZHv2CKSZiJpWhsxP9lFofRXuqg8cSCvA4Kjc+6Vro7ySgrr0vRPFD/a2Nza7jxOnjx99vxF9+WrE1vWhsOIl7I0ZzmzIIWGkRNOwlllgKlcwml+8WXln/4AY0Wpj92ygoliMy0KwZkLEu1We0QxNzfKgzENPcL7mBSGcZ81XjXE1op6sZ815wp/6xf9BRUDTDRc4nnLBwlZCCowmQHO8Fu8pKLfHpgX/mdzTlxZXW8XDRVv8gHt9tJh2gLfJ9ma9NAah7T7i0xLXivQjktm7ThLKzfxzDjBJTQJqS1UjF+wGYwD1UyBnfg2mQbvBWWKi9KEpR1u1ZsdnilrlyoPlasx7V1vJT7kjWtXfJp4oavageZXFxW1xK7Eq5jxVBjgTi4DYdyIMCvmcxaCdeEzkhBCdvfJ98nJu2H2fph+/9A7+LyOo4Neo13URxn6iA7QV3SIRoij3+hftBltRX/jKO7EyVVpHK17dtAtxDv/Afsnsss=</latexit>
Soft SVM
Eliminate the slack variables to rewrite this
This form is more interpretable
51
Maximize margin
Minimize total slack (i.e allow as few examples as possible to violate the margin)
Tradeoff between the two terms
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Maximizing margin and minimizing loss
Three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi
This is the hinge loss function
52
Maximize margin Penalty for the prediction
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Maximizing margin and minimizing loss
We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi
This is the hinge loss function
53
Maximize margin Penalty for the prediction
Maximizing margin and minimizing loss
We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi
This is the hinge loss function
54
Maximize margin Penalty for the prediction
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Maximizing margin and minimizing loss
We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi wTxi
This is the hinge loss function
55
Maximize margin Penalty for the prediction
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Maximizing margin and minimizing loss
We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi (wTxi+b)
This is the hinge loss function
56
Maximize margin Penalty for the prediction
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
Maximizing margin and minimizing loss
We can consider three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi (wTxi+b)• Example is correctly classified but within the margin: penalty = 1 – yi (wTxi+b)
This gives us the hinge loss function
57
Maximize margin Penalty for the prediction
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
The Hinge Loss
58
y(wTx+b)
Loss
The Hinge Loss
59
0-1 loss
Hinge loss
Loss
y(wTx+b)
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
5
The Hinge Loss
60
0-1 loss
Loss
0-1 loss: If the sign of y and wTx+bis the same, then no penalty
0-1 loss: If the sign of y and wTx+bare different, then penalty = 1
Hinge loss
y(wTx+b)
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
5
The Hinge Loss
61
Hinge: Penalize predictions even if they are correct, but too close to the hyperplane
Loss
Hinge: No penalty if wTx+b is far away from 1 (-1 for negative examples)
Hinge: Incorrect predictions get a linearly increasing penalty with |wTx+b|
y(wTx+b)
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
5
Maximizing margin and minimizing loss
Three cases• Example is correctly classified and is outside the margin: penalty = 0• Example is incorrectly classified: penalty = 1 – yi wTxi• Example is correctly classified but within the margin: penalty = 1 – yi wTxi
62
Maximize margin Penalty for the prediction
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
General learning principle
Risk minimization
63
Define the notion of “loss” over the training data as a function of a hypothesis
Learning = find the hypothesis that has lowest loss on the training data
General learning principle
Regularized risk minimization
64
Define the notion of “loss” over the training data as a function of a hypothesis
Learning = find the hypothesis that has lowest
[Regularizer + loss on the training data]
Define a regularization function that penalizes over-complex hypothesis.
Capacity control gives better generalization
SVM objective function
65
Regularization term: • Maximize the margin• Imposes a preference over the
hypothesis space and pushes for better generalization
• Can be replaced with other regularization terms which impose other preferences
Empirical Loss: • Hinge loss • Penalizes weight vectors that make
mistakes
• Can be replaced with other loss functions which impose other preferences
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5
SVM objective function
66
Regularization term: • Maximize the margin• Imposes a preference over the
hypothesis space and pushes for better generalization
• Can be replaced with other regularization terms which impose other preferences
Empirical Loss: • Hinge loss • Penalizes weight vectors that make
mistakes
• Can be replaced with other loss functions which impose other preferences
A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
w � Cyixi otherwise
5