1
CSE 881: Data Mining
Lecture 10: Support Vector Machines
2
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
3
Support Vector Machines
One Possible Solution
B1
4
Support Vector Machines
Another possible solution
B2
5
Support Vector Machines
Other possible solutions
B2
6
Support Vector Machines
Which one is better? B1 or B2? How do you define better?
B1
B2
7
Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2
B1
B2
b11
b12
b21
b22
margin
8
SVM vs Perceptron
Perceptron: SVM:
Minimizes least square error (gradient descent)
Maximizes margin
9
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf
||||
2 Margin
w
10
Learning Linear SVM
We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
or
This is a constrained optimization problem– Solve it using Lagrange multiplier method
||||
2 Margin
w
1bxw if 1
1bxw if1
i
i
iy
2
||||)(
2wwL
Niby ii ,...,2,1 ,1)( xw
11
Unconstrained Optimization
min E(w1,w2) = w12 + w2
2
002
002
222
111
wwdw
dE
wwdw
dE
w2
w1
E(w1,w2)
w1=0, w2=0 minimizes E(w1,w2)
Minimum value is E(0,0)=0
12
Constrained Optimization
min E(w1,w2) = w12 + w2
2
subject to 2w1 + w2 = 4 (equality constraint)
516),(,5
4,58
42042
202
022
)42(
2121
121121
222
111
2122
21
wwEww
wwwwd
dL
wwdw
dL
wwdw
dL
wwwwL
Dual problem:
445)422(4max
222 L
w2
w1
E(w1,w2)
13
Learning Linear SVM
Lagrange multiplier:
– Take derivative w.r.t and b:
– Additional constraints:
– Dual problem:
n
iii by
wwL
1
2
12
||||)( xw
iii
n
iiii
yb
L
yL
00
01
xww
01)(
0
by iii
i
xw
n
ijiji
n
ii yywL
11
)( ji xx
14
Learning Linear SVM
Bigger picture:– Learning algorithm needs to find w and b– To solve for w:
– But:
is zero for points that do not reside on
Data points where s are not zero are called support vectors
n
iiii y
1
xw
01)( by iii xw
1)( by ii xw
15
Example of Linear SVM
x1 x2 y 0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0
Support vectors
16
Learning Linear SVM
Bigger picture…
– Decision boundary depends only on support vectors If you have data set with same support vectors, decision boundary will not change
– How to classify using SVM once w and b are found? Given a test record, xi
1bxw if1
1bxw if1)(
i
i
ixf
17
Support Vector Machines
What if the problem is not linearly separable?
18
Support Vector Machines
What if the problem is not linearly separable?– Introduce slack variables
Need to minimize:
Subject to:
If k is 1 or 2, this leads to same objective function as linear SVM but with different constraints (see textbook)
ii
ii
1bxw if1
-1bxw if1)(
ixf
N
i
kiC
wwL
1
2
2
||||)(
19
Nonlinear Support Vector Machines
What if decision boundary is not linear?
20
Nonlinear Support Vector Machines
Trick: Transform data into higher dimensional space
0)( bxw
Decision boundary:
21
Learning Nonlinear SVM
Optimization problem:
Which leads to the same set of equations (but involve (x) instead of x)
22
Learning NonLinear SVM
Issues:
– What type of mapping function should be used?
– How to do the computation in high dimensional space? Most computations involve dot product (xi) (xj)
Curse of dimensionality?
23
Learning Nonlinear SVM
Kernel Trick:
– (xi) (xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space) Examples:
24
Example of Nonlinear SVM
SVM with polynomial degree 2 kernel
25
Learning Nonlinear SVM
Advantages of using kernel:
– Don’t have to know the mapping function – Computing dot product (xi) (xj) in the
original space avoids curse of dimensionality
Not all functions can be kernels
– Must make sure there is a corresponding in some high-dimensional space
– Mercer’s theorem (see textbook)