Download ppt - CSE 881: Data Mining

1

CSE 881: Data Mining

Lecture 10: Support Vector Machines

2

Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data

3


One Possible Solution

B1

4


Another possible solution

B2

5


Other possible solutions

B2

6


Which one is better? B1 or B2? How do you define better?

B1

B2

7


Find hyperplane maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21

b22

margin

8

SVM vs Perceptron

Perceptron: SVM:

Minimizes least square error (gradient descent)

Maximizes margin

9


B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf

||||

2 Margin

w

10

Learning Linear SVM

We want to maximize:

– Which is equivalent to minimizing:

– But subjected to the following constraints:

or

This is a constrained optimization problem– Solve it using Lagrange multiplier method

||||

2 Margin

w

1bxw if 1

1bxw if1

i

i

iy

2

||||)(

2wwL

Niby ii ,...,2,1 ,1)( xw

11

Unconstrained Optimization

min E(w1,w2) = w12 + w2

2

002

002

222

111

wwdw

dE

wwdw

dE

w2

w1

E(w1,w2)

w1=0, w2=0 minimizes E(w1,w2)

Minimum value is E(0,0)=0

12

Constrained Optimization

min E(w1,w2) = w12 + w2

2

subject to 2w1 + w2 = 4 (equality constraint)

516),(,5

4,58

42042

202

022

)42(

2121

121121

222

111

2122

21

wwEww

wwwwd

dL

wwdw

dL

wwdw

dL

wwwwL

Dual problem:

445)422(4max

222 L

w2

w1

E(w1,w2)

13

Learning Linear SVM

Lagrange multiplier:

– Take derivative w.r.t and b:

– Additional constraints:

– Dual problem:

n

iii by

wwL

1

2

12

||||)( xw

iii

n

iiii

yb

L

yL

00

01

xww

01)(

0

by iii

i

xw

n

ijiji

n

ii yywL

11

)( ji xx

14

Learning Linear SVM

Bigger picture:– Learning algorithm needs to find w and b– To solve for w:

– But:

is zero for points that do not reside on

Data points where s are not zero are called support vectors

n

iiii y

1

xw

01)( by iii xw

1)( by ii xw

15

Example of Linear SVM

x1 x2 y 0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0

Support vectors

16

Learning Linear SVM

Bigger picture…

– Decision boundary depends only on support vectors If you have data set with same support vectors, decision boundary will not change

– How to classify using SVM once w and b are found? Given a test record, xi

1bxw if1

1bxw if1)(

i

i

ixf

17


What if the problem is not linearly separable?

18


What if the problem is not linearly separable?– Introduce slack variables

Need to minimize:

Subject to:

If k is 1 or 2, this leads to same objective function as linear SVM but with different constraints (see textbook)

ii

ii

1bxw if1

-1bxw if1)(

ixf

N

i

kiC

wwL

1

2

2

||||)(

19

Nonlinear Support Vector Machines

What if decision boundary is not linear?

20

Nonlinear Support Vector Machines

Trick: Transform data into higher dimensional space

0)( bxw

Decision boundary:

21

Learning Nonlinear SVM

Optimization problem:

Which leads to the same set of equations (but involve (x) instead of x)

22

Learning NonLinear SVM

Issues:

– What type of mapping function should be used?

– How to do the computation in high dimensional space? Most computations involve dot product (xi) (xj)

Curse of dimensionality?

23


Kernel Trick:

– (xi) (xj) = K(xi, xj)

– K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space) Examples:

24

Example of Nonlinear SVM

SVM with polynomial degree 2 kernel

25


Advantages of using kernel:

– Don’t have to know the mapping function – Computing dot product (xi) (xj) in the

original space avoids curse of dimensionality

Not all functions can be kernels

– Must make sure there is a corresponding in some high-dimensional space

– Mercer’s theorem (see textbook)