CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 1

Support Vector Machines (SVMs)• Learning mechanism based on linear

programming• Chooses a separating plane based on maximizing

the notion of a margin– Based on PAC learning

• Has mechanisms for– Noise

– Non-linear separating surfaces (kernel functions)

• Notes based on those of Prof. Jude Shavlik


Support Vector Machines

A+

A-

Find the best separating plane in feature space - many possibilities to choose from

which is thebest choice?


SVMs – The General Idea• How to pick the best separating plane?• Idea:

– Define a set of inequalities we want to satisfy

– Use advanced optimization methods (e.g., linear programming) to find satisfying solutions

• Key issues:– Dealing with noise

– What if no good linear separating surface?


Linear Programming• Subset of Math Programming• Problem has the following form:

function f(x1,x2,x3,…,xn) to be maximizedsubject to a set of constraints of the form:

g(x1,x2,x3,…,xn) > b

• Math programming - find a set of values for the variables x1,x2,x3,…,xn that meets all of the constraints and maximizes the function f

• Linear programming - solving math programs where the constraint functions and function to be maximized use linear combinations of the variables– Generally easier than general Math Programming problem– Well studied problem


Maximizing the Margin

A+

A-

the decisionboundary

The margin between categories - want this distance to be maximal - (we’ll assume linearly separable for now)


PAC Learning• PAC – Probably Approximately Correct learning• Theorems that can be used to define bounds for

the risk (error) of a family of learning functions• Basic formula, with probability (1 - ):

• R – risk function, is the parameters chosen by the learner, N is the number of data points, and h is the VC dimension (something like an estimate of the complexity of the class of functions)

N

hNhRR emp

)4/log()/2(log()()(


Margins and PAC Learning• Theorems connect PAC theory to the size of the

margin• Basically, the larger the margin, the better the

expected accuracy• See, for example, Chapter 4 of Support Vector

Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002


Some Equations

w

xw

xw

xw

xw

neg

pos

2margin

margin) (the planes red and bluebetween Distance

1

examples negative allFor

1

examples positive allFor

threshold-

features,input - weights,-

Plane Separating

1s result from dividingthrough by a constantfor convenience

Euclidean length (“2 norm”) ofthe weight vector


What the Equations Mean

A+

A-

Support Vectors

Margin

x´w = γ + 1

x´w = γ - 1

2 / ||w||2


Choosing a Separating Plane

A+

A-

?


Our “Mathematical Program” (so far)

soln) optimal global (a above theosolution t a find tosoftware

onoptimizati gprogramminmath existing use nowcan We

es)inequalitiour of sideleft the to move and trick"" ANN

theuse course, of could, (we parameters adjustableour are , :Note

examples) (for 1

examples) (for 1

such that

min2

,

w

xw

xw

w

neg

pos

w

for technical reasons easier tooptimize this “quadratic program”


Dealing with Non-Separable Data

We can add what is called a “slack” variable to each example

This variable can be viewed as:0 if the example is correctly separated

y “distance” we need to move example to make it

correct (i.e., the distance from its surface)


“Slack” Variables

A+

A-

y

Support Vectors


The Math Program with Slack Variables

0

1

1

such that

positive) (all components of sum - norm" one"

constant scaling

exampleeach for one

featureinput each for one

min

k

1

1

2

,,

k

jneg

ipos

sw

s

sxw

sxw

s

s

w

sw

j

i

This is the “traditional”Support Vector Machine


Why the word “Support”?• All those examples on or on the wrong side of the

two separating planes are the support vectors– We’d get the same answer if we deleted all the non-

support vectors!

– i.e., the “support vectors [examples]” support the solution


PAC and the Number of Support Vectors• The fewer the support vectors, the better the

generalization will be• Recall, non-support vectors are

– Correctly classified

– Don’t change the learned model if left out of the training set

• So

examples training#

ctorssupport ve # rateerror out oneleave


Finding Non-Linear Separating Surfaces• Map inputs into new space

Example: features x1 x2

5 4

Example: features x1 x2 x12 x2

2 x1*x2

5 4 25 16 20

• Solve SVM program in this new space– Computationally complex if many features

– But a clever trick exists


The Kernel Trick• Optimization problems often/always have a

“primal” and a “dual” representation– So far we’ve looked at the primal formulation

– The dual formulation is better for the case of a non-linear separating surface


Perceptrons Re-Visited

zero) (all 0 assumes This

weightschange and wrong

get we timesofnumber some is where

So

iedmisclassifcurrently is example theif

, 1- F and 1 T if s,perceptronIn

#

1

11

initial

i

i

examples

iiiifinal

i

iikk

w

x

xyw

x

xyww


Dual Form of the Perceptron Learning Rule

errors) (counts 1 then

) teacher predicted (i.e., 0 if

i example eachFor

:algorithm perceptron dual) (i.e.,New

sgn

sgn So

otherwise 1-

0 if 1 sgn

sgn

ii1

1

ii

ij

#examples

jjji

iii

#examples

iiii

αα

xxyαy

)xxyα(

)xxyα( ) xh(

z(z)

)xw( ) x h( perceptronoutput of


Primal versus Dual Space• Primal – “weight space”

– Weight features to make output decision

• Dual – “training-examples space”– Weight distance (which is based on the features) to

training examples

)sgn()( newnew xwxh

)xxyα()xh( ij

#examples

jjjnew

1

sgn


The Dual SVM

2

minmax

:form primal back toconvert Can

0

thatsuch

min

Let

11

1

1

11 121

iyiy

i

n

i ii

i

n

i ii

n

i i

n

i

n

j jijiji

xwxw

xyw

y

xxyy

examples #trainingn

ii


Non-Zero αi’s

weights the tocontribute

0) (i.e., ctorssupport ve only the -

Recall

ctorssupport ve the

are 0 withexamples Those

1

i

i

n

i ii

i

xyw


Generalizing the Dot Product

spacenew to

convertingdirectly thanefficient moreusually -

spacenew thein

features know the explicitly toneedt don' we-

productdot a computing re we'spacenew thisin -

implicitly spacenew a into features original the

maps linear)-non(usually kernel acceptable LAn

), K(e.g.,

functions" kernel"other to

),uct( Dot_Prod

generalize can We

jiji

jiji

xxxx

xxxx


The New Space for a Sample Kernel

2212211122122111

2222121221211111

22211

2

2

,,,,,,

)(

2let and

Let

zzzzzzzzxxxxxxxx

zzxxzzxxzzxxzzxx

zxzx)zx(

#features

)zx () z,xK(

Our new feature space (with 4 dimensions)- we’re doing a dot product in it


g(+)

Visualizing the Kernel

-

+

+

++-

-

-

-

+

Input Space

Original Space

Separating plane (non-linear here butlinear in derived space)

g(+)

g(-)

Derived Feature Space

New Space

g(+)g(+)g(+)

g(-)g(-)

g(-) g(-)g() is feature transformation function

process is similar to what hidden units do in ANNs but kernel is user chosen


More Sample Kernels

etc.) DNA,(text, tasksspecific

for designedmany including more,many plus -

sANN' of sigmoid to Related-

tanh 3)

network RBF toleads kernel, Gaussian -

2)

1)22

d

dzxc)z,xK(

e)z,xK(

constzx)z,xK(zx


What Makes a Kernel

...

reala returns where 4)

3)

constant is where 2)

1)

are so thenkernels, are and If

kernela is function

a whenzescharacteri theoremnsMercer'

21

1

21

21

f())zf()xf(

()() * KK

c()c*K

()K()K

()K()K

)z,xf(


Key SVM Ideas• Maximize the margin between positive and

negative examples (connects to PAC theory)• Penalize errors in non-separable case• Only the support vectors contribute to the solution• Kernels map examples into a new, usually non-

linear space– We implicitly do dot products in this new space (in the

“dual” form of the SVM program)

Documents

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based