Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore

Support Vector Machines

Andrew W. MooreAssociate Professor

School of Computer ScienceCarnegie Mellon University

Support Vector Machines: Slide 2Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit






Linear Classifiers

f x

yest

denotes +1

denotes -1f(x,w,b) = sgn(w. x -

b)

How would you classify this data?

Learning machine f takes an input x and transforms it, somehow using weight α into a predicted output yest = +/-1


Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Perceptron rule stops only if no misclassification remains:

w = (ty)x

Any of these lines are possible


Linear Classifiers

f x

yest

denotes +1

denotes -1


LMS rule uses linear elements and minimizes

E= (ty)2

Not necessarily guarantee no misclassification


Linear Classifiers

f x

yest

denotes +1

denotes -1


If the objective is to assure zero sample error…

Any of these would be fine..

..but which is best?


Classifier Margin

f x

yest

denotes +1

denotes -1


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Margin of the separator is the width of separation between

classes


Maximum Margin

f x

yest

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against






Specifying a line and margin

• How do we represent this mathematically?• …in m input dimensions?

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone


Specifying a line and margin

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

Classify as..

+1 if w . x + b >= 1

-1 if w . x + b <= -1

Universe explodes

if -1 < w . x + b < 1

wx+b=1

wx+b=0

wx+b=-

1


Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }Claim: The vector w is perpendicular to the Plus Plane.

Why?

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also perpendicular to the Minus Plane



The line from x- to x+ is perpendicular to the planes.

So to get from x- to x+ travel some distance in direction w







classifier• What QP • How Maximum Margin can be turned

into a QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit


Learning the Maximum Margin Classifier

Given a guess of w and b we can• Compute whether all data points in the correct half-planes• Compute the width of the margin

So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How?

• Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

x-

x+ww.

2


Learning via Quadratic Programming

• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.


The Optimization Problem


Learning the Maximum Margin Classifier

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

ww.

2

Q1: What is our quadratic optimization criterion?

Q2: What are constraints in our QP?

The objective?• To maximize M• While ensuring all

data points are in the correct half-planes

• Constraints:w . xk + b >= 1 if yk = 1

w . xk + b <= -1 if yk = -1

Minimize w.w

- How many constraints will we have?

- What should they be?

R

M =




QP problem• How we deal with noisy (non-separable)

data• How we permit non-linear boundaries• How SVM Kernel functions permit


Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?Idea :

Minimize w.w + C (distance of error points to their correct place)

Minimize trade-off between margin and training error


Learning Maximum Margin with Noise

wx+b=1

wx+b=0

wx+b=-

1

ww.

2

Our QP criterion:Minimize

• How many constraints? 2R

• What should they be?w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

k >= 0 for all k

R

kkεC

1

.2

1ww

7

11 2


Hard/Soft Margin Separation


Controlling Soft Margin Separation






Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0


Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0


Harder 1-dimensional dataset

That’s wiped the smirk off SVM’s face.

What can be done about this?

x=0


Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0),( 2

kkk xxz


Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

),( 2kkk xxz

x=0






Common SVM basis functions

zk = ( polynomial terms of xk of degree 1 to q )

zk = ( radial basis functions of xk )

zk = ( sigmoid functions of xk )

This is sensible.

Is that the end of the story?

No…there’s one more trick!

KW

||KernelFn)(][ jk

kjk φjcx

xz


The “Kernel Trick”


Quadra

tic

Dot

Pro

duct

s

mm

m

m

m

m

mm

m

m

m

m

bb

bb

bb

bb

bb

bb

b

b

b

b

b

b

aa

aa

aa

aa

aa

aa

a

a

a

a

a

a

1

1

32

1

31

21

2

22

21

2

1

1

1

32

1

31

21

2

22

21

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)()( bΦaΦ

1

m

iiiba

1

2

m

iii ba

1

22

m

i

m

ijjiji bbaa

1 1

2

+

+

+

Constant Term

Linear Terms

Pure Quadratic

Terms

Quadratic Cross-Terms


Quadratic Dot Products

)()( bΦaΦ

m

i

m

ijjiji

m

iii

m

iii bbaababa

1 11

22

1

221

2)1.( ba

1.2).( 2 baba

121

2

1

m

iii

m

iii baba

1211 1

m

iii

m

i

m

jjjii bababa

122)(11 11

2

m

iii

m

i

m

ijjjii

m

iii babababa

They’re the same!

And this is only O(m) to compute!


SVM Kernel Functions• K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function• Beyond polynomials there are other very

high dimensional basis functions that can be made practical by finding the right Kernel Function• Radial-Basis-style Kernel Function:

• Neural-net-style Kernel Function:

2

2

2

)(exp),(

ba

baK

).tanh(),( babaK


Non-linear SVM with Kernels


Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• What can be done?• Answer: with output arity N, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.


What You Should Know• Linear SVMs• The definition of a maximum margin

classifier• What QP can do for you (but, for this class,

you don’t need to know how it does it)• How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit us to

pretend we’re working with ultra-high-dimensional basis-function terms


Remaining…• Actually, I skip

• ERM(Empirical Risk Minimization)• VC(Vapnik/Chervonenkis) dimension

• But, It needed at data processing time

• Handling unbalanced datasets• Selecting C• Selecting Kernel

• What is a Valid Kernel?• How to Construct Valid Kernel?

• It needed at tuning time.

Now, the real game is starting in earnest !!


SVMlight

• http://svmlight.joachims.org (Thorsten Joachims @ cornell Univ.)

• Commands• svm_learn [options] [train_file] [model_file]

• svm_learn –c 1.5 –x 1 train_file model_file

• svm_classify [test_file] [model_file] [prediction_file]• svm_classify test_file model_file prediction_file

• File Format• <y> <featureNumber>:<value> …

<featureNumber>:<value> #comment• 1 23:0.5 105:0.1 1023:0.8 1999:0.34

Idx 1 … 23 … 105 … 1023

1024

… 1998

1999

… cls

Value 0 0 .5 0 .1 0 .8 0 0 0 .34 0 1


Options• -z <char> : selects whether the SVM is trained in classification (c) or

regression (r) mode• -c <float> : controls the trade-off between training error and margin

– the lower the value of C, the more training error is tolerated. The best value of C depends on the data and must be determined emperically.

• -w <float> : width of tube (i.e Є) Є-intensive loss function used in regression. The value must be non-negative.

• -j <float> : specifies cost-factors for the loss functions both in classification and regression.

• -b <int> : switches between a biased (1) or an unbiased (0) hyperplane.

• -i <int> : selects how training errors are treated. (1-4)• -x <int> : selects whether to compute leave-one-out estimates of the

generalization performance.• -t <int> : selects the type of kernel functions (0-4)• -q <int> : maximum size of working set.• -m <int> : specifies the amount of memory (in megabytes) available

for caching kernel evaluations.• ……………………………………..


Example 1: Sense Disambiguation

• AK has 3 senses• above knee, acetate kinase, artificial kidney

• Ex)• w13, w345, w874, w8, w7345, AK(acetate

kinase), w123, w13, w8, w14, w8• => 2 w8:3 w13:2 w14:1 w123:1 w345:1 w874:1

w7345:1


Example of training set1 109:0.060606 193:0.090909 211:0.052632 3079:1.000000 5882:0.200000 1 109:0.030303 143:1.000000 377:0.500000 439:0.500000 1012:0.500000

3808:1.000000 3891:1.000000 4789:0.333333 18363:0.333333 23106:1.000000

1 174:0.166667 244:0.500000 321:0.500000 332:0.500000 723:0.333333 3064:0.500000 3872:1.000000 16401:1.000000 19369:1.000000 23109:1.000000

2 108:0.250000 148:0.333333 202:0.250000 313:0.200000 380:1.000000 3303:1.000000 8944:1.000000 11513:1.000000 23110:1.000000 23111:1.000000

1 100:0.125000 109:0.030303 129:0.200000 131:0.071429 135:0.050000 543:0.500000 5880:0.200000 17153:0.100000 23112:0.100000

1 100:0.125000 109:0.060606 125:0.500000 131:0.071429 193:0.045455 2553:0.200000 4790:0.100000 5880:0.200000

2 382:1.000000 410:1.000000 790:1.000000 1160:0.500000 2052:1.000000 3260:1.000000 7323:0.500000 23113:1.000000 23114:1.000000 23115:1.000000

2 133:1.000000 147:0.250000 148:0.333333 177:0.500000 214:1.000000 960:1.000000 1131:1.000000 1359:1.000000 6378:0.333333 22842:1.000000

1 109:0.060606 917:1.000000 2747:0.500000 2748:0.500000 2749:1.000000 3252:1.000000 21699:1.000000

1 543:0.500000 557:1.000000 563:0.500000 1077:1.000000 1160:0.500000 2747:0.500000 2748:0.500000 12805:1.000000 23116:1.000000 23117:1.000000

1 593:1.000000 1124:0.500000 1615:1.000000 3607:1.000000 13443:1.000000 23118:1.000000 23119:1.000000 23120:1.000000

…………………………………………………………


Trained model of SVMSVM-multiclass Version V1.012 # number of classes24208 # number of base features0 # loss function3 # kernel type3 # kernel parameter -d1 # kernel parameter -g1 # kernel parameter -s1 # kernel parameter -rempty# kernel parameter -u48584 # highest feature index167 # number of training documents335 # number of support vectors plus 10 # threshold b, each following line is a SV (starting with alpha*y)0.010000000000000000208166817117217 24401:0.2 24407:0.071428999 24756:1

24909:0.25 25283:1 25424:0.33333299 25487:0.043478001 25773:0.5 31693:1 37411:1 #-0.010000000000000000208166817117217 193:0.2 199:0.071428999 548:1 701:0.25

1075:1 1216:0.33333299 1279:0.043478001 1565:0.5 7485:1 13203:1 #0.010000000000000000208166817117217 24407:0.071428999 24419:0.0625

25807:0.166667 26905:0.166667 27631:1 27664:1 35043:1 35888:1 36118:1 48085:1 #-0.010000000000000000208166817117217 199:0.071428999 211:0.0625 1599:0.166667

2697:0.166667 3423:1 3456:1 10835:1 11680:1 11910:1 23877:1 #……………………………………………………


Example 2: Text Chunking


Example 3: Text Classification• Corpus: Reuters-21578

• 12,902 news story• 118 category• ModApte split (75%trainig:9,603doc / 25%test 3,299 doc)


Q & A

Documents

Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University