50
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Embed Size (px)

Citation preview

Copyright © 2001, Andrew W. Moore

Support Vector Machines

Andrew W. MooreAssociate Professor

School of Computer ScienceCarnegie Mellon University

Support Vector Machines: Slide 2Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 3Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 4Copyright © 2001, Andrew W. Moore

Linear Classifiers

f x

yest

denotes +1

denotes -1f(x,w,b) = sgn(w. x -

b)

How would you classify this data?

Learning machine f takes an input x and transforms it, somehow using weight α into a predicted output yest = +/-1

Support Vector Machines: Slide 5Copyright © 2001, Andrew W. Moore

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Perceptron rule stops only if no misclassification remains:

w = (ty)x

Any of these lines are possible

Support Vector Machines: Slide 6Copyright © 2001, Andrew W. Moore

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

LMS rule uses linear elements and minimizes

E= (ty)2

Not necessarily guarantee no misclassification

Support Vector Machines: Slide 7Copyright © 2001, Andrew W. Moore

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

If the objective is to assure zero sample error…

Any of these would be fine..

..but which is best?

Support Vector Machines: Slide 8Copyright © 2001, Andrew W. Moore

Classifier Margin

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Margin of the separator is the width of separation between

classes

Support Vector Machines: Slide 9Copyright © 2001, Andrew W. Moore

Maximum Margin

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against

Support Vector Machines: Slide 10Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 11Copyright © 2001, Andrew W. Moore

Specifying a line and margin

• How do we represent this mathematically?• …in m input dimensions?

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

Support Vector Machines: Slide 12Copyright © 2001, Andrew W. Moore

Specifying a line and margin

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

Classify as..

+1 if w . x + b >= 1

-1 if w . x + b <= -1

Universe explodes

if -1 < w . x + b < 1

wx+b=1

wx+b=0

wx+b=-

1

Support Vector Machines: Slide 13Copyright © 2001, Andrew W. Moore

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }Claim: The vector w is perpendicular to the Plus Plane.

Why?

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also perpendicular to the Minus Plane

Support Vector Machines: Slide 14Copyright © 2001, Andrew W. Moore

Computing the margin width

The line from x- to x+ is perpendicular to the planes.

So to get from x- to x+ travel some distance in direction w

Support Vector Machines: Slide 15Copyright © 2001, Andrew W. Moore

Computing the margin width

Support Vector Machines: Slide 16Copyright © 2001, Andrew W. Moore

Computing the margin width

Support Vector Machines: Slide 17Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned

into a QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 18Copyright © 2001, Andrew W. Moore

Learning the Maximum Margin Classifier

Given a guess of w and b we can• Compute whether all data points in the correct half-planes• Compute the width of the margin

So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How?

• Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

x-

x+ww.

2

Support Vector Machines: Slide 19Copyright © 2001, Andrew W. Moore

Learning via Quadratic Programming

• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Support Vector Machines: Slide 20Copyright © 2001, Andrew W. Moore

The Optimization Problem

Support Vector Machines: Slide 21Copyright © 2001, Andrew W. Moore

Learning the Maximum Margin Classifier

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

ww.

2

Q1: What is our quadratic optimization criterion?

Q2: What are constraints in our QP?

The objective?• To maximize M• While ensuring all

data points are in the correct half-planes

• Constraints:w . xk + b >= 1 if yk = 1

w . xk + b <= -1 if yk = -1

Minimize w.w

- How many constraints will we have?

- What should they be?

R

M =

Support Vector Machines: Slide 22Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable)

data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 23Copyright © 2001, Andrew W. Moore

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?Idea :

Minimize w.w + C (distance of error points to their correct place)

Minimize trade-off between margin and training error

Support Vector Machines: Slide 24Copyright © 2001, Andrew W. Moore

Learning Maximum Margin with Noise

wx+b=1

wx+b=0

wx+b=-

1

ww.

2

Our QP criterion:Minimize

• How many constraints? 2R

• What should they be?w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

k >= 0 for all k

R

kkεC

1

.2

1ww

7

11 2

Support Vector Machines: Slide 25Copyright © 2001, Andrew W. Moore

Hard/Soft Margin Separation

Support Vector Machines: Slide 26Copyright © 2001, Andrew W. Moore

Controlling Soft Margin Separation

Support Vector Machines: Slide 27Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 28Copyright © 2001, Andrew W. Moore

Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0

Support Vector Machines: Slide 29Copyright © 2001, Andrew W. Moore

Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0

Support Vector Machines: Slide 30Copyright © 2001, Andrew W. Moore

Harder 1-dimensional dataset

That’s wiped the smirk off SVM’s face.

What can be done about this?

x=0

Support Vector Machines: Slide 31Copyright © 2001, Andrew W. Moore

Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0),( 2

kkk xxz

Support Vector Machines: Slide 32Copyright © 2001, Andrew W. Moore

Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

),( 2kkk xxz

x=0

Support Vector Machines: Slide 33Copyright © 2001, Andrew W. Moore

Outline• Linear SVMs• The definition of a maximum margin

classifier• What QP • How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit

Support Vector Machines: Slide 34Copyright © 2001, Andrew W. Moore

Common SVM basis functions

zk = ( polynomial terms of xk of degree 1 to q )

zk = ( radial basis functions of xk )

zk = ( sigmoid functions of xk )

This is sensible.

Is that the end of the story?

No…there’s one more trick!

KW

||KernelFn)(][ jk

kjk φjcx

xz

Support Vector Machines: Slide 35Copyright © 2001, Andrew W. Moore

The “Kernel Trick”

Support Vector Machines: Slide 36Copyright © 2001, Andrew W. Moore

Quadra

tic

Dot

Pro

duct

s

mm

m

m

m

m

mm

m

m

m

m

bb

bb

bb

bb

bb

bb

b

b

b

b

b

b

aa

aa

aa

aa

aa

aa

a

a

a

a

a

a

1

1

32

1

31

21

2

22

21

2

1

1

1

32

1

31

21

2

22

21

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)()( bΦaΦ

1

m

iiiba

1

2

m

iii ba

1

22

m

i

m

ijjiji bbaa

1 1

2

+

+

+

Constant Term

Linear Terms

Pure Quadratic

Terms

Quadratic Cross-Terms

Support Vector Machines: Slide 37Copyright © 2001, Andrew W. Moore

Quadratic Dot Products

)()( bΦaΦ

m

i

m

ijjiji

m

iii

m

iii bbaababa

1 11

22

1

221

2)1.( ba

1.2).( 2 baba

121

2

1

m

iii

m

iii baba

1211 1

m

iii

m

i

m

jjjii bababa

122)(11 11

2

m

iii

m

i

m

ijjjii

m

iii babababa

They’re the same!

And this is only O(m) to compute!

Support Vector Machines: Slide 38Copyright © 2001, Andrew W. Moore

SVM Kernel Functions• K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function• Beyond polynomials there are other very

high dimensional basis functions that can be made practical by finding the right Kernel Function• Radial-Basis-style Kernel Function:

• Neural-net-style Kernel Function:

2

2

2

)(exp),(

ba

baK

).tanh(),( babaK

Support Vector Machines: Slide 39Copyright © 2001, Andrew W. Moore

Non-linear SVM with Kernels

Support Vector Machines: Slide 40Copyright © 2001, Andrew W. Moore

Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• What can be done?• Answer: with output arity N, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Support Vector Machines: Slide 41Copyright © 2001, Andrew W. Moore

What You Should Know• Linear SVMs• The definition of a maximum margin

classifier• What QP can do for you (but, for this class,

you don’t need to know how it does it)• How Maximum Margin can be turned into a

QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit us to

pretend we’re working with ultra-high-dimensional basis-function terms

Support Vector Machines: Slide 42Copyright © 2001, Andrew W. Moore

Remaining…• Actually, I skip

• ERM(Empirical Risk Minimization)• VC(Vapnik/Chervonenkis) dimension

• But, It needed at data processing time

• Handling unbalanced datasets• Selecting C• Selecting Kernel

• What is a Valid Kernel?• How to Construct Valid Kernel?

• It needed at tuning time.

Now, the real game is starting in earnest !!

Support Vector Machines: Slide 43Copyright © 2001, Andrew W. Moore

SVMlight

• http://svmlight.joachims.org (Thorsten Joachims @ cornell Univ.)

• Commands• svm_learn [options] [train_file] [model_file]

• svm_learn –c 1.5 –x 1 train_file model_file

• svm_classify [test_file] [model_file] [prediction_file]• svm_classify test_file model_file prediction_file

• File Format• <y> <featureNumber>:<value> …

<featureNumber>:<value> #comment• 1 23:0.5 105:0.1 1023:0.8 1999:0.34

Idx 1 … 23 … 105 … 1023

1024

… 1998

1999

… cls

Value 0 0 .5 0 .1 0 .8 0 0 0 .34 0 1

Support Vector Machines: Slide 44Copyright © 2001, Andrew W. Moore

Options• -z <char> : selects whether the SVM is trained in classification (c) or

regression (r) mode• -c <float> : controls the trade-off between training error and margin

– the lower the value of C, the more training error is tolerated. The best value of C depends on the data and must be determined emperically.

• -w <float> : width of tube (i.e Є) Є-intensive loss function used in regression. The value must be non-negative.

• -j <float> : specifies cost-factors for the loss functions both in classification and regression.

• -b <int> : switches between a biased (1) or an unbiased (0) hyperplane.

• -i <int> : selects how training errors are treated. (1-4)• -x <int> : selects whether to compute leave-one-out estimates of the

generalization performance.• -t <int> : selects the type of kernel functions (0-4)• -q <int> : maximum size of working set.• -m <int> : specifies the amount of memory (in megabytes) available

for caching kernel evaluations.• ……………………………………..

Support Vector Machines: Slide 45Copyright © 2001, Andrew W. Moore

Example 1: Sense Disambiguation

• AK has 3 senses• above knee, acetate kinase, artificial kidney

• Ex)• w13, w345, w874, w8, w7345, AK(acetate

kinase), w123, w13, w8, w14, w8• => 2 w8:3 w13:2 w14:1 w123:1 w345:1 w874:1

w7345:1

Support Vector Machines: Slide 46Copyright © 2001, Andrew W. Moore

Example of training set1 109:0.060606 193:0.090909 211:0.052632 3079:1.000000 5882:0.200000 1 109:0.030303 143:1.000000 377:0.500000 439:0.500000 1012:0.500000

3808:1.000000 3891:1.000000 4789:0.333333 18363:0.333333 23106:1.000000

1 174:0.166667 244:0.500000 321:0.500000 332:0.500000 723:0.333333 3064:0.500000 3872:1.000000 16401:1.000000 19369:1.000000 23109:1.000000

2 108:0.250000 148:0.333333 202:0.250000 313:0.200000 380:1.000000 3303:1.000000 8944:1.000000 11513:1.000000 23110:1.000000 23111:1.000000

1 100:0.125000 109:0.030303 129:0.200000 131:0.071429 135:0.050000 543:0.500000 5880:0.200000 17153:0.100000 23112:0.100000

1 100:0.125000 109:0.060606 125:0.500000 131:0.071429 193:0.045455 2553:0.200000 4790:0.100000 5880:0.200000

2 382:1.000000 410:1.000000 790:1.000000 1160:0.500000 2052:1.000000 3260:1.000000 7323:0.500000 23113:1.000000 23114:1.000000 23115:1.000000

2 133:1.000000 147:0.250000 148:0.333333 177:0.500000 214:1.000000 960:1.000000 1131:1.000000 1359:1.000000 6378:0.333333 22842:1.000000

1 109:0.060606 917:1.000000 2747:0.500000 2748:0.500000 2749:1.000000 3252:1.000000 21699:1.000000

1 543:0.500000 557:1.000000 563:0.500000 1077:1.000000 1160:0.500000 2747:0.500000 2748:0.500000 12805:1.000000 23116:1.000000 23117:1.000000

1 593:1.000000 1124:0.500000 1615:1.000000 3607:1.000000 13443:1.000000 23118:1.000000 23119:1.000000 23120:1.000000

…………………………………………………………

Support Vector Machines: Slide 47Copyright © 2001, Andrew W. Moore

Trained model of SVMSVM-multiclass Version V1.012 # number of classes24208 # number of base features0 # loss function3 # kernel type3 # kernel parameter -d1 # kernel parameter -g1 # kernel parameter -s1 # kernel parameter -rempty# kernel parameter -u48584 # highest feature index167 # number of training documents335 # number of support vectors plus 10 # threshold b, each following line is a SV (starting with alpha*y)0.010000000000000000208166817117217 24401:0.2 24407:0.071428999 24756:1

24909:0.25 25283:1 25424:0.33333299 25487:0.043478001 25773:0.5 31693:1 37411:1 #-0.010000000000000000208166817117217 193:0.2 199:0.071428999 548:1 701:0.25

1075:1 1216:0.33333299 1279:0.043478001 1565:0.5 7485:1 13203:1 #0.010000000000000000208166817117217 24407:0.071428999 24419:0.0625

25807:0.166667 26905:0.166667 27631:1 27664:1 35043:1 35888:1 36118:1 48085:1 #-0.010000000000000000208166817117217 199:0.071428999 211:0.0625 1599:0.166667

2697:0.166667 3423:1 3456:1 10835:1 11680:1 11910:1 23877:1 #……………………………………………………

Support Vector Machines: Slide 48Copyright © 2001, Andrew W. Moore

Example 2: Text Chunking

Support Vector Machines: Slide 49Copyright © 2001, Andrew W. Moore

Example 3: Text Classification• Corpus: Reuters-21578

• 12,902 news story• 118 category• ModApte split (75%trainig:9,603doc / 25%test 3,299 doc)

Support Vector Machines: Slide 50Copyright © 2001, Andrew W. Moore

Q & A