Lec9 SVM Nonlinear

8/8/2019 Lec9 SVM Nonlinear

1/37

Support Vector Machines:

Nonlinear Case

Jieping YeDepartment of Computer Science and Engineering

Arizona State University

http://www.public.asu.edu/~jye02

Source: Andrews tutorials on

SVM
http://www.public.asu.edu/~jye02http://www.public.asu.edu/~jye02


2/37

Outline of lecture

Nonlinear SVM using basis functions

Nonlinear SVM using kernels

Extensions

SVM for multi-class classification

SVM path SVM for unbalanced data


3/37

Support Vector Machine: Linear Case

Balance the trade off between margin andclassification errors

( )

( )

( )

d* * 2

1 1,

1 1 1 1

2 2 2 2

{ , }= min

1 , 0

1 , 0

...

1 , 0

N

i ji jw b

N N N N

w b w c

y w x b

y w x b

y w x b

= =+

+

+

+

rrr r

r r

r r

denotes +1

denotes -1

1

2

3


4/37

Case

Maximi

ze

= ==

R

k

R

l

kllk

R

k

k Q1 11 2

1where ( )kl k l k l Q y y= x x

Subject tothese

constraints:

kCk 0

Then define:

=

=R

k

kkky

1

xwThen classify with:

f(x,w,b) = sign(w. x- b)

01

==

R

k

kky


5/37

b

A linear programming

problem !

ci = iii c


6/37

Suppose were in 1-dimension

What wouldSVMs do withthis data?

x=0


7/37

Suppose were in 1-dimension

Not a bigsurprise

Positive plane Negativeplane

x=0


8/37

Harder 1-dimensional dataset

What can be

done aboutthis?

x=0


9/37


Apply the following map

x=0),(

2

kkk xx=z


10/37


x=0),( 2kkk xx=z



11/37


0 1-3-4 43-1

),( 2kkk xx=z

0 1-3-4 43-1

16

9


12/37


),,,,( 22 kkkkkkk yxyxyx=z



13/37

Common SVM basis functions

zk = ( polynomial terms ofxk of degree 1 to

q )

zk = ( radial basis functions ofxk )

zk = ( sigmoid functions ofxk )

==

2

2||exp)(][

jkkjk j

cxxz

B i


14/37

BasisFunctions

=

mm

m

m

m

m

xx

xx

xx

xx

xx

xx

x

x

x

x

x

x

1

1

32

1

31

21

2

2

2

2

1

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)(x

ConstantTerm

Linear Terms

PureQuadraticTerms

Quadratic

Cross-Terms

Number of terms (assuming minput dimensions) = (m+2)-choose-2

= (m+2)(m+1)/2

= (as near as)m2

/2

You may be wondering whatthose

s are doing.

Youll find out why theyre theresoon.

2


15/37

QP (old)

Maximi

ze

where ).( lklkkl yyQ xx=

Subject tothese

constraints:

kCk 0

Then define:

=

=R

kkkky

1

xwThen classify with:

f(x,w,b) = sign(w. x- b)

01

==

R

k

kky

= ==

R

k

R

l

kllk

R

k

k Q

1 11 2

1


16/37

functions

where ))().(( lklkkl yyQ xx=

Subject tothese

constraints:

kCk 0

Then define: Then classify with:

f(x,w,b) = sign(w. (x)-b)

01

==

R

k

kky

>

=0s.t.

)(

k

k

kkky xw

Maximize = ==

R

k

R

l

kllk

R

k

k Q

1 11 2

1

Most important changes:

X (x)


17/37

QP with basis functions


Subject tothese

constraints:

kCk 0

Then define:

Then classify with:


01

==

R

k

kky

We must do R2/2 dot productsto get this matrix ready.

Each dot product requires m2/2additions and multiplications

The whole thing costs R2 m2 /4.

>

=0s.t.

)(

k

k

kkky xw

Maximize = ==

R

k

R

l

kllk

R

k

k Q

1 11 2

1

11


18/37

Dot

Pro d

ucts

=

mm

m

m

m

m

mm

m

m

m

m

bb

bb

bb

bb

bb

bb

b

b

b

b

b

b

aa

aa

aa

aa

aa

aa

a

a

a

a

a

a

1

1

32

1

31

21

2

2

2

2

1

2

1

1

1

32

1

31

21

2

2

2

2

1

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)()( ba

1

=

m

i

iiba1

2

=

m

i

ii ba1

22

= +=

m

i

m

ij

jiji bbaa1 1

2

+

+

+


19/37

Dot

Pro d

ucts

= )()( ba

= +===

+++m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221

Just out of interest, lets look atanother function ofa and b:

2)1.( +ba

1.2).( 2 ++= baba

121

2

1

++

=

==

m

i

ii

m

i

ii baba

1211 1

++= == =

m

i

ii

m

i

m

j

jjii bababa

122)(

11 11

2 +++= == +==

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa


20/37

D

ot

Pro d

ucts

= )()( ba

Just out of interest, lets look atanother function ofa and b:

2)1.( +ba

1.2).( 2 ++= baba

121

2

1

++

=

==

m

i

ii

m

i

ii baba

1211 1

++= == =

m

i

ii

m

i

m

j

jjii bababa

122)(

11 11

2 +++= == +==

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa

Theyre the same!

And this is only O(m)to compute!

= +===

+++m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221


21/37

functions


Subject tothese

constraints:

kCk 0



01

==

R

k

kky

We must do R2/2 dot productsto get this matrix ready.

Each dot product now onlyrequires m additions and

multiplications

>

=0s.t.

)(

k

k

kkky xw

Maximize = ==

R

k

R

l

kllk

R

k

k Q

1 11 2

1


22/37

Higher Order Polynomials

Poly-nomial (x) Cost tobuild Qkl matrixtraditionally

Cost if 100inputs (a). (b) Cost tobuild Qkl matrixsneakily

Cost if100inputs

QuadraticAll m2/2terms up todegree 2

m2 R2/4 2,500 R2 (a.b+1)2 mR2/ 2 50 R2

Cubic All m3/6terms up todegree 3

m3 R2/12 83,000 R2 (a.b+1)3 mR2/ 2 50 R2

Quartic All m4/24

terms up tode ree 4

m4 R2/48 1,960,000 R2 (a.b+1)4 mR2/ 2 50 R2

5


23/37

functions5)1( + lk xx

Maximi

ze

= ==

+R

k

R

l

kllk

R

k

k Q1 11


Subject tothese

constraints:

kCk 0

Then define:

>

=

0s.t.

)(

kk

kkky xw

Then classify with:


01

==

R

k

kky

Only Sm operations (S=#support

>

=0s.t.

)()()(kk

kkky xxxw

>

+=0s.t.

5)1(kk

kkky xx


24/37

functions


f(x,w,b) = sign(K(w, x)- b)

>

=0s.t.

)(kk

kkky xw

Maximize = ==

R

k

R

l

kllk

R

k

k Q

1 11 2

1where ),( lklkkl KyyQ xx=

Subject tothese

constraints:

kCk 0 01

==

R

k

kky

Most important change:

),()().( lklk K xxxx

S l i


25/37

SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function

Beyond polynomials there are other very highdimensional basis functions that can be made

practical by finding the right Kernel Function Radial-Basis-style Kernel Function:

Sigmoidal function

=

2

2

2

)(exp),(

babaK


26/37

Kernel Tricks Replacing dot product with a kernel function

Not all functions are kernel functions

Need to be decomposable

K(a,b) = (a) (b) Could K(a,b) = (a-b)3 be a kernel function ?

Could K(a,b) = (a-b)4 (a+b)2 be a kernelfunction?


27/37

Kernel Tricks Mercers condition

To expand Kernel function K(x,y) into a dot product,i.e. K(x,y)= (x) (y), K(x, y) has to be positivesemi-definite function, i.e., for any function f(x)whose is finite, the following inequalityholds

Could be a kernel function?

( ) ( , ) ( ) 0dxdyf x K x y f y

2( ) f x dx

( )( , )p

i iiK x y x y=

r r


28/37

Kernel Tricks Pro

Introducing nonlinearity into the model Computational cheap

Con

Still have potential overfitting problems

N li K l (I)


29/37

Nonlinear Kernel (I)

N li K l (II)


30/37

Nonlinear Kernel (II)

SVM P f


31/37

SVM Performance

Generalization theory General methodology for many types of

problems

Same Program + New Kernel = New method

No problems with local minima

Robust optimization methods.

Successful Applications

SVM P f


32/37

SVM Performance

Do SVM scale to massive datasets? How to chose C and Kernel?

What is the effect of attribute scaling?

How to handle categorical variables? How to incorporate domain knowledge?

l ifi ti


33/37

classification SVMs can only handle two-class outputs.

What can be done? Answer: with output arity N, learn N SVMs

SVM 1 learns Output==1 vs Output != 1

SVM 2 learns Output==2 vs Output != 2

: SVM N learns Output==N vs Output != N

Then to predict the output for a new input, justpredict with each SVM and find out which one

puts the prediction the furthest into the positiveregion.

Other approaches

Pair-wise SVM, Multi-category SVM

l ti


34/37

selection The Entire Regularization Path for the Support Vector Machine

(Hastie, Rosset, Tibshirani and Zhu)

http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf

An algorithm for computing the two-class SVM solution for all possiblevalues of the regularization parameter C, at essentially thecomputational cost of a single SVM fit. Not only does this allow forefficient model selection, but it also exposes the role of regularization

for SVMs.

where ),( lklkkl KyyQ xx=

Subject totheseconstraints:

kCk 0 01

==

R

k

kky

Maximize = == R

k

R

lkllk

R

kk Q

1 11 2

1

f b l d d
http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdfhttp://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf


35/37

SVM for unbalanced data

( ) iallfor,111

2

,min

iii

N

j

j

d

i

ibw

bwxy

cw

+

+ ==

( ) iallfor,1

21

21

1

2

,min

iii

Cx

j

Cx

j

d

i

ibw

bwxy

ccw

jj

+

++ =

Original SVM formulation:

the first class has much smaller size than the second class,

ply different weights to the two classes: 21 cc >

R f


36/37

References

C.J.C. Burges. A tutorial on support vector

machines for pattern recognition.

Kristin Bennett . Support Vector Machines:Hype or Hallelujah?

Statistical Learning Theory by VladimirVapnik, Wiley-Interscience; 1998

SoftwareSVM-light, http://svmlight.joachims.org/, free

download

N t l
http://svmlight.joachims.org/http://svmlight.joachims.org/


37/37

Next class Topics

Multi-class SVM

Semi-supervised clustering

Readings In Defense of One-Vs-All Classification

Constrained K-means Clustering with Background Knowledge

Semi-supervised Clustering by Seeding

Distance metric learning, with application to clustering with side-information

Documents

Lec9 SVM Nonlinear