Lec9 SVM Nonlinear

Embed Size (px)

Citation preview

  • 8/8/2019 Lec9 SVM Nonlinear

    1/37

    Support Vector Machines:

    Nonlinear Case

    Jieping YeDepartment of Computer Science and Engineering

    Arizona State University

    http://www.public.asu.edu/~jye02

    Source: Andrews tutorials on

    SVM

    http://www.public.asu.edu/~jye02http://www.public.asu.edu/~jye02
  • 8/8/2019 Lec9 SVM Nonlinear

    2/37

    Outline of lecture

    Nonlinear SVM using basis functions

    Nonlinear SVM using kernels

    Extensions

    SVM for multi-class classification

    SVM path SVM for unbalanced data

  • 8/8/2019 Lec9 SVM Nonlinear

    3/37

    Support Vector Machine: Linear Case

    Balance the trade off between margin andclassification errors

    ( )

    ( )

    ( )

    d* * 2

    1 1,

    1 1 1 1

    2 2 2 2

    { , }= min

    1 , 0

    1 , 0

    ...

    1 , 0

    N

    i ji jw b

    N N N N

    w b w c

    y w x b

    y w x b

    y w x b

    = =+

    +

    +

    +

    rrr r

    r r

    r r

    denotes +1

    denotes -1

    1

    2

    3

  • 8/8/2019 Lec9 SVM Nonlinear

    4/37

    Case

    Maximi

    ze

    = ==

    R

    k

    R

    l

    kllk

    R

    k

    k Q1 11 2

    1where ( )kl k l k l Q y y= x x

    Subject tothese

    constraints:

    kCk 0

    Then define:

    =

    =R

    k

    kkky

    1

    xwThen classify with:

    f(x,w,b) = sign(w. x- b)

    01

    ==

    R

    k

    kky

  • 8/8/2019 Lec9 SVM Nonlinear

    5/37

    b

    A linear programming

    problem !

    ci = iii c

  • 8/8/2019 Lec9 SVM Nonlinear

    6/37

    Suppose were in 1-dimension

    What wouldSVMs do withthis data?

    x=0

  • 8/8/2019 Lec9 SVM Nonlinear

    7/37

    Suppose were in 1-dimension

    Not a bigsurprise

    Positive plane Negativeplane

    x=0

  • 8/8/2019 Lec9 SVM Nonlinear

    8/37

    Harder 1-dimensional dataset

    What can be

    done aboutthis?

    x=0

  • 8/8/2019 Lec9 SVM Nonlinear

    9/37

    Harder 1-dimensional dataset

    Apply the following map

    x=0),(

    2

    kkk xx=z

  • 8/8/2019 Lec9 SVM Nonlinear

    10/37

    Harder 1-dimensional dataset

    x=0),( 2kkk xx=z

    Apply the following map

  • 8/8/2019 Lec9 SVM Nonlinear

    11/37

    Harder 1-dimensional dataset

    0 1-3-4 43-1

    ),( 2kkk xx=z

    0 1-3-4 43-1

    16

    9

  • 8/8/2019 Lec9 SVM Nonlinear

    12/37

    Harder 2-dimensional dataset

    ),,,,( 22 kkkkkkk yxyxyx=z

    Apply the following map

  • 8/8/2019 Lec9 SVM Nonlinear

    13/37

    Common SVM basis functions

    zk = ( polynomial terms ofxk of degree 1 to

    q )

    zk = ( radial basis functions ofxk )

    zk = ( sigmoid functions ofxk )

    ==

    2

    2||exp)(][

    jkkjk j

    cxxz

    B i

  • 8/8/2019 Lec9 SVM Nonlinear

    14/37

    BasisFunctions

    =

    mm

    m

    m

    m

    m

    xx

    xx

    xx

    xx

    xx

    xx

    x

    x

    x

    x

    x

    x

    1

    1

    32

    1

    31

    21

    2

    2

    2

    2

    1

    2

    1

    2

    :

    2

    :

    2

    2

    :

    2

    2

    :

    2

    :

    2

    2

    1

    )(x

    ConstantTerm

    Linear Terms

    PureQuadraticTerms

    Quadratic

    Cross-Terms

    Number of terms (assuming minput dimensions) = (m+2)-choose-2

    = (m+2)(m+1)/2

    = (as near as)m2

    /2

    You may be wondering whatthose

    s are doing.

    Youll find out why theyre theresoon.

    2

  • 8/8/2019 Lec9 SVM Nonlinear

    15/37

    QP (old)

    Maximi

    ze

    where ).( lklkkl yyQ xx=

    Subject tothese

    constraints:

    kCk 0

    Then define:

    =

    =R

    kkkky

    1

    xwThen classify with:

    f(x,w,b) = sign(w. x- b)

    01

    ==

    R

    k

    kky

    = ==

    R

    k

    R

    l

    kllk

    R

    k

    k Q

    1 11 2

    1

  • 8/8/2019 Lec9 SVM Nonlinear

    16/37

    functions

    where ))().(( lklkkl yyQ xx=

    Subject tothese

    constraints:

    kCk 0

    Then define: Then classify with:

    f(x,w,b) = sign(w. (x)-b)

    01

    ==

    R

    k

    kky

    >

    =0s.t.

    )(

    k

    k

    kkky xw

    Maximize = ==

    R

    k

    R

    l

    kllk

    R

    k

    k Q

    1 11 2

    1

    Most important changes:

    X (x)

  • 8/8/2019 Lec9 SVM Nonlinear

    17/37

    QP with basis functions

    where ))().(( lklkkl yyQ xx=

    Subject tothese

    constraints:

    kCk 0

    Then define:

    Then classify with:

    f(x,w,b) = sign(w. (x)-b)

    01

    ==

    R

    k

    kky

    We must do R2/2 dot productsto get this matrix ready.

    Each dot product requires m2/2additions and multiplications

    The whole thing costs R2 m2 /4.

    >

    =0s.t.

    )(

    k

    k

    kkky xw

    Maximize = ==

    R

    k

    R

    l

    kllk

    R

    k

    k Q

    1 11 2

    1

    11

  • 8/8/2019 Lec9 SVM Nonlinear

    18/37

    Dot

    Pro d

    ucts

    =

    mm

    m

    m

    m

    m

    mm

    m

    m

    m

    m

    bb

    bb

    bb

    bb

    bb

    bb

    b

    b

    b

    b

    b

    b

    aa

    aa

    aa

    aa

    aa

    aa

    a

    a

    a

    a

    a

    a

    1

    1

    32

    1

    31

    21

    2

    2

    2

    2

    1

    2

    1

    1

    1

    32

    1

    31

    21

    2

    2

    2

    2

    1

    2

    1

    2

    :

    2

    :

    2

    2

    :

    2

    2

    :

    2

    :

    2

    2

    1

    2

    :

    2

    :

    2

    2

    :

    2

    2

    :

    2

    :

    2

    2

    1

    )()( ba

    1

    =

    m

    i

    iiba1

    2

    =

    m

    i

    ii ba1

    22

    = +=

    m

    i

    m

    ij

    jiji bbaa1 1

    2

    +

    +

    +

  • 8/8/2019 Lec9 SVM Nonlinear

    19/37

    Dot

    Pro d

    ucts

    = )()( ba

    = +===

    +++m

    i

    m

    ij

    jiji

    m

    i

    ii

    m

    i

    ii bbaababa1 11

    22

    1

    221

    Just out of interest, lets look atanother function ofa and b:

    2)1.( +ba

    1.2).( 2 ++= baba

    121

    2

    1

    ++

    =

    ==

    m

    i

    ii

    m

    i

    ii baba

    1211 1

    ++= == =

    m

    i

    ii

    m

    i

    m

    j

    jjii bababa

    122)(

    11 11

    2 +++= == +==

    m

    i

    ii

    m

    i

    m

    ij

    jjii

    m

    i

    ii babababa

  • 8/8/2019 Lec9 SVM Nonlinear

    20/37

    D

    ot

    Pro d

    ucts

    = )()( ba

    Just out of interest, lets look atanother function ofa and b:

    2)1.( +ba

    1.2).( 2 ++= baba

    121

    2

    1

    ++

    =

    ==

    m

    i

    ii

    m

    i

    ii baba

    1211 1

    ++= == =

    m

    i

    ii

    m

    i

    m

    j

    jjii bababa

    122)(

    11 11

    2 +++= == +==

    m

    i

    ii

    m

    i

    m

    ij

    jjii

    m

    i

    ii babababa

    Theyre the same!

    And this is only O(m)to compute!

    = +===

    +++m

    i

    m

    ij

    jiji

    m

    i

    ii

    m

    i

    ii bbaababa1 11

    22

    1

    221

  • 8/8/2019 Lec9 SVM Nonlinear

    21/37

    functions

    where ))().(( lklkkl yyQ xx=

    Subject tothese

    constraints:

    kCk 0

    Then define: Then classify with:

    f(x,w,b) = sign(w. (x)-b)

    01

    ==

    R

    k

    kky

    We must do R2/2 dot productsto get this matrix ready.

    Each dot product now onlyrequires m additions and

    multiplications

    >

    =0s.t.

    )(

    k

    k

    kkky xw

    Maximize = ==

    R

    k

    R

    l

    kllk

    R

    k

    k Q

    1 11 2

    1

  • 8/8/2019 Lec9 SVM Nonlinear

    22/37

    Higher Order Polynomials

    Poly-nomial (x) Cost tobuild Qkl matrixtraditionally

    Cost if 100inputs (a). (b) Cost tobuild Qkl matrixsneakily

    Cost if100inputs

    QuadraticAll m2/2terms up todegree 2

    m2 R2/4 2,500 R2 (a.b+1)2 mR2/ 2 50 R2

    Cubic All m3/6terms up todegree 3

    m3 R2/12 83,000 R2 (a.b+1)3 mR2/ 2 50 R2

    Quartic All m4/24

    terms up tode ree 4

    m4 R2/48 1,960,000 R2 (a.b+1)4 mR2/ 2 50 R2

    5

  • 8/8/2019 Lec9 SVM Nonlinear

    23/37

    functions5)1( + lk xx

    Maximi

    ze

    = ==

    +R

    k

    R

    l

    kllk

    R

    k

    k Q1 11

    where ))().(( lklkkl yyQ xx=

    Subject tothese

    constraints:

    kCk 0

    Then define:

    >

    =

    0s.t.

    )(

    kk

    kkky xw

    Then classify with:

    f(x,w,b) = sign(w. (x)-b)

    01

    ==

    R

    k

    kky

    Only Sm operations (S=#support

    >

    =0s.t.

    )()()(kk

    kkky xxxw

    >

    +=0s.t.

    5)1(kk

    kkky xx

  • 8/8/2019 Lec9 SVM Nonlinear

    24/37

    functions

    Then define: Then classify with:

    f(x,w,b) = sign(K(w, x)- b)

    >

    =0s.t.

    )(kk

    kkky xw

    Maximize = ==

    R

    k

    R

    l

    kllk

    R

    k

    k Q

    1 11 2

    1where ),( lklkkl KyyQ xx=

    Subject tothese

    constraints:

    kCk 0 01

    ==

    R

    k

    kky

    Most important change:

    ),()().( lklk K xxxx

    S l i

  • 8/8/2019 Lec9 SVM Nonlinear

    25/37

    SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM

    Kernel Function

    Beyond polynomials there are other very highdimensional basis functions that can be made

    practical by finding the right Kernel Function Radial-Basis-style Kernel Function:

    Sigmoidal function

    =

    2

    2

    2

    )(exp),(

    babaK

  • 8/8/2019 Lec9 SVM Nonlinear

    26/37

    Kernel Tricks Replacing dot product with a kernel function

    Not all functions are kernel functions

    Need to be decomposable

    K(a,b) = (a) (b) Could K(a,b) = (a-b)3 be a kernel function ?

    Could K(a,b) = (a-b)4 (a+b)2 be a kernelfunction?

  • 8/8/2019 Lec9 SVM Nonlinear

    27/37

    Kernel Tricks Mercers condition

    To expand Kernel function K(x,y) into a dot product,i.e. K(x,y)= (x) (y), K(x, y) has to be positivesemi-definite function, i.e., for any function f(x)whose is finite, the following inequalityholds

    Could be a kernel function?

    ( ) ( , ) ( ) 0dxdyf x K x y f y

    2( ) f x dx

    ( )( , )p

    i iiK x y x y=

    r r

  • 8/8/2019 Lec9 SVM Nonlinear

    28/37

    Kernel Tricks Pro

    Introducing nonlinearity into the model Computational cheap

    Con

    Still have potential overfitting problems

    N li K l (I)

  • 8/8/2019 Lec9 SVM Nonlinear

    29/37

    Nonlinear Kernel (I)

    N li K l (II)

  • 8/8/2019 Lec9 SVM Nonlinear

    30/37

    Nonlinear Kernel (II)

    SVM P f

  • 8/8/2019 Lec9 SVM Nonlinear

    31/37

    SVM Performance

    Generalization theory General methodology for many types of

    problems

    Same Program + New Kernel = New method

    No problems with local minima

    Robust optimization methods.

    Successful Applications

    SVM P f

  • 8/8/2019 Lec9 SVM Nonlinear

    32/37

    SVM Performance

    Do SVM scale to massive datasets? How to chose C and Kernel?

    What is the effect of attribute scaling?

    How to handle categorical variables? How to incorporate domain knowledge?

    l ifi ti

  • 8/8/2019 Lec9 SVM Nonlinear

    33/37

    classification SVMs can only handle two-class outputs.

    What can be done? Answer: with output arity N, learn N SVMs

    SVM 1 learns Output==1 vs Output != 1

    SVM 2 learns Output==2 vs Output != 2

    : SVM N learns Output==N vs Output != N

    Then to predict the output for a new input, justpredict with each SVM and find out which one

    puts the prediction the furthest into the positiveregion.

    Other approaches

    Pair-wise SVM, Multi-category SVM

    l ti

  • 8/8/2019 Lec9 SVM Nonlinear

    34/37

    selection The Entire Regularization Path for the Support Vector Machine

    (Hastie, Rosset, Tibshirani and Zhu)

    http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf

    An algorithm for computing the two-class SVM solution for all possiblevalues of the regularization parameter C, at essentially thecomputational cost of a single SVM fit. Not only does this allow forefficient model selection, but it also exposes the role of regularization

    for SVMs.

    where ),( lklkkl KyyQ xx=

    Subject totheseconstraints:

    kCk 0 01

    ==

    R

    k

    kky

    Maximize = == R

    k

    R

    lkllk

    R

    kk Q

    1 11 2

    1

    f b l d d

    http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdfhttp://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf
  • 8/8/2019 Lec9 SVM Nonlinear

    35/37

    SVM for unbalanced data

    ( ) iallfor,111

    2

    ,min

    iii

    N

    j

    j

    d

    i

    ibw

    bwxy

    cw

    +

    + ==

    ( ) iallfor,1

    21

    21

    1

    2

    ,min

    iii

    Cx

    j

    Cx

    j

    d

    i

    ibw

    bwxy

    ccw

    jj

    +

    ++ =

    Original SVM formulation:

    the first class has much smaller size than the second class,

    ply different weights to the two classes: 21 cc >

    R f

  • 8/8/2019 Lec9 SVM Nonlinear

    36/37

    References

    C.J.C. Burges. A tutorial on support vector

    machines for pattern recognition.

    Kristin Bennett . Support Vector Machines:Hype or Hallelujah?

    Statistical Learning Theory by VladimirVapnik, Wiley-Interscience; 1998

    SoftwareSVM-light, http://svmlight.joachims.org/, free

    download

    N t l

    http://svmlight.joachims.org/http://svmlight.joachims.org/
  • 8/8/2019 Lec9 SVM Nonlinear

    37/37

    Next class Topics

    Multi-class SVM

    Semi-supervised clustering

    Readings In Defense of One-Vs-All Classification

    Constrained K-means Clustering with Background Knowledge

    Semi-supervised Clustering by Seeding

    Distance metric learning, with application to clustering with side-information