SMO research

  • Upload
    mai-yeu

  • View
    232

  • Download
    0

Embed Size (px)

Citation preview

  • 8/13/2019 SMO research

    1/19

    SVM by Sequential Minimal

    Optimization (SMO)

    Algorithm by John Platt

    Lecture by David PageCS 760: Machine Learning

  • 8/13/2019 SMO research

    2/19

    Quick Review

    Inner product (specifically here the dot product)is defined as

    Changingxto w and ztox, we may also write:

    or or wTx or w x In our usage, xis feature vector and wis weight

    (or coefficient) vector

    A line (or more generally a hyperplane) is writtenas wTx= b, or wTx- b = 0

    A linear separator is written as sign(wTx- b)

  • 8/13/2019 SMO research

    3/19

    Quick Review (Continued)

    Recall in SVMs we want to maximize margin

    subject to correctly separating the data that

    is, we want to:

    Here the ys are the class values (+1 for

    positive and -1 for negative), so

    says xicorrectly labeled with room to spare

    Recall is just wTw

  • 8/13/2019 SMO research

    4/19

    Recall Full Formulation

    As last lecture showed us, we can Solve the dual more efficiently (fewer unknowns)

    Add parameter C to allow some misclassifications

    Replace xi

    T

    xjby more more general kernel term

  • 8/13/2019 SMO research

    5/19

    Intuitive Introduction to SMO

    Perceptron learning algorithm is essentially

    doing same thingfind a linear separator by

    adjusting weights on misclassified examples

    Unlike perceptron, SMO has to maintain sum

    over examples of example weight times

    example label

    Therefore, when SMO adjusts weight of one

    example, it must also adjust weight of another

  • 8/13/2019 SMO research

    6/19

    An Old Slide:

    Perceptron as Classifier

    Output for example xis sign(wTx)

    Candidate Hypotheses: real-valued weight vectors w

    Training: Update wfor each misclassified example x(targetclass t, predicted o) by:

    wi wi+ h(t-o)xi Here h is learning rate parameter

    Let E be the error o-t. The above can be rewritten as

    w whEx

    So final weight vector is a sum of weighted contributions fromeach example. Predictive model can be rewritten as weightedcombination of examples rather than features, where theweight on an example is sum (over iterations) of terms hE.

  • 8/13/2019 SMO research

    7/19

    Corresponding Use in Prediction

    To use perceptron, prediction is wTx- b. May

    treat -b as one more coefficient in w (and omit

    it here), may take sign of this value

    In alternative view of last slide, prediction can

    be written as or, revising the

    weight appropriately,

    Prediction in SVM is:

    b-x Tj'i

    ja

    b-x Tii

    jjay

  • 8/13/2019 SMO research

    8/19

    From Perceptron Rule to SMO Rule

    Recall that SVM optimization problem has the addedrequirement that:

    Therefore if we increase one by an amount h, in

    either direction, then we have to change another

    by an equal amount in the opposite direction

    (relative to class value). We can accomplish this by

    changing: w whEx also viewed as:

    yhE to:1 1y1h(E1-E2)

    or equivalently: 1 1+ y1h(E2-E1)

    We then also adjust 2so that y11+ y22stays same

  • 8/13/2019 SMO research

    9/19

    First of Two Other Changes

    Instead of h = or some similar constant, weset h = or more generally h =

    For a given difference in error between two

    examples, the step size is larger when the two

    examples are more similar. When x1and x2are similar, K(x1,x2) is larger (for typical

    kernels, including dot product). Hence the

    denominator is smaller and step size is larger.

    20

    1

    212211 2

    1

    xxxxxx -

    ),(2),(),(

    1

    212211 xxKxxKxxK -

  • 8/13/2019 SMO research

    10/19

    Second of Two Other Changes

    Recall our formulation had an additionalconstraint:

    Therefore, we clip any change we make toany to respect this constraint

    This yields the following SMO algorithm. (Sofar we have motivated it intuitively. Afterwardwe will show it really minimizes our objective.)

  • 8/13/2019 SMO research

    11/19

    Updating Two s: One SMO Step

    Given examples e1 and e2, set

    where:

    Clip this value in the natural way: if y1 = y2 then:

    otherwise:

    Set where s = y1y2

  • 8/13/2019 SMO research

    12/19

    Karush-Kuhn-Tucker (KKT) Conditions

    It is necessary and sufficient for a solution to our objective that all ssatisfy the following:

    An is 0 iff that example is correctly labeled with room to spare

    An is C iff that example is incorrectly labeled or in the margin

    An is properly between 0 and C (is unbound) iff that example isbarely correctly labeled (is a support vector)

    We just check KKT to within some small epsilon, typically 10-3

  • 8/13/2019 SMO research

    13/19

    SMO Algorithm

    Input: C, kernel, kernel parameters, epsilon

    Initialize b and all s to 0

    Repeat until KKT satisfied (to within epsilon): Find an example e1that violates KKT (prefer unbound

    examples here, choose randomly among those)

    Choose a second example e2. Prefer one to maximize stepsize (in practice, faster to just maximize |E1E2|). If thatfails to result in change, randomly choose unboundexample. If that fails, randomly choose example. If that

    fails, re-choose e1. Update 1and 2in one step

    Compute new threshold b

  • 8/13/2019 SMO research

    14/19

    At End of Each Step, Need to Update b

    Compute b1and b2as follows

    Choose b = (b1+ b2)/2

    When 1and 2are not at bound, it is in factguaranteed that b = b1= b2

  • 8/13/2019 SMO research

    15/19

    Derivation of SMO Update Rule

    Recall objective function we want to optimize:

    Can write objective as:

    where:

    Starred values are from end of previous round

    1with

    everybody else

    2with

    everybody else

    all the

    other s

  • 8/13/2019 SMO research

    16/19

    Expressing Objective in Terms of One

    Lagrange Multiplier Only

    Because of the constraint , each

    update has to obey:

    where wis a constant and sis y1y2

    This lets us express the objective function in

    terms of 2alone:

  • 8/13/2019 SMO research

    17/19

    Find Minimum of this Objective

    Find extremum by first derivative wrt 2:

    If second derivative positive (usual case), then

    the minimum is where (by simple Algebra,

    remembering ss= 1):

  • 8/13/2019 SMO research

    18/19

    Finishing Up

    Recall that w = and

    So we can rewrite

    as: = s(K11-K12)(s2*+1

    *) +

    y2(u1-u2+y11*(K12-K11) + y22

    *(K22-K21)) +y2y2y1y2

    We can rewrite this as:

  • 8/13/2019 SMO research

    19/19

    Finishing Up

    Based on values of wand v, we rewrote

    as:

    From this, we get our update rule:

    where

    u1y1

    u2y2

    K11+ K222K12