prev

next

of 77

View

90Download

5

Embed Size (px)

DESCRIPTION

. Instructor : Saeed Shiry. . SVM Kernel Methods . SVM 1992 Vapnik statistical learning theory . - PowerPoint PPT Presentation

Transcript

* Instructor : Saeed Shiry

*SVM Kernel Methods . SVM 1992 Vapnik statistical learning theory . SVM : 1.1%

* ( ) : overfitting

* (maximum margin) . .

*Support Vector Machines are a system for efficiently training linear learning machines in kernel-induced feature spaces, while respecting the insights of generalisation theory and exploiting optimisation theory.Cristianini & Shawe-Taylor (2000)

* : Linear Discrimination .

*IntuitionsXXOOOOOOXXXXXXOO

*IntuitionsXXOOOOOOXXXXXXOO

*IntuitionsXXOOOOOOXXXXXXOO

*IntuitionsXXOOOOOOXXXXXXOO

*A Good SeparatorXXOOOOOOXXXXXXOO

*Noise in the ObservationsXXOOOOOOXXXXXXOO

*Ruling Out Some SeparatorsXXOOOOOOXXXXXXOO

*Lots of NoiseXXOOOOOOXXXXXXOO

*Maximizing the MarginXXOOOOOOXXXXXXOO

*

n .

*

: ( ) . : n :

* SVM : . .

* .

* . VC dimension . .

* Class 1Class -1X2X1SVSVSV

* SVM SVM : (high dimensionality) overfitting . optimization : .

* x ny {-1, 1} f(x) = sign( + b)w n b + b = 0w1x1 + w2x2 + wnxn + b = 0 W, b :

*Linear SVM MathematicallyLet training set {(xi, yi)}i=1..n, xiRd, yi {-1, 1} be separated by a hyperplane with margin . Then for each training example (xi, yi):

For every support vector xs the above inequality is an equality. After rescaling w and b by /2 in the equality, we obtain that distance between each xs and the hyperplane is

Then the margin can be expressed through (rescaled) w and b as:wTxi + b - /2 if yi = -1wTxi + b /2 if yi = 1yi(wTxi + b) /2

*

x X2X1f(x)=0f(x)>0f(x)

* Plus-plane = { x : w . x + b = +1 }Minus-plane = { x : w . x + b = -1 }Classify as.. -1 if w . x + b = 1

* :Plus-plane = { x : w . x + b = +1 }Minus-plane = { x : w . x + b = -1 } w . X- X+ X- .

* X- X+ . W . :x+= x-+ w for some value of .

* : w . x+ + b = +1w . x- + b = -1X+= x-+ w| x+- x-| = M M W b .

*

w . x+ + b = +1w . x- + b = -1X+= x-+ w| x+- x-| = M

w.( x-+ w) + b = +1w.x-+ w.w + b = +1-1+ w.w = +1 =2/ w.w

*

* 1 1- : + b 1 for y=1 + b -1 for y= -1 yi ( + b) 1 for all i

* SVM : (xi, yi) i=1,2,N; yi{+1,-1}

Minimise ||w||2Subject to : yi ( + b) 1 for all iNote that ||w||2 = wTw

quadratic programming . .

*Quadratic Programming

*Recap of Constrained OptimizationSuppose we want to: minimize f(x) subject to g(x) = 0A necessary condition for x0 to be a solution:

a: the Lagrange multiplierFor multiple constraints gi(x) = 0, i=1, , m, we need a Lagrange multiplier ai for each of the constraints

*Recap of Constrained OptimizationThe case for inequality constraint gi(x) 0 is similar, except that the Lagrange multiplier ai should be positiveIf x0 is a solution to the constrained optimization problem

There must exist ai 0 for i=1, , m such that x0 satisfy

The function is also known as the Lagrangrian; we want to set its gradient to 0

* Construct & minimise the Lagrangian

Take derivatives wrt. w and b, equate them to 0

The Lagrange multipliers ai are called dual variablesEach training point has an associated dual variable.

parameters are expressed as a linear combination of training points

only SVs will have non-zero ai

* a6=1.4Class 1Class 2a1=0.8a2=0a3=0a4=0a5=0a7=0a8=0.6a9=0a10=0

*The Dual ProblemIf we substitute to Lagrangian , we have

Note that

This is a function of ai only

*The Dual ProblemProperties of ai when we introduce the Lagrange multipliersThe result when we differentiate the original Lagrangian w.r.t. bThe new objective function is in terms of ai onlyIt is known as the dual problem: if we know w, we know all ai; if we know all ai, we know wThe original problem is known as the primal problemThe objective function of the dual problem needs to be maximized!The dual problem is therefore:

*The Dual Problem

This is a quadratic programming (QP) problemA global maximum of ai can always be found

w can be recovered by

* So, Plug this back into the Lagrangian to obtain the dual formulationThe resulting dual that is solved for by using a QP solver:

The b does not appear in the dual so it is determined separately from the initial constraints

Data enters only in the form of dot products!

* (*, b*) quadratic SVM . x :sign[f(x, *, b*)], where

Data enters only in the form of dot products!

* The solution of the SVM, i.e. of the quadratic programming problem with linear inequality constraints has the nice property that the data enters only in the form of dot products!Dot product (notation & memory refreshing): given x=(x1,x2,xn) and y=(y1,y2,yn), then the dot product of x and y is xy=(x1y1, x2y2,, xnyn).This is nice because it allows us to make SVMs non-linear without complicating the algorithm

*The Quadratic Programming ProblemMany approaches have been proposedLoqo, cplex, etc. Most are interior-point methodsStart with an initial solution that can violate the constraintsImprove this solution by optimizing the objective function and/or reducing the amount of constraint violationFor SVM, sequential minimal optimization (SMO) seems to be the most popularA QP with two variables is trivial to solveEach iteration of SMO picks a pair of (ai,aj) and solve the QP with these two variables; repeat until convergenceIn practice, we can just regard the QP solver as a black-box without bothering how it works

* SVM . .

* slack ! xi wTx+b .

* slack xi, i=1, 2, , N, yi ( + b) 1 :yi ( + b) 1- xi , xi 0

.

* w :

C>0 . slack .

* .

C .

find ai that maximizes subject to

*Soft Margin HyperplaneIf we minimize wi xi, xi can be computed by

xi are slack variables in optimizationNote that xi=0 if there is no error for xixi is an upper bound of the number of errorsWe want to minimize C : tradeoff parameter between error and marginThe optimization problem becomes

*The Optimization ProblemThe dual of this new constrained optimization problem is

w is recovered as

This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on ai nowOnce again, a QP solver can be used to find ai

* : :

*

. . kernel trick .

f(.)Feature spaceInput spaceNote: feature space is of higher dimension than the input space in practice

* curse of dimensionality .

* We will introduce Kernels:Solve the computational problem of working with many dimensionsCan make it possible to use infinite dimensionsefficiently in time / spaceOther advantages, both practical and conceptual

*Transform x (x)The linear algorithm depends only on xxi, hence transformed algorithm depends only on (x)(xi)Use kernel function K(xi,xj) such that K(xi,xj)= (x)(xi)

*Suppose f(.) is given as follows

An inner product in the feature space is

So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly

This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

An Example for f(.) and K(.,.)

* :

*:

*Modification Due to Kernel FunctionChange all inner products to kernel functionsFor training,

OriginalWith kernel function

- *Modification Due to Kernel FunctionFor testing, the new data z is classified as class 1 if f0, and as class 2 if f
*ModularityAny kernel-based learning algorithm composed of two modules:A general purpose learning machineA problem specific kernel functionAny K-B algorithm can be fitted with any kernelKernels themselves can be constructed in a modular wayGreat for software engineering (and for analysis)

* :If K, K are kernels, then:K+K is a kernelcK is a kernel, if c>0aK+bK is a kernel, for a,b >0Etc etc etc

.

*ExampleSuppose we have 5 1D data pointsx1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1We use the polynomial kernel of degree 2K(x,y) = (xy+1)2C is set to 100We first find ai (i=1, , 5) by

*Example

By using a QP solver, we geta1=0, a2=2.5, a3=0, a4=7.333, a5=4.833Note that the constraints are indeed satisfiedThe support vectors are {x2=2, x4=5, x5=6}The discriminant function is

b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2 and x5 lie on