Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Support Vector Machines

Lecturer: Yishay Mansour

Itay Kirshenbaum

Lecture Overview

In this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).

Lecture Overview – Cont.

We begin with building the intuition behind SVMs

continue to define SVM as an optimization problem and discuss how to efficiently solve it.

We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.

Introduction

Support Vector Machine is a supervised learning algorithm

Used to learn a hyperplane that can solve the binary classification problem

Among the most extensively studied problems in machine learning.

Binary Classification Problem

Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis

that best predicts other points drawn i.i.d from D

nRX

)},(),...,,{( 11 mm yxyxS }1,1{ Y

Hh

Binary Classification – Cont.

Consider the problem of predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height

and weight) Each point is assigned +1 to indicate

successful treatment or -1 otherwise This can be used as training data

Binary classification – Cont.

Infinitely many ways to classify Occam’s razor – simple

classification rules provide better results Linear classifier or hyperplane

Our class of linear classifiers:

0)*( if 1 to maps bxwXxHh

},|)*(sign {x RbRwbxwH n

Kirsh

Add a graphic

Choosing a Good Hyperplane

Intuition Consider two cases of positive

classification: w*x + b = 0.1 w*x + b = 100

More confident in the decision made by the latter rather than the former

Choose a hyperplane with maximal margin

Good Hyperplane – Cont.

Definition: Functional margin S

A linear classifier:

)*(ˆ with ˆˆ min},...,1{

s bxwy iiii

mi

),( toaccording oftion classifica theis bwxy ii

Maximal Margin

w,b can be scaled to increase margin sign(w*x + b) = sign(5w*x + 5b) for

all x (5w, 5b) is 5 times greater than (w,b)

Cope by adding an additional constraint: ||w|| = 1

Maximal Margin – Cont.

Geometric Margin Consider the geometric distance

between the hyperplane and the closest points

Geometric Margin

Definition: Definition: Geometric margin S

Relation to functional margin

Both are equal when

)*( with min},...,1{

s w

bx

w

wy iiii

mi

ii yw ̂

1w

Kirsh

Another slide for the calculation

The Algorithm

We saw: Two definitions of the margin Intuition behind seeking a

maximizing hyperplane Goal: Write an optimization

program that finds such a hyperplan

We always look for (w,b) maximizing the margin

The Algorithm – Take 1

First try:

Idea Maximize - For each sample the

Functional margin is at least Functional and geometric margin are

the same as Largest possible geometric margin

with respect to the training set

1,,...,1,)*(max wmibxwy ii

1w


The first try can’t be solved by any off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex

How can we discard the constraint? Use geometric margin!

1w

mibxwyw

ii ,...,1,ˆ)*(ˆ

max


We now have a non-convex objective function – The problem remains

Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not

change the program – Make things easier

w

1max

2

2

1min w

The algorithm – Final version

The final program:

The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard

quadratic programing (QP) software

mibxwyw ii ,...,1,1)*(2

1max

2

Convex Optimization

We want to solve the optimization problem more efficiently than generic QP

Solution – Use convex optimization techniques

Convex Optimization – Cont.

Definition: A convex function

Theorem

)()1()())1((

:1,0,,

yfxfyxf

Xyx

f

))(()()( :,

functionconvex abledifferenti a be :Let

xyxfxfyfXyx

xf

Convex Optimization Problem

Convex optimization problem

We look for a value of Minimizes Under the constraint

mixgxf

mixgf

iXx

i

,..,1,0)( s.t. )(min Find

functionconvex be ,..,1,:,Let

Xx)(xf

mixgi ,..,1,0)(

Lagrange Multipliers

Used to find a maxima or a minima of a function subject to constraints

Use to solve out optimization problem

Definition

sMultiplier Lagrange thecalled are

0, )()(),(

,..,1,

sconstraint subject to function of Lagragian

1

i

i

m

iii

i

XxxgxfxL

mig

fL

Primal Program

Plan Use the Lagrangian to write a

program called the Primal Program Equal to f(x) is all the constraints are

met Otherwise –

Definition – Primal Program

),(max)( 0 xLxP

Primal Progam – Cont.

The constraints are of the form If they are met

is maximized when all are 0, and the summation is 0

Otherwise is maximized for

0)( xgi

m

iii xg

1

)(

i

)()( xfxP

m

iii xg

1

)( i

)(xP

Primal Progam – Cont.

Our convex optimization problem is now:

Define as the value of the primal program

),(maxmin)(min 0 xLx XxPXx )(min* xp PXx

Dual Program

We define the Dual Program as:

We’ll look at

Same as our primal program Order of min / max is different

Define the value of our Dual Program

),(min)( xLx XxD

),(minmax)(minmax 00 xLx XxaDXxa

),(minmax 0* xLd Xxa

Dual Program – Cont.

We want to show If we find a solution to one problem,

we find the solution to the second problem

Start with “max min” is always less then “min

max”

Now on to

** pd

** pd

*00

* ),(maxmin),(minmax pxLxLd aXxXxa

** dp

Dual Program – Cont.

Claim

Proof

Conclude

)( osolution t a is and then

),(),(),( :feasible is which ,0

andpoint saddle a are which 0 and exists if

p***

****

**

xxdp

axLaxLaxLxa

ax

*

0

*

***

00

*

),(infsup),(inf

),(),(sup),(supinf

daxLaxL

axLaxLaxLp

xax

aax

** pd

Karush-Kuhn-Tucker (KKT) conditions

KKT conditions derive a characterization of an optimal solution to a convex problem.

Theorem

0)()( 3.

0)( ),( .2

0)()( ),( 1.

:s.t. 0 problemon optimizati theosolution t a is

convex. and abledifferenti are ,..,1, and that Assume

xgxg

xgxL

xgxfxL

x

migf

ii

a

xxx

i

KKT Conditions – Cont.

Proof

The other direction holds as well

m

i ii

m

i iii

m

i ixi

x

xga

xgxga

xxxga

xxxfxfxf

x

1

1

1

0)(

)]()([

)()(

)()()()(

: feasibleevery For

KKT Conditions – Cont.

Example Consider the following optimization

problem: We have The Lagragian will be

2 .. 2

1min 2 xtsx

xxgxxf 2)(,2

1)( 1

2

)2(2

1),( 2 xxxL

**

22*

**

202),(

2

12)2(

2

1),(

0

xxL

xL

xxx

L

Optimal Margin Classifier

Back to SVM Rewrite our optimization program

Following the KKT conditions Only for points in the training

set with a margin of exactly 1 These are the support vectors of the

training set

01)*(),(

,...,1,1)*(2

1min

2

bxwybwg

mibxwyw

iii

ii

0i

Optimal Margin – Cont.

Optimal margin classifier and its support vectors


Construct the Lagragian

Find the dual form First minimize to get Do so by setting the derivatives to

zero

]1)*([2

1),,(

1

2

bxwywbwL iim

ii

),,( bwLD

m

i

iii

m

i

iiix

xyw

xywbwL

1

*

1

0),,(

Optimal Margin – Cont. Take the derivative with respect to

Use in the Lagrangian

We saw the last tem is zero

m

i

ii ybwL

b 1

0),,(

b

im

ii

jim

jiji

jim

ii ybxxyybwL

11,1

**

2

1),,(

*w

)(2

1),,(

1,1

** WxxyybwL jim

jiji

jim

ii


The dual optimization problem

The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem

m

i

ii

i

y

miW

1

0

,..,1,0:)(max

* )(W

m

i

iixyw1

**

Optimal Margin – Cont. Still need to find

Assume is a support vector We get

*b

ix

ii

ii

ii

xwyb

bxwy

bxwy

**

**

**

)(1

Error Analysis Using Leave-One-Out

The Leave-One-Out (LOO) method Remove one point at a time from the

training set Calculate an SVM for the remaining

points Test our result using the removed

point Definition

The indicator function I(exp) is 1 if exp is true, otherwise 0

))((1ˆ

1}{

m

i

ii

xSLOO yxhIm

R i

LOO Error Analysis – Cont.

Expected error

It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1

)]([])([

)])(([1

]ˆ[

'~'}{,

1}{~

1 SDS

ii

xSXS

m

i

ii

xSLOODS

herrorEyxhE

yxhIEm

RE

mi

im

LOO Error Analysis – Cont.

Theorem

ProofSSN

m

SNEherrorE

SV

SVDSSDS mm

in ctorssupport ve ofnumber theis )(

]1

)([)]([ 1~~

1

)(ˆ :Hence

ctor.support ve a bemust point they,incorrectlpoint a classifies if

m

SNR

h

SV

S

LOO

Generalization Bounds Using VC-dimension

Theorem

Proof

2

22

Then .,min:)(set

hyperplane theofdimension -VC thebe Let .:Let

R

dwxwxwsign

dRxxS

Sx

d

i

iid

i

iiii

ii

d

xyxywxy

dixwyw

xx

1

d

1i 1

i1

wd

:dover Summing .,..,1 )(:

1,1yevery for So shattered. is ,..,set that theAssume

Generalization Bounds Using VC-dimension – Cont.

Proof – Cont.

2

22

22

,

,

2

1

2

1

1

Therefore

][d

: thatconcludecan we

when 1][ and when 0][ Since

][d

:ondistributi uniform with s' over the Averaging

Rd

dRxyyxxE

jiyyEjiyyE

yyxxExyExyE

y

i

iji

ji

jiy

jiy

jiy

ji

ji

jiy

d

i

iiy

d

i

iiy

Documents

Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum