41
Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Embed Size (px)

Citation preview

Page 1: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Support Vector Machines

Lecturer: Yishay Mansour

Itay Kirshenbaum

Page 2: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Lecture Overview

In this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).

Page 3: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Lecture Overview – Cont.

We begin with building the intuition behind SVMs

continue to define SVM as an optimization problem and discuss how to efficiently solve it.

We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.

Page 4: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Introduction

Support Vector Machine is a supervised learning algorithm

Used to learn a hyperplane that can solve the binary classification problem

Among the most extensively studied problems in machine learning.

Page 5: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Binary Classification Problem

Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis

that best predicts other points drawn i.i.d from D

nRX

)},(),...,,{( 11 mm yxyxS }1,1{ Y

Hh

Page 6: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Binary Classification – Cont.

Consider the problem of predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height

and weight) Each point is assigned +1 to indicate

successful treatment or -1 otherwise This can be used as training data

Page 7: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Binary classification – Cont.

Infinitely many ways to classify Occam’s razor – simple

classification rules provide better results Linear classifier or hyperplane

Our class of linear classifiers:

0)*( if 1 to maps bxwXxHh

},|)*(sign {x RbRwbxwH n

Kirsh
Add a graphic
Page 8: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Choosing a Good Hyperplane

Intuition Consider two cases of positive

classification: w*x + b = 0.1 w*x + b = 100

More confident in the decision made by the latter rather than the former

Choose a hyperplane with maximal margin

Page 9: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Good Hyperplane – Cont.

Definition: Functional margin S

A linear classifier:

)*(ˆ with ˆˆ min},...,1{

s bxwy iiii

mi

),( toaccording oftion classifica theis bwxy ii

Page 10: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Maximal Margin

w,b can be scaled to increase margin sign(w*x + b) = sign(5w*x + 5b) for

all x (5w, 5b) is 5 times greater than (w,b)

Cope by adding an additional constraint: ||w|| = 1

Page 11: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Maximal Margin – Cont.

Geometric Margin Consider the geometric distance

between the hyperplane and the closest points

Page 12: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Geometric Margin

Definition: Definition: Geometric margin S

Relation to functional margin

Both are equal when

)*( with min},...,1{

s w

bx

w

wy iiii

mi

ii yw ̂

1w

Kirsh
Another slide for the calculation
Page 13: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

The Algorithm

We saw: Two definitions of the margin Intuition behind seeking a

maximizing hyperplane Goal: Write an optimization

program that finds such a hyperplan

We always look for (w,b) maximizing the margin

Page 14: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

The Algorithm – Take 1

First try:

Idea Maximize - For each sample the

Functional margin is at least Functional and geometric margin are

the same as Largest possible geometric margin

with respect to the training set

1,,...,1,)*(max wmibxwy ii

1w

Page 15: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

The Algorithm – Take 2

The first try can’t be solved by any off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex

How can we discard the constraint? Use geometric margin!

1w

mibxwyw

ii ,...,1,ˆ)*(ˆ

max

Page 16: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

The Algorithm – Take 3

We now have a non-convex objective function – The problem remains

Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not

change the program – Make things easier

w

1max

2

2

1min w

Page 17: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

The algorithm – Final version

The final program:

The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard

quadratic programing (QP) software

mibxwyw ii ,...,1,1)*(2

1max

2

Page 18: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Convex Optimization

We want to solve the optimization problem more efficiently than generic QP

Solution – Use convex optimization techniques

Page 19: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Convex Optimization – Cont.

Definition: A convex function

Theorem

)()1()())1((

:1,0,,

yfxfyxf

Xyx

f

))(()()( :,

functionconvex abledifferenti a be :Let

xyxfxfyfXyx

xf

Page 20: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Convex Optimization Problem

Convex optimization problem

We look for a value of Minimizes Under the constraint

mixgxf

mixgf

iXx

i

,..,1,0)( s.t. )(min Find

functionconvex be ,..,1,:,Let

Xx)(xf

mixgi ,..,1,0)(

Page 21: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Lagrange Multipliers

Used to find a maxima or a minima of a function subject to constraints

Use to solve out optimization problem

Definition

sMultiplier Lagrange thecalled are

0, )()(),(

,..,1,

sconstraint subject to function of Lagragian

1

i

i

m

iii

i

XxxgxfxL

mig

fL

Page 22: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Primal Program

Plan Use the Lagrangian to write a

program called the Primal Program Equal to f(x) is all the constraints are

met Otherwise –

Definition – Primal Program

),(max)( 0 xLxP

Page 23: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Primal Progam – Cont.

The constraints are of the form If they are met

is maximized when all are 0, and the summation is 0

Otherwise is maximized for

0)( xgi

m

iii xg

1

)(

i

)()( xfxP

m

iii xg

1

)( i

)(xP

Page 24: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Primal Progam – Cont.

Our convex optimization problem is now:

Define as the value of the primal program

),(maxmin)(min 0 xLx XxPXx )(min* xp PXx

Page 25: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Dual Program

We define the Dual Program as:

We’ll look at

Same as our primal program Order of min / max is different

Define the value of our Dual Program

),(min)( xLx XxD

),(minmax)(minmax 00 xLx XxaDXxa

),(minmax 0* xLd Xxa

Page 26: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Dual Program – Cont.

We want to show If we find a solution to one problem,

we find the solution to the second problem

Start with “max min” is always less then “min

max”

Now on to

** pd

** pd

*00

* ),(maxmin),(minmax pxLxLd aXxXxa

** dp

Page 27: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Dual Program – Cont.

Claim

Proof

Conclude

)( osolution t a is and then

),(),(),( :feasible is which ,0

andpoint saddle a are which 0 and exists if

p***

****

**

xxdp

axLaxLaxLxa

ax

*

0

*

***

00

*

),(infsup),(inf

),(),(sup),(supinf

daxLaxL

axLaxLaxLp

xax

aax

** pd

Page 28: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Karush-Kuhn-Tucker (KKT) conditions

KKT conditions derive a characterization of an optimal solution to a convex problem.

Theorem

0)()( 3.

0)( ),( .2

0)()( ),( 1.

:s.t. 0 problemon optimizati theosolution t a is

convex. and abledifferenti are ,..,1, and that Assume

xgxg

xgxL

xgxfxL

x

migf

ii

a

xxx

i

Page 29: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

KKT Conditions – Cont.

Proof

The other direction holds as well

m

i ii

m

i iii

m

i ixi

x

xga

xgxga

xxxga

xxxfxfxf

x

1

1

1

0)(

)]()([

)()(

)()()()(

: feasibleevery For

Page 30: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

KKT Conditions – Cont.

Example Consider the following optimization

problem: We have The Lagragian will be

2 .. 2

1min 2 xtsx

xxgxxf 2)(,2

1)( 1

2

)2(2

1),( 2 xxxL

**

22*

**

202),(

2

12)2(

2

1),(

0

xxL

xL

xxx

L

Page 31: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Optimal Margin Classifier

Back to SVM Rewrite our optimization program

Following the KKT conditions Only for points in the training

set with a margin of exactly 1 These are the support vectors of the

training set

01)*(),(

,...,1,1)*(2

1min

2

bxwybwg

mibxwyw

iii

ii

0i

Page 32: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Optimal Margin – Cont.

Optimal margin classifier and its support vectors

Page 33: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Optimal Margin – Cont.

Construct the Lagragian

Find the dual form First minimize to get Do so by setting the derivatives to

zero

]1)*([2

1),,(

1

2

bxwywbwL iim

ii

),,( bwLD

m

i

iii

m

i

iiix

xyw

xywbwL

1

*

1

0),,(

Page 34: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Optimal Margin – Cont. Take the derivative with respect to

Use in the Lagrangian

We saw the last tem is zero

m

i

ii ybwL

b 1

0),,(

b

im

ii

jim

jiji

jim

ii ybxxyybwL

11,1

**

2

1),,(

*w

)(2

1),,(

1,1

** WxxyybwL jim

jiji

jim

ii

Page 35: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Optimal Margin – Cont.

The dual optimization problem

The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem

m

i

ii

i

y

miW

1

0

,..,1,0:)(max

* )(W

m

i

iixyw1

**

Page 36: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Optimal Margin – Cont. Still need to find

Assume is a support vector We get

*b

ix

ii

ii

ii

xwyb

bxwy

bxwy

**

**

**

)(1

Page 37: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Error Analysis Using Leave-One-Out

The Leave-One-Out (LOO) method Remove one point at a time from the

training set Calculate an SVM for the remaining

points Test our result using the removed

point Definition

The indicator function I(exp) is 1 if exp is true, otherwise 0

))((1ˆ

1}{

m

i

ii

xSLOO yxhIm

R i

Page 38: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

LOO Error Analysis – Cont.

Expected error

It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1

)]([])([

)])(([1

]ˆ[

'~'}{,

1}{~

1 SDS

ii

xSXS

m

i

ii

xSLOODS

herrorEyxhE

yxhIEm

RE

mi

im

Page 39: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

LOO Error Analysis – Cont.

Theorem

ProofSSN

m

SNEherrorE

SV

SVDSSDS mm

in ctorssupport ve ofnumber theis )(

]1

)([)]([ 1~~

1

)(ˆ :Hence

ctor.support ve a bemust point they,incorrectlpoint a classifies if

m

SNR

h

SV

S

LOO

Page 40: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Generalization Bounds Using VC-dimension

Theorem

Proof

2

22

Then .,min:)(set

hyperplane theofdimension -VC thebe Let .:Let

R

dwxwxwsign

dRxxS

Sx

d

i

iid

i

iiii

ii

d

xyxywxy

dixwyw

xx

1

d

1i 1

i1

wd

:dover Summing .,..,1 )(:

1,1yevery for So shattered. is ,..,set that theAssume

Page 41: Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

Generalization Bounds Using VC-dimension – Cont.

Proof – Cont.

2

22

22

,

,

2

1

2

1

1

Therefore

][d

: thatconcludecan we

when 1][ and when 0][ Since

][d

:ondistributi uniform with s' over the Averaging

Rd

dRxyyxxE

jiyyEjiyyE

yyxxExyExyE

y

i

iji

ji

jiy

jiy

jiy

ji

ji

jiy

d

i

iiy

d

i

iiy