SVM learning - sag.art.uniroma2.it

SVM learningWM&R a.a. 2020/21

R. BASILID I PA R T I M E N T O D I I N G E G N E R I A D E L L’ I M P R E S A

U N I V E R S I TÀ D I R O M A “ T O R V E R G ATA ”

E M A I L : B A S I L I @ I N F O . U N I R O M A 2 . I T

1

SummaryPerceptron Learning

Limitations of linear classifiers

Support Vector Machines◦ Maximal margin classification

◦ Optimization with hard margin

◦ Optimization with soft margin

The roles of kernels in SVM-based learning

2

An hyperplane has equation :

is the vector of the instance to be classified

is the hyperplane gradient

Classification function:

Linear Classifiers (1)

bwxbwxxf n ,, ,)(

x

w

( ) sign( ( ))h x f x

Linear Classifiers (2)Computationally simple.

Basic idea: select an hypothesis that makes no mistake over training-set.

The separating function is equivalent to a neural net with just one neuron (perceptron)

Which hyperplane?

Perceptron

bxwx i

ni

i

..1

sgn)(

NotationThe functional margin of an example

with respect to an hyerplane is:

The distribution of functional margins of an hyperplane with respect to a training set S is the distribution of margins of the examples in S.

The functional margin of an hyperplane with respect to S is the minimum margin of the distribution

7

)( bxwy iii

),( ii yx

),( bw

),( bw

Geometric Margin

Inner product and cosine distanceFrom

It follows that:

Norm of times cosine , i.e. the projection of onto

9

||||||||),cos(

wx

wxwx

||||||||),cos(||||

w

wx

w

wxwxx

x

w

x

w

x

Notations (2)

By normalizing the hyperplan equation, i.e.

we get the geometrical margin

The geometrical margin corresponds to the distance of points in S from the hyperplane.

For example in 2

10

||||,

|||| w

b

w

w

)( bxwy iii

Geometrical margin Training set margin

x

x

i

x ix

x

x

x

x

x

xj

o j

ooo

o

o

o

oo o

o

Geometric margin vs. data points in the training set

Notations (3)The margin of the training set S is the maximal geometric margin among every hyperplane.

The hyperplane that corresponds to this (maximal) margin is called maximal margin hyperplane

12

Maximal margin vs other margins

Perceptron: on-line algorithm

),(k,return

found iserror no until

endfor

endif

1kk

then0)( if

to1 ifor

Repeat

||||max;0;0;0

2

1

1

100

kk

ikk

iikk

kiki

ili

bw

Rybb

xyww

bxwy

xRkbw

Classification Error

adjustments

The mechanics of the perceptron

On-line Learning: dealing with positive examples

16

On-line Learning:dealing with negative examples

17

The adjusted hyperplane

Perceptron: the management of an individual instance x

20

Adjusting the (hyper)plane directions

21

Adjusting the distance from the origins

22

Novikoff theoremGiven a non-trival training-set S (|S|=m>>0) and:

If a vector , exists such that:

with > 0, then the maximal number of errors made by the perceptron is :

23

* *, || || 1 w w

* *( , ) 1,..., ,i iy b i m w x

2

* 2,

Rt

1max || || .i

i lR x

ConsequencesThe theorem states that whatever is the length of the geometrical margin, if data instances are linearly separable, then the perceptron is able to find the separating hyperplane in a finite number of steps.

This number is inversely proportional to the square of the margin.

This bound is invariant to the scale of individual patterns.

The learning rate is not critical but only affects the rate of convergence.

24

The decision function of linear classifiers can be written as follows:

as well the adjustment function

The learning rate impacts only in the re-scaling of the hyperplanes, and does not influence the algorithm ( )

Training data only appear in the scalar products!!

1.

1...

1...

( ) sgn( ) sgn( )

sgn(( ) )

j j j

j m

j j j

i m

h x w x b y x x b

y x x b

1...

if ( ) 0 then i j j j i i i

j m

y y x x b

Duality

First property of SVMsDUALITY is the first property of Support Vector Machines

The SVMs are learning machines of the kind:

It must be noted that (input, i.e. training & testing instances) data only appear in the scalar product

The matrix is called Gram matrix of the incoming distribution

26

, 1

l

i ji j

G

x x

1...

( ) sgn( ) sgn( )j j j

j m

f x w x b y x x b

Limitations of linear classifiers◦ Problems in dealing with non linearly separale data

◦ Treatment of Noisy Data

◦ Data must be in real-value vector formalism, i.e. a underlying metric space topology is required

27

SolutionsArtificial Neural Networks (ANN) approach: augment the number of neurons, and organize them into layers multilayer neural neworks Learning through the Back-propagation algorithm (Rumelhart & McLelland, 91).

SVMs approach: Extend the representation by exploiting kernel functions (i.e. non linear often task dependent functions described by the Gram matrix).

◦ In this way the learning algorithms are decoupled from the application domain, that can be coded esclusively through task-specific kernel functions.

◦ The feature modeling does not necessarily have to produce real-valued vectors but can be derived from intrinsic properties of the training objects

◦ Complex data structures, e.g. sequences, trees, graphs or PCA-like decompositions (e.g. LSA), can be managed by individual kernels

28

Which hyperplane?

Maximum Margin HyperplanesVar1

Var2

Margin

Margin

IDEA: Select the

hyperplane that

maximizes the margin

Support VectorsVar1

Var2

Margin

Support vectors

How to get the maximum margin?

Var1

Var2kbxw

kbxw

0 bxw

kk

w

The geometric margin

is:2 k

w

Optimization problem

ex. negativ a is se ,

ex. positive a is if ,

||||

2

xkbxw

xkbxw

w

kMAX

Scaling the hyperplane …

Var1

Var21w x b

1w x b

0 bxw

11w

There is a scale for

which k=1.

The optimization

problem becomes:

negative is if ,1

positive is if ,1

||||

2max

xbxw

xbxw

w

The optimization problemThe optimal hyperplane satyisfies:

◦ Minimize

◦ Under:

The dual problem is simpler

21( )

2

(( ) ) 1 1,...,i i

w w

y w x b i m

Definition of the Lagrangian

35

libxwy

wwwf

ii ,...,1,1))((

2

1)()(

2

are not used as no equality constrant is needed in the primal equation

Dual optimization problem

Notice that the multipliers are not used in the dual optimizationproblem as no equality constrant is imposed in the primal form

Graphically:Two examples of constrained optmization (with equalities)

1),(

),(

22

yxyxg

yxyxf

cyxg

yxyxf

),(

),( 22

Dual Optimization problem

38

Notice that the multipliers are not used in the targeted optimization as no equality constrantis imposed

Transforming into the dualThe Lagrangian corresponding to our problem becomes:

In order to solve the dual problem we compute

and then imposing derivatives to 0, wrt

39

w

Transforming into the dual (cont.)

Imposing derivatives = 0 wrt

and wrt b

40

w

Transforming into the dual (cont.)

… by substituting into the objective function

41

Dual Optimization problem

• The formulation depends on the set of variables and not from w and b

• It has a simpler form

• It makes explicit the individual contributions (i) of (a selected set of) examples (xi)

42

Khun-Tucker Theorem Necessary (and sufficent) conditions for the existence of the optimal solution are the following:

43

Karush-Kuhn-Tucker constraint

Some consequencesLagrange constraints:

Karush-Kuhn-Tucker constraints

The support vector are having not null , i.e. such that

They lie on the frontier

b is derived through the following formula

1 1

0m m

i i i i i

i i

a y w y x

[ ( ) 1] 0, 1,...,i i iy x w b i m

i 1)( bwxyii

i

x

Support Vectors

45

Var1

Var2

Margin

Support Vectors

Non linearly separable training data

Var1

Var21w x b

1w x b

0 bxw

11

w

iSlack variables are

introduced

Mistakes are allowed and the

optimization function is

penalized

i

Soft Margin SVMs

Var1

Var21w x b

1w x b

0 bxw

11

w

i

New constraints:

Objective function:

C is the trade-off

between

margin and errors

i i

Cw 2||||2

1min

0

1)(

i

iiii xbxwy

Converting in the dual form

deriving wrt

48

bw and ,

Partial derivatives

49

Substitution in the objective function

of Kronecker

50

ij

Dual optimization problem(the final form)

Soft Margin Support Vector Machines

The algorithm tries to keep i =0 and then maximizes the margin.

The algorithm minimizes the sums of distances from the hyperplane and not the number of errors (as it corresponds to an NP-complete problem)

If C, the solution tends to conform to the hard margin solution

ATT.!!!: if C = 0 then =0. Infact it is always possible to satisfy:

If C grows, it tends to limit the number of tolerated errors. Infinite settings for C provide the number of errors to be 0, exactly as in the hard-margin formulation.

i iCw 2||||

2

1min

0

1)(

i

iiii xbxwy

|||| w

iii xby

1

Robustness: Soft vs Hard Margin SVMs

i

Var1

Var2

0 bxw

i

Var1

Var20 bxw

Soft Margin SVM Hard Margin SVM

Soft vs Hard Margin SVMsA Soft-Margin SVM has always a solution

A Soft-Margin SVM is more robust wrt odd training examples

◦ Insufficient Representation (e.g. Limited Vocabularies)

◦ High ambiguity of (linguistic) features

An Hard-Margin SVM requires no parameter

ReferencesBasili, R., A. Moschitti Automatic Text Categorization: From Information Retrieval to Support Vector Learning , Aracne Editrice, Informatica, ISBN: 88-548-0292-1, 2005

A tutorial on Support Vector Machines for Pattern Recognition (C.J.Burges ) ◦ URL: http://www.umiacs.umd.edu/~joseph/support-vector-machines4.pdf

The Vapnik-Chervonenkis Dimension and the Learning Capability of Neural Nets (E.D: Sontag)◦ URL: http://www.math.rutgers.edu/~sontag/FTP_DIR/vc-expo.pdf

Computational Learning Theory◦ (Sally A Goldman Washington University St. Louis Missouri)

◦ http://www.learningtheory.org/articles/COLTSurveyArticle.ps

AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods), N. Cristianini and J. Shawe-Taylor Cambridge University Press.

The Nature of Statistical Learning Theory, V. N. Vapnik - Springer Verlag (December, 1999)

Documents

SVM learning - sag.art.uniroma2.it