Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SVM learningWM&R a.a. 2020/21
R. BASILID I PA R T I M E N T O D I I N G E G N E R I A D E L L’ I M P R E S A
U N I V E R S I TÀ D I R O M A “ T O R V E R G ATA ”
E M A I L : B A S I L I @ I N F O . U N I R O M A 2 . I T
1
SummaryPerceptron Learning
Limitations of linear classifiers
Support Vector Machines◦ Maximal margin classification
◦ Optimization with hard margin
◦ Optimization with soft margin
The roles of kernels in SVM-based learning
2
An hyperplane has equation :
is the vector of the instance to be classified
is the hyperplane gradient
Classification function:
Linear Classifiers (1)
bwxbwxxf n ,, ,)(
x
w
( ) sign( ( ))h x f x
Linear Classifiers (2)Computationally simple.
Basic idea: select an hypothesis that makes no mistake over training-set.
The separating function is equivalent to a neural net with just one neuron (perceptron)
NotationThe functional margin of an example
with respect to an hyerplane is:
The distribution of functional margins of an hyperplane with respect to a training set S is the distribution of margins of the examples in S.
The functional margin of an hyperplane with respect to S is the minimum margin of the distribution
7
)( bxwy iii
),( ii yx
),( bw
),( bw
Inner product and cosine distanceFrom
It follows that:
Norm of times cosine , i.e. the projection of onto
9
||||||||),cos(
wx
wxwx
||||||||),cos(||||
w
wx
w
wxwxx
x
w
x
w
x
Notations (2)
By normalizing the hyperplan equation, i.e.
we get the geometrical margin
The geometrical margin corresponds to the distance of points in S from the hyperplane.
For example in 2
10
||||,
|||| w
b
w
w
)( bxwy iii
Geometrical margin Training set margin
x
x
i
x ix
x
x
x
x
x
xj
o j
ooo
o
o
o
oo o
o
Geometric margin vs. data points in the training set
Notations (3)The margin of the training set S is the maximal geometric margin among every hyperplane.
The hyperplane that corresponds to this (maximal) margin is called maximal margin hyperplane
12
Perceptron: on-line algorithm
),(k,return
found iserror no until
endfor
endif
1kk
then0)( if
to1 ifor
Repeat
||||max;0;0;0
2
1
1
100
kk
ikk
iikk
kiki
ili
bw
Rybb
xyww
bxwy
xRkbw
Classification Error
adjustments
Novikoff theoremGiven a non-trival training-set S (|S|=m>>0) and:
If a vector , exists such that:
with > 0, then the maximal number of errors made by the perceptron is :
23
* *, || || 1 w w
* *( , ) 1,..., ,i iy b i m w x
2
* 2,
Rt
1max || || .i
i lR x
ConsequencesThe theorem states that whatever is the length of the geometrical margin, if data instances are linearly separable, then the perceptron is able to find the separating hyperplane in a finite number of steps.
This number is inversely proportional to the square of the margin.
This bound is invariant to the scale of individual patterns.
The learning rate is not critical but only affects the rate of convergence.
24
The decision function of linear classifiers can be written as follows:
as well the adjustment function
The learning rate impacts only in the re-scaling of the hyperplanes, and does not influence the algorithm ( )
Training data only appear in the scalar products!!
1.
1...
1...
( ) sgn( ) sgn( )
sgn(( ) )
j j j
j m
j j j
i m
h x w x b y x x b
y x x b
1...
if ( ) 0 then i j j j i i i
j m
y y x x b
Duality
First property of SVMsDUALITY is the first property of Support Vector Machines
The SVMs are learning machines of the kind:
It must be noted that (input, i.e. training & testing instances) data only appear in the scalar product
The matrix is called Gram matrix of the incoming distribution
26
, 1
l
i ji j
G
x x
1...
( ) sgn( ) sgn( )j j j
j m
f x w x b y x x b
Limitations of linear classifiers◦ Problems in dealing with non linearly separale data
◦ Treatment of Noisy Data
◦ Data must be in real-value vector formalism, i.e. a underlying metric space topology is required
27
SolutionsArtificial Neural Networks (ANN) approach: augment the number of neurons, and organize them into layers multilayer neural neworks Learning through the Back-propagation algorithm (Rumelhart & McLelland, 91).
SVMs approach: Extend the representation by exploiting kernel functions (i.e. non linear often task dependent functions described by the Gram matrix).
◦ In this way the learning algorithms are decoupled from the application domain, that can be coded esclusively through task-specific kernel functions.
◦ The feature modeling does not necessarily have to produce real-valued vectors but can be derived from intrinsic properties of the training objects
◦ Complex data structures, e.g. sequences, trees, graphs or PCA-like decompositions (e.g. LSA), can be managed by individual kernels
28
Maximum Margin HyperplanesVar1
Var2
Margin
Margin
IDEA: Select the
hyperplane that
maximizes the margin
How to get the maximum margin?
Var1
Var2kbxw
kbxw
0 bxw
kk
w
The geometric margin
is:2 k
w
Optimization problem
ex. negativ a is se ,
ex. positive a is if ,
||||
2
xkbxw
xkbxw
w
kMAX
Scaling the hyperplane …
Var1
Var21w x b
1w x b
0 bxw
11w
There is a scale for
which k=1.
The optimization
problem becomes:
negative is if ,1
positive is if ,1
||||
2max
xbxw
xbxw
w
The optimization problemThe optimal hyperplane satyisfies:
◦ Minimize
◦ Under:
The dual problem is simpler
21( )
2
(( ) ) 1 1,...,i i
w w
y w x b i m
Definition of the Lagrangian
35
libxwy
wwwf
ii ,...,1,1))((
2
1)()(
2
are not used as no equality constrant is needed in the primal equation
Dual optimization problem
Notice that the multipliers are not used in the dual optimizationproblem as no equality constrant is imposed in the primal form
Graphically:Two examples of constrained optmization (with equalities)
1),(
),(
22
yxyxg
yxyxf
cyxg
yxyxf
),(
),( 22
Dual Optimization problem
38
Notice that the multipliers are not used in the targeted optimization as no equality constrantis imposed
Transforming into the dualThe Lagrangian corresponding to our problem becomes:
In order to solve the dual problem we compute
and then imposing derivatives to 0, wrt
39
w
Dual Optimization problem
• The formulation depends on the set of variables and not from w and b
• It has a simpler form
• It makes explicit the individual contributions (i) of (a selected set of) examples (xi)
42
Khun-Tucker Theorem Necessary (and sufficent) conditions for the existence of the optimal solution are the following:
43
Karush-Kuhn-Tucker constraint
Some consequencesLagrange constraints:
Karush-Kuhn-Tucker constraints
The support vector are having not null , i.e. such that
They lie on the frontier
b is derived through the following formula
1 1
0m m
i i i i i
i i
a y w y x
[ ( ) 1] 0, 1,...,i i iy x w b i m
i 1)( bwxyii
i
x
Non linearly separable training data
Var1
Var21w x b
1w x b
0 bxw
11
w
iSlack variables are
introduced
Mistakes are allowed and the
optimization function is
penalized
i
Soft Margin SVMs
Var1
Var21w x b
1w x b
0 bxw
11
w
i
New constraints:
Objective function:
C is the trade-off
between
margin and errors
i i
Cw 2||||2
1min
0
1)(
i
iiii xbxwy
Soft Margin Support Vector Machines
The algorithm tries to keep i =0 and then maximizes the margin.
The algorithm minimizes the sums of distances from the hyperplane and not the number of errors (as it corresponds to an NP-complete problem)
If C, the solution tends to conform to the hard margin solution
ATT.!!!: if C = 0 then =0. Infact it is always possible to satisfy:
If C grows, it tends to limit the number of tolerated errors. Infinite settings for C provide the number of errors to be 0, exactly as in the hard-margin formulation.
i iCw 2||||
2
1min
0
1)(
i
iiii xbxwy
|||| w
iii xby
1
Robustness: Soft vs Hard Margin SVMs
i
Var1
Var2
0 bxw
i
Var1
Var20 bxw
Soft Margin SVM Hard Margin SVM
Soft vs Hard Margin SVMsA Soft-Margin SVM has always a solution
A Soft-Margin SVM is more robust wrt odd training examples
◦ Insufficient Representation (e.g. Limited Vocabularies)
◦ High ambiguity of (linguistic) features
An Hard-Margin SVM requires no parameter
ReferencesBasili, R., A. Moschitti Automatic Text Categorization: From Information Retrieval to Support Vector Learning , Aracne Editrice, Informatica, ISBN: 88-548-0292-1, 2005
A tutorial on Support Vector Machines for Pattern Recognition (C.J.Burges ) ◦ URL: http://www.umiacs.umd.edu/~joseph/support-vector-machines4.pdf
The Vapnik-Chervonenkis Dimension and the Learning Capability of Neural Nets (E.D: Sontag)◦ URL: http://www.math.rutgers.edu/~sontag/FTP_DIR/vc-expo.pdf
Computational Learning Theory◦ (Sally A Goldman Washington University St. Louis Missouri)
◦ http://www.learningtheory.org/articles/COLTSurveyArticle.ps
AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods), N. Cristianini and J. Shawe-Taylor Cambridge University Press.
The Nature of Statistical Learning Theory, V. N. Vapnik - Springer Verlag (December, 1999)