38
Support Vector Support Vector Machines and Kernel Machines and Kernel Methods Methods Kenan Gençol Kenan Gençol Department of Electrical and Electronics Department of Electrical and Electronics Engineering Engineering Anadolu University Anadolu University submitted submitted in the course in the course MAT592 Seminar MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics Department of Mathematics

Support Vector Machines and Kernel Methods

Embed Size (px)

DESCRIPTION

Support Vector Machines and Kernel Methods. Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics. Agenda. - PowerPoint PPT Presentation

Citation preview

Page 1: Support Vector Machines and Kernel Methods

Support Vector Machines Support Vector Machines and Kernel Methodsand Kernel Methods

Kenan GençolKenan GençolDepartment of Electrical and Electronics EngineeringDepartment of Electrical and Electronics Engineering

Anadolu UniversityAnadolu University

submittedsubmittedin the coursein the course

MAT592 SeminarMAT592 Seminar

Advisor: Prof. Dr. Yalçın KüçükAdvisor: Prof. Dr. Yalçın KüçükDepartment of MathematicsDepartment of Mathematics

Page 2: Support Vector Machines and Kernel Methods

AgendaAgenda

Linear Discriminant Functions and Linear Discriminant Functions and Decision HyperplanesDecision Hyperplanes

Introduction to SVMIntroduction to SVM Support Vector MachinesSupport Vector Machines Introduction to KernelsIntroduction to Kernels Nonlinear SVMNonlinear SVM Kernel MethodsKernel Methods

Page 3: Support Vector Machines and Kernel Methods

Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes

Figure 1. Two classes of patterns and a linear decision function

Page 4: Support Vector Machines and Kernel Methods

Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes

Each pattern is represented by a vectorEach pattern is represented by a vector

Linear decision function has the Linear decision function has the equationequation

where where ww11,w,w22 are weights and are weights and ww00 is the is the

bias termbias term

Page 5: Support Vector Machines and Kernel Methods

Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes

The general decision hyperplane The general decision hyperplane equation in d-dimensional space has equation in d-dimensional space has the form:the form:

where where w = [ww = [w11 w w22 ....w ....wdd]] is the weight is the weight

vector and vector and ww00 is the bias term. is the bias term.

Page 6: Support Vector Machines and Kernel Methods

Introduction to SVMIntroduction to SVM

There are many hyperplanes that There are many hyperplanes that separates two classesseparates two classes

Figure 2. An example of two possible classifiers

Page 7: Support Vector Machines and Kernel Methods

Introduction to SVMIntroduction to SVM

THE GOAL:THE GOAL: Our goal is to search for direction Our goal is to search for direction ww

and bias and bias ww00 that gives the that gives the maximum maximum

possible marginpossible margin, or in other words, to , or in other words, to orientate this hyperplaneorientate this hyperplane in such a in such a way as to be way as to be as far as possibleas far as possible from from the closest members of both classes.the closest members of both classes.

Page 8: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Figure 3. Hyperplane through two linearly separable classes

Page 9: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Our training data is of the form:Our training data is of the form:

This hyperplane can be described byThis hyperplane can be described by

and called and called separating hyperplaneseparating hyperplane..

Page 10: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Select variables Select variables ww and and bb so that: so that:

These equations can be combined These equations can be combined into:into:

Page 11: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

The points that lie closest to the The points that lie closest to the separating hyperplane are called separating hyperplane are called support vectorssupport vectors (circled points in (circled points in diagram) anddiagram) and

are called are called supporting hyperplanessupporting hyperplanes. .

Page 12: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Figure 3. Hyperplane through two linearly separable classes (repeated)

Page 13: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

The hyperplane’s equidistance from The hyperplane’s equidistance from HH11 and and HH22 means that means that dd11= = dd22 and this and this

quantity is known as quantity is known as SVM MarginSVM Margin::

dd11+ + dd22 ==

dd11= = dd22= =

ww

b

w

b 211

w

1

Page 14: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Maximizing Maximizing Minimizing Minimizing

min such that min such that yyii(x(xii . w + b) -1 >= 0 . w + b) -1 >= 0

Minimizing is equivalent to minimizingMinimizing is equivalent to minimizing

to perform Quadratic Programming (QP) to perform Quadratic Programming (QP)

optimizationoptimization

w

1w

w

2

2

1ww

Page 15: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Optimization problemOptimization problem::

Minimize Minimize

subject to subject to ibwxy ii ,01)(

2

2

1w

Page 16: Support Vector Machines and Kernel Methods

SVM: Linearly Separable SVM: Linearly Separable CaseCase

This is an inequality constrained optimization This is an inequality constrained optimization problem with Lagrangian function:problem with Lagrangian function:

where where ααii >= 0 >= 0 i=1,2,....,Li=1,2,....,L are Lagrange are Lagrange multipliers.multipliers.

(1)

Page 17: Support Vector Machines and Kernel Methods

SVMSVM The corresponding KKT conditions The corresponding KKT conditions

are:are:

(2)

(3)

Page 18: Support Vector Machines and Kernel Methods

SVMSVM

This is a convex optimization This is a convex optimization problem.The cost function is convex problem.The cost function is convex and the set of constraints are linear and the set of constraints are linear and define a convex set of feasible and define a convex set of feasible solutions. Such problems can be solutions. Such problems can be solved by considering the so called solved by considering the so called Lagrangian DualityLagrangian Duality

Page 19: Support Vector Machines and Kernel Methods

SVMSVM

Substituing (2) and (3) gives a new Substituing (2) and (3) gives a new formulation which being dependent formulation which being dependent on on αα, we need to maximize., we need to maximize.

Page 20: Support Vector Machines and Kernel Methods

SVMSVM

This is called Dual form (Lagrangian This is called Dual form (Lagrangian Dual) of the primary form. Dual form Dual) of the primary form. Dual form onlyonly requires the requires the dot productdot product of of each input vector to be calculated.each input vector to be calculated.

This is important for the This is important for the Kernel Kernel TrickTrick which will be described later. which will be described later.

Page 21: Support Vector Machines and Kernel Methods

SVMSVM

So the problem becomes a dual So the problem becomes a dual problem:problem:

MaximizeMaximize

subject to subject to

HTL

ii 2

1

1

01

i

L

ii y ii ,0

Page 22: Support Vector Machines and Kernel Methods

SVMSVM

Differentiating with respect to Differentiating with respect to ααii ‘s ‘s

and using the constraint equation, a and using the constraint equation, a system of equations is obtained. system of equations is obtained. Solving the system, the Lagrange Solving the system, the Lagrange multipliers are found and optimum multipliers are found and optimum hyperplane is given according to the hyperplane is given according to the formula:formula:

Page 23: Support Vector Machines and Kernel Methods

SVMSVM

Some Notes:Some Notes: SUPPORT VECTORS are the feature are the feature

vectors for vectors for ααii > 0 > 0 i=1,2,....,Li=1,2,....,L The cost function is strictly The cost function is strictly convexconvex.. Hessian matrix is Hessian matrix is positive definitepositive definite.. Any local minimum is also global and unique. Any local minimum is also global and unique.

The optimal hyperplane classifier of a SVM is The optimal hyperplane classifier of a SVM is UNIQUE..

Although the solution is unique, the resulting Lagrange multipliers are not unique.

Page 24: Support Vector Machines and Kernel Methods

Kernels: IntroductionKernels: Introduction

When applying our SVM to linearly When applying our SVM to linearly separable data we have started byseparable data we have started by creating a matrix H from the dot product creating a matrix H from the dot product of our input variables:of our input variables:

being known as being known as Linear Linear KernelKernel, an example of a family of functions , an example of a family of functions called Kernel functions.called Kernel functions.

jTijiji xxxxxxk ),(

Page 25: Support Vector Machines and Kernel Methods

Kernels: IntroductionKernels: Introduction

The set of kernelThe set of kernel functions are all based on functions are all based on calculating calculating inner products of two vectorsinner products of two vectors..

This means if the function is mapped to a This means if the function is mapped to a higher dimensionality space by a nonlinear higher dimensionality space by a nonlinear mapping function only the inner mapping function only the inner products of the mapped inputs need to be products of the mapped inputs need to be determined without needing to explicitly determined without needing to explicitly calculate calculate ФФ . .

This is called “This is called “Kernel TrickKernel Trick” ”

)(: xx

Page 26: Support Vector Machines and Kernel Methods

Kernels: IntroductionKernels: Introduction

Kernel Trick is useful because there Kernel Trick is useful because there are many classification/regression are many classification/regression problems that are not fully problems that are not fully separable/regressable in the input separable/regressable in the input space but separable/regressable in a space but separable/regressable in a higher dimensional space.higher dimensional space.

Hd :

)()( jiji xxxx

)(,)( jiji xxxx

Page 27: Support Vector Machines and Kernel Methods

Kernels: IntroductionKernels: Introduction

Popular Kernel FamiliesPopular Kernel Families:: Radial Basis Function (RBF) KernelRadial Basis Function (RBF) Kernel

Polynomial KernelPolynomial Kernel

Sigmodial (Hyperbolic Tangent) KernelSigmodial (Hyperbolic Tangent) Kernel

Page 28: Support Vector Machines and Kernel Methods

Nonlinear Support Vector Nonlinear Support Vector MachinesMachines

The support vector machine with kernel functions The support vector machine with kernel functions becomes:becomes:

and the resulting classifier:and the resulting classifier:

Page 29: Support Vector Machines and Kernel Methods

Nonlinear Support Vector Nonlinear Support Vector MachinesMachines

Figure 4. The SVM architecture employing kernel functions.

Page 30: Support Vector Machines and Kernel Methods

Kernel MethodsKernel Methods

Recall that a kernel function computes the Recall that a kernel function computes the inner product of the images under an inner product of the images under an embedding of two data pointsembedding of two data points

is a is a kernelkernel if if 1. 1. kk is symmetric: is symmetric: k(x,y) = k(y,x)k(x,y) = k(y,x) 2. 2. kk is positive semi-definite, i.e., the “ is positive semi-definite, i.e., the “Gram Gram

MatrixMatrix” ” KKijij = = k(xk(xii,x,xjj)) is positive semi-definite.is positive semi-definite.

XXk :

)(),(),( zxzxk

Page 31: Support Vector Machines and Kernel Methods

Kernel MethodsKernel Methods

The answer for which kernels does The answer for which kernels does there exist a pair there exist a pair {{H,H,φφ}}, with the , with the properties described above, and for properties described above, and for which does there not is given by which does there not is given by Mercer’s conditionMercer’s condition..

Page 32: Support Vector Machines and Kernel Methods

Mercer’s conditionMercer’s condition Let be a compact subset of and let and a Let be a compact subset of and let and a

mapping mapping

where where HH is an Euclidean space. Then the inner is an Euclidean space. Then the inner product operation has an equivalent product operation has an equivalent representationrepresentation

and is a symmetric function satisfying the following and is a symmetric function satisfying the following conditioncondition

for any , such thatfor any , such that

X n XxHxXx )(:

)()()(),(),( zxzxzxk rr

r

),( zxk

XX

dxdzzgxgzxk 0)()(),(

dxxg )(2

)(xg Xx

Page 33: Support Vector Machines and Kernel Methods

Mercer’s TheoremMercer’s Theorem TheoremTheorem. . Suppose Suppose KK is a continuous  is a continuous symmetricsymmetric

non-negative definitenon-negative definite kernel. Then there is kernel. Then there is anan  orthonormal basisorthonormal basis { {eeii}}ii  ofof  LL22[[aa, , bb] ] consisting of consisting of eigenfunctions of eigenfunctions of TTKK   

such that the corresponding sequence of such that the corresponding sequence of eigenvalueseigenvalues {λ {λii}}ii  is nonnegative. The is nonnegative. The eigenfunctions corresponding to non-zero eigenfunctions corresponding to non-zero eigenvalues are continuous on [eigenvalues are continuous on [aa, , bb] and] and  KK  has the has the representationrepresentation

where the convergence is absolute and uniform.where the convergence is absolute and uniform.

b

a

K dsssxKxT )(),()(

)()(),(1

tesetsK jjj

j

Page 34: Support Vector Machines and Kernel Methods

Kernel MethodsKernel Methods

Suppose Suppose kk11and and kk22 are valid are valid (symmetric, positive definite) kernels (symmetric, positive definite) kernels on on XX. Then the following are valid . Then the following are valid kernels:kernels: 1.1.

2.2.

3.3.

Page 35: Support Vector Machines and Kernel Methods

Kernel MethodsKernel Methods

4.4.

5.5.

6.6.

7.7.

Page 36: Support Vector Machines and Kernel Methods

ReferencesReferences

[1] C.J.C. Burges, “Tutorial on support [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, Data Mining and Knowledge Discovery 2, 121-167, 1998.121-167, 1998.

[2] Marques de Sa, J.P., “Pattern [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Recognition Concepts,Methods and Applications”, Springer, 2001.Applications”, Springer, 2001.

[3] S. Theodoridis, “Pattern Recognition”, [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.Elsevier Academic Press, 2003.

Page 37: Support Vector Machines and Kernel Methods

ReferencesReferences

[4] T. Fletcher, “Support Vector Machines [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005.Explained”, UCL, March,2005.

[5] Cristianini,N., Shawe-Taylor,J., “Kernel [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge Methods for Pattern Analysis”, Cambridge University Press, 2004.University Press, 2004.

[6] “Subject Title: Mercer’s Theorem”, [6] “Subject Title: Mercer’s Theorem”, Wikipedia: Wikipedia: http://en.wikipedia.org/wiki/Mercer’s_theorhttp://en.wikipedia.org/wiki/Mercer’s_theoremem

Page 38: Support Vector Machines and Kernel Methods

Thank YouThank You