53
1 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory MACHINE LEARNING Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University [email protected] www.cs.iastate.edu/~honavar/ www.cild.iastate.edu/ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximum margin classifiers Discriminative classifiers that maximize the margin of separation Support Vector Machines Feature spaces Kernel machines VC Theory and generalization bounds Maximum margin classifiers Comparisons with conditionally trained discriminative classifiers

2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

1

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

MACHINE LEARNING

Vasant HonavarArtificial Intelligence Research Laboratory

Department of Computer ScienceBioinformatics and Computational Biology Program

Center for Computational Intelligence, Learning, & DiscoveryIowa State University

[email protected]/~honavar/

www.cild.iastate.edu/

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximum margin classifiers

• Discriminative classifiers that maximize the margin of separation

• Support Vector Machines– Feature spaces– Kernel machines– VC Theory and generalization bounds– Maximum margin classifiers

• Comparisons with conditionally trained discriminative classifiers

Page 2: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

2

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Perceptrons revisited

Perceptrons• Can only compute threshold functions• Can only represent linear decision surfaces• Can only learn to classify linearly separable training

dataHow can we deal with non linearly separable data?• Map data into a typically higher dimensional feature

space where the classes become separableTwo problems must be solved:• Computational problem of working with high

dimensional feature space• Overfitting problem in high dimensional feature spaces

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximum margin model

We cannot outperform Bayes optimal classifier when• The generative model assumed is correct• The data set is large enough to ensure reliable

estimation of parameters of the modelsBut discriminative models may be better than generative

models when• The correct generative model is seldom known• The data set is often simply not large enough

Maximum margin classifiers are a kind of discriminative classifiers designed to circumvent the overfitingproblem

Page 3: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

3

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Extending Linear Classifiers – Feature spaces

Map data into a feature space where they are linearly separable

( )ϕ→x xx ( )ϕ x

X Φ

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Linear Separability in Feature spaces

The original input space can always be mapped to some higher-dimensional feature space where the training data become separable:

Φ: x→ φ(x)

Page 4: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

4

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning in the Feature Spaces

High dimensional Feature spaces

where typically d > > n solve the problem of expressing complex functions

But this introduces a• computational problem (working with very large vectors)• generalization problem (curse of dimensionality)

SVM offer an elegant solution to both problems

( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Perceptron Algorithm Revisited

• The perceptron works by adding misclassified positive or subtracting misclassified negative examples to an arbitrary weight vector, which (without loss of generality) we assumed to be the zero vector.

• The final weight vector is a linear combination of training points

where, since the sign of the coefficient of is given by label yi, the are positive values, proportional to the number of times, misclassification of has caused the weight to be updated. It is called the embedding strength of the pattern .

1,

l

i i ii

yα=

= ∑w x

ix

iαix

ix

Page 5: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

5

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Dual Representation

The decision function can be rewritten as:

The update rule is:

WLOG, we can take

( ) ( )

⎟⎟⎠

⎞⎜⎜⎝

⎛+=

⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛=+=

=

=

by

bybh

j

l

jjj

l

jjjj

xx

xxxwx

,sgn

,sgn,sgn

1

1

α

α

1.η =

ηαα

byy

y

ii

ij

l

jjji

ii

+←

≤⎟⎟⎠

⎞⎜⎜⎝

⎛+∑

=

then

if

) ,( example training on

,0,1

xx

x

α

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Implications of Dual Representation

When Linear Learning Machines are represented in the dual form

Data appear only inside dot products (in decision function and in training algorithm)

The matrix which is the matrix of pair-wise dot products between training samples is called the Gram matrix

( ), 1

l

i j i jG

== ⋅x x

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛+=+= ∑

=

bybh ij

l

jjjii xxxwx ,sgn,sgn

Page 6: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

6

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Implicit Mapping to Feature Space

Kernel Machines• Solve the computational problem of working with

many dimensions• Can make it possible to use infinite dimensions

efficiently • Offer other advantages, both practical and conceptual

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernel-Induced Feature Spaces

Φ→ϕ X:

( ) ( )( ) ( )( )ii

ii

fhbf

xxxwx

sgn,

=

+= ϕ

where is a non-linear map from input space to feature space

In the dual representation, the data points only appear inside dot products

( ) ( )( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛+=+= ∑

=

bybh ij

l

jjjii xxxwx ϕϕαϕ ,sgn,sgn

1

Page 7: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

7

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernels

• Kernel function returns the value of the dot product between the images of the two arguments

• When using kernels, the dimensionality of the feature space Φ is not necessarily important because of the special properties of kernel functions

• Given a function K, it is possible to verify that it is a kernel (we will return to this later)

1 2 1 2( , ) ( ), ( )K x x x xϕ ϕ=

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernel Machines

We can use perceptron learning algorithm in the feature space by taking its dual representation and replacing dot products with kernels:

1 2 1 2 1 2, ( , ) ( ), ( )x x K x x x xϕ ϕ← =

Page 8: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

8

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Kernel Matrix

K= K(2,l)…K(2,3)K(2,2)K(2,1)

K(1,l)…K(1,3)K(1,2)K(1,1)

K(l,l)…K(l,3)K(l,2)K(l,1)

……………

Kernel matrix is the Gram matrix in the feature space Φ

(the matrix of pair-wise dot products between feature vectors corresponding to the training samples)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Properties of Kernel Matrices

It is easy to show that the Gram matrix (and hence the kernel matrix) is

• A Square matrix • Symmetric (K T = K)• Positive semi-definite (all eigenvalues of K are non-

negative – Recall that eigenvalues of a square matrix A are given by values of λ that satisfy | A-λI |=0)

Any symmetric positive semi definite matrix can be regarded as a kernel matrix, that is, as an inner product matrix in some feature space Φ

( ) ( ) ( )jiji xxxxK ϕϕ= ,,

Page 9: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

9

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mercer’s Theorem: Characterization of Kernel Functions

A function K : X×X ℜ is said to be (finitely) positive semi-definite if

• K is a symmetric function: K(x,y) = K(y,x) • Matrices formed by restricting K to any finite subset of

the space X are positive semi-definiteCharacterization of Kernel Functions Every (finitely) positive semi definite, symmetric function

is a kernel: i.e. there exists a mapping ϕ such that it is possible to write:

( ) ( ) ( )jijiK xxxx ϕϕ= ,,

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Properties of Kernel Functions - Mercer’s TheoremEigenvalues of the Gram Matrix define an expansion of

Kernels:

That is, the eigenvalues act as features!

Recall that the eigenvalues of a square matrix A correspond to solutions of

⎥⎥⎥⎥

⎢⎢⎢⎢

=

=λ−

1000010000100001

0

I

IA where

( ) ( ) ( ) ( ) ( )jlil

lljijiK xxxxxx ϕϕλϕϕ ∑== ,,

Page 10: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

10

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Examples of Kernels

( ) ⎟⎟⎠

⎞⎜⎜⎝

σ−

=zx

zx,

eK ,

Simple examples of kernels:

( ) dK zxzx ,, =

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Example: Polynomial Kernels

1 2

1 2

( , );( , );

x x xz z z

==

2 21 1 2 2

2 2 2 21 1 2 2 1 1 2 2

2 2 2 21 2 1 2 1 2 1 2

, ( )

2

( , , 2 ), ( , , 2 )

( ), ( )

x z x z x z

x z x z x z x z

x x x x z z z z

x zϕ ϕ

= +

= + +

=

=

Page 11: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

11

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Making Kernels – Closure PropertiesThe set of kernels is closed under some operations. IfK1 , K2 are kernels over X×X, then the following are kernels:

We can make complex kernels from simple ones: modularity!

( ) ( ) ( )( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( )

( )( )

( )n

T

NNN

nnK

K

KKfffK

KKKa aKK

KKK

ℜ⊆×

=

ℜ×ℜℜ→

=ℜ→=

=>=

+=

X

X

X

andmatrix definite positivesymmetric a is

over kernel a is

;

BBzxzx

zxzxzxzx

zxzxzxzxzx

zxzxzx

;,

;:

;,,:;,

,,,0,,

,,,

3

3

21

1

21

ϕ

ϕϕ

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernels

• We can define Kernels over arbitrary instance spaces including• Finite dimensional vector spaces, • Boolean spaces• ∑* where ∑ is a finite alphabet• Documents, graphs, etc.

• Kernels need not always be expressed by a closed form formula.

• Many useful kernels can be computed by complex algorithms (e.g., diffusion kernels over graphs).

Page 12: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

12

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernels on Sets and Multi-Sets

( ) kernel. a is , Then

, Let domain fixed some for Let

212

2

21

21

AA

V

AAK

XAAVX

I=

∈⊆

Exercise: Define a Kernel on the space of multi-sets whose elements are drawn from a finite domain V

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

String Kernel (p-spectrum Kernel)

• The p-spectrum of a string is the histogram – vector of number of occurrences of all possible contiguous substrings – of length p

• We can define a kernel function K(s, t) over ∑* × ∑* as the inner product of the p-spectra of s and t.

2

3

=

===

),( :substrings Common

tsKtat, ati

pncomputatiot

statisticss

Page 13: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

13

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernel Machines

Kernel Machines are Linear Learning Machines that :• Use a dual representation• Operate in a kernel induced feature space (that is a

linear function in the feature space implicitly defined by the gram matrix corresponding to the data set K)

( ) ( )( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛+=+= ∑

=

bybh ij

l

jjjii xxxwx ϕϕαϕ ,sgn,sgn

1

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernels – the good, the bad, and the ugly

Bad kernel – A kernel whose Gram (kernel) matrix is mostly diagonal -- all data points are orthogonal to each other, and hence the machine is unable to detect hidden structure in the data

10…010

1…000……………

0…001

Page 14: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

14

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernels – the good, the bad, and the ugly

Good kernel – Corresponds to a Gram (kernel) matrix in which subsets of data points belonging to the same class are similar to each other, and hence the machine can detect hidden structure in the data

3340000032

4230024300

00023Class 1

Class 2

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Kernels – the good, the bad, and the ugly

• In mapping in a space with too many irrelevant features, kernel matrix becomes diagonal

• Need some prior knowledge of target so choose a good kernel

Page 15: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

15

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning in the Feature Spaces

High dimensional Feature spaces

where typically d > > n solve the problem of expressing complex functions

But this introduces a• computational problem (working with very large vectors)

– solved by the kernel trick – implicit computation of dot products in kernel induced feature spaces via dot products in the input space

• generalization problem (curse of dimensionality)

( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Generalization Problem

• The curse of dimensionality• It is easy to overfit in high dimensional spaces

• The Learning problem is ill posed • There are infinitely many hyperplanes that

separate the training data • Need a principled approach to choose an optimal

hyperplane

Page 16: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

16

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Generalization Problem

“Capacity” of the machine – ability to learn any training set without error – related to VC dimension

Excellent memory is not an asset when it comes to learning from limited data

“A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree”

C. Burges

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

History of Key Developments leading to SVM1958 Perceptron (Rosenblatt)

1963 Margin (Vapnik)

1964 Kernel Trick (Aizerman)

1965 Optimization formulation (Mangasarian)

1971 Kernels (Wahba)

1992-1994 SVM (Vapnik)1996 – present Rapid growth, numerous applications

Extensions to other problems

Page 17: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

17

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

A Little Learning TheorySuppose: • We are given l training examples• Train and test points drawn randomly (i.i.d) from some

unknown probability distribution D(x,y)• The machine learns the mapping and outputs a

hypothesis . A particular choice of generates “trained machine”

• The expectation of the test error or expected risk is

( , ).i iy x

i iy→x

( ) ( ) ( )∫ −= ydDbhybR ,,,21, xαxα

( )bh ,,αx ( )b,α

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

A Bound on the Generalization PerformanceThe empirical risk is:

Choose some such that . With probability the following bound – risk bound of h(x,α) in distribution D holds (Vapnik,1995):

where is called VC dimension is a measure of “capacity” of machine.

δ 1 δ−

0d ≥

10 << δ

( ) ( )⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛⎟⎠⎞

⎜⎝⎛−⎟

⎞⎜⎝

⎛+⎟

⎠⎞

⎜⎝⎛

+≤l

dld

bRbR emp4

log12log,,

δ

αα

( ) ( )∑=

−=l

iiemp bhy

lbR

1,,

21, αxα

Page 18: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

18

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

A Bound on the Generalization Performance

The second term in the right-hand side is called VC confidence.

Three key points about the actual risk bound:• It is independent of D(x,y) • It is usually not possible to compute the left hand

side.• If we know d, we can compute the right hand side.

The risk bound gives us a way to compare learning machines!

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The VC Dimension

Definition: the VC dimension of a set of functions

is d if and only if there exists a set of points d data points such that each of the 2d possible labelings (dichotomies) of the d data points, can be realized using some member of H but that no set exists with q > d satisfying this property.

( ) bhH ,,αx=

Page 19: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

19

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The VC Dimension

VC dimension of H is size of largest subset of X shattered by H. VC dimension measures the capacity of a set H of hypotheses (functions).

If for any number N, it is possible to find N points that can be separated in all the 2N possible

ways, we will say that the VC-dimension of the set is infinite

1,.., Nx x

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The VC Dimension Example

• Suppose that the data live in space, and the set consists of oriented straight lines, (linear discriminants).

• It is possible to find three points that can be shattered by this set of functions

• It is not possible to find four. • So the VC dimension of the set of linear discriminants in

is three.

2R ( , )h αx

2R

Page 20: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

20

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The VC Dimension of Hyperplanes

Theorem 1 Consider some set of m points in . Choose any one of the points as origin. Then the mpoints can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent.

Corollary: The VC dimension of the set of oriented hyperplanes in is n+1, since we can always choose n+1 points, and then choose one of the points as origin, such that the position vectors of the remaining points are linearly independent, but can never choose n+2 points

nR

nR

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

VC Dimension

VC dimension can be infinite even when the number of parameters of the set of hypothesis functions is low.

Example: For any integer l with any labels we can find l points and parameter αsuch that those points can be shattered byThose points are:

and parameter a is:

( , )h αx

( , ) sgn(sin( )), ,h α α α≡ ∈x x x R1,..., , 1,1l iy y y ∈ −

1,..., lx x( , )h αx

10 , 1,..., .iix i l−= =

1

(1 )10(1 )2

ili

i

yα π=

−= + ∑

Page 21: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

21

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Vapnik-Chervonenkis (VC) Dimension

• Let H be a hypothesis class over an instance space X.• Both H and X may be infinite.• We need a way to describe the behavior of H on a finite

set of points S ⊆ X.

• For any concept class H over X, and any S ⊆ X,

• Equivalently, with a little abuse of notation, we can write ( ) HhShSH ∈∩=Π :

mXXXS ...., 21=

( ) ( ) ( )( ) HhXhXhS mH ∈=Π :......1

ΠH(S) is the set of all dichotomies or behaviors on S that areinduced or realized by H

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Vapnik-Chervonenkis (VC) Dimension

• If where , or equivalently, • we say that S is shattered by H.

• A set S of instances is said to be shattered by a hypothesis class H if and only if for every dichotomy of S, there exists a hypothesis in H that is consistent with the dichotomy.

( ) mH S 10,=Π mS = ( ) m

H S 2=Π

Page 22: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

22

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

VC Dimension of a hypothesis class

Definition: The VC-dimension V(H), of a hypothesis class H defined over an instance space X is the cardinality d of the largest subset of X that is shattered by H. If arbitrarily large finite subsets of X can be shattered by H, V(H)=∞

How can we show that V(H) is at least d?• Find a set of cardinality at least d that is shattered by

H.How can we show that V(H) = d?• Show that V(H) is at least d and no set of cardinality

(d+1) can be shattered by H.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

VC Dimension of a Hypothesis Class - Examples

• Example: Let the instance space X be the 2-dimensional Euclidian space. Let the hypothesis space H be the set of axis parallel rectangles in the plane.

• V(H)=4 (there exists a set of 4 points that can be shattered by a set of axis parallel rectangles but no set of 5 points can be shattered by H).

Page 23: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

23

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Some Useful Properties of VC Dimension

Proof: Left as an exercise

( ) ( ) ( )

( ) ( ) ( )( ) ( ) ( ) ( )

( )

( )( ) ( ) ( ) dmmOmΦdmmΦ

mΦm

mSSmdVHllHVOHVH

lHHVHVHVHHH

HVHVHhhXH

HHVHHVHVHH

dd

md

dH

HH

l

l

<=≥=

≤Π

=Π=Π=

=

++≤⇒∪==⇒∈−=

≤⇒⊆

if and if where)(

,:)(max)( , If)lg)(( , from concepts of onintersecti or union aby formed is If

:

lg)( class, concept finite a is If

2

12121

2121

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Risk Bound

What is VC dimension and empirical risk of the nearest neighbor classifier?

Any number of points, labeled arbitrarily, can be shattered by a nearest-neighbor classifier, thus and empirical risk =0 .So the bound provide no useful information in this case.

d = ∞

Page 24: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

24

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Minimizing the Bound by Minimizing d

• VC confidence (second term) depends on d/l• One should choose that learning machine with minimal d• For large values of d/l the bound is not tight

0.05δ =

( ) ( )⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛⎟⎠⎞

⎜⎝⎛−⎟

⎞⎜⎝

⎛+⎟

⎠⎞

⎜⎝⎛

+≤l

dld

RR emp4

log12log δ

αα

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Bounds on Error of Classification

The classification error• depends on the VC dimension of the hypothesis class • is independent of the dimensionality of the feature space

Vapnik proved that the error ε of of classification function h for separable data sets is

where d is the VC dimension of the hypothesis class and l is the number of training examples

dOl

ε ⎛ ⎞= ⎜ ⎟⎝ ⎠

Page 25: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

25

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Structural Risk MinimizationFinding a learning machine with the minimum upper bound

on the actual risk • leads us to a method of choosing an optimal machine for a

given task• essential idea of the structural risk minimization (SRM).

Let be a sequence of nested subsets of hypotheses whose VC dimensions satisfy d1 < d2 < d3 < …

• SRM consists of finding that subset of functions which minimizes the upper bound on the actual risk

• SRM involves training a series of classifiers, and choosing a classifier from the series for which the sum of empirical risk and VC confidence is minimal.

1 2 3 ...H H H⊂ ⊂ ⊂

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Margin Based Bounds on Error of Classification

⎟⎟

⎜⎜

⎛⎟⎟⎠

⎞⎜⎜⎝

γ=ε

21 Ll

O

( )min|| ||i i

iy fγ =

xw

( ) ,f b= +x w x

ppL Xmax=

Margin based bound

Important insight

Error of the classifier trained on a separable data set is inversely proportional to its margin, and is independent of the dimensionality of the input space!

Page 26: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

26

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximal Margin Classifier

The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature spaceSVMs control capacity by increasing the margin, not by reducing the number of features

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Margin

1

1

1

1

1

1

1

1Linear separation of the input space

( )( ) ( )( )xx

xwxfsignh

bf=

+= ,

w

wb

Page 27: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

27

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Functional and Geometric Margin

• The functional margin of a linear discriminant (w,b) w.r.t. a labeled pattern is defined as

• If the functional margin is negative, then the pattern is incorrectly classified, if it is positive then the classifier predicts the correct label.

• The larger the further away xi is from the discriminant• This is made more precise in the notion of the geometric

margin

( , )i i iy bγ ≡ +w x

| |iγ

|| ||i

iγγ ≡w

which measures the Euclidean distance of a point from the decision boundary.

( ) 1,1, −×ℜ∈ dii yx

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Geometric Margin

+∈ SiX

iγjγ

The geometric margin of two points

−∈ SjX

Page 28: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

28

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Geometric Margin

+∈ SX i

−∈ SX j

Example

( ) ( ) ( )( )( )( )

21

111111.11

1)11(

1)11(

21

==

=−+=−+=

==

−==

W

X

W

ii

ii

i

i

γ

xxYγY

b

γ

(1,1)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Margin of a training set

iiγγ min=

γ

The functional margin of a training set

The geometric margin of a training set

iiγγ min=

Page 29: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

29

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximum Margin Separating Hyperplaneis called the (functional) margin of (w,b)

w.r.t. the data set S=(xi,yi).

The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximum is a maximal margin hyperplane.

Maximal Margin Hyperplane

min i iγ γ≡

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

VC Dimension of - margin hyperplanes

• If data samples belong to a sphere of radius R, then the set of - margin hyperplanes has VC dimension bounded by

• WLOG, we can make R=1 by normalizing data points of each class so that they lie within a sphere of radius R.

• For large margin hyperplanes, VC-dimension is controlled independent of dimensionality n of the feature space

Δ

Δ

1),/min()( 22 +Δ≤ nRHVC

Page 30: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

30

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximal Margin Classifier

The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin

Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space

SVMs control capacity by increasing the margin, not by reducing the number of features

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximizing Margin Minimizing ||W||

Definition of hyperplane (w,b) does not change if we rescale it to (σw, σb), for σ>0Functional margin depends on scaling, but geometric margin does notIf we fix (by rescaling) the functional margin to 1, the geometric margin will be equal 1/||w||We can maximize the margin by minimizing the norm ||w||

γ

Page 31: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

31

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximizing Margin Minimizing ||w||

Let x+ and x- be the nearest data points in the sample space. Then we have, by fixing the distance from the hyperplane to be 1:

o

( ) ( )

( )w

xx,ww

xxw,

xw,

xw,

2

211

1

1

=−

=−−=−

−=+

=+

−+

−+

+

b

b

γ

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning as optimization

Minimize

subject to:

ww,

( ) 1≥+ b,y ii xw

Page 32: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

32

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Constrained Optimization

• Primal optimization problem• Given functions f, gi, i=1…k; hj j=1..m; defined on a domain

Ω ⊆ ℜn,

( )( )( )

sconstraintquality e

sconstraint inequality function bjectiveo

...10

,...10subject to minimize

mjhkigΩf

j

i

===≤∈

ww

ww

Shorthand ( ) ( )( ) ( ) ...m jhh

...k igg

j

i

1 0 denotes 01 0 denotes 0

=≤==≤≤

wwww

Feasible region ( ) ( ) 00 =≤∈= www hgΩF ,:

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Optimization problems

• Linear program – objective function as well as equality and inequality constraints are linear

• Quadratic program – objective function is quadratic, and the equality and inequality constraints are linear

• Inequality constraints gi(w) ≤ 0 can be active i.e. gi(w) = 0 or inactive i.e. gi(w) < 0.

• Inequality constraints are often transformed into equality constraints using slack variablesgi(w) ≤ 0 ↔ gi(w) +ξi = 0 with ξi ≥ 0

• We will be interested primarily in convex optimization problems

Page 33: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

33

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Convex optimization problem

If function f is convex, any local minimum of an unconstrained optimization problem with objective function f is also a global minimum, since for any

A set is called convex if , and for any the point

A convex optimization problem is one in which the set , the objective function and all the constraints are convex

*w

*≠u w *( ) ( )f f≤w u

Ω

nΩ ⊂ R ,∀ ∈Ωw u

(0,1),θ ∈ ( (1 ) )θ θ+ − ∈Ωw u

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Lagrangian Theory

Given an optimization problem with an objective function f (w)and equality constraints hj (w) = 0 j = 1..m, we define a Lagrangian as

( ) ( ) ( )∑=

+=m

jjjhβfL

1wwβw,

where βj are called the Lagrange multipliers. The necessaryConditions for w* to be minimum of f (w) subject to the Constraints hj (w) = 0 j = 1..m is given by

( ) ( ) 00 =∂

∂=

∂∂

ββw

wβw **** , ,, LL

The condition is sufficient if L(w,β*) is a convex function of w

Page 34: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

34

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Lagrangian Theory Example

( )

( ) ( )( )

( )

( )

54,

52

54

52

45

04,,

022,,

021,,42,,

042,

22

22

22

−=−=

==±=

=−+=∂

=+=∂

=+=∂

∂−+++=

=−+

+=

yxf

yx

yxyxL

yyyxL

xxyxL

yxyxyxLyx

yxyxf

whenminimized is

:have weabove, the Solving

:to Subject Minimize

mmλ

λλ

λλ

λλλλ

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Lagrangian theory – example

Find the lengths u, v, w of sides of the box that has the largest volume for a given surface area c

( ) ( )6

00

000

2

2

cwvuvu βuwβv

w)β(uwuvLw);β(vvw

uLv);β(uuv

wL

cvwuvwuuvwL

cvwuvwu

-uvw

===⇒−=−

++−==∂∂

++−==∂∂

++−==∂∂

⎟⎠⎞

⎜⎝⎛ −++β+−=

=++

and

;

to subject

minimize

Page 35: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

35

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Lagrangian Optimization - Example

• The entropy of a probability distribution p=(p1...pn)over a finite set 1, 2,...n is defined as

• The maximum entropy distribution can be found by minimizing –H(p) subject to the constraints

( ) i

n

ii ppH 2

1

log∑=

−=p

0

11

≥∀

=∑=

i

n

ii

pi

p

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Generalized Lagrangian Theory

Given an optimization problem with domain Ω ⊆ ℜn

( ) ( ) ( ) ( )∑∑==

++=m

jjj

k

iii hβgαfL

11,, wwwβαw

where f is convex, and gi and hj are affine, we can define the generalized Lagrangian function as:

( )( )( )

sconstraintquality e

sconstraint inequality function bjectiveo

...10

,...10subject to minimize

mjhkigΩf

j

i

===≤∈

ww

ww

An affine function is a linear function plus a translationF(x) is affine if F(x)=G(x)+b where G(x) is a linear function of xand b is a constant

Page 36: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

36

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Generalized Lagrangian Theory: KKT Conditions

Given an optimization problem with domain Ω ⊆ ℜn

where f is convex, and gi and hj are affine. The necessary andSufficient conditions for w* to be an optimum are the existence of α* and β* such that

( ) ( )

( ) ( ) kigg

LL

iiii ...;; ;

,, ,,

****

**,***,*

1000

00

=≥α≤=α

=∂

∂=

∂∂

wwβ

βαww

βαw

( )( )( )

sconstraintquality e

sconstraint inequality function bjectiveo

to subject minimize

mjhkigΩf

j

i

...10,...10

===≤∈

ww

ww

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

( ) ( ) ( )

( ) ( ) ( ) ( )

( )

( ) ( ) ( )

( ) ( )

( ) ( ) ( )

200,0

2;0,2022

62

2;2

6;2

202;032;0120,0

0,02

3;1032;012;0,00;2,0,0

0;02032

012

231

0;231,

21

111

111

11121

21

21

11

2121

21

2122

22

==≠≠

===⇒=⎟⎠⎞

⎜⎝⎛ −

−+

−−=

−=

=−+=+−=+−⇒=≠≠=

≤+==⇒=−=−⇒==

≤−≤+≥≥

=−=−+=−+−=∂∂

=++−=∂∂

−+−++−+−=

≤−≤+−+−=

; y x

yxyx

yxyx

yxyxyx

yxyx

yxyxyyL

xxL

yxyxyxL

yxyxyxyxf

is solution the Hence,solution feasible a yieldnot does 4 Case

solution feasible a is which

yieldswhich

:3 Casesolution feasible a yieldnot does :2 caseSimilarly,

violates it because solution feasible a not is This 1: Case

;

:to Subject Minimize

λλ

λλλλλλλλλλλ

λλ

λλλλ

λλλλ

λλ

λλ

Page 37: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

37

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Dual optimization problem

A primal problem can be transformed into a dual by simply setting to zero, the derivatives of the Lagrangian with respect to the primal variables and substituting the results back into the Lagrangian to remove dependence on the primal variables, resulting in ( ) ( )βαwβα

w,,inf, L

Ω∈=θ

which contains only dual variables and needs to be maximized under simpler constraints

The infimum of a set S of real numbers is denoted by inf(S) and is defined to be the largest real number that is smaller than or equal to every number in S. If no such number exists then inf(S) = −∞.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Maximum Margin HyperplaneThe problem of finding the minimal margin hyperplaneis a constrained optimization problemUse Lagrange theory (extended by Karush, Kuhn, and Tucker – KKT)

Lagrangian:

( ) ( )[ ]0

1,,21

−+−= ∑i

iii

ip byL

α

α xwwww

Page 38: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

38

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

From Primal to Dual

Minimize Lp(w) with respect to (w,b) requiring that derivatives of Lp(w) with respect to all vanish, subject to the

constraints Differentiating Lp(w) :

iα0iα ≥

Substituting the equality constraints back into Lp(w) we obtain a dual problem.

0

0

0

=

=

=∂

=∂

iii

iiii

p

p

y

yb

L

L

α

α xw

w

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Dual ProblemMaximize:

1 1max ( ) min ( )2

i iy i y ib =− =⋅ + ⋅= −

w x w x

( )

sconstraint primal using solved be to has

and to Subject

b

αyα

yy

ybyyyy

byy

yyL

iii

i

ii

ijjijiji

iii

ii

ijjijiji

ijjijiji

ij

jjjii

i

jjjj

iiiiD

00

,21

21

1,

21

≥=

+−=

+−−=

⎥⎥⎦

⎢⎢⎣

⎡−

⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛−

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=

∑∑

∑∑∑∑

∑∑

∑∑

ααα

αααααα

αα

αα

xx

xxxx

xx

xxw

Page 39: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

39

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Karush-Kuhn-Tucker Conditions

Solving for the SVM is equivalent to finding a solution to the KKT conditions

( ) ( )[ ]

( )( )[ ]0

01,

01,

00

0

1,,21

=−+

≥−+

=⇒=∂

=⇒=∂

−+−=

i

iii

ii

iii

p

iiii

p

iii

ip

by

by

yb

L

yL

byL

αα

α

α

α

xw

xw

xww

xwwww

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Karush-Kuhn-Tucker Conditions for SVMThe KKT conditions state that optimal solutions must satisfy

Only the training samples xi for which the functional margin = 1 can have nonzero αi. They are called Support Vectors.

The optimal hyperplane can be expressed in the dual representation in terms of this subset of training samples – the support vectors

( , )i bα w

1 sv( , , )

l

i i i i i ii i

f b y b y bα α= ∈

= ⋅ + = ⋅ +∑ ∑x α x x x x

( )[ ] 01, =−+ by iii xwα

Page 40: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

40

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

⎟⎟⎠

⎞⎜⎜⎝

⎛== ∑

∈SVi

ixwαγ 1

( )

∑∑∑∑

∑∑∑∑

∑∑

∑∑

∑∑

∈∈∉∈

∈∉∈∈

∈∈

∈∈

==

=+⎟⎟⎠

⎞⎜⎜⎝

⎛+=

+⎟⎟⎠

⎞⎜⎜⎝

⎛+=−=

−=⇒=⎟⎠

⎞⎜⎝

⎛+∈

=

=

SVii

SVjj

SVjjj

SVjjj

SVjj

SVjj

SVjjj

SVjjj

SVijjiiij

SVijiiijj

jiSVi

iiSVj

jj

jijij

l

ii

l

j

yyb

yybby,

byyybyySV

yy

yy,

αααα

ααα

αα

αα

αα

444 3444 21

Q

condns KKT of because

0

11

01

1,1,,

,

,

ww

xxxxx

xx

xxww

So we have

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Support Vector Machines Yield Sparse Solutions

γ Support vectors

Page 41: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

41

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Example: XOR

∑∑∑

= ==

=

−=

−+−=

N

i

N

jj

Tijiji

N

ii

N

ii

Tii

T

T

yyQ

bybL

1 121

1

121

21

)(

]1)([),,(

)(

xx

xwwww

www

αααα

αα 0 11

1

X y

[-1,-1] -1

[-1,+1] +1

[+1,-1] +1

[+1,+1] -12)1(),( jT

ijiK xxxx +=

Txxxxxx ]2,2,,2,,1[)( 212221

21=xϕ

⎥⎥⎥⎥

⎢⎢⎢⎢

=

9111

1911

1191

1119

K

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Example: XOR

)9292292229(

)(4243

234232

2213121

212

1

4321

ααααααααααααααα

ααααα

+−+−+++−−−

+++=Q

191919

19

4321

4321

4321

4321

=+−−=−++−=−++−

=+−−

αααααααααααα

αααα41

81

4,3,2,1,

)( =

====

ααααα

o

oooo

QNot surprisingly, each of the training samples is a support vector

21

412

21 , == oo ww T

N

iiiio

-

yw

⎥⎦⎤

⎢⎣⎡=

= ∑=

0002100

)(1

xϕα

Setting derivative of Q wrtthe dual variables to 0

Page 42: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

42

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Example – Two Spirals

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Summary of Maximal margin classifier

• Good generalization when the training data is noise free and separable in the kernel-induced feature space

• Excessively large Lagrange multipliers often signal outliers – data points that are most difficult to classify – SVM can be used for identifying outliers

• Focuses on boundary points If the boundary points are mislabeled (noisy training data)– the result is a terrible classifier if the training set is

separable– Noisy training data often makes the training set non

separable – Use of highly nonlinear kernels to make the training set

separable increases the likelihood of over fitting –despite maximizing the margin

Page 43: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

43

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Non-Separable Case• In the case of non-separable data in feature space, the

objective function grows arbitrarily large. • Solution: Relax the constraint that all training data be

correctly classified but only when necessary to do so• Cortes and Vapnik, 1995 introduced slack variables

in the constraints:

, 1,..,i i lξ =

( )( )b,y iii +−= xwγξ ,0max

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Non-Separable Case

Generalization error can be shown to be

max(0, ( , )).i i iy x bξ γ= − +w

γ2

2

1

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛ +≤

∑γ

ξε i

iR

l

Page 44: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

44

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Non-Separable Case

For error to occur, the corresponding must exceed (which was chosen to be unity) So is an upper bound on the number of training errors. So the objective function is changed from to

where C and k are parameters

(typically k=1 or 2 and C is a large positive constant

iiξ∑

γ

∑+⎟⎟

⎜⎜

i

kiC ξ

2

2w

2

2w

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Soft-Margin ClassifierMinimize:

Subject to:

∑+i

iC ξww,21

( ) ( )0

1,≥

−≥+

i

iii byξ

ξxw

( )( )32144444 344444 2144 344 21

ii

ξ

ii

iiiii

ii

iP byCL

on sconstraintnegativitynon involving sconstraint inequalityfunction objective −

∑∑∑ −+−+−+= ξμξαξ

ξ

1,21 2 xww

Page 45: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

45

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

The Soft-Margin Classifier: KKT Conditions

( ) 0

01,0

>•

=±=+•

>

ii

iii

i

b

b

ξ

ξα

hence and ,by iedmisclassif is or

hence and i.e., vector support a is ifonly

wx

xwx

( )( )0

01,0,0,0

=

=+−+

≥≥≥

ii

iiii

iii

μbyμ

ξξα

αξxw

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

From primal to dual( )( )

jijiij

jii

iD

DP

iiii

P

iii

P

ii

iiP

P

ii

iiiii

ii

iP

yyL

LL

CCL

yb

L

yLL

byCL

xx

xww

xww

,21

,

00

00

0

,0

1,21 2

∑∑

∑∑∑

−=

≤≤⇒=+⇒=∂∂

=⇒=∂

=⇒=∂∂

−+−+−+=

ααα

αμαξ

α

α

ξμξαξ

maximized be to dual the get we into back ngSubstituti

to variables primal wrt of sderivative the Equating

Page 46: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

46

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Soft Margin-Dual Lagrangian

( )( )( )

00

001,

00

,21

=<<

=

=−

=+−+

=≤≤

−=

∑∑

i

i

i

ii

iiii

iiii

jijiij

jii

iD

αCα

C

Cby

yC

yyL

samples, training )classified (correctly other all Forvectors, support true For

only when sample) training fied(misclassi zero-non are variables Slack

:Conditions KKT

and :to Subject

Maximize

α

αξξα

αα

ααα

xw

xx

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Implementation TechniquesMaximizing a quadratic function, subject to a linear

equality and inequality constraints

,

1( ) ( , )2

00

i i j i j i ji i j

i

i ii

W y y K x x

Cy

α α α α

α

α

= −

≤ ≤

=

∑ ∑

( ) 1 ( , )i j j i jji

W y y K x xαα

∂= −

∂ ∑α

Page 47: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

47

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

On-line algorithm for the 1-norm soft margin (bias b=0)

Given training set D

repeatfor i=1 to l

else

end foruntil stopping criterion satisfiedreturn α

←α 0

(1 ( , ))i i i i j j i jj

y y K x xα α η α← + − ∑if 0 then 0i iα α < ←

if theni iC Cα α > ←

( )iii K xx ,

1=η

In the general setting, b ≠ 0 and we need to solve for b

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Implementation Techniques

Use QP packages (MINOS, LOQO, quadprog from MATLAB optimization toolbox). They are not online and require that the data are held in memory in the form of kernel matrix

Stochastic Gradient Ascent. Sequentially update 1 weight at the time. So it is online. Gives excellent approximation in most cases

1ˆ 1 ( , )( , )i i i j j i j

ji i

y y K x xK x x

α α α⎛ ⎞

← + −⎜ ⎟⎝ ⎠

Page 48: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

48

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Chunking and Decomposition

Given training set D

Select an arbitrary working set repeat

solve optimization problem onselect new working set from data not satisfying KKT conditions

until stopping criterion satisfiedreturn

←α 0D D⊂

D

α

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Sequential Minimal Optimization S.M.O.

At each step SMO: optimize two weights simultaneously in order not to violate the linear constraintOptimization of the two weights is performed analyticallyRealizes gradient dissent without leaving the linear constraint (J.Platt)Online versions exist (Li, Long; Gentile)

0i ii

yα =∑

,i jα α

Page 49: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

49

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

SVM Implementations

• Details of optimization and Quadratic Programming• SVMLight – one of the first practical implementations of

SVM (Platt)

• Matlab toolboxes for SVM• LIBSVM – one of the best SVM implementations is by Chih-

Chung Chang and Chih-Jen Lin of the National University of Singapore http://www.csie.ntu.edu.tw/~cjlin/libsvm/

• WLSVM – LIBSVM integrated with WEKA machine learning toolbox Yasser El-Manzalawy of the ISU AI Labhttp://www.cs.iastate.edu/~yasser/wlsvm/

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Why does SVM Work?• The feature space is often very high dimensional. Why

don’t we have the curse of dimensionality?• A classifier in a high-dimensional space has many

parameters and is hard to estimate• Vapnik argues that the fundamental problem is not the

number of parameters to be estimated. Rather, the problem is the capacity of a classifier

• Typically, a classifier with many parameters is very flexible, but there are also exceptions– Let xi=10i where i ranges from 1 to n. The classifier

can classify all xi correctly for all possible combination of class labels on xi

– This 1-parameter classifier is very flexible

Page 50: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

50

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Why does SVM work?

• Vapnik argues that the capacity of a classifier should not be characterized by the number of parameters, but by the capacity of a classifier– This is formalized by the “VC-dimension” of a classifier

• The minimization of ||w||2 subject to the condition that the geometric margin =1 has the effect of restricting the VC-dimension of the classifier in the feature space

• The SVM performs structural risk minimization: the empirical risk (training error), plus a term related to the generalization ability of the classifier, is minimized

• SVM loss function is analogous to ridge regression. The term ½||w||2 “shrinks” the parameters towards zero to avoid overfitting

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Choosing the Kernel Function• Probably the most tricky part of using SVM• The kernel function should maximize the similarity among

instances within a class while accentuating the differences between classes

• A variety of kernels have been proposed (diffusion kernel, Fisher kernel, string kernel, …) for different types of data -

• In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for data that livein a fixed dimensional input space

• Low order Markov kernels and its relatives are good ones to consider for structured data – strings, images, etc.

Page 51: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

51

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Other Aspects of SVM

• How to use SVM for multi-class classification?– One can change the QP formulation to become multi-

class– More often, multiple binary classifiers are combined– One can train multiple one-versus-all classifiers, or

combine multiple pairwise classifiers “intelligently”• How to interpret the SVM discriminant function value as

probability?– By performing logistic regression on the SVM output of a

set of data that is not used for training• Some SVM software (like libsvm) have these features built-

in

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Recap: How to Use SVM

• Prepare the data set• Select the kernel function to use• Select the parameter of the kernel function and the

value of C– You can use the values suggested by the SVM

software– You can use a tuning set to tune C

• Execute the training algorithm and obtain αι and b• Unseen data can be classified using the αi, the support

vectors, and b

Page 52: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

52

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Strengths and Weaknesses of SVM

• Strengths– Training is relatively easy

• No local optima– It scales relatively well to high dimensional data– Tradeoff between classifier complexity and error can be

controlled explicitly– Non-traditional data like strings and trees can be used

as input to SVM, instead of feature vectors• Weaknesses

– Need to choose a “good” kernel function.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Recent developments

• Better understanding of the relation between SVM and regularized discriminative classifiers (logistic regression)

• Knowledge-based SVM – incorporating prior knowledge as constraints in SVM optimization (Shavlik et al)

• A zoo of kernel functions• Extensions of SVM to multi-class and structured label

classification tasks • One class SVM (anomaly detection)

Page 53: 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

53

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Representative Applications of SVM

• Handwritten letter classification• Text classification• Image classification – e.g., face recognition• Bioinformatics

– Gene expression based tissue classification– Protein function classification– Protein structure classification– Protein sub-cellular localization

• ….