View
335
Download
2
Category
Tags:
Preview:
DESCRIPTION
Citation preview
1
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
MACHINE LEARNING
Vasant HonavarArtificial Intelligence Research Laboratory
Department of Computer ScienceBioinformatics and Computational Biology Program
Center for Computational Intelligence, Learning, & DiscoveryIowa State University
honavar@cs.iastate.eduwww.cs.iastate.edu/~honavar/
www.cild.iastate.edu/
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximum margin classifiers
• Discriminative classifiers that maximize the margin of separation
• Support Vector Machines– Feature spaces– Kernel machines– VC Theory and generalization bounds– Maximum margin classifiers
• Comparisons with conditionally trained discriminative classifiers
2
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Perceptrons revisited
Perceptrons• Can only compute threshold functions• Can only represent linear decision surfaces• Can only learn to classify linearly separable training
dataHow can we deal with non linearly separable data?• Map data into a typically higher dimensional feature
space where the classes become separableTwo problems must be solved:• Computational problem of working with high
dimensional feature space• Overfitting problem in high dimensional feature spaces
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximum margin model
We cannot outperform Bayes optimal classifier when• The generative model assumed is correct• The data set is large enough to ensure reliable
estimation of parameters of the modelsBut discriminative models may be better than generative
models when• The correct generative model is seldom known• The data set is often simply not large enough
Maximum margin classifiers are a kind of discriminative classifiers designed to circumvent the overfitingproblem
3
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Extending Linear Classifiers – Feature spaces
Map data into a feature space where they are linearly separable
( )ϕ→x xx ( )ϕ x
X Φ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Linear Separability in Feature spaces
The original input space can always be mapped to some higher-dimensional feature space where the training data become separable:
Φ: x→ φ(x)
4
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning in the Feature Spaces
High dimensional Feature spaces
where typically d > > n solve the problem of expressing complex functions
But this introduces a• computational problem (working with very large vectors)• generalization problem (curse of dimensionality)
SVM offer an elegant solution to both problems
( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Perceptron Algorithm Revisited
• The perceptron works by adding misclassified positive or subtracting misclassified negative examples to an arbitrary weight vector, which (without loss of generality) we assumed to be the zero vector.
• The final weight vector is a linear combination of training points
where, since the sign of the coefficient of is given by label yi, the are positive values, proportional to the number of times, misclassification of has caused the weight to be updated. It is called the embedding strength of the pattern .
•
1,
l
i i ii
yα=
= ∑w x
ix
iαix
ix
5
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Dual Representation
The decision function can be rewritten as:
The update rule is:
WLOG, we can take
( ) ( )
⎟⎟⎠
⎞⎜⎜⎝
⎛+=
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛=+=
∑
∑
=
=
by
bybh
j
l
jjj
l
jjjj
xx
xxxwx
,sgn
,sgn,sgn
1
1
α
α
1.η =
ηαα
byy
y
ii
ij
l
jjji
ii
+←
≤⎟⎟⎠
⎞⎜⎜⎝
⎛+∑
=
then
if
) ,( example training on
,0,1
xx
x
α
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Implications of Dual Representation
When Linear Learning Machines are represented in the dual form
Data appear only inside dot products (in decision function and in training algorithm)
The matrix which is the matrix of pair-wise dot products between training samples is called the Gram matrix
( ), 1
l
i j i jG
== ⋅x x
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛+=+= ∑
=
bybh ij
l
jjjii xxxwx ,sgn,sgn
1α
6
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Implicit Mapping to Feature Space
Kernel Machines• Solve the computational problem of working with
many dimensions• Can make it possible to use infinite dimensions
efficiently • Offer other advantages, both practical and conceptual
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernel-Induced Feature Spaces
Φ→ϕ X:
( ) ( )( ) ( )( )ii
ii
fhbf
xxxwx
sgn,
=
+= ϕ
where is a non-linear map from input space to feature space
In the dual representation, the data points only appear inside dot products
( ) ( )( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛+=+= ∑
=
bybh ij
l
jjjii xxxwx ϕϕαϕ ,sgn,sgn
1
7
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernels
• Kernel function returns the value of the dot product between the images of the two arguments
• When using kernels, the dimensionality of the feature space Φ is not necessarily important because of the special properties of kernel functions
• Given a function K, it is possible to verify that it is a kernel (we will return to this later)
1 2 1 2( , ) ( ), ( )K x x x xϕ ϕ=
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernel Machines
We can use perceptron learning algorithm in the feature space by taking its dual representation and replacing dot products with kernels:
1 2 1 2 1 2, ( , ) ( ), ( )x x K x x x xϕ ϕ← =
8
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Kernel Matrix
K= K(2,l)…K(2,3)K(2,2)K(2,1)
K(1,l)…K(1,3)K(1,2)K(1,1)
K(l,l)…K(l,3)K(l,2)K(l,1)
……………
Kernel matrix is the Gram matrix in the feature space Φ
(the matrix of pair-wise dot products between feature vectors corresponding to the training samples)
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Properties of Kernel Matrices
It is easy to show that the Gram matrix (and hence the kernel matrix) is
• A Square matrix • Symmetric (K T = K)• Positive semi-definite (all eigenvalues of K are non-
negative – Recall that eigenvalues of a square matrix A are given by values of λ that satisfy | A-λI |=0)
Any symmetric positive semi definite matrix can be regarded as a kernel matrix, that is, as an inner product matrix in some feature space Φ
( ) ( ) ( )jiji xxxxK ϕϕ= ,,
9
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mercer’s Theorem: Characterization of Kernel Functions
A function K : X×X ℜ is said to be (finitely) positive semi-definite if
• K is a symmetric function: K(x,y) = K(y,x) • Matrices formed by restricting K to any finite subset of
the space X are positive semi-definiteCharacterization of Kernel Functions Every (finitely) positive semi definite, symmetric function
is a kernel: i.e. there exists a mapping ϕ such that it is possible to write:
( ) ( ) ( )jijiK xxxx ϕϕ= ,,
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Properties of Kernel Functions - Mercer’s TheoremEigenvalues of the Gram Matrix define an expansion of
Kernels:
That is, the eigenvalues act as features!
Recall that the eigenvalues of a square matrix A correspond to solutions of
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
=λ−
1000010000100001
0
I
IA where
( ) ( ) ( ) ( ) ( )jlil
lljijiK xxxxxx ϕϕλϕϕ ∑== ,,
10
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Examples of Kernels
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛
σ−
=zx
zx,
eK ,
Simple examples of kernels:
( ) dK zxzx ,, =
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Example: Polynomial Kernels
1 2
1 2
( , );( , );
x x xz z z
==
2 21 1 2 2
2 2 2 21 1 2 2 1 1 2 2
2 2 2 21 2 1 2 1 2 1 2
, ( )
2
( , , 2 ), ( , , 2 )
( ), ( )
x z x z x z
x z x z x z x z
x x x x z z z z
x zϕ ϕ
= +
= + +
=
=
11
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Making Kernels – Closure PropertiesThe set of kernels is closed under some operations. IfK1 , K2 are kernels over X×X, then the following are kernels:
We can make complex kernels from simple ones: modularity!
( ) ( ) ( )( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( )
( )( )
( )n
T
NNN
nnK
K
KKfffK
KKKa aKK
KKK
ℜ⊆×
=
ℜ×ℜℜ→
=ℜ→=
=>=
+=
X
X
X
andmatrix definite positivesymmetric a is
over kernel a is
;
BBzxzx
zxzxzxzx
zxzxzxzxzx
zxzxzx
;,
;:
;,,:;,
,,,0,,
,,,
3
3
21
1
21
ϕ
ϕϕ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernels
• We can define Kernels over arbitrary instance spaces including• Finite dimensional vector spaces, • Boolean spaces• ∑* where ∑ is a finite alphabet• Documents, graphs, etc.
• Kernels need not always be expressed by a closed form formula.
• Many useful kernels can be computed by complex algorithms (e.g., diffusion kernels over graphs).
12
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernels on Sets and Multi-Sets
( ) kernel. a is , Then
, Let domain fixed some for Let
212
2
21
21
AA
V
AAK
XAAVX
I=
∈⊆
Exercise: Define a Kernel on the space of multi-sets whose elements are drawn from a finite domain V
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
String Kernel (p-spectrum Kernel)
• The p-spectrum of a string is the histogram – vector of number of occurrences of all possible contiguous substrings – of length p
• We can define a kernel function K(s, t) over ∑* × ∑* as the inner product of the p-spectra of s and t.
2
3
=
===
),( :substrings Common
tsKtat, ati
pncomputatiot
statisticss
13
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernel Machines
Kernel Machines are Linear Learning Machines that :• Use a dual representation• Operate in a kernel induced feature space (that is a
linear function in the feature space implicitly defined by the gram matrix corresponding to the data set K)
( ) ( )( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛+=+= ∑
=
bybh ij
l
jjjii xxxwx ϕϕαϕ ,sgn,sgn
1
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernels – the good, the bad, and the ugly
Bad kernel – A kernel whose Gram (kernel) matrix is mostly diagonal -- all data points are orthogonal to each other, and hence the machine is unable to detect hidden structure in the data
10…010
1…000……………
0…001
14
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernels – the good, the bad, and the ugly
Good kernel – Corresponds to a Gram (kernel) matrix in which subsets of data points belonging to the same class are similar to each other, and hence the machine can detect hidden structure in the data
3340000032
4230024300
00023Class 1
Class 2
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Kernels – the good, the bad, and the ugly
• In mapping in a space with too many irrelevant features, kernel matrix becomes diagonal
• Need some prior knowledge of target so choose a good kernel
15
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning in the Feature Spaces
High dimensional Feature spaces
where typically d > > n solve the problem of expressing complex functions
But this introduces a• computational problem (working with very large vectors)
– solved by the kernel trick – implicit computation of dot products in kernel induced feature spaces via dot products in the input space
• generalization problem (curse of dimensionality)
( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Generalization Problem
• The curse of dimensionality• It is easy to overfit in high dimensional spaces
• The Learning problem is ill posed • There are infinitely many hyperplanes that
separate the training data • Need a principled approach to choose an optimal
hyperplane
16
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Generalization Problem
“Capacity” of the machine – ability to learn any training set without error – related to VC dimension
Excellent memory is not an asset when it comes to learning from limited data
“A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree”
C. Burges
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
History of Key Developments leading to SVM1958 Perceptron (Rosenblatt)
1963 Margin (Vapnik)
1964 Kernel Trick (Aizerman)
1965 Optimization formulation (Mangasarian)
1971 Kernels (Wahba)
1992-1994 SVM (Vapnik)1996 – present Rapid growth, numerous applications
Extensions to other problems
17
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
A Little Learning TheorySuppose: • We are given l training examples• Train and test points drawn randomly (i.i.d) from some
unknown probability distribution D(x,y)• The machine learns the mapping and outputs a
hypothesis . A particular choice of generates “trained machine”
• The expectation of the test error or expected risk is
( , ).i iy x
i iy→x
( ) ( ) ( )∫ −= ydDbhybR ,,,21, xαxα
( )bh ,,αx ( )b,α
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
A Bound on the Generalization PerformanceThe empirical risk is:
Choose some such that . With probability the following bound – risk bound of h(x,α) in distribution D holds (Vapnik,1995):
where is called VC dimension is a measure of “capacity” of machine.
δ 1 δ−
0d ≥
10 << δ
( ) ( )⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛⎟⎠⎞
⎜⎝⎛−⎟
⎠
⎞⎜⎝
⎛+⎟
⎠⎞
⎜⎝⎛
+≤l
dld
bRbR emp4
log12log,,
δ
αα
( ) ( )∑=
−=l
iiemp bhy
lbR
1,,
21, αxα
18
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
A Bound on the Generalization Performance
The second term in the right-hand side is called VC confidence.
Three key points about the actual risk bound:• It is independent of D(x,y) • It is usually not possible to compute the left hand
side.• If we know d, we can compute the right hand side.
The risk bound gives us a way to compare learning machines!
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The VC Dimension
Definition: the VC dimension of a set of functions
is d if and only if there exists a set of points d data points such that each of the 2d possible labelings (dichotomies) of the d data points, can be realized using some member of H but that no set exists with q > d satisfying this property.
( ) bhH ,,αx=
19
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The VC Dimension
VC dimension of H is size of largest subset of X shattered by H. VC dimension measures the capacity of a set H of hypotheses (functions).
If for any number N, it is possible to find N points that can be separated in all the 2N possible
ways, we will say that the VC-dimension of the set is infinite
1,.., Nx x
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The VC Dimension Example
• Suppose that the data live in space, and the set consists of oriented straight lines, (linear discriminants).
• It is possible to find three points that can be shattered by this set of functions
• It is not possible to find four. • So the VC dimension of the set of linear discriminants in
is three.
2R ( , )h αx
2R
20
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The VC Dimension of Hyperplanes
Theorem 1 Consider some set of m points in . Choose any one of the points as origin. Then the mpoints can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent.
Corollary: The VC dimension of the set of oriented hyperplanes in is n+1, since we can always choose n+1 points, and then choose one of the points as origin, such that the position vectors of the remaining points are linearly independent, but can never choose n+2 points
nR
nR
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
VC Dimension
VC dimension can be infinite even when the number of parameters of the set of hypothesis functions is low.
Example: For any integer l with any labels we can find l points and parameter αsuch that those points can be shattered byThose points are:
and parameter a is:
( , )h αx
( , ) sgn(sin( )), ,h α α α≡ ∈x x x R1,..., , 1,1l iy y y ∈ −
1,..., lx x( , )h αx
10 , 1,..., .iix i l−= =
1
(1 )10(1 )2
ili
i
yα π=
−= + ∑
21
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Vapnik-Chervonenkis (VC) Dimension
• Let H be a hypothesis class over an instance space X.• Both H and X may be infinite.• We need a way to describe the behavior of H on a finite
set of points S ⊆ X.
• For any concept class H over X, and any S ⊆ X,
• Equivalently, with a little abuse of notation, we can write ( ) HhShSH ∈∩=Π :
mXXXS ...., 21=
( ) ( ) ( )( ) HhXhXhS mH ∈=Π :......1
ΠH(S) is the set of all dichotomies or behaviors on S that areinduced or realized by H
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Vapnik-Chervonenkis (VC) Dimension
• If where , or equivalently, • we say that S is shattered by H.
• A set S of instances is said to be shattered by a hypothesis class H if and only if for every dichotomy of S, there exists a hypothesis in H that is consistent with the dichotomy.
( ) mH S 10,=Π mS = ( ) m
H S 2=Π
22
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
VC Dimension of a hypothesis class
Definition: The VC-dimension V(H), of a hypothesis class H defined over an instance space X is the cardinality d of the largest subset of X that is shattered by H. If arbitrarily large finite subsets of X can be shattered by H, V(H)=∞
How can we show that V(H) is at least d?• Find a set of cardinality at least d that is shattered by
H.How can we show that V(H) = d?• Show that V(H) is at least d and no set of cardinality
(d+1) can be shattered by H.
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
VC Dimension of a Hypothesis Class - Examples
• Example: Let the instance space X be the 2-dimensional Euclidian space. Let the hypothesis space H be the set of axis parallel rectangles in the plane.
• V(H)=4 (there exists a set of 4 points that can be shattered by a set of axis parallel rectangles but no set of 5 points can be shattered by H).
23
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Some Useful Properties of VC Dimension
Proof: Left as an exercise
( ) ( ) ( )
( ) ( ) ( )( ) ( ) ( ) ( )
( )
( )( ) ( ) ( ) dmmOmΦdmmΦ
mΦm
mSSmdVHllHVOHVH
lHHVHVHVHHH
HVHVHhhXH
HHVHHVHVHH
dd
md
dH
HH
l
l
<=≥=
≤Π
=Π=Π=
=
++≤⇒∪==⇒∈−=
≤
≤⇒⊆
if and if where)(
,:)(max)( , If)lg)(( , from concepts of onintersecti or union aby formed is If
:
lg)( class, concept finite a is If
2
12121
2121
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Risk Bound
What is VC dimension and empirical risk of the nearest neighbor classifier?
Any number of points, labeled arbitrarily, can be shattered by a nearest-neighbor classifier, thus and empirical risk =0 .So the bound provide no useful information in this case.
d = ∞
24
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Minimizing the Bound by Minimizing d
• VC confidence (second term) depends on d/l• One should choose that learning machine with minimal d• For large values of d/l the bound is not tight
0.05δ =
( ) ( )⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛⎟⎠⎞
⎜⎝⎛−⎟
⎠
⎞⎜⎝
⎛+⎟
⎠⎞
⎜⎝⎛
+≤l
dld
RR emp4
log12log δ
αα
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Bounds on Error of Classification
The classification error• depends on the VC dimension of the hypothesis class • is independent of the dimensionality of the feature space
Vapnik proved that the error ε of of classification function h for separable data sets is
where d is the VC dimension of the hypothesis class and l is the number of training examples
dOl
ε ⎛ ⎞= ⎜ ⎟⎝ ⎠
25
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Structural Risk MinimizationFinding a learning machine with the minimum upper bound
on the actual risk • leads us to a method of choosing an optimal machine for a
given task• essential idea of the structural risk minimization (SRM).
Let be a sequence of nested subsets of hypotheses whose VC dimensions satisfy d1 < d2 < d3 < …
• SRM consists of finding that subset of functions which minimizes the upper bound on the actual risk
• SRM involves training a series of classifiers, and choosing a classifier from the series for which the sum of empirical risk and VC confidence is minimal.
1 2 3 ...H H H⊂ ⊂ ⊂
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Margin Based Bounds on Error of Classification
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
γ=ε
21 Ll
O
( )min|| ||i i
iy fγ =
xw
( ) ,f b= +x w x
ppL Xmax=
Margin based bound
Important insight
Error of the classifier trained on a separable data set is inversely proportional to its margin, and is independent of the dimensionality of the input space!
26
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximal Margin Classifier
The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature spaceSVMs control capacity by increasing the margin, not by reducing the number of features
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Margin
1
1
1
1
1
1
1
1Linear separation of the input space
( )( ) ( )( )xx
xwxfsignh
bf=
+= ,
w
wb
27
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Functional and Geometric Margin
• The functional margin of a linear discriminant (w,b) w.r.t. a labeled pattern is defined as
• If the functional margin is negative, then the pattern is incorrectly classified, if it is positive then the classifier predicts the correct label.
• The larger the further away xi is from the discriminant• This is made more precise in the notion of the geometric
margin
( , )i i iy bγ ≡ +w x
| |iγ
|| ||i
iγγ ≡w
which measures the Euclidean distance of a point from the decision boundary.
( ) 1,1, −×ℜ∈ dii yx
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Geometric Margin
+∈ SiX
iγjγ
The geometric margin of two points
−∈ SjX
28
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Geometric Margin
+∈ SX i
−∈ SX j
iγ
jγ
Example
( ) ( ) ( )( )( )( )
21
111111.11
1)11(
1)11(
21
==
=−+=−+=
==
−==
W
X
W
ii
ii
i
i
γ
xxYγY
b
γ
(1,1)
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Margin of a training set
iiγγ min=
γ
The functional margin of a training set
The geometric margin of a training set
iiγγ min=
29
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximum Margin Separating Hyperplaneis called the (functional) margin of (w,b)
w.r.t. the data set S=(xi,yi).
The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximum is a maximal margin hyperplane.
Maximal Margin Hyperplane
min i iγ γ≡
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
VC Dimension of - margin hyperplanes
• If data samples belong to a sphere of radius R, then the set of - margin hyperplanes has VC dimension bounded by
• WLOG, we can make R=1 by normalizing data points of each class so that they lie within a sphere of radius R.
• For large margin hyperplanes, VC-dimension is controlled independent of dimensionality n of the feature space
Δ
Δ
1),/min()( 22 +Δ≤ nRHVC
30
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximal Margin Classifier
The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin
Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space
SVMs control capacity by increasing the margin, not by reducing the number of features
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximizing Margin Minimizing ||W||
Definition of hyperplane (w,b) does not change if we rescale it to (σw, σb), for σ>0Functional margin depends on scaling, but geometric margin does notIf we fix (by rescaling) the functional margin to 1, the geometric margin will be equal 1/||w||We can maximize the margin by minimizing the norm ||w||
γ
31
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximizing Margin Minimizing ||w||
Let x+ and x- be the nearest data points in the sample space. Then we have, by fixing the distance from the hyperplane to be 1:
o
( ) ( )
( )w
xx,ww
xxw,
xw,
xw,
2
211
1
1
=−
=−−=−
−=+
=+
−+
−+
−
+
b
b
γ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning as optimization
Minimize
subject to:
ww,
( ) 1≥+ b,y ii xw
32
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Constrained Optimization
• Primal optimization problem• Given functions f, gi, i=1…k; hj j=1..m; defined on a domain
Ω ⊆ ℜn,
( )( )( )
sconstraintquality e
sconstraint inequality function bjectiveo
...10
,...10subject to minimize
mjhkigΩf
j
i
===≤∈
ww
ww
Shorthand ( ) ( )( ) ( ) ...m jhh
...k igg
j
i
1 0 denotes 01 0 denotes 0
=≤==≤≤
wwww
Feasible region ( ) ( ) 00 =≤∈= www hgΩF ,:
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Optimization problems
• Linear program – objective function as well as equality and inequality constraints are linear
• Quadratic program – objective function is quadratic, and the equality and inequality constraints are linear
• Inequality constraints gi(w) ≤ 0 can be active i.e. gi(w) = 0 or inactive i.e. gi(w) < 0.
• Inequality constraints are often transformed into equality constraints using slack variablesgi(w) ≤ 0 ↔ gi(w) +ξi = 0 with ξi ≥ 0
• We will be interested primarily in convex optimization problems
33
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Convex optimization problem
If function f is convex, any local minimum of an unconstrained optimization problem with objective function f is also a global minimum, since for any
A set is called convex if , and for any the point
A convex optimization problem is one in which the set , the objective function and all the constraints are convex
*w
*≠u w *( ) ( )f f≤w u
Ω
nΩ ⊂ R ,∀ ∈Ωw u
(0,1),θ ∈ ( (1 ) )θ θ+ − ∈Ωw u
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Lagrangian Theory
Given an optimization problem with an objective function f (w)and equality constraints hj (w) = 0 j = 1..m, we define a Lagrangian as
( ) ( ) ( )∑=
+=m
jjjhβfL
1wwβw,
where βj are called the Lagrange multipliers. The necessaryConditions for w* to be minimum of f (w) subject to the Constraints hj (w) = 0 j = 1..m is given by
( ) ( ) 00 =∂
∂=
∂∂
ββw
wβw **** , ,, LL
The condition is sufficient if L(w,β*) is a convex function of w
34
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Lagrangian Theory Example
( )
( ) ( )( )
( )
( )
54,
52
54
52
45
04,,
022,,
021,,42,,
042,
22
22
22
−=−=
==±=
=−+=∂
∂
=+=∂
∂
=+=∂
∂−+++=
=−+
+=
yxf
yx
yxyxL
yyyxL
xxyxL
yxyxyxLyx
yxyxf
whenminimized is
:have weabove, the Solving
:to Subject Minimize
mmλ
λλ
λλ
λλλλ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Lagrangian theory – example
Find the lengths u, v, w of sides of the box that has the largest volume for a given surface area c
( ) ( )6
00
000
2
2
cwvuvu βuwβv
w)β(uwuvLw);β(vvw
uLv);β(uuv
wL
cvwuvwuuvwL
cvwuvwu
-uvw
===⇒−=−
++−==∂∂
++−==∂∂
++−==∂∂
⎟⎠⎞
⎜⎝⎛ −++β+−=
=++
and
;
to subject
minimize
35
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Lagrangian Optimization - Example
• The entropy of a probability distribution p=(p1...pn)over a finite set 1, 2,...n is defined as
• The maximum entropy distribution can be found by minimizing –H(p) subject to the constraints
( ) i
n
ii ppH 2
1
log∑=
−=p
0
11
≥∀
=∑=
i
n
ii
pi
p
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Generalized Lagrangian Theory
Given an optimization problem with domain Ω ⊆ ℜn
( ) ( ) ( ) ( )∑∑==
++=m
jjj
k
iii hβgαfL
11,, wwwβαw
where f is convex, and gi and hj are affine, we can define the generalized Lagrangian function as:
( )( )( )
sconstraintquality e
sconstraint inequality function bjectiveo
...10
,...10subject to minimize
mjhkigΩf
j
i
===≤∈
ww
ww
An affine function is a linear function plus a translationF(x) is affine if F(x)=G(x)+b where G(x) is a linear function of xand b is a constant
36
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Generalized Lagrangian Theory: KKT Conditions
Given an optimization problem with domain Ω ⊆ ℜn
where f is convex, and gi and hj are affine. The necessary andSufficient conditions for w* to be an optimum are the existence of α* and β* such that
( ) ( )
( ) ( ) kigg
LL
iiii ...;; ;
,, ,,
****
**,***,*
1000
00
=≥α≤=α
=∂
∂=
∂∂
wwβ
βαww
βαw
( )( )( )
sconstraintquality e
sconstraint inequality function bjectiveo
to subject minimize
mjhkigΩf
j
i
...10,...10
===≤∈
ww
ww
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
200,0
2;0,2022
62
2;2
6;2
202;032;0120,0
0,02
3;1032;012;0,00;2,0,0
0;02032
012
231
0;231,
21
111
111
11121
21
21
11
2121
21
2122
22
==≠≠
===⇒=⎟⎠⎞
⎜⎝⎛ −
−+
−−=
−=
=−+=+−=+−⇒=≠≠=
≤+==⇒=−=−⇒==
≤−≤+≥≥
=−=−+=−+−=∂∂
=++−=∂∂
−+−++−+−=
≤−≤+−+−=
; y x
yxyx
yxyx
yxyxyx
yxyx
yxyxyyL
xxL
yxyxyxL
yxyxyxyxf
is solution the Hence,solution feasible a yieldnot does 4 Case
solution feasible a is which
yieldswhich
:3 Casesolution feasible a yieldnot does :2 caseSimilarly,
violates it because solution feasible a not is This 1: Case
;
:to Subject Minimize
λλ
λλλλλλλλλλλ
λλ
λλλλ
λλλλ
λλ
λλ
37
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Dual optimization problem
A primal problem can be transformed into a dual by simply setting to zero, the derivatives of the Lagrangian with respect to the primal variables and substituting the results back into the Lagrangian to remove dependence on the primal variables, resulting in ( ) ( )βαwβα
w,,inf, L
Ω∈=θ
which contains only dual variables and needs to be maximized under simpler constraints
The infimum of a set S of real numbers is denoted by inf(S) and is defined to be the largest real number that is smaller than or equal to every number in S. If no such number exists then inf(S) = −∞.
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Maximum Margin HyperplaneThe problem of finding the minimal margin hyperplaneis a constrained optimization problemUse Lagrange theory (extended by Karush, Kuhn, and Tucker – KKT)
Lagrangian:
( ) ( )[ ]0
1,,21
≥
−+−= ∑i
iii
ip byL
α
α xwwww
38
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
From Primal to Dual
Minimize Lp(w) with respect to (w,b) requiring that derivatives of Lp(w) with respect to all vanish, subject to the
constraints Differentiating Lp(w) :
iα0iα ≥
Substituting the equality constraints back into Lp(w) we obtain a dual problem.
0
0
0
=
=
=∂
∂
=∂
∂
∑
∑
iii
iiii
p
p
y
yb
L
L
α
α xw
w
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Dual ProblemMaximize:
1 1max ( ) min ( )2
i iy i y ib =− =⋅ + ⋅= −
w x w x
( )
sconstraint primal using solved be to has
and to Subject
b
αyα
yy
ybyyyy
byy
yyL
iii
i
ii
ijjijiji
iii
ii
ijjijiji
ijjijiji
ij
jjjii
i
jjjj
iiiiD
00
,21
21
1,
21
≥=
+−=
+−−=
⎥⎥⎦
⎤
⎢⎢⎣
⎡−
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛−
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛=
∑
∑∑
∑∑∑∑
∑∑
∑∑
ααα
αααααα
αα
αα
xx
xxxx
xx
xxw
39
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Karush-Kuhn-Tucker Conditions
Solving for the SVM is equivalent to finding a solution to the KKT conditions
( ) ( )[ ]
( )( )[ ]0
01,
01,
00
0
1,,21
≥
=−+
≥−+
=⇒=∂
∂
=⇒=∂
∂
−+−=
∑
∑
∑
i
iii
ii
iii
p
iiii
p
iii
ip
by
by
yb
L
yL
byL
αα
α
α
α
xw
xw
xww
xwwww
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Karush-Kuhn-Tucker Conditions for SVMThe KKT conditions state that optimal solutions must satisfy
Only the training samples xi for which the functional margin = 1 can have nonzero αi. They are called Support Vectors.
The optimal hyperplane can be expressed in the dual representation in terms of this subset of training samples – the support vectors
( , )i bα w
1 sv( , , )
l
i i i i i ii i
f b y b y bα α= ∈
= ⋅ + = ⋅ +∑ ∑x α x x x x
( )[ ] 01, =−+ by iii xwα
40
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
⎟⎟⎠
⎞⎜⎜⎝
⎛== ∑
∈SVi
ixwαγ 1
( )
∑∑∑∑
∑∑∑∑
∑∑
∑∑
∑∑
∈∈∉∈
∈∉∈∈
∈∈
∈∈
==
=+⎟⎟⎠
⎞⎜⎜⎝
⎛+=
+⎟⎟⎠
⎞⎜⎜⎝
⎛+=−=
−=⇒=⎟⎠
⎞⎜⎝
⎛+∈
=
=
SVii
SVjj
SVjjj
SVjjj
SVjj
SVjj
SVjjj
SVjjj
SVijjiiij
SVijiiijj
jiSVi
iiSVj
jj
jijij
l
ii
l
j
yyb
yybby,
byyybyySV
yy
yy,
αααα
ααα
αα
αα
αα
444 3444 21
Q
condns KKT of because
0
11
01
1,1,,
,
,
ww
xxxxx
xx
xxww
So we have
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Support Vector Machines Yield Sparse Solutions
γ Support vectors
41
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Example: XOR
∑∑∑
∑
= ==
=
−=
−+−=
=Φ
N
i
N
jj
Tijiji
N
ii
N
ii
Tii
T
T
yyQ
bybL
1 121
1
121
21
)(
]1)([),,(
)(
xx
xwwww
www
αααα
αα 0 11
1
X y
[-1,-1] -1
[-1,+1] +1
[+1,-1] +1
[+1,+1] -12)1(),( jT
ijiK xxxx +=
Txxxxxx ]2,2,,2,,1[)( 212221
21=xϕ
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
9111
1911
1191
1119
K
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Example: XOR
)9292292229(
)(4243
234232
2213121
212
1
4321
ααααααααααααααα
ααααα
+−+−+++−−−
+++=Q
191919
19
4321
4321
4321
4321
=+−−=−++−=−++−
=+−−
αααααααααααα
αααα41
81
4,3,2,1,
)( =
====
ααααα
o
oooo
QNot surprisingly, each of the training samples is a support vector
21
412
21 , == oo ww T
N
iiiio
-
yw
⎥⎦⎤
⎢⎣⎡=
= ∑=
0002100
)(1
xϕα
Setting derivative of Q wrtthe dual variables to 0
42
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Example – Two Spirals
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Summary of Maximal margin classifier
• Good generalization when the training data is noise free and separable in the kernel-induced feature space
• Excessively large Lagrange multipliers often signal outliers – data points that are most difficult to classify – SVM can be used for identifying outliers
• Focuses on boundary points If the boundary points are mislabeled (noisy training data)– the result is a terrible classifier if the training set is
separable– Noisy training data often makes the training set non
separable – Use of highly nonlinear kernels to make the training set
separable increases the likelihood of over fitting –despite maximizing the margin
43
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Non-Separable Case• In the case of non-separable data in feature space, the
objective function grows arbitrarily large. • Solution: Relax the constraint that all training data be
correctly classified but only when necessary to do so• Cortes and Vapnik, 1995 introduced slack variables
in the constraints:
, 1,..,i i lξ =
( )( )b,y iii +−= xwγξ ,0max
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Non-Separable Case
Generalization error can be shown to be
iξ
jξ
max(0, ( , )).i i iy x bξ γ= − +w
γ2
2
1
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛ +≤
∑γ
ξε i
iR
l
44
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Non-Separable Case
For error to occur, the corresponding must exceed (which was chosen to be unity) So is an upper bound on the number of training errors. So the objective function is changed from to
where C and k are parameters
(typically k=1 or 2 and C is a large positive constant
iiξ∑
γ
∑+⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
i
kiC ξ
2
2w
iξ
2
2w
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Soft-Margin ClassifierMinimize:
Subject to:
∑+i
iC ξww,21
( ) ( )0
1,≥
−≥+
i
iii byξ
ξxw
( )( )32144444 344444 2144 344 21
ii
ξ
ii
iiiii
ii
iP byCL
on sconstraintnegativitynon involving sconstraint inequalityfunction objective −
∑∑∑ −+−+−+= ξμξαξ
ξ
1,21 2 xww
45
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
The Soft-Margin Classifier: KKT Conditions
( ) 0
01,0
>•
=±=+•
>
ii
iii
i
b
b
ξ
ξα
hence and ,by iedmisclassif is or
hence and i.e., vector support a is ifonly
wx
xwx
( )( )0
01,0,0,0
=
=+−+
≥≥≥
ii
iiii
iii
μbyμ
ξξα
αξxw
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
From primal to dual( )( )
jijiij
jii
iD
DP
iiii
P
iii
P
ii
iiP
P
ii
iiiii
ii
iP
yyL
LL
CCL
yb
L
yLL
byCL
xx
xww
xww
,21
,
00
00
0
,0
1,21 2
∑∑
∑
∑
∑∑∑
−=
≤≤⇒=+⇒=∂∂
=⇒=∂
∂
=⇒=∂∂
−+−+−+=
ααα
αμαξ
α
α
ξμξαξ
maximized be to dual the get we into back ngSubstituti
to variables primal wrt of sderivative the Equating
46
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Soft Margin-Dual Lagrangian
( )( )( )
00
001,
00
,21
=<<
=
=−
=+−+
=≤≤
−=
∑
∑∑
i
i
i
ii
iiii
iiii
jijiij
jii
iD
αCα
C
Cby
yC
yyL
samples, training )classified (correctly other all Forvectors, support true For
only when sample) training fied(misclassi zero-non are variables Slack
:Conditions KKT
and :to Subject
Maximize
α
αξξα
αα
ααα
xw
xx
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Implementation TechniquesMaximizing a quadratic function, subject to a linear
equality and inequality constraints
,
1( ) ( , )2
00
i i j i j i ji i j
i
i ii
W y y K x x
Cy
α α α α
α
α
= −
≤ ≤
=
∑ ∑
∑
( ) 1 ( , )i j j i jji
W y y K x xαα
∂= −
∂ ∑α
47
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
On-line algorithm for the 1-norm soft margin (bias b=0)
Given training set D
repeatfor i=1 to l
else
end foruntil stopping criterion satisfiedreturn α
←α 0
(1 ( , ))i i i i j j i jj
y y K x xα α η α← + − ∑if 0 then 0i iα α < ←
if theni iC Cα α > ←
( )iii K xx ,
1=η
In the general setting, b ≠ 0 and we need to solve for b
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Implementation Techniques
Use QP packages (MINOS, LOQO, quadprog from MATLAB optimization toolbox). They are not online and require that the data are held in memory in the form of kernel matrix
Stochastic Gradient Ascent. Sequentially update 1 weight at the time. So it is online. Gives excellent approximation in most cases
1ˆ 1 ( , )( , )i i i j j i j
ji i
y y K x xK x x
α α α⎛ ⎞
← + −⎜ ⎟⎝ ⎠
∑
48
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Chunking and Decomposition
Given training set D
Select an arbitrary working set repeat
solve optimization problem onselect new working set from data not satisfying KKT conditions
until stopping criterion satisfiedreturn
←α 0D D⊂
D
α
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Sequential Minimal Optimization S.M.O.
At each step SMO: optimize two weights simultaneously in order not to violate the linear constraintOptimization of the two weights is performed analyticallyRealizes gradient dissent without leaving the linear constraint (J.Platt)Online versions exist (Li, Long; Gentile)
0i ii
yα =∑
,i jα α
49
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
SVM Implementations
• Details of optimization and Quadratic Programming• SVMLight – one of the first practical implementations of
SVM (Platt)
• Matlab toolboxes for SVM• LIBSVM – one of the best SVM implementations is by Chih-
Chung Chang and Chih-Jen Lin of the National University of Singapore http://www.csie.ntu.edu.tw/~cjlin/libsvm/
• WLSVM – LIBSVM integrated with WEKA machine learning toolbox Yasser El-Manzalawy of the ISU AI Labhttp://www.cs.iastate.edu/~yasser/wlsvm/
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Why does SVM Work?• The feature space is often very high dimensional. Why
don’t we have the curse of dimensionality?• A classifier in a high-dimensional space has many
parameters and is hard to estimate• Vapnik argues that the fundamental problem is not the
number of parameters to be estimated. Rather, the problem is the capacity of a classifier
• Typically, a classifier with many parameters is very flexible, but there are also exceptions– Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible combination of class labels on xi
– This 1-parameter classifier is very flexible
50
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Why does SVM work?
• Vapnik argues that the capacity of a classifier should not be characterized by the number of parameters, but by the capacity of a classifier– This is formalized by the “VC-dimension” of a classifier
• The minimization of ||w||2 subject to the condition that the geometric margin =1 has the effect of restricting the VC-dimension of the classifier in the feature space
• The SVM performs structural risk minimization: the empirical risk (training error), plus a term related to the generalization ability of the classifier, is minimized
• SVM loss function is analogous to ridge regression. The term ½||w||2 “shrinks” the parameters towards zero to avoid overfitting
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Choosing the Kernel Function• Probably the most tricky part of using SVM• The kernel function should maximize the similarity among
instances within a class while accentuating the differences between classes
• A variety of kernels have been proposed (diffusion kernel, Fisher kernel, string kernel, …) for different types of data -
• In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for data that livein a fixed dimensional input space
• Low order Markov kernels and its relatives are good ones to consider for structured data – strings, images, etc.
51
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Other Aspects of SVM
• How to use SVM for multi-class classification?– One can change the QP formulation to become multi-
class– More often, multiple binary classifiers are combined– One can train multiple one-versus-all classifiers, or
combine multiple pairwise classifiers “intelligently”• How to interpret the SVM discriminant function value as
probability?– By performing logistic regression on the SVM output of a
set of data that is not used for training• Some SVM software (like libsvm) have these features built-
in
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Recap: How to Use SVM
• Prepare the data set• Select the kernel function to use• Select the parameter of the kernel function and the
value of C– You can use the values suggested by the SVM
software– You can use a tuning set to tune C
• Execute the training algorithm and obtain αι and b• Unseen data can be classified using the αi, the support
vectors, and b
52
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Strengths and Weaknesses of SVM
• Strengths– Training is relatively easy
• No local optima– It scales relatively well to high dimensional data– Tradeoff between classifier complexity and error can be
controlled explicitly– Non-traditional data like strings and trees can be used
as input to SVM, instead of feature vectors• Weaknesses
– Need to choose a “good” kernel function.
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Recent developments
• Better understanding of the relation between SVM and regularized discriminative classifiers (logistic regression)
• Knowledge-based SVM – incorporating prior knowledge as constraints in SVM optimization (Shavlik et al)
• A zoo of kernel functions• Extensions of SVM to multi-class and structured label
classification tasks • One class SVM (anomaly detection)
53
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Representative Applications of SVM
• Handwritten letter classification• Text classification• Image classification – e.g., face recognition• Bioinformatics
– Gene expression based tissue classification– Protein function classification– Protein structure classification– Protein sub-cellular localization
• ….
Recommended