Upload
nancy-rose
View
223
Download
3
Tags:
Embed Size (px)
Citation preview
CS 8751 ML & KDD Support Vector Machines 1
Support Vector Machines (SVMs)• Learning mechanism based on linear
programming• Chooses a separating plane based on maximizing
the notion of a margin– Based on PAC learning
• Has mechanisms for– Noise
– Non-linear separating surfaces (kernel functions)
• Notes based on those of Prof. Jude Shavlik
CS 8751 ML & KDD Support Vector Machines 2
Support Vector Machines
A+
A-
Find the best separating plane in feature space - many possibilities to choose from
which is thebest choice?
CS 8751 ML & KDD Support Vector Machines 3
SVMs – The General Idea• How to pick the best separating plane?• Idea:
– Define a set of inequalities we want to satisfy
– Use advanced optimization methods (e.g., linear programming) to find satisfying solutions
• Key issues:– Dealing with noise
– What if no good linear separating surface?
CS 8751 ML & KDD Support Vector Machines 4
Linear Programming• Subset of Math Programming• Problem has the following form:
function f(x1,x2,x3,…,xn) to be maximizedsubject to a set of constraints of the form:
g(x1,x2,x3,…,xn) > b
• Math programming - find a set of values for the variables x1,x2,x3,…,xn that meets all of the constraints and maximizes the function f
• Linear programming - solving math programs where the constraint functions and function to be maximized use linear combinations of the variables– Generally easier than general Math Programming problem– Well studied problem
CS 8751 ML & KDD Support Vector Machines 5
Maximizing the Margin
A+
A-
the decisionboundary
The margin between categories - want this distance to be maximal - (we’ll assume linearly separable for now)
CS 8751 ML & KDD Support Vector Machines 6
PAC Learning• PAC – Probably Approximately Correct learning• Theorems that can be used to define bounds for
the risk (error) of a family of learning functions• Basic formula, with probability (1 - ):
• R – risk function, is the parameters chosen by the learner, N is the number of data points, and h is the VC dimension (something like an estimate of the complexity of the class of functions)
N
hNhRR emp
)4/log()/2(log()()(
CS 8751 ML & KDD Support Vector Machines 7
Margins and PAC Learning• Theorems connect PAC theory to the size of the
margin• Basically, the larger the margin, the better the
expected accuracy• See, for example, Chapter 4 of Support Vector
Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002
CS 8751 ML & KDD Support Vector Machines 8
Some Equations
w
xw
xw
xw
xw
neg
pos
2margin
margin) (the planes red and bluebetween Distance
1
examples negative allFor
1
examples positive allFor
threshold-
features,input - weights,-
Plane Separating
1s result from dividingthrough by a constantfor convenience
Euclidean length (“2 norm”) ofthe weight vector
CS 8751 ML & KDD Support Vector Machines 9
What the Equations Mean
A+
A-
Support Vectors
Margin
x´w = γ + 1
x´w = γ - 1
2 / ||w||2
CS 8751 ML & KDD Support Vector Machines 10
Choosing a Separating Plane
A+
A-
?
CS 8751 ML & KDD Support Vector Machines 11
Our “Mathematical Program” (so far)
soln) optimal global (a above theosolution t a find tosoftware
onoptimizati gprogramminmath existing use nowcan We
es)inequalitiour of sideleft the to move and trick"" ANN
theuse course, of could, (we parameters adjustableour are , :Note
examples) (for 1
examples) (for 1
such that
min2
,
w
xw
xw
w
neg
pos
w
for technical reasons easier tooptimize this “quadratic program”
CS 8751 ML & KDD Support Vector Machines 12
Dealing with Non-Separable Data
We can add what is called a “slack” variable to each example
This variable can be viewed as:0 if the example is correctly separated
y “distance” we need to move example to make it
correct (i.e., the distance from its surface)
CS 8751 ML & KDD Support Vector Machines 13
“Slack” Variables
A+
A-
y
Support Vectors
CS 8751 ML & KDD Support Vector Machines 14
The Math Program with Slack Variables
0
1
1
such that
positive) (all components of sum - norm" one"
constant scaling
exampleeach for one
featureinput each for one
min
k
1
1
2
,,
k
jneg
ipos
sw
s
sxw
sxw
s
s
w
sw
j
i
This is the “traditional”Support Vector Machine
CS 8751 ML & KDD Support Vector Machines 15
Why the word “Support”?• All those examples on or on the wrong side of the
two separating planes are the support vectors– We’d get the same answer if we deleted all the non-
support vectors!
– i.e., the “support vectors [examples]” support the solution
CS 8751 ML & KDD Support Vector Machines 16
PAC and the Number of Support Vectors• The fewer the support vectors, the better the
generalization will be• Recall, non-support vectors are
– Correctly classified
– Don’t change the learned model if left out of the training set
• So
examples training#
ctorssupport ve # rateerror out oneleave
CS 8751 ML & KDD Support Vector Machines 17
Finding Non-Linear Separating Surfaces• Map inputs into new space
Example: features x1 x2
5 4
Example: features x1 x2 x12 x2
2 x1*x2
5 4 25 16 20
• Solve SVM program in this new space– Computationally complex if many features
– But a clever trick exists
CS 8751 ML & KDD Support Vector Machines 18
The Kernel Trick• Optimization problems often/always have a
“primal” and a “dual” representation– So far we’ve looked at the primal formulation
– The dual formulation is better for the case of a non-linear separating surface
CS 8751 ML & KDD Support Vector Machines 19
Perceptrons Re-Visited
zero) (all 0 assumes This
weightschange and wrong
get we timesofnumber some is where
So
iedmisclassifcurrently is example theif
, 1- F and 1 T if s,perceptronIn
#
1
11
initial
i
i
examples
iiiifinal
i
iikk
w
x
xyw
x
xyww
CS 8751 ML & KDD Support Vector Machines 20
Dual Form of the Perceptron Learning Rule
errors) (counts 1 then
) teacher predicted (i.e., 0 if
i example eachFor
:algorithm perceptron dual) (i.e.,New
sgn
sgn So
otherwise 1-
0 if 1 sgn
sgn
ii1
1
ii
ij
#examples
jjji
iii
#examples
iiii
αα
xxyαy
)xxyα(
)xxyα( ) xh(
z(z)
)xw( ) x h( perceptronoutput of
CS 8751 ML & KDD Support Vector Machines 21
Primal versus Dual Space• Primal – “weight space”
– Weight features to make output decision
• Dual – “training-examples space”– Weight distance (which is based on the features) to
training examples
)sgn()( newnew xwxh
)xxyα()xh( ij
#examples
jjjnew
1
sgn
CS 8751 ML & KDD Support Vector Machines 22
The Dual SVM
2
minmax
:form primal back toconvert Can
0
thatsuch
min
Let
11
1
1
11 121
iyiy
i
n
i ii
i
n
i ii
n
i i
n
i
n
j jijiji
xwxw
xyw
y
xxyy
examples #trainingn
ii
CS 8751 ML & KDD Support Vector Machines 23
Non-Zero αi’s
weights the tocontribute
0) (i.e., ctorssupport ve only the -
Recall
ctorssupport ve the
are 0 withexamples Those
1
i
i
n
i ii
i
xyw
CS 8751 ML & KDD Support Vector Machines 24
Generalizing the Dot Product
spacenew to
convertingdirectly thanefficient moreusually -
spacenew thein
features know the explicitly toneedt don' we-
productdot a computing re we'spacenew thisin -
implicitly spacenew a into features original the
maps linear)-non(usually kernel acceptable LAn
), K(e.g.,
functions" kernel"other to
),uct( Dot_Prod
generalize can We
jiji
jiji
xxxx
xxxx
CS 8751 ML & KDD Support Vector Machines 25
The New Space for a Sample Kernel
2212211122122111
2222121221211111
22211
2
2
,,,,,,
)(
2let and
Let
zzzzzzzzxxxxxxxx
zzxxzzxxzzxxzzxx
zxzx)zx(
#features
)zx () z,xK(
Our new feature space (with 4 dimensions)- we’re doing a dot product in it
CS 8751 ML & KDD Support Vector Machines 26
g(+)
Visualizing the Kernel
-
+
+
++-
-
-
-
+
Input Space
Original Space
Separating plane (non-linear here butlinear in derived space)
g(+)
g(-)
Derived Feature Space
New Space
g(+)g(+)g(+)
g(-)g(-)
g(-) g(-)g() is feature transformation function
process is similar to what hidden units do in ANNs but kernel is user chosen
CS 8751 ML & KDD Support Vector Machines 27
More Sample Kernels
etc.) DNA,(text, tasksspecific
for designedmany including more,many plus -
sANN' of sigmoid to Related-
tanh 3)
network RBF toleads kernel, Gaussian -
2)
1)22
d
dzxc)z,xK(
e)z,xK(
constzx)z,xK(zx
CS 8751 ML & KDD Support Vector Machines 28
What Makes a Kernel
...
reala returns where 4)
3)
constant is where 2)
1)
are so thenkernels, are and If
kernela is function
a whenzescharacteri theoremnsMercer'
21
1
21
21
f())zf()xf(
()() * KK
c()c*K
()K()K
()K()K
)z,xf(
CS 8751 ML & KDD Support Vector Machines 29
Key SVM Ideas• Maximize the margin between positive and
negative examples (connects to PAC theory)• Penalize errors in non-separable case• Only the support vectors contribute to the solution• Kernels map examples into a new, usually non-
linear space– We implicitly do dot products in this new space (in the
“dual” form of the SVM program)