Upload
bridget-oneal
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Support Vector Support Vector MachinesMachines
Jordan SmithJordan Smith
MUMT 611MUMT 611
14 February 200814 February 2008
Topics to coverTopics to cover What do Support Vector Machines (SVMS) What do Support Vector Machines (SVMS)
do?do?
How do SVMs work?How do SVMs work? Linear dataLinear data Non-linear dataNon-linear data (Kernel functions)(Kernel functions) Unseparable dataUnseparable data(added Cost function)(added Cost function)
Search optimizationSearch optimization
Why?Why?
What SVMs doWhat SVMs do
What SVMs doWhat SVMs do
What SVMs doWhat SVMs do
= margin
What SVMs doWhat SVMs do
= margin
What SVMs doWhat SVMs do
= margin= support vector
What SVMs doWhat SVMs do
= margin= support vector
(optimum separating hyperplane)
What SVMs doWhat SVMs do
= margin= support vector
(optimum separating hyperplane)
What SVMs doWhat SVMs do
Sherrod 230
Topics to coverTopics to cover What do Support Vector Machines (SVMS) What do Support Vector Machines (SVMS)
do?do?
How do SVMs work?How do SVMs work? Linear dataLinear data Non-linear dataNon-linear data (Kernel functions)(Kernel functions) Unseparable dataUnseparable data(added Cost function)(added Cost function)
Search optimizationSearch optimization
Why?Why?
The linear, separable The linear, separable casecase
Training data {Training data {xxii, y, yii}} Separating hyperplane defined by Separating hyperplane defined by
normal vector normal vector ww hyperplane equation: hyperplane equation: w·xw·x + b = 0 + b = 0 distance from plane to origin: |b|/|w|distance from plane to origin: |b|/|w|
Distances from hyperplane to nearest Distances from hyperplane to nearest point in each collection are dpoint in each collection are d++ and d and d--
Goal: maximize dGoal: maximize d++ + d + d--
(margins)
The linear, separable The linear, separable casecase
1)1) x xii·w ·w + b ≥ +1+ b ≥ +1 (for y(for yii = +1) = +1)
2) 2) xxii·w ·w + b ≤ -1+ b ≤ -1 (for y(for yii = -1) = -1) yyii((xxii·w ·w + b) - 1 ≥ 0 + b) - 1 ≥ 0 for our support vectors, distance from originfor our support vectors, distance from origin
to plane = |1-b|/|w|to plane = |1-b|/|w|
AlgebraAlgebra d d++ + d + d-- = 2 / |w| = 2 / |w|
New goal:New goal:maximize: 2 /|w|maximize: 2 /|w| i.e.,i.e., minimize: |w|minimize: |w|
Nonlinear SVMsNonlinear SVMs
Sherrod 235
Nonlinear SVMsNonlinear SVMs
Kernel trick:Kernel trick:
Map data into a higher-dimensional space Map data into a higher-dimensional space using using : R: Rdd HH
Training problems involve only the dot Training problems involve only the dot product, so product, so HH can even be of infinite can even be of infinite dimensiondimension
Kernel trick makes nonlinear solutions Kernel trick makes nonlinear solutions linear again!linear again!
youtube youtube exampleexample
Nonlinear SVMsNonlinear SVMs
Radial basis function:Radial basis function:
Sherrod 236
Nonlinear SVMsNonlinear SVMs
SigmoidSigmoid
Sherrod 237
Another demonstrationAnother demonstration
appletapplet
The unseparable caseThe unseparable case
Classifiers need to have a balanced Classifiers need to have a balanced capacitycapacity:: Bad botanist: “It has 847 leaves. Not a Bad botanist: “It has 847 leaves. Not a
tree!”tree!” Bad botanist: “It’s green. That’s a tree!”Bad botanist: “It’s green. That’s a tree!”
The unseparable caseThe unseparable case
Sherrod 237
The unseparable caseThe unseparable case
The unseparable caseThe unseparable case
= error= fuzzy margin
The unseparable caseThe unseparable case
Add a cost function:Add a cost function:
xxii·w ·w + b ≥ +1 + b ≥ +1 - - ii (for y(for yii = +1) = +1) xxii·w ·w + b ≤ -1 + b ≤ -1 + + ii (for y(for yii = - = -
1)1) i i ≥ 0≥ 0
old goal:old goal: minimize |w|minimize |w|22/2/2
new goal:new goal: minimize |w|minimize |w|22/2 /2 + C(∑+ C(∑i i ii))kk
Optimizing your searchOptimizing your search
To find the separating hyperplane, To find the separating hyperplane, you must manipulate many you must manipulate many parameters, depending on which parameters, depending on which kernel function you select:kernel function you select: C, the cost constantC, the cost constant Gamma, Gamma, ii, etc., etc.
There are two basic methods:There are two basic methods: Grid searchGrid search Pattern searchPattern search
Topics to coverTopics to cover What do Support Vector Machines (SVMS) do?What do Support Vector Machines (SVMS) do?
How do SVMs work?How do SVMs work? Linear dataLinear data Non-linear dataNon-linear data (Kernel functions)(Kernel functions) Unseparable dataUnseparable data (added Cost function)(added Cost function)
Search optimizationSearch optimization
Why?Why?
Why use SVMs?Why use SVMs?
Uses:Uses: Optical character recognitionOptical character recognition Spam detectionSpam detection MIRMIR
genre, artist classification (Mandel genre, artist classification (Mandel 2004, 2005)2004, 2005)
mood classification (Laurier 2007)mood classification (Laurier 2007) popularity classification, based on lyrics popularity classification, based on lyrics
(Dhanaraj 2005)(Dhanaraj 2005)
Why use SVMs?Why use SVMs? Machine learner of choice for high-Machine learner of choice for high-
dimensional data, such as text, images, dimensional data, such as text, images, music!music!
Conceptually simple.Conceptually simple.
Generalizable and efficient.Generalizable and efficient.
Next slides: results of a benchmark study Next slides: results of a benchmark study (Meyer 2004) comparing SVMs and other (Meyer 2004) comparing SVMs and other learning techniqueslearning techniques
Questions?Questions?
Key ReferencesKey ReferencesBurges, C. J. C. "A tutorial on support vector machines for pattern
recognition." Data Mining and Knowledge Discovery, 2:955-974, 1998. http://citeseer.ist.psu.edu/burges98tutorial.html
Cortes, C. and V. Vapnik. "Support-Vector Networks." Machine Learning, 20:273-297, Sept 1995. http://citeseer.ist.psu.edu/cortes95supportvector.html
Sherrod, Phillip H. 2008. DTREG: Predictive Modeling Software. (User’s guide) 227-41. <http://www.dtreg.com/DTREG.pdf”
Smola, A. J. and B. Scholkopf. 1998. “A tutorial on support vector regression.” NEUROCOLT Technical report NC-TR-98-030. Royal Holloway college, London.