Support Vector Machines Jordan Smith MUMT 611 14 February 2008

Support Vector Support Vector MachinesMachines

Jordan SmithJordan Smith

MUMT 611MUMT 611

14 February 200814 February 2008

Topics to coverTopics to cover What do Support Vector Machines (SVMS) What do Support Vector Machines (SVMS)

do?do?

How do SVMs work?How do SVMs work? Linear dataLinear data Non-linear dataNon-linear data (Kernel functions)(Kernel functions) Unseparable dataUnseparable data(added Cost function)(added Cost function)

Search optimizationSearch optimization

Why?Why?

What SVMs doWhat SVMs do



= margin


= margin


= margin= support vector



(optimum separating hyperplane)



(optimum separating hyperplane)


Sherrod 230

Topics to coverTopics to cover What do Support Vector Machines (SVMS) What do Support Vector Machines (SVMS)

do?do?

How do SVMs work?How do SVMs work? Linear dataLinear data Non-linear dataNon-linear data (Kernel functions)(Kernel functions) Unseparable dataUnseparable data(added Cost function)(added Cost function)


Why?Why?

The linear, separable The linear, separable casecase

Training data {Training data {xxii, y, yii}} Separating hyperplane defined by Separating hyperplane defined by

normal vector normal vector ww hyperplane equation: hyperplane equation: w·xw·x + b = 0 + b = 0 distance from plane to origin: |b|/|w|distance from plane to origin: |b|/|w|

Distances from hyperplane to nearest Distances from hyperplane to nearest point in each collection are dpoint in each collection are d++ and d and d--

Goal: maximize dGoal: maximize d++ + d + d--

(margins)

The linear, separable The linear, separable casecase

1)1) x xii·w ·w + b ≥ +1+ b ≥ +1 (for y(for yii = +1) = +1)

2) 2) xxii·w ·w + b ≤ -1+ b ≤ -1 (for y(for yii = -1) = -1) yyii((xxii·w ·w + b) - 1 ≥ 0 + b) - 1 ≥ 0 for our support vectors, distance from originfor our support vectors, distance from origin

to plane = |1-b|/|w|to plane = |1-b|/|w|

AlgebraAlgebra d d++ + d + d-- = 2 / |w| = 2 / |w|

New goal:New goal:maximize: 2 /|w|maximize: 2 /|w| i.e.,i.e., minimize: |w|minimize: |w|

Nonlinear SVMsNonlinear SVMs

Sherrod 235


Kernel trick:Kernel trick:

Map data into a higher-dimensional space Map data into a higher-dimensional space using using : R: Rdd HH

Training problems involve only the dot Training problems involve only the dot product, so product, so HH can even be of infinite can even be of infinite dimensiondimension

Kernel trick makes nonlinear solutions Kernel trick makes nonlinear solutions linear again!linear again!

youtube youtube exampleexample


Radial basis function:Radial basis function:

Sherrod 236


SigmoidSigmoid

Sherrod 237

Another demonstrationAnother demonstration

appletapplet

The unseparable caseThe unseparable case

Classifiers need to have a balanced Classifiers need to have a balanced capacitycapacity:: Bad botanist: “It has 847 leaves. Not a Bad botanist: “It has 847 leaves. Not a

tree!”tree!” Bad botanist: “It’s green. That’s a tree!”Bad botanist: “It’s green. That’s a tree!”


Sherrod 237



= error= fuzzy margin


Add a cost function:Add a cost function:

xxii·w ·w + b ≥ +1 + b ≥ +1 - - ii (for y(for yii = +1) = +1) xxii·w ·w + b ≤ -1 + b ≤ -1 + + ii (for y(for yii = - = -

1)1) i i ≥ 0≥ 0

old goal:old goal: minimize |w|minimize |w|22/2/2

new goal:new goal: minimize |w|minimize |w|22/2 /2 + C(∑+ C(∑i i ii))kk

Optimizing your searchOptimizing your search

To find the separating hyperplane, To find the separating hyperplane, you must manipulate many you must manipulate many parameters, depending on which parameters, depending on which kernel function you select:kernel function you select: C, the cost constantC, the cost constant Gamma, Gamma, ii, etc., etc.

There are two basic methods:There are two basic methods: Grid searchGrid search Pattern searchPattern search

Topics to coverTopics to cover What do Support Vector Machines (SVMS) do?What do Support Vector Machines (SVMS) do?

How do SVMs work?How do SVMs work? Linear dataLinear data Non-linear dataNon-linear data (Kernel functions)(Kernel functions) Unseparable dataUnseparable data (added Cost function)(added Cost function)


Why?Why?

Why use SVMs?Why use SVMs?

Uses:Uses: Optical character recognitionOptical character recognition Spam detectionSpam detection MIRMIR

genre, artist classification (Mandel genre, artist classification (Mandel 2004, 2005)2004, 2005)

mood classification (Laurier 2007)mood classification (Laurier 2007) popularity classification, based on lyrics popularity classification, based on lyrics

(Dhanaraj 2005)(Dhanaraj 2005)

Why use SVMs?Why use SVMs? Machine learner of choice for high-Machine learner of choice for high-

dimensional data, such as text, images, dimensional data, such as text, images, music!music!

Conceptually simple.Conceptually simple.

Generalizable and efficient.Generalizable and efficient.

Next slides: results of a benchmark study Next slides: results of a benchmark study (Meyer 2004) comparing SVMs and other (Meyer 2004) comparing SVMs and other learning techniqueslearning techniques

Questions?Questions?

Key ReferencesKey ReferencesBurges, C. J. C. "A tutorial on support vector machines for pattern

recognition." Data Mining and Knowledge Discovery, 2:955-974, 1998. http://citeseer.ist.psu.edu/burges98tutorial.html

Cortes, C. and V. Vapnik. "Support-Vector Networks." Machine Learning, 20:273-297, Sept 1995. http://citeseer.ist.psu.edu/cortes95supportvector.html

Sherrod, Phillip H. 2008. DTREG: Predictive Modeling Software. (User’s guide) 227-41. <http://www.dtreg.com/DTREG.pdf”

Smola, A. J. and B. Scholkopf. 1998. “A tutorial on support vector regression.” NEUROCOLT Technical report NC-TR-98-030. Royal Holloway college, London.

Documents

Support Vector Machines Jordan Smith MUMT 611 14 February 2008