TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

TDT4173MachineLearningandCase‐BasedReasoning

Lecture6–SupportVectorMachines.Ensemble Methods

HelgeLangsethogAgnarAamodt

NTNU–IDISeksjonforintelligentesystemer

Outline

1 Wrap-up from last time

2

Ensemble-methodsBackgroundBaggingBoosting

3

Support Vector MachinesBackgroundLinear separatorsThe dual problemNon-separable subspacesNonlinearity and kernels

2 TDT4173 Machine Learning

agnar

Rectangle

SupportVectorMachines

TDT4173MachineLearningandCBR

SupportVectorMachines(SVMs)

KernelMethods

PaperbyBenne+andCampbell

Support Vector Machines Background

Description of the task

Data:

1 We have a set of data D = {(x1, y1), . . . , (xm, ym)}. Theinstances are described by xi, the class is yi.

2 The data is generated by some unknown probabilitydistribution P(x, y).

Task:

1 Be able to “guess” y at a new location x.

2 For SVMs one typically states this as “find an unknownfunction f(x) that estimates y at x.”

3 Note! In this lesson we look at binary classification, and lety ∈ {−1,+1} denote the classes.

4 We will look for linear functions, i.e.,f(x) = b + w

Tx ≡ b +

∑mi=1 wi · xi


Support Vector Machines Linear separators

How to find the best linear separator

We are looking for a linear separator for this data




There are so many solutions. . .




But only one is considered the “best”!




SVMs are called “large margin classifiers”




. . . and the data points touching the lines are the support vectors



The geometry of the problem

w

{x : b + wTx ≡ −1}

{x : b + wTx ≡ 0}

{x : b + wTx ≡ +1}

Note! Since one line has b + wTx = −1, the other has

b + wTx = 1, the length between them is 2/‖w‖.



An optimisation problem

Optimisation criteria:

The distance between margins is 2/‖w‖, so that is what wewant to maximise.

Equivalently, we can minimise ‖w‖/2.

For simplicity of the mathematics, we will rather minimise‖w‖2/2

Constraints:

The margin separates all data observations correctly:

b + wTxi ≤ −1 for yi = −1.

b + wTxi ≥ +1 for yi = +1.

Alternative (equivalent) constraint set: yi(b + wTxi) ≥ 1



An optimisation problem (2)

Mathematical Programming Setting:

Combining the above requirements we obtain

minimize wrt. w and b:1

2‖w‖2

subject to yi(b + wTxi) − 1 ≥ 0, i = 1, . . . ,m

Properties:

Problem is convex

Hence it has unique minimum

Efficient algorithms for solving it exist


Support Vector Machines The dual problem

The dual problem – and the convex hull

The convex hull of {xj}:

The smallest subset of the instance space that

is convex

contains all elements {xj}

is the convex hull of {xj}. Find it by drawing lines between all xj

and choose the “outermost boundary”.



The dual problem – and the convex hull (2)

Look at the difference between the points closest in the convexhulls. The decision line must be orthogonal to the line between thetwo closest points.

c

d

c

d

So, we want to minimise ‖c − d‖. c can be written as a weightedsum of all elements in the green class: c =

∑yi=Class 1 αixi, and

similarly for d.



The dual problem – and the convex hull (3)

Minimising ‖c − d‖ is (modulo a constant) equivalent to thisformulation:

minimize wrt. α:1

2

m∑

i=1

m∑

j=1

αiαjyiyjxT

i xj −

m∑

i=1

αi

subject to

m∑

i=1

yiαi = 0 and that αi ≥ 0, i = 1, . . . ,m

Properties:

Problem is convex, hence has unique minimum.

Quadratic programming problem – known solution method.

For solution: αi > 0 only if xi is a support vector.



“Theoretical foundation”

1 Formal proofs of SVM properties available (but out of scopefor us)

2 Large separators smart if we have small variations in x then wewill still classify correctly

3 There are many “skinny” margin planes, only one if you lookfor the “fattest” plane; thus more robust.


Support Vector Machines Non-separable subspaces

What if the convex hulls are overlapping?

If the convex hulls are overlapping we cannot find a linearseparatorTo handle this, we optimise a criteria where we maximisedistance between lines minus a penalty for mis-classificationsThis is equivalent to scaling the convex hulls, and do asbefore on the reduced convex hulls


Support Vector Machines Non-separable subspaces

What if the convex hulls are overlapping? (2)

The problem with scaling is (modulo a constant) equivalent to thisformulation:

minimize wrt. α:1

2

m∑

i=1

m∑

j=1

αiαjyiyjxT

i xj −

m∑

i=1

αi

subject tom∑

i=1

yiαi = 0 and that C ≥ αi ≥ 0, i = 1, . . . ,m

Properties:

Problem as before, but C introduces the scaling; this isequivalent to incurring cost of misclassification.

Still solvable using “standard” methods.

Demo: Different values of C:http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml


http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml



StaBsBcalLearningTheory

•  MisclassificaBonerrorandthefuncBoncomplexityboundgeneralizaBonerror.

•  Maximizingmarginsminimizescomplexity.

•  “Eliminates”overfiRng.

•  SoluBondependsonlyonSupportVectorsnotnumberofaSributes.

Support Vector Machines Nonlinearity and kernels

Nonlinear problems – when scaling does not make sense

The problem is difficult to solve when x = (r, s) has only twodimensions

. . . but if we blow it up to us five dimension:θ(x) = {r, s, rs, r2, s2}, i.e. “invent” the mappingθ(·) : R

2 7→ R5, and try to find the linear separator in R

5,then everything is OK.



Solving the problem in higher dimensions

We solve this as before, but remembering to look in the higherdimension:

minimize wrt. α:1

2

m∑

i=1

m∑

j=1

αiαjyiyjθ(xi)Tθ(xj) −

m∑

i=1

αi

subject to

m∑

i=1

yiαi = 0 and that C ≥ αi ≥ 0, i = 1, . . . ,m

Note that:

We do not need to evaluate θ(x) directly, only θ(xi)Tθ(xj).

If we find a “clever way” of evaluating θ(xi)Tθ(xj) (i.e.,

independent of the size of the target space) we can solve theproblem easily, and without even thinking about what θ(x)

even means.

We define K(xi,xj) = θ(xi)Tθ(xj), and focus on finding

K(·, ·) instead of the mapping. K is called a kernel.



Kernel functions

θ(x) K(θ(xi),θ(xj))

Degree d polynomial (xT

i xj + 1)d

Radial Basis Functions exp(

(xi−xj)2

2σ

)

Two-layer Neural Network sigmoid (η · xT

i xj + c)

Different kernels have different properties, and finding the“right” kernel is a difficult task, and can be hard to visualise.

Example: The RBF kernel uses (implicitly) an infinitelydimensional representation for θ(·).



SVMs: Algorithmic summary

Select the parameter C (tradeoff between minimising trainingset error and maximising the margin).

Select kernel function, and associated parameters (e.g., σ forRBF).

Solve the optimisation problem using quadratic programming.

Find the value b by using the support vectors.

Classify a new point x using

f(x) = sign

{m∑

i=1

yiαiK(x,xi) − b

}

Demo: Different kernelshttp://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml


http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml


TDT4173MachineLearningandCBROctober2‐4,2000 M2000 6

SVMExtensions

•  Regression•  VariableSelecBon•  BoosBng•  DensityEsBmaBon•  UnsupervisedLearning– Novelty/OutlierDetecBon– FeatureDetecBon – Clustering



ManyOtherApplicaBons

•  SpeechRecogniBon•  DataBaseMarkeBng•  QuarkFlavorsinHighEnergyPhysics•  DynamicObjectRecogniBon•  KnockDetecBoninEngines•  ProteinSequenceProblem•  TextCategorizaBon•  BreastCancerDiagnosis•  See:hSp://www.clopinet.com/isabelle/Projects/SVM/applist.html



Hallelujah!

•  GeneralizaBontheoryandpracBcemeet

•  Generalmethodologyformanytypesofproblems•  SameProgram+NewKernel=Newmethod•  Noproblemswithlocalminima

•  Fewmodelparameters.Selectscapacity.•  RobustopBmizaBonmethods.•  SuccessfulApplicaBons

BUT…



HYPE?

•  WillSVMsbeatmybesthand‐tunedmethodZforX?

•  DoSVMscaletomassivedatasets?•  HowtochoseCandKernel?•  WhatistheeffectofaSributescaling?

•  Howtohandlecategoricalvariables?•  Howtoincorporatedomainknowledge?•  Howtointerpretresults?

Documents

TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise