27
TDT 4173 Machine Learning and Case‐Based Reasoning Lecture 6 – Support Vector Machines. Ensemble Methods Helge Langseth og Agnar Aamodt NTNU – IDI Seksjon for intelligente systemer

TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Embed Size (px)

Citation preview

Page 1: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

TDT4173MachineLearningandCase‐BasedReasoning

Lecture6–SupportVectorMachines.Ensemble Methods

HelgeLangsethogAgnarAamodt

NTNU–IDISeksjonforintelligentesystemer

Page 2: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Outline

1 Wrap-up from last time

2

Ensemble-methodsBackgroundBaggingBoosting

3

Support Vector MachinesBackgroundLinear separatorsThe dual problemNon-separable subspacesNonlinearity and kernels

2 TDT4173 Machine Learning

agnar
Rectangle
Page 3: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

SupportVectorMachines

TDT4173MachineLearningandCBR

SupportVectorMachines(SVMs)

KernelMethods

PaperbyBenne+andCampbell

Page 4: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Background

Description of the task

Data:

1 We have a set of data D = {(x1, y1), . . . , (xm, ym)}. Theinstances are described by xi, the class is yi.

2 The data is generated by some unknown probabilitydistribution P(x, y).

Task:

1 Be able to “guess” y at a new location x.

2 For SVMs one typically states this as “find an unknownfunction f(x) that estimates y at x.”

3 Note! In this lesson we look at binary classification, and lety ∈ {−1,+1} denote the classes.

4 We will look for linear functions, i.e.,f(x) = b + w

Tx ≡ b +

∑mi=1 wi · xi

17 TDT4173 Machine Learning

Page 5: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

How to find the best linear separator

We are looking for a linear separator for this data

18 TDT4173 Machine Learning

Page 6: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

How to find the best linear separator

There are so many solutions. . .

18 TDT4173 Machine Learning

Page 7: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

How to find the best linear separator

But only one is considered the “best”!

18 TDT4173 Machine Learning

Page 8: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

How to find the best linear separator

SVMs are called “large margin classifiers”

18 TDT4173 Machine Learning

Page 9: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

How to find the best linear separator

. . . and the data points touching the lines are the support vectors

18 TDT4173 Machine Learning

Page 10: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

The geometry of the problem

w

{x : b + wTx ≡ −1}

{x : b + wTx ≡ 0}

{x : b + wTx ≡ +1}

Note! Since one line has b + wTx = −1, the other has

b + wTx = 1, the length between them is 2/‖w‖.

19 TDT4173 Machine Learning

Page 11: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

An optimisation problem

Optimisation criteria:

The distance between margins is 2/‖w‖, so that is what wewant to maximise.

Equivalently, we can minimise ‖w‖/2.

For simplicity of the mathematics, we will rather minimise‖w‖2/2

Constraints:

The margin separates all data observations correctly:

b + wTxi ≤ −1 for yi = −1.

b + wTxi ≥ +1 for yi = +1.

Alternative (equivalent) constraint set: yi(b + wTxi) ≥ 1

20 TDT4173 Machine Learning

Page 12: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Linear separators

An optimisation problem (2)

Mathematical Programming Setting:

Combining the above requirements we obtain

minimize wrt. w and b:1

2‖w‖2

subject to yi(b + wTxi) − 1 ≥ 0, i = 1, . . . ,m

Properties:

Problem is convex

Hence it has unique minimum

Efficient algorithms for solving it exist

21 TDT4173 Machine Learning

Page 13: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines The dual problem

The dual problem – and the convex hull

The convex hull of {xj}:

The smallest subset of the instance space that

is convex

contains all elements {xj}

is the convex hull of {xj}. Find it by drawing lines between all xj

and choose the “outermost boundary”.

22 TDT4173 Machine Learning

Page 14: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines The dual problem

The dual problem – and the convex hull (2)

Look at the difference between the points closest in the convexhulls. The decision line must be orthogonal to the line between thetwo closest points.

c

d

c

d

So, we want to minimise ‖c − d‖. c can be written as a weightedsum of all elements in the green class: c =

∑yi=Class 1 αixi, and

similarly for d.

23 TDT4173 Machine Learning

Page 15: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines The dual problem

The dual problem – and the convex hull (3)

Minimising ‖c − d‖ is (modulo a constant) equivalent to thisformulation:

minimize wrt. α:1

2

m∑

i=1

m∑

j=1

αiαjyiyjxT

i xj −

m∑

i=1

αi

subject to

m∑

i=1

yiαi = 0 and that αi ≥ 0, i = 1, . . . ,m

Properties:

Problem is convex, hence has unique minimum.

Quadratic programming problem – known solution method.

For solution: αi > 0 only if xi is a support vector.

24 TDT4173 Machine Learning

Page 16: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines The dual problem

“Theoretical foundation”

1 Formal proofs of SVM properties available (but out of scopefor us)

2 Large separators smart if we have small variations in x then wewill still classify correctly

3 There are many “skinny” margin planes, only one if you lookfor the “fattest” plane; thus more robust.

25 TDT4173 Machine Learning

Page 17: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Non-separable subspaces

What if the convex hulls are overlapping?

If the convex hulls are overlapping we cannot find a linearseparatorTo handle this, we optimise a criteria where we maximisedistance between lines minus a penalty for mis-classificationsThis is equivalent to scaling the convex hulls, and do asbefore on the reduced convex hulls

26 TDT4173 Machine Learning

Page 18: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Non-separable subspaces

What if the convex hulls are overlapping? (2)

The problem with scaling is (modulo a constant) equivalent to thisformulation:

minimize wrt. α:1

2

m∑

i=1

m∑

j=1

αiαjyiyjxT

i xj −

m∑

i=1

αi

subject tom∑

i=1

yiαi = 0 and that C ≥ αi ≥ 0, i = 1, . . . ,m

Properties:

Problem as before, but C introduces the scaling; this isequivalent to incurring cost of misclassification.

Still solvable using “standard” methods.

Demo: Different values of C:http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

27 TDT4173 Machine Learning

Page 19: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

SupportVectorMachines

TDT4173MachineLearningandCBR

StaBsBcalLearningTheory

•  MisclassificaBonerrorandthefuncBoncomplexityboundgeneralizaBonerror.

•  Maximizingmarginsminimizescomplexity.

•  “Eliminates”overfiRng.

•  SoluBondependsonlyonSupportVectorsnotnumberofaSributes.

Page 20: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Nonlinearity and kernels

Nonlinear problems – when scaling does not make sense

The problem is difficult to solve when x = (r, s) has only twodimensions

. . . but if we blow it up to us five dimension:θ(x) = {r, s, rs, r2, s2}, i.e. “invent” the mappingθ(·) : R

2 7→ R5, and try to find the linear separator in R

5,then everything is OK.

28 TDT4173 Machine Learning

Page 21: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Nonlinearity and kernels

Solving the problem in higher dimensions

We solve this as before, but remembering to look in the higherdimension:

minimize wrt. α:1

2

m∑

i=1

m∑

j=1

αiαjyiyjθ(xi)Tθ(xj) −

m∑

i=1

αi

subject to

m∑

i=1

yiαi = 0 and that C ≥ αi ≥ 0, i = 1, . . . ,m

Note that:

We do not need to evaluate θ(x) directly, only θ(xi)Tθ(xj).

If we find a “clever way” of evaluating θ(xi)Tθ(xj) (i.e.,

independent of the size of the target space) we can solve theproblem easily, and without even thinking about what θ(x)

even means.

We define K(xi,xj) = θ(xi)Tθ(xj), and focus on finding

K(·, ·) instead of the mapping. K is called a kernel.

29 TDT4173 Machine Learning

Page 22: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Nonlinearity and kernels

Kernel functions

θ(x) K(θ(xi),θ(xj))

Degree d polynomial (xT

i xj + 1)d

Radial Basis Functions exp(

(xi−xj)2

)

Two-layer Neural Network sigmoid (η · xT

i xj + c)

Different kernels have different properties, and finding the“right” kernel is a difficult task, and can be hard to visualise.

Example: The RBF kernel uses (implicitly) an infinitelydimensional representation for θ(·).

30 TDT4173 Machine Learning

Page 23: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

Support Vector Machines Nonlinearity and kernels

SVMs: Algorithmic summary

Select the parameter C (tradeoff between minimising trainingset error and maximising the margin).

Select kernel function, and associated parameters (e.g., σ forRBF).

Solve the optimisation problem using quadratic programming.

Find the value b by using the support vectors.

Classify a new point x using

f(x) = sign

{m∑

i=1

yiαiK(x,xi) − b

}

Demo: Different kernelshttp://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

31 TDT4173 Machine Learning

Page 24: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

SupportVectorMachines

TDT4173MachineLearningandCBROctober2‐4,2000 M2000 6

SVMExtensions

•  Regression•  VariableSelecBon•  BoosBng•  DensityEsBmaBon•  UnsupervisedLearning– Novelty/OutlierDetecBon– FeatureDetecBon – Clustering

Page 25: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

SupportVectorMachines

TDT4173MachineLearningandCBR

ManyOtherApplicaBons

•  SpeechRecogniBon•  DataBaseMarkeBng•  QuarkFlavorsinHighEnergyPhysics•  DynamicObjectRecogniBon•  KnockDetecBoninEngines•  ProteinSequenceProblem•  TextCategorizaBon•  BreastCancerDiagnosis•  See:hSp://www.clopinet.com/isabelle/Projects/SVM/applist.html

Page 26: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

SupportVectorMachines

TDT4173MachineLearningandCBR

Hallelujah!

•  GeneralizaBontheoryandpracBcemeet

•  Generalmethodologyformanytypesofproblems•  SameProgram+NewKernel=Newmethod•  Noproblemswithlocalminima

•  Fewmodelparameters.Selectscapacity.•  RobustopBmizaBonmethods.•  SuccessfulApplicaBons

BUT…

Page 27: TDT 4173 Machine Learning and Case‐Based · PDF fileAn optimisation problem Optimisation criteria: The distance between margins is 2/kwk, so that is what we want to maximise

SupportVectorMachines

TDT4173MachineLearningandCBR

HYPE?

•  WillSVMsbeatmybesthand‐tunedmethodZforX?

•  DoSVMscaletomassivedatasets?•  HowtochoseCandKernel?•  WhatistheeffectofaSributescaling?

•  Howtohandlecategoricalvariables?•  Howtoincorporatedomainknowledge?•  Howtointerpretresults?