Machines ETC3250: Support VectorFrom LDA to SVM Linear discriminant analysis uses the dif ference...

Preview:

Citation preview

ETC3250: Support VectorMachinesSemester 1, 2020

Professor Di Cook

Econometrics and Business Statistics Monash University

Week 7 (b)

Separating hyperplanes

In a -dimensional space, a hyperplane is a �at af�ne subspace ofdimension .

(ISLR: Fig 9 1)

p

p − 1

2 / 36

Separating hyperplanes

The equation of -dimensional hyperplane is given by

If and for , then

Equivalently,

p

β0 + β1X1 + ⋯ + βpXp = 0

xi ∈ Rp yi ∈ {−1, 1} i = 1, … , n

β0 + β1xi1 + ⋯ + βpxip > 0 if yi = 1,

β0 + β1xi1 + ⋯ + βpxip < 0 if yi = −1

yi(β0 + β1xi1 + ⋯ + βpxip) > 0

3 / 36

Separating hyperplanes

A new observation is assigned a class depending on which side of thehyperplane it is located Classify the test observation based on the sign of

If , class , and if , class , i.e.

x∗

s(x∗) = β0 + β1x∗1 + ⋯ + βpx∗

p

s(x∗) > 0 1 s(x∗) < 0 −1h(x∗) = sign(s(x∗)).

4 / 36

Separating hyperplanes

What about the magnitude of ?

lies far from the hyperplane + morecon�dent about our classi�cation near the hyperplane + less con�dentabout our classi�cation

s(x∗)

s(x∗) far from zero  → x

s(x∗) close to zero  → x

5 / 36

Separating hyperplane classi�ers

We will explore three different types of hyperplane classi�ers, witheach method generalising the one before it.

Maximal marginal classi�er for when the data is perfectly separatedby a hyperplane Support vector classi�er/ soft margin classi�er for when data is NOTperfectly separated by a hyperplane but still has a linear decisionboundary, and Support vector machines used for when the data has non-lineardecision boundaries.

In practice, SVMs are used to refer to all three methods, however wewill distinguish between the three notions in this lecture.

6 / 36

Maximal margin classi�er

All are separating hyperplanes. Which is best?

7 / 36

Maximal margin classi�er

Source: Machine Learning Memes for Convolutional Teens

8 / 36

From LDA to SVM

Linear discriminant analysis uses the difference between means toset the separating hyperplane. Support vector machines uses gaps between points on the outeredge of clusters to set the separating hyperplane.

9 / 36

SVM vs Logistic Regression

If our data can be perfectly separated using a hyperplane, then therewill in fact exist an in�nite number of such hyperplanes. We compute the (perpendicular) distance from each trainingobservation to a given separating hyperplane. The smallest suchdistance is known as the margin. The optimal separating hyperplane (or maximal margin hyperplane)is the separating hyperplane for which the margin is largest. We can then classify a test observation based on which side of themaximal margin hyperplane it lies. This is known as the maximal marginclassi�er.

10 / 36

Support vectors

11 / 36

Support vectors

The support vectors are equidistant from the maximal marginhyperplane and lie along the dashed lines indicating the width of themargin. They support the maximal margin hyperplane in the sense that ifthese points were moved slightly then the maximal margin hyperplanewould move as well

The maximal margin hyperplane depends directly on the supportvectors, but not on the other observations

12 / 36

Support vectors

13 / 36

Example: Support vectors (and slack vectors)

14 / 36

Maximal margin classi�er

If and for , the separatinghyperplane is de�ned as

where and is the number of support vectors. Then

the maximal margin hyperplane is found by

maximising , subject to , and

.

xi ∈ Rp yi ∈ {−1, 1} i = 1, … , n

{x : β0 + xT β = 0}

β = ∑s

i=1(αiyi)xi s

M ∑p

j=1 β2j = 1

yi(xTi β + β0) ≥ M, i = 1, . . . , n

15 / 36

Non-separable case

The maximal margin classi�er only workswhen we have perfect separability in our data.

What do we do if data is not perfectlyseparable by a hyperplane?

The support vector classi�er allows points toeither lie on the wrong side of the margin, oron the wrong side of the hyperplanealtogether.

Right: ISLR Fig 9.6

16 / 36

Support vector classi�er - optimisation

Maximise , subject to , and

, AND .

tells us where the th observation is located and is a nonnegativetuning parameter.

: correct side of the margin, : wrong side of the margin (violation of the margin), : wrong side of the hyperplane.

M ∑p

i=1 β2i = 1

yi(x′iβ + β0) ≥ M(1 − ϵi), i = 1, . . . , n ϵi ≥ 0,∑n

i=1 ϵi ≤ C

εi i C

εi = 0εi > 0εi > 1

17 / 36

Non-separable case

Tuning parameter: decreasing the value of C

18 / 36

Non-linear boundaries

The support vector classi�er doesn't work well for non-linearboundaries. What solution do we have?

19 / 36

Enlarging the feature space

Consider the following 2D non-linear classi�cation problem. We cantransform this to a linear problem separated by a maximal marginhyperplane by introducing an additional third dimension.

Source: Grace Zhang @zxr.nju

20 / 36

The inner product

Consider two -vectors

The inner product is de�ned as

A linear measure of similarity, and allows geometric constrctions suchas the maximal marginal hyperplane.

p

x = (x1, x2, … , xp) ∈ Rp

and y = (y1, y2, … , yp) ∈ Rp.

⟨x, y⟩ = x1y1 + x2y2 + ⋯ + xpyp =

p

∑j=1

xjyj

21 / 36

Kernel functions

A kernel function is an inner product of vectors mapped to a (higherdimensional) feature space .

Non-linear measure of similarity, and allows geometric constructions inhigh dimensional space.

H = Rd, d > p

K(x, y) = ⟨ψ(x),ψ(y)⟩

ψ : Rp → H

22 / 36

Examples of kernels

Standard kernels include:

Linear K(x, y) = ⟨x, y⟩

Polynomial K(x, y) = (⟨x, y⟩ + 1)d

Radial K(x, y) = exp(−γ||x − y||2)

23 / 36

Support Vector Machines

The kernel trick

The linear support vector classi�er can berepresented as follows:

We can generalise this by replacing the innerproduct with the kernel function as follows:

f(x) = β0 +∑i∈S

αi⟨x, xi⟩.

f(x) = β0 +∑i∈S

αiK(x, xi).

24 / 36

Your turn

Let and be vectors in . By expanding

show that this is equilvalent to an inner product in .

Remember: .

x y R2

K(x,y) = (1 + ⟨x,y⟩)2

H = R6

⟨x,y⟩ =∑p

j=1 xjyj

03:0025 / 36

Solution

where .

K(x, y) = (1 + ⟨x, y⟩)2

= (1 +2

∑j=1

xjyj)

2

= (1 + x1y1 + x2y2)2

= (1 + x21y2

1 + x22y2

2 + 2x1y1 + 2x2y2 + 2x1x2y1y2)

= ⟨ψ(x), ψ(y)⟩

ψ(x) = (1, x21, x2

2, √2x1, √2x2, √2x1x2)

26 / 36

The kernel trick - why is it a trick?

We do not need to know what the high dimensional enlarged featurespace really looks like.

We just need to know the which kernel function is most appropriate asa measure of similarity.

The Support Vector Machine (SVM) is a maximal margin hyperplane in built by using a kernel function in the low dimensional feature space .

H

H

Rp

27 / 36

Non-linear boundaries

Polynomial and radial kernel SVMs

28 / 36

Non-linear boundaries

Italian olive oils: Regions 2, 3 (North and Sardinia)

29 / 36

SVM in high dimensions

Examining misclassi�cations and which points are selected to besupport vectors

30 / 36

SVM in high dimensions

Examining boundaries

31 / 36

SVM in high dimensions

Boundaries of a radial kernel in 3D

32 / 36

SVM in high dimensions

Boundaries of a polynomial kernel in 5D

33 / 36

Comparing decision boundaries

34 / 36

Increasing the value of cost in svm

35 / 36

� Made by a human with a computerSlides at https://iml.numbat.space.

Code and data at https://github.com/numbats/iml.

Created using R Markdown with �air by xaringan, andkunoichi (female ninja) style.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

36 / 36

Recommended