33
Support Vector Machines Tao Tao Department of computer science University of Illinois

Support Vector Machines Tao Department of computer science University of Illinois

Embed Size (px)

Citation preview

Page 1: Support Vector Machines Tao Department of computer science University of Illinois

Support Vector Machines

Tao Tao

Department of computer science

University of Illinois

Page 2: Support Vector Machines Tao Department of computer science University of Illinois

Adapting many contents and even slides from:

• Gentle Guide to Support Vector Machines, Ming-Hsuan Yang

• Support Vector and Kernel Methods, Thorsten Joachims

Page 3: Support Vector Machines Tao Department of computer science University of Illinois

Problem

Optimal hyper-plane to classify data points

How to choose this hyper-plane?

Page 4: Support Vector Machines Tao Department of computer science University of Illinois

What is “optimal”?

Intuition: to maximize the margin

Page 5: Support Vector Machines Tao Department of computer science University of Illinois

What is “optimal”?

Statistically: risk Minimization• Risk function Riskp(h) = P(h(x)!=y) = ∫Δ(h(x)!=y)dP(x,y) (h in H)

h : hyper-plane function; x: vector; y:1,-1; Δ: indicator function

• Minimization hopt = argminh{Riskp(h)}

Page 6: Support Vector Machines Tao Department of computer science University of Illinois

In practice…

Given N observations: (X,Y) (Y are labels, 1,-1) Looking for a mapping: x->f(x,α) (1,-1) Expected risk: Empirical risk

Question: Are they consistent in terms of minimization?

Page 7: Support Vector Machines Tao Department of computer science University of Illinois

Vapnik/Chervonenkis (VC) dimensionDefinition: VC dimension of H is equal to the maximum

number h of examples that can be split into two sets in all 2h ways using function from H.

Example: In R2 space, VC dimension is 3 (Rn, vc: n+1)

But, 4 points:

Page 8: Support Vector Machines Tao Department of computer science University of Illinois

Upper bound for expected risk

This bound for the expected risk holds with probability 1-η.

h: VC dimension

the second term: VC confidence

Training error Avoid overfitting

Page 9: Support Vector Machines Tao Department of computer science University of Illinois

Error vs. vc dimension

Page 10: Support Vector Machines Tao Department of computer science University of Illinois

Want to minimize expected risk? It is not enough just to minimize the empirical

risk Need to choose an appropriate VC Make both parts small

Solution: Structural Risk Minimization (SRM)

Page 11: Support Vector Machines Tao Department of computer science University of Illinois

Structural risk minimization

Nested structure of hypothesis space:

h(n) ≤ h(n+1), h(n) is the VC dimension of Hn

Tradeoff between VC dimension and empirical risk

Problem: VC dimension minimum empirical risk

Page 12: Support Vector Machines Tao Department of computer science University of Illinois

Linear SVM

Given xi in Rn

Linearly separable: exist w in Rn and b in R , s.t

yi(w●xi+b) ≥ 1 Scale (w,b) in order to make the distance of the

closest points, say xj, equals to 1/||w|| Optimal separating hyper-plane (OSH): to maximize

the 1/||w||

Page 13: Support Vector Machines Tao Department of computer science University of Illinois

Linear SVM example

Given (x,y), find (w,b), s.t. <x●w>+b = 0

additional requirement: mini|<w●xi>+b| = 1

f(x,w,b) = sgn(x●w+b)

ID x1 x2 x3 x4 x5 x6 x7 yD1 1 2 0 0 2 0 2 1D2 0 0 0 3 0 1 1 -1D3 0 2 1 0 0 0 3 1D4 0 0 1 1 1 1 1 -1

w,b 2 3 -1 -3 -1 -1 0 b=1

Page 14: Support Vector Machines Tao Department of computer science University of Illinois

VC dimension upper bound

Lemma [Vapnik 1995]• Let R be the radius of smallest

ball to cover all x: {||x-a||<R}, • let fw,b = sgn((w●x)+b) be the

decision functions • ||w|| ≤ A

Then, VC dimension h < R2A2+1

||w|| = 1/δ, δ is margin length

δ

R

w

Page 15: Support Vector Machines Tao Department of computer science University of Illinois

So …

Maximizing the margin δ

═> Minimizing ||w||

═> Smallest acceptable VC dimension

═> Constructing an optimal hyper-plane

Is everything clear??

How to do it? Quadratic Programming!

Page 16: Support Vector Machines Tao Department of computer science University of Illinois

Constrained quadratic programmingMinimize ½ <w●w>

Subject to yi(<w●xi>+b) ≥ 1

Solve it: Lagrange multipliers to find the saddle point

For more details, go to the book:

An introduction to Support Vector Machines

Page 17: Support Vector Machines Tao Department of computer science University of Illinois

What is “support vectors”?

yi(w●xi+b) ≥ 1

Most of xi achieves inequality signs;

The xi, achieving equal signs,

are called support vectors.

Support vector

Page 18: Support Vector Machines Tao Department of computer science University of Illinois

Inseparable data

Page 19: Support Vector Machines Tao Department of computer science University of Illinois

Soft margin classifier

Loose the margin by introducing N nonnegative variable ξ = (ξ1,ξ2,…, ξn)

So that yi(<w●xi>+b) ≥ 1- ξi

Problem:

Minimize ½ <w●w> + C ∑ ξi

Subject to yi(<w●xi>+b) ≥ 1 – ξi

ξ ≥ 0

Page 20: Support Vector Machines Tao Department of computer science University of Illinois

C and ξ

C:• C is small, maximize the minimum distance• C is large, minimize the number of

misclassified points ξ:• >1: misclassified points• 0< ξ<1: correctly classified but closer than 1/||w||• =0: margin vectors

Page 21: Support Vector Machines Tao Department of computer science University of Illinois

Nonlinear SVM

R2 R3

Page 22: Support Vector Machines Tao Department of computer science University of Illinois

Feature space

Input Space

Feature Space

Φ

a | b | c

a | b | c | aa | ab | ac | bb | bc | cc

Φ

Page 23: Support Vector Machines Tao Department of computer science University of Illinois

Problem: Very many parameters! O(Np) attributes in feature space, for N attributes, p degree.

Solution: Kernel methods!

Page 24: Support Vector Machines Tao Department of computer science University of Illinois

Dual representations

Lagrange multipliers:

Require:

substitute

Page 25: Support Vector Machines Tao Department of computer science University of Illinois

Constrained QP using dual

D is an N×N matrix such that Di,j = yiyj<xi●xj>

Observations: the only way the data points appear in the training problem is in the form of dot products--- <xi●xj>

Page 26: Support Vector Machines Tao Department of computer science University of Illinois

Go back to nonlinear SVM…

Original:

Expanding to high dimensional space:

Problem: Φ is computationally expensive. Fortunately: We only need Φ(xi)●Φ(xj)

Page 27: Support Vector Machines Tao Department of computer science University of Illinois

Kernel function

K(xi,xj) = Φ(xi)●Φ(xj)

Without knowing exact Φ Replace <xi●xj > by K(xi,xj) All previous derivations in linear SVM hold

Page 28: Support Vector Machines Tao Department of computer science University of Illinois

How to decide to kernel function? Mercer condition (necessary and sufficient):

K(u,v) is symmetric

Page 29: Support Vector Machines Tao Department of computer science University of Illinois

Some examples for kernel functions

Page 30: Support Vector Machines Tao Department of computer science University of Illinois

Multiple classes (k)

One-against-the rest: k SVM’s One-against-one: k(k-1)/2 SVM’s K-class SVM John Platt’s DAG method

Page 31: Support Vector Machines Tao Department of computer science University of Illinois

Application in text classification Counting each term in an article An article, therefore, becomes a vector (x)

Further reading and advanced topics. the theory of linear

… is much ………

The problem of linear regressionIs much older than the

Classification………

Read Problem

…Class

……

24…5……

count

Attributes: terms

Value: occurrence or frequency

Page 32: Support Vector Machines Tao Department of computer science University of Illinois

Conclusions

Linear SVM VC dimension Soft margin classifier Dual representation Nonlinear SVM Kernel methods Multi-classifier

Page 33: Support Vector Machines Tao Department of computer science University of Illinois

Thank you!