An Introduction to Support Vector Machine and Spectral Clusteringdept.stat.lsa.umich.edu/~laber/ML_talk.pdf · 2008. 5. 19. · Support Vector Machine Spectral Clustering How does

Support Vector MachineSpectral Clustering

An Introduction to Support Vector Machine andSpectral Clustering

Jing Wang

Department of Statistics, University of Michigan

Feb 7th, 2008

Jing Wang An Introduction to Support Vector Machine and Spectral Clustering


Outline

I Support Vector MachineI Spectral Clustering



Separating Hyperplanes

Consider a two-class, linearlyseparable classificationproblem.



Separating Hyperplanes

Let (x1, y1), . . . , (xn, yn) be training data, xi ∈ Rd , yi ∈ {−1, 1}.

Data are linearly separable if there exists w ∈ Rd , b ∈ R such that

yi(wT xi + b) > 0

for i = 1, . . . , n. The pair (w, b) defines a hyperplane of equation

{x : wT x + b = 0}

named a separating hyperplane.



What is a good Decision Boundary?

I Are all decision boundaries equally good?



Maximum Margin Hyperplane/Optimal SeparatingHyperplane

I The decision boundary should be as far away from the data ofboth classes as possible.

I Maximize the margin ρ: the distance from the hyperplane tothe closest data point.

ρ(w, b) := mini=1,...,n

|wT xi + b |||w ||



Maximum Margin Hyperplane/Optimal SeparatingHyperplane

I The maximum margin or optimal separating hyperplane is thesolution of

(w∗, b∗) = argmaxw,b ρ(w, b)



Finding the Hyperplane

I Rescale (w, b) so that it’s in canonical form

yi(wT xi + b) ≥ 1 for all iyi(wT xi + b) = 1 for some i

I The margin becomes ρ(w, b) = 1||w ||



Finding the Hyperplane

I The optimal hyperplane is the solution of

minw,b

12||w ||2

s.t . yi(wT xi + b) ≥ 1, i = 1, . . . , n

I Constrained quadratic programming problemI Those xi such that yi(wT xi + b) = 1 are called support

vectors.



Optimal Soft Margin Hyperplane

Uh-oh! There is going to be a problem! What should we do?

Real data is often not linearly separable.




I We allow ’error’ ξi in classification,ξi approximates the number ofmisclassified samples.

I We want to minimize∑n

i=1 ξi , andξi should satisfy

yi(wT xi + b) ≥ 1 − ξiξi ≥ 0, i = 1, . . . , n

I If xi is misclassified, then ξi > 1.I 1

n∑n

i=1 ξi ≥ training error




I For linearly nonseparable data, we modify QP by introducing”slack variables” ξ1, . . . , ξn.

I Optimal soft margin hyperplane (w, b) is the solution of

minw,b ,ξi

12||w ||2 + C

n∑i=1

ξi

s.t . yi(wT xi + b) ≥ 1 − ξiξi ≥ 0, i = 1, . . . , n

I C: tradeoff parameter between error and margin



Support Vector Machines

I http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlI Support vector machine (SVM) is an extension of an optimal

soft margin linear classifier that allows for nonlinear decisionboundaries.



How does it Work?

I Key idea: transform xi to a higher dimensional space to ”makelife easier”.

1. Input space: the space where the points xis are located2. Feature space: the space of φ(xi) after transformation

I Linear operation in the feature space is equivalent tonon-linear operation in input space. e.g. X21 + X1X2



Transformation



Wait a second...

I Computation in the feature space can be costly because it ishigh dimensional. e.g. Gaussian Kernel is infinite dimensional

I We need to use the kernel trick



Soft Margin Dual

I Recall the primal QP for the optimal soft margin hyperplane

minw,b ,ξi

12||w ||2 + C

n∑i=1

ξi

s.t . yi(wT xi + b) ≥ 1 − ξiξi ≥ 0, i = 1, . . . , n

I The Lagrangian is

L(w, b , ξ, α, β) =12||w ||2+C

n∑i=1

ξi−n∑

i=1

αi[yi(wT xi+b)−1+ξi]−n∑

i=1

βiξi

I At a stationary point, ∂L∂w = 0,∂L∂b = 0,

∂L∂ξi

= 0



Soft Margin Dual

I After substitution, we get L = −12∑n

i,j=1 αiαjyiyjxTi xj +

∑ni=1 αi ,

and w =∑n

j=1 αjyjxj .I Dual QP

maxα,β

− 12

n∑i,j=1

αiαjyiyjxTi xj +n∑

i=1

αi

s.t .n∑

i=1

αiyi = 0

αi + βi = C , i = 1, . . . , n

αi ≥ 0, βi ≥ 0, i = 1, . . . , n



Soft Margin Dual

I After eliminating {βi}, dual becomes

maxα− 1

2

n∑i,j=1

αiαjyiyjxTi xj +n∑

i=1

αi

s.t .n∑

i=1

αiyi = 0

0 ≤ αi ≤ C , i = 1, . . . , nI If α∗i is optimal for the dual QP, then

w∗ =n∑

i=1

α∗i yixi

is optimal for the primal QP. It’s a linear combination of datapoints.



Solution

I From KKT condition, we have

α∗i (1 − ξ∗i − yi(w∗T xi + b∗)) = 0 ∀iβ∗i ξ∗i = 0 ∀i

I xi for which yi(w∗T xi + b∗) = 1 − ξ∗i are support vectors. On orinside the margin of separation, small fraction, hard to classify.

I w∗ =∑

support vectors α∗i yixi , sparse representation.



Solution

I b∗ = yi − w∗T xi , for any xi s.t. 0 < α∗i < CI In practice, it’s common to average over several i s.t.

0 < α∗i < C to counter numerical imprecisionI Classifier is

f(x) = sign(n∑

i=1

α∗yix′i x + b∗)



Nonlinear Classification

I Map patterns into high-dimensional feature space

X =

X (1)...

X (d)

−→ Φ(X) =φ(1)

...

φ(m)

where m > d and φ(j) : Rd → R are nonlinear.

I Build a linear classifier in feature space, which will induce anonlinear classifier in original input space.

I [x(1), x(2)]→ [1, x(1), x(2), x(1)x(2), (x(1))2, (x(2))2]



Nonlinear Classification

I Optimal soft margin hyperplane in feature space is thesolution of

maxα− 1

2

n∑i,j=1

αiαjyiyj < Φ(xi),Φ(xj) > +n∑

i=1

αi

s.t .n∑

i=1

αiyi = 0

0 ≤ αi ≤ C , i = 1, . . . , n

I w∗ =∑n

i=1 α∗i yiΦ(xi) ∈ Rm

I b∗ = yi −∑n

j=1 α∗j yj < Φ(xi),Φ(xj) >, for any i s.t. 0 < α

∗i < C



Kernel Trick

I In fact, we don’t need to compute w∗

f(x) = sign{w∗TΦ(x) + b∗}

= sign{n∑

i=1

α∗i yi < Φ(xi),Φ(x) > +b∗}

where b∗ = yi −∑n

j=1 α∗j yj < Φ(xi),Φ(xj) >

I All we need is inner products, defineK(xi , xj) =< Φ(xi),Φ(xj) >

I The size of the dual QP is n instead of m



Kernel Trick

How to get kernel matrix (K(xi , xj))i,j?I Explicitly calculate from Φ(x), O(m × n2), m can be infiniteI Use Φ such that K(xi , xj) has a nice simple easily computed

form (kernel trick)



Inner Product Kernel

I An example:

Φ(u) =[(u(1))2,

√2u(1)u(2), (u(2))2

]′K(u, v) = (uT v)2

I Some commonly used inner product kernels:1. Homogeneous polynomial kernel:

K(u, v) =< u, v >p , p = 1, 2, . . .2. Inhomogeneous polynomial kernel:

K(u, v) =< u, v + c >p , p = 1, 2, . . . , c > 03. Gaussian kernel: K(u, v) = (2πσ2)−d/2exp

{− ||u−v ||22σ2

}, σ > 0

Φ is infinite dimensional4. If use K(u, v) = uT v, soft margin



Summary

Let K be an IP kernel, the SVM classifier is

f(x) = sign{n∑

i=1

α∗i yiK(x, xi) + b∗}

where α∗i is the solution of

maxα− 1

2

n∑i,j=1

αiαjyiyjK(xi , xj) +n∑

i=1

αi

s.t .n∑

i=1

αiyi = 0 and 0 ≤ αi ≤ C , i = 1, . . . , n

and b∗ is given by

b∗ = yi −n∑

j=1

α∗j yjK(xi , xj)

for i such that 0 < α∗i < CJing Wang An Introduction to Support Vector Machine and Spectral Clustering


Multi-class Classification with SVM

With N outputs, learn N SVM’s

1. SVM 1 learns ”output=1” vs ”output , 1”

2. SVM 2 learns ”output=2” vs ”output , 2”

...

3. SVM N learns ”output=N” vs ”output , N”

Then to predict the output for a new input, just predict with eachSVM and find out which one puts the prediction the furthest intothe positive region.



Strength and Weakness

Strength:I Anecdotally works very wellI Training is relatively easy, no local optimalI Tradeoff between classifier complexity and error can be

controlled explicitly

Weakness:I Need to choose a ”good” kernel function



Outline

I Support Vector MachineI Spectral Clustering



Spectral Clustering

The basic idea: reduce the problem of clustering to graph partition.

Find a partition of the graph such that: the intra-cluster weights(similarities) are high and the inter-cluster weights are low.



What do we need - a Similarity Graph

I A similarity graph: similarity matrix W = [wij], where wij ≥ 0,wij = wji , and wij is larger if xi , xj are more similar.

I Good thing: xi and xj don’t have to be Euclidean dataI An example: l-nearest neighbor graph

1. xi , xj adjacent↔ xi is an l-nearest neighbor of xj or vice-versa2. wij = exp{−||xi − xj ||2/2σ2}



What do we need - a Graph Laplacian

I Degree of a vertex xi : di =∑n

j=1 wij

I Degree matrix D =

d1 0. . .

0 dn

I Unnormalized graph Laplacian is L = D −W , lii =

∑j,i wij ,

lij = −wij



Properties of a Graph Laplacian

I L is symmetric and positive semi-definiteI The smallest eigenvalue of L is 0 with corresponding

eigenvector 1 = [1, 1, · · · , 1]TI L has n non-negative, real-valued eigenvalues

0 = λ1 ≤ λ2 ≤ . . . ≤ λn



Property Relevant for Clustering

Let A ⊆ {x1, . . . , xn} be a cluster. Define

1A = [f1, f2, · · · , fn]′ ∈ Rn

where fi = 1 if xi ∈ A and fi = 0 o.w.

Proposition: Suppose the graph has connected componentsA1, . . . ,AK . Then the null space of L has dimension K and isspanned by 1A1 , . . . ,1AK .

If f ∈ N(L), then f = ∑Kk=1 αk1Ak .



Spectral Clustering - Ideal Case

How can we use this result to devise a clustering algorithm?I Ideal case: there are K connected components and K is

known.1. Compute L2. Compute a basis u1, . . . , uK for N(L). They’re eigenvectors of

L with eigenvalue 0.3. Let yi = (ui1, . . . , uiK )T , here ui =

∑Kk=1 αik1Ak

4. If xi , xj are in the same cluster, then yi = yjI How to cluster?



Spectral Clustering - Nonideal Case

I Nonideal case: there are edges connecting points in differentclusters.

L = large smalllargesmall large

= large 0large

0 large

+ small



Changes in Algorithm

I If we perturb a matrix by another matrix with small entries,then the eigenvalue and eigenvector of the matrix areperturbed by a corresponding small amount.

I We can use the eigenvectors of L with the smallest Keigenvalues as an approximation to the nullspaces of anidealized L based on the true clusters.

I What if you don’t know K



Spectral Clustering Algorithm

1. Construct similarity graph

2. Form unnormalized graph Laplacian L ∈ Rn×n

3. Determine the K smallest eigenvalues of L ,0 = λ1 ≤ λ2 ≤ · · · ≤ λK , and corresponding eigenvectorsu1, . . . , uK ∈ Rn. If K is not known, find the first large gap inthe spectrum.

4. Define yi = (ui1, . . . , uki)T , i = 1, . . . , n

5. Cluster {yi}ni=1 using K-means clustering and assign {xi}ni=1 tocorresponding clusters. (K -dimensional space instead ofd-dimensional space)



More

I Normalized spectral clustering: normalized graph Laplacian

L̃ = D−1L = I − D−1W

I Spectral Clustering may be used to solve relaxations of somegraph cuts problems.

I Unnormalized spectral clustering→ Ratio CutRatioCut(A1, . . . ,AK )=12

∑Kk=1

W(Ak ,Āk )|Ak |

I Normalized spectral clustering→ Normalized CutNcut(A1, . . . ,AK )=12

∑Kk=1

W(Ak ,Āk )vol(Ak )



Reference:I Von Luxburg, U. A Tutorial on Spectral Clustering. Technical

Report at MPI Tuebingen, 2006.I C.J.C. Burges. A tutorial on support vector machines for

pattern recognition. Data Mining and Knowledge Discovery,2(2):955-974, 1998.

I V.Vapnik. The Nature of Statistical Learning Theory, 1995.Springer.



Documents

An Introduction to Support Vector Machine and Spectral Clusteringdept.stat.lsa.umich.edu/~laber/ML_talk.pdf · 2008. 5. 19. · Support Vector Machine Spectral Clustering How does