41
Support Vector Machine Spectral Clustering An Introduction to Support Vector Machine and Spectral Clustering Jing Wang Department of Statistics, University of Michigan Feb 7th, 2008 Jing Wang An Introduction to Support Vector Machine and Spectral Clusteri

An Introduction to Support Vector Machine and Spectral Clusteringdept.stat.lsa.umich.edu/~laber/ML_talk.pdf · 2008. 5. 19. · Support Vector Machine Spectral Clustering How does

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Support Vector MachineSpectral Clustering

    An Introduction to Support Vector Machine andSpectral Clustering

    Jing Wang

    Department of Statistics, University of Michigan

    Feb 7th, 2008

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Outline

    I Support Vector MachineI Spectral Clustering

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Separating Hyperplanes

    Consider a two-class, linearlyseparable classificationproblem.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Separating Hyperplanes

    Let (x1, y1), . . . , (xn, yn) be training data, xi ∈ Rd , yi ∈ {−1, 1}.

    Data are linearly separable if there exists w ∈ Rd , b ∈ R such that

    yi(wT xi + b) > 0

    for i = 1, . . . , n. The pair (w, b) defines a hyperplane of equation

    {x : wT x + b = 0}

    named a separating hyperplane.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    What is a good Decision Boundary?

    I Are all decision boundaries equally good?

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Maximum Margin Hyperplane/Optimal SeparatingHyperplane

    I The decision boundary should be as far away from the data ofboth classes as possible.

    I Maximize the margin ρ: the distance from the hyperplane tothe closest data point.

    ρ(w, b) := mini=1,...,n

    |wT xi + b |||w ||

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Maximum Margin Hyperplane/Optimal SeparatingHyperplane

    I The maximum margin or optimal separating hyperplane is thesolution of

    (w∗, b∗) = argmaxw,b ρ(w, b)

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Finding the Hyperplane

    I Rescale (w, b) so that it’s in canonical form

    yi(wT xi + b) ≥ 1 for all iyi(wT xi + b) = 1 for some i

    I The margin becomes ρ(w, b) = 1||w ||

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Finding the Hyperplane

    I The optimal hyperplane is the solution of

    minw,b

    12||w ||2

    s.t . yi(wT xi + b) ≥ 1, i = 1, . . . , n

    I Constrained quadratic programming problemI Those xi such that yi(wT xi + b) = 1 are called support

    vectors.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Optimal Soft Margin Hyperplane

    Uh-oh! There is going to be a problem! What should we do?

    Real data is often not linearly separable.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Optimal Soft Margin Hyperplane

    I We allow ’error’ ξi in classification,ξi approximates the number ofmisclassified samples.

    I We want to minimize∑n

    i=1 ξi , andξi should satisfy

    yi(wT xi + b) ≥ 1 − ξiξi ≥ 0, i = 1, . . . , n

    I If xi is misclassified, then ξi > 1.I 1

    n∑n

    i=1 ξi ≥ training error

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Optimal Soft Margin Hyperplane

    I For linearly nonseparable data, we modify QP by introducing”slack variables” ξ1, . . . , ξn.

    I Optimal soft margin hyperplane (w, b) is the solution of

    minw,b ,ξi

    12||w ||2 + C

    n∑i=1

    ξi

    s.t . yi(wT xi + b) ≥ 1 − ξiξi ≥ 0, i = 1, . . . , n

    I C: tradeoff parameter between error and margin

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Support Vector Machines

    I http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlI Support vector machine (SVM) is an extension of an optimal

    soft margin linear classifier that allows for nonlinear decisionboundaries.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    How does it Work?

    I Key idea: transform xi to a higher dimensional space to ”makelife easier”.

    1. Input space: the space where the points xis are located2. Feature space: the space of φ(xi) after transformation

    I Linear operation in the feature space is equivalent tonon-linear operation in input space. e.g. X21 + X1X2

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Transformation

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Wait a second...

    I Computation in the feature space can be costly because it ishigh dimensional. e.g. Gaussian Kernel is infinite dimensional

    I We need to use the kernel trick

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Soft Margin Dual

    I Recall the primal QP for the optimal soft margin hyperplane

    minw,b ,ξi

    12||w ||2 + C

    n∑i=1

    ξi

    s.t . yi(wT xi + b) ≥ 1 − ξiξi ≥ 0, i = 1, . . . , n

    I The Lagrangian is

    L(w, b , ξ, α, β) =12||w ||2+C

    n∑i=1

    ξi−n∑

    i=1

    αi[yi(wT xi+b)−1+ξi]−n∑

    i=1

    βiξi

    I At a stationary point, ∂L∂w = 0,∂L∂b = 0,

    ∂L∂ξi

    = 0

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Soft Margin Dual

    I After substitution, we get L = −12∑n

    i,j=1 αiαjyiyjxTi xj +

    ∑ni=1 αi ,

    and w =∑n

    j=1 αjyjxj .I Dual QP

    maxα,β

    − 12

    n∑i,j=1

    αiαjyiyjxTi xj +n∑

    i=1

    αi

    s.t .n∑

    i=1

    αiyi = 0

    αi + βi = C , i = 1, . . . , n

    αi ≥ 0, βi ≥ 0, i = 1, . . . , n

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Soft Margin Dual

    I After eliminating {βi}, dual becomes

    maxα− 1

    2

    n∑i,j=1

    αiαjyiyjxTi xj +n∑

    i=1

    αi

    s.t .n∑

    i=1

    αiyi = 0

    0 ≤ αi ≤ C , i = 1, . . . , nI If α∗i is optimal for the dual QP, then

    w∗ =n∑

    i=1

    α∗i yixi

    is optimal for the primal QP. It’s a linear combination of datapoints.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Solution

    I From KKT condition, we have

    α∗i (1 − ξ∗i − yi(w∗T xi + b∗)) = 0 ∀iβ∗i ξ∗i = 0 ∀i

    I xi for which yi(w∗T xi + b∗) = 1 − ξ∗i are support vectors. On orinside the margin of separation, small fraction, hard to classify.

    I w∗ =∑

    support vectors α∗i yixi , sparse representation.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Solution

    I b∗ = yi − w∗T xi , for any xi s.t. 0 < α∗i < CI In practice, it’s common to average over several i s.t.

    0 < α∗i < C to counter numerical imprecisionI Classifier is

    f(x) = sign(n∑

    i=1

    α∗yix′i x + b∗)

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Nonlinear Classification

    I Map patterns into high-dimensional feature space

    X =

    X (1)...

    X (d)

    −→ Φ(X) =φ(1)

    ...

    φ(m)

    where m > d and φ(j) : Rd → R are nonlinear.

    I Build a linear classifier in feature space, which will induce anonlinear classifier in original input space.

    I [x(1), x(2)]→ [1, x(1), x(2), x(1)x(2), (x(1))2, (x(2))2]

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Nonlinear Classification

    I Optimal soft margin hyperplane in feature space is thesolution of

    maxα− 1

    2

    n∑i,j=1

    αiαjyiyj < Φ(xi),Φ(xj) > +n∑

    i=1

    αi

    s.t .n∑

    i=1

    αiyi = 0

    0 ≤ αi ≤ C , i = 1, . . . , n

    I w∗ =∑n

    i=1 α∗i yiΦ(xi) ∈ Rm

    I b∗ = yi −∑n

    j=1 α∗j yj < Φ(xi),Φ(xj) >, for any i s.t. 0 < α

    ∗i < C

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Kernel Trick

    I In fact, we don’t need to compute w∗

    f(x) = sign{w∗TΦ(x) + b∗}

    = sign{n∑

    i=1

    α∗i yi < Φ(xi),Φ(x) > +b∗}

    where b∗ = yi −∑n

    j=1 α∗j yj < Φ(xi),Φ(xj) >

    I All we need is inner products, defineK(xi , xj) =< Φ(xi),Φ(xj) >

    I The size of the dual QP is n instead of m

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Kernel Trick

    How to get kernel matrix (K(xi , xj))i,j?I Explicitly calculate from Φ(x), O(m × n2), m can be infiniteI Use Φ such that K(xi , xj) has a nice simple easily computed

    form (kernel trick)

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Inner Product Kernel

    I An example:

    Φ(u) =[(u(1))2,

    √2u(1)u(2), (u(2))2

    ]′K(u, v) = (uT v)2

    I Some commonly used inner product kernels:1. Homogeneous polynomial kernel:

    K(u, v) =< u, v >p , p = 1, 2, . . .2. Inhomogeneous polynomial kernel:

    K(u, v) =< u, v + c >p , p = 1, 2, . . . , c > 03. Gaussian kernel: K(u, v) = (2πσ2)−d/2exp

    {− ||u−v ||22σ2

    }, σ > 0

    Φ is infinite dimensional4. If use K(u, v) = uT v, soft margin

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Summary

    Let K be an IP kernel, the SVM classifier is

    f(x) = sign{n∑

    i=1

    α∗i yiK(x, xi) + b∗}

    where α∗i is the solution of

    maxα− 1

    2

    n∑i,j=1

    αiαjyiyjK(xi , xj) +n∑

    i=1

    αi

    s.t .n∑

    i=1

    αiyi = 0 and 0 ≤ αi ≤ C , i = 1, . . . , n

    and b∗ is given by

    b∗ = yi −n∑

    j=1

    α∗j yjK(xi , xj)

    for i such that 0 < α∗i < CJing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Multi-class Classification with SVM

    With N outputs, learn N SVM’s

    1. SVM 1 learns ”output=1” vs ”output , 1”

    2. SVM 2 learns ”output=2” vs ”output , 2”

    ...

    3. SVM N learns ”output=N” vs ”output , N”

    Then to predict the output for a new input, just predict with eachSVM and find out which one puts the prediction the furthest intothe positive region.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Strength and Weakness

    Strength:I Anecdotally works very wellI Training is relatively easy, no local optimalI Tradeoff between classifier complexity and error can be

    controlled explicitly

    Weakness:I Need to choose a ”good” kernel function

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Outline

    I Support Vector MachineI Spectral Clustering

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Spectral Clustering

    The basic idea: reduce the problem of clustering to graph partition.

    Find a partition of the graph such that: the intra-cluster weights(similarities) are high and the inter-cluster weights are low.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    What do we need - a Similarity Graph

    I A similarity graph: similarity matrix W = [wij], where wij ≥ 0,wij = wji , and wij is larger if xi , xj are more similar.

    I Good thing: xi and xj don’t have to be Euclidean dataI An example: l-nearest neighbor graph

    1. xi , xj adjacent↔ xi is an l-nearest neighbor of xj or vice-versa2. wij = exp{−||xi − xj ||2/2σ2}

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    What do we need - a Graph Laplacian

    I Degree of a vertex xi : di =∑n

    j=1 wij

    I Degree matrix D =

    d1 0. . .

    0 dn

    I Unnormalized graph Laplacian is L = D −W , lii =

    ∑j,i wij ,

    lij = −wij

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Properties of a Graph Laplacian

    I L is symmetric and positive semi-definiteI The smallest eigenvalue of L is 0 with corresponding

    eigenvector 1 = [1, 1, · · · , 1]TI L has n non-negative, real-valued eigenvalues

    0 = λ1 ≤ λ2 ≤ . . . ≤ λn

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Property Relevant for Clustering

    Let A ⊆ {x1, . . . , xn} be a cluster. Define

    1A = [f1, f2, · · · , fn]′ ∈ Rn

    where fi = 1 if xi ∈ A and fi = 0 o.w.

    Proposition: Suppose the graph has connected componentsA1, . . . ,AK . Then the null space of L has dimension K and isspanned by 1A1 , . . . ,1AK .

    If f ∈ N(L), then f = ∑Kk=1 αk1Ak .

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Spectral Clustering - Ideal Case

    How can we use this result to devise a clustering algorithm?I Ideal case: there are K connected components and K is

    known.1. Compute L2. Compute a basis u1, . . . , uK for N(L). They’re eigenvectors of

    L with eigenvalue 0.3. Let yi = (ui1, . . . , uiK )T , here ui =

    ∑Kk=1 αik1Ak

    4. If xi , xj are in the same cluster, then yi = yjI How to cluster?

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Spectral Clustering - Nonideal Case

    I Nonideal case: there are edges connecting points in differentclusters.

    L = large smalllargesmall large

    = large 0large

    0 large

    + small

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Changes in Algorithm

    I If we perturb a matrix by another matrix with small entries,then the eigenvalue and eigenvector of the matrix areperturbed by a corresponding small amount.

    I We can use the eigenvectors of L with the smallest Keigenvalues as an approximation to the nullspaces of anidealized L based on the true clusters.

    I What if you don’t know K

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Spectral Clustering Algorithm

    1. Construct similarity graph

    2. Form unnormalized graph Laplacian L ∈ Rn×n

    3. Determine the K smallest eigenvalues of L ,0 = λ1 ≤ λ2 ≤ · · · ≤ λK , and corresponding eigenvectorsu1, . . . , uK ∈ Rn. If K is not known, find the first large gap inthe spectrum.

    4. Define yi = (ui1, . . . , uki)T , i = 1, . . . , n

    5. Cluster {yi}ni=1 using K-means clustering and assign {xi}ni=1 tocorresponding clusters. (K -dimensional space instead ofd-dimensional space)

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    More

    I Normalized spectral clustering: normalized graph Laplacian

    L̃ = D−1L = I − D−1W

    I Spectral Clustering may be used to solve relaxations of somegraph cuts problems.

    I Unnormalized spectral clustering→ Ratio CutRatioCut(A1, . . . ,AK )=12

    ∑Kk=1

    W(Ak ,Āk )|Ak |

    I Normalized spectral clustering→ Normalized CutNcut(A1, . . . ,AK )=12

    ∑Kk=1

    W(Ak ,Āk )vol(Ak )

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

  • Support Vector MachineSpectral Clustering

    Reference:I Von Luxburg, U. A Tutorial on Spectral Clustering. Technical

    Report at MPI Tuebingen, 2006.I C.J.C. Burges. A tutorial on support vector machines for

    pattern recognition. Data Mining and Knowledge Discovery,2(2):955-974, 1998.

    I V.Vapnik. The Nature of Statistical Learning Theory, 1995.Springer.

    Jing Wang An Introduction to Support Vector Machine and Spectral Clustering

    Support Vector MachineSpectral Clustering