31
MLCC 2018 - Clustering Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

MLCC 2018 - Clustering

Lorenzo RosascoUNIGE-MIT-IIT

Page 2: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

About this class

We will consider an unsupervised setting, and in particular the problem ofclustering unlabeled data into “coherent” groups.

MLCC 2018 2

Page 3: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

supervised learning

I ”Learning with a teacher”

I Data set S = {(x1, y1), . . . , (xn, yn)} with xi ∈ Rd and yi ∈ RI X = (x1, . . . , xn)

> ∈ Rn×d and y = (y1, . . . , yn)>.

MLCC 2018 3

Page 4: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

supervised learning

I ”Learning with a teacher”

I Data set S = {(x1, y1), . . . , (xn, yn)} with xi ∈ Rd and yi ∈ RI X = (x1, . . . , xn)

> ∈ Rn×d and y = (y1, . . . , yn)>.

MLCC 2018 4

Page 5: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Unsupervised learning

I ”Learning without a teacher”

I Data set S = {x1, . . . , xn} with xi ∈ Rd

MLCC 2018 5

Page 6: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Unsupervised learning

I ”Learning without a teacher”

I Data set S = {x1, . . . , xn} with xi ∈ Rd

MLCC 2018 6

Page 7: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Unsupervised learning problems

I Dimensionality reduction

I Clustering

I Density estimation

I Learning association rules

I Learning adaptive data representations

I ...

MLCC 2018 7

Page 8: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Supervised vs unsupervised methods

I In supervised learning we have a measure of success — based on aloss function and on a model selection procedure e.g., crossvalidation

I In unsupervised learning we don’t !

– hence many heuristics and the proliferation of many algorithmsdifficult to evaluate — lack of theoretical grounds

MLCC 2018 8

Page 9: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Clustering

I Clustering is a widely used technique for data analysis, withapplications ranging from statistics, computer science, biology, socialsciences....

I Goal:Grouping/segmenting a collection of objects into subsets or clusters.(Possibly also) arrange clusters into a natural hierarchy

MLCC 2018 9

Page 10: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Clustering examples

MLCC 2018 10

Page 11: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Clustering algorithms

I Combinatorial algorithms - directly from data {xi}ni=1 + somenotion of similarity or dissimilarity

I Mixture models - based on some assumption on the underlyingprobability distribution

MLCC 2018 11

Page 12: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Combinatorial clustering

I We assume some knowledge on the number of clusters K ≤ n.Goal: associate a cluster label k = {1, . . . ,K} with each datum, bydefining an encoder C s.t.

k = C(xi)

I We look for an encoder C∗ that achieves the goal of clustering data,according to some specific requirement of the algorithm and basedon data pairs dissimilarities

MLCC 2018 12

Page 13: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Combinatorial clustering

I Criterion: assign to the same cluster similar/close data

I We may start from the following ”loss” or energy function (withinclass):

W (C) = 1

2

K∑k=1

∑C(i)=k

∑C(i′)=k

d(xi, xi′)

I C∗ = argminW (C)I Unfeasible in practice!

S(N,K) =1

K!

K∑k=1

(−1)K−k(K

k

)kn

and notice that S(10, 4) ∼ 34K while S(19, 4) ∼ 1010

MLCC 2018 13

Page 14: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

K-means algorithm

It refers specifically to the Euclidean distance.

I initialize cluster centroids mk k = 1, . . . ,K at random

I repeat until convergence

1. assign data to centroids C(xi) = argmin1≤k≤K ||xi −mk||22. update centroids

MLCC 2018 14

Page 15: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

K-means functional

K-means corresponds to minimizing the following function

J(C,m) =

K∑k=1

∑C(i)=k

||xi −mk||2

The algorithm is an alternating optimization procedure, with convergenceguarantees in practice (no rates).

The function J is not convex, thus K-means is not guaranteed to find aglobal minimum.

Computational cost

1. data assignment O(Kn)

2. cluster centers updates O(n)

MLCC 2018 15

Page 16: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

K-meansElements of Statistical Learning (2nd Ed.) c⃝Hastie, Tibshirani & Friedman 2009 Chap 14

-4 -2 0 2 4 6

-20

24

6

Initial Centroids

• • •

••

••

••••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

• ••

•••

••

••

••••

• • •••

•• • •

• •• •

• •• ••

••

• •

•• ••••

••

••• • •

•• •• •• •• •

••• •

•••

•• •

••

•••

••

•••

•••

••

• •

••• ••

••

••

• • •

••

••

••••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

• ••

•••

••

••

••••

• • •••

•• • •

• •• •

• •• ••

••

• •

•• ••••

••

••• • •

•• •• •• •• •

••• •

•••

•• •

••

•••

••

•••

•••

••

• •

••• ••

••

••

Initial Partition

• • •

••

••

••••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

••

••

•••

••

••••

• • •••

•• • •

• •• •

• •• ••

••

• •

•• ••

••

••

•• •• •• •

••• •

••

••

• •• •

••

•••

••

•••

••

••

••

••

• •

••• ••

Iteration Number 2

••

••

• • •

••

••

••••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

•••

••

•••

••

••••

• • •••

•• • •

• •• •

• •• ••

••

• •

••

••• •

••• • •

•• •• ••• •• •

•••

•••

•• •

••

•••

••

•••

•••••

• •

••• ••

Iteration Number 20

••

••

FIGURE 14.6. Successive iterations of the K-meansclustering algorithm for the simulated data of Fig-ure 14.4.

Figure from Hastie, Tibshirani, Friedman

MLCC 2018 16

Page 17: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Example Vector Quantization

Elements of Statistical Learning (2nd Ed.) c⃝Hastie, Tibshirani & Friedman 2009 Chap 14

FIGURE 14.9. Sir Ronald A. Fisher (1890 − 1962)was one of the founders of modern day statistics, towhom we owe maximum-likelihood, sufficiency, andmany other fundamental concepts. The image on theleft is a 1024×1024 grayscale image at 8 bits per pixel.The center image is the result of 2 × 2 block VQ, us-ing 200 code vectors, with a compression rate of 1.9bits/pixel. The right image uses only four code vectors,with a compression rate of 0.50 bits/pixel

Figure from Hastie, Tibshirani, Friedman

MLCC 2018 17

Page 18: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - similarity graph

I A set of unlabeled data {xi}ni=1 and some notion of similaritybetween data pairs sij

I We may represent them as a similarity graph G = (V,E)

I Clustering can be seen as a graph partitioning problem

MLCC 2018 18

Page 19: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - graph notation

G = (V,E) undirected graph

I V : data correspond to the vertices

I E : Weighted adjacency matrix W = (wij)ni,j=1 with wij ≥ 0.

W is symmetric wij = wji, as G is undirected.

I Degree of a vertex: di =∑n

j=1 wij

Degree matrix: D = diag(di)

I Sub-graphs:A,B ⊂ V then W (A,B) =

∑i∈A,j∈B wij

Subgraph size:

– |A| number of vertices– vol(A) =

∑i∈A di

MLCC 2018 19

Page 20: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - how to build the graph

We use the available pairwise similarities sijI ε-neighbourhood graph: connect vertices whose similarity is larger

than ε

I KNN graph: connect vertex vi to its K neighbours. Not symmetric!

I fully connected graph: sij = exp(−d2ij/2σ2)d is the Euclidean distance, σ ≥ 0 controls the width of aneighborhood

MLCC 2018 20

Page 21: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - how to build the graph

I n can be very large, it would be preferable if W was sparse

I In general it is better some notion of locality

wij =

{sij if j is a KNN of i0 otherwise

MLCC 2018 21

Page 22: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - graph Laplacians

Unnormalized graph Laplacian: L = D −WProperties:

I For all f ∈ Rn

f>Lf =1

2

n∑ij=1

wij(fi − fj)2

f>Lf = f>Df − f>Wf

=∑i

dif2i −

∑i,j

fifjwij

=1

2

∑i

(∑j

wij)f2i − 2

∑ij

fifjwij +∑j

(∑i

wij)f2j

=

=1

2

∑ij

wij(fi − fj)2

MLCC 2018 22

Page 23: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - graph Laplacians

Unnormalized graph Laplacian: L = D −WI For each vector f ∈ Rn

f>Lf =1

2

n∑ij=1

wij(fi − fj)2

The graph Laplacian measures the variation of f on the graph(f>Lf small if close points have close function values fi)

I L is symmetric and positive semi-definite

I The smallest eigenvalue of L is 0 and its corresponding eigenvectoris a vector of ones

I L has N non negative real-valued eigenvalues0 = λ1 ≤ λ2 ≤ . . . ≤ λN

Laplacian and clustering: the multiplicity k of λ0 = 0 equals thenumber of connected components in the graph

MLCC 2018 23

Page 24: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - graph Laplacians

Unnormalized graph Laplacian: L = D −WI For each vector f ∈ Rn

f>Lf =1

2

n∑ij=1

wij(fi − fj)2

The graph Laplacian measures the variation of f on the graph(f>Lf small if close points have close function values fi)

I L is symmetric and positive semi-definite

I The smallest eigenvalue of L is 0 and its corresponding eigenvectoris a vector of ones

I L has N non negative real-valued eigenvalues0 = λ1 ≤ λ2 ≤ . . . ≤ λN

Laplacian and clustering: the multiplicity k of λ0 = 0 equals thenumber of connected components in the graph

MLCC 2018 24

Page 25: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Spectral clustering - graph Laplacians

Unnormalized graph Laplacian:

L = D −W

Normalized graph Laplacians:

Ln1 = D−1/2LD−1/2 = I −D−1/2WD−1/2

Ln2 = D−1L = I −D−1W

MLCC 2018 25

Page 26: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

A spectral clustering algorithm

I Graph Laplacian

– compute the Unnormalized Graph Laplacian L (unnormalizedalgorithm)

– compute a Normalized Graph Laplacian Ln1 or Ln2 (normalizedalgorithm)

I compute the first k eigenvectors of the Laplacian (k number ofclusters to compute)

I let Uk ∈ Rn×k be the matrix containing the k eigenvectors ascolumns

I yj ∈ Rk be the vector obtained by the j-th row of Uk j = 1 . . . n.Apply k-means to {yj}

MLCC 2018 26

Page 27: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

A spectral clustering algorithm

Computational cost

I Eigendecomposition O(n3)

I It may be enough to compute the first k eigenvalues/eigenvectors.There are algorithms for this

MLCC 2018 27

Page 28: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Example

Elements of Statistical Learning (2nd Ed.) c⃝Hastie, Tibshirani & Friedman 2009 Chap 14

−4 −2 0 2 4

−4−2

02

4

x1

x2

0.0

0.1

0.2

0.3

0.4

0.5

Number

Eige

nval

ue

1 3 5 10 15

0 100 200 300 400

Eigenvectors

Index

2nd

Smal

lest

3rd

Smal

lest

−0.0

5 0

.05

−0.0

5 0

.05

−0.04 −0.02 0.00 0.02

−0.0

6−0

.02

0.02

0.06

Second Smallest Eigenvector

Third

Sm

alle

st E

igen

vect

or

Spectral Clustering

FIGURE 14.29. Toy example illustrating spectralclustering. Data in top left are 450 points falling inthree concentric clusters of 150 points each. The pointsare uniformly distributed in angle, with radius 1, 2.8and 5 in the three groups, and Gaussian noise withstandard deviation 0.25 added to each point. Using ak = 10 nearest-neighbor similarity graph, the eigen-vector corresponding to the second and third smallest

Figure from Hastie, Tibshirani, FriedmanMLCC 2018 28

Page 29: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

The number of clusters

eigengap heuristic

Figure from Von Luxburg tutorial

MLCC 2018 29

Page 30: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Semi-supervised learning

Laplacian-based regularization algorithms (Belkin et al. 04)

Set of labeled examples: {(xi, yi)}ni=1

Set of unlabeled examples: {(xj)}n+uj=n+1

f∗ = argminf∈H

1

n

n∑i=1

`(f(xi), yi) + λA‖f‖2 +λIu2fTLf

MLCC 2018 30

Page 31: MLCC 2018 - Clustering - LCSLlcsl.mit.edu/courses/mlcc/mlcc2018/materials/slides/MLCC_clustering_L07.pdfSpectral Clustering FIGURE 14.29. Toy example illustrating spectral clustering

Wrapping up

In this class we introduced the concept of data clustering and sketchedsome of the best known algorithms

Ulrike Von Luxburg - A tutorial on Spectral Clustering

MLCC 2018 31