ECE 5984: Introduction to Machine Learning › ~s15ece5984 › slides › L22_kmeans_gmm.pptx.pdfThe K-means GMM assumption • There are k components • Component i has an associated

ECE 5984: Introduction to Machine Learning

Dhruv Batra Virginia Tech

Topics: –  Unsupervised Learning: Kmeans, GMM, EM

Readings: Barber 20.1-20.3

Midsem Presentations Graded •  Mean 8/10 = 80%

–  Min: 3 –  Max: 10

(C) Dhruv Batra 2

Tasks

(C) Dhruv Batra 3

Classification x y

Regression x y

Discrete

Continuous

Clustering x c Discrete ID

Dimensionality Reduction

x z Continuous

Supervised Learning

Unsupervised Learning

Unsupervised Learning •  Learning only with X

–  Y not present in training data

•  Some example unsupervised learning problems: –  Clustering / Factor Analysis –  Dimensionality Reduction / Embeddings –  Density Estimation with Mixture Models

(C) Dhruv Batra 4

New Topic: Clustering

Slide Credit: Carlos Guestrin 5

Synonyms •  Clustering

•  Vector Quantization

•  Latent Variable Models •  Hidden Variable Models •  Mixture Models

•  Algorithms: –  K-means –  Expectation Maximization (EM)

(C) Dhruv Batra 6

Some Data

7 (C) Dhruv Batra Slide Credit: Carlos Guestrin

K-means

1.  Ask user how many clusters they’d like.

(e.g. k=5)


K-means


(e.g. k=5)

2.  Randomly guess k cluster Center

locations


K-means


(e.g. k=5)


locations

3.  Each datapoint finds out which Center it’s

closest to. (Thus each Center “owns” a set of datapoints)


K-means


(e.g. k=5)


locations


closest to.

4.  Each Center finds the centroid of the

points it owns


K-means


(e.g. k=5)


locations


closest to.

4.  Each Center finds the centroid of the

points it owns…

5.  …and jumps there

6.  …Repeat until terminated! 12 (C) Dhruv Batra Slide Credit: Carlos Guestrin

K-means •  Randomly initialize k centers

–  µ(0) = µ1(0),…, µk

(0)

•  Assign: –  Assign each point i∈{1,…n} to nearest center: – 

•  Recenter: –  µj becomes centroid of its points


C(i) ⇥� argminj

||xi � µj ||2

K-means •  Demo

–  http://www.kovan.ceng.metu.edu.tr/~maya/kmeans/ –  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/

AppletKM.html

(C) Dhruv Batra 14

What is K-means optimizing? •  Objective F(µ,C): function of centers µ and point

allocations C:

– 

–  1-of-k encoding

•  Optimal K-means: –  minµmina F(µ,a)

15 (C) Dhruv Batra

F (µ, C) =NX

i=1

||xi � µC(i)||2

F (µ,a) =NX

i=1

kX

j=1

aij ||xi � µj ||2

Coordinate descent algorithms


•  Want: mina minb F(a,b)

•  Coordinate descent: –  fix a, minimize b –  fix b, minimize a –  repeat

•  Converges!!! –  if F is bounded –  to a (often good) local optimum

•  as we saw in applet (play with it!)

•  K-means is a coordinate descent algorithm!

•  Optimize objective function:

•  Fix µ, optimize a (or C)


K-means as Co-ordinate Descent

minµ1,...,µk

mina1,...,aN

F (µ,a) = minµ1,...,µk

mina1,...,aN

NX

i=1

kX

j=1


•  Optimize objective function:

•  Fix a (or C), optimize µ


K-means as Co-ordinate Descent

minµ1,...,µk

mina1,...,aN

F (µ,a) = minµ1,...,µk

mina1,...,aN

NX

i=1

kX

j=1


One important use of K-means •  Bag-of-word models in computer vision

(C) Dhruv Batra 19

Bag of Words model

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

…

Zaire 0

Slide Credit: Carlos Guestrin (C) Dhruv Batra 20

Object Bag of ‘words’

Fei-‐Fei Li

Fei-‐Fei Li

Interest Point Features

Normalize patch

Detect patches [Mikojaczyk and Schmid ’02]

[Matas et al. ’02]

[Sivic et al. ’03]

Compute SIFT

descriptor [Lowe’99]

Slide credit: Josef Sivic

…

Patch Features


dictionary formation

…


Clustering (usually k-means)

Vector quantization

…


Clustered Image Patches

Fei-Fei et al. 2005

Visual Polysemy. Single visual word occurring on different (but locally similar) parts on different object categories.

Visual Synonyms. Two different visual words representing a similar part of an object (wheel of a motorbike).

Visual synonyms and polysemy

Andrew Zisserman

Image representation

…..

frequ

ency

codewords

Fei-‐Fei Li

(One) bad case for k-means

•  Clusters may overlap •  Some clusters may be

“wider” than others

•  GMM to the rescue!


GMM

(C) Dhruv Batra 31 Figure Credit: Kevin Murphy

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

Recall Multi-variate Gaussians

(C) Dhruv Batra 32

GMM

(C) Dhruv Batra 33

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure Credit: Kevin Murphy

Hidden Data Causes Problems #1 •  Fully Observed (Log) Likelihood factorizes

•  Marginal (Log) Likelihood doesn’t factorize

•  All parameters coupled!

(C) Dhruv Batra 34

GMM vs Gaussian Joint Bayes Classifier •  On Board

–  Observed Y vs Unobserved Z –  Likelihood vs Marginal Likelihood

(C) Dhruv Batra 35

Hidden Data Causes Problems #2

(C) Dhruv Batra 36

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35


Hidden Data Causes Problems #2 •  Identifiability

(C) Dhruv Batra 37

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

µ1

µ2

−15.5 −10.5 −5.5 −0.5 4.5 9.5 14.5 19.5

−15.5

−10.5

−5.5

−0.5

4.5

9.5

14.5

19.5


Hidden Data Causes Problems #3 •  Likelihood has singularities if one Gaussian

“collapses”

(C) Dhruv Batra 38 x

p(x

)

Special case: spherical Gaussians and hard assignments


•  If P(X|Z=k) is spherical, with same σ for all classes:

•  If each xi belongs to one class C(i) (hard assignment), marginal likelihood:

•  M(M)LE same as K-means!!!

P(xi, y = j)j=1

k

∑i=1

N

∏ ∝ exp − 12σ 2 xi −µC(i)

2%

&'(

)*i=1

N

∏

P(xi | z = j)∝ exp −12σ 2 xi −µ j

2#

$%&

'(

The K-means GMM assumption

•  There are k components

•  Component i has an associated mean vector µι

µ1

µ2

µ3





•  Each component generates data from a Gaussian with mean mi and

covariance matrix σ2Ι

Each data point is generated according to the following recipe:

µ1

µ2

µ3





•  Each component generates data from a Gaussian with

mean mi and covariance matrix σ2Ι

Each data point is generated according to the following

recipe:

1.  Pick a component at random: Choose component i with

probability P(y=i)

µ2






mean mi and covariance matrix σ2Ι


recipe:


probability P(y=i)

2.  Datapoint ∼ Ν(µι, σ2Ι )

µ2

x


The General GMM assumption

µ1

µ2

µ3


•  Component i has an associated mean vector mi


mean mi and covariance matrix Σi


recipe:


probability P(y=i)

2.  Datapoint ~ N(mi, Σi ) Slide Credit: Carlos Guestrin (C) Dhruv Batra 44

K-means vs GMM •  K-Means

–  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

•  GMM –  http://www.socr.ucla.edu/applets.dir/mixtureem.html

(C) Dhruv Batra 45

Documents

ECE 5984: Introduction to Machine Learning › ~s15ece5984 › slides › L22_kmeans_gmm.pptx.pdfThe K-means GMM assumption • There are k components • Component i has an associated