Download pdf - K-means and GMM

Network Intelligence and Analysis Lab


Clustering methods via EM algorithm

2014.07.10Sanghyuk Chun


• Machine Learning• Training data• Learning model

• Unsupervised Learning• Training data without label• Input data: 𝐷𝐷 = {𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁}• Most of unsupervised learning problems are trying to find

hidden structure in unlabeled data• Examples: Clustering, Dimensionality Reduction (PCA, LDA), …

Machine Learning and Unsupervised Learning

2


• Clustering• Grouping objects in a such way that objects in the same group

are more similar to each other than other groups• Input: a set of objects (or data) without group information• Output: cluster index for each object

• Usage: Customer Segmentation, Image Segmentation…

Unsupervised Learning and Clustering

Input Output

ClusteringAlgorithm

3


K-means ClusteringIntroductionOptimization

4


• Intuition: data in same cluster has shorter distance than data which are in other clusters

• Goal: minimize distance between data in same cluster• Objective function:

•

𝐽𝐽 = �𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾

𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2

• Where N is number of data points, K is number of clusters• 𝑟𝑟𝑛𝑛𝑘𝑘 ∈ {0,1} is indicator variables where k describing which of

the K clusters the data point 𝐱𝐱𝐧𝐧 is assigned to• 𝛍𝛍𝐤𝐤 is a prototype associated with the k-th cluster

• Eventually 𝛍𝛍𝐤𝐤 is same as the center (mean) of cluster

K-means Clustering

5


• Objective function:•

𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛{𝑟𝑟𝑛𝑛𝑛𝑛,𝛍𝛍𝐤𝐤} �𝑛𝑛=1

𝑁𝑁

�𝑘𝑘=1

𝐾𝐾


• This function can be solved through an iterative procedure• Step 1: minimize J with respect to the 𝑟𝑟𝑛𝑛𝑘𝑘, keeping 𝛍𝛍𝐤𝐤 is fixed• Step 2: minimize J with respect to the 𝛍𝛍𝐤𝐤, keeping 𝑟𝑟𝑛𝑛𝑘𝑘 is fixed• Repeat Step 1,2 until converge

• Does it always converge?

K-means Clustering – Optimization

6


• Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex

• 𝑓𝑓 𝑥𝑥,𝑦𝑦 is biconvex if fixing x, 𝑓𝑓𝑥𝑥 y = 𝑓𝑓 𝑥𝑥,𝑦𝑦 is convex over Y and fixing y, 𝑓𝑓𝑦𝑦 𝑥𝑥 = 𝑓𝑓 𝑥𝑥,𝑦𝑦 is convex over X

• One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems

• It does not guarantee the global optimal point• But it always converge to some local optimum

Optional – Biconvex optimization

7


•


𝑁𝑁

�𝑘𝑘=1

𝐾𝐾


• Step 1: minimize J with respect to the 𝑟𝑟𝑛𝑛𝑘𝑘, keeping 𝛍𝛍𝐤𝐤 is fixed

• 𝑟𝑟𝑛𝑛𝑘𝑘 = �1 𝑎𝑎𝑓𝑓 𝑘𝑘 = 𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑗𝑗 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 𝟐𝟐

0 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑟𝑟𝑜𝑜𝑎𝑎𝑜𝑜𝑜𝑜• Step 2: minimize J with respect to the 𝛍𝛍𝐤𝐤, keeping 𝑟𝑟𝑛𝑛𝑘𝑘 is fixed

• Derivative with respect to 𝛍𝛍𝐤𝐤 to zero giving• 2∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 = 0• 𝛍𝛍𝐤𝐤 = ∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛𝐱𝐱𝐧𝐧

∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛• 𝛍𝛍𝐤𝐤 is equal to the mean of all the data assigned to cluster k

K-means Clustering – Optimization

8


• Advantage of K-means clustering• Easy to implement (kmeans in Matlab, kcluster in Python)• In practice, it works well

• Disadvantage of K-means clustering• It can converge to local optimum• Computing Euclidian distance of every point is expensive

• Solution: Batch K-means• Euclidian distance is non-robust to outlier

• Solution: K-medoids algorithms (use different metric)

K-means Clustering – Conclusion

9


Mixture of GaussiansMixture ModelEM AlgorithmEM for Gaussian Mixtures

10


• Assumption: There are k components: 𝑐𝑐𝑖𝑖 𝑖𝑖=1𝑘𝑘

• Component 𝑐𝑐𝑖𝑖 has an associated mean vector 𝜇𝜇𝑖𝑖• Each component generates data from a Gaussian with mean 𝜇𝜇𝑖𝑖

and covariance matrix Σ𝑖𝑖

Mixture of Gaussians

𝜇𝜇1 𝜇𝜇2

𝜇𝜇3𝜇𝜇4

𝜇𝜇5

11


• Represent model as linear combination of Gaussians• Probability density function of GMM•

𝑝𝑝 𝑥𝑥 = �𝑘𝑘=1

𝐾𝐾

𝜋𝜋𝑘𝑘𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘 , Σ𝑘𝑘

• 𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘 , Σ𝑘𝑘 = 12𝜋𝜋 𝑑𝑑/2 Σ 1/2 exp{−1

2𝑥𝑥 − 𝜇𝜇 ⊤Σ−1 𝑥𝑥 − 𝜇𝜇 }

• Which is called a mixture of Gaussian or Gaussian Mixture Model• Each Gaussian density is called component of the mixtures and

has its own mean 𝜇𝜇𝑘𝑘 and covariance Σ𝑘𝑘• The parameters are called mixing coefficients (∑𝑘𝑘 𝜋𝜋𝑘𝑘 = 1)

Gaussian Mixture Model

12


• 𝑝𝑝 𝑥𝑥 = ∑𝑘𝑘=1𝐾𝐾 𝜋𝜋𝑘𝑘𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘, Σ𝑘𝑘 , where ∑𝑘𝑘 𝜋𝜋𝑘𝑘 = 1

• Input:• The training set: 𝑥𝑥𝑖𝑖 𝑖𝑖=1

𝑁𝑁

• Number of clusters: k

• Goal: model this data using mixture of Gaussians• Mixing coefficients 𝜋𝜋1,𝜋𝜋2, … ,𝜋𝜋𝑘𝑘• Means and covariance: 𝜇𝜇1, 𝜇𝜇2, … , 𝜇𝜇𝑘𝑘; Σ1, Σ2, … , Σ𝑘𝑘

Clustering using Mixture Model

13


• 𝑝𝑝 𝑥𝑥 𝐺𝐺 = 𝑝𝑝 𝑥𝑥 𝜋𝜋1, 𝜇𝜇1, … = ∑𝑖𝑖 𝑝𝑝 𝑥𝑥 𝑐𝑐𝑖𝑖 𝑝𝑝(𝑐𝑐𝑖𝑖) = ∑𝑖𝑖 𝜋𝜋𝑖𝑖𝑁𝑁(𝑥𝑥|𝜇𝜇𝑖𝑖 , Σ𝑖𝑖)• 𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁 𝐺𝐺 = Π𝑖𝑖𝑝𝑝(𝑥𝑥𝑖𝑖|𝐺𝐺)• The log likelihood function is given by•

ln𝑝𝑝 𝐗𝐗 𝛑𝛑,𝛍𝛍,𝚺𝚺 = �𝑛𝑛=1

𝑁𝑁

ln �𝑘𝑘=1

𝐾𝐾

𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱𝐧𝐧 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤

• Goal: Find parameter which maximize log-likelihood• Problem: Hard to compute maximum likelihood• Solution: use EM algorithm

Maximum Likelihood of GMM

14


• EM algorithm is an iterative procedure for finding the MLE• An expectation (E) step creates a function for the expectation of

the log-likelihood evaluated using the current estimate for the parameters

• A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step

• These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

• EM always converges to one of local optimums

EM (Expectation Maximization) Algorithm

15


•


𝑁𝑁

�𝑘𝑘=1

𝐾𝐾


• E-Step: minimize J with respect to the 𝑟𝑟𝑛𝑛𝑘𝑘, keeping 𝛍𝛍𝐤𝐤 is fixed

• 𝑟𝑟𝑛𝑛𝑘𝑘 = �1 𝑎𝑎𝑓𝑓 𝑘𝑘 = 𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑗𝑗 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 𝟐𝟐

0 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑟𝑟𝑜𝑜𝑎𝑎𝑜𝑜𝑜𝑜

• M-Step: minimize J with respect to the 𝛍𝛍𝐤𝐤, keeping 𝑟𝑟𝑛𝑛𝑘𝑘 is fixed

• 𝛍𝛍𝐤𝐤 = ∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛𝐱𝐱𝐧𝐧∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛

K-means revisit: EM and K-means

16


• Let 𝑧𝑧𝑘𝑘 is Bernoulli random variable with probability 𝜋𝜋𝑘𝑘• 𝑝𝑝 𝑧𝑧𝑘𝑘 = 1 = 𝜋𝜋𝑘𝑘 where ∑𝑧𝑧𝑘𝑘 = 1 and ∑𝜋𝜋𝑘𝑘 = 1

• Because z use a 1-of-K representation, this distribution in the form

• 𝑝𝑝 𝑧𝑧 = ∏𝑘𝑘=1𝐾𝐾 𝜋𝜋𝑘𝑘

𝑧𝑧𝑛𝑛

• Similarly, the conditional distribution of x given a particular value for z is a Gaussian

• 𝑝𝑝 𝑥𝑥 𝑧𝑧 = ∏𝑘𝑘=1𝐾𝐾 𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘, Σ𝑘𝑘 𝑧𝑧𝑛𝑛

Latent variable for GMM

17


• The joint distribution is given by 𝑝𝑝 𝑥𝑥, 𝑧𝑧 = 𝑝𝑝 𝑧𝑧 𝑝𝑝(𝑥𝑥|𝑧𝑧)• 𝑝𝑝 𝑥𝑥 = ∑𝑧𝑧 𝑝𝑝 𝑧𝑧 𝑝𝑝(𝑥𝑥|𝑧𝑧) = ∑𝑘𝑘 𝜋𝜋𝑘𝑘𝑁𝑁(𝑥𝑥|𝜇𝜇𝑘𝑘 , Σ𝑘𝑘)• Thus the marginal distribution of x is a Gaussian mixture of the

above form• Now, we are able to work with joint distribution instead of

marginal distribution• Graphical representation of a GMM

for a set of N i.i.d. data points {𝑥𝑥𝑛𝑛}with corresponding latent variable{𝑧𝑧𝑛𝑛}, where n=1,…,N

Latent variable for GMM

𝐳𝐳𝐧𝐧

𝑿𝑿𝒏𝒏

𝛑𝛑

𝝁𝝁 𝚺𝚺N

18


• Conditional probability of z given x• From Bayes’ theorem,

• 𝛾𝛾 𝑧𝑧𝑘𝑘 ≡ 𝑝𝑝 𝑧𝑧𝑘𝑘 = 1 𝐱𝐱 = 𝑝𝑝 𝑧𝑧𝑛𝑛=1 𝑝𝑝 𝐱𝐱 𝑧𝑧𝑘𝑘 = 1∑𝑗𝑗=1𝐾𝐾 𝑝𝑝 𝑧𝑧𝑗𝑗=1 𝑝𝑝 𝐱𝐱 𝑧𝑧𝑗𝑗 = 1

=𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤

∑𝑗𝑗=1𝐾𝐾 𝜋𝜋𝑗𝑗𝑁𝑁(𝐱𝐱|𝛍𝛍𝐣𝐣,𝚺𝚺𝐣𝐣)

• 𝛾𝛾 𝑧𝑧𝑘𝑘 can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x

EM for Gaussian Mixtures (E-step)

19


• Likelihood function for GMM•

ln𝑝𝑝 𝐗𝐗 𝛑𝛑,𝛍𝛍,𝚺𝚺 = �𝑛𝑛=1

𝑁𝑁

ln �𝑘𝑘=1

𝐾𝐾

𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱𝐧𝐧 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤

• Setting the derivatives of log likelihood with respect to the means 𝜇𝜇𝑘𝑘 of the Gaussian components to zero, we obtain

•

𝜇𝜇𝑘𝑘 =1

N𝑘𝑘�𝑛𝑛=1

𝑁𝑁

𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧

where, 𝑁𝑁𝑘𝑘 = ∑𝑛𝑛=1𝑁𝑁 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘)

EM for Gaussian Mixtures (M-step)

20


• Setting the derivatives of likelihood with respect to the Σ𝑘𝑘 to zero, we obtain

•

𝚺𝚺𝒌𝒌 =1𝑁𝑁𝑘𝑘

�𝑛𝑛=1

𝑁𝑁

𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 ⊤

• Maximize likelihood with respect to the mixing coefficient 𝜋𝜋 by using a Lagrange multiplier, we obtain

• ln𝑝𝑝 𝐗𝐗 𝛑𝛑,𝛍𝛍,𝚺𝚺 + 𝜆𝜆(∑𝑘𝑘=1𝐾𝐾 𝜋𝜋𝑘𝑘 − 1)

• 𝜋𝜋𝑘𝑘 = 𝑁𝑁𝑛𝑛𝑁𝑁

EM for Gaussian Mixtures (M-step)

21


• 𝜇𝜇𝑘𝑘 ,Σ𝑘𝑘 ,𝜋𝜋𝑘𝑘 do not constitute a closed-form solution for the parameters of the mixture model because the responsibility 𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 depend on those parameters in a complex way

• 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘) = 𝜋𝜋𝑛𝑛𝑁𝑁 𝐱𝐱 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤∑𝑗𝑗=1𝐾𝐾 𝜋𝜋𝑗𝑗𝑁𝑁(𝐱𝐱|𝛍𝛍𝐣𝐣,𝚺𝚺𝐣𝐣)

• In EM algorithm for GMM, 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘) and parameters are iteratively optimized

• In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters

• In M step, re-estimate the means, covariances, and mixing coefficients using previous results

EM for Gaussian Mixtures

22


• Initialize the means 𝜇𝜇𝑘𝑘, covariances Σ𝑘𝑘 and mixing coefficient 𝜋𝜋𝑘𝑘, and evaluate the initial value of the log likelihood

• E step: Evaluate the responsibilities using the current parameter•

𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘) =𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤

∑𝑗𝑗=1𝐾𝐾 𝜋𝜋𝑗𝑗𝑁𝑁(𝐱𝐱|𝛍𝛍𝐣𝐣,𝚺𝚺𝐣𝐣)• M step: Re-estimate parameters using the current responsibilities

• 𝜇𝜇𝑘𝑘𝑛𝑛𝑛𝑛𝑛𝑛 = 1N𝑛𝑛∑𝑛𝑛=1𝑁𝑁 𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧

• 𝚺𝚺𝒌𝒌𝒏𝒏𝒏𝒏𝒏𝒏 = 1𝑁𝑁𝑛𝑛∑𝑛𝑛=1𝑁𝑁 𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 ⊤

• 𝜋𝜋𝑘𝑘𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑁𝑁𝑛𝑛𝑁𝑁

• 𝑁𝑁𝑘𝑘 = ∑𝑛𝑛=1𝑁𝑁 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘)• Repeat E step and M step until converge

EM for Gaussian Mixtures

23


• We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model

• Consider a Gaussian mixture model with covariance matrices are given by 𝜀𝜀𝐼𝐼, where 𝜀𝜀 is a variance parameter and I is identity

• If we consider the limit 𝜀𝜀 → 0, log likelihood of GMM becomes

• 𝐸𝐸𝑧𝑧 ln𝑝𝑝 𝑋𝑋,𝑍𝑍 𝜇𝜇, Σ,𝜋𝜋 → −12

= ∑𝑛𝑛∑𝑘𝑘 𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2 + 𝐶𝐶

• Thus, we see that in this limit, maximizing the expected complete-data log likelihood is equivalent to K-means algorithm

Relationship between K-means algorithm and GMM

24