Network Intelligence and Analysis Lab
Network Intelligence and Analysis Lab
Clustering methods via EM algorithm
2014.07.10Sanghyuk Chun
Network Intelligence and Analysis Lab
• Machine Learning• Training data• Learning model
• Unsupervised Learning• Training data without label• Input data: 𝐷𝐷 = {𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁}• Most of unsupervised learning problems are trying to find
hidden structure in unlabeled data• Examples: Clustering, Dimensionality Reduction (PCA, LDA), …
Machine Learning and Unsupervised Learning
2
Network Intelligence and Analysis Lab
• Clustering• Grouping objects in a such way that objects in the same group
are more similar to each other than other groups• Input: a set of objects (or data) without group information• Output: cluster index for each object
• Usage: Customer Segmentation, Image Segmentation…
Unsupervised Learning and Clustering
Input Output
ClusteringAlgorithm
3
Network Intelligence and Analysis Lab
K-means ClusteringIntroductionOptimization
4
Network Intelligence and Analysis Lab
• Intuition: data in same cluster has shorter distance than data which are in other clusters
• Goal: minimize distance between data in same cluster• Objective function:
•
𝐽𝐽 = �𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2
• Where N is number of data points, K is number of clusters• 𝑟𝑟𝑛𝑛𝑘𝑘 ∈ {0,1} is indicator variables where k describing which of
the K clusters the data point 𝐱𝐱𝐧𝐧 is assigned to• 𝛍𝛍𝐤𝐤 is a prototype associated with the k-th cluster
• Eventually 𝛍𝛍𝐤𝐤 is same as the center (mean) of cluster
K-means Clustering
5
Network Intelligence and Analysis Lab
• Objective function:•
𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛{𝑟𝑟𝑛𝑛𝑛𝑛,𝛍𝛍𝐤𝐤} �𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2
• This function can be solved through an iterative procedure• Step 1: minimize J with respect to the 𝑟𝑟𝑛𝑛𝑘𝑘, keeping 𝛍𝛍𝐤𝐤 is fixed• Step 2: minimize J with respect to the 𝛍𝛍𝐤𝐤, keeping 𝑟𝑟𝑛𝑛𝑘𝑘 is fixed• Repeat Step 1,2 until converge
• Does it always converge?
K-means Clustering – Optimization
6
Network Intelligence and Analysis Lab
• Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex
• 𝑓𝑓 𝑥𝑥,𝑦𝑦 is biconvex if fixing x, 𝑓𝑓𝑥𝑥 y = 𝑓𝑓 𝑥𝑥,𝑦𝑦 is convex over Y and fixing y, 𝑓𝑓𝑦𝑦 𝑥𝑥 = 𝑓𝑓 𝑥𝑥,𝑦𝑦 is convex over X
• One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems
• It does not guarantee the global optimal point• But it always converge to some local optimum
Optional – Biconvex optimization
7
Network Intelligence and Analysis Lab
•
𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛{𝑟𝑟𝑛𝑛𝑛𝑛,𝛍𝛍𝐤𝐤} �𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2
• Step 1: minimize J with respect to the 𝑟𝑟𝑛𝑛𝑘𝑘, keeping 𝛍𝛍𝐤𝐤 is fixed
• 𝑟𝑟𝑛𝑛𝑘𝑘 = �1 𝑎𝑎𝑓𝑓 𝑘𝑘 = 𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑗𝑗 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 𝟐𝟐
0 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑟𝑟𝑜𝑜𝑎𝑎𝑜𝑜𝑜𝑜• Step 2: minimize J with respect to the 𝛍𝛍𝐤𝐤, keeping 𝑟𝑟𝑛𝑛𝑘𝑘 is fixed
• Derivative with respect to 𝛍𝛍𝐤𝐤 to zero giving• 2∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 = 0• 𝛍𝛍𝐤𝐤 = ∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛𝐱𝐱𝐧𝐧
∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛• 𝛍𝛍𝐤𝐤 is equal to the mean of all the data assigned to cluster k
K-means Clustering – Optimization
8
Network Intelligence and Analysis Lab
• Advantage of K-means clustering• Easy to implement (kmeans in Matlab, kcluster in Python)• In practice, it works well
• Disadvantage of K-means clustering• It can converge to local optimum• Computing Euclidian distance of every point is expensive
• Solution: Batch K-means• Euclidian distance is non-robust to outlier
• Solution: K-medoids algorithms (use different metric)
K-means Clustering – Conclusion
9
Network Intelligence and Analysis Lab
Mixture of GaussiansMixture ModelEM AlgorithmEM for Gaussian Mixtures
10
Network Intelligence and Analysis Lab
• Assumption: There are k components: 𝑐𝑐𝑖𝑖 𝑖𝑖=1𝑘𝑘
• Component 𝑐𝑐𝑖𝑖 has an associated mean vector 𝜇𝜇𝑖𝑖• Each component generates data from a Gaussian with mean 𝜇𝜇𝑖𝑖
and covariance matrix Σ𝑖𝑖
Mixture of Gaussians
𝜇𝜇1 𝜇𝜇2
𝜇𝜇3𝜇𝜇4
𝜇𝜇5
11
Network Intelligence and Analysis Lab
• Represent model as linear combination of Gaussians• Probability density function of GMM•
𝑝𝑝 𝑥𝑥 = �𝑘𝑘=1
𝐾𝐾
𝜋𝜋𝑘𝑘𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘 , Σ𝑘𝑘
• 𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘 , Σ𝑘𝑘 = 12𝜋𝜋 𝑑𝑑/2 Σ 1/2 exp{−1
2𝑥𝑥 − 𝜇𝜇 ⊤Σ−1 𝑥𝑥 − 𝜇𝜇 }
• Which is called a mixture of Gaussian or Gaussian Mixture Model• Each Gaussian density is called component of the mixtures and
has its own mean 𝜇𝜇𝑘𝑘 and covariance Σ𝑘𝑘• The parameters are called mixing coefficients (∑𝑘𝑘 𝜋𝜋𝑘𝑘 = 1)
Gaussian Mixture Model
12
Network Intelligence and Analysis Lab
• 𝑝𝑝 𝑥𝑥 = ∑𝑘𝑘=1𝐾𝐾 𝜋𝜋𝑘𝑘𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘, Σ𝑘𝑘 , where ∑𝑘𝑘 𝜋𝜋𝑘𝑘 = 1
• Input:• The training set: 𝑥𝑥𝑖𝑖 𝑖𝑖=1
𝑁𝑁
• Number of clusters: k
• Goal: model this data using mixture of Gaussians• Mixing coefficients 𝜋𝜋1,𝜋𝜋2, … ,𝜋𝜋𝑘𝑘• Means and covariance: 𝜇𝜇1, 𝜇𝜇2, … , 𝜇𝜇𝑘𝑘; Σ1, Σ2, … , Σ𝑘𝑘
Clustering using Mixture Model
13
Network Intelligence and Analysis Lab
• 𝑝𝑝 𝑥𝑥 𝐺𝐺 = 𝑝𝑝 𝑥𝑥 𝜋𝜋1, 𝜇𝜇1, … = ∑𝑖𝑖 𝑝𝑝 𝑥𝑥 𝑐𝑐𝑖𝑖 𝑝𝑝(𝑐𝑐𝑖𝑖) = ∑𝑖𝑖 𝜋𝜋𝑖𝑖𝑁𝑁(𝑥𝑥|𝜇𝜇𝑖𝑖 , Σ𝑖𝑖)• 𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁 𝐺𝐺 = Π𝑖𝑖𝑝𝑝(𝑥𝑥𝑖𝑖|𝐺𝐺)• The log likelihood function is given by•
ln𝑝𝑝 𝐗𝐗 𝛑𝛑,𝛍𝛍,𝚺𝚺 = �𝑛𝑛=1
𝑁𝑁
ln �𝑘𝑘=1
𝐾𝐾
𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱𝐧𝐧 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤
• Goal: Find parameter which maximize log-likelihood• Problem: Hard to compute maximum likelihood• Solution: use EM algorithm
Maximum Likelihood of GMM
14
Network Intelligence and Analysis Lab
• EM algorithm is an iterative procedure for finding the MLE• An expectation (E) step creates a function for the expectation of
the log-likelihood evaluated using the current estimate for the parameters
• A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step
• These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
• EM always converges to one of local optimums
EM (Expectation Maximization) Algorithm
15
Network Intelligence and Analysis Lab
•
𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛{𝑟𝑟𝑛𝑛𝑛𝑛,𝛍𝛍𝐤𝐤} �𝑛𝑛=1
𝑁𝑁
�𝑘𝑘=1
𝐾𝐾
𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2
• E-Step: minimize J with respect to the 𝑟𝑟𝑛𝑛𝑘𝑘, keeping 𝛍𝛍𝐤𝐤 is fixed
• 𝑟𝑟𝑛𝑛𝑘𝑘 = �1 𝑎𝑎𝑓𝑓 𝑘𝑘 = 𝑎𝑎𝑟𝑟𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑗𝑗 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 𝟐𝟐
0 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑟𝑟𝑜𝑜𝑎𝑎𝑜𝑜𝑜𝑜
• M-Step: minimize J with respect to the 𝛍𝛍𝐤𝐤, keeping 𝑟𝑟𝑛𝑛𝑘𝑘 is fixed
• 𝛍𝛍𝐤𝐤 = ∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛𝐱𝐱𝐧𝐧∑𝑛𝑛 𝑟𝑟𝑛𝑛𝑛𝑛
K-means revisit: EM and K-means
16
Network Intelligence and Analysis Lab
• Let 𝑧𝑧𝑘𝑘 is Bernoulli random variable with probability 𝜋𝜋𝑘𝑘• 𝑝𝑝 𝑧𝑧𝑘𝑘 = 1 = 𝜋𝜋𝑘𝑘 where ∑𝑧𝑧𝑘𝑘 = 1 and ∑𝜋𝜋𝑘𝑘 = 1
• Because z use a 1-of-K representation, this distribution in the form
• 𝑝𝑝 𝑧𝑧 = ∏𝑘𝑘=1𝐾𝐾 𝜋𝜋𝑘𝑘
𝑧𝑧𝑛𝑛
• Similarly, the conditional distribution of x given a particular value for z is a Gaussian
• 𝑝𝑝 𝑥𝑥 𝑧𝑧 = ∏𝑘𝑘=1𝐾𝐾 𝑁𝑁 𝑥𝑥 𝜇𝜇𝑘𝑘, Σ𝑘𝑘 𝑧𝑧𝑛𝑛
Latent variable for GMM
17
Network Intelligence and Analysis Lab
• The joint distribution is given by 𝑝𝑝 𝑥𝑥, 𝑧𝑧 = 𝑝𝑝 𝑧𝑧 𝑝𝑝(𝑥𝑥|𝑧𝑧)• 𝑝𝑝 𝑥𝑥 = ∑𝑧𝑧 𝑝𝑝 𝑧𝑧 𝑝𝑝(𝑥𝑥|𝑧𝑧) = ∑𝑘𝑘 𝜋𝜋𝑘𝑘𝑁𝑁(𝑥𝑥|𝜇𝜇𝑘𝑘 , Σ𝑘𝑘)• Thus the marginal distribution of x is a Gaussian mixture of the
above form• Now, we are able to work with joint distribution instead of
marginal distribution• Graphical representation of a GMM
for a set of N i.i.d. data points {𝑥𝑥𝑛𝑛}with corresponding latent variable{𝑧𝑧𝑛𝑛}, where n=1,…,N
Latent variable for GMM
𝐳𝐳𝐧𝐧
𝑿𝑿𝒏𝒏
𝛑𝛑
𝝁𝝁 𝚺𝚺N
18
Network Intelligence and Analysis Lab
• Conditional probability of z given x• From Bayes’ theorem,
• 𝛾𝛾 𝑧𝑧𝑘𝑘 ≡ 𝑝𝑝 𝑧𝑧𝑘𝑘 = 1 𝐱𝐱 = 𝑝𝑝 𝑧𝑧𝑛𝑛=1 𝑝𝑝 𝐱𝐱 𝑧𝑧𝑘𝑘 = 1∑𝑗𝑗=1𝐾𝐾 𝑝𝑝 𝑧𝑧𝑗𝑗=1 𝑝𝑝 𝐱𝐱 𝑧𝑧𝑗𝑗 = 1
=𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤
∑𝑗𝑗=1𝐾𝐾 𝜋𝜋𝑗𝑗𝑁𝑁(𝐱𝐱|𝛍𝛍𝐣𝐣,𝚺𝚺𝐣𝐣)
• 𝛾𝛾 𝑧𝑧𝑘𝑘 can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x
EM for Gaussian Mixtures (E-step)
19
Network Intelligence and Analysis Lab
• Likelihood function for GMM•
ln𝑝𝑝 𝐗𝐗 𝛑𝛑,𝛍𝛍,𝚺𝚺 = �𝑛𝑛=1
𝑁𝑁
ln �𝑘𝑘=1
𝐾𝐾
𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱𝐧𝐧 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤
• Setting the derivatives of log likelihood with respect to the means 𝜇𝜇𝑘𝑘 of the Gaussian components to zero, we obtain
•
𝜇𝜇𝑘𝑘 =1
N𝑘𝑘�𝑛𝑛=1
𝑁𝑁
𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧
where, 𝑁𝑁𝑘𝑘 = ∑𝑛𝑛=1𝑁𝑁 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘)
EM for Gaussian Mixtures (M-step)
20
Network Intelligence and Analysis Lab
• Setting the derivatives of likelihood with respect to the Σ𝑘𝑘 to zero, we obtain
•
𝚺𝚺𝒌𝒌 =1𝑁𝑁𝑘𝑘
�𝑛𝑛=1
𝑁𝑁
𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 ⊤
• Maximize likelihood with respect to the mixing coefficient 𝜋𝜋 by using a Lagrange multiplier, we obtain
• ln𝑝𝑝 𝐗𝐗 𝛑𝛑,𝛍𝛍,𝚺𝚺 + 𝜆𝜆(∑𝑘𝑘=1𝐾𝐾 𝜋𝜋𝑘𝑘 − 1)
• 𝜋𝜋𝑘𝑘 = 𝑁𝑁𝑛𝑛𝑁𝑁
EM for Gaussian Mixtures (M-step)
21
Network Intelligence and Analysis Lab
• 𝜇𝜇𝑘𝑘 ,Σ𝑘𝑘 ,𝜋𝜋𝑘𝑘 do not constitute a closed-form solution for the parameters of the mixture model because the responsibility 𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 depend on those parameters in a complex way
• 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘) = 𝜋𝜋𝑛𝑛𝑁𝑁 𝐱𝐱 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤∑𝑗𝑗=1𝐾𝐾 𝜋𝜋𝑗𝑗𝑁𝑁(𝐱𝐱|𝛍𝛍𝐣𝐣,𝚺𝚺𝐣𝐣)
• In EM algorithm for GMM, 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘) and parameters are iteratively optimized
• In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters
• In M step, re-estimate the means, covariances, and mixing coefficients using previous results
EM for Gaussian Mixtures
22
Network Intelligence and Analysis Lab
• Initialize the means 𝜇𝜇𝑘𝑘, covariances Σ𝑘𝑘 and mixing coefficient 𝜋𝜋𝑘𝑘, and evaluate the initial value of the log likelihood
• E step: Evaluate the responsibilities using the current parameter•
𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘) =𝜋𝜋𝑘𝑘𝑁𝑁 𝐱𝐱 𝛍𝛍𝐤𝐤,𝚺𝚺𝐤𝐤
∑𝑗𝑗=1𝐾𝐾 𝜋𝜋𝑗𝑗𝑁𝑁(𝐱𝐱|𝛍𝛍𝐣𝐣,𝚺𝚺𝐣𝐣)• M step: Re-estimate parameters using the current responsibilities
• 𝜇𝜇𝑘𝑘𝑛𝑛𝑛𝑛𝑛𝑛 = 1N𝑛𝑛∑𝑛𝑛=1𝑁𝑁 𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧
• 𝚺𝚺𝒌𝒌𝒏𝒏𝒏𝒏𝒏𝒏 = 1𝑁𝑁𝑛𝑛∑𝑛𝑛=1𝑁𝑁 𝛾𝛾 𝑧𝑧𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝜇𝜇𝑘𝑘 ⊤
• 𝜋𝜋𝑘𝑘𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑁𝑁𝑛𝑛𝑁𝑁
• 𝑁𝑁𝑘𝑘 = ∑𝑛𝑛=1𝑁𝑁 𝛾𝛾(𝑧𝑧𝑛𝑛𝑘𝑘)• Repeat E step and M step until converge
EM for Gaussian Mixtures
23
Network Intelligence and Analysis Lab
• We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model
• Consider a Gaussian mixture model with covariance matrices are given by 𝜀𝜀𝐼𝐼, where 𝜀𝜀 is a variance parameter and I is identity
• If we consider the limit 𝜀𝜀 → 0, log likelihood of GMM becomes
• 𝐸𝐸𝑧𝑧 ln𝑝𝑝 𝑋𝑋,𝑍𝑍 𝜇𝜇, Σ,𝜋𝜋 → −12
= ∑𝑛𝑛∑𝑘𝑘 𝑟𝑟𝑛𝑛𝑘𝑘 𝐱𝐱𝐧𝐧 − 𝛍𝛍𝐤𝐤 2 + 𝐶𝐶
• Thus, we see that in this limit, maximizing the expected complete-data log likelihood is equivalent to K-means algorithm
Relationship between K-means algorithm and GMM
24