View
2
Download
0
Category
Preview:
Citation preview
Unsupervised Learning I:
K-Means Clustering
Reading:
Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
(http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)
Unsupervised learning = No labels on training examples!
Main approach: Clustering
Example: Optdigits data set
Optdigits features f1 f2 f3 f4 f5 f6 f7 f8
f9
x = (f1, f2, ..., f64) = (0, 2, 13, 16, 16, 16, 2, 0, 0, ...) Etc. ..
Partitional Clustering of Optdigits
Feature 1
Feature 2
Feature 3 64-dimensional space
Partitional Clustering of Optdigits
Feature 1
Feature 2
Feature 3 64-dimensional space
Hierarchical Clustering of Optdigits
Feature 1
Feature 2
Feature 3 64-dimensional space
Hierarchical Clustering of Optdigits
Feature 1
Feature 2
Feature 3 64-dimensional space
Hierarchical Clustering of Optdigits
Feature 1
Feature 2
Feature 3 64-dimensional space
Issues for clustering algorithms
• How to measure distance between pairs of instances?
• How many clusters to create?
• Should clusters be hierarchical? (I.e., clusters of clusters)
• Should clustering be “soft”? (I.e., an instance can belong to different clusters, with “weighted belonging”)
Most commonly used (and simplest) clustering algorithm:
K-Means Clustering
Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials
Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials
Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials
Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials
K-means clustering algorithm
K-means clustering algorithm
Typically, use mean of points in cluster as centroid
K-means clustering algorithm
Distance metric: Chosen by user. For numerical attributes, often use L2 (Euclidean) distance. Centroid of a cluster here refers to the mean of the points in the cluster.
d(x, y) = (xi − yi )2
i=1
n
∑
Example: Image segmentation by K-means clustering by color From http://vitroz.com/Documents/Image%20Segmentation.pdf
K=5, RGB space
K=10, RGB space
K=5, RGB space
K=10, RGB space
K=5, RGB space
K=10, RGB space
• A text document is represented as a feature vector of word frequencies
• Distance between two documents is the cosine of the angle between their corresponding feature vectors.
Clustering text documents
Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles.
Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018029
Exercise 1
How to evaluate clusters produced by K-means?
• Unsupervised evaluation
• Supervised evaluation
Unsupervised Cluster Evaluation
We don’t know the classes of the data instances
Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let ci denote a cluster in C. Let |ci| denote the number of elements in ci. Clustering C:
How can we quantify how good a clustering this is?
|c1|= 9
|c2|= 6
|c3|= 6
c1 c2
c3
We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):
mse(c) =d(x,
x∈c∑ µc )
2
| c |
Average mse (C) =mse(c)
c∈C∑
K
c1 c2
c3
We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):
mse(c) =d(x,
x∈c∑ µc )
2
| c |
Average mse (C) =mse(c)
c∈C∑
K Note: The assigned reading uses sum square error rather than mean square error.
c1 c2
c3
We also want to maximize pairwise separation of each cluster.
c1 c2
c3
We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):
mss (C) =d(µi,µ j
all distinct pairs of clusters i, j∈C (i≠ j )∑ )2
K(K −1) / 2
c1 c2
c3
We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):
mss (C) =d(µi,µ j
all distinct pairs of clusters i, j∈C (i≠ j )∑ )2
K(K −1) / 2
c1 c2
c3
Exercises 2-3
Supervised Cluster Evaluation
Suppose we know the classes of the data instances: black, blue, green
c1 c2
c3
Supervised Cluster Evaluation
Suppose we know the classes of the data instances: black, blue, green
Entropy of a cluster: The degree to which a cluster consists of objects of a single class.
c1 c2
c3
entropy(ci ) = − pi, jj=1
|Classes|
∑ log2 pi, j
wherepi, j = probability that a member of cluster i belongs to class j
=mi, j
mi
, where mi, j is the number of instances in cluster i with class j
and mi is the number of instances in cluster i
c1 c2
c3
entropy(c1) = −39log2
39+49log2
49+29log2
29
⎛
⎝⎜
⎞
⎠⎟=1.74
entropy(c2 ) = −16log2
16+46log2
46+16log2
16
⎛
⎝⎜
⎞
⎠⎟= 0.78
entropy(c3) = −46log2
46+26log2
26
⎛
⎝⎜
⎞
⎠⎟= 0.45
Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.
mean entropy(C) = mi
mi=1
K
∑ entropy(ci )
Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.
mean entropy(C) = mi
mi=1
K
∑ entropy(ci )
c1 c2
c3
entropy(c1) = −39log2
39+49log2
49+29log2
29
⎛
⎝⎜
⎞
⎠⎟=1.74
entropy(c2 ) = −16log2
16+46log2
46+16log2
16
⎛
⎝⎜
⎞
⎠⎟= 0.78
entropy(c3) = −46log2
46+26log2
26
⎛
⎝⎜
⎞
⎠⎟= 0.45
Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.
mean entropy(C) = mi
mi=1
K
∑ entropy(ci )
c1 c2
c3
entropy(c1) = −39log2
39+49log2
49+29log2
29
⎛
⎝⎜
⎞
⎠⎟=1.74
entropy(c2 ) = −16log2
16+46log2
46+16log2
16
⎛
⎝⎜
⎞
⎠⎟= 0.78
entropy(c3) = −46log2
46+26log2
26
⎛
⎝⎜
⎞
⎠⎟= 0.45
mean entropy(C) = 921
(1.74)+ 621
(0.78)+ 621
(0.45) =1.1
What is the mean entropy of this clustering?
c1 c2
c3
Entropy Exercise
Issues for K-means
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined. – For categorical data, use K-modes: The centroid is
represented by the most frequent values.
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined. – For categorical data, use K-modes: The centroid is
represented by the most frequent values.
• The user needs to specify K.
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined. – For categorical data, use K-modes: The centroid is
represented by the most frequent values.
• The user needs to specify K.
• The algorithm is sensitive to outliers – Outliers are data points that are very far away from other
data points.
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined. – For categorical data, use K-modes: The centroid is
represented by the most frequent values.
• The user needs to specify K.
• The algorithm is sensitive to outliers – Outliers are data points that are very far away from other
data points. – Outliers could be errors in the data recording or some
special data points with very different values.
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
CS583, Bing Liu, UIC
Issues for K-means: Problems with outliers
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Dealing with outliers • One method is to remove some data points in the clustering
process that are much further away from the centroids than other data points. – Expensive – Not always a good idea!
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Dealing with outliers • One method is to remove some data points in the clustering
process that are much further away from the centroids than other data points. – Expensive – Not always a good idea!
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. – Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
CS583, Bing Liu, UIC
Issues for K-means (cont …) • The algorithm is sensitive to initial seeds.
+
+
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
CS583, Bing Liu, UIC
• If we use different seeds: good results
+ +
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
CS583, Bing Liu, UIC
• If we use different seeds: good results Often can improve K-means results by doing several random restarts.
+ +
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
CS583, Bing Liu, UIC
• If we use different seeds: good results Often can improve K-means results by doing several random restarts.
+ +
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
Often useful to select instances from data as initial seeds.
CS583, Bing Liu, UIC
• The K-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).
+
Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
Other Issues • What if a cluster is empty?
– Choose a replacement centroid
• At random, or
• From cluster that has highest mean square error
• How to choose K ?
• The assigned reading discusses several methods for improving a clustering with “postprocessing”.
Choosing the K in K-Means • Hard problem! Often no “correct” answer for unlabeled data
• Many proposed methods! Here are a few:
• Try several values of K, see which is best, via cross-validation. – Metrics: mean square error, mean square separation, penalty for too
many clusters [why?]
• Start with K = 2. Then try splitting each cluster. – New means are one sigma away from cluster center in direction of greatest
variation.
– Use similar metrics to above.
• “Elbow” method: – Plot average MSE vs. K. Choose K at which MSE (or other
metric) stops decreasing abruptly.
– However, sometimes no clear “elbow”
“elbow”
Recommended