Unsupervised Learning I: K-Means...

Unsupervised Learning I:

K-Means Clustering

Reading:

Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

(http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

Unsupervised learning = No labels on training examples!

Main approach: Clustering

Example: Optdigits data set

Optdigits features f1 f2 f3 f4 f5 f6 f7 f8

x = (f1, f2, ..., f64) = (0, 2, 13, 16, 16, 16, 2, 0, 0, ...) Etc. ..

Partitional Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Partitional Clustering of Optdigits

Feature 1

Feature 2

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 1

Feature 2

Feature 1

Feature 2

Issues for clustering algorithms

•  How to measure distance between pairs of instances?

•  How many clusters to create?

•  Should clusters be hierarchical? (I.e., clusters of clusters)

•  Should clustering be “soft”? (I.e., an instance can belong to different clusters, with “weighted belonging”)

Most commonly used (and simplest) clustering algorithm:

K-Means Clustering

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

K-means clustering algorithm

Typically, use mean of points in cluster as centroid

K-means clustering algorithm

Distance metric: Chosen by user. For numerical attributes, often use L2 (Euclidean) distance. Centroid of a cluster here refers to the mean of the points in the cluster.

d(x, y) = (xi − yi )2

Example: Image segmentation by K-means clustering by color From http://vitroz.com/Documents/Image%20Segmentation.pdf

K=5, RGB space

K=10, RGB space

K=5, RGB space

K=10, RGB space

K=5, RGB space

K=10, RGB space

•  A text document is represented as a feature vector of word frequencies

•  Distance between two documents is the cosine of the angle between their corresponding feature vectors.

Clustering text documents

Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles.

Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018029

Exercise 1

How to evaluate clusters produced by K-means?

•  Unsupervised evaluation

•  Supervised evaluation

Unsupervised Cluster Evaluation

We don’t know the classes of the data instances

Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let ci denote a cluster in C. Let |ci| denote the number of elements in ci. Clustering C:

How can we quantify how good a clustering this is?

|c1|= 9

|c2|= 6

|c3|= 6

We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

mse(c) =d(x,

x∈c∑ µc )

Average mse (C) =mse(c)

c∈C∑

We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

mse(c) =d(x,

x∈c∑ µc )

Average mse (C) =mse(c)

c∈C∑

K Note: The assigned reading uses sum square error rather than mean square error.

We also want to maximize pairwise separation of each cluster.

We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):

mss (C) =d(µi,µ j

all distinct pairs of clusters i, j∈C (i≠ j )∑ )2

K(K −1) / 2

We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):

mss (C) =d(µi,µ j

all distinct pairs of clusters i, j∈C (i≠ j )∑ )2

K(K −1) / 2

Exercises 2-3

Supervised Cluster Evaluation

Suppose we know the classes of the data instances: black, blue, green

Supervised Cluster Evaluation

Suppose we know the classes of the data instances: black, blue, green

Entropy of a cluster: The degree to which a cluster consists of objects of a single class.

entropy(ci ) = − pi, jj=1

|Classes|

∑ log2 pi, j

wherepi, j = probability that a member of cluster i belongs to class j

=mi, j

, where mi, j is the number of instances in cluster i with class j

and mi is the number of instances in cluster i

entropy(c1) = −39log2

39+49log2

49+29log2

⎝⎜

⎠⎟=1.74

entropy(c2 ) = −16log2

16+46log2

46+16log2

⎝⎜

⎠⎟= 0.78

46+26log2

⎝⎜

⎠⎟= 0.45

Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: where mi is the number of instances in cluster ci and m is the total number of instances in the clustering.

mean entropy(C) = mi

∑ entropy(ci )

39+49log2

49+29log2

⎝⎜

⎠⎟=1.74

16+46log2

46+16log2

⎝⎜

⎠⎟= 0.78

46+26log2

⎝⎜

⎠⎟= 0.45

∑ entropy(ci )

39+49log2

49+29log2

⎝⎜

⎠⎟=1.74

16+46log2

46+16log2

⎝⎜

⎠⎟= 0.78

46+26log2

⎝⎜

⎠⎟= 0.45

mean entropy(C) = 921

(1.74)+ 621

(0.78)+ 621

(0.45) =1.1

What is the mean entropy of this clustering?

Entropy Exercise

Issues for K-means

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

Issues for K-means

•  The user needs to specify K.

Issues for K-means

•  The algorithm is sensitive to outliers –  Outliers are data points that are very far away from other

data points.

Issues for K-means

•  The algorithm is sensitive to outliers –  Outliers are data points that are very far away from other

data points. –  Outliers could be errors in the data recording or some

special data points with very different values.

CS583, Bing Liu, UIC

Issues for K-means: Problems with outliers

Dealing with outliers •  One method is to remove some data points in the clustering

process that are much further away from the centroids than other data points. –  Expensive –  Not always a good idea!

Dealing with outliers •  One method is to remove some data points in the clustering

process that are much further away from the centroids than other data points. –  Expensive –  Not always a good idea!

•  Another method is to perform random sampling. Since in

sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. –  Assign the rest of the data points to the clusters by distance or

similarity comparison, or classification

Issues for K-means (cont …) •  The algorithm is sensitive to initial seeds.

•  If we use different seeds: good results

Issues for K-means (cont …)

•  If we use different seeds: good results Often can improve K-means results by doing several random restarts.

Often useful to select instances from data as initial seeds.

•  The K-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).

Other Issues •  What if a cluster is empty?

–  Choose a replacement centroid

•  At random, or

•  From cluster that has highest mean square error

•  How to choose K ?

•  The assigned reading discusses several methods for improving a clustering with “postprocessing”.

Choosing the K in K-Means •  Hard problem! Often no “correct” answer for unlabeled data

•  Many proposed methods! Here are a few:

•  Try several values of K, see which is best, via cross-validation. –  Metrics: mean square error, mean square separation, penalty for too

many clusters [why?]

•  Start with K = 2. Then try splitting each cluster. –  New means are one sigma away from cluster center in direction of greatest

variation.

–  Use similar metrics to above.

•  “Elbow” method: –  Plot average MSE vs. K. Choose K at which MSE (or other

metric) stops decreasing abruptly.

–  However, sometimes no clear “elbow”

“elbow”

Unsupervised Learning I: K-Means...

Documents

Pest Management Regulatory Agency (PMRA) Priorities for 2012-2013

RULES AND REGULATIONS 2015 - PMRA Rules 2015.pdf · RULES AND REGULATIONS 2015 ... PMRA sponsors are helping to keep motorcycle drag racing venues available to you. ... (Clutch must

Unsupervised Learning

Klasifikasi Unsupervised

UNSUPERVISED IMAGE CLASSIFCATION USING ISODATA AND FUZZY …eprints.utem.edu.my/15021/1/UNSUPERVISED IMAGE CLASSIFCATION USING.pdf · unsupervised image classifcation using . isodata

Unsupervised Map

Unsupervised Anomaly Detection in Large Datacenters › ~mgabel › papers › MSc Thesis - Final.pdf · 2014-06-14 · Unsupervised Anomaly Detection ... Unsupervised Anomaly Detection

Unsupervised Learning Learning Unsupervised...Unsupervised Learning and Data Mining Learning Mining Unsupervised Learning and Data Mining Unsupervised Data Clustering Supervised Learning

Quiz 3 Practice Problems Solutions 1. - Computer Action Teamweb.cecs.pdx.edu › ~mm › MachineLearningWinter2019 › ...following two attributes: Genre (Romance, Self-Help, or Thriller)

UNSUPERVISED AI

Unsupervised Learning - wnzhangwnzhang.net/teaching/cs420/slides/9-unsupervised-learning.pdf · Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University CS420, Machine Learning,

Unsupervised Learning and Data Mining Unsupervised Learning

Unsupervised Prsentationssssdas

Unsupervised Classification

Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Unsupervised co-segmentation through region matchingjcrubio/publications/Unsupervised Cosegmentation... · Unsupervised co-segmentation through region matching Jose C. Rubio 1, Joan

Unsupervised High-Dimensional Analysis Aligns …_120G8... · Unsupervised High-Dimensional Analysis Aligns Dendritic Cells across ... Unsupervised High-Dimensional Analysis Aligns

Unsupervised Learning. Supervised learning vs. unsupervised learning

PMRA Environmental Risk Assessment - Adam Oliver …adamoliverbrown.com/wp-content/uploads/2012/03/... · PMRA Environmental Risk Assessment ... Operations Directorate (CLSROD)

This document provides guidance for EPA and PMRA reviewers