96
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www- users.cs.umn.edu /~ kumar / dmbook /ch8.pdf )

Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Embed Size (px)

Citation preview

Page 1: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Unsupervised Learning

Reading:

Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

(http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

Page 2: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Unsupervised learning = No labels on training examples!

Main approach: Clustering

Page 3: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Example: Optdigits data set

Page 4: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Optdigits features f1 f2 f3 f4 f5 f6 f7 f8

f9

x = (f1, f2, ..., f64) = (0, 2, 13, 16, 16, 16, 2, 0, 0, ...) Etc. ..

Page 5: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Partitional Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Page 6: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Partitional Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Page 7: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Page 8: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Page 9: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Hierarchical Clustering of Optdigits

Feature 1

Feature 2

Feature 3 64-dimensional space

Page 10: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Issues for clustering algorithms

•  How to measure distance between pairs of instances?

•  How many clusters to create?

•  Should clusters be hierarchical? (I.e., clusters of clusters)

•  Should clustering be “soft”? (I.e., an instance can belong to different clusters, with “weighted belonging”)

Page 11: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Most commonly used (and simplest) clustering algorithm:

K-Means Clustering

Page 12: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Page 13: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Page 14: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Page 15: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Adapted from Andrew Moore, http://www.cs.cmu.edu/~awm/tutorials

Page 16: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
Page 17: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

K-means clustering algorithm

Page 18: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

K-means clustering algorithm

Typically, use mean of points in cluster as centroid

Page 19: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

K-means clustering algorithm

Distance metric: Chosen by user. For numerical attributes, often use L2 (Euclidean) distance. Centroid of a cluster here refers to the mean of the points in the cluster.

d(x, y) = (xi − yi )2

i=1

n

Page 20: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Example: Image segmentation by K-means clustering by color From http://vitroz.com/Documents/Image%20Segmentation.pdf

K=5, RGB space

K=10, RGB space

Page 21: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

K=5, RGB space

K=10, RGB space

Page 22: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

K=5, RGB space

K=10, RGB space

Page 23: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  A text document is represented as a feature vector of word frequencies

•  Distance between two documents is the cosine of the angle between their corresponding feature vectors.

Clustering text documents

Page 24: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles.

Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018029

Page 25: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Exercise 1

Page 26: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

How to evaluate clusters produced by K-means?

•  Unsupervised evaluation

•  Supervised evaluation

Page 27: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Unsupervised Cluster Evaluation

We don’t know the classes of the data instances

Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let c denote a cluster in C. Let |c| denote the number of elements in c. We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

mse(c) =d(x,

x∈c∑ µc )

2

| c |

Average mse (C) =mse(c)

c∈C∑

K

Page 28: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Unsupervised Cluster Evaluation

We don’t know the classes of the data instances

Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let c denote a cluster in C. Let |c| denote the number of elements in c. We want to minimize the distance between elements of c and the centroid µc . coherence of each cluster c – i.e., minimize Mean Square Error (mse):

mse(c) =d(x,

x∈c∑ µc )

2

| c |

Average mse (C) =mse(c)

c∈C∑

K Note: The assigned reading uses sum square error rather than mean square error.

Page 29: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Unsupervised Cluster Evaluation

We don’t know the classes of the data instances

We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss):

mss (C) =d(µi,µ j

all distinct pairs of clusters i, j∈C (i≠ j )∑ )2

K(K −1) / 2

Page 30: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Exercises 2-3

Page 31: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Supervised Cluster Evaluation

Suppose we know the classes of the data instances

Entropy of a cluster: The degree to which a cluster consists of objects of a single class.

Mean entropy of a clustering: Average entropy over all clusters in the clustering

entropy(ci ) = − pi, jj=1

|Classes|

∑ log2 pi, j

wherepi, j = probability that a member of cluster i belongs to class j

=mi, j

mi

, where mi, j is the number of instances in cluster i with class j

and mi is the number of instances in cluster i

mean entropy(C) = mi

m1

K

∑ entropy(ci )

where mi is the number of instances in cluster i and m is the total number of instances in the dataset.

We want to minimize mean entropy

Page 32: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Entropy Example

Cluster 1 Cluster 2 Cluster3

1 2 1 3 1 1 3 2 3 3 3 2 3 1 1 3 2 2 3 2

Suppose there are 3 classes: 1, 2, 3

entropy(c1) = −47log2

47+17log2

17+27log2

27

⎝⎜

⎠⎟=1.37

entropy(c2 ) = − 0+ 26log2

26+46log2

46

⎝⎜

⎠⎟= 0.91

entropy(c3) = −27log2

27+37log2

37+27log2

27

⎝⎜

⎠⎟=1.54

mean entropy(C) = 7201.37( )+ 6

200.91( )+ 7

201.54( )

Page 33: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Exercise 4

Page 34: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Issues for K-means

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 35: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 36: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

•  The user needs to specify K.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 37: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Issues for K-means

•  The algorithm is only applicable if the mean is defined. –  For categorical data, use K-modes: The centroid is

represented by the most frequent values.

•  The user needs to specify K.

•  The algorithm is sensitive to outliers –  Outliers are data points that are very far away from other

data points. –  Outliers could be errors in the data recording or some

special data points with very different values.

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 38: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

CS583, Bing Liu, UIC

Issues for K-means: Problems with outliers

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 39: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Dealing with outliers •  One method is to remove some data points in the clustering

process that are much further away from the centroids than other data points. –  Expensive –  Not always a good idea!

•  Another method is to perform random sampling. Since in

sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. –  Assign the rest of the data points to the clusters by distance or

similarity comparison, or classification

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 40: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

CS583, Bing Liu, UIC

Issues for K-means (cont …) •  The algorithm is sensitive to initial seeds.

+

+

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Page 41: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

CS583, Bing Liu, UIC

•  If we use different seeds: good results

+ +

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

Page 42: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

CS583, Bing Liu, UIC

•  If we use different seeds: good results Often can improve K-means results by doing several random restarts.

+ +

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

Page 43: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

CS583, Bing Liu, UIC

•  If we use different seeds: good results Often can improve K-means results by doing several random restarts.

+ +

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

Often useful to select instances from data as initial seeds.

Page 44: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

CS583, Bing Liu, UIC

•  The K-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).

+

Adapted from Bing Liu, UIC http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

Issues for K-means (cont …)

Page 45: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Other Issues

•  What if a cluster is empty?

–  Choose a replacement centroid

•  At random, or

•  From cluster that has highest mean square error

•  How to choose K ?

•  The assigned reading discusses several methods for improving a clustering with “postprocessing”.

Page 46: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Choosing the K in K-Means •  Hard problem! Often no “correct” answer for unlabeled data

•  Many proposed methods! Here are a few:

•  Try several values of K, see which is best, via cross-validation. –  Metrics: mean square error, mean square separation, penalty for too

many clusters [why?]

•  Start with K = 2. Then try splitting each cluster. –  New means are one sigma away from cluster center in direction of greatest

variation.

–  Use similar metrics to above.

Page 47: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  “Elbow” method: –  Plot average mse (or SSE) vs. K. Choose K at which SSE (or other

metric) stops decreasing abruptly.

–  However, sometimes no clear “elbow”

“elbow”

Page 48: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Homework 5

Page 49: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Quiz 4 Review

Page 50: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Soft Clustering with Gaussian Mixture Models

Page 51: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Soft Clustering with Gaussian mixture models

•  A “soft”, generative version of K-means clustering

•  Given: Training set S = {x1, ..., xN}, and K.

•  Assumption: Data is generated by sampling from a “mixture” (linear combination) of K Gaussians.

Page 52: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Gaussian Mixture Models Assumptions

•  K clusters

•  Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.]

•  Assume that each data instance we have was generated by the following procedure:

1. Select cluster ci with probability P(ci) = πi

2. Sample point from ci’s Gaussian distribution

Page 53: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Mixture of three Gaussians (one dimensional data)

p(x) = π1N (x |µ1,σ1)+π 2N (x |µ2,σ 2 )+π3N (x |µ3,σ 3)

where π1 +π 2 +π3 =1

Page 54: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Clustering via finite Gaussian mixture models

•  Clusters: Each cluster will correspond to a single Gaussian. Each point x ∈ S will have some probability distribution over the K clusters.

•  Goal: Given the data, find the Gaussians! (And their probabilities πi .) I.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized.

•  This is called a Maximum Likelihood method.

–  S is the data

–  {θK} is the “hypothesis” or “model”

–  P(S | {θK}) is the “likelihood”.

Page 55: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

General form of one-dimensional (univariate) Gaussian Mixture Model

p(x) = π iN (x |µi,σ i )i=1

K

where π ii=1

K

∑ =1

Page 56: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  Assume training set S has N values generated by a univariant Gaussian distribution:

•  Likelihood function: probability of data given model (or parameters of model)

p(S | µ,σ ) = Ni=1

N

∏ (xi | µ,σ )

Learning a GMM Simple Case:

Maximum Likelihood for Single Univariate Gaussian

S = {x1,...,xN}, where

N (x | µ,σ ) =1

2πσ 2e−

(x−µ )2

2σ 2

Page 57: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  How to estimate parameters µ and σ from S? •  Maximize the likelihood function with respect to µ and σ .

•  We want the µ and σ that maximize the probability of the data.

•  Problem: Individual values of are typically very

small. (Can underflow numerical precision of computer.)

N (xi | µ,σ )

Maximize : p(S | µ,σ) = Ni=1

N

∏ (xi | µ,σ)

=12πσ 2

e−(xi −µ )2

2σ 2

i=1

N

Page 58: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  Solution: Work with log likelihood instead of likelihood.

ln p(S |µ,σ ) = ln 12πσ 2

e−xi−µ( )2

2σ 2

i=1

N

∏⎛

⎜⎜

⎟⎟

= ln e−xi−µ( )2

2σ 2

2πσ 2

⎜⎜⎜

⎟⎟⎟i=1

N

∑ = ln e−xi−µ( )2

2σ 2⎛

⎜⎜

⎟⎟− ln 2πσ 2( )

⎜⎜

⎟⎟

i=1

N

= −xi −µ( )2

2σ 2 +12ln 2π( )+ 1

2ln σ 2( )

⎝⎜⎜

⎠⎟⎟

i=1

N

= −12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

Page 59: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

ln p(S |µ,σ ) = ln 12πσ 2

e−xi−µ( )2

2σ 2

i=1

N

∏⎛

⎜⎜

⎟⎟

= ln e−xi−µ( )2

2σ 2

2πσ 2

⎜⎜⎜

⎟⎟⎟i=1

N

∑ = ln e−xi−µ( )2

2σ 2⎛

⎜⎜

⎟⎟− ln 2πσ 2( )

⎜⎜

⎟⎟

i=1

N

= −xi −µ( )2

2σ 2 +12ln 2π( )+ 1

2ln σ 2( )

⎝⎜⎜

⎠⎟⎟

i=1

N

= −12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

•  Solution: Work with log likelihood instead of likelihood.

Find a simplified expression for this.

Page 60: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  Solution: Work with log likelihood instead of likelihood.

ln p(S |µ,σ ) = ln 12πσ 2

e−xi−µ( )2

2σ 2

i=1

N

∏⎛

⎜⎜

⎟⎟

= ln e−xi−µ( )2

2σ 2

2πσ 2

⎜⎜⎜

⎟⎟⎟i=1

N

∑ = ln e−xi−µ( )2

2σ 2⎛

⎜⎜

⎟⎟− ln 2πσ 2( )

⎜⎜

⎟⎟

i=1

N

= −xi −µ( )2

2σ 2 +12ln 2π( )+ 1

2ln σ 2( )

⎝⎜⎜

⎠⎟⎟

i=1

N

= −12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

Page 61: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, find maximum likelihood parameters, µ and σ2. First, maximize

with respect to µ.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

µML =1N

xni=n

N

(ML = “Maximum Likelihood”)

ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥

=ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2⎡

⎣⎢

⎦⎥

= −12σ 2 −2(xi

i=1

N

∑ −µ)⎡

⎣⎢

⎦⎥

=1σ 2 xi

i=1

N

∑⎛

⎝⎜

⎠⎟− Nµ

⎣⎢

⎦⎥= 0

Result:

Page 62: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, find maximum likelihood parameters, µ and σ2. First, maximize

with respect to µ. Find µ that maximizes this.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

µML =1N

xni=n

N

(ML = “Maximum Likelihood”)

ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥

=ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2⎡

⎣⎢

⎦⎥

= −12σ 2 −2(xi

i=1

N

∑ −µ)⎡

⎣⎢

⎦⎥

=1σ 2 xi

i=1

N

∑⎛

⎝⎜

⎠⎟− Nµ

⎣⎢

⎦⎥= 0

Result:

Page 63: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, find maximum likelihood parameters, µ and σ2. First, maximize

with respect to µ. Find µ that maximizes this. How to do this?

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

Page 64: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, find maximum likelihood parameters, µ and σ2. First, maximize

with respect to µ. Find µ that maximizes this.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥= 0

Page 65: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, find maximum likelihood parameters, µ and σ2. First, maximize

with respect to µ.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

µML =1N

xni=n

N

(ML = “Maximum Likelihood”)

ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥

=ddµ

−12σ 2 (xi

i=1

N

∑ −µ)2⎡

⎣⎢

⎦⎥

= −12σ 2 −2(xi

i=1

N

∑ −µ)⎡

⎣⎢

⎦⎥

=1σ 2 xi

i=1

N

∑⎛

⎝⎜

⎠⎟− Nµ

⎣⎢

⎦⎥= 0

Result:

Page 66: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, maximize

with respect to σ2. Find σ2 that maximizes this.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

ddσ 2 −

12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥

=ddσ 2 −

12σ 2( )

−1(xi

i=1

N

∑ −µ)2 − N2lnσ 2

⎣⎢

⎦⎥

=12σ 2( )

−2(xi

i=1

N

∑ −µ)2 − N2σ 2 =

1σ 2( )

2 (xii=1

N

∑ −µ)2 − Nσ 2

=(xi

i=1

N

∑ −µ)2 − Nσ 2

σ 2( )2 = 0 ⇒σ 2

ML =1N

(xnn=1

N

∑ −µML )2

Page 67: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, maximize

with respect to σ2. Find σ2 that maximizes this.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

ddσ 2 −

12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥= 0

Page 68: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Now, maximize

with respect to σ2.

ln p(S | µ,σ) = −12σ 2 (xi

i=1

N

∑ − µ)2 − N2lnσ 2 −

N2ln(2π )

ddσ 2 −

12σ 2 (xi

i=1

N

∑ −µ)2 − N2lnσ 2 −

N2ln(2π )

⎣⎢

⎦⎥

=ddσ 2 −

12σ 2( )

−1(xi

i=1

N

∑ −µ)2 − N2lnσ 2

⎣⎢

⎦⎥

=12σ 2( )

−2(xi

i=1

N

∑ −µ)2 − N2σ 2 =

1σ 2( )

2 (xii=1

N

∑ −µ)2 − Nσ 2

=(xi

i=1

N

∑ −µ)2 − Nσ 2

σ 2( )2 = 0 ⇒σ 2

ML =1N

(xnn=1

N

∑ −µML )2

Page 69: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  The resulting distribution is called a “generative model” because it can generate new data values.

•  We say that

parameterizes the model. •  In general, θ is used to denote the (learnable) parameters of a

probabilistic model

N (x |µML,σML ) =12πσ 2

e−(x−µML )

2

2σML2

θ = {µML,σML}

Page 70: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Learning a GMM

More general case: Multivariate Gaussian Distribution

Multivariate (D-dimensional) Gaussian:

. oft determinan theis and matrix, covariance a is r,mean vecto ldimensiona a is where

)2(1),|( 2

)()(

2/12/

1

ΣΣΣµ

ΣΣµx

µxΣµx

D DD-

eT

D

×

=−−

−−

πN

Page 71: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Covariance: Variance: Covariance Matrix Σ :

Σi,j = cov (xi , xj)

cov(x, y) =(xi − x )(yi − y )

i=1

n

∑(n−1)

cov(x, x) =(xi − x )(xi − x )

i=1

n

∑n

=(xi − x )

2

i=1

n

∑n

= var(x) =σ 2 (x)

Page 72: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}.

•  General expression for finite Gaussian mixture model:

•  That is, x has probability of “membership” in multiple clusters/classes.

p(x) = π kk=1

K

∑ N (x |µk,Σk )

Page 73: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Maximum Likelihood for Multivariate Gaussian Mixture Model

•  Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood.

•  Log likelihood function:

•  Given S, we can maximize this function to find

•  But no closed form solution (unlike simple case in previous slides)

•  In this multivariate case, we can efficiently maximize this function using the “Expectation / Maximization” (EM) algorithm.

lnP(S |π,µ,Σ) = ln π kk=1

K

∑ N (xn |µk ,Σk )⎛

⎝⎜

⎠⎟

n=1

N

ML,, }{ Σµπ

Page 74: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Expectation-Maximization (EM) algorithm

•  General idea: –  Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.)

–  Alternate between E (expectation) and M (maximization) step:

•  E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.)

•  M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.)

Repeat until the log-likelihood or the parameters θ do not change significantly.

Page 75: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

More detailed version of EM algorithm

1.  Let X be the set of training data. Initialize the means µk, covariances Σk, and mixing coefficients πk, and evaluate initial value of log likelihood.

2.  E step. Evaluate the “responsibilities” using the current parameter values

∑ ∑= =

⎟⎠

⎞⎜⎝

⎛=

N

nkkn

K

kk ,,,p

1 1)|(ln)|(ln ΣµxΣµπX Nπ

rn,k =π kN (xn |µk,Σk )

π kN (xn |µk,Σk )j=1

K

∑where rn,k denotes the “responsibilities” of the kth cluster for the nth data point.

Page 76: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

3.  M step. Re-estimate the parameters θ using the current responsibilities.

µknew =

1

rn,kn=1

N

∑rn,k xn( )

n=1

N

Σknew =

1

rn,kn=1

N

∑rn,k (xn −µk )(xn −µk )

T( )n=1

N

π knew =

rn,kn=1

N

∑N

Page 77: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

4.  Evaluate the log likelihood with the new parameters

and check for convergence of either the parameters or the log likelihood. If not converged, return to step 2.

∑ ∑= =

⎟⎠

⎞⎜⎝

⎛=

N

nkkn

K

kk ,,,p

1 1)|(ln)|(ln ΣµxΣµπX Nπ

Page 78: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  EM much more computationally expensive than K-means

•  Common practice: Use K-means to set initial parameters, then improve with EM.

•  –  Initial means: Means of clusters found by k-means

–  Initial covariances: Sample covariances of the clusters found by K-means algorithm.

–  Initial mixture coefficients: Fractions of data points assigned to the respective clusters.

Page 79: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  Can prove that EM finds local maxima of log-likelihood function.

•  EM is very general technique for finding maximum-likelihood solutions for probabilistic models

Page 80: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Using GMM for Classification

Assume each cluster corresponds to one of the classes. A new test example x is classified according to

class = argmaxclassi

P(y = classi )P(x |θi )

where

P(x |θi ) = π ii=1

K

∑ N x,µi,Σi( )

Page 81: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Case Study: Text classification from labeled and unlabeled

documents using EM K. Nigam et al., Machine Learning, 2000

•  Big problem with text classification: need labeled data.

•  What we have: lots of unlabeled data. •  Question of this paper: Can unlabeled data be used to increase

classification accuracy?

•  I.e.: Any information implicit in unlabeled data? Any way to take advantage of this implicit information?

Page 82: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

General idea: A version of EM algorithm

•  Train a classifier with small set of available labeled documents.

•  Use this classifier to assign probabilisitically-weighted class labels to unlabeled documents by calculating expectation of missing class labels.

•  Then train a new classifier using all the documents, both originally labeled and formerly unlabeled.

•  Iterate.

Page 83: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Probabilistic framework

•  Assumes data are generated with Gaussian mixture model

•  Assumes one-to-one correspondence between mixture components and classes.

•  “These assumptions rarely hold in real-world text data”

Page 84: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Probabilistic framework Let C = {c1, ..., cK} be the classes / mixture components

Let θ = {µ1, ...,µ K} ∪ {Σ1, ..., ΣK} ∪ {π1, ..., πK} be the mixture parameters.

Assumptions: A document di is created by first selecting a mixture component according to the mixture weights πj, then having this selected mixture component generate a document according to its own parameters, with distribution p(di | cj; θ).

•  Likelihood of document di :

);|()|(1

θπθ ji

k

jki cdpdp ∑

=

=

Page 85: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

•  Now, we will apply EM to a Naive Bayes Classifier Recall Naive Bayes classifier: Assume each feature is conditionally independent, given cj.

Let x = ( f1, f2,..., fN )We have:p( f1, f2,..., fN | cj ) = p( f1 | cj )p( f2 | cj )!p( fN | cj )

p(cj | x) = p(cj ) p( fii∏ | cj ), i =1,...,N; j =1,...,K

Page 86: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

To “train” naive Bayes from labeled data, estimate

These values are estimates of the parameters in θ. Call these values .

p(cj ) and p( fi | cj ), j =1,...,K; i =1,...,N

θ̂

Page 87: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Note that Naive Bayes can be thought of as a generative mixture model.

Document di is represented as a vector of word frequencies ( w1, ..., w|V| ), where V is the vocabulary (all known words).

The probability distribution over words associated with each class is parameterized by θ.

We need to estimate to determine what probability

distribution document di = ( w1, ..., w|V| )is most likely to come from.

θ̂

Page 88: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Applying EM to Naive Bayes •  We have a small number of labeled documents Slabeled, and a

large number of unlabeled documents, Sunlabeled.

•  The initial parameters are estimated from the labeled documents Slabeled.

•  Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x ∈ Sunlabeled.

•  Maximization step: Re-estimate using values for x ∈ Sunlabeled ∪ Sunlabeled

•  Repeat until or has converged.

θ̂

)|( xjcp

θ̂ )|( xjcp

)|( xjcp θ̂

Page 89: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Augmenting EM

What if basic assumptions (each document generated by one component; one-to-one mapping between components and classes) do not hold?

They tried two things to deal with this:

(1) Weighting unlabeled data less than labeled data

(2) Allow multiple mixture components per class: A document may be comprised of several different sub-topics, each best captured with a different word distribution.

Page 90: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552

Data

•  20 UseNet newsgroups

•  Web pages (WebKB)

•  Newswire articles (Reuters)

Page 91: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
Page 92: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
Page 93: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
Page 94: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
Page 95: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
Page 96: Unsupervised Learning - Computer Action Teamweb.cecs.pdx.edu/~mm/.../UnsupervisedLearning.pdf · Unsupervised Learning Reading: ... Steinbach, and Kumar, pp. 487-515, 532-541, 546-552