Transcript
Page 1: K-medoid-style Clustering Algorithms for Supervised Summary Generation

Zeidat&Eick, MLMTA, Las Vegas 2004

1

K-medoid-style Clustering Algorithms for Supervised Summary Generation

Nidal Zeidat & Christoph F. Eick

Dept. of Computer Science

University of Houston

Page 2: K-medoid-style Clustering Algorithms for Supervised Summary Generation

2 Eick&Zeidat, MLMTA, Las Vegas 2004

Talk Outline

1. What is Supervised Clustering?

2. Representative-based Clustering Algorithms

3. Benefits of Supervised Clustering

4. Algorithms for Supervised Clustering

5. Empirical Results

6. Conclusion and Areas of Future Work

Page 3: K-medoid-style Clustering Algorithms for Supervised Summary Generation

3 Eick&Zeidat, MLMTA, Las Vegas 2004

1. (Traditional) Clustering

Partition a set of objects into groups of similar objects. Each group is called cluster

Clustering is used to “detect classes” in data set (“unsupervised learning”)

Clustering is based on a fitness function that relies on a distance measure and usually tries to minimize distance between objects within a cluster.

Page 4: K-medoid-style Clustering Algorithms for Supervised Summary Generation

4 Eick&Zeidat, MLMTA, Las Vegas 2004

(Traditional) Clustering … (continued)

A

CB

Attribute2

Attribute1

Page 5: K-medoid-style Clustering Algorithms for Supervised Summary Generation

5 Eick&Zeidat, MLMTA, Las Vegas 2004

Supervised Clustering

Assumes that clustering is applied to classified examples.

The goal of supervised clustering is to identify class-uniform clusters that have a high probability density. prefers clusters whose members belong to single class (low impurity)

We would, also, like to keep the number of clusters low (small number of clusters).

Page 6: K-medoid-style Clustering Algorithms for Supervised Summary Generation

6 Eick&Zeidat, MLMTA, Las Vegas 2004

Supervised Clustering … (continued)

Attribute 1Attribute 1

Attribute 2Attribute 2

Traditional Clustering Supervised Clustering

Page 7: K-medoid-style Clustering Algorithms for Supervised Summary Generation

7 Eick&Zeidat, MLMTA, Las Vegas 2004

A Fitness Function for Supervised Clustering

q(X) := Impurity(X) + β*Penalty(k)

ck

ck

0

n

ck

Penalty(k) and

,n

Examples Minority of # )Impurity(X where

k: number of clusters used

n: number of examples the dataset

c: number of classes in a dataset.

β: Weight for Penalty(k), 0< β ≤2.0

Page 8: K-medoid-style Clustering Algorithms for Supervised Summary Generation

8 Eick&Zeidat, MLMTA, Las Vegas 2004

2. Representative-Based Supervised Clustering (RSC)

Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster.

The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative.

Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.

Page 9: K-medoid-style Clustering Algorithms for Supervised Summary Generation

9 Eick&Zeidat, MLMTA, Las Vegas 2004

Representative Based Supervised Clustering … (Continued)

Attribute2

Attribute1

Page 10: K-medoid-style Clustering Algorithms for Supervised Summary Generation

10 Eick&Zeidat, MLMTA, Las Vegas 2004

Representative Based Supervised Clustering … (Continued)

Attribute2

Attribute1

1

2

3

4

Page 11: K-medoid-style Clustering Algorithms for Supervised Summary Generation

11 Eick&Zeidat, MLMTA, Las Vegas 2004

Representative Based Supervised Clustering … (Continued)

Attribute2

Attribute1

1

2

3

4

Objective of RSC: Find a subset OR of O such that the clustering X obtained

by using the objects in OR as representatives minimizes q(X).

Page 12: K-medoid-style Clustering Algorithms for Supervised Summary Generation

12 Eick&Zeidat, MLMTA, Las Vegas 2004

Why do we use Representative-Based Clustering Algorithms?

Representatives themselves are useful:– can be used for summarization – can be used for dataset compression

Smaller search space if compared with algorithms, such as k-means.

Less sensitive to outliers Can be applied to datasets that contain nominal

attributes (not feasible to compute means)

Page 13: K-medoid-style Clustering Algorithms for Supervised Summary Generation

13 Eick&Zeidat, MLMTA, Las Vegas 2004

3. Applications of Supervised Clustering

Enhance classification algorithms. – Use SC for Dataset Editing to enhance NN-classifiers [ICDM04]– Improve Simple Classifiers [ICDM03]

Learn Sub-classes / Summary Generation Distance Function Learning Dataset Compression/Reduction For Measuring the Difficulty of a Classification Task

Page 14: K-medoid-style Clustering Algorithms for Supervised Summary Generation

14 Eick&Zeidat, MLMTA, Las Vegas 2004

Representative Based Supervised Clustering Dataset Editing

A

C

E

a. Dataset clustered using supervised clustering.

b. Dataset edited using cluster representatives.

Attribute1

D

B

Attribute2

F

Attribute2

Attribute1

Page 15: K-medoid-style Clustering Algorithms for Supervised Summary Generation

15 Eick&Zeidat, MLMTA, Las Vegas 2004

Representative Based Supervised Clustering Enhance Simple Classifiers

Attribute1

Attribute2

Page 16: K-medoid-style Clustering Algorithms for Supervised Summary Generation

16 Eick&Zeidat, MLMTA, Las Vegas 2004

Representative Based Supervised Clustering Learning Sub-classes

Attribute2

Ford TrucksAttribute1

Ford Trucks

Ford Vans

GMC Trucks

GMC Van

GMC Van

:Ford

:GMC

Page 17: K-medoid-style Clustering Algorithms for Supervised Summary Generation

17 Eick&Zeidat, MLMTA, Las Vegas 2004

4. Clustering Algorithms Currently Investigated

1. Partitioning Around Medoids (PAM). Traditional

2. Supervised Partitioning Around Medoids (SPAM).

3. Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).

4. Top Down Splitting Algorithm (TDS).

5. Supervised Clustering using Evolutionary Computing (SCEC).

Page 18: K-medoid-style Clustering Algorithms for Supervised Summary Generation

18 Zeidat&Eick, MLMTA, Las Vegas 2004

Algorithm SRIDHCR

REPEAT r TIMEScurr := a randomly created set of representatives (with size

between c+1 and 2*c)WHILE NOT DONE DO

1. Create new solutions S by adding a single non-representative to curr and by removing a single representative from curr.

2. Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one).

3. IF q(s)<q(curr) THEN curr:=sELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=sELSE terminate and return curr as the solution for

this run.Report the best out of the r solutions found.

Page 19: K-medoid-style Clustering Algorithms for Supervised Summary Generation

19 Zeidat&Eick, MLMTA, Las Vegas 2004

Set of Medoids after adding one non-medoid

q(X) Set of Medoids after removing a medoid

q(X)

8 42 62 148 (Initial solution) 0.086 42 62 148 0.086

8 42 62 148 1 0.091 8 62 148 0.073

8 42 62 148 2 0.091 8 42 148 0.313

…….... ……. 8 42 62 0.333

8 42 62 148 52 0.065 42 62 148 0.086

……… …….

8 42 62 148 150 0.0715

Trials in first part (add a non-medoid) Trials in second part (drop a medoid)

Run Set of Medoids producing lowest q(X) in the run q(X) Purity

0 8 42 62 148 (Init. Solution) 0.086 0.947

1 8 42 62 148 52 0.065 0.947

2 8 42 62 148 52 122 0.041 0.973

3 42 62 148 52 122 117 0.030 0.987

4 8 62 148 52 122 117 0.021 0.993

5 8 62 148 52 122 117 87 0.016 1.000

6 8 62 52 122 117 87 0.014 1.000

7 8 62 122 117 87 0.012 1.000

Page 20: K-medoid-style Clustering Algorithms for Supervised Summary Generation

20 Zeidat&Eick, MLMTA, Las Vegas 2004

Algorithm SPAM

Build Initial Solution curr: (given # of clusters k)1. Determine the medoid of the most frequent class in the dataset. Insert

that object m into curr. 2. For k-1 times, add an object v in the dataset to curr (that is not already

in curr) that gives the lowest value for q(X) for curr {v}.

Improve Initial Solution curr:DO FOREVER

FOR ALL representative objects r in curr DO FOR ALL non-representatives objects o in dataset DO

1. Create a new solution v by clustering the dataset around the representative set curr{r}{o} and insert v into S.

2. Calculate q(v) for this clustering.Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one).IF q(s)<q(curr) THEN curr:=s

ELSE TERMINATE returning curr as the final solution.

Page 21: K-medoid-style Clustering Algorithms for Supervised Summary Generation

21 Eick&Zeidat, MLMTA, Las Vegas 2004

Differences between SPAM and SRIDHCR

1. SPAM tries to improve the current solution by replacing a representative by a non-representative, whereas SRIDHCR improves the current solution by removing a representative/by inserting a non-representative.

2. SPAM is run keeping the number of clusters k fixed, whereas SRIDHCR searches for a “good” value of k, therefore exploring a larger solution space. However, in the case of SRIDHCR which choices for k are good is somewhat restricted by the selection of the parameter .

3. SRIDHCR is run r times starting from a random initial solution, SPAM is only run once.

Page 22: K-medoid-style Clustering Algorithms for Supervised Summary Generation

22 Eick&Zeidat, MLMTA, Las Vegas 2004

5. Performance Measures for the Experimental Evaluation

The investigated algorithms were evaluated based on the following performance measures:

Cluster Purity (Majority %). Value of the fitness function q(X). Average dissimilarity between all objects and their

representatives (cluster tightness). Wall-Clock Time (WCT). Actual time, in seconds,

that the algorithm took to finish the clustering task.

Page 23: K-medoid-style Clustering Algorithms for Supervised Summary Generation

23 Zeidat&Eick, MLMTA, Las Vegas 2004

Algorithm Purity q(X) Tightness(X).

Iris-Plants data set, # clusters=3

PAM 0.907 0.0933 0.081

SRIDHCR 0.981 0.0200 0.093

SPAM 0.973 0.0267 0.133

Vehicle data set, # clusters =65

PAM 0.701 0.326 0.044

SRIDHCR 0.835 0.192 0.072

SPAM 0.764 0.263 0.097

Image-Segment data set, # clusters =53

PAM 0.880 0.135 0.027

SRIDHCR 0.980 0.035 0.050

SPAM 0.944 0.071 0.061

Pima-Indian Diabetes data set, # clusters =45

PAM 0.763 0.237 0.056

SRIDHCR 0.859 0.164 0.093

SPAM 0.822 0.202 0.086

19%

7%

Table 4: Traditional vs. Supervised Clustering (β=0.1)

Page 24: K-medoid-style Clustering Algorithms for Supervised Summary Generation

24 Zeidat&Eick, MLMTA, Las Vegas 2004

Algorithm q(X) Purity Tightness(X)

WCT (Sec.)

IRIS-Flowers Dataset, # clusters=3

PAM 0.0933 0.907 0.081 0.06

SRIDHCR 0.0200 0.980 0.093 11.00

SPAM 0.0267 0.973 0.133 0.32

Vehicle Dataset, # clusters=65

PAM 0.326 0.701 0.044 372.00

SRIDHCR 0.192 0.835 0.072 1715.00

SPAM 0.263 0.764 0.097 1090.00

Segmentation Dataset, # clusters=53

PAM 0.135 0.880 0.027 4073.00

SRIDHCR 0.035 0.980 0.050 11250.00

SPAM 0.071 0.944 0.061 1422.00

Pima-Indians-Diabetes, # clusters=45

PAM 0.237 0.763 0.056 186.00

SRIDHCR 0.164 0.859 0.093 660.00

SPAM 0.202 0.822 0.086 58.00

Table 5: Comparative Performance of the Different Algorithms, β=0.1

Page 25: K-medoid-style Clustering Algorithms for Supervised Summary Generation

Algorithm Avg. Purity Tightness(X) Avg.WCT (Sec.)

IRIS-Flowers Dataset, # clusters=3

PAM 0.907 0.081 0.06

SRIDHCR 0.959 0.104 0.18

SPAM 0.973 0.133 0.33

Vehicle Dataset, # clusters=56

PAM 0.681 0.046 505.00

SRIDHCR 0.762 0.081 22.58

SPAM 0.754 0.100 681.00

Segmentation Dataset, # clusters=32

PAM 0.875 0.032 1529.00

SRIDHCR 0.946 0.054 169.39

SPAM 0.940 0.065 1053.00

Pima-Indians-Diabetes, # clusters=2

PAM 0.656 0.104 0.97

SRIDHCR 0.795 0.109 5.08

SPAM 0.772 0.125 2.70

Table 6: Average Comparative Performance of the Different Algorithms, β=0.4

Page 26: K-medoid-style Clustering Algorithms for Supervised Summary Generation

26 Eick&Zeidat, MLMTA, Las Vegas 2004

Why is SRIDHCR performing so much better than SPAM?

SPAM is relatively slow compared with a single run of SRIDHCR allowing for 5-30 restarts of SRIDHCR using the same resources. This enables SRIDHCR to conduct a more balanced exploration of the search space.

Fitness landscape induced by q(X) contains many plateau-like structures (q(X1)=q(X2)) and many local minima and SPAM seems to get stuck more easily.

The fact that SPAM uses a fixed k-value does not seem beneficiary for finding good solutions, e.g.: SRIDHCR might explore {u1,u2,u3,u4}…{u1,u2,u3,u4,v1,v2} … {u3,u4,v1,v2}, whereas SPAM might terminate with the sub-optimal solution {u1,u2,u3,u4}, if neither the replacement of u1 through v1 nor the replacement of u2 by v2 enhances q(X).

Page 27: K-medoid-style Clustering Algorithms for Supervised Summary Generation

27 Zeidat&Eick, MLMTA, Las Vegas 2004

Dataset k β Ties % Using q(X) Ties % Using Tightness(X)Iris-Plants 10 0.00001 5.8 0.0004Iris-Plants 10 0.4 5.7 0.0004Iris-Plants 50 0.00001 20.5 0.0019Iris-Plants 50 0.4 20.9 0.0018

Vehicle 10 0.00001 1.04 0.000001Vehicle 10 0.4 1.06 0.000001Vehicle 50 0.00001 1.78 0.000001Vehicle 50 0.4 1.84 0.000001

Segmentation 10 0.00001 0.220 0.000000Segmentation 10 0.4 0.225 0.000001Segmentation 50 0.00001 0.626 0.000001Segmentation 50 0.4 0.638 0.000000

Diabetes 10 0.00001 2.06 0.0Diabetes 10 0.4 2.05 0.0Diabetes 50 0.00001 3.43 0.0002Diabetes 50 0.4 3.45 0.0002

Table 7: Ties distribution

Page 28: K-medoid-style Clustering Algorithms for Supervised Summary Generation

28 Zeidat&Eick, MLMTA, Las Vegas 2004

00.10.20.30.40.50.60.70.80.9

1

β

Pu

rity

0510152025303540455055606570

Nu

mb

er

of

clu

ste

rs (

k)

Purity (Vehicle) Purity (Diabetes)k (Vehicle) k (Diabetes)

Figure 2: How Purity and k Change as β Increases

Page 29: K-medoid-style Clustering Algorithms for Supervised Summary Generation

29 Eick&Zeidat, MLMTA, Las Vegas 2004

6. Conclusions

1. As expected, supervised clustering algorithms produced significantly better cluster purity than traditional clustering. Improvements range between 7% and 19% for different data sets.

2. Algorithms that too greedily explore the search space, such as SPAM, do not seem to be very suitable for supervised clustering. In general, algorithms that explore the search space more randomly seem to be more suitable for supervised clustering.

3. Supervised clustering can be used to enhance classifiers, dataset summarization, and generate better distance functions.

Page 30: K-medoid-style Clustering Algorithms for Supervised Summary Generation

30 Eick&Zeidat, MLMTA, Las Vegas 2004

Future Work

1. Continue work on supervised clustering algorithms– Find better solutions– Faster– Explain some observations

2. Using supervised clustering for summary generation/learning subclasses

3. Using supervised clustering to find “compressed” nearest neighbor classifiers.

4. Using supervised clustering to enhance simple classifiers

5. Distance function learning

Page 31: K-medoid-style Clustering Algorithms for Supervised Summary Generation

31 Eick&Zeidat, MLMTA, Las Vegas 2004

K-Means Algorithm

Attribute2

Attribute1

1

2

3

4

Page 32: K-medoid-style Clustering Algorithms for Supervised Summary Generation

32 Eick&Zeidat, MLMTA, Las Vegas 2004

K-Means Algorithm

Attribute2

Attribute1

1

2

3

4


Recommended