Upload
christine-bridges
View
230
Download
0
Embed Size (px)
Citation preview
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Supervised Clustering ---Algorithms and Applications
Christoph F. EickDepartment of Computer Science
University of Houston
Organization of the Talk1. Supervised Clustering
2. Representative-based Supervised Clustering Algorithms
3. Applications: Using Supervised Clustering for
a. Dataset Editing
b. Class Decomposition
c. Distance Function Learning
d. Region Discovery in Spatial Datasets
4. Other Activities I am Involved With
Ch. Eick: Supervised Clustering --- Algorithms and Applications
List of Persons that Contributed to the Work Presented in Today’s Talk
• Tae-Wan Ryu (former PhD student; now faculty member Cal State Fullerton)
• Ricardo Vilalta (colleague at UH since 2002; Co-Director of the UH’s Data Mining and Knowledge Discovery Group)
• Murali Achari (former Master student)• Alain Rouhana (former Master student)• Abraham Bagherjeiran (current PhD student)• Chunshen Chen (current Master student)• Nidal Zeidat (current PhD student)• Sujing Wang (current PhD student)• Kim Wee (current MS student)• Zhenghong Zhao (former Master student)
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Traditional Clustering
• Partition a set of objects into groups of similar objects. Each group is called a cluster.
• Clustering is used to “detect classes” in a data set (“unsupervised learning”).
• Clustering is based on a fitness function that relies on a distance measure and usually tries to create “tight” clusters.
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Ch. Eick
Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).
Different Forms of Clustering
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Motivation: Finding Subclasses using SC
Attribute2
Ford TrucksAttribute1
Ford SUV
Ford Vans
GMC Trucks
GMC Van
GMC SUV
:Ford
:GMC
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Related Work Supervised Clustering
• Sinkkonen’s [SKN02] discriminative clustering and Tishby’s information bottleneck method [TPB99, ST99] can be viewed as probabilistic supervised clustering algorithms.
• There has been a lot of work in the area of semi-supervised clustering that centers on clustering with background information. Although the focus of this work is traditional clustering, there is still a lot of similarity between techniques and algorithms they investigate and the techniques and algorithms we investigate.
Ch. Eick: Supervised Clustering --- Algorithms and Applications
2. Representative-Based Supervised Clustering
• Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster.
• The remaining objects in the data set are then clustered around these representatives by assigning objects to the cluster of the closest representative.
Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Representative-Based Supervised Clustering … (Continued)
Attribute2
Attribute1
1
2
3
4
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Representative-Based Supervised Clustering … (continued)
Attribute2
Attribute1
1
2
3
4
Objective of RSC: Find a subset OR of O such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Ch. Eick: Supervised Clustering --- Algorithms and Applications
SC Algorithms Currently Investigated
1. Supervised Partitioning Around Medoids (SPAM).
2. Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).
3. Top Down Splitting Algorithm (TDS).
4. Supervised Clustering using Evolutionary Computing (SCEC)
5. Agglomerative Hierarchical Supervised Clustering (AHSC)
6. Grid-Based Supervised Clustering (GRIDSC)
Remark: For a more detailed discussion of SCEC and SRIDHCR see [EZZ04]
Ch. Eick: Supervised Clustering --- Algorithms and Applications
A Fitness Function for Supervised Clustering
q(X) := Impurity(X) + β*Penalty(k)
ck
ck
0
n
ck
Penalty(k) and
,n
ExamplesMinority of # )Impurity(X where k: number of clusters used
n: number of examples the dataset
c: number of classes in a dataset.
β: Weight for Penalty(k), 0< β ≤2.0
Penalty(k) vs k
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 5 10 26 53k
Pe
na
lty(k
)
k
Penalty(k) increase sub-linearly.
because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Algorithm SRIDHCR (Greedy Hill Climbing)
REPEAT r TIMEScurr := a randomly created set of representatives (with size between c+1
and 2*c)WHILE NOT DONE DO
1. Create new solutions S by adding a single non-representative to curr and by removing a single representative from curr
2. Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one)
3. IF q(s)<q(curr) THEN curr:=sELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=sELSE terminate and return curr as the solution for this run.
Report the best out of the r solutions found.
Highlights:
• k is not an input parameter, SRIDHCR searches for best k within the range that is induced by .
• Reports the best clustering found in r runs
Initial generation Next generation
Copy
Crossover
Mutation
Final generation
Supervised Clustering using Evolutionary Computing: SCEC
Result:
Best solution
Ch. Eick: Supervised Clustering --- Algorithms and Applications
The complete flow chart of SCEC
InitializeSolutions
Clusteringon S[i]
Evaluationon S[i]
IntermediateResult
Record Best Solution, Q
ComposePopulation S
Mutation
New S’[i]
Crossover Copy
K-tournament
Evaluate a Population
Loop PS times
Loop PS times
Create next Generation
Best Solution, Q, SummaryExit
Loop N times
The complete flow chart of SCEC
InitializeSolutions
Clusteringon S[i]
Evaluationon S[i]
IntermediateResult
Record Best Solution, Q
ComposePopulation S
Mutation
New S’[i]
Crossover Copy
K-tournament
Evaluate a Population
Loop PS times
Loop PS times
Create next Generation
Best Solution, Q, SummaryExit
Loop N times
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Complex1 Dataset
Complex1-Reduced
0
50
100
150
200
250
300
350
400
450
500
0 200 400 600 800
Class-0
Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Supervised Clustering ResultComplex-1-Reduced, SRIDHCR, Beta=0.25, r=50
0
50
100
150
200
250
300
350
400
450
500
0 100 200 300 400 500 600 700 800
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Supervised Clustering ---Algorithms and Applications
Organization of the Talk1. Supervised Clustering
2. Representative-based Supervised Clustering Algorithms
3. Applications: Using Supervised Clustering for
a. for Dataset Editing
b. for Class Decomposition
c. for Distance Function Learning
d. for Region Discovery in Spatial Datasets
4. Other Activities I am Involved With
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Nearest Neighbour Rule
Consider a two class problem where each sample consists of two measurements (x,y).
k = 1
k = 3
For a given query point q, assign the class of the nearest neighbour.
Compute the k nearest neighbours and assign the class by majority vote.
Problem: requires “good” distance function
Ch. Eick: Supervised Clustering --- Algorithms and Applications
3a. Dataset Reduction: Editing
• Training data may contain noise, overlapping classes
• Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries
• Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly)
• Secondary Goal of Editing: enhance the speed of a k-NN classifier
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Wilson Editing• Wilson 1972• Remove points that do not agree with the majority of their k nearest neighbours
Wilson editing with k=7
Original data
Earlier example
Wilson editing with k=7
Original data
Overlapping classes
Ch. Eick: Supervised Clustering --- Algorithms and Applications
RSC Dataset Editing
A
C
E
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
Attribute1
D
B
Attribute2
F
Attribute2
Attribute1
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Experimental Evaluation
• We compared a traditional 1-NN, 1-NN using Wilson Editing, Supervised Clustering Editing (SCE), and C4.5 (that was run using its default parameter setting).
• A benchmark consisting of 8 UCI datasets was used for this purpose.
• Accuracies were computed using 10-fold cross validation.• SRIDHCR was used for supervised clustering.• SCE was tested using different compression rates by associating
different penalties with the number of clusters found (by setting parameter to: 0.1, 0.4 and 1.0).
• Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Table 2: Prediction Accuracy for the four classifiers.β NR Wilson 1-NN C4.5
Glass (214)0.1 0.636 0.607 0.692 0.6770.4 0.589 0.607 0.692 0.6771.0 0.575 0.607 0.692 0.677
Heart-Stat Log (270)0.1 0.796 0.804 0.767 0.7820.4 0.833 0.804 0.767 0.7821.0 0.838 0.804 0.767 0.782
Diabetes (768)0.1 0.736 0.734 0.690 0.7450.4 0.736 0.734 0.690 0.7451.0 0.745 0.734 0.690 0.745
Vehicle (846)0.1 0.667 0.716 0.700 0.7230.4 0.667 0.716 0.700 0.7231.0 0.665 0.716 0.700 0.723
Heart-H (294)0.1 0.755 0.809 0.783 0.8020.4 0.793 0.809 0.783 0.8021.0 0.809 0.809 0.783 0.802
Waveform (5000)0.1 0.834 0.796 0.768 0.7810.4 0.841 0.796 0.768 0.7811.0 0.837 0.796 0.768 0.781
Iris-Plants (150)0.1 0.947 0.936 0.947 0.9470.4 0.973 0.936 0.947 0.9471.0 0.953 0.936 0.947 0.947
Segmentation (2100)0.1 0.938 0.966 0.956 0.9680.4 0.919 0.966 0.956 0.9681.0 0.890 0.966 0.956 0.968
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Table 3: Dataset Compression Rates for SCE and Wilson Editing.
Avg. k [Min-Max]
for SCE
SCE Compression
Rate (%)
Wilson Compression
Rate (%)Glass (214)
0.1 34 [28-39] 84.3 270.4 25 [19-29] 88.4 271.0 6 [6 – 6] 97.2 27
Heart-Stat Log (270)0.1 15 [12-18] 94.4 22.40.4 2 [2 – 2] 99.3 22.41.0 2 [2 – 2] 99.3 22.4
Diabetes (768)0.1 27 [22-33] 96.5 30.00.4 9 [2-18] 98.8 30.01.0 2 [2 – 2] 99.7 30.0
Vehicle (846)0.1 57 [51-65] 97.3 30.50.4 38 [ 26-61] 95.5 30.51.0 14 [ 9-22] 98.3 30.5
Heart-H (294)0.1 14 [11-18] 95.2 21.90.4 2 99.3 21.91.0 2 99.3 21.9
Waveform (5000)0.1 104 [79-117] 97.9 23.40.4 28 [20-39] 99.4 23.41.0 4 [3-6] 99.9 23.4
Iris-Plants (150)0.1 4 [3-8] 97.3 6.00.4 3 [3 – 3] 98.0 6.01.0 3 [3 – 3] 98.0 6.0
Segmentation (2100)0.1 57 [48-65] 97.3 2.80.4 30 [24-37] 98.6 2.81.0 14 99.3 2.8
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Summary SCE and Wilson Editing• Wilson editing enhances the accuracy of a traditional 1-NN classifier
for six of the eight datasets tested. It achieved compression rates of approx. 25%, but much lower compression rates for “easy” datasets.
• SCE achieved very high compression rates without loss in accuracy for 6 of the 8 datasets tested.
• SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested.
• Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy.
• SCE tends to pick representatives that are in the center of a region that is dominated by a single class; it removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates.
Remark: For a more detailed evaluation of SCE, Wilson Editing, and other editing techniques see [EZV04] and [ZWE05].
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Future Direction of this Research
Data Set Data Set’
Goal: Find such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly.
IDLAIDLA
Classifier C Classifier C’
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Supervised Clustering vs. Clustering the Examples of Each Separately
Approaches to discover subclasses of a given class:
1. Cluster the examples of each class separately
2. Use supervised clustering
O OOx x xO OOx x xO OOx x x
Figure 4. Supervised clustering editing vs. clustering each class (x and o) separately.
Remark: A traditional clustering algorithm, such as k-medoids, would pick oas the cluster representative, because it is “blind” on how the examples ofother classes distribute, whereas supervised clustering would pick o as the representative; obviously, o is not a good choice for editing, because it attractspoints of the class x, which leads to misclassifications.
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Simple classifiers:
• Encompass a small class of approximating functions.
• Limited flexibility in their decision boundaries
Attribute 1
Attribute 2
Attribute 1
Attribute 2
Attribute 2
Applications of Supervised Clustering3.b Class Decomposition (see also [VAE03])
Attribute 1
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Naïve Bayes vs. Naïve Bayes with Class Decomposition
Dataset Naïve Bayes (NB)
NB with Class Decomposition
Improvement
Diabetes 76.56 77.08 0.52% Heart-H 79.73 70.27 9.46% Segment 68.00 75.045 7.05% Vehicle 45.02 68.25 23.23%
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Example: How to Find Similar Patients?
The following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age,…)• Attribute Domains
– ssn: 9 digits
– weight between 30 and 650; mweight=158 sweight=24.20
– height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2
– cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor
– eye-color: {brown, blue, green, grey }
– age: between 3 and 100; mage=45 sage=13.2
Task: Define Patient Similarity
3c. Using Clustering in Distance Function Learning
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Data Extraction Tool
DBMS
Clustering Tool
User Interface
A set of clusters
Similarity measure
Similarity Measure Tool
Default choices and domain information
Library of similarity measures
Type and weight
information
ObjectView
Library of clustering algorithms
CAL-FULL/UH Database Clustering & Similarity Assessment Environments
CAL-FULL/UH Database Clustering & Similarity Assessment Environments
LearningTool
TrainingData
Today’stopic
For more details: see [RE05]
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Similarity Assessment Framework and Objectives
• Objective: Learn a good distance function for classification tasks.• Our approach: Apply a clustering algorithm with the distance function to
be evaluated that returns a number of clusters k. The more pure the obtained clusters are the better is the quality of .
• Our goal is to learn the weights of an object distance function such that all the clusters are pure (or as pure is possible); for more details see [ERBV05] and [BECV05] papers.
f
fjif
ji
wpf
woopfoo
1
*)(1
,),(
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Idea: Coevolving Clusters and Distance Functions
Clustering X DistanceFunction Cluster
Goodness of the Distance Function
q(X) Clustering Evaluation
Weight Updating Scheme /Search Strategy
x
x x
x
o
oo
o
xx
oo
xx
oo
oo
“Bad” distance function “Good” distance function
xx
oo
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Idea Inside/Outside Weight Updating
Cluster1: distances with respect to Att1
Action: Increase weight of Att1
Action: Decrease weight for Att2
Cluster1: distances with respect to Att2Idea: Move examples of the majority class closer to each other
xo oo ox
o o xx o o
o:=examples belonging to majority classx:= non-majority-class examples
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Sample Run of IOWU for Diabetes Dataset
Graph produced by Abraham Bagherjeiran
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Research Framework Distance Function Learning
RandomizedHill Climbing
AdaptiveClustering
Inside/OutsideWeight Updating
K-Means
SupervisedClustering
NN-Classifier
Weight-Updating Scheme /Search Strategy
Distance FunctionEvaluation
… …
WorkBy Karypis
[BECV05]
Other Research
[ERBV04]
Ch. Eick: Supervised Clustering --- Algorithms and Applications
3.d Discovery of Interesting Regions for Spatial Data Mining
Task: 2D/3D datasets are given; discover interesting regions in the dataset that maximize a given fitness function; examples of region discovery include:– Discover regions that have significant deviations from
the prior probability of a class; e.g. regions in the state of Wyoming were people are very poor or not poor at all
– Discover regions that have significant variation in the income (fitness is defined based on the variance with respect to income in a region)
– Discover regions for congressional redistricting– Discover congested regions for traffic control
Remark: We use (supervised) clustering to discover such regions; regions are implicitly defined by the set of points that belong to a cluster.
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Wyoming Map
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Household Income in 1999: Wyoming Park County
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Clusters Regions
Example: 2 clusters in red and blue are given; regions are defined by using a Voronoi diagram based on a NN classifier with k=7; region are in grey and white.
An Evaluation Scheme for Discovering Regions that Deviate from the Prior Probability of a Class C
Letprior(C)= |C|/np(c,C)= percentage of examples in c that belong to class CReward(c) is computed based on p(c.C), prior(C) , and based on the following parameters: 1,2,R+,R112; R+,R0) relying on the following interpolation function (e.g. 1=0.8,2=1.2,R+ =1, R=1):
p(c,C)
Reward(c)
1prior(C)prior(C)*1 prior(C)*2
R+
R
qC(X)= cX ((p(c,C),prior(C),1,2,R+,R-) |c|)/n)with >1 (typically, 1.0001<<2); the idea is that increases incluster-size rewarded nonlinearly, favoring clusters with more points as long as |c|*(…) increases.
(p(C),prior(C),1,2,R+,R
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets
Ch. Eick
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Supervised Clustering ---Algorithms and Applications
Organization of the Talk1. Supervised Clustering
2. Representative-based Supervised Clustering Algorithms
3. Applications: Using Supervised Clustering for
a. for Dataset Editing
b. for Class Decomposition
c. for Distance Function Learning
d. for Region Discovery in Spatial Datasets
4. Other Activities I am Involved With
Ch. Eick: Supervised Clustering --- Algorithms and Applications
An Environment for Adaptive (Supervised) Clusteringfor Summary Generation Applications
Idea: Development of a Generic Clustering/Feedback/Adaptation Architecture whose objective is to facilitate the search for clusterings that maximize an internally and/or an externally given reward function (for some initial ideas see [BECV05])
ClusteringAlgorithm
Inputs Clustering
quality
AdaptationSystem
changes
Past Experience
feedback
EvaluationSystem
Summary
Domain Expert
q(X),…
FitnessFunctions
(predefined)
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Clustering Algorithm Inputs
Data Set Examples
Data Set Feature Representation
Distance Function
Clustering Algorithm Parameters
Fitness Function Parameters
Background Knowledge
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Research Topics 2005/2006• Inductive Learning/Data Mining
– Decision trees, nearest neighbor classifiers– Using clustering to enhance classification algorithms– Making sense of data
• Supervised Clustering– Learning subclasses– Supervised clustering algorithms that learn clusters with arbitrary shape– Using supervised clustering for region discovery– Adaptive clustering
• Tools for Similarity Assessment and Distance Function Learning• Data Set Compression and Creating Meta Knowledge for Local Learning Techniques
– Comparative studies – Creating maps and other data set signatures for datasets based on editing, SC, and
other techniques• Traditional Clustering • Data Mining and Information Retrieval for Structured Data• Other: Evolutionary Computing, File Prediction, Ontologies, Heuristic Search,
Reinforcement Learning, Data Models.Remark: Topics that were “covered” in this talk are in blue
Ch. Eick: Supervised Clustering --- Algorithms and Applications
Links to 7 Papers [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003.
http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf
[EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004.
http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf
[EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004.
http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf
[RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).
http://www.cs.uh.edu/~ceick/kdd/RE05.doc
[ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005.
http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf
[ZWE05] N. Zeidat, S. Wang, C. Eick,, Editing Techniques: a Comparative Study, submitted for publication.
http://www.cs.uh.edu/~ceick/kdd/ZWE05.pdf
[BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication.
http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf