Upload
austin-dalton
View
238
Download
2
Tags:
Embed Size (px)
Citation preview
ClusteringClustering
IntroductionIntroduction
ClusteringClustering
Summarization of large data– Understand the large customer data
Data organization– Manage the large customer data
Outlier detection– Find unusual customer data
ClusteringClustering
Previous process before classification/association– Find useful grouping for classes– Association rules within a particular
cluster
Problem DescriptionProblem Description
Given– A data set of N data items with each have
a d-dimensional data feature vectorTask
– Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise
Measure of closeness: similarityMeasure of closeness: similarity
sim Q D w wq dj
t
j j( , ) ( )
1
sim Q D
w wq dj
t
j j
w wqj
t
dj
t
j j
( , )
( )
( ) ( )
1
2
1
2
1
sim Q D
w wq dj
t
j j
w wqj
t
dj
t
j j
( , )
( )
( ) ( )
21
2
1
2
1
sim Q D
w w
w w
q dj
t
q dj
t
j j
j jw wq
j
t
dj
t
j j
( , )
( )
( )( ) ( )
1
1
2
1
2
1
Simple Matching
Cosine Coefficient
Dice’s Coefficient
Jaccard’s Coefficient
Measure of closeness: Measure of closeness: disimilaritdisimilarityy
Distance Measure– Distance = dissimilarity– Manhattan distance
– Euclidean distance
– Minkowski metric
– Mahalnobis distance
)()(),( yxyxyxd t
yxyxd
),(
mmii yxyxd
1
),(
)()(),( 1 yxSyxyxd t
Similarity MatrixSimilarity Matrix
jiij DDd to of similarity
Note that dij = dji (i.e., the m
atrix is symmetric. So, we only need the lower triangle part of the matrix:
D D D
D d
D d d
D d d d
n
n n n n n
1 2 1
2 21
3 31 32
1 2 1
( )
D D D
D d d
D d d
D d d
n
n
n
n n n
1 2
1 12 1
2 21 2
1 2
Similarity Matrix - ExampleSimilarity Matrix - ExampleT1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2
sim T T w wi j jkikk
N
( , ) ( )
1
T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3
Term-TermSimilarity Matrix
Term-TermSimilarity Matrix
Similarity ThresholdsSimilarity Thresholds A similarity threshold is used to mark pairs that are “sufficiently” similar
– The threshold value is application and collection dependent
T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3
T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0
Using a threshold value of 10 in the previous example
Clustering methodsClustering methods
Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods
Partitioning methodsPartitioning methods
K-means1) Choose k objects as the initial cluster
centers; set i=0
2) Loop
3) For each object
4) Assign data points to their nearest centroid
5) Compute mean of cluster as centre
Partitioning methodsPartitioning methods
I
II
: centroid
Partitioning MethodPartitioning Method(Iterative method)(Iterative method)
The basic algorithm:1. select M cluster representatives (centroids)
2. for i = 1 to N, assign Di to the most similar centroid
3. for j = 1 to M, recalculate the cluster centroid Cj4. repeat steps 2 and 3 until these is (little or) no change in clusters
Example:
T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2
Initial (arbitrary) assignment:C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6}
Cluster Centroids
Partitioning MethodPartitioning Method(Iterative method)(Iterative method)
Example (continued)
T1 T2 T3 T4 T5 T6 T7 T8Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1
Now using simple similarity measure, compute the new cluster-term similarity matrix
Now compute new cluster centroids using the original document-term matrix
T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1
The process is repeated until no further changes are made to the clusters
Clustering methodsClustering methods
Partitioning methods Hierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods
Hierarchical methodsHierarchical methods
Group objects into a tree of clustersTypes
– Agglomerative bottom-up approach• Single-linkage• Complete-linkage• Group-linkage• Centroid-linkage• Ward’s method
– Divisive top-down approach• Use of K-means clustering
Hierarchical methodsHierarchical methods
a
b
c
d
e
a b
c d e
d e
a b c d e
4step 3step 2step 1step 0step
0step 1step 2step 3step 4step
Hierarchical methodsHierarchical methods
Ward’s method– at each step join cluster pair whose merger minimizes the
increase in total within-group error sum of squares
Clustering methodsClustering methods
Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods
Graph RepresentationGraph Representation The similarity matrix can be visualized as an undirected graph
– each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)
T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0
T1 T3
T4
T6T8
T5
T2
T7
Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)
Basic clustering techniques try to determine which object belong to the same class
Clique Method (complete link)– all items within a cluster must be within the similarity threshold
of all other items in that cluster– clusters may overlap– generally produces small but very tight clusters
Single Link Method– any item in a cluster must be within the similarity threshold of at
least one other item in that cluster– produces larger but weaker clusters
Other methods– star method - start with an item and place all related items in that
cluster– string method - star with an item; place one related item in that
cluster; then place anther item related to the last item entered, and so on
Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)
Clique Method– a clique is a completely connected subgraph of a graph– in the clique method, each maximal clique in the graph becomes
a cluster
T1 T3
T4
T6T8
T5
T2
T7
Maximal cliques (and therefore the clusters) in the previous example are:
{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}
Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.
Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)
Single Link Method– selected a item not in a cluster and place it in a new cluster– place all other related item in that cluster– repeat step 2 for each item in the cluster until nothing more can be
added– repeat steps 1-3 for each item that remains unclustered
T1 T3
T4
T6T8
T5
T2
T7
In this case the single link method produces only two clusters:
{T1, T3, T4, T5, T6, T2, T8} {T7}
Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.
Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)Star method
– {t1, t3, t4, t5, t6}– {t2, t8}– {t7}
String method– {t1, t3, t4, t2, t6, t8}– {t5}– {t7}
T1 T3
T4
T6T8
T5
T2
T7
Clustering methodsClustering methods
Partitioning methodsHierarchical methods Density-based methodsGrid-based methodsModel-based methodsGraph-based methods
Density-based methodsDensity-based methods
Clusters: density-connected sets
DBSCAN algorithm
Density-based methodsDensity-based methods
Based on a set of density distribution functions
Density-based methodsDensity-based methods
Based on a set of density distribution functions
Density function
Clustering methodsClustering methods
Partitioning methodsHierarchical methodsDensity-based methods Grid-based methodsModel-based methodsGraph-based methods
Grid-based methodsGrid-based methods
Organize the data space as a grid file– Determines clusters as density-
connected components of the grid– Approximate clusters found by
DBSCAN
Clustering methodsClustering methods
Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methods Model-based methodsGraph-based methods
Model-based methodsModel-based methods
Optimize the fit between the given data and some mathematical model
N-dim. Centroid vector
… …
Self-Organizing Map (SOM)Self-Organizing Map (SOM)
A sample data vector X is randomly chosen
BMU: best matching unit– The map unit with centroid closest to X
– Update the centroid vector
}{min iib mxmx
))()(()()()1( tmxthttmtm ibiii
2
2
2
||
)( ib rr
bi eth
Learning rateNeighborhood kernel function
Self-Organizing MapSelf-Organizing Map
Output layer
Input sample
After updatingBefore updating
Self-Organizing MapSelf-Organizing Map
SOM