Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

ClusteringClustering

IntroductionIntroduction


Summarization of large data– Understand the large customer data

Data organization– Manage the large customer data

Outlier detection– Find unusual customer data


Previous process before classification/association– Find useful grouping for classes– Association rules within a particular

cluster

Problem DescriptionProblem Description

Given– A data set of N data items with each have

a d-dimensional data feature vectorTask

– Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise

Measure of closeness: similarityMeasure of closeness: similarity

sim Q D w wq dj

t

j j( , ) ( )

1

sim Q D

w wq dj

t

j j

w wqj

t

dj

t

j j

( , )

( )

( ) ( )

1

2

1

2

1

sim Q D

w wq dj

t

j j

w wqj

t

dj

t

j j

( , )

( )

( ) ( )

21

2

1

2

1

sim Q D

w w

w w

q dj

t

q dj

t

j j

j jw wq

j

t

dj

t

j j

( , )

( )

( )( ) ( )

1

1

2

1

2

1

Simple Matching

Cosine Coefficient

Dice’s Coefficient

Jaccard’s Coefficient

Measure of closeness: Measure of closeness: disimilaritdisimilarityy

Distance Measure– Distance = dissimilarity– Manhattan distance

– Euclidean distance

– Minkowski metric

– Mahalnobis distance

)()(),( yxyxyxd t

yxyxd

),(

mmii yxyxd

1

),(

)()(),( 1 yxSyxyxd t

Similarity MatrixSimilarity Matrix

jiij DDd to of similarity

Note that dij = dji (i.e., the m

atrix is symmetric. So, we only need the lower triangle part of the matrix:

D D D

D d

D d d

D d d d

n

n n n n n

1 2 1

2 21

3 31 32

1 2 1

( )

D D D

D d d

D d d

D d d

n

n

n

n n n

1 2

1 12 1

2 21 2

1 2

Similarity Matrix - ExampleSimilarity Matrix - ExampleT1 T2 T3 T4 T5 T6 T7 T8

Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

sim T T w wi j jkikk

N

( , ) ( )

1

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

Term-TermSimilarity Matrix

Term-TermSimilarity Matrix

Similarity ThresholdsSimilarity Thresholds A similarity threshold is used to mark pairs that are “sufficiently” similar

– The threshold value is application and collection dependent

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

Using a threshold value of 10 in the previous example

Clustering methodsClustering methods

Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods

Partitioning methodsPartitioning methods

K-means1) Choose k objects as the initial cluster

centers; set i=0

2) Loop

3) For each object

4) Assign data points to their nearest centroid

5) Compute mean of cluster as centre

Partitioning methodsPartitioning methods

I

II

: centroid

Partitioning MethodPartitioning Method(Iterative method)(Iterative method)

The basic algorithm:1. select M cluster representatives (centroids)

2. for i = 1 to N, assign Di to the most similar centroid

3. for j = 1 to M, recalculate the cluster centroid Cj4. repeat steps 2 and 3 until these is (little or) no change in clusters

Example:

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

Initial (arbitrary) assignment:C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6}

Cluster Centroids

Partitioning MethodPartitioning Method(Iterative method)(Iterative method)

Example (continued)

T1 T2 T3 T4 T5 T6 T7 T8Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1

Now using simple similarity measure, compute the new cluster-term similarity matrix

Now compute new cluster centroids using the original document-term matrix

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1

The process is repeated until no further changes are made to the clusters


Partitioning methods Hierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods

Hierarchical methodsHierarchical methods

Group objects into a tree of clustersTypes

– Agglomerative bottom-up approach• Single-linkage• Complete-linkage• Group-linkage• Centroid-linkage• Ward’s method

– Divisive top-down approach• Use of K-means clustering


a

b

c

d

e

a b

c d e

d e

a b c d e

4step 3step 2step 1step 0step

0step 1step 2step 3step 4step


Ward’s method– at each step join cluster pair whose merger minimizes the

increase in total within-group error sum of squares


Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods

Graph RepresentationGraph Representation The similarity matrix can be visualized as an undirected graph

– each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

T1 T3

T4

T6T8

T5

T2

T7

Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)

Basic clustering techniques try to determine which object belong to the same class

Clique Method (complete link)– all items within a cluster must be within the similarity threshold

of all other items in that cluster– clusters may overlap– generally produces small but very tight clusters

Single Link Method– any item in a cluster must be within the similarity threshold of at

least one other item in that cluster– produces larger but weaker clusters

Other methods– star method - start with an item and place all related items in that

cluster– string method - star with an item; place one related item in that

cluster; then place anther item related to the last item entered, and so on


Clique Method– a clique is a completely connected subgraph of a graph– in the clique method, each maximal clique in the graph becomes

a cluster

T1 T3

T4

T6T8

T5

T2

T7

Maximal cliques (and therefore the clusters) in the previous example are:

{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}

Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.


Single Link Method– selected a item not in a cluster and place it in a new cluster– place all other related item in that cluster– repeat step 2 for each item in the cluster until nothing more can be

added– repeat steps 1-3 for each item that remains unclustered

T1 T3

T4

T6T8

T5

T2

T7

In this case the single link method produces only two clusters:

{T1, T3, T4, T5, T6, T2, T8} {T7}

Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)Star method

– {t1, t3, t4, t5, t6}– {t2, t8}– {t7}

String method– {t1, t3, t4, t2, t6, t8}– {t5}– {t7}

T1 T3

T4

T6T8

T5

T2

T7


Partitioning methodsHierarchical methods Density-based methodsGrid-based methodsModel-based methodsGraph-based methods

Density-based methodsDensity-based methods

Clusters: density-connected sets

DBSCAN algorithm


Based on a set of density distribution functions


Based on a set of density distribution functions

Density function


Partitioning methodsHierarchical methodsDensity-based methods Grid-based methodsModel-based methodsGraph-based methods

Grid-based methodsGrid-based methods

Organize the data space as a grid file– Determines clusters as density-

connected components of the grid– Approximate clusters found by

DBSCAN


Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methods Model-based methodsGraph-based methods

Model-based methodsModel-based methods

Optimize the fit between the given data and some mathematical model

N-dim. Centroid vector

… …

Self-Organizing Map (SOM)Self-Organizing Map (SOM)

A sample data vector X is randomly chosen

BMU: best matching unit– The map unit with centroid closest to X

– Update the centroid vector

}{min iib mxmx

))()(()()()1( tmxthttmtm ibiii

2

2

2

||

)( ib rr

bi eth

Learning rateNeighborhood kernel function

Self-Organizing MapSelf-Organizing Map

Output layer

Input sample

After updatingBefore updating

Self-Organizing MapSelf-Organizing Map

SOM

Documents

Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data