37
Clustering Clustering

Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Embed Size (px)

Citation preview

Page 1: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

ClusteringClustering

Page 2: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

IntroductionIntroduction

Page 3: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

ClusteringClustering

Summarization of large data– Understand the large customer data

Data organization– Manage the large customer data

Outlier detection– Find unusual customer data

Page 4: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

ClusteringClustering

Previous process before classification/association– Find useful grouping for classes– Association rules within a particular

cluster

Page 5: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Problem DescriptionProblem Description

Given– A data set of N data items with each have

a d-dimensional data feature vectorTask

– Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise

Page 6: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Measure of closeness: similarityMeasure of closeness: similarity

sim Q D w wq dj

t

j j( , ) ( )

1

sim Q D

w wq dj

t

j j

w wqj

t

dj

t

j j

( , )

( )

( ) ( )

1

2

1

2

1

sim Q D

w wq dj

t

j j

w wqj

t

dj

t

j j

( , )

( )

( ) ( )

21

2

1

2

1

sim Q D

w w

w w

q dj

t

q dj

t

j j

j jw wq

j

t

dj

t

j j

( , )

( )

( )( ) ( )

1

1

2

1

2

1

Simple Matching

Cosine Coefficient

Dice’s Coefficient

Jaccard’s Coefficient

Page 7: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Measure of closeness: Measure of closeness: disimilaritdisimilarityy

Distance Measure– Distance = dissimilarity– Manhattan distance

– Euclidean distance

– Minkowski metric

– Mahalnobis distance

)()(),( yxyxyxd t

yxyxd

),(

mmii yxyxd

1

),(

)()(),( 1 yxSyxyxd t

Page 8: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Similarity MatrixSimilarity Matrix

jiij DDd to of similarity

Note that dij = dji (i.e., the m

atrix is symmetric. So, we only need the lower triangle part of the matrix:

D D D

D d

D d d

D d d d

n

n n n n n

1 2 1

2 21

3 31 32

1 2 1

( )

D D D

D d d

D d d

D d d

n

n

n

n n n

1 2

1 12 1

2 21 2

1 2

Page 9: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Similarity Matrix - ExampleSimilarity Matrix - ExampleT1 T2 T3 T4 T5 T6 T7 T8

Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

sim T T w wi j jkikk

N

( , ) ( )

1

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

Term-TermSimilarity Matrix

Term-TermSimilarity Matrix

Page 10: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Similarity ThresholdsSimilarity Thresholds A similarity threshold is used to mark pairs that are “sufficiently” similar

– The threshold value is application and collection dependent

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

Using a threshold value of 10 in the previous example

Page 11: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering methodsClustering methods

Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods

Page 12: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Partitioning methodsPartitioning methods

K-means1) Choose k objects as the initial cluster

centers; set i=0

2) Loop

3) For each object

4) Assign data points to their nearest centroid

5) Compute mean of cluster as centre

Page 13: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Partitioning methodsPartitioning methods

I

II

: centroid

Page 14: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Partitioning MethodPartitioning Method(Iterative method)(Iterative method)

The basic algorithm:1. select M cluster representatives (centroids)

2. for i = 1 to N, assign Di to the most similar centroid

3. for j = 1 to M, recalculate the cluster centroid Cj4. repeat steps 2 and 3 until these is (little or) no change in clusters

Example:

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

Initial (arbitrary) assignment:C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6}

Cluster Centroids

Page 15: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Partitioning MethodPartitioning Method(Iterative method)(Iterative method)

Example (continued)

T1 T2 T3 T4 T5 T6 T7 T8Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1

Now using simple similarity measure, compute the new cluster-term similarity matrix

Now compute new cluster centroids using the original document-term matrix

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1

The process is repeated until no further changes are made to the clusters

Page 16: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering methodsClustering methods

Partitioning methods Hierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods

Page 17: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Hierarchical methodsHierarchical methods

Group objects into a tree of clustersTypes

– Agglomerative bottom-up approach• Single-linkage• Complete-linkage• Group-linkage• Centroid-linkage• Ward’s method

– Divisive top-down approach• Use of K-means clustering

Page 18: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Hierarchical methodsHierarchical methods

a

b

c

d

e

a b

c d e

d e

a b c d e

4step 3step 2step 1step 0step

0step 1step 2step 3step 4step

Page 19: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Hierarchical methodsHierarchical methods

Ward’s method– at each step join cluster pair whose merger minimizes the

increase in total within-group error sum of squares

Page 20: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data
Page 21: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering methodsClustering methods

Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methodsModel-based methodsGraph-based methods

Page 22: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Graph RepresentationGraph Representation The similarity matrix can be visualized as an undirected graph

– each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

T1 T3

T4

T6T8

T5

T2

T7

Page 23: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)

Basic clustering techniques try to determine which object belong to the same class

Clique Method (complete link)– all items within a cluster must be within the similarity threshold

of all other items in that cluster– clusters may overlap– generally produces small but very tight clusters

Single Link Method– any item in a cluster must be within the similarity threshold of at

least one other item in that cluster– produces larger but weaker clusters

Other methods– star method - start with an item and place all related items in that

cluster– string method - star with an item; place one related item in that

cluster; then place anther item related to the last item entered, and so on

Page 24: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)

Clique Method– a clique is a completely connected subgraph of a graph– in the clique method, each maximal clique in the graph becomes

a cluster

T1 T3

T4

T6T8

T5

T2

T7

Maximal cliques (and therefore the clusters) in the previous example are:

{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}

Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

Page 25: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)

Single Link Method– selected a item not in a cluster and place it in a new cluster– place all other related item in that cluster– repeat step 2 for each item in the cluster until nothing more can be

added– repeat steps 1-3 for each item that remains unclustered

T1 T3

T4

T6T8

T5

T2

T7

In this case the single link method produces only two clusters:

{T1, T3, T4, T5, T6, T2, T8} {T7}

Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

Page 26: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering Algorithms Clustering Algorithms (Graph-based)(Graph-based)Star method

– {t1, t3, t4, t5, t6}– {t2, t8}– {t7}

String method– {t1, t3, t4, t2, t6, t8}– {t5}– {t7}

T1 T3

T4

T6T8

T5

T2

T7

Page 27: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering methodsClustering methods

Partitioning methodsHierarchical methods Density-based methodsGrid-based methodsModel-based methodsGraph-based methods

Page 28: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Density-based methodsDensity-based methods

Clusters: density-connected sets

DBSCAN algorithm

Page 29: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Density-based methodsDensity-based methods

Based on a set of density distribution functions

Page 30: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Density-based methodsDensity-based methods

Based on a set of density distribution functions

Density function

Page 31: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering methodsClustering methods

Partitioning methodsHierarchical methodsDensity-based methods Grid-based methodsModel-based methodsGraph-based methods

Page 32: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Grid-based methodsGrid-based methods

Organize the data space as a grid file– Determines clusters as density-

connected components of the grid– Approximate clusters found by

DBSCAN

Page 33: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Clustering methodsClustering methods

Partitioning methodsHierarchical methodsDensity-based methodsGrid-based methods Model-based methodsGraph-based methods

Page 34: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Model-based methodsModel-based methods

Optimize the fit between the given data and some mathematical model

N-dim. Centroid vector

… …

Page 35: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Self-Organizing Map (SOM)Self-Organizing Map (SOM)

A sample data vector X is randomly chosen

BMU: best matching unit– The map unit with centroid closest to X

– Update the centroid vector

}{min iib mxmx

))()(()()()1( tmxthttmtm ibiii

2

2

2

||

)( ib rr

bi eth

Learning rateNeighborhood kernel function

Page 36: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Self-Organizing MapSelf-Organizing Map

Output layer

Input sample

After updatingBefore updating

Page 37: Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data

Self-Organizing MapSelf-Organizing Map

SOM