Cluster Analysis in data Minining

Cluster AnalysisCluster Analysis

SUSHIL KULKARNISUSHIL KULKARNI

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

SUSHIL KULKARNI

What is Cluster ?

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that –Data points in one cluster are more similar to

one another. –Data points in separate clusters are less

similar to one another.

Similarity Measures: –Euclidean Distance if attributes are

continuous. –Other Problem-specific Measures.

SUSHIL KULKARNI

Outliers

Outliers are objects that do not belong to any cluster or form clusters of very small cardinality

In some applications we are interested in discovering outliers, not clusters (outlier analysis)

cluster

outliers

SUSHIL KULKARNI

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Data Structures data matrix

(two modes)

dissimilarity or distancematrix

(one mode)

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

the “classic” data input

attributes/dimensions

tupl

es/o

bjec

ts

the desired data input to some clustering algorithms

objects

obje

cts

Measuring Similarity in Clustering

Dissimilarity/Similarity metric:

The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metricmetric:

1. d(i, j) 0 (non-negativity)2. d(i, i) =0 (isolation)3. d(i, j)= d(j, i) (symmetry)4. d (i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

SUSHIL KULKARNI

Type of data in cluster analysis

Interval-scaled variables

e.g., salary, height

Binary variables

e.g., gender (M/F), has_cancer(T/F)

Nominal (categorical) variables

e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

SUSHIL KULKARNI

Similarity and Dissimilarity Between Objects

Distance metrics are normally used to measure the similarity or dissimilarity between two data objects

SUSHIL KULKARNI

Similarity and Dissimilarity Between Objects (Cont.)

Euclidean distance:

Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance:

)||...|||(|),( 22

22

2

11 nn jx

ix

jx

ix

jx

ixjid

)||...||2

||1

(),( 22

22

2

11 nn jx

ixnw

jx

ixw

jx

ixwjid

SUSHIL KULKARNI

Binary Variables A binary variable has two states: 0 absent, 1 present

A contingency table for binary data

Jaccard coefficient distance (noninvariant if the binary

variable is asymmetric):

cbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

object i

object j

SUSHIL KULKARNI

Binary Variables

1-1 match is stronger indicator of similarity

then 0-0 indicator of similarity

It may be possible that 0-0 indicator can be

ignored.

In matching coefficient equal weight for 1-1

matches and 0-0 matches are taken

In Jacords coefficient 0-0 indicator match is

ignored

SUSHIL KULKARNI

Dissimilarity between Binary Variables

Example (Jaccard coefficient)

Name Fever Cough Test-1 Test-2 Test-3 Test-4 Tina 1 0 1 0 0 0 Dina 1 0 1 0 1 0 Meena 1 1 0 0 0 0

SUSHIL KULKARNI


Name Fever Cough Test-1 Test-2 Test-3 Test-4 Tina 1 0 1 0 0 0 Dina 1 0 1 0 1 0

Consider name as Tina , Dina

Number of attributes where name_1 is 1 and name_2 is 1=2 =a (Fever, Test-1)

Number of attributes where name_ 1 is 1 and name_ 2 as 0 = b =0

Number of attributes where name_ 1 is 0 and name_ 2 as 1 = c =1 (Test-3)

Number of attributes where both names = 0 = d = 3

(Cough, Test-2, Test-4)SUSHIL KULKARNI


75.0211

21)Dina,Meena(d

67.0111

11)Meena,Tina(d

33.0102

10)Dina,Tina(d

Based on the magnitude of the similarity coefficient, we conclude that Meena and Dina are most similar and Tina and Dina are least similar Other pair lies between these extremes

SUSHIL KULKARNI

cbacb jid

),(

Using Jaccard coefficient distance we get : a =2b=0c=1d=3

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Major Clustering Approaches

Partitioning algorithms: Construct

random partitions and then iteratively

refine them by some criterion

Hierarchical algorithms: Create a

hierarchical decomposition of the set of

data (or objects) using some criterion

SUSHIL KULKARNI

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Partitioning Algorithms: Basic Concepts

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is

represented by the center of the cluster k-medoids or PAM (Partition around medoids)

(Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

SUSHIL KULKARNI

Centroid Medoid

Centroid or Medoid

SUSHIL KULKARNI

The k-means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps:

1. Partition objects into k nonempty subsets2. Compute seed points as the centroidscentroids of

the clusters of the current partition. The centroid is the center (mean point) of the cluster.

3. Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment.

SUSHIL KULKARNI

K-Means Example

Given: {2,4,10,12,3,20,30,11,25} Assume that we want two clusters. Write the elements in increasing order

{2,3,4,10,11,12,20,25,30} Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,11,12,20,25,30} Means are m1=2.5,m2=16 K1={2,3,4},K2={10,11,12,20,25,30} Means are m1=3,m2=18 K1={2,3,4,10},K2={11,12,20,25,30} Means are m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,25,30} Means are m1=7,m2=25 Stop as the clusters with these means as no more jumps

from K2 to K 1 is possible. SUSHIL KULKARNI

The k-means Clustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

SUSHIL KULKARNI

Comments on the k-means Method

Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

SUSHIL KULKARNI

The K-Medoids Clustering Method

SUSHIL KULKARNI

A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.

The K-Medoids Clustering Algorithm

SUSHIL KULKARNI

A medoid clustering algorithm is as follows: The algorithm begins with arbitrary selection of the k

objects as medoid points out of n data points (n>k) After selection of the k medoid points, associate each

data object in the given data set to most similar medoid. The similarity here is defined using distance measure that can be Euclidean distance

Randomly select non medoid object O′ compute total cost S of swapping initial medoid object to

O′ If S<0, then swap initial medoid with the new one (if

S<0 then there will be new set of medoids) repeat steps 2 to 5 until there is no change in the

medoid


Cluster the following data set of ten objects into two clusters i.e k = 2.

Consider a data set of ten objects as follows

SUSHIL KULKARNI

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6


SUSHIL KULKARNI

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6


Step 1

Initialise k centre Let us assume c1 = (3,4) and c2 = (7,4) So here c1 and c2 are selected as medoid. Calculating distance so as to associate each data

object to its nearest medoid. Cost is calculated using Minkowski distance metric.

SUSHIL KULKARNI


The most popular conform to Minkowski distance:

where i = (xi1, xi2, …, xin) and

j = (xj1, xj2, …, xjn) are two n-dimensional

data objects, and p is a positive integer

ppjn

xin

xp

jx

ix

pj

xi

xjipL/1

||...|22

||11

|),(

SUSHIL KULKARNI


If p = 1, then L1 is :

||...||||),(1 2211 nn jxixjxixjxixjiL

SUSHIL KULKARNI


c1 Data objects (Xi)

Cost (distance)

(3,4) (2,6) 3

(3,4) (3,8) 4

(3,4) (4,7) 4

(3,4) (6,2) 5

(3,4) (6,4) 3

(3,4) (7,3) 5

(3,4) (8,5) 6

(3,4) (7,6) 6

SUSHIL KULKARNI


Cost (distance)

(7,4) (2,6) 7

(7,4) (3,8) 8

(7,4) (4,7) 6

(7,4) (6,2) 3

(7,4) (6,4) 1

(7,4) (7,3) 1

(7,4) (8,5) 2

(7,4) (7,6) 2


SUSHIL KULKARNI

So the clusters become:

Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)} Since the points (2,6) (3,8) and (4,7) are close to c1 hence they form one cluster while remaining points form another cluster. Cost between any two points is found using formula

d

1i|cx|)c,x(tcos

where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. Cost between any two points is found using formula

Total cost is the summation of the cost of data object from its medoid in its cluster so here:


SUSHIL KULKARNI

Total Cost = {cost( (3,4),(2,6))+cost( (3,4), (3,8))+ cost( (3,4), (4,7))} + {cost( (7,4),(6,2))+ cost( (7,4),(7,3))+ cost((7,4),(8,5))+ cost( (7,4),(7,6))} =3+4+4+3+1+1+2+2 =20

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}


SUSHIL KULKARNIclusters after step 1


SUSHIL KULKARNI

Step 2

Selection of nonmedoid O′ randomly Let us assume O′ = (7,3) So now the medoids are c1(3,4) and O′(7,3) If c1 and O′ are new medoids, calculate the total cost involved by using the formula in the step 1



Cost (distance)

(3,4) (2,6) 3

(3,4) (3,8) 4

(3,4) (4,7) 4

(3,4) (6,2) 5

(3,4) (6,4) 3

(3,4) (7,4) 4

(3,4) (8,5) 6

(3,4) (7,6) 6

SUSHIL KULKARNI

O’ Data objects (Xi)

Cost (distance)

(7,3) (2,6) 8

(7,3) (3,8) 9

(7,3) (4,7) 7

(7,3) (6,2) 2

(7,3) (6,4) 2

(7,3) (7,4) 1

(7,3) (8,5) 3

(7,3) (7,6) 3


SUSHIL KULKARNI


SUSHIL KULKARNI

Total Cost = 3+4+4+2+2+1+3+3=22 So cost of swapping medoid from c2 to O′ is S = current total cost – Past total cost = 22-20 =2 > 0 So moving to O′ would be bad idea, so the previous choice was good and algorithm terminates here (i.e there is no change in the medoids).

Cluster Analysis

What is a Cluster ? Types of Data in Cluster A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

AGNES (Agglomerative Nesting)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge objects that have the least dissimilarity

Go on in a non-descending fashion

Eventually all objects belong to the same cluster

Single-Link: each time merge the clusters (C1,C2) which are connected by the shortest single link of objects, i.e., minpC1,qC2dist(p,q)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

SUSHIL KULKARNI

• Uses dissimilarity/distance matrix as input. • Start with each individual item in its own cluster Merge nodes that have the least dissimilarity• Go on in a non-descending fashion• Eventually all nodes belong to the same cluster• The output is a dendogram which is represented as a set of ordered triples (d, k , K) where d is the threshold distance, k is the number of clusters and K is the set of clusters

AGNES (Agglomerative Nesting)

SUSHIL KULKARNI

A B

C

D

E

1

1

23

3

2

5

32

4

Graph with all distances

C

1

1A B

E

D

Graph with threshhold dmin=1

AGNES (Agglomerative Nesting): Minimum Distance Method

A B C

Single link dendrogram

ED

A B

C

D

E

1

1

22

2


A B

C

D

E

1

1

23

3

232

4

A B

C

D

E

1

1

23

3

232



A B C D E

Single link dendrogram

AGNES (Agglomerative Nesting): Minimum Distance Method

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

E.g., level 1 gives 4 clusters: {a,b},{c},{d},{e},level 2 gives 3 clusters: {a,b},{c},{d,e} level 3 gives 2 clusters: {a,b},{c,d,e}, etc.

A Dendrogram Shows How the Clusters are Merged Hierarchically

a b c d e

ab

d

e

c

level 1

level 2

level 3

level 4

Agglomerative ExampleA B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

BA

E C

D

4

Threshold of

2 3 51

A B C D ESUSHIL KULKARNI

MST Example

A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

BA

E C

D

SUSHIL KULKARNI

DIANA (Divisive Analysis)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

o Inverse order of AGNES

o Initially all items are placed in one cluster

oThe clusters are split when some elements are not efficiently close to other elements

o Eventually each node forms a cluster on its own

o One simple example of a divisive algorithm is based on the MST version of the single link algorithm

o Edges are cut out from the minimum spanning tree from largest to the smallest


SUSHIL KULKARNI

D

A B

CE

1

1

23

3

2

5

32

4A B

CE

1

1

2

3D


3

A

B

C

E

1

1

2

D

Cut the largest edge ED

The cluster {A,B,C,D,E} is split into two clusters

{E} {A,B,C,D}


SUSHIL KULKARNI

E

A B1

1

2

D

C

A

B

C

E

1

1

2

D

Two clusters are

{E} {A,B,C,D}

Next cut the edge between B and C

The cluster {A,B,C,D} is split into {A,B} and {C,D}

A B

E

1

1

D


SUSHIL KULKARNI

C

A

B

C

E

1

1

D

In the next step these will be split finally giving clusters as

{E}, {A}, {B}, {C}, {D},

A B

E

D


SUSHIL KULKARNI

C

More on Hierarchical Clustering Methods

Integration of hierarchical with distance-based clustering

BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters

CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling

SUSHIL KULKARNI

Documents

Cluster Analysis in data Minining