56
Cluster Cluster Analysis Analysis SUSHIL KULKARNI SUSHIL KULKARNI

Cluster Analysis in data Minining

Embed Size (px)

DESCRIPTION

This is a lecturer taken at Refresher course for Statistics teachers. It contains varies algorithms along with examples.

Citation preview

Page 1: Cluster Analysis in data Minining

Cluster AnalysisCluster Analysis

SUSHIL KULKARNISUSHIL KULKARNI

Page 2: Cluster Analysis in data Minining

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

SUSHIL KULKARNI

Page 3: Cluster Analysis in data Minining

What is Cluster ?

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that –Data points in one cluster are more similar to

one another. –Data points in separate clusters are less

similar to one another.

Similarity Measures: –Euclidean Distance if attributes are

continuous. –Other Problem-specific Measures.

SUSHIL KULKARNI

Page 4: Cluster Analysis in data Minining

Outliers

Outliers are objects that do not belong to any cluster or form clusters of very small cardinality

In some applications we are interested in discovering outliers, not clusters (outlier analysis)

cluster

outliers

SUSHIL KULKARNI

Page 5: Cluster Analysis in data Minining

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Page 6: Cluster Analysis in data Minining

Data Structures data matrix

(two modes)

dissimilarity or distancematrix

(one mode)

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

the “classic” data input

attributes/dimensions

tupl

es/o

bjec

ts

the desired data input to some clustering algorithms

objects

obje

cts

Page 7: Cluster Analysis in data Minining

Measuring Similarity in Clustering

Dissimilarity/Similarity metric:

The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metricmetric:

1. d(i, j) 0 (non-negativity)2. d(i, i) =0 (isolation)3. d(i, j)= d(j, i) (symmetry)4. d (i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

SUSHIL KULKARNI

Page 8: Cluster Analysis in data Minining

Type of data in cluster analysis

Interval-scaled variables

e.g., salary, height

Binary variables

e.g., gender (M/F), has_cancer(T/F)

Nominal (categorical) variables

e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

SUSHIL KULKARNI

Page 9: Cluster Analysis in data Minining

Similarity and Dissimilarity Between Objects

Distance metrics are normally used to measure the similarity or dissimilarity between two data objects

SUSHIL KULKARNI

Page 10: Cluster Analysis in data Minining

Similarity and Dissimilarity Between Objects (Cont.)

Euclidean distance:

Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance:

)||...|||(|),( 22

22

2

11 nn jx

ix

jx

ix

jx

ixjid

)||...||2

||1

(),( 22

22

2

11 nn jx

ixnw

jx

ixw

jx

ixwjid

SUSHIL KULKARNI

Page 11: Cluster Analysis in data Minining

Binary Variables A binary variable has two states: 0 absent, 1 present

A contingency table for binary data

Jaccard coefficient distance (noninvariant if the binary

variable is asymmetric):

cbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

object i

object j

SUSHIL KULKARNI

Page 12: Cluster Analysis in data Minining

Binary Variables

1-1 match is stronger indicator of similarity

then 0-0 indicator of similarity

It may be possible that 0-0 indicator can be

ignored.

In matching coefficient equal weight for 1-1

matches and 0-0 matches are taken

In Jacords coefficient 0-0 indicator match is

ignored

SUSHIL KULKARNI

Page 13: Cluster Analysis in data Minining

Dissimilarity between Binary Variables

Example (Jaccard coefficient)

Name Fever Cough Test-1 Test-2 Test-3 Test-4 Tina 1 0 1 0 0 0 Dina 1 0 1 0 1 0 Meena 1 1 0 0 0 0

SUSHIL KULKARNI

Page 14: Cluster Analysis in data Minining

Dissimilarity between Binary Variables

Name Fever Cough Test-1 Test-2 Test-3 Test-4 Tina 1 0 1 0 0 0 Dina 1 0 1 0 1 0

Consider name as Tina , Dina

Number of attributes where name_1 is 1 and name_2 is 1=2 =a (Fever, Test-1)

Number of attributes where name_ 1 is 1 and name_ 2 as 0 = b =0

Number of attributes where name_ 1 is 0 and name_ 2 as 1 = c =1 (Test-3)

Number of attributes where both names = 0 = d = 3

(Cough, Test-2, Test-4)SUSHIL KULKARNI

Page 15: Cluster Analysis in data Minining

Dissimilarity between Binary Variables

75.0211

21)Dina,Meena(d

67.0111

11)Meena,Tina(d

33.0102

10)Dina,Tina(d

Based on the magnitude of the similarity coefficient, we conclude that Meena and Dina are most similar and Tina and Dina are least similar Other pair lies between these extremes

SUSHIL KULKARNI

cbacb jid

),(

Using Jaccard coefficient distance we get : a =2b=0c=1d=3

Page 16: Cluster Analysis in data Minining

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Page 17: Cluster Analysis in data Minining

Major Clustering Approaches

Partitioning algorithms: Construct

random partitions and then iteratively

refine them by some criterion

Hierarchical algorithms: Create a

hierarchical decomposition of the set of

data (or objects) using some criterion

SUSHIL KULKARNI

Page 18: Cluster Analysis in data Minining

Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Page 19: Cluster Analysis in data Minining

Partitioning Algorithms: Basic Concepts

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is

represented by the center of the cluster k-medoids or PAM (Partition around medoids)

(Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

SUSHIL KULKARNI

Page 20: Cluster Analysis in data Minining

Centroid Medoid

Centroid or Medoid

SUSHIL KULKARNI

Page 21: Cluster Analysis in data Minining

The k-means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps:

1. Partition objects into k nonempty subsets2. Compute seed points as the centroidscentroids of

the clusters of the current partition. The centroid is the center (mean point) of the cluster.

3. Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment.

SUSHIL KULKARNI

Page 22: Cluster Analysis in data Minining

K-Means Example

Given: {2,4,10,12,3,20,30,11,25} Assume that we want two clusters. Write the elements in increasing order

{2,3,4,10,11,12,20,25,30} Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,11,12,20,25,30} Means are m1=2.5,m2=16 K1={2,3,4},K2={10,11,12,20,25,30} Means are m1=3,m2=18 K1={2,3,4,10},K2={11,12,20,25,30} Means are m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,25,30} Means are m1=7,m2=25 Stop as the clusters with these means as no more jumps

from K2 to K 1 is possible. SUSHIL KULKARNI

Page 23: Cluster Analysis in data Minining

The k-means Clustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

SUSHIL KULKARNI

Page 24: Cluster Analysis in data Minining

Comments on the k-means Method

Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

SUSHIL KULKARNI

Page 25: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.

Page 26: Cluster Analysis in data Minining

The K-Medoids Clustering Algorithm

SUSHIL KULKARNI

A medoid clustering algorithm is as follows: The algorithm begins with arbitrary selection of the k

objects as medoid points out of n data points (n>k) After selection of the k medoid points, associate each

data object in the given data set to most similar medoid. The similarity here is defined using distance measure that can be Euclidean distance

Randomly select non medoid object O′ compute total cost S of swapping initial medoid object to

O′ If S<0, then swap initial medoid with the new one (if

S<0 then there will be new set of medoids) repeat steps 2 to 5 until there is no change in the

medoid

Page 27: Cluster Analysis in data Minining

The K-Medoids Clustering Method

Cluster the following data set of ten objects into two clusters i.e k = 2.

Consider a data set of ten objects as follows

SUSHIL KULKARNI

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6

Page 28: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6

Page 29: Cluster Analysis in data Minining

The K-Medoids Clustering Method

Step 1

Initialise k centre Let us assume c1 = (3,4) and c2 = (7,4) So here c1 and c2 are selected as medoid. Calculating distance so as to associate each data

object to its nearest medoid. Cost is calculated using Minkowski distance metric.

SUSHIL KULKARNI

Page 30: Cluster Analysis in data Minining

Similarity and Dissimilarity Between Objects

The most popular conform to Minkowski distance:

where i = (xi1, xi2, …, xin) and

j = (xj1, xj2, …, xjn) are two n-dimensional

data objects, and p is a positive integer

ppjn

xin

xp

jx

ix

pj

xi

xjipL/1

||...|22

||11

|),(

SUSHIL KULKARNI

Page 31: Cluster Analysis in data Minining

Similarity and Dissimilarity Between Objects

If p = 1, then L1 is :

||...||||),(1 2211 nn jxixjxixjxixjiL

SUSHIL KULKARNI

Page 32: Cluster Analysis in data Minining

The K-Medoids Clustering Method

c1 Data objects (Xi)

Cost (distance)

(3,4) (2,6) 3

(3,4) (3,8) 4

(3,4) (4,7) 4

(3,4) (6,2) 5

(3,4) (6,4) 3

(3,4) (7,3) 5

(3,4) (8,5) 6

(3,4) (7,6) 6

SUSHIL KULKARNI

c2 Data objects (Xi)

Cost (distance)

(7,4) (2,6) 7

(7,4) (3,8) 8

(7,4) (4,7) 6

(7,4) (6,2) 3

(7,4) (6,4) 1

(7,4) (7,3) 1

(7,4) (8,5) 2

(7,4) (7,6) 2

Page 33: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

So the clusters become:

Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)} Since the points (2,6) (3,8) and (4,7) are close to c1 hence they form one cluster while remaining points form another cluster. Cost between any two points is found using formula

d

1i|cx|)c,x(tcos

where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. Cost between any two points is found using formula

Total cost is the summation of the cost of data object from its medoid in its cluster so here:

Page 34: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

Total Cost = {cost( (3,4),(2,6))+cost( (3,4), (3,8))+ cost( (3,4), (4,7))} + {cost( (7,4),(6,2))+ cost( (7,4),(7,3))+ cost((7,4),(8,5))+ cost( (7,4),(7,6))} =3+4+4+3+1+1+2+2 =20

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

Page 35: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNIclusters after step 1

Page 36: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

Step 2

Selection of nonmedoid O′ randomly Let us assume O′ = (7,3) So now the medoids are c1(3,4) and O′(7,3) If c1 and O′ are new medoids, calculate the total cost involved by using the formula in the step 1

Page 37: Cluster Analysis in data Minining

The K-Medoids Clustering Method

c1 Data objects (Xi)

Cost (distance)

(3,4) (2,6) 3

(3,4) (3,8) 4

(3,4) (4,7) 4

(3,4) (6,2) 5

(3,4) (6,4) 3

(3,4) (7,4) 4

(3,4) (8,5) 6

(3,4) (7,6) 6

SUSHIL KULKARNI

O’ Data objects (Xi)

Cost (distance)

(7,3) (2,6) 8

(7,3) (3,8) 9

(7,3) (4,7) 7

(7,3) (6,2) 2

(7,3) (6,4) 2

(7,3) (7,4) 1

(7,3) (8,5) 3

(7,3) (7,6) 3

Page 38: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

Page 39: Cluster Analysis in data Minining

The K-Medoids Clustering Method

SUSHIL KULKARNI

Total Cost = 3+4+4+2+2+1+3+3=22 So cost of swapping medoid from c2 to O′ is S = current total cost – Past total cost = 22-20 =2 > 0 So moving to O′ would be bad idea, so the previous choice was good and algorithm terminates here (i.e there is no change in the medoids).

Page 40: Cluster Analysis in data Minining

Cluster Analysis

What is a Cluster ? Types of Data in Cluster A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods

SUSHIL KULKARNI

Page 41: Cluster Analysis in data Minining

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 42: Cluster Analysis in data Minining

AGNES (Agglomerative Nesting)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge objects that have the least dissimilarity

Go on in a non-descending fashion

Eventually all objects belong to the same cluster

Single-Link: each time merge the clusters (C1,C2) which are connected by the shortest single link of objects, i.e., minpC1,qC2dist(p,q)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

SUSHIL KULKARNI

Page 43: Cluster Analysis in data Minining

• Uses dissimilarity/distance matrix as input. • Start with each individual item in its own cluster Merge nodes that have the least dissimilarity• Go on in a non-descending fashion• Eventually all nodes belong to the same cluster• The output is a dendogram which is represented as a set of ordered triples (d, k , K) where d is the threshold distance, k is the number of clusters and K is the set of clusters

AGNES (Agglomerative Nesting)

SUSHIL KULKARNI

Page 44: Cluster Analysis in data Minining

A B

C

D

E

1

1

23

3

2

5

32

4

Graph with all distances

C

1

1A B

E

D

Graph with threshhold dmin=1

AGNES (Agglomerative Nesting): Minimum Distance Method

A B C

Single link dendrogram

ED

Page 45: Cluster Analysis in data Minining

A B

C

D

E

1

1

22

2

Graph with threshhold dmin=2

A B

C

D

E

1

1

23

3

232

4

A B

C

D

E

1

1

23

3

232

Graph with threshhold dmin=3

Graph with threshhold dmin=4

A B C D E

Single link dendrogram

AGNES (Agglomerative Nesting): Minimum Distance Method

Page 46: Cluster Analysis in data Minining

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

E.g., level 1 gives 4 clusters: {a,b},{c},{d},{e},level 2 gives 3 clusters: {a,b},{c},{d,e} level 3 gives 2 clusters: {a,b},{c,d,e}, etc.

A Dendrogram Shows How the Clusters are Merged Hierarchically

a b c d e

ab

d

e

c

level 1

level 2

level 3

level 4

Page 47: Cluster Analysis in data Minining

Agglomerative ExampleA B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

BA

E C

D

4

Threshold of

2 3 51

A B C D ESUSHIL KULKARNI

Page 48: Cluster Analysis in data Minining

MST Example

A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

BA

E C

D

SUSHIL KULKARNI

Page 49: Cluster Analysis in data Minining

DIANA (Divisive Analysis)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 50: Cluster Analysis in data Minining

o Inverse order of AGNES

o Initially all items are placed in one cluster

oThe clusters are split when some elements are not efficiently close to other elements

o Eventually each node forms a cluster on its own

o One simple example of a divisive algorithm is based on the MST version of the single link algorithm

o Edges are cut out from the minimum spanning tree from largest to the smallest

DIANA (Divisive Analysis)

SUSHIL KULKARNI

Page 51: Cluster Analysis in data Minining

D

A B

CE

1

1

23

3

2

5

32

4A B

CE

1

1

2

3D

DIANA (Divisive Analysis)

Page 52: Cluster Analysis in data Minining

3

A

B

C

E

1

1

2

D

Cut the largest edge ED

The cluster {A,B,C,D,E} is split into two clusters

{E} {A,B,C,D}

DIANA (Divisive Analysis)

SUSHIL KULKARNI

E

A B1

1

2

D

C

Page 53: Cluster Analysis in data Minining

A

B

C

E

1

1

2

D

Two clusters are

{E} {A,B,C,D}

Next cut the edge between B and C

The cluster {A,B,C,D} is split into {A,B} and {C,D}

A B

E

1

1

D

DIANA (Divisive Analysis)

SUSHIL KULKARNI

C

Page 54: Cluster Analysis in data Minining

A

B

C

E

1

1

D

In the next step these will be split finally giving clusters as

{E}, {A}, {B}, {C}, {D},

A B

E

D

DIANA (Divisive Analysis)

SUSHIL KULKARNI

C

Page 55: Cluster Analysis in data Minining

More on Hierarchical Clustering Methods

Integration of hierarchical with distance-based clustering

BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters

CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling

SUSHIL KULKARNI

Page 56: Cluster Analysis in data Minining