12
Outline AGNES and DIANA BIRCH Clustering Hierarchical Methods Huiping Cao Introduction to Data Mining, Slide 1/12

Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

  • Upload
    others

  • View
    12

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

ClusteringHierarchical Methods

Huiping Cao

Introduction to Data Mining, Slide 1/12

Page 2: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

Hierarchical Methods

Use distance matrix as clustering criteria. This method doesnot require the number of clusters k as an input, but needs atermination condition.

Agglomerative hierarchical clustering; bottom-up

Divisive hierarchical clustering; top-down

Introduction to Data Mining, Slide 2/12

Page 3: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Start by letting each object form its own clusters

Iteratively merges clusters

Finds two clusters that are closest to each other

Termination: all nodes belong to the same cluster

Analysis: Each iteration merges two clusters. This methodrequires at most n iterations.

AGNES(AgglomerativeNesting)

Introduction to Data Mining, Slide 3/12

Page 4: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Start by placing all objects in one cluster

Iteratively divides clusters

Eventually (1) each node forms a cluster on its own, or (2)objects within a cluster are sufficiently similar to each other.

DIANA(DivisiveAnalysis)

Introduction to Data Mining, Slide 4/12

Page 5: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

DIANA (Divisive Analysis)

Inverse order of AGNES

Practically, use heuristics in partitioning because there are2n−1 − 1 possible ways to partition a set of n objects into twoexclusive subsets.

There are many more agglomerative methods than divisivemethods.

Introduction to Data Mining, Slide 5/12

Page 6: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

Hierarchical Methods

A user can specify the desired number of clusters as atermination condition.

Hierarchical  Clustering  

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)

Introduction to Data Mining, Slide 6/12

Page 7: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

Dendrogram: Shows How the Clusters are Merged

Decompose data objects into several levels of nestedpartitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting thedendrogram at the desired level, then each connectedcomponent forms a cluster.Dendrogram: Shows How the Clusters are Merged

Introduction to Data Mining, Slide 7/12

Page 8: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

R: AGNES, DIANA

> library(cluster)

> help(agnes)

> help(diana)

> help(mona)

CRAN document:https://cran.r-project.org/web/packages/cluster/cluster.pdf

The method mona returns a list representing a divisive hierarchical clustering of a

dataset with binary variables only.

Introduction to Data Mining, Slide 8/12

Page 9: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

Hierarchical Clustering Methods

Major weakness of agglomerative clustering methods

Do not scale well: time complexity of at least O(n2), where nis the number of total objects

Further improvement

BIRCH (1996): uses CF-tree and incrementally adjusts thequality of sub-clusters

ROCK (1999): clustering categorical data by neighbor and linkanalysis

CHAMELEON (1999): hierarchical clustering using dynamicmodeling

Introduction to Data Mining, Slide 9/12

Page 10: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

BIRCH (1996)

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: AnEfficient Data Clustering Method for Very Large Databases.SIGMOD 1996:103-114.

Incrementally construct a CF (Clustering Feature) tree, ahierarchical data structure for multiphase clustering

Introduction to Data Mining, Slide 10/12

Page 11: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

Clustering Feature Vector in BIRCH

Clustering feature (CF) CF = (N, LS , SS)

N: Number of data points

LS =∑N

i=1 Xi , linear sum of all the points

SS =∑N

i=1 X2i , square sum of all the points

Easy to calculate centroid, radius, and diameter

x0 =∑N

i=1 xiN = LS

N

R =

√∑Ni=1(xi−x0)2

N =√

N·SS−2LS2+LS2

N2 =√

N·SS−LS2

N2

D =

√∑Ni=1

∑Nj=1(xi−xj )2

N(N−1) =√

2N·SS−2LS2

N(N−1)

Additivity theorem

CF1 + CF2 = (N1 + N2, LS1 + LS2,SS1 + SS2)

Introduction to Data Mining, Slide 11/12

Page 12: Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

BIRCH (1996)

Scales linearly: finds a good clustering with a single scan andimproves the quality with a few additional scans– O(n): where n is the number of the objects to be clustered

Weaknesses

handles only numeric data, and sensitive to the order of thedata record.

node size is limited by the memory size

not good quality when clusters are not in spherical

Introduction to Data Mining, Slide 12/12