Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM

Outline AGNES and DIANA BIRCH

ClusteringHierarchical Methods

Huiping Cao

Introduction to Data Mining, Slide 1/12


Hierarchical Methods

Use distance matrix as clustering criteria. This method doesnot require the number of clusters k as an input, but needs atermination condition.

Agglomerative hierarchical clustering; bottom-up

Divisive hierarchical clustering; top-down



AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Start by letting each object form its own clusters

Iteratively merges clusters

Finds two clusters that are closest to each other

Termination: all nodes belong to the same cluster

Analysis: Each iteration merges two clusters. This methodrequires at most n iterations.

AGNES(AgglomerativeNesting)



DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Start by placing all objects in one cluster

Iteratively divides clusters

Eventually (1) each node forms a cluster on its own, or (2)objects within a cluster are sufficiently similar to each other.

DIANA(DivisiveAnalysis)



DIANA (Divisive Analysis)

Inverse order of AGNES

Practically, use heuristics in partitioning because there are2n−1 − 1 possible ways to partition a set of n objects into twoexclusive subsets.

There are many more agglomerative methods than divisivemethods.



Hierarchical Methods

A user can specify the desired number of clusters as atermination condition.

Hierarchical Clustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)



Dendrogram: Shows How the Clusters are Merged

Decompose data objects into several levels of nestedpartitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting thedendrogram at the desired level, then each connectedcomponent forms a cluster.Dendrogram: Shows How the Clusters are Merged



R: AGNES, DIANA

> library(cluster)

> help(agnes)

> help(diana)

> help(mona)

CRAN document:https://cran.r-project.org/web/packages/cluster/cluster.pdf

The method mona returns a list representing a divisive hierarchical clustering of a

dataset with binary variables only.


https://cran.r-project.org/web/packages/cluster/cluster.pdf


Hierarchical Clustering Methods

Major weakness of agglomerative clustering methods

Do not scale well: time complexity of at least O(n2), where nis the number of total objects

Further improvement

BIRCH (1996): uses CF-tree and incrementally adjusts thequality of sub-clusters

ROCK (1999): clustering categorical data by neighbor and linkanalysis

CHAMELEON (1999): hierarchical clustering using dynamicmodeling



BIRCH (1996)

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: AnEfficient Data Clustering Method for Very Large Databases.SIGMOD 1996:103-114.

Incrementally construct a CF (Clustering Feature) tree, ahierarchical data structure for multiphase clustering



Clustering Feature Vector in BIRCH

Clustering feature (CF) CF = (N, LS , SS)

N: Number of data points

LS =∑N

i=1 Xi , linear sum of all the points

SS =∑N

i=1 X2i , square sum of all the points

Easy to calculate centroid, radius, and diameter

x0 =∑N

i=1 xiN = LS

N

R =

√∑Ni=1(xi−x0)2

N =√

N·SS−2LS2+LS2

N2 =√

N·SS−LS2

N2

D =

√∑Ni=1

∑Nj=1(xi−xj )2

N(N−1) =√

2N·SS−2LS2

N(N−1)

Additivity theorem

CF1 + CF2 = (N1 + N2, LS1 + LS2,SS1 + SS2)



BIRCH (1996)

Scales linearly: finds a good clustering with a single scan andimproves the quality with a few additional scans– O(n): where n is the number of the objects to be clustered

Weaknesses

handles only numeric data, and sensitive to the order of thedata record.

node size is limited by the memory size

not good quality when clusters are not in spherical


Documents

Clustering - Hierarchical MethodsClustering - Hierarchical Methods Author Huiping Cao Created Date 10/26/2016 3:03:35 PM