Upload
others
View
12
Download
1
Embed Size (px)
Citation preview
Outline AGNES and DIANA BIRCH
ClusteringHierarchical Methods
Huiping Cao
Introduction to Data Mining, Slide 1/12
Outline AGNES and DIANA BIRCH
Hierarchical Methods
Use distance matrix as clustering criteria. This method doesnot require the number of clusters k as an input, but needs atermination condition.
Agglomerative hierarchical clustering; bottom-up
Divisive hierarchical clustering; top-down
Introduction to Data Mining, Slide 2/12
Outline AGNES and DIANA BIRCH
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Start by letting each object form its own clusters
Iteratively merges clusters
Finds two clusters that are closest to each other
Termination: all nodes belong to the same cluster
Analysis: Each iteration merges two clusters. This methodrequires at most n iterations.
AGNES(AgglomerativeNesting)
Introduction to Data Mining, Slide 3/12
Outline AGNES and DIANA BIRCH
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Start by placing all objects in one cluster
Iteratively divides clusters
Eventually (1) each node forms a cluster on its own, or (2)objects within a cluster are sufficiently similar to each other.
DIANA(DivisiveAnalysis)
Introduction to Data Mining, Slide 4/12
Outline AGNES and DIANA BIRCH
DIANA (Divisive Analysis)
Inverse order of AGNES
Practically, use heuristics in partitioning because there are2n−1 − 1 possible ways to partition a set of n objects into twoexclusive subsets.
There are many more agglomerative methods than divisivemethods.
Introduction to Data Mining, Slide 5/12
Outline AGNES and DIANA BIRCH
Hierarchical Methods
A user can specify the desired number of clusters as atermination condition.
Hierarchical Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative (AGNES)
divisive (DIANA)
Introduction to Data Mining, Slide 6/12
Outline AGNES and DIANA BIRCH
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into several levels of nestedpartitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting thedendrogram at the desired level, then each connectedcomponent forms a cluster.Dendrogram: Shows How the Clusters are Merged
Introduction to Data Mining, Slide 7/12
Outline AGNES and DIANA BIRCH
R: AGNES, DIANA
> library(cluster)
> help(agnes)
> help(diana)
> help(mona)
CRAN document:https://cran.r-project.org/web/packages/cluster/cluster.pdf
The method mona returns a list representing a divisive hierarchical clustering of a
dataset with binary variables only.
Introduction to Data Mining, Slide 8/12
Outline AGNES and DIANA BIRCH
Hierarchical Clustering Methods
Major weakness of agglomerative clustering methods
Do not scale well: time complexity of at least O(n2), where nis the number of total objects
Further improvement
BIRCH (1996): uses CF-tree and incrementally adjusts thequality of sub-clusters
ROCK (1999): clustering categorical data by neighbor and linkanalysis
CHAMELEON (1999): hierarchical clustering using dynamicmodeling
Introduction to Data Mining, Slide 9/12
Outline AGNES and DIANA BIRCH
BIRCH (1996)
Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: AnEfficient Data Clustering Method for Very Large Databases.SIGMOD 1996:103-114.
Incrementally construct a CF (Clustering Feature) tree, ahierarchical data structure for multiphase clustering
Introduction to Data Mining, Slide 10/12
Outline AGNES and DIANA BIRCH
Clustering Feature Vector in BIRCH
Clustering feature (CF) CF = (N, LS , SS)
N: Number of data points
LS =∑N
i=1 Xi , linear sum of all the points
SS =∑N
i=1 X2i , square sum of all the points
Easy to calculate centroid, radius, and diameter
x0 =∑N
i=1 xiN = LS
N
R =
√∑Ni=1(xi−x0)2
N =√
N·SS−2LS2+LS2
N2 =√
N·SS−LS2
N2
D =
√∑Ni=1
∑Nj=1(xi−xj )2
N(N−1) =√
2N·SS−2LS2
N(N−1)
Additivity theorem
CF1 + CF2 = (N1 + N2, LS1 + LS2,SS1 + SS2)
Introduction to Data Mining, Slide 11/12
Outline AGNES and DIANA BIRCH
BIRCH (1996)
Scales linearly: finds a good clustering with a single scan andimproves the quality with a few additional scans– O(n): where n is the number of the objects to be clustered
Weaknesses
handles only numeric data, and sensitive to the order of thedata record.
node size is limited by the memory size
not good quality when clusters are not in spherical
Introduction to Data Mining, Slide 12/12