Upload
shanna-francis
View
228
Download
2
Tags:
Embed Size (px)
Citation preview
• Tian Zhang
• Raghu Ramakrishnan
• Miron Livny
• Presented by: Peter Vile
BIRCH: A New data clustering Algorithm and Its Applications
• Data Clustering:
Problem of dividing N data points into K groups so as to minimize an intra-group difference metric.
Many Algorithms already exists
Problem:
Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions.
• Probability-Based Clustering
• COBWEB– use probabilistic measurements for making
decisions (Category Utility)– Hierarchical, Iterative
• Disadvantage– Category Utility takes time and memory– tends to over fit
Other Methods
Cobweb’s Limits
-assumes probability distributions of attributes are independent
-can only have discrete values, approximates continuous data with discretization
-storing and manipulating large sets of data becomes infeasible
The Competition
• Distance-Based Clustering
• KMEANS, CLARAN
• CLARAN– like Kmeans– Node - set of medians– starts in random node and moves to closest
neighbor
BIRCH
• Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the
concept of summarization) -exploits the observation that not every data
point is important for clustering purposes
Limitations of BIRCH
• Handles only metric data
Definitions
• Centroid– avg value
• Radius – std dev
• Diameter– avg. pairwise distance
within a cluster
N
XX
N
i i 10
2
1
1
20
N
XXR
N
i i
2
1
1 1
2
1
NN
XXD
N
i
N
j ji
How Far Away• Given the centroids of two clusters: • centroid Euclidean distance D0
• centroid Manhattan distance D1
– d is the dimension
d
i
ii XXXXD1
)(2
)(121 |00||00|1
212
21 000 XXD
More Distances
• Average Inter-cluster Distance (D2)
• Average Intra-cluster Distance (D3)
• Variance Increase Distance (D4)
21
1 21
1
21
1 1
2
2
NN
XXD
N
i
NN
Nj ji
2
1
2121
1 1
2
13
21 21
NNNN
XXD
NN
i
NN
j ji
2
1
1 1
2
2
1
2
1
1
2
121
1 1 21
1
21
1
1
21
21
4
N
i
NN
Nj
NN
Nl l
j
N
l li
NN
k
NN
l lk N
XX
N
XX
NN
XXD
Clustering Feature (CF)• “A Clustering Feature (CF) entry is a
triple summarizing the information that we maintain about a sub-cluster of data points.”
• CF Definition: CF = (N, , SS)– N : number of data points in the cluster– : Linear sum of the N data points, – SS : Square sum of the N data points,
SL
SL
N
i iX1
N
i iX1
2
• CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately
• CF Additivity Theorem: Assume that CF1 and CF2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the sub-cluster that is formed by merging the two disjoint sub-clusters is :
1111 ,, SSSLNCF
2222 ,, SSSLNCF
21212121 ,, SSSSSLSLNNCFCF
CF-tree Features
• Has two parameters– 1. Branching Factor
• B - non-leaf node ( [CF i, child i], i = 1..B)– child i - pointer to its i-th child node
• L - leaf node ( [CF j, prev, next], j = 1..L)
– 2. Threshold • specify the size of each leaf entry
– diameter(D) of each leaf entry < T
– or radius(R) of each leaf entry < T
CF-tree Features(Continue)• Tree size is a function of T
– tree size = f(T)
– T increases tree size decreases
• Page Size (P)– A node is required to fit in a page of size P– P can be varied for performance tuning
• CF tree will be built dynamically as new data objects are inserted
CF-tree
• Two Algorithms used to build a CF-tree– 1. Insertion Algorithm
• Purpose: Building a CF-Tree
– 2. Rebuilding Algorithm• Purpose:
– Rebuild the whole tree with larger T (smaller size)
– this happens when CF-tree size limit is exceeded
Insertion Algorithm
• 1. Identifying the appropriate leaf– non-leaf node– use distance metric to chose closest branch
• 2. Insert leaf into leaf node– merge with closest leaf [CF i, prev, next]– if T violated make new leaf [CF i+1, prev, next]– if L violated split into two leaf nodes
• choose two leaves that are farthest apart
• put the rest in leaf node with the closest leaf
Insertion Algorithm
• 3. Update tree path– if B is violated split the node– CF should be the sum of child CFs
• 4. A Merging Refinement– try to combine two closest children of the node
that did not split– might free space
Rebuilding Algorithm
• When to do it?– If the CF-tree size limit is exceeded
• What does it do?– Creates new tree with larger T(diameter/radius)
• larger T -> smaller tree size
• How– Deletes path from old tree and adds path to new
tree.
Rebuilding Algorithm
• Procedure– ‘OldCurrentPath’ starts at leftmost branch– 1. Create the ‘NewCurrentPath’ in new tree
• copy nodes from OldCurrentPath into NewCurrentPath
– 2. Insert leaf entries in ‘OldCurrentPath’ into the new tree
• if leaf does not go into NewCurrentPath remove it from the NewCurrentPath
Rebuilding Algorithm
– 3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’
• delete nodes in ‘OldCurrentPath’
• if ‘NewCurrentPath’ Is empty delete its nodes
– 4. Process the next path in the old tree– only needs enough pages to cover
‘OldCurrentPath’• usually the height of the tree
Potential Problems
• Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node
• Anomaly 2: Sub cluster ends up in the wrong leaf node
Reducibility Theorem: Assume we rebuild CF-tree ti+1 of threshold Ti by the above algorithm, and let Si and Si+1be the sizes of ti
and ti+1 respectively. If Ti+1 > Ti, then Si+1<Si,
and the transformation from ti to ti+1 needs at most h extra pages of memory, where h is the height of t i
BIRCH Clustering Algorithm
• Four Phases– 1. Loading
• scan all data and build an initial in-memory CF-tree
– 2. Optional Condensing• rebuild CF-tree to make it smaller
• faster analysis, but reduced accuracy
– 3. Global Clustering• run clustering algorithm on CFs (KMEANS)
• handles anomaly 1
BIRCH Clustering Algorithm
– 4. Optional Refining• Use the centroids of the clusters as seeds
• Scan data again and assign points to closest seed
• Handles anomaly 2
BIRCH Phase 1• Building CF-tree
– Heuristic Threshold (T default = 0)• when rebuilding need new T (diameter/radius)
• use avg distance of closest leaf pairs in same node
• should reduce size of tree by about half
BIRCH Phase 1
– Outlier Handling Option• CF with small N(# data points) is saved to disk
• try to reinsert when– run out of memory
– finish reading data
• if data is noisy, improves runtime and accuracy
– Delay Split Option• about to run out of memory
• CFs that would cause the tree to split saved to disk
How does this compare to other clustering methods?
Ran against Kmeans and CLARANS
Results
Results
Results
Results
Runtime
Conclusions• BIRCH vs CLARANS and KMEANS
– runs faster (fewer scans)– less order sensitive– less memory
Where can I use this?
• Interactive and Iterative Pixel Classification– MVI ( Multi-band Vegetation Imager)– BIRCH helps classify pixels through clustering
• Code Generalization in Image Compression– compressing visual data to save space– code book - vector code words for image
blocks– BIRCH assigns nearest code word to vector
Main limitations of BIRCH?
• Ability to only handle metric data.
Name the two algorithms in BIRCH clustering?
1. Inserting
2. Rebuilding
What is the purpose to have phase 4 in the BIRCH clustering algorithm?
-All copies of a given data point go to the same cluster.
-option to discard outliers
-can converge to a minimum