Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

• Tian Zhang

• Raghu Ramakrishnan

• Miron Livny

• Presented by: Peter Vile

BIRCH: A New data clustering Algorithm and Its Applications

• Data Clustering:

Problem of dividing N data points into K groups so as to minimize an intra-group difference metric.

Many Algorithms already exists

Problem:

Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions.

• Probability-Based Clustering

• COBWEB– use probabilistic measurements for making

decisions (Category Utility)– Hierarchical, Iterative

• Disadvantage– Category Utility takes time and memory– tends to over fit

Other Methods

Cobweb’s Limits

-assumes probability distributions of attributes are independent

-can only have discrete values, approximates continuous data with discretization

-storing and manipulating large sets of data becomes infeasible

The Competition

• Distance-Based Clustering

• KMEANS, CLARAN

• CLARAN– like Kmeans– Node - set of medians– starts in random node and moves to closest

neighbor

BIRCH

• Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the

concept of summarization) -exploits the observation that not every data

point is important for clustering purposes

Limitations of BIRCH

• Handles only metric data

Definitions

• Centroid– avg value

• Radius – std dev

• Diameter– avg. pairwise distance

within a cluster

N

XX

N

i i 10

2

1

1

20

N

XXR

N

i i

2

1

1 1

2

1

NN

XXD

N

i

N

j ji

How Far Away• Given the centroids of two clusters: • centroid Euclidean distance D0

• centroid Manhattan distance D1

– d is the dimension

d

i

ii XXXXD1

)(2

)(121 |00||00|1

212

21 000 XXD

More Distances

• Average Inter-cluster Distance (D2)

• Average Intra-cluster Distance (D3)

• Variance Increase Distance (D4)

21

1 21

1

21

1 1

2

2

NN

XXD

N

i

NN

Nj ji

2

1

2121

1 1

2

13

21 21

NNNN

XXD

NN

i

NN

j ji

2

1

1 1

2

2

1

2

1

1

2

121

1 1 21

1

21

1

1

21

21

4

N

i

NN

Nj

NN

Nl l

j

N

l li

NN

k

NN

l lk N

XX

N

XX

NN

XXD

Clustering Feature (CF)• “A Clustering Feature (CF) entry is a

triple summarizing the information that we maintain about a sub-cluster of data points.”

• CF Definition: CF = (N, , SS)– N : number of data points in the cluster– : Linear sum of the N data points, – SS : Square sum of the N data points,

SL

SL

N

i iX1

N

i iX1

2

• CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately

• CF Additivity Theorem: Assume that CF1 and CF2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the sub-cluster that is formed by merging the two disjoint sub-clusters is :

1111 ,, SSSLNCF

2222 ,, SSSLNCF

21212121 ,, SSSSSLSLNNCFCF

CF-tree Features

• Has two parameters– 1. Branching Factor

• B - non-leaf node ( [CF i, child i], i = 1..B)– child i - pointer to its i-th child node

• L - leaf node ( [CF j, prev, next], j = 1..L)

– 2. Threshold • specify the size of each leaf entry

– diameter(D) of each leaf entry < T

– or radius(R) of each leaf entry < T

CF-tree Features(Continue)• Tree size is a function of T

– tree size = f(T)

– T increases tree size decreases

• Page Size (P)– A node is required to fit in a page of size P– P can be varied for performance tuning

• CF tree will be built dynamically as new data objects are inserted

CF-tree

• Two Algorithms used to build a CF-tree– 1. Insertion Algorithm

• Purpose: Building a CF-Tree

– 2. Rebuilding Algorithm• Purpose:

– Rebuild the whole tree with larger T (smaller size)

– this happens when CF-tree size limit is exceeded

Insertion Algorithm

• 1. Identifying the appropriate leaf– non-leaf node– use distance metric to chose closest branch

• 2. Insert leaf into leaf node– merge with closest leaf [CF i, prev, next]– if T violated make new leaf [CF i+1, prev, next]– if L violated split into two leaf nodes

• choose two leaves that are farthest apart

• put the rest in leaf node with the closest leaf

Insertion Algorithm

• 3. Update tree path– if B is violated split the node– CF should be the sum of child CFs

• 4. A Merging Refinement– try to combine two closest children of the node

that did not split– might free space

Rebuilding Algorithm

• When to do it?– If the CF-tree size limit is exceeded

• What does it do?– Creates new tree with larger T(diameter/radius)

• larger T -> smaller tree size

• How– Deletes path from old tree and adds path to new

tree.


• Procedure– ‘OldCurrentPath’ starts at leftmost branch– 1. Create the ‘NewCurrentPath’ in new tree

• copy nodes from OldCurrentPath into NewCurrentPath

– 2. Insert leaf entries in ‘OldCurrentPath’ into the new tree

• if leaf does not go into NewCurrentPath remove it from the NewCurrentPath


– 3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’

• delete nodes in ‘OldCurrentPath’

• if ‘NewCurrentPath’ Is empty delete its nodes

– 4. Process the next path in the old tree– only needs enough pages to cover

‘OldCurrentPath’• usually the height of the tree

Potential Problems

• Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node

• Anomaly 2: Sub cluster ends up in the wrong leaf node

Reducibility Theorem: Assume we rebuild CF-tree ti+1 of threshold Ti by the above algorithm, and let Si and Si+1be the sizes of ti

and ti+1 respectively. If Ti+1 > Ti, then Si+1<Si,

and the transformation from ti to ti+1 needs at most h extra pages of memory, where h is the height of t i

BIRCH Clustering Algorithm

• Four Phases– 1. Loading

• scan all data and build an initial in-memory CF-tree

– 2. Optional Condensing• rebuild CF-tree to make it smaller

• faster analysis, but reduced accuracy

– 3. Global Clustering• run clustering algorithm on CFs (KMEANS)

• handles anomaly 1

BIRCH Clustering Algorithm

– 4. Optional Refining• Use the centroids of the clusters as seeds

• Scan data again and assign points to closest seed

• Handles anomaly 2

BIRCH Phase 1• Building CF-tree

– Heuristic Threshold (T default = 0)• when rebuilding need new T (diameter/radius)

• use avg distance of closest leaf pairs in same node

• should reduce size of tree by about half

BIRCH Phase 1

– Outlier Handling Option• CF with small N(# data points) is saved to disk

• try to reinsert when– run out of memory

– finish reading data

• if data is noisy, improves runtime and accuracy

– Delay Split Option• about to run out of memory

• CFs that would cause the tree to split saved to disk

How does this compare to other clustering methods?

Ran against Kmeans and CLARANS

Results

Results

Results

Results

Runtime

Conclusions• BIRCH vs CLARANS and KMEANS

– runs faster (fewer scans)– less order sensitive– less memory

Where can I use this?

• Interactive and Iterative Pixel Classification– MVI ( Multi-band Vegetation Imager)– BIRCH helps classify pixels through clustering

• Code Generalization in Image Compression– compressing visual data to save space– code book - vector code words for image

blocks– BIRCH assigns nearest code word to vector

Main limitations of BIRCH?

• Ability to only handle metric data.

Name the two algorithms in BIRCH clustering?

1. Inserting

2. Rebuilding

What is the purpose to have phase 4 in the BIRCH clustering algorithm?

-All copies of a given data point go to the same cluster.

-option to discard outliers

-can converge to a minimum

Documents

Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications