38
• Tian Zhang • Raghu Ramakrishnan • Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Embed Size (px)

Citation preview

Page 1: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

• Tian Zhang

• Raghu Ramakrishnan

• Miron Livny

• Presented by: Peter Vile

BIRCH: A New data clustering Algorithm and Its Applications

Page 2: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

• Data Clustering:

Problem of dividing N data points into K groups so as to minimize an intra-group difference metric.

Many Algorithms already exists

Problem:

Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions.

Page 3: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

• Probability-Based Clustering

• COBWEB– use probabilistic measurements for making

decisions (Category Utility)– Hierarchical, Iterative

• Disadvantage– Category Utility takes time and memory– tends to over fit

Other Methods

Page 4: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Cobweb’s Limits

-assumes probability distributions of attributes are independent

-can only have discrete values, approximates continuous data with discretization

-storing and manipulating large sets of data becomes infeasible

Page 5: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

The Competition

• Distance-Based Clustering

• KMEANS, CLARAN

• CLARAN– like Kmeans– Node - set of medians– starts in random node and moves to closest

neighbor

Page 6: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

BIRCH

• Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the

concept of summarization) -exploits the observation that not every data

point is important for clustering purposes

Page 7: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Limitations of BIRCH

• Handles only metric data

Page 8: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Definitions

• Centroid– avg value

• Radius – std dev

• Diameter– avg. pairwise distance

within a cluster

N

XX

N

i i 10

2

1

1

20

N

XXR

N

i i

2

1

1 1

2

1

NN

XXD

N

i

N

j ji

Page 9: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

How Far Away• Given the centroids of two clusters: • centroid Euclidean distance D0

• centroid Manhattan distance D1

– d is the dimension

d

i

ii XXXXD1

)(2

)(121 |00||00|1

212

21 000 XXD

Page 10: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

More Distances

• Average Inter-cluster Distance (D2)

• Average Intra-cluster Distance (D3)

• Variance Increase Distance (D4)

21

1 21

1

21

1 1

2

2

NN

XXD

N

i

NN

Nj ji

2

1

2121

1 1

2

13

21 21

NNNN

XXD

NN

i

NN

j ji

2

1

1 1

2

2

1

2

1

1

2

121

1 1 21

1

21

1

1

21

21

4

N

i

NN

Nj

NN

Nl l

j

N

l li

NN

k

NN

l lk N

XX

N

XX

NN

XXD

Page 11: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Clustering Feature (CF)• “A Clustering Feature (CF) entry is a

triple summarizing the information that we maintain about a sub-cluster of data points.”

• CF Definition: CF = (N, , SS)– N : number of data points in the cluster– : Linear sum of the N data points, – SS : Square sum of the N data points,

SL

SL

N

i iX1

N

i iX1

2

Page 12: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

• CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately

Page 13: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

• CF Additivity Theorem: Assume that CF1 and CF2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the sub-cluster that is formed by merging the two disjoint sub-clusters is :

1111 ,, SSSLNCF

2222 ,, SSSLNCF

21212121 ,, SSSSSLSLNNCFCF

Page 14: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

CF-tree Features

• Has two parameters– 1. Branching Factor

• B - non-leaf node ( [CF i, child i], i = 1..B)– child i - pointer to its i-th child node

• L - leaf node ( [CF j, prev, next], j = 1..L)

– 2. Threshold • specify the size of each leaf entry

– diameter(D) of each leaf entry < T

– or radius(R) of each leaf entry < T

Page 15: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

CF-tree Features(Continue)• Tree size is a function of T

– tree size = f(T)

– T increases tree size decreases

• Page Size (P)– A node is required to fit in a page of size P– P can be varied for performance tuning

• CF tree will be built dynamically as new data objects are inserted

Page 16: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

CF-tree

• Two Algorithms used to build a CF-tree– 1. Insertion Algorithm

• Purpose: Building a CF-Tree

– 2. Rebuilding Algorithm• Purpose:

– Rebuild the whole tree with larger T (smaller size)

– this happens when CF-tree size limit is exceeded

Page 17: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Insertion Algorithm

• 1. Identifying the appropriate leaf– non-leaf node– use distance metric to chose closest branch

• 2. Insert leaf into leaf node– merge with closest leaf [CF i, prev, next]– if T violated make new leaf [CF i+1, prev, next]– if L violated split into two leaf nodes

• choose two leaves that are farthest apart

• put the rest in leaf node with the closest leaf

Page 18: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Insertion Algorithm

• 3. Update tree path– if B is violated split the node– CF should be the sum of child CFs

• 4. A Merging Refinement– try to combine two closest children of the node

that did not split– might free space

Page 19: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Rebuilding Algorithm

• When to do it?– If the CF-tree size limit is exceeded

• What does it do?– Creates new tree with larger T(diameter/radius)

• larger T -> smaller tree size

• How– Deletes path from old tree and adds path to new

tree.

Page 20: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Rebuilding Algorithm

• Procedure– ‘OldCurrentPath’ starts at leftmost branch– 1. Create the ‘NewCurrentPath’ in new tree

• copy nodes from OldCurrentPath into NewCurrentPath

– 2. Insert leaf entries in ‘OldCurrentPath’ into the new tree

• if leaf does not go into NewCurrentPath remove it from the NewCurrentPath

Page 21: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Rebuilding Algorithm

– 3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’

• delete nodes in ‘OldCurrentPath’

• if ‘NewCurrentPath’ Is empty delete its nodes

– 4. Process the next path in the old tree– only needs enough pages to cover

‘OldCurrentPath’• usually the height of the tree

Page 22: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Potential Problems

• Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node

• Anomaly 2: Sub cluster ends up in the wrong leaf node

Page 23: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Reducibility Theorem: Assume we rebuild CF-tree ti+1 of threshold Ti by the above algorithm, and let Si and Si+1be the sizes of ti

and ti+1 respectively. If Ti+1 > Ti, then Si+1<Si,

and the transformation from ti to ti+1 needs at most h extra pages of memory, where h is the height of t i

Page 24: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

BIRCH Clustering Algorithm

• Four Phases– 1. Loading

• scan all data and build an initial in-memory CF-tree

– 2. Optional Condensing• rebuild CF-tree to make it smaller

• faster analysis, but reduced accuracy

– 3. Global Clustering• run clustering algorithm on CFs (KMEANS)

• handles anomaly 1

Page 25: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

BIRCH Clustering Algorithm

– 4. Optional Refining• Use the centroids of the clusters as seeds

• Scan data again and assign points to closest seed

• Handles anomaly 2

Page 26: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

BIRCH Phase 1• Building CF-tree

– Heuristic Threshold (T default = 0)• when rebuilding need new T (diameter/radius)

• use avg distance of closest leaf pairs in same node

• should reduce size of tree by about half

Page 27: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

BIRCH Phase 1

– Outlier Handling Option• CF with small N(# data points) is saved to disk

• try to reinsert when– run out of memory

– finish reading data

• if data is noisy, improves runtime and accuracy

– Delay Split Option• about to run out of memory

• CFs that would cause the tree to split saved to disk

Page 28: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

How does this compare to other clustering methods?

Ran against Kmeans and CLARANS

Page 29: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Results

Page 30: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Results

Page 31: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Results

Page 32: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Results

Page 33: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Runtime

Page 34: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Conclusions• BIRCH vs CLARANS and KMEANS

– runs faster (fewer scans)– less order sensitive– less memory

Page 35: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Where can I use this?

• Interactive and Iterative Pixel Classification– MVI ( Multi-band Vegetation Imager)– BIRCH helps classify pixels through clustering

• Code Generalization in Image Compression– compressing visual data to save space– code book - vector code words for image

blocks– BIRCH assigns nearest code word to vector

Page 36: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Main limitations of BIRCH?

• Ability to only handle metric data.

Page 37: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

Name the two algorithms in BIRCH clustering?

1. Inserting

2. Rebuilding

Page 38: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications

What is the purpose to have phase 4 in the BIRCH clustering algorithm?

-All copies of a given data point go to the same cluster.

-option to discard outliers

-can converge to a minimum