CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23

CHAMELEON: A Hierarchical Clustering Algorithm Using

Dynamic Modeling

Author: George et al.

Advisor: Dr. Hsu

Graduate: ZenJohn Huang

IDSL seminar 2001/10/23

Outline

Motivation Objective Research restrict Literature review

An overview of related clustering algorithms The limitations of clustering algorithms

CHAMELEON Concluding remarks Personal opinion

Motivation

Existing clustering algorithms can breakdown Choice of parameters is incorrect Model is not adequate to capture the

characteristics of clusters Diverse shapes, densities, and sizes

Objective

Presenting a novel hierarchical clustering algorithm – CHAMELEON Facilitating discovery of natural and

homogeneous Being applicable to all types of data

Research Restrict

In this paper, authors ignored the issue of scaling to large data sets that cannot fit in the main memory

Literature Review

Clustering An overview of related clustering algorithms The limitations of the recently proposed state

of the art clustering algorithms

Clustering

The intracluster similarity is maximized and the intercluster similarity is minimized [Jain and Dubes, 1988]

Serving as the foundation for data mining and analysis techniques

Clustering(cont’d)

Applications Purchasing patterns Categorization of documents on WWW [Boley, et

al., 1999] Grouping of genes and proteins that have similar

functionality[Harris, et al., 1992] Grouping if spatial locations prone to earth

quakes[Byers and Adrian, 1998]

An Overview of Related Clustering Algorithms Partitional techniques Hierarchical techniques

Partitional Techniques

K means[Jain and Dubes, 1988]

Hierarchical Techniques

CURE [Guha, Rastogi and Shim, 1998] ROCK [Guha, Rastogi and Shim, 1999]

Limitations of Existing Hierarchical Schemas CURE

Fail to take into account special characteristics

Limitations of Existing Hierarchical Schemas(cont’d) ROCK

Irrespective of densities and shapes

CHAMELEON

Overview Modeling the data Modeling the cluster similarity A two-phase clustering algorithm Performance analysis Experimental Results

Overall Framework CHAMELEON

Modeling the Data

K-nearest graphs from an original data in 2D

Modeling the Cluster Similarity

Relative inter-connectivity

Modeling the Cluster Similarity(cont’d) Relative closeness

A Two-phase Clustering Algorithm Phase I: Finding initial sub-clusters

A Two-phase Clustering Algorithm(cont’d) Phase I: Finding initial sub-clusters

Multilevel paradigm[Karypis & Kumar, 1999]

hMeT|s [Karypis & Kumar, 1999]

A Two-phase Clustering Algorithm(cont’d) Phase II: Merging sub-clusters using a

dynamic framework

RCjiRIji TCCRCTCCRI ),( and ),(

TRI, TRC: user specified threshold

A Two-phase Clustering Algorithm(cont’d) Phase II: Merging sub-clusters using a

dynamic framework

),(*),( jiji CCRCCCRI

parameter specifieduser a is α

Performance Analysis

The amount of time required to compute K-nearest neighbor graph Two-phase clustering

Performance Analysis(cont’d)

The amount of time required to compute K-nearest neighbor graph

Low-dimensional data sets = O(n log n) High-dimensional data sets = O(n2)

Performance Analysis(cont’d)

The amount of time required to compute Two-phase clustering

Computing internal inter-connectivity and closeness for each cluster: O(nm)

Selecting the most similar pair of cluster: O(n log n + m2 log m)

Total time = O(nm + n log n + m2 log m)

Experimental Results

Program DBSCAN: a publicly available version CURE: a locally implemented version

Data sets Qualitative comparison

Data Sets• Five clusters• Different size, shape,

and density• Noise point

• Two clusters• Close to each other• Different region, different

densities• Six clusters• Different size, shape,

and orientation• Random noise point• Special artifacts

• Eight clusters• Different size, shape,

and orientation• Random noise

and special artifacts

• Eight clusters• Different size, shape, density,

and orientation• Random noise point

Concluding remarks

CHAMELEON can discover natural clusters of different shapes and sizes

It is possible to use other algorithms instead of k-nearest neighbor graph

Different domains may require different models for capturing closeness and inter-connectivity

Personal Opinion

Without further work

Documents

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23