1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04

1

Efficient and Effective Clustering Methods for Spatial

Data MiningRaymond T. Ng, Jiawei Han

Pavan PodilaCOSC 6341, Fall ‘04

2

Overview

Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant

CLARANS Observations Summary

3

Overview



4

Spatial Data Mining Identifying interesting relationships and

characteristics that may exist implicitly in Spatial Databases

Different from Relational Databases Spatial objects - store both spatial and non-

spatial attributes Queries (“All Walmart stores within 10 miles of

UH) Spatial Joins, work on spatial indexes (R-tree) Huge sizes (Tera bytes)

GIS is a classic example

5

Overview



6

Partitioning Methods

Given K, the number of partitions to create, a partitioning method constructs initial partitions. It then iterative refines the quality of these clusters so as to maximize intra-cluster similarity and inter-cluster dissimilarity.

[Quality of Clustering]: Average dissimilarity of objects from their cluster centers (medoids)

Selected algorithms:1. K-medoids2. PAM3. CLARA4. CLARANS

7

K-Medoids Partition based clustering (K

partitions) Effective, why ?

Resistant to outliers Do not depend on order in

which data points are examined

Cluster center is part of dataset, unlike k-means where cluster center is gravity based

Experiments show that large data sets are handled efficiently

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-means

K-medoids

8

PAM (Partitioning Around Medoids)

[Goal]: Find K representative objects of the data set. Each of the K objects is called a Medoid, the most centrally located object within a cluster.

9

PAM (2) Start with K data points designated

as medoids. Create cluster around a medoid by moving data points close to the medoid Oj belongs to Oi

if d(Oj, Oi) = minOe d(Oj, Oe)

Iteratively replace Oi with Oh if quality of clustering improves.

Swapping cost, Cijh, associated for replacing a selected object Oi with a non-selected object Oh

Oi

OjOh

10

PAM (3)

Select K representative

objects arbitrarily

Compute TCih for all pairs

(Oi, Oh)

Select pair(Oi, Oh) with

min TCih (Oi, Oh)

TCih < 0Replace Oi

with Oh

For every Oj find the most representative

object

Yes No

* O(k(n-k)2) for each iteration* Good for small data sets (n=100, k=5)

11

CLARA (Clustering LARge

Applications)

Improvement over PAM Finds medoids in a sample from the dataset [Idea]: If the samples are sufficiently random,

the medoids of the sample approximate the medoids of the dataset

[Heuristics]: 5 samples of size 40+2k gives satisfactory results

Works well for large datasets (n=1000, k=10)

12

Overview



13

CLARANS (Clustering Large Applications

based on RANdomized Search) A graph abstraction, Gn,k

Each vertex is a collection of k medoids

| S1 S2 | = k – 1 Each node has k(n-k)

neighbors Cost of each node is total

dissimilarity of objects to their medoids

PAM searches whole graph CLARA searches subgraph

S1

S2

{Od1, ..., Odk}

{Oc1, ..., Ock}

{Ob1, ..., Obk}

{Oa1, ..., Oak}

{Om1, ..., Omk}

14

CLARANS (2)Input

maxNeighbors, numLocal

i = 1,minCost = ∞ ,bestNode = -1

current = random node of

Gn,k

j = 1Pick random neighbor S of

current.

TCS < TCcurrent

j < maxNeighbor

j++

NO

YES

TCcurrent < minCost

NO

minCost = TCcurrent,bestNode = current

YES

i > numLocal

Output bestNode

Stop

NO

YES

current = S YES

i++

NO

Experimental values

• numLocal = 2• maxNeighbors = max(1.25% of k(n-k), 250)

15

CLARANS (3)

Outperforms PAM and CLARA in terms of running time and quality of clustering

O(n2) for each iteration

CLARANS vs PAM

CLARANS vs CLARA

16

Overview



17

Generalization Useful to mine non-spatial

attributes Process of merging tuples

based on a concept hierarchy DBLearn – SQL query, gen.

hierarchy and thresholdred orange yellow green

blue indigo

violet

reddish bluishyellowish

color

Initial relation Generalized relation

Sphere(color, diameter)

1...20 21...40

small large

diameter

18

Silhouette

Silhouette of object Oj determines how much

Oj belongs to it’s cluster

Between -1 and 1 1 indicates high

degree of membership

Silhouette width of cluster Average silhouette of

all objects in cluster

Silhouette coefficient Average silhouette

widths of k clusters

Silhoutte width Interpretation0.71 – 1 Strong cluster

0.51 – 0.7 Reasonable cluster

0.26 – 0.5 Weak or artificial cluster

≤ 0.25 No cluster found

19

SD and NSD approach

SD – Spatial Dominant NSD – Non-Spatial Dominant Clustering for spatial attributes /

Generalization for non-spatial attributes Dominance is decided by what is

carried out first (clustering/generalization)

Second phase works on tuples from previous stage

20

SD(CLARANS)

Specify learning request in the form of SQL

query

Data

SQL

Tuples

Oi

OjOh

CLARANS on spatial attributes

Knat clusters

Collect non-spatial components

Apply DBLearn

For every cluster

Finds non-spatial generalizations from spatial clustering

Value for Knat is determined through heuristics using the silhouette coefficients

Clustering phase can be treated as finding spatial generalization hierarchy

21

NSD(CLARANS)

Finds spatial clusters from non-spatial generalizations

Clusters may overlap

Apply DBLearn to non-spatial attributes

Generalized tuples

Check if any clusters

overlap. Merge them.

Clusters

Data

Specify learning request as SQL

query

SQL

Tuples

For every generalized tuple

Collect spatial components

Oi

OO

jh

CLARANS to find

Knat clusters

22

Overview



23

Observations

In all previous methods, quality of mining depends on the SQL query

CLARANS assumes that the entire dataset is in memory. Not always the case for large data sets.

Quality of results cannot be guaranteed when N is very large – due to Randomized Search

24

Observations (2)

Other clustering algorithms proposed for Spatial Data Mining

Hierarchical: BIRCH Density based: DBSCAN, GDBSCAN,

DBRS Grid based: STING

25

Summary

A seminal paper on use of clustering for spatial data mining

CLARANS is an effective clustering technique for large datasets

SD(CLARANS)/NSD(CLARANS) are effective spatial data mining algorithms

26

References

Primary Efficient and Effective Clustering Methods for

Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han

Secondary CLARANS: A Method for Clustering Objects for

Spatial Data Mining - Raymond T. Ng, Jiawei Han

Clustering for Mining in Large Spatial Databases - Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu

An Introduction to Spatial Database Systems - Ralf Hartmut Güting

Documents

1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04