Upload
mervin-sherman
View
217
Download
0
Embed Size (px)
Citation preview
1
Efficient and Effective Clustering Methods for Spatial
Data MiningRaymond T. Ng, Jiawei Han
Pavan PodilaCOSC 6341, Fall ‘04
2
Overview
Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant
CLARANS Observations Summary
3
Overview
Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant
CLARANS Observations Summary
4
Spatial Data Mining Identifying interesting relationships and
characteristics that may exist implicitly in Spatial Databases
Different from Relational Databases Spatial objects - store both spatial and non-
spatial attributes Queries (“All Walmart stores within 10 miles of
UH) Spatial Joins, work on spatial indexes (R-tree) Huge sizes (Tera bytes)
GIS is a classic example
5
Overview
Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant
CLARANS Observations Summary
6
Partitioning Methods
Given K, the number of partitions to create, a partitioning method constructs initial partitions. It then iterative refines the quality of these clusters so as to maximize intra-cluster similarity and inter-cluster dissimilarity.
[Quality of Clustering]: Average dissimilarity of objects from their cluster centers (medoids)
Selected algorithms:1. K-medoids2. PAM3. CLARA4. CLARANS
7
K-Medoids Partition based clustering (K
partitions) Effective, why ?
Resistant to outliers Do not depend on order in
which data points are examined
Cluster center is part of dataset, unlike k-means where cluster center is gravity based
Experiments show that large data sets are handled efficiently
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K-means
K-medoids
8
PAM (Partitioning Around Medoids)
[Goal]: Find K representative objects of the data set. Each of the K objects is called a Medoid, the most centrally located object within a cluster.
9
PAM (2) Start with K data points designated
as medoids. Create cluster around a medoid by moving data points close to the medoid Oj belongs to Oi
if d(Oj, Oi) = minOe d(Oj, Oe)
Iteratively replace Oi with Oh if quality of clustering improves.
Swapping cost, Cijh, associated for replacing a selected object Oi with a non-selected object Oh
Oi
OjOh
10
PAM (3)
Select K representative
objects arbitrarily
Compute TCih for all pairs
(Oi, Oh)
Select pair(Oi, Oh) with
min TCih (Oi, Oh)
TCih < 0Replace Oi
with Oh
For every Oj find the most representative
object
Yes No
* O(k(n-k)2) for each iteration* Good for small data sets (n=100, k=5)
11
CLARA (Clustering LARge
Applications)
Improvement over PAM Finds medoids in a sample from the dataset [Idea]: If the samples are sufficiently random,
the medoids of the sample approximate the medoids of the dataset
[Heuristics]: 5 samples of size 40+2k gives satisfactory results
Works well for large datasets (n=1000, k=10)
12
Overview
Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant
CLARANS Observations Summary
13
CLARANS (Clustering Large Applications
based on RANdomized Search) A graph abstraction, Gn,k
Each vertex is a collection of k medoids
| S1 S2 | = k – 1 Each node has k(n-k)
neighbors Cost of each node is total
dissimilarity of objects to their medoids
PAM searches whole graph CLARA searches subgraph
S1
S2
{Od1, ..., Odk}
{Oc1, ..., Ock}
{Ob1, ..., Obk}
{Oa1, ..., Oak}
{Om1, ..., Omk}
14
CLARANS (2)Input
maxNeighbors, numLocal
i = 1,minCost = ∞ ,bestNode = -1
current = random node of
Gn,k
j = 1Pick random neighbor S of
current.
TCS < TCcurrent
j < maxNeighbor
j++
NO
YES
TCcurrent < minCost
NO
minCost = TCcurrent,bestNode = current
YES
i > numLocal
Output bestNode
Stop
NO
YES
current = S YES
i++
NO
Experimental values
• numLocal = 2• maxNeighbors = max(1.25% of k(n-k), 250)
15
CLARANS (3)
Outperforms PAM and CLARA in terms of running time and quality of clustering
O(n2) for each iteration
CLARANS vs PAM
CLARANS vs CLARA
16
Overview
Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant
CLARANS Observations Summary
17
Generalization Useful to mine non-spatial
attributes Process of merging tuples
based on a concept hierarchy DBLearn – SQL query, gen.
hierarchy and thresholdred orange yellow green
blue indigo
violet
reddish bluishyellowish
color
Initial relation Generalized relation
Sphere(color, diameter)
1...20 21...40
small large
diameter
18
Silhouette
Silhouette of object Oj determines how much
Oj belongs to it’s cluster
Between -1 and 1 1 indicates high
degree of membership
Silhouette width of cluster Average silhouette of
all objects in cluster
Silhouette coefficient Average silhouette
widths of k clusters
Silhoutte width Interpretation0.71 – 1 Strong cluster
0.51 – 0.7 Reasonable cluster
0.26 – 0.5 Weak or artificial cluster
≤ 0.25 No cluster found
19
SD and NSD approach
SD – Spatial Dominant NSD – Non-Spatial Dominant Clustering for spatial attributes /
Generalization for non-spatial attributes Dominance is decided by what is
carried out first (clustering/generalization)
Second phase works on tuples from previous stage
20
SD(CLARANS)
Specify learning request in the form of SQL
query
Data
SQL
Tuples
Oi
OjOh
CLARANS on spatial attributes
Knat clusters
Collect non-spatial components
Apply DBLearn
For every cluster
Finds non-spatial generalizations from spatial clustering
Value for Knat is determined through heuristics using the silhouette coefficients
Clustering phase can be treated as finding spatial generalization hierarchy
21
NSD(CLARANS)
Finds spatial clusters from non-spatial generalizations
Clusters may overlap
Apply DBLearn to non-spatial attributes
Generalized tuples
Check if any clusters
overlap. Merge them.
Clusters
Data
Specify learning request as SQL
query
SQL
Tuples
For every generalized tuple
Collect spatial components
Oi
OO
jh
CLARANS to find
Knat clusters
22
Overview
Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant
CLARANS Observations Summary
23
Observations
In all previous methods, quality of mining depends on the SQL query
CLARANS assumes that the entire dataset is in memory. Not always the case for large data sets.
Quality of results cannot be guaranteed when N is very large – due to Randomized Search
24
Observations (2)
Other clustering algorithms proposed for Spatial Data Mining
Hierarchical: BIRCH Density based: DBSCAN, GDBSCAN,
DBRS Grid based: STING
25
Summary
A seminal paper on use of clustering for spatial data mining
CLARANS is an effective clustering technique for large datasets
SD(CLARANS)/NSD(CLARANS) are effective spatial data mining algorithms
26
References
Primary Efficient and Effective Clustering Methods for
Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han
Secondary CLARANS: A Method for Clustering Objects for
Spatial Data Mining - Raymond T. Ng, Jiawei Han
Clustering for Mining in Large Spatial Databases - Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu
An Introduction to Spatial Database Systems - Ralf Hartmut Güting