Efficient and Effective Clustering Methods for Spatial...

Efficient and Effective Clustering Methods for Spatial

Data Mining

Raymond T. Ng, Jiawei Han

Overview

� Spatial Data Mining

� Clustering techniques

� CLARANS

� Spatial and Non-Spatial dominant CLARANS

� Observations

� Summary

Overview

� CLARANS

� Observations

� Summary

Spatial Data Mining

� Identifying interesting relationships and characteristics that may exist implicitly in Spatial Databases

� Different from Relational Databases� Spatial objects - store both spatial and non-

spatial attributes� Queries (“All Walmart stores within 10 miles of

UH)� Spatial Joins, work on spatial indexes (R-tree)� Huge sizes (Tera bytes)

� GIS is a classic example

Overview

� CLARANS

� Observations

� Summary

Partitioning Methods

Given K, the number of partitions to create, a partitioning method constructs initial partitions. It then iterative refines the quality of these clusters so as to maximize intra-cluster similarity and inter-cluster dissimilarity.

[Quality of Clustering]: Average dissimilarity of objects from their cluster centers (medoids)

Selected algorithms:

1. K-medoids

2. PAM

3. CLARA

4. CLARANS

K-Medoids

� Partition based clustering (K partitions)

� Effective, why ?

� Resistant to outliers� Do not depend on order in

which data points are examined

� Cluster center is part of dataset, unlike k-means where cluster center is gravity based

� Experiments show that large data sets are handled efficiently

0 1 2 3 4 5 6 7 8 9 10

K-means

K-medoids

PAM (Partitioning Around Medoids)

� [Goal]: Find K representative objects of the data set. Each of the K objects is called a Medoid, the most centrally located object within a cluster.

PAM (2)

� Start with K data points designated as medoids. Create cluster around a medoid by moving data points close to the medoid

Oj belongs to Oi

if d(Oj, Oi) = minOe d(Oj, Oe)

� Iteratively replace Oi with Oh if quality of clustering improves.

� Swapping cost, Cijh, associated for replacing a selected object Oi with a non-selected object Oh

PAM (3)

* O(k(n-k)2) for each iteration* Good for small data sets(n=100, k=5)

CLARA (Clustering LARge Applications)

� Improvement over PAM

� Finds medoids in a sample from the dataset

� [Idea]: If the samples are sufficiently random, the medoids of the sample approximate the medoids of the dataset

� [Heuristics]: 5 samples of size 40+2k gives satisfactory results

� Works well for large datasets (n=1000, k=10)

Overview

� CLARANS

� Observations

� Summary

CLARANS (Clustering Large Applications based on RANdomized Search)

� A graph abstraction, Gn,k

� Each vertex is a collection of k medoids

� | S1 S2 | = k – 1

� Each node has k(n-k) neighbors

� Cost of each node is total dissimilarity of objects to their medoids

� PAM searches whole graph

� CLARA searches subgraph

{Od1, ..., Odk}

{Oc1, ..., Ock}

{Ob1, ..., Obk}

{Oa1, ..., Oak}

{Om1, ..., Omk}

CLARANS (2)

Experimental values

• numLocal = 2

• maxNeighbors =

max(1.25% of k(n-k), 250)

CLARANS (3)

� Outperforms PAM and CLARA in terms of running time and quality of clustering

� O(n2) for each iteration

CLARANS vs PAM

CLARANS vs CLARA

Overview

� CLARANS

� Observations

� Summary

Generalization

� Useful to mine non-spatial attributes

� Process of merging tuples based on a concept hierarchy

� DBLearn – SQL query, gen. hierarchy and threshold

Initial relation Generalized relation

Sphere(color, diameter)

Silhouette

Silhouette of object Oj

� determines how much Oj belongs to it’s cluster

� Between -1 and 1� 1 indicates high

degree of membership

Silhouette width of cluster� Average silhouette of

all objects in cluster

Silhouette coefficient� Average silhouette

widths of k clusters

Silhoutte width Interpretation

0.71 – 1 Strong cluster

0.51 – 0.7 Reasonable cluster

0.26 – 0.5 Weak or artificial cluster

≤ 0.25 No cluster found

SD and NSD approach

� SD – Spatial Dominant

� NSD – Non-Spatial Dominant

� Clustering for spatial attributes / Generalization for non-spatial attributes

� Dominance is decided by what is carried out first (clustering/generalization)

� Second phase works on tuples from previous stage

SD(CLARANS)

Specify learning

request in the

form of SQL

TuplesOi

CLARANS

on spatial

attributes

Knat clusters

Collect non-spatial

components

Apply DBLearn

For every cluster

� Finds non-spatial generalizations from spatial clustering

� Value for Knat is determined through heuristics using the silhouette coefficients

� Clustering phase can be treated as finding spatial generalization hierarchy

NSD(CLARANS)

� Finds spatial clusters from non-spatial generalizations

� Clusters may overlap

Overview

� CLARANS

� Observations

� Summary

Observations

� In all previous methods, quality of mining depends on the SQL query

� CLARANS assumes that the entire dataset is in memory. Not always the case for large data sets.

� Quality of results cannot be guaranteed when N is very large – due to Randomized Search

Observations (2)

� Other clustering algorithms proposed for Spatial Data Mining

� Hierarchical: BIRCH

� Density based: DBSCAN, GDBSCAN, DBRS

� Grid based: STING

Summary

� A seminal paper on use of clustering for spatial data mining

� CLARANS is an effective clustering technique for large datasets

� SD(CLARANS)/NSD(CLARANS) are effective spatial data mining algorithms

References

� Primary

� Efficient and Effective Clustering Methods for Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han

� Secondary

� CLARANS: A Method for Clustering Objects for Spatial Data Mining - Raymond T. Ng, Jiawei Han

� Clustering for Mining in Large Spatial Databases -Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu

� An Introduction to Spatial Database Systems - Ralf

Hartmut Güting

Efficient and Effective Clustering Methods for Spatial...

Documents

Generalized Density-Based Clustering for Spatial Data Mining

Plasmodium vivax malaria at households: spatial clustering … · RESEARCH Open Access Plasmodium vivax malaria at households: spatial clustering and risk factors in a low endemicity

Spatial clustering and local risk of leprosy in São Paulo

The spatial clustering of X-ray selected AGN

Spatial Clustering Using Hierarchical SOM - Opencdn.intechopen.com/...Spatial_clustering_using_hierarchical_som.pdf · Spatial Clustering Using Hierarchical SOM Roberto Henriques,

Clustering spatial functional data

On Estimation of the Spatial Clustering: Case On ...ceur-ws.org/Vol-971/paper10.pdf · On Estimation of the Spatial Clustering: ... role among methods of spatial statistics. ... of

Efficient and Effective Clustering Methods for Spatial Data Mining

1. Introduction - University of Calgary in Alberta · 2014-10-17 · Keywords: Spatial Clustering; Constraint-based Clustering; Obstacle. 1. Introduction Clustering large amounts

A Spatial Fuzzy Clustering Algorithm With Kernel Metric ...stolkinr/publication/j24.pdf · A Spatial Fuzzy Clustering Algorithm With ... Index Terms—Fuzzy C-means (FCM) cluster,

Spatial Scan Statistics for Graph Clustering - School of …jeffp/papers/SSSGC-SDM08.pdf · Spatial Scan Statistics for Graph Clustering Bei Wang ∗ Jeﬀ M. Phillips † Robert

local-density based spatial clustering algorithm with noise

CLARANS-A Method for Clustering Objects for Spatial Data MiningTKDE02

Spatial clustering of livestock Anthrax events associated

The Spatial Clustering of Science and Capital - Stanford University

Spatial Clustering Methods

SPATIAL HIERARCHICAL CLUSTERING - Unesp

Spatial clustering of defect luminescence centers in …gan-sem.phys.strath.ac.uk/wp-content/uploads/Kusch2015...Spatial clustering of defect luminescence centers in Si-doped low resistivity

Measuring spatial clustering in disease patterns

Spatial Clustering and Industrial Competitiveness