Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao

Clustering Spatial Data Using Random Walks

Author : David Harel

Yehuda Koren

Graduate : Chien-Ming Hsiao

Outline• Motivation• Objective• Introduction• Basic Notions• Modeling The Data• Clustering Using Random Walks

– Separators and separating operators– Clustering by separation– Clustering spatial points

• Integration with Agglomerative Clustering• Examples• Conclusion• Opinion

Motivation

• The characteristics of spatial data pose several difficulties for clustering algorithms

• The clusters may have arbitrary shapes and non-uniform sizes– Different cluster may have different densities

• The existence of noise may interfere the clustering process

Objective

• Present a new approach to clustering spatial data• Seeking efficient clustering algorithms.• Overcoming noise and outliers

Introduction

• The heart of the method is in what we shall be calling separating operators.

• Their effect is to sharpen the distinction between the weights of inter-cluster edges and intra-cluster edges– By decreasing the former and increasing the latter

• It can be used on their own or can be embedded in a classical agglomerative clustering framework.

BASIC NOTIONS

• graph-theoretic notions

j and ibetween edge the: ,

of degree the: deg

edgesk most at path with aby

of node some toconnected are that nodes ofset the:

.Let

function weighing:

graph weighteda be ,,Let

ji

GG

SSV

VS

w

wEVG

k

(A higher value means more similar)

BASIC NOTIONS

• The probability of a transition from node i to node j

• The probability that a random walk originating at s will reach t before returning to s

kii

iij

kid

d

jiwp

,,

,

tisiforpand

ptsP

jijijits

is isiescape

,1,0

,

,

,

MODELINE THE DATA

• Delaunay triangulation (DT)– Many O(n log n) time and O(n) space algorithms exist for

computing the DT of a planar point set.

• K-mutual neighborhood– The k-nearest neighbors of each point can be O(n log n) ti

me O(n) space for any fixed arbitrary dimension.

• The weight of the edge (a,b) is – d(a,b) is the Euclidean distance between a and b.

– ave is the average Euclidean distance between two adjacent points.

2

2,exp

ave

bad

CLUSTERING USING RANDOM WALKS

• To identifying natural clusters in a graph is to find ways to compute an intimacy relation between the nodes incident to each of the graph’s edges.

• Identifying separators is to use an iterative process of separation.– This is a kind of sharpening pass

NS : Separation by neighborhood similarity

:be todefined is ,by denoted

,similarity odneighborhoby of separation The constant.

small some be andgraph weighteda be Let

NS(G)

G

kV,E,wG

yandxvectorstheofsimilaritysomeisyxsim

uPvPsimvuEuvwhere

EVGGNS

k

kvisit

kvisit

ks

s

dfn

,

,,,,

,,

1 space andE in time computed beCan n

Definition :

CE : Separation by circular escape

:

,,.

,,,

betodefinedis

CEbydenotedescapecircularbyGofseparationtheconsant

smallsomebekletandgraphweightedabeEVGLet

vuPuvPuvCE

uvCEvuwEuvwhere

wEVGGCE

kescape

kescape

dfnk

ks

ss

dfn

,,,

,,,

,,

Definition :

1 space andE in time computed beCan n

Clustering spatial points

Integration with Agglomerative Clustering

• The separation operators can be used as a preprocessing before activating agglomerative clustering on the graph

• Can effectively prevent bad local merging opposing the graph structure.

• It is equivalent to a “single link” algorithm preceded by a separation operation

Examples

Conclusion

• It is robust in the presence of noise and outliers, and is flexible in handling data of different densities.

• The CE operator yields better results than the NS operator

• The time complexity of our algorithm applied to n data points is O(n log n)

Opinion

• Since the algorithm does not rely on spatial knowledge, we can to try it on other types of data.

END

Documents

Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao