Upload
emil-bell
View
212
Download
0
Embed Size (px)
Citation preview
Clustering Spatial Data Using Random Walks
Author : David Harel
Yehuda Koren
Graduate : Chien-Ming Hsiao
Outline• Motivation• Objective• Introduction• Basic Notions• Modeling The Data• Clustering Using Random Walks
– Separators and separating operators– Clustering by separation– Clustering spatial points
• Integration with Agglomerative Clustering• Examples• Conclusion• Opinion
Motivation
• The characteristics of spatial data pose several difficulties for clustering algorithms
• The clusters may have arbitrary shapes and non-uniform sizes– Different cluster may have different densities
• The existence of noise may interfere the clustering process
Objective
• Present a new approach to clustering spatial data• Seeking efficient clustering algorithms.• Overcoming noise and outliers
Introduction
• The heart of the method is in what we shall be calling separating operators.
• Their effect is to sharpen the distinction between the weights of inter-cluster edges and intra-cluster edges– By decreasing the former and increasing the latter
• It can be used on their own or can be embedded in a classical agglomerative clustering framework.
BASIC NOTIONS
• graph-theoretic notions
j and ibetween edge the: ,
of degree the: deg
edgesk most at path with aby
of node some toconnected are that nodes ofset the:
.Let
function weighing:
graph weighteda be ,,Let
ji
GG
SSV
VS
w
wEVG
k
(A higher value means more similar)
BASIC NOTIONS
• The probability of a transition from node i to node j
• The probability that a random walk originating at s will reach t before returning to s
kii
iij
kid
d
jiwp
,,
,
tisiforpand
ptsP
jijijits
is isiescape
,1,0
,
,
,
MODELINE THE DATA
• Delaunay triangulation (DT)– Many O(n log n) time and O(n) space algorithms exist for
computing the DT of a planar point set.
• K-mutual neighborhood– The k-nearest neighbors of each point can be O(n log n) ti
me O(n) space for any fixed arbitrary dimension.
• The weight of the edge (a,b) is – d(a,b) is the Euclidean distance between a and b.
– ave is the average Euclidean distance between two adjacent points.
2
2,exp
ave
bad
CLUSTERING USING RANDOM WALKS
• To identifying natural clusters in a graph is to find ways to compute an intimacy relation between the nodes incident to each of the graph’s edges.
• Identifying separators is to use an iterative process of separation.– This is a kind of sharpening pass
NS : Separation by neighborhood similarity
:be todefined is ,by denoted
,similarity odneighborhoby of separation The constant.
small some be andgraph weighteda be Let
NS(G)
G
kV,E,wG
yandxvectorstheofsimilaritysomeisyxsim
uPvPsimvuEuvwhere
EVGGNS
k
kvisit
kvisit
ks
s
dfn
,
,,,,
,,
1 space andE in time computed beCan n
Definition :
CE : Separation by circular escape
:
,,.
,,,
betodefinedis
CEbydenotedescapecircularbyGofseparationtheconsant
smallsomebekletandgraphweightedabeEVGLet
vuPuvPuvCE
uvCEvuwEuvwhere
wEVGGCE
kescape
kescape
dfnk
ks
ss
dfn
,,,
,,,
,,
Definition :
1 space andE in time computed beCan n
Clustering spatial points
Integration with Agglomerative Clustering
• The separation operators can be used as a preprocessing before activating agglomerative clustering on the graph
• Can effectively prevent bad local merging opposing the graph structure.
• It is equivalent to a “single link” algorithm preceded by a separation operation
Examples
Conclusion
• It is robust in the presence of noise and outliers, and is flexible in handling data of different densities.
• The CE operator yields better results than the NS operator
• The time complexity of our algorithm applied to n data points is O(n log n)
Opinion
• Since the algorithm does not rely on spatial knowledge, we can to try it on other types of data.
END