105
Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) [email protected] http://www.cs.hku.hk/~ckchui Supervisor: Dr. Benjamin C.M. Kao. HKU Department of Computer Science Database Research Seminar 18th May 2006

Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) [email protected] ckchui

Embed Size (px)

Citation preview

Page 1: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Density-Based Clustering of Uncertain Data (KDD2005)

Authors: Hans-Peter Kriegel and Martin PfeilePresenter: Chui Chun Kit (Mphil) [email protected] http://www.cs.hku.hk/~ckchuiSupervisor: Dr. Benjamin C.M. Kao.

HKU Department of Computer ScienceDatabase Research Seminar

18th May 2006

Page 2: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Presentation Outline Introduction

What is clustering? Density based similarity measurment DBSCAN

Issues from mining certain data to uncertain data Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain

objects? Theoretical foundation of changing DBSCAN to FDBSCAN

FDBSCAN From DBSCAN to FDBSCAN Computational Issues Experimental Results

Conclusions

Page 3: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Introduction

Page 4: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

What is Clustering?

Problem description A set of objects A similarity measurement Discover groups of similar objects More precisely, find sets of objects

which intra-cluster similarity is high while inter-clusters similarity is relatively low.

Page 5: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Different Clusters Discovered by Different Similarity Measurement

Distance-based Density-based Pattern-based …etc

Page 6: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Density-based clustering

The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster.

The clusters are separated by low object density regions (noise)

Any clusters ?

x

y

Page 7: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Density-based clustering

The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster.

The clusters are separated by low object density regions (noise)

Density-based clustering can detect arbitrary cluster shapes

Page 8: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Key idea of density-based clustering Density constraint for objects to form

clusters Intuitively for each object of a cluster

the neighborhood of a given radius has to contain at least a minimum number of objects. (density constraint)

i.e The density in the neighborhood has. to exceed some threshold.

Objects not belong to any clusters are regard as noise.

Page 9: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Previous Works on Density Based Clustering

DBSCAN A density-based clustering algorithm Work on data with no uncertainty

Will present the uncertainty version of DBSCAN later

Page 10: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCAN

Two important definitions of DBSCAN Core objects Directly-density reachable Density reachable (skip) Density connected (skip)

For the sake of discussion, these two definitions are skipped

Page 11: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCANDefinition 1: Core Object Given the density constraint (µ

andε) An object o is defined as a core

object iff there are µ or more objects within theε-range of o.

Basically, we can conduct a range search on object o with radius ε, if there are µ or more objects returned, then o is a core object.

Page 12: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCANDefinition 1: Core Object

Example (µ=5 ) Is o1 a core object?

Since there are 5 objects within the ε-range of o1, o1 is a core object

o1

o2

Since there are 5 objects within the ε-range of o2, o2 is a core object too.

ε

ε

Page 13: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCANDefinition 2: Directly-density reachable

An object p is directly-density reachable from o if the following conditions are satisfied 1st condition: o is a core object 2nd condition: d(p,o) ≤ε

Page 14: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCANDefinition 2: Directly-density reachable

Example (µ=5 ) Question: Is o2 directly-density reachable from

o1?

1st condition: Is o1 a core object?Since there are 5 objects within the ε-range of o1, o1 is a core object

o1

2nd condition: Is d(o2,o1) ≤ε ? Yes, it is within the ε-range of o1.

ε

o2

Thus, o2 is directly-density reachable from o1

Page 15: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCANHow it works? Brief idea… Search for clusters by checking the ε-

neighborhood of each object in the database.

If a core object o is found, a new cluster with o and it’s direct density-reachable objects is created.

DBSCAN iteratively collects the directly density-reachable objects from the objects in the cluster.

Page 16: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

DBSCAN

Example (µ=5 )

o1

ε Arbitrary pick a point, e.g. o1, check if it is a core object…

o1 is a core objectA cluster with o1 and all o1’s density reachable objects

DBSCAN continues to “expand” the cluster by adding objects which are directly density reachable from cluster objects

ε

ε

ε

ε

ε

Since a1 is not a core object, a2 is NOT direct-density reachable from a1.a2 is NOT added into the cluster

a2

a1

Pick another point for next iteration if the current cluster does not expand.

Eventually, clusters are formedObjects that not assigned to any clusters are regarded as noise

o2

Eventually, clusters are formedObjects that not assigned to any clusters are regarded as noise

Page 17: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From Certain Data to Uncertain Data

Page 18: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From certain to uncertain dataFive major issues … Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Page 19: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From certain to uncertain dataFive major issues … Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Page 20: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Why data exhibit uncertainty? In many modern application ranges, e.g.

the clustering of moving objects or sensor databases, only uncertain data is available.

For instance, in the area of mobile services, the objects continuously change their positions so that exact positional information is often not available.

Page 21: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Why data exhibit uncertainty?

In application areas such as clustering of distributed feature vectors, due to security aspects or to limited bandwidth, only approximated information is transmitted to a central server site.

Page 22: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Uncertain Data (Example) Somewhere in a tropical rain forest… Location tracking of a group of about 300

Chimpanzees. Implanted device reports location of a

Chimpanzee regularly. However the reported location is not precise,

it only return the area the Chimpanzee is located.

The area is called an uncertainty region Assume the probability that the Chimpanzee

located in any location inside the uncertainty region is the same.

Page 23: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Uncertain Data (Example) The Chimpanzee society is

complicated, some young Chimpanzees may gather to fight against the leader.

Zoologists are interested to study the factors that affect the formation of different groups (clusters) inside the Chimpanzee society.

Page 24: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Uncertain Data (Example) One observation is that Chimpanzees

of the same group usually stay closely together.

Assume that one Chimpanzee belongs to one group only.

Density based clustering can help to discover the Chimpanzee groups (clusters).

Page 25: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Uncertain Data (Example)

Uncertainty region of 15 Chimpanzees reported by the location tracking devices(location of each Chimpanzee)

Clusters

x

y

Somewhere in the tropical rain forest…

Page 26: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From certain to uncertain dataFive major issues… Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Page 27: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Representing Uncertain Objects

probability

x

yProbability density functions

for 2-D objects

Probability density functions of 1-D objects

Value (e.g. temperature)

Page 28: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

The probability that an object o is having a value between a and b can be obtained by

Representing Uncertain Objects Question: What is the distance between

ouncertain and o’uncertain?

a b

Area

valueValue (e.g. temperature)

Probability density functions of 1-D objects

Page 29: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Page 30: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

How to represent the distance between uncertain objects?

Distance Density Function pd(o,o’) Distance Distribution Function

Pd(o,o’)(b) Distance expectation value Ed(o,o’)

Aggregated value Information loss

Page 31: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

How to represent the distance between uncertain objects?

Distance Density Function pd(o,o’) Distance Distribution Function

Pd(o,o’)(b) Distance expectation value Ed(o,o’)

Aggregated value Information loss

Page 32: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Density Function pd(o,o’)

Express the distance between two objects by means of a probability density function.

Let d be a distance function. Let P(a≤d(o,o’)≤b) denote the probability that

d(o,o’) is between a and b. A probability density function pd(o,o’) is called a

distance density function if the following condition holds:

Page 33: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Density Function pd(o,o’)

probability

Distance between o and o’dis

pd(o,o’)(dis) =Pd (o,o’) (dis)

0

Probability density functions (pdf) of each uncertain data item is considered independent.

Distance density function express the distance between two uncertain objects by mean of pdf.

Value (e.g. temperature)

Page 34: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Density Function pd(o,o’)

probability

Distance between o and o’

Distance Density Function(represents the distance between two uncertain objects) pd (o,o’)

0

Page 35: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Density Function pd(o,o’)

probability

From the distance density function, the probability that the distance between two uncertain objects is between a and b is given by

Distance between o and o’a b

Area = P(a≤d(o,o’)≤b)|Area | = 1

0

pd (o,o’) Maximum possible distance between o and o’

Minumum possible distance between o and o’

Page 36: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

How to represent the distance between uncertain objects?

Distance Density Function pd(o,o’) Distance Distribution Function

Pd(o,o’)(b) Distance expectation value Ed(o,o’)

Aggregated value Information loss

Page 37: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Distribution Function

Captures the probability that the distance between two uncertain objects is smaller than or equal to a value b.

Useful in density-based clustering, when expressing the probability that the d(o’,o) ≤b.

2nd condition for directly density reachable in DBSCAN

Page 38: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Distribution Function In density-based clustering, when

evaluating whether an object o’ is directly density-reachable from o, we may want to ask

o

o’

What is the probability that o and o’ are close to each other? i.e. distance between o and o’ smaller than or equal to b?

The distance distribution function Pd(o,o’)(b) is the answer.

Probability density functions (pdf)

Page 39: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Distribution Function

probability

0

Distance Density Function

pd (o,o’)

The distance distribution function Pd(o,o’)(b) is equal to the integration of the distance density function pd(o,o’) from negative infinity to b .

b Distance between o and o’

Page 40: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

How to represent the distance between uncertain objects?

Distance Density Function pd(o,o’) Distance Distribution Function

Pd(o,o’)(b) Distance Expectation Value Ed(o,o’)

Aggregated value Information loss

Page 41: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Distance Expectation Value Ed(o,o’)

Represent the distance between two uncertain objects by one numerical value.

Advantage: Since the distance between two uncertain objects is represented by a single value, traditional clustering algorithms work. E.g. DBSCAN

Disadvantage: Information loss

Average distance between two objects aggregated from the distance density function

Distance density function

Page 42: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Page 43: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations ICore Object Probability Let denotes the probability

that an object o is a core object. Core object probability of an object o is

given by the following formula

We start derive this formula from the core object definition of DBSCAN…

Page 44: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations ICore Object Probability In DBSCAN, an object o is a core

object if the density constraint (µ andε) is satisfied.

i.e. There are µ or more objects p within the ε-range of o. (d(p,o) ≤ε)

The probability that an object o is a core object is the probability that the density constraint is satisified.

The probability that there are µ or more objects p with d(p,o) ≤ε

Page 45: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations ICore Object Probability

p

ε

o

Example µ=5 If ε is this large, obviously, core-object probability of o is 1

If ε is this small, what is the core object probability of o?

Sometime, d(p,o) ≤εand sometime d(p,o) ≥ε

What is the core object probability of o?

Probability density functions (pdf)

Page 46: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations ICore Object Probability For each subset A of the database D

which having the cardinality higher than or equal to µ.

Page 47: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations ICore Object Probability For each subset A of the database D

which having the cardinality higher than or equal to µ Determine the probability that only the

objects p of A with d(p,o) ≤ε but no other objects in D\A.

The probability that only the objects p of A having d(p,o) ≤ε

but no other objects in D\A

Page 48: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Remind that is the probability that the distance between two uncertain objects is smaller than or equal to a value b.

Theoretical Foundations ICore Object Probability

Second part :Probability that ALL objects p in D\A are NOT d(p,o) ≤ε

First part:Probability that ALL objects p in A with d(p,o) ≤ε

The probability that only the objects p of A having d(p,o) ≤ε

but no other objects in D\A

Page 49: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Page 50: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations IIReachability Probability Let be the probability that

p is reachable from o. In DBSCAN, an object p is directly

density reachable form o if 1st condition : o is a core object 2nd condition : d(p,o) ≤ε

×

Incorrect, why?

The two events are Dependent to each other !These two conditions are

NOT independent!

Page 51: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations IIReachability Probability Example (µ=3)

×

Incorrect, why?

In this case,The probability that o is a core object is depend on the probability that d(p,o) ≤εi.e. 1st and 2nd conditions are NOT independent.

o

p

q

Probability density functions (pdf) The two events are Dependent to each other !These two conditions are

NOT independent!

ε –range of o

Page 52: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Theoretical Foundations IIReachability Probability Two independent conditions

×

1st conditionWe consider the core object probability in D\p.And relax the density constraint µ by 1.

o

p

q

2nd conditionWe consider the probability that d(p,o) ≤ε

p

Their product corresponds to the probability that at least µ objects o’ from D are having d(o’,o) ≤ε, and that object p is one of them.Which correspond to the definition of directly density reachable in DBSCAN

Page 53: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

The probability that at least µ-1 objects from D\p are located within anε-range of o is

Theoretical Foundations IIReachability Probability

Page 54: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

The probability that at least µ-1 objects from D\p are located within anε-range of o is

The probability that the distance between p and o is smaller than or equal to ε is

Theoretical Foundations IIReachability Probability

Page 55: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

The two conditions are independent Their product corresponds to the

probability that at least µ objects from D are located in ε- range of o, and that p is one of them.

Theoretical Foundations IIReachability Probability

The probability that at least µ-1 objects from D\p having their distance with o smaller than or equal toε

The probability that the distance between p and o is smaller than or equal to ε

Page 56: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

How does FDBSCAN works?

Traditional DBSCAN algorithm clusters a data set by always adding objects to the current cluster which are directly density reachable from the current query object o.

FDBSCAN works very similar to the traditional approach.

Page 57: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

How does FDBSCAN works? For each uncertain object o

Check if it is a core object If yes, for each other object p

Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form

a cluster

There are O(|DB|2) reachability probability computations

Page 58: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Computational Aspect I

Computing the reachability probability

Page 59: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Reachability Probability

Core Object Probability

Distance Density Function

Computational Aspect IComputing

Integration

Integration

Page 60: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Direction 1: Avoid calculating the integration

Sampling Monte-carlo sampling Each uncertain object o is represented by a

sequence of s sample points. i.e. <o1,o2,…os>

Compute base on the sample sequences.

How it can be done? (If time allowed)

Computational Aspect IComputing

Page 61: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Computational Aspect IComputing Direction 2: Reduce the number of

reachability probability computations. Some objects maybe located very far away

from o, which is obviously no chance to be directly density-reachable from o.

Use MBRs to bound the object samples Compute for all objects o, the MBR(o) bounding

the sample points <o1,o2,…os> If MBR(p) is outside theε- range of o, p must NOT

be direct density-reachable from o.

Page 62: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Computational Aspect II (If time allows)

Computing Core Object Probability

Interesting, but complicated, click here to skip!

Page 63: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Computational Aspect IIComputing Core Object Probability

Two issues 1st issue : There are many core

object probability computations. 2nd issue : In each core object

probability computation, we have to consider (in |DB|) exponentially many subsets A of DB.

Page 64: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

1st Issue : Many Core object Probability Computations

For each uncertain object o Check the probability that o is a core

object Core object probability ≥ 0.5

For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form

a cluster

, for all p in D

The 1st condition of reachability probability is a core object probability

Page 65: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

2nd Issue: Exponentially many subsets to consider for each core-object value

Furthermore, the computation of core-object values has to consider (in |DB|) exponentially many subsets A of DB.

For all subsets A in D with cardinality greater than or equal to µ

Page 66: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

2nd Issue: Exponentially many subsets to consider for each core-object value

Sampling Monte-carlo sampling Each uncertain object o is

represented by a sequence of s sample points. i.e. <o1,o2,…os>

Compute base on the sample sequences.

How it can be done?

Page 67: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Compute base on the sample sequences s is the sample rate. <o1,o2,…os> Determine the core-object probability

base on s 2 meaningful samples. oj is called the j th instance of o. Dj is the collection of j th instance of all

objects in D. E.g. s=5

a1, a2, a3, a4, a5 b1, b2, b3, b4, b5 c1, c2, c3, c4, c5 d1, d2, d3, d4, d5

D1 = {a1,b1,c1,d1,e1}D2 ={a2,b2,c2,d2,e2}…

Page 68: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Compute base on the sample sequences If we want to compute the core object probability

of o, create a s×s sample matrix M(o) M(o) keep track of the information for deducing

With some modification, it can be used to deduce

Each cell mi,j of M(o) indicates the number of ε-neighbors of oi in Dk.

Page 69: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (skip) Each cell mi,j contains the number of ε-

neighbor of object sample oi in database instance Dj.

Dj consists of all other objects’ j-th sample (excluding oj)

Page 70: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1

o is the query object

All object samples are bounded by MBRs

Sample rate=3

µ = 5

Page 71: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

o

a

b

c

d

o3o2

o1

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

Page 72: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

We are going fill m1,1

Since o1 itself is also counted, it is initialized to 1.

How many ε-neighbors of o1 in D1?

By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1

By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1

By MBR pruning, we are sure these three objects contain ε-neighbors of o1 in D1

4

MBR(b) and MBR(a) cannot be prunedRetrieve their sample sequences

MBR(b) and MBR(a) cannot be prunedRetrieve their sample sequences

We are going fill m1,1b1 and a1 are ε-neighbors

61

Although b2 is ε-neighbor of o1, it is not counted as it is NOT in database instance 1.6 is the final value. This indicates that there are 6 ε-neighbors of object sample o1 in database instance D1.

Page 73: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5

Page 74: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

Page 75: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6

Page 76: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4

Page 77: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

Page 78: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4

Page 79: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4

Page 80: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

Now we have the sample matrix M(o).

Page 81: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Compute base on the sample matrix M(o), (µ = 5)

For each uncertain object o Check the probability that o is a core

object Core object probability ≥ 0.5

For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form

a cluster

Page 82: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Core object probability 1st Step: Count the number of elements

in the sample matrix M(o) which contain values higher than or equal to µ

2nd Step: Normalize the value by s^2 yields

Compute base on the sample matrix M(o), (µ = 5)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

1st Step: Count = 6

2nd Step: Core-object probability of o = 6 / 9

Since the core object probability is > 0.5, o is treated as a core-object

Page 83: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Compute base on the sample matrix M(o), (µ = 5)

For each uncertain object o Check the probability that o is a core object Core object probability ≥ 0.5

For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a

cluster

The first partCan be derived from M(o)

The second partCan do some pruning using the object samples’ MBRs

Page 84: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

1st step: Decrease the values mi,j by 1 for which d(oi,pj)≤εholds.

2nd step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1.

3rd step: Normalizing the number by s2 yield the probability

Compute The first part

Page 85: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Computing the first part

Conceptually, M(o) contains the ε-neighbor information in D, we want it contains the information in D\a.

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

1st Step: decrease the values mi,j by 1 for which d(oi,pj)≤εholds.

5 4

Decrease m1,1 and m1,3 by 1Decrease m2,1 and m2,3 by 1

5 4

Decrease m3,3 by 1

4

Page 86: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Computing the first part

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

inst

ance

s o

f o 1

2

3

database instances

1 2 3

5

4

4 4

3rd Step: Since all the cell are greater than or equal to 5-1 =4, the first part probability is equal to 9/9 = 1

5 4

5 4

4

2nd Step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1

Page 87: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Count the number of events d(oi,pj)≤ε, and by normalizing the number by s×s.

The MBRs of the object samples can be used for pruning.

Compute The second part

Page 88: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

1st Step: Count the Number of events d(oi,pj)≤ε

Count =

Computing the second part

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

2nd Step: Normalize the count by s^2.The reachability probability of a from o is 5/9.

2 + 2 + 1

= 5

Page 89: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

= 1 × 5/9 = 5/9

Since ≥ 0.5, p is directly density reachable from o.

p and o form a cluster.

Reachability of a from o

Page 90: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Reachability of other objects from o

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

Page 91: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experimental Evaluation

Page 92: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experimental Evaluation Datasets

Artificial data set (ART) 1000 2-dimensional objects which are normally

distributed in [0,1] Each object is randomly surrounded by a box having

a side length of p<1 in each dimension (Data fuzziness)

Assume uniform probability distribution within the box

Engineering data set (PLANE) 5000 42-dimensional objects Normalized

Page 93: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experimental Evaluation

Implementation FDBSCAN EXPDBSCAN

Represent the distance between two uncertain objects by a single distance expectation value Ed(o,o’).

Use the traditional DBSCAN algorithm to mine the data.

Page 94: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experimental Evaluation

Implementation Java 1.4 Window platform 730 MHz processor 512 MB main memory Sample rate s = 5

Page 95: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experiment 1Efficiency of the FDBSCAN Measure the runtimes of FDBSCAN and

EXPDBSCAN on ART dataset p=0.01

Little fuzziness in the datasets

Runtime (s)

Does EXPDBSCAN applied MBR pruning strategies as FDBSCAN?

Page 96: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experiment 2 Effectively of FDBSCAN

Measures the relation between the quality of the cluster results and data fuzziness of FDBSCAN and EXPDBSCAN.

How to measure the quality of clusters? Treat as a black box for the time being… Good cluster will have the quality value

close to 1, vice versa

Page 97: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experiment 2 Effectively of FDBSCAN

FDBSCAN returns clusters with better quality than EXPDBSCAN in all data fuzziness and number of dimensions. i.e. more effective

In ART, EXPDBSCAN performs quite well, but for high dimensional data, its quality is much worse than the FDBSCAN approach.

The quality of EXPDBSCAN and FDBSCAN fall in high data fuzziness, however, the degree of falling of FDBSCAN is smaller than EXPDBSCAN.

Page 98: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experiment 3Accuracy of the core object classification

How accurate do FDBSCAN and EXPDBSCAN classify core object?

Precision and recall rate of core object Precision shows how precise the reported

core set of core objects is. # reported real core objects / #of core objects

reported Recall shows the percentage of real core

objects reported. #reported real core objects/ total # of real core

objects in D

Page 99: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Experiment 3Accuracy of the core object classification

FDBSCAN has a higher precision and recall rate of core object in 2D ART dataset.

Very few real core objects are found for EXPDBSCAN, however nearly most of the returned core objects are real core objects

The precision and recall rate are not 100% because FDBSCAN use sampling approach for calculating the core object probability

The precision and recall rate of FDBSCAN increases in high dimension. Why?

EXPDBSCAN has a lower recall rate than FDBSCAN. Why?

Page 100: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Why EXPDBSCAN suffer from low recall rate? (Example µ=5)

B

Probability density functionGaussian Distribution

Core point candidates

A

1

2

3 4

5

6

7

8

9

10

Page 101: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Why EXPDBSCAN suffer from low recall rate? (Example µ=5)

A

B

ε

εNumber of ε-neighbor = 5A is a core object

Number of ε-neighbor = 4B is NOT a core object

1

2

3 4

5

6

7

8

9

10

Page 102: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Conclusion Demonstrated how density based

clustering can be carried out based on uncertain information.

Presented the theoretical foundations for density based clustering of uncertain data.

FDBSCAN work on the fuzzy distance function directly instead of working on lossy aggregated information.

Page 103: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

My comments We also want to know…

The relationship between the sample rate and the execution time, a higher sample rate should suggest a more accurate result, but generally it tradeoffs with execution time. What is the relationship between these two parameters?

Sample rate vs cluster quality Sample rate vs data dimensionality, which is a

reference to determine the sample rate based on the data characteristic

Sample rate vs fuzziness of data Since we represent each uncertain object by MBRs, the

MBR(o) are bounding the samples of o This means that the MBR(o) may not bounding the whole

uncertainty region of o In high data fuzziness, MBR(o) may not precisely indicate

the uncertainty region of the real object o.

Page 104: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

Something confused…

We also want to know… The reason for using 0.5 probability to

determine core object is questionable. Why don’t treat this as a parameter?

A higher value should suggests more false negative core objects, a lower value suggests more false positive core objects.

Page 105: Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs.hku.hk ckchui

The End

Thank you