15
A Framework for Clustering Evolving Data Streams Yueshen Xu Zhejiang Univ CCNT, Middleware Middle ware, CCNT, ZJU 01/19/22

Stream data mining & CluStream framework

Embed Size (px)

DESCRIPTION

This ppt is my learning report addressed in my lab which was composed by myself. I hope it is of help and use to you friends.

Citation preview

Page 1: Stream data mining & CluStream framework

A Framework for Clustering Evolving Data Streams

Yueshen Xu

Zhejiang Univ

CCNT, Middleware

Middle ware, CCNT, ZJU04/10/23

Page 2: Stream data mining & CluStream framework

Stream Processing Event Stream Processing Complex Event Processing

04/10/23 Middleware, CCNT, ZJU

Data Stream Mining

Event Stream Processing

Complex Event Processing

In-memory Computing

Real –time Computing

Big data

Computing Mode

Real Application

SAP…

Taobao

Yahoo:S4BaiduBrown&MIT

We are We are endeavoring!endeavoring!

Page 3: Stream data mining & CluStream framework

The paper itself

Published in VLDB 2003 Have been cited 635 times By C.C. Aggarwal, Jiawei Han, Jiayong Wang,

and Philip S.Yu

04/10/23 Middleware, CCNT, ZJU

Watson

Watson UIUCUIUC

THUTHU UIC

UIC

A standard, a bible as well as an obligatory reading

Expert Pundit Expert Pundit

!!

Page 4: Stream data mining & CluStream framework

Data Stream & Streaming Data

What is data stream?——Those Data sets behave just like water flow (I think)

An infinite infinite process consisting of data which continuously continuously evolves with time (C.C. Aggarwal, Jiawei Han et al)

The formalized description

is a multi-dimensional record, and is the corresponding time stamp.

04/10/23 Middleware, CCNT, ZJU

),( iii tVX ),,(, 1 diii vvV

iV it

The data model makes a determining influence on the computing model How?

Page 5: Stream data mining & CluStream framework

Principles

Be very different from those for static data sets (my own thought) One-pass scan You can have the only one chance to see it No storage for primitive data Infinite, another form of big data No necessity In-memory mining Instantaneous Preference for new coming data User point of view

04/10/23 Middleware, CCNT, ZJU

Approximate results You must change your old ideas about traditional static data sets

Ordered, Countable, Enumerable, Infinite, no-storage

Data ModelData Model

Vital!

Page 6: Stream data mining & CluStream framework

The Framework

The methodology The core value of the paper

Micro- and macro-clustering process Necessity and inevitability under this frame

The pyramidal time frame Balancing between the accuracy and storage capability

04/10/23 Middleware, CCNT, ZJU

The principle of approximate approximate resultsresults

Cluster Feature Vector Additivity The micro clusters

Is it sophistic?

Page 7: Stream data mining & CluStream framework

Why are they opted for?

Cluster Feature Vector

04/10/23 Middleware, CCNT, ZJU

Definition CFV is defined as a tuple , the sum of the squares of the data values : Sigma & Square , the sum of data values : Sigma , the sum of the squares of the time stamps : Sigma & Square , the sum of the time stamps : Sigma , the number of data items belonging to the cluster

)32( d ),1,2,1,2( nCFCFCFCF ttxx

xCF2

xCF1tCF 2tCF 2

n

Why CFV?

User – oriented Additivity

Not come up by Prof. Han et al

Page 8: Stream data mining & CluStream framework

Pyramidal Time Frame(1)

Snapshots are classified into different orders which can vary from 1 to log(T)

Snapshots of the i-th order at time intervals of Only the last snapshots of order i are stored

04/10/23 Middleware, CCNT, ZJU

i1l

An example2,2 l

WorseWorseCase~~Case~~

Page 9: Stream data mining & CluStream framework

Pyramidal Time Frame(2)

The difference from his book

Divided by , but not by The number of orders is constant

04/10/23 Middleware, CCNT, ZJU

i 1i Best caseNo redundancy

Why (my own thought)

The newer is left, and the older is abandoned The lower level is not friendly to those old snapshots, but the

higher one does Not only punish , but protect for the older one

Page 10: Stream data mining & CluStream framework

Micro-Cluster(1)------Procedure

04/10/23 Middleware, CCNT, ZJU

t

hh’

Micro cluster(CFV)

Snapshots

T

Page 11: Stream data mining & CluStream framework

Micro-Cluster(2)------Initialization

What is to be initialized?

Micro-clusters The number of micro-clusters maintained in each snapshot is

constant

Determined by the amount of memory available

Larger than the natural number of clusters, but smaller than the number of data points in the data stream

Each cluster owns an unique id

04/10/23 Middleware, CCNT, ZJU

Supported by the experiment

Reasonable ?

Page 12: Stream data mining & CluStream framework

Micro-Cluster(3)------Updating

A new data point is coming, what will be done? Join, Delete & Merge Join : find the nearest one Find the nearest micro-cluster and be involved in its boundary

RMS & Distance

Delete : find the oldest one The average time stamp of the last m data point

Take the time stamp contained in CFV as the approximation

Merge : find the closest two clusters They don’t explain how idlistidlist

04/10/23 Middleware, CCNT, ZJU

Page 13: Stream data mining & CluStream framework

Macro-Cluster(1)------Find the approximate time stamp

What’s the analyst behavior?

Find clusters over a past time horizon of hh All about : additivity property

I don’t understand how they cope with the fault tolerance

Only two snapshots are necessary What is to be clustered?

CFV

04/10/23 Middleware, CCNT, ZJU

)()( 'htStS cc

Not user-friendlyNot user-friendly

Page 14: Stream data mining & CluStream framework

Macro-Cluster(2)------modified k-means

What has been modified in k-means?

The micro-clusters are treated as pseudo-points The seeds are no longer picked randomly The more points, the more important

Experiments are sufficient

04/10/23 Middleware, CCNT, ZJU

Page 15: Stream data mining & CluStream framework

Q&AQ&A

04/10/23 Middleware, CCNT, ZJU

StreamStream

StreamStream

StreaStreamm StreaStrea

mm

StreaStreamm

StreamStream

StreaStreamm

StreamStream