Stream data mining & CluStream framework

A Framework for Clustering Evolving Data Streams

Yueshen Xu

Zhejiang Univ

CCNT, Middleware

Middle ware, CCNT, ZJU04/10/23

Stream Processing Event Stream Processing Complex Event Processing

04/10/23 Middleware, CCNT, ZJU

Data Stream Mining

Event Stream Processing

Complex Event Processing

In-memory Computing

Real –time Computing

Big data

Computing Mode

Real Application

SAP…

Taobao

Yahoo:S4BaiduBrown&MIT

We are We are endeavoring!endeavoring!

The paper itself

Published in VLDB 2003 Have been cited 635 times By C.C. Aggarwal, Jiawei Han, Jiayong Wang,

and Philip S.Yu


Watson

Watson UIUCUIUC

THUTHU UIC

UIC

A standard, a bible as well as an obligatory reading

Expert Pundit Expert Pundit

!!

Data Stream & Streaming Data

What is data stream?——Those Data sets behave just like water flow (I think)

An infinite infinite process consisting of data which continuously continuously evolves with time (C.C. Aggarwal, Jiawei Han et al)

The formalized description

is a multi-dimensional record, and is the corresponding time stamp.


),( iii tVX ),,(, 1 diii vvV

iV it

The data model makes a determining influence on the computing model How?

Principles

Be very different from those for static data sets (my own thought) One-pass scan You can have the only one chance to see it No storage for primitive data Infinite, another form of big data No necessity In-memory mining Instantaneous Preference for new coming data User point of view


Approximate results You must change your old ideas about traditional static data sets

Ordered, Countable, Enumerable, Infinite, no-storage

Data ModelData Model

Vital!

The Framework

The methodology The core value of the paper

Micro- and macro-clustering process Necessity and inevitability under this frame

The pyramidal time frame Balancing between the accuracy and storage capability


The principle of approximate approximate resultsresults

Cluster Feature Vector Additivity The micro clusters

Is it sophistic?

Why are they opted for?

Cluster Feature Vector


Definition CFV is defined as a tuple , the sum of the squares of the data values : Sigma & Square , the sum of data values : Sigma , the sum of the squares of the time stamps : Sigma & Square , the sum of the time stamps : Sigma , the number of data items belonging to the cluster

)32( d ),1,2,1,2( nCFCFCFCF ttxx

xCF2

xCF1tCF 2tCF 2

n

Why CFV?

User – oriented Additivity

Not come up by Prof. Han et al

Pyramidal Time Frame(1)

Snapshots are classified into different orders which can vary from 1 to log(T)

Snapshots of the i-th order at time intervals of Only the last snapshots of order i are stored


i1l

An example2,2 l

WorseWorseCase~~Case~~

Pyramidal Time Frame(2)

The difference from his book

Divided by , but not by The number of orders is constant


i 1i Best caseNo redundancy

Why (my own thought)

The newer is left, and the older is abandoned The lower level is not friendly to those old snapshots, but the

higher one does Not only punish , but protect for the older one

Micro-Cluster(1)------Procedure


t

hh’

Micro cluster(CFV)

Snapshots

T

Micro-Cluster(2)------Initialization

What is to be initialized?

Micro-clusters The number of micro-clusters maintained in each snapshot is

constant

Determined by the amount of memory available

Larger than the natural number of clusters, but smaller than the number of data points in the data stream

Each cluster owns an unique id


Supported by the experiment

Reasonable ?

Micro-Cluster(3)------Updating

A new data point is coming, what will be done? Join, Delete & Merge Join : find the nearest one Find the nearest micro-cluster and be involved in its boundary

RMS & Distance

Delete : find the oldest one The average time stamp of the last m data point

Take the time stamp contained in CFV as the approximation

Merge : find the closest two clusters They don’t explain how idlistidlist


Macro-Cluster(1)------Find the approximate time stamp

What’s the analyst behavior?

Find clusters over a past time horizon of hh All about : additivity property

I don’t understand how they cope with the fault tolerance

Only two snapshots are necessary What is to be clustered?

CFV


)()( 'htStS cc

Not user-friendlyNot user-friendly

Macro-Cluster(2)------modified k-means

What has been modified in k-means?

The micro-clusters are treated as pseudo-points The seeds are no longer picked randomly The more points, the more important

Experiments are sufficient


Q&AQ&A


StreamStream

StreamStream

StreaStreamm StreaStrea

mm

StreaStreamm

StreamStream

StreaStreamm

StreamStream

Education

Stream data mining & CluStream framework