9
Clustering big data streams Citation for published version (APA): Hassani, M., & Seidl, T. (2016). Clustering big data streams: recent challenges and contributions. IT - Information Technology, 58(4), 206-213. https://doi.org/10.1515/itit-2016-0007 DOI: 10.1515/itit-2016-0007 Document status and date: Published: 01/08/2016 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim. Download date: 21. Jul. 2021

Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

Clustering big data streams

Citation for published version (APA):Hassani, M., & Seidl, T. (2016). Clustering big data streams: recent challenges and contributions. IT -Information Technology, 58(4), 206-213. https://doi.org/10.1515/itit-2016-0007

DOI:10.1515/itit-2016-0007

Document status and date:Published: 01/08/2016

Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne

Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.

Download date: 21. Jul. 2021

Page 2: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

it – Information Technology 2016; 58(4): 206–213 DE GRUYTER OLDENBOURG

Special Issue

Marwan Hassani* and Thomas Seidl

Clustering Big Data streams: recent challengesand contributionsDOI 10.1515/itit-2016-0007Received February 8, 2016; accepted May 4, 2016

Abstract: Traditional clustering algorithms merely consid-ered static data. Today’s various applications and researchissues in big data mining have however to deal with con-tinuous, possibly infinite streams of data, arriving at highvelocity. Web traffic data, surveillance data, sensor mea-surements and stock trading are only some examples ofthese daily-increasing applications.Since the growth of data volumes is accompanied by a sim-ilar raise in their dimensionalities, clusters cannot beexpected to completely appear when considering all at-tributes together. Subspace clustering is a general ap-proach that solved that issue by automatically finding thehidden clusters within different subsets of the attributesrather than considering all attributes together.In this article, novel methods for an efficient subspaceclustering of high-dimensional big data streams are pre-sented. Approaches that efficiently combine the anytimeclustering concept with the stream subspace clusteringparadigm are discussed. Additionally, efficient and adap-tive density-based clustering algorithms are presented forhigh-dimensional data streams. Novel open-source as-sessment framework and evaluation measures are addi-tionally presented for subspace stream clustering.

Keywords: Subspace clustering, Big Data, data streams.

ACM CCS: Information systems → Information systemsapplications → Data mining → Clustering, Informa-tion systems → Information systems applications →Data mining → Data streammining, Information systems→ Information systems applications → Spatial-temporalsystems → Sensor networks

*Corresponding author: Marwan Hassani, RWTH Aachen University,Data Management and Data Exploration Group, D-52074 Aachen,Germany, e-mail: [email protected] Seidl: Ludwig-Maximilians-Universität (LMU) Munich,Database Systems Group, D-80538 München, Germany

1 IntroductionClustering is a well-established data mining concept thataims at automatically grouping similar data objects whileseparating dissimilar ones. This process is strongly depen-dent on the notion of similarity, which is often based onsome distance measure. Thus, similar objects are usuallyclose to each other while dissimilar ones are far from eachother. The clustering task is performed without a previousknowledge of the data, or in an unsupervisedmanner.

During the early stages of data mining research, thewhole data objects were considered to be statically andpermanently stored in the memory. This allowed the de-signed data mining technique to perform as much pas-sages over the objects as needed to deliver the desired pat-terns. In the era of big data, the recent growth of the datasize and the easiness of collecting data made the previ-ous settings no more convenient. The size of the contin-uously generated data and the limited storage capacity al-low in many scenarios for a single passage over the data,and users are interested in gaining a real time knowledgeabout the data as they are produced.

A data stream is an ordered sequence of objects thatcan be read once or very small number of times using lim-ited processing and computing storage possibilities. Thissequence of objects can be endless and flows usually athigh speeds with a varying underlying distribution of thedata. This fast and infinite flow of data objects does notallow the traditional permanent storage of the data andthus multiple passages are not any more possible. Manydomains are dealing essentially with data streams. Themost prominent examples include network traffic data,telecommunication records, click streams, weather mon-itoring, stock trading, surveillance data, health data, cus-tomer profile data, and sensor data. There is many moreto come. A very wide spectrum of real-world streaming ap-plications is expanding. Particularly in sensor data, suchapplications spread from home scenarios like the smarthomes to environmental applications, monitoring tasks inthe health sector [14, 20], in the digital humanities us-ing eye-tracking [9] or gesture monitoring [3], but do notend with military applications. Actually, any source of

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 3: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

DE GRUYTER OLDENBOURG M. Hassani and T. Seidl, Clustering Big Data streams | 207

Health Context:ECGPulse

Body HumidityBody Temperature

Environment Context:TemperatureHumidityCondition

Time Context:Daily TimeWeek DayLocation Context:

Passed DistanceAltitude Change

Light-weighted single-stream patterns

• Find Multiple-stream patterns• Incrementally update them

Prediction request of an interesting health parameter and reply from the server

(a)

Eye-tracking data stream

(b)

Figure 1: Two applications of miningbody-generated streaming data.(a) In a health care scenario [14, 20]where streaming sequentialbehavioral patterns are collected andused for predicting near-futureinteresting dimensions, and (b) ina translation scenario incollaboration with psycholinguists inthe humanities area [9].

information can easily be elaborated to produce a contin-uous flow of the data. Another emerging streaming datasources are social data. In a single minute, 546,000 tweetsare happening, 2,460,000 pieces of content are sharedon Facebook and 72 h of new videos are uploaded toYouTube¹. Users are interested of gaining the knowledgeout of these information during their same minute of gen-eration. A delay, say till the next minute, might result in anoutdated knowledge.

In Figure 1(a), a body sensor network is producingmultiple streams about the health status of the runner.Other sensors are collecting streams of other contextualinformation like the weather and location information.These can be processed on a local mobile device or a re-mote server to gain, for instance, some knowledge aboutthe near-future status [20]. Figure 1(b) presents anothertype of sensor streaming data where an eye-tracking sys-tem is used to record the duration and the position of eacheye fixation over the monitor during a human reading orwriting process [21]. One task could be here finding inter-esting patterns that represent important correlations be-tween eye-gazes and key strokes [9].

Stream clustering aims at detecting clusters that areformed out of the evolving streaming objects. These clus-

1 As on July 2014, Sources: domo.com and statisticbrain.com

ters must be continuously updated as the stream emergesto follow the current distribution of the data. These clus-ters represent mainly the gained knowledge out of theclustering task. In this article, advanced stream cluster-ing models are introduced. These models are mainly mo-tivated by the basic challenges that we have observed forclustering of streaming data in real world scenarios, par-ticularly sensor streaming data (cf. Figure 1).

The remainder of this article is organized as follows: inSection 2wewill list some of the challenges one has to facewhile designing stream clustering algorithms for miningbig data. In Section 3,wepresent, at a high level of abstrac-tion, recent approaches in the field of big data streammin-ing that overcame the challenges mentioned in Section 2.We spend however more time to explain two recent any-time subspace stream clustering in more details. Finally,Section 4 concludes this article with a summarization ta-ble of the most important properties of discussed streamclustering algorithms.

2 Challenges of stream clusteringof Big Data

Designing stream clustering approaches has someunique special challenges. We list in the following some

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 4: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

208 | M. Hassani and T. Seidl, Clustering Big Data streams DE GRUYTER OLDENBOURG

paradigms that make it challenging to design a streamclustering approach.

2.1 Adaptation to the stream changes, andoutlier awareness

The algorithm must incrementally cluster the stream datapoints to detect evolving clusters over the time, while for-getting outdated data. New trends of the data must be de-tected at the same time of their appearance. Nevertheless,the algorithmmust be able to distinguishnew trends of thestream from outliers. Fulfilling the up-to-date requirementcontradicts the outlier awareness one. Thus, meeting thistradeoff is one of the basic challenges of any stream clus-tering algorithm.

2.2 Storage awareness, and high clusteringquality

Due to the huge sizes and high speeds of streaming data,any clustering algorithm must perform as few passagesover the objects as possible. In most cases, the applicationand the storage limitations allow only for a single passage.However, high quality clustering results are requested tomake the desired knowledge out of the data stream. Moststatic clustering models tend to deliver an initial, some-times random, clustering solution and then optimize it byrevisiting the objects to maximize some similarity func-tion. Although suchmultiple-passagespossibility doesnotexist for streaming algorithms, the requirement of an opti-mized, high quality clustering does still exist.

2.3 Efficient handling of high-dimensional,different-density streaming objects

The current huge increase of the sizes of data was accom-paniedwith a similar boost in their number of dimensions.This applies of course to streamingdata too. For suchkindsof data with higher dimensions, distances between the ob-jects grow more and more alike due to an effect termedcurse of dimensionality [4]. According to this effect, ap-plying traditional clustering algorithms in the full-spacemerely will result in considering almost all objects as out-liers, as the distances between them grow exponentiallywith their dimensionality 𝑑. The latter fact motivated theresearch in the area of subspace clustering over static datain the last decade, which searches for clusters in all of the2𝑑−1 subspaces of the data by excluding a subgroup of the

dimensions at each step. Apparently this implies highercomplexity of the algorithm even for static data, whichmakes it evenmore challenging when considering stream-ing data.

Additionally, as the stream evolves, the number, thedensity and the shapes of clusters may dramaticallychange. Thus, assuming a certain number of clusters likein𝑘-means-based clusteringmodels or setting a static den-sity threshold as in the DBSCAN-based clustering modelsis not convenient for a stream clustering approach. A self-adjustment to the different densities of the data is stronglyneeded while designing a stream clustering algorithm (cf.Figures 2 and 3). Again, this requirement is in conflict withthe storage awareness necessity.

2.4 Flexibility to varying time allowancesbetween streaming objects

An additional, natural characteristic of data streams (e.g.sensor data) is the fluctuating speed rate. Streaming dataobjects arrive usually with different time allowances be-tween them, although the application settings would as-sume a constant stream speed. Available stream cluster-ing approaches, called budget algorithms in this context,strongly restrict their model size to handle minimal timeallowance to be on the safe side (cf. Figure 4). In the caseof reduced stream speed, the algorithm remains idle dur-ing the rest of the time, till the next streaming object ar-rives. Anytime mining algorithms, designed recently forstatic data, try to make use of any given amount timeto deliver some result. Longer given times, imply higherclustering quality. This idea was adopted for clusteringstreamingdata. Although this setting canbe seen as an op-portunity for improving the clustering quality rather thana challenge, it is not trivial to have a flexible algorithmicmodel that is able to deliver some result evenwith very faststreams.

3 Recent contributions in the fieldof efficient clustering of Big Datastreams

In this section, we present, at a high level of abstrac-tion, our novel, efficient stream clustering algorithms thatconsider all of the above challenges mentioned in Sec-tion 2. We spend more time to explain two algorithmsthat contribute to anytime and subspace stream cluster-ing. These contributions [8] are structured in the following

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 5: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

DE GRUYTER OLDENBOURG M. Hassani and T. Seidl, Clustering Big Data streams | 209

Input stream

Online component

Offline component

Continuous summaries (microclusters)

Final clustering over microclusters

User clustering request

Figure 2: The online-offline model of streamclustering algorithms. Decayed input objectshave lighter colors than recent ones.

Mutual reachability graph

A minimal spanning tree

Dendrogram (hierarchical clustering)

Extraction of “flat” clustering

>

∆ + (∆)

maintain updated

Figure 3: The steps of HASTREAM algorithm[17]. The incremental part [15] is explained inred arrows to maintain the clustering attimestamp 𝑡𝑗 after the red insertions anddeletions of microclusters introduced to theold ones from timestamp 𝑡𝑖 (represented byΔ). 𝐴𝑓𝑓(Δ) represents the affectedmicroclusters because of the inserted and thedeleted microclusters: Δ.

three subsections. In Section 3.1, we present novel high-dimensional density-based stream clustering techniques.In Section 3.2, we introduce and deeply explain advancedanytime stream clustering approaches. Finally, in Sec-tion 3.3, we present a unique subspace stream clusteringframework as well as the subspace cluster mapping eval-uation measure. In all of the following subsections, thefirst and the second challenges mentioned in Sections 2.1and 2.2 are carefully considered. Each of the rest of thechallenges (Sections 2.3 and 2.4) is the main focus in oneof the first two subsections in the following.

3.1 High-dimensional, density-basedstream clustering algorithms

In this line of research to address big data stream clus-tering, we present three density-based stream clusteringalgorithms. Here, the third challenge mentioned in Sec-tion 2.3 is mainly considered.

An efficient projected stream clustering algorithm isintroduced for handling high-dimensional, noisy, evolv-ing data streams is presented in PreDeConStream [16]. Thistechnique is based on a two-phasemodel available inmoststream clustering algorithms [1, 6] (cf. Figure 3). The firstphase represents the process of the online maintenance ofdata summaries, called microclusters. As the name sug-gests, those summaries are smaller clusters, where onlyminimal statistics of the streaming objects are collected.These statistics are the count, the linear sum and thesquared sum. The microclusters are then passed to an of-

fline phase for generating the final clustering. The tech-nique works on incrementally updating the output of theonline phase stored in a microcluster structure. Takingthose microclusters that are fading out over time into con-sideration speeds up the process of assigning new datapoints to existing clusters. The algorithm localizes thechange to the previous clustering result, and smartly usesa clustering validity interval to make an efficient offlinephase.

HASTREAM [17] contributes a hierarchical, self-adaptive, density-based stream clustering model (cf.Figure 2). The algorithm focuses on smoothly detectingthe varying number, densities and shapes of the streamingclusters. A cluster stability measure is applied over thesummaries of the streaming data (the microclusters inFigure 3), to extract the most stable offline clustering.

In I-HASTREAM [15], in order to improve the efficiencyof the suggested model in the offline phase, somemethodsfrom the graph theory are adopted and others were con-tributed, to incrementally update aminimal spanning treeof microclusters (cf. the red arrows in Figure 2). This tree isused to continuously extract the final clustering, by local-izing the changes that appeared in the stream, and main-taining the affected parts merely.

3.2 Advanced anytime stream clusteringalgorithms

By considering all other challenges, the main focus of thetwo algorithms presented in this section are the third and

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 6: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

210 | M.Hassani and T. Seidl, Clustering Big Data streams DE GRUYTER OLDENBOURG

qu

ality

constant stream

varying stream

+ 1 Δ

+ 1

Figure 4: The concept of anytime stream mining algorithms.

Figure 5: The main insertion and model concept of LiarTree [13].

the fourth challenges mentioned in Sections 2.3 and 2.4.Anytime algorithms build upon the realistic assumptionof the varying time allowances between streaming objects(cf. Figure 4). They aim at increasing the quality of theiroutput if theyweregivenmore time (i.e. the timeallowanceΔ𝑡 is bigger) instead of being idle as in traditional algo-rithms.

The LiarTree algorithm [13] is contributed on the on-line phase (cf. Figure 3) to provide precise stream sum-maries and to effectively handle noise, drift and noveltyat any given time. The algorithms stores the microclustersin a tree data structure (cf. Figure 5). Themicroclusters areindexed in a hierarchical way, such that microclusters inany higher level of the tree are bigger and less in numberthan the ones in lower levels. Themost fine-grainedmicro-clusters are stored in the leaf level of the tree. Additionally,each microcluster points to the smaller microclusters thatare nested inside it. Thus, the insertion of objects inside itsmicrocluster in the leaf-level is performed in an efficientway. We prove that the runtime of the our anytime algo-rithm is logarithmic in the size of the maintained modelopposed to a linear time complexity often observed in pre-vious approaches. The anytime concept is achieved usingthe insertion of objects in the tree. If Δ𝑡

𝑖(cf. Figure 4) is

big enough, the object 𝑜𝑖can be inserted into its leaf level

microcluster, as this insertion needs more time but guar-antees a better quality. IfΔ𝑡

𝑖is however small (i.e. another

object 𝑜𝑖+1

has arrived in the stream), then 𝑜𝑖is stored tem-

porarily in a buffer at the level where it has reached, andthe insertion of 𝑜

𝑖+1is prioritized. The buffered 𝑜

𝑖is “hitch-

hiked” later once there is another insertion of a later object

Figure 6: The main insertion and model of SubClusTree [12]. Anillustrative example of a 3D dataset.

𝑜𝑗; 𝑗 > 𝑖 descending in its same insertion path to its desti-

nation.Three other main contributions of LiarTree. the first is

enabling the anytime concept to fast adapt to the newtrends of the data by allowing a liar growth of the treetemporarily in a bottom-up way to allow faster splitting,merging and birth of new microclusters. The second is fil-tering noise from new trends of the stream by allowinga local buffering of noise objects until they their densityis enough (w.r.t. neighboring microclusters) to form a newmicrocluster. The final contribution is guaranteeing a log-arithmic complexity of the insertion.

In the SubClusTree algorithm [12], another complex-ity dimension is added to the problem addressed inLiarTree [13]. The high-dimensionality paradigm of bigstreaming data (cf. Section 2.3) is considered together withthe varying arrival times and the streaming aspects of thedata (cf. Section 2.4). SubClusTree is a subspace anytimestream clustering algorithm, that can flexibly adapt to thedifferent stream speeds and makes the best use of avail-able time to provide a high quality subspace clustering. Ituses a forest of multiple liartrees as its data structure (cf.Figure 6). Each liartree represents here a subspace. Thus,in the illustrative example in Figure 6where a 3Ddataset isconsidered, 110, 010 and 001 represent the liartreeswherethe insertion is performed over the subspace that containsthe first dimension, the second and the third, respectively.As explained using the blue arrows in Figure 6, if Δ𝑡

𝑖is

big enough, the insertion of 𝑜𝑖proceeds to the subspaces

(liartrees) of two dimensions, and so on, until reaching thetree of the full space.

The insertion within a single liartree is not allowed tobe interrupted until the object reaches its leaf level. OnceΔ𝑡𝑖has finished, the inserted dimensions of 𝑜

𝑖are only

considered, and the insertion of 𝑜𝑖+1

is prioritized. Appar-ently it is not convenient for SubClusTree to have a data

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 7: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

DE GRUYTER OLDENBOURG M. Hassani and T. Seidl, Clustering Big Data streams | 211

structure of 2𝑑−1 liartrees for processing a 𝑑-dimensionaldataset, thus liartrees of only populated subspaces areinitiated. Usually those relevant trees are considerablysmaller than the full number of trees. Popular subspacesare decided over a certain batch using a heuristic thatdecides potential higher-dimensionality subspaces as thecombination of popular lower-dimensional subspaces inan Apriori-like method.

The heuristic used by SubClusTree estimates the den-sity of flexible grids to efficiently distinguish the populatedhigher-dimensional subspaces (i.e., subspaces with clus-ters) from irrelevant ones.

3.3 A framework and an evaluation measurefor subspace stream clustering

After the development of the previous algorithms, a newgap in the literature of evaluating stream subspace clus-tering algorithms has appeared.

The first subspace clustering evaluation frameworkover data streams, called Subspace MOA, is presentedin [11]. This open-source framework is based on the MOAstream mining framework [5], and has three phases (cf.Figure 7). In the online phase, users have the possibil-ity to select one of three most famous summarizationtechniques to form the microclusters. Upon a user re-quest for a final clustering, the regeneration phase con-structs the data objects out of the current microclusters.Then, in the offline phase, one of five subspace cluster-

user requestCFA CFB

CFFCFE

CFG

CFC CFDFigure 7: The Subspace MOAmodel for streamsubspace clustering of big data. The blow arrowsrepresent the online phase, the green arrowrepresents the regeneration phase and the red arrowrepresents the offline phase.

Figure 8: (a): The runtime of the static subspace clustering algorithm: PROCLUS [2]. Beginning from a sub-dataset size of 200 K objects only,the algorithm fails to successfully finish the running. (b): A successful run of PROCLUS (and other subspace clustering algorithm) over thewhole KDD dataset [7] when applying the streaming PROCLUS using Subspace MOA with CluStream [1] in the online phase on the samemachine.

ing algorithms can be selected. In addition to the previouscombinations, the framework contains available projectedstream clustering algorithms like PreDeConStream [16]and HDDStream [19]. The framework is supported witha subspace stream generator, a visualization interface andvarious subspace clustering evaluation measures.

With the increase of the size of high-dimensional data,applying traditional subspace clustering algorithms is im-possible due to their exponential complexities. Figure 8(a)shows, for instance, that applying the static variant ofPROCLUS [2] over a small subset of the KDD dataset [7]is extremely slow even on a reasonable machine. Begin-ning froma size of 200 K objects only, the experiment doesnot finish. However, as shown in Figure 8(b), it is possibleby using Subspace MOA over the same machine to get therelatively-large dataset completely clusteredwith the samesubspace clustering algorithm (PROCLUS [2]).

In [10], a novel external evaluationmeasure for streamsubspace clustering algorithms called SubCMM: SubspaceCluster Mapping Measure, is presented. SubCMM is ableto handle errors caused by emerging, moving, or splittingsubspace clusters. This first evaluationmeasure that is de-signed to reflect the quality of stream subspace algorithmsis directly integrated in the SubspaceMOA framework. Theexperimental evaluation, performed using the SubspaceMOA framework, depicts the ability of SubCMM to reflectthe different changes happening in the subspaces of theevolving stream.

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 8: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

212 | M.Hassani and T. Seidl, Clustering Big Data streams DE GRUYTER OLDENBOURG

Table 1: Properties of the different stream clustering algorithms discussed in this article (𝑛𝑎=not applicable).

Algorithm Onlin

e-off

linemodel

Density

-based

(inonlin

eph

ase)

Density

-based

(inoff

lineph

ase)

Hierarchical

Outlier-aware(cf.Se

ction

2.1)

Density

-adaptive(cf.Section

2.3)

Anytime(cf.Se

ction

2.4)

Increm

ental(inoff

lineph

ase)

Overlapp

ingclustersandsubspaces

Datastructureofmicroclusters

Full-space stream clustering algorithmsCluStream [1] ✓ 𝑛𝑎 𝑃𝑦𝑟𝑎𝑚𝑖𝑑𝑎𝑙 𝑡𝑖𝑚𝑒 𝑓𝑟𝑎𝑚𝑒

DenStream [6] ✓ ✓ ✓ ✓ 𝑛𝑎 𝐿𝑖𝑠𝑡

LiarTree [13] ✓ ✓ 𝑛𝑎 ✓ ✓ 𝑛𝑎 𝐼𝑛𝑑𝑒𝑥

HASTREAM [17] ✓ ✓ ✓ ✓ ✓ ✓ 𝑛𝑎 𝐼𝑛𝑑𝑒𝑥 or 𝐿𝑖𝑠𝑡I-HASTREAM [15] ✓ ✓ ✓ ✓ ✓ ✓ 𝑛𝑎 ✓ 𝑛𝑎 𝐼𝑛𝑑𝑒𝑥 or 𝐿𝑖𝑠𝑡

Subspace stream clustering algorithmsPreDeConStream [16] ✓ ✓ ✓ ✓ 𝐿𝑖𝑠𝑡

HDDStream [19] ✓ ✓ ✓ ✓ 𝐿𝑖𝑠𝑡

SubClusTree [12] ✓ ✓ 𝑛𝑎 ✓ ✓ ✓ 𝐼𝑛𝑑𝑒𝑥

4 ConclusionIn this article, we have mainly addressed the followingthree v’s of big data: velocity, volume and variety. Vari-ous streaming data applications, with a huge volumes andvarying velocities were considered in different scenariosand applications.

A list of the recent challenges that face the designerof a stream clustering algorithm was shown and deeplydiscussed. Finally, recent contributions on three main re-search lines in the area of big data stream mining, werepresented and their fulfillment to the design requirementswas highlighted.

Table 1 summarizes the properties of the stream clus-tering algorithms discussed in this article w.r.t. differentaspects. It checks whether the specific algorithm followsthe online-offline model introduced in Figure 3. Also, thealgorithms that apply a density-based clustering concepton their online or offline phases are listed. Additionally,the algorithms delivering a hierarchical final clusteringare highlighted. All algorithms are then checked whetherthey fulfill the challenges introduced in Sections 2.1, 2.3and 2.4. For the subspace streamclustering algorithms, thetable checks additionally, whether they produce simulta-neously overlapping clusters and subspaces [18].Projectedstream clustering algorithms like HDDStream [19] and Pre-DeConStream [16] produce non-redundant clusters, whereany object from the dataset can be maximally a part of

one projected cluster. Finally, the data structure used byeach algorithm for storing and accessing themicroclustersis mentioned. CluStream [1] uses a pyramidal time framedata structure to store snapshots of the microclusters atcarefully selected timestamps. The other algorithms effi-ciently use a list structure, a tree or both to access theirmicroclusters.

References1. C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for

clustering evolving data streams. In Proceedings of VLDB ’03,pp. 81–92, 2003.

2. C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park.Fast algorithms for projected clustering. In Proceedings ofSIGMOD ’99, pp. 61–72, 1999.

3. C. Beecks, M. Hassani, J. Hinnell, D. Schüller, B. Brenger, I. Mit-telberg, and T. Seidl. Spatiotemporal similarity search in 3Dmotion capture gesture streams. In Proceedings of SSTD ’15,pp. 355–372, 2015.

4. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. Whenis “nearest neighbor” meaningful? In Proceedings of ICDT ’99,pp. 217–235, 1999.

5. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massiveonline analysis. JMLR, 11:1601–1604, 2010.

6. F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clusteringover an evolving data stream with noise. In Proceedings of SDM’06, pp. 328–339, 2006.

7. Dataset. Network Intrusion KDD Cup Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999.

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM

Page 9: Clustering big data streams · DE GRUYTER OLDENBOURG M.HassaniandT.Seidl,ClusteringBigDatastreams | 207 Health Context: ECG Pulse Body Humidity Body Temperature … Environment Context:

DE GRUYTER OLDENBOURG M. Hassani and T. Seidl, Clustering Big Data streams | 213

8. M. Hassani. Efficient Clustering of Big Data Streams. PhD the-sis, RWTH Aachen University, 1 2015. https://publications.rwth-aachen.de/record/478963/files/478963.pdf.

9. M. Hassani, C. Beecks, D. Töws, T. Serbina, M. Haberstroh,P. Niemietz, S. Jeschke, S. Neumann, and T. Seidl. Sequentialpattern mining of multimodal streams in the humanities. InProceedings of BTW ’15, pp. 683–686, 2015.

10. M. Hassani, Y. Kim, S. Choi, and T. Seidl. Subspace cluster-ing of data streams: new algorithms and effective evaluationmeasures. J. Intell. Inf. Syst., 45(3):319–335, 2015.

11. M. Hassani, Y. Kim, and T. Seidl. Subspace MOA: subspacestream clustering evaluation using the MOA framework. InProceedings of DASFAA ’13, pp. 446–449, 2013.

12. M. Hassani, P. Kranen, R. Saini, and T. Seidl. Subspace anytimestream clustering. In Proceedings of SSDBM’ 14, page 37, 2014.

13. M. Hassani, P. Kranen, and T. Seidl. Precise anytime clusteringof noisy sensor data with logarithmic complexity. In Proceed-ings of SensorKDD ’11 @KDD ’11, pp. 52–60, 2011.

14. M. Hassani and T. Seidl. Towards a mobile health context pre-diction: Sequential pattern mining in multiple streams. In Pro-ceedings of MDM ’11, volume 2, pp. 55–57, 2011.

15. M. Hassani, P. Spaus, A. Cuzzocrea, and T. Seidl. Adaptivestream clustering using incremental graph maintenance. InBigMine 2015 @ KDD’15, pp. 49–64, 2015.

16. M. Hassani, P. Spaus, M. M. Gaber, and T. Seidl. Density-basedprojected clustering of data streams. In Proceedings of SUM’12, pp. 311–324, 2012.

17. M. Hassani, P. Spaus, and T. Seidl. Adaptive multiple-resolution stream clustering. In Proceedings of MLDM ’14,pp. 134–148, 2014.

18. H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD, 3(1):1:1–1:58, 2009.

19. I. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, and H.-P. Kriegel.Density-based projected clustering over high dimensional datastreams. In Proceedings of SDM ’12, pp. 987–998, 2012.

20. C. Quix, J. Barnickel, S. Geisler, M. Hassani, S. Kim, X. Li,A. Lorenz, T. Quadflieg, T. Gries, M. Jarke, S. Leonhardt,U. Meyer, and T. Seidl. Healthnet: A system for mobile andwearable health information management. In Proceedings ofthe Workshop on Information Management for Mobile Applica-tions, pp. 36–43, 2013.

21. T. Vaegs, P. Niemietz, C. Tummel, A. Richert, C. Beecks,M. Haberstroh, M. Hassani, T. Meisen, M. Priesters, H. Vieritz,T. Seidl, I. Mittelberg, S. Neumann, and S. Jeschke. Fosteringinterdisciplinary integration in the e-humanities. In Proceed-ings of EDULEARN ’14, pp. 2244–2251, 2014.

BionotesDr. Marwan HassaniRWTH Aachen University, DataManagement and Data Exploration Group,D-52074 Aachen, [email protected]

Dr. Marwan Hassani is postdoc researcher and associate teachingassistant at the data management and data exploration group at theRWTH Aachen University. His research interests include stream datamining, sequential pattern mining of multiple streams, efficientanytime clustering of big data streams, streaming process miningand exploration of evolving graph data. Marwan received hid PhD(2015) from RWTH Aachen University. He received an equivalenceMaster in Computer Science from RWTH Aachen University (2009).He coauthored more than 40 scientific publications and serves onseveral program committees.

Prof. Dr. Thomas SeidlLudwig-Maximilians-Universität (LMU)Munich, Database Systems Group,D-80538 München, [email protected]

Prof. Dr. Thomas Seidl is full professor for computer science andhead of the database systems group at Ludwig-Maximilians-Universität (LMU) Munich. His research activities include data min-ing and database technology for multimedia and spatial-temporalobjects in engineering, business, life science and humanities appli-cations. He coauthored more than 300 scientific publications andserves on several program committees and scientific boards. Seidlreceived his Diplom (MSc, 1992) from TU München and his PhD(1997) and venia legendi (2001) from LMU Munich. From 2002 to2016 he headed the data management and data exploration groupas a full professor for computer science at RWTH Aachen University.

Brought to you by | Universitätsbibliothek der RWTH AachenAuthenticated

Download Date | 3/21/17 10:35 AM