n Learning from Data Streams n Motivation n Big Data Stream n Novelty detection n Clustering Learning n Predictive Learning n Frequent Pattern Mining
n Counting Algorithms n Frequent Items
n Tools and Applications
Main Topics
2
n Traditional datasets n Data streams n Novelty detection n Algorithms n Examples n Challenges
Motivation
3
n DM techniques were developed and are usually applied to static datasets n All the data are available n Machine learning algorithm induces a static decision model n Small to medium datasets
Data Mining
4
n Previous practice n Few companies generate data n All the rest consume data
n Current practice n Everybody produces data n Everybody consumes data
5
Data production is changing
n Machines are continuously collecting data n And sending them to other machines
6
Data explosion
n Everybody is a movie maker n And wants a big audience
n Everybody has a great video taste n And shares what likes
n Everybody is being watched n Everywhere and everytime
7
Data Mining
n Real-life problems are dynamic n Data are generated continuously and at
high speed n Medium to large size n Data streams n New techniques and modification of
existing techniques
8
Data Mining
9
http
s://
ww
w.d
omo.
com
Data never sleeps
10
http
s://
ww
w.d
omo.
com
11
12
http://www.flightradar24.com
13
Data never sleeps
Data never sleeps
14
15
Data never sleeps
16
Day 1 - Afternoon
Day 2 - Morning
Data never sleeps
Data never sleeps
17
18
Data never sleeps
19
Data never sleeps
20
Real data from smartphones
Portugal
http://www.publico.pt/ciencia/noticia/telemoveis-fornecem-quase-em-tempo-real-mapas-da-densidade-populacional-portuguesa-1677020
21
Real data from smartphones
Population dynamics between the main holiday period (July and August) and working periods in France. Credit: Catherine Linard http://phys.org/news/2014-10-cellphone-population-density.html#jCp
n For each taxi in Porto, predict passenger demand n 30 minutes horizon
n ECML/PKDD data science challenge
22
Real-time taxi demand prediction
23
Data never sleeps
n Walmart n Data Center occupies 11.000 m2 n > 1 million transactions per hour n Process 40 petabytes per day n > 2000 times content of all books in the
American Congress library n Largest world library in space and number of books (> 155 million items)
André Ponce de Leon F de Carvalho 24
Data sources
24
n Youtube n More than 1 billion users n At each day, billions of accesses and
hundreds of millions of watching hours n Number of hours each person watch per
month grows 50% each year n 300 hours of video uploaded each minute
André Ponce de Leon F de Carvalho 25
Data sources
25
26
Big Data relevance
http://hadoopadmin.com/big-data-hadoop-what-it-is-why-it-matters/sas-volume-variety-verlocity-value/
27
Mismanaged data cost
n The new characteristics of data: n Time and space:
n The objects of analysis exist in time and space n Often they are able to move
n Dynamic environment: n The objects exist in a dynamic and evolving
environment
n Information processing capability: n The objects have limited information processing
capabilities
28
A World in movement
n The new characteristics of data: n Locality:
n The objects know only their local spatio-temporal environment
n Distributed Environment: n Objects are able to exchange information with
other objects
n Main Goal: n Real-Time Analysis:
n Decision models must evolve in correspondence with the evolving environment
29
A World in movement
n These characteristics imply: n Switch from one-shot learning to continuously
learning dynamic models that evolve over time n In the perspective induced by ubiquitous
environments n Finite training sets, static models, and stationary
distributions will have to be completely thought anew
n The algorithms will have to use limited computational resources
n In terms of processing, memory space and communication time
30
Challenges of Real Time Stream Mining
n Usual features of and
Task
Classification Regression
Data generation
Asynchronous Synchronous
Labelled observations?
No Yes
DS TS
Sequence dependence?
No Yes
31
Time Series x Data Streams
n Stock market n Currency value n Energy demand and consumption n Hydro-electrical energy generation n Weather forecasting
Time series sources
32
n Data arrive sequentially and, usually n With high speed n Dynamically, time-changing environments n Without control on the arrival order n Different intervals between arrivals
n Stream usually have unlimited size n Data distribution may change over time n Arriving objects are unlabelled
33
Data streams main features
n Data must be accessed only once n Data cannot be stored in memory
n After processed, object is discarded
n Decision model must be continuously updated n Be able to detect novelties
n Novelty detection
n Model update must be fast n Concept drift
34
Data streams solution requirements
n DS mining can use incremental learning algorithms
n Model is adapted as new examples become available n Training never stops n Alternative: wait and train again with the
expanded training set (retraining) n Ignore previous model
n Several incremental learning algorithms
35
Incremental Learning
n Ability to identify new or unknown situations n Usually a classification task
n Novelty, anomaly and outlier detection n Different definitions in statistics and
machine learning n Find patterns that are different from the
normal, usual, patterns
36
Novelty Detection
n Few examples that are unexpected and do not represent a new concept n Anomaly
n Exception to what is known n Cohesive and representative group of examples
representing a new concept n Can be a novelty
n Decision model must be adapted to incorporate the anomaly
n Outlier n Abnormality or noise
37
Anomalies and Outliers
n Concept evolution n New concept (class) emerges in the stream
n Concept drift n Change in the profile (data distribution) of
an existing concept (class) n Recurring concepts
n Concepts that appeared in the past and disappeared may occur again in the future
38
Novelty Detection modalities
Variable 1 Time n
Variable 1 Time n + m
Varia
ble
2
Varia
ble
2
Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012
39
Feature Drift
Variable 1 Variable 1
Varia
ble
2
Varia
ble
2
Time n Time n + m
Offline - first data Online - new data
Variable 1 Variable 1 Va
riabl
e 2
Varia
ble
2
Offline Online - first data
New model
New model
Initial model
Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012
40
Concept Drift
Variable 1 Variable 1
Varia
ble
2
Varia
ble
2
Time n Time n + m
Offline - first data Online - new data
Variable 1 Variable 1 Va
riabl
e 2
Varia
ble
2
Offline Online - first data
New model
New model
Initial model
41
Concept Evolution
Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012
Time n + m + k Time n + m + k + l
Variable 1 Variable 1 Va
riabl
e 2
Varia
ble
2
Offline Online - first data New model
Variable 1 Variable 1
Offline - first data Online - new data
New model
Time n Time n + m
Varia
ble
2 Initial model
Varia
ble
2
New model
42
Concept Re-occurrence
Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012
Time
Mea
n da
ta
Time
Mea
n da
ta
Time
Mea
n da
ta
Abrupt change
Time
Mea
n da
ta
Time
Mea
n da
ta
Time
Mea
n da
ta
Abrupt change Multiple streams
Incremental Gradual
Outlier Reocurring concepts Multiple streams
43
Profiles of changes over time
Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012
n Data may become outdated and no longer useful n Outdated data should be discharged
n Several mechanisms n Choice depends on
n How we expect the changes to occur in the data distribution
n Trade-off between intensity (reactivity) and robustness of noise
n Faster reactivity ⇒ more abrupt ⇒ higher risk of keeping noise
44
Forgetting mechanisms
n Forgetting can be: n Abrupt (crisp forgetting)
n At each time, a given observation is kept or removed from a learning window
n Gradual (soft forgetting) n All observations are kept in a full memory n Observations are weighted, reflecting their age
(relevance) n Importance of the observation in the training set
should reduce with aging
45
Forgetting mechanisms
n We do not know all the classes during training n Only the classes in the training set are
known n Unknown classes can appear in the test set n Does not mean that
n Unknown classes did not exist when training set was obtained
n Data comes in a stream n Data distribution changed
46
Open Set Recognition
n Non profit movements to bring social benefits to people and communities n Some of them adopted by companies
n How does it occur? n Meetings n Events n Academic internships n Social networks
n Current trend: data stream mining for social good
47
Data Science for Social Good
n Existing approaches n Using (open) data to solve civic problems
n Usually want development of web/mobile apps
n Using data science techniques to solve social problems
n Mainly want insights from data scientists
n Data democratization n Allow anyone access to data n First U.S. Chief Data Scientist was named
n Precision medicine, open data, data-driven decision
48
Data Science for Social Good
n Different forms of engagement n Challenges and competitions
n Predictive data analytics to preventing fires n http://ibmhadoop.devpost.com/
n University internship n Volunteer n Part time jobs n Full time jobs
49
http://www.kdnuggets.com/2014/07/data-for-good-data-driven-projects-social-good.html
Data Science for Social Good
n Bring social benefits to people and communities n Good health care for all n Economical development of poor countries n Good education for all n Clean and cheap energy n Citizenship n Environmental protection n Better and cleaner transport
50
Data Science for Social Good
Education
n Monitor student performance n Support development of better teaching
platforms n Dynamically adapted to students
performance and needs
n Evaluate teachers and schools n Replicate good experiences n Act before late
51
Data Science for Social Good
Finance
n Improve the financial health of communities
n Support small business n Direct social initiatives n Fraud detection in the use of public
resources
52
Data Science for Social Good
Environmental
n Reduce global warming n Decrease deforestation n Reduce effects of draughts n Predict natural disasters n Detect invasive species n Increase species diversity
53
Data Science for Social Good
Health care
n Monitor patient status in intensive units n Accelerate and make cheaper medical
research n Look at millions of patient records arriving
in streams
n Discover epidemics n Elderly fall prevention
54
Data Science for Social Good
Relavant links
n Data Science for Social Good Fellowship n DataLook n civisanalytics.com n digitalhumanitarians.com n www.data4good.co n http://www.meetup.com/DataKind-UK
55
Data Science for Social Good
Big Data Stream Mining
Albert Bifet, Andre Carvalho, Joao [email protected]
LIAAD-INESC TEC, University of Porto, Portugal
Learning from Data StreamsPowerful IdeasClustering LearningPredictive LearningNovelty DetectionFrequent Pattern Mining
Outline
Learning from Data StreamsPowerful IdeasClustering LearningPredictive LearningNovelty DetectionFrequent Pattern Mining
Data Streams
Data Streams: Continuous flow of data generated at high-speedin Dynamic, Time-changing environments.We need to maintain Decision models in real time.Decision Models must be capable of:
I incorporating new information at the speed data arrives;
I detecting changes and adapting the decision models to themost recent information.
I forgetting outdated information;
Unbounded training sets, dynamic models.
Data Stream Processing
1. One example at at time,used at most once
2. Limited memory
3. Limited time
4. Anytime prediction
Approximate Algorithms
Powerful ideas
I Summarization:Compact and fast summaries to store sufficient statistics
I Approximation:How much information we need to learn, with high probability,an hypothesis H that is within small error of the truehypothesis ?Pr(|H − H| < ε|H|) > 1− δ
I Estimation:Useful for change detection
Adaptive Learning Algorithms
A survey on concept drift adaptation, Gama, Zliobaite, Bifet et al, ACM-CSUR 2014
Clustering Data Streams
I New requirements in stream clustering:I Generate high-quality clusters in one scanI High quality, efficient incremental clusteringI Analysis for different time granularityI Tracking the evolution of clusters
I Clustering: A stream data reduction technique
Cluster Feature Vector
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang,
Ramakrishnan, Livny 1996
Cluster Feature Vector: CF = (N, LS ,SS)
I N: Number of data points
I LS :∑N
1 ~xi
I SS :∑N
1 (~xi )2
Constant space irrespective to the number of examples!
Micro clusters
The sufficient statistics of a cluster A are CFA = (N, LS ,SS).
I N, the number of data objects,
I LS, the linear sum of the data objects,
I SS, the sum of squared the data objects.
Properties:
I Centroid = LS/N
I Radius =√
SS/N − (LS/N)2
I Diameter =√
2×N∗SS−2×LS2
N×(N−1)
Micro clusters
Given the sufficient statistics of a cluster A, CFA = (NA, LSA, SSA).Updates are:
I Incremental: a point x is added to the cluster:LSA ← LSA + x ; SSA ← SSA + x2; NA ← NA + 1
I Additive: merging clusters A and B:LSC ← LSA + LSB ; SSC ← SSA + SSB ; NC ← NA + NB
I Subtractive:CF (C1 − C2) = CF (C1)− FV (C2)
CluStream
CluStream: A Framework for Clustering Evolving Data Streams, Aggarwal, J. Han, J.
Wang, P. Yu (VLDB03)
I Divide the clustering process into online and offlinecomponents
I Online: periodically stores summary statistics about the streamdata
I Micro-clustering: better quality than k-meansI Incremental, online processing and maintenance
I Offline: answers various user queries based on the storedsummary statistics
I Tilted time framework: register dynamic changes
I With limited overhead to achieve high efficiency, scalability,quality of results and power of evolution/change detection
CluStream: Online Phase
Inputs:
I Maximum micro-cluster diameter Dmax
For each x in the stream:I Find the nearest micro-cluster Mi
I IF the diameter of (Mi ∪ x) < Dmax
I THEN assign x to that micro-clusterMi ← Mi ∪ x
I ELSE Start a new micro-cluster based on x
Pyramidal Time Frame
I The micro-clusters are stored at snapshots.
I The snapshots follow a pyramidal pattern
I The micro-clusters might be aggregated using tiltedhistograms
Any Time Stream Clustering
The ClusTree: indexing micro-clusters for anytime stream mining, Kranen, Assent,
Baldauf, Seidl, KAIS 2011
Properties of anytime algorithms
I Deliver a model at any timeI Improve the model if more time is available
I Model adaptation whenever an instance arrivesI Model refinement whenever time permits
I an online component to learn micro-clusters
I Any variety of online components can be utilized
I Micro-clusters are subject to exponential aging
Clustering Evaluation
An effective evaluation measure for clustering on evolving data streams; Kremer,
Kranen, Jansen, Seidl, Bifet, Holmes, Pfahringer, KDD 2011
I Clusters may: appear, fade, move, mergeI Missed points (unassigned)I Misplaced points (assigned to different cluster)I Noise
I Cluster Mapping Measure CMMI External (ground truth)I Normalized sum of penalties of these errors
Cluster Evolution
Analysis
I find the cluster structure in the current window,
I find the cluster structure over time ranges with granularityconfined by the specification of window size and boundary,
I put different weights on different windows to mine variouskinds of weighted cluster structures,
I mine the evolution of cluster structures based on the changesof their occurrences in a sequence of windows
Bibliography: Cluster data streams
I Birch: an efficient data clustering method for very large databases Zhang, T.,Ramakrishnan, R., e Livny, M. ACM SIGMOD 1996
I Clustering data streams: Theory and practice. Guha, S., Meyerson, A., Mishra,N., Motwani, R., e O’Callaghan, L. IEEE TKDE 2003
I CluStream: A Framework for Clustering Evolving Data Streams, Aggarwal, J.Han, J. Wang, P. Yu VLDB03
I Monic: modeling and monitoring cluster transitions Spiliopoulou, M., Ntoutsi,I., Theodoridis, Y., e Schult, R. ACM SIGKDD 2006
I The clustree: indexing micro-clusters for anytime stream mining Kranen, P.,Assent, I., Baldauf, C., and Seidl, T. KAIS 2011
I An effective evaluation measure for clustering on evolving data streams; Kremer,Kranen, Jansen, Seidl, Bifet, Holmes, Pfahringer, KDD 2011
I Data stream clustering: A survey Silva, J. A., Faria, E., Barros, R., Hruschka,E., Carvalho, A., Gama, J. ACM Computing Surveys, 2013
Learning Decision Trees
The base Idea
I Which attribute to choose at each splitting node?
I A small sample can often be enough to choose the optimalsplitting attribute
I Collect sufficient statistics from a small set of examples
I Estimate the merit of each attribute
How large should be the sample?
I The wrong idea: Fixed sized, defined apriori without lookingfor the data;
I The right idea: Choose the sample size that allow todifferentiate between the alternatives.
Very Fast Decision Trees
Mining High-Speed Data Streams, P. Domingos, G. Hulten; KDD 2000
The base IdeaA small sample can often be enough to choose the optimalsplitting attribute
I Collect sufficient statistics from a small set of examples
I Estimate the merit of each attributeI Use Hoeffding bound to guarantee that the best attribute is
really the best.I Statistical evidence that it is better than the second best
Very Fast Decision Trees: Main Algorithm
I Input: δ desired probability level.
I Output: T A decision Tree
I Init: T ← Empty Leaf (Root)I While (TRUE)
I Read next exampleI Propagate example through the tree from the root till a leafI Update sufficient statistics at leafI If leaf (#examples) > Nmin
I Evaluate the merit of each attributeI Let A1 the best attribute and A2 the second bestI Let ε =
√R2ln(1/δ)/(2n)
I If G(A1)− G(A2) > εI Install a splitting test based on A1
I Expand the tree with two descendant leaves
VFDT
Concept-adapting VFDT
G. Hulten, L. Spencer, P. Domingos: Mining Time-Changing Data Streams KDD 2001
I Model consistent with sliding window on streamI Keep sufficient statistics also at internal nodes
I Recheck periodically if splits pass Hoeffding testI If test fails, grow alternate subtree and swap-in when accuracy
of alternate is better
I Processing updates O(1), time +O(W) memoryI Increase counters for incoming instance, decrease counters for
instance going out window
Hoeffding Adaptive Tree
A. Bifet, R. Gavalda: Adaptive Parameter-free Learning from Evolving Data Streams
IDA, 2009
I Replace frequency counters by estimatorsI No need for window of examplesI Sufficient statistics kept by estimators separately
I Parameter-free change detector + estimator with theoreticalguarantees for subtree swap (ADWIN)
I Keeps sliding window consistent with the no-change hypothesis
Hoeffding Algorithms
I Classification:Mining high-speed data streams, P. Domingos, G. Hulten, KDD, 2000
I Regression:Learning model trees from evolving data streams; Ikonomovska, Gama,Dzeroski; Data Min. Knowl. Discov. 2011
I Rules:Learning Decision Rules from Data Streams, J. Gama, P. Kosina; IJCAI 2011
I Clustering:Hierarchical Clustering of Time-Series Data Streams. Rodrigues, Gama, IEEETKDE 20(5): 615-627 (2008)
I Multiple Models:Ensembles of Restricted Hoeffding Trees. Bifet, Frank, Holmes, Pfahringer;ACM TIST; 2012J. Duarte, J. Gama, Ensembles of Adaptive Model Rules from High-Speed DataStreams. BigMine 2014.
I . . .
Option Trees
Speeding-Up Hoeffding-Based Regression Trees With Options, Ikonomovska, et al,
ICML 2011
Use option nodes to solve ties
Rules
Problem: very large decision treeshave context that is complex and hardto understand
I Rules: self-contained, modular,easier to interpret, no need tocover the universe
I L keeps sufficient statistics to:make predictionsexpand the ruledetect changes and anomalies
Adaptive Model Rules
Adaptive Model Rules from Data Streams, Almeida, Ferreira, Gama; ECML/PKDD
2013
I Ruleset: ensemble of rules
I Rule prediction: mean, linearmodel
I Ruleset prediction:I Ordered: only first rule covers
instanceI Unordered: weighted avg. of
predictions of rules coveringinstance x
I Weights inversely proportionalto error
AMRules Induction
I Rule creation: default ruleexpansion
I Rule expansion: split onattribute maximizing σreduction
I Hoeffding boundε =
√R2ln(1/δ)/(2n)
I Expand whenσ1st/σ2nd < 1− ε
I Evict rule when P-H signals analarm
I Detect and explain localanomalies
Clustering Time-series
Hierarchical Clustering of Time-Series Data Streams. Rodrigues, Gama, TKDE, 2008
Using Pearson correlation as splitting criteria.
Hoeffding Algorithms: Analysis
The number of examples required to expand a node only dependson the Hoeffding bound: ε decreases with
√N.
I Low variance models:Stable decisions with statistical support.
I Low overfiting:Examples are processed only once.
I No need for pruning;Decisions with statistical support;
I Convergence: Hoeffding Algorithms becomes asymptoticallyclose to that of a batch learner. The expected disagreement isδ/p; where p is the probability that an example fall into a leaf.
Bibliography on Predictive Learning
I Mining High Speed Data Streams, by Domingos, Hulten, SIGKDD 2000.
I Mining time-changing data streams, Hulten, Spencer, Domingos, KDD2001.
I Efficient Decision Tree Construction on Streaming Data, by R. Jin, G.Agrawal, SIGKDD 2003.
I Accurate Decision Trees for Mining High Speed Data Streams, by J.Gama, R. Rocha, P. Medas, SIGKDD 2003.
I Forest trees for on-line data; J. Gama, P. Medas, R. Rocha; SAC 2004.
I Learning decision trees from dynamic data streams, Gama, Medas, andRodrigues; SAC 2005
I Decision trees for mining data streams, Gama, Fernandes, and Rocha,Intelligent Data Analysis, Vol. 10, 2006.
I Handling Time Changing Data with Adaptive Very Fast Decision Rules,Kosina, Gama; ECML-PKDD 2012
I Learning model trees from evolving data streams, Ikonomovska, Gama,Dzeroski: Data Min. Knowl. Discov. 2011
Definition
I Novelty Detection refers to the automatic identification ofunforeseen phenomena embedded in a large amount of normaldata.
I Novelty is a relative concept with regard to our currentknowledge:
I It must be defined in the context of a representation of ourcurrent knowledge.
I Specially useful when novel concepts represent abnormal orunexpected conditions
I Expensive to obtain abnormal examplesI Probably impossible to simulate all possible abnormal
conditions
Context
I In real problems, as time goes by
I The distribution of known concepts may changeI New concepts may appear
I By monitoring the data stream, emerging concepts may bediscovered
I Emerging concepts may represent
I An extension to a known concept (Extension)I A novel concept (Novelty)
I Several interesting applications: Early Detection of Fault in JetEngines, Intrusion Detection in computer networks, Breaking Newsin a flow of text documents (news articles), Burst of Gamma-ray(astronomical data),
One-Class Classification
Autoassociator Networks
Concept-learning in the absence of counter-examples: an
autoassociaton-based approach Nathalie Japcowicz, 1999
I Three layer network
I The nr. of neurons in the outputlayer is equal to the input layer
I Train the network such that ~y isequal to the ~x
I The network is trained toreproduce the input at theoutput layer
Autoassociator Networks
To classify a test example ~x
I Propagate ~x through the network and let ~y be thecorresponding output;
I If∑k
i (xi − yi )2 < Threshold Then the example is considered
from class normal;
I Otherwise, ~x is a counter-example of the normal class.
Novelty detection
I Training set (Offline Phase )I Dtr = (X1, y1), (X2, y2), . . . , (Xm, ym)I Xi : vector of input attributes for the ith example
yi : target attributeI yi ∈ Ytr where Ytr = c1, c2, . . . , cL
I When new data arrive (Online Phase)I Given a sequence of unlabelled examples Xnew
Goal: Classify Xnew in Yall where Yall = c1, c2, . . . , cL, . . . , cKand K > L
Novelty Detection Systems
I ECSMiner: Assume that the class label of new examples isknown
I OLINDDA: unsupervised, but restricted to binary classificationproblems
I MINAS (MultI-class learNing Algorithm for data Streams)I Does not use the class labels of new examplesI Can deal with novelty detection in data streams multi-class
problem
OLINDDA algorithm
OnLIne Novelty and Drift Detection AlgorithmSpinosa, Carvalho, Gama: OLINDDA: a cluster-based approach for
detecting novelty and concept drift in data streams SAC 2007
I Offline and Online phases
I Models: normal, extension and novelty
I Each model is represented by a set of clusters
I Not suitable for multi-class problem
OLLINDA
ECSMiner algorithm
Masud, Gao, Khan, Han, and Thuraisingham, Classification and novel
class detection in concept-drifting data streams under time constraints,
TKDE 2011
Supervised algorithm integrating novel concepts and concept drift
I Ensemble of classifiersI Creates a new model when all examples in a chunk are labeled
I Supposes that all examples in the stream will be labeled (aftera delay of Tl time units)
I An instance will be classified in until Tc time units of its arrival
Minas algorithm
MINAS: Multiclass Learning Algorithm for Novelty Detection in Data
Streams, E. Faria, J. Gama, A. Carvalho, DAMI (to appear)
I Unsupervised algorithm for novelty detection in data streamsmulti-class problemsRepresents each known class by a set of hyperspheres
I Use of offline (training) and online phasesIn each phase learns one or more classes
I Cohesive set of examples is necessary to learn new concepts orextensionsIsolated examples are not considered as novelty
MINAS - Offline phase
I Learns a decision model based on the known concept aboutthe problemKMeans or Clustream
I Run only once
I Each class is represent by a set of clusters (hyperspheres)
MINAS - Online phase
I Receives new examples from the streamI Classify each new example
I In one of the known classes orI As unknown
I Cohesive group of unknown examples are used to detect newclasses or extensions
Minas
Novelty Detection Bibliography
I Masud, Gao, Khan, Han, and Thuraisingham, Classification and
novel class detection in concept-drifting data streams under time
constraints, TKDE 2011
I Spinosa, Carvalho, Gama: OLINDDA: a cluster-based approach for
detecting novelty and concept drift in data streams SAC 2007
I MINAS: Multiclass Learning Algorithm for Novelty Detection in
Data Streams, E. Faria, J. Gama, A. Carvalho, DAMI (to appear)
I P. Angelov and X. Zhou, Evolving fuzzy-rule-based classifiersfrom data streams Trans. Fuz Syst. 2008.
I D. Tax and R. Duin, Growing a multi-class classifier with areject option Pattern Recognit. Lett., 2008.
I F. Denis, R. Gilleron, and F. Letouzey, Learning from positiveand unlabeled examples, Theoretical Comput. Sci., 2005.
I D. Cardoso and F. Franca A Bounded Neural Network forOpen Set Recognition, IJCNN 2015
Introduction
I Frequent pattern mining refers to finding patterns that occurgreater than a pre-specified threshold value.
I Patterns refer to items, itemsets, or sequences.
I Threshold refers to the percentage of the pattern occurrencesto the total number of transactions. It is termed as Support.
Introduction
I Finding frequent patterns is the first step for the discovery ofassociation rules in the form of A→ B.
I Apriori algorithm represents a pioneering work for associationrules discoveryR Agrawal and R Srikant, Fast Algorithms for Mining Association Rules.VLDB 2004.
I An important step towards improving the performance of association rulesdiscovery was FP-GrowthJ. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without CandidateGeneration SIGMOD 2000
Introduction
I Many measurements have been proposed for finding thestrength of the rules.
I The very frequently used measure is support.I The support Supp(X ) of an itemset X is defined as the
proportion of transactions in the data set which contain theitemset.
I Another frequently used measure is confidence.I Confidence refers to the probability that set B exists given that
A already exists in a transaction.I Confidence (A→ B) = Supp (AB) / Supp (A)
Frequent Pattern Mining in Data Streams
The process of frequent pattern mining over data streams differsfrom the conventional one as follows:
I The technique should be linear or sublinear: You Have OnlyOne Look.
I heavy hitters, top-k, frequent items, and itemsets.
Frequent Items (Heavy Hitters) in Data Streams
Manku and Motwani have two master algorithms in this area:
I Sticky Sampling
I Lossy Counting
G. S. Manku and R. Motwani. Approximate Frequency Counts over DataStreams, in Proceedings of the 28th International Conference on Very LargeData Bases (VLDB), Hong Kong, China, August 2002.
Sticky Sampling
Sticky sampling is a probabilistic technique.I The user inputs three parameters
I Minimum Support(s)I Admissible Error (ε)I Probability of failure (δ)
I A simple data structure is maintained that has entries of dataelements and their associated frequencies (e, f).
I The sampling rate decreases gradually with the increase in thenumber of processed data elements: t = 1
ε log(s−1δ−1)
Sticky Sampling
I For each incoming element in a data stream, the datastructure is checked for an entry
I If an entry exists, then increment the frequencyI Otherwise sample the element with the current sampling rate.I If selected, then add a new entry, else the element is ignored.
I With every change in sampling rate, an unbiased coin toss isdone for each entry with decreasing the frequency with everyunsuccessful coin toss
I If the frequency goes down to zero, the entry is released
Lossy Counting
I Lossy counting is a deterministic technique.I The user inputs two parameters
I Minimum Support (s)I Admissible Error (ε)
I The data structure has entries of data elements, theirassociated frequencies (e, f, 4) where 4 is the maximumpossible error in f.
I The stream is conceptually divided into buckets with a widthw = 1/ε.
I Each bucket is labeled by a value of N/w , where N startsfrom 1 and increases by 1.
Lossy Counting
I For a new incoming element, the data structure is checkedI If an entry exists, then increment the frequencyI Otherwise, add a new entry with 4 = bcurrent − 1 where
bcurrent is the current bucket label.
I When switching to a new bucket, all entries withf +4 < bcurrent are deleted.
Error Analysis
Output:
I Elements with counter values exceeding s × N − ε× N
How much do we undercount?
I If the current size of stream is N and window-size = 1/ε thenfrequency error ≤ #window = ε× N
Approximation guarantees:
I Frequencies underestimated by at most ε× N
I No false negatives
I False positives have true frequency at least s × N − ε× N
How many counters do we need?
I Worst case: 1/εlog(εN) counters
Pattern mining: definitions
Patterns: sets with a subpattern relation ⊂
{cheese,milk} ⊂ {milk, peanuts, cheese, butter}
(search?buy) ⊂ (home?search?cart?buy?exit)
Applications: market basket analysis, intrusion detection, churnprediction, feature selection, XML query analysis, query andclickstream analysis, anomaly detection . . .
Pattern mining in streams: definitions
I The support of a pattern T in a stream S at time t is theprobability that a pattern T ′ drawn from S ′s distribution attime t is such that T ⊂ T ′
I Typical task: Given access to S , at all times t, produce theset of patterns T with support at least ε at time t
I A pattern is closed if no superpattern has the same support.
I No information is lost if we focus only on closed patterns.
Key data structure: Lattice of patterns, with counts
Fundamentals
I A priori property: t ⊆ t ′ ⇒ support(t) ≥ support(t ′)
I Closed: none of its supersets has the same supportCan generate all freq. itemsets and their support
I Maximal: none of its supersets is frequentCan generate all freq. itemsets (without support)
I Maximal ⊆ Closed ⊆ Frequent ⊆ D
FP-Stream
C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: Mining frequentpatterns in data streams at multiple time granularities. NGDM(2003)
I Multiple time granularities
I Based on FP-Growth (depth-first search over itemset lattice)
I Pattern-tree with Tilted-time windowTilted-time window: logarithmically aggregated time slots (lognumber of levels, aggregate when the level is full, push theaggregate one level up)
I Time sensitive queries, emphasis on recent history
I High time and memory complexity
Moment
Y. Chi , H. Wang, P. Yu , R. Muntz: Moment: Maintaining Closed
Frequent Itemsets over a Stream Sliding Window. ICDM 2004
I Keeps track of boundary below frequent itemsetsI Closed Enumeration Tree (CET) (≈ prefix tree)
I Infrequent gateway nodes (infrequent)I Unpromising gateway nodes (infrequent, dominated)I Intermediate nodes (frequent, dominated)I Closed nodes (frequent)
I By adding/removing transactions closed/infreq. do notchange
Itemset mining
I MOMENT (Chi+ 04) (Sliding window, frequent closed, exact)
I CLOSTREAM (Yen+ 09) (Sliding window, all closed, exact)
I MFI (Li+ 09) (Transaction-sensitive window, frequent closed,exact)
I IncMine (Cheng+ 08) (Sliding window, frequent closed,approximate; faster for moderate approximate ratios)
Sequence, trees, and graph mining
I Frequent subsequence mining:MILE (Chen+05), SMDS (Marascu-Masseglia 06), SSBE(Koper-Nguyen 11)
I Bifet+08: Frequent closed unlbeled subtree mining
I Bifet+11: Frequent closed labeled subtree mining; Frequentclosed labeled subgraph mining
Bibliography on Frequent Item’s
I What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically,by G. Cormode, S. Muthukrishnan, PODS 2003.
I Dynamically Maintaining Frequent Items Over A Data Stream, by C. Jin,W. Qian, C. Sha, J. Yu, A. Zhou; CIKM 2003.
I Processing Frequent Itemset Discovery Queries by Division and SetContainment Join Operators, by R. Rantzau, DMKD 2003.
I Approximate Frequency Counts over Data Streams, by G. Singh Manku,R. Motawani, VLDB 2002.
I Finding Hierarchical Heavy Hitters in Data Streams, by G. Cormode, F.Korn, S. Muthukrishnan, D. Srivastava, VLDB 2003.
I J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without CandidateGeneration SIGMOD 2000
I Metwally, D. Agrawal, A. Abbadi, Efficient Computation of Frequent andTop-k Elements in Data Streams, ICDT 2005
I Y. Chi , H. Wang, P. Yu , R. Muntz: Moment: Maintaining ClosedFrequent Itemsets over a Stream Sliding Window ICDM04
I C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: Mining frequent patternsin data streams at multiple time granularities. NGDM (2003)
Outline
1 Evaluation
2 Non Distributed Open Source Tools
3 Distributed Open Source Tools
4 Applications
Data stream classification cycle
1 Process an example at a time,and inspect it only once (atmost)
2 Use a limited amount of memory
3 Work in a limited amount oftime
4 Be ready to predict at any point
Evaluation
1 Error estimation: Hold-out or Prequential
2 Evaluation performance measures: Accuracy or κ-statistic
3 Statistical significance validation: MacNemar or Nemenyi test
Evaluation Framework
Error Estimation
Data available for testing
Holdout an independent test set
Apply the current decision model to the test set, at regulartime intervals
The loss estimated in the holdout is an unbiased estimator
Holdout Evaluation
1. Error Estimation
No data available for testing
The error of a model is computed from the sequence ofexamples.
For each example in the stream, the actual model makes aprediction, and then uses it to update the model.
Prequential orInterleaved-Test-Then-Train
1. Error Estimation
Hold-out or Prequential?
Hold-out is more accurate, but needs data for testing.
Use prequential to approximate Hold-out
Estimate accuracy using sliding windows or fading factors
Hold-out or Prequential orInterleaved-Test-Then-Train
2. Evaluation performance measures
Predicted PredictedClass+ Class- Total
Correct Class+ 75 8 83Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Accuracy = 75100 + 10
100 = 7583
83100 + 10
1717100 = 85%
Arithmetic mean = (7583 + 1017)/2 = 74.59%
Geometric mean =√
7583
1017 = 72.90%
2. Performance Measures with Unbalanced Classes
Predicted PredictedClass+ Class- Total
Correct Class+ 75 8 83Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Predicted PredictedClass+ Class- Total
Correct Class+ 68.06 14.94 83Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
2. Performance Measures with Unbalanced Classes
Kappa Statistic
p0: classifier’s prequential accuracy
pc : probability that a chance classifier makes a correctprediction.
κ statistic
κ =p0−pc1−pc
κ = 1 if the classifier is always correct
κ = 0 if the predictions coincide with the correct ones as oftenas those of the chance classifier
Forgetting mechanism for estimating prequential kappa
Sliding window of size w with the most recent observations
Outline
1 Evaluation
2 Non Distributed Open Source Tools
3 Distributed Open Source Tools
4 Applications
VFML
Very Fast Machine Learning
Developed by Pedro Domingos and his team
Contains first implementation of Hoeffding Tree
VFDT: Very Fast Decision TreeCVFDT: Concept-adapting Very Fast Decision Tree
Does not contain ensembles
Implemented in C
Not longer maintained since 2003
VW
Vowpal Wabbit
Developed by John Langford at Yahoo Research and MicrosoftResearch
Used in Microsoft Azure Machine Learning
Single Classifier until 2013
Distributed using MPI
Based on the Hashing Trick
Sofia-ML
Developed by David Sculley, at Google
Good design of the software
Contains
Fast online learnersFast k-means clustering
{M}assive {O}nline {A}nalysis MOA (Bifet et al. 2010)
{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.
It is closely related to WEKA
It includes a collection of offline and online as well as tools forevaluation:
classification, regressionclusteringfrequent pattern mining
Easy to extend
Easy to design and run experiments
WEKA
Waikato Environment for Knowledge Analysis
Collection of state-of-the-art machine learning algorithms anddata processing tools implemented in Java
Released under the GPL
Support for the whole process of experimental data mining
Preparation of input dataStatistical evaluation of learning schemesVisualization of input data and the result of learning
Used for education, research and applications
Complements “Data Mining” by Witten & Frank & Hall
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
Classification Experimental Setting
Classification Experimental Setting
Evaluation procedures for DataStreams
Holdout
Interleaved Test-Then-Train orPrequential
Classification Experimental Setting
Data Sources
Random Tree Generator
Random RBF Generator
LED Generator
Waveform Generator
Hyperplane
SEA Generator
STAGGER Generator
Classification Experimental Setting
Classifiers
Naive Bayes
Decision stumps
Hoeffding Tree
Hoeffding Option Tree
Bagging and Boosting
ADWIN Bagging andLeveraging Bagging
Clustering Experimental Setting
Clustering Experimental Setting
Internal measures External measuresGamma Rand statisticC Index Jaccard coefficientPoint-Biserial Folkes and Mallow IndexLog Likelihood Hubert Γ statisticsDunn’s Index Minkowski scoreTau PurityTau A van Dongen criterionTau C V-measureSomer’s Gamma CompletenessRatio of Repetition HomogeneityModified Ratio of Repetition Variation of informationAdjusted Ratio of Clustering Mutual informationFagan’s Index Class-based entropyDeviation Index Cluster-based entropyZ-Score Index PrecisionD Index RecallSilhouette coefficient F-measure
Table: Internal and external clustering evaluation measures.
Clustering Experimental Setting
Clusterers
StreamKM++
CluStream
ClusTree
Den-Stream
D-Stream
CobWeb
Web
http://www.moa.cms.waikato.ac.nz
Easy Design of a MOA classifier
void resetLearningImpl ()
void trainOnInstanceImpl (Instance inst)
double[] getVotesForInstance (Instance i)
Easy Design of a MOA clusterer
void resetLearningImpl ()
void trainOnInstanceImpl (Instance inst)
Clustering getClusteringResult()
Extensions of MOA
Multi-label Classification
Active Learning
Regression
Closed Frequent Graph Mining
Twitter Sentiment Analysis
streamDM C++
http://streamdm.noahlab.com.hk/
Outline
1 Evaluation
2 Non Distributed Open Source Tools
3 Distributed Open Source Tools
4 Applications
streams Framework
Developed by Christian Bockermann at University ofDortmund
Uses MOA for Machine Learning methods
Integrates with Storm
RapidMiner Streams Plugin
Apache Mahout
Scalable machine learning library
Current version runs on Hadoop
Some methods are streaming to scale
New version in Scala, to run on Spark
Jubatus
Developed by Nippon Telegraph and Telephone
Open source online machine learning and distributedcomputing framework
Implemented in C++
Apache SAMOA(De Francisci & Bifet 2015)
samoa is distributed streaming machine learning(ML) framework that contains a programing
abstraction for distributed streaming ML algorithms.
Apache SAMOA
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-
adap
ter MOA
Other ML frameworks
samoa-S4 samoa-storm samoa-other-SPEs
Apache SAMOA
SAMOA SA
SAMOA ML Developer API
Processing ItemProcessor
Stream
SAMOA ML Developer API
Web
http://samoa-project.net/
Apache Flink
streamDM
http://streamdm.noahlab.com.hk/
streamDM
New project specific designed for Spark Streaming
Spark Streaming: latency in seconds
Easy to integrate in Spark systems
Designed in Scala
Classification, Regression, Clustering, Frequent Pattern Mining
Outline
1 Evaluation
2 Non Distributed Open Source Tools
3 Distributed Open Source Tools
4 Applications
Twitter: A Massive Data Stream
Web 2.0
Micro-blogging service
Built to discover what is happening at any moment in time,anywhere in the world.
3 billion requests a day via its API.
Twitter Streaming API
Twitter APIs
Streaming API
Two discrete REST APIs
Real-time access to Tweets
sampled form
filtered form
HTTP based
GET
POST
DELETE
Sentiment Analysis on Twitter
Sentiment analysis
Classifying messages into two categories depending on whetherthey convey positive or negative feelings
Emoticons are visual cues associated with emotional states, whichcan be used to define class labels for sentiment classification
Positive Emoticons Negative Emoticons
:) :(:-) :-(: ) : (:D=)
Table: List of positive and negative emoticons.
Outline
Final Comments
Open Challenges
Open Challenges
I Structured input and output
I Multi-target, multi-task and transfer learning
I Millions of classes
I Visualization
I Distributed Streams
I Representation learning
I Ease of use
Lessons Learned
Learning from data streams:
I Learning is not one-shot: is an evolving process;
I We need to monitor the learning process;
I Opens the possibility to reasoning about the learning
Reasoning about the Learning Process
Intelligent systems must:
I be able to adapt continuously to changing environmentalconditions and evolving user habits and needs.
I be capable of predictive self-diagnosis.
The development of such self-configuring, self-optimizing, andself-repairing systems is a major scientific and engineeringchallenge.