Upload
flink-forward
View
6.073
Download
1
Embed Size (px)
Citation preview
Streaming &Parallel Decision Tree in Flink
1 2 3 4
1 2 3 4 anwar.rizal @anrizal
1 2 3 4
Outlines Motivation
Architecture Decision Trees Implementation Conclusion
Motivation
Motivation Architecture Decision Trees Implementation Conclusion
Motivation
Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified
Motivation $90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
Motivation $90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
(predicted) to increase zero to two days (predicted) to increase this week (predicted) to increase next week
Motivation
FRA – NYC
FRA - LON
FRA - MEX
Motivation
FRA – NYC
FRA - LON
FRA - MEX
Need attention revenue decrease Need attention passenger decrease Need attention revenue decrease, cost increase
Motivation
Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified
Motivation
The classifier is kept fresh No need for separate batch learning/evaluation The feedback is taken into account in real time, regularly
The classifier can be introspected Transparent model structure (e.g. know the tree, information gain for each split point) Known expected performance (accuracy, precision, recall, AUC)
Seamless support for workflow of machine learning Data preprocessing: up/down sampling, imputations, … Feature selections Model evaluation, cross validation,
MUST
Motivation
The classifier is immediately available The classifier can already predict during learning When learning phase is terminated, it starts another cycle of learning
The classifier has a meta-learning capability The classifier has several models different parameters It is possible to learn about the learning capability of the models
NICE TO HAVE
Motivation
Learning Learning & Classifying
End of learning
New cycle of learning
Cycle of Learning, Classifying during Learning, End of Learning, Classifying, New Learning
Motivation
Classifying Application
Stream Learner
Labeled points
Classifier Predicted points
Unlabeled points
Motivation Architecture
Decision Trees Implementation Conclusion
Decision Trees
Decision Trees
From origin to recent developments
“Understand data by asking a sequence of questions ” Classification and Regression Trees (CART) by Breiman et al. in 1984
“Pool decision trees to improve generalization” Random Forests by Breiman in 1999
“Let’s play: pose estimation for XBox’s Kinect” Shotton et al. 2011
Decision Trees
Streaming Decision Trees
“A classifier for streaming data with a bound” Hoeffding Tree (VFDT), Dominguez & Hulthen 2000
“Use of Approximate Histograms for Decision Tree” Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov 2010
Advance purchase
Reservation Subspace
Clas
s
FIRSTBUSINESS
ECONOMY
Train a decision tree - get the intuition!
1 2 3 4 1 2
3 4
Busy procrastinators
Tourists Foreseeing businessmen
Tourists
Brad Pitt
Save money for the company
Business Leisure Supervision
Advance purchase
Reservation Subspace
Clas
s
FIRSTBUSINESS
ECONOMY
Classifying - get the intuition !
Business Leisure
1 2 3 4 1 2
3 4
+ confidence
measure
Advance purchase
Reservation Subspace
Clas
s
FIRSTBUSINESS
ECONOMY
Decision tree - node optimization
Information Gain
Decision Trees
Streaming Decision Trees
The batch version of decision trees require view of the full learning data set
In streaming each point can only be seen once the processing should be fast, can’t afford too much access to disks
Decision Trees
Streaming Decision Tree – get the intuition !
Instead of using every point, the points are compressed
The real position of each point is then approximated
Advance purchase
Reservation Subspace
Clas
s
FIRSTBUSINESS
ECONOMY
Streaming decision tree - get the intuition!
1 2 3 4 1 2
3 4
Busy procrastinators
Tourists Foreseeing businessmen
TouristsBusiness Leisure Supervision
+ Count of points nearby
Decision Trees
Streaming Decision Tree – the Question
“How to find split points for a decision tree ?“
label 1 / feature 1
Coun
t
0
2
4
6
8
Feature 1
2 5 7.5 9 11
Decision Trees
Compressing Data
An approximate histogram is built for each label/feature
label n/feature 1
0
4
8
4 6 8.5 10 13
label n
feature 1
0
10
20
30
40
1 3.5 7 11 14
Total
label 1 / feature 1
0
4
8
2 5 7.5 9 11
label 0
Decision Trees
For each feature, all histograms of the feature are merged
Prepare Split Candidates (1/2)
Total
0
10
20
30
40
1 3.5 7 11 14
Total
Get the split candidates s.t. the interval between two split candidates have same number of points (the colored square is as large as each other )
Total
0
10
20
30
40
1 3.5 7 11 14
Totalu1 u2
Decision Trees
Prepare Split Candidates (2/2)
Find the split point that maximizes the information gain using the split points histogram per feature/label
Total
0
10
20
30
40
1 3.5 7 11 14
Totalu1 u2
Decision Trees
Determine Split
Advance purchase
Reservation Subspace
Clas
s
FIRSTBUSINESS
ECONOMY
The Intuition is not exactly precise
Business Leisure Supervision
• The histograms can no longer be used for further split
• And of course, we have already lost original data
A different data set is used for different iteration
Decision Tree
* If there are not enough data, the same data can be reinjected instead, Kafka is very good for this
Subsequent Split – get the intuition !
Motivation Architecture Decision Trees
Implementation Conclusion
Implem
entation
Implem
entation
Stream Learner
Implem
entation
Stream Learner
We use two kafka streams: • One for labeled data stream • One for the tree developed so far
(the topic is also use by classifying applications)
• Because we need to annotate each message with the tree so far
Implem
entation
Code Outlinesval kafkaDataStream: DataStream[Point]= val kafkaTreeStream: DataStream[Node] =
// annotate each message with the latest tree val annotatedDataStream: DataStream[AnnotatedPoint] = (kafkaDataStream connect kafkaTreeStream) flatMap (new AnnotateMessageCoFlatMap(…))
// create histogram per feature / node val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) } .timeWindowAll(Time.of(1, TimeUnit.MINUTES)) .reduce{ (n1, n2) => mergeHistogram(n1, n2) } // merge histogram val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2) }
val newTree = mergedHistograms .filter(hs => haveEnoughPoints(hs) && toSplit(hs)) .map{ n => val splitPoint = maxInformationGain( calculateSplitCandidates(n)) n.splitAt(splitPoint) }
val Histogram
➔accumulate
var histogram
➔re-accumulate from 0
h1
Motivation Architecture Decision Trees Implementation
Conclusion
Conclusion
Conclusions
Summary
Streaming algorithms based on approximate histograms are explained
The streaming decision trees algorithms open possibilities to have interesting properties of classifier: freshness and continuous learning
Flink together with Kafka allow an implementation of the algorithm in a nice way
Conclusions
Next Steps
Random Forests: Trees with randomly selected features at each level Trees with different span of data (trees with more but old data might behave worse than trees with less but more fresh data: forgetting capabilities) Providing information of what type of trees behave better at a given period of time (meta learning)
Thanks!
Credit to: Yiqing Yan (Eurecom) & Tianshu Yang (Telecom Bretagne), Amadeus Interns