38
Streaming &Parallel Decision Tree in Flink 1 2 3 4 1 2 3 4 anwar.rizal @anrizal

Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Embed Size (px)

Citation preview

Page 1: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Streaming &Parallel Decision Tree in Flink

1 2 3 4

1 2 3 4 anwar.rizal @anrizal

Page 2: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

1 2 3 4

Outlines Motivation

Architecture Decision Trees Implementation Conclusion

Page 3: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

Motivation Architecture Decision Trees Implementation Conclusion

Page 4: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified

Page 5: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation $90 $90 $120 $90 $90 $150 $200

$90 $75 $90 $90 $90 $90 $90

$120 $90 Sold out Sold out $75 $90 $90

$120 $90 $90 $90 $100 $90 $120

Page 6: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation $90 $90 $120 $90 $90 $150 $200

$90 $75 $90 $90 $90 $90 $90

$120 $90 Sold out Sold out $75 $90 $90

$120 $90 $90 $90 $100 $90 $120

(predicted) to increase zero to two days (predicted) to increase this week (predicted) to increase next week

Page 7: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

FRA – NYC

FRA - LON

FRA - MEX

Page 8: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

FRA – NYC

FRA - LON

FRA - MEX

Need attention revenue decrease Need attention passenger decrease Need attention revenue decrease, cost increase

Page 9: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified

Page 10: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

The classifier is kept fresh No need for separate batch learning/evaluation The feedback is taken into account in real time, regularly

The classifier can be introspected Transparent model structure (e.g. know the tree, information gain for each split point) Known expected performance (accuracy, precision, recall, AUC)

Seamless support for workflow of machine learning Data preprocessing: up/down sampling, imputations, … Feature selections Model evaluation, cross validation,

MUST

Page 11: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

The classifier is immediately available The classifier can already predict during learning When learning phase is terminated, it starts another cycle of learning

The classifier has a meta-learning capability The classifier has several models different parameters It is possible to learn about the learning capability of the models

NICE TO HAVE

Page 12: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

Learning Learning & Classifying

End of learning

New cycle of learning

Cycle of Learning, Classifying during Learning, End of Learning, Classifying, New Learning

Page 13: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation

Classifying Application

Stream Learner

Labeled points

Classifier Predicted points

Unlabeled points

Page 14: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation Architecture

Decision Trees Implementation Conclusion

Decision Trees

Page 15: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Decision Trees

From origin to recent developments

“Understand data by asking a sequence of questions ” Classification and Regression Trees (CART) by Breiman et al. in 1984

“Pool decision trees to improve generalization” Random Forests by Breiman in 1999

“Let’s play: pose estimation for XBox’s Kinect” Shotton et al. 2011

Page 16: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Decision Trees

Streaming Decision Trees

“A classifier for streaming data with a bound” Hoeffding Tree (VFDT), Dominguez & Hulthen 2000

“Use of Approximate Histograms for Decision Tree” Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov 2010

Page 17: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Advance purchase

Reservation Subspace

Clas

s

FIRSTBUSINESS

ECONOMY

Train a decision tree - get the intuition!

1 2 3 4 1 2

3 4

Busy procrastinators

Tourists Foreseeing businessmen

Tourists

Brad Pitt

Save money for the company

Business Leisure Supervision

Page 18: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Advance purchase

Reservation Subspace

Clas

s

FIRSTBUSINESS

ECONOMY

Classifying - get the intuition !

Business Leisure

1 2 3 4 1 2

3 4

+ confidence

measure

Page 19: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Advance purchase

Reservation Subspace

Clas

s

FIRSTBUSINESS

ECONOMY

Decision tree - node optimization

 

 

  

 Information Gain

 

Page 20: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Decision Trees

Streaming Decision Trees

The batch version of decision trees require view of the full learning data set

In streaming each point can only be seen once the processing should be fast, can’t afford too much access to disks

Page 21: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Decision Trees

Streaming Decision Tree – get the intuition !

Instead of using every point, the points are compressed

The real position of each point is then approximated

Page 22: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Advance purchase

Reservation Subspace

Clas

s

FIRSTBUSINESS

ECONOMY

Streaming decision tree - get the intuition!

1 2 3 4 1 2

3 4

Busy procrastinators

Tourists Foreseeing businessmen

TouristsBusiness Leisure Supervision

+ Count of points nearby

Page 23: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Decision Trees

Streaming Decision Tree – the Question

“How to find split points for a decision tree ?“

Page 24: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

label 1 / feature 1

Coun

t

0

2

4

6

8

Feature 1

2 5 7.5 9 11

Decision Trees

Compressing Data

An approximate histogram is built for each label/feature

Page 25: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

label n/feature 1

0

4

8

4 6 8.5 10 13

label n

feature 1

0

10

20

30

40

1 3.5 7 11 14

Total

label 1 / feature 1

0

4

8

2 5 7.5 9 11

label 0

Decision Trees

For each feature, all histograms of the feature are merged

Prepare Split Candidates (1/2)

Page 26: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Total

0

10

20

30

40

1 3.5 7 11 14

Total

Get the split candidates s.t. the interval between two split candidates have same number of points (the colored square is as large as each other )

Total

0

10

20

30

40

1 3.5 7 11 14

Totalu1 u2

Decision Trees

Prepare Split Candidates (2/2)

Page 27: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Find the split point that maximizes the information gain using the split points histogram per feature/label

Total

0

10

20

30

40

1 3.5 7 11 14

Totalu1 u2

Decision Trees

Determine Split

Page 28: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Advance purchase

Reservation Subspace

Clas

s

FIRSTBUSINESS

ECONOMY

The Intuition is not exactly precise

Business Leisure Supervision

• The histograms can no longer be used for further split

• And of course, we have already lost original data

Page 29: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

A different data set is used for different iteration

Decision Tree

* If there are not enough data, the same data can be reinjected instead, Kafka is very good for this

Subsequent Split – get the intuition !

Page 30: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation Architecture Decision Trees

Implementation Conclusion

Implem

entation

Page 31: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Implem

entation

Stream Learner

Page 32: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Implem

entation

Stream Learner

We use two kafka streams: • One for labeled data stream • One for the tree developed so far

(the topic is also use by classifying applications)

• Because we need to annotate each message with the tree so far

Page 33: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Implem

entation

Code Outlinesval kafkaDataStream: DataStream[Point]= val kafkaTreeStream: DataStream[Node] =

// annotate each message with the latest tree val annotatedDataStream: DataStream[AnnotatedPoint] = (kafkaDataStream connect kafkaTreeStream) flatMap (new AnnotateMessageCoFlatMap(…))

// create histogram per feature / node val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) } .timeWindowAll(Time.of(1, TimeUnit.MINUTES)) .reduce{ (n1, n2) => mergeHistogram(n1, n2) } // merge histogram val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2) }

val newTree = mergedHistograms .filter(hs => haveEnoughPoints(hs) && toSplit(hs)) .map{ n => val splitPoint = maxInformationGain( calculateSplitCandidates(n)) n.splitAt(splitPoint) }

Page 34: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

val Histogram

➔accumulate

var histogram

➔re-accumulate from 0

h1

Page 35: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Motivation Architecture Decision Trees Implementation

Conclusion

Conclusion

Page 36: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Conclusions

Summary

Streaming algorithms based on approximate histograms are explained

The streaming decision trees algorithms open possibilities to have interesting properties of classifier: freshness and continuous learning

Flink together with Kafka allow an implementation of the algorithm in a nice way

Page 37: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Conclusions

Next Steps

Random Forests: Trees with randomly selected features at each level Trees with different span of data (trees with more but old data might behave worse than trees with less but more fresh data: forgetting capabilities) Providing information of what type of trees behave better at a given period of time (meta learning)

Page 38: Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Thanks!

Credit to: Yiqing Yan (Eurecom) & Tianshu Yang (Telecom Bretagne), Amadeus Interns