Anwar Rizal – Streaming & Parallel Decision Tree in Flink

Streaming &Parallel Decision Tree in Flink

1 2 3 4

1 2 3 4 anwar.rizal @anrizal

1 2 3 4

Outlines Motivation

Architecture Decision Trees Implementation Conclusion

Motivation

Motivation Architecture Decision Trees Implementation Conclusion

Motivation

Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified

Motivation $90 $90 $120 $90 $90 $150 $200

$90 $75 $90 $90 $90 $90 $90

$120 $90 Sold out Sold out $75 $90 $90

$120 $90 $90 $90 $100 $90 $120

Motivation $90 $90 $120 $90 $90 $150 $200

$90 $75 $90 $90 $90 $90 $90

$120 $90 Sold out Sold out $75 $90 $90

$120 $90 $90 $90 $100 $90 $120

(predicted) to increase zero to two days (predicted) to increase this week (predicted) to increase next week

Motivation

FRA – NYC

FRA - LON

FRA - MEX

Motivation

FRA – NYC

FRA - LON

FRA - MEX

Need attention revenue decrease Need attention passenger decrease Need attention revenue decrease, cost increase

Motivation

Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified

Motivation

The classifier is kept fresh No need for separate batch learning/evaluation The feedback is taken into account in real time, regularly

The classifier can be introspected Transparent model structure (e.g. know the tree, information gain for each split point) Known expected performance (accuracy, precision, recall, AUC)

Seamless support for workflow of machine learning Data preprocessing: up/down sampling, imputations, … Feature selections Model evaluation, cross validation,

MUST

Motivation

The classifier is immediately available The classifier can already predict during learning When learning phase is terminated, it starts another cycle of learning

The classifier has a meta-learning capability The classifier has several models different parameters It is possible to learn about the learning capability of the models

NICE TO HAVE

Motivation

Learning Learning & Classifying

End of learning

New cycle of learning

Cycle of Learning, Classifying during Learning, End of Learning, Classifying, New Learning

Motivation

Classifying Application

Stream Learner

Labeled points

Classifier Predicted points

Unlabeled points

Motivation Architecture

Decision Trees Implementation Conclusion

Decision Trees

Decision Trees

From origin to recent developments

“Understand data by asking a sequence of questions ” Classification and Regression Trees (CART) by Breiman et al. in 1984

“Pool decision trees to improve generalization” Random Forests by Breiman in 1999

“Let’s play: pose estimation for XBox’s Kinect” Shotton et al. 2011

Decision Trees

Streaming Decision Trees

“A classifier for streaming data with a bound” Hoeffding Tree (VFDT), Dominguez & Hulthen 2000

“Use of Approximate Histograms for Decision Tree” Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov 2010

Advance purchase

Reservation Subspace

Clas

s

FIRSTBUSINESS

ECONOMY

Train a decision tree - get the intuition!

1 2 3 4 1 2

3 4

Busy procrastinators

Tourists Foreseeing businessmen

Tourists

Brad Pitt

Save money for the company

Business Leisure Supervision

Advance purchase


Clas

s

FIRSTBUSINESS

ECONOMY

Classifying - get the intuition !

Business Leisure

1 2 3 4 1 2

3 4

+ confidence

measure

Advance purchase


Clas

s

FIRSTBUSINESS

ECONOMY

Decision tree - node optimization

Information Gain

Decision Trees

Streaming Decision Trees

The batch version of decision trees require view of the full learning data set

In streaming each point can only be seen once the processing should be fast, can’t afford too much access to disks

Decision Trees

Streaming Decision Tree – get the intuition !

Instead of using every point, the points are compressed

The real position of each point is then approximated

Advance purchase


Clas

s

FIRSTBUSINESS

ECONOMY

Streaming decision tree - get the intuition!

1 2 3 4 1 2

3 4

Busy procrastinators

Tourists Foreseeing businessmen

TouristsBusiness Leisure Supervision

+ Count of points nearby

Decision Trees

Streaming Decision Tree – the Question

“How to find split points for a decision tree ?“

label 1 / feature 1

Coun

t

0

2

4

6

8

Feature 1

2 5 7.5 9 11

Decision Trees

Compressing Data

An approximate histogram is built for each label/feature

label n/feature 1

0

4

8

4 6 8.5 10 13

label n

feature 1

0

10

20

30

40

1 3.5 7 11 14

Total

label 1 / feature 1

0

4

8

2 5 7.5 9 11

label 0

Decision Trees

For each feature, all histograms of the feature are merged

Prepare Split Candidates (1/2)

Total

0

10

20

30

40

1 3.5 7 11 14

Total

Get the split candidates s.t. the interval between two split candidates have same number of points (the colored square is as large as each other )

Total

0

10

20

30

40

1 3.5 7 11 14

Totalu1 u2

Decision Trees

Prepare Split Candidates (2/2)

Find the split point that maximizes the information gain using the split points histogram per feature/label

Total

0

10

20

30

40

1 3.5 7 11 14

Totalu1 u2

Decision Trees

Determine Split

Advance purchase


Clas

s

FIRSTBUSINESS

ECONOMY

The Intuition is not exactly precise

Business Leisure Supervision

• The histograms can no longer be used for further split

• And of course, we have already lost original data

A different data set is used for different iteration

Decision Tree

* If there are not enough data, the same data can be reinjected instead, Kafka is very good for this

Subsequent Split – get the intuition !

Motivation Architecture Decision Trees

Implementation Conclusion

Implem

entation

Implem

entation

Stream Learner

Implem

entation

Stream Learner

We use two kafka streams: • One for labeled data stream • One for the tree developed so far

(the topic is also use by classifying applications)

• Because we need to annotate each message with the tree so far

Implem

entation

Code Outlinesval kafkaDataStream: DataStream[Point]= val kafkaTreeStream: DataStream[Node] =

// annotate each message with the latest tree val annotatedDataStream: DataStream[AnnotatedPoint] = (kafkaDataStream connect kafkaTreeStream) flatMap (new AnnotateMessageCoFlatMap(…))

// create histogram per feature / node val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) } .timeWindowAll(Time.of(1, TimeUnit.MINUTES)) .reduce{ (n1, n2) => mergeHistogram(n1, n2) } // merge histogram val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2) }

val newTree = mergedHistograms .filter(hs => haveEnoughPoints(hs) && toSplit(hs)) .map{ n => val splitPoint = maxInformationGain( calculateSplitCandidates(n)) n.splitAt(splitPoint) }

val Histogram

➔accumulate

var histogram

➔re-accumulate from 0

h1

Motivation Architecture Decision Trees Implementation

Conclusion

Conclusion

Conclusions

Summary

Streaming algorithms based on approximate histograms are explained

The streaming decision trees algorithms open possibilities to have interesting properties of classifier: freshness and continuous learning

Flink together with Kafka allow an implementation of the algorithm in a nice way

Conclusions

Next Steps

Random Forests: Trees with randomly selected features at each level Trees with different span of data (trees with more but old data might behave worse than trees with less but more fresh data: forgetting capabilities) Providing information of what type of trees behave better at a given period of time (meta learning)

Thanks!

Credit to: Yiqing Yan (Eurecom) & Tianshu Yang (Telecom Bretagne), Amadeus Interns

Technology

Anwar Rizal – Streaming & Parallel Decision Tree in Flink