33
AMIDST Toolbox – Flink Forward 1 A Java Toolbox for Scalable Probabilistic Machine Learning 12-14 SEP 2016, Berlin Ana M. Martínez Aalborg University [email protected]

Flink Forward 2016

Embed Size (px)

Citation preview

AMIDST Toolbox – Flink Forward 1

A Java Toolbox for Scalable Probabilistic Machine Learning

12-14 SEP 2016, Berlin

Ana M. MartínezAalborg University

[email protected]

AMIDST Toolbox – Flink Forward 2

Who are we?

THE AMIDST CONSORTIUM

3AMIDST Toolbox – Flink Forward

4

RunningUse Case

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

5

Predicting Defaulting Clients

Predicts probability a customer will default within 2 years

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

§ Daily data for millions of clients § Tons of missing data.§ Odd distributions.

6AMIDST Toolbox – Flink Forward

7

Toolbox presentation

AMIDST Toolbox – Flink Forward

GENERAL DESCRIPTION

8AMIDST Toolbox – Flink Forward

AMIDST APPROACH

9

Data Knowledge

Openbox Models

Blackbox Inference Engine(Powered by Flink)

AMIDST Toolbox – Flink Forward

10

MainFeatures

AMIDST Toolbox – Flink Forward

PGMS

11

Probabilistic graphical models (PGMs)Specify your model using probabilistic graphical models with latent variables and temporal dependencies

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

12AMIDST Toolbox – Flink Forward

PGMS

13

Custom Gaussian Mixture ModelHij defines local mixtureHi defines a global mixture.

AMIDST Toolbox – Flink Forward

PGMS

14

RUNNING CODE EXAMPLE

//Set-up Flink session.final ExecutionEnvironment env =

ExecutionEnvironment.getExecutionEnvironment();

//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false);

//Build the modelModel model = new CustomGaussianMixture(data.getAttributes());

AMIDST Toolbox – Flink Forward

SCALABLE INFERENCE

15

Scalable Learning

Perform Bayesian inference on your probabilistic models with powerful approximate and scalable algorithms.

AMIDST Toolbox – Flink Forward

SCALABLE INFERENCE

16

d-VMP Algorithm - Coded as iterative map-reduce task

A state-of-the-art distributed variational message passing algorithm.

AMIDST Toolbox – Flink Forward

SCALABLE INFERENCE

17

RUNNING CODE EXAMPLE

AMIDST Toolbox – Flink Forward

//Set-up Flink session.final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false);

//Build the modelModel model = new CustomGaussianMixture(data.getAttributes());

//Learn the modelmodel.updateModel(data);

DATA STREAMS

18

Data Streams

Update your models when new data is available. This makes our toolbox appropriate for learning from data streams.

AMIDST Toolbox – Flink Forward

DATA STREAMS

19

//Set-up Flink session.final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false);

//Build the modelModel model = new CustomGaussianMixture(data.getAttributes());

//Learn the modelmodel.updateModel(data);

//Update your modelfor(int i=1; i<12; i++) {

filename = “dataFlink_month"+i+".arff";data = DataFlinkLoader.loadDataFromFolder(env, filename,false);model.updateModel(data);

}

RUNNING CODE EXAMPLE

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

20

Predicting Defaulting Clients§ Old BCC’s models based on logistic regression got an AUC of 0.816.§ AMIDST’s models gets an AUC of 0.952.

AMIDST Toolbox – Flink Forward

LARGE-SCALE DATA

21

Scalability analysis

Use your defined models to process massive data sets in a distributed computer cluster using Flink.

AMIDST Toolbox – Flink Forward

LARGE-SCALE DATA

22

One billion node probabilistic modelExperiment on a Flink cluster with 16 nodes on AWS.

AMIDST Toolbox – Flink Forward

§ Speedup(withrespectto2nodes)

AMIDST Toolbox – Flink Forward 23/32

0

1

2

3

4

5

6

7

8

4nodes 8nodes 16nodes

MODULAR DESIGN

24

Modular Design

The AMIDST Toolbox has been designed following a modular structure.This makes easier:

§ The maintenance and enhancement of the software§ The integration with external software: HUGIN, MOA, Weka, R.

AMIDST Toolbox – Flink Forward

MODULAR DESIGN

25AMIDST Toolbox – Flink Forward

26

Running Use Case II

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

27

Tracking Concept DriftDetects changes in customer profiles during Spanish financial crisis

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

28

Hidden Variables are used to capture changes in customer profile

MODEL

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

29

RUNNING CODE

AMIDST Toolbox – Flink Forward

//Set-up Flink session.final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false);

//Build the modelModel model = new ConceptDriftDetector(data.getAttributes());

//Learn the modelmodel.updateModel(data);

//Update your modelfor(int i=1; i<12; i++) {

filename = “dataFlink_month"+i+".arff";data = DataFlinkLoader.loadDataFromFolder(env, filename,false);model.updateModel(data);System.out.println(model.getPosteriorDistribution(“hiddenVar”).

toString());}

CONCEPT DRIFT DETECTION

30

Hidden Variable Captures Concept Drift

Drift Pattern: Seasonal + Global trend

RESULTS

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

31

Unemployment Rate main driver of Concept DriftHidden Variable correlates with unemployment rate (rho = 0.961)

RESULTS

AMIDST Toolbox – Flink Forward

COLLABORATE

32

www.amidsttoolbox.com github.com/amidst/toolbox

AppacheLicense 2.0

AMIDST Toolbox – Flink Forward

AMIDST Toolbox – Flink Forward 33

Thanks for your attention

@ [email protected]

@AmidstToolbox

www www.amidsttoolbox.com