26
1 Distributed R The Next Generation Platform for Predictive Analytics Jorge Martinez Vishrut Gupta Ed Ma April 10th, 2015

Distributed R: The Next Generation Platform for Predictive Analytics

Embed Size (px)

Citation preview

1

Distributed R

The Next Generation Platform for Predictive Analytics

Jorge Martinez

Vishrut Gupta

Ed Ma April 10th, 2015

2

About me

FPGAs

Barcelona

2009

Embedded software, GPUs

Barcelona

2011

Distributed systems and ML

SF

2013

@jorgemarsal

http://github.com/jorgemarsal

3

The data

explosion

4

Horizontal scalingThe shift from BI to Data Science

The shift from BI to

data scienceHappens!

https://www.youtube.com/watch?v=vbb-AjiXyh0

5

Predictive analytics workflow

Build Models

Evaluate ModelsDeploy

Models

(In-database

scoring)

BI Integration

1 2

3

Build and evaluate predictive models on large datasets using Distributed R

2

1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB)

3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.

6

Data Scientists Preferred Languages: R & SQLAdoption of R increased across industries

1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html

7

R is …

“The best thing about R is that it was developed by statisticians. The worst

thing about R is that… it was developed by statisticians.”

-Bo Cogwill, Google

8

R is ….

PopularNot

scalable

Open

source No parallel

algorithmsFlexible

Extensible

Limited

pre/post

processing

9

Horizontal scalingFunctional programming and big dataScale-out

Scale-out

10

Horizontal scaling

“The future has arrived, it’s just

not evenly distributed yet”

- William Gibson

“The future has arrived, it’s just

not evenly distributed yet”

- William Gibson

Ship code to data,

Functional programming

11

Distributed RThe Next Generation Platform for Predictive Analytics

12

Distributed RA New Enterprise class predictive analytics platform

A scalable, high-performance platform for the R language

• Implemented as an R package

• Open source

Use familiar GUIs

and packages

Analyze data too

large for vanilla R

Leverage multiple

nodes for

distributed

processing

Vastly

improved

performance

13

Distributed R: architecture

Master

• Schedules tasks across the cluster.

• Sends commands/code to workers

Workers

• Do the actual work

• Own the data

• Work on independent data partitions in

parallel

DistR Master

Worker 1

Worker 2

Worker 3

Worker 4

14

• Relies on user defined partitioning

• Also support for distributed data-frames and lists

darray

Distributed R: Distributed data structures

15

• Express computations over partitions

• Execute across the cluster

foreach

Distributed R: Distributed code

f (x)

16

Distributed R basic demo

17

• Similar signature, accuracy as R packages

• Scalable and high performance

• E.g., regression on billions of rows in a couple of minutes

Distributed R: Built-in distributed algorithms

Algorithm Use cases

Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.

Logistic Regression (GLM)Customer Response modeling, Healthcare analytics

(Disease analysis)

Random Forest Customer churn, Market campaign analysis

K-Means ClusteringCustomer segmentation, Fraud detection, Anomaly

detection

Page Rank Identify influencers

18

Distributed R March Madness demo

19

Parallel Random Forest Example

Random Forest – building an

ensemble of deep decision trees

Need to build 100 decision trees on 4

machines

Each machine builds 25 decision trees

Can use random forest to predict

March Madness Bracket

X

7

>

5

X1

2

>

3.

4

X

3

>

3

01 10

21

March Madness Bracket

Train Model to predict individual games

Use team and opponent features to train a model

• blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy

Calculate the summary statistics of each team

Group by teams and get the mean of each team’s features

Predict the result of the game

Concatenate the summary statistics of the team and feed to model that predicts individual

games

Fill out bracket by predicting 1 game at the time

22

23

Distributed R Census demo using Shiny

http://15.126.194.41/public/index.html

24

Distributed R rocks!

• Regression on billions of rows in minutes

• Graph algorithms on 10B edges

• Load 400GB+ data from database to R in < 10 minutes

• Open source!

25

That’s cool… what can I do with it?

• Collaborate

• Github (report issues, send PRs) https://github.com/vertica/DistributedR

• Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/

• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-

distributed-r/

• Buy commercial support

26

“The future has already arrived,

it’s just not evenly distributed yet”

- William Gibson

Thank you

http://github.com/vertica/distributedr