Upload
jorge-martinez-de-salinas
View
387
Download
4
Tags:
Embed Size (px)
Citation preview
1
Distributed R
The Next Generation Platform for Predictive Analytics
Jorge Martinez
Vishrut Gupta
Ed Ma April 10th, 2015
2
About me
FPGAs
Barcelona
2009
Embedded software, GPUs
Barcelona
2011
Distributed systems and ML
SF
2013
@jorgemarsal
http://github.com/jorgemarsal
4
Horizontal scalingThe shift from BI to Data Science
The shift from BI to
data scienceHappens!
https://www.youtube.com/watch?v=vbb-AjiXyh0
5
Predictive analytics workflow
Build Models
Evaluate ModelsDeploy
Models
(In-database
scoring)
BI Integration
1 2
3
Build and evaluate predictive models on large datasets using Distributed R
2
1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB)
3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.
6
Data Scientists Preferred Languages: R & SQLAdoption of R increased across industries
1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
7
R is …
“The best thing about R is that it was developed by statisticians. The worst
thing about R is that… it was developed by statisticians.”
-Bo Cogwill, Google
8
R is ….
PopularNot
scalable
Open
source No parallel
algorithmsFlexible
Extensible
Limited
pre/post
processing
10
Horizontal scaling
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
Ship code to data,
Functional programming
12
Distributed RA New Enterprise class predictive analytics platform
A scalable, high-performance platform for the R language
• Implemented as an R package
• Open source
Use familiar GUIs
and packages
Analyze data too
large for vanilla R
Leverage multiple
nodes for
distributed
processing
Vastly
improved
performance
13
Distributed R: architecture
Master
• Schedules tasks across the cluster.
• Sends commands/code to workers
Workers
• Do the actual work
• Own the data
• Work on independent data partitions in
parallel
DistR Master
Worker 1
Worker 2
Worker 3
Worker 4
14
• Relies on user defined partitioning
• Also support for distributed data-frames and lists
darray
Distributed R: Distributed data structures
15
• Express computations over partitions
• Execute across the cluster
foreach
Distributed R: Distributed code
f (x)
17
• Similar signature, accuracy as R packages
• Scalable and high performance
• E.g., regression on billions of rows in a couple of minutes
Distributed R: Built-in distributed algorithms
Algorithm Use cases
Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.
Logistic Regression (GLM)Customer Response modeling, Healthcare analytics
(Disease analysis)
Random Forest Customer churn, Market campaign analysis
K-Means ClusteringCustomer segmentation, Fraud detection, Anomaly
detection
Page Rank Identify influencers
19
Parallel Random Forest Example
Random Forest – building an
ensemble of deep decision trees
Need to build 100 decision trees on 4
machines
Each machine builds 25 decision trees
Can use random forest to predict
March Madness Bracket
X
7
>
5
X1
2
>
3.
4
X
3
>
3
01 10
21
March Madness Bracket
Train Model to predict individual games
Use team and opponent features to train a model
• blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy
Calculate the summary statistics of each team
Group by teams and get the mean of each team’s features
Predict the result of the game
Concatenate the summary statistics of the team and feed to model that predicts individual
games
Fill out bracket by predicting 1 game at the time
24
Distributed R rocks!
• Regression on billions of rows in minutes
• Graph algorithms on 10B edges
• Load 400GB+ data from database to R in < 10 minutes
• Open source!
25
That’s cool… what can I do with it?
• Collaborate
• Github (report issues, send PRs) https://github.com/vertica/DistributedR
• Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/
• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-
distributed-r/
• Buy commercial support