27
Building A Scalable Data Science Platform with R Mario Inchiosa, PhD Principal Software Engineer Hadoop Summit San Jose June 30, 2016

Building a Scalable Data Science Platform with R

Embed Size (px)

Citation preview

Page 1: Building a Scalable Data Science Platform with R

Building A Scalable Data Science Platform with R

Mario Inchiosa, PhDPrincipal Software EngineerHadoop Summit San JoseJune 30, 2016

Page 2: Building a Scalable Data Science Platform with R

What is

• The most popular statistical programming language• A data visualization tool• Open source

• 2.5+M users • Taught in most universities

• Thriving user groups worldwide

• 8000+ contributed packages

• New and recent grad’s use it

Language Platform

Community

Ecosystem• Rich application & platform integration

Page 3: Building a Scalable Data Science Platform with R

Common R use casesVertical Sales & Marketing Finance & Risk Customer & Channel Operations &

Workforce

Retail

Demand  ForecastingLoyalty Programs

Cross-sell & Upsell Customer Acquisition

Fraud DetectionPricing Strategy

Personalization Lifetime Customer Value Product Segmentation

Store Location DemographicsSupply Chain Management

Inventory Management

Financial Services

Customer Churn Loyalty Programs Cross-sell & Upsell

Customer Acquisition

Fraud DetectionRisk& Compliance

Loan Defaults

PersonalizationLifetime Customer Value

Call Center OptimizationPay for Performance 

Healthcare Marketing Mix Optimization

Patient Acquisition Fraud DetectionBill Collection

Population HealthPatient Demographics

Operational Efficiency Pay for Performance

ManufacturingDemand Forecasting

Marketing mix OptimizationPricing Strategy

Perf Risk Management Supply Chain Optimization

Personalization

Remote Monitoring Predictive Maintenance

Asset Management

Page 4: Building a Scalable Data Science Platform with R

R Adoption is on a Tear, but not Enterprise-class!

2015 2014

IEEE Spectrum July 2015

Data Flows Overwhelm Open Source R– In-Memory Operation– Lack of Parallelism– Expensive Data Movement &

Duplication

Not enterprise ready– Inadequacy of Community

Support– Lack of Guaranteed Support

Timeliness– No SLAs or Support models

R Adoption is on a Tear, but Open Source R is not Enterprise Class

Page 5: Building a Scalable Data Science Platform with R

R from Microsoft brings

Peace of mind

Efficiency Speed and

scalability

Flexibility

and agility

Page 6: Building a Scalable Data Science Platform with R

Portability & investment assurance

Write Once – Deploy Anywhere

R Server portfolio

Cloud• Windows• Linux• HDInsight

• SQL Server 2016 EE• SQL Server 2016 SERDBMS• Windows• LinuxDesktops & Servers

Hadoop & Spark • Hortonworks• Cloudera• MapR

EDW • SQL Server 2016• Teradata Database

R+CR

ANM

icros

oft R

Op

en

DistributedR

ScaleR

ConnectR

DeployRR Server Technology

Page 7: Building a Scalable Data Science Platform with R

R Script for Execution in MapReduceSample R Script:

rxSetComputeContext( RxHadoopMR(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Define Compute Context

Define Data Source

Train Predictive

Model

Page 8: Building a Scalable Data Science Platform with R

Easy to Switch From MapReduce to Spark

Keep other code

unchanged

Sample R Script:

rxSetComputeContext( RxSpark(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Change the Compute Context

Page 9: Building a Scalable Data Science Platform with R

R Server: scale-out R, Enterprise Class!• 100% compatible with open source R • Any code/package that works today with R will work in R Server

• Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict()

• Ability to parallelize any R function• Ideal for parameter sweeps, simulation, scoring

Page 10: Building a Scalable Data Science Platform with R

Parallelized & Distributed Algorithms

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

ETL Statistical Tests

Subsample (observations & variables) Random Sampling

Sampling Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for

set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &

long form) Marginal Summaries of Cross Tabulations

Descriptive Statistics Sum of Squares (cross product matrix for

set variables) Multiple Linear Regression Generalized Linear Models (GLM)

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Logistic Regression Predictions/scoring for models Residuals for all models

Predictive Statistics K-Means

Clustering

Decision Trees Decision Forests Gradient Boosted Decision

Trees Naïve Bayes

Machine Learning

Simulation Simulation (e.g. Monte Carlo) Parallel Random Number

Generation Custom Parallelization rxDataStep

rxExec PEMA-R API

Variable Selection Stepwise Regression

Page 11: Building a Scalable Data Science Platform with R

R Server Hadoop Architecture

R R R R R

R R R R R

Data Nodes Edge Node

Head Nodes

Data Scientists

R Server

2. R Server Distributed Processing:

Master R process on Edge Node

Apache YARN and Spark

Worker R processes on Data Nodes

1. R Server Local Processing:

Data in Distributed Storage

R process on Edge Node

Page 12: Building a Scalable Data Science Platform with R

R Server for Hadoop - ConnectivityWorker Task

R Server Master Task

Finalizer

Initiator

Edge Node

Worker Task

Worker Task

Remote Execution: ssh

Web Services DeployR

ssh or R Tools for Visual Studio

BI Tools & Applications

Jupyter Notebooks

Thin Client IDEs

https://

https://

or MapReduce

Page 13: Building a Scalable Data Science Platform with R

HDInsight + R Server: Managed Hadoop for Advanced Analytics in the Cloud

SparkR functions RevoScaleR functions

R

Spark and Hadoop

Blob Storage Data Lake Storage

• Easy setup, elastic, SLA

• Spark• Integrated notebooks experience• Upgraded to latest Version 1.6.1

• R Server• Leverage R skills with massively

scalable algorithms and statistical functions

• Reuse existing R functions over multiple machines

Page 14: Building a Scalable Data Science Platform with R

R Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data

0

1,000

,000,0

00

2,000

,000,0

00

3,000

,000,0

00

4,000

,000,0

00

5,000

,000,0

00

6,000

,000,0

00

7,000

,000,0

00

8,000

,000,0

00

9,000

,000,0

00

10,00

0,000

,000

11,00

0,000

,000

12,00

0,000

,000

13,00

0,000

,000

Logistic Regression on NYC Taxi Dataset

Billions of rows

Elap

sed

Tim

e

2.2 TB

Page 15: Building a Scalable Data Science Platform with R

Prepare: Assemble, cleanse, profile and transform diverse data relevant to the subject.

OperationalizeModelPrepare

Model: Use statistical and machine learning algorithms to build classifiers and regression models

Operationalize: Make predictions and visualizations to support business applications

Typical advanced analytics lifecycle

Page 16: Building a Scalable Data Science Platform with R

• Clean/Join – Using SparkR from R Server

• Train/Score/Evaluate – Scalable R Server

functions

• Deploy/Consume – Using AzureML from R

Server

Airline Arrival Delay Prediction Demo

Page 17: Building a Scalable Data Science Platform with R

• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection

• >20 years of data• 300+ Airports• Every carrier, every commercial flight• http://www.transtats.bts.gov

Airline data set

Page 18: Building a Scalable Data Science Platform with R

• Hourly land-based weather observations from NOAA

• > 2,000 weather stations• http://www.ncdc.noaa.gov/orders/qclcd/

Weather data set

Page 19: Building a Scalable Data Science Platform with R

Provisioning a cluster with R Server

Page 20: Building a Scalable Data Science Platform with R

Scaling a cluster

Page 21: Building a Scalable Data Science Platform with R

Clean and Join using SparkR in R Server

Page 22: Building a Scalable Data Science Platform with R

Train, Score, and Evaluate using R Server

Page 23: Building a Scalable Data Science Platform with R

Publish Web Service from R

Page 24: Building a Scalable Data Science Platform with R

• HDInsight Premium Hadoop cluster• Spark on YARN distributed computing• R Server R interpreter• SparkR data manipulation functions• RevoScaleR Statistical & Machine Learning

functions• AzureML R package and Azure ML web

service

Demo Technologies

Page 25: Building a Scalable Data Science Platform with R

Building a genetic disease risk application with R Data

• Public genome data from 1000 Genomes

• About 2TB of raw data

Processing• VariantTools R package

(Bioconductor)• Match against NHGRI GWAS catalogAnalytics• Disease Risk• AncestryPresentation• Expose as Web Service APIs• Phone app, Web page, Enterprise

applications

BAM BAM BAM BAM

VariantTools

GWAS

BAM

Platform• HDInsight Hadoop (8 clusters)

• 1500 cores, 4 data centers• Microsoft R Server

Page 26: Building a Scalable Data Science Platform with R

The Four Transformational Trendscloud

computing2011 2016 5x increase

data science

Universities filling 300,000 US talent gap

90% of the data in the world today has been created in the last two years alone

bigdata

opensourceincluding R, Linux, Hadoop

Page 27: Building a Scalable Data Science Platform with R

© 2016 Microsoft Corporation. All rights reserved.

For more information…

R Servermicrosoft.com/r-server

HDInsight Premiummicrosoft.com/hdinsight