Building a Scalable Data Science Platform with R

Building A Scalable Data Science Platform with R

Mario Inchiosa, PhDPrincipal Software EngineerHadoop Summit San JoseJune 30, 2016

What is

• The most popular statistical programming language• A data visualization tool• Open source

• 2.5+M users • Taught in most universities

• Thriving user groups worldwide

• 8000+ contributed packages

• New and recent grad’s use it

Language Platform

Community

Ecosystem• Rich application & platform integration

Common R use casesVertical Sales & Marketing Finance & Risk Customer & Channel Operations &

Workforce

Retail

Demand ForecastingLoyalty Programs

Cross-sell & Upsell Customer Acquisition

Fraud DetectionPricing Strategy

Personalization Lifetime Customer Value Product Segmentation

Store Location DemographicsSupply Chain Management

Inventory Management

Financial Services

Customer Churn Loyalty Programs Cross-sell & Upsell

Customer Acquisition

Fraud DetectionRisk& Compliance

Loan Defaults

PersonalizationLifetime Customer Value

Call Center OptimizationPay for Performance

Healthcare Marketing Mix Optimization

Patient Acquisition Fraud DetectionBill Collection

Population HealthPatient Demographics

Operational Efficiency Pay for Performance

ManufacturingDemand Forecasting

Marketing mix OptimizationPricing Strategy

Perf Risk Management Supply Chain Optimization

Personalization

Remote Monitoring Predictive Maintenance

Asset Management

R Adoption is on a Tear, but not Enterprise-class!

2015 2014

IEEE Spectrum July 2015

Data Flows Overwhelm Open Source R– In-Memory Operation– Lack of Parallelism– Expensive Data Movement &

Duplication

Not enterprise ready– Inadequacy of Community

Support– Lack of Guaranteed Support

Timeliness– No SLAs or Support models

R Adoption is on a Tear, but Open Source R is not Enterprise Class

http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages

R from Microsoft brings

Peace of mind

Efficiency Speed and

scalability

Flexibility

and agility

Portability & investment assurance

Write Once – Deploy Anywhere

R Server portfolio

Cloud• Windows• Linux• HDInsight

• SQL Server 2016 EE• SQL Server 2016 SERDBMS• Windows• LinuxDesktops & Servers

Hadoop & Spark • Hortonworks• Cloudera• MapR

EDW • SQL Server 2016• Teradata Database

R+CR

ANM

icros

oft R

Op

en

DistributedR

ScaleR

ConnectR

DeployRR Server Technology

R Script for Execution in MapReduceSample R Script:

rxSetComputeContext( RxHadoopMR(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Define Compute Context

Define Data Source

Train Predictive

Model

Easy to Switch From MapReduce to Spark

Keep other code

unchanged

Sample R Script:

rxSetComputeContext( RxSpark(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Change the Compute Context

R Server: scale-out R, Enterprise Class!• 100% compatible with open source R • Any code/package that works today with R will work in R Server

• Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict()

• Ability to parallelize any R function• Ideal for parameter sweeps, simulation, scoring

Parallelized & Distributed Algorithms

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

ETL Statistical Tests

Subsample (observations & variables) Random Sampling

Sampling Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for

set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &

long form) Marginal Summaries of Cross Tabulations

Descriptive Statistics Sum of Squares (cross product matrix for

set variables) Multiple Linear Regression Generalized Linear Models (GLM)

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Logistic Regression Predictions/scoring for models Residuals for all models

Predictive Statistics K-Means

Clustering

Decision Trees Decision Forests Gradient Boosted Decision

Trees Naïve Bayes

Machine Learning

Simulation Simulation (e.g. Monte Carlo) Parallel Random Number

Generation Custom Parallelization rxDataStep

rxExec PEMA-R API

Variable Selection Stepwise Regression

R Server Hadoop Architecture

R R R R R

R R R R R

Data Nodes Edge Node

Head Nodes

Data Scientists

R Server

2. R Server Distributed Processing:

Master R process on Edge Node

Apache YARN and Spark

Worker R processes on Data Nodes

1. R Server Local Processing:

Data in Distributed Storage

R process on Edge Node

R Server for Hadoop - ConnectivityWorker Task

R Server Master Task

Finalizer

Initiator

Edge Node

Worker Task

Worker Task

Remote Execution: ssh

Web Services DeployR

ssh or R Tools for Visual Studio

BI Tools & Applications

Jupyter Notebooks

Thin Client IDEs

https://

https://

or MapReduce

HDInsight + R Server: Managed Hadoop for Advanced Analytics in the Cloud

SparkR functions RevoScaleR functions

R

Spark and Hadoop

Blob Storage Data Lake Storage

• Easy setup, elastic, SLA

• Spark• Integrated notebooks experience• Upgraded to latest Version 1.6.1

• R Server• Leverage R skills with massively

scalable algorithms and statistical functions

• Reuse existing R functions over multiple machines

R Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data

0

1,000

,000,0

00

2,000

,000,0

00

3,000

,000,0

00

4,000

,000,0

00

5,000

,000,0

00

6,000

,000,0

00

7,000

,000,0

00

8,000

,000,0

00

9,000

,000,0

00

10,00

0,000

,000

11,00

0,000

,000

12,00

0,000

,000

13,00

0,000

,000

Logistic Regression on NYC Taxi Dataset

Billions of rows

Elap

sed

Tim

e

2.2 TB

Prepare: Assemble, cleanse, profile and transform diverse data relevant to the subject.

OperationalizeModelPrepare

Model: Use statistical and machine learning algorithms to build classifiers and regression models

Operationalize: Make predictions and visualizations to support business applications

Typical advanced analytics lifecycle

• Clean/Join – Using SparkR from R Server

• Train/Score/Evaluate – Scalable R Server

functions

• Deploy/Consume – Using AzureML from R

Server

Airline Arrival Delay Prediction Demo

• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection

• >20 years of data• 300+ Airports• Every carrier, every commercial flight• http://www.transtats.bts.gov

Airline data set

http://www.transtats.bts.gov/

• Hourly land-based weather observations from NOAA

• > 2,000 weather stations• http://www.ncdc.noaa.gov/orders/qclcd/

Weather data set

http://www.ncdc.noaa.gov/orders/qclcd/

Provisioning a cluster with R Server

Scaling a cluster

Clean and Join using SparkR in R Server

Train, Score, and Evaluate using R Server

Publish Web Service from R

• HDInsight Premium Hadoop cluster• Spark on YARN distributed computing• R Server R interpreter• SparkR data manipulation functions• RevoScaleR Statistical & Machine Learning

functions• AzureML R package and Azure ML web

service

Demo Technologies

Building a genetic disease risk application with R Data

• Public genome data from 1000 Genomes

• About 2TB of raw data

Processing• VariantTools R package

(Bioconductor)• Match against NHGRI GWAS catalogAnalytics• Disease Risk• AncestryPresentation• Expose as Web Service APIs• Phone app, Web page, Enterprise

applications

BAM BAM BAM BAM

VariantTools

GWAS

BAM

Platform• HDInsight Hadoop (8 clusters)

• 1500 cores, 4 data centers• Microsoft R Server

The Four Transformational Trendscloud

computing2011 2016 5x increase

data science

Universities filling 300,000 US talent gap

90% of the data in the world today has been created in the last two years alone

bigdata

opensourceincluding R, Linux, Hadoop

© 2016 Microsoft Corporation. All rights reserved.

For more information…

R Servermicrosoft.com/r-server

HDInsight Premiummicrosoft.com/hdinsight

https://www.microsoft.com/en-us/server-cloud/products/r-server/

https://azure.microsoft.com/en-us/services/hdinsight/

https://azure.microsoft.com/en-us/services/hdinsight/