Upload
doanthuan
View
225
Download
5
Embed Size (px)
Citation preview
BR005
Microsoft Machine Learning& Data Science SummitSeptember 26 – 27 | Atlanta, GA
Building a Scalable Data Science Platform with R on HDInsightDebraj GuhaThakurtaSenior Data ScientistData Group – Algorithms and Data Science, Redmond
Email: [email protected]: @d_guhathakurta
Co-contributors: Mario Inchiosa, Katherine Zhao, Hang Zhang, Max Kaznadi
• R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight • Mon, Sept 26, 1:30 – 2:30 PM• Maxim Lukiyanov
• Big, Fast, and Data-Furious…with Spark • Mon, Sept 27, 12:30 – 1:30 PM• Maxim Lukiyanov
• Instructor-Led Lab: The Cortana Intelligence Suite - Part Two: Deep Dive • Mon, Sept 26, 10:30 AM – 5 PM• Buck Woody
• Self-Paced Lab: Microsoft Server R• Mon, Sept 26, 1 – 4 PM; Tue Sept 27, 10:30 – 11:30 AM & 12:30 – 2:30 PM• Jeremy Reynolds
• Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…• Tue, Sept 27, 3 – 4 PM• Hang Zhang, Jacob Spoelstra, Gopi Kumar
Related talks3
• Microsoft R Server: Benefits
• R Server on HDInsight (Premium, Preview): Scalable analytical platform on Azure
• How to: • Develop end-to-end data science process using R Server on Spark HDInsight
(Premium)• How to adopt process and code
Key takeaways4
• R and its benefits / limitations• Microsoft R Server: Scalable, enterprise-class• R Server on HDInsight (Premium) clusters• Demo - Developing end-to-end data science processes using
R Server on HDInsight Spark clusters• Pointers to technical content: Tutorials, templates, blogs
Agenda5
R – its benefits and limitations
R - introduction
• 2.5+M users • Taught in most universities• Thriving user groups
worldwideCommunity
• The most popular statistical programming & ML language
• Data visualization & reporting tool• Open source, transparent
Language Platform
• Free
7
• 9,000+ contributed packagesEcosystem • Applications & integration• Many use cases / business problems
addressed
Preferred language by Analytics Professionals
Source: SAS, R or Python Survey 2016, by Burtch Works
Which do you prefer to use: SAS, R, or Python?
2015 20142016 2015
Unified IEEE Spectrum Ranking 2016http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages
8
Common R use casesVertical Sales & Marketing Finance & Risk Customer & Channel Operations &
Workforce
Retail
Demand ForecastingLoyalty ProgramsCross-sell & Upsell
Customer Acquisition
Fraud DetectionPricing Strategy
Personalization Lifetime Customer Value Product Segmentation
Store Location DemographicsSupply Chain Management
Inventory Management
Financial Services
Customer Churn Loyalty Programs Cross-sell & Upsell
Customer Acquisition
Fraud DetectionRisk& Compliance
Loan Defaults
PersonalizationLifetime Customer
Value
Call Center OptimizationPay for Performance
Healthcare Marketing Mix Optimization
Patient Acquisition Fraud Detection
Bill Collection Population Health
Patient Demographics Operational Efficiency Pay for Performance
ManufacturingDemand Forecasting
Marketing mix OptimizationPricing Strategy
Perf Risk Management Supply Chain Optimization
Personalization
Remote Monitoring Predictive Maintenance
Asset Management
9
Processing limitations of open source R
• In-Memory Operation
• Lack of Parallelism
• Expensive Data Movement
& Duplication
Open source R is not enterprise class
Inadequacy of
Community Support
Lack of Guaranteed
Support Timeliness
No SLAs or Support Models
Microsoft R Server
R from Microsoft brings13
Peace of mind Speed and
scalabilityEfficiencyFlexibilit
y
• Support and SLA• Works on data in memory or on disc (scale)• Wide range of scalable and distributed R functions • Works in several compute contexts (incl. Hadoop, Spark, SQL-server),
and data sources (incl. disk, HDFS, SQL)
Portability & investment assurance
R Server portfolio
Cloud • Windows• Linux
• SQL Server 2016 EE• SQL Server 2016 SERDBMS• Windows• LinuxDesktops & Servers
Hadoop & Spark • Hortonworks• Cloudera• MapR
EDW • SQL Server 2016• Teradata Database
R+CR
ANM
icros
oft R
Op
en
DistributedR
ScaleR
ConnectR
DeployR
R Server Technology
14
Write once deploy anywhere - WODA
• On a workstation:• All available cores used for math operations and parallel processes• Hard drive capacity sets limit for data size, not RAM• Works directly on XDF (External data frames) on disk
• On a cluster:• Parallel utilization of nodes• Distributed file systems like HDFS greatly expand possible data
sizes
ScaleR - parallel or distributed processing
15
Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models
ScaleRPEMA: Parallel external memory algorithms
Stream data into RAM in blocks. “Big Data” can be any data size. Can handle Megabytes to Gigabytes to Terabytes…
ScaleR algorithms work inside multiple cores / nodes in parallel at high speed
Interim results are collected and combined analytically to produce the output on the entire data set
XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.
16
Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models
• Linear regression (rxLinMod)• Generalized linear models (rxLogit, rxGLM)• Decision trees (rxDTree)• Gradient boosted decision trees (rxBTree)• Random forests (rxDForest)• K-means (rxKmeans)• Naïve Bayes (rxNaiveBayes)
Available ScaleR distributed algorithms
17
ScaleR distributed algorithms Data import – Delimited, Fixed, SAS,
SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)
Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test
ETL Statistical Tests
Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for
set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &
long form) Marginal Summaries of Cross Tabulations
Descriptive Statistics Sum of Squares (cross product matrix for
set variables) Multiple Linear Regression Generalized Linear Models (GLM)
exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.
Covariance & Correlation Matrices Predictions/scoring for models Residuals for all models
Predictive Statistics
K-MeansClustering
Linear regression Logistic regression Decision Trees Decision Forests Gradient Boosted Decision
Trees Naïve Bayes
Machine Learning
Simulation Simulation (e.g. Monte Carlo) Parallel Random Number
Generation Custom Parallelization rxExec
PEMA-R APIVariable Selection Stepwise Regression
18
• Any analysis that is more complex than simple aggregations• Analysis with data that fit in physical memory of single
machines• Creating sophisticated visualizations (e.g. ggplot, lattice)• Creating reports (use knitr and Markdown)• Analyses that use domain-specific tools or cutting-edge
algorithms• e.g. Forecasting, health informatics, …. , etc.
Typical uses of open source R19
• Working with big data• Building models that take too long to run in R• Working with clusters and distributed file
systems• e.g. HDInsight clusters + HDFS
• Developing portable scripts for many compute contexts
Typical uses of R Server20
Big Data In-memory bound
Hybrid memory & disk scalability
Operates on bigger volumes of data
Speed of Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise Readiness
Community support
Commercial support Delivers full service production support
Analytic Breadth & Depth
9000+ innovative analytic packages
Leverage open source packages plus Big Data ready packages
Supercharges R with ScaleR functions
Commercial Viability
Risk of deployment of open source
Commercial license Eliminate risk with open source
Benefits of R Server21
R Server
R Server on HDInsight (Premium)
R Server on HDInsight (Premium)Managed Hadoop for advanced analytics in the Cloud
RevoScaleR
Hadoop / Spark
Blob Storage (HDFS)Data Lake Storage
• Easy setup, elastic, SLA• R Server benefits
• Leverage R skills• ScaleR functions• ….
• Familiar & enhanced IDEs• Popular IDEs (RStudio, RTVS, Notebooks,
etc.)
23
Others (e.g. SparkR)
R
Provisioning HDInsight (Premium) with R Server
24
Elastic - Scaling HDInsight clusters25
R server on HDInsight - Architecture
26
R R R R R
R R R R R
Data Scientists
R Server
Edge
Head Nodes
Data/Worker Nodes
R Server on HDInsight - Connectivity
Worker Task
R Server Master Task
Edge Node
Worker Task
Worker Task
Remote Execution: ssh
ssh or R Tools for Visual Studio
Jupyter Notebooks
Thin Client IDEs
https://
https://
or MapRedu
ce
27
R Server on HDInsight - Data processing
Server Local Processing
Data in Distributed Storage
R process on Edge Node
Server Distributed Processing
Master R process on Edge Node
Apache YARN / Spark
Worker R processes on Data
Nodes
28
Write once deploy anywhere - WODASwitching compute contextsCode can be deployed from a server or edge node to run in Spark/Hadoop without any functional R model re-coding.
## Statistical Summary rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineData, reportProgress = 1)## Linear model and plothdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
data = AirlineData)
## SETUP LOCAL ENVIRONMENT VARIABLES ## myLocalCC <- “localpar”
## LOCAL COMPUTE CONTEXT ## rxSetComputeContext(myLocalCC)
Local Parallel processing - Linux or Windows
Compute context R script - sets where the model will run
R script – does not need to change to run in Hadoop/ Spark
29
mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()
rxSetComputeContext(mySparkCC) rxSetComputeContext(myHadoopCC)
In – Spark/Hadoop
R Script for Execution in MapReduce
Sample R Script:
rxSetComputeContext( RxHadoopMR(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)
Define Compute Context
Define Data Source
Train Predictive
Model
30
Easy to Switch From MapReduce to Spark
Keep other code
unchanged
Sample R Script:
rxSetComputeContext( RxSpark(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)
Change the Compute Context
31
Creating a data science process using R Server on Spark HDInsight
Apache Spark engine and its APIs
33Denny Lee, DataBricks
Spark Core
Spark Streamin
gSpark SQL MLlib GraphX
o Scale out, fault tolerant, distributed, in-memory processing
o Multi-language API (incl. R)
o Standard libraries: ML, statistics
33
Spark’s use cases - Diverse industries & scenarios
Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html
34
Spark advanced analytics
Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html 35
Advanced analytics is an important Spark feature
R is rapidly gaining popularity
(Available since June 2015)
35
Open-source packages for ML in Spark using Ro SparkR: o R package - a light-weight front-end for Apache Spark from
Ro Limited in terms of ML algo bindings at this timeo Works on MLlib functions (RDDs)
o Sparklyr-ML: o Developed by RStudioo Provides R bindings to spark.ml library
36
• Git-based repositories with templates providing a central archive
• Standardized project structure• Document templates• Utility scripts• Independent of the execution
environment, to allow scientists to use multiple cloud resources as needs dictate.
Building intelligent applications using team data science process
https://blogs.technet.microsoft.com/machinelearning/2016/09/08/building-intelligent-applications-using-the-team-data-science-
process/ 38http://aka.ms/tdsp
Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…
Tue, Sept 27, 3 – 4 PMHang Zhang, Jacob Spoelstra, Gopi Kumar
38
Prepare: Assemble, cleanse, profile and transform diverse data relevant to the subject
Model: Use statistical and machine learning algorithms to build classifiers and regression models
Operationalize: Make predictions and visualizations to support business applications
DS process shown in demo
OperationalizeModelPrepare
39
E2E Demo/ExampleFlight arrival delay prediction1. Provisioning clusters using PowerShell scripts2. Prep (Clean/Join) – Using SparkR from R Server3. Model (Train/Score/Evaluate) – Scale R4. Deployment – to Azure ML from R Server
40
End-to-end data science process example
Azure Blob Storage
HDInsight
Microsoft R Server Azure Machine Learning
Web Application
Data Sources Data Partition Feature Engineering
Model TrainingPredictions
Web Services Consumption
Power BI
KDD 2016, (Tutorial Using R on Spark) tinyurl.com/KDD2016Rzure Machine Learning: https://azure.microsoft.com/en-us/services/machine-learning/
41
• Azure blob storage (HDFS)• R Server on Spark HDInsight (Premium)• Azure ML R package and Azure ML web
service• PowerBI (optional)
Technologies / services used42
Provisioning & deleting R Server Spark HDInsight clusters using Azure Commandlets & ARM templates## CREATE CLUSTERS USING ARM TEMPLATES$templatePath = "https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/KDDCup2016/Scripts/Configuration/azuredeploy.json";
$hdiparams @{clusterType="spark"; clusterName=$clustername; clusterLoginUserName="admin"; clusterLoginPassword=$clusterpasswd; sshUserName="remoteuser"; sshPassword=$clusterpasswd;clusterWorkerNodeCount=2};
New-AzureRmResourceGroupDeployment -Name $clustername -ResourceGroupName $resourcegroup -TemplateParameterObject $hdiparams -TemplateUri $templatePath;
## DELETE CLUSTERSRemove-AzureRmHDInsightCluster -ClusterName $clustername
43
Script based deployment of HDInsight clusters with R Sever
44
• Predict if a flight arrival is going to be by 15 mins or not (binary classification), based on features:• Airline, flight, airport • Airline carrier• Type of airplane / vehicle• Departure and arrival airports• Flight distance• Month, week, day
• Weather• Wind speed• Visibility• Humidity
Prediction task: Predict flight delays
45
• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection
• >20 years of data• 300+ Airports• Every carrier, every commercial
flight• http://www.transtats.bts.gov
Data-set: Airline & Weather46
• Hourly land-based weather observations from NOAA (National Oceanographic and Atmospheric Assoc.)
• > 2,000 weather stations• http://www.ncdc.noaa.gov/orders/
qclcd/
Airline Weather
Connection: Thin client → RStudio Server+ Glimple of down-sampled data (19 mil rows)
Data prepClean and Join using SparkR in R Server
48
• SparkR: R package - a light-weight front-end for Apache Spark from R• Provides distributed operations like selection, filtering, aggregation using SparkSQL• Distributed machine learning using Apache Spark’s MLlib (limited)
ModelingTrain, score, and evaluate using ScaleR functions
49
Modeling scalability with ScaleR on Spark HDInsight Scales linearly to hundreds of nodes, billions of rows and terabytes of data
50
0
1,000
,000,0
00
2,000
,000,0
00
3,000
,000,0
00
4,000
,000,0
00
5,000
,000,0
00
6,000
,000,0
00
7,000
,000,0
00
8,000
,000,0
00
9,000
,000,0
00
10,00
0,000
,000
11,00
0,000
,000
12,00
0,000
,000
13,00
0,000
,000
0200400600800
10001200140016001800
Logistic Regression on NYC Taxi Dataset
Billions of rows
Elap
sed
Tim
e
HDInsight (Premium) Spark cluster100 D12 (4 core, 28 GB) worker nodes
2.2 TB
Mario Inchiosa
Comparison of ScaleR with open source algorithms (Preliminary)
51
Configuration:• HDI cluster size: 7
nodes• 1 Edge Node: 8 cores,
28GB- 4 Worker Nodes: 8
cores, 28GB• Dataset: Duplicated
Airlines data (.csv)• Number of columns: 26
1 2 3 4 5 6 7 8 9
Logistic Regression (E2E - reading from csv files)
Series1Series2Series3Series4
Number of rows (million)
Elap
sed
time
Katherine Zhao
Azure ML - Deploying web services for predictive analytics
52
Easily build ML models Easily deploy models as web-services
DeploymentPublish Web Service from R Server in AzureML
53
azureml-settings.json{"workspace": {"id": “<>", "authorization_token": “<>", "api_endpoint": "https://studioapi.azureml.net",
"management_endpoint":
https://management.azureml.net }}
A prediction web service in AzureML54
Adopting process and code - Resources
Tutorials - Scalable data analytics using R Server
• KDD Conference tutorial 2016• http://www.tinyurl.com/KDD2016R
• Public GitHub repository
56
57
Templates – Predictive solutions for business problemsCortana Intelligence Gallery
https://gallery.cortanaintelligence.com/Tutorial/Retail-Customer-Churn-Template-using-Microsoft-R-Server-HDInsight-Spark-1
Blogs - Further examples of scalable analysis
https://blogs.msdn.microsoft.com/azuredatalake/2016/08/09/rapid-big-data-prototyping-with-microsoft-r-server-on-apache-spark-context-switching-spark-tuning/
http://blog.revolutionanalytics.com/2016/04/mrs-nyc-taxi.html
58
Summary & acknowledgements
• R Server on Azure HDInsight (Premium) – a managed distributed compute platform for data science
• Scalable end to end processes can be built on HDI clusters integrated with other Azure services
• Published resources (w/ code) available for developing analytical work-flows
Summary60
• Mario Inchiosa [Principal Software Engineer]• Katherine Zhao [Data Scientist II]• Jeremy Reynolds [Senior Data Scientist Lead]• Max Kaznadi [Data Scientist II]• Hang Zhang [Senior Data Scientist Manager]
Acknowledgements61
Thank you!Debraj [email protected]
© Copyright Microsoft Corporation. All rights reserved.
Backups
R Open Microsoft R Server
R+CR
AN
DistributedR
ScaleR
ConnectR
DeployRRTVS
R Server architecture
ConnectR• High-speed & direct
connectorsAvailable for:• High-performance XDF• SAS, SPSS, delimited &
fixed format text data files• Hadoop HDFS (text & XDF)• Teradata Database• EDWs and ADWs• ODBC
ScaleR• Ready-to-Use high-performance
big data big analytics • Fully-parallelized analytics• Data prep & data distillation• Descriptive statistics & statistical tests• Range of predictive functions • User tools for distributing customized R
algorithms across nodes
DistributedR• Distributed computing
framework• Delivers cross-platform
portability
R+CRAN• Open source R interpreter• Freely-available huge range of R
algorithms• Algorithms callable by Microsoft R• Embeddable in R scripts• 100% Compatible with existing R
scripts, functions and packages
Microsoft R Open• Based on open source R• High-performance math
library to speed up linear algebra functions• Checkpoint package to easily share R code and replicate results using specific R package versions
DeployR• RESTful APIs for easy
integration from Java, JavaScript, .NET • Enterprise
authentication & security
R Tools for Visual Studio• State of the art, R Tools for Visual
Studio IDE
ModelingTrain, Score, and Evaluate using R Server
66
DeploymentPublish Web Service from R
rpartModel <- as.rpart(dTreeModel)scoringFn <- function(newdata){ library(rpart) predict(rpartModel, newdata=newdata)}
67
azureml-settings.json{"workspace":
{"id": “<>", "authorization_token": “<>", "api_endpoint":
"https://studioapi.azureml.net", "management_endpoint":
https://management.azureml.net}
}