Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Microsoft R Family – L200Microsoft R Family – L200
Trends What is R?R Server & SQL
R ServicesR Analytics on
Hadoop
Agenda
Trends
Connected data
CLOUD
MOBILE
Microsoft R Family – L200Microsoft R Family – L200
Three major trends converging
IntelligenceCloudData
Microsoft R Family – L200
Transformational trends
Machine
Learning
PP
T R
EM
12
Churn analysis
Social network analysis
Recommenda-tion engines
Location-based tracking and services
Weather forecasting business planning
Fraud detection
Equipment monitoring
Personalized Insurance
cloud computing
2011 2016 5x increase
data explosion
90% of the data in the world today has been created in the last two years alone
Decision
$235 $158 $456 $674
Product & Service
Innovation
Customer Facing Operations Productivity
Data DividendIncremental Gains Made by Leaders in Data and Analytics
Gains in $ BillionsIDC Data Dividend Study and Survey, N=2,020, April 2014
Microsoft R Family – L200Microsoft R Family – L200
EXAMPLE SOLUTIONS
Advanced Analytics scenarios
Retail & Consumer Products
Healthcare
Financial Services & Insurance
Government
Manufacturing
€
Microsoft R Family – L200
Success requires convergence of skills
Data Science Team
Data Engineering
Data Science
Application Development
Business Acumen
Data Management
Data Dividend
What is R?
Microsoft R Family – L200
A brief history of R
1993
Research project in Auckland, NZ
1995
Open source
1997
R-core
2000
R-1.0.0
2003
R Foundation
2004
First UseR!
2009
New York Times
2015
R-3.2.0
R Consortium
Photo credit: Robert Gentleman
Microsoft R Family – L200
IEEE Spectrum July 2015
Language PopularityIEEE Spectrum Top Programming Languages
R’s popularity is growing rapidly
R Usage GrowthRexer Data Miner Survey, 2007-2013
Rexer Data Miner Survey
#9: R
Microsoft R Family – L200
People
+
Data Sources
Apps
Sensors and devices
Microsoft R Server familyFrom Data To Action On Premises
INTELLIGENCEDATA ACTION
Automated SystemsMicrosoft R Server & SQL R Services
Apps
Cortana Intelligence
Microsoft R Family – L200
What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 7500+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language Platform
Community
Ecosystem
• Rich application & platform integration
Microsoft R Family – L200
The Comprehensive R Archive Network
CRAN
In addition to CRAN, Bioconductor, GitHub, others distribute R packages
Microsoft R Family – L200Microsoft R Family – L200
Challenges posed by open source R
??
Lack of Commercial
Support
InadequateModeling
Performance
Complex DeploymentProcesses
Limited Data Scale
Microsoft R Family – L200Microsoft R Family – L200
R from Microsoft brings
Peace of mind
Efficiency Speed and scalability
Flexibility and agility
Included in SQL Server 2016
Reuse and optimize existing R code
Eliminate data movement
In-database deployment
Memory and disk scalability
No R memory limits
Write once, deploy anywhere
Enterprise speed and scale
Near-DB analytics
Parallel threading and processing
Reuse SQL skills for data engineering
Cost effectiveness
Scalability and choice
Simplicity and agility
Microsoft R Family – L200
SQL ServerR Services
Linux
Hadoop Teradata
Windows
Microsoft R portfolio
CommercialCommunity
R ServerR Open
Microsoft R Family – L200
DatasizeIn-memory
In-memory In-Memory or Disk Based
Speed of AnalysisSingle threaded Multi-threaded
Multi-threaded, parallel processing 1:N servers
SupportCommunity Community Community + Commercial
Analytic Breadth & Depth 7500+ innovative analytic
packages7500+ innovative analyticpackages
7500+ innovative packages + commercial parallel high-speed functions
LicenceOpen Source
Open SourceCommercial license.Supported release with indemnity
R Distribution Comparison
Copyright Microsoft Corporation. All rights reserved.
Microsoft R Family – L200
Example Solutions
Fraud detection
Sales forecasting
Warehouse efficiency
Predictive maintenance
Extensibility
Microsoft AzureMachine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
Built-in advanced analytics In-database analytics
Built-in to SQL Server
Analytic Library
T-SQL Interface
Relational Data
010010
100100
010101
010010
100100
010101
Data ScientistInteract directlywith data
Data Developer/DBAManage data and analytics together
?
R
R Integration
R built-in to your T-SQL
In-database Advanced AnalyticsBuild intelligent applications with SQL Server R Services
Open Source R with in-memory & massive scale - multi-threading and massive parallel processing
Ad
vance
d A
nalytics
NEW
NEW
R built-in to SQL Server
Mission critical OLTP
Real-time operational analyticswithout moving the data
NEW
R Server & SQL R Services
Microsoft R Family – L200Microsoft R Family – L200
Microsoft R Products
• Microsoft R Server for Redhat Linux
• Microsoft R Server for SUSE Linux
• Microsoft R Server for Teradata DB
• Microsoft R Server for Hadoop on Redhat
Microsoft R Server
• Built in Advanced Analytics and Stand Alone Server Capability
• Leverages the Benefits of SQL 2016 Enterprise Edition
SQL Server R Services
R Server
Microsoft R Family – L200Microsoft R Family – L200
High-performance open source R plus:◦ Data source connectivity to big-data objects
◦ Big-data advanced analytics
◦ Multi-platform environment support
◦ In-Hadoop and in-Teradata predictive modeling
◦ Development and production environment support◦ IDE for data scientist developers
◦ Secure, Scalable R Deployment
DeployR
R Open R Server
DevelopR
Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is
supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and
machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling
Introducing Microsoft R Server
Microsoft R Family – L200Microsoft R Family – L200
R Open Microsoft R Server
DeployRDevelopR
The Microsoft R Server Platform
ConnectR• High-speed & direct connectors
Available for:• High-performance XDF
• SAS, SPSS, delimited & fixed format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBCScaleR• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R algorithms across nodes
• Wide data sets supported – thousands of variables
DistributedR• Distributed computing framework
• Delivers cross-platform portability
R+CRAN• Open source R interpreter
• R 3.1.2
• Freely-available huge range of R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing R scripts, functions and packages
RevoR• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance math library to speed up linear algebra functions
Microsoft R Family – L200Microsoft R Family – L200
ScaleR – Parallel + “Big Data”
Stream data in to RAM in blocks. “Big Data” can be any data size. We handle Megabytes to Gigabytes to Terabytes…
Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed
Interim results are collected and combined analytically to produce the output on the entire data setXDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
Microsoft R Family – L200Microsoft R Family – L200
Microsoft R Server
Multi-Platform Scalable RSupport Entire Analytics Lifecycle
Big Data Scale & Compute
Remote Execution in Hadoop, Teradata
AdvantagesSupported 100% R
Deploy via Web Services
Reduced Security Exposure
Enterprise R Analytics for Linux, Hadoop & Teradata
Typical Advanced Analytics Process
OperationalizeModelPrepare
SQL R Services
Microsoft R Family – L200Microsoft R Family – L200
SQL Server 2016 Enterprise Edition
SQL Server R Services
Integration Facilities:
• Component Integration• Launchers• Parameter Passing• Results Return• Console Output
Return• Parallel Data Exchange
(RTM)• Stored Procedures• Package Administration
SQL Server Enterprise Edition R Services
SQL ServerQuery
Processor
Algorithm Library
• Data Prep• Descriptive Stats• Sampling• Statistical Tests• Predictive Models
• Variable Selection• Clustering• Classification• Custom APIs for R + CRAN• Parallel Scoring
Fast, Parallel, Storage Efficient Algorithms
Microsoft R Open
• 100% Open Source R• Fully CRAN Compatible• Accelerated Math
Open Source R Interpreter
Microsoft R Family – L200Microsoft R Family – L200
Run R In-Database from TSQL
SQL Server 2016
In-Database Execution of R +
CRAN
+ SQL
In-Database Execution of:
R Code
CRAN Packages
Move the Work to the
Data
Run R From the Query Processor
Retrieve Models, Scores,
Transformed Data,
Plots/Images
Operationalize scoring/prediction
in database for data batches or
real-time
Microsoft R Family – L200Microsoft R Family – L200
SQL
In-Database Execution:
Remote Execution
Parallelized Compute SQL Server
Remote ExecutionContext
Explore and Model:
In Parallel, In-Database
Parallelize distributable R and CRAN
Operationalize:
Score In Parallel
Parallel
Worker
Tasks
Move BIG Work
to the Data
Large Data Sets in Chunks
Parallel
Algorithm
Iterate/ Sequence
Run Parallel Algorithms in Database from an R client
Demo
Microsoft R Family – L200Microsoft R Family – L200
Using SQL Server R Services
Model & Deploy In SQL16Support Entire Analytics Lifecycle
Run R Inside SQL 2016 using IDE
Embed R in T-SQL or run stored proc’s
AdvantagesScale By Eliminating Movement
Scale Using Parallelized Computation
Reduce Security Exposure
Reuse SQL Skills for Data Engineering
Reuse SQL Skills in App Development teams
Maximize Operational Stability for Applications
Supported 100% R
Enterprise R Analytics in SQL Server 2016
Typical Advanced Analytics Process
SQL 2016
OperationalizeModelPrepare
Microsoft R Family – L200Microsoft R Family – L200
Eliminate Data Movement
In-Database Example: From 5+ hours to 40 seconds
Rows
Min
ute
s
R on a server pulling data via SQL
R on a server Invoking RRE ScaleR Inside
the EDW
Microsoft R Family – L200Microsoft R Family – L200
Variable Selection
Stepwise Regression
Simulation
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Naïve Bayes
Combination
rxDataStep
rxExec
PEMA-R API Custom Algorithms
Parallelized, Remote Execution AlgorithmsData Step
Data import – Delimited, Fixed, SAS, SPSS, OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Descriptive Statistics
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long form)
Marginal Summaries of Cross Tabulations
Statistical Tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random Sampling
Predictive Models
Sum of Squares (cross product matrix for set variables)
Quantiles (approx.)
Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
R Analytics on Hadoop
41
Microsoft R Family – L200Microsoft R Family – L200
Leveraging the cloud
Evolutionary, blended, hybrid approaches will help to avoid big disruptions
Embracing the cloud from On-Premises is complicated.
Several Alternatives
On-Premises
In the Cloud
Microsoft R Family – L200Microsoft R Family – L200
Microsoft R Scales to Big Data for Enterprises
Escapes R’s traditional memory limits
Scales predictive modeling using parallelization
Distributes computation cores & nodes
Minimizes data movement using in-database, in-MapReduce and in-Apache Spark execution
Microsoft R Family – L200Microsoft R Family – L200
Leveraging Cloud and On-Prem Data As One
OperationalizeModelPrepare
SQL 2016
OperationalizeSQL 2016
ModelPrepare
ModelSQL 2016
Operationalize
Prepare
On
-Pre
mis
esIn
th
e C
lou
d
Polybase Example: Federate CRM Data with Cloud-Borne Demographics
StretchDB Example: Cost Reduce Growth of EDW On-Prem Scoring Example: Purchase Propensity Prediction
On-Prem Scoring Example: Near-Real-Time Fraud or Anomaly Detection
Microsoft R Family – L200
HDInsight + R Server: Managed Hadoop for Advanced Analytics
Spark APIs(data manipulation)
R Server(big computation)
R
Spark and Hadoop
Blob Storage Data Lake Storage
NotebooksScala/Java
• Apache Hadoop• Fault-tolerant distributed data storage and processing
• Pluggable distributed filesystem (HDFS)
• Big Data - Structured and unstructured
• Apache Spark• Batch, Streaming, SQL, ML, Graph
• Scala, Java, Python, and R APIs
• Integrates with Hadoop
• HDInsight Hadoop Clusters• Fully managed, elastic
• HDFS-compatible Azure Blob Storage and Azure Data Lake Store
• R Server on HDInsight (Premium)• Integrates with Hadoop and Spark
• Provides scalable Statistical and ML Algorithms
• Runs arbitrary R functions and packages in parallel
Microsoft R Family – L200
R Server on HDInsight scales to billions of rows and terabytes of data
0
200
400
600
800
1000
1200
1400
1600
1800
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Elap
sed
Tim
e (
seco
nd
s)
Billions of rows
rxLogit on a 100 node HDInsight cluster
Configuration:• HDI cluster size: 100 nodes- Edge node: D14 V2 (16 cores, 112GB)- Worker Nodes: D12 (4 cores, 28GB)• Dataset: NYC Taxi dataset, filtered,
transformed, and duplicated• Number of columns: 19• Format: CSV• fs.azure.selfthrottling.read.factor=1
2.2 TB
Microsoft R Family – L200Microsoft R Family – L200
DistributedR
ScaleR
ConnectR
DevelopR
Write Once. Deploy Anywhere.
Code Portability Across Platforms
In the Cloud Azure HDI/ Spark
Workstations & Servers LinuxWindows
Clustered SystemsLinux Clusters (LSF For Now)Microsoft HPC
EDW Teradata
HadoopHortonworksClouderaMapR &HDInsight
Microsoft R Family – L200
### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”),
fileSystem = hdfsFS)
DistributedR - Revolution Code Portability
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
localFS <- RxNativeFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = localFS)
Local Parallel processing – Linux or Windows In – Hadoop
ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce
Compute context R script – sets where the model will run
Functional model R script – does not need to change to run in Hadoop
Demo