23
Microsoft R server Stefan Cronjaeger Technical Solution Specialist Advanced Analytics Global Blackbelt – Germany [email protected] +49 151 4406 3425

Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Microsoft R serverStefan CronjaegerTechnical Solution Specialist Advanced AnalyticsGlobal Blackbelt – [email protected]+49 151 4406 3425

Page 2: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers
Page 3: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

?

Page 4: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

DatasizeIn-memory

In-memory In-Memory or Disk Based

Speed of AnalysisSingle threaded Multi-threaded

Multi-threaded, parallel

processing 1:N servers

SupportCommunity Community Community + Commercial

Analytic Breadth

& Depth 7500+ innovative analytic

packages7500+ innovative analytic

packages

7500+ innovative packages +

commercial parallel high-

speed functions

LicenceOpen Source

Open Source

Commercial license.

Supported release with

indemnity

Microsoft

R Open

Microsoft

R Server

Page 5: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

R Open Microsoft R Server

DeployRDevelopR

ConnectR•High-speed & direct connectors

Available for:•High-performance XDF

•SAS, SPSS, delimited & fixed format text data files

•Hadoop HDFS (text & XDF)

•Teradata Database & Aster

•ODBC

ScaleR•Ready-to-Use high-performance big data big analytics

• Fully-parallelized analytics DistributedR•Distributed computing framework

•Delivers cross-platform portability

R+CRAN•Open source R interpreter

•R 3.1.2

• Freely-available huge range of R algorithms

•100% Compatible with existing R scripts, functions and packages

RevoR•Performance enhanced R interpreter

•Based on open source R

•Adds high-performance math library to speed up linear algebra functions

Page 6: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Scale R – Parallelized Algorithms & Functions

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Preparation Statistical Tests

Sampling

Descriptive Statistics Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models K-Means

Decision Trees

Decision Forest

Gradient Boosted Decision Trees

Naïve Bayes

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination rxDataStep

rxExec

PEMA API

Page 7: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Algorithm

Master

Analyze

Blocks In

Parallel

Load Block

At A TimeDistribute

Work, Compile

Results

Not every algorithm works in parallel

Often there are several steps involved

Page 8: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Distributed R - How Does Remote Execution Work?

Algorithm

Master

Big

Data

Predictive

Algorithm

Analyze

Blocks In

Parallel

Load Block

At A Time

Distribute Work,

Compile Results

The Results:

• Even Faster Computation

• Larger Data Set Capacity

• Fewer Security Concerns

• No Data Movement, No Copies

“Pack and Ship” Requests

to Remote Environments

Results

Microsoft R Server functions

• A compute context defines remote connection• Microsoft R functions prefixed with rx

• Current compute context determines processing

location

Page 9: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Parallelization or Sequential Execution

Algorithm

Master

Analyze

Blocks In

Parallel

Load Block

At A TimeDistribute

Work, Compile

Results

Parallel:

• Large number of parallel workers

• Fast handling of Big Data

Algorithm

Master

Analyze

Blocks

Sequentially

Load Block

At A TimeDistribute

Work, Compile

Results

Sequential:

• Just one block of data in RAM

• Handling of Big Data with moderate

resources

Page 10: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers
Page 11: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Microsoft R Server has no data size limits in relation to size of available RAM

US flight data for 20 years

Linear Regression on Arrival Delay

Run on 4 core laptop, 16GB RAM and 500GB SSD

Page 12: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Microsoft R Server

Hadoop

Page 13: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

R R R R R

R R R R R

ScaleR Production

RStudio Server Pro

Microsoft R Server

1. Copy

2. Stream

3. Send

Page 14: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Write Once Deploy Anywhere

### ANALYTICAL PROCESSING ###

### Statistical Summary of the data

rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)

### CrossTab the data

rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)

### Linear Model and plot

hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data =

AirlineDataSet)

plot(hdfsXdfArrLateLinMod$coefficients)

# SETUP SQLSERVER ENVIRONMENT VARIABLES

mySqlServer <- RxInSqlServer()

# SQL SERVER COMPUTE CONTEXT AND TABLE REF

rxSetComputeContext(mySqlServer)

AirlineDataSet <-

RxSqlServerData(table=“AirlineDemoSmall”)

### SETUP HADOOP ENVIRONMENT VARIABLES

myHadoopCluster <- RxHadoopMR()

### HADOOP COMPUTE CONTEXT USING HDFS

rxSetComputeContext(myHadoopCluster)

### CREATE HDFS, DIRECTORY AND FILE OBJECTS

hdfsFS <- RxHdfsFileSystem()

AirlineDataSet <-

RxXdfData(“AirlineDemoSmall.xdf”,

fileSystem = hdfsFS

Local Parallel – Linux or Windows In – Hadoop

ScaleR functions can run in-Hadoop, in-Spark or in-Database without any

functional R recoding

R script – does not

need to change to

run across different

platforms

# SETUP LINUX ENVIRONMENT VARIABLES

rxSetComputeContext("localpar")

# CREATE LINUX, DIRECTORY AND FILE

OBJECTS

linuxFS <- RxNativeFileSystem()

AirlineDataSet <-

RxXdfData(“AirlineDemoSmall.xdf”,

fileSystem = linuxFS)

SQL Server

Page 15: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Microsoft R Server

SQL Server R Services

Page 16: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Leverage Full Capability of R:

• Rich Statistical, Visualization & Predictive Analytics

• A Large and Growing Skill Base

… including Microsoft R Servers Big Data Capabilities:

• Scalable Computation

• Scalable Data Size

… all Running In-Database:

• Divide Work Between Data Scientists and Data Engineers

• Reduce Data Duplication and Data Movement

… While Protecting Information:

• Eliminate Data Movement & Unnecessary Copying

• Leverage Database Data Protections

• Leverage Database Tools for Backup, Scheduling, …

+SQL Server

2016

Enterprise

Edition

Copyright Microsoft Corporation. All rights reserved.

Page 17: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

SQL

In-Database Execution:

Remote Execution

Parallelized Compute SQL

Server

Remote

Execution

Context

Explore and Model:

In Parallel, In-Database

Parallelize distributable R and CRAN

Operationlize:

Score In Parallel

Parallel

Worker

Tasks

Move

BIG

Work to

the

DataLarge Data Sets in Chunks

Parallel

Algorithm

Iterate/ Sequence

Run Parallel Algorithms in Database from an R client

Page 18: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Run R In-Database from TSQL

SQL

Server

2016

In-Database

Execution of

R + CRAN

+ SQL

In-Database Execution of:

R Code

CRAN Packages

Move the

Work to

the Data

Run R

From the

Query

Processor

Retrieve

Models,

Scores,

Transformed

Data,

Plots/Images

Operationalise

scoring/predicti

on in database

for data batches

or real-time

Page 19: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Microsoft R Server

Operationalizing R based Analytics

Page 20: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Page 21: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

:

Copyright Microsoft Corporation. All rights reserved.

Revolution Scale R engine + Open Source R with access to 7000+ packages

R Model Repository

Enterprise Security

R Session Management

Resource Management

Desktop Applications Mobile Apps Web Applications Real-Time Applications

API Client Libraries

R / Statistical

Modeling Expert

Layer

Business User

Layer

DeployR

Web

Services

Layer

Page 22: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Copyright Microsoft Corporation. All rights reserved.

Page 23: Microsoft R server Stefan Cronjaeger - Meetupfiles.meetup.com/3576292/Stefan Cronjaeger R Server.pdf · •Ready-to-Use high-performance big data big analytics ... framework •Delivers

Copyright Microsoft Corporation. All rights reserved.

The creation and management of R runtime resource usage

The monitoring of events on the grid and server