38
Introducing Microsoft R Server & Microsoft R Open Krit Kamtuo Technical Evangelist Microsoft (Thailand) Limited

microsoft r server for distributed computing

  • Upload
    bainida

  • View
    830

  • Download
    3

Embed Size (px)

Citation preview

Page 1: microsoft r server for distributed computing

IntroducingMicrosoft R Server &Microsoft R Open

Krit KamtuoTechnical Evangelist

Microsoft (Thailand) Limited

Page 2: microsoft r server for distributed computing

What is R?

Language

Platform

Community

Ecosystem

• A programming language for statistics, analytics, and data science

• A data visualization framework

• Provided as Open Source

• Used by 2.5M+ data scientists, statisticians and analysts

• Taught in most university statistics programs

• Active and thriving user groups across the world

• CRAN: 7000+ freely available algorithms, test data and evaluation

• Many of these are applicable to big data if scaled

• New and recent graduates prefer it

Page 3: microsoft r server for distributed computing

20152009200420032000199719951993

Research Project in

New Zealand

Open Source Project

R-Core Group

R-1.0.0 released

R Foundation

First user

New York Times article

R-3.2.0 and R Consortium (founded by Microsoft)

History of R

Page 4: microsoft r server for distributed computing

$?

Challenges posed by open source R

Uncertain

total cost of ownership

Inadequate

access to important

business data

Limited

business agility

Limited

business value

Page 5: microsoft r server for distributed computing

R from Microsoft brings

Page 6: microsoft r server for distributed computing

6

• Free and open source R distribution

• Enhanced and distributed by Revolution Analytics

Microsoft R Open

• Built in Advanced Analytics and Stand Alone Server

Capability

• Leverages the Benefits of SQL 2016 Enterprise Edition

SQL Server R Services

Microsoft R Products

Page 7: microsoft r server for distributed computing

Microsoft R Server

• Microsoft R Server for Redhat Linux

• Microsoft R Server for SUSE Linux

• Microsoft R Server for Teradata DB

• Microsoft R Server for Hadoop on Redhat

Microsoft R Server

Page 8: microsoft r server for distributed computing

Introducing SQL Server 2016 R Services

Enterprise speed and performance

Near-DB analytics

Parallel threading and processing

Model on-premises, store in cloud—or vice versa

Hybrid memory and disk scalability

Not bound by memory-enabling limits of larger

datasets

Included in SQL Server 2016

Reuse and optimize existing R code

Eliminate data movement across machines

Write once, deploy anywhere

Page 9: microsoft r server for distributed computing

Microsoft R server for distributed computing

The First NIDA Business Analytics and Data Sciences Contest/Conference

วันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์

-แนะนํา Microsoft R Server

-Distributed Computing มีวิธีการอย่างไร และมีประโยชน์อย่างไร

-แนะนําวิธีการ Configuration สําหรับ Distributed Computing

https://businessanalyticsnida.wordpress.com

https://www.facebook.com/BusinessAnalyticsNIDA/

กฤษฏิ์ คําตื้อ,

Technical Evangelist,

Microsoft (Thailand)

-Distributed computing กับ Big Data

-Analytics บน R server

-สาธิตและสอนในลักษณะ workshop

Computer Lab 2 ชั้น 10 อาคารสยามบรมราชกุมารี

1 กันยายน 2559 เวลา 9.00-12.30

Page 10: microsoft r server for distributed computing

Scalable in-database analytics

Data Scientist

Interacts directly with data

Creates models

and experiments

Data Analyst/DBA

Manages data and

analytics together

Example Solutions• Fraud detection

• Sales forecasting

• Warehouse efficiency

• Predictive

maintenance

010010100100

010101

Relational Data

Extensibility

?R

R Integration

Analytic LibraryOpen Source RRevolution PEMA

T-SQL Interface

How is it Integrated?• T-SQL calls a Stored Procedure

• Script is run in SQL through

extensibility model

• Result sets sent through Web API

to database or applications

Benefits• Faster deployment of ML models

• Less data movement, faster

insights

• Work with large datasets: mitigate

R memory and scalability

limitations

Page 11: microsoft r server for distributed computing

Cost effectiveness

• Best Advanced Analytics Value

• R Services and Polybase are built-ino Part of SQL Server 2016 Enterprise Edition

• In DB analytics shrinks analysis cost and timeo No data movement reduces costs

• No Proprietary Hardware Requirement

o Can be installed in commodity hardware

• Integration between cloud and open source offerings

SQL SERVER 2016$ 648 K

+ $120 Per user for PowerBI

Costs based on a Server with2 proc/ 8 Cores

Page 12: microsoft r server for distributed computing

11

Page 13: microsoft r server for distributed computing

High-performance open source R plus:• Data source connectivity to big-data objects

• Big-data advanced analytics

• Multi-platform environment support

• In-Hadoop and in-Teradata predictive modeling

• Development and production environment support

• IDE for data scientist developers

• Secure, Scalable R Deployment

DeployR

R Open R Server

DevelopR

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is

supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and

machine learning capabilities, R Server supports the full range of analytics – exploration, analysis,

visualization and modeling

Introducing Microsoft R Server

Page 14: microsoft r server for distributed computing

R Open Microsoft R Server

DeployRDevelopR

The Microsoft R Server Platform

ConnectR• High-speed & direct

connectors

Available for:• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database & Aster

• EDWs and ADWs

• ODBCScaleR• Ready-to-Use high-performance

big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Range of predictive functions

• User tools for distributing customized R algorithms across nodes

• Wide data sets supported – thousands of variables

DistributedR• Distributed computing framework

• Delivers cross-platform portability

R+CRAN• Open source R interpreter

• R 3.1.2

• Freely-available huge range of R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and packages

RevoR• Performance enhanced R

interpreter

• Based on open source R

• Adds high-performance math library to speed up linear algebra functions

Page 15: microsoft r server for distributed computing

ScaleR – Parallel + “Big Data”

Stream data in to RAM in blocks. “Big Data” can be any data

size. We handle Megabytes to Gigabytes to Terabytes…

Our ScaleR algorithms work

inside multiple cores / nodes

in parallel at high speed

Interim results are collected

and combined analytically to

produce the output on the

entire data set

XDF file format is optimised to work with the ScaleR library and

significantly speeds up iterative algorithm processing.

Page 16: microsoft r server for distributed computing

16

Page 17: microsoft r server for distributed computing

SQL Server 2016 Enterprise Edition

SQL Server R Services

Integration Facilities:

• Component Integration• Launchers• Parameter Passing• Results Return• Console Output

Return• Parallel Data Exchange

(RTM)• Stored Procedures• Package Administration

SQL Server

Query

Processor

Algorithm Library

• Data Prep

• Descriptive Stats

• Sampling

• Statistical Tests

• Predictive Models

• Variable Selection

• Clustering

• Classification

• Custom APIs for R + CRAN

• Parallel Scoring

Fast, Parallel, Storage Efficient Algorithms

Microsoft R Open

• 100% Open Source R• Fully CRAN Compatible• Accelerated Math

Open Source R Interpreter

Page 18: microsoft r server for distributed computing

Run R In-Database from TSQL

SQL

Server

2016

In-Database

Execution of R

+ CRAN

+ SQL

In-Database Execution of:

R Code

CRAN Packages

Move the

Work to

the Data

Run R From

the Query

Processor

Retrieve

Models,

Scores,

Transformed

Data,

Plots/Images

Operationalise

scoring/predictio

n in database for

data batches or

real-time

Page 19: microsoft r server for distributed computing

SQL

In-Database Execution:

Remote Execution

Parallelized Compute SQL

Server

Remote

Execution

Context

Explore and Model:

In Parallel, In-Database

Parallelize distributable R and CRAN

Operationlize:

Score In Parallel

ParallelWorker Tasks

Move

BIG

Work to

the DataLarge Data Sets in Chunks

Parallel Algorithm

Iterate/ Sequence

Run Parallel Algorithms in Database from an R client

Page 20: microsoft r server for distributed computing

SQL 2016

ScaleR PEMAs: Fast, Parallel, Storage Efficient Algorithms

R Interpreter

Conceptual Flow

Page 21: microsoft r server for distributed computing

SQL

Processor

Data Segments

(CTP3 is

via files)

R IDE

XSP

RTerm.exe

R.dll

(MSLP$

SQL16) BxlServer.exe

(MSLP$SQL16)

Input Data Set

via ODBC

ScaleR Master Process

Worker Process

Worker Process

Worker Process

Data Segments

Console OutSpawn WorkerProc’s.

Assemble Intermediate

Results

Iterate/ Sequence

MPI Ring

Results – Models, Data

Parallelized Algorithms in Database

Page 22: microsoft r server for distributed computing

22

Page 23: microsoft r server for distributed computing

Introducing Microsoft R Server

Page 24: microsoft r server for distributed computing

Gradient Boosted Decision Trees

Naïve Bayes

Scale R – Parallelized Algorithms & Functions

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Preparation Statistical Tests

Sampling

Descriptive Statistics

Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models K-Means

Decision Trees

Decision Forests

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination

rxDataStep

rxExec

New

PEMA-R API Custom Algorithms

Page 25: microsoft r server for distributed computing

ScaleR - Performance comparisonMicrosoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.

US flight data for 20 years

Linear Regression on Arrival Delay

Run on 4 core laptop, 16GB RAM and 500GB SSD

Page 26: microsoft r server for distributed computing

DistributedR

ScaleR

ConnectR

DevelopR

Distributed R - Model development and model compute choice: “Write Once. Deploy Anywhere.”

Code Portability Across Platforms

In the Cloud

Workstations & Servers LinuxWindows

EDW Teradata

HadoopHortonworksClouderaMapR

+ HD Insights

+ Hadoop Spark

+ R Tools for Visual Studio

+ Azure ML

Ro

ad

map

Azure Marketplace

+ SQL Server v16

Microsoft R Server

Page 27: microsoft r server for distributed computing

Distributed R - How Does Remote Execution Work?

Algorithm

Master

Big

Data

Predictive

Algorithm

Analyze

Blocks In

Parallel

Load Block

At A Time

Distribute Work,

Compile Results

The Results:

• Even Faster Computation

• Larger Data Set Capacity

• Fewer Security Concerns

• No Data Movement, No Copies

Work

“Pack and Ship” Requests

to Remote Environments

Results

Microsoft R Server functions

• A compute context defines remote connection• Microsoft R functions prefixed with rx

• Current compute context determines processing

location

Page 28: microsoft r server for distributed computing

DistributedR - Revolution Code Portability

### SETUP HADOOP ENVIRONMENT VARIABLES ###

myHadoopCCC <- RxHadoopMR()

### HADOOP COMPUTE CONTEXT ###

rxSetComputeContext(myHadoopCC)

### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###

hdfsFS <- RxHdfsFileSystem()

AirlineDataSet <-

RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”)

, fileSystem = hdfsFS)

### ANALYTICAL PROCESSING ###

### Statistical Summary of the data

rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)

### CrossTab the data

rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)

### Linear Model and plot

hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)

plot(hdfsXdfArrLateLinMod$coefficients)

### SETUP LOCAL ENVIRONMENT VARIABLES ###

myLocalCC <- “localpar”

### LOCAL COMPUTE CONTEXT ###

rxSetComputeContext(myLocalCC)

### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###

linuxFS <- RxNativeFileSystem() )

AirlineDataSet <-

RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”,

fileSystem = linuxFS)

Local Parallel processing – Linux or Windows In – Hadoop

ScaleR models can be deployed from a server or edge node to run in Hadoop

without any functional R model re-coding for map-reduce

Compute

context R script

– sets where the

model will run

Functional

model R script –

does not need

to change to run

in Hadoop

Page 29: microsoft r server for distributed computing

DistributedR - In-Hadoop

Uses Hadoop nodes for R

computations

Eliminate data movement

latency on very large data

Remove data duplication

Faster model development

No MapReduce R coding

Develop better models

using all the data= Microsoft R Server

Page 30: microsoft r server for distributed computing

MRS and Hadoop Architecture options

R R R R R

R R R R R

ScaleR Production

RStudio Server Pro

Microsoft R Server

1. Copy

2. Stream

3. Send

Page 31: microsoft r server for distributed computing

DistributedR - Hadoop Processing Methods

Method 1: Local (Linux) parallel processing using all

cores on one node, copying data from HDFS to store

in local Linux file-system.

Compute Context

HadoopCompute Context

HadoopCompute Context

Local Parallel

Linux (Local) File-System

HDFS

Csv, Xdf

Processing

Data

1 Edge node 1:n data nodes

1:n disks 1:(n x number of nodes) disks

Csv, Xdf

Linux FS

Read / write

Method 1

(“Beside” or “Edge”)

Copy

to Local

File

Method 2: Local (Linux) parallel processing using all cores on one node, streaming data from / to HDFS

Compute Context

HadoopCompute Context

HadoopCompute Context

Local Parallel

Compute Context

Hadoop

Linux (Local) File-System

HDFS

Csv, Xdf

1:n nodes

1:n disks 1:(n x number of nodes) disks

1 Edge node

Page 32: microsoft r server for distributed computing

Method 3

Method 3: Hadoop (Map-Reduce) parallel processing

using all cores on n nodes, using HDFS data on each

node

Compute Context

HadoopCompute Context

HadoopCompute Context

Local Parallel

Compute Context

Hadoop

Linux (Local) File-System

HDFS

Csv, Xdf

Processing

Data

1:n nodes

1:n disks 1:(n x number of nodes) disks

Csv, Xdf

HDFS

Read / write

(“inside”)

R script

sent to data

nodes

1 Edge node

R model script sent to Master Node:

1. Starts a master process

2. Distribute work

3. Master tasks for each node

4. Master initiates distributed work1.Hadoop schedules mapper for each split2.Algorithm computes intermediate result3.Reducer combines intermediate results

5. Master process evaluates

completion

6. Iterates as required by the

algorithm

7. Returns consolidated answer to

script

Page 33: microsoft r server for distributed computing

DistributedR - What processing mode to use, when?

Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm)

guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop

processing)

Low Medium High

Small Data

< 10GB

Medium Data

< 50GB

Bigger Data

> 50GB

Edge Node Linux

processingIn-Hadoop

processing

Local Linux

file-systemHadoop

file-system

LegendProcessing

Complexity

Data Size

Page 34: microsoft r server for distributed computing

While Open Source R delivers:

• Capability• 6500+ Algorithm &

Connector Packages Available for Free in CRAN

• Simplicity• R Skills Transfer / Lower cost

of Talent• Ease of Integration with Other

Analytics Packages & Data• Access to Huge Libraries of R

Analytical Algorithms

• Speed• Intel-Optimized Computation

• Peace of mind• Knowledge that your business is using a stable platform backed with

commercial support and services

• Platform longevity for more predictability around costs

• Speed and scalability• Faster decisions using advanced analytics that were previously unachievable

• In-Hadoop & In Teradata Analysis

• Efficiency• Continue getting returns on existing hardware and software investments

• Developers can write code once and deploy it anywhere, keeping costs low

• Flexibility and agility• Model data in a hybrid environment: on-premises, in the cloud, or both

• Scripting, modeling, and in-database analytics across platforms shrinks analysis time and enables agile response to business needs

SQL Server R Services and Microsoft R Server deliver:

Page 35: microsoft r server for distributed computing
Page 36: microsoft r server for distributed computing

Introducing Microsoft R Open• Enhanced Open Source R distribution

• Based on the latest Open Source R (3.1.2)

• Built, tested and distributed by Microsoft

• Enhanced by Intel MKL Library to speed up linear algebra functions

• Compatible with all R-related software• CRAN packages, RStudio, third-party R integrations, …

• Revolutions Open-Source R packages• Reproducible R Toolkit – Checkpoint , miniCRAN

• ParallelR – parallelise execution via ‘foreach’ loop

• Rhadoop – rhdfs, rhbase, ravro, rmr2, plyrmr

• AzureML – read/write data to AzureML, publish R code as ML API

• MRAN website mran.revolutionanalytics.com• Enhanced documentation and learning resources

• Discover 6500 free add-on R packages

• Open source (GPLv2 license) - 100% free to download, use and share

Page 37: microsoft r server for distributed computing

DatasizeIn-memory

In-memory In-Memory or Disk Based

Speed of AnalysisSingle threaded Multi-threaded

Multi-threaded, parallel

processing 1:N servers

SupportCommunity Community Community + Commercial

Analytic Breadth

& Depth 7500+ innovative analytic

packages7500+ innovative analytic

packages

7500+ innovative packages +

commercial parallel high-speed functions

LicenceOpen Source

Open Source

Commercial license.

Supported release with indemnity

CRAN, MRO, MRS ComparisonMicrosoft

R Open

Microsoft

R Server

Page 38: microsoft r server for distributed computing

More efficient and multi-threaded math computation.

Benefits math intensive processing.

No benefit to program logic and data transform

CRAN R compared to Microsoft R Open

• Matrix calculation – upto 27x faster

• Matrix functions – upto 16x faster

• Programation – 0x faster