48

Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning
Page 2: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Trends What is R?R Server & SQL

R ServicesR Analytics on

Hadoop

Agenda

Page 3: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Trends

Page 4: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Connected data

CLOUD

MOBILE

Page 5: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Three major trends converging

IntelligenceCloudData

Page 6: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

Transformational trends

Machine

Learning

PP

T R

EM

12

Churn analysis

Social network analysis

Recommenda-tion engines

Location-based tracking and services

Weather forecasting business planning

Fraud detection

Equipment monitoring

Personalized Insurance

cloud computing

2011 2016 5x increase

data explosion

90% of the data in the world today has been created in the last two years alone

Page 7: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Decision

Page 8: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

$235 $158 $456 $674

Product & Service

Innovation

Customer Facing Operations Productivity

Data DividendIncremental Gains Made by Leaders in Data and Analytics

Gains in $ BillionsIDC Data Dividend Study and Survey, N=2,020, April 2014

Page 9: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

EXAMPLE SOLUTIONS

Advanced Analytics scenarios

Page 10: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Retail & Consumer Products

Healthcare

Financial Services & Insurance

Government

Manufacturing

Page 11: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

Success requires convergence of skills

Data Science Team

Data Engineering

Data Science

Application Development

Business Acumen

Data Management

Data Dividend

Page 12: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

What is R?

Page 13: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

A brief history of R

1993

Research project in Auckland, NZ

1995

Open source

1997

R-core

2000

R-1.0.0

2003

R Foundation

2004

First UseR!

2009

New York Times

2015

R-3.2.0

R Consortium

Photo credit: Robert Gentleman

Page 14: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

IEEE Spectrum July 2015

Language PopularityIEEE Spectrum Top Programming Languages

R’s popularity is growing rapidly

R Usage GrowthRexer Data Miner Survey, 2007-2013

Rexer Data Miner Survey

#9: R

Page 15: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

People

+

Data Sources

Apps

Sensors and devices

Microsoft R Server familyFrom Data To Action On Premises

INTELLIGENCEDATA ACTION

Automated SystemsMicrosoft R Server & SQL R Services

Apps

Cortana Intelligence

Page 16: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

What is

• A statistics programming language

• A data visualization tool

• Open source

• 2.5+M users

• Taught in most universities

• Thriving user groups worldwide

• 7500+ free algorithms in CRAN

• Scalable to big data

• New and recent grad’s use it

Language Platform

Community

Ecosystem

• Rich application & platform integration

Page 17: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

The Comprehensive R Archive Network

CRAN

In addition to CRAN, Bioconductor, GitHub, others distribute R packages

Page 18: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Challenges posed by open source R

??

Lack of Commercial

Support

InadequateModeling

Performance

Complex DeploymentProcesses

Limited Data Scale

Page 19: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

R from Microsoft brings

Peace of mind

Efficiency Speed and scalability

Flexibility and agility

Page 20: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Included in SQL Server 2016

Reuse and optimize existing R code

Eliminate data movement

In-database deployment

Memory and disk scalability

No R memory limits

Write once, deploy anywhere

Enterprise speed and scale

Near-DB analytics

Parallel threading and processing

Reuse SQL skills for data engineering

Cost effectiveness

Scalability and choice

Simplicity and agility

Page 21: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

SQL ServerR Services

Linux

Hadoop Teradata

Windows

Microsoft R portfolio

CommercialCommunity

R ServerR Open

Page 22: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

DatasizeIn-memory

In-memory In-Memory or Disk Based

Speed of AnalysisSingle threaded Multi-threaded

Multi-threaded, parallel processing 1:N servers

SupportCommunity Community Community + Commercial

Analytic Breadth & Depth 7500+ innovative analytic

packages7500+ innovative analyticpackages

7500+ innovative packages + commercial parallel high-speed functions

LicenceOpen Source

Open SourceCommercial license.Supported release with indemnity

R Distribution Comparison

Copyright Microsoft Corporation. All rights reserved.

Page 23: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

Example Solutions

Fraud detection

Sales forecasting

Warehouse efficiency

Predictive maintenance

Extensibility

Microsoft AzureMachine Learning Marketplace

New R scripts

010010

100100

010101

010010

100100

010101

010010

100100

010101

010010

100100

010101

Built-in advanced analytics In-database analytics

Built-in to SQL Server

Analytic Library

T-SQL Interface

Relational Data

010010

100100

010101

010010

100100

010101

Data ScientistInteract directlywith data

Data Developer/DBAManage data and analytics together

?

R

R Integration

Page 24: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

R built-in to your T-SQL

In-database Advanced AnalyticsBuild intelligent applications with SQL Server R Services

Open Source R with in-memory & massive scale - multi-threading and massive parallel processing

Ad

vance

d A

nalytics

NEW

NEW

R built-in to SQL Server

Mission critical OLTP

Real-time operational analyticswithout moving the data

NEW

Page 25: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

R Server & SQL R Services

Page 26: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Microsoft R Products

• Microsoft R Server for Redhat Linux

• Microsoft R Server for SUSE Linux

• Microsoft R Server for Teradata DB

• Microsoft R Server for Hadoop on Redhat

Microsoft R Server

• Built in Advanced Analytics and Stand Alone Server Capability

• Leverages the Benefits of SQL 2016 Enterprise Edition

SQL Server R Services

Page 27: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

R Server

Page 28: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

High-performance open source R plus:◦ Data source connectivity to big-data objects

◦ Big-data advanced analytics

◦ Multi-platform environment support

◦ In-Hadoop and in-Teradata predictive modeling

◦ Development and production environment support◦ IDE for data scientist developers

◦ Secure, Scalable R Deployment

DeployR

R Open R Server

DevelopR

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is

supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and

machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling

Introducing Microsoft R Server

Page 29: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

R Open Microsoft R Server

DeployRDevelopR

The Microsoft R Server Platform

ConnectR• High-speed & direct connectors

Available for:• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database & Aster

• EDWs and ADWs

• ODBCScaleR• Ready-to-Use high-performance

big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Range of predictive functions

• User tools for distributing customized R algorithms across nodes

• Wide data sets supported – thousands of variables

DistributedR• Distributed computing framework

• Delivers cross-platform portability

R+CRAN• Open source R interpreter

• R 3.1.2

• Freely-available huge range of R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and packages

RevoR• Performance enhanced R

interpreter

• Based on open source R

• Adds high-performance math library to speed up linear algebra functions

Page 30: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

ScaleR – Parallel + “Big Data”

Stream data in to RAM in blocks. “Big Data” can be any data size. We handle Megabytes to Gigabytes to Terabytes…

Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed

Interim results are collected and combined analytically to produce the output on the entire data setXDF file format is optimised to work with the ScaleR library and

significantly speeds up iterative algorithm processing.

Page 31: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Microsoft R Server

Multi-Platform Scalable RSupport Entire Analytics Lifecycle

Big Data Scale & Compute

Remote Execution in Hadoop, Teradata

AdvantagesSupported 100% R

Deploy via Web Services

Reduced Security Exposure

Enterprise R Analytics for Linux, Hadoop & Teradata

Typical Advanced Analytics Process

OperationalizeModelPrepare

Page 32: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

SQL R Services

Page 33: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

SQL Server 2016 Enterprise Edition

SQL Server R Services

Integration Facilities:

• Component Integration• Launchers• Parameter Passing• Results Return• Console Output

Return• Parallel Data Exchange

(RTM)• Stored Procedures• Package Administration

SQL Server Enterprise Edition R Services

SQL ServerQuery

Processor

Algorithm Library

• Data Prep• Descriptive Stats• Sampling• Statistical Tests• Predictive Models

• Variable Selection• Clustering• Classification• Custom APIs for R + CRAN• Parallel Scoring

Fast, Parallel, Storage Efficient Algorithms

Microsoft R Open

• 100% Open Source R• Fully CRAN Compatible• Accelerated Math

Open Source R Interpreter

Page 34: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Run R In-Database from TSQL

SQL Server 2016

In-Database Execution of R +

CRAN

+ SQL

In-Database Execution of:

R Code

CRAN Packages

Move the Work to the

Data

Run R From the Query Processor

Retrieve Models, Scores,

Transformed Data,

Plots/Images

Operationalize scoring/prediction

in database for data batches or

real-time

Page 35: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

SQL

In-Database Execution:

Remote Execution

Parallelized Compute SQL Server

Remote ExecutionContext

Explore and Model:

In Parallel, In-Database

Parallelize distributable R and CRAN

Operationalize:

Score In Parallel

Parallel

Worker

Tasks

Move BIG Work

to the Data

Large Data Sets in Chunks

Parallel

Algorithm

Iterate/ Sequence

Run Parallel Algorithms in Database from an R client

Page 36: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Demo

Page 37: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Using SQL Server R Services

Model & Deploy In SQL16Support Entire Analytics Lifecycle

Run R Inside SQL 2016 using IDE

Embed R in T-SQL or run stored proc’s

AdvantagesScale By Eliminating Movement

Scale Using Parallelized Computation

Reduce Security Exposure

Reuse SQL Skills for Data Engineering

Reuse SQL Skills in App Development teams

Maximize Operational Stability for Applications

Supported 100% R

Enterprise R Analytics in SQL Server 2016

Typical Advanced Analytics Process

SQL 2016

OperationalizeModelPrepare

Page 38: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Eliminate Data Movement

In-Database Example: From 5+ hours to 40 seconds

Rows

Min

ute

s

R on a server pulling data via SQL

R on a server Invoking RRE ScaleR Inside

the EDW

Page 39: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Variable Selection

Stepwise Regression

Simulation

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Cluster Analysis

K-Means

Classification

Decision Trees

Decision Forests

Gradient Boosted Decision Trees

Naïve Bayes

Combination

rxDataStep

rxExec

PEMA-R API Custom Algorithms

Parallelized, Remote Execution AlgorithmsData Step

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Descriptive Statistics

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long form)

Marginal Summaries of Cross Tabulations

Statistical Tests

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Sampling

Subsample (observations & variables)

Random Sampling

Predictive Models

Sum of Squares (cross product matrix for set variables)

Quantiles (approx.)

Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Page 40: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

R Analytics on Hadoop

41

Page 41: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Leveraging the cloud

Evolutionary, blended, hybrid approaches will help to avoid big disruptions

Embracing the cloud from On-Premises is complicated.

Several Alternatives

On-Premises

In the Cloud

Page 42: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Microsoft R Scales to Big Data for Enterprises

Escapes R’s traditional memory limits

Scales predictive modeling using parallelization

Distributes computation cores & nodes

Minimizes data movement using in-database, in-MapReduce and in-Apache Spark execution

Page 43: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

Leveraging Cloud and On-Prem Data As One

OperationalizeModelPrepare

SQL 2016

OperationalizeSQL 2016

ModelPrepare

ModelSQL 2016

Operationalize

Prepare

On

-Pre

mis

esIn

th

e C

lou

d

Polybase Example: Federate CRM Data with Cloud-Borne Demographics

StretchDB Example: Cost Reduce Growth of EDW On-Prem Scoring Example: Purchase Propensity Prediction

On-Prem Scoring Example: Near-Real-Time Fraud or Anomaly Detection

Page 44: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

HDInsight + R Server: Managed Hadoop for Advanced Analytics

Spark APIs(data manipulation)

R Server(big computation)

R

Spark and Hadoop

Blob Storage Data Lake Storage

NotebooksScala/Java

• Apache Hadoop• Fault-tolerant distributed data storage and processing

• Pluggable distributed filesystem (HDFS)

• Big Data - Structured and unstructured

• Apache Spark• Batch, Streaming, SQL, ML, Graph

• Scala, Java, Python, and R APIs

• Integrates with Hadoop

• HDInsight Hadoop Clusters• Fully managed, elastic

• HDFS-compatible Azure Blob Storage and Azure Data Lake Store

• R Server on HDInsight (Premium)• Integrates with Hadoop and Spark

• Provides scalable Statistical and ML Algorithms

• Runs arbitrary R functions and packages in parallel

Page 45: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

R Server on HDInsight scales to billions of rows and terabytes of data

0

200

400

600

800

1000

1200

1400

1600

1800

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Elap

sed

Tim

e (

seco

nd

s)

Billions of rows

rxLogit on a 100 node HDInsight cluster

Configuration:• HDI cluster size: 100 nodes- Edge node: D14 V2 (16 cores, 112GB)- Worker Nodes: D12 (4 cores, 28GB)• Dataset: NYC Taxi dataset, filtered,

transformed, and duplicated• Number of columns: 19• Format: CSV• fs.azure.selfthrottling.read.factor=1

2.2 TB

Page 46: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200Microsoft R Family – L200

DistributedR

ScaleR

ConnectR

DevelopR

Write Once. Deploy Anywhere.

Code Portability Across Platforms

In the Cloud Azure HDI/ Spark

Workstations & Servers LinuxWindows

Clustered SystemsLinux Clusters (LSF For Now)Microsoft HPC

EDW Teradata

HadoopHortonworksClouderaMapR &HDInsight

Page 47: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Microsoft R Family – L200

### SETUP HADOOP ENVIRONMENT VARIABLES ###

myHadoopCC <- RxHadoopMR()

### HADOOP COMPUTE CONTEXT ###

rxSetComputeContext(myHadoopCC)

### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###

hdfsFS <- RxHdfsFileSystem()

AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”),

fileSystem = hdfsFS)

DistributedR - Revolution Code Portability

### ANALYTICAL PROCESSING ###

### Statistical Summary of the data

rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)

### CrossTab the data

rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)

### Linear Model and plot

hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)

plot(hdfsXdfArrLateLinMod$coefficients)

### SETUP LOCAL ENVIRONMENT VARIABLES ###

myLocalCC <- “localpar”

### LOCAL COMPUTE CONTEXT ###

rxSetComputeContext(myLocalCC)

### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###

localFS <- RxNativeFileSystem()

AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,

fileSystem = localFS)

Local Parallel processing – Linux or Windows In – Hadoop

ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce

Compute context R script – sets where the model will run

Functional model R script – does not need to change to run in Hadoop

Page 48: Microsoft R Family L100 Deck - Meetupfiles.meetup.com/6477262/20161214_DataDrivenMTL_Microsoft... · 2016-12-15 · Microsoft R Family –L200 Transformational trends Machine Learning

Demo