53
Revolution Confidential Revolution R E nterprise 5.0: Scalable Data Management and Analysis for the E nterprise S ue R anney, VP Product Development November 2011

New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Embed Size (px)

DESCRIPTION

Revolution R Enterprise 5.0 is Revolution Analytics’ scalable analytics platform. At its core is Revolution Analytics’ enhanced Distribution of R, the world’s most widely-used project for statistical computing. In this webinar, Dr. Ranney will discuss new features and show examples of the new functionality, which extend the platform’s usability, integration and scalability.

Citation preview

Page 1: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

R evolution R E nterpris e 5.0: S c alable Data

Management and A nalys is for the

E nterpris eS ue R anney, V P P roduc t Development

November 2011

Page 2: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialNovember 17, 2011: Welc ome!

Thanks for coming! Text questions for Q&A after the presentation

2Revolution R Enterprise 5.0 Webinar

Page 3: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialIn Today’s Webinar…

About Revolution R Enterprise 5.0 “I don’t have big data.” Why use Revolution R

Enterprise 5.0 to get started? “I don’t have big hardware.” Big data on your

desktop. “I have big data, and need to be ready for

tomorrow’s even bigger data.” Scaling data analysis to a cluster.

“I need to write my own scalable analyses.”: Creating your own scalable R extensions.

Wrap-up, Q&A

3Revolution R Enterprise 5.0 Webinar

Page 4: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

A bout R evolution R E nterpris e

Revolution R Enterprise 5.0 Webinar 4

Page 5: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

EnterpriseDeployment

Performance

Productivity

Big Data Analysis

Training & Consulting

TechnicalSupport

R evolution R E nterpris e:

5

Open Source

Performance Enhancements

Greater Productivity & Ease of Use

Tackle “Big Data”

IT-Friendly Enterprise Deployment

On-Call Experts

Page 6: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialR evolution R E nterpris e 5.0: What’s New? Distributed/Parallel Computing

Distribute analytics and R functions to a Windows HPC server cluster Scalable Data Management

New data import and cleaning/manipulation tools. Expanded Scalable Analytics

Principal components analysis, factor analysis, and more Enhanced R Productivity Environment

Create and build R packages Integration with Hadoop

Cloudera-certified MapReduce programming in R Enhanced RevoDeployR Server

Supports multiple compute nodes, batch scheduling and LDAP security Upgraded Open Source R

R 2.13.2 with new byte-compiler

Revolution R Enterprise 5.0 Webinar 6

Page 7: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialR evolution R E nterpris e: What G ets Ins talled?

7

Latest stable version of Open-Source R ( 2.13.2)High performance math librariesRevoScaleR package:Scalable data management and analysisDistributed data analysis/parallel computing

Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE), including a visual debugger for R

Data Management and Statistical Analysis for the Enterprise

Revolution R Enterprise 5.0 Webinar

Page 8: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

“ I don’t have big data.” Why get s tarted with R evolution R E nterpris e?

Revolution R Enterprise 5.0 Webinar 8

Page 9: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

Why R evolution R E nterpris e 5.0 with “ S mall” Data

Easy to get started; consistent interface for “start-to-finish” data analysis with just a few functions Data import (text, SAS, SPSS, ODBC) Data transformations & manipulation Basic data analysis

Performance: fast analysis such as summary statistics, cross tabs, linear models, logistic regression – even for data that can fit in memory Scalability: replicate the data analysis you do

today on big data down the road9Revolution R Enterprise 5.0 Webinar

Page 10: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialS c alable Data Management: Import

Import data from a variety of sources with rxImport SPSS SAS Delimited text (e.g., comma separated) Fixed format text Databases with ODBC connection

Read small data sets into a data frame; store larger data sets in an efficient .xdf file format

Use arguments such as colClasses and colInfo to provide guidance on how to import data (e.g., as integer, factor, etc.)

10Revolution R Enterprise 5.0 Webinar

Page 11: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE xample: Import Mortgage Default Data

11Revolution R Enterprise 5.0 Webinar

Import a data file (10,000 obs) into a data frame, specifying the input file location Create a place-holder object for an output file, that you’ll use with bigger data Use the same code to import a file with 10 million observations

In both cases, the data object returned can be used as input data in other RevoScaleR functions

rxImport is new in 5.0 – to simplify and scale the data import process

Page 12: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialS c alable Data Management: Data S tep

Basic steps for data manipulation and cleaning Variable selection Data transformations Row selection

One function does it all: rxDataStep Can use the same approach (function

arguments) at various stages of your analysis: Import Data step “On-the-fly” in data analysis

12Revolution R Enterprise 5.0 Webinar

Page 13: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE xample: Data S tep with Mortgage Data

13Revolution R Enterprise 5.0 Webinar

Specify the input data – can be a data frame or an object representing an .xdf file

Specify an output file, if desired Select variables and rows to include in the new data set List out variable transformations, using usual R expressions

rxDataStep is new in 5.0 – to simplify and scale the data step

Page 14: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

E xamples of R Operators and F unctions You C an Us e in ‘trans forms ’

Operator/Function Description+, -, *, /, ^, %%, … Row-by-row addition, subtraction, multiplication,

division, exponentiation, modulus<, <=, >, >=, ==, != Logical operatorsabs, ceiling, floor, round, log, log10, cos, sin, sqrt, …

Basic mathematical functions

as.Date, weekdays, months, quarters, …

Convert character data to Date data. Then use functions like weekdays(), months(), quarters()

rnorm, runif,gamma, exp,… Distribution functions

cut Create a factor from a numeric variablesubstr, toupper, tolower Basic string handlingifelse ifelse(test, yes, no) – set the value of a variable

conditional on a test? Or, write your own R function

14Revolution R Enterprise 5.0 Webinar

Page 15: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

A dditional F unctions for P roces s ing Data S ets

Function Purpose

rxSort Sort a data set by one or more key variables

rxMerge Merge two data sets by one or more key variables

rxFactors Create or modify factors (categorical variables) based on existing variables

rxSetVarInfo Change variable information such as the name or description of a variable

rxSetInfo Add or change a data set description

rxSplitXdf Split a single .xdf file into multiple .xdf files

15Revolution R Enterprise 5.0 Webinar

Page 16: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

“ I don’t have big hardware.” B ig data analys is on your des ktop.

Revolution R Enterprise 5.0 Webinar 16

Page 17: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialG etting S tarted with B ig Data

When I talk with people about their “big data”, almost always the first issue they raise is “hardware”. “What kind of hardware do I need to analyze big data.” My answer, “Get started today with the

hardware you have. With Revolution R Enterprise 5.0, you can quickly begin doing scalable data analysis on your desktop while you are determining your longer term hardware requirements.”

17Revolution R Enterprise 5.0 Webinar

Page 18: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialB ig Data on Your Des ktop

Data sets with many variables and 100-million observations can be easily processed on a desktop using RevoScaleR functions. Using Revolution R Enterprise 5.0, you can

avoid getting locked into memory-bound analyses. By processing data a chunk at a time, increasing the number of observations in your data set doesn’t increase the memory requirements for a given analysis.

18Revolution R Enterprise 5.0 Webinar

Page 19: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

E xample: A nalys is data on all the births in the United S tates from 1985 - 2008

From R in a Nutshell (in dealing with the 2006 birth data):The natality files are gigantic; they're approximately 3.1 GB uncompressed. That's a little larger than R can easily process, so I used Perl to translate these files into a form easily readable by R. Almost 100-million observations Originally stored in annual fixed-format text files;

imported and appended into one .xdf file for fast access using RevoScaleR import function (no need for Perl)

19Revolution R Enterprise 5.0 Webinar

Page 20: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE xamples : Interac ting with Your Data

Quickly compute summary statistics for variables in the data set using rxSummary: birth weight in grams and a time trend variable, months since Jan. 1985rxSummary(~DBIRWT + MNTHS_SINCE_1985,

data = birthAll,

blocksPerRead = 10)

blocksPerRead set to 10 will read in 10 blocks of the desired variables from the .xdf file for each read, or a little under 5,000,000 observations per read

Revolution R Enterprise 5.0 Webinar 20

Page 21: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

E xample: S ummary S tatis tics on Two B irth Data Variables

21Revolution R Enterprise 5.0 Webinar

Looks like 9999 must be the missing value code for DBIRWT

Time processing all chunks and final results

Page 22: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE xamples : G roup Averages

Use rxCube to compute the proportion of babies that were boys for each year for each the race category for the mothermomRaceYear <-

rxCube(ItsaBoy~F(DOB_YY):MRACEREC,

data = birthAll, blocksPerRead = 10)

F() function creates an “on-the-fly” categorical variable for each unit interval

The average of the dependent variable, ItsaBoy, will be computed for each category determined by the interaction term.

Revolution R Enterprise 5.0 Webinar 22

Page 23: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE xample: Us e rxC ube to C ompute P roportion of B oys by Year and Mother ’s R ac e

23Revolution R Enterprise 5.0 Webinar

Put the results into a data frame, easy for plotting

rxLinePlot is particularly well-suited for plotting rxCube results

Page 24: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE xample: P lot the R es ults

24Revolution R Enterprise 5.0 Webinar

Page 25: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE s timating a L inear Model

25Revolution R Enterprise 5.0 Webinar

Suppose we want to estimate a linear model: birth weight (in pounds) on plurality and a time trend:BIRWTLBS ~

DPLURAL_REC + MNTHS_SINCE_1985

whereBIRWTLBS = DBIRWT/453.59237 and

rowSelection = DBIRWT < 9000

Page 26: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

Us ing the .xdf data file as a data s ource for the biglm package from C R A N

26Revolution R Enterprise 5.0 Webinar

The biglm package also processes data in chunks. We can create an .xdf data source and use it with biglmfunctions. A linear model on almost 100 million rows in about 6 minutes on a desktop in R seems impressive.

I’ve written a small function to use an .xdf data source with biglm

Page 27: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialUs ing rxL inMod: Optimized for S peed

27Revolution R Enterprise 5.0 Webinar

Adding another 5,000,000 observations would add less than 1 second

Page 28: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialL inear Model R es ults

28Revolution R Enterprise 5.0 Webinar

For those who have gotten interested in the actual analysis, here are the results: •At the beginning of 1985, the average singleton baby weighed 7.46 pounds. •Twins were a little over two pounds smaller, and triplets or higher even smaller. •There’s a downward trend in birth weight over time, but very small.

Page 29: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialE s timating a B ig L ogis tic Model

29Revolution R Enterprise 5.0 Webinar

Let’s try a more challenging model: a logistic regression with over 50 parameters (categorical data for Dad and Mom’s ages, race, Hispanic ethnicity, live birth order, plurality, gestation, and year)ItsaBoy ~ DadAgeR8 + MomAgeR7 +

FRACEREC + FHISP_REC +MRACEREC + MHISP_REC + LBO4 + DPLURAL_REC + Gestation + F(DOB_YY)

Page 30: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialB ig L ogis tic Model on the Des ktop

30Revolution R Enterprise 5.0 Webinar

Even a large logistic regression (over 50 parameters) with almost 100 million rows of data can be estimated on a desktop, in about the time it takes to get a cup of coffee (about 6 minutes)

But what if that’s not fast enough?

Page 31: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialA udienc e P oll

31Revolution R Enterprise 5.0 Webinar

Before we answer that question, let’s do a quick poll of the audience

Page 32: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

“ I need to be ready for tomorrow’s data.”

S caling data analys is to a c lus ter.

Revolution R Enterprise 5.0 Webinar 32

Page 33: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialS c aling Data A nalys is to a C lus ter

With Revolution R Enterprise 5.0, you can use the same functions that you used on your desktop to scale to a cluster of computers Windows HPC Server currently supported. (See

http://technet.microsoft.com/en-us/hpc/cc453771for information on a 180 day evaluation copy.)

33Revolution R Enterprise 5.0 Webinar

Page 34: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

T he B irth Data L ogis tic R egres s ion on a C lus ter

In our office we have a 5-node cluster of commodity hardware (about $5,000) running Windows HPC Server

I just set my compute context to use the cluster (and wait for the results) and set the location of the data on the nodes

Then run the same code

34Revolution R Enterprise 5.0 Webinar

42 seconds instead of 6 minutes

Page 35: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialHow Does It Work? When I run rxLogit from my desktop with an HPC Server

compute context, a job is submitted to the cluster. The master node allocates tasks to worker nodes to

compute intermediate results on their part of the data. The master node aggregates the intermediate results

from the nodes and processes them. If needed, more tasks are assigned (e.g., computing the next iteration)

When complete, the master node sends the results back to my desktop.

Best of all, I don’t need to know how it works. I just set my compute context, run my code, and get my results back.

35Revolution R Enterprise 5.0 Webinar

Page 36: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialT he HP C J ob S c heduler If I’m interested, I can see the activity on the cluster

using the HPC job scheduler, which can be launched from a menu item in the R Productivity Environment

36Revolution R Enterprise 5.0 Webinar

I can see that my computations were processed 4 cores on each of 5 nodes.

Page 37: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialHPA and HP C B oth S upported

I think of the logistic regression we just ran as High Performance Analytics. The computations are automatically distributed for the analysis of huge data sets. An key component is simultaneous rapid access to data - a cluster where each node has a separate disk drive is usually ideal.

With traditional High Performance Computing, the focus is not on the data. For example, a user might specify a function to be run in parallel across computing resources. Typically these are “embarrassingly” or “pleasingly” parallel computing problems.

37Revolution R Enterprise 5.0 Webinar

Page 38: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialHP C E xample: the B irthday P roblem

In a group of a given size, what is the probability that two people will have the same birthday? We can perform a brute-force computation,

repeatedly creating random samples and counting – and do it in parallel across nodes of the cluster We’ll use a function, pbirthday, that takes 2

arguments: n - group size ntests - the number of times to sample

38Revolution R Enterprise 5.0 Webinar

Page 39: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialHP C E xample: the B irthday P roblem

39Revolution R Enterprise 5.0 Webinar

Set the compute context to do computations on our cluster Use the rxExec function to ask each node on the 5-node

cluster to do up to 20 runs of the pbirthday function, each using a different value for the ‘n’ argument.

rxExec allows users to run arbitrary functions in parallel

The results come back in a list, which we can manipulate and plot.

Page 40: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialHP C E xample: the B irthday P roblem

40Revolution R Enterprise 5.0 Webinar

Page 41: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialUs ing R evoS claleR with foreac h: doR S R

41Revolution R Enterprise 5.0 Webinar

Another alternative for doing parallel computing with RevoScaleR is using the foreach package.

The foreach package provides a for-loop-like approach to parallel computing

Parallel backends have been written for a variety of parallel computing packages, now including RevoScaleR

Let’s look at a simple example: computing square roots in parallel

Page 42: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialS imple example of foreac h with doR S R

42Revolution R Enterprise 5.0 Webinar

To get started with doRSR, load the library and register it as the backend for foreach

To run jobs on the cluster, set your compute context We’ll estimate the square root for the numbers from 1

to 20. In this case, the 20 cores will be requested from the

cluster for computations

Page 43: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialT he J ob S c heduler Us ing doR S R

43Revolution R Enterprise 5.0 Webinar

You can see the 20 cores requested for the job, and that all 5 nodes were used.

Page 44: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialS etting Up Your C ompute C ontext

44Revolution R Enterprise 5.0 Webinar

I’ve mentioned the “compute context” a lot. To setup your compute context, you just need basic information about your cluster.

It’s easy to create a new compute context based on an existing one. Just specify the properties you’d like to change.

Page 45: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialNon-Waiting J obs on a C lus ter

45Revolution R Enterprise 5.0 Webinar

It is common to use non-waiting jobs when working with a cluster. Send off your job, and return to work. Check the status of a non-waiting job in the object browser, or have an email sent. Then retrieve the results on your local machine.

When my job is done, I can retrieve the results using ‘rxGetJobResults’

See my job status here

Page 46: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

“ I need to write my own s calable analys es .”

C reating your own s calable R extens ions .

Revolution R Enterprise 5.0 Webinar 46

Page 47: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialC reating Your Own S calable E xtens ions

Use doRSR and rxExec to distribute user-defined computations across processes or nodes of a cluster Use output from RevoScaleR functions as input

into other functions (i.e., output from rxCor into princomp for Principal Components) Write your own chunking algorithms, e.g., using

rxDataStep to automatically chunk through the data. (I’ll show you an example.) When you’re done, create a package to

distribute your new functions using the RPE

47Revolution R Enterprise 5.0 Webinar

Page 48: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialTrans formation F unc tions

Transformation functions are user-defined functions that operate on a chunk of data. They can be used to perform arbitrary computations and update results.

48Revolution R Enterprise 5.0 Webinar

You can use transformation functions in RevoScaleR analysis functions to perform specialized data transformations. This example is for use in rxDataStep.

Page 49: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialUs ing rxDataS tep for Us er C omputations

rxDataStep will automatically “chunk” through the data and run the transformation function on each chunk. Just initialize your computed values in the transformObjectsargument. Your final results can be returned in a list.

49Revolution R Enterprise 5.0 Webinar

The updated tableSum will contain the cumulated results of calling the table function on the “DayofWeek” Variable.

Page 50: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialP ac kage S upport in the R P E To create a new R package project, choose

File/New/Project/R Package Project

Right click on the ‘man’ folder to add a help file for a new function.

Build a package by right-clicking on the project and choosing: Build R Package

50Revolution R Enterprise 5.0 Webinar

A solution with all the required R package components is automatically created.

Page 51: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialWrap Up

It’s time to get started with Revolution R Enterprise 5.0 Start out analyzing a small data frame Use the same code to analyze a large data set locally Get high computing performance using the same code

on a cluster Extend your analyses using the power and flexibility of

the R language

51Revolution R Enterprise 5.0 Webinar

Page 52: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution ConfidentialR evolution R E nterpris e: F ree to A cademia

Personal use Research Teaching Package development

52

Free Academic Downloadwww.revolutionanalytics.com/downloads/free-academic.php

Discounted Technical Support Subscriptions Available

Page 53: New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

R evolution R E nterpris e 5.0:

Now Available!

T hank You!