New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis

Revolution Confidential

R evolution R E nterpris e 5.0: S c alable Data

Management and A nalys is for the

E nterpris eS ue R anney, V P P roduc t Development

November 2011

Revolution ConfidentialNovember 17, 2011: Welc ome!

Thanks for coming! Text questions for Q&A after the presentation

2Revolution R Enterprise 5.0 Webinar

Revolution ConfidentialIn Today’s Webinar…

About Revolution R Enterprise 5.0 “I don’t have big data.” Why use Revolution R

Enterprise 5.0 to get started? “I don’t have big hardware.” Big data on your

desktop. “I have big data, and need to be ready for

tomorrow’s even bigger data.” Scaling data analysis to a cluster.

“I need to write my own scalable analyses.”: Creating your own scalable R extensions.

Wrap-up, Q&A



A bout R evolution R E nterpris e

Revolution R Enterprise 5.0 Webinar 4


EnterpriseDeployment

Performance

Productivity

Big Data Analysis

Training & Consulting

TechnicalSupport

R evolution R E nterpris e:

5

Open Source

Performance Enhancements

Greater Productivity & Ease of Use

Tackle “Big Data”

IT-Friendly Enterprise Deployment

On-Call Experts

Revolution ConfidentialR evolution R E nterpris e 5.0: What’s New? Distributed/Parallel Computing

Distribute analytics and R functions to a Windows HPC server cluster Scalable Data Management

New data import and cleaning/manipulation tools. Expanded Scalable Analytics

Principal components analysis, factor analysis, and more Enhanced R Productivity Environment

Create and build R packages Integration with Hadoop

Cloudera-certified MapReduce programming in R Enhanced RevoDeployR Server

Supports multiple compute nodes, batch scheduling and LDAP security Upgraded Open Source R

R 2.13.2 with new byte-compiler


Revolution ConfidentialR evolution R E nterpris e: What G ets Ins talled?

7

Latest stable version of Open-Source R ( 2.13.2)High performance math librariesRevoScaleR package:Scalable data management and analysisDistributed data analysis/parallel computing

Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE), including a visual debugger for R

Data Management and Statistical Analysis for the Enterprise

Revolution R Enterprise 5.0 Webinar


“ I don’t have big data.” Why get s tarted with R evolution R E nterpris e?



Why R evolution R E nterpris e 5.0 with “ S mall” Data

Easy to get started; consistent interface for “start-to-finish” data analysis with just a few functions Data import (text, SAS, SPSS, ODBC) Data transformations & manipulation Basic data analysis

Performance: fast analysis such as summary statistics, cross tabs, linear models, logistic regression – even for data that can fit in memory Scalability: replicate the data analysis you do

today on big data down the road9Revolution R Enterprise 5.0 Webinar

Revolution ConfidentialS c alable Data Management: Import

Import data from a variety of sources with rxImport SPSS SAS Delimited text (e.g., comma separated) Fixed format text Databases with ODBC connection

Read small data sets into a data frame; store larger data sets in an efficient .xdf file format

Use arguments such as colClasses and colInfo to provide guidance on how to import data (e.g., as integer, factor, etc.)


Revolution ConfidentialE xample: Import Mortgage Default Data


Import a data file (10,000 obs) into a data frame, specifying the input file location Create a place-holder object for an output file, that you’ll use with bigger data Use the same code to import a file with 10 million observations

In both cases, the data object returned can be used as input data in other RevoScaleR functions

rxImport is new in 5.0 – to simplify and scale the data import process

Revolution ConfidentialS c alable Data Management: Data S tep

Basic steps for data manipulation and cleaning Variable selection Data transformations Row selection

One function does it all: rxDataStep Can use the same approach (function

arguments) at various stages of your analysis: Import Data step “On-the-fly” in data analysis


Revolution ConfidentialE xample: Data S tep with Mortgage Data


Specify the input data – can be a data frame or an object representing an .xdf file

Specify an output file, if desired Select variables and rows to include in the new data set List out variable transformations, using usual R expressions

rxDataStep is new in 5.0 – to simplify and scale the data step


E xamples of R Operators and F unctions You C an Us e in ‘trans forms ’

Operator/Function Description+, -, *, /, ^, %%, … Row-by-row addition, subtraction, multiplication,

division, exponentiation, modulus<, <=, >, >=, ==, != Logical operatorsabs, ceiling, floor, round, log, log10, cos, sin, sqrt, …

Basic mathematical functions

as.Date, weekdays, months, quarters, …

Convert character data to Date data. Then use functions like weekdays(), months(), quarters()

rnorm, runif,gamma, exp,… Distribution functions

cut Create a factor from a numeric variablesubstr, toupper, tolower Basic string handlingifelse ifelse(test, yes, no) – set the value of a variable

conditional on a test? Or, write your own R function



A dditional F unctions for P roces s ing Data S ets

Function Purpose

rxSort Sort a data set by one or more key variables

rxMerge Merge two data sets by one or more key variables

rxFactors Create or modify factors (categorical variables) based on existing variables

rxSetVarInfo Change variable information such as the name or description of a variable

rxSetInfo Add or change a data set description

rxSplitXdf Split a single .xdf file into multiple .xdf files



“ I don’t have big hardware.” B ig data analys is on your des ktop.


Revolution ConfidentialG etting S tarted with B ig Data

When I talk with people about their “big data”, almost always the first issue they raise is “hardware”. “What kind of hardware do I need to analyze big data.” My answer, “Get started today with the

hardware you have. With Revolution R Enterprise 5.0, you can quickly begin doing scalable data analysis on your desktop while you are determining your longer term hardware requirements.”


Revolution ConfidentialB ig Data on Your Des ktop

Data sets with many variables and 100-million observations can be easily processed on a desktop using RevoScaleR functions. Using Revolution R Enterprise 5.0, you can

avoid getting locked into memory-bound analyses. By processing data a chunk at a time, increasing the number of observations in your data set doesn’t increase the memory requirements for a given analysis.



E xample: A nalys is data on all the births in the United S tates from 1985 - 2008

From R in a Nutshell (in dealing with the 2006 birth data):The natality files are gigantic; they're approximately 3.1 GB uncompressed. That's a little larger than R can easily process, so I used Perl to translate these files into a form easily readable by R. Almost 100-million observations Originally stored in annual fixed-format text files;

imported and appended into one .xdf file for fast access using RevoScaleR import function (no need for Perl)


Revolution ConfidentialE xamples : Interac ting with Your Data

Quickly compute summary statistics for variables in the data set using rxSummary: birth weight in grams and a time trend variable, months since Jan. 1985rxSummary(~DBIRWT + MNTHS_SINCE_1985,

data = birthAll,

blocksPerRead = 10)

blocksPerRead set to 10 will read in 10 blocks of the desired variables from the .xdf file for each read, or a little under 5,000,000 observations per read



E xample: S ummary S tatis tics on Two B irth Data Variables


Looks like 9999 must be the missing value code for DBIRWT

Time processing all chunks and final results

Revolution ConfidentialE xamples : G roup Averages

Use rxCube to compute the proportion of babies that were boys for each year for each the race category for the mothermomRaceYear <-

rxCube(ItsaBoy~F(DOB_YY):MRACEREC,

data = birthAll, blocksPerRead = 10)

F() function creates an “on-the-fly” categorical variable for each unit interval

The average of the dependent variable, ItsaBoy, will be computed for each category determined by the interaction term.


Revolution ConfidentialE xample: Us e rxC ube to C ompute P roportion of B oys by Year and Mother ’s R ac e


Put the results into a data frame, easy for plotting

rxLinePlot is particularly well-suited for plotting rxCube results

Revolution ConfidentialE xample: P lot the R es ults


Revolution ConfidentialE s timating a L inear Model


Suppose we want to estimate a linear model: birth weight (in pounds) on plurality and a time trend:BIRWTLBS ~

DPLURAL_REC + MNTHS_SINCE_1985

whereBIRWTLBS = DBIRWT/453.59237 and

rowSelection = DBIRWT < 9000


Us ing the .xdf data file as a data s ource for the biglm package from C R A N


The biglm package also processes data in chunks. We can create an .xdf data source and use it with biglmfunctions. A linear model on almost 100 million rows in about 6 minutes on a desktop in R seems impressive.

I’ve written a small function to use an .xdf data source with biglm

Revolution ConfidentialUs ing rxL inMod: Optimized for S peed


Adding another 5,000,000 observations would add less than 1 second

Revolution ConfidentialL inear Model R es ults


For those who have gotten interested in the actual analysis, here are the results: •At the beginning of 1985, the average singleton baby weighed 7.46 pounds. •Twins were a little over two pounds smaller, and triplets or higher even smaller. •There’s a downward trend in birth weight over time, but very small.

Revolution ConfidentialE s timating a B ig L ogis tic Model


Let’s try a more challenging model: a logistic regression with over 50 parameters (categorical data for Dad and Mom’s ages, race, Hispanic ethnicity, live birth order, plurality, gestation, and year)ItsaBoy ~ DadAgeR8 + MomAgeR7 +

FRACEREC + FHISP_REC +MRACEREC + MHISP_REC + LBO4 + DPLURAL_REC + Gestation + F(DOB_YY)

Revolution ConfidentialB ig L ogis tic Model on the Des ktop


Even a large logistic regression (over 50 parameters) with almost 100 million rows of data can be estimated on a desktop, in about the time it takes to get a cup of coffee (about 6 minutes)

But what if that’s not fast enough?

Revolution ConfidentialA udienc e P oll


Before we answer that question, let’s do a quick poll of the audience


“ I need to be ready for tomorrow’s data.”

S caling data analys is to a c lus ter.


Revolution ConfidentialS c aling Data A nalys is to a C lus ter

With Revolution R Enterprise 5.0, you can use the same functions that you used on your desktop to scale to a cluster of computers Windows HPC Server currently supported. (See

http://technet.microsoft.com/en-us/hpc/cc453771for information on a 180 day evaluation copy.)


http://technet.microsoft.com/en-us/hpc/cc453771�


T he B irth Data L ogis tic R egres s ion on a C lus ter

In our office we have a 5-node cluster of commodity hardware (about $5,000) running Windows HPC Server

I just set my compute context to use the cluster (and wait for the results) and set the location of the data on the nodes

Then run the same code


42 seconds instead of 6 minutes

Revolution ConfidentialHow Does It Work? When I run rxLogit from my desktop with an HPC Server

compute context, a job is submitted to the cluster. The master node allocates tasks to worker nodes to

compute intermediate results on their part of the data. The master node aggregates the intermediate results

from the nodes and processes them. If needed, more tasks are assigned (e.g., computing the next iteration)

When complete, the master node sends the results back to my desktop.

Best of all, I don’t need to know how it works. I just set my compute context, run my code, and get my results back.


Revolution ConfidentialT he HP C J ob S c heduler If I’m interested, I can see the activity on the cluster

using the HPC job scheduler, which can be launched from a menu item in the R Productivity Environment


I can see that my computations were processed 4 cores on each of 5 nodes.

Revolution ConfidentialHPA and HP C B oth S upported

I think of the logistic regression we just ran as High Performance Analytics. The computations are automatically distributed for the analysis of huge data sets. An key component is simultaneous rapid access to data - a cluster where each node has a separate disk drive is usually ideal.

With traditional High Performance Computing, the focus is not on the data. For example, a user might specify a function to be run in parallel across computing resources. Typically these are “embarrassingly” or “pleasingly” parallel computing problems.


Revolution ConfidentialHP C E xample: the B irthday P roblem

In a group of a given size, what is the probability that two people will have the same birthday? We can perform a brute-force computation,

repeatedly creating random samples and counting – and do it in parallel across nodes of the cluster We’ll use a function, pbirthday, that takes 2

arguments: n - group size ntests - the number of times to sample




Set the compute context to do computations on our cluster Use the rxExec function to ask each node on the 5-node

cluster to do up to 20 runs of the pbirthday function, each using a different value for the ‘n’ argument.

rxExec allows users to run arbitrary functions in parallel

The results come back in a list, which we can manipulate and plot.



Revolution ConfidentialUs ing R evoS claleR with foreac h: doR S R


Another alternative for doing parallel computing with RevoScaleR is using the foreach package.

The foreach package provides a for-loop-like approach to parallel computing

Parallel backends have been written for a variety of parallel computing packages, now including RevoScaleR

Let’s look at a simple example: computing square roots in parallel

Revolution ConfidentialS imple example of foreac h with doR S R


To get started with doRSR, load the library and register it as the backend for foreach

To run jobs on the cluster, set your compute context We’ll estimate the square root for the numbers from 1

to 20. In this case, the 20 cores will be requested from the

cluster for computations

Revolution ConfidentialT he J ob S c heduler Us ing doR S R


You can see the 20 cores requested for the job, and that all 5 nodes were used.

Revolution ConfidentialS etting Up Your C ompute C ontext


I’ve mentioned the “compute context” a lot. To setup your compute context, you just need basic information about your cluster.

It’s easy to create a new compute context based on an existing one. Just specify the properties you’d like to change.

Revolution ConfidentialNon-Waiting J obs on a C lus ter


It is common to use non-waiting jobs when working with a cluster. Send off your job, and return to work. Check the status of a non-waiting job in the object browser, or have an email sent. Then retrieve the results on your local machine.

When my job is done, I can retrieve the results using ‘rxGetJobResults’

See my job status here


“ I need to write my own s calable analys es .”

C reating your own s calable R extens ions .


Revolution ConfidentialC reating Your Own S calable E xtens ions

Use doRSR and rxExec to distribute user-defined computations across processes or nodes of a cluster Use output from RevoScaleR functions as input

into other functions (i.e., output from rxCor into princomp for Principal Components) Write your own chunking algorithms, e.g., using

rxDataStep to automatically chunk through the data. (I’ll show you an example.) When you’re done, create a package to

distribute your new functions using the RPE


Revolution ConfidentialTrans formation F unc tions

Transformation functions are user-defined functions that operate on a chunk of data. They can be used to perform arbitrary computations and update results.


You can use transformation functions in RevoScaleR analysis functions to perform specialized data transformations. This example is for use in rxDataStep.

Revolution ConfidentialUs ing rxDataS tep for Us er C omputations

rxDataStep will automatically “chunk” through the data and run the transformation function on each chunk. Just initialize your computed values in the transformObjectsargument. Your final results can be returned in a list.


The updated tableSum will contain the cumulated results of calling the table function on the “DayofWeek” Variable.

Revolution ConfidentialP ac kage S upport in the R P E To create a new R package project, choose

File/New/Project/R Package Project

Right click on the ‘man’ folder to add a help file for a new function.

Build a package by right-clicking on the project and choosing: Build R Package


A solution with all the required R package components is automatically created.

Revolution ConfidentialWrap Up

It’s time to get started with Revolution R Enterprise 5.0 Start out analyzing a small data frame Use the same code to analyze a large data set locally Get high computing performance using the same code

on a cluster Extend your analyses using the power and flexibility of

the R language


Revolution ConfidentialR evolution R E nterpris e: F ree to A cademia

Personal use Research Teaching Package development

52

Free Academic Downloadwww.revolutionanalytics.com/downloads/free-academic.php

Discounted Technical Support Subscriptions Available

http://www.revolutionanalytics.com/downloads/free-academic.php�






R evolution R E nterpris e 5.0:

Now Available!

T hank You!

Technology

New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis