Upload
revolution-analytics
View
6.634
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Revolution R Enterprise 5.0 is Revolution Analytics’ scalable analytics platform. At its core is Revolution Analytics’ enhanced Distribution of R, the world’s most widely-used project for statistical computing. In this webinar, Dr. Ranney will discuss new features and show examples of the new functionality, which extend the platform’s usability, integration and scalability.
Citation preview
Revolution Confidential
R evolution R E nterpris e 5.0: S c alable Data
Management and A nalys is for the
E nterpris eS ue R anney, V P P roduc t Development
November 2011
Revolution ConfidentialNovember 17, 2011: Welc ome!
Thanks for coming! Text questions for Q&A after the presentation
2Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialIn Today’s Webinar…
About Revolution R Enterprise 5.0 “I don’t have big data.” Why use Revolution R
Enterprise 5.0 to get started? “I don’t have big hardware.” Big data on your
desktop. “I have big data, and need to be ready for
tomorrow’s even bigger data.” Scaling data analysis to a cluster.
“I need to write my own scalable analyses.”: Creating your own scalable R extensions.
Wrap-up, Q&A
3Revolution R Enterprise 5.0 Webinar
Revolution Confidential
A bout R evolution R E nterpris e
Revolution R Enterprise 5.0 Webinar 4
Revolution Confidential
EnterpriseDeployment
Performance
Productivity
Big Data Analysis
Training & Consulting
TechnicalSupport
R evolution R E nterpris e:
5
Open Source
Performance Enhancements
Greater Productivity & Ease of Use
Tackle “Big Data”
IT-Friendly Enterprise Deployment
On-Call Experts
Revolution ConfidentialR evolution R E nterpris e 5.0: What’s New? Distributed/Parallel Computing
Distribute analytics and R functions to a Windows HPC server cluster Scalable Data Management
New data import and cleaning/manipulation tools. Expanded Scalable Analytics
Principal components analysis, factor analysis, and more Enhanced R Productivity Environment
Create and build R packages Integration with Hadoop
Cloudera-certified MapReduce programming in R Enhanced RevoDeployR Server
Supports multiple compute nodes, batch scheduling and LDAP security Upgraded Open Source R
R 2.13.2 with new byte-compiler
Revolution R Enterprise 5.0 Webinar 6
Revolution ConfidentialR evolution R E nterpris e: What G ets Ins talled?
7
Latest stable version of Open-Source R ( 2.13.2)High performance math librariesRevoScaleR package:Scalable data management and analysisDistributed data analysis/parallel computing
Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE), including a visual debugger for R
Data Management and Statistical Analysis for the Enterprise
Revolution R Enterprise 5.0 Webinar
Revolution Confidential
“ I don’t have big data.” Why get s tarted with R evolution R E nterpris e?
Revolution R Enterprise 5.0 Webinar 8
Revolution Confidential
Why R evolution R E nterpris e 5.0 with “ S mall” Data
Easy to get started; consistent interface for “start-to-finish” data analysis with just a few functions Data import (text, SAS, SPSS, ODBC) Data transformations & manipulation Basic data analysis
Performance: fast analysis such as summary statistics, cross tabs, linear models, logistic regression – even for data that can fit in memory Scalability: replicate the data analysis you do
today on big data down the road9Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialS c alable Data Management: Import
Import data from a variety of sources with rxImport SPSS SAS Delimited text (e.g., comma separated) Fixed format text Databases with ODBC connection
Read small data sets into a data frame; store larger data sets in an efficient .xdf file format
Use arguments such as colClasses and colInfo to provide guidance on how to import data (e.g., as integer, factor, etc.)
10Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialE xample: Import Mortgage Default Data
11Revolution R Enterprise 5.0 Webinar
Import a data file (10,000 obs) into a data frame, specifying the input file location Create a place-holder object for an output file, that you’ll use with bigger data Use the same code to import a file with 10 million observations
In both cases, the data object returned can be used as input data in other RevoScaleR functions
rxImport is new in 5.0 – to simplify and scale the data import process
Revolution ConfidentialS c alable Data Management: Data S tep
Basic steps for data manipulation and cleaning Variable selection Data transformations Row selection
One function does it all: rxDataStep Can use the same approach (function
arguments) at various stages of your analysis: Import Data step “On-the-fly” in data analysis
12Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialE xample: Data S tep with Mortgage Data
13Revolution R Enterprise 5.0 Webinar
Specify the input data – can be a data frame or an object representing an .xdf file
Specify an output file, if desired Select variables and rows to include in the new data set List out variable transformations, using usual R expressions
rxDataStep is new in 5.0 – to simplify and scale the data step
Revolution Confidential
E xamples of R Operators and F unctions You C an Us e in ‘trans forms ’
Operator/Function Description+, -, *, /, ^, %%, … Row-by-row addition, subtraction, multiplication,
division, exponentiation, modulus<, <=, >, >=, ==, != Logical operatorsabs, ceiling, floor, round, log, log10, cos, sin, sqrt, …
Basic mathematical functions
as.Date, weekdays, months, quarters, …
Convert character data to Date data. Then use functions like weekdays(), months(), quarters()
rnorm, runif,gamma, exp,… Distribution functions
cut Create a factor from a numeric variablesubstr, toupper, tolower Basic string handlingifelse ifelse(test, yes, no) – set the value of a variable
conditional on a test? Or, write your own R function
14Revolution R Enterprise 5.0 Webinar
Revolution Confidential
A dditional F unctions for P roces s ing Data S ets
Function Purpose
rxSort Sort a data set by one or more key variables
rxMerge Merge two data sets by one or more key variables
rxFactors Create or modify factors (categorical variables) based on existing variables
rxSetVarInfo Change variable information such as the name or description of a variable
rxSetInfo Add or change a data set description
rxSplitXdf Split a single .xdf file into multiple .xdf files
15Revolution R Enterprise 5.0 Webinar
Revolution Confidential
“ I don’t have big hardware.” B ig data analys is on your des ktop.
Revolution R Enterprise 5.0 Webinar 16
Revolution ConfidentialG etting S tarted with B ig Data
When I talk with people about their “big data”, almost always the first issue they raise is “hardware”. “What kind of hardware do I need to analyze big data.” My answer, “Get started today with the
hardware you have. With Revolution R Enterprise 5.0, you can quickly begin doing scalable data analysis on your desktop while you are determining your longer term hardware requirements.”
17Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialB ig Data on Your Des ktop
Data sets with many variables and 100-million observations can be easily processed on a desktop using RevoScaleR functions. Using Revolution R Enterprise 5.0, you can
avoid getting locked into memory-bound analyses. By processing data a chunk at a time, increasing the number of observations in your data set doesn’t increase the memory requirements for a given analysis.
18Revolution R Enterprise 5.0 Webinar
Revolution Confidential
E xample: A nalys is data on all the births in the United S tates from 1985 - 2008
From R in a Nutshell (in dealing with the 2006 birth data):The natality files are gigantic; they're approximately 3.1 GB uncompressed. That's a little larger than R can easily process, so I used Perl to translate these files into a form easily readable by R. Almost 100-million observations Originally stored in annual fixed-format text files;
imported and appended into one .xdf file for fast access using RevoScaleR import function (no need for Perl)
19Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialE xamples : Interac ting with Your Data
Quickly compute summary statistics for variables in the data set using rxSummary: birth weight in grams and a time trend variable, months since Jan. 1985rxSummary(~DBIRWT + MNTHS_SINCE_1985,
data = birthAll,
blocksPerRead = 10)
blocksPerRead set to 10 will read in 10 blocks of the desired variables from the .xdf file for each read, or a little under 5,000,000 observations per read
Revolution R Enterprise 5.0 Webinar 20
Revolution Confidential
E xample: S ummary S tatis tics on Two B irth Data Variables
21Revolution R Enterprise 5.0 Webinar
Looks like 9999 must be the missing value code for DBIRWT
Time processing all chunks and final results
Revolution ConfidentialE xamples : G roup Averages
Use rxCube to compute the proportion of babies that were boys for each year for each the race category for the mothermomRaceYear <-
rxCube(ItsaBoy~F(DOB_YY):MRACEREC,
data = birthAll, blocksPerRead = 10)
F() function creates an “on-the-fly” categorical variable for each unit interval
The average of the dependent variable, ItsaBoy, will be computed for each category determined by the interaction term.
Revolution R Enterprise 5.0 Webinar 22
Revolution ConfidentialE xample: Us e rxC ube to C ompute P roportion of B oys by Year and Mother ’s R ac e
23Revolution R Enterprise 5.0 Webinar
Put the results into a data frame, easy for plotting
rxLinePlot is particularly well-suited for plotting rxCube results
Revolution ConfidentialE xample: P lot the R es ults
24Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialE s timating a L inear Model
25Revolution R Enterprise 5.0 Webinar
Suppose we want to estimate a linear model: birth weight (in pounds) on plurality and a time trend:BIRWTLBS ~
DPLURAL_REC + MNTHS_SINCE_1985
whereBIRWTLBS = DBIRWT/453.59237 and
rowSelection = DBIRWT < 9000
Revolution Confidential
Us ing the .xdf data file as a data s ource for the biglm package from C R A N
26Revolution R Enterprise 5.0 Webinar
The biglm package also processes data in chunks. We can create an .xdf data source and use it with biglmfunctions. A linear model on almost 100 million rows in about 6 minutes on a desktop in R seems impressive.
I’ve written a small function to use an .xdf data source with biglm
Revolution ConfidentialUs ing rxL inMod: Optimized for S peed
27Revolution R Enterprise 5.0 Webinar
Adding another 5,000,000 observations would add less than 1 second
Revolution ConfidentialL inear Model R es ults
28Revolution R Enterprise 5.0 Webinar
For those who have gotten interested in the actual analysis, here are the results: •At the beginning of 1985, the average singleton baby weighed 7.46 pounds. •Twins were a little over two pounds smaller, and triplets or higher even smaller. •There’s a downward trend in birth weight over time, but very small.
Revolution ConfidentialE s timating a B ig L ogis tic Model
29Revolution R Enterprise 5.0 Webinar
Let’s try a more challenging model: a logistic regression with over 50 parameters (categorical data for Dad and Mom’s ages, race, Hispanic ethnicity, live birth order, plurality, gestation, and year)ItsaBoy ~ DadAgeR8 + MomAgeR7 +
FRACEREC + FHISP_REC +MRACEREC + MHISP_REC + LBO4 + DPLURAL_REC + Gestation + F(DOB_YY)
Revolution ConfidentialB ig L ogis tic Model on the Des ktop
30Revolution R Enterprise 5.0 Webinar
Even a large logistic regression (over 50 parameters) with almost 100 million rows of data can be estimated on a desktop, in about the time it takes to get a cup of coffee (about 6 minutes)
But what if that’s not fast enough?
Revolution ConfidentialA udienc e P oll
31Revolution R Enterprise 5.0 Webinar
Before we answer that question, let’s do a quick poll of the audience
Revolution Confidential
“ I need to be ready for tomorrow’s data.”
S caling data analys is to a c lus ter.
Revolution R Enterprise 5.0 Webinar 32
Revolution ConfidentialS c aling Data A nalys is to a C lus ter
With Revolution R Enterprise 5.0, you can use the same functions that you used on your desktop to scale to a cluster of computers Windows HPC Server currently supported. (See
http://technet.microsoft.com/en-us/hpc/cc453771for information on a 180 day evaluation copy.)
33Revolution R Enterprise 5.0 Webinar
Revolution Confidential
T he B irth Data L ogis tic R egres s ion on a C lus ter
In our office we have a 5-node cluster of commodity hardware (about $5,000) running Windows HPC Server
I just set my compute context to use the cluster (and wait for the results) and set the location of the data on the nodes
Then run the same code
34Revolution R Enterprise 5.0 Webinar
42 seconds instead of 6 minutes
Revolution ConfidentialHow Does It Work? When I run rxLogit from my desktop with an HPC Server
compute context, a job is submitted to the cluster. The master node allocates tasks to worker nodes to
compute intermediate results on their part of the data. The master node aggregates the intermediate results
from the nodes and processes them. If needed, more tasks are assigned (e.g., computing the next iteration)
When complete, the master node sends the results back to my desktop.
Best of all, I don’t need to know how it works. I just set my compute context, run my code, and get my results back.
35Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialT he HP C J ob S c heduler If I’m interested, I can see the activity on the cluster
using the HPC job scheduler, which can be launched from a menu item in the R Productivity Environment
36Revolution R Enterprise 5.0 Webinar
I can see that my computations were processed 4 cores on each of 5 nodes.
Revolution ConfidentialHPA and HP C B oth S upported
I think of the logistic regression we just ran as High Performance Analytics. The computations are automatically distributed for the analysis of huge data sets. An key component is simultaneous rapid access to data - a cluster where each node has a separate disk drive is usually ideal.
With traditional High Performance Computing, the focus is not on the data. For example, a user might specify a function to be run in parallel across computing resources. Typically these are “embarrassingly” or “pleasingly” parallel computing problems.
37Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialHP C E xample: the B irthday P roblem
In a group of a given size, what is the probability that two people will have the same birthday? We can perform a brute-force computation,
repeatedly creating random samples and counting – and do it in parallel across nodes of the cluster We’ll use a function, pbirthday, that takes 2
arguments: n - group size ntests - the number of times to sample
38Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialHP C E xample: the B irthday P roblem
39Revolution R Enterprise 5.0 Webinar
Set the compute context to do computations on our cluster Use the rxExec function to ask each node on the 5-node
cluster to do up to 20 runs of the pbirthday function, each using a different value for the ‘n’ argument.
rxExec allows users to run arbitrary functions in parallel
The results come back in a list, which we can manipulate and plot.
Revolution ConfidentialHP C E xample: the B irthday P roblem
40Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialUs ing R evoS claleR with foreac h: doR S R
41Revolution R Enterprise 5.0 Webinar
Another alternative for doing parallel computing with RevoScaleR is using the foreach package.
The foreach package provides a for-loop-like approach to parallel computing
Parallel backends have been written for a variety of parallel computing packages, now including RevoScaleR
Let’s look at a simple example: computing square roots in parallel
Revolution ConfidentialS imple example of foreac h with doR S R
42Revolution R Enterprise 5.0 Webinar
To get started with doRSR, load the library and register it as the backend for foreach
To run jobs on the cluster, set your compute context We’ll estimate the square root for the numbers from 1
to 20. In this case, the 20 cores will be requested from the
cluster for computations
Revolution ConfidentialT he J ob S c heduler Us ing doR S R
43Revolution R Enterprise 5.0 Webinar
You can see the 20 cores requested for the job, and that all 5 nodes were used.
Revolution ConfidentialS etting Up Your C ompute C ontext
44Revolution R Enterprise 5.0 Webinar
I’ve mentioned the “compute context” a lot. To setup your compute context, you just need basic information about your cluster.
It’s easy to create a new compute context based on an existing one. Just specify the properties you’d like to change.
Revolution ConfidentialNon-Waiting J obs on a C lus ter
45Revolution R Enterprise 5.0 Webinar
It is common to use non-waiting jobs when working with a cluster. Send off your job, and return to work. Check the status of a non-waiting job in the object browser, or have an email sent. Then retrieve the results on your local machine.
When my job is done, I can retrieve the results using ‘rxGetJobResults’
See my job status here
Revolution Confidential
“ I need to write my own s calable analys es .”
C reating your own s calable R extens ions .
Revolution R Enterprise 5.0 Webinar 46
Revolution ConfidentialC reating Your Own S calable E xtens ions
Use doRSR and rxExec to distribute user-defined computations across processes or nodes of a cluster Use output from RevoScaleR functions as input
into other functions (i.e., output from rxCor into princomp for Principal Components) Write your own chunking algorithms, e.g., using
rxDataStep to automatically chunk through the data. (I’ll show you an example.) When you’re done, create a package to
distribute your new functions using the RPE
47Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialTrans formation F unc tions
Transformation functions are user-defined functions that operate on a chunk of data. They can be used to perform arbitrary computations and update results.
48Revolution R Enterprise 5.0 Webinar
You can use transformation functions in RevoScaleR analysis functions to perform specialized data transformations. This example is for use in rxDataStep.
Revolution ConfidentialUs ing rxDataS tep for Us er C omputations
rxDataStep will automatically “chunk” through the data and run the transformation function on each chunk. Just initialize your computed values in the transformObjectsargument. Your final results can be returned in a list.
49Revolution R Enterprise 5.0 Webinar
The updated tableSum will contain the cumulated results of calling the table function on the “DayofWeek” Variable.
Revolution ConfidentialP ac kage S upport in the R P E To create a new R package project, choose
File/New/Project/R Package Project
Right click on the ‘man’ folder to add a help file for a new function.
Build a package by right-clicking on the project and choosing: Build R Package
50Revolution R Enterprise 5.0 Webinar
A solution with all the required R package components is automatically created.
Revolution ConfidentialWrap Up
It’s time to get started with Revolution R Enterprise 5.0 Start out analyzing a small data frame Use the same code to analyze a large data set locally Get high computing performance using the same code
on a cluster Extend your analyses using the power and flexibility of
the R language
51Revolution R Enterprise 5.0 Webinar
Revolution ConfidentialR evolution R E nterpris e: F ree to A cademia
Personal use Research Teaching Package development
52
Free Academic Downloadwww.revolutionanalytics.com/downloads/free-academic.php
Discounted Technical Support Subscriptions Available
Revolution Confidential
R evolution R E nterpris e 5.0:
Now Available!
T hank You!