Download pdf - R + 15 minutes = Hadoop cluster

useR Vignette:

R + 15 minutes = Hadoop cluster

Greater Boston useR GroupFebruary 2011

by

Jeffrey [email protected]

Greater Boston useR Meeting, February 2011 Slide 2useR Vignette: R + 15 minutes = Hadoop Cluster

Agenda

● What's Hadoop?● But I don't have Big

Data● Building the cluster● Estimating π

stochastically● Want to know more?


MapReduce, Hadoop and Big Data

● Hadoop is an open source implementation of Google's MapReduce-based data processing infrastructure● Designed to process huge data sets

– “huge” = “all of facebook's web logs”– Yahoo! sorted 1TB in 62 seconds in May 2009– HDFS distributed file system makes replication decisions

based on knowledge of network topology● Amazon Elastic MapReduce is full Hadoop stack

on EC2


MapReduce = Map + shuffle + Reduce

Source: http://developer.yahoo.com/hadoop/tutorial/module4.html


But I don't have Big Data

● Agricultural economist J.D. Long doesn't either, but he does have a bunch of simulations to run

● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function

● Enjoy Hadoop's infrastructure for job scheduling, fault tolerance, inter-node communication, etc.

● Use Amazon's cloud to scale up quickly as needed


Load the segue library

> library(segue)

Loading required package: rJava

Loading required package: caTools

Loading required package: bitops

Segue did not find your AWS credentials. Please run the setCredentials() function.

> setCredentials('YOUR_ACCESS_KEY_ID', 'YOUR_SECRET_ACCESS_KEY')


Start the cluster

> myCluster <- createCluster(numInstances=5)

STARTING - 2011-01-04 15:07:53

[…]

BOOTSTRAPPING - 2011-01-04 15:11:28

[…]

WAITING - 2011-01-04 15:15:35

Your Amazon EMR Hadoop Cluster is ready for action.

Remember to terminate your cluster with stopCluster().

Amazon is billing you!


Estimate π stochastically

> estimatePi <- function(seed){

set.seed(seed)

numDraws <- 1e6

r <- .5 #radius

x <- runif(numDraws, min=-r, max=r)

y <- runif(numDraws, min=-r, max=r)

inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)

return(sum(inCircle) / length(inCircle) * 4)

}


Run the simulation

> seedList <- as.list(1:1e3)

> myEstimates <- emrlapply( myCluster, seedList, estimatePi )

RUNNING - 2011-01-04 15:22:28

[…]

WAITING - 2011-01-04 15:32:18

> myPi <- Reduce(sum, myEstimates) / length(myEstimates)

> format(myPi, digits=10)

[1] "3.141586544"

> format(pi, digits=10)

[1] "3.141592654"


Won't break the bank

● Total cost: $0.15

Standard On-Demand Instances

Amazon EC2 Price per hour (On-Demand Instances)

Amazon Elastic MapReduce Price per hour

Small (Default) $0.085 per hour $0.015 per hour

Large $0.34 per hour $0.06 per hour

Extra Large $0.68 per hour $0.12 per hour


Want to know more?

● JD Long's segue package● http://code.google.com/p/segue/

● Hadoop● http://hadoop.apache.org/● Book: http://oreilly.com/catalog/0636920010388

● My blog● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/

http://code.google.com/p/segue/

http://hadoop.apache.org/

http://oreilly.com/catalog/0636920010388

http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/