useR Vignette:
R + 15 minutes = Hadoop cluster
Greater Boston useR GroupFebruary 2011
by
Jeffrey [email protected]
Greater Boston useR Meeting, February 2011 Slide 2useR Vignette: R + 15 minutes = Hadoop Cluster
Agenda
● What's Hadoop?● But I don't have Big
Data● Building the cluster● Estimating π
stochastically● Want to know more?
Greater Boston useR Meeting, February 2011 Slide 3useR Vignette: R + 15 minutes = Hadoop Cluster
MapReduce, Hadoop and Big Data
● Hadoop is an open source implementation of Google's MapReduce-based data processing infrastructure● Designed to process huge data sets
– “huge” = “all of facebook's web logs”– Yahoo! sorted 1TB in 62 seconds in May 2009– HDFS distributed file system makes replication decisions
based on knowledge of network topology● Amazon Elastic MapReduce is full Hadoop stack
on EC2
Greater Boston useR Meeting, February 2011 Slide 4useR Vignette: R + 15 minutes = Hadoop Cluster
MapReduce = Map + shuffle + Reduce
Source: http://developer.yahoo.com/hadoop/tutorial/module4.html
Greater Boston useR Meeting, February 2011 Slide 5useR Vignette: R + 15 minutes = Hadoop Cluster
But I don't have Big Data
● Agricultural economist J.D. Long doesn't either, but he does have a bunch of simulations to run
● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function
● Enjoy Hadoop's infrastructure for job scheduling, fault tolerance, inter-node communication, etc.
● Use Amazon's cloud to scale up quickly as needed
Greater Boston useR Meeting, February 2011 Slide 6useR Vignette: R + 15 minutes = Hadoop Cluster
Load the segue library
> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run the setCredentials() function.
> setCredentials('YOUR_ACCESS_KEY_ID', 'YOUR_SECRET_ACCESS_KEY')
Greater Boston useR Meeting, February 2011 Slide 7useR Vignette: R + 15 minutes = Hadoop Cluster
Start the cluster
> myCluster <- createCluster(numInstances=5)
STARTING - 2011-01-04 15:07:53
[…]
BOOTSTRAPPING - 2011-01-04 15:11:28
[…]
WAITING - 2011-01-04 15:15:35
Your Amazon EMR Hadoop Cluster is ready for action.
Remember to terminate your cluster with stopCluster().
Amazon is billing you!
Greater Boston useR Meeting, February 2011 Slide 8useR Vignette: R + 15 minutes = Hadoop Cluster
Estimate π stochastically
> estimatePi <- function(seed){
set.seed(seed)
numDraws <- 1e6
r <- .5 #radius
x <- runif(numDraws, min=-r, max=r)
y <- runif(numDraws, min=-r, max=r)
inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)
return(sum(inCircle) / length(inCircle) * 4)
}
Greater Boston useR Meeting, February 2011 Slide 9useR Vignette: R + 15 minutes = Hadoop Cluster
Run the simulation
> seedList <- as.list(1:1e3)
> myEstimates <- emrlapply( myCluster, seedList, estimatePi )
RUNNING - 2011-01-04 15:22:28
[…]
WAITING - 2011-01-04 15:32:18
> myPi <- Reduce(sum, myEstimates) / length(myEstimates)
> format(myPi, digits=10)
[1] "3.141586544"
> format(pi, digits=10)
[1] "3.141592654"
Greater Boston useR Meeting, February 2011 Slide 10useR Vignette: R + 15 minutes = Hadoop Cluster
Won't break the bank
● Total cost: $0.15
Standard On-Demand Instances
Amazon EC2 Price per hour (On-Demand Instances)
Amazon Elastic MapReduce Price per hour
Small (Default) $0.085 per hour $0.015 per hour
Large $0.34 per hour $0.06 per hour
Extra Large $0.68 per hour $0.12 per hour
Greater Boston useR Meeting, February 2011 Slide 11useR Vignette: R + 15 minutes = Hadoop Cluster
Want to know more?
● JD Long's segue package● http://code.google.com/p/segue/
● Hadoop● http://hadoop.apache.org/● Book: http://oreilly.com/catalog/0636920010388
● My blog● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/