Automated Hadoop Cluster Construction on EC2

Preview:

DESCRIPTION

Presented at Houston Hadoop Meetup

Citation preview

Automated Hadoop Clusters on EC2

Mark KerznerSHMsoft

What is Hadoop? :) :) :)

Everybody knows that ... What is your definition?

What is a cloud?

Everybody knows that, but 1. Elastic resources2. Internet delivery3. SAAS4. Virtualization5. Device-enabled6. Only (1) or all of the above

You are the Hadoop programmer

... and you need tools What are your alternatives?● IDE● Local "cluster"● Pseudo-distributed cluster● EC2

You are the Hadoop programmer

... and you need tools What are your alternatives?● IDE - compile and run the code● Local "cluster" - local file system● Pseudo-distributed cluster - test outside● EC2 - test on the cluster, test for scale

What are your resources

● Tom White, "Hadoop, the Definitive Guide"● www.hadoopilluminated.com

For real play, you need a cluster

Hadoop+ (oh, by the way...)

HBase, Cassandra, MongoDB, NoSQL, Dynamo, BigTable, Dryad (MS), Azure (MS), MapReduce, MapR (EMC), Cloudera distribution, EMC distribution, IBM distribution...

WhirrSetup export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... Installcurl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gztar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1 Generate key sssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr Runbin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

Whirr limitations

● No EBS● All or nothing● Generates configuration artifacts● Takes over your computer, no more local

development - uses proxy● Hard to customize

Amazon EMR

EMR limitations

● No choice of image● Fixed architecture● Hard to debug● Hard to customize

You do it

Repeat the manual procedure, only automate it PrepareAMI, Java, Hadoop On-the-flyStart AMI, login, configure, start services, verify, run test jobs

You do it - advanced

On startup Under-provision, over-provision, progress On-the-fly Monitor, run test jobs, watch for cluster deterioration

Cloudera Manager

MapR Manager

On the large scale

Hadoop 0.20 - up to 4,000 nodesHadoop 0.23 - up to 20,000GridGain - 100's of 1,000's

Thank you

Questions?

Recommended