Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
2013 AWS Worldwide Public Sector Summit Washington, D.C.
EMR for Fun and for Profit
Ben Butler | Sr. Manager, Big Data
[email protected] | @bensbutler
2013 AWS Worldwide Public Sector Summit
Overview
1. What is big data? 2. What is AWS Elastic
MapReduce?
3. What data is available? 4. How to use AWS EMR to
support my agency’s mission?
What is Big Data?
So what is it?
When your data sets become
so large that you have to start innovating around
how to collect, store, organize, analyze and share it
Compute Storage Big Data
100
GB
1,000
PB
Challenges start at relatively small volumes
Compute Storage Big Data
GB TB PB
Compute Storage Big Data Unconstrained data growth
95% of the 1.2 zettabytes of data in the digital universe is unstructured
70% of of this is user-generated content
Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.
Source: IDC
ZB
EB
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth/year
Where does it come from?
Compute Storage Big Data
Innovation
Why AWS and big data?
Amazon
S3
Amazon
DynamoDB
Amazon
RedShift Spot
HPC EMR
Compute Storage
How do you get your slice of it?
AWS Direct Connect
Dedicated low latency
bandwidth
Queuing
Highly scalable event
buffering
Amazon Storage Gateway
Sync local storage to the cloud
AWS Import/Export
Physical media shipping
Compute Storage Big Data
AWS Relational Database
Service
Fully managed database
(MySQL, Oracle, MSSQL)
AWS DynamoDB
NoSQL, Schema-less,
Provisioned throughput
database
Amazon S3
Object datastore up to 5TB
per object
99.999999999% durability
Where do you put your slice of it?
AWS SimpleDB
NoSQL, Schema-less
Smaller datasets
Compute Storage Big Data
Amazon Glacier
Long term cold storage
From $0.01 per GB/Month
99.999999999% durability
Where do you put your slice of it?
Compute Storage Big Data
Scale Price
Performance
How quick do you need to read it?
Single digit ms 10s-100s ms <5 hours
AWS DynamoDB
Social scale applications Provisioned throughput performance
Flexible consistency models
AWS S3
Any object, any app 99.999999999% durability
Objects up to 5TB in size
AWS Glacier
Media & asset archives Extremely low cost
S3 levels of durability
Compute Storage Big Data
Scale Price
Performance
Operate at any scale
Unlimited data
Compute Storage Big Data
Data App App
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Data has gravity
Compute Storage Big Data
Data
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Compute Storage Big Data …and inertia at volume…
Data
…easier to move applications to the data
Compute Storage Big Data
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Bring compute capacity to the data
Very large dataset seeks
strong & consistent
compute for short term
relationship, possibly
longer
Compute Storage Big Data
Compute Storage Big Data Flexible compute resources, on demand
Vertical
Scaling
From $0.02/hr
Amazon Elastic Compute Cloud (EC2) Basic unit of compute capacity
Range of CPU, memory & local disk options
17 Instance types available, from micro through cluster compute to SSD backed
Feature Details
Flexible Run Windows or Linux distributions
Scalable Wide range of instance types from micro to cluster compute
Machine Images Configurations can be saved as machine images (AMIs) from which
new instances can be created
Full control Full root or administrator rights
VM Import/Export Import and export VM images to transfer configurations in and out of
EC2
Monitoring Publishes metrics to Cloud Watch
Inexpensive On-demand, Reserved and Spot instance types
Secure Full firewall control via Security Groups
On and Off Fast Growth
Variable peaks Predictable peaks
Elastic capacity as you need it
Compute Storage Big Data
On and Off Fast Growth
Predictable peaks Variable peaks
WASTE
CUSTOMER DISSATISFACTION
Elastic capacity as you need it
Compute Storage Big Data
Elastic cloud capacity
Traditional
IT capacity
Your IT needs
Time
Capacity
Elastic capacity as you need it
Compute Storage Big Data
Fast Growth On and Off
Predictable peaks Variable peaks
Elastic capacity as you need it
Compute Storage Big Data
From one instance…
Compute Storage Big Data
…to thousands
Compute Storage Big Data
Innovation
Why AWS and big data?
S3
DynamoDB RedShift
Spot
HPC EMR
Compute Storage
Innovation
Why AWS and big data?
S3
DynamoDB RedShift
Spot
HPC EMR
Compute Storage
AWS EMR – Elastic MapReduce
2013 AWS Worldwide Public Sector Summit
A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases
Amazon Elastic MapReduce
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
Very large
click log
(e.g TBs)
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Split the
log into
many small
pieces
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Split the
log into
many small
pieces
Process in an
EMR cluster
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Split the
log into
many small
pieces
Process in an
EMR cluster
Aggregate
the results
from all the
nodes
Very large
click log
(e.g TBs)
What
John
Smith
did
Lots of actions
by John Smith
Split the
log into
many small
pieces
Process in an
EMR cluster
Aggregate
the results
from all the
nodes
What
John
Smith
did
Very large
click log
(e.g TBs) Insight in a fraction of the time
2013 AWS Worldwide Public Sector Summit
How does it work?
EMR
Amazon EMR
Cluster Amazon S3
1. Put the data into
S3 (or HDFS)
3. Get the results
2. Launch your cluster.
Choose:
• Hadoop distribution
• How many nodes
• Node type (hi-CPU,
hi-memory, etc.)
• Hadoop apps (Hive,
Pig, HBase)
2013 AWS Worldwide Public Sector Summit
EMR
How does it work?
Amazon S3
You can
easily resize
the cluster Amazon EMR
Cluster
2013 AWS Worldwide Public Sector Summit
EMR
How does it work?
Amazon S3
Use Spot
nodes to save
time and
money
Amazon EMR
Cluster
2013 AWS Worldwide Public Sector Summit
EMR
How does it work?
Amazon S3
Launch parallel clusters
against the same data
source (tune for the
workload)
Amazon EMR
Clusters
2013 AWS Worldwide Public Sector Summit
How does it work?
Amazon S3
When the work is complete,
you can terminate the cluster
(and stop paying)
Amazon EMR
Cluster
2013 AWS Worldwide Public Sector Summit
Amazon EMR Cluster
How does it work?
You can store
everything in HDFS
(local disk)
High Storage nodes
= 48 TB/node
2013 AWS Worldwide Public Sector Summit
Launch in a Virtual
Private Cloud for
extra security
Amazon EMR Cluster
How does it work?
Amazon EMR cluster Start an Amazon
EMR cluster
using AWS
Management
Console or AWS
Command Line
Interface tools
Master instance group Amazon EMR cluster
Master instance
group created
that controls the
cluster (runs
MySQL)
Master instance group Amazon EMR cluster
Core instance group
Core instance
group created for
life of cluster
Master instance group Amazon EMR cluster
Core instance group
HDFS HDFS
Core instances
run DataNode
and TaskTracker
daemons
Master instance group Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
Optional task
instances can be
added or
subtracted to
perform work
Master instance group Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
Amazon S3
Amazon S3 can
be used as
underlying file
system for
input/output data
Master instance group Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
Amazon S3
Master node
coordinates
distribution of
work and
manages cluster
state
Master instance group Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
Amazon S3
Core and Task
instances read-write
to Amazon S3
map Input
file reduce Output
file
Amazon EC2 instance
Shuffle
map Input
file reduce Output
file
map Input
file reduce Output
file
map Input
file reduce Output
file
EC2 instance
EC2 instance
Amazon EC2 instance
Shuffle
HDFS
Amazon EMR
Pig
HDFS
Amazon S3 Amazon
DynamoDB
Amazon EMR
HDFS
Data management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS
Pig
Analytics languages Data management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS
Pig
Amazon
RDS
Analytics languages Data management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS
Pig
Analytics languages Data management
Amazon
RedShift AWS Data Pipeline
Amazon EMR Amazon
RDS
Amazon S3 Amazon
DynamoDB
Amazon’s Public Data Sets
2013 AWS Worldwide Public Sector Summit
Amazon Public Data Sets
2013 AWS Worldwide Public Sector Summit
• 270+ TB and growing dataset, hosted for free in
AWS cloud
• Researchers no longer need massive on-
premises storage and compute
• Collaboration revolution: not just shared data but
“executable papers”
1000 Genomes Project
Using EMR for your Mission
• Lack of constraints leads to new usage models
• Gives control back to individual development teams
• Fail-fast (and fail-cheap) opens up exploratory style
• Many customers create 100s of Amazon EMR clusters per day
• Classic burst-y workload perfect for the cloud
• Big data / HPC clusters themselves are parallelized resources
• Can you build a faster on-premises cluster? Yes, but…
• Usually a shared/contented resource; in cloud, each user/workgroup gets their own
cluster
• Cloud is often the fastest platform based on “MTTJC”
(Mean Time To Job Completion)
Cloud Democratizes Big Data/HPC
• No-obligation use allows for experimentation, prototypes and
operational/business pilots
• Faster time from inception of idea to solution
• Provides a platform that can scale to meet the massive needs of large
data sets
• Bottom line:
• Enables experimentation and innovation without large capital investments
• Improves ROI for Big Data projects
• http://aws.amazon.com/hpc-applications/
Cloud and Big Data/HPC
Try the tutorial: aws.amazon.com/articles/2855
Find out more: aws.amazon.com/big-data
Thank you!
Ben Butler, Sr. Manager, AWS Marketing