67
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler | Sr. Manager, Big Data [email protected] | @bensbutler

Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit Washington, D.C.

EMR for Fun and for Profit

Ben Butler | Sr. Manager, Big Data

[email protected] | @bensbutler

Page 2: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

Overview

1. What is big data? 2. What is AWS Elastic

MapReduce?

3. What data is available? 4. How to use AWS EMR to

support my agency’s mission?

Page 3: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

What is Big Data?

Page 4: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

So what is it?

When your data sets become

so large that you have to start innovating around

how to collect, store, organize, analyze and share it

Compute Storage Big Data

Page 5: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

100

GB

1,000

PB

Challenges start at relatively small volumes

Compute Storage Big Data

Page 6: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

GB TB PB

Compute Storage Big Data Unconstrained data growth

95% of the 1.2 zettabytes of data in the digital universe is unstructured

70% of of this is user-generated content

Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.

Source: IDC

ZB

EB

Page 7: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Web sites Blogs/Reviews/Emails/Pictures

Social Graphs Facebook, Linked-in, Contacts

Application server logs Web sites, games

Sensor data Weather, water, smart grids

Images/videos Traffic, security cameras

Twitter 50m tweets/day 1,400% growth/year

Where does it come from?

Compute Storage Big Data

Page 8: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Innovation

Why AWS and big data?

Amazon

S3

Amazon

DynamoDB

Amazon

RedShift Spot

HPC EMR

Compute Storage

Page 9: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

How do you get your slice of it?

AWS Direct Connect

Dedicated low latency

bandwidth

Queuing

Highly scalable event

buffering

Amazon Storage Gateway

Sync local storage to the cloud

AWS Import/Export

Physical media shipping

Compute Storage Big Data

Page 10: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

AWS Relational Database

Service

Fully managed database

(MySQL, Oracle, MSSQL)

AWS DynamoDB

NoSQL, Schema-less,

Provisioned throughput

database

Amazon S3

Object datastore up to 5TB

per object

99.999999999% durability

Where do you put your slice of it?

AWS SimpleDB

NoSQL, Schema-less

Smaller datasets

Compute Storage Big Data

Page 11: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Amazon Glacier

Long term cold storage

From $0.01 per GB/Month

99.999999999% durability

Where do you put your slice of it?

Compute Storage Big Data

Page 12: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Scale Price

Performance

How quick do you need to read it?

Single digit ms 10s-100s ms <5 hours

AWS DynamoDB

Social scale applications Provisioned throughput performance

Flexible consistency models

AWS S3

Any object, any app 99.999999999% durability

Objects up to 5TB in size

AWS Glacier

Media & asset archives Extremely low cost

S3 levels of durability

Compute Storage Big Data

Page 13: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Scale Price

Performance

Operate at any scale

Unlimited data

Compute Storage Big Data

Page 14: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Data App App

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Data has gravity

Compute Storage Big Data

Page 15: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Data

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Compute Storage Big Data …and inertia at volume…

Page 16: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Data

…easier to move applications to the data

Compute Storage Big Data

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Page 17: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Bring compute capacity to the data

Very large dataset seeks

strong & consistent

compute for short term

relationship, possibly

longer

Compute Storage Big Data

Page 18: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Compute Storage Big Data Flexible compute resources, on demand

Vertical

Scaling

From $0.02/hr

Amazon Elastic Compute Cloud (EC2) Basic unit of compute capacity

Range of CPU, memory & local disk options

17 Instance types available, from micro through cluster compute to SSD backed

Feature Details

Flexible Run Windows or Linux distributions

Scalable Wide range of instance types from micro to cluster compute

Machine Images Configurations can be saved as machine images (AMIs) from which

new instances can be created

Full control Full root or administrator rights

VM Import/Export Import and export VM images to transfer configurations in and out of

EC2

Monitoring Publishes metrics to Cloud Watch

Inexpensive On-demand, Reserved and Spot instance types

Secure Full firewall control via Security Groups

Page 19: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

On and Off Fast Growth

Variable peaks Predictable peaks

Elastic capacity as you need it

Compute Storage Big Data

Page 20: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

On and Off Fast Growth

Predictable peaks Variable peaks

WASTE

CUSTOMER DISSATISFACTION

Elastic capacity as you need it

Compute Storage Big Data

Page 21: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Elastic cloud capacity

Traditional

IT capacity

Your IT needs

Time

Capacity

Elastic capacity as you need it

Compute Storage Big Data

Page 22: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Fast Growth On and Off

Predictable peaks Variable peaks

Elastic capacity as you need it

Compute Storage Big Data

Page 23: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

From one instance…

Compute Storage Big Data

Page 24: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

…to thousands

Compute Storage Big Data

Page 25: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Innovation

Why AWS and big data?

S3

DynamoDB RedShift

Spot

HPC EMR

Compute Storage

Page 26: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Innovation

Why AWS and big data?

S3

DynamoDB RedShift

Spot

HPC EMR

Compute Storage

Page 27: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

AWS EMR – Elastic MapReduce

Page 28: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases

Amazon Elastic MapReduce

Page 29: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

What is EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Page 30: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Very large

click log

(e.g TBs)

Page 31: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Page 32: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Page 33: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Page 34: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Aggregate

the results

from all the

nodes

Page 35: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Very large

click log

(e.g TBs)

What

John

Smith

did

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Aggregate

the results

from all the

nodes

Page 36: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

What

John

Smith

did

Very large

click log

(e.g TBs) Insight in a fraction of the time

Page 37: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

How does it work?

EMR

Amazon EMR

Cluster Amazon S3

1. Put the data into

S3 (or HDFS)

3. Get the results

2. Launch your cluster.

Choose:

• Hadoop distribution

• How many nodes

• Node type (hi-CPU,

hi-memory, etc.)

• Hadoop apps (Hive,

Pig, HBase)

Page 38: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

EMR

How does it work?

Amazon S3

You can

easily resize

the cluster Amazon EMR

Cluster

Page 39: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

EMR

How does it work?

Amazon S3

Use Spot

nodes to save

time and

money

Amazon EMR

Cluster

Page 40: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

EMR

How does it work?

Amazon S3

Launch parallel clusters

against the same data

source (tune for the

workload)

Amazon EMR

Clusters

Page 41: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

How does it work?

Amazon S3

When the work is complete,

you can terminate the cluster

(and stop paying)

Amazon EMR

Cluster

Page 42: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

Amazon EMR Cluster

How does it work?

You can store

everything in HDFS

(local disk)

High Storage nodes

= 48 TB/node

Page 43: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

Launch in a Virtual

Private Cloud for

extra security

Amazon EMR Cluster

How does it work?

Page 44: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Amazon EMR cluster Start an Amazon

EMR cluster

using AWS

Management

Console or AWS

Command Line

Interface tools

Page 45: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Master instance

group created

that controls the

cluster (runs

MySQL)

Page 46: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Core instance group

Core instance

group created for

life of cluster

Page 47: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Core instance group

HDFS HDFS

Core instances

run DataNode

and TaskTracker

daemons

Page 48: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Optional task

instances can be

added or

subtracted to

perform work

Page 49: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Amazon S3 can

be used as

underlying file

system for

input/output data

Page 50: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Master node

coordinates

distribution of

work and

manages cluster

state

Page 51: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Core and Task

instances read-write

to Amazon S3

Page 52: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

map Input

file reduce Output

file

Amazon EC2 instance

Shuffle

Page 53: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

map Input

file reduce Output

file

map Input

file reduce Output

file

map Input

file reduce Output

file

EC2 instance

EC2 instance

Amazon EC2 instance

Shuffle

Page 54: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

HDFS

Amazon EMR

Pig

Page 55: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

HDFS

Amazon S3 Amazon

DynamoDB

Amazon EMR

Page 56: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

HDFS

Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

Page 57: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

HDFS

Pig

Analytics languages Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

Page 58: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

HDFS

Pig

Amazon

RDS

Analytics languages Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

Page 59: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

HDFS

Pig

Analytics languages Data management

Amazon

RedShift AWS Data Pipeline

Amazon EMR Amazon

RDS

Amazon S3 Amazon

DynamoDB

Page 60: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Amazon’s Public Data Sets

Page 61: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

Amazon Public Data Sets

Page 62: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

2013 AWS Worldwide Public Sector Summit

• 270+ TB and growing dataset, hosted for free in

AWS cloud

• Researchers no longer need massive on-

premises storage and compute

• Collaboration revolution: not just shared data but

“executable papers”

1000 Genomes Project

Page 63: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Using EMR for your Mission

Page 64: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

• Lack of constraints leads to new usage models

• Gives control back to individual development teams

• Fail-fast (and fail-cheap) opens up exploratory style

• Many customers create 100s of Amazon EMR clusters per day

• Classic burst-y workload perfect for the cloud

• Big data / HPC clusters themselves are parallelized resources

• Can you build a faster on-premises cluster? Yes, but…

• Usually a shared/contented resource; in cloud, each user/workgroup gets their own

cluster

• Cloud is often the fastest platform based on “MTTJC”

(Mean Time To Job Completion)

Cloud Democratizes Big Data/HPC

Page 65: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

• No-obligation use allows for experimentation, prototypes and

operational/business pilots

• Faster time from inception of idea to solution

• Provides a platform that can scale to meet the massive needs of large

data sets

• Bottom line:

• Enables experimentation and innovation without large capital investments

• Improves ROI for Big Data projects

• http://aws.amazon.com/hpc-applications/

Cloud and Big Data/HPC

Page 66: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Try the tutorial: aws.amazon.com/articles/2855

Find out more: aws.amazon.com/big-data

Page 67: Washington, D.C.d36cz9buwru1tt.cloudfront.net/145AB-130-EMR-for... · AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models Objects

Thank you!

Ben Butler, Sr. Manager, AWS Marketing