Upload
vuongdan
View
216
Download
0
Embed Size (px)
Citation preview
Big Data on AWS
Services Overview
Bernie Nallamotu| Principle Solutions Architect
\
So what is it?
When your data sets become
so large that you have to start innovating around
how to collect, store, organize, analyze and share it
Compute Storage Big Data
100
GB
1,000
PB
Challenges start at relatively small volumes
Compute Storage Big Data
GB TB PB
Compute Storage Big Data Unconstrained data growth
95% of the 1.2 zettabytes of data in the digital universe is unstructured
70% of of this is user-generated content
Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.
Source: IDC
ZB
EB
Web sites Blogs/Reviews/Emails/Pictures
Social Graphs Facebook, Linked-in, Contacts
Application server logs Web sites, games
Sensor data Weather, water, smart grids
Images/videos Traffic, security cameras
Twitter 50m tweets/day 1,400% growth/year
Where does it come from?
Compute Storage Big Data
Innovation
Why AWS and big data?
Amazon
S3
Amazon
DynamoDB
Amazon
RedShift Spot
HPC EMR
Compute Storage
AWS Worldwide Public Sector Team
Amazon EMR
(Elastic Map Reduce)
AWS Data Pipeline
Hosted Hadoop
framework Move data among AWS
services and on-
premises data sources
Amazon Redshift
Petabyte-scale data
warehouse service
Big Data Services
Compute Storage Big Data
How do you get your slice of it?
AWS Direct Connect
Dedicated low latency
bandwidth
Queuing
Highly scalable event
buffering
Amazon Storage Gateway
Sync local storage to the cloud
AWS Import/Export
Physical media shipping
Compute Storage Big Data
AWS Relational Database
Service
Fully managed database
(MySQL, Oracle, MS SQL Server,
PostgreSQL)
AWS DynamoDB
NoSQL, Schema-less,
Provisioned throughput
database
Amazon S3
Object datastore up to 5TB
per object
99.999999999% durability
Where do you put your slice of it?
AWS SimpleDB
NoSQL, Schema-less
Smaller datasets
Compute Storage Big Data
Amazon Glacier
Long term cold storage
From $0.01 per GB/Month
99.999999999% durability
Where do you put your slice of it?
Compute Storage Big Data
Scale Price
Performance
How quick do you need to read it?
Single digit ms 10s-100s ms <5 hours
AWS DynamoDB
Social scale applications Provisioned throughput performance
Flexible consistency models
AWS S3
Any object, any app 99.999999999% durability
Objects up to 5TB in size
AWS Glacier
Media & asset archives Extremely low cost
S3 levels of durability
Compute Storage Big Data
Scale Price
Performance
Operate at any scale
Unlimited data
Compute Storage Big Data
Data App App
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Data has gravity
Compute Storage Big Data
Data
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Compute Storage Big Data …and inertia at volume…
Data
…easier to move applications to the data
Compute Storage Big Data
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Bring compute capacity to the data
Very large dataset seeks
strong & consistent
compute for short term
relationship, possibly
longer
Compute Storage Big Data
Compute Storage Big Data Flexible compute resources, on demand
Vertical
Scaling
From $0.02/hr
Amazon Elastic Compute Cloud (EC2) Basic unit of compute capacity
Range of CPU, memory & local disk options
27 Instance types available, from micro through cluster compute to SSD backed
Feature Details
Flexible Run Windows or Linux distributions
Scalable Wide range of instance types from micro to cluster compute
Machine Images Configurations can be saved as machine images (AMIs) from which
new instances can be created
Full control Full root or administrator rights
VM Import/Export Import and export VM images to transfer configurations in and out of
EC2
Monitoring Publishes metrics to Cloud Watch
Inexpensive On-demand, Reserved and Spot instance types
Secure Full firewall control via Security Groups
On and Off Fast Growth
Variable peaks Predictable peaks
Elastic capacity as you need it
Compute Storage Big Data
On and Off Fast Growth
Predictable peaks Variable peaks
WASTE
CUSTOMER DISSATISFACTION
Elastic capacity as you need it
Compute Storage Big Data
Elastic cloud capacity
Traditional
IT capacity
Your IT needs
Time
Capacity
Elastic capacity as you need it
Compute Storage Big Data
Fast Growth On and Off
Predictable peaks Variable peaks
Elastic capacity as you need it
Compute Storage Big Data
From one instance…
Compute Storage Big Data
…to thousands
Compute Storage Big Data
Innovation
Why AWS and big data?
S3
DynamoDB RedShift
Spot
HPC EMR
Compute Storage
Innovation
Why AWS and big data?
S3
DynamoDB RedShift
Spot
HPC EMR
Compute Storage
AWS EMR – Elastic MapReduce
AWS Worldwide Public Sector Team
A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases
Amazon Elastic MapReduce
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
HDFS Reliable storage
MapReduce Data analysis
map Input
file reduce Output
file
EC2 instance
map Input
file reduce Output
file
map Input
file reduce Output
file
map Input
file reduce Output
file
EC2 instance
EC2 instance
EC2 instance
Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
map
Person Total Alice 25 Bob 49
Charlie 63 David 29
reduce
Map? Reduce?
AWS Worldwide Public Sector Team
AWS Elastic MapReduce Architecture
HDFS
Amazon EMR
Pig
HDFS
Amazon S3 Amazon
DynamoDB
Amazon EMR
HDFS
Data management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS
Pig
Analytics languages Data management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS
Pig
Amazon
RDS
Analytics languages Data management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS
Pig
Analytics languages Data management
Amazon
RedShift AWS Data Pipeline
Amazon EMR Amazon
RDS
Amazon S3 Amazon
DynamoDB
Useful Resources & Links
• AWS Big Data: http://aws.amazon.com/big-data
• AWS HPC: http://aws.amazon.com/hpc-applications
• Architecture Center: http://aws.amazon.com/architecture
• Documentation: http://aws.amazon.com/documentation
• Security Center: http://aws.amazon.com/security
• Whitepapers: http://aws.amazon.com/whitepapers
• Resources: http://aws.amazon.com/resources
• Case Studies: http://aws.amazon.com/solutions/case-studies
• Solution Providers: http://aws.amazon.com/solutions/global-solution-providers
• Calculator: http://calculator.s3.amazonaws.com/calc5.html
• TCO Calculator: http://aws.amazon.com/tco-calculator
• AWS Blog: http://aws.typepad.com
• The Power of 60: http://www.powerof60.com