Upload
amazon-web-services
View
992
Download
3
Embed Size (px)
DESCRIPTION
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Citation preview
• What technologies should I use? – Why?
– How?
• Reference architecture
• Design patterns
Volume
Velocity
Variety
Glacier
S3 DynamoDB
RDS
EMR
Redshift
Data PipelineKinesis
Cassandra CloudSearch
Kinesis-
enabled
app
What Tools Should I Use?
Ingest Store Process Visualize
GlacierS3
DynamoDB
RDS
Kinesis
Spark
Streaming
EMRData Pipeline
Storm
Kafka
Redshift
Cassandra
CloudSearch
Kinesis
Connector
Kinesis
enabled app
Ingest
Database
Cloud
Storage
Stream
Storage
Stream
Storage
Database
Cloud
Storage
Amazon Kinesis or Kafka
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard or Partition 1
Shard or Partition 2
Amazon Kinesis or Kafka
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard or Partition 1
Shard or Partition 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Cloud Database &
Storage
App/Web Tier
Client Tier
Database & Storage Tier
App/Web Tier
Client Tier
Data TierDatabase & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon ElastiCache
Amazon S3
Amazon
Glacier
Amazon CloudSearch
HDFS on Amazon EMR
Structured – Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
Structured – Complex Query
SQL
Amazon RDS
Search
Amazon CloudSearch
Unstructured – No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured – Custom Query
Hadoop/HDFS
Amazon Elastic MapReduce
Data
Str
uctu
re C
om
ple
xity
Query Structure Complexity
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
Amazon
RDS
Request RateHigh Low
Cost/GBHigh Low
LatencyLow High
Data VolumeLow High
AmazonGlacier
AmazonCloudSearch
Str
uctu
reLow
High
Amazon
DynamoDB
Amazon
ElastiCache
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(3 TB Max)
GB–TB GB–PB
(~nodes)
GB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate Very High Very High High High Low – Very
High
Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost
$/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -
Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data
Use Case: A Video Streaming Application
Use Case: A Video Streaming App – Upload
AmazonDynamoDB
AmazonRDS
Amazon CloudSearch
Amazon S3
A Video Streaming App – Discovery
XAmazon
ElastiCache
CloudFront
AmazonDynamoDB
AmazonRDS
Amazon CloudSearch
Amazon S3
Process
Batch Processing
• Take large amount of cold data and ask
questions
• Takes minutes or hours to get answers back
Example: Generating hourly, daily,
weekly reports
Use Case: Video Recommendations
Amazon
S3
Amazon
Glacier
Amazon
DynamoDBAmazon
EMR
Use Case: Batch Analytics
Amazon
EMR
Amazon
S3
Amazon
Glacier
Amazon
Redshift
Stream Processing (AKA Real Time)
• Take small amount of hot data and ask
questions
• Takes short amount of time to get your
answer back
Example: 1min metrics
https://amplab.cs.berkeley.edu/benchmark/
Redshift Impala Presto Spark Hive
Query
Latency
Low Low Low Low - Medium Medium - High
Durability High High High High High
Data
Volume
1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes EMR
bootstrap
EMR
bootstrap
EMR
bootstrap
Yes (EMR)
Storage Native HDFS HDFS/S3 HDFS/S3 HDFS/S3
# of BI
Tools
High Medium High Low High
Spark Streaming Apache Storm
+ Trident
Kinesis Client
Library
Scale/Throughput ~ Nodes ~ Nodes ~ Nodes
Data Volume ~ Nodes ~ Nodes ~ Nodes
Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling
Fault Tolerance Built-in Built-in KCL Check pointing
Programming
languages
Java, Python, Scala Java, Scala,
Clojure
Java, Python
Process Store Process Store
Amazon
Kinesis
Amazon
Kinesis
Connectors
Amazon
S3Amazon
DynamoDB
Amazon
Kinesis
Amazon
Kinesis
Connectors
Amazon
S3Amazon
DynamoDB
Hive SparkStorm
Amazon Kinesis / Kafka
NoSQL /Amazon
DynamoDB
Amazon S3
Devices
Logging
Presto
Hive
AmazonRedshift
Spark Streaming
Storm
Native Client
AmazonRedshift
Native Client
Hive
HDFS
Presto
Hive
Impala
Apps
AmazonCloudSearch
Spark
BI & Visualization tools
Spark
Hive
Spark
Streaming,
Apache
Storm
Amazon
Redshift Spark,
Impala,
Presto
Hive
Amazon
Redshift
Hive
Spark,
Presto
Amazon
Kinesis/
Kafka
Amazon
DynamoDBAmazon S3Data
Hot ColdData TemperatureQ
ue
ry L
ate
nc
y
Low
HighAnswers
HDFS
Hive
Native
Client
Spark
Streaming
Hive
Amazon Kinesis / KafkaData
Answers
Apache Storm Native Client
Amazon
DynamoDB
Native
Client
Amazon
Redshift
Hive
Spark,
Presto
Amazon
Kinesis/
Kafka
Amazon S3Data
Answers
Spark,
Impala,
PrestoRedshift
Spark,
Presto
Kinesis/
KafkaDynamoDB S3Data
Answers
HDFS
• Big data processing stages: ingest, storage,
process, and visualize
• Use the right tool for the job– Ingest: Transactional data, file data, stream data
– Storage: Data structure, query patterns, hot vs cold etc.
– Processing: Query latency
• Big data reference architecture and design patterns
Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals