Upload
amazon-web-services-korea
View
765
Download
2
Embed Size (px)
Citation preview
SEOUL
© 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
실시간빅데이터및스트리밍분석
김일호 – AWS Solutions Architect
Agenda
• Batch Processing: Amazon Elastic MapReduce (EMR)
• Real-time Processing: Amazon Kinesis
• Cost-saving Tips
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Batch processing
Amazon Elastic MapReduce (EMR)
Why Amazon EMR?
Easy to UseLaunch a cluster in minutes
Low CostPay an hourly rate
ElasticEasily add or remove capacity
ReliableSpend less time monitoring
SecureManage firewalls
FlexibleControl the cluster
Easy to deploy
AWS Management Console Command Line
Or use the Amazon EMR API with your favorite SDK.
Easy to monitor and debug
Integrated with Amazon CloudWatch
Monitor Cluster, Node, and IO
Monitor Debug
Hue
Amazon S3 and Hadoop distributed file system (HDFS)
Hue
Query Editor
Hue
Job Browser
Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Use bootstrap actions to install applications…
https://github.com/awslabs/emr-bootstrap-actions
…or to configure Hadoop
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name
Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Amazon EMR: Leveraging Amazon S3
Amazon S3 as your persistent data store
• Amazon S3
– Designed for 99.999999999% durability
– Separate compute and storage
• Resize and shut down Amazon EMR clusters with no data loss
• Point multiple Amazon EMR clusters at same data in Amazon S3
EMRFS makes it easier to leverage Amazon S3
• Better performance and error handling options
• Transparent to applications – just read/write to “s3://”
• Consistent view
– For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side encryption
• Faster listing using EMRFS metadata
EMRFS support for Amazon S3 client-side encryption
Amazon S3
Am
azon S
3 e
ncry
ption
clie
nts
EM
RFS e
nable
d fo
rAm
azo
n S
3 c
lient-
sid
e e
ncryp
tion
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Amazon S3 EMRFS metadata in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number of
objects
Without Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of Amazon S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
Optimize to leverage HDFS
• Iterative workloads – If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to HDFS for processing.
Amazon EMR: Design patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourlyDaily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent dailyLogs stored in
Amazon S3Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
Real-time Processing
Amazon Kinesis
Real-time analytics
Real-time ingestion
• Highly scalable
• Durable
• Elastic
• Re-playable reads
Continuous processing
• Load-balancing incoming streams
• Fault-tolerance, check-pointing and replay
• Elastic
• Enables multiple apps to process in parallel
Continuous data flow
Low end-to-end latency
Continuous, real-time workloads
+
Data ingestion
Global top 10
example.com
Starting simple...
Global top-10
Distributing the workload…
example.com
Global top10
Local top 10
Local top 10
Local top 10
Or using an elastic data broker…
example.com
Global top 10
Data
record
StreamShard
Partition key
Worker
My top 10
Data recordSequence number
14 17 18 21 23
Amazon Kinesis – managed stream
example.com
Amazon
Kinesis
AW
S e
nd
po
int
Amazon
S3
Amazon
DynamoDB
Amazon
Redshift
Data
sources
Availability
Zone
Availability
Zone
Data
sources
Data
sources
Data
sources
Data
sources
Availability
Zone
Shard 1
Shard 2
Shard N
[Data
archive]
[Metric
extraction]
[Sliding-window
analysis]
[Machine
learning]
App. 1
App. 2
App. 3
App. 4
Amazon EMR
Amazon Kinesis – common data broker
Amazon Kinesis – stream and shards
•Stream: A named entity to
capture and store data
•Shards: Unit of capacity
•Put – 1 MB/sec or 1000
TPS
•Get - 2 MB/sec or 5 TPS
•Scale by adding or removing
shards
•Replay in 24-hr. window
How to size your Amazon Kinesis stream
Consider 2 producers, each producing 2 KB records at 500 TPS:
Minimum of 2 shards for ingress of 2 MB/s
2 Applications can read with egress of 4MB/s
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Producers
Application
How to size your Amazon Kinesis stream
Consider 3 consuming applications each processing the data
Simple! Add another shard to the stream to spread the load
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Application
Application
Producers
Shard
Amazon Kinesis – distributed streams
• From batch to continuous processing
• Scale UP or DOWN without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability
– Records stored across multiple Availability Zones
• Run multiple parallel Amazon Kinesis applications
Data processing
Batch
Micro
batch
Real
time
Pattern for real-time analytics…
Batch
analysisData Warehouse
Hadoop
Notifications
& alerts
Dashboards/
visualizations
APIsStreaming
analytics
Data
streams
Deep learning
Dashboards/
visualizations
Spark-Streaming
Apache Storm
Amazon KCL
Data
archive
Real-time analytics
• Streaming
– Event-based response within seconds; for example,
detecting whether a transaction is a fraud or not
• Micro-batch
– Operational insights within minutes; for example,
monitor transactions from different regions
Kinesis
Client
Library
Amazon Kinesis Client Library (Amazon KCL)
• Distributed to handle
multiple shards
• Fault tolerant
• Elastically adjusts to shard
count
• Helps with distributed
processing
Amazon
Kinesis
Stream
Amazon EC2
Amazon EC2
Amazon EC2
Amazon KCL design components
• Worker: The processing unit that maps to each application instance
• Record processor: The processing unit that processes data from a shard of an Amazon Kinesis stream
• Check-pointer: Keeps track of the records that have already been processed in a given shard
Amazon KCL restarts the processing of the shard at the last-known processed record if a worker fails
Amazon Kinesis Connector Library
• Amazon S3
– Archival of data
• Amazon Redshift
– Micro-batching loads
• Amazon DynamoDB
– Real-time Counters
• Elasticsearch
– Search and Index
S3 Dynamo DB Amazon
Redshift
Amazon
Kinesis
Read data directly into
Hive, Pig, Streaming,
and Cascading from
Amazon Kinesis
Real-time sources into batch-oriented systems
Multi-application support & check-pointing
EMR integration with Amazon
Kinesis
DStream
RDD@T1 RDD@T2
Messages
Receiver
Spark streaming – Basic concepts
• Higher-level abstraction called Discretized Streams
(DStreams)
• Represented as sequences of Resilient Distributed
Datasets (RDDs)
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
Apache Storm: Basic concepts
• Streams: Unbounded sequence of tuples
• Spout: Source of stream
• Bolts: Processes that input streams and output new streams
• Topologies: Network of spouts and bolts
https://github.com/awslabs/kinesis-storm-spout
Batch
Micro
batch
Real
time
Putting it together…
Producer Amazon
Kinesis
App Client
EMRS3
Amazon KCL
DynamoDB
Amazon Redshift BI tools
Amazon KCL
Amazon KCL
Ref. re:invent 2014 BDT310
Cost-saving tips
• Use Amazon S3 as your persistent data store (only pay for compute when you need it!).
• Use Amazon EC2 Spot Instances (especially with task nodes) to save 80 percent or more on the Amazon EC2 cost.
• Use Amazon EC2 Reserved Instances if you have steady workloads.
• Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).
• Contact your sales rep about custom pricing options, if you are spending more than $10K per month on Amazon EMR.
SEOUL
© 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved