54

(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Embed Size (px)

DESCRIPTION

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

Citation preview

Page 1: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 2: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

• What technologies should I use? – Why?

– How?

• Reference architecture

• Design patterns

Page 3: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 4: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Volume

Velocity

Variety

Page 5: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 6: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Glacier

S3 DynamoDB

RDS

EMR

Redshift

Data PipelineKinesis

Cassandra CloudSearch

Kinesis-

enabled

app

Page 7: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

What Tools Should I Use?

Page 8: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 9: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Ingest Store Process Visualize

Page 10: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

GlacierS3

DynamoDB

RDS

Kinesis

Spark

Streaming

EMRData Pipeline

Storm

Kafka

Redshift

Cassandra

CloudSearch

Kinesis

Connector

Kinesis

enabled app

Page 11: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Ingest

Page 12: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Database

Cloud

Storage

Stream

Storage

Page 13: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Stream

Storage

Database

Cloud

Storage

Page 14: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon Kinesis or Kafka

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Shard or Partition 1

Shard or Partition 2

Page 15: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon Kinesis or Kafka

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Shard or Partition 1

Shard or Partition 2

Consumer 1

Count of

Red = 4

Count of

Violet = 4

Consumer 2

Count of

Blue = 4

Count of

Green = 4

Page 16: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 17: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Cloud Database &

Storage

Page 18: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

App/Web Tier

Client Tier

Database & Storage Tier

Page 19: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

App/Web Tier

Client Tier

Data TierDatabase & Storage Tier

Search

Hadoop/HDFS

Cache

Blob Store

SQL NoSQL

Page 20: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Database & Storage Tier

Amazon RDSAmazon

DynamoDB

Amazon ElastiCache

Amazon S3

Amazon

Glacier

Amazon CloudSearch

HDFS on Amazon EMR

Page 21: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 22: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Structured – Simple Query

NoSQL

Amazon DynamoDB

Cache

Amazon ElastiCache

Structured – Complex Query

SQL

Amazon RDS

Search

Amazon CloudSearch

Unstructured – No Query

Cloud Storage

Amazon S3

Amazon Glacier

Unstructured – Custom Query

Hadoop/HDFS

Amazon Elastic MapReduce

Data

Str

uctu

re C

om

ple

xity

Query Structure Complexity

Page 23: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 24: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Hot Warm Cold

Volume MB–GB GB–TB PB

Item size B–KB KB–MB KB–TB

Latency ms ms, sec min, hrs

Durability Low–High High Very High

Request rate Very High High Low

Cost/GB $$-$ $-¢¢ ¢

Page 25: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon

RDS

Request RateHigh Low

Cost/GBHigh Low

LatencyLow High

Data VolumeLow High

AmazonGlacier

AmazonCloudSearch

Str

uctu

reLow

High

Amazon

DynamoDB

Amazon

ElastiCache

Page 26: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon

ElastiCache

Amazon

DynamoDB

Amazon

RDS

Amazon

CloudSearch

Amazon

EMR (HDFS)

Amazon S3 Amazon Glacier

Average

latency

ms ms ms, sec ms,sec sec,min,hrs ms,sec,min

(~ size)

hrs

Data volume GB GB–TBs

(no limit)

GB–TB

(3 TB Max)

GB–TB GB–PB

(~nodes)

GB–PB

(no limit)

GB–PB

(no limit)

Item size B-KB KB

(64 KB max)

KB

(~rowsize)

KB

(1 MB max)

MB-GB KB-GB

(5 TB max)

GB

(40 TB max)

Request rate Very High Very High High High Low – Very

High

Low–

Very High

(no limit)

Very Low

(no limit)

Storage cost

$/GB/month

$$ ¢¢ ¢¢ $ ¢ ¢ ¢

Durability Low -

Moderate

Very High High High High Very High Very High

Hot Data Warm Data Cold Data

Page 27: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Use Case: A Video Streaming Application

Page 28: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Use Case: A Video Streaming App – Upload

AmazonDynamoDB

AmazonRDS

Amazon CloudSearch

Amazon S3

Page 29: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

A Video Streaming App – Discovery

XAmazon

ElastiCache

CloudFront

AmazonDynamoDB

AmazonRDS

Amazon CloudSearch

Amazon S3

Page 30: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Process

Page 31: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 32: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 33: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Batch Processing

• Take large amount of cold data and ask

questions

• Takes minutes or hours to get answers back

Example: Generating hourly, daily,

weekly reports

Page 34: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Use Case: Video Recommendations

Amazon

S3

Amazon

Glacier

Amazon

DynamoDBAmazon

EMR

Page 35: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Use Case: Batch Analytics

Amazon

EMR

Amazon

S3

Amazon

Glacier

Amazon

Redshift

Page 36: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Stream Processing (AKA Real Time)

• Take small amount of hot data and ask

questions

• Takes short amount of time to get your

answer back

Example: 1min metrics

Page 37: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 38: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

https://amplab.cs.berkeley.edu/benchmark/

Page 39: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Redshift Impala Presto Spark Hive

Query

Latency

Low Low Low Low - Medium Medium - High

Durability High High High High High

Data

Volume

1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes

Managed Yes EMR

bootstrap

EMR

bootstrap

EMR

bootstrap

Yes (EMR)

Storage Native HDFS HDFS/S3 HDFS/S3 HDFS/S3

# of BI

Tools

High Medium High Low High

Page 40: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Spark Streaming Apache Storm

+ Trident

Kinesis Client

Library

Scale/Throughput ~ Nodes ~ Nodes ~ Nodes

Data Volume ~ Nodes ~ Nodes ~ Nodes

Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling

Fault Tolerance Built-in Built-in KCL Check pointing

Programming

languages

Java, Python, Scala Java, Scala,

Clojure

Java, Python

Page 41: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 42: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 43: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Process Store Process Store

Page 44: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon

Kinesis

Amazon

Kinesis

Connectors

Amazon

S3Amazon

DynamoDB

Page 45: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon

Kinesis

Amazon

Kinesis

Connectors

Amazon

S3Amazon

DynamoDB

Hive SparkStorm

Page 46: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon Kinesis / Kafka

NoSQL /Amazon

DynamoDB

Amazon S3

Devices

Logging

Presto

Hive

AmazonRedshift

Spark Streaming

Storm

Native Client

AmazonRedshift

Native Client

Hive

HDFS

Presto

Hive

Impala

Apps

AmazonCloudSearch

Spark

BI & Visualization tools

Spark

Hive

Page 47: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 48: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Spark

Streaming,

Apache

Storm

Amazon

Redshift Spark,

Impala,

Presto

Hive

Amazon

Redshift

Hive

Spark,

Presto

Amazon

Kinesis/

Kafka

Amazon

DynamoDBAmazon S3Data

Hot ColdData TemperatureQ

ue

ry L

ate

nc

y

Low

HighAnswers

HDFS

Hive

Native

Client

Page 49: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Spark

Streaming

Hive

Amazon Kinesis / KafkaData

Answers

Apache Storm Native Client

Amazon

DynamoDB

Native

Client

Page 50: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Amazon

Redshift

Hive

Spark,

Presto

Amazon

Kinesis/

Kafka

Amazon S3Data

Answers

Page 51: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Spark,

Impala,

PrestoRedshift

Spark,

Presto

Kinesis/

KafkaDynamoDB S3Data

Answers

HDFS

Page 52: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014
Page 53: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

• Big data processing stages: ingest, storage,

process, and visualize

• Use the right tool for the job– Ingest: Transactional data, file data, stream data

– Storage: Data structure, query patterns, hot vs cold etc.

– Processing: Query latency

• Big data reference architecture and design patterns

Page 54: (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals