AWS DB Best Practices

8/12/2019 AWS DB Best Practices

1/61

2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of A

DAT203 - AWS Storage and DatabaseArchitecture Best Practices

Siva Raghupathy, Amazon Web Services


2/61

The Third Platform

Built on:

Mobile devices

Cloud services

Social technologies

Big data Billions of users

Millions of apps


3/61

Data Volume, Velocity, Variety

2.7 zettabytes (ZB) of data

exists in the digital universetoday 1 ZB = 1 billion terabytes

450 billion transaction per day

by 2020 More unstructured data than

structured data


4/61

Common Questions from Database Devel

Cloud Migration

How do I move (my data) to the

cloud?Data/Storage Technologies

What data store should I use?

SQL or NoSQL?

Hadoop or DW? What about search?

Management Concerns

Is my data (in the clou

Relational features w/nightmares?

My data volume, velocare exploding!

How can I reduce cos

Performance and Delive

Need low latency (ms

Need high throughput

Need to ship in days


5/61

Cloud Data Tier Anti-Pattern

Data Tier


6/61

Cloud Data Tier Architecture Use the Right Tool

App/Web Tier

Client Tier

Data Tier

Search

Ha

Cache EBlob Store

SQLNoSQLData

Warehouse


7/61


8/61

Compute Storage

AWSGlobalInfrastructure

Database

AppServices

Deployment & Administration

Networking

AWS


9/61

AWS ManagedDatabase & Storage Serv

StructuredComplex Query SQL

Amazon RDS(MySQL, Oracle, SQL Server) Data Warehouse

Amazon Redshift

Search Amazon

CloudSearchUnstructuredCustom Query Hadoop

Amazon Elastic MapReduce(EMR)

StructuredSimple Q NoSQL

Amazon Dynamo Cache

Amazon ElastiCa(Memcached, Redis)

UnstructuredNo Q Cloud Storage

Amazon S3 Amazon Glacier


10/61

AWS PrimitiveCompute and Storag

Compute Capabilities

Many different EC2 instancetypes General purpose Compute optimized Storage optimized Memory optimized

Host any major data storagetechnology RDBMS NoSQL Cache

Raw Storage Options

EC2 Instance store (e Amazon Elastic Block Standard volume

1 TB, ~100 IOPS pe

Provisioned IOPS vo 1 TB, up to 4000 IO

Stripe multiple volum

IOPS or storage

Primit ives add f lexib i l i ty , but also com e with operatio


11/61

AWS Data Tier Architecture - Us the right tool fo

D

Amazon RDS

AmazonCloudSearch

Amazon DynamoDB

AmazonElastiCache

AmazonElastic MapReduce

Amazon S3

Amazon Redshift AWS Data Pipeline


12/61

Reference Architecture


13/61


AmazonRDS

AmazonCloudSearch

AmazonDynamoDB

AmazonElastiCache

AmazonEMR

AmazonS3

AWS

AR


14/61

Use Case: A Video Streaming Application


15/61

Use Case: A Video Streaming App U

AmazonDynamoDB

AmazonRDS

AmazonCloudSearch

AmazonS3


16/61

A Video Streaming App Discove

XAmazon

ElastiCache

CloudFront

AmazonDynamoDB

AmazonRDS

AmazonCloudSearch

AmazonS3


17/61

Use Case: A Video Streaming App

AmazonS3

AmazonDynamoDB

AmazonEMR


18/61

Use Case: A Video Streaming App Anal

AmazonEMR

AmazonS3

AmRe

Wh t i th t t f d t


19/61

What is the temperature of your dat


20/61

Data Characteristics: Hot, Warm, Co

Hot Warm Cold

Volume MBGB GBTB PB

Item size BKB KBMB KBT

Latency ms ms, sec min,

Durability LowHigh High VeryRequest rate Very High High LowCost/GB $$-$ $-

Low


21/61

AmazonElastiCache

AmazonRDS

AmazonRedshift

Amazon S3

Request rate

High

Cost/GBHigh

LatencyLow

Data Volume

Low

AmazonEMR

Structure

Low

High

AmazonDynamoDB

What data store should I use?


22/61

What data store should I use?Elasti-Cache

AmazonDynamoDB

AmazonRDS

CloudSearch

AmazonRedshift

AmazonEMR (Hive)

A

Averagelatency

ms ms ms,sec ms,sec sec,min sec,min,hrs

m(

Data volume GB GBTBs(no limit)

GBTB(3 TB Max)

GBTB TBPB(1.6 PB max)

GBPB(~nodes)

G(

Item size B-KB KB(64 KB max)

KB(~rowsize)

KB(1 MBmax)

KB(64 K max)

KB-MB K(

Request rate Very High Very High High High Low Low LV(

Storage cost$/GB/month

$$ $

Durability Low -Moderate

Very High High High High High V

Hot Data Warm Data

S f


23/61

AWS Data Tier Architecture - Use the right tool f

Da

Amazon RDS

AmazonCloudSearch

Amazon DynamoDB

AmazonElastiCache


Amazon S3



24/61

Cost Conscious Design

C t C i D i


25/61

Cost Conscious DesignExample: Should I use Amazon S3 or Amazon Dy

Im currently scoping out a project that will greatly

my teams use ofAmazon S3. Hoping you could ansome questions. The current iteration of the designmany small files, perhaps up to a billion during peatotal size would be on the order of 1.5 TB per mont

Request rate(Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per mo

300 2048 1483 777,600,0

C t C i D i


26/61

Cost Conscious DesignExample: Should I use Amazon S3 or Amazon Dyn

Request rate Object size Total sizeA S3
http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E


27/61


Object size(Bytes)

Total size(GB/month

300 2,048 1,483

Amazon S3 orAmazonDynamoDB?
http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E


28/61


Object size(Bytes)

Total size(GB/month)

Obmo

Scenario 1300 2,048 1,483 77

Scenario 2300 32,768 23,730 777

Amazon S3

Amazon DynamoDB

use

use
http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4Ehttp://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-24CBA60C-49D4-4D42-84B6-B33E2C980C94http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-24CBA60C-49D4-4D42-84B6-B33E2C980C94http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E


29/61

Best Practices

Amazon RDS


30/61

When to use

Transactions Complex queries Medium to high query/write rate

Up to 30 K IOPS (15 K reads + 15K writes)

100s of GB to low TBs

Workload can fit in a single node High durability

When not to use

Massive read/write ra Example: 150 K writ

second

Data size or throughpsharding Example: 10 s or 10

Simple Get/Put and qNoSQL can handle Complex analytics

Push-Button Scaling

Region

Multi-AZ

AZ 1 AZ 2

Amazon RDS


31/61

Amazon RDS Best Practices Use the right DB instance class

Use EBS-optimized instances db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarg

db.cr1.8xlarge

Use provisioned IOPS

Use multi-AZ for high availability Use read replicas for

Scaling reads

Schema changes

Additional failure recovery

Amazon DynamoDB


32/61

When to use

Fast and predictable performance

Seamless/massive scale

Autosharding

Consistent/low latency

No size or throughput limits

Very high durability

Key-value or simple queries

When not to use

Need multi-item/row otransactions

Need complex queries

Need real-time analytihistoric data

Storing cold data

Amazon DynamoDB

Amazon DynamoDB Best Practi


33/61

Amazon DynamoDB Best Practi Keep item size small

Store metadata in Amazon DynamoDB and

large blobs in Amazon S3 Use a table with a hash key for extremely

high scale

Use table per day, week, month etc. forstoring time series data

Use conditional/OCC updates

Use hash-range key to model 1:N relationships

Multi-tenancy

Avoid hot keys and hot partitions

Events_table_2012

Event_id(Hash key)

Timestamp(range key)

A

Events_table_2012_05_week1

Event_id(Hash key)


AEvents_table_2012_05_wee

Event_id(Hash key)


Events_table_2012_05_wee

Event_id(Hash key)


Amazon ElastiCache (Memcached)


34/61

When to use

Transient key-value store

Need to speed up reads/write

Caching frequent SQL, NoSQL orDW query results

Saving transient and frequently

updated data Increment/decrement game

scores/counters

Web application session storage

Best effort deduplication

When not to use

Store infrequently use

Need persistence

Amazon ElastiCache (Memcached)

Amazon ElastiCache (Memcached) Best Practic


35/61

Amazon ElastiCache (Memcached) Best Practic

Use autodiscovery Share memcached client objects in application Use TTLs Consider memory for connections overhead Use Amzon CloudWatch alarms / SNS alerts

Number of connections Swap memory usage

Freeable memory

Amazon ElastiCache (Redis)


36/61

When to use

Key-value store with advanceddata structures Strings, lists, sets, sorted sets,

hashes

Caching Leader boards

High-speed sorting Atomic counters Queuing systems Activity streams

When not to use

Need native sharding Need hard persisten

Data wont fit in memo

Need transaction rollbunder exceptions

Amazon ElastiCache (Redis)

Amazon ElastiCache (Redis) Best Practice


37/61

Amazon ElastiCache (Redis) Best Practice

Use TTL

Use the right instance types Instances with high ECU/vCPU and network performance

yield the highest throughput. Example: m2.4xlarge, m2.2xlarge

Use read replicas Increase read throughput

AOF cannot protect against all failure modes

Promote read replicas to primary

Use RDB file snapshot for on-premises to Amazon ElastiCache Key parameter group settings

Avoid AOF with fsync always huge impact on performance

AOF (+ RDB) with fsync everysecbest durability + performance

Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-based on the workloads

Amazon CloudSearch


38/61

When to use

No search expertise

Full-text search

Ranking

Relevance

Structured and unstructured data

Faceting

$0 to $10 (4 items)

$10 and above (3 items)

When not to use

Not as replacement fo Not as a system of reco

Transient data

Nonatomic updates

Amazon CloudSearch

A Cl dS h B P i


39/61

Batch documents for uploading

Use Amazon CloudSearch for searching and anostore for retrieving full records for the UI (i.e. donreturn fields)

Include other data like popularity scores in docum

Use stop words to remove common terms Use fielded queries to reduce match sets

Query latency is proportional to query specificity

Amazon CloudSearch Best Practice

Amazon Redshift


40/61

When to use

Information analysis and reporting Complex DW queries that

summarize historical data Batched large updates e.g. daily

sales totals 10s of concurrent queries 100s GB to PB Compression Column based Very high durability

When not to use

OLTP workloads 1000s of concurrent

Large number of sinupdates

Amazon Redshift

A R d hift B t P ti


41/61

Amazon Redshift Best Practices

Use COPY command to load large data sets from

S3, Amazon DynamoDB, Amazon EMR/EC2/Unix Split your data into multiple files

Use GZIP or LZOP compression

Use manifest file

Choose proper sort key Range or equality on WHERE clause

Choose proper distribution key Join column, foreign key or largest dimension, group by column

Avoid distribution key for denormalized data

Amazon Elastic MapReduce


42/61

When to use

Batch analytics/processing Answers in minutes or hours

Structured and unstructured data Parallel scans of the entire dataset

with uniform query performance

Supports Hive QL + other languages GB, TB, or PB of data Replicated data store (HDFS) for

ad-hoc and real-time queries(HBase)

When not to use

Real-time analytics (D Need answers in sec

1000s of concurrent u

Amazon Elastic MapReduce

Amazon Elastic MapReduce Best Practic


43/61

p

Choose between transient and persistentclusters for best TCO

Leverage Amazon S3 integration forhighly durable and interim storage

Right-size cluster instances based oneach jobnot one size fits all

Leverage resizing and spot to add andremove capacity cost-effectively

Tuning cluster instances can be easierthan tuning Hadoop code

AWS Data Pipeline


44/61

AWS Data Pipeline

When to use

Automate movement and transformationof data (ETL in the cloud)

Dependency management Data Control

Schedule management Transient Amazon EMR clusters Regular data move pattern

Every hour, day Every 30 minutes

Amazon DynamoDB backups Cross region

When not to use

Less that 15 minutes schinterval

Execution latency less th Event-based scheduling

AWS Data Pipeline Best Practice


45/61

AWS Data Pipeline Best Practice

Use dependency rather than time based

Make your activities idempotent

Add in your tools using shell activity

Use Amazon S3 for staging

Amazon S3


46/61

When to use

Store large objects

Key-value store - Get/Put/List Unlimited storage Versioning Very high durability

99.999999999%

Very high throughput (via parallel

clients) Use for storing persistent data

Backups Source/target for EMR Blob store with metadata in SQL

or NoSQL

When not to use

Complex queries

Very low latency (ms) Search Read-after-write consi

overwrites Need transactions

Amazon S3 Best Practices


47/61

Use random hash prefix for keys

Ensure a random access pattern Use Amazon CloudFront for high throughput GETs and PU

Leverage the high durability, high throughput design of Amfor backup and as a common storage sink Durable sink between data services

Supports de-coupling and asynchronous delivery

Consider RRS for lower cost, lower durability storage of derivatives or copies

Consider parallel threads and multipart upload for faster w

Consider parallel threads and range get for faster reads

Amazon Glacier


48/61

When to use

Infrequently accessed data sets Very low cost storage Data retrieval times of several

hours is acceptable Encryption at rest Very high durability

99.999999999% Unlimited amount of storage

When not to use

Frequent access Low latency access

Amazon Glacier Best Practices


49/61

Reduce request and storage costs with aggrega

Aggregating your files into bigger files before sending them to Am Store checksums along with your files

Use a format that allows you to access files within your aggregate

Improve speed and reliability with multipart uploa

Reduce costs with ranged retrievals

Maintaining your own index in a highly durable s

Amazon EC2 + Amazon EBS/Instanc


50/61

When to use Alternate data store technologies

Hand-tuned performance needs

Direct/admin access required

When not to use When a managed serv

the job

When operational explow

Storage

Amazon EBS Best Practices


51/61

Pick the right EC2 instance type Higher network performance instances for driving more Amazon EBS IOPS

EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazo

Use provisioned IOPS volumes for database workloads reconsistent IOPS

Use standard volumes for workloads requiring low to mod& occasional bursts

Stripe multiple Amazon EBS volumes for higher IOPS or s RAID0 for higher I/O

RAID10 for highest local durability

Amazon EBS snapshots Quiesce the file system and take a snapshot

Amazon EC2 Best Practices


52/61

HI-Best IOPS/$HS-Best GB/$

Amazon EC2 Best Practices


53/61

Summary

Cloud Data Tier Architecture Anti-Pa


54/61

Data Tier

AWS Data Tier Architecture - Use the right tool f


55/61

Da

Amazon RDS

AmazonCloudSearch

Amazon DynamoDB

AmazonElastiCache


Amazon S3



56/61


AmazonRDS

AmazonCloudSearch

AmazonDynamoDB

AmazonElastiCache

AmazonEMR

AmazonS3

AWS

AR

Cost Conscious Design


57/61


58/61

Please give us your feedback on thispresentation

As a thank you, we will select prizewinners daily for completed surveys!

DAT203


59/61


60/61


61/61

Remember

Documents

AWS DB Best Practices