AWS DB Best Practices

Embed Size (px)

Citation preview

  • 8/12/2019 AWS DB Best Practices

    1/61

    2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of A

    DAT203 - AWS Storage and DatabaseArchitecture Best Practices

    Siva Raghupathy, Amazon Web Services

  • 8/12/2019 AWS DB Best Practices

    2/61

    The Third Platform

    Built on:

    Mobile devices

    Cloud services

    Social technologies

    Big data Billions of users

    Millions of apps

  • 8/12/2019 AWS DB Best Practices

    3/61

    Data Volume, Velocity, Variety

    2.7 zettabytes (ZB) of data

    exists in the digital universetoday 1 ZB = 1 billion terabytes

    450 billion transaction per day

    by 2020 More unstructured data than

    structured data

  • 8/12/2019 AWS DB Best Practices

    4/61

    Common Questions from Database Devel

    Cloud Migration

    How do I move (my data) to the

    cloud?Data/Storage Technologies

    What data store should I use?

    SQL or NoSQL?

    Hadoop or DW? What about search?

    Management Concerns

    Is my data (in the clou

    Relational features w/nightmares?

    My data volume, velocare exploding!

    How can I reduce cos

    Performance and Delive

    Need low latency (ms

    Need high throughput

    Need to ship in days

  • 8/12/2019 AWS DB Best Practices

    5/61

    Cloud Data Tier Anti-Pattern

    Data Tier

  • 8/12/2019 AWS DB Best Practices

    6/61

    Cloud Data Tier Architecture Use the Right Tool

    App/Web Tier

    Client Tier

    Data Tier

    Search

    Ha

    Cache EBlob Store

    SQLNoSQLData

    Warehouse

  • 8/12/2019 AWS DB Best Practices

    7/61

  • 8/12/2019 AWS DB Best Practices

    8/61

    Compute Storage

    AWSGlobalInfrastructure

    Database

    AppServices

    Deployment & Administration

    Networking

    AWS

  • 8/12/2019 AWS DB Best Practices

    9/61

    AWS ManagedDatabase & Storage Serv

    StructuredComplex Query SQL

    Amazon RDS(MySQL, Oracle, SQL Server) Data Warehouse

    Amazon Redshift

    Search Amazon

    CloudSearchUnstructuredCustom Query Hadoop

    Amazon Elastic MapReduce(EMR)

    StructuredSimple Q NoSQL

    Amazon Dynamo Cache

    Amazon ElastiCa(Memcached, Redis)

    UnstructuredNo Q Cloud Storage

    Amazon S3 Amazon Glacier

  • 8/12/2019 AWS DB Best Practices

    10/61

    AWS PrimitiveCompute and Storag

    Compute Capabilities

    Many different EC2 instancetypes General purpose Compute optimized Storage optimized Memory optimized

    Host any major data storagetechnology RDBMS NoSQL Cache

    Raw Storage Options

    EC2 Instance store (e Amazon Elastic Block Standard volume

    1 TB, ~100 IOPS pe

    Provisioned IOPS vo 1 TB, up to 4000 IO

    Stripe multiple volum

    IOPS or storage

    Primit ives add f lexib i l i ty , but also com e with operatio

  • 8/12/2019 AWS DB Best Practices

    11/61

    AWS Data Tier Architecture - Us the right tool fo

    D

    Amazon RDS

    AmazonCloudSearch

    Amazon DynamoDB

    AmazonElastiCache

    AmazonElastic MapReduce

    Amazon S3

    Amazon Redshift AWS Data Pipeline

  • 8/12/2019 AWS DB Best Practices

    12/61

    Reference Architecture

  • 8/12/2019 AWS DB Best Practices

    13/61

    Reference Architecture

    AmazonRDS

    AmazonCloudSearch

    AmazonDynamoDB

    AmazonElastiCache

    AmazonEMR

    AmazonS3

    AWS

    AR

  • 8/12/2019 AWS DB Best Practices

    14/61

    Use Case: A Video Streaming Application

  • 8/12/2019 AWS DB Best Practices

    15/61

    Use Case: A Video Streaming App U

    AmazonDynamoDB

    AmazonRDS

    AmazonCloudSearch

    AmazonS3

  • 8/12/2019 AWS DB Best Practices

    16/61

    A Video Streaming App Discove

    XAmazon

    ElastiCache

    CloudFront

    AmazonDynamoDB

    AmazonRDS

    AmazonCloudSearch

    AmazonS3

  • 8/12/2019 AWS DB Best Practices

    17/61

    Use Case: A Video Streaming App

    AmazonS3

    AmazonDynamoDB

    AmazonEMR

  • 8/12/2019 AWS DB Best Practices

    18/61

    Use Case: A Video Streaming App Anal

    AmazonEMR

    AmazonS3

    AmRe

    Wh t i th t t f d t

  • 8/12/2019 AWS DB Best Practices

    19/61

    What is the temperature of your dat

  • 8/12/2019 AWS DB Best Practices

    20/61

    Data Characteristics: Hot, Warm, Co

    Hot Warm Cold

    Volume MBGB GBTB PB

    Item size BKB KBMB KBT

    Latency ms ms, sec min,

    Durability LowHigh High VeryRequest rate Very High High LowCost/GB $$-$ $-

    Low

  • 8/12/2019 AWS DB Best Practices

    21/61

    AmazonElastiCache

    AmazonRDS

    AmazonRedshift

    Amazon S3

    Request rate

    High

    Cost/GBHigh

    LatencyLow

    Data Volume

    Low

    AmazonEMR

    Structure

    Low

    High

    AmazonDynamoDB

    What data store should I use?

  • 8/12/2019 AWS DB Best Practices

    22/61

    What data store should I use?Elasti-Cache

    AmazonDynamoDB

    AmazonRDS

    CloudSearch

    AmazonRedshift

    AmazonEMR (Hive)

    A

    Averagelatency

    ms ms ms,sec ms,sec sec,min sec,min,hrs

    m(

    Data volume GB GBTBs(no limit)

    GBTB(3 TB Max)

    GBTB TBPB(1.6 PB max)

    GBPB(~nodes)

    G(

    Item size B-KB KB(64 KB max)

    KB(~rowsize)

    KB(1 MBmax)

    KB(64 K max)

    KB-MB K(

    Request rate Very High Very High High High Low Low LV(

    Storage cost$/GB/month

    $$ $

    Durability Low -Moderate

    Very High High High High High V

    Hot Data Warm Data

    S f

  • 8/12/2019 AWS DB Best Practices

    23/61

    AWS Data Tier Architecture - Use the right tool f

    Da

    Amazon RDS

    AmazonCloudSearch

    Amazon DynamoDB

    AmazonElastiCache

    AmazonElastic MapReduce

    Amazon S3

    Amazon Redshift AWS Data Pipeline

  • 8/12/2019 AWS DB Best Practices

    24/61

    Cost Conscious Design

    C t C i D i

  • 8/12/2019 AWS DB Best Practices

    25/61

    Cost Conscious DesignExample: Should I use Amazon S3 or Amazon Dy

    Im currently scoping out a project that will greatly

    my teams use ofAmazon S3. Hoping you could ansome questions. The current iteration of the designmany small files, perhaps up to a billion during peatotal size would be on the order of 1.5 TB per mont

    Request rate(Writes/sec)

    Object size(Bytes)

    Total size(GB/month)

    Objects per mo

    300 2048 1483 777,600,0

    C t C i D i

  • 8/12/2019 AWS DB Best Practices

    26/61

    Cost Conscious DesignExample: Should I use Amazon S3 or Amazon Dyn

    Request rate Object size Total sizeA S3

    http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E
  • 8/12/2019 AWS DB Best Practices

    27/61

    Request rate(Writes/sec)

    Object size(Bytes)

    Total size(GB/month

    300 2,048 1,483

    Amazon S3 orAmazonDynamoDB?

    http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E
  • 8/12/2019 AWS DB Best Practices

    28/61

    Request rate(Writes/sec)

    Object size(Bytes)

    Total size(GB/month)

    Obmo

    Scenario 1300 2,048 1,483 77

    Scenario 2300 32,768 23,730 777

    Amazon S3

    Amazon DynamoDB

    use

    use

    http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4Ehttp://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-24CBA60C-49D4-4D42-84B6-B33E2C980C94http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-24CBA60C-49D4-4D42-84B6-B33E2C980C94http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E
  • 8/12/2019 AWS DB Best Practices

    29/61

    Best Practices

    Amazon RDS

  • 8/12/2019 AWS DB Best Practices

    30/61

    When to use

    Transactions Complex queries Medium to high query/write rate

    Up to 30 K IOPS (15 K reads + 15K writes)

    100s of GB to low TBs

    Workload can fit in a single node High durability

    When not to use

    Massive read/write ra Example: 150 K writ

    second

    Data size or throughpsharding Example: 10 s or 10

    Simple Get/Put and qNoSQL can handle Complex analytics

    Push-Button Scaling

    Region

    Multi-AZ

    AZ 1 AZ 2

    Amazon RDS

  • 8/12/2019 AWS DB Best Practices

    31/61

    Amazon RDS Best Practices Use the right DB instance class

    Use EBS-optimized instances db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarg

    db.cr1.8xlarge

    Use provisioned IOPS

    Use multi-AZ for high availability Use read replicas for

    Scaling reads

    Schema changes

    Additional failure recovery

    Amazon DynamoDB

  • 8/12/2019 AWS DB Best Practices

    32/61

    When to use

    Fast and predictable performance

    Seamless/massive scale

    Autosharding

    Consistent/low latency

    No size or throughput limits

    Very high durability

    Key-value or simple queries

    When not to use

    Need multi-item/row otransactions

    Need complex queries

    Need real-time analytihistoric data

    Storing cold data

    Amazon DynamoDB

    Amazon DynamoDB Best Practi

  • 8/12/2019 AWS DB Best Practices

    33/61

    Amazon DynamoDB Best Practi Keep item size small

    Store metadata in Amazon DynamoDB and

    large blobs in Amazon S3 Use a table with a hash key for extremely

    high scale

    Use table per day, week, month etc. forstoring time series data

    Use conditional/OCC updates

    Use hash-range key to model 1:N relationships

    Multi-tenancy

    Avoid hot keys and hot partitions

    Events_table_2012

    Event_id(Hash key)

    Timestamp(range key)

    A

    Events_table_2012_05_week1

    Event_id(Hash key)

    Timestamp(range key)

    AEvents_table_2012_05_wee

    Event_id(Hash key)

    Timestamp(range key)

    Events_table_2012_05_wee

    Event_id(Hash key)

    Timestamp(range key)

    Amazon ElastiCache (Memcached)

  • 8/12/2019 AWS DB Best Practices

    34/61

    When to use

    Transient key-value store

    Need to speed up reads/write

    Caching frequent SQL, NoSQL orDW query results

    Saving transient and frequently

    updated data Increment/decrement game

    scores/counters

    Web application session storage

    Best effort deduplication

    When not to use

    Store infrequently use

    Need persistence

    Amazon ElastiCache (Memcached)

    Amazon ElastiCache (Memcached) Best Practic

  • 8/12/2019 AWS DB Best Practices

    35/61

    Amazon ElastiCache (Memcached) Best Practic

    Use autodiscovery Share memcached client objects in application Use TTLs Consider memory for connections overhead Use Amzon CloudWatch alarms / SNS alerts

    Number of connections Swap memory usage

    Freeable memory

    Amazon ElastiCache (Redis)

  • 8/12/2019 AWS DB Best Practices

    36/61

    When to use

    Key-value store with advanceddata structures Strings, lists, sets, sorted sets,

    hashes

    Caching Leader boards

    High-speed sorting Atomic counters Queuing systems Activity streams

    When not to use

    Need native sharding Need hard persisten

    Data wont fit in memo

    Need transaction rollbunder exceptions

    Amazon ElastiCache (Redis)

    Amazon ElastiCache (Redis) Best Practice

  • 8/12/2019 AWS DB Best Practices

    37/61

    Amazon ElastiCache (Redis) Best Practice

    Use TTL

    Use the right instance types Instances with high ECU/vCPU and network performance

    yield the highest throughput. Example: m2.4xlarge, m2.2xlarge

    Use read replicas Increase read throughput

    AOF cannot protect against all failure modes

    Promote read replicas to primary

    Use RDB file snapshot for on-premises to Amazon ElastiCache Key parameter group settings

    Avoid AOF with fsync always huge impact on performance

    AOF (+ RDB) with fsync everysecbest durability + performance

    Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-based on the workloads

    Amazon CloudSearch

  • 8/12/2019 AWS DB Best Practices

    38/61

    When to use

    No search expertise

    Full-text search

    Ranking

    Relevance

    Structured and unstructured data

    Faceting

    $0 to $10 (4 items)

    $10 and above (3 items)

    When not to use

    Not as replacement fo Not as a system of reco

    Transient data

    Nonatomic updates

    Amazon CloudSearch

    A Cl dS h B P i

  • 8/12/2019 AWS DB Best Practices

    39/61

    Batch documents for uploading

    Use Amazon CloudSearch for searching and anostore for retrieving full records for the UI (i.e. donreturn fields)

    Include other data like popularity scores in docum

    Use stop words to remove common terms Use fielded queries to reduce match sets

    Query latency is proportional to query specificity

    Amazon CloudSearch Best Practice

    Amazon Redshift

  • 8/12/2019 AWS DB Best Practices

    40/61

    When to use

    Information analysis and reporting Complex DW queries that

    summarize historical data Batched large updates e.g. daily

    sales totals 10s of concurrent queries 100s GB to PB Compression Column based Very high durability

    When not to use

    OLTP workloads 1000s of concurrent

    Large number of sinupdates

    Amazon Redshift

    A R d hift B t P ti

  • 8/12/2019 AWS DB Best Practices

    41/61

    Amazon Redshift Best Practices

    Use COPY command to load large data sets from

    S3, Amazon DynamoDB, Amazon EMR/EC2/Unix Split your data into multiple files

    Use GZIP or LZOP compression

    Use manifest file

    Choose proper sort key Range or equality on WHERE clause

    Choose proper distribution key Join column, foreign key or largest dimension, group by column

    Avoid distribution key for denormalized data

    Amazon Elastic MapReduce

  • 8/12/2019 AWS DB Best Practices

    42/61

    When to use

    Batch analytics/processing Answers in minutes or hours

    Structured and unstructured data Parallel scans of the entire dataset

    with uniform query performance

    Supports Hive QL + other languages GB, TB, or PB of data Replicated data store (HDFS) for

    ad-hoc and real-time queries(HBase)

    When not to use

    Real-time analytics (D Need answers in sec

    1000s of concurrent u

    Amazon Elastic MapReduce

    Amazon Elastic MapReduce Best Practic

  • 8/12/2019 AWS DB Best Practices

    43/61

    p

    Choose between transient and persistentclusters for best TCO

    Leverage Amazon S3 integration forhighly durable and interim storage

    Right-size cluster instances based oneach jobnot one size fits all

    Leverage resizing and spot to add andremove capacity cost-effectively

    Tuning cluster instances can be easierthan tuning Hadoop code

    AWS Data Pipeline

  • 8/12/2019 AWS DB Best Practices

    44/61

    AWS Data Pipeline

    When to use

    Automate movement and transformationof data (ETL in the cloud)

    Dependency management Data Control

    Schedule management Transient Amazon EMR clusters Regular data move pattern

    Every hour, day Every 30 minutes

    Amazon DynamoDB backups Cross region

    When not to use

    Less that 15 minutes schinterval

    Execution latency less th Event-based scheduling

    AWS Data Pipeline Best Practice

  • 8/12/2019 AWS DB Best Practices

    45/61

    AWS Data Pipeline Best Practice

    Use dependency rather than time based

    Make your activities idempotent

    Add in your tools using shell activity

    Use Amazon S3 for staging

    Amazon S3

  • 8/12/2019 AWS DB Best Practices

    46/61

    When to use

    Store large objects

    Key-value store - Get/Put/List Unlimited storage Versioning Very high durability

    99.999999999%

    Very high throughput (via parallel

    clients) Use for storing persistent data

    Backups Source/target for EMR Blob store with metadata in SQL

    or NoSQL

    When not to use

    Complex queries

    Very low latency (ms) Search Read-after-write consi

    overwrites Need transactions

    Amazon S3 Best Practices

  • 8/12/2019 AWS DB Best Practices

    47/61

    Use random hash prefix for keys

    Ensure a random access pattern Use Amazon CloudFront for high throughput GETs and PU

    Leverage the high durability, high throughput design of Amfor backup and as a common storage sink Durable sink between data services

    Supports de-coupling and asynchronous delivery

    Consider RRS for lower cost, lower durability storage of derivatives or copies

    Consider parallel threads and multipart upload for faster w

    Consider parallel threads and range get for faster reads

    Amazon Glacier

  • 8/12/2019 AWS DB Best Practices

    48/61

    When to use

    Infrequently accessed data sets Very low cost storage Data retrieval times of several

    hours is acceptable Encryption at rest Very high durability

    99.999999999% Unlimited amount of storage

    When not to use

    Frequent access Low latency access

    Amazon Glacier Best Practices

  • 8/12/2019 AWS DB Best Practices

    49/61

    Reduce request and storage costs with aggrega

    Aggregating your files into bigger files before sending them to Am Store checksums along with your files

    Use a format that allows you to access files within your aggregate

    Improve speed and reliability with multipart uploa

    Reduce costs with ranged retrievals

    Maintaining your own index in a highly durable s

    Amazon EC2 + Amazon EBS/Instanc

  • 8/12/2019 AWS DB Best Practices

    50/61

    When to use Alternate data store technologies

    Hand-tuned performance needs

    Direct/admin access required

    When not to use When a managed serv

    the job

    When operational explow

    Storage

    Amazon EBS Best Practices

  • 8/12/2019 AWS DB Best Practices

    51/61

    Pick the right EC2 instance type Higher network performance instances for driving more Amazon EBS IOPS

    EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazo

    Use provisioned IOPS volumes for database workloads reconsistent IOPS

    Use standard volumes for workloads requiring low to mod& occasional bursts

    Stripe multiple Amazon EBS volumes for higher IOPS or s RAID0 for higher I/O

    RAID10 for highest local durability

    Amazon EBS snapshots Quiesce the file system and take a snapshot

    Amazon EC2 Best Practices

  • 8/12/2019 AWS DB Best Practices

    52/61

    HI-Best IOPS/$HS-Best GB/$

    Amazon EC2 Best Practices

  • 8/12/2019 AWS DB Best Practices

    53/61

    Summary

    Cloud Data Tier Architecture Anti-Pa

  • 8/12/2019 AWS DB Best Practices

    54/61

    Data Tier

    AWS Data Tier Architecture - Use the right tool f

  • 8/12/2019 AWS DB Best Practices

    55/61

    Da

    Amazon RDS

    AmazonCloudSearch

    Amazon DynamoDB

    AmazonElastiCache

    AmazonElastic MapReduce

    Amazon S3

    Amazon Redshift AWS Data Pipeline

  • 8/12/2019 AWS DB Best Practices

    56/61

    Reference Architecture

    AmazonRDS

    AmazonCloudSearch

    AmazonDynamoDB

    AmazonElastiCache

    AmazonEMR

    AmazonS3

    AWS

    AR

    Cost Conscious Design

  • 8/12/2019 AWS DB Best Practices

    57/61

  • 8/12/2019 AWS DB Best Practices

    58/61

    Please give us your feedback on thispresentation

    As a thank you, we will select prizewinners daily for completed surveys!

    DAT203

  • 8/12/2019 AWS DB Best Practices

    59/61

  • 8/12/2019 AWS DB Best Practices

    60/61

  • 8/12/2019 AWS DB Best Practices

    61/61

    Remember