25
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anurag Gupta, VP, Aurora, Amazon Redshift, EMR October 2015 DAT405 Amazon Aurora Deep Dive Design Overview of Aurora Performance and Availability 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag 6-way replication across 3 AZs Custom, scale-out SSD storage Less than 30s failovers or crash recovery Shared storage across replicas Up to 15 read replicas that act as failover targets Pay for the storage you use Automatic hotspot management Automatic IOPS provisioning 100K writes/second & 500K reads/second Buffer caches that survive a database restarts SQL fault injection MySQL compatible Automatic volume growth Up to 64TB databases Proactive data block corruption detection Automated continuous backups to S3 Automated repair of bad disks Peer to peer gossip replication Quorum writes tolerate drive or AZ failures 1/10 th the cost of commercial databases Less than 10ms replica lag Commercial Performance at Open Source Prices

(DAT405) Amazon Aurora Deep Dive

Embed Size (px)

Citation preview

Page 1: (DAT405) Amazon Aurora Deep Dive

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Anurag Gupta, VP, Aurora, Amazon

Redshift, EMR

October 2015

DAT405

Amazon Aurora Deep DiveDesign Overview of Aurora Performance and Availability

6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lag6-way replication across 3 AZsCustom, scale-out SSD storageLess than 30s failovers or crash recoveryShared storage across replicasUp to 15 read replicas that act as failover targetsPay for the storage you useAutomatic hotspot managementAutomatic IOPS provisioning100K writes/second & 500K reads/secondBuffer caches that survive a database restartsSQL fault injectionMySQL compatibleAutomatic volume growthUp to 64TB databasesProactive data block corruption detectionAutomated continuous backups to S3Automated repair of bad disksPeer to peer gossip replicationQuorum writes tolerate drive or AZ failures1/10th the cost of commercial databasesLess than 10ms replica lagCommercial Performance at Open Source Prices

Page 2: (DAT405) Amazon Aurora Deep Dive

MySQL-compatible relational database

Performance and availability of

commercial databases

Simplicity and cost-effectiveness of

open source databases

What is Amazon Aurora?

Page 3: (DAT405) Amazon Aurora Deep Dive

4 client machines with 1,000 connections each

WRITE PERFORMANCE READ PERFORMANCE

Single client machine with 1,600 connections

MySQL SysBench

R3.8XL with 32 cores and 244 GB RAM

SQL benchmark results

Page 4: (DAT405) Amazon Aurora Deep Dive

Reproducing these results

h t t p s : / / d 0 . a w s s t a t i c . c o m / p r o d u c t - m a rk e t i n g / Au r o r a / R D S_ Au r o r a _ Pe r f o r m a n c e _ As s e s s m e n t _ Be n c hm a r k i n g _ v 1 - 2 . p d f

AMAZON

AURORA

R3.8XLARGE

R3.8XLARGE

R3.8XLARGE

R3.8XLARGE

R3.8XLARGE

• Create an Amazon VPC (or use an existing one).

• Create four EC2 R3.8XL client instances to run the

SysBench client. All four should be in the same AZ.

• Enable enhanced networking on your clients.

• Tune your Linux settings (see whitepaper).

• Install Sysbench version 0.5.

• Launch a r3.8xlarge Amazon Aurora DB instance in

the same VPC and AZ as your clients.

• Start your benchmark!

1

2

3

4

5

6

7

Page 5: (DAT405) Amazon Aurora Deep Dive

Beyond benchmarks

If only real-world applications saw benchmark

performance

DISTORTIONS IN THE REAL WORLD

Requests contend with each other

Metadata rarely fits in data dictionary cache

Data rarely fits in buffer cache

Production databases need to run HA

Page 6: (DAT405) Amazon Aurora Deep Dive

Scaling user connections

SysBench OLTP workload

250 tables

Connections Amazon Aurora

RDS MySQL

30K IOPS (single AZ)

50 40,000 10,000

500 71,000 21,000

5,000 110,000 13,000

8xU P TO

FA S T E R

Page 7: (DAT405) Amazon Aurora Deep Dive

Scaling table count

Tables

Amazon

Aurora

MySQL

I2.8XL

local SSD

MySQL

I2.8XL

RAM disk

RDS

MySQL

30K IOPS

(single AZ)

10 60,000 18,000 22,000 25,000

100 66,000 19,000 24,000 23,000

1,000 64,000 7,000 18,000 8,000

10,000 54,000 4,000 8,000 5,000

SysBench write-only workload

1,000 connections

Default settings

11xU P TO

FA S T E R

Page 8: (DAT405) Amazon Aurora Deep Dive

Scaling data set size

DB Size Amazon Aurora

RDS MySQL

30K IOPS (single AZ)

1GB 107,000 8,400

10GB 107,000 2,400

100GB 101,000 1,500

1TB 26,000 1,200

67xU P TO

FA S T E R

SYSBENCH WRITE-ONLY

DB Size Amazon Aurora

RDS MySQL

30K IOPS (single AZ)

80GB 12,582 585

800GB 9,406 69

CLOUDHARMONY TPC-C

136xU P TO

FA S T E R

Page 9: (DAT405) Amazon Aurora Deep Dive

Running with read replicas

Updates per

second Amazon Aurora

RDS MySQL

30K IOPS (single AZ)

1,000 2.62 ms 0 s

2,000 3.42 ms 1 s

5,000 3.94 ms 60 s

10,000 5.38 ms 300 s

SysBench write-only workload

250 tables

500xU P TO

L O W E R L A G

Page 10: (DAT405) Amazon Aurora Deep Dive

Do fewer IOs

Minimize network packets

Cache prior results

Offload the database engine

DO LESS WORK

Process asynchronously

Reduce latency path

Use lock-free data structures

Batch operations together

BE MORE EFFICIENT

How do we achieve these results?

DATABASES ARE ALL ABOUT I/O

NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND

HIGH-THROUGHPUT PROCESSING DOES NOT ALLOW CONTEXT SWITCHES

Page 11: (DAT405) Amazon Aurora Deep Dive

IO traffic in RDS MySQL

BINLOG DATA DOUBLE-WRITELOG FRM FILES

T Y P E O F W R I T E

MYSQL WITH STANDBY

Issue write to EBS – EBS issues to mirror, ack when both done

Stage write to standby instance using DRBD

Issue write to EBS on standby instance

IO FLOW

Steps 1, 3, 5 are sequential and synchronous

This amplifies both latency and jitter

Many types of writes for each user operation

Have to write data blocks twice to avoid torn writes

OBSERVATIONS

780K transactions

7,388K I/Os per million txns (excludes mirroring, standby)

Average 7.4 I/Os per transaction

PERFORMANCE

30 minute SysBench write-only workload, 100 GB data set, RDS SingleAZ, 30K

PIOPS

EBS mirrorEBS mirror

AZ 1 AZ 2

Amazon S3

EBSAmazon Elastic

Block Store (EBS)

Primary

instance

Standby

instance

1

2

3

4

5

Page 12: (DAT405) Amazon Aurora Deep Dive

IO traffic in Aurora (database)

AZ 1 AZ 3

Primary

instance

Amazon S3

AZ 2

Replica

instance

AMAZON AURORA

ASYNC

4/6 QUORUM

DISTRIBUTED

WRITES

BINLOG DATA DOUBLE-WRITELOG FRM FILES

T Y P E O F W R I T E

30 minute SysBench write-only workload, 100 GB data set

IO FLOW

Only write redo log records; all steps asynchronous

No data block writes (checkpoint, cache replacement)

6X more log writes, but 9X less network traffic

Tolerant of network and storage outlier latency

OBSERVATIONS

27,378K transactions 35X MORE

950K I/Os per 1M txns (6X amplification) 7.7X LESS

PERFORMANCE

Boxcar redo log records – fully ordered by LSN

Shuffle to appropriate segments – partially ordered

Boxcar to storage nodes and issue writes

Page 13: (DAT405) Amazon Aurora Deep Dive

IO traffic in Aurora (storage node)

LOG RECORDS

Primary

instance

INCOMING QUEUE

STORAGE NODE

S3 BACKUP

1

2

3

4

5

6

7

8

UPDATE

QUEUE

ACK

HOT

LOG

DATA

BLOCKS

POINT IN TIME

SNAPSHOT

GC

SCRUB

COALESCE

SORT

GROUP

PEER-TO-PEER GOSSIPPeer

storage

nodes

All steps are asynchronous

Only steps 1 and 2 are in foreground latency path

Input queue is 46X less than MySQL (unamplified, per node)

Favor latency-sensitive operations

Use disk space to buffer against spikes in activity

OBSERVATIONS

IO FLOW

① Receive record and add to in-memory queue

② Persist record and ACK

③ Organize records and identify gaps in log

④ Gossip with peers to fill in holes

⑤ Coalesce log records into new data block versions

⑥ Periodically stage log and new block versions to S3

⑦ Periodically garbage collect old versions

⑧ Periodically validate CRC codes on blocks

Page 14: (DAT405) Amazon Aurora Deep Dive

Asynchronous group commits

Read

Write

Commit

Read

Read

T1

Commit (T1)

Commit (T2)

Commit (T3)

LSN 10

LSN 12

LSN 22

LSN 50

LSN 30

LSN 34

LSN 41

LSN 47

LSN 20

LSN 49

Commit (T4)

Commit (T5)

Commit (T6)

Commit (T7)

Commit (T8)

LSN GROWTHDurable LSN at head-node

COMMIT QUEUEPending commits in LSN order

TIME

GROUP

COMMIT

TRANSACTIONS

Read

Write

Commit

Read

Read

T1

Read

Write

Commit

Read

Read

Tn

TRADITIONAL APPROACH AMAZON AURORA

Maintain a buffer of log records to write out to disk

Issue write when buffer full or time out waiting for writes

First writer has latency penalty when write rate is low

Request I/O with first write, fill buffer till write picked up

Individual write durable when 4 of 6 storage nodes ACK

Advance DB durable point up to earliest pending ACK

Page 15: (DAT405) Amazon Aurora Deep Dive

Re-entrant connections multiplexed to active threads

Kernel-space epoll() inserts into latch-free event queue

Dynamically size threads pool

Gracefully handles 5,000+ concurrent client sessions on r3.8xl

Standard MySQL – one thread per connection

Doesn’t scale with connection count

MySQL EE – connections assigned to thread group

Requires careful stall threshold tuning

CLIE

NT

CO

NN

EC

TIO

N

CLIE

NT

CO

NN

EC

TIO

N

LATCH FREE

TASK QUEUE

ep

oll(

)

MYSQL THREAD MODEL AURORA THREAD MODEL

Adaptive thread pool

Page 16: (DAT405) Amazon Aurora Deep Dive

IO traffic in Aurora (read replica)

PAGE CACHE

UPDATE

Aurora master

30% Read

70% Write

Aurora replica

100% New Reads

Shared Multi-AZ Storage

MySQL master

30% Read

70% Write

MySQL replica

30% New Reads

70% Write

SINGLE-THREADED

BINLOG APPLY

Data volume Data volume

Logical: Ship SQL statements to replica.

Write workload similar on both instances.

Independent storage.

Can result in data drift between master and

replica.

Physical: Ship redo from master to replica.

Replica shares storage. No writes performed.

Cached pages have redo applied.

Advance read view when all commits seen.

MYSQL READ SCALING AMAZON AURORA READ SCALING

Page 17: (DAT405) Amazon Aurora Deep Dive

Improvements over the past few months

Write batch size tuning

Asynchronous send for read/write I/Os

Purge thread performance

Bulk insert performance

BATCH OPERATIONS

Failover time reductions

Malloc reduction

System call reductions

Undo slot caching patterns

Cooperative log apply

OTHER

Binlog and distributed transactions

Lock compression

Read-ahead

CUSTOMER FEEDBACK

Hot row contention

Dictionary statistics

Mini-transaction commit code path

Query cache read/write conflicts

Dictionary system mutex

LOCK CONTENTION

Page 18: (DAT405) Amazon Aurora Deep Dive

Availability

“Performance only matters if your database is up”

Page 19: (DAT405) Amazon Aurora Deep Dive

Storage node availability

Quorum system for read/write; latency tolerant

Peer-to-peer gossip replication to fill in holes

Continuous backup to S3 (designed for 11 9s durability)

Continuous scrubbing of data blocks

Continuous monitoring of nodes and disks for repair

10 GB segments as unit of repair or hotspot rebalance to quickly

rebalance load

Quorum membership changes do not stall writes

AZ 1 AZ 2 AZ 3

Amazon S3

Page 20: (DAT405) Amazon Aurora Deep Dive

Traditional databases

Have to replay logs since the last

checkpoint

Typically 5 minutes between checkpoints

Single-threaded in MySQL; requires a

large number of disk accesses

Amazon Aurora

Underlying storage replays redo records

on demand as part of a disk read

Parallel, distributed, asynchronous

No replay for startup

Checkpointed data Redo log

Crash at T0 requires

a reapplication of the

SQL in the redo log since

last checkpoint

T0 T0

Crash at T0 will result in redo logs being

applied to each segment on demand, in

parallel, asynchronously

Instant crash recovery

Page 21: (DAT405) Amazon Aurora Deep Dive

Survivable caches

We moved the cache out of the

database process

Cache remains warm in the event of a

database restart

Lets you resume fully loaded

operations much faster

Instant crash recovery + survivable

cache = quick and easy recovery from

DB failures

SQL

Transactions

Caching

SQL

Transactions

Caching

SQL

Transactions

Caching

Caching process is outside the DB process

and remains warm across a database restart

Page 22: (DAT405) Amazon Aurora Deep Dive

Faster, more predictable failover

ApprunningFailure detection DNS propagation

Recovery Recovery

DBfailure

MYSQL

App

running

Failure detection DNS propagation

Recovery

DB

failure

AURORA WITH MARIADB DRIVER

1 5 - 2 0 s e c

3 - 2 0 s e c

Page 23: (DAT405) Amazon Aurora Deep Dive

ALTER SYSTEM CRASH [{INSTANCE | DISPATCHER | NODE}]

ALTER SYSTEM SIMULATE percent_failure DISK failure_type IN

[DISK index | NODE index] FOR INTERVAL interval

ALTER SYSTEM SIMULATE percent_failure NETWORK failure_type

[TO {ALL | read_replica | availability_zone}] FOR INTERVAL interval

Simulate failures using SQL

To cause the failure of a component at the database node:

To simulate the failure of disks:

To simulate the failure of networking:

Page 24: (DAT405) Amazon Aurora Deep Dive

Amazon Aurora team

work hard. have fun. make history.

Page 25: (DAT405) Amazon Aurora Deep Dive

Thank you!