© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian RobinsonSpecialist Solutions Architect, Data & Analytics, EMEA
18 May, 2017
Deep Dive on Amazon Aurora
Open source compatible relational database
Performance and availability of commercial databases
Simplicity and cost-effectiveness of open source databases
What is Amazon Aurora?
Fastest growing
service in AWS
history
Business applications
Web and mobile
Content management
E-commerce, retail
Internet of Things
Search, advertising
BI, analytics
Games, media
Aurora Customer Adoption
Performance
WRITE PERFORMANCE READ PERFORMANCE
Scaling with Instance Sizes
Aurora scales with instance size for both read and write.
Aurora MySQL 5.6 MySQL 5.7
Real-Life Data – Gaming WorkloadAurora vs. RDS MySQL – r3.4XL, MAZ
Aurora 3X faster on r3.4xlarge
How Did We Achieve This?
I/OPackets/secondContext switching
A Service-Oriented Architecture Applied to DatabasesControl PlaneData Plane
Amazon DynamoDB
Amazon SWF
Amazon Route 53
Logging + Storage
SQL
Transactions
Caching
Amazon S3
I/O Traffic in MySQL
BINLOG DATA DOUBLE-WRITELOG FRM FILES
T Y P E O F W R I T E
EBS mirrorEBS mirror
AZ 1 AZ 2
Amazon S3
EBSAmazon Elastic
Block Store (EBS)
PrimaryInstance
ReplicaInstance
1
2
3
4
5
780K transactions7,388K I/Os per million txns (excludes mirroring, standby)Average 7.4 I/Os per transaction
PERFORMANCE
30 minute SysBench writeonly workload, 100GB dataset, RDS MultiAZ, 30K PIOPS
I/O Traffic in Aurora
AZ 1 AZ 3
PrimaryInstance
Amazon S3
AZ 2
ReplicaInstance
ASYNC4/6 QUORUM
DISTRIBUTED WRITES
BINLOG DATA DOUBLE-WRITELOG FRM FILES
T Y P E O F W R I T E
27,378K transactions 35X MORE950K I/Os per 1M txns (6X amplification) 7.7X LESS
PERFORMANCE
ReplicaInstance
I/O Traffic in Aurora (Storage Node)
LOG RECORDS
Primary Instance
INCOMING QUEUE
STORAGE NODE
S3 BACKUP
1
2
3
4
5
6
7
8UPDATE QUEUE
ACK
HOTLOG
DATABLOCKS
POINT IN TIMESNAPSHOT
GC
SCRUBCOALESCE
SORTGROUP
PEER TO PEER GOSSIPPeerStorageNodes
All steps are asynchronousOnly steps 1 and 2 are in foreground latency pathInput queue is 46X less than MySQL (unamplified, per node)Favor latency-sensitive operationsUse disk space to buffer against spikes in activity
OBSERVATIONS
Asynchronous Group Commits
Read
Write
Commit
Read
Read
T1
Commit (T1)
Commit (T2)
Commit (T3)
LSN 10
LSN 12
LSN 22
LSN 50
LSN 30 LSN 34LSN 41LSN 47
LSN 20
LSN 49
Commit (T4)Commit (T5)Commit (T6)Commit (T7)
Commit (T8)
LSN GROWTHDurable LSN at head-node
COMMIT QUEUEPending commits in LSN order
TIME
GROUPCOMMIT
TRANSACTIONS
Read
Write
Commit
Read
Read
T2
Read
Write
Commit
Read
Read
Tn
MySQL Read Scaling
MySQL Master
30% Read
70% Write
MySQL Replica
30% New Reads
70% Write
SINGLE-THREADEDBINLOG APPLY
Data Volume Data Volume
Aurora Read Scaling
PAGE CACHEUPDATEAurora Master
30% Read
70% Write
Aurora Replica
100% New Reads
Shared Multi-AZ Storage
“In MySQL, we saw replica lag spike to almost 12 minutes which is almost absurd from an application’s perspective. With Aurora, the maximum read replica lag across 4 replicas never exceeded 20 ms.”
Real-Life Data - Read Replica Latency
CLI
ENT
CO
NN
ECTI
ON
CLI
ENT
CO
NN
ECTI
ON
LATCH FREETASK QUEUE
epol
l()
MYSQL THREAD MODEL AURORA THREAD MODEL
Adaptive Thread Pool
Availability
“Performance only matters if your database is up”
Storage Durability
AZ 1 AZ 2 AZ 3
Amazon S3
What can fail?Segment failures (disks)Node failures (machines)AZ failures (network or datacenter)
Optimizations4 out of 6 write quorum3 out of 6 read quorumPeer-to-peer replication for repairs
SQL
Transaction
AZ 1 AZ 2 AZ 3
Caching
SQL
Transaction
AZ 1 AZ 2 AZ 3
Caching
Amazon Aurora Storage Engine Fault-Tolerance
Aurora Replicas
AZ 1 AZ 3AZ 2
PrimaryNodePrimaryNodePrimaryNode
PrimaryNodePrimaryNode
SecondaryNode
PrimaryNodePrimaryNode
SecondaryNode
Continuous Backup
Segment snapshot Log records
Recovery point
Segment 1
Segment 2
Segment 3
Time
Checkpointed Data Redo Log
Crash at T0 requiresa re-application of theSQL in the redo log sincelast checkpoint
T0
Traditional Crash Recovery
T0
Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously
Amazon Aurora – Instant Crash Recovery
Survivable Caches
SQLTransactions
Caching
SQL
Transactions
Caching
SQLTransactions
Caching
Caching process is outside the DB process and remains warm across a database restart
Faster Failover
AppRunningFailure Detection DNS Propagation
Recovery Recovery
DBFailure
MYSQL
AppRunning
Failure Detection DNS Propagation
Recovery
DBFailure
AURORA WITH MARIADB DRIVER
1 5 - 2 0 s e c
3 - 2 0 s e c
Database Failover Time
0.00% 5.00%
10.00% 15.00% 20.00% 25.00% 30.00% 35.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
0 - 5s – 30% of fail-overs
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
5 - 10s – 40% of fail-overs
0% 10% 20% 30% 40% 50% 60%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
10 - 20s – 25% of fail-overs
0%
5%
10%
15%
20%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
20 - 30s – 5% of fail-overs
Thank You