AWS November Webinar Series - Aurora Deep Dive

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Abdul Sathar Sait, Principal Product Manager, RDS

Lynn Ferrante, Business Development Manager, RDS

Amazon Aurora Deep Dive

MySQL-compatible relational database

Performance and availability of commercial databases

Simplicity and cost-effectiveness of open source databases

What is Amazon Aurora?

Fastest growing service in AWS history

Business Applications

Web and mobile

Content management

E-commerce, retail

Internet of Things

Search, advertising

BI and analytics

Games, media

Common customer use cases

Expedia: On-line travel marketplace Real-time business intelligence and analytics on

a growing corpus of on-line travel market place data.

Current SQL server based architecture is too expensive. Performance degrades as data volume grows.

Cassandra with Solr index requires large memory footprint and hundreds of nodes, adding cost.

Aurora benefits: Aurora meets scale and performance

requirements with much lower cost. 25,000 inserts/sec with peak up to 70,000. 30ms

average response time for write and 17ms for read, with 1 month of data.

World’s leading online travel company, with a portfolio that includes 150+ travel sites in 70 countries.

PG&E: Large public utility Servicing high traffic surge during power events

had always been a problem.

Availability is critical when databases are down, it adversely affects service to gas and electrical customers.

Aurora benefits: Being able to create multiple database replicas

with millisecond latency allows them handle large surges in traffic and still give customers timely, up-to-date information during a power event .

Amazon Aurora, with 6-way replication, self healing storage and automatic instance repair, provides the availability and reliability needed for mission critical applications.

One of the largest combination natural gas and electric utilities in the United States with approximately 16 million customers in 70,000-square-mile service area in northern and central California.

ISCS: Insurance claims processing

Have been using Oracle and SQL server for operational and warehouse data

Cost and maintenance of traditional commercial database has become the biggest expenditure and maintenance headache.

Aurora benefits: The cost of a “more capable” deployment on

Aurora has proven to be about 70% less than ISCS’s SQL Server deployments.

Eliminated backup window with Aurora’s continuous backup; exploiting linear scaling with number of connections; continuous upload to Redshift using Aurora read replicas.

Provides policy management, claim, billing solutions for casualty and property and insurance organizations

Alfresco: Enterprise content management

Scaling Alfresco document repositories to billions of documents

Support user applications that require sub-second response times

Aurora benefits:

Scaled to 1 billion documents with a throughput of 3 million per hour, which is 10 times faster than their current environment.

Moving from large data centers to cost-effective management with AWS and Aurora.

Leading the convergence of Enterprise Content Management and Business Process Management. More than 1,800 organizations in 195 countries rely on Alfresco, including leaders in financial services, healthcare, and the public sector.

4 cl ient machines with 1,000 connections each

WRITE PERFORMANCE READ PERFORMANCE

Single client machine with 1,600 connections

Using MySQL SysBench with Amazon Aurora R3.8XL with 32 cores and 244 GB RAM

SQL Benchmark Results

Reproducing these results

https://d0.awsstatic.com/product-marketing/Aurora/RDS_Aurora_Performance_Assessment_Benchmarking_v1-2.pdf

AMAZON AURORA

R3.8XLARGE

R3.8XLARGE

R3.8XLARGE

R3.8XLARGE

R3.8XLARGE

• Create an Amazon VPC (or use an existing one).

• Create four EC2 R3.8XL client instances to run the SysBench client. All four should be in the same AZ.

• Enable enhanced networking on your clients

• Tune your Linux settings (see whitepaper)

• Install Sysbench version 0.5

• Launch a r3.8xlarge Amazon Aurora DB Instance in the same VPC and AZ as your clients

• Start your benchmark!

1

2

3

4

5

6

7

Beyond benchmarks

If only real world applications saw benchmark performance

POSSIBLE DISTORTIONSReal world requests contend with each otherReal world metadata rarely fits in data dictionary cacheReal world data rarely fits in buffer cacheReal world production databases need to run HA

Scaling User Connections

SysBench OLTP Workload

250 tables

Connections Amazon AuroraRDS MySQL

30K IOPS (single AZ)

50 40,000 10,000

500 71,000 21,000

5,000 110,000 13,000

8xU P TO

FA S T E R

Scaling Table Count

TablesAmazon Aurora

MySQLI2.8XL

local SSD

MySQLI2.8XL

RAM disk

RDS MySQL


10 60,000 18,000 22,000 25,000

100 66,000 19,000 24,000 23,000

1,000 64,000 7,000 18,000 8,000

10,000 54,000 4,000 8,000 5,000

SysBench write-only workload1,000 connectionsDefault settings

11xU P TO

FA S T E R

Number of writes per second

Scaling Data Set Size

DB Size Amazon AuroraRDS MySQL


1GB 107,000 8,400

10GB 107,000 2,400

100GB 101,000 1,500

1TB 26,000 1,200

67xU P TO

FA S T E R

SYSBENCH WRITE-ONLY

DB Size Amazon AuroraRDS MySQL


80GB 12,582 585

800GB 9,406 69

CLOUDHARMONY TPC-C

136xU P TO

FA S T E R

Running with Read Replicas

Updates persecond Amazon Aurora

RDS MySQL30K IOPS (single AZ)

1,000 2.62 ms 0 s

2,000 3.42 ms 1 s

5,000 3.94 ms 60 s

10,000 5.38 ms 300 s

SysBench Write-only Workload250 tables

500xU P TO

L O W E R L A G

Do fewer IOs

Minimize network packets

Cache prior results

Offload the database engine

DO LESS WORK

Process asynchronously

Reduce latency path

Use lock-free data structures

Batch operations together

BE MORE EFFICIENT

How Do We Achieve These Results?

DATABASES ARE ALL ABOUT I/O

NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND

HIGH-THROUGHPUT PROCESSING DOES NOT ALLOW CONTEXT SWITCHES

A service-oriented architecture applied to the database

Moved the logging and storage layer into a multi-tenant, scale-out database-optimized storage service

Integrated with other AWS services like Amazon EC2, Amazon VPC, Amazon DynamoDB, Amazon SWF, and Amazon Route 53 for control plane operations

Integrated with Amazon S3 for continuous backup with 99.999999999% durability

Control PlaneData Plane

Amazon DynamoDB

Amazon SWF

Amazon Route 53

Logging + Storage

SQL

Transactions

Caching

Amazon S3

1

2

3

Aurora Cluster

Amazon S3

AZ 1 AZ 2 AZ 3

Aurora Primary instance

Cluster volume spans 3 AZs

Aurora Cluster with Replicas

Amazon S3

AZ 1 AZ 2 AZ 3

Aurora Primary instance

Cluster volume spans 3 AZs

Aurora Replica Aurora Replica

IO Traffic in RDS MySQL

BINLOG DATA DOUBLE-WRITELOG FRM FILES

T Y P E O F W R I T E

MYSQL WITH STANDBY

EBS mirrorEBS mirror

AZ 1 AZ 2

Amazon S3

EBSAmazon Elastic

Block Store (EBS)

PrimaryInstance

StandbyInstance

1

2

3

4

5

Issue write to EBS – EBS issues to mirror, ack when both doneStage write to standby instanceIssue write to EBS on standby instance

IO FLOW

Steps 1, 3, 5 are sequential and synchronousThis amplifies both latency and jitterMany types of writes for each user operationHave to write data blocks twice to avoid torn writes

OBSERVATIONS

780K transactions7,388K I/Os per million txns (excludes mirroring, standby)Average 7.4 I/Os per transaction

PERFORMANCE

30 minute SysBench write-only workload, 100GB dataset, RDS Single AZ, 30K PIOPS

IO Traffic in Aurora (Database)

AZ 1 AZ 3

PrimaryInstance

Amazon S3

AZ 2

ReplicaInstance

AMAZON AURORA

ASYNC4/6 QUORUM

DISTRIBUTED WRITES

BINLOG DATA DOUBLE-WRITELOG FRM FILES

T Y P E O F W R I T E S

30 minute SysBench writeonly workload, 100GB dataset

IO FLOW

Only write redo log records; all steps asynchronousNo data block writes (checkpoint, cache replacement)6X more log writes, but 9X less network trafficTolerant of network and storage outlier latency

OBSERVATIONS

27,378K transactions 35X MORE

950K I/Os per 1M txns (6X amplification) 7.7X LESS

PERFORMANCE

Boxcar redo log records – fully ordered by LSNShuffle to appropriate segments – partially orderedBoxcar to storage nodes and issue writes

IO Traffic in Aurora (Storage Node)

LOG RECORDS

Primary Instance

INCOMING QUEUE

STORAGE NODE

S3 BACKUP

1

2

3

4

5

6

7

8

UPDATE QUEUE

ACK

HOTLOG

DATABLOCKS

POINT IN TIMESNAPSHOT

GC

SCRUBCOALESCE

SORTGROUP

PEER TO PEER GOSSIPPeerStorageNodes

All steps are asynchronousOnly steps 1 and 2 are in foreground latency pathInput queue is 46X less than MySQL (unamplified, per node)Favor latency-sensitive operationsUse disk space to buffer against spikes in activity

OBSERVATIONS

IO FLOW

① Receive record and add to in-memory queue② Persist record and ACK ③ Organize records and identify gaps in log④ Gossip with peers to fill in holes⑤ Coalesce log records into new data block versions⑥ Periodically stage log and new block versions to S3⑦ Periodically garbage collect old versions⑧ Periodically validate CRC codes on blocks

Asynchronous group commits

Read

Write

Commit

Read

Read

T1

Commi t (T1)

Commi t (T2)

Commit (T3)

LSN 10

LSN 12

LSN 22

LSN 50

LSN 30

LSN 34

LSN 41

LSN 47

LSN 20

LSN 49

Commit (T4)Commit (T5)

Commit (T6)Commit (T7)

Commi t (T8)

LSN GROWTHDurable LSN at head-node

COMMIT QUEUEPending commits in LSN order

TIME

GROUPCOMMIT

TRANSACTIONS

Read

Write

Commit

Read

Read

T1

Read

Write

Commit

Read

Read

Tn

TRADITIONAL APPROACH AMAZON AURORAMaintain a buffer of log records to write out to disk

Issue write when buffer full or time out waiting for writes

First writer has latency penalty when write rate is low

Request I/O with first write, fill buffer till write picked up

Individual write durable when 4 of 6 storage nodes ACK

Advance DB Durable point up to earliest pending ACK

Re-entrant connections multiplexed to active threads

Kernel-space epoll() inserts into latch-free event queue

Dynamically size threads pool

Gracefully handles 5000+ concurrent client sessions on r3.8xl

Standard MySQL – one thread per connection

Doesn’t scale with connection count

MySQL EE – connections assigned to thread group

Requires careful stall threshold tuning

CLI

EN

T C

ON

NE

CTI

ON

CLI

EN

T C

ON

NE

CTI

ON

LATCH FREETASK QUEUE

epol

l()

MYSQL THREAD MODEL AURORA THREAD MODEL

Adaptive thread pool

IO Traffic in Aurora (Read Replica)

PAGE CACHEUPDATE

Aurora Master

30% Read

70% Write

Aurora Replica

100% New Reads

Shared Multi-AZ Storage

MySQL Master

30% Read

70% Write

MySQL Replica

30% New Reads

70% Write

SINGLE-THREADEDBINLOG APPLY

Data Volume Data Volume

Logical: Ship SQL statements to Replica

Write workload similar on both instances

Independent storage

Can result in data drift between Master and Replica

Physical: Ship redo from Master to Replica

Replica shares storage. No writes performed

Cached pages have redo applied

Advance read view when all commits seen

MYSQL READ SCALING AMAZON AURORA READ SCALING

Continuing the Improvements

Write batch size tuning Asynchronous send for read/write I/OsPurge thread performanceBulk insert performance

BATCH OPERATIONS

Failover time reductionsMalloc reductionSystem call reductionsUndo slot caching patternsCooperative log apply

OTHERBinlog and distributed transactionsLock compressionRead-ahead

CUSTOMER FEEDBACK

Hot row contentionDictionary statisticsMini-transaction commit code pathQuery cache read/write conflictsDictionary system mutex

LOCK CONTENTION

Availability

“Performance only matters if your database is up”

Storage node availability

Quorum system for read/write; latency tolerant

Peer to peer gossip replication to fill in holes

Continuous backup to S3 (designed for 11 9s durability)

Continuous scrubbing of data blocks

Continuous monitoring of nodes and disks for repair

10GB segments as unit of repair or hotspot rebalance to quickly rebalance load

Quorum membership changes do not stall writes

AZ 1 AZ 2 AZ 3

Amazon S3

Traditional databasesHave to replay logs since the last checkpoint

Typically 5 minutes between checkpoints

Single-threaded in MySQL; requires a large number of disk accesses

Amazon AuroraUnderlying storage replays redo records on demand as part of a disk read

Parallel, distributed, asynchronous

No replay for startup

Checkpointed Data Redo Log

Crash at T0 requiresa re-application of theSQL in the redo log sincelast checkpoint

T0 T0

Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously

Instant crash recovery

Survivable caches

We moved the cache out of the database process

Cache remains warm in the event of a database restart

Lets you resume fully loaded operations much faster

Instant crash recovery + survivable cache = quick and easy recovery from DB failures

SQLTransactions

Caching

SQL

Transactions

Caching

SQLTransactions

Caching

Caching process is outside the DB process and remains warm across a database restart

Faster, more predictable failover

AppRunningFailure Detection DNS Propagation

Recovery Recovery

DBFailure

MYSQL

AppRunning

Failure Detection DNS Propagation

Recovery

DBFailure

AURORA WITH MARIADB DRIVER

1 5 - 2 0 s e c

3 - 2 0 s e c

ALTER SYSTEM CRASH [{INSTANCE | DISPATCHER | NODE}]

ALTER SYSTEM SIMULATE percent_failure DISK failure_type IN

[DISK index | NODE index] FOR INTERVAL interval

ALTER SYSTEM SIMULATE percent_failure NETWORK failure_type

[TO {ALL | read_replica | availability_zone}] FOR INTERVAL interval

Simulate failures using SQL

To cause the failure of a component at the database node:

To simulate the failure of disks:

To simulate the failure of networking:

http://aws.amazon.com/rds/aurora

Technology

AWS November Webinar Series - Aurora Deep Dive