AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

December 1, 2016

Deep Dive on Amazon Aurora

Anurag Gupta, VP, Database Services

Amazon EMR, Amazon Redshift, Amazon Aurora, Amazon Athena, AWS Glue

[email protected]

Agenda

What is Aurora?

Review of Aurora performance

New performance enhancements

Review of Aurora availability

New availability enhancements

Other recent and upcoming feature enhancements

Open source compatible relational database

Performance and availability of

commercial databases

Simplicity and cost-effectiveness of

open source databases

What is Amazon Aurora?

Performance

WRITE PERFORMANCE READ PERFORMANCE

Scaling with instance sizes

Aurora scales with instance size for both read and write.

Aurora MySQL 5.6 MySQL 5.7

Real-life data – gaming workloadAurora vs. RDS MySQL – r3.4XL, MAZ

Aurora 3X faster on r3.4xlarge

Do fewer I/Os

Minimize network packets

Cache prior results

Offload the database engine

DO LESS WORK

Process asynchronously

Reduce latency path

Use lock-free data structures

Batch operations together

BE MORE EFFICIENT

How did we achieve this?

DATABASES ARE ALL ABOUT I/O

NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND

HIGH-THROUGHPUT PROCESSING IS ALL ABOUT CONTEXT SWITCHES

I/O traffic in MySQL

BINLOG DATA DOUBLE-WRITELOG FRM FILES

T Y P E O F W R I T E

MYSQL WITH REPLICA

EBS mirrorEBS mirror

AZ 1 AZ 2

Amazon S3

EBSAmazon Elastic

Block Store (EBS)

Primary

Instance

Replica

Instance

1

2

3

4

5

Issue write to EBS – EBS issues to mirror, ack when both done

Stage write to standby instance through DRBD

Issue write to EBS on standby instance

I/O FLOW

Steps 1, 3, 4 are sequential and synchronous

This amplifies both latency and jitter

Many types of writes for each user operation

Have to write data blocks twice to avoid torn writes

OBSERVATIONS

780K transactions

7,388K I/Os per million txns (excludes mirroring, standby)

Average 7.4 I/Os per transaction

PERFORMANCE

30 minute SysBench writeonly workload, 100GB dataset, RDS MultiAZ, 30K PIOPS

I/O traffic in Aurora

AZ 1 AZ 3

Primary

Instance

Amazon S3

AZ 2

Replica

Instance

AMAZON AURORA

ASYNC

4/6 QUORUM

DISTRIBUTED

WRITES

BINLOG DATA DOUBLE-WRITELOG FRM FILES

T Y P E O F W R I T E

I/O FLOW

Only write redo log records; all steps asynchronous

No data block writes (checkpoint, cache replacement)

6X more log writes, but 9X less network traffic

Tolerant of network and storage outlier latency

OBSERVATIONS

27,378K transactions 35X MORE

950K I/Os per 1M txns (6X amplification) 7.7X LESS

PERFORMANCE

Boxcar redo log records – fully ordered by LSN

Shuffle to appropriate segments – partially ordered

Boxcar to storage nodes and issue writesReplica

Instance

I/O traffic in Aurora (storage node)

LOG RECORDS

Primary

Instance

INCOMING QUEUE

STORAGE NODE

S3 BACKUP

1

2

3

4

5

6

7

8

UPDATE

QUEUE

ACK

HOT

LOG

DATA

BLOCKS

POINT IN TIME

SNAPSHOT

GC

SCRUB

COALESCE

SORT

GROUP

PEER TO PEER GOSSIPPeer

Storage

Nodes

All steps are asynchronous

Only steps 1 and 2 are in foreground latency path

Input queue is 46X less than MySQL (unamplified, per node)

Favor latency-sensitive operations

Use disk space to buffer against spikes in activity

OBSERVATIONS

I/O FLOW

① Receive record and add to in-memory queue

② Persist record and acknowledge

③ Organize records and identify gaps in log

④ Gossip with peers to fill in holes

⑤ Coalesce log records into new data block versions

⑥ Periodically stage log and new block versions to S3

⑦ Periodically garbage collect old versions

⑧ Periodically validate CRC codes on blocks

I/O traffic in Aurora Replicas

PAGE CACHE

UPDATE

Aurora Master

30% Read

70% Write

Aurora Replica

100% New Reads

Shared Multi-AZ Storage

MySQL Master

30% Read

70% Write

MySQL Replica

30% New Reads

70% Write

SINGLE-THREADED

BINLOG APPLY

Data Volume Data Volume

Logical: Ship SQL statements to Replica

Write workload similar on both instances

Independent storage

Can result in data drift between Master and Replica

Physical: Ship redo from Master to Replica

Replica shares storage. No writes performed

Cached pages have redo applied

Advance read view when all commits seen

MYSQL READ SCALING AMAZON AURORA READ SCALING

“In MySQL, we saw replica lag spike to almost 12 minutes which is

almost absurd from an application’s perspective. With Aurora, the

maximum read replica lag across 4 replicas never exceeded 20 ms.”

Real-life data - read replica latency

Asynchronous group commits

Read

Write

Commit

Read

Read

T1

Commit (T1)

Commit (T2)

Commit (T3)

LSN 10

LSN 12

LSN 22

LSN 50

LSN 30

LSN 34

LSN 41

LSN 47

LSN 20

LSN 49

Commit (T4)

Commit (T5)

Commit (T6)

Commit (T7)

Commit (T8)

LSN GROWTHDurable LSN at head-node

COMMIT QUEUEPending commits in LSN order

TIME

GROUP

COMMIT

TRANSACTIONS

Read

Write

Commit

Read

Read

T1

Read

Write

Commit

Read

Read

Tn

TRADITIONAL APPROACH AMAZON AURORA

Maintain a buffer of log records to write out to disk

Issue write when buffer full or time out waiting for writes

First writer has latency penalty when write rate is low

Request I/O with first write, fill buffer till write picked up

Individual write durable when 4 of 6 storage nodes ACK

Advance DB Durable point up to earliest pending ACK

Re-entrant connections multiplexed to active threads

Kernel-space epoll() inserts into latch-free event queue

Dynamically size threads pool

Gracefully handles 5000+ concurrent client sessions on r3.8xl

Standard MySQL – one thread per connection

Doesn’t scale with connection count

MySQL EE – connections assigned to thread group

Requires careful stall threshold tuning

CLIE

NT

CO

NN

EC

TIO

N

CLIE

NT

CO

NN

EC

TIO

N

LATCH FREE

TASK QUEUE

ep

oll(

)

MYSQL THREAD MODEL AURORA THREAD MODEL

Adaptive thread pool

Scan

Delete

Aurora lock management

Scan

Delete

Insert

Scan Scan

Insert

Delete

Scan

Insert

Insert

MySQL lock manager Aurora lock manager

Same locking semantics as MySQL

Concurrent access to lock chains

Multiple scanners allowed in an individual lock chains

Lock-free deadlock detection

Needed to support many concurrent sessions, high update throughput

New performance enhancements

Cached read performance

Catalog concurrency: Improved data dictionary

synchronization and cache eviction.

NUMA aware scheduler: Aurora scheduler is

now NUMA aware. Helps scale on multi-socket

instances.

Read views: Aurora now uses a latch-free

concurrent read-view algorithm to construct read

views.

0

100

200

300

400

500

600

700

MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 2016

In thousands of read requests/sec

* R3.8xlarge instance, <1GB dataset using Sysbench

25% Throughput gain

Smart scheduler: Aurora scheduler now

dynamically assigns threads between I/O heavy

and CPU heavy workloads.

Smart selector: Aurora reduces read latency by

selecting the copy of data on a storage node with

best performance

Logical read ahead (LRA): We avoid read I/O

waits by prefetching pages based on their order

in the btree.

Non-cached read performance

0

20

40

60

80

100

120

MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 2016

In thousands of requests/sec

* R3.8xlarge instance, 1TB dataset using Sysbench

10% Throughput gain

Scan

Delete

Hot row contention

Scan

Delete

Insert

Scan Scan

Insert

Delete

Scan

Insert

Insert

MySQL lock manager Aurora lock manager

Highly contended workloads had high memory and CPU

1.9 (Nov) – lock compression (bitmap for hot locks)

1.9 – replace spinlocks with blocking futex – up to 12x reduction in CPU, 3x improvement in throughput

December – use dynamic programming to release locks: from O(totalLocks * waitLocks) to O(totalLocks)

Throughput on Percona TPC-C 100 improved 29x (from 1,452 txns/min to 42,181 txns/min)

Hot row contention

MySQL 5.6 MySQL 5.7 Aurora Improvement

500 connections 6,093 25,289 73,955 2.92x

5000 connections 1,671 2,592 42,181 16.3x

Percona TPC-C – 10GB

* Numbers are in tpmC, measured using release 1.10 on an R3.8xlarge, MySQL numbers using RDS and EBS with 30K PIOPS

MySQL 5.6 MySQL 5.7 Aurora Improvement

500 connections 3,231 11,868 70,663 5.95x

5000 connections 5,575 13,005 30,221 2.32x

Percona TPC-C – 100GB

Accelerates batch inserts sorted by

primary key – works by caching the

cursor position in an index traversal.

Dynamically turns itself on or off based on

data pattern.

Avoids contention in acquiring latches

while navigating down the tree.

Bi-directional, works across all insert

statements.

• LOAD INFILE, INSERT INTO SELECT, INSERT

INTO REPLACE and, Multi-value inserts.

Batch insert performance

Index

R4 R5R2 R3R0 R1 R6 R7 R8

Index

Root

Index

R4 R5R2 R3R0 R1 R6 R7 R8

Index

Root

MySQL: Traverses B-tree starting from root for all inserts

Aurora: Inserts avoids index traversal

Faster index build

MySQL 5.6 leverages Linux read ahead – but

this requires consecutive block addresses in

the btree. It inserts entries top down into the

new btree, causing splits and excessive

logging.

Aurora’s scan pre-fetches blocks based on

position in tree, not block address.

Aurora builds the leaf blocks and then the

branches of the tree.

• No splits during the build.

• Each page touched only once.

• One log record per page.

2-4X better than MySQL 5.6 or MySQL 5.7

0

2

4

6

8

10

12

r3.large on 10GBdataset

r3.8xlarge on10GB dataset

r3.8xlarge on100GB dataset

Ho

urs

RDS MySQL 5.6 RDS MySQL 5.7 Aurora 2016

Why spatial index

Need to store and reason about spatial data

• E.g., “Find all people within 1 mile of a hospital”

• Spatial data is multi-dimensional

• B-Tree indexes are one-dimensional

Aurora supports spatial data types (point/polygon)

• GEOMETRY data types inherited from MySQL 5.6

• This spatial data cannot be indexed

Two possible approaches:

• Specialized access method for spatial data (e.g., R-Tree)

• Map spatial objects to one-dimensional space & store in B-

Tree - space-filling curve using a grid approximation

AB

A A

A A

A A A

BB

BB

B

A COVERS B

COVEREDBY AA CONTAINS B

INSIDE A

A TOUCH B

TOUCH A

A OVERLAPBDYINTERSECT B

OVERLAPBDYINTERSECT A

A OVERLAPBDYDISJOINT B

OVERLAPBDYDISJOINT A

A EQUAL B

EQUAL A

A DISJOINT B

DISJOINT A

A COVERS B

ON A

Spatial indexes in Aurora

Z-index used in Aurora

Challenges with R-Trees

Keeping it efficient while balanced

Rectangles should not overlap or cover empty space

Degenerates over time

Re-indexing is expensive

R-Tree used in MySQL 5.7

Z-index (dimensionally ordered space filling curve)

Uses regular B-Tree for storing and indexing

Removes sensitivity to resolution parameter

Adapts to granularity of actual data without user declaration

Eg GeoWave (National Geospatial-Intelligence Agency)

Spatial index benchmarksSysbench – points and polygons

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . .. .

. . . . . . . . . . . . .

* r3.8xlarge using Sysbench on <1GB dataset

* Write Only: 4000 clients, Select Only: 2000 clients, ST_EQUALS

0

20000

40000

60000

80000

100000

120000

140000

Select-only(reads/sec) Write-only(writes/sec)

Aurora

MySQL5.7

Availability

“Performance only matters if your database is up”

Storage durability

Storage volume automatically grows up to 64 TB

Quorum system for read/write; latency tolerant

Peer to peer gossip replication to fill in holes

Continuous backup to S3 (built for 11 9s durability)

Continuous monitoring of nodes and disks for repair

10 GB segments as unit of repair or hotspot rebalance

Quorum membership changes do not stall writes

AZ 1 AZ 2 AZ 3

Amazon S3

Aurora Replicas

Aurora clusters contain a primary node

and up to fifteen replicas

Failing database nodes are

automatically detected and replaced

Failing database processes are

automatically detected and recycled

Customer applications may scale-out

read traffic across replicas

Replicas are automatically promoted

on persistent outage

AZ 1 AZ 3AZ 2

Primary

NodePrimary

NodePrimary

Node

Primary

NodePrimary

NodeSecondary

Node

Primary

NodePrimary

NodeSecondary

Node

Continuous backup

Segment snapshot Log records

Recovery point

Segment 1

Segment 2

Segment 3

Time

• Take periodic snapshot of each segment in parallel; stream the redo logs to Amazon S3

• Backup happens continuously without performance or availability impact

• At restore, retrieve the appropriate segment snapshots and log streams to storage nodes

• Apply log streams to segment snapshots in parallel and asynchronously

Traditional Databases

Have to replay logs since the last

checkpoint

Typically 5 minutes between checkpoints

Single-threaded in MySQL; requires a

large number of disk accesses

Amazon Aurora

Underlying storage replays redo records

on demand as part of a disk read

Parallel, distributed, asynchronous

No replay for startup

Checkpointed Data Redo Log

Crash at T0 requires

a re-application of the

SQL in the redo log since

last checkpoint

T0 T0

Crash at T0 will result in redo logs being

applied to each segment on demand, in

parallel, asynchronously

Instant crash recovery

Survivable caches

We moved the cache out of the

database process

Cache remains warm in the event of

database restart

Lets you resume fully loaded

operations much faster

Instant crash recovery + survivable

cache = quick and easy recovery from

DB failures

SQL

Transactions

Caching

SQL

Transactions

Caching

SQL

Transactions

Caching

Caching process is outside the DB process

and remains warm across a database restart

Faster failover

AppRunningFailure Detection DNS Propagation

Recovery Recovery

DBFailure

MYSQL

App

Running

Failure Detection DNS Propagation

Recovery

DB

Failure

AURORA WITH MARIADB DRIVER

1 5 - 2 0 s e c

3 - 2 0 s e c

Database failover time

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

0 - 5s – 30% of fail-overs

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35


0%

10%

20%

30%

40%

50%

60%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35


0%

5%

10%

15%

20%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35


New availability enhancements

Availability is about more than HW failures

You also incur availability disruptions when you

1. Patch your database software

2. Modify your database schema

3. Perform large scale database reorganizations

4. Restore a database after a user error

Zero downtime patching

Netw

ork

ing

sta

te

Ap

plic

ati

on

sta

te

Storage Service

App state

Net state

App state

Net state

Befo

re

ZD

P

New DB Engine

Old DB Engine

New DB Engine

Old DB Engine

With Z

DP

User sessions terminate

during patching

User sessions remain

active through patchingStorage Service

Zero downtime patching – current constraints

We have to go to our current patching model when we can’t park connections:

• Long running queries

• Open transactions

• Bin-log enabled

• Parameter changes pending

• Temporary tables open

• Locked tables

• SSL connections open

• Read replicas instances

We are working on addressing the above.

Database cloning

Create a copy of a database without

duplicate storage costs

• Creation of a clone is nearly

instantaneous – we don’t copy data

• Data copy happens only on write – when

original and cloned volume data differ

Typical use cases:

• Clone a production DB to run tests

• Reorganize a database

• Save a point-in-time snapshot for analysis

without impacting production system

Production database

Clone Clone

CloneDev/test

applications

Benchmarks

Production applications

Production applications

How does it work?

Page

1

Page

2

Page

3

Page

4

Source Database

Page

1

Page

3

Page

2

Page

4

Cloned database

Shared Distributed Storage System: physical pages

Both databases reference same pages on the shared

distributed storage system

Page

1

Page

2

Page

3

Page

4

How does it work? (contd.)

Page

1

Page

2

Page

3

Page

4

Page

5

Page

1

Page

3

Page

5

Page

2

Page

4

Page

6

Page

1

Page

2

Page

3

Page

4

Page

5

Page

6

As databases diverge, new pages are added appropriately to each

database while still referencing pages common to both databases

Page

2

Page

3

Page

5

Shared Distributed Storage System: physical pages

Source Database Cloned database

Online DDL: Aurora vs. MySQL

Full table copy; rebuilds all indexes – can take

hours or days to complete.

Needs temporary space for DML operations

DDL operation impacts DML throughput

Table lock applied to apply DML changes

Index

LeafLeafLeaf Leaf

Index

Roottable name operation column-name time-stamp

Table 1

Table 2

Table 3

add-col

add-col

add-col

column-abc

column-qpr

column-xyz

t1

t2

t3

We add an entry to the metadata table and use schema

versioning to decode the block.

Added a modify-on-write primitive to upgrade the block to

the latest schema when it is modified.

Currently support add NULLable column at end of table.

Priority is to support other add column, drop/reorder,

modify datatypes.

MySQL Amazon Aurora

Online DDL performance

On r3.large

On r3.8xlarge


10GB table 0.27 sec 3,960 sec 1,600 sec




10GB table 0.06 sec 900 sec 1,080 sec



Online point-in-time restore

Online point-in-time restore is a quick way to bring the database to a particular point in

time without having to restore from backups

• Rewinding the database to quickly recover from unintentional DML/DDL operations.

• Rewind multiple times to determine the desired point-in-time in the database state. For

example, quickly iterate over schema changes without having to restore multiple times.

t0 t1 t2

t0 t1

t2

t3 t4

t3

t4

Rewind to t1

Rewind to t3

Invisible Invisible

Online PiTR

Online PiTR operation changes the state of

the current DB

Current DB is available within seconds, even

for multi-terabyte DBs

No additional storage cost as current DB is

restored to prior point in time

Multiple iterative online PiTRs are practical

Rewind has to be within the allowed rewind

period based on purchased rewind storage

Cross-region online PiTR is not supported

Online vs. offline point-in-time restore (PiTR)

Offline PiTR

PiTR creates a new DB at desired point in time

from the backup of the current DB

New DB instance takes hours to restore for multi-

terabyte DBs

Each restored DB is billed for its own storage

Multiple iterative offline PiTRs is time consuming

Offline PiTR has to be within the configured

backup window or from snapshots

Aurora supports cross-region PiTR

How does it work?

Segment snapshot Log records

Rewind Point

Segment 1

Segment 2

Segment 3

Time

Aurora takes periodic snapshots within each segment in parallel and stores

them locally

At rewind time, each segment picks the previous local snapshot and applies the

log streams to the snapshot to produce the desired state of the DB

Storage Segments

Logs within the log stream are made visible or invisible based on the branch within the LSN

tree, providing a consistent view for the DB

The first rewind performed at t2 to rewind the DB to t1 makes the logs in purple color invisible

The second rewind performed at time t4 to rewind the DB to t3 makes the logs in red and purple invisible

How does it work? (contd.)

t0 t1 t2

t0 t1

t2

t3 t4

t3

t4

Rewind to t1

Rewind to t3

Invisible Invisible

Removing blockers

My applications require PostgreSQL

Amazon Aurora PostgreSQL compatibility

now in preview

Same underlying scale out, 3 AZ, 6 copy,

fault tolerant, self healing, expanding

database optimized storage tier

Integrated with a PostgreSQL 9.6

compatible database

Session DAT206-R today @3:30 -

Venetian, Level 3, San Polo 3403

Logging + Storage

SQL

Transactions

Caching

Amazon

S3

T2 RI discounts

Up to 34% with a 1-year RI

Up to 57% with a 3-year RI

vCPU Mem Hourly Price

db.t2.medium 2 4 $0.082

db.r3.large 2 15.25 $0.29

db.r3.xlarge 4 30.5 $0.58

db.r3.2xlarge 8 61 $1.16

db.r3.4xlarge 16 122 $2.32

db.r3.8xlarge 32 244 $4.64

An R3.Large is too expensive for my use case

T2.Small coming in Q1. 2017

*Prices are for Virginia

My databases need to meet certifications

Amazon Aurora gives each database

instance IP firewall protection

Aurora offers transparent encryption at rest

and SSL protection for data in transit

Amazon VPC lets you isolate and control

network configuration and connect

securely to your IT infrastructure

AWS Identity and Access Management

(IAM) provides resource-level permission

controls

*New* *New*

Aurora Auditing

MariaDB server_audit plugin Aurora native audit support

Aurora can sustain over 500K events/sec

Create event string

DDL

DML

Query

DCL

Connect

DDL

DML

Query

DCL

Connect

Write

to File

Create event string

Create event string

Create event string

Create event string

Create event string

Latch-free

queue

Write to File

Write to File

Write to File

MySQL 5.7 Aurora

Audit Off 95K 615K 6.47x

Audit On 33K 525K 15.9x

Sysbench Select-only Workload on 8xlarge Instance

AWS ecosystem

Lambda

S3

IAM

CloudWatch

Generate AWS Lambda events from Aurora stored procedures.

Load data from Amazon S3, store snapshots and backups in S3.

Use AWS IAM roles to manage database access control.

Upload systems metrics and audit logs to Amazon CloudWatch.

*NEW*

Q1

MySQL compatibility

Business Intelligence Data Integration Query and Monitoring

“We ran our compatibility test suites against

Amazon Aurora and everything just worked."

- Dan Jewett, VP, Product Management at Tableau

MySQL 5.6 / InnoDB compatible

No application compatibility issues

reported since launch

MySQL ISV applications run pretty much

as is

Working on 5.7 compatibility

Running a bit slower than expected

Back ported 81 fixes from different

MySQL releases

Timeline

Available now (1.9) Available in Dec (1.10) Available in Q1

Performance

Availability

Security

Ecosystem

PCI/DSS

HIPPA/BAA

Fast online schema change

Managed MySQL to Aurora

replication

Cross-region snapshot copy

Online Point in Time Restore

Database cloning

Zero-downtime patching

Spatial indexingLock compression

Replace spinlocks with blocking futex

Faster index build

Aurora auditing

IAM Integration

Copy-on-write volume

T2.Medium T2.Small

CloudWatch for metrics, audit

Thank you!We are collecting feedback forms in the back.

There are also a pile of temporary tattoos there that you can put

on before the relay party. Two sheets in each one, so you can

share with a friend. Have fun!

Technology

AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)