Deep Dive on Amazon Aurora - Covering New Feature Announcements

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Steve Abraham – Solutions Architect

December 14, 2016

Deep Dive on Amazon AuroraIncluding New Features

Agenda

What is Aurora?

Review of Aurora performance

New performance enhancements

Review of Aurora availability

New availability enhancements

Other recent and upcoming feature enhancements

Open source compatible relational database

Performance and availability of commercial databases

Simplicity and cost-effectiveness of open source databases

What is Amazon Aurora?

Performance

Write performance Read performance

Aurora scales with instance size for both read and write.

Aurora MySQL 5.6 MySQL 5.7

Scaling with instance sizes

Aurora vs. RDS MySQL – r3.4XL, MAZ

Before : 15ms

After : 5.5ms

Aurora Migration

Aurora 3X faster on r3.4xlarge

Real-life data – gaming workload

Do fewer IOs

Minimize network packets

Cache prior results

Offload the database engine

Do Less WorkProcess asynchronously

Reduce latency path

Use lock-free data structures

Batch operations together

Be More Efficient

Databases are all about i/oNetwork-attached storage is all about packets/secondHigh-throughput processing is all about context switches

How did we achieve this?

BINLOG DATA DOUBLE-WRITELOG FRM FILES

TY PE OF W RI TE

Mysql with replica

EBS mirrorEBS mirror

AZ 1 AZ 2

Amazon S3

EBSAmazon Elastic

Block Store (EBS)

PrimaryInstance

ReplicaInstance

1

2

3

4

5

Issue write to EBS – EBS issues to mirror, ack when both done Stage write to standby instance through DRBD Issue write to EBS on standby instance

Io flow

Steps 1, 3, 4 are sequential and synchronous This amplifies both latency and jitter Many types of writes for each user operation Have to write data blocks twice to avoid torn writes

Observations

780K transactions 7,388K I/Os per million txns (excludes mirroring, standby) Average 7.4 I/Os per transaction

Performance

30 minute sysbench writeonly workload, 100GB dataset, RDS multiaz, 30K PIOPS

IO traffic in MySQL

AZ 1 AZ 3

PrimaryInstance

Amazon S3

AZ 2

ReplicaInstance

ASYNC4/6 QUORUM

DISTRIBUTED WRITES

ReplicaInstance

BINLOG DATA DOUBLE-WRITELOG FRM FILES

TY PE OF W RI TE

Amazon Aurora Boxcar redo log records – fully ordered by LSN Shuffle to appropriate segments – partially ordered Boxcar to storage nodes and issue writes

IO flow

Only write redo log records; all steps asynchronous No data block writes (checkpoint, cache replacement) 6X more log writes, but 9X less network traffic Tolerant of network and storage outlier latency

Observations

27,378K transactions 35X MORE

950K I/Os per 1M txns (6X amplification) 7.7X LESS

Performance

IO traffic in Aurora

LOG RECORDS

Primary Instance

INCOMING QUEUE

STORAGE NODE

S3 BACKUP

1

2

3

4

5

6

7

8

UPDATE QUEUE

ACK

HOTLOG

DATABLOCKS

POINT IN TIMESNAPSHOT

GC

SCRUBCOALESCE

SORTGROUP

PEER TO PEER GOSSIPPeerStorageNodes

All steps are asynchronous Only steps 1 and 2 are in foreground latency path Input queue is 46X less than MySQL (unamplified, per node) Favor latency-sensitive operations Use disk space to buffer against spikes in activity

Observations

IO Flow1. Receive record and add to in-memory queue2. Persist record and ACK 3. Organize records and identify gaps in log4. Gossip with peers to fill in holes5. Coalesce log records into new data block versions6. Periodically stage log and new block versions to S37. Periodically garbage collect old versions8. Periodically validate CRC codes on blocks

IO traffic in Aurora (Storage Node)

MySQL Master

30% Read

70% Write

MySQL Replica

30% New Reads

70% Write

SINGLE-THREADEDBINLOG APPLY

Data Volume Data Volume

Logical: Ship SQL statements to Replica

Write workload similar on both instances

Independent storage

Can result in data drift between Master and Replica

Physical: Ship redo from Master to Replica

Replica shares storage. No writes performed

Cached pages have redo applied

Advance read view when all commits seen

PAGE CACHEUPDATE

Aurora Master

30% Read

70% Write

Aurora Replica

100% New Reads

Shared Multi-AZ Storage

IO traffic in Aurora ReplicasMy SQL read scaling Amazon Aurora read scaling

“In MySQL, we saw replica lag spike to almost 12 minutes which is almost absurd from an application’s perspective. With Aurora, the maximum read replica lag across 4 replicas never exceeded 20 ms.”

Real-life data - read replica latency

Read

Write

Commit

Read

Read

T1

Commi t (T1)

Commi t (T2)

Commit (T3)

LSN 10

LSN 12

LSN 22

LSN 50

LSN 30

LSN 34

LSN 41

LSN 47

LSN 20

LSN 49

Commit (T4)

Commit (T5)Commit (T6)

Commit (T7)

Commi t (T8)

LSN GROWTHDurable LSN at head-node

COMMIT QUEUEPending commits in LSN order

TIME

GROUPCOMMIT

Transactions

Read

Write

Commit

Read

Read

T1

Read

Write

Commit

Read

Read

Tn

Traditional Approach Amazon Aurora Maintain a buffer of log records to write out to disk

Issue write when buffer full or time out waiting for writes

First writer has latency penalty when write rate is low

Request I/O with first write, fill buffer till write picked up

Individual write durable when 4 of 6 storage nodes ACK

Advance DB Durable point up to earliest pending ACK

Asynchronous group commits

Re-entrant connections multiplexed to active threads

Kernel-space epoll() inserts into latch-free event queue

Dynamically size threads pool

Gracefully handles 5000+ concurrent client sessions on r3.8xl

Standard MySQL – one thread per connection

Doesn’t scale with connection count

MySQL EE – connections assigned to thread group

Requires careful stall threshold tuning

Clie

nt c

onne

ctio

n

Clie

nt c

onne

ctio

n LATCH FREETASK QUEUE

epol

l()

MySQL thread model Aurora thread model

Adaptive thread pool

Scan

Delete

Scan

Delete

Insert

Scan Scan

Insert

Delete

Scan

Insert

Insert

MySQL lock manager Aurora lock manager

Same locking semantics as MySQL Concurrent access to lock chains Multiple scanners allowed in an individual lock chains Lock-free deadlock detection

Needed to support many concurrent sessions, high update throughput

Aurora lock management

New Performance Enhancements

Catalog concurrency: Improved data dictionary synchronization and cache eviction.

NUMA aware scheduler: Aurora scheduler is now NUMA aware. Helps scale on multi-socket instances.

Read views: Aurora now uses a latch-free concurrent read-view algorithm to construct read views.

MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 20160

100

200

300

400

500

600

700

In thousands of read requests/sec* R3.8xlarge instance, <1GB dataset using Sysbench

25% Throughput gain

Cached read performance

Smart scheduler: Aurora scheduler now dynamically assigns threads between IO heavy and CPU heavy workloads.

Smart selector: Aurora reduces read latency by selecting the copy of data on a storage node with best performance

Logical read ahead (LRA): We avoid read IO waits by prefetching pages based on their order in the btree.

MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 20160

20

40

60

80

100

120

In thousands of requests/sec* R3.8xlarge instance, 1TB dataset using Sysbench

Non-cached read performance

10% Throughput gain

Scan

Delete

Scan

Delete

Insert

Scan Scan

Insert

Delete

Scan

Insert

Insert

Hot row contentionMySQL lock manager Aurora lock manager

Highly contended workloads had high memory and CPU 1.9 (Nov) – lock compression (bitmap for hot locks) 1.9 – replace spinlocks with blocking futex – up to 12x reduction in CPU, 3x improvement in throughput December – use dynamic programming to release locks: from O(totalLocks * waitLocks) to O(totalLocks)

Throughput on Percona TPC-C 100 improved 29x (from 1,452 txns/min to 42,181 txns/min)

MySQL 5.6 MySQL 5.7 Aurora Improvement

500 connections 6,093 25,289 73,955 2.92x

5000 connections 1,671 2,592 42,181 16.3x

Percona TPC-C – 10GB

* Numbers are in tpmC, measured using release 1.10 on an R3.8xlarge, MySQL numbers using RDS and EBS with 30K PIOPS

MySQL 5.6 MySQL 5.7 Aurora Improvement

500 connections 3,231 11,868 70,663 5.95x

5000 connections 5,575 13,005 30,221 2.32x

Percona TPC-C – 100GB

Hot row contention

Accelerates batch inserts sorted by primary key – works by caching the cursor position in an index traversal.

Dynamically turns itself on or off based on data pattern

Avoids contention in acquiring latches while navigating down the tree.

Bi-directional, works across all insert statements

LOAD INFILE, INSERT INTO SELECT, INSERT INTO REPLACE and, Multi-value inserts.

Index

R4 R5R2 R3R0 R1 R6 R7 R8

Index

Root

Index

R4 R5R2 R3R0 R1 R6 R7 R8

Index

Root

MySQL: Traverses B-tree starting from root for all inserts

Aurora: Inserts avoids index traversal;

Insert performance

MySQL 5.6 leverages Linux read ahead – but this requires consecutive block addresses in the btree. It inserts entries top down into the new btree, causing splits and lots of logging.

Aurora’s scan pre-fetches blocks based on position in tree, not block address.

Aurora builds the leaf blocks and then the branches of the tree.

No splits during the build.

Each page touched only once.

One log record per page.

2-4X better than MySQL 5.6 or MySQL 5.7r3.large on

10GB datasetr3.8xlarge on 10GB dataset

r3.8xlarge on 100GB dataset

0

2

4

6

8

10

12

RDS MySQL 5.6 RDS MySQL 5.7 Aurora 2016H

ours

Faster index build

Need to store and reason about spatial data E.g. “Find all people within 1 mile of a hospital” Spatial data is multi-dimensional B-Tree indexes are one-dimensional

Aurora supports spatial data types (point/polygon) GEOMETRY data types inherited from MySQL 5.6 This spatial data cannot be indexed

Two possible approaches: Specialized access method for spatial data (e.g., R-Tree) Map spatial objects to one-dimensional space & store in

B-Tree - space-filling curve using a grid approximation

AB

A A

A A

A A A

B B

BB

B

A COVERS BB COVEREDBY A

A CONTAINS BB INSIDE A

A TOUCH BB TOUCH A

A OVERLAPBDYINTERSECT BB OVERLAPBDYINTERSECT A

A OVERLAPBDYDISJOINT BB OVERLAPBDYDISJOINT A

A EQUAL BB EQUAL A

A DISJOINT BB DISJOINT A

A COVERS BB ON A

Why Spatial Index

Z-index used in Aurora

Challenges with R-Trees Keeping it efficient while balanced Rectangles should not overlap or cover empty space Degenerates over time Re-indexing is expensive

R-Tree used in MySQL 5.7

Z-index (dimensionally ordered space filling curve) Uses regular B-Tree for storing and indexing Removes sensitivity to resolution parameter Adapts to granularity of actual data without user declaration Eg GeoWave (National Geospatial-Intelligence Agency)

Spatial Indexes in Aurora

. . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . .. .

. . . . . . . . . . . . .

* r3.8xlarge using Sysbench on <1GB dataset* Write Only: 4000 clients, Select Only: 2000 clients, ST_EQUALS

Spatial Index BenchmarksSysbench – points and polygons

Availability

“Performance only matters if your database is up”

Storage volume automatically grows up to 64 TB

Quorum system for read/write; latency tolerant

Peer to peer gossip replication to fill in holes

Continuous backup to S3 (built for 11 9s durability)

Continuous monitoring of nodes and disks for repair

10GB segments as unit of repair or hotspot rebalance

Quorum membership changes do not stall writes

AZ 1 AZ 2 AZ 3

Amazon S3

Storage Durability

Aurora cluster contains primary node and up to fifteen replicas

Failing database nodes are automatically detected and replaced

Failing database processes are automatically detected and recycled

Customer application may scale-out read traffic across replicas

Replica automatically promoted on persistent outage

AZ 1 AZ 3AZ 2

PrimaryNodePrimaryNodePrimaryNode

PrimaryNodePrimaryNode

SecondaryNode

PrimaryNodePrimaryNode

SecondaryNode

More Replicas

Segment snapshot Log records

Recovery point

Segment 1

Segment 2

Segment 3

Time

Continuous Backup

Take periodic snapshot of each segment in parallel; stream the redo logs to Amazon S3

Backup happens continuously without performance or availability impact

At restore, retrieve the appropriate segment snapshots and log streams to storage nodes

Apply log streams to segment snapshots in parallel and asynchronously

Traditional Databases Have to replay logs since the last checkpoint Typically 5 minutes between checkpoints Single-threaded in MySQL; requires a large

number of disk accesses

Amazon Aurora Underlying storage replays redo records

on demand as part of a disk read Parallel, distributed, asynchronous No replay for startup

Checkpointed Data Redo Log

Crash at T0 requiresa re-application of theSQL in the redo log sincelast checkpoint

T0 T0

Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously

Instant Crash Recovery

We moved the cache out of the database process

Cache remains warm in the event of database restart

Lets you resume fully loaded operations much faster

Instant crash recovery + survivable cache = quick and easy recovery from DB failures

SQLTransactions

Caching

SQL

Transactions

Caching

SQLTransactions

Caching

Caching process is outside the DB process and remains warm across a database restart

Survivable Caches

AppRunningFailure Detection DNS Propagation

Recovery Recovery

DBFailure

MYSQL

AppRunning

Failure Detection DNS Propagation

Recovery

DBFailure

AURORA WITH MARIADB DRIVER

15-20 sec

3-20 sec

Faster Fail-Over

1 4 7 10 13 16 19 22 25 28 31 340.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

0 - 5s – 30% of fail-overs

1 4 7 10 13 16 19 22 25 28 31 340.00%5.00%

10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%


1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 350%

10%

20%

30%

40%

50%

60%


1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 350%

5%

10%

15%

20%


Database fail-over time

New Availability Enhancements

You also incur availability disruptions when you1. Patch your database software

2. Modify your database schema

3. Perform large scale database reorganizations

4. DBA errors requiring database restores

Availability is about more than HW failures

Net

wor

king

stat

e

App

licat

ion

stat

e

Storage ServiceApp state

Net state

App state

Net state

Bef

ore

ZD

P

New DB Engine

Old DB Engine

New DB Engine

Old DB Engine

With

ZD

P

User sessions terminate during patching

User sessions remain active through patching

Storage Service

Zero downtime patching

We have to go to our current patching model when we can’t park connections: Long running queries Open transactions Bin-log enabled Parameter changes pending Temporary tables open Locked Tables SSL connections open Read replicas instancesWe are working on addressing the above.

Zero downtime patching – current constraints

Create a copy of a database without duplicate storage costs Creation of a clone is nearly instantaneous –

we don’t copy data Data copy happens only on write – when

original and cloned volume data differ

Typical use cases: Clone a production DB to run tests Reorganize a database Save a point in time snapshot for analysis

without impacting production system.

Production database

Clone Clone

CloneDev/test

applications

Benchmarks

Production applications

Production applications

Database cloning

Page 1

Page 2

Page 3

Page 4

Source Database

Page 1

Page 3

Page 2

Page 4

Cloned database

Shared Distributed Storage System: physical pages

Both databases reference same pages on the shared distributed storage system

Page 1

Page 2

Page 3

Page 4

How does it work?

Page 1

Page 2

Page 3

Page 4

Page 5

Page 1

Page 3

Page 5

Page 2

Page 4

Page 6

Page 1

Page 2

Page 3

Page 4

Page 5

Page 6

As databases diverge, new pages are added appropriately to each database while still referencing pages common to both databases

Page 2

Page 3

Page 5

Shared Distributed Storage System: physical pages

Source Database Cloned database

How does it work? (contd.)

Full Table copy; rebuilds all indexes – can take hours or days to complete.

Needs temporary space for DML operations DDL operation impacts DML throughput Table lock applied to apply DML changes

Index

LeafLeafLeaf Leaf

Index

Roottable name operation column-name time-stamp

Table 1Table 2Table 3

add-coladd-coladd-col

column-abccolumn-qprcolumn-xyz

t1t2t3

We add an entry to the metadata table and use schema versioning to decode the blo.

Added a modify-on-write primitive to upgrade the block to the latest schema when it is modified.

Currently support add NULLable column at end of table Priority is to support other add column, drop/reortder,

modify datatypes.

MySQL Amazon Aurora

Online DDL: Aurora vs. MySQL

On r3.large

On r3.8xlarge


10GB table 0.27 sec 3,960 sec 1,600 sec




10GB table 0.06 sec 900 sec 1,080 sec



Online DDL performance

t0 t1 t2

t0 t1

t2

t3 t4

t3

t4

Rewind to t1

Rewind to t3

Invisible Invisible

What is online point in time restore?

Online point in time restore is a quick way to bring the database to a particular point in time without having to restore from backups Rewinding the database to quickly recover from unintentional DML/DDL operations Rewind multiple times to determine the desired point in time in the database state. For example quickly iterate

over schema changes without having to restore multiple times

Online PiTR

Online PiTR operation changes the state of the current DB

Current DB is available within seconds, even for multi terabyte DBs

No additional storage cost as current DB is restored to prior point in time

Multiple iterative online PiTRs are practical

Rewind has to be within the allowed rewind period based on purchased rewind storage

Cross-region online PiTR is not supported

Offline PiTR

PiTR creates a new DB at desired point in time from the backup of the current DB

New DB instance takes hours to restore for multi terabyte DBs

Each restored DB is billed for its own storage

Multiple iterative offline PiTRs is time consuming

Offline PiTR has to be within the configured backup window or from snapshots

Aurora supports cross-region PiTR

Online vs Offline Point in Time Restore (PiTR)

Segment snapshot Log records

Rewind Point

Segment 1

Segment 2

Segment 3

Time

Storage Segments

How does it work?

Take periodic snapshots within each segment in parallel and store them locally At rewind time, each segment picks the previous local snapshot and applies the log streams to the snapshot to

produce the desired state of the DB

t0 t1 t2

t0 t1

t2

t3 t4

t3

t4

Rewind to t1

Rewind to t3

Invisible Invisible

How does it work? (contd.)

Logs within the log stream are made visible or invisible based on the branch within the LSN tree to provide a consistent view for the DB The first rewind performed at t2 to rewind the DB to t1 makes the logs in purple color invisible The second rewind performed at time t4 to rewind the DB to t3 makes the logs in red and purple invisible

Removing Blockers

Amazon Aurora PostgreSQL compatibility now in preview

Same underlying scale out, 3 AZ, 6 copy, fault tolerant, self healing, expanding database optimized storage tier

Integrated with a PostgreSQL 9.6 compatible database

Logging + Storage

SQL

Transactions

Caching

Amazon S3

My applications require PostgreSQL

T2 RI discountsUp to 34% with a 1-year RIUp to 57% with a 3-year RI

vCPU Mem Hourly Price

db.t2.medium 2 4 $0.082

db.r3.large 2 15.25 $0.29

db.r3.xlarge 4 30.5 $0.58

db.r3.2xlarge 8 61 $1.16

db.r3.4xlarge 16 122 $2.32

db.r3.8xlarge 32 244 $4.64

T2.Small coming in Q1. 2017

*Prices are for Virginia

An R3.Large is too expensive for my use case

Amazon Aurora gives each database instance IP firewall protection

Aurora offers transparent encryption at rest and SSL protection for data in transit

Amazon VPC lets you isolate and control network configuration and connect securely to your IT infrastructure

AWS Identity and Access Management provides resource-level permission controls

*New* *New*

My databases need to meet certifications

MariaDB server audit plugin Aurora native audit support

We can sustain over 500K events/sec

Create event string

DDL

DML

Query

DCL

Connect

Write to File

DDL

DML

Query

DCL

Connect

Create event string

Create event string

Create event string

Create event string

Create event string

Latch-free queue

Write to File

Write to File

Write to File

MySQL 5.7 Aurora

Audit Off 95K 615K 6.47x

Audit On 33K 525K 15.9x

Sysbench Select-only Workload on 8xlarge Instance

Aurora Auditing

Lambda

S3

IAM

CloudWatch

Generate Lambda event from Aurora stored procedures.

Load data from S3, store snapshots and backups in S3.

Use IAM roles to manage database access control.

Upload systems metrics and audit logs to CloudWatch.

*NEW*

Q1

AWS ecosystem

Business Intelligence Data Integration Query and Monitoring

“We ran our compatibility test suites against Amazon Aurora and everything just worked." - Dan Jewett, VP, Product Management at Tableau

MySQL 5.6 / InnoDB compatible No application compatibility issues

reported in last 18 months

MySQL ISV applications run pretty much as is

Working on 5.7 compatibility Running a bit slower than expected

– hope to make it available soon

Back ported 81 fixes from different MySQL releases;

MySQL compatibility

Available now Available in Dec Available in Q1

Performance

Availability

Security

Ecosystem

PCI/DSSHIPPA/BAA

Fast online schema changeManaged MySQL to Aurora replication

Cross-region snapshot copyOnline Point in Time RestoreDatabase cloningZero-downtime patching

Spatial indexingLock compressionReplace spinlocks with blocking futexFaster index build

Aurora auditing

IAM Integration

Copy-on-write volume

T2.Medium T2.SmallCloudwatch for metrics, audit

Timeline

Thank you!

Steve Abraham – [email protected]

mailto:[email protected]

Technology

Deep Dive on Amazon Aurora - Covering New Feature Announcements