27
Aurora Multi-Master Justin Levandoski

Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Aurora Multi-Master

Justin Levandoski

Page 2: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Agenda

>

>

Aurora Fundamentals

Aurora Multi-Master Use Cases

> Aurora Multi-Master Architecture

Page 3: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Aurora Fundamentals

Page 4: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Amazon Aurora :A r e l a t i o n a l d a t a b a s e r e i m a g i n e d f o r t h e c l o u d

R Speed and availability of high-end commercial databases

R Simplicity and cost-effectiveness of open source databases

R Drop-in compatibility with MySQL and PostgreSQL

R Simple pay as you go pricing

Delivered as a managed service

Page 5: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Why Aurora

SQL

TRANSACTIONS

CACHING

LOGGING

Relational databases were not design for the cloud Monolithic architectureLarge failure blast radius

Databases in the cloudCompute & storage have different lifetimesInstances fail/shutdown/scale up & downInstances added to a cluster

Compute & storage are best decoupled forscalability, availability, and durability

Attached Storage

Page 6: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Aurora Arch i tec tu re

• Database tier• Writes redo log to network• No checkpointing! The log is the database• Pushes log application to storage• Master replicates to read replicas for cache updates

• Storage tier• Highly parallel scale-out redo processing• Data replicated 6 ways across 3 AZs• Generate/materialize database pages on demand• Instant database redo recovery

• 4/6 Write Quorum with Local Tracking• AZ + 1 failure tolerance• Read quorum needed only during recovery

DISTRIBUTED STORAGE NODES WITH SSDS

Master Replica Replica Replica

Master

Shared storage volume

Read Replica Read Replica

SQL

Transactions

Caching

SQL

Transactions

Caching

SQL

Transactions

Caching

AZ1 AZ2 AZ3

Page 7: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Aurora S torage Node

LOG RECORDS

Primary Instance

INCOMING QUEUE

STORAGE NODE

S3 BACKUP

1

2

3

4

5

6

7

8

UPDATE QUEUE

ACK

HOTLOG

DATABLOCKS

POINT IN TIMESNAPSHOT

GC

SCRUBCOALESCE

SORTGROUP

PEER TO PEER GOSSIPPeerStorageNodes

All steps are asynchronousOnly steps 1 and 2 are in foreground latency pathInput queue is 46X less than MySQL (unamplified, per node)Favor latency-sensitive operationsUse disk space to buffer against spikes in activity

OBSERVATIONS

IO FLOW

① Receive record and add to in-memory queue② Persist record and ACK ③ Organize records and identify gaps in log④ Gossip with peers to fill in holes⑤ Coalesce log records into new data block versions⑥ Periodically stage log and new block versions to S3⑦ Periodically garbage collect old versions⑧ Periodically validate CRC codes on blocks

SUMMARY

Manages 10GB page segments10GB = right size for repair/fault toleranceUse fault tolerance for heat management/machine patching

Page 8: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Crash Recovery

CRASH

Log records Gaps

Volume CompleteLSN (VCL)

AT CRASH

IMMEDIATELY AFTER CRASH RECOVERY

Consistency Point LSN (CPL)

Consistency Point LSN (CPL)

Storage establishes consistency points that increase monotonically + continuously returned to DB

Transactions commit once DB can prove all changes have met quorum

Volume Complete LSN (VCL) is the highest point where all records have met quorum

Consistency Point LSN (CPL) is the highest commit record below VCL.

Everything past CPL is deleted at crash recovery

Removes the need for 2PC at each commit spanning storage nodes.

No redo or undo processing is required before the database is opened for processing

Page 9: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Aurora Multi-Master Architecture

Page 10: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

SHARED DISK CLUSTER

GLOBAL RESOURCE MANAGER

SHARED STORAGE

M1 M2 M3

M1 M1 M1M2 M3 M2

SQL

Transactions

Caching

Logging

STORAGE

Dis t r ibuted Lock Manager

APPLICATION

SQL

Transactions

Caching

Logging

LOCKING PROTOCOL MESSAGES

Cons

Heavyweight cache coherency traffic on per-lock basisNetworking can get expensiveNegative scaling with hot blocks

Pros

All data available to all nodesEasy to build applicationsSimilar cache coherency as in multi-processors

Page 11: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

SHARED NOTHING

SQL

Transactions

Caching

Logging

STORAGE

Par t i t ioned w i th Consensus

APPLICATION

SQL

Transactions

Caching

Logging

Cons

Heavyweight commit and membership change protocols

Can result in hot partitions = expensive repartitioning

Cross partition operations expensive; better at small requests

Pros

Query broken up and sent to data nodes

Less coherence traffic – only for commits

Can scale to many nodes

STORAGE

DATA RANGE #1

DATA RANGE #2

DATA RANGE #4

DATA RANGE #3

DATA RANGE #5

L

L L

L

L

Page 12: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Shared storage volume

Storage nodes with SSDs

SQL

Transactions

Caching

Availability Zone 1

SQL

Transac9ons

Caching

• Each instance can execute write transaction with no coordination with the others

• Instances share a distributed storage volume

• Nodes fail and recover independently• Optimistic Page-Based Conflict

Resolution

• No Pessimistic Locking

• No Global Commit Coordination

• Writer instances in two availability zones provide continuous availability

• GA August 2019

Aurora Mu l t i -Maste r

Availability Zone 2

• Membership• Heartbeat

• Replication• Metadata

Cluster Services

Page 13: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Storage

Page 1Page 1Page 1

Page 1Page 1

Page 2Page 2Page 2Page 2

Page 2Page 2

Page 1

Non-conflicting writes originating on different masters on different tables

Blue Master Green MasterTime

Begin Trx (BT1) 1 Begin Trx (OT1)

2 Update (table1)

3 Commit (BT1)

OK OK

No Synchronization

Non-Conf l i c t ing Wr i tes

Update (table2)

Commit (OT1)

Page 14: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Storage

Page 1Page 1Page 1

Page 1Page 1

Page 2Page 2Page 2Page 2

Page 2Page 2

Page 1

Conflicting writes originating on different masters on the same table

Blue Master Green MasterTime

Begin Trx (BT1) 1 Begin Trx (OT1)

2 Update (row1, table1)

3 Commit (BT1)

OK RETRY

Optimistic Conflict Resolution

`Conf l i c t ing Wr i te

Update (row1, table1)

Rollback (OT1)

Page 15: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Storage

Page 1Page 1Page 1

Page 1Page 1

Page 2Page 2Page 2Page 2

Page 2Page 2

Page 1

Conflicting writes originating on different masters on the same table

Blue Master Green MasterTime

Begin Trx (BT1) 1 Begin Trx (OT1)

2 Update (row1, table1)

3

OK RETRY

Logical Conflict Detection

Log i ca l Confl ic t

Update (row1, table1)and rollback (OT1)

4 Commit (BT1)

Page 1

Page 16: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Mechan i cs F rom the Head Node

Partitioned LSN and transaction id space

Durability and resolution point at storage constantly increasing (creating new page versions)

Incoming replication from other masters

Database engine must handle rejected write to storageTransaction rollbackB-tree structure modificationsEtc…

Storage

upda

te

reje

ct

acce

pt

Replication

Page 17: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Mul t i -Maste r Commit

BlueMaster

GreenMaster

Commit Commit

com

mit

com

mit

com

mit

com

mit

com

mit

com

mit

Log Records

Page 18: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Mechan i cs F rom the Storage Node

Log records written by multiple masters

Quorum commit log records like before, fills in log chain, etc

Detects conflicting writes from other nodes

Returns rejection to log write on conflict

LOG RECORDS

Master 1INCOMING QUEUE /

CONFLICT DETECTION

STORAGE NODE

S3 BACKUP

12

35

6

7

8

UPDATE QUEUE

ACCEPT/REJECT

HOTLOG

DATABLOCKS

POINT IN TIMESNAPSHOT

GC

SCRUBCOALESCE

SORTGROUP

PEER TO PEER GOSSIPPeerStorageNodes

Master 2

Master N

4

Page 19: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Recovery in Mu l t i -Maste r

CRASH

Log records Gaps

Volume CompleteLSN (VCL)

AT CRASH

IMMEDIATELY AFTER CRASH RECOVERY

Consistency Point LSN (CPL)

Consistency Point LSN (CPL)

Green Master Crashes

Gaps

VCL

AT CRASH

VCL

CPL CPL

Green Master Recovery Point

Gaps filled New LSNs and Gaps

SINGLE MASTER MULTI MASTER

Page 20: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Instance Read-After-Write (INSTANCE_RAW): A transaction can observe all transactions previously committed on this instances, and transactions executed on other nodes, subject to replication lag.

Regional Read-After-Write (REGIONAL_RAW): A transaction can observe all transactions previously committed on all instances in the cluster.

Cons i s tency Mode l

Page 21: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Reg iona l Read-Af te r Wr i te

N3

N1 N2

Client

T2

T3 N1 wait for replication to catch up until T2 AND T3

Globally consistent results

No waits on the write path

Adds latency ONLY to consistent reads

Configurable per session

Shared distributed storage volume

ReadT1

Page 22: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Opt imi s t i c Execut ion : M in i -T ransact ions (3 )

Resolution point constantly advancing

Can pessimistically wait for multi-page mtr to resolve (lack of concurrency/performance)

Aurora optimistically executes multi-page mtr(greater in-memory concurrency)

Rolls back mtr (and all dependent operations) retroactively on conflict

Adaptively switches to pessimistic resolution if high percentage of conflict detected

Page 23: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Aurora Multi-Master Use Cases

Page 24: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Shared storage volume

Storage nodes with SSDs

AZ1 AZ2

Application

AZ3

R/WR/O

R/OR/W

A lot of work can be done on single Aurora writer

Writable replicas provide instant failover

Cont inuous Ava i l ab i l i t y

Page 25: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Structure the workload to limit conflicts between database instances.

Prefer partitioning writes per table (or table partition) from a single database instance.

Aurora MM allows customers to “soft partition,” or re-partition on the fly

Continuous availability through failures and planned maintenance

Shared storage volume

Storage nodes with SSDs

AZ1 AZ2

Application

AZ3

R/WR/W

R/WR/W

Mul t i -Wr i te r Conf igura t ion

Page 26: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000W

rites

/sec

ond

Scaling/Node Failure in Aurora Multi Master

Total writes/sec

One Writer: R4.4XL

Scale out to Two WritersR4.4XL + R4.4XL

Scale Up Instance1R4.8XL + R4.4XL

Instance1 offline

Continuous availability

Page 27: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata

Questions