Database Replication in Tashkent

Preview:

DESCRIPTION

Database Replication in Tashkent. CSEP 545 Transaction Processing Sameh Elnikety. Replication for Performance. Expensive Limited scalability. DB Replication is Challenging. Single database system Large, persistent state Transactions Complex software Replication challenges - PowerPoint PPT Presentation

Citation preview

Database Replication in Tashkent

CSEP 545 Transaction ProcessingSameh Elnikety

Replication for PerformanceExpensiveLimited scalability

2

DB Replication is Challenging• Single database system

– Large, persistent state– Transactions– Complex software

• Replication challenges– Maintain consistency – Middleware replication

3

BackgroundReplica 1StandaloneDBMS

4

Background

Replica 2

Replica 1

Replica 3

Load Balancer

5

Read Tx

Replica 2

Replica 1

Replica 3

Load Balancer

T

Read tx does not change DB state

6

Update tx changesDB state

Update Tx 1/2

Replica 2

Replica 1

Replica 3

Load Balancer

Twsws

7

Update tx changesDB state

Update Tx 1/2

Replica 2

Replica 1

Replica 3

Load Balancer

Tws

Apply (or commit) T everywhere

ws

ws

ws

Example:T1: { set x = 1 }

8

ws

Ordering

Update Tx 2/2

Replica 2

Replica 1

Replica 3

Load Balancer

Tws

Update tx changesDB state

ws

Tws

ws

9

Ordering

Update Tx 2/2

Replica 2

Replica 1

Replica 3

Load Balancer T

Update tx changesDB state

T

ws

ws

ws ws

ws

ws

ws

ws

Replica 3 Example:T1: { set x = 1 }T2: { set x = 7 }

Commit updates in order

10

Ordering

Sub-linear Scalability Wall

Replica 2

Replica 1

Replica 3

Load Balancer T

T

ws

ws

ws ws

ws

ws

ws

ws

Replica 3

11

Replica 4

• General scaling techniques– Address fundamental bottlenecks– Synergistic, implemented in middleware– Evaluated experimentally

This Talk

12

Super-linear Scalability

Single Base United MALB UF0

20

40

60

80

100

120

TP S

12 X

25 X

37 X

1 X

7 X

Big Picture: Let’s OversimplifyStandaloneDBMSR reading

updatelogging

U

14

readingupdatelogging

Big Picture: Let’s Oversimplify

Replica 1/N (traditional)

StandaloneDBMSR reading

updatelogging

U

N.RN.U

RU

(N-1).ws

15

readingupdatelogging

readingupdatelogging

Big Picture: Let’s Oversimplify

Replica 1/N (traditional)

Replica 1/N (optimized)

StandaloneDBMS

16

R readingupdatelogging

U

N.RN.U

RU

(N-1).ws

N.RN.U

R*U*

(N-1).ws*

readingupdatelogging

readingupdatelogging

Big Picture: Let’s Oversimplify

Replica 1/N (traditional)

Replica 1/N (optimized)

StandaloneDBMS

17

R readingupdatelogging

U

N.RN.U

RU

(N-1).ws

N.RN.U

R*U*

(N-1).ws*

MALBUpdate FilteringUniting O & D

Key Points1. Commit updates in order

– Perform serial synchronous disk writes– Unite ordering and durability

2. Load balancing– Optimize for equal load: memory contention– MALB: optimize for in-memory execution

3. Update propagation– Propagate updates everywhere– Update filtering: propagate to where needed

18

Tx A

Roadmap

Replica 2

Replica 1

Replica 3

Load Balancer

12, 3Ordering

Load balancing

Update propagation

Commit updates in

order 19

• Traditionally: – Commit ordering and durability are separated

• Key idea: – Unite commit ordering and durability

Key Idea

20

All Replicas Must Agree• All replicas agree on

– which update tx commit– their commit order

• Total order – Determined by middleware – Followed by each replica

durability

Replica 3

Tx A

Tx Bdurability

Replica 2

durability

Replica 1

21

Tx B

durability

Replica 3

Ordering

Tx A

Order Outside DBMSTx A

Tx Bdurability

Replica 2

durability

Replica 1

22

Tx B

durability

Replica 3

Ordering

Tx A

A B

A B

Order Outside DBMSTx A

Tx Bdurability

Replica 2

A B

durability

Replica 1

A B

A B

A B

A B

23

Ordering

A B DBMS

durability

Replica 3

Proxy

Tx A

Tx B

SQL interface

Task A

Task B

Enforce External Commit Order

24

Ordering

A B DBMS

durability

Replica 3

Proxy

Tx A

Tx B

SQL interface

Task A

Task B

B A

Enforce External Commit Order

25

Ordering

A B DBMS

durability

Replica 3

Proxy

Tx A

Tx B

SQL interface

Task A

Task B

B A

Cannot commit A & B concurrently!

Enforce External Commit Order

26

Ordering

A B

durability

Replica 3

Proxy

Tx A

Tx B

SQL interface

Task A

Task B

A

Enforce Order = Serial Commit

DBMS

27

Ordering

A B

durability

Replica 3

Proxy

Tx A

Tx B

SQL interface

Task A

Task B

A B

Enforce Order = Serial Commit

DBMS

28

Commit Serialization is Slow

DurabilityA

Proxy

DBMS

durability

CPU

OrderingA B C

Commit orderA B C

DurabilityA B

CPU

DurabilityA B C

CPU

Com

mit A

Com

mit B

Com

mit C

Ack

A

Ack

B

Ack

C

29

Commit Serialization is Slow

DurabilityA

Proxy

DBMS

durability

CPU

OrderingA B C

Commit orderA B C

DurabilityA B

CPU

DurabilityA B C

CPU

Com

mit A

Com

mit B

Com

mit C

Ack

A

Ack

B

Ack

C

Problem: Durability & ordering separated → serial disk writes

30

Com

mit A

Com

mit B

Com

mit C

Ack

A

Ack

B

Ack

C

Unite D. & O. in Middleware

Proxy

DBMS

CPU

OrderingA B C

Commit orderA B C

CPU

DurabilityA B C

CPU

durabilityOFF

durability

31

Com

mit A

Com

mit B

Com

mit C

Ack

A

Ack

B

Ack

C

Unite D. & O. in Middleware

Proxy

DBMS

CPU

OrderingA B C

Commit orderA B C

CPU

DurabilityA B C

CPU

durabilityOFF

durability

Solution: Move durability to MW Durability & ordering in middleware → group commit

32

• Middleware logs tx effects– Durability of update tx

• Guaranteed in middleware• Turn durability off at database

• Middleware performs durability & ordering– United → group commit → fast

• Database commits update tx serially– Commit = quick main memory operation

Implementation: Uniting D & O in MW

33

Uniting Improves Throughput• Metric

– Throughput• Workload

– TPC-W Ordering (50% updates)

• System– Linux cluster – PostgreSQL– 16 replicas– Serializable exec. Single Base United MALB UF

0

5

10

15

20

25

30

35

40

TPC-W

1 X

12 X

7 X

TP S

Tx A

Roadmap

Replica 2

Replica 1

Replica 3

Load Balancer

1Ordering

2, 3

Load balancing

Update propagation

Commit updates in

order 35

Key IdeaReplica 1

Mem

Disk

Replica 2Mem

Disk

Load Balancer

Equal load on replicas

36

Key IdeaReplica 1

Mem

Disk

Replica 2Mem

Disk

Load Balancer

Equal load on replicas

MALB: (Memory-Aware Load Balancing)Optimize for in-memory execution 37

How Does MALB Work?Database 21 3

Workload A →

B →

MemMemory

21

2 3

38

A, B, A, B

A, B, A, B

Read Data From Disk

A, B, A, B

Replica 1Mem

Disk21 3

Replica 2Mem

Disk21 3

LeastLoaded

31

A →

B →21

2 3

39

A, B, A, B

A, B, A, B

Read Data From Disk

A, B, A, B

Replica 1Mem

Disk21 3

Replica 2Mem

Disk21 3

LeastLoaded

31

Slow

Slow

A →

B →21

2 3

40

21 331

21 331

Data Fits in MemoryReplica 1

Mem

Disk21 3

Replica 2Mem

Disk21 3

MALB

A →

B →21

2 3A, A, A, A

B, B, B, B

A, B, A, B

41

Data Fits in MemoryReplica 1

Mem

Disk21 3

21

Replica 2Mem

Disk21 3

32

MALB

Fast

Fast

A →

B →21

2 3A, A, A, A

B, B, B, BMemory info?Many tx and replicas?

A, B, A, B

42

• Exploit tx execution plan– Which tables & indices are accessed– Their access pattern

• Linear scan, direct access• Metadata from database

– Sizes of tables and indices

Estimate Tx Memory Needs

43

• Objective– Construct tx groups that fit together in memory

• Bin packing– Item: tx memory needs– Bin: memory of replica– Heuristic: Best Fit Decreasing

• Allocate replicas to tx groups– Adjust for group loads

Grouping Transactions

44

MALB in Action

A B CD E F

MALB

45

MALB in Action

A B CD E F

MALB

Memory needs forA, B, C, D, E, F

46

Group A

MALB in Action

A B CD E F Group B C

Group D E F

MALB

Memory needs forA, B, C, D, E, F

47

Group A

MALB in Action

A B CD E F Replica

Replica

Replica

Group B C

A

Group D E F

B C

D E F

MALB

Disk

Disk

Disk

Memory needs forA, B, C, D, E, F

48

• Objective– Optimize for in-memory execution

• Method– Estimate tx memory needs– Construct tx groups– Allocate replicas to tx groups

MALB Summary

49

• Implementation– No change in consistency – Still middleware

• Compare– United: efficient baseline system– MALB: exploits working set information

• Same environment– Linux cluster running PostgreSQL– Workload: TPC-W Ordering (50% update txs)

Experimental Evaluation

50

MALB Doubles Throughput

TPC-WOrdering16 replicas

51

Single Base United MALB UF0

20

40

60

80

100

120

TP S

105%

12 X

25 X

1 X

7 X

MALB Doubles Throughput

52United MALB

0.0

0.2

0.4

0.6

0.8

1.0

Single Base United MALB UF0

20

40

60

80

100

120

TP S

Rea

d I/O

, nor

mal

ized

105%

12 X

25 X

1 X

7 X

BigSmall

Big

Small

MemSize

DBSize

Big Gains with MALB

4%0%29%

48%105%45%

182%75%12%

BigSmall

Big

Small

MemSize

DBSize

Big Gains with MALB

4%0%29%

48%105%45%

182%75%12%

Run from memory

Run from disk

Tx A

Roadmap

Replica 2

Replica 1

Replica 3

Load Balancer

1Ordering

2, 3

Load balancing

Update propagation

Commit updates in

order 55

• Traditional: – Propagate updates everywhere

• Update Filtering: – Propagate updates to where they are needed

Key Idea

56

Update Filtering ExampleReplica 1

Mem

Disk21 3

Replica 2Mem

Disk21 3

MALBUF

A →

B →21

2 3

A, B, A, B

57

Group A

Update Filtering ExampleReplica 1

Group B

Mem

Disk21 3

21

Replica 2Mem

Disk21 3

32

MALBUF

A →

B →21

2 3

A, B, A, B

58

Group A

Update Filtering Example

Disk

Replica 1

Group B

Mem

21

21

Replica 2Mem

Disk21 3

2

MALBUF

Updatetable 1

3

3

A →

B →21

2 3

A, B, A, B

59

Group A

Update Filtering Example

Disk

Replica 1

Group B

Mem

21

21

Replica 2Mem

Disk21 3

2

MALBUF

Updatetable 1

3

3

A →

B →21

2 3

A, B, A, B

60

Group A

Update Filtering Example

Disk

Replica 1

Group B

Mem

21

21

Replica 2Mem

Disk2 3

2

MALBUF

Updatetable 1

3

3

A →

B →21

2 3

A, B, A, B

611

Group A

Update Filtering Example

Disk

Replica 1

Group B

Mem

21

21

Replica 2Mem

Disk21 3

2

MALBUF

Updatetable 1

3

Updatetable 3

3

A →

B →21

2 3

A, B, A, B

62

Group A

Update Filtering Example

Disk

Replica 1

Group B

Mem

21

21

Replica 2Mem

Disk21 3

2

MALBUF

Updatetable 1

3

Updatetable 3

3

A →

B →21

2 3

A, B, A, B

63

Group A

Update Filtering Example

Disk

Replica 1

Group B

Mem

21

21

Replica 2Mem

Disk21 3

2

MALBUF

Updatetable 1

3

Updatetable 3

3

A →

B →21

2 3

A, B, A, B

64

Update Filtering in Action

UF

65

Update Filtering in Action

UF

Update tored table

66

Update Filtering in Action

UF

Update tored table

Update togreen table

67

Update Filtering in Action

UF

Update tored table

Update togreen table

68

Update Filtering in Action

UF

Update tored table

Update togreen table

69

Single Base United MALB UF0

20

40

60

80

100

120

MALB+UF Triples Throughput37 X

TP S

12 X

25 X

1 X

7 X

49% TPC-WOrdering16 replicas

MALB UF0

2

4

6

8

10

12

14

Single Base United MALB UF0

20

40

60

80

100

120

MALB+UF Triples Throughput37 X

TP S

12 X

25 X

1 X

7 X Prop

. Upd

ates

15

7

49%

1.49

0

0.5

1

1.5

2

MALB MALB+UF

Filtering Opportunities

50%Ordering Mix

5% Browsing Mix

1.02

0

0.5

1

1.5

2

MALB MALB+UF

Updates

Rat

io M

ALB

+UF

/ MA

LB

72

Conclusions1. Commit updates in order

– Perform serial synchronous disk writes– Unite ordering and durability

2. Load balancing– Optimize for equal load: memory contention– MALB: optimize for in-memory execution

3. Update propagation– Propagate updates everywhere– Update filtering: propagate to where needed

73

Recommended