Retrofitting High Availability Mechanism to Tame Hybrid

Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing

Sijie Shen, Rong Chen, Haibo Chen, Binyu Zang{ds_ssj, rongchen} @ sjtu.edu.cn

Institute of Parallel and Distributed Systems, Shanghai Jiao Tong UniversityShanghai Artificial Intelligence Laboratory

Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China

1

Typical workloads of databases

2

Short-term read-write transactionsØ Online order processingØ Stock exchangeØ E-commerce

Long-term read-only queriesØ Business intelligenceØ Financial reportingØ Data mining

OLTP(OnLine Transaction Processing)

OLAP(OnLine Analytical Processing)

HTAP: new type of workloads

3

Real-time queries on data generated by transactionsØ IoTØ HealthcareØ Fraud detectionØ Personalized recommendation

OLTP OLAPHTAP(Hybrid Transaction/Analytical Processing)

RequirementsPerformanceØMinimizing performance degradation (e.g., < 10%)ØMillions of transactions per second[1]

FreshnessØReal-time time delay between TP and AP (e.g., < 20ms)

4[1] ALIBABA CLOUD. Double 11 Real-Time Monitoring System with Time Series Database.

Scenario FraudDetection

SystemMonitor

Personalized Ad.

Stockexchange

Time delay 20 ms 20 ms 100 ms 200 ms

Data analysis: TP+ETL+AP

5

OLTPMySQLRocksDB……

Extract

Trans.

Load

OLAP

ClickhouseMonetDB

Data generation, row store(update, delete, insert)

Data analysis, column store(SUM, AVG, GROUP)

Alternative#1：DUAL-SYSTEMGood performanceTime delay: from seconds to minutes

ETL(Kafka, IDAA, F1 lightening…)

index

HTAP alternativesAlternative#1：DUAL-SYSTEMØGood performanceØLarge time delay

Alternative#2：SINGLE-LAYOUTØShort time delayØHuge perf. degradation

Alternative#3：DUAL-LAYOUTØLightweight sync. (a tradeoff)

6

Freshness1 110010 10

0%10%20%30%40%50%60%70%80%

100ms s

Human Reaction Time (100-250ms)

HyPer

VEGITO

BatchDB

MemSQL

IDAA

F1 Lightning

43

Hekaton

Perf.

Degrad

atio

n

Fraud Detection(~20ms)

Online Gaming (50-100ms)

Stock Price Monitor (~200ms)

1

3

5

4 Personalized Ad(~100ms)

5

23

1

10M

1M

100K

10K IDAA

SyPer

MemSQL

BatchDB

HyPer

VEGITODrTM+H

Hekaton

SyPer

(SQLServer)

System Monitoring(~20ms)

2

21

TiDB

TiDB

(community)

TPC-H

1K

FaRMv2

TPC-C(NewOrder txns/s)

OLAP

OLTPΔ

OLTP

OLAP

ΔOLTP

OLAP

Goal

Goal

VEGITOa distributed in-memory HTAP systemOpportunity of High Availability for fast in-memory OLTPØReplication-based HA mechanism is commonØSynchronous log shipping during transaction committing

Reuse HA for HTAPØ For performance: OLTP on primary, OLAP on backup/APØ For freshness: synchronous logs

7

cleaner

Primary

Backup/TP

Backup/TP

Backup/TP:- Fresh- Fault-tolerant

cleaner

Primary

Backup/TP

Backup/AP

Backup/AP:- Fresh- Fault-tolerant- Columnar

Freshness1 110010 10

0%10%20%30%40%50%60%70%80%

100ms s

Human Reaction Time (100-250ms)

HyPer

VEGITO

BatchDB

MemSQL

IDAA

F1 Lightning

43

Hekaton

Perf.

Degrad

atio

n

Fraud Detection(~20ms)

Online Gaming (50-100ms)

Stock Price Monitor (~200ms)

1

3

5

4 Personalized Ad(~100ms)

5

23

1

10M

1M

100K

10K IDAA

SyPer

MemSQL

BatchDB

HyPer

VEGITODrTM+H

Hekaton

SyPer

(SQLServer)

System Monitoring(~20ms)

2

21

TiDB

TiDB

(community)

TPC-H

1K

FaRMv2

TPC-C(NewOrder txns/s)

Effects of VEGITO

8

CHALLENGES AND DESIGNSGoal of Backup/AP: fresh, fault-tolerant, columnar

9

Challenge#1: Log cleaningLog shippingØ TP threads append logs to queueØ Cleaners drain logs

For high availabilityØ Drain logs in parallelØ Without consistency until recovery

For AP queriesØ Should be consistencyØ Slow (70% ↓): global timestamp + sequential cleaning

10

C#1

Backup/AP needs consistent and parallel log cleaning.

Epoch-based designPartition time into non-overlapping epochs

Time isolation between OLTP and OLAP, each machine hasØ E/TX: epoch of TX logs (increase periodically)Ø E/C: epoch of logs being drainedØ E/Q: epoch of stable versions on backup/AP

Freshness: ms-level epoch with consistency of distributed TX

11545 3

cleaner

Primary

Backup/TP

Backup/AP

E/TX=5

E/Q=3TimeE/Q E/TX

Backup/APStable

Backup/APUnstable

freshness

E/C

E/C=4

Consistent epoch assign

Gossip-style epoch assign► Epoch oracle: update epoch periodically and broadcast► Gossip epoch during commit if violate dependence

► Consistency: previous TX within an equal or smaller epoch

12

1

M1

2

M2TX1

w

w

TX2

r

TX1 happen-before TX2

C

EpochOracle

4

4

Gossip

5

5

Parallel log cleaning

Clean logs matching TX dependence► Parallelism: Logs within an epoch drained in parallel► Consistency: each machine update E/C when all logs of an

epoch drained individually

13

1

M1

2

M2TX2

w

w

2

cleaner

5 5

1

cleaner

5

5

TX1

r

4 4

5 4

5 4

5

Q

3

3

3

4

45

5

TX1 happen-before TX2 E/C E/Q=3

Challenge#2: MVCSMulti-version column store (MVCS)Ø Isolation & OLAP performance

Different locality for read & writeØ column-wise vs. row-wise

Chain-based MVCSØ Update efficiently

Ø Scan performance drop 90%Ø when read 0.5 more version on avg.

14

C#2

40 E=20 E=2

100 E=1100 E=199 E=480 E=4

100 E=2100 E=2

100 E=1100 E=1

100 E=297 E=3 100 E=2

VEGITO: block-based MVCSExploit performance for OLAPØ Scan-efficient (locality)Ø Update: CoW in the unit of blocks

Optimizations: Row-split & Col-mergeØ temporal & spatial localityØ Split a column into several pages (unit of CoW)Ø Merge high-related columns together

15

Multi-version column store

400

1001009980

100100

E=4

400

10010010097

100100

E=3

400

100100100100100100

E=2

100100100100

E=1

12.5x scan performance improved

Challenge#3: tree-based indexInterference from heavy inserts

Write-optimized tree indexØ At the expense of read performance

Read-optimized tree indexØ Write performance is limited

16

C#3

Two-phase concurrent updatingTree insert = in-place insert + balance (costly)

Insert in buffer of each leaf, balance in batchØ Insert: 8.7x STX+HTM (read-opt), 1.4x Masstree (write-opt)Ø Read: 9% overhead under insertion

17

Fault toleranceVEGITO perseveres the same availability guaranteesØ no need for extra replicasØ prefers to recover the primary from backup/TP

Special cases (rare)Ø Both primary and backup/TP fail: rebuild primary from

backup/AP and migrate to another machineØ Backup/AP fail: rebuild to the next epoch

18

Evaluation Setup16 machines, each hasØ 2x12 Intel Xeon E5-2650 processors,128GB RAMØ 2x ConnectX-4 100Gbps InfiniBand NIC

BenchmarkØ CH-BenCHmark ≈ TPC-C + TPC-H

Workload SettingsØ OLTP-onlyØ OLAP-onlyØ HTAP (VEGITO: epoch interval = 15 ms)

19

Compare to specific systemsCompared with OLTP-specific systemsØ Peak throughput: 3.7 M txns/sØ 1% lower than DrTM+H (OLTP-specific)

Compared with OLAP-specific systemsØ Geo-mean latency: 57.2 msØ 2.8x faster than MonetDB (OLAP-specific)

20

HTAP workloadsVEGITO: performance & freshnessØ OLTP 1.9M txns/s (degradation: 5%)Ø OLAP 24 qry/s (degradation: 1%)Ø Freshness (max time delay) < 20 ms

21

RecoveryKill one of the primary for twice

Recovery from Backup/APØ 42 ms for rebuilding the primary

22

ConclusionVEGITO : retrofitting high availability mechanism to tame hybrid transaction/analytical processing

OLTP on primary, OLAP on backupØ Backup/AP: fresh, columnar, and fault-tolerant

Please check VEGITO atØ https://github.com/SJTU-IPADS/vegito

23

Thanks & QA

https://github.com/SJTU-IPADS/vegito

Documents

Retrofitting High Availability Mechanism to Tame Hybrid