TRIAD: Creating Synergies Between Memory, Disk and Log
in Log Structured Key-Value Stores
EPFL
A. Arora K. Gupta P. KonkaH. Yuan
O. Balmau D. Didona R. Guerraoui W. ZwaenepoelEPFL EPFL EPFL
Nutanix Nutanix NutanixNutanix
USENIX ATC ‘17, Santa Clara CA
1
Very simple data stores.
KV pairs.
Simple operations: update, read.
KV Stores
2
KV Stores
3
Distributed Single machine
In-memory
Persistent
KV Stores
4
Distributed Single machine
In-memory
Persistent
KV Stores
5
Distributed Single machine
In-memory
Persistent
Log-Structured Merge (LSM)
TRIAD in a Nutshell
TRIAD LSM KV: achieves 2x throughput on production wklds.
Methods: Reducing background I/O in LSMs.
6
TRIAD in a Nutshell
TRIAD LSM KV: achieves 2x throughput on production wklds.
Methods: Reducing background I/O in LSMs.
7
JJ
No need to know workload a priori.LSM KV semantics preserved.
LSM Overview
8
LSM ComponentsSorted memory component
SSTables• sorted files• many SSTables/Level Cm
L0
Ln
…
9
LSM Updatesupdate
Commit Log
10
Cm
L0
Ln
LSM Reads
read
11
Cm
L0
Ln
LSM Background Ops: Flushing
12
FlushingFrom memory to L0.
Cm
L0
Ln
LSM Background Ops: Flushing
CommitLog
L0
flushing
K1 V1’
K2 V2
K3 V3’
…
Kn Vn
K1 V1
K2 V2
K1 V1’
K3 V3
Cm
13
RAM
Disk
Mem component full
LSM Background Ops: Flushing
CommitLog
L0
flushing
Cm
14
RAM
Disk
K1 V1’
K2 V2
…
Kn Vn
flush1
LSM Background Ops: Flushing
CommitLog
L0
flushing
K1 V1’
K2 V2
K3 V3’
K1 V1
K2 V2
K1 V1’’
K3 V3
…
K1 V1’
K3 V3’
Cm
15
RAM
Disk
Commit log full
LSM Background Ops: Flushing
CommitLog
L0
flushing
Cm
16
RAM
Disk
flush1
K1 V1’
K2 V2
K3 V3’
LSM Background Ops: Compaction
17
FlushingFrom memory to L0.
Cm
L0
Ln
CompactionLevels on disk.
LSM Background Ops: Compaction
18
Disk L0
L1
Ln
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
K
Key Val
… …
…
LSM Background Ops: Compaction
19
Disk L0
L1
Ln
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
K
…
K written to disk
LSM Background Ops: Compaction
20
Disk L0
L1
Ln
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
K
K
…K rewritten to disk
LSM Background Ops: Compaction
21
Disk L0
L1
Ln
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
K
… K rewritten to disk…Key Val
… …
K
LSM Background Ops: Compaction
22
Disk L0
L1
Ln
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
…Key Val
… …
K K rewritten n+1th time to disk!Key Val
… …
K
LSM Background Ops: Compaction
23
Disk L0
L1
Ln
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
…Key Val
… …
K K rewritten n+1th time to disk!Key Val
… …
K
Write amplification (WA)≈amount of rewrites of data to disk
Insight
Severe competition for compute/storage resources between LSM background ops and user ops.
24
050
100150200250300
Uniform50r-50w
Skewed50r-50w
K O
pera
tions
/sRocksDBRocksDB No BG I/O
Background I/O Overhead§ Long & slow bg. ops slowdown of user ops.
25
050
100150200250300
Uniform50r-50w
Skewed50r-50w
K O
pera
tions
/sRocksDBRocksDB No BG I/O
Background I/O Overhead§ Long & slow bg. ops slowdown of user ops.
26
up to 3x throughput gap L
Goal
Decrease background ops overhead to increase user throughput.
27
TRIAD
28
TRIAD
TRIAD-MEM
TRIAD-DISK
TRIAD-LOG
29
Workload Improve WA in
Skewed workloads Flushing and Compaction
In-between Compaction
Uniform workloads Flushing
Three techniques work together and are complementary.
TRIAD
TRIAD-MEM
TRIAD-DISK
TRIAD-LOG
30
Workload Improve WA in
Skewed workloads Flushing and Compaction
Uniform workloads Flushing
TRIAD-MEM
TRIAD-MEM
TRIAD-DISK
TRIAD-LOG
31
Workload Improve WA in
Skewed workloads Flushing and Compaction
In-between Compaction
Uniform workloads Flushing
32
L0
flushing
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
33
L0
flushing
K1 V11
K1 V11
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
34
L0
flushing
K1 V12
K1 V11
K1 V12
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
35
L0
flushing
K1 V12
K2 V2
K1 V11
K1 V12
K2 V2
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
36
L0
flushing
K1 V13
K2 V2
K1 V11
K1 V12
K2 V2
K1 V13
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
37
L0
flushing
K1 V14
K2 V2
K1 V11
K1 V12
K2 V2
K1 V13
K1 V14
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
38
L0
flushing
K1 V1n
K2 V2
K1 V11
K1 V12
K2 V2
K1 V13
K1 V14
…
K1 V1n
Cm
CommitLog
Problem: Flushing with Skewed Workloads
RAM
Disk
39
L0
flushing
Cm
K1 V1n
K2 V2
Problem: Flushing with Skewed Workloads
flush1
CommitLog
RAM
Disk
40
L0
flushing
K1 V1n
K2 V2
Cm
K1 V1n
K2 V2
K1 V1n’
K3 V3
K1 V1n’’
K2 V2’
Kn Vn
…
Problem: Flushing with Skewed Workloads
flush1 flush2 flushn
K1 V11
K2 V2
K1 V12
K1 V13
K1 V14
…
K1 V1n
CommitLog
RAM
Disk
High data skew
• Flush because commit log is full• Flush mostly empty mem comp
L0
L1
41
Problem: Compaction with Skewed Workloads
Popular key K1Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
L0
L1
42
Problem: Compaction with Skewed Workloads
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
L0
L1
43
Problem: Compaction with Skewed Workloads
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
L0
L1
44
Problem: Compaction with Skewed Workloads
Key Val
… …
Key Val
… …Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
L0
L1
45
Problem: Compaction with Skewed Workloads
Rewritten to disk
Key Val
… …
Key Val
… …Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
L0
L1
46
Problem: Compaction with Skewed Workloads
Rewritten to disk
Key Val
… …
Key Val
… …Key Val
… …
Key Val
… …
Key Val
… …
Key Val
… …
File on L1 rewritten to disk twice because of one key L
47
TRIAD-MEM: Hot-cold key separation
L0
flushing
K1 V1n
K2 V2
K3 V3
…
Kn Vn
Cm
K1 V11
K2 V2
K1 V12
K1 V13
K1 V14
…
K1 V1n
CommitLog
RAM
Disk
48
TRIAD-MEM: Hot-cold key separation
L0
flushing
K1 V1n
K2 V2
K3 V3
…
Kn Vn
Cm
Idea: Keep hot keys in memoryFlush only cold keysKeep hot keys in CL
K1 V11
K2 V2
K1 V12
K1 V13
K1 V14
…
K1 V1n
CommitLog
RAM
Disk
49
TRIAD-MEM: Hot-cold key separation
L0
flushing
K1 V1n
Cm
K2 V2
K3 V3
…
Kn Vn
Idea: Keep hot keys in memoryFlush only cold keysKeep hot keys in CL
K1 V1n
CommitLog
RAM
Disk
ü Good for skewed workloads.
ü Reduce flushing WA: less data written from memory to disk.
ü Reduce compaction WA: avoid repeatedly compacting hot keys.
50
TRIAD-MEM Summary
TRIAD-LOG
TRIAD-LOG
51
Workload Improve WA in
Uniform workloads Flushing
Problem: Flushing with Uniform Workloads
CommitLog
L0
flushing
Key Val
K1 V1’
K2 V2
…
Kn Vn
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
52
RAM
Disk
Problem: Flushing with Uniform Workloads
CommitLog
L0
flushing
Key Val
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
53
RAM
Disk
Key Val
K1 V1’
K2 V2
…
Kn Vn
flush
CommitLog
L0
flushing
Key Val
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
54
RAM
Disk
Key Val
K1 V1’
K2 V2
…
Kn Vn
flush
Problem: Flushing with Uniform Workloads
CommitLog
L0
flushing
Key Val
…
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
55
RAM
Disk
Key Val
K1 V1’
K2 V2
…
Kn Vn
flush
Insight: Flushed data already written to commit log.
Idea: Use commit logs as SSTables. Avoid bg I/O due to flushing.
Problem: Flushing with Uniform Workloads
TRIAD-LOG
CommitLog
L0
flushing
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
Key Val CLIndex
K1 V1’ 3
K2 V2 2
…
Kn Vn n
56
RAM
Disk
Point to most recent entry in CL.
TRIAD-LOG
L0
flushing
Cm
Key Val CLIndex
K1 V1’ 3
K2 V2 2
…
Kn Vn n
57
RAM
Disk CommitLog
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
�
TRIAD-LOG
L0
flushing
Cm
Key Val CLIndex
K1: 3
K2:2
Kn:n
K1 V1
K2 V2
K1 V1’
…
Kn Vn
CLIndex
K1: 3
K2:2
…
Kn:n
58
RAM
Disk
CL-SSTableOnly flush CL Index from memory and couple it with the current Commit Log. CommitLog
CLIndex
K1: 3
K2:2
Kn:n
Keep index in memory for further reads.
ü Good for uniform workloads.
ü Reuse Commit Log as L0 SST.
ü No more flushing of mem component to disk.
59
TRIAD-LOG Summary
TRIAD Summary
TRIAD-MEM, TRIAD-DISK, TRIAD-LOG:
oComplementary, targeting different wklds.
oWorking simultaneously.
oTransparent to the workloads; no a priori knowledge needed.
60
Evaluation
61
Evaluation
§Compare TRIAD with RocksDB
§Workloads: Production, Synthetic
§Metrics: Throughput, Write Amplification (WA)
§Code: https://github.com/epfl-labos/TRIAD
62
Write Amplification (WA)
totaldatawrittentostoragedatawritten byappWA =
63
Production Workloads: Throughput
64
050
100150200250300350
Prod Wkld 1 Prod Wkld 2
KOPS RocksDB
TRIAD
0
2
4
6
8
10
Prod Wkld 1 Prod Wkld 2
Writ
e Am
plifi
catio
n
RocksDB
TRIAD
~uniform skewed
higheris better
Production Workloads: Throughput
65
050
100150200250300350
Prod Wkld 1 Prod Wkld 2
KOPS RocksDB
TRIAD
0
2
4
6
8
10
Prod Wkld 1 Prod Wkld 2
Writ
e Am
plifi
catio
n
RocksDB
TRIAD
~uniform skewed
TRIAD: stable throughput across wklds.
2x
higheris better
Production Workloads: Write Amplification
66
050
100150200250300350
Prod Wkld 1 Prod Wkld 2
KOPS RocksDB
TRIAD
0
2
4
6
8
10
Prod Wkld 1 Prod Wkld 2
Writ
e Am
plifi
catio
n
RocksDB
TRIAD
~uniform skewed
loweris better
Production Workloads: Write Amplification
67
050
100150200250300350
Prod Wkld 1 Prod Wkld 2
KOPS RocksDB
TRIAD
0
2
4
6
8
10
Prod Wkld 1 Prod Wkld 2
Writ
e Am
plifi
catio
n
RocksDB
TRIAD
~uniform skewed
TRIAD: low and uniform WA.
4x
loweris better
TRIAD: Throughput Breakdown Synthetic Workloads
68
0
50
100
150
200
250
300
SkewAwarenessOnly
DeferredCompac;onOnly
CommitLogIndexingOnly
RocksDB
KOPS
NoSkew
250260270280290300310320330
SkewAwarenessOnly
DeferredCompac?onOnly
CommitLogIndexingOnly
RocksDB
KOPS
Skew1-99
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
0
50
100
150
200
250
300
SkewAwarenessOnly
DeferredCompac;onOnly
CommitLogIndexingOnly
RocksDB
KOPS
NoSkew
250260270280290300310320330
SkewAwarenessOnly
DeferredCompac?onOnly
CommitLogIndexingOnly
RocksDB
KOPS
Skew1-99
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
Skew1%-99%
NoSkew
TRIAD-LOGTRIAD RocksDB
TRIAD-LOGTRIAD RocksDB
TRIAD-MEM
TRIAD-MEM
higheris better
TRIAD: Throughput Breakdown Synthetic Workloads
69
0
50
100
150
200
250
300
SkewAwarenessOnly
DeferredCompac;onOnly
CommitLogIndexingOnly
RocksDB
KOPS
NoSkew
250260270280290300310320330
SkewAwarenessOnly
DeferredCompac?onOnly
CommitLogIndexingOnly
RocksDB
KOPS
Skew1-99
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
0
50
100
150
200
250
300
SkewAwarenessOnly
DeferredCompac;onOnly
CommitLogIndexingOnly
RocksDB
KOPS
NoSkew
250260270280290300310320330
SkewAwarenessOnly
DeferredCompac?onOnly
CommitLogIndexingOnly
RocksDB
KOPS
Skew1-99
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
TRIAD-MEM TRIAD-DISK TRIAD-LOG RocksDB
Skew1%-99%
NoSkew
TRIAD-LOGTRIAD RocksDB
TRIAD-LOGTRIAD RocksDB
TRIAD-MEM
TRIAD-MEM
higheris better
Complementary techniques
§ TRIAD-MEM efficient for skewed workloads.
§ TRIAD-LOG efficient for uniform workloads.
97%
96%
More in Our Paper
oMore production workloads
oMore synthetic workloads
oDetailed breakdown of TRIAD techniques
oTRIAD-DISK
70
Related work
71
o LevelDB: first LSM-based KV store. No attempts to reduce WA.
o A number of systems attempt to reduce WA.§ LSMs: RocksDB, bLSM (SIGMOD/PODS ’12), VT-tree (FAST ‘13),
HyperLevelDB, LSM-trie (USENIX ATC ‘15), Cassandra, WiscKey (FAST ‘16).
§ B-epsilon trees: Tucana (USENIX ATC ’16), BetrFS (FAST ’15, ‘16)
§ No hot/cold key separation, not use the commit log as a pseudo-SST.
o Hot/cold key separation idea used in different context.§ FLASH storage systems: dual-pool algorithm (SAC ‘07), Application-
Managed Flash (FAST ‘16)
Related work
72
o LevelDB: first LSM-based KV store. No attempts to reduce WA.
o A number of systems attempt to reduce WA.§ LSMs: RocksDB, bLSM (SIGMOD/PODS ’12), VT-tree (FAST ‘13),
HyperLevelDB, LSM-trie (USENIX ATC ‘15), Cassandra, WiscKey (FAST ‘16).
§ B-epsilon trees: Tucana (USENIX ATC ’16), BetrFS (FAST ’15, ‘16)
§ But no hot/cold key separation, not use commit log as a pseudo-SST.
o Hot/cold key separation idea used in different context.§ FLASH storage systems: dual-pool algorithm (SAC ‘07), Application-
Managed Flash (FAST ‘16)
Related work
73
o LevelDB: first LSM-based KV store. No attempts to reduce WA.
o A number of systems attempt to reduce WA.§ LSMs: RocksDB, bLSM (SIGMOD/PODS ’12), VT-tree (FAST ‘13),
HyperLevelDB, LSM-trie (USENIX ATC ‘15), Cassandra, WiscKey (FAST ‘16).
§ B-epsilon trees: Tucana (USENIX ATC ’16), BetrFS (FAST ’15, ‘16)
§ But no hot/cold key separation, not use commit log as a pseudo-SST.
o Hot/cold key separation idea used in different context.§ FLASH storage systems: dual-pool algorithm (SAC ‘07), Application-
Managed Flash (FAST ‘16)
Related work
74
o LevelDB: first LSM-based KV store. No attempts to reduce WA.
o A number of systems attempt to reduce WA.§ LSMs: RocksDB, bLSM (SIGMOD/PODS ’12), VT-tree (FAST ‘13),
HyperLevelDB, LSM-trie (USENIX ATC ‘15), Cassandra, WiscKey (FAST ‘16).
§ B-epsilon trees: Tucana (USENIX ATC ’16), BetrFS (FAST ’15, ‘16)
§ But no hot/cold key separation, not use commit log as a pseudo-SST.
o Hot/cold key separation idea used in different context.§ FLASH storage systems: dual-pool algorithm (SAC ‘07), Application-
Managed Flash (FAST ‘16)
Related work
75
o LevelDB: first LSM-based KV store. No attempts to reduce WA.
o A number of systems attempt to reduce WA.§ LSMs: RocksDB, bLSM (SIGMOD/PODS ’12), VT-tree (FAST ‘13),
HyperLevelDB, LSM-trie (USENIX ATC ‘15), Cassandra, WiscKey (FAST ‘16).
§ B-epsilon trees: Tucana (USENIX ATC ’16), BetrFS (FAST ’15, ‘16)
§ But no hot/cold key separation, not use commit log as a pseudo-SST.
o Hot/cold key separation idea used in different context.§ FLASH storage systems: dual-pool algorithm (SAC ‘07), Application-
Managed Flash (FAST ‘16)
Take-home Messages
ü TRIAD: LSM key-value store with high throughput and low WA.
ü Lightweight bg ops improve user throughput.
ü TRIAD: high throughput and low WA.
76
Take-home Messages
77
ü TRIAD: LSM key-value store with high throughput and low WA.
ü Complementary techniques transparent to workload types.
ü TRIAD: high throughput and low WA.
Take-home Messages
78
ü TRIAD: LSM key-value store with high throughput and low WA.
ü Complementary techniques transparent to workload types.
ü Impact of LSM I/O on user throughput reduced.