Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Monkey: Optimal Navigable Key-Value Store
Niv Dayan, Manos Athanassoulis, Stratos Idreos
price per GB
time
storage is cheaper
inserts & updatesworkload
price per GB
time
storage is cheaper
inserts & updatesworkload
need for write-optimized database structures
time1996 now
LSM-tree invented
need for write-optimized database structures
time1996 now
LSM-tree invented
Key-Value Stores
need for write-optimized database structures
LSM-tree Key-Value Stores
What are they really?
memoryupdates
buffer
storage
0
level
memoryupdates
buffer
storage
0
1
level
sort & flush
runs
memoryupdates
buffer
storage
0
1
2 sort-merge
sort & flush
runs
memory
buffer
storage
0
1
2
3
level
exponentially increasing capacities
O(log(N)) levels
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
X
one I/O per run
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
X
one I/O per run
memory
buffer
storage
0
1
2
3
level
fence pointers
Bloomfilters
lookup key X
X
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
X
true negative
Bloomfilters
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
X
false positive
true negative
Bloomfilters
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
X
false positive
true positive
true negative
Bloomfilters
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
X
false positive
true positive
true negative
Performance & Cost Tradeoffs
Bloom filters
memory
buffer
storage
0
1
2
3
level
fence pointers
Bloom filters
lookup key X
X
false positive
true positive
true negative
Performance & Cost Tradeoffs
bigger filters fewer false positives
memory
buffer
storage
0
1
2
3
level
fence pointers
Bloom filters
lookup key X
X
false positive
true positive
true negative
Performance & Cost Tradeoffs
bigger filters fewer false positives memory vs. lookups
memory
buffer
storage
0
1
2
3
level
fence pointers
Bloom filters
lookup key X
X
false positive
true positive
true negative
Performance & Cost Tradeoffs
memory vs. lookupsbigger filters fewer false positives
more merging fewer runsmore merging fewer runs
memory
buffer
storage
0
1
2
3
level
fence pointers
Bloom filters
lookup key X
X
false positive
true positive
true negative
Performance & Cost Tradeoffs
lookups vs. updates
memory vs. lookupsbigger filters fewer false positives
more merging fewer runs
lookup cost
main memory
update cost
lookup cost
main memory
update cost
update cost
lookup cost
existing systems
fixed memory
lookup cost
main memory
update cost
merge more
merge less
fixed memory
existing systems
update cost
lookup cost
lookup cost
main memory
update cost
less memory
more memory
update cost
lookup cost
Problem 1:existing systems
Problem 2:
update cost
lookup cost
Problem 1:existing systems
Problem 2:
suboptimal filters allocation
update cost
lookup cost
suboptimal filters allocation
fixed memory
Pareto frontier
x
x
existing systemsProblem 1:
Problem 2:
update cost
lookup cost
x
x
hard to tune
Problem 1:
Problem 2:
suboptimal filters allocation
update cost
lookup cost
x
x
Bloom filters sizelookups vs. memory
Problem 1:
Problem 2:
suboptimal filters allocation
hard to tune
update cost
lookup cost
max throughput
x
x
merge policy greedlookups vs. updates
Problem 1:
Problem 2:
suboptimal filters allocation
hard to tune
update cost
lookup cost
Monkey: Optimal Navigable Key-Value Store
c
Monkey: Optimal Navigable Key-Value Store
insights:
steps:
observations:
c
Monkey: Optimal Navigable Key-Value Store
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
performance?fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree
optimize allocation
asymptotically better
memory vs. lookups
updates vs. lookups
navigate
insights:
steps:
observations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
memory
lookups updates
ad-hoc trade-offs
update costMonkey
answer what-if design questions
performance?
lookup cost
existing
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree
updates vs. lookups
navigate
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
update cost
lookup cost
Pareto frontier
WiredTiger
Cassandra, HBase
MonkeyRocksDB, LevelDB
for fixed memory
update cost
lookup cost
Pareto frontier
WiredTiger
Cassandra, HBaseRocksDB, LevelDB
max throughput
Monkey
for fixed memory
c
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
memory
lookups updates
ad-hoc trade-offs
update costMonkey
answer what-if design questions
performance?
lookup cost
existing
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree
updates vs. lookups
navigate
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
memory
lookups updates
answer what-if design questions
performance?
update costMonkey
lookup cost
existing
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree ad-hoc trade-offs
updates vs. lookups
navigate
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
f
fence pointers
Bloom filtersbuffer
memory storage
data
f
fence pointers
Bloomfiltersbuffer
memory storage
data
f
fence pointers
Bloomfiltersbuffer
memory storage
data
f
fence pointers
Bloomfiltersbuffer
memory storage
data>
f
Bloomfilters
memory storage
data
f
Bloom filters
storagememoryX bits
per entry
data
f
Bloom filters
storagememory
data
X bits per entry
Bloom filters
memoryX bits
per entry
= e
bits Mentries N- ln(2)
2
falsepositive rate p
Bloom filters
memoryX bits
per entry
p
p
p
= e
bits Mentries N- ln(2)
2
false positive rate p
Bloom filters
memory
p
p
p
= e
bits Mentries N- ln(2)
2
false positive rate p
X bits per entry
worst-case I/O overhead:
Bloom filters
memoryO( ∑p )
p
p
p
= e
bits Mentries N- ln(2)
2
false positive rate p
X bits per entry
worst-case I/O overhead:
Bloom filters
memoryO( ∑p )
p
p
p
=ln(2)2
false positive rate p
X bits per entry
e
bits Mentries N-
worst-case I/O overhead:
Bloom filters
memory
p
p
p
= e
bits Mentries N- ln(2)
2
false positive rate p
X bits per entry
O( ∑e-M/N )
worst-case I/O overhead:
Bloom filters
memory
p
p
p
O(log(N))
X bits per entry
O( ∑e-M/N )
worst-case I/O overhead:
Bloom filters
memory
p
p
p
O(log(N))
X bits per entry
O( log(N) · e-M/N )
worst-case I/O overhead:
Bloom filters
memory
p
p
p
X bits per entryCan we do better?
O( log(N) · e-M/N )
worst-case I/O overhead:
fence pointers
Bloom filters
X
lookupkey X
data runs
…
…
fence pointers
Bloom filters
X
false positive
lookup key X
false positive
false positive
false positive
false positive
data runs
I/O
I/O
I/O
I/O
I/O
…
…
fence pointers
Bloom filters
X
false positive
lookup key X
false positive
false positive
false positive
false positive
data runs
I/O
I/O
I/O
I/O
I/Omost memory
…
…
fence pointers
Bloom filters
X
false positive
lookup key X
false positive
false positive
false positive
false positive
data runs
I/O
I/O
I/O
I/O
I/O
most memory saves at most 1 I/O
most memory
…
…
Bloom filters
false positive
rates
reallocate some
most memory
Bloom filters
false positive
rates
same memory, fewer lookup I/Os
reallocate some
most memory
0 < p2 < 1
0 < p1 < 1
0 < p0 < 1
xxxx
false positive rates
relax
0 < p2 < 1
0 < p1 < 1
0 < p0 < 1
false positive rates
relax
lookup cost = f(p0, p1 …)
= f(p0, p1 …)
false positive rates
relax model
0 < p2 < 1
0 < p1 < 1
0 < p0 < 1memory footprint
0 < p2 < 1
memory footprint
lookup cost
0 < p1 < 1
0 < p0 < 1
= f(p0, p1 …)
= f(p0, p1 …) in terms of p0, p1
model
false positive rates
relax optimize
lookup cost = ∑ pi
p2
p1
p0
false positive
rates
Bloom filters
…
= e
bitsentries- ln(2)
2
falsepositive rate
memory footprint
p2
p1
p0
false positive
rates
Bloom filters
…
lookup cost = ∑ pi
= -ln(2)2
ln( )falsepositive rate entriesbits
memory footprint
p2
p1
p0
false positive
rates
Bloom filters
…
lookup cost = ∑ pi
bits(p0, N)
bits(p1, N/T)
bits(p2, N/T2)
memory footprint
…
p2
p1
p0
false positive
rates
Bloom filters
…
lookup cost = ∑ pi
bits(p0, N)
bits(p1, N/T)
bits(p2, N/T2)
memory footprint
…
p2
p1
p0
false positive
rates
Bloom filters
…
lookup cost = ∑ pi memory = c · N ·- ∑
ln(pi)Ti
entriesconstant
size ratio
false positive rates
optimizep2
p1
p0
false positive
rates
Bloom filters
…
lookup cost = ∑ pi memory = c · N ·- ∑
ln(pi)Ti
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
exponential
decrease
same
State-of-the-Art Bloom filters
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
p
p
p
exponential
decrease
State-of-the-Art Bloom filters
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
p
p
p>
<
<
lookup cost = ∑pi = ∑p<
……
State-of-the-Art Bloom filters
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
p
p
p>
<
<
lookup cost = ∑pi = ∑p= O( log(N) · e-M/N )= O( e-M/N )
N | number of entries M | overall memory for Bloom filters
<
……
State-of-the-Art Bloom filters
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
p
p
p>
<
<
lookup cost = ∑pi = ∑p
N | number of entries M | overall memory for Bloom filters
asymptotic winlookup cost increases at slower rate as data grows
…… <
= O( log(N) · e-M/N )= O( e-M/N )
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
convergent geometric series
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
c · entries ·- ln(pi)∑
Ti=memory
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
-ln(lookup cost) c · entries ·=memory
Monkey Bloom filters
…false
positive rates
p0/T2
p0/T
p0
model lookups vs. memory trade-off
-ln(lookup cost)=memory c · entries ·
fixed memory
existing systemsProblem 1:
Problem 2:
suboptimal filters allocation
hard to tune
update cost
lookup cost
fixed memory
Pareto frontier
x
x
existing systemsProblem 1:
Problem 2:
suboptimal filters allocation
hard to tune
update cost
lookup cost
x
x
Bloom filters sizelookups vs. memory
Problem 1:
Problem 2:
suboptimal filters allocation
hard to tune
update cost
lookup cost
max throughput
x
x
merge policy greedlookups vs. updatesProblem 1:
Problem 2:
suboptimal filters allocation
hard to tune
update cost
lookup cost
c
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
memory
lookups updates
update costMonkey
answer what-if design questions
performance?
lookup cost
existing
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree ad-hoc trade-offs
updates vs. lookups
navigate
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
c
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
memory
lookups updates
update costMonkey
answer what-if design questions
performance?
lookup cost
existing
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree ad-hoc trade-offs
updates vs. lookups
navigate
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
Identify
size ratio
merge policy
Identify Map
size ratio look
ups
updates
merge policy
Identify Map
size ratio
merge policy
look
ups
updates
sorted arraylog LSM-tree
Identify Map
size ratio
merge policy
sorted array
log
look
ups
updates
Identify Map
size ratio look
ups
updates
merge policy
Navigate
workload hardware
optimalmaximum throughout
log
sorted array
LevelingTiering
Merge Policies
read-optimizedwrite-optimized
Levelingread-optimized
Tieringwrite-optimized
read-optimizedwrite-optimizedLevelingTiering
T runs per level
read-optimizedwrite-optimizedLevelingTiering
T runs per level
merge & flush
read-optimizedwrite-optimizedLevelingTiering
T runs per level
read-optimizedwrite-optimizedLevelingTiering
T runs per level
merge
read-optimizedwrite-optimizedLevelingTiering
T runs per level
flush
T times bigger
read-optimizedwrite-optimizedLevelingTiering
T runs per level T times bigger
T runs per level
1 run per level
write-optimized read-optimizedLevelingTiering
O(T · logT(N) · e-M/N) O(logT(N) · e-M/N)
write-optimized read-optimizedLevelingTiering
runs per level levels levels
false positive rate
false positive rate
lookupcost:
T runs per level
1 run per level
write-optimized read-optimizedLevelingTiering
T runs per level
1 run per level
O(T · logT(N) · e-M/N) O(logT(N) · e-M/N)
runs per level levels levels
false positive rate
false positive rate
lookup cost:
write-optimized read-optimizedLevelingTiering
T runs per level
1 run per level
O(T · e-M/N) O(e-M/N)
runs per level
false positive rate
false positive rate
lookup cost:
O(logT(N)) O(T · logT(N))
merges per levellevels levels
write-optimized read-optimizedLevelingTiering
updatecost:
T runs per level
1 run per level
O(T · e-M/N) O(e-M/N)lookup cost:
write-optimized read-optimizedLevelingTiering
size ratio T
T runs per level
1 run per level
O(logT(N)) O(T · logT(N))update cost:
O(T · e-M/N) O(e-M/N)lookup cost:
1 run per level
1 run per level
write-optimized read-optimizedLevelingTiering
O(e-M/N) = O(e-M/N)
O(log(N)) = O(log(N)) update cost:
lookup cost:
size ratio T
write-optimized read-optimizedLevelingTiering
T runs per level
1 run per level
O(logT(N)) O(T · logT(N))update cost:
O(T · e-M/N) O(e-M/N)lookup cost:
size ratio T
write-optimized read-optimizedLevelingTiering
O(1) O(N)
O(Na · e-M/N) O(e-M/N)
O(lNl) runs per level 1 run per level
update cost:
lookup cost:
size ratio T
1 run per level
write-optimized read-optimizedLevelingTiering
log sorted array
O(lNl) runs per level
O(N)
O(e-M/N)
update cost:
lookup cost:
size ratio T
O(Na · e-M/N)
O(1)
lookup cost
update cost
Tiering
log
Leveling sorted array
lookup cost
update cost
Tiering
log
Leveling sorted array
T=2
T | size ratio
lookup cost
update cost
Tiering
log
Leveling sorted array
T | size ratio
T=2sorted arraylog LSM-tree
lookup cost
update cost
Tiering
log
Leveling sorted array
T | size ratio
workload hardware
optimalmaximum
throughoutT=2
update cost
lookup cost max
throughput
x
x
merge policy greedlookups vs. updatesProblem 1:
Problem 2:
suboptimal filters allocation
hard to tune
better asymptotic scalability
number of entries (log scale)
look
up la
tenc
y (m
s)
LevelDB
Monkey
better asymptotic scalability
number of entries (log scale)
look
up la
tenc
y (m
s)
workload adaptability
(F)
T4
T2LL4
L4L6 L6 L8
L8L16
% lookups in workload
look
up la
tenc
y (m
s)
LevelDB
MonkeyLevelDB
fixed Monkey
navigable Monkey
http://daslab.seas.harvard.edu/crimsondb/
self-designs navigates what-if?
Monkey: Optimal Navigable Key-Value Store
merge policy
log
sorted array
memory
lookups updates
update costMonkey
answer what-if design questions
performance?
lookup cost
existing
fixed false positive rates
lookup cost = ∑ pi
suboptimal
filters LSM-tree ad-hoc trade-offs
updates vs. lookups
navigate
optimize allocation
asymptotically better
memory vs. lookups
insights:
steps:
observations:
Monkey: Optimal Navigable Key-Value Store
0 < memory < ∞more in paper:
bufferfilters cache
skewed & range lookups
Monkey: Optimal Navigable Key-Value Store
skewed & range lookups0 < memory < ∞more in paper:
bufferfilters cache
http://daslab.seas.harvard.edu/monkey/
Monkey: Optimal Navigable Key-Value Store
skewed & range lookups
Thanks!
0 < memory < ∞more in paper:
bufferfilters cache
http://daslab.seas.harvard.edu/monkey/