Upload
acunu
View
3.931
Download
2
Embed Size (px)
DESCRIPTION
A repeat of my oscon talk, with some more details. Check out the video at http://skillsmatter.com/podcast/nosql/castle-big-data
Citation preview
Tom WilkieFounder & VP Engineering
@tom_wilkie
Castle: Reinventing Storage for Big Data
Before the Flood
Old hardware
1990
BTree File systems
RAID
Small databases
BTree indexes
Two Revolutions
BTree file systems
2010
New hardware
RAID
Write-optimised indexes
Distributed, shared-nothing databases
BTree file systems
New hardware
RAID
Write-optimised indexes
...
Bridging the Gap
Castle
2011
Distributed, shared-nothing databases
New hardware
Castle
New hardware
...
SNAPSHOTS*
* And clones!
What’s in the Castle?
Acun
u Ke
rnel
Use
rspa
ce
Linu
x Ke
rnel
Dou
blin
g Ar
rays
arra
ys
rang
e qu
erie
ske
y in
sert
inse
rtqu
eues
Bloo
m fi
lters
x
userspaceinterface
kernelspaceinterface
doubling arraymapping layer
modlist btreemapping layer
block mapping &cacheing layer
linux's block &MM layers
Mem
ory
man
ager
"Ext
ent"
laye
r exte
ntal
loca
tor
& m
appe
r
frees
pace
man
ager
btre
era
nge
quer
ies
key
get
key
inse
rtVe
rsio
n tre
e
Stre
amin
g in
terfa
ceke
y in
sert
key
get
buffe
red
valu
e ge
tbu
ffere
dva
lue
inse
rtra
nge
quer
ies
Cac
he
flusher
exte
nt b
lock
cach
e
page
cac
he
prefetcher
In-k
erne
l w
orkl
oads
Bloc
k la
yer
shar
ed b
uffe
rsas
ync,
sha
red
mem
ory
ring
Shar
ed m
emor
y in
terfa
ceke
ys
valu
es
Arra
ys
valu
e ar
raysbt
ree
key
get
arra
ysm
anag
emen
t
mer
ges
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
page cacheprefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
• Opensource (GPLv2, MIT for user libraries)
• http://bitbucket.org/acunu
• Loadable Kernel Module, targeting CentOS’s 2.6.18
• http://www.acunu.com/blogs/andy-twigg/why-acunu-kernel/
Castle
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
page cache
prefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
The Interface
castle_{back,objects}.c
The Interface
v16
v3
v24
v13
v1
v15v12
v0
v13
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
page cache
prefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
Doubling Array
castle_{da,bloom}.c
B-Tree
logB N
B
• If node is full, split and insert new node into parent (recurse)
• For random inserts, nodes placed randomly on disk
Update Range Query(Size Z)
B-Tree O(logB N)random IOs
O(Z/B) random IOs
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array
2
9
2 9
Inserts
Buffer arrays in memory until we have > B of them
Doubling Array
11
8 8 11
2 9 2 8 9 11
Inserts
etc...
Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...
https://acunu-videos.s3.amazonaws.com/dajs.html
Demo
Update Range Query(Size Z)
B-Tree O(logB N)random IOs
O(Z/B) random IOs
Doubling Array O((log N)/B)sequential IOs
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling ArrayQueries
• Add an index to each array to do lookups
• query(k) searches each array independently
query(k)
Doubling Array
• Bloom Filters can help exclude arrays from search
• ... but don’t help with range queries
Queries
query(k)
B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Update Range Query(Size Z)
B-Tree O(logB N)random IOs
O(Z/B) random IOs
Doubling Array O((log N)/B)sequential IOs
O(Z/B) sequential IOs
~ log (2^30)/log 100= 5 IOs/update
~ log (2^30)/100= 0.2 IOs/update
8KB @ 100MB/s = 13k IOs/s
8KB @ 100MB/s, w/ 8ms seek = 100 IOs/s
13k / 0.2 = 65k updates/s
100 / 5 = 20 updates/s
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
page cache
prefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
Doubling Array
castle_{da,bloom}.c
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
page cache
prefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
“Mod-list” B-Tree
castle_{btree,versions}.c
Copy-on-Write BTreeIdea:
• Apply path-copying [DSST] to the B-tree
Problems:
• Space blowup: Each update may rewrite an entire path
• Slow updates: as above
A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)
Nv = #keys live (accessible) at version v
Update Range Query
Space
CoW B-Tree
O(logB Nv)random IOs
O(Z/B) random IOs O(N B logB Nv)
1 a 1 b
• Inserts produce arraysv1
“BigTable” snapshots
1 a 1 b
“BigTable” snapshots
• Inserts produce arrays
• Snapshots increment ref counts on arrays
• Merges product more arrays, decrement ref count on old arrays
2 a 2 b
v1 v2
1 c
• Inserts produce arrays
• Snapshots increment ref counts on arrays
• Merges product more arrays, decrement ref count on old arrays
1 1
v1 v2
1
1 a 1 b
1 a b c
“BigTable” snapshots
“BigTable” snapshots
• Inserts produce arrays
• Snapshots increment ref counts on arrays
• Merges product more arrays, decrement ref count on old arrays
• Space blowup
1 1
v1 v2
1
1 a 1 b
1 a b c
Nv = #keys live (accessible) at version v
Update Range Query
Space
CoW B-Tree
O(logB Nv)random IOs
O(Z/B) random IOs O(N B logB Nv)
“BigTable” style DA
O((log N)/B)sequential IOs
O(Z/B) sequential IOs O(VN)
“Mod-list” BTreeIdea:
• Apply fat-nodes [DSST] to the B-tree
• ie insert (key, version, value) tuples, with special operations
Problems:
• Similar performance to a BTree
If you limit the #versions, can be constructed sequentially, and embedded into a DA
Nv = #keys live (accessible) at version v
Update Range Query
Space
CoW B-Tree
O(logB Nv)random IOs
O(Z/B) random IOs O(N B logB Nv)
“BigTable” style DA
O((log N)/B)sequential IOs
O(Z/B) sequential IOs O(VN)
“Mod-list” in a DA
O((log N)/B)sequential IOs
O(Z/B) sequential IOs O(N)CASTLE
LevelDB
Stratified BTreeProblem: Embedded “Mod-list” #versions limit
Solution: Version-split arrays during merges
v0
v1 v2
v-split
v2v2 v2v0 v0
k1 k4 k5k3k2
{v2}
{v1,v0} v1 v1 v1v0 v1 v0 v0
k1 k4 k5k2
v0 entries here are duplicates
v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1
newer older
merge
v1 v2v2 v1 v2 v1v0 v1 v0 v0
k1 k4 k5k3k2
(duplicates removed)
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
page cache
prefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
“Mod-list” B-Tree
castle_{btree,versions}.c
Acunu Kernel
Userspace
Linux Kernel
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
rlin
ux's
bloc
k &
MM
laye
rs Memory manager
"Extent" layerextent
allocator& mapper
freespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cacheflusher
extent blockcache
page cache
prefetcher
In-kernel workloads
Block layer
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
Disk Layout: RDA
castle_{cache,extent,freespace,rebuild}.c
13
89
5
14
2 12 34
67 8
1 34 5
67 10
1112 1315
16
910
1114
5 2
8 9
1413 12 15
16
Disk Layout: RDArandom duplicate allocation
Performance Comparison
Small random inserts Inserting 3 billion rows
Acunu powered Cassandra -‘standard’ Cassandra -
Insert latency While inserting 3 billion rows
Acunu powered Cassandra x‘standard’ Cassandra +
Small random range queriesPerformed immediately after inserts
Acunu powered Cassandra -‘standard’ Cassandra -
Standard Acunu Benefits
inserts rate95% latency
~32k/s~32s
~45k/s~0.3s
>1.4x>100x
gets rate95% latency
~100/s~2s
~350/s~0.5s
>3.5x>4x
range queries95% latency
~0.4/s~15s
~40/s~2s
>100x>7.5x
Performance summary
Future
Memcache + Cassandra
Castle
H/W
Castle
H/W
...
Cassandra memcache Cassandra memcache
Cass client memcachedget/insert get/put
100k random inserts/sec!
v16 v24
v13
v1
v15v12 v13
v16 v24
v13
v1
v15v12 v13
v16 v24
v13
v1
v15v12 v13
v16 v24
v13
v1
v15v12 v13
• Castle: like BDB, but for Big Data
• DA: transforms random IO into sequential IO
• Snapshots & Clones: addressing real problems with new workloads
• 2 orders of magnitude better performance and predictability
Questions?Tom Wilkie@tom_wilkie
http://bitbucket.org/acunuhttp://github.com/acunu
http://www.acunu.com/downloadhttp://www.acunu.com/insights
References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil
http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-Structured%20Merge-Tree%20%28LSM-
Tree%29.pdf
[COLA] Cache-Oblivious Streaming B-trees, Michael A. Bender et al
http://www.cs.sunysb.edu/~bender/newpub/BenderFaFi07.pdf
[DSST] Making Data Structures Persistent - J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Data Structures Persistent, Journal of Computer and System Sciences, Vol. 38, No. 1, 1989
http://www.cs.cmu.edu/~sleator/papers/making-data-structures-persistent.pdf
Stratified B-trees and versioned dictionaries, - Andy Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, John Wilkes, Tom Wilkie, HotStorage’11
http://www.usenix.org/event/hotstorage11/tech/final_files/Twigg.pdf
[RDA] Random duplicate storage strategies for load balancing in multimedia servers, 2000, Joep Aerts and Jan Korst and Sebastian Egner
http://www.win.tue.nl/~joep/IPL.ps
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the
Apache Software Foundation.