B-trees, Shadowing, andClones
Ohad Rodeh
B-trees, Shadowing, and Clones – p.1/51
Talk outlinePrefaceBasics of getting b-trees to work with shadowingPerformance resultsAlgorithms for cloning (writable-snapshots)
B-trees, Shadowing, and Clones – p.2/51
IntroductionThe talk is about a free technique useful forfile-systems like ZFS and WAFLIt is appropriate for this forum due to the talked aboutport of ZFS to LinuxThe ideas described here were used in a researchprototype of an object-disk.A b-tree was used for the OSD catalog (directory), anextent-tree was used for objects (files).
B-trees, Shadowing, and Clones – p.3/51
ShadowingSome file-systems use shadowing: WAFL, ZFS, . . .Basic idea:1. File-system is a tree of fixed-sized pages2. Never overwrite a pageFor every command:1. Write a short logical log entry2. Perform the operation on pages written off to the
side3. Perform a checkpoint once in a while
B-trees, Shadowing, and Clones – p.4/51
Shadowing IIIn case of crash: revert to previous stable checkpointand replay the logShadowing is used for: Snapshots, crash-recovery,write-batching, RAID
B-trees, Shadowing, and Clones – p.5/51
Shadowing IIIImportant optimizations1. Once a page is shadowed, it does not need to be
shadowed again until the next checkpoint2. Batch dirty-pages and write them sequentially to
disk
B-trees, Shadowing, and Clones – p.6/51
SnapshotsTaking snapshots is easy with shadowingIn order to create a snapshot:1. The file-system allows more than a single root2. A checkpoint is taken but not erased
B-trees, Shadowing, and Clones – p.7/51
B-treesB-trees are used by many file-systems to representfiles and directories: XFS, JFS, ReiserFS, SAN.FSThey guarantee logarithmic-time key-search, insert,removeThe main questions:1. Can we use B-trees to represent files and directories in
conjunction with shadowing?
2. Can we get good concurrency?
3. Can we supports lots of clones efficiently?
B-trees, Shadowing, and Clones – p.8/51
ChallengesChallenge to multi-threading: changes propagate up tothe root. The root is a contention point.In a regular b-tree the leaves can be linked to theirneighbors.
Not possible in conjunction with shadowingThrows out a lot of the existing b-tree literature
3 15
3 6 10 15 20
3 4 6 7 8 10 11 15 16 20 25
3 15
3 6 10 15 20
3 4 6 7 8 10 11 15 16 20 25
B-trees, Shadowing, and Clones – p.9/51
Write-in-place b-treeUsed by DB2 and SAN.FSUpdates b-trees in place; no shadowingUses a bottom up update procedure
B-trees, Shadowing, and Clones – p.10/51
Write-in-placeexample
Insert-element1. Walk down the tree until reaching the leaf L2. If there is room: insert in L3. If there isn’t, split and propagate upwardNote: tree nodes contain between 2 and 5 elements
3 25
3 6 9 15 20 25 30
3 4 6 7 10 11 15 16 20 23 26 27 30 35 40
3 25
3 6 9 15 20 25 30
3 4 6 7 8 10 11 15 16 20 23 26 27 30 35 40
B-trees, Shadowing, and Clones – p.11/51
Alternate shadowingapproach
Used in many databases, for example, Microsoft SQLserver.Pages have virtual addressesThere is a table that maps virtual addresses to physicaladdressesIn order to modify page P at address L1
1. Copy P to another physical address L2
2. Modify the mapping table, P → L2
3. Modify the page at the L2
B-trees, Shadowing, and Clones – p.12/51
Alternate shadowingapproach II
ProsAvoids the ripple effect of shadowingUses b-link trees, very good concurrency
ConsRequires an additional persistent data structurePerformance of accessing the map is crucial togood behavior
B-trees, Shadowing, and Clones – p.13/51
Requirements fromshadowed b-tree
The b-tree has to:1. Have good concurrency2. Work well with shadowing3. Use deadlock avoidance4. Have guarantied bounds on space and memory
usage per operationTree has to be balanced
B-trees, Shadowing, and Clones – p.14/51
The solution:insert-key
Top-downLock-coupling for concurrencyProactive splitShadowing on the way downInsert element 81. Causes a split to node [3,6,9,15,20]2. Inserts into [6,7]
3 25
3 6 9 15 20 25 30
3 4 6 7 10 11 15 16 20 23 26 27 30 35 40
3 15 25
3 6 9 15 20 25 30
3 4 6 7 8 10 11 15 16 20 23 26 27 30 35 40
B-trees, Shadowing, and Clones – p.15/51
Remove-keyTop-downLock-coupling for concurrencyProactive merge/shuffleShadowing on the way downExample: remove element 10
3 15
3 6 15 20
3 4 6 7 8 10 11 15 16 20 25 30
3 6 15 20
3 4 6 7 8 10 11 15 16 20 25 30
3 6 15 20
3 4 6 7 8 11 15 16 20 25 30
B-trees, Shadowing, and Clones – p.16/51
Analysis forInsert/Remove-key
Always hold two/three locksAt most three pages held at any timeModify a single path in the tree
B-trees, Shadowing, and Clones – p.17/51
Pros/ConsCons:
Effectively lose two keys per node due to proactivesplit/merge policyNeed loose bounds on number of entries per node(b . . . 3b)
Pros:No deadlocks, no need for deadlockdetection/avoidanceRelatively simple algorithms, adaptable forcontrollers
B-trees, Shadowing, and Clones – p.18/51
CloningTo clone a b-tree means to create a writable copy of itthat allows all operations: lookup, insert, remove, anddelete.A cloning algorithm has several desirable properties
B-trees, Shadowing, and Clones – p.19/51
Cloning propertiesAssume p is a b-tree and q is a clone of p, then:1. Space efficiency: p and q should, as much as
possible, share common pages2. Speed: creating q from p should take little time and
overhead3. Number of clones: it should be possible to clone p
many times4. Clones as first class citizens: it should be possible
to clone q
B-trees, Shadowing, and Clones – p.20/51
Cloning, a naivesolution
A trivial algorithm for cloning a tree is copying itwholesale.This does not have the above four properties.
B-trees, Shadowing, and Clones – p.21/51
WAFL free-spaceThere are deficiencies in the classic WAFL free spaceA bit is used to represent each cloneWith a map of 32-bits per data block we get 32 clonesTo support 256 clones, 32 bytes are needed perdata-block.In order to clone a volume or delete a clone we need tomake a pass on the entire free-space
B-trees, Shadowing, and Clones – p.22/51
ChallengesHow do we support a million clones without a huge free-spacemap?
How do we avoid making a pass on the entire free-space?
B-trees, Shadowing, and Clones – p.23/51
Main ideaModify the free space so it will keep a reference count(ref-count) per blockRef-count records how many times a page is pointed toA zero ref-count means that a block is freeEssentially, instead of copying a tree, the ref-counts ofall its nodes are incremented by oneThis means that all nodes belong to two trees insteadof one; they are all sharedInstead of making a pass on the entire tree andincrementing the counters during the clone operation,this is done in a lazy fashion. B-trees, Shadowing, and Clones – p.24/51
Cloning a tree1. Copy the root-node of p into a new root2. Increment the free-space counters for each of the
children of the root by one
P,1
B,1 C,1
D,1 E,1 G,1 H,1
P,1 Q,1
B,2 C,2
D,1 E,1 G,1 H,1
B-trees, Shadowing, and Clones – p.25/51
Mark-dirty, withoutclones
Before modifying page N, it is “marked-dirty”1. Informs the run-time system that N is about to be
modified2. Gives it a chance to shadow the page if necessaryIf ref-count == 1: page can be modified in placeIf ref-count > 1, and N is relocated from address a1 toaddress a2
1. the ref-count for a1 is decremented2. the ref-count for a2 is made 13. The ref-count of N’s children is incremented by 1
B-trees, Shadowing, and Clones – p.26/51
Example, insert-keyinto leaf H,tree q
P,1 Q,1
B,2 C,2
D,1 E,1 G,1 H,1
P,1 Q,1
B,2 C,2
D,1 E,1 G,1 H,1
(I) Initial trees, Tp and Tq (II) Shadow Q
P,1 Q,1
B,2 C,1 C’,1
D,1 E,1 G,2 H,2
P,1 Q,1
B,2 C,1 C’,1
D,1E,1 G,2H,1 H’,1
(III) shadow C (IV) shadow H
B-trees, Shadowing, and Clones – p.27/51
Deleting a treeref-count(N) > 1: Decrement the ref-count and stopdownward traversal. The node is shared with othertrees.ref-count(N) == 1 : It belongs only to q. Continuedownward traversal and on the way back upde-allocate N.
B-trees, Shadowing, and Clones – p.28/51
Delete exampleP,1 Q,1
B,1 C,2 X,1
D,1 E,1 G,1 Y,2 Z,1
P,1
B,1 C,1
D,1 E,1 G,1 Y,1
(I) Initial trees Tp and Tq (II) After deleting Tq
B-trees, Shadowing, and Clones – p.29/51
Comparison to WAFLfree-space
Clone:Ref-counts: increase ref-counts for children of rootWAFL: make a pass on the entire free-space andset bits
Delete clone p
Ref-counts: traverse the nodes that belong only to p
and decrement ref-countersWAFL: make a pass on the entire free-space andset snapshot bit to zero
B-trees, Shadowing, and Clones – p.30/51
Comparison to WAFLfree-space
During normal runtimeRef-counts: additional cost of incrementingref-counts while performing modificationsWAFL: none
Space taken by free-space mapRef-counts: two bytes per block allow 64K clonesWAFL: two bytes allow 16 clones
B-trees, Shadowing, and Clones – p.31/51
Resource andperformance analysis
The addition of ref-counts does not add b-tree nodeaccesses. Worst-case estimate on memory-pages anddisk-blocks used per operation is unchangedConcurrency remains unaffected by ref-countsSharing on any node that requires modification isquickly broken and each clone gets its own version
B-trees, Shadowing, and Clones – p.32/51
FS countersThe number of free-space accesses increases.Potential of significantly impacting performance.Several observations make this unlikely:1. Once sharing is broken for a page and it belong to a
single tree, there are no additional ref-count costsassociated with it.
2. The vast majority of b-tree pages are leaves.Leaves have no children and therefore do not incuradditional overhead.
B-trees, Shadowing, and Clones – p.33/51
FS counters IIThe experimental test-bed uses in-memory free-spacemaps1. Precludes serious investigation of this issue2. Remains for future work
B-trees, Shadowing, and Clones – p.34/51
SummaryThe b-trees described here:
Are recoverableHave good concurrencyAre efficientHave good bounds on resource usageHave a good cloning strategy
B-trees, Shadowing, and Clones – p.35/51
Backup slides
B-trees, Shadowing, and Clones – p.36/51
PerformanceIn theory, top-down b-trees have a bottle-neck at thetopIn practice, this does not happen because the topnodes are cachedIn the experiments1. Entries are 16bytes: key=8bytes, data=8bytes2. A 4KB node contains 70-235 entries
B-trees, Shadowing, and Clones – p.37/51
Test-bedSingle machine connected to a DS4500 throughFiber-Channel.Machine: dual-CPU Xeon 2.4Ghz with 2GB of memory.Operating System Linux-2.6.9The b-tree on a virtual LUNThe LUN is a RAID1 in a 2+2 configurationStrip width is 64K, full stripe=512KBRead and write caching is turned off
B-trees, Shadowing, and Clones – p.38/51
Basic diskperformance
IO-size=4KBDisk-area=1GB
#threads op. time per op.(ms) ops per second
10 read N/A 1217write N/A 640R+W N/A 408
1 read 3.9 256write 16.8 59R+W 16.9 59
B-trees, Shadowing, and Clones – p.39/51
Test methodologyThe ratio of cache-size to number of b-tree pages1. Is fixed at initialization time2. This ratio is called the in-memory percentageVarious trees were used, with the same results. Theexperiments reported here are for tree T235 .
B-trees, Shadowing, and Clones – p.40/51
Tree T235
Maximal fanout: 235Legal #entries: 78 .. 235Contains 9.5 million keys and 65500 nodes1. 65000 leaves2. 500 index-nodesTree depth is: 4Average node capacity 150 keys
B-trees, Shadowing, and Clones – p.41/51
Test methodology IICreate a large tree using random operationsFor each test1. Clone the tree2. Age the clone by doing 1000 random
insert-key/remove-key operations3. Perform 10
4 – 108 measurements against the clone
with random keys4. Delete the clonePerform each measurement 5 times, and average.The standard deviation was less than 1% in all tests.
B-trees, Shadowing, and Clones – p.42/51
Latencymeasurements
Four operations whose latency was measured:lookup-key, insert-key, remove-key, append-key.Latency measured in milliseconds
Lookup Insert Remove Append
3.43 16.76 16.46 0.007
B-trees, Shadowing, and Clones – p.43/51
Different in-memoryratios
Workload: 100% lookup workload
% in-memory 1 thread 10 threads ideal
100 14237 19805 ∞
50 391 1842 243425 321 1508 162210 268 1290 13525 254 1210 12812 250 1145 1241
B-trees, Shadowing, and Clones – p.44/51
ThroughputFour workloads were used:1. Search-100: 100% lookup2. Search-80: 80% lookup, 10% insert, 10% remove3. Modify: 20% lookup, 40% insert, 40% remove4. Insert: 100% insertMetric: operations per secondSince there isn’t much difference between 2% inmemory and 50%, the rest of the experiments weredone using 5%.Allows putting all index nodes in memory.
B-trees, Shadowing, and Clones – p.45/51
Throughput IITree #threads Src-100 Src-80 Modify Insert
T235 10 1227 748 455 4001 272 144 75 62
Ideal 1281 429
B-trees, Shadowing, and Clones – p.46/51
Workload with somelocality
Workload: randomly choose a keyWith 80% probably read the next 100 keys after itWith 10% probability, insert/overwrite the next 100keysWith 10% probability, remove the next 100 keys
#threads semi-local
10 166341 3848
B-trees, Shadowing, and Clones – p.47/51
Clone performanceTwo clones are made of base tree T235
Aging is performed1. 12000 operations are performed2. 6000 against each clone
Src-100 Src-80 Modify Insert
2 clones 1221 663 393 343base 1227 748 455 400
B-trees, Shadowing, and Clones – p.48/51
Clone performance at100% in-memory
Src-100 Src-80 Modify Insert
2 threads 20395 18524 16907 165051 thread 13910 12670 11452 11112
B-trees, Shadowing, and Clones – p.49/51
Performance ofcheckpointing
A checkpoint is taken during the throughput testPerformance degrades1. A dirty page has to be destaged prior to being
modified2. Caching of dirty-pages is effected
B-trees, Shadowing, and Clones – p.50/51
Performance ofcheckpointing II
Tree T235 Src-100 Src-80 Modify Insert
checkpoint 1205 689 403 353base 1227 748 455 400
B-trees, Shadowing, and Clones – p.51/51