51
B-trees, Shadowing, and Clones Ohad Rodeh B-trees, Shadowing, and Clones – p.1/51

B-trees, Shadowing, and Clones - USENIX

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: B-trees, Shadowing, and Clones - USENIX

B-trees, Shadowing, andClones

Ohad Rodeh

B-trees, Shadowing, and Clones – p.1/51

Page 2: B-trees, Shadowing, and Clones - USENIX

Talk outlinePrefaceBasics of getting b-trees to work with shadowingPerformance resultsAlgorithms for cloning (writable-snapshots)

B-trees, Shadowing, and Clones – p.2/51

Page 3: B-trees, Shadowing, and Clones - USENIX

IntroductionThe talk is about a free technique useful forfile-systems like ZFS and WAFLIt is appropriate for this forum due to the talked aboutport of ZFS to LinuxThe ideas described here were used in a researchprototype of an object-disk.A b-tree was used for the OSD catalog (directory), anextent-tree was used for objects (files).

B-trees, Shadowing, and Clones – p.3/51

Page 4: B-trees, Shadowing, and Clones - USENIX

ShadowingSome file-systems use shadowing: WAFL, ZFS, . . .Basic idea:1. File-system is a tree of fixed-sized pages2. Never overwrite a pageFor every command:1. Write a short logical log entry2. Perform the operation on pages written off to the

side3. Perform a checkpoint once in a while

B-trees, Shadowing, and Clones – p.4/51

Page 5: B-trees, Shadowing, and Clones - USENIX

Shadowing IIIn case of crash: revert to previous stable checkpointand replay the logShadowing is used for: Snapshots, crash-recovery,write-batching, RAID

B-trees, Shadowing, and Clones – p.5/51

Page 6: B-trees, Shadowing, and Clones - USENIX

Shadowing IIIImportant optimizations1. Once a page is shadowed, it does not need to be

shadowed again until the next checkpoint2. Batch dirty-pages and write them sequentially to

disk

B-trees, Shadowing, and Clones – p.6/51

Page 7: B-trees, Shadowing, and Clones - USENIX

SnapshotsTaking snapshots is easy with shadowingIn order to create a snapshot:1. The file-system allows more than a single root2. A checkpoint is taken but not erased

B-trees, Shadowing, and Clones – p.7/51

Page 8: B-trees, Shadowing, and Clones - USENIX

B-treesB-trees are used by many file-systems to representfiles and directories: XFS, JFS, ReiserFS, SAN.FSThey guarantee logarithmic-time key-search, insert,removeThe main questions:1. Can we use B-trees to represent files and directories in

conjunction with shadowing?

2. Can we get good concurrency?

3. Can we supports lots of clones efficiently?

B-trees, Shadowing, and Clones – p.8/51

Page 9: B-trees, Shadowing, and Clones - USENIX

ChallengesChallenge to multi-threading: changes propagate up tothe root. The root is a contention point.In a regular b-tree the leaves can be linked to theirneighbors.

Not possible in conjunction with shadowingThrows out a lot of the existing b-tree literature

3 15

3 6 10 15 20

3 4 6 7 8 10 11 15 16 20 25

3 15

3 6 10 15 20

3 4 6 7 8 10 11 15 16 20 25

B-trees, Shadowing, and Clones – p.9/51

Page 10: B-trees, Shadowing, and Clones - USENIX

Write-in-place b-treeUsed by DB2 and SAN.FSUpdates b-trees in place; no shadowingUses a bottom up update procedure

B-trees, Shadowing, and Clones – p.10/51

Page 11: B-trees, Shadowing, and Clones - USENIX

Write-in-placeexample

Insert-element1. Walk down the tree until reaching the leaf L2. If there is room: insert in L3. If there isn’t, split and propagate upwardNote: tree nodes contain between 2 and 5 elements

3 25

3 6 9 15 20 25 30

3 4 6 7 10 11 15 16 20 23 26 27 30 35 40

3 25

3 6 9 15 20 25 30

3 4 6 7 8 10 11 15 16 20 23 26 27 30 35 40

B-trees, Shadowing, and Clones – p.11/51

Page 12: B-trees, Shadowing, and Clones - USENIX

Alternate shadowingapproach

Used in many databases, for example, Microsoft SQLserver.Pages have virtual addressesThere is a table that maps virtual addresses to physicaladdressesIn order to modify page P at address L1

1. Copy P to another physical address L2

2. Modify the mapping table, P → L2

3. Modify the page at the L2

B-trees, Shadowing, and Clones – p.12/51

Page 13: B-trees, Shadowing, and Clones - USENIX

Alternate shadowingapproach II

ProsAvoids the ripple effect of shadowingUses b-link trees, very good concurrency

ConsRequires an additional persistent data structurePerformance of accessing the map is crucial togood behavior

B-trees, Shadowing, and Clones – p.13/51

Page 14: B-trees, Shadowing, and Clones - USENIX

Requirements fromshadowed b-tree

The b-tree has to:1. Have good concurrency2. Work well with shadowing3. Use deadlock avoidance4. Have guarantied bounds on space and memory

usage per operationTree has to be balanced

B-trees, Shadowing, and Clones – p.14/51

Page 15: B-trees, Shadowing, and Clones - USENIX

The solution:insert-key

Top-downLock-coupling for concurrencyProactive splitShadowing on the way downInsert element 81. Causes a split to node [3,6,9,15,20]2. Inserts into [6,7]

3 25

3 6 9 15 20 25 30

3 4 6 7 10 11 15 16 20 23 26 27 30 35 40

3 15 25

3 6 9 15 20 25 30

3 4 6 7 8 10 11 15 16 20 23 26 27 30 35 40

B-trees, Shadowing, and Clones – p.15/51

Page 16: B-trees, Shadowing, and Clones - USENIX

Remove-keyTop-downLock-coupling for concurrencyProactive merge/shuffleShadowing on the way downExample: remove element 10

3 15

3 6 15 20

3 4 6 7 8 10 11 15 16 20 25 30

3 6 15 20

3 4 6 7 8 10 11 15 16 20 25 30

3 6 15 20

3 4 6 7 8 11 15 16 20 25 30

B-trees, Shadowing, and Clones – p.16/51

Page 17: B-trees, Shadowing, and Clones - USENIX

Analysis forInsert/Remove-key

Always hold two/three locksAt most three pages held at any timeModify a single path in the tree

B-trees, Shadowing, and Clones – p.17/51

Page 18: B-trees, Shadowing, and Clones - USENIX

Pros/ConsCons:

Effectively lose two keys per node due to proactivesplit/merge policyNeed loose bounds on number of entries per node(b . . . 3b)

Pros:No deadlocks, no need for deadlockdetection/avoidanceRelatively simple algorithms, adaptable forcontrollers

B-trees, Shadowing, and Clones – p.18/51

Page 19: B-trees, Shadowing, and Clones - USENIX

CloningTo clone a b-tree means to create a writable copy of itthat allows all operations: lookup, insert, remove, anddelete.A cloning algorithm has several desirable properties

B-trees, Shadowing, and Clones – p.19/51

Page 20: B-trees, Shadowing, and Clones - USENIX

Cloning propertiesAssume p is a b-tree and q is a clone of p, then:1. Space efficiency: p and q should, as much as

possible, share common pages2. Speed: creating q from p should take little time and

overhead3. Number of clones: it should be possible to clone p

many times4. Clones as first class citizens: it should be possible

to clone q

B-trees, Shadowing, and Clones – p.20/51

Page 21: B-trees, Shadowing, and Clones - USENIX

Cloning, a naivesolution

A trivial algorithm for cloning a tree is copying itwholesale.This does not have the above four properties.

B-trees, Shadowing, and Clones – p.21/51

Page 22: B-trees, Shadowing, and Clones - USENIX

WAFL free-spaceThere are deficiencies in the classic WAFL free spaceA bit is used to represent each cloneWith a map of 32-bits per data block we get 32 clonesTo support 256 clones, 32 bytes are needed perdata-block.In order to clone a volume or delete a clone we need tomake a pass on the entire free-space

B-trees, Shadowing, and Clones – p.22/51

Page 23: B-trees, Shadowing, and Clones - USENIX

ChallengesHow do we support a million clones without a huge free-spacemap?

How do we avoid making a pass on the entire free-space?

B-trees, Shadowing, and Clones – p.23/51

Page 24: B-trees, Shadowing, and Clones - USENIX

Main ideaModify the free space so it will keep a reference count(ref-count) per blockRef-count records how many times a page is pointed toA zero ref-count means that a block is freeEssentially, instead of copying a tree, the ref-counts ofall its nodes are incremented by oneThis means that all nodes belong to two trees insteadof one; they are all sharedInstead of making a pass on the entire tree andincrementing the counters during the clone operation,this is done in a lazy fashion. B-trees, Shadowing, and Clones – p.24/51

Page 25: B-trees, Shadowing, and Clones - USENIX

Cloning a tree1. Copy the root-node of p into a new root2. Increment the free-space counters for each of the

children of the root by one

P,1

B,1 C,1

D,1 E,1 G,1 H,1

P,1 Q,1

B,2 C,2

D,1 E,1 G,1 H,1

B-trees, Shadowing, and Clones – p.25/51

Page 26: B-trees, Shadowing, and Clones - USENIX

Mark-dirty, withoutclones

Before modifying page N, it is “marked-dirty”1. Informs the run-time system that N is about to be

modified2. Gives it a chance to shadow the page if necessaryIf ref-count == 1: page can be modified in placeIf ref-count > 1, and N is relocated from address a1 toaddress a2

1. the ref-count for a1 is decremented2. the ref-count for a2 is made 13. The ref-count of N’s children is incremented by 1

B-trees, Shadowing, and Clones – p.26/51

Page 27: B-trees, Shadowing, and Clones - USENIX

Example, insert-keyinto leaf H,tree q

P,1 Q,1

B,2 C,2

D,1 E,1 G,1 H,1

P,1 Q,1

B,2 C,2

D,1 E,1 G,1 H,1

(I) Initial trees, Tp and Tq (II) Shadow Q

P,1 Q,1

B,2 C,1 C’,1

D,1 E,1 G,2 H,2

P,1 Q,1

B,2 C,1 C’,1

D,1E,1 G,2H,1 H’,1

(III) shadow C (IV) shadow H

B-trees, Shadowing, and Clones – p.27/51

Page 28: B-trees, Shadowing, and Clones - USENIX

Deleting a treeref-count(N) > 1: Decrement the ref-count and stopdownward traversal. The node is shared with othertrees.ref-count(N) == 1 : It belongs only to q. Continuedownward traversal and on the way back upde-allocate N.

B-trees, Shadowing, and Clones – p.28/51

Page 29: B-trees, Shadowing, and Clones - USENIX

Delete exampleP,1 Q,1

B,1 C,2 X,1

D,1 E,1 G,1 Y,2 Z,1

P,1

B,1 C,1

D,1 E,1 G,1 Y,1

(I) Initial trees Tp and Tq (II) After deleting Tq

B-trees, Shadowing, and Clones – p.29/51

Page 30: B-trees, Shadowing, and Clones - USENIX

Comparison to WAFLfree-space

Clone:Ref-counts: increase ref-counts for children of rootWAFL: make a pass on the entire free-space andset bits

Delete clone p

Ref-counts: traverse the nodes that belong only to p

and decrement ref-countersWAFL: make a pass on the entire free-space andset snapshot bit to zero

B-trees, Shadowing, and Clones – p.30/51

Page 31: B-trees, Shadowing, and Clones - USENIX

Comparison to WAFLfree-space

During normal runtimeRef-counts: additional cost of incrementingref-counts while performing modificationsWAFL: none

Space taken by free-space mapRef-counts: two bytes per block allow 64K clonesWAFL: two bytes allow 16 clones

B-trees, Shadowing, and Clones – p.31/51

Page 32: B-trees, Shadowing, and Clones - USENIX

Resource andperformance analysis

The addition of ref-counts does not add b-tree nodeaccesses. Worst-case estimate on memory-pages anddisk-blocks used per operation is unchangedConcurrency remains unaffected by ref-countsSharing on any node that requires modification isquickly broken and each clone gets its own version

B-trees, Shadowing, and Clones – p.32/51

Page 33: B-trees, Shadowing, and Clones - USENIX

FS countersThe number of free-space accesses increases.Potential of significantly impacting performance.Several observations make this unlikely:1. Once sharing is broken for a page and it belong to a

single tree, there are no additional ref-count costsassociated with it.

2. The vast majority of b-tree pages are leaves.Leaves have no children and therefore do not incuradditional overhead.

B-trees, Shadowing, and Clones – p.33/51

Page 34: B-trees, Shadowing, and Clones - USENIX

FS counters IIThe experimental test-bed uses in-memory free-spacemaps1. Precludes serious investigation of this issue2. Remains for future work

B-trees, Shadowing, and Clones – p.34/51

Page 35: B-trees, Shadowing, and Clones - USENIX

SummaryThe b-trees described here:

Are recoverableHave good concurrencyAre efficientHave good bounds on resource usageHave a good cloning strategy

B-trees, Shadowing, and Clones – p.35/51

Page 36: B-trees, Shadowing, and Clones - USENIX

Backup slides

B-trees, Shadowing, and Clones – p.36/51

Page 37: B-trees, Shadowing, and Clones - USENIX

PerformanceIn theory, top-down b-trees have a bottle-neck at thetopIn practice, this does not happen because the topnodes are cachedIn the experiments1. Entries are 16bytes: key=8bytes, data=8bytes2. A 4KB node contains 70-235 entries

B-trees, Shadowing, and Clones – p.37/51

Page 38: B-trees, Shadowing, and Clones - USENIX

Test-bedSingle machine connected to a DS4500 throughFiber-Channel.Machine: dual-CPU Xeon 2.4Ghz with 2GB of memory.Operating System Linux-2.6.9The b-tree on a virtual LUNThe LUN is a RAID1 in a 2+2 configurationStrip width is 64K, full stripe=512KBRead and write caching is turned off

B-trees, Shadowing, and Clones – p.38/51

Page 39: B-trees, Shadowing, and Clones - USENIX

Basic diskperformance

IO-size=4KBDisk-area=1GB

#threads op. time per op.(ms) ops per second

10 read N/A 1217write N/A 640R+W N/A 408

1 read 3.9 256write 16.8 59R+W 16.9 59

B-trees, Shadowing, and Clones – p.39/51

Page 40: B-trees, Shadowing, and Clones - USENIX

Test methodologyThe ratio of cache-size to number of b-tree pages1. Is fixed at initialization time2. This ratio is called the in-memory percentageVarious trees were used, with the same results. Theexperiments reported here are for tree T235 .

B-trees, Shadowing, and Clones – p.40/51

Page 41: B-trees, Shadowing, and Clones - USENIX

Tree T235

Maximal fanout: 235Legal #entries: 78 .. 235Contains 9.5 million keys and 65500 nodes1. 65000 leaves2. 500 index-nodesTree depth is: 4Average node capacity 150 keys

B-trees, Shadowing, and Clones – p.41/51

Page 42: B-trees, Shadowing, and Clones - USENIX

Test methodology IICreate a large tree using random operationsFor each test1. Clone the tree2. Age the clone by doing 1000 random

insert-key/remove-key operations3. Perform 10

4 – 108 measurements against the clone

with random keys4. Delete the clonePerform each measurement 5 times, and average.The standard deviation was less than 1% in all tests.

B-trees, Shadowing, and Clones – p.42/51

Page 43: B-trees, Shadowing, and Clones - USENIX

Latencymeasurements

Four operations whose latency was measured:lookup-key, insert-key, remove-key, append-key.Latency measured in milliseconds

Lookup Insert Remove Append

3.43 16.76 16.46 0.007

B-trees, Shadowing, and Clones – p.43/51

Page 44: B-trees, Shadowing, and Clones - USENIX

Different in-memoryratios

Workload: 100% lookup workload

% in-memory 1 thread 10 threads ideal

100 14237 19805 ∞

50 391 1842 243425 321 1508 162210 268 1290 13525 254 1210 12812 250 1145 1241

B-trees, Shadowing, and Clones – p.44/51

Page 45: B-trees, Shadowing, and Clones - USENIX

ThroughputFour workloads were used:1. Search-100: 100% lookup2. Search-80: 80% lookup, 10% insert, 10% remove3. Modify: 20% lookup, 40% insert, 40% remove4. Insert: 100% insertMetric: operations per secondSince there isn’t much difference between 2% inmemory and 50%, the rest of the experiments weredone using 5%.Allows putting all index nodes in memory.

B-trees, Shadowing, and Clones – p.45/51

Page 46: B-trees, Shadowing, and Clones - USENIX

Throughput IITree #threads Src-100 Src-80 Modify Insert

T235 10 1227 748 455 4001 272 144 75 62

Ideal 1281 429

B-trees, Shadowing, and Clones – p.46/51

Page 47: B-trees, Shadowing, and Clones - USENIX

Workload with somelocality

Workload: randomly choose a keyWith 80% probably read the next 100 keys after itWith 10% probability, insert/overwrite the next 100keysWith 10% probability, remove the next 100 keys

#threads semi-local

10 166341 3848

B-trees, Shadowing, and Clones – p.47/51

Page 48: B-trees, Shadowing, and Clones - USENIX

Clone performanceTwo clones are made of base tree T235

Aging is performed1. 12000 operations are performed2. 6000 against each clone

Src-100 Src-80 Modify Insert

2 clones 1221 663 393 343base 1227 748 455 400

B-trees, Shadowing, and Clones – p.48/51

Page 49: B-trees, Shadowing, and Clones - USENIX

Clone performance at100% in-memory

Src-100 Src-80 Modify Insert

2 threads 20395 18524 16907 165051 thread 13910 12670 11452 11112

B-trees, Shadowing, and Clones – p.49/51

Page 50: B-trees, Shadowing, and Clones - USENIX

Performance ofcheckpointing

A checkpoint is taken during the throughput testPerformance degrades1. A dirty page has to be destaged prior to being

modified2. Caching of dirty-pages is effected

B-trees, Shadowing, and Clones – p.50/51

Page 51: B-trees, Shadowing, and Clones - USENIX

Performance ofcheckpointing II

Tree T235 Src-100 Src-80 Modify Insert

checkpoint 1205 689 403 353base 1227 748 455 400

B-trees, Shadowing, and Clones – p.51/51