31
Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation And High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two by Rodrigo Rodrigues, Charles Blake and Barbara Liskov presented at HotOS 2003 By Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, *Ion Stoica MIT and *Berkeley node node node node Internet node

Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

Embed Size (px)

Citation preview

Page 1: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

Wide-Area Cooperative Storage with CFS

Presented by Hakim WeatherspoonCS294-4: Peer-to-Peer Systems

Slides liberally borrowed from the SOSP 2001 CFS presentationAnd High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two by Rodrigo Rodrigues, Charles Blake and Barbara Liskov presented at HotOS 2003

By Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, *Ion Stoica

MIT and *Berkeley

node

node node

node

Internet

node

Page 2: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:2

Design Goals• Spread storage burden evenly (Avoid hot spots)• Tolerate unreliable participants• Fetch speed comparable to whole-file TCP• Avoid O(#participants) algorithms

– Centralized mechanisms [Napster], broadcasts [Gnutella]

• Simplicity – Does simplicity imply provable correctness?

• More precisely, could you build CFS correctly?

– What about performance?

• CFS attempts to solve these challenges– Does it?

Page 3: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:3

CFS Summary

• CFS provides peer-to-peer r/o storage• Structure: DHash and Chord• Claims efficient, robust, and load-balanced

– Does CFS achieve any of these qualities?

• It uses block-level distribution• The prototype is as fast as whole-file TCP

• Storage promise Redundancy promise data must move as members leave! lower bound on bandwidth usage

Page 4: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:4

Client-server interface

• Files have unique names• Files are read-only (single writer, many readers)• Publishers split files into blocks and place blocks into a

hash table.• Clients check files for authenticity [SFSRO]

FS Client serverInsert file f

Lookup file f

Insert block

Lookup block

node

server

node

Page 5: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:5

Server Structure

• DHash stores, balances, replicates, caches blocks• DHash uses Chord [SIGCOMM 2001] to locate blocks•Why blocks instead of files?

•easier load balance (remember complexity of PAST)

DHash

Chord

Node 1 Node 2

DHash

Chord

Page 6: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:6

CFS file system structure

• The root-block is identified by a public key– signed by corresponding private key

• Other blocks identified by hash of their contents• What is wrong with this organization?

– Path of blocks from data-block to root-block modified for every update. – This is okay because system is read-only.

root-blockpublic key

signature

H(D)

directoryblock

DH(F)

inodeblock

F

data block

B1data block

B2H(B2)

H(B1)

Page 7: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:7

DHash/Chord Interface

• lookup() returns list with node IDs closer in ID space to block ID– Sorted, closest first

server

DHash

Chord

Lookup(blockID) List of <node-ID, IP address>

finger table with <node IDs, IP address>

Page 8: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:8

DHash Uses Other Nodes to Locate Blocks

N40

N10

N5

N20

N110

N99

N80 N50

N60N68

Lookup(BlockID=45)

1.

2.

3.

Page 9: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:9

Storing Blocks

• Long-term blocks are stored for a fixed time– Publishers need to refresh periodically

• Cache uses LRU

disk: cache Long-term block storage

Page 10: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:10

Replicate blocks at r successors

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68• r = 2 log N• Node IDs are SHA-1 of IP Address• Ensures independent replica failure

Page 11: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:11

Lookups find replicas

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

1.3.

2.

4.

Lookup(BlockID=17)

RPCs:1. Lookup step2. Get successor list3. Failed block fetch4. Block fetch

Page 12: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:12

First Live Successor Manages Replicas

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

Copy of17

• Node can locally determine that it is the first live successor

Page 13: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:13

DHash Copies to Caches Along Lookup Path

N40

N10

N5

N20

N110

N99

N80

N60

Lookup(BlockID=45)

N50

N68

1.

2.

3.

4.RPCs:1. Chord lookup2. Chord lookup3. Block fetch4. Send to cache

Page 14: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:14

Virtual Nodes Allow Heterogeneity

• Hosts may differ in disk/net capacity• Hosts may advertise multiple IDs

– Chosen as SHA-1(IP Address, index)– Each ID represents a “virtual node”

• Host load proportional to # v.n.’s• Manually controlled• Sybil attach!

Node A

N60N10 N101

Node B

N5

Page 15: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:15

Experiment (12 nodes)!(pre-planetlab)

• One virtual node per host• 8Kbyte blocks• RPCs use UDP

CA-T1CCIArosUtah

CMU

To vu.nlLulea.se

MITMA-CableCisco

Cornell

NYU

OR-DSL

To vu.nl lulea.se ucl.uk

To kaist.kr, .ve

• Caching turned off• Proximity routing

turned off

Page 16: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:16

CFS Fetch Time for 1MB File

• Average over the 12 hosts• No replication, no caching; 8 KByte blocks

Fetc

h T

ime (

Seco

nd

s)

Prefetch Window (KBytes)

Page 17: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:17

Distribution of Fetch Times for 1MB

Fract

ion

of

Fetc

hes

Time (Seconds)

8 Kbyte Prefetch

24 Kbyte Prefetch40 Kbyte Prefetch

Page 18: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:18

CFS Fetch Time vs. Whole File TCP

Fract

ion

of

Fetc

hes

Time (Seconds)

40 Kbyte Prefetch

Whole File TCP

Page 19: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:19

Robustness vs. Failures

Faile

d L

ooku

ps

(Fra

ctio

n)

Failed Nodes (Fraction)

(1/2)6 is 0.016

Six replicasper block;

Page 20: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:20

Revisit Assumptions • P2P Purist Ideals

– Cooperation, Symmetry, Decentralized

• How realistic are these assumptions?• In what domains are they valid?

FastestFlaky

Slower Flaky

Slow

Faster

Stable

Fast Slowtest

10 .. 100s GB/Node of Idle Cheap Disk

Distributed Data Store w/ all the *ilities:High AvailabilityGood ScalabilityHigh ReliabilityMaintainabilityFlexibilityFault-Tolerant DHT

Page 21: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:21

BW for Redundancy Maintenance

• Assume average system size, N, stable– P(Leave)/Time = Leaves/Time/N = 1/Lifetime– Join = Leave forever rate = 1/Lifetime– Leaves induce redundancy replacement

• replacement size x replacement rate

– Joins cost the same

Maintenance BW > 2 x Space/Lifetime• Space/node < ½ BW/node x Lifetime

• Quality WAN storage scales with WAN BW and member quality

Page 22: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:22

BW for Redundancy Maintenance II

• maintenance BW 200 Kbps• lifetime = Median 2001-Gnutella

session = 1 hour

• served space = 90 MB/node << donatable storage!

Page 23: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:23

Peer Dynamics

• The peer-to-peer “dream”– Reliable storage from many unreliable

components

• Robust lookup perceived as critical• Bandwidth to maintain redundancy is the

hard problem [Blake and Rodrigues 03]

Page 24: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:24

Need Too Much BW toMaintain Redundancy

• 10M users; 25% avail.; 1 week membership; 100G donation => 50 kbps

• Wait! It gets worse… HW trends

HighAvailability

ScalableStorage

DynamicMembership

Must Pick Two

Page 25: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:25

Proposal #1

• Server-to-Server DHTs• Reduce to a Solved Problem? Not

really…– Self-Configuration– Symmetry– Scalability– Dynamic Load balance

Page 26: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:26

Proposal #2

• Complete routing information• Possible complications:

– Memory Requirements– Bandwidth [Gupta,Liskov,Rodrigues 03]– Load balance

• Multi-hop optimization makes sense only when many very dynamic members serve a little data– (multi-hop not required if N < per host / 40

bytes)

Page 27: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:27

Proposal #3

• Decouple networking layer from data layer• Layer of indirection

– a.k.a. distributed directory, location pointers, pointers, etc

• Combines a little of proposal #1 and 2– DHT no longer decides who, what, when, where,

why, and how for storage maintenance.– Separate policy from mechanism.

Page 28: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:28

Appendix: Chord Hashes aBlock ID to its Successor

N32

N10

N100

N80

N60

CircularID Space

• Nodes and blocks have randomly distributed IDs• Successor: node with next highest ID

B33, B40, B52

B11, B30

B112, B120, …, B10

B65, B70

B100

Block ID Node ID

Page 29: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:29

Appendix: Basic Lookup

N32

N10

N5

N20

N110

N99

N80

N60

N40

“Where is block 70?”

“N80”

• Lookups find the ID’s predecessor• Correct if successors are correct

Page 30: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:30

Appendix: Successor Lists Ensure Robust Lookup

N32

N10

N5

N20

N110

N99

N80

N60

• Each node stores r successors, r = 2 log N• Lookup can skip over dead nodes to find blocks

N40

10, 20, 32

20, 32, 40

32, 40, 60

40, 60, 80

60, 80, 99

80, 99, 110

99, 110, 5

110, 5, 10

5, 10, 20

Page 31: Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:31

Appendix: Chord Finger Table Allows O(log N) Lookups

N80

½¼

1/8

1/161/321/641/128

• See [SIGCOMM 2000] for table maintenance