28
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley

Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

Long Term Durability with Seagull

Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group)

University of California, BerkeleyROC/Sahara/OceanStore Retreat, Lake Tahoe. Monday, January 13,

2003

Page 2: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:2

Questions

• Given: wide-area durable storage is complex.• What is required to convince you to place your

data in this system (or a like system)? – How do you know that it works?– How efficient is it?

• BW, latency, throughput.

– Do you trust it?• Who do you sue.

– How much does it cost? • BW, Storage, Money.

– How reliable is it?• MTTDL/Fractions of Blocks Lost Per Year (FBLPY).

Page 3: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:3

Relevance to ROC/Sahara/OceanStore• Components of Communication

– Heart beating, Fault tolerant routing, etc.

• Correlation– Monitoring, Human input, etc.

• Detection– Distributed vs. Global.

• Repair– Triggered vs. Continuous

• Reliability– Continuous restart of communication links, etc.– FBLPY (MTTDL).

Page 4: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:4

Outline

• Overview• Experience.• Lessons learned• Required Components• Future Directions

Page 5: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:5

Deployment

• Planet Lab global network– 98 machines at 42 institutions, in North America,

Europe, Australia.– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)– North American machines (2/3) on Internet2

Page 6: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:6

Deployment

• Deployed storage system in November of 2002.– ~ 50 physical machines.– 100 virtual nodes.

• 3 clients, 93 storage serves, 1 archiver, 1 monitor.

– Support OceanStore API• NFS, IMAP, etc.

– Fault injection.– Fault detection and repair.

Page 7: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:7

Path of an OceanStore Update

Page 8: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:8

OceanStore SW Architecture

Page 9: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:9

OceanStore SW Architecture

Page 10: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:10

Path of a Storage Update• Erasure codes

– redundancy without overhead of strict replication– produce n fragments, where any m is sufficient to reconstruct data. m < n.

rate r = m/n. Storage overhead is 1/r.

Archiver

Page 11: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:11

Durability

1.E-70

1.E-60

1.E-50

1.E-40

1.E-30

1.E-20

1.E-10

1.E+00

0 6 12 18 24

Repair Time (Months)

Pro

ba

bilit

y o

f B

loc

k F

ailu

re p

er

Ye

ar

n = 4 fragmentsn = 8 fragmentsn = 16 fragmentsn = 32 fragmentsn = 64 fragments

• Fraction of Blocks Lost Per Year (FBLPY)*– r = ¼, erasure-encoded block. (e.g. m = 16, n = 64)– Increasing number of fragments, increases durability of block

• Same storage cost and repair time.

– n = 4 fragment case is equivalent to replication on four servers.* Erasure Coding vs. Replication, H. Weatherspoon and J. Kubiatowicz, In Proc. of IPTPS

2002.

Page 12: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:12

Naming and Verification Algorithm

• Use cryptographically secure hash algorithm to detect corrupted fragments.

• Verification Tree: – n is the number of

fragments.– store log(n) + 1 hashes

with each fragment.

– Total of n.(log(n) + 1) hashes.

• Top hash is a block GUID (B-GUID).

– Fragments and blocks are self-verifying

Fragment 3:

Fragment 4:

Data:

Fragment 1:

Fragment 2:

H2 H34 Hd F1 - fragment data

H14 data

H1 H34 Hd F2 - fragment data

H4 H12 Hd F3 - fragment data

H3 H12 Hd F4 - fragment data

F1 F2 F3 F4

H1 H2 H3 H4

H12 H34

H14

B-GUID

HdData

Encoded Fragments

F1

H2

H34

Hd

Fragment 1: H2 H34 Hd F1 - fragment data

Page 13: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:13

GUID

Fragments

Enabling TechnologyTapestry DOLR

Page 14: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:14

Outline

• Overview• Experience.• Lessons learned• Required Components• Future Directions

Page 15: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:15

Lessons Learned

• Need ability to route to an object if it exists.– Hindrance to a long running process.

• Robustness to node and network failures.

• Need tools to diagnosis current state of network.• Need ability to run without inner ring.• Need monitor, detection, repair mechanisms.

– Avoid correlated failures.– Quickly and efficiently detect faults.– Efficiently repair faults.

• Need to perform maintenance in distributed fashion.

Page 16: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:16

Outline

• Overview• Experience.• Lessons learned• Required Components• Future Directions

Page 17: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:17

Monitor: Low Failure Correlation Dissemination

• Model Builder.– Various sources.– Model failure

correlation.

• Set Creator.– Queries random nodes.– Dissemination Sets.

• Storage servers that fail with low correlation.

• Disseminator.– Sends fragments to

members of set.

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Disseminator

Disseminator

set

set

probe

type

fragments

fragments

fragments

Storage Server

Page 18: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:18

Monitor: Low Failure Correlation Dissemination

• Sanity Check– Monitored 1909 Web Servers

• Future– Simple Network Management

Protocol (SNMP)• standard protocol for monitoring.• Query network components for

information about their configuration, activity, errors, etc.

– Define an OceanStore/Tapestry MIB.

• Management Information Base (MIB)

Page 19: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:19

Detection• Goal

– Maintain routing and object state using minimal resources.• e.g. less than 1% of bandwidth and cpu cycles.

• Server Heartbeat's– “Keep-alive” beacon along each forward link.– Increasing period (decreasing frequency) with the routing level.

• Data-Driven Server Heartbeat's– “Keep-alive” Multicast to all ancestors with an object pointer that points to us.– Multicast with increasing radius.

Page 20: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:20

Detection

• Republish/Object Heartbeats– Object Heartbeat (Republish).– Heartbeat period increasing with distance

• (i.e. Heartbeat frequency decreases with distance)• Distance is number of application-level hops

• Distributed Sweep– Request object from storage servers.– Sweep period increasing with distance

• (i.e. Sweep frequency decreases with distance)

• Global Sweep (Responsible Party/Client)– Request object from storage servers at regular intervals.– Period constant. (i.e. frequency constant)

Page 21: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:21

Object Publication and Location

Page 22: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:22

Detection and Repair Schemes

Page 23: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:23

Efficient Repair

• Distributed.– Exploit DOLR’s distributed information and locality.– Efficient detection and then reconstruction of fragments.

3274

4577

5544

AE87

3213

9098

1167

6003

0128

L2L2

L1

L1

L2

L2

L3

L3

L2

L1L1

L2L3

L2

Ring of L1Heartbeats

Page 24: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:24

Detection

Page 25: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:25

Efficient Repair

• Continuous vs. Triggered• Continuous (Responsible Party/Client)

– Request object from storage servers at regular intervals.– Period constant. (i.e. frequency constant)

• Triggered (Infrastructure)– FBLPY proportional to MTTR/MTTF.

• Disks: MTTF > 100,000 hours.• Gnutella: MTTF = 2.3 hours. (Median 40 minutes).

– Local vs. Remote Repair.• Local stub looks like durable disk.

Page 26: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:26

Efficient Repair

• Reliability vs. Cost vs. Trust.

Page 27: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:27

Outline

• Overview• Experience.• Lessons learned• Required Components• Future Directions

Page 28: Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore

ROC/Sahara/OceanStore

©2003 Hakim Weatherspoon/UC Berkeley Seagull:28

Future Directions

• Redundancy, Detection, Repair, Monitoring.– None alone is sufficient.– Only reliable as weakest link.

• Verify system?– System may always be in inconsistent state.– How do you know …

• Data exists?• Data will exist tomorrow?

• Applications/Usage in the long-term.– NSF, IMAP, rsynch (back-up), InternetArchive, etc.