34
Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI2006 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris Presenter: Hakim Weatherspoon

Effective Replica Maintenance for Distributed Storage Systems

  • Upload
    jett

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Effective Replica Maintenance for Distributed Storage Systems. USENIX NSDI2006 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris Presenter: Hakim Weatherspoon. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Effective Replica Maintenance for Distributed Storage Systems

Effective Replica Maintenance for Distributed Storage Systems

USENIX NSDI2006Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim

Weatherspoon,M. Frans Kaashoek, John Kubiatowicz, and Robert Morris

Presenter: Hakim Weatherspoon

Page 2: Effective Replica Maintenance for Distributed Storage Systems

11/22/04 Data Recovery Optimizations Systems Lunch:2

Motivation• Efficiently Maintain Wide-area Distributed Storage Systems• Redundancy

– duplicate data to protect against data loss

• Place data throughout wide area– Data availability and durability

• Continuously repair loss redundancy as needed– Detect permanent failures and trigger data recovery

Page 3: Effective Replica Maintenance for Distributed Storage Systems

Motivation

• Distributed Storage System: a network file system whose storage nodes are dispersed over the Internet

• Durability: objects that an application has put into the system are not lost due to disk failure

• Availability: get request will be able to return the object promptly

112/04/21 3

Page 4: Effective Replica Maintenance for Distributed Storage Systems

Motivation

• To store immutable objects durably at a low bandwidth cost in a distributed storage system

112/04/21 4

Page 5: Effective Replica Maintenance for Distributed Storage Systems

Contributions

• A set of techniques that allow wide-area systems to efficiently store and maintain large amounts of data

• An implementation: Carbonite

112/04/21 5

Page 6: Effective Replica Maintenance for Distributed Storage Systems

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient failure cost• Implementation Issues• Conclusion

112/04/21 6

Page 7: Effective Replica Maintenance for Distributed Storage Systems

Providing Durability

• Durability is relatively more important than availability

• Challenges– Replication algorithm: Create new replica faster

than losing them– Reducing network bandwidth– Distinguish transient failures from permanent disk

failures– Reintegration

112/04/21 7

Page 8: Effective Replica Maintenance for Distributed Storage Systems

Challenges to Durability

• Create new replicas faster than replicas are destroyed– Creation rate < failure rate system is infeasible

• Higher number of replicas do not allow system to survive a higher average failure rate

– Creation rate = failure rate + ε (ε is small) burst of failure may destroy all of the replicas

112/04/21 8

Page 9: Effective Replica Maintenance for Distributed Storage Systems

Number of Replicas as a Birth-Death Process

• Assumption: independent exponential inter-failure and inter-repair times

• λf : average failure rate• μi: average repair rate at state i• rL : lower bound of number of replicas (rL = 3 in this case)

112/04/21 9

Page 10: Effective Replica Maintenance for Distributed Storage Systems

Model Simplification

• Fixed μ and λ the equilibrium number of replicas is Θ = μ/ λ

• If Θ < 1, the system can no longer maintain full replication regardless of rL

112/04/21 10

Page 11: Effective Replica Maintenance for Distributed Storage Systems

Real-world Settings• Planetlab

– 490 nodes– Average inter-failure time 39.85 hours– 150 KB/s bandwidth

• Assumption– 500 GB per node– rL = 3

• λ = 365 day / 490 x (39.85 / 24) = 0.439 disk failures / year • μ = 365 day / (500 GB x 3 / 150 KB/sec) = 3 disk copies / year

• Θ = μ/ λ = 6.85

112/04/21 11

Page 12: Effective Replica Maintenance for Distributed Storage Systems

Impact of Θ

• Θ is the theoretical upper limit of replica number• bandwidth ↑ μ ↑ Θ ↑• rL ↑ μ ↓ Θ ↓

112/04/21 12

Θ = 1

Page 13: Effective Replica Maintenance for Distributed Storage Systems

Choosing rL

• Guidelines– Large enough to ensure durability– At least one more than the maximum burst of

simultaneous failures– Small enough to ensure rL <= Θ

112/04/21 13

Page 14: Effective Replica Maintenance for Distributed Storage Systems

rL vs Durablility

• Higher rL would cost high but tolerate more burst failures• Larger data size λ ↑ need higher rL

Analytical results from Planetlab traces (4 years)

112/04/21 14

Page 15: Effective Replica Maintenance for Distributed Storage Systems

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient failure cost• Implementation Issues• Conclusion

112/04/21 15

Page 16: Effective Replica Maintenance for Distributed Storage Systems

Definition: Scope

• Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We call the size of that set the node’s scope.

• scope є [rL , N]– N: number of nodes in the system

112/04/21 16

Page 17: Effective Replica Maintenance for Distributed Storage Systems

Effect of Scope

• Small scope– Easy to keep track of objects– More effort of creating new objects

• Big scope– Reduces repair time, thus increases durability– Need to monitor many nodes– If large number of objects and random placement,

may increase the likelihood of simultaneous failures

112/04/21 17

Page 18: Effective Replica Maintenance for Distributed Storage Systems

Scope vs. Repair Time

• Scope ↑ repair work is spread over more access links and completes faster

• rL ↓ scope must be higher to achieve the same durability

112/04/21 18

Page 19: Effective Replica Maintenance for Distributed Storage Systems

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient failure cost• Implementation Issues• Conclusion

112/04/21 19

Page 20: Effective Replica Maintenance for Distributed Storage Systems

The Reasons

• Not creating new replicas for transient failures– Unnecessary costs (replicas)– Waste resources (bandwidth, disk)

• Solutions– Timeouts– Reintegration– Batch– Erasure codes

112/04/21 20

Page 21: Effective Replica Maintenance for Distributed Storage Systems

Timeouts • Timeout > average down time

– Average down time: 29 hours– Reduce maintenance cost– Durability still maintained

112/04/21 21

• Timeout >> average down time– Durability begins to fall– Delays the point at which the

system can begin repair

Page 22: Effective Replica Maintenance for Distributed Storage Systems

Reintegration

• Reintegrate replicas stored on nodes after transient failures

• System must be able to track more than rL number of replicas

• Depends on a: the average fraction of time that a node is available

112/04/21 22

Page 23: Effective Replica Maintenance for Distributed Storage Systems

Effect of Node Availability

• Pr[new replica needs to be created] == Pr[less than rL replicas are available] :

• Chernoff bound: 2rL/a replicas are needed to keep rL copies available

112/04/21 23

Page 24: Effective Replica Maintenance for Distributed Storage Systems

Node Availability vs. Reintegration

• Reintegrate can work safely with 2rL/a replicas• 2/a is the penalty for not distinguishing transient and

permanent failures• rL = 3

112/04/21 24

Page 25: Effective Replica Maintenance for Distributed Storage Systems

Four Replication Algorithms

• Cates– Fixed number of replicas rL

– Timeout

• Total Recall– Batch

• Carbonite– Timeout + reintegration

• Oracle– Hypothetical system that can differentiate transient

failures from permanent failures

112/04/21 25

Page 26: Effective Replica Maintenance for Distributed Storage Systems

Effect of Reintegration

112/04/21 26

Page 27: Effective Replica Maintenance for Distributed Storage Systems

Batch

• In addition to rL replicas, make e additional copies– Makes repair less frequent– Use up more resources

• rL = 3

112/04/21 27

Page 28: Effective Replica Maintenance for Distributed Storage Systems

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient failure cost• Implementation Issues• Conclusion

112/04/21 28

Page 29: Effective Replica Maintenance for Distributed Storage Systems

DHT vs. Directory-based Storage Systems

• DHT-based: consistent hashing an identifier space• Directory-based : use indirection to maintain data, and DHT to

store location pointers

112/04/21 29

Page 30: Effective Replica Maintenance for Distributed Storage Systems

Node Monitoring for Failure Detection

• Carbonite requires that each node know the number of available replicas of each object for which it is responsible

• The goal of monitoring is to allow the nodes to track the number of available replicas

112/04/21 30

Page 31: Effective Replica Maintenance for Distributed Storage Systems

Monitoring consistent hashing systems

Each node maintains, for each object, a list of nodes in the scope without a copy of the object

When synchronizing, a node n provide key k to node n’ who missed an object with key k, prevent n’ from reporting what n already knew

112/04/21 31

Page 32: Effective Replica Maintenance for Distributed Storage Systems

Monitoring host availability

• DHT’s routing tables uses the spanning tree rooted at each node a O(logN) out-degree

• Multicast heartbeat message of each node to its children nodes periodically

• When heartbeat is missed, monitoring node triggers repair actions

112/04/21 32

Page 33: Effective Replica Maintenance for Distributed Storage Systems

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient failure cost• Implementation Issues• Conclusion

112/04/21 33

Page 34: Effective Replica Maintenance for Distributed Storage Systems

Conclusion

• Many design choices remain to be made– Number of replicas (depend on failure distribution and

bandwidth, etc)– Scope size– Response to transient failures

• Reintegration (extra copies #)• Timeouts (timeout period)• Batch (extra copies #)

112/04/21 34