Fabián E. Bustamante, Fall 2005 Efficient Replica Maintenance for Distributed Storage Systems B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon,

Fabián E. Bustamante, Fall 2005

Efficient Replica Maintenance for Distributed Storage Systems

B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. Kaashoek, J. Kubiatowicz, and R. Morris, In Proc. of NSDI, May 2006.

Presenter: Fabián E. Bustamante

EECS 443 Advanced Operating SystemsNorthwestern University

2

Replication in Wide-Area Storage

Applications put & get objects in/from the wide-area storage system

Objects are replicated for– Availability

• Get on an object will return promptly

– Durability• Object put by the app are not lost due to disk failures

– An object may be durably stored but not immediately available


3

Goal: durability at low bandwidth cost

Durability is a more practical & useful goal

Threat to durability– Loose the last copy of an object– So, create copies faster than they are destroyed

Challenges– Replication can eat your bandwidth– Hard to distinguish bet/ transient & permanent

failure– After recover, some replicas may be in nodes the

lookup algorithm does not check

Paper presents Carbonite – efficient wide-area replication technique for durability


4

System Environment

Use PlanetLab (PL) as representative– >600 nodes distributed

world-wide– History traces collected by

CoMon project (every 5’)– Disk failures from event

logs of PlanetLab Central

Synthetic traces– 632 nodes as PL– Failure inter-arrival times from exponential dist.

(mean session time and downtime as in PL)– Two years instead of one and avg node lifetime of 1 year

Simulation– Trace-driven event-based simulator– Assumptions

• Network paths are independent• All nodes reachable from all other nodes• Each node with same link capacity

Dates 3/1/05-2/28/06

Hosts 632

Transient failures 21355

Disk failures 219

Transient host downtime (s) (median,avg,90th)

1208, 104647, 14242

Any failure interarrival (s) 305, 1467, 3306

Disk failure interarrival (s) 544411, 143476, 490047


5

Understanding durability

To handle some avg. rate of failure – create new replicas faster than they are destroyed– Function of per-node access link, number of nodes, amount of

data stored per node

Infeasible system – unable to keep pace w/ avg. failure rate – will eventually adapt by discarding objects (which ones?)

If creation rate is just above failure rate – failure burst may be a problem

Target replicas to maintain – rL

Durability does not increased continuously with rL


6

Improving repair time

Scope – set of other nodes that can hold copies of the objects a node is responsible for

Small scope– Easier to keep track of copies– Effort of creating copies fall on a small set of nodes– Addition of nodes may result on needless copying of objects

(when combined w/ consistent hashing)

Large scope– Spread work among more

nodes– Network traffic source/

destination are spread– Temp failures will be

noticed by more nodes


7

Reducing transient costs

Impossible to distinguish transient/permanent failures

To minimize net traffic due to transient failures: reintegrate replicas

Carbonite– Selecet a suitable value for rL

– Respond to detected failure by creating new replica– Reintegrate replicas

Bytes sent by different maintenance algorithms


8

Reducing transient costs

Bytes sent w/ and w/o reintegration

Impact of timeouts on bandwidth and durability


9

Assumptions

The PlanetLab testbed can be seen as representative of something

Immutable data

Relatively stable system membership & data loss driven by disk failures

Disk failures are uncorrelated

Simulation– Network paths are independent– All nodes reachable from all other nodes– Each node with same link capacity

Documents

Fabián E. Bustamante, Fall 2005 Efficient Replica Maintenance for Distributed Storage Systems B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon,