Upload
austen-walton
View
213
Download
1
Embed Size (px)
Citation preview
Fabián E. Bustamante, Fall 2005
Efficient Replica Maintenance for Distributed Storage Systems
B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. Kaashoek, J. Kubiatowicz, and R. Morris, In Proc. of NSDI, May 2006.
Presenter: Fabián E. Bustamante
EECS 443 Advanced Operating SystemsNorthwestern University
2
Replication in Wide-Area Storage
Applications put & get objects in/from the wide-area storage system
Objects are replicated for– Availability
• Get on an object will return promptly
– Durability• Object put by the app are not lost due to disk failures
– An object may be durably stored but not immediately available
EECS 443 Advanced Operating SystemsNorthwestern University
3
Goal: durability at low bandwidth cost
Durability is a more practical & useful goal
Threat to durability– Loose the last copy of an object– So, create copies faster than they are destroyed
Challenges– Replication can eat your bandwidth– Hard to distinguish bet/ transient & permanent
failure– After recover, some replicas may be in nodes the
lookup algorithm does not check
Paper presents Carbonite – efficient wide-area replication technique for durability
EECS 443 Advanced Operating SystemsNorthwestern University
4
System Environment
Use PlanetLab (PL) as representative– >600 nodes distributed
world-wide– History traces collected by
CoMon project (every 5’)– Disk failures from event
logs of PlanetLab Central
Synthetic traces– 632 nodes as PL– Failure inter-arrival times from exponential dist.
(mean session time and downtime as in PL)– Two years instead of one and avg node lifetime of 1 year
Simulation– Trace-driven event-based simulator– Assumptions
• Network paths are independent• All nodes reachable from all other nodes• Each node with same link capacity
Dates 3/1/05-2/28/06
Hosts 632
Transient failures 21355
Disk failures 219
Transient host downtime (s) (median,avg,90th)
1208, 104647, 14242
Any failure interarrival (s) 305, 1467, 3306
Disk failure interarrival (s) 544411, 143476, 490047
EECS 443 Advanced Operating SystemsNorthwestern University
5
Understanding durability
To handle some avg. rate of failure – create new replicas faster than they are destroyed– Function of per-node access link, number of nodes, amount of
data stored per node
Infeasible system – unable to keep pace w/ avg. failure rate – will eventually adapt by discarding objects (which ones?)
If creation rate is just above failure rate – failure burst may be a problem
Target replicas to maintain – rL
Durability does not increased continuously with rL
EECS 443 Advanced Operating SystemsNorthwestern University
6
Improving repair time
Scope – set of other nodes that can hold copies of the objects a node is responsible for
Small scope– Easier to keep track of copies– Effort of creating copies fall on a small set of nodes– Addition of nodes may result on needless copying of objects
(when combined w/ consistent hashing)
Large scope– Spread work among more
nodes– Network traffic source/
destination are spread– Temp failures will be
noticed by more nodes
EECS 443 Advanced Operating SystemsNorthwestern University
7
Reducing transient costs
Impossible to distinguish transient/permanent failures
To minimize net traffic due to transient failures: reintegrate replicas
Carbonite– Selecet a suitable value for rL
– Respond to detected failure by creating new replica– Reintegrate replicas
Bytes sent by different maintenance algorithms
EECS 443 Advanced Operating SystemsNorthwestern University
8
Reducing transient costs
Bytes sent w/ and w/o reintegration
Impact of timeouts on bandwidth and durability
EECS 443 Advanced Operating SystemsNorthwestern University
9
Assumptions
The PlanetLab testbed can be seen as representative of something
Immutable data
Relatively stable system membership & data loss driven by disk failures
Disk failures are uncorrelated
Simulation– Network paths are independent– All nodes reachable from all other nodes– Each node with same link capacity