26
GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

GLACIERSHIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE

MASSIVE CORRELATED FAILURE

PRESENTED BY

ANILA JAGANNATHAM

Page 2: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

INTRODUCTION

GLACIER IS A DISTRIBUTED STORAGE SYSTEM THAT RELIES ON MASSIVE REDUNDANCY TO MASK THE EFFECT OF LARGE SCALE CORRELATED FAILURES.

AIM: TO PROVIDE HIGHLY DURABLE STORAGE

DESPITE CORRELATED BYZANTINE FAILURES OF MAJORITY OF PARTICIPATING NODES.

Ex: INTERNET WORM ATTACKS.

Page 3: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

WHY IS IT DIFFERENT ?

In Oceanstore and Phoenix, Introspection is used

where an accurate failure model is assumed. Problem: Observation doesn’t reveal low-incidence

failures and humans cannot predict all sources of correlated failures.

Glacier is very different from Oceanstore or Phoenix as it doesn’t make any assumption about the nature of failure . Uses abundant but unreliable storage space on the nodes to provide durable storage for critical data.

Page 4: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

REQUIREMENTS

Nodes form an Overlay Network. Directory service – To map the keys to the address of

a live node, that is currently responsible for the key

Keys form a circular space Each node is responsible for a uniformly sized

segment of key space Node Identifiers are assigned pseudo-randomly to

prevent Sybil attacks. Glacier has to reliably identify, authenticate and

communicate with the node that is currently responsible for a given key.

Page 5: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

ARCHITECTURE

Glacier operates along side a primary store

Primary store- Provides R/W access and short-term availability by masking individual failures

Glacier- acts as archival storage

Aggregation layer – aggregates small objects prior to insertion into Glacier.

Page 6: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

INTEFACE TO APPLICATION

Lease – Used to control the life time of stored objects.

When Lease expires the object is removed from storage.

Lease period is chosen to exceed the assumed maximal duration of a large-scale failure (several weeks or months).

Application interact with glacier using following methods

put (i ,v, o, l) – to STORE an object o , under identifier i , version v and lease period l get( i , v) – to retrieve a stored object refresh ( i, v, l) – to extend the lease of an existing object.

Page 7: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

FRAGMENTS AND MANIFESTS

Glacier uses erasure codes to reduce storage overhead. An object O of size |O| is recorded in ‘n’ Fragments F1,F2 ,

…Fn of size |O|/r, any r of which contain sufficient information to restore the entire object.

Object is stored under key ‘k’ Fragment as (k, i ,v) – where i – index , v- version Authenticator Ao = (H(O), H(F1), H(F2)...,H(Fn) ,v ,l)

where H(f) denotes a secure hash (e.g., SHA-1)

Used to detect and remove corrupted fragments during recovery.

Manifest Mo = Authenticator + Cryptographic signature to authenticate the object and each of the fragments

Page 8: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

FRAGMENT PLACEMENT

Glacier uses a Placement function ‘P’ to determine the node which stores a particular fragment (k , i , v).

Requirements for the Placement function Fragments of same objects should be placed on

different pseudo-randomly chosen node. Ability to locate a fragment after failure with only the

Object key. Fragments with similar keys should be grouped

together to allow aggregation Placement function should be stable i.e., node

should change rarely. Glacier uses: P( k , i, v) = k + i /(n+1) + H( v) - maps the primary replica at position k and its n

fragments to n+1 equidistant points in the circular id space

Page 9: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

FRAGMENT PLACEMENT

Insert a new object Glacier sends a probe message to each location

P( k , i , v) ( where i= 1..n). If owner of P( k, i , v) is currently online it responds to

the message and Glacier sends the fragment directly to that node. Otherwise fragment is discarded and restored later by maintenance mechanism.

If fewer than r nodes are online , temporary fragment holders are used.

Page 10: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

FRAGMENT MAINTENANCE

Maintenance mechanism is needed as nodes may miss fragment insertions due to short term churn.

Maintenance Uses the fact that Fragments with similar keys are assigned to similar set of nodes.

Each fragment holder has N-1 peers which are storing fragments of exact same objects as itself.

Protocol: Node compile a list of all keys (k,v) in its local store and

send it to some of its peers Each peers checks it against its own store and replies with

a list of manifests, one for each object missing in list For each object, node requests k fragments from its peers

and validate each of the fragments against the manifest and computes the fragment that has to be stored locally

Page 11: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

RECOVERY

Maintenance mechanism has to restore full redundancy.

If compromised node fails permanently – other nodes take over the key segments.

If compromised node recovers and rejoins the system the fragments have to be restored.

To prevent congestive collapse during recovery – Glacier limits the number of simultaneous fragment reconstruction to Rmax.

Page 12: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

CONFIGURATION

Durability : If a failure affects a fraction f<= fmax of the storage nodes, each object survives with probability P>= Pmin.

The probability that an Object O can be reconstructed if at least r trial have a positive outcome is given by Bernoulli trails

Page 13: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

Parameters N & r have to be chosen such that P meets desired level of durability.

Probability that a collection of n objects survives the failure unscathed is PD(n) = Dn

If value of fmax is accidentally chosen low, Glacier still offers protection, the survival probability degrades gracefully as the magnitude of the actual failure increases.

Ex: fmax = 0.6 and Pmin = 0.999999

when f = 0.7 P = 0.9997

f = 0.8 P = 0.975.

Page 14: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

OBJECT AGGREGATION

User access the system using one node at a time – called as user’s proxy is the only node trusted by the user

When user inserts the objects into the Glacier they are buffered at the user’s proxy node and inserted immediately to the primary store

After enough objects have been gathered or time has passed the buffered objects are placed as a single object into Glacier under an aggregate key.

If objects have to stored in Glacier immediately then Flush method is used.

Proxy maintains a local aggregate directory which maps object key to aggregate that contains the object.

To ensure recovery the owner’s aggregates form a linked list. The head of the list is stored in an application-specific object with a well known key.

Page 15: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

OBJECT AGGREGATION

An aggregate contains references to multiple aggregates to prevent disconnection if an aggregate expires in order other than insertion order

Aggregates forms a DAG

Indegree of every aggregate is kept above dmin

An aggregate consists of tuples (oi, ki, vi)

Page 16: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

RECOVERY

After failure – Information not in the Glacier is lost and has to be restored – Contents of the Primary store, Aggregate directories.

Aggregate Directories can be recovered by walking through the DAG.

First, the key of most recently inserted aggregate is retrieved using a well known key in Glacier.

Later- Aggregates are retrieved in sequence and objects contained are added to the aggregate directory.

Primary store can be populated lazily on demand by applications or eagerly while walking the aggregate DAG.

Page 17: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

CONSOLIDATION

Glacier periodically checks the aggregate directory for aggregates whose leases will expire soon and decide whether to renew their leases.

Aggregate is SMALL or Majority of Object leases have expired then lease is not renewed.

Instead the non-expired objects are consolidated with new objects either from local buffers or other aggregates and new aggregate is created.

Consolidation is used to maintain low storage overhead. And particularly effective when leases are bimodal.

Page 18: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

SECURITY• ATTACKS ON INTEGRITY – Malicious attacker can overwrite the fragments on nodes

under control. Authenticator is used by fragment holder to validate fragments and replace corrupted fragments.

• ATTACKS ON DURABILITY – If attacker can successfully delete all replicas and more than n-r fragments of an object then it is lost. Unlikely due to pseudo-random selection of nodes.

• ATTACKS ON TIME SOURCE – Are avoided as the timestamps in the storage nodes are maintained as relative values.

• SPACE-FILLING ATTACKS – Attacker can consume all the storage space available. This doesn’t affect existing data and storage can be reclaimed gradually as data expires. To prevent this incentive mechanisms can be added.

• ATTACK ON GLACIER – Unlikely as code for deleting fragments ,handoff and expiration is very simple.

• HAYSTACK-NEEDLE ATTACKS – Attacker can compromise personal node itself and insert large number of decoy objects making recovery infeasible. Can be overcome by periodically inserting reference objects with well known version numbers like current time stamp.

Page 19: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

EXPERIMENTAL EVALUVATION

Tested in 2 –ways First – As a storage layer for ePOST ( a cooperative

serverless email system) for 140 days Glacier maintains N = 48 fragments using an erasure

code with r = 5. fmax = 60% and Pmin = 0.999999

Epost has 35 nodes which are desktop PC’s running Linux, OS X and windows.

Glacier was able to handle all types of failures which included kernel panics, JVM crashes, Configuration error causing 16 nodes to be disconnected.

Page 20: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

Fig 7 shows the cumulative size of all the objects inserted over time as well as objects that have not yet expired. Initial lease – 1month.

Fig 8 shows high number of small objects ranging between 1-10KB. And less than 1% of object larger than 600KB. Emails typically where small objects, Emails with attachment- larger objects.

Fig 9 shows the growth in the storage as new email enters the systems and increase in trash as the mails are deleted

Page 21: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM
Page 22: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

ePOST RECOVERY

Randomly selected 13 nodes and copied their local fragments to 13 fresh nodes.

Started new overlay network with only these 13 nodes Resulting situation corresponds to 58% failure which is

close to fmax = 60% Completely reinstalled epost on a 14th node and let it

join the ring. One of the user entered the email address and

approximate date when he had last used the system. Retrieval process took 1 hour after which epost was

ready to use

Page 23: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

SIMULATIONS Used Trace driven simulations corresponding to 147

users , approx 10,000 nodes and wide range of failures Explore the impact of Diurnal short term churn. Modeled a ring of 250 nodes where M% will be

unavailable between 5pm-7am & 2M% on weekends. Fig 14 shows the decrease in insertion messages and

increase in maintenance traffic.

Page 24: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

Experiment shows that glacier is able to manage this large amount of data with surprisingly low maintenance overhead and that is it scalable both with respect to load and system size.

Page 25: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

CONCLUSION

Glacier ensures durability of unrecoverable data in a cooperative, decentralized storage system, despite large scale correlated Byzantine failures.

It does not rely on Introspection which has inherent limitation to capture all sources of correlated failures.

Glacier uses raw, unreliable storage available at nodes to provide hard durability guarantees.

Page 26: GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM

QUESTIONS ?