Distributed Architectures

Distributed Architectures

Introduction Computing everywhere:

Desktop, Laptop, Palmtop Cars, Cellphones Shoes? Clothing? Walls?

Connectivity everywhere: Rapid growth of bandwidth in the interior of the net Broadband to the home and office Wireless technologies such as CMDA, Satellite, laser

There is a need for persistent storage. Is it possible to provide an Internet-based

distributed storage system.

System Requirements

Key requirements: Be able to deal with intermittent

connectivity provided to some computing devices

Information must be kept secure from theft and denial of service attacks.

Information must be extremely durable. Information must be separate from location. Should have a uniform and highly available

access to information.

Implications of Requirements

Don’t want to worry about backup Don’t want to worry about obsolescence Need lots of resources to make data secure and

highly available, BUT don’t want to own them Outsourcing of storage already becoming popular

Pay monthly fee and your “data is out there” Simple payment interface

One bill from one company

Global Persistent Store

Persistent store should be characterized as follows: Transparent: Permits behavior to be

independent of the device themselves Consistently: Allows users to safely access

the same information from many different devices simultaneously.

Reliably: Devices can be rebooted or replaced without losing vital configuration information

Applications Groupware and Personal Information

Management Tools Examples: Calendar, email, contact lists, distributed

design tools. Must allow for concurrent updates from many people Users must see an ever-progressing view of shaired

information even when conflicts occurs.

Digital libraries and repositories for scientific data. Require massive quantities of storage. Replication for durability and availability is desirable.

Questions about Persistent Information

Where is persistent information stored? Want: Geographic independence for availability, durability, and

freedom to adapt to circumstances How is it protected?

Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity

Can we make it indestructible? Want: Redundancy with continuous repair and redistribution for

long-term durability Is it hard to manage?

Want: Automatic optimization, diagnosis and repair Who owns the aggregate resouces?

Want: Utility Infrastructure!

OceanStore: Utility-based Infrastructure

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

IBM

Transparent data service provided by federationof companies: Monthly fee paid to one service provider Companies buy and sell capacity from each other

OceanStore: Everyone’s Data, One Big

Utility “The data is just out there”

How many files in the OceanStore? Assume 1010 people in world Say 10,000 files/person (very conservative?) So 1014 files in OceanStore!

If 1 gig files (ok, a stretch), get 1 mole of bytes!

Oceanstore Goals Untrusted infrastructure

Only clients can be trusted Servers can crash, or leak information to third parties Most of the servers are working correctly most of the

time Class of trusted servers that can carry out protocols

on the clients behalf (financially liable for integrity of data)

Nomadic Data Access Data can be cached anywhere, anytime

(promiscuous caching) Continuous introspective monitoring to locate data

close to the user

OceanStore:Everyone’s data, One big

Utility “The data is just out there”

Separate information from location Locality is an only an optimization (an important one!) Wide-scale coding and replication for durability

All information is globally identified Unique identifiers are hashes over names & keys Single uniform lookup interface replaces: DNS, server location, data

location No centralized namespace required

OceanStore Assumptions Untrusted Infrastructure:

The OceanStore is comprised of untrusted components Only ciphertext within the infrastructure

Responsible Party: Some organization (i.e. service provider) guarantees that your

data is consistent and durable Not trusted with content of data, merely its integrity

Mostly Well-Connected: Data producers and consumers are connected to a high-

bandwidth network most of the time Exploit multicast for quicker consistency when possible

Promiscuous Caching: Data may be cached anywhere, anytime

Optimistic Concurrency via Conflict Resolution: Avoid locking in the wide area Applications use object-based interface for updates

Naming At the lowest level, OceanStore objects are

identified by a globally unique identifier (GUID).

A GUID is a pseudorandom fixed length bit string.

The GUID does not contain any location information and is not human readable.

Passwd vs. 12agfs237fundfhg666abcdefg999ldfnhgga

Generation of GUIDs is adapted from the concept of a self-certifying pathname which inherently specify all information necessary to communicate securely with remote file servers (e.g., network address; public key).

Naming

Create hierarchies using “directory” objects To allow arbitrary directory hierarchies to be

built, Oceanstore allows directores to contain pointers to other directories.

A user can choose several directories as “roots” and secure those directories through external methods, such as a public key authority.

These root directories are only roots wrt clients using them. The system has no root.

An object GUID is a secure hash (160 bit SHA-1) of owner’s key and some human readable name

Naming

Is 160 bits enough? Good enough for now Requires over 280 unique objects before

collisions worrisome.

Access Control Restricting Readers

To prevent unauthorized reads, we encrypt all data in the system that is not completely public and distribute the encryption key to those readers with read permission.

To revoke read permission:• Data should be deleted from all replicas• Data should be re-encrypted• New keys should be distributed• Clients can still access old data till it is deleted in

all replicas.

Access Control Restricting Writers

All writes are signed to prevent unauthorized writes.

Validity checked by Access Control Lists (ACLs)

The owner can securely choose the ACL x for an object foo by providing a signed certificate that translates to “Owner says use ACL x for object foo”

If A trusts B, B trusts C and C trusts D, then does A trust D?

• Depends on the policy!• Needs work

Data Routing and Location Entities are free to reside on any of the

OceanStore servers. This provides maxiumum flexibility in

selecting policies for replication, availability, caching and migration.

Makes the process of locating and interaction with the objects more complex.

Routing Address an object by its GUID

Message label: GUID, random number; no destination IP address.

Oceanstore routes message to closest GUID replica.

Oceanstore combines data location and routing: No name service to attack Save one round-trip for location discovery

Two-Tiered Approach to Routing

Fast, probabilistic search: Built from attenuated Bloom filters Why? This works well if items accessed

frequently reside close to where they are being used.

Slow guaranteed search based on the Plaxton Mesh data structure used for underlying routing infrastructure Messages begin by routing from node to node

along the distributed data structure until the destination is discovered.

Attenuated Bloom Filter

Fast, probabilistic search algorithm based on Bloom filters. A Bloom filter is a randomized data

structure Strings are stored using multiple hash

functions It can be queried to check the presence of a

string Membership queries result in rare false

positives but never false negatives.


Bloom filter Assume hash functions h1,h2,…hk

A Bloom filter is a bit vector of length w. On input the Bloom filter computes k hash

functions (denoted h1,h2,…hk )

h1

h2

h3

hk

1

1

1

1

X

Attenuated Bloom Filter To determine whether the set represented

by a Bloom filter contains a given element, that element is hashed and the corresponding bits in the filter are examined.

If all of the bits are set, the set may contain the object.

False positive probability decreases exponentially with linear increase in the number of hash functions and memory.


Attenuated Bloom filter Array of d Bloom filters Associate each neighbor link with an

attenuated Bloom filter. The first filter in the array summarizes

documents available from that neighbor (one hop along the link).

The i th Bloom filter is a union of all Bloom filters for all of the nodes that are a distance of i hops.


n1 n2

n3

n410101

00011

1101011011

11100

11100

11010

00011

4b

2

3

4a

1

Rounded Box: Local Bloom FilterUnrounded Box: Attenuated Bloom filter

5


Example Assume that an object whose GUID hashes

to bits 0,1,and 3 is being searched for. The steps are as follows:1) The local Bloom Filter for n1 shows that it does not

have the object.2) The attenuated Bloom filter at n1 is used to

determine which neighbor may have the object. 3) The query moves to n2 whose Bloom filter indicates

that it does not have the document.4) (a) Examining the attenuated Bloom filter shows

that n4 doesn’t have the document but that n3 may.

5) The query is forwarded to n3 which verifies that it has the object.

Attenuated Bloom Filter Routing

Performing a location query:• The querying node examine the first level of each

of its neighbors’ attenuated Bloom Filters.• If one of the filters matches, it is likely that the

desired data item is only one hop away, and the query is forwarded to the matching neighbor closest to the current node in network latency.

• If no filter matches, the querying node looks for a match in the 2nd level of every filter.

• If a match is found, the query is forwarded to the matching neighbor of lowest latency.

• If the data can’t be found then go to the deterministic search.

Deterministic Routing and Location

Based on Plaxton trees A Plaxton tree is a distributed tree

structure where every node is the root of a tree.

Every server in the system is assigned a random node identifier (node-ID). Each object has root node f(GUID) = RootID Root node is responsible for storing object’s

location


Local routing maps are used by each node. These are referred to as neighbor maps.

Nodes keep “nearest” neighbor pointers differing in one digit.

Each node is connected to other nodes via neighbor links of various levels.

Level 1 edges from a given node connect to the 16 nodes closest (in network latency) with different values in the lowest digit of their addresses.


Level-2 edges connect to the 16 closest nodes that match in the lowest digit and have different second digits, etc;

Neighborhood links provide a route from every node to every other node in the system: Resolve the destination node address one

digit at a time using a level-1 edge for the first digit, a level-2 edge for the second etc

A Plaxton mesh refers to a set of trees where there is a tree rooted at every node.


Example: Route from ID 0694 to 54420694 0692 0642 0442 5442

0642 x042 xx02 xxx0

1642 x142 xx12 xxx1

2642 x242 xx22 xxx2

3642 x342 xx32 xxx3

4642 x442 xx42 xxx4

5642 x542 xx52 xxx5

6642 x642 xx62 xxx6

7642 x742 xx72 xxx7

Lookup map for node 0642

0 digits equal3 digits equal

MSB unequal digit = 0

MSB unequal digit = 1Tapestry: An Infrastructure for Fault-tolerant Wide-area Location & RoutingJ. Kubiatowicz and A. JosephUC Berkeley Technical Report

Global Routing Algorithm

Move to the closest node to you that shares at least one more digit with the target destination

E.g.: 0325 B4F8 9098 7598 4598

9098

4598

1598

0098

3E98

2BB8

0325

87CA

7598

B4F8

2118

D598

L1

L2

L3

L4


This routing method guarantees that any existing unique node in the system will be found in at most LogbN logical hops, in a system with an N size namespace using IDs of base b.


A server S publishes that it has an object O by routing a message to the “root node” of O.

The publishing process consists of sending a message toward the root node.

At each hop along the way,the publish message stores location information in the form of a mapping <Object-ID(O),Server-ID(S)>. These mappings are simply pointers to the

server S where O is being stored.


During a location query, clients send messages to objects. A message destined for O is initially routed toward’s O’s root.

At each step, if the message encounters a node that contains the location mapping for O, it is immediately redirected to the server containing the object.

Else the message is forwarded one step closer to the root.


Where multiple copies of data exist in Plaxton, each node en route to the root node stores locations of all replicas.

The replica chosen may be defined by the application.

Achieving Fault Tolerance If each object has a single root then it

becomes a single point of failure, the potential subject of denial of service attacks, and an availability problem.

This is addressed by having multiple roots which can be achieved by hashing each GUID with a small number of different salt values.

The result maps to several different root nodes, thus gaining redundancy and making it difficult to target for denial of service.

Achieving Fault Tolerance

This scheme is sensitive to corruption in the links and pointers.

Bad links can be immediately detected and routing can be continued by jumping to a random neighbor node. Each tree spans every node, hence any

node should be able to reach the root.

Update Model and Consitency Resolution

Depending on the application, there could be a high degree of write sharing.

Do not really want to do wide-area locking.

Update Model and Consistency Resolution

The process of conflict resolution starts with a series of updates, chooses a total order among them, then applies them atomically in that order.

The easiest way to compute this order is to require that all updates pass through a master replica.

All other replicas can be thought of as read caches.

Problems Does not scale Vulnerable to Denial of Service (DOS) attacks Requires trusting a single replica


The notion of distributed consistency solves DoS/Trust issues

Replace master replica with a primary tier of replicas.

These replicas cooperate with one another in Byzantine agreement protocol to choose the final commit order for updates.

Problem: The message traffic needed by Byzantine

agreement protocols is high.


Basically a two tier architecture is used. Primary tier is where replicas use

Byzantine Agreement A secondary tier acts as a distributed

read/write cache. The secondary tier is kept up-to-date via

“push” or “pull”

The Path of an Update

Client creates and signs an update against an object

Cilent sends update to primary tier and several random secondary tier replicas

While primary tier applies update against master copy, secondary tier propagates update epidemically.

Primary tier’s decision is passed to all replicas


A secondary tier of replicas communicate among themselves and the primary tier using a protocol that implements an epidemic algorithm.

Epidemic algorithms propagate updates to all replicas in as few messages as possible. A process P picks another server Q at

random and subsequently exchanges updates with Q.

The Path of an Update


Note that secondary replicas contain both tentative and committed data.

The epidemic-style communication pattern is used used to quickly spread tentative commits among themselves and to pick a tentative serialization order which is based on optimistically timestamping their updates.

Byzantine Agreement A set of processes need to agree on a

value, after one or more processes have proposed what that value (decision) should be

Processes may be correct, crashed, or they may exhibit arbitrary (Byzantine) failures

Messages are exchanged on an one-to-one basis, and they are not signed

Problems of Agreement The general goal of distributed agreement

algorithms is to have all the nonfaulty processes reach consensus on some issue and to establish that consensus within a finite number of steps.

What if processes exhibit Byzantine failures. This is often compared to armies in the

Byzantine Empire in which there many conspiracies, intrigue and untruthfulness were alleged to be common in ruling circles.

A traitor is analogous to a fault.

Byzantine Agreement

We will illustrate by example where there are 4 generals, where one is a traitor.

Step 1: Every general sends a (reliable) message to every

other general announcing his troop strength. Loyal generals tell the truth. Traitors tell every other general a different lie. Example: general 1 reports 1K troops, general 2

reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.

Byzantine Agreement

We will illustrate by example where there are 4 generals, where one is a traitor.

Step 1: Every general sends a (reliable) message to every

other general announcing his troop strength. Loyal generals tell the truth. Traitors tell every other general a different lie. Example: general 1 reports 1K troops, general 2

reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.

Byzantine Agreement

Byzantine Agreement

Step 2: The results of the announcements of step 1

are collected together in the form of vectors.

Byzantine Agreement

Byzantine Agreement

Step 3 Consists of every general passing his vector

from the previous step to every other general.

Each general gets three vectors from each other general.

General 3 hasn’t stopped lying. He invents 12 new values: a through l.

Byzantine Agreement

Byzantine Agreement

Step 4 Each general examines the ith element of

each of the newly received vectors. If any value has a majority, that value is put

into the result vector. If no value has a majority, the

corresponding element of the result vector is marked UNKNOWN.

Byzantine Agreement

With m faulty processes, agreement is possible only if 2m+1 processes function correctly

The total is 3m+1. If messages cannot be guaranteed to be

delivered within a known, finite time, no agreement is possible if even one process is faulty.

Why? Slow processes are indistinguishable from crashed ones.

Introspection Oceanstore uses introspection which mimics

adaptation in biological systems. The observation modules monitor the activity

of a running system and keeps a historical record of system behavior.

Use sophisticated analysis to extract patterns from these observations.

Why? Cluster recognition Replica management

Status System written in Java Unix Applications used: e-mail Deployed on 100 host machines at 40

sites (approximately) in the US, Canada, Europe, Australia and New Zealand.

Prototype Applications

File Server (ran the Andrew file system) Primary replicas at UCB, Stanford, Intel,

Berkeley Reads are faster Writes are slower

Cost of write is primarily associated with using Byzantine agreement.

They modified the Byzantine agreement protocol to use signatures which does speed up the writes but not as hoped for. Better for large writes then small writes.

Why Important?

This body of work introduces the concept of a cooperative utility for global-scale persistent storage.

Discusses interesting approaches to addressing the problems of data availability and survivability in the face of disaster.

Investigates the uses of introspection for optimization of a self-organizing system.

Documents

Distributed Architectures