Upload
kristina-daby
View
276
Download
3
Tags:
Embed Size (px)
Citation preview
Peer-to-Peer (P2P) systems
DHT, Chord, Pastry, …
Unstructured vs Structured Unstructured P2P networks allow resources
to be placed at any node. The network topology is arbitrary, and the growth is spontaneous.
Structured P2P networks simplify resource location and load balancing by defining a topology and defining rules for resource placement.
Guarantee efficient search for rare objectsWhat are the rules???
Distributed Hash Table (DHT)
Hash Tables Store arbitrary keys
and satellite data (value)
put(key,value) value = get(key)
Lookup must be fast Calculate hash function
h() on key that returns a storage cell
Chained hash table: Store key (and optional value) there
What Is a DHT?
Single-node hash table:key = Hash(name)put(key, value)get(key) -> value
How do I do this across millions of hosts on the Internet? Distributed Hash Table
DHT
Distributed hash table
Distributed application
get (key) data
node node node….
put(key, data)
Lookup service
lookup(key) node IP address
• Application may be distributed over many nodes• DHT distributes data storage over many nodes
(DHash)
(Chord)
Why the put()/get() interface?
API supports a wide range of applications
DHT imposes no structure/meaning on keys
Distributed Hash Table Hash table functionality in a P2P network :
lookup of data indexed by keys
Key-hash node mapping Assign a unique live node to a key Find this node in the overlay network quickly and
cheaply
Maintenance, optimization Load balancing : maybe even change the key-hash
node mapping on the fly Replicate entries on more nodes to increase
robustness
Distributed Hash Table
The lookup problem
Internet
N1
N2 N3
N6N5
N4
Publisher
Key=“title”Value=MP3 data…
ClientLookup(“title”)
?
Centralized lookup (Napster)
Publisher@
Client
Lookup(“title”)
N6
N9 N7
DB
N8
N3
N2N1SetLoc(“title”, N4)
Simple, but O(N) state and a single point of failure
Key=“title”Value=MP3 data…
N4
Flooded queries (Gnutella)
N4Publisher@
Client
N6
N9
N7N8
N3
N2N1
Robust, but large number of messages per lookup
Key=“title”Value=MP3 data…
Lookup(“title”)
Routed queries (Chord, Pastry, etc.)
N4Publisher
Client
N6
N9
N7N8
N3
N2N1
Lookup(“title”)
Key=“title”Value=MP3 data…
Routing challenges
Define a useful key nearness metric Keep the hop count small Keep the tables small Stay robust despite rapid change
Freenet: emphasizes anonymity Chord: emphasizes efficiency and
simplicity
What is Chord? What does it do?
In short: a peer-to-peer lookup service
Solves problem of locating a data item in a collection of distributed nodes, considering frequent node arrivals and departures
Core operation in most p2p systems is efficient location of data items
Supports just one operation: given a key, it maps the key onto a node
Chord Characteristics
Simplicity, provable correctness, and provable performance
Each Chord node needs routing information about only a few other nodes
Resolves lookups via messages to other nodes (iteratively or recursively)
Maintains routing information as nodes join and leave the system
Mapping onto Nodes vs. Values
Traditional name and location services provide a direct mapping between keys and values
What are examples of values? A value can be an address, a document, or an arbitrary data item
Chord can easily implement a mapping onto values by storing each key/value pair at node to which that key maps
Napster, Gnutella etc. vs. Chord
Compared to Napster and its centralized servers, Chord avoids single points of control or failure by a decentralized technology
Compared to Gnutella and its widespread use of broadcasts, Chord avoids the lack of scalability through a small number of important information for routing
Addressed Difficult Problems (1)
Load balance: distributed hash function, spreading keys evenly over nodes
Decentralization: chord is fully distributed, no node more important than other, improves robustness
Scalability: logarithmic growth of lookup costs with number of nodes in network, even very large systems are feasible
Addressed Difficult Problems (2)
Availability: chord automatically adjusts its internal tables to ensure that the node responsible for a key can always be found
Flexible naming: no constraints on the structure of the keys – key-space is flat, flexibility in how to map names to Chord keys
Chord properties
Efficient: O(log(N)) messages per lookup N is the total number of servers
Scalable: O(log(N)) state per node
Robust: survives massive failures
The Base Chord Protocol (1)
Specifies how to find the locations of keys
How new nodes join the system
How to recover from the failure or planned departure of existing nodes
Consistent Hashing
Hash function assigns each node and key an m-bit identifier using a base hash function such as SHA-1
ID(node) = hash(IP, Port) ID(key) = hash(key)
Properties of consistent hashing:
Function balances load: all nodes receive roughly the same number of keys
When an Nth node joins (or leaves) the network, only an O(1/N) fraction of the keys are moved to a different location
Chord IDs
Key identifier = SHA-1(key) Node identifier = SHA-1(IP, Port) Both are uniformly distributed
Both exist in the same ID space
How to map key IDs to node IDs?
Chord IDs
consistent hashing (SHA-1) assigns each node and object an m-bit ID
IDs are ordered in an ID circle ranging from 0 – (2m-1).
New nodes assume slots in ID circle according to their ID
Key k is assigned to first node whose ID ≥ k
successor(k)
Consistent hashing
N32
N90
N105
K80
K20
K5
Circular 7-bitID space
Key 5Node 105
A key is stored at its successor: node with next higher ID
6
1
2
6
0
4
26
5
1
3
7
2identifiercircle
identifier
node
X key
Successor Nodes
successor(1) = 1
successor(2) = 3successor(6) = 0
Node Joins and Departures
6
1
2
0
4
26
5
1
3
7successor(6) = 7
6
1
successor(1) = 3
Consistent Hashing – Join and Departure
When a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n.
When node n leaves the network, all of its assigned keys are reassigned to n’s successor.
Consistent Hashing – Node Join
0
4
26
5
1
3
7
keys1
keys2
keys
keys
7
5
Consistent Hashing – Node Dep.
0
4
26
5
1
3
7
keys1
keys2
keys
keys6
7
Scalable Key Location
A very small amount of routing information suffices to implement consistent hashing in a distributed environment
Each node need only be aware of its successor node on the circle
Queries for a given identifier can be passed around the circle via these successor pointers
A Simple Key Lookup
Pseudo code for finding successor:// ask node n to find the successor of idn.find_successor(id)
if (id (n, successor])return successor;
else// forward the query around the
circlereturn
successor.find_successor(id);
Scalable Key Location
Resolution scheme correct, BUT inefficient:
it may require traversing all N nodes!
Acceleration of Lookups Lookups are accelerated by maintaining additional
routing information
Each node maintains a routing table with (at most) m entries (where N=2m) called the finger table
ith entry in the table at node n contains the identity of the first node, s, that succeeds n by at least 2i-1 on the identifier circle
s = successor(n + 2i-1) (all arithmetic mod 2m)
s is called the ith finger of node n, denoted by n.finger(i).node
Scalable Key Location – Finger Tables
0
4
26
5
1
3
7
124
130
finger tablestart succ.
keys1
235
330
finger tablestart succ.
keys2
457
000
finger tablestart succ.
keys6
0+20
0+21
0+22
For.
1+20
1+21
1+22
For.
3+20
3+21
3+22
For.
Finger Tables
0
4
26
5
1
3
7
124
[1,2)[2,4)[4,0)
130
finger tablestart int. succ.
keys1
235
[2,3)[3,5)[5,1)
330
finger tablestart int. succ.
keys2
457
[4,5)[5,7)[7,3)
000
finger tablestart int. succ.
keys6
Finger Tables - characteristics
Each node stores information about only a small number of other nodes, and knows more about nodes closely following it than about nodes farther away
A node’s finger table generally does not contain enough information to determine the successor of an arbitrary key k
Repetitive queries to nodes that immediately precede the given key will lead to the key’s successor eventually
Node Joins – with Finger Tables
0
4
26
5
1
3
7
124
[1,2)[2,4)[4,0)
130
finger tablestart int. succ.
keys1
235
[2,3)[3,5)[5,1)
330
finger tablestart int. succ.
keys2
457
[4,5)[5,7)[7,3)
000
finger tablestart int. succ.
keys
finger tablestart int. succ.
keys
702
[7,0)[0,2)[2,6)
003
6
6
66
6
Node Departures – with Finger Tables
0
4
26
5
1
3
7
124
[1,2)[2,4)[4,0)
130
finger tablestart int. succ.
keys1
235
[2,3)[3,5)[5,1)
330
finger tablestart int. succ.
keys2
457
[4,5)[5,7)[7,3)
660
finger tablestart int. succ.
keys
finger tablestart int. succ.
keys
702
[7,0)[0,2)[2,6)
003
6
6
6
0
3
Simple Key Location SchemeN1
N8
N14
N21N32
N38
N42
N48
K45
Scalable Lookup SchemeN1
N8
N14
N21N32
N38
N42
N48
N51
N56N8+1 N14
N8+2 N14
N8+4 N14
N8+8 N21
N8+16 N32
N8+32 N42
Finger Table for N8
finger 1,2,3
finger 4
finger 6
finger [k] = first node that succeeds (n+2k-1)mod2m
finger 5
Chord key location Lookup in
finger table the furthest node that precedes key
-> O(log n) hops
Scalable Lookup Scheme
// ask node n to find the successor of idn.find_successor (id)
n0 = find_predecessor(id);return n0.successor;
// ask node n to find the predecessor of idn.find_predecessor (id)
n0 = n; while (id not in (n0, n0.successor] )
n0 = n0.closest_preceding_finger(id);return n0;
// return closest finger preceding idn.closest_preceding_finger (id)
for i = m downto 1if (finger[i].node belongs to (n, id))
return finger[i].node;return n;
Lookup Using Finger TableN1
N8
N14
N21N32
N38
N42
N51
N56
N48
lookup(54)
Scalable Lookup Scheme
Each node forwards query at least halfway along distance remaining to the target
Theorem: With high probability, the number of nodes that must be contacted to find a successor in a N-node network is O(log N)
“Finger table” allows log(N)-time lookups
N80
½¼
1/8
1/161/321/641/128
Dynamic Operations and Failures
Need to deal with: Node Joins and Stabilization Impact of Node Joins on Lookups Failure and Replication Voluntary Node Departures
Node Joins and Stabilization
Node’s successor pointer should be up to date For correctly executing lookups
Each node periodically runs a “Stabilization” Protocol Updates finger tables and successor
pointers
Node Joins and Stabilization
Contains 6 functions: create() join() stabilize() notify() fix_fingers() check_predecessor()
Create()
Creates a new Chord ring
n.create()predecessor = nil;successor = n;
Join()
Asks m to find the immediate successor of n.
Doesn’t make rest of the network aware of n.
n.join(m)predecessor = nil;successor = m.find_successor(n);
Stabilize()
Called periodically to learn about new nodes Asks n’s immediate successor about successor’s
predecessor p Checks whether p should be n’s successor instead Also notifies n’s successor about n’s existence, so that
successor may change its predecessor to n, if necessary
n.stabilize()x = successor.predecessor;if (x (n, successor))
successor = x;successor.notify(n);
Notify()
m thinks it might be n’s predecessor
n.notify(m)if (predecessor is nil or m (predecessor, n))
predecessor = m;
Fix_fingers()
Periodically called to make sure that finger table entries are correct
New nodes initialize their finger tables Existing nodes incorporate new nodes into their
finger tables
n.fix_fingers()next = next + 1 ;if (next > m)
next = 1 ;finger[next] = find_successor(n + 2next-1);
Check_predecessor()
Periodically called to check whether predecessor has failed If yes, it clears the predecessor pointer,
which can then be modified by notify()
n.check_predecessor()if (predecessor has failed)
predecessor = nil;
Theorem
If any sequence of join operations is executed interleaved with stabilizations, then at some time after the last join the successor pointers will form a cycle on all nodes in the network
Stabilization Protocol
Guarantees to add nodes in a fashion to preserve reachability
Impact of Node Joins on Lookups
If finger table entries are reasonably current
Lookup finds the correct successor in O(log N) steps
If successor pointers are correct but finger tables are incorrect
Correct lookup but slower If incorrect successor pointers
Lookup may fail
Impact of Node Joins on Lookups
Performance If stabilization is complete
Lookup can be done in O(log N) time If stabilization is not complete
Existing nodes finger tables may not reflect the new nodes
Doesn’t significantly affect lookup speed Newly joined nodes can affect the lookup speed, if
the new nodes ID’s are in between target and target’s predecessor
Lookup will have to be forwarded through the intervening nodes, one at a time
Theorem
If we take a stable network with N nodes with correct finger pointers, and another set of up to N nodes joins the network, and all successor pointers (but perhaps not all finger pointers) are correct, then lookups will still take O(log N) time with high probability
Source of Inconsistencies:Concurrent Operations and Failures
Basic “stabilization” protocol is used to keep nodes’ successor pointers up to date, which is sufficient to guarantee correctness of lookups
Those successor pointers can then be used to verify the finger table entries
Every node runs stabilize periodically to find newly joined nodes
Stabilization after Join
np
su
cc(n
p)
= n
s
ns
n
pre
d(n
s)
= n
p
n joins predecessor = nil n acquires ns as successor via some n’
n notifies ns being the new predecessor
ns acquires n as its predecessor
np runs stabilize
np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor
all predecessor and successor pointers are now correct
fingers still need to be fixed, but old fingers will still work
nil
pre
d(n
s)
= n
su
cc(n
p)
= n
Node joins and stabilization
Node joins and stabilization
• N26 joins the system
• N26 aquires N32 as its successor
• N26 notifies N32
• N32 aquires N26 as its predecessor
Node joins and stabilization
• N26 copies keys
• N21 runs stabilize() and asks its successor N32 for its predecessor which is N26.
Node joins and stabilization
• N21 aquires N26 as its successor
• N21 notifies N26 of its existence
• N26 aquires N21 as predecessor
Failure Recovery
Key step in failure recovery is maintaining correct successor pointers
To help achieve this, each node maintains a successor-list of its r nearest successors on the ring
If node n notices that its successor has failed, it replaces it with the first live entry in the list
stabilize will correct finger table entries and successor-list entries pointing to failed node
Performance is sensitive to the frequency of node joins and leaves versus the frequency at which the stabilization protocol is invoked
68
Impact of node joins on lookups
All finger table entries are correct => O(log N) lookups
Successor pointers correct, but fingers inaccurate => correct but slower lookups
68
69
Impact of node joins on lookups
Stabilization completed => no influence on performence
Only for the negligible case that a large number of nodes joins between the target‘s predecessor and the target, the lookup is slightly slower
No influence on performance as long as fingers are adjusted faster than the network doubles in size
69
70
Failure of nodes
Correctness relies on correct successor pointers
What happens, if N14, N21, N32 fail simultaneously?
How can N8 aquire N38 as successor?
70
71
Failure of nodes
Correctness relies on correct successor pointers
What happens, if N14, N21, N32 fail simultaneously?
How can N8 aquire N38 as successor?
71
Voluntary Node Departures
Can be treated as node failures Two possible enhancements
Leaving node may transfers all its keys to its successor
Leaving node may notify its predecessor and successor about each other so that they can update their links
Chord – facts
Every node is responsible for about K/N keys (N nodes, K keys)
When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node)
Lookups need O(log N) messages
To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required
Pastry Self-organizing overlay network of nodes
With high probability, nodes with adjacent nodeId are diverse in geography, ownership, jurisdiction, network attachment, etc…
Pastry takes into account network locality (e.g. IP routing hops).
Pastry
Instead of organizing the id-space as a Chord-like ring, the routing is based on numeric closeness of identifiers
The focus is not only on the no. of routing hops, but also on network locality – as factors in routing efficiency
Pastry Identifier space:
Nodes and data items are uniquely associated with m-bit ids – integers in the range (0 – 2m -1) – m is typically 128
Pastry views ids as strings of digits to the base 2b where b is typically chosen to be 4
A key is located on the node to whose node id it is numerically closest
Routing Goal
Pastry routes messages to the node whose nodeId is numerically closest to the given key in less than log2b (N) steps:
“A heuristic ensures that among a set of nodes with the k closest nodeIds to the key, the message is likely to first reach a node near the node from which the message originates, in term of the proximity metric”
Routing Information Pastry’s node state is divided into 3 main
elements
The routing table – similar to Chord’s finger table – stores links to id-space
The leaf set contains nodes which are close in the id-space
Nodes that are closed together in terms of network locality are listed in the neighbourhood set
Routing Table A Pastry node’s routing table is made up of m/b
(log2b N) rows with 2b -1 entries per row On node n, entries in row i hold the identities of Pastry
nodes whose node-id share an i-digit prefix with n but differ in digit n itself
For ex, the first row is populated with nodes that have no prefix in common with n
When there is no node with an appropriate prefix, the corresponding entry is left empty
Single digit entry in each row shows the corresponding digit of the present node’s id – i.e. prefix matches the current id up to the given value of p – the next row down or leaf set should be examined to find a route.
Routing Table
Routing tables (RT) thus built achieve an effect similar to Chord finger table
The detail of the routing information increases with the proximity of other nodes in the id-space
Without a large no. of nearby nodes, the last rows of the RT are only sparsely populated – intuitively, the id-space would need to be fully exhausted with node-ids for complete RTs on all nodes
In populating the RT, there is a choice from the set of nodes with the appropriate id-prefix
During the routing process, network locality can be exploited by selecting nodes which are close in terms of proximity ntk. metric
Leaf Set
The Routing tables sort node ids by prefix. To increase lookup efficiency, the leaf set L of nodes holds the |L| nodes numerically closest to n (|L|/2 smaller and |L|/2 larger, L = 2 or 2 × 2b, normally)
The RT and the leaf set are the two sources of information relevant for routing
The leaf set also plays a role similar to Chord’s successor list in recovering from failures of adjacent nodes
Neighbourhood Set
Instead of numeric closeness, the neighbourhood set M is concerned with nodes that are close to the current node with regard to the network proximity metric
Thus, it is not involved in routing itself but in maintaining network locality in the routing information
Pastry Node State (Base 4)
LNodes that are numerically
closer to the present Node (2b or 2x2b entry)
RCommon prefix with
10233102-next digit-rest of NodeId (log2b (N) rows, 2b-1 columns)
MNodes that are closest
according to the proximity metric (2b or 2x2b entry)
Routing
Key D arrives at nodeId A
Ril enetry in routing table
at column i and row l
Li i-th closest nodeId in leaf set
Dl value of the l’s digit in the key D
shl(A,B) length of the prefix shared in digits
Routing
Routing is divided into two main steps:
First, a node checks whether the key K is within the range of its leaf set
If it is the case, it implies that K is located in one of the nearby nodes of the leaf set. Thus, the node forwards the query to the leaf set node numerically closest to K. In case this is the node itself, the routing process is finished.
Routing
If K does not fall within the range of the leaf set, the query needs to be forwarded over a large distance using the routing table
In this case, a node n tries to pass the query on to a node which shares a longer common prefix with K than n itself
If there is no such entry in the RT, the query is forwarded to a node which shares a prefix with K of the same length as n but which is numerically close to K than n
Routing
This scheme ensures that routing loop do not occur because the query is routed strictly to a node with a longer common identifier prefix than the current node, or to a numerically closer node with the same prefix
Routing performance
Routing procedure converges, each step takes the message to node that either:
Shares a longer prefix with the key than the local node
Shares as long a prefix with, but is numerically closer to the key than the local node.
Routing performance
Assumption: Routing tables are accurate and no recent node failures
There are 3 cases in the Pastry routing scheme:
Case 1: Forward the query (according to the RT) to a node with a longer prefix match than the current node.
Thus, the no. of nodes with longer prefix matches is reduced by at least a factor of 2b in each step, so the destination is reached in log2b N steps.
Routing performance
There are 3 cases:
Case 2: The query is routed via leaf set (one step). This increases the no. of hop by one
Routing performance
There are 3 cases: Case 3: The key is neither covered by the leaf set
nor does the RT contains an entry with a longer matching prefix than the current node
Consequently, the query is forwarded to a node with the same prefix length, adding an additional routing hop.
For a moderate leaf set size ( |L| = 2 × 2b), the probability of this case is less than 0.6%. So, it is very unlikely that more than one additional hop is incurred.
Routing performance
As a result, the complexity of routing remains at O(log2b N) on average
Higher values of b leads to fast routing but also increases the amount of state that needs to managed at each node
Thus, b is typically 4 but Pastry implementation can choose an appropriate trade-off for specific application
Join and Failure
Join Use routing to find numerically closest node already in
network Ask state from all nodes on the route and initialize
own state
Error correction Failed leaf node: contact a leaf node on the side of the
failed node and add appropriate new neighbor
Failed table entry: contact a live entry with same prefix as failed entry until new live entry found, if none found, keep trying with longer prefix table entries
Self Organization: Node Arrival
The new node n is assumed to know a nearby Pastry node k based on the network proximity metric
Now n needs to initialize its RT, leaf set and neighbourhood set. Since K is assumed to be close to n, the nodes
in K’s neghbourhood set are reasonably good choices for n, too.
Thus, n copies the neighbourhood set from K.
Self Organization: Node Arrival
To build its RT and leaf set, n routes a special join message via k to a key equal to n According to the standard routing rules, the
query is forwarded to the node c with the numerically closest id and hence the leaf set of c is suitable for n, so it retrieves c’s leaf set for itself.
The join request triggers all nodes, which forwarded the query towards c, to provide n with their routing information.
Self Organization: Node Arrival
Node n’s RT is constructed from the routing information of these nodes starting at row 0. As this row is independent of the local node id,
n can use these entries at row zero of k’s routing table
In particular, it is assumed that n and k are close in terms of network proximity metric
Since k stores nearby nodes in its RT, these entries are also close to n.
In the general case of n and k not sharing a common prefix, n cannot reuse entries from any other row in K’s RT.
Self Organization: Node Arrival
The route of the join message from n to c leads via nodes v1, v2, … vn with increasingly longer common prefixes of n and vi
Thus, row 1 from the RT of v1 is also a good choice for the same row of the RT of n
The same is true for row 2 on node v2 and so on
Based on this information, the RT of n can be constructed.
Self Organization: Node Arrival
Finally, the new node sends its node state to all nodes in its routing data so that these nodes can update their own routing information accordingly
In contrast to lazy updates in Chord, this mechanism actively updates the state in all affected nodes when a new node joins the system
At this stage, the new node is fully present and reachable in the Pastry network
Node Failure Node failure is detected when a communication
attempt with another node fails. Routing requires contacting nodes from RT and leaf set, resulting in lazy detection of failures
During routing, the failure of a single node in the RT does not significantly delay the routing process. The local node can chose to forward the query to a different node from the same row in the RT. (Alternatively, a node could store backup nodes with each entry in the RT)
Node Failure To replace the failed node at entry i in row j of its RT
(Rji), a node contacts another node referenced in row
j Entries in the same row j of the remote node are valid
for the local node and hence it can copy entry Rji from
the remote node to its own RT
In case it failed as well, it can probe another node in row j for entry Rj
i
If no live node with appropriate nodeID prefix can be obtained in this way, the local node queries nodes from the preceding row Rj-1
Node Failure Repairing a failed entry in the leaf set of a node is
straightforward – utilizing the leaf set of other nodes referenced in the local leaf set.
Contacts the leaf set of the largest index on the side of the failed node
If this node is unavailable, the local node can revert to leaf set with smaller indices
Node Departure
Neighborhood node: asks other members for their M, checks the distance of each of the newly discovered nodes, and updates its own neighborhood set accordingly.
Locality “Route chosen for a message is likely to
be good with respect to the proximity metric”
Discussion: Locality in the routing table Route locality Locating the nearest among k nodes
Locality in the routing table
Node A is near X A’s R0 entries are close to A, A is close to X, and
triangulation inequality holds entries in X are relatively near A.
Likewise, obtaining X’s neighborhood set from A is appropriate.
B’s R1 entries are reasonable choice for R1of X Entries in each successive row are chosen from an
exponentially decreasing set size. The expected distance from B to any of its R1 entry is
much larger than the expected distance traveled from node A to B.
Second stage: X requests the state from each of the nodes in its routing table and neighborhood set to update its entries to closer nodes.
Routing locality Each routing step moves the message closer to the
destination in the nodeId space, while traveling the least possible distance in the proximity space.
Given that: A message routed from A to B at distance d cannot
subsequently be routed to a node with a distance of less than d from A
The expected distance traveled by a message during each successive routing step is exponentially increasing
Since a message tends to make larger and larger strides with no possibility of returning to a node within di of any node i encountered on the route, the message has nowhere to go but towards its destination
Locating the nearest among k nodes
Goal: among the k numerically closest nodes to a
key, a message tends to first reach a node near the client.
Problem: Since Pastry routes primarily based on nodeId
prefixes, it may miss nearby nodes with a different prefix than the key.
Solution (using a heuristic): Based on estimating the density of nodeIds, it
detects when a message approaches the set of k and then switches to numerically nearest address based routing to locate the nearest replica.
Arbitrary node failures and network partitions Node continues to be responsive, but
behaves incorrectly or even maliciously.
Repeated queries fail each time since they normally take the same route.
Solution: Routing can be randomized The choice among multiple nodes that
satisfy the routing criteria should be made randomly
CAN : Content Addressable Network
Hash value is viewed as a point in a D-dimensional Cartesian space
Hash value points <n1, n2, …, nD>. Each node responsible for a D-dimensional “cube” in
the space Nodes are neighbors if their cubes “touch” at more
than just a point
• Example: D=2• 1’s neighbors: 2,3,4,6• 6’s neighbors: 1,2,4,5• Squares “wrap around”, e.g., 7 and 8 are neighbors• Expected # neighbors: O(D)
CAN : Routing To get to <n1, n2, …, nD> from <m1, m2, …, mD>
choose a neighbor with smallest Cartesian distance from <m1, m2, …, mD> (e.g., measured from neighbor’s center)
• e.g., region 1 needs to send to node covering X• Checks all neighbors, node 2 is closest• Forwards message to node 2• Cartesian distance monotonically decreases with each transmission• Expected # overlay hops: (DN1/D)/4
CAN : Join
To join the CAN: find some node in the CAN
(via bootstrap process) choose a point in the space
uniformly at random using CAN, inform the node
that currently covers the space that node splits its space in half
1st split along 1st dimension if last split along dimension i
< D, next split along i+1st dimension
e.g., for 2-d case, split on x-axis, then y-axis
keeps half the space and gives other half to joining node
The likelihood of a rectangle being selected is proportional to it’s size,
i.e., big rectangles chosen more frequently
CAN: Join
Bootstrap
node
new node
CAN: construction
I
Bootstrap
node
new node 1) Discover some node “I” already in CAN
CAN: Join
2) Pick random point in space
I
(x,y)
new node
CAN: Join
(x,y)
3) I routes to (x,y), discovers node J
I
J
new node
CAN: Join
newJ
4) split J’s zone in half… new owns one half
CAN Failure recovery
View partitioning as a binary tree Leaves represent regions covered by overlay nodes Intermediate nodes represents “split” regions that
could be “reformed” Siblings are regions that can be merged together
(forming the region that is covered by their parent)
CAN Failure Recovery
Failure recovery when leaf S is removed
Find a leaf node T that is either
S’s sibling Descendant of S’s sibling
where T’s sibling is also a leaf node
T takes over S’s region (move to S’s position on the tree)
T’s sibling takes over T’s previous region
Maintenance
Use zone takeover in case of failure or leaving of a node
Send your neighbor table to neighbors to inform that you are alive at discrete time interval t
If your neighbor does not send alive in time t, takeover its zone
Zone reassignment is needed
Zone reassignment
1
2
3
4
1
3
2 4
Zoning
Partition tree
Zone reassignment
1
3
4
1
3 4
Zoning
Partition tree
Zone reassignment
1
2
3
4
1
3
2 4
Zoning
Partition tree
Zone reassignment
1
2
4
1
2 4
Zoning
Partition tree