Peer-to-Peer Structured Overlay Networks
Antonino Virgillito
Background
Peer-to-peer systems
• distribution
• symmetry (communication, node roles)
• decentralized control
• self-organization
• dynamicity
Data Lookup in P2P Systems
• Data items spread over a large number of nodes
• Which node stores which data item?
• A lookup mechanism needed– Centralized directory -> bottleneck/single
point of failure– Query Flooding -> scalability concerns– Need more structure!
More Issues
• Organize, maintain overlay network– node arrivals– node failures
• Resource allocation/load balancing
• Resource location
• Network proximity routing
What is a Distributed HashTable?
• Exactly that • A service, distributed over multiple machines,
with hash table semantics– put(key, value), Value = get(key)
• Designed to work in a peer-to-peer (P2P) environment
• No central control• Nodes under different administrative control• But of course can operate in an “infrastructure”
sense
What is a DHT?
• Hash table semantics:put(key, value),Value = get(key)
• Key is a single flat string• Limited semantics compared to keyword search• Put() causes value to be stored at one (or more) peer(s)• Get() retrieves value from a peer• Put() and Get() accomplished with unicast routed
messages• In other words, it scales• Other API calls to support application, like notification
when neighbors come and go
Distributed Hash Tables (DHT)
k6,v6
k1,v1
k5,v5
k2,v2
k4,v4
k3,v3
nodes
Operations:put(k,v)get(k)
P2P overlay networ
k
P2P overlay networ
k
• p2p overlay maps keys to nodes• completely decentralized and self-organizing• robust, scalable
Popular DHTs
• Tapestry (Berkeley)– Based on Plaxton trees---similar to hypercube routing– The first* DHT– Complex and hard to maintain (hard to understand
too!)
• CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge)– Second wave of DHTs (contemporary with and
independent of each other)
DHTs Basics
• Node IDs can be mapped to the hash key space• Given a hash key as a “destination address”,
you can route through the network to a given node
• Always route to the same node no matter where you start from
• Requires no centralized control (completely distributed)
• Small per-node state is independent of the number of nodes in the system (scalable)
• Nodes can route around failures (fault-tolerant)
Things to look at
• What is the structure?
• How does routing work in the structure?
• How does it deal with node joins and departures (structure maintenance)?
• How does it scale?
• How does it deal with locality?
• What are the security issues?
The Chord Approach
• Consistent Hashing
• Logical Ring
• Finger Pointers
The Chord Protocol
• Provides:– A mapping successor: key -> node– To lookup key K, go to node successor(K)
• successor defined using consistent hashing:– Key hash– Node hash– Both Keys and Nodes hash to same (circular)
identifier space– successor(K)=first node with hash ID equal to or
greater than hash(K)
Example: The Logical Ring
Nodes 0, 1, 3
Keys 1, 2, 6
Consistent Hashing [Karger et al. ‘97]
• Some Nice Properties:– Smoothness: minimal key movement on node
join/leave– Load Balancing: keys equitably distributed
over nodes
Mapping Details
• Range of Hash Function– Circular ID space module 2m
• Compute 160 bit SHA-1 hash, and truncate to m-bits– Chance of collision rare if m is large enough
• Deterministic, but hard for an adversary to subvert
Chord State
• Successor/Predecessor in the Ring
• Finger Pointers– n.finger[i] = successor (n+2 i-1)– Each node knows more about
portion of circle close to it!
Example: Finger Tables
Chord: routing protocol
- A set of nodes towards id are contacted remotely - Each node is queried for the known node which is closest to id- Process stops when a node is found having successor > id
Notation n.foo( ) stands for a remote call to node n.
Example: Chord Routing
Finger Pointers for Node 1
Lookup Complexity
• With high probability: O(log(N))
• Proof Intuition: – Being p the successor of the targeted key, distance to
p reduces by at least half in each step– In m steps, would reach p– Stronger claim: In O(log(N)) steps, distance ≤ 2m/N Thereafter even linear advance will suffice to give
O(log(N)) lookup complexity
Chord invariants
• Every key in the network can be located as long as the following invariants are preserved after joins and leaves: – Each node’s successor is correctly
maintained– For every key k, node successor(k) is
responsible for k
Chord: Node Joins
• New node B learns of at least one existing node A via external means
• B asks A to lookup its finger-table information– Given that B’s hash-id is b, A does lookup for
B.finger[i] = successor ( b + 2i-1) if interval not already included in finger[i-1]
– B stores all finger information and sets up pred/succ pointers
Node Joins (contd.)
• Update of finger table of existing nodes p such that:
1. p precedes b by at least 2i-1
2. the i-th finger of node p succeeds b– Starts from p = predecessor( b - 2i-1 ) and proceeds
in counter-clock-wise direction while 2. is true
• Transferring keys:– Only from successor(b) to b– Must send notification to the application
Example: finger table update
Node 6 joins
Example: transferring keys
Node 1 leaves
Concurrent Joins/Leaves
• Need a stabilization protocol to guard against inconsistency
• Note: – Incorrect finger pointers may only increase latency,
but incorrect successor pointers may cause lookup failure!
• Nodes periodically run stabilization protocol– Finds successor’s predecessor– Repair if this isn’t self
• This algorithm is also run at join
Example: node 25 joins
Example: node 28 joins before 20 stabilizes (1)
Example: node 28 joins before 20 stabilizes (2)
CAN• Virtual d-dimensional
Cartesian coordinatesystem on a d-torus– Example: 2-d [0,1]x[1,0]
• Dynamically partitionedamong all nodes
• Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P
• Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P– If P is not contained in the zone of the requesting node or
its neighboring zones, route request to neighbor node in zone nearest P
Routing in a CAN
• Follow straight line path through the Cartesian space from source to destination coordinates
• Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor
• Use greedy routing to neighbor closest to destination
• For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors– Average routing path length:
dn
d 1
4
CAN Construction
• Joining node locates a bootstrapnode using the CAN DNS entry– Bootstrap node provides IP addresses
of random member nodes
• Joining node sends JOIN request torandom point P in the Cartesian space
• Node in zone containing P splits thezone and allocates “half” to joining node
• (K,V) pairs in the allocated “half” aretransferred to the joining node
• Joining node learns its neighbor setfrom previous zone occupant– Previous zone occupant updates its neighbor set
Departure, Recovery and Maintenance
• Graceful departure: node hands over its zone and the (K,V) pairs to a neighbor
• Network failure: unreachable node(s) trigger an immediate takeover algorithm that allocate failed node’s zone to a neighbor– Detect via lack of periodic refresh messages– Neighbor nodes start a takeover timer initialized in proportion to
its zone volume– Send a TAKEOVER message containing zone volume to all of
failed node’s neighbors– If received TAKEOVER volume is smaller kill timer, if not reply
with a TAKEOVER message– Nodes agree on neighbor with smallest volume that is alive
Pastry
Generic p2p location and routing substrate
• Self-organizing overlay network
• Lookup/insert object in < log16 N routing steps (expected)
• O(log N) per-node state• Network proximity routing
Pastry: Object distribution
objId
Consistent hashing
128 bit circular id space
nodeIds (uniform random)
objIds (uniform random)
Invariant: node with numerically closest nodeId maintains object
nodeIds
O2128-1
Pastry: Object insertion/lookup
X
Route(X)
Msg with key X is routed to live node with nodeId closest to X
Problem:
complete routing table not feasible
O2128-1
Pastry: Routing table (# 65a1fc)
log16 Nrows
Row 0
Row 1
Row 2
Row 3
0x1x2x3x4x5x
7x8x9xaxbxcxdxexfx
60x
61x
62x
63x
64x
66x
67x
68x
69x
6ax
6bx
6cx
6dx
6ex
6fx
650x
651x
652x
653x
654x
655x
656x
657x
658x
659x
65bx
65cx
65dx
65ex
65fx
65a0x
65a2x
65a3x
65a4x
65a5x
65a6x
65a7x
65a8x
65a9x
65aax
65abx
65acx
65adx
65aex
65afx
Pastry: Leaf sets
Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively.
• routing efficiency/robustness
• fault detection (keep-alive)
• application-specific local coordination
Pastry: Routing procedureif (destination is within range of our leaf set)
forward to numerically closest memberelse
let l = length of shared prefix let d = value of l-th digit in D’s addressif (Rl
d exists)
forward to Rld
else forward to a known node that (a) shares at least as long a prefix(b) is numerically closer than this node
Pastry: Routing
Properties• log16 N steps • O(log N) state
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1
Pastry: Performance
Integrity of overlay message delivery:• guaranteed unless L/2 simultaneous failures
of nodes with adjacent nodeIds
Number of routing hops:
• No failures: < log16 N expected, 128/b + 1 max
• During failure recovery:– O(N) worst case, average case much better
Pastry Join
• X = new node, A = bootstrap, Z = nearest node
• A finds Z for X• In process, A, Z, and all nodes in path
send state tables to X• X settles on own table
– Possibly after contacting other nodes
• X tells everyone who needs to know about itself
Pastry Leave
• Noticed by leaf set neighbors when leaving node doesn’t respond– Neighbors ask highest and lowest nodes in leaf set for
new leaf set
• Noticed by routing neighbors when message forward fails– Immediately can route to another neighbor– Fix entry by asking another neighbor in the same
“row” for its neighbor– If this fails, ask somebody a level up