Peer-to-Peer Structured Overlay Networks Antonino Virgillito

Peer-to-Peer Structured Overlay Networks

Antonino Virgillito

Background

Peer-to-peer systems

• distribution

• symmetry (communication, node roles)

• decentralized control

• self-organization

• dynamicity

Data Lookup in P2P Systems

• Data items spread over a large number of nodes

• Which node stores which data item?

• A lookup mechanism needed– Centralized directory -> bottleneck/single

point of failure– Query Flooding -> scalability concerns– Need more structure!

More Issues

• Organize, maintain overlay network– node arrivals– node failures

• Resource allocation/load balancing

• Resource location

• Network proximity routing

What is a Distributed HashTable?

• Exactly that • A service, distributed over multiple machines,

with hash table semantics– put(key, value), Value = get(key)

• Designed to work in a peer-to-peer (P2P) environment

• No central control• Nodes under different administrative control• But of course can operate in an “infrastructure”

sense

What is a DHT?

• Hash table semantics:put(key, value),Value = get(key)

• Key is a single flat string• Limited semantics compared to keyword search• Put() causes value to be stored at one (or more) peer(s)• Get() retrieves value from a peer• Put() and Get() accomplished with unicast routed

messages• In other words, it scales• Other API calls to support application, like notification

when neighbors come and go

Distributed Hash Tables (DHT)

k6,v6

k1,v1

k5,v5

k2,v2

k4,v4

k3,v3

nodes

Operations:put(k,v)get(k)

P2P overlay networ

k

P2P overlay networ

k

• p2p overlay maps keys to nodes• completely decentralized and self-organizing• robust, scalable

Popular DHTs

• Tapestry (Berkeley)– Based on Plaxton trees---similar to hypercube routing– The first* DHT– Complex and hard to maintain (hard to understand

too!)

• CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge)– Second wave of DHTs (contemporary with and

independent of each other)

DHTs Basics

• Node IDs can be mapped to the hash key space• Given a hash key as a “destination address”,

you can route through the network to a given node

• Always route to the same node no matter where you start from

• Requires no centralized control (completely distributed)

• Small per-node state is independent of the number of nodes in the system (scalable)

• Nodes can route around failures (fault-tolerant)

Things to look at

• What is the structure?

• How does routing work in the structure?

• How does it deal with node joins and departures (structure maintenance)?

• How does it scale?

• How does it deal with locality?

• What are the security issues?

The Chord Approach

• Consistent Hashing

• Logical Ring

• Finger Pointers

The Chord Protocol

• Provides:– A mapping successor: key -> node– To lookup key K, go to node successor(K)

• successor defined using consistent hashing:– Key hash– Node hash– Both Keys and Nodes hash to same (circular)

identifier space– successor(K)=first node with hash ID equal to or

greater than hash(K)

Example: The Logical Ring

Nodes 0, 1, 3

Keys 1, 2, 6

Consistent Hashing [Karger et al. ‘97]

• Some Nice Properties:– Smoothness: minimal key movement on node

join/leave– Load Balancing: keys equitably distributed

over nodes

Mapping Details

• Range of Hash Function– Circular ID space module 2m

• Compute 160 bit SHA-1 hash, and truncate to m-bits– Chance of collision rare if m is large enough

• Deterministic, but hard for an adversary to subvert

Chord State

• Successor/Predecessor in the Ring

• Finger Pointers– n.finger[i] = successor (n+2 i-1)– Each node knows more about

portion of circle close to it!

Example: Finger Tables

Chord: routing protocol

- A set of nodes towards id are contacted remotely - Each node is queried for the known node which is closest to id- Process stops when a node is found having successor > id

Notation n.foo( ) stands for a remote call to node n.

Example: Chord Routing

Finger Pointers for Node 1

Lookup Complexity

• With high probability: O(log(N))

• Proof Intuition: – Being p the successor of the targeted key, distance to

p reduces by at least half in each step– In m steps, would reach p– Stronger claim: In O(log(N)) steps, distance ≤ 2m/N Thereafter even linear advance will suffice to give

O(log(N)) lookup complexity

Chord invariants

• Every key in the network can be located as long as the following invariants are preserved after joins and leaves: – Each node’s successor is correctly

maintained– For every key k, node successor(k) is

responsible for k

Chord: Node Joins

• New node B learns of at least one existing node A via external means

• B asks A to lookup its finger-table information– Given that B’s hash-id is b, A does lookup for

B.finger[i] = successor ( b + 2i-1) if interval not already included in finger[i-1]

– B stores all finger information and sets up pred/succ pointers

Node Joins (contd.)

• Update of finger table of existing nodes p such that:

1. p precedes b by at least 2i-1

2. the i-th finger of node p succeeds b– Starts from p = predecessor( b - 2i-1 ) and proceeds

in counter-clock-wise direction while 2. is true

• Transferring keys:– Only from successor(b) to b– Must send notification to the application

Example: finger table update

Node 6 joins

Example: transferring keys

Node 1 leaves

Concurrent Joins/Leaves

• Need a stabilization protocol to guard against inconsistency

• Note: – Incorrect finger pointers may only increase latency,

but incorrect successor pointers may cause lookup failure!

• Nodes periodically run stabilization protocol– Finds successor’s predecessor– Repair if this isn’t self

• This algorithm is also run at join

Example: node 25 joins

Example: node 28 joins before 20 stabilizes (1)

Example: node 28 joins before 20 stabilizes (2)

CAN• Virtual d-dimensional

Cartesian coordinatesystem on a d-torus– Example: 2-d [0,1]x[1,0]

• Dynamically partitionedamong all nodes

• Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P

• Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P– If P is not contained in the zone of the requesting node or

its neighboring zones, route request to neighbor node in zone nearest P

Routing in a CAN

• Follow straight line path through the Cartesian space from source to destination coordinates

• Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor

• Use greedy routing to neighbor closest to destination

• For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors– Average routing path length:

dn

d 1

4

CAN Construction

• Joining node locates a bootstrapnode using the CAN DNS entry– Bootstrap node provides IP addresses

of random member nodes

• Joining node sends JOIN request torandom point P in the Cartesian space

• Node in zone containing P splits thezone and allocates “half” to joining node

• (K,V) pairs in the allocated “half” aretransferred to the joining node

• Joining node learns its neighbor setfrom previous zone occupant– Previous zone occupant updates its neighbor set

Departure, Recovery and Maintenance

• Graceful departure: node hands over its zone and the (K,V) pairs to a neighbor

• Network failure: unreachable node(s) trigger an immediate takeover algorithm that allocate failed node’s zone to a neighbor– Detect via lack of periodic refresh messages– Neighbor nodes start a takeover timer initialized in proportion to

its zone volume– Send a TAKEOVER message containing zone volume to all of

failed node’s neighbors– If received TAKEOVER volume is smaller kill timer, if not reply

with a TAKEOVER message– Nodes agree on neighbor with smallest volume that is alive

Pastry

Generic p2p location and routing substrate

• Self-organizing overlay network

• Lookup/insert object in < log16 N routing steps (expected)

• O(log N) per-node state• Network proximity routing

Pastry: Object distribution

objId

Consistent hashing

128 bit circular id space

nodeIds (uniform random)

objIds (uniform random)

Invariant: node with numerically closest nodeId maintains object

nodeIds

O2128-1

Pastry: Object insertion/lookup

X

Route(X)

Msg with key X is routed to live node with nodeId closest to X

Problem:

complete routing table not feasible

O2128-1

Pastry: Routing table (# 65a1fc)

log16 Nrows

Row 0

Row 1

Row 2

Row 3

0x1x2x3x4x5x

7x8x9xaxbxcxdxexfx

60x

61x

62x

63x

64x

66x

67x

68x

69x

6ax

6bx

6cx

6dx

6ex

6fx

650x

651x

652x

653x

654x

655x

656x

657x

658x

659x

65bx

65cx

65dx

65ex

65fx

65a0x

65a2x

65a3x

65a4x

65a5x

65a6x

65a7x

65a8x

65a9x

65aax

65abx

65acx

65adx

65aex

65afx

Pastry: Leaf sets

Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively.

• routing efficiency/robustness

• fault detection (keep-alive)

• application-specific local coordination

Pastry: Routing procedureif (destination is within range of our leaf set)

forward to numerically closest memberelse

let l = length of shared prefix let d = value of l-th digit in D’s addressif (Rl

d exists)

forward to Rld

else forward to a known node that (a) shares at least as long a prefix(b) is numerically closer than this node

Pastry: Routing

Properties• log16 N steps • O(log N) state

d46a1c

Route(d46a1c)

d462ba

d4213f

d13da3

65a1fc

d467c4d471f1

Pastry: Performance

Integrity of overlay message delivery:• guaranteed unless L/2 simultaneous failures

of nodes with adjacent nodeIds

Number of routing hops:

• No failures: < log16 N expected, 128/b + 1 max

• During failure recovery:– O(N) worst case, average case much better

Pastry Join

• X = new node, A = bootstrap, Z = nearest node

• A finds Z for X• In process, A, Z, and all nodes in path

send state tables to X• X settles on own table

– Possibly after contacting other nodes

• X tells everyone who needs to know about itself

Pastry Leave

• Noticed by leaf set neighbors when leaving node doesn’t respond– Neighbors ask highest and lowest nodes in leaf set for

new leaf set

• Noticed by routing neighbors when message forward fails– Immediately can route to another neighbor– Fix entry by asking another neighbor in the same

“row” for its neighbor– If this fails, ask somebody a level up

Documents

Peer-to-Peer Structured Overlay Networks Antonino Virgillito