34
Making P2P Networks Scalable a paper presentation by Derek Tingle

Making P2P Networks Scalable a paper presentation by Derek Tingle

Embed Size (px)

Citation preview

Making P2P Networks Scalable

a paper presentation by

Derek Tingle

P2P Basics

• Files stored on clients’ machines

• Typically read only

• Search mechanism

• Download mechanism

• Wildly popular

Gnutella

• Decentralized

• Unstructured

• Flood search

• Routing Table

Gnutella Message Header

Length of message after header. Next message header located this many bytes from this header. Messages <= 4 kB.

Data Length19-22

Incremented at each node. Note that (TTL + Hops Taken)= TTL initial

Hops Taken18

Time To Live. Each node decrements this value when receiving the message. It gets dropped when the value is zero.

TTL Remaining

17

What type of message it is. Ping, pong (ping response), query, query response, push, bye

Function ID16

A randomly chosen globally unique (in theory) 16 byte identifier generated by the client for each message it sends.

Message ID0-15

FunctionNameBytes

Gnutella

• Decentralized

• Unstructured

• Flood search

• Routing table

• DANGER!– Not scalable

Design Goals

• Allow the Gnutella-like p2p to handle higher amount of queries

• Make it scalable• Utilize heterogeneity of machines

Search Protocol

• GIA search is based on random walks– Like floods, but less messages

• But because random walks are blind, there are scaling issues

• So GIA uses biased walks– Toward high degree or high capacity?

High degree AND high capacity• Using a dynamic topology adaptation algorithm

• Ensures:– high capacity nodes have a high degree– low capacity nodes are close to high

capacity nodes

• Level of satisfaction S

• Measures how close the sum of capacities of a node's neighbors normalized by degrees is to that node's own capacity

• The lower the satisfaction level, the shorter the adaptation interval

To add a node n...

• Accept if num_nabr is still <= max_nbrs• Select the node with the highest degree

out of the subset of neighbors with a capacity less than that of n

• Only drop that node if it has less neighbors than n

One-hop replication

• Each node records an index of neighbor nodes' content

• Ensures that high capacity nodes can respond to a greater number of queries

Flow control• To avoid overloading a node

• Only can direct queries to a neighbor if the neighbor is ready

• Node uses tokens to signify it can handle queries

• Node gives out tokens at the rate it can process queries

• If queries are being queued, decrease allocation rate

• Weights the allocation of tokens for capacity

• If a node isn't using tokens, they are allocated to other neighbors

• Can be piggy backed

Search Protocol (again)

• Biased random walks aren't random• Send queries to highest capacity

neighbor with tokens• Time To Live decremented at each node• Book-keeping limits same path traversal• MAX_RESPONSES decremented for

each found answer• Append address of owning node to the

forwarded query

Query Resilience

• Can't let a query die with a node• Keep-alive messages

– query responses– dummy query responses

• Originator can resend query if no keep-alive messages arrive for a while

• When the topology adapts, the previous connections are maintained for a while

Simulations

• GIA compared to:– Flood– Random Walks over Random Topologies– Supernode mechanisms

• queries only flooded between supernodes

Assumptions• All nodes produce queries at same rate

• Capacity = number of messages processed per unit time

• queues have infinite length

• specific keyword searches

• min_alloc = min(C/num_nbrs) = 4

• For Flood and Super– average diameter is 7– TTL is 10

• Look at relative performance, not absolute

Metrics

• Success rate = fraction of queries issued that reach the file

• hop-count• delay = time taken from query's start to

finish• Collapse Point (CP) node query rate at

point beyond which success rate drops below 90%

• Average hop-count before collapse

Performance Comparison

• Search terminates after finding a single answer

• 5000 and 10,000 nodes for each system• .01, .05, .1, .5, 1 replication rates

Performance results

• RWRT better than Flood at high replication, equal at low replication

• GIA has higher hop counts than Flood and Super

• GIA hop counts lower as replication goes up

• Flood and Super aren't scalable... duh.

Multiple Search Responses

• Same tests, MAX_RESPONSES 1, 10, 20

• Flood and Super unchanged• GIA and RWRT decline as M_R

increases

Factor Analysis

957.0006RWRT + FLWCTL

15.12GIA - FLWCTL

1129.001RWRT + TADAPT

133.7.2GIA – TADAPT

997.0015RWRT + BIAS

24.06GIA – BIAS

134.0005RWRT + OHR

8570.004GIA – OHR

987.0005RWRT15.07GIA

Hop-count

Collapse Point

AlgorithmHop-count

Collapse Point

Algorithm

Node Failure model

• Force nodes to fail at a uniformly random time between 0 and MAXLIFETIME

• MAXLIFTIME = 10s, 100s, 1000s, forever• Even at 10s, GIA is 2-4 orders of

magnitude better than RWRT, Super, and Flood when they aren't fialing.

Types of P2P searching

• Centralized (Napster)– Based on user provided file lists

• Decentralized– Queries are distributed to peers– Unstructured (Gnutella)– Structured (Chord)

Distributed Hash Tables

• Pros:– Scalable– Quick lookup

• O(log n) steps• O(n) steps for

Gnutella

– Can find needles

• Cons (why not DHTs):– P2P Clients are transient

• Require O(log n) repair operations after each failure

– DHTs only support exact match searches

– P2Ps look for hay• (Not really a con)

Analysis and Comparison of P2P Search Models

• Dimitrios Tsoumakos• Nick Roussopoulos

Blind

• Gnutella (flood)

• Modified-BFS

• Iterative Deepening

• Random Walks

Informed

• Gnutella2 (Super-peer)• Intelligent-BFS• APS• Local Indices• Routing Indices• Distributed Resource Location Protocol• Gnutella with Shortcuts• GIA

Gnutella2

• Uses super-peers (hubs)

• They act as local servers for their peers

• Hubs are connected

• Queries the hubs sequentially

Intelligent-BFS

• Query similarity metric to find similar queries• Forwards to neighbors most likely to answer

that query• Focuses on object discovery rather than

message reduction• Increased number of hits• Does not handle node departures well• Assumes a node specializes in one file type

APS

• Uses indices to weight random walks

• Each index value represents a query for a specific object directed toward a specific node

• Index value is raised or lowered based on outcome of query

• Optimistic and pessimistic update approaches

• Originating node sends query to all neighbors, those neighbors send query to one neighbor

Local Indices• Each node indexes objects stored on nodes

within a radius r and can answer queries for them

• A BFS like search is performed• Queries hop a distance of 2r+1 nodes• Accuracy and hits are very high• Decreases actual processing time• Floods the network with messages• Churn is very costly b/c flooding is used to

update the repository for all joins/leaves

Routing Indices• Files are assumed to fall into themes• Each node stores the number of files of

each theme reachable from each outgoing path

• Three functions used to determine best outgoing path

• Queries forwarded to best outgoing path• Flooding is used for creation and update,

so serious issues with dynamic networks• Bloom filters...

Distributed Resource Location Protocol

• Initially, random flooding is used to find objects

• When an object is discovered, the query backtracks, storing the location of the found object on those nodes

• If a node knows where a queried object is located, it can directly contact that node

• Depending on specificity of queries, only one replica of a certain object is ever found

• In a dynamic network, much flooding

Gnutella with Shortcuts

• Uses standard flooding initially• If a peer provides an answer, it is

indexed on the requesting nodes• Nodes forward queries to the ranked

shortcuts first, then flood if necessary• Shortcuts ranked by success rate• Very high success rate• Works well when users make related

queries

Results

• All algorithms that implement flooding in some fashion have high success rates

• Systems that use shortcuts aren't hurt as badly by departures as expected, because the more flood searches utilized, the more accurate the shortcuts

• GIA is middle of the pack• No collapse point test