Improving Search in P2P Networks By Shadi Lahham

Improving Search in P2P Networks

By Shadi Lahham

Improving P2P Search 2

Purpose of This Lecture

• General understanding of P2P systems

• Appreciating the need for efficient search

• Applying different search techniques to different scenarios


Table Of Contents

• P2P Basics– What Is P2P

– Advantages of P2P

– Types of P2P Systems

– Shortcomings

• Search Methods– The Search Problem

– Current Methods

– Suggested Methods

• Experimental Setup– Metrics– Data Collection– Calculating Costs

• Analysis of Results

• Conclusions

Introduction

P2P Basics


What is P2P

• Distributed system

• Peers (nodes) are servers and clients simultaneously

• Peers are of equal roles

• Resources shared across peers

• No central server needed

• Examples of P2P system


P2P Overview

file3f3

file2f2

file1f1

FileKey


Advantages of P2P

• P2P vs. Centralized Servers– Distributes disk space / bandwidth

– Inexpensively scalable

– Self organized (autonomous)

– Load balancing

– Adaptative / fault tolerant

– Less susceptible to attacks

– Allows for redundancy


Types of P2P Systems

• Hybrid ( napster )

• Pure ( gnutella )

• Super Peers ( kaZaA )


Hybrid ( napster )


Pure ( gnutella )


Super Peers ( kaZaA )

• Make use of heterogeneity– Powerful peers serve as super peers

– Weaker peers act as clients

• Super-peers index clients’ files– Requires updates on join/leave/update

• Queries handled at super-peer level– Saves query costs


Super Peers ( kaZaA )


Hybrid - Shortcomings

• High cost on centralized index

• Performance & scalability bottleneck

• Needs maintenance

• Vulnerable ! Highly visible target


Pure - Shortcomings

• Inefficient search (flooding)

• Heterogeneity of peers not considered– Bottlenecks (limited peers)

– Fragmentation


Super Peers - Shortcomings

• Super nodes might become bottlenecks for clients– requires redundancy

• Bad selection of supernodes might cause even worse problems

Search Methods


The Search Problem

• Connected graph

• Might contain cycles

• Individual node doesn’t know structure

• Only knows its neighbors

• No idea where data can be found


The Search Problem

• Goal : Find as many occurrences of the data using min time and resources

• Solution : – BFS ?

– Bounded BFS ?– (naive approaches)


Bounded BFS Search

TTL=2TTL=1TTL=0


Bounded BFS Search

• Messages get a global TTL (time to live)

• Algorithm– Source broadcasts a message to a subset of

neighbors

– Neighbors search locally . Results are sent to source if found

– TTL = TTL – 1;

– As long as TTL > 0 Nodes forward message to neighbors

• Downside : wastes bandwidth / processing


Current Methods

• Gnutella - BFS – High cost

– Gets complete results ( for depth D)

– Relatively short time

• Freenet - DFS – Poor response time

– Minimizes BW costs


Suggested Methods

• Iterative deepening

• Directed BFS

• Local Indices


Iterative Deepening

• Idea:– Search at a small depth and increase if

required

– Aims to minimize the cost of BFS without detracting from it’s ability to satisfy queries

• Notice that given enough iterations this method returns %100 results of BFS


Iterative Deepening (cont…)

• Elements :– Policies P={a,b,c,..} define deepening

behavior

– BFS is run to depth a and frozen

– If source is satisfied it stops the process

– Otherwise it asks BFS to resume to depth b

– Process is repeated until source satisfied or we reach the last policy item


Iterative Deepening (cont…)

• Elements :– We can specify how long to wait

between iterations

– We need a system-wide message ID to identify individual messages


Example P={1,3,4} W=1


Directed BFS

• Idea:– Choose a subset of neighbors to query

– Neighbors will BFS as usual

– Aims to provide a balance between good response time and results

– Minimize costs of full BFS

• Notice that only a subset of possible results are returned so we might fail to satisfy query


Directed BFS Example

TTL=2TTL=1TTL=0


Directed BFS (cont…)

• But which neighbors to pick ??– Maintain simple statistics on neighbors

to derive heuristics• Highest past results • Lowest average hops

– (close to nodes containing useful data) • High message count

– (stable - can handle large flow) • Shortest message queue

– (long implies saturation)• More to come …


Local Indices

• Idea:– Nodes hold metadata of all nodes at

radius r

– Can process query at a few nodes, but get same number of results

– Aims to balance satisfaction / costs


Local Indices

• Elements:– Policies P={a,b,c,..} define the depths at

which we search• Example P={1,5,6}• Nodes at depth 1 process the query• Nodes at depth 2,3,4 forward without

processing• Policy ends at depth 6

– System-wide Radius r (small ~ 50K metadata )


Example P={1,4}

Process

Don’t process

r = ?


Local Indices (cont…)

– Notice that now there is an overhead

– On Join• Send join message of TTL = r • Direct Exchange of metadata

– On leave / timeout• remove metadata of gone / dead nodes

– On Update• Send update message of TTL = r

Experimental Setup


Metrics

• How to compare methods ?1. Costs

2. Results

3. Time


Metrics

1. Costs – We do not base cost on a specific query but

rather calculate the average cost on Q rep ,

a representative set of real queries submitted

– It makes sense to discuss costs in aggregate (i.e., over all the nodes in the network)

– Therefore our two cost metrics are• Average aggregate bandwidth • Average aggregate processing cost


Metrics

2. Results Quality– Number of results

– Satisfaction

3. Time to satisfaction


Data Collection

• Data gathered from Gnutella network

• Directly measured– Iterative deepening

– Directed BFS

• Performance data & analysis– Local indices


Data Collection

Number of hops

Response time

Results per message

Source IP

Etc …

Collected Data


Data Collection

Symbol Description

M(Q; n) # of response messages received for query Q, from n hops away

R(Q; n) # of results received for query Q, from n hops away

N(Q; n) # of nodes n hops away that process Q

C(Q; n) # of redundant edges n hops away

Extracted Data


Calculating Costs

• We’ve seen two types of costs– Bandwidth (BW) costs

– Processing costs

• Calculations should take into account– Costs of sending a query

– Costs of sending replies

• A example of calculating BW costs


Calculating Costs

BWbfs (Q) = ∑ ( a(Q) · (N(Q,n) + C(Q,n)) D

n=1

+ n · ( c · R(Q,n) + d · M(Q,n))

a(Q) Size of query Q d Size of response message

c Size of result record D Max TTL

Analysis of Results

Iterative Deepening


Symbols Used

Symbol Definition

D Maximum time-to-live of a message, in terms of hops

Z Number of results needed to satisfy a query

Qrep Representative set of queries for the Gnutella network

W Waiting time (in seconds) between iterations

Ng Number of neighbors of client (source node)


Results – Iterative Deepening

• Recall that iterative deepening policies P={a,b,c,..} define deepening behavior

• In order to have the same level of satisfaction as BFS a policy must have D as the last depth

• Also note the degenerate case policy {D} which is the bounded BFS we presenter earlier



• Variables– Define :

Pd = { d , d+1 , … , D }

P = { Pd for d = 1,2,…,D }

= { {1,2,…D}, {2,3,…D},…, {D-1,…D},{D} }

W (waiting time) can take the values

1,2,4,6,150 (seconds)



• Fixed values Z = 50 , Ng = 8

– Increasing Z• Lower probability of satisfaction• Higher costs• More results

– Decreasing Ng• Slightly Lower probability of satisfaction• Significantly Lower costs





• BW costs same for P7 for all W’s

• As d increases costs increase.the larger d is the more likely the policy will “overshoot”

• As W decreases costs increaseon a small W premature determination of un-satisfaction again leads to overshooting





• Time to satisfaction is inversely proportional to cost

• Choose a policy that balances average waiting time and cost

• For example {P5 W=6}

Analysis of Results

Directed BFS


Heuristics - Directed BFS

Symbol HeuristicRAND (Random)

>RES Returned the greatest number of results*

<TIME Had the shortest average time to satisfaction*

<HOPS smallest average number of hops taken by results*

>MSG Sent our client the greatest number of messages (all types)

<QLEN Had the shortest message queue

<LAT Had the shortest latency

>DEG Had the highest degree (number of neighbors)

*in the past 10 queries


Results – Directed BFS







• Costs in directed BFS unaffected by Z

• Users more aware of quality of results than BW costs – We recommend >RES <TIME

– Still cheaper than full BFS (~65%)

• Sum up till now– Iterative deepening - lowest costs

– Directed BFS – fastest time to satisfaction

Analysis of Results

Local Indices


Results – Local Indices

• Recall that iterative deepening policies P={a,b,c,..} define the depths at which we search

• We choose policies that minimize the number of nodes that process the query



• We consider the following policies



• Also recall that joins / leaves / updates have a BW overhead

• QJR (QueryJoinRatio) gives us the ratio of queries to joins/leaves in the network



P0 r=0





21MB

71 KB



• Time to Satisfaction– Because most Query and Response

messages have r fewer hops to travel, the time to forward messages to the outermost depth and back to the source will be shorter than for BFS

– However, because nodes have larger indices, processing the query should take more time.



• Summary– Huge savings in costs

– Time to satisfaction comparable to BFS

– Determining r must take QJR into consideration

• For current QJR values (e.g. Gnutella = 10) r =1 is a good choice


Relative performance

Technique Time to satisfy

Satisfaction

Probability

Number of results

Aggregate Bandwidth

Aggregate

Processing

Bounded BFS 100% 100% 100% 100% 100%

Iterative deepening 190% 100% 19% 28% 47%

Directed BFS 140% 86% 37% 38% 28%

Local indices

≈100%

100% 100% 39% 51%


Conclusions

• All 3 methods show significant bandwidth and processing savings

• Methods are simple and easy to implement in current systems

• Methods might be used in conjunction


Bibliography

Yang, Beverly; Garcia-Molina, Hector :• Improving Search in Peer-to-Peer Systems

http://newdbpubs.stanford.edu:8090/pub/2002-28

• Improving Search in Peer-to-Peer Systems [extended]


• Designing a Super-peer Network http://newdbpubs.stanford.edu:8090/pub/2003-33

Gnutella websitehttp://www.gnutella.com/




http://www.gnutella.com/

Thank you

Documents

Improving Search in P2P Networks By Shadi Lahham