View
220
Download
3
Tags:
Embed Size (px)
Citation preview
Improving Search in P2P Networks
By Shadi Lahham
Improving P2P Search 2
Purpose of This Lecture
• General understanding of P2P systems
• Appreciating the need for efficient search
• Applying different search techniques to different scenarios
Improving P2P Search 3
Table Of Contents
• P2P Basics– What Is P2P
– Advantages of P2P
– Types of P2P Systems
– Shortcomings
• Search Methods– The Search Problem
– Current Methods
– Suggested Methods
• Experimental Setup– Metrics– Data Collection– Calculating Costs
• Analysis of Results
• Conclusions
Introduction
P2P Basics
Improving P2P Search 5
What is P2P
• Distributed system
• Peers (nodes) are servers and clients simultaneously
• Peers are of equal roles
• Resources shared across peers
• No central server needed
• Examples of P2P system
Improving P2P Search 6
P2P Overview
file3f3
file2f2
file1f1
FileKey
Improving P2P Search 7
Advantages of P2P
• P2P vs. Centralized Servers– Distributes disk space / bandwidth
– Inexpensively scalable
– Self organized (autonomous)
– Load balancing
– Adaptative / fault tolerant
– Less susceptible to attacks
– Allows for redundancy
Improving P2P Search 8
Types of P2P Systems
• Hybrid ( napster )
• Pure ( gnutella )
• Super Peers ( kaZaA )
Improving P2P Search 9
Hybrid ( napster )
Improving P2P Search 10
Pure ( gnutella )
Improving P2P Search 11
Super Peers ( kaZaA )
• Make use of heterogeneity– Powerful peers serve as super peers
– Weaker peers act as clients
• Super-peers index clients’ files– Requires updates on join/leave/update
• Queries handled at super-peer level– Saves query costs
Improving P2P Search 12
Super Peers ( kaZaA )
Improving P2P Search 13
Hybrid - Shortcomings
• High cost on centralized index
• Performance & scalability bottleneck
• Needs maintenance
• Vulnerable ! Highly visible target
Improving P2P Search 14
Pure - Shortcomings
• Inefficient search (flooding)
• Heterogeneity of peers not considered– Bottlenecks (limited peers)
– Fragmentation
Improving P2P Search 15
Super Peers - Shortcomings
• Super nodes might become bottlenecks for clients– requires redundancy
• Bad selection of supernodes might cause even worse problems
Search Methods
Improving P2P Search 17
The Search Problem
• Connected graph
• Might contain cycles
• Individual node doesn’t know structure
• Only knows its neighbors
• No idea where data can be found
Improving P2P Search 18
The Search Problem
• Goal : Find as many occurrences of the data using min time and resources
• Solution : – BFS ?
– Bounded BFS ?– (naive approaches)
Improving P2P Search 19
Bounded BFS Search
TTL=2TTL=1TTL=0
Improving P2P Search 20
Bounded BFS Search
• Messages get a global TTL (time to live)
• Algorithm– Source broadcasts a message to a subset of
neighbors
– Neighbors search locally . Results are sent to source if found
– TTL = TTL – 1;
– As long as TTL > 0 Nodes forward message to neighbors
• Downside : wastes bandwidth / processing
Improving P2P Search 21
Current Methods
• Gnutella - BFS – High cost
– Gets complete results ( for depth D)
– Relatively short time
• Freenet - DFS – Poor response time
– Minimizes BW costs
Improving P2P Search 22
Suggested Methods
• Iterative deepening
• Directed BFS
• Local Indices
Improving P2P Search 23
Iterative Deepening
• Idea:– Search at a small depth and increase if
required
– Aims to minimize the cost of BFS without detracting from it’s ability to satisfy queries
• Notice that given enough iterations this method returns %100 results of BFS
Improving P2P Search 24
Iterative Deepening (cont…)
• Elements :– Policies P={a,b,c,..} define deepening
behavior
– BFS is run to depth a and frozen
– If source is satisfied it stops the process
– Otherwise it asks BFS to resume to depth b
– Process is repeated until source satisfied or we reach the last policy item
Improving P2P Search 25
Iterative Deepening (cont…)
• Elements :– We can specify how long to wait
between iterations
– We need a system-wide message ID to identify individual messages
Improving P2P Search 26
Example P={1,3,4} W=1
Improving P2P Search 27
Directed BFS
• Idea:– Choose a subset of neighbors to query
– Neighbors will BFS as usual
– Aims to provide a balance between good response time and results
– Minimize costs of full BFS
• Notice that only a subset of possible results are returned so we might fail to satisfy query
Improving P2P Search 28
Directed BFS Example
TTL=2TTL=1TTL=0
Improving P2P Search 29
Directed BFS (cont…)
• But which neighbors to pick ??– Maintain simple statistics on neighbors
to derive heuristics• Highest past results • Lowest average hops
– (close to nodes containing useful data) • High message count
– (stable - can handle large flow) • Shortest message queue
– (long implies saturation)• More to come …
Improving P2P Search 30
Local Indices
• Idea:– Nodes hold metadata of all nodes at
radius r
– Can process query at a few nodes, but get same number of results
– Aims to balance satisfaction / costs
Improving P2P Search 31
Local Indices
• Elements:– Policies P={a,b,c,..} define the depths at
which we search• Example P={1,5,6}• Nodes at depth 1 process the query• Nodes at depth 2,3,4 forward without
processing• Policy ends at depth 6
– System-wide Radius r (small ~ 50K metadata )
Improving P2P Search 32
Example P={1,4}
Process
Don’t process
r = ?
Improving P2P Search 33
Local Indices (cont…)
– Notice that now there is an overhead
– On Join• Send join message of TTL = r • Direct Exchange of metadata
– On leave / timeout• remove metadata of gone / dead nodes
– On Update• Send update message of TTL = r
Experimental Setup
Improving P2P Search 35
Metrics
• How to compare methods ?1. Costs
2. Results
3. Time
Improving P2P Search 36
Metrics
1. Costs – We do not base cost on a specific query but
rather calculate the average cost on Q rep ,
a representative set of real queries submitted
– It makes sense to discuss costs in aggregate (i.e., over all the nodes in the network)
– Therefore our two cost metrics are• Average aggregate bandwidth • Average aggregate processing cost
Improving P2P Search 37
Metrics
2. Results Quality– Number of results
– Satisfaction
3. Time to satisfaction
Improving P2P Search 38
Data Collection
• Data gathered from Gnutella network
• Directly measured– Iterative deepening
– Directed BFS
• Performance data & analysis– Local indices
Improving P2P Search 39
Data Collection
Number of hops
Response time
Results per message
Source IP
Etc …
Collected Data
Improving P2P Search 40
Data Collection
Symbol Description
M(Q; n) # of response messages received for query Q, from n hops away
R(Q; n) # of results received for query Q, from n hops away
N(Q; n) # of nodes n hops away that process Q
C(Q; n) # of redundant edges n hops away
Extracted Data
Improving P2P Search 41
Calculating Costs
• We’ve seen two types of costs– Bandwidth (BW) costs
– Processing costs
• Calculations should take into account– Costs of sending a query
– Costs of sending replies
• A example of calculating BW costs
Improving P2P Search 42
Calculating Costs
BWbfs (Q) = ∑ ( a(Q) · (N(Q,n) + C(Q,n)) D
n=1
+ n · ( c · R(Q,n) + d · M(Q,n))
a(Q) Size of query Q d Size of response message
c Size of result record D Max TTL
Analysis of Results
Iterative Deepening
Improving P2P Search 44
Symbols Used
Symbol Definition
D Maximum time-to-live of a message, in terms of hops
Z Number of results needed to satisfy a query
Qrep Representative set of queries for the Gnutella network
W Waiting time (in seconds) between iterations
Ng Number of neighbors of client (source node)
Improving P2P Search 45
Results – Iterative Deepening
• Recall that iterative deepening policies P={a,b,c,..} define deepening behavior
• In order to have the same level of satisfaction as BFS a policy must have D as the last depth
• Also note the degenerate case policy {D} which is the bounded BFS we presenter earlier
Improving P2P Search 46
Results – Iterative Deepening
• Variables– Define :
Pd = { d , d+1 , … , D }
P = { Pd for d = 1,2,…,D }
= { {1,2,…D}, {2,3,…D},…, {D-1,…D},{D} }
W (waiting time) can take the values
1,2,4,6,150 (seconds)
Improving P2P Search 47
Results – Iterative Deepening
• Fixed values Z = 50 , Ng = 8
– Increasing Z• Lower probability of satisfaction• Higher costs• More results
– Decreasing Ng• Slightly Lower probability of satisfaction• Significantly Lower costs
Improving P2P Search 48
Results – Iterative Deepening
Improving P2P Search 49
Results – Iterative Deepening
• BW costs same for P7 for all W’s
• As d increases costs increase.the larger d is the more likely the policy will “overshoot”
• As W decreases costs increaseon a small W premature determination of un-satisfaction again leads to overshooting
Improving P2P Search 50
Results – Iterative Deepening
Improving P2P Search 51
Results – Iterative Deepening
• Time to satisfaction is inversely proportional to cost
• Choose a policy that balances average waiting time and cost
• For example {P5 W=6}
Analysis of Results
Directed BFS
Improving P2P Search 53
Heuristics - Directed BFS
Symbol HeuristicRAND (Random)
>RES Returned the greatest number of results*
<TIME Had the shortest average time to satisfaction*
<HOPS smallest average number of hops taken by results*
>MSG Sent our client the greatest number of messages (all types)
<QLEN Had the shortest message queue
<LAT Had the shortest latency
>DEG Had the highest degree (number of neighbors)
*in the past 10 queries
Improving P2P Search 54
Results – Directed BFS
Improving P2P Search 55
Results – Directed BFS
Improving P2P Search 56
Results – Directed BFS
Improving P2P Search 57
Results – Directed BFS
• Costs in directed BFS unaffected by Z
• Users more aware of quality of results than BW costs – We recommend >RES <TIME
– Still cheaper than full BFS (~65%)
• Sum up till now– Iterative deepening - lowest costs
– Directed BFS – fastest time to satisfaction
Analysis of Results
Local Indices
Improving P2P Search 59
Results – Local Indices
• Recall that iterative deepening policies P={a,b,c,..} define the depths at which we search
• We choose policies that minimize the number of nodes that process the query
Improving P2P Search 60
Results – Local Indices
• We consider the following policies
Improving P2P Search 61
Results – Local Indices
• Also recall that joins / leaves / updates have a BW overhead
• QJR (QueryJoinRatio) gives us the ratio of queries to joins/leaves in the network
Improving P2P Search 62
Results – Local Indices
P0 r=0
Improving P2P Search 63
Results – Local Indices
Improving P2P Search 64
Results – Local Indices
21MB
71 KB
Improving P2P Search 65
Results – Local Indices
• Time to Satisfaction– Because most Query and Response
messages have r fewer hops to travel, the time to forward messages to the outermost depth and back to the source will be shorter than for BFS
– However, because nodes have larger indices, processing the query should take more time.
Improving P2P Search 66
Results – Local Indices
• Summary– Huge savings in costs
– Time to satisfaction comparable to BFS
– Determining r must take QJR into consideration
• For current QJR values (e.g. Gnutella = 10) r =1 is a good choice
Improving P2P Search 67
Relative performance
Technique Time to satisfy
Satisfaction
Probability
Number of results
Aggregate Bandwidth
Aggregate
Processing
Bounded BFS 100% 100% 100% 100% 100%
Iterative deepening 190% 100% 19% 28% 47%
Directed BFS 140% 86% 37% 38% 28%
Local indices
≈100%
100% 100% 39% 51%
Improving P2P Search 68
Conclusions
• All 3 methods show significant bandwidth and processing savings
• Methods are simple and easy to implement in current systems
• Methods might be used in conjunction
Improving P2P Search 69
Bibliography
Yang, Beverly; Garcia-Molina, Hector :• Improving Search in Peer-to-Peer Systems
http://newdbpubs.stanford.edu:8090/pub/2002-28
• Improving Search in Peer-to-Peer Systems [extended]
http://newdbpubs.stanford.edu:8090/pub/2001-47
• Designing a Super-peer Network http://newdbpubs.stanford.edu:8090/pub/2003-33
Gnutella websitehttp://www.gnutella.com/
Thank you