Improving Search in P2P Networks

By Shadi Lahham

Improving P2P Search 2

Purpose of This Lecture

• General understanding of P2P systems

• Appreciating the need for efficient search

• Applying different search techniques to different scenarios

Table Of Contents

• P2P Basics– What Is P2P

– Advantages of P2P

– Types of P2P Systems

– Shortcomings

• Search Methods– The Search Problem

– Current Methods

– Suggested Methods

• Experimental Setup– Metrics– Data Collection– Calculating Costs

• Analysis of Results

• Conclusions

Introduction

P2P Basics

What is P2P

• Distributed system

• Peers (nodes) are servers and clients simultaneously

• Peers are of equal roles

• Resources shared across peers

• No central server needed

• Examples of P2P system

P2P Overview

file3f3

file2f2

file1f1

FileKey

Advantages of P2P

• P2P vs. Centralized Servers– Distributes disk space / bandwidth

– Inexpensively scalable

– Self organized (autonomous)

– Load balancing

– Adaptative / fault tolerant

– Less susceptible to attacks

– Allows for redundancy

Types of P2P Systems

• Hybrid ( napster )

• Pure ( gnutella )

• Super Peers ( kaZaA )

Hybrid ( napster )

Pure ( gnutella )

Super Peers ( kaZaA )

• Make use of heterogeneity– Powerful peers serve as super peers

– Weaker peers act as clients

• Super-peers index clients’ files– Requires updates on join/leave/update

• Queries handled at super-peer level– Saves query costs

Super Peers ( kaZaA )

Hybrid - Shortcomings

• High cost on centralized index

• Performance & scalability bottleneck

• Needs maintenance

• Vulnerable ! Highly visible target

Pure - Shortcomings

• Inefficient search (flooding)

• Heterogeneity of peers not considered– Bottlenecks (limited peers)

– Fragmentation

Super Peers - Shortcomings

• Super nodes might become bottlenecks for clients– requires redundancy

• Bad selection of supernodes might cause even worse problems

Search Methods

The Search Problem

• Connected graph

• Might contain cycles

• Individual node doesn’t know structure

• Only knows its neighbors

• No idea where data can be found

The Search Problem

• Goal : Find as many occurrences of the data using min time and resources

• Solution : – BFS ?

– Bounded BFS ?– (naive approaches)

Bounded BFS Search

TTL=2TTL=1TTL=0

Bounded BFS Search

• Messages get a global TTL (time to live)

• Algorithm– Source broadcasts a message to a subset of

neighbors

– Neighbors search locally . Results are sent to source if found

– TTL = TTL – 1;

– As long as TTL > 0 Nodes forward message to neighbors

• Downside : wastes bandwidth / processing

Current Methods

• Gnutella - BFS – High cost

– Gets complete results ( for depth D)

– Relatively short time

• Freenet - DFS – Poor response time

– Minimizes BW costs

Suggested Methods

• Iterative deepening

• Directed BFS

• Local Indices

Iterative Deepening

• Idea:– Search at a small depth and increase if

required

– Aims to minimize the cost of BFS without detracting from it’s ability to satisfy queries

• Notice that given enough iterations this method returns %100 results of BFS

Iterative Deepening (cont…)

• Elements :– Policies P={a,b,c,..} define deepening

behavior

– BFS is run to depth a and frozen

– If source is satisfied it stops the process

– Otherwise it asks BFS to resume to depth b

– Process is repeated until source satisfied or we reach the last policy item

Iterative Deepening (cont…)

• Elements :– We can specify how long to wait

between iterations

– We need a system-wide message ID to identify individual messages

Example P={1,3,4} W=1

Directed BFS

• Idea:– Choose a subset of neighbors to query

– Neighbors will BFS as usual

– Aims to provide a balance between good response time and results

– Minimize costs of full BFS

• Notice that only a subset of possible results are returned so we might fail to satisfy query

Directed BFS Example

TTL=2TTL=1TTL=0

Directed BFS (cont…)

• But which neighbors to pick ??– Maintain simple statistics on neighbors

to derive heuristics• Highest past results • Lowest average hops

– (close to nodes containing useful data) • High message count

– (stable - can handle large flow) • Shortest message queue

– (long implies saturation)• More to come …

Local Indices

• Idea:– Nodes hold metadata of all nodes at

radius r

– Can process query at a few nodes, but get same number of results

– Aims to balance satisfaction / costs

Local Indices

• Elements:– Policies P={a,b,c,..} define the depths at

which we search• Example P={1,5,6}• Nodes at depth 1 process the query• Nodes at depth 2,3,4 forward without

processing• Policy ends at depth 6

– System-wide Radius r (small ~ 50K metadata )

Example P={1,4}

Process

Don’t process

Local Indices (cont…)

– Notice that now there is an overhead

– On Join• Send join message of TTL = r • Direct Exchange of metadata

– On leave / timeout• remove metadata of gone / dead nodes

– On Update• Send update message of TTL = r

Experimental Setup

Metrics

• How to compare methods ?1. Costs

2. Results

3. Time

Metrics

1. Costs – We do not base cost on a specific query but

rather calculate the average cost on Q rep ,

a representative set of real queries submitted

– It makes sense to discuss costs in aggregate (i.e., over all the nodes in the network)

– Therefore our two cost metrics are• Average aggregate bandwidth • Average aggregate processing cost

Metrics

2. Results Quality– Number of results

– Satisfaction

3. Time to satisfaction

Data Collection

• Data gathered from Gnutella network

• Directly measured– Iterative deepening

– Directed BFS

• Performance data & analysis– Local indices

Data Collection

Number of hops

Response time

Results per message

Source IP

Etc …

Collected Data

Data Collection

Symbol Description

M(Q; n) # of response messages received for query Q, from n hops away

R(Q; n) # of results received for query Q, from n hops away

N(Q; n) # of nodes n hops away that process Q

C(Q; n) # of redundant edges n hops away

Extracted Data

Calculating Costs

• We’ve seen two types of costs– Bandwidth (BW) costs

– Processing costs

• Calculations should take into account– Costs of sending a query

– Costs of sending replies

• A example of calculating BW costs

Calculating Costs

BWbfs (Q) = ∑ ( a(Q) · (N(Q,n) + C(Q,n)) D

+ n · ( c · R(Q,n) + d · M(Q,n))

a(Q) Size of query Q d Size of response message

c Size of result record D Max TTL

Analysis of Results

Iterative Deepening

Symbols Used

Symbol Definition

D Maximum time-to-live of a message, in terms of hops

Z Number of results needed to satisfy a query

Qrep Representative set of queries for the Gnutella network

W Waiting time (in seconds) between iterations

Ng Number of neighbors of client (source node)

Results – Iterative Deepening

• Recall that iterative deepening policies P={a,b,c,..} define deepening behavior

• In order to have the same level of satisfaction as BFS a policy must have D as the last depth

• Also note the degenerate case policy {D} which is the bounded BFS we presenter earlier

• Variables– Define :

Pd = { d , d+1 , … , D }

P = { Pd for d = 1,2,…,D }

= { {1,2,…D}, {2,3,…D},…, {D-1,…D},{D} }

W (waiting time) can take the values

1,2,4,6,150 (seconds)

• Fixed values Z = 50 , Ng = 8

– Increasing Z• Lower probability of satisfaction• Higher costs• More results

– Decreasing Ng• Slightly Lower probability of satisfaction• Significantly Lower costs

• BW costs same for P7 for all W’s

• As d increases costs increase.the larger d is the more likely the policy will “overshoot”

• As W decreases costs increaseon a small W premature determination of un-satisfaction again leads to overshooting

• Time to satisfaction is inversely proportional to cost

• Choose a policy that balances average waiting time and cost

• For example {P5 W=6}

Analysis of Results

Directed BFS

Heuristics - Directed BFS

Symbol HeuristicRAND (Random)

>RES Returned the greatest number of results*

<TIME Had the shortest average time to satisfaction*

<HOPS smallest average number of hops taken by results*

>MSG Sent our client the greatest number of messages (all types)

<QLEN Had the shortest message queue

<LAT Had the shortest latency

>DEG Had the highest degree (number of neighbors)

*in the past 10 queries

Results – Directed BFS

• Costs in directed BFS unaffected by Z

• Users more aware of quality of results than BW costs – We recommend >RES <TIME

– Still cheaper than full BFS (~65%)

• Sum up till now– Iterative deepening - lowest costs

– Directed BFS – fastest time to satisfaction

Analysis of Results

Local Indices

Results – Local Indices

• Recall that iterative deepening policies P={a,b,c,..} define the depths at which we search

• We choose policies that minimize the number of nodes that process the query

• We consider the following policies

• Also recall that joins / leaves / updates have a BW overhead

• QJR (QueryJoinRatio) gives us the ratio of queries to joins/leaves in the network

P0 r=0

• Time to Satisfaction– Because most Query and Response

messages have r fewer hops to travel, the time to forward messages to the outermost depth and back to the source will be shorter than for BFS

– However, because nodes have larger indices, processing the query should take more time.

• Summary– Huge savings in costs

– Time to satisfaction comparable to BFS

– Determining r must take QJR into consideration

• For current QJR values (e.g. Gnutella = 10) r =1 is a good choice

Relative performance

Technique Time to satisfy

Satisfaction

Probability

Number of results

Aggregate Bandwidth

Aggregate

Processing

Bounded BFS 100% 100% 100% 100% 100%

Iterative deepening 190% 100% 19% 28% 47%

Directed BFS 140% 86% 37% 38% 28%

Local indices

≈100%

100% 100% 39% 51%

Conclusions

• All 3 methods show significant bandwidth and processing savings

• Methods are simple and easy to implement in current systems

• Methods might be used in conjunction

Bibliography

Yang, Beverly; Garcia-Molina, Hector :• Improving Search in Peer-to-Peer Systems

http://newdbpubs.stanford.edu:8090/pub/2002-28

• Improving Search in Peer-to-Peer Systems [extended]

http://newdbpubs.stanford.edu:8090/pub/2001-47

• Designing a Super-peer Network http://newdbpubs.stanford.edu:8090/pub/2003-33

Gnutella websitehttp://www.gnutella.com/

Thank you

Improving Search in P2P Networks

Documents

P2P Search

On Improving the Performance Dependability of Unstructured P2P Systems via Replication

Improving stop and search?

Improving Data Discovery Through Semantic Search

Improving Data Access in P2P Systems

SAPIR Search in Audio-visual content using P2p IR

Improving Search Engines using Online Communities

EGOIST : A scalable SNS-inspired SystemEGOIST : A scalable SNS-inspired System Under Churn With Incomplete Information Improving Search performance in Unstructured P2P systems Publications:

Semantic and Distributed Entity Search in the Web of Data€¦ · Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions

Improving your search skills

Dynamic Search Algorithm in P2P Networksijiet.com/wp-content/uploads/2013/12/10.pdf2013/12/10 · Dynamic Search Algorithm in P2P Networks Prabhudev.S.Irabashetti M.tech Student,UBDTCE,

Improving Peer-to-Peer Networks “Limited Reputation Sharing in P2P Systems” “Robust Incentive Techniques for P2P Networks”

Improving searches III: Database Search Techniques

P2P search engine 'ORBIS

Improving Search on GrowthHackers.com

WEBTOP Meta Search and P2P Knowledge Sharing

Personalized Web Search for Improving Retrieval Effectivenessmeng/pub.d/tkde_fang.pdf · Personalized Web Search For Improving Retrieval Effectiveness ... Search Engine * A preliminary

P2p, Fall 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search & Replication in Unstructured P2p

Protecting & Improving Search Ranking in 2015 Steve Wiideman, Search Strategist

A DISTRIBUTED TECHNOLOGIES COMPANY A P2P SOLUTIONS COMPANY P2P SEARCH MARKETING (SEM & SEO) AWARDS AND INDUSTRY LEADERSHIP 2008 DCIA Groundbreaker Product