"A Measurement Study of Peer-to-Peer File Sharing Systems"

"A Measurement Study of Peer-to-Peer File Sharing

Systems"

Stefan Saroiu, P. Krishna Gummadi Steven D. Gribble, "A Measurement Study of Peer-to-Peer File Sharing Systems", Proceedings of the Multimedia Computing and Networking (MMCN), San Jose, January, 2002.

Peer-to-Peer Membership

Ad-hoc Dynamic

Goals Design architecture that encourages cooperation

Examples Napster Gnutella BitTorrent

Server maintains index of files contained in connected peers

Peer queries server for file

Server returns peer who has file and then queries other servers

Direct link between peers to transfer file

Locating Files (Napster)

Locating Files (Gnutella) Queries are performed

by flooding the network Responses are

returned only if the peer has the file, queries are forwarded to neighbors

Files are downloaded via direct link between two peers

Collecting Metrics (Napster) Napster Crawler

Query for “popular” files using multiple simultaneous connections, maintain list of peers returned by server

For each peer collect metadata from the server Peers reported bandwidth Number of files shared Current number of uploads/downloads Names and sizes of all files shared IP address

Collecting Metrics (Gnutella) Gnutella Crawler

Connect to popular peers Send ping messages with large TTLs to known

peers Add newly discovered peers based on pong

messages Pong messages include metadata about peer

Number of files Total size of files

Metadata collected Bottleneck bandwidth Latency Number of shared files Lifetime/Uptime Distribution across DNS domains

Approx. 60% of peers have a session duration shorter than 1 hour

Napster peers are up a larger percentage of the time as compared to Gnutella peers

Gnutella –20% of hosts are available >45% of the time Napster – 20% of hosts are available >80% of the time

Approximately 30% of Napster peers misreport their bandwidth (Modems + ISDN < 64Kbps)

Authors argue that misreporting is one indication that many users in a p2p system are not willing to cooperate.

Gnutella Approximately 25% of Gnutella peers share no files 75% of peers share less than 100 files 7% of peers share more than 1000 files (which is

more than the other 93% combined) Napster

40%-60% of users share only 5-20% of files

(a) Gnutella network of 1771 peers(b) 30% of peers randomly removed

(a) 1106 of remaining 1300 nodes still connected(c) 4% of peers selectively removed (63 best connected peers)

(a) Network becomes highly fragmented

"Characterizing Unstructured Overlay Topologies in Modern

P2P File-Sharing Systems"

Daniel Stutzbach, Reza Rejaie, and Subhabrata Sen, "Characterizing Unstructured Overlay Topologies in

Modern P2P File-Sharing Systems," Networking, 2008.

Gnutella Characteristics in Depth

• Previous studies as claimed in the paper:– Lack the accuracy of the captured snapshots.– The crawlers that have been used weren’t fast enough.

• Resulting in distorted snapshots• And partial view of the topology.

– Simulations are standing on invalid assumption (e.g. power-law distributing).

– “Finally, to our knowledge, the dynamics of unstructured P2P overlay topologies have not been studied in detail in any prior work.”

Cruiser: Fast and Accurate Crawler

• “Cruiser can accurately capture a complete snapshot of the Gnutella network with more than one million peers in just a few minutes”.

• Faster than any crawler ever built, as claimed.• Enabled capturing the overlay dynamics which led to

more accurate characterization. • Data Set: 18,000 snapshot captured in 11 month

period. Weekly intervals and daily random captures.

Some of the Findings

• As mentioned, the study refutes the power-law distribution of the node degree.

• Online-like architecture. • Limewire v.s BearShare.• Reachability: 30-38% are

unreachable. Previous studies ignored this factor which forms a non-negligible fraction of the peers

WikiPedia

Two-Tier Topology

Top-level overlay

Leaf

Ultrapeer

Fast Crawler vs. Slow Crawler

• Why the Power-Law Distribution is a measurement artifact?

• Slow Crawler: Cruiser with less concurrent connection.

Form Two-piece Power-Law distribution.

Top-Level Overlay Analysis

• The actual distribution is two-piece Power-Law distribution

Peaks at 30 degreeDue to the fact of the pre-configuration of Limewire and BearShare.

New peers, approaching 30 degree

Re-configured or other implementation to have higher degree

Leaves Overlay Analysis

Limewire: 30 leaves

BearShare: 45 leaves

Majority have 3 or less parent

< 0.02% 100-3000 parent Other: 75 leaves

Reachability

• Flood-based Query: New peers are discovered exponentially up to a certain point.

• Pair-wise Distance: 60% have a length of 4 as shortest path distance

• Effect of two-tier topology on leaves:– One Parent: we get a distribution similar to the

Utlrapeers shifted by 2.– More than that:

• 50% -> length of 5 (+1).• 50% -> length of 6 (+2).

Overlay Dynamics

• As the number of peers increases, the uptime (in hours) decreases exponentially.

Overlay Dynamics-2

• What are the causes ?– Protocol Driven: As the peers select their

neighbors according to the protocol.– User Driven: Peer participation

• Definition of a stable peer: – a peer is stable if it manages to have a

connection duration of time t. t=48 hours as in the study.

Internal Connectivity of Stable Core

• Stable core: peers with t >= 48. [Excluding the connection between unstable peers]

• 88-94% remained connected.• Observations:

– Stable core are clustered together

– Peers with higher uptime are more biased to establish connections with each others.

External Connectivity of Stable Core

• Peers (not in the stable core) are following the same behavior. (i.e biased to connect with peers have >= uptime)

• This behavior led to form onion-like connections.– The core of the onion is the Stable Core

• SC(t) < P1(t)• …• Pn-1(t) < Pn(t)

Overlay Dynamics .. The Main Cause

User driven dynamics are the major factor of the overlay dynamics.

"Making Gnutella-like P2P Systems Scalable"

Y. Chawathe, S. Ratnaswamy, L. Breslau, N. Lanham, S. Shenker, "Making Gnutella-like P2P Systems

Scalable," ACM Sigcomm 2003,Dongyu Qiu and R. Srikant, "Modeling and performance analysis of

BitTorrent-like peer-to-peer networks", Proceedings of ACM Sigcomm, 2004.

Review of Gnutella

• Uses an unstructured overlay network• Distributed download AND search• Floods query across this overlay with a

limited scope• Notorious for poor scaling

Scaling Gnutella

• Previously proposed solution: hash table to wide-area file search

• The paper advocates maintaining simplicity of unstructured system but with new mechanisms– propose solution using aspects of a system similar

to KaZaA and its supernodes model• supernodes have higher bandwidth connectivity• searches are routed to supernodes which hold pointers

to peer data

Distributed Hash Tables (structured overlay)

• Pros:– Looking up using DHT requires O(log n) steps vs.

Gnutella's O(n) steps• Cons: Doesn’t deal well with…

– Transience of nodes in P2P network: DHTs require repair operations to preserve efficiency and correctness of routing.

– Situations where keyword searches are more prevalent, and important, than exact-match queries

– Situations where most queries are for relatively well-replicated files not single copies of files

Gia’s Improvements

• All messages exchanged by clients are tagged at their origin with a globally unique identifier

• Explicitly accounts for node heterogeneity and capacity constraints

• Replace flooding with “biased” random walks• It is not any single of the following

components, but in fact, the combination of them all that provides GIA's large performance advantage…

• Dynamic topology adaptation: – Puts most nodes within short reach of high

capacity nodes and makes sure that the well-connected high-degree nodes can actual handle the large number of queries by calculating “satisfaction level” (discussed later)

• Active flow control: – Avoid overloading by assigning flow-control tokens

based on "available capacity"– Dropping queries not an option in Gia where

random walks are taken instead of floods– A node that advertises high capacity to handle

incoming queries is in turn assigned more tokens for its own outgoing queries.

• One-hop replication: of pointers to content offered by immediate neighbors– This way a node can also answer queries for its

neighbors (resulting in fewer hops)– When a node goes offline, its information is

flushed from its neighbors to maintain consistency and accuracy

• Search protocol: – Based on biased random walks that directs

queries towards high-capacity nodes (instead of purely random)

– TTLs (and MAX_RESPONSES) bound duration of the biased random walks and book-keeping avoids redundant paths

“Satisfaction Level”

• A measure of how close the sum of the capacities of all of a node's neighbors (normalized by their degrees) is to the node's own capacity.

• To add a new neighbor, a node randomly selects a small number of candidate entries. From these randomly chosen entries, it selects the node with maximum capacity greater than its own capacity. If no such candidate entry exists, it selects one at random. (see Algorithm 1)

• Nodes with low satisfaction levels perform topology adaptation more frequently than satisfied nodes.

I = T x K^(1-S)S : Satisfaction level (see algorithm 2)I : Adaptation interval T : maximum interval between adaptation iterationsK : aggressiveness of the adaptation.

• After each interval I, if a node's S < 1.0, it attempts to add a new neighbor. If S = 1.0, it still continues to iterate through the adaptation process, checking its satisfaction level every T seconds.

Algorithms(regarding Satisfaction)

Gia, Flood, Random Walk over Random Topology, and Supernode models.

• When the query load increases, we notice a sharp "knee" in the curves beyond which the success rate drops sharply and delays increase rapidly. The hop-count holds steady until the knee-point and then decreases.

Gia, Flood, Random Walk over Random Topology, and Supernode models.

• Compare by observing Collapse Point (CP) and Hop-count before Collapse Point (CP-HC)

Single Search Responses• Aggregate System Capacity of Gia is 3 to 5 orders of magnitude

higher than Flood and Random Walk Random Topology.• RWRT performs better than Flood typically but can be about the

same when there are fewer nodes since RWRT may end up visiting practically all the nodes anyway.

• Flood and Supernode models retain low hop counts whereas Gia may sometimes need to traverse quite a bit. RWRT, being random, has hop counts that are inversely proportional to the replication factor.

• Flood and Supernode models have a collapse point that falls as the number of nodes increases due to the greater query load at each node.

Multiple Search Responses

• MAX_RESPONSES parameter only affect Gia and RWRT since Flood and Supernode flood throughout the network

• Naturally, higher MAX_RESPONSES causes more hop counts in Gia and RWRT models. Collapse point is also drops when the replication factor is low.

• Note: A search for k responses at r% replication is equivalent to one for a single answer at r/k% replication.

Node failure

• When a node leaves the network, its queued queries are resumed by the nodes that originally generated them.

• How do we make sure a query isn't lost?– Keep-alive messages (query responses sent back

to the originator of the query act as implicit keep-alives). If no actual matches have been found yet, we’ll still just send a dummy query response message.

Next bottleneck:Download process?

• The authors believe that the technique of directing towards higher capacity nodes helps alleviate this issue.

• To make the above benefit more significant, it would be better to have the higher capacity nodes store more files (rather than simply pointers).

• Simple solution would be to have popular files at low capacity nodes replicated to the higher capacity nodes.

Documents

"A Measurement Study of Peer-to-Peer File Sharing Systems"