41
"A Measurement Study of Peer-to-Peer File Sharing Systems" Stefan Saroiu, P. Krishna Gummadi Steven D. Gribble, "A Measurement Study of Peer-to-Peer File Sharing Systems", Proceedings of the Multimedia Computing and Networking (MMCN), San Jose, January, 2002.

"A Measurement Study of Peer-to-Peer File Sharing Systems"

  • Upload
    roddy

  • View
    58

  • Download
    3

Embed Size (px)

DESCRIPTION

"A Measurement Study of Peer-to-Peer File Sharing Systems". Stefan Saroiu, P. Krishna Gummadi Steven D. Gribble, "A Measurement Study of Peer-to-Peer File Sharing Systems", Proceedings of the Multimedia Computing and Networking (MMCN), San Jose, January, 2002. Peer-to-Peer. Membership Ad-hoc - PowerPoint PPT Presentation

Citation preview

Page 1: "A Measurement Study of Peer-to-Peer File Sharing Systems"

"A Measurement Study of Peer-to-Peer File Sharing

Systems"

Stefan Saroiu, P. Krishna Gummadi Steven D. Gribble, "A Measurement Study of Peer-to-Peer File Sharing Systems", Proceedings of the Multimedia Computing and Networking (MMCN), San Jose, January, 2002.

Page 2: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Peer-to-Peer Membership

Ad-hoc Dynamic

Goals Design architecture that encourages cooperation

Examples Napster Gnutella BitTorrent

Page 3: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Server maintains index of files contained in connected peers

Peer queries server for file

Server returns peer who has file and then queries other servers

Direct link between peers to transfer file

Locating Files (Napster)

Page 4: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Locating Files (Gnutella) Queries are performed

by flooding the network Responses are

returned only if the peer has the file, queries are forwarded to neighbors

Files are downloaded via direct link between two peers

Page 5: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Collecting Metrics (Napster) Napster Crawler

Query for “popular” files using multiple simultaneous connections, maintain list of peers returned by server

For each peer collect metadata from the server Peers reported bandwidth Number of files shared Current number of uploads/downloads Names and sizes of all files shared IP address

Page 6: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Collecting Metrics (Gnutella) Gnutella Crawler

Connect to popular peers Send ping messages with large TTLs to known

peers Add newly discovered peers based on pong

messages Pong messages include metadata about peer

Number of files Total size of files

Page 7: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Metadata collected Bottleneck bandwidth Latency Number of shared files Lifetime/Uptime Distribution across DNS domains

Page 8: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Approx. 60% of peers have a session duration shorter than 1 hour

Napster peers are up a larger percentage of the time as compared to Gnutella peers

Gnutella –20% of hosts are available >45% of the time Napster – 20% of hosts are available >80% of the time

Page 9: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Approximately 30% of Napster peers misreport their bandwidth (Modems + ISDN < 64Kbps)

Authors argue that misreporting is one indication that many users in a p2p system are not willing to cooperate.

Page 10: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Gnutella Approximately 25% of Gnutella peers share no files 75% of peers share less than 100 files 7% of peers share more than 1000 files (which is

more than the other 93% combined) Napster

40%-60% of users share only 5-20% of files

Page 11: "A Measurement Study of Peer-to-Peer File Sharing Systems"

(a) Gnutella network of 1771 peers(b) 30% of peers randomly removed

(a) 1106 of remaining 1300 nodes still connected(c) 4% of peers selectively removed (63 best connected peers)

(a) Network becomes highly fragmented

Page 12: "A Measurement Study of Peer-to-Peer File Sharing Systems"

"Characterizing Unstructured Overlay Topologies in Modern

P2P File-Sharing Systems"

Daniel Stutzbach, Reza Rejaie, and Subhabrata Sen, "Characterizing Unstructured Overlay Topologies in

Modern P2P File-Sharing Systems," Networking, 2008.

Page 13: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Gnutella Characteristics in Depth

• Previous studies as claimed in the paper:– Lack the accuracy of the captured snapshots.– The crawlers that have been used weren’t fast enough.

• Resulting in distorted snapshots• And partial view of the topology.

– Simulations are standing on invalid assumption (e.g. power-law distributing).

– “Finally, to our knowledge, the dynamics of unstructured P2P overlay topologies have not been studied in detail in any prior work.”

Page 14: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Cruiser: Fast and Accurate Crawler

• “Cruiser can accurately capture a complete snapshot of the Gnutella network with more than one million peers in just a few minutes”.

• Faster than any crawler ever built, as claimed.• Enabled capturing the overlay dynamics which led to

more accurate characterization. • Data Set: 18,000 snapshot captured in 11 month

period. Weekly intervals and daily random captures.

Page 15: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Some of the Findings

• As mentioned, the study refutes the power-law distribution of the node degree.

• Online-like architecture. • Limewire v.s BearShare.• Reachability: 30-38% are

unreachable. Previous studies ignored this factor which forms a non-negligible fraction of the peers

WikiPedia

Page 16: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Two-Tier Topology

Top-level overlay

Leaf

Ultrapeer

Page 17: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Fast Crawler vs. Slow Crawler

• Why the Power-Law Distribution is a measurement artifact?

• Slow Crawler: Cruiser with less concurrent connection.

Form Two-piece Power-Law distribution.

Page 18: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Top-Level Overlay Analysis

• The actual distribution is two-piece Power-Law distribution

Peaks at 30 degreeDue to the fact of the pre-configuration of Limewire and BearShare.

New peers, approaching 30 degree

Re-configured or other implementation to have higher degree

Page 19: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Leaves Overlay Analysis

Limewire: 30 leaves

BearShare: 45 leaves

Majority have 3 or less parent

< 0.02% 100-3000 parent Other: 75 leaves

Page 20: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Reachability

• Flood-based Query: New peers are discovered exponentially up to a certain point.

• Pair-wise Distance: 60% have a length of 4 as shortest path distance

• Effect of two-tier topology on leaves:– One Parent: we get a distribution similar to the

Utlrapeers shifted by 2.– More than that:

• 50% -> length of 5 (+1).• 50% -> length of 6 (+2).

Page 21: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Overlay Dynamics

• As the number of peers increases, the uptime (in hours) decreases exponentially.

Page 22: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Overlay Dynamics-2

• What are the causes ?– Protocol Driven: As the peers select their

neighbors according to the protocol.– User Driven: Peer participation

• Definition of a stable peer: – a peer is stable if it manages to have a

connection duration of time t. t=48 hours as in the study.

Page 23: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Internal Connectivity of Stable Core

• Stable core: peers with t >= 48. [Excluding the connection between unstable peers]

• 88-94% remained connected.• Observations:

– Stable core are clustered together

– Peers with higher uptime are more biased to establish connections with each others.

Page 24: "A Measurement Study of Peer-to-Peer File Sharing Systems"

External Connectivity of Stable Core

• Peers (not in the stable core) are following the same behavior. (i.e biased to connect with peers have >= uptime)

• This behavior led to form onion-like connections.– The core of the onion is the Stable Core

• SC(t) < P1(t)• …• Pn-1(t) < Pn(t)

Page 25: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Overlay Dynamics .. The Main Cause

User driven dynamics are the major factor of the overlay dynamics.

Page 26: "A Measurement Study of Peer-to-Peer File Sharing Systems"

"Making Gnutella-like P2P Systems Scalable"

Y. Chawathe, S. Ratnaswamy, L. Breslau, N. Lanham, S. Shenker, "Making Gnutella-like P2P Systems

Scalable," ACM Sigcomm 2003,Dongyu Qiu and R. Srikant, "Modeling and performance analysis of

BitTorrent-like peer-to-peer networks", Proceedings of ACM Sigcomm, 2004.

Page 27: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Review of Gnutella

• Uses an unstructured overlay network• Distributed download AND search• Floods query across this overlay with a

limited scope• Notorious for poor scaling

Page 28: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Scaling Gnutella

• Previously proposed solution: hash table to wide-area file search

• The paper advocates maintaining simplicity of unstructured system but with new mechanisms– propose solution using aspects of a system similar

to KaZaA and its supernodes model• supernodes have higher bandwidth connectivity• searches are routed to supernodes which hold pointers

to peer data

Page 29: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Distributed Hash Tables (structured overlay)

• Pros:– Looking up using DHT requires O(log n) steps vs.

Gnutella's O(n) steps• Cons: Doesn’t deal well with…

– Transience of nodes in P2P network: DHTs require repair operations to preserve efficiency and correctness of routing.

– Situations where keyword searches are more prevalent, and important, than exact-match queries

– Situations where most queries are for relatively well-replicated files not single copies of files

Page 30: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Gia’s Improvements

• All messages exchanged by clients are tagged at their origin with a globally unique identifier

• Explicitly accounts for node heterogeneity and capacity constraints

• Replace flooding with “biased” random walks• It is not any single of the following

components, but in fact, the combination of them all that provides GIA's large performance advantage…

Page 31: "A Measurement Study of Peer-to-Peer File Sharing Systems"

• Dynamic topology adaptation: – Puts most nodes within short reach of high

capacity nodes and makes sure that the well-connected high-degree nodes can actual handle the large number of queries by calculating “satisfaction level” (discussed later)

• Active flow control: – Avoid overloading by assigning flow-control tokens

based on "available capacity"– Dropping queries not an option in Gia where

random walks are taken instead of floods– A node that advertises high capacity to handle

incoming queries is in turn assigned more tokens for its own outgoing queries.

Page 32: "A Measurement Study of Peer-to-Peer File Sharing Systems"

• One-hop replication: of pointers to content offered by immediate neighbors– This way a node can also answer queries for its

neighbors (resulting in fewer hops)– When a node goes offline, its information is

flushed from its neighbors to maintain consistency and accuracy

• Search protocol: – Based on biased random walks that directs

queries towards high-capacity nodes (instead of purely random)

– TTLs (and MAX_RESPONSES) bound duration of the biased random walks and book-keeping avoids redundant paths

Page 33: "A Measurement Study of Peer-to-Peer File Sharing Systems"

“Satisfaction Level”

• A measure of how close the sum of the capacities of all of a node's neighbors (normalized by their degrees) is to the node's own capacity.

• To add a new neighbor, a node randomly selects a small number of candidate entries. From these randomly chosen entries, it selects the node with maximum capacity greater than its own capacity. If no such candidate entry exists, it selects one at random. (see Algorithm 1)

Page 34: "A Measurement Study of Peer-to-Peer File Sharing Systems"

• Nodes with low satisfaction levels perform topology adaptation more frequently than satisfied nodes.

I = T x K^(1-S)S : Satisfaction level (see algorithm 2)I : Adaptation interval T : maximum interval between adaptation iterationsK : aggressiveness of the adaptation.

• After each interval I, if a node's S < 1.0, it attempts to add a new neighbor. If S = 1.0, it still continues to iterate through the adaptation process, checking its satisfaction level every T seconds.

Page 35: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Algorithms(regarding Satisfaction)

Page 36: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Gia, Flood, Random Walk over Random Topology, and Supernode models.

• When the query load increases, we notice a sharp "knee" in the curves beyond which the success rate drops sharply and delays increase rapidly. The hop-count holds steady until the knee-point and then decreases.

Page 37: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Gia, Flood, Random Walk over Random Topology, and Supernode models.

• Compare by observing Collapse Point (CP) and Hop-count before Collapse Point (CP-HC)

Page 38: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Single Search Responses• Aggregate System Capacity of Gia is 3 to 5 orders of magnitude

higher than Flood and Random Walk Random Topology.• RWRT performs better than Flood typically but can be about the

same when there are fewer nodes since RWRT may end up visiting practically all the nodes anyway.

• Flood and Supernode models retain low hop counts whereas Gia may sometimes need to traverse quite a bit. RWRT, being random, has hop counts that are inversely proportional to the replication factor.

• Flood and Supernode models have a collapse point that falls as the number of nodes increases due to the greater query load at each node.

Page 39: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Multiple Search Responses

• MAX_RESPONSES parameter only affect Gia and RWRT since Flood and Supernode flood throughout the network

• Naturally, higher MAX_RESPONSES causes more hop counts in Gia and RWRT models. Collapse point is also drops when the replication factor is low.

• Note: A search for k responses at r% replication is equivalent to one for a single answer at r/k% replication.

Page 40: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Node failure

• When a node leaves the network, its queued queries are resumed by the nodes that originally generated them.

• How do we make sure a query isn't lost?– Keep-alive messages (query responses sent back

to the originator of the query act as implicit keep-alives). If no actual matches have been found yet, we’ll still just send a dummy query response message.

Page 41: "A Measurement Study of Peer-to-Peer File Sharing Systems"

Next bottleneck:Download process?

• The authors believe that the technique of directing towards higher capacity nodes helps alleviate this issue.

• To make the above benefit more significant, it would be better to have the higher capacity nodes store more files (rather than simply pointers).

• Simple solution would be to have popular files at low capacity nodes replicated to the higher capacity nodes.