60
Carleton University BCS Honours Project Peer-to-Peer File Sharing Network Optimisation Darryl Edward Payne Student: 266137 Supervisor: Dr. Tony White, Computer Science Wednesday, March 31, 2004

Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

Carleton University BCS Honours Project

Peer-to-Peer File Sharing Network Optimisation

Darryl Edward Payne Student: 266137

Supervisor: Dr. Tony White, Computer Science Wednesday, March 31, 2004

Page 2: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

ii

Abstract

Peer-to-peer file sharing networks (P2PFSN) appeared after most of the other “killer apps” on the Internet had permanently affixed themselves onto our lives. Nevertheless, they are now one of the most common methods of publishing content on the Internet. There are two aspects to these networks: performing searches for wanted files, and downloading content. This paper focuses on content searching, however will delve not-too-deeply into content downloading where appropriate. The purpose of this paper is to examine how a mature network, Gnutella, functions and consider several relatively simple changes to it which will hopefully create noticeable improvements in its usability. Modifications made to the Gnutella network as part of this project will be implemented by extending an existing open-source client written in the Java programming language, Phex. Specific changes to the client include the addition of caching of mirrors for local content, and modifications to the order in which connections to hosts on the network are attempted.

Page 3: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

iii

Acknowledgements

I’d like to thank the entire development team at Momentous.ca, past and present, for their support during my preparation of this report. Each one contributed their opinions and ideas, and they all deserve some credit. Specifically, they are Ryan North for his superior combination of grammar and technical knowledge, Roy Hooper for his understanding of the inner workings of the Internet, Amanda Shiga - for too many things to list here, Tony Hooper for his engineering point of view, Ben Levac for offering advise that only someone with a view of the world that contrasts my own can, Norm Ritchie for understanding that I can’t come home after a long day at work and work some more, Kelvin Osborn for his support while I was away from work, Taryn Naidu - simply for his presence, Mel Tayler for her unannounced – but pleasant - visit during my last hours of work on this report, Sheri Adamson for her support ever since first year, Georgiana Badea for her never-ending good humor, and finally Magaly Obas and Nikki Melki who I certainly don’t see enough of these days. I’d also like to thank Dr. Tony White for his comments, criticisms, and ideas without which this report would not be complete. Finally, thanks to my family, friends, fellow students, and everyone else who contributed in some way to my years at Carleton. They number far too many to name here.

Page 4: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

iv

Table of Contents

Part 1: P2PFSN: How They Work and Why They Exist .................................................... 1 1.1 Ancient Origins......................................................................................................... 1 1.2 Napster ...................................................................................................................... 4 1.3 Second Generation Clients........................................................................................ 5

Part 2 – The Internet vs. Peer-to-Peer Networks ................................................................ 9 2.1 – Real-world results .................................................................................................. 9 2.2 – Specific Problems ................................................................................................ 11

Part 3 – Gnutella ............................................................................................................... 13 3.1 - Introduction .......................................................................................................... 13 3.2 - The Ultrapeer System ........................................................................................... 15 3.3 – Existing Optimisations......................................................................................... 17

Part 4 – Strategies – Network Layout ............................................................................... 19 4.1 – P2P networks are built on top of TCP/IP............................................................. 19 4.2 – Current Algorithm and Immediate Goals for Improvement ................................ 24 4.3 – Swarm Intelligence .............................................................................................. 26 4.4 – Implementation .................................................................................................... 32

Part 5 – Caching Strategies ............................................................................................... 34 5.1 The Purpose of Caching in a Peer-to-Peer System ................................................. 34 5.2 – Content Mirrors.................................................................................................... 35 5.3 – Implementation .................................................................................................... 40

Part 6: Results ................................................................................................................... 42 6.1 – Results of Network Host Cache Changes ............................................................ 42 6.2 – Results of File Mirror Site Changes..................................................................... 45

Part 7: Conclusions and Suggestions for Future Work..................................................... 48 Appendix – Contents of Included Disc............................................................................. 54

Page 5: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

v

List of Figures

FIGURE 1: “HYBRID” PEER-TO-PEER NETWORK LAYOUT SUCH AS NAPSTER 4 FIGURE 2: EXAMPLE OF PEER-TO-PEER SEARCH TREE – DEPTH OF 3, 30 HOSTS REACHED 8 FIGURE 3- NETWORK LAYOUT WITH GNUTELLA'S ULTRAPEER SYSTEM 15 FIGURE 4: NETWORK LAYERS (OSI MODEL) 19

Page 6: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

vi

List of Tables

TABLE 1: MAXIMUM NUMBER OF HOSTS REACHED AT SEARCH DEPTH 8 TABLE 2 - GNUTELLA PROTOCOL 0.6 MESSAGES 14 TABLE 3 – SOME POSSIBLE HEURISTICS AVAILABLE FOR GNUTELLA HOSTS 28 TABLE 4 - MIRRORS AND FILE POPULARITY 45

Page 7: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

1

Part 1: P2PFSN: How They Work and Why They Exist

1.1 Ancient Origins Before understanding how peer-to-peer networks work, it’s

important to understand how and why they came to exist.

The Internet was designed and built as a network of peers.

However, if we look at the initial set of widely used

protocols which evolved from it – all still in widespread

use - we see they are based on the same networking pattern:

HTTP (Hyper-Text Transfer Protocol) for the World Wide Web;

SMTP (Send Mail Transfer Protocol) for the sending and

distribution of e-mail;

IMAP (Internet Message Access Protocol) and POP (Post

Office Protocol) for the management of received of email;

FTP (File Transfer Protocol) which is still among the most

used file sharing systems;

And finally IRC (Internet Relay Chat) for real-time text

messaging (“chat”)

All these protocols (from the average Internet user’s

perspective) appear to be based on the client-server

pattern, where all users are considered clients and all

Page 8: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

2

messages are a two way communication between clients and

the server – or at least pass through the server before

reaching another client. The clients only see each other

if the server tells them there are other clients connected

to it (directly, as part of the protocol specification, in

the case of IRC; or indirectly, through scripts using the

protocol, as in dynamic content through HTTP.)

However, if we look deeper into the systems behind

the servers utilizing some of these protocols, we can begin

to see the roots of Peer-to-Peer file sharing. To see

this, we’ll look more closely at what happens when a user

sends an email. Let’s take the example of Jim (whose

address is [email protected]), who is writing to his

friend Sue ([email protected]). Jim opens up his email

client, and writes the email. When it’s ready to send, his

email client connects to his local SMTP server,

smtp.jimsdomain.com, and sends it a copy of his message,

addressed to [email protected]. smtp.jimsdomain.com will

then connect to smtp.suesdomain.com and ask it to deliver

the message to [email protected]. Alternately, if

smtp.suesdomain.com is not the final destination for an

email addressed to [email protected], but rather

emailtosue.suesdomain.com is, smtp.suesdomain.com can

Page 9: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

3

either choose to forward the email to the correct server

transparently, or notify smtp.jimsdomain.com of the correct

peer he should connect to. The worldwide network of smtp

servers, of which smtp.jimsdomain.com and

smtp.suesdomain.com are only two – is made up of a complete

graph of peers (every peer on the network is able to reach

– either directly or indirectly – every other peer), and

can be looked at as one of the first worldwide peer-to-peer

networks. Some of the principles used within SMTP, such as

message forwarding and redirecting between peers, are good

starting points in our exploration of peer-to-peer

networks.

Page 10: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

4

1.2 Napster

User 2

User 1

User 5

User 6

server1.napster.comserver2.napster.com

User 4SearchableDatabase

User 3

SearchableDatabase

Legend

Search PathDownload Path

FIGURE 1: “Hybrid” peer-to-peer network layout such as Napster It’s impossible to talk about peer-to-peer networks without

mentioning Napster, whose rapid rise to notoriety and even

more rapid demise were catalysts encouraging widespread use

of peer-to-peer file sharing, and are possibly the main

reasons that peer-to-peer file sharing clients are as

popular on the Internet as they are today. However, it is

important to make the distinction that Napster was not a

peer-to-peer network as we know them today, but rather a

hybrid system* with a massive amount of peers connected to

one of one or many detached central servers at any time,

the servers controlling the connections of clients and

* See Figure 1

Page 11: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

5

their searching capabilities, but the peers themselves

sharing the actual content.

Napster appeared on the Internet in early 1999, as the

result of a young programmer, Shawn Fanning’s idea.

Napster was the first online file sharing system where no

files were stored at a central server – but rather they

were distributed amongst the actual users of the system.

From the user’s point of view, where Napster differed from

previous ways of accessing files on the Internet was the

sheer quantity of uncensored content that was easily

accessible. Unfortunately, the variety of its content, and

particularly its uncensored content quickly brought it

problems. In December 1999, the RIAA sued Napster for

copyright infringement, and by July 2001 it had been shut

down completely. However, by the time this happened,

Napster’s user base had reached a magnitude which required

an immense amount of system and bandwidth resources to

support its server-based system.

1.3 Second Generation Clients In my discussion of Napster so far, I’ve briefly mentioned

the two categories which would influence, and support the

Page 12: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

6

development of a new generation of peer-to-peer file

sharing networks.

Firstly, technical requirements of Napster’s system had

required a huge cash investment to create and maintain

enough servers for their growing user base – which quickly

grew from hundreds of thousands to tens of millions of

users, a cost which could not be maintained in the long

run. More immediately important, was that the servers

themselves were separate entities: Users could only search

among other users connected to the same server, hiding a

wealth of results to their searches.

Secondly, political difficulties caused setbacks to their

system which became increasingly difficult, and ultimately

impossible to overcome over time. Here the server-based

design of their system was its downfall, as with a single

point of failure at the servers, shutting down the system

became as simple as flipping a switch – more specifically,

forcing Napster through legal venues to flip that switch.

Once these difficulties had finally shut down the entire

system, it didn’t take long for Napster’s users to notice

its disappearance and look for a replacement. The OpenNap

Page 13: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

7

server was one of the first entries into the game. OpenNap

was open-source Napster-compatible server software, which

allowed anyone to run their own Napster server, and enabled

existing Napster clients would function with minimal

modifications. However, these too were shut down one by

one, or became overloaded with clients and searches to the

point of being nearly impossible to obtain a connection to,

and impossibly slow once connected. OpenNap servers and

clients are still in widespread use, as servers come and

go, however they are not nearly as popular as Napster

itself was in its prime.

The reason for this is not that the user base for online

file sharing has declined, but that a new generation of

peer-to-peer clients; networks that not only used peer-to-

peer networking for distributing files, but also as a way

to search for content soon appeared. Gnutella and WinMX

were among the first of these to gain popularity.

Page 14: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

8

Client Performing Search

FIGURE 2: Example of peer-to-peer search tree – depth of 3, 30 hosts reached

TABLE 1: Maximum number of hosts reached at search depth Search Depth Clients Reached Search Depth Clients Reached

1 5 8 488,280 2 30 9 2,441,405 3 155 10 12,207,030 4 780 11 61,035,155 5 3,905 12 305,175,780 6 19,530 13 1,525,878,905 7 97,655 14 7,629,394,530

This second generation of clients began on a simple

premise: by eliminating the central server, the significant

technical and political problems with Napster’s client

could be overcome. Perhaps more importantly, with every

peer on the network considered an equal, every client would

be able to search the content of every other. The

theoretical reaches of this network were staggering. The

premise is this: If each client on the network broadcasts

a message to five others, then after a depth of one, five

Page 15: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

9

clients are reached. After a depth of two (each of those

five clients broadcasts or forwards the message to five

others), this becomes thirty clients†. At a depth of

fourteen, over seven billion clients have been sent the

message‡ – this represents a client for every person in the

world. If each of these messages takes one second to send,

it should require only thirty seconds of real time (14

seconds for the message to reach all clients, 14 seconds

for any responses to travel back up the network to the

client who initiated the search) before the entire world

has been asked the question and given their chance to

answer. However, as we will see in part 2, in reality the

Internet is far from the homogeneous, limitless network

where every client is considered equal that these equations

assume it to be.

Part 2 – The Internet vs. Peer-to-Peer Networks

2.1 – Real-world results The real-world networks made of up of this second

generation of peer-to-peer clients were successful in

providing a new, decentralized method of allowing clients † See Figure 2 ‡ See Table 1 for the theoretical number of clients reachable as the search tree reaches higher depths

Page 16: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

10

to search each other’s shared files. However, the

theoretical abilities of the network were based an ideal

world, and in reality the network has not reached its full

potential either in size or in the level of service it

provides its clients.

The average number of users connected to the Gnutella

network at any one time is in the range of 200,000. T

means that all peers should be able to reach all other

peers in fewer than 8 hops, most in fewer than 7§. In

reality, the 14 hops that should only be needed in a

network 35,000 times the size is not enough to traverse the

longest path connecting two hosts on the existing network.

The reason for this is fairly simple: Without a central

server to oversee the network topology, a peer on the

network connects to other peers based only on availability

– this results in a network whose interconnecting paths are

generally randomly placed. The solution to this problem

is, however, not as simple. Before we even consider a

solution to this problem, we need to examine some of the

problems we will face – problems which every application

existing on a system as large and diverse as the Internet

§ See TABLE 1

Page 17: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

11

will inevitably come across, and which current peer-to-peer

networks only partially overcome.

2.2 – Specific Problems

Firewalls are one of the Internet’s most common and

reliable security features, and thus the problems they

cause are the most difficult to get through. There are two

types of firewalls – those that block incoming connections

and those that block outgoing connections. The most common

of these is the type that blocks incoming connections, and

often all incoming connections are prevented on all ports.

If all incoming ports are blocked, a peer cannot accept

incoming connections – his only method of entering the

network is to actively look for ways in, and the part he

can play in the network is severely limited. The main

reason for this is that if two clients cannot accept

incoming connections, there is no way for them to connect

to each other directly, and they cannot trade files.

Bandwidth limitations are a major concern on the network as

well – the numbers here are easy to see. If there are

200,000 clients on the network, and every client sends a 50

byte search message every five minutes, clients would be

Page 18: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

12

maintaining an average bandwidth usage of 34KB/sec both

ways. Clients on a dialup connection have at most a

7KB/sec download bandwidth, and 4.2KB/sec upload bandwidth

which would become very quickly consumed, leaving little

bandwidth left for the reason the client is on the network,

to download content. This becomes even more complicated a

problem when we consider that most dialup, ADSL, or cable

modem internet connections have greater download capacity

than upload – this means that clients approaching their

maximum download capacity will not be able to forward every

message they receive. This can create many possible dead-

ends on the network, where messages would simply stop being

forwarded.

The network layout of a peer-to-peer network is also

something that can easily spiral out of control. In an

ideal world, at each depth of the search tree, only clients

which have not been contacted before would be reached.

However, often the same packet reaches a host twice along

two different paths. When this happens, another dead end

is reached when the packet reaches a host the second time –

the host remembers that it has reached him before and

discards it.

Page 19: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

13

Part 3 – Gnutella

3.1 - Introduction

Gnutella is a Peer-to-Peer system with a fairly simple

protocol which is built on top of existing standards such

as the HTTP protocol and the XML formatting standard, and

has had open-source clients nearly from its introduction.

It is by far the most used protocol with these

qualifications – only the Fastrack network with its Kazaa

client is more widely used, it is closed and proprietary in

nature**. As its downloading portion masquerades as an

extension to HTTP, users can access files distributed on

the Gnutella network through many firewalls and even work

through some proxy servers for a greater compatibility with

firewalled systems. Its XML-based protocol allows for

standard extensions which can be used or ignored at the

individual client’s discretion.

** Although Kazaa is known to use some of the same existing standards as Gnutella, it is a commercial project, so what is known about it comes only from reverse engineering – something that is especially difficult because its messages are encrypted - not from documentation or source code released by its creators.

Page 20: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

14

TABLE 2 - Gnutella Protocol 0.6 Messages Message Name Description and Purpose

Ping Broadcasts your existence to the network, and requests PONG responses to locate other peers on the network

Pong

Response to a Ping message. Message is routed back to the peer who initiated the Ping message. Peers seeing this message may (in fact, are expected to) cache the host IP address and ID which are sent within the message in their host cache

Query Initiates a search on the network. This message includes some search text or the unique identifier of a specific file (normally a SHA1 signature) the requesting party is looking for.

Query Hit

Response to a Query message. This message includes the host ID and connection information for the party which has the file, as well as the unique identifier (SHA1 signature) of the file. It may also include, as part of the extension block of the message, a list of peers which are known to have the same file.

Push

Used to request a file from a client which is behind a firewall. The message is routed back through the path which the Query Hit message traveled. When the message reaches its destination, the client serving the file will initiate a connection to the client that wishes to download the file and send it.

Bye Optional message sent before a client disconnects from another. The message can include the reason why the client is disconnecting.

Gnutella is also a very developed protocol, supporting

standardized extensions to its messages and already

including many optimisations to its networks, the most

significant of which being related to the Ultrapeer system.

These extensions and existing optimisations will allow us

to examine and test some possible further optimisations on

an already mature network. Table 2, above, outlines the

basic messages it broadcasts or forwards over the Gnutella

network, enabling its clients to search for and request

content. The complete specification for the Gnutella

Page 21: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

15

protocol can be found in the Gnutella 0.6 RFC (Klingberg and

Manfredi, 2002).

3.2 - The Ultrapeer System

Ultrapeer

Leaf

Leaf

Leaf Leaf

Ultrapeer

Leaf

Leaf

Leaf

Leaf

Ultrapeer

LeafLeaf

LeafLeaf

Ultrapeer

LeafLeaf

Leaf

Leaf

FIGURE 3- Network Layout with Gnutella's Ultrapeer System

The Ultrapeer system, whose layout is shown in Fig. 3, is

at the heart of Gnutella. It was designed to alleviate

bandwidth usage on dial-up clients, minimize the effects of

firewalls on network layout, as well as add some stability

to the network. Clients using the Ultrapeer system would

either designate themselves an Ultrapeer or a Leaf, and

connections by either would only be initiated to

Page 22: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

16

Ultrapeers. What this means is that leafs are only

connected to Ultrapeers and not other leafs, therefore they

do not see traffic as it travels through the network, they

only receive requests as an endpoint on the network.

The benefits to the leafs in this system are immediately

obvious – their bandwidth usage is minimized, as they never

have to forward a message on through the network. The

benefits to the overall network are also easy to see – if

we consider that clients can only become Ultrapeers if they

meet a certain set of qualifications. Ultrapeers must be

connected to the network for, generally, more than 2 hours

before they can become an Ultrapeer. What this means to

the network is that there is a certain amount of stability

to its layout; Clients that are only connected briefly, for

example, to do a single search, cannot instantly become an

integral part of the network.

The number of hosts reached by a query is also directly

increased by this system. As Ultrapeers have prescribed

minimum bandwidth and resource requirements, additional

connections can be made to each Ultrapeer. If 6

Ultrapeers, and 10 leafs are connected to each Ultrapeer,

and each of these leafs has a chance to respond to each

Page 23: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

17

query, the number of hosts reached at each level of the

query tree becomes 11, instead of 1, thus 1074205 hosts can

be reached with a search through 7 levels of Ultrapeers.

3.3 – Existing Optimisations

The stability that the Ultrapeer system added to the

Gnutella network eased the introduction of a multitude of

optimisations. The primary Ultrapeer related optimisation

is meant to be a near-optimal minimisation of traffic to

the leaf nodes. When a leaf connects to an Ultrapeer, it

transfers an index of search terms related to its files to

the Ultrapeer, so the Ultrapeer can check this index before

forwarding any queries to the leaf. This index of search

terms is generally simply a list of words related to the

files the leaf is sharing; The Ultrapeer will only forward

queries containing one of these words. In this way, nearly

all unimportant traffic will never reach a leaf node. Of

course, until a leaf node has completed the transfer of its

index to the Ultrapeer, it will still receive all queries.

Many optimisations have been proposed which may not yet be

part of the Gnutella standard, but may still be implemented

and used in some production clients. Some of these include

Page 24: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

18

using a “random walker” search algorithm and caching of

QUERYHIT messages.

The random walker search algorithm contrasts the standard

“flooding” search algorithm, where every host forwards the

search message to every other host. Instead, a set of

random walker messages are sent out onto the network, which

are forwarded along the network to only one host by each

host. This creates a drastic reduction in traffic, and

because of this each message is less likely to reach a dead

end, thereby allowing the search to possibly travel further

out on the network. However, the sheer number of hosts

that the flooding search pattern reaches is not achieved.

More about random walkers can be read in “Search and

Replication in Unstructured Peer-to-Peer Networks” [Lv. et

al 2002].

QUERYHIT message caching allows a host to return a QUERYHIT

message for a file he is not serving. If a host sees a

search term twice in succession, and has cached QUERYHIT

messages associated with the first search, he can

optionally return a hit for this file rather than

forwarding the search on. The downside of this technique

is a danger with any caching technique: The network on the

Page 25: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

19

other side of the host may have changed in between

searches, and the hit he is returning may no longer exist

while others may have since appeared.

Part 4 – Strategies – Network Layout

4.1 – P2P networks are built on top of TCP/IP

1 - Physical

7 - Application6 - Presentation

5 - Session 4 - Transport 2 - Data Link3 - Network

GnutellaP2P

ProtocolTCP IP

FIGURE 4: Network Layers (OSI Model)

TCP is a connection-based packet delivery protocol

operating a layer below most communication protocols on the

Internet. As P2P network searching is inherently forgiving

of dropped packets and network dead ends (our goal here is

not to reach every host on the network, only to reach as

many as possible in an acceptable amount of time) the

specifics of it are uninteresting to the operation of peer-

to-peer file searching in general. However, IP is the

routed protocol operating a layer below TCP, at layer 3 of

Page 26: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

20

the OSI model††. The routing path of packets along the

network is minimized by several routing table optimisation

algorithms (routing protocols) which may be of interest to

us. At the very least, we can use some of the information

contained within the routing tables IP uses to help us in

the organization of our network. This warrants closer

examination.

The reason for the distinction between IP being a routed

protocol, and the routing protocols behind it, is the

routing table. Routing of IP packets is not strictly

determined at transmission time, but rather they travel

along a pre-determined path, directed by a routing table at

each router. These routing tables are the output of one of

several algorithms which collectively map the routes across

the Internet, and adapt to include new systems, networks,

and routers as the Internet itself changes shape.

Before we can see how these algorithms can help peer-to-

peer networks, we need to see that a routing table to a

router on the Internet is analogous to a list of connected

peers to any Ultrapeer on Gnutella. Naturally, while there

are similarities, there are also differences, the biggest

†† See Fig. 4 for the OSI model and the layers and protocols worth nothing

Page 27: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

21

being that IP routing’s goal is to map the shortest path

between any two nodes, while the goal of optimising a P2P

network’s layout is, for our purposes, to reach the maximum

number of relevant peers with the 7 hops the TTL‡‡ on

Gnutella’s messages gives us. Still, the goal of both

systems is similar: to build a network without a central

authority, that does the best job it can at routing

messages to all peers.

The way in which this is accomplished for IP on the

Internet is through a jumble of information exchanged

between routers, through protocols such as RIP (Routing

Information Protocol), IGRP (Interior Gateway Routing

Protocol), and BGP (Border Gateway Protocol). These

protocols and the systems behind them have the goal of

maintaining the routing tables at each router which route

IP packets along the correct path to reach their

destination, however the methods they use, and consequently

their success in a given situation varies. I will discuss

the basic techniques used by each of these routing

protocols briefly and how these techniques can be applied

to peer-to-peer searching.

‡‡ TTL refers to time to live; a number associated with a Gnutella packet which is decremented every time the packet reaches a host (makes a “hop” on the network). When the TTL reaches 0, the packet stops being forwarded. Most Gnutella packets have a TTL of 7 which means that the seventh host they reach will drop the packet.

Page 28: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

22

The Routing Information Protocol is among the oldest of the

Internet’s routing protocols – it was designed for a past

incarnation of the Internet which was a much smaller system

than it is today. Routers supporting RIP would broadcast

their routing table to their neighbours – a routing table

entry consisting of a subnet and the number of hops from

that router to the destination. The neighbours would then

add appropriate entries to their own routing table, adding

one hop to each entry stored. They would then rebroadcast

their own routing table to their neighbours, and so on… If

a router received two entries for the same subnet, it would

favour the path which was shortest: the one with the least

router hops. This system was simple, and elegant enough,

however it did not scale well – routing tables were

transmitted at 90-second intervals and would eventually

become too large to deal with. As every router shared its

entire routing table, routers using RIP would eventually

have a routing table entry for every system on the

Internet. It was also particularly susceptible to routing

loops – packets would too often get lost in the maze of

routers and reach the same router more than once. This

resulted in longer than necessary transmission times, or on

occasion packets not reaching their destination at all.

Page 29: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

23

RIP’s successors such as the IGRP and OSPF extended RIP’s

abilities by using other metrics than hop count to decide a

packet’s route when more than one was available, such as

bandwidth, load, delay, and reliability. OSPF also cut

down on the transmission of routing tables by only sharing

immediately interesting information with its neighbours: a

router’s abilities were not broadcast over the entire

network, so each router only knew the next immediate step

in a packet’s route, not the full route to the router that

handles it. The Border Gateway Protocol (BGP) eventually

became the standard for communication between Autonomous

Systems (AS) on the Internet, however is not of much

interest to us as it is dependent on having an authority or

central server to assign AS numbers, and to divide the

network into distinct Autonomous Systems, whereas we must

maintain the P2P network’s lack of a central authority.

The purpose of this discussion has been to examine the way

that IP builds its routing tables, with the hope that some

of its algorithms will help our Gnutella client build its

routing table, or choose its direct connections. RIP

builds its routing table by forwarding packets on to the

host that will reach its destination in the least number of

Page 30: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

24

router hops. Other routing protocols extend this by

measuring the cost of forwarding a packet on through a

specific path – cost being a heuristic built from known

information about the router or path, reflecting both the

likelihood the path will be successful in delivering the

packet to its destination, and the speed at which it will

do so.

In Gnutella, the destination of our QUERY packets is

everyone, or “as many clients as possible”, so while the

value of a direct connection can be determined by some of

the same heuristics as IP: the reliability of a peer and

the cost of sending a message to a peer, we also need to

consider the number of hosts reached by sending a message

to that peer. The way in which we calculate the cost and

reliability of a connection will thus also differ, and we

need to re-consider all the information available to us to

come up with appropriate heuristics.

4.2 – Current Algorithm and Immediate Goals for Improvement

Phex, in its current incarnation, has a single goal in

choosing which host to connect to on the network: get on

the network as quickly as possible by connecting to the

Page 31: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

25

most reliable hosts. It first chooses hosts it has

connected to before, and is quick to discard hosts who

denied its last connection. The only other parameter it

uses to place the hosts in the order it will try them is a

number representing the average daily uptime of the host –

information which is provided by most hosts on the network

when they respond to a “PING” query, these responses being

its sole method of harvesting hosts.

Phex keeps a cache of 1000 hosts it has seen on the

Gnutella network, and once this limit has been reached,

discards the oldest hosts to whom its last connection

attempt failed. This again is consistent with its strategy

of maintaining a list of the most reliable hosts, although

would not prevent a host that has allowed a connection many

times, but simply rejected the last one from dropping from

the list before it has outlived its usefulness.

Our immediate goals to improve this strategy should result

in the following enhancements: First of all, we want to

connect to the hosts which will enable us to contribute the

most to the network as a whole – we want to position

ourselves in the network so the bandwidth and resources we

have to offer do not go unused. Secondly, we want to

Page 32: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

26

connect to the hosts which will give us as a user the best

experience possible – provide the most applicable search

results quickly, and allow us to download the files we were

looking for speedily. It is worth noting that if the

entire network works towards the first goal, the second

goal will be at least partially fulfilled. Thirdly, we

don’t want to lose sight of the existing goal, which is to

join and become a productive member of the network as

rapidly as possible, by immediately connecting to hosts

which are likely to accept our connections. Finally, we

want to do all this without introducing significant – if

any – additional message traffic into the network.

4.3 – Swarm Intelligence

Swarm intelligence is a way of forming self-organizing

systems where each of the individuals in the system follows

a simple set of behaviour rules, their direct goal as an

individual differing from, and often not immediately

recognizable as being even related to, the goal of the

collective system. A simple example of such a system is a

virtual system of ants. The virtual ants exist in a world

made up of a small grid. Each spot on the grid is either

empty, or contains a piece of food. Each ant in the system

Page 33: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

27

follows a simple set of rules: Move around randomly. If

you are not carrying any food, and come across a piece of

food, pick it up. If you are carrying a piece of food, and

come across a piece of food, drop what you are carrying.

So the goal of the individual ants is to pick up food and

drop it near other pieces of food. The goal of the system

of ants, however, and what does indeed eventually occur, is

to collect all the food into a single pile.

If we want to make use of this idea in attempting to lend

some organization to Gnutella’s network layout, and to our

position in the Gnutella network, we need to find some

available information on which we can base the order in

which we connect to known hosts, and thus choose the best

available hosts to which we forward messages. The numbers

or heuristics that we base this decision on should both

provide us immediate improved performance as a user, as

well as – assuming all the hosts on the network use similar

heuristics – improve the overall structure of the network.

There are three types of information available to us:

information that is already being stored and used,

information that is already available during normal use of

the network – it just needs to be stored (passively

Page 34: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

28

available), and information that we need to do some

additional processing to gather (actively available). The

numbers can generally be assigned to one of three

categories: TCP/IP related statistics, Gnutella Network

related statistics, and Gnutella Host Content related

statistics. Finally, each heuristic can help us decide a

host’s usefulness in one of four instances – helping us

decide which hosts are closest on the TCP/IP network, which

are closest on the Gnutella network, which are the most

reliable, and which hosts we can most easily discard when

our host cache becomes full.

TABLE 3 – Some possible heuristics available for Gnutella Hosts Heuristic Availability Type/Source Usefulness Date first seen on Gnutella Passive Gnutella Reliability

Date most recently seen on Gnutella Passive Gnutella Reliability

Average Daily Uptime Already Used Gnutella Reliability

Number of failed direct connections Passive TCP/IP Reliability and

Discardability

Number of successful direct connections Passive TCP/IP Reliability and

Discardability

Date of Last failed direct connection Already Used TCP/IP Reliability and

Discardability

Date of Last successful direct connection Already Used TCP/IP Reliability and

Discardability

Was last direct connection successful? Already Used TCP/IP Reliability and

Discardability

Time to establish direct connection - most recent Passive TCP/IP TCP/IP Layout

Page 35: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

29

Time to establish direct connection - average Passive TCP/IP TCP/IP Layout

ICMP Ping time Active TCP/IP TCP/IP Layout

Trace route Number of Router Hops Active TCP/IP TCP/IP Layout

Last Gnutella response time Passive Gnutella Gnutella Layout

Total number of Shared files Passive or Active Gnutella Content Gnutella Layout

Total size of Shared files Passive or Active Gnutella Content Gnutella Layout

Total number of Interesting Search Results Passive or Active Gnutella Content Gnutella Layout

Total number of mirrors at this host Passive or Active Gnutella Content Gnutella Layout

Table 3 shows only a small subset of all the information we

can gather about hosts in our host cache, as well as how

this information could be useful to us. As a general rule,

we want to connect to hosts that are interesting to us.

Interesting can mean they have a wide variety of content

available or specific content that we as a user are

interested in. It can also mean that they are close to us,

or we have a particularly fast connection to them on the

TCP/IP network. They become even more interesting for a

direct connection if they are noticeably distant on the

Gnutella network – establishing a direct connection to them

would enable all responses to be received in the time shown

by the TCP/IP response time, rather than the current

Gnutella response time.

Page 36: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

30

This particular idea seems like a good candidate for a rule

used by an individual in a network governed by swarm

intelligence. To see this, we will ignore for a moment all

the other factors, and start as a host wishing to gain

access to the network. First, we connect to one host, ask

it who the furthest host away from it on the Gnutella

network is, and then connect to the host it specifies, this

gives us access to two completely separate parts of the

network, as well as giving the first host we connected to

access to hosts one step further away from the furthest

host it knows about. If every host on the network follows

this set of rules, it should follow that each host will be

doing its part to bridge parts of the network together, and

an organized pattern will begin to emerge.

In reality, however, there are other factors to consider.

Firstly, we are connecting to more than 2 hosts – after the

first two hosts, we need to decide ourselves which host we

are having the most trouble reaching, or which we know to

be the furthest away from us through each host we’re

connected to. Secondly, it is very likely that that host

will deny our connection, either because it’s already full,

or it is behind a firewall. Thirdly, we want to consider

Page 37: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

31

some of the other heuristics available to us in deciding

which host is best for us to connect to. Finally, we must

consider that the first host we connect to will not have

this factor available for us to consider.

So, in consideration of all these factors, we can propose

the following scheme for ordering our host cache:

The hosts we connect to first should be hosts which are

reliable – we want to get a leg into the network as quickly

as possible, so the user can start searching. We should

then look for two kinds of hosts:

Firstly, hosts which we are having trouble reaching. The

last hosts to return a PONG or QUERYHIT in response to a

PING or QUERY message are assumed to be the hosts furthest

away from us on the Gnutella network.

Secondly, hosts which are interesting. If a host is a

mirror for a lot of our files, he may have other files we

want. If a host returns a significant number of

interesting search results for your searches, we might as

well connect to him directly. The added bonus is if he is

using the same heuristic to connect to peers, eventually

Page 38: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

32

interesting peers will clump together into interconnected

cliques.

In addition, we need to keep track of hosts which are no

longer of any use to us. These are the ones we will drop

from our list first when more interesting hosts come along.

This decision needs to take all three of the above

heuristics into account so that we can maintain a queue

which contains enough interesting hosts in all three

categories.

4.4 – Implementation

Phex maintains a sorted tree to hold its host cache, the

tree currently uses a comparator which bases its decisions

on reliability – it chooses first the host with the best

average daily uptime, or who we last connected to

successfully most recently, and to whom our last connection

did not fail. We will extend this to maintain at least two

other trees – one containing hosts in the order we will

discard them, the other containing hosts in the order in

which we are most interested in connecting to directly – as

opposed to the existing tree which contains hosts in order

of which we are most likely to connect to.

Page 39: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

33

In addition, we will need to be able to view the contents

of the cache in real-time, for this purpose a tab has been

added to the Phex user interface with a table detailing all

the hosts currently in the cache, along with their various

available statistics. The tree is initially sorted in the

order the hosts will be tried.

The three heuristics will be inter-dependent. If one

heuristic believes two hosts to be equal then it will

fallback to the less appropriate one. Specifically, the

order in which hosts will be discarded will be first based

on the number of unsuccessful connection attempts (higher

is worse), and the length of time since a successful

connection has been made to that host (longer is worse).

If these two heuristics are equal, the algorithm will first

discard hosts which are less interesting. Similarly, if

all hosts are equally reliable (generally this means the

client has already tried all hosts it has reliability

statistics for), it will immediately try the most

interesting ones.

Page 40: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

34

Part 5 – Caching Strategies

5.1 The Purpose of Caching in a Peer-to-Peer System Even with a fully optimized network layout, the most

commonly used TTL on Gnutella, at 7, is only enough to

reach a network of just over 97,000 users. With the

current estimated number of hosts in the Gnutella network

being over 200,000 it is not just unlikely, but impossible

for any message to reach all the hosts on the network.

Caching, occurring at various points throughout the

network, can allow a host to see relevant contents of a

host more than 7 hops distant, without being able to see

the host itself.

The Gnutella network supports several forms of caching –

the Ultrapeer system is designed around a caching system

with a different goal: to eliminate needless traffic on

the network. Thus, each Ultrapeer has a small cache of

recent searches – and knows whether or not connected leafs

or peers responded to those searches. If the same search

reaches the Ultrapeer a second time, it does not forward

the search on to branches which it knows do not respond to

that search.

Page 41: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

35

5.2 – Content Mirrors Towards the goal of this project, we will examine a form of

caching meant to increase the number of search results,

specifically the number of full and partial mirrors

available for a specific result.

File mirrors are an idea that began on the Internet with

FTP. Systems would periodically make a copy, or “mirror”

of a site, at a separate location so downloads of the

site’s files could be split between the two servers. In a

peer-to-peer network, this idea can grow from having two

mirrors for a file, to having hundreds or thousands. When

more than one user is sharing a file, the downloader can

take is pick from all available sources and download from

the least busy, fastest, closest, or most reliable source.

Mirrors become even more useful if a client is capable of

simultaneously downloading separate parts of a file from

different sources, and rebuilding the file once all parts

have been received, an idea which peer-to-peer systems such

as Bittorrent make use of to its fullest extent.§§

§§ See Section 7 for more information on Bittorrent and how its logic could be applied to more traditional peer to peer networks.

Page 42: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

36

These mirrors or alternate download locations can be most

effectively stored, and most easily integrated into the

network, by being cached at all peers who have that file.

The reasons for this include the fact that a peer with the

file is more likely to see other peers with the file,

either directly or indirectly – as I will discuss in the

remainder of this section.

There are four ways a peer can passively (without sending

or receiving any extra messages) locate other peers with

that file on the network, as well as a couple which will

require minimal extensions to existing messages. The first

of these passive methods is the most obvious: When the

peer receives a download request for the file, and

successfully sends the file to a peer - assuming the

receiving peer will immediately share that file - he knows

that peer has that file, and can add it to his alternate

list. If the receiving client supports sharing partial

files, this can be done even sooner: As soon as the

serving client starts sending the receiving client data,

the receiving client immediately becomes a partial mirror

for the file. This, in theory, should be a very effective

method of constantly growing a list of mirrors.

Page 43: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

37

The second of these methods guarantees that all files

downloaded off the network will start with at least one

good initial alternate – this is simply the other direction

of the first method. When a peer finishes receiving a

file, he should immediately store the peer he received the

file from as a mirror. In many cases, more than one peer

would have served parts of the file, in which case we can

seed our mirror list with every peer who served us a part

of the file, or even every peer in our download candidates

list whether we used them or not.

These two strategies alone, if implemented on all clients,

should quickly provide nearly every file on the network

with at least one mirror location, and are fairly

straightforward. The third method I will discuss requires

an additional action on the part of the user – which is

likely, but not guaranteed to happen in the course of the

user’s interaction with the client. If the user has a

certain type of file in his library, it is likely that he

will search for similar files, and that a file he has in

his library will appear in the results of this search.

When this happens, the client software can notice a file in

the results is already locally available, and rather than

offer it for download to the user again, store the sources

Page 44: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

38

for the file as alternates for the local copy. While this

is not guaranteed to provide results, as it requires both

an action on the part of the user, and search results which

are not guaranteed to be found, this has the potential to

be the best constant source of alternate mirrors as the

results of a search habitually provide many more results

than the number of peers which a client has uploaded to or

downloaded from.

The final method of using existing information on the

network to obtain mirror sites is passive searching.

Because each peer on the network routes messages to other

peers, a client will see hundreds of messages each minute

as they pass through the network. Some of these messages

represent query results directed at other peers, and some

of these query results will be for files which are in a

client’s library. If the client, before passing the

message on through the network, checks if the file is

available locally and if so stores the host in the QUERYHIT

message that is currently serving the file, the client can

slowly accumulate mirrors simply by being an observer (even

while only an idle participant) in the network. This will

likely yield real-world results only for popular files,

Page 45: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

39

however as it is only an extension of earlier ideas, it is

worth implementing.

The methods described so far did not require any additional

messages to be sent across the Gnutella network – indeed;

even messages already being sent did not require any

modification at all: Only existing data already being

shared across the network was used. The obvious

progression of this idea is sharing of mirror lists between

peers – when a new source gets added to a client’s list; it

could contact all the sources it knows about, and tell them

about this new source. In this way, each client on the

network could always have a complete list of every peer on

the network sharing each file. However, it is easy to see

that the amount of additional data this scheme would pass

over the network could quickly be overwhelming – and we may

show that it is unnecessary if we can maintain a large

enough list of mirrors using the passive methods already

described.

Actively searching for alternate download sources could

also be an option worth considering: When the client is

idling on the network, it would at intervals go through its

library searching for new alternates and storing them in

Page 46: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

40

its list. Along the same lines, the client could ask its

existing alternate sources if they know of any new

alternate sources at intervals. This may be a better

option, simply because the number of messages passed could

be throttled based on the number of sources available – if

a client already knows of a reasonable number of sources

for a file, it does not need to actively look for more.

5.3 – Implementation

Phex has an existing nearly complete implementation of the

standard Gnutella extension for alternate locations. It

has no existing methods of acquiring these alternate

locations implemented, although it does have the ability to

store them and send them as an extension to the QUERYHIT

message. Many Gnutella clients have the ability to process

these mirrors when sent as part of the QUERYHIT message, so

this extension to Phex should be immediately worthwhile.

All four passive methods of acquiring mirrors will be

implemented. When a download is partially completed, and

the file is added to Phex’s shared files list, all good

download candidates for the file will be immediately

stored. When a part of an upload is completed, the

Page 47: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

41

receiving client will be immediately stored as a mirror.

Finally, all QUERYHIT messages will be processed as soon

they are seen, and checked for a match with the signatures

of all files we are currently sharing. Mirrors will be

cached to a maximum count of 100***, with older mirrors

being taken off the list as newer ones appear.

The Phex user interface will also be modified to show the

user how many mirrors it is storing for each file he is

sharing. This will also help us gage the success of the

system, as this number should continually grow as the

client operates. It also is a good indication of the

overall popularity of files a user is sharing, as we will

see in section 6.2.

*** 100 was chosen simply as an arbitrary number used so the size of this cache does not grow unnecessarily large; for the more popular files on the network, it is not unreasonable to expect to see thousands of mirrors for one file within a matter of hours, for example current #1 hit songs in mp3 format. We are operating under the assumption that there is no reason for one host to know over 100 mirrors for a file at this stage; for a further examination of this cache size see part 7.

Page 48: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

42

Part 6: Results

6.1 – Results of Network Host Cache Changes The first thing that was immediately apparent once the Phex

client was displaying the real-time status of its host

cache was that there were a few issues with the existing

system. The first was a minor annoyance – as soon as a

host was used, it disappeared from the host cache. It is,

of course, necessary to temporarily displace it from the

top of the host cache so it is not attempted twice, however

if we remove every host we connect to from the cache, we

immediately eliminate some of the best candidates we may

have for the next time the client connects to the network.

Also, since some of the statistics we collect can only be

found by attempting a direct connection, such as TCP/IP

response time, this became an important thing to fix.

It seemed prudent to not move the host’s position at all in

the cache until information about the host became

available, so the function to get the next host in the

cache simply skips over anything the client is already

connected to. Thus, the real-time display of the cache

shows hosts it is connected to highlighted in blue at the

Page 49: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

43

top – although these may move down as more information

about these hosts becomes available and more hosts get

added.

The Gnutella response time of a host is not persistent – it

gets reset every time the program restarts. This enables

us to first connect to the most reliable hosts,

disregarding this piece of information, and as it gets

collected, hosts with a higher Gnutella response time move

to the top of the cache. It is also necessary for a more

fundamental reason – the relative location of that host on

the network will change as your position in the network

changes, so the number becomes less valid as soon as you

connect to or disconnect from a host.

Most other heuristics are persistent, however. TCP/IP

networks are relatively stable so related information need

not be updated so often. A host’s content is, for our

purposes, assumed to be fairly constant, although it will

increase over time and a user can remove files from what he

is sharing. The total amount of content shared is updated

each time we connect to a host, however the interest rating

of a host is more constant than the actual content he

Page 50: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

44

carries; while the content itself may change the general

theme of the content can be generally assumed to be stable.

Testing for this portion of the project was only done with

a single client. The modifications were successful in that

the program is noticeably quicker to gain an initial

connection into the network than before, thanks to some

stability added to the host cache. It does attempt and

connect to hosts which it is slow to receive responses from

over the Gnutella network – often the first ten hosts in

the cache will have a Gnutella response time of over 30

seconds. If all other assumptions are correct, this both

enables us to see parts of the network which were out of

reach to us before, as well as showing promise of lending

some organization to the network layout if all clients were

to use a similar scheme.

Page 51: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

45

6.2 – Results of File Mirror Site Changes

Table 4 - Mirrors and File Popularity

File ID # times file was searched

# of times file was uploaded

# of mirrors known after 7 days

A 878 64 100 B 205 52 100 C 154 41 100 D 432 20 99 E 444 27 87 F 435 0 23 G 192 5 22 H 298 0 18 I 461 1 4 J 20 1 3 K 20 2 3 L 24 2 3 M 19 1 3 N 149 1 2 O 25 2 2 P 22 2 2 Q 651 1 1 R 195 1 1 S 145 1 1 T 130 1 1 U 128 1 1 V 30 1 1 W 11 1 1 X 8 0 1 Y 8 0 1 Z 8 0 1

The success of the system in acquiring and caching mirrors

for local content varied somewhat by file, however overall

did gather mirrors for every file it downloaded and

uploaded, and additional mirrors were collected with

varying degrees of success (generally, only for the most

Page 52: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

46

popular files). Table 4 shows the status of the mirror

cache at the end of the 7 day test period. As expected,

each file that was initially acquired from the network (all

files except Q, R, and S) has at least one mirror: The

host that the file was initially downloaded from. Files Q,

R, and S were files our host was seeding onto the network –

they could not have been previously in any other host’s

library. Each was downloaded by another peer once, and

thus has that peer as its only current mirror.

As the table shows, it had no trouble caching mirrors for

the more popular files; file A in Table 4, which was a

current number one hit single, received 64 hosts from

uploading alone; while it only stored the most recent 100

mirrors for the file, thousands of different hosts with

this file were seen over the 7 day period. Files B and C,

also files commonly uploaded, also saw mirrors numbering in

the hundreds. The less popular files fared less well –

most gathering one to 4 mirrors. A longer stay on the

network may have driven more sharing of these files, and

therefore more mirrors would be found. It would be, for

example, interesting to examine in several months time how

many mirrors could be found for the files we seeded.

Page 53: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

47

Overall, uploads provided 40% of the mirrors, the initial

download of the file provided 9% of the mirrors and 51%

came from the search results: Active searches where the

file came up again in a user’s searches or passive searches

where we found a mirror by inactively watching the messages

through the network. These results generally reflect what

we expected to see, with an initial small number of mirrors

coming from the download of a file, while the majority

coming through passive monitoring of network activity

afterwards.

Page 54: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

48

Part 7: Conclusions and Suggestions for Future Work

The results we achieved with this project barely scrape the

surface of the optimisations that could be implemented on

peer to peer file searching networks such as Gnutella. We

showed, on an existing public network, that it is possible

- without burdening the network with any more messages or

data - to harvest mirrors for files in your library. We

also showed that, again without sending any additional data

over the network, it is possible to build and calculate

heuristics to be more selective about which hosts on the

network to connect to. The full potential of both of these

techniques, however, would require some further changes to

the client, as well as changes to other aspects of the

network which we will study briefly.

It is worth examining further here the effects of the size

of the mirror cache on the validity of mirrors. As with

any caching technique, care must be made to ensure the data

does not become stale. Specifically, in terms of our

mirror caching techniques, it would not be productive to be

spreading a list of invalid mirrors for a file. It also

may be possible to carefully prune the stored mirror lists

to realize a more reliable data cache.

Page 55: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

49

With the cache size we used, there are two cases we see in

the real world: popular files where the cache is replaced

with new information regularly, and less popular files

where the cache is fairly static – in fact, in these cases

mirrors could remain in the cache indefinitely since the

limit of 100 hosts will never be reached – 100 different

peers will never want that file. Because of the constantly

changing structure of the Gnutella network, a large portion

of these mirrors will only be valid for a short time after

they are seen. Others will only be valid for a short

period of time per day, or per week: whenever the user

starts up their Gnutella peer software.

In the first case - the extreme being file A, but files B

and C also falling into this category – the cache is

constantly being updated with new information. This case

is the more interesting one to examine; we end up with a

mirror cache similar in many ways to the host cache we use

to connect into the Gnutella network, thus some of the same

techniques can be considered. Rather than simply deleting

the oldest host in the cache, the least reliable one could

be deleted instead. Reliability could, as with the host

cache, be calculated based on number of times this host has

Page 56: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

50

been seen, whether a partial or complete file was seen, how

recently this host has been seen, or whether you have

actually downloaded from this host. The precise heuristic

which decides when to discard a host could be some weighted

combination of these factors. In the second case, to

further ensure the utility of cached data, the host could

be discarded before the cache size limit is reached if this

heuristic reached a specific value.

Mirrors become especially useful to a client in a peer to

peer network when the client supports multi-source

downloads, and when files become larger than the average 4-

minute mp3-compressed song. Because of the capricious

nature of peer to peer networks, it is difficult to find a

peer willing to share the entirety of a file larger than

this, and generally also difficult to find peers with that

file who are not already uploading it and thus not able to

upload to you as well. This is a problem the BitTorrent

engine, whose techniques are described in detail by its

creator Bram Cohen [Cohen, 2003], addresses and offers a

solution that works in many situations.

The techniques Cohen uses are based on partial file

sharing. When a user has downloaded only part of a file,

Page 57: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

51

he starts to share it on the network. This way, if two

users are downloading the file, the entire file needs to be

provided only once to the combination of the two users. If

user one has the first half of the file, and user two has

the second, they can then send the part of the file the

other user needs. Cohen’s work is based on a much larger

system, where hundreds or thousands of users can be

involved, each having different and overlapping parts of a

single file. The BitTorrent system is designed to provide

each user with optimal download performance, as well as

attempting to ensure that a complete distributed copy of

the file is available for as long as possible, so that when

peers sharing the complete file disappear from the network,

users will not be left with a file they cannot complete.

What this does is addresses the problem of peers who are

serving files leaving the network before all peers who are

downloading files have completed their downloads. If a

user must upload a file to be able to download it, they

will be much more reliable as a source than a user who has

no incentive to continue sharing a file. By integrating a

system such as this with Gnutella – the basis for which is

already there based on the content mirrors this project

experimented with – the capability of the system would be

Page 58: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

52

further enhanced, allowing it to be a reliable source for

larger files.

Changes to the host cache management of the Phex client

also yielded some interesting results. Hosts with long

response times proved plentiful on the network, however no

numbers were generated as to what the effect of giving

priority to these hosts actually is on the network.

Results of this model are better suited to a simulation –

real-world results may not be obvious until most clients on

the network are using such an algorithm, a difficult

proposition considering the wide variety of clients in use

and hosts numbering in the hundreds of thousands.

Content-based heuristics to choose a host to connect to

were only minimally implemented, and only used as a

secondary statistic on which to order the host cache.

These could prove interesting to a single client on the

network, as he could seek out hosts with similar interests

and connect to them directly. However, again this concept

could be examined better in a simulation as sub-networks or

sub-graphs of clients with similar content are the eventual

goal of this heuristic, something that would again require

a large group of hosts on the network.

Page 59: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

53

References

Cohen, B (2003) Incentives Build Robustness in BitTorrent, http://www.bitconjurer.org/BitTorrent/bittorrentecon.pdf T. Klingberg, R. Manfredi (2002) Gnutella 0.6, http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html Lv, Q., Cao, P., Edith, C., Li, K., Shenker, S. (2002) Search and Replication in Unstructured Peer-to-Peer Networks,

http://www.cs.princeton.edu/~qlv/download/searchp2p_full.pdf

Page 60: Peer-to-Peer File Sharing Network Optimisationpeople.scs.carleton.ca/~arpwhite/documents/honours... · Napster appeared on the Internet in early 1999, as the result of a young programmer,

54

Appendix – Contents of Included Disc The included disc contains a soft copy of this document, as well as the source code to the modified Phex client. The Phex client that was modified was acquired from Phex CVS on SourceForge (www.sf.net), thus the changes made could be submitted to the source repository with ease. Modified files are marked as changed through CVS; if the reader is interested in examining the source code modifications, using the latest version of Eclipse (www.eclipse.org) to open the project will provide the best results. Included in the Eclipse project is the CVS information so modified files will be marked using Eclipse’s built-in CVS support. Building the client can be done using Eclipse once the project is opened. Executing the project is most easily done by using the Run command from within Eclipse as well. Source code is included in the source/ folder; Documents are in the documents/ folder; Eclipse is available in the eclipse/ folder; Related documents are included in the references/ folder.