202
Gareth Tyson “Peer-to-Peer Content Distribution” B.Sc. (Hons) Computer Science March 2005

1javed/class-FP2P10S/papers-2006/P2P... · Web viewSearches are instead, aggregated; if a multi-word search is entered, it first decides which supernode each search element should

Embed Size (px)

Citation preview

Gareth Tyson

“Peer-to-Peer Content Distribution”

B.Sc. (Hons) Computer ScienceMarch 2005

“I certify that the material contained in the dissertation is my own work and does not contain significant portion of unreferenced unacknowledged material. I also warrant that the above statement applies to the implementation of the project and all association documentation.

Regarding the electronically submitted version of this submitted work, I consent tothis being stored electronically and copied for assessment purposes, including theDepartment’s use of plagiarism detection systems in order to check the integrity ofassessed work.

I agree to my dissertation being placed in the public domain, with my nameexplicitly included as the author of the work.”

Signed,

________________Gareth Tyson

ABSTRACTSince the emergence of the Peer-to-Peer revolution the proliferation of the technology has been unstoppable, its many benefits have helped to numb the pain of its inefficiencies but current designs mean that without severe modification to their scalability it will be simply impossible to service the requirements of the growing P2P community.

This project aims to improve the search algorithms of current P2P technology whilst still maintaining some of the better features. This report outlines the design, development and evaluation of the creation of such a P2P network with the emphasis on evaluating the chosen design and its possible derivatives.

ABSTRACT……………………………………………………………………………....3CONTENTS………………………………………...……………………………………4

1. INTRODUCTION…………………………………………………...….71.1. General Overview of Peer-toPeer………………………………………………71.2. Project Aims……………………………………………………………………..81.3. Report Overview………………………………………………………………...9

2. BACKGROUND………………………………………………………112.1. Terminology of P2P……………………………………………………………112.2. Analysis of P2P Architectures………………………………………………...12

2.2.1. Gnutella.....................................................................................................122.2.2. Kazaa.........................................................................................................152.2.3. Napster.......................................................................................................172.2.4. Pastry.........................................................................................................18

2.3. Legal Background...............................................................................................192.4. Data Resource Management (DRM).................................................................19

2.4.1. The Generations of DRM...........................................................................202.4.2. Encrypted Shell..........................................................................................202.4.3. Marking......................................................................................................202.4.4. Centrally Controlled DRM........................................................................212.4.5. Problems with DRM...................................................................................21

2.5. Choices in Software and Technology................................................................212.5.1. C.................................................................................................................222.5.2. Java............................................................................................................222.5.3. Java RMI....................................................................................................222.5.4. JXTA. ........................................................................................................22

2.6. Justification for Building a New Type of Network..........................................232.7. Summary.............................................................................................................26

3. DESIGN..................................................................................................273.1. General Overview of the System.......................................................................273.2. Overall Structure................................................................................................273.3. Network...............................................................................................................27

3.3.1. Connecting New Nodes to the Network.....................................................283.3.2. Creating the Indexed Structure of the Network.........................................293.3.3. Creating New Supernodes..........................................................................293.3.4. Dealing with the Loss of Supernodes.........................................................303.3.5. Keeping File Lists up to Date....................................................................313.3.6. Distributing File Lists to Supernodes........................................................32

3.4. Searching.............................................................................................................323.4.1. Routing of Search Queries.........................................................................333.4.2. Search Aggregation...................................................................................343.4.3. How Supernodes Search their own Databases..........................................343.4.4. How Supernodes Perform Fuzzy Searches................................................34

3.5. Downloads/Uploads............................................................................................353.6. System Architecture...........................................................................................37

____________________________________________________________________Page 4 of 124

3.7. Class Overview....................................................................................................393.7.1. Design Patterns..........................................................................................41

3.8. User Interface......................................................................................................423.9. Summary.............................................................................................................46

4. IMPLEMENTATION............................................................................474.1. The GazNet Protocol..........................................................................................47

4.1.1. 000-PRESENT – 001-PRESENT? – Pinging a Node................................494.1.2. 030-CONNECT? – The connection process..............................................494.1.3. 080-SEARCH=search string – The Searching Process.............................504.1.4. 120-DOWNLOAD=filename=x=y – The Download Process...................504.1.5. 190-NEW_SUPERNODE – Supernode creation.......................................514.1.6. 400-ALPHA_LOCATIONS_UPDATE.......................................................51

4.2. The ProtocolSocket.............................................................................................514.3. Data Structures...................................................................................................52

4.3.1. Range.........................................................................................................524.3.2. Group.........................................................................................................534.3.3. The FileDatabase.......................................................................................53

4.4. Important Algorithms........................................................................................554.4.1. The Splitting Algorithm..............................................................................554.4.2. Multi-Source Downloads...........................................................................564.4.3. Aggregating Search...................................................................................58

4.5. Summary.............................................................................................................595. THE SYSTEM IN OPERATION.........................................................60

5.1. A Typical Session................................................................................................605.2. Summary.............................................................................................................64

6. PROCESS DESCRIPTION...................................................................656.1. The Connection Process.....................................................................................656.2. The Searching Process........................................................................................656.3. The File Transfer Process..................................................................................666.4. The Splitting Process..........................................................................................676.5. Summary.............................................................................................................67

7. TESTING................................................................................................687.1. The Test Bed........................................................................................................687.2. How the System was Tested...............................................................................68

7.2.1. Testing Utilities..........................................................................................697.2.2. The Simulators...........................................................................................69

7.3. System Test Results............................................................................................707.4. Summary of Errors.............................................................................................75

7.4.1. Networking.................................................................................................757.4.2. File Transfers.............................................................................................767.4.3. The User Interface.....................................................................................77

7.5. Incremental Testing............................................................................................777.6. Summary.............................................................................................................79

8. EVALUATION.......................................................................................808.1. Evaluation of Architecture.................................................................................80

8.1.1. Scalability..................................................................................................80

____________________________________________________________________Page 5 of 124

8.1.2. Search Algorithms......................................................................................878.1.3. Reliability and Robustness.........................................................................928.1.4. File Transfers.............................................................................................938.1.5. Miscellaneous............................................................................................94

8.2. Evaluation of User Interface..............................................................................968.2.1. The Welcome Panel....................................................................................968.2.2. The Search Panel.......................................................................................968.2.3. The Download Panel..................................................................................968.2.4. The Upload Panel......................................................................................978.2.5. The My Files Panel....................................................................................97

8.3. Alternative Solutions..........................................................................................978.3.1. Scalability..................................................................................................978.3.2. Robustness and Reliability.......................................................................1038.3.3. File Transfers...........................................................................................104

8.4. Critical Summary of Evaluation.....................................................................1059. CONCLUSION.....................................................................................106

9.1. Review of Aims..................................................................................................1069.2. Review of Original Project Proposal...............................................................1099.3. Suggested Revisions to Design.........................................................................1109.4. Suggested Revisions to Implementation.........................................................1119.5. Future Work......................................................................................................1129.6. Lessons Learnt..................................................................................................1139.7. Final Conclusion...............................................................................................113

10. REFERENCES....................................................................................11511. Appendix – Original Project Proposal..............................................117

The project’s working documents are available at “http://www.lancs.ac.uk/ug/tysong”

____________________________________________________________________Page 6 of 124

1. Introduction

1.1General Overview of Peer to PeerOne of the current major issues in computer science is the concept of Peer to Peer

(P2P) computing. The purpose behind P2P is to allow a community of users to be able to carry out services such as sharing files without the need for a central server to run the network. In recent years the proliferation of P2P technology has been immense allowing massive distribution of digital media by parties which previously would simply not have had the resources to perform such a large scale task. The idea behind P2P is that a number of peers can form together into a network without the need to be managed by another party; instead they manage themselves, dynamically changing to deal with the addition and removal of peers. This is obviously an attractive idea and has been picked up by many companies and programmers wishing to create such networks.

The first major P2P network was Napster, which allowed peers to share and download music files. This network fulfilled many of the end requirements of this project, primarily the fast search time, however Napster’s reign of supremacy wasn’t to last forever as, for legal reasons, it was shut down, leaving a gaping cavern left to be filled by other less efficient networks such as Gnutella and Kazaa. The inefficiency of these networks is exactly the reason for this project as it strives to improve upon the glaring problems with performance and scalability that resonate throughout current P2P technology.

The intention of this project is to build a new P2P network, titled ‘GazNet’ that will improve on existing networks, in particular it aims to speed up the process in which a peer searches the network to find content. Currently the majority of P2P networks have extremely slow search times, this is because in many cases every peer in the network must be contacted to complete a search. This can be both highly frustrating for the user and a heavy burden on the underlying network. In the previous sentence it was said that the majority of P2P networks have extremely slow search times, not the entirety; this is because there are networks available that can perform high speed searches, but this performance comes at a price. These networks require the user to enter the exact name of the file they are looking for; the slightest deviation from the correct name will result in no results being returned. This fallibility has meant that there has been little popular interest in them, simply because the average user would rather wait a long time and receive a plethora of results, rather than wait a short time and receive no results. This project aims to increase search performance without the sacrifice of ‘fuzzy’ searching i.e. the ability to enter a search word and receive results even though the filename doesn’t exactly match that word, e.g. a search for ‘Help’ would return the file ‘The Beatles – Help!.mp3’. By fulfilling this requirement a network will have been devised that will no longer force the user to choose between search time and the effectiveness of searches.

____________________________________________________________________Page 7 of 124

1.2 Project Aims

1) To design, develop, and evaluate a P2P network to allow the sharing of files.

2) To allow fast, efficient searching of the networka. To decrease search time compared to current P2P technologiesb. To allow fuzzy searches that don’t require specific details to be entered

e.g. the exact name of the file.c. To limit the effect network size has on search timed. To keep search time low even when the overlay network is heavily loadede. To keep searches as accurate as possible, so out of date file references

aren’t returned in search results i.e. references to files that have already been removed from the network

3) To make the network as scalable as possible in both terms of its performance and in the maximum number of nodes that can be concurrently connected to the network.

4) To make the system as reliable and robust as possiblea. The network must be able to recover from the loss of supernodesb. The network should be able to a reasonable degree withstand targeted

attacks by malicious usersc. Downloads must be able to be resumed without having to re-download all

the data already received if transfers are interrupted.d. The system should deal with errors in an elegant way.

5) To make the network as independent as possible i.e. avoiding the use as of centralised managed servers, removing any one point of failure making it near impossible to shut the network down.

6) To allow sharing of files over the networka. To allow file transfers to be performed as quickly as possible hiding the

user from the underlying actions

7) To minimise the time taken to connect to the networka. To avoid the use of a centralised server to aid connection to the network.

8) To provide a highly simple, easy to use user interfacea. To let the user quickly and with ease search for files without having to

understand and underlying processesb. To provide the user with some form of discriminatory information about

specific downloads so they may make an informed decision on which files to download.

c. To allow the user to modify his/her shares without difficultyd. To allow the user to view what he/she is currently sharing

____________________________________________________________________Page 8 of 124

e. The interface must be responsible at all times even when actions are being carried out behind the scenes such as a search.

f. The user must be provided with progress details of downloads and be told when a file has finished downloading

g. The user must be provided with a indicator (i.e. bit rate) of how quickly a file is downloading.

Abandoned AimsDue to time constraints some aims spoken about in the project proposal had to be

dropped, these aims have been listed below; these topics have been touched on in the report but have received no implementation or in-depth analysis.

1) To provide the user with an incentive to share his/her files.

2) To create a P2P network to run over a wireless environment.

3) To allow the protection of files using DRMa. To provide all users with the functionality to protect their own filesb. To allow the safe transfer to already licensed shares

i. To enforce the controlled distribution and sale of licences i.e. stop users from selling mp3 files but also keep the copy themselves

1.5 Report Overview

Chapter 2 (Background): This section gives an in-depth review of current P2P technologies and provides a foundation on which the report can be easily read.

Chapter 3(Design): This section aims to give the reader an overview of how the system architecture and several key processes are carried out.

Chapter 4(Implementation): This section provides a greater insight in to the workings of the system and shows the system that actually was produced.

Chapter 5(The System in Operation): This short section shows a typical session and the interaction between the system and the user.

Chapter 6(Process Description): This section explains the underlying processes that are carried out during a typical session, this includes both user initiated requests and automatic system initiated processes.

Chapter 7(Evaluation): This section evaluates the effectiveness of the finished system and how well it has fulfilled the aims set out in this chapter.

Chapter 8(Testing): This section lays out the testing schemes used to insure the working of the system and how well the system dealt with the testing.

____________________________________________________________________Page 9 of 124

Chapter 9(Conclusion): This section sums up how well the project aims have been fulfilled and what was done wrong.

____________________________________________________________________Page 10 of 124

2. Background

2.1 Terminology of P2P

Architecture – The topology of the network and how information is routed through the network.

Connecting to the network – The process in which a node becomes part of the network, this is typically carried out by finding an already connected node and then tunnelling information through that node.

Distributed Hash Table (DHT) – A type of P2P network that allocates an amount of hash space to nodes in the network, these nodes then look after all files that reside in that hash space.

File list – A list of files and their location in the network (IP address), extra details can also be kept about the files such as their size and type.

Hash algorithm – A mathematical process performed on a string to create a number that is as unique as possible for the purpose of storing data. If a hash algorithm returns 5 it could be then put the data in an array at position 5, then all that is required to locate the data is to perform the hash algorithm again to find out where it is stored. The simplest hash algorithm is to simply to use the first letter of the string being hashed.

Hash Key – The value returned from a hash algorithm

Hash space – The hash keys that a particular area of storage holds, e.g. using a hashing algorithm that hashes on the first letter in a string, position 0 in an array could look after the hash space ‘A’.

Node – Any computer that is connected to the network

P2P Generations – P2P architectures can be split into three generations, the first is centralised P2P structures such as Napster which uses a central file server to search the network. Second generation networks use decentralised file lists. The third is Distributed Hash Tables.

Servent – A node that is neither a server nor a client, instead it performs both duties, this is a frequent occurrence in P2P networks as every node should be the same.

Supernode – A high performance node that is used in the network to reduce the burden on both the overall network and on individual nodes. This is usually done by caching file lists so a smaller number of nodes need to be searched to find files however they can also be used in different fashions such as caching actual files to reduce download time.

____________________________________________________________________Page 11 of 124

2.2 Analysis of P2P Architectures

2.2.1 GnutellaThe Gnutella protocol is probably the most robust P2P protocol available, it is

fully distributed and therefore very resilient to faults. The first process a Gnutella servent must carry out is to connect to GnutellaNet, this is done by connecting to an already connected node, currently the most prevalent method for finding connected nodes is through node caches that maintain databases of existing nodes on the network. Once an already connected node has been found, the new node must then attempt to connect to it, to do this it initiates a TCP connection to the servent then sends the following request string (encoded in ASCII),

GNUTELLA CONNECT/<protocol version string>\n\nTo this, the servent will reply with

GNUTELLA OK\n\nif it wishes to accept the connection, any other reply will indicate that the connection has been refused.

When a servent wishes to perform a search it will channel the query through the node it is connected to, it should be noted that a servent can have multiple connections to other servents, in such a situation queries will be forwarded onto all the connected servents. If a servent receives a Query that matches content in its shared folder it will then reply with a QueryHit Descriptor, informing the searching node that it has content of interest to it. The Gnutella protocol is summaries in Table 2.1,

Descriptor DescriptionPing Used to discover other nodes on the network, a ping request should be

replied with a PongPong The response to a Ping; it tells the pinging node that the servent is

present and other information such as its IP address and statistics on the data it is sharing

Query Used to search the network, if a servent receives a Query which matches one or more files that it owns it will reply with a QueryHit

QueryHit The response to a Query; it will inform the searching servent of information about the host such as its connection speed, it will also provide a list of the files that match the Query.

Push A mechanism that allows a firewalled servent to contribute file-based data to the network

Table 2.1 Overview of Gnutella Protocol

Every protocol Descriptor from Table 2.1 will be encapsulated in a Descriptor packet, this is shown in Diagram 2.1. There are 5 fields in the Descriptor, the Descriptor ID which is a uniquely identifiable string, the Payload Descriptor which indicates what type of Descriptor, a TTL field which limits the packets ‘reach’ in the network, the Hops field which records the number of nodes the Descriptor’s traversed and a Payload Length field which contains the length of the payload. This header is followed by the payload, which will be one of the Descriptors from Table 2.1.

____________________________________________________________________Page 12 of 124

The TTL field is used to limit the reach that the message will have over the network, every servent that receives a Descriptor packet will decrement the TTL before it is forwarded on, if a servent receives a Descriptor with a TTL of 0 it will not forward it on, therefore deleting the packet from the network.

Descriptor ID Payload Descriptor

Time to Live (TTL)

Hops Payload Length

0 15 16 17 18 19 22Diagram 2.1 – Descriptor Header

The TTL is the most important field in the Descriptor, this is because this factor controls how large the network is, in practical terms. The number of nodes in the network possesses no significance to an individual user, this is because a singe node only has a certain degree of ‘reach’ in the network, this ‘reach’ is set by two factors, the TTL and

the number of nodes it is connected to. If there was no TTL field, a Descriptor would simply circle the network being forwarded by node after node, as the GnutellaNet tree isn’t a spanning tree these Descriptor’s would get forwarded back to nodes that had already received them and therefore never leave the network, it is therefore impossible to actually search the whole network as, unless the searching node knows the entire topology of the network, it is likely that messages will just start looping through the network uncontrollably if a suitability large TTL field is given to extend reach to the furthest away nodes.

Downloads are performed using HTTP; this is because HTTP is an well established file transfer protocol that can carry out the required purpose perfectly well, it is more suited to the application than a protocol such as FTP as it requires no logging in process to be carried out.

This architecture has obvious benefits, it’s fully distributed layout(Diagram 2.2) means that it would be at worst extremely hard and at best impossible for the network to be shutdown. This means that users can always expect the network to be fully working. The only weak point in the structure is the locating of a node to connect to in the first place but this could be overlooked as the problem is resident in every P2P architecture. Its search mechanism means that all queries are carried in real-time meaning that all results will be completely up to date therefore eliminating the frustration of attempting to download files that have already been removed from the network.

Gnutella’s high resilience to break-down however unfortunately comes at a cost and this cost is performance; out of all the P2P architectures Gnutella performs by far the worst,

____________________________________________________________________Page 13 of 124

Peer 1

Peer 6Peer 7

Peer 4

Peer 3

Peer 5

Peer 2

Diagram 2.2 The topology of GnutellaNet

this is because when a node wants to find a file the query must be propagated over the whole network meaning that every single node connected to the network must receive every single query (if the whole network was to be searched). This would be acceptable if the network was running over a high speed LAN with only 500 users but the amount of network traffic created when a search is sent over the internet with 3 millions users running the servent is immense. Obviously it would soon get impractical making Gnutella’s biggest problem scalability, the simple fact is that over any sort of large network Gnutella will not scale, and this is discussed further in the next subsection. Take this problem into the realms of Wireless networking and it suddenly explodes, when someone on a desktop PC searches for a file they will generally be running the internet over a fixed rate connection with multiple web browser windows open. This means that long search times are far more tolerable as the user can both afford the long wait and also entertain themselves during it by carrying out other tasks such as checking their e-mails. A wireless user is generally less fortunate as their connection is usually a metered one and there is less multi-tasking available for the user to carry out things whilst they wait making the search time extremely expensive.

Gnutella’s Scalability (Ritter, 2001)It has been mentioned earlier that Gnutella is highly un-scalable but exactly to

what degree does Gnutella’s architecture limit scalability. In this short analysis of Gnutella’s scalability a few variables will be used,

P – The number of connected usersN – The number of connections held open to other nodesT – The Time to Live(TTL) of packets, these are used to age packets to stop queries from circling round the network forever.

Due to the structure of Gnutella, P is never relevant to your potential reach, in reality the only limiting factors are N and T. Therefore raising N or T will raise the number of nodes searched and decreasing N and T will reduce the number of nodes that can be searched.

T=1 T=2 T=3 T=4 T=5 T=6 T=7 T=8N=2 2 4 6 8 10 12 14 16N=3 3 9 21 45 93 189 381 765N=4 4 16 52 160 484 1,456 4,372 13,120N=5 5 25 105 425 1,705 6,825 27,305 109,225N=6 6 36 186 936 4,686 23,436 117,186 585,936N=7 7 49 301 1,813 10,885 65,317 391,909 2,351,461N=8 8 64 456 3,200 22,408 156,864 1,098,056 7,686,400

Table 2.2 – Reachable users based on a fully balanced network with universal N and T values.

The next obvious step might be to increase the N and T to harvest a greater number of results however the side effect of this is the subsequent increase in bandwidth used (B).

____________________________________________________________________Page 14 of 124

T=1 T=2 T=3 T=4 T=5 T=6 T=7 T=8N=2 166 332 498 664 830 996 1,162 1,328N=3 249 747 1,743 3,735 7,719 15,687 31,623 63,495N=4 332 1,328 4,316 13,280 40,172 120,848 362,876 1,088,960N=5 415 2,075 8,715 35,275 141,515 566,475 2,266,315 9,065,675N=6 498 2,988 15,438 77,688 388,938 1,945,188 9,726,438 48,632,688N=7 581 4,067 24,983 150,479 903,455 5,421,311 32,528,447 195,171,263N=8 664 5,312 37,848 265,600 1,859,864 13,019,712 91,138,648 637,971,200

Table 2.3 – Amount of bandwidth generated for an 83 byte query

T=1 T=2 T=3 T=4 T=5 T=6 T=7 T=8N=2 332 664 996 1,328 1,660 1,992 2,324 2,656N=3 498 1,494 3,486 7,470 15,438 31,374 63,246 126,990N=4 664 2,656 8,632 26,560 80,344 241,696 725,752 2,177,920N=5 830 4,150 17,430 70,550 283,030 1,132,950 4,532,630 18,131,350N=6 996 5,976 30,876 155,376 777,876 3,890,376 19,452,876 97,265,376N=7 1,162 8,134 49,966 300,958 1,806,910 10,842,622 65,056,894 390,342,526N=8 1,328 10,624 75,696 531,200 3,719,728 26,039,424 182,277,296 1,275,942,400

Table 2.4 – Amount of bandwidth incurred for an 83 byte query

Table 2.3 refers to the amount of bandwidth generated by an 83 byte query however Table 2.4 refers to the amount incurred, the difference is that the former only measures outgoing data whilst the latter includes both outgoing and incoming data to give a more accurate measurement. As can be seen Gnutella has the potential to create over 1.2GB of data just to propagate one query, this not only creates a massive burden on the network but also makes search times extremely slow due to the time it takes to transport that amount of data.

2.2.2 KazaaKazaa works on the FastTrack protocol and has considerable performance

improvements over Gnutella, unfortunately however, unlike Gnutella the FastTrack protocol is encrypted and therefore little is known about the actual details of the protocol, all the information in this section has been gained from the studying and measurement of node interactions. The improved performance of Kazaa is gained through the use of multiple classes of peers, the ordinary node and the supernode. To join the network a node must first find the address of a supernode, Kazaa uses a supernode registry, stored in the Windows Registry, to hold a list of up to 200 supernodes that connection attempts can be made to. The selection of which supernode to connect to, is based on two factors, the first is the load of the supernode; in the registry there is a field that indicates the average load of each supernode, nodes will show a marked preference for lightly loaded supernodes. The second factor is the locality of the supernode, the locality is devised using two means; the first is the round trip time (RTT) of the connection; 40% of node to supernode connections having a RTT of less than 5ms with the other 60% having RTT of about 100ms. Similarly, the majority (60%) of supernode to supernode connections have RTTs of less that 50ms. The second means of figuring out locality is analysing the IP address the supernode; a node will tend to have a matching address prefix (e.g. 24.x.x.x)

____________________________________________________________________Page 15 of 124

to the supernode that it is connected to. When a node is trying to connect to the network, it will make multiple simultaneous connections to many supernodes in its supernode registry; this is done in an attempt to find a lightly loaded, local supernode to connect to. Once the node has decided which supernode to connect to all other connections will be closed apart from the selected supernode with which a persistent connection will be kept. On connection to the supernode a new supernode list will be downloaded to the node, so to keep the list as up to date as possible, therefore keeping connection time at a minimum.

Once a node is connected to a supernode it uploads metadata about the files that it is sharing, this includes such things as connection speed and a content hash of each file. The content hash is used to find matching files so that multi-source downloads can be performed, when a download is started Kazaa will search the network for that content hash to find other sources, this saves it from doing a less accurate key word search. To search the network a node would send a query to the supernode that it is connected to, the supernode would then forward the query on to the other supernodes it is connected to, which will then do the same, in a similar fashion to a standard Gnutella node, this can be seen in Diagram 2.3. If a supernode finds a match in its database it will then reply to the searching node with metadata about the search results.

A Kazaa supernode will only maintain connections to a small subset of all the supernodes in the network, therefore, like Gnutella, a search will only actually reach a small proportion of the entire network. However unlike Gnutella nodes, Kazaa supernodes will frequently (approximately every 10 minutes), change their connections with other supernodes so to explore a larger proportion of the network, this is to improve results when dealing with very long searches that last for hours. A supernode will generally maintain between 100 and 160 connections with nodes and a further 30-50 connections with other supernodes, it is estimated that there are about 30,000 supernodes in the Kazaa network therefore a search will generally only query 0.1% of the overall network. Kazaa-lite, an unofficial version of Kazaa, will create connections to multiple supernodes in its supernode registry when performing a search; this is done to increase the amount of the network that is searched,

____________________________________________________________________Page 16 of 124

Diagram 2.3 – The Kazaa topology

although this increases the number of search results, it generates much more network traffic, sometimes unnecessarily.

As well as frequently changing supernode connections, lists of current supernode addresses are frequently swapped between both supernodes and nodes, typically after receiving a list, a node will merge its own list with the new one purging a certain number of its previous entries, by doing this both supernodes and nodes can keep amore up to date picture of the network, therefore easing re-connection.

The Kazaa network has number of great features, its node location process is decentralised so nodes should be able to connect to the network without worrying about servers becoming unreachable. But Kazaa’s obvious best feature is its search improvements, the use of supernodes dramatically improves scalability and search time is vastly reduced meaning the architecture is much better suited to wireless networking.

However Kazaa isn’t the perfect network, its use of supernodes trades reliability in for performance as there is always the possibility that supernodes will go offline and when this happens all the information about files will be lost and all the nodes connected will find themselves no longer on the network. This means that in extremely dynamic networks Gnutella might even be more efficient as the loss of a supernode means loss of data and a sudden increase in network traffic (as nodes search for a new supernode and file lists are passed between them). However this would need to be an extremely dynamic network in which all nodes frequently connect and disconnect as only a small minority of nodes are actually supernodes (normal nodes can come and go as often as they please without much damage to the network). Another problem arises when one looks at the hardware available for wireless devices, there is much less diversity in these compared to normal desktop PCs similarly, they are no where near as powerful; all this means that the selection of supernodes is going to be increasingly difficult and the quality of service that these supernodes can provide will be limited, this will probably lead to supernodes only be able to have a small number of nodes connected to them. Also as searches aren’t performed in real-time information can become out of date and users can find themselves trying to download files which no longer exist on the network.

2.2.3 NapsterNapster, out of all the P2P networks is generally considered to the best, however

the ironic fact of the matter is that Napster isn’t actually a P2P network in reality it is simply a client-server search engine, Diagram 2.4 (more accurately a broker server). To connect to the network a Napster client simply sends a list of files to the Napster server then to search the network it simply sends a query to the server which then searches its

____________________________________________________________________Page 17 of 124

Napster ServerNode

Node

Node

Node

Diagram 2.4 – The Napster Topology

database and returns the results, just like all the other P2P networks downloads are carried out between a direct link.

Napster has a great deal of advantages over all other P2P networks, the first one is obvious by the length of Napster’s description, its client-server architecture makes is extremely simple and extremely fast, search time is comparable to that of Google. Also reliability is very high as there is no reliance on random nodes –all searches are routed through a network of high performance servers.

By the previous paragraph Napster appears to be the perfect network, this would be true but for one fact and that is the same one the ultimately led to its demise. There is a tendency for illegal material to be shared on P2P networks and Napster’s architecture meant that the company running it were responsible for the material shared meaning that companies concerned about losing money had an adequate case to shut the network down – which they managed. Therefore although it was a perfect network in respect to performance and reliability it was actually the least workable network.

2.2.4 PastryThe previous examples are conventional P2P networks, Pastry however, dubbed

the third generation P2P network, is based on a distributed hash table (DHT). Each node in the network is allocated a portion of the hash space and will contain references to files that hash to that space. When a user looks up a certain key the network will return the IP address of the node containing the file and the download can proceed.

The Pastry network is arranged in a ring topology (Diagram 2.5), to find a file, a node will pass the key to its neighbour which will check to see if it part of its hash space or if it knows where the location of the correct hash space is, otherwise it will simply pass the

request on to its neighbour. This will carry on until the hash space is found, this greatly improved on networks such as Gnutella because once the hash space is found the searching node will know that the entire network has been searched and will therefore not need to create any more network traffic querying more nodes. Whilst improving on the Gnutella network substantially, it also provides improvement over the Kazaa and Napster networks as there is no longer one single point of failure, instead every node looks after a small proportion of the hash space. Pastry also makes the distribution

of responsibility fairer as there are no heavily loaded supernodes that must take care of

____________________________________________________________________Page 18 of 124

N8

N14

N21

N1

N38

N42

N48

N58

K54

Lookup (K54)

N32Diagram 2.5 – The Pastry Topology

other nodes. Pastry has the time complexity of 0(log N), which is obviously a massive improvement over Gnutella.

Unfortunately Pastry and other DHTs aren’t totally perfect, despite their search time improvements over Gnutella and Kazaa they can still in no way compete with Napster. Pastry’s biggest downfall and the reason for its limited use is that to generate a search that will actually yield a result one would have to type the exact file name letter for letter so the correct hash key can be generated therefore searches like “Britney Spears” would not (or are unlikely to) return any results, this makes the search method impractical for everyday use.

2.3 Legal BackgroundThere is no doubt that the courts have had a major effect on P2P technology and it

is therefore necessary to give a legal overview of P2P. The major legal problem with P2P networks is that they facilitate the distribution of material without any sort of discrimination between the content, it is therefore possible for users to illegally distribute any sort of material they wish be it illegal media or child pornography. Since Napster’s advent various trade organisations such as the RIAA and MPAA have taken a great interest in P2P networks with the aim of them shutting down.

In response to legal attacks P2P network have become more and more technologically resilient making it impossible to shut many networks down. This has led to the users of the network being targeted rather than the owners of the network, this is because the owner’s ignorance and lack of power over the content of the network makes it difficult for them to be prosecuted however if a user was caught distributing illegal material their punishment would be just as severe as if they were distributing it via any other means.

Perhaps even more worrying (depending on your view) is the advent of anonymous networks attempting to destroy the tracking of downloads and searches. These networks use encryption and special routing that makes it extremely difficult for authorities or even the person who the file is being uploaded from, to find out who the file is going to. This from one point of view is protecting the rights of people to be able to carry out legal actions in private but unfortunately has the side effect that the service can be misused by others people to protect their identities from the authorities, an example of this is worries over terrorists using P2P networks to anonymously communicate.

For these reasons, any newly created P2P network must be legally aware of the dangers, if the creator and administrator of the network have knowledge of the content being shared on that network it is necessary for them to be able to prevent that content from being illegal. Due to these reasons, P2P networks have shifted away from central server architecture towards more distributed topologies such as Gnutella.

2.4 Data Resource Management (DRM)There are currently a variety of techniques used to control the use and distribution

of digital material, similarly there are a variety of technologies around that aim to implement these techniques. The need for DRM has been heavily pushed by companies

____________________________________________________________________Page 19 of 124

and associations that that find it in their interest to stop the illegal distribution of their products, it is therefore no surprise then that the latest versions of Windows Media Player provide support for DRM.

To illustrate the possibilities open to providing DRM functionality in a P2P system a few of the current DRM architectures will be listed.

2.4.1 The Generations of DRMCurrent DRM can be broadly split into two generations, the first generation in

which the focus was on controlling illegal copying and the second generation which broadened the scope to cover,

“the description, identification, trading, protection, monitoring and tracking of all forms of rights usages over both tangible and intangible assets including management of rights holders’ relationships” – Renato Iannella

2.4.2 Encrypted ShellProbably the most important aspect of DRM is the use of encryption without it

any user will simply be able to copy the files without any hassle. This method works be simply ‘encasing’ the actual file inside an encrypted shell (basically encrypting the bytes from the file), to open the file a user will therefore require a key, or license to do so. There are two ways of using an encrypting shell to obtain protected files, the first is to use one license; firstly the user will send the other party a copy of his/her license (obviously in an encrypted form) and this will be used to create the encrypted shell, it will be then be sent to the user and will only be usable on a computer than possesses that license; the second option is to license each file individually by sending the encrypted key with the shell, this could be a considered a better conceptual method as identical files would be able to run on the same computer as long as the user possesses the correct license therefore if the file is accidentally deleted and then replaced by a different copy the file will still run.

One of the biggest problems with using an encrypted shell is if for some reason the user’s licenses are deleted all his legal files will become useless, this can be easily rectified by properly backing-up licenses but the possibility is still there. Another problem is that to run a protected file there has to be support for it in the software that the file is being run on. Therefore to create a new DRM system one would have to also create new file players to run the files on. Another problem is if someone can manage to access the encryption key inside the license, if this happens it would then be possible to simply decrypt the file and distribute in any way the user wished.

2.4.3 MarkingThe second technique is marking media in some way, either by using a

watermark, flag or XrML tag. This mark will indicate that the media is copyright protected and may contain extra information such as how many times the file may be used. This technique will allow the person licensing the file greater control over how it will be used in the future and open up a whole new range of possibilities such as the ability to pay for a licence for merely one night like one might do if renting a film.

____________________________________________________________________Page 20 of 124

2.4.4 Centrally Controlled DRMThe two previous examples seem to place little infringement on a user’s privacy

however centrally controlled DRM breaks away from this theme dramatically. Instead of making a link between a licence and a file it makes a link between file and a person, this can be done by using a Globally Unique Identifier (GUID) which will be assigned to each user (or Media player); this can then be used to track a user’s file usage, it can therefore be used to prevent files from being used. Windows Media Player uses this technique, also it creates a log of files the user runs and then contacts a central server to find out digital rights. Similarly Microsoft eBook Reader makes the user link the reader to a passport account, using this Microsoft captures a unique hardware profile of the computer which allows Microsoft to prevent other computer from reading the eBooks.

The biggest problem with this scheme is the issue of privacy, using centrally controlled DRM, profiles can be made of users containing everything they listen to, this can be used for purposes such as marketing. Similarly by linking a file to the person it would be no longer possible to enjoy media anonymously instead (if you’re using Windows Media Player) Microsoft will know who you are and what content you’ve watched. Using this scheme attempting to protect data is also expanding to the profiling of a user’s tastes; this is obviously something that will not be popular.

2.4.5 Problems with DRMUnsurprisingly DRM is littered with problems, this is partly due to the complexity

of the problem and partly due to the infancy of the research. There are frequent problems in transmission of data, for instance if a user wished to sell an mp3, how would they do it? Similarly if a user wished to listen to the mp3 on his/her portable mp3 player it would not function. What would prevent a user from simply burning music to CD then ripping it again without protection on it? Or simply burning the music to CD 100 times and simply selling them on. What would happen if a user bought a new computer and he/she wished to be able to play the music on both computers? How would a user play his protected media when running Linux seeing that most DRM schemes are built for Windows?

All these questions lead up to one fact, DRM is not yet ready for widespread use and unless the matter is heavily pressed by the suppliers of the files and the software manufacturers that create programs to run the files it is unlikely that it will be taken up by many users as they simply would prefer their media to be unprotected as it makes life easier for them. The hardest quandary lies in how far to take DRM, it must be powerful enough to stop the illegal users but lenient enough to allow legitimate users from using their files in a proper legal manner.

2.5 Choices in Software and Technologies

2.5.1 CC would be the ideal choice for a P2P client if it was to be executed only on one

platform such as Windows, this will run extremely fast and have a minimal memory foot print. C could also be used with conjunction with C++ to build the user interface.

____________________________________________________________________Page 21 of 124

However, unfortunately as C is platform specific it would be impractical for this project as it is intended to be distributed as widely as possible, with platforms varying from Linux to Windows CE.

2.5.2 JavaJava has several performance problems, mainly that it is slow to start and slow to

execute and the memory requirements are much higher than a program written in a language such as C, however Java’s platform independence makes it the ideal choice for a project of this type. Java comes in a variety of different editions, the two of importance here being J2SE (Standard Edition) and J2ME (Micro Edition), the later is designed to work on hardware such as PDAs and obviously has a cut down API. The software will be written in J2SE for the use on desktop computers but will be able to be re-written in J2ME with relative ease. The use of Java will enable the application to easily cross over standard and wireless networks increasing the usefulness of the network dramatically.

2.5.3 Java RMIRemote Method Invocation boasts a great deal of advantages, the main one is the

easing of strain placed on the actual implementation of the network, this is because RMI would allow a simpler structuring of the protocol rather than using simple sockets. Instead of using a protocol message such as “SEARCH ‘Aerosmith’” a remote method could be called e.g. results = search(“Aerosmith”). RMI would then deal with both the sending of the request and the receiving of the reply obviously making the code far quicker to program. This at first sounds good but there are limitations to its usefulness, firstly using RMI would place a whole host of restrictions on the design of the network, for example using RMI would also mean using TCP, also it would mean that each node would have to obtain copies of the remote object which would therefore mean that somewhere along the line an RMI server would be required to supply them. Another problem is with the distribution of the application, any node running the program would also need to support RMI, as this comes as standard with J2SE is would not really effect it but neither version of J2ME support RMI and therefore would require plug-ins to allow the application to work. RMI is not really a technology created for P2P networking and apart from making implementation easier it would not have many benefits to the end result.

2.5.3 JXTAJXTA, devised by Sun, is a set of open source protocols to run P2P applications

over devices ranging from high end servers to PDAs and mobile phones. It is important to remember that JXTA is a set of protocols and not an application, it therefore could be programmed in any language, further to this it makes great use of XML in protocol messages therefore also making it platform independent. JXTA uses high levels of abstraction to allow JXTA to be customised for a variety of different applications, ranging from file sharing to chat systems; JXTA will deal with such things as locating other peers and propagating messages over the network. Unfortunately JXTA is not an appropriate technology to be used in the creation of a new architecture, this is because JXTA already uses its own architecture to run the network, therefore the creation of a new architecture would somewhat defeat the purpose of exploiting JXTA’s features.

____________________________________________________________________Page 22 of 124

2.6 Justification for Building a New Type of NetworkFrom the previous sections it is obvious that there is a plethora of P2P networks

already available, this raises the question of whether it is actually necessary to build another one. The two main aims of this project are to firstly speed up search times and to increase the scalability of the network; these two aims are very closely linked as the scalability of the network is often measured in how long it takes to carry out a complete search. The following section will discuss what limitations current P2P technologies have that stop them from fulfilling this project’s requirements and how GazNet aims to overcome the problems.

This project is faced with a whole host of problems and this can be seen by the list of aims, if a different system was specifically built to fulfil each different aim, a large variety of P2P architectures would be created, this is because in many of the situations fulfilling one requirement will mean to a greater or lesser degree the sacrificing of another.

Aim 2 – To allow fast, efficient searching of the networkAim 2 is probably the most important requirement of the project as this is the area

in which most gain hopes to be made. Aim 2a (to decrease search times compared to current P2P technologies) provides the first clear differentiator between current networks and this project. The primary goal of this project is to improve the search times provided by other current systems, the main problem with nearly every current P2P network is its poor search times when dealing with large numbers of nodes. Gnutella is the best example of this, needing to traverse every node in the network to perform a complete search. Even when the number of searched nodes is lowered in Kazaa by the use of supernodes the search times are still not acceptable, this problem is directly attributable to the lack of structure in the networks which although is good for Aim 4, reliability, it is extremely bad for Aim 2. By not providing anyway of searching the whole network by only contacting a subset of the nodes (or supernodes), it means that the search time is directly effected and controlled by the size of the network, this alludes to scalability issues and how well the network can cope with high numbers of nodes; this goes against aim 2c which stipulates that search time shouldn’t be dictated by network size. Aim 2a has also been taken on by another type of P2P network, DHTs; networks like Pastry work with the intention of minimising search times as much as possible whilst still maintaining the fully distributed nature of networks like Gnutella. By storing a small amount of hash space on a large number of nodes, the network can be completely searched by only actually searching a tiny subset of the nodes, this means that whilst Gnutella would be left searching millions of nodes a DHT would have already found every instance of that file in the network. At first glance, therefore, a DHT would seem to fulfil the project’s primary objective, therefore rendering the project pointless, however where DHTs’ limitations lie is in their inability to perform fuzzy searches (Aim 2b); when a search is carried out the only results returned would be the exact matches, even if a dash is put in the wrong place, no results will be returned. This major limitation is responsible for a lack of commercial and popular interest in DHTs, as for real world file sharing they are simply not practical.

____________________________________________________________________Page 23 of 124

Aim 2d(to keep search time low even when the overlay network is heavily loaded) is very important as even if the network can handle large numbers of nodes, the burden of searching needs to be dealt with without massively degrading search time. Probably the best performing architecture regarding Aim 2d is Gnutella, this at first might sound strange but it should be noted that 2d stipulates that search time should be kept low even when the overlay network is burdened not the underlying network; Gnutella’s fully distributed nature means that the load is spread evenly amongst all the nodes on the network therefore avoiding singular points that are heavily burdened, Gnutella’s architecture creates a massive load of the physical network but in fact minimises load on the overlay network as each node is taking on a fair portion of the work. Although Gnutella seems to fulfil the requirement, it is ignoring the need to ”keep search time low”, although Gnutella overlay network handles heavy burden well, it still doesn’t have a low search time, in fact a lightly burdened Gnutella network would still have a longer search time than a heavily burdened Kazaa network, unfortunately however, Kazaa suffers from problems in other realms. Kazaa has singular points of network build up i.e. the supernodes, therefore if there is a higher frequency of searching there will be a bigger performance hit as a smaller percentage of the nodes will have to deal with the extra load. Probably the best architecture to fulfil Aim 2d is a DHT, this is because it’s highly distributed nature helps spread the network load and its searching technique helps keep search times low, a DHT would therefore fulfil both parts of the requirement.

All file caching networks will have to give serious consideration to fulfilling Aim 2e (to keep searches as accurate as possible), this is because as nodes go offline it is more than possible that old file references will be left on the database which will therefore lead to erroneous results being returned to searches. Indisputably, Gnutella will fulfil this aim the best degree, this is because all searches are performed in real-time and a node would have to be present on the network to actually return any results. Kazaa similarly has little problems with this as all nodes are connected by a TCP connection to only one supernode, therefore if the connection breaks the supernode will know the node is offline and that it should remove all file references supplied by that node. The only problem arises when file references are stored on multiple locations which have no knowledge of the nodes status, this will occur in DHTs; references to a node’s files will be stored across the whole range of the network and therefore may easily become out of date as nodes leave the network.

Aim 3 – To make the network as scalable as possibleAim 3 is a major requirement for all P2P networks, one which unfortunately is not

fulfilled by many, the best example of this Gnutella which is probably the most un-scalable design available, its necessity to contact every node in the network to perform a search makes it simply impossible for any degree of performance to be obtained from a large network, having said this Gnutella has allow a seamlessly unlimited amount of nodes connected to the network, unfortunately in reality these nodes are not fully connected as the reach a node has over the network is limited by the number of connections is has with other nodes and the number set in its TTL field; therefore although a node may be in theory fully connected to the whole network in practise its

____________________________________________________________________Page 24 of 124

searches will only traverse a proportion of the network. Kazaa provides a massive scalability increase over Gnutella as its use of supernodes means that the whole network can be searched with far less node visitations, also providing that the network has spare supernodes, an unlimited amount of nodes will be able to connect to the network. However, as the network increases in size eventually the same problem as Gnutella has now will be reached and search time will become intolerably slow. DHTs will still suffer scalability problems as the more nodes there are the more hops it will take to locate the node containing the hash space you are looking for, however it will still scale far better than Gnutella or Kazaa, as the full network doesn’t have to be traversed to locate the entire results; also the number of nodes is theoretically only limited to the number of computers in the world.

Aim 4 – To make the system as reliable and robust as possibleAim 4 is a requirement that needs to be fulfilled by any P2P network Probably the

most robust architecture available is that of Gnutella, its fully distributed nature means that it is extremely hard, if not impossible, to destroy GnutellaNet as there is simply no main point of failure, the loss of one node will have absolutely no effect on the overall network. This robustness also leads to high reliability as a node will rarely be cut off from the network, unless every single node it is connected to goes offline. Gnutella’s robustness is the main reason for the practicality of the protocol being open source, Kazaa protocol is highly encrypted partly because of commercial issues and partly because of the vulnerability of the network to attack and what the malicious use of the protocol could bring. Despite this, Kazaa is still very reliable however its partially centralised structure means that it is open to attacks because the loss of one supernode means the removal of about 150 nodes from the network. Therefore a correctly targeted Denial of Service attack on one node could in fact affect a large number of nodes. Due to Kazaa’s large size it will be able to fulfil Aim 4a because it would simply be impossible for one hacker to attack enough supernodes to make a significant difference to the network, however if purely the architecture is considered without thought of Kazaa’s size it becomes clear that Kazaa is open to attack.

A DHT’s highly distributed architecture, like Gnutella has great robustness, and makes it hard to destroy the network. However unlike Gnutella, particular nodes contain more information than merely their own file list, this creates dependencies between nodes and makes the network more connection-oriented therefore also making it more vulnerable. However, the hashing algorithm used means that particular nodes don’t look after a particular type of content or a particular addresses space’s content, therefore the loss of node would not result in the loss of a specific area of content similarly DHT’s increased distribution of file lists means that the loss of a node has far less impact on the network than the loss of a supernode in Kazaa.

Aim 5 – To make the network as independent as possible and Aim 7 – To minimise the time taken to connect to the network

The reason for trying to make the network as independent as possible is to improve Aim 4 by removing any singular points of failure. The Napster network is a perfect example of a system that doesn’t fulfil Aim 5, this is because the whole network

____________________________________________________________________Page 25 of 124

is based on a client-server model. Generally when referring to making the network as independent as possible, it means reducing the amount of external help required to run the network – the main type being help gaining access to the network (Aim 7). Due to Gnutella’s high robustness and distributed nature it is perhaps ironic that Gnutella generally requires the greatest amount of external help to run then network, as to join the network a node must first find an already connected node; to do this Gnutella generally contacts a node cache that will supply the node with the address of an already connected node. It can be clearly seen that this model creates an important point of failure, and does not fulfil Aim 7a (to avoid the use of centralised servers to aid connection to the network). Kazaa’s primary node location technique aims to create greater independence over techniques employed by Gnutella. Every node on the network will be issued with a standard address cache of well known supernodes, a node will attempt to connect to every address in the cache until it is accepted, as the node contacts more supernodes on the network (for searching) these will also be added to the cache making it larger and more up to date, therefore making it easier to connect to the network on future occasions. Kazaa, therefore would fulfil both Aims 5 and 7 as both could function without any sort of central managed resource. An unfortunate problem with Aim 7 is that the overall aim conflicts with the sub aim (7a), this is because, by avoiding the use of centralised servers the network is automatically limiting itself to slow connection times in many circumstances as slower techniques such Kazaa’s will have to be employed instead.

Aim 6 – To allow the sharing of files over the networkThere is very little differentiation between different networks on the fulfilment of

Aim 6, as the speed of file transfer is largely based on the service provided by the underlying network, however certain techniques, mainly the use of multiple download sources can dramatically increase performance. Most P2P clients allow the use of multiple download sources, it is therefore hard to differentiate between them.

SummaryFrom the ordering and the length of each section is can be seen what aspects are

important to this project. With the exclusion of Aim 2 (searching), all the features have been implemented before in previous networks and in some circumstances can be performed to a higher standard using a different architecture. However this project intends, to focus on improving the searching of a network rather than improving the other aspects.

2.7 SummaryThis chapter has taken an in-depth look at a few of the P2P technologies currently

available and some areas of research into the domain. It has also identified where existing designs cannot carry out what is required of this project to a sufficient standard to make the project obsolete. The next chapter will delve deeper into the workings of the proposed system and give a high level overview of how a system will function.

____________________________________________________________________Page 26 of 124

3. Design

This chapter describes starting from a high level view point getting progressively lower level the design of the system and how it will be implemented.

3.1 General Overview of the SystemThe concept behind the system is a very simple one, a network of supernodes are

created with each one looking after a certain pattern of letters. These supernodes will maintain lists of all the files on the network that begin with the appropriate pattern of letters. These supernodes will be arranged in a tree like structure; when the network is running with only one supernode, that supernode will be responsible for every letter combination available, however when the load becomes too great, that supernode will half its load by creating a new supernode (from one of the nodes on the network) then allocating the later half of its file database (M-Z) to that new supernode. After halving its load 26 times, their will be 26 supernodes each looking after one letter (A, B, C etc). Once this stage is reached a supernode will have to split to a new tier when its load becomes too great, this means instead of splitting according to the first letter, it will split according to the first two letters, so supernodes of type Aa, Ab, Ac etc will be created. By carrying out this process the network will respond to extra load with a greater level of indexing, this will mean that however large the network gets, fast efficient searches will still be able to be performed.

3.2 Overall StructureThe overall structure can be split into 4 components, the user interface, the

network, searching and uploads/downloads. The network refers to the functionality behind connecting up nodes and keeping the structure of the network correct; searching refers to how the network actually searches for content; uploads/downloads refers to how downloads are managed; each of these components will be examined in turn.

Diagram 3.1 – The Overall Structure of the Application

3.3 NetworkThe network could be considered to be the most important component of the

project, it encompasses many different design aspects and is by far the largest module. Below is a list of its functions followed by a in-depth review of it workings

Connecting new nodes to the network Creating the indexed structure of the network Creating new supernodes Dealing with the loss of supernodes

____________________________________________________________________Page 27 of 124

Network

User Interface

Search Downloads

Keeping file lists up to date. Distributing file lists to supernodes

3.3.1 Connecting New Nodes to the NetworkThis is an extremely important requirement as without its correct functioning the

network would simply not be able to be built or grow beyond an extremely small size. GazNet uses 2 different methods to locate connected nodes, the first is the use of web servers and the second is the storage of address caches.

Web serversUsing the first technique a client attempting to connect will to try to connect to a

preset web server and download a file called “Supernodes.dat”, this is a Java properties file which contains a list of all the first tier nodes (A, B,C etc). Once the file is obtained the client will attempt to connect to random supernodes on the list until it is accepted or given a different supernode address that will accept the connection. The list of supporting web servers is contained in a file called “Servers.dat” which is statically set when the software is installed.

This naturally leads on to the question of how “Supernodes.dat” will come to reside on the web server. Every time a new supernode is created it will connect to every web server in its “Servers.dat” file and send an updated version of “Supernodes.dat” via FTP.

Unfortunately several security problems arise with this method of supernode location. The worst being the simplicity in which a hacker could obtain the FTP password and upload his/her own “Supernodes.dat” file. This makes the use of web servers an unworkable technique in the long run however its purpose is only to aid in the initial setting up of the network; because of its indexed nature it would be better for the network to be initiated with all the first tier supernodes (but not necessary) before the beginning of its widespread use. Web servers provide a simple technique for doing this but after the initial phases it is likely that the use of web servers will cease being replaced by the following technique, supernode caches.

Supernode cachesThis will be the main method used to locate supernodes; each client will keep a

file called “Nodes.dat” which will contain a long list of supernode addresses. When the software is first installed it will contain a pre-set list of well know supernodes that the node could connect to. When a client starts up it will run through each addresses asking for a list of the current first tier supernodes once it has received it, it will attempt to connect to one and also update “Nodes.dat” by concatenating the new addresses onto the file. The client will begin with the most recently used supernodes to increase its chances of a speedy connection, therefore frequent users should be able connect faster than people that use the network sparsely. This is shown in Kazaa which also uses supernode caches – if you have recently used it the connection time will be much faster.

____________________________________________________________________Page 28 of 124

3.3.2Creating the Indexed Structure of the NetworkAn uninitiated network should be able to be set up with relative ease therefore

users could if they wanted set up private network on LANs without many complications. GazNet is almost completely self configuring the only requirement is to create the first supernode, any other clients should then connect to that supernode and all maintenance be carried out automatically by the network.

When the first supernode is created it will be responsible for every file on the network therefore all nodes will be connected to it and all their file references will be stored on it. However when the load gets too much for it (measured by the size of its file list) it will perform a process called splitting, the supernode will select its highest performance connected node and instruct it to set itself up as a supernode, it will then half its file list and send one half to the new supernode which will then be responsible for that portion of the hash space. This process will continue until there are 26 supernodes looking after all files beginning with the letters A, B, C etc however at this point when the load becomes too high for a supernode it will have to start adding depth to the network, i.e. creating more tiers. It will go through the same process of creating a new supernode but it will split its file list on the basis of each files second letter, so the new supernode will contain all the files beginning with An to Az.

Each supernode will maintain knowledge of two sets of supernodes; the first is the first tier supernodes, this is because these are the sole gateway into the network and are necessary if the user of the computer running the supernode wishes to search the network; the second set is the tier of supernodes directly below the supernode’s tier, this is because these addresses are required to route information through the network, if the addresses of the lower tier is corrupted then effectively all of the files on these supernodes will be cut off from the network and therefore inaccessible. To keep the lower tier information up to date whenever a supernode splits it will need to contact its parent supernode to inform it that it is no longer looking after the same hash space, if this did not happen, the routing information on supernodes would rapidly become obsolete.

3.3.3 Creating New SupernodesCreating supernodes is an integral part of the network as a poorly chosen

supernode will have a serious impact on the network performance. When a new node connects to a supernode it will send it a rating out of 100, this rating will refer to the node’s performance. The rating is based on a variety of different aspects,

Processor speed Memory size Speed of connection Operating system

When a supernodes needs to split it will search through its database of connected nodes to find the highest ranking one, it will send a protocol message instructing it to become a supernode, the node can either reply with an affirmative message or a negative one, if the message is negative the next best node will be contacted once a node has accepted the instruction it will receive the file list and begin its functions.

____________________________________________________________________Page 29 of 124

3.3.4 Dealing with the Loss of SupernodesThe loss of a supernode will have a serious effect on the network whatever

measures are taken to deal with it. The loss of a supernode in a standard supernode network such as Kazaa is vastly different to the loss of one in this type of network. In Kazaa losing a supernode means that a 100-160 nodes loose their connection to the network and their file references are lost – in a network of 3 millions users this is relatively unimportant, those nodes will just reconnect to the network via another supernode and the network will stabilise again. However the loss of a supernode in GazNET means the loss of a whole alphabetical range of files. Even worse is the fact that nodes storing files references that aren’t actually connected to the supernode won’t even know their files have been lost. This poses an extremely complex problem, one that will have a massive effect on the quality of service provided by the network.

One possibility is to simply have connected supernodes send out a overlay broadcast message saying that the supernode has gone down and those particular files must be re-sent to a new supernode, this is a perfectly viable option with a reasonably small network running over a LAN but when the number of nodes reaches the millions the broadcast will be an extremely slow and costly process and the amount of network traffic created will be massive. Using this technique, supernodes will take several minutes to recover and unlike Kazaa the network won’t loose a small subset of the nodes connected it will loose every single file that is in that hash space e.g. every file that begins with the letter A. Similarly the higher the tier of the nodes, the more of the network that will be disconnected. Therefore this technique has a serious scaling problem.

The second, more viable, solution is to provide the network with a greater degree of redundancy, basically every supernode will have 2 or 3 brother supernodes that will contain all the same information, the same file list, the same connected nodes and the same supernode information. When a supernode goes down one of these brother supernodes will notice and will send messages out to all connected nodes, higher tier supernodes and all children supernodes informing them of the problem. These will all make the appropriate changes in their state and the network will continue functioning in a matter of seconds instead of minutes. The main problem with this is the extra resources used to maintain these brother supernodes (processor time, network traffic), each supernode would have to forward all file lists and supernode changes on to 2 or 3 more nodes, this basically would triple the amount of network traffic the supernode is sending out. Another problem is that the unused brother supernodes represent an untapped resource, while these brothers are receiving all this information they could be dramatically cutting the stress on the working supernode. Similarly brother supernodes can’t be setup as child supernodes and therefore the more heavily used real supernodes may have to be lower performance nodes than the unused brother supernodes.

However despite these issues the indexed structure of GazNet means that it simply cannot tolerate extended lengths of time without a supernode, so for this architecture brother supernodes come across as being the best choice.

____________________________________________________________________Page 30 of 124

3.3.5 Keeping File Lists up to DateEven if all the above requirements are fulfilled the network will not work if the

search results are out of date. Therefore keeping the file list up to date is an important part of the network. In Gnutella this is simply not an issue as all queries are carried out in real time, similarly Kazaa has a much easier job as a supernode is responsible for every file on a connected node therefore if that node disconnects all the files are removed from the database. Unfortunately life is not as simple for GazNet as many supernodes contain file information about one node, therefore to remove all files from the network on a particular node several supernodes will need to be notified of the nodes removal.

The simplest technique would be to carry out no updating of file lists. Using this technique if a node attempts to download a file and finds that the node/file no longer exists it would contact the supernode that informed it of the file and tell it that the node has disconnected, the supernode can then remove all of the node’s files from its database. This has the advantage of requiring a minimal amount of network traffic and seriously reduces the burden on the supernode. However this process can be extremely frustrating for the users who may frequently find themselves attempting to download non-existent files.

A more advanced technique would be for the supernode that the node is connected to, to remove files from the network on behalf of the node. The supernode would be informed of the nodes complete file list which it will write to disk and if the node disconnects the supernode will load the file list and contact all the appropriate supernodes telling them to remove the files. This will keep the network much more up to date but will vastly increase the burden on the supernode, not only will it have to receive and store complete lists of files on its hard drive but it will have to start contacting many different supernodes to inform them of the disconnected node. If several nodes disconnect at the same time the supernode will have a large job in front of it that may be noticeable to the supernode user. However this technique would be a good choice for quite a static network as file lists would be kept up to date but supernodes wouldn’t have to start removing files very often. A similar technique would be to allocate nodes ‘brothers’ that will maintain a TCP connection between each other, each brother node would know the others file list and therefore in the case of one node going offline the brother would realise and then be able to contact the appropriate supernodes informing them of the nodes removal from the network. The brother node would then request a new brother from its connected supernode; to increase reliability multiple brothers could be allocated to a node to provide for the chance that two nodes could go offline simultaneously.

Probably the most practical option would be to give the file references time to live (TTL) variables which mean that files will be automatically deleted off the network after say 15 minutes and nodes will have to refresh the network. This is a nice balance between the two previously mentioned techniques as the network will remain reasonably up to date but there won’t be a heavy stress on supernodes when nodes disconnect, the files will just remove themselves. However if the network is reasonably static network traffic will be dramatically increased and this would be a very bad choice, as every 15 minutes file lists will need refreshing when in fact they could have just been left alone. Possibly a better

____________________________________________________________________Page 31 of 124

choice would be to increase the refresh time to every 30 minutes, it is a trade-off between network load and search accuracy.

3.3.6 Distributing File Lists to SupernodesDue to GazNet’s indexed structure the method of distribution of files list is

inherently more complex than a standard supernode P2P network, this is because nodes need not merely upload their file list to the supernode they are connected to, instead they must distribute all their file references out to the appropriate supernodes spread across the whole network.

The fastest way to do this would be for every node on the network to hold the address of every supernode on the network, this would allow a node to directly contact the appropriate supernodes and send them the correct portion of their file list. This solution, although fast has one glaring problem, which is its scalability; on small private networks with only a few supernodes the idea is perfectly workable, however on a public network with 30,000 supernodes it is massively impractical to hold that number of addresses.

Another method would be to contact the top tier supernodes and request all the required addresses, this would be much more scalable than the previous solution as each node wouldn’t have to be kept up to date with 1,000s of addresses. However there are still scalability problems as if a user has a large number of shared files a great deal of supernodes may need to be contacted to find all the appropriate addresses. The most elegant way of doing this would be to send out a query for the supernode that looks after the appropriate word, when it is found the supernode would return the correct address and the node can send it the file list.

The method used is to send the file list to the supernode that you believe to be looking after that hash space. Due to a standard node’s limited knowledge of the network (i.e. just the first tier supernodes) the first thing a node would do is send the file list to the first tier supernodes, as that is the place that it believes the files should be stored. When the first tier supernode receives the file list it will have knowledge of the tier below it and therefore maybe a better place to put the files, it would therefore pass the file list on to the appropriate supernode. That supernode may also have knowledge that there is a better place to store the files so it would pass the file list once again down to another supernode. This process will continue until the correct supernode has been found, this is an elegant way of carrying out the distribution as it follows the semantics of a standard tree traversal however the passing on of file lists through the whole network will create a much larger load than just passing through the address of the correct supernode. Similarly it will put a greater burden on the supernodes that are traversed both in processing power and the amount of bandwidth used. Despite this, on a small network this method works well and is efficient in that each supernode is only traversed once.

3.4 SearchingSearching is the process of querying the network to find out if any files matching

certain attributes currently exist on the network. The searching part of the project can be split into two aspects

____________________________________________________________________Page 32 of 124

The routing of search queries How supernodes search their own databases

3.4.1 Routing of Search QueriesThis is the how nodes actually find out where to send their search queries and

how these queries find their way through the network. Where GazNet fails in simplicity it makes up in search times, its use of indexed supernodes can dramatically decrease search time. The best way to illustrate who nodes search is with an example; when the user wishes to search for a piece of music such as “Moonlight Sonata” GazNet will first split the words up and then look at the first letter, so in the example the letters ‘M’ and ‘S’ will be derived. It then looks up the supernodes that looks after the hash space for ‘M’ and ‘S’, it will then send the query to the appropriate supernode, so, to the ‘M’ supernode it will send the following protocol message,

080-SEARCH=Moonlight

The ‘M’ supernode will then consult what tier it is currently on, if it has tiers below it, it will remove the first letter of the query and look up the ‘Mo’ supernode in its second level supernode list. It will then send the following protocol message to that supernode,

080-SEARCH=oonlight

If this supernode contains the file it will then return the results. This recursive process can continue for a maximum depth as the word length, so for the query “Moonlight”, the search can continue for a maximum depth of 9.

Supernode ‘M’

Supernode ‘Mo’

Node

Supernode ‘S’

Supernode ‘So’

080-SEARCH=“Moonlight”

080-SEARCH=“oonlight” 080-SEARCH=“onata”

080-SEARCH=“Sonata

Results Results

Results Results

Diagram 3.2 – Tree Traversal for a Search

The method of searching also makes it obvious why it so important to keep supernodes online and furthers the argument for using brother supernodes as a single break in the network will cut off all lower tiers from being searched.

____________________________________________________________________Page 33 of 124

This method of searching also means that fuzzy searches can be carried out, using a DHT such as Pastry a search like “Moonlight” would be likely not to return any results, the query would have to be the exact file name such as “Beethoven – Moonlight Sonata.mp3”. Using GazNet, the user can simply search for “Sonata” or “Moonlight” making searching a much easier process. It would also be appropriate to state that every word in a filename will be stored on the network, so using the previous example, the mp3 will be given three references on different supernodes, B, M and S (for “Beethoven”, “Moonlight” and “Sonata”).

3.4.2 Search aggregationGazNet is intended to be able to be set up on any network without the need to

perform masses of configuration, it is therefore necessary for GazNet to be able to grow dynamically without the need for a certain number of supernodes to exist before the network can function. It is therefore more than likely that on some networks there will only be one supernode running; this supernode will therefore look after every single file on the network regardless of its hash code. If it was presumed that GazNet was running on a large public network there would be inefficient amounts of traffic created if in fact GazNet was running over a small private network on a LAN. This is because on the public GazNet network, there would be large number of supernodes and therefore each separate search element (i.e. each word in a search) would be sent to a different supernode, so it would make sense to simply send out each search separately without any processing. On a small private network there would probably be only one supernode so by sending separate search for each word in the search the network traffic would be unnecessarily increased. Searches are instead, aggregated; if a multi-word search is entered, it first decides which supernode each search element should be sent to, then it compiles one search string to send to each supernode containing all the appropriate search words separated by spaces. When the supernode receives this it is then will separate up all the words then search its database separately, then recombine the results and return them to the searching node, the node will then combine all the results from each supernode and display them as if everything had been received from one.

3.4.3 How Supernodes Search their own DatabasesThe layout of the network also makes database searches inherently simple as well;

because the files are hashed using their first letter over the network it is obvious that the best technique to store file references is to use the same hash function. Supernodes will maintain an array of 26 linked lists (one for each letter in the alphabet), therefore when a search is made, the first letter is checked then the correct linked list (Java vector) is linearly search. This technique makes the process of splitting much simpler as well, as the second half of the array simply needs to be sent to the new supernode and then deleted.

3.4.4 How Supernodes Perform Fuzzy SearchesSo far the network has made frequent references to the system running basically

like a Distributed Hash Table, therefore something extra is required to allow the network to perform to important task of fuzzy searches. The primary technique used is to load multiple copies of each file references onto the network; each file reference would be searchable by one of the words in the file name. Therefore for the filename “Aerosmith –

____________________________________________________________________Page 34 of 124

I don’t want to miss a thing.mp3”, the letters, A, I, D, W, T, M A T would be loaded onto the network with each word independently searchable. This will allow a user to simply type in Aerosmith and have many files returned that have the word Aerosmith in.

The second method used is only something that will work on a small network with preferably only one supernode or in the cases of coincidence. A search comparison between a filename and a search word is performed by the checking of any instance of that word occurs in the string, therefore if the file was called “Aerosmith-I don’t want to miss a thing.mp3” or “I don’t want to miss a thing -Aerosmith.mp3” a positive match will still be returned even though the search word is corrupted by the presence of a dash. The reason why it works better on smaller networks is that as files get spread out over a larger number of supernodes the chances of a match ‘coincidently’ occurring gets smaller, however as a single supernode network is essentially behaving like a Napster server the chances of finding a match like this are much greater.

3.5 Downloads/UploadsIt is un-doubtable that the majority of time on P2P networks is spent either

uploading or downloading files therefore the process must be made as efficient as possible. It is impossible to send more data through a connection than the bandwidth will allow so therefore the question of how a dial-up connection can actually manage to download at any faster than 56k is raised. The simple answer is that it can’t but it is possible to utilise the maximum amount of bandwidth possible.

Nowadays most P2P file share application use the concept of multi-source downloads, basically nodes will choose several nodes to download one single file from each taking for example 25% of the file. When all the downloads are completed the pieces will be put together to create the full file. Therefore if a node is downloading a file from 3 other nodes at the respective bit rates of, 1.2 Kb/ps, 0.8 KB/ps and 1.7KB/ps the total bit rate will be 3.7KB/ps where as if only one download source was used the maximum bit rate would be 1.7KB/ps.

The method to select the download points is relatively simple, firstly a maximum of 5 matches are found, these are found by using the (overly simplistic) method of filename matches and file size matches. The file size is then divided by the number of download points. Therefore if the user wished to download a 1KB jpeg called “Dog.jpeg” and there were 4 instances of it on the network the following protocol messages would be sent out to each node.

120-DOWNLOAD=Dog.jpeg=0=256120-DOWNLOAD=Dog.jpeg=256=256120-DOWNLOAD=Dog.jpeg=512=256120-DOWNLOAD=Dog.jpeg=768=256

The first number is the offset in the file to start the download from and the second number is the number of bytes to download from that offset. If the file is not easily

____________________________________________________________________Page 35 of 124

dividable number like the example the final download will compensate for this by either downloading a larger or smaller number of bytes compared to the other downloads.

When downloading a large file it is not possible to store the whole thing in memory, it is therefore necessary to stream the data to file to prevent a java.lang.OutOfMemoryError. If one source is used to download the data, it will be streamed to a single file called (using the previous example) Dog0.dat, similarly if the download has two sources, the data will be streamed to two files, Dog0.dat and Dog1.dat. This is because each download is performed separately and will posses no knowledge of the other sections being downloaded, it will simply be given a file to stream the data to. Once all the downloads are complete the files will then me merged into one file, then the file will be given the appropriate filename.

One of the problems with using this method is that if one of the download sources goes offline a file may be left incomplete with one section of it missing, in such a case the missing bytes will be downloaded from the first source by creating a new download connection. When a download is stopped by a user event e.g. clicking pause or closing down the application, the data files from the download will remain in the download folder, then when the application is restarted the data files will be checked and a search initiated to try and find instances of that file on the network. If one or more instance is found, each source will be allocated a proportion of the file to download, if only one instance is found a download will be initiated for each data file at that one source, if the number of instances on the network equal the number of data files (i.e. the number of original sources) each source will be allocated one data file to complete. The maximum number of sources used will equal the number of original sources, therefore if previously there was only one source but when the download is resumed there are 5 sources, 4 of these will not be utilised because only one source would be allocated the duty of completing the file.

The actual file transfer will simply take place over the same TCP connection that is being used for the protocol messages.

____________________________________________________________________Page 36 of 124

3.6 System Architecture

GazNet

DataFileFilter

Client

FileDatabase

FileDescription

IO

ListOfFiles

NodeDatabase

NodeElement

CompletedListener

Download

DownloadManager

Upload

UploadConnections

UploadControl

UploadListener

Connection

NetworkStartup

Node

NodeConnections

Search

SuperNode

SuperNodeConnections

Alpha

BitRate

FTPClient

Group

ProtectedInteger

ProtocolSocket

Range

Rating

Reporter

Semaphore

StopWatch

GazNet.database

GazNet.download

GazNet.network

Gaznet.util

Diagram 3.3 – A package and class diagram

____________________________________________________________________Page 37 of 124

____________________________________________________________________Page 38 of 124

Dat

aFile

Filte

rIO

File

Des

crip

tion

List

OfF

iles

File

Dat

abas

e

Nod

eEle

men

t

Nod

eDat

abas

e

<<in

terfa

ce>>

File

nam

eFilt

er

<<in

terfa

ce>>

File

nam

eFilt

er

Dow

nloa

dMan

ager

Upl

oadC

onne

ctio

ns

Upl

oad

Upl

oadC

ontro

l

Dow

nloa

d

<<in

terfa

ce>>

Com

plet

edLi

sten

er<<

inte

rface

>>

Upl

oadL

iste

ner

TCP

co

nnec

tion

Con

nect

ion

Sup

erN

ode

Nod

eCon

nect

ions

Net

wor

kSta

rtup

Nod

e

Sup

erN

odeC

onne

ctio

ns

Sea

rch

TCP

co

nnec

tion

<<in

terfa

ce>>

Sto

pWat

ch

BitR

ate

Rat

ing

Dat

aFile

Filte

rD

ataF

ileFi

lter

IOIO

File

Des

crip

tion

File

Des

crip

tion

List

OfF

iles

List

OfF

iles

File

Dat

abas

eFi

leD

atab

ase

Nod

eEle

men

tN

odeE

lem

ent

Nod

eDat

abas

eN

odeD

atab

ase

<<in

terfa

ce>>

File

nam

eFilt

er

<<in

terfa

ce>>

File

nam

eFilt

er

Dow

nloa

dMan

ager

Dow

nloa

dMan

ager

Upl

oadC

onne

ctio

nsU

ploa

dCon

nect

ions

Upl

oad

Upl

oad

Upl

oadC

ontro

lU

ploa

dCon

trol

Dow

nloa

dD

ownl

oad

<<in

terfa

ce>>

Com

plet

edLi

sten

er<<

inte

rface

>>

Upl

oadL

iste

ner

<<in

terfa

ce>>

Upl

oadL

iste

ner

TCP

co

nnec

tion

Con

nect

ion

Con

nect

ion

Sup

erN

ode

Sup

erN

ode

Nod

eCon

nect

ions

Nod

eCon

nect

ions

Net

wor

kSta

rtup

Net

wor

kSta

rtup

Nod

eN

ode

Sup

erN

odeC

onne

ctio

nsS

uper

Nod

eCon

nect

ions

Sea

rch

Sea

rch

TCP

co

nnec

tion

<<in

terfa

ce>>

Sto

pWat

ch

<<in

terfa

ce>>

Sto

pWat

ch

BitR

ate

BitR

ate

Rat

ing

Rat

ing

Diagram 3.4: An Approximate UML Diagram

Some of the utility classes are missing from this diagram, this is because they occur in so many classes it would be impractical to show all the associations..

3.7 Class Overview

“GazNet” packageClass DescriptionClient This class calls the P2Pgui class and will create a instance of GazNetSupernode This class creates a Supernode.

Table 3.1 – Overview of GazNet Package

“GazNet.database” packageClass DescriptionDataFileFilter This object is used solely by IO to filter out files of type ’dat’, these

files are incomplete downloads.FileDatabase This class holds FileDescription classes in a hash table, it can perform a

variety of necessary functions on the database, including searching it, converting it to different formats (e.g. an array) and extracting a particular set of files.

FileDescription This class represents files, it contains four variables, the file’s name, the file’s size, the owner of the file (i.e. the IP address) and a hash key. The hash key is not actually a hash key but a string that is used for searching and will be modified by each supernode that the file passes through; originally the hash key will be the same as the filename but as the FileDescription passes through each tier the first letter of the hash key will be removed, this is to facilitate the recursive nature of the search algorithm.

IO This class deals with all Input/Output aspects of the system such as getting a list of the shared files and reading/writing administrative files such as ‘NodeFile.dat’.

ListOfFiles This object implements the interface, Enumeration and is used to step through FileDescription arrays.

NodeDatabase This class hold a database of NodeElement classesNodeElement This class represents nodes in the network

Table 3.2 – Overview of GazNet.database Package

“GazNet.download” packageClass DescriptionCompletedListener This is an interface for the Observer design pattern, it uses the

method percentageChanged to follow the progress of a download/upload.

Download This class performs a single download from one node.DownloadManager This class implements the Singleton design pattern and therefore

this one class manages all downloads. It is responsible for selecting which nodes to download from and it deals with the calculations for performing multi-source downloads.

Upload This class performs a single upload to one node.UploadConnections This class listens on a given port for download requests and will

pass the request onto the Upload class if one is received.

____________________________________________________________________Page 39 of 124

UploadControl This class ensures that large transfers are performed without the connection freezing.

UploadListener This is an interface for the Observer design pattern which monitors new upload requests.

Table 3.3 – Overview of GazNet.download Package

“GazNet.gui” packageClass DescriptionDownloadPanel This class extends javax.swing.JPanel and deals with the

displaying off all the current downloads.MyFilesPanel This class extends javax.swing.JPanel and deals with which

files the user shares and where downloaded files are stored.P2Pgui This class extends Tabs and contains the main method to run

GazNet.PopupListener This class listens for right click on the tables in DownloadPanel,

SearchPanel and UploadPanel and will display a popup menu if appropriate.

SearchPanel The class extends javax.swing.JPanel and allows the user to search for files and then displays the results.

SuperNodeConsole This class is used for debugging and allows someone who’s running a supernode to view the nodes connected to it and the contents of the FileDatabase.

SuperNodeGui This class extends SuperNodeConsole and adds the functionality of an ActionListener so the user can actually interact with the console.

Tabs This class combines all the panels into a javax.swing.TabbedPane.

UneditableTableModel This class extends the DefaultTableModel to restrict the user from editing tables that contain file information.

UploadPanel This class extends javax.swing.JPanel and displays all the current uploads

WelcomePanel This class extends javax.swing.JPanel and is the default tab to be displayed when the gui starts up.

Table 3.4 – Overview of GazNet.gui Package

“GazNet.network” packageClass DescriptionConnection This class deals with connecting a node to the network NetworkStartup This class aids other classes in starting up the network, it

follows the Singleton design pattern and contains many static variables that other classes will reference. It is responsible for locating and maintaining supernode addresses and when running as a supernode updating the appropriate servers with new addresses.

Node This class deals with all interactions between the node and the supernode (excluding searches). It also deals with messages

____________________________________________________________________Page 40 of 124

from other nodes such as popup message requests.NodeConnections This class listens for connections from other nodes and then

passes them on to the Node classSearch This class is used by nodes to search for a file.SuperNode This class deals with all interactions with the supernode from

both other supernodes and nodes.SuperNodeConnections This class listens for connections to the supernode and then

passes them on to the SuperNode class.Table 3.5 – Overview of GazNet.network Package

“GazNet.util” packageClass DescriptionAlpha This class is performs operations on strings, characters and

numbers that convert them into different formats depending on their purpose.

BitRate This class extends StopWatch and calculates the bit rate of a transfer and also provides other utilities to aid in the displaying on transfer information

FTPClient This is an FTP client used to transfer files that contain the locations of supernodes to web servers

Group This is a transfer class that wraps all classes that are transmitted over the network.

ProtectedInteger This class is used to syncronise downloads and the compiling of search results.

ProtocolSocket This class follows the Wrapper pattern and wraps the java.net.Socket class adding functionality such as the ability to send file lists.

Range This class holds the range of hash space that a supernode looks after, it also hold the tier that the supernode is on.

Rating This class deals generating the performance rating of a node.Reporter This class is used to print out all debugging messages.Semaphore This class is a semaphore and is used by the ProtectedInteger

classStopWatch This class measures periods of time.

Table 3.6 – Overview of GazNet.util Package

3.7.1 Design PatternsTo improve performance and increase the ease and elegance of the system’s

development, three design patterns were necessary

SingletonIn a system such as this, it is frequently required for certain variables to stay

constant throughout the system, a perfect example of this comes in the form of the routing table; multiple classes will use the routing table however multiple classes can also modify the routing table, it is therefore imperative that some form of control is used to stop two classes using a different routing table. The NetworkStartup class maintains the

____________________________________________________________________Page 41 of 124

routing table and a number of other variables that should remain consistent throughout the system. The solution devised was to use the Singleton Design Pattern, a method is provided called getInstance(), calling this method will instantiate a instance of the class and hold a reference to it in a static variable, however if the variable had already been instantiated this instance would be returned instead of creating a new one. By also making the class’s constructor private it would mean that only one instance of the object can ever exist, making this the perfect choice to implement in the NetworkStartup class.

ObserverIn certain situations in the system it is necessary for a multitude of objects to keep

a single object informed of changes in its status. An example of this is in the progress field of the transfer panes; every time another percent of the file is transferred this must be displayed in the progress field. As, many simultaneous transfers can take place at one time the task of following each file’s progress can become quickly unmanageable. To carry out this process the observer design pattern is used; all classes that must listen (i.e. the transfer table in the transfers pane) must implement an appropriate interface containing a method that can be called to notify them of a change (e.g. updateProgress). Any object wishing to be informed of changes will be added to a vector, then when a change occurs the object will cycle through the vector calling the update method on each listening class.

DecoratorIn the system it is sometimes necessary to add functionality to a class in a

dynamic way therefore eliminating it from performed through inheritance. Such an instance is in the ProtocolSocket, where extra functionality above a standard java Socket must be supported. To do this the decorator design pattern is employed, the ProtocolSocket creates a shell around the standard java Socket and will pass on method calls to the Socket however before this, it may make modifications to the parameters. Alternatively it may make modifications to what the Socket returns. An example of this is when the read() method is called, the Socket will return an array of bytes but the Protocol Socket will return a string.

3.8 User InterfaceThe user interface is an extremely important component of the overall system as

this is the only way a user has of interacting with the network, one of the aims of the system was to provide a simple, easy to use interface which will therefore require a balance between the level of control the user has over the program and the level of complexity faced by the user.

There were two original designs, the first was a simple looking straight forward GUI (design A) and the second was a GUI more closely modelled on existing P2P file sharing GUIs such as Kazaa (design B).

Design A’s strength lies in its simplicity, it will be both easy to implement and easy to use; any user with a basic grasp of the purpose of the software will be able to begin using the interface within minutes of seeing it. However the limited design allows no

____________________________________________________________________Page 42 of 124

customisation and gives the user little extra functionality over the very basics; an example of this is the inability to resize the screen or to select more advanced search options.

GazNet P2P

Search Download Upload My Files

Message of the Day:

GazNet

Connected to GazNet

Welcome

Diagram 3.5 - GUI design A

Design B’s strengths are exactly what Design A’s weakness are, it will provide more options for the user such as the ability to select different types of searches (film, video, exe etc) and provide features like the resizing of the window. It is obvious that this type of UI would be superior to Design A for many different purposes, mainly everyday common use, this is because users would probably frequently use the application and therefore become quickly accustomed to the GUI and therefore be able to exploit the extra features, such as advanced searching which would allow user to have more control over their search results.

Despite these differences, the two designs have many similarities as they both work on the same main principals. Both have tabs with similar titles and many features reside on

____________________________________________________________________Page 43 of 124

both GUIs, the main difference lies in the size and therefore the extra features that can be put into the larger design (Design B).

GazNet P2P

Welcome Search Transfers My Files

All media

Music

Video

Executable

Document

Enter Search:

Search!

Filename File Size Speed

Diagram 3.6 - GUI Design B.

Design A was chosen because it suited most closely the requirements of the system as the system aims to focus on networking aspects rather than HCI (Human Computer Interaction) aspect, because of this, the UI is one of the lower ranking requirements (number 8). Therefore the UI is there to serve the purpose of supplying an interface between the network and the user (or tester) with the greatest ease of use rather than the most functional degree of use. Design A can best serve this purpose as its simplicity allows for quicker implementation and quicker testing, despite this main function, the GUI is also perfectly adequate for use by an average user.

Full DesignThe design splits down into five panes, each contained on one tab; this first, the

welcome screen, can be seen in Diagram 3.5, it contains two elements, a message of the

____________________________________________________________________Page 44 of 124

day screen which is intended to contain information about updates and other information, and a connected status symbol to tell the user if he/she is connected to the network or not.

Search!

GazNet P2P

Filename SpeedSize

Diagram 3.7 - The search panel

The search pane intends to provide a simple method for searching the network, the user need only type in the search words in the text field then click search. Any results returned will be displayed in the table.

GazNet P2P

Filename Bit RateProgress

Diagram 3.8 - The Download/Upload panel

____________________________________________________________________Page 45 of 124

The upload and download panels are identical, they both contain a table which lists the files being transferred along with the progress of the transfer (in percentage) and the bit rate.

The most complex panel is the ‘My Files’ panel, which looks after the folders that the user wishes to share and the location in which the user wants his/her downloaded files to be put. The panel also displays the number of files that are being shared and their total size. The panel is dominated by a tree which will contain all the folders and all the files in those folders that are shared, the tree can be modified by using the Add and Remove buttons which will allow folders to be added or removed from the shared list. The update button will manually force the node to update the file references on the supernodes in the network. The bottom text field will contain the location in which all downloaded files are placed; the browse button allows the user to find the appropriate file using a file selection window.

Diagram 3.9 - The ‘My Files’ Panel

3.9 SummaryThis chapter has given a relatively high level overview of how the system should

be implemented, excluding modification made during incremental testing this system should be an accurate representation of the end system. The following chapter covers the implementing of the design and hopes to give the reader a more low-level knowledge of how the system actually works.

____________________________________________________________________Page 46 of 124

GazNet P2P

Add

Remove

Update

C:\Downloaded

Put Downloaded files in:

Browse

Currently Sharing:

124 Files

Total Size Shared:

396Mb

Shares

C:\Music

4. Implementation

4.1 The GazNet ProtocolUp until now GazNet has always been a reference to the designed application,

however a more accurate definition would be that it is the protocol that the application merely implements. The protocol is ASCII based and works over TCP running on port 4001. It is text based, primarily to aid debugging; this was decided because of the complex nature of the network and therefore in anticipation of a number of problems. Another major factor in the choice of a text based protocol was its greater extensibility; it is likely that with prolonged testing and usage that a number of modifications to the protocol will need to be made, similarly as the network increases in size changes may need to be made to improve scalability. Extending a bit oriented protocol may become a complex affair, this is because many fields are already set and cannot be changed in the name of backward compatibility, similarly the size of packets are already set, therefore they cannot be expanded to support extra features such as longer search fields or IPv6 addresses. It is therefore clear that, at least in the early stages of the protocol, that a text based approach is superior, however it is not perfect. The biggest problem with text based protocols is their inefficiency, a good example is GazNet’s pong message (response to a ping), which is 000-PRESENT(Supernode); this message is 22 bytes long despite the fact that there are only two sections, the message type (000-PRESENT) and the type of node, (Supernode), this message could easily be replaced by a much smaller bit orientated message using one byte for the message type (allowing 256 different types of message) and only one bit for the node type. Diagram 7.1 shows such a protocol message, the first 8 bits show the message type (the number 1 represents a pong), and the 9 th bit represents what type of node the reply has come from (0 for node and 1 for supernode). This has more than the halved size of the protocol message, which may not make that much difference on a small scale network but on a large network with millions of nodes it will represent a massive reduction in network traffic.

Message Type Node Type0 0 0 0 0 0 0 1 1

Diagram 4.1

The Message Type field could even be doubled to allow a much greater number of possible protocol message (65536) and still the message would not be as long as its ASCII counterpart. These arguments put forward a strong case for changing the protocol’s format, something that may indeed be done in the future once the protocol has become stable enough to make the change but for the moment the protocol will stay text based, Table 7.1 give an overview of every message in the protocol

____________________________________________________________________Page 47 of 124

GeneratingNode

Protocol Message Description

Supernode/ Node

000-PRESENT(Supernode/Node) A response to a ping, the value in the brackets represents the type of node that is replying

Supernode/ Node

001-PRESENT? A ping to find out if a node is online and what its node type is (Supernode or node)

Supernode/ Node

002-OPEN? A message that is sent to verify if the TCP connection is still open, there is no reply because a exception will automatically be created if there is the connection is not open

Node 010-ALHPA_LOCATIONS? A request for the current addresses of all 26 top tier supernodes

Supernode 011-NEW_ALPHA_LOCATIONS= A message indicating that a node’s routing table is incorrect, this message will be followed by a array containing the 26 top tier supernodes

Supernode 022-NEW_SUPERNODE=supernode This message will be sent if a newer supernode has been created that must deal with a node’s request.

Node 030-CONNECT? A request sent by a node to find out if a supernode will accept new connections

Node 031-JOIN? A request sent by a node to find out if a supernode will accept more file lists.

Supernode 040-CONNECT_OK A reply from a supernode to a 030 request accepting the connection from a node

Supernode 041-JOIN_OK A reply from a supernode to a 031 request accepting a new file list from a node

Supernode 042-CONNECT_REFUSED A reply refusing either a 030 or 031 request from a node

Supernode/ Node

050-LIST A request from a supernode for a the file list to be sent

Supernode/ Node

060-LIST_READY A reply from a node to a 050 request which will be followed by a file list.

Supernode/ Node

070-LIST_OK A reply from a supernode confirming that the file list is valid.

Supernode/ Node

071-LIST_ERROR A reply from a supernode indicating that the file list has either not been received or is invalid

Node 072-RATING=rating A message containing the performance rating of a node

Supernode 073-RATING_OK A reply from a supernode saying that the performance rating is valid.

Supernode 074-RATING_ERROR A reply from a supernode saying that the performance rating is invalid

Node 080-SEARCH=search string A search request from a supernodeSupernode 090-SEARCH_OK A reply from a supernode saying that the query is

validSupernode 091-INVALID_SEARCH A reply from a supernode saying that the query

contains invalid characters e.g. ‘@’Supernode 100-SEARCH_LIST_READY A message saying that the search list has been

prepared.Node 110-SEARCH_LIST_OK A message saying that the search list has been

received and it is a valid formatNode 120-DOWNLOAD=filename=x=y A request to download a file stating at offset x of

____________________________________________________________________Page 48 of 124

length yNode 130-DOWNLOAD_OK A reply sent to indicate that the file resides on the

node and it can be downloadedNode 131-FILE_NOT_PRESENT An error message indicating the file requested is

not in the user’s shared foldersNode 140-DOWNLOAD_READY A message sent to the uploading node to indicate

the download is ready to commenceSupernode 190-NEW_SUPERNODE A message to a node requesting that it converts

into a supernodeNode 200-ACCEPT_SUPERNODE A reply sent by a node accepting a 190 requestSupernode 210-SENDING_LISTSupernode 400-ALPHA_LOCATIONS_UPADTESupernode 410-UPDATE_OKSupernode 420-UPDATE_REFUSEDSupernode/Node

900-MESSAGE=message A request sent to a node for a popup message to be displayed

Supernode/ Node

999-UNRECOGNISED_COMMAND A reply sent if the message isn’t recognised as a valid protocol message

Table 4.1 – Protocol Overview

4.1.1 000-PRESENT – 001-PRESENT? – Pinging a NodeWhen a node (or person) wishes to verify that a node is still online a ‘001-

PRESENT?’ request will be sent; the reply will either be 000-PRESENT(Supernode) or 000-PRESENT(Node) depending on the node’s type.

4.1.2 030-CONNECT? – The Connection ProcessWhen a node wishes to connect to the network it must connect to a supernode, the

first step is to send a ‘030-CONNECT?’ request to a supernode, a reply will be then sent accepting (‘040-CONNECT_OK’) or refusing (‘042-CONNECT_REFUSED’) the request. If the request is refused it will be followed by a ‘022-NEW_SUPERNODE= supernode’ message, the supernode will refer to the most recently created supernode; next the supernode will send an array containing the 26 top tier supernodes – this will ensure that the node has all the up to date addresses, the 022 message is sent because that supernode referenced will have the greatest chance of being open to new connections as it is youngest supernode available.

If the request is accepted the supernode will send a ‘050-LIST’ request which asked the node to send its file list, the node will then send a ‘060-LIST_READY’ message followed by its file list, the supernode will then reply with a ‘070-LIST_OK’ message or a ‘071-LIST_ERROR’ depending on whether the list was valid.

The next stage is for the node to inform the supernode of its performance rating, this is measured by a number between 0 and 100, and is sent in a ‘073-RATING=rating’ message. If the number is valid the supernode will reply with ‘073-RATING_OK’ else it will reply with ‘074-RATING_ERROR’.

The majority of supernodes that a newly connected node will contact will receive join requests opposed to connect requests. When a node joins a supernode it does not inform it of its performance rating and the supernode does not record the node in its node database

____________________________________________________________________Page 49 of 124

instead it merely adds its file list to the file database. The same process is followed as for a connect request but a ‘031-JOIN?’ message is sent instead of a ‘030-CONNECT?’ message. Diagram 4.2 shows the interactions involved in a successful 030-CONNECT? request.

073-RATING_OK

040-CONNECT_OK

050-LIST

030-CONNECT?

File list sent

070-LIST_OK

072-RATING=rating

060-LIST_READY

SupernodeNode

Diagram 4.2 – Supernode-Node Interaction during the Connection Process

4.1.3 080-SEARCH=search string – The Searching ProcessTo search the network a node will first check the first letter of the search string, it

will then look up the correct supernode to contact in its routing table, the node will then send a ‘080-SEARCH=search string’ to the supernode where ‘search string’ is a query that the user wants to search for. The supernode will then check that the search string is valid i.e. it begins with the correct letter and it doesn’t begin with a non-alpha character, if it is valid it will reply with a ‘090-SEARCH_OK’ message else it will reply with a ‘091-INVALID_SEARCH’ message. Once the supernode has searched the file database and compiled the results it will send a ‘100-SEARCH_LIST_READY’ message to the node to indicate the list is ready, it will then send the file list. If the list is valid the node will reply with ‘110-SEARCH_LIST_OK’.

4.1.4 120-DOWNLOAD=filename=x=y – The Download ProcessWhen a node wishes to download a file it will send a ‘120-DOWNLOAD=

filename=x=y’ request to the node that the file resides on. The filename refers to the full filename of the file, x refers to the offset that it wishes the download to begin at and y refers to the number of bytes to transfer from that offset; x and y are used to facilitate multi-source downloads. If the file resides on the node and the node is willing to commence a transfer it will reply with a ‘130-DOWNLOAD_OK’ message, if the file doesn’t reside on the node it will send a ‘131-FILE_NOT_PRESENT’ message. When

____________________________________________________________________Page 50 of 124

the downloading node is ready to start the download it will send a ‘140-DOWNLOAD_READY’ then the uploading node will start sending the data over the same connection.

4.1.5 190-NEW_SUPERNODE – Supernode CreationWhen a supernode wishes to split and create a new supernode it will first select

the highest performance node from its node database and then send a ‘190-NEW_SUPERNODE’ message. The node will then reply with a ‘200-ACCEPT_SUPERNODE’ message, then the supernode will send a file list containing all the FileDescription object that the new supernode must add to its file database, if the file list is valid the new supernode will reply with a ‘070-LIST_OK’ message – the creation of the supernode will then be complete.

4.1.6 400-ALPHA_LOCATIONS_UPDATEOne of the most important requirements of the protocol is to keep the routing

information on the network up to date, if one supernode becomes unsynchronized the whole network could start to collapse. To ensure this doesn’t happen supernodes must inform their parent supernodes of any changes in their routing table. Changes only occur when a supernode splits so whenever this happens a supernode will contact the supernode that created it and give it the new routing information. To do this a supernode sends its parent a ‘400-ALPHA_LOCATIONS_UPDATE’ message, the parent will then reply with a ‘410-UPDATE_OK’ message; the child supernode will send a Group containing the modified supernode addresses. Alternatively the parent supernode may reply with a ‘420-UPDATE_REFUSED’ message in which case the child will not be able to send the update.

4.2 The ProtocolSocketThe GazNet protocol makes use of a variety of different transport methods, the

most frequently used method of communication is threw ASCII strings, however on top of this, the protocol uses object transmission and byte streams; because of this a class called ProtocolSocket was defined to perform all tasked involved in the transmitting the protocol.

Method DescriptionprotocolFetch():String Receives a protocol message e.g. 080-SEARCHprotocolSend(String s) Sends a protocol messagegetList(): Group Receives a Group object (see section 4.3.2)sendList(Group g) Sends a Group object (see section 4.3.2)getObject(): Object Receives an ObjectsendObject(Object o) Sends an Object

Table 4.2 Function of the ProtocolSocket class

The ProtocolSocket is a wrapper class, it contains an instance of a Socket class which is used to actually transmit data but before it is forwarded onto the Socket it is in some way modified, for example the protocolSend(String s) method accepts a String as a parameter but passes a byte array on to the Socket’s write method.

____________________________________________________________________Page 51 of 124

There are two constructors in the ProtocolSocket class, the first accepts a String and an integer, the String contains the IP address to connect to and the integer is the port to connect to; this constructor is used when a node wishes to initiate a connection with another node. The second constructor accepts a Socket as a parameter, the ProtocolSocket then sets this as the Socket to wrap; this constructor is used when a node has received a connection from another node.

The ProtocolSocket class also carries out the important function of maintaining a list of all the connections a node has, this is important because

a) The application is multi-threaded and therefore connections need to be managed to prevent them from interfering with each other

b) Some connections are persistent and kept open throughout the node’s life, it is therefore necessary to ensure that multiple connections to the same node aren’t opened to prevent inefficiencies and errors.

c) Sometimes a class that doesn’t have a reference to the correct ProtocolSocket might need to contact another already connected node, by keeping a connections list it allows any class to access any Socket that it requires.

The connections list is held in a static Vector in the ProtocolSocket class; whenever a new ProtocolSocket is constructed it will first check the connections list to ascertain whether or not a connection already exists to that IP address and port. If no open connection exists the Socket variable in the new ProtocolSocket is initiated and the new Socket is added to the activeConnections Vector. If the connection already exists no new Socket is initiated and the Socket variable inside the new ProtocolSocket is set to the existing Socket from the activeConnections Vector.

4.3 Data StructuresThe whole premise of P2P file sharing is the distribution of data, therefore data

structures play an important part in this project. The following section deals with the three important data structures the Range, the Group and the File Database.

4.3.1 RangeThe Range class is a very important class in the implementation; it is used to

represent the range of letters (A-M, N-Z etc) that a supernode is looking after. When a supernode splits it will modify its own Range in accordance of its new duties, it will also send a Range object to the new supernode informing it of the range of letters that it is looking after. The Range class stores ranges using two integers that have the values 0 – 25 representing the letters A – Z; these values are stored in the variables, first and last and are referred to as the first range and the last range. Due to GazNet’s tree like structure it is also necessary that the Range class has the ability to differentiate between the tiers of the network tree; to do this another integer called tier is used, the first tier is represented by 0 and each lower tier is represented by the number of hops from the first tier e.g. the second tier is represented by 1.

____________________________________________________________________Page 52 of 124

4.3.2 GroupThe Group class is the main object of network transfer used; it is used to store

arrays of other objects when they are being transferred over the network. It has two private variables, and Object array, contents[] and a Range object, range. The object is primarily used to transfer arrays holding FileDescriptions however it is also used to transfer arrays containing supernode addresses, because the Group class stores Objects any subclass of Object can be used.

4.3.3 The FileDatabaseThe FileDatabase is the most important data structure in the project, it is

responsible for holding looking after files and performing a number of different operations on them.

The type of element stored in the FileDatabase is the FileDescription; this class holds information specific to each file, the following table list the variables stored in it.

Variable Type DescriptionfileName String This is the filename of the file it representsfileSize String This is the size of the file it representsowner String This is the IP address of the computer that the file resides onhashKey String This is used to search for files. Every time the file is passed down

to another a lower tier, the first letter is removed from the hashKey (which originally equals the fileName). Similarly when a search request is passed down to a lower tier the first letter is removed from the search string, this recursive process makes the implementation far more elegant.

beenUsed boolean This is used in the search algorithm; because multiple copies of one FileDescription are distributed throughout the network it is possible that one supernode may end up storing multiple copies of the same FileDescription in its database. Therefore when searching its databases the beenUsed variable will be set to true after it has been added to the search results; only FileDescriptions with their beenUsed variable set to false will be checked, this is to prevent multiple copies of one FileDescription being returned in search results.

Table 4.3 – Overview of the FileDescription class

The FileDatabase serves two purposes, when being used in a node it stores all the files that reside on that node however, very little of the functionality is used and its main purpose is to split the files up into the appropriate hash space.

The main functions of the FileDatabase are used when the computer is running as a supernode, obviously its abstract purpose stays the same, to store files, but a supernode will be storing a much higher number of files from all over the network so therefore performance is a big issue.

____________________________________________________________________Page 53 of 124

How the File Database stores filesThis is probably the most important aspect of the File Database, as this its primary

function. Files are stored using the same hash function as is used on the overall network i.e. files the hashed by their first letter. Each instance of a File Database has an array containing 26 Vectors, each Vector holds FileDescriptions whose hashKey begins with a particular letter. When a FileDatabase object is constructed it is given a list of FileDescriptions which are then linearly scanned and placed in the appropriate vector. The FileDatabase also has a load method which allows extra FileDescriptions to be added to the database at any time. It is obvious that this is not the most efficient way of storing file references as faster lookup time could be obtained by using a different hashing algorithm however by using this technique much faster supernode splitting times can be obtained as little processing has to go on to find out which FileDescriptions to be sent to the new supernode, all that needs to be done is to pass on the Vectors that lie in the alpha range of the new supernode.

How the File Database is searchedThere are two methods in the FileDatabase class used for searching, a public

method, searchKey(String searchString) and a private method, search(String searchString, int vector, Vector searchFiles, Vector changed). searchKey receives a String as a parameter, this String is what the File Database is being searched for, it will split the String up into separate words and then call search methods passing in each separate word. The search method takes 4 different parameters, the first is the searchString, then second is the index of the Vector array that should be searched, the third is a Vector in which the search results should be put into, the fourth is a Vector in which all the searched FileDescriptions should be put into, this is so that the beenUsed variable (see Table 7.2) can be set back to false.

When the search method is called it will do a linear search of the appropriate Vector looking for matches, any search results will be placed in the searchFiles Vector. As one supernode store more than one Vector of FileDescriptions it is possible that multiple search methods will be called on different Vectors; the same searchFiles Vector will be passed into every call of the search method. Once every search method has completed the searchFiles vector will contain all the results from that supernode.

Other functionsThe FileDatabase class also carries out a number of other functions such as the

removal of files and the retrieval of Ranges of files. Probably the most notable other function carried out is the cutList() method. When a supernode creates a new tier, it cannot simply pass a portion of its File Database on to the new supernode as it would do if it was splitting on the same tier. Instead it must halve the database on the second letter rather than the first, it therefore uses the method cutList() which returns a new FileDatabase containing all the FileDescriptions in the supernode but arranged in the database based on their second letter, this FileDatabase can then be split just as if the supernode was splitting on the same tier.

____________________________________________________________________Page 54 of 124

Another important feature of the FileDatabase is that of storing the Range which the supernode is looking after, this is kept in a static variable that can be accessed by any other class.

4.4 Important AlgorithmsThe following section will outline three important algorithms in the

implementation, the splitting algorithm, the multi-source download algorithm and aggregating search algorithm.

4.4.1 The Splitting AlgorithmUndoubtedly the most important algorithm in the entire system is the splitting

algorithm; the modification of this algorithm can either improve or degrade performance dramatically, more so than any other element in the system. The splitting algorithm deals with removing files from a supernode’s file database and allocating responsibility for them to another supernode. The process of splitting is carried out over two methods, split() and createNewSuperNode(Range r), the former calculates the new range for the child supernode to look after and the later deals with removing the appropriate file references and updating routing tables.

The system works on the premise that supernodes will look after blocks of contiguous letters, as long as this is adhered the splitting algorithm can be changed in any way without affecting the rest of the system. The splitting algorithm used is the most simple available, every time the supernode splits its hash space is halved and the later segment is allocated to a new supernode. When a supernode reaches a state in which it is only looking after one letter (e.g A), it then must split to create a new tier, this tier will look after the next letter in the word (e.g. Ae), when a supernode splits to a new tier exactly the same splitting algorithm is used, the hash space is simply halved again but this time based on the second letter.

public Range split(){if(the supernode only looks after one letter)

{ //needs to create a new tierset this supernode’s range to 0-13return a Range of 14-25 with a tier of this supernode’s tier+1 }

else{int first = the first range of this supernodeint last = the last range of this supernodeint myLast = ((last-first)/2) + first

set this supernodes range to first – myLastreturn a Range of myLast+1 – last}

}

____________________________________________________________________Page 55 of 124

The second part of the splitting process is to actually carry out the split based on the Range supplied by the split() method.

public boolean createNewSuperNode(){Range childRange = split();boolean sentOK = notify child supernode of its duties

if(sentOK==true){remove childRange from file databaseset child range in routing table to new supernode’s addrreturn true;}

else{return false;}

}

4.4.2 Multi-source DownloadsAn important part of the overall project is the downloading process as this

constitutes the largest proportion time spent on the network. It is therefore necessary to make the process as fast as possible. The method used by GazNet and many other file share applications is that of multi-source downloads i.e. downloading different parts of the file from multiple different nodes therefore utilising as much bandwidth as possible.

Two classes are involved in the download process, the DownloadManager and the Download class. The DownloadManager implements the Singleton design pattern and manages all downloads.

When the user wishes to start a new download, the application calls the addDownload(FileDescription[] files, int selected) method, the array contains all the files that the user’s search returned and the integer is the index in the array of the file that they wish to download. The DownloadManager will then search the array to find matches with the selected file, matches are decided based on whether the filename and the file size match. From this process a maximum of 5 file matches will be returned, the next process is to divide the file into portions that can be downloaded from separate sources.

Splitting the file into download portionsThere are multiple processes that take place to carry out a multi-source download,

these processes are implemented in 4 different methods; these methods support the pausing and resuming of files.

____________________________________________________________________Page 56 of 124

This method generates the length of each segment depending on how much has been previously downloaded therefore allowing downloads to be resumed.

long calculateChunkSize(FileDescription file, int numDownloadPoints, int num){chunkSize = size of file / number of sources

if (exists a data file containing part of the download){remove this portion of the file from the chunk size}

return chunkSize}

The calculateStartPoint method finds the position in which the download should begin, this is calculated by adding the start position of the segment and how much of the file has already been downloaded, together.

long calculateStartPoint(int num, long chunkSize){if (exists a data file containing part of the download)

{return (how far the download has already progressed + the position

of start of the segment)}

}

The calculateFinalChunk method calculates the size of the final segment of the file, it is calculated by taking the starting point of the download away from the total size of the file.

long calculateFinalChunk(long startPoint, int num){return total size of file – starting point of final download}

The final method uses the previous three methods to initiate a new download,

for(number of download sources-1) //loop through all but the last download{calculate chunk size; calculate start point; begin new download}

calculate final chunk sizebegin new download of remainder of file

____________________________________________________________________Page 57 of 124

When a point to point download is initiated, it is handled by the Download class, if it is a multi-source download it is given five parameters, public Download(FileDescription f, long first, long len, ProtectedInteger fin, File dataFile). The FileDescription is the file that is to be downloaded, this will also contain the IP address of the download point; first is the position in the file to start to download from; len is the number of bytes to download from the starting point; fin is used to coordinate all the downloads so the DownloadManager can tell when all the segments have been downloaded; dataFile is the file in which the bytes should be put into.

4.4.3 Aggregating SearchesSupernodes will frequently, especially on a small network, look after ranges of

hash space rather than just one letter (hence the need for a Range data structure rather than merely one integer), it is therefore necessary to sometimes send multiple search words to one supernode. Without any extra processing a node would simply send multiple separate searches to the supernode, this would represent a waste of bandwidth, as every search query would also have extra information with it (i.e. the overhead of the protocol messages).

The search aggregation algorithm is split into two sections, the first breaks up the search string into separate tokens and selects the correct supernode to send them to; the second part aggregates the search words that are going to the same supernode into the same query.

String[] strings = new String[26];while( still more search words)

{int index = value of first letter in search word //A=0, B=1…Z=25

strings[index].concat (next search word)}

This algorithm will create an array of size 26 with the appropriate search words in each element, e.g. using the example Beach Boys, strings[1] will contain “Beach Boys”, if it was “The Beach Boys”, strings[1] will contain “Beach Boys” and strings[19] will contain “The”. This array is then passed onto the next algorithm which will decide which addresses to send the queries to.

Range range = new Range // range = 0 to 25set last range to 0

for (int i = 0; i<25 i++){if((i<25) && (supernode i == supernode i+1)) //have the same IP

address{

____________________________________________________________________Page 58 of 124

add one to the last range}

else{String tempfor(int b=range.firstRange; b<=range.lastRange; b++)

{ //goes through all matching supernodesconcatenate strings[b] to temp }

create new search to supernode i for String tempset range first range to iset range last range to i}

}

The algorithm, goes through an array of the top tier supernodes and will create a Range object of the first sequence of matching supernodes, the next part of the algorithm will then go through the string array supplied by the previous algorithm concatenating all the strings in the range into one string, a new Search object will then be created using that one string. The Range’s first and last values are set to i and the process carries on until every supernode has been checked.

4.5 SummaryThis chapter has aimed to give a more in-depth look at the underlying functioning

of the system, however this has taken away somewhat from how the system will appear to the end user, the following chapter aims to give an indication of how the system is to use.

____________________________________________________________________Page 59 of 124

5. The System in Operation

This chapter summarises the general actions a user will perform in an average session using GazNet, it aims to convey what it would like to actually use the program.

5.1 A Typical SessionObviously the first action to be performed will be to load the program, in its

current state the user must load the program up from the command prompt however in the future it will be accessible by simply double clicking on a .jar file.

Diagram 5.1 – Loading the application

Once the program has loaded up the user will be faced with the GazNet welcome page, there are two salient feature to note here, the first it a Message of the Day window which will download an MOTD from a web server, this will be used to inform the user of problems with the system, updates, new features etc. The second point to raise is the connection status sign at the bottom left side of the window; when GazNet is initially loaded up it will be attempting to connect to the network, at this stage the connection status will say “Connecting…”, once it is connecting the status will change to “Connected to GazNet”.

Diagram 5.2- The Welcome Screen

____________________________________________________________________Page 60 of 124

Presuming this is the first time the user has used GazNet, he/she may wish to add some shares. Therefore the next step is to go to the “My Files” tab which will then present a window showing a list of the folders that are shared. The user will then click on the “Add” button to add a share, he/she will then be presented with a window that will allow him/her to select folders to be shared, after he/she has chosen the folder to share he/she must click Share Folder.

Diagram 5.3 – Adding a new shared folder

Users can similarly remove shares by clicking the remove button, a confirmation box will be displayed to verify if this is what the user wishes to do. The folder in which downloaded files are placed can also be modified using this window. The user must click on browse and then select the folder in a similar fashion to the adding shares. It is not possible to directly type the path into the text field to ensure that invalid path names are entered, therefore creating exceptions during downloads.

____________________________________________________________________Page 61 of 124

The next step is to start searching the network for files, to do this the user will move to the “Search” tab, searching in GazNet is a very straight forward process in which the user must simply type the query in to the search text field and then click search. The above table will then display a list of all the results.

If the user wishes to download one of these files he/she can either double click on the file or right click then select download from the popup menu. If the user then moves to the “Download” tab the file will be the listed there with the percentage of the download that has completed and the current bit rate. Similarly if anyone is uploading files from his/her computer the same format will be displayed in the “Upload” tab.

If the user wished to cancel a download he/she can right click on the appropriate file to bring up a popup menu, this menu will allow the user to cancel the download; similarly if the user had interrupted the download before it had completed he/she would be able to resume it by selecting resume. Once the file has finished downloading, the user will also be able to open the file from the application by selecting open from the popup menu, this will open the file in the default application. The user however does not have any such features in the upload panel, this is to stop him/her from simply ‘leaching’ off the network by cancelling all uploads and not sharing his/her files.

____________________________________________________________________Page 62 of 124

Diagram 5.4 – Search the network

Diagram 5.5 – The download panel

Diagram 5.6 – The upload panel

____________________________________________________________________Page 63 of 124

5.2 SummaryThis chapter has outlined the processes carried out by a user during a typical

session, however it has not given any indication of what sort of underlying processes that these actions actually initiate. The next chapter outlines what actually happens when the system is running.

____________________________________________________________________Page 64 of 124

6. Process Description

Although the interactions with the user are relatively simple the underlying processes that are being carried out are much more complex. This chapter gives an overview of what the application does to service the user’s request; some processes have been simplified to make them easier to understand.

6.1 The Connection ProcessWhen the user first starts GazNet the first process that must be carried out is to

actually connect to the GazNet network. The primary method used here is to cycle through a cache of supernode addresses attempting each one in turn until a connect is accepted or the node is informed of another supernode that will accept the connection.

Diagram 6.1 – The connection process

When the node first starts up, the P2Pgui (the main class), will obtain the locations of the supernodes from the NetworkStartup class; this will have obtained a list of supernodes the from the supernode cache then cycled through these addresses one of them will send the node a list of the current supernode addresses. It will then obtain the list of its shared files from the IO class. The P2Pgui object will then create a Connection class that will connect to the node to the network.

6.2 The Searching ProcessProbably the next thing a user would wish to do is to search for a file, to do this a

user would type in a query in the search panel, this would in turn create a Search object, this object would then get the addresses of the appropriate supernode from the NetworkStartup object. The next task is to send the query to the supernode, when the supernode receives the query is will run a search method which will call another search method in the FileDatabase which return a list of all the search matches. The supernode would then send the results over the TCP connection back to the searching node.

____________________________________________________________________Page 65 of 124

P2Pgui ConnectionNetworkStartup IOgetAlphaLocations

Connects to network

return

getFileList()

return

connect

return

SearchPanel SupernodeSearch NetworkStartup

Search()

return

search()

return

return

FileDatabase

getAlphaLocations()

searchKey()

return

Remote Location

Diagram 6.2 – The Search Process

6.3 The File Transfer ProcessAfter the user has found a file they want, they will attempt to download it, the

search panel will add the download to the DownloadPanel which will in turn add the download to the DownloadManager. If the download has multiple sources the correct segment sizes will be calculated then subsequent Download objects will be created to service each source. The Download will contact the Upload object on the remote node and the file transfer will begin. Every 64KB, the Download object will send a message that will be dealt with the UploadControl object to indicate that it is ready for the next 64KB, the UploadControl object will then inform the Upload object to send the next section. Once the download is complete (for all sources), the DownloadManager will be informed and it will write the file to disk, if there were multiple sources they will all be combined. Finally the DownloadPanel will be informed and it will indicate that the download is complete.

Diagram 6.3 – The Download Process

The DownloadManager will create multiple instance of the Download class, one for each source; Diagram 6.3 shows a single source download, if it was multi-source, several Download objects would be created and each would perform the same task as the above.

____________________________________________________________________Page 66 of 124

DownloadPanel UploadDownloadManager Download

addDownload()

return

Initiates download

return

UploadControl

Download()

writeFile()

Remote Location

File Transfer

File Transfer

Acknowledgement

Send next 64KB

6.4 The Splitting ProcessThis process is one of the most important in the application, as this is responsible

for creating the indexed structure of the network. When the supernode realises that it needs to split is will initiate the createNewSuperNode() method which will in turn call the split() method which will return the Range of the new supernode and also modify its own Range. The createNewSuperNode() method will then call the notifiyChildSuperNode() method which will retrieve the file references to be sent to the new supernode then contact the Node which is to be converted to a supernode. The node will then be converted and sent the appropriate file references

createNewSupenode() FileDatabasesplit() notifyChildSuperNode()

Sets new Range of supernode

return

return

return

Node

Get files for new supernode

setUpAsSuperNode()

Return new supernode Range

Remote LocationSupernode methods

Diagram 6.4 – The Splitting Process

6.5 SummaryThe four listed processes are only a tiny subset of all the processes and method

calls that occur during the lifespan of the application but it can still be seen that, even with simplification, the system carries out some complex tasks, because of this detailed testing needs to be carried out to ensure the effectiveness of the system. The following chapter will describe some of the testing mechanisms used and give an overview of the results obtained.

____________________________________________________________________Page 67 of 124

7. Testing

In a distributed system, especially one which intends to have no central point of control, a rigorous testing regime is required, even after heavy testing it is unlikely that the system will be fully de-bugged, therefore on top of the testing there must be a great deal of elegant error handing to ensure a reliable system.

7.1 The Test BedThere are always problems in testing a system that is intended to run in a

distributed manner over many computers, therefore an appropriate test bed must be devised to allow the well structured testing that such a system needs. Unfortunately it is impossible for a single computer to execute multiple copies of the GazNet application simultaneously, the reasons for this are two fold, the first is that all instances of GazNet will run on the same port, therefore making it impractical to run more than one on a single computer. The second reason is that as not all supernodes maintain connections with other nodes when communications need to be made (e.g. for a search) therefore a new TCP connection would need to be created, for this to happen each node will need a separate IP address as there is no allowance in the protocol for multiple instances of GazNet to share the same IP address. These two problems mean that to test GazNet, a series of computers must run the application, this is perfectly viable to test network on a single tier but impractical when the network becomes large enough to create new tiers. Therefore to test the network on a larger scale a simulator was developed to verify the performance of the network when in excess of 100 nodes connected.

7.2 How the System was TestedThe system was developed in very much a component based manner, this

therefore led to the use of incremental testing, with each function of the system being perfected before the next stage was begun. The stages can be roughly compared to the packages, a good example of this is the download package which is almost fully self contained.

Early design stages were tested on a small network of 2 computers, one acting as the supernode and another as a connecting node. This was sufficient for the early stages of developing the protocol, as things like the connection process only involves two nodes. The next stage required the network to be extended to 3 computers as more advanced processes needed to be tested such as multi-source downloads and supernode splitting. However once the majority of the system was built, the network needed to be expanded dramatically to accommodate the creation a several supernodes running a number of nodes.

Generally testing the network took the form of,creating the next stage of functionalitycreating a network condition to test itif it worked create other network conditions to test itelse fix the problem and try to test it again.

____________________________________________________________________Page 68 of 124

by carrying out this process, the system could develop on a solid platform of previous work, therefore if the connection process ceased to work, this could be attributed to the latest additional component to the system as it had already been ascertained that the connection process functioned correctly.

7.2.1 Testing UtilitiesTo aid the testing of the project multiple testing utilities were used, one of the

most important was a packet sniffer, the one most frequently used was AnalogX. Using this application allowed monitoring of every single communication that went on between the different nodes; as the protocol is text based this made the process even more easy.

The second utility used was a self written C application similar to telnet, this allowed commands to be manually sent to a supernode or node, forcing them to commit certain action, this was especially helpful when trying to make a supernode split. It was also helpful for testing multi-source downloads, as figures could be manually entered instead of calculated by the DownloadManager.

A simple utility, that resides in the GazNet.util package is the Reporter, this contains methods that will print out messages to the screen. Statements are split up into writes and errors, and are accompanied by methods that allow either type of statement to stop being displayed on the screen. This allows not only debugging statements to be switched off but also easily allows errors to be written to disk or to a remote location making the viewing of errors much easier. The alternative to doing this is using System.out.println() statements which suffer from the problem that after the coding has finished and been tested every println statement will have to be removed.

7.2.2 The SimulatorsTwo simulators were used to test the design better; the first one simply scanned a

list of files then created a particular network tier and displayed statistics about the load that those files put on the network. The second was more complex and actually built the network; the simulator can be given a number of nodes that will be on the network then it will begin to add the nodes, building a network node by node. Originally the network will start off with 26 supernodes but as the network gets larger more tiers will be created. As this process goes on, output will be streamed to a file, giving details of what files are added to supernodes and how supernodes are split. Using the simulator revealed a massive flaw in the current implementation of the splitting algorithm, which would not have shown itself until the network had reached a much larger size. The problem was that the latter part of the tier (i.e. supernodes like S and X) had problems splitting because the original algorithm was only gave consideration to splitting the early part of the alphabet. This problem led to range like X-R which obviously would lead to the collapse of the network. The simulators proved invaluable for the testing and evaluation process and have ultimately resulted in a more efficient implementation, it would have perhaps been better however, to have created the simulators before the implementation of the actual program.

____________________________________________________________________Page 69 of 124

7.3 System Test ResultsOnce the system was finished a set of tests to validate the standard functionality

of the system were devised, these tests aim to ensure the functionality of both the networking and the file transfer aspects of the system. Table 7.1 – 7.3 outline the tests carried out to verify the functioning of the system; these tables don’t contain the complete test results and display no incremental testing.

Number of Supernodes = 1Operation Expected Outcome Actual Outcome CommentsConnecting to a supernode

Connection process carried out and accepted

As expected

Joining a supernode Join process carried out and accepted

As expected Although a Join will work it won’t be required in a single supernode network as every node will Connect to the supernode

Searching the network for an existing file

A Group will be returned containing appropriate results.

As expected

Searching the network for a non-existent file

An empty Group object should be returned, ‘No Results’ should be displayed on the search table

As expected

Splitting the supernode The supernode’s range and file database should be halved. A new supernode should be created on an existing node with a FileDatabase containing the old supernodes files

As expected If no nodes are connected to the supernode, nothing will happen and all future Connect/Join request will be refused.

Removing a node from the network

The supernode should realise the nodes disconnection and remove all file references belonging to that node.

As expected

Number of Supernodes = 2Connecting to a supernode

Connection process carried out and accepted

As expected but due to the supernode cache usually a node will attempt to connect to the original supernode which will then re-direct it to the new supernode. Sometimes when the supernode is re-directed, the original supernode is not Joined as it should be, this leaves files absent from the network.

This is a problem inherent to the system as there is no way of the new node finding out the new supernode without the original supernode’s help. The supernode is Joined incorrectly because the node decides which supernode to connect to before it actually Joins any other supernodes, therefore when the

____________________________________________________________________Page 70 of 124

connect is refused it doesn’t Join, instead it simply leaves it.

Joining a supernode Join process carried out and accepted

As expected but due to the supernode cache usually a node will attempt to connect to the original supernode which will then re-direct it to the new supernode.

Searching the network for an existing file residing on the new supernode

Search attempt will be sent to original node and then re-directed to the new supernode

As expected

Splitting the original supernode again

Splitting process again, this time other supernode should be informed of the new supernode

As expected Every supernode on the tier will be informed of the new addresses when a supernode splits

Splitting of the new supernode

Splitting process again, this time other supernode should be informed of the new supernode

As expected Works in the same fashion as original supernode.

Taking new supernode offline

The original supernode should notice the absence then create a new supernode to take its place. All connected nodes should start to look for a new supernode

As expected but nodes have no knowledge of the loss of their supernode.

Although a new supernode is created it has no knowledge of the old supernode’s files or connected nodes. The nodes don’t know their supernode is offline because only the server side of a TCP connection will notice the loss of the connection

Taking original supernode offline

The original supernode should notice the absence then create a new supernode to take its place. All connected nodes should start to look for a new supernode

The new supernode doesn’t realise the old supernode has gone down.

Only the original server will notice if the TCP connection fails, the newly created supernode will not know until it tries to send data over the connection.

Removing a node from the network

The supernode that the nodes is connected to should realise the loss of the node then remove the appropoiatae file references. All other supernodes won’t remove the file references from their supernodes

As expected This happens because the supernode that the node is connected to maintain a TCP connection therefore it can tell when the node leaves the network. However the other supernodes have no knowledge of the nodes status so don’t remove

____________________________________________________________________Page 71 of 124

the files.Number of Supernodes = 4Connecting to a supernode

Connection process carried out and accepted

As expected but due to the supernode cache usually a node will attempt to connect to the original supernode which will then re-direct it to the new supernode. Sometimes when the supernode is re-directed, the original supernode is Joined as it should be, this leaves files absent from the network.

This is a problem inherent to the system as there is no way of the new node finding out the new supernode without the original supernode’s help. If the system was running on a larger scale with more supernodes in the cache this problem would not occurThe supernode is not Joined correctly because the node decides which supernode to connect to before it actually Joins any other supernodes, therefore when the connect is refused it doesn’t Join, instead it simply leaves it. This is leading to high numbers of files not being added to the network

Joining a supernode Join process carried out and accepted

As expected but due to the supernode cache usually a node will attempt to connect to the original supernode which will then re-direct it to the new supernode.

Searching the network for an existing file residing on the new supernode

Search attempt will be sent to original node and then re-directed to the new supernode

As expected

Splitting one of the supernodes

The split process should complete and all other supernode be informed of the new addresses

As expected

Taking new supernode offline

The original supernode should notice the absence then create a new supernode to take its place. All connected nodes should start to look for a new supernode

As expected but nodes have no knowledge of the loss of their supernode.

Although a new supernode is created it has no knowledge of the old supernode’s files or connected nodes. The nodes don’t know their supernode is offline because only the server side of a TCP connection will notice the loss of the connection

Taking original The original supernode The new supernode Only the original server

____________________________________________________________________Page 72 of 124

supernode offline should notice the absence then create a new supernode to take its place. All connected nodes should start to look for a new supernode

doesn’t realise the old supernode has gone down.

will notice if the TCP connection fails, the newly created supernode will not know until it tries to send data over the connection.

Table 7.1 Testing of the Network

Testing File TransfersOperation Expected Outcome Actual Outcome CommentsPerforming single source download

Download completes with the correct status being shown in the GUI

As expected

Performing multi-source download

Download completes with the correct status being shown in the GUI

As expected

Performing large download (50MB+)

Download complete with appropriate bit rate

Frequently the download would freeze after a certain percentage (it varied) then the connection would time out.

A new class – UploadControl was created that would send a confirmation back to the uploading node every 64KB, this kept the download going.

Performing single source download, then taking the source off-line

Download should stop Exception thrown and the download stops

Although the expected outcome occurs, it would be better to automatically search for a replacement source. The program doesn’t allow another download of the same name to be started until the previous one has been cleared.

Performing multi-source download then taking one of the sources off-line

Source download should stop

As expected, all other source downloads continue

No action is taken to find a new source, therefore leaving a segment empty.

Resuming a single source download

The download to complete OK

As expected This works, but the incomplete file must be search for and then downloaded manually, there is no indication of an incomplete download in the GUI.

Resuming a multi-source download with less sources than originally

The download should complete OK, the first source should take on the load of the missing sources

As expected This works, but the incomplete file must be search for and then downloaded manually, there is no indication of an incomplete download in the GUI.

Resuming a multi- The download should As expected This performs as

____________________________________________________________________Page 73 of 124

source download with more sources than originally

ignore the extra sources and just use the original number of sources

expected but it would be better to exploit the extra sources rather than ignore them.

Attempt a download to a node that doesn’t posses the file.

A protocol error message should be returned

As expected

Attempt a multi-source download with a source that doesn’t posses the file.

A protocol error should be returned

As expected but no new download is initiated to replace the source so one segment is left empty.

This must be rectified to make the download process practical.

The downloading node goes offline before the download is complete

The uploading node should throw an exception and the upload should cease

As expected No change is made to the GUI so the user doesn’t know the upload has stopped.

7.2 Testing of the Download Features

Testing the GUIOperation Expected Outcome Actual Outcome CommentsTrying a valid search If there are search

results they should be displayed in the table. If there are no search results a message in the table should say ‘No search results’

As expected

Trying an invalid search e.g. %

A message should inform the user that it is invalid

A message comes up in the table saying “None Alpha Query”

It would perhaps be better to show a more helpful message

Trying to download these error messages (“No search results” or “None Alpha query”)

Nothing An download is attempted but no download is added to the download panel.

It would be better to remove the popup menu option when an error message is showing.

Cancelling a download The download should be stopped and the progress field get replaced with ‘Cancelled’

As expected

Resuming a cancelled download

The download should resume and the progress field should start counting again

A new download is created on the table saying with its status as ‘Connecting…” and the original instance of the download will begin to count again. Sometimes the progress counter will exceed 100% and the file will not be written to disk.

The extra table entry shouldn’t be added and when the file isn’t written to disk problem must be rectified.

Resuming a cancelled download but without any sources to download from.

No change in the bit rate field

Another entry in the download table is added with its status as “Connecting” and the original download entry

Some different entry should be added, like can’t find new source

____________________________________________________________________Page 74 of 124

set’s its progress 0% but nothing actually happened

Adding a share The share should be added to the tree and to the share file.

As expected The modifications will only be added to the actual supernode when the program is restarted

Removing a share The share will be removed from the tree and the share file

As expected The changes will only take effect after the program has been restarted

Clicking the update button

Does nothing Does nothing Needs to be given the functionality to actually update the supernodes

Changing the download location

Switch the folder in which downloads are placed

As expected Although it changes it, the new location is not saved to file. When the program is restarted again it will have gone back to the default “C:\Downloaded\”

Connected message on Welcome screen when the node connects to the network

The message changes from “Connecting…” to “Connected to GazNet”

As expected Although the message changes it doesn’t change back to “Connecting…” if the connection is lost

Scrolling down shows more results

Shows more results As expected The scroll bar is always present always with 50 elements in the table

Changing the results when the table is scrolled down to the bottom

Table should scroll back to the top

The scroll bar doesn’t change position

The user must scroll the bar back up to the top to see the new results.

Table 7.3 – Testing of the GUI

7.4 Summary of ErrorsThe application, although, works in a number of ways still suffers from a number

of different bugs and errors in all areas of the implementation.

7.4.1 NetworkingThe networking side of the implementation suffers from a series of limitations and

errors,1) The network is limited to one tier2) File lists have a tendency to become un-synchronised and out of date3) Nodes have the inability to realise when their supernode has gone offline4) There is little support to replace supernodes that have gone offline

The first problem is that the network has no support to increase past the first tier of supernodes, this limits the size the network can become. There are several limitations that stop the network expanding (which all lie in a lack of implementation). The main reason behind the inability to create lower tiers is that only the routing for the first tier is implemented, similarly searches don’t pass through the tiers nor do file lists. However the

____________________________________________________________________Page 75 of 124

main building blocks for multi-tier splitting have been implemented; for example the splitting algorithm has the ability to create a new tier also, the NetworkStartup class has a lower tier array that can contain the IP addresses of the tier below.

The second problem is also prevalent, this is for two reasons; file lists become out of date because, with the exclusion of the files residing on supernodes belong to a node that is connected to that supernode, there is no implementation for removing files from the network. However this problem is slight and can easily be implemented, the larger of the two problems happens when the number of supernodes starts to increase. When a supernode refuses a connection the node will connect to the newly created supernode, however because the algorithm decides where to put each file references before the actual process starts it won’t dynamically react to the changing of the supernode structure, it will therefore not try to re-join the supernode that has just refused it. As this process repeats itself the network will become gradually more out of date.

One of the problems with a TCP connection is that although, a server can realise when a client has gone offline, the client can’t notice when the server has gone offline, this leads to problem 3. A node will not change its state at all when its supernode goes offline, instead it will still consider itself connected to GazNet. It will only realise its disconnection when some sort of contact is attempted with its parent supernode, this will result in an exception and the application will need to be restarted to reconnect to the network. There won’t be any problems if the node attempts to contact any other supernode as this action creates a new TCP connection rather than using an already failed connection.

The fourth problem is a major problem with the network as every other problem could be considered, in some way, to some extent ‘trivial’, a first tier network still allows quite a lot of users and it’s not that bad if a user has to restart the program to reconnect to the network however reliability issues do have serious repercussion for the network. If a supernode goes offline, in the design a brother supernode should immediately take over the supernode’s duties however, due to time constraints this wasn’t implemented therefore when a supernode fails there is no way of recovering the lost data. However the network does have a certain degree of reliability, if a supernode fails its parent will detect this (through a break in the TCP connection) and will then allocate the failed supernode’s range to a new supernode. This will insure the structure of the network is maintained however it will not insure the integrity of the file lists, as there is no way of recovering the lost data from the failed supernode.

7.4.2 File TransfersThis area of the implementation was possibly the most error ridden, thorough use

of incremental testing helped fix many bugs but extended black box testing revealed that a number of previously unthought-of problems existed.

1) The loss of a download source means will not be replaced by either a new or an existing source

____________________________________________________________________Page 76 of 124

2) Resuming downloads in the same session that the download was stopped sometimes leads to the progress counter being corrupted and the file not correctly writen to disk.

The first error occurs when download source (either single source or multi-source) fails; when this happens the download will simply stop, no attempt will be made to locate a replacement sources, instead a user must manually restart the download, choosing a different source from the search results.

The second error occurs when a download for some reason (node failure or cancelling) fails, when the download is resumed, the node will search for instances of the same file then attempt to restart the download if a copy is available, however sometimes the download won’t complete and the progress indicator will exceed 100%.

7.4.3 The User InterfaceThe GUI could be considered to a great extent a success, most features work and

the errors that exist are superficial. The main problems are as follows,1) Resumed downloads add a new instance of the download to the download

table2) The scroll bar on the search table does not automically return to the top of the

table when a new search is carried out.The first problem is not a serious matter as it doesn’t affect the actual download process, however it does create confusion for the user as the status of the download stays at “Connecting” even though the original entry in the table’s status continues to increase as normal.

Error 2 is a more serious problem as it can cause genuine problems for the user, this is because if the first search yielded 50 results and the user scrolled down to the bottom then enters a new search which only yields 5 results, the table would appear blank because the user would be viewing the bottom 19 entries and obviously if there are only 5 results the bottom 19 entries would be blank.

7.5 Incremental TestingAs has been mentioned earlier incremental testing constituted a major part of the

overall testing of the system. The system was developed very much in stages with different processes being completed before the next one was begun, an example of this is the different stages of the protocol, the first area implemented was the discovery process in which a node finds a supernode to connect to, the next step was the actual connection process, later things like supernode splitting and supernode recovery were added. As each of these processes were created they received heavy incremental testing to ensure functionality, after this they also received a degree of unit testing.

To show how the incremental testing was carried out, the download procedure’s testing will be outlined. Each stage of the downloading process was done separately; the stages are as follows,

1) File transfer protocol messages

____________________________________________________________________Page 77 of 124

2) The actual file transfer3) Multi-source file transfer

a. Calculation of segment sizeb. Modification of protocol to accommodate multi-source downloadsc. Combining each source’s data together to form the complete file

Stage 1 was relatively error free, the experience gained from the creation of previous protocol interactions, meant that little trouble was involved in writing the code. Stage 2, although required only the simple process of transferring a byte stream over a TCP connection did have a major problem. Frequently during large file transfers the connection would freeze, after strenuous testing it was found that this generally only occurred on file larger than several megabytes; to rectify this confirmation messages were sent back to the uploading node after every 64KB of transfer, these confirmations instructed the uploading node to send the next 64KB of data.

Stage 3 is split up into three different sections; stage 3a required the most testing as frequently incorrect calculation would corrupt files. Two file types were used to test 3a, the first was text files, by transferring files from multiple sources using a text file containing a string of text, it could easily be ascertained whether the transfer was successful as the string would still be the same, if it was different it was easy to find out the problem area by looking at the particular part of the string that was corrupted. Usually the string used was a sentence of about 20 characters, although this was frequently modified to test things like an odd number of bytes. The second type of file used was an installation file, (namely the J2SE installation file), this file would throw an error if it had been corrupted which allowed downloads to be checked quickly and easily and also tested how effective the measures taken to fix problems with large downloads were (as the file was over 37MB). The modification of the protocol was simple, as it only required the ability to indicate what part of a file to download.

Section 3c, created a number of problems, originally all downloads were stored in an array in memory, therefore downloads could easily be written to disk, this led to problems with 3a as frequently errors in this were undetectable as if a transfer downloaded the wrong bytes they would simply overwrite the array element creating no error. However, large files would not fit in memory so they therefore needed to be written to disk during the download process. The method used was to create a different data file for each download source then combine them at the end of the download. This meant that the errors previously tolerated in 3a needed to be corrected as if each sources download overlapped, instead of overwriting the overlap, they would be put together therefore leading to corruption of the file.

Table 7.4 outlines, the major problems that were rectified during the incremental testing of the download procedure.

____________________________________________________________________Page 78 of 124

# The Problem How it was rectified1 java.lang.OutOfMemoryError

getting thrown during downloads

Files were being fully downloaded into any array before being written to file therefore with large files there wasn’t enough memory available to the virtual machine. The problem was rectified by buffering 64KB of received data then writing it to a data file.

2 Large file transferred often froze A new class called UploadControl was added which would receive acknowledgments of the Download class after every 64KB of file transfer, this class would then force the Upload class to transfer the following 64KB of data.

3 During multi-source downloads file sometimes got corrupted

This error was only found after transmitting installation files, these files would show a corruption in their data when the installation was attempted. The reason for this was that on files with an uneven number of bytes the segmenting of the file for multiple sources sometimes left the final bytes out. By creating a new method called calculateFinalChunk this problem was solved

4 Not all sources were being uploaded from, leaving the download incomplete

There were errors in the for loop that cycled through all the download points.

5 When resuming file downloads but with a smaller number of sources than original the download isn’t completed

Each data file for each segment was trying to be completed using a different source. It was rectified by allocating any data files that don’t have a new source to the first source

6 The percentage transferred of the file was often inaccurate sometimes finishing on more or less than 100%

Algorithm errors, resulting from the use of integers instead of real numbers.

7 Downloads create IO exception The download folder doesn’t already exist. This was rectified by creating the folder if there was an error.

Table 7.4 – Main errors encountered during the download packages development

Summary 7.6The testing process was an important stage of the project, it was very much an

iterative process with much re-visiting of problems in an attempt to improve the overall functionality. Several techniques were used including, Black-box, white-box, incremental and unit testing; with each technique new errors were discovered and rectified. However the results mean little without an evaluation of them to ascertain how well the system actually works and what would need to be done to make the system work better. The next chapter analyses the results in the hope of finding the major higher level flaws of the system and how these flaws could be rectified.

____________________________________________________________________Page 79 of 124

8. Evaluation

8.1 Evaluation of ArchitectureAt first glance the indexed architecture seems to be a quick and efficient topology

for a P2P network, it appears to draw on strengths from both DHTs and supernode networks however if one take a more in-depth look at the structure several limitations come to light.

8.1.1 Scalability

Even distributionOne of the main problems with GazNet is the simplicity in which it splits

supernodes, as you’ll remember supernodes split themselves by dividing their hash space (i.e. the letters that they look after) by 2 then delegating responsibility for the latter to a new supernode, so a supernode looking after the letters, A,B,C, D will delegate responsibility over C and D to another supernode when it splits. This nice simple mechanism at first seems a nice clean way of separating hash space but when one looks more closely at the repercussion is becomes clear that it will create massive inefficiencies in the network.

The below table is based on an average shared music folder, it has been generated by going through every file, listing the number of occurrences of each letter in the filename and calculating the percentage of load each supernode would receive from someone joining the network. The music collection is a good choice as there isn’t a large collection of one particular artist or type of music therefore limiting the effect felt from a collection heavily featuring one artist such as Elvis Presley. In an ideal network each supernode would receive 3.85% of the file list. Percentage refers to the percentage load put on each supernode, deviation refers to how far this percentage deviates from the recommended load and percentage deviation is the percentage of the deviation from the recommended load.

Letter Percentage Deviation Percentage DeviationA 8.20% 4.35 213.19%B 7.49% 3.64 194.65%C 6.06% 2.21 157.58%D 4.81% 0.97 125.13%E 2.67% -1.17 -30.64%F 3.92% 0.08 101.96%G 2.85% -0.99 -25.97%H 5.70% 1.86 148.31%I 3.74% -0.11 -2.85%J 3.57% -0.28 -7.27%K 1.43% -2.42 -62.85%L 4.81% 0.97 125.13%M 6.95% 3.11 180.75%

____________________________________________________________________Page 80 of 124

N 1.60% -2.24 -58.44%O 5.88% 2.04 152.94%P 2.14% -1.71 -44.41%Q 0.53% -3.31 -86.23%R 4.63% 0.79 120.50%S 5.70% 1.86 148.31%T 8.91% 5.07 231.73%U 0.18% -3.67 -95.32%V 1.07% -2.78 -72.20%W 5% 1.32 134%X 0% -3.85 -200%Y 1.78% -2.06 -53.76%Z 0.18% -3.67 -95.32%

Table 8.1 – First Tier Distribution of load over GazNet.

A

B

C

D

E

F

G

HI

JKLM

N

O

P

Q

R

S

T

U

VW XYZ

A B C

D E F

G H I

J K L

M N O

P Q R

S T U

V W X

Y Z

Diagram 8.1 – First Tier Percentage Load on Supernodes

It can be seen that the distribution over the network is less than perfect, in its worst case a supernode is overloaded by 231.73% (T). This can be contrasted with the massive under loading of supernodes such as ‘X’ which is under loaded by 200%. The closest a supernode gets to having a perfect load is ‘I’ which is 2.85% under loaded. There are 13 nodes that are overloaded; F being the least overloaded at 101.96% overload and ‘T’ being the most overloaded with a 231.73% overload. This means that 50% of the supernodes handle 78.06% of the network load; this will lead to a massive burden on supernodes like A, B and T.

____________________________________________________________________Page 81 of 124

If you look at this problem from a different angle you also see that, 50% of the supernodes handle only 21.94% of the network load which obviously means a massive waste of resources, none more so than for the ‘X’ supernode which has no load at all. Using data gathered it would be possible to derive a far more efficient load distribution algorithm that would allocate a number of supernodes to hash spaces such as A and merge hash space such as X and Z into one supernode.

-250.00%

-200.00%

-150.00%

-100.00%

-50.00%

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Supernodes

Perc

enta

ge D

evia

tion

Diagram 8.2 –Deviation from Ideal Supernode Load

The previous paragraphs have delved into some of the problems the splitting algorithm suffers from however no comment has yet been made about its uniformity over all tiers of the network. When limited to one layer the splitting algorithm is inefficient but workable, take the network onto a second tier, maybe even a third tier and the system becomes inefficient on an exponential scale. A good example can be derived from a second tier E supernode, using the same data as before the second tier load distribution was calculated.

As can be seen in Diagram 8.3 the second tier supernodes have even more of an uneven load distribution with the ‘El’ supernode receiving a massive 33.33% of the load. Unfortunately this is not a one-off; a similar pattern can be seen throughout the whole second tier supernodes, ‘Go’ supernodes have 66.67% of the load, ‘Oo’ supernodes have 44.44% of the load and ‘In’ supernodes have 54.55% of the load. Many other supernodes are weighed down with between 10% to 30% of the load which may in comparison to 54.55% seem reasonable but when compared to the 3.85% recommended load actually equates up to an 840% overload (if you’re interested 54.55% equates to a 1,418.18% overload).

____________________________________________________________________Page 82 of 124

abcdefghijk

l

m

n

opqrst

u

v

w

x

y

z

a b c d

e f g h

i j k l

m n o p

q r s t

u v w x

y z

Diagram 8.3 – Second Tier Percentage Load on an ‘E’ supernode

Conversely many supernodes have a load of 0%, in fact out of the 676 supernodes on the a fully used second tier 517 of them contain no information at all, that means 76.47% of the second tier carry none of the load compared to the 3.84% of supernodes in the first tier that aren’t used. It is therefore hard to decide which is the biggest problem with the GazNet design, the massive strain placed on popular supernodes or the massive waste of resources provided by supernodes that deal with such combinations as ‘YT’.

Supernode Creation and Tier ManagementThe concept behind the GazNet architecture was that it would respond to

increases in the network with an increased level of indexing, therefore however large the network became files would always be easy to locate. Despite the aforementioned problems GazNet still will do this; unfortunately the way the GazNet tree is distributed will mean that heavily loaded supernodes will have to create extra tiers in the network, this means that that there will be a search time performance hit leading to a serious compromise on one of the major project requirements – to reduce search time.

____________________________________________________________________Page 83 of 124

When supernodes become heavily loaded they will split to ease their load, when there are 26 or less, supernodes will split on the same tier however when supernode numbers increase to over 26 there will be a requirement to split onto another tier, this is how GazNet improves on other network’s search times unfortunately the way supernodes split means that there will be extremely long chains of supernodes, for example the supernode ‘Al’ will need to split quite soon after its creation because of the high frequency of word beginning with the letters ‘Al’. A major problem with the GazNet design is that it doesn’t allow for the ‘looping’ of supernodes i.e. when a supernode becomes to heavily loaded it would create an equal supernode that would help look after the same hash space. The idea behind this would be that it would increase search time as the network would no longer be fully indexed instead it would be partially indexed with groups of supernodes looking after one hash space that would each need to be searched to insure the whole network was searched. Unfortunately this doesn’t provide for the scenario in which too many files with the same name occur on the network, this will often happen as popular files such as the latest chart singles will make their way on to many computers. The current structure of the network will not be able to handle a situation such as this, it will simply cease to work; this can be best explained using an example; say GazNet is highly populated with Cat, Dog and Fish lovers and they all feel it necessary to share with the world their collection of pictures, movies, and audio files of their favourite animals which they all conveniently choose to title Cat.jpg, Dog.avi and Fish.mp3. When the users log on to the network they will send these files to the C, D and F supernodes respectively, however, when the C supernode reaches its maximum capacity of files it will split giving all the Cat.jpg files to the new Ca supernode unfortunately as this supernode is also pretty heavily loaded down with Cat pictures it will split passing them all on to the new Cat supernode, this process will go on till there is an extremely heavily loaded Cat.jpg supernode and at this point no more files titled Cat.jpg will be able to be stored on the network.

A (820000000)

Al (205000000)

Ali (29294500)

Alie (29294500)

Alien (29294500)Diagram 8.4 – Route a file will take through supernodes

Diagram 8.4 perfectly illustrates the problem, due to the distribution of load supernodes such as Al, will be overloaded, the diagram is based on a network containing 1 billion files and it shows how a file called Alien will work its way through the network. The files in brackets show the numbers of files that would be stored on that supernode if it couldn’t

____________________________________________________________________Page 84 of 124

split, the calculation are based on the data from the pervious section. Due to network load the file keeps getting pushed onto another supernode however when it reaches its furthest point, the ‘Alien’ supernode, it cannot go any further and the network will fail as there is no other supernode it can share the load with. Obviously this example is nothing like a real world situation as it is based on one shared folder and it cannot accurately scale up to a network of one billion files but the concept is still valid and shows one of the major flaws in the network.

Further to this another problem with network structure is that its depth cannot exceed the length of the longest word stored on it therefore if the network only contained files named things like “Abba.mp3” and “Queen.mp3” the network can never have a depth exceeding 9 layers. This at first may appear to be good as the smaller the depth the faster the search time however the problem is that if the number of supernodes is limited by the number of layers and the number of layers is limited by the word length, any network with short words will be heavily limited in the number of nodes that can be connected to it as there simply won’t be enough supernodes for new nodes to connect to.

One of GazNet’s better features is how supernodes are actually created;. GazNet does not statically decide whether nodes should become supernodes or not, instead supernodes are created dynamically whenever they are required, this means that at any given time just enough supernodes exist for the network to function. This, at first seems to react well to the highly dynamic nature of a P2P network, however this technique only deals with the addition of nodes to the network not the removal. This therefore leads to redundant supernodes that no longer look after a sufficient number of files to warrant their existence, it would be better if in such situations a supernode could merge with another lightly loaded supernode therefore freeing itself up to carry out functions at another part of the network.

Supernode loadThe previous sections nicely lead on to the subject of supernode load, it has

already been established that there is very little chance of many supernodes being perfectly loaded instead there seems to be about half that are overloaded and half that are under loaded.

Supernodes have two main purposes, the first is the caching of file lists to speed up search time and the second is to allow other users to query them to find out the actual locations of files. Similarly to this, a supernode’s load can be categorised into two types, memory and processor load i.e. the resources the supernode has to allocate to managing file lists and secondly, processor and networking load i.e. the resources the supernode has to allocate to routing requests through the network. To truly understand the load placed on supernodes these two categories must be analysed both separately and together.

The first type of load is probably the least significant, this is because compared to searching adding and removing elements from the supernode’s databases (node and file) is a relatively infrequent action. Similarly the process of splitting a supernode is also relatively infrequent therefore the file list should stay moderately static. The memory

____________________________________________________________________Page 85 of 124

requirements placed on the supernode should be easily dealt by any node capable of becoming a supernode; this leaves the drain on the resources created by the actual management of the databases. File lists are stored using a hash table, values are hashed using the first letter into an array of 26 vectors, this was done to ease the strain on the system when the supernode is splitting. This at first gives a relatively good load balance, looking at Table 8.1 one can see that the maximum load is 8.2%, which, although, might have bad effect on a distributed table, an array should be able to deal with it easily. Similarly same tier splits will be extremely easy to carry out as the supernode simply has to cut the array in half and send one half to the new supernode. The creation of new tiers, however, will create a greater load on the system as every file has the be checked to find out which one’s will need to be sent to the next tier.

Unfortunately as supernodes split and new tiers are created the load balance on the array will become increasingly worse, loading one vector far heavier than any others meaning that in reality the supernode is simply performing a linear search of its contents. This at first seems a serious issue, as a linear search is extremely inefficient and will place a much higher load on the system compared to something like a binary search, however it must be remembered that the intention of the project was never to create and efficient database search mechanism, in fact the way files are searched for, on the supernode is irrelevant to the overall success of the project so it is easy to overlook these performance issues.

The second and most important factor of supernode load is the network load, mainly the routing of information through the network. When a node wishes to perform a query it will send it to the appropriate first tier supernode e.g. to search for Aerosmith the query will first be sent to the ‘A’ supernode, this will then be passed on to the ‘Ae’ supernode, then onto the ‘Aer’ supernode until the furthest path has been traversed, which should yield the results. This will obviously place an extremely heavy load on the higher level supernodes as 3.85% of searches will have to pass threw them (presuming each of the first tier supernodes receive an equal share). There is a similar problem to this in the way that files are sent to the correct lower tier supernodes. All the files go through the first tier supernodes which then pass down the file references down the network to the appropriate supernode in the same manner that searches are passed through the network. On a fairly static network this won’t create too much network traffic but during the network setup or over a highly dynamic network the traffic produced will be immense.

The first aspect of supernode load will have a relatively small effect on overall performance but on heavily used supernodes it will have a hefty effect on the performance of the computer and as supernodes are supposed to be made up from everyday computers used by regular people the performance hit might encourage people the simply stop using the network or to restart the program so they are no longer running as a supernode alternatively at the far end of the spectrum the computer may simply crash under the load. These actions will have a seriously bad effect on the overall network as the loss of a supernode is a serious issue. It is therefore obvious that although the efficiency of each individual supernode isn’t an important issue in this project it is still an important issue in the overall efficiency of the network as the supernodes are the

____________________________________________________________________Page 86 of 124

backbone of GazNet and if they run poorly, the whole network will suffer. The network efficiency is the primary issue in the project and also will constitute the biggest load on the supernodes; this is because searches of individual supernodes are far more infrequent than network traversals therefore supernodes (higher level ones at least) will have to forward a great deal of information through the network creating a heavy burden on them.

8.1.2 Search AlgorithmsIt is hard to objectively evaluate the search algorithms used as they are so heavily

embedded in the structuring of the network and therefore their efficiency almost solely depends on the efficiency of the network. It has already been established that the network architecture has several flaws but for this section is will be assumed that the network is a fully functioning 100% reliable architecture.

To illustrate the degree of success in this area two other P2P networks will be compared and contrasted, Gnutella and Kazaa (see Background for details). Using the data from the previous section a network will be constructed working on the following presumptions,

A node number of 1,000,000 in each network (n) A file number of 1,000,000,000 in each network (f) Kazaa allows a maximum of 200 nodes connected to each supernode (s) Kazaa requires a search of every supernode to complete the search GazNet allow a maximum of 500,000 files on each supernode (sf) but where

necessary can exceed this to any degree(for arguments sake) Gnutella will not be using a TTL field Search Time is measure in hops

Firstly, Gnutella will take the longest time to search as every single node requires to be searched in real-time therefore N number of nodes need to be contacted to obtain complete search results. Gnutella’s best and worst case scenarios are exactly the same, this is because to search the whole network every node must be queried, however because some nodes will obviously be searched before others valid results will be likely to be returned before the search is complete so in reality the best time depends on how popular the item that the user is searching for. Kazaa on the other hand makes a dramatic reduction on Gnutella’s search time, with only n/s nodes requiring to be contacted. This is because Kazaa caches file lists on supernodes therefore a smaller number of nodes actually need to be searched. The worst and best cases work in exactly the same manner as Gnutella’s . GazNet’s search time however cannot be calculated solely using the above presumptions because it depends on how the network has been distributed therefore its best and worst cases are more complex.

Worst CaseGenerally speaking, the worst case search time will depend on two factors, the first being the length of the query, a longer word will obviously be able to be spread over more supernodes and therefore will take more hops; the second being the popularity of that particular hash, the more popular a hash is the more likely it is for the supernode to split

____________________________________________________________________Page 87 of 124

and therefore create an extra hop in the search. In a network containing a billion files it is likely that there will be many supernodes that will have split, these supernodes will go along chains like Britney and Disney, therefore the worst case is the word length, e.g. for Disney it would be 6, a more generic worst case would be d, which represents the greatest depth in the network as this would be the worst possible search anyone could make.

Best CaseThe best case is much easier to devise, it is simply 1, as this would be the number of hops required to contact one of the first tier nodes, therefore if one was to search for the letter ‘A’ it would only take one hop.

Network Traffic CreatedThe previous sections have discussed the search algorithm’s efficiency in terms of

hops, this relates to how efficient the overlay network works however another important issue is how heavy a burden searching put on the underlying network, this is especially important if GazNet was running over the internet.

The amount of network traffic created during a search is very subjective, it depends primarily on how many words are in the search and whether or not different supernodes need to be contacted to service each search word. The following calculations will work on the both the previously mention presumption and the following,

A query rate of 625 per second (625qps) (Ritter, 2001) The results are stored on the lowest available tier

Type of Data SizeIP header 20 bytesTCP header 20 bytesGazNet Messages 63 bytesTotal: 103 bytes

Table 8.2 – Size of a search excluding the search string

To reach the correct supernode it will take 4 hops through the network; therefore 4 copies of the query will be created, making a sum of 428 bytes including a 4 byte query.

____________________________________________________________________Page 88 of 124

# Search Number of Queries Bandwidth 1 Abba Single Query 428b

Queries per second (625qps) 261.23KBQueries per minute (37500qpm) 15.3MBQueries per hour (2250000qph) 918.38MB

2 Britney Single Query 770bQueries per second (625qps) 469.97KBQueries per minute (37500qpm) 27.53MBQueries per hour (2250000qph) 1652.24MB

3 “Britney Spears”

Single Query 1424bQueries per second (625qps) 869.14KBQueries per minute (37500qpm) 50.92MBQueries per hour (2250000qph) 3055.57MB

4 “Britney Spears – Hit me baby one more time”

Single Query 2798bQueries per second (625qps) 1.66MBQueries per minute (37500qpm) 100.06MBQueries per hour (2250000qph) 6003.85MB

5 Mix of all previous queries

Single Query 1355bQueries per second (625qps) 827.02KBQueries per minute (37500qpm) 48.45MBQueries per hour (2250000qph) 2907.51MBTable 8.3 – Bandwidth used for queries

The results in Table 8.3 show how much bandwidth will be used for a number of different queries. The first two queries are varying length single word queries, as can be seen query 2 is 342 bytes longer than query 1, this is not because of its greater length but because the figures are based on the worst case scenario in which the lowest possible tier supernode would have to be contacted, in query 1 this would mean 4 hops but in query 2 it would mean 7 hops this therefore leads to multiple copies of the search being created which leads to much higher use of bandwidth. Query 3 increases in size a further 654 bytes over Query 2, this is not because of the increased length of the query but because there are two words in the query, which means that two separate searches have to be sent out to different supernodes, this is more clearly shown in Query 4 where 7 separate queries have to be sent out, in this query the size increases to 2798 bytes. Query 4 requires 7 searches to be sent out, however there are 8 words in the query, this is because two of the words begin with the same letter, ‘M’, because of this the two words can be aggregated together into one search however after the first hop their letters will no longer match so the search will be split into two separate searches and sent on to different supernodes.

The amount of bandwidth used in a query is based on four factors,

____________________________________________________________________Page 89 of 124

The number of words in the query – The more words there are the more supernodes will need to be contacted.

The size of the network – The larger the network the more likely it is that more supernodes will exist, therefore making network traversals longer.

The length of the search words – The longer the words the more likely it is that more of the network tree will need to be traversed to find the correct supernode.

The length of the search string – The loner the string the larger the protocol messages will be.

The main use of bandwidth is used during the traversal of the tree, this is because multiple copies of the search have to be passed through the network therefore the best way to minimise bandwidth use is to lower the number of hops taken to find the correct supernodes.

GazNet will generally create the least amount of search traffic in any P2P network, this is because it requires the least amount of hops to find the results, this is directly attributable to the network’s indexed architecture. This can be clearly seen when bandwidth use is compared to a protocol like Gnutella, using the example in provided by Ritter(2001), a search for “The Grateful Dead” on a 1,000,000 node network would generate 91.31MB of bandwidth, this can be compared to 1597 bytes generated by GazNet. Gnutella, therefore, obviously creates a massive burden on the underlying network whereas GazNet is very ‘network friendly’ as only a tiny percentage of network traffic is created in comparison to Gnutella (0.0017% to be precise). The reduction in bandwidth used it not only beneficiary to the underlying network, it also lowers the load on the overlay network, this is because 91MB of data will take a extremely long time to propagate through a network, whilst 1597 will take hardly any time at all leading to faster search times. Similarly, smaller amounts of traffic will create a smaller burden on the backbone of the overlay network (the supernodes) therefore allowing better performance.

Search Algorithm SummarySearch Time was a major issue in the designing of this network and this can clearly show in the chosen architecture, many things such as robustness, reliability and supernode loading has been sacrificed to make the search time as efficient as possible. From the previous example (a one million node network) it is clear that this has been accomplished.

In a nutshell this is the fastest existing P2P search engine in the world, however all this has been based on the presumption that the search algorithm (basically a tree traversal) is running over a perfect fully reliable network and as is known P2P networks are anything but reliable. One of the major flaws with the search method is that if one link fails every file below that is lost e.g. if the ‘Al’ supernode goes down all references beginning with the letters ‘Al’ will no longer be searchable however it should be noted that just because they are no longer searchable it doesn’t mean that they no longer reside on the network, both the file and the file reference still exist – the file on the node and the file reference on the supernode. If the searching node knew the address of the correct supernode

____________________________________________________________________Page 90 of 124

already it would still be able to search the network by simply bypassing the higher level supernodes, this concept might be one way of improving the reliability of the network.

It may be salient to point out the difference between robustness and reliability here as search time is affected by both matters. A robust system will be able withstand attacks and tolerate failures to a high extent, a perfect example is Gnutella, its fully distributed structure makes it probably the most robust network available – it would be simply impossible to destroy GnutellaNet. Reliability on the other hand is a more precise measurement referring to how well the network can recover from problems and could be measured in the percentage down time the network suffers from over a period of time. Unfortunately GazNet doesn’t seem to hold either of these points extremely well but it could be said that it is more reliable than robust. This is because its indexed nature leaves it very vulnerable to attack as hackers will know how to find where high tier supernodes are and if they can take one supernode offline all below will follow, this basically equates to a system with very little robustness. However GazNet’s brother supernode mechanism means that it can provide a greater degree of reliability as supernodes should always be replaceable with one of their brothers. Robustness and Reliability will be spoken about in more depth in future sections.

Intelligent SearchingThe previous section has talked about the factors that are involved in the speed

and efficiency of a search, by the very nature of the network search time will be fast and the bandwidth used will be low however there are still inefficiencies in the process that could be improved upon. The main inefficiency is the use of ‘dumb’ searching which leads to slower search times, greater quantities of bandwidth and lower quality search results. When a file is stored on the network or a file is searched for, no discrimination is used between words, therefore any word can be stored on the network This at first sight seems a good property however this allowance leads to great inefficiencies; Table 8.1 shows that the letter ‘T’ constitutes 8.91% of the load on the first tier supernodes, this is because of the word ‘The’, without it the ‘T’ supernode would receive a 37% drop in load leaving it with only 5.53% of the overall network load which is only 1.68% away from the ideal percentage. This is the same with searching, if a user searched for “The Beach Boys”, not only would excess bandwidth be generated but also the user would be supplied with erroneous search results with the word ‘The’ in, such as “The Beatles”.

Two techniques could be employed to remedy these problems; the first would be to cross reference all the search results placing all the matching ones at the top of the results table (something that would be beneficiary anyway). The second way is to employ a vetting of searches and file references, banning words such as ‘the’, ‘a’ and ‘it’. By doing this massive amounts storage can be saved and bandwidth wastage used in transporting searches and file references with erroneous words can be avoided.

Search AggregationThroughout this report, the matter of efficiency has been spoken about in-depth;

in any P2P network traffic must be kept to a minimum therefore the unnecessary creation of traffic is unacceptable. Search aggregation is used to minimise the amount of traffic

____________________________________________________________________Page 91 of 124

created for each query, this reduces the burden on both the underlying network and the supernode that services the search. Search aggregation involves combining search words destined for the same supernodes into one search instead of sending them in different searches.

The following information shows the size of a full search, this excludes any results returned.

080-SEARCH=Beach Boys……………...21 Bytes090-SEARCH_OK………………………13 Bytes100-SEARCH_LIST_READY……….….21 Bytes110-SEARCH_LIST_OK……………..…18 Bytes

=72 Bytes

This query was aggregated as the search contains both the words ‘Beach’ and ‘Boys’, however if it wasn’t aggregated its size would increase from 72 bytes to 135 bytes, which is a 187.5% increase in size. This is only for a two word query, if the search was changed to “The Beach Boys” the size would increase from 76 bytes using an aggregated search to 216 bytes which is a 284.2% increase in size. This is because each word would be sent in a separate search, so the overall size wouldn’t just increase by the search word’s size but also by the size of the extra protocol messages that would need to be sent. When dealing with such a small number of bytes it is quite easy to overlook such inefficiencies but as the network increases in size the problem becomes much greater. The supernodes that suffer the worse are the top tier ones, this is because they look after all single letter hash space such as all the ‘A’ words, because of this it is likely that multiple search words will be sent to the same supernode a good example is in the case of the search for “Beach Boys”. Multiple search words being sent to a lower tier node is less likely because the chances of sending two words with the same two opening letter is much less likely than two words opening with the same first letter.

By using the test data from the previous sections, if a user was to search for an artist and a particular song, 14.41% more searches to the first tier would be generated if aggregation wasn’t used, this makes the benefits clear.

8.1.3 Reliability and RobustnessOne of the main arguments for using a P2P network is its robustness, a Gnutella

like network would be near impossible to destroy, therefore, obviously reliability and robustness are vital requirements in a network of this type. The designed method, to use brother supernodes, was not implemented and therefore there is no way of maintaining reliability in this implementation, if a supernode goes offline, all information on that supernode is lost.

There is one robustness technique employed, a child supernode will maintain a TCP connection with its parent, if either of these go down the other will realise and therefore create a new supernode to take on their load however, no support is provided to backup file lists so when a supernode goes offline the whole file list is deleted and therefore everything stored on that supernode must be refreshed. Therefore in such a circumstance

____________________________________________________________________Page 92 of 124

every node on the network would have to be contacted to inform them of the loss of the supernode and asked to update the new supernode with their file lists. This will create far too much network traffic for the option to be viable and therefore is not a good enough replacement for brother supernodes. Another problem with this method, is the length of time required to re-initiate a supernode using brother supernodes the time will be under a minute but using this technique it could take many minutes depending on the size of the network, with a large public network the recovery time could be a very long indeed.

8.1.4 File TransfersThe main issue to evaluate in the area of file transfer is the way that multiple

source downloads are performed. The purpose of multi source downloads is to speed up the length of time it takes to transfer a file and to a degree GazNet does work however there are certain glaring faults in the algorithm.

The way files are split up is pretty standard, the file size is divided by the number of sources and each source is allocated a section to download, several files are then filled with the bytes from each source then once the download is complete are the files are merged to create the finished file. Files downloads can also be suspended and then resumed later even if there are a different number of sources which is a useful feature however there are still some problems with the process. A big problem with downloads is that the application doesn’t dynamically react to changes in the environment, for example if a source goes offline during a download, instead of finding a new source, the downloads from all the other sources will finish and the file will be left unfinished until the user restarts the download so the final pieces can be finished off.

Another problem is the way that GazNet decides on what files are identical, it uses two of the fields in the FileDescription, the name and the file size. If these two match it presumes the files are the same; unfortunately on a large network there will be many files that are the same but have a different filenames, this means that many possible sources will be ignored resulting in a download that is slower than it could have been. Most multi-source capable P2P program generate a message digest of the file and use this to locate other identical files, this is a far better method as it reduces the chances of mistakes being made and also increases the chances of find other matching files.

Probably the biggest problem with the download system is its indiscriminate choice of which node it downloads from. The whole point of multi-source downloads is to increase the speed in which files are downloaded but the whole benefit of using this system is lost if a poor choice is made in relation to which node is used to download the file from. If there are 5 nodes containing the desired files capable of reaching bit rates of, 2.5Kb/s, 1Kb/s, 5Kb/s, 3Kb/s and 512Kb/s the obvious choice to make it to download from the last source because it could transfer the file the quickest however GazNet would download from all 5 sources, if the downloading node’s bit rate is lower than the slowest uploading node there will be no impact on the download time but if the node has a connection speed of 128Kb/s there will be a serious increase in download time as the four slow nodes attempt to squeeze the data through the pipe. It is therefore obvious that multi-source downloads aren’t always appropriate and the aim of obtaining as many

____________________________________________________________________Page 93 of 124

sources as possible is not always the best plan. Basically the best choice for a download is as follows,

SUM(Bit rates of all sources)=Bit rate of downloading node

This is because there is no point in using a broadband connection to supply a dial-up connection with a file when another dial-up connection could supply the file at the downloading nodes maximum bit rate. Instead the broadband connection can be left to supply other broadband connections with files as a broadband connection will actually utilise the full bit rate opposed to a dial-up connection which would only utilise a tiny percentage.

8.1.5 Miscellaneous

Transport ProtocolThe transport protocol used throughout the project is TCP, this bears many

advantages for the project and many disadvantages. ObjectStreams are used frequently, primarily in the transmission of lists (node lists, file lists etc) and as these run over TCP it is obviously a necessity. TCP has many plus points, its reliable, in-order transport system was one of the main reasons why it was chosen as the GazNet protocol required to be delivered in the correct order, as well as this, TCP helped improve reliability as lost packets during a supernode split could create no end of problems. TCP was also chosen as its connection-orientated approach allowed supernodes to realise when nodes had gone down without the need for explicit programming, however the benefits of this are limited due to GazNet’s connectionless approach. This is where doubts over TCP first creep in; with the exclusion of the supernode that a node is connected to there are no connections kept with the network i.e. supernodes that are merely joined (sent file details but sent details of the node) maintain no connection at all, this therefore provides an argument for the use of UDP. It would perhaps therefore to have avoided the overhead involved in using TCP in certain areas whilst still using TCP for things like object transfer, however if a choice had to made between TCP and UDP, it would always be TCP.

Network Address Translation (NAT)A problem that has plagued many P2P networks, including GazNet, is the issue of

public and private addresses. Since the realization that IPv4 addresses were running out, much effort and research has been put into alleviating the situation, some has been targeted at the creation of a replacement protocol (IPv6) and some has been more focused on trying to perverse the addresses that we already had. A frequently used technique is to hide a number of devices on a network behind a single public IP address; every device on the network will have a non-globally unique IP address, called a private address but to the outside world it will appear like there is only one device with a single public IP address. All external communications will be sent through the NAT system and neither side should have any difficulties; problems only occur when an external device is trying to initiate a connection with a device behind a NAT system. This is because the NAT system will have no way of deciphering what the destination is because the external device only recognises the single public shared IP address. This problem can be avoided

____________________________________________________________________Page 94 of 124

by using port forwarding in which every private device is assigned a port; whenever an external device wishes to contact a private device it can contact it through the assigned port.

Another issue with NAT is how a device will actually devise its own public IP address as its knowledge is limited to the private IP address. A similar problem occurs when a device has multiple network interfaces such as an Ethernet connections and a dial-up internet connection, in such a circumstance an InetAddress.getLocalHost() method call will frequently return the private Ethernet address rather than the public dial-up address. Such a circumstance will have no effect on the network connection phase, nor will it have any effect on the searching process, the only problem that will arise is when another node attempts to download something from the node behind the NAT system. This is because every instance of a FileDescription has a variable that stores the IP address of the owner, this variable is set by the owner of the file using the InetAddress.getLocalHost() method, therefore if a external node attempts to download the file it will try and contact the private address which unless they are both on the same private network, will lead to an exception.

It is obviously important that a P2P network supports NAT if it intends to be distributed on wide scale, therefore a way around these problems must be found. The main problem comes about because the FileDescription’s owner variable is set by the host, therefore allowing the possibility that it is given the incorrect interface’s IP address, this could easily be rectified by setting the owner variable at the supernode rather than the host, this would ensure that the public address is used rather than the private one. To get around the NAT system the host can also include a variable in the FileDescription that stipulates which port to connect to so that the NAT can successfully forward the correct information on to the destination (if the NAT system supports port forwarding).

RatingOne of the problems with Java is that because of its platform independence, it has

little support for detecting with underlying hardware, because of this it was made impossible for a performance rating to be devised. Currently a random number is submitted as a node’s performance rating, however this is obviously unworkable. The only feature Java can detect that is relevant to the rating is the operating system and connection speed of the node. The only way around this problem in Java is to obtain the performance details from the user.

This issue raised little problems during testing as the experiemental network was run over a LAN, with similar performance nodes however, it the network was to run over the internet, it would quickly collapse as low performance computers with dial-up connections may be converted to supernodes, which would lead to crippling performance issues.

____________________________________________________________________Page 95 of 124

8.2 Evaluation of the User InterfaceThe user interface is a highly important component of the overall system; it

should therefore receive thorough testing and evaluation to ensure its quality. However the simplicity of the chosen user interface and its relative unimportance when compared to the other components means that only a limited evaluation is possible

8.2.1 Welcome PanelThe simplicity of this panel limits commenting on it, the Message of the Day

window will allow the user to find out new information about the system and the connected status will also help the user to understand what is going on. The only problem with the screen is that the connected status will only change once from “Connecting…” to “Connected”, if the node is disconnected, the status on the Welcome Panel will not be changed.

8.2.2 The Search PanelThe search panel fulfils its requirements perfectly well, however two aspects of

the panel could be improved upon; the first is the addition of being able to search by pressing the return key, in its current form the user must enter the search word then use the mouse to click the search button. Secondly the search table’s scroll panel does not automatically return to the top of the table when a new search is initiated. This leads to the appearance that no results have been returned even when there have been; this occurs when a search is performed that yields 50 results, if the user scrolls to the bottom then starts another search that only yields 10 results, the table will remain empty as the user will be viewing the bottom 19 entries in the table, where in fact there are only 10 entries situated at the top of the table.

The panel could also be improved by providing the user with a greater degree of knowledge about the file he/she wants to download; this could include things like connection speed. This will lead to a more educated choice being possible, leading to less downloading of unwanted files.

8.2.3 The Download PanelThis panel simply contains a table with the downloads on, this panel has the most

errors out of all the panels. When a download is resumed, a new instance of the resumed file is added to the table with a status of “Connecting…”, the original instance will display the correct statistics but the new instance will not be removed. Similarly there is no option for downloads to be removed from the list, they will only be removed after the application has been restarted. Further to this downloads that are unfinished because the application has been closed down will not remain in the download table when the application is loaded up again, instead the user will have to remember or check their download folder and then initiate another search and manually resume the download rather than it being automatically done for them at the program initiation.

Right-clicking on a downloading item will bring up a popup menu, which will allow the user to open the file, cancel the download, or resume the download. These menu items will be available what ever the status of the download; it would be better if only possible

____________________________________________________________________Page 96 of 124

actions were available, therefore if the download was still in progress, the open file option would be disabled.

8.2.4 The Upload PanelThe upload panel, like the download panel, only consists of a table containing a

list of all the uploads. Also like the download panel there is no ability to the remove finished uploads from the list, leading to a list that is full of unnecessary information.

8.2.5 The My File PanelThis panel is by far the most complex screen facing the user, it is well laid out and

easy to use and all the buttons work, apart from the Update button. The only problem is that when a user changes his/her download folder it will only remain like that whilst the application is open, after it is closed it will revert back to the default “C:\Downloaded\”. This can be extremely infuriating for the user and in some circumstances will lead to the user simply not realising why his/he downloads aren’t working as no thought would be given to the download folder reverting back to the default.

8.3 Alternative SolutionsThere have been a great deal of problems outlined in the previous sections

however very little has been offered in the way improvements that could be made to the system to more efficient. The following sections outline ways that the system could be changed

8.3.1 ScalabilityOne of the main driving forces behind the GazNet architecture is that its

architecture would make it highly scalable however the previous sections have illustrated that there are flaws in the design which limit the scalability of the network.

Splitting Algorithm and Even DistributionThis algorithm is responsible for how and when supernodes split and therefore

responsible for the structure and loading of the network. The splitting algorithm currently is as follows,

half hash space //i.e. from A-Z to A-M and N-Zcreate new supernode and allocate it the new second hash spacepass the new supernode the appropriate file referencesdelete the old file referencesallocate this supernode the other half of the hash space

This algorithm has benefits in its simplicity but doesn’t really perform to an adequate standard when working over a large network; this can be seen by examining the data from the first section. There are two possibilities for improving this area of performance, the first is to change the splitting algorithm to distribute the load in a more even manner; the second is to change the nature of the splitting algorithm so supernodes can dynamically change their state and distribute their files ‘on the fly’.

____________________________________________________________________Page 97 of 124

There are several ways that the splitting algorithm could be modified, the way I will suggest is a statistical way by basically splitting supernodes according to their average percentage load (indicated by the previously shown research). Using this method heavily loaded supernodes will be paired up with lightly loaded supernodes to make the load as close to 3.85% (the ideal load) as possible for instance ‘T’ would be paired up with ‘X’ because ‘T’ is the heaviest loaded and ‘X’ is the lightest. An alternative approach would be to split supernodes into groups according to their anticipated load; ‘T’ would be put in a group of its own because it is heavily loaded and ‘Q’, ‘U’, ‘V’, ‘X’, ‘Y’ and ‘Z’ would be put in a group because their loads roughly add up to 3.85. The logic behind this is that the number of supernodes will be reduced as it will not be necessary to create a ‘Q’ supernode simply to accommodate 5 entries, instead all these lightly loaded supernodes can join up their file lists on one supernode. This has the big advantage over the previously mentioned technique that heavily loaded supernodes won’t have their loads increased even more by being paired up with other supernodes (even if they are lightly loaded).

A totally different approach to this would be to allow the dynamic reshaping of the network depending how the load is distributed. An example of this happening would be if a lightly loaded supernode noticed that another supernode was lightly loaded, then merging with the other supernode to free up the supernode for other functions (either to take on a different hash space or to revert back to a node). This process would mean that the network is constantly responding to changes and will ensure maximum efficiency. Unfortunately, doing this will create a heavy overhead (depending on the size of the network) and it will be limited in its scope i.e. it would be impractical for every supernode in a one million node network to be dynamically changing its state according to every other supernode. The best environment for this to be carried out would be on one tier; two possible architectures would be for all the supernodes to be linked in a line (a bus) and messages about the load be transmitted over the bus or every supernode to have a link to every other supernode on that tier (26 links). The latter has a bigger overhead but is much more robust than the former, these connections could also be used to share and backup information therefore eliminating the need for brother supernodes. Despite the advantages, one of the major problems with this system is the complexity of keeping the routing tables up to date, as every time supernodes split or merge the parent will have to be notified.

A B C D

A CB D

Bus

Fully Connected

Diagram 8.5 – Types of supernode-supernode connections

____________________________________________________________________Page 98 of 124

The biggest advantage of using this dynamic splitting algorithm is that the system no longer has to make decisions based on current information without the ability to take future events into consideration; a major issue with the static creation of supernodes is that they can’t be disbanded i.e. if there is a flood of files beginning with the letter ‘x’ once these files have gone, the network will be left with a number of supernodes that no longer have any function, this is a major waste of resources that could be avoided using this dynamic allocation.

Another major plus point of using this mechanism is (although it might not be immediately obvious) its uniformity; using any of the previously mentioned algorithms to make them efficient the splitting schemes for each tier must be changed as different tiers display different patterns in their load. The dynamic splitting algorithm does not suffer from this problem as it will intelligently distribute load in real time.

Network Tree TraversalThere are two main processes that require the traversal of the tree, the first is the

distribution of the file lists from each node and the second is the searching of the network. These are extremely similar as both require a file list to be passed along the tree, during the file list’s distribution the list is sent down the tree (albeit it after being split up) and during searching a file list of the search results has to be passed up the tree back to the top tier supernode.

One of the reasons why high level supernodes are so heavily loaded is that they basically function as gateways into the network and in that capacity are required to route file lists through the network to the appropriate supernode, this reduces the work required by nodes but dramatically increases the burden on the supernodes and therefore must be improved upon to make the network scalable.

The majority of work involved in the current scheme is spent passing file lists on through the network rather than merely looking up the current supernode on a routing table therefore it would be possible for the appropriate address to be returned to the node and leave the job of sending the file lists to the node. One of the problems with this is that one Round Trip Time (RTT) will be added to the connection time for every supernode that needs to be traversed in the network, as on top of the current process another message has to be sent back to the node to inform it of the next supernode address to traverse. However the number of TCP connections required does not increase and the burden on the supernodes and the amount of network created will be dramatically decreased (as file lists won’t have to pass every supernode on the path to the correct one). This system would improve performance if supernodes were heavily loaded and therefore took a long time to process the file list however if the network was small and the supernodes were under loaded the overhead involved in returning the next supernode address in the traversal could possibly decrease efficiency. Having said this, this would only occur on a high speed network with relatively small file lists as the overhead of passing the file list down the tree would be far greater than the simple sending of an address. A salient point to make now would be that when the supernodes are being queried for the next address in

____________________________________________________________________Page 99 of 124

the traversal it is likely that several addresses must be returned as multiple lower tier supernodes will need to be contacted.

Searching can be improved upon in a slightly different way, as no file lists are sent down the network there is little point in sending the next address in the traversal back to the node because the supernode could just as easily send the search on to the next supernode itself. Where the improvements lie is how the supernode returns the results to the search node; currently the results are passed back up through the network through every supernode in the path, this just like file list distribution creates an unnecessarily large amount of network traffic, instead the searching node’s address could be sent down with the search query and therefore any results could be sent directly back to the searching node avoiding sending results through the network. The only problem with this scheme is that it will change the elegant recursive nature of the distributed searching algorithm as search results won’t be returned back up to the original supernode; this however pales in comparison with the performance gain involved in modifying the search algorithm.

The above schemes are obviously an improvement but it still suffers from the problem that to get to a lower tier supernode ever supernode above must be contacted, this is both a problem for robustness and a problem for performance, it would be much better if tiers could be skipped to take you directly to the desired supernode. It is therefore necessary to make a trade off between the number of lower tier supernodes each supernode must know about and the speed at which each supernode can be accessed. Knowing more about the network below you will mean that more control traffic would need to be created to insure that addresses are kept up to date and means that routing tables would have to be larger. However, if a supernode knew of 5 tiers below it, it would mean that 5 nodes could be avoided being traversed, therefore in a 20 tier network there would only be 4 node visitations rather than 20. The greater the knowledge of each supernode, the smaller the search time, unfortunately with greater knowledge comes greater quantities of control traffic and a more complex routing system. Using this system however, does have other advantages one of these being greater robustness (which will be spoken about later) as if one supernode goes offline all other lower supernodes won’t go offline with it. This scheme can be increased to any degree with the limit being every single node knowing every other node which is obviously somewhat of an impossibility, it is therefore necessary to find the ideal degree of knowledge to be held by each supernode. Working on the basis that each tier is complete (i.e. contains 26 supernodes) routing tables could quickly get out of hand, maintaining knowledge of one tier (what GazNet currently does) required a routing table of size 26, maintaining knowledge of two tiers requires a routing table of size 676 and maintaining knowledge of three tiers requires a routing table of size 17576. It is obvious that the tables could rapidly become un-scalable; below is a table listing how large the routing tables would have to be to use this scheme working on the presumption that each tier is complete.

____________________________________________________________________Page 100 of 124

Number of Lower Tiers

Size of Routing Table

0 01 262 6763 175764 4569765 118816766 3089157767 8031810176

Table 8.4 – Size of Routing Tables for Storage of Different Numbers of Tiers

As can be seen, to implement a scheme maintaining knowledge of 5 lower tiers the routing table would have to be 11881676 elements long, this would obviously be a heavy burden on the supernode; place on top of that the massive amount of control traffic that would have to be sent around the network to maintain all these routing tables and it becomes obvious that this system is impractical. At a maximum each supernode would be able to maintain knowledge of two lower tiers, however all these calculations are based on complete tiers, in reality every tier, especially lower ones won’t be complete; add on top of this the use of some more sophisticated splitting algorithms (as mentioned before) and the scheme becomes more scalable, however, whatever splitting algorithm is used it is unlikely that knowledge of more than three tiers will be able to maintained when functioning on anything but a small network.

Architecture of the NetworkIt has been found that certain faults lie in the architecture of the network; using a

modified splitting algorithm (preferable the dynamical algorithm) a lot of the previous problems could be avoided such as obscene supernodes like ‘Aqrjs’ being created but the network would still not be perfect as this algorithm would not allow for an unlimited number of identical files to be stored on the network, as once a network path such as ‘Cat.jpeg” has been created that supernode is the only one that can store files of that name, meaning that the ‘Cat.jpeg’ supernode’s capacity equals the maximum number of files called that allowed on the network. The best way to solve this problem is to introduce supernode looping in the network, basically what would happen is when a file is added to a supernode that is already at capacity and it can’t create another tier to pass it on to (like in the previous example) it will create another supernode that will look after the same hash space, therefore allowing a limitless number of instances of the file ‘Cat.jpeg’ on the network. This is really the only practical way of solving this problem but it does lower search time as it reduces the indexing of the network which in a way defeats the object of the building of the network.

The GazNet design is more of a concept than a specific architecture and that concept is the indexing of supernodes which could take many forms. Below a couple of different architectures that hold both advantages and disadvantages over the chosen architecture will be explained.

____________________________________________________________________Page 101 of 124

One of the major problem with the network is that the more tiers that are created the less likely they are to be used, this is because the longer the chain of supernodes becomes the stranger the letter combinations become, for example there are not many words that begin with the letters ‘Ahnwwq’; even using the modified splitting algorithms this problem will not disappear as it’s inherent to the design. To rectify this problem an alternative design would be to lower the number of tiers and make up the supernode numbers by using looping. By limiting the tier numbers to, say, 2 it would mean that erroneous supernodes like ‘Ahnwwq’ will not be created

Aa

Ab

Ab

AcAb

AbDiagram 8.6 – Looping of Supernodes

The limited level of indexing will still have a large improvement on the search time when compared to programs like Kazaa and Gnutella as search time could be lowered by up to 626 times (the number of indexed supernodes). Unfortunately this design will have no where near as quick a search time as the previous design and will eventually face similar scalability issues as other supernode networks, as the network grows to millions of nodes.

Another major issue in the architecture is where to put file references along the path of supernodes, currently GazNet will place to file references in the longest match therefore if there are two supernodes ‘A’ and ‘Ae’, Aerosmith will be put in Ae, however as the network gets larger this will lead to some supernodes not being used at all. There are basically two choices either the splitting supernode takes on an extra letter in its hash space (Diagram 8.7a) or the splitting supernode will allocate the next letter in the hash space to a new supernode (Diagram 8.7b).

Diagram 8.7 – The Two Ways of Creating a New Tier

____________________________________________________________________Page 102 of 124

Aa

Ab Ac Ad

A

Aa Ab Ac

(a) (b)

GazNet using technique (b), this is because it creates a more structured network and alleviates high tier supernode’s work load, however the main problem with this scheme is that file lists get simply passed down the network and don’t start really to create a useful network until the chains are quite long and therefore spell out larger proportions of word, this scheme would be more appropriate for the alternative network that uses looping. Technique (a) would technically increase the chances of parent supernode taking on file lists but as the both letters will be the same there is little chance of many files actually needing to be stored on that supernode. Technique (b) is an appropriate choice for using the dynamic splitting algorithm as the tiers are very much separate so they won’t interfere with each other.

Another possibility is to allow file references on any supernode on the correct path, this would mean that the utilisation of every supernode will increase but the burden on supernodes will also increase. This is not just because supernodes will be better utilised but because to perform a search every supernode along the path will have to be searched. The search time will not be heavily effected because as soon as a supernode receives a query it can pass it on and then search its own database, working on the presumption that every supernode is of equal performance, search time will actually be increased because there is a possibility that an instance of the required file is on a higher tier supernode, it can therefore be returned back to the searching node before the final supernode in the path actually received the query.

Probably the biggest limiting factor on scalability is the way in which the first tier supernodes act as gateways into the network, because of this all traffic must be routed through 26 supernodes, this is obviously impractical and un-scalable. The problem can be alleviated by using caching of supernode addresses or increasing the routing information on nodes therefore allowing the first tier to be bypassed on occasion however these methods will not decrease the load on supernodes by a sufficient level to make a large scale network practical. The only way of reducing the degree in which the first tier will act as a ‘bottleneck’ is to increase the number of first tier supernodes therefore increasing the number of gateways there are into the network. This can be done by providing dedicated gateways that only function as a means of accessing the network rather than looking after files, working on the presumption of 625 queries per second (qps) on a perfectly balanced first tier 24 queries will be received by each supernode per second, by giving each supernode a further 9 ‘helpers’ the query rate will drop to 2.4qps, this is an easily manageable figure that will dramatically increase the scalability of the network.

8.3.2 Robustness and Reliability One of the biggest problems with the network is that it suffers from problems with

robustness, the major one being effects faced by the loss of a supernode. The loss of a supernode in a network such as Kazaa would result in the loss of between 100 to 160 nodes from the network; however the loss of a supernode in GazNet could mean the loss of a whole hash space (such as every reference beginning with the letter A). The current method used to ensure reliability is to allocate supernodes brothers that will maintain the same information as they do such as the file and node database, if the supernode goes

____________________________________________________________________Page 103 of 124

down one of these brother supernodes will immediately take over, informing all relevant nodes and supernodes of the new IP address. The main problem with this is that perfectly valid supernodes that could be used in the network are functioning as these brother supernodes that for the majority of the time do nothing to help the performance of the network, this is exacerbated if lower performance nodes have to be used as supernodes because higher performance nodes have already been turned into brother supernodes.

As previously mentioned a greater robustness could be obtained through increasing the amount of routing information in supernodes so if one goes offline the parent can ‘hop’ over it to the lower supernode, this would dramatically increase the robustness but not the reliability as the supernode that goes down will still lose the information contained on it and the only way of regaining it would be to send an overlay broadcast out requesting nodes to update it. This area would be a good thing to include in the dynamic splitting algorithm previously mentioned; by basically managing each tier as a distributed database supernodes could share their file lists with other supernodes on their tier which could then save them to disk just in case they are required later. This would disband the requirement for brother supernodes and free up resources so they could be used in the rest of the network, also it would lower network traffic as supernodes and their brothers wouldn’t have to be constantly in contact.

8.3.3 File TransfersThe transfer of files is obviously important as it represents the longest length of

time spent on a P2P network. It was previously discovered that the multi-source download wasn’t efficient enough to maximise the download rates, this was mainly because of a nodes ignorance of the bit rates of other nodes. To rectify this, nodes could employ two different methods, the first would simply to statically label each FileDescription with the average bit rate of the node, this would be extremely simple to do and would help the node come to an easy decision on who to download from. The second is to find out the current bit rates of nodes by querying them dynamically, this has the advantage of returning a more accurate bit rate and will result in the best choice possible however if there are a lot of instances in the file on the network a node may find itself having to query many nodes to find out the best one to download from. This problem could be alleviated by combining the two methods and leaving the choice of which nodes to check for their current bit rates up to the node which will base its decision on the static bit rate included in the FileDescription, this will help nodes minimise the number of wasted queries it makes. Once a node has all the information it requires it will then be able to make an accurate decision on where to download from. This information could also be used to aid in the distribution of load over multi-source downloads, therefore nodes with more bandwidth could be given a larger proportion of the load.

A different approach would be utilise unused bandwidth to increase download rates when nodes aren’t busy. This would be done by uploading files from slow connections to fast connections when there is bandwidth to spare, these files will be kept in a folder so that when other nodes wish to download the file they will be able download it from the faster node. This could be taken one step further and nodes could actually upload files to free web servers which would then be able to supply the files to nodes, it is likely that most

____________________________________________________________________Page 104 of 124

web servers will be faster than the majority of nodes connected to the network (including broadband connections) making the download process even faster.

8.4 Critical Summary of the EvaluationThis chapter has discussed in details not only the short-comings but also the better

features of the system, however it is limited in the accuracy in which it can comment on the system. This is mainly attributable to the limited real-world testing that the system has received, therefore many figures have had to be based on limited data; the use of simulators has aided in the accuracy but still it cannot be compared to the real word. Similarly the only educated evaluation that the system can receive is from what has actually been implemented; due to the limited degree of implementation certain areas have been based more on speculation than fact. The distributed nature of the system also limited how much testing can practically be performed on the system, the highest number of nodes that GazNet has run over is 16, although this is adequate for many testing purposes, it is not adequate to verify that the system is fully functional as it is intended to run over millions of nodes. By using simulators extra test data has been accumulated but there is a much higher scope for mistakes and things being left out, than a fully working system therefore its output cannot be taken with 100% certainty of completeness.

____________________________________________________________________Page 105 of 124

9 Conclusion

9.1 Review of AimsThere were 8 aims set out at the beginning of the report, the following section

aims to analyse to what extent these aims have been met.

1) To design, develop, and evaluate a P2P network to allow the sharing of files.This aim has been completely fulfilled, the P2P network created allows the sharing of

files and has been evaluated rigorously. The network has faults and therefore could be said to have been not fully developed but a high degree of development has taken place.

2) To allow fast, efficient searching of the networkh. To decrease search time compared to current P2P technologiesi. To allow fuzzy searches that don’t require specific details to be entered

e.g. the exact name of the file.j. To limit the effect network size has on search timek. To keep search time low even when the overlay network is heavily loadedl. To keep searches as accurate as possible, so out of date file references

aren’t returned in search results i.e. references to files that have already been removed from the network

These aims were to a great extent the core intentions of the project and therefore constitute the most important area to be completed. The structure of the network certainly fulfils the requirement for fast search time and manages impressive improvements over current P2P technologies search methodologies. Aim 2b is fulfilled in that individual words can be searched for opposed to complete filenames however the word has to be an individual word and not a word inside another or the second part of a hyphenated word, similarly there is no function to retrieve wrongly spelt words, instead to word must be an exact match. 2c has also been fulfilled, as network size has a limited effect on search performance in comparison to a network like Gnutella however the architecture’s current build does posses scalability problems that can have an effect on the search time, this primarily manifests itself in the form of overloading of the higher tier supernodes. 2d, is quite subjective, this is because the ability of the network to deal with heavy loading largely depends on the first tier supernodes; if they are very high performance, search time will be able to be kept low, however if the supernodes are low performance they might not be able to service the queries at a fast enough rate to not have a noticeable impact on the search time. 2e is probably the least completed aim, the designed method for keeping file references up to date has fundamental flaws, that may frequently result in out of date search results being returned, further to this the technique was not implemented in the application therefore leaving file references on supernodes indefinitely. However, file references will be removed from a node’s supernode when the node disconnects as the supernode will realise the TCP connection has been broken and will proceed to remove all file references. It is hard, to keep networks like this up to date but it is likely that once the simplicity of the chosen method would work the best, alternatively a more complex approach could be taken to yield more effective results such as the ‘brother’ node system described in section 3.2.5.

____________________________________________________________________Page 106 of 124

3) To make the network as scalable as possible in both terms of its performance and in the maximum number of nodes that can be concurrently connected to the network.

Like Aim 2, this aim was an important objective at the heart of the project and was therefore something that much effort was put in to. It is possible to argue both the completion and the un-fulfilment of this aim, however the stance taken here is that it was fulfilled. Using the measure of search time it is hard to argue that the network is incapable of scaling well, its indexed nature allows extremely fast location of the correct files and (depending on the performance of the supernodes) will provide a good service even when dealing with a high number of nodes. Similarly providing that a similar files per node ratio to Kazaa is maintained there will be little chance of a node not being able to find a supernode to connect to. Unfortunately, it is also arguable that this aim was not fulfilled, this is primarily attributable to the way that the first tier supernodes provide the only gateway into the network and therefore can become points of ‘bottle-necking’ where more data is coming in than the supernode is able to deal with. This problem could be rectified by two approaches, the first is to cache supernode locations on nodes therefore allowing the first tier to be sometimes bypassed, this would help alleviate first tier load to a certain degree but in reality may not reduce the load by that much as a user will not perform many searches with the same first two letters, therefore the supernode would have to be queried again to find the location for the latest search. The second technique would be to increase the number of first tier supernodes, therefore creating multiple gateways into the network, on a million node network, using 6 supernodes per letter would mean only 4 queries per second(qps) (presuming even distribution over the first tier and 625qps).

4) To make the system as reliable and robust as possiblea. The network must be able to recover from the loss of supernodesb. The network should be able to a reasonable degree withstand targeted attacks by

malicious usersc. Downloads must be able to be resumed without having to re-download all the data

already received if transfers are interrupted.d. The system should deal with errors in an elegant way.

From the early stages of the project reliability was know to be a weak point of the design and it could be said to have been sacrificed in aid of improving search time. The technique used to fulfil 4a (brother supernodes) is an effective and reliable method, although it is not without it faults however this was not implemented in the project (due to time constraints) and therefore only fulfilled in design and not actual implementation. Aim 4b is definitely a problem area in the network; by managing to remove a single first tier supernode a malicious user could remove a massive portion of the network, this action is largely defending against by using brother supernode but in the case of all brothers being taken offline the affected supernodes will be severed without any method of recombining the network. Aim 4c has been fully accomplished, without any sort of problems, the only possible issue is the applications inability to automatically search and restart interrupted downloads instead a user must search for the files again and restart the download manually. The system generally deals with errors in an elegant way, therefore

____________________________________________________________________Page 107 of 124

fulfilling aim 4d however there are certain times in which the GUI crashes, an example of this is when a search encounters an unexpected error such as the connection failing, in such a case the GUI will freeze.

5) To make the network as independent as possible i.e. avoiding the use as of centralised managed servers, removing any one point of failure making it near impossible to shut the network down.

Like Aim 3, it is not a binary situation in which this aim can be said to have been fulfilled or not; the network does not use centrally managed servers in the traditional sense however an element of centralisation has been introduced into the system in the form of supernodes. It would be near impossible to fulfil the performance requirements of this project using a fully distributed architecture, therefore centralisation is undoubtedly necessary. The main problem, is based around the issue of making the network near impossible to shut down, unfortunately the importance of the first tier supernodes and the network’s inability to recover from the loss of them means that a concise effort by a party with suitable resources would be able to shut the network down. By Using enough brothers to the first tier supernodes it may be possible to make the network robust enough to defend itself against attacks but by consistently performing Denial of Service attacks on the first tier the performance could be crippled.

6) To allow sharing of files over the networka. To allow file transfers to be performed as quickly as possible hiding the user from the underlying actions

This aim to a large extent has been completed; files can be transferred between nodes however the speed of the process is limited by the bandwidth available. Multi-source downloading is used to improve performance but it is not advanced or intelligent enough to maximise download performance however in the majority of cases download speed has been improved.

7) To minimise the time taken to connect to the networka. To avoid the use of a centralised server to aid connection to the network.

The unfortunate problem with this aim is that by fulfilling 7a the degree to which the overall aim can be fulfilled is limited, this is because by avoiding the use of a centralised server the time taken to connect to the network is increased. The method of using supernode caches is a lengthy process compared to the use of a central server to aid connection but it does work. Unfortunately however, a user that uses the application sparsely may find that it takes a long time to connect because his node cache is out of date. An even worse scenario is that the node cache will no longer contain any valid supernodes therefore stopping the user from connecting to the network at all.

8) To provide a highly simple, easy to use user interfacea. To let the user quickly and with ease search for files without having to understand and underlying processes

____________________________________________________________________Page 108 of 124

b. To provide the user with some form of discriminatory information about specific downloads so they may make an informed decision on which files to download.

c. To allow the user to modify his/her shares without difficultyd. To allow the user to view what he/she is currently sharinge. The interface must be responsible at all times even when actions are being carried

out behind the scenes such as a search.f. The user must be provided with progress details of downloads and be told when a

file has finished downloadingg. The user must be provided with a indicator (i.e. bit rate) of how quickly a file is

downloading.

The user interface was always a low priority requirement, however it was still a component of the project that needed to be addressed. 8a is undoubtedly fulfilled as the actions a user has to carry out to search is simply to click on the search tab, type in the query and then click the button. Aim 8b is not fulfilled in any real terms, however the GUI does provide a field in the search results table that indicates the connection speed of the node that the file resides on, however the functionality behind this is not implemented, therefore the requirement is only fulfilled on a superficial level. Aim 8c and 8d have both been fulfilled as the UI for dealing with the shares is as simple as it can be, similarly the shared folders and files are displayed in an easy to view tree. 8e has been fulfilled to a large extent as the program is multi-threaded enough to stop underlying processes interfering with the responsiveness of the GUI, however on a low performance computer it is still possible that the GUI could become unresponsive when certain processes, such as very high bit rate transfers are taking place. Further to this as mentioned earlier on aim 4d it is possible that the GUI can freeze, leaving the system completely unresponsive. Aims 8f and 8g are both fulfilled as the user is supplied with both details of the progress and a bit-rate counter during transfers; both of these are situated in the download an upload table.

9.2 Review of Original Project ProposalThe original project proposal spoke of a very much different system than was

created, this is mainly because of time constraints and a shift in the importance of certain aims, namely search performance. The two major changes in the project is the abandoning of DRM and the move away from a wireless platform based network.

The first aim was modified to exclude the implementation of it on a wireless platform, instead the network was intended to run on a conventional network.

To allow the sharing of files between separate wireless systems.Despite this, many of the requirements of a wireless P2P network were kept in the actual aims, a good example of this is fast search time.

The next abandoned aim was to give the user an incentive to share his/her files, To provide incentives for users to allow other users to upload files from them.

This aim was considered to be an aim that was not vital to the success of a P2P network as many users frequently share files anyway and therefore was something that could be overlooked, to the advantage of other aims.

____________________________________________________________________Page 109 of 124

The last and probably the biggest abandoned aim was the decision to remove all aspects of DRM from the project,

To allow the control of the distribution of files through DRM and licensing. To aid the secure and easy transmission of licenses.

This was done because of the sheer time required to implement it, something that simply wasn’t possible on the timescale allotted.

9.3 Suggested Revisions to DesignSome design revisions were spoken about in section 7.2 however no firm

decisions were made as to which revisions should be implemented in any future extensions of the project. The main issues that were raised in section 7.2 were,

1) Splitting Algorithm used2) Network Tree Traversal3) Architecture of the Network4) Reliability Issues5) Download techniques

Revision 1 to 3 relate to scalability issues, one of the major issues of the project was improving scalability of P2P networks, the overall concept used is still a valid one, however certain adjustments could be made to the finer details. After the completion of the project it is clear that in its current form the design would not scale however this is not to say that the design with modifications is unworkable. By modifying the splitting algorithm to the dynamic algorithm spoken about in section 7.2.1 huge increases in efficiency could be gained, however an alteration of this kind will not provide large improvements in scalability, instead it will mean a small increase in scalability and a large improvement in resource allocation. To increase scalability sufficiently, changes to areas 2 and 3 must be made; appropriate changes have been outlined in the evaluation, a major issue is the limited number of instances of a file that can reside on the network as this is a direct barrier against scalability.

Reliability is probably the weakest point of the GazNet design, early on in the project decisions were made to sacrifice a lot of the reliability strengths of fully distributed P2P models with the intention of acquiring much more powerful searching techniques. Unfortunately, however efficient the search algorithm is, it will never perform to a high standard if the underlying network is not reliable enough to provide a stable foundation for the algorithm to run over. Probably the best way to improve reliability would be to increase the routing information held on each node (both ordinary nodes and supernodes). Perhaps, however, bringing the architecture closer to that of Gnutella might provide a greater degree of robustness; this would likely sacrifice a degree of search efficiency but would improve the network overall as less problems would be encountered.

The current download method is implemented using a simple multi-source download algorithm that splits the file into a series a segments that can be downloaded from each available source. To improve the process, it is required that the algorithm is made more intelligent; to allow this intelligence to function properly it will also need to be supplied with more information, namely the speed of the connection. Currently no allowance is

____________________________________________________________________Page 110 of 124

given for the amount of bandwidth available to each source, instead every source is given an equal weighting, this is good for simplicity but does not manage to maximise the speed of the download as a slower connection could easily be chosen over a faster one. To improve download speed, an algorithm should be devised that will calculate the fastest possible way of downloading the file, instead of simply trying to download from as many sources as possible.

9.4 Suggested Revision to ImplementationAs a design orientated project, the implementation could be said to have taken a

back seat compared to the actual design of the system. However as the design and implementation are so closely related and without a proficient implementation the design will never reach its full potential, some revision should be made.

1) Improve local searching algorithms2) Modify user interface3) Modify protocol

One of the main problems with the implementation is the choice of local searching algorithms used, i.e. how the supernode searches its file database. The method could be described as a hash table but, values are only hashed by their first letter, therefore not creating a very effective system. Files are stored in an array of 26 vectors which must then be searched linearly, this method was suitable for testing the design but if the system was to be deployed, a serious load would be put on supernodes needing to make frequent database searches. Instead the Java Hashtable class could be used to improve look-up time; this would make the splitting process a more complex ordeal but as searches are a more frequent occurrence overall efficiency would be improved.

The user interface like the searching algorithm was mainly intended to provide a platform on which testing could be easily carried out however the GUI provides little functionality to allow many actions that a user may wish to do. A good example of this is the ability to perform multiple searches at once, similarly the user has little control over his shares beyond the ability to add and remove them. Another problem for the user is the lack of detail about search results, therefore making the decision on which files to download a more difficult one.

Currently the protocol is a simple ASCII text based one, this has been used to make it more extensible during the prototyping phase however is extra size makes it inefficient and might create problems in the future. This protocol could be converted into a smaller bit-oriented protocol, as massive savings could be made over using text. Another big problem with the protocol is that it isn’t encrypted, an open source protocol is perfectly viable for a fully distributed network like Gnutella but the security of the protocol on a network like GazNet is imperative to limit the damage that could be done by a malicious party. A good example of this is the ‘190-NEW_SUPERNODE’ command which if used improperly could cause massive damage to the network by creating a multitude of unnecessary supernodes. Another protocol modification that could be made is the transport protocol that it runs over; currently everything is run over TCP, however considering GazNet’s largely connectionless-orientated approach perhaps more use of UDP would improve the networks efficiency. Such an instance is performing pings,

____________________________________________________________________Page 111 of 124

currently to ping a node a TCP connection must be first initiated however, by using UDP the cost of connection could be avoided.

9.5 Future WorkAny attempt at creating P2P applications can always be followed up with future

work, this is because the complexity of such as system can never be dealt with to an extent to have created a optimal design. GazNet is an early implementation of what is essentially a P2P concept rather than a specific design. The concept is to use an indexed tree to allow the efficient searching of a network, GazNet embodied this principal in a particular way not in the definitive way, this leaves the task of finding the definitive way. Many design modification have been outlined in section 7 which most certainly should be implemented in future projects, the following list outlines future work that could take place.

The modification of the architecture – currently the architecture is based around the simplest premise possible, to simply split each supernode by its letter range however despite its simplicity this method is probably the least efficient. The architecture centres on the splitting algorithm and therefore the modification of it would be quite simple, three new algorithms are outlined in section 7.1.1, the best of which being the dynamic splitting algorithm. It is un-doubtable that any future work must include, this or a variation on this algorithm if increased performance and scalability are to be gained.

Better support for internet use – as has been said earlier the implementation was more of a technique for testing the design rather than an application that could actually be released, therefore the practicality of using GazNet over the internet is limited. The ultimate testing mechanism is its use by the end user, therefore the success of the network will never be known until it is ran in real conditions. Kazaa implements methods to improve its internet performance such as programming nodes to prefer connecting to supernodes with smaller round trip times. It would be good to extend these sorts of internet concepts to GazNet and allow its use in real life to find out its real efficiency.

Expanding the applications the system can be used for – currently the only action GazNet can perform is file transfer, however P2P search engines can be used for a multitude of different applications ranging from live video streaming to sharing web content. It is frequently talked about that much of the world wide web is not indexed by internet search engines; by giving people the power to host their own web sites on their own computer and allowing other to find it through the GazNet search engine, it would allow many who wouldn’t have previously thought of creating a web site to make one whilst also increasing the indexing of the internet’s massive web content.

The inclusion of DRM – in the original project proposal DRM was to be a key part of the system however due to time constraints it had to be dropped. DRM is a major issue in computer science and something that is likely to be more heavily incorporated into P2P technology as time goes on, this makes DRM something that any future work in the area should give serious thought on.

The running of the network on a wireless platform – as well a DRM, another original project proposal that was dropped was the implementation of the network

____________________________________________________________________Page 112 of 124

on wireless technology. Also like DRM, wireless P2P networks are definitely something that will be entering the research domain in a more prevalent manner soon. The concepts and requirements of GazNet are heavily based around the needs of a wireless network user and would therefore be something that would be interesting to port onto a wireless platform.

9.6 Lessons LearntThe project, although in some ways a failure, was a provider of many benefits.

Leaving aside the experiences gained, the knowledge acquired through both the research and the implementation has vastly extended both my knowledge of P2P technology and overall networking. Further to this, my Java programming skills have greatly expanded, not only in networking but also in Swing, security, design patterns and data structures like Hashtables, to name a few. There is no doubt that these newly acquired practical skills will come in useful in later life. On a less practical side many new skills have also been gained such as scheduling and evaluative skills.

One of the major lessons learnt was in relation to ‘jumping’ into implementation before the appropriate design has been completed. A good example of this was the separate programming of the Supernode and Node classes, later on it was realised that it would be more efficient to sub-class Supernode from Node and therefore alterations had to be made. Similarly, implementation was begun on areas such as the NetworkStartup class before an appropriate level of detail in the design had been completed. Such actions led to inefficient programming and the subsequent wastage of time spent fixing them. Something else that I have learnt to do now, is to carry out a more thorough analysis of the design before implementation begins. The evaluation is littered with criticisms of the original design and ways of improving it; much of the evaluation was based on data not obtained from the implementation, instead it was based on simulators and mathematics proving that the implemented system was not practical. If this had been carried out, as it could have been, before the implementation a much more efficient system could have been created.

9.7 Final ConclusionThis project was, from the very beginning, aiming to test a concept rather than

implement a specific design. GazNet could be considered to be an application, a protocol or a design however a more accurate description of it would be an idea, the idea is to decrease P2P search time by using an indexed tree of supernodes, all other matters around it are merely issues of a particular design. GazNet could be said to be a conceptual success but an implementation failure, the finished program both lacked elements of the design and possessed undesirable elements that should have been left out; despite this, in controlled situations, it worked. The simulators created, primarily for evaluative purposes, showed that the concept is a workable one and with modifications could provide the basis for a very effective P2P network. The implementation was never created with the aim of deploying it in the real world, instead it was to be used to test the viability of the network and bring out any important issues that would need to be dealt with, something that it has done admirably. However, these positive points should not be unaccompanied by some more practical notes; yes, the implementation does serve its

____________________________________________________________________Page 113 of 124

purpose by discrediting the design and helping to locate areas that need improvement but, it would have been better if instead of finding the faults, the design had worked to a greater extent therefore supporting the GazNet concept more.

This project is undoubtedly an evaluation orientated one and has certainly performed this task well. Much more about the limitations of the network are known and any further attempts will be much in debt to this original design. However, even if no further work is carried out on GazNet, with a little more time a fully functioning P2P network with some very advantageous features would have been created that would be fully suited to running over a LAN offering many of benefits over networks like Gnutella.

____________________________________________________________________Page 114 of 124

References

- Adar, E and Huberman, B. (2000)Free Riding on Gnutella http://firstmonday.org/issues/issue5_10/adar/index.html

- Balakrishnan, H. and Kaashoek , F. and Karger, D. and Morris, R. and Stoica, I. Looking up Data in P2P Systems.

- Boger, M. (2001). Java in Distributed Systems

- Buhnik, T. and Laura, L. and Leonardi, S. (2003). A phenomenological analysis of the peer-to-peer networks based on Gnutella protocol

- Clark, J. Accessibility implications of digital rights management http://www.joeclark.org/access/resources/DRM.html

- Dabek, F. and Bunskill, M. and Khaasoek, F. and Karger, D. Building Peer-to-Peer Systems With Chord, a Distributed Lookup Service

- Digital Rights Management and Privacy http://www.epic.org/privacy/drm/

- D-Lib magazine June (2001). Digital Rights Management (DRM) Architectures http://www.dlib.org/dlib/june01/iannella/06iannella.html

- Jovanovic, M.A. Modeling Large Scale Peer-to-Peer Networks and A Case Study of Gnutella

- Kazaa Homepage, http://www.kazaa.com

- Leibowitz, N. and Ripeanu, M. and Wierzbicki, A. (2003). Deconstructing the Kazaa Network

- Liang, J. and Kumar, R. and Ross, K. W. – Understanding Kazaa- Miller, M (2001). Discovering P2P.

- Moore D. and Hebeler, J. (2002). Peer-to-Peer: Building Secure, Scalable, and Manageable Networks

- Peterson, L.L. and B.S Davie (2003). Computer Networks: A Systems Approach

- Rao, S. Chord-over-Chord Overlay

____________________________________________________________________Page 115 of 124

- Rich, L. It's Who You Know, Up to a Point - Massive Media - Company Business and Marketing. http://articles.findarticles.com/p/articles/mi_m0HWW/is_44_3/ai_66678864

- Ripeanu, M. Peer-to-Peer Architecture Case Study: Gnutella Network

- Ritter, J (2001). Why Gnutella can’t scale. No really.

- Stoica, I. et al. (2001). Chord: a scalable peer-to-peer lookup service for internet applications, ACM SIGCOMM.

- The Gnutella Protocol Specification v0.4 http://www.clip2.com

- Wilkipediahttp://en.wikipedia.org/wiki/P2p#Usage_of_the_term

- Xie, M. P2P Systems Based on Distributed Hash Table

- Zhao, B. Y. et al. (2004). Tapestry: A resilient global-scale overlay for server deployment

____________________________________________________________________Page 116 of 124

Appendix

Proposal for Wireless Peer 2 Peer File Share application

1. AbstractThe aim of this project is to provide a facility for people using wireless hardware such as PDAs to be able to access and share a variety of files such as MP3s and documents. There is a need for this as people may frequently find themselves in situations in which it is not possible to access a regular file share program such as Gnutella. The project focuses on the more low level aspects of routing and how to make the system run as effectively as possible. Unlike someone using a desktop PC with broadband a wireless connection will cost the user a lot more, further to this it is also a lot slower as well, this means the system must be more efficient. Also the project aims to limit the proliferation of material by using some form of Digital Rights Management. The DRM should hopefully allow every user on the network to protect their material and enable copyrighted material to be shared securely and legally. The outcome of the project will hopefully be a fast and effective way to share files over a wireless connection that rectifies some of the major flaws of other file share programs such as Gnutella.

2. IntroductionThe essential idea of P2P files sharing is not a new concept however the implementation of it has often been flawed and inefficient. In certain circumstances (e.g. a fast connection or small files) this can be tolerated however in other circumstances problems like this are unacceptable, one of these circumstances is a wireless file sharing program. The reason for this is that unlike a permanent broadband connection a wireless connection is expensive, therefore what possible reason would there be for someone to allow someone else to upload files from them when it is costing them money. A second problem is that an active connection takes up a great deal of battery life and not many users would be happy about having to repeatedly charge their battery so they can simply share their files.

Therefore a wireless file share program needs to have certain improvements over its standard desktop counterpart. To start with, network traffic has to be kept to a minimum, that means that an efficient routing algorithm must be devised to allow information to propagate through the network with a minimum amount of data transfer, also techniques must be found that improve the speed of downloads so that user can access files quicker and therefore save money and battery power. Finally some sort of incentive must be found for users to allow others to upload files from them; the improved download time will form part of the incentive but also users could be given something like improved access or a higher status the more files they share. Another incentive would be some sort of pecuniary advantage for sharing your files, this would be harder to implement than the other incentives but one way would be to sell your files to other users instead of simply providing them for free, this leads on to the concept of DRM.

The next stage in the file sharing process is the actual control over the spreading of files. As anyone who watches the news will know, music companies have serious problems

____________________________________________________________________Page 117 of 124

with the ease in which MP3s can be downloaded off the internet; this leads to the need for some sort of way to control how digital material is accessed. Digital Rights Management (DRM) is the answer to this; currently DRM is in its infancy and it is necessary to advance it to allow users the confidently distribute licensed material safely in the knowledge that only authorised users will be able to use it. This P2P application aims to incorporate DRM into the process of sharing files and therefore allow users to safely and securely share their files. One possible use of DRM links back to the need for incentives. By protecting files with licenses users can sell their files instead of just freely providing them to all, this will encourage users to share files because the more files that are uploaded the more money the user will make. Users can then share their free files along side their licensed files so others have access to the maximum amount of data that is possible.

This report outlines various aspects of this project and will hopefully inform the reader of the approach that will be taken and any problems that may arise.

3. BackgroundAs said previously P2P file sharing programs are not a new phenomenon, the first being Napster. Ironically, however Napster isn’t actually a true P2P system, in fact it used a central server that would store a list of files currently on everyone’s computer, the only stage in which true P2P occurred was the final one in which the file was actually downloaded. The use of a Brokered service was one of Napster’s most efficient features, however it was also its downfall. A central database provided quick and efficient searching but as the owners therefore knew what was being distributed on their network it made them responsible for it, therefore if users are sharing copyrighted material the owners of it have a good case for shutting the system down (as they did). Another problem with using a brokered service is that it is very fragile, if the central server goes down the whole network stops working, because of this the closure of Napster was made very simple. The moral of the Napster story is that a Brokered service is impossible to run unless very stringent control is kept on the files that are being shared.

After Napster was shutdown a new gap was opened for other applications to be developed that could share files. The client server system was now considered to be unworkable as shared copyrighted material would have to be removed by the administrator (as this constitutes the biggest share of the files this would make the system very unpopular), therefore a true P2P system would have to be developed in which content could not be blamed on the creators. Several systems arose, one of them was Gnutella, which is a true P2P system; even if the government did wish to shut it down it would be near to impossible as it is fully independent – every peer is equal and there is no reliance on a central server. Although true P2P has obvious advantages the big draw back is inefficiency, compared to a client-server system Gnutella is very slow and creates far too much network traffic. This is because when a search is performed instead of sending a query to a server the node must propagate a query to every node on the network in real time; this involves sending a query to all the nodes you know then those nodes sending the query on to every node that they know. This system quickly gets out of hand when nodes start receiving the same query multiple times because different nodes are

____________________________________________________________________Page 118 of 124

connected to the same node. Another system, KaZaA tries to rectify this by using things called ‘Super Nodes’, these are computers with high speed connections which can take on more than their fair share of traffic thereby making the whole process quicker.

The next requirement for the system is some sort Data Rights Management (DRM), like P2P it is not a new concept and there have been a few predecessors. DRM can occur in many different forms, it can range from encrypting files to the construction of a complex web following the transfer of files. The urgent need for an effective form of DRM was recognised when copyrighted files began to illegally propagate over the internet, frequently through the use of P2P applications. Commonly DRM is segregated into two generations; the first of these uses encryption techniques to prevent the basic problem of copying files to other computers. The second generation covers a more diverse range of requirements, it covers description, identification, trading, protection, monitoring and tracking of all forms of rights. Currently DRM is implemented using some form of encrypted wrapper around the file or using a tag that indicates what access rights users have. DRM can reside in many different areas such as the Operating System or the actual media itself; the latest version of Windows Media Player currently implements DRM.

The next question is why can’t an already created file share program simply be ported onto a wireless device. The answer is simply that the needs of a mobile user is significantly different from that of a desktop user. Mobile users obviously will spend most of their time during use on the move, this means that it is unlikely that they will simply use the internet leisurely to pass time unlike the desktop user who may often sit down and spend hours browsing the internet simply to entertain themselves. Instead they will use the internet with a specific aim such as the check their E-mails, therefore users may just pop online for a few minutes then disconnect again; this sporadic use really undermines the use of file sharing software. The main problem with wireless P2P therefore stems from peoples reluctance to use it; the cost of a wireless connection both in battery power consumed and monetary terms is very expensive compared to a broadband connection, this leads to the attitude of “Why should I share my files when it costs me money to do so?” Two differences have to be made between wireless P2P and desktop P2P then, the first is to improve the efficiency of the network so things can be carried out as quickly as possible, the second is to give the user some sort of incentive actually share their files and stay online enough for someone to actually download something from them.

One of techniques KaZaA uses to encourage the sharing of files is to allocate different users different rights; depending on how many files that have been uploaded from you a different rating will be given which will entitle you to a certain level of service i.e. the more you share the greater the performance you will get from the network. This idea will be incorporated but it is unlikely that it will be a big enough incentive especially if there is not that much of a difference between a high rating and a low rating. A better idea would be to use some sort of monetary incentive, this is harder to implement but would be more effective. The use of DRM makes this idea viable as people would be able to posses licensed material which they could sell over the network, along side the saleable material they could also provide free material which people could upload, basically this

____________________________________________________________________Page 119 of 124

means the more often you are connected to the network the more money you could make. A similar idea to this is the ability for people to distribute material quickly and effectively. A good example of this is if a wireless user spots Brad Pitt kissing someone other than his wife, he could take a picture of this and put it on the P2P network then license it out. All a newspaper would have to do is it carry out a search on the network and purchase (using DRM) the image off the owner. As cameras become more and more popular in phones this concept becomes more and more viable and would shift the monetary cost of sharing files from the owner to the downloader basically making the P2P network into a massive digital market place.

4. The Proposed ProjectThe overall aim of the project is to create an efficient high-speed wireless P2P file share program that can be independently managed without the production of masses of network traffic.

The individual aims and requirements are To allow the sharing of files between separate wireless systems. To allow the quick, easy and efficient searching of files over the system. A fast setting up of the network that creates a minimum amount of traffic. To make the download time as short as possible for both parties. To provide incentives for users to allow other users to upload files from them. To make the system as independent and as manageable as possible. To allow the control of the distribution of files through DRM and licensing. To aid the secure and easy transmission of licenses.

Currently the approach that would be most efficient is to use a similar system to KaZaA which uses super nodes to improve efficiency. However, although super nodes do dramatically improve performance, for wireless hardware it is possible that they could be modified to be more effective. This is because their isn’t as much diversity between the performance of wireless handsets compared to that of desktop computers, due to this fact one handset wouldn’t be able to take on that much more of the work than any other this means that super nodes would have to be used in a different fashion. Super nodes will be treated as a set of connected servers (therefore making one large distributed server) and it will search files in a similar manner to DNS queries would be used – the actual downloading of files will occur between the two parties without any other intervention but where possible files will be downloaded from as many separate users as possible (in a similar way to FreeNet) to minimise the time each user is having files uploaded from them.

To connect to the network a new node would send out a broadcast message asking for the location of a super node, when it receives an answer it will try to connect to the node and send it a list of its files. The super node will add the files to its database and the new node will then be connected. To search, a node simply would query the super node’s database and would receive the location of the file as an answer. If the file does not reside in the database the query will be directed up to a higher super node and so forth until the file is found (or not found as the case may be).

____________________________________________________________________Page 120 of 124

The role of super node (or dynamic server) will be allocated as and when it is needed, any node with sufficient performance will be able to act as a super node and it will be given the task be an already existent super node. This will create a hierarchical structure and will therefore be quite easy to maintain. If one super node receives too many connections it will then select a new node to be a super node and therefore direct new connection request to that node, this means that the network could be compared to a linked list of super nodes.

There are a variety of ways in which DRM could be implemented, one of these ways would be to maintain a central server which could track the distribution and use of files, however this would go against the aim of making the application as independent as possible. This means that a different method must be used, in this situation the best alternative would be to ‘tag’ files with a set of rights, however this would mean that the software used to run the files would need to implement a compatible form of DRM. An alternative to this would be to use some form of encrypted shell, this would be easiest to implement and not require a central server but suffers from a lack of functionality, the only problems it really addresses is the illegal copying of files. Therefore the best choice is to ‘tag’ files to control their distribution, these tags can be hardware specific and therefore only run on one person’s computer so once they posses the files they won’t be able to distribute them outside the agreement of the license.

Testing may be difficult, as preferably a small number of wireless handsets would be used. A small number of these can be used for a simple black box testing to insure that the system works on them but the majority of testing must be done of a normal desktop PC, which is simulating a wireless environment. The application can be distributed through a number of people that will allow some sort of monitoring of transactions, which will allow many flaws to be spotted. This monitoring could appear in many different ways, the most obvious is to monitor the network side of the application. This could come in the form of time taken to connect to the network, time taken to download files, time taken to search for an item etc. Things like this would be very easy to monitor, on top of this a more sophisticated observation could be made such as how long it takes the user to carry out a certain task. Similarly the overall efficiency can be found out by simply including an additional feature in the beta version that will record properties such as connection time.

5. Programme of Work

The project will be carried out in the first 20 weeks of the academic year of 2004/2005.

Tasks

1. Research current P2P and DRM technology.There is currently a great deal of interest in these topics and they form a popular area for research. Further to this the economic need for DRM technology means companies like Microsoft are keen to develop this type software. The popularity of P2P applications also

____________________________________________________________________Page 121 of 124

means that there are quite a few around which results in a lot of example software to research.

2. Design of routing, searching and downloading algorithms.The networking side of the application constitutes the most important aspect of the project. The design should focus of the protocols used and the way information is propagated through the network.

3. Design of DRM algorithms.The method of DRM used is very important as it will affect all the users of the network. Due to the evident complexity of DRM a lot of work should be put into the design of it as it is likely that many errors may arise if it is designed badly.

4. Implementation of networking aspects.The networking basis must be implemented first as this will provide the basis in which the other aspects are created. It will be created using a basic command line parameter interface to detach the networking from the higher level parts of the project.

5. Implementation of DRMOnce the networking foundation has been laid the DRM functionality can be implemented.

6. Design of User InterfaceOnce the actual functionality of the application has been finished the user interface can be designed. This is left till after the implementation of the other parts because it is likely that some new features will be developed during implementation so it would be best to leave it till later.

7. Implementation of the User Interface.The user interface will be implemented and be mapped to the underlying functionality of the application.

8. TestingAll aspects of the application will need to be tested (and fixed if necessary) with special emphasis of the scalability of the networking algorithms.

9. Analysis of test resultsOnce the system has been tested the results should be analysed and any improvements or desirable extra features should be added.

10. Writing of Report.Once the system is finished everything should be documented.

6. Resources The obvious resource requirement is a one or two wireless systems to test out the system, these are obviously required as a certain amount of testing needs to be carried out to

____________________________________________________________________Page 122 of 124

ensure that areas such as the user interface work well on wireless hardware, it is not quite so important for the lower level functionality to be tested on a wireless system as it can be simulated on a desktop PC. This leads on to the next hardware requirement which is a number of PCs, it would be ideal to be able to test the system on a high number of computers to test how well the system scales but as this is not possible it could be tested on about 50 –100 PCs. This would be easy to carry out simply in one or two computer labs late at night when nobody is using them. Alternatively the application could be distributed amongst a number of people that could test it during its use by simply installing it on their own computer. This however would be needed only once during the evaluation period of the project, throughout the project when the application is being developed testing can simply be carried out on one Java enabled PC

The software requirements are more modest, simply a programming language (Java) and a small number of utilities to help development and testing. These include something to help in the design of the user interface such as a GUI builder and some network utilities like a packet sniffer.

7. References

Books

- Discovering P2P by Michael Miller, First Edition 2001.

- Peer-to-Peer: Building Secure, Scalable, and Manageable Networks by Dana Moore and John Hebeler, First Edition 2002.

- Java in Distributed Systems by Marko Boger, 2001.

Websites

- Accessibility implications of digital rights management by Joe Clark.http://www.joeclark.org/access/resources/DRM.html

- Digital Rights Management (DRM) Architectures by D-Lib magazine June 2001 http://www.dlib.org/dlib/june01/iannella/06iannella.html

- Digital Rights Management and Privacy http://www.epic.org/privacy/drm/

- It's Who You Know, Up to a Point - Massive Media - Company Business and Marketing by Laura Rich

http://articles.findarticles.com/p/articles/mi_m0HWW/is_44_3/ai_66678864

____________________________________________________________________Page 123 of 124

Deviations from Original Project Proposal

The project has deviated from the original proposal in three main ways, the first is the move away from a wireless platform and, second was the exclusion of any implantation of DRM and the third was a modification of the network architecture

Move away from a Wireless PlatformThis was done because the system was originally developed using J2SE, this was

because it was necessary to have standard computers running the application to function as supernodes. It was then intended that a J2ME version would be developed, however due to time constraints this was not possible. The system still maintains aims and requirements of a wireless system but does not implement any functionality that would allow it to run on a wireless platform

Exclusion of DRMThis was done solely because of time constraints; there was a strong move

towards the networking side of the project and large quantities of time were spent doing this, in an attempt to improve things like search time. Unfortunately this was done at the expense of DRM.

Modification of Network ArchitectureThe original intention was to create a Kazaa like supernode network, however,

during the design it was decided that this network wasn’t efficient enough to run over a wireless platform, the next stage was to develop a different type of network with more efficient searching algorithms, the embodied itself in the design seen in this project.

Approval by SupervisorAll these deviations were approved by the supervisor, from early on the in the

project the change in network architecture was established however the other two deviations were only decided upon at the beginning of the second term.

____________________________________________________________________Page 124 of 124