Semantic Overlay Networks for Peer-to-peer Web Search

Saarland UniversityFaculty of Natural Sciences and Technology I

Department of Computer ScienceBachelor’ s Program in Computer Science

Bachelor’s Thesis

Semantic Overlay Networks forPeer-to-peer Web Search

submitted by

Tim Benke

on 13.04.2007

Supervisor

Prof. Dr.-Ing. Gerhard Weikum

Advisors

Josiane Xavier ParreiraSebastian Michel

Reviewers

Prof. Dr.-Ing. Gerhard WeikumDr.-Ing. Holger Bast

Statement

Hereby I confirm that this thesis is my own work and that I havedocumented all sources used.

Saarbrucken, 13.04.2007

Declaration of Consent

Herewith I agree that my thesis will be made available through thelibrary of the Computer Science Department.

Saarbrucken, 13.04.2007

Abstract

We consider a network of peers, where each peer has its own col-lection obtained by individually crawling the web. When designinga distributed search system for such networks, an important task ishow to efficiently perform query routing, i.e., how to find the mostpromising peers to answer the query. However, the efficiency of thoserouting techniques depends heavily on the underlying network orga-nization. Therefore, previous works have proposed the creation of se-mantic overlay networks (SON), where peers are grouped according totheir contents. In this work we present a rather different notion ofSONs, where peers are free to decide to which peers they want andto which peers they do not want to establish a connection, followingthe idea of the p2pDating algorithm that was recently proposed. Weconsider a P2P network that uses Pastry as the underlying networkinfrastructure, and where peers use Nutch to perform web crawls andLucene to build local indexes. We show that SONs can greatly re-duce the amount of traffic needed for answering a query, while stillmaintaining a high recall.

4

Contents

1 Introduction 1

2 Background 22.1 P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 P2P Networks with Central Server . . . . . . . . . . . 42.1.2 Unstructured P2P Networks . . . . . . . . . . . . . . . 42.1.3 Structured P2P Networks . . . . . . . . . . . . . . . . 5

2.2 Searching with and without Semantic Overlay Networks . . . 6

3 Related Work 7

4 p2pDating 94.1 Measures to Define Friends . . . . . . . . . . . . . . . . . . . 11

4.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.2 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.3 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Strategies for Scores . . . . . . . . . . . . . . . . . . . . . . . 13

5 Design 145.1 BatchCrawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Messages, Protocol . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2.1 Dating . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.2 KeepAlive . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.3 Querying . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Implementation 206.1 Implementation of Algorithms . . . . . . . . . . . . . . . . . . 20

6.1.1 p2pDating . . . . . . . . . . . . . . . . . . . . . . . . . 206.1.2 MIP Computation . . . . . . . . . . . . . . . . . . . . 206.1.3 KLD Computation . . . . . . . . . . . . . . . . . . . . 21

6.2 Pastry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.3 Lucene, Nutch . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.3.1 Query Types . . . . . . . . . . . . . . . . . . . . . . . 256.4 Summary of implemented Classes . . . . . . . . . . . . . . . . 26

7 Experiments 287.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 28

7.1.1 Network Setup . . . . . . . . . . . . . . . . . . . . . . 287.1.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . 287.1.3 Data from: del.icio.us, dmoz.org . . . . . . . . . . . . 28

7.2 Recall Computation and Analysis Tools . . . . . . . . . . . . 297.2.1 Tests performed . . . . . . . . . . . . . . . . . . . . . 32

7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 33

5

7.3.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . 337.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 35

8 Conclusion 368.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.2 Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 37

6

1 Introduction

Two important types of network architectures are the client-server archi-tecture and the P2P architecture. One example for the client-server ar-chitecture is the early web. Many passive web servers serving content arecontacted by clients posing requests. The servers never start connections tothe clients and clients are not contacted by other clients.

This clear distinction between client and server is broken by the P2Pparadigm. All nodes work in similar roles and form a distributed system.As all nodes in a P2P networks have the same functionality we will usethe terms node and peer synonymously throughout this document. Eachpeer serves as a server to other peers and can also act as a client to requestinformation from other peers.

The concept of P2P systems is related to distributed computing but dif-fers in some aspects. In distributed computing the goal is usually to solveone particular problem while peers in a P2P system are often considered tobe selfish, i.e., they try to maximize their personal gain as each peer is typi-cally represented by a real-world user. One popular example for this selfishbehavior is the problem of free-riding, i.e., a lot of peers only try to get somebenefits out of the system without providing services to other peers. More-over, P2P systems have to cope with high dynamics, i.e., peers are enteringthe system at high rates or leaving the systems without prior notice. Com-pared to this, the network organization in a traditional distributed systemis rather static.

The first P2P file-sharing application that gained public recognition wasNapster. It was released in 1999. After the shutdown of the original versionbecause of legal reasons, other protocols like Gnutella gained wide-spreaddistribution. Gnutella was created in 2000 and absorbed many of the formerNapster users. Because of problems with the scalability of the version at thattime it could not cope with the increased traffic [31].

With new protocols came also new forms of organization of P2P net-works. Whereas Napster depends on a central server for query execution,networks like Gnutella represent fully decentralized P2P networks. Later,in 2001, the concept of distributed hash tables (DHTs) was developed andproved to be superior in terms of scalability, and is now supported by manyBitTorrent clients — another now popular P2P file sharing application.DHTs allow to efficiently map and route keys or “IDs” to values or peers.

With this thesis we want to build a distributed web search engine. Weconsider a network of peers that have collections of webpages. The task is tomake it possible to perform full-fledged searches for multi-keyword queriesand let it appear to the user like a usual centralized search engine. In ourwork this is complicated because of the dynamics and the disorganization ofa P2P network.

According to [35] P2P networks today account for an estimated 50% of

1

the whole internet traffic.Semantic Overlay Networks (SONs) are a form of network organization

that is used to reduce the number of peers needed to contact during queryexecution. A SON are groups of peers with semantically similar content.Queries can then be routed to the appropriate group and be answered bypeers with similar content. This way the quality of results is increased whilesimultaneously reducing the number of peers needed to query. Usually,SONs are defined as groups of peers that have a common interest or thatwere classified to have documents in a certain topic. Of course the classifiershould be the same for all the peers which means that one needs a kindof centralized algorithm or strong agreement on the classifier that classifiesthe peers. Different from that we let each peer autonomously build a groupof his preferred peers. So each peer can use different strategies to definewhat peers he wants to have in his SONs. One part of these strategies is asemantic comparison of the collections.

In this work we show how peers autonomously build a SON and howqueries are executed. We performed experiments that show that the amountof queries needed is reduced when one queries the groups built using ouralgorithm.

The rest of this document is organized as follows: Section 2 is a wrapup of background information on the different types of P2P networks andsearching in regular P2P networks without SONs and what benefits can begained from searching with SONs. Section 3 gives an overview of relatedwork that also deals with distributed search. Section 4 is about the orig-inal p2pDating paper by Parreira et al. [34]. The system design and theimplementation details are described in Section 5 and 6. The experimentsshowing the benefit of p2pDating can be found in Section 7. Conclusions,a summary of contributions and directions for future research can be foundin Section 8.

2 Background

2.1 P2P Networks

There exist many definitions of what a P2P system is and file sharing net-works like Napster, Gnutella, and eDonkey are very popular today. A com-pelling and concise definition of P2P is given in [33]:

“A self-organizing system of equal, autonomous entities (peers) [which]aims for the shared usage of distributed resources in a networked environ-ment avoiding central services.” In short it is a system with completelydecentralized self-organization and resource usage.

For the field of web search there are some very interesting features inP2P networks:

2

• Equal Partners: The main difference to client-server systems is thatP2P networks are a collaboration between equal partners which havethe same functionality. This means that a symmetric protocol is usedon all peers.

• Robustness: A client-server architecture fails when the servercrashes. In contrast to that, a P2P network survives if only someof the peers fail. The usual web search engines have to solve thisproblem with many redundant servers in a server farm.

Note that Napster is considered to be a P2P network although it uses acentral server for location of content/peers. However most other P2Pnetworks have no single point of failure and are therefore also moreresistant to denial-of-service-attacks.

In addition, if an attacker has taken over one server he can control theserver and its content as he wishes. A web search engine distributedover many peers however is resistant to attempts to control the contentunless an attacker can control all the computers. No individual candeteriorate the whole content. Normally each peer contributes onlyone small collection to the whole content of the P2P network.

• Distribution of Content, Computing Power, Bandwidth: Thisis probably the most important point. Each peer may only contributewith a small amount of resources to the network, but the individualresources add up to something bigger. The theoretical unboundednumber of peers makes every P2P system superior to a supercomputer.So generally, P2P networks are more scalable than normal client-serverarchitectures, especially when a lot of content or computing power isneeded like in the SETI@home project [36].

• Anonymity, no Censorship: Since the resources are spread overthe peers, instead of stored in a single location, P2P web search canin principle provide better immunity to search result distortion by thebias of big providers, commercial interests, or even censorship.

• Dynamics: P2P networks can cope with the frequent joins and fail-ures of peers. Additionaly this encourages the P2P networks to use adifferent way of identifying the peers instead of the frequently changinginternet address.

With the growth of the internet in terms of bandwidth and users, it isclear that the classic client-server architecture cannot cope with the risingrequirements of the internet. Because of its centralized nature it is proneto resource bottlenecks. Consequently, client-server applications are hard tomaintain and can be replaced by much simpler P2P applications to solvethese problems.

3

Based on the properties mentioned, P2P systems can be further dividedinto two different classes. On the one side are the totally decentralizedsystems and on the other side are the solutions that use both direct P2Pconnections between the peers and also a central server. In between aresystems that have some priviliged peers which handle groups of normal peers.A central server is used for instance to make it easier to locate content andpeers and to manage peers joining and leaving.

One example for the decentralized extreme is Gnutella 0.4 while all in-stant messaging services, Skype and Napster lie in the latter category.

One of the main challenges of Peer-to-Peer systems lies in the decentral-ized self-organization of a distributed system in achieving a high quality ofservice without the need for centralized services. With this goal in mind italso becomes important on how this organization is achieved. One idea canbe to just connect the peers randomly and try to solve any tasks/problemsin an ad hoc way. Another idea is to build a structure which is maintainedwhen peers join and leave the network. This structure can then be used tolocate content and/or peers.

In general, P2P networks can be classified in P2P networks with a centralserver, unstructured P2P networks and structured P2P networks.

2.1.1 P2P Networks with Central Server

The idea of this approach is very simple: peers do not have to deal withmuch complexity because the server solves the issues of peer and contentlocation. The first file-sharing application Napster is one example for thisearly concept of P2P networks.

In Napster the peers index their content and send their index to a centralserver which stores it and the address of the peer. Queries can then be sendto this server and it answers with the addresses of other users. Only thena P2P connection between the querying peer and the other peers with thecontent is initiated.

2.1.2 Unstructured P2P Networks

An unstructured P2P network is formed when the links between peers areestablished arbitrarily. So these networks can be easily constructed, as anew peer that wants to join the network can simply copy existing links ofanother node. Then it can form its own links over time. The advantageof this approach is its simplicity which allowed to build some of the earlydecentralized P2P networks. The popular file-sharing protocol Gnutella isan example of this idea.

Joining a Gnutella network is done by connecting to arbitrary peers in anexisting network. For querying, Gnutella makes use of a flooding technique:it sends a query to all its neighbors, i.e., all directly connected peers, which

4

do the same with all their neighbors until a hopefully sufficient portion ofthe network is “flooded”. To solve this practically, Gnutella uses a time-to-live counter for each query, e.g., 7 hops. This means that each peer whichreceives a message decrements the counter and send the message to all hisneighbors, until a peer receives a message with a counter of value 0. Thisapproach suffers from the shortcoming that there will be many redundantmessages since many peers will receive the same message several times dueto the fact that the P2P network is not a tree. Besides, some peers willnot be able to contribute to the results because they do not have relevantinformation and peers in the network which have the information might notbe reached.

The so called bottlenecks can have a negative effect in Gnutella. Bottle-necks are slow peers that are densely connected to other peers. So if a queryhas to pass this slow peer the whole execution of the query is slowed down.

Moreover, a pre-determined number of hops may still be too inflexibleto reach a sufficient portion of the nodes with relevant content.

2.1.3 Structured P2P Networks

Distributed Hash Tables (DHTs) are the most important structured P2Pnetworks. They organize the peers in an addressable way which makes iteasy to locate content. For instance in Chord [32] IDs/keys are assigned topeers which determines a location in a ring. Chord uses a table for routingcalled finger table. If one has built a P2P network with l bit identifiers forthe peers , the network can hold 2l nodes. The finger table used in eachnode for routing has at most l entries. It is ordered by the distance fromthe peer, i.e., to take a shortcut to a node in the ring that has an ID closerto the target peer’s ID.

Each peer only maintains a routing table for O(log(N)) peers, where N isthe number of peers. Other DHTs only differ in this complexity by constantfactors . The routing tables make it possible to halve the distance betweenthe peers in each routing step. In Chord for example this is accomplished bystoring for each of the l entries in the finger table a peer that is at a certaindistance. This distance is given by the row, so if the row is i then the node inthat row has at least an identifier 2i−1 entries further on the ring. So in eachrouting step the i decreases by at least one, which allows DHTs like Pastry,Chord and Kademlia to reach all peers with in the average O(log(N)) hops.Pastry uses a similar technique to Chord’s routing tables. Other DHTs likeSymphony and Viceroy need in the average O(log2(N)), whereas CAN hascosts of O(D

2 N1D ) for an D-dimensional space [39].

What is interesting about this is that the number of peers can almostincrease arbitrarily high. This means in particular that if the location of thecontent is known, a query in Pastry can be done in O(log(N)), while in aunstructured P2P System it would probably take considerably longer.

5

Joins and failures of peers can be handled in O(log2(N)) steps. In gen-eral, this means that DHTs are superior to unstructured P2P networks interms of scalability, reliability and fault tolerance.

2.2 Searching with and without Semantic Overlay Networks

When searching in a P2P network without a SON there are basically fourscenarios. The first is to use a central server which indexes all content andthat peers contact if they want to search the P2P network. Second, there isthe pure P2P approach with equal peers that are totally decentralized. Thethird approach is a mix of the first two ideas, a hybrid P2P approach. Fourththere are the also decentralized structured P2P networks, e.g., DHTs.

Of course the first idea is very good in terms of the number of messagesand the overall traffic generated when searching. However the problem isthat there is a single point of failure and the server does not scale well if thenumber of users increases.

The central server approach has its disadvantages for a large num-ber of users because the need for bandwidth usage, processing powerand storage increases with each new peer. Moreover, one single pointof failure is very risky. For each peer the storage cost is increased byO(size of the local collection) while querying costs only O(1) because onlythe server has to be considered.

The pure P2P approach is the most secure because no single point offailure exists and it can be implemented in a way that it scales. One exam-ple is the implementation of Gnutella 0.4, which uses query flooding to allrandom peers connected to the peer. One problem is that many redundantmessages are generated and that peers with interesting content for that par-ticular query may be widely distributed in the network and there is no wayto guarantee that the relevant information will be reached. Gnutella alsoexperienced severe problems when dealing with too many clients becausethe sheer number of messages created in a network above some thresholdoverloaded the traffic capacities of the peers.

Considering the storage costs there is no new storage needed per newpeer because information is not replicated but the querying and retrievingcosts are O(N2) [38].

In fact, an early version had this problem and broke down when too manyusers switched from Napster to the Gnutella network [31]. Additionally, thepossibility remains that part of the relevant information is ignored becauseit is outside of the region limited by a fixed number of hops.

The hybrid approach that is used, for instance, in later versions ofGnutella and in FastTrack uses Ultrapeers/Superpeers which are electedbased on their bandwidth, average delay and processing power. These in-terconnected nodes are responsible for small groups of peers. This createsa denser network and less traffic is generated from queries because the Su-

6

perpeers handle a large portion of the requests. This approach has becomevery popular but it suffers from the same scalability issues and problem toreach all relevant peers as int the early Gnutella approach.

DHTs are structured totally decentralized P2P networks and thusequally secure as the first pure P2P approaches. As discussed earlier, DHTsallow to route from an arbitrary peer to another very efficiently in O(log(N))hops. So if the content is already distributed according to IDs the approachis very attractive also due to its good scalability and security. However weneed a preprocessing step to accomplish this mapping between content andIDs or we use simple terms instead of topic names which is done in Minerva[14].

This leads to the idea of grouping nodes according to their content tolocate the most relevant peers. These groups could be, e.g., people who areinterested in basketball and people interested in geology. So if one knowswhere peers with content about a particular topic are, one can easily contactthem and ask for their query results.

Instead of using only the capabilities of the underlying P2P network onecan also one can also construct an overlay network on top of the normalP2P network. The notion of interest groups allow each node to join severalgroups for the topics it is interested in. Then queries can always be routed tothe most promising peers. These groups form an semantic overlay network(SON) on top of the normal P2P network based on the peers semanticsimilarity to each other. When nodes are automatically organized in groupsin a SON for specific topics based on their content one can start a queryin that topic and exactly the right nodes in the network are reached. Soone gets the most relevant results with very little redundancy and overheadbecause messages are not flooded in the network.

One way to create a SON is to use a central server to classify the peerscollections in different groups and assign the peers to groups in a SON ac-cording to these classes, but this again breaks the decentralization propertyand is also very inflexible when a peer’s collection changes.

Our approach is to let the peers decide to which peers they want toestablish a connection. The decisions are taken based on information fromother peers obtained through random meeting.

Based on this information the other peers are either added to the listof preferred peers (or “friends”) or not. During query routing messages areforwarded only to those preferred peers and results are then merged. Thisalgorithm is called p2pDating and is described in the following Section.

3 Related Work

This discussion of related work was taken from [34].DHTs play an important role in the basis of many recently developed

7

P2P systems, e.g., Chord [32], Pastry [30], CAN [23], P2P-Net [29], andP-Grid [28]. These DHTs allow mapping from keys, e.g., hashed keywordsor hashed titles, to peers but are merely a network infrastructure and donot include implementations of sophisticated search functionality.

However there are some projects that are involved in P2P web search.PlanetP [18] allows users to publish/subscribe to files and to use a rankedsearch for content. PlanetP indexes all peers content in a global index, thatis disseminated among peers by a gossiping algorithm.

In Odissea [25], a global index is distributed over all peers. Only onepeer is responsible for each term and has the complete index for that term.A term is either a keyword or a word stem. Fagin’s threshold algorithm[19] is used in a distributed version to execute queries. Odissea seems tocause high traffic for distributing its document meta data in the networkand queries seem to be limited to one or two query terms.

The system from [24] is similar in the way that each peer is responsiblefor a subset of all the terms and it maintains a distributed inverted index.It puts emphasis on three techniques to minimize the bandwidth usage formulti-keyword queries.

A system with privileged peers called directory peers is used in [22]. It isan example of a hybrid P2P system because these directory peers have moreduties than the other peers. This could lead to problems in scalability andfault tolerance. The Kullback-Leibler divergence, calculated over the termdistribution of a peer, is used to determine to which peers queries should beforwarded.

The approach in [17] is interesting for us because they also use SONs.However they use a fixed classification of the peers into different SONs astheir starting point. The peer collection is classified using some algorithmand then the peer is added to some SON. We however let the peers decideon their own which peers they want to add to their SON and let them decidethis by meeting other peers in the network.

[13] is a project also built on Pastry and is used as a content-addressablenetwork (CAN) where the content of the P2P system is partitioned on allthe nodes in the P2P network.

Semantic overlay is also the concern of [26] where they use the semanticsimilarity as a distance measure. This means that the less similar a doc-ument is, the more does it take to reach it from the peer that holds theother document. Naturally the content is also divided so that each peer isresponsible for a portion of the whole content. The document semantics aredetermined by latent semantic indexing (LSI). A similar approach is used in[21] but they use a dimension reduction technique to build better semanticoverlays called Semantic Small Worlds. Topic segments are used in [4] tocluster semantic related peers.

[27] have proposed an architecture to take advantage of all the resourcesand content of a P2P system. They build a system using document cat-

8

egories, node clustering and try to distribute the load in and between theclusters of peers.

[3] uses a lazy learning algorithm for RDF(S) queries. Peers collectinformation about queries which were successfully answered by other peersand uses this information practically as a cache for recurring queries.

P-Grid is extended with scalable semantic overlay networks and withtraversal strategies in [10].

There are also other works which are not only focused on distributedsearch using semantic information but information retrieval in general. In[16] algorithms for result merging and database content discovery are de-scribed. [20] is a work on database selection in networked IR. Althoughdistributed information retrieval is a relevant aspect of P2P search, P2Psearch itself is more challenging because it adds the complex dynamics of aP2P network to the IR problems.

4 p2pDating

p2pDating [34] is an algorithm for constructing Semantic Overlay Networks.In contrast to other algorithms [17],[26],[21],[4] it does not require a pre-processing step based on the classification of the documents. For that analgorithm has to be executed on a central server to classify peers and assignthem to SONs, but we want to construct a decentralized system.

Instead the p2pDating approach is to have all peers autonomous and letthem meet with each other. Then they can decide on their own which peersthey want to form a SON with. These preferred peers are called friends. Inthe decision process several measures can be used, e.g., the overlap betweenthe two collections, the similarity of the two collections, the history of theinteraction with the peer or the usage frequency of the peer. Some measuresare described in more detail in Section 4.1.

If one peer has a list of friends these friends could also be interesting toa second peer that has met the first and decided to add him as a friend.These friends of friends are called candidates.

The p2pDating algorithm begins with a procedure to choose peersto meet next called choosePeerToMeet(). When these peers have beencontacted it is decided if they are good enough to be added as friends.If they themselves have some good friends these are added as candidatesand later they can be contacted and also become friends. This importantprocess is executed in the Dating algorithm (see Algorithm 1) and thesecond important algorithm is ChoosePeerToMeet() (see Algorithm 2).

9

Algorithm 1 p2pDating Algorithm1: repeat2: P ← choosePeerToMeet()3: contact P4: if isFriend(P) then5: add(P, friend list)6: end if7: if hasGoodFriends(P) then8: C ← friends of P9: add(C, candidate list)

10: end if

Algorithm 2 The choosePeerToMeet() procedure1: P ← a peer from the candidate list, with probability α2: P ← a peer from the friend list, with probability β3: P ← a random peer in the network, with probability (1− α− β)4: return P

A peer for the next meeting is chosen from three lists, based on theprobabilities given by α and β (see Algorithm 2):

• Friends: These nodes are the preferred nodes of the peer.

• Candidates: These are the friends of friends. It is likely that friendswho were already identified also have friends which are interesting forthe peer.

• Random Nodes: Specially in the startup phase new nodes must befound, so the peer can start building its SONs. It also allows to findnodes that recently joint the network.

The parameters α and β are determined by the user. In the contactphase the peer first contacts the chosen peer and ask for a summary ofhis collection. A score is computed for the contacted peer based on thesummary.

If the score is higher than a certain threshold the peer is added as afriend. If the node also happens to have good friends these are added to thecandidate list.

If a friend has been contacted, it is only checked if his friends or collectionhas changed. If he is not online anymore he is removed from the friend list,and if the collection has changed the values in the summary of this peer areupdated and a new score is computed.

10

By choosing the peers based on certain measure we can select the mostpromising peers for routing queries to. It is assumed that peers that have asimilar content will provide good results for queries. Under the assumptionthat peers perform queries on the topic of their interests inferred from theircontent, we will get good results by asking similar peers. Because a peerknows about its friends, it can simply forward the queries to peers in itsneighborhood.

Because we know the peers in the SON from the beginning we can answerqueries very fast and accurately.

The paper [34] already contains preliminary experiments that show thesuitability of p2pDating. They have performed simulations with the JXPalgorithm and with the Minerva P2P web search system to construct a SONand perform queries.

4.1 Measures to Define Friends

In [34] several measures were suggested to define the usefulness of a peer:

4.1.1 History

Because the usefulness of a P2P network depends on the peers capabilities itis important to work against non-cooperative or malicious peers. A way todo this is to measure how much credit a peer has. Also free riding peers, i.e.,peers that do not contribute with content and only profit from others, couldbe identified with this. One could use a measure on how well the other peerbehaved in the past. To do this, credit points are given for good behavior.Good behavior would be to provide resources, answer queries and so on. Inregular intervals this counter of credits can be reset to allow peers who havechanged their content to get a better status.

4.1.2 Overlap

In many situations we also want to know how much of the other peer’scollection we already know, i.e., how high the overlap with this other peeris. If a peer has a very high overlap with our own collection we would rathernot query it because we will probably get many redundant results that wealready have. For comparing collections, without having to transfer thementirely over the network, we use statistical synopses.

Statistical synopses of peers are a light-weight approximation techniquefor comparing data of different peers without explicitly transferring theircontents. Synopses provide very compact representations for sets, containingsome local information that can be used to estimate the correlation betweentwo sets. There are different techniques for computing synopses, for example,Bloom filters [1], Hash sketches [2] and Minwise Independent Permutations(MIPs) [15]. In our work we use Minwise Independent Permutations (MIPs).

11

The overall idea is to select a random set of documents from the collectionby always selecting one specific document of a random permutation of thedocuments. We assume that set of documents is given by a set of docIDsrepresenting the documents.

First of all a set of docIDs has to be computed from the URLs. For thefunction used see Section 4.2.

To compute N permutations from the set of docIDs, N hash functionsare applied on each docID. From these N permutations the minimum valueis taken and this results in a summary of N numbers representing the wholecollection. The hash functions have the following form:

hi(x) = (aix + bi) mod U, where 1 ≤ i ≤ N

The ai and bi are random integers and U is a big prime number. Differentvalues in the coefficients ai, bi are used to compute more permutations.

Moreover ai, bi and U are the same for all runs. From each permutationthe minimum value is taken and added to the MIP vector as a summary ofall the documents.

The summary of the collection, i.e., the N numbers of the MIP Vector,can be compared to another MIP Vector to get, for instance, an estimationof the resemblance between the two original sets (see example given inFigure 1).

Resemblance(SA, SB) =|SA ∩ SB||SA ∪ SB|

where SA and SB are the set of URLs of peers A and B, respectively.The resemblance can then be used to estimate the overlap.

4.1.3 Similarity

Another important measure is the similarity of the document collections. Toget an idea of what is the topic of a collection we want to know how oftena term occurs, i.e., the term distribution.

The term distribution can be obtained by looking at the index of the peer.The probability that a term x will appear in a collection F is given by f(x),likewise for the collection G and the probability g(x). f and g are discreteprobability distributions. From these the Kullback-Leibler divergence [7] orrelative entropy can be calculated to estimate the similarity of two termdistributions.

KL(F,G) :=∑

x

f(x) logf(x)g(x)

12

20 48 24 36 18 820 48 24 36 18 8

docID set

h1 = (7x + 3) mod 51

h2 = (5x + 6) mod 51

hN = (3x + 9) mod 51

17 21 3 12 24 817 21 3 12 24 8

9 21 18 45 30 339 21 18 45 30 33

40 9 21 15 24 4640 9 21 15 24 46

Apply Permutations to all docID‘s

8

9

9

N

Create MIP Vectorfrom Minima of Permutations

8

9

8

24

33

24

36

9

45

24

48

13

Estimate Resemblance byComparison

EstimatedResemblance

=

62

Figure 1: How MIPs are Computed

The value computed has the property that it is zero when the collec-tions have the same distribution and the larger the value, the more the twodistributions differ in the probabilities.

The similarity of two collections, F and G, is inversely proportional toKL(F,G). The term distributions of the collections can come from twodifferent sources: (i) the documents that were referenced by the bookmarksor (ii) all the documents in the collection. The bookmark approach [12] isuseful if the referenced sites links were then used as seeds to crawl the webwith a larger depth.

4.2 Strategies for Scores

Based on the measures defined in 4.1 four different strategies were imple-mented to determine scores for a collection. Given collections CA and CB

from peers A and B, respectively, the Score(CA, CB) is defined as how use-ful the documents in CB will be for peer A. The scores lie in the interval[0.0, 1.0] with 0.0 being the lowest score. There are many ways to computethe score, some are listed below.

• Overlap Only:

To compute the overlap we use the Minwise Independent Permutationsbased on hashes of the document IDs. In this case we want as littleoverlap as possible to get more information from other collections .

Score(CA, CB) = 1−Overlap(CA, CB)

• Similarity Only:

13

The Kullback-Leibler distance assesses the semantic similarity of twocollections based on their term distributions. As discussed earlier a100% similar document collection would yield a KLD value of zero.

Score(CA, CB) =1

1 + Similarity(CA, CB)

• Weighted Sum:

This is a weighted sum of the overlap and the similarity measure toget benefits from both measures.

Score(CA, B) = α1

1 + Similarity(CA, CB)+(1−α)(1−Overlap(CA, CB))

with 0 ≤ α ≤ 1

• Random:

This strategy is used as a comparison to assess the benefit from ourtheories. As the name suggests the random strategy assigns each peera random score every time it is considered as a candidate. The scorealso lies in the range 0.0 to 1.0.

Score(CA, CB) = random number

The α is set by the user and we tested several different configurations in theexperiments. Note that Overlap Only and Similarity only can be testedby setting α to 0.0 and 1.0, respectively.

5 Design

In extension to [34], this thesis contains the design and implementation of areal P2P application. We have combined the benefits from SONs and DHTs.With SONs we only have to contact few peers with few redundant messagesand low overhead for answering a query and a DHT guarantees that everypeer can be reached with at most O(log(N)) hops. We chose Pastry [30] asour underlying network infrastructure, which is the infrastructure currentlyused on Minerva P2P web search system [13].

Nutch [5] was chosen as the crawler because it is a stable open-sourceproject and simple enough to use for other projects. The crawled pages arestored in a Lucene index. Lucene [6] is an indexer for text files related tothe Nutch project. Both are described in more detail in Section 6.3.

In the construction phase several important parameters are set; the val-ues for α and β from Algorithm 2, the size of the SON, i.e., the maximum

14

length of the friendList, the maximum number of candidates and two se-lection strategies are set. The first strategy that is stored in the attributeisFriendStrategy determines the scores of a peer (see Section 4.2). Asecond strategy that can be chosen is the SelectionStrategy that selectsone particular peer ID out of the friend and candidate list respectively forcontacting in the p2pDating algorithm.

All the parameters are set at construction time, but they can later bechanged at runtime.

Peers have the option to create a collection by providing a list of URLsto the Nutch crawler. The crawler can be set to use the bookmarks of thatpeer as seeds, since bookmarks are a good indicator of a peer’s interests. Thecrawled collection is created by Lucene and analyzed to get the informationfor the summaries.

Using the class IndexAnalyzer the term frequency vector for the KLDand the MIP array are computed. This information is saved in the attributesof the peer. This is handled in the Peer’s method start() and could laterbe called again to updates on the collection.

After all this initialization the main algorithm is executed by the methodp2pDating() in fixed time intervals.

First, one group of peers is chosen by choosePeerToMeet() and then amessage is sent to one particular friend, candidate or random peer. Can-didates and random peers are contacted using a DatingQueryMessage andfriends are checked using a KeepAliveQueryMessage.

In the method deliver() the different types of messages arrive and de-pending on its type differents actions are taken. If friends answer witha KeepAliveAnswerMessage their changed synopsis is updated. FromDatingAnswerMessages sent by candidates or random peers their collec-tions summary is extracted, the isFriend() method computes a score andbased on it the addFriend() method adds the peer to the friends or not.The function hasGoodFriends() from Algorithm 1 in our implementationalways returns true. A more refined version would probably be to returntrue, e.g., if the peer is a friend or if the peers used the same strategy to com-pute scores. Query messages for dating, for updating friends and searchingare of course answered with the corresponding answer message.

When a user wants to search the startQuery() is executed. It acceptseither a Lucene Query object or a query formulated as a string. The methodsends out a QueryMessage to all peers in the friend list and searches alsothe peer’s index with queryMyself(). When all results from the other peershave arrived in QueryAnswerMessages the results are merged, duplicates areremoved and the top-k results are selected and displayed. The parameter kis by default set to 20.

The simulation scenario is the following: First of all, we start all peersof the P2P network that we want to simulate on one computer. We alsostore the union of all peers collections on the same computer. This is done

15

for performance evaluation only. Then, a new ring is created, using Pastry,and nodes are added to the ring. Each node bootstraps from the previouspeer who joined the ring.

For the experiments we first used simulated nodes instead of realSocketPastryNode because it allows faster testing. Later on, we againswitched to the version using sockets because there was some error in thesimulator which only showed after many executions.

5.1 BatchCrawl

The crawls are executed by the function crawl() in the class BatchCrawl,which allows crawls that take as seeds the set of URLs contained in oneparticular file in a directory. For the performance evaluation we use a globalindex of all documents in network. The global index is used to computethe relative recall which is our quality measure. This index is also createdusing the BatchCrawl class by merging the smaller collections and removeduplicate URLs resulting from overlap.

Moreover, BatchCrawl allows to automate the crawl process for all thepeers, as we did for our experiments. All it needs is one file containing theURLs for each peer in its own directory and it will create a crawl directoryand perform an unsupervised crawl, which is especially useful if we haveto crawl a high number of pages per peer. The use of the program can becontrolled by command-line parameters to launch the different functions.

5.2 Messages, Protocol

In this section we will describe when messages are sent and also how thecorresponding processes like contacting candidates, friends and querying arehandled.

5.2.1 Dating

As outlined in the original p2pDating paper and in the Overview Section,the dating process begins with the choice of a list of peers to contact.

A peer is selected from the friend and candidate lists using theSelectionStrategy.selectFriendNode() and selectCandidateNode()methods, respectively. Our SimpleSelectionStrategy always chooses theoldest friend that has been contacted. In the implementation we take thehead of the list and adds it at the end, this way rotating it. Meanwhile, thecandidates are just chosen based on the score of the friend they came from.We implemented this by adding the candidates with their friends score andsorting based on it. The random peer messages are just sent to a randomID, which is not the contacting peer ID.

There are two classes that extend the DatingMessage:DatingQueryMessage and DatingAnswerMessage. The

16

DatingQueryMessage only has the IDs of sender and receiver whichinitiate the contact, while the DatingAnswerMessage includes the MIPsummary, a Vector< Pair<String, Integer> > containing the termdistribution, and an array with the friends IDs.

When a DatingAnswerMessage is received the score for the peer is com-puted in the method isFriend(). The attribute isFriendStrategy deter-mines which one of the strategies outlined in Section 4.2 is performed. Apeer is added to the friend list if one of the two conditions is true:

• the friend list has not yet reached the predetermined maximum size(numFriends)

• there is a peer in the friend list that has an lower score. In this casethe peer with the lowest score is removed.

The method hasGoodFriends() returns true and the IDs from the newfriend are added to the candidates.

After a fixed interval the next dating message is sent. This could becomea problem with real world applications if all peers send a message at the sametime and should perhaps be replaced by a random pick from a Gaussiandistribution.

5.2.2 KeepAlive

If a friend is chosen in choosePeerToMeet() a KeepAliveQueryMessage issent. The contacted peer should return a KeepAliveAnswerMessage withthe same information as in the DatingAnswerMessage.

If the data has changed since the addition of the friend the values in thefriend list are updated. The friend’s friends are added as candidates. Thetime when the KeepAliveQueryMessage was sent is recorded. At regularintervals, friends that have not been contacted for a long time are droppedfrom the list.

5.2.3 Querying

The types of queries that are available can be found in Section 6.3.1. Supposewe want to query for “Dirk Nowitzki” which is a phrase query. In ourscenario we only consider the top-k results of the query to limit the trafficproduced.

The query execution procedure is as follows:

1 A Query object is created based on type of the query: a PhraseQueryor just a TermQuery or on of the other types (see Section 6.3.1); thisis done either by the user or with the help of the query parser.

17

2 queryMyself() is executed on the querying peer and the top-k resultsare saved in a result vector.

3 In the method startQuery() a QueryMessage is sent with the queryobject to each peer in the friendList. This message includes the IDsof querying and queried peers.

4 The friends also each execute queryMyself() locally and sendtheir top-k results in a QueryAnswerMessage as a Vector<Pair<Float,String> > containing URLs and their scores. The re-sults from the friends are saved on receipt in the results vector untilall friends have answered.

5 In the method answerQuery() the results are merged and duplicateURLs are removed.

6 The results are sorted by the score and the top-k results are displayed

The exact sequence of methods used is illustrated in Figure 2.

18

Figure 2: Process Diagram for Query

19

6 Implementation

6.1 Implementation of Algorithms

The implementation details of the algorithms mentioned in [34] are explainedhere.

6.1.1 p2pDating

The implementation of the algorithm was actually straightforward. We justhad do translate the pseudocode, given in [34], into java methods. Therewere some implementation details to be solved regarding data structuresand inheritances. We chose List as the type for friend and candidate listbut Friend and Candidate are now children of the class Acquaintancebecause they have many properties in common (see Figure 5). The decisionwhether something is added to the list or not is in the function addFriend()which checks if the size limit was reached or if the candidate is not goodenough to replace another peer in the list. For additions to the candidatesit is checked whether the maximum size of the candidate list was reachedor if the candidate is already friend or candidate. In particular, for ourexperiments, the maximum size is the number of peers so the list can notbecome full. The order in the friend and candidate lists is maintained bysorting the elements using java Comparators.

6.1.2 MIP Computation

We are using the algorithm sketched in 4.1. To get random numbers whichare the same for each run of the computation we use two fixed numbers asseeds for a random number generator. The biggest Mersenne prime in thelong integer range was chosen as U .

We had to have a hash function from the URLs to some number. Thehashes computed by the hashCode() method are integer’s. This methodis also available in the String class in java but integer’s are not suitablealthough they are after all saved in 32 bit. This is too few given all possibleURLs.

The first idea we had to obtain hashes was the java classjava.security.MessageDigest. It provides hashes from byte arraysto byte arrays with two different algorithms; “MD5” and “SHA”. Thedigest() method then produces 160 and 128 bit byte arrays, respectively,which would have surely been enough and getting a byte array from a Stringwas not a problem. However, we wanted to use normal primitive datatypesfor the hashes and the largest integer datatype is long with 64 bit. Whatone can do now is either choose a bigger hash datatype or only take 64 bitsof the generated hash. The first possibility was not really an option andfor the second option we were quite sure that the hash function should also

20

work correctly if we would have taken only the last 64 bits, but by that timewe had already found another hash function that could be used.

The implementation of M. O. Rabin’s hash function in the WebCATproject allows arbitrary input and produces long integer’s . It is animplementation of Rabin’s fingerprinting algorithm and was specifically cre-ated for hashing URLs. It is faster than MD5 and SHA in its computationand the probability of collisions is well understood [9].

It takes a String as argument and yields a number in the long integerrange, i.e., between −263 and 263−1, which should be big enough to preventthat too many different URLs are hashed to a same number.

The length of the MIP array is flexible but we also chose 64 because itproved to be a good trade-off in other experiments.

6.1.3 KLD Computation

For the KLD computation we need the term distribution, which is obtainedby an analysis of the crawled data. As mentioned in Section 4.1 one caneither analyze the term distribution of the pages referenced by the book-marks or the whole crawled collection. In this work we only considered thewhole collection, because the link collection was small enough and the valuecomputed should represent the distribution more accurately. We should notget many irrelevant pages in our collection as [8] shows that most outgoinglinks from a certain topic link to other pages in the same topic, e.g., sportspages will most of the time link to other sports pages.

In order to obtain the term distribution we first get a list of all the termsthat occur in the index and then use a function that gives us the frequencyof the term in a particular document. By iterating over all documents whichcontain the term we can compute the term frequency correctly. We countthe overall number of words at the same time. After this step we have aVector< Pair<String,Integer> > of the terms and their frequency andthe overall number of words numWords.

This gives us all the information needed for KLD computation. TheKLD is computed by the formula from Section 4.1, which requires thatf(x) > 0 and g(x) > 0, ∀x. This does not hold in our practical examplesince F ∩G 6= F ∪G, which means that there are of course terms that onlyoccur in one of the collections.First we tried to fix this by only considering the terms which occur in bothcollections ( F ∩G ) but this means that the KLD can be negative and nolonger meaningful (see example below).

21

Example:

terms: ”basket” ”ball” ”database” ”computation”f 0.0 0.1 0.9 0.0g 0.4 0.5 0.0 0.1

f(ball) ∗ log f(ball)g(ball) = 0.1 ∗ log 2

10 = −0.069

In this simple example we can see that although the value is veryclose to zero, the collections are very different in their content.In order to fix this we do the following: if the frequency of a term isnon-zero in one collection and zero in the other, we set the frequency of thelatter collection to 1. This way we do not ignore any terms which could bevery important for the collection.To summarize the final computation is:

|F | = number of words in F , |G| = number of words in G

AddF = number of 1′s added for words in F ,

AddG = number of 1′s added for words in G

f(x) =max{(term frequency of x in F ), 1}

|F |+ AddF

g(x) =max{(term frequency of x in G), 1}

|G|+ AddG

If one takes these modified versions of the term distributions the fi-nal value can be easily computed using the usual formula (see Section 4.1).Because we had to introduce this fix, the Vector<Pair<String,Integer>>in the DatingAnswerMessage now contains the number of occurrences ofthe terms instead of their probability.

6.2 Pastry

The underlying network infrastructure for managing peers joining andleaving the network and for delivering messages is Pastry. Pastry im-plements a DHT, therefore it is a structured P2P network. In thisproject, an implementation called FreePastry was used. It was developed

22

at Rice University, Houston, USA. The version being used is FreePas-try 2.0. FreePastry is published under the BSD license and available athttp://freepastry.org/FreePastry/.

In Pastry, peers and data items can be uniquely identified by IDs. Theselie in a range from 0 to 2l − 1 (l is typically 128) and are written in digitsto the base 2b. In the context of DHTs one can either call them node IDsfor peers or keys for data items. Each data item has a peer associated withit. The peer with the node ID closest to the key is the one responsible forit. For our project it is important to note that messages to random nodesalways reach a peer and are not discarded.

When a peer wants to join an existing network, it is given a IP-addressof a peer in the network to start a bootstrap process from, so Pastry canhandle the assignment of IDs.

The routing process works as follows in Pastry: Each peer has a routingtable, a leaf set and a neighborhood set. The routing table stores linksinto the network. In the leaf set the closest peers relative to the peer’s IDare saved. The neighborhood set saves the closest peers based on a certainmetric. So each peer only has a limited number of approximately (log2b(N))entries in his routing table. From these, the nearest peer to the target IDis chosen. Pastry also tries to use locality by preferring nodes very close,which are in the neighboring set.

This allows Pastry to use only O(log2b(N)) hops in the routing processuntil the destination is reached. By also exploiting the network localityPastry achieves faster transfer of message.

Pastry uses a metric which can be easily switched to shortest hop count,lowest latency, highest bandwidth, or even a general combination of metrics.This makes Pastry superior to P2P Systems which do not regard the localityof the peers. Gnutella for example uses random peers from the Gnutellanetwork to build a list of neighbors and many of the messages for neighboringnodes in the network have to traverse long distances causing a high delay.

Although in some projects like Minerva, Chord and CAN the notion ofkeys is really used to locate content, we use IDs randomly and do not try tobuild such a relation. This does not mean that such a relation cannot existbut our program is unaware of it – at least at this point.

From a programmer’s perspective, Pastry offers an easy to use API,good documentation and tutorials. It is not so hard to get it to work as anunderlying network infrastructure, given the complexity of DHTs.

6.3 Lucene, Nutch

Lucene and Nutch are both projects that are supported by the ApacheSoftware Foundation and actively developed. They are published under anopen source license: Apache license version 2.0. Nutch is used as our internetcrawler in the version rc1 for 0.9. Nutch constructs a Lucene index when

23

crawling. Unfortunately it does this without a term frequency vector whichwould make it very easy to compute the term frequencies we need for theKLD computation.

What we do instead is to get a list of all terms of the index from theclass IndexReader. Then we iterate over this list and for each documentwhere the term occurs we count how often it occurs and we also increase acounter for all words, which we need in the computation for the entropy inthe KLD computation. This is done in the class IndexAnalyzer which isrun at each peer in the method start() before the meeting phase.

Lucene is a program that can be used to index files, i.e., to build aninverted index, and search this index. Lucene is entirely written in Javaand is allegedly tuned for performance. It can be used to index text filesand advertises to build indexes that are 20-30% the size of the original textand to use only 1 MB of RAM. It supports most kinds of queries like term,phrase, boolean, wildcard, proximity and range queries. Version 2.1 is usedin this project.

There are many books, tutorials and documentation for this project andthe source code is also well commented. Lucene is used in many open sourceprojects.

Nutch on the other hand builds on Lucene and is a stand-alone crawlerand in fact a complete search engine. One can provide it with URLs whichare then crawled and saved in a Lucene index. As an application whichbuilds on the Tomcat server/framework it also allows to search the indexusing a web interface and is specifically tuned for this purpose. Nutch alsoprovides a multitude of parsers to be able to index documents different thanplain text files. Parsers for html, xml, pdf and word documents are included.

Performing a crawl using the command-line java tools is very easy andsearching with the command-line or web interface also works but using thesource code is not as easy. There are only very few tutorials and few expla-nations to get it compiled in Eclipse instead of using it with a Tomcat server.Also there is few documentation and commented code. What makes it evenmore difficult is the way Nutch executes crawls. It uses several databasesfor fetched segments, links and indexed pages and they are all filled and co-ordinated using a job management system which makes it hard to see whatreally happens. Nevertheless with some help from people from the Nutchmailing list we were able to get it to work. If one uses the main crawl andsearch classes it is not too complicated once the system is working.

Finally we decided to use Lucene equivalents for searching the indexesbecause they are better documented. IndexSearcher is the Lucene class weuse in the query and it is used in a straightforward way with the methodquery(). Lucene offers much choice in terms of filters, change of weights andanalyzers to evaluate/parse the query, but we did not modify the rankingor weights of the documents.

24

6.3.1 Query Types

Lucene offers a wide variety of advanced query types. The Lucene indexcreated by Nutch has several fields for “anchor”, “title”,“url” and “content”.By default we search in the field “content”. The types offered with automaticparsing support from a string are:

• TermQuery: This is the usual one-keyword query.

• BooleanQuery: A multi-keyword query can be done withBooleanQuery; AND, OR and brackets can be used for expressions. Iftwo keywords are given an AND conjunction between them is assumed.

• PhraseQuery: This is a phrase query parsed by giving a expression“phrase”.

• RangeQuery: This query is very useful when dates are saved for doc-uments, e.g., one could search for all web pages modified in a certaintime interval.

• FuzzyQuery: This is an interesting query type which allows to searchfor terms similar to the queried term.

• PrefixQuery: With this query you can search for terms starting witha given prefix.

• WildcardQuery: This is a more advanced form of the PrefixQuery.Wildcards ?,* are allowed anywhere in the search expression and standfor one or more respectively zero or more characters missing. Obvi-ously these can become very inefficient in the execution time.

Despite all these powerful queries Lucene can be used like any other searchengine and with the same syntax. The StandardAnalyzer used in theproject filters stop-words.

25

6.4 Summary of implemented Classes

The most important class is the class Peer (see Figure 3 ). It holds all theinformation to perform the p2pDating algorithm and has methods to startqueries and evaluate query results.

Figure 3: The Main Class; Peer

The protocol used relies on several messages that are children of thePastry messages (see Figure 4 ).

The classes for the friends hold the ID and all details of the evaluationof how useful the peer is. The candidates hold the score and information ofthe peer which was added to the first peer’s friend list. (see Figure 5 )

26

<<Interface>>

rice

.p2p

.co

mm

on

api.M

essa

ge

+MAX_PRIORITY: static final int = -15

+LOW_PRIORITY: static final int = 10

+MEDIUM_PRIORITY: static final int = 0

+HIGH_PRIORITY: static final int = -10

+getPriority(): int

Dat

ing

Mes

sag

e

#from: rice.p2p.commonapi.Id

#to: rice.p2p.commonapi.Id

+DatingMessage(from:rice.p2p.commonapi.Id,

to:rice.p2p.commonapi.Id,sendToCandidate:boolean)

+getFrom(): rice.p2p.commonapi.Id

+getTo(): rice.p2p.commonapi.d

+isSendToCandidate(): boolean

Dat

ing

Qu

eryM

essa

ge

Dat

ing

An

swer

Mes

sag

e

-idArray: rice.p2p.commonapi.Id[]

-mipArray: long[]

-termVector: Vector<Pair<String, Integer>>

-sentToCandidate: boolean

-numWords: int

Tes

tMes

sag

e

-from: rice.p2p.commonapi.Id

-to: rice.p2p.commonapi.Id

-sendAt: Date

-nodesOnTheWay: LinkedList<Id>

Qu

eryM

essa

ge



-query: org.apache.lucene.search.Query

Qu

eryA

nsw

erM

essa

ge



-query: org.apache.lucene.search.Query

-urlHits: Vector<Pair<Float,String>>

Kee

pA

liveM

essa

ge

#from: rice.p2p.commonapi.Id

#to: rice.p2p.commonapi.Id

+DatingMessage(from:rice.p2p.commonapi.Id,

to:rice.p2p.commonapi.Id)

+getFrom(): rice.p2p.commonapi.Id

+getTo(): rice.p2p.commonapi.d

Kee

pA

liveQ

uer

yMes

sag

e

Kee

pA

liveA

nsw

erM

essa

ge

-idArray: rice.p2p.commonapi.Id[]

-mipArray: long[]

-termVector: Vector<Pair<String, Integer>>

-numWords: int

Figure 4: The Message Types Used

27

Acquaintance

-id: rice.p2p.commonapi.Id

-score: double

-simpleSynopsis: int

Friend

-credits: long

-lastUsed: Date

-usageFreq: long

-addedAsCandidate: boolean

Candidate

Figure 5: Classes for Friends and Candidates

7 Experiments

We have performed two experiments using different queries and strategiesto assess our approach.

7.1 Experimental Setup

7.1.1 Network Setup

In these simple experiments we created different collections for 12 peers. Theexperiments already took quite long, but not because of the algorithm , butbecause all the preprocessing is time-consuming. All the results conductedhere took 17 hours. When we start the experiments on the computer firstone peer is created and all the following 11 peers are bootstrapped from itsaddress and can this way join the P2P network.

7.1.2 System Setup

The experiments were conducted on a Windows XP PC and a Linux server.The Windows XP computer uses a 3 GHz Intel Processor and 1 GB mainmemory. We have started the experiments using Eclipse 3.2.1 with Java VMparameters -Xmx512m to increase the available memory which is needed tohandle the indexes and the term vectors for the KLD calculation. The Linuxserver runs kernel 2.6.16 on a 2.4 GHz AMD Opteron and has 8 GB mainmemory. Here the experiments were started using a shell script.

7.1.3 Data from: del.icio.us, dmoz.org

The benchmark collections consist of HTML pages and were created bycrawling the web starting from “del.icio.us” and “dmoz.org” pages.

28

del.icio.us ( http://del.icio.us/ )is a social network for bookmarks andallows everyone to get a login and to assign tags to web-pages – this wayclassifying interesting web picks. The pages from del.icio.us were obtainedby issueing queries with typical keywords/tags for the topics and by selectingpopular websites for these keywords.

The open and manually maintained directory of web-sites, dmoz.org (http://www.dmoz.org/ ) contains ∼5 million pages, that have been catego-rized by ∼75 thousand maintainers who are responsible for ∼600 thousandcategories. Based on these categories, we selected links from those categoriesthat fit our topics.

Both data sources are publicly available and provide high-quality pagessince most pages have been created/categorized by real users.

We created the following initial collections:

• Basketball: 44 seed pages distributed on 4 peers: mainly news andNBA sites

• Computer Science: 54 seed pages distributed on 3 peers: mainlysites from database chairs and conferences

• Flowers: 43 seed pages distributed on 3 peers: mainly sites for flowerdelivery, selling and some description

• Geology: 26 seed pages distributed on 2 peers: some official geologysites in various regions and some of geological phenomena

While creating these collections, we induced some overlap between peersby assigning the same URLs to multiple peers.

7.2 Recall Computation and Analysis Tools

In order to evaluate our approach, we need a measure to see how good ourproposed techniques are. Two popular measures to assess the quality ofsearch engine results are precision and recall.

precision(results) =|relevant documents in result|

|all results|

recall(results) =|relevant documents in result||all relevant documents|

However, these measures require that we know whether a document isa relevant match to a query or not. Since we do not have these relevance

29

assessments, we have to use a different notion of recall, named relative recall.The relative recall is a measure that expresses how close the obtained result,when asking a subset of peers, comes to a result that we would get whenasking all peers.

For computing the relative recall we need to build a global index whichcontains all the information from the smaller indexes, which is not easy inour scenario.

The first idea we followed was to look at all the small collections andlist all the URLs that were really crawled and then execute a crawl withexactly these URLs and reduced depth so that only the pages behind theURLs were crawled. The problem with this is that it takes a really longtime and alsonot all URLs could be crawled because of network errors andtimeouts. So there were about 200 pages lost, which was about 5% of allURLs, and the results of the smaller collections often had pages in themwhich were not in the global collection.

The second idea was to just merge all the smaller collections and withthat in mind we even found a method in Lucene which just takes the indexesand merges them to a larger collection.

First the global index is created with the Constructor IndexWriter().The indexes of the smaller collections are merged to the global index by usingthe Lucene function IndexWriter.addIndexes(). The results worked finewith this and no URLs were missing in the global collection but we foundout that some URLs were displayed more than once in the results. Thishappened because there was a significant amount of overlap in the crawledURLs of the smaller collections. We found out that the current stable versionof Nutch 0.8.1 did not offer a tool to remove duplicate URLs even though theprevious version had this functionality and the version that is in the SVNhas it too. Therefore we had to switch to this new version and include theclass DeleteDuplicates. After some time to adjust everything to this newversion we were finally able to get a global index of all the URLs withoutduplicates.

The relative recall was computed as follows:

queryResultsk = documents in topk results from queryi

globalResults = documents in global topk results from queryi

recall(queryi, results, peerx) =|queryResultsk ∩ globalResultsk|

|k|where all peers query for their local top-k results and these results are

merged, and duplicate URLs are removed. In case the last merged andduplicate-free query results have documents with the same score at the

30

position k and k+1 both documents are taken into consideration for thecomputation of recall.

In detail this is done by first sorting the hits by their URL to easily checkfor duplicates. Then these duplicates are set to the minimum score and itis sorted by score. From this the top-k results are just the first k elements(see Figure 2).

To get a representative comparison between the merged local query re-sults and the global query results when answering the query we change thescores of the local results. We change the scores of each document to itsscore in the global index. We tried to circumvent the problem that doc-uments will be evaluated differently for a query because of the differencebetween the local and the global index.

Standard scoring functions like tf*idf (term-frequency * inverse docu-ment frequency) cause problems since peers might compute different scoresfor the same document since the idf component depends on the documentcollection and thus is different from peer to peer.

Because we use many random algorithms we have to repeat our exper-iments several times to get meaningful results. To be able to evaluate therecall of several runs we would like to have the individual values of thequeries. This is done by storing the results in an Vector<Double> attributeadditionally to displaying the recall. The results can later be retrieved withthe getRecallResults() method in the Peer class. Then several runs canbe executed one after another and the average recall is calculated automat-ically, but the individual results can also be displayed.

To make queries after a specific number of iterations of the p2pDatingalgorithm we created a extra model to schedule the queries. The differentactions are stored in a class which is derived from the class Action. This canbe a QueryAction, HaltAction and TestAction which executes a query,shuts down the peer and prints debugging information like the friends andcandidates. All the classes take a integer as parameter to specify when theaction should be executed. The QueryAction is constructed using a Queryobject that is submitted by the user.A peer can be scheduled by assigning an array Action[] plan and allthe actions in the plan will be executed. If one just wants the standardbehavior, meaning that the peer just executes p2pDating in regular intervalsone can just call run() without adding a plan.

Example:

Action[] plan = new Action[2];

plan[0] = new QueryAction(50,myQuery);

plan[1] = new HaltAction(51);

31

In the example above 50 iterations of the p2pDating algorithm areexecuted and then the pre-defined query is executed. All the resultsare collected and the method halt() is executed. We implementeda special PeerTest which starts the node but actually performsall possible meetings one after another and tests what scores arecomputed from the term frequency vectors and the MIP arrays. Thisway it is very easy to see if peers that were supposed to have thesame topic also get the same KLD value and one can see how high theoverlap is computed by MIP and by one-to-one comparison.

The resulting average number of pages was about 350 for eachpeer. File size was about 1.7 MB for the small collections and 16.1MB for the global index. The exact number of pages were:peer 1: 480 pages,peer 2: 346 pages, peer 3: 677 pages, peer 4: 397pages, peer 5: 406 pages, peer 6: 316 pages, peer 7: 211 pages, peer 8:390 pages, peer 9: 529 pages, peer 10: 351 pages, peer 11: 663 pagesand peer 12: 599 pages.As you can see the numbers vary depending on the topic. This isbecause some web pages just have more links on their web-sites thanothers.

7.2.1 Tests performed

We consider all the different strategies specified in Section 4.2 and thedata described in the previous section. The queries that were performedare:

• “Dirk Nowitzki”,“playoffs” for the topic basketball on peer 1 and2

• “database theory” for the topic computer science on peer 5

• “orchid” for the topic flowers on peer 9

• “earth science” for the topic geology on peer 12

The queries consisting of only one keyword were TermQuery objects andthe 2-word queries were performed using PhraseQuery objects. Thetests were conducted by constructing a Action[] object for each query.

The experiments in detail were:

1. All queries were executed after 80 iterations for the differentstrategies. A varying number of peers were asked to see howmany peers have to be considered to get good results. We chose

32

80 iterations because it is shown in the second experiment thatthe algorithm reaches a good level after 80 iterations.

2. For three peers at iterations 0, 10, 40, 80, 100 and 200 the querieswere performed to measure how many iterations are needed toachieve a sufficient recall. We chose three peers to be queriedbecause this is the maximum number of peers in a topic.

Both experiments were done for all the different strategies

• We use the weighted sum strategy with varying α to see whichvalue of α was the most successful when building SONs.

Recall that the weighted sum strategy with α = 0.0 means thesimilarity only strategy and α = 1.0 the overlap only strategy. Sowe can emulate these strategies with weighted sum. The valuesof α that were tested in the experiments were 0.0, 0.4, 0.8 and1.0.

• The random strategy was introduced to test if the algorithm pro-vides any significant benefit at all.

• Additionaly a more refined alternative version of the Peer wasimplemented. The most successful version of that implementationis shown additionaly. The changes that were made in that versionare:

– If a DatingAnswerMessage is received, only friends of friendsare added instead of friends of candidates or random nodesnot accepted as friends

– When a KeepAliveMessage was received from a friend its newIDs are compared to the old and only new IDs are added tothe candidates.

The time needed for 80 iterations was approximately 12-14 secondswhen run on the WinXP machine and the Linux server. The wholeexecution of the experiments took about 17 hours.

7.3 Experimental Results

7.3.1 Experiments

33

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

Rec

all

Number of Friends

Overlap OnlyWeighted Sum 0.4Weighted Sum 0.8

Similarity OnlyAlternative Similarity Only

Random

Figure 6: Recall after 80 Iterations

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200

Rec

all

Number of Iterations

Overlap OnlyWeighted Sum 0.4Weighted Sum 0.8

Similarity OnlyAlternative Similarity Only

Random

Figure 7: Recall with three Friends

34

7.4 Summary of Results

What is suprising in the experiments is that the more refined alternativestrategy did not turn out to be most succesful. This can be explainedby the small size of the experiment. In this setting with only 12 peersthe most successful strategy to first get as much information as pssibleby adding all peers to the candidates list. In a more realistic setting,when the size of the lists are considerably smaller than the number ofpeers, the more refined strategy might lead to better results becausedating messages are only sent to promising peers.

In Figure 6 we see that the p2pDating strategies, except the OverlapOnly strategy, are superior to choosing random nodes. In fact, themost successful strategy is the Similarity Only strategy and this canbe explained by the fact that most peers have some overlap to otherpeers but there are no peers which are totally redundant to the queryingpeer.

Figure 7 shows that the Similarity Only strategy reaches almosta 10 percent points advantage to the Random strategy already at 10iterations. The difference reaches after 80 iterations is about 20 percentpoints.

The reasons why the graphs do not reach a recall of 100 % evenwhen asking all the other 11 peers are:

• It takes many iterations to reach enough different peers, this isdue to the number of random messages who only reach the samepeers. For instance it is often so that after 80 iterations not 11other different peers were reached.

• Our algorithm has a big portion of randomness included. So itcan happen that not all the relevant peers are reached, even after200 runs.

• Results can be below the top-k results in the smaller but not inthe global index. Due to the fact that the peers only select theirtop-k results based on their scores and it can happen that theresults are in the collection but they are just not in the top-kresults. So the maximum recall possible is reached but it is below100%.

• It may happen that some peers that are not in the SON haverelevant results and they are not considered, but at least in ourcollection this only happens seldom.

In summary, we have observed that query routing based on semantic

35

overlay networks achieves an excellent result quality with few messagesneeded to query peers.

8 Conclusion

8.1 Contributions

In this work we have shown an implementation of the p2pDating al-gorithm that creates a semantic overlay network and that can be usedduring query routing to deliver high quality results. The experimentsshow that the methods used have a significant advantage in compari-son to the random strategy. This advantage should be even higher forbigger collections, when the number of irrelevant peers is higher andfinding the most promising ones becomes crucial

even increase if the number of peers increases because in a realnetwork the number of irrelevant peers might be higher.

The number of bytes sent was not analyzed directly but it can beestimated. In a typical dating message that is sent in each iteration, anarray of longs with the MIP array, an array of long IDs for friends,the number of words in the collection and the term vector giving theterm distribution are sent. The size of this vector is approximately 2MB but it is still low when compared to a typical MP3 file of 3.5 MB.Moreover this is all totally uncompressed data.

8.2 Experiences

We used Pastry as the underlying network infrastructure to build astructured P2P System which uses SONs. This was done by imple-menting our own class extending the Application class which containsthe Pastry node. Sending and receiving messages is handled by Pastrybut the messages extending the Pastry message types were designedand implemented by us, so we also had t decide when to send messagesand what to do when messages are received. This task was actuallythe design of a protocol that the peers have to follow.

The Lucene index was used as the data basis for the P2P Websearch network. The whole collection can be created by crawling theuser’s own bookmarks or like we did for experiments from real usersfrom social networks. Another option is to index pages on-the-fly usinga plugin.

Crawling can be done relatively easily using Nutch, but getting itto work in Eclipse was not so easy. In fact, the preparations of theexperiments took a considerable amount of time of my thesis, afterthe algorithm was already programmed and everything was working.

36

Because of a missing feature to remove duplicates we had to switch fromthe latest stable nutch version 0.8.1 to the version 0.9. Fortunately itwas easier to get it to work.

A big portion of time went to programming essential analyzing toolsto see how the algorithm works. Many extra methods and classes hadto be implemented to test if the algorithm really constructed the rightSONs. The testing was made much easier when we switched fromversion 1.4.4 to the 2.0 version because the new simulator node is moreindependent of the hardware and the underlying layers of the network,although we had to invest some time for the changed interface.

Unfortunately, after experimenting some time with the simulatormode from FreePastry we discovered a problem that only randomlyappeared – sometimes after hours of execution of the experiments, itstopped the whole execution without producing error messages or ex-ception. On both Linux and Windows machines the error persisted andcould not be solved by debugging. It is probably either related to java’sor the operating systems process management. When we changed tothe real SocketPastry version the problem disappeared. The effect wasthat the experiments took much longer than expected, even when ex-ecuted on a fast Linux server.

In order to start the peers on one computer, each peer had tobe started as one thread as a Runnable object, that executed thep2pDating algorithm in its run() method. Unfortunately, with theother threads for receiving messages and for querying it could happenthat one thread removed an object while another thread was iterat-ing over the same structure which resulted in exceptions. By usingsynchronized methods and blocks this problem was solved.

8.3 Future Research

The measures used in this work could be extended with measures thatare based on the history of the peers and will surely bring many ad-vantages in real P2P applications. Moreover, more refined techniquesto select and remember candidates and friends could be implemented.

Experiments with real data from high-quality sources, e.g., veryactive users from del.icio.us or user data from browser plugins indexingthe visited links should also yield interesting results. In the future wewant to conduct these experiments and we also want to combine thisapproach with Minerva’s metadata directory.

The integration with Minerva [13] should be quite easy becauseboth system use FreePastry to establish the p2p system. Minervaalso uses a PastryNode as its interface to the network and it shouldbe possible to just use a Minerva node instead of the standard

37

“PastryNode” used at the moment.

References

[1] B. H. Bloom. Space/time trade-offs in hash coding with allowableerrors. Communications of ACM, 13(7), 422-426, (1970).

[2] P. Flajolet & G. N. Martin. Probalistic counting algorithms fordata base applications. Journal of Computer and System Sciences,32(2), 182-209, (1985).

[3] C. Tempich, S. Staab, & A. Wranick. REMINDIN’: Semanticquery routing in peer-to-peer networks based on social metaphors.In WWW. New York, USA: ACM.

[4] M. Bawa, G. S. Manku, & P. Raghavan. Sets: search enhanced bytopic segmenation. In SIGIR, pp.306-313.

[5] http://lucene.apache.org/nutch/. ( as seen on 6.4.2007 ).

[6] http://lucene.apache.org/. ( as seen on 6.4.2007 ).

[7] S. Kullback. Information theroy and statistics. New York: Wiley,(1959).

[8] G. W. Flake, S. Lawrence, C. L. Giles & F. Coetzee. Self-organization of the Web and identification of communities. IEEESciences,31, (2), 182-209.

[9] http://webcat.sourceforge.net/javadocs/pt/tumba/parser/RabinHashFunction.html.( as seen on 5.4.2007 ).

[10] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and T. V. Pelt.Gridvine: Building internet-scale semanticoverlay networks. Tech-nical report, EPFL, (2004).

[11] M. Bender, S. Michel, P. Triantafillou, and G. Weikum. GlobalDocument Frequency Estimation in Peer-to-Peer Web Search. InWebDB, 2006.

[12] M. Bender, S. Michel, G. Weikum, and C. Zimmer. Bookmarkdriven query-routing in peer-to-peer Web search. In SIGIR work-shop on P2P IR, 2004.

38

[13] M. Bender, S. Michel, G. Weikum, and C. Zimmer. The MIN-ERVA project: Database selection in the context of P2P search.In Datenbanksysteme in Business, Technologie und Web BTW,2005.

[14] M. Bender, S. Michel,P. Triantafillou, G. Weikum, and C. Zim-mer. MINERVA: Collaborative P2P Search. Demo. VLDB 2005Trondheim, Norway.

[15] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher.Min-wise independent permutations. Journal of Computer andSystem Sciences 2000.

[16] J. Callan. Distributed information retrieval. In Advances in infor-mation retrieval, Kluwer Academic Publishers. 2000.

[17] A. Crespo and H. Garcia-Molina. Semantic overlay networks forp2p systems. Technical report, computer science department,Stanford university, October 2002.

[18] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen.PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. Technical Report DCS-TR-487, Rutgers University, Sept. 2002.

[19] R. Fagin. Combining fuzzy information from multiple systems. J.Comput. Syst. Sci. 58(1), 1999.

[20] N. Fuhr. A decision-theoretic approach to database selection innetworked IR. ACM Transactions on Information Systems, 1999.

[21] M. Li, W.-C. Lee, and A. Sivasubramaniam. Semantic small world:An overlay network for peer-to-peer search. In ICNP 2004.

[22] J. Lu and J. Callan. Content-based retrieval in hybrid peer-to-peernetworks. In CIKM, 2003.

[23] S. Schenker. A scalable content-addressable network. In SIG-COMM, 2001.

[24] P. Reynolds and A. Vahdat. Efficient peer-to-peer keyword search-ing. In Middleware, 2003.

[25] T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X.Long, and K. Shanmugasunderam. Odissea: A peer-to-peer archi-tecture for scalable web search and information retrieval. Technicalreport, Polytechnic Univ., 2003.

39

[26] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information re-trieval using self-organizing semantic overlay networks. In SIG-COMM, 2003.

[27] P. Triantafillou, C. Xiruhaki, M. Koubarakis, and N. Ntarmos. To-wards high performance peer-to-peer content and resource sharingsystems. In CIDR, 2003.

[28] K. Aberer, M. Punceva, M. Hauswirth, and R. Schmidt. Improvingdata access in p2p systems. IEEE Internet Computing, 2002.

[29] E. Buchmann and K. Bohm. How to Run Experiments with LargePeer-to-Peer Data Structures. In IPDPS 2004, Apr. 2004.

[30] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized ob-ject location, and routing for large-scale peer-to-peer systems. InMiddleware, 2001.

[31] K. Wehrle, S. Gotz, S. Rieche. Distributed Hash Tables. P2P Sys-tems and Applications. - Berlin: Springer, p.83, 2005.

[32] S. Goz, S. Rieche, K. Wehrle Selected DHT Algorithms. P2P Sys-tems and Applications. - Berlin: Springer, chapter 8, p.95ff, 2005.

[33] R. Steinmetz and K. Wehrle. “Peer-toPeer-Networking & -Computing”. Informatik-Sprektrum, 27(1):51-54, Springer, Heidel-ber 2004, (in german).

[34] J. X. Parreira et al. p2pDating: Real life inspired semantic over-lay networks. Information Processing and Managamenet (2006),doi:10.1016/j.ipm.2006.09.007.

[35] G. Hasslinger. ISP Platforms Under a Heavy Peer-to-Peer Work-load. P2P Systems and Applications. - Berlin: Springer, 2005,p.383ff.

[36] D. Anderson. Random Graphs, Small Worlds and Scale-Free Net-works. P2P Peer-To-Peer; Harnessing the Benefits of a Disrup-tion Technology. - Beijing - Cambridge - Farnham: O’Reilly, 2001,p.67ff.

[37] K. A. Lehmann, M. Lemann. Random Graphs, Small Worlds andScale-Free Networks. P2P Systems and Applications. - Berlin:Springer, 2005, p.57ff.

[38] K. Wehrle, S. Goz, S. Rieche. Distributed Hash Tables. P2P Sys-tems and Applications. - Berlin: Springer, 2005, p.79ff.

40

[39] S. Goz, S. Rieche, K. Wehrle. Selected DHT Algorithms. P2P Sys-tems and Applications. - Berlin: Springer, 2005, chapter8, p.97ff.

[40] R. Steinmetz, K. Wehrle. Peer-To-Peer; Harnessing the Bene-fits of a Disruption Technology. Beijing - Cambridge - Farnham:O’Reilly, 2001.

[41] A. Oram. P2P Systems and Applications. Berlin: Springer, 2005.

41

Documents

Semantic Overlay Networks for Peer-to-peer Web Search