23
COMPUTER SCIENCE REVIEW 6 (2012) 161–183 Available online at www.sciencedirect.com journal homepage: www.elsevier.com/locate/cosrev Survey Searching in peer-to-peer networks Iraklis A. Klampanos a,,1 , Joemon M. Jose b a School of Informatics, University of Edinburgh, United Kingdom b School of Computing Science, University of Glasgow, United Kingdom ARTICLE INFO Article history: Received 24 May 2011 Accepted 2 July 2012 Keywords: Information retrieval Content-based retrieval Ontologies Semantic overlay networks P2P networking Applications Distributed hash tables ABSTRACT As peer-to-peer networks are proving capable of handling huge volumes of data, the need for effective search tools is lasting and imperative. During the last years, a number of research studies have been published, which attempt to address the problem of search in large, decentralized networks. In this article, we mainly focus on content and concept-based retrieval. After providing a useful discussion on terminology, we introduce a representative sample of such studies and categorize them according to basic functional and non-functional characteristics. Following our analysis and discussion we conclude that future work should focus on information filtering, re-ranking and merging of results, relevance feedback and content replication as well as on related user-centric aspects of the problem. c 2012 Elsevier Inc. All rights reserved. Contents 1. Introduction ................................................................................................................................................................................. 162 2. Background and terminology........................................................................................................................................................ 163 2.1. Defining peer-to-peer networking ...................................................................................................................................... 163 2.2. Small-world networks and peer-to-peer ............................................................................................................................ 163 2.3. The process of information retrieval .................................................................................................................................. 164 2.4. Distributed information retrieval ....................................................................................................................................... 164 2.5. Data, information and knowledge ..................................................................................................................................... 165 3. An architectural viewpoint ........................................................................................................................................................... 166 4. Retrieval over peer-to-peer networks ............................................................................................................................................ 168 4.1. Distributed hash tables ...................................................................................................................................................... 168 4.2. Semantic overlay networks ................................................................................................................................................ 169 4.2.1. Network-data independence and SONs ................................................................................................................ 170 4.3. Content-based peer-to-peer networking ............................................................................................................................ 170 5. Frequently occurring components of peer-to-peer retrieval networks .......................................................................................... 172 5.1. Non-functional characteristics .......................................................................................................................................... 173 5.1.1. Primary target application scenarios .................................................................................................................... 173 5.1.2. Network organization and topology ..................................................................................................................... 174 Corresponding author. E-mail address: [email protected] (I.A. Klampanos). 1 The work in this article was undertaken while the author was at the School of Computing Science of the University of Glasgow. 1574-0137/$ - see front matter c 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.cosrev.2012.07.001

Searching in peer-to-peer networks

Embed Size (px)

Citation preview

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cosrev

Survey

Searching in peer-to-peer networks

Iraklis A. Klampanosa,∗,1, Joemon M. Joseb

a School of Informatics, University of Edinburgh, United Kingdomb School of Computing Science, University of Glasgow, United Kingdom

A R T I C L E I N F O A B S T R A C T

Article history:

Received 24 May 2011

Accepted 2 July 2012

Keywords:

Information retrieval

Content-based retrieval

Ontologies

Semantic overlay networks

P2P networking

Applications

Distributed hash tables

As peer-to-peer networks are proving capable of handling huge volumes of data, the

need for effective search tools is lasting and imperative. During the last years, a number

of research studies have been published, which attempt to address the problem of

search in large, decentralized networks. In this article, we mainly focus on content and

concept-based retrieval. After providing a useful discussion on terminology, we introduce a

representative sample of such studies and categorize them according to basic functional

and non-functional characteristics. Following our analysis and discussion we conclude

that future work should focus on information filtering, re-ranking and merging of results,

relevance feedback and content replication as well as on related user-centric aspects of the

problem.c⃝ 2012 Elsevier Inc. All rights reserved.

Contents

1. Introduction .................................................................................................................................................................................162

2. Background and terminology........................................................................................................................................................163

2.1. Defining peer-to-peer networking......................................................................................................................................163

2.2. Small-world networks and peer-to-peer ............................................................................................................................163

2.3. The process of information retrieval..................................................................................................................................164

2.4. Distributed information retrieval .......................................................................................................................................164

2.5. Data, information and knowledge .....................................................................................................................................165

3. An architectural viewpoint ...........................................................................................................................................................166

4. Retrieval over peer-to-peer networks............................................................................................................................................168

4.1. Distributed hash tables ......................................................................................................................................................168

4.2. Semantic overlay networks................................................................................................................................................169

4.2.1. Network-data independence and SONs ................................................................................................................170

4.3. Content-based peer-to-peer networking ............................................................................................................................170

5. Frequently occurring components of peer-to-peer retrieval networks ..........................................................................................172

5.1. Non-functional characteristics ..........................................................................................................................................173

5.1.1. Primary target application scenarios ....................................................................................................................173

5.1.2. Network organization and topology .....................................................................................................................174

∗ Corresponding author.E-mail address: [email protected] (I.A. Klampanos).

1 The work in this article was undertaken while the author was at the School of Computing Science of the University of Glasgow.

1574-0137/$ - see front matter c⃝ 2012 Elsevier Inc. All rights reserved.doi:10.1016/j.cosrev.2012.07.001

162 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

5.2. IR-related functional components .....................................................................................................................................175

5.2.1. Retrieval mechanisms and models .......................................................................................................................175

5.2.2. Evaluation ............................................................................................................................................................178

6. Other issues and pointers for future work ....................................................................................................................................179

7. Conclusions ..................................................................................................................................................................................180

References ....................................................................................................................................................................................181

1. Introduction

The term peer-to-peer is known to have been coined circa2000 when it was used to describe the network of millionsof users all over the world sharing music through Napster [1].According to the peer-to-peer (P2P) networking paradigm, allthe participating computers behave equally in terms of theservices they can offer and receive inside the network. Thiscan be seen as the opposite, or even as a generalization, ofthe established Client–Server (C–S) model, where services arebeing offered by central servers to a much larger end-userpopulation. In a peer-to-peer network all participating nodesmay equally offer and request services.

The popularity and widespread use of peer-to-peernetworks started with the infamous Napster, which allowedits users to share and download multimedia content,especially MP3-encoded music files. After the success anddownfall of Napster, as a free and largely illegal file-sharingservice, and the media attention it received, it was inevitablethat researchers turned to peer-to-peer networking for anumber of different reasons. The promise of doing away witha central serving authority, instead exploiting the collectiveknowledge and resources of all participants, was enough ofa reason for many. Furthermore, the peer-to-peer paradigmwas conceptually connected to the mathematical notion ofsmall worlds [2]. In practical terms this meant that peer-to-peer networks could bemade highly-available and connected,scalable, dynamic and robust. These were strong incentivesfor people to take a close look at the promising, then new,paradigm. Today, such non-functional characteristics stillform the main reason for people to further research peer-to-peer networks and related solutions.

Nowadays, there are many peer-to-peer systems aswell as research and development proposals coveringa broader spectrum of applications beyond music andmultimedia sharing. File-sharing systems, such as Limewireand BitTorrent, have evolved in order to be able to handlevirtually any kind of file. Peer-to-peer, distributed file-systemsare more reliable and file-system extensions for collaborationover peer-to-peer networks are also available for general use.Other fields where the peer-to-peer paradigm seems to begaining momentum include digital libraries, patent retrieval,multimedia retrieval, Internet television, to name but a few.However, independently of the application at hand, in orderfor these applications to be useful to end-users they need tohave effective and efficient search capabilities, depending onthe requirements of the application.

As an example of peer-to-peer searching let us considerfile-sharing applications. All file-sharing systems havesearching facilities which enable users to search for filesbased on their file-name. Since most users of these systemsdo not look for completeness or accuracy, searching based

on file-names is, in most cases, adequate. On the contrary,for a digital library application, such an approach wouldbe insufficient. Indeed, a digital-library application wouldrequire a search facility able to perform full-text search withefficiency and accuracy on a par with that of a centralizedsearch engine. This type of searching of textual or multimediainformation over peer-to-peer networks is an active researchfield, which we intend to address in this survey.

During the last ten years or so, a number of studieshave been published addressing issues related to retrieval inpeer-to-peer networks. These publications make suggestionsfor improving individual components or propose novelarchitectures addressing retrieval in peer-to-peer networks.As a fast-paced and relatively new field, peer-to-peer retrievallacks coherence across publications, due to inconsistentterminology. The terms “information retrieval”, “semanticoverlay networks”, “content-based searching” and manyothers are being used interchangeably and invariably only toadd confusion for both new and more seasoned researchers.Additionally, in some cases, there seems to be confusionregarding the exact problems various proposals aim toaddress. This makes it difficult for new researchers enteringthe field to place their work on the constantly changing mapof the ongoing research effort. Moreover, it is difficult tojustifiably determine which aspects of peer-to-peer retrievalneed further attention and which could form the basis offuture work. For these reasons survey papers constitute aninvaluable tool for answering these questions in a way whichis both well-structured and focused on the current state-of-the-art as a whole. The contributions we make in this articlefocus on information retrieval in peer-to-peer networks and,more specifically, are the following:

1. Provide justified and consistent terminology for the areaof peer-to-peer information retrieval, able to capture theessence of recent and current research effort.

2. Provide a categorized analysis of past and current researchstudies based on both functional and non-functionalcharacteristics, wherever these apply.

3. Discuss recent and current research trends as well aspointers for future work.

The structure of this article is as follows: In the next sec-tion we introduce all necessary background concepts relatedto retrieval over peer-to-peer networks. This section includesa discussion on definitions of peer-to-peer networking, small-world networks, information retrieval and the vector-spacemodel, distributed information retrieval and a discussion onthe differences of data, information and knowledge as theseare used by various related solutions. In Section 3 we presentan architectural viewpoint through which different retrievalapproaches, even though addressing different primary prob-lems, can coexist. In Section 4 we introduce a number of

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 163

proposals for various kinds of retrieval over peer-to-peernetworks. These are categorized in distributed hash-tables,content-based solutions and semantic overlays, emphasizingthe similarities and differences in these approaches. In Sec-tion 5 we analyze the components present in content-basedsolutions and semantic overlays. From a retrieval viewpoint,we discuss basic functional and non-functional characteris-tics of various proposals. In Section 6 we provide pointersfor future work which, we feel, has not been sufficiently ad-dressed in the literature yet. Finally, we conclude this articlein Section 7.

2. Background and terminology

The various approaches of retrieval over peer-to-peer net-works combine paradigms, techniques and algorithms fromtraditional information retrieval as well as from othercomputing-related scientific areas. In this section we intro-duce important background information for the purposes ofthis survey.

2.1. Defining peer-to-peer networking

At present there is not a single, accurate and generallyaccepted definition of what constitutes a peer or a peer-to-peer network. By stating that a peer-to-peer networkis a network where all the participating nodes are madeequal we leave out most real peer-to-peer systems, includingcompletely decentralized networks like Gnutella. Even inthese networks peers may choose the level of theirparticipation. For instance, in a Gnutella network, there maybe computers which are much better connected than others.In turn, this leads to having computers with better access tocertain resources than others; hence concluding that not allpeers are, in fact, equal.

Coulouris et al. [3] define a peer-to-peer network as one inwhich “[. . . ] all of the processes play similar roles, interactingcooperatively as peers to perform a distributed activity orcomputation without any distinction between clients andservers”. Even though this definition is generic so as to be ableto capture the essence of peer-to-peer networking, it leavesout the notion of hybrid peer-to-peer networks, in whichdynamically allocated super-nodes exist, which serve theirsemantic neighborhoods. It also imposes similar functionalityon all participating nodes, regardless of their willingness orability to participate or cooperate. Androutsellis-Theotokisand Spinellis [4] indicate that the discrepancies amongthe various definitions of peer-to-peer computing aredue to the fact that these networks are being labeledas “peer-to-peer” because of their external, application-specific characteristics. Following a systems-oriented, non-functional approach, they propose that peer-to-peer networks“[. . . ] are distributed systems consisting of interconnectednodes able to self-organize into network topologies withthe purpose of sharing resources such as content, CPUcycles, storage and bandwidth, capable of adapting tofailures and accommodating transient populations of nodeswhile maintaining acceptable connectivity and performancewithout requiring the inter-mediation or support of a global

centralized server or authority”. This definition preciselystates what a pure peer-to-peer network is from anengineering viewpoint. However, when looking at peer-to-peer systems for information retrieval, in any other specificdomain, it fails to take into account characteristic featuresthat may be present, thus leaving out highly relevant systemsand approaches.

For the purposes of information retrieval over a peer-to-peer network we propose the following definitions, since theyencompass the spirit of peer-to-peer networking as well asbeing inclusive of the studies discussed in this survey.

Peers are processes running on participating machines inthe network, which are potentially capable of providing andusing remote services in a similar manner. However, peersmay not exhibit equal levels of participation. These shouldbe proportional to peers’ willingness and should be dictatedby hardware or other given, non-functional circumstances(e.g. limited bandwidth etc.) or by user intervention. Peersmay provide both server and client functionality.

It then follows that

A peer-to-peer network is a directed graph whose nodes arerepresented by peers and its edges are represented by abstractcommunication channels. In such a network the equality ofpeers is defined by their potential capabilities, while theirparticipation levels are proportional to their willingness toparticipate.

2.2. Small-world networks and peer-to-peer

Small-world networks [5] are types of networks which areneither random nor regular, but are classified as beingin-between of the two extremes. Such networks are wellconnected and exhibit small shortest path lengths betweenany two of their vertices while at the same time forminghighly connected clusters. As Milgram [6] famously showed(and so did many others after him), small-world patternsemerge naturally in social networks, as people typicallybelong to various, partially disjoint social circles. In a small-world network, the ties between closely associated partiesare referred to as strong ties or short-range links and to adegree are formed due to similarity, or common interests,between the two parties. On the other hand, weak ties orlong-range links are these formed between two parties ofdifferent groups or clusters in a seemingly randomway.Whilethe strong ties indicate closeness (on some basis), the weakties are the ones making the network connected providing forsmall shortest-path lengths between any two nodes. It wasfor these reasons that Granovetter [7] spoke of the strength ofweak ties.

Information-sharing peer-to-peer networks resemble so-cial networks, since the attachment of new nodes to the net-work is typically being done initially at random and sub-sequently according to factors such as closeness, interest,download speed, content similarity, etc. The small-worldcharacteristic of small shortest-path lengths means that,given the right routing choices are made, message passingin a peer-to-peer network can occur efficiently, regardless ofthe size of the network. At the same time, the clustering ofnodes according to some criterion wouldmean that messages

164 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

would not have to be randomly traversing the network forsuitable recipients but could rather be sent directly to a po-tentially relevant group. As an example, the original Gnutellanetwork [8] has been shown to be exhibiting small-worldproperties [1], even though it is flat-structured and similarity-unaware. Since this very basic file-sharing network is a small-world without explicitly defining the basis of peer clustering,it would be of significant benefit to have clusters based oncontent instead. This would mean that the network organiza-tion would occur around the shared information, neighborli-ness would indicate common information interests and queryrouting would be directed to the peers most likely to have rel-evant content. At the same time the long-range links wouldbe used in order to route a query further away in the networkand into peer-groups of dissimilar interests to the local neigh-borhood. This is the motivation behind various peer-to-peerstudies claiming to be “enforcing” the small-world propertyonto their networks in order to gain in both network efficiencyand in retrieval effectiveness [9–11].

2.3. The process of information retrieval

Information retrieval is a process through which a user aimsto satisfy an information need by typically retrieving relevant,non-structured information items from a collection. Suchitems may either be textual, such as web pages, books,articles, etc., or multimedia, such as sound, music, imagesand video. The differences between information retrieval andother forms of retrieval are described below, in Section 2.5.

Information retrieval typically takes place through the useof search engines. A search engine provides users the meansto retrieve documents from potentially huge corpora. Thesearch engine has to process the documents, before theycan be made searchable. First, the document collection isfed into the search engine, which, after some initial lexicalprocessing, creates the index—the most central data structureof the search engine. The lexical processing phase usuallyinvolves the removal of stop-words, which are small and veryfrequently occurring word tokens that do not contribute tothe meaning of a document (e.g. “the”, “when”, “is” etc.), aswell as stemming, the removal of the suffixes of words [12].As an example, after stemming, the words “connecting”and “connections” will be internally represented by thetoken “connect”. The application of stemming and stop-wordremoval results in smaller, more manageable indexes andalso in more descriptive document representations.

The index is the most central data structure of the searchengine. It holds lists of indexing features (terms) as well aspointers from features to actual documents. Depending onthe IR model adopted these indexing features may vary. Itis also possible, and often necessary, for an index to holdboth forward pointers (from document identifiers to features)as well as inverted pointers (from features to documentidentifiers). This latter component of the index is also calledan inverted index and it is important for the assembly of thefinal results list that the end-user receives at the end of asearch process.

Retrieval takes place against such index data-structures,given an applied IR model, describing documents and, moregenerally, retrieval units, appropriate matching and ranking

functions, etc. [13]. An example of a popular and highlyinfluential IR model is the Vector-Space Model [14], whichdescribes documents as vectors of term-weights. Given thisdescription, the similarity of two documents can then beintuitively described by any quantification of their angularseparation, such as the cosine of the two document vectors.More information on introductory IR concepts can be foundin [15,16], etc.

One of the goals of peer-to-peer IR systems, as well asof other centrally controlled distributed IR systems, is theeffective distribution of index data-structures as well as thedistribution of appropriate IR models.

2.4. Distributed information retrieval

Distributed information retrieval (DIR) is a research areasimilar to that of peer-to-peer information retrieval. DIRaddresses the problem of locating and retrieving informationfrom a set of databases or IR engines, as opposed toclassical IR, which assumes the existence of a single, centralinformation store. The interface between the user andthe information sources is typically provided by a singlebroker that manages all the transactions between the client(the user) and the servers (the IR engines). This broker isresponsible for forwarding queries to the most relevant of theproviders as well as fusing and returning result-lists to theuser. This requires that the broker has some knowledge overthe retrievable content of the participating providers beforeany query routing can take place.

Callan [17] described the problem of DIR as the set of thefollowing problems:

Resource description is the process during which individualproviders or IR engines inform the broker of theircontent.

Resource selection is the process during which, given a query,the broker makes an informed decision as to whichproviders may have content that is relevant to thequery. These will be the providers that will end-upreceiving and responding to the query.

Results fusion is themerging of result-lists that the broker hasto perform after having received results from thevarious, previously selected, providers. The brokerhas to merge the various result lists into a single onebefore routing them back to the user.

These defining issues of distributed information retrievalcan be approached in various ways, building on differenttheoretical backgrounds. Three highly influential trends canbe traced in the literature, which have also influenced someof the systems examined in this survey. Callan et al. [18]proposed a framework based on probabilistic, inferencenetworks. The probabilities used by this approach are basedon statistics, such as term frequencies, drawn from theremote collections. Fuhr [19] proposed a decision-theoreticapproach encompassing both issues related to content aswell as to other costs which might occur in a DIR setting. Siet al. [20], building on previous work undertaken by Ponte andCroft [21], proposed an approach to DIR based on languagemodels.

The ultimate goal in both P2P and distributed IR is tobe able to retrieve information from multiple independent

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 165

Fig. 1 – Knowledge retrieval in a distributed environment through a tree of concepts or ontology.

sources. These sources might have overlapping material orinaccurate resource descriptions or even be antagonistictowards one another. These issues are all factors whichthe broker has to work around and they are common inboth peer-to-peer and distributed IR. On the other hand,in a peer-to-peer setting, providers may also be clients aswell as brokers for other nodes’ requests, i.e. multiple peersmay have to play the role of the broker for others. Thenetwork topology is also different, with arbitrary connectionsbetween participants being the norm in the peer-to-peercase. Finally, no assumption can be drawn on the availabilityor the quality of any particular resource—in fact it isexpected for participating peers to join and leave the networkunexpectedly. Therefore, resource descriptions have to bedisseminated not only to one but to a number of peers.Because of the complex topologies of P2P networks, resourceselection has to take place at various stages of the queryrouting thus becoming a complex routing task. Lastly, themerging of results also has to be undertaken by multipleparticipants, at multiple points in the network. It followsthat peer-to-peer information retrieval, arguably, deals withgeneralized versions of the DIR problems and so the two fieldsare inherently related.

2.5. Data, information and knowledge

Historically there seems to have been confusion amongstresearchers and practitioners regarding the boundaries be-tween data and information retrieval. In his early informa-tion retrieval textbook, van Rijsbergen [22] discusses theirdifference in terms of algorithms and approaches. It is gen-erally accepted that the boundaries of the two fields can be,in some cases, hazy. However, since today we have plentyof examples of both database and information retrieval sys-tems, their differences are as apparent as ever. These differ-ences stem from the fact that the terms “data” and “infor-mation” refer to different computational entities. However,due to their linguistic closeness in meaning, their differencewas not so obvious from the beginning of information anddata retrieval research. With the advent of fast communica-tions and the wide deployment of distributed systems a sim-ilar confusion emerged. Especially in the domain of peer-to-peer networking, where the participating peers enjoy certainfreedoms with respect to the way they manage their content,it is common for, especially new researchers to find a rathermixed use of terminology in related published material.

In the meantime, “knowledge” has emerged as a newcontent unit in computer science. Knowledge representation,management and retrieval became very active fields forresearchers and companies, who are also looking for waysof distributing these processes in a peer-to-peer fashion.It is generally accepted that, as a concept, knowledge isneither information nor data and so its representation incomputers should also differ. In their book, Davenport andPrusak [23] define knowledge as a “[. . . ] fluid mix of framedexperience, values, contextual information and expert insightthat provides a framework for evaluating and incorporatingnew experiences and information. [. . . ]”. What becomesapparent from this definition is the connection of knowledgeto both data and information. In computer science, ontologiesare thought to be capable of encapsulating the notion ofknowledge [24]. In particular, after the World Wide WebConsortium’s work on the Semantic Web [25] started and theResource Description Framework (RDF) [26] was completed,many used RDF or similar XML-based ontology frameworksfor knowledge representation, management and retrieval, forexample [27,28], and others.

From this discussion, the need to differentiate betweenthe aforementioned research approaches, with respect toretrieval, should become apparent. The retrieval units andmethods as well as their target application domains differ.This also affects a system’s presentation, implementationcomplexity and evaluation methodology. In this survey wewill name the three distinct retrieval tasks with respectto the type of the retrievable unit they assume foroperation. Therefore, we will refer to content-based systemswhen presenting systems which address the problem ofpeer-to-peer information retrieval, since their input is theshared content in its unstructured form. Distributed hash-table systems are a form of distributed databases, sincesearching takes place against properly defined domains andthe matching is exact. Semantic overlay networks (SONs) (seeFig. 1) deal with the retrieval of semantic semi-structureddata describing resources or used to derive knowledge fromother data and information. In the case of SONs, the retrievalof meta-information is usually abstracted away from thelower-lever retrieval of data or information, which makes foran additional differentiating factor.

At this point it is worth noting that a number ofresearchers consider the term SON to cover every P2P networkthat delivers some retrieval functionality via the classificationor clustering of information and networking resources [29].

166 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

Table 1 – A comparison of retrieval approaches for peer-to-peer networks.

DHTs Content-based SONsItem representation Feature sets Frequency vectors Meta-data

Matching Partial or exact Best or partial Partial or exact

Model Deterministic Probabilistic Deterministic

Inference Deduction Induction Deduction

Classification Monothetic Polythetic Monothetic

Query specification Complete Incomplete Complete

Items wanted Matching Relevant Matching

Network topology Flat Hybrid or flat Flat

Evaluation challenges Efficiency, recall Efficiency, precision, recall Efficiency, recall

However, for the purposes of this survey, we feel that this istoo broad a definition, since purely content-based systemsand SONs differ both in their approach to searching as wellas in the assumptions they take with respect to the sharedinformation.

In Table 1 we outline a number of aspects relevant toretrieval with regards to search solutions proposed for peer-to-peer networks. These aspects are notmeant to be completeand the treatment suggested for each of these is not meantto be universally true. They are provided as a means fordiscussion as well as for highlighting the differences in theaforementioned approaches.

Item representation refers to the way retrievable items arerepresented in the different frameworks. In DHTs they tendto be represented as feature sets of hash values. Directly orindirectly these sets represent the presence or absence of afeature from an item andmay take different forms dependingon the implementation. In the case of content-based systems,items are usually represented by term-frequency vectorsstemming directly from the content. In the case of SONs,the retrievable items are meta-data about the items. (Eventhough the final output of a SON is a list of documents orother retrievable items, the immediate contribution of anontology to the retrieval process goes as far as the resourceselection phase.) Item representation affects the volume ofinput data a system will have to cope with and therefore,to an extent, its scalability. While DHTs and SONs usuallydeal with, or are evaluated against smaller sets of data, theirnetwork topologies are usually flat, with peers being exactlyequal in terms of their capabilities and responsibilities. In thecase of content-based systems, flat-structure topologies havenot been seen to scale. Instead, most scalable content-basedarchitectures rely on some content classification or clusteringmechanism as well as on a sub-layer of super-peers withregional administrative responsibilities.

In DHTs, the matching of a query to resources can beeither partial or exact. Depending on the implementation,partially matched items may also be returned. However,the model remains deterministic even for the case ofpartial matching. In content-based systems, matching maybe partial or by giving a better rank to the items matchingbest. The underlying models for content-based systemsare probabilistic, similar to traditional information retrievalsystems. In the case of SONs, matching is usually eitherpartial or exact, since certain concepts from a query musthave the same keys of a resource description in order tomatch. Usually but not always, the model in the case of SONsis deterministic.

In terms of useful classification, DHTs follow the trendof databases and, even though they usually do not directlyapply classification, whenever implicit classification occurs,it is monothetic. For example, in order for a peer to becomea member of a network neighborhood, it must possess termsor documents or indices which are sufficient and necessaryfor its inclusion. Due to their probabilistic nature, content-based systems exhibit polythetic classes in that membershipto a class depends on an overall similarity or dissimilaritybetween the vectors of the two entities. In some cases, wheredimensionality reduction is applied, membership in a classdoes not even depend on common terms in the originalcontent vectors. In SONs, monothetic classes would probablybe more useful, since comparisons between two trees usuallydepend on common, or deterministically related, concepts.

The query specification in the cases of DHTs and SONs iscomplete. This is not to say that all terms in a query mustbe found in a document in order for the document to beretrieved, but it is implicitly assumed that a query expressesthe information need as fully as possible, usually looking fora specific piece of data. It is of no surprise then that in boththese approaches the items a user expects to find matchthe query. In the case of content-based systems, however,this is not the ultimate goal of querying. Instead, usinga content-based retrieval system one would expect to finddocuments about the topic described by the query. Relevantdocuments may or may not directly resemble the querybut should still be retrieved. This also helps to explain therationale behind the evaluation approaches various studiestake. Apart from the issue of network efficiency, which iscommon to any distributed system, DHTs and SONs areusually evaluated for recall, that is their ability to find asmany relevant items as possible, given a query. This isbased on the assumption that retrieved items match thequery and are therefore relevant by definition. In the caseof content-based systems, researchers also look for precision,since non-relevant documents typically appear in the resultslist. Since information retrieval is not about exact matchingbut relevance (or aboutness), the quality of a system is alsoaffected by its capability to differentiate between relevant andnon-relevant documents.

3. An architectural viewpoint

The classification of systems based on their retrievalproperties, as they are discussed in the previous section

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 167

Fig. 2 – Sample interfacing between the IR Model and the Resource Resolving layers in the form of a sequence diagram. Inthis example, the actions performed by the resolved could be performed by a DHT. In reality a peer software stack willinclude additional layers as will the layers depicted here contain additional components; however these have been omittedas they do not fall within the focus of this survey.

and summed up in Table 1, is important for discussionas well as for being the basis on which current researchcommunities seem to describe themselves. This classificationis being taken further with a number of reference researchsystems and discussion in Section 4. However, before thisis discussed, it is beneficial to explore where the variousretrieval approaches fall from an architectural viewpoint, asthis will aid our understanding with respect to their merits,weaknesses and the approach taken for their evaluation.

Let us assume a hypothetical P2P IR system, able to queryarbitrary remote resources in a P2P network. This system willneed to have a presentation layer to accommodate interactionwith its users as well as needing to provide some retrieval,indexing and networking functionality. Irrespectively ofthe actual retrieval unit (document, person, image, etc.),matching strategy (probabilistic, best-match, Boolean, etc.)or the space over which matching occurs (term-frequencyvectors, meta-data, keywords), our hypothetical system canbe decomposed into the following broad functional layers:

1. Presentation: Interactive I/O and presentation.2. IR modeling: Abstract application layer, which describes the

retrieval unit, strategy and space over which matchingmay occur. This layer will interpret queries handed bythe Presentation layer and will issue requests to theunderlying Resource Resolution layer in order to computematching, similarity, etc.

3. Resource resolution: Provides resolution of resources bothlocally and remotely. This layer provides access to thelower-level index as well as locating appropriate remoteresources needed by the IR Modeling layer. This layer alsointerfaces with the Network layer as it is responsible formaintaining the overlay topology.

4. Network: The hard-wired or wireless networking layer.

In this article we focus on the IR Modeling (2) and ResourceResolution (3) layers.

In real life, research or otherwise, this separation isnot as well-defined, with the layer boundaries being ratherfuzzy and with cross-layer communication not being strictlypair-wise. In research publications focus is given to oneof the above layers, depending on the research field beingaddressed. However, this is typically done implicitly and theproposed architectures, due to presentation reasons, mightseem to span two or more of the aforementioned layers. Asan example, a content-based P2P IR solution might describean algorithm for locating relevant peer-groups in the overlaynetwork. This, however, does not preclude the use of a DHTfor this purpose. Similarly, a DHT-based retrieval solutionmight evaluate against a Boolean data matching model.This does necessarily mean that the same DHT would beunsuitable for supporting a more sophisticated IR model.However, content-based architectures, SONs and DHTs are ontheir own suitable to perform some retrieval function, whichin combination with the layer boundaries defined abovebeing rather fuzzy might cause some confusion, especially tonewcomers.

To demonstrate this point further, consider Fig. 2. Here,we depict the actions a hypothetical cluster-based P2P IRsystem may need to take in order to evaluate a givenquery. Actions, such as the transformation of a query to aformat compatible to the model as well as the classificationalgorithms used are defined within the IR Model. Content-based P2P IR approaches generally focus on such aspectsof information and retrieval modeling. Functionality, suchas the location of resources within the network based oninformation units, e.g. keywords, fall within the ResourceResolving layer. DHTs are better suited and indeed tend tofocus more on this aspect of the problem.

168 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

Fig. 3 – A broad classification of approaches for retrievalover peer-to-peer networks based on their originalscientific influences.

While this architectural classification is important fordiscussing issues relating to software modeling, design andimplementation, we believe that it is less useful in a researchsurvey. For the remainder of this article we will concentrateon another classification of P2P retrieval systems basedon their retrieval properties, with a focus on unstructuredcontent-based networks and SONs.

4. Retrieval over peer-to-peer networks

Almost every peer-to-peer application that has either beenbuilt or proposed as a research study is in need of someform of retrieval functionality. Even though this surveyprimarily focuses on information retrieval, in this sectionwe use the term “retrieval” in its broadest sense, i.e. for allforms of information or data retrieval (full-text, multimedia,keyword, hash-based keys, etc.). As mentioned above, thereare three distinct approaches to retrieval over peer-to-peernetworks present in the literature, all of which stem fromdifferent modeling and application needs. These approaches,depicted in Fig. 3, can be broadly classified as the following:content-based, semantic overlay networks and DHTs. Thisclassification is related to the systems’ retrieval properties,approach and their intended use as this is conveyed in therespective published material. As discussed in Section 3, ahypothetical complete systemmay have components from allthree. In the following sections we attempt to define thesedifferent approaches, provide examples from the literatureand identify their target applications.

4.1. Distributed hash tables

One of the most popular ways to organize and retrieve dataon a peer-to-peer network is through the use of distributedhash-tables (DHTs). The key idea behind this approach isthat each peer is responsible for keeping indexes for anumber of terms, describing objects of interest. Networkaddressing can be based on the hash values that these termsyield. As a consequence, whenever a query is issued, itsconstituent terms are being hashed using a globally knownhash-function. The hash values that are produced addressthe peers responsible for knowing where, in the network, tolook for content relevant to the query terms. These term-based indexes may associate terms to (document, peerId)tuples (document-level indexing) or just to peerIds (peer-level

indexing) [30]. A query is then routed to them, before reachingthe final, relevant information providers. By design, stand-alone DHT-based systems are suitable for data retrieval asopposed to information retrieval. However, before comparingthem in a peer-to-peer setting, let us first have a brief look ata few influential systems.

A number of different DHT-based approaches have beenreported in the literature, sharing the aforementioned basisof operation. Stoica et al. [31] and Balakrishnan et al. [32]propose Chord, a DHT-based protocol for peer-to-peer datalocation and retrieval. Chord, as a protocol, is not application-specific and it supports a single operation, map, which isused to map keys onto peers. For the purposes of retrieval,these keys could be single indexing terms or other indexingfeatures of documents and so on. Chord emphasizes fairworkload, by allowing for a uniform assignment of routingindexes amongst peers. This is achieved through the useof consistent hashing [33], which guarantees that aboutthe same number of routing elements will be assigned toeach node in the network. This process requires each nodeto maintain information about O(logN) other nodes, for atopologically steady N-node system. Chord can resolve look-ups in O(logN) messages. The routing information that thenodes share gets updated when nodes join or leave thenetwork. In Chord, nodes and keys are assigned an m-bitidentifier, through hashing, and are arranged in an identifiercircle modulo 2m. Any key is assigned to the first node whoseidentifier is equal to or greater than the identifier of the key,i.e. to the successor node of the key.

An alternative suggestion was made by Ratnasamyet al. [34], who developed and experimentally evaluated anInternet-sized distributed hash-table, dubbed CAN—ContentAddressable Network. This architecture supports all opera-tions found in traditional hash-tables, like insert, lookup anddelete. In contrast to Chord, CAN is conceptually arrangedin a d-dimensional Cartesian coordinate space on a d-torus.This logical space is able to store (key, value) pairs and itis partitioned among the nodes present in the network. So,each peer is responsible for a partition of the torus, andtherefore for a partition of the total set of keys handled bythe network. In CAN, each node is aware of the boundariesof the region it maintains as well as those of the regionsmaintained by its immediate neighbors. A query, after runthrough a hash-function, can be routed greedily to the neigh-bor whose coordinates are closer to the target coordinates.For a d-dimensional space partitioned in n equal zones, theaverage routing path length is (d/4)(n1/d) hops while individ-ual nodes need to maintain information about 2d neighbors,irrespectively of the size of the network.

Rowstron and Druschel [35] proposed Pastry, anotherDHT-based peer-to-peer network which employs a circularaddressing space. In Pastry, peers are assigned a random128-bit identifier upon joining the network. These identifiersare taken to be uniformly distributed within the 128-bitaddressing space, which ranges within [0,2128 − 1]. For thepurposes of routing, identifiers and object keys are taken to bedigit sequences with a base of 2b, where b is a configurationparameter with a typical value of 4. Pastry routes queries tothe node whose identifier is numerically closer to a given key.This is achieved incrementally, with the local node routing

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 169

a query to another node whose id shares a prefix with thekey that is at least one digit (or b bits) longer than theid of the local node. If such a node cannot be found, thequery is forwarded to a node whose identifier shares a prefixof the same length with the key but which is nonethelessnumerically closer to the key. For an N-node, topologicallysteady network, Pastry can route a message to the mostsignificant node in less than log2b N steps.

The aforementioned distributed hash-tables have beenhighly influential for both research and developmentproposals. Projects built on Pastry include PAST [36,37] andScribe [38]. Tapestry [39] and Kademlia [40] also use similaralgorithms and have been influenced by Pastry. On the otherhand, Oceanstore [41] is similar to Chord. CAN influenced thecreation of PeerSearch [42], a hybrid approach using a CAN-like underlying DHT for routing and the vector-space modelfor information retrieval at the end-nodes. Other relatedmaterial include the works of Loo et al. [43], Bender et al. [44]and Papapetrou et al. [45], among others.

4.2. Semantic overlay networks

Semantic overlay networks (SONs) follow an alternativeapproach to organizing peer-nodes in a network in order tosupport retrieval. Instead of relying on one or more globallyknown hash-functions, content and peer organization inSONs is usually based on a globally known tree structure ofconcepts. The central source of inspiration for SONs is theSemantic Web [25] as well as related research in ontology-driven content management. Since these tree structuresemployed by SONs are used to define information schemata,they are also referred to as schema-based networks.

The main idea behind this approach is to use a globallyknown structure in order to organize content and peerssemantically. Compared to DHTs, this approach is moreoriented towards keyword-based information search, sinceit adds semi-structured semantics to the various entitiesof the network, be it documents, peers or semantically-closed peer-groups. Once the document collections of thepeers have been categorized, the network is ready to acceptqueries. Typically, the queries also get categorized using thesame global semantic structure, and they get propagatedto these network entities that share similar categories. Thefinal information retrieval process that takes place at the leafnodes is independent of the prior categorization process—the categorization is relevant only to query routing and tonetwork organization.

One of the first uses of the term “Semantic Overlay Net-work” came by Crespo and Garcia-Molina [46] who proposeda multi-layered architecture aimed at music file-sharing andretrieval. The global meta-data structure used is a song clas-sification schema based on music genre, taken from allmu-sic.com.2 In this network, each music genre is covered byits own semantic overlay network. Therefore, depending onits shared music collection, each node may belong to morethan one SONs and consequently to different peer neighbor-hoods. Membership of peers into SONs is determined by thegenre of their shared content—a peer joins a SON if it hasa significant number of songs of the genre conceptualized

2 http://www.allmusic.com/.

by the SON. Once the classification of the peers into par-allel SONs has taken place, the network is ready to acceptqueries. In order for queries to become comparable to theconcepts expressed by the SONs, they first need to be classi-fied against the samemeta-data structure. A query, once clas-sified, would visit the SONs representing the leaf categoriesand then move upwards in the hierarchy until enough re-sults have been returned. In this work, Boolean matching wasused, even though the architecture does not restrict the use ofalternative matching models.

Moving away from particular applications and aiming toprovide a model for creating and using semantics in peer-to-peer networks, Ehrig et al. [47] proposed SWAP (SemanticWeband Peer-to-Peer). The primary aim of this work was to bringsemantic-web technologies, and in particular ontologies, intothe peer-to-peer world. Having a semantic model to describeinformation in peer-to-peer networks would allow heteroge-neous sources to be abstracted onto a level where compar-ison, management and retrieval is possible. Abstractly, theSWAP model allows for the subjective treatment of discov-erable metadata items, and therefore makes their combina-tion and comparison possible. Practically, SWAP adopts anarchitectural point of view, defining various components ofa SWAP-enabled node as well as their responsibilities andmeans of communication. Furthermore, it goes on to proposefully-fledged ontology classes, expressed in RDF, capable ofexpressing metadata relevant to the nodes. SWAP supportsthe extraction, further annotation or enrichment as well asthe combination or merging of metadata and has been thebasis of further work and analysis.

Building on SWAP, Schmitz et al. [48] proposed Swapster,a schema-based knowledge management platform aiming topromote ideas borrowed from social networking into peer-to-peer networks. In particular, they suggest that, due to theirad-hoc nature, peer-to-peer networks are suitable for storingand searching for knowledge created in an ad-hoc manner.Furthermore, following from social paradigms, peer-to-peernetworks can be used to aid socialization by forwardingrequests to the most appropriate nodes to answer them. Thiscomes as a direct parallel to real life, where we ask peopleconsidered to be experts about relevant topics of interestand has already been applied to other applications, the mostnotable one being PageRank [49]. In order to fulfill its purpose,Swapster extracts information from various local resources,such as emails and documents, and describes them in anRDF schema. Such RDF structures are the basic contentdescriptions being communicated in the network. Queries areexpressed in a relevant SQL-like language for RDF contentand forwarded to the nodes which are judged to be the mostsuitable to respond. Additionally, the authors propose twocomplementary network organization methods, so that theclustering of nodes sharing similar resources can be achieved.The first method is an offline topology formation, suitable forbootstrapping the network, while the second is more suitablefor ongoing node updates and rewiring during the lifetime ofthe network.

In another study, Schmitz [50] shows how to organize anontology-based peer-to-peer system in a small world. Theauthor provides a number of effective rewiring strategiesthat take into account the semantic signatures of nodes

170 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

and shows how these lead to the formation of small-worldnetworks (please refer back to Section 2.2 for an introductionto small-world networks). The assumption that peerswill only know their and their immediate neighborhood’ssemantic signatures is also made here and it again fuelsthe rewiring strategies presented in the study. According tothese strategies, each peer periodically assesses its similarityto its neighbors and, if it finds it to be below somepredetermined threshold, it advertises its expertise with theprospect of getting reassigned to a different neighborhood.This forwarding of expertise can be done either randomly,in the case where the network has not yet been clustered,or based on expertise similarity, in the case of a clusterednetwork. The rewiring strategy itself is essentially defined bythe way expertise forwarding is carried out in the network.This implicit and continuous clustering of resources is shownto be beneficial in terms of the recall achieved by the networkup to a threshold beyond which over-clustering starts tohinder retrieval effectiveness.

Furthering the work on rewiring by Schmitz et al. [48]and building on the SWAP platform, Löser and Tempich [51]proposed alternative strategies for incremental link creationand peer ranking. Based on social patterns, the authorsdistinguish three kinds of overlays from the viewpoint ofeach peer in the network: the content-provider overlay, therecommender overlay and the bootstrapping overlay. Eachof these overlays will be providing responses to queriesdepending on the query, the locality of the peer and theprevious responses the peer has received from its neighbors.For instance, when a peer first joins the network, it will senda query to a number of peers chosen from the non-semanticbootstrapping layer. Subsequent queries, also depending onthe results of previous searches, might be sent to peersfound in more than one layers, etc. The authors providedifferent ways to rank peers found in the different overlays.In a separate publication, Löser et al. [52] present morerefined versions of these concepts and algorithms along withan experimental evaluation showing, among other things,that they outperform other approaches in terms of routingefficiency and network adaptation to interest shifts.

Specifically targeting the problem of information retrieval,Lv and Cheng [53] proposed to derive a concept tree throughclustering the terms of the shared document corpus. Thebackbone of this solution is a semantic tree whose leafs areall the possible terms in the global dictionary, while its non-leaf nodes represent previously extracted concepts. Giventhat a peer knows which parts of the tree are covered by itslocal content, it can determine the extent of its similarityto other peers and therefore choose its neighbors and guideits query routing strategy accordingly. The concept tree forthis particular study was extracted from the Reuters corpusthough hierarchical, divisive k-means clustering. The itemsclustered were the dictionary terms, represented by binaryvectors whose elements denoted the presence or absenceof a term from a document in the corpus. By calculatingthe overlap and the separation in concepts between itselfand others, a peer can choose its near and distant linksrespectively. This process takes place periodically and itis expected to eventually form a small-world network ofsemantically similar clusters. As opposed to the other studies

described above, this natively supports full-text search. Onthe other hand, it requires that a global concept tree has beenbuilt before the network can be effectively created.

The work on SONs outlined above, even though not com-plete, provides a solid basis for discussion and analysis of thetrends which have emerged for searching in P2P networks.Additional studies we would like to draw the interestedreader’s attention to include the following: Löser et al. [54]present the construction of hybrid SONs based on cluster-ing elaborate and heterogeneous DB schemas. Relevant toP2P databases and distributed schema organization and datamanagement also is the PIAZZA peer data management sys-tem [55]. Penzo et al. [56] investigate range-based and nearest-neighbor selection for associating related peers during theinitialization and lifetime of the overlay. Doulkeridis et al. [57],building on their previous work on a distributed algorithmicapproach to network cluster building [58], describe a self-organizing hybrid SON for efficient searching.

4.2.1. Network-data independence and SONsThe use of hierarchical semantic structures for peer organi-zation is not limited to the mapping of content to conceptsand the subsequent retrieval based on these concepts. An al-ternative, engineering-oriented, use of semantic trees is tar-geted to system interoperability and data independence. Inhis seminal work, Codd [59] argued for the need of indepen-dence of applications from their underlying data and relatedstructures. The tool he proposed in order to achieve data inde-pendence was the nowwell-known and established relationalmodel. Following on from this work, and having in mind thevast and increasingly volatile computer networks of today,Hellerstein [60] addressed the need for network-data inde-pendence. The rationale behind this need is that the size andrate of change in today’s networks is much greater than therate of change of applications.

Towards tackling network data independence, Parkho-menko et al. [61] proposed the use of ontology-driven peerprofiles in order to allow seamless service interoperabilityand transparency, even across existing peer-to-peer networks.Focusing on the discovery and integration of web services,the authors claim that such peer profiles would be benefi-cial for many open challenges of peer-to-peer networking,such as security, resource aggregation, peer-group manage-ment etc. Working towards the same goal, Aberer et al. [62]propose Gridvine, a logical layer to be used on top of a DHT.The authors address the issue of network data independencedirectly by separating between a physical data layer – a DHT– from a logical, semantic layer—Gridvine. At the logical levelGridvine supports a number of services, including attribute-based search of data, schema management, mapping andinheritance, which promote data independence in the net-work, as well as interoperability between potentially separatenetworks.

4.3. Content-based peer-to-peer networking

Content-based systems represent the third distinct approachaddressing retrieval over peer-to-peer networks. The algo-rithms and structures employed by content-based systemsare based on traditional information retrieval models and

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 171

are usually probabilistic in nature. As opposed to both theDHT approach and to the SON approach, the content-basedparadigm relies directly on the shared content of the nodesin order to perform network addressing, indexing and queryrouting. As in the distributed information retrieval paradigm,introduced in Section 2.4, content-based approaches usu-ally implement a bootstrapping, resource description phasebefore retrieval can take place. Content descriptions, usedfor index building and query routing purposes, are usuallyexpressed as vectors. A common approach is for peers todisseminate average term-frequency vectors. Document clus-tering [22,63,64] is another commonly employed approach.Various document clustering algorithms have been used ei-ther as modules of suggested architectures (e.g. by Ng andSia [65], Klampanos and Jose [66], etc.) or as a means forexperimental evaluation (e.g. by Lu and Callan [67]). In thissection we will introduce and discuss studies that representthe content-based paradigm for peer-to-peer informationretrieval.

PlanetP [68–71] is a fully defined peer-to-peer network forinformation retrieval. In this system, each peer has completeknowledge of the information shared by all the participatingpeers in the network. This wide dissemination of content de-scriptions is achieved via the means of a gossiping or epidemicalgorithm [72], which allows the peers to share summariesof their content with the rest of the peer community. Morespecifically, PlanetP content advertisements are term vectorsencoded in Bloom filters [73] and act as compact and effi-cient summaries of the local corpora. Each peer is thereforeexpected to store locally all content advertisements sent outby other peers. A PlanetP peer has to inform the networkof its content when it joins the network and when its con-tent changes. The dissemination of content advertisementsis achieved through the combination of rumormongering andanti-entropy, as originally proposed by Demers et al. [74], to-gether with a partial entropy measure found to be useful inthe context of peer-to-peer networks. PlanetP’s gossiping al-gorithm works as follows: Periodically, each peer randomlychooses a target peer, believed to be currently on-line, andattempts to inform it of its content changes. If the targetpeer had not previously heard of these changes, it updates itsrecords and forwards the rumor in the same way. The algo-rithm terminates when one of the gossiping peers contacts npeers in a row that already know about this change. Such anti-entropy algorithms help to avoid the situation where a peer ora group of peers never get to hear about a content change.

When a query is issued by a peer, the system first searchesits local set of content advertisements and ranks the peersaccording to their closeness to the query. The peers getranked according to a metric called inverse peer frequency(IPF), inspired by the inverse document frequency (introducedin Section 2.3). The rationale behind using IPF is that if aterm is present in the collections of many peers then it isnot useful in resolving between them. The IPF for a term tis evaluated as follows: IPFt = log(1 + N/Nt), where N is thenumber of peers in the network and Nt is the number of peersthat appear to contain the term t. Given this definition, therelevance measure used to rank peers is the following:

Ri(Q) =

t∈Q∧t∈Bi

IPFt (1)

where i denotes a given peer and Bi is the ith peer’s bloomfilter. Once this ranking has been done, the top-ranked peersreceive the query and the results are accumulated at thequery initiating node. The number of peers to be selecteddepends on the number of results the user requires as wellas on a heuristic that depends on the total number of peers inthe network.

Even though PlanetP forms a complete proposal for peer-to-peer information retrieval, it has a notable limitation,which makes it more appropriate for limited environmentsrather than Internet-wide deployment: each node requiresglobal knowledge of the shared content in the network. Thisis not a realistic assumption for any large network since thereare no guarantees that every peer will be able to directlydiscover every other or that it will be able to manage millionsof changing content descriptions or handle the overwhelmingnetwork traffic created through gossiping. This is the reasonPlanetP can only scale up to a few thousands of nodes,according to its creators.

Ng and Sia [65,75] proposed DISCOVIR, which stands fordistributed content-based visual information retrieval. DIS-COVIR addresses the issues of organizing and retrieving in-formation in a large-scale peer-to-peer network. Contrary toPlanetP, DISCOVIR organizes the network by creating clustersof peers sharing similar content, without each peer having tostore content descriptions of every other peer. These explic-itly defined peer-groups help query routing by making it moretargeted towards clusters of relevant content. The query rout-ing algorithm adopted by the authors is dubbed the FireworkQuery Model and it is designed to avoid query flooding.

DISCOVIR, like many other architectures in the broaderfield of peer-to-peer networking, has been influenced bysocial networks and by their inherent small-world properties.Hence, each DISCOVIR peer maintains two sets of links:attractive links and random links. As the name suggests,the attractive links connect peers sharing similar content(short-range links), while the random ones maintain theconnectivity of the whole network graph (long-range links).In DISCOVIR, attractive links are established betweenpeers sharing content of the highest similarity within apredetermined network radius. As a consequence, clustersof peers sharing similar content are formed, while theconnectivity of the network is preserved through the long-range, random links which are always present. DISCOVIR’srouting strategy exploits the two sets of links by forwardingqueries to neighbors with relevant content through theattractive links, or if no such peers exist for the given query,forward the query using the random links.

DISCOVIR is primarily targeted at image and multimediasharing and retrieval. However, it could also be usefulin other domains, provided there exist proper documentrepresentations as well as a suitable similarity function.

Bawa et al. [76] proposed the SETS architecture, whichstands for Search Enhanced by Topic Segmentation. SETSfocuses on efficient and effective query routing by groupingthe peers according to the content they share. Instead ofusing an agglomerative clustering approach (for example theapproach followed by DISCOVIR and other architectures) thisarchitecture employs a divisive strategy. According to thisstrategy, each peer’s collection is partitioned into segments,

172 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

each of which having a descriptive component, its topiccentroid. These segments drive the grouping of peers intogroups using short and long-range links. In SETS, short-range links are formed between peers within the same topicsegment, whereas long-range links are formed between peerswhich belong to different segments. When a query is issuedin SETS it get routed through the network’s topic-awarerouting algorithm. First, a small number of topic segmentsare selected by evaluating the similarity of the query to thecorresponding topic centroids. Subsequently the query getsrouted sequentially to the selected segments by first followingthe long-range links and then following the short-range links.In SETS, individual peers do not need to know all availabletopic segments on a per-peer basis, however, they do needto know all topic-segment centroids in the network in orderto be able to route queries effectively. SETS is focused onfull-text document retrieval and, as such, it features term-frequency vectors as content descriptions while it uses thecosine coefficient similarity measure.

Lu and Callan [67] proposed a hybrid peer-to-peer archi-tecture tuned for information retrieval in highly distributeddigital libraries. Their proposed architecture is described as ahybrid one since it allows for two kinds of nodes: leaf nodes,which are digital libraries, and directory nodes, which man-age the searching process. In this architecture, leaf nodesof similar characteristics are connected via directory nodes.Even though these characteristics are not explicitly defined,the experimental evaluation environment used was derivedby clustering the entire test document collection and assign-ing the clusters to individual leaf nodes. For this purpose, asoft-clustering algorithm was used, as proposed by Lin andKondadadi [77], which leads to non-mutually-exclusive clus-ters with regards to the documents they contain.

This architecture is based on language modeling tech-niques and selects appropriate resources for query routing ac-cording to a metric based on K–L divergence. This is used as ameans to quantify the likelihood that a leaf node will satisfya user’s information need, and is given by

S(Q,C) =

q∈Q

log{λP(q|C) + (1 − λ)P(q|G)} (2)

where P(q|C) is the collection’s language model—i.e. theprobability that the query term q comes from the collectionC, P(q|G) is the global language model and λ is a smoothingparameter. In this architecture, resource selection happensat both the leaf and the directory nodes. Another importantaspect of this piece of work is that it employs query-basedsampling in order to ease the need for directory nodes tohaving to pass full descriptions of themselves to neighboringdirectory nodes.

Linari and Weikum [78] proposed another architecture,which organizes its nodes according to content in order to aidquery routing and retrieval. Similar to the system introducedabove, it is also based on language models. However,the authors use the Jensen–Shannon measure, which is ametricized version of the Kullback–Leibler divergence. Theauthors claim that by using a metric they can reduce thenumber of irrelevant peers when querying, and thereforereduce the number of hops, by triangulation. An additionaldifference is the use of bloom filters in order to reduce

the network overload due to the necessary description ofresources of the participating nodes.

Adopting a more general view of the problem of infor-mation retrieval over peer-to-peer networks, Klampanos andJose [66] proposed a hybrid network based on document clus-tering. As opposed to using clustering in order to satisfy aprior assumption on the content shared, this work integratesclustering within the proposed architecture. In essence, thiswork represents an attempt to transfer cluster-based re-trieval into the peer-to-peer realm. According to this archi-tecture, the participating peers can take up various roles inthe network. From an information retrieval perspective, themost important roles are the information provider, taken byleaf nodes sharing documents, and the hub, taken by nodesresponsible for network organization, clustering and queryrouting.

This architecture operates over two stages of informationclustering in order to aid network organization and supportretrieval. Before information providers can join the networkthey must cluster their content and pass their contentdescriptions to any random hub node they have previouslyconnected to. On receiving these descriptions, the hubsearches for related peer groups in the network. If such apeer group has been found, the hub recommends the newpeer to any of the organizing hubs of the peer group. If nosuitable groups have been found, the hub creates a new oneand starts managing it. In its first instance this architectureassumes that each hub knows of all peer groups in thenetwork, even though this could be relaxed by applying somecontent-based organization to the hub sub-overlay. Becauseinformation providers get organized in peer groups based ontheir local cluster descriptors, each information provider isallowed to belong to multiple peer groups. Also, each hubmay manage one or more peer groups. When a query isissued by a requesting peer, it first visits the peer’s nearesthub. The hub then forwards it to the top nearest peergroups it is aware of. Once the query has reached a peergroup, it gets forwarded further to the top most relevantinformation providers to answer it. Result lists get routedback, gettingmerged along the way, before reaching the queryinitiator. For in-peer document clustering as well as for peerclustering this study used a form of unbounded single-passclustering [22]. The fusion of results was done according toa re-ranking algorithm based on the Dempster–Shafer theoryof evidence combination, adapted from a version introducedby Jose [79].

Other related studies that take the content-based ap-proach to network organization and searching include worksby Balke et al. [80], Seshadri and Cooper [81] and Skobeltsynet al. [82,83], among others.

5. Frequently occurring components of peer-to-peer retrieval networks

Having introduced various research proposals covering allthree major retrieval approaches in peer-to-peer networks, inthis section we will address a number of individual functionaland non-functional characteristics. Such separation should

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 173

Table 2 – The systems included in this section, their reference names and their classes.

Name Bibliographic reference Approach

CMU Lu and Callan [67] Content-basedGlasgow Klampanos and Jose [66] Content-basedBologna Linari and Weikum [78] Content-basediCluster Raftopoulou and Petrakis [10] Content-basedDISCOVIR Ng and Sia [65,75] Content-basedCTO Lv and Cheng [53] Semantic overlayKassel Schmitz [50] Semantic overlayINGA Löser and Tempich [51]; Löser et al. [52] Semantic overlayBibster Haase et al. [85] Semantic overlayPennsylvania Li et al. [11] Semantic overlaySWAPSTER Schmitz et al. [48] Semantic overlay

allow for a better understanding of the current researchtrends, further support the separation we observe in peer-to-peer approaches for retrieval and highlight needs forfuture work. The main focus here will be on content-basedsystems and SONs because of their suggested similarities inthe literature. Another reason we will not be focusing onDHTs is that they have already been covered extensively ina number of articles and books the avid reader can refer toinstead [4,84]. In this analysis we will take into account afew basic categories of non-functional as well as of functionalcharacteristics of various research systems and discuss themseparately.

In Table 2 we outline the bibliographic references and theircorresponding names that we use for this analysis. For thenames used we took the following convention: if a system isgiven a name by its authors then this name is used, otherwisewe use the city name of the first author’s institution as givenin the publication or the name of the institution itself. It isimportant to note that, even though this is not meant to bean exhaustive list of systems, it should still be adequate forrepresenting the major research efforts taking place in thebroader field of peer-to-peer information retrieval.

5.1. Non-functional characteristics

DHTs for peer-to-peer networking have been widely repre-sented in the literature as well as on the market. DHTshave received considerable attention by the databases re-search community due to their capability to address the prob-lem of data location in an elegant and predictable way. Atthe same time DHTs found immediate application in pop-ular file-sharing networks such as eMule and Bittorent. Theirwidespread application made it possible for researchers, suchas Androutsellis-Theotokis and Spinellis [4], to discuss theirnon-functional characteristics with the ultimate purpose ofcategorizing them. A similar approach cannot be taken for thecase of content-based systems as they have not been imple-mented for general use. However, we believe that basic non-functional characteristics should be considered even before aresearch case can be made, as they are useful in narrowingdown the scope of research and they promote communica-tion within the research community. In this section we willattempt to discuss such basic non-functional characteristicsas a means for further analysis of the proposals outlined inTable 2.

Table 3 – Primary target application scenarios. Pleasenote the empty column under File-Sharing (F-S). MostDHTs could be primarily appropriate for this type ofapplication.

Open DLs F-S MM KM

CMU ✓

Glasgow ✓ ✓

Bologna ✓ ✓

iCluster ✓ ✓

DISCOVIR ✓

CTO ✓

Kassel ✓

INGA ✓ ✓

Bibster ✓

Pennsylvania ✓

SWAPSTER ✓

5.1.1. Primary target application scenariosAn important non-functional characteristic (or requirement)that affects the research direction of a proposal is its intendedapplication scenario. In the fairly complex and modular,content- or concept-based peer-to-peer networks, changesin the target application requirement are likely to causesignificant changes to the research and development effort,even though this may not be obvious from the outset.

Table 3 depicts the various proposals along with theapplication scenarios they appear to be addressing, accordingto their reported functionality. The scenarios we haveincluded are shown as the columns of the table. Fromthese scenarios Open refers to open sharing and retrieval oftext-based documents, where the content in each peer isaccumulated either through direct creation at the node or bydownloading content from elsewhere. In this scenario, thedocument collections would be expected to exhibit largelyskewed, power-law, distributions in terms of their sizes. Atthe same time, it would be fair to assume that the topicsexpressed by each of the collections would be relativelylimited in number. Another application scenario considered isthe digital libraries scenario (titled DLs in the table). Assumingthat there is a limited number of organizations able toprovide such services, in this application scenario we canassume that the skewness in the numbers of documentswill not be as extreme as in the Open case, at least not forthe smaller end of the distribution. Another characteristicthat differentiates digital libraries from open and personalinformation sharing is that in the former case there may

174 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

be very large libraries that cover a large number of topics.F-S refers to the file-sharing application scenario and, ingeneral, to any application scenario involving the retrievalof objects described by a limited number of keywords. MMrefers to multimedia and in particular image and videoretrieval scenarios. This application scenario differs fromOpen and DLmainly in that the item and resource descriptionsare densely populated vectors, affecting any clustering andretrieval algorithms employed for the task. Another point ofdifference is found in the number of features used to describesuch resources, which are inherently less than the overallsize of the vocabulary extracted from a text-based corpus,for instance. The final application scenario we identify is forknowledge-management and retrieval. By this, in this article,we mean the location and retrieval of meta-informationinstead of the direct retrieval of information. As discussedin Section 2.5, knowledge is most commonly described inthe form of tree-structured ontologies. Systems that addressretrieval in such environments are therefore regarded to berelevant to this application scenario regardless of whetherthey also provide means for the final retrieval of the requiredpieces of information or not.

The differences in content-based and semantic-overlayapproaches become evident by looking at Table 3. There is aclear separation between SONs and content-based systemswith respect to the target application scenarios they address.The assignment of the various systems to the applicationscenarios was done based on their qualitative characteristicsas well as on the problems the authors of respective systemsclaim to address.

CMU mainly addresses information retrieval in digitallibraries as it relies on previously categorized (or clustered)content in order to obtain resource descriptions. Also, theuse of language modeling implies that the global languagemodel is known or that it can be estimated. The estimationof this statistic through, for instance, query sampling, canbecome very challenging in networks with high rates of peersjoining and leaving the network. Similarly, Bologna is alsoprimarily targeted at digital libraries, since it also uses theKullback–Leibler (KL) divergence through its Jensen–Shannon(JS) implementation, therefore the problem of using orestimating global statistics remains. However, this approachcould also be applied to Open environments since it doesnot rely on pre-clustered information but rather it adopts anincremental approach to network organization. The Glasgowsystem is mainly targeted at the Open application scenario,from which it adopts its main assumptions, discussed above.Glasgow incorporates clustering and it does not rely on anyglobal or nearly global statistics. Additionally, this systemcould also be used for retrieval in digital libraries, sinceit could be altered to use existing content structure withminimum effort. iCluster is another system that appearsto be primarily able to address the Open scenario, since itsupports real-time network organization based on contentclustering. DISCOVIR is the only system of the above thataddresses the problem of image retrieval in peer-to-peernetworks. Its operation is based on clustering along with acontent-sensitive query forwarding algorithm that attemptsto locate the maximum number of relevant resources witha small number of hops. Architecturally, DISCOVIR is similar

to the CMU and Glasgow systems and with a few algorithmicchanges it could be made suitable for text-based retrievalas well. However, since its authors explicitly state that itspurpose is image and multimedia retrieval we will classify itunder the MM category.

Even though CTO’s base of operation is a concept tree, weloosely classify it as a content-based system. This is due tothe fact that the tree used contains the whole vocabulary atits leaf nodes and is therefore directly related to the sharedcontent itself, as opposed to meta-information about the con-tent. This is also the reason why it is classified as suitableto be used in a digital library setting. At the same time, therapidly changing global content of an Open scenario wouldprobably be prohibitive for its smooth operation. SystemsPennsylvania, Kassel and SWAPSTER operate around meta-information and so they would be directly applicable only tosuch application scenarios. Bibster is also classified as such,even though its operation is narrower than the aforemen-tioned systems, in that it addresses solely bibliographic infor-mation sharing and retrieval. Such information is describedby meta-keys, such as author names, year of publication etc.,however Bibster goes one step ahead of mere keyword match-ing since it derives and uses semantic relationships betweenauthors, publications etc. INGA is built around the extractionof semantics stored in an underlying tree and is thereforemostly applicable to knowledge-management tasks. However,due to its multi-layered architecture it could also cope withinformation retrieval from digital libraries, if these librariesprovided the topics they cover in a suitable format.

5.1.2. Network organization and topologyIn Section 2.1 we discussed the issue of terminology for peer-to-peer networking, especially the definition of equality insuch networks. Even though this might be primarily seen asa philosophical discussion it does have real implications inthe design of a system. Whether a system can be classifiedas peer-to-peer or not affects the organization of the overlaynetwork and therefore its applicability. For the purposes ofthis survey we claim that any networked system may beclassified as peer-to-peer if it does not employ fixed andoverpowered nodes. In that way, any system that operates in anon-centralized and flexible manner can be seen as a peer-to-peer one. This definition excludes pseudo-P2P networks, suchas the original instantiation of Napster.

The main approaches for network organization, whichare widely used in the literature, are unstructured or hybrid.Unstructured systems are the ones exhibiting a flat structurewithin which every peer has exactly the same responsibilitiesas every other. In practical terms, this means that everyparticipant can relay messages, connect arbitrarily to othernodes, issue queries and share content at all times. Onthe other hand, hybrid systems distinguish the peers inmainly-client nodes andmainly-server nodes of various roles.The usual breakdown of roles one may come across inthe literature divides peers in hub-nodes, responsible formessage routing, and information-provider nodes, for storingand sharing content. Additional roles may be those of theclient-node, which issues queries, and of the merger-node,being responsible for merging results lists from previouslyundertaken retrieval sessions. In most cases, the client-role

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 175

Table 4 – Common network topologies for peer-to-peer networking. VA and VB are two arbitrary nodes in the network,while d is the number of dimensions for the case of the hypercube topology.

Topology VA to VB Longest path Connections overhead

Arbitrary Multiple routes Varies but can be small-world Arbitrarily manyTree Single route Varies unless balanced Arbitrarily manyHypercube Single route d d

is thought to be of no-consequence to the retrieval session,since a query will have to be routed to relevant informationproviders regardless of its original point of issue. Similarly,the role of the merger-node is usually either included inthe functionality of hub-nodes or is neglected altogether.The differentiating factors between client–server systems andhybrid peer-to-peer networks are that in the later case thereis a large number of super-nodes instead of a single serverand also that super-nodes may arbitrarily come into or go outof service, or interchange roles with the client nodes.

Network topology is independent of the network organiza-tion and, for the purposes of this survey, only includes peerscapable of routing messages. The most common topology onecan find in peer-to-peer IR proposals – in fact in all the pro-posals presented in this study – is the arbitrary. By this wemean that, while connections between peers may or may notget created based on content similarity or other semantic fea-tures, from a graph perspective they appear to be random.Weconsider the small-world topology to be a special case of ar-bitrary topologies and in fact most of the presented systemsclaim to be generating small-world topologies.

For the sake of greater coverage we will mention two moretopologies that could be used for retrieval over peer-to-peernetworks. The tree topology is where the message-routingpeers are arranged in a hierarchical way, such that thereare no communication cycles. Such an organization enforcesa single route between any two peers and its performancedepends largely on how well balanced the structure is.Another characteristic of this topology is that each peerwill have to maintain arbitrarily many neighboring nodes.A well-known distributed system which has a peer-to-peernature and a tree topology is the DNS [1]. Finally, anothernoteworthy topology is the hypercube (as used, for instance,by Nottelmann and Fuhr [86]). In an n-cube, a hypercubeof dimensionality n, each node is connected to exactly nother nodes, one per dimension. Once a message has arrivedthrough a dimension d it can only be forwarded throughdimensions strictly greater than d. This has some desirableimplications. First, there is a unique route between any twonodes, with the worst case being a message requiring dhops in order to reach its destination. Additionally, such ahypercube will be able to support up to 2d peers, makingit a very scalable topology. These three topologies aresummarized in Table 4.

In Table 5 we list the topologies of the aforementionedcontent-based and SON systems. It is notable that all ofthese systems adopt an arbitrary topology for their retrievalpurposes. The problem in adopting the other two topologieshas to do with the difficulty to model content in such a strictway, given that the structure of these networks is directlyor indirectly based on content. However, such topologieswould be more useful on the system level, as they are

Table 5 – Primary target application scenarios.

System name Organization Topology

CMU Hybrid ArbitraryGlasgow Hybrid ArbitraryBologna Unstructured ArbitraryiCluster Unstructured ArbitraryDISCOVIR Unstructured ArbitraryCTO Unstructured ArbitraryKassel Unstructured ArbitraryINGA Unstructured ArbitraryBibster Unstructured ArbitraryPennsylvania Unstructured ArbitrarySWAPSTER Unstructured Arbitrary

guaranteed to be efficient and predictable. Please note thatthere are additional topologies in the literature, for instancecircular ones, etc. Such structures seem more appealing forapplication in DHTs, hence they are not discussed in thissurvey.

5.2. IR-related functional components

Having touched upon basic non-functional characteristics ofcontent-based and semantic networks we will now look intotheir functional characteristics with respect to informationretrieval. In the following sections we will discuss thedifferences the aforementioned systems have in terms ofthe IR models they employ, their evaluation assumptionsand setups as well as their content distribution policies.Additional issues we will discuss include personalizationand resource description adjustments, collaboration andinformation filtering in peer-to-peer systems.

5.2.1. Retrieval mechanisms and models

As discussed previously, distributed information retrievalis a predominant influence for peer-to-peer informationretrieval systems. As a consequence, content-based systemstypically employ information retrieval models that havebeen also previously used in a distributed IR context.Also, wherever clustering is used, content-based systemsnaturally follow well-documented IR approaches. SONs, dueto their differences regarding retrievable content and targetapplication scenarios (Section 2.5) approach the problem ofretrieval and clustering from an architectural viewpoint. Inthis section we will discuss the methods and models used bya number of systems, wherever these apply.

CMU adopts a language modeling approach for resourceselection and retrieval. According to this architecture, thesimilarity of a query to a digital library is expressed by a

176 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

slightly altered version of the Kullback–Leibler divergenceexpressed as:

S(Q,C) =

q∈Q

log{λP(q|C) + (1 − λ)P(q|G)} (3)

where Q is the set of query terms, C is the language model ofa digital library, G is the global language model and λ is themixture smoothing factor.

This approach assumes that term-frequency statistics aremade available by the digital libraries as well as that someestimation can be reached concerning the global languagemodel. Since the explicit advertisement of such informationwould be prohibitive in terms of network overhead, Lu andCallan [67] propose an incremental approach according towhich a flooding approach gradually gives way to a selectiveone. During the query flooding phase directory (or hub) nodesget to learn the contents of their neighboring directory andleaf nodes by studying the replies they send to issued queries.Once sufficiently many queries have been issued, content-based routing based on the K–L divergence can be adopted. Inorder for this gradual acquisition of statistics to be efficientit is important that the digital libraries share brief andcomprehensive content. This requirement is addressed by theauthors by assuming that each digital library shares contentfrom a single, even though broad, topic. Experimentally, thisassumption is realized through the use of clustering aboutwhich we will discuss further in Section 5.2.2.

The Glasgow architecture relies on clustering in order toorganize shared content in a way which allows for moreefficient query routing and retrieval. Contrary to CMU, itexpects information providers to first internally cluster theircontent before advertising it to the network. Consequently,it applies a distributed form of single-pass clustering inorder to create peer groups of providers sharing similarcontent. This architecture does not enforce the use of anyspecific clustering algorithm for document clustering at theinformation providers, however, for its evaluation it usesWard’s algorithm [87] on term-frequency vectors. During theretrieval phase, a query is first matched against peer-groupsand then against individual peers before any retrieval cantake place. For query routing, the Glasgow system employsthe cosine similarity function over term-frequency vectors.Once the query has reached an information provider, thisarchitecture does not make any assumptions regarding whichretrieval model the provider may use. The only assumptionmade on that end is that the results list is ranked accordingto relevance.

In a similar manner, DISCOVIR, expects peers to be ableto describe their content as a set of vectors. Then, afterapplying a neighbor-flooding ad hoc approach, the peersmaintain short-range connections to the most similar peersthey have previously discovered. DISCOVIR also predicts theuse of long-range links, which are randomly created (pleaserefer back to Section 2.2 for more information on small-worldnetworks and linkage). This system measures the Euclideandistance in order to decide whether a short-range link shouldbe established between two peers as well as for queryforwarding. Long-range links are created randomly when apeer joins the network or according to user preference.

For the Bologna system, Linari and Weikum [78] adoptthe use of language models for their topology creation as

well as for query routing. However, contrary to CMU, thissystem employs the Jensen–Shannon distance, which is asymmetrical version of the Kullback–Leibler divergence andis defined as follows:

DJS(A,B) =12

[DKL(A,C) + DKL(B,C)] (4)

where C =12 (A + B) and DKL denotes the Kullback–Leibler

divergence between two given points. This distance ismeasured on term-frequency vectors. The rationale behindthe preference of this distance measure over the K–Ldivergence lies in its symmetry, which allows for the earlyelimination of uninteresting peers through the triangleinequality.

CTO, like most systems that rely solely on concept trees,does not employ an IR model in the traditional sense. Inthis case, the similarity between any two nodes or betweena node and a query is calculated based on the overlap of theirrespective concepts. Additionally, for the sake of effectivecreation of a small-world topology, CTO also calculates howwell a peer covers the concepts another peer does not. Inthis system, the weighted coverage between two conceptcollections is defined as follows:

Cov(A,B) =

tk∈A∧tk∈B

P(tk) (5)

where P(tk) = dL−k indicates the priority of a concept oflevel k, since concepts of higher levels should receive higherpriority over concepts of lower levels. L denotes the height ofthe concept tree and d is the average number of sub-nodes inthe concept tree.

Another semantic overlay, Kassel, focuses on network or-ganization rather than retrieval itself. The goal is to organizethe peers in a small-world topology based on the qualita-tive characteristics of the shared items. This organizationis achieved through an adaptive and incremental networkclustering algorithm, which takes into account the clusteringcoefficients of the formed clusters. With respect to item sim-ilarity, this system assumes that a similarity function sim ex-ists, such that 1 − sim is a metric. Such a similarity functionis employed by the clustering coefficient calculations. In par-ticular, the clustering coefficient of a node v is weighted by itssimilarity to the nodes it knows about and is defined by:

γwv =1

kv(kv − 1)

w∈Γ (v)

sim(v,w)|u ∈ Γ (v) : (w,u) ∈ E| (6)

where Γ (v) are the nodes pointed to by, but not including vand kv = Γ (v) denotes the size of the neighborhood. The in-troduction of sim in the calculation of the clustering coeffi-cient for a node v has the effect of each link from a neighbor-ing node w to v counting as much as the similarity of the twonodes. From the above definition it follows that 0 ≤ γwv ≤ 1,with higher values of γwv indicating a denser neighborhood.

In INGA [52], the similarity between a query and a peeris calculated differently depending on the topics covered bythe query. For the case of single-predicate queries, i.e. whena query represents a single concept in the ontology tree, iscalculated as follows:

simTopic(q,p) =

e−αl eβh

− e−βh

eβh + e−βhif q = p

1 otherwise(7)

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 177

Table 6 – Retrieval approaches followed for peer-to-peer retrieval. IR Model refers to the underlying mathematical modelused for retrieval, when this applies. Measure refers to the similarity measure used by the system for matching queriesto content, such as peer-collections, documents, etc. Clustering method refers to the information clustering approachadopted by the systems. Assumed Clustered refers to whether the system assumes that the shared content is alreadyclustered before it can become searchable.

IR model Measure Clustering method Assumed clustered

CMU Language K–L Soft-clustering YesGlasgow Vector-space Cosine Ward’s, Single pass NoBologna Language J–S Incremental NoiCluster Generic N/A Incremental NoDISCOVIR Vector-space Euclidean Incremental NoCTO N/A Coverage Data-centric NoKassel N/A Generic Data-centric NoINGA N/A Proximity-based Data-centric NoBibster N/A Proximity-based Data-centric NoPennsylvania N/A Boolean Data-centric NoSWAPSTER N/A Generic Data-centric No

where q is the query concept, p is the peer concept, l is theshortest-path length between q and p in the concept tree andh is the minimum level in the tree of either q or p.

In the case of conjunctive queries, INGA follows analternative strategy which follows from prior querying ofneighboring peers. The rationale behind this approach withrespect to conjunctive queries is the fact that queriesdescribed by multiple concepts are likely to result in few orno results. In fact, this directly corresponds to the well-knownand documented issue of the Boolean model for informationretrieval [15], which appears to also be present in the field ofconcept or knowledge retrieval. For the case of conjunctivequeries, INGA proposes the following similarity measure:

Rp(q) =

ti=1

qpi (8)

where t is the number of topics expressed in the query andqpi is the number of query hits per topic i of each peer able tomatch at least one of the query’s topics. In terms of networkorganization and clustering as well as in terms of querymatching, Bibster [85] is identical to INGA.

Pennsylvania [11] enforces a small-world topology amongstthe participating peers on top of an underlying distributedhash table. Before joining the network, each peer must haveclustered its shared data, therefore exposing summarizingdescriptions of its content. For this purpose, the data cluster-ing approach of Zhang et al. [88] is adopted. The centroid ofthe largest cluster of each peer has a prominent role since itdetermines the position of the peer inside the network. Thisapproach clearly makes the assumption that most peers willbe mostly about a single or very few topics. Peer clusters areformed based on the peers’ descriptions after the maximumcluster size has been predetermined. When a peer joins thenetwork, it gradually moves towards the space expressed byits pre-calculated description, getting linked to neighboringpeers of semantically similar content. Additionally, if the join-ing of a peer causes a cluster to exceed this maximum size,then the cluster gets divided into two new clusters. This strat-egy helps to maintain load balancing in the network. Sincethis system adopts data clustering, the matching betweenpeers is absolute. This fact also holds for the clusters, which

are labeled in such a way that they match the underlying se-mantic space they represent. Search queries are forwarded in-side the network in the same manner, after they have beentransformed into appropriate feature vectors. Upon receivinga query, a peer may either decide to forward it inside its cur-rent cluster, if the query matches its description, or push itto an external cluster if the query is not covered by the localdescription. The cross-cluster forwarding is carried out basedon the distance between the signature vector of the query andthat of the distant clusters the local peer knows about, muchin the same way as in a DHT.

SWAPSTER [48] is intended to provide a generic infrastruc-ture for knowledge management and, as such, it does notprovide details regarding its retrieval functionality. However,as it is a SON, we can expect various strategies to be appli-cable, such as the ones employed by Kassel, INGA, etc. Em-phasis is given on network organization and maintenance orre-wiring. A similar emphasis is also given in iCluster [10],with re-wiring strategies taking a prominent role in the evolu-tion andmaintenance of the network. However, iCluster takesan approach which is more directed at information retrieval,as it does not explicitly assume the existence of a concept treeand it does not rule out the usage of any particular informa-tion clustering algorithm.

Table 6 summarizes the set of systems under study,leading us to a number of conclusions. First, once more, thedifferences between semantic overlay networks and content-based systems are highlighted. Traditionally, informationretrieval is achieved through the statistical modeling ofdocuments and collections. Similarly, transformations ofwell-established information retrieval models, such as thevector-space model or of various instantiations of languagemodels, are typically employed by content-based systems. Onthe other hand, concept-based systems primarily deal withintermediate metadata structures. Regardless of whetherthese systems also address the issue of mapping thereal content onto these metadata structures or not, theirfocus remains on the organization of the network andon subsequent retrieval mainly based on exact match, orcoverage measures between subsets of metadata.

An additional observation to be made is that, apart fromCMU, no system assumes that the content is already clus-tered, or otherwise organized. This results in most systems

178 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

having to generate the necessary information structures inthe network themselves. Glasgow, adopting a hybrid networkorganization, implements single-pass clustering on the net-work level, with cluster centers maintained by hub peers. Theother systems, content-based as well as semantic overlays,adopt less strict rules on peer-cluster creation, usually by ex-plicitly maintaining two types of network links. In all cases,network clustering occurs incrementally, which is the onlyviable method in highly dynamic environments.

5.2.2. EvaluationThe evaluation of research technology through experimen-tation is particularly important for a proposal’s publication,adoption and, in some cases, commercialization. In order foran experimental evaluation to be beneficial it primarily hasto adhere to the assumptions made by the target applica-tion scenario. Performance figures obtained through an eval-uation procedure not addressing the target application barelittle or no significance. Apart from providing quantitative ev-idence for the performance of a system, experimental evalu-ation also provides the opportunity for interested parties tocontribute and share methodologies and ideas. When partic-ular methodologies have been widely accepted we can com-pare systems effectively under similar settings. An importantand highly influential example of such practice for informa-tion retrieval is TREC [89].

As in other fields of computer science research, the field ofretrieval over peer-to-peer networks relies heavily on experi-mental evaluation. Due to the nature of peer-to-peer retrievalsystems, a thorough evaluation requires that both retrievaleffectiveness as well as network efficiency are evaluated. Anadditional factor for the successful evaluation of such a sys-tem is the size of the experiment. Since typical peer-to-peernetworks are very large, evaluating such systems requires aslarge an experimental testbed as possible. The methodologyfollowed is also very important since the researcher must becareful as to not evaluate aspects not covered by the solutionand also not to misinterpret related factors. In this section wewill discuss the evaluation methodology followed by each ofthe systems as well as the tools used for that purpose. Due tothe differences we observe between content-based systemsand semantic overlays, their evaluation methodologies alsodiffer. Typically, content-based systems are evaluated primar-ily for retrieval performance while semantic overlays are eval-uated primarily for network performance. Table 7 outlines ex-amples of systems which have been evaluated for retrieval ef-fectiveness by means of experimental document collections.In this section we will see examples of both approaches.

CMU was evaluated for both retrieval effectiveness andnetwork efficiency [67]. For the evaluation of the retrievalaspect of the system the authors based their experimentaldocument collection on TREC’s WT10g collection. Out of thedomains contained in WT10g, 2500 were randomly selectedcomprising of 1,421,088 documents. Each of these 2500collections were assigned to a leaf node in the simulatednetwork. The leaf nodes were clustered into 25 peer clusters,which represented the network’s directory nodes. Theseclusters were not mutually exclusive, so that if a leaf nodecovered more than one topic it could be attached to multipledirectory nodes. TREC’s WT10g comes with 100 topics

accompanied by their relevance assessments. The authors,claiming that these were not enough for the evaluation ofsuch a system, devised 15,000 queries by combining theinterpolated unigram document language model with thebigram document language model and some heuristics. As abaseline for assessing relevance, the authors used the resultsobtained from a centralized retrieval system. Both the testbedand the query set were made available for other researchersto use.3 The sizes of the testbed and of the query set can beseen as adequate for the intended evaluation. However, theuse of a web-based collection for retrieval in digital librariesis not ideal due to the properties of typical web pages (beingusually small, containing typographical errors, etc.). Becauseof the lack, at the time, of an adequately large and moresuitable collection, this as well as other studies were forcedto use WT10g in order to generate a more realistic evaluationenvironment. The network performance was measured interms of the message complexity of queries, i.e. in terms ofthe number of messages needed for a query to be routed. Thecost of efficiency given adverse network conditions, like peersleaving unexpectedly, was not explored. CMU was shown tooutperform a name-based, Gnutella-like system both in termsof retrieval and network performance.

Glasgow was initially evaluated using smaller collec-tions [66] but it was consequently also evaluated usingWT10g [90], which is the evaluation we will focus on. Forthis evaluation, the authors had previously created a num-ber of different testbeds reflecting three different applicationscenarios [91]: open information sharing, digital libraries aswell as a uniform distribution of documents to peers. In addi-tion to their testbeds, the authors complemented their exper-iments by also using the one devised at CMU. These testbedswere also made available online for other researchers todownload and use.4 Due to the fact that the clustering ofresources was part of the Glasgow system, individual peercollections were first clustered and their cluster centroidswere then used as content descriptions. Following this initialphase, the content descriptions were incrementally clustered,using a single-pass clustering algorithm, in order for peergroups exposing similar content to be formed. This, second,clustering procedure was undertaken as part of the simula-tion of the system. For this evaluation, the authors only usedthe queries and the relevance assessments provided by TREC.The results yielded by Glasgow in terms of precision were notencouraging. This led the authors to pursue the causes usinga smaller testbed and they looked into a number of individ-ual factors that could have contributed to this, ranging fromthe properties of the collections in terms of clustering to theusefulness of replication and relevance feedback. Similar tothe CMU approach, network efficiency was also evaluated interms of message complexity and not in terms of other dy-namic factors that would be expected to take place during thelifetime of such a system.

DISCOVIR [75], targeting at image and multimedia re-trieval, used data and categories extracted from CorelDraw’simage collection. It was then evaluated for routing efficiency,once the network had reached a stable state, for different

3 http://boston.lti.cs.cmu.edu/callan/Data/.4 http://www.dcs.gla.ac.uk/~iraklis/resources.html.

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 179

Table 7 – Basic characteristics of peer-to-peer networks, which have been evaluated for retrieval performance. We onlyinclude systems employing document collections for their evaluation.

Collection(s) Number of peers #Avg. doc’s

CMU TREC WT10g 2500 >568Glasgow TREC WT10g 11,680/1500 145/568/otheriCluster OHSUMED, TREC-6 30,000/566,078 15/283DISCOVIR CorelDraw Image Collection Varies N/ACTO Reuters corpus vol.1 109,500 46

numbers of peers. Among the findings of this evaluation isthat the network clustering is more effective as the size ofthe network grows. Another finding is that the percentage ofpeers visited becomes smaller as the network grows.

iCluster was evaluated for the effectiveness of clustering,its retrieval effectiveness in terms of recall as well as for itsnetwork efficiency. The document collections used were asubset of OHSUMED TREC having 30,000 medical articles andTREC-6 having 566,078 news articles. Both collections werepreviously clustered into a number of mutually exclusivedocument sets. A network of 2000 peers was set up, eachof the peers sharing documents from individual documentclusters so as to enforce topic coherence within each peer.The incremental clustering algorithm was then simulatedtaking into account the communication costs involved. Theauthors showed that the network is effectively organizedwithout overly excessive communication costs. Retrievalperformance was also measured in terms of recall andat different stages of network evolution, appearing to beincreasing as the network reached a stable point. In contrastto the systems CMU and Glasgow, the authors of iClusterseemed to prefer better quality documents over a largerdocument base for experiments.

CTO was evaluated against the Reuters corpus (Vol.1).This collection includes an author field, which was used togroup the documents. Each peer was assigned the documentswritten by an individual author. This document collectioncomprises of 109,500 documents, shared by 2368 peers.100 queries were generated by randomly choosing threeterms from 100 random documents. Relevant documentswere considered to be those that contained all query terms.Furthermore, the authors group the queries into four levelsof “popularity”, in terms of the number of documents in thecorpus containing them. This was done in order to evaluatethe behavior of the system when it needs to locate scarceinformation. The concept tree required by this solution wasbuilt using the same document collection and through termclustering on the documents. CTO was shown to increase itsrecall in all four groups of queries as the number of peersreached increased. Its evaluation also shows that the recallrate is higher for the case of concept-guided routing than inthe case of flooding.

Kassel was evaluated in terms of its clustering properties.Since this is a semantic overlay relying on an ontologyfor information description and query routing, the SWRC5

ontology was used for its evaluation. Each peer was assigneda random item from this ontology, which was used todescribe both the shared content and the peer’s expertise.

5 http://ontobroker.semanticweb.org/ontos/swrc.html.

The properties evaluated in this study were how the networkclustering coefficient and the characteristic path lengthchanged as the network evolved after the application ofvarious on-line clustering (or re-wiring) strategies. A measureof recall was also included in the evaluation, in most casesshown to stabilize after a point in time.

Using a similar approach to Kassel, INGA was evaluatedusing a data set based on the open directory project forthe Web.6 INGA also used bibliographic data acquired fromBibster. A number of artificial queries were also generatedand employed. Similar to other semantic overlay evaluation,INGA was evaluated for recall as well as for its networkingproperties, message complexity, etc. The open directory wasalso chosen as an evaluation medium by the creators ofSWAPSTER, which was evaluated for cluster coefficient andcharacteristic path length over time as well as for recall.A similar preference towards the evaluation of networkand clustering properties is also taken by the Pennsylvaniasystem.

6. Other issues and pointers for future work

The set of systems described and analyzed in the previoussections, although not exhaustive, showcases both theusefulness of decentralized approaches to informationseeking as well as of the progress being made in this field.However, there are a number of issues which have notbeen looked at yet, at least not alongside other information-related services. In this section we will briefly discuss thesecandidates for future research.

Information filtering is the process during which a useris notified of new relevant content about a matter oftheir interest without having to issue a query explicitly.Information filtering is often seen as a reverse process tothat of information retrieval, and so the two are highlyrelated [92]. Especially in dynamically changing environmentsmanaging large volumes of information, such as peer-to-peer networks, information filtering should have had aprominent position. Even though a few studies have appearedtargeting information filtering in peer-to-peer networks, forinstance by Ouksel [93] or by Wang et al. [94], they are nottargeted at existing proposed information seeking networks.Instead they appear rather isolated from such trends anddo not directly contribute to a holistic approach regardinginformation seeking services in networked environments.At the same time, the fact that most studies attempt tocluster the participating peers according to their content,

6 http://www.dmoz.org.

180 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

instantly makes filtering beneficial to apply. If we acceptthe argument according to which individual peers (at leastuser-operated ones) share content about a small numberof topics, then most of the changes occurring in any onenetwork cluster could potentially interest every peer alsopresent in that cluster. Additionally, if a peer declared itsinterest in a new topic, this declaration could also help thenetwork reassess its position in the network. The integrationof information filtering to current approaches, be it content-based or concept-based, would improve the search processitself as well as the service provided to the users of thesesystems.

Another research field which could improve the usefulnessof these systems is the study of content replication and ofrelevance feedback. It has already been shown that bothtechniques could improve retrieval effectiveness in content-based peer-to-peer networks [90]. Furthermore, it appearsthat while replication could improve the recall achieved by asystem, relevance feedback could improve its precision. Thisindicates that these two techniques are complementary ina cluster-based peer-to-peer setting and, therefore, that theycould be applied dynamically depending on the applicationat hand. If an application is recall-critical, for instancepatent retrieval, content replication would bemore beneficial.On the other hand, if an application is precision-critical,such as web retrieval, relevance feedback should carrymore weight, and so on. In addition, to the best of ourknowledge, the effect of replication in pure informationretrieval environments has not been thoroughly studied.Strategies for the effective replication of content withoutthe user needing to explicitly download it have not beenconsidered for the information retrieval instantiation of peer-to-peer networks. It is clear that such replication strategiescould be complemented or, in some cases, even becomeobsolete by the concurrent existence of information filtering,however no such possibilities seem to have been studied.Furthermore, relevance feedback has not been thoroughlyresearched. Indeed, relevance feedback could be aided by theexistence of peer clusters in these networks. For instance, apiece of feedback given by a user could end up affecting otherusers sharing similar interests.

Re-ranking and results’ fusion7 is another area whichneeds to be explored further. As discussed above, inSection 2.4, fusion of results is central to DIR and so is toretrieval in P2P networks. The intelligent score normalizationand re-ranking of results coming from different sourcescan potentially improve the effectiveness of retrieval. Veryfew of the proposals mentioned in this survey explicitlypropose models and algorithms to aid merging. Fusion inP2P retrieval is an interesting problem for a number ofreasons: First, while in DIR the broker is responsible forfusion, there is no commonly accepted policy as to whichshould be the responsible party for fusion in P2P. Indeed, inP2P, fusion could take place at the nodes forwarding backthe results incrementally, at the query initiating node orat other arbitrary specialized nodes. For instance, in thefirst case, intermediate nodes would be burdened further for

7 For the purposes of this section the terms “re-ranking”,“merging” and “fusion” are used interchangeably.

merging results on behalf of another node but, dependingon the system, they could be in a better position to mergeif they know more of the expertise of their neighbors. In thethird instance the system would require more bandwidth perrequest, while it would move towards a less decentralizedpath. Second, the issue of score normalization is also veryimportant. If a P2P system does not make assumptionsregarding the IR models employed in the various nodesthen a re-ranking of results will need to take into accountcross-model statistical properties, such as the distribution ofscores [95,96], etc. or some account of the expertise of thenodes providing results [66]. However, significant researchstill needs to be done in re-ranking and fusion for P2Pnetworks before such algorithms are on a par with proposedalgorithms on resource selection, query routing and retrieval.

Finally, the influence and significance of the user hasnever been taken into account in peer-to-peer informationretrieval research. The inherent characteristics of peer-to-peer networks and their applications have not yet beenconsidered in user studies. Future work pointers on thisfront include collaborative searches from remote locations,collaborative filtering, visualization of the information spacein aid of the user, virtual library organization of remoteresources, etc.

7. Conclusions

In this article we surveyed a number of approaches and tech-niques proposed for various retrieval tasks over peer-to-peernetworks. After giving sufficient background information onvarious related aspects, we presented and discussed a num-ber of published proposals addressing different instantiationsof the issue. In this survey we stated and demonstrated differ-ences between various systems as well as between the threemajor retrieval approaches: distributed hash-tables, content-based approaches and semantic overlays. With respect tothese differences we provided a discussion about their tar-get retrieval units (data, information and knowledge in Sec-tion 2.5), before revisiting them when presenting the variousarchitectures. Focusing on content-based approaches and se-mantic overlays, as they are often thought to be interrelated,we gave a categorized treatment on their respective charac-teristics in Section 5. In the same section we distinguishedbetween the core functional and non-functional character-istics of these proposals. It is clearly important that thefunctional components employed by the various proposedsystems are suited to the non-functional requirements setby the systems’ respective authors. Finally, in Section 6 wepresented directions for future work, which could proveimportant for both semantic overlays and content-basedsystems, namely filtering, relevance feedback and replica-tion, merging and re-ranking of results and user-interfacesand collaboration.

Overall, we believe that peer-to-peer networking still hasgreat potential which could lead to significant technologicalbreakthroughs. We also believe that it is still, commercially,rather underexploited, with the majority of the applicationsdealing with file-sharing. Since this research field has beenactive for a number of years we could expect to start seeing

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 181

mature and useful applications able to exploit the inherentcharacteristics of these networks. Research on peer-to-peerinformation retrieval will still be important and it could bemade more influential and persuasive if proposals and theirexperimental evaluations are closer to their initial applicationassumptions. For this purpose, the sharing and re-use ofresources, such as evaluation testbeds, queries and the like,is of great importance. As more and more informationbecomes available online, high-speed Internet connectionsreach the majority of people worldwide and mobile devicesbecome evermore powerful, the decentralization of serviceswill become commonplace and peer-to-peer or similartechnologies could provide a viable, global platform for theirdeployment.

R E F E R E N C E S

[1] A. Oram (Ed.), Peer-to-Peer: Harnessing the Power ofDisruptive Technologies, O’Reilly & Associates Inc., CA 95472,USA, 2001.

[2] M. Buchanan, Small-World: Uncovering Nature’s HiddenNetworks, Orion, 2002.

[3] G. Coulouris, J. Dollimore, T. Kindberg, Distributed Systems:Concepts and Design, Pearson Education Limited, 2001.

[4] S. Androutsellis-Theotokis, D. Spinellis, A survey of peer-to-peer content distribution technologies, ACM ComputingSurveys 36 (4) (2004) 335–371. URL: http://www.spinellis.gr/pubs/jrnl/2004-ACMCS-p2p/html/AS04.html.

[5] D.J. Watts, Small Worlds: The Dynamics of Networks betweenOrder and Randomness, Princeton University Press, 1999.

[6] S. Milgram, The small-world problem, Psychology Today (1)(1967) 62–67.

[7] M.S. Granovetter, The strength of weak ties, AmericanJournal of Sociology (AJS) 78 (6) (1973) 1360–1380.

[8] Clip2, The gnutella protocol specification v0.4, June 2001.http://rfc-gnutella.sourceforge.net/developer/stable/index.html.

[9] A. Iamnitchi, M. Ripeanu, I.T. Foster, Locating data in (small-world?) peer-to-peer scientific collaborations, in: IPTPS’01:Revised Papers from the First International Workshop onPeer-to-Peer Systems, Springer-Verlag, London, UK, 2002,pp. 232–241.

[10] P. Raftopoulou, E. Petrakis, iCluster: a self-organizing overlaynetwork for P2P information retrieval, in: Proceedings of 30thEuropean ECIR Conference, Glasgow, Scotland, 30 March–3April 2008.

[11] M. Li, W. Lee, A. Sivasubramaniam, Semantic small world: anoverlay network for peer-to-peer search, 2004.

[12] M.F. Porter, An algorithm for suffix stripping, in: K.S. Jones,P. Willet (Eds.), Readings in Information Retrieval, MorganKaufmann Publishers Inc., 1997, pp. 313–316.

[13] K.S. Jones, P. Willet, Models, in: K.S. Jones, P. Willet (Eds.),Readings in Information Retrieval, 1997, pp. 257–263 (Chap-ter 5).

[14] G. Salton, A. Wong, C.S. Yang, A vector space model forautomatic indexing, Communications of the ACM 18 (11)(1975) 613–620.

[15] R.K. Belew, Finding Out About: A Cognitive Perspectiveon Search Engine Technology and the WWW, CambridgeUniversity Press, Cambridge, United Kingdom, 2000.

[16] C.D. Manning, P. Raghavan, H. Schütze, Introduction toInformation Retrieval, first ed., Cambridge University Press,2008.

[17] J. Callan, Distributed information retrieval, in: Advances inInformation Retrieval, Kluwer Academic Publishers, 2000,pp. 127–150 (Chapter 5).

[18] J.P. Callan, Z. Lu, W.B. Croft, Searching distributed collectionswith inference networks, in: SIGIR’95: Proceedings of the 18thAnnual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, ACM, New York, NY,USA, 1995, pp. 21–28.

[19] N. Fuhr, A decision–theoretic approach to database selectionin networked IR, ACM Transactions on Information Systems17 (1999) 229–249.

[20] L. Si, R. Jin, J. Callan, P. Ogilvie, A language modelingframework for resource selection and results merging,in: CIKM 2002, ACM Press, 2002, pp. 391–397.

[21] J.M. Ponte, W.B. Croft, A language modeling approach toinformation retrieval, 1998, pp. 275–281.

[22] C.J. van Rijsbergen, Information Retrieval, second ed.,Butterworths, London, 1979.

[23] T.H. Davenport, L. Prusak, Working Knowledge: HowOrganizations Manage What They Know, Harvard BusinessSchool Press, 1998.

[24] G. Stumme, in: J. Becker, R. Knackstedt (Eds.), Usingontologies and formal concept analysis for organizingbusiness knowledge, Physica (2002), 163–174.

[25] W3C, W3C semantic web activity, October 2008. http://www.w3.org/2001/sw/.

[26] W3C, Resource description framework, RDF, 2008. http://www.w3.org/RDF/.

[27] Yannmei Wang, Zhonghua Yang, Pe Hin Hinny Kong, RobertKheng Leng Gay, Ontology-based Web knowledge manage-ment, in: Proceedings of the 2003 Joint Conference on In-formation, Communications and Signal Processing, 2003 andFourth Pacific Rim Conference onMultimedia, vol. 3, 2003, pp.1859–1863, http://dx.doi.org/10.1109/ICICS.2003.1292789.

[28] A. Aldea, R. Baares-alcntara, J. Bocio, J. Gramajo, D. Isern,An ontology-based knowledge management platform, in:Proceedings of the Workshop on Information Integration onthe Web, IIWeb-03 at the 18th International Joint Conferenceon Artificial Intelligence, 2003, pp. 177–182.

[29] C. Doulkeridis, A. Vlachou, K. Nørvåg, M. Vazirgiannis,Distributed semantic overlay networks, in: Handbook of Peer-to-Peer Networking, Springer, 2010, pp. 463–494 (Chapter).

[30] L.T. Nguyen, W.G. Yee, O. Frieder, Adaptive distributedindexing for structured peer-to-peer networks, in: CIKM’08:Proceeding of the 17th ACM Conference on Information andKnowledge Management, ACM, New York, NY, USA, 2008,pp. 1241–1250.

[31] I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan,Chord: a scalable peer-to-peer lookup service for Internetapplications, in: Proceedings of the 2001 ACM SIGCOMMConference, 2001, pp. 149–160.

[32] H. Balakrishnan, M.F. Kaashoek, D. Karger, R. Morris, I. Stoica,Looking up data in P2P systems, Communications of the ACM46 (2) (2003) 43–48.

[33] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine,D. Lewin, Consistent hashing and random trees: distributedcaching protocols for relieving hot spots on the world wideweb, in: Proceedings of the Twenty-Ninth Annual ACMSymposium on Theory of Computing, ACM Press, 1997,pp. 654–663.

[34] S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker,A scalable content addressable network, in: Proceedings ofACM SIGCOMM 2001, 2001.

[35] A. Rowstron, P. Druschel, Pastry: scalable, decentralizedobject location, and routing for large-scale peer-to-peersystems, Lecture Notes in Computer Science 2218 (2001).

[36] P. Druschel, A. Rowstron, Past: a large-scale, persistentpeer-to-peer storage utility, in: HotOS VIII, Schloss Elmau,Germany, 2001, pp. 75–80.

182 C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3

[37] A.I.T. Rowstron, P. Druschel, Storage management andcaching in past, a large-scale, persistent peer-to-peer storageutility, in: Symposium on Operating Systems Principles, 2001,pp. 188–201.

[38] A.I.T. Rowstron, A.-M. Kermarrec, M. Castro, P. Druschel,SCRIBE: the design of a large-scale event notificationinfrastructure, in: Networked Group Communication, 2001,pp. 30–43.

[39] K. Hildrum, J.D. Kubiatowicz, S. Rao, B.Y. Zhao, Distributedobject location in a dynamic network, in: Proceedings ofthe Fourteenth ACM Symposium on Parallel Algorithms andArchitectures, August 2002, pp. 41–52.

[40] P. Maymounkov, D. Mazieres, Kademlia: a peer-to-peerinformation system based on the XORmetric, in: Proceedingsof IPTPS02, Cambridge, USA, March 2002.

[41] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels,R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C.Wells, B. Zhao, Oceanstore: an architecture for global-scalepersistent storage, in: Proceedings of ACM ASPLOS, ACM,2000.

[42] C. Tang, Z. Xu, M. Mahalingam, Peersearch: efficientinformation retrieval in peer-peer networks, Tech. Rep. HPL-2002-198, Hewlett-Packard Labs, 2002.

[43] B.T. Loo, R. Huebsch, I. Stoica, J.M. Hellerstein, The case for ahybrid P2P search infrastructure, in: IPTPS, 2004, pp. 141–150.

[44] M. Bender, S. Michel, G. Weikum, C. Zimmer, The Minervaproject: database selection in the context of P2P search,in: G. Vossen, F. Leymann, P.C. Lockemann, W. Stucky (Eds.),BTW, in: LNI, vol. 65, GI, 2005, pp. 125–144.

[45] O. Papapetrou, W. Siberski, W.-T. Balke, W. Nejdl, Dhtsover peer clusters for distributed information retrieval,in: 21st International Advanced Information Networking andApplications, AINA-07, IEEE, 2007.

[46] A. Crespo, H. Garcia-Molina, Semantic overlay networks forP2P systems, 2002.

[47] M. Ehrig, P. Haase, R. Siebes, S. Staab, R. Studer, C. Tempich,The swap data and metadata model for semantics-basedpeer-to-peer systems, in: Proceedings of MATES-2003. FirstGerman Conference on Multiagent Technologies, in: LNAI,Springer, 2003, pp. 22–25.

[48] C. Schmitz, S. Staab, C. Tempich, Socialisation in peer-to-peerknowledge management, in: Proc. International Conferenceon Knowledge Management, I-Know, 2004.

[49] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerankcitation ranking: bringing order to the web, 1999.

[50] C. Schmitz, Self-organization of a small world by topic, in:Proc. 1st International Workshop on Peer-to-Peer KnowledgeManagement, 2004.

[51] E. Löser, C. Tempich, On ranking peers in semantic overlaynetworks, in: 3rd Conference on Professional KnowledgeManagement, 2005.

[52] A. Löser, S. Staab, C. Tempich, Semantic social overlaynetworks, IEEE Journal on Selected Areas in Communications25 (1) (2007) 5–14.

[53] J. Lv, X. Cheng, CTO: concept tree based semantic overlayfor pure peer-to-peer information retrieval, in: CIKM’07:Proceedings of the Sixteenth ACM Conference on Conferenceon Information and Knowledge Management, ACM, NewYork, NY, USA, 2007, pp. 931–934.

[54] A. Löser, F. Naumann, W. Siberski, W. Nejdl, U. Thaden,Semantic overlay clusters within super-peer networks,in: K. Aberer, M. Koubarakis, V. Kalogeraki (Eds.), Databases,Information Systems, and Peer-to-Peer Computing, vol. 2944,Springer, Berlin, Heidelberg, 2004, pp. 33–47.

[55] A.Y. Halevy, Z.G. Ives, J. Madhavan, P. Mork, D. Suciu,I. Tatarinov, The piazza peer data management system, IEEETransactions on Knowledge and Data Engineering 16 (7)(2004) 787–798.

[56] W. Penzo, S. Lodi, F. Mandreoli, R. Martoglia, S. Sassatelli,Semantic peer, here are the neighbors you want!, in: Pro-ceedings of the 11th International Conference on Extend-ing Database Technology: Advances in Database Technology,EDBT’08, ACM, New York, NY, USA, 2008, pp. 26–37.

[57] C. Doulkeridis, A. Vlachou, K. Nørvåg, Y. Kotidis, M.Vazirgiannis, Efficient search based on content similarityover self-organizing P2P networks, Peer-to-Peer Networkingand Applications (2010) 67–79.

[58] C. Doulkeridis, K. Nørvåg, M. Vazirgiannis, Desent: decentral-ized and distributed semantic overlay generation in P2P net-works, IEEE Journal on Selected Areas in Communications 25(2007) 25–34.

[59] E.F. Codd, A relational model of data for large shared databanks, Communications of the ACM 13 (6) (1970) 377–387.

[60] J.M. Hellerstein, Toward network data independence, SIG-MOD Record 32 (3) (2003) 34–40.

[61] O. Parkhomenko, Y. Lee, E.K. Park, Ontology-driven peerprofiling in peer-to-peer enabled semantic web, in: CIKM’03:Proceedings of the Twelfth International Conference onInformation and Knowledge Management, ACM, New York,NY, USA, 2003, pp. 564–567.

[62] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, T. van Pelt,GridVine: building Internet-scale semantic overlay networks,in: International Semantic Web Conference, ISWC, in: LNCS,vol. 3298, 2004, pp. 107–121.

[63] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review,ACM Computing Surveys 31 (3) (1999) 264–323.

[64] F. Crestani, S. Wu, Testing the cluster hypothesis indistributed information retrieval, Information Processing andManagement 42 (5) (2006) 1137–1150.

[65] C.H. Ng, K.C. Sia, Peer clustering and firework query model,in: Proceedings of 11th World Wide Web Conference, May2002.

[66] I.A. Klampanos, J.M. Jose, An architecture for informationretrieval over semi-collaborating peer-to-peer networks, in:Proceedings of the 2004 ACM Symposium on AppliedComputing, Nicosia, Cyprus, vol. 2, March 14–17 2004,pp. 1078–1083.

[67] J. Lu, J. Callan, Content-based retrieval in hybrid peer-to-peer networks, in: Proceedings of the Twelfth InternationalConference on Information and Knowledge Management,ACM Press, 2003, pp. 199–206.

[68] F.M. Cuenca-Acuna, C. Peery, R.P. Martin, T.D. Nguyen,PlanetP: infrastructure support for P2P information sharing,Tech. Rep. DCS-TR-465, Department of Computer Science,Rutgers University, November 2001.

[69] F.M. Cuenca-Acuna, R.P. Martin, T.D. Nguyen, PlanetP: usinggossiping and random replication to support reliable peer-to-peer content search and retrieval, Tech. Rep. DCS-TR-494,Department of Computer Science, Rutgers University, July2002.

[70] F.M. Cuenca-Acuna, T.D. Nguyen, Text-based content searchand retrieval in ad hoc P2P communities, in: InternationalWorkshop on Peer-to-Peer Computing (Co-Located withNetworking 2002), Springer-Verlag, 2002.

[71] F.M. Cuenca-Acuna, C. Peery, R.P. Martin, T.D. Nguyen,PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities, in: Twelfth IEEEInternational Symposium on High Performance DistributedComputing, HPDC-12, IEEE Press, 2003.

[72] M. Jelasity, S. Voulgaris, R. Guerraoui, A.-M. Kermarrec,M. van Steen, Gossip-based peer sampling, ACM Transactionson Computer Systems 25 (2007).

[73] B.H. Bloom, Space/time trade-offs in hash coding withallowable errors, Communications of the ACM 13 (7) (1970)422–426.

C O M P U T E R S C I E N C E R E V I E W 6 ( 2 0 1 2 ) 1 6 1 – 1 8 3 183

[74] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson,Epidemic algorithms for replicated database maintenance,in: Proceedings of the Sixth Annual ACM Symposiumon Principles of Distributed Computing, ACM Press, 1987,pp. 1–12.

[75] C.H. Ng, K.C. Sia, Advanced peer clustering and fireworkquery model, in: Proceedings of International World WideWeb Conference, WWW, 2003.

[76] M. Bawa, G.S. Manku, P. Raghavan, Sets: search enhancedby topic segmentation, in: Proceedings of the 26th AnnualInternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval, ACM Press, 2003,pp. 306–313.

[77] K. Lin, R. Kondadadi, A similarity-based soft clusteringalgorithm for documents, in: Proceedings of the 7thInternational Conference on Database Systems for AdvancedApplications, IEEE Computer Society, 2001, pp. 40–47.

[78] A. Linari, G. Weikum, Efficient peer-to-peer semantic overlaynetworks based on statistical language models, in: P2PIR’06:Proceedings of the International Workshop on InformationRetrieval in Peer-to-Peer Networks, ACM, New York, NY, USA,2006, pp. 9–16.

[79] J.M. Jose, An integrated approach for multimedia informationretrieval, Ph.D. Thesis, The Robert Gordon University, April1998.

[80] W.-T. Balke, W. Nejdl, W. Siberski, U. Thaden, Dl meets P2P—distributed document retrieval based on classification andcontent, in: A. Rauber, S. Christodoulakis, A.M. Tjoa (Eds.),ECDL, in: Lecture Notes in Computer Science, vol. 3652,Springer, 2005, pp. 379–390.

[81] S. Seshadri, B.F. Cooper, Routing queries through a peer-to-peer infobeacons network using information retrievaltechniques, IEEE Transactions on Parallel and DistributedSystems 18 (12) (2007) 1754–1765.

[82] G. Skobeltsyn, T. Luu, I.P. Zarko, M. Rajman, K. Aberer, Webtext retrieval with a P2P query-driven index, in: W. Kraaij,A.P. de Vries, C.L.A. Clarke, N. Fuhr, N. Kando (Eds.), SIGIR,ACM, 2007, pp. 679–686.

[83] G. Skobeltsyn, T. Luu, I.P. Zarko, M. Rajman, K. Aberer,Query-driven indexing for scalable peer-to-peer text re-trieval, Future Generation Computer Systems 25 (1) (2009)89–99.

[84] R. Steinmetz, K. Wehrle (Eds.), Peer-to-Peer Systems andApplications, Springer-Verlag, Berlin, Heidelberg, 2005.

[85] P. Haase, J. Broekstra, M. Ehrig, M. Menken, M. Plechawski,P. Pyszlak, B. Schnizler, R. Siebes, S. Staab, C. Tempich,Bibster—a semantics-based bibliographic peer-to-peer sys-tem, in: Proceedings of the Third International Semantic WebConference, 2004, pp. 122–136.

[86] H. Nottelmann, N. Fuhr, A decision–theoretic model fordecentralised query routing in hierarchical peer-to-peernetworks, in: 29th European Conference on InformationRetrieval Research, ECIR 2007, 2007, pp. 148–159.

[87] J.H. Ward, Hierarchical grouping to optimize an objectivefunction, Journal of American Statistical Association 58 (301)(1963) 236–244.

[88] T. Zhang, R. Ramakrishnan, M. Livny, Birch: a new dataclustering algorithm and its applications, Data Mining andKnowledge Discovery 1 (2) (1997) 141–182.

[89] NIST, Text retrieval conference, trec, 2008. http://trec.nist.gov/.

[90] I.A. Klampanos, J.M. Jose, An evaluation of a cluster-basedarchitecture for peer-to-peer information retrieval, in: 18thInternational Conference on Database and Expert SystemsApplications—DEXA’07, 2007, pp. 380–391.

[91] I.A. Klampanos, V. Poznanski, J.M. Jose, P. Dickman, Asuite of testbeds for the realistic evaluation of peer-to-peer information retrieval systems, in: LNCS 3408. ECIR’05,Santiago de Compostela, Spain, March 2005, pp. 38–51.

[92] N.J. Belkin, W.B. Croft, Information filtering and informationretrieval: two sides of the same coin? Communications of theACM 35 (12) (1992) 29–38.

[93] A.M. Ouksel, In-context peer-to-peer information filtering onthe web, SIGMOD Record 32 (3) (2003) 65–70.

[94] J. Wang, J. Pouwelse, R.L. Lagendijk, M.J.T. Reinders, Dis-tributed collaborative filtering for peer-to-peer file sharingsystems, in: SAC’06: Proceedings of the 2006 ACM Sympo-sium on Applied Computing, ACM, New York, NY, USA, 2006,pp. 1026–1030.

[95] R. Manmatha, T. Rath, F. Feng, Modeling score distributionsfor combining the outputs of search engines, in: SIGIR’01:Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval, ACM Press, 2001, pp. 267–275.

[96] A. Arampatzis, J. Kamps, A signal-to-noise approach to scorenormalization, in: CIKM’09: Proceeding of the 18th ACMConference on Information and Knowledge Management,ACM, 2009, pp. 797–806.