LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document...

LSDS-IR’08, October 30, 2008 1

Peer-to-Peer Similarity Search over Widely Distributed Document

Collections

Christos Doulkeridis1, Kjetil Nørvåg2, Michalis Vazirgiannis1

1Department of InformaticsAthens University of Economics and Business, Greece

2Department of Computer ScienceNorwegian University of Science and Technology, Norway

Motivation• Application

– Digital libraries

• Given a document (=query), retrieve similar documents

• e.g. find similar papers to my research paper

• Efficiently locate subset of peers that store similar content to the query

• Challenge– Similarity search over widely

distributed high-dimensional data

Computer

Distributed Information Retrieval

Outline

• Local peer pre-processing– Feature extraction

– Local clustering

• Semantic overlay network (SON) construction– Topological zone creation

– Zone clustering

• Super-peer organization of SONs – Searching

• Experimental evaluation• Conclusions & future work

Feature Extraction andLocal Document Clustering

• Peers store documents• Tokenization/stemming/

stop-word removal• Each document represented by a

feature vector (top-k features)– Vector Space Model (VSM)– Fi = {(fij, wij)}

• Cluster feature vectors

• Result: – set of initial clusters per peer

• Each cluster represented by feature vector

Peer’s initial clusters

Overlay Construction

• Multi-phase distributed process• Starting point: unstructured P2P network• Recursive application of 3 steps, until

global clusters (SONs) are created

Zone Creation• A certain percentage of peers

becomes initiators– randomly distributed over the

network.• PROBE-based technique• Partial synchronization• In case of excessive zone

sizes– zone partitioning

Finally:• Each initiator

– knows the peer ids in its zone– knows neighboring initiators

• Each peer knows its initiator

Initiators

Initiator

Zone Clustering

• Initiators – collect feature vectors from

– perform intra-zone hierarchical clustering

– pick cluster representatives

• Cluster description– CDi = (Ci, Fi, {P}, R)

• Remaining challenge– How to bring together

similar (remote) clusters?

similar remote clusters

Inter-zone Clustering

Level 1

Level 2

Level 3

Level 4

Advantages:1) Very large networks2) Efficient3) Small individual load

SON Merging

• Create d links among the least-connected peers in merged SONs

SON 1 SON 2

For d=3

Super-Peer

Searching

• Inter-SON routing

• Intra-SON routing

• Naïve solution: flooding

Adaptive Clustering

• After global SON creation– Broadcast final cluster

descriptions to all peers– Use zone hierarchy for

efficient broadcasting

• Each peer can then– Reassign its documents to

clusters– Join the appropriate SONs

• Similar to a feedback mechanism

• Advantages – see experimental results

Super-peer Level

Peer Level

A’s Cluster

D’s Cluster G’s ClusterJ’s Cluster

Final organization

Experimental Setup• GT-ITM topology generator (1K, 5K peers)• TREC.GOV2 (1M docs), Reuters (810K docs)• Random querying peer• Query:

“Given doc X, find the top-k similar docs to X”• Cosine similarity• Similarity threshold Ts, to determine matching docs to query• Metrics

– Recall– Recall@k– Precision@k– #Contacted peers

Clustering Statistics

• Adaptive clustering – decreases the average pair-wise similarity of clusters

– Increases average pair-wise similarity of documents within a cluster (not shown here)

Search Evaluation

• Recall– Ts=0.2

– Also tried Ts=0.1

• #Contacted Peers

Search Evaluation - GOV2/P5000

00.10.20.30.40.50.60.70.8

1 2 3 4 5

Top-N Clusters

Pre@10

Pre@20

Pre@40

Pre@60

Pre@80

1 2 3 4 5

Top-N Clusters

Rec@10

Rec@20

Rec@40

Rec@60

Rec@80

SON-based versus Plain Super-peer

Conclusions

• We presented a novel approach for P2P similarity search • Peers self-organize into SONs, forming a super-peer

network• We showed how a high-quality searching mechanism can

be deployed• We presented experiments on 2 large document collections

(GOV2 and Reuters) to evaluate our approach

• Future work:– More efficient inter-SON routing– Semantic similarity search using query expansion– Use of other clustering algorithms to improve performance

Thank you for your attention !

More info:http://www.db-net.aueb.gr/

http://www.idi.ntnu.no/grupper/db/

LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document...

Documents

INTEGRATING PEER-TO-PEER FUNCTIONALITIES …eprints.kingston.ac.uk/35133/1/Integrating Peer-to-Peer... · INTEGRATING PEER-TO-PEER FUNCTIONALITIES AND ROUTING IN MOBILE AD-HOC

Peer to-peer

Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011

Redes Peer to Peer

Peer to peer

Peer-to-Peer Networks - uni-freiburg.dearchive.cone.informatik.uni-freiburg.de/teaching/lecture/peer-to-peer... · Peer-to-Peer Networks 14 Game Theory Christian Schindelhauer Technical

By Akrivi Vlachou 1 , Christos Doulkeridis 1 , Kjetil Nørvag 1 and Yannis Kotidis 2

Kjetil Nørvåg

Distributed Top-k Query Processing by Exploiting Skyline ... · Distributed Top-k Query Processing by Exploiting Skyline Summaries Akrivi Vlachou · Christos Doulkeridis · Kjetil

Peer-to-Peer Networks 05 Pastry - uni-freiburg.dearchive.cone.informatik.uni-freiburg.de/teaching/lecture/peer-to-peer... · Peer-to-Peer Networks 05 Pastry Christian Ortolf Technical

Peer To Peer Coding

peer-to-peer overlay

Identifying the Most Influential Data Objects with Reverse Top-k Queries By Akrivi Vlachou 1, Christos Doulkeridis 1, Kjetil Nørvag 1 and Yannis Kotidis

Peer-to-Peer Security

EQUITY CROWDFUNDING & PEER-TO-PEER LENDING · 2019-11-20 · LEGALINK I EQUITY CROWDFUNDING AND PEER TO PEER LENDING 7 Peer-to-Peer Lending For the purposes of the following, ‘peer-to-peer

June 2020 Investor Presentation › mr5ircnw_encana › 881...($0.5) $0.0 $0.5 $1.0 Peer 28 Peer 27 Peer 26 Peer 25 Peer 24 Peer 23 Peer 22 Peer 21 Peer 20 Peer 19 Peer 18 Peer 17

Peer-to-Peer - P2P

Peer-to-Peer Data Management Management Peer-to-Peer Data ...gtsat/collection/Morgan Claypool/Peer-to-Peer D… · The third type are peer-to-peer document retrieval systems that

Peer-to-Peer Fundraising

Peer to Peer Services