18
LSDS-IR’08, October 30, 2008 1 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1 , Kjetil Nørvåg 2 , Michalis Vazirgiannis 1 1 Department of Informatics Athens University of Economics and Business, Greece 2 Department of Computer Science Norwegian University of Science and Technology, Norway

LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 1

Peer-to-Peer Similarity Search over Widely Distributed Document

Collections

Christos Doulkeridis1, Kjetil Nørvåg2, Michalis Vazirgiannis1

1Department of InformaticsAthens University of Economics and Business, Greece

2Department of Computer ScienceNorwegian University of Science and Technology, Norway

Page 2: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 2

Motivation• Application

– Digital libraries

• Given a document (=query), retrieve similar documents

• e.g. find similar papers to my research paper

• Efficiently locate subset of peers that store similar content to the query

• Challenge– Similarity search over widely

distributed high-dimensional data

Computer

Computer

Computer

Computer

Computer

Computer

Computer

Computer

Distributed Information Retrieval

Page 3: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 3

Outline

• Local peer pre-processing– Feature extraction

– Local clustering

• Semantic overlay network (SON) construction– Topological zone creation

– Zone clustering

• Super-peer organization of SONs – Searching

• Experimental evaluation• Conclusions & future work

Page 4: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 4

Feature Extraction andLocal Document Clustering

• Peers store documents• Tokenization/stemming/

stop-word removal• Each document represented by a

feature vector (top-k features)– Vector Space Model (VSM)– Fi = {(fij, wij)}

• Cluster feature vectors

• Result: – set of initial clusters per peer

• Each cluster represented by feature vector

Peer’s initial clusters

Page 5: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 5

Overlay Construction

• Multi-phase distributed process• Starting point: unstructured P2P network• Recursive application of 3 steps, until

global clusters (SONs) are created

Page 6: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 6

Zone Creation• A certain percentage of peers

becomes initiators– randomly distributed over the

network.• PROBE-based technique• Partial synchronization• In case of excessive zone

sizes– zone partitioning

Finally:• Each initiator

– knows the peer ids in its zone– knows neighboring initiators

• Each peer knows its initiator

Initiators

Initiator

Page 7: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 7

Zone Clustering

• Initiators – collect feature vectors from

peers

– perform intra-zone hierarchical clustering

– pick cluster representatives

• Cluster description– CDi = (Ci, Fi, {P}, R)

• Remaining challenge– How to bring together

similar (remote) clusters?

similar remote clusters

Page 8: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 8

Inter-zone Clustering

Level 1

Level 2

Level 3

Level 4

Advantages:1) Very large networks2) Efficient3) Small individual load

Page 9: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 9

SON Merging

• Create d links among the least-connected peers in merged SONs

SON 1 SON 2

For d=3

Super-Peer

Page 10: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 10

Searching

• Inter-SON routing

• Intra-SON routing

• Naïve solution: flooding

Q

Page 11: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 11

Adaptive Clustering

• After global SON creation– Broadcast final cluster

descriptions to all peers– Use zone hierarchy for

efficient broadcasting

• Each peer can then– Reassign its documents to

clusters– Join the appropriate SONs

• Similar to a feedback mechanism

• Advantages – see experimental results

H

G

DJ

IE

FB

CA

A

D G

J

E

Super-peer Level

Peer Level

A’s Cluster

D’s Cluster G’s ClusterJ’s Cluster

Final organization

Page 12: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 12

Experimental Setup• GT-ITM topology generator (1K, 5K peers)• TREC.GOV2 (1M docs), Reuters (810K docs)• Random querying peer• Query:

“Given doc X, find the top-k similar docs to X”• Cosine similarity• Similarity threshold Ts, to determine matching docs to query• Metrics

– Recall– Recall@k– Precision@k– #Contacted peers

Page 13: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 13

Clustering Statistics

• Adaptive clustering – decreases the average pair-wise similarity of clusters

– Increases average pair-wise similarity of documents within a cluster (not shown here)

Page 14: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 14

Search Evaluation

• Recall– Ts=0.2

– Also tried Ts=0.1

• #Contacted Peers

Page 15: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 15

Search Evaluation - GOV2/P5000

00.10.20.30.40.50.60.70.8

1 2 3 4 5

Top-N Clusters

Pre@10

Pre@20

Pre@40

Pre@60

Pre@80

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5

Top-N Clusters

Rec@10

Rec@20

Rec@40

Rec@60

Rec@80

Page 16: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 16

SON-based versus Plain Super-peer

Page 17: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 17

Conclusions

• We presented a novel approach for P2P similarity search • Peers self-organize into SONs, forming a super-peer

network• We showed how a high-quality searching mechanism can

be deployed• We presented experiments on 2 large document collections

(GOV2 and Reuters) to evaluate our approach

• Future work:– More efficient inter-SON routing– Semantic similarity search using query expansion– Use of other clustering algorithms to improve performance

Page 18: LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis

LSDS-IR’08, October 30, 2008 18

Thank you for your attention !

More info:http://www.db-net.aueb.gr/

http://www.idi.ntnu.no/grupper/db/