LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document...

Preview:

Citation preview

LSDS-IR’08, October 30, 2008 1

Peer-to-Peer Similarity Search over Widely Distributed Document

Collections

Christos Doulkeridis1, Kjetil Nørvåg2, Michalis Vazirgiannis1

1Department of InformaticsAthens University of Economics and Business, Greece

2Department of Computer ScienceNorwegian University of Science and Technology, Norway

LSDS-IR’08, October 30, 2008 2

Motivation• Application

– Digital libraries

• Given a document (=query), retrieve similar documents

• e.g. find similar papers to my research paper

• Efficiently locate subset of peers that store similar content to the query

• Challenge– Similarity search over widely

distributed high-dimensional data

Computer

Computer

Computer

Computer

Computer

Computer

Computer

Computer

Distributed Information Retrieval

LSDS-IR’08, October 30, 2008 3

Outline

• Local peer pre-processing– Feature extraction

– Local clustering

• Semantic overlay network (SON) construction– Topological zone creation

– Zone clustering

• Super-peer organization of SONs – Searching

• Experimental evaluation• Conclusions & future work

LSDS-IR’08, October 30, 2008 4

Feature Extraction andLocal Document Clustering

• Peers store documents• Tokenization/stemming/

stop-word removal• Each document represented by a

feature vector (top-k features)– Vector Space Model (VSM)– Fi = {(fij, wij)}

• Cluster feature vectors

• Result: – set of initial clusters per peer

• Each cluster represented by feature vector

Peer’s initial clusters

LSDS-IR’08, October 30, 2008 5

Overlay Construction

• Multi-phase distributed process• Starting point: unstructured P2P network• Recursive application of 3 steps, until

global clusters (SONs) are created

LSDS-IR’08, October 30, 2008 6

Zone Creation• A certain percentage of peers

becomes initiators– randomly distributed over the

network.• PROBE-based technique• Partial synchronization• In case of excessive zone

sizes– zone partitioning

Finally:• Each initiator

– knows the peer ids in its zone– knows neighboring initiators

• Each peer knows its initiator

Initiators

Initiator

LSDS-IR’08, October 30, 2008 7

Zone Clustering

• Initiators – collect feature vectors from

peers

– perform intra-zone hierarchical clustering

– pick cluster representatives

• Cluster description– CDi = (Ci, Fi, {P}, R)

• Remaining challenge– How to bring together

similar (remote) clusters?

similar remote clusters

LSDS-IR’08, October 30, 2008 8

Inter-zone Clustering

Level 1

Level 2

Level 3

Level 4

Advantages:1) Very large networks2) Efficient3) Small individual load

LSDS-IR’08, October 30, 2008 9

SON Merging

• Create d links among the least-connected peers in merged SONs

SON 1 SON 2

For d=3

Super-Peer

LSDS-IR’08, October 30, 2008 10

Searching

• Inter-SON routing

• Intra-SON routing

• Naïve solution: flooding

Q

LSDS-IR’08, October 30, 2008 11

Adaptive Clustering

• After global SON creation– Broadcast final cluster

descriptions to all peers– Use zone hierarchy for

efficient broadcasting

• Each peer can then– Reassign its documents to

clusters– Join the appropriate SONs

• Similar to a feedback mechanism

• Advantages – see experimental results

H

G

DJ

IE

FB

CA

A

D G

J

E

Super-peer Level

Peer Level

A’s Cluster

D’s Cluster G’s ClusterJ’s Cluster

Final organization

LSDS-IR’08, October 30, 2008 12

Experimental Setup• GT-ITM topology generator (1K, 5K peers)• TREC.GOV2 (1M docs), Reuters (810K docs)• Random querying peer• Query:

“Given doc X, find the top-k similar docs to X”• Cosine similarity• Similarity threshold Ts, to determine matching docs to query• Metrics

– Recall– Recall@k– Precision@k– #Contacted peers

LSDS-IR’08, October 30, 2008 13

Clustering Statistics

• Adaptive clustering – decreases the average pair-wise similarity of clusters

– Increases average pair-wise similarity of documents within a cluster (not shown here)

LSDS-IR’08, October 30, 2008 14

Search Evaluation

• Recall– Ts=0.2

– Also tried Ts=0.1

• #Contacted Peers

LSDS-IR’08, October 30, 2008 15

Search Evaluation - GOV2/P5000

00.10.20.30.40.50.60.70.8

1 2 3 4 5

Top-N Clusters

Pre@10

Pre@20

Pre@40

Pre@60

Pre@80

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5

Top-N Clusters

Rec@10

Rec@20

Rec@40

Rec@60

Rec@80

LSDS-IR’08, October 30, 2008 16

SON-based versus Plain Super-peer

LSDS-IR’08, October 30, 2008 17

Conclusions

• We presented a novel approach for P2P similarity search • Peers self-organize into SONs, forming a super-peer

network• We showed how a high-quality searching mechanism can

be deployed• We presented experiments on 2 large document collections

(GOV2 and Reuters) to evaluate our approach

• Future work:– More efficient inter-SON routing– Semantic similarity search using query expansion– Use of other clustering algorithms to improve performance

LSDS-IR’08, October 30, 2008 18

Thank you for your attention !

More info:http://www.db-net.aueb.gr/

http://www.idi.ntnu.no/grupper/db/

Recommended