G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Query-Driven Indexing forQuery-Driven Indexing for Scalable P2P Text RetrievalScalable P2P Text Retrieval

Infoscale’07, Infoscale’07, June 6-8, 2007June 6-8, 2007Suzhou, China Suzhou, China

Gleb SkobeltsynEPFL, Switzerland

June 6, 2007

Joint work with: • Toan Luu• Ivana Podnar Žarko• Martin Rajman• Karl Aberer

AlvisAlvis

DHTDHT

GoalGoal

• Our goalgoal is to achieve scalablescalable full-text retrieval with structured P2P networks (DHTs)

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 22 // 2525

Each peer:• Provides resources (bandwidth, storage)• Searches the whole network• Publishes its own documents


Naïve (single-term) approachNaïve (single-term) approach

... is to distribute the global inverted index in a DHT:

K I

K I

K I

K I

K I

K I

K I

K I

Query: “epfl & gleb”

h(“epfl”)-{d1,d2}

h(“gleb”)-{d2,d3}

h(t’)-{d4,d5}

K I

This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor

{d1,d2}

{d2}

33 // 2525

single term indexing

highly discriminative key indexing

very inefficient for distributed indexing

efficient for distributed indexing

same retrieval quality

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®

®

®... ...

long posting lists

smal

l voc

.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®

®

®... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®

®

®... ... PEER N

®

PEER 1

PEER N

...

HDKs

Indexing with Highly Discriminative Indexing with Highly Discriminative KeysKeys


[1] Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07, Istambul, Turkey

44 // 2525

Indexing with HDKs: main propertiesIndexing with HDKs: main properties


• Distributed index contains {key,PL} pairs:• Each keykey corresponds to a term or a set of terms• Each key is assigned to a posting listposting list

• Each posting list stores at most DFDFmaxmax top-ranked top-ranked document references.

• Data-Driven key generation:• Each time a new document is indexed, some

posting lists for a key k k can reach the max size of DFmax

It triggerstriggers the generation of new keys (k + other frequent keys)

• Proximity Filter: a document qualifies for a key t1&t2 if t1 is closeclose to t2 (specified by a window size ww). 55 // 2525

HDK – exhaustive data driven indexingHDK – exhaustive data driven indexing

• Pro’sPro’s: – ICDE’07 paper proves that the number of keys grows

linearly– Elegant key generation mechanism– Low bandwidth while query processing (PL’s of limited

size)

• Con’sCon’s:– Practically the number of keys is LARGE: 68M for 0.6M

docs– High bandwidth consumption at indexing

• ProblemProblem:– Too many keys are superfluous (almost never used)


Query Driven IndexingQuery Driven Indexing

Lets index only what is queried!Lets index only what is queried!


ContentsContents

• Introduction• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– ONM– Scalability– Evaluation

• Conclusion


Query-Driven Index (QDI)Query-Driven Index (QDI)

• Query-Driven Indexing strategy solves the “Too-Many-KeysToo-Many-Keys” problem:– Avoids maintenance of superfluous keys– Generates only such keys that are requested by users– Utilizes query-log to discover such keys

• ProblemsProblems– Indexing of a new key requires a bandwidth-efficient

mechanism to obtain the top-k posting list associated with the key Opportunistic Notification Mechanism Opportunistic Notification Mechanism

(smart-broadcast)(smart-broadcast)

– Incomplete index causes degradation of query results quality Show that the degradation is lowShow that the degradation is low


Which keys to index?Which keys to index?

• Each single-term found in the document collection is has to be indexed. – We call all single-term keys a basic single term indexbasic single term index.

– The posting lists are truncated at DFmax.

• A key k is non-superfluousnon-superfluous and can be activated activated iff:

– k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFmin is a parameter for our model (popularity filter).

– k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).

– all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).


QDI: RetrievalQDI: Retrieval


a b c

abc

ab bc ac

• Single term index is generated

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac

3) Probe Pa Pb and Pc

4) Obtain top-DFmax results for a, b and c (ranked w.r.t a, b and c respectively)

5) Contact peers in the list, re-rank the obtained results w.r.t abc

6) Output top-10

• Inc. the QF for ab, bc and ac• Activate (index) ac

peer?abc nothing

?abc

nothing

nothing

nothing

?abc

+1 +1 +1

DFmax

popularpopular

1111 // 2525

QDI: Retrieval 2QDI: Retrieval 2


abc

ab bc ac

a b c

• Assume the frequency of b is below DFmax

• Note, how the redundancy filter would simplify the lattice in such a case(grayed nodes cannot be activated)

DFmax

abc

ab bc

1212 // 2525

QDI: Retrieval 3QDI: Retrieval 3


abc

ab bc ac

a b c

• Single term index is generated and ac is indexed

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac – obtain the result for ac

3) Probe Pb and obtain the result for b

4) Contact all peers in the list to re-rank the obtained results w.r.t abc

5) Output top-10

• Inc. the QF for ab, bc and ac

peer?abc nothing

?abc

nothing

nothing

?abc

+1+1 +1

1313 // 2525

Opportunistic Notification MechanismOpportunistic Notification Mechanism

• ONM used to activate a new multi-term key• ONM is a “smart” broadcast with the following

features:– It is based on the shower multicast [2]: each peer within a

specified range is contacted only once– Notifications are small and low-priority => piggybacking– Broadcast is split into several multicast sessions, each time

pruning low-score documents– It uses the high-performance DHT layer [3]

[2] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05

[3] F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07


ScalabilityScalability

• The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size)

• The indexing traffic depends on the number of keys to be activated.

– The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly linearly with the number of peers, if each peer provides a limited number of documents

– The number of keys does does notnot depend on the document depend on the document collection sizecollection size but only on the size of the query log

– We can use the QFmin parameter to adjust the tradeoff:

indexing traffic <-> retrieval qualityindexing traffic <-> retrieval qualityG.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1515 // 2525

ContentsContents

• Introduction• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– ONM– Scalability– Evaluation

• Conclusion


Overlap experimentOverlap experiment

• Use the Wikipedia query-log (9M queries/9-10.2004) to build the index

• Choose randomly 3K test queries• Answer each test queryquery with Google and compare to the union of

top-DFmax Google results for each of its combinationsits combinations that are indexed according to the logs.

• Mimics our P2PIR system if Google’s ranking is used.• Example:

Original query

Non-superfluous (indexed) combinations

X

X

overlap@5=3/5=60%


Overlap exampleOverlap example


>id=481, q=“what did babe ruth do in the 1920”what did babe ruth do in the 1920”

“1920 babe ruth”, qf=0 ----> Ov@100= 100%

“1920 babe”, qf=0 ---------> Ov@100= 9% +++“1920 ruth”1920 ruth”, qf=1 ---------> Ov@100= 33%33% +++“babe ruth”babe ruth”, qf=495 -------> Ov@100= 69% 69%

---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7%

Size: 192192, Keys used: 22, Overlap@100: 94%94%

• Cut-n-paste from the simulation log:

1818 // 2525

Overlap with GoogleOverlap with Google


Overlap with YahooOverlap with Yahoo


Overlap with Google (no/partial/full Overlap with Google (no/partial/full overlap)overlap)


P2P Index SimulationsP2P Index Simulations


• Number of keys depends only on the query log size and QFmin!

• Does not depend on the collection size!

• Number of keys is much smaller than for the HDK approach: 68M keys for 650K doc

2222 // 2525

Real query logs?Real query logs?


• Wikipedia queries are unrealistic (too skewed) as users know what they want.

• Real web-queries might perform worse?

• Large scale experiments with real web queries and the TREC collection in [4]

[4] [4] Web Text Retrieval with a P2P Query-Driven Index G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer To appear in SIGIR’07

2323 // 2525

ConclusionsConclusions

• We presented the query-driven indexing strategy query-driven indexing strategy for scalable web text retrieval with structured P2P networks:– Stores posting lists in a DHT for terms andand term combinations

– Stores at most at most DFmax top document references in a posting list

– Efficiently collects the query statisticsstatistics in a distributed fashion

– Based on this statistics activates (indexes) only popularpopular keys

– Computes the result of a multi-term query based only on the index entries available at the moment – nono costly intersections

• We also showed that:– With real query-logs our approach achieves good retrieval qualitygood retrieval quality

– The QFmin parameter adjusts the traffic/quality tradeofftradeoff



Last slideLast slide

Thank you for your attention!Questions?

2525 // 2525

Documents

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007