Upload
toby-parker
View
237
Download
1
Tags:
Embed Size (px)
Citation preview
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
Query-Driven Indexing forQuery-Driven Indexing for Scalable P2P Text RetrievalScalable P2P Text Retrieval
Infoscale’07, Infoscale’07, June 6-8, 2007June 6-8, 2007Suzhou, China Suzhou, China
Gleb SkobeltsynEPFL, Switzerland
June 6, 2007
Joint work with: • Toan Luu• Ivana Podnar Žarko• Martin Rajman• Karl Aberer
AlvisAlvis
DHTDHT
GoalGoal
• Our goalgoal is to achieve scalablescalable full-text retrieval with structured P2P networks (DHTs)
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 22 // 2525
Each peer:• Provides resources (bandwidth, storage)• Searches the whole network• Publishes its own documents
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
Naïve (single-term) approachNaïve (single-term) approach
... is to distribute the global inverted index in a DHT:
K I
K I
K I
K I
K I
K I
K I
K I
Query: “epfl & gleb”
h(“epfl”)-{d1,d2}
h(“gleb”)-{d2,d3}
h(t’)-{d4,d5}
K I
This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor
{d1,d2}
{d2}
33 // 2525
single term indexing
highly discriminative key indexing
very inefficient for distributed indexing
efficient for distributed indexing
same retrieval quality
term 1 posting list 1 term 2 posting list 2
term M-1 posting list M-1term M posting list M
®
®
®... ...
long posting lists
smal
l voc
.
key 11 posting list 11 key 12 posting list 12
key 1i posting list 1i
®
®
®... ...
short posting lists
larg
e vo
c.
PEER 1
...
key N1 posting list N1 key N2 posting list N2
key Nj posting list Nj
®
®
®... ... PEER N
®
PEER 1
PEER N
...
HDKs
Indexing with Highly Discriminative Indexing with Highly Discriminative KeysKeys
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
[1] Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07, Istambul, Turkey
44 // 2525
Indexing with HDKs: main propertiesIndexing with HDKs: main properties
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
• Distributed index contains {key,PL} pairs:• Each keykey corresponds to a term or a set of terms• Each key is assigned to a posting listposting list
• Each posting list stores at most DFDFmaxmax top-ranked top-ranked document references.
• Data-Driven key generation:• Each time a new document is indexed, some
posting lists for a key k k can reach the max size of DFmax
It triggerstriggers the generation of new keys (k + other frequent keys)
• Proximity Filter: a document qualifies for a key t1&t2 if t1 is closeclose to t2 (specified by a window size ww). 55 // 2525
HDK – exhaustive data driven indexingHDK – exhaustive data driven indexing
• Pro’sPro’s: – ICDE’07 paper proves that the number of keys grows
linearly– Elegant key generation mechanism– Low bandwidth while query processing (PL’s of limited
size)
• Con’sCon’s:– Practically the number of keys is LARGE: 68M for 0.6M
docs– High bandwidth consumption at indexing
• ProblemProblem:– Too many keys are superfluous (almost never used)
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 66 // 2525
Query Driven IndexingQuery Driven Indexing
Lets index only what is queried!Lets index only what is queried!
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 77 // 2525
ContentsContents
• Introduction• HDK approach for indexing• Query-driven approach for indexing/retrieval
– Indexing structure– Example– ONM– Scalability– Evaluation
• Conclusion
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 88 // 2525
Query-Driven Index (QDI)Query-Driven Index (QDI)
• Query-Driven Indexing strategy solves the “Too-Many-KeysToo-Many-Keys” problem:– Avoids maintenance of superfluous keys– Generates only such keys that are requested by users– Utilizes query-log to discover such keys
• ProblemsProblems– Indexing of a new key requires a bandwidth-efficient
mechanism to obtain the top-k posting list associated with the key Opportunistic Notification Mechanism Opportunistic Notification Mechanism
(smart-broadcast)(smart-broadcast)
– Incomplete index causes degradation of query results quality Show that the degradation is lowShow that the degradation is low
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 99 // 2525
Which keys to index?Which keys to index?
• Each single-term found in the document collection is has to be indexed. – We call all single-term keys a basic single term indexbasic single term index.
– The posting lists are truncated at DFmax.
• A key k is non-superfluousnon-superfluous and can be activated activated iff:
– k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFmin is a parameter for our model (popularity filter).
– k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).
– all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1010 // 2525
QDI: RetrievalQDI: Retrieval
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
a b c
abc
ab bc ac
• Single term index is generated
• Process abc1) Probe Pabc
2) Probe Pab Pbc and Pac
3) Probe Pa Pb and Pc
4) Obtain top-DFmax results for a, b and c (ranked w.r.t a, b and c respectively)
5) Contact peers in the list, re-rank the obtained results w.r.t abc
6) Output top-10
• Inc. the QF for ab, bc and ac• Activate (index) ac
peer?abc nothing
?abc
nothing
nothing
nothing
?abc
+1 +1 +1
DFmax
popularpopular
1111 // 2525
QDI: Retrieval 2QDI: Retrieval 2
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
abc
ab bc ac
a b c
• Assume the frequency of b is below DFmax
• Note, how the redundancy filter would simplify the lattice in such a case(grayed nodes cannot be activated)
DFmax
abc
ab bc
1212 // 2525
QDI: Retrieval 3QDI: Retrieval 3
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
abc
ab bc ac
a b c
• Single term index is generated and ac is indexed
• Process abc1) Probe Pabc
2) Probe Pab Pbc and Pac – obtain the result for ac
3) Probe Pb and obtain the result for b
4) Contact all peers in the list to re-rank the obtained results w.r.t abc
5) Output top-10
• Inc. the QF for ab, bc and ac
peer?abc nothing
?abc
nothing
nothing
?abc
+1+1 +1
1313 // 2525
Opportunistic Notification MechanismOpportunistic Notification Mechanism
• ONM used to activate a new multi-term key• ONM is a “smart” broadcast with the following
features:– It is based on the shower multicast [2]: each peer within a
specified range is contacted only once– Notifications are small and low-priority => piggybacking– Broadcast is split into several multicast sessions, each time
pruning low-score documents– It uses the high-performance DHT layer [3]
[2] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05
[3] F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1414 // 2525
ScalabilityScalability
• The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size)
• The indexing traffic depends on the number of keys to be activated.
– The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly linearly with the number of peers, if each peer provides a limited number of documents
– The number of keys does does notnot depend on the document depend on the document collection sizecollection size but only on the size of the query log
– We can use the QFmin parameter to adjust the tradeoff:
indexing traffic <-> retrieval qualityindexing traffic <-> retrieval qualityG.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1515 // 2525
ContentsContents
• Introduction• HDK approach for indexing• Query-driven approach for indexing/retrieval
– Indexing structure– Example– ONM– Scalability– Evaluation
• Conclusion
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1616 // 2525
Overlap experimentOverlap experiment
• Use the Wikipedia query-log (9M queries/9-10.2004) to build the index
• Choose randomly 3K test queries• Answer each test queryquery with Google and compare to the union of
top-DFmax Google results for each of its combinationsits combinations that are indexed according to the logs.
• Mimics our P2PIR system if Google’s ranking is used.• Example:
Original query
Non-superfluous (indexed) combinations
X
X
overlap@5=3/5=60%
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1717 // 2525
Overlap exampleOverlap example
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
>id=481, q=“what did babe ruth do in the 1920”what did babe ruth do in the 1920”
“1920 babe ruth”, qf=0 ----> Ov@100= 100%
“1920 babe”, qf=0 ---------> Ov@100= 9% +++“1920 ruth”1920 ruth”, qf=1 ---------> Ov@100= 33%33% +++“babe ruth”babe ruth”, qf=495 -------> Ov@100= 69% 69%
---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7%
Size: 192192, Keys used: 22, Overlap@100: 94%94%
• Cut-n-paste from the simulation log:
1818 // 2525
Overlap with GoogleOverlap with Google
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 1919 // 2525
Overlap with YahooOverlap with Yahoo
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 2020 // 2525
Overlap with Google (no/partial/full Overlap with Google (no/partial/full overlap)overlap)
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 2121 // 2525
P2P Index SimulationsP2P Index Simulations
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
• Number of keys depends only on the query log size and QFmin!
• Does not depend on the collection size!
• Number of keys is much smaller than for the HDK approach: 68M keys for 650K doc
2222 // 2525
Real query logs?Real query logs?
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
• Wikipedia queries are unrealistic (too skewed) as users know what they want.
• Real web-queries might perform worse?
• Large scale experiments with real web queries and the TREC collection in [4]
[4] [4] Web Text Retrieval with a P2P Query-Driven Index G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer To appear in SIGIR’07
2323 // 2525
ConclusionsConclusions
• We presented the query-driven indexing strategy query-driven indexing strategy for scalable web text retrieval with structured P2P networks:– Stores posting lists in a DHT for terms andand term combinations
– Stores at most at most DFmax top document references in a posting list
– Efficiently collects the query statisticsstatistics in a distributed fashion
– Based on this statistics activates (indexes) only popularpopular keys
– Computes the result of a multi-term query based only on the index entries available at the moment – nono costly intersections
• We also showed that:– With real query-logs our approach achieves good retrieval qualitygood retrieval quality
– The QFmin parameter adjusts the traffic/quality tradeofftradeoff
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 2424 // 2525
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval
Last slideLast slide
Thank you for your attention!Questions?
2525 // 2525