Upload
clifford-wright
View
219
Download
0
Embed Size (px)
Citation preview
MPI Informatik 1/17Oberseminar AG5
Result merging in a Peer-to-Peer Web Search Engine
Supervisors:
Speaker : Sergey Chernov
Prof. Gerhard Weikum
Christian Zimmer
Matthias Bender
MPI Informatik 2/17Oberseminar AG5
Overview
1 Result merging problem
2 Selected result merging methods
3 Efficient index processing
4 Summary
5 References
MPI Informatik 3/17Oberseminar AG5
Query processing in distributed IR
q – query, P – set of peers,
P’ – subset of peers most “relevant” for q
Ri – ranked result list of Pi, Rm – merged result list,
Merging
P1
P2
P3
P4
P5
P6
P1’
Selection<P,q> <P’,q> Retrieval <<R1, R2, R3,>,q>
P2’
P3’
R1
R2
R3
RM
RM
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
...................
MPI Informatik 4/17Oberseminar AG5
Naive merging approaches
How we can combine results from n peers if we need top-k documents?
1. Retrieve k documents with the highest similarity scores. Problem: scores incomparable
2. Fetch k best documents from n peers, re-rank them and select k from k*n.
Problem: too expensive communication
3. Take the same number k/n of documents from each peer in round-robin fashion.
Problem: some databases contain several high relevant documents and some are not
MPI Informatik 5/17Oberseminar AG5
Merging properties in a Peer-to-Peer Web Search Engine
Heterogeneous databases (local scores incomparable)
High scalability (document fetching is very expensive)
Cooperative environment (statistics available, common search model)
Highly dynamic (no room for learning methods)
Database selection (extra info for score normalization)
MPI Informatik 6/17Oberseminar AG5
Collection fusion – results merging from disjoint data sets
Ideal retrieval effectiveness: 100% of the single collection baseline.
Data fusion – results merging from single document set
Ideal retrieval effectiveness: > 100% of the single collection baseline if rankings are independent (e.g. TF*IDF and PageRank)
Our scenario is in the middle, probably, more close to the collection fusion
Collection vs. Data fusion
Disjoint document setDB1 DB2 DB3
P1 P2 P3
P1 P2 P3
DB1
Shared document set
Overlapping document setDB1 DB2 DB3
P1 P2P3
MPI Informatik 7/17Oberseminar AG5
Results Merging Problem
Objective: Merge returned documents from multiple sources into a single ranked list.
Difficulty: Local document similarity scores may be incomparable due to different statistical values.
Solution: Transform local scores into global ones.
An Ideal Merging:
Ideal retrieval effectiveness: same as that as if all documents were in the single collection.
Efficiency: optimize the retrieval process.
MPI Informatik 8/17Oberseminar AG5
Global IDF merging (Viles and French [6])
Global IDF: compute global IDFs as follows (Viles and French [6]):
Where DFi – number of documents with particular term on peer i,
|Di| – overall number of documents on peer i. Assumption: documents overlap among collection affects all term
proportionally to their IDFs
Precision: in disjoint case – 100% of single collection baseline,
in overlapping setup – ?
GIDFTFsim
N
i i
N
i i
DF
DGIDF
||
||log
MPI Informatik 9/17Oberseminar AG5
ICF merging (Callan [1])
Inverted Collection Frequency ICF instead of IDF: replace IDF with ICF value
Where CF – number of peers with particular term, |C| – number of collections (peers) in the system
Assumption: ICF value is analogue of IDF in peer-to-peer setting, important term occurs on a small number of peers
Precision: in disjoint case – ? in overlapping setup – ?
ICFTFsim
CFC
ICF||
log
MPI Informatik 10/17Oberseminar AG5
CORI-merging (1) (Callan [1])
Database selection step constants are heuristics tuned for INQUERY search engine
cw – is the number of indexing terms in collection avg_cw – is the average number of indexing terms across collections
DF-based part
CF-based part
Database score ri for current query term qi
cwavgcwDF
DFT
_/15050
)0.1log(
5.0log
C
CF
C
I
iii ITq 6.04.0
MPI Informatik 11/17Oberseminar AG5
CORI-merging (2) (Callan [1])
Database scores Ri for query q with n terms
Minimum database Rmin score with T=0 for each term
Maximum database Rmax score with T=1 for each term
Normalized database score R’I
low-score database still can contribute documents in top-k
n
i i
n
i iiii
I
IT
RRRR
R1
1
minmax
min'
)()(
4.006.04.01
1min
n
i iInR
n
i ii qn
R1
1
n
i i
n
i i In
In
R11max
6.04.016.04.0
1
MPI Informatik 12/17Oberseminar AG5
CORI-merging (3) (Callan [1])
Normalizing document scores the similar procedure gives us
reducing effect of different local IDFs
Scores for merging: both effects of DF and ICF are included in Ri
Assumption: It is a most successful representative method of those that combines database score with local scores
Precision: in disjoint case – 70-100% of single collection baseline, in overlapping setup – 70-100%
4.1
4.0 '''iRDD
sim
n
i i
n
i iiii
IDF
IDFTF
DDDD
D1
1
minmax
min'
)()(
MPI Informatik 13/17Oberseminar AG5
Obtain “fair” scores by language models manipulation
all peers collection – G document - D query term – q
smoothing parameters – λ, β peer collection – C query - Q
Language modeling merging (Si et al.[3])
n
Qqiijiji CqPDqPDCQP ||1),|(
1|log),|(log)|(log QCPDCQPDQP iijiij
iii
iii CPCQP
CPCQPQCP
|
|)|(
Assumptions: linear separation of evidences, correct segmentation of documents across collections
Precision: equal or better than CORI
MPI Informatik 14/17Oberseminar AG5
Methods summary
Selected methods Global IDF normalization ICF normalization CORI merging Language modeling merging
Which method is the best?
Future experiments will show
What we can get from merging in terms of computational efficacy?
Reduce index processing cost, lets look at the example
MPI Informatik 15/17Oberseminar AG5
Index processing optimization
P1
Query = {A B C}, 2 selected peers, top-10 results needed; index lists are processed with TA-sorted, it stops when WorstScore10>BestScorecand
Pq
1. {A B C}
2. and GIDF are posed
P2
A B
................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................
................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................
................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................
C
3. WorstScore=0.6 propagated to querying peer
3. WorstScore=0.8 propagated to querying peer
A B
................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................
................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................
................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................
C
2. WorstScore=0.6
BestScore=0.8
2. WorstScore=0.8
BestScore=0.9
4. Largest WorstScore=0.8
returns to
the peers4. WorstScore=0.7
BestScore=0.79
TA-sorted stops
4. WorstScore=0.8
BestScore=0.79
TA-sorted stopsOur prize
5. WorstScore=0.71
BestScore=0.7
stops originally
MPI Informatik 16/17Oberseminar AG5
Future work
To do list Several merging methods must be implemented and
evaluated on experimental testbed Effect of index processing optimization should be
investigated on family of algorithms
Other issues How to compute PageRank in distributed environment? How we can incorporate bookmarks in result merging? How we can obtain the best combination of
similarity + PageRank + bookmarks?
MPI Informatik 17/17Oberseminar AG5
References
1. Callan J. P., Distributed Information Retrieval, 2000. 2. Callan J. P., Lu Z., Croft W. B., Searching Distributed
Collections With Inference Networks, 1995. 3. Si L., Jin R., Callan J., Ogilvie P., A Language Modeling
Framework for Resource Selection and Results Merging, 2002.
4. Craswell N., Methods for Distributed Information Retrieval, PhD Thesis, 2000.
5. Kirsch S. T., Distributed search patent. U.S. Patent 5,659,732, 1997.
6. Viles C. L., French J. C., Dissemination of collection wide information in a distributed information retrieval system, 1995.