MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian

MPI Informatik 1/17Oberseminar AG5

Result merging in a Peer-to-Peer Web Search Engine

Supervisors:

Speaker : Sergey Chernov

Prof. Gerhard Weikum

Christian Zimmer

Matthias Bender


Overview

1 Result merging problem

2 Selected result merging methods

3 Efficient index processing

4 Summary

5 References


Query processing in distributed IR

q – query, P – set of peers,

P’ – subset of peers most “relevant” for q

Ri – ranked result list of Pi, Rm – merged result list,

Merging

P1

P2

P3

P4

P5

P6

P1’

Selection<P,q> <P’,q> Retrieval <<R1, R2, R3,>,q>

P2’

P3’

R1

R2

R3

RM

RM

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................

...................


Naive merging approaches

How we can combine results from n peers if we need top-k documents?

1. Retrieve k documents with the highest similarity scores. Problem: scores incomparable

2. Fetch k best documents from n peers, re-rank them and select k from k*n.

Problem: too expensive communication

3. Take the same number k/n of documents from each peer in round-robin fashion.

Problem: some databases contain several high relevant documents and some are not


Merging properties in a Peer-to-Peer Web Search Engine

Heterogeneous databases (local scores incomparable)

High scalability (document fetching is very expensive)

Cooperative environment (statistics available, common search model)

Highly dynamic (no room for learning methods)

Database selection (extra info for score normalization)


Collection fusion – results merging from disjoint data sets

Ideal retrieval effectiveness: 100% of the single collection baseline.

Data fusion – results merging from single document set

Ideal retrieval effectiveness: > 100% of the single collection baseline if rankings are independent (e.g. TF*IDF and PageRank)

Our scenario is in the middle, probably, more close to the collection fusion

Collection vs. Data fusion

Disjoint document setDB1 DB2 DB3

P1 P2 P3

P1 P2 P3

DB1

Shared document set

Overlapping document setDB1 DB2 DB3

P1 P2P3


Results Merging Problem

Objective: Merge returned documents from multiple sources into a single ranked list.

Difficulty: Local document similarity scores may be incomparable due to different statistical values.

Solution: Transform local scores into global ones.

An Ideal Merging:

Ideal retrieval effectiveness: same as that as if all documents were in the single collection.

Efficiency: optimize the retrieval process.


Global IDF merging (Viles and French [6])

Global IDF: compute global IDFs as follows (Viles and French [6]):

Where DFi – number of documents with particular term on peer i,

|Di| – overall number of documents on peer i. Assumption: documents overlap among collection affects all term

proportionally to their IDFs

Precision: in disjoint case – 100% of single collection baseline,

in overlapping setup – ?

GIDFTFsim

N

i i

N

i i

DF

DGIDF

||

||log


ICF merging (Callan [1])

Inverted Collection Frequency ICF instead of IDF: replace IDF with ICF value

Where CF – number of peers with particular term, |C| – number of collections (peers) in the system

Assumption: ICF value is analogue of IDF in peer-to-peer setting, important term occurs on a small number of peers

Precision: in disjoint case – ? in overlapping setup – ?

ICFTFsim

CFC

ICF||

log


CORI-merging (1) (Callan [1])

Database selection step constants are heuristics tuned for INQUERY search engine

cw – is the number of indexing terms in collection avg_cw – is the average number of indexing terms across collections

DF-based part

CF-based part

Database score ri for current query term qi

cwavgcwDF

DFT

_/15050

)0.1log(

5.0log

C

CF

C

I

iii ITq 6.04.0



Database scores Ri for query q with n terms

Minimum database Rmin score with T=0 for each term

Maximum database Rmax score with T=1 for each term

Normalized database score R’I

low-score database still can contribute documents in top-k

n

i i

n

i iiii

I

IT

RRRR

R1

1

minmax

min'

)()(

4.006.04.01

1min

n

i iInR

n

i ii qn

R1

1

n

i i

n

i i In

In

R11max

6.04.016.04.0

1



Normalizing document scores the similar procedure gives us

reducing effect of different local IDFs

Scores for merging: both effects of DF and ICF are included in Ri

Assumption: It is a most successful representative method of those that combines database score with local scores

Precision: in disjoint case – 70-100% of single collection baseline, in overlapping setup – 70-100%

4.1

4.0 '''iRDD

sim

n

i i

n

i iiii

IDF

IDFTF

DDDD

D1

1

minmax

min'

)()(


Obtain “fair” scores by language models manipulation

all peers collection – G document - D query term – q

smoothing parameters – λ, β peer collection – C query - Q

Language modeling merging (Si et al.[3])

n

Qqiijiji CqPDqPDCQP ||1),|(

1|log),|(log)|(log QCPDCQPDQP iijiij

iii

iii CPCQP

CPCQPQCP

|

|)|(

Assumptions: linear separation of evidences, correct segmentation of documents across collections

Precision: equal or better than CORI


Methods summary

Selected methods Global IDF normalization ICF normalization CORI merging Language modeling merging

Which method is the best?

Future experiments will show

What we can get from merging in terms of computational efficacy?

Reduce index processing cost, lets look at the example


Index processing optimization

P1

Query = {A B C}, 2 selected peers, top-10 results needed; index lists are processed with TA-sorted, it stops when WorstScore10>BestScorecand

Pq

1. {A B C}

2. and GIDF are posed

P2

A B

................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................

................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................

................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................

C

3. WorstScore=0.6 propagated to querying peer

3. WorstScore=0.8 propagated to querying peer

A B

................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................

................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................

................ ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ................. ..................

C

2. WorstScore=0.6

BestScore=0.8

2. WorstScore=0.8

BestScore=0.9

4. Largest WorstScore=0.8

returns to

the peers4. WorstScore=0.7

BestScore=0.79

TA-sorted stops

4. WorstScore=0.8

BestScore=0.79

TA-sorted stopsOur prize

5. WorstScore=0.71

BestScore=0.7

stops originally


Future work

To do list Several merging methods must be implemented and

evaluated on experimental testbed Effect of index processing optimization should be

investigated on family of algorithms

Other issues How to compute PageRank in distributed environment? How we can incorporate bookmarks in result merging? How we can obtain the best combination of

similarity + PageRank + bookmarks?


References

1. Callan J. P., Distributed Information Retrieval, 2000. 2. Callan J. P., Lu Z., Croft W. B., Searching Distributed

Collections With Inference Networks, 1995. 3. Si L., Jin R., Callan J., Ogilvie P., A Language Modeling

Framework for Resource Selection and Results Merging, 2002.

4. Craswell N., Methods for Distributed Information Retrieval, PhD Thesis, 2000.

5. Kirsch S. T., Distributed search patent. U.S. Patent 5,659,732, 1997.

6. Viles C. L., French J. C., Dissemination of collection wide information in a distributed information retrieval system, 1995.

Documents

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian