Upload
cornelia-quinn
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
A Search Engine Architecture Based on Collection Selection
Diego Puppin
University of Pisa, Italy
Supervisors: D. Laforenza, M. Vanneschi
Introduction
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Motivations The Web is getting bigger and bigger, and
users are more and more picky! Precise results are needed very fast The index is growing, due to added page and
advanced indexing Big IR problems for the Web, books,
multimedia search engine
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Motivations (2) There is the need for new solutions, able to give
high quality results with reduced computing load Parallel Computing looks like the most natural
choice to help algorithms to face this growth rate [Baeza-Yates et al. 2007a]
Billions of pages and data available (several TB): the index is still very big (about 5X the collection size)
New approaches to partitioning are key to the next phase
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Parallel (Distributed) IRSs
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Term vs Doc partitioning
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Term vs Doc partitioning Reduced computing load for term part.
Only the servers with relevant terms Problems of load balancing Heavier communication patterns Doc.part. better balancing but all
documents are scanned How to reduce the load with doc.part.?
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Main contributions
1. Query vector doc model More efficient for partitioning and
selection (co-clustering and PCAP)
2. Load-driven routing Exploits better the available load Based on the effective load of the system
3. Incremental Caching Improves throughput AND quality
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Acknowledgments Fabrizio Silvestri Raffaele Perego Ricardo Baeza-Yates Adbur Chowdury, Ophir Frieder,
Gerhard Weikum, and the various reviewers…
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Other contributions More compact collection representation
1/5 CORI and outperforming A way to select documents (50%) to move out
of the index The documents in the supplemental index
contribute to only 3% top results A simple way to update the index in a doc.
partitioned system Extended simulation
6 M documents, 800k test queries, real computing costs, several configurations tested
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Reviewers’ Request: Frieder More detailed discussion of the
coclustering algorithm Improved cost scheme Experiments to be extended in the
future
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Reviewers’ Requests: Weikum Improved description of pipelined term-
partitioned IR system Improved description of coclustering Better definition of shingles New realistic cost model Deeper discussion of cache and silent
documents
How to Improve Partitions
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Partitioning Strategy
p1 p2 pp
Partitioning Strategy
DocumentCollection
Random
Content-based(e.g. K-Means,Link-based Clust.)
Usage-Based
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
The QV Model
Co-clustering
qu
erie
s
documents
4 8 10 1 12 2 9 7 11 5 3 6
11
6
1
9
7
3
10
5
12
2
8
4
qu
erie
s
documents
4 8 101 122 97 1153 6
11
6
1
7
9
3
10
5
12
2
8
4
Document j is returned in answer to query i.
Document j is not relevant to query i.
QueryCluster
DocumentCluster
Each document cluster corresponds to a different partition. In this case three
partitions are generated
For each query cluster a vocabulary is built out of all the different query terms of the queries in the cluster
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Theoretical Model of Co-clustering The algorithm we use [Dhillon et al., 2003]
finds the clustering that minimizes the loss of information between the original matrix and the clustered matrix (given the number of row and column clusters)
Efficient implementation, very robust solution Stable to test period, number of clusters, training
set used, matrix model (scores, boolean, repeated)
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
QV for Collection Selection
Que
ry c
lust
ers
Query
Partitions are ranked according to their relevance to the query
Document clusters
We called this strategy PCAP
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
PCAP collection selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Experimental Settings Experiments were carried out using
WBR99: 5,939,061 documents; 22 GB uncompressed text Snapshot of the Brazilian Web (domain .br) back in 1999.
A query log from todobr.com relative to the period Jan-Oct 2003. Zettair as the IR Core
Training: 190,000 queries, Test: 800,000 queries We created 16 + 1 doc. clusters and 128 query clusters. Model tested on the successive week (the fourth week). Metrics
used: Intersection: percentage of relevant results returned using only k
servers out of 16+1 (from [Puppin et al., 2006]). Competitive similarity: percentage of relevance score obtained using
only k servers out of 16+1 (adapted from [Chierichetti et al., 2007]).
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Quality Metrics
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Very Effective Partitioning and Selection
CORI on Random Partitioning
Intersection at1 2 4 8 16 17
5
10
20
0.30
0.59
1.20
0.57
1.16
2.49
1.27
2.55
5.04
2.62
5.00
9.77
4.60
9.30
18.71
5.00
10.00
20.00
CORI on QV Partitioning
Intersection at1 2 4 8 16 17
5
10
20
1.55
3.05
5.97
2.29
4.48
8.77
3.01
5.92
11.61
3.83
7.62
15.10
4.89
9.77
19.54
5.00
10.00
20.00
PCAP on QV Partitioning
Intersection at1 2 4 8 16 17
5
10
20
1.73
3.47
6.92
2.26
4.51
9.02
2.89
5.75
11.47
3.76
7.50
14.98
4.84
9.66
19.29
5.00
10.00
20.00
In the case of Random CORI performs really bad!Almost equal to relevants/Nclusters. E.g. 5/17 = 0.29411765 ~ 0.3
CORI on QV vs. CORI on random performs about 5.2 times better.
PCAP on QV vs. CORI on random performs about 5.8 times better.
PCAP on QV vs. CORI on QV performs about 1.1 times better.
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Strength Popular queries are driving the
distribution Low-dimensional space to represent
documents More efficient collection representation QV may be built while answering
queries
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Weakness Dependent from the training set
Actually… NOT! Cannot manage new query terms
Very small fraction, CORI does not help Inc. caching can help
Collection selection dependent from assignment But addition does not break performance
Issues with Load Distribution
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Load BalancingPeak Load on Each IR Core
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Core ID
Peak Load
Still the maximum load is ~ 25% of the maximum capacity available at each IR Core
Load is measured as the maximum number of queries answered by each IR core within a sliding query window of 1000 queries.
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Load Balancing Strategies Load-driven basic <L>
Servers are ranked according to their relevance, using a collection selection function. The first gets priority 1, then linearly down to 1/17. Every server i has to answer if: L(i) < p(i) * L
Load-driven boost <L,T> Priority is 1 for the first T server, then
linearly down to 1/(17-T)
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Experimental Settings (2) The broker models the load in the cores as
the number of queries served from the last W queries
Assumption: cost =1, for each query and collection We will change this
We count the number of relevant results we can get by polling the servers, up to the chosen load threshold
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Load Balancing Results
Peak Load on Each IR Core
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Core ID
Peak Load
FIXED 4
BASIC 24.7
BOOST 4 24.7
FIXED 4 BASIC<24.7> BOOST<4, 24.7>
5 3.10 3.40 3.55
10 6.00 6.80 7.00
20 12.20 13.60 14.00
Intersection (# of relevant results retrieved)
FIXED 4 BASIC<24.7> BOOST<4, 24.7>
5 0.88 0.91 0.92
10 0.87 0.90 0.90
20 0.85 0.89 0.90
Competitive Similarity (% of rank score retrieved)
Caching and Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Interaction with a Cache Result caching is commonly used in WSEs
[Baeza-Yates et al., 2007a; Baeza-Yates et al., 2007b].
Caching has the effect of reshaping the power-law underlying the query distribution [Baeza-Yates et al., 2007a].
We designed a novel caching strategy (i.e. Incremental Caching) integrated with collection selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Incremental Caching
IRCore1
IRCore2
IRCore3
IRCore4
Incr
emen
tal
Cac
he
Q…
…
…
…
…
…
…Q…
…Q…
…
…
…
…Q…
…
Q
Q
Q
Q
Q
Results
ServersPolled X XX X
An incremental cache is effective both at load reduction, and at improving result quality.
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Incremental Caching Results
BASIC<24.7> BOOST<4, 24.7> INCREMENTAL
5 3.40 3.55 4.00
10 6.80 7.00 7.80
20 13.60 14.00 15.60
Intersection (like P@N - # of relevant res retrieved)
BASIC<24.7> BOOST<4, 24.7> INCREMENTAL
5 0.91 0.92 0.94
10 0.90 0.90 0.93
20 0.89 0.90 0.93
Competitive Similarity (% of rank score retrieved)
Peak Load on Each IR Core
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Core ID
Peak Load
FIXED 4
BASIC 24.7
BOOST 4 24.7
BOOST 4 24.7 + INC
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Refined Cost Model and Prioritization
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Collection Prioritization We reverse the load control from the
broker to the cores The broker broadcasts the query, and
sends info about the relative rank of each core (the priority)
Each core serves query if L(i) < p(i) L L(i) = sum of the comp. cost (timing) of
served queries
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Extended Tests We actually partitioned the documents
onto different servers We indexed locally, and we measured
the timing of each query The actual timing is used to compute
the load and drive the system Load cap is AVERAGE load
The peak can heavily vary!
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
…the bill, please!
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Conclusions We presented an architecture for a
distributed search engine, based on collection selection
The load-driven strategy and the incremental caching can retrieve very high quality results, with reduced load
Verified with an extensive simulation
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Impact and Benefits If a given precision is expected, we can
use FEWER servers With a given number of servers, we get
HIGHER precision Confirmed with different metrics
Smaller load for the IR system, with more focus on top results
Nice trade-off cost vs. quality
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Impact and Benefits (2) Load-driven routing can be used
to absorb query peaks to offer higher/lower quality results to
selected users Consistent ranking due to local indexing Inc. caching can be used to reduce the
negative effects of selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Furthermore Caching posting lists is very effective on
local indices Simple way to add new documents Inc. caching could help with impact-
ordered posting lists Caching could be based on line value
(query frequency, number of polled servers)
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Future Work Comparison with other results in
clustering (k-means, link-based, P2P, LSI, SVD)
Test on a large-scale, real-world search engine
Real-world implementation at Google TOIS paper to wrap up
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
References [Puppin et al., 2006]
Diego Puppin, Fabrizio Silvestri, Domenico Laforenza. “Query-Driven Document Partitioning and Collection Selection”. Invited Paper. Proceedings of INFOSCALE ‘06.
[Puppin & Silvestri, 2006] Diego Puppin, Fabrizio Silvestri. “The Query-Vector
Document Model”. Proceedings of CIKM ‘06. [Puppin et al., 2007]
Diego Puppin, Ricardo Baeza-Yates, Raffaele Perego, Fabrizio Silvestri. “Incremental Caching for Collection Selection Architectures”. Proceedings of INFOSCALE ‘07.
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
References [Baeza-Yates et al., 2007a]
Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri. “Challenges in Distributed Information Retrieval”. Invited Paper. Proceedings of ICDE 2007.
[Chierichetti et al., 2007] F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A.
Tiberi, E. Upfal. “Finding Near Neighbors Through Cluster Pruning”. Proceedings of PODS 2007.
[Baeza-Yates et al., 2007b] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira,
Vanessa Murdock, Vassilis Plachouras, Fabrizio Silvestri. “The Impact of Caching on Search Engines”. Proceedings of SIGIR 2007.
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
References [Dhillon et al., 2003]
Dhillon, I. S. and Mallela, S. and Modha, D. S., “Information-Theoretic Co-Clustering”. Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003)
Backup Slides
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Adding Documents
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Adding Documents It is important to assign new documents
to the fittest clusters New versions, New pages etc.
The new documents will be found along with the previously assigned documents
Hopefully the coll. selection will find them with similar docs
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
A Modest Proposal The body of the new document is used
as query for the PCAP selection The body is compared to the query
clusters We will find a similarity between doc.
body and query cluster We use PCAP to rank doc. collections
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Implementation The first 1000 byte of (stripped) body
doc are used The new doc is assigned to the doc.
cluster with the top PCAP score New docs are locally indexed No need to re-train / re-assign New docs have consistent score and
ranking
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
Test Configurations
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection
September 2007 University of Pisa
Diego Puppin, A Search Engine Architecture Based on Collection Selection