Topical Query Decomposition

Topical Query Decomposition

Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis

Yahoo! ResearchBarcelona, Spain

KDD 08

2

Abstract

Given a query and a document retrieval system To produce a small set of queries whose union

of resulting documents corresponds approximately to that of the original query.

Set cover problem Greedy algorithm

Clustering problem Two-phase algorithm based on hierarchical

agglomerative clustering (dynamic programming)

3

Introduction

A query log L A list of pairs < q, D(q) >

q: query, D(q): its result a set of documents that answer

query q

Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q

4

5

The goal is to compute a cover. Selecting a subcollection C Q(q7) such that it

covers almost all of D(q7)

6

Problem Statement – 1/3

Red-Blue set cover problem U={b1,…bn, r1,…rm} ( for a query q ) B={b1,…bn} (i.e., document set) R={r1,…rm} (i.e., query set) S={S1,…,Sk} is provided from L (query log L)

Si U Si

B : blue points in Si (SiB= Si B)

SiR : red points in Si (Si

R= Si B) Goal: To find a subcollection C ⊆ S that

covers many blue points of U without covering too many red points.

7


For each query q, the candidate queries Q(q)

For each set Si with blue and red points, its weight is

scatter sc(Si) (coherence: opposite of scatter)

ii SvSu

i v,udSsc min 2)()(

1))(1()(

)(

2

}{

b,qclickslogbw

bw|S| BiSbwi

8


Our goal is to find a subcollection C ⊆ S that covers almost all the blue points of U and has large coherence.

More precisely, we want that C satisfies the following properties: Cover-blue Not-cover-red Small-overlap Coherence

9

Greedy Algorithm – 1/2

At i-th iteration , minimizes s(S,VB,VR)

C, R, O are parameters that weight the relative importance of the three terms.

VB : blue balls were already selected at before iterations

VR : red balls were already selectedat before iterations

D. Peleg. Approximation algorithm for the label-covermax and red-blue set cover problem. Journal of Discrete Algorithms, 2007

10

Greedy Algorithm – 2/2

11

Integer Programming

Si+S2+….Sl <=10

Si <= 1

12

Clustering-Based Method

Two-phase approach First phase: all points in set B are clustered

using a hierarchical agglomerative clustering algorithm. (CLUTO toolkit)

Second phases: to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S.

The main idea is to match sets of S into clusters of Every node T ∈ corresponds to a cluster

T(B) be the set of points in B

13

Clustering-Based Method

Dendrogram

14

Clustering-Based Method -Dynamic Programming - 1/2

Complete Coverage: for each set S S v.s. for each node T ∈ , Matching score m(T, S)

m*(T) the score of the best matching set in S.

Optimal cost of covering the points of TB with sets in S.

15

Clustering-Based Method -Dynamic Programming - 2/2 Partial Coverage:

U weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points.

16

Application

Query log L : 2.9 million distinct queries A majority of users only looks at the first page

of results, while few users request more result pages.

D(q): any user asking for q in the query log navigated, and consider the set of result documents for the query

24 million distinct documents seen by the users

17

Application - Candidate queries for the cover For each query q, the candidate queries Qk(q)

18

Application - Results A set of 100 queries were randomly picked

from top 10,000 queries submitted by users.

Cost of k queries The number of documents

included outside the set D(q) Average numbre of queries

covering each element Coverage after the top k

candidates have been picked

19

20

21

22

Conclusions

A novel problem : Topical query decomposition

Elegant solutions red-blue metric set cover clustering with predefined clusters.

( hierarchical agglomerative clustering ) The set-cover formulation provides solutions

of better quality Code and data for reproducing the results

shown in Table 3 is available at http://www.yr-bcn.es/querydecomp/ .

Documents

Topical Query Decomposition