22
Topical Query Decomposition Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08

Topical Query Decomposition

Embed Size (px)

DESCRIPTION

Topical Query Decomposition. Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08. Abstract. Given a query and a document retrieval system - PowerPoint PPT Presentation

Citation preview

Page 1: Topical Query Decomposition

Topical Query Decomposition

Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis

Yahoo! ResearchBarcelona, Spain

KDD 08

Page 2: Topical Query Decomposition

2

Abstract

Given a query and a document retrieval system To produce a small set of queries whose union

of resulting documents corresponds approximately to that of the original query.

Set cover problem Greedy algorithm

Clustering problem Two-phase algorithm based on hierarchical

agglomerative clustering (dynamic programming)

Page 3: Topical Query Decomposition

3

Introduction

A query log L A list of pairs < q, D(q) >

q: query, D(q): its result a set of documents that answer

query q

Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q

Page 4: Topical Query Decomposition

4

Page 5: Topical Query Decomposition

5

The goal is to compute a cover. Selecting a subcollection C Q(q7) such that it

covers almost all of D(q7)

Page 6: Topical Query Decomposition

6

Problem Statement – 1/3

Red-Blue set cover problem U={b1,…bn, r1,…rm} ( for a query q ) B={b1,…bn} (i.e., document set) R={r1,…rm} (i.e., query set) S={S1,…,Sk} is provided from L (query log L)

Si U Si

B : blue points in Si (SiB= Si B)

SiR : red points in Si (Si

R= Si B) Goal: To find a subcollection C ⊆ S that

covers many blue points of U without covering too many red points.

Page 7: Topical Query Decomposition

7

Problem Statement – 2/3

For each query q, the candidate queries Q(q)

For each set Si with blue and red points, its weight is

scatter sc(Si) (coherence: opposite of scatter)

ii SvSu

i v,udSsc min 2)()(

1))(1()(

)(

2

}{

b,qclickslogbw

bw|S| BiSbwi

Page 8: Topical Query Decomposition

8

Problem Statement – 3/3

Our goal is to find a subcollection C ⊆ S that covers almost all the blue points of U and has large coherence.

More precisely, we want that C satisfies the following properties: Cover-blue Not-cover-red Small-overlap Coherence

Page 9: Topical Query Decomposition

9

Greedy Algorithm – 1/2

At i-th iteration , minimizes s(S,VB,VR)

C, R, O are parameters that weight the relative importance of the three terms.

VB : blue balls were already selected at before iterations

VR : red balls were already selectedat before iterations

D. Peleg. Approximation algorithm for the label-covermax and red-blue set cover problem. Journal of Discrete Algorithms, 2007

Page 10: Topical Query Decomposition

10

Greedy Algorithm – 2/2

Page 11: Topical Query Decomposition

11

Integer Programming

Si+S2+….Sl <=10

Si <= 1

Page 12: Topical Query Decomposition

12

Clustering-Based Method

Two-phase approach First phase: all points in set B are clustered

using a hierarchical agglomerative clustering algorithm. (CLUTO toolkit)

Second phases: to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S.

The main idea is to match sets of S into clusters of Every node T ∈ corresponds to a cluster

T(B) be the set of points in B

Page 13: Topical Query Decomposition

13

Clustering-Based Method

Dendrogram

Page 14: Topical Query Decomposition

14

Clustering-Based Method -Dynamic Programming - 1/2

Complete Coverage: for each set S S v.s. for each node T ∈ , Matching score m(T, S)

m*(T) the score of the best matching set in S.

Optimal cost of covering the points of TB with sets in S.

Page 15: Topical Query Decomposition

15

Clustering-Based Method -Dynamic Programming - 2/2 Partial Coverage:

U weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points.

Page 16: Topical Query Decomposition

16

Application

Query log L : 2.9 million distinct queries A majority of users only looks at the first page

of results, while few users request more result pages.

D(q): any user asking for q in the query log navigated, and consider the set of result documents for the query

24 million distinct documents seen by the users

Page 17: Topical Query Decomposition

17

Application - Candidate queries for the cover For each query q, the candidate queries Qk(q)

Page 18: Topical Query Decomposition

18

Application - Results A set of 100 queries were randomly picked

from top 10,000 queries submitted by users.

Cost of k queries The number of documents

included outside the set D(q) Average numbre of queries

covering each element Coverage after the top k

candidates have been picked

Page 19: Topical Query Decomposition

19

Page 20: Topical Query Decomposition

20

Page 21: Topical Query Decomposition

21

Page 22: Topical Query Decomposition

22

Conclusions

A novel problem : Topical query decomposition

Elegant solutions red-blue metric set cover clustering with predefined clusters.

( hierarchical agglomerative clustering ) The set-cover formulation provides solutions

of better quality Code and data for reproducing the results

shown in Table 3 is available at http://www.yr-bcn.es/querydecomp/ .