40
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia Talk modified for CS 632 by S. Sudarshan

1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

Embed Size (px)

Citation preview

Page 1: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

1

Efficient Computation of Diverse Query Results

Erik Vee

joint work with

Utkarsh Srivastava, Jayavel Shanmugasundaram,Prashant Bhat, Sihem Amer Yahia

Talk modified for CS 632 by S. Sudarshan

Page 2: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

2

Motivation

• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

Page 3: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

3

Motivation

• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

• … or looking for cars on Yahoo! Autos, andseeing only Hondas

Page 4: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

4

Motivation

• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

• … or looking for cars on Yahoo! Autos, andseeing only Hondas

• … or looking for jobs on Yahoo! Hotjobs, andseeing only jobs from Yahoo!

• It is not enough to simply give the best response– Need diversity of answers

Page 5: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

5

Diversity Search

• If we display 30 results in 5 categories, then should show 6 items from each category– NB: Our goal is to show range of choices,

not representative sample

– Recurse on each subgroup of items

• Diversity crucial for users looking for range of results– e.g. Shopping, information gathering/research

• Useful for aiding navigation– Users tend to favor search-and-click over hierarchies

• Likely to give at least one good answer on first page

Page 6: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

6

Contributions

• Formally define diversity search– Other diversity-like approaches use extensive post-processing

or are not query-dependent

• Proved that traditional IR engines cannot produce guaranteed diverse results

• Gave novel algorithms to produce diverse results– Both one-pass (datastreaming) and probing algorithms

• Experimentally verified that these results are nearly as fast as normal top-k processing– Much faster than post-processing techniques

Page 7: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

7

What about other approaches?

• If not diverse enough, query again– E.g. If all results are from one company, issue another query– Bad for latency

• Issue multiple queries (one for Honda, one for Toyota...)– Can be prohibitively expensive (kills throughput)

• latency fine

– Some applications may have dozens of top-level categories

• Fetch extra results, then find most diverse set from this– Not guaranteed to get good results– Requires fetching additional results unnecessarily

• Fetch all results, then find diverse set– Many times slower

• Random sample of results– Miss important results this way

Page 8: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

8

What about clever scoring?

• Can we give each item a global “diversity” score, then find top-k using this?– Prove in paper: There is no global score that gives guaranteed

diversity

• Can we give each item a local “diversity” score, so that it has a different score in each list of the inverted index?– Prove in paper: There is no list-based scoring of the item that

gives guaranteed diversity

Page 9: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

9

Outline

• Definition of diversity

• Overview of our algorithms

• Our experimental results

Page 10: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

10

Diversity search

• Over all possible sets of top-k results that match query, return set with most diversity

• Paper defines diversity more precisely– Focus on hierarchy view of diversity (in next slides)

• For scored diversity (in which each item has a score)– Over all possible sets of top-k results with maximum score,

return set with highest diversity

– Note: Diversity only useful when score not too fine-grained

Page 11: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

11

Diversity definition (by picture)

Implicitly defineshierarchy

Make

Model

Color

Year

Text

Determine a category ordering

Page 12: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

12

Hierarchy after a query

Diversity search alwaysreturns valid results

E.g. Query text contains `Low`

Page 13: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

13

Hierarchy after a query

Diversity search alwaysreturns valid results

E.g. Query text contains `Low`

All siblings return thesame number of results(or as close as possible)

Page 14: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

14

Returning top-k diverse results

Diversity search alwaysreturns valid results

E.g. Query text contains `Low`

Suppose return k=4 results

Must return 2 Hondas and 2 ToyotasWill not

return2 green Civics

Page 15: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

15

Outline

• Definition of diversity

• Overview of our algorithms

• Our experimental results

Page 16: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

16

Algorithms

• One Pass– Never goes backward (just one pass over dataset)

– Maintains a top-k diverse set based on what has been seen

– Jumps ahead if more results will not help diversity

– Optimal one-pass algorithm

• Probe– May jump forward or backward (i.e. probes)

– Prove: at most 2k probes for top-k diverse result set

• Both also work for scored diversity

Page 17: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

17

Dewey IDs

Every branch gets a number

Every item then labeled,e.g. 0.2.0.1.0 isHonda Odyssey Green ’06 `Good miles’

Create invertedindex

low 00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000

Page 18: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

18

Next and Prev

Supports two basic operations: Next and Prev

E.g. Query text contains `Low`

Next(0.0.3.2.2) = 1.0.0.0.0Prev(2.0.0.0.0) = 1.3.0.0.0

Inverted index for ‘Low’ listsall items in Dewey ID order

In general, must find intersection of lists (still easy)

low 00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000

Page 19: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

19

One pass (for k = 2)

First finds 00000, 00010

Now knows Civic Greenno longer helps

Jumps by callingnext(0.0.1.0.0)

Page 20: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

20

Finds 00100Removes 00010

One pass (for k = 2)

First finds 00000, 00010

Now knows Civic Greenno longer helps!

Jumps by callingnext(0.0.1.0.0)

Now knows Civicno longer helps!

Jumps by callingnext(0.1.0.0.0)

Page 21: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

21

Finds 00100Removes 00010

One pass (for k = 2)

First finds 00000, 00010

Now knows Civic Greenno longer helps!

Jumps by callingnext(0.0.1.0.0)

Now knows Civicno longer helps!

Jumps by callingnext(0.1.0.0.0)

Finds 01000Removes 00100 Knows to stop

Page 22: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

22

Unscored One-Pass Algorithm

Key step: deciding where to skip to

Remove 1st element in queue

Page 23: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

23

One-Pass Algorithm (Cont.)

• Complexity: k lnd(3k)

• Scored One Pass Algo: same algo as for unscored case, except:– replace line 11 of the unscored one-pass algorithm with the line

• id = mergedList.next(id+1, skipId, root, minScore)

• The semantics of the above line is to return the smallest id greater than or equal to id+1 such that either

– score(id) > root.minScore, or

– score(id) >= root.minScore, and the return id is greater than skipId.

Page 24: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

24

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(. . . . )to find first and last items

Wants another Honda

Calls prev(0. . . . )

Discovers there are only2 top-level categories

Page 25: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

25

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(. . . . )to find first and last items

Wants another Honda

Calls prev(0. . . . )

Why not next(0.1.0.0.0)?

If Honda has only onechild, then will returna Toyota!

Page 26: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

26

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(. . . . )to find first and last items

Wants another Honda

Calls prev(0. . . . )

Finds 00310

Wants another Toyota

Calls next(1.0.0.0.0)

Page 27: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

27

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(. . . . )to find first and last items

Wants another Honda

Calls prev(0. . . . )

Finds 00310

Wants another Toyota

Calls next(1.0.0.0.0)

Finds 10000

Page 28: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

28

Unscored Probing Algorithm

Page 29: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

29

Unscored Probing (Cont.)

Page 30: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

30

Unscored Probing (Cont.)

Page 31: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

31

Unscored Probing (Cont.)

Page 32: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

32

Unscored Probing

• Invariant: Whenever id node, either id belongs to some child of node in our data structure, or node.edge[LEFT] <= id <= node.edge[RIGHT]

• Invariant: Let node be some node in our data structure, and suppose during the execution of the algorithm, we call node.getProbeId(), returning (probeId, dir). Then we have mergedList.next(probeId, dir) node.

• Theorem 2: The unscored probing algorithm given in Algorithms 2, 3 makes at most 2k calls to next.

Page 33: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

33

Scored Probing (Cont.)

• Let be the score of the lowest-scoring item in thetop-K list returned. Diversity is only guaranteed among items whose score is . – The difficulty comes from not knowing the exact value of .

Page 34: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

34

Scored Probing

Page 35: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

35

Outline

• Definition of diversity

• Overview of our algorithms

• Our experimental results

Page 36: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

36

Results

• Dataset consisted of listing from Yahoo! Autos

• Queries were synthetic to test various parameters– Selectivity, # predicates, # results

• Preprocessing time for 100K listings < 5min– Times shown are for 5K queries

• 4 algorithms– Basic: No diversity

– Naïve: Fetch everything, post-process

– OnePass: Our algorithm. Takes just one pass over data

– Probe: Our algorithm. May make multiple probes into data

Page 37: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

37

Comparable time for diversity search

unscored scored

Basic: No diversity

Naïve: Many times slower OnePass: Close to probe

Probe: Within factor 2 of no diversity

MultiQuery (not shown): Latency close to Basic, but throughput many times worse

Page 38: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

38

Results summary

• Getting diverse results not too much slower than getting non-diverse results– Many times faster than naïve approaches

• Multi-query approach has even worse throughput than naïve– But keeps latency low

• How does this compare to getting extra results, then finding a diverse subset?– Getting 2k results instead of k is about twice as slow

– Plus, does not guarantee diverse results

Page 39: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

39

Conclusions

• Can get guaranteed diversity, taking time close to normal top-k query– Almost as fast or faster than non-guaranteed results

– Diversity at every level

• Works even when items have scores

• Needs a different algorithm than traditional IR engines– Proved this in paper (under standard notions)

• Are there approximate notions that can use existing IR machinery?

Page 40: 1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

40