Supporting Efficient Top-k Queries in Type-A h ead Search

Preview:

DESCRIPTION

Supporting Efficient Top-k Queries in Type-A h ead Search. Guoliang Li 1 , Jiannan Wang 1 , Chen Li 2 , Jianhua Feng 1 1 Tsinghua University 2 UC Irvine, Bimaple Technology Inc. . SIGIR 2012, Portland, Oregon. Query suggestions. Type-ahead search (instant search). - PowerPoint PPT Presentation

Citation preview

Supporting Efficient Top-k Queries in Type-Ahead

SearchGuoliang Li1, Jiannan Wang1, Chen Li2,

Jianhua Feng1

1 Tsinghua University2 UC Irvine, Bimaple Technology Inc.

SIGIR 2012, Portland, Oregon

Tsinghua/UC Irvine/Bimaple 2

Query suggestions

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 3

Type-ahead search (instant search)

Li, Wang, Li, and Feng

Finding answers instantly!

Tsinghua/UC Irvine/Bimaple 4

ipubmed.ics.uci.edu

Li, Wang, Li, and Feng

Fuzzy search

Tsinghua/UC Irvine/Bimaple 5

Advantages of instant fuzzy search

Li, Wang, Li, and Feng

Save time

Fat fingers!

Mobile friendly

Correct errors

Tsinghua/UC Irvine/Bimaple 6

Challenges Speed

“100ms rule” Prefix matching Fuzzy matching

Li, Wang, Li, and Feng

Quality

Tsinghua/UC Irvine/Bimaple 7

Techniques for computing top-k answers in instant fuzzy search

without generating all candidates

Li, Wang, Li, and Feng

Contributions

Ranking framework Index Structures Algorithms Experimental evaluation

Tsinghua/UC Irvine/Bimaple 8

Outline

Problem Formulation Instant exact search Instant fuzzy search Experiments

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 9

Problem Formulation Data: records Query:

w1, w2, …, wm wm partial keyword

Answers: k best records

Li, Wang, Li, and Feng

graph icde li

ID Recordr0 graph icdmr1 graph group luir2 gray icde liur3 graph icde lin luir4 graph group icdm lin

liur5 graph gray gross icdm

lin liur6 gray group icdm lin liur7 gray gross group icde

linr8 gross icde liur9 icdm liu

Prefix

Tsinghua/UC Irvine/Bimaple 10

Aggregate

Ranking Framework

Li, Wang, Li, and Feng

graph, gray, gross, icde, lin, liu

Record

Query graph

icde

li

Score(graph) Score(icde) Score(lin)Score(liu)

Max

Tsinghua/UC Irvine/Bimaple 11

Trie

Index structures

gr

a

i lcd

e mo

p yh

s ups

i uin u

Li, Wang, Li, and Feng

ID Recordr0 graph icdmr1 graph group luir2 gray icde liur3 graph icde lin luir4 graph group icdm lin

liur5 graph gray gross

icdm lin liur6 gray group icdm lin

liur7 gray gross group icde

linr8 gross icde liur9 icdm liu

Inverted Index

Tsinghua/UC Irvine/Bimaple 12

{graph, icde, li} k=1

Basic Solution

gr

a

i lcd

e mo

p yh

s ups

i uin u

graph icdelin

liu

Li, Wang, Li, and Feng

Too many candidates

Tsinghua/UC Irvine/Bimaple 13

Optimization 1: Heap-based MethodAggregate

Max Heap

𝑟 ,9𝑟5 ,8

graphicde

linliu

GetMax()

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 14

Optimization 2: Top-k List-Merging Algorithm

Example: Threshold algorithm

Li, Wang, Li, and Feng

T = 15

= 17= 14= 12= 12

Random Access

Sorted Access

Sorted Access

Early termination

Tsinghua/UC Irvine/Bimaple 15

Efficient Random Access: How?ID “grap

h”“icde” “li”

7 0 ?

gr

a

i lcd

e mo

p yh

s ups

i uin u

ID Recordr0 graph icdmr1 graph group luir2 gray icde liur3 graph icde lin luir4 graph group icdm

lin liu… …

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 16

Forward index [Ji et al. WWW’09]ID “grap

h”“icde” “li”

7 0 ?

ID Forward listr0 <1, 2> <5, 3>r1 <1, 3> <1, 9> <9, 6>r2 <2, 9> <5, 2> <8, 3>r3 <1, 4> <5, 2> <7, 9> <9,

4>r4 <1, 7> <4, 3> <6, 9> <7,

2> <8, 7>… …Keyword IDWeight

Li, Wang, Li, and Feng

gr

a

i lcd

e mo

p yh

s ups

i uin u

12

3 45 6

7 8 9[1, 1][1, 2]

[3,3] [4, 4]

[3, 4]

[1, 4]

[1, 4]

[2,2][5, 6][5, 6][5, 6] [7, 8]

[7, 9]

[9, 9]

Tsinghua/UC Irvine/Bimaple 17

gr

a

i lcd

e mo

p yh

s ups

i uin u

12

3 45 6

7 8 9[1, 1][1, 2]

[3,3] [4, 4]

[3, 4]

[1, 4]

[1, 4]

[2,2][5, 6][5, 6][5, 6] [7, 8]

[7, 9]

[9, 9]

ID Forward listr0 <1, 2> <5, 3>

r1 <1, 3> <1, 9> <9, 6>

r2 <2, 9> <5, 2> <8, 3>

r3 <1, 4> <5, 2> <7, 9> <9, 4>

r4 <1, 7> <4, 3> <6, 9> <7, 2> <8, 7>

… …

Random Access Using Forward IndexID “grap

h”“icde” “li”

7 07?

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 18

Outline

Problem Formulation Instant exact search Instant fuzzy search Experiments

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 19

Ranking Framework (Fuzzy matching)

Li, Wang, Li, and Feng

Aggregate

graph, gray, icdm, gross, lin, liu

Record

Query graph

icde

li

Score(graph) Sim(icde,icdm)*Score(icdm) Score(lin)Score(liu)

MaxSim(li,i)*Score(lin)

Tsinghua/UC Irvine/Bimaple 20

{graph, icde, li}, similarity threshold τ=0.45

Computing Similar Prefixes [Ji et al. WWW’09]

gr

a

i lcd

e mo

p yh

s ups

i uin u

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 21

Top-k Algorithm

icdeicdm lin

liu

lui

Max Heap

𝑟3 ,9𝑟5 ,8 𝑟 4 ,4.5

3 2

similarity

×0.5 ×1 ×1 ×0.5×0.5

icdeicdm

×0.5×1

𝑟 4 ,4.54 Max Heap

𝑟5 ,9Max Heap

GetMax()

sum

×1

graph icde li

graph

GetMax() GetMax()

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 22

Probing on Forward Lists

Efficient Random Access (method 1)

ID “graph”

“icde” “li”

7 9 ?

ID Forward listr0 <1, 2> <5, 3>r1 <1, 3> <1, 9> <9, 6>r2 <2, 9> <5, 2> <8, 3>r3 <1, 4> <5, 2> <7, 9> <9,

4>r4 <1, 7> <4, 3> <6, 9> <7,

2> <8, 7>… …

Binary Search: [5,6], [7,9], [7,8], [9,9], 7, 8, 9

Li, Wang, Li, and Feng

gr

a

i lcd

e mo

p yh

s ups

i uin u

12

3 45 6

7 8 9[1, 1][1, 2]

[3,3] [4, 4]

[3, 4]

[1, 4]

[1, 4]

[2,2][5, 6][5, 6][5, 6] [7, 8]

[7, 9]

[9, 9]

Tsinghua/UC Irvine/Bimaple 23

Efficient Random Access (method 2) Probing on Trie Leaf Nodes

ID “graph”

“icde” “li”

7 9 ?

gr

a

i lcd

l mo

p yh

s ups

i uin u

12

3 4

5 67 8 9[1,1]

[1,2][3,3] [4,4]

[3,4]

[1,4][1,4]

[2,2][5,6][5,6][5,6] [7,8]

[7,9]

[9,9]ID Forward listr0 <1, 2> <5, 3>

r1 <1, 3> <1, 9> <9, 6>

r2 <2, 9> <5, 2> <8, 3>

r3 <1, 4> <5, 2> <7, 9> <9, 4>

r4 <1, 7> <4, 3> <6, 9> <7, 2> <8, 7>

… …

li, 0.5 li, 0.5

li, 1li, 1

li, 0.5

Traverse the forward list of Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 24

Optimization by materializing union lists

gr

a

i lcd

e mo

p yh

s ups

i uin u

Time/space tradeoff Cost-based analysis for a space budget

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 25

Outline

Problem Formulation Instant exact search Instant fuzzy search Experiments

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 26

Data sets and index costs

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 27

Exact Search (DBLP)

k=10, similarity threshold τ=0.6Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 28

Exact Search (DBLP)

k=10, similarity threshold τ=0.6Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 29

Fuzzy Search

DBLP, k=10, similarity threshold τ=0.6Li, Wang, Li, and Feng

TA

NRA

Tsinghua/UC Irvine/Bimaple 30

Other results (not included in the paper)

More general ranking (e.g., positional information)

Other languages Location-based search

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 31

Conclusions (ipubmed.ics.uci.edu)

Efficient techniques for instant fuzzy search

Li, Wang, Li, and Feng

Tsinghua/UC Irvine/Bimaple 32

Acknowledgements The authors have financial interest in Bimaple

Technology Inc., a company currently commercializing some of the techniques described in this publication.

Chen Li was partially supported by NIH grant 1R21LM010143-01A1.

Guoliang Li, Jianan Wang, and Jianhua Feng were partly supported by the National Natural Science Foundation of China under Grant No. 61003004, the National Grand Fundamental Research 973 Program of China under Grant No. 2011CB302206, a project of Tsinghua University under Grant No. 20111081073, and the “NExT Research Center” funded by MDA, Singapore, under the Grant No. WBS:R-252-300-001-490.

Li, Wang, Li, and Feng

Recommended