Efficient Approximate Search on String Collections Part I

Efficient Approximate Search on String CollectionsPart I

Marios Hadjieleftheriou Chen Li

1

http://www.uci.edu/

DBLP Author Search

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2

Try their names (good luck!)

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3

Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou

UCSD Case Western AT&T--Research

4

Better system?

5

http://dblp.ics.uci.edu/authors/

People Search at UC Irvine

6

http://psearch.ics.uci.edu/

Web Search

http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful

results closer togetherActual queries gathered by Google

7

Data Cleaning

R

informix

microsoft

…

…

S

infromix

…mcrosoft

…

8

Problem Formulation

Find strings similar to a given string: dist(Q,D) <= δ

Example: find strings similar to “hadjeleftheriou”

Performance is important!-10 ms: 100 queries per second (QPS)- 5 ms: 200 QPS

9

Outline

Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion

10

Part I

Part II

Preliminaries

11

Next…

Similarity Functions Similar to:

a domain-specific function returns a similarity value between two strings

Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey

12

13

A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,

deletion, substitution) to change s1 to s2Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Edit Distance

13

Gram-based algorithmsList-merging algorithms [LLL08]Variable-length grams (VGRAM)

[LWY07,YWL08]

14

Next…

“q-grams” of strings

u n i v e r s a l

2-grams

15

Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a lFixed length: q

16

If ed(s1,s2) <= k, then their # of common grams >=

(|s1|- q + 1) – k * q

q-gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433 17

# of common grams >= 3

Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1

id strings01234


2-grams


4

2 30

1 4

201 30 1 2 4

41 2 433

ti ic cksh ht ti ic ck

18

T-occurrence Problem

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

19

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

20

List-Merging Algorithms

HeapMerger MergeOpt

[SK04]

[LLL08, BK02]

ScanCount MergeSkip DivideSkip

21

Heap-based Algorithm

Min-heap

Count # of occurrences of each element using a heap

Push to heap ……

22

MergeOpt Algorithm [SK04]

Long Lists: T-1 Short Lists

Binary

search

23

Example of MergeOpt

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

24

2525

ScanCount

1 2 3

…

1

3

5

10

13

10

13

15

5

7

13

13 15


# of occurrences# of occurrences

00

00

00

44

11

Increment by 1Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

List-Merging Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

[SK04]

[LLL08, BK02]

26

MergeSkip algorithm [BK02, LLL08]

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

27

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15


minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

28

DivideSkip Algorithm [LLL08]

Long Lists Short Lists

Binary

searchMergeSkip

29

How many lists are treated as long lists?

30

Length Filtering

Ed(s,t) ≤ 2

s: s:

t: t:

Length: 19Length: 19

Length: 10Length: 10

By length only!By length only!

31

Positional Filtering

a b

a b

Ed(s,t) ≤ 2

s

t

(ab,1)

(ab,12)

32

Combine filters with list-merging algorithms [LLL08]

33

A filter tree

Variable-length grams (VGRAM) [LWY07,YWL08]

34

Next…

# of common grams >= 1

2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1

id strings01234


3-grams

atiichickricstastistutattictucuck

tic icksht hti tic ick

4

24

1

2010

3413

42

3

id strings01234


id strings01234


35

Observation 1: dilemma of choosing “q”

Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings

id strings01234


4

2 30

1 4

2-grams


201 30 1 2 4

41 2 433 36

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

37

VGRAM: Main idea

Grams with variable lengths (between qmin and qmax) zebra

- ze(123) corrasion

- co(5213), cor(859), corr(171) Advantages

Reduce index size Reducing running time Adoptable by many algorithms

38

Challenges

Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their

gram-set similarity? Adopting VGRAM in existing algorithms?

39

Challenge 1: String Variable-length grams?

Fixed-length 2-grams

Variable-length grams

u n i v e r s a l

niivrsalunivers

[2,4]-gram dictionaryu n i v e r s a l

40

Representing gram dictionary as a trie

niivrsalunivers

41

42

Step 2: Constructing a gram dictionary

qmin=2

qmax=4

Frequency-based [LYW07] Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a lFixed length: q

43

Deletion affects variable-length grams

i-qmax+1 i+qmax- 1Deletion

Not affected Not affectedAffected

i

44

Main idea

For a string, for each position, compute the number of grams that could be destroyed by an operation at this position

Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index

45

Vector of s = <2,4,6,8,9>

With 2 edit operations, at most 4 grams can be affected

Use this number to do count filtering

Summary of VGRAM index

46

Challenge 4: adopting VGRAM

Easily adoptable by many algorithms

Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min

# of their common grams

47

Lower bound on # of common grams

If ed(s1,s2) <= k, then their # of common grams >=:

(|s1|- q + 1) – k * q

u n i v e r s a l

Fixed length (q)

Variable lengths: # of grams of s1 – NAG(s1,k)

48

Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1

1 2 4

1 210 4

3

…ckic…ti…

Lower bound = 3

Lower bound = 1

sh ht tick

2 4

1

4

0

11

2

3

…ckicich…tictick…

2-4 grams2-gramstick

id strings01234


id strings01234


id strings01234


49

End of part I

Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion

50

Part I

Part II

References [AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav

Kaushik .VLDB 2006 [ACGK08] Incorporating string transformations in record matching. Arvind Arasu,

Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008 [BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire

Kenyon. SODA 2002 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String

Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009 [BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar,

Alan M. Frieze, Michael Mitzenmacher. STOC 1998 [CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris

Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis. SIGMOD 2005

[CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06

[CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08

[HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008

51

References [HYK+08] Hashed samples: selectivity estimators for set similarity selection queries.

Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008. [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang

Jin, Chen Li. VLDB 2005. [JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen

Li, and Jianhua Feng. WWW 2009 [JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large

Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08 [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita

Sarawagi, Divesh Srivastava. SIGMOD 2006. [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches.

Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008. [LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit

Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String

Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007

[MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007

52

References [SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal.

SIGMOD 2004 [XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance

constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008 [XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei

Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008 [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to

Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li. SIGMOD 2008

53

Documents

Efficient Approximate Search on String Collections Part I