Upload
clover
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Efficient Approximate Search on String Collections Part I. Marios Hadjieleftheriou. Chen Li. DBLP Author Search. http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html. Try their names (good luck!). http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html. UCSD. - PowerPoint PPT Presentation
Citation preview
Efficient Approximate Search on String CollectionsPart I
Marios Hadjieleftheriou Chen Li
1
DBLP Author Search
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2
Try their names (good luck!)
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3
Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou
UCSD Case Western AT&T--Research
4
Better system?
5
http://dblp.ics.uci.edu/authors/
People Search at UC Irvine
6
http://psearch.ics.uci.edu/
Web Search
http://www.google.com/jobs/britney.html
Errors in queries
Errors in data
Bring query and meaningful
results closer togetherActual queries gathered by Google
7
Data Cleaning
R
informix
microsoft
…
…
S
infromix
…mcrosoft
…
8
Problem Formulation
Find strings similar to a given string: dist(Q,D) <= δ
Example: find strings similar to “hadjeleftheriou”
Performance is important!-10 ms: 100 queries per second (QPS)- 5 ms: 200 QPS
9
Outline
Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion
10
Part I
Part II
Preliminaries
11
Next…
Similarity Functions Similar to:
a domain-specific function returns a similarity value between two strings
Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey
12
13
A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2
Edit Distance
13
Gram-based algorithmsList-merging algorithms [LLL08]Variable-length grams (VGRAM)
[LWY07,YWL08]
14
Next…
“q-grams” of strings
u n i v e r s a l
2-grams
15
Edit operation’s effect on grams
k operations could affect k * q grams
u n i v e r s a lFixed length: q
16
If ed(s1,s2) <= k, then their # of common grams >=
(|s1|- q + 1) – k * q
q-gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433 17
# of common grams >= 3
Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1
id strings01234
richstickstichstuckstatic
2-grams
atchckicristtatituuc
4
2 30
1 4
201 30 1 2 4
41 2 433
ti ic cksh ht ti ic ck
18
T-occurrence Problem
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
19
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
20
List-Merging Algorithms
HeapMerger MergeOpt
[SK04]
[LLL08, BK02]
ScanCount MergeSkip DivideSkip
21
Heap-based Algorithm
Min-heap
Count # of occurrences of each element using a heap
Push to heap ……
22
MergeOpt Algorithm [SK04]
Long Lists: T-1 Short Lists
Binary
search
23
Example of MergeOpt
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
24
2525
ScanCount
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
List-Merging Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
[SK04]
[LLL08, BK02]
26
MergeSkip algorithm [BK02, LLL08]
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
27
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
28
DivideSkip Algorithm [LLL08]
Long Lists Short Lists
Binary
searchMergeSkip
29
How many lists are treated as long lists?
30
Length Filtering
Ed(s,t) ≤ 2
s: s:
t: t:
Length: 19Length: 19
Length: 10Length: 10
By length only!By length only!
31
Positional Filtering
a b
a b
Ed(s,t) ≤ 2
s
t
(ab,1)
(ab,12)
32
Combine filters with list-merging algorithms [LLL08]
33
A filter tree
Variable-length grams (VGRAM) [LWY07,YWL08]
34
Next…
# of common grams >= 1
2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1
id strings01234
richstickstichstuckstatic
3-grams
atiichickricstastistutattictucuck
tic icksht hti tic ick
4
24
1
2010
3413
42
3
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
35
Observation 1: dilemma of choosing “q”
Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433 36
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio
37
VGRAM: Main idea
Grams with variable lengths (between qmin and qmax) zebra
- ze(123) corrasion
- co(5213), cor(859), corr(171) Advantages
Reduce index size Reducing running time Adoptable by many algorithms
38
Challenges
Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their
gram-set similarity? Adopting VGRAM in existing algorithms?
39
Challenge 1: String Variable-length grams?
Fixed-length 2-grams
Variable-length grams
u n i v e r s a l
niivrsalunivers
[2,4]-gram dictionaryu n i v e r s a l
40
Representing gram dictionary as a trie
niivrsalunivers
41
42
Step 2: Constructing a gram dictionary
qmin=2
qmax=4
Frequency-based [LYW07] Cost-based [YLW08]
Challenge 3: Edit operation’s effect on grams
k operations could affect k * q grams
u n i v e r s a lFixed length: q
43
Deletion affects variable-length grams
i-qmax+1 i+qmax- 1Deletion
Not affected Not affectedAffected
i
44
Main idea
For a string, for each position, compute the number of grams that could be destroyed by an operation at this position
Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index
45
Vector of s = <2,4,6,8,9>
With 2 edit operations, at most 4 grams can be affected
Use this number to do count filtering
Summary of VGRAM index
46
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min
# of their common grams
47
Lower bound on # of common grams
If ed(s1,s2) <= k, then their # of common grams >=:
(|s1|- q + 1) – k * q
u n i v e r s a l
Fixed length (q)
Variable lengths: # of grams of s1 – NAG(s1,k)
48
Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
1 2 4
1 210 4
3
…ckic…ti…
Lower bound = 3
Lower bound = 1
sh ht tick
2 4
1
4
0
11
2
3
…ckicich…tictick…
2-4 grams2-gramstick
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
49
End of part I
Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion
50
Part I
Part II
References [AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav
Kaushik .VLDB 2006 [ACGK08] Incorporating string transformations in record matching. Arvind Arasu,
Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008 [BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire
Kenyon. SODA 2002 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String
Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009 [BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar,
Alan M. Frieze, Michael Mitzenmacher. STOC 1998 [CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris
Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis. SIGMOD 2005
[CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06
[CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08
[HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008
51
References [HYK+08] Hashed samples: selectivity estimators for set similarity selection queries.
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008. [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang
Jin, Chen Li. VLDB 2005. [JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen
Li, and Jianhua Feng. WWW 2009 [JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large
Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08 [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita
Sarawagi, Divesh Srivastava. SIGMOD 2006. [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches.
Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008. [LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit
Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007
[MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007
52
References [SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal.
SIGMOD 2004 [XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance
constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008 [XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei
Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008 [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li. SIGMOD 2008
53