Upload
yul
View
56
Download
1
Embed Size (px)
DESCRIPTION
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm 1 , Shengyue Ji 1 , Chen Li 1 , Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China. Motivation: Data Cleaning. Should clearly be “ Niels Bohr”. - PowerPoint PPT Presentation
Citation preview
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient
Approximate String Search
Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2
1University of California, Irvine2Renmin University of China
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Data Cleaning
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Should clearly be “Niels Bohr”
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Record LinkageName Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …
Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker
No exact match!
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Query Relaxation
http://www.google.com/jobs/britney.html
Actual queries gathered by Google
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
What is Approximate String Search?
Query against collection:Find entries similar to “Arnold Schwarseneger”
What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similarity- Dice- Etc.
How can we support these types of queries efficiently?
String Collection
Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzenegger…
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approximate Query Answering
irvine2-grams {ir, rv, vi, in, ne}
Intuition: Similar strings share a certain number of grams
Sliding Window
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approximate Query Example
Query: “irvine”, Edit Distance 12-grams {ir, rv, vi, in, ne}
tf vi ir ef rv ne unin ……
Lookup Grams
2-grams134579
59
15
1239
39
79
569
Inverted Lists
(stringIDs)
12456
Count >= 3 Candidates = {1, 5, 9}May have false positives
134579
15
1239
79
569
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
T-Occurrence Problem
Find elements whose occurrences ≥ T
Ascendingorder
Merge
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Compression
Inverted Index >> Source DataFit in memory? Space Budget?
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Related Work
IR: lossless compression of inverted lists (disk-based)
Delta representation + compact encoding
Inverted lists in memory: decompression overhead
Tune compression ratio?
Overcome these limitations in our setting?
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Main Contributions
Two lossy compression techniques Answer queries exactly
Index fits into a space budget
Queries faster on the compressed indexes Flexibility to choose space / time tradeoff
Existing list-merging algorithms: re-use + compression specific
optimizations
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approach 1: Discarding Lists
tf vi ir ef rv ne unin ……2-grams134579
59
15
1239
39
79
569
Inverted Lists
(stringIDs)
12456
Lists discarded, “Holes”
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Effects on Queries
Decrease lower bound T on common grams
Smaller T more false positives
T <= 0 “panic”, scan entire string collection
Surprise Fewer lists Faster Queries (depends)
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
sha han ang ngh gha hai ter …
Query “shanghai”, Edit Distance 13-grams {sha, han, ang, ngh, gha, hai}
uni ing3-grams
Hole grams
Regular grams
Basis: Edit Operations “destroy” q=3 gramsNo Holes: T = #grams – ed * q = 6 – 1 * 3 = 3With holes: T’ = T – #holes = 0 Panic!
Really destroy q=3 grams per edit operation?
Dynamic Programming for tighter T
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to Discard
Good choice depends on query workload
Space budget: Many combinations of grams
Make a “reasonable” choice efficiently?
Effect on QueryUnaffected Panic
Slower or Faster
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to DiscardINPUT: Space Budget, Inverted lists, Workload
OUTPUT: Lists to discard
tf vi ir ef rv ne unin ……
Query1Query2Query3
…Total estimated running time t
Estimated impact ∆t
Incremental Update
Choose one list at a time
ALGORITHM: Greedy & Cost-Based
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Estimating Query Times
List-Merging:cost function, offline with linear regression
Panic: #strings * avg similarity time
Post-Processing: #candidates * avg similarity time
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Estimating #candidatesIncremental-ScanCount Algorithm
2 3 0 1 40 1 2 3 4
2 2 0 0 30 1 2 3 4
Counts
StringIDs
Counts
StringIDs
Decrement
un
134
List to Discard
BEFORET = 3#candidates = 2
AFTERT’ = T-1 = 2#candidates = 3
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approach 2: Combining Lists
tf vi ir ef rv ne unin ……2-grams134579
59
569
1239
139
79
69Inverted
Lists (stringIDs)
12456
Lists combined
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Effects on Queries
Lower bound T is unchanged (no new panics)
Lists become longer:
More time to traverse lists
More false positives
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speeding Up Queries
Query3-grams {sha, han, ang, ngh, gha, hai}
combined listsrefcount = 2
combined listsrefcount = 3
Traverse physical lists once. Count for stringIDs increases by refcount.
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to Combine
Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Locality-Sensitive Hashing (LSH)
Selecting candidate pairs to combine Basis: estimated cost on query workload Similar to DiscardLists Different Incremental ScanCount algorithm
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
ExperimentsDatasets:
Google WebCorpus Word Grams IMDB Actors DBLP Titles
Overview: Performance & Scalability of DiscardLists & CombineLists Comparison with IR compression & VGRAM Changing workloads
10k Queries: Zipf distributed, from datasetq=3, Edit Distance=2, (also Jaccard & Cosine)
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
ExperimentsDiscardLists CombineLists
Runtime decreases! Runtime decreases!
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Comparison with IR compression Carryover-12
Uncompressed
Compressed
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Comparison with variable-length grams, VGRAM
Uncompressed
Compressed
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Future Work
Combine: DiscardLists, CombineLists and IR compression
Filters for partitioning, global vs. local decisions
Dealing with updates to index
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Conclusions
Two lossy compression techniques Answer queries exactly
Index fits into a space budget
Queries faster on the compressed indexes Flexibility to choose space / time tradeoff
Existing list-merging algorithms: re-use + compression specific
optimizations
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Thank You!This work is part of
The Flamingo Project
http://flamingo.ics.uci.edu
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
More ExperimentsWhat if the workload changes from the training workload?
Speaker: Alexander Behm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
More ExperimentsWhat if the workload changes from the training workload?