Upload
samantha-gill
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Space-Constrained Gram-Based Indexing for
Efficient Approximate String Search
Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2
1University of California, Irvine2Renmin University of China
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview
Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Data Cleaning
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Real-world data is dirty
Typos
Inconsistent
representations
(PO Box vs. P.O. Box)
Approximately check
against clean dictionary
Should clearly be “Niels Bohr”
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Record Linkage
Name Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …
Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker
We want to link records belonging to the same entity
No exact match!
The same entity may have similar representations
Arnold Schwarzeneger versusArnold Schwarzenegger
Forrest Whittaker versusForest Whittacker
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Query Relaxation
http://www.google.com/jobs/britney.html
Errors in queries
Errors in data
Bring query and meaningful
results closer together
Actual queries gathered by Google
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
What is Approximate String Search?String Collection: (People)
Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzeneger………
Queries against collection:Find all entries similar to “Forrest Whitaker”Find all entries similar to “Arnold Schwarzenegger”Find all entries similar to “Brittany Spears”
What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similaity- Dice- Etc.
The similar to predicate can help our described applications!
How can we support these types of queries efficiently?
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approximate Query Answering
Main Idea: Use q-grams as signatures for a string
irvine
2-grams {ir, rv, vi, in, ne}
Intuition: Similar strings share a certain number of grams
Inverted index on grams supports finding all data strings sharing enough grams with a query
Sliding Window
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approximate Query ExampleQuery: “irvine”, Edit Distance 1
2-grams {ir, rv, vi, in, ne}
tf vi ir ef rv ne unin ……
Lookup Grams
2-grams134579
59
15
1239
39
79
569
Inverted Lists
(stringIDs)
12456
Each edit operations can “destroy” at most q gramsAnswers must share at least T = 5 – 1 * 2 = 3 grams
T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.
Candidates = {1, 5, 9}May have false positivesNeed to compute real similarity
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Compression
Inverted index can be very large compared to source data
May need to fit in memory for fast query processing
Can we compress the index to fit into a space budget?
Index-Size Estimation
Each string produces |s| - q + 1 gramsFor each gram we add one element to its inverted list (a 4-byte uint)With ASCII encoding the index is ~4x as large as the original data!
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Motivation: Related Work
IR community developed many lossless compression algorithms for
inverted lists (mostly in a disk-based setting)
Mainly use delta representation + packing
If inverted lists are in memory these techniques always impose
decompression overhead
Difficult to tune compression ratio
How to overcome these limitations in our setting?
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
This Paper
We developed two lossy compression techniques
We answer queries exactly
Index can fit into a space budget (space constraint)
Queries can become faster on the compressed indexes
Flexibility to choose space / time tradeoff
Existing list-merging algorithms can be re-used (even with
compression specific optimizations)
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview
Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approach 1: Discarding Lists
tf vi ir ef rv ne unin ……2-grams134579
59
15
1239
39
79
569
Inverted Lists
(stringIDs)
12456
B E FORE
tf vi ir ef rv ne unin ……2-grams
Inverted Lists
(stringIDs)
A F TER
59
15
79
569
12456
Lists discarded, “Holes”
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Effects on Queries
Need to decrease merging-threshold T
Lower T more false positives to post-process
If T <= 0 we “panic”, need to scan entire
collection and compute true similarities
Surprisingly! Query Processing time can
decrease because fewerlists to consider
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
sha han ang ngh gha hai ter …
Query “shanghai”, Edit Distance 13-grams {sha, han, ang, ngh, gha, hai}
uni ing3-grams
Hole grams
Regular grams
Merging-threshold without holes, T = #grams – ed * q = 6 – 1 * 3 = 3Basis: Each Edit Operation can “destroy” at most q=3 gramsNaïve new Merging-Threshold T’ = T – #holes = 0 Panic!
Can we really destroy at most q=3 non-hole grams with each edit operation?
sha han ang ngh gha hai
Delete “a” Delete “g”
Can destroy at most 2 grams with 1 Edit Operation!New Merging-Threshold T’ = 1
We use Dynamic Programming to compute tighter T’
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to Discard
One extreme: query is entirely unaffected
Other extreme: query becomes panic
Good choice of lists depends on query workload
Many combinations of lists to discard that satisfy
memory constraint, checking all is infeasible
How can we make a “reasonable” choice efficiently?
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to DiscardInput:Memory ConstraintInverted Lists LQuery Workload WOutput:Lists to Discard D
DiscardLists { While(Memory Constraint Not Satisfied) { For each list in L { ∆t = estimateImpact(list, W) benefit = list.size() } discard = use ∆t’s and benefits to choose list add discard to D remove discard from L }}
How can we do this efficiently?Perhaps incrementally?
Times needed:List-Merging TimePost-Processing TimePanic Time
What exactly should we minimize?benefit / cost?cost only?
We could ignore benefit…
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to DiscardEstimating Query Times With Holes
List-Merging Time: cost function, parameters decided offline with linear regressionPost-Processing Time: #candidates * average compute similarity timePanic Time: #strings * average compute similarity time
#candidates depends on T, data distribution, number of holes
Incremental-ScanCount Algorithm
2 0 3 3 2 4 0 0 1 0
0 1 2 3 4 5 6 7 8 9StringIDs
CountsBefore Discarding ListT = 3#candidates = 3
2 0 2 2 1 4 0 0 0 0
0 1 2 3 4 5 6 7 8 9StringIDs
CountsAfter Discarding ListT’ = T – 1 = 2#candidates = 4
2
List to discard
348
decrement counts
Many more ways to improve speed of DiscardLists, this is just one example…
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview
Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Approach 2: Combining Liststf vi ir ef rv ne unin ……2-grams
134579
59
569
1239
139
79
69Inverted
Lists (stringIDs)
12456
B E FORE
2-grams
Inverted Lists
(stringIDs)
A F TER
tf vi ir ef rv ne unin ……134579
569
1239
79
69
12456
Lists combined Intuition: Combine correlated lists.
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Effects on Queries
Merging-threshold T is unchanged (no new panics)
Lists become longer:
More time to traverse lists
More false positives
List-Merging Optimization3-grams {sha, han, ang, ngh, gha, hai}
combinedrefcount = 2
combinedrefcount = 3
Traverse physical lists once.
Count for stringIDs on physical lists increased by refcount instead of 1
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Choosing Lists to Combine
Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Using Locality-Sensitive Hashing (LSH)
Selecting candidate pairs to combine Based on estimated cost on query workload Similar to DiscardList Different Incremental ScanCount algorithm
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Overview
Motivation & Preliminaries
Approach 1: Discarding Lists
Approach 2: Combining Lists
Experiments & Conclusion
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Experiments Datasets:
Google WebCorpus (word grams) IMDB Actors
Queries: picked from dataset, Zipf distributed q=3, Edit Distance=2 Overview:
Performance of flavors of DiscardLists & CombineLists Scalability with increasing index size Comparison with IR compression technique Comparison with VGRAM What if workload changes from training workload
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Experiments
DiscardLists CombineLists
Runtime decreases! Runtime decreases!
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Experiments
Uncompressed
Compressed
CompressedUncompressed
Comparison with IR compression technique
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Experiments
Uncompressed
Compressed
Compressed
Uncompressed
Comparison with variable-length gram technique, VGRAM
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Future Work
DiscardLists, CombineLists and IR compression could be combined
When considering filter tree, global vs. local decisions
How to minimize impact on performance if workload change
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Conclusion
We developed two lossy compression techniques
We answer queries exactly
Index can fit into a space budget (space constraint)
Queries can become faster on the compressed indexes
Flexibility to choose space / time tradeoff
Existing list-merging algorithms can be re-used (even with
compression specific optimizations)
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
More ExperimentsWhat if the workload changes from the training workload?
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
More ExperimentsWhat if the workload changes from the training workload?