Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Space-Constrained Gram-Based Indexing for

Efficient Approximate String Search

Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2

1University of California, Irvine2Renmin University of China


Overview

Motivation & Preliminaries

Approach 1: Discarding Lists

Approach 2: Combining Lists

Experiments & Conclusion


Motivation: Data Cleaning

Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Real-world data is dirty

Typos

Inconsistent

representations

(PO Box vs. P.O. Box)

Approximately check

against clean dictionary

Should clearly be “Niels Bohr”


Motivation: Record Linkage

Name Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …

Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker

We want to link records belonging to the same entity

No exact match!

The same entity may have similar representations

Arnold Schwarzeneger versusArnold Schwarzenegger

Forrest Whittaker versusForest Whittacker


Motivation: Query Relaxation

http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful

results closer together

Actual queries gathered by Google


What is Approximate String Search?String Collection: (People)

Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzeneger………

Queries against collection:Find all entries similar to “Forrest Whitaker”Find all entries similar to “Arnold Schwarzenegger”Find all entries similar to “Brittany Spears”

What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similaity- Dice- Etc.

The similar to predicate can help our described applications!

How can we support these types of queries efficiently?


Approximate Query Answering

Main Idea: Use q-grams as signatures for a string

irvine

2-grams {ir, rv, vi, in, ne}

Intuition: Similar strings share a certain number of grams

Inverted index on grams supports finding all data strings sharing enough grams with a query

Sliding Window


Approximate Query ExampleQuery: “irvine”, Edit Distance 1

2-grams {ir, rv, vi, in, ne}

tf vi ir ef rv ne unin ……

Lookup Grams

2-grams134579

59

15

1239

39

79

569

Inverted Lists

(stringIDs)

12456

Each edit operations can “destroy” at most q gramsAnswers must share at least T = 5 – 1 * 2 = 3 grams

T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.

Candidates = {1, 5, 9}May have false positivesNeed to compute real similarity


Motivation: Compression

Inverted index can be very large compared to source data

May need to fit in memory for fast query processing

Can we compress the index to fit into a space budget?

Index-Size Estimation

Each string produces |s| - q + 1 gramsFor each gram we add one element to its inverted list (a 4-byte uint)With ASCII encoding the index is ~4x as large as the original data!


Motivation: Related Work

IR community developed many lossless compression algorithms for

inverted lists (mostly in a disk-based setting)

Mainly use delta representation + packing

If inverted lists are in memory these techniques always impose

decompression overhead

Difficult to tune compression ratio

How to overcome these limitations in our setting?


This Paper

We developed two lossy compression techniques

We answer queries exactly

Index can fit into a space budget (space constraint)

Queries can become faster on the compressed indexes

Flexibility to choose space / time tradeoff

Existing list-merging algorithms can be re-used (even with

compression specific optimizations)


Overview







tf vi ir ef rv ne unin ……2-grams134579

59

15

1239

39

79

569

Inverted Lists

(stringIDs)

12456

B E FORE

tf vi ir ef rv ne unin ……2-grams

Inverted Lists

(stringIDs)

A F TER

59

15

79

569

12456

Lists discarded, “Holes”


Effects on Queries

Need to decrease merging-threshold T

Lower T more false positives to post-process

If T <= 0 we “panic”, need to scan entire

collection and compute true similarities

Surprisingly! Query Processing time can

decrease because fewerlists to consider


sha han ang ngh gha hai ter …

Query “shanghai”, Edit Distance 13-grams {sha, han, ang, ngh, gha, hai}

uni ing3-grams

Hole grams

Regular grams

Merging-threshold without holes, T = #grams – ed * q = 6 – 1 * 3 = 3Basis: Each Edit Operation can “destroy” at most q=3 gramsNaïve new Merging-Threshold T’ = T – #holes = 0 Panic!

Can we really destroy at most q=3 non-hole grams with each edit operation?

sha han ang ngh gha hai

Delete “a” Delete “g”

Can destroy at most 2 grams with 1 Edit Operation!New Merging-Threshold T’ = 1

We use Dynamic Programming to compute tighter T’


Choosing Lists to Discard

One extreme: query is entirely unaffected

Other extreme: query becomes panic

Good choice of lists depends on query workload

Many combinations of lists to discard that satisfy

memory constraint, checking all is infeasible

How can we make a “reasonable” choice efficiently?


Choosing Lists to DiscardInput:Memory ConstraintInverted Lists LQuery Workload WOutput:Lists to Discard D

DiscardLists { While(Memory Constraint Not Satisfied) { For each list in L { ∆t = estimateImpact(list, W) benefit = list.size() } discard = use ∆t’s and benefits to choose list add discard to D remove discard from L }}

How can we do this efficiently?Perhaps incrementally?

Times needed:List-Merging TimePost-Processing TimePanic Time

What exactly should we minimize?benefit / cost?cost only?

We could ignore benefit…


Choosing Lists to DiscardEstimating Query Times With Holes

List-Merging Time: cost function, parameters decided offline with linear regressionPost-Processing Time: #candidates * average compute similarity timePanic Time: #strings * average compute similarity time

#candidates depends on T, data distribution, number of holes

Incremental-ScanCount Algorithm

2 0 3 3 2 4 0 0 1 0

0 1 2 3 4 5 6 7 8 9StringIDs

CountsBefore Discarding ListT = 3#candidates = 3

2 0 2 2 1 4 0 0 0 0

0 1 2 3 4 5 6 7 8 9StringIDs

CountsAfter Discarding ListT’ = T – 1 = 2#candidates = 4

2

List to discard

348

decrement counts

Many more ways to improve speed of DiscardLists, this is just one example…


Overview






Approach 2: Combining Liststf vi ir ef rv ne unin ……2-grams

134579

59

569

1239

139

79

69Inverted

Lists (stringIDs)

12456

B E FORE

2-grams

Inverted Lists

(stringIDs)

A F TER

tf vi ir ef rv ne unin ……134579

569

1239

79

69

12456

Lists combined Intuition: Combine correlated lists.


Effects on Queries

Merging-threshold T is unchanged (no new panics)

Lists become longer:

More time to traverse lists

More false positives

List-Merging Optimization3-grams {sha, han, ang, ngh, gha, hai}

combinedrefcount = 2

combinedrefcount = 3

Traverse physical lists once.

Count for stringIDs on physical lists increased by refcount instead of 1


Choosing Lists to Combine

Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Using Locality-Sensitive Hashing (LSH)

Selecting candidate pairs to combine Based on estimated cost on query workload Similar to DiscardList Different Incremental ScanCount algorithm


Overview






Experiments Datasets:

Google WebCorpus (word grams) IMDB Actors

Queries: picked from dataset, Zipf distributed q=3, Edit Distance=2 Overview:

Performance of flavors of DiscardLists & CombineLists Scalability with increasing index size Comparison with IR compression technique Comparison with VGRAM What if workload changes from training workload


Experiments

DiscardLists CombineLists

Runtime decreases! Runtime decreases!


Experiments

Uncompressed

Compressed

CompressedUncompressed

Comparison with IR compression technique


Experiments

Uncompressed

Compressed

Compressed

Uncompressed

Comparison with variable-length gram technique, VGRAM


Future Work

DiscardLists, CombineLists and IR compression could be combined

When considering filter tree, global vs. local decisions

How to minimize impact on performance if workload change


Conclusion

We developed two lossy compression techniques

We answer queries exactly

Index can fit into a space budget (space constraint)

Queries can become faster on the compressed indexes

Flexibility to choose space / time tradeoff

Existing list-merging algorithms can be re-used (even with

compression specific optimizations)


More ExperimentsWhat if the workload changes from the training workload?


More ExperimentsWhat if the workload changes from the training workload?

Documents

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient