Efficient Top-k Algorithms for Fuzzy Search in StringCollections
Rares Vernica Chen Li
Department of Computer ScienceUniversity of California, Irvine
First International Workshop on Keyword Search on StructuredData, 2009
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17
Outline
1 Motivation
2 Efficient Top-k AlgorithmsProblem FormulationAlgorithms OverviewTop-k Single-Pass Search Algorithm
3 Experimental Evaluation
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17
Need for Approximate String Queries
ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1
......
...
Figure: Actor names and popularities
SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;
0 Results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
Need for Approximate String Queries
ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1
......
...
Figure: Actor names and popularities
SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;
0 Results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
Need for Approximate String Queries
ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1
......
...
Figure: Actor names and popularities
SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;
0 ResultsRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?
Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
Need for Ranking
ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9
......
......
Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.
Which one result should the system return?Which value is more important, # Movies or similarity?
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
Top-k Similar Strings
Given:Weighted string collection
e.g., actors’ LastName and # Movies
Query stringe.g., “Shwartzenetrugger”
Similarity functione.g, Edit Distance
Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity
Integer k
Return: k best strings in terms of overall score to the query string.
Advantages over Range Search:specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
Top-k Similar Strings
Given:Weighted string collection
e.g., actors’ LastName and # Movies
Query stringe.g., “Shwartzenetrugger”
Similarity functione.g, Edit Distance
Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity
Integer k
Return: k best strings in terms of overall score to the query string.Advantages over Range Search:
specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
Algorithms Overview
Iterative RangeSearch
Single-PassSearch Two-Phase Search
RangeSearch
RangeSearch
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica
→ {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica
→ {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve
,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er
,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn
,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni
,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic
,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-grams: Overlapping substrings of fixed length
Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2
Vernica → {Ve,er,rn,ni,ic,ca}
Veronica→ {Ve,er,ro,on,ni,ic,ca}
Similar strings share a certain number of grams
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
⇒
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”
Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”
Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
q-gram Inverted List Index
q = 2
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-list index
Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
⇐
Top-k Single-pass Search Algorithm
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
Top-k Single-pass Search Algorithm
SetupAssign IDs s.t.ascending order of IDs ≡descending order of weights
Sort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
Top-k Single-pass Search Algorithm
SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorder
Scan the lists corresponding to thegrams in the query.e.g., “bcd”
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
Top-k Single-pass Search Algorithm
SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
Top-k Single-pass Search Algorithm
Naïve approach: Round-RobinScan all the lists in the same timeMaintain a list of “open” IDs (mightstill appear on some of the lists)Store the best k “closed” IDs in atop-k bufferStop when the top-k buffer cannotimprove
ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40
Figure: Dataset
ab cc cd bc1 2 2 44 5 3 5
4
Figure: Gram inverted-listindex
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
99
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1
→
2
→
2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
1
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1 →2
→
2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
12
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1 →2 →2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
122
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd→1 →2 →2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
122
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain thelist of “open” IDs
2 Skip elements
ab bc cd→1 →2 →2
→
20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
9
22
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1 →2 →2→20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
22
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1 →2 →2→20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
1
Figure:Top-kbuffer,k = 1
22
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1 →2 →2→20
→
3
→
421 4 5
......
...19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
34
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
34
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
420
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
420
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20 →3 →4
21 4 5...
......
19 19
→
20
→
20...
...
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Top-k Single-pass Search Algorithm
Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:
1 No need to maintain the listof “open” IDs
2 Skip elements
ab bc cd
→
1
→
2
→
2→20
→
3
→
421 4 5
......
...19 19→20 →20
......
Figure: Graminverted-lists for query“abcd”
2
Figure:Top-kbuffer,k = 1
20
Figure:Min-heap
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
Algorithms Overview
Iterative RangeSearch
Single-PassSearch Two-Phase Search
RangeSearch
RangeSearch
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17
Experimental Setting
Datasets:IMDB Actor Names1
actor names and the number of movies they played in1.2 million actors, average name length 15weight is the number of movies (log normalized)
WEB Corpus Word Grams2
sequences of English words and their frequency on the Web2.4 million sequences, average sequence length 20weight is the frequency (log normalized)
Jaccard similarity and normalized edit similarity, q = 3Index and data are stored in main memory at all timesImplemented in C++ (GNU compiler) on Ubuntu Linux OSIntel 2.40GHz PC, 2GB RAM
1http://www.imdb.com/interfaces2http://www.ldc.upenn.edu/Catalog LDC2006T13Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17
Benefits of Skipping Elements
0.2 0.4 0.6 0.8 1.0 1.2
Dataset Size (millions)
0
10
20
30
40
50
Tim
e (m
s)
SPSSPS*
Average running time fortop-10 queries. IMDBdataset with Jaccardsimilarity. Single-PassSearch (SPS) algorithmand SPS withoutskipping (SPS*).
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17
Potential of the Two-Phase Algorithm
Q1 Q2 Q3
Queries
0
10
20
30
40
Tim
e (m
s)
Running time for 3top-10 queries. WEBCorpus dataset withnormalized editsimilarity. Two-Phasealgorithm with differentinitial thresholds.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17
Optimum Initial Threshold for the Two-Phase Algorithm
0.4 0.8 1.2 1.6 2.0 2.4
Dataset Size (millions)
0
20
40
60
80
100
Tim
e (m
s)
SPS2PH2PH Opt
Average running time fortop-10 queries. WebCorpus dataset withnormalized editsimilarity. Single-PassSearch (SPS) algorithm,Two-Phase (2PH)algorithm, 2PH with theoptimum initial threshold(2PH Opt).The Iterative Range Searchalgorithm was to expensive to beplotted. The average running timewas around 5s.
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17
Summary
Approximate ranking queries in string collectionsUseful when mismatch between query and dataPropose three approaches to solve the problem:
1 Use existing approximate range search algorithms as a “black box”Proves to be the most expensive
2 Use particularities of the top-k problemProves to be very efficient
3 Combine (1) and (2) sequentiallyProves to be slightly more efficient
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17
The Flamingo Project
This work is part ofThe Flamingo Project at UC Irvinehttp://flamingo.ics.uci.edu
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
A Quick Note on Related Work
Fagin et. al [1]similarity on multiplenumerical attributestraverse list of IDsone list per attributelists sorted on similarity tothat attributelists have different orders ofIDsall IDs appear on all the lists
Our Settingsimilarity on one stringattributetraverse list of IDsone list per q-gramlists sorted on global weight
lists have the same order ofIDsa subset of IDs appear oneach list
[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001
Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17