56
Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

Efficient Top-k Algorithms for Fuzzy Search in String ...chenli/pub/keys09-p9-vernica-slides.pdf · Efficient Top-k Algorithms for Fuzzy Search in String Collections ... {Ve,er,rn,ni,ic,ca}

Embed Size (px)

Citation preview

Efficient Top-k Algorithms for Fuzzy Search in StringCollections

Rares Vernica Chen Li

Department of Computer ScienceUniversity of California, Irvine

First International Workshop on Keyword Search on StructuredData, 2009

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

Outline

1 Motivation

2 Efficient Top-k AlgorithmsProblem FormulationAlgorithms OverviewTop-k Single-Pass Search Algorithm

3 Experimental Evaluation

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17

Need for Approximate String Queries

ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1

......

...

Figure: Actor names and popularities

SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;

0 Results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

Need for Approximate String Queries

ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1

......

...

Figure: Actor names and popularities

SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;

0 Results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

Need for Approximate String Queries

ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1

......

...

Figure: Actor names and popularities

SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;

0 ResultsRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?

Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Top-k Similar Strings

Given:Weighted string collection

e.g., actors’ LastName and # Movies

Query stringe.g., “Shwartzenetrugger”

Similarity functione.g, Edit Distance

Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity

Integer k

Return: k best strings in terms of overall score to the query string.

Advantages over Range Search:specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

Top-k Similar Strings

Given:Weighted string collection

e.g., actors’ LastName and # Movies

Query stringe.g., “Shwartzenetrugger”

Similarity functione.g, Edit Distance

Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity

Integer k

Return: k best strings in terms of overall score to the query string.Advantages over Range Search:

specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

Algorithms Overview

Iterative RangeSearch

Single-PassSearch Two-Phase Search

RangeSearch

RangeSearch

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica

→ {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica

→ {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve

,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er

,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn

,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni

,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic

,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”

Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”

Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Top-k Single-pass Search Algorithm

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Top-k Single-pass Search Algorithm

SetupAssign IDs s.t.ascending order of IDs ≡descending order of weights

Sort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Top-k Single-pass Search Algorithm

SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorder

Scan the lists corresponding to thegrams in the query.e.g., “bcd”

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Top-k Single-pass Search Algorithm

SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Top-k Single-pass Search Algorithm

Naïve approach: Round-RobinScan all the lists in the same timeMaintain a list of “open” IDs (mightstill appear on some of the lists)Store the best k “closed” IDs in atop-k bufferStop when the top-k buffer cannotimprove

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

99

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1

2

2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

1

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1 →2

2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

12

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1 →2 →2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

122

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1 →2 →2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

122

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain thelist of “open” IDs

2 Skip elements

ab bc cd→1 →2 →2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

22

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1 →2 →2→20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

22

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1 →2 →2→20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

22

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1 →2 →2→20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

34

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

34

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

420

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

420

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20

3

421 4 5

......

...19 19→20 →20

......

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Algorithms Overview

Iterative RangeSearch

Single-PassSearch Two-Phase Search

RangeSearch

RangeSearch

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17

Experimental Setting

Datasets:IMDB Actor Names1

actor names and the number of movies they played in1.2 million actors, average name length 15weight is the number of movies (log normalized)

WEB Corpus Word Grams2

sequences of English words and their frequency on the Web2.4 million sequences, average sequence length 20weight is the frequency (log normalized)

Jaccard similarity and normalized edit similarity, q = 3Index and data are stored in main memory at all timesImplemented in C++ (GNU compiler) on Ubuntu Linux OSIntel 2.40GHz PC, 2GB RAM

1http://www.imdb.com/interfaces2http://www.ldc.upenn.edu/Catalog LDC2006T13Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17

Benefits of Skipping Elements

0.2 0.4 0.6 0.8 1.0 1.2

Dataset Size (millions)

0

10

20

30

40

50

Tim

e (m

s)

SPSSPS*

Average running time fortop-10 queries. IMDBdataset with Jaccardsimilarity. Single-PassSearch (SPS) algorithmand SPS withoutskipping (SPS*).

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17

Potential of the Two-Phase Algorithm

Q1 Q2 Q3

Queries

0

10

20

30

40

Tim

e (m

s)

Running time for 3top-10 queries. WEBCorpus dataset withnormalized editsimilarity. Two-Phasealgorithm with differentinitial thresholds.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17

Optimum Initial Threshold for the Two-Phase Algorithm

0.4 0.8 1.2 1.6 2.0 2.4

Dataset Size (millions)

0

20

40

60

80

100

Tim

e (m

s)

SPS2PH2PH Opt

Average running time fortop-10 queries. WebCorpus dataset withnormalized editsimilarity. Single-PassSearch (SPS) algorithm,Two-Phase (2PH)algorithm, 2PH with theoptimum initial threshold(2PH Opt).The Iterative Range Searchalgorithm was to expensive to beplotted. The average running timewas around 5s.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17

Summary

Approximate ranking queries in string collectionsUseful when mismatch between query and dataPropose three approaches to solve the problem:

1 Use existing approximate range search algorithms as a “black box”Proves to be the most expensive

2 Use particularities of the top-k problemProves to be very efficient

3 Combine (1) and (2) sequentiallyProves to be slightly more efficient

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17

The Flamingo Project

This work is part ofThe Flamingo Project at UC Irvinehttp://flamingo.ics.uci.edu

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17

A Quick Note on Related Work

Fagin et. al [1]similarity on multiplenumerical attributestraverse list of IDsone list per attributelists sorted on similarity tothat attributelists have different orders ofIDsall IDs appear on all the lists

Our Settingsimilarity on one stringattributetraverse list of IDsone list per q-gramlists sorted on global weight

lists have the same order ofIDsa subset of IDs appear oneach list

[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17