19
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

Compressed Index for Dictionary Matching

  • Upload
    tekla

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Compressed Index for Dictionary Matching. WK Hon (NTHU) , TW Lam (HKU) , R Shah (LSU) , SL Tam (HKU) , JS Vitter (Purdue). Outline. Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree - PowerPoint PPT Presentation

Citation preview

Page 1: Compressed Index for Dictionary Matching

1

Compressed Index for Dictionary Matching

WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter

(Purdue)

Page 2: Compressed Index for Dictionary Matching

2

• Dictionary Matching Problem• Summary of Results• Description of Our Solution (Brief):

Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities

• Open Problems

Outline

Page 3: Compressed Index for Dictionary Matching

3

on receiving any text T, we can report for each Pj, all positions in T where it occurs

• Input: A set of d short patterns, { P1, P2, …, Pd }

of total length n

• Problem: Preprocess the patterns, and create an index so that:

Dictionary Matching

Page 4: Compressed Index for Dictionary Matching

4

• Relevant parameters to measure index’s performance:d = # of patterns

n = total length of patterns |T| = length of T = size of alphabet of T and patterns occ = total occurrences in search result

Dictionary Matching

Page 5: Compressed Index for Dictionary Matching

5

Summary of Results

Space (bits) Search Time Ref

O( n log n ) O( |T| + occ ) [AC 75]

O( n ) when = constant

O( (|T| + occ) log2 n) [CHLS 07]

O( n log ) O(|T| log log n + occ) ** this **

(1 + o(1)) n log

O(|T| (log n + log d) + occ)

** this **

optimal

|patterns| + o(n log )

= constant in (0,1)

Page 6: Compressed Index for Dictionary Matching

6

Existing Solution I: Patricia Trie

• Compact trie storing all d patterns

cha

h

ti

r

Patricia trie for { ate, chair, chat, hat, have, vet }

a

e

e

ate

v

vt

t

Page 7: Compressed Index for Dictionary Matching

7

Existing Solution I: Patricia Trie

• Advantage:Space: |patterns| + O( d log n ) bits

Very small overhead in addition to the input patterns

Page 8: Compressed Index for Dictionary Matching

8

Existing Solution I: Patricia Trie

Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found

Disadvantage: Searching: worst-case O(|T|n + occ) time

Page 9: Compressed Index for Dictionary Matching

9

Existing Solution II: Suffix Tree

• Compact trie storing all suffixes of all d patterns

suffix tree for { ate, chair, chat, hat, have, vet }

a

tc

ha h

t

ir

ar

i

tv

t

r

r

e

e

$

ir

e

$ t

ve

i

$e

v et

$

Page 10: Compressed Index for Dictionary Matching

10

Existing Solution II: Suffix Tree

Searching: worst-case O(|T| + occ) time

Matching Time = O(|T|)

Same Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found

Page 11: Compressed Index for Dictionary Matching

11

Existing Solution II: Suffix Tree

Disadvantage: Space: O( n log n ) bits

could be much larger than O( n log ), the space for |patterns|

Page 12: Compressed Index for Dictionary Matching

12

Our Solution

no suffixes:poor

searching

all suffixes:poor space

some suffixes:good space +

searching

Page 13: Compressed Index for Dictionary Matching

13

Our Solution: Sampling

• Store one suffix for every suffixes

= 2 for { ate, chair, chat, hat, have, vet }

a

tc

ha h

t

ir

ar

t

te

$

ir

e

ve

v et

$

Page 14: Compressed Index for Dictionary Matching

14

Our Solution: Sampling

• Store one suffix for every suffixes

irregularities

= 2 for { ate, chair, chat, hat, have, vet }

a

tc

ha h

t

ir

ar

t

te

$

ir

e

ve

v et

$

Page 15: Compressed Index for Dictionary Matching

15

Our Solution: Sampling

Need to handle irregularities

Same Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found

Matching time = O(|T|) despite irregularities

Page 16: Compressed Index for Dictionary Matching

16

When = log n

Handling irregularities

predecessor search in a set of (log n)-bit integers

Search: O(|T| log log n + occ) timeSpace: O( n log ) bits

Y-fast trie

Page 17: Compressed Index for Dictionary Matching

17

When = (log n) / log

Handling irregularities

predecessor search in a set of (log n)-bit strings

Search: O(|T| (log n + log d) + occ) timeSpace: |patterns| + o(n log ) bits

Sting B-tree

Page 18: Compressed Index for Dictionary Matching

18

When = (log n) / log

Handling irregularities

predecessor search in a set of (log n)-bit strings

Search: O(|T| (log n + log d) + occ) timeSpace: n Hk + o(n log ) bits

Sting B-tree

FerVen 07

Page 19: Compressed Index for Dictionary Matching

19

Open Problems

Compressed + Dynamic Version: Can an index support update in the set of

patterns ? Target: Achieve nHk-type space bound

External Memory Version: Can an index operate in external memory and still support fast searching ?