Upload
tekla
View
30
Download
0
Embed Size (px)
DESCRIPTION
Compressed Index for Dictionary Matching. WK Hon (NTHU) , TW Lam (HKU) , R Shah (LSU) , SL Tam (HKU) , JS Vitter (Purdue). Outline. Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree - PowerPoint PPT Presentation
Citation preview
1
Compressed Index for Dictionary Matching
WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter
(Purdue)
2
• Dictionary Matching Problem• Summary of Results• Description of Our Solution (Brief):
Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities
• Open Problems
Outline
3
on receiving any text T, we can report for each Pj, all positions in T where it occurs
• Input: A set of d short patterns, { P1, P2, …, Pd }
of total length n
• Problem: Preprocess the patterns, and create an index so that:
Dictionary Matching
4
• Relevant parameters to measure index’s performance:d = # of patterns
n = total length of patterns |T| = length of T = size of alphabet of T and patterns occ = total occurrences in search result
Dictionary Matching
5
Summary of Results
Space (bits) Search Time Ref
O( n log n ) O( |T| + occ ) [AC 75]
O( n ) when = constant
O( (|T| + occ) log2 n) [CHLS 07]
O( n log ) O(|T| log log n + occ) ** this **
(1 + o(1)) n log
O(|T| (log n + log d) + occ)
** this **
optimal
|patterns| + o(n log )
= constant in (0,1)
6
Existing Solution I: Patricia Trie
• Compact trie storing all d patterns
cha
h
ti
r
Patricia trie for { ate, chair, chat, hat, have, vet }
a
e
e
ate
v
vt
t
7
Existing Solution I: Patricia Trie
• Advantage:Space: |patterns| + O( d log n ) bits
Very small overhead in addition to the input patterns
8
Existing Solution I: Patricia Trie
Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found
Disadvantage: Searching: worst-case O(|T|n + occ) time
9
Existing Solution II: Suffix Tree
• Compact trie storing all suffixes of all d patterns
suffix tree for { ate, chair, chat, hat, have, vet }
a
tc
ha h
t
ir
ar
i
tv
t
r
r
e
e
$
ir
e
$ t
ve
i
$e
v et
$
10
Existing Solution II: Suffix Tree
Searching: worst-case O(|T| + occ) time
Matching Time = O(|T|)
Same Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found
11
Existing Solution II: Suffix Tree
Disadvantage: Space: O( n log n ) bits
could be much larger than O( n log ), the space for |patterns|
12
Our Solution
no suffixes:poor
searching
all suffixes:poor space
some suffixes:good space +
searching
13
Our Solution: Sampling
• Store one suffix for every suffixes
= 2 for { ate, chair, chat, hat, have, vet }
a
tc
ha h
t
ir
ar
t
te
$
ir
e
ve
v et
$
14
Our Solution: Sampling
• Store one suffix for every suffixes
irregularities
= 2 for { ate, chair, chat, hat, have, vet }
a
tc
ha h
t
ir
ar
t
te
$
ir
e
ve
v et
$
15
Our Solution: Sampling
Need to handle irregularities
Same Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found
Matching time = O(|T|) despite irregularities
16
When = log n
Handling irregularities
predecessor search in a set of (log n)-bit integers
Search: O(|T| log log n + occ) timeSpace: O( n log ) bits
Y-fast trie
17
When = (log n) / log
Handling irregularities
predecessor search in a set of (log n)-bit strings
Search: O(|T| (log n + log d) + occ) timeSpace: |patterns| + o(n log ) bits
Sting B-tree
18
When = (log n) / log
Handling irregularities
predecessor search in a set of (log n)-bit strings
Search: O(|T| (log n + log d) + occ) timeSpace: n Hk + o(n log ) bits
Sting B-tree
FerVen 07
19
Open Problems
Compressed + Dynamic Version: Can an index support update in the set of
patterns ? Target: Achieve nHk-type space bound
External Memory Version: Can an index operate in external memory and still support fast searching ?