44
Sparse Compact Directed Acyclic Word Graphs Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & Japan Science and Technology Agency)

Sparse Compact Directed Acyclic Word Graphs

  • Upload
    corby

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Sparse Compact Directed Acyclic Word Graphs. Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & Japan Science and Technology Agency). Traditional Pattern Matching Problem. Given : text T in S * and pattern P in S * - PowerPoint PPT Presentation

Citation preview

Page 1: Sparse Compact Directed Acyclic Word Graphs

Sparse Compact DirectedAcyclic Word Graphs

Shunsuke Inenaga(Japan Society for the Promotion of Science

& Kyushu University)

Masayuki Takeda(Kyushu University

& Japan Science and Technology Agency)

Page 2: Sparse Compact Directed Acyclic Word Graphs

Traditional Pattern Matching Problem

Given : text T in and pattern P in Return : whether or not P appears in T

: alphabet (set of characters) : set of strings

A text indexing structure for T enables you to solve the above problem in O(m) time (for fixed ).

m: the length of P

Page 3: Sparse Compact Directed Acyclic Word Graphs

Suffix Trie

A trie representing all suffixes of T

a

a

c

b

$

c

b

$

c

b

$

b

$

$T = aacb$

aacb$acb$cb$b$$

Page 4: Sparse Compact Directed Acyclic Word Graphs

Unwanted Matching

pace runner

pattern

s

text

Page 5: Sparse Compact Directed Acyclic Word Graphs

Unwanted Matching

pace runner

pattern

s

text

Page 6: Sparse Compact Directed Acyclic Word Graphs

Unwanted Matching

pace runner

pattern

s

text

Page 7: Sparse Compact Directed Acyclic Word Graphs

Introducing Word Separator #

# : word separator - special symbol not in D = # : dictionary of words

Text T : an element of D+

(T is a sequence T1T2…Tk of k words in D)

e.g., T = This#is#a#pen#

= {A,…,z} D = {...,This#,...,a#,...is#,...pen#,...}

Page 8: Sparse Compact Directed Acyclic Word Graphs

Word-level Pattern Matching Problem

Given: text T in D+ and pattern P in D+

Return: whether or not P appears at the beginning of any word in T

e.g.

T = The#space#runner#is#not#your#good#pace#runner#

P = pace#runner#

Page 9: Sparse Compact Directed Acyclic Word Graphs

Word-level Pattern Matching Problem

Given: text T in D+ and pattern P in D+

Return: whether or not P appears at the beginning of any word in T

e.g.

T = The#space#runner#is#not#your#good#pace#runner#

P = pace#runner#

Page 10: Sparse Compact Directed Acyclic Word Graphs

Word Suffix Trie

A trie representing the suffixes of T which begin at a word.

T = aa#b#

aa#b#a#b##b#b##

a

a

#

b

#

b

#

Page 11: Sparse Compact Directed Acyclic Word Graphs

Normal and Word Suffix Tries

a

a

#

b

#

b

#

a

a

#

b

#

#

b

#

#

b

#

b

#

T = aa#b#

Normal Suffix Trie Word Suffix Trie

Page 12: Sparse Compact Directed Acyclic Word Graphs

Normal and Word Suffix Trees

aa

#b

#

b#

a

a#

b#

#b#

#b#

b#

T = aa#b#

Normal Suffix Tree Word Suffix Tree

Page 13: Sparse Compact Directed Acyclic Word Graphs

Sizes of Word Suffix Tries and Trees

For text T = T1T2…Tk of length n,

the word suffix trie of T requires O(nk) space, but

the word suffix tree of T requires O(k) space!!

because the word suffix tree has only k leaves and has only branching internal nodes.

Page 14: Sparse Compact Directed Acyclic Word Graphs

Construction of Word Suffix Trees

Algorithm by Andersson et al. ( 1996 )

for text T = T1T2…Tk of length n, constructs word suffix trees in O(n) expected time with O(k) space.

Our algorithm (CPM’06)

builds word suffix trees in O(n) time in the worst case, with O(k) space.

Page 15: Sparse Compact Directed Acyclic Word Graphs

Our Construction Algorithm

We modify Ukkonen’s on-line normal suffix tree construction algorithm by using minimum DFA accepting dictionary D

We replace the root node of the suffix tree with the final state of the DFA.

Page 16: Sparse Compact Directed Acyclic Word Graphs

Minimum DFA

The minimum DFA accepting D = # clearly requires constant space (for fixed ).

#

Page 17: Sparse Compact Directed Acyclic Word Graphs

On-line Construction of Word Suffix Trees

T = aa#b# a,b

a

#

Page 18: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

a

#

On-line Construction of Word Suffix Trees

Page 19: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

On-line Construction of Word Suffix Trees

Page 20: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

On-line Construction of Word Suffix Trees

Page 21: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

#

On-line Construction of Word Suffix Trees

Page 22: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

#

On-line Construction of Word Suffix Trees

Page 23: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

#b

b

On-line Construction of Word Suffix Trees

Page 24: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

#b

b

On-line Construction of Word Suffix Trees

Page 25: Sparse Compact Directed Acyclic Word Graphs

T = aa#b# a,b

#

aa

#b

b#

#

On-line Construction of Word Suffix Trees

Page 26: Sparse Compact Directed Acyclic Word Graphs

Pseudo-Code

Just change here

Page 27: Sparse Compact Directed Acyclic Word Graphs

Like Dress-up Doll...

Page 28: Sparse Compact Directed Acyclic Word Graphs

Like Dress-up Doll... [cont.]

Just change hair!!

Body is the same!!

aa

#b

#

b#

a

a#

b#

#b#

#b#

b#

a,b#

#

Different looking!!!!

Page 29: Sparse Compact Directed Acyclic Word Graphs

Compact Directed Acyclic Word Graphs

a

a#

b#

#b#

#b#

b#

T = aa#b#

Suffix Tree

a

a#b#

#b#

#b#

b#

Compact Directed Acyclic Word Graph (CDAWG)

minimization

Page 30: Sparse Compact Directed Acyclic Word Graphs

Sparse CDAWGs

aa

#b

#

b#

T = aa#b#

Word Suffix Tree Sparse Compact Directed Acyclic Word Graph (SCDAWG)

minimization

aa#b#

b#

Page 31: Sparse Compact Directed Acyclic Word Graphs

Sparse CDAWGs [cont.]

a

a

#b

#

#

T = a#b#a#bab#

Word Suffix Tree SCDAWG

minimization

#b

ba

a

b

#

#

#b

ba

#ba

#ba

a#

#

#b

ba #

ba

a#b

b

Page 32: Sparse Compact Directed Acyclic Word Graphs

SCDAWG Construction

SCDAWGs can be constructed by minimizing word suffix trees in O(k) time.

using Revuz’s DAG minimization algorithm (1992)

Page 33: Sparse Compact Directed Acyclic Word Graphs

SCDAWG Construction [cont.]

Question : Direct construction for SCDAWGs?

Answer : YES! Using minimal DFA accepting dictionary D, we can directly build SCDAWGs in O(n) time and O(k) space.

We modify the CDAWG on-line construction algorithm (Inenaga et al. 05) by using the above DFA.

Page 34: Sparse Compact Directed Acyclic Word Graphs

Pseudo-Code

Just change here

Body is the same!!

a

a#b#

#b#

#b#

b#

aa#b#

b#

Different looking!!!!a,b

#

#

Page 35: Sparse Compact Directed Acyclic Word Graphs

Some Events

Basically on-line construction of SCDAWGs is similar to that of word suffix trees.

Except for the two following unique events: Edge merging Node splitting

Page 36: Sparse Compact Directed Acyclic Word Graphs

Edge Merging

T = a#b#a#bab#bc...

a#

#b

a#b b

a,b

#

a#

#

b

Page 37: Sparse Compact Directed Acyclic Word Graphs

Edge Merging

T = a#b#a#bab#bc...

a#

#b

a#b b

a,b

#

a#

#

b

a a

Page 38: Sparse Compact Directed Acyclic Word Graphs

Edge Merging

T = a#b#a#bab#bc...

a#

#b

a#b b

a,b

#

a#

#

b

a a

a

Page 39: Sparse Compact Directed Acyclic Word Graphs

Edge Merging

T = a#b#a#bab#bc...

a#

#b

a#b

b

a,b

#

a

a

Page 40: Sparse Compact Directed Acyclic Word Graphs

Node Splitting

T = a#b#a#bab#bc...

a#

#b

a#b

b

a,b

#

a

a

b#

b#

Page 41: Sparse Compact Directed Acyclic Word Graphs

Node Splitting

T = a#b#a#bab#bc...

a#

#b

a#b

b

a,b

#

a

a

b#

b#

b

b

Page 42: Sparse Compact Directed Acyclic Word Graphs

Node Splitting

T = a#b#a#bab#bc...

a#

#b

a#b

b

a,b

#

a

a

b#

b#

b

b

a#

#ba

a

b#

b#

b b

Page 43: Sparse Compact Directed Acyclic Word Graphs

Conclusion

We introduced new text indexing structure sparse compact directed acyclic word graphs (SCDAWGs) for word-level pattern matching.

We presented an on-line algorithm to construct SCDAWGs directly, in O(n) time with O(k) space.

The key is the use of minimum DFA accepting dictionary D.

Page 44: Sparse Compact Directed Acyclic Word Graphs

Related Work

“Sparse Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to SPIRE’06