Transcript
Page 1: Ternary Directed Acyclic Word Graphs  (TDAWG)

1

Ternary Directed Acyclic Word Graphs (TDAWG)

Satoru Miyamoto, Shunsuke Inenaga,

Masayuki Takeda and Ayumi Shinohara

Present by

Peera Liewlom

(The Last Algorithm Group)

Page 2: Ternary Directed Acyclic Word Graphs  (TDAWG)

2

CIAA 2003• Eighth International Conference on

Implementation and Application of Automata

• July 16-18, 2003, Santa Barbara, CA, USA

• Topic / Committee / Community

Page 3: Ternary Directed Acyclic Word Graphs  (TDAWG)

3

Why did I select this paper ?• DAWG start 1985… not so far• Continueing development• cDAWG, ASDAWG, morphic DAWG, WDAWG,

SDAWG, two-tree DAWG, DASG, CSDAWG etc.• TST : 1997 – 98, TDAWG : 2003• DAWG : Widely Apply by Bioinformatics, NLP,

Graph Theory, String Matching, Automata etc.• Speed & Space Trends in Huge Data Management• Topic for Algorithm Group• Matching the interesting topics in this seminar

group

Page 4: Ternary Directed Acyclic Word Graphs  (TDAWG)

4

Content

• DFA (use in string matching’s problem)

• DAWG

• Ternary Search Tree

• Paper : TDAWG, Experiment & Result

• Paper : Conclusion

• Paper : Discussion

Page 5: Ternary Directed Acyclic Word Graphs  (TDAWG)

5

DFADeterministic Finite Automata

Page 6: Ternary Directed Acyclic Word Graphs  (TDAWG)

6

Formalities• Deterministic Finite Accepter (DFA)

FqQM ,,,, 0Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

Page 7: Ternary Directed Acyclic Word Graphs  (TDAWG)

7

Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

Page 8: Ternary Directed Acyclic Word Graphs  (TDAWG)

8

Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

Page 9: Ternary Directed Acyclic Word Graphs  (TDAWG)

9

Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 10: Ternary Directed Acyclic Word Graphs  (TDAWG)

10

Set of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

Page 11: Ternary Directed Acyclic Word Graphs  (TDAWG)

11

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

Page 12: Ternary Directed Acyclic Word Graphs  (TDAWG)

12

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

Page 13: Ternary Directed Acyclic Word Graphs  (TDAWG)

13

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 14: Ternary Directed Acyclic Word Graphs  (TDAWG)

14

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

Page 15: Ternary Directed Acyclic Word Graphs  (TDAWG)

15

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

Page 16: Ternary Directed Acyclic Word Graphs  (TDAWG)

16

Another Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML ,, M

acceptacceptaccept

Page 17: Ternary Directed Acyclic Word Graphs  (TDAWG)

17

• ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

Page 18: Ternary Directed Acyclic Word Graphs  (TDAWG)

18

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

Page 19: Ternary Directed Acyclic Word Graphs  (TDAWG)

19

DAWGDirected Acyclic Word Graph

Page 20: Ternary Directed Acyclic Word Graphs  (TDAWG)

20

DAWG

Page 21: Ternary Directed Acyclic Word Graphs  (TDAWG)

21

DAWG

Page 22: Ternary Directed Acyclic Word Graphs  (TDAWG)

22

DAWG

Page 23: Ternary Directed Acyclic Word Graphs  (TDAWG)

23

cDAWG

Page 24: Ternary Directed Acyclic Word Graphs  (TDAWG)

24

แนวคิ�ดพั�ฒนาหลั�กMethodology

node edge

จุ�ดเด�นในการพั�ฒนา

1.DAWG

เป็�นต้�นแบบของการพั�ฒนาDAWG ซึ่��งป็ร�บทิ�ศทิางของกราฟแบบ แต้กต้�นไม้�ให้�สาม้ารถชี้ !ต้นเองได้�ทิ#าให้�ลด้node ลงไป็ ได้�ม้ากและเพั��ม้

ป็ระส�ทิธิ�ภาพัความ้เร*วม้ากกว+าDAG 2.cDAWG

เน�นการลด้จำ#านวนnode ลงทิ#าให้�ลด้จำ#านวนedge ลงต้าม้ไป็ด้�วย

ทิ#าให้�การป็ระม้วลผลเร*วกว+าDAWG 3.ASDAWG

สาม้ารถเก*บsubsequence ทิ�!งห้ม้ด้ให้�รวม้อย/+ในกราฟก�อนเด้ ยวก�น

เห้ม้าะส#าห้ร�บการว�เคราะห้0subsequence และลด้พั1!นทิ � ห้น+วยความ้จำ#าได้�ม้าก

4.morphic DAWG

เป็�นการป็ระย2กต้0น#าฟ3งก0ชี้��นม้ากระทิ#าก�บข�อม้/ลแบบDAWG

5.WDAWG

ม้ กรอบความ้ยาวของสายsequence ส#าห้ร�บควบค2ม้เฉพัาะส��งทิ �เรา สนใจำ(VLDC) โด้ยส��งทิ �ไม้+สนใจำให้�ก#าห้นด้เป็�นwildcard ทิ#าให้�

เจำาะกล2+ม้เป็6าห้ม้ายในการว�เคราะห้0ได้�ง+ายสะ ด้วกข�!น6.SDAWG

ใชี้�ป็ร�บโครงสร�าง DAWG ให้�ม้ ค2ณสม้บ�ต้�symmetric tree

ทิ#าให้�ม้ ความ้เร*วเฉล �ยในการใชี้�งานส/งส2ด้7.two-tree DAWG

เป็�นเทิคน�คส#าห้ร�บต้�ด้แบ+งDAWG ออกเป็�น2 ส+วนซึ่��งทิ#าให้�การ อ�พัเด้ทิข�อม้/ลทิ#าได้�เร*วข�!นไม้+ต้�องป็ร�บโครงสร�างต้�นไม้�ทิ�!งต้�น

8.DASG

พั�ฒนาเพั��ม้จำากcDAWG โด้ยก#าห้นด้ให้�แต้+ละ edge เชี้1�อม้โยง ระห้ว+างnode สาม้ารถม้ ทิ�ศทิางไป็และย�อนกล�บได้�

9.CSDAWG

ป็ร�บให้�โครงสร�างต้�นไม้�DAWG สาม้ารถม้ จำ2ด้เร��ม้ต้�นและจำ2ด้ส�!นส2ด้ เป็�นจำ2ด้เด้ ยวก�นได้�ทิ#าให้�น#าการเก*บข�อม้/ลแบบน !ไป็ใชี้�ก�บข�อม้/ลกราฟ ฟ8คห้ร1อจำ โอเม้ต้ร�กเชี้+น วงกลม้ห้ร1อโพัล กอนได้�

Page 25: Ternary Directed Acyclic Word Graphs  (TDAWG)

25

TSTTernary Search Tree

Page 26: Ternary Directed Acyclic Word Graphs  (TDAWG)

26

TST History• Jon L. Bentley and Robert Sedgewick• Algorithms for Sorting and Searching

Strings, Proceeding. 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1997.

• Ternary Search Trees, Dr. Dobb's Journal, April 1998.

• Dictionary of Algorithms and Data Structures, National Institute of Standard and Technology, http://www.nist.gov/

Page 27: Ternary Directed Acyclic Word Graphs  (TDAWG)

27

BST DST

TST

Page 28: Ternary Directed Acyclic Word Graphs  (TDAWG)

28

Page 29: Ternary Directed Acyclic Word Graphs  (TDAWG)

29

TDAWGTernary Directed Acyclic Word Graph

Page 30: Ternary Directed Acyclic Word Graphs  (TDAWG)

30

Introduction

• DFA how to implement the transitions of each state ? (Time & Space efficiency)

• TST “implant” BST for transitions– Good Time

• DAWG smallest DFA for all suffixes– Good Space

• TDAWG

• Proof : TDAWG VS. DAWG

Page 31: Ternary Directed Acyclic Word Graphs  (TDAWG)

31

Hypothesis / Theorem (1/2)• Time = Construct + Search (useable for online)• DFA function

= Alphabet (Chinese & Japan ~ 1000 chars)• State• Table O(|p|) p = length of pattern• Table use very large memory• Link List O(| | x |p|) search time• If is large … problem for search time

FqQM ,,,, 0

QQ :

Page 32: Ternary Directed Acyclic Word Graphs  (TDAWG)

32

Hypothesis / Theorem (2/2)• For TDAWG

– Use O(|S|) space– Use O(log|| x |p|) for search time– Use O(|| x |S|2) construct time (Bentley & Sedwick)– Use O(|| x |S|) construct time (this paper … apply from

Blummer’s online DAWG construction)

• Comparison : TDAWG VS. DAWG(table & link list)– Space , Search Time , Construction Time

Page 33: Ternary Directed Acyclic Word Graphs  (TDAWG)

33

TST TDAWG

Page 34: Ternary Directed Acyclic Word Graphs  (TDAWG)

34

Online DAWG Construction

Page 35: Ternary Directed Acyclic Word Graphs  (TDAWG)

35

Online TDAWG Construction

Page 36: Ternary Directed Acyclic Word Graphs  (TDAWG)

36

Experiment Result

Page 37: Ternary Directed Acyclic Word Graphs  (TDAWG)

37

Conclusion

• New data structure … TDAWG

• Construction time (English text 256)– TDAWG < linklistDAWG < tableDAWG

• Space Requirment– linklistDAWG < TDAWG ~ 20 %– tableDAWG not compare in same scale

• Search Time– Short pattern: tableDAWG best , TDAWG <

linklistDAWG– Log curve VS. Linear Curve (long pattern?)

Page 38: Ternary Directed Acyclic Word Graphs  (TDAWG)

38

Discussion & Future Work• In Asian Language (characters~1000s)

should have better search time than English (character 256) because log(||x|p|)

• Apply to other DAWG… cDAWG, minimumDAWG …etc.

• More efficiency by AVL tree (AVL-balance)

• Bioinformatic have 4 character . But, Sliding window with 12 characters = 412


Recommended