Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
SUFFIX ARRAYSmusings on the data structure, applications and implementation in Java
Michał Nowak, Dawid Weiss
IDSS Seminar, 11/2009
Notation
• Let Σ be an alphabet;• Σ is finite and non-empty;
• symbols xa ∈ Σ, where a = 1 . . . |Σ| are ordered;
• O(xb − xc) = 1;
• $ is a special symbol smaller than any x ∈ Σ.
• Let X be a sequence of n symbols, where X [i] ∈ Σ andi = 0 . . . n − 1.
• Let S(i) or i be a sub-sequence of X starting at positioni and ending at n − 1.
Notation
• Let Σ be an alphabet;• Σ is finite and non-empty;
• symbols xa ∈ Σ, where a = 1 . . . |Σ| are ordered;
• O(xb − xc) = 1;
• $ is a special symbol smaller than any x ∈ Σ.
• Let X be a sequence of n symbols, where X [i] ∈ Σ andi = 0 . . . n − 1.
• Let S(i) or i be a sub-sequence of X starting at positioni and ending at n − 1.
Examples
• Zero-terminated US-ASCII strings (Latin letters).
• DNA sequences.
• Integer-coded sequences of words.
• . . .
(recall)
Suffix Trees
suffix S(i) ic a c a o $ 0
a c a o $ 1c a o $ 2
a o $ 3o $ 4
$ 5
suffix S(i) ic a c a o $ 0
a c a o $ 1c a o $ 2
a o $ 3o $ 4
$ 5
Suffixes of “cacao”.
Building the trie.
Suffix trie for “cacao”.
Compacting (pocket trees)
1 Move labels to edges,
2 collapse nodes with a single descendant.
Suffix tree for “cacao”.
Properties of suffix trees
• Maximum 2n nodes (!).
• Elegant to program with.
• Construction in O(n) time (Esko Ukkonen, 1995).
Applications
P1: Search for a substring m in Θ(|m|) steps.
P2: The longest repeated sequence.
?P3: The longest common substring of m strings.
alibaba.taliban.
P3: The longest common substring of m strings.
P3: The longest common substring of m strings.
What do you think,do geese see God?
P4: The longest palindrome.
whatdoyouthinkdogeeseseegod.dogeeseseegodkniht...
P4: The longest palindrome.
There are many more ST applications.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Scienceand Computational Biology. Cambridge University Press.
Problems with suffix trees
• Alphabet size.
• Processing locality.
Suffix Arrays
Ordered depth-first ST traversal.
$a · c a o $a · o $c a · c a o $c a · o $o $
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
suffix S(i) i
c a c a o $ 0a c a o $ 1
c a o $ 2a o $ 3
o $ 4$ 5
→ sort →
suffix S(i) SA[j]
$ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4
Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index)× n.
• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).
• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.
(selected)
Algorithms
Naïve suffix sortingAlgorithms:
• Three-way quicksort (Bentley, McIlroy).
• Radix sort and combinations.
Problems due to:
• suffix comparisons not of O(1) (average LCP),“aaaaaaa...a$”.
• potentially large and sparse alphabets.
Naïve suffix sortingAlgorithms:
• Three-way quicksort (Bentley, McIlroy).
• Radix sort and combinations.
Problems due to:
• suffix comparisons not of O(1) (average LCP),“aaaaaaa...a$”.
• potentially large and sparse alphabets.
SACA goals
• Possibly Θ(n).• Fast in practice (real data).
• Lightweight (computation memory).
• Cache and IO-aware (linear scans).
Source: Puglisi et al., ACM Comput. Surveys.
O(n log n): qsufsort
• Jesper Larsson, Kunihiko Sadakane; 1999.
• Prefix doubling.
• Quick sort within buckets.
0 1 2 3 4 5 6 7 8 9 10 11
x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]
ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]
↑ ↑ ↑ ↑ ↑1 4 6 8 11
i S(i)
0 abeacadabea$3 acadabea$5 adabea$7 abea$
10 a$
→ ordered by
i + h S(i)
0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$
10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11
x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]
ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]
↑ ↑ ↑ ↑ ↑1 4 6 8 11
i S(i)
0 abeacadabea$3 acadabea$5 adabea$7 abea$
10 a$
→ ordered by
i + h S(i)
0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$
10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11
x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]
ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]
↑ ↑ ↑ ↑ ↑1 4 6 8 11
i S(i)
0 abeacadabea$3 acadabea$5 adabea$7 abea$
10 a$
→ ordered by
i + h S(i)
0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$
10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11
x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]
ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]
↑ ↑ ↑ ↑ ↑1 4 6 8 11
i S(i)
0 abeacadabea$3 acadabea$5 adabea$7 abea$
10 a$
→ ordered by
i + h S(i)
0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$
10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11
x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]
ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]L = [ -1 5 2 -2 2 ]
SA2 = [ 11 10 (0 7) 3 5 (1 8) 4 6 (2 9) ]ISA2 = [ 3 7 11 4 8 5 9 3 7 11 2 0 ]
L = [ -2 2 -2 2 -2 2 ]
SA4 = [ 11 10 (0 7) 3 5 8 1 4 6 9 2 ]ISA4 = [ 3 7 11 4 8 5 9 3 6 10 1 0 ]
L = [ -2 2 -8 ]
SA8 = [ 11 10 7 0 3 5 8 1 4 6 9 2 ]ISA8 = [ 3 7 11 4 8 5 9 2 6 10 1 0 ]
L = [ -12 ]
O(n) solution: skewDivide-and-conquer:
• Bucket-sorting.
• Problem splitting and recursion.
• Cheap merge from partial SAs.
12-page paper, C source code included. . .
Source: MN thesis.
Source: Karkkainen/Sanders.
Induced copying
• Determine different types of suffixes.
• Sort a single type of suffixes.
• Induce the order of remaining suffixes.
jsuffixarrays.org
About the project
• Michał Nowak.
• Suffix arrays in Java.
• Different algorithms.
• JVM benchmark.
Algorithms:
Algorithm Authors Complexity Memory
skew P. Sanders, J. Kärkkäinen O(n) 10-13nqsufsort J. Larrson, K. Sadakane O(n log n) 8ndeep shallow G. Manzini O(n2 log n) 5ntwo-stage H. Itoh, H. Tanaka O(n2 log n) 5nimpr. two-stage S. Puglisi, M. Maniscalco O(n2 log n) 5nbpr K. B. Schürmann O(n2(log n)−1) 9-10n
quicksort (naïve algorithm)
JVMs (in their newest versions):
• SUN
• IBM
• JRockit (Oracle)
• Apache Harmony
• gcj, initially only.
Code conversion problems
• Pointers→ indexed arrays.
• Memory reused for different types→ φ.
• Stack-allocated structures and arrays→ φ.
• Boundary checks penalty.
• Language constraints (byte arrays).
Code conversion problems
• Pointers→ indexed arrays.
• Memory reused for different types→ φ.
• Stack-allocated structures and arrays→ φ.
• Boundary checks penalty.
• Language constraints (byte arrays).
Code conversion problems
• Pointers→ indexed arrays.
• Memory reused for different types→ φ.
• Stack-allocated structures and arrays→ φ.
• Boundary checks penalty.
• Language constraints (byte arrays).
Code conversion problems
• Pointers→ indexed arrays.
• Memory reused for different types→ φ.
• Stack-allocated structures and arrays→ φ.
• Boundary checks penalty.
• Language constraints (byte arrays).
Code conversion problems
• Pointers→ indexed arrays.
• Memory reused for different types→ φ.
• Stack-allocated structures and arrays→ φ.
• Boundary checks penalty.
• Language constraints (byte arrays).
Code conversion problems
• Pointers→ indexed arrays.
• Memory reused for different types→ φ.
• Stack-allocated structures and arrays→ φ.
• Boundary checks penalty.
• Language constraints (byte arrays).
Engineering fun
JIT code dumping1 private static byte bump(byte v) {2 return (byte) (v + 1);3 }4
5 public static void doLoop(byte [] array) {6 for (int i = 0; i < array.length; i++) {7 array[i] = bump(array[i]);8 }9 }
Equivalent C code (gcc -S -O3):1 ...2 .L4:3 leaq 1048576(%rsp), %rdx // The loop’s end address.4 addb $1, (%rax) // bump byte5 addq $1, %rax // increase loop counter.6 cmpq %rdx, %rax7 jne .L4 // repeat until cond. true.
Java code compiled with gcj -O3 -S Test.java:1 ...2 .L20:3 .p2align 4,,54 jbe .L31 // AOOB if max <= i5 .L23:6 movslq %edi,%rax // store current i7 addl $1, %edi // increase loop counter.8 addb $1, 12(%rbx,%rax) // bump byte inside the array9 cmpl %edi, %edx // loop condition check.
10 .p2align 4,,211 jg .L20
Java code; JIT-compiled, SUN’s HotSpot, -server:1 mov 0x10(%rsi),%r9d // get array length2 test %r9d,%r9d // check if empty array.3 jle L_OUT4 xor %r11d,%r11d // i = 05 L1: cmp %r9d,%r11d6 jae L_OUT // L_OUT if max >= i7 movslq %r11d,%r10 // store current i8 movsbl 0x18(%rsi,%r10,1),%r8d9 inc %r11d // bump i
10 inc %r8d // bump value at array11 mov %r8b,0x18(%rsi,%r10,1)12 cmp $0x1,%r11d // jump always (?)13 jl L1
But also:1 movslq %ecx,%rcx // (Unfolded loop)2 movsbl 0x19(%rbx,%rcx,1),%r9d3 inc %r9d4 mov %r9b,0x19(%rbx,%rcx,1)5 movsbl 0x1a(%rbx,%rcx,1),%r9d6 inc %r9d7 mov %r9b,0x1a(%rbx,%rcx,1)8 movsbl 0x1b(%rbx,%rcx,1),%r9d9 inc %r9d
10 ...
Quiz1 /** Do I look evil to you? */2 public class Example10 {3 private static boolean ready;4
5 public static void startThread() {6 new Thread() {7 public void run() {8 try { sleep(2000); } catch (Exception e) { /* ignore */ }9 ready = true;
10 System.out.println("Setting ready to true.");11 }12 }.start();13 }14
15 public static void main(String [] args) {16 startThread();17 while (!ready) {18 // Do nothing.19 }20 System.out.println("I’m ready.");21 }22 }
1 > gcj -O3 -S Example10.java
1 ...2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip)3 jne .L134 .L16: // buhu! :)5 jmp .L166 .p2align 4,,77 .L13:8 ...
1 > gcj -O3 -S Example10.java
1 ...2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip)3 jne .L134 .L16: // buhu! :)5 jmp .L166 .p2align 4,,77 .L13:8 ...
Other things of interest• Minimizing GC activity.
• Memory tracing aspects (AspectJ).
Evaluation
Evaluation
• Random input (alphabets 4, 100, 255, variable length).
• Gauntlet and Manzini’s corpora.
• Multiple runs, warmup rounds removed.
• Wall time measured (no parallelism other than the GC).
• Memory peaks measured.
1
2
3
4
5
6
7
8
0 50 100 150 200 250 300
time
[s]
alphabet size [symbols]
time on constant input, varying alphabet
BPRDEEP-SHALLOW
DIVSUFSORTNS
QSUFSORTSKEW
0
100
200
300
400
500
600
0 5 10 15 20 25
mem
ory
[MB
]
input size [millions elements]
memory on random input, alphabet size = 255
BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW
0
5
10
15
20
25
30
35
40
45
0 5 10 15 20 25
time
[s]
input size [millions elements]
time on random input, alphabet size = 255
BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20 25
time
[s]
input size [millions elements]
time on random input, alphabet size = 4
BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW
In reality, data is hardly ever random.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
SKEW DIVSUFSORT BPR QSUFSORT
tim
e [
s]
ababab(ab...)ac, 196KB
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
SKEW DIVSUFSORT BPR QSUFSORT
tim
e [
s]
Where is naïve sort? ababab(ab...)ac, 196KB
0
20
40
60
80
100
120
140
sun ibm jrockit harmony
time
[s]
BPRDIVSUFSORT
QSUFSORTSKEW
Summary
• Suffix arrays are worth remembering.
• Fundamental and short algorithms developed now.
• http://jsuffixarrays.org.