Upload
christopher-nicholson
View
225
Download
0
Embed Size (px)
Citation preview
Joint Advanced Student School 20041
Compressed Suffix Arrays
Compression of Suffix Arrays to linear size
Fabian Pache
Joint Advanced Student School 20042
Motivation
• Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory
• Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access.
Joint Advanced Student School 20043
Outline
1. Trivial Compression
2. Grossi & Vitter– Outline– Algorithms and Analysis
3. Sadakane– Outline– Algorithms and Analysis (Sketch)
Joint Advanced Student School 20044
Conventions used
• Text T [1…n]– Binary text {a,b}n, terminated with #
• Pattern P [1…m]– Binary text {a,b}m
• Suffix Array SA [1…n]– Each entry points to a T [ i ]– Uses n log n bits
Joint Advanced Student School 20045
Trivial Compression
• Construct and store SA;
T = baab#
1. baab#
2. aab#
3. ab#
4. b#
5. #
2. aab#
3. ab#
5. #
6. baab#
4. b#
SA = [ 2,3,5,1,4 ]
a < # < b
Joint Advanced Student School 20046
Trivial Compression
• Recover T from SASA = [ 2,3,5,1,4 ] T = _ _ _ _ #
SA = [ 2,3,5,1,4 ] T = b _ _ b #
SA = [ 2,3,5,1,4 ] T = b a a b #
# < b
a < #
Joint Advanced Student School 20047
Trivial Compression
• Therefore each Suffix Array can be compressed to
)(n
)(nO
• Drawback: decompression takes
Joint Advanced Student School 20048
Grossi & Vitter
• Outline:– Recursive „Divide and Conquer“-type
algorithm– Stores SA implicitly
(for all but the last level)
• Supported operations– lookup( i )– compress
Joint Advanced Student School 20049
G&V – compress
• Structural OutlineSA0 )log(|| 0 nnOSA
SA1 |||| 021
1 SASA
SA2
SAl )(|| nOSAl
nl loglog
|||| 121
2 SASA
Joint Advanced Student School 200410
G&V – compress
Given a Text T; create SA
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 2124321430
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 2225
TSA
Joint Advanced Student School 200411
G&V – compress
Create array B [1...n] with n = |SA|– B [ i ] = 1, if T [ i ] even
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 2124321430
1 1 1 1 1 1 1 1
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 1 1 1 1 1 1
TSAB
Joint Advanced Student School 200412
G&V – compress
Create array B [1...n] with n = |SA|– B [ i ] = 1, if T [ i ] even– B [ i ] = 0, if T [ i ] odd
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0
TSAB
Joint Advanced Student School 200413
G&V – compress
• Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ]
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 101010111112121212131414151616
TSABrank
Joint Advanced Student School 200414
G&V – compress
Define a mapping [ 1..n ] so that– If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 14151823 28103031
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 101010111112121212131414151616
7 8 10 131617 29 27
TSABRank
Joint Advanced Student School 200415
G&V – compress
Define a mapping [ 1..n ] so that– If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1– If B[ i ] = 1: [ i ] = i
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSABRank
Joint Advanced Student School 200416
G&V – compress
Compressing from SAk to SAk+1
• Store only even values of SAk in SAk+1
• Divide each entry in SAk+1 by 2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSAk
Bk
Rankk
k
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1
Joint Advanced Student School 200417
G&V – lookup
Reconstruction of SAk [ i ] using Bk, rankk,
k and SAk+1 [ i ]
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)
Joint Advanced Student School 200418
G&V – lookup
• Proof / Example part 1: B [ i ] = 1
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))]
SAk [ i ] = 2 SAk+1 [ rankk ( i )]1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
100 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #18
1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSAk
Bk
Rankk
k
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1
Joint Advanced Student School 200419
G&V – lookup
• Proof / Example part 2: B [ i ] = 0
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] - 1
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
170 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #
31 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSAk
Bk
Rankk
k
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1
Joint Advanced Student School 200420
G&V-lookup
Stored information
• For each level k = 0...l-1, – explicitly store Bk
– rankk and k stored implicit
– SA reconstructible by recursion
• For level l store SAl explicit
– No further information neede
Joint Advanced Student School 200421
G&V – lookup
lookup ( i ) = rlookup( i ,0 )
rlookup ( i, k )
if (k == l)
return SA[i];
else
return 2 * rlookup( rankk[ psik[i]], k+1) + (Bk[i]-1);
end
• Pseudocode for the lookup function
Joint Advanced Student School 200422
G&V - details
Speed versus Time
Quick and Large
Small and Slow
Space (in bits) O (n log log n) O (n)
Time O (log log n) O (log n) > 0
Joint Advanced Student School 200423
G&V – Quick and Large
Storing rank spaceefficient and quickly accessible:
Explicit storage of rank takes n log n bitsJacobson´s method uses o( n ) bits
Both allow for constant time access
Joint Advanced Student School 200424
G&V – Quick and Large
Storing k efficiently (outline):
• Create 2k arrays; one for each possible substring over {a,b}2k
using the substring as label
aa
ab
ba
bb
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200425
G&V – Quick and Large
• For each Bk [ j ] = 1 find the 2k literals preceding the suffix referenced by SAk [ j ] in T
• Store j in the array according to T
aa
ab
ba 1
bb
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200426
G&V – Quick and Large
In other words:
For each i with Bk [ i ] = 0 and t the first 2k literals of the suffix referenced by SAk [ i ], insert [ i ] in array t
aa
ab 9
ba 1
bb
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200427
G&V – Quick and Large
For each j with Bk [ j ] = 1
t = T [ 2k SAk [ j ] – 2k, ..., 2k SAk [ j ] – 1]
add j to the array with label t
aa
ab 9
ba 1, 6, 12, 14
bb 2, 4, 5
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200428
G&V – Quick and Large
To calculate k ( i ), use i - rankk ( i ) as index to the concatenated arrays
aa
ab 9
ba 1, 6, 12, 14
bb 2, 4, 5
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2 2(8) = 6
Joint Advanced Student School 200429
G&V – Quick and Large
l = log log n levels of compression
• Space occupation:l levels, each occupying O ( n ) bits
O ( n log log n ) bits
• Time requirement for lookup ( i ):l levels, each requiring O ( 1)
O ( log log n ) steps
Joint Advanced Student School 200430
G&V – Small and Slow
Reduction of size by allowing for higher time usage
Quick and Large
Small and Slow
Space (in bits) O (n log log n) O (n)
Time O (log log n) O (log n) > 0
Joint Advanced Student School 200431
G&V – Small and Slow
Instead of storing all l = log log n levels, store only levels
nl
nl
loglog
loglog´
0
21
Example n = 32store levels 0, 2, 3
Joint Advanced Student School 200432
G&V – Small and Slow
Example using | T | = 32
SA0
SA1
SA3
SA2
Joint Advanced Student School 200433
G&V – Small and Slow
Keep only 3 levels
SA0
SA1
SA3
Joint Advanced Student School 200434
G&V – Small and Slow
On levels 0 and l´, mark entries that are still present in the next level
SA0
SA1
SA3
Joint Advanced Student School 200435
G&V – Small and Slow
Before the modification:• Bk[ i ] = 1 SAk[ i ] is stored in SAk+1
k used for each Bk[ i ] = 0 to find SAk[ [ i ] ] = SAk[ i ] +1
Modifications added:• Bo´[ i ] = 1 SA0[ i ] is stored in SAl´
• Bl´´[ i ] = 1 SAl´[ i ] is stored in SAl
´k used for each Bk[ i ] = 1 to find SAk[ [ i ] ] = SAk[ i ] +1
Joint Advanced Student School 200436
G&V – Small and Slow
Construction of ´ and B´ markings of indices
1 2 3 4
2 3 4 1SA3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11
1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
9 1 6 12 14 2 4 5
10 8 11 13 15 7 16 3
SA1
B1´
1
1´
Joint Advanced Student School 200437
G&V – Small and Slow
´ and in combination can be used to traverse the entire SA (ascending)
1 2 3 4
2 3 4 1SA3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11
1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
9 1 6 12 14 2 4 5
10 8 11 13 15 7 16 3
SA1
B1´
11´
Joint Advanced Student School 200438
G&V – Small and Slow
Length of the traversal determines required time for lookup
• Level 0 contains n entries
• Level l´ was divided ½ log log n times
n
nnnl
nn log22´||
loglogloglog21
Joint Advanced Student School 200439
G&V – Small and Slow
Length of the traversal determines required time for lookup
• 0s in B´ are evenly spaced
n
n
nn
l
llog
log
´||
|| 0 Longest sequence of 0s
Joint Advanced Student School 200440
G&V – Small and Slow
Generalized for more than 2 additional levels (the number must be constant!):
Let L be the number of levels, = L-1
The longest sequence of 0s has length log n
Joint Advanced Student School 200441
G&V – Small and Slow
reconstruction of levels < l requires
• a vector describing which entries of level k´ can be found in k´+1
=>O ( n ) bits
• a function ´ that combined with allows for complete traversal of SA
=>O ( n ) bits
Joint Advanced Student School 200442
Sadakane
Improvements on the datastructure and algorithms proposed by Grossi & Vitter
• More operations– inverse( j ): return i so that SA[ i ] = j– search( P ): return l, r where P matches T– decompress( s, e ): return T[s...e]
• Allow for alphabets || > 2
Joint Advanced Student School 200443
Sadakane – inverse( i )
Goal:For a suffix starting at position j, find the index i of the lexicographic order of all suffices
Assuming:j = SA[ i ]
Create SA-1 so that:i = SA-1[ j ]
Joint Advanced Student School 200444
Sadakane – inverse( i )
Proposition:
inverse( i ) can be computed in O( logn ) with explicit storage of SA-1 at the last level and a recursion for all above.
Joint Advanced Student School 200445
Sadakane – search( P )
Goal:Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T
Joint Advanced Student School 200446
Sadakane – search( P )
Proposition:
By augmenting the datastructure by a function C-1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time
Joint Advanced Student School 200447
Sadakane – decompress( I )
Goal:Using only SA and its functions, return the substring of T pointed to by I = [ s, e ].
Joint Advanced Student School 200448
Sadakane – decompress( I )
Proposition:
A substring of length l = e-s+1can be decompressed using only SA, SA-1 and C-1 in O( l + logn ) time, where n is the length of the original text.
Joint Advanced Student School 200449
Sadakane – Complexity
Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required.
The space-complexity of the Sadakane-improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text