Upload
takeshi-yamamuro
View
1.980
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
x86/x64最適化勉強会#4
A x86-optimized rank/select dictionary for bit sequences
2012/6/16
Takeshi Yamamuro
1
What’s Succinct Data Structure?
2
SDS: Succinct Data Structure
• Recently, Getting Popular in Some Areas
– Researches & Engineering
• Not Data Structure, But Data Representation
– A compressed method for other data structures
– e.g., alphabets, trees, and graphs
• Transparent Operations w/o Unpacking Explicitly
– e.g., succinct LZ77 compression*1
3 *1 Kreft, S. and Navarro, G.: LZ77-Like Compression with Fast Random Access, In Proceedings of DCC, 2010
More Details
• SDS = Succinct Data + Succinct Index
• Succinct Data
– Compact representation for target data
– Almost to information theoretic lower bounds
e.g., If N patterns, the lower bound’s logN
• Succinct Index
– O(1) operations for target data
– o(N) space costs: ignored asymptotically
4
More Details
5
cited from: http://goo.gl/rkQ5z
If you need more information, ...
A rank/select dictionary for SDS
6
A Rank/Select Operations
• SDS Composed of Rank/Select Operations
– Many calls of rank/select inside
• Rank/Select for Succinct Bit Sequences: B[i]
– rankx(n, B): the total of 1s in B[0...n]
– selectx(n, B): n-th position of x in B[]
7
i 0 1 2 3 4 5 6 7 8
B[i] 1 0 1 1 0 0 1 1 0
rank1(5, B)=3 select1(4, B)=6
A Rank/Select Operations
• Available Rank/Select Implementation
– ux-trie: http://code.google.com/p/ux-trie/
– rx: http://code.google.com/p/mozc/
– marisa-trie: http://code.google.com/p/marisa-trie/
• Today Contributions
– x86-optimized rank/select
– https://github.com/maropu/dbitv
8
Performance Results
9
• Performance Benchmark Setups*1
– Generate a random sequence of bits: 50% density
– Random rank/select queries over the bits
– CPU: Intel Core-i5 [email protected]
• Latency Observed
– 11 trials, and median latency
*1 Reference: http://d.hatena.ne.jp/s-yata/20111216/1324032373
Performance Results: Rank
10
1.E+00
1.E+01
1.E+02
1.E+03
avera
ged r
ank late
ncy (
ns)
bit length
ux
rx
marisa
opt
Performance Results: Select
11
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
avera
ged s
ele
ct
late
ncy (
ns)
bit length
ux
rx
marisa
opt
Implementation Details
12
Implementation: 4 Russian Methods
13
• Rule: O(1) operation costs with o(N) space
B[] = A sequence of bits
N-bits
Implementation: 4 Russian Methods
14
• Rule: O(1) operation costs with o(N) space
• Split into log2N fixed-length blocks
• Total Counts Pre-computed in L[]
B[] = A sequence of bits
l1
N2log
l2
x
i
Nx
i
x
Nxi
iBiBiBBxrank1
log/
1 1log/
1
2
2
][][][),(
]log/[ 2
1 NxL
L[] =
Implementation: 4 Russian Methods
15
• Rule: O(1) operation costs with o(N) space
• Split into log2N fixed-length blocks
• Total Counts Pre-computed in L[]
B[] = A sequence of bits
l1
N2log
l2
x
i
Nx
i
x
Nxi
iBiBiBBxrank1
log/
1 1log/
1
2
2
][][][),(
]log/[ 2 NxL
O(1) O(log2N)
L[] =
Implementation: 4 Russian Methods
16
• Rule: O(1) operation costs with o(N) space
• L[]: o(N) space costs
B[] = A sequence of bits
l1
N2log
l2
)()log
(loglog2
NoN
NON
N
N
L[] =
Implementation: 4 Russian Methods
17
• Rule: O(1) operation costs with o(N) space
• Split into 1/2logN fixed-length blocks again
• Total Counts Pre-computed in S[]
B[] = A sequence of bits
L[] = l1
N2log
l2 nlog2
1
S[] = s1 s2
x
Nxi
x
i
Nx
i
Nx
Nxi
iBiBiBiBBxrank
1log2
1/
1
log/
1
log2
1/
1log/
1 ][][][][),(
2
2
]log/[ 2 nxL ]log2
1/[ nxS
Implementation: 4 Russian Methods
18
• Rule: O(1) operation costs with o(N) space
• Split into 1/2logN fixed-length blocks again
• Total Counts Pre-computed in S[]
B[] = A sequence of bits
L[] = l1
N2log
l2 nlog2
1
S[] = s1 s2
x
Nxi
x
i
Nx
i
Nx
Nxi
iBiBiBiBBxrank
1log2
1/
1
log/
1
log2
1/
1log/
1 ][][][][),(
2
2
]log/[ 2 nxL ]log2
1/[ nxS
O(1) O(1)
O(logN)
Implementation: 4 Russian Methods
19
• Rule: O(1) operation costs with o(N) space
• S[]: o(N) space costs
B[] = A sequence of bits
L[] = l1
N2log
l2 nlog2
1
S[] = s1 s2
)()log
loglog()log(log
log21
2
2No
N
NNON
N
N
Implementation: 4 Russian Methods
20
• Rule: O(1) operation costs with o(N) space
• O(1) Popcount/Table-Lookup in Last Term
B[] = A sequence of bits
L[] = l1
N2log
l2 nlog2
1
S[] = s1 s2
x
Nxi
x
i
Nx
i
Nx
Nxi
iBiBiBiBBxrank
1log2
1/
1
log/
1
log2
1/
1log/
1 ][][][][),(
2
2
]log/[ 2 nxL ]log2
1/[ nxS
O(1) O(1)
O(logN) O(1) ->
Implementation: 4 Russian Methods
21
• Rule: O(1) operation costs with o(N) space
• As a result, o(N) Space Costs
B[] = A sequence of bits
L[] = l1
N2log
l2 nlog2
1
S[] = s1 s2
)()log
loglog(
log
loglog4
logNo
N
NNO
N
NN
N
N
L[] size S[] size
Implementation: 4 Russian Methods
22
• Rule: O(1) operation costs with o(N) space
Implementation: Practice
23
• Low Computation Costs & High Cache Penalties
– 3 cache/TLB misses per rank
1 3 4 6 7 9 10 13
L[]:
S[]:
18
2 5 7 9 12 13 18 19 1 3 7 …
…
B[]: 01..000000....101......0 0110....001...............0 0000100 ...
256bit
32bit
ex. rank1(402=256*1+32*4+18, B)
21
Popcount these left bits
Implementation: Practice
24
• Low Computation Costs & High Cache Penalties
– 3 cache/TLB misses per rank
1 3 4 6 7 9 10 13
L[]:
S[]:
18
2 5 7 9 12 13 18 19 1 3 7 …
…
B[]: 01..000000....101......0 0110....001...............0 0000100 ...
256bit
32bit
ex. rank1(402=256*1+32*4+18, B)
21
Popcount these left bits
Miss!
Miss!
Miss!
Implementation: Practice
25
• Packing the required data into a single cacheline
・・・ 0110....001..........0
64B Cache line
4B 1B 32B
56B Chunk
padding 12B padding
Implementation: Practice
26
• Packing the required data into a single cacheline
Implementation: Practice
27
• BTW, where select?
– Omitted for my time limit
– Plz see the code ...
• 2 Way Implementation
– O(logN) complexity
• ux-trie, rx, and marisa-trie
• Binary searches with rank
• Many cache/TLB misses suffered
– O(1) complexity
• My implementation to minimize these penalties
• 1-rank, 1-SIMD comparison, and O(1) –bsf
• Only 2 cache/TLB misses
Implementation: Practice
28
• BTW, where select?
– Omitted for my time limit
– Plz see the code ...
• 2 Way Implementation
– O(logN) complexity
• ux-trie, rx, and marisa-trie
• Binary searches with rank
• Many cache/TLB misses suffered
– O(1) complexity
• My implementation to minimize these penalties
• 1-rank, 1-SIMD comparison, and O(1) –bsf
• Only 2 cache/TLB misses
Not implemented yet ...