A x86-optimized rank&select dictionary for bit sequences

x86/x64最適化勉強会#4

A x86-optimized rank/select dictionary for bit sequences

2012/6/16

Takeshi Yamamuro

1

What’s Succinct Data Structure?

2

SDS: Succinct Data Structure

• Recently, Getting Popular in Some Areas

– Researches & Engineering

• Not Data Structure, But Data Representation

– A compressed method for other data structures

– e.g., alphabets, trees, and graphs

• Transparent Operations w/o Unpacking Explicitly

– e.g., succinct LZ77 compression*1

3 *1 Kreft, S. and Navarro, G.: LZ77-Like Compression with Fast Random Access, In Proceedings of DCC, 2010

More Details

• SDS = Succinct Data + Succinct Index

• Succinct Data

– Compact representation for target data

– Almost to information theoretic lower bounds

e.g., If N patterns, the lower bound’s logN

• Succinct Index

– O(1) operations for target data

– o(N) space costs: ignored asymptotically

4

More Details

5

cited from: http://goo.gl/rkQ5z

If you need more information, ...

A rank/select dictionary for SDS

6

A Rank/Select Operations

• SDS Composed of Rank/Select Operations

– Many calls of rank/select inside

• Rank/Select for Succinct Bit Sequences: B[i]

– rankx(n, B): the total of 1s in B[0...n]

– selectx(n, B): n-th position of x in B[]

7

i 0 1 2 3 4 5 6 7 8

B[i] 1 0 1 1 0 0 1 1 0

rank1(5, B)=3 select1(4, B)=6

A Rank/Select Operations

• Available Rank/Select Implementation

– ux-trie: http://code.google.com/p/ux-trie/

– rx: http://code.google.com/p/mozc/

– marisa-trie: http://code.google.com/p/marisa-trie/

• Today Contributions

– x86-optimized rank/select

– https://github.com/maropu/dbitv

8

Performance Results

9

• Performance Benchmark Setups*1

– Generate a random sequence of bits: 50% density

– Random rank/select queries over the bits

– CPU: Intel Core-i5 [email protected]

• Latency Observed

– 11 trials, and median latency

*1 Reference: http://d.hatena.ne.jp/s-yata/20111216/1324032373

Performance Results: Rank

10

1.E+00

1.E+01

1.E+02

1.E+03

avera

ged r

ank late

ncy (

ns)

bit length

ux

rx

marisa

opt

Performance Results: Select

11

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

avera

ged s

ele

ct

late

ncy (

ns)

bit length

ux

rx

marisa

opt

Implementation Details

12

Implementation: 4 Russian Methods

13

• Rule: O(1) operation costs with o(N) space

B[] = A sequence of bits

N-bits


14


• Split into log2N fixed-length blocks

• Total Counts Pre-computed in L[]


l1

N2log

l2

x

i

Nx

i

x

Nxi

iBiBiBBxrank1

log/

1 1log/

1

2

2

][][][),(

]log/[ 2

1 NxL

L[] =


15


• Split into log2N fixed-length blocks

• Total Counts Pre-computed in L[]


l1

N2log

l2

x

i

Nx

i

x

Nxi

iBiBiBBxrank1

log/

1 1log/

1

2

2

][][][),(

]log/[ 2 NxL

O(1) O(log2N)

L[] =


16


• L[]: o(N) space costs


l1

N2log

l2

)()log

(loglog2

NoN

NON

N

N

L[] =


17


• Split into 1/2logN fixed-length blocks again

• Total Counts Pre-computed in S[]


L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

x

Nxi

x

i

Nx

i

Nx

Nxi

iBiBiBiBBxrank

1log2

1/

1

log/

1

log2

1/

1log/

1 ][][][][),(

2

2

]log/[ 2 nxL ]log2

1/[ nxS


18


• Split into 1/2logN fixed-length blocks again

• Total Counts Pre-computed in S[]


L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

x

Nxi

x

i

Nx

i

Nx

Nxi

iBiBiBiBBxrank

1log2

1/

1

log/

1

log2

1/

1log/

1 ][][][][),(

2

2

]log/[ 2 nxL ]log2

1/[ nxS

O(1) O(1)

O(logN)


19


• S[]: o(N) space costs


L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

)()log

loglog()log(log

log21

2

2No

N

NNON

N

N


20


• O(1) Popcount/Table-Lookup in Last Term


L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

x

Nxi

x

i

Nx

i

Nx

Nxi

iBiBiBiBBxrank

1log2

1/

1

log/

1

log2

1/

1log/

1 ][][][][),(

2

2

]log/[ 2 nxL ]log2

1/[ nxS

O(1) O(1)

O(logN) O(1) ->


21


• As a result, o(N) Space Costs


L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

)()log

loglog(

log

loglog4

logNo

N

NNO

N

NN

N

N

L[] size S[] size


22


Implementation: Practice

23

• Low Computation Costs & High Cache Penalties

– 3 cache/TLB misses per rank

1 3 4 6 7 9 10 13

L[]:

S[]:

18

2 5 7 9 12 13 18 19 1 3 7 …

…

B[]: 01..000000....101......0 0110....001...............0 0000100 ...

256bit

32bit

ex. rank1(402=256*1+32*4+18, B)

21

Popcount these left bits


24

• Low Computation Costs & High Cache Penalties

– 3 cache/TLB misses per rank

1 3 4 6 7 9 10 13

L[]:

S[]:

18

2 5 7 9 12 13 18 19 1 3 7 …

…

B[]: 01..000000....101......0 0110....001...............0 0000100 ...

256bit

32bit

ex. rank1(402=256*1+32*4+18, B)

21

Popcount these left bits

Miss!

Miss!

Miss!


25

• Packing the required data into a single cacheline

・・・ 0110....001..........0

64B Cache line

4B 1B 32B

56B Chunk

padding 12B padding


26

• Packing the required data into a single cacheline


27

• BTW, where select?

– Omitted for my time limit

– Plz see the code ...

• 2 Way Implementation

– O(logN) complexity

• ux-trie, rx, and marisa-trie

• Binary searches with rank

• Many cache/TLB misses suffered

– O(1) complexity

• My implementation to minimize these penalties

• 1-rank, 1-SIMD comparison, and O(1) –bsf

• Only 2 cache/TLB misses


28

• BTW, where select?

– Omitted for my time limit

– Plz see the code ...

• 2 Way Implementation

– O(logN) complexity

• ux-trie, rx, and marisa-trie

• Binary searches with rank

• Many cache/TLB misses suffered

– O(1) complexity

• My implementation to minimize these penalties

• 1-rank, 1-SIMD comparison, and O(1) –bsf

• Only 2 cache/TLB misses

Not implemented yet ...

Business

A x86-optimized rank&select dictionary for bit sequences