28
x86/x64最適化勉強会#4 A x86-optimized rank/select dictionary for bit sequences 2012/6/16 Takeshi Yamamuro 1

A x86-optimized rank&select dictionary for bit sequences

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: A x86-optimized rank&select dictionary for bit sequences

x86/x64最適化勉強会#4

A x86-optimized rank/select dictionary for bit sequences

2012/6/16

Takeshi Yamamuro

1

Page 2: A x86-optimized rank&select dictionary for bit sequences

What’s Succinct Data Structure?

2

Page 3: A x86-optimized rank&select dictionary for bit sequences

SDS: Succinct Data Structure

• Recently, Getting Popular in Some Areas

– Researches & Engineering

• Not Data Structure, But Data Representation

– A compressed method for other data structures

– e.g., alphabets, trees, and graphs

• Transparent Operations w/o Unpacking Explicitly

– e.g., succinct LZ77 compression*1

3 *1 Kreft, S. and Navarro, G.: LZ77-Like Compression with Fast Random Access, In Proceedings of DCC, 2010

Page 4: A x86-optimized rank&select dictionary for bit sequences

More Details

• SDS = Succinct Data + Succinct Index

• Succinct Data

– Compact representation for target data

– Almost to information theoretic lower bounds

e.g., If N patterns, the lower bound’s logN

• Succinct Index

– O(1) operations for target data

– o(N) space costs: ignored asymptotically

4

Page 5: A x86-optimized rank&select dictionary for bit sequences

More Details

5

cited from: http://goo.gl/rkQ5z

If you need more information, ...

Page 6: A x86-optimized rank&select dictionary for bit sequences

A rank/select dictionary for SDS

6

Page 7: A x86-optimized rank&select dictionary for bit sequences

A Rank/Select Operations

• SDS Composed of Rank/Select Operations

– Many calls of rank/select inside

• Rank/Select for Succinct Bit Sequences: B[i]

– rankx(n, B): the total of 1s in B[0...n]

– selectx(n, B): n-th position of x in B[]

7

i 0 1 2 3 4 5 6 7 8

B[i] 1 0 1 1 0 0 1 1 0

rank1(5, B)=3 select1(4, B)=6

Page 8: A x86-optimized rank&select dictionary for bit sequences

A Rank/Select Operations

• Available Rank/Select Implementation

– ux-trie: http://code.google.com/p/ux-trie/

– rx: http://code.google.com/p/mozc/

– marisa-trie: http://code.google.com/p/marisa-trie/

• Today Contributions

– x86-optimized rank/select

– https://github.com/maropu/dbitv

8

Page 9: A x86-optimized rank&select dictionary for bit sequences

Performance Results

9

• Performance Benchmark Setups*1

– Generate a random sequence of bits: 50% density

– Random rank/select queries over the bits

– CPU: Intel Core-i5 [email protected]

• Latency Observed

– 11 trials, and median latency

*1 Reference: http://d.hatena.ne.jp/s-yata/20111216/1324032373

Page 10: A x86-optimized rank&select dictionary for bit sequences

Performance Results: Rank

10

1.E+00

1.E+01

1.E+02

1.E+03

avera

ged r

ank late

ncy (

ns)

bit length

ux

rx

marisa

opt

Page 11: A x86-optimized rank&select dictionary for bit sequences

Performance Results: Select

11

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

avera

ged s

ele

ct

late

ncy (

ns)

bit length

ux

rx

marisa

opt

Page 12: A x86-optimized rank&select dictionary for bit sequences

Implementation Details

12

Page 13: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

13

• Rule: O(1) operation costs with o(N) space

B[] = A sequence of bits

N-bits

Page 14: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

14

• Rule: O(1) operation costs with o(N) space

• Split into log2N fixed-length blocks

• Total Counts Pre-computed in L[]

B[] = A sequence of bits

l1

N2log

l2

x

i

Nx

i

x

Nxi

iBiBiBBxrank1

log/

1 1log/

1

2

2

][][][),(

]log/[ 2

1 NxL

L[] =

Page 15: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

15

• Rule: O(1) operation costs with o(N) space

• Split into log2N fixed-length blocks

• Total Counts Pre-computed in L[]

B[] = A sequence of bits

l1

N2log

l2

x

i

Nx

i

x

Nxi

iBiBiBBxrank1

log/

1 1log/

1

2

2

][][][),(

]log/[ 2 NxL

O(1) O(log2N)

L[] =

Page 16: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

16

• Rule: O(1) operation costs with o(N) space

• L[]: o(N) space costs

B[] = A sequence of bits

l1

N2log

l2

)()log

(loglog2

NoN

NON

N

N

L[] =

Page 17: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

17

• Rule: O(1) operation costs with o(N) space

• Split into 1/2logN fixed-length blocks again

• Total Counts Pre-computed in S[]

B[] = A sequence of bits

L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

x

Nxi

x

i

Nx

i

Nx

Nxi

iBiBiBiBBxrank

1log2

1/

1

log/

1

log2

1/

1log/

1 ][][][][),(

2

2

]log/[ 2 nxL ]log2

1/[ nxS

Page 18: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

18

• Rule: O(1) operation costs with o(N) space

• Split into 1/2logN fixed-length blocks again

• Total Counts Pre-computed in S[]

B[] = A sequence of bits

L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

x

Nxi

x

i

Nx

i

Nx

Nxi

iBiBiBiBBxrank

1log2

1/

1

log/

1

log2

1/

1log/

1 ][][][][),(

2

2

]log/[ 2 nxL ]log2

1/[ nxS

O(1) O(1)

O(logN)

Page 19: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

19

• Rule: O(1) operation costs with o(N) space

• S[]: o(N) space costs

B[] = A sequence of bits

L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

)()log

loglog()log(log

log21

2

2No

N

NNON

N

N

Page 20: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

20

• Rule: O(1) operation costs with o(N) space

• O(1) Popcount/Table-Lookup in Last Term

B[] = A sequence of bits

L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

x

Nxi

x

i

Nx

i

Nx

Nxi

iBiBiBiBBxrank

1log2

1/

1

log/

1

log2

1/

1log/

1 ][][][][),(

2

2

]log/[ 2 nxL ]log2

1/[ nxS

O(1) O(1)

O(logN) O(1) ->

Page 21: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

21

• Rule: O(1) operation costs with o(N) space

• As a result, o(N) Space Costs

B[] = A sequence of bits

L[] = l1

N2log

l2 nlog2

1

S[] = s1 s2

)()log

loglog(

log

loglog4

logNo

N

NNO

N

NN

N

N

L[] size S[] size

Page 22: A x86-optimized rank&select dictionary for bit sequences

Implementation: 4 Russian Methods

22

• Rule: O(1) operation costs with o(N) space

Page 23: A x86-optimized rank&select dictionary for bit sequences

Implementation: Practice

23

• Low Computation Costs & High Cache Penalties

– 3 cache/TLB misses per rank

1 3 4 6 7 9 10 13

L[]:

S[]:

18

2 5 7 9 12 13 18 19 1 3 7 …

B[]: 01..000000....101......0 0110....001...............0 0000100 ...

256bit

32bit

ex. rank1(402=256*1+32*4+18, B)

21

Popcount these left bits

Page 24: A x86-optimized rank&select dictionary for bit sequences

Implementation: Practice

24

• Low Computation Costs & High Cache Penalties

– 3 cache/TLB misses per rank

1 3 4 6 7 9 10 13

L[]:

S[]:

18

2 5 7 9 12 13 18 19 1 3 7 …

B[]: 01..000000....101......0 0110....001...............0 0000100 ...

256bit

32bit

ex. rank1(402=256*1+32*4+18, B)

21

Popcount these left bits

Miss!

Miss!

Miss!

Page 25: A x86-optimized rank&select dictionary for bit sequences

Implementation: Practice

25

• Packing the required data into a single cacheline

・・・ 0110....001..........0

64B Cache line

4B 1B 32B

56B Chunk

padding 12B padding

Page 26: A x86-optimized rank&select dictionary for bit sequences

Implementation: Practice

26

• Packing the required data into a single cacheline

Page 27: A x86-optimized rank&select dictionary for bit sequences

Implementation: Practice

27

• BTW, where select?

– Omitted for my time limit

– Plz see the code ...

• 2 Way Implementation

– O(logN) complexity

• ux-trie, rx, and marisa-trie

• Binary searches with rank

• Many cache/TLB misses suffered

– O(1) complexity

• My implementation to minimize these penalties

• 1-rank, 1-SIMD comparison, and O(1) –bsf

• Only 2 cache/TLB misses

Page 28: A x86-optimized rank&select dictionary for bit sequences

Implementation: Practice

28

• BTW, where select?

– Omitted for my time limit

– Plz see the code ...

• 2 Way Implementation

– O(logN) complexity

• ux-trie, rx, and marisa-trie

• Binary searches with rank

• Many cache/TLB misses suffered

– O(1) complexity

• My implementation to minimize these penalties

• 1-rank, 1-SIMD comparison, and O(1) –bsf

• Only 2 cache/TLB misses

Not implemented yet ...