59
P ARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung 1 , Jinwoo Park 1 , Johann Blieberger 2 and Bernd Burgstaller 1 1 Yonsei University, Korea 2 Vienna University of Technology, Austria 46 th International Conference of Parallel Processing Bristol, United Kingdom in August 14 - 17, 2017

PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

PARALLEL CONSTRUCTION OF

SIMULTANEOUS DETERMINISTIC

FINITE AUTOMATA ON SHARED-

MEMORY MULTICORES

Minyoung Jung1, Jinwoo Park1,

Johann Blieberger2 and Bernd Burgstaller1

1Yonsei University, Korea

2Vienna University of Technology, Austria

46th International Conference of Parallel Processing

Bristol, United Kingdom in August 14 - 17, 2017

Page 2: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Motivation

2

String pattern matching with finite automata (FAs) is a

well-established method across many areas.

Text editors

Compiler front-ends

Internet search engines

Security and DNA sequence analysis

The sequential FA algorithm has linear complexity in

the size of the input.

Significant research effort has been spent on parallelizing

FA matching to improve the sequential performance

Hard to be parallelized due to the dependency between

state transitions

Page 3: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

3

DFA

Page 4: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

4

Page 5: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

5

What is the start state?

Page 6: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

6

Page 7: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

7

Page 8: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

8

Page 9: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Limitation of parallel FA matching

Motivation (cont.)

9

Page 10: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

SFA construction

Simultaneous Finite Automata (SFAs)

Accumulated state transition information

Simulates the parallel execution of |Q| DFAs on a

single DFA

10

DFA SFA

Motivation (cont.)

Page 11: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parallel FA matching

Parallel SFA matching

Motivation (cont.)

11

Page 12: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parallel FA matching

Parallel SFA matching

Motivation (cont.)

12

Page 13: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parallel FA matching

Parallel SFA matching

Motivation (cont.)

13

Page 14: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parallel FA matching

Parallel SFA matching

Motivation (cont.)

14

Page 15: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parallel FA matching

Parallel SFA matching

Motivation (cont.)

15

Page 16: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Motivation (cont.)

16

3 states

6 states

Page 17: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Our contributions

17

Introduce fingerprint-based hashing of SFA-

states to speed up state comparisons.

Provide x86 SIMD-based transposition

kernels for SFA-state construction to leverage

data-parallelism and cache-locality.

Perform in-memory compression of SFA-states

to mitigate the space constraints of large problems.

Parallelize SFA construction for shared-memory

multicores with lock-free synchronization on all

data-structures including thread-local queues supporting work-stealing.

1.

2.

3.

4.

Page 18: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Start with the initial state .

DFA over

Sequential SFA construction

18SFA

Page 19: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

DFA over

Sequential SFA construction

19

Until no more states to process

SFA

Page 20: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

20

DFA over

SFA

Page 21: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

21

Insert into the processed set

DFA over

SFA

Page 22: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

22

Iterate with every symbol

DFA over

SFA

Page 23: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

23

Find new states

DFA over

SFA

Page 24: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

24

Update the SFA transition function

DFA over

SFA

Page 25: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

25

Check existence &

add new state to the set

(set membership test)

DFA over

SFA

Page 26: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

26

Generate a next state with symbol

DFA over

SFA

Page 27: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

27

Generate a next state with symbol

DFA over

SFA

Page 28: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

DFA over

Sequential SFA construction

28

Choose the unprocessed state

SFA

Page 29: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

DFA over

Sequential SFA construction

29SFA

Generate a next state with symbol

Page 30: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

DFA over

Sequential SFA construction

30

Until no more states to process

SFA

Page 31: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Sequential SFA construction

31

Set the initial and the final state

DFA over

SFA

Page 32: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Optimizing SFA construction

32

Page 33: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Optimizing SFA construction

33

Parameterized transposition

Fingerprint-based hashing

Page 34: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Fingerprint-based hashing

34

Fingerprints ( ) Short bit-strings for larger objects (SFA-states)

CityHash, FarmHash, Rabin’s method, etc. create fingerprints

Speed up comparisons of SFA-states

exhaustive SFA-state comparisons

Page 35: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Fingerprint-based hashing

35

Fingerprints ( ) Short bit-strings for larger objects (SFA-states)

CityHash, FarmHash, Rabin’s method, etc. create fingerprints

Speed up comparisons of SFA-states

fingerprint comparisons

Page 36: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Fingerprint-based hashing

36

Fingerprints ( ) Short bit-strings for larger objects (SFA-states)

CityHash, FarmHash, Rabin’s method, etc. create fingerprints

Speed up comparisons of SFA-states

Fingerprint-collisions

It follows from the properties of the hash function that if fingerprints are

different, SFA-states are different.

No exhaustive comparison necessary.

With small probability, different SFA-states generate same fingerprint.

Fingerprint-collision

If fingerprints are the same, SFA-states may be the same.

exhaustive comparisons are required.

Page 37: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Fingerprint-based hashing (cont.)

37

Hashing of SFA-states Speed up lookups, reduces number of SFA-state comparisons

Hash key: fingerprint % size of the hash-table

Value: fingerprint, SFA-state

0

1

2

Hash-table (size=3)

Page 38: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Fingerprint-based hashing (cont.)

38

Hash-collisions Different SFA-states may map to the same hash-key due to the modulo-

operation.

0

1

2

Hash-table (size=3)

Hash-collision

Page 39: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Fingerprint-based hashing (cont.)

39

Hash-collisions Different SFA-states may map to the same hash-key due to the modulo-

operation.

Resolved by closed addressing with chaining

0

1

2

Hash-table (size=3)

Page 40: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parameterized transposition

40

Speed up creating next SFA-states of each SFA-state

1 0 0

1 0 2

2 2 2

a b c

0

1

2

Non-optimized:

compute next states one by one

DFA transition table

Page 41: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parameterized transposition

41

Speed up creating next SFA-states of each SFA-state

1 2 1

0 2 0

0 2 0

a

b

c

1 0 0

1 0 2

2 2 2

a b c

0

1

2

DFA transition table

Optimized: transpose the table to the table

according to the DFA-states of the source SFA-state

Page 42: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

1 2 1

0 2 0

0 2 0

Parameterized transposition

42

Speed up creating next SFA-states of each SFA-state

a

b

c

1 0 0

1 0 2

2 2 2

a b c

0

1

2

DFA transition table

Optimized: transpose the table to the table

according to the DFA-states of the source SFA-state

Page 43: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parameterized transposition (cont.)

43

DFA transition table (17x20)

8x8 8x8

1x1

8x8 8x8

4x8 4x8

x86 SIMD-intrinsics-based transposition kernels

20 next SFA-states (20x17)

Example transposed transition table

# DFA-states: 17, # symbols: 20

Page 44: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parameterized transposition (cont.)

44

DFA transition table (17x20)

8x8 8x8

1x1

8x8 8x8

4x8 4x8

x86 SIMD-intrinsics-based transposition kernels

20 next SFA-states (20x17)

Example transposed transition table

# DFA-states: 17, # symbols: 20

Page 45: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parameterized transposition (cont.)

45

DFA transition table (17x20)

8x8 8x8

1x1

8x8 8x8

4x8 4x8

x86 SIMD-intrinsics-based transposition kernels

20 next SFA-states (20x17)

Example transposed transition table

# DFA-states: 17, # symbols: 20

Page 46: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Parameterized transposition (cont.)

46

DFA transition table (17x20)

8x8 8x8

1x1

8x8 8x8

4x8 4x8

x86 SIMD-intrinsics-based transposition kernels

20 next SFA-states (20x17)

Example transposed transition table

# DFA-states: 17, # symbols: 20

Page 47: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Work (SFA-state) distribution

47

New SFA-states are pushed to the global queue:

Thread 1: Thread 2:

Highly contentedFront Back

Observations:

1) The amount of work changes dynamically.

Few available states at the beginning, but soon all cores are saturated.

2) Switching the work distribution scheme dynamically adapts to the

changing load condition and reduces the cache-coherence overhead.

Scheme 1: static distribution via a global queue:

Advantage: avoid coherence-overhead at front of the queue from work-

stealing attempts of idle threads

Back of the queue is not contended because initially little work is

available.

Page 48: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Work (SFA-state) distribution (cont.)

48

Scheme 2: dynamic distribution via thread-local queues

Work-stealing: steal work from the other’s queue once the local queue

is empty

Work will be popped exactly once by a thread because of lock-free

synchronization using compare-and-swap (CAS) operation

Advantage: avoid coherence-overhead from the highly contended back

of the global queue

Dequeuing SFA-states from other thread-local queues (work-stealing)

makes front of the queue highly contended (cache coherence overhead)

when little work is available

Thread-local queues:

Thread 1:

(owner)

Thread 2:

(thief)

CAS fails

CAS succeeds

Thread 0:

(thief)

Page 49: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

In-memory compression

49

SFA-state compression mitigates state explosion problem

Dictionary-based compression shows high compression

ratios due to structural properties of FAs

FA-states tend to repeat in SFA-states

Compression requires additional costly computation

Initiate once a critical memory threshold is reached

27 KB per SFA-state

Compress

Page 50: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

In-memory compression (cont.)

50

Mitigate intractable problem sizes

Conduct SFA construction in three phases

First phase: construct an SFA with un-compressed SFA-states

Page 51: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Dictionary-based

lossless compression

In-memory compression (cont.)

51

Mitigate intractable problem sizes

Conduct SFA construction in three phases

First phase: construct an SFA with un-compressed SFA-states

Second phase: compress all generated SFA-states once a critical

memory threshold is reached

Page 52: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Mitigate intractable problem sizes

Conduct SFA construction in three phases

First phase: construct an SFA with un-compressed SFA-states

Second phase: compress all generated SFA-states once a critical

memory threshold is reached

Third phase: resume SFA construction with compressed SFA-states

In-memory compression (cont.)

52

Decompress

CompressSet membership test

Page 53: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Experimental evaluation

53

Benchmarks: 1250 patterns from PROSITE protein database

Their minimal DFAs are generated by Grail+.

Exclude patterns take several days to convert to minimal DFAs.

Proposed algorithm implemented in C11 using POSIX threads.

Performance results are obtained by PAPI allows accesing

hardware performance counters.

Evaluation platforms:

4-CPU (64 cores) AMD Opteron system

2-CPU (44 cores, 2 hyperthreads per core) Intel Xeon Broadwell E5-

2699 v4 system

Linux CentOS version 7

Page 54: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Experimental evaluation (cont.)

54

Speedups of optimized sequential algorithm over the previous algorithm

Hashing: max 4.1x on AMD, 3.1x on Intel

Combination of hashing and transposition:

max 6.8x on AMD, 5.2x on Intel

On the AMD system On the Intel system

Page 55: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Experimental evaluation (cont.)

55

Speedups of parallelization

Based on our fastest sequential algorithm using hashing and

parameterized transposition

On the AMD

system

(Max. 108.9x)

On the Intel

system

(Max. 46.1x)

Page 56: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Experimental evaluation (cont.)

56

Performance and size comparison with and w/o compression

Six benchmarks on the Intel system (four benchmarks are intractable

w/o compression and two benchmarks are added to compare them)

Set our memory manager’s threshold to 200 GB to force compression

of two tractable benchmarks

Intractable w/o compression

Page 57: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Conclusion

57

Introduced fingerprints and hashing to reduce state

comparisons and set membership tests.

Parameterized transposition of the transition table ensures

cache locality of memory accesses.

Dynamic switch from global work queue to thread local

queues with work-stealing avoids contention of cache-lines at

front and back of queue.

Dynamically switch to in-memory compression of SFA-states

once they cannot fit into the main memory.

Overall speedups including fingerprint-based hashing,

parameterized transposition and parallelization without

compression are up to 312x on AMD and 193x on Intel.

Compression ratios are up to 30 on the Intel system.

Page 58: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

This research was supported by:

the Austrian Science Fund (FWF) project I

1035N23

the Next-Generation Information Computing

Development Program through the National

Research Foundation of Korea (NRF), funded by

the Ministry of Science, ICT & Future Planning

under grant NRF2015M3C4A7065522

Acknowledgments

58

Page 59: PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC FINITE AUTOMATA ON SHARED- MEMORY MULTICORES Minyoung Jung1, Jinwoo

Thank you!

Q&A