High-Throughput Subset Matching on Commodity GPU-Based … · Subset Match Useful in many scenarios Social networks, Twitter Data Center management Service brokering 2/30

High-Throughput Subset Matching on

Commodity GPU-Based Systems

Daniele Rogora∗ Michele Papalini$ Koorosh Khazaei∗

Alessandro Margara% Antonio Carzaniga∗ Gianpaolo Cugola%

presented by

Daniele Rogora

%Politecnico di Milano ∗Università della Svizzera italiana $Cisco Systems

Milano Lugano Paris

Italy Switzerland France

EuroSys 2017

1 / 30

Subset Match

Useful in many scenarios

Social networks, Twitter

2 / 30

Subset Match



Data Center management

2 / 30

Subset Match




Service brokering

2 / 30

Subset Match




Service brokering

Cloud 3.0

2 / 30

Example

Subscribers Tag Set...

.

.

.

Daniele

{#football, #acmilan}

{#politics, #Italy}

Antonio {#politics, #USA}

{#chomsky}...

.

.

.

3 / 30

Example


.

.

.

Daniele


{#politics, #Italy}


{#chomsky}...

.

.

.

#politics, #USA

#Italy#politics, #USA,


#Italy

3 / 30

Example


.

.

.

Daniele


{#politics, #Italy}


{#chomsky}...

.

.

.

#politics, #USA


#Italy#acmilan, #closing,

#news, #football

3 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

4 / 30



a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

4 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA}

4 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 1

4 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 11

4 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 111 1

4 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10


4 / 30


1 2 3 4 5 6 7 8 9 10

1 111 100 0 0 0

4 / 30

Example

Subscribers Bit String...

.

.

.

Daniele


{#politics, #Italy}


{#chomsky}...

.

.

.

#politics, #USA



#Italy

5 / 30

Example

Subscribers Bit String...

.

.

.

k1

aaa1001101000aaa

0010010011

k2 1001000011

0000101000...

.

.

.

101101001110110100111011010011

5 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

Output

k2,k3,k5,k2match

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

Output

k2,k3,k5match-unique

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

Output

k2,k3,k5match-unique

The stream of filters is

intense: 6k queries/s

The database is huge:

212M tag sets

6 / 30

A Complex Problem

database size

system 20M 40M 212M

MongoDB — — —

GPU-only, plain 0.40 0.20 0.04

GPU-only, plain with batching 11.50 6.30 1.20

CPU-only, fast prefix tree 21.10 14.00 4.30

CPU-only, state-of-the-art ICN 27.60 17.40 —

CPU-only, Tagmatch 3.90 3.40 0.68

Tagmatch 268.80 144.40 35.30

(throughput: thousand queries per second)

7 / 30

A Complex Problem

database size

system 20M 40M 212M

MongoDB — — —






Tagmatch 268.80 144.40 35.30


Rivest, 1976

7 / 30

A Complex Problem

database size

system 20M 40M 212M

MongoDB — — —






Tagmatch 268.80 144.40 35.30


7 / 30

TagMatch

8 / 30

First Approach: using GPUs

Kernel

9 / 30


Kernel

Block 0 Block 1 Block 2


Block 6 Block . . . Block n

9 / 30


Kernel




9 / 30


Kernel




tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

q

9 / 30


Kernel




tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

q

thread i

if (si ⊆ q)

results.add(q)

9 / 30


Kernel




tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

q0 q1 q2 q3 q4 . . . q255

thread i

for (q ∈ q0 . . . q255)

if (si ⊆ q)

results.add(q)

9 / 30


CPU: launch kernel

CPU: merge matches with keys

results

key

table

q0 q1 q2 q3 q4 . . . q255Kernel




tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

9 / 30


CPU: launch kernel

CPU: merge matches with keys

results

key

table

q0 q1 q2 q3 q4 . . . q255Kernel




tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

This is not fast enough

database size

system 20M 40M 212M

MongoDB — — -–






Tagmatch 268.80 144.40 35.30


9 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

10 / 30

Partitioning



Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

10 / 30

Partitioning



Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

10 / 30

Partitioning



Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

10 / 30

Partitioning



Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

10 / 30

Partitioning



Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

and we can do that efficiently on the cpu, while preserving

batches

10 / 30

Model{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.

input queries (stream)

q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.

Bloom-filterencoding

⋆ “unique” query

pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

11 / 30

Model{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.


q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.



pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

q1 →k3 ,k13 , . . .

q2 →k1 ,k2 ,k2 ,

k6 ,k8 ,k5 ,

k8 ,k13 , . . .

q⋆

3 →k9 ,k3 ,k37 ,

k3 ,k7 , . . .

.

.

.

results (stream)

merge

CPU 11 / 30

{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.


q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.



pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

q1 →k3 ,k13 , . . .

q2 →k1 ,k2 ,k2 ,

k6 ,k8 ,k5 ,

k8 ,k13 , . . .

q⋆

3 →k9 ,k3 ,k37 ,

k3 ,k7 , . . .

.

.

.

results (stream)

merge

CPU

Partitioning

12 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

P Bit String

00001101101

0000110100

1

1010000100

0000010110

0000001110

2

0110100000

0011100010

0010101000

31000100000

0000110001

13 / 30

Partitioning

P Mask Bit String

00001101101

0000110100

1

1010000100

0000010110

0000001110

2

0110100000

0011100010

0010101000

31000100000

0000110001

13 / 30

Partitioning

P Mask Bit String

00000100100 0001101101

0000110100

1

1010000100

0000000100 0000010110

0000001110

2

0110100000

0010100000 0011100010

0010101000

30000100000 1000100000

0000110001

13 / 30

{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.


q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.



pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

q1 →k3 ,k13 , . . .

q2 →k1 ,k2 ,k2 ,

k6 ,k8 ,k5 ,

k8 ,k13 , . . .

q⋆

3 →k9 ,k3 ,k37 ,

k3 ,k7 , . . .

.

.

.

results (stream)

merge

CPU

Pre-process

14 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q0

q0

q0

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q1

q1

q1

q0 q1

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q2

q2

q2

q1

q2

q0 q1 q2

15 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q2

q0 q1 q2

flush

15 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q2

Timeout expired!

15 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q2

flush

15 / 30

Optimization

16 / 30

GPU Optimization

q0 q1 q2 q3 q4 . . . q255Kernel




tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

17 / 30

GPU Optimization

Kernel q0 q1 q2 q3 q4 . . . q255

Block 0

t255 | 1110010100

. . . | . . .

t2 | 1110100000

t1 | 1110110000

t0 | 1110110110

Block 1

t255 | 0011101101

. . . | . . .

t2 | 0101101011

t1 | 0110001110

t0 | 0110010110

17 / 30

GPU OptimizationPhase 1

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

idle

Thread 1

idle

Thread n

idle

Thread 2

idlefirst = 1110110110

last = 1110010100

17 /


Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

idle

Thread 1

idle

Thread n

idle

Thread 2


last = 1110010100

first ⊕ last = 0000100010

17 /


Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

idle

Thread 1

idle

Thread n

idle

Thread 2


last = 1110010100

first ⊕ last = 0000100010

prefix = 1110000000

common prefix = 1110000000

17 / 30


Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

prefix ⊆ q3?

Thread 1

prefix ⊆ q1?

Thread n

prefix ⊆ qn?

Thread 2

prefix ⊆ q2?


prefix ⊆ q0?

Q =

17 / 30


Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

V

Thread 1

V

Thread n

?

Thread 2

X


V

q1 q3 q21 q0 q200q177Q =

17 / 30


Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

Thread 1

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

Thread n

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

Thread 2

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )


for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

q1 q3 q21 q0 q200q177Q =

17 / 30

Workflow Optimization

18 / 30


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size Data

18 / 30


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size Data

copy res size


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3

Data

copy res size

syn

c

18 / 30


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3

Data

copy res size

syn

c

copy res data

18 / 30


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3

Data

copy res size

syn

c

copy res data

18 / 30


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3 q7,q21,q1

Data

copy res size

syn

c

copy res data

syn

c

18 / 30


run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3 q7,q21,q1

Data

copy res size

syn

c

copy res data

syn

cprocess res

18 / 30


run kernel

copy all res

process ressyn

c

Size Data

GPU

CPU

Size Data

18 / 30


GPU

CPU

Size Data

Size Data

18 / 30


GPU

CPU

Size Data

q207,q17

Size Data

Size Data

Size

2

Data


GPU

CPU

Size

3

Data

q207,q17

Size Data

Size Data

q7,q21,q1

Size

2

Data

run kernel


GPU

CPU

Size

3

Data

q207,q17

Size Data

Size Data

q7,q21,q1

Size

2

Data

run kernel

copy res


GPU

CPU

Size

3

Data

q207,q17

Size

3

Data

q207,q17

Size Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c


GPU

CPU

Size

3

Data

q207,q17

Size

3

Data

q207,q17

Size Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

process res


GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

process res

run kernel


GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

process res

run kernel

copy res


GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

4

Data

q7,q21,q1

run kernel

copy res

syn

c

process res

run kernel

copy ressyn

c


GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

4

Data

q7,q21,q1

run kernel

copy res

syn

c

process res

run kernel

copy ressyn

c

process res18 / 30


run kernel

copy res size

copy res data

process res

syn

csyn

c

run kernel

copy all res

process res

syn

c

run kernel

copy res

process res

run kernel

copy res

process res

syn

csyn

c

18 / 30

Evaluation

19 / 30

Evaluation

1 single machine

24 (48) physical (virtual) cpu cores

2 Nvidia Titan X

19 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Database size (% of the full Twitter database)

TagMatch, matchTagMatch, match-unique

Does it scale with bigger databases?

20 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)



20 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)


TagMatch, matchTagMatch, match-uniqueprefix tree, matchprefix tree, match-unique

20 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)


TagMatch, matchTagMatch, match-uniqueprefix tree, matchprefix tree, match-unique

Twitter

20 / 30

Threads

0

10

20

30

40

50

8 16 24 32 40 48

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of threads


prefix tree, matchprefix tree, match-unique

Does it scale with bigger machines?

21 / 30

Threads

0

10

20

30

40

50

8 16 24 32 40 48

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of threads



21 / 30

Threads

0

10

20

30

40

50

8 16 24 32 40 48

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of threads



GPU limit!

21 / 30

Latency

0

0.5

1

1.5

2

2.5

3

3.5

4

200 400 600 800 no limit

Late

ncy

(s)

Timeout (ms)

1%, 25%, median, 75%, 99%maximum

Does batching kill latency?

22 / 30

Latency

0

0.5

1

1.5

2

2.5

3

3.5

4

200 400 600 800 no limit

Late

ncy

(s)

Timeout (ms)

1%, 25%, median, 75%, 99%maximum

22 / 30

Memory usage

5

10

15

20

25

30

0 20 40 60 80 100

Mem

ory

usag

e(G

B)


GPU, I/O buffersGPU, tagset table

Host

How much memory does it need?

23 / 30

Memory usage

5

10

15

20

25

30

0 20 40 60 80 100

Mem

ory

usag

e(G

B)


GPU, I/O buffersGPU, tagset table

Host

23 / 30

Conclusion

subset matching

24 / 30

Conclusion

subset matching◮ computationally complex◮ highly parallelizable

24 / 30

Conclusion


TagMatch

24 / 30

Conclusion


TagMatch◮ implements an efficient CPU/GPU pipeline

24 / 30

Conclusion


TagMatch◮ implements an efficient CPU/GPU pipeline

https://github.com/carzaniga/TagMatch

24 / 30

High-Throughput Subset Matching on

Commodity GPU-Based Systems

Daniele Rogora∗ Michele Papalini$ Koorosh Khazaei∗

Alessandro Margara% Antonio Carzaniga∗ Gianpaolo Cugola%

presented by

Daniele Rogora

%Politecnico di Milano ∗Università della Svizzera italiana $Cisco Systems

Milano Lugano Paris

Italy Switzerland France

EuroSys 2017

25 / 30

Partition size

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800 900

Thr

ough

put

(tho

usan

d qu

erie

s/s)

MAXP: Maximum size of partitions (thousands)

matchmatch-unique

26 / 30

Mongo DB

10-1

100

101

102

103

104

105

106

4 5 6 7 8 9 10

Thr

ough

put

(que

ries/

s)

Number of tags per query

TagMatch 1MTagMatch 3MTagMatch 5M

MongoDB 1MMongoDB 3MMongoDB 5M

27 / 30

Partitioning time

0

10

20

30

40

50

10 20 30 40 50 60 70 80 90 100

Tim

e (s

)


balanced partitioning

28 / 30

More tags

0.1

1

10

100

1000

0 1 2 3 4 5 6 7 8 9

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of additional tags per query

TagMatchprefix tree

100

1000

10000

100000

0 1 2 3 4 5 6 7 8 9

Out

put t

hrou

ghpu

t(t

hous

and

keys

/s)

Number of additional tags per query

TagMatchprefix tree

29 / 30

Descriptors Representation




hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2


Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2





e−k |S2|mk |S1\S2|



30 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2





e−k |S2|mk |S1\S2|



30 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2





e−k |S2|mk |S1\S2|



30 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10





e−k |S2|mk |S1\S2|



30 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10





e−k |S2|mk |S1\S2|



30 / 30





hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10





e−k |S2|mk |S1\S2|



30 / 30

Documents

High-Throughput Subset Matching on Commodity GPU-Based … · Subset Match Useful in many scenarios Social networks, Twitter Data Center management Service brokering 2/30