Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
High-Throughput Subset Matching on
Commodity GPU-Based Systems
Daniele Rogora∗ Michele Papalini$ Koorosh Khazaei∗
Alessandro Margara% Antonio Carzaniga∗ Gianpaolo Cugola%
presented by
Daniele Rogora
%Politecnico di Milano ∗Università della Svizzera italiana $Cisco Systems
Milano Lugano Paris
Italy Switzerland France
EuroSys 2017
1 / 30
Subset Match
Useful in many scenarios
Social networks, Twitter
2 / 30
Subset Match
Useful in many scenarios
Social networks, Twitter
Data Center management
2 / 30
Subset Match
Useful in many scenarios
Social networks, Twitter
Data Center management
Service brokering
2 / 30
Subset Match
Useful in many scenarios
Social networks, Twitter
Data Center management
Service brokering
Cloud 3.0
2 / 30
Example
Subscribers Tag Set...
.
.
.
Daniele
{#football, #acmilan}
{#politics, #Italy}
Antonio {#politics, #USA}
{#chomsky}...
.
.
.
3 / 30
Example
Subscribers Tag Set...
.
.
.
Daniele
{#football, #acmilan}
{#politics, #Italy}
Antonio {#politics, #USA}
{#chomsky}...
.
.
.
#politics, #USA
#Italy#politics, #USA,
#Italy#politics, #USA,
#Italy
3 / 30
Example
Subscribers Tag Set...
.
.
.
Daniele
{#football, #acmilan}
{#politics, #Italy}
Antonio {#politics, #USA}
{#chomsky}...
.
.
.
#politics, #USA
#Italy#politics, #USA,
#Italy#acmilan, #closing,
#news, #football
3 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
4 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
4 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA}
4 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 1
4 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 11
4 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 111 1
4 / 30
Tagsets Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
D = {politics, Italy, USA} 1 111 1
4 / 30
Tagsets Representation
1 2 3 4 5 6 7 8 9 10
1 111 100 0 0 0
4 / 30
Example
Subscribers Bit String...
.
.
.
Daniele
{#football, #acmilan}
{#politics, #Italy}
Antonio {#politics, #USA}
{#chomsky}...
.
.
.
#politics, #USA
#Italy#politics, #USA,
#Italy#politics, #USA,
#Italy
5 / 30
Example
Subscribers Bit String...
.
.
.
k1
aaa1001101000aaa
0010010011
k2 1001000011
0000101000...
.
.
.
101101001110110100111011010011
5 / 30
Model
Tagset
table
Bit String Keys
1000100000 k2
1010000100 k4,k2
0110100000 k3
0011100010 k6,k2
0010101000 k5,k2
0000100100 k2
6 / 30
Model
Tagset
table
Bit String Keys
1000100000 k2
1010000100 k4,k2
0110100000 k3
0011100010 k6,k2
0010101000 k5,k2
0000100100 k2
Query stream
0110101100
6 / 30
Model
Tagset
table
Bit String Keys
1000100000 k2
1010000100 k4,k2
0110100000 k3
0011100010 k6,k2
0010101000 k5,k2
0000100100 k2
Query stream
0110101100
6 / 30
Model
Tagset
table
Bit String Keys
1000100000 k2
1010000100 k4,k2
0110100000 k3
0011100010 k6,k2
0010101000 k5,k2
0000100100 k2
Query stream
0110101100
Output
k2,k3,k5,k2match
6 / 30
Model
Tagset
table
Bit String Keys
1000100000 k2
1010000100 k4,k2
0110100000 k3
0011100010 k6,k2
0010101000 k5,k2
0000100100 k2
Query stream
0110101100
Output
k2,k3,k5match-unique
6 / 30
Model
Tagset
table
Bit String Keys
1000100000 k2
1010000100 k4,k2
0110100000 k3
0011100010 k6,k2
0010101000 k5,k2
0000100100 k2
Query stream
0110101100
Output
k2,k3,k5match-unique
The stream of filters is
intense: 6k queries/s
The database is huge:
212M tag sets
6 / 30
A Complex Problem
database size
system 20M 40M 212M
MongoDB — — —
GPU-only, plain 0.40 0.20 0.04
GPU-only, plain with batching 11.50 6.30 1.20
CPU-only, fast prefix tree 21.10 14.00 4.30
CPU-only, state-of-the-art ICN 27.60 17.40 —
CPU-only, Tagmatch 3.90 3.40 0.68
Tagmatch 268.80 144.40 35.30
(throughput: thousand queries per second)
7 / 30
A Complex Problem
database size
system 20M 40M 212M
MongoDB — — —
GPU-only, plain 0.40 0.20 0.04
GPU-only, plain with batching 11.50 6.30 1.20
CPU-only, fast prefix tree 21.10 14.00 4.30
CPU-only, state-of-the-art ICN 27.60 17.40 —
CPU-only, Tagmatch 3.90 3.40 0.68
Tagmatch 268.80 144.40 35.30
(throughput: thousand queries per second)
Rivest, 1976
7 / 30
A Complex Problem
database size
system 20M 40M 212M
MongoDB — — —
GPU-only, plain 0.40 0.20 0.04
GPU-only, plain with batching 11.50 6.30 1.20
CPU-only, fast prefix tree 21.10 14.00 4.30
CPU-only, state-of-the-art ICN 27.60 17.40 —
CPU-only, Tagmatch 3.90 3.40 0.68
Tagmatch 268.80 144.40 35.30
(throughput: thousand queries per second)
7 / 30
TagMatch
8 / 30
First Approach: using GPUs
Kernel
9 / 30
First Approach: using GPUs
Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
9 / 30
First Approach: using GPUs
Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
9 / 30
First Approach: using GPUs
Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
tagset
table
s0
s1
s2
.
.
.
.
.
.
sn−2
sn−1
sn
q
9 / 30
First Approach: using GPUs
Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
tagset
table
s0
s1
s2
.
.
.
.
.
.
sn−2
sn−1
sn
q
thread i
if (si ⊆ q)
results.add(q)
9 / 30
First Approach: using GPUs
Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
tagset
table
s0
s1
s2
.
.
.
.
.
.
sn−2
sn−1
sn
q0 q1 q2 q3 q4 . . . q255
thread i
for (q ∈ q0 . . . q255)
if (si ⊆ q)
results.add(q)
9 / 30
First Approach: using GPUs
CPU: launch kernel
CPU: merge matches with keys
results
key
table
q0 q1 q2 q3 q4 . . . q255Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
tagset
table
s0
s1
s2
.
.
.
.
.
.
sn−2
sn−1
sn
9 / 30
First Approach: using GPUs
CPU: launch kernel
CPU: merge matches with keys
results
key
table
q0 q1 q2 q3 q4 . . . q255Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
tagset
table
s0
s1
s2
.
.
.
.
.
.
sn−2
sn−1
sn
This is not fast enough
database size
system 20M 40M 212M
MongoDB — — -–
GPU-only, plain 0.40 0.20 0.04
GPU-only, plain with batching 11.50 6.30 1.20
CPU-only, fast prefix tree 21.10 14.00 4.30
CPU-only, state-of-the-art ICN 27.60 17.40 —
CPU-only, Tagmatch 3.90 3.40 0.68
Tagmatch 268.80 144.40 35.30
(throughput: thousand queries per second)
9 / 30
Partitioning
lots of filters share many bits...
we could filter out many filters efficiently and quickly...
10 / 30
Partitioning
lots of filters share many bits...
we could filter out many filters efficiently and quickly...
Bit String Keys
1000100000 k2
1010100100 k4,k2
0110100000 k3
0011000010 k6,k2
0011101000 k5,k2
0001100100 k2
10 / 30
Partitioning
lots of filters share many bits...
we could filter out many filters efficiently and quickly...
Bit String Keys
1000100000 k2
1010100100 k4,k2
0110100000 k3
0011000010 k6,k2
0011101000 k5,k2
0001100100 k2
0001011100
10 / 30
Partitioning
lots of filters share many bits...
we could filter out many filters efficiently and quickly...
Bit String Keys
1000100000 k2
1010100100 k4,k2
0110100000 k3
0011000010 k6,k2
0011101000 k5,k2
0001100100 k2
0001011100
10 / 30
Partitioning
lots of filters share many bits...
we could filter out many filters efficiently and quickly...
Bit String Keys
1000100000 k2
1010100100 k4,k2
0110100000 k3
0011000010 k6,k2
0011101000 k5,k2
0001100100 k2
0001011100
10 / 30
Partitioning
lots of filters share many bits...
we could filter out many filters efficiently and quickly...
Bit String Keys
1000100000 k2
1010100100 k4,k2
0110100000 k3
0011000010 k6,k2
0011101000 k5,k2
0001100100 k2
0001011100
and we can do that efficiently on the cpu, while preserving
batches
10 / 30
Model{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆
.
.
.
input queries (stream)
q1= 010101 · · ·11
q2= 011111 · · ·01
q⋆
3= 001110 · · ·11
.
.
.
Bloom-filterencoding
⋆ “unique” query
pre
-pro
cess
CPU
0 none
1 010001 · · ·01 → P1
2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4
3000101 · · ·10 → P5
. . .
· · · · · ·191 . . .
partition table
su
bset
matc
h
GPU
P1
011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3
. . .
P2
001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64
. . .
.
.
.
.
.
.
tagset table
. . . ,q2
batch1 P1
. . . ,q2 ,q3
batch2 P2
. . . ,q1 ,q3
batch3 P3
.
.
.
key
loo
ku
p/r
ed
uce
CPU
1 → k1 ,k23 → k2 ,k6 ,k8
.
.
.
63 → k5 ,k8 ,k13
.
.
.
key table
q2 ,1,q2 ,3, . . .
results1
q2 ,63,q3 ,71, . . .
results2
q1 ,324,q3 ,99, . . .
results3
.
.
.
11 / 30
Model{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆
.
.
.
input queries (stream)
q1= 010101 · · ·11
q2= 011111 · · ·01
q⋆
3= 001110 · · ·11
.
.
.
Bloom-filterencoding
⋆ “unique” query
pre
-pro
cess
CPU
0 none
1 010001 · · ·01 → P1
2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4
3000101 · · ·10 → P5
. . .
· · · · · ·191 . . .
partition table
su
bset
matc
h
GPU
P1
011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3
. . .
P2
001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64
. . .
.
.
.
.
.
.
tagset table
. . . ,q2
batch1 P1
. . . ,q2 ,q3
batch2 P2
. . . ,q1 ,q3
batch3 P3
.
.
.
key
loo
ku
p/r
ed
uce
CPU
1 → k1 ,k23 → k2 ,k6 ,k8
.
.
.
63 → k5 ,k8 ,k13
.
.
.
key table
q2 ,1,q2 ,3, . . .
results1
q2 ,63,q3 ,71, . . .
results2
q1 ,324,q3 ,99, . . .
results3
.
.
.
q1 →k3 ,k13 , . . .
q2 →k1 ,k2 ,k2 ,
k6 ,k8 ,k5 ,
k8 ,k13 , . . .
q⋆
3 →k9 ,k3 ,k37 ,
k3 ,k7 , . . .
.
.
.
results (stream)
merge
CPU 11 / 30
{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆
.
.
.
input queries (stream)
q1= 010101 · · ·11
q2= 011111 · · ·01
q⋆
3= 001110 · · ·11
.
.
.
Bloom-filterencoding
⋆ “unique” query
pre
-pro
cess
CPU
0 none
1 010001 · · ·01 → P1
2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4
3000101 · · ·10 → P5
. . .
· · · · · ·191 . . .
partition table
su
bset
matc
h
GPU
P1
011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3
. . .
P2
001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64
. . .
.
.
.
.
.
.
tagset table
. . . ,q2
batch1 P1
. . . ,q2 ,q3
batch2 P2
. . . ,q1 ,q3
batch3 P3
.
.
.
key
loo
ku
p/r
ed
uce
CPU
1 → k1 ,k23 → k2 ,k6 ,k8
.
.
.
63 → k5 ,k8 ,k13
.
.
.
key table
q2 ,1,q2 ,3, . . .
results1
q2 ,63,q3 ,71, . . .
results2
q1 ,324,q3 ,99, . . .
results3
.
.
.
q1 →k3 ,k13 , . . .
q2 →k1 ,k2 ,k2 ,
k6 ,k8 ,k5 ,
k8 ,k13 , . . .
q⋆
3 →k9 ,k3 ,k37 ,
k3 ,k7 , . . .
.
.
.
results (stream)
merge
CPU
Partitioning
12 / 30
Partitioning
Max size: 3
P Bit String
0
1000100000
1010000100
0110100000
0011100010
0010101000
0001101101
0000110100
0000110001
0000010110
0000001110
13 / 30
Partitioning
Max size: 3
P Bit String
0
1000100000
1010000100
0110100000
0011100010
0010101000
0001101101
0000110100
0000110001
0000010110
0000001110
13 / 30
Partitioning
Max size: 3
P Bit String
0
1000100000
1010000100
0110100000
0011100010
0010101000
0001101101
0000110100
0000110001
0000010110
0000001110
P Bit String
0
1010000100
0001101101
0000110100
0000010110
0000001110
1
1000100000
0110100000
0011100010
0010101000
0000110001
13 / 30
Partitioning
Max size: 3
P Bit String
0
1000100000
1010000100
0110100000
0011100010
0010101000
0001101101
0000110100
0000110001
0000010110
0000001110
P Bit String
0
1010000100
0001101101
0000110100
0000010110
0000001110
1
1000100000
0110100000
0011100010
0010101000
0000110001
13 / 30
Partitioning
Max size: 3
P Bit String
0
1000100000
1010000100
0110100000
0011100010
0010101000
0001101101
0000110100
0000110001
0000010110
0000001110
P Bit String
0
1010000100
0001101101
0000110100
0000010110
0000001110
1
1000100000
0110100000
0011100010
0010101000
0000110001
13 / 30
Partitioning
Max size: 3
P Bit String
0
1000100000
1010000100
0110100000
0011100010
0010101000
0001101101
0000110100
0000110001
0000010110
0000001110
P Bit String
0
1010000100
0001101101
0000110100
0000010110
0000001110
1
1000100000
0110100000
0011100010
0010101000
0000110001
P Bit String
00001101101
0000110100
1
1010000100
0000010110
0000001110
2
0110100000
0011100010
0010101000
31000100000
0000110001
13 / 30
Partitioning
P Mask Bit String
00001101101
0000110100
1
1010000100
0000010110
0000001110
2
0110100000
0011100010
0010101000
31000100000
0000110001
13 / 30
Partitioning
P Mask Bit String
00000100100 0001101101
0000110100
1
1010000100
0000000100 0000010110
0000001110
2
0110100000
0010100000 0011100010
0010101000
30000100000 1000100000
0000110001
13 / 30
{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆
.
.
.
input queries (stream)
q1= 010101 · · ·11
q2= 011111 · · ·01
q⋆
3= 001110 · · ·11
.
.
.
Bloom-filterencoding
⋆ “unique” query
pre
-pro
cess
CPU
0 none
1 010001 · · ·01 → P1
2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4
3000101 · · ·10 → P5
. . .
· · · · · ·191 . . .
partition table
su
bset
matc
h
GPU
P1
011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3
. . .
P2
001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64
. . .
.
.
.
.
.
.
tagset table
. . . ,q2
batch1 P1
. . . ,q2 ,q3
batch2 P2
. . . ,q1 ,q3
batch3 P3
.
.
.
key
loo
ku
p/r
ed
uce
CPU
1 → k1 ,k23 → k2 ,k6 ,k8
.
.
.
63 → k5 ,k8 ,k13
.
.
.
key table
q2 ,1,q2 ,3, . . .
results1
q2 ,63,q3 ,71, . . .
results2
q1 ,324,q3 ,99, . . .
results3
.
.
.
q1 →k3 ,k13 , . . .
q2 →k1 ,k2 ,k2 ,
k6 ,k8 ,k5 ,
k8 ,k13 , . . .
q⋆
3 →k9 ,k3 ,k37 ,
k3 ,k7 , . . .
.
.
.
results (stream)
merge
CPU
Pre-process
14 / 30
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
q0
q0
q0
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
q1
q1
q1
q1
q0 q1
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
q2
q2
q2
q1
q2
q0 q1 q2
15 / 30
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
q1
q2
q0 q1 q2
flush
15 / 30
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
q1
q2
Timeout expired!
15 / 30
Pre Process
front
end
1st bit Mask...
.
.
.
2 0010100000 → P2
40000100100 → P0
0000100000 → P3
7 0000000100 → P1
.
.
....
thread poolfooooo
partition
queues
P0
P1
P2
P3
Pn
GPU
handlers
GPUscheduler
q1
q2
flush
15 / 30
Optimization
16 / 30
GPU Optimization
q0 q1 q2 q3 q4 . . . q255Kernel
Block 0 Block 1 Block 2
Block 3 Block 4 Block 5
Block 6 Block . . . Block n
tagset
table
s0
s1
s2
.
.
.
.
.
.
sn−2
sn−1
sn
17 / 30
GPU Optimization
Kernel q0 q1 q2 q3 q4 . . . q255
Block 0
t255 | 1110010100
. . . | . . .
t2 | 1110100000
t1 | 1110110000
t0 | 1110110110
Block 1
t255 | 0011101101
. . . | . . .
t2 | 0101101011
t1 | 0110001110
t0 | 0110010110
17 / 30
GPU OptimizationPhase 1
Kernel q0 q1 q2 q3 q4 . . . q255
Block
Thread 0
Thread 3
idle
Thread 1
idle
Thread n
idle
Thread 2
idlefirst = 1110110110
last = 1110010100
17 /
GPU OptimizationPhase 1
Kernel q0 q1 q2 q3 q4 . . . q255
Block
Thread 0
Thread 3
idle
Thread 1
idle
Thread n
idle
Thread 2
idlefirst = 1110110110
last = 1110010100
first ⊕ last = 0000100010
17 /
GPU OptimizationPhase 1
Kernel q0 q1 q2 q3 q4 . . . q255
Block
Thread 0
Thread 3
idle
Thread 1
idle
Thread n
idle
Thread 2
idlefirst = 1110110110
last = 1110010100
first ⊕ last = 0000100010
prefix = 1110000000
common prefix = 1110000000
17 / 30
GPU OptimizationPhase 2
Kernel q0 q1 q2 q3 q4 . . . q255
Block
Thread 0
Thread 3
prefix ⊆ q3?
Thread 1
prefix ⊆ q1?
Thread n
prefix ⊆ qn?
Thread 2
prefix ⊆ q2?
common prefix = 1110000000
prefix ⊆ q0?
Q =
17 / 30
GPU OptimizationPhase 2
Kernel q0 q1 q2 q3 q4 . . . q255
Block
Thread 0
Thread 3
V
Thread 1
V
Thread n
?
Thread 2
X
common prefix = 1110000000
V
q1 q3 q21 q0 q200q177Q =
17 / 30
GPU OptimizationPhase 3
Kernel q0 q1 q2 q3 q4 . . . q255
Block
Thread 0
Thread 3
for (qi ∈ Q)
if (f ⊆ qi )
results.add(qi )
Thread 1
for (qi ∈ Q)
if (f ⊆ qi )
results.add(qi )
Thread n
for (qi ∈ Q)
if (f ⊆ qi )
results.add(qi )
Thread 2
for (qi ∈ Q)
if (f ⊆ qi )
results.add(qi )
common prefix = 1110000000
for (qi ∈ Q)
if (f ⊆ qi )
results.add(qi )
q1 q3 q21 q0 q200q177Q =
17 / 30
Workflow Optimization
18 / 30
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size Data
18 / 30
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size Data
copy res size
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size
3
Data
copy res size
syn
c
18 / 30
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size
3
Data
copy res size
syn
c
copy res data
18 / 30
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size
3
Data
copy res size
syn
c
copy res data
18 / 30
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size
3 q7,q21,q1
Data
copy res size
syn
c
copy res data
syn
c
18 / 30
Workflow Optimization
run kernel
Size
3 q7,q21,q1
Data
GPU
CPU
Size
3 q7,q21,q1
Data
copy res size
syn
c
copy res data
syn
cprocess res
18 / 30
Workflow Optimization
run kernel
copy all res
process ressyn
c
Size Data
GPU
CPU
Size Data
18 / 30
Workflow Optimization
GPU
CPU
Size Data
Size Data
18 / 30
Workflow Optimization
GPU
CPU
Size Data
q207,q17
Size Data
Size Data
Size
2
Data
Workflow Optimization
GPU
CPU
Size
3
Data
q207,q17
Size Data
Size Data
q7,q21,q1
Size
2
Data
run kernel
Workflow Optimization
GPU
CPU
Size
3
Data
q207,q17
Size Data
Size Data
q7,q21,q1
Size
2
Data
run kernel
copy res
Workflow Optimization
GPU
CPU
Size
3
Data
q207,q17
Size
3
Data
q207,q17
Size Data
q7,q21,q1
Size
2
Data
run kernel
copy res
syn
c
Workflow Optimization
GPU
CPU
Size
3
Data
q207,q17
Size
3
Data
q207,q17
Size Data
q7,q21,q1
Size
2
Data
run kernel
copy res
syn
c
process res
Workflow Optimization
GPU
CPU
Size
3
Data
q87,q12,q1,q5
Size
3
Data
q207,q17
Size
4
Data
q7,q21,q1
Size
2
Data
run kernel
copy res
syn
c
process res
run kernel
Workflow Optimization
GPU
CPU
Size
3
Data
q87,q12,q1,q5
Size
3
Data
q207,q17
Size
4
Data
q7,q21,q1
Size
2
Data
run kernel
copy res
syn
c
process res
run kernel
copy res
Workflow Optimization
GPU
CPU
Size
3
Data
q87,q12,q1,q5
Size
3
Data
q207,q17
Size
4
Data
q7,q21,q1
Size
4
Data
q7,q21,q1
run kernel
copy res
syn
c
process res
run kernel
copy ressyn
c
Workflow Optimization
GPU
CPU
Size
3
Data
q87,q12,q1,q5
Size
3
Data
q207,q17
Size
4
Data
q7,q21,q1
Size
4
Data
q7,q21,q1
run kernel
copy res
syn
c
process res
run kernel
copy ressyn
c
process res18 / 30
Workflow Optimization
run kernel
copy res size
copy res data
process res
syn
csyn
c
run kernel
copy all res
process res
syn
c
run kernel
copy res
process res
run kernel
copy res
process res
syn
csyn
c
18 / 30
Evaluation
19 / 30
Evaluation
1 single machine
24 (48) physical (virtual) cpu cores
2 Nvidia Titan X
19 / 30
Scalability
1
10
100
20 30 40 50 60 70 80 90 100
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Database size (% of the full Twitter database)
TagMatch, matchTagMatch, match-unique
Does it scale with bigger databases?
20 / 30
Scalability
1
10
100
20 30 40 50 60 70 80 90 100
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Database size (% of the full Twitter database)
TagMatch, matchTagMatch, match-unique
20 / 30
Scalability
1
10
100
20 30 40 50 60 70 80 90 100
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Database size (% of the full Twitter database)
TagMatch, matchTagMatch, match-uniqueprefix tree, matchprefix tree, match-unique
20 / 30
Scalability
1
10
100
20 30 40 50 60 70 80 90 100
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Database size (% of the full Twitter database)
TagMatch, matchTagMatch, match-uniqueprefix tree, matchprefix tree, match-unique
20 / 30
Threads
0
10
20
30
40
50
8 16 24 32 40 48
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Number of threads
TagMatch, matchTagMatch, match-unique
prefix tree, matchprefix tree, match-unique
Does it scale with bigger machines?
21 / 30
Threads
0
10
20
30
40
50
8 16 24 32 40 48
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Number of threads
TagMatch, matchTagMatch, match-unique
prefix tree, matchprefix tree, match-unique
21 / 30
Threads
0
10
20
30
40
50
8 16 24 32 40 48
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Number of threads
TagMatch, matchTagMatch, match-unique
prefix tree, matchprefix tree, match-unique
GPU limit!
21 / 30
Latency
0
0.5
1
1.5
2
2.5
3
3.5
4
200 400 600 800 no limit
Late
ncy
(s)
Timeout (ms)
1%, 25%, median, 75%, 99%maximum
Does batching kill latency?
22 / 30
Latency
0
0.5
1
1.5
2
2.5
3
3.5
4
200 400 600 800 no limit
Late
ncy
(s)
Timeout (ms)
1%, 25%, median, 75%, 99%maximum
22 / 30
Memory usage
5
10
15
20
25
30
0 20 40 60 80 100
Mem
ory
usag
e(G
B)
Database size (% of the full Twitter database)
GPU, I/O buffersGPU, tagset table
Host
How much memory does it need?
23 / 30
Memory usage
5
10
15
20
25
30
0 20 40 60 80 100
Mem
ory
usag
e(G
B)
Database size (% of the full Twitter database)
GPU, I/O buffersGPU, tagset table
Host
23 / 30
Conclusion
subset matching
24 / 30
Conclusion
subset matching◮ computationally complex◮ highly parallelizable
24 / 30
Conclusion
subset matching◮ computationally complex◮ highly parallelizable
TagMatch
24 / 30
Conclusion
subset matching◮ computationally complex◮ highly parallelizable
TagMatch◮ implements an efficient CPU/GPU pipeline
24 / 30
Conclusion
subset matching◮ computationally complex◮ highly parallelizable
TagMatch◮ implements an efficient CPU/GPU pipeline
https://github.com/carzaniga/TagMatch
24 / 30
High-Throughput Subset Matching on
Commodity GPU-Based Systems
Daniele Rogora∗ Michele Papalini$ Koorosh Khazaei∗
Alessandro Margara% Antonio Carzaniga∗ Gianpaolo Cugola%
presented by
Daniele Rogora
%Politecnico di Milano ∗Università della Svizzera italiana $Cisco Systems
Milano Lugano Paris
Italy Switzerland France
EuroSys 2017
25 / 30
Partition size
0
5
10
15
20
25
30
35
40
0 100 200 300 400 500 600 700 800 900
Thr
ough
put
(tho
usan
d qu
erie
s/s)
MAXP: Maximum size of partitions (thousands)
matchmatch-unique
26 / 30
Mongo DB
10-1
100
101
102
103
104
105
106
4 5 6 7 8 9 10
Thr
ough
put
(que
ries/
s)
Number of tags per query
TagMatch 1MTagMatch 3MTagMatch 5M
MongoDB 1MMongoDB 3MMongoDB 5M
27 / 30
Partitioning time
0
10
20
30
40
50
10 20 30 40 50 60 70 80 90 100
Tim
e (s
)
Database size (% of the full Twitter database)
balanced partitioning
28 / 30
More tags
0.1
1
10
100
1000
0 1 2 3 4 5 6 7 8 9
Thr
ough
put
(tho
usan
d qu
erie
s/s)
Number of additional tags per query
TagMatchprefix tree
100
1000
10000
100000
0 1 2 3 4 5 6 7 8 9
Out
put t
hrou
ghpu
t(t
hous
and
keys
/s)
Number of additional tags per query
TagMatchprefix tree
29 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
h1
h2
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30
Descriptors Representation
Representation of tagsets with Bloom filters
a bitvector of size m
k independent hash functions h1, . . . ,hk
hi : Tags →{1, . . . ,m}
Example: (k = 2,m = 10)
1 2 3 4 5 6 7 8 9 10
D = {politics, Italy, USA} 1 111 1
Concretely, in our implementation: m = 192,k = 7
False positives: testing S1 ⊆ S2 with Bloom fil-
ters gives a false positive with probability 1 −
e−k |S2|mk |S1\S2|
For example, when |S2| = 10 and |S1 \S2| = 3, we
have a false positive with probability 10−11
30 / 30