Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine
Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, Guoqing Harry Xu1 1 12 3
UCLA Nanjing University Facebook1 2 3
2
Big Graph
2
Graph Datasets
Big Graph
2
Graph Datasets
GraphChi
Graph Systems
GridGraphBig Graph
Graph Analytical Problems
3
Graph Analytical Problems
3
Graph Computation
Graph Analytical Problems
3
Graph Computation
PageRank
Connected Component
Graph Analytical Problems
3
Graph Computation
PageRank
Connected Component
Iterative value computation
Graph Analytical Problems
3
Graph Computation
PageRank
Connected Component
Iterative value computation
GraphChi
Think Like a Vertex
Graph Analytical Problems
3
Graph Computation
Graph Mining
PageRank
Connected Component
Iterative value computation
GraphChi
Think Like a Vertex
Graph Analytical Problems
3
Graph Computation
Graph Mining
PageRank
Connected Component
Frequent Subgraph Mining
Clique Finding
Iterative value computation
GraphChi
Think Like a Vertex
Graph Analytical Problems
3
Graph Computation
Graph Mining
PageRank
Connected Component
Frequent Subgraph Mining
Clique Finding
Iterative value computation
Discover structural patterns
GraphChi
Think Like a Vertex
Graph Analytical Problems
3
Graph Computation
Graph Mining
PageRank
Connected Component
Frequent Subgraph Mining
Clique Finding
Iterative value computation
Discover structural patterns
GraphChi
?
Think Like a Vertex
Existing Mining Systems
• Enumerate all possible subgraphs
• For each subgraph, check if it matches the pattern
• Pattern is application-specific (Clique finding, motif counting, frequent subgraph mining)
4
Existing Datalog Systems
5
• Relational predicates
- TC(a, b, c) R(a, b), a < b, R(b, c), b < c, R(c, a)
- count TC(a, b, c)
• Relation algebra enables composition of small structures into big structures
Challenges in Graph Mining
6
1 2 3 4 5 6
4k22k
335k7.8M
117M
1.7B
Exponentially
size of subgraphs
# of
subg
raph
s
• # of subgraphs grows exponentially with the size of subgraphs
Arabesque [CHC Teixeira et al. , SOSP’15]
Problems with Distributed Mining Systems
7
• Suffer from large startup and communication overhead
- Arabesque on 10-node cluster, 35s startup, 3s execution
- DistGraph on 128-node cluster, 32,768GB memory
• Need enterprise clusters with large amounts of memory
- some nodes out of memory, other nodes with memory usage < 10%
• Poor load balancing due to dynamic working sets
Problems with Datalog Systems
8
• Programming model is not expressive enough for complex graph mining algorithms
Thoughts and Insight
9
• Not all users have access to enterprise cluster
• Many users are domain experts with limited background in hosting a cluster
• Distributed mining systems drawbacks: large startup, underutilized cpus, poor load balancing
Thoughts and Insight
9
• Not all users have access to enterprise cluster
• Many users are domain experts with limited background in hosting a cluster
• Distributed mining systems drawbacks: large startup, underutilized cpus, poor load balancing
Increasingly large SSDs
Our Proposal: RStreamA single machine, out-of-core graph mining system
10
• A simple and expressive API
• Gather-Apply-Scatter + Relational Algebra => GRAS
• An efficient runtime engine
• implements relational algebra with streaming
GAS
11
Gather information from neighbor vertices
GAS
12
Apply and update the vertex property
GAS
13
Scatter information to neighbor vertices
GRAS
14
GRAS
14
GAS
supports iterative graph processing
GRAS
14
GAS
Relational Algebra
supports iterative graph processing
enables composition of structures
GRAS
14
GAS
Relational Algebra
GRAS
supports iterative graph processing
enables composition of structures
iteratively composition of structures
GRAS
14
GAS
Relational Algebra
GRAS
supports iterative graph processing
enables composition of structures
iteratively composition of structures
Edge Streaming
15
• Use streaming to reduce I/O costs
• Sequentially access (larger) datasets from disk, randomly access (smaller) datasets held in memory
X-Stream [A Roy et al. , SOSP’13]
Edge Streaming
16
VID Value
1 12 2
Src Dest
1 42 5
Value Dest
1 42 5
Vertex Table Edge TableUpdate Table
A graph is partitioned into streaming partitions. Each streaming partition contains
Streaming for Scatter/Gather
17
Update Table
src dest
1 22 5
Edge Table
ID value
1 a2 b
Update Table
Streaming Partition 1
Streaming Partition 2
Vertex Table
Update Tablevalue dest
a 2b 5
a 2
b 5
Scatter
Update Tablevalue dest
a 2
Update Table
ID value
1 a2 b
Update TableVertex Table
Update Tablevalue dest
a+b 2
Gather/Apply
Streaming Load Shuffle
Streaming Load
RStream API
18
Scatter
Relational
Relational
GatherApply
.
.
.
Scatter
GatherApply
Relational
Example:Triangle Counting
19
Scatter R1 R2
Example:Triangle Counting
19
edge table
src dest1 42 5… …
1 4
2 5
Scatter
Scatter R1 R2
VID value1 42 5… …
vertex table
Example:Triangle Counting
19
edge table
update table1
src dest1 42 5… …
c1 c21 42 5… …
⋈src dest4 95 8… …
edge table
1 4 9
2 5 8
1 4
2 5
Scatter
R1
(a, b) ⋈ (b, c)(a, b, c)
Scatter R1 R2
VID value1 42 5… …
vertex table
Example:Triangle Counting
19
edge table
update table1
src dest1 42 5… …
c1 c21 42 5… …
⋈src dest4 95 8… …
edge table
c1 c2 c31 4 92 5 8… … …
⋈src dest9 18 2… …
update table2 edge table
1 4 9
2 5 8
1 4 9
2 5 8
1 4
2 5
Scatter
R1
(a, b) ⋈ (b, c)(a, b, c)
(a, b, c) ⋈ (c, a) (a, b, c, a)R2
Scatter R1 R2
VID value1 42 5… …
vertex table
Outline
• How to provide a general programming interface for graph mining algorithms?
• How to implement relational operators efficiently for graphs?
20
Load
Streaming for Join Operator
21
Update Table
Src Dest
1 22 5
Edge Table
C1 C2
3 16 2 ⋈
Update Table
C1 C2 C3
3 1 26 2 5
3 1 2
6 2 5
Streaming Partition 1
Streaming Partition 2
Streaming Shuffle
Load
Streaming for Join Operator
21
Update Table
Src Dest
1 22 5
Edge Table
C1 C2
3 16 2 ⋈
Update Table
C1 C2 C3
3 1 26 2 5
3 1 2
6 2 5
Streaming Partition 1
Streaming Partition 2
Locality-Aware Join
Streaming Shuffle
Structural Information
22
1
2
3
⋈ 3 4
1
2
3
1
2
3
⋈ 4
1
2
3
2
1 2 3 3 4
1 2 3 2 4
Structural Information
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
2
1 2 3 3 4
1 2 3 2 4
Structural Information
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
2
1 2 3 4
1 2 3 3 4
1 2 3 2 4
Structural Information
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
42
1 2 3 4
1 2 3 3 4
1 2 3 2 4
Structural Information
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
42
1 2 3 4
1 2 3 4
1 2 3 3 4
1 2 3 2 4
Structural Information
same update tuples
different subgraphs
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
42
1 2 3 4
1 2 3 4
1 2 3 3 4
1 2 3 2 4
Structural Information
same update tuples
different subgraphs
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
42
1 2 3 4
1 2 3 4
1 2 3 3 4
1 2 3 2 4
Structural Information
same update tuples
different subgraphs
22
1
2
3
⋈ 3 4
1
2
3 4
1
2
3
⋈ 4
1
2
3
42
1 2 3 4
1 2 3 4
Structural info is missing!
1 2 3 3 4
1 2 3 2 4
Missing Structural Information
• Identical tuples may represent different structures
• Different tuples may represent identical structures
23
Adding Structural Info• Encodes the history of joins in update tuples
24
6 8
5
7
6 8
sub graph update tuplesindex 0 1
⋈8
6 8 7
⋈8
6 8 7
5
index 0 1
6 8 7(1)
2
6 8 5(1)7(1)
index 0 1 2 3
Is Join Enough?
• Join grows a subgraph from one of its vertices
• For Frequent Subgraph Mining, we need to explore all possibilities of existing subgraphs
• A different way of joining to grow a subgraph from all of its vertices
25
Join on All Columns
1 2
• Joins update table with edge table on every column
2610
Join on All Columns
1 2
1 2 3
• Joins update table with edge table on every column
2610
Join on All Columns
1 2
1 2 3
1 24
• Joins update table with edge table on every column
2610
Join on All Columns
1 2
1 2 3
1 24
1 2 35
1 2 3
6
1 2 3 7
• Joins update table with edge table on every column
2610
Join on All Columns
1 2
1 2 3
1 24
1 2 35
1 2 3
6
1 2 3 7
1 248
1 24
9
1 24
• Joins update table with edge table on every column
2610
Automorphism and Isomorphism
1 2 3
1 2 3
thread 1
thread 2
• Different threads can generate identical(automorphic) update tuples
27
• Select and keep one, remove all the other duplicates
1
2
3
5
4
6
Aggregation( )2,
• Different tuples may belong to same isomorphism class
• Aggregate to count number of each distinct shape
Arabesque [CHC Teixeira et al. , SOSP’15]
Evaluation• Platform
- 10-node cluster, 5TB SSD
- Each node: 2 Xeon(R) CPU E5-2640 v3 processors,32GB memory
28
• Application
- Triangle Counting
- Transitive Closure
- N-Clique Finding
- N-Motif Counting
- Frequent Subgraph Mining
Graphs #Edges #VerticesCiteseer 4,732 3,312
Mico 1.1M 100K
Patents 14M 2.7M
LiverJournal 69M 4.8M
Orkut 117M 3M
UK-2005 936M 39.5M
• Input graphs
Comparisons with Mining Systems
29
Citeseer
Mico Patent
Triangle Counting
RStream 0.04 15.8 6.7Arabesque-10 38.1 43.1 114.9
5-CliqueRStream 0.01 115.1 35.3
Arabesque-10 42.8 132 174.5
3-FSM 1K
RStream 0.06 351.7 383.7Arabesque-10 35.6 5790.1 -ScaleMine-10 1.2 802.6 -DistGraph-10 0.4 - -
RStream outperforms Arabesque by 60.9x ScaleMine by 12.1x DistGraph by 7.2x
Comparisons with Mining Systems
30
0200400600800
100012001400160018002000
3-10K 3-15K 3-20K 4-15K 4-20K 4-25K 5-15K 5-20K 5-25K
Rstream
ScaleMine
Arabesque
FSM on patent graph
subgraph size - support
runn
ing
time(
seco
nds)
Comparisons with Datalog Systems
31
LiveJournal Orkut
TriangleCounting
RStream 87 827.4
BigDatlog-10 94.8 1205.3
BigDatalog-5 109.6 1850.3
BigDatalog-1 567.3 -
SociaLite 896.1 - 0
100
200
300
400
500
600
700
800
900
1,000
BD-1 BD-5 BD-10 SL RSTi
me(
seco
nds)
Transitive Closure
8,021
Size of Intermediate Data
32
Phase #MB
4-Motif Counting
Mico
0 16.5
1 2086
2 886378
3 672194
Total 1.49TB
Size of Intermediate Data
32
Phase #MB
4-Motif Counting
Mico
0 16.5
1 2086
2 886378
3 672194
Total 1.49TB
13MB initial graph68182 X
ConclusionsRStream: A single machine, out-of-core graph mining system
33
• A simple and expressive API
• GAS + Relational Algebra => GRAS
• An efficient runtime engine
• implements relational algebra with tuple streaming
https://github.com/rstream-system