Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
InfoLab
Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, Jinwook Kim
Department of Information and Communication Engineering
DGIST
GTS (GStream 2.0):
A Fast and Scalable Graph Processing Method
based on Streaming Topology to GPUs
SIGMOD’16
InfoLab
Outline
2
Introduction
Preliminaries
Streaming graph topology
Exploiting multiple GPUs
Experimental results
Conclusions
InfoLab
Big graph data
3
Graphs are everywhere
- web, social networks, telecommunication, biology, neuroscience
Sizes of graphs are growing
Graph analysis is getting more and more important
- e.g. PageRank, connected components, counting triangles
InfoLab
Sizes of graphs
4
Real graphs
- livejournal: 5M vertices, 69M edges
- twitter: 42M vertices, 1468M edges
- yahooweb: 1414M vertices, 6637M edges
Synthetic graphs
- RMAT20: 1M vertices, 16M edges
- RMAT30: 1B vertices, 16B edges
- RMAT40: 1T vertices, 16T edges
Human connectome: 100B vertices, 100T edges
InfoLab
Existing graph processing methods (1)
5
Single-machine / CPU-based methods
- limited computing power (tens of CPU cores)
- limited graph size (main memory)
- e.g., MTGL[IPDPS’07], Galois[SOSP’13], Ligra[PPoPP’13],
Ligra+[DCC’15]
Single-machine / GPU-based methods
- good computing power (thousands of GPU cores)
- limited graph size (main memory / GPU memory)
- e.g., MapGraph[GRADES’14], TOTEM[PACT’12],
CuSha[HPDC’14]
InfoLab
Existing graph processing methods (2)
6
Interconnection Network
P P P
M M M
topology data
attribute data
Distributed methods
- scalable computing power (thousands of CPU cores)
- edge cut: communication traffic
- vertex cut: storage overhead (not scalable in graph size)
- e.g, GraphLab[ODSI’12], Giraph[Apache], GraphX[Apache],
Naiad[SOSP’13]
existing distributed methods
InfoLab
Existing graph processing methods (2)
7
Interconnection Network
P P P
M M M
topology data
attribute data
P
D
M
Distributed methods
- scalable computing power (thousands of CPU cores)
- edge cut: communication traffic
- vertex cut: storage overhead (not scalable in graph size)
- e.g, GraphLab[ODSI’12], Giraph[Apache], GraphX[Apache],
Naiad[SOSP’13]
existing distributed methods
GTS method
InfoLab
Key ideas of GTS
8
Scale-up approach
- no communication traffic (from edge cut)
- no storage overhead (from vertex cut)
- thousands of computing cores (up to 8 GPUs)
- terabytes of storage (up to 8 PCI-E SSDs)
Storing only updatable attribute data in GPU memory
- read-only attribute (RA), writable attribute (WA)
- WA data is much smaller than topology data
Moving topology data from SSDs to GPUs
- streaming only necessary pages (page-level random access)
- strategies for mapping between SSDs and GPUs
InfoLab
Slotted page format for graph [KDD’13]
9
Storing topology data
- small page (SP): low-degree vertices
- large page (LP): high-degree vertices
VID (logical ID)
RID (physical ID) : ⟨PageID, SlotNo⟩
InfoLab
Extended slotted page format
10
Original slotted page format: 1MB / page
- 2-byte PageID: max. 64K pages
- 2-byte SlotNo: max. 64K vertices / page
- max. 4 billion vertices (practically, max. 1 billion vertices)
Extended slotted page format
- p-byte PageID
- q-byte SlotNo
- max. 281 trillion vertices (p=3, q=3)
InfoLab
Superstep: streaming graph topology once
11
Streaming topology data with read-only attribute (RA)
Two GPU kernels for a graph algorithm : KSP , KLP
WA: writable attribute
RA: read-only attribute
SP: small page
LP: large page
InfoLab
Asynchronous multiple streams
12
Async. copying {WA, SP, RA} to GPU cannot overlap
Executing GPU kernels can overlap
Performing up to 32 streams simultaneously
InfoLab
Exploiting multiple GPUs and SSDs
13
Strategy-P: the same WA in all GPUs
Strategy-S: different WAs in each GPU
InfoLab
Strategy-P vs. Strategy-S
14
InfoLab
Micro-level graph processing
15
+4 +1 +2 +2
1 2 4 5 0 0 6 4 5 0 3 5
slots
records 0 3 4 2 7 9 6 9 9 6 7 8
0 4 5 7 9 12 15 18 20 21
0 1 2 3 4 5 6 7 8 9
Each thread processes each vertex (VWC technique)
Each GPU processes (numBlocks numThreads) vertices
- process multiple Small Pages simultaneously
- process Large Pages for a high-degree vertex simultaneously
thread 0 thread 9……..
InfoLab
Experimental setup
16
Methods H/W setting
CPU-based
methods
MTGL [IPDPS’07]
Galois [SOSP’13]
Ligra [PPoPP’13]
Ligra+[DCC’15]16 CPU cores
two GPUs
128GB memory
two PCI-E SSDs
GPU-based
methods
MapGraph [GRADES’14]
TOTEM [PACT’12]
CuSha [HPDC’14]
GTS [SIGMOD’16]
Distributed
methods
GraphLab [ODSI’12]
Giraph [Apache]
GraphX [Apache]
Naiad [SOSP’13]
DGIST supercomputer iREMB
(Rank#454, 30 nodes)
480 CPU cores
1,920 GB memory
Infiniband QDR (40 Gbps)
InfoLab
Data sets
17
InfoLab
Comparison with CPU-based methods
18
InfoLab
Comparison with GPU-based methods
19
InfoLab
Comparison with distributed methods
20
InfoLab
Other graph algorithms
21
InfoLab
Strategy-P
22
Faster than Strategy-S (BFS, two SSDs / main memory)
Similar with Strategy-S (PageRank, one SSD / two HDDs)
InfoLab
Conclusions
23
New scale-up approach
- storing only updatable attribute data in GPU memory
- moving topology data from SSDs to GPUs
- two strategies : Strategy-P, Strategy-S
- extended slotted page, caching, micro-level parallel processing
Results
- faster than distributed methods on supercomputer
- processing larger-scale graphs (RMAT34, two Intel 750 SSDs)
- cost-efficient
- energy efficient
InfoLab
Thank you!
Question?