GTS (GStream2.0): A Fast and Scalable Graph Processing ...infolab.kaist.ac.kr/publications/public/docs/SIGMOD16_slides.pdf · Existing graph processing methods (2) 7 Interconnection

InfoLab

Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, Jinwook Kim

Department of Information and Communication Engineering

DGIST

GTS (GStream 2.0):

A Fast and Scalable Graph Processing Method

based on Streaming Topology to GPUs

SIGMOD’16

InfoLab

Outline

2

Introduction

Preliminaries

Streaming graph topology

Exploiting multiple GPUs

Experimental results

Conclusions

InfoLab

Big graph data

3

Graphs are everywhere

- web, social networks, telecommunication, biology, neuroscience

Sizes of graphs are growing

Graph analysis is getting more and more important

- e.g. PageRank, connected components, counting triangles

InfoLab

Sizes of graphs

4

Real graphs

- livejournal: 5M vertices, 69M edges

- twitter: 42M vertices, 1468M edges

- yahooweb: 1414M vertices, 6637M edges

Synthetic graphs

- RMAT20: 1M vertices, 16M edges

- RMAT30: 1B vertices, 16B edges

- RMAT40: 1T vertices, 16T edges

Human connectome: 100B vertices, 100T edges

InfoLab

Existing graph processing methods (1)

5

Single-machine / CPU-based methods

- limited computing power (tens of CPU cores)

- limited graph size (main memory)

- e.g., MTGL[IPDPS’07], Galois[SOSP’13], Ligra[PPoPP’13],

Ligra+[DCC’15]

Single-machine / GPU-based methods

- good computing power (thousands of GPU cores)

- limited graph size (main memory / GPU memory)

- e.g., MapGraph[GRADES’14], TOTEM[PACT’12],

CuSha[HPDC’14]

InfoLab


6

Interconnection Network

P P P

M M M

topology data

attribute data

Distributed methods

- scalable computing power (thousands of CPU cores)

- edge cut: communication traffic

- vertex cut: storage overhead (not scalable in graph size)

- e.g, GraphLab[ODSI’12], Giraph[Apache], GraphX[Apache],

Naiad[SOSP’13]

existing distributed methods

InfoLab


7

Interconnection Network

P P P

M M M

topology data

attribute data

P

D

M

Distributed methods

- scalable computing power (thousands of CPU cores)

- edge cut: communication traffic

- vertex cut: storage overhead (not scalable in graph size)

- e.g, GraphLab[ODSI’12], Giraph[Apache], GraphX[Apache],

Naiad[SOSP’13]

existing distributed methods

GTS method

InfoLab

Key ideas of GTS

8

Scale-up approach

- no communication traffic (from edge cut)

- no storage overhead (from vertex cut)

- thousands of computing cores (up to 8 GPUs)

- terabytes of storage (up to 8 PCI-E SSDs)

Storing only updatable attribute data in GPU memory

- read-only attribute (RA), writable attribute (WA)

- WA data is much smaller than topology data

Moving topology data from SSDs to GPUs

- streaming only necessary pages (page-level random access)

- strategies for mapping between SSDs and GPUs

InfoLab

Slotted page format for graph [KDD’13]

9

Storing topology data

- small page (SP): low-degree vertices

- large page (LP): high-degree vertices

VID (logical ID)

RID (physical ID) : ⟨PageID, SlotNo⟩

InfoLab

Extended slotted page format

10

Original slotted page format: 1MB / page

- 2-byte PageID: max. 64K pages

- 2-byte SlotNo: max. 64K vertices / page

- max. 4 billion vertices (practically, max. 1 billion vertices)

Extended slotted page format

- p-byte PageID

- q-byte SlotNo

- max. 281 trillion vertices (p=3, q=3)

InfoLab

Superstep: streaming graph topology once

11

Streaming topology data with read-only attribute (RA)

Two GPU kernels for a graph algorithm : KSP , KLP

WA: writable attribute

RA: read-only attribute

SP: small page

LP: large page

InfoLab

Asynchronous multiple streams

12

Async. copying {WA, SP, RA} to GPU cannot overlap

Executing GPU kernels can overlap

Performing up to 32 streams simultaneously

InfoLab

Exploiting multiple GPUs and SSDs

13

Strategy-P: the same WA in all GPUs

Strategy-S: different WAs in each GPU

InfoLab

Strategy-P vs. Strategy-S

14

InfoLab

Micro-level graph processing

15

+4 +1 +2 +2

1 2 4 5 0 0 6 4 5 0 3 5

slots

records 0 3 4 2 7 9 6 9 9 6 7 8

0 4 5 7 9 12 15 18 20 21

0 1 2 3 4 5 6 7 8 9

Each thread processes each vertex (VWC technique)

Each GPU processes (numBlocks numThreads) vertices

- process multiple Small Pages simultaneously

- process Large Pages for a high-degree vertex simultaneously

thread 0 thread 9……..

InfoLab

Experimental setup

16

Methods H/W setting

CPU-based

methods

MTGL [IPDPS’07]

Galois [SOSP’13]

Ligra [PPoPP’13]

Ligra+[DCC’15]16 CPU cores

two GPUs

128GB memory

two PCI-E SSDs

GPU-based

methods

MapGraph [GRADES’14]

TOTEM [PACT’12]

CuSha [HPDC’14]

GTS [SIGMOD’16]

Distributed

methods

GraphLab [ODSI’12]

Giraph [Apache]

GraphX [Apache]

Naiad [SOSP’13]

DGIST supercomputer iREMB

(Rank#454, 30 nodes)

480 CPU cores

1,920 GB memory

Infiniband QDR (40 Gbps)

InfoLab

Data sets

17

InfoLab

Comparison with CPU-based methods

18

InfoLab

Comparison with GPU-based methods

19

InfoLab

Comparison with distributed methods

20

InfoLab

Other graph algorithms

21

InfoLab

Strategy-P

22

Faster than Strategy-S (BFS, two SSDs / main memory)

Similar with Strategy-S (PageRank, one SSD / two HDDs)

InfoLab

Conclusions

23

New scale-up approach

- storing only updatable attribute data in GPU memory

- moving topology data from SSDs to GPUs

- two strategies : Strategy-P, Strategy-S

- extended slotted page, caching, micro-level parallel processing

Results

- faster than distributed methods on supercomputer

- processing larger-scale graphs (RMAT34, two Intel 750 SSDs)

- cost-efficient

- energy efficient

InfoLab

Thank you!

Question?

Documents

GTS (GStream2.0): A Fast and Scalable Graph Processing ...infolab.kaist.ac.kr/publications/public/docs/SIGMOD16_slides.pdf · Existing graph processing methods (2) 7 Interconnection