A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers

A Scalable Implementa.on of a MapReduce-‐based Graph Processing Algorithm for Large-‐

scale Heterogeneous Supercomputers

Koichi Shirahata*1，Hitoshi Sato*1,*2， Toyotaro Suzumura*1,*2,*3，Satoshi Matsuoka*1

*1 Tokyo Ins;tute of Technology *2 CREST, Japan Science and Technology Agency

*3 IBM Research -‐ Tokyo

1

Emergence of Large Scale Graphs Need fast and scalable analysis using HPC

2

900 Million Ver;ces 100 Billion Edges

GPU-‐based Heterogeneous supercomputers

3

Fast Large Graph Processing with GPGPU

High peak performance High memory bandwidth

GPGPU

Mo.va.on

TSUBAME 2.0 1408 compute nodes (3 GPUs / node)

Problems of Large Scale Graph Processing with GPGPU

•  How much do GPUs accelerate large scale graph processing ? – Applicability to graph applica;ons •  Computa;on paXerns of graph algorithm affects performance

•  Tradeoff between computa;on and CPU-‐GPU data transfer overhead

– How to distribute graph data to each GPU in order to exploit mul;ple GPUs

4

GPU memory

CPU memory

Scalability Load balancing Communica;on

Motivating Example: CPU-based Graph Processing

•  How much is the graph applica.on accelerated using GPU ? –  Simple computa;on paXerns，High memory bandwidth –  Complex computa;on paXerns, PCI-‐E overhead

5

0

2000

4000

6000

8000

10000

12000

14000

1 2 4 8 16 32 64 128

Elap

sed Time [m

s]

# Compute Nodes

Reduce Sort

Copy Map

Contribu;ons •  Implemented a scalable mul.-‐GPU-‐based PageRank applica.on –  Extend Mars (an exis;ng GPU MapReduce framework)

•  Using the MPI library –  Implement GIM-‐V on mul;-‐GPU MapReduce

•  GIM-‐V: a graph processing algorithm –  Load balance op;miza;on between GPU devices for large-‐scale graphs •  Task scheduling-‐based graph par;;oning

6

•  Scale well up to 256 nodes (768 GPUs) •  1.52x speedup compared with on CPUs

Performance on TSUBAME2.0 supercomputer

Proposal: Mul;-‐GPU GIM-‐V with Load Balance Op;miza;on

7

Graph Applica.on PageRank

Graph Algorithm Mul.-‐GPU GIM-‐V

MapReduce Framework Mul.-‐GPU Mars

PlaZorm CUDA, MPI

Implement GIM-‐V on mul.-‐GPUs MapReduce -‐  Op;miza;on for GIM-‐V -‐  Load balance op;miza;on

Extend an exis.ng GPU MapReduce framework (Mars) for mul.-‐GPU


8




PlaZorm CUDA, MPI



Structure of Mars •  Mars*1 : an exis;ng GPU-‐based MapReduce framework –  CPU-‐GPU data transfer (Map) – GPU-‐based Bitonic Sort (Shuffle) – Allocates one CUDA thread / key (Map, Reduce)

9

*1 : Bingsheng He et al. Mars: A MapReduce Framework on Graphics Processors. PACT 2008

Preprocess

GPU Processing

Map Sort Reduce

Scheduler

Structure of Mars •  Mars*1 : an exis;ng GPU-‐based MapReduce framework –  CPU-‐GPU data transfer (Map) – GPU-‐based Bitonic Sort (Shuffle) – Allocates one CUDA thread / key (Map, Reduce)

10

*1 : Bingsheng He et al. Mars: A MapReduce Framework on Graphics Processors. PACT 2008

→ We extend Mars for mul.-‐GPU support

Preprocess

GPU Processing

Map Sort Reduce

Scheduler

Proposal: Mars Extension for Mul.-‐GPU using MPI

Map Sort

Map Sort

Reduce

Reduce

GPU Processing Scheduler

Upload CPU → GPU

Download GPU → CPU

Download GPU → CPU

Upload CPU → GPU

•  Inter-‐GPU communica;ons in Shuffle – G2C → MPI_Alltoallv → C2G → local Sort

•  Parallel I/O feature using MPI-‐IO –  Improve I/O throughput between memory and storage

11

Map Copy Sort Reduce


12




PlaZorm CUDA, MPI



Large graph processing algorithm GIM-‐V

13 *1 : Kang, U. et al, “PEGASUS: A Peta-‐Scale Graph Mining System-‐ Implementa;on and Observa;ons”, IEEE INTERNATIONAL CONFERENCE ON DATA MINING 2009

•  Generalized Itera.ve Matrix-‐Vector mul.plica.on*1 –  Graph applica;ons are implemented by defining 3 func;ons –  v’ = M ×G v where

v’i = Assign(vj , CombineAllj ({xj | j = 1..n, xj = Combine2(mi,j, vj)})) (i = 1..n)

×G

Vj

V

M ×G

V



×G

Vj

V

M ×G

Combine2

V





×G

×G

Vj

V

M CombineAll V

Combine2





×G

×G

Vj

V

M CombineAll V

Assign

Combine2







×G

×G

Vj

V

M CombineAll V

Assign

Combine2

GIM-‐V can be implemented by 2-‐stage MapReduce → Implement on mul.-‐GPU environment

Proposal: GIM-‐V implementa.on on mul.-‐GPU

•  Con.nuous execu.on feature for itera.ons –  2 MapReduce stages / itera;on – Graph par;;on at Pre-‐processing

•  Divide the input graph ver;ces/edges among GPUs –  Parallel Convergence test at Post-‐processing

•  Locally on each process -‐> globally using MPI_Allreduce

18

Graph Par;;on

Stage 1

Stage 2

GPU Processing Scheduler

Pre-‐process

Convergence Test

Post-‐process Mul.-‐GPU GIM-‐V

Combine2 CombineAll

Eliminate metadata and use fixed size payload

Op;miza;ons for mul;-‐GPU GIM-‐V

•  Data structure – Mars handles metadata and payload

•  Thread alloca.on – Mars handles one key per thread

•  Load balance op.miza.on –  Scale-‐free property

•  Small number of ver;ces have many edges

19

In Reduce stage, allocate mul. CUDA threads to a single key according to value size

Minimize load imbalance among GPUS

Mars Our Implementa.on

Eliminate metadata and use fixed size payload

Op;miza;ons for mul;-‐GPU GIM-‐V

•  Data structure – Mars handles metadata and payload

•  Thread alloca.on – Mars handles one key per thread

•  Load balance op.miza.on –  Scale-‐free property

•  Small number of ver;ces have many edges

20

In Reduce stage, allocate mul. CUDA threads to a single key according to value size

Mars Our Implementa.on

Minimize load imbalance among GPUS

Apply Load Balancing Op;miza;on •  Par..on the graph in order to minimize load imbalance among GPUs –  Applying a task scheduling algorithm

•  Regard Vertex/Edges as Task •  TaskSize i = 1 + Σ Outgoing Edges

–  LPT (Least Processing Time) schedule *1

•  Assign tasks in decreasing order of task size

*1 : R. L. Graham, “Bounds on mul;processing anomalies and related packing algorithms,” in Proceedings of the May 16-‐18, 1972, spring joint computer conference, ser. AFIPS ’72 (Spring)

P3 P2 P1

4 5 6 8 7

Tasks = {8, 5, 4, 3, 1} Minimize the maximum amount

Vertex i

i

TaskSize i = 1 + 3 V Eout

21

Experiments

•  Methods – A single round of itera;ons (w/o Preprocessing) –  PageRank applica;on

•  Measures rela;ve importance of web pages

–  Input data •  Ar;ficial Kronecker graphs

–  Generated by generator in Graph 500 •  Parameters

–  SCALE: log 2 of #ver;ces (#ver;ces = 2SCALE) –  Edge_factor: 16 (#edges = Edge_factor × #ver;ces)

22

16 4 3

2 1

8 12 4

8 6 4 2

12 3 2 1 4 3 2 1

G2 =G1⊗G1G1

Study the performance of our mul.-‐GPU GIM-‐V •  Scalability •  Comparison w/ a CPU-‐based implementa.on •  Validity of the load balance op.miza.on

Experimental environments •  TSUBAME 2.0 supercomputer – We use 256 nodes (768 GPUs)

•  CPU-‐GPU: PCI-‐E 2.0 x16 •  Internode: QDR IB (40 Gbps) dual rail

•  Mars – MarsGPU-‐n

•  n GPUs / node (n: 1, 2, 3)

– MarsCPU •  12 threads / node •  MPI and pthread •  Parallel quick sort

23

CPU GPU

Model Intel® Xeon® X5670

Tesla M2050

# Cores 6 448

Frequency 2.93 GHz 1.15 GHz

Memory 54 GB 2.7 GB

Compiler gcc 4.3.4 nvcc 4.0

24

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200 250 300

MEgde

s / se

c

# Compute Nodes

MarsGPU-‐1 MarsGPU-‐2 MarsGPU-‐3 MarsCPU

SCALE 30

SCALE 29

SCALE 28

SCALE 27

87.04 ME/s (256 nodes)

1.52x speedup (3 GPU v CPU)

Weak Scaling Performance: MarsGPU vs. MarsCPU

Becer •  W/O load balance op;miza;on

Weak Scaling Performance: MarsGPU vs. MarsCPU

25

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200 250 300

MEgde

s / se

c

# Compute Nodes

MarsGPU-‐1 MarsGPU-‐2 MarsGPU-‐3 MarsCPU

SCALE 30

SCALE 29

SCALE 28

SCALE 27

Becer •  W/O load balance op;miza;on

87.04 ME/s (256 nodes)

1.52x speedup (3 GPU v CPU)

Performance Breakdown

26

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MarsCPU MarsGPU-‐1 MarsGPU-‐2 MarsGPU-‐3

Elap

sed Time [m

s]

Map MPI-‐Comm PCI-‐Comm Hash Sort Reduce

Performance Breakdown: MarsGPU and MarsCPU

Becer

SCALE 28

27

0

1000

2000

3000

4000

5000

6000

7000

8000

9000


Elap

sed Time [m

s]



8.93x (Map)

2.53x (Sort)

Becer

SCALE 28

28

0

1000

2000

3000

4000

5000

6000

7000

8000

9000


Elap

sed Time [m

s]



Becer

SCALE 28

PCI-‐E overhead

Efficiency of GIM-‐V Op;miza;ons •  Data structure (Map, Sort, Reduce) •  Thread alloca.on (Reduce)

29

1

10

100

1000

10000

Map Sort Reduce

Elap

sed Time [m

s]

Naive Op;mized

1.92x

1.64x 66.8x

SCALE 26, 128 nodes on MarsGPU-‐3

Becer

30

0

10

20

30

40

50

60

70

80

90

0 20 40 60 80 100 120 140

MEd

ges /

Sec

# Compute Nodes

MarsGPU-‐3 MarsGPU-‐3 LPT

1.16x

1.16x Speedup

Round Robin vs. LPT Schedule •  Similar except for on 128 nodes –  Input graphs are rela;vely well-‐balanced (Graph500)

Weak Scaling Performance Becer

31

0

10

20

30

40

50

60

70

80

90

0 20 40 60 80 100 120 140

MEd

ges /

Sec

# Compute Nodes


1.16x

1.16x Speedup

Performance Breakdown

•  Similar except for on 128 nodes –  Input graphs are rela;vely well-‐balanced (Graph500)

Weak Scaling Performance

Round Robin vs. LPT Schedule

Becer

Performance Breakdown Round robin vs. LPT Schedule

•  Bitonic sort calculates power-‐of-‐two key-‐value pairs –  Load balancing reduced the number of sor;ng elements

32 0

500

1000

1500

2000

2500

3000


Elap

sed Time [m

s] Map

MPI-‐Comm

PCI-‐Comm

Hash

Sort

Reduce

Speedup in Sort

Becer

33

1

10

100

1000

10000

100000

PEGASUS MarsCPU MarsGPU-‐3

KEdges / Sec

Outperform Hadoop-‐based Implementa;on

•  PEGASUS: a Hadoop-‐based GIM-‐V implementa;on –  Hadoop 0.21.0 –  Lustre for underlying Hadoop’s file system

186.8x Speedup

SCALE 27, 128 nodes Becer

Related Work •  Graph processing using GPU –  Shortest path algorithms for GPU (BFS，SSSP, and APSP)*1

→ Not achieve compe;;ve performance •  MapReduce implementa;ons on GPUs – GPMR*2 : MapReduce implementa;on on mul; GPUs → Not show scalability for large-‐scale processing

•  Graph processing with load balancing –  Load balancing while keeping communica;on low on R-‐MAT graphs*3

→ We show the task scheduling-‐based load-‐balancing

34

*1 : Harish, P. et al, “Accelera;ng Large Graph Algorithms on the GPU using CUDA”, HiPC 2007. *2 : Stuart, J.A. et al, “Mul;-‐GPU MapReduce on GPU Clusters”, IPDPS 2011. *3 : J. Chhugani, N. Sa;sh, C. Kim, J. Sewall, and P. Dubey, “Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing single-‐node efficiency,” in Parallel Distributed Processing Symposium (IPDPS), 2012

Conclusions •  A scalable MapReduce-‐based GIM-‐V implementa.on using mul.-‐GPU – Methodology •  Extend Mars to support mul;-‐GPU •  GIM-‐V using mul;-‐GPU MapReduce •  Load balance op;miza;on

–  Performance •  87.04 ME/s on SCALE 30 (256 nodes, 768 GPUs) •  1.52x speedup than the CPU-‐based implementa;on

•  Future work – Op;miza;on of our implementa;on

•  Improve communica;on, locality – Data handling larger than GPU memory capacity

•  Memory hierarchy management (GPU, DRAM, NVM, SSD) 35

Comparison with Load Balance Algorithm (Simula;on, Weak Scaling)

•  Compare between naive (Round robin) and load balancing op;miza;on (LPT schedule)

•  Similar except for 128 nodes (3.98% on SCALE 25, 64 nodes) –  Performance improvement: 13.8% (SCALE 26, 128 nodes)

36

0

5

10

15

20

25

30

35

40

2 4 8 16 32 64 128

Load

Imba

lance [%

]

# Compute Nodes

Round Robin LPT

1.67x BeXer

Large-‐scale Graphs in Real World •  Graphs in real world

–  Health care, SNS, Biology, Electric power grid etc. –  Millions to trillions of ver;ces and 100 millions to 100 trillions of edges

–  Similar proper;es •  Scale-‐free (power-‐low degree distribu;on) •  Small diameter

•  Kronecker Graph –  Similar proper;es as real world graphs –  Widely used (e.g. the Graph500 benchmark*1) since obtained easily by simply applying itera;ve products on a base matrix

37 *1 : D. A. Bader et al. The graph500 list. Graph500.org. hXp://www.graph500.org/

16 4 3

2 1

8 12 4

8 6 4 2

12 3 2 1 4 3 2 1