GPU Sparse Graph Traversal

UNIVERSITY of VIRGINIA

GPU Sparse Graph

Traversal Duane Merrill (NVIDIA)

Michael Garland (NVIDIA)

Andrew Grimshaw (Univ. of Virginia)

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Breadth-first search (BFS)

1. Pick a source node

2. Rank every vertex by the length

of shortest path from source

(or by predecessor vertex)

2

0

1 1

2

1

1

2

2 2

2

1

3

2

3

2

1 2

2


Wikipedia ‘07

BFS applications A common algorithmic building block

Tracking reachable heap during

garbage collection

Belief propagation (statistical inference)

Finding community structure

Path-finding

Core benchmark kernel

Graph 500

Rodinia

Parboil

Simple performance-analog for many

applications

Pointer-chasing

Work queues

3

3D lattice


“Conventional” wisdom

GPUs would be poorly suited for BFS

Not pleasingly data-parallel (if you want to solve it efficiently)

Insufficient atomic throughput for cooperation

4


“Conventional” wisdom

GPUs would be poorly suited for BFS

Not pleasingly data-parallel (if you want to solve it efficiently)

Insufficient atomic throughput for cooperation

Issues with scaling to 10,000s of processing elements

Workload imbalance, SIMD underutilization

Not enough parallelism in all graphs

5


The punchline

We show GPUs provide

exceptional performance…

Billions of traversed-edges

per second per socket

For diverse synthetic and

real-world datasets

Linear-work algorithm

…using efficient parallel

prefix sum

For cooperative data

movement

6

Agarwal et al. (SC’10) & Leiserson et al. (SPAA’10)

0

1

2

3

4

1 10 100 1000 10000 100000

Bill

ion

s o

f tr

aves

ed e

dge

s/se

c/so

cket

Average traversal depth

Tesla C2050

Multi-core (per socket)

Sequential


Level-synchronous

parallelization

strategies


(a) Quadratic-work approach “Typical” GPU BFS (e.g., Harish et al., Deng et al. Hong et al.)

1. Inspect every vertex at every time

step

2. If updated, update its neighbors

3. Repeat

Quadratic O(n2+m) work

Trivial data-parallel stencils

One thread per vertex

8

t5

t9

t16

t11

t6

t0

t3

t1 t2

t4

t10

t7

t12

t17

t13

t15

t14

t8


(b) Linear-work approach

1. Expand neighbors

(vertex frontier edge-frontier)

2. Filter out previously visited &

duplicates

(edge-frontier vertex frontier)

3. Repeat

9

0

0.1

0.2

0.3

0.4

0.5

Ver

tex

iden

tifi

ers

Distance from source

Logical EdgeFrontier

Logical VertexFrontier

Linear O(n+m) work

Need cooperative buffers for

tracking frontiers


Quadratic vs. linear Quadratic is too much work!!

10

3D Poisson Lattice (3003 = 27M vertices)

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800

Ref

eren

ced

ver

tex

iden

tifi

ers

(m

illio

ns)

BFS iteration

Quadratic Linear

130x work

2.5x–2,300x speedup vs. quadratic GPU methods

Goal: GPU competence on diverse datasets (Not a one-trick pony…)

11

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16 18

Ve

rtic

es

(mill

ion

s)

Search Depth

0.0

0.1

0.2

0.3

0.4

0.5

0 100 200 300 400 500 600 700 800

Ve

rtic

es

(mill

ion

s)

Search Depth

0

1

2

3

4

5

0 20 40 60 80 100 120 140 160

Ver

tice

s (m

illio

ns)

Search Depth

0

20

40

60

80

100

120

140

0 1 2 3 4

Ve

rtic

es

(mill

ion

s)

Search Depth

0

1

2

3

4

5

0 4 8 12 16 20 24 28 32 36 40 44 48 52

Ve

rtic

es (

mill

ion

s)

Search Depth

Wikipedia (social)

3D Poisson grid (cubic lattice)

R-MAT (random, power-law, small-world)

Europe road atlas (Euclidian space)

PDE-constrained

optimization (non-linear KKT)

Auto transmission manifold (tetrahedral mesh)


Challenges


Linear-work challenges for parallelization

Generic challenges

1. Load imbalance between cores

Coarse workstealing queues

2. Bandwidth inefficiency

Use “status bitmask” when

filtering already-visited nodes

GPU-specific challenges

1. Cooperative allocation (global

queues)

2. Load imbalance within SIMD lanes

3. Poor locality within SIMD lanes

4. Simultaneous discovery (i.e., the

benign race condition)

13


Linear-work challenges for parallelization

Generic challenges

1. Load imbalance between cores

Coarse workstealing queues

2. Bandwidth inefficiency

Use “status bitmask” when

filtering already-visited nodes

GPU-specific challenges

1. Cooperative allocation (global

queues)

2. Load imbalance within SIMD lanes

3. Poor locality within SIMD lanes

4. Simultaneous discovery (i.e., the

benign race condition)

14


(1) Cooperative data movement (queuing)

Problem:

Need shared work “queues”

GPU rate-limits (C2050):

16 billion vertex-identifiers / sec (124 GB/s)

Only 67M global-atomics / sec (238x

slowdown)

Only 600M smem-atomics / sec (27x

slowdown)

Solution:

Compute enqueue offsets using prefix-

sum

15

0.54

124.3

0.1

1

10

100

1000

1 16 256 4096 65536

GB

/ s

ec

Vertex stride between atomic ops (log-scale)

Transfer Rate


Prefix sum for allocation

16

Each output is computed to be the sum

of the previous inputs

O(n) work

Use results as a scatter offsets

Fits the GPU machine model well

Proceed at copy-bandwidth

Only ~8 thread-instructions per

input

A C D

0 1 2 3 4 5

Allocation requirement

Output

Result of prefix sum

A C D 0 2 3 3

t0 t1 t2 t3

2 1 0 3

t0 t1 t2 t3


Prefix sum for edge expansion

17

A C D

0 1 2 3 4 5

Adjacency list size

Expanded edge frontier

Result of prefix sum

A C D 0 2 3 3

t0 t1 t2 t3

2 1 0 3

t0 t1 t2 t3

Efficient local scan (granularity coarsening)

18

barrier

barrier

x16

x16

x1

x4

x4


(2) Poor load balancing within SIMD lanes

Problem:

Large variance in adjacency

list sizes

Exacerbated by wide SIMD

widths

Solution:

Cooperative neighbor

expansion

Enlist nearby threads to help

process each adjacency list in

parallel

19

t0 t1 t2 t3 t4 t5

a) Bad: Serial expansion & processing

b) Slightly better: Coarse warp-centric parallel expansion

c) Best: Fine-grained parallel expansion (packed by prefix sum)

t0 t1 t2 t3 t4 t5 t31

t32 t33 t34 t35 t36 t37 t63

t64 t65 t66 t67 t68 t69 t95

…

…

…

t1 t3 t0 t2


(2) Poor load balancing within SIMD lanes

Problem:

Large variance in adjacency

list sizes


widths

Solution:

Cooperative neighbor

expansion

Enlist nearby threads to help

process each adjacency list in

parallel

20

t0 t1 t2 t3 t4 t5




t0 t1 t2 t3 t4 t5 t31

t32 t33 t34 t35 t36 t37 t63

t64 t65 t66 t67 t68 t69 t95

…

…

…

t1 t3 t0 t2


(3) Poor locality within SIMD lanes

Problem:

The referenced adjacency lists

are arbitrarily located


widths

Can’t afford to have SIMD

threads streaming through

unrelated data

Solution:

Cooperative neighbor expansion

21

t0 t1 t2 t3 t4 t5




t0 t1 t2 t3 t4 t5 t31

t32 t33 t34 t35 t36 t37 t63

t64 t65 t66 t67 t68 t69 t95

…

…

…

t1 t3 t0 t2


(4) SIMD simultaneous discovery

“Duplicates” in the edge-frontier redundant work


Compounded every iteration

22

6 7

3 4

8

5

0 1 2

6 7

3 4

8

5

0 1 2

6 7

3 4

8

5

0 1 2

6 7

3 4

8

5

0 1 2

Iteration 1 Iteration 2 Iteration 3 Iteration 4


Simultaneous discovery

Normally not a problem for

CPU implementations

Serial adjacency list inspection

A big problem for cooperative SIMD

expansion

Spatially-descriptive datasets

Power-law datasets

Solution:

Localized duplicate removal using

hashes in local scratch

23

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 100 200 300 400 500 600 700 800

Ver

tice

s (m

illio

ns)

Search Depth

Logical edge-frontier

Logical vertex-frontier

Actuallyexpanded

Actuallycontracted


Absolute and relative

performance

24

Single-socket comparison

Graph Spy Plot Average

Search Depth

Nehalem Tesla C2050

Billion TE/s Parallel

speedup Billion TE/s

Parallel

speedup†††

europe.osm 19314 0.3 11 x

grid5pt.5000 7500 0.6 7.3 x

hugebubbles 6151 0.4 15 x

grid7pt.300 679 0.12 (4-core†) 2.4 x 1.1 28 x

nlpkkt160 142 0.47 (4-core†) 3.0 x 2.5 10 x

audikw1 62 3.0 4.6 x

cage15 37 0.23 (4-core†) 2.6 x 2.2 18 x

kkt_power 37 0.11 (4-core†) 3.0 x 1.1 23 x

coPapersCiteseer 26 3.0 5.9 x

wikipedia-2007 20 0.19 (4-core†) 3.2 x 1.6 25 x

kron_g500-logn20 6 3.1 13 x

random.2Mv.128Me 6 0.50 (8-core††) 7.0 x 3.0 29 x

rmat.2Mv.128Me 6 0.70 (8-core††) 6.0 x 3.3 22 x

† 2.5 GHz Core i7 4-core (Leiserson et al.) †† 2.7 GHz EX Xeon X5570 8-core (Agarwal et al) ††† vs 3.4GHz Core i7 2600K (Sandybridge)

Single-socket comparison

Graph Spy Plot Average

Search Depth

Nehalem Tesla C2050

Billion TE/s Parallel

speedup Billion TE/s

Parallel

speedup†††

europe.osm 19314 0.3 11 x

grid5pt.5000 7500 0.6 7.3 x

hugebubbles 6151 0.4 15 x

grid7pt.300 679 0.12 (4-core†) 2.4 x 1.1 28 x

nlpkkt160 142 0.47 (4-core†) 3.0 x 2.5 10 x

audikw1 62 3.0 4.6 x

cage15 37 0.23 (4-core†) 2.6 x 2.2 18 x

kkt_power 37 0.11 (4-core†) 3.0 x 1.1 23 x

coPapersCiteseer 26 3.0 5.9 x

wikipedia-2007 20 0.19 (4-core†) 3.2 x 1.6 25 x

kron_g500-logn20 6 3.1 13 x

random.2Mv.128Me 6 0.50 (8-core††) 7.0 x 3.0 29 x

rmat.2Mv.128Me 6 0.70 (8-core††) 6.0 x 3.3 22 x

† 2.5 GHz Core i7 4-core (Leiserson et al.) †† 2.7 GHz EX Xeon X5570 8-core (Agarwal et al) ††† vs 3.4GHz Core i7 2600K (Sandybridge)


Summary

Quadratic-work approaches are

uncompetitive

Dynamic workload management:

Prefix sum (vs. atomic-add)

Utilization requires fine-grained expansion

and contraction

GPUs can be very amenable to

dynamic, cooperative problems

27


Questions?

Graph Spy Plot Description Average

Search Depth

Vertices

(millions)

Edges

(millions)

europe.osm Road network 19314 50.9 108.1

grid5pt.5000 2D Poisson stencil 7500 25.0 125.0

hugebubbles-00020 2D mesh 6151 21.2 63.6

grid7pt.300 3D Poisson stencil 679 27.0 188.5

nlpkkt160 Constrained optimization

problem

142 8.3 221.2

audikw1 Finite element matrix 62 0.9 76.7

cage15 Transition prob. matrix 37 5.2 94.0

kkt_power Optimization (KKT) 37 2.1 13.0

coPapersCiteseer Citation network 26 0.4 32.1

wikipedia-20070206 Wikipedia page links 20 3.6 45.0

kron_g500-logn20 Graph500 random graph 6 1.0 100.7

random.2Mv.128Me Uniform random graph 6 2.0 128.0

rmat.2Mv.128Me RMAT random graph 6 2.0 128.0

Experimental corpus


Comparison of expansion techniques

30


Coupling of expansion & contraction phases

Alternatives:

Expand-contract

Wholly realize vertex-frontier in DRAM

Suitable for all types of BFS iterations

Contract-expand

Wholly realize edge-frontier in DRAM

Even better for small, fleeting BFS iterations

Two-phase

Wholly realize both frontiers in DRAM

Even better for large, saturating BFS iterations (surprising!!)

31

BFS iteration

Logical EdgeFrontier

Logical VertexFrontier


Comparison of coupling approaches

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

no

rmal

ized

10

9 e

dge

s /

sec

Expand-Contract Contract-Expand 2-Phase Hybrid

32


Local duplicate culling

Contraction uses local

collision hashes in smem

scratch space:

Hash per warp

(instantaneous

coverage)

Hash per CTA (recent

history coverage)

Redundant work < 10% in

all datasets

No atomics needed

33

1.1x

4.2x

2.4x 2.6x

1.7x

1.2x 1.1x

1.4x

1.1x

1.3x

1

10

Red

un

dan

t ex

pan

sio

n

fact

or

(lo

g)

Simple Local duplicate culling


Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat

34

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0

1

2

3

4

5

6

7

8

9

no

rmal

ized

10

9 e

dge

s /

sec

C2050 x1 C2050 x2 C2050 x3 C2050 x4

Documents

GPU Sparse Graph Traversal