34
UNIVERSITYof VIRGINIA GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia)

GPU Sparse Graph Traversal

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Sparse Graph Traversal

UNIVERSITY of VIRGINIA

GPU Sparse Graph

Traversal Duane Merrill (NVIDIA)

Michael Garland (NVIDIA)

Andrew Grimshaw (Univ. of Virginia)

Page 2: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Breadth-first search (BFS)

1. Pick a source node

2. Rank every vertex by the length

of shortest path from source

(or by predecessor vertex)

2

0

1 1

2

1

1

2

2 2

2

1

3

2

3

2

1 2

2

Page 3: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Wikipedia ‘07

BFS applications A common algorithmic building block

Tracking reachable heap during

garbage collection

Belief propagation (statistical inference)

Finding community structure

Path-finding

Core benchmark kernel

Graph 500

Rodinia

Parboil

Simple performance-analog for many

applications

Pointer-chasing

Work queues

3

3D lattice

Page 4: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

“Conventional” wisdom

GPUs would be poorly suited for BFS

Not pleasingly data-parallel (if you want to solve it efficiently)

Insufficient atomic throughput for cooperation

4

Page 5: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

“Conventional” wisdom

GPUs would be poorly suited for BFS

Not pleasingly data-parallel (if you want to solve it efficiently)

Insufficient atomic throughput for cooperation

Issues with scaling to 10,000s of processing elements

Workload imbalance, SIMD underutilization

Not enough parallelism in all graphs

5

Page 6: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

The punchline

We show GPUs provide

exceptional performance…

Billions of traversed-edges

per second per socket

For diverse synthetic and

real-world datasets

Linear-work algorithm

…using efficient parallel

prefix sum

For cooperative data

movement

6

Agarwal et al. (SC’10) & Leiserson et al. (SPAA’10)

0

1

2

3

4

1 10 100 1000 10000 100000

Bill

ion

s o

f tr

aves

ed e

dge

s/se

c/so

cket

Average traversal depth

Tesla C2050

Multi-core (per socket)

Sequential

Page 7: GPU Sparse Graph Traversal

UNIVERSITY of VIRGINIA

Level-synchronous

parallelization

strategies

Page 8: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(a) Quadratic-work approach “Typical” GPU BFS (e.g., Harish et al., Deng et al. Hong et al.)

1. Inspect every vertex at every time

step

2. If updated, update its neighbors

3. Repeat

Quadratic O(n2+m) work

Trivial data-parallel stencils

One thread per vertex

8

t5

t9

t16

t11

t6

t0

t3

t1 t2

t4

t10

t7

t12

t17

t13

t15

t14

t8

Page 9: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(b) Linear-work approach

1. Expand neighbors

(vertex frontier edge-frontier)

2. Filter out previously visited &

duplicates

(edge-frontier vertex frontier)

3. Repeat

9

0

0.1

0.2

0.3

0.4

0.5

Ver

tex

iden

tifi

ers

Distance from source

Logical EdgeFrontier

Logical VertexFrontier

Linear O(n+m) work

Need cooperative buffers for

tracking frontiers

Page 10: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Quadratic vs. linear Quadratic is too much work!!

10

3D Poisson Lattice (3003 = 27M vertices)

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800

Ref

eren

ced

ver

tex

iden

tifi

ers

(m

illio

ns)

BFS iteration

Quadratic Linear

130x work

2.5x–2,300x speedup vs. quadratic GPU methods

Page 11: GPU Sparse Graph Traversal

Goal: GPU competence on diverse datasets (Not a one-trick pony…)

11

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16 18

Ve

rtic

es

(mill

ion

s)

Search Depth

0.0

0.1

0.2

0.3

0.4

0.5

0 100 200 300 400 500 600 700 800

Ve

rtic

es

(mill

ion

s)

Search Depth

0

1

2

3

4

5

0 20 40 60 80 100 120 140 160

Ver

tice

s (m

illio

ns)

Search Depth

0

20

40

60

80

100

120

140

0 1 2 3 4

Ve

rtic

es

(mill

ion

s)

Search Depth

0

1

2

3

4

5

0 4 8 12 16 20 24 28 32 36 40 44 48 52

Ve

rtic

es (

mill

ion

s)

Search Depth

Wikipedia (social)

3D Poisson grid (cubic lattice)

R-MAT (random, power-law, small-world)

Europe road atlas (Euclidian space)

PDE-constrained

optimization (non-linear KKT)

Auto transmission manifold (tetrahedral mesh)

Page 12: GPU Sparse Graph Traversal

UNIVERSITY of VIRGINIA

Challenges

Page 13: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Linear-work challenges for parallelization

Generic challenges

1. Load imbalance between cores

Coarse workstealing queues

2. Bandwidth inefficiency

Use “status bitmask” when

filtering already-visited nodes

GPU-specific challenges

1. Cooperative allocation (global

queues)

2. Load imbalance within SIMD lanes

3. Poor locality within SIMD lanes

4. Simultaneous discovery (i.e., the

benign race condition)

13

Page 14: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Linear-work challenges for parallelization

Generic challenges

1. Load imbalance between cores

Coarse workstealing queues

2. Bandwidth inefficiency

Use “status bitmask” when

filtering already-visited nodes

GPU-specific challenges

1. Cooperative allocation (global

queues)

2. Load imbalance within SIMD lanes

3. Poor locality within SIMD lanes

4. Simultaneous discovery (i.e., the

benign race condition)

14

Page 15: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(1) Cooperative data movement (queuing)

Problem:

Need shared work “queues”

GPU rate-limits (C2050):

16 billion vertex-identifiers / sec (124 GB/s)

Only 67M global-atomics / sec (238x

slowdown)

Only 600M smem-atomics / sec (27x

slowdown)

Solution:

Compute enqueue offsets using prefix-

sum

15

0.54

124.3

0.1

1

10

100

1000

1 16 256 4096 65536

GB

/ s

ec

Vertex stride between atomic ops (log-scale)

Transfer Rate

Page 16: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Prefix sum for allocation

16

Each output is computed to be the sum

of the previous inputs

O(n) work

Use results as a scatter offsets

Fits the GPU machine model well

Proceed at copy-bandwidth

Only ~8 thread-instructions per

input

A C D

0 1 2 3 4 5

Allocation requirement

Output

Result of prefix sum

A C D 0 2 3 3

t0 t1 t2 t3

2 1 0 3

t0 t1 t2 t3

Page 17: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Prefix sum for edge expansion

17

A C D

0 1 2 3 4 5

Adjacency list size

Expanded edge frontier

Result of prefix sum

A C D 0 2 3 3

t0 t1 t2 t3

2 1 0 3

t0 t1 t2 t3

Page 18: GPU Sparse Graph Traversal

Efficient local scan (granularity coarsening)

18

barrier

barrier

x16

x16

x1

x4

x4

Page 19: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(2) Poor load balancing within SIMD lanes

Problem:

Large variance in adjacency

list sizes

Exacerbated by wide SIMD

widths

Solution:

Cooperative neighbor

expansion

Enlist nearby threads to help

process each adjacency list in

parallel

19

t0 t1 t2 t3 t4 t5

a) Bad: Serial expansion & processing

b) Slightly better: Coarse warp-centric parallel expansion

c) Best: Fine-grained parallel expansion (packed by prefix sum)

t0 t1 t2 t3 t4 t5 t31

t32 t33 t34 t35 t36 t37 t63

t64 t65 t66 t67 t68 t69 t95

t1 t3 t0 t2

Page 20: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(2) Poor load balancing within SIMD lanes

Problem:

Large variance in adjacency

list sizes

Exacerbated by wide SIMD

widths

Solution:

Cooperative neighbor

expansion

Enlist nearby threads to help

process each adjacency list in

parallel

20

t0 t1 t2 t3 t4 t5

a) Bad: Serial expansion & processing

b) Slightly better: Coarse warp-centric parallel expansion

c) Best: Fine-grained parallel expansion (packed by prefix sum)

t0 t1 t2 t3 t4 t5 t31

t32 t33 t34 t35 t36 t37 t63

t64 t65 t66 t67 t68 t69 t95

t1 t3 t0 t2

Page 21: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(3) Poor locality within SIMD lanes

Problem:

The referenced adjacency lists

are arbitrarily located

Exacerbated by wide SIMD

widths

Can’t afford to have SIMD

threads streaming through

unrelated data

Solution:

Cooperative neighbor expansion

21

t0 t1 t2 t3 t4 t5

a) Bad: Serial expansion & processing

b) Slightly better: Coarse warp-centric parallel expansion

c) Best: Fine-grained parallel expansion (packed by prefix sum)

t0 t1 t2 t3 t4 t5 t31

t32 t33 t34 t35 t36 t37 t63

t64 t65 t66 t67 t68 t69 t95

t1 t3 t0 t2

Page 22: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

(4) SIMD simultaneous discovery

“Duplicates” in the edge-frontier redundant work

Exacerbated by wide SIMD

Compounded every iteration

22

6 7

3 4

8

5

0 1 2

6 7

3 4

8

5

0 1 2

6 7

3 4

8

5

0 1 2

6 7

3 4

8

5

0 1 2

Iteration 1 Iteration 2 Iteration 3 Iteration 4

Page 23: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Simultaneous discovery

Normally not a problem for

CPU implementations

Serial adjacency list inspection

A big problem for cooperative SIMD

expansion

Spatially-descriptive datasets

Power-law datasets

Solution:

Localized duplicate removal using

hashes in local scratch

23

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 100 200 300 400 500 600 700 800

Ver

tice

s (m

illio

ns)

Search Depth

Logical edge-frontier

Logical vertex-frontier

Actuallyexpanded

Actuallycontracted

Page 24: GPU Sparse Graph Traversal

UNIVERSITY of VIRGINIA

Absolute and relative

performance

24

Page 25: GPU Sparse Graph Traversal

Single-socket comparison

Graph Spy Plot Average

Search Depth

Nehalem Tesla C2050

Billion TE/s Parallel

speedup Billion TE/s

Parallel

speedup†††

europe.osm 19314 0.3 11 x

grid5pt.5000 7500 0.6 7.3 x

hugebubbles 6151 0.4 15 x

grid7pt.300 679 0.12 (4-core†) 2.4 x 1.1 28 x

nlpkkt160 142 0.47 (4-core†) 3.0 x 2.5 10 x

audikw1 62 3.0 4.6 x

cage15 37 0.23 (4-core†) 2.6 x 2.2 18 x

kkt_power 37 0.11 (4-core†) 3.0 x 1.1 23 x

coPapersCiteseer 26 3.0 5.9 x

wikipedia-2007 20 0.19 (4-core†) 3.2 x 1.6 25 x

kron_g500-logn20 6 3.1 13 x

random.2Mv.128Me 6 0.50 (8-core††) 7.0 x 3.0 29 x

rmat.2Mv.128Me 6 0.70 (8-core††) 6.0 x 3.3 22 x

† 2.5 GHz Core i7 4-core (Leiserson et al.) †† 2.7 GHz EX Xeon X5570 8-core (Agarwal et al) ††† vs 3.4GHz Core i7 2600K (Sandybridge)

Page 26: GPU Sparse Graph Traversal

Single-socket comparison

Graph Spy Plot Average

Search Depth

Nehalem Tesla C2050

Billion TE/s Parallel

speedup Billion TE/s

Parallel

speedup†††

europe.osm 19314 0.3 11 x

grid5pt.5000 7500 0.6 7.3 x

hugebubbles 6151 0.4 15 x

grid7pt.300 679 0.12 (4-core†) 2.4 x 1.1 28 x

nlpkkt160 142 0.47 (4-core†) 3.0 x 2.5 10 x

audikw1 62 3.0 4.6 x

cage15 37 0.23 (4-core†) 2.6 x 2.2 18 x

kkt_power 37 0.11 (4-core†) 3.0 x 1.1 23 x

coPapersCiteseer 26 3.0 5.9 x

wikipedia-2007 20 0.19 (4-core†) 3.2 x 1.6 25 x

kron_g500-logn20 6 3.1 13 x

random.2Mv.128Me 6 0.50 (8-core††) 7.0 x 3.0 29 x

rmat.2Mv.128Me 6 0.70 (8-core††) 6.0 x 3.3 22 x

† 2.5 GHz Core i7 4-core (Leiserson et al.) †† 2.7 GHz EX Xeon X5570 8-core (Agarwal et al) ††† vs 3.4GHz Core i7 2600K (Sandybridge)

Page 27: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Summary

Quadratic-work approaches are

uncompetitive

Dynamic workload management:

Prefix sum (vs. atomic-add)

Utilization requires fine-grained expansion

and contraction

GPUs can be very amenable to

dynamic, cooperative problems

27

Page 28: GPU Sparse Graph Traversal

UNIVERSITY of VIRGINIA

Questions?

Page 29: GPU Sparse Graph Traversal

Graph Spy Plot Description Average

Search Depth

Vertices

(millions)

Edges

(millions)

europe.osm Road network 19314 50.9 108.1

grid5pt.5000 2D Poisson stencil 7500 25.0 125.0

hugebubbles-00020 2D mesh 6151 21.2 63.6

grid7pt.300 3D Poisson stencil 679 27.0 188.5

nlpkkt160 Constrained optimization

problem

142 8.3 221.2

audikw1 Finite element matrix 62 0.9 76.7

cage15 Transition prob. matrix 37 5.2 94.0

kkt_power Optimization (KKT) 37 2.1 13.0

coPapersCiteseer Citation network 26 0.4 32.1

wikipedia-20070206 Wikipedia page links 20 3.6 45.0

kron_g500-logn20 Graph500 random graph 6 1.0 100.7

random.2Mv.128Me Uniform random graph 6 2.0 128.0

rmat.2Mv.128Me RMAT random graph 6 2.0 128.0

Experimental corpus

Page 30: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Comparison of expansion techniques

30

Page 31: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Coupling of expansion & contraction phases

Alternatives:

Expand-contract

Wholly realize vertex-frontier in DRAM

Suitable for all types of BFS iterations

Contract-expand

Wholly realize edge-frontier in DRAM

Even better for small, fleeting BFS iterations

Two-phase

Wholly realize both frontiers in DRAM

Even better for large, saturating BFS iterations (surprising!!)

31

BFS iteration

Logical EdgeFrontier

Logical VertexFrontier

Page 32: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Comparison of coupling approaches

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

no

rmal

ized

10

9 e

dge

s /

sec

Expand-Contract Contract-Expand 2-Phase Hybrid

32

Page 33: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Local duplicate culling

Contraction uses local

collision hashes in smem

scratch space:

Hash per warp

(instantaneous

coverage)

Hash per CTA (recent

history coverage)

Redundant work < 10% in

all datasets

No atomics needed

33

1.1x

4.2x

2.4x 2.6x

1.7x

1.2x 1.1x

1.4x

1.1x

1.3x

1

10

Red

un

dan

t ex

pan

sio

n

fact

or

(lo

g)

Simple Local duplicate culling

Page 34: GPU Sparse Graph Traversal

D. Merrill, M. Garland, A. Grimshaw. Scalable GPU Graph Traversal. PPoPP’12.

Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat

34

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0

1

2

3

4

5

6

7

8

9

no

rmal

ized

10

9 e

dge

s /

sec

C2050 x1 C2050 x2 C2050 x3 C2050 x4