Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

Parallel Graph Algorithms

Rupesh Nasre.IIT Madras

NITKFebruary 2016

2

Graphs

● Where do we encounter graphs?● Social networks, road connections, molecular

interactions, planetary forces, …● snap, florida, dimacs, konect, ...

● Why treat them separately?● They provide structural information.● They can be processed more efficiently.

● What challenges do they pose?● Load imbalance, poor locality, …● Irregularity

3

What is IrReguLariTy?

● Data-access or control patterns are unpredictable at compile time.

Irregular data-access Irregular control-flow

int a[N], b[N], c[N];readinput(a);

c[5] = b[a[4]];

int a[N], b[N], c[N];readinput(a);

c[5] = b[a[4]];

Pointer-based data structures often contribute to irregularity.

int a[N];readinput(a);

if (a[4] > 30) {...

}

int a[N];readinput(a);

if (a[4] > 30) {...

}

Needs dynamic techniques

4

Examples of Irregular Algorithms

C1C1 C2C2 C3C3 C4C4 C5C5

X1X1 X2X2 X5X5X4X4X3X3

a = &xb = &yp = &a*p = bc = a

aa

cc

pp bb

x

ya

5

3

8

5

6 6

4

7

Delaunay Mesh Refinement Points-to Analysis

Minimum Spanning Tree Computation

Survey Propagation

5

Regular vs. Irregular Algorithms

X =

aa

cc

bb dd

73

4

gg

ff

ee

Matrix Multiplication Shortest Paths Computation

By knowing matrix size and its starting address, and without knowing

its values we can determine the dynamic behavior.

Dynamic behavior is dependent on the input graph.

This results inpessimistic synchronization.

6

Irregularity is a Matter of Degree

X =

aa

cc

bb dd

73

4

gg

ff

ee

Co

ntr

ol-

flo

w ir

reg

ula

rity

0.5%

1.0%

1.5%

0%

2.0%

2.5%

3.0%

3.5%

4.0%

memory-access irregularity

0.0%

10% 20% 30% 40% 50% 60% 70% 80%

7

Irregularity is a Matter of Degree

X =

aa

cc

bb dd

73

4

gg

ff

ee

8

ee

Dynamic Techniques

aa

cc

bb dd

gg

ff hh ii

jj Active node

Neighborhood

Non-overlapping neighborhoods can be processed in parallel.

Overlapping neighborhoods require synchronization.

Leads to optimistic and cautious parallelizations.

Question: Given an algorithm, can you schedule its parallel processing such that neighborhoods never overlap?

9

● Manual● Libraries

● Galois, Ligra, LonestarGPU, Totem, ...

● Domain-Specific Languages● Green-Marl, Elixir, BlackBuck, ...

Parallelism Approaches

Pro

du

ctiv

ity P

erform

ance

10

Specifying Parallelism

● Do not specify.● Sequential input, completely automated, currently

very challenging in general

● Implicit parallelism● aggregates, aggregate functions, value-based

processing, ...

● Explicit parallelism● pthreads, MPI, ...

11

Identifying Dependence

for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}

for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}

Is there a flow dependencebetween different iterations?

Dependence equations0 <= ii

w < ii

r < 10

2 * iiw = 2 * ii

r + 1

which can be written as

0 <= iiw

iiw <= ii

r - 1

iir <= 9

2 * iiw <= 2 * ii

r + 1

2 * iir + 1 <= 2 * ii

w

-1 0

1 -1

0 1

2 -2

-2 2

iiw

iir

-1 0

1 -1

0 1

2 -2

-2 2

0

-1

9

1

-1

<=

Dependence exists if the system has a solution.Dependence exists if the system has a solution.

12

Outline

● Introduction➔ Graph algorithms on GPUs● Morph algorithms on GPUs● Synthesizing graph algorithms for GPUs

13

What is a GPU?

● Graphics Processing Unit● Separate piece of hardware

connected using a bus● Separate address space

than that of the CPU● Massive multithreading● Warp-based execution

14

What is a Warp?

Source: Wikipedia

15

GPU Computation Hierarchy

...

... ... ......

... ... ......

... ... ......

Thread

Warp

Block

Multi-processor

GPU

1

32

1024

Tens ofthousands

Hundreds ofthousands

16

Challenges with GPUs

● Warp-based execution● Locking is expensive● Dynamic memory allocation is costly● Limited data-cache● Programmability issues

● separate address space● no recursion support● complex computation hierarchy● exposed memory hierarchy● ...

17

Challenges in Graph Algorithms

● Synchronization● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive

● Memory latency● locality is difficult to exploit● low caching support

● Thread-divergence● work done per node varies with graph structure

● Uncoalesced memory accesses● warp-threads access arbitrary graph elements

18

Irregular Algorithms on GPUs

aa

cc

bb dd1

34

ggff

ee5 2 7

7sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) relaxEdge(src, dst, wt)

sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) relaxEdge(src, dst, wt)

Single-source shortest paths

19


aa

cc

bb dd1

34

ggff

ee

Single-source shortest paths

5 2 7

7sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt

if altdist < ddst distance[dst] = altdist

sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt

if altdist < ddst distance[dst] = altdist

Mem

ory-bound kernel

Kernel unrolling

Instead of operating on a node, a thread operates on a subgraph.

20

Improving Synchronization

aa

push-based

aa

pull-based

1010

33 44

2 3

tfive tseven

77

33 44

2 3

tfive tseven

55

33 44

2 3

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

21


aa

cc

bb dd1

34

ggff

ee

Barnes-Hut n-body simulationBreadth-first search Single-source shortest paths

5 2 7

7

● Better memory layout

● Kernel unrolling

● Local worklists

● Improved synchronization

Application Speedup

BFS 48

BH 90

SSSP 45

22

Outline

● introduction● Graph algorithms on GPUs

● e.g., breadth-first search, shortest paths

➔ Morph algorithms on GPUs➔ e.g., mesh refinement, pointer analysis

● Synthesizing irregular algorithms for GPUs

23

Identify the Celebrity

Source: wikipedia

24

What is a morph?

Source: wikipedia

25

Examples of Morph Algorithms

C1C1 C2C2 C3C3 C4C4 C5C5

X1X1 X2X2 X5X5X4X4X3X3

a = &xb = &yp = &a*p = bc = a

aa

cc

pp bb

x

ya

5

3

8

5

6 6

4

7

Delaunay Mesh Refinement Points-to Analysis

Minimum Spanning Tree Computation

Survey Propagation

26

TAO Classification

Multi-core GPU

Regular (matrix multiplication, stencils)

Irregular (static-graph) (breadth-first search, shortest paths)

Morph (mesh refinement, survey propagation)

The TAO of Parallelism in Algorithms, Pingali et al. [PLDI 2011]

27

First Attempt

● Ported multi-core Galois benchmarks to CUDA

● GPU version's performance was comparable to that of the single-threaded version

48322420161284210

10

20

30

40

50

60

70

80

90

Serial (Shewchuck, Berkeley)

GPU Ported

Multicore Galois

Exe

cutio

n tim

e (s

econ

ds)

Number of CPU threads

DMR on a mesh with 10 million triangles with around 50% bad triangles

28

Challenges in Morph Algorithms● Synchronization

● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive

● Memory allocation● changing graph structure requires new strategies● memory requirement cannot be predicted

● Load imbalance● different modifications to different parts of the graph● work done per node changes dynamically● leads to thread-divergence and uncoalesced

memory accesses

29

Performance Considerations

Graph representation

Subgraph addition

Subgraph deletion

Conflict resolution

Load balancing

44

33

22

11

55

30

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points-to Analysis Minimum Spanning Tree Computation

Points Triangles

31



Points Triangles Neighbors


32



Points Triangles Neighbors

.

.

.

Clauses to literals

Literals to clauses


33



Points Triangles NeighborsClauses to

literalsLiterals to clauses

Pointers

.

.

.

Edges


.

.

.

34



Points Triangles NeighborsClauses to

literalsLiterals to clauses

Pointers

.

.

.

Points-to sets

.

.

.

Graph

.

.

.

Components


.

.

.

.

.

.

35

Subgraph Addition 22

● Pre-allocation● When final graph has bounded size● Overallocation

● Host-only● When next size predictable on host, global

allocation● May be expensive, but can be optimized

● Kernel-host● When next size predictable, global allocation● Size computation can piggyback

● Kernel-only● General purpose, local allocation● Expensive, but can be optimized

e.g. MST, DMR

e.g. PTA

e.g. DMR

e.g. PTA

36

PerformanceE

xecu

tion

time

(sec

onds

)

Number of CPU threads

Serial (Shewchuck, Berkeley)

Multi-core Galois

GPU Optimized48322420161284210

10

20

30

40

50

60

70

80

90

GPU Ported

Application Speedup

DMR 80

MST 20

PTA 23

SP 25

44

33

22

11

55

Graph representation

Subgraph addition

Subgraph deletion

Conflict resolution

Load balancing

37

GPU Optimization Principles

GPUPrinciples

Synchronization

Com

puta

tion Mem

ory

Kernel transform

ations

Data grouping

Exploiting m

emory hierarchy

Algorithm selection

Work sorting

Work chunking

Communication onto computation

Following parallelism profile

Pipelined computation

Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization

38

GPU Optimization Principles

GPUPrinciples

Synchronization

Com

puta

tion Mem

ory

Kernel transformationsData groupingExploiting memory hierarchy

Algorithm selectionWork sorting

Work chunkingCommunication onto computation

Following parallelism profilePipelined computation

Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization

These optimization principlesare critical for high-performingirregular GPU computations.

39

Barrier-based Synchronization

➔ Disjoint accesses

● Overlapping accesses

● Benign overlaps

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

...

Consider threads pushing elements into a worklist

40


● Disjoint accesses

➔ Overlapping accesses

● Benign overlaps

...

atomic per element

non-atomic mark

prioritized mark

check

Race and resolve

Race and resolve

AND

ORnon-atomic mark

check

e.g., for owning cavities in Delaunay mesh refinement

e.g., for inserting unique elements into a worklist

Consider threads trying to own a set of elements

41


● Disjoint accesses

● Overlapping accesses

➔ Benign overlaps

...

with atomics

without atomics

e.g., level-by-level breadth-first search

Consider threads updating shared variables to the same value

a

b dc

e gf

42

Exploiting Algebraic Properties

➔ Monotonicity

● Idempotency

● Associativity

1010

33 44

2 3

tfive tseven

77

33 44

2 3

tfive tseven

55

33 44

2 3

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

Consider threads updating distances in shortest paths computation

43


● Monotonicity

➔ Idempotency

● Associativity

zz

bb cc

t2 t3

aa dd

t1 t4 zz zz zz zz

worklist zz

pp

qqrr

t5, t6, t7,t8

Consider threads updating distances in shortest paths computation

Update by multiple threads Multiple instances of a node in the worklist

Same node processed by multiple threads

44


● Monotonicity

● Idempotency

➔ Associativity

zz

bb cc

t2 t3

aa dd

t1 t4x

y z,v

m,n

x,y,z,v,m,n

Consider threads pushing points-to information to a pointer node

Associativity helps push information using prefix-sum

45

Other Synchronization Methods

➔ Partitioning● Scatter-gather

Delaunay Mesh Refinement

● Layout-based graph partitioning is expensive

● Partitioning time may be comparable to overall running time

aa

push-based

aa

pull-based

Partitioning enables pull-based approach

46

Other Synchronization Methods

● Partitioning➔ Scatter-gather

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

...

scatter

gather

Consider threads pushing elements into a worklist

47

Outline

● Introduction● Graph algorithms on GPUs● Morph algorithms on GPUs➔ Synthesizing irregular algorithms for GPUs

48

Hand-Written BFS

... hedgessrcdst = (unsigned int *)malloc(nedges * sizeof(unsigned int)); hpsrc = (unsigned int *)calloc(nnodes, sizeof(unsigned int)); hdist = (float *)malloc(nnodes * sizeof(float));

... // Read graph. …

Memoryallocationon host

Memoryallocationon host

49

Hand-Written BFS ... cudaMalloc((void **)&edgessrcdst, nedges * sizeof(unsigned int)); cudaMalloc((void **)&psrc, nnodes * sizeof(unsigned int));

cudaMemcpy(edgessrcdst, hedgessrcdst, nedges * sizeof(unsigned int), cudaMemcpyHostToDevice); cudaMemcpy(edgessrcwt, hedgessrcwt, nedges * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(psrc, hpsrc, nnodes * sizeof(unsigned int), cudaMemcpyHostToDevice);

...

Memoryallocationon device

Memoryallocationon device

Explicitmemorytransfer

Explicitmemorytransfer

50

Hand-Written BFS

… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged); ...

Transferfrom GPU

Transferfrom GPU

Transfer to GPU

Transfer to GPU

Explicitbarrier

Explicitbarrier

51

Hand-Written BFS

… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged);

cudaMemcpy(hdist, dist, nnodes * sizeof(float), cudaMemcpyDeviceToHost); ...

Transferfrom GPU

Transferfrom GPU

52

SSSP Specification in Elixir

Graph [nodes(node: Node, dist: int),

edges(src: Node, dst: Node, wt: int)]

source: Node

initDist = [nodes(node n, dist d)] →

[d = if (n == source) 0 else ∞]

relaxEdge = [nodes(node p, dist pd),

nodes(node q, dist qd),

edges(src p, dst q, wt w),

pd + wt < qd] →

[qd = pd + wt]

init = foreach initDist

bfs = iterate relaxEdge

main = init; bfs

Graph [nodes(node: Node, dist: int),

edges(src: Node, dst: Node, wt: int)]

source: Node

initDist = [nodes(node n, dist d)] →

[d = if (n == source) 0 else ∞]

relaxEdge = [nodes(node p, dist pd),

nodes(node q, dist qd),

edges(src p, dst q, wt w),

pd + wt < qd] →

[qd = pd + wt]

init = foreach initDist

bfs = iterate relaxEdge

main = init; bfs

Generate kernels

Graph rewrite rules

Graph layout unspecified

Processing order unspecified

53

SSSP Specification in Green-Marl

... G.dist = (G == root) ? 0 : +INF; G.updated = (G == root) ? True: False; G.dist_nxt = G.dist; G.updated_nxt = G.updated;

While (!fin) { fin = True;

Foreach(n: G.Nodes)(n.updated) { Foreach(s: n.Nbrs) { Edge e = s.ToEdge(); // the edge to s // updated_nxt becomes true only if dist_nxt is actually updated <s.dist_nxt; s.updated_nxt> min= <n.dist + e.len; True>; } }

G.dist = G.dist_nxt; G.updated = G.updated_nxt; G.updated_nxt = False; fin = ! Exist(n: G.Nodes){n.updated}; } ...

54

Using Operators

e

a

b

cd

A = b @ w1B = a @ w2C = a @ w3D = c @ w4E = (d @ w5) # (b @ w6)

Assigning # to a thread is vertex-based.Assigning @ to a thread is edge-based.

@ = +# = minwi = 1

@ = +# = minwi = wt

BFS SSSP

@ = /# = +wi = outdegree

PR

55

Language Features

Automatic memory management Memory allocation on host and device Memory transfer between devices

Automatic kernel configuration Easy overriding

Intelligent barrier insertion Automatic synchronization

56

Performance of Synthesized BFS

Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs

Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs

Generate kernels

Graph rewrite rules

Graph layout unspecified

Hand-written multicore

Hand-written GPU

Synthesized GPU

100% 29% 41%

Processing order unspecified

57

The Road Ahead

● Static graph algorithms● Dynamic graph algorithms● Language extensions for morph algorithms● Optimizations for special classes of graphs● Applications

58

Parallel Graph Algorithms

Rupesh [email protected]

NITKFebruary 2016

59

60

Todo

● Literature survey● higher level picture● Irregularity definition, earlier work by yelick, runger.● Multi-core status, tao.

● Morph algorithms● Benchmark suite: show properties of each b/m● optimizations● Synthesis● Current work: ir for multicore and gpu● Addons: data-topology, workload● cite papers.● Avoid sentences at the bottom of the slides.

61

Subgraph Deletion 33

● Marking● When deletion is infrequent● Deleted memory unavailable

● Explicit deletion● When deletion is frequent, local deletions● Global deletion may require expensive compaction

● Recycle● When new subgraph fits in the memory

of the old subgraph● Moderately expensive

e.g. SP

e.g. DMR

62

Conflict Resolution 44

1. mark triangles

–barrier–

2. priority-check markings

–barrier–

3. check markings

0

01

1

1

1

63

Effect of Optimizations

Base with partitioning

Race and resolveWith atomic-free global barrier

Optimized memory layout

Adaptive configuration

Load balancing

Single precision arithmeticOn-demand memory reallocation

0 10 20 30 40 50 60 70 80

Speedup

Op

tim

izat

ion

s

Input: mesh with 10 million triangles with around 50% bad triangles

Delaunay Mesh Refinement

Documents

Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior