63
Parallel Graph Algorithms Rupesh Nasre. IIT Madras NITK February 2016

Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

Parallel Graph Algorithms

Rupesh Nasre.IIT Madras

NITKFebruary 2016

Page 2: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

2

Graphs

● Where do we encounter graphs?● Social networks, road connections, molecular

interactions, planetary forces, …● snap, florida, dimacs, konect, ...

● Why treat them separately?● They provide structural information.● They can be processed more efficiently.

● What challenges do they pose?● Load imbalance, poor locality, …● Irregularity

Page 3: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

3

What is IrReguLariTy?

● Data-access or control patterns are unpredictable at compile time.

Irregular data-access Irregular control-flow

int a[N], b[N], c[N];readinput(a);

c[5] = b[a[4]];

int a[N], b[N], c[N];readinput(a);

c[5] = b[a[4]];

Pointer-based data structures often contribute to irregularity.

int a[N];readinput(a);

if (a[4] > 30) {...

}

int a[N];readinput(a);

if (a[4] > 30) {...

}

Needs dynamic techniques

Page 4: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

4

Examples of Irregular Algorithms

C1C1 C2C2 C3C3 C4C4 C5C5

X1X1 X2X2 X5X5X4X4X3X3

a = &xb = &yp = &a*p = bc = a

aa

cc

pp bb

x

ya

5

3

8

5

6 6

4

7

Delaunay Mesh Refinement Points-to Analysis

Minimum Spanning Tree Computation

Survey Propagation

Page 5: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

5

Regular vs. Irregular Algorithms

X =

aa

cc

bb dd

73

4

gg

ff

ee

Matrix Multiplication Shortest Paths Computation

By knowing matrix size and its starting address, and without knowing

its values we can determine the dynamic behavior.

Dynamic behavior is dependent on the input graph.

This results inpessimistic synchronization.

Page 6: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

6

Irregularity is a Matter of Degree

X =

aa

cc

bb dd

73

4

gg

ff

ee

Co

ntr

ol-

flo

w ir

reg

ula

rity

0.5%

1.0%

1.5%

0%

2.0%

2.5%

3.0%

3.5%

4.0%

memory-access irregularity

0.0%

10% 20% 30% 40% 50% 60% 70% 80%

Page 7: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

7

Irregularity is a Matter of Degree

X =

aa

cc

bb dd

73

4

gg

ff

ee

Page 8: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

8

ee

Dynamic Techniques

aa

cc

bb dd

gg

ff hh ii

jj Active node

Neighborhood

Non-overlapping neighborhoods can be processed in parallel.

Overlapping neighborhoods require synchronization.

Leads to optimistic and cautious parallelizations.

Question: Given an algorithm, can you schedule its parallel processing such that neighborhoods never overlap?

Page 9: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

9

● Manual● Libraries

● Galois, Ligra, LonestarGPU, Totem, ...

● Domain-Specific Languages● Green-Marl, Elixir, BlackBuck, ...

Parallelism Approaches

Pro

du

ctiv

ity P

erform

ance

Page 10: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

10

Specifying Parallelism

● Do not specify.● Sequential input, completely automated, currently

very challenging in general

● Implicit parallelism● aggregates, aggregate functions, value-based

processing, ...

● Explicit parallelism● pthreads, MPI, ...

Page 11: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

11

Identifying Dependence

for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}

for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}

Is there a flow dependencebetween different iterations?

Dependence equations0 <= ii

w < ii

r < 10

2 * iiw = 2 * ii

r + 1

which can be written as

0 <= iiw

iiw <= ii

r - 1

iir <= 9

2 * iiw <= 2 * ii

r + 1

2 * iir + 1 <= 2 * ii

w

-1 0

1 -1

0 1

2 -2

-2 2

iiw

iir

-1 0

1 -1

0 1

2 -2

-2 2

0

-1

9

1

-1

<=

Dependence exists if the system has a solution.Dependence exists if the system has a solution.

Page 12: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

12

Outline

● Introduction➔ Graph algorithms on GPUs● Morph algorithms on GPUs● Synthesizing graph algorithms for GPUs

Page 13: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

13

What is a GPU?

● Graphics Processing Unit● Separate piece of hardware

connected using a bus● Separate address space

than that of the CPU● Massive multithreading● Warp-based execution

Page 14: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

14

What is a Warp?

Source: Wikipedia

Page 15: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

15

GPU Computation Hierarchy

...

... ... ......

... ... ......

... ... ......

Thread

Warp

Block

Multi-processor

GPU

1

32

1024

Tens ofthousands

Hundreds ofthousands

Page 16: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

16

Challenges with GPUs

● Warp-based execution● Locking is expensive● Dynamic memory allocation is costly● Limited data-cache● Programmability issues

● separate address space● no recursion support● complex computation hierarchy● exposed memory hierarchy● ...

Page 17: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

17

Challenges in Graph Algorithms

● Synchronization● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive

● Memory latency● locality is difficult to exploit● low caching support

● Thread-divergence● work done per node varies with graph structure

● Uncoalesced memory accesses● warp-threads access arbitrary graph elements

Page 18: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

18

Irregular Algorithms on GPUs

aa

cc

bb dd1

34

ggff

ee5 2 7

7sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) relaxEdge(src, dst, wt)

sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) relaxEdge(src, dst, wt)

Single-source shortest paths

Page 19: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

19

Irregular Algorithms on GPUs

aa

cc

bb dd1

34

ggff

ee

Single-source shortest paths

5 2 7

7sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt

if altdist < ddst distance[dst] = altdist

sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt

if altdist < ddst distance[dst] = altdist

Mem

ory-bound kernel

Kernel unrolling

Instead of operating on a node, a thread operates on a subgraph.

Page 20: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

20

Improving Synchronization

aa

push-based

aa

pull-based

1010

33 44

2 3

tfive tseven

77

33 44

2 3

tfive tseven

55

33 44

2 3

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

Page 21: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

21

Irregular Algorithms on GPUs

aa

cc

bb dd1

34

ggff

ee

Barnes-Hut n-body simulationBreadth-first search Single-source shortest paths

5 2 7

7

● Better memory layout

● Kernel unrolling

● Local worklists

● Improved synchronization

Application Speedup

BFS 48

BH 90

SSSP 45

Page 22: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

22

Outline

● introduction● Graph algorithms on GPUs

● e.g., breadth-first search, shortest paths

➔ Morph algorithms on GPUs➔ e.g., mesh refinement, pointer analysis

● Synthesizing irregular algorithms for GPUs

Page 23: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

23

Identify the Celebrity

Source: wikipedia

Page 24: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

24

What is a morph?

Source: wikipedia

Page 25: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

25

Examples of Morph Algorithms

C1C1 C2C2 C3C3 C4C4 C5C5

X1X1 X2X2 X5X5X4X4X3X3

a = &xb = &yp = &a*p = bc = a

aa

cc

pp bb

x

ya

5

3

8

5

6 6

4

7

Delaunay Mesh Refinement Points-to Analysis

Minimum Spanning Tree Computation

Survey Propagation

Page 26: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

26

TAO Classification

Multi-core GPU

Regular (matrix multiplication, stencils)

Irregular (static-graph) (breadth-first search, shortest paths)

Morph (mesh refinement, survey propagation)

The TAO of Parallelism in Algorithms, Pingali et al. [PLDI 2011]

Page 27: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

27

First Attempt

● Ported multi-core Galois benchmarks to CUDA

● GPU version's performance was comparable to that of the single-threaded version

48322420161284210

10

20

30

40

50

60

70

80

90

Serial (Shewchuck, Berkeley)

GPU Ported

Multicore Galois

Exe

cutio

n tim

e (s

econ

ds)

Number of CPU threads

DMR on a mesh with 10 million triangles with around 50% bad triangles

Page 28: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

28

Challenges in Morph Algorithms● Synchronization

● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive

● Memory allocation● changing graph structure requires new strategies● memory requirement cannot be predicted

● Load imbalance● different modifications to different parts of the graph● work done per node changes dynamically● leads to thread-divergence and uncoalesced

memory accesses

Page 29: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

29

Performance Considerations

Graph representation

Subgraph addition

Subgraph deletion

Conflict resolution

Load balancing

44

33

22

11

55

Page 30: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

30

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points-to Analysis Minimum Spanning Tree Computation

Points Triangles

Page 31: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

31

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points Triangles Neighbors

Points-to Analysis Minimum Spanning Tree Computation

Page 32: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

32

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points Triangles Neighbors

.

.

.

Clauses to literals

Literals to clauses

Points-to Analysis Minimum Spanning Tree Computation

Page 33: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

33

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points Triangles NeighborsClauses to

literalsLiterals to clauses

Pointers

.

.

.

Edges

Points-to Analysis Minimum Spanning Tree Computation

.

.

.

Page 34: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

34

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points Triangles NeighborsClauses to

literalsLiterals to clauses

Pointers

.

.

.

Points-to sets

.

.

.

Graph

.

.

.

Components

Points-to Analysis Minimum Spanning Tree Computation

.

.

.

.

.

.

Page 35: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

35

Subgraph Addition 22

● Pre-allocation● When final graph has bounded size● Overallocation

● Host-only● When next size predictable on host, global

allocation● May be expensive, but can be optimized

● Kernel-host● When next size predictable, global allocation● Size computation can piggyback

● Kernel-only● General purpose, local allocation● Expensive, but can be optimized

e.g. MST, DMR

e.g. PTA

e.g. DMR

e.g. PTA

Page 36: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

36

PerformanceE

xecu

tion

time

(sec

onds

)

Number of CPU threads

Serial (Shewchuck, Berkeley)

Multi-core Galois

GPU Optimized48322420161284210

10

20

30

40

50

60

70

80

90

GPU Ported

Application Speedup

DMR 80

MST 20

PTA 23

SP 25

44

33

22

11

55

Graph representation

Subgraph addition

Subgraph deletion

Conflict resolution

Load balancing

Page 37: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

37

GPU Optimization Principles

GPUPrinciples

Synchronization

Com

puta

tion Mem

ory

Kernel transform

ations

Data grouping

Exploiting m

emory hierarchy

Algorithm selection

Work sorting

Work chunking

Communication onto computation

Following parallelism profile

Pipelined computation

Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization

Page 38: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

38

GPU Optimization Principles

GPUPrinciples

Synchronization

Com

puta

tion Mem

ory

Kernel transformationsData groupingExploiting memory hierarchy

Algorithm selectionWork sorting

Work chunkingCommunication onto computation

Following parallelism profilePipelined computation

Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization

These optimization principlesare critical for high-performingirregular GPU computations.

Page 39: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

39

Barrier-based Synchronization

➔ Disjoint accesses

● Overlapping accesses

● Benign overlaps

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

...

Consider threads pushing elements into a worklist

Page 40: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

40

Barrier-based Synchronization

● Disjoint accesses

➔ Overlapping accesses

● Benign overlaps

...

atomic per element

non-atomic mark

prioritized mark

check

Race and resolve

Race and resolve

AND

ORnon-atomic mark

check

e.g., for owning cavities in Delaunay mesh refinement

e.g., for inserting unique elements into a worklist

Consider threads trying to own a set of elements

Page 41: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

41

Barrier-based Synchronization

● Disjoint accesses

● Overlapping accesses

➔ Benign overlaps

...

with atomics

without atomics

e.g., level-by-level breadth-first search

Consider threads updating shared variables to the same value

a

b dc

e gf

Page 42: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

42

Exploiting Algebraic Properties

➔ Monotonicity

● Idempotency

● Associativity

1010

33 44

2 3

tfive tseven

77

33 44

2 3

tfive tseven

55

33 44

2 3

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

Consider threads updating distances in shortest paths computation

Page 43: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

43

Exploiting Algebraic Properties

● Monotonicity

➔ Idempotency

● Associativity

zz

bb cc

t2 t3

aa dd

t1 t4 zz zz zz zz

worklist zz

pp

qqrr

t5, t6, t7,t8

Consider threads updating distances in shortest paths computation

Update by multiple threads Multiple instances of a node in the worklist

Same node processed by multiple threads

Page 44: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

44

Exploiting Algebraic Properties

● Monotonicity

● Idempotency

➔ Associativity

zz

bb cc

t2 t3

aa dd

t1 t4x

y z,v

m,n

x,y,z,v,m,n

Consider threads pushing points-to information to a pointer node

Associativity helps push information using prefix-sum

Page 45: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

45

Other Synchronization Methods

➔ Partitioning● Scatter-gather

Delaunay Mesh Refinement

● Layout-based graph partitioning is expensive

● Partitioning time may be comparable to overall running time

aa

push-based

aa

pull-based

Partitioning enables pull-based approach

Page 46: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

46

Other Synchronization Methods

● Partitioning➔ Scatter-gather

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

...

scatter

gather

Consider threads pushing elements into a worklist

Page 47: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

47

Outline

● Introduction● Graph algorithms on GPUs● Morph algorithms on GPUs➔ Synthesizing irregular algorithms for GPUs

Page 48: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

48

Hand-Written BFS

... hedgessrcdst = (unsigned int *)malloc(nedges * sizeof(unsigned int)); hpsrc = (unsigned int *)calloc(nnodes, sizeof(unsigned int)); hdist = (float *)malloc(nnodes * sizeof(float));

... // Read graph. …

Memoryallocationon host

Memoryallocationon host

Page 49: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

49

Hand-Written BFS ... cudaMalloc((void **)&edgessrcdst, nedges * sizeof(unsigned int)); cudaMalloc((void **)&psrc, nnodes * sizeof(unsigned int));

cudaMemcpy(edgessrcdst, hedgessrcdst, nedges * sizeof(unsigned int), cudaMemcpyHostToDevice); cudaMemcpy(edgessrcwt, hedgessrcwt, nedges * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(psrc, hpsrc, nnodes * sizeof(unsigned int), cudaMemcpyHostToDevice);

...

Memoryallocationon device

Memoryallocationon device

Explicitmemorytransfer

Explicitmemorytransfer

Page 50: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

50

Hand-Written BFS

… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged); ...

Transferfrom GPU

Transferfrom GPU

Transfer to GPU

Transfer to GPU

Explicitbarrier

Explicitbarrier

Page 51: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

51

Hand-Written BFS

… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged);

cudaMemcpy(hdist, dist, nnodes * sizeof(float), cudaMemcpyDeviceToHost); ...

Transferfrom GPU

Transferfrom GPU

Page 52: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

52

SSSP Specification in Elixir

Graph [nodes(node: Node, dist: int),

edges(src: Node, dst: Node, wt: int)]

source: Node

initDist = [nodes(node n, dist d)] →

[d = if (n == source) 0 else ∞]

relaxEdge = [nodes(node p, dist pd),

nodes(node q, dist qd),

edges(src p, dst q, wt w),

pd + wt < qd] →

[qd = pd + wt]

init = foreach initDist

bfs = iterate relaxEdge

main = init; bfs

Graph [nodes(node: Node, dist: int),

edges(src: Node, dst: Node, wt: int)]

source: Node

initDist = [nodes(node n, dist d)] →

[d = if (n == source) 0 else ∞]

relaxEdge = [nodes(node p, dist pd),

nodes(node q, dist qd),

edges(src p, dst q, wt w),

pd + wt < qd] →

[qd = pd + wt]

init = foreach initDist

bfs = iterate relaxEdge

main = init; bfs

Generate kernels

Graph rewrite rules

Graph layout unspecified

Processing order unspecified

Page 53: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

53

SSSP Specification in Green-Marl

... G.dist = (G == root) ? 0 : +INF; G.updated = (G == root) ? True: False; G.dist_nxt = G.dist; G.updated_nxt = G.updated;

While (!fin) { fin = True;

Foreach(n: G.Nodes)(n.updated) { Foreach(s: n.Nbrs) { Edge e = s.ToEdge(); // the edge to s // updated_nxt becomes true only if dist_nxt is actually updated <s.dist_nxt; s.updated_nxt> min= <n.dist + e.len; True>; } }

G.dist = G.dist_nxt; G.updated = G.updated_nxt; G.updated_nxt = False; fin = ! Exist(n: G.Nodes){n.updated}; } ...

Page 54: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

54

Using Operators

e

a

b

cd

A = b @ w1B = a @ w2C = a @ w3D = c @ w4E = (d @ w5) # (b @ w6)

Assigning # to a thread is vertex-based.Assigning @ to a thread is edge-based.

@ = +# = minwi = 1

@ = +# = minwi = wt

BFS SSSP

@ = /# = +wi = outdegree

PR

Page 55: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

55

Language Features

Automatic memory management Memory allocation on host and device Memory transfer between devices

Automatic kernel configuration Easy overriding

Intelligent barrier insertion Automatic synchronization

Page 56: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

56

Performance of Synthesized BFS

Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs

Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs

Generate kernels

Graph rewrite rules

Graph layout unspecified

Hand-written multicore

Hand-written GPU

Synthesized GPU

100% 29% 41%

Processing order unspecified

Page 57: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

57

The Road Ahead

● Static graph algorithms● Dynamic graph algorithms● Language extensions for morph algorithms● Optimizations for special classes of graphs● Applications

Page 58: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

58

Parallel Graph Algorithms

Rupesh [email protected]

NITKFebruary 2016

Page 59: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

59

Page 60: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

60

Todo

● Literature survey● higher level picture● Irregularity definition, earlier work by yelick, runger.● Multi-core status, tao.

● Morph algorithms● Benchmark suite: show properties of each b/m● optimizations● Synthesis● Current work: ir for multicore and gpu● Addons: data-topology, workload● cite papers.● Avoid sentences at the bottom of the slides.

Page 61: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

61

Subgraph Deletion 33

● Marking● When deletion is infrequent● Deleted memory unavailable

● Explicit deletion● When deletion is frequent, local deletions● Global deletion may require expensive compaction

● Recycle● When new subgraph fits in the memory

of the old subgraph● Moderately expensive

e.g. SP

e.g. DMR

Page 62: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

62

Conflict Resolution 44

1. mark triangles

–barrier–

2. priority-check markings

–barrier–

3. check markings

0

01

1

1

1

Page 63: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior

63

Effect of Optimizations

Base with partitioning

Race and resolveWith atomic-free global barrier

Optimized memory layout

Adaptive configuration

Load balancing

Single precision arithmeticOn-demand memory reallocation

0 10 20 30 40 50 60 70 80

Speedup

Op

tim

izat

ion

s

Input: mesh with 10 million triangles with around 50% bad triangles

Delaunay Mesh Refinement