Design Principles for Sparse Matrix Multiplication on the GPU

Design Principles for Sparse Matrix Multiplicationon the GPU

European Conference on Parallel and Distributed Computing

Carl Yang1,3, Aydın Buluc2,3 and John D. Owens1

1 University of California, Davis2 University of California, Berkeley

3 Lawrence Berkeley National Laboratory

August 30, 2018

CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 1 / 58

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation

6 Conclusion and Future Work


Problem

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation



Problem

What’s in common?

(a) Social network analysis (b) Bioinformatics

(c) Computer forensics (d) Recommender systems


Problem

What’s in common?


Problem

Problem: Why high-performance?

0Van Den Heuvel, Martijn P., and Olaf Sporns. “Rich-club organization of thehuman connectome.” Journal of Neuroscience 31.44 (2011): 15775-15786.CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 6 / 58

Problem

Solution: Graph frameworks

CPU: Pregel, GraphLab, Cyclops, GraphX, Powergraph, GoFFish, Blogel,Gremlin, Haloop, Apache Giraph, Apache Hama, GPS, Mizan, Giraphx,Seraph, GiraphUC, Pregel+, Pregelix, Apache Tinkerpop, LFGraph, Gelly,Trinity, Ligra, ...

GPU: Gunrock, Medusa, Totem, Frog, VertexAPI2, MapGraph, CuSha, ...


Problem

Problem: What are the right primitives?

1 Concise

2 Portable

3 High-performance

4 Expressible


Problem

Thesis statement

Linear algebra is the right way to think about graph algorithms.

Graph primitives based in linear algebra are superior to ones based onexisting vertex-centric graph frameworks in conciseness, portability,performance and expressibility.


Problem

Graph traversal is sparse matrix multiplication

1Denes Konig, 1931CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 10 / 58

Problem

Ingredient 1: Operations

(a) SpMVYang et al., ICPP ’18

(b) SpMMYang et al., EuroPar ’18

(c) SpMSpVYang et al., IPDPSW ’15, ICPP ’18

(d) SpGEMM


Problem

Ingredient 2: Operators

Semiring notation: (Add, Multiply, Domain)

Add: How to combine edges

Multiply: How to combine vertex with incident edges

Domain: Vertex/edge attributes

Name Semiring Application

Real field {+,×,R} Classical numerical linear algebraBoolean {|,&, {0, 1}} Graph connectivityTropical {min,+,R ∪ {∞}} Shortest pathMax-plus {max,+,R} Graph matchingMin-times {min,×,R} Maximal independent set


Problem

Operations + Operators = GraphBLAS


Related work

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation



Related work

Thread-level parallelism (TLP)

It is common knowledge launching 256 threads instead of 1 thread on theGPU will give more parallelism:


Related work

Instruction-level parallelism (ILP)

It is less well-known that you can also use parallelism among instructionsin a single thread:


Related work

Typical SpMM memory access pattern


Related work

Row split: Loading A


Related work

Row split: Loading B is dependent on A

(a) Sparse matrix A (b) Dense matrix B

memory accesses not coalescedILP: 1 (non-existent)


Related work

Anything wrong with this?


memory accesses not coalescedILP: 1 (non-existent)CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 20 / 58

Related work

Load-balancing can be an issue


Related work

Can we do better?

Row split

Use ILP that naturally exists in SpMM

Merge-based

Address load-balancing issue


Row split

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation



Row split

A new SpMM memory access pattern

4

A

B

C

32

m

k

n

k

m

n

Block.x 0

Block.x 1

Block.x 2

Block.x

Block.y 1Block.y 0

m4

.

.

.


Row split

Row split: Our contribution


1. T0 broadcasts 0 to T1, T2, ..., T7

2. T0, T1, T2, ..., T7 do coalesced memory access for row 0


Row split



1. T0 broadcasts 0 to T1, T2, ..., T72. T0, T1, T2, ..., T7 do coalesced memory access for row 0CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 26 / 58

Row split




Row split




Row split



Access into B is fully coalescedILP: up to 32 (depending on # nnz in row)CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 29 / 58

Row split

Row split: Main takeaways

Thinking about TLP and ILP is the right way to think about sparsematrix multiplication

Isolate memory reads from compute from memory writes, because thisallows ILP to materialize


Merge-based

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation



Merge-based

Static work assignment

(a) Row split

(b) Merge based1

Merge-based SpMV has been shown superior static allocations.Will merge-based SpMM be superior too?


Merge-based

Merge-based (dynamic work assignment)

(a) Row split (b) Merge-based1

Merge-based SpMV has been shown superior static allocations.Will merge-based SpMM be superior too?

1Duane Merrill, Michael Garland. “Merge-based parallel sparse matrix-vectormultiplication”. SC ’16, IEEE: 678-689CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 33 / 58

Merge-based

Merge-based (dynamic work assignment)

(a) Row split (b) Merge-based1

Merge-based SpMV has been shown superior to row split SpMV.Is merge-based SpMM be superior too?


Merge-based

Merge-based

Two-level decomposition:

1 Partition nonzeroes evenlyamongst blocks 0, 1, 2, 3

2 Global synchronization

3 Perform computation


5 Carry-out fix-up


Merge-based

Merge-based






5 Carry-out fix-up


Merge-based

Merge-based






5 Carry-out fix-up


Merge-based

Merge-based






5 Carry-out fix-up


Merge-based

Merge-based






5 Carry-out fix-up


Merge-based

Merge-based






5 Carry-out fix-up


Merge-based

Merge-based






5 Partial sum fix-up


Evaluation

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation



Evaluation

Experimental setup

CPU: Intel 4-core E5-2637 v2 Xeon CPU @ 3.50GHz, 556GB RAM

GPU: NVIDIA K40c, 12GB RAM


Evaluation

Row split performance (62.5 nonzeroes per row)

StocF-

1465

ldoor

bmw3_2

bone

S10 F1

inline

_1

bone

010

audik

w_1

bmwcra

_1

crank

seg_2

Dataset

0

20

40

60

80

100

120

140

160

Perfo

rman

ce (G

Flop

s)cuSPARSE csrmmcuSPARSE csrmm2

MAGMA SELL-PProposed row-split


Evaluation

Merge path performance (7.92 nonzeroes per row)

citati

onCite

web-St

anf

wheel_

601

c-big

FullC

hip

ASIC_32

0k

Stanfo

rd ins2

neos3

circui

t5M

Dataset

0

10

20

30

40

50

Perfo

rman

ce (G

Flop

s)cuSPARSE csrmmRow-split

cuSPARSE csrmm2Proposed merge-based


Evaluation

Heuristic to switch between row split and merge path

101 102 103

Mean row length

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Spee

d-up

Row-splitMerge-based

(a) Without heuristic

101 102 103

Mean row length

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Spee

d-up

(b) With heuristic


Conclusion and Future Work

Overview

1 Problem

2 Related work

3 Row split

4 Merge-based

5 Evaluation




Comparison with GEMM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Percentage of Nonzeroes in Matrix

0

500

1000

1500

2000

2500

3000

3500

Runt

ime

(ms)

cuSPARSE csrmmcuSPARSE csrmm2Proposed SpMMcuBLAS sgemm



Typical SpMM memory access pattern

By one scan of A while adding for each non-zero entry aij the productaijbj to ci , the result matrix C is computed



Future direction: Adjustment for deep learning workloads

Keep B in memory and load A

m = number of slow memory refs =√

knM

n2

M

Optimal for many nnz per row k ≥ nM

1Greiner, Gero and Riko Jacob. “The I/O Complexity of Sparse Matrix Dense MatrixMultiplication.” LATIN ’10.CY, AB, JDO (UCD, UCB, LBNL) Design for Sparse Matrix Multiplication August 30, 2018 50 / 58


Future direction: Implicit load-balancing

Very challenging to explicitly load-balance a sparse matrix computationand generate ILP.

What if a load-balancing library could do the hard work for you and all youhad to do was specify the computation?

What would the interface to such a library look like?



Conclusion

Thinking about TLP and ILP is the right way to think about sparsematrix multiplication

Isolate memory reads from compute from memory writes, because thisallows ILP to materialize

Two-phase decomposition is especially helpful for very sparse matrices(nnz/row < 9). It is a useful building block for load-balancing library.



Acknowledgments

The Gunrock team

NVIDIA

for their generous hardware support

Funding support

NSF Award # CCF-1629657DARPA XDATA program under US Army award W911QX-12-C-0059DARPA HIVE program under agreement number FA8650-18-2-7836LBNL’s DOE contract DE-AC02-05CH11231DOE’s OASCR contract DE-AC02-05CH11231NNSA and DOE’s Office of Science under the Exascale ComputingProject 17-SC-20-SC



Questions?

Code is available at: https://github.com/owensgroup/merge-spmm



What’s wrong with existing implementations?

2 8 32 128

512

2048

8192

3276

8

1310

72

5242

88

2097

152

8388

608

Mean row length

0

10

20

30

40

SpM

V Pe

rform

ance

(GFl

ops)

SpMVSpMM

0

20

40

60

80

100

120

SpM

M P

erfo

rman

ce (G

Flop

s)

Poor performance:Tall and skinny matrices: 8M × 2Wide and fat matrices: 2× 8M

Needs specialized sparse matrix formats (ELLPACK, SELL-P, etc.)Traditional TLP ways of hiding latency is insufficient, so use ILP too



Benchmark revisited

2 8 32 128

512

2048

8192

3276

8

1310

72

5242

88

2097

152

8388

608

Mean row length

0

25

50

75

100

125

150

175Pe

rform

ance

(GFl

ops)

cuSPARSE csrmm2Proposed row splitMerrill merge SpMVProposed merge SpMM



Why merge-based underperformed

Table: Number of independent instructions per thread (ILP), register usage andoverhead. RS stands for row split. MG stands for merge-based.

SpMV SpMM

Operation RS MG RS MG

ILP: Read A.col ind and A.val 1 7 1 1ILP: Read x / Read B 1 7 0 < L ≤ 32 32ILP: Write y / Write C 1 7 1 32Register usage 2 14 64 64

Memory access overhead 0 A.nnz896 0 2A.nnz


Documents

Design Principles for Sparse Matrix Multiplication on the GPU