When Sparse Matrices Met Heterogeneous Processors ... · Implemen

When Sparse Matrices Met Heterogeneous Processors: Opportunities and Challenges

Weifeng Liu Norwegian University of Science and Technology, Norway

March10th,2018,Tokyo,Japan

Background

Sparse Matrix &

Heterogeneous Processor (Tightly-coupled CPU-GPU Chip)

Sparse matrix •  If the majority of entries in a matrix are zeros, we can store the matrix by

using a sparse representation and only records nonzero entries. •  The Compressed Sparse Row (CSR) format is the most widely used.

0 6 0 5 0 4 A

(4x4) sparse, 6 nonzeros

0 1 0 0

0 0 0 0

0 0 6 0 5 0 4 A

(4x4) dense

0 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3

0 1 3 3 6

value =

column index =

row pointer =

A (in CSR) (4x4)

6 nonzeros

CPU chip

GPU chip

Loosely-coupled CPU-GPU platform

Tightly-coupled CPU-GPU chip

Experimental Setup

One APU (AMD R5-2400G) &

Four Sparse Matrix Kernels (OMP/OCL) &

956 Sparse Matrices (from SuiteSparse)

Testbed

Ryzen 5 2400G, Zen+Vega, quad-core CPU, 4MB L3 cache, 704 GCN-core GPU),

2-ch 16GB DDR4-2933, 46.9GB/s B/W. MS Win10 :O

nVidiaGeForceTitanX,PascalGP102,3584CUDAcores,3MBL2cache,12GBGDDR5X,480GB/sB/W.

(onlyusedforsomeperf.comparison)

Four sparse kernels

•  Sparsematrix-vectorMulQplicaQon(SpMV)

x 0 2 0 1

0 6 0 5 0 4 0 d 0 c

0 a 0 b 2a+3b

0 4a+5c+6d

•  SparsetransposiQon(SpTRANS)

0 2 0 1

0 6 0 5 0 4

0 1 0 3

0 6 0 5

•  Sparsematrix-matrixMulQplicaQon(SpGEMM)

0 2 0 1

0 6 0 5 0 4 0 d

0 c 0 a

0 b 0 e

4a+5e 0 5d

1e 0 3b 0 3c

2a x =

•  Sparsetriangularsolve(SpTRSV)

0 x0 0 x1 0 1

0 1 0 1 0 3

0 2 0 d 0 c

0 a 0 b x =

956 matrices from SuiteSparse Collection

square matrices, 100K <= number of nonzeros <= 200M

956 matrices from in total

2757 matrices

Kernel 1. Sparse Matrix-Vector Multiplication

(SpMV)

SpMV •  Multiply a sparse matrix A by a dense vector x, and obtain a resulting dense

vector y.

0 6 0 5 0 4 A

(4x4) sparse, 6 nonzeros

x (4x1) dense

y (4x1) dense

4a+5c+6d

Question, Experiment, Opportunity and Challenge

•  Sparsematrix-vectorMulQplicaQon(SpMV)

x 0 2 0 1

0 6 0 5 0 4 0 d 0 c

0 a 0 b 2a+3b

0 4a+5c+6d

QuesQon1:Who(CPUsideorGPUside)givesthebestperformance?Experiment1:TestfourSpMVmethods(CSR-omp,CSR5-omp,CSR-adapQve-opencl,CSR5-opencl)onCPUsideandGPUside.

Opportunity1andChallenge1

Parallel SpMV (using the CSR format) •  Assign a row block to one core of a many-core processor.

core 0

core 1

core 2

core 3

Probably imbalanced computation, degraded

performance on many-core processors

NathanBell,MichaelGarland.Implemen<ngSparseMatrix-VectorMul<plica<ononThroughput-OrientedProcessors.SC09.

0 6 0 5 0 4

Parallel SpMV (using CSR-Adaptive) •  Adaptively assign row blocks to cores for better load balancing.

core 0

core 1

But when several rows are much longer

(power-law distribution), possible imbalance

still exists.

0 6 0 5 0 4

JosephL.Greathouse,MayankDaga.Efficientsparsematrix-vectormul<plica<ononGPUsusingtheCSRstorageformat.SC14.

Parallel SpMV (using CSR5) (our work) •  Organize nonzeros in Tiles of identical size. The design objectives include load

balancing, SIMD-friendly, low preprocessing cost and reduced storage space.

WeifengLiu,BrianVinter.CSR5:AnEfficientStorageFormatforCross-PlaNormSparseMatrix-VectorMul<plica<on.ICS15.

core 0

core 1

SpMV on CPU side

uniform dist.

power-law dist.

best `top’ performance: CSR-omp wins

best `load-balanced’ performance:

CSR5-omp wins

SpMV on GPU side

uniform dist.

power-law dist.

best `top’ performance: CSR-adaptive-ocl wins a bit

best `load-balanced’ performance: CSR5-ocl wins

CSR5-SpMV (GPU vs. CPU)

uniform dist.

power-law dist.

best `overall’

performance: CSR5-ocl wins

Best SpMV (GPU vs. CPU)

uniform dist.

power-law dist.

best `top’ performance: CPU-best wins

best `overall’ performance:

GPU-best wins

matrix `dc1’ : CPU-best: 35 GB/s GPU-best: 42 GB/s

Titan-X CSR: 1.1 GB/s

SpMV (Opportunity and Challenge)

Opportunities: •  Load balancing is still a problem on such `small’ processors •  Load balanced methods on APU can largely outperforms naive code on GPU •  Load balanced methods can be slower in many cases •  CPU achieves higher `top’ performance •  GPU achieves higher `overall’ performance

Challenges: •  Understanding correlations between performance on real hardware and sparsity

structures of sparse matrices •  Utilizing both CPU side and GPU side more efficiently

Kernel 2. Sparse Matrix-Matrix Multiplication

(SpGEMM)

•  Sparsematrix-matrixMulQplicaQon(SpGEMM)

0 2 0 1

0 6 0 5 0 4 0 d

0 c 0 a

0 b 0 e

4a+5e 0 5d

1e 0 3b 0 3c

2a x =

QuesQon2:Istheunifiedmemory(aka,sharedvirtualmemory)helpful?Experiment2:SpGEMMrunningonGPUsidebutusingtheunifiedmemory.

SpGEMM •  Multiply a sparse matrix A by a sparse matrix B, and obtain a resulting

sparse matrix C.

0 6 0 5 0 4

A (4x4)

sparse, 6 nonzeros

B (4x4)

sparse, 6 nonzeros

4a+5e 0 5d

C (4x4)

sparse, 8 nonzeros

0 3b 0 3c

SpGEMM - parallel insertion •  Because of the compressed sparse format, the result nonzeros are

“inserted” into C, but not “added” to predictable locations of C.

0 6 0 5

0 5d 0 5e

insert to column 1

fuse to column 3

insert to column 2

row 3 of C 4a+5e 0 5d 0 6f

0 5d 0 4a

4a+5e 0 5d 0 6f

0 5d 0 4a 4a+5e

An SpGEMM framework (our work)

WeifengLiu,BrianVinter.AFrameworkforGeneralSparseMatrix-MatrixMul<plica<ononGPUsandHeterogeneousProcessors.JPDC.2015.(extendedfromIPDPS14)WeifengLiu,BrianVinter.AnEfficientGPUGeneralSparseMatrix-MatrixMul<plica<onforIrregularData.IPDPS14.

An SpGEMM framework (our work)

SpGEMM on GPU (Unified vs. Non-unified mem)

more “insertion”

more “fusion”

Obvious performance

SpGEMM (Opportunity and Challenge)

Opportunity: •  For GPU-friendly irregular problems with output of unknown size, using unified

memory can bring higher performance.

Challenge: •  Various memory configurations and deeper memory hierarchy need more careful

tuning.

Kernel 3. Sparse Transposition

(SpTRANS)

SpTRANS •  Transpose a sparse matrix A (in CSR) to a sparse matrix B (i.e., AT, in

CSR). This is equivalent to a conversion from CSR to CSC, or vice versa.

0 6 0 5 0 4

0 2 0 4 0 3 0 1 0 5 0 6

1 3 1 0 3 3

0 2 3 5 6

value =

column index =

row pointer =

0 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3

0 1 3 3 6

value =

column index =

row pointer =

0 2 0 1

0 6 0 5 0 4

0 1 0 3

0 6 0 5

0 x0 0 x1 0 1

0 1 0 1 0 3

0 2 0 d 0 c

0 a 0 b x =

QuesQon3:CanatomicoperaQonsworkwell?

Experiment3:SpTRANSandSpTRSVonGPUside.

SpTRANS •  Transpose a sparse matrix A (in CSR) to a sparse matrix B (i.e., AT, in

CSR). This is equivalent to a conversion from CSR to CSC, or vice versa.

0 6 0 5 0 4

0 2 0 4 0 3 0 1 0 5 0 6

1 3 1 0 3 3

0 2 3 5 6

value =

column index =

row pointer =

0 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3

0 1 3 3 6

value =

column index =

row pointer =

HaoWang,WeifengLiu,KaixiHou,Wu-chunFeng.ParallelTransposi<onofSparseDataStructures.ICS16.

atomic operations

Performance on GPU (dGPU vs. iGPU)

dGPU has much higher performance

more “dense”

more “sparse”

B/W utilization on GPU (iGPU vs. dGPU)

iGPU has much higher bandwidth utilization

more “dense”

more “sparse”

Kernel 4. Sparse Triangular Solve

(SpTRSV)

SpTRSV •  Compute a dense solution vector x from a sparse linear system Lx=b, where

L is a square lower triangular sparse matrix, and b is a dense vector.

0 3 0 2

L (4x4)

nnzL = 6 known

x (4x1)

dense unknown

b (4x1)

dense known

x0=ax1=bx2=c-3a-2bx3=d

1*x0=a1*x1=b3*x0+2*x1+1*x2=c1*x3=d

system:

solu<on:

0 2 0 1

0 6 0 5 0 4

0 1 0 3

0 6 0 5

0 x0 0 x1 0 1

0 1 0 1 0 3

0 2 0 d 0 c

0 a 0 b x =

QuesQon3:CanatomicoperaQonsworkwell?

Experiment3:SpTRANSandSpTRSVonGPUside.

Sync-free method for SpTRSV (our work) •  All cores (one for each column as a node) are activiated: some of them compute

solution, the others busy-wait for spin-locks unlocked to start.

Node 0

Node 1

Node 2

Node 3 1 1 3 1 In-degree:

2 2 1 1 Out-degree:

0 3 0 2

0 1 2 5 CSR row_ptr:

0 2 4 5 CSC column_ptr:

WeifengLiu,AngLi,JonathanHogg,IainS.Duff,BrianVinter.FastSynchroniza<on-FreeAlgorithmsforParallelSparseTriangularSolveswithMul<pleRight-HandSides.CCPE.2017.(extendedfromEuropar16)WeifengLiu,AngLi,JonathanHogg,IainS.Duff,BrianVinter.ASynchroniza<on-FreeAlgorithmforParallelSparseTriangularSolves.Europar16.

atomic operations

Performance on GPU (dGPU vs. iGPU)

dGPU has much higher performance

more “levels” less “nodes per level”

more “nodes per level”

less “levels”

B/W utilization on GPU (iGPU vs. dGPU)

iGPU has much higher bandwidth utilization

more “levels” less “nodes per level”

more “nodes per level”

less “levels”

SpTRANS and SpTRSV (Opportunity and Challenge)

Opportunity: •  In terms of utilization, using more atomic operations looks more promising on

APUs, though overall performance is not competitive over dGPU yet.

Challenge: •  Possibly converting/eliminating more (global) synchronizations to atomic

operations for overall higher performamce.

Conclusion

•  The emergement of heterogeneous processors bring larger algorithm design and tuning space.

•  Based on a large set of experimental results, some opportunities are observed.

•  New exciting challenges are coming.

T k u ! 0 4 9 8

A y Q s n s ? 0 2 7 4 11 13 12

When Sparse Matrices Met Heterogeneous Processors ... · Implemen

Documents

Sparse SPM: Group Sparse-dictionary learning in SPM framework …adni.loni.usc.edu/adni-publications/LeeYB_2016_neuro... · 2019-06-04 · Sparse SPM: Group Sparse-dictionary learning

Implemen tación del enfoque de derechos humanos: la

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Sparse Coding in Sparse Winner networks

Superscalar Processors Superscalar Processors: Branch ...bhagiweb/cs211/lectures/superscalar.pdfSuperscalar Processors: Branch Prediction Dynamic Scheduling Superscalar Processors

Sparse Optimization - Lecture: Basic Sparse Optimization ...wotaoyin/summer2013/...Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online

SPARSE CODING - dsba.korea.ac.krdsba.korea.ac.kr/wp/wp-content/seminar/Deep learning/SPARSE-CODING...SPARSE CODING Deep learning lab ... The brain might be learning a sparse representation

ALBERTA SEED PROCESSORS AGMseedprocessors.ca/.../03/2018-Alberta-Seed-Processors-AGM-Registr… · ALBERTA SEED PROCESSORS AGM ... (Reference “Alberta Seed Processors” to receive

Tuning Sparse Matrix Vector Multiplication for multi-core ... · Tuning Sparse Matrix Vector Multiplication for multi-core processors Sam Williams samw@cs.berkeley.edu

Sparse Matrix Sparse Vector Multiplication using Parallel ...web.eecs.utk.edu/~gdp/pdf/baugher-ms-thesis.pdf · Sparse Matrix Sparse Vector Multiplication using Parallel and Reconfigurable

From Sparse Regression to Sparse Multiple Correspondence ... · From Sparse Regression to Sparse Multiple Correspondence Analysis Gilbert Saporta CEDRIC, CNAM, Paris . gilbert.saporta@cnam.fr

Implemen - Schneier · PDF filecould b e implemen ted easily.W e explore these claims in more detail; in partic-ular, w e attempt to implemen t the describ ed attac ks against Gn

The Design and Implemen

ISO 13485:2016 -laatujärjestelmän implemen- tointi

GPU Computing: The Democratization of Parallel Computinggilbert/cs240a/old/cs... · PBSM Thread Processors PBSM Thread Processors Thread Processors Thread Processors Thread Processors

MatRaptor: A Sparse-Sparse Matrix Multiplication ......Sparse-sparse matrix-matrix multiplication (SpGEMM) is a key computational primitive in many important application do-mains such

Tuning Sparse Matrix Vector Multiplication for multi-core processors

NVIDIA Research Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors Nathan Bell and Michael Garland

Sparse Sparse Bundle Adjustment - bmva.org · sparse Cholesky solvers. sSBA outperforms the current SBA standard implementation on datasets with sparse secondary structure by at least

DIMITRI P. BERTSEKAS:j:§ and JOHN N. TSITSIKLIS:jweb.mit.edu/dimitrib/www/Some_Aspects_DB.pdf · by the other processors. An asynchronous implemen-tation of iteration (1.1) is not