September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute

April 21, 2023

1

Utilizing CUDA for Preconditioned GMRES Solvers

DCABES’09

Shiming Xu1, Hai Xiang Lin1, Wei Xue2 , and Ke Wang3

1 Delft Institute of Applied Mathematics, TU Delft2 Department of Computer Science & Technology, Tsinghua University3 Lab. Parallel Soft. & Comp. Sci., Inst. Software Chinese Academy of Science

Outline

• Introduction to Krylov-subspace linear system solvers & preconditioning techniques

• Introduction to GPGPU & NVIDIA CUDA• GMRES solver on CUDA• Approximate inverse preconditioner based on A-

biconjugate (AINV) on CUDA• Experiments & Results• Conclusion

April 21, 2023 2

April 21, 2023 3

Introduction – Krylov subspace solver• Iterative Linera System Solver[2]

• Krylov Subspace-based solver:

• Popular Solvers:• GMRES, CG, Bi-CG

Introduction – Preconditioners (1)

• Iteration Count ~ Condition of matrix A• Preconditioners[2,9]:• Improve condition of the ‘actual’ matrix for

iteration• Left & right preconditioning• Effective matrix & system:

April 21, 2023 4

April 21, 2023

Introduction – Preconditioned GMRES

5

Introduction – Preconditioners (2)

• Incomplete Factorization-based:• Incomplete LU/Cholesky factorization[1,2]

• ILU(0), ILUt, ILUk, ILUtp, etc• Preconditioning: forward/backward elimination

• Approximate Inverse-based:• A-biconjugation (AINV)[8]

• Forbenius-norm minimization• Preconditioning: matrix-vector product

April 21, 2023 6

April 21, 2023

A-biconjugate based Preconditioner

7

Introduction – GPGPU & NVIDIA CUDA• General Purposed computing on Graphics Processing

Units[12]

• NVIDIA CUDA[6]: • First (de facto) widely adopted platform for GP-GPU

• Characteristics of GPU:• Throughput-oriented architecture, SIMD style• High peak FLOPS/bandwidth• Massively parallel (>thousands of concurrent threads)• Weak caching/memory model/programmability• Weak support for branches, no ILP mechanism

April 21, 2023 8

Introduction – GPU/CPU ComparisonCPU GPU

Sample Intel i7-920 (Nehalem) NVIDIA Tesla C1060

Freq. 2.66 GHz 1.3 GHz

Peak FLOPS (SP/DP)

85 G / 42.5 G 624 G / 78 G

Peak Bandwidth ~25 GB/s ~100 GB/s

Core configuration

4-physical core8-virtual cores

10-multiprocessor240-stream processor

Cache System 3-level coherent cache(32KB x4, 256KB x4, 8MB)

2-level cache (24KB x10, 256 KB)

SW-managed Cache

None 16KB x30April 21, 2023 9

CUDA Thread HierarchyCUDA Thread Hierarchy

April 21, 2023

CUDA Device AbstractionCUDA Device Abstraction

Introduction – NVIDIA CUDA

10

Data Formats for Sparse Matrices

• ELLPACK & ELLPACK-based (HYB)[4]

• Good bandwidth utilization• CSR/CSC (Compressed Sparse Row/Column)

April 21, 2023 11

00

01

02

03

04

05

06

07

08

09

0A

0B

0C

0D

0E

0F

10

11

12

13

14

15

16

17

18

19

1A

1B

1C

1D

1E

1F

20

21

22

23

24

25

26

27

28

29

2A

2B

2C

2D

2E

2F

30

31

32

33

34

35

36

37

38

39

3A

3B

3C

3D

3E

3F

50

51

52

53

54

55

56

57

58

59

5A

5B

5C

5D

5E

5F

40

41

42

43

44

45

46

47

48

49

4A

4B

4C

4D

4E

4F

60

61

62

63

64

65

66

67

68

69

6A

6B

6C

6D

6E

6F

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Thread 8

Thread 9

Thread 10

Thread 11

Thread 12

Thread 13

Thread 14

Thread 15

Thread 16

à

à

à

à

à

à

à

à

à

à

à

à

à

à

à

à

April 21, 2023G-SG-S

Modified G-SModified G-S

GMRES in CUDA – Algorithms

• Orthgonalization[11,13]:• Gram-Schmidt• Modified Gram-Schmidt• Gram-Schmidt with re-orthogonalization

12

GMRES in CUDA – Implementation

• Sparse Matrix-Dense Vector products (SpMV)• Orthogonalization• Inner Products• AXPY operations

• Preconditioner – AINV• Close relationship to ILU-related ones• High-performance/easy parallelization

April 21, 2023 13

AINV w/ Predefined Sparsity Pattern

• AINV-0:• WT+Z has the same sparsity pattern as A• Similar to ILU-0

• Preconditioner Generation:• CSC format for both W and Z

• Preconditioning in GMRES:• HYB format

April 21, 2023 14

AINV in CUDA

• Parallelization:• Inner iteration on Line 4~7 and Line 8~12

• Kernels:• Sparse-Vector Sparse-Vector inner products

(Line 5~6)• Sparse-Vector Sparse-Vector updates (Line

9~11)

April 21, 2023 15

Experiments – Tests

• GMRES kernels• Krylov subspace generation: SpMV• Orthogonalization

• AINV-0 preconditioner generation• AINV-0 preconditioned GMRES iteration

April 21, 2023 16

Experiments – Configurations

CPU Intel i7-920 (4-core, 2.66GHz)Memory 12GB (DDR-III, 1066MHz)GPU NVidia Tesla C1060GPU Memory

4GB

CUDA Version

2.0

April 21, 2023 17

Protein CantWindTunn

elEpide

mCircuit Petro OPF TDS Cubes

Parabolic

Size 36K 62K 218K 526K 171K 132K 2.1M 25K 101K 526K

NNZ 4.3M 4.0M 11.6M 2.1M 959K 4.2M 8.1M 160K 874K 2.1M

Table.1 System Configurations

Table.2 Test Matrices

Experiments – SpMV

• 3.7x speedup in SpMV• Performance:

• Bandwidth utilization• Distribution in non-zero

element count per row

April 21, 2023 18

Experiments – Orthogonalization

• Modified G-S scheme

• Orthogonalization:• 1 vector ~ 64 bases

• Short vectors: CPU• Long vectors: GPU

April 21, 2023 19

Experiments – AINV-0 Construction

• Averaged 2x speed-up• Performance:

• Lower matrix bandwidth• Fewer non-zeros per row• Adjacent rows with higher

sparsity pattern similarity• Larger Matrix size

April 21, 2023 20

Experiments – GMRES iterations

• Restarted GMRES(64)• Components:

• Orthogonalization (1~64)

• A-based SpMV• Preconditioning

• Left, Right, & Scaling• ~3x speed-up per iteration

April 21, 2023 21

Conclusion

• >3x speed-up’s for Krylov-subspace methods kernel • >3.5x speed-up for Krylov-subspace generation• ~7x speed-up for orthogonalization process for

long matrix/vector size• 2x speed-up for AINV-0 preconditioner generation• ~3x speed-up for GMRES iteration• Future Work:

• Optimization in both CPU & GPU implementation• AINV with dynamic fill-in’s

April 21, 2023 22

References1. Timothy A. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006.2. Yousef Saad, Iterative Methods for Sparse Linear Systems, 2nd Ed., SIAM, 2003.3. BLAS – Basic Linear Algebra Subprograms, http://www.netlib.org/blas/.4. Nathan Bell and Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-

Oriented Processors, SC’09.5. Muthu Manikandan Baskaran and Rajesh Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on

GPUs, Technical Report 2009.6. CUDA Zone, http://www.nvidia.com/cuda.7. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, SC’08.8. M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems,

SIAM J. Sci. Comput. 19 (1998).9. Michele Benzi, Preconditioning Techniques for Large Linear Systems: A Survey, Journal of Computational

Physics 182 (2002) pp.418-477.10. Matthias Christen, Olaf Schenk and Helmar Burkhart, General-Purpose Sparse Matrix Building Blocks

using the NVIDIA CUDA Technology Platform, Workshop on GPGPU, 2007.11. L. Giarud, J. Langou and M. Rozloznik, The Loss of Orthogonality in the Gram-Schmidt Orthogonalization

Process, Intl. J. Computers & Math. with Applications, 50 (2005), pp.1069-1075.12. GPGPU, http://www.gpgpu.org. 13. W. Hoffmann, Iterative Algorithms for Gram-Schmidt Orthogonalization, Computing 41 (1989), 335-348. 14. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder, Sparse Matrix Solvers on the GPU: Conjugate

Gradients and Multigrid, SIGGRAPH’05.

April 21, 2023 23

ANY QUESTIONS?THANK YOU!

April 21, 2023 24

Documents

September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute