Upload
virginia-dina-harrington
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
April 21, 2023
1
Utilizing CUDA for Preconditioned GMRES Solvers
DCABES’09
Shiming Xu1, Hai Xiang Lin1, Wei Xue2 , and Ke Wang3
1 Delft Institute of Applied Mathematics, TU Delft2 Department of Computer Science & Technology, Tsinghua University3 Lab. Parallel Soft. & Comp. Sci., Inst. Software Chinese Academy of Science
Outline
• Introduction to Krylov-subspace linear system solvers & preconditioning techniques
• Introduction to GPGPU & NVIDIA CUDA• GMRES solver on CUDA• Approximate inverse preconditioner based on A-
biconjugate (AINV) on CUDA• Experiments & Results• Conclusion
April 21, 2023 2
April 21, 2023 3
Introduction – Krylov subspace solver• Iterative Linera System Solver[2]
• Krylov Subspace-based solver:
• Popular Solvers:• GMRES, CG, Bi-CG
Introduction – Preconditioners (1)
• Iteration Count ~ Condition of matrix A• Preconditioners[2,9]:• Improve condition of the ‘actual’ matrix for
iteration• Left & right preconditioning• Effective matrix & system:
April 21, 2023 4
April 21, 2023
Introduction – Preconditioned GMRES
5
Introduction – Preconditioners (2)
• Incomplete Factorization-based:• Incomplete LU/Cholesky factorization[1,2]
• ILU(0), ILUt, ILUk, ILUtp, etc• Preconditioning: forward/backward elimination
• Approximate Inverse-based:• A-biconjugation (AINV)[8]
• Forbenius-norm minimization• Preconditioning: matrix-vector product
April 21, 2023 6
April 21, 2023
A-biconjugate based Preconditioner
7
Introduction – GPGPU & NVIDIA CUDA• General Purposed computing on Graphics Processing
Units[12]
• NVIDIA CUDA[6]: • First (de facto) widely adopted platform for GP-GPU
• Characteristics of GPU:• Throughput-oriented architecture, SIMD style• High peak FLOPS/bandwidth• Massively parallel (>thousands of concurrent threads)• Weak caching/memory model/programmability• Weak support for branches, no ILP mechanism
April 21, 2023 8
Introduction – GPU/CPU ComparisonCPU GPU
Sample Intel i7-920 (Nehalem) NVIDIA Tesla C1060
Freq. 2.66 GHz 1.3 GHz
Peak FLOPS (SP/DP)
85 G / 42.5 G 624 G / 78 G
Peak Bandwidth ~25 GB/s ~100 GB/s
Core configuration
4-physical core8-virtual cores
10-multiprocessor240-stream processor
Cache System 3-level coherent cache(32KB x4, 256KB x4, 8MB)
2-level cache (24KB x10, 256 KB)
SW-managed Cache
None 16KB x30April 21, 2023 9
CUDA Thread HierarchyCUDA Thread Hierarchy
April 21, 2023
CUDA Device AbstractionCUDA Device Abstraction
Introduction – NVIDIA CUDA
10
Data Formats for Sparse Matrices
• ELLPACK & ELLPACK-based (HYB)[4]
• Good bandwidth utilization• CSR/CSC (Compressed Sparse Row/Column)
April 21, 2023 11
00
01
02
03
04
05
06
07
08
09
0A
0B
0C
0D
0E
0F
10
11
12
13
14
15
16
17
18
19
1A
1B
1C
1D
1E
1F
20
21
22
23
24
25
26
27
28
29
2A
2B
2C
2D
2E
2F
30
31
32
33
34
35
36
37
38
39
3A
3B
3C
3D
3E
3F
50
51
52
53
54
55
56
57
58
59
5A
5B
5C
5D
5E
5F
40
41
42
43
44
45
46
47
48
49
4A
4B
4C
4D
4E
4F
60
61
62
63
64
65
66
67
68
69
6A
6B
6C
6D
6E
6F
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 8
Thread 9
Thread 10
Thread 11
Thread 12
Thread 13
Thread 14
Thread 15
Thread 16
à
à
à
à
à
à
à
à
à
à
à
à
à
à
à
à
April 21, 2023G-SG-S
Modified G-SModified G-S
GMRES in CUDA – Algorithms
• Orthgonalization[11,13]:• Gram-Schmidt• Modified Gram-Schmidt• Gram-Schmidt with re-orthogonalization
12
GMRES in CUDA – Implementation
• Sparse Matrix-Dense Vector products (SpMV)• Orthogonalization• Inner Products• AXPY operations
• Preconditioner – AINV• Close relationship to ILU-related ones• High-performance/easy parallelization
April 21, 2023 13
AINV w/ Predefined Sparsity Pattern
• AINV-0:• WT+Z has the same sparsity pattern as A• Similar to ILU-0
• Preconditioner Generation:• CSC format for both W and Z
• Preconditioning in GMRES:• HYB format
April 21, 2023 14
AINV in CUDA
• Parallelization:• Inner iteration on Line 4~7 and Line 8~12
• Kernels:• Sparse-Vector Sparse-Vector inner products
(Line 5~6)• Sparse-Vector Sparse-Vector updates (Line
9~11)
April 21, 2023 15
Experiments – Tests
• GMRES kernels• Krylov subspace generation: SpMV• Orthogonalization
• AINV-0 preconditioner generation• AINV-0 preconditioned GMRES iteration
April 21, 2023 16
Experiments – Configurations
CPU Intel i7-920 (4-core, 2.66GHz)Memory 12GB (DDR-III, 1066MHz)GPU NVidia Tesla C1060GPU Memory
4GB
CUDA Version
2.0
April 21, 2023 17
Protein CantWindTunn
elEpide
mCircuit Petro OPF TDS Cubes
Parabolic
Size 36K 62K 218K 526K 171K 132K 2.1M 25K 101K 526K
NNZ 4.3M 4.0M 11.6M 2.1M 959K 4.2M 8.1M 160K 874K 2.1M
Table.1 System Configurations
Table.2 Test Matrices
Experiments – SpMV
• 3.7x speedup in SpMV• Performance:
• Bandwidth utilization• Distribution in non-zero
element count per row
April 21, 2023 18
Experiments – Orthogonalization
• Modified G-S scheme
• Orthogonalization:• 1 vector ~ 64 bases
• Short vectors: CPU• Long vectors: GPU
April 21, 2023 19
Experiments – AINV-0 Construction
• Averaged 2x speed-up• Performance:
• Lower matrix bandwidth• Fewer non-zeros per row• Adjacent rows with higher
sparsity pattern similarity• Larger Matrix size
April 21, 2023 20
Experiments – GMRES iterations
• Restarted GMRES(64)• Components:
• Orthogonalization (1~64)
• A-based SpMV• Preconditioning
• Left, Right, & Scaling• ~3x speed-up per iteration
April 21, 2023 21
Conclusion
• >3x speed-up’s for Krylov-subspace methods kernel • >3.5x speed-up for Krylov-subspace generation• ~7x speed-up for orthogonalization process for
long matrix/vector size• 2x speed-up for AINV-0 preconditioner generation• ~3x speed-up for GMRES iteration• Future Work:
• Optimization in both CPU & GPU implementation• AINV with dynamic fill-in’s
April 21, 2023 22
References1. Timothy A. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006.2. Yousef Saad, Iterative Methods for Sparse Linear Systems, 2nd Ed., SIAM, 2003.3. BLAS – Basic Linear Algebra Subprograms, http://www.netlib.org/blas/.4. Nathan Bell and Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-
Oriented Processors, SC’09.5. Muthu Manikandan Baskaran and Rajesh Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on
GPUs, Technical Report 2009.6. CUDA Zone, http://www.nvidia.com/cuda.7. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, SC’08.8. M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems,
SIAM J. Sci. Comput. 19 (1998).9. Michele Benzi, Preconditioning Techniques for Large Linear Systems: A Survey, Journal of Computational
Physics 182 (2002) pp.418-477.10. Matthias Christen, Olaf Schenk and Helmar Burkhart, General-Purpose Sparse Matrix Building Blocks
using the NVIDIA CUDA Technology Platform, Workshop on GPGPU, 2007.11. L. Giarud, J. Langou and M. Rozloznik, The Loss of Orthogonality in the Gram-Schmidt Orthogonalization
Process, Intl. J. Computers & Math. with Applications, 50 (2005), pp.1069-1075.12. GPGPU, http://www.gpgpu.org. 13. W. Hoffmann, Iterative Algorithms for Gram-Schmidt Orthogonalization, Computing 41 (1989), 335-348. 14. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder, Sparse Matrix Solvers on the GPU: Conjugate
Gradients and Multigrid, SIGGRAPH’05.
April 21, 2023 23
ANY QUESTIONS?THANK YOU!
April 21, 2023 24