Upload
hathien
View
213
Download
0
Embed Size (px)
Citation preview
Premiers retours d’expérience sur l’utilisation de GPU pour des applications de mécanique des structures
Copyright © ESI Group, 2009. All rights reserved.Copyright © ESI Group, 2010. All rights reserved.
Juin 2011
1
Antoine Petitet et Stefanos Vlachoutsis
Sommaire
Travaux réalisés dans le cadre du projet OpenGPU grâce au soutien de la DGCIS.
Copyright © ESI Group, 2009. All rights reserved.Copyright © ESI Group, 2010. All rights reserved.
Méthode implicite: résolution de systèmes linéaires creux
Méthode explicite: Smoothed Particle Hydrodynamics(SPH)
Multi-frontal Solver and CUBLAS
� One of the major workhorses of VPS implicit is the (multi-frontal) linear system direct solver (MUMPS).
� The multi-frontal method operates by design on dense sub-matrices for performance: GEMM and TRSM BLAS Level 3
Copyright © ESI Group, 2010. All rights reserved.
kernels with sometime a large number of RHS.
� In VPS, main focus is on double precision real and complex operands.
� What about using the CUBLAS provided by NVIDIA …
� … and see what happens on some industrial test cases ?
CUBLAS (3.2) Level 3 Performance
Single Precision Level 3 CUBLAS
• Performance on C2070 (ECC on) including data transfers.
• GEMM and [SY,HE]RK optimized.
• Little has been done for the performance of the other Level 3 BLAS routines.
Copyright © ESI Group, 2010. All rights reserved.
0
100
200300
400
500
600G
flop
s / s
640 3200 5760 8320
Problem size
Single Precision Level 3 CUBLAS
SGEMMSSYMMSSYRKSSYR2KSTRMMSTRSM
• True for all other precisions D, C and Z.
• TRSM is important for multiple RHS solve.
Recursive GEMM based Level 3 BLAS
A11
A21 A22
B1
B2
B1 := A11-1 B1 (TRSM)
B2 := B2 – A21 B1 (GEMM)
B2 := A22-1 B2 (TRSM)
Copyright © ESI Group, 2010. All rights reserved.
• Recursive formulation of the TRSM operation.
• Use of native (slower) TRSM on leaves of the tree and (fast) GEMM elsewhere.
• Method can be applied to all Level 3 (and 2) operations.
(Recursive) CUBLAS Level 3 Performance
150200
250
300
Gfl
op
s /
s
DGEMM (original) DTRSM (original) DTRSM (recursive)
•Asymptotically achieves GEMM performance.
150
200
250
300
Gfl
op
s / s
ZGEMM (original) ZSYMM (recursive) ZHEMM (recursive)
ZSYR2K (recursive) ZHER2K (recursive) ZTRMM (recursive)
ZTRSM (recursive)
Copyright © ESI Group, 2010. All rights reserved.
0
50
100
150
Gfl
op
s /
s
640 3200 5760 8320
Problem size
0
50
100
150
Gfl
op
s / s
640 3200 5760 8320
Problem size
• [SY,HE] rank-2k updates should be implemented by a GEMM call followed by a triangular inplace copy-add.
• The recursive algorithm should be used until there is enough memory to use the above algorithm.
VPS Implicit: Non-Linear Static Test Case
• Double precision real, 1 rhs.
• 12 numerical factorizations and 12 solves.
• Problem size = 4207059, non-zero terms = 130732938.
• Speed-up: 20% over 1 Nehalem core.
Copyright © ESI Group, 2010. All rights reserved.
0
10
20
30
40
50
60
70
Tim
e in
mn
Total Matrix Solver
CPUCPU-GPU
VPS-Implicit: NVH Frequency Response
• Double precision complex, 1258 rhs.
• 25 numerical factorizations and 175 solves.
• Problem size = 409813, non-zero terms = 37229935.
• Speed-up: 2x over 1 Nehalem core.
Copyright © ESI Group, 2010. All rights reserved.
0
100
200
300
400
500
600
700
800
Tim
e in
mn
Total Matrix Solver
CPUCPU-GPU
Internal Acoustics
Conclusions
• Naïve (no data transfer / computation overlap) recursive GEMM based implementation was necessary to handle efficiently large number of rhs.
• The library approach makes the GPU particularly easy
Copyright © ESI Group, 2010. All rights reserved.
• The library approach makes the GPU particularly easy to use within complex applications …
• … the performance gain however remains limited. More work is necessary to get better speedups for sparse direct solvers on GPUs.
SPH
La granularité des calculs effectués en SPH en fait une méthode de choix pour le calcul sur GPU.
Calculs réel simple précision.
La majeure partie des calculs est uniformément répartie dans (seulement) 3 hot-spots de 5 routines au total.
Copyright © ESI Group, 2010. All rights reserved.
dans (seulement) 3 hot-spots de 5 routines au total.
Les temps d’exécution reportés inclus les transferts de données vers la carte (pas de recouvrement).
Comparaison des temps de calcul entre 1 cœur Nehalem W5590 et une carte Nvidia Fermi (C2070 6Gb de RAM).
Cas industriel: Véhicule roulant sur de l’eau (2730202 points, 1927277 particules, 782575 plaques).
Cuda kernels for one hot-spot
Simulation (ms) GPU (s) CPU (s) Gain(%)
20 6189 7572 18
40 13100 16010 18
80 27980 34770 19
Copyright © ESI Group, 2010. All rights reserved.
Elapsed time CPU - GPU (1)
0
5000
10000
15000
20000
25000
30000
35000
40000
0 20 40 60 80 100
Simulation time (ms)
Ela
pse
d t
ime
(s)
GPU
CPU
Speedup seems to slightly increase with the simulation time.
Estimation for 3 hot-spots
Simulation (ms) GPU (s) CPU (s) Gain(%)
20 2500 7572 67
40 5300 16010 67
80 11500 34770 67
Copyright © ESI Group, 2010. All rights reserved.
Elapsed time CPU - GPU - estimation
0
5000
10000
15000
20000
25000
30000
35000
40000
0 20 40 60 80 100
Simulation time (s)
Ela
psed
tim
e (s
)GPU
CPU
Data re-use (less data transfers) as the numbers of kernels increase should lead to an even better speedup.
Simulation (ms) GPU (s) CPU (s) Gain(%)
20 4846 7572 36
Cuda kernels for 3 hot-spots
Number of registers is constant: need to reduce the size
Copyright © ESI Group, 2010. All rights reserved. 13
Number of registers is constant: need to reduce the size of thread blocks to run successfully: performance loss.
Size of argument list is limited in bytes:
256 Bytes 1.x (C1060)
4 Kbytes 2.0 (C2070)
Conclusions – future work
� SPH: very promising for GPU computing … still need to work on kernels to achieve the potential.
� Hybrid GPU(s) – CPU computing: to investigate.
� Other explicit method topics to investigate: Finite Pointset
Copyright © ESI Group, 2010. All rights reserved.
� Other explicit method topics to investigate: Finite Pointset Method (FPM), Internal forces computing, Contact mechanics, …
� Experiments on clusters of GPUs (MPI+OpenMP+GPUs)
� Tools evaluation for kernel generation: HMPP, PGI