Premiers retours d’expérience sur l’utilisation de GPU ... · Comparaison des temps de calcul entre 1 cœur Nehalem W5590 et une carte Nvidia Fermi (C2070 6Gb de RAM). Cas industriel:

Premiers retours d’expérience sur l’utilisation de GPU pour des applications de mécanique des structures

Copyright © ESI Group, 2009. All rights reserved.Copyright © ESI Group, 2010. All rights reserved.

Juin 2011

1

Antoine Petitet et Stefanos Vlachoutsis

Sommaire

Travaux réalisés dans le cadre du projet OpenGPU grâce au soutien de la DGCIS.

Copyright © ESI Group, 2009. All rights reserved.Copyright © ESI Group, 2010. All rights reserved.

Méthode implicite: résolution de systèmes linéaires creux

Méthode explicite: Smoothed Particle Hydrodynamics(SPH)

Multi-frontal Solver and CUBLAS

� One of the major workhorses of VPS implicit is the (multi-frontal) linear system direct solver (MUMPS).

� The multi-frontal method operates by design on dense sub-matrices for performance: GEMM and TRSM BLAS Level 3

Copyright © ESI Group, 2010. All rights reserved.

kernels with sometime a large number of RHS.

� In VPS, main focus is on double precision real and complex operands.

� What about using the CUBLAS provided by NVIDIA …

� … and see what happens on some industrial test cases ?

CUBLAS (3.2) Level 3 Performance

Single Precision Level 3 CUBLAS

• Performance on C2070 (ECC on) including data transfers.

• GEMM and [SY,HE]RK optimized.

• Little has been done for the performance of the other Level 3 BLAS routines.


0

100

200300

400

500

600G

flop

s / s

640 3200 5760 8320

Problem size

Single Precision Level 3 CUBLAS

SGEMMSSYMMSSYRKSSYR2KSTRMMSTRSM

• True for all other precisions D, C and Z.

• TRSM is important for multiple RHS solve.

Recursive GEMM based Level 3 BLAS

A11

A21 A22

B1

B2

B1 := A11-1 B1 (TRSM)

B2 := B2 – A21 B1 (GEMM)

B2 := A22-1 B2 (TRSM)


• Recursive formulation of the TRSM operation.

• Use of native (slower) TRSM on leaves of the tree and (fast) GEMM elsewhere.

• Method can be applied to all Level 3 (and 2) operations.

(Recursive) CUBLAS Level 3 Performance

150200

250

300

Gfl

op

s /

s

DGEMM (original) DTRSM (original) DTRSM (recursive)

•Asymptotically achieves GEMM performance.

150

200

250

300

Gfl

op

s / s

ZGEMM (original) ZSYMM (recursive) ZHEMM (recursive)

ZSYR2K (recursive) ZHER2K (recursive) ZTRMM (recursive)

ZTRSM (recursive)


0

50

100

150

Gfl

op

s /

s

640 3200 5760 8320

Problem size

0

50

100

150

Gfl

op

s / s

640 3200 5760 8320

Problem size

• [SY,HE] rank-2k updates should be implemented by a GEMM call followed by a triangular inplace copy-add.

• The recursive algorithm should be used until there is enough memory to use the above algorithm.

VPS Implicit: Non-Linear Static Test Case

• Double precision real, 1 rhs.

• 12 numerical factorizations and 12 solves.

• Problem size = 4207059, non-zero terms = 130732938.

• Speed-up: 20% over 1 Nehalem core.


0

10

20

30

40

50

60

70

Tim

e in

mn

Total Matrix Solver

CPUCPU-GPU

VPS-Implicit: NVH Frequency Response

• Double precision complex, 1258 rhs.

• 25 numerical factorizations and 175 solves.

• Problem size = 409813, non-zero terms = 37229935.

• Speed-up: 2x over 1 Nehalem core.


0

100

200

300

400

500

600

700

800

Tim

e in

mn

Total Matrix Solver

CPUCPU-GPU

Internal Acoustics

Conclusions

• Naïve (no data transfer / computation overlap) recursive GEMM based implementation was necessary to handle efficiently large number of rhs.

• The library approach makes the GPU particularly easy


• The library approach makes the GPU particularly easy to use within complex applications …

• … the performance gain however remains limited. More work is necessary to get better speedups for sparse direct solvers on GPUs.

SPH

La granularité des calculs effectués en SPH en fait une méthode de choix pour le calcul sur GPU.

Calculs réel simple précision.

La majeure partie des calculs est uniformément répartie dans (seulement) 3 hot-spots de 5 routines au total.


dans (seulement) 3 hot-spots de 5 routines au total.

Les temps d’exécution reportés inclus les transferts de données vers la carte (pas de recouvrement).

Comparaison des temps de calcul entre 1 cœur Nehalem W5590 et une carte Nvidia Fermi (C2070 6Gb de RAM).

Cas industriel: Véhicule roulant sur de l’eau (2730202 points, 1927277 particules, 782575 plaques).

Cuda kernels for one hot-spot

Simulation (ms) GPU (s) CPU (s) Gain(%)

20 6189 7572 18

40 13100 16010 18

80 27980 34770 19


Elapsed time CPU - GPU (1)

0

5000

10000

15000

20000

25000

30000

35000

40000

0 20 40 60 80 100

Simulation time (ms)

Ela

pse

d t

ime

(s)

GPU

CPU

Speedup seems to slightly increase with the simulation time.

Estimation for 3 hot-spots


20 2500 7572 67

40 5300 16010 67

80 11500 34770 67


Elapsed time CPU - GPU - estimation

0

5000

10000

15000

20000

25000

30000

35000

40000

0 20 40 60 80 100

Simulation time (s)

Ela

psed

tim

e (s

)GPU

CPU

Data re-use (less data transfers) as the numbers of kernels increase should lead to an even better speedup.


20 4846 7572 36

Cuda kernels for 3 hot-spots

Number of registers is constant: need to reduce the size

Copyright © ESI Group, 2010. All rights reserved. 13

Number of registers is constant: need to reduce the size of thread blocks to run successfully: performance loss.

Size of argument list is limited in bytes:

256 Bytes 1.x (C1060)

4 Kbytes 2.0 (C2070)

Conclusions – future work

� SPH: very promising for GPU computing … still need to work on kernels to achieve the potential.

� Hybrid GPU(s) – CPU computing: to investigate.

� Other explicit method topics to investigate: Finite Pointset


� Other explicit method topics to investigate: Finite Pointset Method (FPM), Internal forces computing, Contact mechanics, …

� Experiments on clusters of GPUs (MPI+OpenMP+GPUs)

� Tools evaluation for kernel generation: HMPP, PGI


Documents

Premiers retours d’expérience sur l’utilisation de GPU ... · Comparaison des temps de calcul entre 1 cœur Nehalem W5590 et une carte Nvidia Fermi (C2070 6Gb de RAM). Cas industriel: