1
co-processor acceleration of an unmodified parallel Structural Mechanics code with FEAST-GPU Dominik Göddeke Hilmar Wobker, Stefan Turek Applied Mathematics, University of Dortmund [email protected] Robert Strzodka Max Planck Center, MPII [email protected] Jamaludin Mohd-Yusof Patrick McCormick Los Alamos National Laboratory jamal|[email protected] FEM package: FEAST Existing MPI-based Finite Element package with more than 100, 000 lines of code. Global partition into un- structured collection of macros, refined indepen- dently via generalized tensor-product meshes. Generalized DD/MG solvers: parallel MG smoothed by patchwise MG, implicit overlap. Exploit structured parts: high performance, co-processor acceleration Hide anisotropies locally: numerical robustness, (weak) scalability Why Graphics Processor Units (GPUs)? Scientific computing paradigm shift: Fine-grained paral- lelism addresses thermal constraints, frequency and power scaling, and the memory wall. multicore CPUs (commodity based clusters) Cell (heterogeneity on a chip) GPUs (> 128 parallel cores, 1000s of threads) GPUs are very attractive for commodity based clusters: fast, readily available, excellent price/performance ratio, superior bandwidth. Use GPUs as co-processors! Common concerns: Bus between host and co-processor can limit performance: Perform as much independent local work as possible between global synchronizations [1]. GPUs only support single precision: Apply a mixed preci- sion iterative refinement scheme [2]. Importance of bandwidth FEAST is 95% memory bound on a single node. 0 20 40 60 80 100 120 140 16Mi, C=4 (16 L10 mac) 32Mi, C=8 (32 L10 mac) 64Mi, C=16 (64 L10 mac) <---- smaller is better <---- sec #degrees of freedom (DOFs) 0g_1c4m_Cn_DQ 1g6m_1c2m_(C/2)n_DQ 1g7m_1c1m_(C/2)n_DQ 1g4m_0c_Cn_DQ Example: DQ cluster, LANL 2x Xeon EM64T (3.4 GHz, 6.4 GB/s shared bandwidth), 1x Quadro FX4500 (strong GPUs, 512MB, 33.6 GB/s), or 1x Quadro FX1400 (weak GPUs, 128MB, 19.2 GB/s), GPUs increase overall system bandwidth by 4–6x, of which 2–3x translates into application speedup. Test configurations Coarse grids, partitions, boundary conditions Computed displacements and van Mises stresses: Results Identical L 2 errors despite single precision smoothing: 1e-8 1e-7 1e-6 1e-5 1e-4 16 64 256 <---- smaller is better <---- L2 error number of macros L7(CPU) L7(GPU) L8(CPU) L8(GPU) L9(CPU) L9(GPU) L10(CPU) L10(GPU) Vertical: L 2 error reduction by a factor of 4 (h 2 ) per refinement level. Horizontal: Independence of macro distribu- tion and refinement level: 16 macros on L9 are equivalent to 64 macros on L8. Diagonal: Quadrupling the number of macros on the same level gives same error reduction as refining. Speedup of 1.6x–2.4x for 2x Xeon vs. 1x Quadro 4500: 0 10 20 30 40 50 60 70 BLOCK speedup 1.9 PIPE speedup 1.8 CRACK speedup 2.4 <---- smaller is better <---- time per iteration in sec CPU GPU 16 nodes, Infiniband, strong GPUs, 128 Mi DOFs per configuration Good weak scalability for 2x Xeon vs. 1x Quadro 1400: 20 25 30 35 40 45 50 4 8 16 32 64 128 <---- smaller is better <---- normalized time per iteration in sec Number of nodes CPU GPU 4–128 nodes with Infiniband, weak GPUs, 8 Mi DOFs per node (1 Bi max) Additional benefits: Additional benefits in metrics as power/performance, price/performance, perfor- mance/dollar etc. Co-processor integration: FEAST-GPU Hardware acceleration integrated on the node level: GPUs accelerate local smoothing for the global, parallel solver. Each type of hardware performs the tasks it is best suited for: The GPU executes the MG smoother, the CPU compensates for low precision and is responsible for communication and synchronization of the parallel solvers. All local smoothers implement the same interface, enabling hardware acceleration through simple changes in configura- tion files. This decoupled approach results in a sufficient amount of local work between communication, both between CPU and GPU in a node and between nodes using MPI. Less than 1000 lines of new kernel code. No changes to applications on top of FEAST. More details: Using GPUs to improve multigrid solver per- formance on a cluster [1]. CSM application: FEAST-SOLID Fundamental model problem: elastic, compressible material, exposed to small deformations in a static loading process. Use linearized strain tensor and kinematic relation; apply Hooke’s law for isotropic elastic material: Lamé equation for the unknown displacements u: -2μ div ε(u) - λ grad div u = f , x Ω u = g , x Γ D σ (u) · n = t, x Γ N FEM discretization and separate displacement ordering: K 11 K 12 K 21 K 22 u 1 u 2 = f 1 f 2 Solver: Global Krylov-subspace method, block- preconditioned (Gauss-Seidel) by treating the scalar blocks K 11 and K 22 with FEAST solvers. References [1] Dominik Göddeke, Robert Strzodka, Jamal Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker, and Stefan Turek. Using GPUs to improve multigrid solver performance on a cluster. accepted for publication in the International Journal of Computa- tional Science and Engineering, 2008. [2] Dominik Göddeke, Robert Strzodka, and Stefan Turek. Performance and accuracy of hardware-oriented native-, emulated- and mixed- precision solvers in FEM simulations. International Journal of Paral- lel, Emergent and Distributed Systems, 22(4):221–256, August 2007.

co-processor acceleration of an unmodified parallel ...goeddeke/pubs/...solvers: parallel MG smoothed by patchwise MG, implicit overlap. Exploit structured parts: high performance,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: co-processor acceleration of an unmodified parallel ...goeddeke/pubs/...solvers: parallel MG smoothed by patchwise MG, implicit overlap. Exploit structured parts: high performance,

co-processor acceleration of an unmodified parallel

Structural Mechanics code with FEAST-GPU

Dominik GöddekeHilmar Wobker, Stefan Turek

Applied Mathematics, University of Dortmund

[email protected]

Robert StrzodkaMax Planck Center, MPII

[email protected]

Jamaludin Mohd-YusofPatrick McCormick

Los Alamos National Laboratory

jamal|[email protected]

FEM package: FEAST

Existing MPI-based Finite Element package with more than100, 000 lines of code.

Global partition into un-structured collection ofmacros, refined indepen-dently via generalizedtensor-product meshes.Generalized DD/MGsolvers: parallel MGsmoothed by patchwiseMG, implicit overlap.

Exploit structured parts:

high performance, co-processor accelerationHide anisotropies locally:

numerical robustness, (weak) scalability

Why Graphics Processor Units (GPUs)?

Scientific computing paradigm shift: Fine-grained paral-lelism addresses thermal constraints, frequency and powerscaling, and thememory wall.

multicore CPUs (commodity based clusters)Cell (heterogeneity on a chip)

GPUs (> 128 parallel cores, 1000s of threads)

GPUs are very attractive forcommodity based clusters: fast,readily available, excellent price/performance ratio, superiorbandwidth.

Use GPUs as co-processors!

Common concerns:

Bus between host and co-processor can limit performance:Perform as much independent local work as possiblebetween global synchronizations [1].

GPUs only support single precision: Apply a mixed preci-sion iterative refinement scheme [2].

Importance of bandwidth

FEAST is 95% memory bound on a single node.

0

20

40

60

80

100

120

140

16Mi, C=4 (16 L10 mac)

32Mi, C=8 (32 L10 mac)

64Mi, C=16 (64 L10 mac)

<--

-- s

mal

ler

is b

ette

r <

----

sec

#degrees of freedom (DOFs)

0g_1c4m_Cn_DQ1g6m_1c2m_(C/2)n_DQ1g7m_1c1m_(C/2)n_DQ

1g4m_0c_Cn_DQ

Example: DQ cluster, LANL2x Xeon EM64T (3.4 GHz, 6.4 GB/s shared bandwidth),1x Quadro FX4500 (strong GPUs, 512MB, 33.6 GB/s),or 1x Quadro FX1400 (weak GPUs, 128MB, 19.2 GB/s),

GPUs increase overall system bandwidth by 4–6x,of which 2–3x translates into application speedup.

Test configurations

Coarse grids, partitions, boundary conditions

Computeddisplacementsandvan Mises stresses:

Results

Identical L2 errors despite single precision smoothing:

1e-8

1e-7

1e-6

1e-5

1e-4

16 64 256

<--

-- s

malle

r is

better

<--

-- L

2 e

rror

number of macros

L7(CPU)L7(GPU)L8(CPU)L8(GPU)L9(CPU)L9(GPU)

L10(CPU)L10(GPU)

Vertical: L2 error reduction by a factor of 4(h2) per refinement level.Horizontal: Independence of macro distribu-tion and refinement level: 16 macros onL9 areequivalent to 64 macros onL8.Diagonal: Quadrupling the number of macroson the same level gives same error reduction asrefining.

Speedup of 1.6x–2.4x for 2x Xeon vs. 1x Quadro 4500:

0

10

20

30

40

50

60

70

BLOCKspeedup 1.9

PIPEspeedup 1.8

CRACKspeedup 2.4

<--

-- s

mal

ler

is b

ette

r <

----

tim

e pe

r ite

ratio

n in

sec

CPUGPU

16 nodes, Infiniband, strong GPUs,128 Mi DOFs per configuration

Good weak scalability for 2x Xeon vs. 1x Quadro 1400:

20

25

30

35

40

45

50

4 8 16 32 64 128

<--

-- s

mal

ler

is b

ette

r <

----

nor

mal

ized

tim

e pe

r ite

ratio

n in

sec

Number of nodes

CPUGPU

4–128 nodes with Infiniband, weak GPUs,8 Mi DOFs per node (1 Bi max)

Additional benefits:Additional benefits in metrics as power/performance, price/performance, perfor-mance/dollar etc.

Co-processor integration:FEAST-GPU

Hardware acceleration integrated on the node level: GPUsacceleratelocal smoothingfor the global, parallel solver.Each type of hardware performs the tasks it is best suited for:The GPU executes the MG smoother, the CPU compensatesfor low precision and is responsible for communication andsynchronization of the parallel solvers.

All local smoothers implement the same interface, enablinghardware acceleration through simple changes in configura-tion files.

This decoupled approachresults in a sufficient amount oflocal work between communication, both between CPU andGPU in a node and between nodes using MPI.

Less than 1000 lines of new kernel code.No changes to applications on top of FEAST.

More details:Using GPUs to improve multigrid solver per-formance on a cluster [1].

CSM application: FEAST-SOLID

Fundamental model problem: elastic, compressible material,exposed to small deformations in a static loading process.

Use linearized strain tensor and kinematic relation; applyHooke’s law for isotropic elastic material:Lamé equationfor the unknown displacementsu:

−2µ div ε(u) − λ grad div u = f , x ∈ Ω

u = g, x ∈ ΓDσ(u) · n = t, x ∈ ΓN

FEM discretization andseparate displacement ordering:(

K11 K12K21 K22

) (

u1u2

)

=

(

f1f2

)

Solver: Global Krylov-subspace method, block-preconditioned (Gauss-Seidel) by treating the scalarblocksK11 andK22 with FEAST solvers.

References

[1] Dominik Göddeke, Robert Strzodka, Jamal Mohd-Yusof, PatrickMcCormick, Hilmar Wobker, Christian Becker, and Stefan Turek.Using GPUs to improve multigrid solver performance on a cluster.accepted for publication in the International Journal of Computa-tional Science and Engineering, 2008.

[2] Dominik Göddeke, Robert Strzodka, and Stefan Turek. Performanceand accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations.International Journal of Paral-lel, Emergent and Distributed Systems, 22(4):221–256, August 2007.