1
Automatic code generation for highly parallel multigrid solvers Sebastian Kuckuk 1 , Christian Schmitt 2 , Harald Köstler 1 1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Department of Computer Science 10 (System Simulation) 2 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Department of Computer Science 12 (Hardware-Software-Co-Design) References: [1] Christian Schmitt, Sebastian Kuckuk, Harald Köstler, Frank Hannig, and Jürgen Teich. An Evaluation of Domain-Specific Language Technologies for Code Generation. To appear in Proceedings of the 14th International Conference on Computational Science and Its Applications (ICCSA 2014), June 2014. [2] Stefan Kronawitter and Christian Lengauer. Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations (HiStencils), pages 75–80, January 2014. [3] Alexander Grebhahn, Norbert Siegmund, Sven Apel, Sebastian Kuckuk, Christian Schmitt, and Harald Köstler. Optimizing Performance of Stencil Code with SPL Conqueror. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations (HiStencils), pages 7–14, January 2014. [4] Sebastian Kuckuk, Björn Gmeiner, Harald Köstler, and Ulrich Rüde. A generic prototype to benchmark algorithms and data structures for hierarchical hybrid grids. In Proceedings of the International Conference on Parallel Computing (ParCo), pages 813 - 822, September 2013 Code Generation with Scala Necessary due to the high variance of the multigrid domain Hardware - CPU, GPU or both? Number of nodes, sockets and cores? Cache characteristics? Network characteristics? Software - MPI, OpenMP or both? CUDA or OpenCL? Which version? MG components - Cycle type? Which smoother(s)? Which coarse grid solver? Which inter-grid operators? MG parameters - Relaxation? Number of smoothing steps? Optimizations - Vectorization? Temporal Blocking? Loop transformations? Problem description - Which PDE? Which boundary conditions? Discretization - Finite Differences, Finite Element or Finite Volumes? Domain - Uniform or block-structured? How to partition? Sebastian Kuckuk Harald Köstler Ulrich Rüde Alexander Grebhahn Sven Apel Stefan Kronawitter Armin Größlinger Christian Lengauer Christian Schmitt Frank Hannig Jürgen Teich Project ExaStencils Generation of efficient, robust and exa-scalable geometric multigrid solvers Modular and feature-rich code generation and transformation framework written in Scala [1] Automatic low-level optimization via polyhedral transformations [2] Interface to SPL and LFA prediction and optimization [3] Hannah Rittich Matthias Bolten Geometric Multigrid Smoothing of high frequency errors Coarsened representation of low frequency errors Preliminary Results First scaling results with generated solvers match behavior of earlier reference experiments [4] 3D FD discretization of Poisson‘s equation on uniform grids 4 threads per core, pure MPI 1M unknowns per core 0 0,2 0,4 0,6 0,8 1 1,2 512 1k 2k 4k 8k 16k 32k 64k 128k 256k Parallel Efficiency Number of Cores Weak Scaling for two Configurations V3,3 with Gauss-Seidel V4,2 with Jacobi 250 270 290 310 330 350 370 390 410 430 450 512 1k 2k 4k 8k 16k 32k 64k 128k 256k Mean Time per vCycle [ms] Number of Cores Weak Scaling for two Configurations V3,3 with Gauss-Seidel V4,2 with Jacobi The domain partition is directly mapped to the parallelization Each domain consists of one or more blocks Each block consists of one or more fragments Each fragment consists of several data points / cells Each block corresponds to one MPI rank Each fragment corresponds to one OMP rank Pure MPI corresponds to one fragment/block Pure OMP corresponds to one block Hybrid MPI/OMP corresponds to multiple blocks and multiple fragments/block Possible optimization: aggregate all fragments within one block and OMP parallelize field operations directly Support of various communication patterns Different regions (overlap, ghost layers) Arbitrary list of neighbors (represented by directions) Easy to use subsets, e.g. to send to all processes with larger or equal coordinates Generated domain initialization function sets relevant information, e.g. connection to local/remote primitives, ids, ranks, etc., on each process at run-time Multi-Layered DSL Approach From abstract problem specification on layer 1 to concrete solver implementation on layer 4 L1: mathematical formulation of problem L2: discretization of the problem L3: specification of algorithmic components L4: complete program specification

Automatic code generation for highly parallel multigrid solvers · 2017-03-01 · Automatic code generation for highly parallel multigrid solvers Sebastian Kuckuk1, Christian Schmitt2,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic code generation for highly parallel multigrid solvers · 2017-03-01 · Automatic code generation for highly parallel multigrid solvers Sebastian Kuckuk1, Christian Schmitt2,

Automatic code generation for highly parallel multigrid solvers Sebastian Kuckuk1, Christian Schmitt2, Harald Köstler1

1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Department of Computer Science 10 (System Simulation)2 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Department of Computer Science 12 (Hardware-Software-Co-Design)

References:[1] Christian Schmitt, Sebastian Kuckuk, Harald Köstler, Frank Hannig, and Jürgen Teich. An Evaluation of Domain-Specific Language Technologies for Code Generation. To appear in Proceedings of the 14th International Conference on Computational Science and Its Applications (ICCSA 2014), June 2014.[2] Stefan Kronawitter and Christian Lengauer. Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations (HiStencils), pages 75–80, January 2014.[3] Alexander Grebhahn, Norbert Siegmund, Sven Apel, Sebastian Kuckuk, Christian Schmitt, and Harald Köstler. Optimizing Performance of Stencil Code with SPL Conqueror. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations (HiStencils), pages 7–14, January 2014.[4] Sebastian Kuckuk, Björn Gmeiner, Harald Köstler, and Ulrich Rüde. A generic prototype to benchmark algorithms and data structures for hierarchical hybrid grids. In Proceedings of the International Conference on Parallel Computing (ParCo), pages 813 - 822, September 2013

Code Generation with Scala Necessary due to the high variance of the multigrid domain Hardware - CPU, GPU or both? Number of nodes, sockets and cores? Cache characteristics? Network characteristics? Software - MPI, OpenMP or both? CUDA or OpenCL? Which version? MG components - Cycle type? Which smoother(s)? Which coarse grid solver? Which inter-grid operators? MG parameters - Relaxation? Number of smoothing steps? Optimizations - Vectorization? Temporal Blocking? Loop transformations? Problem description - Which PDE? Which boundary conditions? Discretization - Finite Differences, Finite Element or Finite Volumes? Domain - Uniform or block-structured? How to partition? …

Sebastian KuckukHarald Köstler

Ulrich Rüde

Alexander GrebhahnSven Apel

Stefan KronawitterArmin Größlinger

Christian Lengauer

Christian SchmittFrank HannigJürgen Teich

Project ExaStencils Generation of efficient, robust and exa-scalable geometric multigrid solvers

Modular and feature-rich code generation and transformation framework written in Scala [1] Automatic low-level optimization via polyhedral transformations [2] Interface to SPL and LFA prediction and optimization [3]

Hannah RittichMatthias Bolten

Geometric Multigrid

Smoothing of high frequency errors

Coarsened representation of low frequency errors

Preliminary Results First scaling results with generated solvers match behavior of earlier reference experiments [4] 3D FD discretization of Poisson‘s equation on uniform grids 4 threads per core, pure MPI 1M unknowns per core

0

0,2

0,4

0,6

0,8

1

1,2

512 1k 2k 4k 8k 16k 32k 64k 128k 256k

Par

alle

l Effi

cien

cy

Number of Cores

Weak Scaling for two Configurations

V3,3 with Gauss-Seidel V4,2 with Jacobi

250

270

290

310

330

350

370

390

410

430

450

512 1k 2k 4k 8k 16k 32k 64k 128k 256k

Mea

n T

ime

per

vCyc

le [

ms]

Number of Cores

Weak Scaling for two Configurations

V3,3 with Gauss-Seidel V4,2 with Jacobi

The domain partition is directly mapped to the parallelization Each domain consists of one or more blocks Each block consists of one or more fragments Each fragment consists of several data points / cells

Each block corresponds to one MPI rank Each fragment corresponds to one OMP rank Pure MPI corresponds to one fragment/block Pure OMP corresponds to one block Hybrid MPI/OMP corresponds to multiple blocks and multiple fragments/block Possible optimization: aggregate all fragments within one block and OMP parallelize field operations directly

Support of various communication patterns Different regions (overlap, ghost layers) Arbitrary list of neighbors (represented by directions) Easy to use subsets, e.g. to send to all processes with larger or equal coordinates

Generated domain initialization function sets relevant information,e.g. connection to local/remote primitives, ids, ranks, etc., on each process at run-time

Multi-Layered DSL Approach From abstract problem specification on layer 1 to concrete solver implementation on layer 4 L1: mathematical formulation of problem L2: discretization of the problem L3: specification of algorithmic components L4: complete program specification