Massively Parallel Phase Field Simulations using HPC

Massively Parallel Phase Field Simulations using HPC Framework waLBerla

SIAM CSE 2015, March 15th 2015

Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

• Motivation

• waLBerla Framework

• Phase Field Method• Overview

• Optimizations

• Performance Modelling

• Managing I/O

• Summary and Outlook

Outline

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• large domain required to reduce boundary influence

• some physical patterns only occur in highly resolved simulations ( spiral )

• simulate big domains in 3D

• unoptimized, general purpose code phase field code from KIT available

• goal: write optimized parallel version for specific model

Motivation

The waLBerla Framework

waLBerla Framework

• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with

Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on structured grids• coupling with in-house rigid body physics engine pe

Vocal Fold Study(Florian Schornbaum)

Fluid Structure Interaction (Simon Bogner)

Free Surface Flow

Block Structured Grids

• structured grid• domain is decomposed into blocks• blocks are the container data structure for simulation data (lattice) • blocks are the basic unit of load balancing

• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers

• support for overlapping communication and computation

• some advanced models ( f.e. FreeSurface) require more complex communication patterns

Hybrid Parallelization

receiverprocess

senderprocess

(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)

A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014

Phase field in waLBerla

Phase field algorithm

• two lattices (fields):• phase field 𝜙 with 4 entries per cell

• chemical potential 𝜇 with 2 entries per cell

• storing two time steps in “src” and “dst” fields

• spatial discretization: finite differences

• temporal discretization: explicit Euler method

• two lattices (fields):• phase field 𝜙 with 4 components

• chemical potential 𝜇 with 2 components

• storing two time steps in “src” and “dst” fields

FLOP per cell 940

Loads / Stores 34

FLOPs per cell 2214

Load/Stores: 168

Roofline Performance Model

FLOPs 3154

Loads / Stores 202

Loads from RAM 101

FLOP / double 31.2

performance data per cell:

RAM bandwidth/core 6.4 GB/s

FLOP/s per core @2.7GHz 21.6 GFLOP/s

Balance (FLOP/double) 25

Sandy Bridge Architecture:

compute bound

from cache

Optimizations of Phase Field algorithm

Optimization Roadmap

• single core optimizations• based on results of performance model

• save floating point operations, pre-compute and store values where possible

• presented on example of 𝝁-Sweep here

• scaling• performance behavior of parallelization

• challenges related to Input/Output

• performance data presented for SuperMUC

Implementation in waLBerla

• starting point: general, prototyping code

• new model specific implementation in waLBerla

• performance guided design • no indirect or virtual calls

• optimized traversal over grid

Implementation in waLBerla

Step 1: Replace / Remove expensive operations

• pre-compute common subexpressions

• fast inverse square root approximation• replace division and sqrt operation with bit level operations and add/muls

• reduce number of divisions using table lookup where possible

Gibbs Energy subterm pre-computation

• many quantities depend on local temperature only

• in this scenario temperature is a function of one coordinate: T = 𝑇(𝑧)

• these quantities can be computed once for each 𝑥, 𝑦 -slice

Gibbs Energy subterm pre-computation

• single instruction multiple data ( SIMD )

• architecture specific instructions• Intel: SSE, AVX, AVX2

• Blue Gene: QPX

• modern compiler do auto-vectorization

• still beneficial to write SIMD instructions explicitly via intrinsics

• problem: separate code for each architecture

• lightweight SIMD abstraction layer in waLBerla to write portable code

𝑎3 𝑎2 𝑎1 𝑎0

𝑏3 𝑏2 𝑏1 𝑏0

vaddpd+

𝑐3 𝑐2 𝑐1 𝑐0 ymm0

24Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• to calculate divergence, values at staggered grid positions are required

• these values can be buffered

• more loads and stores, less floating point operations

• same technique can also be applied in 𝜙 sweep

Buffering of staggered values

pre-computed values

Buffering of staggered values

80 x faster compared to original version

Intranode Scaling

intranode weak scaling on SuperMUC

Single Node Optimization Summary

𝜙-Sweep 21 %

μ-Sweep 27 %

Complete Program 25%

Single Node Optimizations

• replace/remove expensive operations like square roots and divisions

• pre-compute and buffer values where possible

• SIMD intrinsics

Percent Peak on SuperMUC

Why not 100% Peak?

• unbalanced number of multiplications and addition

• divisions counted as 1 FLOP but they cost 43 times as much as a multiplication or addition

Scaling

• scaling on SuperMUC up to 32,768 cores

• ghost layer based communication

• communication hiding

Managing I/O

• I/O necessary to store results (frequently) and for checkpointing (seldom)

• for highly parallel simulations the output of results quickly becomes bottleneck

• Example: storing one time step of (940 x 940 x 2080) domain: 87 GB

• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm

• one mesh for each phase boundary

Managing I/O

• surface meshes still unnecessarily fine resolved: one triangle per interface cell

Managing I/O

local fine meshes generated by marching cubes

on coarse mesh on root

• quadric edge reduce algorithm ( cglib )

• crucial: mesh reduction step preserves boundary vertices

• hierarchical mesh coarsening and reduction during simulation

• result: one coarse mesh with size in the order of several MB

Summary

Summary / Outlook

• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )

• systematic performance engineering several levels

• speedup by factor of 80 compared to original version

• reached around 25% peak performance on SuperMUC

• parallel output data processing during simulation to reduce result file size

• GPU implementation

• coupling to Lattice Boltzmann Method

• improve discretization scheme (implicit method)

Summary

Outlook

A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014

Thank you!

Questions?

Massively Parallel Phase Field Simulations using HPC

Documents

Massively parallel kinetic Monte Carlo simulations of ... · Massively parallel kinetic Monte Carlo simulations of charge carrier transport in organic semiconductors. ... Kinetic

Using HPC Computational Physics Tools for Advanced ......Using HPC Computational Physics Tools for Advanced Engineering Simulations and Production Deployment David Hinz Sr. Director,

Big Data, Simulations and HPC Convergence - About DSCdsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf · Big Data, Simulations and HPC Convergence 5 for streaming data

Accelerating Innovation Through HPC-Enabled Simulations

Massively parallel simulations for disordered systems · parallel tempering known as population annealing [27,28] has gained traction for simulations of disordered systems [4 ,20

MASSIVELY PARALLEL MOLECULAR DYNAMICS ......MASSIVELY PARALLEL MOLECULAR DYNAMICS SIMULATIONS OF CRACK-FRONT DYNAMICS AND MORPHOLOGY IN AMORPHOUS NANOSTRUCTURED SILICA A Dissertation

Certified Systems for Amber Molecular Dynamics · 2. Molecular Dynamics (HPC systems). HPC Systems (High Performance Computing) installed for Molecular Dynamics Simulations, has been

Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

High Performance Research Computing · 2014-10-09 · HPC in Teaching • Incorporating HPC into STEM curriculum o Chemistry: molecular dynamics o Physics: simulations o Geology:

HPC-Cloud-based simulation of sports-car aerodynamicspride.irb.hr/repository/2016/HPC cloud based... · Computational Fluid Dynamics simulations. The Benefits Tests have shown that

HPC for Environmental Simulations - SYNASC · HPC for Environmental Simulations PD Dr. rer. nat. habil. Ralf‐Peter Mundani Computation in Engineering / BGU Scientific Computing

Cosmological hydrodynamical simulations in various …kdolag/Simulations/... · 2011. 5. 11. · Simulations are performed using P-Gadget3, a massively parallel Tree-PM-SPH code based

From Sequence Analysis to Simulations: Applications of HPC in Modern Biology

HPC for 3D Electromagnetic Simulations From Workstation ......Public HPC Cloud -Partnership with HPC resource providers (Bull/Atos, Nimbix, Rescale). - Special cloud licensing model

HPC solutions for Scientific Simulations€¦ · HPC solutions for Scientific Simulations Jean-Pierre Panziera –Nov 3, 2010. 2 ©Bull, 2010 Bull confidential and proprietary Bull’s

Article - Welcome to HPC-Forge | HPC-Forge · Web viewThis project concerned the development of tools for visualization of output from brain simulations performed on supercomputers

HPC SIMULATION AND OPTIMIZATION OF MATERIAL FORMING PROCESSES · HPC SIMULATION AND OPTIMIZATION OF MATERIAL FORMING PROCESSES ... First simulations of the casting process took place

Atomistic and molecular simulations on massively parallel ...bigdft.org/images/3/32/2013-07_TD_BigDFT_Introduction.pdfBigDFT Introduction Running BigDFT Atom positions Basis set Pseudopotential

Recent Advances in HPC for Fluid Dynamics Simulationsregister.ansys.com.cn/ansyschina/minisite/201411_em/motordesign... · Recent Advances in HPC for Fluid Dynamics Simulations

Scalable Agent-based Modelling with Cloud HPC Resources for Social Simulations