A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow...

A Framework for Hybrid Parallel Flow Simulations

with a Trillion Cellsin Complex Geometries

SC13, November 21st 2013

Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler, Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

• waLBerla Framework

• Lattice Boltzmann Method

• Benchmarked Test Cases

• Benchmark Results

• Conclusion & Future Work

Outline

SC13, Denver Christian Godenschwager November 21st 2013

The waLBerla Framework

• Focus on lattice Boltzmann method

• Written in C++

• Contains hand-crafted, machine-specifichigh-performance compute kernels

• Also generic, easily adaptable compute kernels for prototyping

• Modules for handling complexgeometries

• Particulate flow simulations by coupling with our physics engine pe

• Models for multiphase andfree surface flows

waLBerla – an HPC Framework

• Hybridly parallelized (MPI + OpenMP)

• No data structures growing with number of processes involved

• Scales from laptop to recent petascale machines

• Parallel output

• Portable (Compiler/OS)

• Automated tests / CI servers

• Open Source release early 2014

waLBerla – an HPC Framework

llvm/clang

Examples

Study of hemodynamical impact of stenoses in coronary arteries.

Turbulent flow (Re=11000)

around a sphere(Ehsan Fattahi, Daniel Weingaertner)

Examples

7SC13, Denver Christian Godenschwager November 21st 2013

Liquid-Gas-Solid Flow Simulation:Stable Floating Positions of

Box-Shaped Particles(Simon Bogner)

Constructing a hollow cylinder by electron beam melting

(Matthias Markl, Regina Ammer)

Rigid bodies simulated with pe

Lattice Boltzmann Method

• Explicit, mesoscopic method for solving fluid flow problems (or heat, arbitrary advection-diffusion equations…)

• Discretization of Boltzmann equation

• Provides solution for Navier-Stokes equations at low Mach numbers

• Based on uniformly structured, Cartesian grid of cells

Lattice Boltzmann Method

Lattice Boltzmann equation (single-relaxation time, SRT)

𝑓𝑖(𝐱 + 𝐞𝐢𝛿𝑡 , 𝑡 + 𝛿𝑡) = 𝑓𝑖 𝐱, 𝑡 −𝑓𝑖 𝐱, 𝑡 − 𝑓𝑖

𝑒𝑞(𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡 )

𝑓𝑖𝑒𝑞(𝐮, 𝜌) = 𝜔𝑖𝜌 1 +

𝐞𝐢 ⋅ 𝐮

𝑐𝑠2 +

(𝐞𝐢⋅ 𝐮)2

2𝑐𝑠4 −

3𝐮2

2𝑐𝑠2

Equilibrium distribution function

Macroscopic quantities (density, momentum density)

𝜌 = ∑𝑓𝑖 𝜌𝐮 = ∑𝐞𝐢𝑓𝑖

Lattice Boltzmann equation (two-relaxation time, TRT)

𝑓𝑖 𝐱 + 𝐞𝐢𝛿𝑡 , 𝑡 + 𝛿𝑡 = 𝑓𝑖 𝐱, 𝑡 −𝑓𝑖+ 𝐱, 𝑡 − 𝑓𝑖

𝑒𝑞,+𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡

𝜆0−

𝑓𝑖− 𝐱, 𝑡 − 𝑓𝑖

𝑒𝑞,−(𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡 )

TRT model can

improve accuracy

and stability of LBM

LBM computationally

Streaming Step

Collision StepD2Q9

LBM computationally

Streaming Step

Collision StepD3Q19:

19 Loads

198 Flops (TRT)

19 Stores (+19 Loads)

305 Byte

LBM Data Structures

uniform blockdecomposition

• Domain partitioning into blocks containing uniform grid of cells

• Ghostlayer (halo) exchange of outer layer(s)

Benchmarked Testcases

Lid Driven Cavity (LDC) Flow

● Dense

● One block per process

● No load balancing

Flow through Coronary Arteries

● Sparse, but coherent

● Volume fraction 0.3%

● Multiple blocks per process

● Load balancing required

Testcases

Complex Geometry Initialization

Complex geometry given by surface Add regular block partitioning

Discard empty blocks

Allocate block data

Load balancing

Complex Geometry Initialization

Complex geometry given by surface Add regular block partitioning

Discard empty blocks

Allocate block data

Load balancing

File size 500,000 blocks: ~40MB

Separate

domain partitioning

from simulation phase

Domain Partitioning

143 → 649

183 → 413

233 → 277

293 → 201

373 → 149

333 → 154

313 → 184

303 → 190

block size → #blocksdx = 0.2mm target: ≤ 200 blocks

Coronary Artery Testcase Initialization

Domain partitioning of coronary tree dataset

One block per process

512 processes

485 blocks

458,752 processes

458,184 blocks

Hardware

JUQUEEN SuperMUC

Forschungszentrum Jülich, Germany LRZ, Garching (Munich), Germany

IBM system IBM system

Blue Gene/Q Intel Sandy Bridge-EP

28,672 nodes 9,216 nodes

458,752 cores 147,456 cores

5.9 Petaflops peak 3.2 Petaflops peak

448 TB main memory 288 TB main memory

5D Torus Network Non-blocking tree / 4:1 pruned tree

Benchmark Results

Lid Driven Cavity

• SuperMUC – single socket

1 2 3 4 5 6 7 8

SuperMUC - LDC - Weak

naïve, straightforwardimplementation

already quiteoptimized!

1 2 3 4 5 6 7 8

⇒ limited by memory bandwidth

1 2 3 4 5 6 7 8

Bandwidth

• JUQUEEN – single node

⇒ limited by memory bandwidth

P/s SRT

JUQUEEN - LDC - Weak

1 2 4 8 16Cores

hybrid version(4 threads per core) Bandwidth

• SuperMUC – TRT kernel

32 256 2048 16384 131072

16P 1T

#processesper node

#threadsper process

32 256 2048 16384 131072

16P 1T

#processesper node

#threadsper process

islands

32 256 2048 16384 131072

ication

16P 1T

islands

• JUQUEEN – TRT kernel

32 128 512 2048 8192 32768 131072 524288

64P 1T

16P 4T

JUQUEEN - LDC - weak

#processesper node

#threadsper process

1.93 x 1012 cells updated per second(19 values per cell)

⇒ 383 TFlop/s (6.5% peak) ⇒ 800 TB/s (67% peak)

Benchmark Results

Coronary Artery Tree

• JUQUEEN– TRT kernel

JUQUEEN - COR - weak

512 2048 8192 32768 131072 524288

Fraction

Efficiency Fluid Fraction

1.03 trillion load balanced lattice cells

dx = 1.3μm

• JUQUEEN - TRT kernel - dx = 0.05

JUQUEEN - COR - strong

512 2048 8192 32768 131072 524288

Efficiency Peformance

• SuperMUC - TRT kernel - dx = 0.1 mm

SuperMUC - COR - strong

32 128 512 2048 8192 32768

Efficiency Performance

Conclusion & Future Work

• waLBerla runs efficiently on current petascale supercomputers

• Excellent scaling properties

• Execution rates up to 6638 LBM time steps / s in strong scaling settings

• Discretization of coronary artery tree into 1,033,660,569,847 load balanced lattice cells

Conclusion & Future Work

• Future: Grid refinement and dynamic load balancing

• Useful for particulate flows with fully resolved particles

Thank you!

A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow...

Documents

PARALLEL METAHEURISTICS...15.4 Implementing Parallel Hybrid Metaheuristics 15.5 Applications of Parallel Hybrid Metaheuristics 15.6 Conclusions References 16 Parallel Multiobjective

Massively parallel kinetic Monte Carlo simulations of ... · Massively parallel kinetic Monte Carlo simulations of charge carrier transport in organic semiconductors. ... Kinetic

Massively Parallel Phase Field Simulations using HPC

Running Parallel Simulations and Enabling Science Gateways

SIMULATIONS OF PARALLEL RESONANT CIRCUIT POWER

Massively Parallel Cosmological Simulations with ChaNGa · Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kal

Hybrid MPI+OpenMP Parallel MD

System Simulations of Hybrid Electric Vehicles with Focus ... · System Simulations of Hybrid Electric Vehicles with Focus on Emissions Author: Zhiming Gao Subject: Comparative simulations

Hybrid Electric Vehicles(Parallel Mode)

Parallel and Distributed Scilab Simulations with ProActive ... · Parallel and Distributed Scilab Simulations with ProActive Parallel Suite Parallelism is Complex: ... Video 1: IC2D

Hybrid Technology Parallel vs. Series

Abgrall11 - Some Examples of High Order Simulations Parallel of Inviscid Flows on Unstructured and Hybrid Meshes by Residual Distribution Schemes

Massively Parallel Simulations to Analyze Viral Infection

Parallel molecular dynamics simulations of pressure

MATLAB SIMULATIONS OF PARALLEL RESONANT CIRCUIT

Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow ...algo2.iti.kit.edu/schulz/publications/hybrid.pdf · Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow Simulations on

Parallel Distributed Numerical Simulations in Aeronautic ...cfdbib/repository/TR_CFD-PA_05_44.pdf · Parallel Distributed Numerical Simulations in Aeronautic Applications G. All eon

Hybrid Simulations: Theory and Applications in Earthquake

Parallel Network Simulations with NEURON - Yale

DEVELOPMENT OF HYBRID MPI+UPC PARALLEL PROGRAMMING MODELsites.khas.edu.tr/tez/ElifOzturk_izinli.pdf · DEVELOPMENT OF HYBRID MPI+UPC PARALLEL PROGRAMMING MODEL Abstract Parallel Computing