View
10
Download
0
Category
Preview:
Citation preview
A Framework for Hybrid Parallel Flow Simulations
with a Trillion Cellsin Complex Geometries
SC13, November 21st 2013
Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler, Ulrich Rüde
Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
• waLBerla Framework
• Lattice Boltzmann Method
• Benchmarked Test Cases
• Benchmark Results
• Conclusion & Future Work
2
Outline
SC13, Denver Christian Godenschwager November 21st 2013
The waLBerla Framework
• Focus on lattice Boltzmann method
• Written in C++
• Contains hand-crafted, machine-specifichigh-performance compute kernels
• Also generic, easily adaptable compute kernels for prototyping
• Modules for handling complexgeometries
• Particulate flow simulations by coupling with our physics engine pe
• Models for multiphase andfree surface flows
4
waLBerla – an HPC Framework
SC13, Denver Christian Godenschwager November 21st 2013
• Hybridly parallelized (MPI + OpenMP)
• No data structures growing with number of processes involved
• Scales from laptop to recent petascale machines
• Parallel output
• Portable (Compiler/OS)
• Automated tests / CI servers
• Open Source release early 2014
5
waLBerla – an HPC Framework
SC13, Denver Christian Godenschwager November 21st 2013
llvm/clang
6
Examples
SC13, Denver Christian Godenschwager November 21st 2013
Study of hemodynamical impact of stenoses in coronary arteries.
Turbulent flow (Re=11000)
around a sphere(Ehsan Fattahi, Daniel Weingaertner)
Examples
7SC13, Denver Christian Godenschwager November 21st 2013
Liquid-Gas-Solid Flow Simulation:Stable Floating Positions of
Box-Shaped Particles(Simon Bogner)
Constructing a hollow cylinder by electron beam melting
(Matthias Markl, Regina Ammer)
Rigid bodies simulated with pe
Lattice Boltzmann Method
LBM
9
Lattice Boltzmann Method
SC13, Denver Christian Godenschwager November 21st 2013
• Explicit, mesoscopic method for solving fluid flow problems (or heat, arbitrary advection-diffusion equations…)
• Discretization of Boltzmann equation
• Provides solution for Navier-Stokes equations at low Mach numbers
• Based on uniformly structured, Cartesian grid of cells
10
Lattice Boltzmann Method
SC13, Denver Christian Godenschwager November 21st 2013
Lattice Boltzmann equation (single-relaxation time, SRT)
𝑓𝑖(𝐱 + 𝐞𝐢𝛿𝑡 , 𝑡 + 𝛿𝑡) = 𝑓𝑖 𝐱, 𝑡 −𝑓𝑖 𝐱, 𝑡 − 𝑓𝑖
𝑒𝑞(𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡 )
𝜏
𝑓𝑖𝑒𝑞(𝐮, 𝜌) = 𝜔𝑖𝜌 1 +
𝐞𝐢 ⋅ 𝐮
𝑐𝑠2 +
(𝐞𝐢⋅ 𝐮)2
2𝑐𝑠4 −
3𝐮2
2𝑐𝑠2
Equilibrium distribution function
Macroscopic quantities (density, momentum density)
𝜌 = ∑𝑓𝑖 𝜌𝐮 = ∑𝐞𝐢𝑓𝑖
Lattice Boltzmann equation (two-relaxation time, TRT)
𝑓𝑖 𝐱 + 𝐞𝐢𝛿𝑡 , 𝑡 + 𝛿𝑡 = 𝑓𝑖 𝐱, 𝑡 −𝑓𝑖+ 𝐱, 𝑡 − 𝑓𝑖
𝑒𝑞,+𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡
𝜆0−
𝑓𝑖− 𝐱, 𝑡 − 𝑓𝑖
𝑒𝑞,−(𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡 )
𝜆1
TRT model can
improve accuracy
and stability of LBM
11
LBM computationally
SC13, Denver Christian Godenschwager November 21st 2013
Streaming Step
Collision StepD2Q9
12
LBM computationally
SC13, Denver Christian Godenschwager November 21st 2013
Streaming Step
Collision StepD3Q19:
19 Loads
198 Flops (TRT)
19 Stores (+19 Loads)
305 Byte
13
LBM Data Structures
uniform blockdecomposition
SC13, Denver Christian Godenschwager November 21st 2013
• Domain partitioning into blocks containing uniform grid of cells
• Ghostlayer (halo) exchange of outer layer(s)
Benchmarked Testcases
Lid Driven Cavity (LDC) Flow
● Dense
● One block per process
● No load balancing
Flow through Coronary Arteries
● Sparse, but coherent
● Volume fraction 0.3%
● Multiple blocks per process
● Load balancing required
Testcases
15SC13, Denver Christian Godenschwager November 21st 2013
16
Complex Geometry Initialization
SC13, Denver Christian Godenschwager November 21st 2013
Complex geometry given by surface Add regular block partitioning
Discard empty blocks
Allocate block data
Load balancing
17
Complex Geometry Initialization
SC13, Denver Christian Godenschwager November 21st 2013
Complex geometry given by surface Add regular block partitioning
Discard empty blocks
Allocate block data
Load balancing
File size 500,000 blocks: ~40MB
Separate
domain partitioning
from simulation phase
18
Domain Partitioning
SC13, Denver Christian Godenschwager November 21st 2013
143 → 649
183 → 413
233 → 277
293 → 201
373 → 149
333 → 154
313 → 184
303 → 190
303 → 190
block size → #blocksdx = 0.2mm target: ≤ 200 blocks
Coronary Artery Testcase Initialization
19
Domain partitioning of coronary tree dataset
One block per process
512 processes
485 blocks
458,752 processes
458,184 blocks
Hardware
20SC13, Denver Christian Godenschwager November 21st 2013
JUQUEEN SuperMUC
Forschungszentrum Jülich, Germany LRZ, Garching (Munich), Germany
IBM system IBM system
Blue Gene/Q Intel Sandy Bridge-EP
28,672 nodes 9,216 nodes
458,752 cores 147,456 cores
5.9 Petaflops peak 3.2 Petaflops peak
448 TB main memory 288 TB main memory
5D Torus Network Non-blocking tree / 4:1 pruned tree
Benchmark Results
Lid Driven Cavity
22SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – single socket
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8
MLU
P/s
Cores
SRT1
SuperMUC - LDC - Weak
naïve, straightforwardimplementation
already quiteoptimized!
23SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – single socket
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8
MLU
P/s
Cores
SRT2
SRT1
SuperMUC - LDC - Weak
naïve, straightforwardimplementation
already quiteoptimized!
24SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – single socket
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8
MLU
P/s
Cores
SRT
SRT2
SRT1
SuperMUC - LDC - Weak
naïve, straightforwardimplementation
already quiteoptimized!
25SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – single socket
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8
MLU
P/s
Cores
SRT
TRT
SRT2
SRT1
SuperMUC - LDC - Weak
naïve, straightforwardimplementation
already quiteoptimized!
26SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – single socket
⇒ limited by memory bandwidth
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8
MLU
P/s
Cores
SRT
TRT
SRT2
SRT1
SuperMUC - LDC - Weak
naïve, straightforwardimplementation
already quiteoptimized!
Bandwidth
limit
27SC13, Denver Christian Godenschwager November 21st 2013
• JUQUEEN – single node
⇒ limited by memory bandwidth
0
10
20
30
40
50
60
70
80
90
MLU
P/s SRT
TRT
JUQUEEN - LDC - Weak
1 2 4 8 16Cores
hybrid version(4 threads per core) Bandwidth
limit
28SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – TRT kernel
0
1
2
3
4
5
6
7
8
9
10
32 256 2048 16384 131072
MLU
P/s
pe
r c
ore
Cores
16P 1T
4P 4T
2P 8T
SuperMUC - LDC - Weak
#processesper node
#threadsper process
29SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – TRT kernel
0
1
2
3
4
5
6
7
8
9
10
32 256 2048 16384 131072
MLU
P/s
pe
r c
ore
Cores
16P 1T
4P 4T
2P 8T
SuperMUC - LDC - Weak
#processesper node
#threadsper process
2
islands
30SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC – TRT kernel
0
10
20
30
40
50
60
0
1
2
3
4
5
6
7
8
9
10
32 256 2048 16384 131072
Co
mm
un
ication
share
(%)
MLU
P/s
pe
r c
ore
Cores
16P 1T
4P 4T
2P 8T
Comm
SuperMUC - LDC - Weak
2
islands
31SC13, Denver Christian Godenschwager November 21st 2013
• JUQUEEN – TRT kernel
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
32 128 512 2048 8192 32768 131072 524288
MLU
P/s
pe
r c
ore
Cores
64P 1T
16P 4T
8P 8T
JUQUEEN - LDC - weak
#processesper node
#threadsper process
1.93 x 1012 cells updated per second(19 values per cell)
⇒ 383 TFlop/s (6.5% peak) ⇒ 800 TB/s (67% peak)
Benchmark Results
Coronary Artery Tree
33SC13, Denver Christian Godenschwager November 21st 2013
• JUQUEEN– TRT kernel
JUQUEEN - COR - weak
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0
0,5
1
1,5
2
2,5
3
512 2048 8192 32768 131072 524288
Fluid
Fraction
MFL
UP/
s /
Co
re
Cores
Efficiency Fluid Fraction
1.03 trillion load balanced lattice cells
dx = 1.3μm
34SC13, Denver Christian Godenschwager November 21st 2013
• JUQUEEN - TRT kernel - dx = 0.05
JUQUEEN - COR - strong
0
100
200
300
400
500
600
700
800
900
1000
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
512 2048 8192 32768 131072 524288
Time
Step
s/ sM
FLU
P/s
/ C
ore
Cores
Efficiency Peformance
35SC13, Denver Christian Godenschwager November 21st 2013
• SuperMUC - TRT kernel - dx = 0.1 mm
SuperMUC - COR - strong
0
1.000
2.000
3.000
4.000
5.000
6.000
7.000
0,00
0,20
0,40
0,60
0,80
1,00
1,20
1,40
1,60
1,80
32 128 512 2048 8192 32768
Time
Step
s/ sM
FLU
P/s
/ C
ore
Cores
Efficiency Performance
Conclusion & Future Work
• waLBerla runs efficiently on current petascale supercomputers
• Excellent scaling properties
• Execution rates up to 6638 LBM time steps / s in strong scaling settings
• Discretization of coronary artery tree into 1,033,660,569,847 load balanced lattice cells
37
Conclusion & Future Work
SC13, Denver Christian Godenschwager November 21st 2013
• Future: Grid refinement and dynamic load balancing
• Useful for particulate flows with fully resolved particles
Thank you!
Recommended