HyMEP - HYPERSONIC METEOROID ENTRY PHYSICS 61st Course of the International School of Quantum Electronics – ERICE-SICILY: 3-8 October 2017
Giuseppe Pascazio [email protected]
Department of Mechanics, Mathematics and Management
Via Re David 200, 70125, Bari, Italy
CFD and state-to-state of hypersonics flows using GPUs
Collaborations
GIANPIERO COLONNA
LUIGI CUTRONE
MICHELE TUTTAFESTA Liceo Scientifico Statale “Leonardo da Vinci” Bisceglie Italy
FRANCESCO BONELLI
DARIO DE MARINIS
Objective and Outline
Objective: Development of a High Performance Computing (HPC) CFD code for investigating high enthalpy flows
Outline: • Motivation: Parallel computing and GPU • Governing equations • Finite volume method Flux Difference Splitting, Flux Vector Splitting, Operator splitting • Thermochemical non-equilibrium models • Results • Conclusions
Does CFD need faster computers?
4 h
24
h
60
80
day
s
5 species air Frozen
Hypersonic flow over a sphere: 2D 512x256 mesh
5 species air Park
5 species air StS
Euler Navier-Stokes
12
h
30
h
61
10
day
s
Jet compressible non-reacting turbulent flow Re=104
3D LES (100*106 points)
3D DNS (3.6*1012points)
25
00
day
s 3*1
09 d
ays
Single core CPU computational time to complete a simulation
How to build faster computers?
Increase frequency: 1980 PCs run @ 1Mhz – today up to 4 GHz
Parallel computing
Frequency is bounded by restrictions in terms of power and heat as well as the physical limit of transistor size
clusters of multi core Central Processing Units (CPUs)
clusters of Graphic Processing Units (GPUs)
GPU
• Born for display rendering: Specialized libraries were developed for rendering programming (Direct-X, OpenGL etc.).
• Uses many cores to perform math operations needed to alter in real time pixels colors.
• Not born for general purpose computing: first attempts for scientific computing required to know Direct-X and OpenGL, translate the problem in terms of colors, perform computations on GPU and translate back colors to scientific data (very convoluted ).
• The new era of GPGPU (General-Purpose computing on GPU) born in November
2006: the NVIDIA CUDA architecture was released. It is not only a new hardware architecture but above all it provides a programming language (C / C ++ extension) that allows easy use for general purpose computing
J. Sanders and E. Kandrot, CUDA by Example, 2010
GPU Computing
GPUs are attractive for High Performance Computing:
• Massive multithreaded many-core chips
• Huge amount of Flops (both single and double precision)
• High memory bandwidth
• Energy efficiency and low cost
• Available Tools: debuggers, profiles, libraries (BLAS, FFT, LAPACK)
• Programming languages: CUDA C, CUDA Fortran, OpenCL
GPU
Streaming Multiprocessor (SM)
Software implementation tries to be a mirror of the hardware structure
hundreds of threads
thousands of threads
J. Cheng ,M. Grossman ,T. McKercher, Professional CUDA® C Programming, John Wiley & Sons, Inc.
GPU vs CPU: formidable amount of Flops
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
GPU vs CPU: high memory bandwidth
CUDA C PROGRAMMING GUIDE PG-02829-001_v9.0 | September 2017
GPU vs CPU: high energy efficiency
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
GPU vs CPU: green 500 list
In the first 15 positions of the green 500 list 13 clusters are powered with NVIDIA GPUs
https://www.top500.org/green500/list/2017/06/
CUDA C parallel programming example: vectors sum
Splitting parallel blocks: needed to exploit all the GPU capacities
add<<B, T>>(dev_a, dev_c, dev_d)
BxT total number of threads; B blocks; T threads per block
Example of 2 blocks with 4 threads per block: blockDim.x=4 gridDim.x=8
Linear mapping
Needed if N>BxT
J. Sanders, E. Kandrot, CUDA by example, Addison-Wesley, New-York, 2011.
Governing equations
000 VSV
dVdSdVt
WnFU
T
MvibvibVSSV eeevuS
],...,,,,,,..,,..,,..,[ ,1,1,1,1,11,1 U
T
MvibvibVSSV ueueupeuvpuuuuuS
],...,,)(,,,,..,,..,,..,[ ,1,1
2
,1,1,11,1E F
T
MvibvibVSSV vevevpepvuvvvvvS
],...,,)(,,,,...,,...,,...,[ ,1,1
2
,1,1,11,1E G
T
MvibvibVSSV ],....,,0,0,0,,...,,...,,...,[ ,1,,1,,11,1 11 W
U is the vector of the conservative variables, FE / Fv and GE / Gv are the inviscid/viscous flux vectors and W is the source terms vector. S is the number of chemical components, the sth one having Vs internal
levels, the state-to-state approach considers independent species.
S
s
sVN1
VEVE GGFFF ,
Compressible Navier-Stokes (N-S) equations for a multicomponent mixture of reacting gases in thermochemical non-equilibrium for both Park and StS models
Governing equations
TMvibvibii ,1,VV ,.....,,,,, qqquuGF
FV and GV are the viscous flux vectors In the present implementation transport properties of single species are evaluated by using Gupta’s curve fits1. Classical mixing rules are used for mixture properties
iiii YD u
uIuu 3
2T
i
uq iivibvibt hTT
mmmvibvibvibmvib eT uq ,,
1Gupta et al. A review of reaction rates and thermodynamic and transport properties for an 11-species air model for chemical and thermal nonequilibrium calculations to 30000 K, NASA. Reference. Publication. 1232. 1990
Numerical method
ji
Faces
jinum
ji
ji VSdt
dV ,,
,
, WnFU
Cell-centered Finite Volume Space discretization on a Multi-block structured mesh
numVnumEnum ,, FFF
Reactive Navier-Stokes equations: - Advection and pressure term (hyperbolic) - Shear-stress, heat flux terms (diffusive) - Chemical source terms (stiffness)
John C. Tannehill, Dale Anderson, Richard H. Pletcher, Computational Fluid Mechanics and Heat Transfer, Taylor & Francis 1997
Numerical method
ji
Faces
jinum
ji
ji VSdt
dV ,,
,
, WnFU
numVnumEnum ,, FFF
Operator splitting approach: Frozen step + Chemical step Frozen step: Method of Lines:
• Space discretization + Time integration ‐ Space dicretization: Inviscid & Viscous terms scheme ‐ Time integration: Runge-Kutta scheme
Chemical step: implicit scheme for stiff terms
Solution strategy:
Frozen step: inviscid flux space discretization
Faces
num
ji
ji Sdt
dV 0
,
, nFU
Semi-Discrete Schemes or Method of Lines
Frozen equation
Faces
numVnumE
ji
jiS
Vdt
dnFF
U,,
,
, 1 ODE solved with an explicit Runge-Kutta schemes
numE ,F Methods for solving non-linear hyperbolic conservation laws
Discretized inviscid Burger equation: solution of the Riemann problem at each interface
John C. Tannehill, Dale Anderson, Richard H. Pletcher, Computational Fluid Mechanics and Heat Transfer, Taylor & Francis 1997
Frozen step: inviscid flux space discretization
Solution of the Riemann problem for the inviscid Burgers equation: (left) structure of general solution (single wave, shock or rarefaction), (middle) solution is a shock wave and (right) solution is a rarefaction wave* An exact Riemann solver is inexpensive for the inviscid Burger equation. Cost extremely increases when considering a general m x m non-linear hyperbolic system
Structure of the Riemann problem for the 1D Euler equations
*Eleuterio F. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics, Springer 2009
Frozen step: inviscid flux space discretization
Approximate solvers: • Flux Difference Splitting (FDS) or Approximate Riemann solver: Roe’s scheme • Flux Vector Splitting (FVS): Steger and Warming (SW), AUSM
Roe’s scheme: why solve the exact Riemann problem when a lot of information is lost due to the discrete representation?
0 UFU xt 0
X
U
U
FU t
0 xt UAU Approximate Riemann problem solved exactly
RL UUAA , Linearized Jacobian matrix
P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, Journal of Computational Physics 43(2) 357-372, 1981.
Frozen step: inviscid flux space discretization
How to choose A(UL,UR) is not trivial
Rigorous construction of A proposed by Roe: the only allowable matrices are those that satisfy the Property U (uniform validity across discontinuities) :
1. It constitutes a linear mapping from U to F
2. as UL UR U, A(UL,UR) A(U) (ensures consistency)
3. for any UL, UR, A(UL,UR)x(UL-UR)=FL-FR (ensures conservation)
4. The eigenvectors are linearly independent
By introducing a parameter vector Roe constructed A for the Euler equations
UAA
T
RL
RRLL
RL
RRLL
RL uHaHH
Huu
u
2
2
11,,,
U
P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, Journal of Computational Physics 43(2) 357-372, 1981.
Frozen step: inviscid flux space discretization
Flux Difference Splitting: Upwind discretization
The discretization of the equations is performed according to the discrete propagation directions
0 0
)()(
2
12
1
i i
i
ii
i
iiRLi
KKFFF
i
i
iiRL
i
)(
2
12
1KFFF
1 KKA
iiii
2
1)0,min(
iiii
2
1)0,max(
iii 2
1
2
1
Λ Diagonal eigenvalues matrix
K Right eigenvectors
Compact form
*Eleuterio F. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics, Springer 2009
Frozen step: higher order (MUSCL approach)
Schemes described so far are first-order accurate: variable are constant in the control volume
Van Leer* idea: reconstruction of variable within a cell with discontinuities across cells
Cell-averaged values are used to reconstruct UR and UL thus leading to higher-order fluxes: MUSCL (monotone upstream-centered schemes for conservation laws)
Since higher-order schemes are not stable when dealing with discontinuities limiter functions are applied to switch to first order approximation through discontinuities
*van Leer, B. (1979). Towards the Ultimate Conservative Difference Scheme, V: A Second-Order Sequel to Godunov's Method, J. Comput. Phys., vol. 32, pp. 101-136.
2/122/112/1
4
1
4
1
iii
r
i ururuu
2/142/332/1
4
1
4
1
iii
l
i ururuu
12/1 iii uuu
Limiter function
Results
Code scheme of CPU and GPU module with subroutine name specification
M. Tuttafesta, G. Colonna, G. Pascazio, Comput. Phys. Comm. 184 (6) (2013) 1497–1510.
Results
Profiling and thread sensitivity analysis: supersonic nozzle test case
Kernels GPU execution times for the Test A—threads analysis
Multiprocessor warp occupancy varying threads per block and with different registers usage. Occupancy = (active warps)/(maximum warps) Warp is a group of 32 threads that execute the same instruction
M. Tuttafesta, G. Colonna, G. Pascazio, Comput. Phys. Comm. 184 (6) (2013) 1497–1510.
Results: NVIDIA Tesla C2050 vs Pentium(R) Dual-Core E5645
Nozzle test case: 200x200 cells
Hardware specifications Total speed-up as a function of number of time steps n
M. Tuttafesta, G. Colonna, G. Pascazio, Comput. Phys. Comm. 184 (6) (2013) 1497–1510.
Frozen step: inviscid flux space discretization
Steger and Warming Flux Vector Splitting
AUF homogeneous function of degree one
FFUAAUKKUKKF
11
iiii
2
1)0,min(
iiii
2
1)0,max(
12
3
221
12
12
2 2
3223222
1
321
321
2/1
aauauu
auauuF
Compared to Roe’s FDS, SW FVS is easier to implement and less computationally expensive, but it is also less accurate. Upwinding is performed by splitting the flux in positive and negative components. FDS can been interpreted as being caused by a series of waves FVS are interpreted as schemes that transport particles according to the characteristics information*
J.L. Steger, R.F. Warming , Flux vector splitting of the inviscid gasdynamic equations with application to finite-difference methods, Journal of Computational Physics 40 (2) 263-293, 1981 *John C. Tannehill, Dale Anderson, Richard H. Pletcher, Computational Fluid Mechanics and Heat Transfer, Taylor & Francis 1997
Frozen step: inviscid flux space discretization
AUSM: Advection Upstream Splitting Method
Two physically distinct parts that are separately discretized: convective and pressure
0
0
0
0
pp
H
uuc
FF
A cell-face Mach number is used to advect the convective term, whereas the pressure flux is governed by acoustic wave speed
RL
LRRL
pp
aH
au
a
aH
au
a
M
aH
au
a
aH
au
a
M
0
2
1
2
12/12/12/1
F
It retains the accuracy of the Roe’s FDS along with the simplicity and robustness of the SW FVS.
M.-S.Liou, C.J. Steffen, A New Flux Splitting Scheme, Journal of Computational Physics 107 (1) 23-39, 1993
Frozen step: viscous schemes
Viscous terms involve gradients that have to be determined on the cell faces Due to their dissipative nature central differences are used A good procedure for generalized curvilinear coordinates is to apply the Gauss divergence theorem
00 SV
ududV S
facesi
iiduV
u S1
facesi
iyi
facesi
ixi
y
x
dSu
dSu
Vu
u
,
,1
John C. Tannehill, Dale Anderson, Richard H. Pletcher, Computational Fluid Mechanics and Heat Transfer, Taylor & Francis 1997
Numerical method
Inviscid flux: AUSM or Flux Vector Splitting of Steger and Warming with MUSCL approach for higher order accuracy; Runge-Kutta scheme up to third order for time integration Viscous flux: gradients of the primitive variables are evaluated by applying Gauss theorem
Frozen step
Operator splitting approach
0
00
SV
dSdVt
nFU
Numerical method
Chemical step
ntt f
v
c /)(
Sub-time step
Nii
0
yLyPy
tP is a vector and L a diagonal matrix. Pi and Liyi are non-negative and represent, respectively, production and loss terms for component yi
)(1
)()()(
1)(
1)()(
kv
c
i
k
i
v
cv
c
k
iLt
tyPttty
y
yGauss-Seidel iterative scheme
Operator splitting approach
00 VV
dVdVt
WU
Thermochemical non-equilibrium models for air
MULTI-TEMPERATURE 5 SPECIES PARK MODEL1
• 17 reactions + 3 transport equations for the vibrational energies • Arrhenius type rate coefficients function of an effective temperature
calculated as a geometrical mean of translational and vibrational temperatures • Vibrational levels follow a Boltzmann distribution at temperature Tv • Tuned on experimental measures • Not computationally demanding • It may fail when the conditions are far from those for which it was tuned
5 SPECIES State-to-State (StS) MODEL2
• Detailed vibrational kinetics of molecules. • 68 and 47 vibrational levels for N2 and O2 respectively • Thousands of elementary processes High accuracy but huge
computational cost
1 C. Park, Nonequilibrium Hypersonic Aerothermodynamics, Wiley, New York, 1990 2 M. Capitelli et al., Fundamentals Aspects of Plasma Chemical Physics: Kinetics, Springer Science & Business Media, 2015
Multi-temperature 5 species Park model
REACTIONS: Dissociation N2+X 2N +X O2+X 2O +X NO+X N + O +X with X=N2, O2, NO, N, O Zeldovich exchange reactions N2+O N + NO O2+N NO + O
K
k
K
k
kkikki
1 1
'''
I
i
ikikk qM1
kikiki '''
K
k
kri
K
k
kfiikiki XkXkq
1
''
1
'
Generic ith reaction
Chemical production rate of the kth species
Net stoichiometric coefficient
Rate of progress of the ith reaction
Multi-temperature 5 species Park model
The two-temperature Park model assumes that the Arrhenius rate constants are functions of a geometrically averaged between the translational-rotational temperature (T) and the vibrational temperature (Tv) in the form:
ac
iifi
TR
ETAk i exp
)(
)(
aC
afi
riTK
Tkk
i
Arrhenius forward rate constant
reverse rate constant
va TTT
1
with q between 0.3 and 0.7
m
mVmvibmvib
mmLT
TeTe
,,,
,
Landau Teller evolution of the vibrational energy
m Vibrational energy relaxation time (Millikan-White expression)
5 species State-to-State (StS) model
The State-to-State approach write a relaxation equation for each vibrational level so that it is possible to calculate the distribution of internal states when it departs from the Boltzmann one.
vTm/vTa:vibrational translational energy exchange with molecules/atoms; vv: vibrational vibrational energy exchange drm/dra: dissociation-recombination with molecules/atoms LC: ladder climbing
2222 )1()( NvNNvN
NvvNNvN )()( 22
)()1()1()( 2222 wNvNwNvN
222 2)( NNNvN
NNNvN 2)(2
2222 )1()( OvOOvO vTm OvvOOvO )()( 22 vTa
)()1()1()( 2222 wOvOwOvO vv 222 2)( OOOvO drm
OOOvO 2)(2 dra
2222 )1()( OvNOvN
22max2 2)( ONOvN LC OvNOvN )1()( 22
22max2 2)( ONOvN
vTm
vTa LC
vTm vTa vv drm dra
2222 )1()( NvONvO
22max2 2)( NONvO LC NvONvO )1()( 22
NONvO 2)( max2
vTm
vTa LC
)()2()1()( 2222 wOvOwNvO
ONONvO )(2
NNOOvN )(2
vv
Pure N2 Pure O2
Zeldovich exchange reactions
Mixed N2 Mixed O2
CODE PROFILING (Euler eqs.)
Frozen Park StS
Code profiling on a NVIDIA Tesla K40: Mach 6 AIR flow over a sphere; 256x128 computational cells; 4 chemical sub-step; 8 Gauss-Seidel inner iterations. Time per iterations: Frozen = 4.72*10-3 s Park = 2.78*10-2 s StS = 51.1 s Iterations required for a full simulation 10000-20000 ---> 6-12 days for StS (for 512x256 cells 48-96 days)
F. Bonelli, M. Tuttafesta, G. Colonna, L. Cutrone, G. Pascazio, An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster, Computer Physics Communications, 219, pp. 178-195, 2017
Multi-GPU: peer-to-peer communications or MPI-CUDA
GPU-CPU communications MPI infiniband communications among nodes
Classical domain decomposition approach
Peer to peer communications: possible only within the same computational node
MPI-CUDA: GPU vs CPU computational performance
StS Fluid cells 12 GPUs --Time per iteration (s) (Energy (J))
12 CPUs -- Time per iteration (s) (Energy(J))
Speed up (1 GPU vs 1 core)
64x32 6.33 (1.8*104) 8.17 (7.8*103) 1.29 (7.7)
128x64 6.36 (1.8*104) 26.71 (2.56*104) 4.2 (25.2)
256x128 6.90 (1.9*104) 105.9 (10.2*104) 15.3 (91.8)
512x256 15.91 (4.5*104) 419.5 (40.3*104) 26.4 (158.4)
1024x512 68.72 (19.4*104) 1702.1 (163.4*104) 24.8 (148.8)
Park
64x32 7.50*10-3 (21) 1.59*10-3 (1.5) 0.21 (1.3)
128x64 7.77*10-3 (22) 4.55*10-3 (4.3) 0.59 (3.5)
256x128 7.24*10-3 (20) 1.68*10-2 (16) 2.32 (13.9)
512x256 1.36*10-2 (38) 6.53*10-2 (63) 4.8 (28.8)
1024x512 3.48*10-2 (98) 2.46*10-1 (236) 7.1 (42.6)
NVIDIA Tesla K40 (235 W) VS Intel Xeon CPU E5-2630 (6 cores) v2 2.60 GHz (80 W)
Flow past a sphere: Nonaka4 test case (Euler eqs.)
Computational domain, with an example of 4 x4 MPI partitioning, along with boundary conditions (left). 228x392 computational grid shown every 10 grid points (right).
4S. Nonaka et al. ,JTHT 14 (2), 2000
4R= 7mm; u=3490 m/s T=293 K P= 4825 Pa YN2=0.767 YO2=0.233
Nonaka4 test case (Euler eqs.)
Frozen
Park StS
F. Bonelli, M. Tuttafesta, G. Colonna, L. Cutrone, G. Pascazio, An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster, Computer Physics Communications, 219, pp. 178-195, 2017
Nonaka4 test case (Euler eqs.): stagnation line profiles
Normalized density Mach Number
F. Bonelli, M. Tuttafesta, G. Colonna, L. Cutrone, G. Pascazio, An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster, Computer Physics Communications, 219, pp. 178-195, 2017
Nonaka4 test case (Euler eqs.): stagnation line profiles
Mass fractions Temperatures
F. Bonelli, M. Tuttafesta, G. Colonna, L. Cutrone, G. Pascazio, An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster, Computer Physics Communications, 219, pp. 178-195, 2017
Nonaka4 test case (Euler eqs.): vibrational distributions along stagnation line N
2(v
)/N
2
O2(v
)/O
2
N2(v
)/N
2-B
olt
zman
n
O2(v
)/O
2-B
olt
zman
n
dashed: actual distribution solid: Boltzmann distribution
Nonaka4 test case (Euler eqs.): temperature wall line profiles
Temperature profiles along the wall Translational temperature
Nonaka4 test case (Euler eqs.): vibrational distributions wall profiles N
2(v
)/N
2
O2(v
)/O
2
N2(v
)/N
2-B
olt
zman
n
O2(v
)/O
2-B
olt
zman
n
dashed: actual distribution solid: Boltzmann distribution
Roy test case5 (Navier-Stokes)
Mach 11.3 Nitrogen flow over the nose region of a 25-55° blunted bi-conic
5A. Prakash et al., Journal of Computational Physics 230 (2011) 8474–8507
Nonaka4 test case (Navier-Stokes): stagnation line profiles
Normalized density Mach Number
Furudate6 test case: O2 Flow (Navier-Stokes)
6M. Furudate et al. ,JTHT 13 (4), 1999
6R= 7mm; u=3850 m/s; T=293 K; P= 1087 Pa; YO2=1.0
Conclusions
We developed an efficient multi-GPU code for two-dimensional fluid dynamics
A second-order accurate finite-volume space discretization scheme has been used, in conjunction with an explicit Runge-Kutta time integration scheme and an operator-splitting approach with implicit chemical source term treatment
We demonstrated the accuracy and the feasibility of fluid dynamic computations of thermochemical non-equilibrium flows by means of detailed state-to-state (StS) vibrationally resolved air kinetics
The MPI-CUDA approach allowed us to efficiently scale the code across a multiple-nodes GPU cluster with good scalability performance: comparing the single GPU against the single core CPU performance speed-up values up to 150 were found.
Extensive validation of the Navier-Stokes solver with StS model;
Extension to 3D with Immersed Boundary method
Introduction of ionized species
Flow-wall boundary treatment: models for catalysis and ablation
Current and future work
Immersed-Boundary Method
Cope with the time consuming grid generation and remeshing processes
Immersed-Boundary solver for Cartesian grid
- the centers of the fluid-interface and solid-interface cells are projected onto the body surface along its normal direction, so as to obtain fluid-cell projection points (FCPP) and solid-cell projection points (SCPP); - the flow variables at the centers of the fluid cells and the temperature at the solid cells are the unknowns to be computed using a finite-volume dual time stepping solver; - the solid cells do not influence the flow field at all and vice versa; - the boundary conditions are imposed at the fluid-interface and solid-interface cells by a local IB reconstruction scheme.
Sharma nozzle test case
Initial conditions
Grid scheme: 65x47 fluid cells
Supersonic outlet
Subsonic inlet
Simmetry
Normalized temperature profiles
Immersed-Boundary Method: Sharma nozzle test case
Cartesian mesh
Mach number contours
Immersed-Boundary Method: Sharma nozzle test case
N2 Tv contours
IB vs Body fitted