Download pdf - CFD and state-to-state of hypersonics flows using GPUs · GPU • Born for display rendering: Specialized libraries were developed for rendering programming (Direct-X, OpenGL etc.)

HyMEP - HYPERSONIC METEOROID ENTRY PHYSICS 61st Course of the International School of Quantum Electronics – ERICE-SICILY: 3-8 October 2017

Giuseppe Pascazio [email protected]

Department of Mechanics, Mathematics and Management

Via Re David 200, 70125, Bari, Italy

CFD and state-to-state of hypersonics flows using GPUs

Collaborations

GIANPIERO COLONNA

LUIGI CUTRONE

MICHELE TUTTAFESTA Liceo Scientifico Statale “Leonardo da Vinci” Bisceglie Italy

FRANCESCO BONELLI

DARIO DE MARINIS

Objective and Outline

Objective: Development of a High Performance Computing (HPC) CFD code for investigating high enthalpy flows

Outline: • Motivation: Parallel computing and GPU • Governing equations • Finite volume method Flux Difference Splitting, Flux Vector Splitting, Operator splitting • Thermochemical non-equilibrium models • Results • Conclusions

Does CFD need faster computers?

4 h

24

h

60

80

day

s

5 species air Frozen

Hypersonic flow over a sphere: 2D 512x256 mesh

5 species air Park

5 species air StS

Euler Navier-Stokes

12

h

30

h

61

10

day

s

Jet compressible non-reacting turbulent flow Re=104

3D LES (100*106 points)

3D DNS (3.6*1012points)

25

00

day

s 3*1

09 d

ays

Single core CPU computational time to complete a simulation

How to build faster computers?

Increase frequency: 1980 PCs run @ 1Mhz – today up to 4 GHz

Parallel computing

Frequency is bounded by restrictions in terms of power and heat as well as the physical limit of transistor size

clusters of multi core Central Processing Units (CPUs)

clusters of Graphic Processing Units (GPUs)

GPU

• Born for display rendering: Specialized libraries were developed for rendering programming (Direct-X, OpenGL etc.).

• Uses many cores to perform math operations needed to alter in real time pixels colors.

• Not born for general purpose computing: first attempts for scientific computing required to know Direct-X and OpenGL, translate the problem in terms of colors, perform computations on GPU and translate back colors to scientific data (very convoluted ).

• The new era of GPGPU (General-Purpose computing on GPU) born in November

2006: the NVIDIA CUDA architecture was released. It is not only a new hardware architecture but above all it provides a programming language (C / C ++ extension) that allows easy use for general purpose computing

J. Sanders and E. Kandrot, CUDA by Example, 2010

GPU Computing

GPUs are attractive for High Performance Computing:

• Massive multithreaded many-core chips

• Huge amount of Flops (both single and double precision)

• High memory bandwidth

• Energy efficiency and low cost

• Available Tools: debuggers, profiles, libraries (BLAS, FFT, LAPACK)

• Programming languages: CUDA C, CUDA Fortran, OpenCL

GPU

Streaming Multiprocessor (SM)

Software implementation tries to be a mirror of the hardware structure

hundreds of threads

thousands of threads

J. Cheng ,M. Grossman ,T. McKercher, Professional CUDA® C Programming, John Wiley & Sons, Inc.

GPU vs CPU: formidable amount of Flops

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

GPU vs CPU: high memory bandwidth

CUDA C PROGRAMMING GUIDE PG-02829-001_v9.0 | September 2017

GPU vs CPU: high energy efficiency

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

GPU vs CPU: green 500 list

In the first 15 positions of the green 500 list 13 clusters are powered with NVIDIA GPUs

https://www.top500.org/green500/list/2017/06/

CUDA C parallel programming example: vectors sum

Splitting parallel blocks: needed to exploit all the GPU capacities

add<<B, T>>(dev_a, dev_c, dev_d)

BxT total number of threads; B blocks; T threads per block

Example of 2 blocks with 4 threads per block: blockDim.x=4 gridDim.x=8

Linear mapping

Needed if N>BxT

J. Sanders, E. Kandrot, CUDA by example, Addison-Wesley, New-York, 2011.

Governing equations

000 VSV

dVdSdVt

WnFU

T

MvibvibVSSV eeevuS

],...,,,,,,..,,..,,..,[ ,1,1,1,1,11,1 U

T

MvibvibVSSV ueueupeuvpuuuuuS

],...,,)(,,,,..,,..,,..,[ ,1,1

2

,1,1,11,1E F

T

MvibvibVSSV vevevpepvuvvvvvS

],...,,)(,,,,...,,...,,...,[ ,1,1

2

,1,1,11,1E G

T

MvibvibVSSV ],....,,0,0,0,,...,,...,,...,[ ,1,,1,,11,1 11 W

U is the vector of the conservative variables, FE / Fv and GE / Gv are the inviscid/viscous flux vectors and W is the source terms vector. S is the number of chemical components, the sth one having Vs internal

levels, the state-to-state approach considers independent species.

S

s

sVN1

VEVE GGFFF ,

Compressible Navier-Stokes (N-S) equations for a multicomponent mixture of reacting gases in thermochemical non-equilibrium for both Park and StS models

Governing equations

TMvibvibii ,1,VV ,.....,,,,, qqquuGF

FV and GV are the viscous flux vectors In the present implementation transport properties of single species are evaluated by using Gupta’s curve fits1. Classical mixing rules are used for mixture properties

iiii YD u

uIuu 3

2T

i

uq iivibvibt hTT

mmmvibvibvibmvib eT uq ,,

1Gupta et al. A review of reaction rates and thermodynamic and transport properties for an 11-species air model for chemical and thermal nonequilibrium calculations to 30000 K, NASA. Reference. Publication. 1232. 1990

Numerical method

ji

Faces

jinum

ji

ji VSdt

dV ,,

,

, WnFU

Cell-centered Finite Volume Space discretization on a Multi-block structured mesh

numVnumEnum ,, FFF

Reactive Navier-Stokes equations: - Advection and pressure term (hyperbolic) - Shear-stress, heat flux terms (diffusive) - Chemical source terms (stiffness)

John C. Tannehill, Dale Anderson, Richard H. Pletcher, Computational Fluid Mechanics and Heat Transfer, Taylor & Francis 1997

Numerical method

ji

Faces

jinum

ji

ji VSdt

dV ,,

,

, WnFU

numVnumEnum ,, FFF

Operator splitting approach: Frozen step + Chemical step Frozen step: Method of Lines:

• Space discretization + Time integration ‐ Space dicretization: Inviscid & Viscous terms scheme ‐ Time integration: Runge-Kutta scheme

Chemical step: implicit scheme for stiff terms

Solution strategy:

Frozen step: inviscid flux space discretization

Faces

num

ji

ji Sdt

dV 0

,

, nFU

Semi-Discrete Schemes or Method of Lines

Frozen equation

Faces

numVnumE

ji

jiS

Vdt

dnFF

U,,

,

, 1 ODE solved with an explicit Runge-Kutta schemes

numE ,F Methods for solving non-linear hyperbolic conservation laws

Discretized inviscid Burger equation: solution of the Riemann problem at each interface



Solution of the Riemann problem for the inviscid Burgers equation: (left) structure of general solution (single wave, shock or rarefaction), (middle) solution is a shock wave and (right) solution is a rarefaction wave* An exact Riemann solver is inexpensive for the inviscid Burger equation. Cost extremely increases when considering a general m x m non-linear hyperbolic system

Structure of the Riemann problem for the 1D Euler equations

*Eleuterio F. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics, Springer 2009


Approximate solvers: • Flux Difference Splitting (FDS) or Approximate Riemann solver: Roe’s scheme • Flux Vector Splitting (FVS): Steger and Warming (SW), AUSM

Roe’s scheme: why solve the exact Riemann problem when a lot of information is lost due to the discrete representation?

0 UFU xt 0

X

U

U

FU t

0 xt UAU Approximate Riemann problem solved exactly

RL UUAA , Linearized Jacobian matrix

P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, Journal of Computational Physics 43(2) 357-372, 1981.


How to choose A(UL,UR) is not trivial

Rigorous construction of A proposed by Roe: the only allowable matrices are those that satisfy the Property U (uniform validity across discontinuities) :

1. It constitutes a linear mapping from U to F

2. as UL UR U, A(UL,UR) A(U) (ensures consistency)

3. for any UL, UR, A(UL,UR)x(UL-UR)=FL-FR (ensures conservation)

4. The eigenvectors are linearly independent

By introducing a parameter vector Roe constructed A for the Euler equations

UAA

T

RL

RRLL

RL

RRLL

RL uHaHH

Huu

u

2

2

11,,,

U

P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, Journal of Computational Physics 43(2) 357-372, 1981.


Flux Difference Splitting: Upwind discretization

The discretization of the equations is performed according to the discrete propagation directions

0 0

)()(

2

12

1

i i

i

ii

i

iiRLi

KKFFF

i

i

iiRL

i

)(

2

12

1KFFF

1 KKA

iiii

2

1)0,min(

iiii

2

1)0,max(

iii 2

1

2

1

Λ Diagonal eigenvalues matrix

K Right eigenvectors

Compact form

*Eleuterio F. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics, Springer 2009

Frozen step: higher order (MUSCL approach)

Schemes described so far are first-order accurate: variable are constant in the control volume

Van Leer* idea: reconstruction of variable within a cell with discontinuities across cells

Cell-averaged values are used to reconstruct UR and UL thus leading to higher-order fluxes: MUSCL (monotone upstream-centered schemes for conservation laws)

Since higher-order schemes are not stable when dealing with discontinuities limiter functions are applied to switch to first order approximation through discontinuities

*van Leer, B. (1979). Towards the Ultimate Conservative Difference Scheme, V: A Second-Order Sequel to Godunov's Method, J. Comput. Phys., vol. 32, pp. 101-136.

2/122/112/1

4

1

4

1

iii

r

i ururuu

2/142/332/1

4

1

4

1

iii

l

i ururuu

12/1 iii uuu

Limiter function

Results

Code scheme of CPU and GPU module with subroutine name specification

M. Tuttafesta, G. Colonna, G. Pascazio, Comput. Phys. Comm. 184 (6) (2013) 1497–1510.

Results

Profiling and thread sensitivity analysis: supersonic nozzle test case

Kernels GPU execution times for the Test A—threads analysis

Multiprocessor warp occupancy varying threads per block and with different registers usage. Occupancy = (active warps)/(maximum warps) Warp is a group of 32 threads that execute the same instruction


Results: NVIDIA Tesla C2050 vs Pentium(R) Dual-Core E5645

Nozzle test case: 200x200 cells

Hardware specifications Total speed-up as a function of number of time steps n



Steger and Warming Flux Vector Splitting

AUF homogeneous function of degree one

FFUAAUKKUKKF

11

iiii

2

1)0,min(

iiii

2

1)0,max(

12

3

221

12

12

2 2

3223222

1

321

321

2/1

aauauu

auauuF

Compared to Roe’s FDS, SW FVS is easier to implement and less computationally expensive, but it is also less accurate. Upwinding is performed by splitting the flux in positive and negative components. FDS can been interpreted as being caused by a series of waves FVS are interpreted as schemes that transport particles according to the characteristics information*

J.L. Steger, R.F. Warming , Flux vector splitting of the inviscid gasdynamic equations with application to finite-difference methods, Journal of Computational Physics 40 (2) 263-293, 1981 *John C. Tannehill, Dale Anderson, Richard H. Pletcher, Computational Fluid Mechanics and Heat Transfer, Taylor & Francis 1997


AUSM: Advection Upstream Splitting Method

Two physically distinct parts that are separately discretized: convective and pressure

0

0

0

0

pp

H

uuc

FF

A cell-face Mach number is used to advect the convective term, whereas the pressure flux is governed by acoustic wave speed

RL

LRRL

pp

aH

au

a

aH

au

a

M

aH

au

a

aH

au

a

M

0

2

1

2

12/12/12/1

F

It retains the accuracy of the Roe’s FDS along with the simplicity and robustness of the SW FVS.

M.-S.Liou, C.J. Steffen, A New Flux Splitting Scheme, Journal of Computational Physics 107 (1) 23-39, 1993

Frozen step: viscous schemes

Viscous terms involve gradients that have to be determined on the cell faces Due to their dissipative nature central differences are used A good procedure for generalized curvilinear coordinates is to apply the Gauss divergence theorem

00 SV

ududV S

facesi

iiduV

u S1

facesi

iyi

facesi

ixi

y

x

dSu

dSu

Vu

u

,

,1


Numerical method

Inviscid flux: AUSM or Flux Vector Splitting of Steger and Warming with MUSCL approach for higher order accuracy; Runge-Kutta scheme up to third order for time integration Viscous flux: gradients of the primitive variables are evaluated by applying Gauss theorem

Frozen step

Operator splitting approach

0

00

SV

dSdVt

nFU

Numerical method

Chemical step

ntt f

v

c /)(

Sub-time step

Nii

0

yLyPy

tP is a vector and L a diagonal matrix. Pi and Liyi are non-negative and represent, respectively, production and loss terms for component yi

)(1

)()()(

1)(

1)()(

kv

c

i

k

i

v

cv

c

k

iLt

tyPttty

y

yGauss-Seidel iterative scheme

Operator splitting approach

00 VV

dVdVt

WU

Thermochemical non-equilibrium models for air

MULTI-TEMPERATURE 5 SPECIES PARK MODEL1

• 17 reactions + 3 transport equations for the vibrational energies • Arrhenius type rate coefficients function of an effective temperature

calculated as a geometrical mean of translational and vibrational temperatures • Vibrational levels follow a Boltzmann distribution at temperature Tv • Tuned on experimental measures • Not computationally demanding • It may fail when the conditions are far from those for which it was tuned

5 SPECIES State-to-State (StS) MODEL2

• Detailed vibrational kinetics of molecules. • 68 and 47 vibrational levels for N2 and O2 respectively • Thousands of elementary processes High accuracy but huge

computational cost

1 C. Park, Nonequilibrium Hypersonic Aerothermodynamics, Wiley, New York, 1990 2 M. Capitelli et al., Fundamentals Aspects of Plasma Chemical Physics: Kinetics, Springer Science & Business Media, 2015

Multi-temperature 5 species Park model

REACTIONS: Dissociation N2+X 2N +X O2+X 2O +X NO+X N + O +X with X=N2, O2, NO, N, O Zeldovich exchange reactions N2+O N + NO O2+N NO + O

K

k

K

k

kkikki

1 1

'''

I

i

ikikk qM1

kikiki '''

K

k

kri

K

k

kfiikiki XkXkq

1

''

1

'

Generic ith reaction

Chemical production rate of the kth species

Net stoichiometric coefficient

Rate of progress of the ith reaction

Multi-temperature 5 species Park model

The two-temperature Park model assumes that the Arrhenius rate constants are functions of a geometrically averaged between the translational-rotational temperature (T) and the vibrational temperature (Tv) in the form:

ac

iifi

TR

ETAk i exp

)(

)(

aC

afi

riTK

Tkk

i

Arrhenius forward rate constant

reverse rate constant

qq

va TTT

1

with q between 0.3 and 0.7

m

mVmvibmvib

mmLT

TeTe

,,,

,

Landau Teller evolution of the vibrational energy

m Vibrational energy relaxation time (Millikan-White expression)

5 species State-to-State (StS) model

The State-to-State approach write a relaxation equation for each vibrational level so that it is possible to calculate the distribution of internal states when it departs from the Boltzmann one.

vTm/vTa:vibrational translational energy exchange with molecules/atoms; vv: vibrational vibrational energy exchange drm/dra: dissociation-recombination with molecules/atoms LC: ladder climbing

2222 )1()( NvNNvN

NvvNNvN )()( 22

)()1()1()( 2222 wNvNwNvN

222 2)( NNNvN

NNNvN 2)(2

2222 )1()( OvOOvO vTm OvvOOvO )()( 22 vTa

)()1()1()( 2222 wOvOwOvO vv 222 2)( OOOvO drm

OOOvO 2)(2 dra

2222 )1()( OvNOvN

22max2 2)( ONOvN LC OvNOvN )1()( 22

22max2 2)( ONOvN

vTm

vTa LC

vTm vTa vv drm dra

2222 )1()( NvONvO

22max2 2)( NONvO LC NvONvO )1()( 22

NONvO 2)( max2

vTm

vTa LC

)()2()1()( 2222 wOvOwNvO

ONONvO )(2

NNOOvN )(2

vv

Pure N2 Pure O2

Zeldovich exchange reactions

Mixed N2 Mixed O2

CODE PROFILING (Euler eqs.)

Frozen Park StS

Code profiling on a NVIDIA Tesla K40: Mach 6 AIR flow over a sphere; 256x128 computational cells; 4 chemical sub-step; 8 Gauss-Seidel inner iterations. Time per iterations: Frozen = 4.72*10-3 s Park = 2.78*10-2 s StS = 51.1 s Iterations required for a full simulation 10000-20000 ---> 6-12 days for StS (for 512x256 cells 48-96 days)

F. Bonelli, M. Tuttafesta, G. Colonna, L. Cutrone, G. Pascazio, An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster, Computer Physics Communications, 219, pp. 178-195, 2017

Multi-GPU: peer-to-peer communications or MPI-CUDA

GPU-CPU communications MPI infiniband communications among nodes

Classical domain decomposition approach

Peer to peer communications: possible only within the same computational node

MPI-CUDA: GPU vs CPU computational performance

StS Fluid cells 12 GPUs --Time per iteration (s) (Energy (J))

12 CPUs -- Time per iteration (s) (Energy(J))

Speed up (1 GPU vs 1 core)

64x32 6.33 (1.8*104) 8.17 (7.8*103) 1.29 (7.7)

128x64 6.36 (1.8*104) 26.71 (2.56*104) 4.2 (25.2)

256x128 6.90 (1.9*104) 105.9 (10.2*104) 15.3 (91.8)

512x256 15.91 (4.5*104) 419.5 (40.3*104) 26.4 (158.4)

1024x512 68.72 (19.4*104) 1702.1 (163.4*104) 24.8 (148.8)

Park

64x32 7.50*10-3 (21) 1.59*10-3 (1.5) 0.21 (1.3)

128x64 7.77*10-3 (22) 4.55*10-3 (4.3) 0.59 (3.5)

256x128 7.24*10-3 (20) 1.68*10-2 (16) 2.32 (13.9)

512x256 1.36*10-2 (38) 6.53*10-2 (63) 4.8 (28.8)

1024x512 3.48*10-2 (98) 2.46*10-1 (236) 7.1 (42.6)

NVIDIA Tesla K40 (235 W) VS Intel Xeon CPU E5-2630 (6 cores) v2 2.60 GHz (80 W)

Flow past a sphere: Nonaka4 test case (Euler eqs.)

Computational domain, with an example of 4 x4 MPI partitioning, along with boundary conditions (left). 228x392 computational grid shown every 10 grid points (right).

4S. Nonaka et al. ,JTHT 14 (2), 2000

4R= 7mm; u=3490 m/s T=293 K P= 4825 Pa YN2=0.767 YO2=0.233

Nonaka4 test case (Euler eqs.)

Frozen

Park StS


Nonaka4 test case (Euler eqs.): stagnation line profiles

Normalized density Mach Number


Nonaka4 test case (Euler eqs.): stagnation line profiles

Mass fractions Temperatures


Nonaka4 test case (Euler eqs.): vibrational distributions along stagnation line N

2(v

)/N

2

O2(v

)/O

2

N2(v

)/N

2-B

olt

zman

n

O2(v

)/O

2-B

olt

zman

n

dashed: actual distribution solid: Boltzmann distribution

Nonaka4 test case (Euler eqs.): temperature wall line profiles

Temperature profiles along the wall Translational temperature

Nonaka4 test case (Euler eqs.): vibrational distributions wall profiles N

2(v

)/N

2

O2(v

)/O

2

N2(v

)/N

2-B

olt

zman

n

O2(v

)/O

2-B

olt

zman

n

dashed: actual distribution solid: Boltzmann distribution

Roy test case5 (Navier-Stokes)

Mach 11.3 Nitrogen flow over the nose region of a 25-55° blunted bi-conic

5A. Prakash et al., Journal of Computational Physics 230 (2011) 8474–8507

Nonaka4 test case (Navier-Stokes): stagnation line profiles

Normalized density Mach Number

Furudate6 test case: O2 Flow (Navier-Stokes)

6M. Furudate et al. ,JTHT 13 (4), 1999

6R= 7mm; u=3850 m/s; T=293 K; P= 1087 Pa; YO2=1.0

Conclusions

We developed an efficient multi-GPU code for two-dimensional fluid dynamics

A second-order accurate finite-volume space discretization scheme has been used, in conjunction with an explicit Runge-Kutta time integration scheme and an operator-splitting approach with implicit chemical source term treatment

We demonstrated the accuracy and the feasibility of fluid dynamic computations of thermochemical non-equilibrium flows by means of detailed state-to-state (StS) vibrationally resolved air kinetics

The MPI-CUDA approach allowed us to efficiently scale the code across a multiple-nodes GPU cluster with good scalability performance: comparing the single GPU against the single core CPU performance speed-up values up to 150 were found.

Extensive validation of the Navier-Stokes solver with StS model;

Extension to 3D with Immersed Boundary method

Introduction of ionized species

Flow-wall boundary treatment: models for catalysis and ablation

Current and future work

Immersed-Boundary Method

Cope with the time consuming grid generation and remeshing processes

Immersed-Boundary solver for Cartesian grid

- the centers of the fluid-interface and solid-interface cells are projected onto the body surface along its normal direction, so as to obtain fluid-cell projection points (FCPP) and solid-cell projection points (SCPP); - the flow variables at the centers of the fluid cells and the temperature at the solid cells are the unknowns to be computed using a finite-volume dual time stepping solver; - the solid cells do not influence the flow field at all and vice versa; - the boundary conditions are imposed at the fluid-interface and solid-interface cells by a local IB reconstruction scheme.

Sharma nozzle test case

Initial conditions

Grid scheme: 65x47 fluid cells

Supersonic outlet

Subsonic inlet

Simmetry

Normalized temperature profiles

Immersed-Boundary Method: Sharma nozzle test case

Cartesian mesh

Mach number contours

Immersed-Boundary Method: Sharma nozzle test case

N2 Tv contours

IB vs Body fitted