17
Highly Efficient Lattice-Boltzmann Multiphase Simulations of Immiscible Fluids at High-Density Ratios on CPUs and GPUs through Code Generation The International Journal of High Perfor- mance Computing Applications XX(X):117 ©The Author(s) 2020 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/ToBeAssigned www.sagepub.com/ SAGE Markus Holzer 1 , Martin Bauer 1 , Harald K ¨ ostler 1 and Ulrich R ¨ ude 1, 2 Abstract A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Metaprogramming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework WALBERLA to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behaviour. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario–a three-dimensional rising air bubble in water. Keywords GPGPU, Code generation, Performance engineering, Multiphase flow, Lattice Boltzmann 1 Introduction The numerical simulation of multiphase flow is a challenging field of computational fluid dynamics (see Prosperetti and Tryggvason, 2007). Although a wide variety of different approaches have been developed, simulating the dynamics of immiscible fluids with high-density ratio and at high Reynolds numbers is still considered complicated (see Huang et al., 2015). Such multiphase flows require models for the interfacial dynamics (see Yan et al., 2011). A full resolution of these phenomena is usually impractical for macroscopic CFD techniques since the interface is only a few nanometers thick as pointed out by Fakhari et al. (2017b). Therefore, sharp interface techniques model the interface as two-sided boundary conditions on a free surface and can thus achieve a discontinuous transition (see orner et al., 2005; Th¨ urey et al., 2009; Bogner et al., 2015). The modelling and implementation of these boundary conditions can be complicated as stated by Bogner et al. (2016), especially in a parallel setting. Diffuse-interface models, in contrast, represent the interface in a transition region of a finite thickness that is typically much wider than the true physical interface (see Anderson et al., 1998). Thus, the sharp interface between the fluids is replaced by a smooth transition with thickness ξ of a few grid cells. This removes 1 Chair for System Simulation, Friedrich–Alexander–Universit¨ at Erlangen–N ¨ urnberg, Cauerstraße 11, 91058 Erlangen, Germany 2 CERFACS, 31057 Toulouse Cedex 1, France Corresponding author: Markus Holzer, Chair for System Simulation, Friedrich–Alexander–Universit¨ at Erlangen–N ¨ urnberg, Cauerstraße 11, 91058 Erlangen, Germany Email: [email protected] Prepared using sagej.cls [Version: 2017/01/17 v1.20] arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Highly EfficientLattice-Boltzmann MultiphaseSimulations of ImmiscibleFluids at High-Density Ratioson CPUs and GPUs throughCode Generation

The International Journal of High Perfor-mance Computing ApplicationsXX(X):1–17©The Author(s) 2020Reprints and permission:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/ToBeAssignedwww.sagepub.com/

SAGE

Markus Holzer1, Martin Bauer1, Harald Kostler1 and Ulrich Rude1,2

AbstractA high-performance implementation of a multiphase lattice Boltzmann method based on theconservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers ispresented. Metaprogramming techniques are used to generate optimized code for CPUs and GPUsautomatically. The coupled model is specified in a high-level symbolic description and optimizedthrough automatic transformations. The memory footprint of the resulting algorithm is reducedthrough the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiencyof the generated code on a single GPU. The resulting single GPU code has been integrated intothe multiphysics framework WALBERLA to run massively parallel simulations on large domains.Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behaviour. Scalingexperiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating severalhundred fully resolved bubbles. Further, validation of the implementation is shown in a physicallyrelevant scenario–a three-dimensional rising air bubble in water.

KeywordsGPGPU, Code generation, Performance engineering, Multiphase flow, Lattice Boltzmann

1 IntroductionThe numerical simulation of multiphase flow is achallenging field of computational fluid dynamics(see Prosperetti and Tryggvason, 2007). Althougha wide variety of different approaches have beendeveloped, simulating the dynamics of immisciblefluids with high-density ratio and at high Reynoldsnumbers is still considered complicated (see Huanget al., 2015). Such multiphase flows require modelsfor the interfacial dynamics (see Yan et al., 2011).A full resolution of these phenomena is usuallyimpractical for macroscopic CFD techniques sincethe interface is only a few nanometers thick aspointed out by Fakhari et al. (2017b).

Therefore, sharp interface techniques model theinterface as two-sided boundary conditions on afree surface and can thus achieve a discontinuoustransition (see Korner et al., 2005; Thurey et al.,2009; Bogner et al., 2015). The modelling and

implementation of these boundary conditions canbe complicated as stated by Bogner et al. (2016),especially in a parallel setting. Diffuse-interfacemodels, in contrast, represent the interface in atransition region of a finite thickness that is typicallymuch wider than the true physical interface (seeAnderson et al., 1998). Thus, the sharp interfacebetween the fluids is replaced by a smooth transitionwith thickness ξ of a few grid cells. This removes

1Chair for System Simulation,Friedrich–Alexander–Universitat Erlangen–Nurnberg,Cauerstraße 11, 91058 Erlangen, Germany2CERFACS, 31057 Toulouse Cedex 1, France

Corresponding author:Markus Holzer, Chair for System Simulation,Friedrich–Alexander–Universitat Erlangen–Nurnberg,Cauerstraße 11, 91058 Erlangen, GermanyEmail: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

arX

iv:2

012.

0614

4v1

[ph

ysic

s.fl

u-dy

n] 1

1 D

ec 2

020

Page 2: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

2 The International Journal of High Performance Computing Applications XX(X)

abrupt jumps and certain singularities coming alongwith a sharp interface. In this work, we will focus onphase-field modeling. Here, an advection-diffusion-type equation is solved based on either Cahn-Hilliard (see Cahn and Hilliard, 1958) or Allen-Cahn models (ACM) (see Allen et al., 1976) totrack the interface. The governing equations can besolved using the lattice Boltzmann method (LBM)that is based on the kinetic theory expressed by theoriginal Boltzmann equation (see Li et al., 2016).

Typically, the simulation domain is discretizedwith a Cartesian grid for 3D LBM simulations. Dueto its explicit time-stepping scheme and its highdata locality, the LBM is well suited for extensiveparallelization (see Bauer et al., 2020a). For thesimulation of multiphase flows, additional forceterms are needed that employ non local data pointsso that the ideal situation may worsen somewhat.Fakhari et al. have shown how to improve thelocality for the conservative ACM (see Fakhariet al., 2017b). In order to resolve physically relevantphenomena, a sufficiently high resolution at theinterface between the fluid phases is necessary.Thus we may need simulation domains containingbillions of grid points and beyond. This creates theneed for using high-performance computing (HPC)systems and for a highly parallelized and run-timeefficient algorithms. However, highly-optimizedimplementations in low-level languages like Fortranor C often suffer from poor flexibility and poorextensibility so that algorithmic changes and thedevelopment of new models may get tedious.

To overcome this problem, we here use a workflow where we generate the compute intensivekernels at compile time with a code generatorrealized in Python. Thus we obtain the highestpossible performance while maintaining maximalflexibility (see Bauer et al., 2019). Furthermore,the symbolic description of the complete ACMallows us to use IPython (see Perez and Granger,2007) as an interactive prototyping environment.Thus, changes in the model, like additional terms,different discretization schemes, different versionsof the LBM or different LB stencils can beincorporated directly on the level of the definingmathematical equations. These equations can berepresented in LATEX-form. The generated codecan then run in parallel with OpenMP parallelon a single CPU or on a single GPU. Once aworking prototype has created in this workingmode, the automatically generated code kernelscan be integrated into existing HPC software asexternal C++ files. In this way, we can executemassively parallel simulations. Note that this

workflow permits to describe physical models insymbolic form and still run with maximal efficiencyon parallel supercomputers, Using code generation,we realize a higher level ob abstraction and thus animproved separation of concerns.

The remainder of the article is structured asfollows. In section 2, we will summarize relatedwork to the conservative ACM. In section 3,we will introduce the governing equations of theconservative ACM. Section 4 presents details ofthe implementation, by first introducing the codegeneration toolkit lbmpy by Bauer et al. (2020b)which constitutes the basis of our implementation.Then, we present the phase-field algorithm itself ina straightforward and an improved form, where theimprovements essentially lie in the minimization ofthe memory footprint. This is primarily achievedby changing the structure of the algorithm in orderto be able to fuse several compute kernels. Theperformance of our implementation is discussedin section 5. We first show a comparison of thestraightforward and the improved algorithm. Thenthe performance of the improved version is analyzedon a single GPU with a roofline approach. After thatthe scaling behavior on up to 2048 GPU nodes ispresented in section 5.2. For the scaling on manyGPU nodes, communication hiding strategies areexplained and analyzed. Finally, we validate thephysical correctness of our implementation with testcases for a rising bubble in sections 6.1 and 6.2.

2 Related WorkThe interface tracking in this work is carriedout with the Allen-Cahn equation (ACE) (seeAllen et al., 1976). A modification of the ACEto a phase-field model was proposed by Sunand Beckermann (2007). Nevertheless, it is Cahn-Hilliard theory (see Cahn and Hilliard, 1958), whichis most often used in phase-field models to performthe interface tracking. A reason for that is theimplicit conservation of the phase-field and thus theconservation of mass. As a drawback, it includesfourth-order spatial derivatives, which worsens thelocality of the LB framework as pointed out byGeier et al. (2015). In order to make the ACEaccessible for phase-field models, Pao-Hsiung andYan-Ting (2011) have presented it in conservativeform. Furthermore, the conservative ACE containsonly second-order derivatives which allows a moreefficient implementation.

In the work of Geier et al. (2015) the conservativeACE was first solved using a single relaxationtime (SRT) algorithm. Additionally, they proposed

Prepared using sagej.cls

Page 3: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 3

an improvement of the algorithm by solving thecollision step in the central moment space andadapting the equilibrium formulation which makesit possible to directly calculate the gradient of thephase-field locally via the moments. This promisingapproach, however, leads to a loss of accuracy(see Fakhari et al., 2019). On the other handFakhari et al. (2017b) used an SRT formulation tosolve the conservative ACE with isotropic finite-differences (see Kumar, 2004) to compute thecurvature of the phase-field. This approach was laterextended by Mitchell et al. (2018a) for the three-dimensional case. A disadvantage of this approachis, that it becomes complicated to apply single arraystreaming patterns (see Wittmann et al., 2016) likethe AA-pattern or the Esoteric Twist to the LBMas stated by Geier et al. (2015). This is due to anewly introduced non-locality in the update process,originating in the finite difference calculation (seeGeier et al., 2015). In this publication, the phase-field LBE is presented with a multiple relaxationtime (MRT) formulation which was first publishedby Ren et al. (2016). This formulation was also usedin recent studies by Dinesh Kumar et al. (2019).

3 Model Description

3.1 LB Model for Interface TrackingAs described by Fakhari et al. (2017b), the phase-field φ in the conservative ACM assumes twoextreme values, φL and φH , in the bulk of thelighter and the heavier fluid. There are differentpossibilities to choose these values (see Fakhariet al., 2017b). Throughout the simulations in thiswork we set φL = 0 and φH = 1, respectively. Thephase-field equation for two immiscible fluids reads

∂φ

∂t+∇ · φu = ∇ ·M

[∇φ− ∇φ

|∇φ|θ

], (1)

where θ = 1−4(φ−φ0)2/ξ, t is the time, u is themacroscopic velocity, ξ is the interface thicknessand φ0 = (φL+φH)/2 indicates the location of thephase-field. Further, the mobility is

M = τφc2s∆t. (2)

is related to the phase-field relaxation time τφ.The speed of sound cs = c/

√3, where c = ∆x/∆t,

and ∆x = ∆t = 1, which is common practice foruniform grids (see Fakhari et al., 2017b).

In the equilibrium state the profile of the phase-field φ of an interface located at x0 is

φ(x) = φ0 ±φH − φL

2tanh

(x− x0

ξ/2

). (3)

The LB model for equation (1) to update the phase-field distribution function hi can be written as (seeGeier et al., 2015)

hi(x + ei∆t, t+ ∆t) =Ωhij(heqj − hj −

1

2Fφj )|(x,t)

+ hi(x, t) + Fφi (x, t),(4)

in which the forcing term is given by (see Fakhariet al., 2017b)

Fφi (x, t) = ∆t θ wi ei ·∇φ|∇φ|

. (5)

In equation (4) Ωhij represents the elements of thecollision matrix and takes the form Ω = M−1SM ,where M is the moment matrix (see Fakhari andLee, 2013) and S is the diagonal relaxation matrix.As described by Ren et al. (2016) we relax the firstorder moments by 1/τφ and all other moments byone. The phase-field relaxation time τφ is calculatedwith equation (2). Further, the speed of sound cs =c/√

3, where c = ∆x/∆t, and ∆x = ∆t = 1, whichis common practice for uniform grids (see Fakhariet al., 2017b). The parameters wi and ei correspondto the lattice weights and the mesoscopic velocity.The equilibrium phase-field distribution functionheqi = φΓi, where

Γi = wi

[1 +

ei · uc2s

+(ei · u)

2

2c4s− u · u

2c2s

](6)

is the dimensionless distribution function. By takingthe zeroth moment of the phase-field distributionfunctions the phase-field φ can be evaluated

φ =∑i

hi. (7)

The density ρ for the whole domain is calculated byusing a linear interpolation

ρ = ρL + (φ− φL)(ρH − ρL), (8)

where ρH and νH are the density and the kinematicviscosity of the heavier fluid while ρL and νL arethe density and the kinematic viscosity of the lighterfluid.

3.2 LB Model for HydrodynamicsIn a macroscopic form the continuity and theincompressible Navier-Stokes equations describethe evolution of a flow field and can be written as

∇ · u = 0

ρ

(∂u

∂t+ u · ∇u

)= −∇p+∇ ·Π + F ,

(9)

Prepared using sagej.cls

Page 4: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

4 The International Journal of High Performance Computing Applications XX(X)

where ρ is the density, p the pressure, Π =µ(∇u +∇uT

)the viscous stress tensor, µ

the dynamic viscosity and F = F s + F b arethe surface tension force and external forcesrespectively. To solve equation (9) the following LBmodel is used (see Dinesh Kumar et al., 2019)

gi(x + ei∆t, t+ ∆t) =Ωgij(geqj − gj −

1

2Fj)|(x,t)

+ gi(x, t) + Fi(x, t),(10)

where the hydrodynamic forcing is given by

Fi(x, t) = ∆twieiF

ρc2s, (11)

and gi is the velocity-based distribution function forincompressible fluids. The equilibrium distributionfunction is

geqi = p∗wi + (Γi − wi) , (12)

where p∗ = p/ρc2s donates the normalized pressure.The hydrodynamic force F consists of four terms

F = F p + F s + F µ + F b. (13)

The pressure force can be obtained as

F p = −p∗c2s∇ρ, (14)

where the normalized pressure is calculated as thezeroth moment of the hydrodynamic distributionfunction

p∗ =∑i

gi. (15)

The surface tension force

F s = µφ∇φ, (16)

is the product of the chemical potential

µφ = 4β(φ− φL)(φ− φH)(φ− φ0)− κ∇2φ(17)

and the gradient of the phase-field. The coefficientsβ = 12σ/ξ and κ = 3σξ/2 link the interface thicknessand the surface tension. For an MRT scheme theviscous force is computed as

FMRTµ,i =− ν

c2s∆t

[∑α

eαieαj

×∑β

Ωαβ(gβ − geqβ )

∂ρ

∂xj,

(18)

where the viscosity ν is related to hydrodynamicrelaxation time τ

ν = τc2s∆t. (19)

There are a few different ways to interpolate thehydrodynamic relaxation time as it is shown in thework of Fakhari et al. (2017b). Overall, they havedemonstrated to get the most stable results with alinear interpolation. Therefore, we will use it in thiswork

τ = τL + (φ− φL)(τH − τL). (20)

We relax the second order moments with thehydrodynamic relaxation rate

sν =1

τ + 1/2, (21)

when solving equation (10) to ensure the correctviscosity of the fluid. All other moments are relaxedby one. The velocity u is obtained via the firstmoments of the hydrodynamic distribution functionand gets shifted by the external forces

u =∑i

giei +F

2ρ∆t. (22)

In order to approximate the gradient in equa-tions (5), (14), (16) and (18) a second order isotropicstencil can be applied (see Kumar, 2004; Ramaduguet al., 2013)

∇φ =c

c2s (∆x)2

∑i

eiwiφ (x + ei∆t, t) . (23)

The Laplacian in equation (17) can be approximatedwith

∇2φ =2c2

c2s (∆x)2

∑i

wi[φ (x + ei∆t, t)− φ (x, t)

].

(24)

4 Software Design for a FlexibleImplementation

4.1 Code GenerationOur implementation of the conservative ACM isbased on the open source LBM code generationframework lbmpy* (see Bauer et al., 2020b). Usingthis meta-programming approach, we address theoften encountered trade-off between code flexiblity,

∗https://i10git.cs.fau.de/pycodegen/lbmpy

Prepared using sagej.cls

Page 5: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 5

readability, and maintainability on the one hand,and platform-specific performance engineeringon the other hand. Especially when targetingmodern heterogeneous HPC architectures, a highlyoptimized compute kernel, may require that loopsare unrolled, common subexpressions extracted,and possibly hardware-specific intrinsics are used.In state-of-the art optimized software (see Hagerand Wellein, 2010), these transformations areessential, and must be performed manually for eachtarget architecture. Clearly, the resulting codes aretime-consuming to develop, are error prone, hard toread, difficult maintain and often very hard to adaptand extend. Flexibility and maintainability havebeen sacrificed, since such complex programmingtechniques are essential to get the full performanceavailable on the system.

Here, in contrast, we employ the LBM codegeneration framework lbmpy. Thanks to the auto-mated code transformations, the LB scheme can bespecified in a high-level symbolic representation.The hardware- and problem-specific transforma-tions are applied automatically so that starting forman abstract representation, highly efficient C codefor CPUs or CUDA/OpenCL code for GPUs can begenerated with little effort.

Our new tool lbmpy is realized as a Pythonpackage that in turn is built by using the stencil codegeneration and transformataion framework pysten-cils† (see Bauer et al., 2019). The flexibility oflbmpy results from the fully symbolic representationof collision operators and compute kernels, utilizingthe computer algebra system SymPy(see Meureret al., 2017). The package offers an interactive envi-ronment for method prototyping and developmenton a single workstation, similar to what FEniCS (seeAlnæs et al., 2015) is in the context of finite elementmethods. Generated kernels can then be easily inte-grated into the HPC framework WALBERLA, whichis designed to run massively parallel simulations fora wide range of scientific applications (see Baueret al., 2020a). In this workflow, lbmpy is employedfor generating optimized compute and communi-cation kernels, whereas WALBERLA provides thesoftware structure to use these kernels in large scalescenarios on supercomputers. ‘lbmpy can generatekernels for moment-based LB schemes, namelysingle-relaxation-time (SRT), two-relaxation-times(TRT), and multiple-relaxation-time (MRT) meth-ods. Additionally, modern cumulant and entropi-cally stabilized collision operators are supported.

When implementing the coupled multiphasescheme as described in section 3 with lbmpy,we can reuse several major building blocks

that are already part of lbmpy (see figure 1).First we can choose between different single-phase collision operators to use for the Allen-Cahn and the hydrodynamic LBM. We can easilyswitch between different lattices, allowing us toquickly explore the accuracy-performance trade-offbetween stencils with more or less neighbors. Anative 2D implementation is also quickly availableby selecting the D2Q9 lattice model.

Then, the selected collision operators of lbmpycan be adapted to the specific requirements of thescheme. In our case, we have to add the forcingterms equations (5) and (13). This is done onthe symbolic level, such that no additional arraysfor storing these terms have to be introduced, aswould typically be the case when extending anexisting LB method implemented in C/C++. Alsono additional iteration passes are needed to computethe force terms. The additional forces computeddirectly and are within the loops that update the LBdistributions, thus significantly saving memory andoperational overhead. Furthermore, optimizationpasses, like common subexpression elimination,SIMD vectorization via intrinsics, or CUDA indexmapping are done automatically by transformationsfurther down the pipeline, the new force terms arefully included in the optimization. Note how thisleads to a clean separation of concerns betweenmodel development and optimization with obviousbenefits for code maintainability and flexibilitywithout sacrificing the possibility to achieve bestpossible performance.

On the modeling level, this code generationapproach and our tools allow the applicationdeveloper to express the methods using a concisemathematical notation. LB collision operators areformulated in so-called collision space spannedby moments or cumulants (see Coreixas et al.,2019). For each moment/cumulant a relaxation rateand its respective equilibrium value is chosen. Fora detailed description of this formalism and itsrealization in Python see Bauer et al. (2020b).

Similarly, our system supports the mathematicalformulation of differential operators that can bediscretized automatically with various numericalapproximations of derivatives. This functionality isemployed to express the forcing terms. The Pythonformulation directly mimics the mathematicaldefinition as shown in equations (23) and (24), i.e. itprovides a gradient and Laplacian operator. The userthen can select between different finite difference

†https://github.com/mabau/pystencils/

Prepared using sagej.cls

Page 6: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

6 The International Journal of High Performance Computing Applications XX(X)

discretizations, selecting stencil neighborhood,approximation order, and isotropy requirements.

StencilsD2Q9 •

D3Q15 •D3Q19 •D3Q27 •

Collision methodsSRT •TRT •

MRT •Cumulant •

Entropic KBC •

Streaming patternCollide only •

Stream pull collide •Collide stream push •

Esoteric Twist •AA-pattern •

Force modelCustom model 1 •Custom model 2 •

Phase-field LB• Stencil• Collision method• Streaming pattern• Raw moments• Compressible• Relaxation time ωφ• Discrete equilibrium• Force model

Hydrodynamic LB• Stencil• Collision method• Streaming pattern• Weighted moments• Incompressible• Relaxation matrix S• Discrete equilibrium• Force model

Figure 1. Flexibility of the conservative ACM with thelbmpy code generation framework. The boxes on theright show the two LB steps. On the left options areshown which can be applied to the two LB steps bylbmpy. The connecting lines show a possibleconfiguration which will be used for the benchmark inthis section.

Starting from the symbolic representations, wecreate the compute kernels for our application.Knowing the details of the model, in particularthe stencil types, at compile time allows thesystem to simplify expressions, and run commonsubexpression elimination to reduce the number offloating point operations (FLOPs) drastically.

An overview of the complete workflow includingthe combination of lbmpy and WALBERLA forMPI parallel execution is illustrated in figure 2.As described above the creation of the phase-field model is accomplished directly with lbmpy,which forms a convenient prototyping environmentsince all equations can be stated as symbolicrepresentations. lbmpy does not only producethe compute kernels, but can generate also thepack and unpack information as it is neededfor the MPI communication routines. This isagain completely automatic, since the symbolicrepresentations expose all field accesses and thusthe data that must be kept in the ghost layers. Aghost layer is a single layer of cells around eachsubdomain used for the communication between

neighboring subdomains. Furthermore, also theroutines are generated, to implement the boundaryconditions.

The complete generation process can be config-ured to produce C-Code or code for GPUs withCUDA and alternatively OpenCL. These kernelscan be directly called as python functions to be runin an interactive environment or combined with theHPC framework WALBERLA.

lbmpyModel creation

Codegeneration andoptimisation

Computekernel

Boundaryconditions

Communi-cation

BackendsCPU: LLVM and GCC

GPU: CUDA and OpenCL

ExecutionInteractivelywith IPython

MPI distributedwith WALBERLA

Figure 2. Complete workflow of combining lbmpy andWALBERLA for MPI parallel execution. Furthermore,lbmpy can be used as a stand-alone package forprototyping.

To demonstrate the usage of lbmpy we show howthe update rule for the hydrodynamic distributionfunctions gi is realized. Following Mitchell et al.(2018b) equation (10), should be formulated as

g(x + ei∆t, t+ ∆t) = M−1 [m

− (meq + 0.5Fm)S + Fm

],

(25)

where m = Mg and the forcing is given by Fm =ρ−1

(0, Fx, Fy, Fz, 0, . . .

). This formulation drasti-

cally reduces the number for FLOPs needed in eachcell compared to (10) where the force is appliedin the particle distribution function (PDF) space.Note here that this kind of modification can beimplemented in lbmpy in a very simple way. Aftercreating the LB method it contains the momentmatrix M , the relaxation matrix S and the momentequilibrium values meq

i . These variables are statedin SymPy and can be used to directly write equa-tion (25).

Prepared using sagej.cls

Page 7: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 7

method = lbmpy.create_lb_method()

M = method.moment_matrixS = method.relaxation_ratesm_eq = method.moment_equilibrium_valuesF = hydrodynamic_force()m = M * g

g = m - (m - m_eq + 0.5 * F) * S + Fg = g * M.inv()

4.2 AlgorithmTo discuss how the model of section 3 can berealized, we will first present a straightforwardimplementation. The corresponding algorithm isdisplayed in algorithm 1 and will be discussedbriefly in the following. We start the time loopwith time step size ∆t, after initializing all fields.For MPI-parallel simulations, WALBERLA usesa domain partitioning into subdomains that areassigned to CPUs/GPUs (see Bauer et al., 2020a).Thus, we perform the collision for both LB steps oneach subdomain. Following that, we communicatethe relevant PDF values of the ghost layers on eachprocess for both PDF fields. Next, the streamingstep for the phase-field LB step is executed. In orderto update the phase-field, we will then calculate thesum of the phase-field PDFs for each cell, accordingto equation (7). Before we finalize the streamingstep for the hydrodynamic LB step, we do thecommunication of the phase-field φ. As the laststep, we update the velocity field with the first-ordermoments of the hydrodynamic PDFs according toequation (22). To update the macroscopic variablesφ and u each LB field has to be written one moretime.

Algorithm 1: Straightforward algorithm for theconservative ACM.1 Initialisation of all fields2 for each time step t do3 Perform collision of phase-field PDFs4 Perform collision of velocity PDFs

5 Communicate phase-field PDFs6 Communicate velocity PDFs

7 Perform streaming of phase-field PDFs8 Update phase-field

9 Communicate phase-field

10 Perform streaming of velocity PDFs11 Update the velocity12 end

Based on this straightforward algorithm, wewill now outline substantial improvements thatcan be made. To lower the memory footprint ofthe phase-field model, we combine the collisionand the streaming of the phase-field distributionfunctions and the update of the phase-field intoone phase-field LB step. Accordingly, the collisionand the streaming of the velocity PDFs and theupdate of the velocity field get combined to onehydrodynamic LB step. In this manner, the phase-field and velocity PDFs, as well as the phase-fieldand velocity field, get updated in only two insteadof six compute kernels. A detailed overview of theproposed algorithm for the conservative ACM ispresented in algorithm 2. In our proposed algorithm,we subdivide each LB step into iterations over anouter and an inner domain, similarly to Feichtingeret al. (2015) but with variable cell width. This ispointed out in figure 3. For illustration purposes, thefigure illustrates only the two-dimensional case. Thethree-dimensional case is completely analogous. Asshown, the frame width controls the iteration spaceof the outer and inner domain. This can be done forall directions independently. In the case illustrated,we have a frame width of four cells in x-directionand two grid cells in y-direction.

After the initialization of all fields, we start thetime loop with time step size ∆t. Communicationand computation can now be overlapped. Whileupdating the block interior of each subdomainwith a stream-pull-collide scheme, we start thecommunication of the phase-field PDFs in the ghostlayers. Since the update for the phase-field is thezeroth-moment of the phase-field PDFs as describedby equation (7), we also resolve the summation inthe same kernel to minimize the memory footprint.Once the computation is completed, we wait for thecommunication to finish and update the frame ofeach subdomain.

When the LB step for interface tracking is carriedout, we start the communication of the velocity-based distribution function together with the phase-field. Note that these communication requirementscan now be combined so that only one MPI-messagemust be sent. Simultaneously, the inner part of thedomain gets updated with the hydrodynamic LBstep in a collide-stream-push manner. According toequation (18), we need to form the non-equilibriumfor the viscous force, which makes a collide-stream-push scheme more convenient to use. To lowerthe memory pressure, we update the velocity fieldwith equation (22) in the same kernel accordingly.To finish a single time step, we wait for thecommunication and update the outer part of the

Prepared using sagej.cls

Page 8: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

8 The International Journal of High Performance Computing Applications XX(X)

domain. Consequently, in algorithm 2, a one-steptwo-grid algorithm is applied for both LB steps.

Algorithm 2: Improved algorithm for theconservative ACM.1 Initialisation of all fields2 for each time step t do3 Start communication of phase-field PDFs4 Perform phase-field LB step inner domain

5 Wait for the communication to finish6 Perform phase-field LB step outer domain

7 Start communication of the velocity PDFsand the phase-field

8 Perform hydrodynamic LB step innerdomain

9 Wait for the communication to finish10 Perform hydrodynamic LB step outer

domain11 end

frame

block interior

frame

width

iny-direction

frame widthin x-direction

Figure 3. Subdivision of the domain forcommunication hiding.

5 Benchmark ResultsIn the following, we compare algorithm 1 andalgorithm 2. For the straightforward algorithm, weneed to load each PDF field three times duringa single time step. In the improved algorithm,it is only necessary to load each field once. Aswe will discuss in section 5.1 in more detail,the performance limiting factor of the modelis the memory bandwidth. Therefore, we expect

an increase in the performance of our improvedalgorithm approximately by a factor of three. Tomeasure the performance of both algorithms, weinitialize a squared domain of 2603 cells on anNVIDIA Tesla P100 GPU. The domain consists ofliquid with ρH = 1 and a gas bubble in the middlewith ρL = 0.001 with a radius of R = 65 gridcells. The mobility is set to M = 0.02, the interfacethickness is chosen to be ξ = 5, the surface tensionis σ = 10−4 and the relaxation time is τ = 0.53.In all directions, periodic boundary conditions areapplied, and there is no external force acting on thebubble. The same parameter setup is used for allother benchmarks in the next sections if not stateddifferently. We measure the performance in MegaLattice Updates per Second (MLUPs) after 200 timesteps. For the unoptimized algorithm, we measure211 MLUPs, and for the improved algorithm, wemeasure 550 MLUPs. Unlike expected, we can notentirely observe an improvement of a factor of threebetween the two algorithms. One reason might bethat caching works better in the straight forwardalgorithm due to simpler compute kernels.

In the following sections we first investigatethe performance of algorithm 2 for a singleGPU. Afterwards, we analyze the scaling behavioron an increasing number of GPUs in a weakscaling benchmark. For all investigations, we use aD3Q15 SRT LB scheme for the interface trackingand a D3Q27 MRT LB scheme for the velocitydistribution function similarly to Mitchell et al.(2018a).

5.1 Single GPUFor the performance analysis on a single GPU,we focus on the NVIDIA Tesla V100 due toits wide distribution and its usage in the topsupercomputers Summit‡ and Sierra§. Further, wediscuss the performance on a NVIDIA Tesla P100because it is used in the Piz Daint¶ supercomputer,where we ran a weak scaling benchmark whichis shown in section 5.2. In this section, the twoLB steps are analysed independently. In order todetermine whether the LB steps are memory- orcompute-bound, the balance model is used, whichis based on the code balance Bc

Bc =nb

nf, (26)

‡https://www.olcf.ornl.gov/summit/§https://computing.llnl.gov/computers/sierra¶https://www.cscs.ch/computers/piz-daint/

Prepared using sagej.cls

Page 9: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 9

Table 1. Estimated performance results of the phase-field and hydrodynamic LB step. The memory bandwidthbs is determined with a STREAM copy benchmark. The peak performance ppeak is given by the vendor (seeNVIDIA Corporation, 2017).

Hardware LB kernelbs

GB/sppeak

TFLOPSnb

bytesnf

FLOPSBc Bm l

Kernel estimateGLUPs

V100phase 808 7.80 280 320 0.88 0.10 0.11 2.89

hydro 808 7.80 488 809 0.60 0.10 0.16 1.66

P100phase 542 5.30 280 320 0.88 0.10 0.11 1.94

hydro 542 5.30 488 809 0.60 0.10 0.16 1.11

Table 2. Calculated effective bandwidth beff for loads and stores in comparison with the measured bandwidth bmby the nvprof ||. Additionally, the measured performance of the LB steps is given and compared with theestimated results of table 1.

Hardware LB kernelbm,reads

GB/sbm,writes

GB/sbeff,reads

GB/sbeff,writes

GB/sKernel measured

GLUPsRatio

%

V100phase 451 337 404 340 2.66 92

hydro 372 341 337 326 1.36 82

P100phase 292 219 258 217 1.70 87

hydro 255 228 237 230 0.96 86

and the machine balance Bm (see Hager andWellein, 2010)

Bm =bs

ppeak. (27)

The machine balance describes the ratio of themachine bandwidth bs in bytes to the peakperformance ppeak in FLOPs. The code balance,on the other hand, describes the ratio of bytes nb

loaded and stored for the execution of the algorithmto the executed FLOPs nf . The limiting factor of thealgorithm is given by

l = min

(1,Bm

Bc

). (28)

If the “light speed” balance l is less than one a codeis memory limited. To be able to calculate l, valuesfor bs, ppeak, nb and nf need to be stated.

As specified by the vendor, the V100 hasa nominal bandwidth of 900 GB/s (see NVIDIACorporation, 2017). By running a STREAM copybenchmark, we get 808 GB/s as the stream copybandwidth. This synthetic benchmark implementsa vector copy ai = bi and describes the behaviorof the LB steps more realistically than the nominalbandwidth of the GPUs (see Feichtinger et al., 2015;Bauer et al., 2020b). Therefore, we will only refer tothe stream copy bandwidth in further discussions.For the P100 a stream copy bandwidth of 542 GB/scan be measured in the same way.

The peak performances for the acceleratorhardware is taken from the white paper by NVIDIA

(see NVIDIA Corporation, 2017). For the V100 adouble-precision peak performance of 7.8 TFLOPscan be found while it is 5.3 TFLOPs for the P100.

To determine nb for the phase-field step, we firstneed to think about the data which needs to be readand written in a single cell in each iteration. In eachcell, we update the phase-field PDFs in a stream-pull-collide manner. This means we need to readand write 15 double-precision values per time step.Furthermore, the velocity field is required in thiscalculation. Hence, another three double-precisionvalues have to be loaded. Additionally, we need totake the forcing term into account as it is describedby equation (5). In this term, we approximate thecurvature of the phase-field with a second orderisotropic stencil as introduced in equations (23)and (24). This results in a 15-point stencil for thephase.field LB step. To estimate a lower limit forthe memory traffic, we assume an ideal situation,which is reached when every grid point of the phase-field needs to be loaded only once. This wouldbe the case if per cell only one value is loaded,and all additional values can be reused from cachesince other threads already loaded them. Finally,we evaluate the zeroth-moment of the phase-fielddistributions to update the phase-field. Therefore,one more double-precision value is stored. Thus, wehave in total 19 double-precision values to load and16 to store. Altogether this makes 280 bytes per cellper iteration for the phase-field LB step.

Prepared using sagej.cls

Page 10: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

10 The International Journal of High Performance Computing Applications XX(X)

For the hydrodynamic LB step, we have a D3Q27stencil resulting in 27 reads and 27 writes. Further,we evaluate the velocity-field, and also updateit. Thus another three loads and stores need tobe performed. Once again, we assume the idealscenario of only one load when calculating thegradient and the Laplacian of equation (13). Thisassumption leads to 31 loads and 30 stores. Hence,488 bytes per cell are needed per cell per iterationfor the hydrodynamic LB step.

In order to get the number of operations, whichare executed in one iteration per cell, nf we use thecount ops-function provided by lbmpy. For thephase-field LB step, we get a total of 320 FLOPs.Due to a larger stencil, a more complicated collisionoperator and a force model consisting of severalterms, we get more operations for the hydrodynamicLB step namely 809 FLOPs. These values areobtained after applying common subexpressioneliminations.

Combining the obtained values in table 1, we cansee that both LB steps are highly memory bound.Therefore, the maximal performance is given by

Pmax =bsnb. (29)

Additionally, we profiled the two LB steps withthe NVIDIA profiler nvprof|| for both GPUarchitectures. The results for the limiting factors areillustrated in figure 4. These measurements confirmthe theoretical performance model very well, andshow that the memory bandwidth is almost fullyutilized for both GPUs.

Consequently, the next step is to determine ifthe memory bandwidth is also reasonable. Thismeans no unnecessary values should be transferredif they are not needed for the calculation. Byrunning the phase-field LB step independently, wemeasure a performance of about 2.65 GLUPs. Withthis information, we can calculate the effectivebandwidth beff for reads and writes by multiplyingthe measured performance with the transferred data,respectively. For reads, this results in 404 GB/s, whilefor writes it results in 340 GB/s. This matches wellwith the results measured by the NVIDIA profiler,which are 451 GB/s and 337 GB/s, respectively. Itshows that there is not much unnecessary bandwidthutilization. Still, the measured bandwidth forloading values from the device memory is slightlyhigher than the calculated effective bandwidth.This indicates that the ideal assumption of everygrid point of the phase-field being read onlyonce is not completely true. One reason mightbe due to the high memory transfer caused

Function unit (double) Memory (Device)0%

20%

40%

60%

80%

100%

28 % 97 %35 % 88 %

Util

isat

ion

phase-field LB step

hydrodynamic LB step

(a) Tesla V100

Function unit (double) Memory (Device)0%

20%

40%

60%

80%

100%

30 % 93 %33 % 89 %

Util

isat

ion

phase-field LB step

hydrodynamic LB step

(b) Tesla P100

Figure 4. Compute units utilization and memorytransfer measured with nvprof ||. The memory transferis based on the STREAM copy bandwidth of thehardware. The measurements are conducted for thephase-field and the hydrodynamic LB step seperatlyon a Tesla V100 (a) and a Tesla P100 (b).

by the underlying LB step. However, comparingthe measured performance to the estimate intable 1, we reach about 92 % of the theoreticalpeak performance, which means that there is notmuch potential for further improvements. For thehydrodynamic LB step and for the P100 GPU, theresults of the measured bandwidth and the effectivebandwidth are gathered in table 2. As one can see,the behavior of both LB steps is similar on botharchitectures. Moreover, both LB steps reach around85 % of the theoretical peak performance on bothGPUs.

With the usage of code generation, we caneasily change the discrete velocities used for the

‖https://docs.nvidia.com/cuda/profiler-users-guide/index.html

Prepared using sagej.cls

Page 11: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 11

LB steps of the implementation. This makes itpossible for us to evaluate the performance fordifferent two and three-dimensional stencils. Asshown in figure 5 we employed a D3Q27, D3Q19,D3Q15, and a D2Q9 stencil for both the phase-field and the hydrodynamic LB step. We show thatwe are able to reach about 86 % of the theoreticalpeak performance for different three-dimensionalstencils. This number increases to about 94 % ona Tesla V100 and on a Tesla P100 in the two-dimensional case because the assumption that everyvalue of the phase-field φ is loaded only once isbetter fulfilled in those cases. The ideal assumptionthat each value of the phase-field φ is only loadedonce is better fulfilled in the two-dimensional case.Thus, we reach a higher relative performance ofabout 94 % on a Tesla V100 and on a Tesla P100.

(27, 27) (19, 19) (15, 15) (9, 9)0

1,000

2,000

86 % 88 % 87 % 95 %

ML

UPs

(a) Performance on a Tesla V100

(27, 27) (19, 19) (15, 15) (9, 9)0

500

1,000

1,500

86 % 87 % 86 % 93 %

ML

UPs

(b) Performance on a Tesla P100

Figure 5. Performance measurement for different LBstencils (D3Q27, D3Q19, D3Q15 and D2Q9) for thephase-field and the hydrodynamic LB step comparedto the theoretical peak performances which areillustrated as black lines respectively. The whitenumber in each bar shows the ratio betweentheoretical peak performance and measuredperformance.

5.2 Weak Scaling Benchmark

The idea of communication hiding for LBMsimulations was already studied in Feichtinger et al.(2015). As in Feichtinger et al. (2015) we partitioneach subdomain (block) into an outer part, to obtaina frame for each block, and an inner part, theblock interior. In comparison to that, it is possiblein our work, to choose the width of the blockframe freely. We can then execute the LBM kernelsfirst on the frame, then send it asynchronously asghost values to the neighbouring blocks. While thecommunication takes place, the LBM kernels can beexecuted on the inner domains. Due to our flexibleimplementation, we ran benchmarks for differentframe widths on an increasing number of GPUs inorder to find an optimal choice of the frame width.This weak scaling performance benchmark wascarried out on the Piz Daint supercomputer on upto 2048 GPU nodes. We set the physical parametersand the number of grid cells on each GPU asdiscussed at the beginning of this section. Thus wealways have 2603 grid cells on each GPU. Thisbenchmark is performed with different frame widthsto determine their performance. As we increasethe number of nodes, we get more communicationoverhead. It follows that a frame width of‘ (32, 8, 8)shows a significantly higher performance thanan algorithm without communication hiding aspictured in figure 6. On a single GPU, on theother hand, we can see that the performance ofthe algorithm without communication hiding ishigher with 550 GLUPs. Configurations with athicker frame width like (64, 64, 64) would performbetter on a single GPU than a frame width of(32, 8, 8) but show no good scaling behavior similarto when we do not utilize communication hidingat all. Thus for the sake of simplicity, we onlyshow the scaling behavior of three scenarios infigure 6. A frame width of (32, 8, 8), as it deliversa good scaling behavior and performance. Thenative case describes a frame width of (1, 1, 1)and “no communication hiding” refers to a testcase where we do not utilize communication hidingstrategies at all. Furthermore, the theoretical peakperformance, which is calculated without taking thecommunication overhead into account, is shown asa grey line in the figure.

Both, a frame width of (32, 8, 8) and a framewidth of (1, 1, 1) show excellent scaling behavior.However, since the kernel to update the outerframe for the native case is not as efficient asin the other case, we see lower performancethroughout the benchmark. If we do not utilize

Prepared using sagej.cls

Page 12: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

12 The International Journal of High Performance Computing Applications XX(X)

20 21 22 23 24 25 26 27 28 29 210 211

0

200

400

600

GPUs

ML

UPs

perG

PU

frame width (32, 8, 8)

frame width (1, 1, 1)

no communication hidingtheoretical maximum

Figure 6. Weak scaling performance benchmark on the Piz Daint supercomputer. The grey line shows thetheoretical peak performance. With a thicker frame width (dark blue) we reach a parallel efficiency of almost98 % and also 70 % of the theoretical peak performance. Furthermore, it can be seen, that a thin frame (red)shows worse performance and no separation of the domain (light blue) shows worse parallel efficiency.

communication hiding at all, it is visible that welose performance when scaling up our simulation.On 2048 GPUs, we run a simulation setup withabout 36 billion lattice cells and one TLUPs.Furthermore, we reach a parallel efficiency of about98 %. We are still able to exploit 70 % of thetheoretical peak performance on 2048 nodes. Incomparison, we would be able only to achieve62 % of the maximum performance when using aframe width of (1, 1, 1). Not using communicationhiding reduces the measured performance to 49 %of the theoretical peak performance. This lets usconclude that we can save a significant amount ofcompute-time and resources by using our flexibleimplementation, which is not only possible for onespecific LB configuration but for a wide variety ofdifferent setups to solve the LBEs including stateof the art collision operators. On the other side, theuser has a very convenient interactive developmentenvironment to work on new problems and canalmost entirely work directly on the equation level.

6 Numerical Validation

6.1 Single Rising BubbleThe motion of a single gas bubble rising in liquidhas been studied for many centuries by variousauthors and is still a problem of great interesttoday (see Clift et al., 1978; Bhaga and Weber,1981; Grace, 1973; Tomiyama et al., 1998; Loteet al., 2018; Fakhari et al., 2017a; Mitchell et al.,2019). This is due to its vast importance in

many industrial applications and natural phenomenalike the aerosol transfer from the sea, oxygendissolution in lakes due to rain, bubble columnreactors and the flow of foams and suspension, toname just a few. Because of the three-dimensionalnature and nonlinear effects of the problem, thenumerical simulation still remains a challengingtask (see Tripathi et al., 2015). The evolutionof the gas bubble in stationary fluids dependson a large variety of different parameters. Theseare the surface tension, the density differencebetween the fluids, the viscosity of the bulk mediaand the external pressure gradient or gravitationalfield through which buoyancy effects are observedin the gas phase. The parameters are developedinto dimensionless groups in order to acquirecomprehensive theories describing the problem (seeMitchell et al., 2019).

In this study we set up a computational domainof 256× 1024× 256 cells and initially placea spherical bubble with a radius of 16 gridcells at (128, 256, 128). We use no-slip boundaryconditions at the top and bottom of our domain tomimic solid walls. In all other directions, periodicboundary conditions are applied. This setup isconsistent with Mitchell et al. (2019). For theforce acting on the bubble we use the volumetricbuoyancy force

F b = ρgyy, (30)

where gy is the magnitude of the gravitationalacceleration, which is applied in the vertical

Prepared using sagej.cls

Page 13: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 13

direction y (see Fakhari et al., 2017a). The densityρ is calculated by a linear interpolation withequation (8). For all simulations, a D3Q19 stencil isused for both LB steps with an MRT method and aweighted orthogonal moment set (see Fakhari et al.,2017a). To characterise the shape of the bubble,we need five dimensionless parameters. We use theReynolds number based on the gravitational force

ReGr =ρH√gyD3

µH, (31)

where D is the initial diameter of the bubble. TheEotvos number

Eo =gyρHD

2

σ, (32)

which is also called the Bond number describesthe influence of gravitational forces compared tosurface tension forces. Further, we use the densityratio ρ∗ = 1000, the viscosity ratio µ∗ = 100, andthe reference time

tref =

√D

gy. (33)

Thus, the dimensionless time can be calculated ast∗ = t/tref For different Eotvos and gravitationalReynolds numbers, the terminal shape of a risingbubble can be seen in figure 7. All simulations arecarried out with a reference time of tref = 18 000until t∗ = 10. We set the interface thickness to ξ =5 cells and the mobility to M = 0.04. Comparingour results with Mitchell et al. (2019), we see agood agreement regarding the terminal shape of thebubble. Further, we achieve the same behavior asdescribed by the experiments of Bhaga and Weber(1981). The shape of the bubbles represents theincrease of the effective force acting on them. Asa result, we can observe a deformation from lessspherical to flatter bubbles when increasing theEotvos or the gravitational Reynolds number.

Additionally, we calculate the drag coefficientswith the terminal velocity of the bubbles

CLBMD =

4

3

gy (ρH − ρL)D

ρH u2t

, (34)

and compare it to the experiments carried outby Bhaga and Weber (1981). Based on theirobservations they set up the following empiricalequation to calculate the drag coefficient of a risingbubble described by the gravity Reynolds number

CexpD =

2.67910 +

(16

ReGr

) 910

109

. (35)

(a) Eo = 1, ReGr = 40 (b) Eo = 30, ReGr = 10

(c) Eo = 5, ReGr = 40 (d) Eo = 30, ReGr = 30

(e) Eo = 100, ReGr = 40 (f) Eo = 30, ReGr = 120

Figure 7. Terminal shape of a single rising bubblewith ρ∗ = 1000 and µ∗ = 100 under different Eotvosand gravity Reynolds numbers at 10t∗

As it can be seen in figure 8 our results are in goodagreement with the experimental investigations.

0 50 100

5

10

15

20

ReGr

CD

Bhaga and WeberLB simulation

Figure 8. Drag coefficient plotted against theReynolds number of a single rising bubble. The dotsshow the results of the LB simulation with a constantEotvos number of Eo = 30 while the blue linerepresents the analytical estimation from equation (35)made by Bhaga and Weber (1981) in 1981.

6.2 Bubble FieldFor demonstrating the robustness as well as ourpossibilities through the efficient and scalableimplementation, we show a large scale bubblerise scenario with several hundred bubbles. The

Prepared using sagej.cls

Page 14: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

14 The International Journal of High Performance Computing Applications XX(X)

simulation is carried out on a 720× 560× 720domain, which gives about 290 million lattice cells.Our simulation has run for 10 hours resulting in500 000 time steps. For this simulation, we use anEotvos number of Eo = 50. We further specify thegravitational Reynolds number as ReGr = 50 andthe mobility asM = 0.08. We set the reference timeto t∗ = 18 000. The density ratio and the viscosityratio between the fluid and the bubbles are set to theones of water and air (ρ∗ = 1000, µ∗ = 100). Weinitialize two layers of air bubbles at the bottom ofour domain with a radius of R = 16. To initializethe bubbles with a slightly different radius, weadd a random value sampled from [−R/5,R/5] tothe radius. By having air bubbles with differentradii, we can see that bubbles with a larger radiusaccelerate faster, which shows a good physicalagreement. A screenshot of the simulation every125 000 time steps can be seen in figure 9e. We cansee clearly that complex physical phenomena arecarried out in stable simulation. These are bubblecoalescence, the coalescence of air bubbles with theliquid surface and bubble breakage.

7 ConclusionIn this work we have shown an implementa-tion of the conservative ACM based on meta-programming. With this technique we can generatehighly efficient C, OpenCL and CUDA kernelswhich can be integrated in other frameworks forsimulating large scale scenarios. For this work wehave used the WALBERLA framework to integrateour code. We have measured the efficiency ourimplementation on single GPUs. Excellent perfor-mance results compared to the roofline model couldbe shown for a Tesla V100 and a Tesla P100 wherewe could achieve about 85 % of the theoretical peakperformance for both architectures. Additionally,we have shown that our code not only performs verywell for one configuration but keeps its excellentefficiency even for different stencils and differentmethods to solve the LBEs. It is even possibleto directly generate 2D cases for testing whichalso show very good performance results. Throughseparating our iteration region in an inner and anouter part we could enable communication hiding,relevant for multi-GPU simulations with MPI. Withthis technique we are able to run large scale sim-ulations with almost perfect scalability. To showthis we have run a weak scaling benchmark on thePiz Daint supercomputer on 2048 GPU nodes. Wecould show that our implementation has a parallelefficiency of almost 98 %. To validate our code

form a physical point of view we have measuredthe terminal shape of single rising air bubble inwater under various Eotvos numbers and Reynoldsnumbers. We could show good agreement with theliterature data regarding the terminal shape andthe drag coefficient of rising bubbles. Finally, wehave set up a larger scenario where we simulateseveral hundred air bubbles in the water. With ourimplementation, we were not only able to maintaina stable simulation for the complicated test case,but we could also observe complex phenomena likebubble coalescence, the coalescence of air bubbleswith the liquid surface and bubble breakage.

8 Supporting informationThe multiphysics framework waLBerla is releasedas an open-source project and can be used underthe terms of an GNU general public license. Thesource code is available at https://i10git.cs.fau.de/walberla/walberla.

9 AcknowledgmentsWe appreciate the support by Travis Mitchellfor this project. Furthermore, we thank ChristophSchwarzmeier and Christoph Rettinger for fruitfuldiscussions on the topic.

10 FundingWe are grateful to the Swiss National Super-computing Centre (CSCS) for providing compu-tational resources and access to the Piz Daintsupercomputer. Further, the authors would liketo thank the Bavarian Competence Network forTechnical and Scientific High Performance Com-puting (KONWIHR), the Deutsche Forschungsge-meinschaft (DFG, German Research Foundation)for supporting project 408059952 and the Bun-desministerium fur Bildung und Forschung (BMBF,Federal Ministry of Education and Research) forsupporting project 01IH15003A (SKAMPY).

References

Allen S, Tsui D and Vinter B (1976) On the absorptionof infrared radiation by electrons in semiconductorinversion layers. Solid State Communications DOI:10.1016/0038-1098(76)90541-X.

Alnæs MS, Blechta J, Hake J, Johansson A, Kehlet B,Logg A, Richardson C, Ring J, Rognes ME and WellsGN (2015) The FEniCS project version 1.5. Archiveof Numerical Software DOI:10.11588/ans.2015.100.20553.

Prepared using sagej.cls

Page 15: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 15

(a) Initialisation

(b) Time step 125 000 (c) Time step 250 000

(d) Time step 375 000 (e) Time step 500 000

Figure 9. Large scale bubble rise scenario simulated on the Piz Daint supercomputer with several hundred airbubbles.

Anderson DM, McFadden GB and Wheeler AA(1998) Diffuse-interface methods in fluid mechanics.Annual Review of Fluid Mechanics DOI:10.1146/annurev.fluid.30.1.139.

Bauer M, Eibl S, Godenschwager C, Kohl N, KuronM, Rettinger C, Schornbaum F, SchwarzmeierC, Thonnes D, Kostler H and Rude U (2020a)waLBerla: A block-structured high-performanceframework for multiphysics simulations. Computers

& Mathematics with Applications DOI:10.1016/j.camwa.2020.01.007.

Bauer M, Hotzer J, Ernst D, Hammer J, Seiz M, HierlH, Honig J, Kostler H, Wellein G, Nestler B andet al (2019) Code generation for massively parallelphase-field simulations. Association for ComputingMachinery. DOI:10.1145/3295500.3356186.

Bauer M, Kostler H and Rude U (2020b) lbmpy:Automatic code generation for efficient parallel

Prepared using sagej.cls

Page 16: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

16 The International Journal of High Performance Computing Applications XX(X)

lattice Boltzmann methods. arXiv: MathematicalSoftware .

Bhaga D and Weber ME (1981) Bubbles in viscous liq-uids: shapes, wakes and velocities. Journal of FluidMechanics DOI:10.1017/S002211208100311X.

Bogner S, Ammer R and Rude U (2015) Boundaryconditions for free interfaces with the latticeBoltzmann method. Journal of ComputationalPhysics DOI:10.1016/j.jcp.2015.04.055.

Bogner S, Rude U and Harting J (2016) Curvatureestimation from a volume-of-fluid indicator functionfor the simulation of surface tension and wetting witha free-surface lattice Boltzmann method. Phys. Rev.E DOI:10.1103/PhysRevE.93.043302.

Cahn JW and Hilliard JE (1958) Free energy ofa nonuniform system. i. interfacial free energy.The Journal of Chemical Physics DOI:10.1063/1.1744102.

Clift R, Grace JR and Weber ME (1978) Bubbles, drops,and particles. New York ; London : Academic Press.Includes bibliographies and index.

Coreixas C, Chopard B and Latt J (2019) Comprehensivecomparison of collision models in the latticeBoltzmann framework: Theoretical investigations.Phys. Rev. E DOI:10.1103/PhysRevE.100.033305.

Dinesh Kumar E, Sannasiraj SA and Sundar V (2019)Phase field lattice Boltzmann model for air-watertwo phase flows. Physics of Fluids DOI:10.1063/1.5100215.

Fakhari A, Bolster D and Luo LS (2017a) A weightedmultiple-relaxation-time lattice Boltzmann methodfor multiphase flows and its application to partialcoalescence cascades. Journal of ComputationalPhysics DOI:10.1016/j.jcp.2017.03.062.

Fakhari A, Geier M and Bolster D (2019) A simple phase-field model for interface tracking in three dimensions.Computers & Mathematics with Applications DOI:10.1016/j.camwa.2016.08.021.

Fakhari A and Lee T (2013) Multiple-relaxation-timelattice Boltzmann method for immiscible fluidsat high Reynolds numbers. Physical review. E,Statistical, nonlinear, and soft matter physics DOI:10.1103/PhysRevE.87.023304.

Fakhari A, Mitchell T, Leonardi C and Bolster D(2017b) Improved locality of the phase-field lattice-Boltzmann model for immiscible fluids at highdensity ratios. Phys. Rev. E DOI:10.1103/PhysRevE.96.053301.

Feichtinger C, Habich J, Kostler H, Rude U andAoki T (2015) Performance modeling and analysisof heterogeneous lattice Boltzmann simulations onCPU–GPU clusters. Parallel Computing DOI:10.1016/j.parco.2014.12.003.

Geier M, Fakhari A and Lee T (2015) Conservative phase-field lattice Boltzmann model for interface trackingequation. Phys. Rev. E DOI:10.1103/PhysRevE.91.063309.

Grace J (1973) Shapes and velocities of bubbles risingin infinite liquid. Transactions of the Institution ofChemical Engineers .

Hager G and Wellein G (2010) Introduction toHigh Performance Computing for Scientists andEngineers. 1st edition. USA: CRC Press, Inc. ISBN143981192X.

Huang H, Sukop MC and Lu X (2015) Multiphase LatticeBoltzmann Methods: Theory and Application. JohnWiley & Sons, Ltd. DOI:10.1002/9781118971451.

Kumar A (2004) Isotropic finite-differences. Journal ofComputational Physics DOI:10.1016/j.jcp.2004.05.005.

Korner C, Thies M, Hofmann T, Thurey N and Rude U(2005) Lattice Boltzmann model for free surface flowfor modeling foaming. Journal of Statistical PhysicsDOI:10.1007/s10955-005-8879-8.

Li Q, Luo K, Kang Q, He Y, Chen Q and Liu Q (2016)Lattice Boltzmann methods for multiphase flow andphase-change heat transfer. Progress in Energyand Combustion Science DOI:10.1016/j.pecs.2015.10.001.

Lote D, Vinod V and Patwardhan AW (2018) Compar-ison of models for drag and non-drag forces forgas-liquid two-phase bubbly flow. Multiphase Sci-ence and Technology DOI:10.1615/MultScienTechn.2018025983.

Meurer A, Smith CP, Paprocki M, Certık O, KirpichevSB, Rocklin M, Kumar A, Ivanov S, Moore JK,Singh S, Rathnayake T, Vig S, Granger BE, MullerRP, Bonazzi F, Gupta H, Vats S, Johansson F,Pedregosa F, Curry MJ, Terrel AR, Roucka v,Saboo A, Fernando I, Kulal S, Cimrman R andScopatz A (2017) Sympy: symbolic computing inpython. PeerJ Computer Science 3: e103. DOI:10.7717/peerj-cs.103. URL https://doi.org/

10.7717/peerj-cs.103.Mitchell T, Hill B, Firouzi M and Leonardi C (2019)

Urtec-198239-ms development and evaluation ofmultiphase closure models used in the simulation ofunconventional wellbore dynamics. DOI:10.15530/AP-URTEC-2019-198239.

Mitchell T, Leonardi C and Fakhari A (2018a)Development of a three-dimensional phase-fieldlattice Boltzmann method for the study of immisciblefluids at high density ratios. International Journalof Multiphase Flow DOI:10.1016/j.ijmultiphaseflow.2018.05.004.

Prepared using sagej.cls

Page 17: arXiv:2012.06144v1 [physics.flu-dyn] 11 Dec 2020

Holzer et al. 17

Mitchell T, Leonardi C, Firouzi M and Towler B (2018b)Towards closure relations for the rise velocity oftaylor bubbles in annular piping using phase-fieldlattice boltzmann techniques.

NVIDIA Corporation (2017) NVIDIA Tesla V100 GPUarchitecture: The world’ s most advanced datacenter GPU. Technical Report WP-08608-001 v1.1,NVIDIA. URL https://images.nvidia.

com/content/volta-architecture/pdf/

volta-architecture-whitepaper.pdf.Pao-Hsiung C and Yan-Ting L (2011) A conservative

phase field method for solving incompressible two-phase flows. Journal of Computational Physics DOI:0.1016/j.jcp.2010.09.021.

Perez F and Granger BE (2007) IPython: a systemfor interactive scientific computing. Computing inScience and Engineering DOI:10.1109/MCSE.2007.53.

Prosperetti A and Tryggvason G (2007) ComputationalMethods for Multiphase Flow. Cambridge UniversityPress. DOI:10.1017/CBO9780511607486.

Ramadugu R, Thampi SP, Adhikari R, Succi S andAnsumali S (2013) Lattice differential operators forcomputational physics. EPL (Europhysics Letters)DOI:10.1209/0295-5075/101/50006.

Ren F, Song B, Sukop MC and Hu H (2016) Improvedlattice Boltzmann modeling of binary flow based onthe conservative Allen-Cahn equation. Phys. Rev. EDOI:10.1103/PhysRevE.94.023311.

Sun Y and Beckermann C (2007) Sharp interfacetracking using the phase-field equation. Journal ofComputational Physics DOI:10.1016/j.jcp.2006.05.025.

Thurey N, Pohl T and Rude U (2009) Hybridparallelization techniques for lattice Boltzmann freesurface flows DOI:10.1007/978-3-540-92744-0 22.

Tomiyama A, Kataoka I, Zun I and Sakaguchi T (1998)Drag Coefficients of Single Bubbles under Normaland Micro Gravity Conditions. JSME InternationalJournal Series B DOI:10.1299/jsmeb.41.472.

Tripathi M, Sahu K and Govindarajan R (2015) Dynamicsof an initially spherical bubble rising in quiescentliquid. Nature communications DOI:10.1038/ncomms7268.

Wittmann M, Zeiser T, Hager G and Wellein G (2016)Comparison of different propagation steps for latticeboltzmann methods. Computers & Mathematics withApplications DOI:10.1016/j.camwa.2012.05.002.

Yan Y, Zu Y and Dong B (2011) LBM, a useful tool formesoscale modelling of single-phase and multiphaseflow. Applied Thermal Engineering DOI:10.1016/j.applthermaleng.2010.10.010.

Prepared using sagej.cls