Preconditioning for modal discontinuous Galerkin methods ...birken/... · (DGSEM), e.g. [25], which showed that the preconditioning procedure used here indeed has to be modi ed in

Preconditioning for modal discontinuous

Galerkin methods for unsteady 3D

Navier-Stokes equations

Philipp Birken a Gregor Gassner b Mark Haas b

Claus-Dieter Munz b

aUniversity of Kassel, Department of Mathematics, Heinrich-Plett-Str. 40, 34132Kassel, Germany

bUniversity of Stuttgart, Institute of Aerodynamics and Gas Dynamics,Pfaffenwaldring 21, 70569 Stuttgart, Germany

Abstract

We compare different block preconditioners in the context of parallel time adap-tive higher order implicit time integration using Jacobian-free Newton-Krylov (JFNK)solvers for discontinuous Galerkin (DG) discretizations of the three dimensional timedependent Navier-Stokes equations. A special emphasis of this work is the perfor-mance for a relative high number of processors, i.e. with a low number of elementson the processor. For high order DG discretizations, a particular problem that needsto be addressed is the size of the blocks in the Jacobian. Thus, we propose a newclass of preconditioners that exploits the hierarchy of modal basis functions andintroduces a flexible order of the off-diagonal Jacobian blocks. While the standardpreconditioners ’block Jacobi’ (no off-blocks) and full symmetric Gauss-Seidel (fulloff-blocks) are included as special cases, the reduction of the off-block order resultsin the new scheme ROBO-SGS. This allows us to investigate the impact of thepreconditioner’s sparsity pattern with respect to the computational performance.Since the number of iterations is not well suited to judge the efficiency of a precon-ditioner, we additionally consider CPU time for the comparisons. We found thatboth block Jacobi and ROBO-SGS have good overall performance and good strongparallel scaling behavior.

Key words: Discontinuous Galerkin, Unsteady flows, Navier-Stokes, Implicitmethods, Preconditioning, Three dimensional problems

Email addresses: [email protected] (Philipp Birken),[email protected] (Gregor Gassner),[email protected] (Mark Haas), [email protected]

Preprint submitted to Elsevier 10 December 2012

1 Introduction

The solution of unsteady compressible viscous flows may lead to stiff problems,in particular for wall bounded flows and flows at low Mach numbers. Thismeans that the time step size in explicit methods is driven by stability andin implicit methods by accuracy alone. Therefore, implicit methods, whichcan be constructed to have unbounded stability regions, are attractive for anumber of problems and are a standard part of solution procedures for finitevolume methods. However, finding an efficient solver for the resulting linearand nonlinear equation systems has turned out to be a difficult problem inthe DG case [42], in particular for three-dimensional problems. The structureof the nonlinear systems to be solved is of the form

u−ψ + α∆tf(u) = 0,

where ψ is a known vector, α is a method dependent scalar paramater, u thevector of unknowns and f a function representing the overall spatial discretiza-tion. Hence, it does not depend on the specific time integration method. Acorresponding statement holds for the linear equation systems which resultsfrom an iterative solution process of the nonlinear system. Thus, we will usein this work the 4th order accurate diagonally implicit Runge-Kutta methodESDIRK4 [22]. We choose a 4th order accurate method in time, as we areinterested in the simulation of unsteady problems.

Regarding the solvers for the algebraic systems, there is a number of require-ments that have to be satisfied. Firstly, three-dimensional computations havestrong memory demands that will actually increase in the future with newercomputer generations having less and less memory per core available. Thisproblem is particularly pronounced for DG methods. These use a large num-ber of degrees of freedom per cell, leading to large Jacobian blocks with moreintercell connectivity. Already on today’s supercomputers one may find one-self running out of memory faster than one would expect from experience withfinite volume methods. Secondly, the solver has to scale in parallel to be fea-sible for supercomputing. Thirdly, it has to be reasonably fast, which is stilla challenge in the DG context. Finally, the implementation cost should beas low as possible, where here, we are concerned with the additional codingneeded to make an explicit DG method implicit. If all of these requirementsare met (low storage requirements, parallel scaling, fast convergence and easeof implementation), the use of high order methods in the industrial contextwould become more feasible.

At this point it has to be noted that there is not yet a standard DG methodand that in our experience, the question of an efficient solver depends on

(Claus-Dieter Munz).

2

the specific discretization in a nontrivial way. However, it seems that an effi-cient DG method makes use of what is called a nodal basis in some way [17].Here, we will consider the mixed modal-nodal variant suggested by Gassneret. al. [15], which is based on a modal basis but uses a nodal basis for integra-tion. For the diffusive terms, the dGRP flux is used [13]. Furthermore, we didsome preliminary tests of the methodology on a DG Spectral Element method(DGSEM), e.g. [25], which showed that the preconditioning procedure usedhere indeed has to be modified in that context.

Basic candidates for solvers are FAS-Multigrid and preconditioned Jacobian-Free Newton-Krylov methods (JFNK), whereby multigrid can be used as apreconditioner. The FAS multigrid is the method of choice for steady Eulerflows when using finite volume schemes. In the context of unsteady flows, itis often a dual time stepping procedure which seems to be a slow method,as reported by several authors, e.g. [20,6]. When looking at DG methods,the design of a fast multigrid solver is an open problem both for steady andunsteady flows [31,23,3,2,34,28].

We will not attack this problem and consider JFNK schemes in this work.There, the linear systems are solved using Krylov subspace methods, whichdo not need the system matrix explicitly, but only matrix vector products.Since the system matrix is a Jacobian, it can be approximated using finite dif-ferences, circumventing in theory the construction and storage of the Jacobian.In practice, a preconditioner is needed, making the schemes not completelymatrix-free. Regarding the specific Krylov subspace method, it turns out thatGMRES is the best choice in this context [24]. The Newton method is aninexact Newton method, where a good strategy to control the terminationcriteria for the linear solver is necessary. Namely, the strategy by Eisenstatand Walker is used [11].

As mentioned, a preconditioner is necessary to speed up GMRES. Here, multi-grid can be used as a preconditioner, which was considered by several authorsfor the steady Euler equations in two dimensions [31,9,28]. Generally, multigridpreconditioners would satisfy the requirements mentioned, but they face thesame problem as the FAS multigrid: the lack of theory for DG discretizationsleads to nonoptimal methods. Therefore, we will not consider these schemeshere. Regarding other preconditioners, a number of authors has consideredNewton-Krylov methods for two-dimensional flows, in particular Rasetarineraand Hussaini (steady NS) [36], Dolejsi and Feistauer (steady Euler) [10], Dar-mofal et al. (steady NS) [12,9], Kanevsky et. al. (unsteady Euler, NS) [21], aswell as Persson and Peraire (unsteady NS) [33].

Whereas all the work mentioned above considers two-dimensional flow prob-lems, we will focus on the 3D unsteady case. As opposed to the case of finitevolume methods, where going from 2D to 3D increases the number of un-

3

knowns per cell from four to five and thus by just one, we have to multiplythe number of unknowns per cell with a factor dependend on the polynomialdegree, resulting in hundreds of unknowns per cell with an according blockstructure. Thus, the two-dimensional case is in our opinion not representativeof the three-dimensional one and furthermore, the discontinuous Galerkin caseis very different from the finite volume case. Therefore, a successful implicitDG scheme must take these huge blocks in the Jacobian into account.

In this work, we examine the performance of different preconditioners: BlockJacobi, block symmetric Gauß-Seidel (SGS), block-ILU and a multilevel block-ILU suggested by Persson and Peraire [33]. Furthermore, we propose a newclass of SGS-type preconditioners, which we call ROBO-SGS (Reduced Off-diagonal Block Order). This new class exploits the hierarchical basis of themixed modal-nodal DG method to reduce the order of the off block Jacobianblocks and includes the block Jacobi and the full SGS preconditioner as specialcases. An approach that is similar in spirit has been suggested by Renac etal. [37]. The variable sparsity pattern of the ROBO-SGS preconditioner classgives us the possibility to investigate the impact of the preconditioners matrixstructure on the overall performance.

The point about the comparison is that first of all, it is done on three-dimensional test cases, second, it is done in a realistic setting of a time adaptivescheme with a smart choice of tolerances in Newton and a parallel solver andthird, we compare not only iteration numbers, but also CPU time. This isimportant, because iterations show accuracy of a preconditioner but not itsefficiency, since the cost of application of the preconditioner is neglected. Aswe are interested in high fidelity simulations (direct numerical simulation orlarge eddy simulation) of compressible turbulent flow problems, we are solelyinterested in unsteady computations on large parallel architectures. Thus, animportant aspect of the investigations is the parallel scaling of the methodsand the impact of the preconditioner on the parallel performance.

The outline of the paper is as follows: First we will describe the governingequations and the DG methodology used. Then we will briefly discuss theESDIRK4 method, after which we will describe the JFNK method and thedifferent preconditioners. Finally, numerical results are presented where wecompare the different preconditioners.

2 Governing equations

The Navier-Stokes equations are a second order system of conservation laws(mass, momentum, energy) modeling viscous compressible flow. Written inconservative variables density ρ, momentum m and energy per unit volume

4

ρE:

∂tρ+∇ ·m = 0,

∂tmi +d∑j=1

∂xj(mivj + pδij) =1

Re

d∑j=1

∂xjSij + qi, i = 1 ... d

∂t(ρE) +∇ · (Hm) =1

Re

d∑j=1

∂xj

(d∑i=1

Sijvi −1

PrWj

)+ qe.

Here, d stands for the number of dimensions, H for the enthalpy per unitmass, S represents the viscous shear stress tensor and W the heat flux. Asthe equations are dimensionless, the Reynolds number Re and the Prandtlnumber Pr appear. The equations are closed by the equation of state for thepressure p = (γ − 1)ρe, where we assume a perfect gas. Finally, qe denotesa possible source term in the energy equation, whereas q = (q1, ..., qd)

T is asource term in the momentum equation, for example due to external forces.

3 Spatial Discretization

We employ the mixed modal-nodal Discontinuous Galerkin scheme which hasbeen suggested by Gassner et al. [15]. One of the main advantages of thismethod is that it allows the use of elements of arbitrary shape (i.e. tetrahe-drons, prisms, pyramids, hexahedrons, ...) with high order of accuracy. In ourexperience, discretizations using hexahedrons very often require less elementsand thus less total degrees of freedom than ones that only use tetrahedrons forapproximately the same discretizaton error. This property becomes extremelyimportant in the context of implicit methods, since the total number of degreesof freedom has a strong influence on the performance of the solver and an evengreater impact on the memory consumption. This is crucial for DG methods,especially, when doing real world 3D simulations. We will demonstrate theseaspects in the following sections.

3.1 The Discontinuous Galerkin Method

We write the Navier-Stokes equations in the form

ut +∇ · f (u) = q(t,u), (1)

with suitable initial and boundary conditions in a domain Ω×[0, T ] ⊂ Rd×R+0 .

Here, u = u (~x, t) ∈ Rd+2 is the state vector, f (u) = fC (u)− fD (u,∇u) is the

5

physical flux, where fC (u) is the convective (i.e. hyperbolic) and fD (u,∇u)the diffusive (i.e. parabolic) flux component. The possibly time and spacedependent source term is given by q(t, ~x,u).

We derive the DG method by first subdividing the domain Ω into non-overlappinggrid cells Q. In each grid cell we approximate the state vector using a localpolynomial approximation of the form

u (~x, t)∣∣∣Q≈ uQ (~x, t) =

N∑j=1

uj (t)ϕQj (~x) , (2)

where in our case, ϕQj (~x)j=1,...,N are modal hierarchical orthonormal ba-sis functions and u are the corresponding coefficients in the cell Q. The basisfunctions are constructed from a monomial basis with a simple Gram-Schmidtorthogonalization algorithm for arbitrary (reference) grid cell types. The di-mension of the local approximation space depends on the spatial dimension dand the polynomial degree p

N = N(p, d) =(p+ d)!

p!d!. (3)

The next step of our approximation is to define how the unknown degreesof freedom uj (t) are determined. The basis of the considered discontinuousGalerkin method is a weak formulation. Neglecting the source term for now,we insert the approximate solution (2) into the conservation law (1), multiplywith a smooth test function φ = φ (~x) and integrate over Q to obtain

〈uQt +∇ · f(uQ), φ〉Q = 0, (4)

where 〈., .〉Q denotes the L2(Q) scalar product over Q. We proceed with anintegration by parts to obtain

〈uQt , φ〉Q + (f (u) · ~n, φ)∂Q − 〈f(uQ),∇φ〉Q = 0, (5)

where (., .)∂Q denotes the surface integral over the boundary of the elementQ. As the approximate solution is in general discontinuous across grid cellinterfaces, the trace of the flux normal component f (u) · ~n is not uniquelydefined. To get a stable and accurate discretization, several choices for thenumerical approximation are known. Here, we use the HLLC flux [40]. For

a purely convective problem inserting the trace approximation f(uQ)· ~n ≈

gC (u−,u+, ~n) into equation (5) would yield

〈uQt , φ〉Q +(gC(u−,u+, ~n

), φ)∂Q− 〈fC

(uQ),∇φ〉Q = 0. (6)

We denote by (.)− values at the inner side of a cell interface, i.e. values thatdepend on uQ and by (.)+ values that depend on the neighbor cells sharingthe interface with the cell Q.

6

The handling of the diffusive part of the flux is a little more delicate forDG methods, because the jump in the gradients needs special handling. Sev-eral authors have suggested solutions for this problem [32,7,1,4,5] and allof these have been used in conjunction with implicit temporal discretiza-tions. In this work we apply the dGRP flux of Gassner, Lorcher and Munz[13,14,27]. The dGRP flux is an extension of the symmetric interior penalty(SIP) method for compressible Navier-Stokes equations to guarantee opti-mal order of convergence. We choose this variant of the diffusion flux asit has been derived in a way that optimizes stability, i. e. minimizes theeigenvalues of the DG operator [13,14]. From a technical point of view thisflux introduces an approximation of the trace of the flux normal componentfD (u,∇u) · ~n ≈ gdGRP (u−,∇u−,u+,∇u+, η, ~n), where η is a parameter thatdepends on the geometry of the cell Q and its neighbor and the local order ofthe polynomial approximation (2). To ensure adjoint consistency an additionalsurface flux term h (u−,u+, ~n) is introduced via two integrations by parts [14]yielding the final DG formulation

〈uQt , φ〉Q +(gC − gdGRP, φ

)∂Q− (h,∇φ)∂Q − 〈~f

C − ~fD,∇φ〉Q = 0. (7)

The coupling between elements and thus the resulting fill in of the Jacobianmatrix is comparable to the standard SIP method and the commonly usedsecond method of Bassi and Rebay (BR2). We expect thus that the resultsshown below are directly applicable for those diffusive flux variants as well,whereas flux functions with different element coupling such as e.g. the local DG(LDG) and its modification, the compact DG (CDG), may perform differently,although we expect the impact of the choice of the diffusive flux function notto be significant.

3.2 Nodal Integration

The computation of the volume and surface quadrature operators can be a veryexpensive task if standard methods such as Gaussian quadrature are used,which is caused by the high number of polynomial evaluations required forcomputing the fluxes. Based on the nodal DG scheme developed by Hesthavenand Warburton [17], Gassner et al. developed a way of constructing efficientquadrature operators that work on arbitrarily shaped elements, see [15] forfurther details. This has the advantage that the number of degrees of freedomdoes not depend on the element shape as it would be for a purely nodal schemewhen using elements other than tetrahedra. As points to define the nodal basis,Legendre-Gauss-Lobatto (LGL) points are used on edges and then a methodcalled LGL-type nesting is used to determine the interior points, which leadsto a small Lebesgue constant.

The coexistence of modal and nodal elements is quite natural for a DG scheme

7

since the transformation from modal (u) to nodal (u) degrees of freedom isnothing else but a polynomial evaluation of the modal polynomials at thenodal interpolation points which can be expressed in the form of a matrix-vector-multiplication:

u = Vu. (8)

Here V is a Vandermonde matrix containing the evaluations of the modalpolynomials at the interpolation points. The back transformation can be im-plemented using the inverse of the Vandermonde matrix:

u = V−1u. (9)

If the number of nodal interpolation points is different from the number ofmodal degrees of freedom, as is the case for elements other than tetrahedra,the approximate inverse V−1 is defined using a least squares procedure basedon singular value decomposition [15].

The nodal DG method can be conveniently formulated in terms of matricesrepresenting the discrete integrals in (7):

Mut +nFaces∑i=1

MSi gi −Nihi︸︷︷︸

surface integral

− d∑k=1

Sk fk︸︷︷︸volume integral

= 0. (10)

In the case of nonlinear equations, such as e.g. the compressible Navier-Stokesequations, the nonlinearity is present in the evaluation of the fluxes. In Eq.(10), fk, k = 1, ..., d are the vectors of flux evaluations at all nodal points,while gi and hi stand for the evaluations of the surface flux approximations atthe nodal points of the element face i. The operators in Eq. (10) are designedto act on nodal input vectors and to produce a nodal output. Using Eq. (9) allthe operators in Eq. (10) can be modified in order to produce a modal outputyielding the mixed modal-nodal DG method:

ut = −V−1M−1

(nFaces∑i=1

(MS

i gi −Nihi)−

d∑k=1

Sk fk

). (11)

4 Time integration scheme

Equation (11) represents a system of ordinary differential equations (ODEs) inthe cell Q. If we combine the modal coefficient vectors in one vector u ∈ Rm,we obtain a large system of ODEs

ut(t) = f(t,u(t)), (12)

where f is a vector valued function corresponding to the right hand side in(11) for the whole grid. Generally, from now on a vector with an underbar

8

will denote a vector from Rm. We denote the time step size by ∆t and un isthe numerical approximation to u(tn). Note that the explicit dependence ofthe right hand side in (12) on t is relevant only for time dependent boundaryconditions or time dependent source terms.

4.1 ESDIRK4

We will restrict ourselves here to the explicit step singly diagonally implicitRunge-Kutta method of fourth order (ESDIRK4), designed in [22]. Givencoefficients aij and bi, such a method with s stages can be written as

Ui = un + ∆ti∑

j=1

aijf(tn + cj∆tn,Uj), i = 1, ..., s (13)

un+1 = un + ∆ts∑j=1

bjf(tn + cj∆tn,Uj). (14)

Thus, all entries of the Butcher array in the strictly upper triangular part are

0 0 0 0 0 0 0

2γ γ γ 0 0 0 0

83/250 0.137776 -0.055776 γ 0 0 0

31/50 0.144636866026982 -0.223931907613345 0.449295041586363 γ 0 0

17/20 0.098258783283565 -0.591544242819670 0.810121053828300 0.283164405707806 γ 0

1 0.157916295161671 0 0.186758940524001 0.680565295309335 -0.275240530995007 γ

bi 0.157916295161671 0 0.186758940524001 0.680565295309335 -0.275240530995007 γ

bi 0.154711800763212 0 0.189205191660680 0.702045371228922 -0.319187399063579 0.273225035410765

bi − bi 0.003204494398459 0 -0.002446251136679 -0.021480075919587 0.043946868068572 -0.023225035410765

Table 1Butcher diagram for ESDIRK4 with γ = 0.25.

zero. The coefficients can be obtained from table 1. This scheme is A-stable,also L-stable and stiffly accurate. The point about DIRK schemes is, that thecomputation of the stage vectors corresponds to the sequential application ofseveral implicit Euler steps. With the starting vectors

si = un + ∆ti−1∑j=1

aijf(tn + cj∆tn,Uj), (15)

we can solve for the stage values

Ui = si + ∆taiif(tn + ci∆tn,Ui). (16)

The equation (16) corresponds to a step of the implicit Euler method withstarting vector si and time step aii∆t. Note that because the method is stifflyaccurate, un+1 = Us and thus we do not need to evaluate (14).

9

The first explicit step of the Runge-Kutta schemes allows to have a stageorder of two, but also means that the methods can’t be algebraically stable.Furthermore, the explicit stage involving f(tn,u

n) allows to reuse the last stagederivative from the last time step, since f(tn+1,u

n+1) from the last time stepis the same quantity, thus avoiding an evaluation of the right hand side.

4.2 Adaptive time step size selection

For unsteady flows, we need to make sure that the time integration error canbe controlled. To do this, we estimate the time integration error and select thetime step size accordingly. This is done using embedded schemes of a lowerorder p. For ESDIRK4, p = 3. Comparing the local truncation error of bothschemes, we obtain the following estimate for the local error of the lower orderscheme:

l ≈ ∆tns∑j=1

(bj − bj)f(tn + cj∆tn,Uj). (17)

To determine the new step size, we decide beforehand on a target error toler-ance and use a common fixed resolution test [39]. This means that we definethe error tolerance per component via

di = RTOL|uni |+ ATOL, (18)

where RTOL and ATOL are the relative and absolute tolerances. We alwayschoose ATOL = RTOL =: TOL. Then we compare this to the local errorestimate via requiring

‖l./d‖ ≤ 1,

where . denotes a pointwise division operator and we use the 2-norm through-out the text. The next question is, how the time step has to be chosen, suchthat the error can be controlled. The classical method is the following, alsocalled EPS (error per step) control [16]:

∆tnew = ∆tn · ‖l./d‖1/(p+1). (19)

This is combined with two safety factors to avoid volatile increases or decreasesin time step size:

if ‖l./d‖ ≥ 1, ∆tn+1 = ∆tn max(fmin, fsafety‖l./d‖1/p+1)

else ∆tn+1 = ∆tn min(fmax, fsafety‖l./d‖1/p+1).

10

Here, we chose fmin = 0.3, fmax = 2.0 and fsafety = 0.9.

5 Solver for the nonlinear equation systems

The application of implicit time discretizations used herein leads to a globallycoupled nonlinear system of equations of the form

uν+1 = aν+1∆t f(uν+1

)+ Rfix

(u1, ...,uν

), (20)

where we assume that u1, ...,uν are given, Rfix depends on space- and timediscretization and aν+1 is a constant depending on the time discretizationmethod.

5.1 Inexact Newton method

To solve the appearing nonlinear systems, we use inexact Newton’s method,which is locally convergent and can exhibit quadratic convergence. This methodsolves the root equation

u− aν+1∆t f (u)−Rfix(u1, ...,uν

)=: F(u) = 0.

As a termination criterion we use a residual based one similar to (18) with arelative tolerance, resulting in

‖F(uk)‖ ≤ ε = εr‖F(u0)‖.

If the iteration does not converge after a maximal number of iterations, thetime step is repeated with half the time step size. In particular, we will usethe inexact Newton’s method from [8], where the linear system in the k-thNewton step is solved only up to a relative tolerance, given by a forcing termηk. This can be written as:

∥∥∥∥∥∥∂F(u)

∂u

∣∣∣∣∣u(k)

∆u + F(u(k))

∥∥∥∥∥∥≤ ηk‖F(uk)‖ (21)

u(k+1) = u(k) + ∆u, k = 0, 1, ....

In [11], the choice of the sequence of forcing terms is discussed and it is provedthat the inexact Newton iteration (21) converges linearly. Moreover,

• if ηk → 0, the convergence is superlinear and

11

• if ηk ≤ Kη‖F(uk)‖p for some Kη > 0, p ∈ [0, 1], the convergence is super-linear with order 1 + p.

In particular, this means that for a properly chosen sequence of forcing terms,the convergence can be quadratic. A way of achieving this (as proved in [11])is the following:

ηAk = γ‖F(uk)‖2

‖F(uk−1)‖2

with a parameter γ ∈ (0, 1]. The theorem says that convergence is quadratic ifthis sequence is bounded away from one uniformly. Therefore, we set η0 = ηmaxfor some ηmax < 1 and for k > 0:

ηk = min(ηmax, ηAk ).

Eisenstat and Walker furthermore suggest safeguards to avoid volatile de-creases in ηk. To this end, γη2

k−1 > 0.1 is used as a condition to determine ifηk−1 is rather large and thus the definition of ηk is refined to

ηBk = min(ηmax,max(ηAk , γη2k−1)).

Finally, to avoid oversolving in the final stages, they use

ηk = min(ηmax,max(ηBk , 0.5ε/‖F(uk)‖)),

where ε is the tolerance at which the nonlinear iteration would terminate.

5.2 Linear Solver

In each iteration of the inexact Newton method, a linear equation system ofthe form Ax = b, A ∈ Rm×m has to be solved with

A = I− aν+1∆t∂f

∂u|u.

In general, A consist of nE block rows, where nE is the number of elementsin our computational domain. Each row i consists of a dense main diagonalblock of the size (Ni · nV ar) × (Ni · nV ar), where Ni is the number of degreesof freedom of the cell i given by Eq. (3) and nV ar is the number of unknownsin the equations that are being solved. For the compressible Navier-StokesEquations nV ar = 4 for the two-dimensional case and nV ar = 5 for the three-dimensional case. In addition to the main block each Neumann neighbor jof i, i.e. a neighbor j sharing a common face with i, contributes a block ofthe size (Ni · nV ar) × (Nj · nV ar) which leads to a quickly rising number of

12

Order of Approximation

Me

mo

ry / E

lem

en

t [M

B]

1 2 3 4 5 6 7 8 9 1010

4

103

102

101

100

101

TriQuad

TetHex

(a) Memory required for each element

Order of Approximation

DO

F / G

B

1 2 3 4 5 6 7 8 9 10

104

105

106

(b) Number of degrees of freedom thatcan be stored in the Jacobian per GBof memory

Fig. 1. Memory considerations for the Jacobian

matrix entries when using higher order DG methods. Figure 1a shows thememory requirements per block row for different element types and ordersof approximation in 2D and 3D where we assumed a uniform order in thecomputational domain, i.e. Nj = Ni. It becomes clear that while memoryusually is not an issue for Finite Volume methods (equivalent to the first ordercase) it is a critical aspect for DG schemes, especially for the 3D case. In orderto illustrate how restrictive this may become, Fig. 1b shows the maximumnumber of cells a computational domain may contain if only one gigabyte ofmemory is available for the storage of the Jacobian. It should also becomeclear now that while a 2D scheme is not a problem for today’s computers thisis absolutely not the case for a 3D scheme as the memory requirements are anorder of magnitude more restrictive than for the 2D case.

Now for real world problems, the number of unknowns is in the magnitudeof tens of millions, so direct solvers are infeasible which leads us to itera-tive methods. In particular, Krylov subspace methods such as GMRES orBiCGSTAB have been shown to perform well in this context. In their basicversion, these schemes need the Jacobian and in addition a preconditioningmatrix to improve convergence speed, which is storage wise a huge problem.Furthermore, the use of approximate Jacobians to save storage and CPU timeleads to a decrease of the convergence speed of the Newton method. There-fore, we will use here Jacobian-free Newton-Krylov methods [24], which donot need the Jacobian (but still a preconditioner). The idea is that in Krylovsubspace methods, the Jacobian appears only in the form of matrix vectorproducts Avi which can be approximated by a difference quotient

Avi ≈F (u + εvi)− F(u)

ε= vi − aν+1∆t

f (u + εvi)− f(u)

ε. (22)

13

The parameter ε is a scalar, where smaller values lead to a better approxi-mation but may lead to truncation errors. A simple choice for the parameter,that avoids cancellation but still is moderately small is given by Quin, Ludlowand Shaw [35] as

ε =

√eps

‖∆u‖2,

where eps is the machine accuracy. Second order convergence is obtained up toε-accuracy if proper forcing terms are employed, since it is possible to view theerrors coming from the finite difference approximation as arising from inexactsolves.

Of the Krylov subspace methods suitable for the solution of unsymmetriclinear systems, the GMRES method of Saad and Schultz [38] was explainedby McHugh and Knoll [29] to perform better than others in the matrix freecontext. The reason for this is that the vectors in matrix vector multiplicationsin GMRES are normalised, as opposed to those in other methods.

6 Preconditioning

It is well known that the speed of convergence of Krylov subspace methodsdepends strongly on the matrix. Therefore, right preconditioning is used totransform the linear equation system appropriately:

AP−1xP = b, x = P−1xP .

Here, P is an invertible matrix, called a right preconditioner that approximatesthe system matrix in a cheap way. Every time a matrix vector product Avj

appears in a Krylov subspace method, the right preconditioned method isobtained by applying the preconditioner to the vector in advance and thencomputing the matrix vector product with A. Right preconditioning does notchange the initial residual, because

r0 = b0 −Ax0 = b0 −AP−1xP0 .

This also means that, in contrast to left preconditioning, right preconditioningdoes not interfere with the Eisenstat-Walker strategy, which is the main reasonwe use right preconditioning only. Once the termination criterion is fulfilled,the right preconditioner has to be applied one last time to change back fromthe preconditioned approximation to the unpreconditioned.

Often, the preconditioner is not given directly, but implicitly via its inverse.Then its application corresponds to the solution of a linear equation system.If chosen well, the speedup of the Krylov subspace method is significantly and

14

therefore, the choice of the preconditioner is more important than the specificKrylov subspace method used.

For non-normal matrices as we have here, the existing theory is not sufficientto determine optimal preconditioners in any sense. Therefore, we have to re-sort to numerical experiments and heuristics. An overview of preconditionerswith special emphasis on application in flow problems can be found in [30].For the DG case, Persson and Peraire have conducted a survey of several pre-conditioning methods in [33]. Several methods are interesting in this context.

6.1 Jacobi, SGS and ROBO-SGS

An important class of preconditioners are splitting methods that are based ondecomposing A into a (block) diagonal part D, an upper diagonal part U anda lower diagonal part L in such a way that

A = L + D + U.

These blocks are then used to obtain simple approximations to A−1.

The most simple method here is block-Jacobi, where the off diagonal blocksare neglected, which leads to

P = D. (23)

A much more sophisticated method that is a very good preconditioner for com-pressible flow problems is the symmetric block Gauss-Seidel-method (SGS),which corresponds to solving the equation system

(D + L)D−1(D + U)x = xP . (24)

As mentioned before, the major issue here is that the Jacobian consists ofblocks with hundreds of unknowns for DG methods in 3D. In FV schemes,where the blocks have size 5 in 3D, the off diagonal blocks are sometimescomputed on the fly, whereas only the diagonal is stored. In the DG case,the high construction cost for the element Jacobians makes this infeasible.Therefore, using the full Jacobian A leads to significantly higher memoryrequirements compared to storing the diagonal D only, e.g. a factor of five fortetrahedra and a factor of seven for hexahedra.

Thus, SGS needs a huge amount of storage and leads to rather high applicationcosts. On the other hand, the storage requirements of Jacobi are reasonable,as well as the application cost, but the resulting decrease in iteration numbersis much smaller than for SGS.

15

Based on this observation we propose a new class of SGS-like preconditionersthat is between SGS and Jacobi where a varying amount of entries of theoff diagonal blocks L and U is neglected. In this way, we get a trade offbetween memory and application cost on the one hand and efficiency of thepreconditioner on the other hand. In particular, we make use of the fact thatthe modal DG scheme we employ has a hierarchical basis, meaning that thebasis functions can be grouped by their degree:

uQ =p∑j=0

∑|α|=p

uαϕQα .

Here, α is a multi-index, ϕQα is the unique hierarchical modal basis functioncorresponding to that multi-index and uα the vector of coefficients of thesolution in this decomposition.

A block of the Jacobian consists of subblocks where each subblock consists ofthe derivatives of the values corresponding to one multiindex with respect tothe coefficient vector uα corresponding to one basis function and thus a pos-sibly different multiindex. The idea of ROBO-SGS is now to reduce the inter-element coupling in the Jacobian by neglecting all derivatives of higher-orderdegrees of freedom of the neighboring cells. Thus we neglect all derivativesin the offdiagonal blocks Jacobian with respect to degrees p > k with k userdefined. We call this preconditioner ROBO-SGS-k for Reduced OffdiagonalBlock Order, where k is the degree of the polynomial basis functions takeninto account. Note that this idea requires the hierarchical (modal) basis anddoes not work with a purely nodal implementation.

A similar idea is often used for finite volume discretizations, where an ap-proximate Jacobian is computed based on the first order discretization, ne-glecting the impact of the reconstruction [41]. However, this does not changethe amount of storage needed, but only the computational complexity of theJacobian construction.

For example, in the case k = 0 we take only the DOFs of the neighborsinto account which correspond to the integral mean values of the conservedquantities. However, we keep not only the derivatives of the remaining degreesof freedom with respect to themselves, but to all degrees of freedom, resultingin a rectangular structure in the off diagonal blocks of the preconditioner.This is illustrated in Fig. 2. For k = p, we keep everything, thus recoveringthe original block SGS preconditioner. If we formally set k = −1, we neglectall off diagonal block entries, thus recovering block Jacobi.

The number of entries of the off diagonal blocks in the preconditioner is then(Ni · nV ar)×

(Nj · nV ar

), where Nj depends on the user-defined parameter k.

While this results in a decreased accuracy of the preconditioner, the memoryrequirements and the computational cost of the application become smaller,

16

(a) k = 0 variant (b) k = 1 variant

Fig. 2. Reduced versions of the off-diagonal blocks of the Jacobian, k = 0 and k = 1variants

the less degrees of freedom of the neighboring cells we take into account. Notethat even in the case k = 0, the mean values of the neighbors are taken intoaccount, thus leading to reduced offdiagonal blocks that still have a physicalmeaning. Furthermore, the effect of this strategy becomes more pronouncedthe larger the order is, since the number of basis functions corresponding to acertain degree increases with k. To estimate the memory savings, we neglectboundary elements. For the Jacobi preconditioner, we get the overall amountof memory

Memory(Jacobi) = nElems×Memory(Fullblock), (25)

where the memory of the full block is given by

Memory(Fullblock) = (Ni · nV ar)× (Ni · nV ar) , (26)

which scales like ∼ p6/36 with respect to the polynomial degree p. This means,that even the simple block Jacobi preconditioner can get prohibitive withrespect to memory requirements for a three-dimensional computation for largepolynomial degrees p. The full SGS preconditioner drastically amplifies thiseven further as we need to store the blocks for each neighbor. Thus, dependingon the considered element type, we get the overall memory storage for theSGS preconditioner (assuming large total number of elements compared tothe boundary elements) as

Memory(SGS) = nElems×Memory(Fullblock)× (1 + nSides), (27)

where nSides is the number of sides for the element type (e.g. nSides = 6 forthe hexahedra). The memory reduction of ROBO-SGS can be expressed whenwe introduce the amount of memory needed by the reduced blocks

Memory(Reducedblock) = (Ni · nV ar)×(Nj · nV ar

), (28)

which scales like ∼ p3k3/36 for a given off block order k. The total amount of

17

memory needed by the ROBO-SGS preconditioner is given by

Memory(ROBOSGS) = nElems×Memory(Fullblock)

+ nElems×Memory(Reducedblock)× (nSides).

(29)

Thus, the ratio of the ROBO-SGS memory consumption in comparison to thememory needed by the Jacobi preconditioner is given by

Memory(ROBOSGS)

Memory(Jacobi)= 1 +

nElems×Memory(Reducedblock)× (nSides)

nElems×Memory(Fullblock)

= 1 +nSides Nj

Ni

.

(30)

Consider the example of hexahedral elements (nSides = 6) and the polyno-mial degree p = 5, as used in the results section, we get for the ratio of thememory the values 1, 1.1, 1.4, 2.1, 3.1, 4.8 and 7 with the value of the offblock order k = −1, 0, 1, 2, 3, 4, 5, respectively, where k = −1 yields theJacobi preconditioner and k = 5 the full SGS preconditioner.

6.2 ILU preconditioning

Another important class of preconditioners are block incomplete LU (ILU)decompositions, where the blocks correspond to the units the Jacobian con-sists of. The computation of a complete LU decomposition is quite expensiveand in general needs also for sparse matrices full storage. By prescribing asparsity pattern, incomplete LU decompositions can be defined. The applica-tion of such a decomposition as a preconditioner then corresponds to solvingby forward-backward substition the appropriate linear equation system. Thesparsity pattern can for example be influenced by the level of fill. This is inshort a measure for how much beyond the original sparsity pattern is allowedfor the purpose of ILU. Those decompositions with higher levels of fill are verygood black box preconditioners for flow problems [30]. However they are notin line with the philosophy of matrix-free methods.

Thus remains ILU(0), which has no additional level of fill beyond the sparsitypattern of the original matrix A. We use the ILU(0) preconditioner in theform proposed by Persson and Peraire [33] with the in-place factorizationsuggested by Diosady and Darmofal [9]. While this preconditioner usuallyperforms better than the ones based on splitting it has the drawback thatit has to act on the full Jacobian matrix which makes it less attractive incomputational environments with limited memory.

18

6.3 Multilevel preconditioners

Another possibility is to use multilevel schemes as preconditioners. A numberof approaches have been tried in the context of DG methods, e.g. multigridmethods and multi-p methods with different smoothers, as well as a variantby Persson and Peraire [33], where a two-level multi-p method is used withILU(0) as a presmoother and Jacobi as a postsmoother. We will employ avariant where no postsmoothing is employed as we have not experienced anyconvergence acceleration by the postsmoother and name it ILU-CSC for ILUwith coarse scale correction.

Since the computation of the residual requires a matrix-vector multiplication,the cost for applying a multilevel variant of one of the above preconditionersis approximately double the cost of a standard single level variant.

6.4 Parallelization

Regarding parallelization, we use the MPI paradigm. The physical domain isdecomposed into several domains, each of which is assigned to a processor. Thematrix-free approach allows us to use the parallelization scheme described inthe PhD thesis of Lorcher [26] for the function evaluation in the matrix-vectormultiplication, which is shown to scale very well. The use of GMRES, however,is a drawback here, since this requires the use of k scalar products on the k-thiteration, which do not scale perfectly in parallel. However, the alternativeshave other drawbacks, as discussed earlier. As for the preconditioner, with theexception of Jacobi, most schemes would actually require excessive commu-nication at domain boundaries due to the fact that the Neumann neighbors’off-block entries of the system matrix would be located on different CPUs. Inorder to circumvent adding this overhead to our scheme we neglect these partsof the system matrix. This way, as the number of cells per CPU decreases allpreconditioners used herein ultimately converge to the Jacobi scheme reducingthe overall parallel efficiency not due to communication overhead as it is thecase for the scalar products but due to a degradation of the numerical schemeitself resulting in a higher iteration number for the solution process. However,as we will demonstrate in the results section the schemes scale very well.

7 Numerical results

In this section, we examine the performance of the different preconditioners.To focus on the effect of the preconditioner, we ’freeze’ for all tests the numer-

19

ical discretization in space as well as in time: we choose the ESDIRK4 timeintegrator with adaptive time stepping using the tolerance TOL = 10−2. Wethen compare the total number of GMRES iterations needed for a completerun of the solver as a measure of pre-conditioner accuracy and the total CPUtime needed. While the latter is implementation dependent, it is a necessaryadditional information, since a more powerful preconditioner can be muchmore costly and less efficient overall than a less powerful preconditioner.

All computations were carried out using the Fortran code HALO, developed atthe Institute of Aerodynamics and Gasdynamics and all test runs were carriedout in parallel using MPI with double precision arithmetic. The partitioningof the grid is achieved with a space-filling curve approach, which guaranteesthat the number of grid cells on a processor core is balanced and thus thememory requirements for each processor core is roughly about the same. Aswe are interested in three-dimensional unsteady simulations, the focus of thenumerical model is on the high performance computing aspect. This meansthat we are especially interested in the efficiency of the preconditioner for alarge numbers of cores, i.e. for parallel simulations with low computationalload on each processor core. Explicit in time DG discretizations are knownfor their excellent strong parallel scaling, e.g. [18], and it is apriori not clearif implicit time integrators can sustain this property.

As a first test case, we choose the flow past a circular cylinder with free streamMach number M = 0.3 and Reynolds number of Re = 1, 000. The computa-tional grid consists of 10, 400 hexahedral cells, as illustrated in figure 3, withcurved grid cells at the cylinder boundary. For the discontinuous Galerkindiscretization we choose polynomials with degree five, resulting in 582, 400degrees of freedom per conservative variable or 2, 912, 000 unknowns in total.

To obtain an initial condition, we decided to use an explicit time integrationscheme, where we suppose that the time integration errors are negligible dueto the small stability driven time step. The test interval is 1s and we choosethe initial time step as ∆t0 = 0.0118. This time step size was chosen such thatthe resulting error estimate is between 0.9 and 1, leading to an accepted timestep. The distribution of the velocity magnitude at initial time and at the endtime of the test run are shown in figure 4.

The following computations are all performed on the CRAY XE 6 cluster(Hermit) of the computing center HLRS. All preconditioners are tested forcomputations with 64, 128, 256 and 512 processor cores (threads), resulting inan average of 9100, 4550, 2275 and 1137 DOF per conservation variable on aprocessor core, which would be low even for an explicit in time discontinuousGalerkin discretization.

Table 2 shows the result for the simulations with 64 cores. We list the number

20

Fig. 3. Grid for cylinder test case. The grid is extended in three dimensions with 8regular grid cells.

Fig. 4. Initial (top) and final solution of cylinder problem (bottom). Distribution ofthe velocity magnitude.

of GMRES iterations, the overall wallclock time and a comparison of the CPUtimes with respect to the standard block Jacobi preconditioner. We clearlysee that the number of iterations decreases the better the preconditioner,with ROBO-SGS-5 and ILU(0)-CSC being the most powerful. However, thewallclock time gives a very different picture. Here, the most powerful precon-ditioners are among the slowest in the end, because the preconditioner matrixhas a larger sparsity pattern and is thus more expensive to apply. The most

21

efficient preconditioner is the ROBO-SGS-1 in this case, with the CPU timeabout 11% faster compared to the Jacobi preconditioner.

Preconditioner Iter. CPU [s] Comparison to Jacobi [%]

No preconditioner 8,797 2,194 36.0

Jacobi 3,712 1,613 0.0

ROBO-SGS-0 3,338 1,538 -4.6

ROBO-SGS-1 2,824 1,429 -11.4

ROBO-SGS-2 2,656 1,485 -7.9

ROBO-SGS-3 2,641 1,679 4.1

ROBO-SGS-4 2,645 1,989 23.3

ROBO-SGS-5 2,640 2,427 50.5

ILU(0) 2,641 2,467 52.9

ILU(0)-CSC 2,640 2,994 85.6Table 2Number of iterations and wallclock time of the test computations on 64 cores of theCRAY XE6 cluster Hermit for all preconditioners with comparison of CPU time otthe Jacobi preconditioner computation.

The following tables 3-5 show the results for 128, 256 and 512 processor cores(threads). The third column of the table shows the parallel scaling of themethod with respect to the 64 processor cores computation. It is importantto note that the ’no preconditioner’ case gives us basically the scaling simi-lar to an explicit time discretization, as we only use the spatial DG operatorto approximate the matrix-vector product. The only difference to an explicitmethod is that the GMRES algorithm needs an all-to-all communication be-cause of the vector norm in each iteration. The strong scaling of over 85%in this case demonstrates again how well suited discontinuous Galerkin dis-cretizations are for parallel computations.

The results show that the scaling of the Jacobian-free implicit method is asgood as the explicit method for this example. We even get better scaling re-sults with preconditioner compared to the no preconditioner case. The reasonfor this is that the computational load on a processor increases due to theadditional work of applying the preconditioner. Thus, the ratio of compu-tation to communication increases, yielding a better parallel scaling for thepreconditioned schemes. An additional sign for this is that the most expensivepreconditioners (with the largest sparsity pattern) scale the best.

The parallel implementation is such that we only consider the preconditionerfor the local MPI domain, without communication of the preconditioner. Byincreasing the number of processors we decrease the load (number of grid

22

Preconditioner Iter. CPU [s] Scaling [%] Comparison to Jacobi [%]

No preconditioner 8,797 1,196 91.7 43.6

Jacobi 3,712 833 96.8 0.0

ROBO-SGS-0 3,462 818 94.0 -1.8

ROBO-SGS-1 2,926 750 95.3 -10.0

ROBO-SGS-2 2,676 757 98.1 -9.1

ROBO-SGS-3 2,651 843 99.6 1.2

ROBO-SGS-4 2,677 986 100.9 18.4

ROBO-SGS-5 2,641 1,175 103.3 41.1

ILU(0) 2,641 1,196 103.1 43.6

ILU(0)-CSC 2,640 1,477 101.4 77.3Table 3Number of iterations and wallclock time of the test computations on 128 cores of theCRAY XE6 cluster Hermit for all preconditioners. Strong scaling results comparedto the 64 cores computations and comparison of CPU time to Jacobi preconditionercomputation.


No preconditioner 8,797 600 91.4 38.9

Jacobi 3,712 432 93.3 0.0

ROBO-SGS-0 3,516 428 89.8 -0.9

ROBO-SGS-1 3,031 395 90.4 -8.6

ROBO-SGS-2 2,785 393 94.5 -9.0

ROBO-SGS-3 2,685 422 99.5 -2.3

ROBO-SGS-4 2,722 486 102.3 12.5

ROBO-SGS-5 2,649 568 106.8 31.5

ILU(0) 2,647 576 107.1 33.3

ILU(0)-CSC 2640 728 102.8 68.5Table 4Number of iterations and wallclock time of the test computations on 256 cores of theCRAY XE6 cluster Hermit for all preconditioners. Strong scaling results comparedto the 64 cores computations and comparison of CPU time to Jacobi preconditionercomputation.

cells) on the processor and thus the preconditioners all get more similar toblock Jacobi. In the extreme case of only one element on a processor core, allpreconditioners would be block Jacobi due to the parallelisation we use. Thus,

23


No preconditioner 8,797 322 85.2 39.4

Jacobi 3,712 231 87.3 0.0

ROBO-SGS-0 3,508 235 81.8 1.7

ROBO-SGS-1 3,479 230 77.7 -0.4

ROBO-SGS-2 3,014 215 86.3 -6.9

ROBO-SGS-3 2,819 229 91.6 -0.9

ROBO-SGS-4 2,773 246 101.1 6.5

ROBO-SGS-5 2,720 284 106.8 22.9

ILU(0) 2,713 287 107.4 24.2

ILU(0)-CSC 2,679 383 97.7 65.8Table 5Number of iterations and wallclock time of the test computations on 512 cores of theCRAY XE6 cluster Hermit for all preconditioners. Strong scaling results comparedto the 64 cores computations and comparison of CPU time to Jacobi preconditionercomputation.

we can observe that the number of iterations slightly increases when increasingthe processor numbers for the more powerful preconditioning techniques. Thishas furthermore the effect that the difference to block Jacobi with respect toCPU time decreases, since Jacobi is not affected by the parallelisation due toits element local nature. We see that the most efficient preconditioners are theROBO-SGS variants with low off block order. But using 512 processor cores(threads), the difference is only about 7% in favor of the best preconditioner,ROBO-SGS 2. It is clear that the more and more processors we use for thecomputation the more efficient block Jacobi gets in comparison to the moresophisticated preconditioners.

This suggests that more powerful preconditioner are more effective comparedto Jacobi when we have a large number of elements per process, as in this casetheir iteration number decreasing effect is more pronounced. To demonstratethis we consider a second test case which we compute with only eight coreson a 4 quad core AMD Opteron 8378. We consider for this the flow past asphere at a Mach number of M = 0.3 and a Reynolds number of Re = 1, 000.The unstructured grid is larger compared to the cylinder grid and consists of21, 128 hexahedral grid cells and the polynomial degree is chosen equal to four,resulting in 739, 480 DOF per conservative variable and a total of 3, 697, 400unknowns. As before, we use again an explicit time integrator to generate atime error free inital flow field for our tests. We perform the computationswith all preconditioner variants for a time interval of 30 seconds. The initialsolution and the result at the end when using ESDIRK4 time integration with

24

a tolerance of TOL = 10−3 and an initial time step of ∆t = 0.0065 are shownin figure 5, where we see isosurfaces of λ2=−10−4, a common vortex identifier[19]. Note that there is no visual difference between the results for ESDIRK4and those obtained using an explicit Runge-Kutta scheme.

Fig. 5. Isosurfaces of λ2=−10−4 for initial (left) and final solution of sphere problem(right).

Table 6 shows the results of this simulation. Since only 8 processes are usedto compute this test case, we have an average of 2641 grid cells on one core.

The results show that the most efficient preconditioner, the ROBO-SGS-1,is about 30% faster compared to the Jacobi preconditioner. Using this resultonly, obtained with a low number of processors, one could argue that the newintermediate class of preconditioner, the ROBO-SGS-1, is the most effectiveamong all preconditioner techniques. Again, the more powerful preconditionerslike ILU(0) and ILU(0)-CSC are not more computationally efficient.

If we compare this to the results with high processor numbers, where themaximum difference of the fastest preconditioner was only about 7%, we getthe outcome that the standard Jacobi preconditioner is a viable candidatewith good efficiency and scaling. This shows that for a meaningful comparisonof preconditioner techniques for the simulation of three-dimensional unsteadycompressible flows, test runs with high number of processors, i.e. low numberof grid cells on a processor core (thread), are necessary to get the right picturefor practical applications.

25

Preconditioner Iter. CPU in s Comparison to Jacobi [%]

None 218,754 590,968 557

Jacobi 23,369 90,004 0

ROBO-SGS-0 18,170 77,316 -14

ROBO-SGS-1 14,639 66,051 -27

ROBO-SGS-2 13,446 76,450 -14

ROBO-SGS-3 13,077 87,115 -3

ROBO-SGS-4 13,061 100,112 11

ILU(0) 12,767 108,031 20

ILU(0)-CSC 11,609 127,529 42Table 6Number of iterations and wallclock time of the computation using 8 cores.

8 Conclusions

In this work, the efficiency of preconditioners for use in implicit time adap-tive integration schemes in the context of modal DG methods for the three-dimensional unsteady compressible Navier-Stokes equations is compared whenusing a JFNK solver with a smart choice of tolerances for GMRES. A specialemphasis of this work is on the performance for massive parallel computations.For this, we use a large number of processors to examine the scaling and theefficiency for a low local processor load (low number of elements). Besidesthe standard preconditioning techniques such as block Jacobi and ILU(0), weintroduce a new class of preconditioner where we reduce the order of the offdiagonal blocks. This new class, called ROBO-SGS, includes block Jacobi andblock SGS as special cases and thus allows us to investigate the behavior ofthe preconditioner depending on its sparsity pattern.

For all preconditioners tested, excellent strong parallel scaling is achieved evencompared to the scaling of an explicit time discretization (comparable to theno preconditioner test computations). Thus, the first conclusion is that theJacobian-free implicit time integration approach is very well suited for largescale parallel simulations of unsteady three-dimensional compressible Navier-Stokes equations.

An important factor for the overall efficiency of a preconditioner is the compu-tational cost when applying the preconditioner matrix in the GMRES solver.The most powerful preconditioner, i.e. the preconditioner which needs thelowest amount of GMRES iterations, is not necessarily the most efficient pre-conditioner. It seems that the best balance is offered by the new class ofpreconditioners, ROBO-SGS. For all test cases, there is a ROBO-SGS variant

26

(typically with off block order 1 or 2) which gives the most efficient solver withan only marginally increase in memory usage compared to block Jacobi. How-ever, due to the specific communication avoiding implementation, ROBO-SGSapproaches block Jacobi for with increasing number of cores and thus, for asmall number of elements per core, the difference to block Jacobi is small.

Summarizing, we have that for large scale computations with large numberof processors for the simulation of unsteady three-dimensional Navier-Stokesequations, block Jacobi is a very viable choice with an overall good perfor-mance and an advantage with respect to implementation effort and memoryusage. For a small number of processors, ROBO-SGS with a low order forthe off diagonal blocks is a very good alternative, able to provide a noticablespeedup for almost no additional memory.

Acknowledgements

The research presented in this paper was supported in parts by the DeutscheForschungsgemeinschaft (DFG) in the context of the SchwerpunktprogrammMetStroem, the cluster of excellence Simulation Technology (SimTech) and theSFB/TR TRR 30.

References

[1] D. Arnold. An Interior Penalty Finite Element Method with DiscontinuousElements. PhD thesis, The University of Chicago, 1979.

[2] F. Bassi, A. Ghidoni, and S. Rebay. Optimal Runge-Kutta smoothers for the p-multigrid discontinuous Galerkin solution of the 1D Euler equations. J. Comp.Phys., 11:4153–4175, 2011.

[3] F. Bassi, A. Ghidoni, S. Rebay, and P. Tesini. High-order accurate p-multigriddiscontinuous Galerkin solution of the Euler equations. Int. J. Num. Meth.Fluids, 60:847–865, 2009.

[4] F. Bassi and S. Rebay. A high-order discontinuous Galerkin finite elementmethod solution of the 2D Euler equations. J. Comput. Phys., 138:251–285,1997.

[5] F. Bassi, S. Rebay, G. Mariotti, S. Pedinotti, and M. Savini. A high-order accurate discontinuous finite element method fir inviscid an viscousturbomachinery flows. In R. Decuypere and G. Dibelius, editors, Proceedingsof 2nd European Conference on Turbomachinery, Fluid and Thermodynamics,pages 99–108, Technologisch Instituut, Antwerpen, Belgium, 1997.

27

[6] P. Birken. Solving nonlinear systems inside implicit time integration schemes forunsteady viscous flows. In R. Ansorge, H. Bijl, A. Meister, and T. Sonar, editors,Recent Developments in the Numerics of Nonlinear Hyperbolic ConservationLaws, pages 57–71. Springer, 2013.

[7] B. Cockburn and C.-W. Shu. The local discontinuous Galerkin method for time-dependent convection diffusion systems. SIAM J. Numer. Anal., 35:2440–2463,1998.

[8] R. Dembo, R. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J.Numer. Anal., 19:400–408, 1982.

[9] L. T. Diosady and D. L. Darmofal. Preconditioning methods for discontinuousGalerkin solutions of the Navier-Stokes equations. J. Comp. Phys., 228:3917–3935, 2009.

[10] V. Dolejsı and M. Feistauer. A semi-implicit discontinuous Galerkin finiteelement method for the numerical solution of inviscid compressible flow. J.Comp. Phys., 198:727–746, 2004.

[11] S. C. Eisenstat and H. F. Walker. Choosing the forcing terms in an inexactnewton method. SIAM J. Sci. Comp., 17:16–32, 1996.

[12] K. J. Fidkowski, T. A. Oliver, Lu. J., and D. L. Darmofal. p-Multigrid solutionof high-order discontinuous Galerkin discretizations of the compressible Navier-Stokes equations. J. Comp. Phys., 207:92–113, 2005.

[13] G. Gassner, F. Lorcher, and C.-D. Munz. A contribution to the constructionof diffusion fluxes for finite volume and discontinuous Galerkin schemes. J.Comput. Phys., 224:1049–1063, 2007.

[14] G. Gassner, F. Lorcher, and C.-D. Munz. A Discontinuous Galerkin Schemebased on a Space-Time Expansion II. Viscous Flow Equations in MultiDimensions. J. Sci. Comput., 34:260–286, 2008.

[15] G. J. Gassner, F. Lorcher, C.-D. Munz, and J. S. Hesthaven. Polymorphic nodalelements and their application in discontinuous galerkin methods. Journal ofComputational Physics, 228(5):1573 – 1590, 2009.

[16] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Springer,Berlin, Series in Computational Mathematics 14, 3. edition, 2004.

[17] J. S. Hesthaven and T. Warburton. Nodal Discontinuous Galerkin Methods:Algorithms, Analysis, and Applications. Springer, 2008.

[18] Florian Hindenlang, Gregor J. Gassner, Christoph Altmann, Andrea Beck, MarcStaudenmaier, and Claus-Dieter Munz. Explicit discontinuous galerkin methodsfor unsteady problems. Computers and Fluids, 61(0):86 – 93, 2012.

[19] J. Jeong and F. Hussain. On the identification of a vortex. J. Fluid Mech.,285:69–94, 1995.

28

[20] G. Jothiprasad, D. J. Mavriplis, and D. A. Caughey. Higher-order timeintegration schemes for the unsteady Navier-Stokes equations on unstructuredmeshes. J. Comp. Phys., 191:542–566, 2003.

[21] A. Kanevsky, M. H. Carpenter, D. Gottlieb, and J. S. Hesthaven. Applicationof implicit-explicit high order Runge-Kutta methods to discontinuous-Galerkinschemes. J. Comp. Phys., 225:1753–1781, 2007.

[22] C. A. Kennedy and M. H. Carpenter. Additive Runge-Kutta schemes forconvection-diffusion-reaction equations. Appl. Num. Math., 44:139–181, 2003.

[23] C. M. Klaij, M. H. van Raalte, J. J. W. van der Vegt, and H. van derVen. h-Multigrid for space-time discontinuous Galerkin discretizations of thecompressible Navier-Stokes equations. J. Comp. Phys., 227:1024–1045, 2007.

[24] D. A. Knoll and D. E. Keyes. Jacobian-free Newton-Krylov methods: a surveyof approaches and applications. J. Comp. Phys., 193:357–397, 2004.

[25] D. A. Kopriva, S. L. Woodruff, and M. Y. Hussaini. Computationof electromagnetic scattering with a non-conforming discontinuous spectralelement method. Int. J. Num. Meth. in Eng., 53:105–122, 2002.

[26] F. Lorcher. Predictor-Corrector DG schemes for the numerical solution ofthe compressible Navier-Stokes equations in complex domains. Dissertation,Universitat Stuttgart, 2009.

[27] F. Lorcher, G. Gassner, and C.-D. Munz. An explicit discontinuous Galerkinscheme with local time-stepping for general unsteady diffusion equations. J.Comput. Phys., 227(11):5649–5670, 2008.

[28] G. May, F. Iacono, and A. Jameson. A hybrid multilevel method for high-orderdiscretization of the Euler equations on unstructured meshes. J. Comp. Phys.,229:3938–3956, 2010.

[29] P. R. McHugh and D. A. Knoll. Comparison of standard and matrix-freeimplementations of several Newton-Krylov solvers. AIAA J., 32(12):2394–2400,1994.

[30] A. Meister and C. Vomel. Efficient Preconditioning of Linear Systemsarising from the Discretization of Hyperbolic Conservation Laws. Advancesin Computational Mathematics, Vol.14(1):49–73, 2001.

[31] C. R. Nastase and D. J. Mavriplis. High-order discontinuous Galerkin methodsusing an hp-multigrid approach. J. Comp. Phys., 213:330–357, 2006.

[32] J. Peraire and P.-O. Persson. The Compact Discontinuous Galerkin (CDG)Method for Elliptic Problems. SIAM J. Sci. Comput., 30(4):1806–1824, 2008.

[33] P.-O. Persson and J. Peraire. Newton-GMRES Preconditioning forDiscontinuous Galerkin discretizations of the Navier-Stokes equations. SIAMJ. Sci. Comp., 30:2709–2733, 2008.

29

[34] S. Premasuthan, C. Liang, A. Jameson, and Z. J. Wang. A p-Multigrid SpectralDifference method for viscous compressible flow using 2D quadrilateral meshes.AIAA 2009-950, 2009.

[35] N. Qin, D. K. Ludlow, and S. T. Shaw. A matrix–free preconditionedNewton/GMRES method for unsteady Navier-Stokes solutions. Int. J. Num.Meth. Fluids, 33:223–248, 2000.

[36] P. Rasetarinera and M. Y. Hussaini. An Efficient Implicit DiscontinuousSpectral Galerkin Method. J. Comp. Phys., 172:718–738, 2001.

[37] F. Renac, C. Marmignon, and F. Coquel. Time implicit high-orderdiscontinuous galerkin method with reduced evaluation cost. SIAM J. Sci.Comput., 34(1):A370–A394, 2012.

[38] Y. Saad and M. H. Schultz. GMRES: a generalized minimal residual algorithmfor solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7:856–869, 1986.

[39] G. Soderlind and Wang. L. Evaluating numerical ODE/DAE methods,algorithms and software. J. Comp. Appl. Math., 185:244–260, 2006.

[40] E.F. Toro. Riemann Solvers and Numerical Methods for Fluid Dynamics.Springer, 1999.

[41] V. Venkatakrishnan and D. J. Mavriplis. Implicit Solvers for UnstructuredMeshes. J. Comput. Phys., 105(1):83–91, 1993.

[42] P. E. Vincent and A. Jameson. Facilitating the Adoption of Unstructured High-Order Methods Amongst a Wider Community of Fluid Dynamicists. Math.Model. Nat. Phenom., 6(3):97–140, 2011.

30

Documents

Preconditioning for modal discontinuous Galerkin methods ...birken/... · (DGSEM), e.g. [25], which showed that the preconditioning procedure used here indeed has to be modi ed in