On parallel performance of an implicit discontinuous Galerkin compressible flow solver based on different linear solvers

On Parallel Performance of an Implicit Discontinuous Galerkin Compressible Flow Solver Based on

Different Linear Solvers

Amjad Ali, Khalid S. Syed, Ahmad Hassan, Idrees Ahmad

Centre for Advanced Studies in Pure and Applied Mathematics, Bahauddin Zakariya University (BZU),

Multan 60800, Pakistan. Email: [email protected]

Abstract-ParaUelization of an Implicit discontinuous Galerkin

method based on Taylor series basis for the compressible flows

on unstructured meshes is developed for distributed memory

architectures, specifically for cost effective compute clusters.

The system of linear equations arising from the implicit time

integration is solved using three choices of linear solvers:

SGS(k) (Symmetric Gauss-Seidel with k iterations), LU-SGS

(Lower-Upper Symmetric Gauss-Seidel), and a well known

Krylov subspace iterative solver GMRES (Generalized

Minimum Residual) preconditioned with LUSGS. The

comparative study of the parallel performance of the flow solver based on the different linear solvers is tested on a

number of parallel platforms; ranging from compute clusters to multicore machines. The paraUelization is based on

computational domain partitioning using the well-known mesh partitioning software, METIS, and SPMD (Single Program

Multiple Data) message-passing programming paradigm using

MPI (Message Passing Interface) library, which is a de facto

industry standard for portable programming.

Keywords- Discontinuous Galerkin Method; Parallelization; Cluster; Multicore; Linear Solvers

I. INTRODUCTION The present work can be considered as continuation to

our parallelization studies [1,2]. In [I] the parallelization of the Discontinuous Galerkin (DG) method with Taylor basis [3] is presented with an explicit time integration scheme for Euler equations of compressible fluid flows. While in [2], the parallelization of the same method with implicit time integration scheme is presented where LUSGS method has been used as the linear solver. The present work extends the said studies [1,2] to consider the parallelization of the same method with the implicit time integration scheme where few different linear solvers are considered as the alternate choices for solving the system of algebraic equation obtained at each time step of the implicit numerical scheme. In another study, we have also experimented about the parallel performance of the said DG method based on different numerical fluxes [4]. The problem equations, the DG method, the implicit scheme and the discussion and methodology about parallelization are the same as presented in [2].

978-1-4577-0657-8/11/$26.00 © 2011 IEEE 182

Muhammad Ali Ismail Department of Computer and Information System

Engineering, NED University,

Karachi, Pakistan. Email: [email protected]

II. LINEAR SOLVERS

In this study, the linear system is solved using three choices of linear solvers: SGS(k) (Symmetric Gauss-Seidel with k iterations), LU-SGS (Lower-Upper Symmetric Gauss-Seidel), and a well known Krylov subspace iterative solver GMRES (Generalized Minimum Residual) preconditioned with LUSGS.

A. SGS(k) method The matrix A is considered as composed of three

matrices L, D and U, which are strict lower matrix, a diagonal matrix and a strict upper matrix, respectively. The SGS(k) method can be written as [5]:

(l) Initialization:

�Uo =0 (2) Forward Gauss-Seidel iteration:

,hi (D+L)�U 2+U�Uk =R (3) Backward Gauss-Seidel iteration:

(D+L)�Uk+'+L�U'+t =R where k is the number of iterations for SGS method, �U is the solution vector, and R is the right hand side residual [2]. The main advantage of this method is that it does not require any additional storage beyond that of the matrix itself. In this work, this method is used with k = 10 and k = 20.

B. LU-SGS method If only one iteration is used in the SGS method and the

initial guess is set to zero, the resulting method becomes the so called LU-SGS method [5]. LU-SGS method can be written as:

(1) Initialization:

�UO =0 (2) Lower (forward) sweep:

(D+L)�U· =R

(3) Upper (backward) sweep:

(D+L)�U=D�U·, where�U· =D-'(D+U)�U.

C GMRES method One of the most popular Krylov subspace methods is

generalized minimum residual (GMRES) method , which was presented in [6]. GMRES is suitable for solving a linear system where the coefficient matrix is not symmetric and/or positive definite. GMRES minimizes the norm of the computed residual vector over the subspace spanned by a certain number of orthogonal search directions. The main deficiency in this method is that they require massive amount of memory to store the Jacobian matrix.

III. NUMERICAL AND PERFORMANCE COMPUTATIONS

All the computations in this work are carried out on Linux machines with 64-bit Intel compilers and MPICH2 library [7], which is an open source implemen tation of MPI standard . All the floating point operations are performed with double precision accuracy. Following systems (each system with healthy size of DDR2 RAM with 800MHz or higher speed) have been used for the parallel executions:

SYSTEM-l: It is a l2-node Gigabit (lOOObaseT) Ethernet based

cluster, with each node having two Xeon-5l40 (4 MB L2 Cache, 2.33 GHz clock speed , 1333 MHz FSB) dua1core processors. Thus it has 48 CPU cores and is able to run up to 48 processes in parallel . For the testing purposes, the cluster is used with not more than one process mapped on each node. This helps in reducing the memory contention that commonly arises when more than one process within a node simultaneously access the memory. Moreover this also helps in reducing network interface contention that arises when the network bandwidth available to one node is shared among many processes on that node.

SYSTEM-2: It is a 8-core PC having two Xeon-533 5 (2x4 MB L2

Cache, 2.00 GHz clock speed, 1333 MHz FSB) quadcore processors.

20

·20 L-�-�-�-�=-..J.-��-�-�-� ·20 ·15 ·10 ·5 o 5 10 15 20 25

SYSTEM-3:

It is a 8-core PC having two Xeon-5405 (2x6 MB L2 Cache, 2.00 GHz clock speed, 133 3 :MHz FSB) quadcore processors.

SYSTEM-4:

It is a 8-core PC having two Xeon-5520 (8MB L2 Cache , 2.26 GHz clock speed, 5.86 GT/sec QPI) quadcore processors. The QuickPath Interconnect (QPI) technology, introduced by Intel, replacing the legacy Front Side Bus (FSB) technology, to help in removing certain performance bottlenecks, including the starvation of memory bandwidth, quite commonly experienced on the multicore processors . Because of its efficient memory subsystem based on QP I and a turbo memory controller, this processor is experienced to be faster than comparative FSB based processors from Intel for the codes, like most CFD codes, whose performance is often bounded by the memory bandwidth.

The discontinuous Galerkin method using Taylor series basis is capable of accurately and efficiently computing numerical solution of the compressible inviscid flow equations on arbitrary meshes. The problem considered in this study is about inviscid transonic flow past a NACA0012 airfoil at a Mach number of 0.8, and an angle of attack 1.25°. Two triangular meshes of different granularity, one containing around 2000 elements and the other one containing 22000 elements, are considered. The coarser mesh decomposed into 12 parts using METIS is shown in Figure 1. The convergence history with the two meshes, using the choices: LU-SGS, SGS(10), SGS(20), and GMRES preconditioned with LUSGS, on 1 and 12 processes on SYSTEM-I ( 1 2-node cluster) is shown in Figure 2. The pressure and Mach number contour plots obtained using GMRES with 12 processors are shown in Figure 3.

·1 .Q.5 o 0.5 1.5 2

Figure I. The triangular mesh for obtaining solution; partitioned for 12 processes (central region focused in the right one).

183

or-------�----_,------�------�------_,

1ii -il .� ;?;> .�

Q) " '0 E o c

� �-10 o

......... LU8G8 --12 processes --- 3G8(1 0) ---12 processes --- 3G8(20) ---12 processes .

Time 8teps

O ,-----�----�--_r===============, --- LU3G8 -----1 process

LU8G8 -----12 processes

--- 8G8(10)--····1 processs ......... 8G8(1 0)---·-12 processes --- 8G8(20)--····1 process ......... 8G8(20)---·-12 processes

Time 8teps

Figure 2. With coarser mesh (left) and finer mesh (right) , convergence histories using the DGM with different linear solvers arc very mueh similar in each case.

Their similarity for the I and 12 processors shows that convergence is not compromised in the parallel solution.

0.5

>- 0

-0.5

o

----�------�-

0.5 X

p 1.21

1.21583

1.19167

1.1675

1.14333

1.11917

1.095

1.070113

1.04667

1.0225

0.998333

0.974167

0.95

0.925833

0.901667

0.8775

0.853333

0.829167

0.905 0.7801133 0.756667 0.7325 0.708333 0.664167

0.66

0.5

>- 0

-<l.5

o 0.5 X

M 2 '.92083

1.84187

1.7625 1.68333

1.60417

1.525

1.44583 1.36687 1.2875 1.20833 1.12917

1.05 0.9701133 0.891667 0.8125 0.733333 0.85-1167 0.575 0.495833 OAI6667 0.3375 0.258333 0.179187 0.1

Figure 3. Pressure contours (left) and Mach number contours (right) obtained using the parallel DGM+GMRES on 12 processors with PO approximation (first order accurate).

A. Performance comparison among different linear solver cases The scalability and respective time taken by each of the

four cases of our parallel linear solvers on the parallel computer, SYSTEM-1 (12 nodes cluster having Xeon-51 40 processors, each with 4 :ME L2 cache, on each node), is shown in Figure 4. Twenty thousand time-steps are computed in each case and a similar convergence behavior is observed with any number of processes by any linear solver case on a fIxed mesh, as already shown in Figure 2. The following order of time efficiency is observed (better to worse), see Figure 4 (RIGHT):

1) LUSGS 2) GMRES 3) SGS(10) 4) SGS(20).

184

On the other hand, the following order of scalability behavior is observed (better to worse) , see Figure 4 (LEFT):

1) SGS(20) 2) SGS(lO)

3) GMRES 4) LUSGS.

Although the scalability for the cases of LUSGS and GMRES seems worse than that of SGS(l 0) and SGS(20); the over all time taken with any number of processes is far better for LUSGS and GMRES cases. Very similar observations are there on SYSTEM-2 to 4 in Figure 5-7, respectively. The super-linear speedup reveals the cache inefficiencies or bottlenecks for the serial (one process) case that might be addressed to remove through code profiling and tuning.

B. Performance comparison among different parallel systems/or each solver case

The scalability and respective time taken by DGM+LUSGS on the four available parallel computers, SYSTEM-I to 4, is shown in Figure 8. The following order of time efficiency is observed (better to worse) , see Figure 8 (RIGHT):

1) SYSTEM -4 (8 cores o/two 5520 processors)

2) SYSTEM -3 (8 cores o/two 5405 processors)

3) SYSTEM -1 (8 nodes with 5140 processors)

4) SYSTEM -2 (8 cores o/two 5335 processors).

On the other hand, the following order of scalability behavior is observed (better to worse), see Figure 8 (LEFT) :

1) SYSTEM -1 (8 nodes with 5140 processors) 2) SYSTEM -2 (8 cores a/two 5305 processors)

18��====�==�,-----:-----�--� --Linear (Ideal)case 16 ---LU-SGS

SGS(10) 14 12

1�--�----�------�----�------�----� 1 2 4 6 8 Number of Processes

10 12

3) SYSTEM -3 (8 cores o/two 5405 processor�)

4) SYSTEM -4 (8 cores o/two 5520 processors).

The scalability for the case of 8 nodes of the cluster (SYSTEM-I) is better than those of the multicore systems, as was anticipated. Although the scalability for the case of SYSTEM -4 (8 cores of two 5140 processors) seems worse than those of SYSTEM-2 and SYSTEM-3; the over all time taken with any number of processes is far better for SYSTEM-4. The extra scalability on SYSTEM-2 and SYSTEM-3 is due to the fact that certain bottleneck, which was existing on the single process case on these systems in contrast to SYSTEM-I, gets removed when more cores (each with separate Ll cache and pair-wise grouped in separate L2 cache) are used. Therefore extra time reductions are seen with more number of cores. Very similar observations are there for the cases with SGS(lO), SGS(20) and GMRES in Figure 9-\\ , respectively .

2 4 6 8 Numbers of Processes

10 12

Figure 4. On SYSTEM-l (12 nodes cluster), scaling of speedup (LEFT) and respective time taken (RIGHT) with respect to number of processes, by each of the linear solvers are shown. (On I processor, LU-SOS Look 901, SOS(10) Look 3210, SOS(20) Look 5800 and OMRES Look 1430 Seconds).

9 ��======�---:--�/T�� ---Linear 0 ceaQ case 8 ..... ---LUBGS

-+-SGS(10) -+-SGS(20)

7 ...... --6-GMRES

2 3 4 5 Number of Processes

6 7 8

800r-.-�----�----'---�----�----'----'

� 500 c o :rl400 f!!, .. � 300

200

100

Number of Processes

Figurc 5. On SYSTEM-2 (an 8-corc PC with 2 Xeon-5335 processors), scaling of speedup (LEFT) aod respective time taken (RIGHT) with respect to number of

processes, by each of the linear solvers are shown. (On I processor, LU-SOS took 1004, SOS(IO) took 3439, SOS(20) took 6251 and OMRES took 1597 Seconds).

185

Figure 6. On SYSTEM-3 (an 8-core PC with 2 Xeon-5405 processors), scaling of speedup (LEFT) and respective time taken (RIGHT) with respect to number of processes, by each of the linear solvers are shown. (On I processor, LU·SOS took 676, SOS(IO) took 2770, SOS(20) took 4662 and OMRES took 1090 Seeond..).

Figure 7. On SYSTEM-4 (an 8-core PC with 2 Xeon-5520 processors), scaling of speedup (LEFT) and respective time taken (RIGHT) with respect to number of processes, by each of the linear solvers are shown. (On I processor, LU-SOS took 522, SOS( I O) took 1782, SOS(20) took 3087 and OMRES tock 81 6 Seconds).

9 rr=�======�'--:--�---'---' -- Li near (Ideal) case

8 --5t4O-BNODES 5335-8:;ores

7

6

8

300 200 260 240

_220 � 200 �too � 100

i= 140 t20 . tOO

00 60 t

NI Jmhp.r ot Prn.:-:Fl"'.c:.p.� Figure 8. Scaling of speedup (LEFT) and respective time taken (RIGHT) with respect to number of processes, by DGM+LUSGS method on the 4 systems are

shown. (With one I process, the method Look 901, 1004,676, and 522 Seconds on SYSTEM-I LO 4, repecLively).

t1ir===�====�'-�--�---'�/1 -- Linear ( I deal) case 10 --514Q-8NODES

9 · 533S...s:ores 8

8

500r---�----,,--��----r---_----_----, 560 530 500 470

� 440 c g 410 Ie 300 " .� 350 I- 320 . . __ 514Q-8NODES ....

200 - =�::�: 260 __ 5520-8:;ores 230 ·

Figure 9. Scaling of speedup (LEfT) and respective time taken (RJGl-lT) with respect to number of processes, by DGM+SGS(J 0) method on the 4 systems are shown. (\Vith ooc I process, the mcthod took 3210,3439,2770, and 1782 Seconds on SYSTEM-I to 4, repcctive! y).

186

12,---�----,----.----�----�--�----, 11 ..... -Linear(ldeal)case 10 ..... ---S14O-a'10DES

9 · B

533S-B:ores -+- 5405-80ores

4 S Number of Processes

B

IDJ 770 740 . 710 Ei8J

�660 � 620 �500 [{!-56J � 530

;::500 470 440 410 380 3S0

1 4 5 Number of Proc:esses

Figure 10. Scaling of speedup (LEFT) and respective time taken (RJGHT) with respect to number of processes, by DGM+SGS(20) method on the 4 systems are shown. (With one 1 process, the method took 5800, 6251,4662, and 3087 Seconds on SYSTEM-I to 4, repeetivel y).

10��:::::;::::�----:-----�---,--� 9 .... ---' Lines ( I deal) case

:

B .

7

---S14O-fNODES S33S-Ihores

4 5 Number of Processes

7 B

490,---�.-.-c----.----�----�--�----, 480 430 400 .. 370

W 340 � 310 [{!-280 m � 250

220 . ,----'---�----, "-190 . 180 --+-- 5405-8:ores

� 552D-8:ores 130 l001�---2�--�----�4----�5�--�----�--�

Number of PrOO8668S

Figure 11. Scaling of speedup (LEFT) and respective time taken (RIGHT) with respect to number of processes, by DGM+GMRES method on the 4 systems are shown. (With one 1 process, the method took 1430, 1597, 1090, and 816 Seconds on SYSTEM-I to 4, rcpcctivcly).

IV. CONCLUSION

An implicit parallel discontinuous Galerkin method is tested with LUSGS, SGS(IO), SGS(20) and GMRES linear

solvers. Similar convergence behavior is observed with these solvers with any number of processes on the considered test problem of transonic inviscid flow past NACA0012 airfoil. Regarding time efficiency the observed order is (better to worse) : LUSGS, GMRES, SGS(IO) and SGS(20). For scalability this order gets reversed. These observations are for most of the cases on the available parallel systems. The superlinear speedup observed in most cases, reveals the memory bottlenecks in serial executions of the respective codes that should be addressed through profiling and tuning .

Cluster system is verified to be more scalable as compared to single multi core machines. Multicore machines with larger cache sizes and/or faster memory sub-systems are verified to be significantly time efficient for the CFD code, whose performance is bounded by the memory bandwidth.

ACKNOWLEDGMENT

The first author would like to thank Higher Education Commission, Pakistan for sponsoring the short-term visit to NC State University, USA. The first author also expresses gratitude for Prof. Dr. Hong Luo (NCSU) for the collaboration and providing for the opportunity of using some of the computer systems available at NCSU to complete the

present study.

187

REFERENCES

[1] Amjad Ali, Hong Luo, Khalid S. Syed, and Muhammad Ishaq. "A parallel discontinuous Galerkin code for compressible fluid flows on unstructured grids". Journal of Engineering and Applied Sciences, Volume 29 (1), 2010.

[2] Amjad Ali, Hong Luo, Khalid S. Syed, Muhammad Ishaq, and Ahmad Hassan. "A communication-efficient, distributed memory parallel code using discontinuous Galerkin method for compressible flows". IEEE International Conference on Emerging Technologies 20 I 0 (ICET 20 I 0) Islamabad, Pakistan, October 18-19, 2010.

[3] Hong Luo, Joseph D. Baum , and Rainald Lohner. "A discontinuous Galerkin method based on a Taylor basis for compressible nows on arbitrary grids". Journal of Computational Physics. 227(20): 8875-8893,2008.

[4] Amjad Ali, Hong Luo, Khalid S. Sycd, Muhalmnad Ishaq, and Ahmad Hassan. "On parallel performance of a discontinuous Galcrkin dompressible now solver based on dilIerent numerical fluxes". 49" AIAA Aerospace Sciences Meeting, Florida, USA, 4-7 Jan. 20 1 1 .

[5] Hong Luo, Joseph D. Baum, Rainald Lohner. "Fast p-multigrid discontinuous Galerkin method for compressible flows at all speeds". AIAA Journal . 46(3), 2008.

[6] Youeef Saad, and Martin H. Schultz. "GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems". SIAM Journal on Scientific and Statistical Computing. 7(3): 89, 1 nx.

[7] MPICH project. http://www.mcs.anl.gov/research/projects/mpil (Jan. 10,2010).

Documents

On parallel performance of an implicit discontinuous Galerkin compressible flow solver based on different linear solvers