Abstract arXiv:1710.09995v1 [cs.PF] 27 Oct 2017Intel Xeon Phi (MIC) co-processors, is still a great challenge. An in-house parallel CFD code capable of simulating An in-house parallel

Performance optimizations for scalable CFD applications on hybrid CPU+MICheterogeneous computing system with millions of cores

WANG Yong-Xiana,∗, ZHANG Li-Luna, LIU Weia, CHENG Xing-Huaa, Zhuang Yub, Anthony T. Chronopoulosc,d

aNational University of Defense Technology, Changsha, Hu’nan 410073, ChinabTexas Tech University, Lubbock, TX 79409, USA

cUniversity of Texas at San Antonio, San Antonio, TX 78249, USAdVisiting Faculty, Dept Computer Engineering & Informatics, University of Patras 26500 Rio, Greece

Abstract

For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing isa common efficient strategy to reduce the computational time. How to achieve the best performance in the modernsupercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU +Intel Xeon Phi (MIC) co-processors, is still a great challenge. An in-house parallel CFD code capable of simulatingthree dimensional structured grid applications is developed and tested in this study. Several methods of parallelization,performance optimization and code tuning both in the CPU-only homogeneous system and in the heterogeneous systemare proposed based on identifying potential parallelism of applications, balancing the work load among all kinds ofcomputing devices, tuning the multi-thread code toward better performance in intra-machine node with hundreds ofCPU/MIC cores, and optimizing the communication among inter-nodes, inter-cores, and between CPUs and MICs.Some benchmark cases from model and/or industrial CFD applications are tested on the Tianhe-1A and Tianhe-2supercomputer to evaluate the performance. Among these CFD cases, the maximum number of grid cells reached 780billion. The tuned solver successfully scales up to half of the entire Tianhe-2 supercomputer system with over 1.376million of heterogeneous cores. The test results and performance analysis are discussed in detail.

Keywords: computational fluid dynamics, parallel computing, Tianhe-2 supercomputer, CPU+MIC heterogeneouscomputing

1. Introduction

In recent years, with the advent of powerful computersand advanced numerical algorithms, computational fluiddynamics (CFD) has been increasingly applied to the aero-space and aeronautical industries and CFD is reducing thedependencies on experimental testing and has emerged asone of the important design tools for these fields. Par-allel computing on modern high performance computing(HPC) platform is also required to take full advantage ofcapability of computations and huge memories of theseplatforms when an increasing number of large-scale CFDapplications are used nowadays to meet the needs of engi-neering design and scientific research. It’s well known thatHPC systems are facing a rapid evolution driven by powerconsumption constraints, and as a result, multi-/many-core CPUs have turned into energy efficient system-on-chip architectures and HPC nodes integrating main pro-

∗Corresponding authorEmail addresses: [email protected] (WANG Yong-Xian),

[email protected] (ZHANG Li-Lun), [email protected] (LIUWei), [email protected] (CHENG Xing-Hua),[email protected] (Zhuang Yu), [email protected] (AnthonyT. Chronopoulos)

cessors and co-processors/accelerators have become pop-ular in current supercomputer systems. The typical andwidely-used co-processors and/or accelerators include GPG-PUs (general purpose graphics processing unit), the firstgeneration Intel (Knights Corner) MICs (Many IntegratedCore), FPGA (field-programmable gate array), and so on.The fact that there are more than sixty systems withCPU+GPU and more than twenty systems with CPU+MICin the latest Top500 supercomputer list release in Novem-ber 2016 indicates such a trend in HPC platforms.

Consequently, the CFD community faces an importantchallenge of how to keep up with the pace of rapid changesin the HPC systems. It could be hard, if not impossible, toport the high performing original efficient algorithm on thetraditional homogeneous platform to the current new HPCsystems with heterogeneous architectures seamlessly. Theexisting codes need to be re-designed and tuned to exploitthe different levels of parallelism and complex memory hi-erarchies of new heterogeneous systems.

During the past years, researchers in the CFD fieldshave made a great efforts to implement efficiently CFDcodes in the heterogeneous systems. Among many recentstudies, researchers have paid more attention to CPU+GPUhybrid computing. Ref. [1] studied an explicit Runge-

Preprint submitted to Computers & Fluids October 30, 2017

arX

iv:1

710.

0999

5v1

[cs

.PF]

27

Oct

201

7

Kutta CFD solver for three-dimensional compressible Eu-ler equations using a single NVIDIA Tesla GPU and gotroughly 9.5x performance over a quad-core CPU. In [2], theauthors implemented and optimized a two-phase solver forthe Navier-Stokes equations using the Runge-Kutta timeintegration on a multi-GPU platform and achieved an im-pressive speedup of 69.6x on eight GPUs/CPUs. Xu et alproposed a MPI-OpenMP-CUDA parallelization schemein [3] to utilize both CPUs and GPUs for a complex, real-world CFD application using explicit Runge-Kutta solveron the Tianhe-1A supercomputer and achieved a speedupfactor of about 1.3x when comparing one Tesla M2050GPU with two Xeon X5670 CPUs and a parallel efficiencyof above 60% on 1024 Tianhe-1A nodes.

On the other hand, many studies aim to exploit thecapability of CPU + MIC heterogenous systems for mas-sively parallel computing of CFD applications. Ref. [4]gave a small-scale preliminary performance test for a bench-mark code and two CFD applications on a 128-node CPU+ MIC heterogeneous platform and evaluated the earlyperformance results, where application-level testing wasprimarily limited to the case of using a single machinenode. In the same year, authors in [5] made an effortto port their CFD code on traditional HPC platform tothe forthcoming new-setup Tianhe-2 supercomputer withCPU+MIC heterogeneous architecture, and the perfor-mance evaluations of massive CFD cases used up-to 3072machine nodes of the fastest supercomputer in that year.Ref. [6] reported simulations of running large-scale sim-ulations of turbulent flows on massively-parallel acceler-ators including GPUs and Intel Xeon Phi coprocessorsand found that the different GPUs considered substan-tially outperform Intel Xeon Phi accelerator for some ba-sic OpenCL kernels of algorithm. Ref. [7] implemented anunstructured meshes CFD benchmark code on Intel XeonPhis by both explicit and implicit schemes and their re-sults showed that a good scalability can be observed whenusing MPI programing technique. However, the openMPmulti-threading and MPI hybrid case remains untested intheir paper. As a case study, Ref. [8] compared the perfor-mance of high-order weighted essentially non-oscillatoryscheme CFD application on both K20c GPU and XeonPhi 31SP MIC, and the result showed that when vectorprocessing units are fully utilized the MIC can achieveequivalent performance to that of GPUs. Ref. [9] reportedthe performance and scalability of an unstructured meshbased CFD workflow on TACC Stampede supercomputerand NERSC Babbage MIC based system using up to 3840cores for different configurations.

In this paper, we aim to study the porting and per-formance tuning techniques of a parallel CFD code toheterogeneous HPC platform. A set of parallel optimiza-tion methods considering the characteristics of both hard-ware architecture and the typical CFD applications are de-veloped.The portability and device-oriented optimizationare discussed in detail. The test result of the large-scaleCFD simulations on Tianhe-2 supercomputer with CPUs

+ Xeon Phi co-processors hybrid architectures showed thata good performance and scalability can be achieved.

The rest of this paper is organized as follows: In Sect. 2the numerical methods, the CFD code and the heteroge-neous HPC system are briefly introduced. The paralleliza-tion and performance optimization strategies of large-scaleCFD applications running on heterogeneous system arediscussed in detail in Sect. 3. Numerical simulations toevaluate the proposed methods and results and analysisare also reported in Sect. 4, and conclusion remarks aregiven in Sect. 5.

2. Numerical Methods and High Performance Com-puting Platform

2.1. Governing Equations

In this paper, the classical Navier-Stokes governing equa-tion is used to model the three-dimensional viscous com-pressible unsteady flow. The governing equations of thedifferential form in the curvilinear coordinate system canbe written as:

∂Q

∂t+∂(F− Fv)

∂ξ+∂(G−Gv)

∂η+∂(H−Hv)

∂ζ= 0, (1)

where Q = (ρ, ρu, ρv, ρw, ρE)T denotes the conservativestate (vector) variable, F, G and H are the inviscid (con-vective) flux variables, and Fv, Gv and Hv are the vis-cid flux variables in the ξ, η and ζ coordinate directions,respectively. Here, ρ is the density, u, v and w are thecartesian velocity components, E is the total energy. Allthese physics variables are non-dimensional in the equa-tions, and for the three-dimensional flow field, they arevector variables with five components. The details of def-inition and expression of each flux variable can be foundin [10].

2.2. Numerical Methods

For the numerical method, a high-order weighted com-pact nonlinear finite difference method (FDM) is used forthe spatial discretization. Specifically, let us first considerthe inviscid flux derivative along the ξ direction. By usingthe fifth-order explicit weighted compact nonlinear scheme(WCNS-E-5) [11], its cell-centered FDM discretization canbe expressed as

∂Fi

∂ξ=

75

64h(Fi+1/2 − Fi−1/2)− 25

384h(Fi+3/2 − Fi−3/2)

+3

640h(Fi+5/2 − Fi−5/2), (2)

where h is the grid size along ξ direction, and flux (vec-tor) variable F is computed by some kind of flux-splittingmethod by combining the responding left-hand and right-hand cell-edge flow variables. The discretization for otherinviscid fluxes can be computed by a similar procedure.The fourth-order central differencing scheme is carefully

2

chosen for the discretization of viscid flux derivatives toensure all the derivatives have matching discretization er-rors. The reason why to design such a complex weightedstencil is to prevent numerical oscillation around disconti-nuities of flow field.

When the flow field is composed of multiple grid blocks,in order to guarantee the requirement of high accuracy innumerical approach, the grid points shared by the adjacentblocks need be carefully dealt. As a result, the geometricconservation law proposed by Deng et al [11] and somesevere conditions for the boundaries of the neighboringblocks must be met for complex configurations, see [10, 11]for more details.

Once the spatial discretization is completed, we can getthe semi-discretisation form of Eq. (1) as follows:

∂Q

∂t= −R(Q). (3)

In Eq. (3), −R(Q) which denotes the discretized spatialitems in Eq. (1). The discretization in time for the lefthand side of Eq. (3) will result into a multiple step advanc-ing method. In this paper, the third-order Runge-Kuttaexplicit method is used. Suppose superscript (n) denotethe n-th time step, and the time stepping can be expressedas:

Q(n,1) = Q(n) + λ ·R(Q(n)), (4)

Q(n,2) =3

4Q(n) +

1

4

[Q(n,1) + λ ·R(Q(n,1))

], (5)

Q(n+1) =1

3Q(n) +

2

3

[Q(n,2) + λ ·R(Q(n,2))

]. (6)

2.3. Code implementation

An in-house CFD code capable of simulating three di-mensional multi-block structured grid problem by usingthe numerical method mentioned above is developed forthe case study in this paper. The source code containsmore than 15000 code lines of Fortran 90 programminglanguage. In our previous work [12], the code is paral-lelized and tuned for the homogeneous HPC system com-posed of multi-/many-core CPUs. Shortly after that, wemade a simple porting to CPUs + MICs heterogeneoussystem and conducted an early performance evaluation onthe Tianhe-2 supercomputer [5]. In this study, we will re-design and re-factor the code for the heterogenous many-core HPC platform to achieve a better overall performance.The flowchart of the CFD code is shown in Fig. 1.

Similar to other CFD applications, our code consistsof pre-processing, main and post-processing parts. Atthe pre-processing phase, the CFD code reads the meshdata and initializes the flow filed for the whole domain.Then the timing-consuming main part indicated by time-marching loop in the flowchart follows. In each time stepiteration in this phase, the linear systems resulting fromdiscretizing the governing equations in the domain aresolved by numerical methods. When the convergence cri-teria is met during the loop, the post-processing phase

图1典型CFD应用数值模拟流程图

Initialization

Time-marching loop

Calc. inviscid flux derivatives

Calc. viscous flux derivatives

Runge-Kutta equation solver

Exchange data intra-process

Exchange data inter-process

Convergence criteria

Boundary condition

invFlux

visFlux

eqnSolver

CopyComm

BC

MPIComm

Yes

No

Post-processing

Figure 1: Flowchart of CFD code

follows and finalizes the simulation. The traditional paral-lelization method for the CFD program is mainly based onthe technique of domain decomposition and it can be im-plemented on shared-memory and/or distributed-memorysystem employing respective programming models, suchas OpenMP multi-threaded model and MPI multi-processmodel. In these implementations, the same solver is run-ning for each domain, and some boundary data of neigh-boring domains need to be exchanged mutually during it-erations.

3. Parallelization and Optimization Methods

3.1. Multi-level parallelization strategy considering both ap-plication features and computer architecture

In order to explore the potential capability of moderncomputer hardware and improve the scalability of applica-tions, a typical CFD parallel simulation running on thesesystems is based on multi-level and/or multi-granular hy-brid parallelizing strategy. Heterogeneous HPC system us-ing more than one kind of processors or cores are becom-ing a trend. These systems gain performance or energyefficiency not just by adding the same type of processors,but by adding different accelerators and/or coprocessors,usually incorporating specialized processing capabilities tohandle particular tasks. For heterogeneous computer sys-tems equipped with accelerators and/or coprocessors, acommon paradigm is “CPU + accelerator and/or copro-cessor” heterogeneous programming model. For example,“MPI + OpenMP + CUDA” hybrid heterogeneous modelis used when lots of GPU accelerators are employed, and

3

“MPI + OpenMP + Offload” hybrid heterogeneous modelis used instead when lots of MIC coprocessors are em-ployed. In both circumstances, a proper decision of how todivide the simulation tasks into lots of sub-tasks and mapthem to hardware units of the heterogeneous computersystem should be carefully made in order to achieve goodload balancing and high efficiency and parallel speedup.

3.1.1. Principles of multiple level parallelization in typicalCFD applications

Multi-level parallelization according to different granu-larities is the main means to improve performance of appli-cations running on high-performance computers. In typ-ical CFD applications, we try to exploit the parallelismat the following levels from the coarsest granularity to thefinest granularity.

(1) Parallelism of multiple zones of flow field. At thislevel, the whole domain of flow filed to be simulated will bedecomposed into many sub-domains and the same solvercan run on each subdomain. Towards this goal, it is a com-mon practice to partition the points/cells of discretizedgrid/mesh in the domain into many zones, as a result theparallelism of data processing among these zones can beobtained. It’s worthy noting that for the purpose of work-load balancing, it’s far more than trivial to partition thegrid/mesh and map the zones to hardware units in themodern computer system which consists of a variety of het-erogeneous devices resulting different computing capabili-ties. However there are also some obstacles in the domain-decomposition. Among others, the convergence of implicitsolver of linear system would be degenerated dramatically,which means more computational work is needed to makeup for this, thus resulting more time-of-solution time.

(2) Parallelism of multiple tasks within each zone. Con-sidering the simulation process of each zone resulting fromthe domain-decomposition mentioned above, there are stilla series of processing phases including updating convectionflux variables and viscid flux variables in three coordinatedirections, respectively. The fact that no data dependencyexists among these six phases (i.e. 3 convection flux calcu-lation phases plus 3 viscid flux calculation phases) impliesthe concurrency and the parallelism of multiple tasks.

(3) Parallelism of Single-Instruction-Multiple-data. Evenin each zone and each processing phase, the computingcharacteristic is highly similar among all grid points/cells,and lots of data distributed in different grid points/cellscan be maneuvered by the same computer instruction.Besides the parallelization some architecture-oriented per-formance tuning techniques, such as data blocking, loop-based optimization, and so on, are also applied at thislevel.

3.1.2. Parallelization of CFD applications in homogeneoussystems

In typical homogeneous computing platforms, the multi-level parallel computing of traditional CFD applications

is mainly implemented by the combination of MPI multi-process programming model, OpenMP multi-threaded pro-gramming model and single instruction multiple data (SIMD)vectorization techniques.

For the coarse-grain parallelism at the domain decom-position level, as shown in Fig. 2(a), the classical paral-lel computation uses a static mesh-partitioning strategyto obtain more mesh zones. For the case of commonlyused structured grid, each mesh zone of flow field domainis actually a block of grid points/cells. In the followingmapping phase, each mesh zone will be assigned to an ex-ecuting unit, either a running process (in MPI case) ora running thread (in OpenMP multi-threading case), andthese executing unit would further be attached to specifichardware units, namely CPUs, or cores. In practice, a“MPI + OpenMP” hybrid method is applied in order tobetter balance the computation workload and avoid gen-erating too small grid blocks (that usually leads too littleworkloads). This method does not need to over-dividethe existing mesh blocks, and the workload can be fine-adjusted at thread-level by specifying a certain number ofthreads in proportion to the amount of points/cells in theblock.

For the parallelism of intermediate granularity, as shownin Fig. 2(b), the multi-threading parallelization based onthe tasks distributing principle using OpenMP program-ming model is applied. The most time-consuming com-putation for a single iteration of a typical CFD solverinvolves the calculation of the convection flux variables(the invFlux procedure shown in Fig. 1), the viscous fluxvariables (the visFlux procedure), large sparse linear sys-tem solver (the eqnSolver procedure) and other proce-dures. The parallelization of this granularity can be fur-ther divided into two sub-levels: (1) at the upper sub-level, the computation in procedure invFlux and visFlux

along three coordinate directions naturally forms six-taskconcurrent workloads, thus can be parallelized in the com-puting platform. (2) at the lower sub-level, considering theprocessing within each procedure, when discriminating be-tween grid points/cells on the boundary of the zone andthe inner ones of the zone, it can be easily found that byrearranging the code structure and adjusting the order ofcalculation, there is more potential parallelism to exploitbased on some loop transformation techniques [12].

For the fine-grain parallelism within a single zone, asshown in Fig. 2(c), because the computation of each zoneis usually assigned to a specific execution unit (process orthread), thus also mapped to a CPU core, the traditionalSIMD (single instruction multiple data) parallelization canbe applied to this level. Take an example, for the Tianhe-2 supercomputer system, as one of our test platforms, theCPU processor supports the 256-bit wide vector instruc-tion, and the MIC coprocessor supports the 512-bit widevector instruction, which provide great opportunity for usto exploit the parallelism at the instruction level based onthe SIMD method. As seen in our previous experiences,the SIMD vectorization can usually bring us a 4X–8X per-

4

formance improvement measured by the double-precisionfloating-point in a typical CFD application.

图2典型CFD问题的多粒度并行性挖掘及其并行模拟实现的多层次映射方法

invFlux

visFlux

eqnSolver

BC & comm

(a) (b)

data blocking

(c) (d)

Figure 2: Multi-granularity parallelization of typical CFD problems(a–c), and its specific multi-level mapping method (d)

3.1.3. Parallelization of CFD applications in heterogeneoussystems

Heterogeneous system consists of more than one kindof processor or cores, which can results to huge differencesof the computing capability, memory bandwidth and stor-age capability among different sub-systems. For the par-allelization of CFD application in such heterogeneous sys-tems, in addition to the implementation methods for thehomogeneous platforms introduced in the previous section,more attention should be paid to the cooperation betweenmain processors and accelerators/coprocessors. The strat-egy of load balancing, task scheduling, task distribution,and programming model in such circumstances should becarefully re-checked, or even re-designed. Now let’s takethe Tianhe-2 supercomputer system as an example. It con-sists of 16,000 machine nodes, and each node is equippedwith both CPUs as the main processors and MICs as thecoprocessors. The task distributions and collaborationsbetween CPUs and MICs are implemented by so-called“Offload programming model” with the supported librariesprovided by Intel company. As illustrated in Fig. 2(d), taskscheduling and load balancing can be divided into threelayers. The computing tasks of multi-zone grid CFD ap-plication shown at the most top layer (namely “applicationlayer”) are firstly organized and assigned to multiple pro-cesses and/or threads (shown at “executing layer”), thusmapped to specific hardware devices, say, a core of eitherCPU or MIC at the “hardware layer”. For simplicity andeasy implementation, the task assigning, scheduling andmapping between adjacent layers can be built in a static

mode. Considering the fact that each machine node ofTianhe-2 system contains two general-purpose CPUs andthree MIC devices, in order to facilitate the balanced loadbetween these two kind of computing devices, we designedthe following configuration of parallelization: multiple pro-cesses are distributed for each machine node via MPI pro-gramming model as well as many OpenMP threads arerunning within each process to take advantage of many-core features of both CPUs and coprocessors. Offloadingcomputational tasks into three MIC devices is carried outby one of OpenMP threads, namely, main thread. Usingthis arrangement, each process can distribute some sharesof its work to the corresponding MIC devices, and it is eas-ier to utilize the capabilities of all MIC devices equippedon the machine nodes. The communication between CPUsand MIC devices is implemented by the pragmas or direc-tive statements as well as support libraries provided by thevendor. In typical programming practice, the main threadof a CPU process will pick up some data to offload intoMIC devices, and the MIC devices will return the result-ing data back to the main thread running on CPU. Theload balancing between CPU and MIC is a big challenge inthe applications, and a simple static strategy is applied inour implementation. We introduce a further intermediatelayer in the “application layer” shown in Fig. 2(d) whichorganized the mesh zones into groups. As an example,five individual zones, say, G0, G1, G2, G3 and G4, withina group of mesh zones, can further be divided into twosub-groups {{G0, G1}; {G2, G3, G4}} according their vol-umes and shapes, whereas G0 and G1 have almost equalvolume and are assigned to two CPUs in the process re-spectively, and G2, G3 and G4 have another near equalsize and the computing tasks on them will be assignedto three MIC devices respectively. As an extra benefit ofintroducing the intermediate layer, regardless of the paral-lelization and optimization on CPU or on the MIC device,the strategy applied at finer-grain level, i.e. thread-level,SIMD level, etc, can be processed in a similar way.

3.2. Implementations of parallelization and optimizationfor both homogeneous and heterogeneous platforms

3.2.1. Load balancing method of parallel computing

The load balancing of parallel simulation of CFD ap-plications mainly results from re-partitioning the existingmesh in the whole domain of flow field. The case of large-scale parallel CFD simulation imposed great requirementsto the routine grid partitioning procedure. For example,when submitting such a simulation job to a specific HPCplatform, the availability of computing resources, such asthe number of available CPU nodes and/or cores, capa-bility of memory system, etc, are varying and dependingon the status in situ. Among others, a flexible solutionis to partition the original mesh zone into many small-sized blocks followed by re-grouping these blocks into finalgroups of blocks. We developed a pre-processing softwaretool to fulfill this task, and the details of algorithm and

5

implementation can be found in [13]. In fact some com-plex re-partitioning techniques are introduced in the sec-ond processing for fine-adjusting the workload of compu-tation and communication in the solution phase.

For the load balancing of OpenMP multi-threaded par-allelization, there are two task scheduling strategies thatare adopted in the CFD applications. The static taskscheduling strategy based on distributing the workloadsequally into threads is applied in the iteration parallel re-gions. However, for the concurrent execution of invFlux_X,invFlux_Y, invFlux_Z, visFlux_X, visFlux_Y, visFlux_Zprocedures as described in Sect. 3.1.2, the dynamic taskscheduling strategy is more appropriate due to the differ-ences in the number of grid points/cells, the amount ofcomputations among these procedures.

The load balancing between different computing de-vices, such as CPU and MIC, is another issue for the het-erogeneous computing platform. The static task assigningmethod is applied to address this issue, and an adjustableratio parameter, which denotes the ratio of amount ofworkload assigned on one MIC device to that on one CPUprocessor, is introduced to match the different comput-ing platforms guided by the performance data measuredin real applications. We will give the example and resultsfor a series of test cases to discuss this issue in Sect. 4.3.

3.2.2. Multi-threaded parallelization and optimization

It is well known that increasing number of processorcores are integrated in a single chip to gain the computingcapability and make the balance between the power andthe performance in modern computer platform. On thesemany-core computer system, multi-threaded programmingmodel is widely used to effectively exploit the performanceof all processor cores. Thus, the parallelization and opti-mization of multi-threading CFD simulation on many-coreplatform must be implemented. We emphasize some meth-ods of parallelization and optimization as following: (1)We must achieve a balance between the granularity andthe degree of parallelism. In order to achieve this goal intypical CFD applications, the OpenMP multi-threadingparallelization is applied for the computations of meshpoints/cells within each single zone via splitting the it-erative space of the zone in three coordinate directions, asillustrated in Fig. 3. (2) We must reduce the overhead ofthread creation and destruction. If possible, the OpenMPpragma/directive to create the multi-thread region shouldbe placed at the outer loop of the nested loop to reducethe additional overhead caused by repeated dynamic cre-ation and destruction of the thread. (3) We must reducethe amount of memory occupied by a single thread. As of-ten done in general scientific and engineering computing,a lot of variables in the OpenMP multi-thread region willbe declared as private variables to ensure the correctnessof computations. As a result, the memory footprint of theapplication is too large to make the performance of mem-ory access deteriorate. The issue is extremely severe inthe MIC coprocessor device due to it has smaller storage

capacity. To address this issue, we try to minimize theamount of memory allocated for private variable and leteach thread only allocate, access and deallocate the nec-essary data to obtain the maximum performance gains.(4) We must bind the threads to the processor cores inNUMA architecture. Ensuring static mapping and bindingbetween each software thread and the hardware processorcore in the computing platform with NUMA architecturecan improve the performance of CFD application problemsdramatically. This can be achieved by using a system callprovided by the operating system, or the affinity clauseprovided by OpenMP library in CPU platform, and bysetting some environment variables supported by vendorson the MIC platform.

3

!$OMP PARALLEL DO COLLAPSE(LEVEL) ! LEVEL = 1/2/3do kb = 1, nk, kblksizedo jb = 1, nj, jblksizedo ib = 1, ni, iblksize do k = kb, kb+kblksize-1 do j = jb, jb+jblksize-1 !DIR$ SIMD do i = ib, ib+iblksize-1 ... update u(i, j, k) ... enddo enddo enddoenddoenddoenddo

Figure 3: Data blocking used in the parallelization and optimization

3.2.3. Multi-level optimization of communication

In modern computer architectures, gains in processorperformance, via increased clock speeds, have greatly out-paced improvements in memory performance, causing agrowing discrepancy known as the “memory gap” or “mem-ory wall”. Therefore, as time has passed, high-performanceapplications have become more and more limited by mem-ory bandwidth, leaving fast processors frequently idle asthey wait for memory. As a result, to minimize the datacommunication, including data movement and data trans-fer, between different hardware units is an effective methodof applications simulated on modern HPC platform. Onthe other hand, it’s well known that in traditional paral-lel CFD applications, there are lots of data communica-tion, including data movement and data transfer, betweendifferent machine nodes, between different executing pro-cesses running on the same node, and between CPU coreand MIC device, and so on. One of the keys to effectivelydecrease the cost of data communication is therefore toshorten the communication time as much as possible, or“hide” the communication cost by overlaying the commu-nication with other CPU-consuming work.

In order to reduce the number of communication andthe time cost per communication, a reasonable “task-to-hardware” mapping is established. When assigning com-putation tasks of mesh grid to the MPI processes and

6

scheduling the MPI processes on the machine nodes or pro-cessors in the running time, we prefer to distributing theadjacent blocks in spatial topology to the same process orthe same machine node if possible, thus minimizing the oc-currence number of communication across different nodes.If multiple communication between each pair of processescan be done concurrently, the messages will be packed andonly one communication is needed. Furthermore, the non-blocking communication mode is used instead of blockingcommunication in conjunction with overlapping the com-munication and the computation to hide the cost of com-munication fully or in partially.

In large-scale CFD applications with multi-block meshas well as high-order accuracy solver there is another issuewith some neighboring mesh points. These so-called singu-lar mesh points, located in common boundaries, are sharedby three or more adjacent mesh blocks. When wider stencilschemes are used, as it is usually done in high-order accu-racy CFD applications, more very small size messages areneeded for communication, which thus brings additionaloverhead. An optimization method to address this issueis discussed in [14] by introducing non-blocking point-to-point communication.

Another main obstacle to prevent the performance im-provement for CFD applications simulated in CPU + MICheterogeneous platform by offloading programming modelis that there is large amount of data need to be trans-ferred forth and back between CPU and MIC coprocessorusing the PCIe interface which has typically lower band-width comparing with the bandwidth of memory access.This can be optimized by two methods. Firstly we useasynchronous instead of synchronous mode to transfer thedata between CPU and MIC coprocessor, as well as startthe data transferring as soon as possible to partially orcompletely hide the data transfer overhead via overlap-ping the usage of CPU and the data transferring. Sec-ondly optimizing the communication between CPU andMIC coprocessor in order to reuse the data either trans-ferred on the MIC device earlier or resulting from previouscomputations is applied as long as they can be reused inthe following processing phase, such as in the next loopof an iterative solver. If only a part of data in an ar-ray would be updated in the MIC devices, we can use thecompiler directive statement !Dir$ offload supported inIntel Fortran compiler to offload only a slice of that ar-ray to the coprocessor. In other cases of transferring someinduced variables which can be calculated from other pri-mary variables, only those primary data are transferredto coprocessor along with offloading the computing proce-dures to calculate the induced data into coprocessor. Thismeans that we recalculate the generated data on the co-processor instead of transfer them between CPU and MICdevice directly.

3.2.4. The optimization of wide vector instruction usage

Both the CPUs or the MIC coprocessors of a mod-ern computer system can support wide vector instruction

set to utilize the SIMD (single instruction multiple data),or vectorization technique at instruction level. For exam-ple, in Tianhe-2 supercomputer, its CPU supports 256-bitwide vector instructions and its MIC coprocessor supports512-bit wide vector instructions. Vectorization and its op-timization is one of the key methods to improve the overallfloating point computational performance. In our previousexperience, it can bring us a 4x–16x speedup theoreticallydepending on the precision of floating point operations andthe specific type of the processor used for the CFD sim-ulation. In practice of development, we use the compileroption, such as -vec option in Intel Fortran compiler, inaddition to the user-specified compiler directive statementto accomplish this goal. In fact, most modern compil-ers can make this kind of vectorization automatically, andwhat the user should do is checking the optimization re-port generated by the compiler, picking up the candidatecode fragments not auto-vectorized, and adding proper ad-vanced directives or declarations to vectorizing them man-ually providing to the correctness is ensured.

3.2.5. CPU + MIC collaborative parallelization and opti-mization

There are significant differences in the capability ofcomputation, storage capacity, network bandwidth anddelay, etc. between different types of devices of heteroge-neous platform, so we must carefully analyze, design andorganize the appropriate order of computing, memory ac-cess and communication to achieve the collaborative par-allel simulation for CFD applications in this heterogeneousplatform. The collaborative parallelization is mainly ap-plied within the single machine node of high-performancecomputer system. The load balancing between CPUs andMIC coprocessors is designed by assignment of amount ofmeshes and tasks as described in Sect. 3.1.3. All the gridblocks assigned to each process are divided into two typesof groups in according to their volume and size, i.e. thetotal amount of mesh points/cells, whereas the first typeof mesh groups, including two groups, are assigned to theCPUs of the machine node, and the second type of meshgroups, including the remaining three groups, are assignedto the MIC processors of the same machine node, as shownin Fig. 4. By adjusting the ratio of sizes of two types ofgrid blocks, the best value for load balancing can be ob-tained. On the CPU-side multiple threads can be createdthrough the OpenMP programming model to utilize thecapability of many-core processors. The main thread run-ning on the CPU is responsible for interacting with theMIC devices, and the rest of the threads are responsiblefor the computing work distributed for the CPUs. A se-ries of parallel optimizations of collaborative computingare applied based on this framework.

(1) Reducing the overhead of offloading tasks into MICdevices. Due to the limit bandwidth of communication be-tween CPUs and MIC devices via PCIe interface, one ofeffective method to improve the performance is to increasethe number and the volume of data transferred between

7

图4CPU+MIC异构平台上的任务分配与并行优化

b0 b1 b2 b3 b4 b0 b1 b2 b3 b4

network

Node N-1

CPU1CPU0

MIC2MIC1MIC0

Node 0

CPU1CPU0

MIC2MIC1MIC0

heterogeneous cores

domains of flow filedd0 dN-1

Figure 4: Task assignments and mapping on CPU + MIC heteroge-neous platforms

these two types of devices. We can refactor the code linesand re-organize the small kernel to be executed on theMIC devices as a bigger kernel as possible as we can. Bythis method, all the tasks can be completed through asingle offloading data instead of many offloading opera-tions resulting to much more data movement overhead.Furthermore, in the usage of compiler directive statement!Dir$ offload in the Fortran codes, adding alloc_if,free_if and other similar clauses as much as possible toavoid repeated allocating and deallocating space for thebig arrays offloaded in the MIC device. Moreover theinitialization and allocation of the variable should be ar-ranged as early as possible, at least in the “warm” stageprior to the main iterations of computation.

(2) Overlapping different levels of communication. Inthe flowchart shown in Fig. 1, once each eqnSolver proce-dure is completed, it needs the data exchange and synchro-nization among all the blocks distributed on different pro-cesses and/or threads even within the same machine node.These communication can be further divided into two cat-egories: inter-process MPI data communication and thedata transferring between CPUs and MIC coprocessors.If the mesh partitioning is carefully designed based onthe configurations of CPUs and MIC coprocessors, we canavoid any dependence between intra-process MPI commu-nications and data transferring of two types of devices.This enables to overlap them via the asynchronous com-munication and further hide the communication overhead.For example, if one-dimensional mesh partitioning strat-egy is adopted as shown in Fig. 4 and five neighboringblocks (or block groups), namely, {bi : i = 0, . . . , 4}, areassigned to the same computing node. Let’s also supposethat only left-most block b0 and right-most bock b4 are as-signed to CPU processors, and the remaining three blocksb1, b2, b3 are assigned to MIC coprocessors. It’s appar-ent that only CPUs loading the data of both boundaryblock {b0, b4} require exchange their data with neighbor-ing nodes by MPI communication, and the data exchangeamong b1, b2 and b3 can be bounded in an intra-node way

via offloading between CPUs and the MICs. As shown inFig. 5, the total communication overhead can be effectivelyreduced due to this asynchronous overlapping technique.

(3) Parallelizing the processing of heterogeneous pro-cessors. There is essentially concurrency between two typesof computing devices, say, CPU and MIC coprocessor, andamong different devices of the same type, and that means ifthe computational tasks are loaded asynchronously amongthese heterogeneous devices, both CPUs and MICs, theparallelization of computing tasks and overall performanceimprovement can be gained.

The communication overlapping and minimization ofdata movement at multiple levels and phases mentionedin this subsection can be illustrated in Fig. 5.

图5

b1

b2

b3

b4b0

b1

b2

b3

b4b0CPU

MIC 0

MIC 1

MIC 2

BC

MPI communication

CPU↔MIC MIC computation

CPU computation

(a)

(b)saved cost

MIC 0

MIC 1

MIC 2

CPU

Figure 5: Cooperative parallel optimization of CPU + MIC on asingle node. (a) no optimization. (b) overlapping the computationsand communication among CPUs and MICs.

In addition to aforementioned parallelization and op-timization, the traditional performance tuning methodsfor single processor, such as cache-oriented memory op-timization, aligning the data, using asynchronous I/O op-erations, buffering, etc., are also suitable for large-scaleparallel CFD applications. Another issue worthy notingis that in extremely large scale CFD simulations, whichtake a long time to complete, the mean time between fail-ures (MTBF) of the high-performance computer system isdecreasing dramatically. As the result, it is necessary tobuild a fault-tolerant mechanism both at system-level andat application-level. However this is beyond the discussionof this paper.

4. Numerical Experiments and Discussions

4.1. Platform and configurations of numerical experiments

In order to evaluate the performance of the variousparallelization and optimization methods in the previoussection, we designed and tested a series of numerical exper-iments including the early results tested on the Tianhe-1Asupercomputer system, a typical many-core CPU homo-geneous platform, and the latest results on Tianhe-2 su-percomputer with CPU + MIC heterogeneous platform.

8

The basic information of these two platforms is brieflysummarized in Table 1. As shown in the table, Tianhe-2system is composed of 16,000 computing nodes, and eachnode has heterogeneous architecture, including two IntelXeon E5-2692 processors and three Intel Xeon Phi 31S1Pcoprocessors with Many Integrated Core (MIC) architec-ture. The CFD application used for the large-scale paralleltest is an in-house code aiming to simulating the problemswith multi-block structured grid, which is developed usingFortran 90 programming language as well as Intel fortrancompiler of version 13. The -O3 compiling option is ap-plied on CPU unless otherwise noted in all test cases, andthe -O3 -xAVX compiling options are applied for the teston CPU + MIC heterogeneous platform in order to gen-erate correct vectorization instruction for the coprocessor.There are four configurations of test cases in the followingdiscussion. (1) “DeltaWing” case is for simulation of flowfield around a delta wing, and has 44 grid blocks and atotal of 2.4 million of grid cells. (2) “NACA0012” case isfor simulation of flow field around the NACA0012 wing,which has a single block of grid and a total of 10 millionof grid cells. (3) In the “DLR-F6” case, the amount ofgrid cells is 17 millions. (4) In the “CompCorner” case(Fig. 11), the flow field for simulating the compressiblecorner problem are computed [15] for a variety of problemsizes measured by the amount of grid cells. For this pur-pose, a special three-dimensional grid generating software,which can vary the total amount of grid cells, the num-ber and the connecting structure of the grid blocks, andso on, is developed to generate different configurations forthe test.

All performance results reported in this section are thebest one taken from five independent tests, and the tim-ing results is only for the phase of main iterations in theflowchart shown in Fig. 1. We also limit the number ofthe main iteration to no more than 50 for the purpose ofperformance comparison.

4.2. Performance results on homogeneous platforms

The Tianhe-1A supercomputer system, as a typical ho-mogeneous HPC platform, is used in this subsection for theperformance evaluations of following tests.

We firstly design a group of numerical experiments totest the methods of load balancing and communicationoptimization proposed in Sect. 3.1, and the “NACA0012”case with 10 million of mesh cells is employed on Tianhe1A supercomputer system. For this purpose, the test uses64 symmetric MPI processes, one for each machine node,and we evaluate the average wall time of both computa-tion and communication per iteration for the two versionof applications, namely, without optimization and withoptimization, respectively. The performance results areshown in Fig. 6. In the CFD simulation of original ver-sion (without optimization), the time of communicationamong MPI processes is about 63% of total time per it-eration. In the tuned version, as described in Sect. 3.1,we eliminate redundant global communication operations,

maximize the use of non-blocking communication by thehelp of refactoring the codes and overlapping communica-tion and computation as much as possible. These opti-mizations significantly reduce the total overhead of com-munication, resulting to nearly 10 times increasing of theratio of computation to communication.

图6通信优化前后单步迭代求解用时比较

6.08 6.09

10.35

1.07

0

2

4

6

8

10

12

14

16

18

original tuned

Aver

age

time

per i

tera

tion

(sec

)

communication

computation

Figure 6: Comparison of ratio of computation to communicationbetween original and tuned version

For the multi-threaded parallelization and its optimiza-tion discussed in Sect. 3.2, among other methods, the affin-ity (or the binding) of OpenMP threads to CPU cores onthe impact of parallel CFD application performance is thefocus in our tests. The “DeltaWing” case are employed onthe Tianhe-1A platform for this purpose. The results (notshown here) indicate that using the binding strategy ofthe thread to the CPU core can significantly improve theperformance of parallel simulation. The more the numberof threads used per process, the greater the performancegain. Additional numerical experiments for evaluating theimpact of different methods in thread-level parallel op-timization proposed in Sect. 3.2, such as data blocking,reducing memory footprint, etc, are also conducted (theresults is not shown here). Although the results indicatesome minor performance improvement benefit from theseoptimization, the threads affinity strategy shows bettergains instead.

To assess the overall effect of optimization method pro-posed on Sect. 3.2 for the homogeneous system, the per-formance of two test cases with different runtime config-urations on the Tianhe-1A system are reported in Fig. 7,where the horizontal axis is the number of CPU cores usedin the test, and the vertical axis is the relative speedup tothe baseline. In Fig. 7(a), DLR-F6 case with 17 millionof grid cells is tested and the baseline is the performancewhen using two CPU cores. As a larger scale case, theCompCorner case with 800 million of grid cells is usedin Fig. 7(b) and the performance in the case of using 480CPU cores is taken as the baseline. In both tests, two MPIranks, six threads for each, are created for each machinenodes to maximizing the CPU cores of each machine node.As can be seen from Fig. 7(a), for a medium-sized CFDapplication like the DLR-F6 case, the parallel speedup isbasically linear growing when the number of CPU coresis no more than 256. However, further increasing of the

9

Table 1: The configurations of Tianhe-1A and Tianhe-2 platformsTianhe-1A Tianhe-2

CPU Intel Xeon X5670 (6 cores per CPU) Intel Xeon E5-2692 (12 cores per CPU)Frequency of CPU 2.93 GHz 2.2 GHzConfiguration per node 2 CPUs 2 CPUs + 3 MICsMemory capacity per node 48 GB 64 GB for CPUs + 24 GB for MICsCoprocessors used not used in this paper Intel Xeon Phi (MIC) 31S1P (57 cores per MIC)

number of CPU cores result to the decreasing of parallelspeedup significantly due to the lower ratio of computa-tion to communication for each CPU core. More specifi-cally, there are two reasons. Firstly, increasing the numberof processes increases significantly the number of singularpoints shared by neighboring intra-node MPI processes,which leads to much more overhead of MPI communica-tion resulting from a large number of tiny-sized messagesto be transferred among machine nodes [14]. Secondly, fora fixed-size CFD problem, using more CPU cores meansdecreasing the size of sub-problem running in each CPUcore (in typical situation, the number of grid cells dis-tributed in each CPU core is less than 10,000) as well asthe size of sub-problem of each thread, which further leadsto the degradation of parallel performance. In Fig. 7(b),the test case has a larger size of 800 million mesh cells,running on the configurations of 480, 600, 960, 1200, and2400 CPU cores, respectively. Since the problem size ineach grid block is large enough (more than 10 million ofthe grid cells), the ratio of computation to communicationis relatively high, and the relative parallel speedup has anear linear relationship with respective to the number ofCPU cores used in the simulations.

图8两个算例的并行加速比曲线

1

4

16

64

256

2 8 32 128 512 2048

Spee

dup

Number of CPU cores

480

600

960

1200

2400

0

1

2

3

4

400 900 1400 1900 2400

Spee

dup

Number of CPU cores

DLR-F6 case with 17M grid cells CompCorner case with 800M grid cells(a) (b)

Figure 7: Parallel speedup curves for two examples: (a) DLR-F6 casewith 17 million mesh cells. (b) CompCorner case with 800 millionmesh cells.

In order to compare the parallel performances of CFDsimulation under different configurations with a variety ofprocesses and threads combinations, we conducted a com-prehensive tests for the medium-sized DLR-F6 case (17million of grid cells), as shown in Fig. 8. Fig. 8(a) showsthe relationship between running time per iteration andthe number of CPU cores used in different configurationsof MPI ranks and OpenMP threads, where we omit the re-sults when the number of threads per process exceed 4. Ineach specific combination, the total number of CPU coresis ensured to be the same as the number of total number ofthreads. Some facts can be seen from the results. Firstly,

when the number of CPU cores is less than 256, where alinear parallel speedup is observed, there is no significantperformance variance among different combinations of pro-cesses and threads as long as the total number of cores isfixed. Secondly, when we limit the maximum number ofthreads per process to no more than 3, the parallel effi-ciency will be high as the number of cores increases, andthen it declines after the number of cores reaches 256. Asa contrast, when limiting the maximum number of threadsper process to no more than 4, the critical point of paral-lel efficiency will be extended to 1024 CPU cores. In bothcases the number of MPI processes are limited to 256, andthen the parallel efficiency begins to decline. This alsoshows that the cost of communication among MPI pro-cesses becomes the main obstacle of scaling the parallelsimulation further.

In Fig. 8(b) we rearrange the performance results as therunning time v.s. the number of threads per MPI process.It shows that only by increasing the number of threads toimprove the parallel performance still has an upper limit.For example, in the case of using 256 MPI processes, thebest performance can be achieved when 3 threads per pro-cess used. However, in the case of 512 MPI processes, theperformance has no any further improvement when morethan 2 threads per process used. The main reason is stillthe existing lower ratio of computation to communicationin each MPI process when using more and more threads.In fact, for the case of 512 processes, only about 30,000grid cells are distributed to each each process, and it isindeed a light load for the powerful CPU. If more threadsare used in this case, the additional overhead it bringsabout is higher than the performance gain from compu-tation acceleration, thus the overall performance declines.

4.3. Performance results on heterogeneous platforms

We use the collaborative parallelization methods pro-posed in Sect. 3.2 for the testing on the CPU + MICheterogeneous platform. That is, only one MPI processis running on each machine node, as well as OpenMPmulti-threaded running in each MPI process for finer par-allelization, among those threads, one thread, named mainthread, is responsible for offloading and collecting the sub-task to/from the three MIC devices within the machinenode. During the phase of task assignment, each processreads in five grid blocks, two of which take the same sizeand are assigned to two CPUs, and the other three blockswith another size are assigned to the three MICs withinthe same node.

10

图9不同进程+线程组合配置下的性能比较

1248

163264

128256512

10242048

2×1=

22×

2=4

4×1=

42×

4=8

4×2=

88×

1=8

4×4=

168×

2=16

16×1

=16

8×4=

3216×2

=32

32×1

=32

16×4

=64

32×2

=64

64×1

=64

32×4

=128

64×2

=128

128×

1=12

864×4

=256

128×

2=25

625

6×1=

256

128×

4=51

225

6×4=

1024

1024×1

=102

451

2×4=

2048

1024×2

=204

820

48×1

=204

810

24×4

=409

620

48×2

=409

6

Aver

age

time

per i

tera

tion

(sec

)

# of MPI ranks × # of threads per rank = # of CPU cores

4

8

16

32

64

128

256

512

1024

2048

0 1 2 3 4 5 6

Aver

age

time

per i

tera

tion

(sec

)

Number of threads per MPI rank

# of MPI ranks =

4

8

16

32

64

128

256512

2

(a)

(b)

Figure 8: Performance results in different combinations of processesand threads. (a) Running time v.s. configurations of MPI ranks andOpenMP threads. (b) Running time v.s. number of threads per MPIprocess.

To find the best load balancing between two types ofdevices, that is, CPUs and MICs, we firstly fixed the gridsize as 8 millions (8M) for each CPU, thus 1600M for bothCPUs in a machine node, and varied the grid size from 4M,6M, 8M to 10M for each MIC coprocessor, respectively.As a contrast, each test of aforementioned heterogeneouscomputing will be accompanied by another test with thesame total grid size but using only the CPU devices inthe same machine node. The acceleration results of allthese four pairs of tests using “CompCorner” case on 16Tianhe-2 nodes are reported in Fig. 9, and the flow fieldnear the corner area are shown in Fig. 11. It shows thatwhen the grid size for each MIC device is about 4M–6M,the optimal acceleration about 2.62X can be achieved forheterogeneous computation than the CPU-only running.

2.62 2.622.37 2.30

0

1

2

3

16M+12M 16M+18M 16M+24M 16M+30M

Rela

tive

spee

dup

# of grid cells per node (CPU+MIC)

CPU CPU+MIC

图10CPU+MIC协同并行优化的性能效果

Figure 9: Speedup relative to CPU-only test for four configurations.

To further study the optimal ratio of MIC workloadto CPU load in heterogeneous computing, in the followinga series of tests we also vary the grid size for each CPU

from 8M, 16M, 24M to 32M. Fig. 10(a) shows the resultsof performance measure of million cell updates per second(MCUPS) when the ratio of the grid size per CPU to thegrid size per MIC varies. It seems that when the ratiovalue is between 0.6 and 0.8, the highest performance canbe observed in each group of tests, which is also nearlyconsistent with the results shown in Fig. 9. In Fig. 10(b)we fix 16M grid cells for each CPU and 9.6M grid cells foreach MIC (thus the load ratio is 0.6) and scale the problemsize up to 2048 Tianhe-2 machine nodes. The observedgood weak scalability confirms the effectiveness of methodsof performance optimizations proposed in Sect. 3.2.

图11不同的CPU与MIC计算负载比例下的协同并行性能比较

6

8

10

12

0 0.5 1 1.5 2 2.5 3Pe

rform

ance

(MCU

PS)

Ratio of cell number per-MIC to per-CPU

CPU= 8M CPU= 16M

CPU= 24M CPU= 32M

30

40

50

60

70

16 32 64 128 256 512 1024 2048

Runn

ing

time

(sec

)

# of MPI ranks (= # of machine nodes)

CPU CPU+MIC(a) (b)

Figure 10: Balancing the load between CPUs and MIC coprocessors.(a) Performance results under different values of load ratio parame-ter. (b) Scaling up to 2048 machine nodes with load ratio 0.6.

As a larger-scale test of our parallelization and opti-mization for both CPU-only homogeneous platform andCPU + MIC heterogeneous platform, the “CompCorner”cases with two types of configurations are used again onthe Tianhe-2 supercomputer system. In the first type ofconfiguration, shown as “coarse” in Fig. 12, each machinenode process 40 million (40M) of grid cells, and 95.2M ofgrid cells in the “fine” type of configuration. In the CPU-only homogeneous tests, the number of machine nodes isup to 8192 nodes with total 196,600 CPU cores, and thelargest problem size in “fine” configuration reaches 780 bil-lion of grid cells. In the CPU + MIC heterogeneous tests,the maximum number of machine nodes is 7168 with 1.376million of CPU + MIC processor/coprocessor cores, andthe largest case has 680 billion of grid cells. The weak scal-ability results in Fig. 12 shows the running time changingwith the problem size, or equivalently, the number of ma-chine nodes. From the result we can find, whatever theplatform type, either CPU-only homogeneous system orCPU + MIC heterogeneous system, and the load per node,either as in the “coarse” configuration or as in the “fine”configuration, there is basically no change in running timewhen the problem size is increasing in proportion to thecomputing resources, thus showing a very good weak scal-ability.

5. Conclusions

In this paper, the efficient parallelization and opti-mization method for large-scale CFD flow field simulationon modern homogenous and/or heterogeneous computer

11

4.2 可压拐角DNS数据分析

（3）雷诺应力及湍流结构

Figure 11: Flow field result near the corner area in the simulation oftest case “CompCorner”

图12CPU+MIC平台上不同负载比例下的可扩展性比较

0

20

40

60

80

100

120

140

8 16 32 64 128 256 512 1024 2048 4096 8192Aver

age

time

per i

tera

tion

(sec

)

Number of machine nodes ( = number of MPI ranks )

fine CPU_OPT fine CPU+MIC

coarse CPU_OPT coarse CPU+MIC

Figure 12: Weak scalability for two platforms and two testing cases

platforms is studied focusing on the techniques of identi-fying potential parallelism of applications, balancing thework load among all kinds of computing devices, tuningthe multi-thread code toward better performance in intra-machine node with hundreds of CPU/MIC cores, and opti-mizing the communication among inter-nodes, inter-cores,and between CPUs and MICs. A series of numerical ex-periments are tested on the Tianhe-1A and Tianhe-2 su-percomputer to evaluate the performance. Among theseCFD cases, the maximum number of grid cells reaches 780billion. The tuned solver successfully scales to the half sys-tem of the Tianhe-2 supercomputer with over 1.376 millionof heterogeneous cores, and the results show the effective-ness of our propose methods.

Acknowledgments

We would like to thank NSCC-Guangzhou for provid-ing access to the Tianhe-2 supercomputer as well as theirtechnical guidance. This work was funded by the NationalNatural Science Foundation of China (NSFC) under grantno. 61379056.

References

References

[1] A. Corrigan, F. F. Camelli, R. Lohner, J. Wallin, Running un-structured grid-based CFD solvers on modern graphics hard-ware, International Journal for Numerical Methods in Fluids66 (2) (2011) 221–229.

[2] M. Griebel, P. Zaspel, A multi-GPU accelerated solver forthe three-dimensional two-phase incompressible Navier-Stokesequations, Computer Science-Research and Development 25 (1)(2010) 65–73.

[3] C. Xu, L. Zhang, X. Deng, J. Fang, G. Wang, W. Cao, Y. Che,Y. Wang, W. Liu, Balancing CPU-GPU collaborative high-order CFD simulations on the Tianhe-1A supercomputer, in:2014 IEEE 28th International Parallel and Distributed Process-ing Symposium, IEEE, 2014, pp. 725–734.

[4] S. Saini, H. Jin, D. Jespersen, H. Feng, J. Djomehri, W. Arasin,R. Hood, P. Mehrotra, R. Biswas, An early performance eval-uation of many integrated core architecture based SGI rack-able computing system, in: 2013 International Conference forHigh Performance Computing, Networking, Storage and Anal-ysis (SC), IEEE, 2013, pp. 1–12.

[5] Y.-X. Wang, L.-L. Zhang, Y.-G. Che, et al., Heterogeneous com-puting and optimization on Tianhe-2 supercomputer system forhigh-order accurate CFD applications, in: Proceedings of HPCChina 2013, Guilin, China, 2013.

[6] A. Gorobets, F. Trias, R. Borrell, G. Oyarzun, A. Oliva, Directnumerical simulation of turbulent flows with parallel algorithmsfor various computing architectures, in: 6th European Con-ference on Computational Fluid Dynamics, Barcelona, Spain,2014.

[7] M. Vazquez, F. Rubio, G. Houzeaux, J. Gonzalez, J. Gimenez,V. Beltran, R. de la Cruz, A. Folch, Xeon Phi performancefor HPC-based computational mechanics codes, Tech. rep.,PRACE-RI (2014).

[8] L. Deng, H. Bai, D. Zhao, F. Wang, Kepler GPU vs. Xeon Phi:Performance case study with a high-order CFD application, in:2015 IEEE International Conference on Computer and Com-munications (ICCC), 2015, pp. 87–94. doi:10.1109/CompComm.

2015.7387546.[9] C. W. Smith, B. Matthews, M. Rasquin, K. E. Jansen, Perfor-

mance and scalability of unstructured mesh CFD workflow onemerging architectures, Tech. rep., SCOREC Reports (2015).

[10] X. Deng, H. Maekawa, Compact high-order accurate nonlinearschemes, Journal of Computational Physics 130 (1) (1997) 77–91.

[11] X. Deng, M. Mao, G. Tu, H. Liu, H. Zhang, Geometric conserva-tion law and applications to high-order finite difference schemeswith stationary grids, Journal of Computational Physics 230 (4)(2011) 1100–1115.

[12] Y.-X. Wang, L.-L. Zhang, W. Liu, Y.-G. Che, C.-F. Xu, Z.-H. Wang, Y. Zhuang, Efficient parallel implementation of largescale 3D structured grid CFD applications on the Tianhe-1Asupercomputer, Computers & Fluids 80 (2013) 244–250.

[13] Y.-X. Wang, L.-L. Zhang, W. Liu, Y.-G. Che, C.-F. Xu, Z.-H. Wang, Grid repartitioning method of multi-block structuredgrid for parallel cfd simulation, Journal of Computer Researchand Development 50 (8) (2013) 1762–1768.

[14] Y.-X. Wang, L.-L. Zhang, Y.-G. Che, et al., Improved algorithmfor reconstructing singular connection in multi-block CFD ap-plication, Transaction of Nanjing University of Aeronautics andAstronautics 30 (S) (2013) 51–57.

[15] B. Li, L. Bao, B. Tong, Theoretical modeling for the predictionof the location of peak heat flux for hypersonic compressionramp flow, Chinese Journal of Theoretical and Applied Me-chanics 44 (5) (2012) 869–875.

12

http://dx.doi.org/10.1109/CompComm.2015.7387546

http://dx.doi.org/10.1109/CompComm.2015.7387546

Documents

Abstract arXiv:1710.09995v1 [cs.PF] 27 Oct 2017Intel Xeon Phi (MIC) co-processors, is still a great challenge. An in-house parallel CFD code capable of simulating An in-house parallel