SPE-163591-PA (Multiparadigm Parallel Acceleration for Reservoir Simulation)

Embed Size (px)

DESCRIPTION

With the advent of the multicore central-processing unit (CPU), today’s commodity PC clusters are effectively a collection of interconnected parallel computers, each with multiple multicore CPUs and large shared random access memory (RAM), connected together by means of high-speed networks. Each computer, referred to as a compute node, is a powerful parallel computer on its own. Each compute node can be equipped further with acceleration devices such as the general-purpose graphical processing unit (GPGPU) to further speed up computational-intensive portions of the simulator. Reservoir-simulation methods that can exploit this heterogeneous hardware system can be used to solve very-large-scale reservoir-simulation models and run significantly faster than conventional simulators. Because typical PC clusters are essentially distributed share-memory computers, this suggests that the use of mixed-paradigm parallelism (distributed-shared memory), such as message-passing interface and open multiprocessing (MPI-OMP), should work well for computational efficiency and memory use. In this work, we compare and contrast the single-paradigm programming models, MPI or OMP, with the mixed paradigm, MPI-OMP, programming model for a class of solver method that is suited for the different modes of parallelism. The results showed that the distributed memory (MPI-only) model has superior multicompute-node scalability, whereas the shared memory (OMP-only) model has superior parallel performance on a single compute node. The mixed MPI-OMP model and OMP-only model are more memory-efficient for the multicore architecture than the MPI-only model because they require less or no halo-cell storage for the subdomains. To exploit the fine-grain shared memory parallelism available on the GPGPU architecture, algorithms should be suited to the single-instruction multiple-data (SIMD) parallelism, and any recursive operations are serialized. In addition, solver methods and data store need to be reworked to coalesce memory access and to avoid shared memory-bank conflicts. Wherever possible, the cost of data transfer through the peripheral component interconnect express (PCIe) bus between the CPU and GPGPU needs to be hidden by means of asynchronous communication. We applied multiparadigm parallelism to accelerate compositional reservoir simulation on a GPGPU-equipped PC cluster. On a dual-CPU-dual-GPU compute node, the parallelized solver running on the dual-GPGPU Fermi M2090Q achieved up to 19 times speedup over the serial CPU (1-core) results and up to 3.7 times speedup over the parallel dual-CPU X5675 results in mixed MPI+OMP paradigm for a 1.728-million-cell compositional model. Parallel performance shows a strong dependency on the subdomain sizes. Parallel CPU solve has a higher performance for smaller domain partitions, whereas GPGPU solve requires large partitions for each chip for good parallel performance. This is related to improved cache efficiency on the CPU for small subdomains and the loading requirement for massive parallelism on the GPGPU. Therefore, for a given model, the multinode parallel performance decreases for the GPGPU relative to the CPU as the model is further subdivided into smaller subdomains to be solved on more compute nodes. To illustrate this, a modified SPE5 (Killough and Kossak 1987) model with various grid dimensions was run to generate comparative results. Parallel performances for three field compositional models of various sizes and dimensions are included to further elucidate and contrast CPU-GPGPU single-node and multiple-node performances. A PC cluster with the Tesla M2070Q GPGPU and the 6-core Xeon X5675 Westmere was used to produce the majority of the reported results. Another PC cluster with the Tesla M2090Q GPGPU was available for some cases, and the results are reported for the modified SPE5 (Killough and Kossack 1987) problems for comparison.

Citation preview

  • Multiparadigm Parallel Acceleration forReservoir Simulation

    Larry S.K. Fung, SPE,Mohammad O. Sindi, SPE, and Ali H. Dogru, SPE, Saudi Aramco

    Summary

    With the advent of the multicore central-processing unit (CPU),todays commodity PC clusters are effectively a collection of inter-connected parallel computers, each with multiple multicore CPUsand large shared random access memory (RAM), connected togetherby means of high-speed networks. Each computer, referred to as acompute node, is a powerful parallel computer on its own. Eachcompute node can be equipped further with acceleration devicessuch as the general-purpose graphical processing unit (GPGPU) tofurther speed up computational-intensive portions of the simulator.Reservoir-simulation methods that can exploit this heterogeneoushardware system can be used to solve very-large-scale reservoir-simulation models and run significantly faster than conventionalsimulators. Because typical PC clusters are essentially distributedshare-memory computers, this suggests that the use of the mixed-paradigm parallelism (distributed-shared memory), such as mes-sage-passing interface and open multiprocessing (MPI-OMP),should work well for computational efficiency and memory use. Inthis work, we compare and contrast the single-paradigm program-ming models, MPI or OMP, with the mixed paradigm, MPI-OMP,programming model for a class of solver method that is suited forthe different modes of parallelism. The results showed that the dis-tributed memory (MPI-only) model has superior multicompute-node scalability, whereas the shared memory (OMP-only) modelhas superior parallel performance on a single compute node. Themixed MPI-OMP model and OMP-only model are more memory-ef-ficient for the multicore architecture than the MPI-only modelbecause they require less or no halo-cell storage for the subdomains.

    To exploit the fine-grain shared memory parallelism availableon the GPGPU architecture, algorithms should be suited to thesingle-instruction multiple-data (SIMD) parallelism, and any re-cursive operations are serialized. In addition, solver methods anddata store need to be reworked to coalesce memory access and toavoid shared memory-bank conflicts. Wherever possible, the costof data transfer through the peripheral component interconnectexpress (PCIe) bus between the CPU and GPGPU needs to be hid-den by means of asynchronous communication. We applied multi-paradigm parallelism to accelerate compositional reservoirsimulation on a GPGPU-equipped PC cluster. On a dual-CPU-dual-GPGPU compute node, the parallelized solver running onthe dual-GPGPU Fermi M2090Q achieved up to 19 times speedupover the serial CPU (1-core) results and up to 3.7 times speedupover the parallel dual-CPU X5675 results in a mixed MPIOMPparadigm for a 1.728-million-cell compositional model. Parallelperformance shows a strong dependency on the subdomain sizes.Parallel CPU solve has a higher performance for smaller domainpartitions, whereas GPGPU solve requires large partitions foreach chip for good parallel performance. This is related toimproved cache efficiency on the CPU for small subdomains andthe loading requirement for massive parallelism on the GPGPU.Therefore, for a given model, the multinode parallel performancedecreases for the GPGPU relative to the CPU as the model is fur-ther subdivided into smaller subdomains to be solved on morecompute nodes. To illustrate this, a modified SPE5 (Killough and

    Kossack 1987) model with various grid dimensions was run togenerate comparative results. Parallel performances for three fieldcompositional models of various sizes and dimensions areincluded to further elucidate and contrast CPU-GPGPU single-node and multiple-node performances. A PC cluster with theTesla M2070Q GPGPU and the 6-core Xeon X5675 Westmerewas used to produce the majority of the reported results. AnotherPC cluster with the Tesla M2090Q GPGPU was available forsome cases, and the results are reported for the modified SPE5(Killough and Kossack 1987) problems for comparison.

    Introduction

    Modern reservoir simulators in use by the oil and gas industry arecomputationally intensive software packages that are complex tobuild and maintain. A general-purpose simulator includes adiverse collection of algorithms and methods. Coupled multi-phase-flow and transport problems are a tightly coupled system ofnonlinear equations with significant spatial/temporal dependen-cies that need to be resolved for a stable transient solution. Corecomponents include the nonlinear and linear solvers, various for-mulations, discretization methods both in time and space, treat-ment of faults and fractures, dual or multiple porosities andpermeabilities, coupled geomechanics modeling, wellbore-model-ing methods, near-well modeling methods, implicit coupled well-solution method, coupled surface-network modeling, field/reser-voir/well/group management and optimization module, fluid- androck-property calculation package with drainage and imbibitionhysteresis modeling, phase-behavior package and equation-of-state calculation, Jacobian assembly, nonlinear update algorithm,timestepping algorithm, general fluid-in-place initialization algo-rithm that handles multiple fluid types, multiple rock types, a spa-tial distribution of composition and saturation, and complex input/output (I/O) processing required to manage history-match andfield-performance evaluation workflow.

    High-performance computing (HPC) technologies have evolvedrapidly during the last 15 years from the very expensive centralizedsupercomputer to commodity-based PC clusters. Modern typical PCclusters may have hundreds of compute nodes connected togetherby means of high-speed networks. Each compute node containstens of CPU cores and several Gigabytes of memory. Primarily, twomodes of parallelism have been used to speed up reservoir-simula-tion codedistributed memory parallelism by use of the MPI stand-ard and shared memory parallelism by use of the OpenMP standardbecause they are widely supported by hardware vendors. Shared-memory thread-based parallelization can be applied incrementallyand locally to speed up certain computationally intensive loops. Itis well-suited to parallelize code segments that fit the SIMDprogramming model. It is less flexible compared with the distrib-uted-memory method in which each process is independent. Distrib-uted-memory programming is well-suited to the general multiple-instruction multiple-data (MIMD) programming model and can alsobe used for task parallelism; however, the entire simulator (algo-rithm and data structure) must be engineered for the distributed-memory model. In massively parallel applications, each of thesimulator components needs to be efficiently parallel and datadistributed for the overall simulator need to be scalable. Thediverse algorithms and methods within a production-level reser-voir simulator pose significant challenges to fully exploit theavailable performance on these HPC hardwares that may containseveral-thousand computing cores to speed up simulation.

    CopyrightVC 2013 Society of Petroleum Engineers

    This paper (SPE 163591) was accepted for presentation at the SPE Reservoir SimulationSymposium, The Woodlands, Texas, USA, 1820 February 2013, and revised forpublication. Original manuscript received for review 15 March 2013. Revised manuscriptreceived for review 30 July 2013. Paper peer approved 16 October 2013.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 1 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    2013 SPE Journal 1

  • The GPGPU is an emerging accelerator hardware for scientificcomputation. These units can be added by means of PCIe expansionslots to enhance computational horsepower. The GPGPU FermiM2090Q has 512 cores and is capable of 256 fused multiply/addop/clock peak performances at double precision. This is a significantperformance multiple over the hex-core Westmere CPU X5675;however, algorithms suited for GPGPU computing need to exhibitmassive parallelism with thousands of independent threads. This isnot typical with many reservoir-simulation solution methods.Unstructured problems requiring indirect addressing that can lead toshared memory-bank conflicts and inefficiency in global-memoryaccess will lead to poor performance and should be avoided. Meth-ods and associated data that cannot be restructured for efficiency onthe GPGPU may not realize any performance gains. Furthermore,the code segment should contain a sufficiently large amount ofwork so that the data movement through the PCIe bus will not over-whelm performance gains in computation. In the best-case scenario,the data-copying cost can be hidden by means of asynchronouscommunication that can be overlaid with computation.

    Earlier studies involved programming the GPGPU devices withthe low-level CUDATM programming language that was seen bymany as undesirable because it is not very programmer-friendly. Itrequires significant time and effort to develop and debug code onthe device. Studies such as Vuduc et al. (2010) caution that the pro-ductivity loss during the tedious practice of porting scientific codeto GPGPU may outweigh the performance gain in some cases.Consequently, the availability of compiler with CUDA applica-tion-programming-interface (API) extension for high-level lan-guages such as C (NVIDIA 2012a,b) and FORTRAN (The PortlandGroup 2011) has improved this situation. More recently, a directive-based approach with the proposal of the new OpenACC ApplicationProgramming Interface Standard (2013) for multiple acceleratordevices may further reduce the development time. OpenACC is adirective-based programming standard for parallel programming ofheterogeneous CPU/GPU systems.

    Previous work to accelerate linear solver for reservoir simula-tion has been reported by Klie et al. (2011). They ported Krylov-based solvers [generalized conjugate residual (GCR) or general-ized minimum residual (GMRES) methods] with variants of blockincomplete LU (BILU) factorization or symmetric successiveover-relaxation (SSOR) method preconditioners and comparedperformance between the GPGPU and CPU. By using the state-

    of-the-art hardware at the time (Fermi GPGPU C2050 and Neha-lem CPU X5570), they found the GPGPU performance may becomparable to the performance obtained in eight cores of a singlemulticore device for their solver implementation.

    Because the traditional preconditioning methods, such as theincomplete LU (ILU) factorization variants or the nested factori-zation (NF) method, are not suitable, Appleyard et al. (2011)developed the multicolor NF (MCNF) preconditioner as the accel-erator for the Krylov-solver GMRES to solve the sparse linearequations for reservoir simulation on the GPGPU. Their studywas limited to parallelization on a single GPGPU in which theyreported good speedup for large models (>100,000 gridblocks),surpassing their robust serial solver running on the CPU that usesNF as the preconditioning method.

    Zhou and Tchelepi (2013) also reported on accelerating theMCNF algorithm on the GPGPU as the pressure preconditioner ofthe constraint pressure residual (CPR) method in the fully implicitsystem solution. They reported results for single GPGPU as wellas multiple GPGPU on a single compute node (with multiple PCIexpansion slots). In their work, only the pressure solve of the CPRalgorithm was ported to run on the GPGPU, whereas the remain-der of the solver runs on the CPU. For the SPE10 (Christie andBlunt 2001) problem, they reported a 19-times speedup, surpass-ing the one-core serial CPU speed for their implementation and afactor of 3.2 out of 4 for four-GPGPU cards on a one-compute-node implementation. The solution of the SPE10 (Christie andBlunt 2001) problem is overwhelmingly dominated by the pres-sure-solution time. Thus, leaving the full-solve on the CPU didnot post apparent performance issues for them. However, for themore-general multiphase/multicomponent simulation problems,the full solve may be a more significant component of the overallsolution time, and the achievable speedup factor will be lower.

    Recent published data comparing the parallel performance for11 parallel scientific applications indicate that the norm of the per-formance multiples obtained on the GPUs surpassing the CPUsmay only be a small fraction of their quoted peak-performanceratios (Table 1). Often times, the achieved performance on a rou-tine or on a code segment can be significantly better than the over-all applications speedups that might have generated initialoptimism. Of course, what can be achieved will strongly dependon the algorithms and methods of the respective applications.

    In the following sections, the solver method that is well-suited for multiparadigm parallel acceleration is described. Theheterogeneous computing environment used for the project is aPC cluster with compute nodes containing the hex-core West-mere CPUs and the Fermi GPGPUs. This is followed by an ex-planation of mixed paradigm (MPI-OMP) parallelization withunstructured domain decomposition on the multicore CPUs. Thecomparison among the MPI-only, the OMP-only, and the mixedMPI-OMP parallelization for a single compute node and multi-ple compute nodes is illustrated with example problems of vari-ous model sizes. Some of the issues with the GPGPU parallelacceleration are then explained. It is related to the hardwarearchitecture and memory model, which are significantly differ-ent from the cache-based multicore CPU architecture. The solveralgorithms discussed in this work are well-suited to both theCPU-mixed paradigm parallelism and the GPGPU massive par-allelism; however, data organization and code are necessarilydifferent to address the different architecture. Some of the perti-nent issues are explained.

    The GPGPU parallelization conducted is different in severalaspects from those previously cited. The use of GPGPU directivessignificantly reduces the development overhead involved withthe low-level CUDA programming. Our implementation uses thehybrid approachMPIOpenMPGPGPU directives. This ap-proach enables us to run on multiple CPUs and GPGPUs on multi-ple compute nodes in a distributed parallel fashion. We report theperformance not only on synthetic SPE-type models but also onthree full-field compositional models. All our computations areperformed in 64-bit double-precision arithmetic that is requiredfor reservoir simulation.

    TABLE 1RESEARCHERS SQUEEZE GPU PERFORMANCE

    FROM 11 BIG SCIENCE APPS (FELDMAN 2012)

    Application

    Performance

    XK6* vs. XEG**Software

    Framework

    S3DTurbulent combustion 1.4 OpenACC

    NAMDMolecular dynamics 1.4 CUDA

    CP2KChemical physics 1.5 CUDA

    CAM-SECommunity-

    atmosphere model

    1.5 PGI CUDA

    Fortran

    WL-LSMSStatistical

    mechanics of magnetic materials

    1.6 CUDA

    GTC/GTC-GPUPlasma

    physics for fusion energy

    1.6 CUDA

    SPECFEM-3DSeismology 2.5 CUDA

    QMCPACKElectronic

    structure of materials

    3.0 CUDA

    LAMMPSMolecular dynamics 3.2 CUDA

    Denovo3D-neutron

    transport for nuclear reactors

    3.3 CUDA

    ChromaLattice quantum

    chromodynamics

    6.1 CUDA

    * XK6One Opteron 6200 one Fermi/node.** XEGTwo AMD Opteron 6200/node (also known as Interlagos with 16 cores,

    8 dual-core module on a chip).

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 2 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    2 2013 SPE Journal

  • Parallel-Solver Method

    The iterative-solution method in this work uses the approximateinverse preconditioner known as the Z-line powers-series methodto accelerate a Krylov-subspace solver such as the Orthomin (Vin-some 1976) or the GMRES (Saad and Schultz 1986). The Z-linepowers-series method is a special instance of the general line-solve powers-series (LSPS) method and was discussed previously(Fung and Dogru 2008a). The mechanics of the method is to sub-divide the Jacobian matrix into two parts:

    A P E 1

    In this approach, the matrices A, P, and E are fully unstruc-tured. The important aspect of the approach is to choose P suchthat it includes the dominant terms of A and at the same timeremains inexpensive to compute the inverse. Thus, matrix A canbe written as

    A I EP1P: 2

    The approximate inverse preconditioner for A by use of the N-term power series is

    A1 M1N I XNk1

    1kP1Ek" #

    P1: 3

    In the Z-line powers-series method, P is block-tridiagonal withgrid cell ordered in the Z-direction first. The two-level CPRmethod (Wallis et al. 1985; Fung and Dogru 2008a) can be con-structed that uses the LSPS method as the base-preconditioningmethod. An alternate choice of pressure solver, such as the alge-braic multigrid method, can have a faster convergence rate butwill be harder to parallelize and will have poorer scalability (Fungand Dogru 2008a). The present paper documents the performanceresults for the one-level solver only. If the LSPS preconditioner isused as the preconditioner for both the pressure solve and the full-system solve of the CPR algorithm, the expected performancegains by use of the multiparadigm parallelization approaches dis-cussed herein will be comparable, with the attendant improve-ments in the robustness and the efficiency of the CPR method, asdocumented previously. The additional operations in the two-level CPR solver do not pose further issues for either the OMPparallelization or the GPGPU parallelization.

    The Z-line powers-series method involves Z-line solvecoupled with matrix-vector (MV) multiplication operations in asequ-ence. For a structured-grid, a red-black line-reduction stepmay be applied to reduce the work counts. The Orthomin methodwas chosen to generate the reported results in this paper. It con-sists of a series of vector dot products and a MV multiplicationoperation. It is relatively easy to parallelize on both the CPU andGPU in multiple paradigms and is a small fraction of the over-all-solution cost. However, on the GPU, the implementation ofthe reduction operation must be shared memory-bank-conflict-free to have good performance. Memory-bank conflicts will leadto the serialization of operations. Branching will also lead to theserialization of operations on the GPU because each half-warp(16 threads) must execute identical instruction on the GPGPUcores. Algorithms that are suited to the SIMD hardware will nat-urally have an advantage running on the GPGPU. The NVIDIAGPGPU provides the single-instruction multiple-threads (SIMT)programming model that allows the more general MIMD code torun and handles the necessary serialization automatically for theapplication developers. However, such applications withoutproper re-engineering will suffer a significant performance dis-advantage.

    Heterogeneous (CPU-GPGPU) Parallel-ComputingEnvironment

    The computer used for this study is a Dell cluster of PowerEdgeC6100 nodes. The cluster consists of 32 compute nodes with dualmulticore processors. Each processor is an Intel X5675 hex-core(Westmere) 3.07 GHz; therefore, each node is equipped with 12processing cores in total. The operating system running on thenodes is RedHat Enterprise Linux Server 5.4 with the 2.6.18-164.15.1.el5 kernel. Each node is equipped with an Infiniband host-channel adapter supporting quad data rate (QDR) connectionsbetween the nodes through a Qlogic 12800-40 Infiniband switch.Each node is also equipped with 48 GB of memory. As for MPIcommunication over the Infiniband network, the MVAPICH 2 wasused for all the simulation runs. In mixed-paradigm computing, it isimportant to correctly set up the environment variables for processpinning to CPU cores. This is documented in the MVAPICH 2 usermanual (Network-Based Computing Laboratory 2008).

    In addition to the two CPU processors, each node of the clusteris also equipped with two NVIDIA M2070Q GPGPUs, which arebased on NVIDIAs Fermi technology. Each GPGPU has 448processing cores [14 streaming multiprocessors (SMs) with 32cores each] at 1.15 GHz along with 6 GB of memory. With thissetup, a one-to-one binding between the CPU processors andGPGPUs is maintained; thereby, each node has two CPU process-ors and two GPGPUs. Each group of 16 GPGPUs is hosted in aDell PowerEdge C410x PCIe expansion chassis, with a total of 64GPGPUs being hosted in four PCIe expansion chassis. NVIDIAs275.09.07 driver with CUDA 4.0 and compute capability 2.0 wasused with the GPGPUs along with PGIs 11.8 compiler supportingaccelerator directives. Fig. 1 illustrates the cluster-hardware lay-out used to conduct this study.

    Mixed-Paradigm (MPI-OMP) Parallelization WithUnstructured-Domain Partitions

    Mixed-paradigm parallelization (distributed-shared-memory model)partitions the grid into an equal division of Z-lines by use of agraph-partitioning algorithm that produces a nearly equal number ofactive cells per partitions with minimized interdomain connections.The current implementation uses the MPI (1995) standard for dis-tributed memory and the OpenMP (2011) standard for shared mem-ory. Each grid partition is owned by an MPI process. The data, bothmatrix and vector, within each partition are organized to facilitatecommunication hiding and memory access for the solver methods.The dual-level graph-based matrix-data organization was introducedpreviously in terms of a distributed parallel method (Fung andDogru 2008b) with global cell lists and ordering; however, in mas-sively parallel implementation, local cell lists and ordering methods

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . .

    Qlogic Infiniband12800-40 Switch

    4 Infiniband OpticalQDR Links

    4 Infiniband OpticalQDR Links

    4 Nodes PerPowerEdge

    C6100

    4 Nodes PerPowerEdge

    C6100Rack 1 Rack 4

    16 GPGPUs inPowerEdge

    C410x

    Fig. 1Cluster layout consisting of 32 nodes in total with 64GPGPUs. Nodes are interconnected by means of InfinibandQDR.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 3 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    2013 SPE Journal 3

  • were implemented without a global hash table (Fung and Mezghani2013). This methodology is more memory-efficient, and it has noupper limits on model sizes. In mixed paradigm parallelism, eachMPI process spawns a user-specified number of OMP threads. TheOMP-loop limits for each thread are set on the basis of an equal di-vision of Z-lines in each subdomain. When the line lengths are vari-able, a second-level graph partition is required in which the weightfor each Z-line is the number of active cells. For good parallel per-formance, the number of fork-joint and synchronization points forOMP threads needs to be minimized. This can be accomplished byexpanding the OMP parallel regions to encompass the entiresolver code. Communication hiding is easily implemented withproper matrix and vector data layouts and by use of asynchro-nous peer-to-peer communication protocol. Computational workfor the subdomain interior can now overlap with communication.The work on the subdomain boundary can start when all the datain the halo of the subdomain have been received. The currentimplementation uses the master thread of each MP partition tohandle the interprocess communication. When the code is instru-mented with mixed-paradigm parallelism, the choice of MPIonly, OMP only, or mixed-paradigm MPI-OMP is a runtime de-cision by simply setting the number of compute nodes, the num-ber of processes per node, and the number of OMP threads perprocess in a job script. Shared-memory parallelism is limited toone compute node in the typical PC-cluster computingenvironment.

    Results for Mixed-Paradigm (MPI-OMP)Parallelization

    A series of problems with various domain sizes were set up tocompare parallel performances among MPI-only, OMP-only, andmixed MPI-OMP models on one node (2 X5675, 12 cores). The

    model sizes ranged from 50,000 cells to 1 million cells(NXNYNZ 100 100 5, 100 100 10, 100 100 20,100100 50, and 100 100 100). For each model size, theMPI-OMP configurations of 1-12, 12-1, 2-6, 6-2, and 4-3 weretested. The 1-12 configuration is the shared memory1-process,12-threads configurationand the 12-1 configuration is the distrib-uted memory12-processes, 1-thread configuration. The otherthree configurations are mixed-paradigm configurations. All runsare converged to 1.0 108, and each run of the same model hasexactly the same number of iterations and residual norms andchange vectors. The convergent tolerance used in our testing is topurposely check for the effects of the differences in the treatmentof double-precision arithmetic on different hardware architectures.This also ensures the correctness of the parallel implementationand the independence of solution work counts for each paralleliza-tion option. Solver tolerances in production runs will normally belower and more typically at 1 104. The parallel speedup factorsover the serial (1-core) solution time on one compute node are sum-marized in Table 2 and plotted in Fig. 2. The dependence of paral-lel performance on model sizes is clearly shown in Fig. 2, in whichthe smaller model has better parallel performance because ofimproved cache efficiency. On one node, shared-memory paralleli-zation performs better than distributed memory or mixed paradigmfor this solver method. Mixed-paradigm and MPI-only paralleliza-tion produce comparable performances. To test the multicompute-node performance, the 1-million-cell model is solved on up to 12compute nodes (144 cores) in mixed-paradigm (MPI-OMP n-12,2n-6, 3n-4, 4n-3, or 6n-2) configurations or a 12n-1 MPI-only con-figuration, in which n is the number of compute nodes. Thespeedup factors, normalized to the parallel solution time on onecompute node, are plotted in Fig. 3. The normalization is per-formed on the basis of the respective one-node result for each MPI-OMP configuration, as summarized in the last column of Table 2.For example, the four-compute-node run (n 4) of the 2n-6 caseuses the timing of the MPI-OMP 2-6 case for normalization. Thisgives the valid one-node to multinode speedup factors for the re-spective MPI-OMP configurations. The speedup factors from onecore to one node (12 cores) are stated in Table 2. Thus, the speedupfactor for up to twelve nodes (144 cores) over the serial one-corerun can be calculated by use of Table 2 and Fig. 3. That factor is13.8-5.32 73.42 for the 12n-1 MPI-only configuration. In themulticompute-node application, MPI-only has super linear scalabil-ity throughout the range of compute nodes tested. The improvedparallel performance with decreasing subdomain sizes is evident;however, mixed-paradigm parallelization has poorer performancesin the multinode situations. This is primarily because of cache-lineconflicts of OMP parallelization in the halo computation, which

    TABLE 2COMPARISON OF MIXED-PARADIGM (MPI-OMP)

    PARALLEL PERFORMANCE FOR VARIOUS MODEL SIZES

    ON 1 NODE (2 X5675 CPU, 12 CORES)

    Model Size 50,000 100,000 200,000 500,000 1,000,000

    112 9.96 9.64 8.12 5.82 5.56121 7.97 8.15 7.52 5.58 5.3226 8.15 8.06 7.18 5.53 5.0962 7.44 7.48 7.01 5.29 5.2143 8.12 7.8 7.27 5.59 5.32

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    500,00050,000

    Spee

    d-up

    Fact

    ors

    Model Sizes

    112121266243

    MPI*OMP

    Fig. 2Mixed-paradigm parallel performance on 1 node (2X5675 CPU, 12 cores) for various model sizes.

    1.0

    3.0

    5.0

    7.0

    9.0

    11.0

    13.0

    15.0

    1 2 3 4 5 6 7 8 9 10 11 12

    Solve

    r Spe

    ed-u

    p Fa

    ctor

    s

    Number of Compute Nodes

    N122N63N44N36N212N1Ideal

    Fig. 3Parallel performance on multicompute nodes for the 1-million-cell model.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 4 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    4 2013 SPE Journal

  • can be improved with thread private workspace or fewer threadsfor the halo computation.

    GPU Parallelization

    The abstracted view of CPU 86 plus GPGPU Fermi architecture isshown in Fig. 4. The GPGPU Fermi architecture has 32 CUDAcores per SM. Each SM schedules threads in groups of 32 threadscalled warps. Each SM has a dual-warp scheduler that simultane-ously schedules and dispatches instructions for two independentwarps. A kernel program executing on the GPGPU organizes threadson a grid of thread blocks. Threads in a thread block cooperatethrough barrier synchronization and shared memory. It is importantthat kernel-code execution does not involve shared memory-bankconflicts that may lead to the serialization of execution. An exampleof a memory-bank conflict is illustrated in Fig. 5. The memorymodel for CUDA coding is per-thread private, per-block shared, per-grid global after kernel-wide synchronization. Coalesced memoryaccess allows the efficient movement of data from global memory tothe shared memory of each SM. Code and data that are structuredfor efficiency on a cache-based architecture, such as the 86, mustbe reorganized to realize performance on the GPGPU.

    Coalesced global-memory access happens when each half warp(16 threads) of an SM accesses the contiguous region of device

    memory. Each SM has configurable shared memory between 16 KBand 48 KB. Shared memory of the SM is organized into 16 benchesof 4-byte words with 256 rows of benches for the 16-KB instance,similar to benches in a football stadium. A memory bank representsa column of benches that feed data to the GPGPU cores of a warp.Access data on an adjacent bank of a warp lead to memory-bankconflicts for the SIMT operations. For CPU computation, data storeis organized to optimize cache reuse, thus minimizing memory traf-fic. This means multiple data elements may be stored consecutively,which are accessed by the same CPU core in computation. This or-ganization is obviously not good for SIMT GPGPU computing.

    To offload computationally intensive components of the simu-lator to the GPGPU, several programming choices are available.These include direct CUDA programming, OpenCL, DirectCom-pute, CUDA C/C, CUDA Fortran, or a directive-basedapproach through the PGI-accelerator directives ACC, which isone of the foundations of the new OpenACC standard used to off-load kernels from the CPU to accelerator devices such as theGPGPU. To evaluate suitability for a production-level applicationdevelopment, we choose the directive-based method and, wherenecessary, supplement with CUDA Fortran. In this approach, theGPGPU code can be compiled and run or debugged on the CPUbefore generating the acceleration kernels for the GPGPU. Thiswas useful to speed up the development process.

    HostMemory

    Thre

    ad P

    roce

    ssor

    s

    x86Host

    ControlExecution Queue

    Device MemoryRDMA

    Level 2 Cache

    Dual Warp Issue

    SpecialFunction

    Unit

    H/WCache

    UserSelectable

    S/WCache

    H/WCache

    UserSelectable

    S/WCache

    SpecialFunction

    Unit

    SpecialFunction

    Unit

    SpecialFunction

    Unit

    H/WCache

    UserSelectable

    S/WCache

    SpecialFunction

    Unit

    SpecialFunction

    Unit

    Dual Warp Issue Dual Warp Issue0 1 15

    Fig. 4An abstracted view of the x86 CPU1GPU Fermi accelerator architecture.

    Thread 0Thread 1Thread 2Thread 3Thread 4Thread 5Thread 6Thread 7

    Thread 15 Thread 15 Thread 15

    Bank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7

    Bank 15 Bank 15 Bank 15

    Bank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7

    Bank 0Bank 1Bank 2Bank 3Bank 4Bank 5Bank 6Bank 7

    Thread 0Thread 1Thread 2Thread 3Thread 4Thread 5Thread 6Thread 7

    Thread 0Thread 1Thread 2Thread 3Thread 4Thread 5Thread 6Thread 7

    (a) (b) (c)

    Good, No bank conflict Bad, 2-way bak conflict

    Fig. 5Examples showing GPGPU thread accessing shared memory with no bank conflict (a) and with bank conflicts (b,c).

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 5 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    2013 SPE Journal 5

  • Multiparadigm Parallelization WithStructured-Domain Partitions

    The first-generation in-house parallel reservoir simulator was cho-sen for multiparadigm parallelization on the heterogeneous com-puting environment. The simulator uses a structured domain-partitioning scheme along the X- and Y-axis of the grid. In mixedparadigm, the X-dimension is subdivided into n slices in which nis equal to the number of MPI processes. Each MPI processspawns m threads within the OMP-parallel regions. The Y-dimen-sion is subdivided into m equal slices and assigned to each threadto compute. This domain-partitioning scheme is simple but is lessflexible, and the load balancing may be less than those achievedby the use of the unstructured domain-partitioning method dis-cussed earlier; however, the code is well-suited to multiparadigmparallelization on the CPU-GPGPU architecture. In this experi-mentation, the solver component of the simulator can be option-ally offloaded to the GPUs, whereas the rest of the simulator runsparallel on the CPUs. Alternatively, the entire simulator can runin mixed paradigm (MPIOMP) on CPUs only.

    Fermi CUDA architecture supports the concurrent executionof multiple kernels. This is useful for codes with multiple inde-pendent tasks that can be executed together and are not useful forporting the linear solver. The most-efficient approach is to offloadone subdomain per GPGPU in a kernel from one MPI process. Inour compute environment, each compute node consists of twoCPUs and two GPUs. Therefore, two MPI processes per node con-figuration are used; each MPI process drives one Fermi M2070Q.A piece of supplementary code is added for this purpose. Thecode was written with PGIs CUDA Fortran runtime library rou-tines for GPGPU accelerators so that odd-numbered MPI proc-esses would bind to GPGPU 1 and even-numbered processeswould bind to GPGPU 0. The code is called at initializationbefore entering the accelerated region. Data transfer betweenCPU and GPU memory through the PCIe bus occurs simultane-ously, and asynchronous communication has been implementedby means of cudaMemcpyAsync function calls to overlap remotedirect-memory access (RDMA) with computational work. In this

    work, the asynchronous transfer of the [A] matrix and the right-hand-side vector is overlain with computation that is better than95% effective, but the halo communication for MPI partitions isnot overlapped. In the solver implementation, block-line solvesare serial. Lines are organized into blocks of threads for massiveSIMD-type parallelism. GPGPU solver is recoded to coalescememory access and to avoid shared-memory bank conflicts inMV operations as well as reduction operations. In particular, aKrylov-subspace method, such as Orthomin or GMRES, involvesmany vector dot products. Hand coding of this aspect must be per-formed to eliminate memory-bank conflicts or simply to use suita-ble cudaBlas library function.

    On a single node, MPI communication involves only shared-memory copy and does not have infiniband traffic. Data on theGPGPU must be moved back to the CPU through the PCIe bus forthis purpose. For multinode parallel applications and without theuse of NVIDIAs GPUDirect technology (NVIDIA 2013), threedata copies are required for MPI, as illustrated in Fig. 6. This datatraffic can reduce parallel scalability. The use of the GPUDirecttechnology was not possible in our testing because the Infinibandinterconnectivity hardware did not support it. The more importantissue that reduces attractiveness for spreading work over multiplecompute nodes comes from the reduction of subdomain sizes and,therefore, the workload on each GPGPU. This is contrary to whatwe want to do to speed up application by distributing the work tomore hardware.

    Results for Multiparadigm Parallelization WithCPU1GPGPU

    To test the parallel performance in multiparadigm, the SPE5 com-positional model (Killough and Kossack 1987) has been modifiedinto several model sizes. The SPE5 model has six hydrocarboncomponents with the water-alternating-gas process. We ran themodel for 1.5 years with water/CO2/water injection at 6-monthcycles. The original model has been modified to several modelsizes, as shown in Table 3. We expect both the sizes and shapesof the model to have an impact on GPGPU performances. Fig. 7shows GPGPU-solver speedup factors against the serial (1 core)CPU solver time. Both the Fermi M2070Q and M2090Q resultsare plotted in which the M2090Q has an average of a 20% betterperformance than the M2070Q. In the figure, the dependence ofparallel performance on model sizes is very evident. At a subdo-main size of 864,000 cells (1.728-million cells in two subdo-mains), a factor of 19 over serial CPU speed is realized for thesolver on one compute node (two GPGPU/two CPU configura-tions). For smaller model sizes, the speedup multiples are moremodest. Fig. 8 shows the GPGPU solver speed up against themixed-paradigm (MPI-OMP 2-6) parallel CPU-solver time. Oursolver-timing comparisons are the actual overall time cost to solu-tion that is inclusive of all communication costs when it is not, orcannot be, overlapped. This comparison represents the real gainby offloading the solver code onto the GPGPU. For these models,a factor of approximately 3.7 was achieved for the solver on theM2090Q GPGPU at model sizes larger than approximately 1

    InfiniBand

    Without GPUDirect With GPUDirect

    InfiniBand

    SystemMemory

    Chipset

    CPU

    GPU Chipset

    CPU

    GPU

    GPUMemory

    GPUMemory

    1 1

    2

    Fig. 6Extra buffer copying needed to move halo data of subdomains for MPI communication in multicompute-node applications.

    TABLE 3MODEL SIZES AND SHAPES OF COMPOSITIONAL

    MODEL USED FOR TESTING MULTIPARADIGM

    PARALLELIZATION

    Model Sizes Model Shapes

    25,000 50 50 1049,000 70 70 10100,000 100 100 10225,000 150 150 10400,000 200 200 10972,000 180 180 301,200,000 200 200 301,728,000 240 240 30

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 6 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    6 2013 SPE Journal

  • million grid cells. The grid dimensions also have an impact onGPGPU parallel performance because kernel computation isorganized into parallel blocks of threads assigned to the SM pro-cessors. Actual field-model dimensions are not in multiples of 16,which is the optimal dimension to use for GPGPU computing.The effect of model-grid dimensions on parallel performance isnot investigated in the present study, but it may account for someof the scatters in the performance trend against model sizes.

    With the solver running on the GPGPU and the remainder ofthe simulator running on the CPU in mixed paradigm (MPI-OMP 2-6 configuration), Fig. 9 shows the multiparadigm (MPI-OMP-ACC) speedup factors for the complete runs relative to se-rial (one core) simulation time. They are the comparisons of theactual wall times for the simulation runs on the respective hard-wares tested. Depending on the model sizes and versions of FermiGPGPU, speedup factors from four to nine were achieved. Fig. 10shows the multiparadigm speedup factors relative to mixed-para-digm parallel simulation running on the CPU only. This repre-sents the performance gain achieved by offloading the solvercomponent of the simulator onto the GPGPU for a single computenode. It is noted that, in Fig. 9, the multiparadigm-simulatorspeedup over serial one-core CPU speed shows a monotonically

    increasing trend with model sizes for the entire range of modelsizes tested. In Fig. 10, the parallel GPGPU-CPU multiparadigmparallelization vs. parallel CPU mixed-paradigm parallelizationshowed a parabolic trend with a somewhat lower speedup factorfor the largest models. This is because the relative parallel scal-ability of the simulator portion and of the solver portion of thesimulator is different. The solver portion is more scalable than thesimulator portion. As a result, the percentage time spent in thesolver becomes less for the largest models, which is reflected inthe overall speedup factors.

    To compare multicompute-node scalability, the 1.728-million-cell model is also solved by use of two, four, and six nodes. Thescalability results for the GPGPU and the CPU parallel solver areillustrated in Fig. 11 in which the multinode speedup factors arenormalized to the results of their respective single-node perform-ance. Internode MPI communication for the GPGPU (withoutGPUDirect) will involve three buffer copies with an additionalbuffer copy on the CPU. In addition, the subdomain size decreasesas the model is subdivided into smaller and smaller subdomainsthat reduce the parallel performance of the GP GPUs. These com-binations of factors result in lower scalability for parallel GPGPUsolve compared with parallel CPU solve as the subdomain sizedecreases.

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    10,000 100,000 1,000,000

    GPU

    Sol

    ver S

    peed

    -up

    Fact

    ors

    over

    Pa

    ralle

    l CPU

    Model Sizes

    M2070M2090

    Fig. 8Solver speedup factors on dual-GPU Fermi against par-allel hex-core dual-X5675 CPUs time for the modified SPE5problem of various model sizes on a single compute node.

    4.0

    5.0

    6.0

    7.0

    8.0

    9.0

    10,000 100,000 1,000,000Ova

    rall S

    peed

    -up

    fact

    ors

    over

    Ser

    ial C

    PU

    Model Sizes

    CPU+M2070CPU+M2090

    Fig. 9Overall simulator speedup factors by use of multipara-digm parallel acceleration (GPGPU1CPU) surpassing serial 1-core CPU speed for the modified SPE5 problem of variousmodel sizes on a single compute node.

    1.00

    1.10

    1.20

    1.30

    1.40

    1.50

    1.60

    10,000 100,000 1,000,000

    Ove

    rall S

    peed

    -up

    fact

    ors

    over

    Par

    alle

    l CPU

    Model Sizes

    CPU+M2070CPU+M2090

    Fig. 10Overall-simulator speedup factors by use of multipara-digm parallel acceleration (GPGPU1CPU) surpassing parallelCPU speed for the modified SPE5 problem of various modelsizes on a single compute node.

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    10,000 100,000 1,000,000

    GPU

    Sol

    ver S

    peed

    -up

    Fact

    ors

    over

    Ser

    ial C

    PU

    Model Sizes

    M2070

    M2090

    Fig. 7Solver speedup factors on the dual-GPU Fermi againstserial (1-core) X5675 CPU time for the modified SPE5 problemof various model sizes on a single compute node.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 7 Total Pages: 10

    ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    2013 SPE Journal 7

  • Three compositional models (Models A, B, and C) with increas-ing model sizes were also tested. Model A has a 90 138 10 grid(124,200 cells) and is an eight-component model with 21 verticalwells and 30 years of simulation period. Model B has an 81106 30 grid (257,580 cells) and is an eight-component modelwith eight complex wells and 7 years of simulation. Model C has a199 1198 26 grid (6,146,712 cells) and is a nine-componentmodel with 231 complex wells and 7 years of simulation. Adetailed comparison table for the solve components is illustrated inTable 4 for Model A and in Table 5 for Model B. Models A and Bare solved on a single compute node. The solver speedup factors of

    the M2070Q Fermi over the parallel CPU-solve were 1.99 and2.22, respectively, for these two models. Note that the M2070Q isapproximately 20% slower than the M2090Q for this code. ForModel C, simulation runs on 7, 8, 10, and 16 compute nodes wereconducted for both the GPU parallel-solve and the CPU parallel-solve options. The speedup factors comparing the GPU-solveoption to CPU-solve option are shown in Fig. 12. The GPU-solvespeedup multiple surpassing the CPU parallel solve decreases from3.0 to 1.45 when the number of compute nodes increases fromseven to 16. The improvement in terms of overall simulator runtimeis also indicated in the plot. It decreases from a factor of 1.5 to afactor of 1.15. These multinode results are consistent with theresults from the modified SPE5 (Killough and Kossack 1987) prob-lems that were previously discussed in detail. All models tested areactual field models with model dimensions that are not divisible by16 (size of a half-warp). To improve coalesced-memory access, itwill be necessary to use pitched memory on the multidimensionalarrays. This is not implemented in our current report and may beconsidered as future required-code enhancements for GPGPU com-puting. In general, the GPGPU solver requires large subdomainsizes for good performance. Although large models are solved onmore compute nodes, the smaller subdomain partitions yield lowerparallel performance on the GP GPU compared with CPU-onlysimulation.

    Summary and Conclusions

    On a single-compute node with dual hex-core Westmere X5675CPUs, the parallel LSPS preconditioned Krylov iterative solver inmixed paradigm (MPI-OMP) has better parallel performance run-ning in OMP-only mode, whereas the parallel efficiency is com-parable between mixed MPI-OMP and MPI-only modes. In allcases, parallel performance increases as model size decreasesbecause of better cache efficiency.

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    1 2 3 4 5 6

    Spee

    d-up

    Fac

    tors

    Number of Compute Nodes

    GPU M20070 SolveCPU SolveIdeal

    Fig. 11Multiple-compute-node parallel scalability for the1.728-million-cell modified SPE5 compositional model. Eachnode consists of dual-GPGPU Fermi M2070Q and dual-CPUhex-core Westmere X5675.

    TABLE 4SPEEDUP-FACTOR COMPARISON FOR INDIVIDUAL-SOLVER COMPONENTS OF

    MODEL-A SIMULATION

    M2070-Q Hex-Core Westmere

    Solver Components GPU Parallel CPU Parallel CPU Serial

    Preconditioning 208 seconds 440 seconds 2,097 seconds

    Speedup factor 10.08 4.76 1

    MV multiply 109 seconds 197 seconds 922 seconds

    Speedup factor 8.46 4.68 1

    Orthomin 12 seconds 42 seconds 108 seconds

    Speedup factor 9.0 2.57 1

    Overall solve 361 seconds 720 seconds 3,276 seconds

    Speedup factor 9.07 4.55 1

    Overall parallel GPU/CPU solver time ratio 9.07/4.55 1.99.

    TABLE 5SPEEDUP-FACTOR COMPARISON FOR INDIVIDUAL-SOLVER COMPONENTS OF

    MODEL-B SIMULATION

    M2070-Q Hex-Core Westmere

    Solver Components GPU Parallel CPU Parallel CPU Serial

    Preconditioning 306 seconds 655 seconds 2,671 seconds

    Speedup factor 8.73 4.07 1

    MV multiply 95 seconds 238 seconds 913 seconds

    Speedup factor 9.61 3.84 1

    Orthomin 29 seconds 66 seconds 214 seconds

    Speedup factor 8.3 3.65 1

    Overall solve 450 seconds 999 seconds 3,921 seconds

    Speedup factor 8.71 3.94 1

    Overall parallel GPU/CPU solver time ratio 8.71/3.94 2.21.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 8 Total Pages: 10

    ID: jaganm Time: 15:01 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    8 2013 SPE Journal

  • On multicompute node, MPI-only has better parallel perform-ance than mixed paradigm (MPIOMP). This is primarily aresult of cache-line conflicts of OMP parallelization in the halocomputation that can be improved with thread-private workspaceor fewer threads for the halo computation. Mixed paradigm inmultinode computing has lower memory consumption as a resultof the reduction in halo-cell storage of the fewer distributed sub-domains. The choice of MPI-only, OMP-only, and mixed MPI-OMP is a runtime decision.

    A multiparadigm-parallelization approach has been success-fully implemented in the first-generation in-house parallel reser-voir simulator. The simulator is massively parallel, and it used astructured domain-partitioning scheme. The simulator portionruns in mixed paradigm (MPI-OMP) on the CPUs. The linearsolver can run optionally on either the GPGPU (MPI-ACC) or theCPU (MPI-OMP). Numerical experimentation shows the follow-ing characteristics: On a single compute node, the solver speedup factor on the

    GPGPU over the CPU strongly depends on the model sizes. Forthe range of model sizes tested, with the Fermi M2090Q, a solverspeedup factor of 6 to 19 over the serial CPU speed and a factorof 1.4 to 3.7 over the parallel CPU speed were achieved. Theoverall simulator speedup factor is 5 to 9 over the serial CPUspeed, and 1.37 to 1.57 over the parallel CPU speed, in whichonly the solver was ported to run on the GPGPU.

    Multicompute-node parallel scalability is better for the CPU-only simulation runs than for the CPUGPGPU simulationruns. The primary reason is that the GPGPU requires a largeamount of parallel work for good performance, whereas theCPU is more cache-efficient at smaller subdomain sizes. The di-vision of a model into more subdomains gives less work to eachGPGPU, which yields lower performance. The secondary rea-son is that MPI data movement requires extra memory copy.

    The various components of the solver (preconditioner, matrix-vector multiplication, Orthomin) show similar performancemultiples comparing the GPGPU vs. the CPU parallel speedup.Solver time is dominated by the preconditioner, and Orthominrepresents less than 10% of parallel-solver time. This is true forboth the GPGPU and the CPU parallelization.Reservoir simulators in production practice are complex soft-

    ware with a diverse collection of algorithms and methods. It is atightly coupled system with strong spatial and temporal dependen-cies. The parallel-data management and associated methods toachieve good scalability on the multicore CPU require significantknow-how and efforts. Few simulators have achieved massiveparallel scalability. The many-core GPGPU is an accelerator de-vice and has an architecture that is very different from that of themulticore CPU. Some aspects have been discussed in the paper.

    Significant re-engineering of code and data layout will be requiredto accelerate code on the GPGPU for reservoir simulators. Insome cases, different methods and algorithms may need to bebuilt altogether. Some simulator components may not be suitablefor porting to the GPGPU. Our current-research results indicatethat although research on heterogeneous HPC hardware platformsfor reservoir simulation should continue, it is not yet matureenough for production-level code development.

    Acknowledgments

    The authors would like to thank Saudi Aramco management forthe permission to publish this paper. We also thank NVIDIA Cor-poration for providing access to their test cluster with the GPGPUFermi M2090Q in which the results for some of the test caseswere generated.

    References

    Appleyard, J.R., Appleyard, J.D., Wakefield, M.A. et al. 2011. Accelerat-

    ing Reservoir Simulators Using GPU Technology. Paper SPE 141402

    presented at the 2011 SPE Reservoir Symposium, Woodlands, Texas,

    2123 February. http://dx.doi.org/10.2118/141402-MS.

    Christie, M.A. and Blunt, M.J. 2001. Tenth SPE Comparative Solution

    Project: A Comparison of Upscaling Techniques. Presented at the SPE

    Reservoir Simulation Symposium, Houston, 1114 February. SPE-

    66599-MS. http://dx.doi.org/10.2118/66599-MS.

    Feldman, M. 2012. Researchers Squeeze GPU Performance from 11 Big

    Science Apps. HPCwire (18 July 2012). http://archive.hpcwire.com/hpcwire/2012-07-18/researchers_squeeze_gpu_performance_from_11_

    big_science_apps.html.

    Fung, L.S.K. and Dogru, A.H. 2008a. Parallel Unstructured-Solver Meth-

    ods for Simulation of Complex Giant Reservoirs. SPE J. 13 (4):440446. http://dx.doi.org/10.2118/106237-PA.

    Fung, L.S.K. and Dogru, A.H. 2008b. Distributed Unstructured Grid

    Infrastructure for Complex Reservoir Simulation. Paper SPE 113906

    presented at the SPE Europec/EAGE Annual Conference and Exhi-

    bition, Rome, Italy, 912 June. http://dx.doi.org/10.2118/113906-

    MS.

    Fung, L.S.K. and Mezghani, M.M. 2013. Machine, Computer Program

    Product and Method to Carry Out Parallel Reservoir Simulation. US

    Patent 8,433,551.

    Killough, J.E. and Kossack, C.A. 1987. Fifth Comparative Solution Pro-

    ject: Evaluation of Miscible Flood Simulators. Presented at the SPE

    Symposium on Reservoir Simulation, San Antonio, Texas, 14 Febru-

    ary. SPE-16000-MS. http://dx.doi.org/10.2118/16000-MS.

    Klie, H, Sudan, H., Li, R. et al. 2011. Exploiting Capabilities of Many

    Core Platforms in Reservoir Simulation. Paper SPE 141265 presented

    at the 2011 SPE Reservoir Symposium, Woodlands, Texas, 2123

    February. http://dx.doi.org/10.2118/141265-MS.

    MPI: A Message-Passing Interface Standard. 1995. Message Passing Inter-

    face Forum, http://www.mpi-forum.org, June12.

    Network-Based Computing Laboratory. 2008. MVAPICH2 1.2 User Guide.

    Columbus, Ohio: Ohio State University. http://www.compsci.wm.edu/

    SciClone/documentation/software/communication/MVAPICH2-1.2/

    mvapich2-1.2rc2_user_guide.pdf.

    NVIDIA. 2012a. CUDA C Best Practice Guide, Version 5.0, October.

    NVIDIA. 2012b. CUDA C Programming Guide, Version 5.0, October.

    NVIDIA. 2013. GPUDirect Technology, CUDA Toolkit, Version 5.5. (19

    July 2013). https://developer.nvidia.com/gpudirect.

    OpenACC. 2013. The OpenACC Application Programming Interface. Open-

    ACC Standard Organization, Version 2.0, June. http://www.openacc-

    standard.org.

    OpenMP Application Program Interface. 2011. Version 3.1, July, http://

    www.openmp.org.

    Saad, Y. and Schultz, M.H. 1986. GMRES: A Generalized Minimal Resid-

    ual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J.Sci. Stat. Comput. 7 (3): 856869.

    The Portland Group. 2011. CUDA FORTRAN Programming Guide and

    Reference, Release 2011, Version 11.8, August.

    Vinsome, P.K.W. 1976. Orthomin, An Iterative Method for Solving

    Sparse Sets of Simultaneous Linear Equations. Paper SPE 5729 pre-

    sented at the 4th SPE Symposium of Numerical Simulation for

    1.0

    1.5

    2.0

    2.5

    3.0

    7 8 9 10 11 12 13 14 15 16

    Spee

    d-up

    Fac

    tors

    (GPU

    /CPU

    Solv

    e)

    Number of Compute Nodes

    overall speedupsolver speedup

    Fig. 12Performance comparison between the use of GPGPUand CPU parallel solve on multiple compute nodes for Model C,which is a 6.15-million-cell nine-component compositional model.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 9 Total Pages: 10

    ID: jaganm Time: 15:01 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    2013 SPE Journal 9

  • Reservoir Performance, Los Angeles, California, 1920 February.

    http://dx.doi.org/10.2118/5729-MS.

    Vuduc, R., Chandramowlishwaran, A., Choi, J. et al. 2010. On the Limits

    of GPU Acceleration. In Proceedings of the 2010 USENIX Workshop.Hot Topics in Parallelism (HotPar), Berkeley, California, June.

    Wallis, J.R., Kendall, R.P., and Little, T.E. 1985. Constrained Residual

    Acceleration of Conjugate Residual Methods. Paper SPE 13563 pre-

    sented at the 8th SPE Reservoir Simulation Symposium, Dallas, 1013

    February. http://dx.doi.org/10.2118/13563-MS.

    Zhou, Y. and Tchelepi, H.A. 2013. Multi-GPU Parallelization of Nested

    Factorization for Solving Large Linear Systems. Paper SPE 163588

    Presented at the 2013 SPE Reservoir Simulation Symposium, Wood-

    lands, Texas, 1820 February. http://dx.doi.org/10.2118/163588-MS.

    Larry S.K. Fung is Principal Professional of Computational Mod-eling Technology in the EXPEC Advanced Research Center ofSaudi Aramco, which he joined in 1997. He is a chief developerof Saudi Aramcos in-house massively parallel reservoir simula-tors GigaPOWERS and POWERS. During this time, Fung has builtseveral core-simulator components, such as the linear andnonlinear solvers, distributed parallel-data infrastructure, multi-scale-fracture multimodal-porosity simulation system, unstruc-tured gridding, multilevel local grid refinement method, andfully coupled implicit-well solver. Before that, he was a staffengineer at Computer Modelling Group for 11 years and hadbuilt several features for the simulators IMEX and STARS, whichinclude the systems for coupled geomechanics thermal simu-lation, and naturally fractured reservoir simulation. Fung haspublished more than 30 papers on reservoir-simulation meth-ods and holds seven US Patents on the subject. He served onthe steering committees of the 2007 SPE Forum on 70% Recov-ery and the 2009, 2011, and 2013 Reservoir Simulation Sympo-

    sium, and he was Cochair of the 2010 SPE Forum on ReservoirSimulation. Fung holds BSc and MSc degrees in civil and envi-ronmental engineering from the University of Alberta and is aregistered professional engineer with the Association of Profes-sional Engineers andGeoscientists of Alberta in Canada.

    Mohammad O. Sindi is a petroleum-engineering system ana-lyst of computational modeling technology in the EXPECAdvanced Research Center of Saudi Aramco, which hejoined in 2003. He specializes in high-performance computingand has had numerous publications with the Institute of Electri-cal and Electronic Engineers, the Association of ComputingMachinery, Intel, NVIDIA, and SPE. Sindi holds a BS degreefrom the University of Kansas and an MS degree from GeorgeWashington University, both in computer science.

    Ali H. Dogru is Chief Technologist of Computational ModelingTechnology in the EXPEC Advanced Research Center of SaudiAramco. He previously worked for Mobil R&D Company andCore Labs Inc., both in Dallas. Dogrus academic experiencecovers various research and teaching positions at the Univer-sity of Texas at Austin, Technical University of Istanbul, Califor-nia Institute of Technology, and Norwegian Institute ofTechnology, and he is currently a visiting scientist at the Massa-chusetts Institute of Technology. He holds several US patents,an MS degree from the Technical University of Istanbul, and aPhD degree from the University of Texas at Austin. Dogruchaired the 20042008 SPE JPT Special Series Committee;served on various SPE committees, including the 20082011R&D Technical Section, the Editorial Review, and SPE FluidMechanics; and he was the chairman of the 2012 Joint SPE/SIAM Symposium on Mathematical Methods in Large ScaleSimulation. He was recipient of the SPE Reservoir Dynamics &Description Award in 2008 and the SPE John Franklin Carll Dis-tinguished Professional Award in 2012.

    J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 10 Total Pages: 10

    ID: jaganm Time: 15:01 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102

    10 2013 SPE Journal