7
Computer Physics Communications 180 (2009) 2673–2679 Contents lists available at ScienceDirect Computer Physics Communications www.elsevier.com/locate/cpc 40th Anniversary Issue Parallel programming interface for distributed data Manhui Wang, Andrew J. May, Peter J. Knowles School of Chemistry, Cardiff University, Cardiff CF10 3AT, United Kingdom article info abstract Article history: Received 9 April 2009 Received in revised form 8 May 2009 Accepted 11 May 2009 Available online 13 May 2009 PACS: 07.05.Bx 07.05.Tp Keywords: MPI Parallel The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Program summary Program title: PPIDD Catalogue identifier: AEEF_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 698 No. of bytes in distributed program, including test data, etc.: 166 173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2–256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with the distribution takes only a few seconds to run. © 2009 Elsevier B.V. All rights reserved. 1. Introduction In recent years, the Message Passing Interface (MPI) standard [1] has emerged as the de facto standard for incorporating paral- This paper and its associated computer program are available via the Computer Physics Communications homepage on ScienceDirect (http://www.sciencedirect. com/science/journal/00104655). * Corresponding author. E-mail address: [email protected] (P.J. Knowles). lelism into scientific and other application codes. MPI offers the programmer a means of managing distributed data through ex- plicit point-to-point message passing, and through collective op- erations such as broadcast and reduction. Its ubiquity allows pro- grammers to write portable parallel codes, and encourages ma- chine manufacturers to invest optimization effort. Although the MPI standard is continually developing, the lack of some elements of functionality that support constructions that are richer than simple message passing has meant that some ap- plication programmers have sought alternatives. In particular, the 0010-4655/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.cpc.2009.05.002

Parallel programming interface for distributed data

Embed Size (px)

Citation preview

Page 1: Parallel programming interface for distributed data

Computer Physics Communications 180 (2009) 2673–2679

Contents lists available at ScienceDirect

Computer Physics Communications

www.elsevier.com/locate/cpc

40th Anniversary Issue

Parallel programming interface for distributed data ✩

Manhui Wang, Andrew J. May, Peter J. Knowles ∗

School of Chemistry, Cardiff University, Cardiff CF10 3AT, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history:Received 9 April 2009Received in revised form 8 May 2009Accepted 11 May 2009Available online 13 May 2009

PACS:07.05.Bx07.05.Tp

Keywords:MPIParallel

The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitablefor use in parallel scientific applications, that delivers communications and global data management.The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. Thisabstraction allows the programmer to write portable parallel codes that can utilise the best, or only,communications library that is available on a particular computing platform.

Program summary

Program title: PPIDDCatalogue identifier: AEEF_v1_0Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 17 698No. of bytes in distributed program, including test data, etc.: 166 173Distribution format: tar.gzProgramming language: Fortran, CComputer: Many parallel systemsOperating system: VariousHas the code been vectorised or parallelized?: Yes. 2–256 processors usedRAM: 50 MbytesClassification: 6.5External routines: Global Arrays or MPI-2Nature of problem: Many scientific applications require management and communication of data that isglobal, and the standard MPI-2 protocol provides only low-level methods for the required one-sidedremote memory access.Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides aninterface, suitable for use in parallel scientific applications, that delivers communications and global datamanagement. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2library. This abstraction allows the programmer to write portable parallel codes that can utilise the best,or only, communications library that is available on a particular computing platform.Running time: Problem dependent. The test provided with the distribution takes only a few seconds torun.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

In recent years, the Message Passing Interface (MPI) standard[1] has emerged as the de facto standard for incorporating paral-

✩ This paper and its associated computer program are available via the ComputerPhysics Communications homepage on ScienceDirect (http://www.sciencedirect.com/science/journal/00104655).

* Corresponding author.E-mail address: [email protected] (P.J. Knowles).

0010-4655/$ – see front matter © 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.cpc.2009.05.002

lelism into scientific and other application codes. MPI offers theprogrammer a means of managing distributed data through ex-plicit point-to-point message passing, and through collective op-erations such as broadcast and reduction. Its ubiquity allows pro-grammers to write portable parallel codes, and encourages ma-chine manufacturers to invest optimization effort.

Although the MPI standard is continually developing, the lackof some elements of functionality that support constructions thatare richer than simple message passing has meant that some ap-plication programmers have sought alternatives. In particular, the

Page 2: Parallel programming interface for distributed data

2674 M. Wang et al. / Computer Physics Communications 180 (2009) 2673–2679

Global Arrays (GA) toolkit [2–4] is an example of library softwarethat allows the programmer to express parallel algorithms directlyin terms of higher-level global data-storage objects. Although theimplementation of these objects is usually distributed, the detailsof this implementation are mostly masked from the programmer,who instead operates on the Global Array objects through the ap-plication programming interface (API). Historically, GA also offeredremote memory-access (RMA) operations to the programmer be-fore these became available in the MPI-2 standard. GA has beensuccessfully deployed in a number of scientific application codes,principally but not exclusively in the quantum chemistry area [5–8]. We note also the development of the Distributed Data Interface(DDI) library [9], which achieves similar functionality through theexplicit use of separate processes to serve data.

Despite the arguable usability and technical superiority of theGA library, there are circumstances where a standards-based ap-proach to the expression of parallel algorithms might be preferable.In particular, the manufacturers of computing systems can investtimely and significant effort in the writing and optimization ofMPI-compliant libraries, since measured performance of MPI ap-plications is a commonly required acceptability criterion for largecomputer installations. This effect is felt most keenly in emergingnew architectures, where the initial user base for a library such asGA can be too small to justify porting and optimization. Unfortu-nately, most scientific application codes are not able to adapt tothese circumstances easily; they are designed and engineered, typ-ically, against one of GA and MPI (or indeed another library), notboth simultaneously.

It is the purpose of this work to enable the application pro-grammer to use in a portable fashion the parallel layer that ismost effectively implemented on any particular computing plat-form. The Parallel Programming Interface for Distributed Data (PPIDD)library provides a common API that is the union of the most fre-quently used functionality of both MPI-2 and GA. We have takenthe decision to create a new API as the easiest pragmatic way toavoid the necessity of a complete general implementation of theGA using an underlying MPI library, or vice versa.

The PPIDD library consists mostly of a layer of wrapper rou-tines, that, with conditional compilation, call either GA or MPI-2library routines. In some cases (for example, the creation and ma-nipulation of global data structures), the desired functionality isnot directly provided by MPI-2, and we have written the appropri-ate algorithms using MPI calls. In these cases, we do not claimor expect that the performance of PPIDD when built on top ofMPI will be as good as when built on GA. The degree to whichoptimization has been attempted has been driven by the perfor-mance of our quantum chemistry code Molpro [10]; PPIDD is notat present appropriate for deployment in application codes thatmake heavy use of global array data structures.

2. Description of the library

Tables 1–5 list the routines that are in the PPIDD library, in eachcase indicating function, together with the MPI-2 and GA equiva-lents where these exist. The library consists of four components,together with supporting ancillary routines.

Message-passing: Conventional two-sided point-to-point transferof data. These functions are present in MPI, and in theTCGMSG extensions of GA.

Collectives: Collective operations that perform tasks such as glob-ally summing floating-point or integer data, and broad-casting data from one process to all others. These opera-tions are supported natively by both GA and MPI.

Distributed data: Array data structures that are accessible by anyprocess independently of the others, and irrespective of

Table 1PPIDD library: utility operations.

PPIDD routine(GA, MPI-2 equivalents)

Description

Utility operations

PPIDD_Initialize(GA_Initialize, MPI_Init)

Initialize the parallel environment.

PPIDD_Finalize(GA_Terminate, MPI_Finalize)

Terminate the parallel environment.

PPIDD_Wtime(GA_Wtime, MPI_Wtime)

Return an elapsed time on the callingprocess.

PPIDD_Error(GA_Error, N/A)

Print an error message and abort theprogram execution.

PPIDD_Nxtval(NXTVAL_, N/A)

Get the next shared counter number.

PPIDD_Size_all( N/A, MPI_Comm_size)

Determine the number of all processesincluding helper process if there is one.

PPIDD_Size(GA_Nnodes, MPI_Comm_size)

Determine the number of computeprocesses.

PPIDD_Rank(GA_Nodeid, MPI_Comm_rank)

Determine the rank of the calling process.

PPIDD_Init_fence(GA_Init_fence, N/A)

Initialize tracing of completion status ofdata movement operations.

PPIDD_Fence(GA_Fence, N/A)

Block the calling process until all datatransfers complete.

Table 2PPIDD library: message-passing operations.

PPIDD routine(GA, MPI-2 equivalents)

Description

PPIDD_Send(SND_, MPI_Send/MPI_Isend)

Blocking/nonblocking send.

PPIDD_Recv(RCV_, MPI_Recv/MPI_Irecv)

Blocking/nonblocking receive.

PPIDD_Wait(WAITCOM_, MPI_Wait)

Wait for completion of all asynchronoussend/receive.

PPIDD_Iprobe(PROBE_, MPI_Iprobe)

Detect nonblocking messages.

Table 3PPIDD library: collective operations.

PPIDD routine(GA, MPI-2 equivalents)

Description

PPIDD_BCast(GA_Brdcst, MPI_Bcast)

Broadcast a message from the rootprocess to all other processes.

PPIDD_Barrier(GA_Sync, MPI_Barrier)

Synchronize processes and ensure allhave reached this routine.

PPIDD_Gsum(GA_Gop, MPI_Allreduce)

Combine values from all processes anddistribute the result back to all processes.

where the data are physically located. GA provides thisfunctionality directly; there is no direct equivalent inMPI, and when built against an MPI-2 library, PPIDDmanages the arrays, effecting communication either withMPI-2 one-sided remote-memory access (RMA), or withthe help of an additional server process. An importantspecial case of global data, heavily used by codes suchas Molpro in dynamic load-balancing, is a global sharedcounter, implemented through PPIDD_Nxtval.

Mutual exclusion: Manage code segments that are to be executedby only one process at a time. These functions are im-plemented with the method proposed by Latham et al.[11].

The library is fully documented at the function-call level us-ing the Doxygen [12] system, and contains both Fortran and Cbindings. The library has been tested on a number of differentcomputing platforms of varying architectures.

Page 3: Parallel programming interface for distributed data

M. Wang et al. / Computer Physics Communications 180 (2009) 2673–2679 2675

Table 4PPIDD library: distributed data operations.

PPIDD routine(GA, MPI-2 equivalents)

Description

PPIDD_Create(NGA_Create, N/A)

Create a global array using the regulardistribution model and return integerhandle representing the array.

PPIDD_Create_irreg(NGA_Create_irreg, N/A)

Create an array by following theuser-specified distribution and returninteger handle representing the array.

PPIDD_Destroy(GA_Destroy, N/A)

Deallocate the array and free anyassociated resources.

PPIDD_Distrib(NGA_Distribution, N/A)

Return the range of a global array heldby a specified process.

PPIDD_Location(NGA_Locate_region, N/A)

Return a list of the processes that holdthe specified section of a global array.

PPIDD_Get(NGA_Get, N/A)

Get data from a global array section tothe local buffer.

PPIDD_Put(NGA_Put, N/A)

Put local data into a section of a globalarray.

PPIDD_Acc(NGA_Acc, N/A)

Accumulate local data into a section of aglobal array. Atomic operation.

PPIDD_Read_inc(NGA_Read_inc, N/A)

Atomically read and increment anelement in an integer global array.

PPIDD_Zero_patch(NGA_Zero_patch, N/A)

Set all the elements in a global arraypatch to zero.

PPIDD_Zero(NGA_Zero, N/A)

Set all the elements of a global array tozero.

PPIDD_Duplicate(GA_Duplicate, N/A)

Create a new global array by applying allthe properties of another existing array.

PPIDD_Inquire_mem(GA_Inquire_memory, N/A)

Get the amount of memory (in bytes)used in the allocated global arrays on thecalling process.

Table 5PPIDD library: mutual exclusion operations.

PPIDD routine(GA, MPI-2 equivalents)

Description

PPIDD_Create_mutexes(GA_Create_mutexes, N/A)

Create a set of mutex variables that canbe used for mutual exclusion.

PPIDD_Lock_mutex(GA_Lock, N/A)

Lock a mutex object identified by a givenmutex number.

PPIDD_Unlock_mutex(GA_Unlock, N/A)

Unlock a mutex object identified by agiven mutex number.

PPIDD_Destroy_mutexes(GA_Destroy_mutexes, N/A)

Free a set of mutexes created previously.

3. Performance

3.1. Initial performance tests

All performance tests have been carried out on the Cardiff Uni-versity Merlin facility [13], which is a Bull Novascale R422 clusterconsisting of 8-core 3.0 GHz Intel E5472 nodes connected with4× DDR Infiniband. The library and test programs were compiledwith Intel Fortran/C 10.1.015, with three different underlying com-munications libraries:

1. Global Arrays version 4.1.1 hosted by Intel MPI version 3.12. Intel MPI version 3.13. Bull MPI version 2-1.7-2

Table 6 shows timing data obtained on different numbers of8-core node for the basic primitive point-to-point, one-sided (put,get, accumulate, shared-counter) and collective operations (broad-cast, floating-point global sum). There are 8 effective processesrunning on each node.In the case of simple message passing, verygood performance is seen with both GA and MPI, with latencies ofaround 0.5 μs and bandwidths that approach the hardware limit(16 Gbit s−1), although with Bull MPI, unexpectedly high latenciesare observed at random. For collective operations, the GA exhibits

Table 6PPIDD primitive operation latency and bandwidth (in (μs)/(MB/second) for la-tency/bandwidth) with GA, Intel MPI-2, and Bull MPI-2 on multiple nodes of Merlin(8 processes per node).

Operation Nnode GA Intel MPI-2 Bull MPI-2

PPIDD_Put 1 0.26/1462 10.72/573 1.63/13262 1.83/1423 105.94/839 18.16/15154 2.20/1642 164.02/1086 22.60/16368 2.70/1701 254.37/1139 26.74/1645

16 3.21/1708 470.51/922 31.96/161432 4.10/1668 1408.55/358 37.04/1507

PPIDD_Get 1 0.27/1556 10.92/1774 1.57/13192 4.89/1472 16.48/17934 6.84/1359 20.82/18178 8.28/1307 26.02/1711

16 9.47/1236 36.15/153132 10.69/1164 86.32/974

PPIDD_Acc 1 0.29/1570 133.29/423 9.87/7342 1.80/1373 106.09/532 18.67/7744 2.28/1249 158.61/624 22.71/8078 2.66/1178 224.71/664 26.81/852

16 3.14/1316 333.04/734 32.26/103232 3.72/1654 856.80/456 36.54/1011

PPIDD_Send(PPIDD_Recv)

1 0.41/1822 0.42/1826 0.56/13442 0.49/1809 0.48/1858 4.61/18604 0.49/1858 0.47/1859 5.82/18608 0.53/1822 0.49/1801 4.44/1784

16 0.61/1819 0.49/1826 13.49/181432 0.73/1805 0.56/1813 1.76/1798

PPIDD_Brdcst 1 0.82/560 1.41/342 1.90/2132 3.16/346 2.01/362 2.87/2424 4.29/295 2.58/374 3.41/2518 6.68/243 3.26/362 3.94/240

16 9.10/189 4.05/353 4.50/23032 12.66/174 5.17/301 5.06/209

PPIDD_Gsum 1 2.01/210 3.18/68 6.56/752 6.15/151 7.32/63 13.06/724 9.18/116 10.63/60 21.33/708 15.07/90 15.75/59 33.93/65

16 20.10/77 21.28/57 57.12/6432 26.76/73 33.78/56 68.54/63

significantly better performance for small numbers of processes,especially for a single node. GA, through its underlying aggregateremote memory copy (ARMCI) component [3], is explicitly awareof, and is able to exploit, shared memory access to facilitate com-munications between processes running on the same SMP node;this is not yet the case for either of the MPI libraries available tous. However, the performance data at high process count reveal ap-parent inadequacies in the implementation of broadcast and globalsummation present in the GA library.

The performance of one-sided operations implemented throughMPI-2 windows is seen to be rather poor, as we have previouslyobserved [14]. The implementation of RMA in MPICH2 [15], and,possibly, other MPI libraries derived from it, is not truly one-sided, since a request from a client process is not honoured untilthe process owning the memory enters an MPI function [16]. Forthe computation patterns common in typical scientific codes, es-pecially those where explicit dynamic load-balancing tuning hasbeen implemented to minimize the frequency of communica-tion calls, this situation is far from ideal; execution will blockwhilst the server process is busy with its own work. GA doesnot suffer from these problems, and is able to progress the one-sided calls with a latency in the 2–10 μs range. Finally, we notethat with the version of Bull MPI available to us, the MPI_Getroutine does not correctly progress requests when working be-tween nodes, and so we do not report performance data in thiscase.

Page 4: Parallel programming interface for distributed data

2676 M. Wang et al. / Computer Physics Communications 180 (2009) 2673–2679

Table 7PPIDD primitive operation latency and bandwidth on a single node with varying numbers of processes.

Operation Processesa GA Intel MPI-2 Bull MPI-2

PPIDD_Put 7 0.26/1495 1.79/1703 [11.53/524] 2.60/1234 [1.76/1327]8 0.26/1462 2.82/1627 [14.39/463] 403.95/1001 [25001.05/116]

PPIDD_Get 7 0.27/1530 1.80/1709 [11.87/1768] 2.64/1241 [1.71/1327]8 0.27/1556 2.69/1656 [14.58/1760] 403.90/993 [24950.64/116]

PPIDD_Acc 7 0.30/1570 1.85/820 [15.25/397] 2.63/676 [5.75/752]8 0.29/1570 2.95/786 [17.99/361] 403.90/550 [24950.69/106]

PPIDD_Send(PPIDD_Recv)

7 0.41/1802 0.42/1820 0.62/13908 0.41/1822 0.40/1838 20.29/786

PPIDD_Brdcst 7 0.75/584 1.36/379 2.07/2678 0.82/560 1.45/331 25031.44/16

PPIDD_Gsum 7 1.78/229 3.50/97 5.89/698 2.01/210 3.84/68 100045.75/53

PPIDD_Nxtval 7 0.47/(N/A) 2.41/(N/A) 2.10/(N/A)8 0.47/(N/A) 608.53/(N/A) 2.38/(N/A)

a Number of compute processes. For the GA implementation, this is the total number of processes; for MPI, it is the total number of processes minus one.

3.2. Low-latency one-sided operations

The above data serve to demonstrate that the MPI-2 RMA func-tions (MPI_Get, MPI_Put, and MPI_Accumulate), at least inthe implementations available to us, cannot presently be used ina straightforward way for one-sided data access, and in particularfor the management of global distributed data structures. However,many (but, of course, not all) parallel algorithms arising in scien-tific application codes do not unavoidably demand full distributionof data, although typically one-sided access to the data is still im-portant. An extreme and common example is the shared globalcount variables that are typically used to manage dynamic loadbalancing; here, the critical operation is the atomic incrementing(PPIDD_Acc) of a single integer, which by definition is local toone process and not distributed. In Molpro, there are further datastructures whose storage requirements are modest, and for whichthe key performance criterion is access time rather than aggregatetransfer bandwidth.

Under these circumstances, the latency problems of typicalMPI-2 implementations of RMA can be circumvented by creatingone or more explicit processes whose sole function is to listenfor, and service, requests for data transfers. Provided these pro-cesses are kept active by appropriate system scheduling, they candeliver remote memory requests with a latency comparable to thatof the primitive message passing upon which their implementationis constructed. The DDI library [9] is built around the concept ofhaving one such process for every compute process. This approachsupports full flexibility for managing global data, but has the disad-vantage that the number of processes is double the number doingarithmetic tasks. One is faced with the choice of either runninghalf the number of compute processes as there are hardware cores(which restricts the arithmetical performance to be at most halfof the theoretical peak), or of being at the mercy of the operatingsystem to schedule the listeners to be active sufficiently often torespond to service requests with appropriate timeliness. Neither ofthese situations is ideal, although in the context of ever-increasinghierarchy in memory bandwidths, for example through multiple-core CPU chips, the dedication of some cores to service tasks isincreasingly less wasteful than on older computing platforms.

In our initial MPI-2 implementation of PPIDD, we have chosento take the opposite extreme, by creating just a single data ser-vice process. In this approach, global objects are stored under thecontrol of just one process on one physical node. This has the ad-vantage of simplicity, but we recognize that ultimately it is notscalable either in terms of memory capacity, or transfer bandwidthon an appropriately scaling switched network. At present, for largedata structures, we continue to use the approach based on the

MPI-2 RMA functions. These ‘low-latency’ versions of the globaldata functions are implemented in the library through the samefunction call as the fully-distributed versions. This is an importantdesign feature; the flavour of implementation can be specified atrun time on the basis of the actual data dimensions and antici-pated access patterns, through a flag passed to the routine thatcreates the array. The same interface is also used when the un-derlying communications are carried out by GA, and in that casethe low-latency flag is simply ignored. This separation of func-tionality from decisions on implementation algorithms, with thosedecisions being taken at run time, is the overall philosophy behindcreating PPIDD; the intention is to free the application program-mer from having to make fundamental and irrevocable choices onlower software layers at a point in development when the optimalimplementation is not yet known, and will doubtless change withfuture hardware developments.

Table 7 shows the performance of the primitive one-sided,point-to-point, and collective operations on a single node, andserves to illustrate the operation of the data server, and the effectof overcommitting resources. The GA implementation shows excel-lent latency and bandwidth with for all operations with either 7or 8 compute processes. With the use of a service process, theIntel MPI implementation is competitive, with differences beingpartially attributable to the better exploitation of shared memoryby GA. The one-sided latency is roughly reduced from 2.8 to 1.8 μswhen there are only 7 compute processes and the server processcan be scheduled to run continuously. As noted earlier, the perfor-mance using MPI-2 RMA is significantly inferior. In the case of BullMPI, very poor performance is obtained with 8 compute processes,i.e., a total of 9 MPI processes. This indicates that the Bull MPI li-brary is designed on the assumption of no overcommitment of theCPU resource, apparently enforcing long time intervals betweencontext switches. With 7 compute processes, the performance isbetter, although somewhat inferior for both one-sided and collec-tive operations to the other two libraries.

Based on these observations, the remaining performance testsuse 8Nnode, 8Nnode − 1 compute processes for the GA, MPI-2 im-plementations respectively, when running on Nnode 8-core nodes.Table 8 shows the performance of the same basic operations upto Nnode = 32. We note firstly that the low-latency variants ofPPIDD_Put and PPIDD_Get show good performance to highprocess counts. The MPI-2 implementations of accumulate showonly approximately half the bandwidth of GA, and this proba-bly arises from excessive local copying of data. The MPI-2-basedshared counter is at least as good as the GA except on a singlenode, where GA takes full advantage of shared memory. The per-

Page 5: Parallel programming interface for distributed data

M. Wang et al. / Computer Physics Communications 180 (2009) 2673–2679 2677

Table 8PPIDD primitive operation latency and bandwidth on multiple nodes of Merlin. In each case, there are 8 processes on each node, i.e., the total number of processes is8 × Nnode for GA and MPI-2; in the latter case there are a total of 8 × Nnode − 1 compute processes. For MPI-2, two kinds of one-sided operation numbers are listed. Thenumbers in square brackets are obtained form the MPI-2 one-sided operations, whilst the others are obtained with the helper process.

Operation Nnode GA Intel MPI-2 Bull MPI-2

PPIDD_Put 1 0.26/1462 1.79/1703 [11.53/524] 2.60/1234 [1.76/1327]2 1.83/1423 2.57/1755 [42.97/760] 3.85/1505 [9.74/1515]4 2.20/1642 3.33/1783 [64.80/1004] 4.90/1620 [12.89/1616]8 2.70/1701 3.43/1835 [101.23/1177] 5.02/1734 [13.59/1629]

16 3.21/1708 4.81/1834 [238.75/1124] 5.32/1796 [15.97/1669]32 4.10/1668 7.73/1802 [794.98/635] 5.55/1758 [19.87/1558]

PPIDD_Get 1 0.27/1556 1.80/1709 [11.87/1768] 2.64/1241 [1.71/1327]2 4.89/1472 2.59/1767 [18.42/1791] 3.88/15084 6.84/1359 3.38/1796 [22.95/1760] 4.90/16338 8.28/1307 3.56/1840 [26.76/1730] 5.00/1741

16 9.47/1236 4.87/1836 [36.94/1515] 5.35/179932 10.69/1164 8.11/1816 [89.54/950] 5.54/1786

PPIDD_Acc 1 0.29/1570 1.85/820 [15.25/397] 2.63/676 [5.75/752]2 1.80/1373 2.63/773 [42.31/490] 3.88/750 [9.97/793]4 2.28/1249 3.38/758 [60.21/546] 4.94/775 [13.06/811]8 2.66/1178 3.51/796 [73.56/646] 5.05/801 [13.97/872]

16 3.14/1316 4.93/783 [98.96/516] 5.36/812 [15.97/1026]32 3.72/1654 7.88/778 [244.00/446] 5.62/809 [19.83/1012]

PPIDD_Send(PPIDD_Recv)

1 0.41/1822 0.42/1820 0.62/13902 0.49/1809 0.48/1858 1.36/18594 0.49/1858 0.49/1808 1.99/18258 0.53/1822 0.48/1859 6.24/1819

16 0.61/1819 0.49/1832 14.03/184232 0.73/1805 0.54/1798 0.73/1831

PPIDD_Brdcst 1 0.82/560 1.36/379 2.07/2672 3.16/346 2.01/351 2.95/2384 4.29/295 2.62/348 3.54/2538 6.68/243 3.23/347 3.97/235

16 9.10/189 4.04/335 4.48/23132 12.66/174 5.17/288 7.16/205

PPIDD_Gsum 1 2.01/210 3.50/97 5.89/692 6.15/151 7.03/66 10.79/664 9.18/116 11.01/61 16.47/628 15.07/90 14.17/59 23.29/57

16 20.10/77 21.89/58 33.87/4932 26.76/73 34.23/56 36.72/50

PPIDD_Nxtval 1 0.47/(N/A) 2.41/(N/A) 2.10/(N/A)2 6.43/(N/A) 3.24/(N/A) 4.13/(N/A)4 6.70/(N/A) 3.81/(N/A) 4.57/(N/A)8 8.20/(N/A) 4.32/(N/A) 5.45/(N/A)

16 9.13/(N/A) 7.11/(N/A) 6.65/(N/A)32 15.79/(N/A) 13.43/(N/A) 11.38/(N/A)

formance on collectives is also similar in all three cases, with IntelMPI consistently showing superior broadcast speed.

3.3. Application performance

Table 9 gives examples of the performance of the library onquantum chemistry calculations. We show elapsed times for sometypical job steps:

TRIPLES Perturbative treatment of triple excitation contributionto electron correlation [17] from standard Molpro bench-mark job mpp_big_normal_ccsd.

MRCI Internally-contracted multireference configuration inter-action [6], standard Molpro benchmark job mpp_big_normal_mrci.

SCF Integral-direct Hartree–Fock from standard Molpro bench-mark job mpp_big_direct_lmp2.

LMP2 Local second-order Møller–Plesset perturbation theory[18,19] from standard Molpro benchmark job mpp_big_direct_lmp2.

In the case of TRIPLES, communication demands consist solelyof reading two-electron integrals from the Lustre global file system,

Table 9Wall clock time (in seconds) for some Molpro benchmark tests with GA and IntelMPI libraries on multiple nodes of Merlin.

Nnode TRIPLES MRCI SCF LMP2

GA MPI-2 GA MPI-2 GA MPI-2 GA MPI-2

1 642.09 674.24 636.33 637.08 1063.65 1194.43 317.58 648.092 320.60 334.52 407.73 420.70 565.80 594.30 192.80 350.834 161.49 163.15 268.70 268.61 308.84 315.37 126.72 232.478 81.33 81.68 206.27 210.23 186.37 186.59 101.31 169.46

16 41.71 41.87 247.01 219.64 139.25 150.08 95.69 159.48

but the time for this is dominated by the floating-point work. Theonly significant parallelization issue is load-balancing, and for thisexample, scaling is almost perfect up to 256 processes with bothGA and MPI-2 versions of PPIDD.

A similar situation occurs with SCF, where most of the work isin computing two-electron integrals, for which the dynamic load-balancing is sufficient for these process counts. However, somedegradation in scaling occurs around 64 processes; there is somescalar work (diagonalization of the Fock matrix), global summationof the Fock matrix, and caching of two-electron integrals on disk.The latter partially accounts for the difference between the GA andMPI-2 performance, where the underlying I/O system is different.

Page 6: Parallel programming interface for distributed data

2678 M. Wang et al. / Computer Physics Communications 180 (2009) 2673–2679

For lower process counts, the performance is essentially indepen-dent of the communications library.

For MRCI, the communications are more extensive and com-plex, and this is reflected in the poorer scaling of performance. Formodest process counts, the performance of GA and MPI-2 is essen-tially the same. In the case of LMP2, there are significant globaldata structures, which we have left as globally distributed. Thus inthis case, the performance of the MPI-built library is inferior.

4. Installing and testing PPIDD

4.1. Prerequisites

1. GNU make version 3.81 or higher2. A Fortran compiler3. A C compiler4. An MPI-2 library, OR, Global Arrays library

We strongly recommend that the same compilers are used forcompiling the Global Arrays and PPIDD libraries.

4.2. Building the library

By default, the supplied GNUmakefile attempts to build thelibrary using the commands ‘mpif90’ and ‘mpicc’, which are typ-ically available when an MPI-2 library has been installed. If youwish to override this, the following are some common variableswhich may need to be set on the command line using the form

make VAR1=value1 VAR2=value2...

The full list of command-line options recognized by make are:

CC C compilerFC Fortran compilerMPICC an MPI-2-aware C compilerMPIFC an MPI-2-aware Fortran compilerINCLUDE directory containing MPI-2 or GA include filesMPILIB MPI linking options, e.g. ‘-L/usr/lib-lmpi’BUILD either MPI-2 (default), GA_TCGMSG, GA_TCGMSG_MPI or

GA_MPICFLAGS C compiler flagsFFLAGS Fortran compiler flagsINT64 setting INT64=y lets make know that the compilers use

8-byte integers. Note, make does not force the compilersto do this, it is up to the user to make any necessarychanges to CFLAGS/FFLAGS

NXTVAL setting NXTVAL=n disables the helper process (PPIDD_Nxtval) for MPI-2 builds, by default it is enabled

Note: When the library is built on top of Global Arrays li-brary, the -Vaxlib option is needed for Intel Fortran compilers(including an MPI-2-aware Fortran compiler who invokes Intel For-tran compiler) to link with a Fortran main program (make . . .FFLAGS=-Vaxlib). This option can allow getarg() and iargc(), whichare not part of the Fortran standard, to be resolved properly.

Building examples are available at./doc/doxygen/html/index.htmlafter the documentation is built.

4.3. Testing the library

Once the library has been built successfully some tests will ap-pear in the ‘test’ directory, identifiable by an.exe suffix. These testsshould be run in order to check the library is working correctly,e.g.

mpiexec -np 4./test/ppidd_test.exe

A sample output for this test can be found at./test/ppidd_test.out(please be aware the results are machine-dependent).

4.4. Building the documentation

This should not be necessary since PPIDD is distributed withdocumentation precompiled. If you should choose to recompile,then the doxygen program is required. The DOXYGEN variable de-fines the executable used.

make doc [DOXYGEN=/path/to/doxygen]

5. Conclusion

We have demonstrated the operation and performance of thePPIDD library, which offers the application programmer an inter-face to message-passing, one-sided RMA, and collectives that isindependent of the underlying communication vehicle. Because ofthe poor performance of some existing implementations of theMPI-2 standard for one-sided operations, the library contains as anoption the possibility of managing global data structures through adedicated server process. Some MPI-2 implementations do not suf-fer from this problem [14], and then the library can be configuredwithout this option. We have demonstrated that the data-serverapproach is effective for small data structures with up to 255 MPIprocesses. Ultimately this single-server model will compromisescalability of performance, but it could easily be generalised to dis-tribute data across an arbitrary number of helper processes, forexample (but not limited to) one per SMP node.

Acknowledgements

We are grateful to G.D. Fletcher, M.F. Guest, R. Thakur andD.W. Walker for helpful discussions. This work has been supportedby EPSRC (EP/C007832/1).

References

[1] http://www.mpi-forum.org/.[2] J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, E. Aprà, Interna-

tional Journal of High Performance Computing Applications 20 (2006) 203.[3] J. Nieplocha, V. Tipparaju, M. Krishnan, D. Panda, International Journal of High

Performance Computing Applications 20 (2006) 233.[4] http://www.emsl.pnl.gov/docs/global/.[5] R.A. Kendall, E. Aprà, D.E. Bernholdt, E.J. Bylaska, M. Dupuis, G.I. Fann, R.J. Har-

rison, J. Ju, J.A. Nichols, J. Nieplocha, T.P. Straatsma, T.L. Windus, A.T. Wong,Comp. Phys. Comm. 128 (2000) 260.

[6] A.J. Dobbyn, P.J. Knowles, R.J. Harrison, J. Comput. Chem. 19 (1998) 1215.[7] Y. Shao, L.F. Molnar, Y. Jung, J. Kussmann, C. Ochsenfeld, S.T. Brown, A.T. Gilbert,

L.V. Slipchenko, S.V. Levchenko, D.P. O’Neill, R.A.D. Jr, R.C. Lochan, T. Wang, G.J.Beran, N.A. Besley, J.M. Herbert, C.Y. Lin, T.V. Voorhis, S.H. Chien, A. Sodt, R.P.Steele, V.A. Rassolov, P.E. Maslen, P.P. Korambath, R.D. Adamson, B. Austin, J.Baker, E.F.C. Byrd, H. Dachsel, R.J. Doerksen, A. Dreuw, B.D. Dunietz, A.D. Du-toi, T.R. Furlani, S.R. Gwaltney, A. Heyden, S. Hirata, C.-P. Hsu, G. Kedziora, R.Z.Khalliulin, P. Klunzinger, A.M. Lee, M.S. Lee, W. Liang, I. Lotan, N. Nair, B. Pe-ters, E.I. Proynov, P.A. Pieniazek, Y.M. Rhee, J. Ritchie, E. Rosta, C.D. Sherrill, A.C.Simmonett, J.E. Subotnik, H.L.W. Iii, W. Zhang, A.T. Bell, A.K. Chakraborty, D.M.Chipman, F.J. Keil, A. Warshel, W.J. Hehre, H.F.S. Iii, J. Kong, A.I. Krylov, P.M.W.Gill, M. Head-Gordon, Phys. Chem. Chem. Phys. 8 (2006) 3172.

[8] G. Karlström, R. Lindh, P. Malmqvist, B.O. Roos, U. Ryde, V. Veryazov, P. Wid-mark, M. Cossi, B. Schimmelpfennig, P. Neogrady, L. Seijo, Computational Ma-terials Science 28 (2003) 222.

[9] G.D. Fletcher, M.W. Schmidt, B.M. Bode, M.S. Gordon, Comp. Phys. Comm. 128(2000) 190.

[10] H.-J. Werner, P.J. Knowles, R. Lindh, F.R. Manby, M. Schütz, P. Celani, T. Korona,A. Mitrushenkov, G. Rauhut, T.B. Adler, R.D. Amos, A. Bernhardsson, A. Berning,D.L. Cooper, M.J.O. Deegan, A.J. Dobbyn, F. Eckert, E. Goll, C. Hampel, G. Hetzer,T. Hrenar, G. Knizia, C. Köppl, Y. Liu, A.W. Lloyd, R.A. Mata, A.J. May, S.J. Mc-Nicholas, W. Meyer, M.E. Mura, A. Nicklass, P. Palmieri, K. Pflüger, R. Pitzer, M.Reiher, U. Schumann, H. Stoll, A.J. Stone, R. Tarroni, T. Thorsteinsson, M. Wang,

Page 7: Parallel programming interface for distributed data

M. Wang et al. / Computer Physics Communications 180 (2009) 2673–2679 2679

A. Wolf, Molpro, version 2008.2, a package of ab initio programs, 2008, seehttp://www.molpro.net.

[11] R. Latham, R. Ross, R. Thakur, International Journal of High Performance Com-puting Applications 21 (2007) 132.

[12] http://www.doxygen.org.[13] http://www.top500.org/system/9294.[14] H.J.J. van Dam, M. Wang, A.G. Sunderland, I.J. Bush, P.J. Knowles, M.F. Guest,

Is MPI-2 suitable for quantum chemistry? Performance of passive target

one-sided communications, Technical report, STFC Daresbury Laboratory,2008.

[15] http://www.mcs.anl.gov/research/projects/mpich2.[16] R. Thakur, Private communication.[17] K. Raghavachari, G.W. Trucks, J.A. Pople, M. Head-Gordon, Chem. Phys. Lett. 157

(1989) 479.[18] M. Schütz, G. Hetzer, H.-J. Werner, J. Chem. Phys. 111 (1999) 5691.[19] H.-J. Werner, F.R. Manby, P.J. Knowles, J. Chem. Phys. 118 (2003) 8149.