CADISHI: Fast parallel calculation of particle-pair

CADISHI: Fast parallel calculation of particle-pair distance histograms onCPUs and GPUs

Klaus Reutera,∗, Jürgen Köfingerb

aMax Planck Computing and Data Facility, Gießenbachstraße 2, 85748 Garching, GermanybMax Planck Institute of Biophysics, Max-von-Laue-Straße 3, 60438 Frankfurt, Germany

Abstract

We report on the design, implementation, optimization, and performance of the CADISHI software pack-age, which calculates histograms of pair-distances of ensembles of particles on CPUs and GPUs. Thesehistograms represent 2-point spatial correlation functions and are routinely calculated from simulations ofsoft and condensed matter, where they are referred to as radial distribution functions, and in the analysis ofthe spatial distributions of galaxies and galaxy clusters. Although conceptually simple, the calculation of ra-dial distribution functions via distance binning requires the evaluation of O(N2) particle-pair distances whereN is the number of particles under consideration. CADISHI provides fast parallel implementations of thedistance histogram algorithm for the CPU and the GPU, written in templated C++ and CUDA. Orthorhom-bic and general triclinic periodic boxes are supported, in addition to the non-periodic case. The CPU kernelsfeature cache-blocking, vectorization and thread-parallelization to obtain high performance. The GPU ker-nels are tuned to exploit the memory and processor features of current GPUs, demonstrating histogrammingrates of up to a factor 40 higher than on a high-end multi-core CPU. To enable high-throughput analyses ofmolecular dynamics trajectories, the compute kernels are driven by the Python-based CADISHI engine. Itimplements a producer-consumer data processing pattern and thereby enables the complete utilization of allthe CPU and GPU resources available on a specific computer, independent of special libraries such as MPI,covering commodity systems up to high-end HPC nodes. Data input and output are performed efficientlyvia HDF5. In addition, our CPU and GPU kernels can be compiled into a standard C library and usedwith any application, independent from the CADISHI engine or Python. The CADISHI software is freelyavailable under the MIT license.

Keywords: radial distribution function, pair-distance distribution function, two-point correlation function,distance histogram, GPU, CUDA

PROGRAM SUMMARYProgram Title: CADISHILicensing provisions: MITProgramming language: C++, CUDA, Python

∗Corresponding authorEmail addresses: [email protected] (Klaus

Reuter), [email protected] (JürgenKöfinger)

Nature of problem:Radial distribution functions are of fundamental impor-tance in soft and condensed matter physics and astro-physics. However, the calculation of distance histogramsscales quadratically with the particles number. To be ableto analyze large data sets, fast and efficient implementa-tions of distance histogramming are crucial.Solution method:CADISHI provides parallel, highly optimized implemen-

Preprint submitted to Computer Physics Communications October 3, 2021

arX

iv:1

808.

0147

8v1

[ph

ysic

s.co

mp-

ph]

4 A

ug 2

018

tations of distance histogramming. On the CPU, highperformance is achieved via an advanced cache blockingscheme in combination with vectorization and threading.On the GPU, the problem is decomposed via a tilingscheme to exploit the GPU’s massively parallel architec-ture and hierarchy of global, constant and shared memoryefficiently, resulting in significant speedups compared tothe CPU. Moreover, CADISHI exploits all the resources(GPUs, CPUs) available on a compute node in parallel.Additional comments including Restrictions and Unusualfeatures (approx. 50-250 words):CADISHI implements the minimum image convention fororthorhombic and general triclinic periodic boxes. Weprovide Python interfaces and the option to compile thekernels into a plain C library.

1. Introduction

Radial distribution functions link the structuraland thermodynamic properties of soft and condensedmatter [1, 2]. Structurally, these particle pair corre-lation functions provide the probability of finding aparticle at a certain distance from another particle ofa system. Thermodynamically, these functions deter-mine the equation of state for systems with pair-wiseinteractions. In astronomy and astrophysics, thesespatial two-point correlation functions are used to de-scribe the distribution of galaxies or galaxy clustersin the universe [3, 4].

In experiments on condensed matter, the Fouriertransform of the radial distribution function is mea-sured by elastic scattering of x-rays or neutrons, orlight scattering in the case of microscopically sizedparticles like colloids. Scattering intensities can becalculated accurately from radial distribution func-tions using Debye’s equation [5]. This approach isused, for example, to calculate small- and wide-anglex-ray scattering intensities from molecular dynamicssimulations of biological macromolecules in solution[6, 7].

Radial distribution functions are central to liq-uid state theory and facilitate the interpretation ofmolecular simulations. For simple liquids, we canpredict phase transitions of liquid mixtures using ra-dial distributions functions obtained from the integraltheory of Ornstein and Zernike [8, 1]. In simulations,

radial distribution functions help us to interpret or-dering effects and the resulting effective interactionsbetween particles [9]. For complex systems, we usu-ally lack feasible theoretical approaches to calculateradial distribution functions. We thus estimate ra-dial distribution functions from molecular dynamicssimulation (MD) trajectories and Monte Carlo simu-lation ensembles by calculating histograms of particlepair-distances [10].

This task of calculating a radial distribution func-tion scales with the number of particles squared andis thus computationally challenging for large systems.Large-scale parallel molecular dynamics simulationsof soft and condensed matter in explicit solvent cangenerate large amounts of trajectory data with hun-dreds of thousands of frames with potentially mil-lions of particles per frame [11]. Levine, Stone, andKohlmeyer were the first to tackle this challenge bytaking advantage of the processing power of CUDA-enabled GPUs [12]. Their software can be easily usedvia VMD [13], a widely used program to set up, vi-sualize, and analyze molecular dynamics simulations.Extending their pioneering efforts, we provide herea novel software for the "CAlculation of DIStanceHIstograms" (CADISHI), which uses both CPUs andGPUs on a single node to calculate radial distribu-tion functions at very high performance. In additionto non periodic systems, orthorhombic and generaltriclinic boxes are supported.

This paper is structured as follows. Section 2briefly introduces the mathematical background anddiscusses sequential and parallel distance histogram-ming. Section 3 details how the distance histogramalgorithms are efficiently implemented on the CPUand on the GPU. We report and discuss extensivebenchmark results for both kinds of processors insection 4. Finally, section 5 closes the paper witha summary.

2. Methods

To calculate a radial distribution function from anensemble of particles, we calculate all pair distancesand collect them in a histogram. If the ensemblestems from molecular simulations then we have toproperly take into account the boundary conditions.

2

Commonly, we use periodic boundary conditions insimulations and apply the minimum image conven-tion. We distinguish two scenarios: In the first sce-nario, we are interested in bulk properties and wecalculate radial distribution functions up to half theminimum image distance, e.g., half the box-lengthfor a cubic box. In the second scenario, we sim-ulate a single macromolecule in a box as a modelfor a dilute system. To calculate scattering intensi-ties (SAXS/WAXS), we have to cut out the macro-molecule and a sufficiently thick layer of solvent andeffectively embed this system in infinite solvent [6]. Inthis case, we calculate the radial distribution functionfor the complete sub-system we have cut out, withoutapplying any periodic boundary conditions.

In the following, we first show how radial distri-bution functions are calculated from histograms ofpair-distances following the notation of Levine et al.[12]. We then sketch the basic algorithm, recapit-ulate how to take periodic boundary conditions fororthorhombic and triclinic boxes into account, anddiscuss different methods for the implementation andparallelization.

2.1. Mathematical background

The radial distribution function [1, 2] is defined as

g(r) = limdr→0

p(r)

4π(Npairs/V )r2dr. (1)

Here, r is the distance between a pair of particles,p(r) is the average number of atom pairs found ata distance between r and r + dr, Npairs is the totalnumber of unique atom pairs in the system, and V isthe total volume of the system. For MD simulations,p(r) is calculated from a finite number of trajectoryframes Nframes for all unique atom pairs indexed byi, j as

p(r) =1

Nframes

Nframes∑k

∑i,j(6=i)

δ(r − rijk), (2)

where rijk is the distance between particles i and jat frame k. The δ function is replaced by a uniform

Algorithm 1 Two-species particle pair distance his-togramming algorithm.initialize histogram to zerofor i = 0 to N1 {loop over species 1} do

for j = 0 to N2 {loop over species 2} docompute distance vector dxij between particlei and particle j,apply minimum image convention to dxij {forperiodic systems only},compute distance rij = (x2ij + y2ij + z2ij)

1/2

obtain bin index κij = (int)nbinsrij/rmaxincrement histogram bin at index κij

end forend for

histogram on a grid by introducing

p(r) =1

Nframes

Nframes∑k

∑i,j(6=i)

∑κ

dκ(r, rijk). (3)

Here, κ is the histogram bin index. The value of ahistogram bin is defined as

dκ(r, ri,j,k) =

{∆r−1 if rκ ≤ r, rijk < rκ + ∆r

0 else.(4)

where ∆r is the width of a histogram bin, and rκ =κ∆r is the lower bound of a histogram bin.

2.2. Basic sequential particle pair distance histogramcomputation

Algorithm 1 sketches the basic sequential two-species distance histogram computation, which is acommon use case. From a set of N1 particles ofspecies 1 and a set of N2 particles of species 2, thedistance for each combination of two particles is eval-uated, rescaled to an integer index which is finallyused to increment the bin counter. In periodic sys-tems, the minimum image convention is applied tothe distance first. In total, distances between N1×N2

particle pairs need to be binned.Considering the single species distance histogram

computation of a set with N particles, the differenceto algorithm 1 is given by the constraint that we per-form only evaluations of unique pairs. To this end,

3

the inner loop in algorithm 1 is modified to start fromj = i + 1, with N = N1 = N2. In this single speciescase, in total N(N − 1)/2 particle pairs need to beconsidered.

It is obvious that the floating-point computationintense part of the algorithm is given by the distancecomputation. Further numerical costs are added byperiodic boundary conditions, which introduce theneed for rounding operations and, in the case of thegeneral triclinic box, the need for multiple distanceevaluations between the different images.

On a side note, it seems tempting to avoid thenumerically costly square root operation in the dis-tance calculation and to use a quadratically scaledhistogram instead. However, in practice it turns outthat the resulting larger histogram array spoils thecache efficiency and leads in combination with thenecessary postprocessing of the histogram to inferiorresults compared to a high-performance implementa-tion of the direct Euclidean distance computation.

2.3. Periodic boundary conditions

Commonly, molecular dynamics simulations andMonte Carlo simulations of soft and condensed mat-ter apply periodic boundary conditions (PBCs) tominimize surface effects, which would otherwise causeartifacts. Moreover, using triclinic box geometriescorresponding to the truncated octahedron or therhombododecahedron, for example, we can calculateradial distribution functions for larger distances thanif we used an orthorhombic box with the same num-ber of particles. The use of triclinic boxes can alsoincrease performance of simulations of single macro-molecules by minimizing the number of solvent parti-cles we have to add. In a periodic box the distance be-tween two particles is given by the distance betweenone particle in the central box and the nearest imageof the other particle. CADISHI implements this so-called the minimum image convention for both, theorthorhombic and the general triclinic periodic boxon both the CPU and the GPU, as detailed on in thefollowing.

Under the minimum image convention, the dis-tance vector dx′ between two points in a general pe-

riodic box is given by

dx′ = dx− b nint(b−1dx

), (5)

where dx is the difference vector between the imagesof two particles in the same box, b is the 3 × 3 ma-trix of box vectors, and nint denotes rounding tothe nearest integer. For further details, we refer toAppendix B in the book by Tuckerman [14].

An orthorhombic box has a rectangular basis andthe basis vectors can have different lengths. In thiscase, b and b−1 are diagonal which simplifies theevaluation of equation (5) in practice.

A general triclinic box has three basis vectors ofdifferent lengths which intersect at arbitrary angles.Hence, an orthorhombic box is a special case of thetriclinic box. Equation (5) gives the minimum dis-tance vector between two images for distances up tohalf of the minimum width of the periodic box. Notethat b is now non-diagonal. To support larger dis-tances, the distance vector from equation 5 needs tobe shifted to the neighboring cells in order to find thetrue minimum distance. We refer the reader to thePhD thesis of Tsjerk Wassenaar for details [15].

2.4. Parallel particle-pair distance histogram compu-tation.

In the case of distance histogramming, the dis-tance calculations between particle pairs are embar-rassingly data parallel. The computational load iseasily balanced between processing units by distribut-ing equally sized subsets of the data, i.e. the combina-torial set of all the relevant particle pairs. Handlingthe bin updates and the computation of the final his-togram correctly in parallel turns out to be more in-tricate. Basically, one of the following approaches ispossible, as pointed out in Ref. [12].

First, all the processing units may update the binsof a single shared histogram concurrently, requiringatomic hardware instructions or other mechanismsto synchronize the individual memory updates. Sec-ond, each processing unit may fill its own private his-togram, followed by a global reduction to sum up allthe private partial histograms in order to obtain thefinal histogram. Third, a combination of both theprevious approaches may be favorable, where groups

4

of processing units share a private histogram. As willbe detailed on in the following sections, the secondoption is suitable for the CPU whereas the third op-tion is well suited for the GPU.

3. Implementation

The CPU and GPU kernels discussed in the fol-lowing sections are implemented in templated C++.Templates avoid code duplication and allow, for ex-ample, to compile executables for single and doubleprecision coordinate input from the same source code.Moreover, templates facilitate the generation of effi-cient code for all supported cases, i.e., with or with-out periodic boxes, and with or without the check ifa distance falls within the maximum allowed valuein the non-periodic case. This is critical for perfor-mance because branches in inner loops are avoidedcompletely at runtime. The distance computationincluding the minimum image convention is imple-mented only once and used by both, CPU and GPU,via header file inclusion. We provide Python inter-faces to the kernels.

Finally, to enable processing of large-scale MD sim-ulation data, the CPU and GPU kernels need to bedriven efficiently to exploit all the compute resourcesavailable on a computer. To this end we have imple-mented a Python layer labeled the CADISHI enginewhich is presented in section 3.3.

3.1. Histogram computation on CPUsAn efficient implementation of the particle-pair

distance histogram algorithm needs to take advan-tage of all levels of parallelism and caches of mod-ern x86_64 CPUs. First, there are several physi-cal cores per chip which typically support more thanone hardware thread each (simultaneous multithread-ing, called hyper-threading for Intel CPUs). Second,each core supports SIMD parallelism on vectors witha width of 128 (SSE2), 256 (AVX), or even 512 bits(AVX512), being able to operate on 4, 8, and 16 singleprecision numbers with a single instruction, respec-tively. Moreover, to hide memory latencies, a cachehierarchy exists with individual caches per physicalcore (L1 and L2), and caches shared between mul-tiple cores (L3). Finally, in multi-socket machines,

0 1 2 3 j

i

0

1

2

3

1,12,1bs

bs

0 1 2 3 j

i

0

1

2

2,1bs

bs

Figure 1: Loop tiling for the distance histogram calculation inthe case of a single particle species (top) and in the case of twoparticle species (bottom). For a single particles species, thetiles are triangular directly below the diagonal and quadraticor rectangular elsewhere, with edges of maximum length bs.For two particle species, the i axis indicates the tile index ofthe first species and the j axis indicates tile index of the secondspecies.

different chips may access the same physical mem-ory, however, at different latencies and bandwidths(NUMA).

To optimize for cache utilization, we implementedcache-blocked versions of the algorithms in additionto a direct implementation of the double-loop struc-ture of algorithm 1. The cache-blocked versions tar-get the L2 cache, which has a size of 256 kB to 1 MBon modern CPUs and is therefore large enough tohold for each thread a block of coordinates, an indexbuffer, and the partial histogram.

In our loop tiling, shown schematically in Fig. 1,the block size bs is defined as the length of an edgeof a tile. The block size depends on the histogram

5

width n_bins and is given by the solution of

L2_cache_size− reserve + extension

= sizeof(uint32_t) · bs2

+2 · sizeof(coordinate_tuple) · bs+sizeof(uint32_t) · n_bins , (6)

which is a simple quadratic equation. On theleft-hand side of Eq. (6) we define the amount ofcache in bytes we make available for cache blocking.This amount is determined by the size of the cacheL2_cache_size, determined at runtime once duringinitialization, by the amount of of cache we reserve forother data reserve, and by extension. Here, we setreserve to 16 kB. The value of extension is set tozero per default. For histograms wider than 45k bins,the value of extension is increased proportional tothe histogram width in order to avoid that the blocksize gets too small and to enable the blocking schemealso for wide histograms. The transition at 45k binswas determined by benchmarking.

The right-hand side Eq. (6) depends on the sizesizeof(uint32_t) of the cache array for the indices,the size of the two sets of particle coordinates, 2 ·sizeof(coordinate_tuple), and the width of thehistogram, sizeof(uint32_t) · n_bins. As shownin the benchmark section below, the blocking schemeallows to scale to large problem sizes without anyperformance degradation. At small problem sizes,the non-blocked versions are faster and a heuristic inthe code decides which kernel is to be called for aspecific input.

The CPU kernels are threaded by means ofOpenMP directives. Each thread works on a sub-set of the combinatorial set of all relevant particlepairs and updates its own private instance of thehistogram. The cache-blocked implementation par-allelizes over tiles, whereas the non-blocked versionparallelizes the double-loop directly using OpenMPdirectives. For each thread, the final step consistsof the reduction of all the private partial histogramsinto the complete histogram.

In addition to threading, SIMD parallelism (vec-torization) is of key importance to achieve good per-formance on modern CPU cores. Two implementa-tion details turn out to be essential for vectoriza-

tion. First, to adjust the memory alignment, thecoordinate triples must be padded, either implicitlyvia compiler-specific attributes or explicitly by ex-tending the triple by a fourth dummy element. Sec-ond, incrementing a histogram bin immediately aftereach distance calculation would access memory in anon-predictable fashion and potentially lead to cachethrashing. Thus, the bin indices are stored temporar-ily in a contiguous buffer, which is of size bs2 in caseof the kernels with cache blocking. The bin updatesare done from that buffer array, an operation that isinherently non vectorizable due to its non contiguousmemory access pattern on the histogram array. Notethat we do not use intrinsics for portability reasons.Hence, it is up to the compiler to actually vectorizethe code which may fail in some cases as shown be-low. The software Intel Amplifier XE was used toguide the optimization work.

For performance reasons, we use integers for bincounts and thus have to take care to avoid inte-ger overflows. The thread-private histograms useunsigned 32 bit integers (uint32_t). Whenevera thread’s number of processed particle pairs ap-proaches the upper limit of uint32_t, the thread-private histogram is added atomically to the globalhistogram (“flushed”) and reset to zero. Th globalhistogram uses unsigned 64 bit integers (uint64_t).Compared to a pure 32 bit implementation, the per-formance penalty turns out to be marginal. Thisstrategy is also applied in the GPU implementation.

3.2. Histogram computation on GPUs

GPUs are not only successfully used to speed upsimulations of molecular dynamics [16, 17], they aresimilarly well suited to accelerate analysis tasks suchas the pair-distance histogram calculation. State-of-the art graphics processing units are comprised ofseveral streaming multiprocessors that have on theorder of 10 to 100 cores each. All the multiproces-sors have access to a cached single global memory onthe GPU, distinct from the host’s main memory. Wepartly follow the lines of Levine et. al. [12] for theimplementation of a tiling scheme suitable to obtainhigh performance. Moreover, we extend their work,e.g. with kernels enabled to scale into the large bin

6

number regime and with the support of triclinic pe-riodic boxes.

Our implementations are based on the NVIDIACUDA programming model [18] but the key pointsmade are applicable to other platforms as well. In thespirit of a heterogeneous programming model, kernelsare launched from the host code to perform computa-tion on the GPU using numerous lightweight threads.Logically, CUDA organizes threads in thread blocks,and arranges the thread blocks on a grid. All threadsfrom a thread block run on the same streaming multi-processor, grouped into so called warps of 32 threadsthat run simultaneously. Threads of the same threadblock are able to communicate via a shared memorythat can be regarded as a user-managed cache. Dif-ferent thread blocks are independent. When multiplethreads read from the same memory address at thesame time, GPU constant memory with its associ-ated constant memory cache is highly beneficial. Forfurther details, we refer the reader to the NVIDIACUDA documentation [18].

With typically on the order of a thousand to a mil-lion of particles per MD trajectory frame, the combi-natorial set of all particle pairs contains on the orderof 106 to 1012 elements. Clearly, the large numberof independent distance calculations can be mappedto independent threads very well in order to keepthe numerous GPU cores busy. However, the effi-cient handling of the histogram bin updates is a ma-jor challenge. This operation involves some kind ofsynchronization between threads and is, moreover,characterized by scattered memory accesses.

In general, the data transfer between host andGPU memory is a performance critical aspect. Forthe distance histogram computation, the cost of thedata transfer is insignificant for sufficiently largeproblem sizes due to the computational complex-ity O(N2), nevertheless we take several optimizationsteps. To avoid the overhead from multiple individ-ual transfers, the complete multi-species coordinateset of an MD trajectory frame is prepared in a con-tiguous memory area in pinned memory on the hostand is then copied to a contiguous area of GPU globalmemory in a single operation. The GPU kernels thencalculate the histograms for all combinations of par-ticle species. Finally, the resulting set of histograms

is transferred from GPU global memory back to CPUpinned memory in a single operation. GPU memoryis allocated at the first kernel call only and reused atsubsequent calls.

To obtain the optimum performance for all relevantinput parameters and to allow for cross validation,three kernel implementations of increasing complex-ity were developed that are explained in detail below.In the production code, the most suitable kernel isselected together with its optimum launch parame-ters for a particular GPU model based on the par-ticle numbers and on the histogram width. To thisend, the implementation internally provides heuris-tics covering recent GPU architectures. The NVIDIAVisual Profiler was used to guide the optimizationwork.

In the following sections, we present the GPU his-togram kernel implementations in order of increasingcomplexity.

3.2.1. A basic GPU kernelA straight-forward approach to implement the

pair-particle distance histogram calculation using theCUDA framework is to map the double-loop struc-ture of algorithm 1 to a two-dimensional grid ofthread blocks such that each thread works on an in-dividual particle pair. Bin increments are naively im-plemented by performing atomic updates of a singleshared histogram in global memory. The overhead ofthis approach can be reduced by choosing the actualCUDA grid smaller than the total grid spanned bythe number of particle coordinates. Doing so we leteach thread work on several particle pairs by loopingover the full coordinate set using the CUDA total gridsize as offset. Consequently, global memory accessesare coalesced automatically.

On older GPUs (pre Kepler), the binning ratecould be increased by cloning the histogram bins inglobal memory and reducing them in a final step,which lowers the access frequency of individual binsand therefore the collision rate. On Kepler and morerecent GPUs we used during the final developmentof this work, we find that the new fast atomic op-erations in global memory do not make such cloningnecessary any more [18, Kepler tuning guide].

7

We refer to this implementation as the simple ker-nel. It is characterized by a rather flat performanceprofile independent of the number of bins. In thescope of this work it is only used to perform cor-rectness checks and as the starting point for morecomplex kernels. It is outperformed substantially bythe two implementations presented in the following,which are actually intended for production use.

3.2.2. Improving performance by coordinate tiling inconstant memory

To speed up the simple GPU kernel, a tiling schemeis introduced in the spirit of a cache blocking tech-nique aimed at a reduction of the number of accessesto global memory. We first make use of the GPU’sconstant memory segment in order to accelerate thememory access to the particle coordinate data. Lim-ited in size to 64 kB, constant memory has an as-sociated fast on-chip cache that delivers a value asquickly as if it was read directly from a register, pro-vided that all the threads in the warp access the sameaddress. However, data can be copied to constantmemory only from the host code.

In our implementation, the inner loop of algorithm1 is mapped to a one-dimensional CUDA grid, cover-ing the second coordinate set stored in global mem-ory. The outer loop is written explicitly inside thekernels. It iterates over a tile of the first coordinateset that is stored in constant memory. Hence, eachthread reads a coordinate tuple from global memoryin a coalesced fashion and performs the distance bin-ning for all the particles stored in the constant mem-ory tile. The latter is the same for all the threadsat each loop iteration such that we exploit the fastconstant memory cache. We added to the host codean additional loop which handles the copying of co-ordinate data to constant memory before launchingthe kernel. For single precision data, a tile in the 64kB of constant memory comprises about 5300 coor-dinate tuples. For each such tile a kernel launch hasto be performed.

To optimize the performance for frames with sev-eral different particle species and different particlenumbers, we minimize the number of kernel launchesfor two-species histograms.We first sort the particlecoordinate sets by increasing particle number. As a

consequence, the set for the second species locatedin global memory has typically more members thanthe first one located in constant memory. This ar-rangement reduces the number of necessary kernellaunches.

We label this implementation the global memorykernel because it keeps the histogram in global mem-ory. It turns out to be the fastest kernel when goingto larger histogram bin numbers, as we will demon-strate below. In addition, it is an important interme-diate step towards optimizing the kernel further, aswill be done in the following by introducing a tilingscheme for the histogram in shared memory.

3.2.3. Improving performance by histogram tiling inshared memory

The global memory kernel presented in the previ-ous section optimizes the coordinate tuple accessesbut performs the atomic bin updates in compara-bly slow global memory. The key step to increasethe binning performance is to introduce private par-tial histograms in shared memory. Compared topre-Maxwell GPUs, the performance benefits greatlyfrom the fast shared memory atomic operations thatbecame available with Maxwell chips [18, Maxwelltuning guide].

In general, one shared memory buffer can be al-located per CUDA thread block. It can be accessedby all the threads in the block at near register speed.However, shared memory is a scarce resource and lim-ited to a maximum of 48 kB per thread block on mostof the GPUs relevant to the present work. Up to now,only the Volta V100 GPU can be configured to pro-vide 96 kB of shared memory to a thread block, whichsignificantly improves performance of this implemen-tation [18, Volta tuning guide]. Hence, 32 bit partialhistograms in shared memory can be at most 12288(×2 on the V100) bins wide. Consequently, to al-low the kernels to process wider histograms, the bin-ning range must be tiled, requiring multiple sweepsthrough all the particle pairs, increasing the run timeproportionally. Before the lifetime of a thread blockends, the private partial 32-bit histogram in sharedmemory is added atomically to the 64-bit histogramin global memory.

8

subset ofcoord_1(constant)

coord_1

coord_2(global)coord_2coord_2

histogram(global)histogram

thread_block_2partial histogram (shared)

thread_block_Npartial histogram (shared)

thread_block_1partial histogram (shared)

. . .

CPU GPU

Figure 2: Tiling scheme employed for the parallel decomposition on the GPU, including the use of the memory hierarchy. FromCPU memory (left, blue), a set of coordinate data from the first species is copied to GPU constant memory (constant, grey)before each kernel launch. The coordinate data of the second species is copied initially once to global GPU memory (global,red). The loop over that second set of coordinates is implemented via a 1d grid of thread blocks, such that each thread blockoperates on a different subset of coord_2. Each thread block internally loops over the first species, accessing the identicalelement at a time from each thread and thereby exploiting the fast constant memory cache. Each thread block updates itsprivate partial histogram in shared memory (shared, violet). Before a thread block terminates, the private partial histogram isadded atomically to the histogram in global GPU memory, which is finally copied back to CPU memory.

Figure 2 shows a schematic of our shared-memory-based GPU implementation, providing a detailed ex-planation of the memory access in the caption.

We label this implementation the shared memorykernel. It is by far the fastest kernel at compara-bly small bin numbers, when only one or few sweepsthrough the particle pairs is necessary. For any re-quested number of bins, the implementation deter-mines the number of sweeps from the available sharedmemory. The shared memory histograms are thentiled to have (about) the same size for all the sweeps.

3.2.4. GPU kernel optimizationWe determined the optimum block sizes of both,

the global memory and the shared memory kernelsfor the 1d CUDA grid from benchmark runs of suffi-ciently large problem sets, which saturate the GPUs.These optimal size are chosen automatically by a sim-ple heuristic. For Maxwell and newer GPUs, a blocksize of 512 threads is chosen, whereas on older (Ke-pler) GPUs it is beneficial to use larger blocks of 1024

threads. A user may override these defaults whencalling the kernels.

In case of the shared memory kernel, we deter-mine the size of the tiles for the partial histogramsby the number of sweeps necessary to compute a re-quested histogram width. The number of sweeps isdetermined by the maximum amount of shared mem-ory available per thread block. In general, as littleshared memory as possible should be used in orderto keep the occupancy of the streaming multiproces-sors sufficiently high, primarily to hide global mem-ory latencies. The best performance is not necessar-ily achieved at an occupancy of 100%, which refersto the maximum number of threads a GPU is ableto keep active. For some compute-bound kernels theinstruction-level parallelism is able to use the GPUvery well at occupancies smaller than one [19]. Asshown in section 4.2 the distance histogram kernelsfall into this class and are able to saturate the GPUsat occupancies down to 25%.

So far, we only discussed the two-species compu-

9

tation. As pointed out before, for the single-speciescase the only difference is given by the start index ofthe inner loop, such that duplicate and same-particleevaluations are avoided. In the GPU kernels, an ifbranch is used to determine from the thread indexwithin the CUDA grid if the thread shall evaluate aparticle pair. Note that for sufficiently large prob-lem sizes only a small fraction of the CUDA threadblocks are actually affected by this if branch, verysimilar to the diagonal blocks required for the CPUcache blocking (see fig. 1). If the condition is true forall the threads of a warp, no overhead is introduced.If it is true for only some of the threads, the branchesare serialized, i.e., some threads of the warp stay idlewhile the other threads perform their computationsin parallel.

Intentionally, the present implementation does notsplit large sets of particle pairs onto several GPUs.Rather, motivated by realistic application scenariosthat require the processing of numerous frames, in-dividual frames are processed completely on a singleGPU, allowing to exploit trivial frame parallelism byusing several GPUs.

3.3. CADISHI parallel engine

Next, we discuss the CADISHI engine, which en-ables users to exploit all the resources (CPUs, GPUs)available on a compute node simultaneously. Suchan efficient use of resource is especially useful for theparallel analysis of long MD trajectories with manyframes.

The concept of a data processing pipeline serves asthe design principal, as shown schematically in fig-ure 3. Frames are provided and buffered in a queueby a reader process. To ensure high performance,the input trajectory is read from HDF5 files [20].The frames are picked up and processed in parallelby multiple worker processes, each computing all thehistograms for a particular frame using the CPU orGPU kernels. The results are put into a second queuefrom which a writer process fetches the histograms,averages them optionally, and saves them to HDF5.To enable users to import MD simulation data eas-ily, we provide a conversion tool, which uses a genericreader [21, 22].

reader

queue

worker_1worker_0 worker_N

queue

writer

. . .

Figure 3: Schematic of the CADISHI trajectory data process-ing framework. Multiple processes (reader, writer, workers)that communicate and synchronize via queues are used to im-plement node-level parallelism. The workers either use GPUor CPU resources.

This design offers a high degree of modularity andflexibility for current and future methodical exten-sions, e.g., the integration into more complex analy-sis workflows using the CAPRIQORN package [7, 6].We use the Python programming language to im-plement CADISHI. In particular, we use Python’smultiprocessing module, which is part of the stan-dard library [23] and enables node-level parallelismout of the box on virtually any platform. The im-plementation does not use nor require a third-partydependency such as the message passing interface(MPI) for distributed-memory parallelism. The costof the inter-process communication of the atom co-ordinates and the histograms is negligible comparedto the N2 complexity of the computational prob-lem. The workers release the global interpreter lockof Python explicitly when calling the compiled CPUand GPU kernels such that the inter-process commu-nication continues to run smoothly during the com-putation.

A useful configuration on a typical two-socket two-GPU compute node could be as follows. Two CPUworkers are used, each running the previously dis-

10

cussed thread-parallel CPU histogram code on an in-dividual CPU socket. In addition, two GPU workersare used, each of them running the GPU histogramcode on an individual GPU. It is important to reservea physical CPU core for each, the reader, the writer,and the GPU workers, in order to guarantee quickdata transfer and avoid IO becoming the bottleneck.CADISHI picks appropriate core numbers and alsohandles the process pinning automatically.

4. Performance benchmarks

We performed extensive performance benchmarksand investigate the binning rates of the CPU andGPU kernels individually for input data of varioussizes. The input data was generated by putting par-ticles at pseudo-random coordinates into a unit box.To profile the CPU and GPU codes, we used a driverprogram to supply the coordinate data, to launch thekernels, to determine the time-to-solution, and to cal-culates the binning rate in billion atom pairs per sec-ond (bapps). All computations discussed below wereperformed in single precision, which is the relevantuse case when MD simulation data is processed. Ingeneral, double precision runs turn out to be betweena factor of 1.3 to 2 slower on CPUs and non-consumerGPUs, depending on the problem size and the pres-ence of a periodic box. In all cases, we measured thetime to solution, which includes memory transfers toand from the GPU and overhead from GPU kernellaunches.

In the second part, we show results for the node-level performance obtained on a compute node withtwo GPUs, running the CADISHI engine on a prac-tically relevant data set from an MD simulation.

4.1. CPU performanceThe CPU kernel was profiled on a shared memory

machine with two Intel Xeon Platinum 8164 proces-sors [24], providing 26 physical cores and 52 hardwarethreads each. The base clock of a core is 2 GHz, andthe cores support the AVX512 instruction set.

Benchmark results in this section are exclusivelybased on the binary generated by the Intel compilericc compiler in version 2017 using the optimiza-tion flags "-fast -xHost -qopt-zmm-usage=high

-qopenmp". Indeed, the Intel compiler manages togenerate AVX512 code for the inner loop with thedistance computation, covering all the possible boxcases. The instruction set in use was checked at run-time by reading out hardware counters for floatingpoint instructions via the Linux perf tool.

For comparison, we compiled kernels using theGNU g++ compiler in version 7.2 with the op-timization flags "-O3 -march=native -ffast-math-funroll-loops -fopenmp" applied. In doing so,vectorized AVX512 code is generated by g++ for theinner loop of algorithm 1 in the case without periodicboundary conditions. In the cases of orthorhombicand triclinic boxes, rounding operations and opera-tions to determine the minimum prevent the vector-ization by the GNU compiler. Compared to the bi-nary generated by icc, we find that the binary fromg++ runs virtually at the same speed for the vector-ized case without any box, whereas the cases with aperiodic box run significantly slower due to the lack-ing vectorization.

As will be seen in the next section, the GPUs per-form much better in general, which mitigates thedrawback of having the boxed computation not vec-torized well on the CPU with the widely used GNUcompiler.

Figure 4 shows a scan in the number of OpenMPthreads for a fixed problem size of 1M× 1M particlepairs and 10k bins, covering the three possible boxcases. Initially, the number of threads is increasedfrom 1 up to all the 26 physical cores on a singlesocket. While the scaling is ideal at the beginning,the curve starts to deviate from the ideal scaling whenthe socket gets increasingly filled. Since the kernelis of complexity O(N2) and therefore compute andnot memory limited, this effect is likely to be causedby the dynamic clocking of the vector units, whichreduces the frequencies as more and more cores areused in order to limit the heat dissipation. Memoryaccesses are of minor importance, in particular dueto the cache blocking optimization. Going from 1 to2 sockets, the scaling is ideal. Enabling hyperthread-ing in addition shows no benefit in the case withoutPBCs and only marginal benefit in the other cases,indicating that the CPU pipelines are already usedquite well. The scaling is rather similar for the three

11

10

100

1000

1 2 4 8 16 26 52 104

wal

l cl

ock

tim

e [s

per

10

12 a

tom

pai

rs]

number of threads

no boxorthorhombic

triclinic

1 socket 2 sockets

SMT

Figure 4: Wall clock time of the histogram calculation as afunction of the number of OpenMP threads. Calculations wereperformed on the Skylake CPU for a fixed problem size of 1M×1M particle pairs and 10k bins, with cache-blocking enabled.The vertical dotted lines indicate the transition from 1 to 2sockets, and, moreover, the transition into the simultaneousmultithreading regime.

cases. The run times clearly indicate the cost asso-ciated with the orthorhombic and the triclinic box,in particular. For all the following investigations onthe CPU, we use 26 threads on a single socket, in or-der to provide a 1:1 baseline for the comparison withcurrent GPU models done in the following section.

Figure 5 shows a scan in the number of histogrambins on the Skylake processor, comparing the im-plementations with and without the cache blockingscheme for a fixed problem size of 500k × 500k par-ticle pairs and 10k bins, with and without periodicboxes. At histogram widths of up to about 10k bins,the implementation with cache blocking achieves thehighest binning rate of about 13 bapps in the casewithout PBCs. In the following, the binning rate de-creases to about 10 bapps at 100k bins, around andbeyond which a linear decrease is observed. The rea-son for the initial steep decrease is that, on each core,the thread-private histogram occupies an increasinglylarger fraction of the L2 cache, as the number ofbins is increased. Therefore, the coordinate blocksare chosen accordingly smaller, decreasing the effi-ciency of the blocking scheme, cf. Eq. 6. Note that

0

2

4

6

8

10

12

14

16

0 50000 100000 150000 200000 250000

bil

lio

n a

tom

pai

rs p

er t

ime

[1/s

]

number of bins

cache blockingno blocking

no box

orthorhombic

triclinic

Figure 5: Histogramming rate as a function of the number ofbins. Calculations were performed on the Skylake CPU for afixed problem size of 500k × 500k atom pairs. We comparethe implementations with and without cache blocking for thecases without PBCs and with PBCs using orthorhombic andtriclinic unit cells.

the cache blocking scheme was not designed to opti-mize for scans in the histogram width but rather forlarge problem sizes as will be discussed next. More-over, in practice many applications require only mod-erate histogram widths which lie within the regime ofhighest performance. In comparison, the implemen-tation without cache blocking is significantly slower,for small bin numbers by about one third in the casewithout box. The relative advantage from the cacheblocking decreases when going to the orthorhombiccase and virtually vanishes for the triclinic case whichis caused by the increasing arithmetic intensity mak-ing the memory accesses less important.

Figure 6 shows scans in the problem size whilekeeping a fixed histogram width of 10k bins. Re-sults for the implementations with and without cacheblocking are shown. In addition to cases withoutPBCs, performance data for the kernels handling pe-riodic boxes is included. We first discuss the scanswithout periodic box. Up to a problem size of about100k×100k atom pairs, the code without cache block-ing turns out to be faster than the variant with block-ing, clearly indicating some overhead of the blockingscheme. Going beyond that problem size, the per-

12

0

2

4

6

8

10

12

14

1000 10000 100000 1x106

bil

lio

n a

tom

pai

rs p

er t

ime

[1/s

]

number of atoms per selection

cache blockingno blocking

no box

orthorhombic

triclinic

Figure 6: Histogramming rate as a function of the problem size.Calculations were performed on the Skylake CPU for fixed his-togram width of 10k. We compare the cache-blocked with thenon-blocked implementation. In addition to the default casewithout PBCs, we show results for PBCs using orthorhombicand triclinic boxes.

formance of the code with cache blocking is nearlyconstant, whereas the performance without blockingdrops severely by more than two thirds in the rangeunder consideration. Turning towards the cases withperiodic boxes, it is observed that the binning rateis clearly slower compared to the case without pe-riodic boxes. Comparing the plateaus, the perfor-mance goes down to about two thirds for the or-thorhombic case and down to about one third forthe triclinic case. The overhead is partially causedby the nearbyint() function which is used to per-form the rounding during the application of the min-imum image convention. In microbenchmarks, thenearbyint() function turned out to be faster thanthe round() function by about 10%, which is likelydue to the fact that it does not raise the Inexact ex-ception on the CPU. For the triclinic box, a three-foldloop over all neighbouring boxes is executed in addi-tion to find the minimum image for the general case,causing the additional overhead.

4.2. GPU performance

The CUDA code was compiled with the gen-eral optimization flags "-O3 -use_fast_math" and

was in addition adapted for the GPU ar-chitecture under consideration, e.g., by apply-ing the flags "–generate-code arch=compute_70,code=compute_70" for the Volta GPU. For the hostcode, GCC and the same flags as previously wereused.

We profiled the GPU kernel on several state-of-the-art hardware platforms as shown in table 1. Wepresent results for NVIDIA V100 [27], P100 [26], andGTX1080 [25] GPUs. The V100 GPU is part of anNVIDIA DGX-1 system [29], whereas the P100 GPUis part of an IBM POWER8+ system [28]. In theDGX-1 system, the V100 GPUs are connected to thehost CPUs via PCIe 3.0. In the POWER8+ system,the P100 GPUs are connected via NVLink which isabout a factor 3 faster in host-device bandwidth thanPCIe 3.0. Note that the GTX1080 GPU was designedfor entertainment applications, lacks ECC memory,and has only very few units enabled for double pre-cision operations.

Below we compare the performance of both im-plementations of interest. First, we investigatethe shared memory kernel which keeps partial his-tograms in shared memory, potentially requiring sev-eral sweeps through all the particle pairs when goingto larger bin numbers (see section 3.2.3). In additionwe profile the global memory kernel, which updates asingle histogram in global memory (see section 3.2.2)and turns out to be of advantage for larger bin num-bers.

We present scans in the histogram width in figure7 with the problem size kept fixed at 4M× 4M atompairs, which is large enough to saturate the GPUs(see below). Initially, the performance curves for theshared memory kernels are constant at rather highlevels. Remarkably, for the V100 GPU we find aplateau of highest performance between 9k and 12kbins, where the curve peaks around 9k bins at a bin-ning rate of 495 bapps. Following that initial rangeof highest performance, the binning rate decreasesin steps, with a width determined by the size of ofthe shared memory on the GPU. Due to the lim-ited memory, multiple sweeps are required for largerbin numbers. Here, the V100 GPU has twice thestep width due to its twice as large shared memoryof 96 kB per thread block. The performance profile

13

Table 1: Overview on the GPUs and host systems used for the performance benchmarks. For more detailed specifications, werefer to the references [25, 26, 27, 28, 29]. Note that the system with GTX1080 GPUs is also used for benchmark runs basedon real MD simulation data, cf. Sec. 4.4.

GPU (2×) GTX1080 (4×) P100 (8×) V100fp32 peak [TFLOPS] 8.87 10.6 15.7mem-bandw. [GB/s] 320 732 900Bus PCIe 3.0 NVLink PCIe 3.0CPU Intel Haswell IBM POWER8+ Intel Broadwell

2 × E5-2680v3 2 packages 2 × E5-2698v4Cores (Threads) 2 × 12 (× 2) 2 × 10 (× 8) 2 × 20 (× 2)

1

10

100

1000

1000 10000 100000

bil

lio

n a

tom

pai

rs p

er t

ime

[1/s

]

number of bins

shared memoryglobal memory

GTX1080

P100

V100

Figure 7: Scan in the number of histogram bins for a problemsize kept fixed at 4M × 4M atom pairs without any periodicbox, comparing the implementations using shared and globalmemory for the histogram updates on the three GPUs underconsideration.

of the consumer-grade GTX1080 GPU is very simi-lar to that of the enterprise-grade P100 GPU, bothfeaturing the Pascal microarchitecture. The P100GPU is somewhat faster, likely due to its higher in-ternal memory bandwidth. In contrast, the perfor-mance curves of the global memory kernel rise ini-tially and reach plateaus at 12k bins. For larger binnumbers, the possibility of collisions during bin up-dates in global memory is low. Again the V100 GPUis the fastest, followed by the P100 and GTX1080 de-vices. The break-even point for the global memorykernels is around 60k bins for the Pascal GPUs andat 128k bins for the V100 GPU. This information is

used by the implementation for a simple heuristic todecide which kernel is to be called for a given binnumber.

The achieved occupancy on the GPUs is nearly100% for histogram widths up to 4k bins (8k on theV100), when the CUDA grid is chosen to launch 512threads per block for which we observe the best per-formance. As confirmed using the nvprof tool theoccupancy decreases down to 25% as the histogramwidth in shared memory is increased to the maxi-mum possible value of about 12k (24k for the V100)bins. The scans shown in figure 7 indicate that theperformance of the shared memory kernel is not atall affected by the varying occupancy which is due totheir compute-bound characteristics and instruction-level parallelism [19]. Rather the number of sweeps isdecisive as seen clearly from the staircase-type curves.

Figure 8 shows scans in the problem size for caseswith and without periodic boxes, while the histogramwidth is kept fixed at 10k bins. Following the initialsteep rise of the performance curves, it can be seenthat there are at least 1010 atom pairs required forthe Pascal-based GPUs and at least 1012 atom pairsfor the V100 GPU to reach performance saturationwhich is indicated by a flat-top in all cases. For theP100 GPU, the effect from the fast NVLink intercon-nect is clearly observed for small problem sizes whenthe GPUs are not saturated and the costs of mem-ory transfer and kernel launches are significant. Inparticular, the curves of the P100 device are shiftedtowards the left compared to the GTX1080 and theV100 GPUs which are connected via the relativelyslower PCIe 3 bus to the host CPUs. The curvesfor the periodic boxes already saturate the GPUs

14

0.001

0.01

0.1

1

10

100

1000

1000 10000 100000 1x106

1x107

1x108

bil

lio

n a

tom

pai

rs p

er t

ime

[1/s

]

number of atoms per selection

no boxorthorhombic

triclinic

GTX1080

P100

V100

Figure 8: Scan in the problem size for a fixed histogram widthof 10k bins, comparing the implementations with and withoutthe periodic boxes on the three GPUs under consideration.

for somewhat smaller problem sizes and level off atlower binning rates, which is both due to the higherarithmetic intensity. Interestingly, on both the non-consumer-grade P100 and V100 GPUs and comparedto the results from the GTX1080 GPU, the perfor-mance of the periodic box cases is much closer to thecases without PBCs. For example, in the case of theorthorhombic box, the binning rate with box is 0.69(0.48) of the binning rate without box on the P100(V100) GPU, whereas it is only 0.07 on the GTX1080device. The reason is likely to be linked to the factthat a significant part of the Pascal processor’s arith-metic capabilities are disabled on the consumer-gradeGTX1080, in particular a large fraction of the double-precision units. We speculate that this might applyas well to the round operation required by the peri-odic boxes.

4.3. Performance comparison between CPU andGPU

Table 2 gives a direct comparison between the per-formance on the CPU and on the GPUs under consid-eration, based on data from the previously discussedruns. A large problem size with 4M× 4M atom pairswas selected at which the performance on both typesof processors is saturated, and a histogram width of

10k bins, where the highest binning rates are seen onboth types of processors. Absolute and relative per-formance numbers are compared for the cases with-out and with periodic boxes. In all cases the GPUsare significantly faster than the CPU. While the con-sumer grade GTX1080 is competitive with the pro-fessional Pascal model P100 in the case without peri-odic box, the GTX1080 is significantly outperformedin the orthorhombic and triclinic cases, presumablylinked to disabled floating point rounding capabili-ties. The Volta V100 GPU is by far the fastest, beat-ing the 26 core Skylake by a factor of up to ∼ 40 inthe case without any box.

Note that our implementations are optimized tobe most efficient for small and moderately large binnumbers up to 24k. In many application scenariosit is possible and sufficient to choose the number ofbins, i.e., the desired numerical resolution of the one-dimensional radial distribution function, to lie withinthat range of highest performance. Finally, we pointout that the CPU code is faster for small problemsizes, below selections of about 100k atoms, whichis due to the overhead induced by the heterogeneousprogramming model and hardware of the GPU.

4.4. CADISHI single-node application performanceThis section reports on the CADISHI application

performance based on actual MD simulation datameasured on a standard 2-socket server equippedwith 2 GPUs.

The data set under consideration comprises 2000frames with about 280000 particles each from anMD simulation of F1-ATPase. Simulations were per-formed using NAMD [30]. With 9 chemical species,36 partial histograms have to be evaluated for allthe possible intra- and inter-species combinations foreach frame, where the individual numbers of theparticle-pairs are highly different, ranging from ∼ 102

to ∼ 1010, which is in particular challenging for theGPU implementation. No periodic box was consid-ered. The compressed HDF5 trajectory file has a sizeof 9 GB. A resolution of 8000 histogram bins waschosen, going up to a maximum radius of 300Å. Wesummed partial histograms at intervals of 100 framesand wrote the summed-up histograms to the disk,leading to a compressed HDF5 output file of 18 MB

15

Table 2: Comparison between the performance on the Skylake chip with 26 cores and the performance on the three GPUsunder consideration for a problem size of 4M× 4M atom pairs, 10k bins, without and with periodic boxes. For the GPUs, theperformance relative to the CPU is given.

no box orthorhombic box triclinic boxprocessor bapps [s−1] rel. bapps [s−1] rel. bapps [s−1] rel.Skylake 12.38 1.00 8.63 1.00 4.22 1.00GTX1080 135.66 10.96 9.44 1.09 6.28 1.49P100 175.04 14.14 120.89 14.01 30.11 7.14V100 494.45 39.95 239.34 27.74 55.25 13.10

in size. The performance benchmarks were run ona HPC cluster with two Haswell CPUs [E5-2680 v3,with 12 (24) physical cores (hardware threads) each]and two consumer grade GTX1080 GPUs per node,and a shared GPFS file system on which the IO wasperformed.

Table 3 gives performance numbers for a selectionof setups found to perform well. Note that for thereader, writer, and any (potentially present) GPUworkers one physical core was reserved each, and thatfor the setups with CPUs involved all the remainingphysical cores were used. The waiting time for newwork packages is about 10 ms for all the setups shown.The plain CPU runs C1 and C2 processed the com-plete dataset in less than 3 hours. Here, using twoworkers on separate NUMA domains turns out to beslightly faster than using only one worker on boththe domains. Simultaneous multithreading was en-abled for the sake of a small speedup (cf. Fig. 4). Onthe other hand, the plain GPU run G2 processed the2000 frames in slightly below 6 minutes. Relative tothe run C2 the speedup is 27.0, demonstrating thatthe GPU has a significant advantage not only withsynthetic benchmark data but also with real MD sim-ulation data.

It seems tempting to perform hybrid runs to fur-ther speed up the GPU runs, however, our experi-ments indicate that keeping all the CPU cores busyin addition to the GPUs does pay off only marginally,if at all. For such hybrid runs we find that the num-ber of threads per CPU worker must be not morethan the number of available physical cores. Forlarger numbers of threads, hyperthreading clogs thethreaded multiprocessing queues between the reader,the writer, and the worker processes. Only the case

G2C1 with a single CPU worker is slightly faster thanthe plain GPU case G2. For hybrid runs, the imbal-ance in processing speed between the CPU and theGPU leads to the situation that the CPU worker isprocessing a final work package while the GPU work-ers have already finished. This effect is still signifi-cant in the present example with 2000 frames but willbecome less important for very large frame numbers.Moreover, a run-time system for task-based paral-lelism may be helpful to mitigate such situations, seee.g. [31].

5. Summary and conclusions

The CADISHI software achieves very high perfor-mance on both, CPUs and GPUs. The kernels forboth types of processors can be driven by the Python-based CADISHI engine to enable high-throughputanalysis of MD trajectories. CADISHI implementsa producer-consumer model and thereby allows forthe complete utilization of all the CPU and GPU re-sources available on a specific computer, independentof special libraries such as MPI, covering commod-ity systems up to high-end HPC nodes. CADISHIenables the analysis of trajectories with many thou-sands of frames in a minimum amount of time. Pro-cessing 2000 frames with 280000 particles each oftrajectory data from an F1-ATPase simulation wasdemonstrated to run in less than 6 minutes on a stan-dard two-socket compute node with two consumer-grade GPUs.

To achieve high performance on the CPU, we pro-posed a cache tiling scheme tailored to fit the L2cache size of a CPU core, OpenMP SIMD directivesin combination with a linear index buffer to help the

16

Table 3: Overview on the single-node CADISHI application performance achieved for the F1-ATPase dataset with 2000 frames,measured on the 2-socket Haswell node with 2 GTX1080 GPUs. The time was taken until all the partial histograms werewritten to disk, i.e., the total time to solution is given, whereas the performance in bapps is reported per compute worker anddoes not include buffering and IO time. The use of simultaneous multithreading is indicated by an asterisk (∗). Three runs persetup were performed and averaged.

setup workers bapps [s−1] time [s]CPU (threads) GPU CPU GPU

C1 1 (44∗) 0 8.2 0 9594.6C2 2 (22∗) 0 4.2 0 9422.9G2 0 2 0 117.1 348.8G2C1 1 (20) 2 6.6 118.1 341.7G2C2 2 (10) 2 3.3 116.0 351.3

compiler generate vectorized code, and thread par-allelism over tiles using classical OpenMP directives.In our test, the implementation performs and scaleswell up to a full shared memory node consisting oftwo 26-core Intel Skylake processors.

Compared to running the optimized CPU code ona 26-core Intel Skylake processor, we find that ouroptimized GPU code achieves a speedup of up to 40on an NVIDIA V100 GPU for a case without any pe-riodic box. For orthorhombic and triclinic periodicboxes the speedup is 28 and 14, respectively. Theconsumer-grade GTX1080 GPU is competitive withthe professional models, in particular for cases with-out a periodic box.

Here, we confirmed the observation of Levine etal. [12] that the use of constant memory to cache co-ordinate data is the key to achieve high performance,even though the maximum available constant mem-ory per GPU did not increase over the GPU gen-erations. In contrast to Levine et al., we use scarceshared memory exclusively for storing histogram dataand not for actively caching coordinate data. Werather rely on the GPU’s native hardware caches.Levine at al. use shared memory to implement over-flow protection, which we handle differently via con-stant memory tiles of known size in combination witha CUDA grid directly mapping the loop over the sec-ond particle species. The histogram updates in both,global and shared memory, have seen significant im-provement due to the introduction of fast atomic in-structions with recent GPU generations [18]. More-over, the 96 kB of shared memory per streaming mul-

tiprocessor of the V100 GPU accelerates the compu-tation for medium and large bin numbers. This hard-ware feature was previously unavailable.

Importantly, we complement the work of Levine etal. [12] by providing a high-performance CPU imple-mentation featuring vectorization and parallelization,the support for the triclinic box, template-based sup-port for single and double precision and for runtimechecks of the distances to fit within the maximumbinning range, and the CADISHI parallel engine fornode-level parallelism, leveraging unprecedented pos-sibilities of large-scale MD data analysis. We providePython interfaces and the option to compile the ker-nels into a plain C library.

The CADISHI software package presented in thispaper is available free of charge in source code underthe permissive MIT license at [32]. It can be used to-gether with the CAPRIQORN software package [7, 6]to calculate SAXS/WAXS scattering intensities frommolecular dynamics trajectories.

Acknowledgements

We thank Prof. Gerhard Hummer, Max Linke, andDr. Markus Rampp for fruitful discussions. We thankProf. Kei-ichi Okazaki for providing an initial NAMDsetup for F1-ATPase. We acknowledge financial sup-port by the Max Planck Society.

17

References

[1] J.-P. Hansen, I. R. McDonald, Theory of SimpleLiquids, Fourth Edition: with Applications toSoft Matter, 4th Edition, Academic Press, 2013.

[2] D. A. McQuarrie, Statistical mechanics / DonaldA. McQuarrie, Harper and Row New York, 1975.

[3] V. Springel, S. D. M. White, A. Jenkins,C. S. Frenk, N. Yoshida, L. Gao, J. Navarro,R. Thacker, D. Croton, J. Helly, J. A. Peacock,S. Cole, P. Thomas, H. Couchman, A. Evrard,J. Colberg, F. Pearce, Simulations of the for-mation, evolution and clustering of galaxies andquasars, Nature 435 (2005) 629 EP –. doi:10.1038/nature03597.

[4] M. Kerscher, I. Szapudi, A. S. Szalay, A com-parison of estimators for the two-point correla-tion function, The Astrophysical Journal Letters535 (1) (2000) L13.

[5] P. Debye, Molecular-weight determination bylight scattering., The Journal of Physical andColloid Chemistry 51 (1) (1947) 18–32. doi:10.1021/j150451a002.

[6] J. Köfinger, G. Hummer, Atomic-resolutionstructural information from scattering experi-ments on macromolecules in solution, Phys. Rev.E 87 (2013) 052712. doi:10.1103/PhysRevE.87.052712.

[7] J. Köfinger, K. Reuter, Capriqorn softwarepackage (2018).URL https://github.com/bio-phys/capriqorn

[8] L. S, Z. F. Ornstein, Accidental deviations ofdensity and opalescence at the critical point ofa single substance, Royal Netherlands Academyof Arts and Sciences (KNAW). Proceedings. 17(1914) 793.

[9] C. N. Likos, Effective interactions in softcondensed matter physics, Physics Reports348 (4) (2001) 267 – 439. doi:10.1016/S0370-1573(00)00141-1.

[10] D. Frenkel, B. Smit, Chapter 4 - Molecu-lar Dynamics Simulations, in: D. Frenkel,B. Smit (Eds.), Understanding Molecular Simu-lation (Second Edition), second edition Edition,Academic Press, San Diego, 2002, pp. 63 – 107.

[11] Protein dynamics: Moore’s law in molecular bi-ology, Current Biology 21 (2) (2011) R68–R70.doi:10.1016/j.cub.2010.11.062.

[12] B. G. Levine, J. E. Stone, A. Kohlmeyer, Fastanalysis of molecular dynamics trajectories withgraphics processing units — radial distributionfunction histogramming, Journal of Computa-tional Physics 230 (9) (2011) 3556 – 3569. doi:10.1016/j.jcp.2011.01.048.

[13] W. Humphrey, A. Dalke, K. Schulten, VMD:Visual molecular dynamics, Journal of Molec-ular Graphics 14 (1) (1996) 33 – 38. doi:10.1016/0263-7855(96)00018-5.

[14] M. Tuckerman, Statistical Mechanics: The-ory and Molecular Simulation, Oxford graduatetexts, Oxford University Press, 2011.

[15] T. A. Wassenaar, Molecular dynamics of senseand sensibility in processing and analysis ofdata, Ph.D. thesis, university of Groningen(2006).URL http://hdl.handle.net/11370/b0c3a19b-9f60-4911-ab23-d9725a2d45a2

[16] B. G. Levine, D. N. LeBard, R. DeVane,W. Shinoda, A. Kohlmeyer, M. L. Klein, Mi-cellization studied by gpu-accelerated coarse-grained molecular dynamics, Journal of Chemi-cal Theory and Computation 7 (12) (2011) 4135–4145. doi:10.1021/ct2005193.

[17] C. Kutzner, S. Páll, M. Fechner, A. Eszter-mann, B. L. de Groot, H. Grubmüller, Bestbang for your buck: Gpu nodes for gromacsbiomolecular simulations, Journal of Computa-tional Chemistry 36 (26) (2015) 1990–2008. doi:10.1002/jcc.24030.

[18] NVIDIA Corporation, CUDA C ProgrammingGuide (2018).

18

URL https://docs.nvidia.com/cuda/index.html

[19] V. Volkov, Better performance at lower occu-pancy, 2010.URL http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf

[20] The HDF Group, Hierarchical Data Format, ver-sion 5 (1997-2018).URL http://www.hdfgroup.org/HDF5/

[21] M.-A. Naveen, D. E. J., W. T. B., B. Oliver,MDAnalysis: A toolkit for the analysis of molec-ular dynamics simulations, Journal of Compu-tational Chemistry 32 (10) 2319–2327. doi:10.1002/jcc.21787.

[22] Richard J. Gowers, Max Linke, JonathanBarnoud, Tyler J. E. Reddy, Manuel N. Melo,Sean L. Seyler, Jan Domański, David L. Dot-son, Sébastien Buchoux, Ian M. Kenney, OliverBeckstein, MDAnalysis: A Python Package forthe Rapid Analysis of Molecular Dynamics Sim-ulations, in: Sebastian Benthall, Scott Rostrup(Eds.), Proceedings of the 15th Python in Sci-ence Conference, 2016, pp. 98 – 105.

[23] Python Software Foundation, The Python Stan-dard Library (2018).URL https://docs.python.org/2/library/

[24] Intel Corporation, Intel Xeon Platinum 8164Processor (2017).URL https://ark.intel.com/products/120503/Intel-Xeon-Platinum-8164-Processor-35_75M-Cache-2_00-GHz

[25] NVIDIA Corporation, NVIDIA GeForce GTX1080 (2016).URL https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf

[26] NVIDIA Corporation, NVIDIA Tesla P100(2016).URL https://images.nvidia.com/

content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

[27] NVIDIA Corporation, NVIDIA Tesla V100GPU Architecture (2017).URL http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

[28] A. B. Caldeira, V. Haug, S. Vetter, IBMPower System S822LC for High PerformanceComputing (2016).URL https://www.redbooks.ibm.com/redpapers/pdfs/redp5405.pdf

[29] NVIDIA Corporation, NVIDIA DGX-1 WithTesla V100 System Architecture (2017).URL http://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf

[30] J. C. Phillips, R. Braun, W. Wang, J. Gum-bart, E. Tajkhorshid, E. Villa, C. Chipot, R. D.Skeel, L. Kalé, K. Schulten, Scalable molecu-lar dynamics with namd, Journal of Compu-tational Chemistry 26 (16) 1781–1802. doi:10.1002/jcc.20289.

[31] C. Augonnet, S. Thibault, R. Namyst, P.-A.Wacrenier, StarPU: A Unified Platform for TaskScheduling on Heterogeneous Multicore Archi-tectures, Concurrency and Computation: Prac-tice and Experience, Special Issue: Euro-Par2009 23 (2011) 187–198. doi:10.1002/cpe.1631.

[32] K. Reuter, J. Köfinger, Cadishi software package(2018).URL https://github.com/bio-phys/cadishi

19

Documents

CADISHI: Fast parallel calculation of particle-pair