Importance of explicit vectorization for CPU and GPU software performance

Journal of Computational Physics 230 (2011) 5383–5398

Contents lists available at ScienceDirect

Journal of Computational Physics

journal homepage: www.elsevier .com/locate / jcp

Importance of explicit vectorization for CPU and GPU software performance

Neil G. Dickson ⇑, Kamran Karimi, Firas HamzeD-Wave Systems Inc., 100-4401 Still Creek Drive, Burnaby, British Columbia, Canada V5C 6G9

a r t i c l e i n f o

Article history:Received 30 September 2010Received in revised form 21 March 2011Accepted 22 March 2011Available online 29 March 2011

Keywords:PerformanceOptimizationVectorizationMonte CarloIsing modelGPU

0021-9991/$ - see front matter � 2011 Elsevier Incdoi:10.1016/j.jcp.2011.03.041

⇑ Corresponding author.E-mail addresses: [email protected] (N.G

a b s t r a c t

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization andnon-parallel optimization techniques, which can often be employed additionally, are lessfrequently discussed. In this paper, we present an analysis of several optimizations doneon both central processing unit (CPU) and GPU implementations of a particular computa-tionally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU andthe equivalent, explicit memory coalescing, on the GPU are found to be critical to achievinggood performance of this algorithm in both environments. The fully-optimized CPU versionachieves a 9� to 12� speedup over the original CPU version, in addition to speedup frommulti-threading. This is 2� faster than the fully-optimized GPU version, indicating theimportance of optimizing CPU implementations.

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

It is common to examine performance increases from parallelism in the form of CPU and GPU multi-threading [1,2], butvectorization and non-parallel optimization techniques remain less common in the literature. Based on theoretical peak per-formance values or performance test results, it is sometimes concluded or even assumed that GPU computation is signifi-cantly faster than CPU computation [3–6]. However, explicit optimization of the CPU code, a process by which aprogrammer manually intervenes to maximize performance, is often not considered. Due to several well-documentedGPU performance factors [7], it is common for GPU developers to become involved in many aspects of coding that can pos-itively affect performance. CPU optimization, however, is usually left to the compiler, and it has been argued that compilersachieve high utilization of the CPU’s computing power [8,9]. If this assumption is not valid, it may have lead to heavilyskewed performance comparisons between CPUs and GPUs. Similarly, due to the prevalence of multicore CPUs, comparingperformance of one core of a CPU against that of a full GPU, as done in [5], is not a completely fair comparison.

The goal of this paper is to present a performance analysis of CPU and GPU implementations of a specific MetropolisMonte Carlo algorithm at different levels of optimization, as outlined in Table 1. Note that the GPU code was too slow to testreliably before making the basic optimizations described in Section 2, so this implementation was not included here.

Vector instruction sets, present on modern commodity CPUs since 2001 (referred to here as SSE), provide 128-bit registersand execution units [10], allowing a single operation to be performed on each of multiple (2 � 64-bit, 4 � 32-bit, 8 � 16-bit,or 16 � 8-bit) adjacent data elements at once. For example, the element-wise addition of two 4-element vectors of 32-bitfloating point numbers can be vectorized by replacing the 4 separate addition operations with a single SSE instruction. How-ever, these SSE instructions are not frequently used explicitly. High-quality vectorization of our algorithm requires knowl-edge about the semantics of the algorithm, which a compiler cannot (and arguably, should not) assume. Software developers,

. All rights reserved.

. Dickson), [email protected] (K. Karimi), [email protected] (F. Hamze).

http://dx.doi.org/10.1016/j.jcp.2011.03.041

mailto:[email protected]



http://dx.doi.org/10.1016/j.jcp.2011.03.041

http://www.sciencedirect.com/science/journal/00219991

http://www.elsevier.com/locate/jcp

Table 1Implementations at different levels of optimization.

Impl. CPU/GPU

Multi-threaded

Compiler optimizationenabled

Basic optimizations(Section 2)

Vectorized MT19937 & flipping(Section 3)

Vectorized data updating(Sections 3.1& 3.2)

A.1a CPU U

A.1b CPU U U

A.2a CPU U U

A.2b CPU U U U

A.3 CPU U U U U

A.4 CPU U U U U U

B.1 GPU U U U

B.2 GPU U U U U U

5384 N.G. Dickson et al. / Journal of Computational Physics 230 (2011) 5383–5398

on the other hand, must already have knowledge of these semantics in order to design and implement the algorithm. Withthe recent introduction of the Advanced Vector Extensions (AVX) instruction set [10], which extends SSE to 256 or more bits,this knowledge is becoming more important.

For example, the MT19937 (pseudo) random number generation algorithm [11] keeps an array of 624 numbers. An alter-native implementation, used in this paper, keeps 4 � 624 = 2,496 numbers and uses SSE to generate 4 random numbers inroughly the same time as each random number before. The resulting random number sequence would be different, so in or-der for a compiler to safely replace the algorithm with something not equivalent to the source code, it must first determinethat the sole purpose of the code is to generate many random numbers. An analogous scenario would be a compiler recog-nizing code performing bubble sort, a stable sort, determining that the sorting stability is not necessary, and then replacingthe code with heap sort, an unstable sort [12].

The application we consider is that of performing a Metropolis Monte Carlo algorithm (e.g. [13]) on a sparse Ising model,which is a system of spin variables si 2 {+1,�1}, with a Hamiltonian (cost function) of the form:

H ¼ �X

i

hisi �X

i;j

Jijsisj;

where hi and Jij are the values that define the particular Ising model. The general Metropolis sweep algorithm used in thispaper, to sample from a Boltzmann distribution over the possible values of spins, can be summarized as in Fig. 1.

This sequential sweeping algorithm was chosen as a starting point for performance comparison because it has alreadybeen identified in literature as a fast alternative to the conventional Metropolis algorithm, which visits spins in random order[14]. This is because visiting spins sequentially requires one fewer random number to be generated per iteration. Also, thedata used to compute probabilities of adjacent spins flipping can be updated faster than computing, from scratch, the prob-ability of flipping a single spin. This is because the updating is a proper subsequence of the operations needed to compute theprobability from scratch.

The optimized implementations were developed in a Quantum Monte Carlo (QMC) [15] simulation context and use Par-allel Tempering, as explained in [16,17]. The topology of the Ising models that appear in these simulations is described in[18], and comes from applying a Suzuki–Trotter decomposition [19] to a given processor architecture, but the optimizationspresented here are directly applicable to many types of sparse Ising models, e.g. any lattice Ising models. For the purposes ofthis paper, the most relevant information is that each spin is adjacent to 6, 7, or 8 other spins in a layered structure, and thatmillions of the Metropolis sweeps shown in Fig. 1 must be performed on millions of systems, each with up to 32,768 spins.With these high computational requirements, even the best-optimized implementations require months of computationtime on thousands of multi-core computers. For densely-connected Ising models, a different vectorized implementation thanthe ones described in Section 3 would be advised, since the sparseness is exploited by our implementations.

Explicit CPU vectorization has shown impressive results in other contexts [20,21]. However, those performance compar-isons are relative to implementations where the compiler does not implicitly vectorize the code, so improvement over acompiler’s implicit vectorization may not be known for those cases. It has also been found in other contexts that a GPUimplementation is not guaranteed to surpass a CPU implementation in performance [22]. Again, many factors could explainsuch a discrepancy, so we attempted to address these concerns by exploring a number of potential optimizations applicableto Metropolis sweeping on sparse Ising models.

The rest of the paper is organized as follows. Section 2 explains a number of non-parallel optimization techniques thatwere applied to both the CPU and the GPU implementations. Section 3 shows how different parts of the code were vector-

for each spin i, if uniform(0,1) random number < probability of flipping the sign of i, flip the sign of i for each spin j adjacent to i, update data used to find probability of flipping j

Fig. 1. A Metropolis Monte Carlo sweep of an Ising model.

N.G. Dickson et al. / Journal of Computational Physics 230 (2011) 5383–5398 5385

ized. This section also explains how memory coalescing for GPU was performed. Section 4 presents the results of a number ofperformance tests we performed using CPU and GPU code at different levels of optimization. Section 5 concludes the paper.

2. Basic optimizations

The optimizations presented here focus only on the Metropolis sweep algorithm of Fig. 1, as everything else remains lar-gely unchanged, including multi-threading, which is explained in [16]. The basic optimization techniques we used on thisalgorithm include: branch elimination, simplification of data structures, result caching, and approximation. We have alsofound these optimizations to be effective for other computationally-intensive, non-Monte-Carlo, CPU applications. OurGPU implementations (B.1 and B.2) use all of these optimizations, and the CPU and GPU code remained nearly identical un-der these optimizations, so although only the CPU code is shown here, these are as applicable to the GPU.

In addition to the optimizations described in this section, many other GPU optimizations were tried, but only one resultedin a non-negligible improvement, and is described in Section 3.2. An example of an additional GPU optimization resulting inroughly a 6% speedup, negligible compared to the others presented here, was to temporarily copy the current state of a ran-dom number generator (see Section 3) into the GPU’s fast shared memory before regenerating its array of random numbers.However, since the full set of random number generator states was too large to keep in the limited shared memory, thiswasn’t able to be improved upon significantly. Similar optimizations were attempted with other data structures, but withno significant success due to the same problem of limited shared memory, as discussed in Section 4. More than 20 differentschemes for assigning work to GPU cores were tested, with the best being described in Section 3.2, but the particulars of theother schemes tested, beyond what appears in Section 3.2, are not notable.

2.1. Branch elimination

Commodity CPUs have had the ability to run parts of multiple instructions at the same time (known as superscalar archi-tecture) since 1993 [10]. The drawback of this feature is that if the current instruction is a conditional branch, for exampledue to an if-statement, either the CPU cannot start running any instructions past that branch, or it must make a guess as towhat the next instruction will be. This guessing is known as branch prediction. When the guess is wrong (a misprediction),several partially-completed instructions may need to be removed from the CPU’s pipeline, which can cause stalls [23]. Elim-inating unnecessary branches can alleviate this problem.

This optimization had a large impact on the performance, and combined with the data structure simplification, made thecode simpler. On its own, the resulting code was smaller, but admittedly less readable. In the original code, shown in Fig. 2,the inner loop of the Metropolis sweep contained two frequently-mispredicted branches. Note that the data structure detailsare examined in Section 2.2.

However, both branches can be eliminated, as in Fig. 3.Note that the value of (neighbours[0] == currentSpin) is 1 when true and 0 when false, so the new code still selects

the neighbour of the edge that is not the current spin, to update it after flipping the current spin. Also, for x86-compatible

for (edgeIndex=0; edgeIndex<numLocalEdges; edgeIndex++) { edge = incidentEdges[edgeIndex];

if (graphEdges[edge][0]==currentSpin) neighbour = graphEdges[edge][1];

else neighbour = graphEdges[edge][0];

if (isInterEdge[edge]) hEffectiveInter[neighbour] -= 2*SMul*J[edge];

else hEffectiveIntra[neighbour] -= 2*SMul*J[edge]; }

Fig. 2. Original loop updating data for spins adjacent to one that has just flipped.

for (edgeIndex=0; edgeIndex<numLocalEdges; edgeIndex++) { edge = incidentEdges[edgeIndex];

neighbours = graphEdges[edge]; neighbour = neighbours[neighbours[0]==currentSpin];

hEffective = isInterEdge[edge] ? hEffectiveInter : hEffectiveIntra; hEffective[neighbour] -= 2*SMul*J[edge]; }

Fig. 3. Loop updating data for spins after a flip, without using branches.


processors since the Pentium Pro in 1995, the ternary operator (i.e. condition? valueIfTrue: valueIfFalse) is imple-mented with a conditional move instruction instead of a branch when both possible values do not involve computation [23].

To further speed up this computation and avoid both of these confusing ways of accessing the data, structures require amore suitable design, examined in the following section.

2.2. Simplification of data structures

The data structures mentioned above are used elsewhere in the code and suffered from similar issues, as they have similaraccess patterns. The original graph data structure had a complex layout in memory, as shown in Fig. 4.

This data structure represents the edges, with weights J, between spins. Because of details specific to the context ofthe application we used, some edges (‘‘inter-layer’’) are handled separately from others (‘‘intra-layer’’), as determined byisInterEdge. This separate handling would not be necessary in other contexts, so can be simply omitted in general. In thiscontext, it can also be eliminated, since here it happens that by design, there are always exactly two inter-layer edges ofeach spin. Since the edges incident to a spin can be handled in any order, reordering them ahead of time such thatthe two inter-layer edges always appear after the intra-layer edges allows them to be handled outside the main loop.isInterEdge can thus be eliminated.

As for the bulk of the graph data structure, which is completely general, we ‘‘eliminated the middle man’’, placing each J

value directly with the target spin index it applies to. In doing so, we duplicated each edge, in order to have in a single place,all edges adjacent to each spin. The data structure then simplifies to Fig. 5.

The corresponding code is shown in Fig. 6. The code is clearer than before and the inside of the loop is now just one line(down from 9). In addition to simply computing less, there is now less memory use, more sequential access, and fewer arraysbeing read in quick succession, improving memory cache use [23]. As such, these simplifications had a large performanceimpact on top of the branch elimination, both on the CPU and the GPU.

In general contexts, where all edges are handled in the same manner, the above code further simplifies to the form shownin Fig. 7. Because this code is equivalent to that of Fig. 6, except for removing operations and data, its performance on Isingmodels with the same level of connectivity should be at least as good as that observed in our specific context.

While these data structure changes provide a performance improvement for Metropolis sweeping on sparse Ising models,the original graph data structure may be faster for use in other graph problems. For example, for graph algorithms whereiteration through the entire edge list is the bottleneck, our simplified graph data structure’s duplication of edges may slowdown the iteration. The important point is that small changes in key data structures can have a large impact on performance.

Fig. 4. Original memory layout of the graph data structure representing connections between spins.

Fig. 5. Simplified memory layout of the graph data structure.

// Handle intra-layer edges for (edge=0; edge<numLocalEdges-2; edge++) { hEffectiveIntra[localEdges[edge].targetSpin]-= 2*SMul*localEdges[edge].J; } // Handle the two inter-layer edges hEffectiveInter[localEdges[edge ].targetSpin]-= 2*SMul*localEdges[edge ].J; hEffectiveInter[localEdges[edge+1].targetSpin]-= 2*SMul*localEdges[edge+1].J;

Fig. 6. Loop updating data for spins after a flip, using a simpler data structure.

for (edge=0; edge<numLocalEdges; edge++) { hEffective[localEdges[edge].targetSpin] -= 2*SMul*localEdges[edge].J; }

Fig. 7. Loop updating data for spins after a flip, for Ising models without special-case edges.


Note that, as mentioned in Section 1, although the example depicted in Figs. 4 and 5 happens to only contain spins with 6edges, the performance tests in Section 4 were conducted on a model having some spins with 7 edges and all other spinswith 8 edges.

2.3. Result caching

This is a common optimization technique. As in dynamic programming, simply avoid computing the same value multipletimes, by computing it once and saving the result. Here, it does not appear prominently, but caching certain data in the algo-rithm did improve performance. For example, (2⁄SMul) appears multiple times in Fig. 6, including inside the for-loop. SMuldoes not change within the loop and is never read without being doubled, so it can simply be doubled once, before the loop.

This improved performance slightly, but noticeably. That there was any improvement was surprising to us, as this is anoptimization that a compiler can easily recognize and perform automatically, and yet the observed improvement suggeststhat the compilers did not do so. Choosing not to cache computations that are clearly duplicated may be advantageous incertain situations, such as when registers are fully occupied and the computation is sufficiently brief, but this was evidentlynot such a case.

As a side note, a way of indirectly caching this multiplication over larger timescales would be to instead multiply all of theJ’s by 2 ahead of time. More subtle forms of result caching are also possible. For example, we generate many random num-bers at a time, avoiding the overhead of cache misses and extra condition checks from frequently switching between gen-erating random numbers and flipping spins.

2.4. Approximating an exponential

In Metropolis Monte Carlo, computing the probability of flipping a spin’s sign requires an exponential to be computed,which is an expensive operation, requiring roughly 83 clock cycles on our test CPU (see Section 4 for CPU information).We created a rough approximation to compute ex in 4 clock cycles, and a more accurate approximation takes 11 clock cycles.The more accurate approximation, with maximum relative error of roughly 1% and average relative error near zero, is gen-erated as in Fig. 8. A derivation and some verification of the approximation is provided in Appendix: The ExponentialApproximation.

Note that the 11 clock cycles includes special masking to produce 0.0 for all x < (�31.5 ln 2) and at least 1.0 for x > 0, andthe approximate 4th root can be computed using the approximate reciprocal-square-root instructions provided on modernIntel and AMD CPUs [10]. It was important that this approximation does not use lookup tables, so that it can also be vector-ized, i.e. to compute 4 approximate exponentials at once.

The faster, less accurate approximation skips the bounds checking (the valid range is (�126 ln 2) 6 x < (128 ln 2)), reducesthe factor in step 2 to 223log2e, and skips step 6. This faster approximation was used in the performance tests for all imple-mentations with these basic optimizations. It is equivalent to a linear interpolation between exact values at the points where

1. Start with x as a 32-bit floating-point value, which must be in the range (-31.5 ln 2) ≤ x < (32 ln 2)

2. Multiply x by 225log2e3. Convert x to a 32-bit integer value 4. Add (127)223 (i.e. 0x3F800000) to x5. Pretending that x is now a 32-bit floating-point value, multiply x by 2ln22 6. Compute the approximate 4th root of x

Fig. 8. An algorithm for approximating ex.


ex is a power of two, scaled by 2ln22. If the 4th root was exact, the more accurate approximation would be equivalent to alinear interpolation between exact values at the points where e4x is a power of two, scaled by 2ln22.

3. Vectorization

All integers and floating-point numbers in the Metropolis sweep algorithm described above, including the random num-ber generator, are 32-bit. On the CPU, SSE’s 128-bit registers thus allow us to concurrently perform up to 4 identical oper-ations on these numbers. The objective in vectorizing the CPU code was to perform the necessary operations 4 at a time, asmuch as possible. For a CPU with the AVX instruction set, this could be increased to 8 or more operations at a time. Note thatsimilar, albeit not identical, techniques apply to the GPU and are described in Section 3.2.

Because SSE instructions operate on 128 consecutive bits, data to be operated upon concurrently must be consecutive. Forexample, to perform an operation on four 32-bit numbers, they must appear consecutively in memory. Additionally, for mostSSE instructions, 128-bit operands in memory must appear at an address that is a multiple of 16 bytes (128 bits), i.e. alignedto 16 bytes. If data do not satisfy these two constraints, they can be reorganized such that they are packed into aligned, con-secutive positions. However, it is preferable to avoid any extra operations to pack the data, by ensuring that data are keptproperly arranged throughout the program’s execution. This is possible in programming languages that allow direct controlover memory layout, such as C++ and assembly, but may not be possible in other languages. For algorithms where the aboveconstraints can be easily met, this low-level parallelism naturally has the potential for significant performance gains.

After implementing the optimizations of Section 2, we observed that a majority of CPU time was being spent generatingthe large volume of random numbers needed, using the Mersenne Twister 19937 [11] algorithm to ensure high quality ofrandomness. Nearly a 4� speedup of the random number generation can be achieved by always generating 4 random num-bers together using SSE. This is accomplished by interlacing 4 MT19937 random number generators with different seeds,working in parallel. A pair of lines from MT19937 comes in Fig. 9.

The ANDs, XORs, OR, logical shift right, and ternary operator in Fig. 9 can all be performed on 4 adjacent 32-bit integersconcurrently, so one can conceptually just change the type of data and y from single 32-bit integers to quadruplets of adja-cent 32-bit integers. Then, the operations to generate 4 random numbers would be almost identical to those needed to gen-erate 1. Fig. 10 depicts the first line of Fig. 9 in scalar and vector forms. Fig. 11 depicts the vector version of the ternaryoperator, described more thoroughly in [20], since its particular operations differ from its scalar version.

We emphasize that this method of vectorizing a random number generator is not specific to Mersenne Twister. For exam-ple, it can also be easily applied to the much simpler Xorshift [24] algorithms.

With random vectors generated, the question is whether they can be used in vector form as well. If 4 independent spinsystems are interlaced, or if spins are reordered such that each quadruplet of spins has no edge between them, the proba-bilities of flipping each of 4 spins can be computed in vector form and the flipping can be done in vector form (masked bywhether each spin actually flipped, similar to Fig. 11). This reordering is described in Section 3.1 below. A larger-scale reor-dering is also used for memory coalescing on the GPU in Section 3.2.

y = (data[i]&UPPER_MASK)|(data[i+1]&LOWER_MASK); data[i] = data[i+397] ^ (y>>1) ^ ((y&1) ? MASK_A : 0);

Fig. 9. Two example lines of MT19937 random number generation algorithm.

Fig. 10. Graphical depiction of operations in scalar and vector versions of a line of Fig. 9 (colour online).

Fig. 11. Graphical depiction of the vectorized version of the ternary operator in Fig. 9(colour online).


C++ compilers for CPUs do not yet natively provide operators on 128-bit data types, instead only providing compilerintrinsics to effectively write the equivalent assembly language code as if the instructions were C++ functions. As such,the vectorized versions of our Metropolis sweep algorithm were implemented directly in assembly language, which tooka week to implement and test, and we found them to be prohibitively difficult to maintain. This is a stumbling block to wide-spread use of vectorization. Having more support in C or C++ for vector operations could enable more programmers to usevectorization. Note that for the GPU implementations, we did not use the GPU’s assembly language, because we are unawareof how we could non-negligibly improve upon the GPU compiler’s output.

After running the performance tests, because of the above concerns, we implemented a custom wrapper of the C++ com-piler intrinsics, allowing us to use SSE instructions in C++ more readably. We re-implemented the sweep algorithm using thiswrapper, and found that it achieves near-identical performance to that of the assembly language implementations, since themachine code generated by Visual C++ 2008 is near-identical when the vectorization is made explicit in this fashion. A sim-ilarly-motivated extension to simplify vectorization in C++ is presented in [20].

In order to give the compiler a better opportunity to implicitly vectorize the random number generation, while stillobserving its regular behaviour for implementations A.1a and A.1b, implementations A.2a and A.2b use 4 random numbergenerators interlaced, as described in this section. In that code, the operations are each performed 4 separate times in closesuccession, once on each generator, to allow this behaviour to be indentified more easily by a compiler. This provided somespeedup, as reported in Section 4.

3.1. Full vectorization on the CPU

If all 4 spins within a quadruplet are topologically identical (as described below, e.g. 4 identical independent systems), theloop that updates data after spin flips can also be vectorized. However, simulating 4 independent systems in this mannerwould make other components of our simulation much less efficient (e.g. the Parallel Tempering must be able to swapout the states of these systems independently of each other), so this is not done in our QMC simulations.

That said, the structure of our QMC simulations can still be exploited in order to vectorize the inner loop of the Metropolissweep in Fig. 1. All of our simulated Ising models consist of many (P64) identical copies of a smaller Ising model, with edgesconnecting corresponding spins in adjacent layers, with a wrap-around from the last to the first layer, as shown in Figs. 12and 13.

Although the first quadruplet marked in Fig. 12 happens to contain only spins that are not adjacent to each other (and thisis not true for all of our simulations), if spin 0 and spin 1 were both flipped together, they could both attempt to update datafor spin 4 at the same time, producing unpredictable results. Reordering the spins by splitting the layers into 4 sections andinterlacing those sections produces the reordering in Fig. 13b. With this ordering, not only is there no threat of two spins inthe same quadruplet trying to update the same spin at the same time, but they also always update spins that form anotherquadruplet, except when an update wraps around between the first and last layers. For example, flipping spins (0,1,2,3)would require updating data for spins (16,17,18,19) identically (the spins corresponding with spin 4 before), since the layersare identical.

Because this update involves identical operations on adjacent data (that can be masked out for spins that do not flip, sim-ilar to in Fig. 11), these operations can be done together in vector form. The first and last layers can be treated as a special

Fig. 12. Section of a small example QMC Ising model, showing 4 layers each with 8 spins (quadruplet 0 of original spin order marked in red; quadruplet 1 oforiginal spin order marked in green). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of thisarticle.)

0,1,2,3

4k+0,4k+1,4k+2,4k+3

3

0

1

2

4k+0

4k+1 4k+2

4k+3

0 1 2 3

4 5 6 7

(a)

(b) (c)

Fig. 13. Reordering of spins for small example QMC Ising model with L = 64 identical layers (colour online).


case to handle wrapping. Note that if the number of layers is not a multiple of 4, either extra spins must be added in such away that they can be ignored, or the remainder of layers must be left non-vectorized.

Because the only requirement for applying the above splitting to the Ising model is that the model have several topolog-ically identical (though not necessarily with identical hand J values) layers, the exact same splitting can be applied to anysufficiently large dimension of a lattice Ising model. For multi-dimensional lattice Ising models, this can optionally be ap-plied to multiple dimensions, at the cost of having more special handling for the boundaries of those dimensions. An inter-esting observation is that this multi-dimensional splitting is very similar in concept to what is done in [5] to optimize a GPUimplementation of Metropolis Monte Carlo on an Ising model, and indeed the following section describes how it was appliedto our GPU implementation. However, this same optimization is not applied to the CPU implementation in [5], biasing their


results in favour of the GPU. Of course for the CPU, the Ising model need only be split into 4 sections, so splitting a singledimension should suffice, whereas it is preferable on the GPU to have many sections.

For non-periodic boundary conditions, to avoid the cost of handling boundary spins separately, one can add ‘‘dummy’’spins connected with J = 0 to the ‘‘real’’ boundary spins. These added spins serve only to make all of the real spins in a qua-druplet topologically equivalent, including the relative indices of their neighbours. This idea is shown in Fig. 14 for a small 2Dlattice Ising model.

The dummy spins would not need a state, and do not need to be swept. They would, however, each need an entry in hEf-fective, so that the entries can be updated with the rest of the quadruplet, which contains real spins. The addition of thedummy spins necessitates a slight change in the reordering to avoid having dummy spins with the same indices as real spins.For example, in Fig. 14, spin 82 has a spin below it with an index 79 less than its own (i.e. 3), so in order for spin 83 to updatethe spin below it at the same time as spin 82, it must also have a spin below it with an index 79 less than its own (i.e. 4). Ifreal quadruplet 1 started at index 4, there would be a conflict. As such, in order to avoid the conflict and maintain 16-bytealignment, quadruplet 1 must start at index 8 instead of 4, leaving indices 5, 6, and 7 unused. Effectively, every second qua-druplet is omitted. Note that the negative indices shown in the figure are arbitrary; the starting index of quadruplet 0 couldbe chosen to be 8 to avoid having to shift the start of hEffective.

3.2. Memory coalescing on GPU

Vector instruction sets like SSE are not available on current GPUs, so the CPU vectorization described above could not beimplemented for the GPU. However, current NVIDIA GPUs access memory much faster than they otherwise would whengroups of 32 GPU threads, called ‘‘warps’’, access adjacent data concurrently, an effect known as ‘‘memory coalescing’’ [3].As such, reordering the spins in a manner very similar to that used to fully vectorize the CPU implementation (except using32-way interlacing instead of 4-way interlacing), would lead to this improved memory access pattern. This is similar to hav-ing a vector processor with 32 elements in each vector, except that on the CPU, data accessed during vector operations mustbe in adjacent memory locations, whereas the GPU supports accessing one or more of the elements from elsewhere in mem-ory, though with degraded performance.

To maximize the number of concurrent threads accessing adjacent data, we split the model into groups of 2 layers andinterlaced all of these groups, as in Fig. 13c. In that example, if not properly coordinated, a thread flipping spin 0 and a threadflipping spin 1 may attempt to update data for spin 256, the corresponding spin in the layer between, at the same time. Assuch, we have each thread attempt flipping spins in its even layer while only updating data to its left, then after all threadshave completed their flipping, each thread updates data to its right. Then, flips are attempted for spins in the odd layers inthe same manner. For a system with 256 layers, we would perform 128-way interlacing, allowing 128 GPU threads to oper-ate on a single Ising model at once, with as many as possible operating on adjacent data.

Our GPU implementation having only basic optimizations (B.1) also has 128 GPU threads operating on each Ising modelwith 256 layers in the exact same manner, simply without reordering the spins first. The GPU version of the code has a ran-dom number generator for each GPU thread [16]. As with the CPU vectorized version, the GPU version with memory coalesc-

Fig. 14. Addition of dummy spins for small example 2D lattice Ising model. (quadruplet 0 marked in red; quadruplet 1 marked in green). (For interpretationof the references to colour in this figure legend, the reader is referred to the web version of this article.)


ing interlaces the 128 random number generators. Interlacing the random number generators was implemented simply byswapping the order of two array indices.

4. Results and comparison

As described above, the relative performance of the CPU and GPU implementations at different levels of optimization wasexamined.

The system used for the CPU performance testing:

� Intel Core i7-965 Extreme (4 physical cores 2� Hyper-Threaded, so 8 logical cores at 3.2 GHz)� 12 GB of 1066 MHz RAM� Windows Vista Ultimate 64-bit� Visual C++ 2008 64-bit compiler, Macro Assembler (x64) Version 8.00

The system used for the GPU performance testing:

� NVIDIA GTX-285 (240 cores at 648 MHz with 1 GB of 1242 MHz RAM)� Running CUDA driver version 196.21� Windows 7 Ultimate 64-bit

For the GPU tests, we observed a low CPU utilization (below 1%), making the results largely independent of the test com-puter’s CPU performance.

For the performance results below, 30,000 Metropolis sweeps were performed on 115 Ising models, each with 24,576spins (256 layers of 96 spins, each layer as depicted in Fig. 15), for a total of 2,826,240 spins. Data was transferred to theGPU only once, and as a result, the observed data transfer time was negligible for the GPU runs. CPU runs were performedon 1, 2, 4, 6, and 8 cores solely for comparison between CPU and GPU. As previously mentioned, multi-threading of this appli-cation is covered in [16]. All runs were performed 10 times and averaged to ensure consistency. The error bars are too smallto see due to low variance in the times. The reference point for Fig. 16, the average time for our original CPU code to executeon 1 core, was 5705.27 s (1 h 35 min 5.27 s).

The basic optimizations improved performance on the CPU by 2.91� to 3.75�, depending on the number of cores. Fullvectorization contributed another factor of 3.08 to 3.16, giving a total speedup of 8.95� to 11.86� from manual optimizationof the CPU code. The speedup is significant enough that using 1 core of the fully vectorized version outperforms 8 cores of theoriginal version by 1.8�.

Reordering the spins and the random number generators to allow for memory coalescing on the GPU gave a dramatic6.78� speedup over the GPU code with just basic optimizations, stressing the importance of using the GPU in a manner sim-ilar to a vector processor. This reorganization of memory was the only difference between the two GPU versions, so the codeof both B.1 and B.2 are almost identical. While results are not presented here for the GPU code before making the basic opti-mizations described in Section 2, it was too slow to test reliably in this same manner, so these optimizations did improve the

Fig. 15. Topology of the 96 spins and 260 edges within each of the 256 ferromagnetically-coupled (as in Fig. 12) layers of each Ising model simulated in theperformance tests. In these Ising models, vertical and horizontal edges are all ferromagnetic (FM), and the remaining edges are randomly selected to beeither FM or anti-FM. The magnitude of both the intra-layer edges (shown) and the inter-layer edges (not shown) vary between the simulated Ising models(colour online).

0

8

16

24

32

40

48

56

64

0 1 2 3 4 5 6 7 8# of CPU cores (N/A for GPU)

Spee

dup

fact

or re

lativ

e to

orig

inal

CPU

cod

e w

ith 1

cor

e CPU Original (A.1b)

CPU Basic Optimization (A.2b)

CPU Partial Vectorization (A.3)

CPU Full Vectorization (A.4)

GPU Basic Optimization (B.1)

GPU Coalescing (B.2)

Fig. 16. Relative performance at different levels of optimization of CPU code and CUDA code on GPU.


GPU performance non-negligibly. Our GPU implementation with memory coalescing performed roughly as well as 4 CPUcores of our fully vectorized CPU version. Running these best-optimized implementations, the full 8 cores of the Core i7 out-perform the GTX-285 by a factor of 2.04.

One source of slowdown for the GPU is the main if-statement of the Monte Carlo sweep algorithm (Fig. 1). If even onethread in a warp of 32 threads flips a spin, all threads in that warp must wait for that spin to be flipped [16]. The probabilityof waiting for a flip in each of the 115 Ising models simulated is shown in Fig. 17.

The Ising models with lower indices have a significantly lower probability of flipping a spin than those with higher indi-ces, since they represent lower effective temperatures. The percentage of time that the A.1 CPU application must wait for themain if-statement to finish is exactly the average percentage chance of flipping, or 28.6%. However, the GPU must wait onaverage 82.8% of the time, i.e. whenever at least 1 spin out of 32 spins flips, which is 2.9� more often than the CPU. The A.4CPU application must also wait more often at 56.8%, i.e. whenever at least 1 spin out of 4 spins flips, or 2.0�more often thanbefore. On the other hand, the reason that this increase in probability of waiting for the main if-statement occurs in the firstplace is that the if-statement can run for multiple spins concurrently. Therefore, this does not explain the relatively low per-formance of the GPU, only why memory coalescing does not provide a larger speedup than the 7� observed.

GPU:P(≥1 of 32 spins

flips)

CPU withFull Vectorization:

P(≥1 of 4 spins flips)

CPU withoutFull Vectorization:

P(flip)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100 110Ising model index

P(w

aits

for a

flip

)

Fig. 17. Probabilities of having to wait for a spin flip in different application versions (colour online).


Comparing the GTX-285 GPU and the Core i7 CPU based on the number of cores is not a meaningful comparison in thiscontext, as the GPU’s 240 cores are outperformed by the CPU’s 8 logical cores when both are running optimized code. Thisdiscrepancy in per-core performance can be partly explained by the 4.94� difference in core frequency, the fact that eachCPU core operates on 4-element vectors, whereas each GPU core operates on scalars, and the lower probability of waitingfor a flip on the CPU.

However, tests on pieces of the GPU implementation also suggest that it is highly memory-bound. Despite attempts toutilize the low-latency shared memory of the GPU and the decreased size of the data structures, each structure was stilltoo large to remain in shared memory between iterations. For example, the hEffectiveIntra array, as used in Fig. 6, is96 KB for a single Ising model of 24,576 spins, much larger than the available 16 KB of shared memory per GPU multipro-cessor. Instead, the data structures must reside in the larger global device memory on the GPU, which has about 100� higherlatency than the shared memory [25]. The CPU, on the other hand, can completely fit these structures in its 8 MB of L3 cacheand can speculatively prefetch data into L2 and L1 cache due to the optimized code’s sequential access pattern [23]. Becausethe CPU’s cache has much lower latency than the GPU’s global device memory, this could heavily impact the relative perfor-mance of the CPU and GPU.

It is not possible in this case to reliably compare against peak theoretical FLOPS on the GPU, since the random numbergeneration consists primarily of integer operations and the remaining code comprises both integer and floating-point oper-ations. However, given that this code is memory-bound, the performance in unlikely to be near the peak.

To examine the CPU implementations more thoroughly, we ran the performance test, described above, on two implemen-tations with compiler optimization disabled (A.1a and A.2a).

The compiler’s implicit optimizations provided roughly a 1.5� speedup for the original CPU code (A.1a to A.1b). The codewith basic explicit optimizations has a larger 2.9� speedup from compiler optimizations (A.2a to A.2b), partly due to ourefforts to help the compiler implicitly vectorize some of the MT19937 random number generation in A.2b, as described inSection 3. While beneficial in the C++ code, this speedup from compiler optimizations is not applicable to the faster vector-ized versions implemented in assembly language (A.3 and A.4). The A.1b row of Table 2 is visualized in Fig. 18.

It is worth mentioning that we have observed similar performance for Ising models of different topologies and similarlevels of connectivity, supporting the assertion that the optimizations presented here apply to more general Ising models.Fig. 19

Table 2Speedup factors between all pairs of CPU implementations using 1 core.

A.1a A.1b A.2a A.2b A.3 A.4

A.1a 1.000 1.508 1.921 5.652 10.636 17.886A.1b 0.663 1.000 1.274 3.748 7.053 11.860A.2a 0.521 0.785 1.000 2.942 5.537 9.311A.2b 0.177 0.267 0.340 1.000 1.882 3.165A.3 0.094 0.142 0.181 0.531 1.000 1.682A.4 0.056 0.084 0.107 0.316 0.595 1.000

0.6631.000 1.274

3.748

7.053

11.860

0

2

4

6

8

10

12

A.1a A.1b A.2a A.2b A.3 A.4

Spee

dup

fact

or re

lativ

e to

orig

inal

CPU

cod

e

Fig. 18. Speedup factors for CPU implementations using 1 core (colours online as in Fig. 16)

Fig. 19. IEEE 32-bit floating-point format (s = sign,x = exponent,m = mantissa) (colour online).


5. Conclusion

We presented a simple Metropolis Monte Carlo sweep algorithm and walked through several optimizations on both CPUand GPU, independent of multi-threading. We found that the presented optimizations can gain 9� to 12� speedup for theCPU. Explicitly vectorizing the CPU implementations contributed a 3� speedup and the equivalent changes to allow formemory coalescing on the GPU resulted in a 7� speedup. Moreover, an Intel Core i7 CPU outperforms an NVIDIA GTX-285 GPU by a factor of 2 when comparing our final optimized CPU implementation (A.4) against our final optimized GPUimplementation (B.2).

Based on these results, relying solely on compiler optimization for a significant speedup may not be sufficient where highperformance is critical, so it is recommended that explicit software optimization, especially vectorization, be more thor-oughly examined, tested, and applied. It is also recommended that comparisons of CPU performance against GPU perfor-mance utilize the CPU’s full vectorization capabilities, if applicable, since neglecting these capabilities could lead tobiased comparisons.

Acknowledgments

The authors thank Geordie Rose and Mohammad Amin for valuable assistance on this project. We would also like to thankThierry Drumel for running our GPU performance tests, and the volunteers contributing computer time to AQUA@Home forencouraging us to push the limits of performance. We express our gratitude to the anonymous reviewers for suggestions thatimproved the paper.

Appendix A. The Exponential Approximation

The approximations presented in Section 2.4 is dependent on the IEEE 32-bit floating-point format [10]:Values in this format can be denormal (x = 0), normal (0 < x < 255), infinite (x = 255 and m = 0), or NaN (x = 255 and

m – 0). The value of a normal floating-point number represented by sign s, exponent x, and mantissa bits m, in this formatis:

f ¼ ð�1Þs 1þ m

223

� �2x�127:

Supposing that f is positive, i.e. s = 0, the value of the 32-bit integer that has the same bit representation as this floating-pointnumber is given by:

i ¼ 223xþm:

Assuming x < 254, adding 223 to i will increase the value of x by one, doubling f. This can be repeated until x = 255, at whichpoint f will become infinite or NaN. Because adding a constant to i has the effect of multiplying f by 2, it is clear that in someform, the value of f is exponential in the value of i, and because f and i have the same bit representation, no operation is re-quired for this exponential to be evaluated. Letting:

y ¼ i

223 � 127;

then isolating the exponent and mantissa of f for a given value of i:

x ¼ i223

j k¼ byc þ 127; m ¼ i mod 223 ¼ 223ðy mod1Þ; s ¼ 0;

f ðiÞ ¼ ð�1Þs 1þ m223

� �2x�127 ¼ 1þ 223ðymod1Þ

223

� �2bycþ127�127 ¼ ð1þ ymod1Þ2byc:

This is exactly 2y when y is an integer, and otherwise, it is a linear interpolation between those values. Thus, given a value ofy for which to compute an approximate value of 2y, one can instead compute the 32-bit integer:

i ¼ roundð223ðyþ 127ÞÞ;

and use the resulting 32 bits as a 32-bit floating-point number. Integrating the relative error of this approximation from 0 to1 yields that the average relative error is (2ln22)�1 � 1 � 0.0407. Therefore, multiplying the result by 2ln22 gives a relativeerror averaging to zero. The relative error is plotted in Fig. 20. Note that this can produce unpredictable results wheny < �126 or y P 128.


Further improvement of this approximation is based on that while the approximation of 2y is exact when y is a multiple of1, the approximation of 24y is exact when y is a multiple of 0.25, so the latter approximation is exact with 4 times the fre-quency. Given a method of quickly approximating 4th roots with lower error than this approximation, such as with specialSSE instructions for approximate reciprocal-square-roots, one can compute:

i4 ¼ roundð223ð4yþ 127ÞÞ;

2y �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2ln22f ði4Þ

4q

:

This has a relative error roughly bounded by (�0.01,0.005), more accurate but slower than the previous method. This relativeerror is also shown in Fig. 20. Note that to avoid overflows, one must have �31.5 6 y < 32 for this more accurateapproximation.

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

-4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0

Rel

ativ

e er

ror o

f app

roxi

mat

e ex

pone

ntia

l

y

Fast

Accurate

Fig. 20. Relative error of fast and accurate exponential approximations by input value (colour online).

-0.08

-0.07

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0 10 20 30 40 50 60 70 80 90 100 110

Rel

ativ

e er

ror i

n av

erag

e en

ergy

Ising Model Index

FastAccurate

Fig. 21. Relative error in average energy due to each exponential approximation (colour online).

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 10 20 30 40 50 60 70 80 90 100 110

Rel

ativ

e er

ror i

n st

anda

rd d

evia

tion

of e

nerg

y

Ising Model Index

FastAccurate

Fig. 22. Relative error in standard deviation of energy due to each exponential approximation (colour online).


It is also important to consider the impact of the approximation on properties of the Ising models being examined. Thisimpact may or may not be problematic, depending on the context. To measure this impact in the context of our application,with a more broadly applicable metric, we examined the average energy and standard deviation of energy in the distributiongenerated by each of the approximations and compared them to those of the exact exponential.

Standard operation of our application involves Parallel Tempering combined with the Metropolis sweeping [16] withmany more sweeps than in the performance tests. On each of the 115 Ising models tested above, we performed 1,000,000sweeps with Parallel Tempering swaps attempted between adjacent Ising models (analogous to swapping temperatures)every 10 sweeps. We executed this 5 times for each exponential function to ensure consistent data, and combined the resultsof the 5 runs appropriately. The relative error in average energy due to the approximations is shown in Fig. 21. The relativeerror in the standard deviation of energy is shown in Fig. 22.

For our application, the Ising models requiring the most accurate equilibration have indices from 40 to 50. For these Isingmodels, the more accurate approximation produces results much closer to that of the exact exponential than the fast approx-imation. Since the overall performance from using the more accurate approximation is almost identical to that of the fasterapproximation, this would not significantly impact the performance tests above. Note that both approximations have verypoor relative error in the standard deviation of energy for Ising models with indices 0 to 3 (not visible in Fig. 22), because, asshown in Fig. 17, the spins in these models almost never flip, meaning that the exponential error rarely cancels itself out.

For scenarios requiring lower error than the methods above, a finer approximation, such as a Taylor series based approx-imation [20], may be used.

References

[1] J.D. Owens, D. Luebke, et al, Computer Graphics Forum 26 (1) (2007) 80–113. ISSN: 0167-7055.[2] L.R. Scott, T. Clark, B. Bagheri, Scientific Parallel Computing, Princeton University Press, 2005.[3] D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann, 2010.[4] S.S. Samant, J. Xia, P. Muyan-Özçelik, J.D. Owens, High performance computing for deformable image registration: towards a new paradigm in adaptive

radiotherapy, Medical Physics 35 (8) (2008) 3546–3553.[5] T. Preis, P. Virnau, et al, GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model, Journal of Computational Physics 228 (2009) 4468–

4477.[6] S. Tomov, M. McGuigan, et al, Benchmarking and implementation of probability-based simulations on programmable graphics cards, Computers &

Graphics 29 (2005) 71–80.[7] S. Ryoo, C. Rodrigues, et al., Optimization principles and application performance of a multithreaded GPU using CUDA, in: Proceedings of the 13th ACM

SIGPLAN Symposium on Principles and practice of parallel programming, 2008, pp. 73–82.[8] R. Allen, K. Kennedy, Optimizing Compilers for Modern Architectures, Morgan Kaufmann Publishers, Academic Press, 2002.[9] A.E. Eichenberger, J.K. O’Brien, et al, Using advanced compiler technology to exploit the performance of the cell broadband engine architecture, IBM

Systems Journal 45 (1) (2006) 59–84.[10] Intel 64 and IA-32 Architectures Software Developer’s Manual, Intel Corporation Order Numbers 253665 through 253669, 2011.[11] M. Matsumoto, T. Nishimura, Mersenne twister: A 623-Dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions

on Modeling and Computer Simulation 8 (1) (1998) 3–30.[12] D.E. Knuth, The art of computer programming, Sorting and Searching, vol. 3, Addison Wesley Longman Publishing, 1998.[13] N. Metropolis, A.W. Rosenbluth, et al, Equation of state calculations by fast computing machines, Journal of Chemical Physics 21 (6) (1953).[14] B.A. Berg, Markov chain Monte Carlo simulations and their statistical analysis, World Scientific Publishing, 2004.[15] J.B. Anderson, Quantum Monte Carlo: Origins, Development, Applications, Oxford University Press, US, 2007.


[16] K. Karimi, N. Dickson, F. Hamze, High-performance physics simulations using multi-core CPUs and GPGPUs in a volunteer computing context,International Journal of High-Performance Applications 25 (1) (2011).

[17] F. Hamze, N. Dickson, K. Karimi, Robust parameter selection for parallel tempering, International Journal of Modern Physics C 21 (5) (2010).[18] K. Karimi, N. Dickson, F. Hamze, et al, Investigating the performance of an adiabatic quantum optimization processor, Quantum Information Processing

(2011).[19] M. Suzuki, Generalized Trotter’s formula and systematic approximants of exponential operators and inner derivations with applications to many-body

problems, Communications in Mathematical Physics 51 (2) (1976) 183–190.[20] P. Djeu, M. Quinlan, P. Stone, Improving Particle Filter Performance Using SSE Instructions, in: International Conference on Intelligent Robotics and

Systems, USA, 2009.[21] H. Ahmadi, M. Moslemi-Naeini, H. Sarbazi-Azad, Efficient SIMD numerical interpolation, in: International Conference on High Performance Computing

and Communications (HPCC), Italy, 2005.[22] S.P. Mohanty, GPU-CPU multi-core for real-time signal processing, in: 2009 Digest of Technical Papers International Conference on Consumer

Electronics, 2009, pp. 1–2.[23] Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation Order Number 248966, November 2007.[24] G. Marsaglia, Xorshift RNGs, Journal of Statistical Software 8 (14) (2003) 1–6.[25] NVIDIA OpenCL Best Practices Guide, Version 1.0, NVIDIA Corporation, 2009.

Documents

Importance of explicit vectorization for CPU and GPU software performance