Importance of explicit vectorization for CPU and GPU software performance

  • Published on

  • View

  • Download


  • importance of optimizing CPU implementations.

    anceizatiot resun [3o mammo

    and execution units [10], allowing a single operation to be performed on each of multiple (2 64-bit, 4 32-bit, 8 16-bit,or 16 8-bit) adjacent data elements at once. For example, the element-wise addition of two 4-element vectors of 32-bitoating point numbers can be vectorized by replacing the 4 separate addition operations with a single SSE instruction. How-ever, these SSE instructions are not frequently used explicitly. High-quality vectorization of our algorithm requires knowl-edge about the semantics of the algorithm, which a compiler cannot (and arguably, should not) assume. Software developers,

    0021-9991/$ - see front matter 2011 Elsevier Inc. All rights reserved.

    Corresponding author.E-mail addresses: (N.G. Dickson), (K. Karimi), (F. Hamze).

    Journal of Computational Physics 230 (2011) 53835398

    Contents lists available at ScienceDirect

    Journal of Computational Physicsdoi:10.1016/ affect performance. CPU optimization, however, is usually left to the compiler, and it has been argued that compilersachieve high utilization of the CPUs computing power [8,9]. If this assumption is not valid, it may have lead to heavilyskewed performance comparisons between CPUs and GPUs. Similarly, due to the prevalence of multicore CPUs, comparingperformance of one core of a CPU against that of a full GPU, as done in [5], is not a completely fair comparison.

    The goal of this paper is to present a performance analysis of CPU and GPU implementations of a specic MetropolisMonte Carlo algorithm at different levels of optimization, as outlined in Table 1. Note that the GPU code was too slow to testreliably before making the basic optimizations described in Section 2, so this implementation was not included here.

    Vector instruction sets, present on modern commodity CPUs since 2001 (referred to here as SSE), provide 128-bit registersIsing modelGPU

    1. Introduction

    It is common to examine performvectorization and non-parallel optimformance values or performance tescantly faster than CPU computatioprogrammer manually intervenes tGPU performance factors [7], it is co 2011 Elsevier Inc. All rights reserved.

    increases from parallelism in the form of CPU and GPU multi-threading [1,2], butn techniques remain less common in the literature. Based on theoretical peak per-lts, it is sometimes concluded or even assumed that GPU computation is signi-6]. However, explicit optimization of the CPU code, a process by which aximize performance, is often not considered. Due to several well-documentedn for GPU developers to become involved in many aspects of coding that can pos-Importance of explicit vectorization for CPU and GPU software performance

    Neil G. Dickson , Kamran Karimi, Firas HamzeD-Wave Systems Inc., 100-4401 Still Creek Drive, Burnaby, British Columbia, Canada V5C 6G9

    a r t i c l e i n f o

    Article history:Received 30 September 2010Received in revised form 21 March 2011Accepted 22 March 2011Available online 29 March 2011

    Keywords:PerformanceOptimizationVectorizationMonte Carlo

    a b s t r a c t

    Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization andnon-parallel optimization techniques, which can often be employed additionally, are lessfrequently discussed. In this paper, we present an analysis of several optimizations doneon both central processing unit (CPU) and GPU implementations of a particular computa-tionally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU andthe equivalent, explicit memory coalescing, on the GPU are found to be critical to achievinggood performance of this algorithm in both environments. The fully-optimized CPU versionachieves a 9 to 12 speedup over the original CPU version, in addition to speedup frommulti-threading. This is 2 faster than the fully-optimized GPU version, indicating the

    journal homepage: www.elsevier .com/locate / jcp

  • [14]. Tdata uability of ipping a single spin. This is because the updating is a proper subsequence of the operations needed to compute the

    Table 1Implementations at different levels of optimization.

    Impl. CPU/GPU


    Compiler optimizationenabled

    Basic optimizations(Section 2)

    Vectorized MT19937 & ipping(Section 3)

    Vectorized data updating(Sections 3.1& 3.2)

    A.1a CPU UA.1b CPU U UA.2a CPU U U

    5384 N.G. Dickson et al. / Journal of Computational Physics 230 (2011) 53835398probability from scratch.The optimized implementations were developed in a Quantum Monte Carlo (QMC) [15] simulation context and use Par-

    allel Tempering, as explained in [16,17]. The topology of the Ising models that appear in these simulations is described in[18], and comes from applying a SuzukiTrotter decomposition [19] to a given processor architecture, but the optimizationspresented here are directly applicable to many types of sparse Ising models, e.g. any lattice Ising models. For the purposes ofthis paper, the most relevant information is that each spin is adjacent to 6, 7, or 8 other spins in a layered structure, and thatmillions of the Metropolis sweeps shown in Fig. 1 must be performed on millions of systems, each with up to 32,768 spins.With these high computational requirements, even the best-optimized implementations require months of computationtime on thousands of multi-core computers. For densely-connected Ising models, a different vectorized implementation thanthe ones described in Section 3 would be advised, since the sparseness is exploited by our implementations.

    Explicit CPU vectorization has shown impressive results in other contexts [20,21]. However, those performance compar-isons are relative to implementations where the compiler does not implicitly vectorize the code, so improvement over a

    compimplesuch ato Me

    Thwerehis is because visiting spins sequentially requires one fewer random number to be generated per iteration. Also, thesed to compute probabilities of adjacent spins ipping can be updated faster than computing, from scratch, the prob-on the other hand, must already have knowledge of these semantics in order to design and implement the algorithm. Withthe recent introduction of the Advanced Vector Extensions (AVX) instruction set [10], which extends SSE to 256 or more bits,this knowledge is becoming more important.

    For example, the MT19937 (pseudo) random number generation algorithm [11] keeps an array of 624 numbers. An alter-native implementation, used in this paper, keeps 4 624 = 2,496 numbers and uses SSE to generate 4 random numbers inroughly the same time as each random number before. The resulting random number sequence would be different, so in or-der for a compiler to safely replace the algorithm with something not equivalent to the source code, it must rst determinethat the sole purpose of the code is to generate many random numbers. An analogous scenario would be a compiler recog-nizing code performing bubble sort, a stable sort, determining that the sorting stability is not necessary, and then replacingthe code with heap sort, an unstable sort [12].

    The application we consider is that of performing a Metropolis Monte Carlo algorithm (e.g. [13]) on a sparse Ising model,which is a system of spin variables si 2 {+1,1}, with a Hamiltonian (cost function) of the form:

    H Xi

    hisi Xi;j


    where hi and Jij are the values that dene the particular Ising model. The general Metropolis sweep algorithm used in thispaper, to sample from a Boltzmann distribution over the possible values of spins, can be summarized as in Fig. 1.

    This sequential sweeping algorithm was chosen as a starting point for performance comparison because it has alreadybeen identied in literature as a fast alternative to the conventional Metropolis algorithm, which visits spins in random order

    A.2b CPU U U UA.3 CPU U U U UA.4 CPU U U U U UB.1 GPU U U UB.2 GPU U U U U Uilers implicit vectorization may not be known for those cases. It has also been found in other contexts that a GPUmentation is not guaranteed to surpass a CPU implementation in performance [22]. Again, many factors could explaindiscrepancy, so we attempted to address these concerns by exploring a number of potential optimizations applicabletropolis sweeping on sparse Ising models.e rest of the paper is organized as follows. Section 2 explains a number of non-parallel optimization techniques thatapplied to both the CPU and the GPU implementations. Section 3 shows how different parts of the code were vector-

    for each spin i, if uniform(0,1) random number < probability of flipping the sign of i, flip the sign of i for each spin j adjacent to i, update data used to find probability of flipping j

    Fig. 1. A Metropolis Monte Carlo sweep of an Ising model.

  • ized. This section also explains howmemory coalescing for GPU was performed. Section 4 presents the results of a number ofperformance tests we performed using CPU and GPU code at different levels of optimization. Section 5 concludes the paper.

    2. Basic optimizations

    The optimizations presented here focus only on the Metropolis sweep algorithm of Fig. 1, as everything else remains lar-gely unchanged, including multi-threading, which is explained in [16]. The basic optimization techniques we used on thisalgorithm include: branch elimination, simplication of data structures, result caching, and approximation. We have alsofound these optimizations to be effective for other computationally-intensive, non-Monte-Carlo, CPU applications. OurGPU implementations (B.1 and B.2) use all of these optimizations, and the CPU and GPU code remained nearly identical un-der these optimizations, so although only the CPU code is shown here, these are as applicable to the GPU.

    In addition to the optimizations described in this section, many other GPU optimizations were tried, but only one resultedin a non-negligible improvement, and is described in Section 3.2. An example of an additional GPU optimization resulting inroughly a 6% speedup, negligible compared to the others presented here, was to temporarily copy the current state of a ran-dom number generator (see Section 3) into the GPUs fast shared memory before regenerating its array of random numbers.However, since the full set of random number generator states was too large to keep in the limited shared memory, thiswasnt able to be improved upon signicantly. Similar optimizations were attempted with other data structures, but withno signicant success due to the same problem of limited shared memory, as discussed in Section 4. More than 20 differentschemes for assigning work to GPU cores were tested, with the best being described in Section 3.2, but the particulars of theother schemes tested, beyond what appears in Section 3.2, are not notable.

    2.1. Branch elimination

    Commodity CPUs have had the ability to run parts of multiple instructions at the same time (known as superscalar archi-

    N.G. Dickson et al. / Journal of Computational Physics 230 (2011) 53835398 5385tecture) since 1993 [10]. The drawback of this feature is that if the current instruction is a conditional branch, for exampledue to an if-statement, either the CPU cannot start running any instructions past that branch, or it must make a guess as towhat the next instruction will be. This guessing is known as branch prediction. When the guess is wrong (a misprediction),several partially-completed instructions may need to be removed from the CPUs pipeline, which can cause stalls [23]. Elim-inating unnecessary branches can alleviate this problem.

    This optimization had a large impact on the performance, and combined with the data structure simplication, made thecode simpler. On its own, the resulting code was smaller, but admittedly less readable. In the original code, shown in Fig. 2,the inner loop of the Metropolis sweep contained two frequently-mispredicted branches. Note that the data structure detailsare examined in Section 2.2.

    However, both branches can be eliminated, as in Fig. 3.Note that the value of (neighbours[0] == currentSpin) is 1 when true and 0 when false, so the new code still selects

    the neighbour of the edge that is not the current spin, to update it after ipping the current spin. Also, for x86-compatible

    for (edgeIndex=0; edgeIndex

  • processors since the Pentium Pro in 1995, the ternary operator (i.e. condition? valueIfTrue: valueIfFalse) is imple-mented with a conditional move instruction instead of a branch when both possible values do not involve computation [23].

    To further speed up this computation and avoid both of these confusing ways of accessing the data, structures require amore suitable design, examined in the following section.

    2.2. Simplication of data structures

    The data structures mentioned above are used elsewhere in the code and suffered from similar issues, as they have similaraccess patterns. The original graph data structure had a complex layout in memory, as shown in Fig. 4.

    This data structure represents the edges, with weights J, between spins. Because of details specic to the context ofthe application we used, some edges (inter-layer) are handled separately from others (intra-layer), as determined byisInterEdge. This separate handling would not be necessary in other contexts, so can be simply omitted in general. In thiscontext, it can also be eliminated, since here it happens that by design, there are always exactly two inter-layer edges ofeach spin. Since the edges incident to a spin can be handled in any order, reordering them ahead of time such thatthe two inter-layer edges always appear after the intra-layer edges allows them to be handled outside the main loop.isInterEdge can thus be eliminated.

    As for the bulk of the graph data structure, which is completely general, we eliminated the middle man, placing each Jvalue directly with the target spin index it applies to. In doing so, we duplicated each edge, in order to have in a single place,all edges adjacent to each spin. The data structure then simplies to Fig. 5.

    The corresponding code is shown in Fig. 6. The code is clearer than before and the inside of the loop is now just one line(down from 9). In addition to simply computing less, there is now less memory use, more sequential access, and fewer arraysbeing read in quick succession, improving memory cache use [23]. As such, these simplications had a large performanceimpact on top of the branch elimination, both on the CPU and the GPU.

    In general contexts, where all edges are handled in the same manner, the above code further simplies to the form shownin Fig. 7. Because this code is equivalent to that of Fig. 6, except for removing operations and data, its performance on Isingmodels with the same level of connectivity should be at least as good as that observed in our specic context.

    While these data structure changes provide...


View more >