Importance of explicit vectorization for CPU and GPU software performance

Embed Size (px)

Text of Importance of explicit vectorization for CPU and GPU software performance

  • importance of optimizing CPU implementations.

    anceizatiot resun [3o mammo

    and execution units [10], allowing a single operation to be performed on each of multiple (2 64-bit, 4 32-bit, 8 16-bit,or 16 8-bit) adjacent data elements at once. For example, the element-wise addition of two 4-element vectors of 32-bitoating point numbers can be vectorized by replacing the 4 separate addition operations with a single SSE instruction. How-ever, these SSE instructions are not frequently used explicitly. High-quality vectorization of our algorithm requires knowl-edge about the semantics of the algorithm, which a compiler cannot (and arguably, should not) assume. Software developers,

    0021-9991/$ - see front matter 2011 Elsevier Inc. All rights reserved.

    Corresponding author.E-mail addresses: ndickson@dwavesys.com (N.G. Dickson), kkarimi@dwavesys.com (K. Karimi), fhamze@dwavesys.com (F. Hamze).

    Journal of Computational Physics 230 (2011) 53835398

    Contents lists available at ScienceDirect

    Journal of Computational Physicsdoi:10.1016/j.jcp.2011.03.041itively affect performance. CPU optimization, however, is usually left to the compiler, and it has been argued that compilersachieve high utilization of the CPUs computing power [8,9]. If this assumption is not valid, it may have lead to heavilyskewed performance comparisons between CPUs and GPUs. Similarly, due to the prevalence of multicore CPUs, comparingperformance of one core of a CPU against that of a full GPU, as done in [5], is not a completely fair comparison.

    The goal of this paper is to present a performance analysis of CPU and GPU implementations of a specic MetropolisMonte Carlo algorithm at different levels of optimization, as outlined in Table 1. Note that the GPU code was too slow to testreliably before making the basic optimizations described in Section 2, so this implementation was not included here.

    Vector instruction sets, present on modern commodity CPUs since 2001 (referred to here as SSE), provide 128-bit registersIsing modelGPU

    1. Introduction

    It is common to examine performvectorization and non-parallel optimformance values or performance tescantly faster than CPU computatioprogrammer manually intervenes tGPU performance factors [7], it is co 2011 Elsevier Inc. All rights reserved.

    increases from parallelism in the form of CPU and GPU multi-threading [1,2], butn techniques remain less common in the literature. Based on theoretical peak per-lts, it is sometimes concluded or even assumed that GPU computation is signi-6]. However, explicit optimization of the CPU code, a process by which aximize performance, is often not considered. Due to several well-documentedn for GPU developers to become involved in many aspects of coding that can pos-Importance of explicit vectorization for CPU and GPU software performance

    Neil G. Dickson , Kamran Karimi, Firas HamzeD-Wave Systems Inc., 100-4401 Still Creek Drive, Burnaby, British Columbia, Canada V5C 6G9

    a r t i c l e i n f o

    Article history:Received 30 September 2010Received in revised form 21 March 2011Accepted 22 March 2011Available online 29 March 2011

    Keywords:PerformanceOptimizationVectorizationMonte Carlo

    a b s t r a c t

    Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization andnon-parallel optimization techniques, which can often be employed additionally, are lessfrequently discussed. In this paper, we present an analysis of several optimizations doneon both central processing unit (CPU) and GPU implementations of a particular computa-tionally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU andthe equivalent, explicit memory coalescing, on the GPU are found to be critical to achievinggood performance of this algorithm in both environments. The fully-optimized CPU versionachieves a 9 to 12 speedup over the original CPU version, in addition to speedup frommulti-threading. This is 2 faster than the fully-optimized GPU version, indicating the

    journal homepage: www.elsevier .com/locate / jcp

  • [14]. Tdata uability of ipping a single spin. This is because the updating is a proper subsequence of the operations needed to compute the

    Table 1Implementations at different levels of optimization.

    Impl. CPU/GPU

    Multi-threaded

    Compiler optimizationenabled

    Basic optimizations(Section 2)

    Vectorized MT19937 & ipping(Section 3)

    Vectorized data updating(Sections 3.1& 3.2)

    A.1a CPU UA.1b CPU U UA.2a CPU U U

    5384 N.G. Dickson et al. / Journal of Computational Physics 230 (2011) 53835398probability from scratch.The optimized implementations were developed in a Quantum Monte Carlo (QMC) [15] simulation context and use Par-

    allel Tempering, as explained in [16,17]. The topology of the Ising models that appear in these simulations is described in[18], and comes from applying a SuzukiTrotter decomposition [19] to a given processor architecture, but the optimizationspresented here are directly applicable to many types of sparse Ising models, e.g. any lattice Ising models. For the purposes ofthis paper, the most relevant information is that each spin is adjacent to 6, 7, or 8 other spins in a layered structure, and thatmillions of the Metropolis sweeps shown in Fig. 1 must be performed on millions of systems, each with up to 32,768 spins.With these high computational requirements, even the best-optimized implementations require months of computationtime on thousands of multi-core computers. For densely-connected Ising models, a different vectorized implementation thanthe ones described in Section 3 would be advised, since the sparseness is exploited by our implementations.

    Explicit CPU vectorization has shown impressive results in other contexts [20,21]. However, those performance compar-isons are relative to implementations where the compiler does not implicitly vectorize the code, so improvement over a

    compimplesuch ato Me

    Thwerehis is because visiting spins sequentially requires one fewer random number to be generated per iteration. Also, thesed to compute probabilities of adjacent spins ipping can be updated faster than computing, from scratch, the prob-on the other hand, must already have knowledge of these semantics in order to design and implement the algorithm. Withthe recent introduction of the Advanced Vector Extensions (AVX) instruction set [10], which extends SSE to 256 or more bits,this knowledge is becoming more important.

    For example, the MT19937 (pseudo) random number generation algorithm [11] keeps an array of 624 numbers. An alter-native implementation, used in this paper, keeps 4 624 = 2,496 numbers and uses SSE to generate 4 random numbers inroughly the same time as each random number before. The resulting random number sequence would be different, so in or-der for a compiler to safely replace the algorithm with something not equivalent to the source code, it must rst determinethat the sole purpose of the code is to generate many random numbers. An analogous scenario would be a compiler recog-nizing code performing bubble sort, a stable sort, determining that the sorting stability is not necessary, and then replacingthe code with heap sort, an unstable sort [12].

    The application we consider is that of performing a Metropolis Monte Carlo algorithm (e.g. [13]) on a sparse Ising model,which is a system of spin variables si 2 {+1,1}, with a Hamiltonian (cost function) of the form:

    H Xi

    hisi Xi;j

    Jijsisj;

    where hi and Jij are the values that dene the particular Ising model. The general Metropolis sweep algorithm used in thispaper, to sample from a Boltzmann distribution over the possible values of spins, can be summarized as in Fig. 1.

    This sequential sweeping algorithm was chosen as a starting point for performance comparison because it has alreadybeen identied in literature as a fast alternative to the conventional Metropolis algorithm, which visits spins in random order

    A.2b CPU U U UA.3 CPU U U U UA.4 CPU U U U U UB.1 GPU U U UB.2 GPU U U U U Uilers implicit vectorization may not be known for those cases. It has also been found in other contexts that a GPUmentation is not guaranteed to surpass a CPU implementation in performance [22]. Again, many factors could explaindiscrepancy, so we attempted to address these concerns by exploring a number of potential optimizations applicabletropolis sweeping on sparse Ising models.e rest of the paper is organized as follows. Section 2 explains a number of non-parallel optimization techniques thatapplied to both the CPU and the GPU implementations. Section 3 shows how different parts of the code were vector-

    for each spin i, if uniform(0,1) random number < probability of flipping the sign of i, flip the sign of i for each spin j adjacent to i, update data used to find probability of flipping j

    Fig. 1. A Metropolis Monte Carlo sweep of an Ising model.

  • ized. This section also explains howmemory coalescing for GPU was performed. Section 4 presents the results of a number ofperformance tests we performed using CPU and GPU code at different levels of optimization. Section 5 concludes the paper.

    2. Basic optimizations

    The optimizations presented here focus only on the Metropolis sweep algorithm of Fig. 1, as everything else remains lar-gely unchanged, including multi-threading, which is explained in [16]. The basic optimization techniques we used on thisalgorithm include: branch elimination, simplication of data structures, result caching, and approximation. We have alsofound these optimizations to be effective for other computationally-intensive, non-Monte-Carlo, CPU applications. OurGPU implementatio