Automatic SIMD Vectorization for HaskellHaskell compiler which gives signiﬁcant vector speedups for a range of programs written in a natural programming style. We com-pare performance

Automatic SIMD Vectorization for Haskell

Leaf PetersenIntel Labs

[email protected]

Dominic OrchardComputer Laboratory

University of [email protected]

Neal GlewIntel Labs

[email protected]

AbstractExpressing algorithms using immutable arrays greatly simplifiesthe challenges of automatic SIMD vectorization, since several im-portant classes of dependency violations cannot occur. The Haskellprogramming language provides libraries for programming withimmutable arrays, and compiler support for optimizing them toeliminate the overhead of intermediate temporary arrays. We de-scribe an implementation of automatic SIMD vectorization in aHaskell compiler which gives significant vector speedups for arange of programs written in a natural programming style. We com-pare performance with that of programs compiled by the GlasgowHaskell Compiler.

1. IntroductionIn the past decade, power has increasingly been recognized as thekey limiting resource in micro-processors, in part due to the lim-ited availability of battery on mobile devices, but more fundamen-tally due to the non-linear scaling of power usage with frequency.This has forced micro-processor designs to explore (or re-explore)different avenues at both the micro-architectural and the architec-tural levels. One successful architectural trend has been the increas-ing emphasis on data parallelism in the form of single-instructionmultiple-data (SIMD) instruction sets.

The key idea of data parallelism is that significant portions ofmany sequential computations consist of uniform (or almost uni-form) operations over large collections of data. SIMD instructionsets exploit this at the machine instruction level by providing datapaths for and operations on fixed-width vectors of data. Program-mers (or preferably compilers) are then responsible for finding dataparallel computations and expressing them in terms of fixed-widthvectors. In the context of languages with loops (or compiled toloops), this corresponds to choosing loops for which multiple it-erations can be computed simultaneously.

The task of automatically finding SIMD vectorizable code inthe context of a compiler has been the subject of extensive studyover the course of decades. For common imperative programminglanguages, such as C and Fortran, in which programs are structuredas loops over mutable arrays of data, the task of rewriting loopsin SIMD vector form is straightforward. However, side-effects anddependencies within a loop body may result in a rewriting changingthe program semantics. Furthermore, the task of discovering which

[Copyright notice will appear here once ’preprint’ option is removed.]

loops permit a valid rewriting (via dependency analysis) is in gen-eral undecidable, and has been the subject of extensive research(see e.g. [12] and overview in [14]). Even quite sophisticated com-pilers often are forced to rely on programmer annotations to assertthat a given loop is a valid target of SIMD vectorization (for exam-ple, the INDEPENDENT keyword in HPF [7]).

A great deal of the difficulty in finding vectorizable loops sim-ply goes away in the context of a functional language. In functionallanguages, the absence of mutation eliminates large classes of themore difficult kinds of dependence violations that block vectoriza-tion in compilers for imperative languages. In this sense, SIMDvectorization would seem to be quite low-hanging fruit for func-tional language compiler writers. In this paper, we describe theimplementation of an automatic SIMD vectorization optimizationpass in the Intel Functional Language Compiler (our compiler).We argue that in a functional language context, this optimizationis straightforward to implement and frequently gives excellent re-sults.

Our compiler was originally developed to compile programswritten in an experimental strict functional language, and the vec-torization optimization that we present here was developed in thatcontext. However, the main bulk of our compiler infrastructure wasdesigned to be fairly language agnostic, and we have more recentlyused it to build an experimental Haskell compiler using the Glas-gow Haskell Compiler (GHC) as a front end. While Haskell is quitea complex language on the surface, the GHC front end reduces itto a relatively simple intermediate representation based on SystemF, which is fairly straightforward to compile using our compiler.By intercepting the GHC intermediate representation (IR) after themain high-level optimization passes have run, we gain the benefitof the substantial and sophisticated optimization infrastructure builtup in GHC, and in particular we benefit from its ability to optimizeaway temporary intermediate arrays.

We make the following three contributions:• We design a core language with arrays and SIMD vector prim-

itives (Section 3).• We outline a novel vectorization process over this language

(Section 4).• We evaluate our approach compared to GHC with the native

backend and LLVM backend, showing speedups of between 2xand 10x using our compiler, for a selection of array processingbenchmarks (Section 5).

We begin with a brief overview of our compiler and its relationshipto GHC.

2. The Intel Functional Language Compiler2.1 Structure of our compilerThe Intel Functional Language Compiler provides an optimiza-tion platform aimed at producing high-performance executablesfrom functional-style code. The core intermediate language of the

1 2013/3/28

compiler is a static single-assignment style, control-flow graphbased intermediate representation (IR). It is strict and has explicitthunks to represent laziness. While the intermediate representationis largely language agnostic, it is not paradigm agnostic. The coretechnologies of the compiler are designed specifically around com-piling functional-language programs. Most importantly for the pur-poses of this paper, the compiler relies on maintaining a distinc-tion between mutable and immutable data, and focuses almost allof its optimization effort on the latter. The compiler is structuredas a whole-program compiler—however, none of the optimizationsin the compiler rely on this property in any essential way. In ad-dition to the usual inlining-style optimizations, contification [5]is performed to turn many uses of (mutual) recursion into loops.This optimization is crucial for our purposes since we structureour SIMD-vectorization optimization as an optimization on loops.Many standard compiler optimizations have been implemented inthe compiler, including loop-invariant code motion and a very gen-eral simplifier in the style of Appel and Jim [2]. The compiler alsoimplements a number of inter-procedural representation optimiza-tions using a field-sensitive flow analysis [10].

The compiler generates output in a modified extension of theC language called Pillar [1]. Among other things, the Pillar lan-guage supports tail calls and second-class continuations. Most im-portantly, it also provides the necessary mechanisms to allow foraccurate garbage collection of managed memory. While an earlyversion of Pillar was implemented as a modification of a C com-piler, Pillar is currently implemented as a source to source trans-lation targetting standard C, which is then compiled using eitherthe Intel C Compiler or GCC. Our garbage collector and a smallruntime are linked in to produce the final executable.

2.2 Haskell and GHCThe Haskell language and the GHC extensions to it provide a largebase of useful high-level libraries written in a very powerful pro-gramming language. These libraries are often written using a high-level functional-language style of programming, relying on excel-lent compiler technology to achieve good performance. While wewished to be able to compile Haskell programs and take advantageof the many libraries, we had no interest in attempting the monu-mental task of duplicating all of the high-level compiler technologyimplemented in GHC. The approach we took therefore was to es-sentially use GHC as a Haskell frontend, thereby taking advantageof the existing type-checker and high-level optimizer.

Fortunately for our purposes, GHC provides some support forwriting its main System-F-style internal representation (calledCore) out to files. While this code is not entirely maintained bythe GHC implementers, we were able to easily modify it to supportall the current Core features we require. We use this facility to builda Haskell compiler pipeline by first running GHC to perform typechecking, desugaring, and all of its Core-level optimizations, andthen writing out the Core IR immediately before it would otherwisebe translated to the next-lower representation. Our compiler thenreads in the serialized Core, and after several initial transformationsand optimizations (including making thunks explicit), performs aclosure-conversion pass and transforms it down to our main CFGbased intermediate representation for subsequent optimization.

2.3 GHC ModificationsUsing this approach, we are able to provide a fairly close to func-tionally complete implementation of the Haskell language as im-plemented by GHC. However, some of the choices of primitivesused in GHC do not match up well with the needs of our compiler,and so some modification of GHC itself was required. A notableexample of this for the purposes of this paper is the treatment ofarrays in the Data.Vector libraries.

The GHC libraries Data.Vector and REPA [6] provide cleanhigh-level interfaces for generating immutable arrays. Program-mers can use typical functional-language operations like maps,folds, and zips to compute arrays from other arrays. These oper-ations are naturally data parallel and amenable to SIMD vectoriza-tion by our compiler. For example, consider a map operation overan immutable array represented with the Data.Vector library. Sucharrays are represented internally to the libraries using streams, andso map first turns the source array into a stream, applies the mapoperation to the stream, and then unstreams that back into the finalarray. Unstreaming is implemented by running a monadic compu-tation in the state monad that first creates a mutable array of the ap-propriate size, then successively initializes the elements of the arrayusing array update, and finally does an unsafe freeze of the mutablearray to an immutable array type. GHC has aggressive inlining andmany simplification operations, which together eliminate all the in-termediate streaming, mapping, and unstreaming operations. Theresult of all of this optimization is a piece of code that sequencesthe array creation, the calling of a tail-recursive function that ini-tializes the array elements, and the unsafe freeze. After translationinto our internal representation, our compiler can then contify thetail-recursive function into a loop, resulting in a natural piece ofcode that can be very effectively optimized using relatively stan-dard loop based optimizations adapted to leverage the benefits ofimmutability.

Unfortunately, a careful reader will observe that as described inthe previous paragraph we do not actually get the benefits of im-mutability with an unmodified GHC. Instead, we end up with codethat creates a mutable array and then initializes it with general ar-ray update operations—a style of code that we argue is drasticallyharder to optimize well in general, and to perform SIMD vectoriza-tion on in particular. As we discuss in Section 4.7, immutable ar-rays make determining when a loop can be SIMD vectorized mucheasier. For our purposes then, what we would like GHC to gener-ate instead of the code described above is code that creates a newuninitialized immutable array and then uses initializing writes toinitialize the array. The two invariants of initializing writes are thatreading an array element must always follow the initializing writeof that element, and that any element can only be initialized once.This style of write preserves the benefits of immutability from thecompiler standpoint, since any initializing write to an element isthe unique definition of that element. Initializing writes are key tobeing able to represent the initialization of an immutable array us-ing the usual loop code, while preserving the ability to leverageimmutability.

In order to get immutable code written in this style from GHCand the libraries discussed above, we modified GHC with newprimitive immutable array types1 and primitive operations on them.The additional operations include creation of a new uninitializedarray, initializing writes, and a subscript operation. We also mod-ified the Data.Vector library and a small part of the REPA libraryto use these immutable arrays instead of mutable arrays and unsafefreeze (of course, as with the standard GHC primitives, correct-ness requires that the library writer use the primitives correctly).With these modifications, the code produced by GHC is suitablefor optimization by our compiler, including in many cases SIMDvectorization.

2.4 Vectorization in our compilerSIMD vectorization in our compiler targets loops for which thecompiler can show that executing several iterations in parallel issemantics preserving, rewriting them to use SIMD vector instruc-

1 GHC does already have some such types—however, we added new onesto avoid compatibility issues

2 2013/3/28

Register kind k ::= s | vVariables xk, yk, zk, . . .Constants c ::= −231, . . . , (231 − 1)Operations op ::= +,−, ∗, /, . . .Instruction I ::= zs = c

| zv = 〈xs0, . . . , x

s7〉

| zs = xv!i| zs = op(xs

0, . . . , xsn)

| zv = 〈op〉(xv0 , . . . , x

vn)

| zs = new[xs]| zs = xs[ys]| zv = xs[〈yv〉]| zv = 〈xv〉[ys]| zv = 〈xv〉[〈yv〉]| xs[ys]← zs

| xs[〈yv〉]← zv

| 〈xv〉[ys]← zv

| 〈xv〉[〈yv〉]← zv

Comparisons cmp ::= xs < ys | xs ≤ ys | xs = ys

Labels L ::= L0, L1, . . .Transfers t ::= goto L(xk

0 , . . . , xkn)

| if (cmp) goto L0(xk0 , . . . , x

kn)

else goto L1(yk0 , . . . , y

km)

| halt

Blocks B ::= L(xk0 , . . . , x

kn):

I0. . .Imt

Control Flow Graph ::= Entry L in {B0, . . . , Bn}

Figure 1. Syntax

tions. Usually these loops serve to initialize immutable arrays, or toperform reductions over immutable arrays. The compiler generatesthe SIMD loops and all of the related control logic simply as codein our intermediate language. Rather than requiring the compilerto generate only the SIMD instructions provided by the target ma-chine, we provide a more uniform SIMD instruction set that maybe a superset of the machine-supported instructions. Our small run-time provides a library of impedance-matching macros that imple-ments the uniform SIMD instructions in terms of SIMD intrisicsunderstood by the underlying C compiler. The C compiler is thenresponsible for low-level optimizations including register alloca-tion and instruction selection.

In the next section we define a vector core language with whichto discuss the vectorization process. This core language is clean,uniform, and is in fact almost exactly faithful to (a small subset of)our actual intermediate representation.

3. Vector Core languageWe define here a small core language in which to present the vec-torization optimization. We restrict ourselves to an intra-proceduralfragment only, since for our purposes here we are not interested ininter-procedural control flow. Programs in the language are given ascontrol-flow graphs written in a variant of static single-assignment(SSA) form. To keep things concrete, our language is based on a 32-bit machine with vector registers that are 256-bits wide; handlingother register and vector widths is a straightforward extension.

3.1 ObjectsThe primitive objects of the vector core language consist of heapobjects and register values. Heap objects are variable-length im-mutable arrays subscripted by 32-bit integers. The elements con-tained in the fields of heap objects are constrained to be 32-bit reg-ister values. Register values are small values that may be boundto variables and are implicitly represented at the machine level viaregisters or spill slots. The register values consist of 32-bit integers,32-bit pointers to heap values, and 256-bit SIMD vectors each con-sisting of eight 32-bit register values. Adding additional primitiveregister values such as floating-pointer numbers, booleans, etc. isstraightforward and adds nothing essential to the presentation.

3.2 VariablesVariables are divided into two classes according to size: scalarvariables xs that may only be bound to 32-bit register values andvector variables xv that may only be bound to 256-bit vector values.In practice of course, a real intermediate language will also haveadditional variable sizes: 64-bit variables for doubles and 64-bitintegers, and a range of vector widths possibly including 128- and512-bit vectors. Note that the variables xs and xv are consideredto be distinct variables. Informally, we will sometimes use xv todenote the variable containing a vectorized version of an originalprogram variable xs—however, this is intended to be suggestiveonly and has no semantic significance.

3.3 InstructionsThe instructions of the vector language serve to introduce and oper-ate on the primitive objects of the language. Most of the instructionsbind a new variable of the appropriate register kind, with the excep-tion of the array-initialization instructions which write values intouninitialized array elements. The move instruction zs = c bindsa variable to the 32-bit integer constant c. The vector introductioninstruction zv = 〈xs

0, . . . , xs7〉 binds the vector variable zv to a vec-

tor composed of the contents of the eight listed 32-bit variables. Weuse the derived syntax zv = 〈cs0, . . . , cs7〉 to introduce a vector ofconstants—this may be derived in the obvious way by first bindingeach of the constants to a fresh 32-bit variable. A component of avector may be selected out using the select instruction zs = xv!i,which binds zs to the ith component of the vector in xv (where imust be in the range 0 . . . 7).

The instruction zs = op(xs0, . . . , x

sn) corresponds to a family

of instructions operating on primitive 32-bit integers, indexed bythe operation op (e.g. addition, subtraction, negation etc.). Primitiveoperations must be fully saturated—that is, the number of operandsprovided must equal the arity of the given operation. We lift theprimitive operations on integers to pointwise primitive operationson vectors with the zv = 〈op〉(xv

0 , . . . , xvn) instruction, which

applies op pointwise across the vectors. That is, semantically zv =〈op〉(xv

0 , . . . , xvn) behaves identically to a sequence of instructions

performing the scalar version of op to each of the correspondingelements of the argument vectors, and then packaging up the resultas the final result vector.

Array objects in the vector language are immutable sequencesof 32-bit data allocated in the heap. New uninitialized array objectsare allocated via the zs = new[xs] instruction which allocates anew array of length xs (where xs must contain a 32-bit integer) andbinds zs to a pointer to the array. The initial contents of the arrayare undefined, and every element must be initialized before beingread. It is an unchecked invariant of the language that every elementof the array is initialized before it is read, and that every elementof the array is initialized at most once. It is in this sense that thearrays are immutable—the write operation on arrays serves only asan initializing write but does not provide for mutation. Ordinaryscalar element initialization is performed via the xs[ys] ← zs

3 2013/3/28

operation which initializes the element of array xs at offset ys tozs. Scalar reads from the array are performed using the zs = xs[ys]instruction, which reads a single element from the array xs at offsetys and binds zs to it.

In order to support SIMD vectorization, the vector language alsoprovides a general set of vector subscript and initialization opera-tions permitting vectors of elements to be read from and writtento arrays (or vectors of arrays) in a single instruction. The set ofinstructions we provide are more general than what is supportedby most existing machine architectures. This is done intentionallysince it is often beneficial to vectorize a loop even if some instruc-tions must be emulated via reduction to scalar operations. Morever,we believe it is useful to present the language in its most generalform—it is always possible to restrict the set of generated instruc-tions as desired. We defer discussion of the situations in whichthese various styles of loads and stores arise to subsequent sectionson the actual vectorization operation.

The simplest vector array loads and stores are the operationsthat read multiple elements directly from a single array into avector variable, or initialize multiple elements of a single arraywith elements contained in a vector variable. The instruction zv =xs[〈yv〉] takes a single array xs and a vector of offsets yv and bindszv to a vector containing the elements of xs from the given offsets.That is, semantically we may treat the instruction zv = xs[〈yv〉] asequivalent to the following list of instructions:

ys0 = ys!0

zs0 = xs[ys0]

. . .ys7 = ys!7

zs7 = xs[ys7]

zv = 〈zs0, . . . , zs7〉

This instruction is commonly known as a “gather” instruction. Theintializing store xs[〈yv〉]← zv , commonly known as the “scatter”instruction, writes the elements of the vector zv to the fields of xs

at offsets given by the vector yv .There are several interesting special cases of scatters and gathers

that are worth further discussion. A common idiom that arises inSIMD vectorization consists of array loads and stores such that theindex vector is constructed by adding multiples of a constant strideto a scalar index. We choose to introduce this idiom as derivedsyntax as follows:

zv = xs[〈ys:i〉] def= zv = xs[〈yv〉]

xs[〈ys:i〉]← zvdef= xs[〈yv〉]← zv

where

yvl = 〈ys, . . . , ys〉

yvb = 〈0, . . . , 7 ∗ i〉

yv = 〈+〉(yvl , y

vb )

We also provide derived syntax for the further special case in whichthe stride is known to be one:

zv = xs[〈ys:〉] def= zv = xs[〈ys:1〉]

xs[〈ys:〉]← zvdef= xs[〈ys:1〉]← zv

For the purposes of simplicity of the vector language, we have leftthese as derived forms. However, it is worth noting that many archi-tectures provide support for these idioms directly (and in fact mayprovide support only for these idioms) and hence from a pragmaticstandpoint it can be important to provide primitive support for theseaddressing modes.

A second mode of addressing for array loads and stores coversthe case where a vector of arrays is indexed pointwise by a singlefixed scalar offset. The zv = 〈xv〉[ys] instruction produces a vectorzv such that the ith element of zv is produced by indexing the itharray from the vector of arrays xv , using offset ys. Similarly, the

〈xv〉[ys] ← zv instruction sets the element of the ith array in thevector xv at offset ys to the value in the ith position of zv .

The final mode of addressing for array loads and stores coversthe case in which a vector of arrays is indexed pointwise usinga vector of offsets. The zv = 〈xv〉[〈yv〉] instruction produces avector zv such that the ith element is produced by indexing theith array from the vector of arrays xv using the ith offset from thevector of offsets yv . Similarly, the 〈xv〉[〈yv〉] ← zv instructionsets the element of the ith array in the vector xv at the offset givenby the ith element of the vector of offsets yv to the value in the ithposition of zv .

3.4 Basic BlocksInstructions are grouped into basic blocks consisting of a labeledentry, a sequence of instructions, and a transfer which terminatesthe block and either transfers control to another block or terminatesthe program. We assume an infinite supply of distinct programlabels ranged over by metavariable L used to label basic blocks thatwe distinguish by numbering distinct labels with distinct integers(as L0, L1, etc.).

Basic block headers consist of the label of the block and a listof entry variables which are defined on entry to the block. Theentry variables are given their defining values by the transfer whichtransfers control to the block. Block entry variables may have scalaror vector kind.

Transfers terminate basic blocks, and serve to transfer controlwithin a program. The goto L(xk

0 , . . . , xkn) transfer shifts control

the block labeled with L. Well-formed programs must have match-ing arities on transfers and the block headers which they target, thatis, the block labeled with L must have entry variables zk0 , . . . , z

kn.

The transfer of control effected by goto L(xk0 , . . . , x

kn) also defines

the entry variables zk0 , . . . , zkn to be the values of xk0 , . . . , x

kn.

The conditional control transfer

if (cmp) goto L0(xk0 , . . . , x

kn) else goto L1(y

k0 , . . . , y

km)

behaves as goto L0(xk0 , . . . , x

kn) if the comparison holds, and as

goto L1(yk0 , . . . , y

km) if the comparison does not hold. Note that

the targets of the two different arms of the conditional are notrequired to be distinct.

The halt transfer terminates the program. We do not considerinter-procedural control-flow in this core langauge since it is notrelevant to the vectorization optimization—however it is straight-forward to add additional transfers to account for this if desired.

3.5 ProgramsPrograms in the core language consist of a single control flowgraph. A control flow graph consists of a designated entry labelL and set of associated basic blocks, each with a distinct label,one of which (the entry block) is labeled with the entry label L.The entry block must expect no entry variables. Program executionproceeds by starting at the designated entry block and executingblock instructions and transfers until a halt instruction is reached(if ever). Well-formed programs must not contain transfers whosetarget label is not in the control-flow graph, and every block in thecontrol-flow graph must be labeled with a distinct label. As usualwith static single-assignment form, variables may only be definedonce, must be defined before their first use, and have scope givenby the dominator tree of the control-flow graph rooted at the entrylabel. The style of static single-assignment form used in this corelanguage is similar to that used in compilers such as the MLtoncompiler [15].

4 2013/3/28

4. VectorizationThe SIMD-vectorization optimization attempts to rewrite a loop toa new loop that executes multiple iterations of the original loopon each iteration using SIMD instructions and registers. While thecore of the idea is quite simple, there are a number of issues thatmust be accounted for in order to preserve the semantics of theoriginal program. In the first part of this section we introduce thekey issues using a series of examples. Before doing so however, wefirst quickly review some preliminary concepts.

4.1 LoopsWe have informally described our vectorization efforts as focusingon loops. Loops tend to account for a high proportion of executedinstructions in programs, and as such are natural targets for intro-ducing SIMD-vector code. We define loops using the standard no-tion of a natural loop, merging loops with shared headers in theusual way to ensure that the nesting structure of loops form a tree.We do not assume a reducible control-flow graph (since in generalthe translation of mutually-recursive functions into loops may in-troduce irreducible control flow). We focus our vectorization effortsmore specifically on the innermost loops that make up the leaves ofthe loop forest of the control-flow graph. For simplicity, we onlytarget bottom-test loops. A separate optimization pass inverts top-test loops to form bottom-test loops, both to enable vectorizationand loop-invariant code motion.

4.2 Induction variablesOur notion of induction variable is standard, but we review it insome detail here since induction variables play a critical role inour algorithm. We define the base induction variables of a loopto be the entry variables of the entry block of the loop such thatthe definition of that variable provided on the loop back edge isproduced by adding a constant to the variable itself. The step of abase induction variable is the constant added each time around theloop. The full set of induction variables is the least set of variablessatisfying:• A base induction variable is an induction variable.• A variable defined by xs = +(ys, zs), where ys is an induction

variable and zs is loop invariant, is an induction variable.• A variable defined by xs = ∗(ys, zs), where ys is an induction

variable and zs is a constant (defined by zs = c), is an inductionvariable.

A full implementation must also deal with the symmetric cases ofthe last two cases, or canonicalize programs into the above forms,and may also consider operations such as subtraction and negation.

Induction variables can be characterized as affine functions ofthe iteration count of the loop. We define the characteristic functionof an induction variable is as is = s ∗#+ d where s is a constant(the step of the induction variable), and d is also a constant (theinitial value of the induction variable). The symbol # stands for theiteration number: the value of the induction variable is for iterationj can be computed directly from its characteristic function simplyby replacing # with j. If the characteristic function of an inductionvariable is s ∗#+ d then the step of the induction variable is s.

The characteristic function of an induction variable is defined asfollows:• The characteristic function of a base induction variable is s ∗# + d where s is the step of the base induction variable asdefined above, and d is the constant initial value of the inductionvariable.

• The characteristic function of an induction variable defined byxs = +(ys, zs) is s∗#+d+c where zs is a constant (definedby zs = c), and s ∗#+ d is the characteristic function of ys.

• The characteristic function of an induction variable defined byxs = ∗(ys, zs) is c∗s∗#+c∗d where zs is a constant (definedby zs = c), and s ∗#+ d is the characteristic function of ys.

A subtle but important point is that in these computations the com-piler must ensure that the overflow semantics of the underlying nu-meric type are respected to avoid introducing or eliminating nu-meric overflow. It is possible to extend the definition of charac-teristic functions to allow loop invariants to take the place of theconstants in the above definition, at the expense of representing thecharacteristic function symbolically.

4.3 Vectorization by exampleConsider the following simple program which computes the point-wise sum of two arrays bs and cs, each of length ls.

L0():as = new[ls]is0 = 0goto L1(is0)

L1(is):xs = bs[is]ys = cs[is]zs = +(xs, ys)as[is]← zs

is1 = +(is, 1)if (is1 < ls) goto L1(is1)

else goto Lexit(is1)

(1)

Ignoring the crucial issue of whether or not it is in fact validto vectorize this loop, there are a number of issues that need to beaddressed in order to produce a vector version of this program.

4.3.1 The vector loopThe core of the vectorization optimization is to produce an innerloop each iteration of which executes multiple iterations of theoriginal loop. For Example 1, this corresponds to generating codeto vector-load eight elements of the arrays b and c at appropriateoffsets, perform a pointwise vector addition of the loaded elements,and to perform a vector write of the eight result elements into thenew array a. The comparison which controls the exit test at thebottom of the loop must also be adjusted appropriately to accountfor the fact that multiple iterations are being performed. Finally, onthe exit edge of the loop, the use of the last value of the inductionvariable is1 must also be accounted for.

To motivate the general treatment of the vectorization optimiza-tion, we first consider the individual instructions of Example 1, be-ginning with the load instruction xs = bs[is]. Executing this in-struction during iteration j of the loop loads a single element fromthe array bs at an offset given by the value of is. In the vectorloop, we wish perform iterations j, j + 1, . . . , j + 7 simultane-ously. The vector version of this instruction then must load eightelements from bs at offsets given by the values of is at iterationj, j+1, . . . , j+7. A natural first attempt at vectorization might beto simply assume that vector versions of all variables are available,and to then generate the corresponding vector instruction using thevector variables. In this case, if we assume the existence of vari-ables bv and iv containing the values of bs and is for iterationsj, . . . , j + 7 then the vector instruction xv = 〈bv〉[〈iv〉] puts intoxv the appropriate values of xs for iterations j, . . . , j + 7.

While this approach is correct (and in the most general case issometimes necessary) a key insight in SIMD vectorization is thatit is often possible to do much better by using knowledge of loopinvariants and induction variables. In this case, the variable bs isloop invariant, and consequently bv will consist simply of eightcopies of the same value. Semantically then, we can replace ouruse of the general vector load instruction with the simpler gatherinstruction xv = bs[〈iv〉], which does not require us to producethe vector version of bs and which may also have a more efficientimplementation. More importantly, the variable is here is a baseinduction variable with step 1. It is consequently possible to predict

5 2013/3/28

the value of is at each of the next eight iterations: that is, the valueof iv will always be 〈is, is+1, . . . , is+7〉. The vector load can thenbe seen to be a load of eight contiguous elements from bs, for whichwe have provided the derived instruction xv = bs[〈is:〉], whichperforms a contiguous load of eight elements, as desired. Sincethis form of load is generally well supported in vector hardware,generating this specialized form in preference to the general formis highly desireable.

The second load instruction can then be vectorized in an exactlyanalagous fashion, resulting in two vector variables xv and yv . Theaddition instruction zs = +(xs, ys) can then be computed usingvector addition on the vector variables, as zv = 〈+〉(xv, yv). Toproduce a vector version of the instruction as[is] ← zs whichwrites the computed value to the new array we follow a similarargument to that above for the loads to produce the as[〈is:〉]← zv

instruction which performs a contiguous write of the values in zv

into the array as starting at offset is.The last instruction of the loop, is1 = +(is, 1), is the induction

variable computation, and as such requires special treatment. Theresult of this instruction is used for three purposes: in the computa-tion of the test to decide whether to proceed with another iteration,to provide a new value for the induction variable in the next itera-tion, and to provide the value passed out on the exit edge.

For the latter two uses, the required value is easy to see. Eachiteration of the vector loop corresponds to eight iterations of thescalar loop, and we require on entry to the loop that the inductionvariable is contain the value appropriate for the first of the eightiterations. Given the value of is for iteration j then, the back edgeof the loop requires the value of is1 for iteration j +7, which is oneplus the value of is on iteration j+7. Similarly, the exit edge of theloop requires the value of is1 at iteration j+7. Since the computationof the value of is for each iteration depends on the value of itselfin the previous iteration, we might imagine ourselves to be stuck.However, the nature of induction variables as affine transformationsmeans that we can compute the value of is1 at iteration j+7 directly:each iteration adds 1 to the original value, hence the value of is atiteration j + 7 is given by is + 7, and the value of is1 at iterationj + 7 is is + 7 + 1. Hence we can compute the appropriate valueof is1 for both the back and the exit edge as is + 8.

The remaining use of the induction variable is1 is to compute theloop exit test. The key insight here is to observe that it is no longersufficient to check that there is at least one iteration remaining:in order to execute the vector loop again, we must have at leasteight remaining iterations. Upon completion of a vector iterationcomputing scalar iterations j, . . . , j + 7, the question that mustbe answered then is whether the scalar loop would always executeiterations j + 8, . . . , j + 15. Because of the monotonic nature ofthe induction variable computation this in turn can be reduced tothe question of wether or not the value of is1 at iteration j + 15 isless than ls. Since is1 is an induction variable which is incrementedby 1 on each iteration, this corresponds to changing the exit test toask whether is + 15 < ls.

4.3.2 Entry and exit codeUsing the ideas from previous section, it is straightforward to writea vector version of the core loop. In addition to the issues touchedon above, vectorizing this example also requires accounting for thepossibility that there may be insufficient iterations to allow use ofthe vector loop, and also for the possibility that there may be extraiterations left over if the exit test succeeds with fewer than eightbut more than zero iterations left. To account for these we keeparound the original scalar loop, and use it to perform any iterationsthat cannot be computed with the vector loop. A preliminary testis inserted before the loop choosing either to target the scalar loop(if less than eight total iterations will be performed) or the vector

loop (otherwise). Similarly, after the vector loop exits, a check isdone for the presence of extra iterations, which if required are againperformed using the scalar loop.

Using these transformations, we arrive at the final desired vec-torized program.

L0():as = new[ls]is0 = 0if (7 < ls) goto L2(is0)

else goto L1(is0)

Lcheck():

if (is3 < ls) goto L1(is3)else goto Lexit(is3)

L2(is2):xv = bs[〈is2:〉]yv = cs[〈is2:〉]zv = 〈+〉(xv, yv)as[〈is2:〉]← zv

is3 = +(is2, 8)is4 = +(is2, 15)if (is4 < ls) goto L1(is3)

else goto Lcheck()

L1(is):xs = bs[is]ys = cs[is]zs = +(xs, ys)as[is]← zs

is1 = +(is, 1)if (is1 < ls) goto L1(is1)


(2)

4.4 Automatic vectorizationThe reasoning of the previous section leads us to the desired result,but seems completely ad hoc and unsuitable to implementation ina compiler. Fortunately, it is possible to derive a simple composi-tional transformation which accomplishes the same thing in a gen-eral manner. In this section we describe the key ideas behind thistransformation, and show that it generates good vector code.

The essential design principles that guide the design of the op-timization are those of orthogonality and compositionality. As wewill see, while the local transformation of instructions producesspecialized code depending on the particulars of the instruction,each instruction is nonetheless translated in a compositional fash-ion in the sense that the translation does not depend on how the re-sult is used. A consequence of this choice is that the compositionaltranslation may produce numerous extraneous, redundant, or loop-invariant instructions, which could be avoided in a less composi-tional approach. The principle that we follow is that the elimina-tion of extraneous, redundant, and loop-invariant instructions is anorthogonal issue, which is already addressed by dead-code elimina-tion, common sub-expression elimination, and loop-invariant codemotion respectively. So long as the vectorization transformation isdefined appropriately, we can safely rely on these optimizations tosweep away the chaff, leaving behind only the essential bits.

The core of the vector transformation works by translating eachscalar instruction into a sequence of instructions which computethree separate values: the scalar value, the vector value, and thelast value. For a vector loop computing iterations j, . . . , j + 7 of ascalar loop, the scalar value of a variable x is the value computed atiteration j, the last value is the value computed at iteration j+7, andthe vector value is the vector containing the computed values for allof the iterations. While it is clear that there is always a degenerateimplementation that simply computes the vector variable and thencomputes the scalar and last values by projecting out the first andlast iteration value, we often compute the scalar and last valuesseparately. This is crucial for a number of reasons: in the first place,we may be able to compute the vector variable more efficiently asa direct function of the scalar value; but more importantly, it willfrequently be the case that the vector variable (or similarly the lastvalue) will be unused. It is therefore important not to introduce adata-dependence of the scalar (or last) value on the vector value lestwe keep an expensive-to-compute vector value live unnecessarily.

6 2013/3/28

For clarity in the translation, we define meta-operators for pro-ducing fresh variables from existing scalar variables. For a variablexs, the variable fvxs is taken to be a fresh scalar variable, which byconvention will contain the scalar value of xs in the vector loop.Similarly, we take lvxs to be a fresh scalar variable, which by con-vention will contain the last value of xs; and we take vvxv to be afresh vector variable, which by convention will contain the vectorvalue of xs.

Induction variables require a small amount of additional mech-anism. For every base induction variable is with step s, we say thatthe affine basis of is is the vector 〈0∗s, 1∗s, . . . , 7∗s〉 (the multi-plications here are at the meta-level). Note that for a base inductionvariable is, adding is to each of the elements of its affine basisgives the values of is for each of the next eight iterations. Other(non-base) induction variables also have affine bases, which arecomputed by code emitted as part of the transformation describedbelow, starting from the bases of the base induction variables. Weuse the notation bviv to denote a fresh variable which by conventionholds the affine basis for the scalar induction variable is.

It is occasionally necessary to introduce a promoted (or lifted)version of a variable: that is, for a variable xs, to produce a vectorvariable containing the vector 〈xs, . . . , xs〉. By convention, we usepvxv to denote a fresh variable containing the promoted version ofxs.

In order to simplify the algorithm, it is convenient to apply itonly to loops for which all variables defined in the loop and usedoutside of the loop (that is, live out from the loop) are explicitlypassed as parameters on the exit edge. This is in no way necessary,but avoids the need to rename variables throughout the rest of theprogram to preserve the SSA property. This does not restrict thegenerality of the algorithm since it is always possible to transformloops into this form.

The vectorization algorithm applies to single block loops of theform

L(is0, . . . , isn):

I0. . .Imif (is < ls) goto L(is01, . . . , i

sn1)

else goto Lexit(xs0, . . . , x

sp)

where the variables is0, . . . , isn are base induction variables, is is aninduction variable with a positive step (the case where the step iszero or negative make no sense), and the variables xs

0, . . . , xsp are

the live out variables of the loop. All the instructions I0, . . . , Immust be scalar instructions (and thus define scalar variables), andcannot create new arrays. It is straightforward to support exit testsof the form is ≤ ls as well. In all cases, the variable ls must bedefined outside of the loop (and hence be loop invariant).

We assume2 the loop has a preheader, that is a block that endswith goto L(iis0, . . . , ii

sn) and that only this block and the loop

transfer to L.As suggested by the ad hoc development from the previous

section, the vectorization transformation is required to producethree new pieces of code: an entry test that decides whether thereare sufficient iterations to enter the vector loop; the core vector loopitself that performs the SIMD iterations; and a cleanup test thatchecks to see if there are any iterations remaining after the SIMDloop has finished, which must be processed by the original scalarloop which remains in the program unchanged. By convention wewill take Lvec and Lcheck to be fresh labels for the new vector loopand cleanup test respectively.

2 Assuming preheaders does not restrict our algorithm - transforming a pro-gram into an equivalent one where all loops have preheaders is a standardtechnique.

4.4.1 Entry testsThe job of the entry test is to determine if there are sufficientiterations to execute the vector loop, and if not to go straight tothe original scalar loop. It also needs to set up some initial valuesfor use in the vector loop. In particular, any variables used in theloop but not defined in the loop must have scalar, last value, andvector versions for use by the instructions in the vector loop..

First, for each variable ys used in the loop but not defined in it(and not an entry variable), we add the following instructions to theend of the instructions of the preheader:

fvys = ys

lvys = ys

vvyv = 〈ys, . . . , ys〉Next, we must determine whether the vector loop should be

entered at all. The desired property of the entry test is that thevector loop should only be entered if there are at least 8 iterations tobe performed. The monotonic nature of induction variables meansthat this question is equivalent to asking whether at least one moreiteration remains after performing the first 7 iterations: that is, wewish to know the result of the exit test of the original loop on the 7thiteration. If the characteristic function for is is s ∗#+ d, then thevalue of is on the 7th iteration can be obtained by replacing # with6 and calculating the result statically. The appropriate entry test isthen of the form iis < ls where iis is defined as iis = s ∗ 6 + c

The value s ∗ 6 + c is a compiler computed constant, and thecompiler can statically determine that this does not overflow. Ifwe generalize the notion of characteristic function to include loop-invariants in addition to constants, this test may require generatingmultiplication and addition instructions, and code must also beemitted to ensure that overflow has not been introduced.

4.4.2 Vector loopTo generate code for the main vector loop, we perform three oper-ations. We first generate the last value and vector versions of thebase induction variables, using their steps to produce these directlyfrom the scalar values. For each of the original instructions of theloop we then generate instructions to compute the scalar, the vec-tor, and last value versions of the instruction. Finally, we adjust theexit test.

For each entry variable isj for 0 ≤ j ≤ n, which is a baseinduction variable of step sj , we produce the vector, basis, lift, andlast value versions of the induction variable as follows. The basisvariable and lifted variables bvivj and pvivj are defined as

bvivj = 〈0 ∗ sj , . . . , 7 ∗ sj〉pvivj = 〈fvisj , . . . , fvisj〉

The vector version of isj can then be defined directly asvvivj = 〈+〉(bvivj , pvivj )

and the last value lvisj can be defined directly using the step.lvisj = +(fvisj , 7 ∗ sj)

This initial step defines vector and last value variables for eachbase induction variable of the loop, as well as basis and lift vari-ables. The translation of each instruction then proceeds composi-tionally, with the translated version of each instruction making useof the vector and last variable variables produced by the translationof all of the previous instructions. In addition, each induction vari-able makes use of the previously defined lift and basis variables toproduce its own vector, last value, basis, and lift variables directly.The complete definition of this translation is given in Figure 2.

Adjusting the exit test of the loop is again straightforward usingthe step information. The desired new exit test must check whether

7 2013/3/28

Instruction Vector translation Side conditions

zs = c

fvzs = cvvzv = 〈zs, . . . , zs〉lvzs = fvzs

zs = +(xs, ys)

fvzs = +(fvxs, fvys)bvzv = bvxv

pvzv = 〈fvzs, . . . , fvzs〉vvzv = 〈+〉(bvzv, pvzv)lvzs = +(lvxs, fvys)

If zs and xs are induction variables.

zs = ∗(xs, ys)

fvzs = ∗(fvxs, fvys)bvzv = 〈∗〉(bvxv, vvyv)pvzv = 〈fvzs, . . . , fvzs〉vvzv = 〈∗〉(bvzv, pvzv)lvzs = ∗(lvxs, fvys)

If zs and xs are induction variables.

zs = op(xs0, . . . , x

sn)

fvzs = op(fvxs0, . . . ,

fvxsn)

vvzv = 〈op〉(vvxv0 , . . . ,

vvxvn)

lvzs = vvzv!7If zs is not an induction variable.

zs = xs[ys]

fvzs = fvxs[fvys]vvzv = fvxs[〈fvys:〉]lvzs = fvxs[lvys]

If xs is loop invariant, and ys is an induction variable with step 1.

zs = xs[ys]

fvzs = fvxs[fvys]vvzv = fvxs[〈fvys:i〉]lvzs = fvxs[lvys]

If xs is loop invariant, and ys is affine with step i

zs = xs[ys]

fvzs = fvxs[fvys]vvzv = fvxs[〈vvyv〉]lvzs = fvxs[lvys]

If xs is loop invariant, and ys is not an induction variable.

zs = xs[ys]

fvzs = fvxs[fvys]vvzv = 〈vvxv〉[fvys]lvzs = lvxs[fvys]

If xs is not loop invariant, and ys is loop invariant.

zs = xs[ys]

fvzs = fvxs[fvys]vvzv = 〈vvxv〉[〈vvyv〉]lvzs = lvxs[lvys]

If both xs and ys are not loop invariant.

xs[ys]← zs fvxs[〈fvys:〉]← vvzv If xs is loop invariant, and ys is an induction variable with step 1.xs[ys]← zs fvxs[〈fvys:i〉]← vvzv If xs is loop invariant, and ys is an induction variable with step i.xs[ys]← zs fvxs[〈vvyv〉]← vvzv If xs is loop invariant, and ys is not an induction variable.xs[ys]← zs 〈vvxv〉[fvys]← vvzv If xs is not loop invariant, and ys is loop invariant.xs[ys]← zs 〈vvxv〉[〈vvyv〉]← vvzv If both xs and ys are not loop invariant.

Figure 2. Vector Block Instruction Translation

is < ls would succeed for at least eight more iterations. Since is

is an induction variable, letting s be its step, it increases by s oneach iteration. Thus the value of is on the next eight iterations willbe lvis + 0 ∗ s, . . . , lvis + 7 ∗ s. These will all be less than ls if thelast of them is (recall that s is positive and). Thus the desired newexit condition is lvis + 7 ∗ s < ls.

Putting this all together the vector loop will be, where i′s is afresh variable:

Lvec(fvis0, . . . ,fvisn):

Instructions for base induction variablesTransformed I0, . . . , Imi′s = +(lvis, 7 ∗ s)if (i′s < ls) goto Lvec(lvis11, . . . ,

lvisn1)else goto Lcheck()

Note that the back edge targets the vector loop, the exit edgetargets the cleanup check, and the last values of the base inductionvariables are passed on the back edge.

4.4.3 On the subject of overflowFor the most part, the vectorization transformation here is carefulto only compute the same values computed in the original program.Generating entry and exit tests violate this property. As mentionedin Section 4.4.1 the compiler can check statically that overflowdoes not occur on the computation of the constant for the initialentry test. For the exit test, it is sufficient to check before entryto the vector loop that ls < MI − 7 ∗ s where MI is the largestrepresentable integer and s is the step of is. For signed types, thischeck must be adjusted in the obvious way when loop bounds orsteps are negative. In all cases, if the checks fail then the originalscalar loop is used to perform the calculation using the originalcode.

4.4.4 Cleanup checkThe cleanup check is responsible for determining if scalar iterationsremain to be performed after the vector loop has exited. This in turncan be done simply by performing the original loop exit test. Scalariterations remain to be performed exactly when lvis < ls, where

8 2013/3/28

Entry test Vector loop Cleanup check Optimized resultL0():as = new[ls]is0 = 0fvas = as

vvav = 〈as, . . . , as〉lvas = as

fvbs = bsvvbv = 〈bs, . . . , bs〉lvbs = bsfvcs = csvvcv = 〈cs, . . . , cs〉lvcs = csfvls = lsvvlv = 〈ls, . . . , ls〉lvls = ls

i′s0 = +(is0, 6 ∗ 1)iis = +(i′s0 , 1)if (iis < ls) goto L2(is0)

else goto Lcheck(is0)

L2(fvis):bvis = 〈0 ∗ 1, . . . , 7 ∗ 1〉pviv = 〈fvis, . . . , fvis〉vviv = 〈+〉(bviv, pviv)lvis = +(fvis, 7 ∗ 1)fvxs = fvbs[fvis]vvxv = fvbs[〈fvis:〉]lvxs = fvbs[fvlvis]fvys = fvcs[fvis]vvyv = fvcs[〈fvis:〉]lvys = fvcs[fvlvis]fvzs = +(fvxs, fvys)vvzv = 〈+〉(vvxv, vvyv)lvzs = zv!7fvas[〈fvis:〉]← vvzvfvis1 = +(fvis, 1)bviv1 = bvivpviv1 = 〈fvis1, . . . , fvis1〉vviv1 = 〈+〉(bviv1 , pviv1)lvis1 = +(lvis, 1)i′s1 = +(lvis1, 7 ∗ 1)if (i′s1 < ls) goto L2(lvis1)

else goto Lcheck()

Lcheck():is2 = +(fvis, 7 ∗ 1)is3 = +(is2, 1)if (is3 < ls) goto L1(is3)


L0():as = new[ls]is0 = 0i′s0 = +(is0, 6)iis = +(i′s0 , 1)if (iis < ls) goto L2(is0)

else goto Lcheck(is0)

L2(fvis):lvis = +(fvis, 7)vvxv = bs[〈fvis:〉]vvyv = cs[〈fvis:〉]vvzv = 〈+〉(vvxv, vvyv)as[〈fvis:〉]← vvzvlvis1 = +(lvis, 1)i′s1 = +(lvis1, 7)if (i′s1 < ls) goto L2(lvis1)

else goto Lcheck()

Lcheck():is2 = +(fvis, 7)is3 = +(is2, 1)if (is3 < ls) goto L1(is3)


Figure 3. Vectorized version of Example 1

lvis is the last value for is computed in the vector loop (since thisrepresents the value of is from the last iteration which has beencompleted). Since the vector loop dominates the cleanup check,all its entry variables and variables it defines are in scope, so thecleanup check could be formed as follows:

Lcheck():

if (lvis < ls) goto L(lvis11, . . . ,lvisn1)

else goto Lexit(lvxs0, . . . ,

lvxsp)

Note that the last values of the base induction variables are passedto the scalar loop and the last values of the live-out variables arepassed to the exit label.

However, instead of using the last values computed in the vec-tor loop, it is better to recompute them in order to avoid introducingdata dependencies that may keep instructions live in the loop whichcould otherwise be eliminated. This turns out to be straightforward:the portion of the translation in Figure 2 that computes last valuescan simply be repeated. This results in a complete calculation ofall of the last values, including lvis. While most of these instruc-tions will turn out to be unnecessary, a single pass of dead codeelimination is suffient to eliminate them.

4.5 Transformed exampleThe actual process of producing SIMD vector versions of loopsseems at first quite ad hoc and difficult to do cleanly, but turns outto be amenable to a simple and straightforward translation whichsystematically produces vector versions of scalar instructions, rely-ing on orthogonal optimizations to clean up unnecessary generatedcode. To illustrate the results of this process, Figure 3 shows theresults of applying the algorithm to Example 1. First note that thebase induction variables are just is of step 1, and the only other in-duction variabl is is1 also of step 1. The variables as, bs, cs, and ls

are used in the loop but not defined by it. Block L0 is the preheader

for the loop. First we transform the preheader to define variablesused by but not defined by the loop and to do the entry test. Thisresults in Figure 3 under the header “Entry test”. Next we gen-erate the instructions for the base induction variable, transformedinstructions of the loop, and adjusted exit condition to form thevector loop. The result of this appears under the header “Vectorloop”. Finally we form the cleanup check by regenerating the last-value computations and using the original loop-exit condition. Forsimplicity we just show enough instructions to compute the lastvalue of is1, the only needed last value. This block appears underthe header “Cleanup check”.

This code, as expected, is terribly verbose. However, by apply-ing copy propagation followed by dead-code elimination, the ex-traneous parts disappear, yielding the code shown under the header“Optimized result”. By applying common sub-expression elimina-tion this code can be further improved by eliminating the calcula-tion of is2 in favor of lvis, and then is3 in favor of lvis1. Finally, simpli-fying the chained arithmetic expressions yields the same vectorizedloop as derived by the initial ad hoc vectorization in the previousexample.

4.6 ReductionsThe SIMD vectorization process defined so far primarily applies toloops which serve to create new arrays. An additional idiom that isimportant to handle in vectorization is that of loops which serve (intotal or in part) to compute a reduction using a binary operator. Thatis, instead of, or in addition to initializing a new array, each iterationof the loop accumulates a value into an accumulator parameter asa function of the value from the last iteration. For example, theinnermost dot product of a matrix multiply routine generally takesthe form of a reduction.

Adding support for reductions to the vectorization transforma-tion is not difficult. In the same manner that we identify certainvariables as induction variables and treat them specially, we iden-

9 2013/3/28

tify variables that fit the pattern of reductions and add additionalcases to the vectorization transformation to handle them differently.Specifically, we say that a variable xs is a reduction variable if it isa parameter to the loop that is not an induction variable, and whichis defined on the back edge of the loop by the application of anassociative binary operator applied to xs and some other variable.For example, the loop:

L0():xs0 = 1

is0 = 0goto L1(is0, x

s0)

L1(is, xs):ys = cs[is]xs1 = +(xs, ys)

is1 = +(is, 1)if (is1 < ls) goto L1(is1, x

s1)

else goto Lexit(xs1)

(3)

computes one more than the sum of the elements of the array cs viaa reduction (assuming that ls is the length of cs) and passes out theresult on the exit edge as xs

1. We say that xs and xs1 are reduction

variables in this loop, with initial value 1.The general strategy for SIMD vectorization of reductions is to

pass a vector variable around the loop, each element of which con-tains a partial reduction. Implicitly, we re-associate the reduction sothat element i of the vector variable contains the partial reductionof all of the values of the iterations which are congruent to i modulothe vector width (8 here). After exiting the vector loop, performinga horizontal reduction on the vector variable itself results in the fullreduction of all the iterations (albeit performed in a different order,hence the requirement of associativity).

More concretely to vectorize a reduction variable xs, we simplypromote xs to a vector variable xv . In the preheader of the loop,we lift the definition of the initial value for the reduction variableto contain the original initial value in element 0, and the unit forthe reduction operation in all other elements. In Example 3 thiscorresponds to defining the initial value as xv

0 = 〈1, 0, . . . , 0〉(since 0 is unit for addition). The reduction operation in the loopis lifted to perform the same vector operation pointwise, hencexv1 = 〈+〉(xv, yv) for the example. Finally, in the cleanup check

block, the lifted reduction variable xs1 is itself reduced to produce

the final scalar result of the portion of the reduction performed bythe vector loop (which is then passed in as the initial value to thescalar loop if subsequent iterations remain to be computed). Thisfinal reduction may be performed using scalar operations.

It is possible to maintain last value and scalar values for reduc-tion variables within the vector loop as well as the vector value,in the same fashion as was done for ordinary operations. However,doing so requires performing a full reduction on the vector valueon each iteration which is potentially expensive enough to makeSIMD vectorization a poor idea. Consequently, we simply chooseto reject SIMD vectorization of loops which would require the cal-culation of scalar or last values for reduction variables within thebody of the vector loop.

Unfortunately, the associativity requirement for the binary op-eration used in reductions can be problematic since most float-ing point operations (e.g. addition and multiplication) are not as-sociative. In practice, reductions over floating point operators areimportant idioms. By default, our compiler does not re-associatefloating point operations, and hence will refuse to vectorize loopswhich perform reductions over floating point data. We provide aflag which permits the compiler to re-associate floating point oper-ations only for SIMD vectorization (but not other optimizations)which allows such loops to be vectorized at the expense of oc-casionally producing slightly different results from the sequentialversion.

4.7 Dependence AnalysisAttentive readers may at this point be concerned that we have leftaside a critical point in our development. While we have shownhow to vectorize a loop, we have said nothing about when (orwhy) it might be valid to do so. The crux of any kind of automaticvectorization lies in determining when it is semantics preservingto execute multiple iterations of a scalar loop in parallel. In themost general case this can be a tremendously hard problem tosolve, and there is a vast literature on techniques for showing thatparticular loops can validly be vectorized (see for example Bik[3]for a good introduction and bibliography). For the particular case ofimmutable arrays however, the problem becomes drastically easier.In this section we give a brief overview of why this is the case.

We say that an instruction I2 is dependent on an instruction I1 ifI2 must causally follow I1. There are a number of different reasonsthat this ordering might be required, each inducing a differentkind of dependence. Flow dependence (or data-dependence) is astandard dataflow dependence induced by a read of the value ofa variable or location which was written at an earlier point in theprogram execution. This is sometimes known as a read-after-writedependence. An anti-dependence on the other hand is induced bya write after a read: that is, a location or variable is updated with anew value after an old value has been read, and therefore the writemust happen after the read to avoid changing the value seen by theread. Finally, output dependence is dependence induced by a writeafter a write: that is, a location or variable is updated with a newvalue which overwrites an old value written by a previous statement(and hence the writes cannot be re-ordered without affecting thefinal value of the written to location or variable).

The key insight for automatic vectorization of immutable arraysis that neither anti-dependence nor output dependence can occur.This is obvious by construction, since the nature of immutabilityis such that the only writes are initializing writes. Therefore, therecannot be a write after a read (since this would imply a read ofuninitialized data), nor can there be a write after a write (since thiswould imply mutation). The only remaining dependence then isflow-dependence. Flow-dependences which are carried by variabledefinitions and uses are straightforward to account for, and donot generally cause unusual trouble. In general, it is possible forloop-carried flow-dependences to block vectorization, however. Forexample, consider the following simple program:

L0():as = new[ls]as[0]← 0goto L1(1)

L1(is):as[is]← is

xs = as[is − 1]if (is < ls) goto L1(is + 1)

else goto Lexit(xs)

While this program uses only initializing writes, it nonetheless in-curs a loop-carried dependence in which a read in one iteration de-pends on a write in a previous iteration. Clearly then, the problemof dependence analysis is not completely trivial even for functionalprograms even after eliminating the problems of anti-dependencesand output-dependences. Note that while this example programreads and writes as through the same variable name, there is noth-ing preventing the binding of as to a different variable name, andhence allowing accesses (reads or initializing writes) to the arrayvia an alias.

In practice however, all programs that we have considered (andindeed, based on the way that immutable arrays are produced, allprograms that we are at all likely to consider) have the very usefulproperty that they are allocated and initialized in the same lexicalscope. The implication of this is that it is usually trivially easy toconservatively determine that an initializing write in a loop doesnot alias with any of the reads in the loop simply by checking thatno aliases for the newly allocated array are introduced before the

10 2013/3/28

loop. As a result, a very naive analysis can be quite successful atbreaking read after write dependences in these style of loops.

5. BenchmarksWe show performance results for six benchmarks, measured onan Intel Xeon E5-4650 (Sandybridge) processor implementing theAVX SIMD instruction set. Two of the benchmarks are small syn-thetic benchmarks. The first computes the pointwise sum of twoarrays of 32-bit floating point numbers using the following code:

addV v1 v2 = zipWith (+) v1 v2

The second computes the horizontal sum of an array of 32-bitfloating point numbers using the following code:

sumV v = foldl (+) 0 v

Both of the kernels are implemented exactly as given, using theoperations from our modified Data.Vector.Unboxed library. Thesecode snippets are each run in a benchmark harness which generatestest data and repeatedly runs the same kernel a number of times inorder to increase the runtime to reasonable lengths for benchmark-ing.

The third benchmark is an n-body simulation kernel whichiteratively calculates the forces between n gravitational bodies,represented as 32-bit floating point numbers. Each iteration usesthe results from the previous iteration as its initial configuration.The code is written using the REPA array library.

The fourth benchmark is a matrix-matrix multiply routine writ-ten using the matrix multiply algorithm from the REPA array li-brary. The matrix elements are double precision (64-bit) floatingpoint numbers. The program generates random arrays, performs asingle multiplication, computes a checksum, and writes the resultout to a file. The portion of the program that is timed and reportedhere covers only the matrix multiplication and the checksum com-putation. The measurements presented here are for 700 by 700 ele-ment arrays.

The fifth benchmark is a two dimensional five-by-five stencilconvolution written using the REPA stencil libraries [8]. The con-volution is repeatedly performed 4000 consecutive times in a loop.

The final benchmark is the “blur” image processing example in-cluded with the REPA distribution. The benchmark was modifiedslightly to match our benchmark harness requirements, with tim-ing code inserted around the kernel of the computation to avoidincluding file input/output times in the measurements. No changesto the code implementing the algoritm itself were made. The tim-ings presented measure the result of performing twenty successiveblurrings on a 600x512 pixel image.

The performance of these six benchmarks is given in figure4. We show runtime normalized to our standard scalar compilerwithout vectorization, since the primary goal is to demonstrate thespeedups obtained from automatic SIMD vectorization optimiza-tion. We also show the performance of unmodified GHC 7.6.1 us-ing both the standard back end and the new LLVM backend [13].This is intended primarily to demonstrate that the scalar code fromwhich we begin is suitably optimized. Both the unmodified andour own modified version of GHC were run with the optimizationflags “-O2”, “-msse2”. Various combinations of the flags “-fno-liberate-case”, “-funfolding-use-threshold”, “-funfolding-keeness-factor” were also used following the documentation of the REPAlibraries, and based on some limited experimentation as to whicharguments gave better performance. In all cases both our scalar andSIMD compiler preserve floating point value semantics and requirethe backend C compiler to do the same. The only exception is thatthe flag discussed in Section 4 to permit the vectorization of float-ing point reductions was used. For some benchmarks, this causedsmall perturbations in the results.

For the vector-vector addition benchmark, our baseline com-piler gives large speedups over both the standard GHC backendand the LLVM backend. Unfortunately this benchmark is poorlystructured for measuring floating point performance. Performanceanalysis shows that the run time is almost entirely dominated bymemory costs associated with initializing the newly allocated ar-rays. Each iteration of the vector-vector product allocates a newresult array which is not in cache, and the inner loop does not doenough work to hide the latency of the memory accesses. The vec-tor sum benchmark on the other hand performs a reduction over theinput vector producing only a scalar result. Consequently memorylatency issues are not critical since the input vectors are in cache,and the SIMD vectorized version of the code runs in half the timeof our baseline compiler.

The N-Body simulation benchmark gives the best results forSIMD vectorization, since the core of the inner loop does substan-tially more floating point work on each iteration. While our base-line compiler gives only slightly better performance than the LLVMbackend, the SIMD vectorized version takes only 23% of the run-time of the scalar version.

The matrix multiplication routine also shows good benefits fromSIMD vectorization, runing in 64% of the time of the scalar version(recall that this benchmark uses double precision floating point).

The convolution and blur benchmarks suffer also get decentspeedups from vectorization. GHC (or the REPA libraries) seem tobe arranging for the inner loop to be executed in an unrolled fash-ion. Unfortunately, this means that the SIMD vector version cannotgenerate unit stride loads supported by the native architecture andmust emulate them. The resulting scalar loads and vector shuffleslimit the benefit of vectorization somewhat. Nonetheless, we areable to still get good speedup over scalar code.

Of particular importance to note with respect to all of the bench-marks but most especially for the last two is that the timings pre-sented here are not for the vectorized loop alone, but for the entirekernel of the program (essentially everything except the input (orinput generation) and output).

6. Related and Future Work, ConclusionsThere is a vast and rich literature on automatic SIMD vectorizationin compilers. Bik [3] provides an excellent overview of the essen-tial ideas and techniques in a modern setting. Many standard com-piler textbooks also provide at least some rudimentary introductionto SIMD vectorization. The NESL programming language [4] pio-neered the idea of writing nested data parallel programs in a func-tional language, an idea which has more recently been picked upby the Data Parallel Haskell project [11]. A related project focusingon regular arrays [6] (REPA) has had good success in getting arraycomputations to scale on multi-processor machines. Programs writ-ten using the REPA libraries are promising targets for SIMD vec-torization, either automatic as in this work, or via explicit languageconstructs in the library implementation. Mainland et al [9] havepursued this latter approach. They implement primitives in GHCthat correspond to SIMD instructions and modify the stream fusionused in the Data.Vector library to generate these SIMD primitives.In this way, they achieve SIMD vectorization in the library, wherewe achieve SIMD vectorization in the compiler. Two different, pos-sibly complimentary, ways to achieve the same ends.

Our initial implementation of SIMD vectorization targets onlysimple loops with no conditional tests. An obvious extension tothis work would be to incorporate support for predicated instruc-tions to allow more loops to be SIMD vectorized. However, hard-ware support for this is somewhat limited in current processors, andthe performance effects of generating predicated code are harderto predict. A more limited effort to target the special case of con-ditionals performing array bounds checks seems like a promising

11 2013/3/28

Matrix Multiply NBody Vector Add Vector Sum Convolution Blur

GHC 3.103658537 2.785547909 9.374823581 1.460823682 4.139505686 3.332035054

GHC LLVM 1.024390244 1.041362273 9.236022147 1.460325889 1.915044305 1.871470302

Intel SIMD 0.643292683 0.227633669 0.847464988 0.471443268 0.707604645 0.630963973

0

1

2

3

4

5

6

7

8

9

10

Run time normalized to Intel scalar (Intel scalar = 1)

Figure 4. Benchmark Performance

area for investigation. As functional languages tend to allocate alot, supporting allocation of heap objects inside vectorized loopscould be important. A vector allocation instruction allocates 8 ob-jects and places pointers to them in a vector variable. Implementingthis operation for fixed-size objects is straightforward and shouldbe quick—a single limit check, vector instructions to compute thepointers, and a scatter to initialize the object meta-data used by thegarbage collector. Implementing vectorized allocation of variable-length arrays should also be straightforward. A special-case in-struction for when the length is loop-invariant would likely havea faster implementation.

To summarize, we have utilized the naturally data-parallel typi-cal functional-language operations on immutable arrays like maps,folds, zips, etc., utilized the large body of work on eliminating in-termediate arrays, and utilized our own internal operations on im-mutable arrays to produce nice loops ammenable to SIMD vector-ization. Implementing that SIMD vectorisation automatically is asurprisingly straightforward process. A key part of this simplicityis the use of immutable objects and operations on them, and in par-ticular on our use of initializing writes. As performance from usingSIMD instruction sets is important on modern architectures, ourtechniques are an important and low-hanging fruit for functional-language implementors.

References[1] T. Anderson, N. Glew, P. Guo, B. T. Lewis, W. Liu, Z. Liu, L. Petersen,

M. Rajagopalan, J. M. Stichnoth, G. Wu, and D. Zhang. Pillar: Aparallel implementation language. In V. Adve, M. J. Garzaran, andP. Petersen, editors, Languages and Compilers for Parallel Computing,pages 141–155. Springer-Verlag, Berlin, Heidelberg, 2008. ISBN 978-3-540-85260-5.

[2] A. Appel and T. Jim. Shrinking lambda expressions in linear time.Journal of Functional Programming, 7(5), Sept. 1997.

[3] A. J. C. Bik. The Software Vectorization Handbook. Intel Press, 2004.ISBN 0974364924.

[4] G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3):85–97, Mar. 1996. ISSN 0001-0782.

[5] M. Fluet and S. Weeks. Contification using dominators. In Proceed-ings of the sixth ACM SIGPLAN international conference on Func-tional programming, ICFP ’01, pages 2–13, Florence, Italy, 2001.ACM. ISBN 1-58113-415-0.

[6] G. Keller, M. M. Chakravarty, R. Leshchinskiy, S. Peyton Jones, andB. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell.In Proceedings of the 15th ACM SIGPLAN international conferenceon Functional programming, ICFP ’10, pages 261–272, New York,NY, USA, 2010. ACM. ISBN 978-1-60558-794-3.

[7] K. Kennedy, C. Koelbel, and H. Zima. The rise and fall of HighPerformance Fortran: an historical object lesson. In Proceedingsof the third ACM SIGPLAN conference on History of programminglanguages, pages 1–22. ACM, 2007.

[8] B. Lippmeier and G. Keller. Efficient parallel stencil convolutionin haskell. In Proceedings of the 4th ACM symposium on Haskell,Haskell ’11, pages 59–70, New York, NY, USA, 2011. ACM. ISBN978-1-4503-0860-1.

[9] G. Mainland, R. Leshchinskiy, S. P. Jones, and S. Marlow.Haskell beats C using generalized stream fusion. Availableat http://research.microsoft.com/en-us/um/people/-simonpj/papers/ndp/haskell-beats-C.pdf; Unpublished,2013.

[10] L. Petersen and N. Glew. GC-safe interprocedural unboxing. In Pro-ceedings of the 21st international conference on Compiler Construc-tion, CC’12, pages 165–184, Tallinn, Estonia, 2012. Springer-Verlag.ISBN 978-3-642-28651-3.

[11] S. Peyton Jones. Harnessing the multicores: Nested data parallelism inHaskell. In Proceedings of the 6th Asian Symposium on ProgrammingLanguages and Systems, APLAS ’08, pages 138–138, Bangalore, In-dia, 2008. Springer-Verlag. ISBN 978-3-540-89329-5.

[12] W. Pugh. A practical algorithm for exact array dependence analysis.Communications of the ACM, 35(8):102–114, 1992.

[13] D. A. Terei and M. M. Chakravarty. An llvm backend for ghc. In ACMSigplan Notices, volume 45, pages 109–120. ACM, 2010.

[14] N. Vasilache, C. Bastoul, A. Cohen, and S. Girbal. Violated depen-dence analysis. In Proceedings of the 20th annual international con-ference on Supercomputing, pages 335–344. ACM, 2006.

[15] S. Weeks. Whole-program compilation in MLton. In Proceedingsof the 2006 workshop on ML, ML ’06, pages 1–1, Portland, Oregon,USA, 2006. ACM. ISBN 1-59593-483-9.

12 2013/3/28

Documents

Automatic SIMD Vectorization for HaskellHaskell compiler which gives signiﬁcant vector speedups for a range of programs written in a natural programming style. We com-pare performance