MultiLoop: E cient Software Pipelining for Modern Hardwareanand/papers/AnandKahlMultiLoop.pdf · MultiLoop: E cient Software Pipelining for Modern Hardware Christopher Kumar Anand

MultiLoop: Efficient Software Pipelining for Modern Hardware

Christopher Kumar Anand Wolfram Kahl

McMaster University, 1280 Main St. West, Hamilton, Ontario Canada L8S 4K1

May 8, 2007– Draft for Comment –

Abstract

This paper is motivated by trends in proces-sor models of which the Cell BE is an exem-plar, and by the need to reliably apply multi-level code optimizations in safety-critical codeto achieve high performance and small codesize.

A MultiLoop is a loop specification con-struct designed to expose in a structured waydetails of instruction scheduling needed forperformance-enhancing transformations. Weshow by example how it may be used to makebetter use of underlying hardware features, in-cluding software branch prediction and SIMDinstructions. In each case, the use of Multi-Loop transformations allows us to take full ad-vantage of software branch prediction to com-pletely eliminate branch misses in our sched-uled code, and reduce the cost of loop overheadby using SIMD vector instructions.

Given the novelty of our representation, itis important to demonstrate feasibility (of zerobranch misses) and evaluate performance (oftransformations) on a wide set of representativeexamples from numerical computation. We in-clude simple loops, nested loops, sequentially-composed loops, and loops containing controlflow. In some cases we obtain significant ben-efits: halving execution time, and halving codesize. As many details as possible are providedfor other compiler writers wishing to adopt ourinnovative transformations, including instruc-tion selection for SIMD-aware control flow.

Copyright c© 2007 Christopher Kumar Anand andWolfram Kahl.

1 Introduction

In the context of the move to multiple light-weight with increasingly-powerful SIMD sup-port, we have reexamined the way controlflow and particularly loops are specified, andpresent here a more general construction calleda MultiLoop. The idea is to provide a vo-cabulary for high-level programmers to providehints to instruction scheduling.

We analyse four representative examplesof numerical calculations relative to the CellBroadband Engine’s Synergistic ProcessingUnits (SPUs) [8], and find that the types ofchanges in instruction selection and schedul-ing enabled by the more general specificationmake a large difference in terms of executionefficiency and code size. Since MultiLoops con-tains normal loops as a special case, adoption ofthe MultiLoop only adds implementation pos-sibilities. The MultiLoop is aimed at medium-level code transformations which imply con-straints on instruction schedules, but not atlow-level scheduling itself.

Yet we have found significant potential forincreased performance by rethinking standardpatterns of code generation. Specifically, weaddress the following features which are notstandardized across architectures:

1. software branch prediction,

2. SIMD parallelism,

3. two-way dispatch executing “arithmetic”and “data movement” in parallel.

In our examples, all branches can be correctlyhinted so that branch misses do not occur, andloop overhead can be largely shifted from thebottlenecked arithmetic pipeline to the non-arithmetic pipeline.

1

May 8, 2007 06:40 - Draft for Comment- MultiLoop 2

SPU Pipelines To simplify the discussionfor readers who are not familiar with the SPUISA and programming model, we refer to thetwo execution pipelines as “arithmetic” and“non-arithmetic”, because this is a good ap-proximation of the division of instructions han-dled by the two pipelines, and this division isvery natural across processor families. Specificmeasurable claims are always made with re-spect to the division of execution as defined inthe programming model, see [2] for details ofthat model. In the literature, the arithmeticand non-arithmetic SPU pipelines are calledthe even and odd pipelines in the literature,as a result of alignment requirements for max-imum instruction dispatch.

Pipelined Loops Since the point of defin-ing new control flow structures is to achievehigher levels of performance, we recall the bestconventional approach to efficient loop imple-mentation: software pipelined loops. Figure 1shows a simple loop structure with three blockssuch that the middle block depends on outputsof the top block, and the bottom block dependsonly on outputs of the first block. The left is

i

i

i

i+2i+1i

i+2i+1i

1

21

n

n-1 n

Figure 1: Left: a simple loop as it would ap-pear in a high-level language, with three stagesexecuted in sequence. Middle: a software-pipelined version of the loop, with stages areexecuted in parallel, prologue and epilogue.Right: without prologue and epilogue addi-tional stages.

the usual view in a high-level language, andother two are software-pipelined versions. Hor-izontal juxtaposition of blocks means that the

blocks are scheduled in parallel (interleaved ona processor with finite superscalar resources);the labels are the logical iteration numbers,which we define to be the iteration number ofthe loop in the standard (left) presentation.

In the version including loop prologue andepilogue (middle), the loop is equivalent to theoriginal in the sense that the same instructionsare executed (although in different orders) onthe same data.

Without loop prologue and epilogue (right),we could get the same performance with onethird the code, but we do not get the same ex-ecution. Using MultiLoop generators, we canmeet loser specifications without significant ad-ditional cost.

Contribution: In this paper we present aclass of MultiLoop control flow patterns, showhow several typical numerical tasks can beformulated as MultiLoops, and introduce twoMultiLoop transformation schemes that pro-duce efficiently schedulable code. Both trans-formations generalize software pipelining aspresented above.

Four examples demonstrate the applicabilityof these transformations, and give some hint asto the applications which will benefit the most.In particular, the Map example shows signif-icant improvement and we present the loopoverhead in enough detail for other compilerwriters to implement as part of instruction se-lection.

Overview: In the next section, we review theaspects if the SPU instruction set relevant forthis paper. Then we define and discuss theMultiLoop control structure and its transfor-mations in Sect. 3, present examples in Sect. 4,and discuss related work in Sect. 5.

2 SIMD and the SPU ISA

Single Instruction Multiple Data (SIMD) in-structions operate on more than one dataelement in parallel. They were introducedinto commodity processors to accelerate multi-media synthesis and processing. These in-struction set architectures (ISAs) are di-verse in structure and implementation, with

May 8, 2007 06:40 - Draft for Comment- Anand-Kahl 3

VMX/Altivec [5] on Power and SSE [12] on x87being the best-known.

Data Flow The SPU ISA [6] uses 128-bitoperands, in common with VMX and SSE. Itcontains a rich set of operations formed by di-viding the 128-bit operands into 8-, 16-, 32-or 64-bit quantities and performing the usualscalar operations independently on each. SeeFig. 2, for an example 32-bit add instructionoperating on four elements. This results in a

A1 A2 A3 A4

B1 B2 B3 B4

C1 C2 C3 C4

+ + + +

Figure 2: fma, a 32-bit add operating on two128-bit wide operands.

useful level of parallelism, but introduces align-ment issues in data. As a result, all SIMDinstruction sets have some instructions to re-arrange data. Two approaches are possible, alarge set of instructions with specific functions(e.g. unpacking pixel data into vectors by com-ponent), or a small set of software-controllableinstructions. All ISAs follow a middle path,with genericity increasing from SSE to VMXto SPU ISA. The instructions of most use insynthesizing loop overhead are the byte per-mute instruction shufb (analogous to VMX’svperm), shown in Fig. 3, and quadword rotateinstructions, including rotqbii, which rotatesthe whole 128-bit vector up to 7 bits (immedi-ate argument) left; this is the preferred registermove instruction.

As shown in Fig. 3, SPU byte permutationcan be used to move 32-bit components fromone slot to another one (useful for transposingsingle-precision floating point matrices, for ex-ample), or duplicate bytes. It can rotate bytesthrough cycles, which can be used to countthrough loop induction variables when the loopsizes are known at compile time. Byte and bit

10

11

12

13

14

15

16

17

18

19

1a

1b

1c

1d

1e

1f

00

01

02

03

04

05

06

07

08

09

0a

0b

0c

0d

0e

0f

00

01

02

03

15

16

17

18

19

14

03

0 04

05

06

07

0 1 2 3 4 5 6 7 8 9 a b c d e f

Figure 3: shufb byte permutation taking twosource operands (coloured) and an operand ofbyte indices.

rotate instructions take both immediate countsand counts from operand registers.

Storage Model As we move to higher lev-els of parallelism on a chip, which Cell is pio-neering, the overhead associated with a sharedmemory space (page tables, cache coherency,etc.) will increase superlinearly. To avoid this,the Cell processor’s SPU compute engines havetheir own local stores (LSs), and use DMA totransfer data to and from system memory. Thishas the positive effect that access to local mem-ory can be as predictable as accesses which hitin L1 cache on a normal processor. Unfortu-nately, it means that all code required for aparticular task must be copied to the LS.

This puts a premium on achieving small codesizes, and changes the assumptions we canmake about loops. Since input and outputdata on such light-weight engines will have tobe transfered to and from local storage in di-gestible chunks, many more loop lengths will beknowable at compile time — especially if codeis compiled just-in-time before distribution tomultiple light-weight cores. The SPU’s Lo-cal Stores are direct-mapped with wrap-aroundaddressing, and no memory protection. As aresult loads cannot raise segmentation faults,which makes it easier to replace clean-up code,prologues and epilogues with extra loop itera-tions, thus reducing code size.

Control Flow To produce efficient loop im-plementations at the single SPU level, we ex-ploit branch hinting, which does not effect theresults of execution, only timing. The hbr in-


struction takes as an immediate argument thelocation of a branch, and (as a register argu-ment) the branch target.

Branch miss penalties have grown with in-creasing hardware pipelining within micropro-cessors. Hardware branch prediction requiresadditional circuits not required by the archi-tected instructions. It is a natural featureto omit from compact processing elements de-signed for heterogeneous parallelism on a chipsystems, like Cell.

3 MultiLoop Code Graphs

For hand-coded assembly language kernels, ex-perts use certain patterns of efficient controlflow. It is a meta-goal of the Coconut project tocapture such patterns and abstract them intoeither code graph transformations at the levelof our intermediate language, or into transla-tors from higher-level structures into assemblylanguage.

MultiLoops make it easy to express suchpatterns. They are the target of transforma-tions from operations normally represented bynested or sequentially composed loops, and al-low the folding of multiple branches into singlebranches. On the SPU, only a single branchhint is active at a time, so a successful strategyfor software predicting branches can only workif branches are sparse.

Intermediate Language All intermediatestructures take the form of labelled hyper-graphs and nested hypergraphs. We do not useanything similar to a register transfer language.

Straight-line computation is expressed incode graphs [7]; these are data-flow hyper-graphs with nodes corresponding to registervalues or generalised state component values,and with hyper-edges labelled by assembly in-structions; these hyper-edges can have multi-ple inputs and outputs according to the argu-ments they use and the results they produce(see Fig. 13 in Sect. 4.2 for an example).

Our code graphs are strictly more expressivethan data dependency graphs used in compi-lation algorithms based on Single Static As-signment (SSA), because state information isexplicitly represented as nodes separate from

data.These data-flow-level code graphs are used as

edge labels for non-branching edges in nestedhypergraphs where the outer level representscontrol flow:Definition 3.1 A control-flow arrangement(CFA) is a directed hypergraph with two kindsof hyper-edges, distinguished by their labels:

• A code graph edge, or stage edge, is la-belled with a code graph, and has exactlyone input node and one output node, la-belled according to the input and outputtypes of the code graph.

• A control flow edge is labelled with acontrol-flow assembly instructions thatcannot occur in the data-flow level.

Besides control-flow edges, another aspect ofcontrol flow is the existence of control-flow joinnodes, i.e., nodes that are output nodes of morethan one edge.

Certain properties of CFAs are important:Definition 3.2 A control-flow arrangement iscalled

• deterministic iff each node has only oneoutgoing edge;

• concurrent iff some control flow edge hasmore than one input node (e.g., synchro-nisation primitives);

• sequential iff all control edges have exactlyone input node;

• tight iff each node immediately adjacentto at least two code graph edges is acontrol-flow join node.

Control flow edges with more than one outputnode are also called branching edges.

For this paper, we are only interested in a re-stricted kind of CFA; we chose the name sincethe original motivation was that of a gener-alised deterministic non-concurrent loop spec-ification that essentially can be seen as imple-menting a “loop with multiple bodies”:Definition 3.3 A MultiLoop is a deterministicsequential CFA.

For example, Fig. 4 (Left) shows a MultiLoopfor the loop pipelining shown in the introduc-tion, in Fig. 1. This CFA is not tight since itcontains the sequence S1, S2, S3 of code graphedges (the stages) without intervening control


args

initCg

T0

S1

T1

S2

T2

S3

T0 * C

bi

T0

exitCG

results

1

1

0

1

args

initCg

T0

Body

T0 * C

bi

T0

exitCG

results

1

1

0

1

Figure 4: Left: MultiLoop for three-stagedloop pipelining. Right: Tight MultiLoop ofsimple loop.

flow joins; this sequence is semantically equiva-lent to a single code graph node containing thesequential composition Body = S1 ; S2 ; S3 ofthe three stage graphs. Such a non-tight CFAcan be tightened by replacing such sequenceswith the results of the corresponding sequen-tial composition; in this case, the correspond-ing tight MultiLoop has a single Body edge in-side the loop, shown in Fig. 4 (Right). In bothcases, the node below the initCG code graphedge is a control-flow join node, and the bi hy-peredge is a branching edge.

From a scheduling point of view, the goalof intermediate code generation is to obtaina tight MultiLoop where typically there is abranch after every code graph edge, each codegraph can be densely scheduled, and each codegraph is sufficiently deep to hint its branch intime.

However, just tightening a non-tight Multi-Loop does not normally achieve this goal.

Some function bodies are designed or gener-ated directly as non-tight MultiLoops, for ex-ample when a common pattern underlies thegenerated code. Other function bodies are gen-erated first as tight MultiLoops, and then “ar-ranged” into MultiLoops with longer stage se-quences, for example using the heuristic forloop pipelining stage separation described in

[17]. There are also cases where a functionbody is first generated as a non-tight Mul-tiLoop, and then re-arranged into a differ-ent non-tight MultiLoop, with additional con-straints imposed by the original staging. Anexample for this will be explained in Sect. 4.2;we show the original non-tight version in Fig. 5to the left, and to the right the re-arrangedversion where midCG has been split as sequen-tial composition midCG = m1; m2, and thetwo halves have been integrated into the ad-jacent stages, yielding Top1CG = top1CG; m1

and Bot12CG = m2; bot12CG etc..

MM8x8Top1

top1CG

MM8x8Mid1

midCG

MM8x8Mid2

bot12CG

MM8x8Top23 * Control

bi

MM8x8Top23

top23CG

MM8x8Mid1

midCG

MM8x8Mid2

bot12CG


bi

MM8x8Top23

top23CG

MM8x8Mid1

midCG

MM8x8Mid2

bot3CG

MM8x8Top1 * Control

bi

MM8x8Top1

1

1

01

0

1

0

1

MM8x8Top1

Top1CG

MM8x8Mid1

Bot12CG


bi

MM8x8Top23

Top23CG

MM8x8Mid1

Bot12CG


bi

MM8x8Top23

Top23CG

MM8x8Mid1

Bot3CG

MM8x8Top1 * Control

bi

MM8x8Top1

1

1

01

0

1

0

1

Figure 5: Left: Source MultiLoop for MatrixMultiplication Right: Re-arranged version.

The scheduling objectives are then achieved viaMultiLoop transformations, and in Sect. 4 weshow example applications for the two high-level transformations described in the follow-ing.


MultiLoop Pipelining: A loop bodysoftware-pipelined into n stages processesdata from n consecutive logical iterations inparallel. This approach can also be used formore complex MultiLoops. While in the caseof simple loops, as in Fig. 4 (Left), those ndata sets are always processed by the same setof the n stages of the loop body, in a moregeneral MultiLoop, different combinations ofn stage graphs will be needed to process datafrom n consecutive logical iterations.

Although for arbitrary code graphs, the codesize could multiply by a factor exponential inn, if this is done for a highly regular Multi-Loop, with a small set of equal stage graphsoccurring with the same periodicity in differ-ent sequences, then performing this pipeliningtransformation followed by a minimization stepcan lead to tight MultiLoops with no or smallcode size increase.

The examples presented in sections 4.1 and4.4 rely on the MultiLoop pipelining transfor-mation.

MultiLoop Re-Sequencing: The initialand final parts of many loop bodies are hard toschedule densely, since there are frequently toomany dependencies in these regions. Since notall of these dependencies carry across the loopboundary, however, moving the boundaries im-proves efficiency.

As a MultiLoop transformation, Re-Sequencing combines sequences of stage edgeswith intervening branch nodes, producing aseparate sequence for each way such inter-vening branches can take. Conceptually, thismoves all branches to the beginning of thesequence, choosing “prophetically” the rightsequence. Therefore, an appropriate transfor-mation of the branch logic is necessary; this isparticularly easy in counted loops with countsknown at compile time. For the rearrangedMultiLoop from Fig. 5 (Right) we show theimmediate (machine-generated) result of thisre-sequencing transformation in Fig. 6. Theone-element sequenced code graph edgesonly serve to connect start and end nodes ofthe MultiLoop; each of the five two-elementsequenced code graph edges results from adifferent way of re-sequencing the brancheswith one stage before the branch and one after.

MM8x8Top1

[Top1CG]

MM8x8Mid1

[bi _1,bi _1]

MM8x8Mid1

[Bot12CG,Top23CG]MM8x8Mid1

[Bot12CG,Top23CG] MM8x8Mid1

[bi _1,bi _1]

MM8x8Mid1

[Bot12CG,Top23CG]

MM8x8Mid1

[Bot12CG,Top23CG]

MM8x8Mid1

[bi _1,bi _1]

MM8x8Mid1

[Bot3CG,Top1CG]

MM8x8Mid1

[Bot3CG]

MM8x8Top1

1

1

01

01

10

Figure 6: Raw Re-Sequenced MultiLoop forMatrix Multiplication

Because of the regularity of the input Mul-tiLoop Fig. 5 (Right), only two different two-sequence code graph edges remain after stan-dard control flow minimisation together witha slight rearrangement of the branch logic (seeFig. 7).

MM8x8Top1

[Top1CG]

MM8x8Mid1

[Bot12CG,Top23CG]

MM8x8Mid1

switch

MM8x8Mid1

[Bot3CG,Top1CG]

MM8x8Mid1

[Bot3CG]

MM8x8Top1

1

1

0

12

Figure 7: Minimised Re-Sequenced MultiLoopfor Matrix Multiplication

In Sect. 4.2 we explain how the “prologue”and “epilogue” edges can be eliminated in this


case; similar approaches can be made to workin other cases, so that for highly regular Mul-tiLoops, the re-sequencing transformation alsoresults in little or no code size increase.

The example presented inSect. 4.3 also re-lies on the MultiLoop re-sequencing transfor-mation.

Programmer Interface The MultiLoop al-lows very general control flow. It is a tool forformal specification and verification of control-flow transformations, and we have also found itvery useful in reasoning informally about pro-posed performance-enhancing transformations.It is not intended as an everyday programminglanguage, and we envision a short list of inter-faces for generating MultiLoops with specificstructures. All of the examples in this papercan be understood in terms of loops with con-ditional execution, which can be presented asloops with versioned loop bodies. The pro-grammer separately specifies declarative com-ponents, specifies their composition into multi-ple loop bodies, and puts constraints on stagingas necessary (see Fig. 8).

II III IVI

constraint c:e.g. must be in

last stage

constraint b:e.g. none

constraint a:e.g. must be in

first stageway1a way2a way3a

b b b

way1c way1c way1c way2c

b

way4a

top top top top

mid1 mid1 mid1 mid1

mid2 mid2 mid2 mid2

bot bot bot bot

Figure 8: Four-way MultiLoop bodies.

Well-structured conditional execution can berefactored into this form. The final compo-nent in each loop version must contain a dis-tinguished control output which contains thebranch address for the next loop version.

In this way, application specialists can opti-mize the loop overhead for common high-level

patterns, including transformations which de-pend on the staging. Four such examples aregiven in Sect. 4. In a conventional tool chain,this would not be possible because control-flowtransformations are performed on the interme-diate language(s) without exposing importantpieces to the high-level language programmer.

4 Examples

In many application domains, execution effi-ciency is largely determined by loop scheduling.Loops can be classified by• whether the number and pattern of itera-

tions is known at compile time or run time;• whether the number and pattern of iter-

ations is data dependent (i.e., whether itcan be calculated from values in registersat the start of the loop, or depends on val-ues loaded through the course of the loop,or calculated iteratively);• whether the loop is nested or not nested;• whether the loop is isolated, or sequen-

tially composed with similar loops.The examples cover these cases, but not in allcombinations. Data dependent loops are morecomplicated in general, and only the resam-pling example covers this case.

4.1 Mapping over Arrays

Mapping a function over a structure (list, set,tree) is a common pattern in all programs.With common memory organizations, mapsover arrays (contiguously stored lists) are themost efficient. Some hardware vendors provideoptimized libraries of mathematical or graphicsfunctions mapped over arrays.

We have found that in this case, a MultiLoopspecification can shift most of the loop over-head from the arithmetic pipeline to the non-arithmetic pipeline. For mathematical func-tions, computation is almost always limited byarithmetic in the arithmetic pipeline, so this re-sults in a measurable speedup. For a standardmath library, the resulting reduction in the re-source lower bound using this formulation willdrop by two to ten percent, based on the im-plementations reported in [17, Tables 7.1 and7.2].


Since this is our smallest example, we willgo through all of the loop overhead, showinghow a single arithmetic instruction is sufficientto move two pointers, update a counter andcalculate the branch address for the loop. Wewill show it for the case of mapping a func-tion, f :: RegVal → RegVal, with one input andone output. There is one unused word in ourcontrol quadword, so this would work just aswell for three-pointer maps (one input and twooutputs, or two inputs and one output).1

In Fig. 9, we show the four instructions ofloop overhead for such loops. The controlquadword (1) contains the input and outputpointer and the counter (and one unused 32-bit word); a single add instruction (2) updatesall three active elements by the increments inthe register constant (3). Single-word add is anarithmetic instruction and slants to the right.All other instructions are non-arithmetic andslant to the left. All non-immediate load/store

16 unroll -4 unroll 16 unroll unused

pOut count pIn unused a (int add)

rotqbyi 8

load(s)

store(s)rotqbii 2

rotqby

pOut countpIn unused

2 bits

unusedloop: exit:hint /

branch

(1) (2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

Figure 9: Overhead for a map loop.

instructions take the address from the firstword in a register quadword, so the stores (4)use the control word as an argument. The loads(5) need their argument (6) rotated (7) by twowords, i.e., 8 bytes. The branch address (8)used by the branch and the hint instructions iscalculated via a byte rotation count (9) whichis calculated by shifting (10) the two high-order

1Less obvious is that that the same idea works forup to seven-pointer cases, if the pointer arithmetic isperformed using 16-bit integers. If the ranges permit,the 16-bit integers can be used as-is, but in general itis necessary to put addresses modulo 16 in the controlquadword, and shift the result by four bits before ex-tracting pointers.

bits from the count word into the low-orderbits of the first word. Since loads and storesand 16-byte aligned, the four low-order bits ofpOut will always be zero, so the only significantbits of the rotation count (9) are the two bitsshifted in from the count word. If the countis initialized to the number of elements to bemapped, then the first count times through theloop, those two bits are zero, but on iteration(count + 1), the count becomes negative andthe high bits are set. So the rotation count iseither zero (initially) or three (on the ultimateiteration). The count is used to rotate (11) aquadword (12) composed of the addresses ofthe top of the loop and the loop exit. If thefunction ends with the loop doing the mapping,and no non-volatile registers are used in thebody, then the exit address for the loop can bethe address in the link register. In this case thefunction return is hinted and the total numberof branches is equal to the number of valuesmapped divided by the unroll factor.

This example is interesting for two reasons:one, it is a very efficient implementation of acommon loop pattern; and two, in the natu-ral software-pipelining of this loop the loadswill be in the first stage, and the stores thelast stage, so the add instruction (2) in somesense belongs to multiple stages and the con-trol word (1) contains values from different log-ical iterations. We view the transformation ofthe loop overhead required to software pipelinethis loop as being on the same level as loop un-rolling, and implement the two transformationstogether in the corresponding code generator.The output of the generator includes iterationdistances between outputs of one stage and in-puts of the next stage, and stage constraints onload and store operations.

4.2 Dense Matrix Multiplication

Matrix matrix multiplication and multiply-addition are important basic operations in lin-ear algebra. Imperatively, they are encoded us-ing nested loops. To achieve efficiency, theyare always unrolled by blocking calculationsinto rectangular submatrices. When the blockshave the same structure, there is only one cal-culational kernel, and differences in loop levelsand iteration manifest in varying pointer cal-


culations. We have transformed such calcula-tions into branch-free form, including the cal-culation of branch addresses in time to hint allbranches. We have analyzed the cost-benefitfor the dense matrix case (no a priori infor-mation about zero matrix entries), and foundthat the expected improvement on a Cell SPUwith perfect scheduling is small. This is be-cause such basic linear algebra has a low com-putation to data size ratio. Significant benefitswould accrue as this ratio rises, e.g., if matrixelements are computed on the fly, or unpackedfrom complicated sparse representations.

We analyze the case in which single preci-sion floating point dense matrices are stored inrow-major order (corresponding to a statically-declared two-dimensional C array). SIMD par-allelism enforces the grouping of row elementsinto quadwords containing four floats. In prac-tice, matrix sizes larger than the SPU LocalStores are multiplied. We will assume that suchmatrices are already blocked contiguously intobig blocks with dimensions divisible by four.We then further block them (non-contiguously)into small blocks. In Fig. 10, the small blocksare 8× 2 quadwords, which is 8× 8 floats, andare shown as contiguous squares, with quad-words separated by solid lines. The smallblocks must divide evenly into the big blocks.

Figure 10: Block structure for square matrices,showing division into SIMD vectors.

We choose a block size large enough to en-sure:

1. SIMD vector boundaries are on blockboundaries (including results of trans-poses);

2. there are sufficient independent parallelfma operations to saturate pipelined ex-ecution units during the multiply;

3. the cost of non-arithmetic load, store,and transpose operations is smaller than

that of the arithmetic operations (thearithmetic cost grows cubicly and theload/store/transpose cost quadratically inthe width of square small blocks).

The minimum square block size with theseproperties is 8× 8, in which case 128 fmas areneeded. We will use these numbers in the fol-lowing analysis. It follows that a lower boundon the scheduled cost per iteration will be thenumber of fma instructions plus any arithmeticinstructions used in the loop overhead.

Iterating over blocks we form sums of prod-ucts Cij =

∑k AikBkj of blocks. In an imper-

ative language this would be

for i

for j

C[i,j] ← 0

for k

C[i,j] ← C[i,j] + A[i,k] * B[k,j]

branch

store C[i,j]

branch

branch

On a processor with static hinting, the in-ner branch would miss once per iteration of themiddle loop, and the middle branch would missonce per outer iteration, and the outer iterationwould miss once. The relative cost of the missesdecreases with the number of blocks in the mul-tiplication. For an 8 × 8 array of 8 × 8 denseblocks it is about two percent of the theoreticalmimimum execution time on an SPU.

Figure 11: Left: Pointer movement throughA blocks, showing movement in green associ-ated with the inner-most loop, in black associ-ated with the middle loop, and in blue associ-ated with the outermost loop. Middle: Pointermovement through B blocks, with the samecolours. Right: Pointer movement through Cblocks, with the same colours.

By applying the re-sequencing transforma-tion to a MultiLoop formulation, we can elimi-nate branch miss penalties and reduce overhead


to a single integer add, a, that increments threepointers. For short and fixed-length loops, wecan use permute and rotate instructions to gen-erate branch addresses on the fly in the non-arithmetic pipeline.

We reformulate the triple-nested loop as aMultiLoop with three versions, with three com-ponents each. The three versions correspond tothe first iteration of the inner loop, the middleiterations and the last iterations. The middlecomponent (midCG in Fig. 5) is unchanging andperforms the multiply, as shown in Fig. 12.2

madd

Figure 12: The middle code graph multiplies.

Figure 13 shows two cases (top1CG andtop23CG in Fig. 5) of the load and incrementportion. The first case is for one version ofthe loop which loads or zeros C depending onwhether a multiply or multiply-add is beingperformed. The second case only loads the nextblock of A and B.

The last components, bot12CG and bot3CGin Fig. 5, shown in detail in Fig. 14, contain thebranch hint and, in the second case, which isused in the third version of the loop body, thestore of the finished C block.

Separate code graph components for the loopoverhead allow for even smaller loop overheadcode bodies. The address calculation, andhence the control-flow, handled by iterInner anditerOuter and the nodes they connect forms adisjoint subgraph. Effectively, the inner loopof the imperative description is encoded byiterInner and the two outer loops by iterOuter.Similary, the pointer movement is handledby different components for the inner-loop,wayInner, and the two outer loops, wayOuter.The number and type of inputs and outputs arenot constant, but the outputs of a loop version

2In the following code graph drawings, inputs andoutputs are represented by triangles. Nodes for thestate of the local store (green) and branch hint (blue)put state explicitly into the code graphs.

loadB

loadA

wayOuter

loadC

rotqby

rotqbyi

iterInner

iterOuterrotqbyi

loadB

loadA

wayInner

rotqbyi iterInner

Figure 13: Top code graphs, with C load (top)for the first loop version and C accumulation(bottom) for the last two.

hbr

storeC

hbr

Figure 14: Multiply bottom code graphs, withC accumulation (top) for the first two loop vari-ations and C store (bottom) for the last.

match the inputs of a loop version which canlegally follow it.

If we apply the re-sequencing transformationto the MultiLoop formulation with two stagesper loop body as shown in Fig. 5 (Right), thenthere are only two scheduled versions of there-sequenced loop body required, as shown inFig. 7. This makes the code size for a sched-uled MultiLoop roughly the same as a software-pipelined version of the imperative code, inwhich the code size for the inner loop wouldbe double the size of the loop body.


Loop Induction via Permutation Groups.To reduce the triple loop nest to two codeMultiLoop bodies and synthesize the requiredpointer movement in Fig. 10 using shuffles androtations, we use two key observations: (1) thattop1CG occurs in logical iterations congruentto 0 mod 8 and bot3CG occurs in logical itera-tions 7 mod 8 which are paired together; (2) thepointer movement is periodic.

The action of any finite group can be sim-ulated in this way, although we have only en-countered Abelian groups up to the present.Groups with power-of-two sizes up to 32 can besimulated using word, half-word, byte or nib-ble rotation. Groups with other orders must besimulated using byte permutation via shufb.Both approaches can be used to directly cy-cle through offsets and addresses, or to indi-rectly cycle through indices into such, and thenuse rotation or permutation with register argu-ments (rotqby, rotqbi or shufb + key syn-thesis) to look up the actual offsets and ad-dresses. Indirect permutation is especially ef-fective when the list being permuted has a greatdeal of overlap. By increasing the order of in-direction, e.g., permuting lists of permute con-stants, the action of direct products of groupscan be so constructed.

We can do without a prologue and epilogue,and only need one MultiLoop to do both ma-trix multiplication and multiplication-additionbecause we exposed load and store addressesas inputs and outputs, and put constraintson their staging. Loads on the SPU are safe(they cannot cause segmentation faults), so ex-tra loads “past the end of the loop” can beignored. Stores are never safe, but we know inwhich logical iterations extra stores will occur.In Fig. 14, the address for the stores is exposedas an input. Normally, this address is producedby a rotqbyi in the first iteration the addressis generated in the init block, so we can set theinitial store address to the location of a scratchmemory area. Only a non-contiguous scratcharea the size of a small block is required.

To produce both multiplication andmultiplication-addition, we always include aloadC which takes its address from a rotqbycontrolled by an exposed register input, whichwe set to rotate either to the address whichwill be used in the next storeC or to load from

a constant scratch memory area stocked withzeros.

For this example, the performance advantageprovided by the MultiLoop will be relativelysmall, because the loop body itself needs tobe so large to avoid being performance-limitedby load/store activity. The encapsulation ofthe loop overhead in the MultiLoop will workfor any sparse structure with identical blocks.For varying blocks (which is more common) amore general set of loop overheads will be re-quired. This is the advantage of the MultiLoopfor blocked matrix multiplication: efficient loopoverheads for new block sparse structures canbe supplied by the developer and do not haveto be encoded in the compiler.

4.3 Partial Fourier Transform

MultiLoop formulations of multi-dimensionalseparable transforms lead to significant codereduction. In the simplest cases, conventionalspecifications lead to sd times more instruc-tions, where s is the number of pipeline stagesused in a schedule and d is the dimensionality.In the complicated case we study it is a factorof two.

Partial Fourier Transforms are Fourier trans-forms composed with coordinate projections orinjections, e.g. they take a vector of 256 com-plex values, perform a Fourier transform, thencopy out the central 192 elements. In the otherdirection we pad the input with zeros. Simi-lar operations occur in wavelet and other fasttransforms.

The MultiLoop and a re-sequencing transfor-mation can be profitably applied when theseoperations are applied in multiple dimensions.Most fast transforms are, by definition, separa-ble, which means that they can be applied in-dependently in each dimension. In imperativecode, this is expressed as a sequential compo-sition of nested loops. We consider the three-dimensional case.

Such transforms are most efficiently com-puted in one direction. On scalar machinesthat is usually the direction in which addresseschange the slowest. To use SIMD parallelism,another direction should be used, otherwise alot of shuffling among registers is necessary.


To avoid having to transform in the expen-sive direction, we transpose before stores:

for dimension in dimensions

for quad-column in columns(dimension)

load quad-column

transform quad-column

transpose quad-column

store quad-column as row

(loading enough columns to completely fill allloaded quadwords, e.g., 4 floats).

For three dimensions, this looks like

xyz → Y zx→ ZxY → XY Z (1)

Index label order indicates array ordering, andcase indicates whether a transform has beenperformed in the direction labelled by that in-dex. Each is a loop of column transforms withcode graph

L→ F→ S, (2)

where L is a complete set of loads with nec-essary data reforming for one row, F performsthe transform, and S is a complete set of stores,with necessary data reforming for one row. Wepermute the indices cyclically (via a transpose)so that every dimension is transformed in the“middle” dimension, because SIMD parallelismis free in the “column” directions.

For Partial FFTs, the size of the dimensionsbefore and after the one-dimensional trans-forms are not necessarily the same. This af-fects the load and store subgraphs by chang-ing the striding pattern both within the columnand the enumeration of columns, which changesregister constants or immediate constants, de-pending on the type of indexing. Immediate-indexed forms reduce register pressure, and aremore complicated for the MultiLoop, so we con-sider that case.

There are two different versions of L and of Sdepending on whether the first index directionhas been transformed or not. We now special-ize to the three-dimensional case: xyz has oneload pattern and Y zx and ZxY have another;if more than one column has to be transformedin parallel then Y zx and ZxY have one storepattern and XY Z has a second. Load and storepatterns can also differ if other data reformat-ting is performed, e.g. interleaved complex

..., rj , ij , rj+1, ij+1, rj+2, ij+2, rj+3, ij+3, ...

into planar or quadword-interleaved format

..., rj , rj+1, rj+2, rj+3, ij , ij+1, ij+2, ij+3, ...

The three sequentially-composed loops be-come a three-way MultiLoop with three com-ponents(

(L1, F, S1|2)|(L2|3, F, S1|2)|(L2|3, F, S3)).

For all but very small transforms, two stagesare sufficient to hide instruction latency in asoftware-pipelined loop, so the computationgraph F is only split once, and the scheduledloop bodies will execute in this order:

(load,store) (load,store) number ofdirection version executions

(1, ∗) (T1, S1|2) n2

(2, 1) (T2|3, S1|2) 2(2, 2) (T2|3, S1|2) nm− 2(3, 2) (T2|3, S3) 2(3, 3) (T2|3, S3) m2

(∗, 3) (T2|3, S3) m2

for an n3 to m3 transformation.In this case, the re-sequenced MultiLoop

allows us to schedule three versions of theloop body, whereas three sequential software-pipelined loops would require three prologuesand three epilogues, effectively doubling thecode size. Even if we used a prologue and epi-logue in the MultiLoop case, the code size isstill reduced by (d + 1)/(2d), where d is thenumber of dimensions.

4.4 Resampling in NUFT

This is the most specialized code we have en-coded in a MultiLoop. It processes a one-dimensional array of velocities and data, andpaints a three-dimensional array with the data.The MultiLoop has 54 different cases withbranching dependent on the cumulative valuesof the velocity data, so simple unrolling wouldbe impossible. Using a pipelined MultiLooprepresentation cuts the resource bound on exe-cution time in half, over alternative representa-tions, at the cost of increased code size. For thetarget applications (Nonuniform Fourier Trans-form)3, this is a desireable trade.

3This computation, together with the previous Par-tial Fourier Transform can be used to perform a nonuni-form Fourier Transform, e.g., used to reconstruct Mag-netic Resonance Images from irregularly-sampled data,but the computational issues are more general.


A time series of positions and data are readin from one-dimensional arrays and renderedto a three-dimensional array of complex values.One can think of it as a variation of rasteriza-tion of anti-aliased lines in an array of pixels. Inthis case the pixels have complex floating-pointvalues. Such tasks do not fit the simple SIMDpattern of parallelism. Single input data ele-ments produce results in multiple neighbouringarray elements. It is not possible to align theoutput with quadword boundaries. The onlyuniversal method of handling unaligned dataaccess is to load or store all of the quadwordswhich overlap the required data and use per-mute or rotate instructions to align the datawithin quadwords.

In the simplest imperative implementation,the array data are read, modified and writ-ten as required. The load and store instruc-tions would then dominate the computationand double the execution time (assuming per-fect scheduling). A more complicated imple-mentation which kept a cache of array valuesin registers and stored and loaded values asrequired would require multiple if statementsnested six-deep or a switch, with so manymissed branches, it would be much less effi-cient than the MultiLoop version. Since theconditions depend on the contents of the datacumulatively, unrolling cannot be used to hidethe latency of the condition calculations. Forcomputations following this pattern (rasteriz-ing curves in higher dimensions), load/store ac-tivity can be significantly reduced by cachingthe active hypercube in registers, if the regis-ter file is large enough.

In our example, input data influences a43 neighbourhood of array values, and thearray values are complex, single precisionfloating point numbers, stored in interleavedreal/imaginary row-major order. Each quad-word contains two such values, so 32 quadwordsare required for an aligned cache. Since com-plex floats are eight bytes wide, we are eitheraligned or unaligned (with the natural cacheboundary in the middle of an aligned quad-word). In the aligned case, rows cover threequadwords instead of two, so the number of reg-isters for an unaligned cache is 48 if we storethe entire quadword. On a 128 register ma-chine, like the Cell SPU, there is room for such

a buffer.We assume that the active section of the ar-

ray changes by at most one element along eachof the axes. There are, therefore, 27 differentpossible movements. Since the cache can bealigned or not aligned on quadword boundaries,each direction of movement requires two differ-ent sets of load/store activity.4 The amount ofload/store activity depends on the movement,e.g., with eight or twelve loads and stores, re-spectively in the aligned and unaligned cases.

Compare this with the number of loads andstores required for a single loop body in whichall conditionals are synthesized using selectionand permutation. In this case, 48 quadwordsmust be loaded before the computation and 48quadwords stored after the computation. Inaddition, a shuffle map must be constructed onthe fly to rotate values in the unaligned case,and this must be applied to each of the 48 quad-words between the computation and the loadsand stores, for a total of 192 non-arithmetic in-structions. Since the scheduled II for the mostcommon (stationary) cases is 86 cycles, [17],and averaged over expected execution traceswould not be more than 100 cycles, not usingthe MultiLoop would cut performance in half.

5 Related Work

Although our goal in this paper is to introducea vocabulary for specialized control flow, andnot to introduce a new scheduling algorithm,there is a lot of overlap between our work andthe literature on optimizing compilers.

Specifically, we are motivated by the dif-ficulty in incorporating conditional executioninto software pipelined loops. This is a well-researched problem.

The current leader in scheduling algorithmsfor software pipelined loops is Swing ModuloScheduling [11]. The most-commonly followedmethod of incorporating conditional executioninto such algorithms is hierarchical reduction[9]. Another alternative is Enhanced ModuloScheduling [18].

4The stationary case does not require any load/storeactivity in the backing store, but our implementationuses two identical implementations to simplify indexinginto the jump table.


Methods which take loops or loop nestscontaining conditional execution and effec-tively break them into multiple paths includeMultiple-II Modulo Scheduling (MII-MS) [19],All Paths Pipelining (APP) [14], and Split-Path Enhanced Pipeline Scheduling (SP-EPS)[13], both of which allow multiple IIs for a sin-gle loop. APP specializes the loop according tobranches being taken or not taken, and moduloschedules each version. Additional edges areadded to facilitate jumping between differentschedules. SP-EPS, on the other hand, takes acontrol-flow graph for a loop (or loop nest) andexpands at branch points by splitting tails. An-other approach is the global software pipelin-ing (GURPR-family) [15, 16]. These methodsinvolve significant data dependency graph ma-nipulation, but none of these methods incor-porate instruction selection. Incorporating in-struction selection capable of finding the codesequences used in our loop overhead exampleswould be prohibitively expensive.

Enhanced Co-Scheduling [4] extends thestate-machine model for pipelined processors,and derives a modulo-scheduling algorithm.Unlike our approach, they start from the ma-chine level and build up to find a useful level offormalization.

Most of these algorithms are designed withVery Long Instruction Word (VLIW) architec-tures (a form of Multiple Instruction Multi-ple Data, MIMD) in mind, rather than light-weight, heterogeneous multi-core systems ex-ploiting SIMD parallelism. In a way, het-erogeneous chip multiprocessors like the CellBE have a much simpler model of executionthan the VLIW machines and systolic arrayswhich have been common in specialized re-search and industrial applications over the lasttwo decades. None of the algorithms citedaddress the main challenges/opportunities forthese newer systems: very heavy branch misspenalties but with software branch prediction,weaker out-of-order execution and multiple dis-patch, but rich SIMD parallelism.

Our approach is to define a vocabulary toenable communication between the program-mer and the instruction scheduler. Another ap-proach, [1], is to move software pipelining intothe domain of the programmer as source-to-source transformations. This approach can be

useful, but restricts the discussion to featurespresent in the high-level language, which makeit impossible to access novel hardware features.

Other researchers have addressed the effi-cient use of SIMD parallelism in loops [3], theysolved the problem of inter-quadword align-ment which is common to all SIMD imple-mentations, especially when parallelizing scalarcode. In our approach, code generators wouldhandle these aspects of parallelization. Multi-Loop transformations are solving the orthogo-nal issue of optimizing control flow.

6 Conclusion

Through detailed analysis of four very differ-ent examples of numerical computation, wehave shown that generators for MultiLoops caneliminate branch miss-prediction and leverageSIMD parallelism to reduce the cost of loopoverhead. In addition to enabling significantimprovements in code size and/or performance,our descriptions are well-structured, encapsu-lating complicated SIMD synthesized controlflow, and facilitating discussions of high-leveltransformations with low-level implications.

Our next steps are: a formalization of thespecification and transformations, incorpora-tion of symbolic computation to prove requiredproperties of scheduled code, and extension ofthis framework to inter-SPU communication,which will involve concurrent control-flow ar-rangements.

Acknowledgements

The authors would like to thank IBM Canada,NSERC, CFI and OIT for research grantswhich supported this work.

About the Authors

Christopher Anand is an Assistant Professorin the Department of Computing and Software.He is motivated to develop efficient code gener-ation for safety-critical high-performance soft-ware systems by his is experience writing Mag-netic Resonance Image reconstruction software


at Picker Medical Systems. His Internet ad-dress is [email protected].

Wolfram Kahl is an Associate Professor inthe Department of Computing and Software atMcMaster University. He is interested in ap-plications and foundations of, and tool supportfor formal methods in software development,in particular high-level specification formalismsand programmig languages, graph transforma-tion, and relation-algebraic calculations. HisInternet address is [email protected].

References

[1] Yosi Ben-Asher and Danny Meisler. Towardsa source level compiler: Source level mod-ulo scheduling. In 2006 Intl. Conf. on Paral-lel Processing Workshops (ICPPW’06), pages298–308. IEEE Computer Society, 2006.

[2] International Business Machines Corporation,Sony Computer Entertainment Incorporated,and Toshiba Corporation. Cell Broadband En-gine Programming Handbook. IBM Systemsand Technology Group, Hopewell Junction,NY, 1.0 edition.

[3] Alexandre E. Eichenberger, Kathryn O’Brien,Kevin O’Brien, Peng Wu, et al. Optimizingcompiler for the cell processor. In PACT ’05:Proc. 14th Intl. Conf. Parallel Architecturesand Compilation Techniques, pages 161–172.IEEE Computer Society, 2005.

[4] R. Govindarajan, N. S. S. Narasimha Rao,E. R. Altman, and Guang R. Gao. Enhancedco-scheduling: A software pipelining methodusing modulo-scheduled pipeline theory. Int.J. Parallel Program., 28(1):1–46, 2000.

[5] IBM. PowerPC Microprocessor Family: Vec-tor/SIMD Multimedia Extension TechnologyProgramming Environments Manual. IBMSystems and Technology Group, HopewellJunction, NY, 2005.

[6] IBM. Synergistic Processor Unit InstructionSet Architecture. IBM Systems and Technol-ogy Group, Hopewell Junction, NY, October2006.

[7] Wolfram Kahl, Christopher Kumar Anand,and Jacques Carette. Control-flow seman-tics for assembly-level data-flow graphs. InWendy McCaull et al., editors, 8th Intl. Sem-inar on Relational Methods in Computer Sci-ence, RelMiCS 2005, volume 3929 of LNCS,pages 147–160. Springer, 2006.

[8] J. A. Kahle, N. N. Day, H. P. Hofs-tee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the cell multipro-cessor. IBM Journal of Research and Develop-ment, 49(4/5):589–604, July/September 2005.

[9] Monica S. Lam. Software pipelining: an effec-tive scheduling technique for vliw machines.SIGPLAN Not., 39(4):244–256, 2004.

[10] Tanya M. Lattner. An implementation ofswing modulo scheduling with extensions forsuperblocks. Master’s thesis, University of Illi-nois at Urbana-Champaign, 2005.

[11] Josep Llosa. Swing modulo scheduling: Alifetime-sensitive approach. In PACT ’96:Proc. 1996 Conf. Parallel Architectures andCompilation Techniques, page 80, Washing-ton, DC, USA, 1996. IEEE Computer Society.

[12] R.M. Ramanathan. Extending the World’sMost Popular Processor Architecture, New in-novations that improve the performance andenergy efficiency of Intel architecture. IntelCorporation, 2006.

[13] SangMin Shim and Soo-Mook Moon. Split-path enhanced pipeline scheduling. IEEETrans. Parallel Distrib. Syst., 14(5):447–462,2003.

[14] Mark G. Stoodley and Corinna G. Lee.Software pipelining loops with conditionalbranches. In MICRO 29: Proceedings of the29th annual ACM/IEEE international sym-posium on Microarchitecture, pages 262–273,Washington, DC, USA, 1996. IEEE ComputerSociety.

[15] Bogong Su, Shiyuan Ding, Jian Wang, andJinshi Xia. Gurpr—a method for globalsoftware pipelining. In MICRO 20: Proc.20th Annual Workshop on Microprogram-ming, pages 88–96. ACM Press, 1987.

[16] Bogong Su and Jian Wang. Gurpr*: a newglobal software pipelining algorithm. In MI-CRO 24: Proc. 24th Annual Intl Symp. onMicroarchitecture, pages 212–216. ACM Press,1991.

[17] Wolfgang Thaller. Explicitly staged soft-ware pipelining. Master’s thesis, McMas-ter University, Department of Computingand Software, 2006. http://sqrl.mcmaster.ca/˜anand/papers/ThallerMScExSSP.pdf.

[18] Nancy J. Warter, Grant E. Haab, KrishnaSubramanian, and John W. Bockhaus. En-hanced modulo scheduling for loops with con-ditional branches. In MICRO 25: Proceed-ings of the 25th annual international sympo-sium on Microarchitecture, pages 170–179, Los


Alamitos, CA, USA, 1992. IEEE ComputerSociety Press.

[19] Nancy J. Warter-Perez and Noubar Par-tamian. Modulo scheduling with multiple ini-tiation intervals. In MICRO 28: Proceed-ings of the 28th annual international sympo-sium on Microarchitecture, pages 111–119, LosAlamitos, CA, USA, 1995. IEEE ComputerSociety Press.

Documents

MultiLoop: E cient Software Pipelining for Modern Hardwareanand/papers/AnandKahlMultiLoop.pdf · MultiLoop: E cient Software Pipelining for Modern Hardware Christopher Kumar Anand