Compiler transformations for high-performance computingpages.cs.wisc.edu/~fischer/cs701.f00/surveys.Dec94.pdf · 1999. 8. 30. · pointers into Fortran 90 will have on optimization

Compiler Transformations for High-Performance Computing

DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP

Computer Sc~ence Diviszon, Unwersbty of California, Berkeley, California 94720

In the last three decades a large number of compiler transformations for optimizing

programs have been implemented. Most optimization for uniprocessors reduce the

number of instructions executed by the program using transformations based on the

analysis of scalar quantities and data-flow techniques. In contrast, optimization for

high-performance superscalar, vector, and parallel processors maximize parallelism

and memory locality with transformations that rely on tracking the properties of

arrays using loop dependence analysis.

This survey is a comprehensive overview of the important high-level program

restructuring techniques for imperative languages such as C and Fortran.

Transformations for both sequential and various types of parallel architectures are

covered in depth. We describe the purpose of each traneformation, explain how to

determine if it is legal, and give an example of its application.

Programmers wishing to enhance the performance of their code can use this survey

to improve their understanding of the optimization that compilers can perform, or as

a reference for techniques to be applied manually. Students can obtain an overview of

optimizing compiler technology. Compiler writers can use this survey as a reference for

most of the important optimizations developed to date, and as a bibliographic reference

for the details of each optimization. Readers are expected to be familiar with modern

computer architecture and basic program compilation techniques.

Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent

Programming; D.3.4 [Programming Languages]: Processors—compilers;

optimization; 1.2.2 [Artificial Intelligence]: Automatic Programming-program

transformation

General Terms: Languages, Performance

Additional Key Words and Phrases: Compilation, dependence analysis, locality,

multiprocessors, optimization, parallelism, superscalar processors, vectorization

INTRODUCTION

Optimizing compilers have become an es-sential component of modern high-perfor-mance computer systems. In addition totranslating the input program into ma-chine language, they analyze it and ap-ply various transformations to reduce itsrunning time or its size.

As optimizing compilers become moreeffective, programmers can become lessconcerned about the details of the under-lying machine architecture and can em-ploy higher-level, more succinct, andmore intuitive programming constructsand program organizations. Simultane-ously, hardware designers are able to

This research has been sponsored in part by the Defense Advanced Research Projects Agency (DARPA)under contract DABT63-92-C-O026, by NSF grant CDA-8722788, and by an IBM Resident Study Program

Fellowship to David Bacon,Permission to copy without fee all or part of this material is granted provided that the copies are not madeor distributed for direct commercial advantage, the ACM copyright notice and the title of the publicationand its date appear, and notice is given that copying is by permission of the Association for ComputingMachinery. To copy otherwise, or to republish, requires a fee and]or specific permission.

@ 01994 0360-0300/94/1200-0345 $03.50

ACM Computing Surveys, Vol. 26, No. 4, December 1994

346 * David F. Bacon et al.

CONTENTS

INTRODUCTION

1. SOURCE LANGUAGE

2 TRANSFORMATION ISSUES

2 1 Correctness

22 Scope

3 TARGET ARCHITECTURES

31 Characterizing Performance

32 Model Architectures

4 COMPILER ORGANIZATION

5 DEPENDENCE ANALYSIS

51 Types of Dependences

52 Representing Dependences

5,3 Loop Dependence Analysis

5.4 Subscript Analysis

6. TRANSFORMATIONS

61 Data-Flow-Based Loop Transformations

6.2 Loop Reordering

6.3 Loop Restructuring

6.4 Loop Replacement Transformations

6.5 Memory Access Transformations

6.6 Partial Evaluation

6.7 Redundancy Ehmmatlon

68 Procedure Call Transformations

7. TRANSFORMATIONS FOR PARALLEL

MACHINES

71 Data Layout

72 Exposing Coarse-Grained Parallelism

73 Computation PartltIOnmg

74 Communication Optimization

75 SIMD Transformations

76 VLIWTransformatlons

8 TRANSFORMATION FRAMEWORKS

8.1 Unified Transformation

8.2 Searching the Transformation Space

9. COMPILER EVALUATION

9.1 Benchmarks

9.2 Code Characterlstlcs

9.3 Compiler Effectiveness

94 Instruction-Level Parallehsm

CONCLUSION

APPENDIX: MACHINE MODELS

A.1 Superscalar DLX

A 2 Vector DLX

A3 Multiprocessors

ACKNOWLEDGMENTS

REFERENCES

employ designs that yield greatly im-proved performance because they needonly concern themselves with the suit-ability of the designas a compiler target,not with its suitability as a direct pro-grammer interface.

In this survey we describe transforma-tions that optimize programs written in

imperative languages such as Fortranand C for high-performance architec-tures, including superscalar, vector,and various classes of multiprocessormachines.

Most of the transformations we de-

scribe can be applied to anyofthe Algol-family languages; many are applicable tofunctional, logic, distributed, and object-oriented languages as well. These otherlanguages raise additional optimizationissues that space does not permit usto cover in this survey. The referencesinclude some starting points for investi-gation of optimizations for LISP andfunctional languages [Appel 1992; Clarkand Peyton-Jones 1985; Kranz et al.1986], object-oriented languages [Cham-bers and Ungar 1989], and the set-basedlanguage SETL [Freudenberger et al.1983].

We have also restricted the discussionto higher-level transformations that re-quire some program analysis. Thus weexclude peephole optimizations and mostmachine-level optimizations. We use theterm optimization as shorthand for opti-mizing transformation.

Finally, because of the richness of thetopic, we have not given a detailed de-scription of intermediate program repre-sentations and analysis techniques.

We make use of a number of differentmachine models, all based on a hypothet-ical superscalar processor called S-DLX.The Appendix details the machine mod-els, presenting a simplified architectureand instruction set that we use when weneed to discuss machine code. While allassembly language examples are com-mented, the reader will need to refer tothe Armendix to understand some detailsof the’ ~xamples (such as cycle counts).

We assume a basic familiarity with~ro~am compilation issues. Readers un-~am~liar with program flow analysis orother basic compilation techniques mayconsult Aho et al. [1986].

1. SOURCE LANGUAGE

All of the high-level examples in thissurvey are written in a language similar


to Fortran 90, with minor variations andextensions; most examples use only fea-tures found in Fortran 77. We have cho-sen to use Fortran because it is the defacto standard of the high-performanceengineering and scientific computingcommunity. Fortran has also been theinput language for a number of research

projects studying parallelization [Allen etal. 1988a; Balasundaram et al. 1989;Polychronopoulos et al. 1989]. It was cho-sen by these projects not only because ofits ubiquity among the user community,but also because its lack of pointers andits static memory model make it moreamenable to analysis. It is not yet clearwhat effect the recent introduction ofpointers into Fortran 90 will have onoptimization.

The optimizations we have presentedare not specific to Fortran—in fact, manycommercial compilers use the same in-termediate language and optimizer forboth C and Fortran. The presence of un-restricted pointers in C can reduce oppor-tunities for optimization because it isimpossible to determine which variablesmay be referenced by a pointer. The pro-cess of determining which references maypoint to the same storage locations iscalled alias analysis [Banning 1979; Choiet al. 1993; Cooper and Kennedy 1989;Landi et al. 1993].

The only changes to Fortran in ourexamples are that array subscripting isdenoted by square brackets to distin-guish it from function calls; and we usedo all loops to indicate textually that allthe iterations of a loop may be executedconcurrently. To make the structure ofthe computation more explicit, we willgenerally express loops as iterationsrather than in Fortran 90 array notation.

We follow the Fortran convention thatarrays are stored contiguously in mem-ory in column-major form. If the array istraversed so as to visit consecutive loca-tions in the linear memory, the first (thatis, leftmost) subscript varies the fastest.In traversing the two-dimensional arraydeclared as a[n, m], the following arraylocations are contiguous: a[n – 1, 3],a[n, 31, a[l, 41, a[2, 41.

Compiler Transformations “ 347

Programmers unfamiliar with Fortranshould also note that all arguments arepassed by reference.

When describing compilation for vectormachines, we sometimes use array nota-tion when the mapping to hardware reg-isters is clear. For instance, the loop

doalli=l,64a[i] = a[i] + c

end do all

could be implemented with a scalar-vec-tor add instruction, assuming a machinewith vector registers of length 64. Thiswould be written in Fortran 90 arraynotation as

a[l :64] = a[l :64] + c

or in vectoras

LF F2,ADDI R8,

LV VI,

ADDSV VI,Sv VI.

machine assembly language

c(R30) ;Ioad c into register F2R30, #a ;Ioad addr. of a into R8

R8 ;Ioad vector a[l :64] to

V1F2, VI ;add scalar to vectorR8 ;store vector in a[l :64]

Array notation will not be used whenloop bounds are unknown, because thereis no longer an obvious correspondencebetween the source code and the fixed-length vector operations that perform thecomputation. To use vector operations,the compiler must perform the transfor-mation called strip-mining, which is dis-cussed in Section 6.2.4.

2. TRANSFORMATION ISSUES

For a compiler to apply an optimizationto a program, it must do three things:

(1)

(2)

(3)

Decide upon a part of the program tooptimize and a particular transfor-mation to apply to it.

Verify that the transformation eitherdoes not change the meaning of theprogram or changes it in a restrictedway that is acceptable to the user.

Transform the program.

In this survey we concentrate on thelast step: transformation of the program.

ACM Computmg Surveys, Vol. 26, No 4, December 1994

348 ● David F. Bacon et al.

However, in Section 5 we introduce de-pendence analysis techniques that areused for deciding upon and verifyingmany loop transformations.

Step (1) is the most difficult and poorlyunderstood; consequently compiler de-sign is in many ways a black art. Be-cause analysis is expensive, engineeringissues often constrain the optimizationstrategies that the compiler can perform.Even for relatively straightforwarduniprocessor target architectures, it ispossible for a sequence of optimizationsto slow a program down. For example, anattempt to reduce the number of instruc-tions executed may actually degrade per-formance by making less efficient use ofthe cache.

As processor architectures becomemore complex, (1) the number of dimen-sions in which optimization is possibleincreases and (2) the decision process isgreatly complicated. Some progress hasbeen made in systematizing the applica-tion of certain families of transforma-tions, as discussed in Section 8. However,all optimizing compilers embody a set ofheuristic decisions as to the transforma-tion orderings likely to work best for thetarget machine(s).

2.1 Correctness

When a program is transformed by thecompiler, the meaning of the programshould remain unchanged. The easiestway to achieve this is to require that thetransformed program perform exactly thesame operations as the original, in ex-actly the order imposed by the semanticsof the language. However, such a strictinterpretation leaves little room for im-provement. The following is a morepractical definition.

Definition 2.1.1 A transformation islegal if the original and the transformedprograms produce exactly the same out-put for all identical executions.

Two executions of a program are iden-

tical executions if they are supplied withthe same input data and if every corre-sponding pair of nondeterministic opera-

tions in the two executions produces thesame result.

Nondeterminism can be introduced bylanguage constructs (such as Ada’s se-lect statement), or by calls to system orlibrary routines that return informationabout the state external to the program

(such as Unix time( ) or read()).In some cases it is straightforward to

cause executions to be identical, for in-stance by entering the same inputs, Inother cases there may be support for de-terminism in the programming environ-ment—for instance, a compilation optionthat forces select statements to evaluatetheir guards in a deterministic order. Asa last resort, the programmer may haveto replace nondeterministic system callstemporarily with deterministic opera-tions in order to compare two versions.

To illustrate many of the commonproblems encountered when trying to op-timize a program and yet maintain cor-rectness, Figure l(a) shows a subroutine,Figure l(b) is a transformed version of it.The new version may seem to have thesame semantics as the original, but itviolates Definition 2.1.1 in the followingways:

a Overflow. If b[k] is a very large num-ber, and a[l ] is negative, then chang-ing the order of the additions so thatC = b [k] + 100000.0 is executed beforethe loop body could cause an overflowto occur in the transformed programthat did not occur in the original. Evenif the original program would haveoverflowed, the transformation causesthe exception to happen at a differentpoint. This situation complicates de-bugging, since the transformation is

not visible to the programmer. Finally,

if there had been a print statement be-tween the assignment and use of C,the transformation would actuallychange the output of the program.

* Different results. Even if no overflow

occurs, the values that are left in thearray a may be slightly different be-cause the order of the additions hasbeen changed. The reason is that float-

ACM Computmg Surveys, Vol 26, No 4, December 1994

Compiler Transformations ● 349

subroutine tricky (a, b,n, m,k)

integer n, m, k

real a[m], b[m]

doi=l, n

a[i] = b[k] + a[i] + 100000.0

end do

return

(a) original program

subroutine tricky(a,b,n,m,k)

integer n, m, k

real a[m], b[m], C

C = b[k] + 100000.0

doi=n, i,-1

a[i] = a[i] + C

end do

return

(b) transformed program

k = m+l

n=O

call tricky(a,b,n,m,k)

(c) possible call to tricky

equivalence (a[l], b[n])

k=n

call tricky(a,b,n,m,k)

(d) possible call to tricky

Figure 1. Incorrect program transformations.

ing-point numbers are approximationsofrealnumbers ,andtheorder inwhichthe approximations are applied(round-ing) can affect the result. However, fora sequence of commutative and asso-ciative integer operations, if no orderof evaluation can cause an exception,then all evaluation orders are equiva-lent. We call operations that are alge-braically but not computationallycommutative (or associative) semicom-mutative (or semiassociative) opera-tions. These issues do not arise withBoolean operations, since they do notcause exceptions or compute approxi-mate values.

● Memory fault. If k > m but n < 1, thereference to b[k] is illegal. The refer-ence would not be evaluated in theoriginal program because the loop bodyis never executed, butit would occurinthe call shown in Figure l(c) to thetransformed code in Figure l(b).

e Different results. a and b may be com-pletely or partially aliased to one an-other, changing the values assigned toa in the transformed program. Figurel(d) shows how this might occur: If thecall is to the original subroutine, b[k]is changed when i = 1, since k = n andb [n] is aliased to a[l ]. In the trans-formed version, the old value of b [k] isused for all i, since it is read before theloop. Even if the reference to b[k] weremoved back inside the loop, the trans-formed version would still give differ-ent results because the loop traversalorder has been reversed.

As a result of these problems, slightlydifferent definitions are used in practice.When bitwise identical results are de-sired, the following definition is used:

Definition 2.1.2 A transformation islegal if, for all semantically correct pro-gram executions, the original and thetransformed programs produce exactlythe same output for identical executions.

Languages typically have many rulesthat are stated but not enforced; for in-stance in Fortran, array subscripts mustremain within the declared bounds, butthis rule is generally not enforced atcompile- or run-time. A program execu-tion is correct if it does not violate therules of the language. Note that correct-ness is a property of a program execu-tion, not of the program itself, since thesame program may execute correctly un-der some inputs and incorrectly underothers.

Fortran solves the problem of parame-ter aliasing by declaring calls that aliasscalars or parts of arrays illegal, as inFigure l(d). In practice, many programsmake use of aliasing anyway. While thismay improve performance, it sometimesleads to very obscure bugs.


350 “ David F. Bacon et al.

For languages and standards that de-fine a specific semantics for exceptions,meeting Definition 2.1.2 tightly con-strains the permissible transforma-tions. If the compiler may assume thatexceptions are only generated when theprogram is semantically incorrect, atransformed program can produce differ-ent results when an exception occurs andstill be legal. When exceptions have awell-defined semantics, as in the IEEEfloating-point standard [American Na-tional Standards Institute 1987], manytransformations will not be legal underthis definition.

However, demanding bitwise identicalresults may not be necessary. An alterna-tive correctness criterion is:

Definition 2.1.3 A transformation islegal if, for all semantically correct ex-

ecutions of the original program, theoriginal and the transformed programs

perform equivalent operations for identi-

cal executions. All permutations of semi-

commutative operations are considered

equivalent.

Since we cannot predict the degree towhich transformations of semicommuta-tive operations change the output, wemust use an operational rather than anobservational definition of equivalence.Generally, in practice, programmers ob-serve whether the numeric results differby more than a certain tolerance, and ifthey do, force the compiler to employDefinition 2.1.2.

2.2 Scope

Optimizations can be applied to a pro-gram at different level~ of granularity.

As the scope of the transformation is en-

larged, the cost of analysis generally in-

creases. Some useful gradations of

complexity are:

* Statement. Arithmetic expressions arethe main source of potential optimiza-tion within a statement.

0 Basic block (straight-line code). This isthe focus of early optimization tech-

*

0

e

●

0

niques. The advantage for analysis isthat there is only one entry point, socontrol transfer need not be consideredin tracking the behavior of the code.

Innermost loop. To target high-perfor-mance architectures effectively, com-pilers need to focus on loops. Most ofthe transformations discussed in thissurvey are based on loop manipulation.Many of them have been studied orwidely applied only in the context ofthe innermost loop.

Perfect loop nest. A loop nest is a set ofloops one inside the next. The nest iscalled a perfect nest if the body of ev-ery loop other than the innermost con-sists of only the next loop in the nest.Because a perfect nest is more easilysummarized and reorganized, severaltransformations apply only to perfectnests.

General loop nest. Any loop nesting,perfect or not.

Procedure. Some optimizations, mem-ory access transformations in particu-lar, yield better improvements if theyare applied to an entire procedure atonce. The compiler must be able tomanage the interactions of all the ba-sic blocks and control transfers withinthe procedure. The standard and ratherconfusing term for procedure-level op-timization in the literature is globaloptimization.

InterProcedural. Considering severalproc~dures together often exp~ses moreopportunities for optimization; in par-ticular, procedure call overhead is of-ten significant, and can sometimes bereduced or eliminated with inter-procedural analysis.

We will generally confine our attentionto optimizations beyond the basic blocklevel.

3. TARGET ARCHITECTURES

In this survey we discuss compilationtechniques for high-performance archi-


tectures, which for our purposes are su-perscalar, vector, SIMD, shared-memorymultiprocessor, and distributed-memorymultiprocessor machines. These architec-tures have in common that they all useparallelism in some form to improveperformance.

The structure of an architecture dic-tates how the compiler must optimizealong a number of different (and some-times competing) axes. The compiler mustattempt to:

e maximize use of computational re-sources (processors, functional units,vector units),

. minimize the number of operationsperformed,

● minimize use of memory bandwidth(register, cache, network), and

- minimize the size of total memoryrequired.

While optimization for scalar CPUS hasconcentrated on minimizing the dynamicinstruction count (or more precisely, thenumber of machine cycles required),CPU-intensive applications are often atleast as dependent upon the performanceof the memory system as they are on theperformance of the functional units.

In particular, the distance in memorybetween consecutively accessed elementsof an array can have a major perfor-mance impact. This distance is called thestride. If a loop is accessing every fourthelement of an array, it is a stride-4 loop.If every element is accessed in order, it isa stride-1 loop. Stride-1 access is desir-able because it maximizes memory local-ity and therefore the efficiency of thecache, translation lookaside buffer (TLB),and paging systems; it also eliminatesbank conflicts on vector machines.

Another key to achieving peak perfor-mance is the paired use of multiply andadd operations in which the result fromthe multiplier can be fed into the adder.For instance, the IBM RS/6000 has amultiply-add instruction that uses themultiplier and adder in a pipelined fash-ion; one multiply-add can be issued eachcycle. The Cray Y-MP C90 does not havea single instruction; instead the hard-


ware detects that the result of a vectormultiply is used by a vector add, anduses a strategy called chaining in whichthe results from the multiplier’s pipelineare fed directly into the adder. Othercompound operations are sometimesimplemented as well.

Use of such compound operations mayallow an application to run up to twice asfast. It is therefore important to organizethe code so as to use multiply-adds wher-ever possible.

3.1 Characterizing Performance

There are a number of variables that wehave used to help describe the perfor-mance of the compiled code quantita-tively:

e

e

0

0

e

0

S is the hardware speed of a singleprocessor in operations per second.Typically, speed is measured either inmillions of instructions per second(MIPS) or in millions of floating-pointoperations per second (megaflops).

P is the number of processors.

F is the number of operations executedby a program.

T is the time in seconds to run a pro-gram.

U = F/ST is the utilization of the ma-chine by a program; a utilization of 1 isideal, but real programs typically havesignificantly lower utilizations. A su-perscalar machine can issue several in-structions per cycle, but if the programdoes not contain a mixture of opera-tions that matches the mixture of func-tional units, utilization will be lower.Similarly, a vector machine cannot beutilized fully unless the sizes of all ofthe arrays in the program are an exactmultiple of the vector length.

Q measures reuse of operands that arestored in memory. It ii the ratio of thenumber of times the operand is refer-enced during computation to the num-ber of times it is loaded into a register.As the value of Q rises, more reuse isbeing achieved, and less memory band-width is consumed. Values of Q below


352 e David F. Bacon et al.

1 indicate that redundant loads areoccurring.

0 Qc is an analogous quantity that mea-sures reuse of a cache line in memory.It is the ratio of the number of wordsthat are read out of the cache fromthat particular line to the number oftimes the cache fetches that line frommemory.

3.2 Model Architectures

In the Appendix we present a series ofmodel architectures that we will usethroughout this survey to demonstratethe effect of’ various compiler transforma-tions. The architectures include a super-scalar CPU (S-DLX), a vector CPU(V-DLX), a shared-memory multiproces-sor (sMX), and a distributed-memorymultiprocessor (dMX). We assume thatthe reader is familiar with basic princi-ples of modern computer architecture, in-cluding RISC design, pipelining, caching,and instruction-level parallelism. Ourgeneric architectures are based on DLX,an idealized RISC architecture intro-duced by Hennessey and Patterson[1990].

4. COMPILER ORGANIZATION

Figure 2 shows the design of a hypotheti-cal compiler for a superscalar or vectormachine. It includes most of the general-purpose transformations covered in thissurvey. Because compilers for parallelmachines are still a very active area ofresearch, we have excluded these ma-chines from the design.

The purpose of this compiler design isto give the reader (1) an idea of how thevarious types of transformations fit to-gether and (2) some of the ordering is-sues that arise. This organization is byno means definitive: different architec-tures dictate different designs, and opin-ions differ on how best to order thetransformations.

Optimization takes place in three dis-tinct phases, corresponding to three dif-ferent representations of the program:high-level intermediate language, low-

level intermediate language, and objectcode. First, optimizations are applied to ahigh-level intermediate language (HIL)that is semantically very close to theoriginal source program. Because all thesemantic information of the source m-o-gram is available, higher-level tran~for-mations are easier to apply. For instance,array references are clearly distinguish-able, instead of being a sequence of low-level address calculations.

Next, the program is translated tolow-level intermediate form (LIL), essen-tially an abstract machine language, andoptimized at the LIL level. Address com-putations provide a major source ofpotential optimization at this level. Forinstance, two references to a[5, 3] anda[7, 3] both require the computation ofthe address of column 3 of array a.

Finally, the program is translated toobject code, and machine-specific opti-mization are performed. Some optimiza-tion, like code collocation, rely on profileinformation obtained by running theprogram. These optimizations are of-ten implemented as binary-to-binarytranslations.

The high-level optimization phase be-gins by doing all of the large-scale re-structuring that will significantly changethe organization of the program. This in-cludes various procedure-restructuringoptimlizations, as well as scalarization,which converts data-parallel constructsinto loops. Doing this restructuring atthe beginning allows the rest of the com-piler to work on individual procedures inrelative isolation,

Next, high-level data-flow optimiza-tion, partial evaluation, and redundancyelimination are performed. These opti-mization simplify the program as muchas possible, and remove extraneous codethat could reduce the effectiveness ofsubsequent analysis.

The main focus in compilers for high-performance architectures is loop opti-mization. A sequence of transformations

is performed to convert the loops into a

form that is more amenable to optimiza-tion: where possible, perfect loop nestsare created; subscript expressions are


Source Program

“FTail-recursion Ehmmatlon

Procedure Cull GraphProcedure Anwkmons 1

High-Level Dataflow OptimizationConstantPropagationand FoldingCopy Props ation

%sCommon.% xpreswonEliminationLoop-invariant Code MotIonLoop UnswtfchingStranoth Reduction

Partial EvaluationAlgebralcSlmplficatronShort-circuiting

&%%%%%X:?duct’On

Redundancy EliminationUnreachableCode ElimmationUsaleasCede ElimmationDeadVariableEltminat!on

! LooIz Preparation I ILoop DistributionLoop NormalizationLoop PeelingScalar Expanaion

Depeedsnce Vectors tLoop Preparation II

ForwardSubstationArray PaddingArray AlignmentIdiom and ReductionRecognitionLOODCollaosmo


F %Perfect Loop Ne.oslnductton Vartables

I

Loop ReorderingLoop IntercharrgaLoop SkewingLoop ReversalLoop TthngLoop CoalescingStrip MiningLoop UnrollingCycle Shnnkma

tLow-level IL Generation

Low-lsvel InferresdlateLanguage (UL)

Low-level Dataflow OptimizationConstant Propagation and FoldingCopy PropagationCommon Subaxpression EliminationLoq-mvanant Code MotionStrength Reduction of lnd.Var. Exprs.ReassociationInductIon Variable Elimination

=

Low-level Redundancy Eliminatiorr

= Cross-Call Re Ister Allocation

Assemblv L@muaee 1

Micro-optimizationPeephok OptimizationSuperoptlmlzatlon

1

Erectable Progrom 1

Instruction Ceche OptimizationCode Co-locatiomDisplacement MinimizationCache Conflict Avoidance

tFinal Opttmtzed Co&

Pigure 2. Organization of a hypothetical optimizing compiler.

ACM Computmg Surveys, Vol. 26, No. 4, December 1994


rewritten in terms of the induction vari-ables; and so on. Then the loop iterationsare reordered to maximize parallelismand locality. A postprocessing phase or-ganizes the resulting loops to minimizeloop overhead.

After the loop optimization, the pro-gram is converted to low-level intermedi-ate language (LIL). Many of the data-flowand redundancy elimination optimiza-tion that were applied to the HIL arereapplied to the LIL in order to eliminateinefficiencies in the component expres-sions generated from the high-level con-structs. Additionally, induction variableoptimization are performed, which areoften important for high performance onuniprocessor machines. A final LIL opti-mization pass applies procedure calloptimizations.

Code generation converts LIL intoassembly language. This phase of thecompiler is responsible for instructionselection, instruction scheduling, andregister allocation. The compiler applieslow-level optimizations during this phaseto improve further the performance ofthe code.

Finally, object code is generated by theassembler, and the object files are linkedinto an executable. After profiling, pro-file-based cache optimization can be ap-plied to the executable program itself.

5, DEPENDENCE ANALYSIS

Among the various forms of analysis usedby optimizing compilers, the one we relyon most heavily in this survey is depen-dence analysis [Banerjee 1988b; Wolfe1989b]. This section introduces depen-dence analysis, its terminology, and theunderlying theory.

A dependence is a relationship be-tween two computations that placesconstraints on their execution order. De-pendence analysis identifies these con-straints, which are then used to deter-mine whether a particular transforma-tion can be applied without changing thesemantics of the computation.

5.1 Types of Dependence

There are two kinds of dependence: con-trol dependence and data dependence.

There is a control dependence betweenstatement 1 and statement 2, written S14s2, when statement S1 determineswhether S’z will be executed. For exam-ple:

1 if (a = 3) then2 b=10

end if

Two statements have a data dependenceif they cannot be executed simultane-ously due to conflicting uses of the samevariable. There are three types of data

dependence: flow dependence (also

called true dependence), antidependence,and output dependence. Sb has a flowdependence on S~ (denoted by S~ - S’d)when S~ must be executed first becauseit writes a value that is read by S’d. Forexample:

3 a=c=lo4 d=2*a+c

SG has an antidependence on S~ (denotedby S~ * Sc ) when Sc writes a variablethat is read by S~:

5 e= f*4+g6 g=z.h

An antidependence does not constrainexecution as tightly as a flow depen-

dence. As before, the code will executecorrectly if So is delayed until after S~completes. An alternative solution is touse two memory locations g~ and g~ tohold the values read in S~ and written inSG, respectively. If the write by SG com-pletes first, the old value will still beavailable in g~.

An output dependence holds when bothstatements write the same variable:

7 a=b*c

8 a=d i-e

We denote this condition by writing ST~ Sg. Again, as with an antidependence,storage replication can allow the state-ments to execute concurrently. In thisshort example there is no interveninguse of a and no control transfer betweenthe two assignments, so the computationin ST is redundant and can actually beeliminated.

ACM Computing Surveys, Vol. 26, No 4, December 1994

The fourth possible relationship, calledan input dependence, holds when two ac-cesses to the same memory location areboth reads. Although an input depen-dence imposes no ordering constraints,the compiler can make note of it for thepurposes of optimizing data placementon multiprocessors.

We denote an unspecified type of de-pendence by SI = S2. Another commonnotation for _dependences uses S~ 8S4 forfjl~ + S4, S58SG for S5 ++ SG, and S7 ~“sa

for S7 + Sa,In the case of data dependence, when

we write X * Y we are being somewhatimprecise: the individual variable refer-ences within a statement generate thedependence, not the statement as awhole. In the output dependence exampleabove, b, C, d, and e can all be read frommemory in any order, and the results ofb * c and d + e can be executed as soon as

their operands have been read frommemory. S7 * S~ actually means thatthe store of the value b * c into a mustprecede the store of the value d + e intoa. When there is a potential ambiguity,we will distinguish between differentvariable references within statements.

5.2 Representing Dependence

To capture the dependence informationfor a piece of code, the compiler creates adependence graph; typically each node inthe graph represents one statement. Anarc between two nodes indicates thatthere is a dependence between the com-putations they represent.

Because it can be cumbersome to ac-count for both control and data depen-dence during analysis, sometimescompilers convert control dependenceinto data dependence using a techniquecalled if-conversion [Allen et al. 1983].If-conversion introduces additionalboolean variables that encode the condi-tional predicates; every statement whoseexecution depends on the conditional isthen modified to test the boolean vari-able. In the transformed code, data de-pendence subsumes control dependence.

Compiler Transformations

doi=2, n

1 a[i] = a[il + c2 b[i] = a[i-1] *

end do

● 355

b [i]

Figure 3. Loop-carried dependence.

5.3 Loop Dependence Analysis

So far we have examined dependence inthe context of straight-line code with con-ditionals—analyzing loops is a morecomplicated problem. In straight-linecode, each statement is executed at mostonce, so the dependence arcs described sofar capture all the possible dependencerelationships. In loops, each statementmay be executed many times, and formany transformations it is necessaryto describe dependence that exist be-tween iterations, called loop-carrieddependence.

A simple example of loop-carried de-pendence is shown in Figure 3. There isno dependence between S’l and S2 with-in any single iteration of the loop, butthere is one between two successive iter-ations. When i = k, S2 reads the value ofa[k – 1] written by S1 in iteration k – 1.

To compute dependence information forloops, the key problem is to understandthe use of arrays; scalar variables arerelatively easy to manage. To track arraybehavior, the compiler must analyze thesubscript expressions in each arrayreference.

To discover whether there is a depen-dence in the loop nest, it is sufficient todetermine whether any of the iterationscan write a value that is read or writtenby any of the other iterations.

Depending on the language, loop incre-ments may be arbitrary expressions.However, the dependence analysis algo-rithms may require that the loops haveonly unit increments. When they do not,the compiler may be able to normalizethem to fit the requirements of the anal-ysis, as described in Section 6.3.6. Forthe remainder of this section we will as-sume that all loops are incremented by 1for each iteration.



do il = 11, u1

do id = [d, Ud

1 a[fl(il, . . ..id). fm($l, ($l, . . . ..d)] = . . .2 . ..= a[gl(il )..., id), . . ..9m(& !... , id)]

end do

end do

end do

Figure 4. GeneraI loop nest.

Figure 4 shows a generalized perfectnest of d loops. The body of the loop nestreads and writes elements of the m-dimensional array a. The function f, and

g, map the current values of the 100Piteration variables to integers that indexthe ith dimension of a. The generalizedloop can give rise to any type of datadependence: for instance, two differentiterations may write the same element ofa, creating an output dependence.

An iteration can be uniquely named bya vector of d elements 1 = (ii, . . . , id),where each index falls within the itera-tion range of its corresponding loop inthe nesting (that is, 1P < iP < u ). The

Loutermost loop corresponds to t e left-most index.

We wish to discover what loop-carrieddependence exist between the two refer-ences to a, and to describe somehow thosedependence that exist. Clearly, a refer-ence in iteration J can depend only onanother reference in iteration 1 that wasexecuted before it, not after it. We for-malize the notion of “before” with the <relation:

I< Jiff3p:(iP <jP Avq <p:iq=jq).

Note that this definition must be ex-tended slightly when the loop incrementmay be negative.

A reference in some iteration J de-pends on a reference in iteration 1 if andonly if at least one reference is a writeand

I + JA ~p: fP(I) =gP(J).

In other words, there is a dependencewhen the values of the subscripts are thesame in different iterations. If no such 1and J exist, the two references are inde-pendent across all iterations of the loop.In the case of an output dependencecaused by the same write in differentiterations, the condition is simply Vp:

fp(~) = fp(w.For example, suppose that we are at-

tempting to describe the behavior of theloop in Figure 5(a). Each iteration of theinner loop writes the element a[i, j]. Thereis a dependence if any other iterationreads or writes that same element. Inthis case, there are many pairs of itera-tions that depend on each other. Con-sider iterations 1 = (1,3) and J = (2, 2).Iteration 1 occurs first, and writes thevalue a[l, 3]. This value is read in itera-tion J, so there is a flow dependencefrom iteration 1 to iteration J. Extend-ing the notation for dependence, wewrite I + J.

When X - Y, we define the depen-dence distance as Y–X=(yl–xl, ..., y~ – x~). In Figure 5(a), the de-pendence distance J – I = (1,– 1).Whenall the dependence distances for a spe-

cific pair of references are the same, thepotentially unbounded set of depen-dence can be represented by the depen-dence distance. When a dependence dis-tance is used to describe the dependencefor all iterations, it is called a distancevector (introduced by Kuck [1978] andMuraoka [1971]).

A legal distance vector V must be lexi-cographically positive, meaning that O <

ACM Computmg Surveys, Vol. 26, No, 4, December 1994


doi=2, n

do j = 1, n-1

a[i, j] = a[i, j] + a[i-l, j+l]

end do

end do

(a) {(1, -l)}

doi=l, n

do j = 2, n-1

a[jl=(a[jl + a[j-11 + a[j+il)/3end do

end do

(b) {(0,1),(1,0),(1,-1)}

Figure 5. Distance vectors.

V (the first nonzero element of the dis-tance vector must be positive). A nega-tive element in the distance vector meansthat the dependence in the correspondingloop is on a higher-numbered iteration. Ifthe first nonzero element was negative,this would indicate a dependence on afuture iteration, which is impossible.

The reader should note that distancevectors describe dependence among iter-ations, not among array elements. Theoperations on array elements create thedependence, but the distance vectors de-scribe dependence among iterations. Forinstance, the loop nest that updates theone-dimensional array a in Figure 5(b)has dependence described by the set oftwo-element distance vectors {(O, 1), (1, O),

(1, - l)}.In some cases it is not possible to de-

termine the exact dependence distance atcompile-time, or the dependence distancemay vary between iterations; but there isenough information to partially charac-terize the dependence. A direction vector(introduced by Wolfe [1989b]) is com-monly used to describe such depen-dence.

For a dependence I * J, we define thedirection vector W = (w ~, ..., w~) where

I<if iP < jp

.‘P= —

if ip = jp

> ifip >jP.

We will use the general term depen-dence vector to encompass both distanceand direction vectors. In Figure 5 thedirection vector for loop (a) is ( <, >),and the direction vectors for loop (b) are

{(=, <),(<, =),(<, >)}. Notethatadi-rection vector entry of < corresponds toa distance vector entry that is greaterthan zero.

The dependence behavior of a loop is

described by the set of dependence vec-tors for each pair of possibly conflictingreferences. These can be summarized intoa single loop direction vector, at the ex-pense of some loss of information (andpotential for optimization). The depen-dence of the loop in Figure 5(b) can besummarized as ( < , *). The symbol #denotes both a < and > direction. and* denotes < , > and = .

A particular dependence betweenstatements S’l and S2 is denoted by writ-ing the dependence vector above the ar-

(<)row, for example SI -+ S2.

Burke and Cytron [1986] present aframework for dependence analysis thatdefines a hierarchy of dependence vec-tors and allows flow and antide~en-dences to be treated symmetrically: Anantidependence is simply a flow depen-dence with an impossible dependencevector (V + O).

5.4 Subscript Analysis

In discussing the analysis of loop-carrieddependence, we omitted an important de-tail: how the compiler decides whethertwo array references might refer to thesame element in different iterations. Inexamining a loop nest, first the compilertries to prove that different iterations areindependent by applying various tests tothe subscript expressions. These testsrely on the fact that the expressions arealmost always linear. When dependenceare found, the compiler tries to describethem with a direction or distance vector.If the subscript expressions are too com-plex to analyze, the compiler assumesthe statements are fully dependent onone another such that no change in exe-cution order is permitted.

ACM Computing Surveys, Vol 26, No 4, December 1994


There are a large variety of tests, all ofwhich can prove independence in somecases. It is infeasible to solve the problemdirectly, even for linear subscript expres-sions, because finding dependence isequivalent to the NP-complete problemof finding integer solutions to systems oflinear Diophantine equations [Banerjeeet al. 1979]. Two general and approxi-mate tests are the GCD [Towle 1976] andBanerjee’s inequalities [Banerjee 1988a].

Additionally, there are a large numberof exact tests that exploit some subscriptcharacteristics to determine whether aparticular type of dependence exists. Oneof the less expensive exact tests is theSingle-Index Test [Banerjee 1979; Wolfe1989b]. The Delta Test [Goff et al. 1991]is a more general strategy that examinessome combinations of dimensions. Othertests that are more expensive to evaluateconsider the multiple subscript dimen-sions simultaneously, such as the A-test[Li et al. 1990], multidimensional GCD[Banerjee 1988a], and the power test[Wolfe and Tseng 1992]. The omega test[Pugh 1992] uses a linear programmingalgorithm to solve the dependence equa-tions. The SUIF compiler project at Stan-ford has had success applying a series ofexact tests, starting with the cheaperones [Maydan et al. 1991]. They use ageneral algorithm (Fourier-Motzkin vari-able elimination [Dantzig and Eaves1974]) as a backup.

One advantage of the multiple-dimen-sion exact tests is their ability to handlecoupled subscript expressions [Goff et al.1991]. Two expressions are coupled if thesame variable appears in both of them.Figure 6 shows a loop that has no depen-dence between iterations. The state-ment reads the values to the right of thediagonal and updates the diagonal. A testthat only examines one dimension of thesubscript expressions at a time, however,will not detect the independence becausethere are many pairs of iterations wherethe expression i in one is equal to theexpression i + 1 in the other. The moregeneral exact tests would discover thatthe iterations are independent despite

do i = 1, n-l

a[i, i] = a[i, i+l]

end do

Figure 6. Coupled subscripts.

the presence of coupled subscriptexpressions.

Section 9.2 discusses a number of stud-ies that examine the subscript ex-pressions in scientific applications andevaluate the effectiveness of differentdependence tests.

6. TRANSFORMATIONS

This section catalogs general-purposeprogram transformations; those transfor-mations that are only applicable on par-allel machines are discussed in Section 7.The primary emphasis is on loops, sincethat is generally where most of the exe-cution time is spent. For each transfor-mation, we provide an example, discussthe benefits and shortcomings, identifyany variants, and provide citations.

A major goal of optimizing compilersfor high-performance architectures is todiscover and exploit parallelism in loops.We will indicate when a loop can be exe-cuted in parallel by using a do all loopinstead of a do loop. The iterations of ado all loop can be executed in any order,or all at once.

A standard reference on compilers ingeneral is the “Red Dragon” book, towhich we refer for some of the most com-mon examples [Aho et al. 1986]. We drawalso on previous summaries [Allen andCocke 1971; Kuck 1977; Padua and Wolfe1986; Rau and Fisher 1993; Wolfe 1989b].

Because some transformations were al-ready familiar to programmers who ap-plied them manually, often we cite onlythe work of researchers who havesystematized and automated the imple-mentation of these transformations. Ad-ditionally, we omit citations to works thatare restricted to basic blocks when globaloptimization techniques exist. Even so,the origin of some optimizations is murky.For instance, Ershov’s ALPHA compiler


Compiler Transformations o 359

[Ershov 1966] performed interproceduralconstant propagation, albeit in a limitedform, in 1964!

6.1 Data-Flow-Based Loop Transformations

A number of classical loop optimizationsare based on data-flow analysis, whichtracks the flow of data through a pro-gram’s variables [Muchnick and Jones1981]. These optimizations are summa-rized by Aho et al. [1986].

doi=l, n

a[i] = a[i] + c*i

end do

(a) original loop

T=c

doi=l, n

a[i] = a[i] + T

T=T+c

end do

(b) after strength reduction

Figure 7. Strength reduction example,

6.1.1 Loop-Based Strength Reduction

Reduction in strength replaces an ex-pression in a loop with one that is equiv-alent but uses a less expensive operator[Allen 1969; Allen et al. 1981]. Figure7(a) shows a loop that contains a multi-plication. Figure 7(b) is a transformedversion of the loop where the multiplica-tion has been replaced by an addition.

Table 1 shows how strength reductioncan be applied to a number of operations.The operation in the first column is as-sumed to appear within a loop that iter-ates over i from 1 to n. When the loop istransformed, the compiler initializes atemporary variable T with the expressionin the second column, The operationwithin the loop is replaced by the expres-sion in the third column, and the value ofT is updated each iteration with the valuein the fourth.

A variable whose value is derived fromthe number of iterations that have beenexecuted by an enclosing loop is called aninduction variable. The loop control vari-able of a do statement is the most com-mon kind of induction variable, but othervariables may also be inductionvariables.

The most common use of strength re-duction, often implemented as a specialcase, is strength reduction of inductionvariable expressions [Aho et al. 1986;Allen 1969; Allen et al. 1981; Cocke andSchwartz 1970].

Strength reduction can be applied toproducts involving induction variables byconverting them to references to anequivalent running sum, as shown in

Table 1. Strength Reductions (the variable c ISloop invariant; x may vary between Iterations)

Expression Initialization Use Update

cxi T=c T T=T+c

c’ T=c T T=Txc

(-1)’ T=–1 T T=–T

z/c T = ~/C XXT

Figure 8(a–c). This special case is mostimportant on architectures in which inte-ger multiply operations take more cyclesthan integer additions. (Current exam-ples include the SPARC [Sun Microsys-tems 1991] and the Alpha [Sites 1992].)Strength reduction may also make otheroptimizations possible, in particular theelimination of induction variables, as isshown in the next section.

The term strength reduction is alsoapplied to operator substitution in a non-loop context, like replacing x x 2 withx + x. These optimizations are covered inSection 6.6.7.

6.1.2 induction Variable Elimination

Once strength reduction has been per-formed on induction variable expres-sions, the compiler can often eliminatethe original variable entirely. The loopexit test must then be expressed in termsof one of the strength-reduced inductionvariables; the optimization is called lin-ear function test replacement [Allen 1969;Aho et al. 1986]. The replacement notonly reduces the number of operations in


doi=

a[ilend do

LF F4 ,

LU R8 ,

LI R9 ,

ADDI R12,

Davidl? Bacon et al.

1, n

= a[i] + c

(a) original code

C(R30)

n(R30)

#1

R30, #a

Loop:14ULTI R1O, R9, #4

ADDI RIO,R12,R1O

LF F5, -4(R1O)

ADDF F5, F5, F4

SF -4(R1O), F5

SLT Ril, R9, R8

ADDI R9, R9, #i

BNEZ Rll, LOOp

;load c Into F4

;load n into R8

;set i (R9) to 1

;Ri2=address(a[l])

;RiO=i*4

;RiO=address(a[i+l])

;load a[i] into F5

;a[i] :=a[i]+c

;store new a[il

;Ril = i<n?; i=i+l

;if i<n, goto Loop

(b) initial compiled loop

LF F4, c(R30) ;load c into F4

LU R8, n(R30) ;load n into R8

LI R9, #l ;set i (R9) to 1

ADDI RI0,R30, #a ;RIO=address(a[i])

Loop : LF F5, (R1O) ;load a[i] into F5

ADDF F5, F5, F4 ;a[i]:=a[i]+c

SF (R1O), F5 ;store new a[i]

ADDI R9, R9, #1 ;i=i+l

ADDI R1O, Ri0,#4 ;RIO=address(a[i+l])

SLT Rll, R9, R8 ;Rll = i<n?

BNEZ Rll, LOOp ;if i<n, goto Loop

(c) after strength reduction—RiO k a running sum

instead of being recomputed from X9 and R12

LF F4, c(R30)

L!d R8, n(R30)

ADDI RIO, R30,#a

MULTI R8, R8, #4

ADDI R8, RIO, R8Loop : LF F5, (R1O)

ADDF F5, F6, F4

SF (R1O), F5ADDI R1O, R10,#4

SLT Rll, R1O,R8

BNEZ Rll, LOOp

;load c into F4

;load n into R8

;RiO=address(a[i])

; R8=n*4

;R8=address(a[n+l]);load a[i] Into F5

;aCil:=a[il+c;store new a[il;RIO=address(a[i+l]);Rll= R1O<R8?;if Rll, goto Loop

(d) after elimination of induction variable (R9)

Figure8. Induction variable optimlzations.

do i= l,n

a[i] = a[i] + sqrt(x)

end do

(a) original loop

if (n > O) C = sqrt(x)

do i = l,n

a[i] = a[i] + C

end do

(b) after code motion

Figure9. Loop-invariant code motion.

a loop, but frees the register used by theinduction variable.

Figure 8(d) shows the result of apply-ing induction variable elimination to thestrength-reduced code from the previoussection.

6.1.3 Loop-lnvarianf Code Motion

When a computation appears inside a

100P, but its result does not change be-tween iterations, the compiler can movethat computation outsidethe loop [Ahoetal, 1986; Cocke and Schwartz 1970].

Code motion canbe applied at a highlevel to expressions in the source code, orat a low level to address computations.The latter is particularly relevant whenindexing multidimensional arrays ordereferencing pointers, as when the in-ner loop of a C program contains an ex-pression like a.b + c.d[i].

Figure 9(a) shows an example inwhichan expensive transcendental function callis moved outside of the inner loop. Thetest in the transformed code in Figure9(b) ensures that if the loop is neverexecuted, themoved codeis not executedeither, lest it raise an exception.

Theprecomputed value is generally as-signed to a register. If registers arescarce, and the expression movedis inex-pensive to compute, code motion may ac-tuallydeoptimize the code, since registerspills willbe introduced in the loop.

Although code motionis sometimes re-ferred to as code hoisting, hoistingis a

ACM Computmg Sumeys,Vol, 26,N0 4, December 1994


do i=2, n

a[i] = a[i] + c

if (x < 7) then

b[i] = a[i] *

else

b[i] = a[i-1]

end if

end do

c [i]

* b[i-1]

(a) original loop

if (n > 1) then

if (x < 7) then

do all i=2, n

a[i] = a[i] + c

b[i] = a[i] * c[i]

end do all

else

do i=2, n

a[il = a[i] + cb[i] = a[i-1] * b[i-1]

end do

end if

end if

(b) after unswitching

Figure IO. Loop unswitching.

more general term referring to anytransfo~mation that moves a- comput~-tion to an earlier point in the program[Aho et al. 1986]. While loop-invariantcode motion is one particularly commonform of hoisting, there are others: anexpression can be evaluated earlier toreduce register pressure or avoid arith-metic unit latency, or a load instructionmight be moved upward to reduce theeffect of memory access latency.

6.1.4 Loop Unswitching

Loop unswitching is applied when a loopcontains a conditional with a loop-invariant test condition. The loop is thenreplicated inside each branch of theconditional, saving the overhead of condi-tional branching inside the loop, reduc-ing the code size of the loop body,possibly enabling the parallelizationbranch of the conditional [AllenCocke 1971].

andof aand

Conditionals that are candidates forunswitching can be detected during theanalysis for code motion, which identifiesloop-invariant values.

In Figure 10(a) the variable x is loopinvariant, allowing the loop to beunstitched and the true branch to beexecuted in parallel, as shown in Figure10(b). Note that, as with loop-invariantcode motion, if there is any chance thatthe condition evaluation will cause anexception, it must be guarded by a testthat the loop will be executed.

In a loop nest where the inner loop hasunknown bounds, if code is generatedstraightforwardly there will be a test be-fore the body of the inner loop to deter-mine whether it should be executed atall. The test for the inner loop will berepeated every time the outer loop is exe-cuted. When the compiler uses an inter-mediate representation of the programthat exposes the test explicitly, unswitch-ing can be applied to move the test out-side of the outer loop. The RS/6000 XLC/Fortran compiler uses unswitching forthis purpose [0’Brien et al. 1990].

6.2 Loop Rerxdering

In this section we describe transforma-tions that change the relative order ofexecution of the iterations of a loop nestor nests. These transformations are pri-marily used to expose parallelism andimprove memory locality.

Some compilers only apply reorderingtransformations to perfect loop nests (seeSection 6.2.7). To increase the opportuni-ties for optimization, such a compiler cansometimes apply loop distribution to ex-tract perfect loop nests from an imperfectnesting.

The compiler determines whether aloop can be executed in parallel by exam-ining the loop-carried dependence. Theobvious case is when all the dependencedistances for the loop are O (direction

=), meaning that there are no depen-dence carried across iterations by theloop. Figure n(a) is an example; the dis-tance vector for the loop is (O, 1), so theouter loop is parallelizable.


362 - David F. Bacon et al.

doi=l, n

doj=2, n

a[i, j] = a[i, j-1] + c

end do

end do

(a) outer loop is parallelizable

doi=i, n

doj=l, n

a[i, j] = a[i-l, j] + a[i-l, j+l]

end do

end do

(b) inner loop is parallelizable.

Figure Il. Dependence conditions forparallelizingloops.

More generally, the pth loop in a loopnest is parallelizable if for every distancevector V=(vl, . . ..vP. v~), v~),

up =ov3q<p:vq>o.

In Figure 1 l(b), the distance vectorsare {(1, 0), (1, – l)}, so the inner loop isparallelizable. Both references on theright-hand side of the expression readelements of a from row i — 1, which waswritten during the previous iteration ofthe outer loop. Therefore the elements ofrow i may be calculated and written inany order.

6.2.1 Loop Interchange

Loop interchange exchanges the positionof two loops in a perfect loop nest, gener-ally moving one of the outer loops to theinnermost position [Allen and Kennedy1984; 1987; Wolfe 1989b]. Interchange isone of the most powerful transformationsand can improve performance in manyways.

e

*

e

Loop interchange may be performed to:

enable vectorization by interchangingan inner, dependent loop with an outer,independent loop;

improve vectorization by moving theindependent loop with the largest rangeinto the innermost position;

improve parallel performance by mov-

ing an in-depende-nt loop outwa~d in a


double precision a[*]

do i = 1, 1024* stride, stride

a[i] = a[i] + c

end do

(a) loop with varying stride

I etride \ Cache I TLB I Relative \

Misses Misses Speed (%)

1 64 2 1002 128 4 83

t I 1

16 I 1024 I 32 I 23

64 I 1024 I 128 I 19256 I 1024 512 12

512 I 1024 1024 8

(b) effect of stride

Figure 12. Predicted effect of stride on perfor-mance of an IBM RS/6000 for the above loop.Array elements are double precision (8 bytes): missrates are per 1024 iterations. Beyond stride 16,TLB misses dominate [IBM 1992],

loop nest to increase the granularity ofeach iteration and reduce the numberof barrier synchronizations;

. reduce stride, ideally to stride 1; and

● increase the number of loop-invariantexpressions in the inner loop.

Care must be taken that these benefitsdo not cancel each other out, For in-stance, an interchange that improvesregister reuse may change a stride-1 ac-cess pattern to a stride-n access patternwith much lower overall performance dueto increased cache misses. Figure 12demonstrates the dramatic effect of dif-ferent strides on an IBM RS /6000.

In Figure 13(a), the inner l’oop accessesarray a with stride n (recall that we areassuming the Fortran convention of col-umn-maj or storage order). By inter-changing the loops, we convert the innerloop to stride-1 access, as shown inFigure 13(b).

For a large array in which less thanone column fits in the cache, this opti-


do i = l,n

do j = l,n

total [i] = total[il + a[i, jl

end do

end do

(a) original loop nest

do j = l,n

do i = i,n

total [i] = total[il + a[i, jl

end do

end do

(b) interchanged loop nest

Figure 13. Loop interchange,

mization reduces the number of cachemisses on a from n 2 to n 2 X elementsizeilinesize, or n 2/ 16 with 4-byte elementsand the 64-byte lines of S-DLX. However,the original loop allows total [i] to beplaced in a register, eliminating theload\ store operations in the inner loop

(see scalar replacement in Section 6.5.4).So the optimized version increases thenumber of load/store operations for totalfrom 2n to 2n 2. If a fits in the cache, theoriginal loop is better.

On a vector architecture, the trans-formed loop enables vectorization byeliminating the dependence on total [i] inthe inner loop.

Interchanging loops is legal when thealtered dependence are legal and whenthe loop bounds can be switched. If twoloops p and q in a perfect loop nest of dloops are interchanged, each dependencevector V=(ul, . . ..up. vq, ,,u~). .,u~) inthe original loop nest becomes V’ =

(vi,...,u~,UP,,,v~)..,v~) in the trans-formed loop nest. If V’ is lexicograph-ically positive, then the dependencerelationships of the original loop are sat-isfied.

A loop nest of just two loops can beinterchanged unless it has a dependencevector of the form ( < , >). Figure 14(a)shows the loop nest with the dependence

(1, – 1), giving rise to the loop-carrieddependence shown in Figure 14(b). Theorder in which the iterations are exe-

doi=2, n

do j = 1, n-1

a[i, j] = a[i-l, j+i]

end do

end do

(a)

J J

123123

(b) (c)

Figure 14. Original loop (a); original traversal or-der (b); traversal order after interchange (c).

cuted is shown by the dotted line.The traversal order after interchange isshown in Figure 14(c): some iterationsare executed before iterations that theydepend on, so the interchange is illegal.

Switching the loop bounds is straight-forward when the iteration space is rect-angular, as in the loop nest in Figure 13.In this case the bounds of the inner loopare independent of the indices of the con-taining loop, and the two can simply beexchanged. When the iteration space isnot rectangular, computing the bounds ismore complex. Triangular spaces are of-ten used by programmers, and trape-zoidal spaces are introduced by loopskewing (discussed in the next section).Further techniques are necessary tomanage imperfectly nested loops. Someof the variations are discussed in detailby Wolfe [ 1989b] and Wolf and Lam[1991].

6.2.2 Loop Skewing

Loop skewing is an enabling transforma-tion that is primarily useful in combina-tion with loop interchange [Lamport1974; Muraoka 1971; Wolfe 1989b].Skewing was invented to handle wave-front computations, so called because theupdates to the array propagate like awave across the iteration space.



do I = 2, n-1

do J = 2, m-1

a[l, Jl = (a[~-t, Jl + a[i, j-11 +

a[l+l, jl + a[l, J+ll )/4

end do

end do

(a) ongmal code dependence {(1, O), (O, 1)}

(b) original space (c) skewed space

do 1 = 2, n-1

do I = i+2, i+m-1

a[l,~-11 = (a[l-i, j-d + a[i, j-1-ii +

a[l+l, j-11 + a[l, J+l-ll )/4

end do

end do

(d) skewed code dependence {(1, 1), (O, 1)}

do I = 4, n+n-2

do 1 = max(2, J-m+l), min(n-l, ~-2)

a[i, j-d = (a[i-l, j-d + a[l, j-1-1] +

a[i+l, j-ii + a[l, ]+l-d )/4

end doend do

(e) skewed and interchanged code dependence

{(1,0),(1,1)}

Figure 15. Loop skewing.

In Figure 15(a) we show a typicalwavefront computation, Each element iscomputed by averaging its four nearestneighbors, While neither of the loops isparallelizable in their original form, eacharray diagonal (“ wavefront”) could becomputed in parallel. The iteration spaceand dependence are shown m Figure15(b), with the dotted lines indicating thewave fronts.

Skewing is performed by adding theouter loop index multiplied by a skewfactor, f, to the bounds of the inner itera-tion variable, and then subtracting thesame quantity from every use of the in-ner iteration variable inside the loop. Be-cause it alters the loop bounds but thenalters the uses of the corresponding in-

dex variables to compensate, skewingdoes not change the meaning of the pro-gram and is always legal.

The original loop nest in Figure 15(a)can be interchanged, but neither loop canbe parallelized, because there is a depen-dence on both the inner loop (O, 1) and onthe outer loop (1, O).

The result when f = 1 is shown in Fig-ure 15(c–d). The transformed code isequivalent to the original, but the effecton the iteration space is to align thediagonal wave fronts of the original loopnest so that for a given value of j, alliterations in i can be executed in parallel.

To expose this parallelism, the skewedloop nest must also be interchanged. Af-ter skew and interchange, the loop nesthas distance vectors {(1, O), (1, 1)}. Thefirst dependence allows the inner loop tobe parallelized because the correspond-ing dependence distance is O. The seconddependence allows the inner loop to beparallelized because it is a dependenceon previous iterations of the outer loop.

Skewing can expose parallelism for anest of two loops with a set of distancevectors {Vk } only if

When skewing by f, the original distance

vector (vi, vz) becomes (rJl, fvl + Pz). Forany dependence where Uz < 0, the goal isto find f such that ful + V2 > 1. Thecorrect skew factor f is computed by tak-ing the maximum of f, = [(1 – u12)/u,ll

over all the dependence [Kennedy et al.1993].

Interchanging of skewed loops is com-plicated by the fact that their boundsdepend on the iteration variables of theenclosing loops. For two loops withbounds il = 11, U1 and iz = lZ, Uz wherelZ and Uz are expressions independent ofil, the skewed inner loop has bounds iz= fi ~ + 12, fi ~ + U2. After interchange,the bounds are

doi2=fZ1+12, fu1+u2

do il = max(ll, [(i, – u2)/fD,

min(ul, [(iz – 12)/f 1)



An alternative method for handlingwavefront computations is supernodepartitioning [Irigoin and Triolet 1988].

6.2.3 Loop Reversal

Reversal changes the direction in whichthe loop traverses its iteration range

[Wedel 19751. It is often used in conjunc-tion with other iteration space reorderingtransformations because it changes thedependence vectors [Wolfe 1989b].

As an optimization in its own right,reversal can reduce loop overhead byeliminating the need for a compare in-struction on architectures without a com-pound compare-and-branch (such as theAlpha [Sites 1992]), The loop is reversedso that the iteration variable runs downto zero, allowing the loop to end with abranch-if-not-equal-to-zero instruction(BNEZ, on S-DLX).

Reversal can also eliminate the needfor temporary arrays in implementingFortran 90 array statements (see Section6.4.3).

If loop p in a nest of d loops is re-versed, then for each dependence vectorV, the entry UP is negated. The reversalis legal if each resulting vector V’ islexicographically positive, that is, when

‘P=Oor3q<p:v~>0.

For instance, the inner loop of a loopnest with direction vectors {(< , =),

( <, >)} can be reversed, because the re-sulting dependence are all stilllexico~aphically positive.

Figure 16 shows how reversal can en-able loop interchange: the original loopnest (a) has the distance vector (1, – 1)that prevents interchange because theresulting distance vector ( – 1, 1) is notlexicographically positive; the reversedloop nest (b) can legally be interchanged.

doi=l, n

do j =1, n

a[i, j] = a[i-1, j+i] + I

end do

end do

(a) original loop nest: distance vector (1, - 1).

Interchange is not possible.

doi=l, n

doj=n, l,-1

a[i, j] = a[i-1, j+i] + I

end do

end do

(b) inner loop reversed: direction vector (1, 1).

Loops may be interchanged.

Figure 16. Loop reversal.

do i=l, n

a[i] = a[i] + c

end do

(a) originaf loop

TN = (n/64)*64

do TI=I, TN, 64

a[TI :TI+63] = a[TI :TI+63] + c

end do

do i=TN+l, n

a[i] = a[i] + c

end do

(b) after strip mining

R9 = address of a[TI]

LV VI, R9 ; VI <- a[TI :TI+63]

ADDSV VI, F8, VI ; vi <- VI + c (F8=c)

Sv Vl, R9 ; a[TI :TI+63] <- VI

(c) vector assembly code for the update of

a[TI :TI+63]

Figure 17. Strip mining.

6.2.4 Strip Miningarray notation, and is equivalent to a do

Strip mining is a method of adjusting the all loop. Cleanup code is needed if the

granularity of an operation, especially a iteration length is not evealy divisible byparallelizable operation [Abu-Sufah et al. the strip length.1981; Allen 1983; Loveman 1977]. One of the most common uses of strip

An example is shown in Figure 17. The mining is to choose the number of inde-

strip-mined computation is expressed in pendent computations in the innermost



loop of a nest. On a vector machine, forexample, the serial loop can then be con-verted into a series of vector operations,each vector comprising a single “strip”[Cray Research 1988]). Strip mining isalso used for SIMD compilation [Weiss1991], for combining send operations ina loop on distributed-memory multipro-cessors [Hiranandani et al. 1992], and forlimiting the size of compiler-generatedtemporary arrays [Abu-Sufah 1979; Wolfe1989b].

Strip mining often requires othertransformations to be performed first.Loop distribution (see Section 6.2.7) canexpose simple loops from within an origi-nal loop nest that is too complex to stripmine. Loop interchange can be used tomove a parallelizable loop into the inner-most position of a loop nest or to maxi-mize the length of the strip.

The examples given above demonstratehow strip mining can create a larger unitof work out of smaller ones. The transfor-mation can also be used in the reversedirection, as shown in Section 7.4.

6.2.5 Cycle Shrinking

Cycle shrinking is essentially a special-ization of strip mining. When a loop hasdependence that prevent it from beingexecuted in parallel (that is, converted toa do all), the compiler may still be ableto expose some parallelism if the depen-dence distance is greater than one. Inthis case cycle shrinking will convert aserial loop into an outer serial loop andan inner parallel loop [Polychronopoulos1987a]. Cycle shrinking is primarily use-ful for exposing fine-grained parallelism.

For instance, in Figure 18(a), a[i + k] iswritten in iteration i and read in itera-tion i + k; the dependence distance is k.Consequently the first k iterations can beperformed in parallel provided that noneof the subsequent iterations is allowed tobegin until the first k are complete. Thesame is then done for the next k itera-tions, as shown in Figure 18(b). The iter-ation space dependence are shown inFigure 18(c): each group of k iterations isdependent only on the previous group.

doi=i, n

i a[i+k] = b[i]

2 b[i+k] = a[i] + c[i]

end do

a) because of the write to a, S1 SS2; because

of the write to b, S2~S1.

do TI=l, n,k

do all i = TI , TI+k-1

a[i+k] = b[i]

2 b[i+k] = a[i] + c[i]

end do all

end do

(b) k iterations can be performed in parallel

because that is the minimum dependence

distance.

~~o

1 2 3 4 s 6

(c) iteration space when n = 6 and k = 2.

Figure 18. Cycle shrinking.

The result is potentially a speedup by

a factor of k, but k is likely to be small

(usually 2 or 3); so this optimization isnormally limited to exposing parallelismthat can be exploited at the instructionlevel (for instance by loop unrolling), Notethat k must be constant within the loopand must at least be known to be positiveat compile time.

6.2.6 Loop Tiling

Tiling is the multidimensional general-

ization of strip mining. Tiling (also called

blocking) is primarily used to improvecache reuse ( QC) by dividing an iterationspace into tiles and transforming the loopnest to iterate over them [Abu-Sufah etal. 1981; Gannon et al, 1988; Lam et al,1991; Wolfe 1989a]. However, it can alsobe used to improve processor, register,TLB, or page locality.

The need for tiling is illustrated by theloop in Figure 19(a) that assigns a thetranspose of b. With the j loop innermost,access to b is stride-1, while access to a is

ACM Computmg Surveys, Vol 26, No. 4, December 1994

Compiler Transformations * 367

do i=l, ndo j=l, n

a[i, j] = b[j, i]end do

end do

(a) original loop

do TI=I, n, 64

do TJ=l, n, 64

do i=TI, min(TI+63, n)

do j=TJ, min(TJ+63ufi)

a[i, j] = b[j, i]

end do

end do

end do

end do

(b) tiled loop

Figure 19, Looptiling.

stride-n. Interchanging does not help,since it makes access to b stride-n. Byiterating over subrectangles of the itera-tion space, as shown in Figure 19(b), theloop uses every cache line fully.

The inner two loops of a matrix multi-ply have this structure; tiling is criticalfor achieving high performance in densematrix multiplication.

A pair of adjacent loops can be tiled ifthey can legally be interchanged. Aftertiling, the outer pair of loops can be in-terchanged to improve locality acrosstiles, and the inner loops can be ex-changed to exploit inner-loop parallelismor register locality.

6.2.7 Loop Distribution

Distribution (also called loop fission orloop splitting) breaks a single loop intomany. Each of the new loops has thesame iteration space as the original, butcontains a subset of the statements of theoriginal loop [Kuck 1977; Kuck et al.1981: Muraoka 19711.

●

●

Distribution is used to

create perfect loop nests;

create subloops with fewer depen-dence;

0

0

e

do i=l, n

a[i] = a[i] +C

x[i+ll = x[i]*7 + x[i+i] + a[i]

end do

(a) original loop

do all i=l, n

a[i] = a[i]+c

end do all

do i=i, n

x[i+l] = x[i]*7 + x[i+l] + a[i]

end do

(b) after loop distribution

Figure 20. Loop distribution.

improve instruction cache and instruc-tion TLB locality due to shorter loopbodies;

reduce memory requirements by iterat-ing over fewer arrays; and

increase register reuse by decreasingregister pressure.

Figure 20 is an example in which dis-tribution removes de~endences and al-lows part of a loop to ‘be run in parallel,

Distribution may be applied to anyloop, but all statements belonging to adependence cycle (called a n-block [Kuck1977]) must be placed in the same loop,and if SI = Sz in the original loop, thenthe loop containing SI must precede theloop that contains S’z. If the loop containscontrol flow, applying if-conversion (seeSection 5.2) can expose greater opportu-nities for distribution. An alternative isto use a control dependence graph[Kennedy and McKinley 1990],

A specialized version of this transfor-mation is distribution by name, origi-nally called horizontal distribution ofname partition [Abu-Sufah et al. 1981].Rather than performing full dependenceanalysis on the loop, the statements are~artitioned into sets that reference mu-~ually exclusive variables. These state-ments are guaranteed to be independent.

When the arrays in question are large,distribution by name can increase cache


368 0 David F. Bacon et al.

locality. Note that the above loop cannotbe distributed using fission by name,since both statements reference a.

6.2.8 Loop Fusion

The inverse transformation of distribu-tion is fusion (also called jamming)[Ershov 1966]. It can improveperformance by

reducing loop overhead;

increasing instruction parallelism;

improving register, vector [Wolfe1989b], data cache, TLB, or page [Abu-Sufah 1979] locality; and

improving the load balance of parallelloops.

In Figure 20, distribution enables theparallelization of part of the loop. How-ever, fusing the two loops improves regis-ter and cache locality since a[i] need onlybe loaded once. Fusion also increases in-struction parallelism by increasing theratio of floating-point operations to inte-ger operations in the loop and reducesloop overhead by a factor of two. Withlarge n, the distributed loop should runfaster on a vector machine while the fusedloop should be better on a superscalarmachine.

For two loops to be fused, they musthave the same loop bounds; when thebounds are not identical, it is sometimespossible to make them identical by peel-ing (described in Section 6.3.5) or by in-troducing conditional expressions into thebody of the loop. Two loops with the samebounds may be fused if there do not existstatements S1 in the first loop and Sz inthe seco~~ ~uch that they have a depen-

dence Sz = S1 in the fused loop. The rea-son this would be incorrect is that beforefusing, all instances of S1 execute beforeany Sz. After fusing, corresponding in-stances are executed together. If any in-stance of S1 has a dependence on (i.e.,must be executed after) any subsequentinstance of Sz, the fusion alters execu-tion order illegally, as shown in Figure21.

I

s20 0 o’ 0Figure 21. Two lo~ps containing S1 and S2 cannot

be fused when S2 = S’l in the fused loop.

6.3 Loop Restructuring

This section describes loop transforma-tions that change the structure of theloop, but leave the computations per-formed by an iteration of the loop bodyand their relative order unchanged.

6.3.1 Loop Unrolling

Unrolling replicates the body of a loopsome number of times called the un-rolling factor (u) and iterates by step uinstead of step 1. The benefits of un-rolling have been studied on several dif-ferent architectures [Dongarra and Hind1979]; it is a fundamental technique forgenerating the long instruction se-quences required by VLIW machines

[Ellis 19861.Unrolling can improve the perfor-

mance by

= reducing loop overhead;

“ increasing instruction parallelism; and

* improving register, data cache, or TLB

locality,

In Figure 22, we show all three of theseimprovements in an example. Loop over-head is cut in half because two iterationsare performed before the test and branchat the end of the loop. Instruction paral-lelism is increased because the secondassignment can be performed while theresults of the first are being stored andthe loop variables are being updated.

If array elements are assigned to regis-ters (either directly or by using scalar

replacement, as described in Section6.5.4), register locality will improve be-cause a[i] and a[i + 1] are used twice in


Compiler Transformations 9 369

do i=2, n-1

a[ij = a[i] + a[i-1] * a[i+ll

end do

(a) original loop

do i=2, n-2, 2

a[i] = a[i] + a[i-1] * a[i+i]

a[i+l] = a[i+l] + a[i] * a[i+2]

end do

if (mod(n-2,2) = i) then

a[n-1] = a[n-1] + a[n-2] * a[nl

end if

(b) loop unrolled twice

Figure 22. Loop unrolling

the loop body, reducing the number ofloads per iteration from 3 to 2.

If the target machine has double- ormultiword loads, unrolling often allowsseveral loads to be combined into one.

The if statement at the end of Figure22(b) is the loop epilogue that must begenerated when it is not known at com-pile time whether the number of itera-tions of the loop will be an exact multipleof the unrolling factor u. If u > 2, theloop epilogue is itself a loop.

Unrolling has the advantage that itcan be applied to any loop, and can bedone profitably at both the high and thelow levels. Some compilers also performloop rerolling prior to unrolling becauseprograms often contain loops that wereunrolled by hand for a different targetarchitecture.

Figure 23(a–c) shows the effects of un-rolling in more detail. The source code inFigure 23(a) is translated to the assem-bly code in Figure 23(b), which takes 6cycles per result on S-DLX. After un-rolling 3 times, the code requires 8 cyclesper iteration or 22/s cycles per result, asshown in Figure 23(c), The original loopstalls for one cycle waiting for the load,and for two cycles waiting for the ADDFto complete. In the unrolled loop some ofthese cycles are filled.

Most compilers for high-performancemachines will unroll at least the inner-most loop of a nesting. Outer loop un-rolling is not as universal because ityields replicated instances of the innerloops. To avoid the additional controloverhead, the compiler can often fuse thecopies back together, yielding the sameloop structure that appeared in the origi-nal code. This combination of transfor-mations is sometimes referred to asunroll-and-jam [Callahan et al. 1988].

Loop quantization [Nicolau 1988] isanother approach to unrolling that avoidsreplicated inner loops. Rather than creat-ing multiple copies and then subse-quently eliminating them, quantizationadds additional statements to the inner-most loop directly. The iteration rangesare changed, but the structure of the loopnest remains the same. However, unlikestraightforward unrolling, quantizationchanges the order of execution of the loopand is not always a legal transformation.

6.3.2 Software Pipelining

Another technique to improve instructionparallelism is software pipelining [Lam1988]. In hardware pipelining, instruc-tion execution is broken into stages, suchas Fetch, Execute, and Write-back. Thefirst instruction is fetched in the firstclock cycle. In the next cycle, the secondinstruction is fetched while the first isexecuted, and so on. Once the pipelinehas been filled, the machine will com-plete 1 instruction per cycle.

In software pipelining, the operationsof a single loop iteration are broken intos stages, and a single iteration performsstage 1 from iteration i, stage 2 fromiteration i – 1, etc. Startup code must begenerated before the loop to initialize thepipeline for the first s – 1 iterations, andcleanup code must be generated after theloop to drain the pipeline for the lasts — 1 iterations.

Figure 23(d) shows how softwarepipelining improves the performance of asimple loop. The depth of the pipeline iss = 3. The software pipelined loop pro-duces one new result every iteration, or 4cycles per result.



do i=l, n

a[i] = a[i] + c

end do

(a) the initial loop

1 LW R8, n(R30) ADDI RIO, R30, #a ;load n mto R8 ;RIO=address(a[l])

2 LF F4, c(R30) ;load c mto F4

3 MJLTI R8, R8, #4 ; R8=n*4

5 ADDI R8, RIO, R8 ; R8=address(a[n+l] )

1 L:LF F5, (RIO) ADDI R1O, RiO, #4 ;load a[l] into F5 ;RIO=address(a[l+l] )

3 ADDF FE., F5, F4 SLT Ril, RI0,R8 ;a[il=a[il+c ;Rll= RIO<R8?

6 SF -4(R1O), F5 BNEZ Rll, L ;store new a[ll ;lf Rll, goto L

(b) the compiled loop body for S-DLX. This is the loop from Fig 8(d) after instruction scheduling

1

2

3

4

5

6

-r

a

L:LF

LF

LF

ADDI

SLT

SF

SF

SF

F5, (R1O)

F6, 4(RIO)

F7, 8(R1O)

Rio, Rio, #12

Rii, RI0,R8

-I2(R1O) >F5

-8(R1O), F6

-4(RIO)> F7

;load a[l] into F5

;load a[l+i] into F6

ADDF F5, F5, F4 ;load a[i+2] Into F8 ;a[l]=a[l]+c

ADDF F6, F6, F4 ;R1O=address (a[l+3]) ;a[l+l]=a[i+l]+c

ADDF F7, F7, F4 ;Rll= R1O<R87 ;a[i+21=a[l+2]+c

;store new a[i]

;store new a[l+ll

BNEZ Rli, L ;store new a[i+21 ;If Rll, goto L

(c) after unrolhng 3 tmnes and rescheduhng (loop prologue and epdogue omitted)

1 LW R8, n(R30) ADDI RIO, R30,#a ;load n into R8 ;RiO=address(a[l])

2 LF F4, c(R30) SUBI R8, R8, #2 ;load c into F4 ; R8=n-2

3 MJLTI R8, R8, #4 ;R8=(n-2)*4

5 ADDI R8, RIO, R8 LF F5, (R1O) ;R8=address(a[n-1]) ;F5=a[l]

7 ADDF F6, F5, F4 LF F5, 4(R1O) ;F6=a[l]+c ;F5=a[2]

1 L:SF (R1O), F6 ADDI RIO, R1O, #4 ;store new a[i] ;RIO=address(a[i+i] )

2 ADDF F6, F5, F4 SLT Rll, R1O,R8 ;a[l+l]=a[i+l]+c ;Rll= R1O<R8?

3 LF F5, 4(R1O) BNEZ Rli, L ;load a[i+2] mto F5 ;if Rii, goto L

4 [stall]

(d) after software pipelining and rescheduling, but without unrolling (loop epilogue omitted)

i L:SF (RIO), F6 ADDF F6, F5, F4 ;store new a[i] ; a[i+21 =a[i+21 +C

2 SF 4( R1O), F8 ADDF F8, F7, F4 ;store new a[i+l] ;a[i+3]=a[i+3]+c

3 LF F5, 16(RIO) ADDI R1O, R1O, #8 ;load a[l+4] Into F5 ;R1O=address (a[l+21)

4 LF F7, 12(RIO) SLT Rli, RIO, R8 ;load a[i+5] into F7 ;Rli= RIO<R87

5 BNEZ Rll, L ;if Rll, goto L

(e) after unrolling 2 times and then software pipehnmg (loop prologue and epilogue omitted)

Figure 23. Increasing instruction parallelism in loops.

ACM Computmg Surveys. Vol 26,N0 4, December 1994


Overlappxlor=.~.-

rme(a) LcJOpunrolling

Tmle

(b) SoftwarePipelimng

Figure 24. Loop unrolling vs. software pipelining.

Finally, Figure 23(e) shows the resultof combining software pipelining (s = 3)with unrolling (u = 2). The loop takes 5cycles per iteration, or 21/z cycl:s perresult. Unrolling alone achieves 2 /~ cy-cles per result. If software pipelining iscombined with unrolling by u = 3, theresulting loop would take 6 cycles periteration or two cycles per result, whichis optimal because only one memory op-eration can be initiated per cycle.

Figure 24 illustrates the difference be-tween unrolling and software pipelining:unrolling reduces overhead, whilepipelining reduces the startup cost ofeach iteration.

Perfect pipelining combines unrollingby loop quantization with softwarepipelining [Aiken and Nicolau 1988a;1988b]. A special case of software pipelin-ing is predictive commoning, which isapplied to memory operations. If an ar-ray element written in iteration i – 1 isread in iteration i, then the first elementis loaded outside of the loop, and eachiteration contains one load and one store.The IW/6000 XL C/Fortran compiler[0’Brien et al. 1990] performs thisoptimization.

If there are no loop-carried depen-dence, the length of the pipeline is thelength of the dependence chain. If thereare multiple, independent dependencechains. thev can be scheduled togethersubject to ~esource availability, o; loopdistribution can be applied to put eachchain into its own loop. The schedulingconstraints in the presence of loop-carried dependence and conditionals aremore complex; the details are discussed

do all i=l, n

do all j=l, m

a[i, jl = a[i, jl + c

end do all

end do all

(a) original loop

do all T=l, n*m

i. = ((T-1) / m)*m + 1

j = MOD(T-1, m) + 1


end do all

(b) coalesced loop

real TA [n*m]

equivalence (TA, a)

do all T = 1, n*m

TA [T] = TA [T] + c

end do all

(c) collapsed loop

Figure 25. Loop coalescing vs. collapsing.

in Aiken and Nicolau [1988a] and Lam[1988].

6.3.3 Loop Coalescing

Coalescing combines a loop nest into asingle loop, with the original indices com-puted from the resulting single inductionvariable [Polychronopoulos 1987b; 1988].Coalescing can improve the scheduling ofthe loop on a parallel machine and mayalso reduce loop overhead.

In Figure 25(a), for example, if n and mare slightly larger than the number ofprocessors P, then neither of the loopsschedules as well as the outer parallelloop, since executing the last n – P iter-ations will take the same time as thefirst P. Coalescing the two loops ensuresthat P iterations can be executed everytime except during the last (nm mod P)

iterations, as shown in Figure 25(b).Coalescing itself is always legal since it

does not change the iteration order of theloop. The iterations of the coalesced loopmay be parallelized if all the originalloops were parallelizable. A criterion for



parallelizability in terms of dependencevectors is discussed in Section 6.2.

The complex subscript calculationsintroduced by coalescing can often besimplified to reduce the overhead of thecoalesced loop [Polychronopoulos 1987 b].

6.3.4 Loop Collapsing

Collapsing is a simpler, more efficient,but less general version of coalescing inwhich the number of dimensions of thearray is actually reduced. Collapsingeliminates the overhead of multiplenested loops and multidimensional arrayindexing.

Collapsing is used not only to increasethe number of parallelizable loop itera-tions, but also to increase vector lengthsand to eliminate the overhead of a nestedloop (also called the Carry optimization[Allen and Cocke 1971; IBM 1991]).

The collapsed version of the loop dis-cussed in the previous section is shownin Figure 25(c).

Collapsing is best suited to loop neststhat iterate over memory with a constantstride. When more complex indexing isinvolved, coalescing may be a betterapproach.

6.3.5 Loop Peeling

When a loop is peeled, a small number ofiterations are removed from the begin-ning or end of the loop and executedseparately. If only one iteration is peeled,a common case, the code for that itera-tion can be enclosed within a conditional.For a larger number of iterations, a sepa-rate loop can be introduced. Peeling hastwo uses: for removing dependence cre-ated by the fh-st or last few loop itera-tions, thereby enabling parallelization;and for matching the iteration control ofadjacent loops to enable fusion.

The loop in Figure 26(a) is not paral-lelizable because of a flow dependencebetween iteration i = 2 and iterations i =3 . . . n. Peeling off the first iteration al-lows the rest of the loop to be parallelizedand fused with the following loop, asshown in Figure 26(b).

doi=2, n

b[i] = b[i] + b[21

end do

doalli=3, n

a[i] = a[i] + c

end do all

(a) original loops

if (2 <= n) then

b[2] = b[2] + b[2]

end if

do all i=3, n

b[i] = b[il + b[2]

a[i] = a[i] + c

end do all

(b) after peeling one iteration from first loop

and fusing the resulting loops

Figure 26. Loop peeling

Since peeling simply breaks a loop intosections without changing the iterationorder, it can be applied to any loop.

6.3.6 Loop Normalization

Normalization converts all loops so thatthe induction variable is initially 1 (or O)and is incremented by 1 on each iteration[Allen and Kennedy 1987]. This transfor-mation can expose opportunities for fu-sion and simplify inter-loop dependenceanalysis, as shown in Figure 27. It canalso help to reveal which loops are candi-dates for peeling followed by fusion.

The most important use of normaliza-tion is to permit the compiler to applysubscript analysis tests, many of whichrequire normalized iteration ranges.

6.3.7 Loop Spreading

Spreading takes two serial loops andmoves some of the computation from thesecond to the first so that the bodies ofboth loops can be executed in parallel[Girkar and Polychronopoulos 1988].

An example is shown in Figure 28: thetwo loops in the original program (a) can-not be fused because they have different


Compiler Transformations ~ 373

doi=l, n

a[i] = a[i] + c

end do

do i = 2, n+l

b[i] = a[i-i] * b[i]

end do

(a) original loops

doi=l, n

a[i] = a[i] + c

end do

doi=l, n

b[i+l] = a[i] * b[i+l]

end do

(b) after normalization, the two loops can be

fused

Figure 27. Loop normalization.

bounds, and there would be a depen-

dence S’z ‘~) SI in the fused loop due tothe write to a in S1 and the read of a inS’z. By executing the statement Sz threeiterations later within the new loop, it ispossible to execute the two statements inparallel, which we have indicated textu-ally with the COBEGIN / COEND com-

pound statement in Figure 28(b).The number of iterations by which the

body of the second loop must be delayedis the maximum dependence distance be-tween any statement in the second loopand any statement in the first loop, plus1. Adding one ensures that there are nodependence within an iteration. For thisreason, there must not be any scalar de-pendence between the two loop bodies.

Unless the loop bodies are large,spreading is primarily beneficial for ex-posing instruction-level parallelism. 13e-pending on the amount of instructionparallelism achieved, the introduction ofa conditional may negate the gain due to

spreading. The conditional can be re-moved by peeling the first few (in thiscase 3) iterations from the first loop, atthe cost of additional loop overhead.

do i = 1, n/2

1 a[i+l] = a[i+l] + a[i]

end do

do i = 1, n-3

2 b[i+l] = b[i+l] + b[i] + a[i+3]

end do

(a) original loops

do i = 1, n/2

COBEGIN

a[i+l] = a[i+i] + a[i]

if (i J 3) then

b[i-2] = b[i-2]+b[i-3]+a[i]

end if

COEND

end do

do i = n/2-3, n-3

b[i+l] = b[i+l] + b[i] +

end do

(b) after spreading

Figure 28. Loop spreading.

a[i+3]

6.4 Loop Replacement Transformations

This section describes loop transforma-tions that operate on whole loops andcompletely alter their structure.

6.4.1 Reduction Recognition

A reduction is an operation that com-putes a scalar value from an array, Com-mon reductions include computing eitherthe sum or the maximum value of theelements in an array. In Figure 29(a), thesum of the elements of a are accumulatedin the scalar S. The dependence vector forthe loop is (l), or ( <). While a loop withdirection vector (<) must normally beexecuted serially, reductions can be par-allelized if the operation performed isassociative. Commutativity provides ad-ditional opportunities for reordering.

In Figure 29(b), the reduction has beenvectorized by using vector adds (the in-ner do all loop) to compute TS; the finalresult is computed from TS using a scalarloop. For semicommutative and semias-sociative operators like floating-pointmultiplication, the validity of the trans-formation depends on the language se-

ACM Computing Surveys, Vol. 26, No, 4, December 1994


doi=l, n

s=s+a[i]

end do

(a) a sum reduction loop

real TS [64]

TS[1:64] = 0.0

do TI = 1, n, 64

TS[1:64] = TS[1:64] + a[TI:TI+63]

end do

do TI = 1, 64

s = s + TSITI]

end do

(b) loop transformed for vectorization

Figure 29. Reduction recognition.

mantics and the programmer’s intent, asdescribed in Section 2.1. Provided thatbitwise identical results are not required,the partial sums can be computed inparallel.

Maximum parallelism is achieved bycomputing the reduction with a tree:pairs of elements are summed, then pairsof these results are summed, and so on.The number of serial steps is reducedfrom O(n) to O(log n).

Operations such as and, or, rein, andmax are truly associative, and their re-duction can be parallelized under allcircumstances.

6.4,2 Loop Idiom Recognition

Parallel architectures often provide spe-

cialized hardware that the compiler cantake advantage of. Frequently, for exam-ple, SIMD machines support reductiondirectly in the processor interconnectionnetwork. Some parallel machines, suchas the Connection Machine CM-2 [Think-ing Machines 1989], include hardwarenot only for reduction but for parallelprefix operations, allowing a loop with abody of the form a[il = a[i – 1] + a[i] tobe parallelized. The parallelization of amore general class of linear recurrences

is described by Chen and Kuck [19751, byKuck [1977; 1978], and Wolfe [1989b].Blelloch [1989] describes compilationstrategies for recognizing and exploitingparallel prefix operations.

The compiler can recognize and con-vert other idioms that are speciallysupported by hardware or software. Forinstance, the CMAX Fortran preproces-sor converts loops implementing vectorand matrix operations into calls to as-sembly-coded BLAS (Basic Linear Alge-bra Subroutines) [Sabot and Wholey1993]. The VAX Fortran compiler con-verts string copy loops into block transferinstructions [Harris and Hobbs 1994].

6.4.3 Array Statement Scalarization

When a loop is expressed in array nota-tion, the compiler can either convert itinto vector operations or scalarize it into

one or more serial loops [Wolfe 198!2b].

However, the conversion is not com-

pletely straightforward because array

notation requires that the operation be

performed as if every value on the right-

hand side and every subexpression on

the left-hand side were computed before

any assignments are performed.

The example in Figure 30 shows a

computation in array notation (a), its

“obvious” (but incorrect) conversion to se-

rial form (b), and a correct conversion (c).

The reason that Figure 30(b) is not cor-

rect is that in the original code, every

element of a is to be incremented by the

value of the previous element. The incre-

ments are to happen as if they were all

performed simultaneously; in the incor-

rect version, each element is incremented

by the updated value of the previous ele-

ment.

The general solution is to introduce a

temporary array T and to have a sepa-

rate loop that writes the values back into

a, as shown in Figure 30(c). The tempo-

rary array can then be eliminated if the

two loops can be legally fused, namely,

when there is no dependence S2 ‘~) SI in

the fused loop, where S1 is the assign-

ment to the temporary, and S2 is the

assignment to the original array.


a[2:n-1] = a[2:n-1] + a[l:n-2]

(a) initial array language expression

do i = 2, n-1

a[i] = a[i] + a[i-1]

end do

(b) incorrect scalarization

do i = 2, n-1

1 T[i] = a[i.] + a[i-1]

end do

do i = 2, n-1

2 a[i] = T[i]

end do

(c) correct scalarization

do i = n-1, 2, -1

a[i] = a[i] + a[i-1]

end do

(d) reversing both loops allows fusion and

eliminates need for temporary array T

a[2:n-11 = a[2:n-il + a[i:n-2] + a[3:n]

(e) array expression requiring a temporary

Figure 30. Array statement scalarization.

In this case there is an antidepen-dence, butitcan be removedby reversingthe loops, enabling fusion, andeliminat-ing the temporary, as shown in Figure30(d). However, the array languagestatementin Figure 30(e) requires atem-porary since an antidependence exists re-gardless of the direction of the loop.

6.5 Memory Access Transformations

High-performance applications are asfrequently memory limited as they arecompute limited. In fact, for the last fif-teen years CPU speeds have doubled ev-ery three to five years, while DRAMspeeds have doubled about once everydecade (DRAMs, or dynamic random ac-cess memory chips, are the low-cost,low-power memory chips used for mainmemory in most computers).


As a result, optimization of the use ofthe memory system has become steadilymore important. Factors affecting mem-ory performance include:

0

●

0

Reuse, denoted by Q and Qc, the ratioof uses of an item to the number oftimes it is loaded (described in Section3);

Parallelism. Vector machines oftendivide memory into banks, allowingvector registers to be loaded in a paral-lel or pipelined fashion. Superscalarmachines often support double- orquadword load and store instructions;

Working Set Size. If all the memorvelemen~s accessed inside of a loop d;not fit in the data cache, then itemsthat will be accessed in later iterationsmay be flushed, decreasing Qc. If morevariables are simultaneously live thanthere are available registers (that is,the register pressure is high), thenloads and stores will have to spill val-ues into memory, decreasing Q. If morepages are accessed in a loop than thereare entries in the TLB, then the TLBmay thrash.

Since the registers are the top of thememory hierarchy, efficient register us-age is absolutely crucial to high perfor-mance. Until the late seventies, registerallocation was considered the single mostimportant problem in compiler optimiza-tion for which there was no adequatesolution. The introduction of techniquesbased on graph coloring [Chaitin 1982;Chaitin et al. 1981; Chow and Hennessy1990] yielded very efficient global (withina procedure) register allocation. Mostcompilers now make use of some variantof graph coloring in their registerallocator.

In general, memory optimizations areinhibited by low-level coding that relieson particular storage layouts. FortranEQUIVALENCE statements are the mostobvious example. C and Fortran 77 bothspecify the storage layout for each type ofdeclaration. Programmers relying on aparticular storage layout may defeat awide range of important optimization.



Optimizations covered in other sec-tions that also can improve memory sys-tem performance are loop interchange

(6.2.1), loop tiling (6.2.6), loop unrolling(6.3. 1), 100P fusion (6.2.8), and variousoptimization that eliminate registersaves at procedure calls (6.8).

6.5.1 Array Padding

Padding is a transformation whereby un-used data locations are inserted betweenthe columns of an array or between ar-rays. Padding is used to ameliorate anumber of memory system conflicts, inparticular:

- bank conflicts on vector machines withbanked memory [Burnett and Coffman,Jr. 1970];

. cache set or TLB set conflicts;

o cache miss jamming [Bacon et al. 1994];and

* false sharing of cache lines on shared-memory multiprocessors.

Bank conflicts on vector machines likethe Cray and our hypothetical V-DLXcan be caused when the program indexesthrough an array dimension that is notlaid out contiguously in memory, leadingto a nonunit stride whose value we willdefine to be s. If s is an exact multiple ofthe number of banks (B), the bandwidthof the memory system will be reduced bya factor of B (8, on V-DLX) because allmemory accesses are to the same bank.The number of memory banks is usuallya power of two.

In general, if an array will be accessedwith stride s, the array should be paddedby the smallest p such that gcd(s +p, B) = 1. This will ensure that B suc-cessive accesses with stride s + p will alladdress different banks. An example isshown in Figure 3 l(a): the loop accessesmemory with stride 8, so all memory ref-erences will be to the first bank. Afterpadding, successive iterations accessmemory with stride 9, so they go to suc-cessive banks, as shown in Figure 3 l(b).

Cached memory systems, especiallythose that are set-associative, are less

real a[8,512]

do i = 1, 512

a[i, i] = a[l, i] + c

end do

(a) original code

real a[9,512]

do i = 1, 512

a[l, i] = a[l, i] + c

end do

(b) padding a eliminates memory bank

confbcts on V-DLX

Figure 31. Array padding

sensitive to low power-of-two strides.However, large power-of-two strides willcause extremely poor performance due tocache set and TLB set conflicts. Theproblem is that addresses are mapped tosets simply by using a range of bits fromthe middle of the address. For instance,on S-DLX, the low 6 bits of an addressare the byte offset within the line, andthe next 8 bits are the cache set. Figure12 shows the effect of stride on the per-formance of a cache-based superscalarmachine.

A number of researchers have notedthat other strides yield reduced cacheperformance. Bailey [1992] has quanti-fied this by noting that because manywords fit into a cache line, if the numberof sets times the number of words per setis almost exactly divisible by the stride(say by n f e), then for a limited numberof references r, every nth memory refer-ence will map to the same set, If [ r/n] isgreater than the associativity of thecache, then the later references will evictthe lines loaded by the earlier references,precluding reuse.

Set and bank conflicts can be causedby a bad stride over a single array, or bya loop that accesses multiple arrays thatall align to the same set or bank. Thuspadding can be inserted between columnsof an array (intraarray padding), or be-tween arrays (interarray padding).

ACM Computmg Surveysj Vol 26, No 4, December 1994


A further performance artifact, calledcache miss J“amming, can occur on ma-chines that allow processing to continueduring a cache miss: if cache misses arespread nonuniformly across the loop iter-ations, the asynchrony of the processorwill not be exploited, and performancewill be reduced. Jamming typically oc-curs when several arrays are accessedwith the same stride and when all havethe same alignment relative to cache lineboundaries (that is, the same low ad-dress bits). Bacon et al. [1994] describecache miss jamming in detail and pre-sent a unified framework for inter- andintraarray padding to handle set con-flicts and jamming.

The disadvantages of padding are thatit increases memory consumption andmakes the subscript calculations foroperations over the whole array morecomplex, since the array has “holes.” Inparticular, padding reduces the benefitsof loop collapsing (see Section 6.3.4).

6.5.2 Scalar Expansion

Loops often contain variables that areused as temporaries within the loop body.Such variables will create an antidepen-

dence S’z ‘~) S1 from one iteration to thenext, and will have no other loop-carrieddependence. Allocating one temporaryfor each iteration removes the depen-dence and makes the loop a candidate forparallelization [Padua et al. 1980; Wolfe1989b], as shown in Figure 32. If thefinal value of c is used after the loop, cmust be assigned the value of T[n].

Scalar expansion is a fundamentaltechnique for vectorizing compilers, andwas performed by the Burroughs Scien-tific Processor [Kuck and Stokes 1982]and Cray-1 [Russell 1978] compilers. Analternative for parallel machines is touse private variables, where each proces-sor has its own instance of the variable;these may be introduced by the compiler(see Section 7.1.3) or, if the language suP-ports private variables, by theprogrammer.

If the compiler vectorizes or paral-lelizes a loop, scalar expansion must be

doi=l, n

c = b[i]

a[i] = a[i] + c

end do

(a) original loop

real T[n]

doalli=l, n

T[i] = b [i]

a[i] = a[i] + T[i]

end do all

(b) after scalar expansion

Figure 32. Scalar expansion.

performed for any compiler-generatedtemporaries in a loop. To avoid creatingunnecessarily large temporary arrays, avectorizing compiler can perform scalarexpansion after strip mining, expandingthe temporary to the size of the vectorstrip.

Scalar expansion can also increase in-struction-level parallelism by removingdependence.

6.5.3 Array Contraction

After transformation of a loop nest, itmay be possible to contract scalars orarrays that have previously been ex-panded. It may also be possible to con-tract other arrays due to interchange orthe use of redundant storage allocationby the programmer [Wolfe 1989b].

If the iteration variable of the pth loopin a loop nest is being used to index thek th dimension of an array x, then di-mension k may be removed from x if (1)loop p is not parallel, (2) all distancevectors V involving x have VP = O, and(3) x is not used subsequently (that is, xis dead after the loop). The latter twoconditions are true for compiler-expanded variables unless the loop struc-ture of the program was changed afterexpansion. In particular, loop distribu-tion can inhibit array contraction bycausing the second condition to beviolated.

ACM Computing Surveys, Vol. 26, No. 4, Decsmber 1994


real T[n, n]

doi=l, n

doallj=l, n

T[i, j] = a[i, j]*3

b[i, j] = T[i, j] + b[i, j]/T[i, j]

end do all

end do

(a) original code

real T[n]

doi=l, n

doallj=l, n

T[j] = a[i, j]*3

b[i, j] = T[j] + b[i, j]/T[j]

end do all

end do

(b) after array contraction

Figure 33. Array contraction.

Contraction reduces the amount ofstorage consumed by compiler-generatedtemporaries, as well as reducing thenumber of cache lines referenced. Othermethods for reducing storage consump-tion by temporaries are strip mining (seeSection 6.2.4) and dynamic allocation oftemporaries, either from the heap or froma static block of memory reserved fortemporaries.

6.5.4 Scalar Replacement

Even when it is not possible to contractan array into a scalar, a similar opti-mization can be performed when a fre-quently referenced array element isinvariant within the innermost loop orloom. In this case. the arrav element canbe loaded into a scalar (anfi presumablytherefore a register) before the inner loopand, if it is modified, stored after theinner loop [Callahan et al. 1990].

Replacement multiplies Q for the ar-ray element by the number of iterationsin the inner loop(s). It can also eliminateunnecessary subscript calculations, al-though that optimization is often done byloop-invariant code motion (see Section

do i = l,n

do j = i,n

total [i] = total[il + a[i, j]

end do

end do

(a) original loop nest

do i = l,n

T = total [i]

do j = l,n

T = T + a[i, j]

end do

total [i] = T

end do

(b) after scalar replacement

Figure 34. Scalar replacement.

6.1.3). Loop interchange can be used toenable or improve scalar replacement;Carr [1993] examines the combination ofscalar replacement, interchange, and un-roll-and-jam in the context of cacheoptimization.

An example of scalar replacement isshown in Figure 34; for a discussion ofthe interaction between replacement andloop interchange, see Section 6,2.1.

6.5.5 Code Collocation

Code collocation improves memory accessbehavior by placing related code in closeproximity. The earliest work rearrangedcode (often at the granularity of a proce-dure) to improve paging behavior [F’errari1976; Hatfield and Gerald 1971].

More recent strategies focus on im-proving cache behavior by placing themost frequent successor to a basic block(or the most frequent callee of a proce-dure) immediately adjacent to it in in-struction memory [Hwu and Chang 1989;Pettis and Hansen 1990].

An estimate is made of the frequencywith which each arc in the control flowgraph will be traversed during programexecution (using either profiling informa-tion or static estimates). Procedures aregrouped together using a greedy algo-rithm that always takes the pair of pro-



cedures (or procedure groups) with thelargest number of calls between them.

Within a procedure, basic blocks can begrouped in the same way (although thedirection of the control flow must betaken into account), or a top-down algo-rithm can be used that starts from theprocedure entry node. Basic blocks with afrequency estimate of zero can be movedto a separate page to increase localityfurther. However, accessing that pagemay require long displacement jumps tobe introduced (see the next subsection),creating the potential for performanceloss if the basic blocks in question areactually executed.

Procedure inlining (see Section 6.8.5)can also affect code locality, and has beenstudied both in conjunction with [Hwuand Chang 1989] and independent of[McFarling 1991] code positioning. Inlin-ing improves performance often by reduc-ing overhead and increasing locality, butif a procedure is called more than once ina loop, inlining will often increase thenumber of cache misses because the pro-cedure body will be loaded more thanonce.

6.5.6 Dkplacement Minimization

The target of a branch or a jump is usu-ally specified relative to the current valueof the program counter (PC). The largestoffset that can be specified varies amongarchitectures; it can be as few as 4 bits. Ifcontrol is transferred to a locationoutside of the range of the offset, a multi-instruction sequence or long-format in-struction is required to perform the jump.For instance, the S-DLX instructionBEQZ R4, error is only legal if error iswithin 215 bytes. Otherwise, the instruc-tion must be replaced with:

BNEZ R4, cent ;reversed testLI R8, error ;get low bits

LUI R8, error >>16 ;get high bits

JR R8 ;jump to targetcent:

This sequence requires three extra in-structions. Given the cost of long dis-

placement jumps, the code should be

organized to keep related sections closetogether in memory, in particular thosesections executed most frequently

[Szymanski 19781.Displacement minimization can also be

applied to data. For instance, a base reg-ister may be allocated for a Fortran com-mon block or group of blocks:

common /big / q, r, x[20000], y, z

If the array x contains word-sized ele-ments, the common block is larger thanthe amount of memory indexable by theoffset field in the load instruction (216bytes on S-DLX). To address y and z,multiple-instruction sequences must beused in a manner analogous to the long-jump sequences above. The problem isavoided if the layout of big is:

common /big / q, r, y, z, x[20000]

6.6 Partial Evaluation

Partial evaluation refers to the generaltechnique of performing part of a compu-tation at compile time. Most of the classi-cal optimizations based on data-flowanalysis are either a form of partial eval-uation or of redundancy elimination

(described in Section 6.7). Loop-baseddata-flow optimizations are described inSection 6.1.

6.6.1 Constant Propagation

Constant propagation [Callahan et al.1986; Kildall 1973; Wegman and Zadeck1991] is one of the most important opti-mization that a compiler can perform,and a good optimizing compiler will ap-ply it aggressively, Typically programscontain many constants; by propagatingthem through the program, the compilercan do a significant amount of precompu-tation. More important, the propagationreveals many opportunities for other op-timization. In addition to obvious possi-bilities such as dead-code elimination,loop optimizations are affected becauseconstants often appear in their inductionranges. Knowing the range of the loop,the compiler can be much more accurate

ACM Computmg Sumeys, Vol. 26, No. 4, December 1994

380 e Dauid F. Bacon et al.

n=64

C=3

doi=l, n

a[i] = a[i] + c

end do

(a) original code

doi=l,64

a[i] = a[i] + 3

end do

(b) after constant propagation

Figure 35. Constant propagation.

in applying the loop optimizations that,more than anything else, determine per-formance on high-speed machines.

Figare 35 shows a simple example ofconstant propagation. On V-DLX, the re-sulting loop can be converted into a sin-gle vector operation because the loop isthe same length as the hardware vectorregisters. The original loop would have tobe strip mined before vectorization (seeSection 6.2.4), increasing the overhead ofthe loop.

6.6.2 Constant Folding

Constant folding is a companion to con-stant propagation: when an expressioncontains an operation with constant val-ues as operands, the compiler can replacethe expression with the result. For exam-ple, x = 3 * * 2 becomes x = 9. Typicallyconstants are propagated and folded si-multaneously [Aho et al. 1986].

Note that constant folding may not belegal (under Definition 2.1.2 in Section2.1) if the operations performed at com-pile time are not id~nticd to ‘chow thatwould have been performed at run time;a common source of such problems is therounding of floating-point numbers dur-ing input or output between phases ofthe compilation process [Clinger 1990;Steele and White 1990].

6.6.3 Copy Propagation

Optimization such as induction variableelimination (6. 1.2) and common- subex-

t = i*4S=tprint *, a[s]

r.t

a[r] = a[r] + c

(a) original code

t . 1*4

print *, a [t]

a[t] = a[t] + c

(b) after copy propagation

Figure 36. Copy propagation

pression elimination (6.7.4) may cause thesame value to be copied several times.The compiler can propagate the originalname of the value and eliminate redun-dant copies [Aho et al. 1986].

Copy propagation reduces registerpressure and eliminates redundant regis-ter-to-register move instructions. An ex-ample is shown in Figure 36.

6.6.4 Forward Substitution

Forward substitution is a generalizationof copy propagation. The use of a variableis replaced by its defining expression,which must be live at that point. Substi-tution can change the dependence rela-tion between variables [Wolfe 1989b] orimprove the analysis of subscript expres-sions in loops [Allen and Kennedy 1987;Kuck et al. 1981].

For instance, in Figure 37(a) the loopcannot be parallelized because an un-known element of a is being written. Af-

ter forward substitution, as shown inFi~re 37(b), the subscript expression is

in terms of the loop-bound variable, andit is straightforward to determine thatthe loop can be implemented as a paral-lel reduction (described in Section 6.4.1).

The use of variables like npl is a com-

mon Fortran idiom that was developedwhen compilers did not perform aggres-sive optimization. The idiom is recom-mended as “good programming style” ina number of Fortran programming texts!



npl = n+l

doi=l, n

a[npl] = a[npl] + a[i]

end do

(a) original code

doalli=l, n

a[n+ll = a[n+ll + a[i]

end do all

(b) after forward substitution

Figure 37. Forward substitution.

Forward substitution is generally per-formed on array subscript expressions atthe same time as loop normalization

(Section 6.3.6). For efficient subscriptanalysis techniques to work, the arraysubscripts must be linear functions of theinduction variables.

6.6.5 Reassociation

Reassociation is a technique for increas-ing the number of common subexpres-sions in a program [Cocke and Markstein1980; Markstein et al. 1994]. It is gener-ally applied to address calculationswithin loops when performing strengthreduction on induction variable expres-sions (see Section 6.1.1). Address calcula-tions generated by array referencesconsist of several multiplications and ad-ditions. Reassociation applies the asso-ciative, commutative, and distributivelaws to rewrite these expressions in acanonical sum-of-products form.

Forward substitution is usually per-formed where possible in the addresscalculations to increase the number ofpotential common subexpressions.

6.6.6 Algebraic Simplification

The compiler can simplify arithmetic ex-pressions by applying algebraic rules tothem. A particularly useful example isthe set of algebraic identities. For in-stance, the statement x = (y* 1 + O) / 1can be transformed into x = y if x and yare integers. Figure 38 illustrates someof the commonly applied rules.

Xxo = o

o/x = o

Zxl = z

Z+o = z

x/1 = x

Figure 38. Algebraic identities used in expression

simplification.

Table 2. Identltles Used In Strength Reduction(&IS the string concatenation operator)

Expression I Reduced Expr. I Datatypes I

~(a, O) + (b, O) 1 (a+ b, O) -complex

len(s~ & s~) \ len(.s~)+len(sz) string

Floating-point arithmetic can be prob-lematic to simplify; for example, if x is anIEEE floating-point number with valueNan (not a number), then x x O = x, in-stead of O.

6.6.7 Strength Reduction

The identities in Table 2 are calledstrength reductions because they replacean expensive operator with an equivalentless expensive operator. Section 6.1.1 dis-cusses the application of strength reduc-tion to operations that appear inside aloop.

The two entries in the table that referto multiplication, x x 2 = x + x and i x2’ = i << c, can be generalized. Multipli-cation by any integer constant can beperformed using only shift and add in-structions [Bernstein 1986]. Other trans-formations are conditional on the valuesof the affected variables; for example,i/2c = i >> c if and only if i is non-

negative [Steele 1977].It is also possible to convert exponenti-

ation to multiplication in the evaluation


382 ● Dauid F. Bacon et al.

wri.te(6,100) c[i]

read(7,100) (d(j), j = 1, 100)

100 format (Al)

(a) original code

call putchar(c[i], 6)

call fgets(d, 100, 7)

(b) after format compilation

Figure39. Format compilation

ofpolynomials ,using the identity

a~xn + a~.lx n–1+ . . . +alx +aO

=(a~x”-l+a~_lx’-2 + . ..+al)

X+ao.

6.6.8 110 Format Compilation

Most languages provide fairly elaboratefacilities for formatting input and output.The formatting specifications are in ef-fect a formatting sublanguage that isgenerally “interpreted” at run time, witha correspondingly high cost for characterinput and output.

Formatted writes can be converted al-most directly into calls to the run-timeroutines that implement the various for-mat styles. These calls are then likelycandidates for inline substitution. Figure39 shows two 1/0 statements and theircompiled equivalents. Idiom recognitionhas been performed to convert the im-plied do loop into an aggregateoperation.

Note that in Fortran, a format state-ment is analogous to a procedure defini-tion, which may be invoked by anynumber of read or write statements, ‘The

same trade-off as with procedure inliningapplies: the formatted 1/0 can be ex-panded inline for higher efficiency, orencapsulated as a procedure for codecompactness (inlining is described in Sec-tion 6.8.5).

Format compilation is done by the VAXFortran compiler [Harris and Hobbs1994] and by the Gnu C compiler [FreeSoftware Foundation 1992].

Format compilation is further compli-cated in C by the fact that printf andscanf are library functions and may beredefined by the programmer.

6.6.9 Superoptimizing

A superoptimizer [Massalin 1987] repre-sents the extreme of optimization, seek-ing to replace a sequence of instructionswith the optimal alternative, It does anexhaustive search, beginning with a sin-gle instruction. If all single instructionsequences fail, two-instruction sequencesare searched, and so on.

A randomly generated instruction se-quence is checked by executing it with asmall number of test inputs that wererun through the original sequence. If itpasses these tests, a thorough verifica-tion procedure is applied.

Superoptimization at compile time ispractical only for short sequences (on theorder of a dozen instructions). It can alsobe used by the compiler designer to findoptimal sequences for commonly occur-ring idioms, It is particularly useful foreliminating conditional branches in shortinstruction sequences, where the pipelinestall penalty may be larger than the costof executing the operations themselves[Granlund and Kenner 1992].

6.7 Redundancy Elimination

There are many optimizations that im-prove performance by identifying redun-dant computations and removing them[Morel and Renvoise 1979]. We have al-ready covered one such transformation,loop-invariant code motion, in Section6.1.3. There a computation was beingperformed repeatedly when it could bedone once.

Redundancy-eliminating transforma-tions remove two other kinds of computa-tions: those that are unreachable andthose that are useless. A computation isunreachable if it is never executed; re-moving it from the program will have no

semantic effect on the instructions exe-cuted. Unreachable code is created byprogrammers (most frequently with con-ditional debugging code), or by transfor-

ACM Computmg Surveys. Vol. 26. No 4, December 1994


mations that have left “orphan” codebehind.

A computation is useless if none of theoutputs of the program are dependent onit.

6.7.1. Unreachable-Code Elimination

Most compilers perform unreachable-codeelimination [Allen and Cocke 1971; Ahoet al. 1986]. In structured programs, thereare two primary ways for code to becomeunreachable. If a conditional predicate isknown to be true or false, one branch ofthe conditional is never taken, and itscode can be eliminated. The other com-mon source of unreachable code is a loo~that does not perform any iterations. ‘

In an unstructured program that relieson goto statements to transfer control,unreachable code is not obvious from theprogram structure but can be found bytraversing the control flow graph of theprogram.

Both unreachable and useless code areoften created by constant propagation,described in Section 6.6.1. In Figure 40(a),the variable debug is a constant. Whenits value is propagated, the conditionalexpression becomes if (O > 1). This ex-pression is always false, so the body ofthe conditional is never executed and canbe eliminated, as shown in Figure 40(b).Similarly, the body of the do loop is neverexecuted and is therefore removed.

Unreachable-code elimination can inturn allow another iteration of constantpropagation to discover more constants;for this reason some compilers performconstant propagation more than once.

Unreachable code is also known asdead code, but that name is also appliedto useless code. so we have chosen to usethe more speci~c term.

Sometimes considered as a separatestep, redundant-control elimination re-moves control constructs such as loom.and conditionals when they become re-dundant (usually as a result of constantpropagation). In Figure 40(b), the loopand conditional control exm-essions are.not used, and we can remove them fromthe program, as shown in (c).

integer c, n, debugdebug = On=Oa . b+j’if (debug j 1) then

C=a+b+d

print *, ‘Warning -- total is ‘ , cend ifcall foo(a)doi=l, n

a[i] = a[i] + c

end do

(a) original code

integer c, n, debug

debug = O

n.o

a = b+7

if (O > 1) then

end if

call foo(a)

doi=l, O

end do

(b) after constant propagation and unreachable

code elimination

integer c, n, debug

debug = O

n=O

a . b+7

call foo(a)

(c) after redundant control elimination

integer c, n, debug

a = b+7

call foo(a)

(d) after useless code elimination

a . b+7

call foo(a)

(e) after dead variable elimination

Figure 40. Redundancy elimination.

6.7.2 Useless-Code Elimination

Useless code is often created by otheroptimizations, like unreachable-codeelimination. When the compiler discoversthat the value being computed by a state-



ment is not necessary, it can remove thecode. This can be done for any nonglobalvariable that is not liue immediately af-ter the defining statement. Live-variableanalysis is a well-known data-flow prob-lem [Aho et al. 1986]. In Figure 40(c), thevalues computed by the assignmentstatements are no longer used; they havebeen eliminated in (d).

6.7.3 Dead-Var/ab/e Elimination

After a series of transformations, partic-ularly loop optimizations, there are oftenvariables whose value is never used. Theunnecessary variables are called deadvariables; eliminating them is a commonoptimization [Aho et al. 1986].

In Figure 40(d), the variables C, n, anddebug are no longer used and can beremoved; (e) shows the code after thevariables are pruned.

6.7.4 Common-Subexpression Elimination

In many cases, a set of computations willcontain identical subexpressions. The re-dundancy can arise in both user code andin address computations generated by thecompiler. The compiler can compute thevalue of the subexpression once, store it,and reuse the stored result [Aho et al.1977; 1986; Cocke 1970]. Common-subex-pression elimination is an importanttransformation and is almost universallyperformed. While it is generally a goodidea to perform common-subexpressionelimination wherever possible, the com-piler must consider the current registerpressure and the cost of recomputing. Ifstoring the temporary value(s) forces ad-ditional spills to memory, the transfer.

mation can actually deoptimize theprogram.

6.7.5 Short Circuiting

Short circuiting is an optimization thatcan be performed on boolean expressions.It is based on the observation that thevalue of many binary boolean operationscan be determined from the value of the

first operand [Arden et al, 1962]. Forexample, the control expression in

If ((a = 1) and (b = 2)) thenc=~

end if

is known to be false if a does not equal 1,regardless of the value of b. Short circuit-ing would convert this expression to:

if (not (a = 1)) goto 10if (not (b = 2)) goto 10C=5

10 continue

Note that if any of the operands in theboolean expression have side-effects,short circuiting can change the results ofthe evaluation. The alteration may ormay not be legal, depending on the lan-guage semantics. Some language defini-tions, including C and many dialects ofLisp, address this problem by requiringshort circuiting of boolean expressions.

6.8 Procedure Call Transformations

The optimizations described in the nextseveral sections attempt to reduce theoverhead of procedure calls in one of fourways:

●

●

●

e

eliminating the call entirely;

eliminating execution of the called pro-cedure’s body;

eliminating some of the entry/exitoverhead; and

avoiding some steps in making a proce-dure call when the behavi~r ;f thecalled procedure is known or can bealtered.

6.8.1 A Calling Convention for S-DLX

To demonstrate the procedure call opti-mization, we fh-st define a calling con-vention for S-DLX. Table 3 shows howthe registers are used.

In general, each called procedure isresponsible for ensuring that the valuesin registers RI 6 – R25 are preservedacross the call. The stack begins at thetop of memory and grows downward.There is no explicit frame pointer; in-stead, the stack pointer is decrementedby the size s of the procedure’s frame atentry and left unchanged during the call.


Compiler Transformations ● 3%5

Table 3. S-DLX Registers and Their Usage

Number Usage

RO Always zero; writes are ignored

Ri Return value when returning from a procedure call

R2. .R7 The first six words of the arguments to the procedure call

R8. .R15 8 caller save registers, used as temporary registers by callee

R16. .R25 10 callee save registers. These registers must be preserved across a call.

R26 . . R29 Reserved for use by the operating system

R30 Stack pointer

R31 Return address during a procedure call

FO. .F3 The first four floating point arguments to the procedure call

F4. .F17 14 caller save floating point registers

F18. .F31 14 callee save floating point registers

The value R30 + s serves as a uirtualframe pointer that points to the base ofthe stack frame, avoiding the use of asecond dedicated register. For languagesthat cannot predict the amount of stackspace used during execution of a proce-dure, an additional general-purpose reg-ister can be used as a frame pointer, orthe size of the frame can be stored at thetop of the stack (R30 + O).

On entering a procedure, the returnaddress is in R31. The first six words ofthe procedure arguments appear in reg-isters R2 – R7, and the rest of the argu-ment data is on the stack. Figure 41shows the layout of the stack frame for aprocedure invocation.

A similar convention is followed forfloating-point registers, except that onlyfour are reserved for arguments.

Execution of a procedure consists of sixsteps:

(1)

(2)

(3)

(4)

Space is allocated on the stack for theprocedure invocation.

The values of registers that will bemodified during procedure execution

(and that must be preserved acrossthe call) are saved on the stack. If theprocedure makes any procedure callsitself, the saved registers should in-clude the return address, R31.

The procedure body is executed.

The return value (if any) is stored inRI, and the registers that were savedin step 2 are restored.

(5) The frame is removed from the stack.

(6) ~~~ds is transferred to the return

Calling a procedure is a four-step pro-cess:

(1)

(2)

(3)

(4)

The values of any of the registersR1 – RI 5 that contain live values aresaved. If the values of any globalvariables that might be used by thecallee are in a register and have beenmodified, the copy of those variablesin memory is updated.

The arguments are stored in the des-ignated registers and, if necessary,on the stack.

A linked jump is made to the targetprocedure; the CPU leaves the ad-dress of the next instruction in R31.

Upon return, the saved registers arerestored, and the registers holdingglobal variables are reloaded.

To demonstrate the structure of a pro-cedure and the calling convention, Figure42 shows a simple function and its com-piled code. The function (foo) and thefunction that it calls (max) each take twointeger arguments, so they do not need topass arguments on the stack. The stackframe for foo is three words, which areused to save the return address (R31 )and register R 16 during the call to max,and to hold the local variable d. R31

ACM Computing Surveys, Vol 26, No. 4, December 1994


high addresses

caller’s stack frarrte

arg n

ar’z 1

locals and temporaries

saved registers

arguments for calledprocedures

Sp --.+

low addresses4

stackgrows down

Figure 41. Stack frame layout

must be preserved because it is overwrit-ten by the jump-and-link (JAL) instruc-tion; RI 6 must be preserved because it isused to hold c across the call.

The procedure first allocates the 12bytes for the stack frame and saves R31and R16; then the parameters are loadedinto registers. The value of d is calcu-lated in the temporary register R9. Thenthe addresses of the arguments are storedin the argument registers, and a jump tomax is made. On return from max, thereturn value has been computed into thereturn value register (Rl ). After remov-ing the stack frame and restoring thesaved registers, the procedure jumps backto its caller through R31.

6.8.2 Leaf Procedure Optimization

A leaf procedure is one that does not callany other procedures; the name comesfrom the fact that these procedures areleaves in the static call graph. The sim-plest optimization for leaf procedures isthat they do not need to save and restore

1frame size

the return address (R31 ). Additionally, ifthe procedure does not have any localvariables allocated to memory, the com-piler does not need to create a stackframe.

Figure 43(a) shows the function max(called by the previous example functionfoo), its original compiled code (b), andthe code after leaf procedure optimiza-tion (c). After eliminating the save/re-store of R31, there is no need to allocatea stack frame. Eliminating the code thatdeallocates the frame also allows thefunction to return directly if x < y.

6.8.3 Cross-Call Register Allocation

Separate compilation reduces the amountof information available to the compilerabout called procedures. However, whenboth callee and caller are available, thecompiler can take advantage of the regis-ter usage of the callee to optimize thecall.

If the callee does not need (or can berestricted not to use) all the temporary



integer function foo(c, b)

integer c, b

integer d, e

d = c+b

e = max(b, d)

foo = e+c

return

end

(a) source code for function foo

foo: SUBI R30, #12 ;adjust SP

Sw 8(R30), R31 ;eave retaddr

Sw 4(R30), R16 ;save R16

LW R16, (R2) ;R16=C

LW R8, (R3) ; R8=b

ADD R9, R16, R8 ;R9=d=c+b

Sw (R30), R9 ;save d

MOV R2, R3 ;argl=addr(b)

ADDI R3, R30, #O ;arg2=addr(d)

JAL max ;call max; Rl=e

ADD RI, Rl, R16 ;Rl=e+c

LW R16, 4(R30) ;restore R16

LW R31, 8(R30) ;restore retaddr

ADDI R30, #12 ;restore SP

JR R31 ; return

(b) compiled code for foo

Figure42. Functionfoo andits compiledcode.

registers (R8–R15), the caller can leavevalues in the unused registers through-out execution of the callee. Additionally,move instructions for parameters can beeliminated.

To perform this optimizationon apro-cedure f, registers must first have beenallocatedfor each ofthe procedures thatfcalls. This can be done by performingregister allocation on procedures orderedby adepth-firstpostorder traversalofthecall graph [Chow 1988].

Forexample, in Figure 43(c)max usesonly R8–RIO. Figure 44 shows foo aftercross-call register allocation. R31 is savedin Rll instead ofon the stack, and thereturn is ajump to the saved addressesin R1l. Additionally, R12 is used insteadof R16, allowing the save and restore ofRI 6 to be eliminated. Only d remains inthe stack frame.

integar function max(x, y)

integer x, y

if (x > y) then

max=x

else

max = Yend if

return

end

(a) source code for function rnax

max: SUBI R30, #4 ;adjust SP

Sw (R30), R31 ;save retaddr

LW R8, (R2) ; R8=X

LH R9, (R3) ; R9=y

SGT RIO, R8, R9 ;RIO=(X > y)

BEQZ R1O, Else ;X <= y

MOV Rl, R8 ; rnax=x

J Ret

Else:MOV Rl, R9 ; max=y

Ret: LW R31, (R30) ;restore R31

ADDI R30, #4 ;restore SP

JR R31 ; return

(b) original compiled code of max

max: LW R8, (R2) ; R8=X

LW R9, (R3) ; R9=y

SGT R1O, R8, R9 ;R1O=(X > y)


MOV Rl, R8 ;max=x

JR R31 ; return

Elee:MOV Rl, R9 ;max=y

JR R31 ; return

(c) max after leaf optimization

Figure 43. Leafprocedure optimization.

foo: SUBI R30, #4

MOV Rll, R31

LW R12, (R2)

LW R8, (R3)

ADD R9, R12, R8

Sw (R30), R9

MOV R2, R3

ADDI R3, R30, #0

JAL max

ADD RI, RI, R12

ADDI R30, #4

JR Rll

;adjust SP

;save retaddr

;R12=c

; R8=b

; R9=d

;save d

;argi=addr(b)

; arg2=addr (d)

;call max

; Rl=e+c

;restore SP

; return

Figure 44. Cross-call register allocation for fOO.

ACM Computing Surveys, Vol. 26,No.4, December 1994


6.8.4 Parameter Promotion max: SGT R1O, R2, R3 ;RIO=(X > y)

When a parameter is passed by refer-BEQZ R1O, Else ;X <= y

ence, the address calculation is done byMOV RI, R2 ; max=x

the caller, but the load ofthe parameterJR R31 ; return

value is done by the callee. This wastesElse:MOV RI, R3 ; nrax=y

an instruction, since most address calcu-JR R31 ; return

lations canbehandled withtheot~set(Rn) (a)maxafter parameter promotion: xandy are

format of load instructions. passed by value in R2 and R3.

More importantly, if the operand is al-ready in a register in the caller, it mustbe spilled to memory and reloaded by thecallee. If the callee modifies the value, itmust then be stored. Upon return to thecaller, if the compiler cannot prove thatthe callee did not modify the operand, itmust be loaded again. Thus, as many astwo unnecessary loads and two unneces-sary stores can be introduced.

An unmodified reference parametercan be passed by value, and a modifiedreference parameter can be passed byvalue-result. Figure 45(a) shows max af-ter this transformation has been applied.Figure 45(b) shows the correspondingmodified form of foo.

Since d can now be held in a register,there is no longer a need for a stackframe, so frame collapsing can be ap-plied, and the stack frame is eliminated.In general, when the compiler can stati-cally identify all the callers of a leaf pro-cedure, it can expand their stack framesto include enough space for both proce-dures. The leaf procedure simply usesthe caller’s stack frame without doingany new allocation of its own.

Parameter promotion is particularlyimportant for languages such as Fortran,in which all argument passing is by ref-erence. However, the argument-passingsemantics of value-result are not thesame as for arguments passed by refex=

ence, in particular in the event of alias-ing. Interprocedural analysis may benecessary to determine the legality of thetransformation.

6.8.5 Procedure Inlining

Procedure inlining (also known as proce-dure integration) replaces a procedurecall with a copy of the body of the calledprocedure [Allen and Cocke 1971; Ball

foo: MOV Rll, R31 ; save retaddr

LW R12, (R2) ;R12=c

LW R2, (R3) ; R2=b

ADD R3, R12, R2 ;R3=d

JAL max ; call max

ADD RI , Rl, R12 ;Rl=e+c

JR RI 1 ; return

(b) foo after parameter promotion on max

Figure 45. Parameter promotion.

1979; Scheifler 1977]. Each occurrence ofa formal parameter is replaced with aversion of the corresponding actual pa-rameter, modified to reflect the callingconvention of the language. Renamingmay also be required for the local vari-ables of the inlined procedure if they con-flict with the calling procedure’s variablenames or if the procedure is inlined morethan once in the same caller.

Inlining can almost always be per-formed, except when the procedure inquestion is recursive. Even recursive rou-tines may benefit from a finite number ofirdining steps, particularly when a con-stant argument allows the compiler todetermine the number of recursive callsthat will be performed. For Fortran pro-grams, incompatible common-block us-ages between caller and callee can makeinlining more complex and in practiceoften prevent it.

When a call is inlined, all the overheadfor the invocation is eliminated. The stackframes for the caller and callee are allo-cated together, and the transfer of con-trol is eliminated. This is particularlyimportant for the return (J R31 ), sincea jump through a register may incur ahigher pipeline penalty than a jump to afixed address.


Compiler Transformations w 389

do i=i, n

call f(a, i)

end do

subroutine f(x, j)

dimension x[*I

x[jl = x[jl + c

return

(a) original code

do all i=l, n

a[il = a[il + c

end do all

(b) after inlining

Figure 46, Procedure inlining

Another reason for irdining is toimprove compiler analysis and opti-mization. In many compilers, a loopcontaining a procedure call cannot beparallelized because its read-write be-havior is unknown. After the call is in-lined, the compiler may be able to proveloop independence, thereby allowing vec-torization or parallelization. Addition-ally, register usage may be improved,constants propagated more accurately,and more redundant operations elimi-

nated.An alternative to inlining is to perform

interprocedural analysis. The advantageof interprocedural analysis is that it canbe applied uniformly, since it does notcause code expansion the way inliningdoes. However, interprocedural analysiscan be costly and increases thecomplexity of the compiler.

Inlining also affects the instructioncache behavior of the program[McFarling 1991]. The change can be fa-vorable, because locality is improved byeliminating the transfer of control. Onthe other hand, if a loop body is mademuch larger, it may no longer fit in thecache and therefore cause additionalmemory accesses. Furthermore, if the

loop contains multiple calls to the sameprocedure, multiple copies of the proce-dure will be loaded into the cache.

The primary disadvantage of inliningis that it increases code size, in the worst

foo: Lu R12, (R2) ;R12=c

LW R2, (R3) ; R2=b

ADD R3, R12, R2 ;R3=d

max: SGT R1O, R2, R3 ;R1O=(X > y)


MOV RI, R2 ; max=x

J Ret ; “return” to f

Else:MOV Rl, R3 ; max=y

Ret: ADD RI, RI, R12 ;Rl=e+c

3R R31 ; return

Figure 47. max inlined into foo.

case exponentially. However, in practiceit is simple to control the size increase byselective application of inlining (for ex-

ample, to small leaf procedures, or toprocedures that are only called from afew places). The result can be a dramaticimprovement in execution speed. Figure46 shows a source-level example of inlin-ing; Figure 47 shows the assembler out-put after the procedure max is inlinedinto fOO.

Ignoring cache effects, assume the fol-lowing: if tp is the time to execute the

entire procedure and t~ is the time to

execute just the body of the procedure, nis the number of times it is called, and Tis the total execution time of the pro-gram, then

t, = n(tP – t~)

is the time saved by inlining, and

I=:

is the fraction of the total run time savedby inlining.

Some recent studies [Cooper et al. 1991;1992] have demonstrated that realizingthe theoretical benefits of inlining mayrequire some modification of the com-piler, because inlined code has differentcharacteristics than human-written code.

Peculiarities of a language or a com-piler may reduce the effectiveness of in-lining or produce unexpected results.Cooper et al. examined several commer-



cial Fortran compilers and uncovered thefollowing potential problems: (1) in-creased demand for registers; (2) largernumbers of local variables, sometimesexceeding a compiler’s limit for the num-ber of analyzable variables; (3) loss ofinformation due to the Fortran conven-tion that procedure parameters may beassumed to be unaliased.

6.8.6 Procedure Cloning

Procedure cloning [Cooper et al. 1993] isa technique for improving optimizationacross procedure call boundaries. The callsites of the procedure being cloned aredivided into groups, and a specializedversion of the procedure is created foreach group. The specialization often pro-vides opportunities for better optimiza-tion, particularly due to improvedconstant propagation.

In Figure 48, cloning procedure f withp replaced by the constant 2 allows re-duction in strength. The real-valuedexponentiation is replaced by a multipli-cation, which is usually at least 10 timesfaster.

6.8.7 Loop Pushing

Loop pushing (also called loop embed-ding [Hall et al. 1991]) moves a loop nestfrom the caller to a cloned version of thecalled procedure. If a compiler does notperform vectorization or parallelizationacross procedure calls directly, pushingis a less general way of achieving asimilar effect.

For example, pushing is done by theCMAX Fortran preprocessor for theThinking Machines CM-5. CMAX con-verts Fortran 77 programs to Fortran 90,attempting to discover data-parallel op-erations in the process [Sabot andWholey 1993].

Pushing not only allows the paral-lelization of the loop in Figure 49, it alsoeliminates the overhead of all but one ofthe procedure calls.

If there are other statements in theloop, distribution is a prerequisite topushing. In this case, however, the de-

call f(a, n, 2)

subroutine f(x, n, p)

real x [*I

integer n, p

doi=l, n

x[i] = x[i] **p

end do

(a) original code

call F.2(a, n)

subroutine F.2(x, n)

real x [*]

integer n

doi=l, n

x [i] = x [i] *x [i]

end do

(b) after cloning

Figure 48. Procedure cloning.

do i=l, n

call f(x, n)

end do

subroutine f (a, j )

real a[*]

a[jl = a[jl + c

return

(a) original loop and procedure

call F.2(x)

subroutine F-2 (a)

real a[*]

do all i=l, n

a[i] = a[il + c

end do all

return

(b) after loop pushing

Figure 49. Loop pushing.

pendence analysis for distribution must

be interprocedural. If there are no other

statements in the loop, the transforma-

tion is always legal.

Procedure inlining (see Section 6.8.5)is a different way of achieving a very


Compiler Transformations 8 391

similar effect, and does not require inter- recursive logical function Inarray(a, x,i,n)

procedural analysis. real x, a[nlinteger i, ~

6.8.8 Tail Recursion Elimination

Tail recursion is a particularly commonform of recursion. A function is recursiveif it invokes itself, directly or indirectly.It is tail recursive if its last act is to callitself and return the value of the recur-sive call without performing any furtherprocessing.

When a function is tail recursive, it isunnecessary to invoke a separate in-stance with its own stack frame. Therecursion can be eliminated; the currentinvocation will not be using its frame anylonger, so the call can be replaced by ajump to the top of the procedure. Figure50 shows an example of a tail-recursivefunction (a) and the result after the re-cursion is eliminated (b). A function thatis not tail recursive is shown in Figure50(c): it uses the result of the recursivecall as an operand to the addition, sothere is computation that must be per-formed after the recursive call returns.

Recursive programs can be trans-formed automatically into tail-recursiveversions that can be executed iteratively[Burstall and Darlington 1977], but thisis not commonly performed by existingcompilers for imperative languages. Somelanguages prevent tail recursion elimina-tion by requiring clean-up code to be exe-cuted after a procedure is finished. Thesemantics of C + +, for example, de-mand that before a procedure returns itmust call a deconstructor on each stack-allocated local object variable.

Other languages, like Scheme [Rees etal. 1986], require that all tail-recursiveprocedures be identified and that they beexecuted without consuming stack spaceproportional to the recursion depth.

6.8.9 Function Memorization

If (i > n) then

inarray = . FALSE.

else if (a[i] = x) then

inarray = TRUE,

else

inanay = inarray(a, x, 1+1, n)end if

return

(a) A tail-recursive procedure

logical function inarray(a, x, 1, n)

real x, a[nl

Integer i, n

if (i > n) then

inarray = . FALSE.

else if (a[i] = x) then

inarray = . TRUE.

else

1 = i+l

goto i

end if

return

(b) After tail recursion elimination

recursive integer function surmrray(a, x,i, n)

real x, a[n]

integer i, n

if (i = n) then

sumsrray = a [i]

else

sunrarray = a[l]+sunmrray(a, x, i+l, n)

end if

return

(c) A procedure that is not tail-recursive

Figure 50. Tail recursion elimination.

possible to cache the results of recentinvocations. When the procedure is calledagain with the same arguments, thecached result is used instead of recom-puting it [Abelson and Sussman 1985;Michie 1968].

Memorization is an optimization that is Figure 51 shows a simple example ofapplied to side-effect free procedures(that is, procedures that do not change

memorization. If f is often called with thesame arguments and f also takes a non-

the state of the program, also called ref- trivial amount of time to run, then mem-erentially transparent). In such cases it is oization will substantially increase



y = f(i)

(a) original function call

logical f. UNCACHED[n]

real f .CACHE[n]

doi=l, n

f_ UNCACHED[i] = true.

end do

. . .

if (f_ UNCACHED[i]) then

f. CACHE [i] = f(i)

f-UNCACHED[i] = false.

end if

y = f-CACHE[i]

(b) code augmented for memorization

Figure 51. Function memoization

performance. Ifnot, it will degrade per-formance and consume memory.

This example assumes that f’s parame-ter is confined to the range 1 . . . n, andthat n is not extremely large. A moresophisticated memorization scheme wouldhash the arguments and use a cache sizethat makes a sensible trade-off betweenreuse and memory consumption. Forfunctions that return dynamically allo-cated objects, storage management mustbe considered as well.

For c calls and a hit rate of r-, the timeto execute the c calls is

T= C(~th + (1 – r)tm)

where tk is the time to retrieve the mem-

oized result from the cache (a hit), and

tn is the time to call f plus the overheadof discovering that it is not in the cache(a miss). With a high hit rate and a largedifference between th and tn, memoriz-

ation can significantly improve perfor-

mance.

7. TRANSFORMATIONS FOR PARALLELMACHINES

All the transformations discussed so far

are applicable to a wide variety of com-

puter organizations. In this section we

describe transformations that are spe-cific to parallel architectures.

Highly efficient parallel applicationshave been developed by manually apply-ing the transformations discussed here tosequential code. Duplicating these suc-cesses by automatically transforming thecode within a compiler is an active areaof research. While a great deal of workhas been done, automatic parallelizationof full-scale applications with existingcompilers does not consistently yield sig-nificant speedups. Experiments thatmeasure the effectiveness of parallelizingcompilers are presented in Section 9.3.

Because the parallelization of se-quential code is proving to be difficult,languages like HPF Fortran [High Per-formance Fortran Forum 1993] provideconstructs that allow the programmer todeclare data decomposition strategies andto expose parallelism. The compiler isthen responsible for generating commu-nication and synchronization code, usingvarious optimizations (discussed below)to reduce the resulting overhead. As ofyet, however, there are few real applica-tions that have been written in theselanguages, and there is little experimen-tal data.

In this section we make extensive useof our model shared-memory and dis-tributed-memory multiprocessors sMXand dMX, which are described fully inthe Appendix. Both machines are con-structed using S-DLX processors. The

programming environment initializes theglobal variable Pnum to be the number of

processors in the machine and Pid to be

the number of the local processor

(starting with O).

7.1 Data Layout

One of the key decisions that a compiler

must make in targeting a multiprocessor

is the decomposition of data across the

processors. The need to distribute data

and manage it is obvious in a dis-

tributed-memory programming model,

where communication is explicit. The

performance of code under a shared-

memory model is also dependent on



avoiding unnecessary communication:the fact that communication is performedimplicitly does not eliminate its cost.

7.1.1 Regular Array Decomposition

The most common strategy for decompos-ing an array on a parallel machine is tomap each dimension to a set of proces-sors in a regular pattern. The four com-monly used patterns are discussed in thefollowing sections.

Serial Decomposition

Serial decomposition is the degeneratecase, where an entire dimension is allo-cated to a single processor. Figure 52(a)shows a simple loop nest that adds c toeach element of a two-dimensional array,the decomposition of the array onto pro-cessors (b), and the loop transformed intocode for sMX (c). Each column is mappedserially, meaning that all the data inthat column will be placed on a singleprocessor. As a result, the loop that iter-ates over the columns (the inner loop) isunchanged.

Block Decomposition

A block decomposition divides the ele-ments into one group of adj scent ele-ments per processor. Using the previousexample, the rows of array a are dividedbetween the four processors, as shown inFigure 52(b). Because the rows have beendivided across the processors, the outerloop in Figure 52(c) has been modified sothat each processor iterates only over thelocal portion of each row.

Figure 52(d) shows the decompositionif block scheduling is applied across boththe horizontal and vertical dimensions.

A major difference between compila-tion for shared- and distributed-memorymachines is that shared-memory ma-chines provide a global name space, sothat any particular array element can benamed by any processor using the samesubscripts as in the sequential code. Onthe other hand, on a distributed-memorymachine, arrays must usually be divided

doallj=l, n

doalli=l, n

a[i, j] = a[i, j] + c

end do all

end do all

(a) original loop

12345878

2

3

“!IIII

401235s7

8

(b) ( block, sertd) decornpos,t;on of a on four

processors

call FORK(Pnum)

do j = (n/Pnuro)*Pid+i, rnin( (n/Pnum)*(Pid+i) , n)

doi=l, n

a[i, j] = a[i, j] + c

end do

end do

call JOIN()

(c) corresponding ( block, serial) scheduling of

the loop, using block size n/Pnurn

*2345678

1

2

E13

o3

1

14

8

6

12 3

8

(d) ( block, block) decomposition of a on four

processors

Figure 52. Serial and block decompositions of an

array on the ehared-memory multiprocessor sMX.

across the processors, each of which hasits own private address space.

As a result, naming a nonlocal arrayelement requires a mapping functionfrom the original, global name space to aprocessor number and local array indiceswithin that processor. The examplesshown here are all for shared memory,and therefore sidestep these complexi-ties, which are discussed in Section 7.3.However, the same decomposition princi-ples are used for distributed memory.



I

f2345678

1

2

3

la

5

6

7

8

(a) (Cyclic, serial) decomposition of a on fourprocessors

call FORK (Pnund

do j = Pid+l, n, Pnum

doi=l, n


end do

end do

call JOIN()

(b) corresponding (CYCIZC, ser!a~) scheduling Of

loop

Figure 53. Cyclic decomposition on sMX

The advantage of block decompositionis that adj scent elements are usually onthe same processor. Because a loop thatcomputes a value for a given elementoften uses the values of neighboring ele-ments, the blocks have good locality andreduce the amount of communication.

Cyclic Decomposition

A cyclic decomposition assigns succes-sive elements to successive processors. Acyclic allocation of columns to processorsfor the sequential code in Figure 52(a) isshown in Figure 53(a). Figure 53(b) showsthe transformed code for sMX.

Cyclic decomposition has the oppositeeffect of blocking: it hafi poor localityfor neighbor-based communication, butspreads load more evenly. Cyclic decom-position works well for loops like the onein Figure 54, as long as the array is largeenough so that each processor handlesmany columns. A common algorithm thatresults in this type of load balance is LUfactorization of matrices. A block decom-position would work poorly because theiterations most costly to execute would

doj=l, n

doi=l, j

total = total + a[i, jl

end do

end do

(a) loop

i

(b) elements of a read by the loop

Figure 54. Triangular iteration space.

be clustered together and computed by asingle processor. With a cyclic decomposi-tion, the expensive iterations are spreadacross the processors.

Biock-Cyclic Decomposition

A block-cyclic decomposition combinesthe two strategies; the elements are di-vided into many adjacent groups, usuallyan integer multiple of the number of pro-cessors available. Each group of elementsis assigned to a processor cyclically. Fig-ure 55(a) shows an array whose columnshave been allocated to two processors ina block-cyclic fashion, and Figure 55(b)gives the code to implement the mappingon sMX. Block-cyclic is a compromise be-tween the locality of block decompositionand the load balancing of cyclicdecomposition.

Irregular Computations

Some loops, like the one shown in Figure56, have highly irregular execution be-havior, When the variance of executiontimes is high, none of the static regulardecomposition strategies may be suf-ficient to yield good performance. Avariety of dynamic algorithms make



1236 S678

<2

3

i’

EIll

,0 1 0 1

6

7

8

(a) (block-cyclic, serial) decomposition of a on

two processors

call FORK(Pnum)

do k = 1, n, 2*Pnum

do j = k + 2*Pid, k + 2*Pid + i

doi=l, n

a[i, j] = a[i, j] + c

end do

end do

end do

call JOIN()

(b) corresponding (block-cyclic, serial)

scheduling of loop, using block size 2

Figure 55. Block-cyclic decomposition on sMX.

doi=l, n

if (mask [i] = 1) then

a[i] = expensive. functiono

end if

end do

(a) example loop

aaJ&

1234567 8S

(b) iteration versus cost for the above loop

Figure 56. Irregular execution behavior.

scheduling decisions at run time basedon the historical behavior of the compu-tation [Lucco 1992; Polychronopoulos andKuck 1987; Tang and Yew 1990].

7.1.2 Automatic Decomposition and Alignment

The decomposition of an array deter-mines how the elements are distributed

across a set of processors; in a block de-composition, for example, the array mightbe divided into contiguous 10 x 10 sub-arrays. The alignment of the array es-tablishes the specific elements that arein each of those subarrays.

Languages like HPF Fortran rely onthe proWammer to declare how arraysare to be aligned and what style of de-composition to use; the language includesannotations like BLOCK and CYCLIC.A variety of research projects have in-vestigated whether the strategy can bedetermined automatically without pro-grammer assistance. Programmers canfind it difficult to choose a good set ofdeclarations, particularly since the opti-mal decomposition can change during thecourse of program execution.

The general strategy for automatingdecomposition and alignment is to ex-press the behavior of the program in arepresentation that either captures com-munication explicitly or allows it to becomputed. The compiler applies opera-tions that reflect different decompositionand alignment choices; the goal is tomaximize parallelism and minimizecommunication. The approaches includethe Alignment-Distribution Graph

[Chatterjee et al. 1993a; 1993b], theComponent Affinity Graph [Li and Chen1990; 1991], constraints [Gupta 1992],and affine transformations [Andersonand Lam 1993].

In addition to presenting an algorithmto find the optimal mapping of arrays toprocessors for one- and two-dimensionalarrays, Mace [1985; 1987] shows thatfinding optimal decompositions is anNP-complete problem.

7.1.3 Scalar Privatization

The goal behind privatization is to in-crease the amount of parallelism and toavoid unnecessary communication. Whena scalar is used within a loop solely as ascratch variable, each processor can begiven a private copy so the use of thescalar need not involve any communica-tion. The transformation is safe if thereare no loop-carried dependence involv-



doi=l, n

c = b[i]

a[i] = a[i] + c

end do

Figure 57. Scalar value suitable for privatization.

ing the scalar; before the scalar is usedin the loop body, its value is always up-dated within the same iteration.

Figure 57 shows a loop that could ben-efit from privatization. The scalar value cis simply a scratch value; if it is priva-tized, the loop is revealed to have inde-~endent iterations. If the arravs are.allocated to separate processors beforethe loop is executed, no further commu-nication is necessary. In this very simpleexample the variable could also have beeneliminated entirely, but in more complexloops temporary variables are repeatedlyupdated and referenced.

Scalar privatization is analogous toscalar expansion (see Section 6.5.2); inboth cases, spurious dependence areeliminated by replicating the variable.The transformation is widely incorpo-rated into parallel compilers [Allen et al.1988b; Cytron and Ferrante 1987].

7.1.4 Array Privatization

Array privatization is similar to scalarprivatization; each processor is given itsown copy of the entire array. The effect ofprivatizing an array is to reduce commu-nication requirements and to exposeadditional parallelism by removing un-necessary dependence caused by storagereuse.

Parallelization studies of real applica-tion programs have found that privatiza-tion is necessary for high performance[Eigenmann et al. 1991; Singh andHennessy 1991]. Despite its importance,verifying that privatization of an array islegal is much more difficult than theequivalent test on a scalar value.

Traditionally, data-flow analysis[Muchnick and Jones 1981] treats an en-tire array as a single value. This lack ofprecision prevents most loop optimiza-

tion discussed in this survey and led tothe development of dependence analysis.Feautrier [ 1988; 1991] extends depen-dence analysis to support array expan-sion, essentially the same optimizationas privatization. The analysis is expen-sive because it requires the use of para-metric integer programming. Otherprivatization research has extendeddata-flow analysis to consider the behav-ior of individual array elements [Li 1992;Maydan et al. 1993; Tu and Padua 1993].

In addition to determining whetherprivatization is legal, the compiler mustalso check whether the values that areleft in an array are used by subsequentcomputations. If so, after the array isprivatized the compiler must add code topreserve those final values by performinga copy-out operation from the privatecopies to a global array.

7.1.5 Cache Alignment

On shared-memory processors like sMX,when one processor modifies storage, thecache line in which it resides is flushedby any other processors holding a copy.The flush is necessary to ensure propersynchronization when multiple proces-sors are updating a single shared vari-able. If several m-ocessors are continually. .updating different variables or array ele-ments that reside in the same cache line,however. the constant flushing of cachelines may severely degrade pe~formance.This problem is called fake sharing,

When false sharing occurs between

~arallel lootI iterations accessing differ-;nt column’s of an array or ~ifferentstructures, false sharing can be ad-dressed by aligning each of the columnsor structures to a cache line boundarv.False sharing of scalars can be address~dby inserting padding between them sothat each shared scalar resides in its owncache line (padding is discussed in Sec-tion 6.5. 1). A further optimization is toplace shared variables ‘and their corre-sponding lock variables in the same cacheline, which will cause the lock acquisitionto act as a prefetch of the data [Torrellaset al. 1994],



7.2 Exposing Coarse-Grained Parallelism

Transferring data can be an expensive

operation on a machine with asyn-

chronously executing processors. One wayto avoid the inefficiency of frequent com-munication is to divide the program intosubcomputations that will execute for anextended period of time without generat-ing message traffic. The following opti-mization are designed to identify suchcomputations or to help the compilercreate them.

7.2.1 Procedure Call Parallelization

One potential source of coarse-grainedparallelism in imperative languages isthe procedure call. In order to determinewhether a call can be performed as anindependent task, the compiler must useinterprocedural analysis to examine itsbehavior.

Because of its importance to all formsof optimization, interprocedural analysishas received a great deal of attentionfrom compiler researchers. One approachis to propagate data-flow and dependenceinformation through call sites [Banning1979; Burke and Cytron 1986; Callahanet al. 1986; Myers 1981]. Another is tosummarize the array access behavior of aprocedure and use that summary to eval-uate whether a call to it can be executedin parallel [Balasundaram 1990; Calla-han and Kennedy 1988a; Li and Yew1988; Triolet et al, 1986].

7.2.2 Split

A more comprehensive approach to pro-gram decomposition summarizes the datausage behavior of an arbitrary block ofcode in a symbolic descriptor [Graham etal. 1993]. As in the previous section, thesummary identifies independence be-tween computations—between procedurecalls or between different loop nests. Ad-ditionally, the descriptor is used in ap-plying the split transformation.

Split is unusual in that it is not ap-plied in isolation to a computation like aloop nest. Instead, split considers the re-lationship of pairs of computations. Twoloop nests, for example, may have depen-

dence between them that prevent thecompiler from executing both simultane-ously on different processors. However, itis often the case that only some of theiterations conflict. Split uses memorysummarization to identify the iterationsin one loop nest that are independent ofthe other one, placing those iterations ina separate loop. Applying split to an ap-plication exposes additional sources ofconcurrency and pipelining.

7.2.3 Graph Partitioning

Data-flow languages [Ackerman 1982;McGraw 1985; Nikhil 1988] expose paral-lelism explicitly. The program is con-verted into a graph that represents basiccomputations as nodes and the move-ment of data as arcs between nodes.Data-flow graphs can also be used as aninternal representation to express theparallel structure of programs written inimperative languages.

One of the major difficulties withdata-flow languages is that they exposeparallelism at the level of individualarithmetic operations. As it is impracti-cal to use software to schedule such asmall amount of work, early projects fo-cused on developing architectures thatembed data-flow execution policies intohardware [Arvind and Culler 1986;Arvind et al. 1980; Dennis 1980]. Suchmachines have not proven to be success-ful commercially, so researchers began todevelop techniques for compiling data-flow languages on conventional architec-tures. The most common approach is tointerpret the data-flow graph dynami-cally, executing a node representing acomputation when all of its operands areavailable. To reduce scheduling over-head, the data-flow graph is generallytransformed by gathering many simplecomputations into larger blocks that areexecuted atomically [Anderson andHudak 1990; Hudak and Goldberg 1985;Mirchandaney et al. 1988; Sarkar 1989].

7.3 Computation Partitioning

In addition to decomposing data as dis-cussed in Section 7.1, parallel compilers



must allocate the computations of a pro-gram to different processors. Each pro-cessor will generally execute a restrictedset of iterations from a particular loop.The transformations in this section seekto enforce correct execution of the pro-gram without introducing unnecessaryoverhead.

7.3.1 Guard Introduction

After a loop is parallelized, not all theprocessors will compute the same itera-tions or send the same slices of data. Thecompiler can ensure that the correct com-putations are performed by introduc-ing conditional statements (known asguards) into the program [Callahan andKennedy 1988b].

Figure 58(a) shows an example of asequential loop that we will transformfor parallel execution. Figare 58(b) showsthe parallel version of the loop. The codecomputes upper and lower bounds thatthe guards use to prevent computationon array elements that are not storedlocally within the processor. The boundsdepend on the size of the array, the num-ber of processors, and the Pid of theprocessor.

We have made a number of simplifyingassumptions: the value of n is an evenmultiple of Pnum, and the arrays arealready distributed across the processorswith a block decomposition. We have alsochosen an example that does not requireinterprocessor communication. To paral-lelize the loops encountered in practice,compilers must introduce communica-tion statements and much more complexguard and induction variable expres-sions.

dMX does not have a global name-

space, so each processor’s instance of anarray is separate. In Figure 58(b), thecompiler could take advantage of the iso-lation of each processor to remap thearray elements. Instead of the originalarray of size n, each processor could de-clare a smaller array of n / Pnum ele-

ments and modify the loop to iterate from1 to n / Pnum. Although the remappingwould simplify the loop bounds, we pre-

doi=l, n

a[i] = a[i] + c

b[i] = b[i] + c

end do

LBA =

UBA =

LBB =

UBB =

(a) original loop

(n/Pnum)*Pid + 1

(n/Pnurn)* (Pid + 1)

(n/Pnum) *Pid + 1

(n/Pnum)*(pid + 1)doi=l, n

if (LBA <= i and. i <= UBA)

a[i] = a[il + cif (LBB <= i and. i <= UBB)

b[i] = b[i] + c

end do

(b) parallelized loop with guards introduced

LB = (n/Pnum)*Pid + 1

UB = (n/Pnum)*(Pid + 1)

doi=l, n

if (LB <= i and. i <= US) then

a[i] = a[i] + c

b[i] = b[i] + c

end if

end do

(c) after guard combination

LB = (n/Pnum)*Pid + 1

UB = (n/Pnum)*(Pid + 1)

doi=LB, UB

a[i] = a[i] + c

b[i] = b[i] + c

end do

(d) after bounds reduction

Figure 58. Computation partitioning.

serve the original element indices tomaintain a closer relationship betweenthe sequential code and its transformedversion.

7.3.2 Redundant Guard Elimination

In the same way that the compiler canuse code hoisting to optimize computa-tion, it can reduce the number of guards

ACM Computmg Surveysj Vol. 26, No. 4, December 1994


by hoisting them to the earliest pointwhere they can be correctly computed.Hoisting often reveals that identicalguards have been introduced and that allbut one can be removed. In Figure 58(b),the two guard expressions are identicalbecause the arrays have the same decom-position. When the guards are hoisted tothe beginning of the loop, the compilercan eliminate the redundant guard. Thesecond pair of bound computations be-comes useless and is also removed. Thefinal result is shown in Figure 58(c).

7.3.3 Bounds Reduction

In Figure 58(c), the guards control whichiterations of the loop perform computa-tion. Since the desired set of iterations isa contiguous range, the compiler canachieve the same effect by changing theinduction expressions to reduce the loopbounds [Koelbel 1990]. Figure 58(d)shows the result of the transformation.

7.4 Communication Optimization

An important part of compilation for dis-tributed-memory machines is the analy-sis of an application’s communicationneeds and introduction of explicit mes-sage-passing operations into it. Before astatement is executed on a given proces-sor, any data it relies on that is notavailable locally must be sent. A simple-minded approach is to generate amessage for each nonlocal data itemreferenced by the statement, but theresulting overhead will usually be un-acceptably high. Compiler designers havedeveloped a set of optimizations that re-duce the time required for communica-tion.

The same fundamental issues arise incommunication optimization as in opti-mization for memory access (describedin Section 6.5): maximizing reuse, mini-mizing the working set, and making useof available parallelism in the em-nmuni-cation system. However, the problems aremagnified because while the ratio of onememory access to one computation onS-DLX is 16:1, the ratio of the number

of cvcles reauired to send one word to thenumber of ‘cycles required for a singleoperation on dMX is about 500:1.

Like vectors on vector processors, com-munication o~erations on a distributed-memory machine are characterized by astartup time, t~, and a per-element cost,t~, the time required to send one byteonce the message has been initiated. OndMX, t, = 10PS and th = 100ns, sosending ‘one 10-byte message costs 11 pswhile sending two 5-byte messages costs21 ps. To avoid paying the startup costunnecessarily. the o~timizations in this. . .section combine data from multiple mes-sages and send them in a singleoperation.

7.4.1 Message Vectorization

Analysis can often determine the set ofdata items transferred in a loop. Ratherthan sending each element of an array inan individual message, the compiler cangroup many of them together and sendthem in a sirwle block transfer. Becausethis is analo~ous to the way a vectorprocessor interacts with memory, the op-timization is called message vectoriza-tion [Balasundaram et al. 1990; Gerndt1990; Zima et al. 1988].

Figure 59(a) shows a sample loop thatmultiplies each element of array a by themirror element of array b. Figure 59(b) isan inefficient parallel version for proces-sors O and 1 on dMX. To simplify thecode, we assume that the arrays havealready been allocated to the processorsusing a block decomposition: the lowerhalf of each array is on processor O andthe upper half on processor 1.

Each processor begins by computingthe lower and upper bounds of the rangethat it is res~onsible for. During eachiteration, it s&ds the element of”b thatthe other processor will need and waitsto receive the corresponding message.Note that Fortran’s call-by-referencesemantics c.on~ert the arrav reference.implicitly into the address of the corre-sponding element, which is then used bythe low-level communication routine toextract the bytes to be sent or received,



doi=l, n

a[il = a[il + b[n+l-i]

end do

(a) original loop

LB = Pid * (n/2) + 1

US = LB + (n/2)

otherPid = i - Pid

doi=LB, UB

call SEND(b[i] , 4, otherPid)

call RECEIVE (b[n+l-i] , 4)

a[i] = a[i] + b[n+l-i]

end do

(b) parallel loop

LB = Pid * (n/2) + 1

LIE = LB + (n/2)

otherPid = 1 - Pid

otherLB = otherPid * (n/2) + 1

otherUB = otherLB + (n/2)

call SEND(b[LB] , (n/2)*4, otherPid)

call RECEIVE(b [otherLB] , (n/2) *4, otherPid)

doi=LB, UB

a[i] = a[i] + b[n+l-i]

end do

(c) parallel loop with vectorized messages

do j = LB, UB, 256

call SEND(b[j] , 256*4, otherPid)

call RECEIVE(b[otherLB+ (j-LB)] ,

256*4, otherPid)

do i = j, j+255

a[i] = a[i] + b[n+l-i]

end do

end do

(d) after strip mining messages (assuming

array size is a multiple of 256)

Figure 59. Message vectorization

When the message arrives, the iterationproceeds.

Figure 59(c) is a much more efficientversion that handles all communicationin a single message before the loop be-gins executing. Each processor computesthe upper and lower bound for both itselfand for the other processor so as to placethe incoming elements of b properly.

Message-passing libraries and networkhardware often perform poorly when

their internal buffer sizes are exceeded.The compiler may be able to performadditional communication optimizationby using strip mining to reduce the mes-sage length as shown in Figure 59(d).

7.4.2 Message Coalescing

Once message vectorization has beenperformed, the compiler can further re-duce the frequency of communication bygrouping messages together that sendoverlapping or adjacent data. The For-tran D compiler [Tseng 1993] uses Regu-lar Sections [Callahan and Kennedy1988a], an array summarization strat-egy, to describe the array slice in eachmessage. When two slices being sent tothe same processor overlap or cover con-tiguous ranges of the array, the associ-ated messages are combined.

7.4.3 Message Aggregation

Sending a message is generally muchmore expensive than performing a blockcopy locally on a processor. Therefore itis worthwhile to aggregate messages be-ing sent to the same processor even if thedata they contain is unrelated [Tseng1993]. A simple aggregation strategy isto identify all the messages directed atthe same target processor and copy thevarious strips of data into a single buffer.The target processor performs thereverse operation.

7.4.4 Collective Communication

Many parallel architectures and mes-sage-pawing libraries offer e.pecial-pur-pose communication primitives such asbroadcast, hardware reduction, and scat-ter-gather. Compilers can improve per-formance by recognizing opportunities toexploit these operations, which are oftenhighly efficient. The idea is analogous toidiom and reduction recognition on se-quential machines. Li and Chen [1991]present compiler techniques that rely onpattern matching to identify opportuni-ties for using collective communication.



7.4.5 Message Pipelining 7.5 SIMD Transformations

Another important optimization is topipeline parallel computations by over-lapping communication and computation.Studies have demonstrated that manyapplications perform very poorly with-out pipelining [Rogers 1991]. Many mes-sage-passing systems allow the processorto continue executing instructions whilea message is being sent. Some supportfully nonblocking send and receive opera-tions. In either case, the compiler has theopportunity to arrange for useful compu-tation to be performed while the networkis delivering messages.

A variety of algorithms have been de-veloped to discover opportunities forpipelining and to move message transferoperations so as to maximize the amountof resulting overlap [Rogers 1991;Koelbel and Mehrotra 1991; Tseng 1993].

7.4.6 Redundant Communication Elimination

To avoid sending messages wherever

possible, the compiler can perform a vari-ety of transformations to eliminate re-dundant communication. Many of theoptimizations covered earlier can also beused on messages. If a message is sentwithin the body of a loop but the datadoes not change from one iteration tothe next, the SEND can be hoisted out ofthe loop. When two messages contain thesame data, only one need be sent.

Messages offer further opportunitiesfor optimization. If the contents of a mes-sage are subsumed by a previous commu-nication, the message need not be sent;this situation is often created whenSENDS are hoisted in order to maximizepipelining opportunities. If a messagecontains data, some of which has alreadybeen sent, the overlap can be removed toreduce the amount of data transferred.Another possibility is that a message isbeing sent to a collection of processors,some of which previously received thedata. The list of recipients can be pruned,reducing the amount of communicationtraffic. These optimizations are used bythe PTRAN II compiler [Gupta et al.1993] to reduce overall message traffic.

SIMD architectures exhibit much moreregular behavior than MIMD machines,eliminating many problematic synchro-nization issues. In addition to the align-ment and decomposition strategies fordistributed-memory systems (see Section7.1.2), the regularity of SIMD intercon-nection networks offers additional oppor-tunities for optimization. The compilercan use very accurate cost models to esti-mate the performance of a particularlayout.

Early SIMD compilation work targetedIVTRAN [Millstein and Muntz 1975], aFortran dialect for the Illiac IV that pro-vided layout and alignment declarations.The compiler provided a parallelizingmodule called the Paralyzer [Presbergand Johnson 1975] that used an earlyform of dependence analysis to identifyindependent loops and applied lineartransformations to optimize communica-tion.

The Connection Machine ConvolutionCompiler [Bromley et al. 1991] targetsthe topology of the SIMD architectureexplicitly with a pattern-matching strat-egy. The compiler focuses on computa-tions that update array elements basedon their neighbors, It finds the pattern ofneighbors that are needed to compute agiven element, called the stencil. Thecross, a common stencil, represents acomputation that updates each array ele-ment using the value of its neighbors tothe north, east, west, and south. Stencilsare aggregated into larger patterns,or multistencils, using optimizationsanalogous to loop unrolling and stripmining. The multistencils are mappedto the hardware so as to minimizecommunication.

SIMD compilers developed at Compass[Knobe and Natarajan 1993; Knobe et al.1988; 1990] construct a preference graphbased on the computations being per-formed. The graph represents alignmentrequests that would yield the least com-munication overhead. The compiler satis-fies every request when possible, using agreedy algorithm to find a solution whenpreferences are in conflict.



Wholey [1992] begins by computing adetailed cost function for the computa-tion. The function is the input to a hill-climbing algorithm that searches thespace of possible allocations of data toprocessors. It begins with two processors,then four, and so forth; each time thenumber of processors is doubled, the al-

gorithm uses the most efficient allocationfrom the previous stage as a startingpoint.

In addition to general-purpose map-ping algorithms, the compiler can usespecific transformations like loop flatten-ing [von Hanxleden and Kennedy 1992]to avoid idle time due to nonuniformcomputations.

7.6 VLIW Transformations

The Very Long Instruction Word (VLIW)architecture is another strategy for exe-cuting instructions in parallel. A VLIWprocessor [Colwell et al. 1988; Fisher etal. 1984; Floating Point Systems 1979;Rau et al. 1989] exposes its multiplefunctional units explicitly; each machineinstruction is very large (on the order of512 bits) and controls many independentfunctional units. Typically the instruc-tion is divided into 32-bit pieces, each ofwhich is routed to one of the functionalunits.

With traditional processors, there isa clear distinction between low-levelscheduling done via instruction selectionand high-level scheduling between pro-cessors. This distinction is blurred inVLIW architectures. They rely on thecompiler to expose the parallel opera-tions explicitly and schedule them atcompile time. Thus the task of generat-ing object code incorporates high-levelresource allocation and scheduling deci-sions that occur only between processorson a more conventional parallelarchitecture.

Trace scheduling [Ellis 1986; Fisher1981; Fisher et al. 1984] is an analysisstrategy that was developed for VLIWarchitectures. Because VLIW machinesrequire more parallelism than is typi-cally available within a basic block, com-

pilers for these machines must lookthrough branches to find additional in-structions to execute. The multiple pathsof control require the introduction ofspeculative execution—computing someresults that may turn out to be unused(ideally at no additional cost, by func-tional units that would otherwise havegone unused).

The speculation can be handled dy-namically by the hardware [ Sohi andVajapayem 1990], but that complicatesthe design significantly. The trace-sched-uling alternative is to have the compilerdo it by identif~ng paths (or traces)through the control-flow graph thatmight be taken. The compiler guesseswhich way the branches are likely to goand hoists code above them accordingly.Superblock scheduling [Hwu et al. 19931is an extension of trace scheduling thatrelies on program profile information tochoose the traces.

8. TRANSFORMATION FRAMEWORKS

Given the many transformations thatcompiler writers have available, they facea daunting task in determining whichones to apply and in what order. There isno single best order of application; onetransformation can permit or prevent asecond from being applied, or it canchange the effectiveness of subsequentchanges. Current compilers generallyuse a combination of heuristics and apartially fixed order of applyingtransformations.

There are two basic approaches to thisproblem: uniffing the transformations ina single mechanism, and applying searchtechniques to the transformation space.In fact there is often a degree of overlapbetween these two approaches.

8.1 Unified Transformation

A promising strategy is to encode boththe characteristics of the code beingtransformed and the effect of each trans-formation; then the compiler can quicklysearch the space of possible sets of trans-formations to find an efficient solution.



One framework that is being activelyinvestigated is based on unimodular ma-trix theory [Banerjee 1991; Wolf and Lam

1991]. It is applicable to any loop nestwhose dependence can be described witha distance vector; a subset of the loopsthat require a direction vector can alsobe handled. The transformations that aunimodular matrix can describe are in-terchange, reversal, and skew.

The basic principle is to encode eachtransformation of the loop in a matrixand apply it to the dependence vectors ofthe loop. The effect on the dependencepattern of applying the transformationcan be determined by multiplying thematrix and the vector. The form of theproduct vector reveals whether the trans-formation is valid. To model the effect ofapplying a sequence of transformations,the corresponding matrices are simplymultiplied.

Figure 60 shows a loop nest that canbe transformed with unimodular matri-ces. The distance vector that describesthe loop is D = (1, O), representing thedependence of iteration i on i – 1 in theouter loop.

Because of the dependence, it is notlegal to reverse the outer loop of the nest.The reversal transformation is

[1R=–::.

The product is

HRD=P1= ‘:.

This demonstrates that the transforma-tion is not legal, because PI is not lexico-graphically positive.

We can also test whether the two loopscan be interchanged; the interchange

[. 1transformation is 1 = ~ ~ . Applying

[1that to D yields P2 = ~ . In this case,

the resulting vector is lexicographicallypositive showing that the transformationis legal.

doi =2,10

do j = 1, 10

a[i, j] = a[i-l, jl + a[i, jlend do

end do

Figure 60. Unimodular transformation example.

Any loop nest whose dependence areall representable by a distance vector canbe transformed via skewing into a canon-ical form called a fully permutable loopnest. In this form, any two loops in thenest can be interchanged without chang-ing the loop semantics. Once in thiscanonical form, the compiler can decom-pose the loop into the granularity thatmatches the target architecture [Wolf andLam 1991].

Sarkar and Thekkath [1992] describe aframework for transforming perfect loopnests that includes unimodular transfor-mations, tiling, coalescing, and parallelloop execution. The transformations areencoded in an ordered sequence. Rulesare provided for mapping the dependencevectors and loop bounds, and the trans-formation to the loop body is described bya template.

Pugh [1991] describes a more ambi-tious (and time-consuming) techniquethat can transform imperfectly nestedloops and can do most of the transforma-tions possible through a combination ofstatement reordering, interchange, fu-sion, skewing, reversal, distribution, andparallelization. It views the transforma-tion problem as that of finding the bestschedule for a set of operations in a loopnest. A method is given for generatingand testing candidate schedules.

8.2 Searching the Transformation Space

Wang [1991] and Wang and Gannon[1989] describe a parallelization systemthat uses heuristic search techniquesfrom artificial intelligence to find a pro-

gram transformation sequence. The tar-get machine is represented by a set offeatures that describe the type, size, andspeed of the processors, memory, and in-



terconnect. The heuristics are organizedhierarchically. The main functions are:description of parallelism in the programand in the machine; matching of programparallelism to machine parallelism; andcontrol of restructuring.

9. COMPILER EVALUATION

Researchers are still trying to find a goodway to evaluate the effectiveness of com-pilers. There is no generally agreed uponway to determine the best possible per-formance of a particular program on aparticular machine, so it is difficult todetermine how well a compiler is doing.Since some applications are better struc-tured than others for a given architec-ture or a given compiler, measurementsfor a particular application or group ofapplications will not necessarily predicthow well another application will fare.

Nevertheless, a wide variety of mea-surement studies do exist that seek toevaluate how applications behave, howwell they are being compiled, and howwell they could be compiled. We havedivided these studies into several groups.

9.1 Benchmarks

Benchmarks have received by far themost attention since they measure deliv-ered performance, yielding results thatare used to market machines. They wereoriginally developed to measure machinespeed, not compiler effectiveness. TheLivermore Loops [McMahon 1986] is oneof the early benchmark suites; it soughtto compare the performance of supercom-puters. The suite consists of a set of smallloops based on the most time-consuminginner loops of scientific codes.

The SPEC benchmark suite [Dixit1992] includes both scientific and gen-eral-purpose applications intended to berepresentative of an engineering/scien-tific workload. The SPEC benchmarks arewidely used as indicators of machine per-formance, but are essentially uniproces-sor benchmarks.

As architectures have become morecomplex, it has become obvious that the

benchmarks measure the combined effec-tiveness of the compiler and the targetmachine. Thus two compilers that targetthe same machine can be compared byusing the SPEC ratings of the generatedcode.

For parallel machines, the contributionof the compiler is even more important,since the difference between naive andoptimized code can be many orders ofmagnitude. Parallel benchmarks includeSPLASH (Stanford Parallel Applicationsfor Shared Memory) [Singh et al. 1992]and the NASA Numerical AerodynamicSimulation (NAS) benchmarks [Baileyet al. 1991].

The Perfect Club [Berry et al. 1989] isa benchmark suite of computationally in-tensive programs that is intended to helpevaluate serial and parallel machines.

9.2 Code Characteristics

A number of studies have focused on theapplications themselves, examining theirsource code for programmer idioms orprofiling the behavior of the compiledexecutable.

Knuth [1971] carried out an early and

influential study of Fortran programs. Hestudied 440 programs comprising 250,000lines (punched cards). The most impor-tant effect of this study was to dramatizethe fact that the majority of the execu-tion time of a program is usually spent ina very small proportion of the code. Otherinteresting statistics are that 9570 of allthe do loops incremented their indexvariable by 1, and 4070 of all do loopscontained only one statement.

Shen et al. [1990] examined the sub-script expressions that appeared in a setof Fortran mathematical libraries andscientific applications. They applied vari-ous dependence tests to these expres-sions. The results demonstrate how thetests compare to one another, showingthat one of the biggest problems was un-known variables. These variables werecaused by procedure calls and by the useof an array element as an index valueinto another array. Coupled subscriptsalso caused problems for tests that exam-


Compiler Transformations e 405

ine a single array dimension at a time

(coupled subscripts are discussed inSection 5.4).

A study by Petersen and Padua [1993]evaluates approximate dependence testsagainst the omega test [Pugh 1992], find-ing that the latter does not expose sub-stantially more parallelism in the PerfectClub benchmarks.

9.3 Compiler Effectiveness

As we mentioned above, researchers findit difficult to evaluate how well a com-piler is doing. They have come up withfour approaches to the problem:

e

e

e

e

examine the compiler output by handto evaluate its ability to transform code;

measure performance of one compileragainst another;

compare the performance of executa-ble compiled with full optimizationagainst little or no optimization; and

compare an application compiled forparallel execution against the sequen-tial version running on one processor.

Nobayashi and Eoyang [1989] comparethe performance of supercomputer com-pilers from Cray, Fujitsu, Alliant,Ardent, and NEC. The compilers wereapplied to various loops from the Liver-more and Argonne test suites that re-quired restructuring before they could becomputed with vector operations.

Carr [1993] applies a set of loop trans-formations (unroll-and-jam, loop inter-change, tiling, and scalar replacement) tovarious scientific applications to find astrategy that optimizes cache utilization.

Wolf [1992] uses inner loops from thePerfect Club and from one of the SPECbenchmarks to compare the performanceof hand-optimized to compiler-optimizedcode. The focus is on improving localityto reduce traffic between memory andthe cache.

Relatively few studies have been per-formed to test the effectiveness of realcompilers in parallelizing real programs.However, the results of those that haveare not encouraging.

One study of four Perfect benchmarkprograms compiled on the Alliant FX/8produced speedups between 0.9 (that is,a slowdown) and 2.36 out of a potential32; when the applications were tuned byhand, the speedups ranged from 5.1 to13.2 [Eigenmann et al. 1991]. In anotherstudy of 12 Perfect benchmarks compiledwith the KAP compiler for a simulatedmachine with unlimited parallelism, 7applications had speedups of 1.4 or less;two applications had speedups of 2-4;and the rest were sped up 10.3, 66, and77. All but three of these applicationscould have been sped up by a factor of 10or more [Padua and Petersen 1992].

Lee et al. [1985] study the ability of theParafrase compiler to parallelize a mix-ture of small programs written in For-tran. Before compilation, while loopswere converted to do loops, and code forhandling error conditions was removed.With 32 processors available, 4 out of 15applications achieved 30920 efficiency, and2 achieved 10% efficiency; the other9 out of 15 achieved less than 10%efficiency. Out of 89 loops, 59 wereparallelized, most of them loops thatinitialized array data structures. Somecoding idioms that are amenable to im-proved analysis were identified.

Some preliminary results have alsobeen reported for the Fortran D compiler[Hiranandani et al. 1993] on the InteliPSC/860 and Thinking Machines CM-5architectures.

9.4 Instruction-Level Parallelism

In order to evaluate the potential gainfrom instruction-level parallelism, re-searchers have engaged in a number ofstudies to measure an upper bound onhow much parallelism is available whenan unlimited number of functional unitsis assumed. Some of these studies arediscussed and evaluated in detail by Rauand Fisher [1993].

Early studies [Tjaden and Flynn 1970]were pessimistic in their findings, mea-suring a maximum level of parallelismon the order of two or three—a resultthat was called the Flynn bottleneck. The



main reason for the low numbers wasthat these studies did not look beyondbasic blocks.

Parallelism can be exploited acrossbasic block boundaries, however, on ma-chines that use speculative execution. h-stead of waiting until the outcome of aconditional branch is known, these archi-tectures begin executing the instructionsat either or both potential branch tar-gets; when the conditional is evaluated,any computations that are rendered in-valid or useless must be discarded.

When the basic block restriction is re-laxed, there is much more parallelismavailable. Riseman and Foster [1972] as-sumed replicated hardware to supportspeculative execution. They found thatthere was significant additional paral-lelism available, but that exploiting itwould require a great deal of hardware.For the programs studied, on the order of216 arithmetic units would be necessaryto expose 10-way parallelism.

A subsequent study by Nicolau andFisher [1984] sought to measure the par-allelism that could be exploited by aVLIW architecture using aggressive com-piler analysis. The strategy was to com-pile a set of scientific programs into anintermediate form, simulate execution,and apply a scheduling algorithmretroactively that used branch and ad-dress information revealed by the simu-lation. The study revealed significantamounts of parallelism—a factor of tensor hundreds.

Wall [1991] took trace data from anumber of applications and measured theamount of parallelism under a variety ofmodels. His study considered the effect ofarchitectural features such as varyingnumbers of registers, branch prediction,and jump prediction. It also consideredthe effects of alias analysis in the com-piler. The results varied widely, yieldingas much as 60-way parallelism under themost optimistic assumptions. The aver-age parallelism on an ambitious combi-nation of architecture and compilersupport was 7.

Butler et al. [1991] used an optimizingcompiler on a group of programs from the

SPEC89 suite. The resulting code wasexecuted on a variety of simulatedmachines. The range includes unrealisti-cally omniscient architectures that com-bine data-flow strategies for schedulingwith perfect branch prediciton. They alsorestricted models with limited hardware.Under the most optimistic assumptions,they were able to execute 17 instructionsper cycle; more practical models yieldedbetween 2.0 and 5.8 instructions per cy-cle. Without looking past branches, fewapplications could execute more than 2instructions per cycle.

Lam and Wilson [1992] argue that onemajor reason why studies of instruction-level parallelism differ markedly in theirresults is branch prediction. The studiesthat perform speculative execution basedon prediction yield much less parallelismthan the ones that have an oracle withperfect knowledge about branches.The authors speculate that executingbranches locally will not yield large de-grees of parallelism; they conduct experi-ments on a number of applications andfind similar results to other studies (be-tween 4.2 and 9.2). They argue that com-piler optimizations that consider theglobal flow of control are essential.

Finally, Larus [1993] takes a differentapproach than the others, focusingstrictly on the parallelism available inloops. He examines six codes (three inte-ger programs from SPEC89 and threescientific applications), collects a traceannotated with semantic information,and runs it through a simulator that as-sumes a uniform shared-memory archi-tecture with infinite processors. Two ofthe scientific codes, the ones that wereprimarily array manipulators, showed apotential speedup in the range of 65–250.The potential speedup of the other codeswas 1.7–3, suggesting that the tech-niques developed for parallelizing sci-entific codes will not work well inparallelizing other types of programs.

CONCLUSION

We have described a large number oftransformations that can be used to im-



prove program performance on sequen-tial and parallel machines. Numerousstudies have demonstrated that thesetransformations, when properly applied,are sufficient to yield high performance.

However, current optimizing compilerslack an organizing principle that allowsthem to choose how and when the trans-formations should be applied. Finding aframework that unifies many transfor-mations is an active area of research.Such a framework could simplify theproblem of searching the space of possi-ble transformations, making it easier andtherefore cheaper to build high-qualityoptimizing compilers. A key part of thisproblem is the development of a costfunction that allows the compiler to eval-uate the effectiveness of a particular setof transformations.

Despite the absence of a strategy foruniffing transformations, compilers haveproven to be quite successful in targetingsequential and superscalar architectures.On ~arallel architectures. however. mosthig~-performance applications currentlyrely on the programmer rather than thecompiler to manage parallelism. Com-pilers face a large space of possibletransformations to apply, and parallelmachines exact a very high cost of fail-ure when optimization algorithms do notdiscover a good reorganization strategyfor the application.

Because efforts to ~arallelize tradi-tional languages autom~tically have metwith little success, the focus of researchhas shifted to compiling languages inwhich the Dromammer shares the re-sponsibility }or”exposing parallelism andchoosing a data layout.

Another developing area is the opti-mization of nonscientific applications.Like most researchers working on high-performance architectures, compilerwriters have focused on code that relieson loop-based manipulation of arrays.Other kinds of programs, particularlythose written in an object-oriented pro-gramming style, rely heavily on pointers.Optimization of pointer manipulation andof object-oriented languages is emergingas another focus of compiler research.

APPENDIX MACHINE MODELS

A.1 Superscalar DLX

A superscalar processor has multiplefunctional units and can issue more thanone instruction per clock cycle. Currentexamples of superscalar machines are theDEC Alpha [Sites 1992], HP PA-RISC[Hewlett-Packard 1992], IBM RS/6000[Oehler and Blasgen 1991], and IntelPentium [Alpert and Avnon 1993].

S-DLX is a simplified superscalar RISCarchitecture. It has four independentfunctional units for integer, load/store,branch, and floating-point operations. Inevery cycle, the next two instructions areconsidered for execution. If they are fordifferent functional units, and thereare no dependence between the instruc-tions, they are both initiated. Otherwise,the second instruction is deferred untilthe next cycle. S-DLX does not reorderthe instructions.

Most operations complete in a singlecycle. When an operation takes more thanone cycle, subsequent instructions thatuse results from multicycle instructionsare stalled until the result is available.Because there is no instruction reorder-ing, when an instruction is stalled noinstructions are issued to any of thefunctional units.

S-DLX has a 32-bit word, 32 general-purpose registers (GPRs, denoted by Rn),and 32 floating-point registers (FPRs, de-noted by Fn). The value of RO is always O.The FPRs can be used as double-preci-sion (64-bit) register pairs. For thesake of simplicity we have not includedthe double-precision floating-pointinstructions.

Memory is byte addressable with a32-bit virtual address. All memory refer-ences are made by load and store instruc-tions between memory and the registers.Data is cached in a 64-kilobyte 4-wayset-associative write-back cache com-posed of 1024 64–byte cache lines. Figure61 is a block diagram of the architectureand its primary datapaths.

Table A.1 describes the instruction set.All instructions are 32 bits and must beword-aligned. The immediate operand



-.I 64

-&Integer 32 Branch 64 from RAM memory

* *

Unit Unit64 KB Cache

32 32 64 bytedline

32 ? _ 1024 lines32

32 Int Registers “ Load/Store *4-way set- ~ Addr ~

64 Unit s-l associative* Xlation Pg$32 FP Registers

64

—Floating

Point Unit

Figure 61. S-DLX functional diagram,

Table A.1. The S-DLX InstructIon Set

Example Instr. Name Meaning Simdar instructions

LW Rl, 30(R2) Load word RI +-Memory [30+R2] Load float (LF)

SW 500(R4), R3 Store word Memory [500+R41 +R3 Store float (SF)

LI RI, #666 Load Immediate RI-666

LUI RI, #666 Load upper immediate R11,5 31 t--666

MOV RI, R2 Move register R1+R2

ADD RI, R2, R3 Add R1*R2+R3 Subtract (SUB)

WJLT RI, R2, R3 Multlply R1-R2x R3 Divide (DIV)

ADDI RI, R2, #3 Add immediate R1-R2+3 SUBI , WJLTI , DIVI

SLL RI> R2, R3 Shift left logical RI+R2<<R3 Shift right logical (SRL)

SLLI RI, R2, #3 Shift left immediate RI+R2< 3 SRLI

SLT RI, R2, R3 Set less than If (R2<R3) RI+l SEQ, SNE, SLE, SGE, SGT and

else RI-O lmmedlate forms

J label Jump PCelabel

JR R3 Jump register PCt-R3

JAL label Jump and link R31+PC+4; PCt-label

JALR R2 Jump and link register R31+PC+4, PC+-R2

BEQZ R4, label Branch If equal zero If (R4=O) PC+--label Branch If not equal zero (BIfEZ)

BFPT label Branch if floating If (FPCR) PCtlabel Branch if floating point false

point true (BFPF)

ADDF Fl, F2, F3 Add float FltF2+F3 Subtract float (SUBF)

f’lWLTF F1 , F2 , F3 Multiply float FltF2x F3 Divide float (DIVF)

HAF F1, F2> F3 Multiply-Add float FltFl+(F2x F3)

EQF Fl, F2 Test equal float If (F1=F2) FPCR +’-1 LTF , GTF, LEF , GEF, IiEF

else FPCR +0

field is 16 bits. The address for load andstore instructions is computed by addingthe 16-bit immediate operand to the reg-ister. To create a full 32-bit constant, thelow 16 bits must first be set with a loadimmediate (Ll) instruction that clears the

high 16 bits; then the high 16 bits mustbe set with a load upper immediate (LUI)instruction.

The program counter is the special reg-ister PC. Jumps and branches are rela-tive to PC + 4; jumps have a 26-bit signed



Table A.2. V-DLXVector InstructIons

Example Instr. Name Meaning

LV VI, RI Load vector register V1+-VLR words at H [RI]

uws VI. (RI. R2) Load vector with stride Vlt-everv R2ffi word for

i SV VI. RI I Store vector register I HIR1] -VLR WC

SUBV V1, V2, V3 I Vector-vector subtraction I V.. -. . . . . .. .-

SUBVS VI .V2. F1 / Vector-scalar subtraction I VI [1. . VLR] +V2

Slmllar I

I I

SUBSV V1; F1 ;V2 I Scalar-vector subtraction I VI [1 . .VLR] t-Fi-V2[l .VLR] I DIVSV

I SVLR Ri I Set vector length register I VLR+RI

offset, and branches have a 16-bit signedoffset. Integer branches test the GPRs forzero; floating-point branches test a spe-cial floating-point condition register(FPCR).

There is no branch delay slot. Becausethe number of instructions executed percycle varies on a superscalar machine, itdoes not make sense to have a fixed num-ber of delay slots. The number of delayslots is also heavily dependent on thepipeline depth, which may vary from onechip generation to the next.

Instead, static branch prediction isused: forward branches are always pre-dicted as not taken, and backwardbranches are predicted as taken. If theprediction is wrong, there is a one-cycledelay.

Floating-point divides take 14 cyclesand all other floating-point operationstake four cycles, except when the resultis used by a store operation, in whichcase they take three cycles. A load thathits in the cache takes two cycles; if itmisses in the cache it takes 16 cycles.Integer multiplication takes two cycles.All other instructions take one cycle.

When a load or store is followed by aninteger instruction that modifies the ad-dress register, the instructions may beissued in the same cycle. If a floating-point store is followed by an operationthat modifies the register being stored,the instructions may also be issued inthe same cycle.

A.2 Vector DLX

For vectorizing transformations, we willuse a version of DLX extended to include

vector support. This new architecture,V-DLX, has eight vector registers, eachof which holds a vector consisting of upto 64 floating-point numbers. The vectorfunctional units perform all their opera-tions on data in the vector and scalarregisters.

We discuss only the functional unitsthat perform floating-point addition,multiplication, and division, though vec-tor machines typically have units to per-form integer and logical operations aswell. V-DLX issues only one scalar or onevector instruction per cycle, but noncon-flicting scalar and vector instructions canoverlap each other.

A special register, the vector lengthregister (VLR), controls the number ofquantities that are loaded, operated upon,or stored in any vector instruction. Bysoftware convention, the VLR is normallyset to 64, except when handling the lastfew iterations of a loop.

The vector operations are described inTable A.2. They include arithmetic opera-tions on vector registers and load/storeoperations between vector registers andmemory. Vector operations take place ei-ther between two vector registers orbetween a vector register and a scalarregister. In the latter case, the scalarvalue is extended across the entire vec-tor. All vector computations have a vec-tor register as the destination.

The speed of a vector operation de-pends on the depth of the pipeline in itsimplementation. The first result appearsafter some number of cycles (called thestartup time). After the pipeline is full,one result is computed per clock cycle. In

ACM Computmg Surveys. Vol. 26, No. 4, December 1994


Table A.3. Startup Times in Cycles on V-DLX

[ Operation I Start-Up Time I

~

I Vector Load,

12 j

the meantime, the processor can con-tinue to execute other instructions. TableA.3 gives the startup times for the vectoroperations in V-DLX. These times shouldnot be compared directly to the cycletimes for operations on S-DLX becausevector machines typically have higherclock speeds than microprocessors, al-though this gap is closing.

A large factor in the sQeed of vectorarchitec~ures is their mem’ory system. V-DLX has eight memory banks. After theload latency, data is supplied at the rateof one word per clock cycle, provided thatthe stride with which the data is ac-cessed does not cause bank conflicts (seeSection 6.5.1).

Current examples of vector machinesare the Cray C-90 [Oed 1992] and IBMES 9000 Model 900 VF [Gibson and Rao1992].

A.3 Multiprocessors

When multiple processors are employedto execute a program, many additionalissues arise. The most obvious issue ishow much of the underlying machine ar-chitecture to expose to the programmer.At one extreme, the programmer canmake explicit use of hardware-supportedoperations by programming with locks,fork/join primitives, barrier synchro-nizations, and message send and receive.These operations are typically providedas system calls or library routines. Sys-tem call semantics are usually not de-fined by the language, making it difficultfor the compiler to optimize themautomatically.

High-level languages for large-scaleparallel processing provide primitives forexpressing parallelism in one of two ways:control parallel or data parallel. Fortran

90 array section expressions are exam-ples of explicitly data parallel operations.APL [Iverson 1962] also contains a widevariety of data-parallel array operators.Examples of control-parallel operationsare cobegin / coend blocks anddoacross loops [Cytron 19861.

For both shared- and distributed-mem-ory multiprocessors, we assume that theprogramming environment initializes thevariables Pn urn to be the total number ofprocessors and Pid to be the number ofthe local processor. Processors are num-bered starting with O.

A.3. 1 Shared-Memory DLX Multiprocessor

sMX is our prototypical shared-memoryparallel architecture. It consists of 16 S-DLX processors and 512MB of sharedmain memory. The processors and mem-ory system are connected together by abus. Each processor has an intelligentcache controller that monitors the bus (asnoopy cache). The caches are the sameas on S-DLX, except that they contain256KB of data. The bandwidth of the busis 128 megabytes/second.

The processors share the memoryunits; a program running on a processorcan access any memory element, and thesystem ensures that the values are main-tained consistently across the machine.Without caching, consistency is easy tomaintain; every memory reference ishandled by the main memory controller.However, performance would be very poorbecause memory latencies are already toohigh on sequential machines to run wellwithout a cache; having many processorsshare a common bus would make theproblem much worse by increasing mem-ory access latency and introducing band-width restrictions. The solution is to giveeach processor a cache that is smartenough to resolve reference conflicts.

A snoopy cache implements a sharingprotocol that maintains consistency whilestill minimizing bus traffic. A processorthat modifies a cache line invalidates allother copies and is said to own that cacheline. There are a variety of cache co-herency protocols [Stenstrom 1990;



Table A.4, Memory reference latency in sMX

Type of Memory Reference / Cycles ]

Read value available in local cache I 21

Read value owned by

Read value nobody owns “ I 20 I

1

‘ other mocessor I 1; i

Write value owneri locallyI

1

Write value owned by other Drocessor I 18

Write value nobody ;wns ‘ 22

Eggers and Katz 1989], but the details

are not relevant to this survey. From thecompiler writer’s perspective, the key is-sue is the time it takes to make a mem-

ory reference. Table A.4 summarizes thelatency of each kind of memory referencein sMX,

Typically, shared-memory multiproces-

sors implement a number of synchroniza-

tion operations. FORK(n) starts identicalcopies of the current program running onn processors; because it must copy thestack of the forking processor to a privatememory for each of the n – 1 other pro-cessors, it is a very expensive operation.JOIN( ) desynchronizes with the previous

FORK, and makes the processor available

for other FORK operations (except for theoriginal forking processor, whichproceeds serially).

BARRIER( ) performs a barrier synchro-nization, which is a synchronization pointin the program where each processorwaits until all processors have arrived atthat point.

pass through a processor en route to some

other destination in the mesh are han-

dled by the network processor without

interrupting the CPU.

The latency of a message transfer isten microseconds plus 1 microsecond per10 bytes of message (we have simplified

the machine by declaring that all remoteprocessors have the same latency). Com-munication is supported by a messagelibrary that provides the following calls:

“ SEND(buffer, nbytes,target)Send nbytes from buffer to the proces-sor target. If the message fits in thenetwork processor’s memory, the

call returns after the copy (1 microsec-ond/ 10 bytes). If not, the call blocksuntil the message is sent. Note that

this raises the potential for deadlock,

but we will not concern ourselves withthat in this simplified model.

● BROADCAST(buffer, nbytes)Send the message to every other pro-cessor in the machine. The communica-tion library uses a fast broadcastingalgorithm, so the maximum latency isroughly twice that of sending a mes-sage to the furthest edge of the mesh.

* RECEIVE (buffer, nbytes)Wait for a message to arrive; when itdoes, up to nbytes of it will be put inthe buffer. The call returns the totalnumber of bytes in the incoming mes-

sage; if that value is greater thannbytes, RECEIVE guarantees that sub-

A.3.2 Distributed-Memory DLX Multiprocessorsequent calls will return the rest of

that message first.

dMX is our hypothetical distributed-

memory multiprocessor, The machine ACKNOWLEDGMENTSconsists of 64 S-DLX processors (num-bered O through 63) connected in an 8 x 8mesh. The network bandwidth of eachlink in the mesh is 10 MB per second.Each processor has an associated net-

work processor that manages communi-cation; the network processor has its ownpool of memory and can communicatewithout involving the CPU. Having aseparate processor manage the networkallows applications to send a messageasynchronously and continue executingwhile the message is sent. Messages that

We thank Michael Burke, Manish Gupta, Monica

Lam, and Vivek Sarkar for their assistance with

various parts of this survey. We thank John

Boyland, James Demmel, Alex Dupuy, John Hauser,

James Larus, Dick Muntz, Ken Stanley, the mem-

bers of the IBM Advanced Development Technology

Institute, and the anonymous referees for their

valuable comments on earlier drafts.

REFERENCES

ASELSON, H. AND SUSSMAN, G. J. 1985. Structureand Interpretation of Computer Programs. MITPress, Cambridge, Mass.



ABU-SUFAH, W 1979 Improving the performance of

virtual memory computers. Ph D., thesis, Tech

Rep 78-945, Univ of Illinois at Urbana-

Champaign

ABU-SUFAEL W., Kuzw, D ,J., AND LAWRIE, D 1981On the performance enhancement of paging

systems through program analysis and trans-

formations IEEE Trans Comput C-30, 5(May), 341-356

ACKRRiWMJ, W B. 1982. Data flow languages Com-

puter 15, 2 (Feb ), 15-25

AHo, A. V,, JOHNSON, S C.. AND ULLMAN, J D 1977

Code generation for expressions with common

subexpressions J ACM 24, 1 (Jan), 146–160

AHo, A V., SETHI, R, AND ULLMAN, J D. 1986

Compder~. Pnnccples, Techruqucs, and Tools

Addmon-Wesley, Reading, Mass

AIKDN, A. AND NICOLAU, A. 1988a Optimal loop

parallehzatlon. In Proceechngs of the SIGPLAN

Conference on Progrczmmlng Language Desgn

and Implementutz on (Atlanta, Ga , June) SIG-PLAN Not 23, 7 (July), 308-317

AIKEN, A. AND NICOLAU. A, 1988b. Perfect pipehn-

mg: A new loop parallelization technique. In

Proceedings of the 2nd European Symposrum

on Programmmg Lecture Notes in Computer

Science, vol 300 Springer-Verlag, Berlin,

221-235.

ALLEN, J, R, 1983. Dependence analysis for sub-

scripted variables and Its apphcatlon to pro-gram transformations. Ph.D. thesis, Computer

Science Dept., Rice Umverslty, Houston, Tex

ALLEN, F. E, 1969. Program optlmlzatlon. In An-

nual Rev~ew m Automatw Programmmg 5. In-

ternational Tracts in Computer Science and

Technology and then- Apphcatlon. vol. 13 Perg-

amon Press, Oxford, England, 239–307

ALLEN, F. E. AND COCKE, J. 1971. A catalogue of

optimizing transformations. In Design and 0p-

tmawatbon of Compzlers, R Rustm, Ed Pren-tice-Hall, Englewood Chffs, N. J., 1–30.

ALLEN, J. R. AND KENNEDY, K. 1987, Automatic

translation of Fortran programs to vector form.ACM Trans. Program Lang. Syst. 9, 4 (Oct.).

491-542,

ALLEN, J. R. AND KENNEDY, K. 1984, Automatic loopinterchange In Proceedings of the SIGPLANSymposium on Compiler Construction(Montreal, Quebec, June). SIGPJ5AN Not. 19,6, 233-246.

ALLEN, F, E., BURKE, M,, CHARLES, P., CYTRON, R.,

AND FERRANTE, J. 1988a. An overwew of thePTRAN analysis system for multiprocessing, J.Parall Distrib. Comput. 5, 5 (Oct.), 617-640.

ALLEN, F. E,, BURKE, M., CYTRON, R., FERKANTE, J.,HSIEH, W., AND SARKAR, V. 1988b. A frameworkfor determining useful parallelism. In Proceed-ings of the ACM International Conference onSupercomputu-zg (St. Malo, France. July). ACMPress, New York, 207–215.

ALLEN, F. E , COCKE, J., AND KENNEDY, K. 1981,

Reduction of operator strength, In Program

Flow Analysis: Theory and ApphcatLons, S. S.

Muchnik and N. D. Jones, Eds., Prentice-Hall,

Englewood Cliffs, N, J,, 79-101,

ALLEN, J R., KENNEDY, K., PORTERFIELD, C., AND

WARREN, J 1983 Conversion of control depen-dence to data dependence In Conference Record

of the 10th ACM Symposium on Principles of

Programming Languages (Austin, Tex., Jan,).

ACM Press, New York, 177-189.

Z%PERT. D AND AVNON, D 1993 Architecture of the

Pentium microprocessor IEEE Micro 13, 3

(June), 11-21

AMERICAN NATIONAI, STANDARDS INSTITUTE 1987 An

American National standard, IEEE standard

for binary floating-point arithmetic SIGPLAN

Not 22, 2 (Feb.), 9-25

ANDERSON, J. M. AND LAM, M. S 1993 Global opti-

mlzatlons for parallelism and locahty on scal-able parallel machmes. In Proceedz ngs of

the SIGPLAN Conference on Programming

Language Design and Implementation (Al-buquerque, New Mexico, June) SIGPL4N Not.

28, 6, 112-125

ANDERSON, S. AND HUDAK, P 1990 Compilation of

Haskell array comprehensions for scientific

computing. In Proceedings of the SIGPLAN

Conference on Programming Language Designand Implementation (White Plains, N.Y , June).

SIGPLAN Not 25, 6, 137-149

APPEL, A. W. 1992. Compdmg wLth Contmuatzons.

Cambridge Umverslty Press, Cambridge, Eng-land.

ARDEN, B. W., GALLER, B, A,, AND GRAHAM, R, M,1962. An algorithm for translating Boolean ex-

pressions, J, ACM 9, 2 (Apr,), 222-239,

AKWND AND CULLER, D. E. 1986. Dataflow architec-

tures In Annual Reznew of Computer Sctence.

Vol. 1. J. F, Traub, B. J. Grosz, B, W. Lampson,

and N, J. Nilsson, Eds, Annual Rewews, PaloMto, Cahf, 225-253

,&VZND, KAT13AIL, V , AND PINGALI, K. 1980. A

dataflow architecture with tagged tokens. Tech.Rep TM-174, Laboratory for Computer Sci-

ence, Massachusetts Institute of Technology,

Cambridge, Mass.

BACON, D. F., CHOW, J.-H., Ju, D R, MUTHUKUMAR,K , AND SARKAR, V. 1994. A compiler frameworkfor restructuring data declarations to enhancecache and TLB effectiveness. In Proceedings ofCASCON ’94 (Toronto, Ontario, Oct.).

BAILEY, D. H, 1992, On the behawor of cache memo-

ries with stnded data access, Tech. Rep. RNR-92-015, NASA Ames Research Center, MoffettField, Calif., (May),

BAILEY, D. H., BARSZCZ, E., BARTON, J. T.,BROWNING, D. S., CARTER, R. L., DAGUM, L,,FATOOHI, R. A., FR~DERICKSON, P. O., LASINSKZ,

T. A., SCHREIBER, R. S., SIMON, H. D.,VENKATAKRISHNAN, V., AND WEERATUNGA, S. K.1991. The NAS parallel benchmarks. Int. J.Supercomput. Appl, 5, 3 (Fall), 63-73,

ACM Computmg Surveys, Vol. 26. No 4, December 1994


BALASUNDARAM, V. 1990. A mechanism for keeping

useful internal information in parallel pro-

gramming tools: The Data Access Descriptor,

J. Parall. DtstrLb. Comput. 9, 2 (June),154-170.

BALASUNDARAM, V., Fox, G., KENNEDY, K., AND

KREMER, U. 1990. An interactive environmentfor data partitioning and distribution. In Pro-

ceedings of the 5th Dlstnbuted Memory Corn-puter Conference (Charleston, South Carolina,

Apr.). IEEE Computer Society Press, LosAlamitos, Calif.

BALASUNDARAM, V., KENNEDY, K., KREMER, U.,MCKINLEY, K., AND SUBHLOK, J. 1989. The

ParaScope editor: An interactive parallel pro-

gramming tool. In Proceedings of Superconz-

putmg ’89 (Reno, Nev., Nov.). ACM Press, New

York, 540-550.

BALL, J. E. 1979. Predicting the effects of optimiza-

tion on a procedure body. In Proceedings of the

SIGPLAN Symposium on Compiler Construc-

tion (Denver, Color., Aug.). SIGPLAN Not. 14,

8, 214-220.

BANERJEE, U. 1991. Unimodular transformations of

double loops. In Aduances in Languages andCompilers for Parallel Processing, A. Nicolau,Ed. Research Monographs in Parallel and Dis-

tributed Computing, MIT Press, Cambridge,Mass., Chap. 10.

BANERJEE, U. 1988a. Dependence Analysis for Su-

percomputing. Kluwer Academic Publishers,

Boston, Mass.

BANEKJEE, U. 1988b. An introduction to a formal

theory of dependence analysis. J. Supercom-

put. 2, 2 (Oct.), 133-149.

BANERJEE, U. 1979. Speedup of ordinary programs,Ph.D. thesis, Tech. Rep. 79-989, Computer Sci-

ence Dept., Univ. of Illinois at Urbana-

Champaign.

BANERJEE, U., CHEN, S. C., KUCK, D. J., AND TOWLE,

R. A. 1979. Time and parallel processor bounds

for Fortran-like loops. IEEE Trans. Comput.

C-28, 9 (Sept.), 660-670.

BANNING, J. P. 1979. An efficient way to find the

side-effects of procedure calls and the aliases of

variables. In Conference Record of the 6th ACM

Symposium on Principles of Programming Lan -

guages (San Antonio, Tex., Jan.). ACM Press,29-41.

BERNSTEIN, R. 1986. Multiplication by integer con-stants. Softw. Pratt. Exper. 16, 7 (July),

641-652.

BERRY, M., CHEN, D., Koss, P., KUCK, D., Lo, S.,PANG, Y., POINTER, L., ROLOFF, R., SAMEH, A.,CLEMENTI, E , CHIN. S , SCHNEI~ER, D , Fox, G ,ME SSINA, P., WALKER, D., HSIUNG, C.,SCHWARZMEIER, J., LUE, K., ORSZAG, S., S~IDL,F., JOHNSON, O., GOODRUM, R., AND MARTIN, J.

1989. The Perfect Club benchmarks: Effective

performance evaluation of supercomputers. Znt.J. Supercomput. Appl. 3, 3 (Fall), 5-40.

BLELLOCH, G, 1989, Scans as primitive parallel op-

erations. IEEE Trans. Comput. C-38, 11 (Nov.),

1526-1538.

BROMLEY, M., HELLER, S., MCNERNEY, T., AND

STEELE, G. L., JR. 1991. Fortran at ten gi

gaflops: The Connection Machine convolution

compiler. In Proceedings of the SIGPLAN Con-ference on Programmmg Language Destgn andImplementation (Toronto, Ontano, June). SIG-

PLAN Not. 26, 6, 145-156.

BURKE, M. AND CYTRON, R. 1986. Interprocedural

dependence analysis and parallelization. InProceedings of the SIGPLAN Symposium on

Compiler Construction (Palo Alto, Calif., June).

SZGPLANNot. 21, 7 (July), 162-175. Extendedversion available as IBM Thomas J. Watson

Research Center Tech. Rep. RC 11794.

BURNETT, G. J. AND COFFMAN, E. G., JR. 1970. A

study of interleaved memory systems. In Pro-

ceedings of the Spring Joint AFIPS Computer

Conference. vol. 36. AFIPS, Montvale, N.J.,467-474.

BURSTALL, R. M. AND DARLINGTON, J. 1977. A trans-formation system for developing recursive pro-

grams. J. ACM 24, 1 (Jan.), 44-67.

BUTLER) M., YEH, T., PATT, Y., ALsuP, M., SCALES,H., AND SHEBANOW, M. 1991, Single instruction

stream parallelism is greater than two. In Pro-ceedings of the 18th Annual International Sym-

posium on Computer Architecture (Toronto,Ontario, May). SIGARCH Comput. Arch. Neuw

19, 3, 276-286.

CALLAHAN, D. AND KENNEDY, K. 1988a. Analysis ofinterprocedural side-effects in a parallel pro-

gramming environment J. Parall. DistribCornput. 5, 5 (Oct.), 517-550.

CALLAHAN, D. AND KENNEDY, K. 1988b. Compihngprograms for distributed-memory multiproces-

sors. J. Supercompzd. 2, 2 (Oct.), 151–169.CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990.

Improving register allocation for subscripted

variables. In Proceedings of the SIGPLAN Con-ference on Programming Language Design and

Implementation (White Plains, N.Y., June).

SIGPLAN Not. 25, 6, 53-65.

CALLAHAN,D , COCKE, J., AND KENNEDY, K. 1988.

Estimating interlock and improving balance for

pipelined architectures J. Parall. Distrib.

Comput. 5, 4 (Aug.), 334-358.

CALLAHAN, D., COOPER, K., KENNEDY, K., ANDTORCZON, L. 1986. Interprocedural constantpropagation. In Proceedings of the SIGPLANSympos~um on CompJer Construction (PaloAlto, Cahf., June). SIGPLAN Not. 21, 7 (July),152-161.

CARR, S. 1993. Memory-hierarchy management.Ph.D. thesis, R]ce Umversity, Houston, Tex.

CHAITIN, G. J. 1982. Register allocation and spillingvia graph coloring. In Proceedings of the SIG-PLAN Symposium on Compiler Construction(Boston, Mass., June). SIGPLAN Not 17, 6,

98-105.



CFIMTIN, G. J,, AUSLANDER, M. A., CHANDRA, A. K.,

COCKE, J., HOPKINS, M. E., AND MARKSTEIN,

P. W. 1981. Register allocation via coloring.

Comput. Lang. 6, 1 (Jan.), 47-57.

CHAMBERS, C. .MXD UNGAR, D. 1989. Customization:Optimizing compiler technology for SELF, adynamically-typed object-oriented program-

ming language. In proceedings of the SIG-

PLAN Conference on Programming Language

Design and Implementation (Portland, Ore.,

June). SIGPLAN Not. 24, 7 (July), 146-160.

CEATTERJEE, S., GILBERT, J. R., AND SCHREIBER, R.

1993a. The alignment-distribution graph. In

Proceedings of the 6th International Workshop

on Languages and Compilers for Parallel Com-

puting. Lecture Notes in Computer Science,vol. 768. 234–252.

CHATTERJEE, S., GILBERT, J. R., SCHREIBER, R., ANDTENG, S.-H. 1993b. Automatic array alignmentin data-parallel programs. In Conference Recordof the 20th ACM Sympo.mum on Prmczples of

Programmmg Languages (Charleston, S. Car-

olina, Jan.). ACM Press, New York, 16–28.

CHEN, S. C. AND KUCK, D. J. 1975. Time and paral-lel processor bounds for linear recurrence sys-

tems. IEEE Trans. Comput. C-24 7 (July),701-717.

CHOI, J.-D., BURKE, M., AND CARINI, P. 1993. Effi-cient flow-sensitive interprocedural computa-tion of pointer-induced aliases and side effects.

In Conference Record of the 20th ACM Sympo-

sium on Principles of Programming Languages

(Charleston, S. Carolina, Jan.). ACM Press,New York, 232-245.

CHOW, F C 1988. Minimizing register usage

penalty at procedure calls. In Proceedings of

the SIGPLAN Conference on ProgrammingLanguage Design and Implementation (Atlanta,

Ga., June). SIGPLAN Not. 23, 7 (July), 85-94.

CHOW, F. C. AND HENNESSEY, J. L. 1990. The prior-

ity-based coloring approach to register alloca-tion. ACM Trans. Program. Lang. Syst. 12, 4

(Oct.), 501-536.

CLARKR, C. D. AND PEYTON-JONES, S. L. 1985. Strict-ness analysis—a practical approach. In Func-tional Programming Languages and ComputerArchitecture. Lecture Notes in Computer Sci-ence, vol. 201. Springer-Verlag, Berlin, 35–49.

CLINGER, W. D. 1990 How to read floating pointnumbers accurately. In proceedings of the SIG-PLAN Conference on Programmmg LanguageDeslogn and Implementation (White Plains,

NY., June). SIGPLAN Not. 25, 6, 92-101.

COCKE, J. 1970. Global common subexpression elim-ination. In proceedings of the ACM Symposmmon Compder OptLmizatzon (July). SIGPLANNot. 5, 7, 20–24.

COCKR, J. AND MARKSTEIN, P. 1980. Measurement ofprogram improvement algorithms. In Proceed-ings of the IFIP Congress (Tokyo, Japan, Oct.).North-Holland, Amsterdam, Netherlands,

221-228. Also available as IBM Tech Rep. RC

8111, Feb. 1980.

COCKE, J. AND SCHWARTZ, J. T. 1970. Programminglanguages and their compders (preliminary

notes). 2nd ed. Courant Inst. of MathemalzcalSciences, New York Umverslty, New York.

COLWELL, R. P,, NIX, R, P,, ODONNEL, J. J.,PAPWORTH, D. B., AND RODMAN, P, K. 1988. A

VLIW architecture for a trace scheduling com-

piler. IEEE Trans. Comput. C-37, 8 (Aug.),

967-979.

COOPER, K, D, AND KENNEDY, K, 1989, Fast inter-

procedural ahas analysie, In Conference Record

of the 16th ACM Symposmm on Prmc~ples of

Programmmg Languages (Austin, Tex., Jan.).


COOPER, K, D., HALL, M. W,, AND KENNEDY, K. 1993.A methodology for procedure cloning. Comput.Lang. 19, 2 (Apr.), 105-117.

COOPER, K. D., HALL, M. W., AND TORCZON, L. 1992.Unexpected side effects of inline substitutionA case study. ACM Lett. Program. Lang. Syst

1, 1 (Mar.), 22-32.

COOPER, K D., HALL, M. W., AND TORCZON, L 1991.An experiment with inline substitution. Softw

Pratt. Exper. 21, 6 (June), 581-601

CRAY RESEARCH 1988 CFT77 Reference Manual.Publication SR-0018-C. Cray Research, Inc..

Eagan, Minn.

CYTRON, R. 1986. Doacross: Beyond vectorizatlonfor multiprocessors. In Proceedings of the Inter-

national Conference on Parallel Processing (St.

Charles, Ill., Aug.). IEEE Computer Society,Washington, D. C., 836-844.

CYTRON, R. AND FERRANTE, J. 1987. What’s in aname? -or- the value of renaming for paral-

lehsm detection and storage allocation. Pro-

ceedings of the Internat~onal Conference on

Parallel Procewng (University Park, Penn.,

Aug.), Pennsylvania State University Press,

University Park. Pa., 19-27.

DANTZIG, G. B. AND EAVES, B. C. 1974. Fourier-Motzkin ehmination and sits dual with applica-tion to integer programming. In CombinatorialProgramming: Methods and Applications

(Versailles, France, Sept.). B. D. Roy, Ed. D.Reidel, Boston, Mass., 93-102.

DENNIS, J B 1980 Data flow supercomputers

Computer 13, 11 (Nov.), 48-56.

DIXIT, K. M. 1992. New CPU benchmarks fromSPEC. In Digest of Papers, Spring COMPCON1992, 37th IEEE Computer Society Interna-

tional Conference (San Francisco, Calif., Feb.).IEEE Computer Society Press, Los Alamitos,Calif., 305-310

DONGARRA, J. AND HIND, A. R. 1979. Unrolling loopsm Fortran. Softzo, Pratt. Exper. 9, 3 (Mar.),219-226.

EGGERS, S. J. AND KATZ, R. H. 1989 Evaluating theperformance of four snooping cache coherencyprotocols. In proceedings of the 16th Annual



International Symposmm on Computer Archi-

tecture (Jerusalem, Israel, May). SZGARCH

Comput. Arch. News 17, 3 (June), 2-15.

EIGENMANN, R,, HOEFLINGER, J., LI, Z,, AND PADUA,D, A. 1991, Experience in the automatic paral-

lelization of four Perfect-benchmark programs.

In Proceedings of the 4th International Work-

shop on Languages and Compders for ParallelComputing. Lecture Notes in Computer Sci-ence, vol. 589. Springer-Verlag, Berlin, 65–83.

Also available as Tech. Rep. 1193, Center forSupercomputing Research and Development,

ELLIS, J. R. 1986. Bulldog: A Compiler for VLIW

Architectures. ACM Doctoral DissertationAward. MIT Press, Cambridge, Mass.

ERSHOV, A. P. 1966. ALPHA-an automatic pro-

gramming system of high efficiency. J. ACM13, 1 (Jan.), 17-24.

FEAUTRIER, P. 1991. Dataflow analysis of array and

scalar references. Int. J. Parall. Program. 20,

1 (Feb.), 23-52.

FEAUTRIER, P. 1988. Array expansion. In Proceed-ings of the ACM International Conference onSupercomputing (St. Malo, France, July), ACMPress, New York, 429-441.

FERRARI, D, 1976, The improvement of program

behavior. Computer 9, 11 (Nov.), 39-47.

FISHER, J.A. 1981. Trace scheduling A technique

for global microcode compaction. IEEE Trans.

Comput. C-30, 7 (July), 478-490.

FISHER, J. A., ELLIS, J. R., RUTTENBERG, J. C., AND

NICOLAU, A. 1984. Parallel processing A smartcompiler and a dumb machine. In Proceedings

of the SIGPLAN Symposium on Compiler Con-struction (Montreal, Quebec, June). SIGPLAN

Not. 19, 6, 37-47.

FLOATING POINT SYSTEMS. 1979. FPS AP-120B Pro-

cessor Handbook. Floating Point Systems, Inc.,Beaverton, Ore.

FREE SOFTWARE FOUNDATION. 1992. gcc 2,x Refer-

ence Manual. Free Software Foundation, Cam-

bridge, Mass.

FREUDENEERGER, S. M., SCHWARTZ, J. T., AND SHARIR,

M. 1983. Experience with the SETL optimizer.ACM Trans. Program. Lang. Syst. 5, 1 (Jan.),

26-45.

GANNON, D., JALBY, W., AND GALLIVAN, K. 1988.

Strategies for cache and local memory manage-ment by global program transformation, J.Parall. Distr~b. Comput. 5, 5 (Oct.), 587-616.

GERNDT, M. 1990. Updating distributed variables inlocal computations. Concurrency Pratt. Ilxper.

2, 3 (Sept.), 171-193.

GIBSON, D. H. AND RAO, G. S. 1992. Design of theIBM System/390 computer family for numeri-cally intensive applications: An overview forengineers and scientists. IBM J. Res. Dev. 36,

4 (July), 695-711.

GIRKAR, M. AND POLYCHRONOPOULOS, C. D. 1988.Compiling issues for supercomputers. In Pro-ceedings of Supercomputing ’88 (Orlando, Fla.,

Nov.). IEEE Computer Society, Washington,

D. C,, 164-173.

GOFF, G,, KENNEDY, K,, AND TSENG, C.-W. 1991.

Practical dependence testing. In Proceedings of

the SIGPLAN Conference on Programming

Language Design and Implementat~on (Toronto,

Ontario, June). SIGPLAN Not, 26, 6, 15-29,

GRAHAM, S. L., LUCCO, S., AND SHARP, O. J. 1993.Orchestrating interactions among parallel com-putations, In Proceedings of the SIGPLAN

Conference on Programming Language Designand Implementation (Albuquerque, New Mex-ico, June), SZGPLAN Not. 28, 6, 100-111.

GRANLUND, T. AND KENNER, R. 1992. Eliminatingbranches using a superoptimizer and the GNU

compiler. In Proceedings of the SIGPLAN Con-ference on Programming Language Design and

Implementation (San Francisco, Calif., June).SZGPLAN Not. 27, 7 (July), 341-352.

GUPTA, M. 1992. Automatic data partitioning on

distributed memory multicomputers. Ph.D. the-

sis, Tech. Rep. UILU-ENG-92-2237, Univ. ofIllinois at Urbana-Champaign.

GUPTA, M., MIDKIE-F, S., SCHONBERG, E., SWEENEY,P., WANG, K. Y., AND BURRE, M. 1993. PTRAN

II: A compiler for High Performance Fortran.

In Proceedings of the 4th International Work-

shop on Compilers for Parallel Computers

(Delft, Netherlands, Dec.). 479-493.

HALL, M. W., KENNEDY, K., AND MCKINLEY, K. S.

1991. Interprocedural transformations for par-allel code generation. In Proceedings of Super-

computing ’91 (Albuquerque, New Mexico,

Nov.). IEEE Computer Society Press, os Alami-

tos, Calif., 424-434.

HARRIS, K. AND HOBBS, S. 1994. VAX Fortran. InOptimization in Compders, F. E. Allen, R.Cytron, B. K. Rosen, and K. Zadeck, Eds. ACMPress, New York, chap. 16. To be published,

HATFIELD, D. J. AND GERALD, J. 1971. Program re-structuring for virtual memory, IBM Syst. J.

10, 3, 168-192,

HENNESSY, J. L, AND PATTERSON, D. A. 1990. Com-puter Architecture: A Quantitative Approach.

Morgan Kaufmann Publishers, San Mateo,Calif,

HEWLETT-PACKARD. 1992. PA-RZSC 1.1 Architectureand Instruction Manual. 2nd ed. Part Number09740-90039. Hewlett-Packard, Palo Alto, Calif.

HIGH PERFORMANCE FORTRAN FORUM. 1993. HighPerformance Fortran language specification,

version 1.0. Tech. Rep. CRPC-TR92225, RiceUniversity, Houston, Tex.

HIRANANDAM, S., KENNEDY, K., AND TSENG, C.-W.1993. Preliminary experiences with theFortran D compiler. In Proceedmg.s of Super-computing ’93 (Portland, Ore., Nov.). IEEEComputer Society Press, Los Alamitos, CaIif.,338-350.

HIRANANDANL S., KRNNEDY, K., AND TSENG, C.-W.1992. Compiling Fortran D for MIMD dis-


416 * David F. Bacon et al,

tributed-memory machines. Commun. ACM 35,

8 (Aug.), 66-80.

HUDAK, P. AND GOLDBERG, B. 1985. Distributed exe-cution of functional programs using serial com-

binators IEEE Trans. Comput. C-34, 10 (Ott),

881-890.

HWU, W. W. AND CHANG, P. P. 1989 Achieving high

instruction cache performance with an optimiz-

ing compiler. In Proceedings of the 16th An-

n ual International Symposium on Computer

Architecture (Jerusalem, Israel, May)

SIGARCH Comput. Arch. News 17, 3 (June),

242-251.

Hwu, W. W., MAHLKE, S. A., CHEN, W. Y., CHANG,P. P., WARTER, N. J., BRINGMANN, R. A.,OUELLETTE, R. G., HANK, R. E., KIYOHARA, T.,HAAO, G E., HOLM, J, G., AND LAVERY, D M

1993. The superblock: An effective technique

for VLIW and superscalar compilation. J. Su -percomput 7, 1/2 (May), 229-248

IBM 1992 Optimization and Tuning Guide for the

XL Fortran and XL C Compilers. 1st ed. Publi-

cation SC09-1545-00, IBM, Armonk, N.Y.

IBM. 1991 IBM RISC System/6000 NIC Tuning

Guide for Fortran and C Publication GG24-

3611-01, IBM, Armonk, NY.

IRIGOIN, F. AND TRIOLET, R. 1988. Supernode parti-

tioning In Conference Record of the 15th ACMSymposium on Principles of Programming Lan -guages (San Diego, Calif., Jan) ACM Press,New York, 319-329

IVERSON, K. E. 1962. A F’rogrammzng Language

John Wiley and Sons, New York.

KENNEDY, K. AND MCKINLEY, K. S. 1990 Loop dis-

tribution with arbitrary control flow In Pro-

ceedings of Supercomputing ’90 (New York,N.Y, Nov. ) IEEE Computer Society Press, Los

Alamitos, Calif., 407-416

KENNEDY, K., MCKTNLEY, K. S., AND TSENG, C.-W.

1993. Analysis and transformation in an inter-active parallel programming tool ConcurrencyPratt. Exper 5, 7 (Oct.), 575-602.

KILDALL, G. 1973. A unified approach to global pro-

gram optimization In Conference Record of theACM Sympostum on Prmclples of Program-

mmg Languages (Boston, Mass , Oct.). ACMPress, New York, 194–206.

KNOBE, K. AND NATARAJAN, V. 1993. Automatic dataallocation to minimize communication on SIMDmachmes. J. Supercomput. 7, 4 (Dec.), 387–415.

KNOBE, K., LUKAS, J, D., AND STEELE, G. L., JR.1990, Data optimlzatlon: Allocation of arrays toreduce communication on SIMD machmes. J.Parall Distrib, Comput. 8, 2 (Feb.), 102-118.

KNOBE, K., LUKAS, J D., AND STEELE, G. L., JR1988. Massively parallel data optimization, InThe 2nd. Symposium on the Frontiers of Mas-siuely Parallel Computation (Fairfax, Vs., Oct.).IEEE, Piscataway, N.J., 551-558.

KNUTH, D. E. 1971. An empirical study of Fortran

programs. Soflw. Pratt. Exper. 1,2 (Apr.-June),

105-133.

KOELBEL, C 1990. Compiling programs for non-shared memory machines Ph D thesis, Purdue

University, West Lafayette, Ind.

KOELBEL, C AND MEHROTRA, P. 1991 Compiling

global name-space parallel loops for distributed

execution. IEEE Trans. Parall Dlstnb. Syst.

2, 4 (Oct.), 440-451

KRANZ, D., KELSE~, R., REES, J., HUDAK, P., PHILBIN,

J , AND ADAMS, N. 1986. ORBIT: An optimizing

compiler for Scheme. In Proceedings of the

SIGPLAN Symposmm on Compder Construc-

tion (Palo Alto, Calif, June). SIGPLAN Not21, 7 (July), 219-233

Kuc~, D. J. 1978. The Structure of Computers andComputations Vol. 1. John Wdey and Sons,New York.

KUCK, D. J. 1977. A survey of parallel machine

orgamzation and programming. ACM Comput.

Suru. 9, 1 (Mar.), 29-59.

KUCK, D. J, AND STOKES, R, 1982. The Burroughs

Sclentitic Processor (BSP). IEEE Trans. Com-

put, C-31, 5 (May), 363-376.

KUCK, D. J., KUHN, R. H,, PADUA, D,, LEASURE, B,,

AND WOLFE, M. 1981, Dependence graphs and

compiler optimizatlons, In Conference Record ofthe 8th ACM Svmposlum on Prmctples of Pro-grammmg Languages (Williamsburg, Vs., Jan,).


LAM, M S 1988. Software pipelining: An effective

scheduling technique for VLIW machines InProceedings of the SIGPLAN Conference on

Programming Language Design and Implemen-tation (Atlanta, Ga., June). SIGPLAN Not. 23,

7 (July), 318-328

LAM, M. S AND WILSON, R P 1992. Limits of con-

trol flow on parallelism In Proceedings of the19th Annual International Symposium on

Computer Architecture (Gold Coast, Australia,May). SIGARCH Comput Arch News 20, 2,46-57

LAM, M, S., ROTHBERG, E. E., AND WOLF, M. E. 1991.The cache performance and optlmlzatlon of

blocked algorithms. In Proceedings of the 4th

International Conference on ArchitecturalSupport for Programmmg Languages and Op-erating Systems (Santa Clara, Cahf,, Apr.).SIGPLAN Not. 26, 4, 63–74.

LAMPORT, L. 1974 The parallel execution of DOloops. Commun. ACM 17, 2 (Feb.), 83-93.

LANDI, W., RYDER, B. G., AND ZHANG, S. 1993. Inter-procedural modification side effect analysiswith pointer abasing. In Proceedings of theSIGPLAN Conference on Programmmg Lan-guage Design and Implementation (Al-buquerque, New Mexico, June), SZGPLAN Not.28, 6, 56-67,

LARUS, J. R. 1993 Loop-level parallelism in nu-

meric and symbolic programs. IEEE Trans.Parall. Dwtrtb. Syst 4, 7 (July), 812-826.



LEE, G., KRUSKAL, C. P., AND KUCK, D. J. 1985. An

empirical study of automatic restructuring ofnonnumerical programs for parallel processors.IEEE Trans. Comput. C-34, 10 (Oct.), 927-933.

LI, Z. 1992. Array privatization for parallel execu-tion of loops. In Proceedings of the ACM Inter-

national Conference on Supercomputmg

(Washington, D. C., July). ACM Press, NewYork, 313-322.

LI, J. AND CHEN, M. 1991. Compiling communica-

tion-eftlcient programs for massively parallel

machines. IEEE Trans. Parall. Distrib. Syst.

2, 3 (July), 361-376.

LI, J. AND CHEN, M. 1990. Index domain alignment:

Minimizing cost of cross-referencing betweendistributed arrays. In The 3rd Symposium onthe Frontiers of Massively Parallel Computa-

tion, J. Jaja, Ed. IEEE Computer Society Press,

Los Alamitos, Calif., 424-433.

LI, Z. AND YEW, P. 1988. Interprocedural analysisfor parallel computing. In Proceedings of the

International Conference on Parallel Process-

ing. F. A. Briggs, Ed. Vol. 2. Pennsylvania State

University Press, University Park, Pa.,

221-228.

LI, Z., YEW, P., AND ZHU, C. 1990. Data dependence

analysis on multi-dimensional array refer-

ences. IEEE Trans. Parall. D~str~b. Syst. 1, 1

(Jan.), 26-34.

LOVRMAN, D. B. 1977. Program improvement by

source-to-source transformation. J. ACM 1, 24(Jan.), 121-145.

Lucco, S. 1992. A dynamic scheduling method forirregular parallel programs. In Proceedings ofthe SIGPLAN Conference on ProgrammingLanguage Design and Implementation (San

Francisco, Calif., June). SIGPLAN Not. 27, 7

(July), 200-211.

MACE, M, E, 1987, Memory Storage Patterns m

Parallel Processing. Kluwer Academic Publish-

ers, Norwell, Mass.

MACE, M. E. AND WAGNER, R. A. 1985. Globally

optimum selection of memory storage patterns.In Proceedings of the International Conference

on Parallel Processing, D. DeGrott, Ed. IEEEComputer Society, Washington, D. C., 264-271.

MARKSTEIN, P., MARKSTEIN, V., AND ZADECK, K. 1994.

Strength reduction. In Optimization in t30mpd-ers. ACM Press, New York, Chap. 9 To bepublished.

MASSALIN, H. 1987. Superoptimizer: A look at the

smallest program. In Proceedings of the 2ndInternational Conference on Architectural Sup-

port for Programmmg Languages and Operat-

ing Systems (Palo Alto, Calif., Oct.). SIGPLANNot. 22, 10, 122-126.

MAYDAN, D. E., AMAMSINGHE, S. P., AND LAM, M. S.1993. Array data flow analysis and its use inarray privatization. In Conference Record of the20th ACM Sympostum on Principles of Pro-gramming Languages (Charleston, S. Carolina,Jan.). ACM Press, New York, 1-15.

MAYDAN, D. E., HENNESSEY, J. L., AND LAM, M. S.

1991. Efficient and exact data dependence

analysis. In Proceedings of the SIGPLAN Con-ference on Programming Language Design and

Implementation (Toronto, Ontario, June). SIG-

PLAN Not. 26, 6, 1-14.

MCFARLING, S. 1991. Procedure merging with in-

struction caches. In Proceedings of the SIG-PLAN Conference on Programming Language

Design and Implementation (Toronto, Ontario,

June). SIGPLAN Not. 26, 6, 71-79.

MCGRAW, J. R. 1985. SISAL: Streams and iterationin a single assignment language. Tech. Rep.

M-146, Lawrence Livermore National Labora-

tory, Livermore, Calif.

MCMAHON, 1?. M. 1986. The Livermore Fortran ker-nels: A computer test of numerical performance

range. Tech. Rep. UCRL-55745, Lawrence Liv-

ermore National Laboratory, Livermore, Calif.

MICHIE, D. 1968. “Memo” functions and machine

learning. Nature 218, 19-22.

MILLSTEIN, R. E. AND MUNTZ, C. A. 1975. The IL-

LIAC-IV Fortran compiler. In Programming

Languages and Compilers for Parallel and Vec-

tor Machines (New York, N.Y., Mar). SIG-PLAN Not. 10, 3, 1-8.

MIRCHANDANEY, R., SALTZ, J. H., SMITH, R. M., NICOL,D. M., AND CROWLEY, K. 1988. Principles of

runtime support for parallel processors. In Pro-

ceedings of the ACM International Conference

on Supercomputing (St. Male, France, July).


MOREL, E. AND RENVOME, C. 1979. Global optimiza-

tion by suppression of partial redundancies.Commun. ACM 22, 2 (Feb.), 96-103.

MUCHNICK, S. S. AND JONES, N., (EDs.) 1981. Pro-gram Flow Analysis. Prentice-Hall, Englewood

Cliffs, N.J.

MURAOKA, Y. 1971. Parallelism exposure and ex-

ploitation in programs. Ph.D. thesis, Tech. Rep.

71-424, Univ. of Illinois at Urbana-Champaign.

MYERS, E. W. 1981. A precise interprocedural data

flow algorithm. In Conference Record of the 8th

ACM Symposium on Principles of Program-

ming Languages (Williamsburg, Vs., Jan.).ACM Press, New York, 219-230.

NICOLAU, A. 1988. Loop quantization: A generalizedloop unwinding technique. J. Parall Distrib.

Comput. 5, 5 (Oct.), 568-586.

NICOLAU, A. AND FISHER, J. A. 1984. Measuring the

parallelism available for very long instruction

word architectures. IEEE Trans. Comput. C-33,11 (Nov.), 968-976.

NIKHIL, R. S. 1988. ID reference manual, version88.0. Tech. Rep. 284, Laboratory for ComputerScience, Massachusetts Instztute of Technol-ogy, Cambridge, Mass.

NOBAYASHL H. AND EOYANG, C. 1989. A comparisonstudy of automatically vectorizing Fortrancompilers. In Proceedings of Supercomputing



’89 (Reno, Nev., Nov.), ACM Press, New York,820-825

OBRIEN, K., HAY, B., MINISH, J., SCHAFFER, H ,

SCHLOSS, B., SHEPHERD, A., AND ZALESKI, M.

1990 Advanced compiler technology for the

RISC System/6000 architecture. In IBM RISC

System/6000 Technology. Publication SA23-2619. IBM Corporation, Mechanicsburg, Penn

OED, W. 1992. Cray Y-MP C90: System features

and early benchmark results. Parall. Comput.18, 8 (Aug.), 947-954.

OEHLER, R. R. AND BLASGEN, M, W. 1991. IBM RISC

System/6000: Architecture and performance.IEEE Micro 11, 3 (June), 14-24,

PADUA, D. A. AND PETERSEN, P. M. 1992 Evaluation

of parallelizing compilers. Parallel Computing

and Transputer Appllcatlons, (Barcelona,Spain, Sept ) CIMNE, 1505-1514. Also avail-

able as Center for Supercomputing Researchand Development Tech Rep. 1173.

PADUA, D. A, AND WOLFE, M. J. 1986. Advanced

compiler optlmlzations for supercomputers.

Commun. ACM 29, 12 (Dec.), 1184-1201.

PADUA, D. A., KUCK, D. J. AND LAWRIE, D. 1980.

High-speed multiprocessors and compilation

techniques IEEE Trans. Comput. C-29, 9

(Sept.), 763-776.

PETERSEN, P. M. AND PADUA, D A 1993 Static and

dynamic evaluation of data dependence analY-

sis. In Proceedings of the ACM International

Conference on Supercomputmg (Tokyo, Japan,

July). ACM Press, New York, 107-116.

PETTIS, K, AND HANSEN, R. C, 1990. Profile guided

code posltionmg. In Proceedings of the SIG-

PLAN Conference on Programmmg Language

Des~gn and Implementation (Whte Plains,

NY., June). SIGPLAN Not. 25, 6, 16-27.

POLYCHRONOPOULOS, C. D. 1988. Parallel Program-

ming and Compilers. Kluwer Academic Pub-lishers, Boston, Mass.

POLYCHRONOPOULOS, C D 1987a Advanced loop

optlmlzations for parallel computers In Pro-

ceedings of the Ist International Conference onSupercomputmg. Lecture Notes in Computer

Science, vol 297 Springer-Verlag, Berlin,

255-277.

POLYCHRONOPOULOS, C. D. 1987b, LOOP coalescing:

A compiler transformation for parallel ma-chines, In Proceedings of the InternationalConference on Parallel Processing (UmversltyPark, Penn., Aug.). Pennsylvania State Univer-sity Press, University Park, Pa., 235–242.

POLYCHRONOPOULOS, C. D. AND KUCK, D. J. 1987.Guided self-scheduling: A practical scheduhngscheme for parallel supercomputers. IEEETrans. Comput. C-36, 12 (Dec.), 1425-1439,

POLYCHRONOPOULOS, C. D., GIRKAR, M., HAGHIGHAT,

M. R., LEE, C. L., LEUNG, B., AND SCHOUTEN, D.1989. Parafrase-2: An environment for paral-Ielizing, partitioning, synchronizing, andscheduling programs on multiprocessors. In

Proceedings of the International Conference onParallel Processing. Volume 2, Pennsylvania

State Umverslty Press, Umversity Park, Pa.,

39-48.

PRESBERG, D. L. AND JOHNSON, N. W. 1975 TheParalyzer: IVTRAN’S parallehsrn analyzer and

synthesizer, In Programmmg Languages and

Compders for Parallel and Vector Machines

(New York, NY., Mar.). SIGPLAN Not. 10, 3,

9-16.

PUGH, W. 1992. A practical algorithm for exact

array dependence analysls Commun. ACM 35,8 (Aug.), 102-115,

PUGH, W. 1991, Umform techniques for loop opti-mization, In proceedings of the ACM Interna-

tional Conference on Supercomputmg (Cologne,

Germany, June). ACM Press, New York.

RAU, B. AND FISHER, J. A. 1993. InstructIon-levelparallel processing: History, overview, and per-spectwe, J. Supercomput. 7, 1/2 (May), 9–50

RAU, B., YEN, D. W. L., YEN, W., AND TOWI,E, R. A.1989. The Cydra 5 departmental supercom-

puter: Design phdosophies, decisions, and

trade-offs, Computer 22, 1 (Jan,), 12-34.

REES, J., CLINGER, W, ET AL. 1986 Revised3 reporton the algorithmic language Scheme. SIG-PLAN Not 21, 12 (Dee), 37-79.

RISEMAN, E. M, AND FOSTER, C. C, 1972, The mhlbl-

tion of potential parallehsrn by conditional

jumps. IEEE Trans, Comput, C-21, 12 (Dec.),1405-1411.

ROGERS, A. M, 1991. Compiling for locality of refer-ence. Ph.D. thesis, Tech Rep. TR 91-1195, Dept.

of Computer Science, Cornell University

RUSSELL, R. M. 1978. The CRAY-1 computer sys-tem. Commun. ACM 21, 1 (Jan.), 63–72,

SABOT, G. AND WHOLEY, S. 1993 CMAX: A Fortrantranslator for the ConnectIon Machine system.In proceedings of the ACM International Con-ference on Supercomputmg (Tokyo, Japan,July). ACM Press, New York.

SARKAR, V. 1989. Partitioning and Scheduling Par-allel Programs for Multiprocessors. Research

Monographs in Parallel and Distributed Com-

puting. MIT Press, Cambridge, Mass

SARKAR, V. AND THEKKATH, R. 1992. A generalframework for Iteration-reordering transforma-tions. In Proceedings of the SIGPLAN Confer-ence on Programming Language Design andImplementation (San Francisco, Calif., June).SIGPLAN Not. 27, 7 (July), 175-187,

SCHEIFLER, R. W. 1977. An analysis of inline substi-tution for a structured programming language.Commun. ACM 20, 9 (Sept.), 647-654.

SHEN, Z , LI, Z., AND YEW, P -C. 1990. An empiricalstudy of Fortran programs for parallelizingcompilers. IEEE Trans. Parall. Distrib. Syst,1, 3 (July), 356-364.



SINGH, J. P. AND HENNESSY, J. L. 1991. An empiricalinvestigation of the effectiveness and limita-

tions of automatic parallelization. In Proceed-

ings of the International Symposium on Shared

Memory Multiprocessing, (Apr.), 213-240.

SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992.

SPLASH: Stanford parallel applications for

shared memory. SZGARCH C’omput. Arch,News 20, 1 (Mar.), 5-44. Also available asStanford Univ. Tech. Rep. CSL-TR-92-526.

SITES, R. L., (ED.) 1992. Alpha Architecture Refer-

ence Manual. Digital Press, Bedford, Mass.

SOHI, G. S. AND VAJAPAYEM, S. 1990. Instruction

issue logic for high-performance, interruptable,

multiple functional unit, pipelined computers.IEEE Trans. Comput. C-39, 3 (Mar.), 349-359.

STEELE, G. L., JR. 1977. Arithmetic shifting consid-ered harmful, SIGPLAN Not. 12, 11 (Nov.),

61-69.

STEELE, G, L., JR, AND WHITE, J, L. 1990. How toprint floating-point numbers accurately, In

Proceedings of the SIGPLAN Conference onProgramming Language Design and Implemen-

tation (White Plains, N.Y., June). SIGPLANNot, 25, 6, 112-126.

STENSTROM, P. 1990. A survey of cache coherence

schemes for multiprocessors. Computer 23, 6

(June), 12-24.

SUN MICROSYSTEMS. 1991. SPARC ArchitectureManual, Version 8. Part No. 800-1399-08. Sun

Microsystems, Mountain View, Calif.

f?IZYMANSHI, T. G. 1978. Assembling code for ma-chines with span-dependent instructions. Corn-

mum. ACM 21, 4 (Apr.), 300–308.

TANG, P, AND YEW, P, 1990. Dynamic proceesor

self-scheduling for general parallel neetedloops, IEEE Trans. Comput. C-39, 7 (July),

919-929.

THINKING MACHINES. 1989. Connection Machine,

Model CM-2 Technical Summary. ThinkingMachines Corp., Cambridge, Mass.

TJADEN, G. S. AND FLYNN, M. J. 1970. Detection and

parallel execution of parallel instructions.IEEE Trans. Comput. C-19, 10 (Oct.), 889-895.

TORRELLAS, J., LAM, H. S., AND HENNESSY, J. L.1994. False sharing and spatial locality in mul-

tiprocessor caches. IEEE Trans. Comput. 43, 6

(June), 651-663.

TOWLE, R. A. 1976, Control and data dependencefor program transformations. Ph.D. thesis,Tech. Rep. 76-788, Computer Science Dept.,

Univ. of Illinois at Urbana-Champaign.

TRIOLET, R., IRIGOIN, F., AND FEAUTRIER, P. 1986.Direct parallelization of call statements. InProceedings of the SIGPLAN Symposium onCompder Construction (Palo Alto, Calif., June).SIGPLAN Not. 21, 7 (July), 176-185.

TSENG, C.-W. 1993. An optimizing Fortran D com-piler for MIMD distributed-memory machines.Ph.D. thesis, Tech. Rep. Rice COMP TR93-199,

Computer Science Dept., Rice University,Houston, Tex.

Tu, P, AND ~ADUA, D. A, 1993, Automatic array

privatization. In Proceedings of the 6th Inter-

national Workshop on Languages and Compd-ers for Parallel Computing. Lecture Notes in

Computer Science, vol. 768. Springer-Verlag,

Berlin, 500-521.

VON HANXLEDEN, R. AND KENNEDY, K. 1992. Relax-ing SIMD control flow constraints using loop

transformations. In Proceedings of the SIG-PLAN Conference on Programming Language

Design and Implementation (San Francisco,

Calif., June). SIGPLAN Not. 27, 7 (July),188-199.

WALL, D. W. 1991. Limits of instruction-level paral-lelism. In Proceedings of the 4th Interncutonal

Conference on Arch~tectural Support for Pro-gramming Languages and Operating Systems

(Santa Clara, Calif., Apr.). SIGPLAN Not. 26,4, 176-188.

WANG, K. 1991. Intelligent program optimizationand parallelization for parallel computers.Ph.D. thesis, Tech. Rep. CSD-TR-91-030, Pur-

due University, West Lafayette, Ind.

WANG, K. AND GANNON, D. 1989. Applying Al tech-

niques to program optimization for parallel

computers. In Parallel Processing for Super-computers and Art~fictal Intelligence, K. Hwang

and D. Llegroot, Eds. McGraw Hill, New York,Chap. 12,

WEDEL, D. 1975. Fortran for the Texas InstrumentsASC system. In Programmmg Languages and

Compilers for Parallel and Vector Machines

(New York, N.Y., Mar.). SZGPLAN Not. 10, 3,119-132.

WEGMAN, M. N. AND ZADECK, F. K. 1991. Constantpropagation with conditional branches, ACM

Trans. Program. Lang. Syst. 13, 2 (Apr.),181-210,

WEISS, M. 1991. Strip mining on SIMD architec-

tures. In Proceedings of the ACM InternationalConference on Supercomputtng (Cologne, Ger-many, June). ACM Press, New York, 234–243.

WHOLEY, S. 1992. Automatic data mapping for dis-tributed-memory parallel computers. In Pro-

ceedings of the ACM International Conferenceon Supercomputing (Washington, D. C., July).ACM Pressj New York, 25–34.

WOLF, M. E. 1992. Improving locality and paral-lelism in nested loops. Ph.D. thesis, ComputerScience Dept., Stanford University, Stanford,

Calif.

WOLFE, M. J. 1989a. More iteration space tiling. InProceedings of Supercomputing ’89 (Reno, Nev.,Nov.). ACM Press, New York, 655–664.

WOLFE, M. J. 1989b. OptLmLzmg Supercompders forSupercomputers. Research Monographs in Par-allel and Distributed Computing. MIT Press,Cambridge, Mass.


420 e David F. Bacon et al

WOLI, M, E, AND LAM, M, S. 1991, A loop transfor- ZIMA, H P, BAST, H ,J , Ah_D GERNnT, M 1988mation theory and an algorlthm to maxlmlze SUPERB. A tool for semi-automaticparallelism. IEEE Trans. Parall, Dlstrlb. Syst. SIMD\MIMD parallehzatlon Parall C’ornput2, 4 (Oct.), 452-471, 6, 1 (Jan,), 1-18

WOLFE, M J ANI) TSEN~, C 1992 The Power Testfor data dependence. IEEE Trans. Parall Dzs-

trzb S-yst 3, 5 (Sept ), 591–601

Recewed November 1993, final revision accepted October 1994

ACM Comput,ng Surveys, Vol 26, No 4, December 1994

Documents

Compiler transformations for high-performance computingpages.cs.wisc.edu/~fischer/cs701.f00/surveys.Dec94.pdf · 1999. 8. 30. · pointers into Fortran 90 will have on optimization