PARALLELISING COMPILERS AND SYSTEMS

This article was downloaded by: [UNAM Ciudad Universitaria]On: 21 December 2014, At: 08:23Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Parallel Algorithms and ApplicationsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/gpaa19

PARALLELISING COMPILERS AND SYSTEMSBALARAM SINHAROY* a & BOLESLAW K. SZYMANSKI ba Systems Technology and Architecture , IBM Corporation , Poughkcepsie, NY, 12601-5400b Department of Computer Science , Rensselaer Polytechnic Institute , Troy, NY, 12180-3590Published online: 27 Mar 2007.

To cite this article: BALARAM SINHAROY* & BOLESLAW K. SZYMANSKI (1997) PARALLELISING COMPILERS AND SYSTEMS, ParallelAlgorithms and Applications, 12:1-3, 5-20, DOI: 10.1080/01495739708941414

To link to this article: http://dx.doi.org/10.1080/01495739708941414

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/gpaa19

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/01495739708941414

http://dx.doi.org/10.1080/01495739708941414

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Parallel Algorithms and Applications, Vol. 12, pp. 5-20Reprints available directly from the publisherPhotocopying permitted by license only

© 1997 OPA (Overseas Publishers Association)Amsterdam B.V. Published in The Netherlands under

license by Gordon and Breach Science PublishersPrinted in India

PARALLELISING COMPILERS AND SYSTEMS

BALARAM SINHARoya.* and BOLES LAW K. SZYMANSKlb.t

"Systems Technology and Architecture. IBM Corporation. Poughkeepsie.NY /2601-5400: bDepartment of Computer Science, Rensselaer

Polytechnic Institute. Troy. NY 12180-3590

(Received27 August 1996; Infinalform J October 1996)

In recent years. high performance computing underwent a deep transformation. ln this paper,we review the state of parallel computation with detailed discussion of the current and futureresearch issues in the area of parallel architectures and compilation methods, instruction levelparallelism and optimization methods to improve the performance of the memory hierarchy.

Keywords: compile-time scheduling; instruction level parallelism; parallelising compiler; parallelprogramming; multiprocessors

Classification Categories: c.i.i, C.1.2, D.1.3. D.3A, DA.l

1 INTRODUCTION

Parallel computation has been around for a few decades, however in recentyears, the area is going through some fundamental changes in many levelsof system design: software support. hardware architectures as well as theemerging application areas for parallel computation. Much of the changes hasbeen fueled by the advances in the hardware technology and 'better understanding and dissemination of knowledge on how to build efficient parallelsystems. The goals of this paper is to review and discuss the current state ofthe art in the area with deeper coverage of the topics not discussed by thesuccessive papers in this special issues.

Section 2 discusses the current state of parallel computation, bothsoftware and hardware aspects of it, and how Ihe different paradigms

'" [email protected] [email protected]

5

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

6 BALARAM SINHAROY AND BOLESLAW K. SZYMANSKI

of parallel computation are merging together: Section 3 discusses thesynergistic development of computer architecture and compiler design inareas such as instruction level parallelism (Section 3.1), parallel architectureand compilation techniques, program and data partitioning, mapping andscheduling (Section 3.2), and efficient management of memory hierarchy to

reduce the effect of memory performance bottleneck (Section 3.3).

2 STATE OF PARALLEL COMPUTATION

In recent years, high performance computing underwent a deep transformation. The declining share of the parallel processing market held by traditionalsupercomputers and the demise of the SIMD machines to special purposeapplications is accompanied by the increasing role of clusters of workstations.as well as distributed and shared memory machines with powerful processors.: These changes created the conditions for a rapid spread of parallelcomputing in government and industry. The emphasis shifted from recordbreaking performance for any price to the price-performance. parameters.

In addition, due to advances in the CMOS technology, single chipcommodity microprocessors can now incorporate advanced techniques (suchas, out-of-order superscalar execution, superpipelining, speculative execution,sophisticated dynamic branch prediction. dynamic instruction scheduling, largeinstruction window buffers. large caches) which were only available for thehigh-end processors used for mainframes and supercomputers. Due to thelarge volume of the commodity microprocessors, there is fierce competitionand rapid performance improvement. Consequently. companies that reliedon processors designed specifically for their architectures, such as KendallSquare Research or the Thinking Machine Corporation, were not successfulin staying in the computer design market. On the other hand, the successfulcompanies. such as SGI or IBM. use the hardware performance gains drivenby the general computer market to improve the performance of their parallelmachines. Fierce competition for the best performance in general computermarket makes parallel processing ubiquitous at all levels of the computingtechnology.

It is important to realize that despite these trends, parallel computing at theend-user and programming level captures a small fraction of the overall information technology industry. The parallel computer industry constitutes lessthan one percent of the U.S. information technology market. The res~1t of thissituation is a narrow user base that can be easi Iy saturated with new products.In addition, parallel computing has been highly dependent on government

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

PARALLELISING COMPILERS 7

policies. Government-run laboratories and government-supported universitiestraditionally constituted more than half of all users of parallel machines.

There are two barriers to more widespread use of parallel processing:

I. Narrow application base: parallel architectures are best suited to solve largeand highly-tuned, coarse-grained, and/or data parallel problems. I

2. Rapid change of hardware: every new generation of parallel architecturesdiffers from the previous one, forcing the users to redevelop their applications. Often, porting and tuning an application to a new architecture cantake as long as the time needed to introduce a new architecture, makingthe ported code obsolete by the time it is ready for use.

Part of the difficulty in extending parallel computing application base hasbeen the lack of standards in parallel programming interfaces. As discussedbelow, such standards are emerging and gaining wide-spread acceptance.

Several different architectural approaches to parallel processing are slowlyconverging to a similar solution, even though it may not be apparent atfirst. Multiprocessors can be differentiated from one another on the basis oftheir global memory access mechanism, which has a profound impact onthe multiprocessing efficiency of a system. The spectrum of the memoryaccess mechanisms spans from classical symmetric multiprocessors (SMP,for example, IBM S/390, Cray Y-MP) in one end to network of workstations (NOW) at the other end. Between these two extremes lie a plethoraof choices, such as cache only memory access (COMA, for example, KSRsystems), scalable shared memory or non-uniform memory access (NUMA,for example, DASH, Sequent CC-NUMA), massively parallel or distributedmemory machines (such as IBM SP2, Cray T3E). With the advances inprocessor and memory interconnection technology the differences in memoryaccess latencies of different multiprocessor systems are disappearing.

Symmetric multiprocessors rely on small, low latency, high bandwidthinterconnection networks between the global memory and local processorcaches to perform fast data transfer and maintain hardware-based cachecoherency. The interconnection networks for large shared memory machinesare larger and relatively slower and therefore behave similar to the distributedmemory multiprocessors which usually are able to efficiently support evenlarger (and slower) interconnection networks. The big difference, however,is that shared memory machines provide hardware or software based cachecoherency, whereas distributed memory machines usually do not provide anycache coherency mechanism. However, distributed memory machines through

I To seek out more growth areas. most parallel processing vendors are entering in the commercial data processing areas and new emerging application areas. such as Data Mining, On LineApplication Programming, etc.

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


extensi ve use of message caching, faster interconnection networks and smallercommunication latencies approach in their behavior to the shared memorymachines with local caches. The overall trend is to use powerful computingnodcs interconnected through a high speed network of large capacity. Theassociated trend is to rely on standard, off-the-shelf components to improveprice-performance ratio of such architectures.

Network of workstations provide the least expensive multiprocessorsystems. Several research projects developed such systems [20] basedon Myrinet interconnection networks 1.7] which rivals the interprocessorcommunication latency and bandwidth of traditional supercomputers. ATMtechnology is also improving rapidly, future implementations, such as OC-192,promise to deliver close to ten gigabits per second? [14].

Irrespective of the underlying hardware, the operating system and themiddleware software can be designed to provide an illusion of eithera distributed memory multiprocessor (separate instances of the operatingsystem running on each processor, with message passing as the means ofinterprocessor communication) or a shared memory multiprocessor (a singleinstance of the operating system running on all the processors and so allthe processors share the various operating system data structures, kernelcode, interrupt handling, etc., with appropriate protocol to maintain mutualexclusion).

A multitude of different pro1lramming paradigms exist to program a multiprocessor system and no standard has evolved for them. Consequently, parallelprogrammers face a daunting challenge, especially with increasingly large andcomplex applications. They must identify parallelism in an application, extractand translate that parallelism into their code, design and implement communication and synchronization that preserve the program semantics and fosterthe efficiency of parallel execution. All these steps must be guided by thecurrently available architectures which may change tomorrow, making someof the designs suboptimal or inefficient. Not surprisingly, in such an environment parallel programming has experienced a long and difficult maturationprocess.

Recently, two basic programming paradigms have emerged: data parallelismand message passing. The first one is popular because of its simplicity. In thisparadigm there is a single program (and therefore a single thread of execution)which is replicated on many processors and each copy operates on separate

2 Even though emerging ATM networks are able to transfer gigabits of data per second,maximum effective throughput to the end-user applications is much lower. This can be attributedto various overheads associated with the data transfer, such as, memory bandwidth limitations.data copying. software checksumming for data integrity checking, servicing of interrupts. cost ofcontext switching and its effect on the memory hierarchy due to loss of data locality, etc:

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

PARALLELlSING COMPILERS 9

part of data. Single Instruction Multiple Data (SIMD) version of this approachrequires hardware support and is considered useful only for a limited range ofapplications. Its loosely synchronized counterpart, often referred to as SingleProgram Multiple Data (SPMD) paradigm is more universal. SPMD parallelcomputation execution consists of two stages:

I. The computational stage, when copies of the same program are executedin parallel on each processor locally. The execution can differ in the conditional branches taken, number of loop iterations executed, etc.

2. The data exchange stage, when all processors concurrently engage inexchanging non-local data.

It should be noted that the data exchange stage is very simple for the sharedmemory machines (where it can be enforced by using locks or barriers). Byreordering the computation and properly selecting the frequency of synchronization, partial interleaving of computation and communication stages can beachieved. The SPMD model matches well the needs of scientific computingwhich often requires applying basically the same algorithm at many pointsof a computational domain. SPMD parallel programs are conceptually simplebecause of a single program executing on all processors, but they are morecomplex than SIMD programs.

For complex applications, running a single program across all parallelprocessors may be unnecessarily restrictive. In particular, dynamicallychanging programs with unpredictable execution times of its componentsresult in poorly balanced parallel computations when implemented in SPMDparadigm. This is because SPMD processes synchronize at the data exchangestage, and none of the processes can proceed to the next computational stageuntiI all others reach the data exchange stage.

Distributed memory machines use message passing for exchanging databetween different processors. The SPMD model may shield the user fromspecifying the detailed data movements, thanks to data distribution directivesfrom which a compi ler generates the message passing statements. However,the users who decide to write the message passing statements themselveshave full control over the program's communication behavior and hence itsexecution time. In particular, the user may define which of the processorssynchronize at the given instance of parallel execution. This approach givesthe users higher flexibility at the cost of requiring them to specify intricateand detailed description of the program. The programs tend to be longerand more complex than their SPMD counterparts, and therefore more errorprone. However, once debugged and tuned, the programs are more efficient.The flexibility of the message passing model makes it applicable to a widevariety of problems. As discussed below, the standard library of functions

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


for message passing, MPI, is becoming a universal tool for parallel softwaredevelopment [48].

There is a plethora of research on parallel programming languages withdifferent flavors to choose from, such as, functional, dataflow object oriented,logical etc. However, the majority of scientific parallel programs were written

in Fortran. Since 1950's this language was a favorite choice for writers ofscientific programs and particularly for generations of graduate students inapplied sciences. Over the years, Fortran underwent a remarkable transformation, from one of the first high level languages to the first language witha well defined standard (Fortran66) to structured programming of Fortran77,to data parallel and object-oriented Fortran90 and finally to the newest standard of High Performance Fortran (HPF) [21]. Each generation brought with

it new features and set a new standard for the manufacturers of hardwareand compilers. Critics of HPF argue that the HPF language is not generalenough. In particular, HPF does not allow for dynamically defined align

ments and distribution that are permitted in some other languages (HPF+,Vienna Fortran). However, even Fortran90 allows for object-oriented designof parallel software [30]. Moreover. standardization of the language features isextremely important for users and compiler and tool writers because it protectstheir software investments against changes in the architecture. In that respect,the introduction of Fortran90 and then HPF is an important step forwardtowards more stable parallel software. HPF can be seen as the flagship of thedata parallelism camp. On the other hand, the supporters of message passingbased parallel programming achieved standardization of their approach in theMessage Passing Interface (MPI) [48].

Parallel processing is at a critical point of its evolution. After a long periodof intense support by government and academia, it slowly moves to derive thebulk of its support from the commercial world. Such a move brings with it achange of emphasis from record breaking performance to price performance

and sustained speed of program execution. The winning architectures arenot only fast but also economically sound. As a result, there is a cleartrend towards widening the base of parallel processing both in hardware andsoftware. On the hardware side, this means use of off-the-shelf, commerciallyavailable components (processor, interconnection switches) which benefit froma rapid pace of the technological advancement fueled by the large customerbase. The other effect is the convergence of different architectures, thanksto spreading of the successful solutions among all of them. Workstationsinterconnected by fast networks and using fast communication protocols (suchas, Berkeley Active Messages [171 and Illinois Fast Messages [23]) areapproaching the performance of commercially available distributed parallel

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

PARALLELISING COMPILERS II

machines. Shared memory machines with multilevel caches and sophisticatedprefetching strategies still executes programs with higher MP efficiency thandistributed memory machines, however the difference in their performance isdecreasing for systems with large number of processors.

On the software side, the widening base of the users relies on standardizationof parallel programming tools. By protecting the programmer's investment insoftware, standardization promotes development of libraries, tools and application kits that in turn attract more end-users to parallel processing. It appearsthat parallel programming is ending a long period of craft design and isentering a stage of industrial development of parallel software.

3 SYNERGY BETWEEN ARCHITECTURE AND COMPILER

Program execution time can be greatly improved by synergistic development of compiler optimization algorithms along with innovative approachesto microarchitecture design, memory subsystem design and improvement inthe data communication networks and protocols. This section first discussesthe issues and the associated research areas in improving the ability of aprocessor to increase the number of instructions executed per processor cycle(instruction level parallelism), which is followed by a discussion on the issuesinvolved with parallelization of numerically intensive application by multiprocessor systems and need for innovation in solving memory system performancebottleneck.

3.1 Parallelization for Uniprocessor Execution: InstructionLevel Parallelism

Microprocessor performance has improved dramatically - about 60% peryear - over the last decade. In addition to the great advances in thesemiconductor technology (which accounts for about 30% performanceimprovement per year), outstanding performance improvements of thecommodity microprocessors can be attributed to the advances in themicroarchitecture design and the compiler technology. However, much of theideas used in such microarchitecture designs are based on Tomasulo's originalideas [42] and the advances in the sequential compiler technology, most ofwhich has. their roots in the mainframe and supercomputer technology, whichhas evolved over the last thirty years. Today there is a growing consensus thatnot much is left to borrow from the mainframe/supercomputer technology andso future microarchitecture and associated compiler technology will have to be

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


radically different to maintain the same performance improvement curve for

the high-end microprocessors in the future. Based on this observation, therehas been several new rnicroarchitecture and compiler optimization methodsdescribed in the literature in recent times, such as, Multiscalar architectures[40], Very Long Instruction Word (VLlW) machines [16,35], Single ProgramSpeculative Multithreading (SPSM) architectures [15]; multiway branchingmachines 1271. Symmetric MultiProcessing-on a chip [4], Simultaneous MultiThreading [43], software pipelining [3] and Disjoint Eager Execution (DEE)[44], to name a few.

Modern microprocessors has the ability to issue and execute significantlymore than one instructions per cycle (lPC). However, the average IPC is oftenlimited to a small number for non-numerical applications. Comprehensiveexperimental studies on non-numerical applications [45,46] show that evenif the processor has a large number of execution units and are able to fetchand speculatively execute large number of instructions along the predicted

path with perfect memory disambiguation and perfect register renaming, thespeedup will be limited to about seven to thirteen for most non-numericalapplications.

The low speedup is the result of the limited parallelism available amongthe instructions between two successive mispredicted conditional branches. Asubsequent study by Lam and Wilson [22] relaxed some of the restrictionson instruction issue and execution, imposed by today's microprocessors dueto insufficient control flow information. The study showed that the amount ofinstruction level parallelism (ILP) potentially available in an arbitrary programis often order(s) of magnitude larger than that are extracted by today's microprocessors. It is the insufficient information available ahead of time to theinstruction fetch and issue unit of a processor about the flow of controlin a program that severely limits the extractable ILP and makes today'scompiler/microprocessor technology unable to achieve high ILP.

It is evident that for non-numerical applications, achieving instruction levelparallelism close to the theoretical limits will require the synergy between thefuture microprocessor design and the compilation techniques. Such synergyshould allow program execution to proceed simultaneously, along multipleflows of control in a speculative fashion, with fast coordination among ·thethreads for communicating data and control information as they become avail- .able. The multiple flows of control (or fine-grain threads) are to be generatedfrom different regions of a program and they should be determined throughcontrol dependency analysis by the compiler.

A significant sequential compiler research work has been performed overthe years for superscalar and VLlW processors to increase the available ILP

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


through various program transformation methods. These transformations areoften simple and restricted to a small number of basic blocks to maintainreasonable compilation time. Among the techniques developed are approachessuch as software pipelining [3], loop unrolling and loop peeling [33],efficient instruction scheduling and register allocation with resource constraints(such as the number of functional units) [29], percolation scheduling orboosting methods to move instructions across conditional branches [39,26],trace scheduling [24], predicated execution [10], formation of superblocksto decrease the number of breaks in the sequential flow of instructions[19], VLIW tree instructions or multiway branching to execute instructionsbelonging to different execution paths with appropriate masking to preventinstructions from non-taken path from being committed to architected registers[27], region based compilation [18] and other program transformation methods.

These transformations allow instructions to be executed speculatively andout of program order, however there is always a single flow of control in allof these methods, as characterized by the requirement of a single programcounter in the associated rnicroarchitecture. Most of these ideas are empirically tested on small benchmarks without any well developed theory. Sincethese approaches keep a single flow of control (implemented by a singleprogram counter), coarse-grain parallelism is not focused on by any of theseapproaches.

Today high-end, out-of-order superscalar microprocessors are facing difficulties in increasing the number of instructions executed per cycle. Among thedifficulties are the following two. First, to scale up the instruction level parallelism, each new generation of processors increase the dynamic fetch/dispatchwindow sizes, add more execution units and attempt to increase the level ofruntime speculation. These features, especially the centralized dispatch unit,increase the complexity of the design, causing delay in the development andverification process. Second. local regions of code have limited ILP and sosuperscalar processors are fundamentally limited from being able to providean order of magnitude higher ILP by their inability of executing independentregions of code concurrently.

A few new approaches have been described in the literature to alleviate thisproblem. Multiscalar approach [40] executes instructions along multiple flowsof control (called tasks) at the same time. All but one of these tasks are usuallyspeculative. These tasks however form a linear chain of dependency amongthem. So if control speculation (that is, branch predictions through interveningbranch instructions) on one task in the chain turns out to be wrong, all itssuccessor tasks have to be squashed. This limitation may restrict a Multiscalarprocessor from achieving high lLP. Approaches similar to Multiscalar, such as,

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


Single Program Speculative Multithreading (SPSM) architecture [15] allowsthe dependency graph of the speculative tasks to be more general than achain - the dependency could be any directed graph. In SPSM, if one of thespeculated task is squashed, it does not necessarily mean that all its successor

tasks have to be squashed as well. Disjoint Eager Execution (DEE) proposes toexecute several future basic blocks along different possible paths of executionbased on the probability (which is obtained by extrapolating information froma sophisticated branch prediction mechanism) of the basic block to be on thetaken path [441.

Most articles describing these approaches typically include descriptions ofthe microarchitecture and the speedup obtained by trace-driven simulation forSPEC-like benchmarks [25], where the traces are usually obtained by runningprograms that are compiled using widely available superscular compilers. Verylittle compiler optimization work has been reported to support such radicallydifferent microarchitectures, even though compiler optimization can greatlyimprove the efficiency and instruction level parallelism deliverable for thesenew microarchitectures.

Speculation over multiple threads of computation (each consisting of a smallnumber of basic blocks) to support fine-grain parallelism is a very promisingarea where compiler optimization can be greatly beneficial. Due to the lack ofcompiler optimization, speedup from approaches such as Multiscalar processors do not produce high speedup. Compiler optimization should includeidentification of multiple execution paths using control dependence analysis,followed by data dependence analysis and address comparison across thesepaths (inter thread dependence analysis) to generate efficient parallel threadsof execution, Special instructions are needed for efficient creation and destruction of these threads and for fast coordination of data and control informationas they are evaluated in program order.

I

Performing complex data dependence analysis in hardware is very expensivein terms of the design effort as well as the hardware complexity and yet there isnot much need for the processor to do it. VLlW microarchitectures and similartechniques are aimed to solve this problem. But VLlW compilation approachdoes not yet have a good solution for binary compatibility or incrementalcompilation.

Current generation of processors lack global control dependence information. Since the compiler has global control flow information, compiler generated appropriate hint instructions in guiding the processor for high bandwidthinstruction fetching and execution will be another fruitful area of research.Such hints has been used in the past in creating prefetch instructions to bringdata or instruction from the main memory to the cache.

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

PARALLELISING COMPILERS

3.2 Parallelization for Multiprocessor Execution: IterationLevel Parallelism

15

This area has been studied extensively over the past decade. Most work inthis area can be seen to fall in one of the following two categories:• User specified parallelism: numerous parallel language extensions for

various existing languages (most notably, HPF [21]) as well as a numberof proposals for new parallel languages for various programming paradigmwere pursued.

• Automatic extraction ofparallelism: work concentrated on developing teststo determine independence of loop iterations [5,31,32], loop restructuringto increase the available parallelism [51,49], and partitioning and mappingof the iterations and data structures over distributed memory machines toreduce communication and synchronization cost [34,8,38].Many algorithms has been developed over the years for memory

disambiguation to determine iteration independence. However. the problemis NP-hard in general and all the relevant information (such as loop bounds)is not always available at compile-time. Even though lot of improvement hasbeen done in the area over the years, most of the algorithms are heuristics andthe performance of these algorithms is not very well understood.

Progress in the algorithm design to determine iteration independence andto generate efficient parallel code is of great importance because explicitlyspecifying all the parallelism available in a program and generating the relevantcode for efficient communication makes the process of developing parallelprograms very expensive.

Synergistic solution between the compiler and parallel computerarchitectures (for example, use of mechanism similar to address resolutionbuffer in multiscalar [40]) using run-time patch-up code could be useful inextracting parallelism, when iteration independence cannot be determined atcompile-time.

Rapid advances in low cost local area networks and interfaces suggest thatin the future many large parallel systems will be built using such networksconnecting powerful yet inexpensive symmetric multiprocessors of moderatesizes. Parallelism extraction, partitioning of data structures that maintains mostdata references to the local node and optimization of data communication androuting for such systems will be interesting research areas.

3.3 Optimization for the Memory Hierarchy

Cache misses are becoming relatively more expensive in modern processors.This is largely due to the fact that the number of processor cycles per second as

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

16 BALARAM SINHAROY AND'BOLESLAW K, SZYMANSKI

well as the number of instructions executed per processor cycle are increasingfaster than the rate at which main memory latency is improving. To make thematter worse, many of the superscalar optimization methods that increase theavailable instruction level parallelism also have an adverse effect (as explainedbelow) on the memory access patterns of a program. These cause the processorto stall on cache misses for a significant portion of its execution time [50]. Asthe Amdahl's law dictates, the performance of a computer system cannot beimproved much by improving only a portion of it. For good performance theentire processing system needs to be designed in a balanced fashion.

Numerous approaches have been proposed recently in the literature toaddress the performance bottleneck due to slower memory [36,2,47,37, II].More work needs to be performed in the area of program transformationtechniques by the compiler to reduce the cache misses for multi-level cachesand false sharing among the processors. Known algorithms to generate prefetchhints 128] to the memory subsystem to reduce cache misses do not work verywell for non-numerical applications [6].

Many superscalar compiler optimization methods to improve instructionlevel parallelism, such as, instruction boosting, procedure inlining, softwarepipelining, loop unrolling, loop peeling, branch target expansion, predicatedexecution, etc., significantly increase the static size of the programs (oftenby a factor of two or more, depending on how aggressively these optimization methods are implemented) and hence increase the working set size of aprogram in the main memory and its footprint in the instruction cache. Mostof these compiler optimization methods change the memory access characteristics of the programs by (I) increasing the sequentiality of the instructionaccess and by (2) increasing the working set size of the programs. For smallinstruction cache (i.e., a cache size that can hold only a small fraction of theworking set size), these optimizations reduce the I-cache misses, because of theincreased sequentiality of the program. However, for moderate sized I-cache,the increased working set size of the program may not fit in the I-cache andso the I-cache miss rate may go up (unless the increase in the sequentialitycompensates for it). For larger l-cache, l-cache miss rates decrease, becausethe I-cache is large enough to hold an increased working set size and is ableto benefit from the increased sequemiality of the programs. Increased sequentiality also helps in improving the effectiveness of the compiler generatedprefetch instructions to prefetch I-cache lines.

Although superscalar optimization methods can reduce I-cache miss rates,especially for small and very large l-eaches, these methods increase the datacache miss rates. This increase is primarily due to (I) the increase in thenumber of instructions executed (including the load/store operations) with

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


better utilization of the functional units and (2) the execution of larger numberof speculative instructions (including load instructions) due to compiler optimizations, such as boosting and predicated execution.

So far compiler optimization researchers either focused on improving theinstruction level parallelism, or on program transformations that improvethe performance of the memory subsystems. However. for real performancebenefit, future research need to look into approaches that improve both theinstruction level parallelism as well as the memory access patterns. Very littleresearch has been focused to combine these two separate efforts [9].

Compared to the number of innovative approaches proposed in the area ofmicroprocessor design and compiler optimization methods, there are few suchproposals to address the memory bottleneck problem. Most approaches rely ondifferent cache designs, prefetching techniques [II] and iteration space tilingmethods [47] for improved cache reuse. Multithreaded processors have beenproposed in the literature to reduce the performance loss due to cache misses.Processor multithreading can reduce some of the processor stall time by overlapping memory access of one thread with instruction execution of a differentthread, However processor multithreading increases the cache miss rates andthus memory access latency is tolerated at the expense of higher memorybandwidth. Compiler optimization methods to create lightweight threads, withhigh level of inter-thread and intra-thread data locality will be very helpful ineffectively reducing the data cache miss rates and the total execution time ofnumerically intensive computation [37].

4 CONCLUSION

In recent years, high performance computing underwent a deep transformation.Many competing ideas are converging due to better understanding of the applications and availability of low-cost commodity hardware components. Withthe advances in the interconnection networks and various software layers to

. support different paradigms of parallel computation, the traditional distinction between shared and distributed memory computation has been blurred.As discussed in this paper, despite the vast progress in parallel computingachieved over last decade, there is a need for future research in the area ofparallel architectures and compilation methods, instruction level parallelismand compiler optimization methods to improve parallelism and the performance of the memory hierarchy. The paper also identifies specific researchtopics with the potential for the highest impact on the future of parallelcomputing.

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

18

References

BALARAM SINHAROY AND BOLESLAW K. SZYMANSKI

[II Adve, S. V. and Hill, M. D. (1993). A Unified Formalization of Four Shared MemoryModels, in IEEE Transactions on Parallel and Distributed Systems, June, 4(6), pp. 613-24.

12J Agarwal, A. (1992). Performance Tradeoffs in Multithreaded Processors, IEEE TransactionsOIl Parallel and Distributed Systems, 3(5), pp. 525-539.

131 Vicki Allan, H., Reese Jones, B., Randall Lee, M. and Stephen Allan, J. (1995), SoftwarePipelining, ACM Computing Surveys, September, 27(3), pp. 367 -432.

141 Amarasinghe, S. P., Anderson, J. M., Wilson, C. S., Liao, S.-W., French, R. S., Hall,M. Woo Murphy, B. R. and Lam, M. S. (1996). The Multiprocessor as a General-PurposeProcessor: A Software Perspective, IEEE Micro, June, 16(3).

151 Banerjee, U. (1988). Dependence Analysis for Supercomputing, (Kluwer AcademicPublishers, Norwell, Massachusetts).

161 Bernstein, D., Cohen, D., Freund, A. and Maydan, D. E, (1995). Compiler Techniques forData Prefetching on the Power PC, Proceedings of the Parallel Architectures and Compilation Techniques, pp. 19-26.

171 Boden, N. J., Cohen, D., Felderman, R. E., Kulawik, A. E., Seitz, C. L., Seizovic, J. N.and Su, W. K. (1995). Myrinet: A Gigabit Per Second Local Area Network, IEEE Micro,February, 15(1), pp. 29-36.

181 Boulet, P., Darte, A., Risset, T. and Robert, Y. (1994). (Pen)-ultimate Tiling, Proc. of Sealahle High Performance Computing Conference, Knoxville, TN, May, pp. 568-576.

191 Carr, S. (1996). Combining Optimization for Cache and Instruction-Level Parallelism,Proceedings of the Parallel Architectures and Compilation Techniques. October, Boston.Massachussetts.

1101 Chan, P. C; Hao, E., Patt, Y. N. and Chang, P. P. (1996). Using Predicated Execution toImprove the Performance of a Dynamically Scheduled Machine with Speculative Execution,International Journal of Parallel Programming, June, 24(3), pp. 209-34.

11I1 Chen, T. F. and Baer, J. L. (1994). A Performance Study of Software and Hardware DataPrefctching Schemes, Proceedings oj the 21st International Symposium on Computer Architecture, April, pp. 223-232.

1121 Deelman, E. Kaplow, W., Szymanski, B" Tannenbaum, P. and Ziantz, L. (1996). Integrating Data andTask Parallelism in Scientific Programs, Proc. 3rd Workshop on Languages,Compilers and Run-Time Systems for Scalable Computers, May 1995, (Kluwer AcademicPublishers, Reading, MA), pp. 169-184.

1131 Diefendorff, K. and Silha, E. (1994). The Power PC User Instruction Set Architecture, IEEEMicro, October, 14(5), pp. 30-41.

[14[ Dittia, Z. D., Cox, J. R. and Parulkar, M. G. (1995). Design of the APIC: A High PerformanceATM Host-Network Interface Chip, Proceedings of IEEE INFOCOM, Boston, pp. 179-187.

[151 Dubey, P., O'Brien, K., O'Brien, K. M. and Barton, C. (1995). Single-Program Speculative Multithreading (SPSM) Architecture: Compiler-assisted Fine-grained Multithreading,Proceedings of the Parallel Architectures and Compilation Techniques, pp. 109-121.

1161 Ebcioglu, K., Groves, Roo Kim, K.-Coo Silberman, G. and Ziv, I. (1994). VLIW Compilation Techniques in a Superscalar Environment, Proceedings of the ACM SIGPLAN '94Conference on Programming Language Design and Implementation, June, pp. 36-48.

1171 Eicken, T. V., Culler, D. E., Goldstein, S. C. and Schauser, K. E. (1992). Active Messages:A Mechanism for Integrated Communication and Computation, International Symposium onComputer Architecture.

1181 Hank, R. E., Hwu, W. W. and Rau, B. R. (1995). Region-Based Compilation: An Introduction and Motivation, Proceedings of the 28th Annual International Symposium on Mlcroarchitccture, December, pp. 158-168.

1191 Hwu, W. M. W., Mahlke, S, A., Chen, W. Y., Chang, P. P., Waner, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R, E., Kiyohara, T., Haab, G. E., Holm, J. G. andLavery, D. M. (1993). The Superblock: An Effective Technique for VLIW and SuperscalarCompilation, Journal of Supercomputing, May, 7(1-2), pp. 229-248.

1201 Kleinrock, L., Gerla, M., Bambos, N., Gong, J., Gafni, E., Bergman, L., Bannister. J.,Monacos, S., Bujewski, T., Hu, P.-c., Karman, B., Kwan, B., Leonardi, E., Peck, J. andPalnati, P. (1996). The Supercomputer Supernet testbed: a WDM based supercomputerinterconnect, Journal of Lightwave Technology, June, 14(6), pp. 1388-99.

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

PARALLELlSING COMPILERS 19

[211 Koelbel. C. H.. et 01. (1994). High Performance Fortran Handbook. (MIT Press).Cambridge, Massachusetts.

[22] Lam, M. and Wilson, R. (1992). Limits of Control Flow on Parallelism. Proceedings of the19th Annual lnternational Symposium on Computer Architecture. May, pp. 46-57.

[23J Lauria, M. and Chien, A. (1997). MPI-FM: High Performance MPI on Workstation Clusters,to appear in Journal of Parallel and Distributed Computing, February.

1241 Lowney. P. Goo Freudenberger, S. M., Karzes, T. J., Lichtenstein, W. D., Nix, R. r.,O'Donnell, J. S. and Ruttenberg, J. C. (1993). The Multiflow Trace Scheduling Compiler,Journal of Supercomputing. May, 7(1-2), pp, 51-142.

[251 Mirghafori, N.; Jacoby, M. and Patterson, D. (1995). Truth in SPEC Benchmarks, ComputerArchitecture News, December, 23(5), pp. 34-42.

[26] Moon, S.-M. and Ebcioglu. K. (1992). An Efficient Resource-Constrained GlobalScheduling Technique for Superscular and VLlW processors. Proceedings ofthe 25th AnnualInternational Symposium 011 Microarchitecturc. December. pp. 55 - 71.

[271 Moon, S.-M. (1993). Increasing Instruction-level Parallelism through Multi-way Branching,Proceedings of the 1993 International Conference 011 Parallel Processing. Volume 1/ - Software, August, pp, 241 - 245.

[28] Mowry, T. C. (1994). Tolerating Latency Through Softwure-Corurolled Data Prcfetctung,Ph.D. Thesis. Stanford University, March.

[29J Natarajan. B. and Schlansker, M. (1995). Spill-Free Parallel Scheduling of Basic Blocks,Proceedings of 'he 28th Annual lnternational Symposium on Microorchitecturc. November.pp. /19-124.

[30J Norton, C; Decyk. V. and Szymanski, B. K. (1996). On Parallel Object Oriented Programming in Fortran 90, ACM Applied Computing Review, 4(1), pp. 27-31.

[31J Psarris, K., Klappholz, D. and Kong, X. (1991). On the Accuracy of the Banerjee Test,Journal of Parallel and Distributed Computing, 12(2), pp. 152-157.

[32] Pugh, W. and Wonnacott, D. (1993). An Exact Method for Analysis of Value-Based ArrayData Dependencies, 1993 workshop on Languages and Compilers for Parallel Computing,Portland, Oregon, Lecture Notes in Computer Science, Berlin: (Springer Verlag), August,pp. 546-566.

[331 Radigan, J., Chang, P. and Banerjee. U. (1996). Integer Loop Code Generation for VLlW,Eighth lnserruuional Workshop on Languages and Compilers for Parallel Computing,August, pp. 318 - 330.

[34J Ramanujam, J. and Sadayappan, P. (1992). Tiling Multidimensional Iteration Spaces forMulticomputers. Journal of Parallel and Distributed Computing, 16, pp. 108- 120.

[351 Rau, B. and Fisher. J. (1993). Instruction Level Parallel Processing: History, Overview, andPerspectives, Journal of Supercomputing, May, 7( 1-2), pp. 9-50.

[36] Saulsbury, A., Pong, F. and Nowatzyk, A. (1996). Missing the Memory Wall: The Case forProcessor/Memory Integration. Proceedings of the 23rd annual International Conference OIl

Computer Architecture, June.[37] Sinharoy, B. (1996). Compiling for Multithreadcd Multicomputer, Languages Compilers

ami Runtime Systems for Scalable Computers, B. Szymanski and B. Sinharoy (eds.).(Kluwer Academic Publishers), pp. 137-152.

[38] Sinharoy, B. and Szymanski, B. K. (1994). Data and Task Alignment in Distributed MemoryArchitectures, Journal of Parallel and Distributed Computing, April, 21(1), pp. 61- 74,

[39] Michael Smith, D., Mark Horowitz. A. and Monica Lam, S. (1992) Efficient SuperscalarPerformance Through Boosting, Proceedings of the Fifth lnternational Conference on Architectural Support for Programming Languages and Operating Systems. October.

[401 Sohi, Goo Breach, S. and Vijaykumar, T. (1995). Multiscalar Processors, Proceedings of the22ml International Symposium 011 Computer Architecture. pp. 414-425.

[411 Szymanski, B. K. (/991). EPL - Parallel Programming with Recurrent Equations, inParallel Functional Languages and Compilers, (ACM Press), New York, pp. 51-104.

[42] Tomasulo, R. M. (1967). An Efficient Algorithm for Exploiting Multiple Arithmetic Units,IBM Journal of Research and Development, 11, pp. 25-33.

[43J Tullsen, D. M., Eggers, S. J. and Levy, H. M. (1995). Simultaneous Multithreading: Maximizing On-Chip Parallelism, International Symposium 011 Computer Architecture. June.

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014


[441 Uht, A. K. and Sindagi, V. (1995). Disjoint Eager Execution: An Optimal Form of Speculative Execution, Proceedings of the 28th Annual International Symposium on Microarchitccture, Ann Arbor, Michigan, November, pp. 313-325.

1451 Wall, D. W. (1991). Limits of Instruction-Level Parallelism, Proceedings oj the Fourthlnternutional Conference on Architectural Support for Programming Languages and Operating Systems, April, pp. 176-188.

146J Wall, D. W. (1993). Limits of Instruction-Level Parallelism, DEC Western Research Lab,Research Report 93/6, November, pp. 1-64.

1471 Wolf, M, E, and Lam, M. S. (1991), A Data Locality Optimizing Algorithm, Proceedings ofthe ACM SIGPLAN'91 Conference on Programming Language Design and Implementation,June, pp. 30-44.

1481 Walker, D. W. and Dongarra, J. J. (1996). MPI: A Standard Message Passing Interface,Supercomputer, January, 12(1), pp. 56-68.

1491 Wolf, M. E. and Lam, M. S. (1991). A Loop Transformation Theory and An Algorithmto Maximize Parallelism, IEEE Transactions on Parallel and Distributed Systems, October.2(4), pp. 452-471.

1501 Wulf, W. A. and McKee, S. A. (1995). Hitting the Memory Wall: Implications of theObvious, Computer Architecture News, March, 23(1), pp. 20-24.

1511 Xuc. J. (1994). Automating Non-Unimodular Loop Transformations for Massive Parallelism, Parallel Computing, April, 20(5), pp. 711-728.

Dow

nloa

ded

by [

UN

AM

Ciu

dad

Uni

vers

itari

a] a

t 08:

23 2

1 D

ecem

ber

2014

Documents

PARALLELISING COMPILERS AND SYSTEMS