16
PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 1 Compiler Techniques for Ef Àcient Communications in Circuit Switched Networks for Multiprocessor Systems Shuyi Shao, Student Member, IEEE, Alex K. Jones, Member, IEEE, Rami Melhem, Fellow, IEEE Abstract— In this paper we explore compiler techniques for achieving efÀcient communications on circuit switching interconnection networks. We propose a compilation frame- work for identifying communication patterns and compiling these patterns as network conÀguration directives. This has the potential of providing signiÀcant performance beneÀts when connections can be established in the network prior to the actual communications. The framework includes a Áexible and powerful communication pattern representation scheme that captures the property of communication pat- terns and allows manipulation of these patterns. In this way, communication phases can be identiÀed within the application. Additionally, we extend the classiÀcation of static and dynamic communications to include persistent communications. Persistent communications are a subclass of dynamic communications that remain unchanged for large segments of the application execution. An experimental com- piler has been developed to implement the framework. This compiler is capable of detecting both static and persistent communications within an application. We show that for the NAS Parallel Benchmarks, 100% of the point-to-point communications can be classiÀed as either static or persistent and 100% of the collectives are either static or persistent with the exception of IS. Simulation-based performance analysis demonstrates the beneÀt of using our compiler techniques for achieving efÀcient communications in multiprocessor systems. Index Terms— Compiled Communication, High Perfor- mance Computing, Multiprocessor Systems, MPI, Circuit- switching networks, Communication Patterns. I. I NTRODUCTION M ANY high performance computing systems use packet-switched networks to interconnect system processors. As systems get larger, a scalable interconnect can assume a disproportionately high portion of the system cost when striving to meet the demands of low-latency and high-bandwidth. Although the quest for cheap, low-latency, high-bandwidth packet-switched interconnections for large scale systems is worthwhile, circuit switching networks [1], [2] have the potential of achieving higher efÀciency than packet and wormhole networks with relatively lower cost. However, the over- head of circuit establishment can be relatively large, and Manuscript was initially received December, 2006 Àrst revised November, 2007, and revised again March 2008. This work is partially supported by NSF award number 0702452, the PERCS project at IBM, Defense Advanced Research Projects Agency (DARPA) under Contract No. NBCH3039004. the beneÀts of circuit switching can only outweigh its drawbacks when communication exhibits locality and when this locality is appropriately explored. Thus, us- ing compiler techniques to explore the communication patterns of parallel applications–referred to as com- piled communication in [3]–[5]–becomes a promising approach for achieving efÀcient communications in the high performance computing domain. There are other techniques to analyze communication locality such as trace analysis and to leverage communication locality such as runtime scheduling. However, trace analysis can be mislead with speciÀc data sets and runtime analysis is subject to cold start misses and potentially allows thrashing which can lead to poor performance. In contrast, our approach discovers the communica- tion patterns between logical nodes based on the struc- ture of the application code and through system calls during execution, provides them to a run-time system that manages circuit switched interconnections. We pro- pose a framework that integrates traditional and new compiler techniques to realize this approach. The motivation of this work partially stems from the proposal to include an optical circuit switching (OCS) network in the design of next generation high perfor- mance computing systems [6]. In that proposal long- lived bulk data transfers are routed through all optical switches, which are characterized by high data rates with high overhead for circuit establishment. An OCS is less expensive than its electronic counterpart as it uses fewer optical transceivers. This interconnection technique is more effective if connections are pre-established and the relatively long switch times are amortized over the lifetime of the connections. For example, Shalf et. al. proposes another interconnection network that use both passive (circuit switching) and active (packet switching) switches to deliver performance equivalent to that of a fully-interconnected network [7]. In their approach, the switches must be reconÀgured to emulate a suitable in- terconnection topology to achieve the best performance for an application. The effectiveness of this topology optimization process heavily depends on the ability to identify the communication pattern of the application. The research in this direction in the high performance computing domain mandates the exploration of new compiler techniques and the development of a system- Digital Object Indentifier 10.1109/TPDS.2008.82 1045-9219/$25.00 © 2008 IEEE This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 1

Compiler Techniques for Ef cientCommunications in Circuit Switched Networks

for Multiprocessor SystemsShuyi Shao, Student Member, IEEE, Alex K. Jones, Member, IEEE, Rami Melhem, Fellow, IEEE

Abstract—In this paper we explore compiler techniquesfor achieving ef cient communications on circuit switchinginterconnection networks. We propose a compilation frame-work for identifying communication patterns and compilingthese patterns as network con guration directives. This hasthe potential of providing signi cant performance bene tswhen connections can be established in the network priorto the actual communications. The framework includes aexible and powerful communication pattern representationscheme that captures the property of communication pat-terns and allows manipulation of these patterns. In thisway, communication phases can be identi ed within theapplication. Additionally, we extend the classi cation ofstatic and dynamic communications to include persistentcommunications. Persistent communications are a subclassof dynamic communications that remain unchanged for largesegments of the application execution. An experimental com-piler has been developed to implement the framework. Thiscompiler is capable of detecting both static and persistentcommunications within an application. We show that forthe NAS Parallel Benchmarks, 100% of the point-to-pointcommunications can be classi ed as either static or persistentand 100% of the collectives are either static or persistent withthe exception of IS. Simulation-based performance analysisdemonstrates the bene t of using our compiler techniquesfor achieving ef cient communications in multiprocessorsystems.

Index Terms—Compiled Communication, High Perfor-mance Computing, Multiprocessor Systems, MPI, Circuit-switching networks, Communication Patterns.

I. INTRODUCTION

MANY high performance computing systems usepacket-switched networks to interconnect system

processors. As systems get larger, a scalable interconnectcan assume a disproportionately high portion of thesystem cost when striving to meet the demands oflow-latency and high-bandwidth. Although the questfor cheap, low-latency, high-bandwidth packet-switchedinterconnections for large scale systems is worthwhile,circuit switching networks [1], [2] have the potential ofachieving higher ef ciency than packet and wormholenetworks with relatively lower cost. However, the over-head of circuit establishment can be relatively large, and

Manuscript was initially received December, 2006 rst revisedNovember, 2007, and revised again March 2008.This work is partially supported by NSF award number 0702452, the

PERCS project at IBM, Defense Advanced Research Projects Agency(DARPA) under Contract No. NBCH3039004.

the bene ts of circuit switching can only outweigh itsdrawbacks when communication exhibits locality andwhen this locality is appropriately explored. Thus, us-ing compiler techniques to explore the communicationpatterns of parallel applications–referred to as com-piled communication in [3]–[5]–becomes a promisingapproach for achieving ef cient communications in thehigh performance computing domain. There are othertechniques to analyze communication locality such astrace analysis and to leverage communication localitysuch as runtime scheduling. However, trace analysis canbe mislead with speci c data sets and runtime analysisis subject to cold start misses and potentially allowsthrashing which can lead to poor performance.In contrast, our approach discovers the communica-

tion patterns between logical nodes based on the struc-ture of the application code and through system callsduring execution, provides them to a run-time systemthat manages circuit switched interconnections. We pro-pose a framework that integrates traditional and newcompiler techniques to realize this approach.The motivation of this work partially stems from the

proposal to include an optical circuit switching (OCS)network in the design of next generation high perfor-mance computing systems [6]. In that proposal long-lived bulk data transfers are routed through all opticalswitches, which are characterized by high data rates withhigh overhead for circuit establishment. An OCS is lessexpensive than its electronic counterpart as it uses feweroptical transceivers. This interconnection technique ismore effective if connections are pre-established andthe relatively long switch times are amortized over thelifetime of the connections. For example, Shalf et. al.proposes another interconnection network that use bothpassive (circuit switching) and active (packet switching)switches to deliver performance equivalent to that of afully-interconnected network [7]. In their approach, theswitches must be recon gured to emulate a suitable in-terconnection topology to achieve the best performancefor an application. The effectiveness of this topologyoptimization process heavily depends on the ability toidentify the communication pattern of the application.The research in this direction in the high performancecomputing domain mandates the exploration of newcompiler techniques and the development of a system-

Digital Object Indentifier 10.1109/TPDS.2008.82 1045-9219/$25.00 © 2008 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 2: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 2

atic compiler-based approach to infer the communicationtopology at compile-time.Based on the temporal and spatial localities of commu-

nications and the capability of the compiler to identifythe temporal properties and the topology of the commu-nications, we classify communications during a phaseof an application’s execution into three categories: static,persistent and dynamic. In this context, the topology ofcommunication is the speci cation of the source anddestination of the messages exchanged.Static – Communication is static if it can be completelydetermined through compile-time analysis. That is thecompiler can identify both the temporal locality and theexact topology of the communication.Persistent – Communication is persistent if, though thecompiler cannot determine the exact topology of com-munication, it can determine that the topology does notchange during the phase. That is, its temporal localitycan be identi ed by the compiler, but its spatial proper-ties remains unknown until run-time.Dynamic – Communication is dynamic if it it neitherstatic nor persistent.Given that communications of applications may

change during different phases of execution, an im-portant part of the analysis of communication patternsis to identify and segregate different communicationphases. We observe that a main source of communica-tion temporal locality originates from loop structures ofparallel programs. Hence, it is natural to consider loopsthat contain communications as the building blocks ofphases.In Figure 1, we illustrate the de nition of static, persis-

tent and dynamic communications when a phase is de-ned as a loop. Speci cally, the communication is static ifthe topology can be completely resolved at compile-time,as in Figure 1(a). In Figure 1(b), the topology can not bedetermined until run-time. However, once de ned, thetopology is repeatedly used within the loop. In this casewe call the communications persistent. In Figure 1(c), thecommunication is dynamic because during each iterationof the loop, the topology is re-calculated.The above classi cation implies different possibilities

for reducing communication overhead. For example,considering circuit switching networks with preloadingcapability such as an OCS network, the earliest oppor-tunity for determining the network con guration for astatic communication is at compile-time. Thus, networkcon guration instructions may be statically inserted intothe code by the compiler at phase boundaries to estab-lish connections between logical nodes in the system.These logical nodes are translated into physical nodes bythe runtime system. For persistent communication, thetopology of the communication is not known at compile-time. However, it is possible to insert at compile-timenetwork con guration instructions containing symbolicexpressions specifying the topology that will be resolvedat runtime. By placing these network con guration in-structions at the earliest point where the expression can

be resolved, the network recon guration may still beable to take place prior to the actual communicationwithin a phase amortizing as much as possible thenetwork recon guration time. Additionally, if processmigration between physical processors occurs (e.g. forfault-tolerance), even within a phase, our compilationapproach is unaffected as it deals with logical nodes. Fora circuit switching target, the runtime system can easilymigrate the connections along with the process to thenew physical processor.In this paper, we present a compilation framework

for identifying and compiling static and persistent com-munication patterns. Many previous efforts to analyzeparallel applications’ communications characteristics arebased solely on trace analysis [7]–[9]. However, thetraces can provide the communication information foronly a single execution instance of an application on aparticular platform. Our approach is to reveal the under-lying communication patterns and make it available atcompile-time. We implement the compilation frameworkby developing an experimental compiler which identi escommunication patterns from the source code. It com-piles the communication pattern of an application andenhances it with network con guration directives. An-other capability of the compiler is to augment the codewith trace generation instructions which can produceinformation about the communication pattern correlatedto the structure of the source (e.g. loops, conditionals,etc.).We use a powerful scheme to represent communica-

tion patterns that includes both collective and point-to-point communications in terms of communication vec-tors and matrices. The vectors and matrices contain exactvalues if the communication pattern contains only staticcommunications. Otherwise, they may contain symbolicexpressions for later resolution. This scheme allows themanipulation of communication patterns through a setof convenient operations. It is also exible and can beeasily tailored to other types of communication analysis.The remainder of this paper is organized as follows.

Section II reviews major related work. We describe ourframework in Section III and we describe an experimen-tal compiler which implements the framework in Sec-tion IV. A set of results from applying the experimentalcompiler are presented in Section V. Conclusions aregiven in Section VI.

II. BACKGROUND AND CONTEXT

High performance computing systems are being builtout of ever-increasing numbers of processors [10], [11].These large systems, however, typically use packet orwormhole routing for interprocessor communicationwith only a few systems built around statically con g-ured circuit switching networks. The most well knownexample of a circuit switching based network is the NECEarth Simulator [12], which uses a huge electronic cross-bar, with 640x640 ports. The ICN (Interconnection Cache

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 3: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 3

Communications using connections

in topology T

(a) Static: T is resolvable at compile-time.

Calculate topology T

Communications using connections

in topology T

(b) Persistent: T is calculated before a phase.

Calculate topology T

Communications using connections

in topology T

(c) Dynamic: T changes inside a phase.

Fig. 1. Static, persistent and dynamic communications when a phase is a loop.

Network) [13], [14] is similar to the Earth Simulatorin its one-to-one relationship between each processingnode and one channel or circuit that it handles. Circuitswitching can also be achieved in parallel systems usingpassive optical components through time-division mul-tiplexing (TDM) [15], [16], wavelength division multi-plexing (WDM) [17] or a combination of the two [18]–[20], which does demonstrate an interest in optical circuitswitching in the high performance computing domain.Circuit switching hardware continues to improve due

to improvements in technology. Newer technologiessuch as optical networking continues to be an alternativeto electronic circuit switching that provides several ad-vantages such as capabilities to handle long wire lengthsand achieve high bandwidths. The biggest argument foruse of optical circuit switching (OCS) is a savings in costover fast electronic networks as OCS removes the needfor expensive transceivers to convert signals betweenoptics and electronics [6]. However, the recon gurationtime of optical switching may be relatively long (ms vsμs) [6], [21], [22]. Hence, implementing circuit switchingwith relatively long switching latency mandates a tech-nique to amortize connection establishment overheadand to explore the locality of communications in par-allel applications. This inspires us to combine compiledcommunication and circuit switching techniques.Our previous work [23] shows that much of the

circuit switching overhead can be amortized by pre-establishing connections and re-using connections asmuch as possible. In fact, the network con gurationpre-loading scheme has been shown to perform betterthan traditional wormhole and non-predictive circuitswitching techniques in many instances. However, thissolution requires that the communication pattern mustbe known early enough, ideally at compile-time. Incases where the communication operations require morebandwidth than the network can provide, it is necessaryto detect the communication pattern and pre-schedulethe communication operations in the different phases forcircuit switching to be effective [24].There have been several attempts [8], [9], [25], [26] to

understand the communication characteristics of parallelapplications. Shires, et. al. presented an algorithm forbuilding a program ow graph representation of anMPI program [25]. They provided an interesting ba-sis for important program analyses useful in software

testing, debugging and code optimization. In [9], Vet-ter and Mueller examined the explicit communicationcharacteristics of several sophisticated scienti c appli-cations, while focusing on the Message Passing Inter-face (MPI) [27] and by using hardware counters onmicroprocessors. Faraj and Yuan [8] investigated thecommunication characteristics of MPI implementationsof the NAS parallel benchmarks [28]. Ho and Lin studiedthe static analysis of communication structures in pro-grams written in a channel-based message passing lan-guage called communication compiling component [26].Although these attempts analyzed the communicationcharacteristics of parallel applications such as the ratio ofdifferent kinds of communications and implicitly discusspersistence of communication and phases, they do notprovide any systematic way to identify accurate commu-nication patterns with respect to connections or specifya quantitative measure of persistence.

Many interesting research projects in the high per-formance computing domain can take advantage of orrely on information about communication patterns. Forexample, Cappello and Germain proposed an approachto associate compiled communications and a circuitswitched interconnection network [3]. Yuan et. al. ex-plored using compiled communication for HPF-like par-allel applications as an alternative to dynamic networkcontrol [4]. The compiled communication technique re-quires that a large portion of static communicationsbe identi ed at compile-time. Dietz and Mattox stud-ied the Flat Neighborhood Network (FNN) which usesthe communication patterns to determine the design ofthe network [29]. Liang et. al. described a compiler,which supports compile-time scheduled communicationfor their adaptive System-On-a-Chip (aSOC) communi-cation architecture [30]. As previously described, ourprevious work introduces a switch design which canuse our compilation technique to pre-program a TDMnetwork switch [23]. Each of these techniques can takeadvantage of compile-time knowledge of the communi-cation pattern to reduce the overhead in the interconnec-tion network.

In contrast to these efforts we provide a systematictechnique to address compiled communication in MPIapplications and to leverage this information for a circuitswitching network.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 4: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 4

III. COMPILATION FRAMEWORK

In this section, we present our compilation frameworkin the context of MPI programs. We also present it in thecontext of the abstract communication model describednext.

A. Model of the Target Systems

We assume a high performance computing system inwhich computing nodes are interconnected through acircuit switched network. The circuit switching networkcan support k simultaneous connections to each process-ing node, loading pre-computed network con gurations,and recon guration at the boundaries of communicationphases. We also assume the system is able to ef cientlyperform dynamic and collective communications. Basedon these assumptions, an abstract model of the targethigh performance computing incorporates two separatecommunication networks: a circuit switching networkused for static and persistent communications, and apacket switching network to accommodate collectivesand short-lived dynamic communications. This inter-connect model provides a signi cant cost advantageover a very, highly-buffered fast packet switch [6]. Anoverview of the model is shown in Figure 2. Separatingthe communication classes in this manner enables us totarget each class with the most appropriate network tech-nology and operating mechanism. Static and persistentcommunications will be compiled and dispatched to thecircuit switching interconnect network.

B. Compilation Paradigm

Current compiler techniques, such as control and dataow graph (CDFG) analysis, inter-procedural analysis,and array analysis can be used to infer information fromthe code that is essential for analyzing the communica-tion behavior of an entire program with explicit messagepassing. Our compiler identi es the MPI functions in thecode, and uses these analysis tools to construct a com-munication pattern for the application. As MPI allowsmessages to be received from any source, our approachis to pro le the message send operations to determinethe communication pattern. The representation for acommunication pattern is detailed in Section III-C.The communication behavior of the application is par-

titioned into communication phases with the knowledgeof the target network speci cation. By mapping thecommunication patterns into a sequences of phases itis possible to create more ef cient communication workingsets or groups of communications that occur in relativelyclose proximity. The granularity of the communicationphases depends on the capacities of the communicationnetwork. Thus, the communication pattern identi edduring analysis according to particular interconnectionnetwork speci cation can be leveraged through networkcon guration instructions inserted into the application.For static communication patterns, the compiler gen-

erates a network con guration le that can be used by

Fig. 2. High performance computing with circuit switching.

the loader and inserts respective network con gurationsetup instructions in the program. For persistent commu-nication patterns, the compiler inserts symbolic networkcon guration instructions that aim at pre-establishingthe needed connections at run-time prior to the actualcommunications.

C. Phases and Pattern Representation

Several research groups have observed that the com-munication operations in many applications exhibit reg-ular patterns [6], [26], [30], [31]. Additionally, it has beenshown that these regular communication patterns can of-ten be discovered through analysis of the source code [3],[8], [32]. Our compilation framework is motivated bythese two discoveries.Often, a parallel program is written to solve a par-

ticular scienti c problem. These applications are oftenorganized in phases and although different processorsmay take different paths, they tend to work in thesame computational phase at approximately the sametime but primarily on their local data. Given that theseparallel applications have computational phases, we canexpect their communication operations to behave sim-ilarly. For example, the communication topologies ofadaptive applications evolve during their execution time.Even for parallel applications that have static communi-cation patterns, their active communication working setmay change as the phases change. The result is one ormore communication phases. Communication phases arenot identical to computational phases, but are stronglyassociated with them. For example, some computationalphases contain no communication and thus can be ig-nored when identifying communication patterns. Severalcomputational phases may just yield a single communi-cation phase. The number of phases is an artifact of theanalysis used to partition the communications into phases.To effectively perform compile-time communication

analysis it is necessary to represent the communicationpatterns identi ed from the code in a form that is bothexible and accurate. In the following sections we de-scribe a communication matrix/vector pair that are used

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 5: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 5

to represent the communication pattern in an applica-tion. Additionally, we describe how the communicationpattern representations can be manipulated to closelycorrespond to the underlying communication network.The fact that the communication pattern of an applica-

tion contains phases is important and must be describedwhile representing the pattern. However, the traditionaltechnique to represent the communication pattern of anapplication is to describe its rough logical topology, (e.g.2-D mesh, hypercube, etc.), or to provide a communica-tion matrix. Such representations are too coarse and cannot describe the communication topologies accurately.They also fail to disclose temporal information. Our com-munication pattern representation scheme is designedspeci cally to avoid this limitation and to effectivelyrepresent the temporal and spatial properties of thepattern.We de ne all the communication operations of an

application as a communication pattern. When the exe-cution is partitioned into phases, the communicationswithin each phase need to be speci ed. There are twotypes of communications in MPI applications: collectivecommunications and point-to-point communications. Inorder to represent the collective communications in apattern, we de ne a c-enumeration to describe the set ofcollective communication functions invoked in a parallelapplication.

De nition A c-enumeration is a list of all the collectivecommunications that appear in a parallel application.Each collective communication is represented by a pair,the function name and optionally the correspondingcommunicator. For example, for the function names,we use AA, AV, AR, and RD to represent MPI Alltoall,MPI Alltoallv, MPI Allreduce, and MPI Reduce respec-tively. The communicator is omitted if it is the defaultMPI communicator. For each related MPI communicator,the same collective MPI functions have exactly one in-stance in the c-enumeration. Each communication patternretains a unique c-enumeration.

Example 1: CEE = {AA, AR , (AR, commu1), RD}indicates that there are three different types of collectivecommunications in the application. The rst two and thelast operations are performed in the default MPI com-municator while the third operation, MPI Allreduce, isperformed in a user-de ned communicator commu1. Thecommunication detection component of the frameworkis responsible for building c-enumerations.Formally we use the grammar described in Figure 3

to represent the communication pattern that describesthe communications in different phases of a program.In the expression for c-enumeration, coli represents anycollective MPI function, and commi represents the corre-sponding MPI communicator. In this gure, we assumethat m is the number of elements in c-enumeration.A communication pattern consists of a sequence of

phases. A basic communication phase is described by ac−vector and a p−matrix that represent all the collective

communication pattern→ c−enumeration, phases

c−enumeration→ {(col0, comm0), ..., (colm−1, commm−1)}phases→ ε|phase phases|[phases] phases

phase→ <c−vector, p−matrix>

c−vector → ε|<w0, w1, ..., wm−1>

p−matrix→ ε|deterministic p−matrix|symbolic p−matrix

Fig. 3. The grammar for communication pattern representations.

and point-to-point communications, respectively, in thatphase. A phase may also be a loop of a sequence ofphases that repeat in any execution instance of an appli-cation, represented by square brackets in the grammar.

De nition A c−vector corresponds to a c-enumeration.Each element of the vector represents the weight ofthe corresponding collective communication in the c-enumeration.

De nition A p−matrix is a N×N matrix that describesa set of point-to-point communications. The entry inposition i, j describes the weight of communication fromprocessor i to j.

De nition δ represents any unknown values, variable,vector, or matrix.

In the above de nitions of p−matrices and c−vectorswe do not enforce a speci c meaning for the communica-tion weight. However, some options include (1) a singlebit value to indicate if point-to-point communicationsfrom the source processor to the destination processorexist (2) the message volume or (3) message count.In the case that the compiler cannot construct even asymbolic expression for a point-to-point communicationthe δ symbol is used in the symbolic expression for thatmatrix entry.A p−matrix is deterministic if the total number of

processorsN is known and each entry of the p−matrix isa constant. A deterministic p−matrix is used to representstatic communications. When the size N of a p−matrixis a symbolic constant and/or any entry can only bedescribed by a symbolic expression instead of a con-stant, it is a symbolic p−matrix. A symbolic p−matrixcan always be described by a formula list. Persistentcommunications can usually be described by symbolicp−matrices.Example 2: As shown in Figure 4, PM A and PM B are aformula list and a deterministic p−matrix, respectively.PM A describes a communication pattern in which eachprocessor rank sends to rank +x and rank−x if rank−x > 0 where x is determined at run-time and N is thetotal number of processors. In the case x = 1 and N = 4,a deterministic p−matrix PM B is inferred from PM A.Example 3: The communication pattern of IS (integersorting) program from NAS parallel benchmark suite[28] is shown in Table I and the deterministic p−matrixin phase 2 is shown in Figure 5.The communication pattern representation scheme can

be used to represent both compile-time identi ed com-

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 6: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 6

„(rank + x) mod N

rank>x : (rank − x) mod N

«(a) PM A

0BB@

11

1 11 1

1CCA

(b) PM B

Fig. 4. A p−matrix PM A described by a formula list and thecorresponding deterministic p−matrix PM B.

TABLE ITHE COMMUNICATION PATTERN OF IS.

c−enumeration {AR,AA,AV,RD}Phases c−vector p−matrixphase 0 <1, 1, 1, 0> NULLphase 1 <1, 1, 1, 0> NULLphase 2 <0, 0, 0, 1> PM IS

munication patterns and the communication patternsidenti ed from execution traces. The main advantage ofour pattern representation scheme is that it captures thetime evolving, or phased, property of communication pat-terns. The concept of phases suggests the need to manip-ulate the communication patterns from different phasesand to schedule the communications in the pattern atdifferent granularities according to the parameters of thetarget system.

D. Manipulating Communication Patterns

If the interconnection network capacity is insuf cientto establish all the circuits required by an application,then it is necessary to divide the program executioninto phases. A communication phase can more preciselyrepresent an active connection working set. The intercon-nection network will be recon gured at the beginningof each phase to establish the circuits most frequentlyused in that phase. Clearly, two con icting criteriaguide phase identi cation. Namely, phases should besmall enough to allow the interconnection network toaccommodate the communication requirements withinthe phase, but should be large enough to avoid theoverhead of recon guring the network. In general, thegoal of the phase formation should be to determine thelargest communication working set that can t withinthe capacity of the circuit switching network and thatrequires the least recon guration during execution.One strategy to determine the communication phases

in the program is to assume that each loop in the appli-cation is a phase and to manipulate these initial phasedecisions to group the communications into new phasesthat are best suited to the capacity of the network. Forinstance, we may want to combine two adjacent phasesif the network capacity is large enough to realize both ofthem; we may want to remove some infrequently usedconnections if the newly combined phase is slightly be-yond the capacity of the network. Communications overthese removed connections will be delivered through thepacket switching network.Given a speci c communication pattern, the determi-

nation of whether or not that pattern ts in a giveninterconnection network depends on the network itself.

0BBBBBBBBB@

11

11

11

1

1CCCCCCCCCA

Fig. 5. p−matrix PM IS (with 8 processors).

For example, a non-blocking network, such as a crossbar,can realize any pattern that is a permutation, whilenetwork-speci c algorithms have to be applied to deter-mine if a pattern ts into a blocking network [33]. In oursystem, we assumed that the interconnection network iscomposed of a number, k, of parallel crossbars, and thuscan realize k permutations simultaneously. In general, ifthe communication pattern is represented by a determin-istic p −matrix, then determining if the pattern ts inthe network is straight forward. However, for patternsdetermined by symbolic expressions, that determinationdepends on the complexity of the expression, and insome cases may not be possible unless the expressionsare evaluated.We de ne three core operations required to deal with

different communication phases. Merge combines thep−matrices and c − vectors of two adjacent phases ofa communication pattern into a new phase. If binarycommunication weights are used, this is equivalent toan OR operation. Filter removes connections below athreshold from a phase of a communication pattern.Unwrap is the equivalent to completely unrolling theloop and merging all of the resulting phases into a singlephase. This is particularly useful to deal with nestedloops or adjacent loops that logically should be in thesame phase.

De nition Merge(phasei, phasej) :Parameters: phasei = <c−vectori, p−matrixi>,phasej = <c−vectorj , p−matrixj>.phasei and phasej are adjacent phases.Semantics: phasei and phasej are merged into anew phase phasenew = <c−vectornew , p−matrixnew>where c−vectornew = c−vectori + c−vectorj andp−matrixnew = p−matrixi + p−matrixj . The +operation depends on the de nition of the weights inthe c−vector and p−matrix. For example, if the weightsare binary bits in a deterministic p−matrix, the +operation is equivalent to an OR operation.

De nition Filter(phase, T ) :Parameters: phase = <c−vector, p−matrix>, T is athreshold.Semantics: This operation replaces phase in the commu-nication pattern by a new phase in which any entry withvalue less than T in c−vector or p−matrix will be set to0.

De nition Unwrap(phase) :Parameters: phase = [<c−vector, p−matrix>].Assume the loop body of phase has been merged into asingle phase “<c−vector, p−matrix>” as above.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 7: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 7

Semantics: This operation unwraps the loop in-dicators of phase. If the communication weightsare de ned as single bit values, this operationuses one “<c−vector, p−matrix>” to replace a loopof “<c−vector, p−matrix>”. If the weights are de-ned as message volume or message counts, theweights in the resulting phases are multiplied by theloop iteration number compared to the weights in“<c−vector, p−matrix>”

E. Phase Partitioning Algorithm

The communication phases within an application typ-ically become apparent at run-time. However, the struc-ture of the source code can often provide enough direc-tion to determine reasonable phase lengths and bound-aries. Loops play a key role in the determination ofphases in the application [34] and work similarly forcommunication phases. Thus, we use loops and thecode blocks between loops as the starting point to buildphases. Adjacent phases whose joined communicationpattern do not violate the capability of the network canbe merged together to form larger phases.The control structures arising from branch structures

create dif culties in merging adjacent phases. We breakdown branches into two categories, rank-dependent anddata-dependent. Rank-dependent branches are the mostdif cult structures to handle as some processors executeeach branch, concurrently. Thus, our solution in this caseis to merge all phases contained into a single phaseand to lter out lowest-bandwidth connections until theresulting pattern can t into the network. However, insome cases it may be possible to determine optimizedpatterns for each branch depending on static knowledgeof the condition. This is discussed in Section IV.In data-dependent branches, all processors will take

either one direction or the other. This condition holdsbecause only branches containing communication op-erations are considered. If data dependent conditionsallowed different processors to follow different branchescontaining communication operations, it would be prob-lematic to have matching sends and receives. Thus, forthese branches, communication patterns can be mergedwithin branches and individual branches, but need notnecessarily be merged across branches as is necessarywith rank-dependent conditionals.For the algorithm we describe several different basic

operations used in addition the core operations fromSection III-D:Adjacent(Pj, Pk) returns true if Pj and Pk are two di-rectly adjacent phases and they are not separated by loopor conditional boundaries and returns false otherwise.FilterUntilFit(P,NET) removes the lowest weight con-nections from the communication pattern of phase Puntil P ts in network NET.CanBeMerged(Pj, Pk, T, NET) returns true if the twophases, Pj and Pk can be merged and t into the networkNET without violating the user speci ed parameter T. T

is a threshold on the maximum weight of a connectionthat can be ltered.The algorithm is presented in Figure 6. The code

starting with line 1 constructs the initial phases ofthe application by merging the adjacent basic-phasesthat t into the network. Starting in line 7, all phasesin the rank-dependent branch structures are mergedand the communication pattern is ltered until it tsinto the network, regardless of the values of T orTM. Starting in line 13 and continuing through theend of the pseudocode, the algorithm revisits mergingphases within loops. First, loops that contain a singlephase are unwrapped at line 14. Starting in line 15,when possible, data-dependent branch structures areattened into single phases. Finally, at line 18 adjacentphases are re-examined in case newly unrolled loops ormerged branch structures have created phases that canbe merged. This continues until the phases reach a steadystate creating the nal phase partition of the application.Communication instructions can now be placed into thecode prior to the execution of each phase.

1 Create a basic phase for each MPI communication function call

2 do3 foreach phase P do4 foreach phase Q such that Adjacent(P, Q)==1 do5 if CanBeMerged(P, Q, T, NET) then Merge(P, Q)6 while(phases can be merged)7 foreach rank-dependent branch structure BS do8 foreach branch Bi of branch structure BS do9 foreach phase P in Bi do10 foreach phase Q in Bi such that Adjacent(P, Q)==1 thenMerge(P, Q) into Pi

11 for any two branches Bi, Bj in BS containing phases Pi, Pj ,respectively, Merge(Pi, Pj ) into PBS

12 FilterUntilFit(PBS , NET)13 do14 foreach loop L if L contains a single phase PL thenUnwrap(PL)15 foreach data-dependent branch structure BS do16 for any two branches Bi, Bj in BS such that each containsa single phase, Pi, Pj , respectively do17 if CanBeMerged(Pi, Pj , T, NET) then Merge(Pi, Pj)18 foreach phase P do19 foreach phase Q such that Adjacent(P, Q)==1 do20 if CanBeMerged(P, Q, T, NET) then Merge(P, Q)21 while (phases can be merged)

Fig. 6. Pseudocode for the phase partitioning algorithm.

IV. IMPLEMENTATION OF AN EXPERIMENTALCOMPILER

An experimental compiler has been developed to im-plement the compilation framework described in Sec-tion III for MPI applications written in C or Fortran 77.

A. Compiler Techniques

Certain traditional and new compiler techniques areneeded for building a compiler that supports compiledcommunication. These techniques need to accomplishthe following main tasks.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 8: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 8

Determination of static communication patterns: Thecompiler techniques needed to determine the commu-nication pattern are different from traditional compileranalysis and optimization techniques in many aspects.For instance, optimizing compilers employ constantpropagation and folding in an effort to reduce andremove unnecessary instructions. In the context of com-piled communication, constant propagation and foldingis important for detecting the actual value of source, des-tination, and message volume variables in communica-tion operations. Thus, expressions that contain variablesrelated to the processor id and system size (e.g. numberof processors), while not constant at compile-time, areconsidered constant because they are resolved prior toprogram execution. For example, the two send opera-tions in the code from Figure 7(a) contain the expres-sions (myrank+k)%nprocs, (myrank+2*k)%nprocs,(myrank-k)%nprocs, and (myrank-2*k)%nprocs).Because myrank and nprocs are actually the processorid and number of processors in the system, respectively,we consider these operations to be static if k is aconstant. We call this symbolic expression propagation.These communication patterns can be resolved into ac-tual numeric values at load-time when the number ofprocessors are known.Addressing SPMD style: MPI programs are written

in SPMD style. Each processor independently executesthe same program on its private data. Nevertheless,often different processors take different execution pathsas occurs with rank-dependent branch structures. Hencethe compiler may use control and data ow analysisto segregate communication in the same communicationmatrix with decision points (DPs) that correlate to thesebranch structures. For static communication, DPs canbe used to reduce the size of communication matrices.For persistent communication, DPs are used as predicateconditions for communication represented by symbolicexpressions. For example, a naive approach to buildinga communication matrix based on the code from Fig-ure 7(a), where k=1 is shown in Figure 7(b). This is builtfrom including the send operation for the then and elsepart of the code for each possible processor id. However,the variable guarding the branching statement may bestatic. For example, in Figure 7(a) the condition myrank< nprocs/2 is entirely based on constants or variablesrelated to the processor id and system size. Thus, thisexpression is static. As a result, only two send operationsare required to be added to the matrix for each processorid resulting in the matrix shown in Figure 7(c). Hence,careful analysis of the branching statement is required inorder to accurately identify the communication patterns.As all phases in rank-dependent branches are merged inline 7 of Figure 6, this can help keep useful connectionsfrom being ltered out of the network.Inter-procedural Analysis: An additional considera-

tion of our compiler is the inclusion of inter-proceduralanalysis. Such a cross-boundary analysis requires sophis-ticated control ow techniques and again is different

void p(int k) {int i;reconfigNetwork(config1);for(i=0; i<1000; i++) {if (myrank < nprocs/2) {MPI Send(buf,1, MPI INT TYPE,(myrank+k)%nprocs,1000,COMM);

MPI Send(buf2, 1, MPI INT TYPE,(myrank+2*k)%nprocs,1000,COMM);

} else {MPI Send(buf,1,MPI INT TYPE,(myrank-k)%nprocs,1000,COMM);

MPI Send(buf2, 1, MPI INT TYPE,(myrank-2*k)%nprocs,1000,COMM);

} } }(a) Decision point (DP) code.

0BBBBBBBBB@

1 1 1 11 1 1 11 1 1 1

1 1 1 11 1 1 1

1 1 1 11 1 1 11 1 1 1

1CCCCCCCCCA

(b) Conservative matrix.

0BBBBBBBBB@

1 11 1

1 11 1

1 11 1

1 11 1

1CCCCCCCCCA

(c) Matrix with DP analysis.

Fig. 7. Example of a communication matrix.

from the traditional inter-procedural passes as its goal isto examine variables at the point of communication. Forexample, consider the procedure p(k) from Figure 7(a),which contains an instruction to send a message tomyrank+k or myrank-k, and p is called from twodifferent locations, L1 and L2, with parameters A and B,respectively. If A is known at L1, while B is not knownat L2, then the communication within p is static when pis invoked from L1, while it is not if p is invoked fromL2.Network con guration instructions: The example

shown in Figure 8 demonstrates a sample of networkcon gurations for the code in Figure 7(a). Each con-guration le is a list of connections, each of whichis scheduled to a particular circuit switching networkprior to execution. Each connection description in thele contains 3 elds: net id source destination,referring to the switch number, source, and destina-tion processor. Figure 8(a) shows the network con g-uration for a single circuit switch and Figure 8(b) ex-tends the con guration for a second circuit switch. Theswitch con guration is passed to the network using thereconfigNetwork(config1) function highlighted inFigure 7(a).In this example, when only a single switch is available

the connections speci ed in Figure 8(b) would actuallybe ltered out and satis ed in the packet switch asdescribed in Figure 6. The con gurations assume thematrix from Figure 7(c). If the conservative matrix wasconstructed from the compiler (Figure 7(b)), the systemwould require four switch planes, or the compiler wouldlter out additional connections, possibly retaining un-used connections.

B. Compiler Implementation

We have developed an experimental compiler whichimplements the framework shown in Figure 9. It is

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 9: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 9

config1:0 0 10 1 20 2 30 3 40 4 30 5 40 6 50 7 6

(a) One switch.

Same as (a) appending1 0 21 1 31 2 41 3 51 4 21 5 31 6 41 7 5

(b) Two switches.

Fig. 8. Example of network con gurations for Figure 7(a).

based on the SUIF compiler infrastructure [35] for de-tecting communication patterns from source code ofcertain applications and enhancing them with compiledcommunications. The SUIF compiler infrastructure is anopen source, source to source compiler research toolkitsupporting both the C and Fortran 77 languages. Thefront end of the SUIF compiler compiles parallel ap-plications to SUIF intermediate format. A tool, porky,from the SUIF compiler system is used to perform basictransformations, such as copy propagation and constantpropagation in order to discover more details aboutthe characteristics of different communication(e.g. static,dynamic, or persistent). The compiler contains four keycompilation passes described in the next several sections:1) Communication Detection Pass: In this pass, all the

MPI communication calls and parameters are located inthe SUIF intermediate code. Based on MPI standards,we can explicitly enumerate all possible MPI operations.With this list, we can easily identify, within the compiler,all the MPI operations that occur in the code. This passcollects information about these communication oper-ations’ parameters. For instance, this pass can identifythat me is the parameter variable holding the rank valueand nprocs is the parameter variable holding the totalnumber of nodes from the following source code.call mpi comm rank(mpi comm world, me, ierr)

call mpi comm size(mpi comm world, nprocs,

ierr)

Communication Detection

Communication Amalysis

Communication Compiling

ApplicationC+MPI Code

Trace GenerationCommunication

Pattern

Network Spei cations

Communication Instruction

Enhanced Code

Trace Generation Instruction

Enhanced Code

Experirmental Compiler

ApplicationF77+MPI Code

SUIF Front End scc

Control and Data Flow Analysis

Code Generations2c, s2f

Fig. 9. The experimental compiler.

2) Communication Analysis Pass: During communica-tion analysis, the MPI operations and parameters usedin the MPI functions identi ed during communicationdetection are used to identify and represent communica-tion patterns. Control and data ow analysis and inter-procedural analysis are used to determine the locationat which communication operations are resolved in thecode. If the source, destination, and message volumeof a communication operation can be resolved as con-stants, the communication is static. If communicationoperations are discovered to be persistent as describedin Figure 1(b), they are further examined as to whetherthey are static. We extend inter and intra-proceduralconstant propagation into symbolic expression propagationto determine whether communications are static. Forexample, it is necessary to recognize and consider theprocessor ID, number of processors, problem size, andvarious other parameters as static to ensure that thetopology can be determined statically. As previouslydescribed, we use loops as basic phase delimiters, andthen merge contiguous phases to form larger phaseswhen it is possible. To obtain proper communicationphases within the code, we use the information dis-covered in CDFG analysis and communication detectionpass. The proper determination of phases is based onfactors such as phase length, amount of static and/orpersistent communication present, and communicationto computation ratio. We use the algorithm from Figure 6to determine the communication phases.3) Communication Compiling Pass: The communication

compiling pass further optimizes the communicationpattern identi ed during analysis, compiles the patternand inserts network con guration directives into theapplication to assist in con guring the interconnections.The communication compiling pass exposes static andpersistent communications to circuit switching intercon-nection networks with the goal of reducing the com-munication overhead. For example, this compiler caninsert instructions to preload network con gurations forthe switches proposed in [23]. Currently, two types ofnetwork con guration instructions are considered in ourexperimental compiler.Network con guration setup instructions are used

to pre-establish network connections. These instructionscan reduce the setup overhead of connections and over-lap network control with computation.Network con guration ush instructions are used toush the current network con guration or a subset ofcircuits from the current con guration. Such instructionscan be used to remove the expired circuits. It alsoprovides potential to speedup the parallel applicationsas it reduces contention for the circuits.For static communication patterns, the content of the

communication pattern derived during analysis is in-serted into the code using the reconfigNetwork()system call as shown in Figure 7(a) designed to pre-con gure the network at the beginning of each com-munication phase. For persistent communication pat-

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 10: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 10

terns, the communication analysis component providescommunication pattern information as a function ofvariables that will not be known until run-time. Valuesfor calculating these symbolic expressions are passedto the reconfigNetwork() function as parameters.The scope and value availabilities of variables in thesymbolic expressions, in addition to the location ofthe communication operations, put constraints on thelocations where these network con guration instructionsmay be inserted.4) Trace Generation Pass: The experimental compiler

also has the capability to automatically add trace gen-eration instructions (e.g. print statements) within MPIapplications. This pass relies on the information dis-covered in the communication detection pass and thecommunication analysis because it generates traces forboth MPI functions and related code structures such asif s, loops and procedure boundaries. The resulting anno-tated code can be compiled and executed on the parallelarchitecture normally just like the original code. The onlydifference is that the newly included print instructionswill record traces. These traces are primarily used tostudy communication patterns of an execution instanceand/or verify that the communication patterns detectedby the compiler are accurate. This pass provides a usefultool for automating trace generation for MPI applicationsand relieve researchers from the trouble of generatingtraces manually.

V. RESULTSOur compiled communication techniques can provide

certain information about MPI parallel applications be-yond the current approaches in the literature. In Sec-tion V-A we examine the impact of considering persis-tent communications in addition to static and dynamiccommunications. This knowledge can be leveraged inthe compiler to determine the communication patternsfor these benchmark applications, as shown in Section V-B. Simulation-based performance analysis is presentedin Section V-C to demonstrate the bene ts of using ourcompiler techniques for achieving ef cient communica-tions in multiprocessor systems.

A. Classi cation of communications of NAS Parallel Bench-marks

We used our experimental compiler to pro le the com-munication statistics of the NAS Parallel Benchmarksv2.4.1. The percentage of static, persistent, and dynamiccommunications are shown for point-to-point operationsin Table II and for collective operations in Table III. In allcases the compiler accurately detected the classi cationcompared to an ideal static analysis.For the point-to-point operations, Integer Sort (IS),

Conjugate Gradient (CG) and Lower-Upper SymmetricGauss-Seidel (LU) contain only static communications.Multigrid (MG), Block-Tridiagonal (BT), and Scalar-Pentadiagonal (SP) contain only persistent communica-tions. For BT and SP, the destination set for each node

TABLE IITHE PERCENTAGES OF DIFFERENT POINT-TO-POINT

COMMUNICATIONS IN NAS BENCHMARKS.

Benchmark Static Persistent DynamicIS 100% 0% 0%CG 100% 0% 0%MG 0% 100% 0%BT 0% 100% 0%SP 0% 100% 0%LU 100% 0% 0%

TABLE IIITHE PERCENTAGES OF DIFFERENT COLLECTIVE COMMUNICATIONS IN

NAS BENCHMARKS.

Benchmark Static Persistent DynamicIS 0.4% 0% 99.6%CG 100% 0% 0%MG 100% 0% 0%EP 100% 0% 0%FT 0.1% 99.9% 0%BT 100% 0% 0%SP 100% 0% 0%LU 100% 0% 0%

is calculated prior to all point-to-point communicationsand are used through application completion. For MG,there are two communication stages. In each stage, thedestination set for each node is calculated prior to thecommunications and is retained until each stage com-pletes. These two stages contains multiple outermostloops.The data in Table III were acquired through compile-

time analysis with the exception of IS and FT for whichthe percentage of dynamic communications were ob-tained from a class B execution trace on 128 processors.The reason is that IS and FT consist of more than oneclass of collective operations and this leads to differentresults with different execution con gurations—problemsize and number of processors. For FT, while the numberof total nodes increases, persistent all-to-all communica-tions dominate the message volume and the percentageof static communications is extremely low.

B. Identifying Communication Patterns

To show the capability of our compiler to identifycommunication patterns, we consider the IS, LU, MGand CG benchmark programs from the NAS parallelbenchmark suite and one application LBMHD (LatticeBoltzmann model of magneto-hydrodynamics [36], [37]).For all of these benchmarks as well as the COMOPSand synthetic applications described in Section V-C.2,the compiler was able to discover all of the static andpersistent communication operations. This was veri edthrough a comparison with manual analysis of the ap-plication. While compiling the NAS parallel benchmarks,the total number of processors, referred to asN , has to beset as a build parameter so that it is known at compile-time.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 11: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 11

IS: The compiler identi ed communication patternfor IS has been shown in Table I and Figure 5. Theweights in c−vectors and p−matrices are binary values.In phases 0 and 1, collective operations AR, AA, andAV are executed with no point-to-point communication.Phase 2 combines the RD collective operation with thepoint-to-point matrix shown in Figure 5. By using Mergeand Unwrap operations, phase 0 and 1 can be combined.LU: The LU benchmark is demonstrated as another ex-

ample. When identifying the communication patterns forLU, we set the threshold such that the Filter operationremoved all the collective communications from all thephases. Initially the compiler identi ed 11 phases. Ten ofthem contain only a small number of connections. Usingthe Merge operation the p−matrix shown in Figure 10is obtained. From this matrix we can see that the numberof destinations varies from 2 to 4 yielding a reasonablysized working set.CG: The compiler initially identi ed two communica-

tion phases from the source code. It turns out that eachof these phases are identical and can be merged. Thep−matrix for the compiler predicted communicationpattern is shown in Figure 11.MG: When analyzing MG, the compiler discovered

that the communication destinations depend on run-time input data. However, after the topologies are de-termined, they are used for an extended period of time.Hence the communication pattern is persistent. There-fore, it is only possible to construct symbolic p−matricesor formula lists from the source code. These p−matricesare resolved at runtime to populate the network.We show the resolved 12 p−matrices for the de-

fault input le supplied with the benchmark shownin Figure 12. p−matrices 0, 1, 6-10 correspond to Fig-ure 12(a), p−matrices 2 and 11 correspond to Fig-ure 12(b), p−matrix 3 corresponds to Figure 12(c), andp−matrix 4 corresponds to Figure 12(d). p−matrix 5 isempty because the branches containing zero point-to-point communications were taken in the decision pointsfor the parameters speci ed in this case.Lattice Boltzmann method to model magneto-

hydrodynamics (LBMHD): The communication patternof application LBMHD is shown in Figure 13 and Fig-ure 14. Our compiler identi ed a single communicationphase. Because the number of processors N and the

Fig. 10. Single bit p−matrixPM LU (N = 16).

Fig. 11. The p−matrix of CG(N = 128).

(a) p−matrices 0,1,6-10. (b) p−matrices 2,11.

(c) p−matrix 3. (d) p−matrix 4.

Fig. 12. The p−matrices of MG (N = 128).

processor rank rank are determined at run-time andnot known at compile-time, we generate a formula listin Figure 13 to describe the p−matrix. Each proces-sor has a set of four other processors with which itcommunicates. Since N and rank are the only symbolsin the expression, this matrix is considered staticallyknown as it may be entirely resolved at load-time.Thus it is possible to entirely con gure the network forLBMHD based on compiler analysis prior to execution.Figure 14 shows the pattern for a run on 64 processors.

0B@

((�rank/Ny�+ 1) mod Nx + (rank − �rank/Ny� ∗Ny)((�rank/Ny� − 1) mod Nx + (rank − �rank/Ny� ∗Ny)

�rank/Ny� ∗Ny + ((rank − �rank/Ny� ∗Ny) + 1) mod Ny

�rank/Ny� ∗Ny + ((rank − �rank/Ny� ∗Ny)− 1) mod Ny

1CA

Fig. 13. LBMHD p−matrix described by a formula list where N =Nx ∗Ny .

Fig. 14. LBMHDp−matrix (N =64).

The above compiler-predicted com-munication patterns have been veri-ed by comparison to their counter-parts extracted from traces. We runthe applications to which trace gener-ation instructions have been insertedby our compiler to collect traces.The communication patterns identi-ed from the corresponding traces areidentical to what the compiler hasidenti ed from the source code.

C. Performance Analysis

To analyze the ef ciency of applying compiled com-munication techniques to static and persistent commu-nications in MPI applications, it is necessary to examineperformance data. Thus, we describe our simulationtestbed used to run performance experiments.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 12: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 12

1) Multiprocessor System Simulator: We developed anevent-driven multiprocessor system simulator usingC++ with CSIM. The simulated system containing N -processors implements the communication model fromFigure 2. The diagram of the simulated system is shownin Figure 15. In the system each processor reads com-munication instructions from an input trace le, whichis either a real trace from an application execution or asynthetic stream of simulation instructions, and emulatesthe corresponding communication operations. In thetrace les communication instructions can be separatedby computation instructions “COMP t”, each of whichemulates a serial of computation over time t.Each processor has a circuit switching NIC connected

to a number of circuit switching networks and a packetswitching NIC connected to a number of packet switch-ing networks. Hence, the simulated system may havemultiple circuit switches and multiple packet switches.Each packet switch is a 2-stage FAT tree network builtusing fast buffered crossbar switches [38], [39]. Eachcircuit switch is a N×N crossbar and can be con guredto realize any permutation at any time. A centralizedscheduler is responsible for setting up and tearing downconnections in all the circuit switches. To avoid thelong connection establishment delay, a processor sendsa message to the circuit switching NIC only whenthere is a connection available in any of the circuitswitches. Otherwise, the message is dispatched to thepacket switches. Hence there must be at least one packetswitching network in the system.When dealing with the phase boundaries consisting

of patterns consisting of point to point communicationoperations, several processors may have proceeded tothe next communication phase while others remain inthe original phase. In our simulator, when a processorleaves a communication phase, it releases the circuitsfrom that phase and requests the circuits from the newphase. In some cases these circuits might not be immedi-ately available due to resources held by processors in theprevious phase. In that case messages proceed throughthe packet switch until the circuits are released. A similarprocess would be possible for collective communications

Fig. 15. Simulation system overview..

that use a relaxed blocking concept as described in [40].When simulating many packet switches, we assume

that a NIC has independent buffers for each packetswitching network. The messages in each buffer are han-dled sequentially, however, the NIC can handle differentbuffers simultaneously. Each incoming message from theprocessor to the NIC is assigned a buffer randomly. Weassume that all packets of that message may only bedelivered through the corresponding packet switchingnetwork.The simulator allows the speci cation of many system

parameters. In our experiments we used 256 processorsat 10GHz with 5000 cycles of overhead for MPI sendand receive. We also used link bandwidth of 4Gb/s forthe FP network and the circuit switch while 400Mb/sis used for the supplemental slower packet switchingnetwork. We use a circuit establishment delay of 3ms forthe OCS and 20 μs for the fast circuit switch. We assumethat the packet switch is a two-level FAT tree with linkpropagation delay varying between 130 and 400 ns.The simulator was extensively tested to verify its

correctness. For example, we explicitly tested singlesource single destination non-blocking message sendingtests with and without saturation of the network toensure back-pressure occurred when the throughput wasexceeded. We ran similar tests with multiple source anddestination arti cial traf c for both an under-loaded andsaturated network achieving expected results accordingto theory. We ran similar experiments for blocking sends.We also examined contention at the destination from sin-gle source and multiple sources with the same amountof traf c to see the improvement in latency while thecompletion time remains constant.2) Simulations: Packet switching networks with buffer-

ing at cross-points [38], [39] can achieve very high per-formance. Unfortunately, the cost of this type of switchis very high making it impractical. However, it can beconsidered a best case scenario for comparison of whata fast packet switch can achieve. The simulation resultsshow that we can achieve the same level, or even betterlevel of performance than buffered crosspoint switchesusing much less expensive circuit switches when thecommunication patterns are known. We simulated twodifferent types of systems to demonstrate this: 1) systemswith only fast packet switching networks, referred toas “Fast Packet switching” or FP and 2) systems withmultiple circuit switches to which as many connectionsas possible are preloaded and pined during execution,referred to as “Preload pined” or PP.For each system, we run the simulation with dif-

ferent number of networks. All links in both the fastpacket switching networks and the circuit switches havethe same bandwidth of 4G bps. We model the cir-cuit switches as optical micro-electro-mechanical system(MEMS)-based switches whose connection setup over-head is set to 3ms [6], [21], [22]. For each PP system,a relatively slow packet switching network with band-width of 400M bps is included. In what follows, we

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 13: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 13

report the average message delay as the performancemetric. Simulation results of 256-node traces of 4 NASbenchmarks are shown in Figure 16.The communication degrees of MG256, CG256, SP256

and BT256 are 13, 5, 6 and 6, respectively. The com-munication is typically evenly distributed across eachnode pairing with the exception MG256 whose dominantcommunication is evenly distributed among only 6 ofthe 13 pairs. Hence, as we increase the number of circuitswitching networks, we see a near linear reduction inthe average message delays for the 4 NAS benchmarks.For instance, we can reduce the message delay of SP256with 6 circuit switches to as low as 37.2% of what FP canachieve. This trend is maintained for all the applicationswe have investigated. Simulation results in Figure 16show that we can reduce message delay using 6 circuitswitches (by 37.2% to 62.8%) compared to FP for all theinvestigated applications.Our technique typically works well for highly static

and persistent communication. Table II shows that allthe 4 NAS benchmarks demonstrated here contain eithercompletely static or persistent communication patternsleading to this linear delay reduction per switch plane.Interestingly, adding additional fast packet switches didnot improve the communication delay for two mainreasons: (1) the NAS benchmarks produce messagesinfrequently enough that the majority of messages haveleft the buffer in the outgoing NIC prior to the arrivalof the next message and (2) we assume that individualmessages cannot be broken up and sent across separatenetwork planes due to problems like out of order arrivalof packets. Supporting out of order arrival of messagescould potentially improve the performance of multiplepacket switch con gurations [41] but leads to signi cantadditional complexity in the system and potentially de-grades performance of in order messages [42]. Given thatmany parallel application are dominated by static andpersistent communication and it is dif cult to achieve asigni cant bene t with multiple packet switch networksand that these fast packet switching networks are of suchhigh complexity and cost, our approach using circuitswitching is very promising.All the applications discussed above either have a

single phase or have a dominant phase in which com-munication volume is much larger than other phases.In order to demonstrate the bene t of recon guringthe network at phase boundaries, we simulate systemswith multiple circuit switches to which connections areloaded at the beginning of each phase and pined duringthe phase, referred to as Phased Preload Pined—PPP.We also consider a circuit switch with a faster circuitestablishment delay for the PPP case called PPP-Fast.PPP can achieve higher performance than PP in certain

circumstances. For example, given a communication pat-tern, in which the ith phase has a communication degreeni and the global communication pattern has a degreem,it is reasonable to expect that m > max{ni}. It is obviousthat PPP can obtain the same performance using only

(us)

(a) Message delay of MG256.

(us)

(b) Message delay of CG256.

(us)

(c) Message delay of SP256.

(us)

(d) Message delay of BT256.

Fig. 16. Message delay of MG256, CG256, SP256, and BT256.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 14: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 14

max{ni} circuit switches as what PP can obtain with mcircuit switches if the network recon guration delay issmall or can be overlapped with computation. We use asynthetic program SYN256 (Figure 17(a)) and the ASCICOMOPS benchmark [43] (Figure 17(b)) to demonstratethe impact of changing the network con guration be-tween phases.SYN256 has 3 phases, each phase performs 2-D mesh

communications along 2 dimensions on a 3-D mesh andhas a communication degree 4 while the communicationdegree of the entire program is 6. For the COMOPS256benchmark the point-to-point communication consist ofthree phases: ping-pong, 2-D ghost cell update, and3-D ghost cell update. Here the topology of the 2-D communication is not embedded in the topologyof the 3-D communication. Simulations show that forboth programs the message delay of PPP is lower thanPP when using the same number of circuit switchingnetworks and it decreases much faster than PP whenwe increase the number of circuit switching networks.PPP-Fast does provide an advantage over PPP in somecases as expected, however, in most cases the connectionestablishment can be suf ciently amortized to make PPPand PPP-Fast equivalent.The communication degrees of parallel applications

are typically small [6], [44]. Although the maximumcommunication degree of some parallel applications canbe very large, the dominant communication degree maystill be small and the low bandwidth connections canbe ltered out with minimal performance loss using acircuit switch approach [6]. Therefore, we can expectthat 10 or fewer circuit switches are suf cient to serve aparallel application in most cases.As previously mentioned, we have primarily reported

message delay as this demonstrates the advantage ofour approach even when the benchmark is not commu-nication bound, as is the case with many of the NASbenchmarks. Figure 18 shows the application completiontime with various network con gurations for COMOPSon 256 processors (Figure 18(a)), which is communi-cation bound and an example from NAS, SP on 256processors (Figure 18(b)), which is computation bound.For the communication bound application, COMOPS,the improvement in execution time mirrors the improve-ment in message delay. For the computation boundapplication, SP, the completion time improves slightly tore ect the improvement of the communication, however,the overall improvement is relatively small. We do notreport PPP for SP as it contains only a single dominantcommunication phase.

VI. CONCLUSIONS

In this paper we explore compiler techniques to ob-tain ef cient communications for MPI applications oncircuit switching networks. A compilation frameworkintegrating fairly sophisticated analysis is described. Inthe framework, communication patterns are identi ed

0

20

40

60

80

100

120

1 2 3 4 5 6

Number of networks

FP PP PPP PPP-Fast

(a) Message delay of SYN256.

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8 9 10

Number of networks

FP PP PPP PPP-Fast

(b) Message delay of COMOPS256.

Fig. 17. Message delay of COMOPS256, SYN256.

and compiled into network con gurations. This allowssigni cant performance bene ts when connections canbe established in the network prior to the actual commu-nication operation with relatively few circuit switches.The framework includes a exible and powerful commu-nication pattern representation scheme that can capturethe property of communication patterns and allow ma-nipulation of these patterns. In this way, communicationphases can be detected and logically formulated withinthe application. This scheme also provides the powerto easily manipulate the granularity of communicationphases using a set of proposed operations on the pat-terns. However, we present a speci c phase partitioningalgorithm used to maximize phase duration and mini-mize connections that cannot be satis ed in the circuitswitch network.Additionally, we extend the classi cation of static

and dynamic communication patterns and operationsto include persistent communications. Persistent com-munications cannot be determined statically, however,they remain unchanged for large segments of the ap-

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 15: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 15

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10

Number of networks

FP PP PPP PPP-Fast

(a) COMOPS256.

107

107.5

108

108.5

109

109.5

110

110.5

111

111.5

1 2 3 4 5 6

Number of networks

FP PP

(b) SP256.

Fig. 18. Completion time examples with various network con gura-tions.

plication execution. An experimental compiler has beendeveloped to implement this framework using the SUIFcompiler infrastructure.Applying our experimental compiler on the NAS par-

allel benchmark suite, we found that a large portionof communications which were previously classi ed asdynamic [8] are actually persistent. This provides oppor-tunities for pre-con guring the network to reduce com-munication overhead. Simulation results show that com-petitive performance can be achieved through combina-tion of compiled communication and circuit switchinginterconnection networks for multiprocessor systems. Inparticular, message delay using compiled communica-tion and circuit switching can reduce message delay by37% to 63% using no more than 6 circuit switches whencompared to an idealistic buffered crossbar based packetswitch network.Finally, we note that the compiled communication con-

cept can be expanded to include non-message-passingparallel programming models such as distributed sharedmemory models that give the appearance of sharedmemory but require actual messages to traverse the

network.

REFERENCES[1] G. Broomell and J. R. Heath, “Classi cation categories and histori-

cal development of circuit switching topologies,” ACM ComputingSurveys (CSUR), vol. 15, no. 2, pp. 95–133, 1983.

[2] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: AnEngineering Approach. Margan Kaufmann, 2003.

[3] F. Cappello and C. Germain, “Toward high communicationperformance through compiled communications on a circuitswitched interconnection network,” in Proc. of the Int. Symp. onHigh Performance Computer Architecture (HPCA), 1995, pp. 44–53.

[4] X. Yuan, R. Melhem, and R. Gupta, “Compiled communicationfor all-optical TDM networks,” in Proc. of SC, 1996.

[5] T. Gross, “Communication in iwarp systems,” in Proc. of SC89.ACM/IEEE, 1989, pp. 436–445.

[6] K. J. Barker, A. Benner, R. Hoare, A. Hoisie, A. K. Jones, D. J.Kerbyson, D. Li, R. Melhem, R. Rajamony, E. Schenfeld, S. Shao,C. Stunkel, and P. A. Walker, “On the feasibility of optical circuitswitching for high performance computing systems,” in Proc. ofSC, 2005.

[7] J. Shalf, S. Kamil, L. Oliker, and D. Skinner, “Analyzing ultra-scaleapplication communication requirements for a recon gurable hy-bird interconnect,” in Proc. of SC, 2005.

[8] A. Faraj and X. Yuan, “Communication characteristics in theNAS parallel benchmarks,” in Proc. of the Parallel and DistributedComputing and Systems Conf. (PDCS), 2002.

[9] J. Vetter and F. Mueller, “Communication characteristics of large-scale scienti c applications for contemporary cluster architec-tures,” Journal of Parallel and Distributed Computing, vol. 63, no. 9,pp. 853–865, September 2003.

[10] N. R. Adiga et al., “An overview of the bluegene/l supercom-puter,” in Proc. of Supercomputing (SC), 2002.

[11] S. L. Scott, “Synchronization and communication in the t3e mul-tiprocessor,” in Proc. of ASPLOS-VII, 1996.

[12] S. Habata, K. Umezawa, M. Yokokawa, and S. Kitawaki, “Hard-ware system of the earth simulator,” Parallel Computing, vol. 30,no. 12, pp. 1287–1313, 2004.

[13] V. Gupta and E. Schenfeld, “Combining message switching withcircuit switching in the interconnection cached multiprocessornetwork,” in Proc. IEEE Int. Symposium on Parallel Architectures,Algorithms and Networks, 1994.

[14] ——, “Task graph partitioning and mapping in a recon gurableparallel architecture,” Parallel Processing Letters, vol. 5, no. 4, pp.563–574, 1995.

[15] D. Chiarulli, S. Levitan, R. Melhem, J. Taza, and G. Graven-streter, “Partitioned optical passive star (pops) multiprocessorinterconnection networks with distributed control,” IEEE Journalof Lightwave Technology, vol. 14, no. 7, pp. 1601–1612, 1996.

[16] G. Gravenstreter and R. Melhem, “Realizing common commu-nication patterns in partitioned optical passive stars (pops) net-works,” IEEE Transactions on Computers, vol. 47, no. 9, pp. 998–1013, 1998.

[17] P. Dowd et al., “Lightning network and system architecture,”Journal of Lightwave Technology, vol. 14, pp. 1371–1387, 1996.

[18] A. K. Kodi and A. Louri, “Rapid: Recon gurable and scalable all-photonic interconnect for distributed shared memory multipro-cessors,” IEEE/OSA Journal of Lightwave Technology, vol. 22, no. 9,pp. 2101–2110, 2004.

[19] ——, “Design of a high-speed optical interconnect for scalableshared memory multiprocessors,” IEEE Micro, vol. 25, no. 1, pp.41–49, 2005.

[20] ——, “A new technique for dynamic bandwidth re-allocation inoptically interconnected high-performance computing systems,”in IEEE Symposium on High-Performance Interconnects, 2006.

[21] P. C. et al, “Design and nonlinear servo control of mems mirrorsand their performance in a large port-count optical switch,”Journal of Microelectromechanical Systems, vol. 14, no. 2, pp. 261–273, April 2005.

[22] T. Yamamoto, J. Yamaguch, R. Sawada, and Y. Uenishi, “Devel-opment of a large-scale 3d mems optical switch module,” NTTTechnical Review, vol. 1, no. 7, pp. 37–42, October 2003.

[23] Z. Ding, R. Hoare, A. Jones, D. Li, S. Shao, S. Tung, J. Zheng,and R. Melhem, “Switch design to enable predictive multiplexedswitching in multiprocessor networks,” in Proc. of the Int. Paralleland Distributed Procssing Symp. (IPDPS), 2005.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 16: PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS …akjones/Alex-K-Jones/Enabling... · that manages circuit switched interconnections. We pro-pose a framework that integrates

PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON 16

[24] S. Shao, A. K. Jones, and R. Melhem, “A compiler-based commu-nication analysis approach for multiprocessor systems,” in Proc.of the Int. Parallel and Distributed Procssing Symp. (IPDPS), 2006.

[25] D. Shires, L. Pollock, and S. Sprenkle, “Program ow graphconstruction for static analysis of mpi programs,” in Proc. ofInt. Conf. on Parallel and Distributed Processing Techniques andApplications(PDPTA), June 1999.

[26] S.-Y. Ho and N.-W. Lin, “Static analysis of communicationstructures in parallel programs,” in Proc. of the Int. ComputerSymp.(ICS), 2002, pp. 215–221.

[27] MPI: A Message-Passing Interface Standard, Message Passing Inter-face Forum, June 1995.

[28] D. Bailey, T. Harris, W. Sahpir, and R. van der Wijingaart, “TheNAS parallel benchmarks 2.0,” Numerical Aerodynamic Simula-tion Facility, NASA Ames Research Center, Tech. Rep. NAS-95-020, December 1995.

[29] H. G. Dietz and T. Mattox, “Compiler techniques for at neigh-borhood networks,” in Proc. of 13th Int. Wrokshop on Languages andCompilers for Parallel Computing, 2000.

[30] J. Liang, A. Laffely, S. Srinivasan, and R. Tessier, “An architectureand compiler for scalable on-chip communication,” IEEE Trans.on Very Large Scale Integration Systems (TVLSI), vol. 12, no. 4, pp.711–726, July 2004.

[31] D. Lahaut and C. Germain, “Static communcations in parallelscienti c programs,” in Proc. of PARLE, 1994.

[32] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina, “Architec-tural requirements of parallel scienti c applications with explicitcommunication,” ACM Computer Architecture News, vol. 21, no. 2,pp. 2–13, May 1993.

[33] R. R. Hoare, Z. Ding, and A. K. Jones, “Level-wise schedulingalgorithm for fat tree interconnection networks,” in Proc. of Su-percomputing, 2006.

[34] V. Delaluz, M. Kandemir, N. Vijakrishnan, A. Sivasubramaniam,and M. J. Irwin, “Dram energy management using software andhardware directed power mode control,” in IEEE InternationalSymposium on High-Performance Computer Architecture, 2001, pp.159–169.

[35] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarsinghe, J. M.Anderson, S. W. K. Tjiang, S. W. Liao, C. W. Tseng, M. W.Hall, M. S. Lam, and J. L. Hennessy, “Suif: An infrastructurefor research on parallelizing and optimizing compilers,” ACMSIGPLAN Notices, vol. 29, no. 12, pp. 31–37, December 1994.

[36] P. Pavlo, G. Vahala, and L. Vahala, “Higher order isotropicvelocity grids in lattice methods,” Physics Review Letters, vol. 80,p. 3960, 1998.

[37] A. MacNab, G. Vahala, P. Pavlo, L. Vahala, and M. Soe, “Latticeboltzmann model for dissipative incompressible mhd,” in 28thEPS Conference on Contr. Fusion and Plasma Phys, vol. 25A, June2001, pp. 853–856.

[38] J. Kim, W. J. Dally, B. Towles, and A. K. Gupta, “Microarchitectureof a high radix router,” in Proc. of ISCA, 2005, pp. 420–431.

[39] C. B. Stunkel, J. Herring, B. Abali, and R. Sivaram, “A new switchchip for ibm rs/6000 sp systems,” in Proc. of SC, 1999.

[40] T. Hoe er, P. Kambadur, R. L. Graham, G. Shipman, and A. Lums-daine, A Case for Standard Non-blocking Collective Operations, ser.Lecture Notes in Computer Science. Springer, 2007, vol. 4757,pp. 125–134.

[41] L. Gharai, C. Perkins, and T. Lehman, “Packet reordering, highspped networks and transport protocol performance,” in Proc. ofthe IEEE International Conference on Computer Communications andNetworks (ICCCN), 2004, pp. 73–78.

[42] P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur, andW. Gropp, “Analyzing the impact of supporting out-of-ordercommunication on in-order performance with iwarp,” in Proc. ofSuperComputing (SC), 2007.

[43] LLNL, “The asci comops benchmark code,” Lawerence LivermoreNational Laboratory Website,http://www.llnl.gov/.

[44] J. Kim and D. J. Lilja, “Characterization of communication pat-terns in message-passing parallel scienti c application programs,”in Proc. of the Second Int. Workshop on Network-Based ParallelComputing: Communication, Architecture, and Applications, G. Goos,J. Hartmanis, and J. Leeuwen, Eds., 1998, pp. 202–216.

PLACEPHOTOHERE

Shuyi Shao received a B.S. and a M.S. degree incomputer science from Xi’an Jiaotong Univer-sity, China, in 1996 and 1999, respectively. He iscurrently pursuing the Ph.D. degree at the Uni-versity of Pittsburgh, PA. His research interestsinclude high performance computing, compiler,and architecture. He is a student member ofIEEE.

PLACEPHOTOHERE

Alex K. Jones received his B.S. in 1998 inPhysics from the College of William and Maryin Williamsburg, Virginia. He received his M.S.and Ph.D. degrees in 2000 and 2002, respec-tively, in Electrical and Computer Engineeringat Northwestern University. He is currently anAssistant Professor of Electrical and ComputerEngineering and Computer Science at the Uni-versity of Pittsburgh, Pennsylvania. He wasformerly a Research Associate in the Centerfor Parallel and Distributed Computing and

Instructor of electrical and computer engineering at Northwestern Uni-versity. He is a Walter P. Murphy Fellow of Northwestern University,a distinction he was awarded twice. Dr. Jones research interests in-clude compilation techniques for behavioral and low-power synthesis,embedded systems, radio frequency identi cation (RFID), and highperformance computing. He is the author of over 50 publicationsin these areas. Dr. Jones has served on several conference programcommittees including the International Conference on Parallel Process-ing, the Parallel and Distributed Computing and Systems Conference,and the Workshop for Large Scale Parallel Processing at IPDPS. Hehas served as associate editor for several journals including ACMTransactions on Design Automation for Electronic Systems and ParallelProcessing Letters. He is currently a member of the IEEE and the ACMand is serving on the executive committee of the ACM Special InterestGroup in Design Automation.

PLACEPHOTOHERE

Rami Melhem received a B.E. in ElectricalEngineering from Cairo University in 1976, anM.A. degree in Mathematics and an M.S. de-gree in Computer Science from the Universityof Pittsburgh in 1981, and a Ph.D. degree inComputer Science from the University of Pitts-burgh in 1983. He was an Assistant Professor atPurdue University prior to joining the facultyof The University of Pittsburgh in 1986, wherehe is currently a Professor of Computer Scienceand Electrical Engineering and the Chair of the

Computer Science Department. His research interests include Op-tical Networks, High Performance Computing, Real-Time and Fault-Tolerant Systems and Parallel Computer Architectures. Dr. Melhemserved on program committees of numerous conferences and work-shops. He was on the editorial board of the IEEE Transactions onComputers and the IEEE Transactions on Parallel and Distributedsystems. He is serving on the advisory boards of the IEEE technicalcommittees on Computer Architecture. He is the editor for the SpringerBook Series in Computer Science and is on the editorial board of theComputer Architecture Letters, The International Journal of EmbeddedSystems and the Journal of Parallel and Distributed Computing. Dr.Melhem is a fellow of IEEE and a member of the ACM.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.