[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

High-Level Data Parallel Programming in Promoter *

Matthias Besch, Hua Bi, Peter Enskonatus, Gerd Heber, Matthias Wilhelmi RWCP Massively Parallel Systems GMD Laboratory Berlin, Germany

GMD FIRST, Rudower Chaussee 5, D-12489 Berlin { mb,bi ,ens ,heber,wilhelmi} @first . gmd. de

Abstract

Implementing realistic scientific applications on parallel platforms requires a high-level, problem-adequate and flexible programming environment. The hybrid system PRO- MOTER pursues a two-level approach allowing easy and flexible programming at both language and library levels. The core concept of PROMOTER’S language model is its highly abstract and unijied concept of data and communication structures. The paper briefly addresses the programming model, but focuses on implementation aspects of the compiler and runtime system. Finally, pe$ormance results are given, evaluating the eficiency of the PROMOTER system.

1. Introduction

Despite of much research work in the area of parallel programming environments and compiler development that has been done over the past decade, the burden of parallel programming still lies in the automatic, transparent, and efficient treatment of locality. The problem arises, whenever one has to distinguish between (cheap) local and (expen- sive) remote data access.

Today anyone seems to agree that the acceptance of parallel computing, and in particular that on distributed memory environments, mainly depends on the quality of a high- level programming model, which should provide powerful abstractions in order to free the programmer from the burden of dealing with low-level issues such as data layout or communication.

A high-level programming model drastically facilitates the building of large-scale applications, however, it leaves the complex task of bridging the large semantic gap between application and parallel machine to the compiler and runtime system.

*This research is supported by the Real World Computing Purtnership (RWCP), Japan.

Data parallel programming languages such as HPF [12], Fortran D [13], Vienna Fortran [6], HPC++ [21], and the like already considerably relax this problem. Intended for the use in scientific computing, these languages focus on dense regular array computations. Expressing parallelism by way of constructs guaranteeing the independence of statement in blocks and loops, these languages provide a true abstraction from the underlying machine.

Optimizing data alignment, data partitioning, and assign- ment to a processor graph are often still too complex. In many systems, for example, the compiler can (and should) be supported by additional user-defined tuning mechanisms such as alignment and distribute directives in HPF. In case of irregular data and dependence structures, however, data parallel languages often loose their expressive power. These structures have to be modeled via indirect indexing and thus are not supported by the compiler but have to be handled dynamically during runtime.

In this paper, we present a hybrid, two-level approach used in PROMOTER [ 1 I]. The library level provides an abstraction of communication and establishes a flexible and efficient runtime system on top of standard platforms. The language level defines a high-level programming model and introduces an abstraction of data partitioning and distribution. At this level, PROMOTER is a coordination language being embedded in an imperative, object-oriented host language (C++).

The underlying design principles of Promoter are best described by the terms polymorphic high-level data parallelism and uniformity of data and dependence structures. Promoter programs are written by concurrently executing methods on all points of an aggregate object. Exploiting the polymorphism of object-oriented languages the strict ho- mogeneity of data and computational domains in data parallel languages can be relaxed to some degree of locally au- tonomous execution.

Uniformity here means that both (irregular) data structures and (irregular) dependences between them are represented by the same high-level construct. This considerably facilitates the exploitation of application-dependent knowl-

47 0-8186-7882-8/97 $10.00 0 1997 IEEE

edge by the compiler and runtime system. In contrast to most other data parallel languages, PRO-

MOTER not only supports dense rectangular domains (arrays), but also sparse or irregular structures, which may be regularly constructed or irregularly enumerated. So, typi- cally, instead of using nested (forall) loops concurrent operations are replaced by a sequence of statements, each of which describes a dependence relation from a source to a target object or some subset thereof.

The paper is organized as follows. Section 2 introduces the data parallel programming model of PROMOTER and presents its core language concepts. Section 3 gives an overview of the compilation system, reflecting the previ- ously mentioned two levels. Then, Section 4 addresses application programming and discusses some problems and their solution in PROMOTER. Section 5 presents some performance results evaluating the PROMOTER runtime system and the quality of compiler optimization. Finally, Section 6 takes a look at comparable systems and gives a brief outlook on future work.

2. The programming model - basic concepts

In the framework of this paper, we can only give a cur- sory introduction into PROMOTER. The following subsections briefly introduce some basic concepts. For a more ex- haustive introduction we refer the reader to [ 171.

2.2. Distributed types and objects

Distributed Types are indexable structures which are built from given types (classes) and indexed by data topologies. Distributed objects are instances of distributed types:

class T; Grid<T> g ; / / g is a distributed object / / over the topology Grid

2.3. Data parallel operations

By data parallel operations, we

of class T

mean that the same operation is performed at-all points of a data structure. This can be expressed by calling a method (operation) on distributed objects. Such a call is then performed by operations repli- cated over the entire topology. As usual in data parallel languages, replication operations are defined by lifting the function result and parameter types to the distributed type with respect to the employed topology.

class T; Grid<T> g; Grid<int> h; int f ( T& ) ; g.T: :method0 ; h = f ( g 1 ;

2.4. Communication 2.1. Topologies

A topology or an index space is some arbitrary, possi- bly irregular and dynamic subset of Zn, where Z being the set of integer numbers. It allows the programmer to model spatial data structures and communication (or dependence) relations in a problem-oriented way.

Using topologies to model spatial data structures the expressive power of PROMOTER goes far beyond computing with dense regular arrays. In the succeeding example, we declare a model triangle of a Finite Element Method (FEM), which is IC times regularly refined. The dimension and range of a topology are defined by an expression like O:M, 0:N or a defined topology like Grid in InnerGrid. The expressions in curly braces are called constraints of the topology, which define all valid indices within the defined range.

topology Grid: O:M, O:N { } ; / / M x N grid

topology InnerGrid: Grid { $a:l: (M-l), $b:l: (N-1) ;

} ; / / interior of the grid

topology Element [kl : 0 :pow ( 2 , kj , 0 :pow (2, k) , 0 :pow (2, kj { $a, $b, $c 1 : a + b + c = pow(2,kj;

1 ;

Communication relations in PROMOTER are also expressed by means of topologies. A communication topology defines a relation bewteen data points, i.e. specifies a subset of the Cartesian product of target and source data topologies.

In the the example below, we declare a communication topology that “supports” the five-point discretization of the Laplace operator on a grid.

topology Laplace-5 : Grid, Grid { $a, $b, a + 1, b; / / right neighbour $a, $b, a ~ 1, b; / / left neighbour $a, $b, a, b + 1; / / upper neighbour $a, $b, a, b - 1; / / lower neighbour

> ;

The actual communication then can be described by evaluating the so-called communication product. A communication product employs the concept of matrix multiplication. The result (represented by the the distributed object y) of the communication product on distributed objects x and c is defined as

Y i j = X i k ‘ C k j , k

48

where the transfer operation "*" and the reduce operation "+" (from E) can be overloaded and a few validity rules regarding undefined items are to be considered.

In a typical application of this general transduce operation (transfer + reduce), an operator ! is defined to denote the important special case of vector-matrix or matrix-vector operation, where "*" and "+" have the usual default seman- tics. In this case one of the outer index spaces (here w.r.t. i) vanishes:

k

The matrix then represents a communication topology as illustrated in the following example.

Grid<double> g,h; h[[ InnerGrid I ] = (g ! Laplace-5) - 4 * g;

g ! Laplace5 transfers one data element of g to its corresponding four neighbours of h for the reduction by the default operator +. The example also shows that we restrict (by the so called selection) the application of the Laplace operator to inner grid points. In general, a selection is a very elegant tool to access a subset of data elements from a distributed object.

The second special case of the communication product is called cross product. Here, the inner topology (reduction space) is missing such that the result defines the outer product:

yij = xi . c j ,

In PROMOTER syntax the cross product looks like:

y=times(x, *, c);

The general form of the transduce operation is expressed by

y=[R]transduce(x, c, +, * ) ;

where the topology R denotes the reduction space. Leaving out the parameter R no reduction takes place.

In this case, the reduction operator can be omitted, and the matrix multiplication of transduce degenerates to the outer product of times:

y=transduce(x, c, * ) ;

3. The Promoter compilation system

In the following two subsections we give a brief overview on compiler and runtime system of the PRO- MOTER system.

3.1. Compiler

The compiler works as a source-to-source translator. It accepts PROMOTER programs as input and generates C++ SPMD programs.

The general strategy is to recognize all PROMOTER constructs, to transform them into an internal and efficiently usable form, and to leave the rest of the source program as close as possible to the original, which will later help the user in debugging and monitoring. The overall structure of the compiler consists of the following components.

The front-end implements the lexical, syntactical, and semantic analysis and builds an abstract syntax tree (AST) and the symbol table. Syntactically incorrect programs are rejected.

In the analysis phase, the compiler tries to gain as much knowledge about the source program, in particular about the data and communication topologies, and its use of the PROMOTER constructs as necessary for succeeding phases. It generates descriptors in the AST which give a condensed view of the PROMOTER constructs. The objective of the analysis phase is two-fold. First, it analyzes those parts of C++ that are needed for the symbolic evaluation of constraints and selections. Second, it recognizes the PRO- MOTER specific constructs and generates corresponding descriptors.

In the succeeding mapping pass, data and communication topologies are brought in to find a sufficiently good distribution of data elements and threads over the set of computing nodes. A program is subdivided into a sequence of phases, of which the distributed variables related to each other by communication operators are subject to a mapping process. Supported by high-level pragmas the compiler can choose between different mapping algorithms and combina- tions thereof. Aside from simple tiling strategies, there are general graph-based multi-partitioning algorithms such as an improved and generalized version of the Kernighan&Lin heuristic [15] or spectral bisection, as well as a fast topographic mapping approach called BHT [3]. The latter is particularly appropriate for finite-neighborhood communication within a single index spaces.

The optimization phase works mainly on the descriptors, which reflect the transformations for the generation of a message passing SPMD program. It tries to reduce the ini- tially generated number of messages and synchronization points.

The final pass of the compiler generates the SPMD program code in C++. It evaluates and modifies the descriptors while traversing the AST and generates the calls to the executing primitives provided by the PROMOTER runtime system.

49

3.2. Runtime system

The PROMOTER runtime system consists of a PRO- MOTER RUNTIME LIBRARY (PRL) and an underlying PROMOTER ABSTRACT MACHINE (PAM) . The PRL is architecture independent, while the PAM must be ported on different platforms.

The PROMOTER runtime system is implemenrted by the SPMD model with a lock-step synchronization scheme. That is, the processes created by the SPMD program alter- nate between communication and computation phases. In the communication phase remote elements are sent and received; in the computation phase operations are performed locally.

PAM is responsible for communication and synchronization between domains of the underlying architecture. The send and receive operations are nonblocking to overlap local and remote operations. To get use of capabilities of the hardware most of the resource management is located in the PAM, the dual processor system MANNA, for instance, can perform all buffering and communication activities on the communication processor, while the application processor can execute the local operations of the user program.

PRL provides the executing primitives for running PRO- MOTER programs. More exactly it provides template classes for distributed objects and template functions on distributed objects. These functions provide basic data parallel operations, for example, data parallel assignments, data parallel function calls, point-to-point communication and collective communication (reduction, expansion, cross product, and communication product).

Distributed objects are implemented in an object- oriented way. They are based on theclass D i s t r ibu t ion which is defined by a spatial structure Topology and a mapping strategy Mapping. A Topology specifies the valid data points of the problem spatial structure, of which their location on one of the physical computing domains (nodes) is determined by Mapping. The concrete Topology, Mapping and D i s t r ibu t ion can be predefined by PRL, and also can be generated by the compiler or defined by programmers. The only condition to apply them is that they must follow the same class interface.

PRL provides a set of data topologies such as Array, BandArray, Mas ked Ar ray, Po i n t -Se t , and Topology-Union, a set of mapping strategies such as B l o c kMapping, C y c le-Bloc kMapping, BHTMapping and GeneralGraphMapping. Differ- ent distributions are defined for the different combination of topologies and mapping strategies. For distributed objects, static element types and dynamic element types (e.g. with pointer) are distinguished for the different implementation of data packing and unpacking for communication.

PRL also provides a set of communication topologies

such as One-To-One and One-ToAany. Based on a communication topology and two distributions representing two distributed objects in communication, a communication pattern can be generated. A communication pattern defines local indices to designate which elements will be sent and at which indices received elements will be operated with local elements.

PRL provides runtime support to generate communication patterns by some template functions (which are corresponding to the basic communication forms in PRO- MOTER). Because these template functions take a communication topology and distributions (not distributed objects) as arguments, it provides the following three optimization possibilities.

A communication pattern generated can be reused by different communications, if the same communication topology and the same distribution relation bewteen distributed objects are defined in communications . A communication pattern can be lifted out of a loop, if the results of communication and computation in the loop do not change their spatial structure and data partition. The computation of communication patterns for the next communications can be overlapped with the current communication, if these communication patterns are not dependent on the result of the current communication. Based on our exper- iments, the possibility to apply these optimizations is very high in a large class of applications.

Based on communication patterns, PRL provides a set of template functions for collective communication. They are implemented by the so called four-phase scheme.

In the first phase, all local data elements which must be sent to other remote domains are collected and an asynchronous send in PAM is started to sent them. Several send modes (e.g. non buffering send, message vectorization) allow an efficient use of buffers and minimizes the number of generated messages. And at the same time, asynchronous receives are initiated according to the information in the communication pattern, that is, from which domains messages will be received.

In the second phase, the local operations are performed. The local operations are executed before receiving remote elements for the tolerance of communication delay.

In the third phase, it waits for messages from all other domains. If the receive of one message from one domain is completed, the relevant local data elements are operated with remote data elements according to the communication pattern. Because the operations made here are associative as defined by PROMOTER language, they can be performed just after the message from one domain has arrived. In this way, the operation and the communication are overlapped.

In the fourth phase, a synchronization must be issued to ensure the asynchronous send and receive operations in PAM have been finished.

50

Since the operations on distributed objects are implemented by template functions, the runtime library provides a generic interface for data parallel application, By generic, we mean that operations on distributed objects are implemented on the same framework for different topologies, different mapping strategies, and different element types.

code on the level of the PROMOTER runtime system. This is interesting also for porting optimized parts of existing software.

PROMOTER code achieves almost the performance of hand-written MPI or PVM code.

Writing PROMOTER code means among other things Therefore, it is easy for a compiler to generate code for a PROMOTER program.

The runtime library also allows an efficient implementa- writing less code.

tion

0

0

0

0

for the PROMOTER language, since the compiler can

specialize the container classes (e.g. mapping) of distributed objects,

make container classes (e.g. topology and mapping) sharable by different distributed objects,

make the communication pattern to be reused by different communications, e.g., in an iteration,

make communication and computation to be overlapped.

topology and 4.2. The flavour of a sample Promoter application

A software tool for high-level data parallel programming has to cover a wide range of applications. Starting from regular applications in Quantum Chromo Dynamics (QCD) and Finite Element Methods with uniform mesh re- finement (UMR) the programmer is faced with extremely complex (irregular and dynamic) structures which play a key role in Finite Element Methods with adaptive mesh re- finement (AMR) and particle simulations. At first glance, it might appear hopeless to design and implement such a tool which produces executables with competitive runtimes. We

In summary, the runtime library defines a generic interface and allows application-dependent specialization and optimization applicable by a compiler. Besides, the runtime library is also designed as a user-level library to support abstract message passing programming that assumes no special support from a compiler [4].

do not assert that the PROMOTER environment is that tool but are convinced (and hope to convince the reader) that PROMOTER already has a lot of its features.

Let us consider a sample application, namely a little particle simulation. Our simulation space consists of three- dimensional lattices of boxes. The boxes contain lists of particles (atoms, ions, electrons) and their size depends on the range of the interaction potentials. For instance, in a silicon simulation one has to take into account the

4. Programming in Promoter

Applications programming for high-perjiormance computing is notoriously difJicult. Although parallel programming is intrinsically complex, the principal reason why high-pei$ormance computing is difJicult is the lack of software tools . . . which leads to wasted computer resources and inhibits the use of high-performance parallel comput- ers by scientists [2].

4.1. General remarks

PROMOTER code is architecture independent.

The programmer does not have to worry about programming message passing and data distribution. (At the level of the runtime system he can pass additional information for instance about the mapping, but the architecture is hidden at this level as well.)

PROMOTER allows rapidprototyping on parallel architectures.

Stillinger-Weber potential with a range of approximately 3.8 Angstroem and the Coulomb potential with a range of approximately 20 Angstroem.

Let us assume for simplicity that our boxes contain at most one particle. The motion of the particles is simulated by a Gear predictor/corrector algorithm [ 11. Here, the programmer is faced with the problem that particles migrate between boxes, i.e. between processors. The question arises how to express this in PROMOTER where the architecture and the process model are hidden from the programmer.

class AtomRef { / / our simplified boxes

public : Atom* link- ;

void null() { link- = 0; 1

index newIndex0 ( return link-->newIndex(); 1 / / where the particle has to move to

index oldIndex0 { return link_->oldIndexO; } / / particles current box

> ;

boo1 moved( const index& a, const index& b) There is a high flexibility offered to the programmer because he also can write, modify, and optimize his

51

{ return ( a ! = b ) ; 1

topology Space; / / the topology of simulation space

Space<AtomRef> x; / / a configuration of particles

Space<index> index-new = x.newIndex(); / / where the particles have to go to?

Space<index> index-old = x.oldIndex0; / / where are the particles?

Spaceibool> particle-moved

/ I select the particles which move

Space<AtomRef> y = 0;

y[[ particle-moved 1 1 = x; / / copy particles moved

= (index-new ! = index-old);

x[[ particle-moved I ] .null(); / / delete unmoved particles

model heat equation. The solver is implemented by an Euler forward-backward algorithm with Jacobi relaxation. The problem with Dirichlet boundary conditions is solved on a 100 x 100-grid with 2000 Jacobi iterations per time step.

MANNA PCC -02 MANNA PCC - 0 4

400

350 t \ 300 1 1

150 I ''''~:, \, , , , , ,

100 ',. , ',( .. .-. .., -. .. ... '... ,. .. ----.- .....-- .---..-I.-~.~-T---~------- .......-... ~ ........ ~

......... 50

n .. . . .... ... .. . .. .. . . . . . .. . . .. ..- ............

0 2 4 6 8 10 12 14 16 18 20 nodes

Figure 1. Runtime of application on MANNA topology Move-Space : Space, Space { I ;

Move-Space<bool> M =

/ / . . . which "supports" the motion

topology Move-Re1 : Move-Space I Si, $ j , $k, $1 I : M[$i, $ j , $k, $ll==true; 1 ; / / dynamically creat a communication topology

times( index-new, &moved, index-old ) ;

x = y ! Move-Rel; / / move the particles

Due to the lack of space, we are not able to discuss a real application in full detail. For the discussion of some aspects of FEM applications in PROMOTER we refer the reader to [9]. Let us mention that it is possible to use the so called dynamic topologies in PROMOTER. These topologies allow the user, among other things, to add and/or remove points of a topology at runtime.

5. Performance results

We have implemented the PROMOTER compiler and the PROMOTER runtime system. The PAM has been ported to our in-house testbed MANNA on top of the parallel OS PEACE, to the IBM SP/2 and on Sun workstation cluster on top of MPI.

MANNA [lo] is a parallel supercomputer developed at GMD-FIRST. It has two Intel i86OXP and 32MB memory per node. The nodes are interconnected with a multi-level crossbar. The total interconnection bandwidth in a 20 node system is 2 GB/s.

Several compiler optimization have been tested with a

SPl2 pcc - SPl2 pcc -02 ----- SPM PCC - 0 4

0 2 4 6 8 10 12 14 16 18 20 nodes

Figure 2. Runtime of application on SP/2

Without optimizations, the compiler inserts code for communication and local operations for every iteration step. With the generic interface the communication patterns are established as they are needed, and operations on local data points are performed using a generic iterator.

The first optimization called communication scheduling detects the loop invariant communication patterns and moves the generation of communication patterns to the pre- fix of the loop. Thus, the communication patterns are reused inside the iteration for the communication operations.

The second optimization specializes the generic iteration on local data points taking advantage of the regular structure of the problem.

Figures 1 and 2 show the results obtained on MANNA

52

Bench- mark

4Nodes I 8Nodes I 16 Nodes PRO I PVM I PRO / PVM I PRO I PVM

Table 1. PROMOTER vs. PVM

MM 0.281 RL 0.280 CG 8.23

The above benchmark programs in PRL are optimized by communication scheduling. It shows that PROMOTER can achieve almost the performance of hand-written MPI or PVM code due to correct runtime support and intelligent optimization by the compiler.

A PROMOTER program is more concise than the corresponding MPI or PVM program. For the above bench- marks, the number of lines in code is listed in Table 2.

0.246 0.155 0.137 0.088 0.083 0.261 0.137 0.132 0.075 0.072 8.21 4.12 4.11 2.36 2.34

Table 2. Code Length of PROMOTER vs. PVM

6. Comparisons and conclusion

In this paper we present high-level data parallel programming in PROMOTER. It leads to an abstract programming style on distributed memory parallel machines, while the efficient implementation of data parallel applications by PROMOTER can be achieved by an object-oriented runtime

support and advanced compilation and optimization tech- niques.

In recent years, there have been major efforts in devel- oping language and runtime library and compiler support for programming distributed memory machine. Roughly speaking, there are two major directions in those efforts. In the Fortran world, HPF 11 21, Fortran D 11 31, Vienna For- tran [6] and others are developed, in the C world, HPC 1211, ICC++ [7], MPC++ [14], pC++ [16], EC++ [20] and others are in progress.

Most of the approaches support two-level data parallel programming. For example, the Fortran D compiler inserts calls to the Multi-block parti 1191 and Chaos [8] library rou- tines to manage communication. The library Multi-block Parti implements regular 'data distribution and regular accesses to distributed arrays, while CHAOS library supports irregular patterns on data accesses to distributed arrays. In the HPC++ framework, a C++ library along with compiler directives supports data parallel C++ programming.

PROMOTER is also a two-level data parallel programming approach. However, in contrast, it supports not only distributed arrays, but also distributed objects with the following features:

Distributed objects can be configured not only on rectangular spatial structures like arrays, but also on arbitrary (non-rectangular, sparse, or irregular) spatial structures.

0 Data partition can be chosen from a set of userdefined and predefined mapping strategies.

In PROMOTER, the embedding of an application's spatial domains in index spaces allows more static optimization to be done at compile time. By providing valuable application-specific information it generally eases the task of mapping for compiler and runtime system. Particularly, in numerical applications spatial structures are often based on geometrical information, which directly can be exploited by the mapping subsystem.

It sub- sumes CHAOS and Multi-block Parti, and provides additional support for distributed pointer-based data structures. In PROMOTER, support for distributed pointer-based data structures will be handled by dynamic topologies [18], i. e. a distributed object can change its shape at runtime. It provides a conceptual equivalence to dynamic creation or expansion in a pointer-based data structures.

Implementing dynamic topologies is part of our future work. Dynamic topologies are most often needed in adaptive applications, in which the problem domain or the spatial structure has to be changed at runtime according to interme- diate runtime results. In considering dynamic topologies we must also take into account dynamic mapping, since distributions have also to be dynamically changed according to

Recently, CHAOS++ [5] has been released.

53

modified topologies. Our preliminary work on these topics shows that efficient implementation of dynamic topologies seems to be possible, if some regularity in the adaption algorithm can be exploited.

References

[ I ] M. P. Allen and D. J. Tildesley. Computer Simularion of Liquids. Claredon Press, Oxford, 1994.

[2] B. Appelbe and D. Bergmark. Software Tools for High- Performance Computing: Survey and Recomendations. Sci- entific Programming, 5(3), 1996.

[3] M. Besch and H. W. Pohl. Topographic data mapping by balanced hypersphere tessellation. In Proc. Euro-Par ‘96, Lyon, France, August 1996, Lecture Notes in Computer Sci- ence 1124, pages 455-458. Springer, 1996.

[4] H. Bi. Towards abstractin of message passing programming. In Proc. of International Conference on advances in parallel and distributed Computing, pages 100-107, Shanghai, China, March 1997. IEEE CS Press.

[5] C. Chang, J. Saltz, and A. Sussman. CHAOS++: A Runtime Library for Supporting Distributed Dynamic Data Structure. Technical report, Center for Res. on Parallel Computation, Rice University, Nov 1995.

[6] B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran. Scientz3c Computing, I(l):31-50, 1992.

[7] A. A. Chien and J. Dolby. The Illinois Concert System: A Problem-Solving Environment for Irregular Applications. In Proc. of DAGS’94, The Sym. on Parallel Computation and Problem Solving Environments, 1994.

[SI R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communi- cation Optimization for Irregular Scientific Computation on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 22(3):462-479, Sep 1994.

[9] J. Gerlach, G. Heber, and A. Schramm. Finite element methods in the promoter programming model. In Proc. Internat. EUROSIM Conference on HPCN Challenges in Telecomp and Telecom, Delft, Netherlands, June 1996.

[lo] W. K. Giloi, U. Briining, and W. Schroder-Preikschat. MANNA: Prototype of a Distributed Memory Architecture with Maximized Sustained Performance. In Proc. Euromi- cro PDP96 Workshop, 1996.

[ l 11 W. K. Giloi, M. Kessler, and A. Schramm. Promoter : A high level object-parallel programming language. In Proc. of Internat. Con$ on High Pe~ormance Computing, New Delhi, India, Dec. 1995.

High Performance Fortran Language Specification V1.1. Technical report, http://www.erc.msstate.edu/hpff/hpf-report-ps/hpf-v 1 1 .ps, 1994.

[13] S. Hiranandani, K. Kennedy, and C.-W. Teng. Compiling Fortran D for MIMD Distributed-Memory Machines. Com- munication ofACM, 35(8):66-80, Aug 1992.

[I41 Y. Ishikawa. MPC++ Programming Language V1.O Specifi- cation with Commentary - Document Version 0.1. Technical Report TR-94014, Real World Computing Partnership, Jun 1994.

[ 121 High Performance Fortran Forum.

[15] B. W. Kernighan and S. Lin. An Efficient Heuristic Pro- cedure for Partitioning Graphs. The Bell System Technical Journal, pages 291-307, Feb 1970.

[16] A. Malony, B. Mohr, D. Beckman, D. Gannon, S. Yang, F. Bodin, and S. Kesavan. A Parallel C++ Runtime System for Scalable Parallel Systems. In Proc. of Supercomputing ’93, pages 14G152. IEEE-CS Press, Nov 1993.

[I71 A. Schramm. Concepts and formal description of the promoter language, version 1 .O. Technical Report RWC-TR- 94-01 8, http://www.first.gmd.de/promoter/papers/, 1994.

[18] A. Schramm. Irregular applications in promoter. In W. K. Giloi, S. Jaenichen, and B. Shriver, editors, Proc. of Internat. MPPM Conference, Berlin, Germany, Oct. 1995. IEEE CS. Press.

[I91 A. Sussman, G. Agrawal, and J. Saltz. A Manual for the Multiblock Parti Runtime Primitives, Revision 4.1. Tech- nical Report CS-TR-3070 and UMIACS-TR-93-36.1, Uni. of Maryland, Department of Com. Sci. and Institute for Ad- vanced Computer Studies, Dec 1993.

[20] The EUROPA Working Group on Parallel C++ Architecture SIG. EC++ - EUROPA Parallel C++ Draft Definition. Tech- nical report, 1995.

1211 The HPC++ Working Group. HPC++ White Paper. Techni- cal Report TR 95633, Center for Res. on Parallel Computa- tion, Rice University, 1995.

54

http://www.erc.msstate.edu/hpff/hpf-report-ps/hpf-v

http://www.first.gmd.de/promoter/papers

Documents

[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second