8
High-Level Data Parallel Programming in Promoter * Matthias Besch, Hua Bi, Peter Enskonatus, Gerd Heber, Matthias Wilhelmi RWCP Massively Parallel Systems GMD Laboratory Berlin, Germany GMD FIRST, Rudower Chaussee 5, D-12489 Berlin { mb,bi ,ens,heber,wilhelmi} @first . gmd.de Abstract Implementing realistic scientific applications on paral- lel platforms requires a high-level, problem-adequate and flexible programming environment. The hybrid system PRO- MOTER pursues a two-level approach allowing easy and flexible programming at both language and library levels. The core concept of PROMOTER’S language model is its highly abstract and unijied concept of data and communi- cation structures. The paper briefly addresses the program- ming model, but focuses on implementation aspects of the compiler and runtime system. Finally, pe$ormance results are given, evaluating the eficiency of the PROMOTER sys- tem. 1. Introduction Despite of much research work in the area of parallel programming environments and compiler development that has been done over the past decade, the burden of parallel programming still lies in the automatic, transparent, and ef- ficient treatment of locality. The problem arises, whenever one has to distinguish between (cheap) local and (expen- sive) remote data access. Today anyone seems to agree that the acceptance of par- allel computing, and in particular that on distributed mem- ory environments, mainly depends on the quality of a high- level programming model, which should provide powerful abstractions in order to free the programmer from the bur- den of dealing with low-level issues such as data layout or communication. A high-level programming model drastically facilitates the building of large-scale applications, however, it leaves the complex task of bridging the large semantic gap be- tween application and parallel machine to the compiler and runtime system. *This research is supported by the Real World Computing Purtnership (RWCP), Japan. Data parallel programming languages such as HPF [12], Fortran D [13], Vienna Fortran [6], HPC++ [21], and the like already considerably relax this problem. Intended for the use in scientific computing, these languages focus on dense regular array computations. Expressing parallelism by way of constructs guaranteeing the independence of statement in blocks and loops, these languages provide a true abstraction from the underlying machine. Optimizing data alignment, data partitioning, and assign- ment to a processor graph are often still too complex. In many systems, for example, the compiler can (and should) be supported by additional user-defined tuning mechanisms such as alignment and distribute directives in HPF. In case of irregular data and dependence structures, however, data parallel languages often loose their expressive power. These structures have to be modeled via indirect indexing and thus are not supported by the compiler but have to be handled dynamically during runtime. In this paper, we present a hybrid, two-level approach used in PROMOTER [ 1 I]. The library level provides an ab- straction of communication and establishes a flexible and efficient runtime system on top of standard platforms. The language level defines a high-level programming model and introduces an abstraction of data partitioning and dis- tribution. At this level, PROMOTER is a coordination lan- guage being embedded in an imperative, object-oriented host language (C++). The underlying design principles of Promoter are best described by the terms polymorphic high-level data paral- lelism and uniformity of data and dependence structures. Promoter programs are written by concurrently executing methods on all points of an aggregate object. Exploiting the polymorphism of object-oriented languages the strict ho- mogeneity of data and computational domains in data par- allel languages can be relaxed to some degree of locally au- tonomous execution. Uniformity here means that both (irregular) data struc- tures and (irregular) dependences between them are repre- sented by the same high-level construct. This considerably facilitates the exploitation of application-dependent knowl- 47 0-8186-7882-8/97 $10.00 0 1997 IEEE

[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

  • Upload
    m

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

High-Level Data Parallel Programming in Promoter *

Matthias Besch, Hua Bi, Peter Enskonatus, Gerd Heber, Matthias Wilhelmi RWCP Massively Parallel Systems GMD Laboratory Berlin, Germany

GMD FIRST, Rudower Chaussee 5, D-12489 Berlin { mb,bi ,ens ,heber,wilhelmi} @first . gmd. de

Abstract

Implementing realistic scientific applications on paral- lel platforms requires a high-level, problem-adequate and flexible programming environment. The hybrid system PRO- MOTER pursues a two-level approach allowing easy and flexible programming at both language and library levels. The core concept of PROMOTER’S language model is its highly abstract and unijied concept of data and communi- cation structures. The paper briefly addresses the program- ming model, but focuses on implementation aspects of the compiler and runtime system. Finally, pe$ormance results are given, evaluating the eficiency of the PROMOTER sys- tem.

1. Introduction

Despite of much research work in the area of parallel programming environments and compiler development that has been done over the past decade, the burden of parallel programming still lies in the automatic, transparent, and ef- ficient treatment of locality. The problem arises, whenever one has to distinguish between (cheap) local and (expen- sive) remote data access.

Today anyone seems to agree that the acceptance of par- allel computing, and in particular that on distributed mem- ory environments, mainly depends on the quality of a high- level programming model, which should provide powerful abstractions in order to free the programmer from the bur- den of dealing with low-level issues such as data layout or communication.

A high-level programming model drastically facilitates the building of large-scale applications, however, it leaves the complex task of bridging the large semantic gap be- tween application and parallel machine to the compiler and runtime system.

*This research is supported by the Real World Computing Purtnership (RWCP), Japan.

Data parallel programming languages such as HPF [12], Fortran D [13], Vienna Fortran [6], HPC++ [21], and the like already considerably relax this problem. Intended for the use in scientific computing, these languages focus on dense regular array computations. Expressing parallelism by way of constructs guaranteeing the independence of statement in blocks and loops, these languages provide a true abstraction from the underlying machine.

Optimizing data alignment, data partitioning, and assign- ment to a processor graph are often still too complex. In many systems, for example, the compiler can (and should) be supported by additional user-defined tuning mechanisms such as alignment and distribute directives in HPF. In case of irregular data and dependence structures, however, data parallel languages often loose their expressive power. These structures have to be modeled via indirect indexing and thus are not supported by the compiler but have to be handled dynamically during runtime.

In this paper, we present a hybrid, two-level approach used in PROMOTER [ 1 I]. The library level provides an ab- straction of communication and establishes a flexible and efficient runtime system on top of standard platforms. The language level defines a high-level programming model and introduces an abstraction of data partitioning and dis- tribution. At this level, PROMOTER is a coordination lan- guage being embedded in an imperative, object-oriented host language (C++).

The underlying design principles of Promoter are best described by the terms polymorphic high-level data paral- lelism and uniformity of data and dependence structures. Promoter programs are written by concurrently executing methods on all points of an aggregate object. Exploiting the polymorphism of object-oriented languages the strict ho- mogeneity of data and computational domains in data par- allel languages can be relaxed to some degree of locally au- tonomous execution.

Uniformity here means that both (irregular) data struc- tures and (irregular) dependences between them are repre- sented by the same high-level construct. This considerably facilitates the exploitation of application-dependent knowl-

47 0-8186-7882-8/97 $10.00 0 1997 IEEE

Page 2: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

edge by the compiler and runtime system. In contrast to most other data parallel languages, PRO-

MOTER not only supports dense rectangular domains (ar- rays), but also sparse or irregular structures, which may be regularly constructed or irregularly enumerated. So, typi- cally, instead of using nested (forall) loops concurrent op- erations are replaced by a sequence of statements, each of which describes a dependence relation from a source to a target object or some subset thereof.

The paper is organized as follows. Section 2 introduces the data parallel programming model of PROMOTER and presents its core language concepts. Section 3 gives an overview of the compilation system, reflecting the previ- ously mentioned two levels. Then, Section 4 addresses ap- plication programming and discusses some problems and their solution in PROMOTER. Section 5 presents some per- formance results evaluating the PROMOTER runtime system and the quality of compiler optimization. Finally, Section 6 takes a look at comparable systems and gives a brief outlook on future work.

2. The programming model - basic concepts

In the framework of this paper, we can only give a cur- sory introduction into PROMOTER. The following subsec- tions briefly introduce some basic concepts. For a more ex- haustive introduction we refer the reader to [ 171.

2.2. Distributed types and objects

Distributed Types are indexable structures which are built from given types (classes) and indexed by data topolo- gies. Distributed objects are instances of distributed types:

class T; Grid<T> g ; / / g is a distributed object / / over the topology Grid

2.3. Data parallel operations

By data parallel operations, we

of class T

mean that the same oper- ation is performed at-all points of a data structure. This can be expressed by calling a method (operation) on distributed objects. Such a call is then performed by operations repli- cated over the entire topology. As usual in data parallel languages, replication operations are defined by lifting the function result and parameter types to the distributed type with respect to the employed topology.

class T; Grid<T> g; Grid<int> h; int f ( T& ) ; g.T: :method0 ; h = f ( g 1 ;

2.4. Communication 2.1. Topologies

A topology or an index space is some arbitrary, possi- bly irregular and dynamic subset of Zn, where Z being the set of integer numbers. It allows the programmer to model spatial data structures and communication (or dependence) relations in a problem-oriented way.

Using topologies to model spatial data structures the ex- pressive power of PROMOTER goes far beyond computing with dense regular arrays. In the succeeding example, we declare a model triangle of a Finite Element Method (FEM), which is IC times regularly refined. The dimension and range of a topology are defined by an expression like O:M, 0:N or a defined topology like Grid in InnerGrid. The expressions in curly braces are called constraints of the topology, which define all valid indices within the defined range.

topology Grid: O:M, O:N { } ; / / M x N grid

topology InnerGrid: Grid { $a:l: (M-l), $b:l: (N-1) ;

} ; / / interior of the grid

topology Element [kl : 0 :pow ( 2 , kj , 0 :pow (2, k) , 0 :pow (2, kj { $a, $b, $c 1 : a + b + c = pow(2,kj;

1 ;

Communication relations in PROMOTER are also ex- pressed by means of topologies. A communication topology defines a relation bewteen data points, i.e. specifies a subset of the Cartesian product of target and source data topolo- gies.

In the the example below, we declare a communication topology that “supports” the five-point discretization of the Laplace operator on a grid.

topology Laplace-5 : Grid, Grid { $a, $b, a + 1, b; / / right neighbour $a, $b, a ~ 1, b; / / left neighbour $a, $b, a, b + 1; / / upper neighbour $a, $b, a, b - 1; / / lower neighbour

> ;

The actual communication then can be described by evalu- ating the so-called communication product. A communica- tion product employs the concept of matrix multiplication. The result (represented by the the distributed object y) of the communication product on distributed objects x and c is defined as

Y i j = X i k ‘ C k j , k

48

Page 3: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

where the transfer operation "*" and the reduce oper- ation "+" (from E) can be overloaded and a few validity rules regarding undefined items are to be considered.

In a typical application of this general transduce opera- tion (transfer + reduce), an operator ! is defined to denote the important special case of vector-matrix or matrix-vector operation, where "*" and "+" have the usual default seman- tics. In this case one of the outer index spaces (here w.r.t. i) vanishes:

k

The matrix then represents a communication topology as illustrated in the following example.

Grid<double> g,h; h[[ InnerGrid I ] = (g ! Laplace-5) - 4 * g;

g ! Laplace5 transfers one data element of g to its cor- responding four neighbours of h for the reduction by the default operator +. The example also shows that we restrict (by the so called selection) the application of the Laplace operator to inner grid points. In general, a selection is a very elegant tool to access a subset of data elements from a distributed object.

The second special case of the communication product is called cross product. Here, the inner topology (reduc- tion space) is missing such that the result defines the outer product:

yij = xi . c j ,

In PROMOTER syntax the cross product looks like:

y=times(x, *, c);

The general form of the transduce operation is expressed by

y=[R]transduce(x, c, +, * ) ;

where the topology R denotes the reduction space. Leaving out the parameter R no reduction takes place.

In this case, the reduction operator can be omitted, and the matrix multiplication of transduce degenerates to the outer product of times:

y=transduce(x, c, * ) ;

3. The Promoter compilation system

In the following two subsections we give a brief overview on compiler and runtime system of the PRO- MOTER system.

3.1. Compiler

The compiler works as a source-to-source translator. It accepts PROMOTER programs as input and generates C++ SPMD programs.

The general strategy is to recognize all PROMOTER con- structs, to transform them into an internal and efficiently usable form, and to leave the rest of the source program as close as possible to the original, which will later help the user in debugging and monitoring. The overall structure of the compiler consists of the following components.

The front-end implements the lexical, syntactical, and semantic analysis and builds an abstract syntax tree (AST) and the symbol table. Syntactically incorrect programs are rejected.

In the analysis phase, the compiler tries to gain as much knowledge about the source program, in particular about the data and communication topologies, and its use of the PROMOTER constructs as necessary for succeeding phases. It generates descriptors in the AST which give a condensed view of the PROMOTER constructs. The objective of the analysis phase is two-fold. First, it analyzes those parts of C++ that are needed for the symbolic evaluation of con- straints and selections. Second, it recognizes the PRO- MOTER specific constructs and generates corresponding de- scriptors.

In the succeeding mapping pass, data and communica- tion topologies are brought in to find a sufficiently good dis- tribution of data elements and threads over the set of com- puting nodes. A program is subdivided into a sequence of phases, of which the distributed variables related to each other by communication operators are subject to a mapping process. Supported by high-level pragmas the compiler can choose between different mapping algorithms and combina- tions thereof. Aside from simple tiling strategies, there are general graph-based multi-partitioning algorithms such as an improved and generalized version of the Kernighan&Lin heuristic [15] or spectral bisection, as well as a fast topo- graphic mapping approach called BHT [3]. The latter is particularly appropriate for finite-neighborhood communi- cation within a single index spaces.

The optimization phase works mainly on the descriptors, which reflect the transformations for the generation of a message passing SPMD program. It tries to reduce the ini- tially generated number of messages and synchronization points.

The final pass of the compiler generates the SPMD pro- gram code in C++. It evaluates and modifies the descrip- tors while traversing the AST and generates the calls to the executing primitives provided by the PROMOTER runtime system.

49

Page 4: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

3.2. Runtime system

The PROMOTER runtime system consists of a PRO- MOTER RUNTIME LIBRARY (PRL) and an underlying PROMOTER ABSTRACT MACHINE (PAM) . The PRL is architecture independent, while the PAM must be ported on different platforms.

The PROMOTER runtime system is implemenrted by the SPMD model with a lock-step synchronization scheme. That is, the processes created by the SPMD program alter- nate between communication and computation phases. In the communication phase remote elements are sent and re- ceived; in the computation phase operations are performed locally.

PAM is responsible for communication and synchroniza- tion between domains of the underlying architecture. The send and receive operations are nonblocking to overlap lo- cal and remote operations. To get use of capabilities of the hardware most of the resource management is located in the PAM, the dual processor system MANNA, for instance, can perform all buffering and communication activities on the communication processor, while the application processor can execute the local operations of the user program.

PRL provides the executing primitives for running PRO- MOTER programs. More exactly it provides template classes for distributed objects and template functions on dis- tributed objects. These functions provide basic data parallel operations, for example, data parallel assignments, data par- allel function calls, point-to-point communication and col- lective communication (reduction, expansion, cross prod- uct, and communication product).

Distributed objects are implemented in an object- oriented way. They are based on theclass D i s t r ibu t ion which is defined by a spatial structure Topology and a mapping strategy Mapping. A Topology specifies the valid data points of the problem spatial structure, of which their location on one of the physical computing do- mains (nodes) is determined by Mapping. The concrete Topology, Mapping and D i s t r ibu t ion can be pre- defined by PRL, and also can be generated by the compiler or defined by programmers. The only condition to apply them is that they must follow the same class interface.

PRL provides a set of data topologies such as Array, BandArray, Mas ked Ar ray, Po i n t -Se t , and Topology-Union, a set of mapping strategies such as B l o c kMapping, C y c le-Bloc kMapping, BHTMapping and GeneralGraphMapping. Differ- ent distributions are defined for the different combination of topologies and mapping strategies. For distributed objects, static element types and dynamic element types (e.g. with pointer) are distinguished for the different implementation of data packing and unpacking for communication.

PRL also provides a set of communication topologies

such as One-To-One and One-ToAany. Based on a communication topology and two distributions representing two distributed objects in communication, a communication pattern can be generated. A communication pattern defines local indices to designate which elements will be sent and at which indices received elements will be operated with local elements.

PRL provides runtime support to generate communica- tion patterns by some template functions (which are cor- responding to the basic communication forms in PRO- MOTER). Because these template functions take a commu- nication topology and distributions (not distributed objects) as arguments, it provides the following three optimization possibilities.

A communication pattern generated can be reused by different communications, if the same communication topology and the same distribution relation bewteen dis- tributed objects are defined in communications . A com- munication pattern can be lifted out of a loop, if the re- sults of communication and computation in the loop do not change their spatial structure and data partition. The com- putation of communication patterns for the next communi- cations can be overlapped with the current communication, if these communication patterns are not dependent on the result of the current communication. Based on our exper- iments, the possibility to apply these optimizations is very high in a large class of applications.

Based on communication patterns, PRL provides a set of template functions for collective communication. They are implemented by the so called four-phase scheme.

In the first phase, all local data elements which must be sent to other remote domains are collected and an asyn- chronous send in PAM is started to sent them. Several send modes (e.g. non buffering send, message vectorization) al- low an efficient use of buffers and minimizes the number of generated messages. And at the same time, asynchronous receives are initiated according to the information in the communication pattern, that is, from which domains mes- sages will be received.

In the second phase, the local operations are performed. The local operations are executed before receiving remote elements for the tolerance of communication delay.

In the third phase, it waits for messages from all other domains. If the receive of one message from one domain is completed, the relevant local data elements are operated with remote data elements according to the communication pattern. Because the operations made here are associative as defined by PROMOTER language, they can be performed just after the message from one domain has arrived. In this way, the operation and the communication are overlapped.

In the fourth phase, a synchronization must be issued to ensure the asynchronous send and receive operations in PAM have been finished.

50

Page 5: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

Since the operations on distributed objects are imple- mented by template functions, the runtime library provides a generic interface for data parallel application, By generic, we mean that operations on distributed objects are imple- mented on the same framework for different topologies, different mapping strategies, and different element types.

code on the level of the PROMOTER runtime system. This is interesting also for porting optimized parts of existing software.

PROMOTER code achieves almost the performance of hand-written MPI or PVM code.

Writing PROMOTER code means among other things Therefore, it is easy for a compiler to generate code for a PROMOTER program.

The runtime library also allows an efficient implementa- writing less code.

tion

0

0

0

0

for the PROMOTER language, since the compiler can

specialize the container classes (e.g. mapping) of distributed objects,

make container classes (e.g. topology and mapping) sharable by different distributed objects,

make the communication pattern to be reused by dif- ferent communications, e.g., in an iteration,

make communication and computation to be over- lapped.

topology and 4.2. The flavour of a sample Promoter application

A software tool for high-level data parallel program- ming has to cover a wide range of applications. Starting from regular applications in Quantum Chromo Dynamics (QCD) and Finite Element Methods with uniform mesh re- finement (UMR) the programmer is faced with extremely complex (irregular and dynamic) structures which play a key role in Finite Element Methods with adaptive mesh re- finement (AMR) and particle simulations. At first glance, it might appear hopeless to design and implement such a tool which produces executables with competitive runtimes. We

In summary, the runtime library defines a generic inter- face and allows application-dependent specialization and optimization applicable by a compiler. Besides, the run- time library is also designed as a user-level library to sup- port abstract message passing programming that assumes no special support from a compiler [4].

do not assert that the PROMOTER environment is that tool but are convinced (and hope to convince the reader) that PROMOTER already has a lot of its features.

Let us consider a sample application, namely a little par- ticle simulation. Our simulation space consists of three- dimensional lattices of boxes. The boxes contain lists of particles (atoms, ions, electrons) and their size depends on the range of the interaction potentials. For instance, in a silicon simulation one has to take into account the

4. Programming in Promoter

Applications programming for high-perjiormance com- puting is notoriously difJicult. Although parallel program- ming is intrinsically complex, the principal reason why high-pei$ormance computing is difJicult is the lack of soft- ware tools . . . which leads to wasted computer resources and inhibits the use of high-performance parallel comput- ers by scientists [2].

4.1. General remarks

PROMOTER code is architecture independent.

The programmer does not have to worry about pro- gramming message passing and data distribution. (At the level of the runtime system he can pass additional information for instance about the mapping, but the ar- chitecture is hidden at this level as well.)

PROMOTER allows rapidprototyping on parallel archi- tectures.

Stillinger-Weber potential with a range of approximately 3.8 Angstroem and the Coulomb potential with a range of approximately 20 Angstroem.

Let us assume for simplicity that our boxes contain at most one particle. The motion of the particles is simulated by a Gear predictor/corrector algorithm [ 11. Here, the pro- grammer is faced with the problem that particles migrate be- tween boxes, i.e. between processors. The question arises how to express this in PROMOTER where the architecture and the process model are hidden from the programmer.

class AtomRef { / / our simplified boxes

public : Atom* link- ;

void null() { link- = 0; 1

index newIndex0 ( return link-->newIndex(); 1 / / where the particle has to move to

index oldIndex0 { return link_->oldIndexO; } / / particles current box

> ;

boo1 moved( const index& a, const index& b) There is a high flexibility offered to the programmer because he also can write, modify, and optimize his

51

Page 6: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

{ return ( a ! = b ) ; 1

topology Space; / / the topology of simulation space

Space<AtomRef> x; / / a configuration of particles

Space<index> index-new = x.newIndex(); / / where the particles have to go to?

Space<index> index-old = x.oldIndex0; / / where are the particles?

Spaceibool> particle-moved

/ I select the particles which move

Space<AtomRef> y = 0;

y[[ particle-moved 1 1 = x; / / copy particles moved

= (index-new ! = index-old);

x[[ particle-moved I ] .null(); / / delete unmoved particles

model heat equation. The solver is implemented by an Euler forward-backward algorithm with Jacobi relaxation. The problem with Dirichlet boundary conditions is solved on a 100 x 100-grid with 2000 Jacobi iterations per time step.

MANNA PCC -02 MANNA PCC - 0 4

400

350 t \ 300 1 1

150 I ''''~:, \, , , , , ,

100 ',. , ',( .. .-. .., -. .. ... '... ,. .. ----.- .....-- .---..-I.-~.~-T---~------- .......-... ~ ........ ~

......... 50

n .. . . .... ... .. . .. .. . . . . . .. . . .. ..- ............

0 2 4 6 8 10 12 14 16 18 20 nodes

Figure 1. Runtime of application on MANNA topology Move-Space : Space, Space { I ;

Move-Space<bool> M =

/ / . . . which "supports" the motion

topology Move-Re1 : Move-Space I Si, $ j , $k, $1 I : M[$i, $ j , $k, $ll==true; 1 ; / / dynamically creat a communication topology

times( index-new, &moved, index-old ) ;

x = y ! Move-Rel; / / move the particles

Due to the lack of space, we are not able to discuss a real application in full detail. For the discussion of some aspects of FEM applications in PROMOTER we refer the reader to [9]. Let us mention that it is possible to use the so called dynamic topologies in PROMOTER. These topologies allow the user, among other things, to add and/or remove points of a topology at runtime.

5. Performance results

We have implemented the PROMOTER compiler and the PROMOTER runtime system. The PAM has been ported to our in-house testbed MANNA on top of the parallel OS PEACE, to the IBM SP/2 and on Sun workstation cluster on top of MPI.

MANNA [lo] is a parallel supercomputer developed at GMD-FIRST. It has two Intel i86OXP and 32MB memory per node. The nodes are interconnected with a multi-level crossbar. The total interconnection bandwidth in a 20 node system is 2 GB/s.

Several compiler optimization have been tested with a

SPl2 pcc - SPl2 pcc -02 ----- SPM PCC - 0 4

0 2 4 6 8 10 12 14 16 18 20 nodes

Figure 2. Runtime of application on SP/2

Without optimizations, the compiler inserts code for communication and local operations for every iteration step. With the generic interface the communication patterns are established as they are needed, and operations on local data points are performed using a generic iterator.

The first optimization called communication schedul- ing detects the loop invariant communication patterns and moves the generation of communication patterns to the pre- fix of the loop. Thus, the communication patterns are reused inside the iteration for the communication operations.

The second optimization specializes the generic iteration on local data points taking advantage of the regular structure of the problem.

Figures 1 and 2 show the results obtained on MANNA

52

Page 7: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

Bench- mark

4Nodes I 8Nodes I 16 Nodes PRO I PVM I PRO / PVM I PRO I PVM

Table 1. PROMOTER vs. PVM

MM 0.281 RL 0.280 CG 8.23

The above benchmark programs in PRL are optimized by communication scheduling. It shows that PROMOTER can achieve almost the performance of hand-written MPI or PVM code due to correct runtime support and intelligent optimization by the compiler.

A PROMOTER program is more concise than the cor- responding MPI or PVM program. For the above bench- marks, the number of lines in code is listed in Table 2.

0.246 0.155 0.137 0.088 0.083 0.261 0.137 0.132 0.075 0.072 8.21 4.12 4.11 2.36 2.34

Table 2. Code Length of PROMOTER vs. PVM

6. Comparisons and conclusion

In this paper we present high-level data parallel pro- gramming in PROMOTER. It leads to an abstract program- ming style on distributed memory parallel machines, while the efficient implementation of data parallel applications by PROMOTER can be achieved by an object-oriented runtime

support and advanced compilation and optimization tech- niques.

In recent years, there have been major efforts in devel- oping language and runtime library and compiler support for programming distributed memory machine. Roughly speaking, there are two major directions in those efforts. In the Fortran world, HPF 11 21, Fortran D 11 31, Vienna For- tran [6] and others are developed, in the C world, HPC 1211, ICC++ [7], MPC++ [14], pC++ [16], EC++ [20] and others are in progress.

Most of the approaches support two-level data parallel programming. For example, the Fortran D compiler inserts calls to the Multi-block parti 1191 and Chaos [8] library rou- tines to manage communication. The library Multi-block Parti implements regular 'data distribution and regular ac- cesses to distributed arrays, while CHAOS library supports irregular patterns on data accesses to distributed arrays. In the HPC++ framework, a C++ library along with compiler directives supports data parallel C++ programming.

PROMOTER is also a two-level data parallel program- ming approach. However, in contrast, it supports not only distributed arrays, but also distributed objects with the fol- lowing features:

Distributed objects can be configured not only on rect- angular spatial structures like arrays, but also on ar- bitrary (non-rectangular, sparse, or irregular) spatial structures.

0 Data partition can be chosen from a set of userdefined and predefined mapping strategies.

In PROMOTER, the embedding of an application's spa- tial domains in index spaces allows more static optimiza- tion to be done at compile time. By providing valuable application-specific information it generally eases the task of mapping for compiler and runtime system. Particularly, in numerical applications spatial structures are often based on geometrical information, which directly can be exploited by the mapping subsystem.

It sub- sumes CHAOS and Multi-block Parti, and provides addi- tional support for distributed pointer-based data structures. In PROMOTER, support for distributed pointer-based data structures will be handled by dynamic topologies [18], i. e. a distributed object can change its shape at runtime. It provides a conceptual equivalence to dynamic creation or expansion in a pointer-based data structures.

Implementing dynamic topologies is part of our future work. Dynamic topologies are most often needed in adap- tive applications, in which the problem domain or the spatial structure has to be changed at runtime according to interme- diate runtime results. In considering dynamic topologies we must also take into account dynamic mapping, since distri- butions have also to be dynamically changed according to

Recently, CHAOS++ [5] has been released.

53

Page 8: [IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

modified topologies. Our preliminary work on these topics shows that efficient implementation of dynamic topologies seems to be possible, if some regularity in the adaption al- gorithm can be exploited.

References

[ I ] M. P. Allen and D. J. Tildesley. Computer Simularion of Liquids. Claredon Press, Oxford, 1994.

[2] B. Appelbe and D. Bergmark. Software Tools for High- Performance Computing: Survey and Recomendations. Sci- entific Programming, 5(3), 1996.

[3] M. Besch and H. W. Pohl. Topographic data mapping by balanced hypersphere tessellation. In Proc. Euro-Par ‘96, Lyon, France, August 1996, Lecture Notes in Computer Sci- ence 1124, pages 455-458. Springer, 1996.

[4] H. Bi. Towards abstractin of message passing programming. In Proc. of International Conference on advances in paral- lel and distributed Computing, pages 100-107, Shanghai, China, March 1997. IEEE CS Press.

[5] C. Chang, J. Saltz, and A. Sussman. CHAOS++: A Runtime Library for Supporting Distributed Dynamic Data Structure. Technical report, Center for Res. on Parallel Computation, Rice University, Nov 1995.

[6] B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran. Scientz3c Computing, I(l):31-50, 1992.

[7] A. A. Chien and J. Dolby. The Illinois Concert System: A Problem-Solving Environment for Irregular Applications. In Proc. of DAGS’94, The Sym. on Parallel Computation and Problem Solving Environments, 1994.

[SI R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communi- cation Optimization for Irregular Scientific Computation on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 22(3):462-479, Sep 1994.

[9] J. Gerlach, G. Heber, and A. Schramm. Finite element meth- ods in the promoter programming model. In Proc. Internat. EUROSIM Conference on HPCN Challenges in Telecomp and Telecom, Delft, Netherlands, June 1996.

[lo] W. K. Giloi, U. Briining, and W. Schroder-Preikschat. MANNA: Prototype of a Distributed Memory Architecture with Maximized Sustained Performance. In Proc. Euromi- cro PDP96 Workshop, 1996.

[ l 11 W. K. Giloi, M. Kessler, and A. Schramm. Promoter : A high level object-parallel programming language. In Proc. of Internat. Con$ on High Pe~ormance Computing, New Delhi, India, Dec. 1995.

High Performance Fortran Language Specification V1.1. Technical report, http://www.erc.msstate.edu/hpff/hpf-report-ps/hpf-v 1 1 .ps, 1994.

[13] S. Hiranandani, K. Kennedy, and C.-W. Teng. Compiling Fortran D for MIMD Distributed-Memory Machines. Com- munication ofACM, 35(8):66-80, Aug 1992.

[I41 Y. Ishikawa. MPC++ Programming Language V1.O Specifi- cation with Commentary - Document Version 0.1. Technical Report TR-94014, Real World Computing Partnership, Jun 1994.

[ 121 High Performance Fortran Forum.

[15] B. W. Kernighan and S. Lin. An Efficient Heuristic Pro- cedure for Partitioning Graphs. The Bell System Technical Journal, pages 291-307, Feb 1970.

[16] A. Malony, B. Mohr, D. Beckman, D. Gannon, S. Yang, F. Bodin, and S. Kesavan. A Parallel C++ Runtime System for Scalable Parallel Systems. In Proc. of Supercomputing ’93, pages 14G152. IEEE-CS Press, Nov 1993.

[I71 A. Schramm. Concepts and formal description of the pro- moter language, version 1 .O. Technical Report RWC-TR- 94-01 8, http://www.first.gmd.de/promoter/papers/, 1994.

[18] A. Schramm. Irregular applications in promoter. In W. K. Giloi, S. Jaenichen, and B. Shriver, editors, Proc. of Internat. MPPM Conference, Berlin, Germany, Oct. 1995. IEEE CS. Press.

[I91 A. Sussman, G. Agrawal, and J. Saltz. A Manual for the Multiblock Parti Runtime Primitives, Revision 4.1. Tech- nical Report CS-TR-3070 and UMIACS-TR-93-36.1, Uni. of Maryland, Department of Com. Sci. and Institute for Ad- vanced Computer Studies, Dec 1993.

[20] The EUROPA Working Group on Parallel C++ Architecture SIG. EC++ - EUROPA Parallel C++ Draft Definition. Tech- nical report, 1995.

1211 The HPC++ Working Group. HPC++ White Paper. Techni- cal Report TR 95633, Center for Res. on Parallel Computa- tion, Rice University, 1995.

54