[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - High-level data parallel programming in Promoter

  • Published on

  • View

  • Download

Embed Size (px)


  • High-Level Data Parallel Programming in Promoter *

    Matthias Besch, Hua Bi, Peter Enskonatus, Gerd Heber, Matthias Wilhelmi RWCP Massively Parallel Systems GMD Laboratory Berlin, Germany

    GMD FIRST, Rudower Chaussee 5, D-12489 Berlin { mb,bi ,ens ,heber,wilhelmi} @first . gmd. de


    Implementing realistic scientific applications on paral- lel platforms requires a high-level, problem-adequate and flexible programming environment. The hybrid system PRO- MOTER pursues a two-level approach allowing easy and flexible programming at both language and library levels. The core concept of PROMOTERS language model is its highly abstract and unijied concept of data and communi- cation structures. The paper briefly addresses the program- ming model, but focuses on implementation aspects of the compiler and runtime system. Finally, pe$ormance results are given, evaluating the eficiency of the PROMOTER sys- tem.

    1. Introduction

    Despite of much research work in the area of parallel programming environments and compiler development that has been done over the past decade, the burden of parallel programming still lies in the automatic, transparent, and ef- ficient treatment of locality. The problem arises, whenever one has to distinguish between (cheap) local and (expen- sive) remote data access.

    Today anyone seems to agree that the acceptance of par- allel computing, and in particular that on distributed mem- ory environments, mainly depends on the quality of a high- level programming model, which should provide powerful abstractions in order to free the programmer from the bur- den of dealing with low-level issues such as data layout or communication.

    A high-level programming model drastically facilitates the building of large-scale applications, however, it leaves the complex task of bridging the large semantic gap be- tween application and parallel machine to the compiler and runtime system.

    *This research is supported by the Real World Computing Purtnership (RWCP), Japan.

    Data parallel programming languages such as HPF [12], Fortran D [13], Vienna Fortran [6], HPC++ [21], and the like already considerably relax this problem. Intended for the use in scientific computing, these languages focus on dense regular array computations. Expressing parallelism by way of constructs guaranteeing the independence of statement in blocks and loops, these languages provide a true abstraction from the underlying machine.

    Optimizing data alignment, data partitioning, and assign- ment to a processor graph are often still too complex. In many systems, for example, the compiler can (and should) be supported by additional user-defined tuning mechanisms such as alignment and distribute directives in HPF. In case of irregular data and dependence structures, however, data parallel languages often loose their expressive power. These structures have to be modeled via indirect indexing and thus are not supported by the compiler but have to be handled dynamically during runtime.

    In this paper, we present a hybrid, two-level approach used in PROMOTER [ 1 I]. The library level provides an ab- straction of communication and establishes a flexible and efficient runtime system on top of standard platforms. The language level defines a high-level programming model and introduces an abstraction of data partitioning and dis- tribution. At this level, PROMOTER is a coordination lan- guage being embedded in an imperative, object-oriented host language (C++).

    The underlying design principles of Promoter are best described by the terms polymorphic high-level data paral- lelism and uniformity of data and dependence structures. Promoter programs are written by concurrently executing methods on all points of an aggregate object. Exploiting the polymorphism of object-oriented languages the strict ho- mogeneity of data and computational domains in data par- allel languages can be relaxed to some degree of locally au- tonomous execution.

    Uniformity here means that both (irregular) data struc- tures and (irregular) dependences between them are repre- sented by the same high-level construct. This considerably facilitates the exploitation of application-dependent knowl-

    47 0-8186-7882-8/97 $10.00 0 1997 IEEE

  • edge by the compiler and runtime system. In contrast to most other data parallel languages, PRO-

    MOTER not only supports dense rectangular domains (ar- rays), but also sparse or irregular structures, which may be regularly constructed or irregularly enumerated. So, typi- cally, instead of using nested (forall) loops concurrent op- erations are replaced by a sequence of statements, each of which describes a dependence relation from a source to a target object or some subset thereof.

    The paper is organized as follows. Section 2 introduces the data parallel programming model of PROMOTER and presents its core language concepts. Section 3 gives an overview of the compilation system, reflecting the previ- ously mentioned two levels. Then, Section 4 addresses ap- plication programming and discusses some problems and their solution in PROMOTER. Section 5 presents some per- formance results evaluating the PROMOTER runtime system and the quality of compiler optimization. Finally, Section 6 takes a look at comparable systems and gives a brief outlook on future work.

    2. The programming model - basic concepts In the framework of this paper, we can only give a cur-

    sory introduction into PROMOTER. The following subsec- tions briefly introduce some basic concepts. For a more ex- haustive introduction we refer the reader to [ 171.

    2.2. Distributed types and objects

    Distributed Types are indexable structures which are built from given types (classes) and indexed by data topolo- gies. Distributed objects are instances of distributed types:

    class T; Grid g ; / / g is a distributed object / / over the topology Grid

    2.3. Data parallel operations

    By data parallel operations, we

    of class T

    mean that the same oper- ation is performed at-all points of a data structure. This can be expressed by calling a method (operation) on distributed objects. Such a call is then performed by operations repli- cated over the entire topology. As usual in data parallel languages, replication operations are defined by lifting the function result and parameter types to the distributed type with respect to the employed topology.

    class T; Grid g; Grid h; int f ( T& ) ; g.T: :method0 ; h = f ( g 1 ;

    2.4. Communication 2.1. Topologies

    A topology or an index space is some arbitrary, possi- bly irregular and dynamic subset of Zn, where Z being the set of integer numbers. It allows the programmer to model spatial data structures and communication (or dependence) relations in a problem-oriented way.

    Using topologies to model spatial data structures the ex- pressive power of PROMOTER goes far beyond computing with dense regular arrays. In the succeeding example, we declare a model triangle of a Finite Element Method (FEM), which is IC times regularly refined. The dimension and range of a topology are defined by an expression like O:M, 0:N or a defined topology like Grid in InnerGrid. The expressions in curly braces are called constraints of the topology, which define all valid indices within the defined range.

    topology Grid: O:M, O:N { } ; / / M x N grid

    topology InnerGrid: Grid { $a:l: (M-l), $b:l: (N-1) ;

    } ; / / interior of the grid

    topology Element [kl : 0 :pow ( 2 , kj , 0 :pow (2, k) , 0 :pow (2, kj { $a, $b, $c 1 : a + b + c = pow(2,kj;

    1 ;

    Communication relations in PROMOTER are also ex- pressed by means of topologies. A communication topology defines a relation bewteen data points, i.e. specifies a subset of the Cartesian product of target and source data topolo- gies.

    In the the example below, we declare a communication topology that supports the five-point discretization of the Laplace operator on a grid.

    topology Laplace-5 : Grid, Grid { $a, $b, a + 1, b; / / right neighbour $a, $b, a ~ 1, b; / / left neighbour $a, $b, a, b + 1; / / upper neighbour $a, $b, a, b - 1; / / lower neighbour

    > ;

    The actual communication then can be described by evalu- ating the so-called communication product. A communica- tion product employs the concept of matrix multiplication. The result (represented by the the distributed object y) of the communication product on distributed objects x and c is defined as

    Y i j = X i k C k j , k


  • where the transfer operation "*" and the reduce oper- ation "+" (from E) can be overloaded and a few validity rules regarding undefined items are to be considered.

    In a typical application of this general transduce opera- tion (transfer + reduce), an operator ! is defined to denote the important special case of vector-matrix or matrix-vector operation, where "*" and "+" have the usual default seman- tics. In this case one of the outer index spaces (here w.r.t. i) vanishes:


    The matrix then represents a communication topology as illustrated in the following example.

    Grid g,h; h[[ InnerGrid I ] = (g ! Laplace-5) - 4 * g;

    g ! Laplace5 transfers one data element of g to its cor- responding four neighbours of h for the reduction by the default operator +. The example also shows that we restrict (by the so called selection) the application of the Laplace operator to inner grid points. In general, a selection is a very elegant tool to access a subset of data elements from a distributed object.

    The second special case of the communication product is called cross product. Here, the inner topology (reduc- tion space) is missing such that the result defines the outer product:

    yij = xi . c j ,

    In PROMOTER syntax the cross product looks like:

    y=times(x, *, c);

    The general form of the transduce operation is expressed by

    y=[R]transduce(x, c, +, * ) ;

    where the topology R denotes the reduction space. Leaving out the parameter R no reduction takes place.

    In this case, the reduction operator can be omitted, and the matrix multiplication of transduce degenerates to the outer product of times:

    y=transduce(x, c, * ) ;

    3. The Promoter compilation system

    In the following two subsections we give a brief overview on compiler and runtime system of the PRO- MOTER system.

    3.1. Compiler

    The compiler works as a source-to-source translator. It accepts PROMOTER programs as input and generates C++ SPMD programs.

    The general strategy is to recognize all PROMOTER con- structs, to transform them into an internal and efficiently usable form, and to leave the rest of the source program as close as possible to the original, which will later help the user in debugging and monitoring. The overall structure of the compiler consists of the following components.

    The front-end implements the lexical, syntactical, and semantic analysis and builds an abstract syntax tree (AST) and the symbol table. Syntactically incorrect programs are rejected.

    In the analysis phase, the compiler tries to gain as much knowledge about the source program, in particular about the data and communication topologies, and its use of the PROMOTER constructs as necessary for succeeding phases. It generates descriptors in the AST which give a condensed view of the PROMOTER constructs. The objective of the analysis phase is two-fold. First, it analyzes those parts of C++ that are needed for the symbolic evaluation of con- straints and selections. Second, it recognizes the PRO- MOTER specific constructs and generates corresponding de- scriptors.

    In the succeeding mapping pass, data and communica- tion topologies are brought in to find a sufficiently good dis- tribution of data elements and threads over the set of com- puting nodes. A program is subdivided into a sequence of phases, of which the distributed variables related to each other by communication operators are subject to a mapping process. Supported by high-level pragmas the compiler can choose between different mapping algorithms and combina- tions thereof. Aside from simple tiling strategies, there are general graph-based multi-partitioning algorithms such as an improved and generalized version of the Kernighan&Lin heuristic [15] or spectral bisection, as well as a fast topo- graphic mapping approach called BHT [3]. The latter is particularly appropriate for finite-neighborhood communi- cation within a single index spaces.

    The optimization phase works mainly on the descriptors, which reflect the transformations for the generation of a message passing SPMD program. It tries to reduce the ini- tially generated number of messages and synchronization points.

    The final pass of the compiler generates the SPMD pro- gram code in C++. It evaluates and modifies the descrip- tors while traversing the AST and generates the calls to the executing primitives provided by the PROMOTER runtime system.


  • 3.2. Runtime system

    The PROMOTER runtime system consists of a PRO- MOTER RUNTIME LIBRARY (PRL) and an underlying PROMOTER ABSTRACT MACHINE (PAM) . The PRL is architecture independent, while the PAM must be ported on different platforms.

    The PROMOTER runtime system is implemenrted by the SPMD model with a lock-step synchronization scheme. That is, the processes created by the SPMD program alter- nate between communication and computation phases. In the communication phase remote elements are sent and re- ceived; in the computation phase operations are performed locally.

    PAM is responsible for communication and synchroniza- tion between domains of the underlying architecture. The send and receive operations are nonblocking to overlap lo- cal and remote operations. To get use of capabilities of the hardware most of the resource management is located in the PAM, the dual processor system MANNA, for instance, can perform all buffering and communication activities on the communication processor, while the application processor can execute the local operations of the user program.

    PRL provides the executing primitives for running PRO- MOTER programs. More exactly it provides template classes for distributed objects and template functions on dis- tributed objects. These functions provide basic data parallel operations, for example, data parallel assignments, data par- allel function calls, point-to-point communication and col- lective communication (reduction, expansion, cross prod- uct, and communication product).

    Distributed objects are implemented in an object- oriented way. They are based on theclass D i s t r ibu t ion which is defined by a spatial structure Topology and a mapping strategy Mapping. A Topology specifies the valid data points of the problem spatial structure, of which their location on one of the physical computing do- mains (nodes) is determined by Mapping. The concrete Topology, Mapping and D i s t r ibu t ion can be pre- defined by PRL, and also can be generated by the compiler or defined by programmers. The only condition to apply them is that they must follow the same class interface.

    PRL provides a set o...


View more >