13
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 13,30-42 ( 199 1) The DIN0 Parallel Programming Language* MATTHEWROSING, ROBERTB. SCHNABEL,ANDROBERTP.WEAVER Department of Computer Science, University of Colorado, Boulder, Colorado 80309-0430 DIN0 (Distributed Numerically Oriented language) is a lan- guage for writing parallel programs for distributed memory (MIMD) multiprocessors. It is oriented toward expressing data parallel algorithms, which predominate in parallel numerical computation. Its goal is to make programming such algorithms natural and easy, without hindering their run-time efficiency. DIN0 consists of standard C augmented by several high-level parallel constructs that are intended to allow the parallel program to conform to the way an algorithm designer naturally thinks about parallel algorithms. The key constructs are the ability to declare a virtual parallel computer that is best suited to the par- allel computation, the ability to map distributed data structures onto this virtual machine, and the ability to define procedures that will run on each processor of the virtual machine concur- rently. Most of the remaining details of distributed parallel computation, including process management and interprocessor communication, result implicitly from these high-level constructs and are handled automatically by the compiler. This paper de- scribes the syntax and semantics of the DIN0 language, gives examples of DIN0 programs, presents a critique of the DIN0 language features, and discusses the performance of code gen- erated by the DIN0 compiler. o 1991 Academic PWSS, IN. 1. INTRODUCTION DIN0 ( Distributed Numerically Oriented language) is a language for writing parallel programs for distributed mem- ory multiprocessors that was developed primarily for nu- merical computation. By a distributed memory multipro- cessor we mean a computer with multiple, independent pro- cessors, each with its own memory but with no shared memory or shared address space, so that communication between processors is accomplished by passing messages. Examples include MIMD hypercubes and networks of com- puters used as multiprocessors. It is generally more difficult to design an algorithm for a multiprocessor than for a serial machine, because the algo- rithm designer has more tasks to accomplish. First, the al- gorithm designer must divide the desired computation among the available processors. Second, the algorithm de- signer must decide how these processes will synchronize. In addition, on a distributed memory multiprocessor, the al- * This research supported by AFOSR Grant AFOSR-85-025 1, NSF Co- operative Agreement DCR-8420944, and NSF Cooperative Agreement CDA- 8420944. gorithm designed must consider how data should be distrib- uted among the processors and how the processors should communicate. The goal of DIN0 is to make programming distributed memory parallel numerical algorithms as easy as possible without hindering efficiency. We believe that the key to this goal is raising the task of specifying algorithm decomposition, data decomposition, interprocess communication, and pro- cess management to a higher level of abstraction than that provided in many current message passing systems. DIN0 accomplishes this by providing high-level parallel constructs that conform to the way the algorithm designer naturally thinks about the parallel algorithm. This, in turn, transfers many of the low-level details associated with distributed par- allel computation to the compiler. In particular, details re- garding message passing, process management, and syn- chronization are no longer necessary in the code that the programmer writes and associated efficiency considerations are addressed by the compiler. This high-level approach to distributed numerical com- putation is feasible because so many numerical algorithms are highly structured. The major data structures in these al- gorithms are usually arrays and sometimes trees. In addition, the algorithms usually exhibit data parallelism, where at each stage of the computation, parallelism is achieved by dividing the data structures into pieces and performing similar or identical computations on each piece concurrently. DIN0 is mainly intended to support such data parallel computation, although it also provides some support for functional par- allelism. The basic approach taken in DIN0 is to have the pro- grammer provide a top-down description of the distributed parallel algorithm. The programmer first defines a virtual parallel machine that best fits the major data structures and communication patterns of the algorithm. Next, the pro- grammer specifies the way that these major data structures will be distributed and possibly replicated among the virtual processors. Finally, the programmer provides procedures that will run on each virtual processor concurrently. Thus the basic model of parallelism is single program, multiple data (SPMD), although far more complexity is possible. Most of the remaining details of parallel computation and commu- nication are handled implicitly by the compiler. A key com- ponent in making this approach viable has been the devel- opment of a rich mechanism for specifying the mappings of 0743-7315/91$3.00 Copyright 0 I99 1 by Academic Press, Inc. All rights of reproduction in any form reserved. 30

The DINO parallel programming language

Embed Size (px)

Citation preview

Page 1: The DINO parallel programming language

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 13,30-42 ( 199 1)

The DIN0 Parallel Programming Language*

MATTHEWROSING, ROBERTB. SCHNABEL,ANDROBERTP.WEAVER

Department of Computer Science, University of Colorado, Boulder, Colorado 80309-0430

DIN0 (Distributed Numerically Oriented language) is a lan- guage for writing parallel programs for distributed memory (MIMD) multiprocessors. It is oriented toward expressing data parallel algorithms, which predominate in parallel numerical computation. Its goal is to make programming such algorithms natural and easy, without hindering their run-time efficiency. DIN0 consists of standard C augmented by several high-level parallel constructs that are intended to allow the parallel program to conform to the way an algorithm designer naturally thinks about parallel algorithms. The key constructs are the ability to declare a virtual parallel computer that is best suited to the par- allel computation, the ability to map distributed data structures onto this virtual machine, and the ability to define procedures that will run on each processor of the virtual machine concur- rently. Most of the remaining details of distributed parallel computation, including process management and interprocessor communication, result implicitly from these high-level constructs and are handled automatically by the compiler. This paper de- scribes the syntax and semantics of the DIN0 language, gives examples of DIN0 programs, presents a critique of the DIN0 language features, and discusses the performance of code gen- erated by the DIN0 compiler. o 1991 Academic PWSS, IN.

1. INTRODUCTION

DIN0 ( Distributed Numerically Oriented language) is a language for writing parallel programs for distributed mem- ory multiprocessors that was developed primarily for nu- merical computation. By a distributed memory multipro- cessor we mean a computer with multiple, independent pro- cessors, each with its own memory but with no shared memory or shared address space, so that communication between processors is accomplished by passing messages. Examples include MIMD hypercubes and networks of com- puters used as multiprocessors.

It is generally more difficult to design an algorithm for a multiprocessor than for a serial machine, because the algo- rithm designer has more tasks to accomplish. First, the al- gorithm designer must divide the desired computation among the available processors. Second, the algorithm de- signer must decide how these processes will synchronize. In addition, on a distributed memory multiprocessor, the al-

* This research supported by AFOSR Grant AFOSR-85-025 1, NSF Co- operative Agreement DCR-8420944, and NSF Cooperative Agreement CDA- 8420944.

gorithm designed must consider how data should be distrib- uted among the processors and how the processors should communicate.

The goal of DIN0 is to make programming distributed memory parallel numerical algorithms as easy as possible without hindering efficiency. We believe that the key to this goal is raising the task of specifying algorithm decomposition, data decomposition, interprocess communication, and pro- cess management to a higher level of abstraction than that provided in many current message passing systems. DIN0 accomplishes this by providing high-level parallel constructs that conform to the way the algorithm designer naturally thinks about the parallel algorithm. This, in turn, transfers many of the low-level details associated with distributed par- allel computation to the compiler. In particular, details re- garding message passing, process management, and syn- chronization are no longer necessary in the code that the programmer writes and associated efficiency considerations are addressed by the compiler.

This high-level approach to distributed numerical com- putation is feasible because so many numerical algorithms are highly structured. The major data structures in these al- gorithms are usually arrays and sometimes trees. In addition, the algorithms usually exhibit data parallelism, where at each stage of the computation, parallelism is achieved by dividing the data structures into pieces and performing similar or identical computations on each piece concurrently. DIN0 is mainly intended to support such data parallel computation, although it also provides some support for functional par- allelism.

The basic approach taken in DIN0 is to have the pro- grammer provide a top-down description of the distributed parallel algorithm. The programmer first defines a virtual parallel machine that best fits the major data structures and communication patterns of the algorithm. Next, the pro- grammer specifies the way that these major data structures will be distributed and possibly replicated among the virtual processors. Finally, the programmer provides procedures that will run on each virtual processor concurrently. Thus the basic model of parallelism is single program, multiple data (SPMD), although far more complexity is possible. Most of the remaining details of parallel computation and commu- nication are handled implicitly by the compiler. A key com- ponent in making this approach viable has been the devel- opment of a rich mechanism for specifying the mappings of

0743-7315/91$3.00 Copyright 0 I99 1 by Academic Press, Inc. All rights of reproduction in any form reserved.

30

Page 2: The DINO parallel programming language

THE DIN0 PARALLEL PROGRAMMING LANGUAGE 31

data structures to virtual machines, along with the efficient implementation of these mapping patterns in the compiler.

Sequential code in a DIN0 program is written in standard C; DIN0 is a superset of C. Only the parallel constructs in DIN0 and a few related additions are new. We have chosen to build DIN0 on top of C because C is widely used and is available on our target parallel machines, because its struc- tured form fits well with our parallel approach, because its dynamic data structures are useful in many numerical ap- plications, and because we have many tools available for implementing a C compiler. However, the DIN0 language as described in this paper could have been built on top of almost any imperative, sequential language, including FOR- TRAN.

Implicit in our approach to parallel programming is the belief that given the current state of knowledge in this area, a language cannot completely hide the distributed memory parallel machine from the programmer if the result is to be programs that run efficiently. Also, we believe that most par- allel algorithm designers do have an idea of how to break up the data structures and how to act upon them concurrently in order to achieve efficient parallel performance. In DINO, we are trying to provide just enough high-level parallel struc- ture to allow the programmer to specify this information, while making the compiler do as much of the tedious work as possible.

To our knowledge, there is relatively little other research that has led to implemented languages for distributed parallel numerical computation. C * [ 15 ] and Kali [ 13, lo] are prob- ably the closest languages we have found to DINO, although DIN0 was developed entirely independently of both of them. C * is a C superset used to program the ( SIMD) Connection Machine. Its domain and selection statement features are similar to DINO’s environment structures and composite procedures, respectively; however, its SIMD orientation gives it a considerably different philosophy and flavor from DINO. One prominent difference is C*‘s local view of data structures as opposed to DINO’s global view. Kali is a high-level lan- guage for data parallel, SPMD computation. It contains con- structs for virtual parallel machines and distributed data structures that are similar to DINO’s. Its programming model, however, is considerably more restrictive than DI- NO’s, allowing communication only upon entering and leaving “forall” loops.

Another parallel language that has been implemented on distributed memory multiprocessors, Linda [ 41, is designed around a shared memory paradigm, the “tuple space”. Due to this approach, its constructs and philosophy are very dif- ferent from DINO’s. For example, the mapping of data to processors is not specified by the Linda programmer and must be optimized at run time. Pisces [ 141 and the Cosmic Environment [ 221 are examples of systems that also support distributed parallel numerical applications, but using a level of support for parallelism lower than that of DINO. Several languages. such as Spot [ 241 and Apply [ 61, support specific

application areas where a C*-like local view of the compu- tation is appropriate; our research aims to support a broader class of computations and this paradigm is not sufficient. Our work has been influenced considerably by the Force [ 81, which supports data parallel computation on shared memory multiprocessors. All of these approaches are im- perative; other approaches to parallel computation include functional languages (such as CRYSTAL [2] and SISAL [ 12 ] ) and concurrent logic languages (see the languages dis- cussed in [ 231). For a general overview of many parallel languages see [ 261.

In recent years, the concept of distributed data, which is key to DINO, has been considered by several research pro- jects, including [ 2 1, 1, 131. An important difference in the DIN0 version is that one-to-many mappings of data to pro- cessors, as well as one-to-one mappings, are supported. This turns out to be important for implicit interprocess com- munication. The concept of virtual machines is also found in Structured Process [ 111.

The remainder of this paper is organized as follows. Section 2 gives a brief informal overview of the DIN0 language and a simple example, while Section 3 gives a more formal de- scription. Section 4 presents two additional DIN0 examples and uses them as a basis for a critique of the DIN0 language features. Section 5 briefly summarizes the current status of the language and areas for related future research. Prelimi- nary discussions of the DIN0 project can be found in [ 16, 17 1, and a moderate-sized example is presented in [ 18 1.

2. DIN0 OVERVIEW

In this section, we set out the basic structure of the language in order to provide some context for the more detailed de- scription and discussion in the next section. We begin by presenting a very simple DIN0 example (Example 2.1)) a matrix-vector multiplication where the matrix is partitioned by rows and each processor calculates one element of the result vector.

The three major DIN0 concepts that are illustrated in this example are the structure of environments, the composite procedure, and the distributed data. We explain each one briefly.

2.1. Environments

To create a parallel DIN0 program, the programmer first declares a structure of environments that best fits the number of processes and the communication pattern between pro- cesses in the parallel algorithm. This structure can be viewed as a virtual parallel machine constructed for the particular algorithm. Each environment in a single structure is a virtual processor that contains identical procedures and data struc- tures, although the contents of the data structures and the sequence of statements executed at run time can differ. Each DIN0 program also contains a “host” environment, a scalar

Page 3: The DINO parallel programming language

32 ROSING, SCHNABEL, AND WEAVER

EXAMPLE 2.1. Matrix-vector multiplication.

#define N 512 #include “din0.h”

environment node [N:id] I

/* Row-wise Parallel Algorithm to Calculate a <-- M*v : */

composite MatVec (in M, in v, out a) float distributed M[N][N] map BlockRow; float distributed v[N] map all; float distributed a[N] map Block;

( int j;

/* a[id] <-- dot product of (row id of M) and v */

a[id] = 0; for (j=O; jcN; j++)

a[id] += M[id]lj] *vu]; I

I

environment host

main 0 (

long int ij; float Min[N][N]; I* Matrix Multiplicand */ float vin[N]; I* Vector Multiplicand */ float aout[N]; I* Vector Answer *I

c /* Input the Data -- omitted for brevity *I

/* Call the Composite Procedure */ MatVec (Min[][]. vin[], aout[])# ;

I* Print the Results -- omitted for brevity *I

environment that acts as the master process. In the example, “environment node [ N:id] ” declares a vector of N identi- cal environments whose names are “node [ 0] ” through “node [ N - 1 ] ” and “environment host” declares a scalar environment named “host”. (“id” is used by the environ- ment to determine its identity; see Section 3.2 for more dis- cussion of this construct.)

2.2. Composite Procedures

A composite procedure is a set of identical procedures, one residing within each environment of a structure of en- vironments. The parameters of a composite procedure typ- ically are distributed variables (discussed below). A com- posite procedure call causes all instances of the procedure to execute concurrently; each utilizes the portion of the dis- tributed parameters and other distributed data structures that are mapped to its environment. In the example, the com- posite procedure is the “MatVec” procedure in the node environment structure and it is called from the host envi- ronment.

2.3. Distributed Data

Distributed data are used to map global data structures onto the structures of environments in a DIN0 program. A mapping of data to environments may be one to one, one

to many, or one to all. These mappings determine how the environments access and share the data and are the key to making interprocess communication natural and implicit. Distributed variables are referenced using their global names and subscripts, as described below.

There are three distributed data declarations in the ex- ample. The first one, “float distributed M[ N] [ N] map BlockRow;“, maps one row of the matrix M to each of the N node environments. The second, “float distributed ~1 [N] map all;“, maps the entire vector 21 to each node environ- ment. The third, “float distributed a [ N] map Block;“, maps one element of the vector a to each node environment. The mapping names “BlockRow, ” “all,” and “Block” come from a library of predefined mapping functions. DIN0 provides a library containing many commonly used mapping func- tions, including these, and a general facility that allows the user to define a very large set of data mappings.

Distributed data can be used to generate interprocess communication in two ways. One is by using a distributed variable as a parameter in a composite procedure call, as in the example. When the host environment calls the composite procedure Mat F’ec, the values in the matrix Min and the vector vin are distributed to the node environments according to the mappings of the corresponding formal parameters M and v. Similarly, upon the return from the composite pro- cedure, the values of a are collected from the node environ- ments and assigned to aout.

Communication also results from “remote” references to distributed variables (not illustrated in the example). In general, distributed data can be referenced either locally or remotely. A local reference, which uses standard C syntax, affects just the local copy and is the same as any standard reference to a variable. A remote reference, which uses the variable name followed by a “#” sign, can be used with a variable that is mapped to multiple environments. A remote write to a distributed variable causes a message containing its current value to be sent to all other environments to which that variable has been mapped. A remote read of a distributed variable receives such a message and updates the variable’s value. Remote references are described in Section 3.4.2.

3. THE DIN0 LANGUAGE

In this section we present a more complete description of the syntax and semantics of the language I together with some

’ We use a combination of discussion, examples, and EBNF notation [25] to describe the additions to C that make up DINO. Note that our EBNF specification includes several EBNF constructs for standard C that are unchanged, namely CONSTANT_EXPRESSION, EXPRESSION LIST, FUNCTION-BODY, and IDENTIFIER; several EBNF constructs for standard C that are expanded, namely DECLARATOR, EXPRESSION, FUNCTION-DEFINITION, PRIMARY, and STATEMENT; and one EBNF construct for standard C, DATA-DEFINITION, whose definition is unchanged but which contains a part, DECLARATOR, that is expanded. In all cases, only the new definitions are included here. For a BNF specifi- cation for C, see [ 91.

Page 4: The DINO parallel programming language

THE DIN0 PARALLEL PROGRAMMING LANGUAGE 33

of the interesting design and semantic issues. For a complete specification of the syntax and semantics of DINO, see [ 3 1.

3.1. Program Structure

PROGRAM :: = (ENVIRONMENT 1 MAPPING -.-FUNCTION I DATA-DEFINITION) +.

A program consists of one or more environment decla- rations, zero or more mapping declarations, and zero or more (usually distributed) data declarations. One of the environ- ment declarations is defined as a scalar environment with the name host. This environment contains a procedure named “main,” where execution starts. The remaining en- vironment declarations generally define structures of envi- ronments that contain composite procedures which are in- voked from the host.

Distributed data structures that are declared at the program level are mapped to one or more environment structures in the program, as specified by their mappings. Mappings that are declared at the program level are accessible to all parts of the program. Declaring distributed data at this level allows the data to be shared between subsequent structures of en- vironments. This is useful, for example, when a computation occurs in phases.

One syntatic feature that affects the whole program is that DIN0 uses the # whenever an operation involves a remote environment. While each of the individual operations is dis- cussed below, the purpose of this general approach is to call the programmer’s attention to each operation that involves other environments.

3.2. Environments

ENVIRONMENT : : = ‘environment’ IDENTIFIER DIMENSION * ‘ { ’ EX- TERNAL-DEFINITION+ ‘ > ‘.

DIMENSION : : = ‘[’ EXPRESSION [ ‘:’ IDENTIFIER ] ‘1’.

EXTERNAL-DEFINITION : : = FUNCTION-DEFINITION 1 DATA-DEFINITION ) MAPPING-FUNCTION.

A structure of environments provides a virtual parallel machine and a mechanism for building parallel algorithms in a top-down fashion. An environment may contain com- posite procedures, standard C functions, distributed data declarations, standard C data declarations, and mappings. Each environment within a given structure contains the same procedures, functions, and C data declarations and shares the same distributed data declarations.

A structure of environments may be any single- or mul- tiple-dimensional array. This is appropriate for parallel nu- merical computation because the parallelism in numerical algorithms is often derived from partitioning physical space

into neighboring subregions, or from partitioning matrices or vectors. In order to give each environment within a struc- ture an identity that can be used in calculations, the pro- grammer can declare constants that will contain the sub- scripts identifying the current environment. In the matrix- vector example, this is id. Most often these are used to refer to the environment’s portion of a distributed data structure, for instance, the first element in an environment’s row of the matrix M is Mid] [ 0] in the same example.

A DIN0 program may contain any number of environ- ment declarations. Typically only one, the host, is scalar and one or more are arrays. Multiple structures of environments serve two important purposes. They are useful when suc- cessive phases of a computation map naturally onto different parallel machines; in this case, generally only the environ- ments in one structure become active at once. They also provide support for functionally parallel programs; in this case, the environments in multiple structures become active at the same time.

An environment can contain an arbitrary number of composite procedures and standard C functions. However, only one task, or thread of control, may be active in an environment at a time. Procedures in two environments can exchange information only if there is cooperation between the procedures, namely a remote read and write of a distrib- uted variable or the use of a reduction operator. These se- mantics imply that there is no hidden shared memory emu- lation in DIN0 and also results in DIN0 programs being deterministic unless asynchronous distributed variables or explicit environment sets are used. (See Section 3.4.2 for a definition of these two features.) As we discuss in Section 3.4.2, we believe that determinism is an important feature in DINO, because it makes it easier to write correct programs. We note that an option that relaxes the requirement that the programmer must explicitly insert remote reads and writes into a DIN0 program is likely to be available in the future but that there are significant trade-offs between relaxing this rule in general and providing a language that is applicable and efficient for a wide variety of applications. See Section 5 for more discussion of this point.

Structures of environments are treated as blocks with re- spect to scope. Ordinary data that are declared within an environment structure are copied to every environment in that structure. Copies of ordinary data on separate environ- ments have no relation to each other and can be accessed only within their own environments. The one exception to this scoping rule is composite procedures, which are always visible in the topmost scope. Thus composite procedures may be called from any environment except their own; i.e., recursive calls are not allowed.

The current version of the DIN0 compiler creates one process for every environment in each structure and maps these processes statically onto the actual parallel machine. This is done in a manner that attempts to optimize the dis- tance, on the parallel computer, between contiguously in-

Page 5: The DINO parallel programming language

34 ROSING, SCHNABEL, AND WEAVER

dexed environments in each structure, using standard tech- niques [ 71. This mapping makes the assumption that most communication is between contiguously indexed environ- ments. The compiler also attempts to optimize load balance, by evenly distributing these processes over the entire actual parallel machine. This is done at compile time, because to do it at run time would require substantial overhead. For example each reference to a distributed data structure would require significant additional computation.

When a programmer declares more environments in a program than there are actual processors, multiprogramming occurs. This style of programming does not lead to optimal efficiency, for two reasons. First the time involved in the context switch may be large compared to the grain of the program. Second the code necessary to support DIN0 is substantial and is duplicated on most distributed memory machines. We took this approach in the initial version of DIN0 because mapping virtual environments onto real pro- cessors without multiprogramming requires a complex anal- ysis of the program in order to restructure it to contract the virtual processes efficiently in the general case. We plan to attack this problem in the future-see Section 5.

3.3. Composite Procedures

declaration:

FUNCTION~DEFINITION : : = ‘composite’ IDENTIFIER ‘ ( ’ [ COMP-PARAME- TER-LIST] ‘)’ FUNCTION-BODY.

COMP-PARAMETER-LIST ::= ( [‘in’ I ‘out’] IDEN- TIFIER) 1) ‘,‘.

call:

STATEMENT : : = IDENTIFIER ‘(’ [EXPRESSION-LIST] ‘)’ ‘#’ [ ‘ {’ ENV-EXP ‘ } ’ ] [ ‘: : ’ STATEMENT 1.

ENV-EXP : : = EXPRESSION.

Composite procedures are used to implement concur- rency. A composite procedure consists of multiple copies of the same procedure, one residing within each environment of a structure of environments. Calling the composite pro- cedure invokes all of these procedures at the same time. Typ- ically, each procedure works on a different part of some dis- tributed data structure(s), which may be defined globally, in the structure of environments, or as a parameter to the composite procedure. This results in a single-program, mul- tiple-data form of parallelism.

The formal parameters of a composite procedure may be distributed variables or standard C variables. The DIN0 se- mantics for passing variables to composite procedures differs somewhat from the C semantics for passing variables to functions. In DINO, a composite procedure must be able to

send a copy of an entire data structure as a parameter (pass- by-value) or return a copy of an entire data structure as a parameter (pass-by-result). This is because the ordinary C mechanism for arrays of simply passing a pointer to the data structure (effectively pass-by-reference) would be meaning- less across processors. Thus, formal parameters that are dis- tributed variables may be preceded by the keywords “in” for pass-by-value, “out” for pass-by-result, or no keyword for both (pass-by-value/result). Formal parameters that are standard variables must be preceded by “in” and are repli- cated in all the environments on which the composite pro- cedure is called. Each element of a result parameter is re- turned from exactly one environment, its “home” environ- ment (see Section 3.4.1).

A composite procedure call has the same syntax as calls for ordinary C functions except that arrays or subsections of arrays can be used (see Section 3.6), actual parameters can be remote references to distributed parameters, and the pa- rameter list is followed by a # sign. All actual parameters corresponding to result and value/result parameters must refer to variables. DIN0 checks the number and types of actual parameters against the formal parameters in a com- posite procedure call. Composite procedures do not return values.

Functional parallelism may be achieved in DIN0 by uti- lizing the optional “: : STATEMENT” construct following a composite procedure call (this definition is recursive; see the EBNF specification above). This STATEMENT is ex- ecuted concurrently with the composite procedures specified by the call. It can be either another composite procedure call utilizing a different structure of environments (or a disjoint subset of the same structure) or a standard C statement that is executed on the host. The program blocks until both the composite procedure and the concurrent statement complete execution.

As with many of the DIN0 constructs, the programmer can exercise more detailed control over a composite proce- dure than that allowed by the default syntax. For example, the invocation of a composite procedure can be limited to a subset of the environments in the environment structure on which it is defined. See [ 31 for a complete explanation of this.

The use of the composite procedure model means that the programmer specifies the computation for a typical en- vironment. (The programmer still uses global names for dis- tributed data.) We prefer this approach to several others we have seen. One such alternative would be to have the pro- grammer specify the computation for the entire program as if it were not distributed among environments. However, this conflicts with our approach of having the programmer be aware of and manage the high-level parallelism. In par- ticular, it would be difficult to come up with a syntax that cleanly allows the programmer to write code as if the data were not distributed but still designate where communica- tions are needed, since these are conflicting objectives. (The

Page 6: The DINO parallel programming language

THE DIN0 PARALLEL PROGRAMMING LANGUAGE 35

need for programmer-designated communications is dis- cussed at the end of Section 3.4.2.) Another alternative would be to have the programmer specify the computation for just one element in a global data structure. The code would then be executed for each element. This approach substantially limits the kinds of problems that can be constructed and sometimes would not fit the parallelism in the algorithm. For example, some algorithms may require that the same computation be done for each row rather than for each ele- ment.

Our approach has its drawbacks too. A principal one is that using a global name space for the data but a single pro- cessor view for the code sometimes makes the code more complex. For example, the programmer may be required to write expressions to identify the first and last elements of that part of a distributed data structure that is mapped to a typical environment. This problem could be addressed by language facilities that automatically return “myfirst” and “mylast” indices in each dimension.

3.4. Distributed Data

DIN0 encourages the programmer to take a global view of data structures that are distributed among multiple pro- cessors and operated upon concurrently. The programmer does this by declaring a data structure as distributed data and then specifying the manner in which that data structure is mapped to the underlying virtual machine given by the programmer-defined structure of environments. The dis- tributed data provide the mechanism for communication between environments, which is accomplished by using dis- tributed variables as parameters to composite procedures and by remote references to distributed variables. They also pro- vide a global view of data structures, and they join with com- posite procedures to produce an SPMD model of compu- tation. We chose a global data model because we feel that a programmer should not have to deal with two different views of the data-a global one before the data are distributed and a local one on each processor.

Currently, the types of data structures that can be distrib- uted are arrays of any dimension. Much of the usefulness of distributed data is due to the rich mechanism for specifying the mappings of arrays onto structures of environments. The declaration of distributed data and the definition and mean- ing of these mappings are discussed in Section 3.4.1. The use of distributed data, primarily remote references for com- munication, is discussed in Section 3.4.2, along with a cri- tique of this approach to communication.

3.4. I. Declaring Distributed Data--Mappings

distributed data declaration:

DECLARATOR : : = [ ‘asynch’] ‘distributed’ DECLA- RATOR ( ‘ [’ CONSTANT-EXPRESSION ‘I’ ) + MAPPING.

MAPPING :: = ‘map’ (‘all’ 1 IDENTIFIER ( (( IDENTI- FIER IDENTIFIER) 11 ‘map’ ) ) .

mapping de$nition:

MAPPING-FUNCTION : : = ‘map’ IDENTIFIER ‘=’ ( ‘[’ MAP-TYPE [ALIGN] ‘1’ I+.

MAP-TYPE : : = ‘all’ 1 ‘compress’ 1 BLOCK-MAP 1 WRAP-MAP.

BLOCK-MAP : : = ‘block’ [‘overlap’ [EXPRESSION] ] [‘cross’ ‘axis’ EXPRESSION].

WRAP-MAP : : = ‘wrap’ [EXPRESSION 1.

ALIGN : : = ‘align’ ‘axis’ EXPRESSION.

Distributed data declarations follow standard C syntax except that the keyword “distributed” precedes the name of the distributed variable and the keyword “map” followed by the name of a mapping follows it. An example is float distributed M [N] [N] map BlockRow;

The mapping is either taken from DINO’s library of pre- defined mappings, as is the case with BlockRow above, or is defined by a mapping definition.

DIN0 provides predefined mappings for mapping one- and two-dimensional (hereafter 1 D or 2D) data structures onto ID structures of environments and 2D data structures onto 2D structures of environments. The predefined lD-to- 1D mappings are “Block”, “Wrap”, and “BlockOverlap”. They can be used to map any array of N variables onto any array of P environments and work in a fairly obvious way. If N is an exact multiple of P, then Block maps the first N/ P data elements onto the first environment, etc., while if N > P but N is not a multiple of P, then the first N mod P environments contain an extra element. Wrap maps each elementi(i=O,..., N - 1) to environment [i mod P], a mapping commonly used in numerical computation for load balancing. BlockOverlap does a Block mapping and in ad- dition maps copies of the first and last elements of each block (except elements 0 and N - 1) to the next lower and higher environment, respectively.

The predefined mappings for mapping 2D (M X N) data structures to ID (P) environment structures are “BlockRow”, “BlockCoP’, “WrapRow”, “WrapCol”, “BlockRowOverlap”, and “BlockColOverlap”. They work identically to the above 1 D-to- 1 D mappings except that rows or columns are used in the place of individual elements. For mapping 2D (M X N) data structures onto 2D (P X Q) environment structures, the predefined mapping functions are “BlockBlock”, “FivePoint”, and “NinePoint”. Block- Block is the cross product of a block mapping of the first axis of the data structure onto the first axis of the environ- ment structure with a block mapping of the second axis of the data structure onto the second axis of the environment

Page 7: The DINO parallel programming language

36 ROSING, SCHNABEL, AND WEAVER

structure. FivePoint is the cross product of a BlockOverlap mapping in each direction; in the case when A4 = P and N = Q it reduces to the standard five-point star. NinePoint is a FivePoint mapping where in addition the comer elements are included in the overlap. Finally, the mapping keyword “all” maps all the elements of any dimensional data structure onto each environment of any dimensional structure of en- vironments.

All these predefined mapping functions are special cases of DINO’s general facility for defining mapping functions. Because this facility is rather complex, we give only a general idea of it here and then give an example; for more detail, see [20]. The first part of the mapping function specifies how the axes of the data structure are matched to the axes of the environment structure, using the keywords “compress” (when the data structure has higher dimension) and “align axis” (to override the default mapping of axis i to axis i). The second part specifies how a specific data axis is mapped onto a specific environment axis, using the keywords “block”, “wrap”, and “all”. One may specify the width of wraps if greater than 1 (“EXPRESSION” in the WRAP-MAP rule) or the size and directions of overlaps in a block mapping ( “overlap” and “cross axis”). Finally, “asynch” is discussed in Section 3.4.2. For example, BlockRow is defined by “map BlockRow = [block] [compress] ;” which says to block map the first axis of the data structure onto the first axis of the environment structure and not to distribute the second data axis.

Implicitly associated with each mapping function is the specification of one home environment and zero or more “copy” environments for each element of the distributed data structure. If an element of a distributed data structure is mapped to only one environment, this is its home. If it is mapped to multiple environments using overlap, the home environment is the environment that the data element would be mapped to if the overlap term were omitted, while the additional environments that the overlap term causes it to be mapped to are its copy environments. This concept is consistent with many uses of such mappings, for example, in differential equation solvers based upon grids of points, where only one process produces new values of the shared grid point and the bordering processes consume these values. If an element is mapped using all, then we arbitrarily consider the lowest subscripted environment to be its home and the remainder to be copy environments, but this distinction is generally unimportant since such variables are usually either input-only parameters or uniformly shared variables internal to the composite procedure. The concept of a home envi- ronment is important to ensure deterministic computations (see Section 3.4.2).

We chose our method of defining the mappings as being a trade-off between generality and efficiency. We needed mappings that could be fully understood at compile time because references to distributed arrays depend on the map- pings. Having, for example, a user-defined function for the

mapping would impose an unacceptable run-time overhead on almost any program. A drawback to our scheme is that mapping definitions can become complex and difficult to understand. We compensate for this, to some extent, by pro- viding a library of predefined mappings.

We chose to go beyond mappings that simply partition data for two reasons. First, it is not uncommon to have some data that are needed in all environments. A simple replication handles this case. Second, by capturing common commu- nication patterns as partially replicated data, the compiler is given enough information to optimize the interprocessor communications and the storage associated with this com- munication, and the programmer’s job in specifying such communications is simplified. In particular, the programmer only has to tell the compiler when to send or receive specific data; it is not necessary to also specify the source and des- tination of the data. Other languages that do not require the programmer to specify the source or destination (and do not include some programmer specification of overlaps of data between processors) typically have to fetch the data (one message to request them and one to return them). The DIN0 compiler has enough information that it never has to do this.

Finally, we included load balancing constructs (the various wrap mappings) in the mappings because we felt that they were an important consideration in many numerical algo- rithms. Although, in theory, load balancing is an orthogonal consideration to data distribution, it turned out to be suffi- ciently complex to implement it in that fashion that we in- tegrated it into the mapping scheme for the first version of the compiler. Most other languages have taken the same approach.

3.4.2. Using Distributed Data

EXPRESSION : : = PRIMARY ‘#’ [ ‘ { ’ ENV-EXP [‘from’ EXPRESSION] ‘1’ I.

ENV-EXP ::= ‘caller’ ] EXPRESSION.

There are three different ways to make use of a distributed variable. The first is as a formal parameter in a composite procedure call, as discussed in Section 3.3. The other two are by local and remote references, either reads or writes, in any standard C statement.

Local reference to a distributed variable uses standard C syntax and is the same as referencing any regular variable. The value of the variable is retrieved from, or stored into, the local copy. This implies that the variable being referenced must be mapped to the environment in which the reference is made.2

* If it is not, it is a programmer error, currently with the same consequences that the programmer incurs in C for an illegal array reference or illegal pointer use-sometimes the program crashes and sometimes it completes but with incorrect results. It would require run-time checking to catch all these errors. We are considering such checks in a future version of DINO, mostly because it is easy to make these mistakes and hard to find them.

Page 8: The DINO parallel programming language

THE DIN0 PARALLEL PROGRAMMING LANGUAGE 37

Remote reference to a distributed variable, either a read or a write, involves communication between environments. It is indicated by the distributed variable name, followed by a #. We first describe the default semantics of the remote references, then briefly discuss several variations. All remote references follow the convention that the home environment produces new values and the copy environments consume them.

If the remote reference is a read and the current environ- ment is a copy environment, then DIN0 looks for the first (least recent) value of that variable that has been received from the home environment but not yet utilized. That is, a message buffer is maintained, and the oldest message from the home environment with a value of that variable is used and then removed from the buffer. If no new value of that variable has been received from the home environment, the reading process blocks until one is received. When the new value is available, it is used to update the local copy of the variable, and then the remainder of the expression in which the remote read is located is evaluated using the local copy. If the remote reference is a read and the current environment is a home environment, then the remote read is the same as a local read. This is useful to the programmer; however, it may result in some performance degradation. If the current environment is neither a home nor a copy environment, then a remote read is an error.

A remote write to a distributed variable works as follows. If the current environment is the home environment (only one home environment is allowed for any piece of data) of that variable and there are associated copy environments, then the value being assigned to that variable is used to update the local copy (since the write occurs in an assignment con- text) and also is sent to all the copy environments. If the current environment is the home environment and there are no associated copy environments, then the remote write is treated as a local write but, again, there may be some per- formance degradation. If the current environment is not the home environment, then a remote write is an error.

As an example of remote reads and a remote write, con- sider a two-dimensional smoothing algorithm where an N X A4 distributed array of data A is mapped onto an N X M array of virtual processors “grid”, with each A [ i ] [i] mapped onto grid [i] [j] (its home environment) and the four ad- joining environments. (This is the mapping function FivePoint.) Then the statement

A[i][j]# = (A[i - l][j]# + A[i][j - l]# + A[i][j]

+ A[i][j + l]# + A[i + l][j]#)/5;

will receive values of the four adjoining elements of A from their home environments, calculate the average of these four

values plus the local value, assign this new value to the local copy of A[ i] [j] , and send it to the four adjoining environ- ments.

All of the above discussion pertains to “synchronous” dis- tributed variables, which are the default in DINO. Distributed variables can also be declared to be “asynchronous” by plac- ing “asynch” before “distributed” in their declarations. The difference between synchronous and asynchronous variables is that a remote read to an asynchronous variable uses the most recent value and does not block. If several values of the variable have been received since the last remote read, then the most recently received value is used to update the local copy and the remaining values are discarded. If no new value has arrived since the last remote read, then the local copy is used. The remaining semantics of remote references to asynchronous variables, including all the semantics for a remote write, are the same as those for synchronous distrib- uted variables.

The asynchronous option has been included because it is of interest in many numerical algorithms where “chaotic” variants are being considered. By making the distinction be- tween synchronous and asynchronous communication part of the distributed variable declaration, as opposed to a prop- erty of the statement that invokes communication, it is pos- sible to transform a DIN0 program from a synchronous to an asynchronous variant simply by changing one or a few declarations. Clearly, the execution of such algorithms may be nondeterministic.

Finally, if the programmer wishes to send or receive dis- tributed variables in a different manner than by these default rules, there is a syntax (explicit environment sets) whereby the programmer can explicitly specify ( 1) the source or des- tination environment(s), (2) the variable from which the value is taken or to which it is assigned, and (3) the name the values are sent as. It is then up to the programmer to ensure that sends and receives balance properly. This com- munication paradigm is discussed briefly in conjunction with Example 4.2. More detail can be found in [ 31.

The default communication paradigm in DIN0 guaran- tees that the program is deterministic. We did this for the obvious reason-we are trying to make it as easy as possible to write correct parallel programs. Avoiding nondeterminism removes one common source of errors in parallel programs. A programmer who needs to make use of nondeterminism must explicitly use asynchronous communication or pro- grammer-designated environment sets.

We selected the home /copy paradigm because it naturally applies to a large number of common numerical algorithms and because it guarantees deterministic behavior of the pro- gram in the default case. There are, however, certain algo- rithms whose communications do not fit this paradigm well. There appear to be two main cases. We have seen several problems for which the current paradigm works except that the mapping must change during the execution of the al-

Page 9: The DINO parallel programming language

38 ROSING, SCHNABEL, AND WEAVER

gorithm. These problems could be handled by allowing pe- riodic remappings. We have also seen a few algorithms whose communications simply do not fit the paradigm. These tend to be algorithms in which some intermediate result needs to be communicated between environments and in which the data structure the programmer would naturally use for com- munication does not fit the concept of a global data structure that is distributed among the environments. We are currently exploring possibilities for remedying this problem.

Finally, we chose a model that requires explicit program- mer designation of sends and receives because it provides us with a powerful computational model for which we can pro- vide an efficient implementation. The disadvantage of this approach is that it forces the programmer to explicitly man- age the communication. We believe that currently there is a trade-off between this, efficiency, and expressiveness.

It might be argued that the programmer should have to make no distinction between local and remote references- that instead these should be distinguished by the compiler or the run-time system. This is done, for example, in C* [ 151 and in Kali [ 131. C* is targeted for SIMD machines and also assumes that communication is relatively fast so that fetching (using two messages to obtain data with no presending) is acceptable. This approach is neither efficient nor general enough for general SPMD computation. Kali and similar languages use a “copy in/copy out” model which allows communication only at the beginning and end of loop iterations. This may force the programmer to break an it- eration in a place unnatural to the algorithm in order to permit communication and also makes it possible for a pro- grammer to unknowingly construct loops where local values change in the course of one iteration but remote ones do not.

In contrast, never distinguishing explicitly between local and remote references would be extremely difficult if not impossible in a language like DINO, which supports entirely general SPMD (and functionally parallel) computation and strives for efficient communication in an environment where communication is presumed to be relatively slow. This would require extremely general data dependency analysis to de- termine when data should be sent and received, as well as optimizations for communication efficiency such as bundling subarrays into single messages and prefetching (sending messages as soon as new values are available rather than requesting them when they are required).

Thus it seems to us that a programming style for parallel, distributed computation that allows no distinction between local and remote references, and still results in efficient code, is possibly only if the model of parallel computation is re- stricted considerably, as in these other languages. In current DINO, we have chosen to support the general SPMD model. However, we acknowledge that many data parallel programs do not need the full power of this model, and we are currently exploring ways to remove the need to distinguish between

local and remote references in many cases. See Section 5 for additional discussion of this point.

3.5. Reduction Functions

PRIMARY : : = REDUCTION ‘(’ EXPRESSION [ ‘,’ EXPRESSION] ‘)’ ‘#’ [‘(’ ENV-EXP ‘}‘I.

REDUCTION : : = ‘gsum’ 1 ‘gprod’ ( ‘gmin’ 1 ‘gmindex’ ] ‘gmax’ ] ‘gmaxdex’.

One type of calculation that involves communication is so fundamental to parallel computation, and departs suffi- ciently from simple patterns of communication, that we be- lieve it should be a parallel language construct. This is a reduction operation, which involves performing a commu- tative operation (e.g., +, *, min, max) over one value from each environment and, often, returning the result to all the environments. DIN0 provides the reduction operators “gsum”, “gprod”, “gmax”, and “gmin”, which return the sum, product, maximum, or minimum of the arguments specified on each environment. DIN0 also provides two re- duction functions, “gmaxdex” and “gmindex”, that take two parameters; the first is the value to be reduced and the second is (a pointer to) an integer index. These functions return the index of the maximum (or minimum) value.

It is implicit in a reduction call that barrier synchronization is involved, and that all participating environments must reach the call; otherwise execution is blocked. All reductions are implemented using efficient, tree-based computation and communication.

3.6. Ranges

PRIMARY : : = PRIMARY ( ‘ [ ] ’ 1 ‘ [ (’ EXPRESSION ‘,’ EXPRESSION ‘ ) ] ‘.

To use most distributed memory multiprocessors effi- ciently, it is important to send messages consisting of blocks of data, rather than multiple messages each consisting of a single data element, whenever possible. In order to move blocks of data between environments efficiently, DIN0 pro- vides the ability to specify subarrays and to use arrays and subarrays in simple assignment statements, which may in- clude remote reads and writes. (C provides no subarray fa- cilities.) In DINO, these subarrays are referred to as “ranges”.

A rectangular subsection of an array is specified by giving the first and the last element for each axis, e.g. “A[ ( i, j)] [ I”. The default value, specified by “ [ I”, is all elements of the array along the indicated axis.

Ranges, including remote references to them, may be used in assignment statements. DIN0 also allows the use of ranges as parameters for composite procedures and reduction func-

Page 10: The DINO parallel programming language

THE DIN0 PARALLEL PROGRAMMING LANGUAGE 39

tions and allows ranges to be used in specifying environment sets. At present we have not made additions to C to allow arithmetic operations on ranges or the use of ranges as pa- rameters or return values to standard C functions. We plan to do so in the future. Due to these limitations of C, remote reads or writes of arrays or ranges cannot be natural parts of standard C statements (e.g., arithmetic operations or calls to C functions), as we would desire. Instead, they are often specified by explicit communication statements such as “A[ i] [ ]# = A[ i] [ 1” after new values of the distributed variable have been generated (see Example 4.2). In ac- knowledgment of these limitations, we have added C macros defining “SEND( subarray)” and “RECV( subarray)” to do a simple send or receive of a range of data (see Ex- ample 4.2).

4. EXAMPLE DIN0 PROGRAMS AND DISCUSSION

Examples 4.1 and 4.2 contain complete DIN0 programs for two typical numerical computation kernels, a rowwise red-black algorithm for solving Poisson’s equation on an N X N grid and the LU factorization of an N X N matrix. Both are shown in the form that they would be coded for in the current version of DINO, i.e., for a structure of P (pre- sumably <N) environments. Both the examples have a sim- ple structure, in particular one structure of environments and one composite procedure; a more complex example is given in [18].

In the red-black algorithm, at each iteration each processor calculates a new value of each of its odd-numbered rows on the basis of the current value of this row and the two adjacent even-numbered rows, and then each processor calculates a new value of each of its even-numbered rows on the basis of the current value of this row and the two adjacent odd- numbered rows. Thus the parallel algorithm maps the rows to the environments in blocks and maps a copy of each bor- der row onto the adjacent environment. We have assumed that N is a multiple of P since the programmer generally can control this in such algorithms. The LU factorization algo- rithm performs Niterations, each of which calculates a pivot and multipliers on the basis of column k (the iteration num- her), and then uses this information to perform eliminations in all columns whose index is greater than k. Thus it is im- plemented efficiently on a distributed memory multiproces- sor by partitioning the matrix by columns and using a wrap mapping. It is immaterial whether N is a multiple of P.

At an overall level, these two examples as well as Example 2.1 appear to us to satisfy our initial goals well. They appear fairly natural and understandable. They certainly appear preferable to the same algorithms programmed using vendor- supplied constructs. For example, they are each about two to three times shorter than the code for the same algorithms using Intel iPSC hypercube constructs and far easier to write and understand. At the same time, they still execute nearly

as efficiently as the programs using iPSC constructs (see Sec- tion 5).

One significant shortcoming of Examples 4.1 and 4.2, in our opinion, is that for efficiency reasons the number of en- vironments currently should not exceed the number of pro- cessors. This limitation hardly complicates the coding of these

EXAMPLE 4.1. Rowwise red black algorithm for Poisson’s equations.

#dchneN 512 #define P 32 #include <stdm.h> #include <m&h> #include <dinah>

cnvimnment node IP:id]

I double d[N]. s[N]; inr needfactor = 0;

void solveRow (U.1) I* solves tridiag (-1,4.-l) * U[mwil= (U[row_i-l]+U[mw_t+l]) *I

double distributed U[N][N] map BlockRowOverlap; I”, 1:

I intJ: double r = 1.0: /’ relaxation constant, can k different constant */ double new,N,;

if (needfactor == 0) I I* factortridiag (-1.4.-l) *l need factor = 1: d[l]=2;

for(i=2: J<N-l,j++)

I SIJ] = lidb-I]; d[jl = sqn (4 sljl’sl~l);

I

P forward solve (solve for elements L..N-I of mw i, add U[il[O]. U[i]LN-I] to nght hand side) *l new [II =(U[i-ll[ll + Uli+ll[lJ+ U[il[O])/d[l]: for (j=2: j<N-2; j++)

new til = (U[i-lIti] + U[i+llIjl sljl*new[j-ll)/dlil: new [N-21 = (U[i-I][N-21 +U[i+II[N-21 +U[i][N-II -SIN-2]*newIN-3])/d(N-21:

/* backward solve *I new IN-21 = (new[N-Z])ld[N-21; for+N-3: j > 0: j=j-I)

new Q] = (new[j] s[j+l]*newlj+ll)/dlj]~

l* relaxation ‘I for a=l; j<N-I: j++)

U[illj] = r*newbl + (I-r)*U[ilfjl: I

/* Parallel Algo&hm For Row-Wise Red Black. N/P Rows per Processor : */ composite S”l”CT ( “ )

double distributed U]N][N] map BlockRowOvedap: /* partitions U by blocks of rows among processors. with each border mw mapped to 2 processon */

I int hnmw = id ? (id * N/P) : 1. l* indices of first and last lows *I int lastmow = (id = (p-l))? N-2 : ((id+l) * (N,F’) I): /* in each block *I mt finteven = hrstmw%2? firstrow + 1: finrmw : ml hntodd = hntmw%Z ? firstrow : hrstmw + 1: ml k. i;

for(k=O. kclM); k++)

( for (i=firsmdd: i c= lastrow: i=i+2) P updale odd mws in each block */

I if ((l==flrstrow) && (i != 1) && (k != 0))

RECV(U[i-I][]); /* DIN0 Macro, see end of Section 3.6 */ if ((~==lastmw) && (I != N-2) && (k != 0))

RECV(U[i+l,[,); solveRaw (LJ.0; if (((i=hrstmw) && (i != I)) II ((i==lastmw) && (i != N-2)))

SENDKlIil[l): I’ DIN0 Macro, see end of Section 3 6 ‘, I

for (I=hrstcven; i <= lastrow: 1=1+2) ,’ update even rows in each block *, I if ((i==firstmw) && (i I= 1) && (k != 0))

RECVKXi-I][,): /* DIN0 Macro. see end of Section 3.6 */ if ((,==las,mw) && (i != N-2) && (k != 0))

RECV(U[i+II[J): solveRow (U,i): if (((i==firstmw) && (i != I)) ]I ((i==lasrmw) && (i != N-2)))

SEND(Ulil[]); /* DIN0 Macro. see end of Section 3.6 *I

I I

I

Page 11: The DINO parallel programming language

40 ROSING. SCHNABEL. AND WEAVER

EXAMPLE 4.1. Continued

en”imnmC”t host I main ()

1 float Uh [NIIN]; I* input and output to “solver”” *I long int id;

i ’ Input the Data (Uh) */ for (i=O; i<N: i++)

for (j=O; j<N; j++) Uh[i](j] = i*j,

for(i=l; i<N-I; ii+) for(j=l; j<N-I: J++)

Willjl = Uh[illjl .10;

solver (UhIl[l)#; I’ Composite Procedure Call */

P Printing the Results (Uh) *I for(k0; icN: i++)

( for (j=O; j<N; J++)

(void) pdntf(“% 2f ‘. Uh[ilu)); (void) printf(‘W);

, ’

EXAMPLE 4.2. LU factorization.

#define N 5 12 #define P 32 #include <stdio.h> #include <math.h> #include <dino.iv

I 1

two examples but it would be more natural to eliminate it. As discussed in Section 5, the pseudo-SIMD approach we are currently investigating does this. It also eliminates the need for #‘s in these examples.

Examples 4.1 and 4.2 also were chosen, in part, to illustrate two problems with expressing remote references. While the change to the pseudo-SIMD approach would eliminate re- mote references from many DIN0 programs, in some pro- grams they would remain and so we are interested in making them as simple and natural as possible. First, the SEND ( subarray) and RECV ( subarray) type references that we discussed at the end of Section 3.6 are used in Example 4.1. These are required, in part, because C does not have facilities for using subarrays in arithmetic operations or as parameters. Our compiler already has most of the hooks needed to implement such facilities and we plan to incor- porate them in the future. Then, one could encapsulate many communications entirely in statements like

solveRow (U[i-l][ ]#,U[i][ ],U[i+l][ ]#)

Second, in the broadcast in Example 4.2, it would be nice not to require the explicit environment set “ { node[ ] } “, but this is currently necessary due to the default semantics of distributed variables which permit implicit remote writes only from a single, home environment. We are considering modifications to the language that would alleviate this type of problem while preserving determinism. One possibility is to allow the home environment to be dynamically redefined within the program.

5. CURRENT STATUS AND FUTURE RESEARCH DIRECTIONS

We have written a full compiler for DINO. It was imple- mented using the compiler generation tools of [ 51. The compiler produces C code augmented by the low-level par-

environment node [P:id] I

/* Parallel LU Factorization Algorithm : */ composite lufactor (A)

float distributed A[N+l][N] map WrapCol; /* does wrap (cyclic) mapping of columns of A to the environments */

I float distributed mult [N+l] map all; /* places copy of this vector in each envimnment */ int i, j. k, pivrow; float piv, temp;

for (k=O; k<N-I; k++) I if ((k%P) == id) /* my envimnment contains pivot C0iumn for this itEmdOn */

I I* select pivot and swap in pivot column *I

piv = A[k][k]; pivmw = k; for (i=k+l: i<N; i++)

if ((fabs(A[i][k])) > piv) [ piv = A[i][k]; pivruw = i: )

if (pivmw != k) ( temp = A[k][k]; A[k][k] = A[pivmw][k]; A[pivmwl[kl = temp; )

/* calculate multipliers and broadcast them and pivot TOW number */ A[N][k] = pivmw; /* N+lst row of A will contain pivots */ for (i=k+l; i<N; i++)

A[il[kl = A[ilIkl/A[kl[kl; mult [<k+l. N>] # ( node[] ] = A [<k+l, N>][k]; /* broadcast */

/* Note: The syntax “(node[])” above and below explicitly designates the target and source environments respectively for the communication. This is required because only one default “home” environment is allowed. See Section 4 in the text. *I

I* receive pivot information ‘1 mu11 [<k+l. N>] = mult [<k+l. N>] R { node[(k%P)] );

I* eliminate in your columns greater than k *I piwow = mult[N]; for (i = id; j < N; j=j+P)

if (j>k) ( if (pivrow != k) (

temp = A[k]U]; A[k]lj] = A[pivmw]b]; A[pivmwHj] = temp; ] for (i=k+l: i<N: i++)

A[i]b] -= mult[i]*A[k][j]; I

environment host I main ()

r ‘float A [N+l][N]: p input and output to “lufactor”“. last row will contain pivots */ long int ij;

for (i=O; icN; i++) /’ Input A */ for (i=O; j<N: j++)

A[i]rj] = (i+l)*(j+l); for (i=O: i<N: i++)

A[i][i] += I;

lufactor (A[][])# ; /* Composite Pmcedure Call */

for (i=O; icN; i++) P Print the Results *I

‘for (i=O; j<N; j++) (void) printf(‘X2f “, A[i]Ij]);

(void) printf(‘ln”); I

Page 12: The DINO parallel programming language

THE DIN0 PARALLEL PROGRAMMING LANGUAGE 41

allel instructions of the target parallel machine. The initial version runs on the Intel iPSC1 and iPSC2 hypercubes and on the simulators for these machines. The production of machine-dependent statements in the compiler has been iso- lated, however, so that porting it to other distributed memory multiprocessors is not expected to be difficult. The compiler is available at no charge to universities and research labo- ratories.

We have compared the lengths and execution times of DIN0 programs, including the examples in Section 4, to those of identical parallel algorithms coded as similarly as possible in C using the Intel iPSC 2 primitives. In general, the DIN0 programs are two to three times shorter than the programs using iPSC primitives. Their run times on the iPSC2 range from within 1% of the iPSC2 primitive versions to roughly double the iPSC version. Whenever the degra- dations are greater than lo-20%, our measurements have shown that they are due to those communication features that, for expediency, were implemented via a run-time library rather than in the compiler. This was done for all parameter distribution via distributed data and for many communi- cations generated via remote (#) references to distributed data.

To demonstrate that communication via remote references could be made efficient, we implemented the most common remote reference mechanisms, namely those using the Blockrow and BlockCol mappings, and all remote references that use explicit environment sets (e.g., Example 4.2), in the compiler. In almost all cases, these require less than 5% more time than the same communications generated using iPSC primitives. The remaining remote references are cur- rently generated via the run-time library and generally require about twice as much time as the same communications gen- erated using iPSC primitives. High efficiency in parameter distribution was considered less important for a first version of the compiler, since parameter distribution constitutes a relatively small part of most programs, and thus it is handled in the run-time library. Parameter distributions using dis- tributed data generally are at most twice as expensive as the same communication using iPSC primitives; however, cer- tain cases are up to five to seven times slower. (Among the default mappings, the one slow case is the wrap column mapping, where each data reference is noncontiguous; the offset calculation is much slower when handled by a general facility than by one tailored to the mapping.)

In summary, our measurements show that a communi- cation-intensive DIN0 program that uses the remote refer- ences that have been implemented in the compiler is at most 10% slower than the equivalent program using low-level primitives, unless the computation is dominated by param- eter distribution. We estimate that at most 2-3 additional months of compiler development work would be required to implement all the remaining communication features, which are currently handled by the run-time library, in the

compiler, and that all DIN0 programs would then require at most lo-30% more time than equivalent programs coded using low-level primitives. We feel this indicates that our high-level programming approach can be used without sig- nificantly hindering efficiency.

The development of the DIN0 language and compiler has given us a platform from which we wish to explore a number of interesting research issues. Some, involving lan- guage features and compiler optimizations, were mentioned briefly in Section 4. These include the ability to contract processes automatically and efficiently, that is, write a DIN0 program for N (or some other function of the problem size) virtual processors and have the compiler transform this pro- gram into an efficient program whose number of processes is equal to the number of available processors.

To facilitate this, we are investigating including a more restricted, SIMD-like composite procedure as an option in DIN0 which we call pseudo-SIMD. In conjunction, we would remove the requirement that the programmer must distinguish between local and remote references in such pro- cedures. Instead, the partial-SIMD semantics would allow the compiler to determine where communications are re- quired. Our preliminary ideas on this topic are discussed in [ 191. This new option would be applicable to many nu- merical algorithms, such as Examples 4.1 and 4.2, and would remove # references from them entirely. It also would allow the programmer to express these algorithms at their natural N-fold level of parallelism and then enable the compiler to contract them to efficient P-fold algorithms.

This research problem is actually just a subproblem of a main research thrust, which is the exploration of issues that arise in expressing parallel algorithms for large, complex ap- plications. These algorithms may involve multiple phases, and different data mappings and virtual parallel machines may be appropriate at different phases. A key ingredient in providing efficient support for such applications will be the ability to embed parallel algorithms within other parallel al- gorithms; then, depending on the level of embedding, a par- allel algorithm may be contracted to anywhere from P to 1 actual process( es). Thus the contraction problem is a key portion of the large-scale application problem. The provision of partially SIMD composite procedures, which greatly fa- cilitate automatic contraction, is a key component in our approach to this general problem.

In conjunction with this research into large-scale parallel programming, we are collaborating with several scientists in using DIN0 and assessing its strengths and weaknesses. We are also beginning to use DIN0 in parallel programming instruction at the graduate and undergraduate level. Other related, longer range research issues we are considering in- clude the provision of dynamic distributed data structures and virtual parallel machines in a high-level distributed pro- gramming language and visual interfaces to high-level dis- tributed programming systems.

Page 13: The DINO parallel programming language

ROSING, SCHNABEL, AND WEAVER 42

1.

2.

3.

4.

5.

6.

7. 8.

9.

10.

Il.

12.

13.

14.

15.

16.

17.

REFERENCES

Callahan, D., and Kennedy, K. Compiling programs for distributed- memory multiprocessors. .I. Supercomput. 2 (1988), 151-169. Chen, M., Choo, Y., and Li, J. Crystal: From functional description to efficient parallel code. Proc. Third Conference on Hypercube Concurrent Computers and Applications, 1988, pp. 417-433. Derby, T., Eskow, E., Neves, R., Rosing, M., Schnabel, R., and Weaver, R. The DIN0 User’s Manual, Department of Computer Science Tech. Rep. CU-CS-50 I-90, University of Colorado at Boulder, 1990. Gelemter, D., Carriero, S., Chandran, S., and Chang, S. Parallel pro- gramming in Linda. Proc. 1985 International Conference on Parallel Processing. IEEE Computer Society, Silver Spring, MO, 1985, pp. 255- 263. Gray, R. W., Heuring, V. P., Krane, S. P., Sloane, A. M., and Waite, W. Eli: A complete, flexible compiler construction system. Department of Electrical and Computer Engineering Tech. Rep. SEC 89- 1- 1, Uni- versity of Colorado at Boulder, 1989. Hamey, L. G. C., Webb, J. A., and Wu, I-C. An architecture independent programming language for low-level vision. Comput. Vision Graphics Image Process. 48 ( 1989), 246-264. Intel iPSC Programmer’s Reference Guide, No. 310612-002, Mar. 1987. Jordan, H. The Force. In Jamieson, L., Cannon, D., and Douglass, R. (Eds.). The Characteristics of Parallel Algorithms. MIT Press, Cam- bridge, MA, 1987, pp. 395-436. Kemighan, B., and Ritchie, D. The C Programming Language. Prentice- Hall, Englewood Cliffs, NJ, 1978. Koelbel, C., Mehrotra, P., and Van Rosendale, J. Supporting shared data structures on distributed memory architectures. Proc. 2nd SIGPLAN Symposium on Principles and Practice of Parallel Processing. ACM, New York, 1990, pp. 177- 186. Li, H., Wang, C., and Lavin, M. Structured process. Proc. 1985 Inter- national Conference on Parallel Processing. IEEE Computer Society, Silver Spring, MD, 1985, pp. 247-254. McGraw, J., Allan, S., Glauert, J., and Dobes, I. SISAL: Streams and iteration in a single assignment language. Language Reference Manual. Tech. Rep. M- 146, Lawrence Livermore National Laboratory, 1983. Mehrotra, P., and Van Rosendale, J. Compiling high level constructs to distributed memory architectures. Proc. Fourth Conference on Hy- percube Concurrent Computers andApplications, 1989, Vol. 2, pp. 845- 852. Pratt, T. The Pisces 2 parallel programming environment. Proc. 1987 International Conference on Parallel Processing. IEEE Computer Society, Silver Spring, MD, 1987, pp. 439-445. Rose, J., and Steele, G. C*: An extended C language for data parallel programming. Tech. Rep. P185-7, Thinking Machines Corp., 1987. Rosing, M., and Schnabel, R. An overview of DINO-A new language for numerical computation on distributed memory multiprocessors. In Rodrique, G. (Ed.). Parallel Processing for Scientific Computation. SIAM, Philadelphia, 1989, pp. 3 12-3 16. Rosing, M., Schnabel, R., and Weaver, R. DING Summary and ex-

18.

19.

20.

21.

22.

23.

24.

25.

26.

amples. In Fox, G. (Ed.). The Third Conference on Hypercube Con- current Computers and Applications. ACM, New York, 1988, pp. 472- 481. Rosing, M., Schnabel, R., and Weaver, R. Expressing complex parallel algorithms in DINO. Proc. Fourth Conference on Hypercubes, Concurrent Computers, and Applications, 1989, Vol. 1, pp. 553-560. Rosing, M., Schnabel, R., and Weaver, R. Massive parallelism and pro- cess contraction in DINO. In Dongarra, J., Messina, P., Sorensen, D. C., and Voigt, R. G. (Eds.). Parallel Processing for Scientific Com- putation, SIAM, Philadelphia, 1990, pp. 364-369. Rosing, M., and Weaver, R. Mapping data to processors in distributed memory computers. Proc. Fifth Distributed Memory Computing Con-

ference, IEEE Computer Society, Silver Spring, MD, 1990, pp. 884- 893. Scott, L., Boyle, J., and Bagher, B. Distributed data structures for sci- entific computation. Proc. Second Conference on Hypercube Multipro- cessors, 1987, pp. 55-66. Seitz, C. L., Seizovic, J., and Su, W.-K. The C programmer’s abbreviated guide to multicomputer programming. Caltech Computer Science Tech. Rep. Caltech-CS-TR-88-l I, 1988. Shapiro, E. The family of concurrent logic programming languages. ACM Comput. Surveys 21,3 (1989), pp. 413-510. Socha, D. G. Spot: A data parallel language for iterative algorithms. Department of Computer Science Tech. Rep. TR 90-03-o 1, University of Washington, 1990. Waite, W., and Coos, G. Compiler Construction. Springer-Verlag, New York/Berlin, 1984. Wegner, P. (Ed.). Special Issue on Programming Language Paradigms. ACM Comput. Surveys 21,3 ( 1989).

MATTHEW ROSING received a B.S. in electrical and computer engi- neering in 1981 and an M.S. in computer science in 1988, both from the University of Colorado at Boulder. From 1982 to 1985 he worked as an electrical engineer for Hewlett-Packard. He is currently working toward his Ph.D. in computer science at the University of Colorado at Boulder, spe- cializing in parallel languages and compilers.

ROBERT B. SCHNABEL received a B.A. in mathematics in 197 1 from Dartmouth College and an M.S. and Ph.D. in computer science in 1975 and 1977, respectively, from Cornell University. He is currently a professor in the Department of Computer Science at the University of Colorado at Boulder, where he has been on the faculty since 1977. His research interests are in numerical and parallel computation, including numerical optimization methods, parallel optimization methods, and parallel languages.

ROBERT P. WEAVER received a B.A. in chemistry in 1973 and a J.D. in 1976, both from Case Western Reserve University. From 1976 to 1986 he was a staff attorney in the Cleveland Regional Office of the Federal Trade Commission. In 1984 he received an M.A. in economics from Cleveland State University. He is currently working toward his Ph.D. in computer science at the University of Colorado at Boulder, specializing in parallel languages and user interfaces.

Received April 23, 1990; revised October 19, 1990; accepted, December 12, 1990