6

Click here to load reader

A C++ language interface for parallel programming

Embed Size (px)

Citation preview

Page 1: A C++ language interface for parallel programming

A C++ language interface for parallel programming

Theo Ungerer

This paper presents an object-oriented interface for parallel programming, and an algorithm for automatic translation into parallel programs. The programming interface consists of a restricted subset of the object- oriented language C + +. Parallelism is defined explicitly at the abstract level of object definitions and method invocations within a single C+ + program. The translator algorithm first generates a machine-independent communi- cation graph and proceeds with the creation of the parallel programs, which are demonstrated for transputer systems with the HELIOS operating system. The necessary communication statements are generated automatically.

parallel programming object-oriented programming automatic parallelization

Object-oriented programming is already known for its superior structuring capabilities in developing large programs. Object-oriented programs provide an abstract communication model: objects are defined to communi- cate with each other by message passing. If an object receives a message, a method of its defining class is invoked. This may change the internal state of the object or further messages may be generated. Such an abstract model of computation fits the message-passing capabilities of distributed-memory multiprocessors. An object-oriented program consists of a set of objects communicating with each other by messages, while in a distributed-memory multiprocessor it is a set of node programs that communi- cate by message passing.

We chose C + + 1 as the object-oriented language to test our approach due to its popularity and general availability. The potential of parallel processing with C+ + has been promoted by augmenting C+ + with synchro- nization and parallel language features 2-6. Our own approach promotes parallel programming by translating sequential programs written in a suitable subset of C+ +

into parallel programs. Each object is transformed into a node program and mapped to a processing node of the distributed-memory multiprocessor. The conceptual messages of the object-oriented programs are transformed into actual messages exchanged between processing nodes. No new language features have to be introduced. The translator algorithm automatically generates separate programs for each processor and the necessary communi- cation statements.

The approach will be demonstrated by translating C+ + programs for transputer systems 7 with the HELIOS operating system 8. Unfortunately C+ + is not yet available for transputer systems with HELIOS operating systems. However, it can easily be adapted by use of the appropriate header files in the C++ program, and precompiling the C++ program to C-code prior to the use of the HELIOS C-compiler. In principle, this approach is more general. It can be applied to other object-oriented languages 9 and directed towards other distributed- memory multiprocessors. A version for translating towards an Intel iPSC/2 multiprocessor is presented in References 10 and 11.

The paper is organized as follows. The language restrictions for the use of the proposed translator algorithm are discussed in the next section. An example program is then presented. The concept of a communication graph, generated automatically from an object-oriented program and used as part of the translator algorithm for the construction of the node programs, is then introduced. The next section gives the basic translator algorithm for creation of the communication graph. The following section describes the algorithm for generating parallel programs for transputer systems with the HELIOS operating system. The next section discusses some optimizations and gives some suggestions to the programmer to yield efficient parallel programs. The final section gives the conclusions.

RESTRICTIONS ON C + +

Department of Mathematics, University of Augsburg, Universitatsstr. 2, Since C+ + is a superset of C it abounds with non object- w-8900 Augsburg, Germany. Email: ungerer(a~uni-augsburg.de oriented language features. Due to the message passing

0141-9331/93/040195-06 © 1993 Butterworth-Heinemann Ltd

Vol 17 No 4 1993 195

Page 2: A C++ language interface for parallel programming

structure of the target machine, we have to restrict C+ + to a subset of purely object-oriented language features that facilitate the construction of parallel programs. The programmer should obey two principal restrictions:

• No global variables or functions should be accessed from methods within class defintions; also STATIC variables and public data members in class definitions should be avoided. However, global variables and functions that are only accessed from the main- function are allowed.

• All user-defined objects and method invocations must be known at compile time. This prohibits object definitions and method invocations local to dynamically bounded loops (a relaxation of this restriction by enabling the distribution of objects at run time is defined in Reference 12), object definitions within conditional statements, and recursion which leads to dynamic creation of objects or method invocations.

By enforcing these restrictions, a static communication graph (as will be explained in a later section) can be derived. The first restriction affects the programming style only. The second restricts the possible application programs.

EXAMPLE PROGRAM

A simple object-oriented C + + program that satisfies the above restrictions is shown in Listing 1. It computes an approximation to n by using the rectangle rule to compute an approximation to the definite integral of f(x) = 4/(1 + x z) between 0 and 1. The first part of the program gives the definition of the class part_0f_pi. INTRVLS defines the number of intervals used for the approximation. WIDTH defines the number of objects of class part 0f pi, each of which computes INTRVLS/WIDTH portions of the integral and returns the partial sum to the calling main- function. Computations pertaining to each object are independent, and thus can be done in parallel.

COMMUNICATION GRAPH

A communication graph describes the abstract communi- cation model, i.e. the structure of the objects and the method invocations. It represents an intermediate step in the translator algorithm. A communication graph derived from a specific C+ + program is independent from the target machine and its configuration.

A communication graph is a directed graph where nodes represent objects and named links represent messages sent between objects. One node that we will refer to as the 'Driver' represents the main-function of the C++ program. There are two kinds of links in the communication graph. Dashed links represent method invocation messages, which trigger the execution of methods. Solid links represent parameter messages, which car~ the appropriate parameters to the involved method, and the returning values (if there are any) back to the caller.

The communication graph is generated at compile time, and therefore all objects and method invocations have to be known at compile time, as is guaranteed by the restrictions given above.

The communication graph for the sample program of Listing 1 is shown in Figure 1. Four objects of class part_0f_pi (partsum[0] . . . . . partsum[3]) are included in this communication graph (only two of the four object nodes and corresponding message links are shown in Figure 1). The main-function is represented in the communication graph as the 'Driver' node. To illustrate the links in this graph consider, for example, the statement sum = partsum[0], c0mpute_pi (1, 10000, 4), which means that the method c0mpute pi in the object partsum[0] is invoked by the main-function. The values i, 10000 and 4, in place of i+1, INTRVLS and WIDTH (after step 2 of the translator algorithm, see next section), are passed to the member function as parameters, and the return parameter is assigned to the variable 'sum'. Three separate messages are necessant: one for the method invocation, one for the three parameters that are combined into a single parameter message, and one for the return parameter.

#define WIDTH 4 #define INTRVLS i0000 class part of pi

{ public: double compute_pi(int, int,int); };

double part of__pi::compute_pi(int start,int cuts,int nprocs) { int i; double point, h, z, pi_part=0.0;

h=l.0/cuts; for(i=start;i<=cuts;i+=nprocs) { point = (i-0.5)*h;

z = 4.0/(l.0+point*point); pi_part += h'z; }

return pi part; }

void main(void) { int i; double sumall=0.0, sum=0.0;

part_of__pi partsum[WIDTH]; for(i=0; i<WIDTH; i++) { sum = partsum[i] .compute_pi(i+l,INTRVLS,WIDTH);

sumall += sum; } cout << "pi = " << 8umall << "\n"; }

Listing1. A sequent~lo~ec~oden~d C++ prog~m

196 Microprocessors and Microsystems

Page 3: A C++ language interface for parallel programming

compute_pi 40

(1,10000,4) 40 ~ partsum[O]

sum 41

sum 101

• ! . (4,10000,4) 100 ~ p a r t s u m [ 3 i l

compute_pi 1 O0

Figure 1. Communication graph

Note, however, that additional synchronization is needed. For instance, the return message has to be sent after receiving the parameter message from the 'Driver' and evaluating the c0mpute_pi method. To solve the synchronization problem, we assign precedence numbers to the messages in the communication graph. The precedence numbers, shown in Figure 1 as simple integers, are derived from the statement numbers in the sequential program (see step 5 in the next section).

GENERATING THE C O M M U N I C A T I O N GRAPH

Parallel programs are derived from the C+ + program by first generating the communication graph and then generating the actual programs. There are five steps in generating the communication graph.

The first step verifies that the restrictions stated earlier are satisfied. The second step applies the 'def ine' directives, and unravels loops with object creations or method invocations in the loop body. Method invocations nested in statements are also unravelled. This makes the program easier to analyse and translate into parallel code. The third step creates all the object nodes by examining the object declaration statements in the program. For instance, part_0f pi partsum[4] results in the creation of the four object nodes (partsum[0] . . . . . paPLsum[3]) shown in Figure 1.

The fourth step creates message links by scanning through the main-function derived in step 2, and generating one or more links in the communication graph for each method invocation. The first is the method invocation link which carries the method name sent from the caller node to the called object node. This is drawn as a dashed arc in the communication graph. If there are parameters, then there are one or more parameter links drawn as solid arcs in the graph.

A parameter for a method invocation could be a simple non-object parameter or an object parameter, and could be passed either by value or by reference. All non- object parameters are combined into one parameter message drawn from the caller to the called object node. This is the case for both call-by-value and call-by-reference. In the case of simple (non-object) parameters passed by reference the parameters are returned to the caller on a separate message after the execution of the method. Likewise, a return value specified in the method call is sent back to the caller. The return value can be combined with the call-by-reference parameters to reduce message traffic. Object parameters are treated differently from

non-object parameters. If an object is passed by value, then a copy of this object is sent as a parameter message from the parameter object node to the called object node. If an object parameter is passed by reference, the object is sent to the called object node, as in the case of a by-value object parameter, and returned to the parameter object node after execution of the method.

The fifth step is the assignment of precedence numbers. The algorithm first sequentially numbers all statements in the main-function and in each method of every class definition. The numbering is done in intervals of tens. It then scans the main-function and, each time a method invocation is encountered, its statement number is assigned as the precedence number to the corresponding method invocation link and, if there is only one, also to its parameter link. If more than one parameter link exists, the following synchronization problems may arise:

• In the case of two or more parameter messages from different starting nodes but with the same callee node, the messages must be distinguishable by the callee node.

• In the case of a call-by-reference (non-object) parameter or a return value, the additional parameter message from the callee node back to the caller node must be sent after receipt of the parameters and the execution of the method by the callee node.

These problems are solved by the intervals of tens in the statement numbers• All parameter links that belongto a single method invocation are assigned intermediate numbers x + O to x + 9, where x is the statement number of the method invocation. The highest number is assigned to the return parameter link. This restricts the number of parameter links per method invocation to 10. To overcome this restriction when necessary, the statement numbering can be changed appropriately in specific cases when more parameter links are necessary. The assignment of precedence numbers is more complicated if methods contain non-local method invocations 1°'13

The resulting precedence numbers capture the semantics of the sequential program. They are used when creating the individual node programs to ensure that messages are sent and received in the order prescribed by the sequential program. To illustrate this procedure, consider the original program in Listing 1. The program is first unravelled and each statement is assigned a unique number. Hence, statement sum=partsum[O].c0mpute pi (1, 10000, 4); obtains statement number 40, statement sum=partsum[1].c0mpute_pi (2, 10000, 4); gets statement number 60, and so on. As a result, the method invocation message c0mpute_pi, corresponding to the first statement mentioned above, has the precedence number 40; the parameters 1, 10000 and 4 are combined into a single message with precedence number 40, and the return message 'sum' has precedence number 41, as shown in Figure 1.

CREATING PARALLEL CODE

The target machine is a transputer system with, in principle, an unlimited number of processor nodes, each equipped with local memory. There is no common memory. All data exchanges between the processor nodes are organized by the HELIOS operating system and must be specified by the programmer by read/write

Vol 17 No 4 7993 197

Page 4: A C++ language interface for parallel programming

statements, and by a so-called CDL script TM. The CDL script defines the resource and communication needs of the parallel programs and is written in a language called CDL (Component Distribution Language). The processors can be physically connected in any topology. Load distribution is done automatically by HELIOS at load- time. Likewise, routing of messages via an intermediate transputer node is organized automatically by HELIOS. Each processor node executes its own program, loaded intitially from a so-called host node. Four additional steps are necessary to generate code for each node of the transputer system.

The first step creates a configuration graph which is a directed graph where nodes represent tasks (node programs) and links represent communication paths between tasks. Each link bears a unique name and two numbers, one for the sender and one for the receiver. The numbers are the so-called Posix numbers, which correspond with HELIOS communication channels. The numbers 0-2 are reserved, by convention, for stdin, stdout and stderr; number 3 is not defined; otherwise, even numbers define receiver ports and odd numbers define sender ports. Each node in the communication graph is assigned to a separate node in the configuration graph. All parameter links that share the same sender and the same receiver node in the communication graph are represented by a single link in the configuration graph. Method invocation links are omitted. Posix numbers are assigned to the links in ascending order, starting with number 4 for receiver ports and with number 5 for sender ports relative to each node. The link names are used as stream names when generating the CDL-script (see below). The configuration graph for the example program is shown in Figure 2.

The second step creates a CDL script, which specifies the communication paths of the parallel programs. A CDL script consists of a component description for each parallel program and a configuration string at the end. The component description consists of an object file name, various system configuration fields, and a streams field. The translator algorithm creates the configuration string in a uniform manner by connecting all component names with the CDL general parallel construct ^ ^, starting with the Driver. The actual communication paths are specified in the streams field of each component description. The streams field consists of a number of entries, each separated by a comma. Each entry corresponds with a Posix number, and is either blank, if not defined in the CDL script, or consists of a stream mode (>1 means output, <lmeans input) and a stream name. The translator algorithm creates a component description for each node in the configuration graph, and assigns the specific stream

stO

5 ~ 4 al

Figure 2. Configuration graph

4~[ p~tsum[0] '~ i

~tsum[3] r k _ . J

component Driver

{ code Driver; ...

streams ,,,,<Istl,>Ist0,<Ist3,>Ist2,

<Ist5,>Ist4,<Ist7,>Ist6;}

component partsum0

{ code partsum; ...

streams ,,,,<Ist0,>Istl; }

component part sum3

{ code c; . . .

streams ,,,,<Ist6,>Ist7; }

Driver ^^ partsum0 ^^ ... ^^ partsum3

Listing 2. A CDL script

name together with the stream mode to each component. The CDL script for the configuration graph in Figure 2 is shown in Listing 2.

The third step generates the Driver program that plays the role of a central manager for the system of parallel node programs and is created by scanning the modified main-function and replacing all method invocations by read/write-statements. The structure and the Posix numbers of the read/write-statements are derived from the configuration graph. Object declarations are omitted. Also all method invocations without parameters, and method invocations with only object parameters are omitted, as these do not need read/write'statements in the Driver program. All other code that is not related to method invocation remains unchanged. Also all global variables and function definitions are included (both are only accessed by the main-function as guaranteed bythe first restriction).

To cope with method invocations within conditional statements, the following mechanism is employed. A conditional statement of the form

if(c0nditi0n) Imeth0dl]

else ImeLh0d2}

is transformed into

if(c0nditi0n) {send meth0d1_message; send_meth0d2dummy_messagel

else Isend_meth0d1_dummy_message; send_meth0d2_message}

The dummy messages are necessary to avoid pending read statements in the node programs. If a dummy message is received, no method is invoked and the program continues without being further blocked. The same mechanism can also be applied to switch statements.

We illustrate this process again by the example program. A simple but necessary optimization to increase parallelism is to put the write statements prior to the read statements, as far as not prohibited by data dependencies. The resulting Driver program is shown in Listing 3.

The fourth step creates the node programs. Each object node in the communication graph results in a separate node program. To deal with inheritance and 'friend' declaration, all class definitions of the sequential C+ + program are included in all node programs.

For generating a node program, the communication

198 Microprocessors and Microsystems

Page 5: A C++ language interface for parallel programming

void main (void) { int i; double sumall=O.O, sum=O.O;

int writebuf [3] ; writebuf [i] =I0000; writebuf [2 ] =4;

writebuf[O]=l; write (5, (char *)writebuf, 3*sizeof(int)) ; writebuf[O]=2; write (7, (char *)writebuf,3*sizeof(int)) ; writebuf[O]=3; write(9, (char *)writebuf,3*sizeof(int)) ; writebuf[O]=4; write(ll, (char *)writebuf, 3*sizeof(int)) ; read(4, (char *)sum, sizeof(double) ; sumall += sum; read(6, (char *)sum, sizeof(double) ; sumall += sum; read(8, (char *)sum, sizeof(double); sumall += sum; read(lO, (char *)sum, sizeof(double) ; sumall += sum; cout << "pi = " << sumall << "\n"; }

Listing 3. Parallel code for the Driver

/* include the class definition for class part of pi void main (void)

{ int receivebuf [ 3 ] , start, cuts, nprogs ; double sum; part of pi partsum; read(4, (char *)receivebuf,3*sizeof(int)) ; start=receivebuf [0 ] ; cuts=receivebuf [i] ; nprogs=receivebuf [2 ] ; sum = partsum.compute_pi(start,cuts,nprocs) ; write (5, (char *) sum, sizeof(double) ; }

Listing 4. Parallel code for the partsum objects

* /

graph is used. The general algorithm performs the following tasks for each object node in the communication graph. Select the arc(s) with the lowest precedence number. For each incoming parameter link, generate a read statement. Next, if there is a method invocation link, generate the corresponding method invocation call. If there is an outgoing parameter link, generate a write statement. Repeat these steps for the arc with the next higher precedence number until all have been processed.

Because of the regularity of the communication structure of the example program, the code is the same for all partsum node programs (as shown in Listing 4).

The state-of-the-art distributed-memory multiprocessors suggest utilization of coarse-grain rather than fine-grain parallelism. The programmer should therefore use computation intensive methods. Since the parameters of method invocations result in the creation of messages, object parameters should be avoided. Also simple value parameters are preferable to reference or pointer para- meters, since the latter cause message traffic from the callee to the caller. Due to the increase in message traffic caused by dummy messages (see step 3 of the previous section), method invocations within conditional state- ments should also be avoided.

OPTIMIZATIONS AND SUGGESTIONS FOR THE PROGRAMMER

If there are more object nodes in the communication graph than processor nodes, message traffic and context switching overhead can be reduced by a strategy that combines distinct object nodes in the communication graph into a single node. The messages within the combined object nodes are collapsed into ordinary method invocations. The same algorithm for constructing node programs is applied to the combined object nodes. Another optimization is an explicit placement of object nodes onto processor nodes by observing the locality of messages, avoiding the automatic load distributing of the HELIOS operating system.

The programmer can strongly influence the efficiency of the generated parallel programs by applying his knowledge of the translation method and the target machine. Size and number of objects constitute granularity of parallelism and partitioning into parallel programs.

CONCLUSIONS

In this paper, we presented our approach to parallel programming of distributed-memory multiprocessors. The tool described in this paper expects programs written in a subset of the object-oriented language C+ +. Parallel programs can be derived automatically and mapped onto a given multiprocessor machine configuration. This mapping has been demonstrated for transputer systems with the HELIOS operating system.

Experimental results have been yielded using the unoptimized translator algorithm adapted to an Intel iPSC/2 hypercube 1°,11 The experiments showed a speed-up of 3.15 over sequential execution using seven nodes of the hypercube (a slow-down by additional messages necessary for measurement included). The same program, adapted to HELIOS code, has also been running on a transputer system. However, our transputer system with only two transputers is too small to achieve comparable performance results.

Vol 17 No 4 1993 199

Page 6: A C++ language interface for parallel programming

Our current research focuses on relaxing the restrictions on C+ + programs as formulated earlier in the paper, on adaptation of the machine-dependent part of the trans- lation algorithm to other distributed-memory multi- processors, and on the study of the transferability of the method to other object-oriented languages, in particular to the object-oriented language Eiffel.

REFERENCES

1 Ellis, M A and Stroustrup, B The Annotated C + + Reference Manual Addison-Wesley, Reading, MA (1990)

2 Gehani, N H and Roome, W D 'Concurrent C++: concurrent programming with class(es)' Softw. Pract. Exper. Vol 18 No 12 (December 1988) pp 1157- 1177

3 Wu, Y and Lewis, T G 'Parallelism encapsulation in C+ +' International Conference on Parallel Processing, Vol II (1990) pp 35-42

4 Faust, I E and Levy, H M 'The performance of an object-oriented thread package' ECOOP/OOPSLA '90 (October 1990) pp 278-287

5 Labr~che, P and Lamarche, L'lnteractors: a real-time executive with multiparty interactions in C+ +' SIGPLAN Not. Vol 25 No 4 pp 20-32

6 Habert, S, Mosseri, L and Abrossimov, V 'COOL: kernel support for object-oriented environments' ECOOP/OOPSLA '90 (October 1990) pp 269-277

7 Transputer Reference Manual Inmos, Prentice-Hall, Englewood Cliffs, NJ (1988)

8 Noble, B, Ganz of Vardas, R and Veer, B The HELIOS Parallel Programming Tutorial Distributed Software, Bristol (1990)

9 Ungerer, T and Bic, L 'An object-oriented interface for parallel programming of loosely-coupled multi- processor systems' Second European Distributed Memory Computing Conference (EDMCC2), Munich (2-24 April 1991) pp 163-172

10 Yin, M-L, Bic, L and Ungerer, T 'Parallel C++ programming on the Intel iPSC/2 hypercube' Fourth Annual Symposium on Parallel Processing, California State University, Fullerton, CA (4-6 April 1990)

11 Yin, M-L, Bic, L and Ungerer, T 'Parallelizing static C÷+ programs' TOOLS PACIFIC '90 (November 1990)

12 Bic, L 'Distributing object arrays in C+ +' l echnicai Report, Information and Computer Science Depart- ment, University of California, Irvine (January 1990)

13 Ungerer, T'Parallelising C+ +-programs for transputer systems' EUROMICRO 91, Proceedings of the 17th Euromicro Conference Hardware and Software Design Automation Vienna, Austria (September 1991) pp 463-470

14 The CDL Guide Distributed Software Limited, Bristol (1990)

Theo Ungerer studied mathematics and | computer science at the Universities of Heidelberg, Zorich and Berlin (Technical r University), where he was awarded a Diploma in i "/982. from 1982 to 1992 he was a scientific i assistant at the University of Augsburg, where he obtained a Doctoral degree in 1986. He was a visiting assistant professor at the University of I California at Irvine from 1989 to 1990. In t992 l he was awarded a Habilitation degree. From I

1992 to 1993 he was a professor of computer architecture at the t friedrich Schiller University, lena. In April 1993 he became Professor of t Computer Architecture at the University of Karlsruhe, Germany. ~l

200 Microprocessors and Microsystems