[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second

Parallel Programming through Configurable Interconnectable Objects

Enrique Vinicio Carrera E. Orlando Loques Julius Leite Coppe - UFRJ CAA - UFF CAA - UFF

Caixa Postal 685 1 1 Rio de Janeiro - RJ, 2 1945-970

Rua Passo da Patria, 156 Niteroi - RJ, 242 10-240

Brazil Brazil Brazil [email protected] .br [email protected] [email protected]

Rua Passo da Patria, 156 Niteroi - RJ, 242 10-240

Abstract

This paper presents P-RIO, a parallel programming environment that supports an object based software conjiguration methodology. It promotes a clear separation of the individual sequential computation components from the interconnection structure used for the interaction between these components. This makes the data and control interactions explicit, simpliJjiing program visualization and understanding. P-RIO includes a graphical tool that helps to conjigure, monitor and debug parallel programs

1. Introduction

In most message passing based parallel programming environments, the interactions are specified through explicit language constructs embedded in the text of the program modules. As a consequence, when the interaction patterns are not trivial, the overall program structure is hidden, making difficult its understanding and performance optimization. In addition, there is little regard for properties such as reuse, modularity and software maintsnmce that are of great concern in the software engineering area.

In this context, P-RIO (Parallel - Reconfigurable Interconnectable Objects) tries to offer high-level, but straightforward, concepts for parallel and distributed programming [I ] . A simple software configuration methodology makes most of the useful object based programming technology properties available, facilitating modularity and code reuse. This methodology promotes a clear separation of the individual sequential computation components from the interconnection structure used for the interaction between these components. Besides the language used for the programming of the sequential components, a configuration language is used to describe

the program composition and its interconnection structure. This makes the data and control interactions explicit, simplifying program visualization and understanding. In principle, the methodology is independent of the programming language, operating system and of the communication architecture adopted.

P-RIO includes a graphical tool that provides system visualization features, that help to configure, monitor and debug a parallel program. The support environment has a modular construction, is highly portable, and provides development tools and run-time support mechanisms for parallel programs, in architectures composed of heterogeneous computing nodes. Our current implementation, based on PVM [ 3 ] , is message passing oriented what had strong influence in the presentation of the paper. However, we also provide some insights when considering a distributed shared memory architecture.

Figure 1. A typical module icon

2. P-RIO concepts

The P-RIO methodology hinges on the configuration paradigm [ 2 ] whereby a system can be assembled by extemally interconnected modules obtained from a library. The concept of module (figure 1) has two facets: (i) a type or class if it is used as a mould to create execution units of a parallel program; (ii) an execution unit, called a module or class instance. Primitive classes define only one thread of control, and instances created from them can execute concurrently. Composition caters for module reuse and encapsulation; that is to say, existing (primitive and composite) classes can be used to compose new classes.

110 0-8186-7882-8/97 $10.00 0 1997 IEEE

Configuration comprises the selection and naming of class instances and the definition of their interconnections.

The points of interaction and interconnection between instances are defined by named ports, which are associated with their parent classes. Ports themselves are classes that can be reused in different module classes. A port instance is named in the context of its owner class. This provides configuration level naming independence helping the reuse of the port classes.

We use the term transaction style to name a specific control and data interaction rule between ports. Besides the interaction rule, the style also defines the implementation mechanisms required for the support of the particular transaction.

Our basic transaction set includes a bi-directional synchronous (remote-procedure-call like) and a Ilnldii.cct;onal asynchronous (datagram-like) transaction style. The synchronous transaction can be used in a relaxed fashion (deferred synchronous), without blocking, allowing to overlap remote calls with local computations. This basic transaction set can be extended or modified in order to fulfill different requirements of parallel programs. It was implemented using the standard PVM library. Currently, we are considering a MPI [4] based implementation. In this case, the different semantic modes (defined by the underlying protocols) associated to MPI point-to-point transactions could be selected at the configuration level. Other transaction styles, as for example, the highly optimized RPC mechanism described in [ 5 ] , based on active messages, could be included in our environment and selected for specific port connections.

2.1. Configuration programming

Figure 2, shows a master-slave architecture typically found in parallel programs and its description using the configuration language. Data is a port class that defines an RPC-like transaction (sync); int [21 and double are the data types used for the in and out call arguments, respectively. Calc-Pi is a composite class that uses the primitive classes Master and Slave; it defines enclosed instances of these classes (master and slave) as well as their ports and connections. Note that the ports named input and output [ i l are of the Data class. The last line specifies the creation of an instance of Calc - Pi, named calc-pi with a variable number of slave instances. The declarative system description is submitted to a command interpreter that executes the low level calls required to create the specified parallel program. The interpreter queries the user to provide the number of replicated instances to be actually created in the system. The instances are automatically distributed to the available set of processors. However, the user can override this

allocation defining specific mappings at the configuration level. The configuration language interpreter uses a parser written in Tcl [6] and supports flow control instructions that facilitate the description of complex configurations [l] . Alternatively, as described in section 2.4, a graphical tool can be used to configure parallel programs.

/'

\

port-class Data sync, int [21, double class Calc-Pi N

class Master N code C "master $N" : port out Data output [$NI ;

1 class Slave

1

code C l'slave" ; port in Data input;

Master master $N; Slave slave [$NI : forall i $N link master.output[$il

slave [Si1 .input; I

I Calc-Pi calc-pi variable;

Figure 2. Example of the configuration language

2.2. Group transactions

A group can be seen as an abstract module that receives inputs and distributes them according to a particular transaction style. In our model, groups are explicitly named for connection purposes and several collective message passing transaction styles can be selected at the configuration domain. Configuration level primitives are available to create and name groups, as well as to specify port to group connections. Our current implementation supports two multicast group transaction styles: unidirectional and bi-directional, for use with the asynchronous and synchronous transactions, respectively. It is noteworthy that the popular all-gather, scatter and gather, and all-to-all collective data moves can be implemented using the standard features of P-RIO.

Barrier synchronization mechanisms can also be implemented using a group transaction through ports. However, we chose to offer a specific programming level

111

primitive for this function, in order not to burden the programmer with port concerns. For the same reason, barrier specifications appear in a simplified form at the configuration level.

2.3. Shared memory transactions

In this section, we present one scheme that we adopt to introduce the distributed shared memory (DSM) paradigm in the context of P-RIO. This provides a hybrid environment, helping the programmer to mix the two paradigms in order to implement the best solution for a specific problem.

\\, \

i

(a) Configuration Structure

Coherence Protocol

(b) Implementation Structure

Figure 3. Distributed shared memory transaction

In figure 3-a, lets consider that module S can perform heavy processing on a data structure on behalf of any module C[i]. The latter concurrently compete to access these processing services through procedure call transactions associated to ports visible at S interface (for simplicity, only one port is presented in figure 3). Thus, as an implementation option, S and its encapsulated data structures can be replicated at each hardware node where there is a potential user of its services. Thus, the procedure code can be executed locally, providing a simple computation and data migration policy. A procedure that performs read-only data operations does not need to execute any synchronization code. However, if the procedure performs updates to S data, they must be implemented using a shared memory coherence protocol. Configuration level analysis, similar to compile level

analysis used in Orca [7], could determine that module S is to be replicated. This could also be done by the programmer, using knowledge regarding program behavior and structure. Thus, each procedure call transaction can be associated to a mutual exclusion region and, in consequence, is equivalent to acquire primitives used in DSM libraries, such as Munin or TreadMarks [SI. A special code associated to each port can be used to trigger the required underlying consistency protocol (figure 3-b). It is interesting to note that the data sharing protocol and the modules to processors mapping can be specified at the configuration level, allowing great flexibility for system tuning. The shared memory extensions are not yet integrated in the environment. We are planning to implement them using a DSM coherency protocol that minimizes latency by transferring updates in advance, for the processor most likely to need it.

/

\

Figure 4. Snapshot of P-RIO graphical interface

2.4. Graphical tool

P-RIO includes a graphical tool that helps to configure, visualize, monitor and debug parallel applications. The graphical representation of each class (icon), including their ports, and of a complete system is automatically generated from its textual configuration description (figure 4). Alternatively, using the graphical tool, a system can be created by selecting classes from a library, specifying instances from them and

112

interconnecting their communication ports. The encapsulation concept is supported, permitting to compose graphically new module classes from already existing classes. Class compression and decompression features allow the designer to inspect the internals of composite classes and simplify the visual exposition of large programs. With this tool, it is also possible to control the physical allocation, instantiation, execution and removal of modules, with simple mouse commands. For example, using these features, one can reconfigure the application on the fly by stopping modules, reallocating them, or changing several other parameters. The graphical interface borrows many features of the XPVM tool [3], contained in the PVM system. Thus, it supports debugging primitives for displaying messages sent and print-like outputs. This interface was implemented using the TcliTk toolkit [6], being very portable as well.

3. An example

As an example, P-RIO is used to parallelize an experimental image recognition system. A neural network classifier and preprocessing techniques that make the result independent of the relative position and scale of the target image are used. The preprocessing stage uses common digital processing techniques (filtering, transforms and convolutions) that imply on high processing demands as well as high rate of repetition of operations. To reduce spurious noise a median filter is applied on the image (a matrix of 256x256 pixels), followed by a border detection algorithm to achieve the object position within the vision field. In parallel, homomorphic processing is applied to the original matrix in order to avoid distortions that can be caused by the light source. The outputs of the two stages are used to generate a polar coordinate representation of the image; and in the sequence, circular harmonic coefficients are obtained. Finally, the Mellin transform is applied to obtain a 128x512 matrix of complex coefficients, that are partially presented to the neural classifier. Considering the sequential implementation of this system we identified two points for performance improvement: (i) In the image preprocessing phase the edge detection task takes 70% of the total machine time. This task could be subdivided in smaller tasks, that can be executed in parallel, taking advantage of the properties of the adopted bi-dimensional signal processing algorithm; (ii) The neural net has 16384 inputs and two layers with 100 and 5 neurons, respectively. Both the learning and recognition tasks take a long CPU time. The neural net was subdivided in a number N of sub-blocks each one containing a subset of the neurons.

Figure 5 presents the module configuration for the system. The median and edge detection functions are

encapsulated in one composite class called center: in this class, edges is the basic unit of parallelization. It is interesting to note that the number of instances of edges is specified by a parameter in the configuration description text. The other functions of the preprocessing stage are encapsulated in a single module, represented as g-mellin, in figure 5.

Figure 5. Image recognition system

3.1. Performance measures

By parallelizing the edge detection function its initial sequential execution time of 168 secs was reduced to 36 secs using 5 Sun4 workstations. In this case the communication costs are very low allowing an almost linear speed-up. In the same application, the classification phase takes 4 minutes in one processor. A parallel version, using 5 processors, reduced this time to almost two minutes. In this implementation the neural network weights were stored in standard file system (Sun-NFS) files. This introduces a bottleneck when the replicated modules try to access concurrently these files. This could be improved using an I/O system with parallel operation support.

P-RIO performance was also compared to pure PVM. We measured the round-trip time for similar message passing operations and the execution times of a set of programs implemented in both environments. For all the experiments, the measures obtained were almost identical, showing that P-RIO does not introduce significant overheads for parallel programming based on workstations connected by a local area network [ 11.

113

4. Programming and configuration support

The programming and configuration abstractions of P- RIO are supported by a simple run time platform. TO start transactions through ports, basic send and receive primitives, similar to those commonly found in parallel programming libraries, are available. We also offer an ADA-like select-with-guard communication constructor based on logic expressions. Another option is a data-flow like constructor that can be used to receive atomically messages coming from a set of ports. Our current implementation of the support primitives is based on C. However, the P-RIO primitives for configuration and interaction are compatible with a wide range of procedural and object oriented languages.

5. Related work & conclusions

Our environment adopts the configuration paradigm proposed in Conic [2], for constructing distributed systems. We extend the configuration framework including features for parallel programming. The use of a system description language has others attractives: the functional modules can be designed using an object oriented methodology, what incentive reuse and can be valuable for the development of complex applications; it makes the system amenable to formal verification techniques and also to program flow analysis techniques; the text is a portable and compact representation of the software that can be mapped to graphic representations (and vice-versa); the use of indexes and flow control constructs allows the specification of large program structures.

Visual Parallel Environment (VPE) is an environment that also allows to create, compile and run PVM programs [9]. VPE exploits the concepts of modular and reusable components to assist the process of creation of a parallel application. On the other hand, P-RIO attempts to exploit modularity and reuse both in the construction and in the execution stages of parallel programs. In addition, useful transaction styles typically used in parallel programming are readily available in our environment. In [ I ] , we compare P-RIO to other parallel programming environments, described in [ 10,1 I] , that adopt a two-step program composition methodology and include a graphical programming tool.

Our proposal incentives a software architecture view for application development. It is centered in the configuration paradigm that provides high-level and flexible abstraction mechanisms for program construction. This strategy has allowed us to pickup and put together several useful concepts (and associated mechanisms) for parallel computing, providing a “middleware” based

programming and execution environment. The available features can be quickly selected and used to configure parallel and distributed programs.

The current version of P-RIO exploits the portability provided by PVM. The implementation is operational since 11/94 and has been used to build several experimental applications. Its code and related documents can be obtained at http://www.caa.uf€.br/projects/p-rio.

Acknowledgments

Grants from the CNPq, Finep and Faperj have partially supported the research project described here.

References

[l] E. Carrera, 0. Loques, J.B. Leite, “P-RIO: An Environment for Modular Parallel Programming,’, Tech. Report UFF- CAA RT-07/96, Niterbi, R.J., Brazil, October, 1996.

[2] J . Magee, J. Kramer and M. Sloman, “Constructing Distributed Systems in Conic”, IEEE Trans. on Software Engineering, Vol. 15, No. 6, June 1989, pp. 663-675. G. A. Geist et al., “PVM: A Users’ Guide and Tutorial for Networked Parallel Computing”, The MIT Press, Cambridge - Mass., 1994.

[4] J. Dongarra, S. W. Otto, M. Snir and D. Walker, “A Message Passing Standard for MPP and Workstations”, Communications of ACM, Vol. 39, No. 7, July 1996, pp.

D. A. Wallach et al., “Optimistic Active Messages: A Mechanism for Scheduling Communication with Computation”. Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Santa Barbara, Calif., July 1995. pp. 21 7-226. J. Ousterhout, “Tcl and the Tk Toolkit”, Addison-Wesley, Reading, Mass., 1994. H. Bal. M. F. Kaashoek. and A. Tanenbaum, “Orca: A Language for Parallel Programming of Distributed Systems”, IEEE Trans. on Software Engineering, Vol. 18. No. 3, March 1992, pp. 190-205. J. Protic, M. Tomasevic, and V. Milutinovic. “Distributed Shared Memory: Concepts and Systems“, IEEE Parallel and Distributed Technology, Vol. 4. No. 2. Summer 1996,

P.Newton, “VPE User Manual”, University of Tennessee, Computer Science Department, Knoxwille. Tenn., U.S.A., June, 1995.

[lo] J. C. Browne, S. I. Hyder, J. Dongarra, K. Moore and P. Newton. “Visual Programming and Debugging for Parallel Computing”, IEEE Parallel & Distributed Technology, Spring 1995, pp. 75-83.

[ I I ] A. Bartoli, P. Corsini, G. Dini, and C. A. Prete. “Graphical Design of Distributed Applications Through Reusable Components”, IEEE Parallel & Distributed Technology. Spring 1995, pp. 37-50.

[ 3 ]

84-90. [5]

[6]

[7]

[SI

pp. 63-79. [9]

114

http://www.caa.uf.br/projects/p-rio

Documents

[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second