[IEEE Computer. Soc. Press Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Geneva, Switzerland (1 April 1997)] Proceedings Second International Workshop on High-Level Parallel Programming Models and Supportive Environments - Parallel programming through configurable interconnectable objects

  • Published on

  • View

  • Download

Embed Size (px)


  • Parallel Programming through Configurable Interconnectable Objects

    Enrique Vinicio Carrera E. Orlando Loques Julius Leite Coppe - UFRJ CAA - UFF CAA - UFF

    Caixa Postal 685 1 1 Rio de Janeiro - RJ, 2 1945-970

    Rua Passo da Patria, 156 Niteroi - RJ, 242 10-240

    Brazil Brazil Brazil vinicio@cos.ufrj .br loques@caa.uff.br julius@caa.uff.br

    Rua Passo da Patria, 156 Niteroi - RJ, 242 10-240


    This paper presents P-RIO, a parallel programming environment that supports an object based software conjiguration methodology. It promotes a clear separation of the individual sequential computation components from the interconnection structure used for the interaction between these components. This makes the data and control interactions explicit, simpliJjiing program visualization and understanding. P-RIO includes a graphical tool that helps to conjigure, monitor and debug parallel programs

    1. Introduction

    In most message passing based parallel programming environments, the interactions are specified through explicit language constructs embedded in the text of the program modules. As a consequence, when the interaction patterns are not trivial, the overall program structure is hidden, making difficult its understanding and performance optimization. In addition, there is little regard for properties such as reuse, modularity and software maintsnmce that are of great concern in the software engineering area.

    In this context, P-RIO (Parallel - Reconfigurable Interconnectable Objects) tries to offer high-level, but straightforward, concepts for parallel and distributed programming [I ] . A simple software configuration methodology makes most of the useful object based programming technology properties available, facilitating modularity and code reuse. This methodology promotes a clear separation of the individual sequential computation components from the interconnection structure used for the interaction between these components. Besides the language used for the programming of the sequential components, a configuration language is used to describe

    the program composition and its interconnection structure. This makes the data and control interactions explicit, simplifying program visualization and understanding. In principle, the methodology is independent of the programming language, operating system and of the communication architecture adopted.

    P-RIO includes a graphical tool that provides system visualization features, that help to configure, monitor and debug a parallel program. The support environment has a modular construction, is highly portable, and provides development tools and run-time support mechanisms for parallel programs, in architectures composed of heterogeneous computing nodes. Our current implementation, based on PVM [ 3 ] , is message passing oriented what had strong influence in the presentation of the paper. However, we also provide some insights when considering a distributed shared memory architecture.

    Figure 1. A typical module icon

    2. P-RIO concepts

    The P-RIO methodology hinges on the configuration paradigm [ 2 ] whereby a system can be assembled by extemally interconnected modules obtained from a library. The concept of module (figure 1) has two facets: (i) a type or class if it is used as a mould to create execution units of a parallel program; (ii) an execution unit, called a module or class instance. Primitive classes define only one thread of control, and instances created from them can execute concurrently. Composition caters for module reuse and encapsulation; that is to say, existing (primitive and composite) classes can be used to compose new classes.

    110 0-8186-7882-8/97 $10.00 0 1997 IEEE

  • Configuration comprises the selection and naming of class instances and the definition of their interconnections.

    The points of interaction and interconnection between instances are defined by named ports, which are associated with their parent classes. Ports themselves are classes that can be reused in different module classes. A port instance is named in the context of its owner class. This provides configuration level naming independence helping the reuse of the port classes.

    We use the term transaction style to name a specific control and data interaction rule between ports. Besides the interaction rule, the style also defines the implementation mechanisms required for the support of the particular transaction.

    Our basic transaction set includes a bi-directional synchronous (remote-procedure-call like) and a Ilnldii.cct;onal asynchronous (datagram-like) transaction style. The synchronous transaction can be used in a relaxed fashion (deferred synchronous), without blocking, allowing to overlap remote calls with local computations. This basic transaction set can be extended or modified in order to fulfill different requirements of parallel programs. It was implemented using the standard PVM library. Currently, we are considering a MPI [4] based implementation. In this case, the different semantic modes (defined by the underlying protocols) associated to MPI point-to-point transactions could be selected at the configuration level. Other transaction styles, as for example, the highly optimized RPC mechanism described in [ 5 ] , based on active messages, could be included in our environment and selected for specific port connections.

    2.1. Configuration programming

    Figure 2, shows a master-slave architecture typically found in parallel programs and its description using the configuration language. Data is a port class that defines an RPC-like transaction (sync); int [21 and double are the data types used for the in and out call arguments, respectively. Calc-Pi is a composite class that uses the primitive classes Master and Slave; it defines enclosed instances of these classes (master and slave) as well as their ports and connections. Note that the ports named input and output [ i l are of the Data class. The last line specifies the creation of an instance of Calc - Pi, named calc-pi with a variable number of slave instances. The declarative system description is submitted to a command interpreter that executes the low level calls required to create the specified parallel program. The interpreter queries the user to provide the number of replicated instances to be actually created in the system. The instances are automatically distributed to the available set of processors. However, the user can override this

    allocation defining specific mappings at the configuration level. The configuration language interpreter uses a parser written in Tcl [6] and supports flow control instructions that facilitate the description of complex configurations [l] . Alternatively, as described in section 2.4, a graphical tool can be used to configure parallel programs.



    port-class Data { sync, int [21, double} class Calc-Pi { N } {

    class Master { N } { code C "master $N" : port out Data output [$NI ;

    1 class Slave { } {


    code C l'slave" ; port in Data input;

    Master master $N; Slave slave [$NI : forall i $N { link master.output[$il

    slave [Si1 .input; I

    I Calc-Pi calc-pi variable;

    Figure 2. Example of the configuration language

    2.2. Group transactions

    A group can be seen as an abstract module that receives inputs and distributes them according to a particular transaction style. In our model, groups are explicitly named for connection purposes and several collective message passing transaction styles can be selected at the configuration domain. Configuration level primitives are available to create and name groups, as well as to specify port to group connections. Our current implementation supports two multicast group transaction styles: unidirectional and bi-directional, for use with the asynchronous and synchronous transactions, respectively. It is noteworthy that the popular all-gather, scatter and gather, and all-to-all collective data moves can be implemented using the standard features of P-RIO.

    Barrier synchronization mechanisms can also be implemented using a group transaction through ports. However, we chose to offer a specific programming level


  • primitive for this function, in order not to burden the programmer with port concerns. For the same reason, barrier specifications appear in a simplified form at the configuration level.

    2.3. Shared memory transactions

    In this section, we present one scheme that we adopt to introduce the distributed shared memory (DSM) paradigm in the context of P-RIO. This provides a hybrid environment, helping the programmer to mix the two paradigms in order to implement the best solution for a specific problem.

    \\, \


    (a) Configuration Structure

    Coherence Protocol

    (b) Implementation Structure

    Figure 3. Distributed shared memory transaction

    In figure 3-a, lets consider that module S can perform heavy processing on a data structure on behalf of any module C[i]. The latter concurrently compete to access these processing services through procedure call transactions associated to ports visible at S interface (for simplicity, only one port is presented in figure 3). Thus, as an implementation option, S and its encapsulated data structures can be replicated at each hardware node where there is a potential user of its services. Thus, the procedure code can be executed locally, providing a simple computation and data migration policy. A procedure that performs read-only data operations does not need to execute any synchronization code. However, if the procedure performs updates to S data, they must be implemented using a shared memory coherence protocol. Configuration level analysis, similar to compile level

    analysis used in Orca [7], could determine that module S is to be replicated. This could also be done by the programmer, using knowledge regarding program behavior and structure. Thus, each procedure call transaction can be associated to a mutual exclusion region and, in consequence, is equivalent to acquire primitives used in DSM libraries, such as Munin or TreadMarks [SI. A special code associated to each port can be used to trigger the required underlying consistency protocol (figure 3-b). It is interesting to note that the data sharing protocol and the modules to processors mapping can be specified at the configuration level, allowing great flexibility for system tuning. The shared memory extensions are not yet integrated in the environment. We are planning to implement them using a DSM coherency protocol that minimizes latency by transferring updates in advance, for the processor most likely to need it.



    Figure 4. Snapshot of P-RIO graphical interface

    2.4. Graphical tool

    P-RIO includes a graphical tool that helps to configure, visualize, monitor and debug parallel applications. The graphical representation of each class (icon), including their ports, and of a complete system is automatically generated from its textual configuration description (figure 4). Alternatively, using the graphical tool, a system can be created by selecting classes from a library, specifying instances from them and


  • interconnecting their communication ports. The encapsulation concept is supported, permitting to compose graphically new module classes from already existing classes. Class compression and decompression features allow the designer to inspect the internals of composite classes and simplify the visual exposition of large programs. With this tool, it is also possible to control the physical allocation, instantiation, execution and removal of modules, with simple mouse commands. For example, using these features, one can reconfigure the application on the fly by stopping modules, reallocating them, or changing several other parameters. The graphical interface borrows many features of the XPVM tool [3], contained in the PVM system. Thus, it supports debugging primitives for displaying messages sent and print-like outputs. This interface was implemented using the TcliTk toolkit [6], being very portable as well.

    3. An example

    As an example, P-RIO is used to parallelize an experimental image recognition system. A neural network classifier and preprocessing techniques that make the result independent of the relative position and scale of the target image are used. The preprocessing stage uses common digital processing techniques (filtering, transforms and convolutions) that imply on high processing demands as well as high rate of repetition of operations. To reduce spurious noise a median filter is applied on the image (a matrix of 256x256 pixels), followed by a border detection algorithm to achieve the object position within the vision field. In parallel, homomorphic processing is applied to the original matrix in order to avoid distortions that can be caused by the light source. The outputs of the two stages are used to generate a polar coordinate representation of the image; and in the sequence, circular harmonic coefficients are obtained. Finally, the Mellin transform is applied to obtain a 128x512 matrix of complex coefficients, that are partially presented to the neural classifier. Considering the sequential implementation of this system we identified two points for performance improvement: (i) In the image preprocessing phase the edge detection task takes 70% of the total machine time. This task could be subdivided in smaller tasks, that can be executed in parallel, taking advantage of the properties of the adopted bi-dimensional signal processing algorithm; (ii) The neural net has 16384 inputs and two layers with 100 and 5 neurons, respectively. Both the learning and recognition tasks take a long CPU time. The neural net was subdivided in a number N of sub-blocks each one containing a subset of the neurons.

    Figure 5 presents the module configuration for the system. The median and edge detection functions are

    encapsulated in one composite class called center: in this class, edges is the basic unit of parallelization. It is interesting to note that the number of instances of edges is specified by a parameter in the configuration description text. The other functions of the preprocessing stage are encapsulated in a single module, represented as g-mellin, in figure 5.

    Figure 5. Image recognition system

    3.1. Performance measures

    By parallelizing the edge detection function its initial sequential execution time of 168 secs was reduced to 36 secs using 5 Sun4 workstations. In this case the communication costs are very low allowing an almost linear speed-up. In the same application, the classification phase takes 4 minutes in one processor. A parallel version, using 5 processors, reduced this time to almost two minutes. In this implementation the neural network weights were stored in standard file system (Sun-NFS) files. This introduces a bottleneck when the replicated modules try to access concurrently these files. This could be improved using an I/O system with parallel operation support.

    P-RIO performance was also compared...


View more >