An introduction to the PMESC parallel programming paradigm and library for task parallel computation

Wuhan University Journal of Natt~,'a! Science.~ Vol. 1 No.3/4 1996, 386--390

A N I N T R O D U C T I O N TO T H E P M E S C P A R A L L E L P R O G R A M N I I N G

P A R A D I G M A N D L I B R A R Y F O R T A S K P A R A L L E L C O N I P U T A T I O N

S. Crivelli E.R. Jessup Department of Computer Science

University of Colorado Boulder, CO 80309-0430 USA

crivells,~colorado.edu jessup<~cs.colorado.edu

A b s t r a c t Task-parallel problems are difficult to implement efficiently in parallel because they

are asynchronous and unpredictable. The difficulties are compounded on distributed- nlemory computers where iuterprocessor communication can impose a substantial overhead. A few languages and libraries have been proposed that are specifically designed to support this kind of computation, tt~,wewer, one big challenge sl ill remains: to make those tools understood and used by scientists, engineers, and others who want to exploit the power of paralM computers without spending much effort in mastering those tools. The PMES(_-' programming paradigm and library pre,~ented here are d:signe, l to naak~ ~ programming Oil distributed-menlory con~put.ers ~asy to ull(lorsl allll ;tlltt t o make efficient paralM ,'o I' easy to produce. The paradigm provi,les a methodology f,~r structuring task-t)aralh:l problenis that allows th,- separation of difDrent phases in the computation The library provides support for ti,ose phases tha~ ar- application-indet),'ndel~l allowlne; th,, users to conceut.rate on the application- Sl-~cific ones.

1. T H E P N I E S C P A R A D I G N I

The P M E S C programming paradigm is an abstract ion for viewing all kinds of parallel a lgor i thms on d is t r ibu ted-memory MIMD (Multiple In'struction, N[ultipie Data) p~rallel computers . It is used as a method of s t ructur ing parMlel algori thms, alloxvin~ tlm separat ion of different phases involving different progra.mming issues. The paradigm is cati,2d Part i t ion- Nlap-Embed-Solve~Communicate ( P M E S C ) . It is composed of five phases bearing those names: tile Par t i t ion phase splits tile work into tasks, the Map phase assigns and reassigns those tasks to the set of processors interconnected by some convenient virtual topology. the Emb,_M phase embeds the virtual topology into the actual m,~chine architecture, the Solve phase performs the algori thm itself, and the Communica t ion ph;~se takes care of the in terprocessor communica.tion. The phases may be executed in any number and in any order.

Parallel ~flgorithms can be classified as static, quasi -dynamic, and dynamic. Static a lgor i thms part i t ion and assign work to the processors only once. at the beginning of the execution. They can be efficiently applied to regular compu ta t i ons bu t do not perform well on irregular ones. [rregula.r problems are those for which no a priori es thnates of load dis t r ibut ion are possible. It is only during program execut ion that different processors can become respousible for different amounts of work. Adapt ive approaches are especially sui ted to these problems because they react to the changes in tile sys t em s ta te , concentrat ing efforts on those areas tll;~t look more promising and making work t ransfer decisions to keep the processor workload balanced. Adapt ive algorithms can be quas i -dynamic or dynamic. Quas i -dynamic approaches apply to those probiems that are synchronous and predictable in s tages and that require periodic load balancing checks to achieve good performance. Dynamic approaches apply" to those computa t ions that are asynchronous and unpredictable

No. 3/4 S. Crivelli ,.t ,a': Aa introduction to the PMESC parallel programming paradigm and library for "" 387

and tha t require continuous, instead of periodic, load ba.lancing checks. Al though the P M E S C paradigm applies to all kinds of parallel a lgor i thms, we

concen t ra t e on the dynamic ones. Dynamic algori thms can be found, for example , in c o m p u t a t i o n s involving a search tree. These include parallel eigenvalue compu t a t i on by the bisection procedure, parallel adapt ive quadra tu re , discrete opt imizat ion and global op t imiza t ion problems. Dynamic problems are ex t remely hard to program on distributed- m e m o r y compute rs because they" can require extensive in terprocessor communica t ion for such program features as load distribution, sharing of information, and t e rmina t ion detect ion. Two systems have been implemented to address dynamic problems: Charm [4] and Express [5]. We propose a new one that combines the strengths of both.

2. T H E P M E S C E N V I R O N N I E N T

The PMESC, paradigm and library consi tute a medium- to coarse-grain envi ronment for managing dynamic computa t ions on d i s t r ibu ted -memory computers . The phi losophy of the P.k[ES(' envir~mment is simple and powerful. It provides a tool tha t frees t h , pr~ogrammer from d,,aii1~g wilh low-level details of such issues as load b~la.ncing. in terprocessor contulunic~ttiolt, and program terminat ion , while allowing her or him to coltceltgI'&te o n t h e applh'a~ ion specific ones.

The fundamenta l design of P M E S C is different from the design of o ther tools. Ra the r than s tar t ing from the hardware and building a communica t ion sys tem. P M E S C began with the applications aim their requirements and built up a sys tem to fulfill them. P M E S C provides the buil~ting blocks to address different programming issues and dilFer,:,qt f rameworks to put these blocks together. The programmer- - -not the language or s y s t e m - - d,,cides which of those frameworks and blocks are most suitable for the par t icular appl icat ion a n~i tho compu',,,r architecture.

P~ [ES( ' is c~m,:el>~uallv a two-layered Otlviroll[llt,nt. :kt the lower level, it p[~,vides suppor t for svnchrc)~,ous and asvnchrouolts message passing. AI the higher-level. PY, IESC provides the abst ract ions for handling more specific p rogramming issues. These rout ines form tim basis t;~r a t!,,xible model of compu*~ation in which the underlying topol,~gy of the hardware can be comph,*e!v ignored. Each level of P M E S C is logically" dist inct and pract ical ly independent of the olhor, with the upper level being built on top of th,, lower one. As a result of that desigt~, w e c a n port the l ibrary to a wide variety" o f c o m p u t , , r s by taking a vertica.1 approach in whic}l the low-levr may need to change while the high-level. built on top of it. does not. The user-code, built on top of those levels, remains unchanged across different computers .

3. T H E P M E S C P R O G R A N I M I N G N I O D E L

P M E S C is designed for task-parallel dynamic applicat ions. The tasks and the queue in which they are stored play a key role in the implementa t ion and performance of dyna.mic problems on d is t r ibu ted-memory computers . Tasks are impor tan t for they define the granular i ty of die parallel problem. The queue is also impor tan t for it originates two different approaches, centralized (maintained by a single processor) and dis t r ibuted (split into local queues maintained by all the processors), tha t lead to different mc,t~ory usage, ne twork usage, and prog, 'amnling conlplexities.

In the dis t r ibuted queue approach, every processor works on the execution of its own tasks until its local queue becomes empty. At this point , the idle processor is assigned lasks from the busy processors in order to balance their workloads. Tha t way. the overall execut ion time is reduced by dynamically dis t r ibut ing the work so that all processors are kept busy most of the time. The choice of load balancing mechanisms is a fundamenta l

388 Wuhan University Journal of .Natural Sciences Vol. 1

issue in achieving good performance. Another performance issue is information sharing. Global variables cannot be used

in asynchronous parallel programs running on distributed-memory computers. However, many uses of global variables can be captured by pseudo-global ones . .To keep a pseudo- global variable, each processor has its own copy of the variable in its local memory. All the processors do not necessarily contain the same value at any moment, but they communicate their values once in awhile to other processors in order to keep the variable updated. Therefore, maintaining a pseudo-global variable involves communication overhead.

A final issue that arises in the solution of asynchronous problems using the distributed queue approach is how to detect termination. For these problems, load balancing is continuously checked and transfer of work accordingly made from the heavily loaded processors to the lightly loaded ones. In this context, idle processors waiting for heavily loaded processors to send some work away may wait forever if not informed that the work has been completed. Therefore, because it is impossible for an idle processor to correctly decide whether or not to quit based on its local information, a global check for termination becomes necessary.

4. T H E P N I E S C F E A T U R E S

The PMESC environment is intended to address the following programming issues:

�9 To provide support for handling the queue of tasks.

�9 To provido support for dynamic load balancing.

�9 To provide support for efficient termination checking.

�9 To provide support for efficient updating of pseudo-global variables.

�9 To provide support %r global a.synchronous operations such as broadcast and gather.

�9 To provide support for Iow-levol message passing.

* To provide support for programming on virtual machines.

The PMESC library offers a set of routines or building blocks to support all these issues. These routines are classified according to the programming phase in the PMESC paradigm for which they provide support.

Thus, the routines that handle the task queue structure make up the Parti t ion module. These routines can either create the task queue structure, dynamically allocate and deallocate memory for the task queue structure, add tasks to tlle queue, select tasks from the queue or partition the queue. Depending on the application, queues can be either non- prioritized or prioritized, centralized or distributed. The Partition module of the PMESC library supports all these approaches.

The routines that take care of balancing the load and checking for termination comprise the Map module. The PMESC library provides different load balancing routines. These include a random approach that distributes tasks globally, a ring-based approach that distributes them among the neighbors in a ring of processors, and a priority-based approach that balances not just the load but also the priorities among the processors. These strategies cam be used in a sender- or receiver-initiated manner. In a sender-initiated approach, congested processors search for other processors to provide them with more work. In a receiver-initiated approach, lightly-loaded or idle processors request work from other processors.

N,~. 3 /4 S. t-'rivclti et , z l : An in~r,Muction to the PMES(" parallel programming p~racligm and library f~.,r "" 389

All,>thez f~.zndamezLtal issue ill the implem~mtation of parallel a lgori thms is the allocation of vir tual ly connected processors to a given archi tecture. Considering embedding as an independent p rocedure allows one to program on a virtual machine, thereby separat ing a rch i tec ture details from the application. Thus, if the p rogrammer is concerned with etficiency and the high communicat ion costs tha t the algori thm may incur, he or she should try a vir tual topology of processors tha t can be efficiently mapped on to the actual machine. The P M E S C fibrary provides such efficient embedding routines tha t exploit the hardware character is t ics while keeping them hidden from the user. These rout ines comprise the Embed modu le and, so far, they include rings, spanning trees, qua te rna ry trees and arrays.

The rout ines tha t take care of handling the low- and high-level in terprocessor communicat ion make up the Communica t ion module. They handle such issues as synchronous and asynchronous point- to-point communicat ions , asynchronous global combine operat ions , and asynchronous upda t ing of pseudo-global variables.

5. E V A L U A T I N G T H E L I B R A R Y

Th~r,~ exists no s tamtard schom~, for the evaluation of p rogramming libraries. Thus, we propose the following steps to determine how well a l ibrary has met the goals of portabil i ty, eliici,'ncy, and ease of use.

�9 Tests of portabi l i ty

- Diffic~zltv of pc)rtizL.g ttw library efl%:iontlv

- Difficulty of t>ovtin~ tim usew's c~l~e el[icientty

�9 T,_'sTs of efficiency

- EI[iciency under ditferenl lewqs of complexity: the stat ic vs. the dynamic a p p roach

- Ii'.ffi<'ie~),'v obt ,dnable by the naive user

- Efficieacv obtainable by the export user

�9 Coral)arisen with relevant package's

�9 Tests of ease of use

- Designer wa.lk~hrough method

- User walkthrough method

The first three of these test categories are familiar to computa t iona l scientists. In [2], we present several examples of P M E S C programs coming from a variety of computa t iona l situa.tions and il lustrating a variety of computa t iona l issues. We develop programs on the i P S C / 2 and port them to the Intel iPSC/860 and Delta and the CM-5. Por tabi l i ty is evaluated by quant i~ ' ing the extent of the changes required to por t the P M E S C library itself and then to port P M E S C progra.ms tot" each of the example problems. Efficiency is evaluated through timings of the example programs on the different machines. These timings include comparisons of different l ibrary routines for such functions as processor load balance as well. Evaluat ing the effects of changing routines allows us- to draw conclusions abou t the relative efficiencies obtainable by naive and expert users of P M E S C .

A perhaps less familiar concept is the one of walkthrough methods . Walkthroughs may be dr)he by the designer or by users of a package. In each case, the p rogrammer works through the development of a progrant, ar t iculat ing her or his design decisions at each step.

390 W"uhan University Journal o f Natural Sciences Vol. 1

The walkthroughs by library users were particularly useful in identifying weak points in the PMESC documentat ion. The pitfalls encountered by a new user of PMESC depended greatly on the user's level of experience with parallel programming. The present version of PMESC documenta t ion responds to all of these troubles, (For more more information on user walkthroughs, see [6].)

6. F I N A L R E M A R K S

PMESC provides support to the inexperienced programmer as well as to the experienced one. It assists the inexperienced programmer in writing portable code without the burden of learning either a new language or the machine architecture details. It assists the experienced programmer by providing a platform for testing new a.pplications, and for comparing different strategies --e.g. different load balancing strategies.

The PMESC library balances the trade-off between efficiency and portability. Thus, while its implementat ion may be tuned to a particular architecture, the user application is portable across any machine on which the library is supported. Our experiments so far include the Intel machines iPSC/2 , iPSC/860, and the Delta as well as the CM5. Also, our preliminary experiments have demonst ra ted that even a novice user can write fairly sophisticat~,d but still efficient code with the prototype PMESC library. (For more details regarding PMESC, see [2].)

7. A C K N O ~ V L E D G M E N T S

This work was fumted by DOE contract DE-FG02-92ER25122, by an NSF Young Investigator Award, and by a grant fl'om the Intel Corporation.

8. R E F E R E N C E S

[I] Barth, W., Martin, R. aad "Wilkinson, J. (197t) Calculation of the eige~valu(.s of a symmetric tridiagor~al matrix by the method of bisection, Handbook for Automatic Computa t ion : Linear Algebra, Springer Verlag, pp. 249-256.

[2] S. Crivelli, (1995) A programnzing paradigm for distributed-memory computers, Ph.D thesis, Dept. of Computer Science, University of Colorado, Boulder.

[3] W. Givens, (1954) Nttmeric:zl computation of the characteristics values of a rcal symmetric matr'iz. Tech. Rep. ORNL-1574, Oak Ridge National Laboratory.

[4] A. Garsoy and A. Sinha and L. V. Kale, (1992) The CHARM(3.2) programmin 9 language manual, Dept. of Compute r Science, University of Illinois.

[5] Ezpress C {,'ser's Guide. Version 3.0, Parasoft Co, 1990.

[6] G.G. P o i s o n and C. Lewis and a. Rieman and C. Whar ton , (1992) Cognitive walkthroughs: A method for theory-based evaluation of user interfaces, International Journal of Man-Machine Studies. 36, 100-110.

Documents

An introduction to the PMESC parallel programming paradigm and library for task parallel computation