Problem-solving environments for parallel computers

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>Future Generation Computer Systems 7 (1991/92) 221-229 221 North-Holland </p><p>Problem-solving environments for parallel computers * </p><p>David A. Padua Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA </p><p>Abstract </p><p>Padua, D.A., Problem-solving environments for parallel computers, Future Generation Computer Systems 7 (1991/92) 221-229. </p><p>Man-machine interaction can take place at different levels of abstraction ranging from the machine-instruction level to the problem-specification level. A problem-solving environment should provide restructuring and debugging tools to make the interaction at these different levels possible and to allow the efficient use of the target machine. Restructurers translate from specifications to programs or from programs to more efficient versions. When the target machine is parallel, the restructurers should include techniques for the automatic exploitation of parallelism. Debuggers are necessary to test for correctness and to evaluate performance at the different levels. Debuggers for parallel programs have to deal with the possibility of nondeterminacy. </p><p>Keywords. Parallel computing; compilers; l~roblem-solving environments; programming environments. </p><p>1. In t roduct ion </p><p>Two of the central goals in software have been the development of good man-machine inter- faces and compilation techniques for the efficient generation of machine code. These goals are par- ticularly important for parallel computers, whose acceptance by ordinary users is directly depen- dent on their becoming as easy to use as sequen- tial machines. </p><p>In this paper we discuss man-machine interac- tion and compilation techniques under the name of problem-solving environments. The term pro- gramming environment is used more frequently, </p><p>but it is more restrictive because programming is only a part (which is not always needed) of the problem-solving process. </p><p>A problem-solving environment should facili- tate man-machine interaction at different levels of abstraction ranging from the machine instruc- tion level to the problem-specification level. The next section briefly discusses levels of abstraction in man-machine interaction. The rest of the pa- per discusses restructurers and debuggers, two of the most important tools that a problem-solving environment should provide. </p><p>* This work was supported in part by the National Science Foundation under Grant No. NSF-MIP-8410110, the US Department of Energy under Grant No. US DOE FG02- 85ER25001, and the NASA Ames Research Center under Grant No. NASA (DARPA) NCC2-559. </p><p>2. Levels of abstraction in man-machine interac- tion </p><p>One of the central goals of software research has been to move the language of man-machine interaction closer to the problem and away from </p><p>0376-5075/92/$05.00 1992 - Elsevier Science Publishers t~'V. All rights reserved </p></li><li><p>222 D.A. Padua </p><p>target machine considerations. Two main ap- proaches have been taken toward this goal. One is the design of very-high-level languages such as Lisp, SETL, and Prolog that tend to simplify the task of programming within some restricted do- main. The other is the design of specification-ori- ented packages that allow the solution of prob- lems without requiring any programming. These packages range from relatively simple tools such as SLADOC, a routine searching system devel- oped in the Applied Mathematical Sciences Pro- grams of the Ames Laboratory, to more complex systems such as ELLPACK [26] used to solve elliptic partial differential equation systems from a description of the equations, the boundary con- ditions, the domain, and the solution method. </p><p>These different approaches can be classified in a hierarchy of abstraction levels. The higher the level in the hierarchy, the fewer the concerns of the user with implementation issues. At the top of the hierarchy is the problem-specification level. Here, the only concern is with the specification of what is to be solved or analyzed: there is no need to be aware of what algorithms or programming languages are used. The interaction takes place by handling knobs or other devices, in the lan- guage of science or engineering, or in terms of mathematical formulas. Examples of systems at this level can be found in some application pro- grams designed for engineering and science, and in computer-aided instruction programs, includ- ing flight simulation programs. From the end- user's point of view, this is clearly the most desir- able level of interaction. However, in many cases it is not possible to design general-purpose sys- tems at this level because the design choices that have to be made when going from specification to algorithms are not understood to the point of automation. </p><p>For this reason, many specification systems require user intervention for algorithm selection. For example, the CIP language [2] accepts prob- lem specifications which are translated into exe- cutable programs via a sequence of correctness- preserving transformations selected by the user. Also, PDE solver systems [5,9,19,26] require the user to specify not only the equations and bound- ary conditions, but also the solution strategy (dis- cretization and solution method among other things). This is due to our inability to automate the analysis of stability and accuracy. The PDE </p><p>solver systems just mentioned belong to a second layer in our hierarchy, the solution-specification layers, where the only responsibility of the user is to select the algorithm by either naming it or by selecting an existing program. </p><p>At the third, and lowest, level of the hierarchy the interaction takes place in the realm of pro- gramming. This level can be decomposed into sublevels corresponding to the different cate- gories of programming languages. These range from assembly language to very-high-level lan- guages such as Lisp, Prolog, and SETL for sym- bolic computing, and Matlab and FIDIL [16] for numerical computing. Parallel programming lan- guages such as Cedar Fortran [12] and Multilisp [13] are at a lower level than their sequential counterparts because parallel constructs are usu- ally concerned with performance and implemen- tation rather than with the problem itself. </p><p>Traditionally, most of the interaction with se- quential machines has been at the programming level, and most of the work on restructuring for parallel computers has concentrated on the trans- lation from sequential programs to parallel ver- sions. This work, described briefly in the next section, facilitates the process of programming by allowing the user to work at the sequential pro- gramming level while making the power of paral- lelism available to the target code. In this way, man-machine interaction takes place at a higher level than that of explicit parallel programming. Also, thanks to restructurers, sequential pro- grams can be translated to different target paral- lel architectures, facilitating portability. </p><p>Less work has been done on the restructuring of specifications into parallel programs. Part of the problem is that the translation of specifica- tions is not well understood even for sequential machines. It is very likely that much effort will be devoted in the near future to this problem. Effec- tive translators for specifications or even very- high-level languages are bound to become impor- tant tools and will probably become a determi- nant factor in making parallel computers widely accepted. Being able to translate specifications will help the cause of automatic exploitation of parallelism because at the specification level there are more opportunities for parallelism than at the programming level, since once the algorithm has been chosen and implemented, some opportuni- ties for parallelism may be lost. Also, because of </p></li><li><p>Problem-solving environments 223 </p><p>the absence of architectural bias, specifications are better than programs for effective porting across widely different target architectures. </p><p>3. Restructurers </p><p>Restructurers translate objects at a level of the interaction hierarchy into objects at a lower level or into more efficient objects at the same level. For example, there are restructurers that trans- late sequential Fortran programs into equivalent (but lower level) parallel programs. There are also restructurers that translate sequential For- tran programs into more efficient sequential For- tran programs. </p><p>A restructurer could generate machine code directly from a program or specification, or it could generate a program in a high-level lan- guage. In either case, when the source code is sequential and the translated version is parallel, a restructurer is called a parallelizer. In the next few paragraphs we discuss parallelization issues for Fortran, Lisp, and specifications. In a final subsection a few words are said about the organi- zation of parallelizers. </p><p>3.1. Fortran parallelization </p><p>Much work has been done over the past 20 years on parallelizing Fortran compilers (see ref. [25] for a tutorial on this work). The most impor- tant techniques deal with the translation of do loops, the most important source of parallelism in numerical programs [4]. Thus, e.g. the loop </p><p>do i = 1 ,n A(i) =B(i) +D( i -1 ) D(i) = E(i) + 1 </p><p>end do </p><p>can be automatically translated into the following vector statements: </p><p>D(1 :n) =E(1 :n) +1 A(1 :n) =B(1 :n) +D(0 :n -1 ) </p><p>If the target machine is a multiprocessor, the code could be translated into </p><p>A(1) =B(1) +D(0) doall I = 1 ,n,K </p><p>do i =l,min(n -1,1 +K-1) D(i) =E(i) +1 A(i +1) =B(i +1) +D(i) </p><p>end do end doall D(n) = E(n) + 1 </p><p>where doall means that different iterations of the loop can be executed in parallel and scheduled in any order. In this example, the loop was blocked (i.e., divided into iteration sub-sequences) to make each parallel thread larger and therefore de- crease the overhead associated with interproces- sor cooperation. Another reason to block is to allow the exploitation of several levels of paral- lelism. Thus, if the processors of the target multi- processor had vector capabilities, the inner loop should be translated into a vector statement. </p><p>In addition to loop parallelization, issues such as synchronization [22], communication, and memory usage may have an important influence on performance. For this reason, many paralleliz- ers include strategies for synchronization instruc- tion generation, locality enhancement, and data partition and distribution. These last two topics are particularly important for distributed memory machines as well as hierarchical memory systems such as the Cedar multiprocessor [20,7]. Synchro- nization considerations can be seen in the previ- ous example where a transformation called align- ment was applied. This transformation tries to place the statement instance that generates a value in the same iteration as the instance con- suming that value. This is done to avoid synchro- nization operations. </p><p>Locality enhancement has been studied exten- sively. For example, some strategies for locality enhancement were presented in ref. [21] and ex- tended and developed into automatic strategies in ref. [1]. Also, locality enhancement mecha- nisms were manually applied to improve the per- formance of matrix multiplication on the Alliant FX/80 [18]. </p><p>Other important memory-related issues also arise. For example, overlapping vector fetches from global memory and computation is an im- portant optimization for the Cedar multiproces- sor [11]. Also, data partitioning and distribution heavily influence performance in distributed </p></li><li><p>224 D.A. Padua </p><p>memory machines and hierarchical systems such as Cedar. Traditionally, the experimental restruc- turers that perform data partitioning have re- quired the user to specify the data partitioning via assertions. More recently, there has been some work to automate this process. </p><p>To illustrate how a parallelizer might improve locality, let us assume a multiprocessor with a cache memory on each processor. Further assume that the cache (and thus the memory) is divided into blocks of K words each, and that data are only exchanged between memory and cache as whole blocks. Assume also that matrix columns are sequences of blocks (i.e. matrices are stored in column-major order and columns are much larger than blocks). </p><p>Now consider the loop: </p><p>do i=1 ,n do j = 1 ,n </p><p>B(j, i) =A(i, j) +1 end do </p><p>end do </p><p>A naive compiler might transform the outer loop into a doall without any other transformation, causing (1 + l /K ) block transfers between mem- ory and caches for each assignment executed. </p><p>To improve this situation, the compiler could block both loops into groups of K iterations. This would have the effect of processing in K x K submatrix order the matrix A. After blocking and interchanging loops, we end up with the loop nest: </p><p>do I = l ,n ,K do J = 1 ,n,K </p><p>do i =1,1 +K-1 do j= J , J +K-1 </p><p>B(j, i) =A(i, j) +1 end do </p><p>end do end do </p><p>end do </p><p>If the outer loop is now transformed into a doall loop, the number of cache block transfers de- creases to 2/K, a clear improvement over the naive approach when K is large. To conclude the work on this loop we need to block once more for </p><p>vector registers, vectorize the innermost loop, and map into vector register instructions: </p><p>doall I = 1,n,K do J = 1 ,n,K </p><p>do i =1,1 +K-1 do j =J,J +K -1 ,32 </p><p>m =min(j +31 ,n) vrl =A(i, j: m) vr2 =vr l + 1 B(j : m,i) =vr2 </p><p>end do end do </p><p>end do end do </p><p>This resulting code segment is more efficient than the original loop, but is also less readable. Not all transformations have this effect. For ex- ample, vectorization often makes the code more readable. </p><p>There are two issues that are rapidly becoming quite important in the area of Fortran restructur- ing. One has to do with the automatic optimiza- tion of parallel programs. Several parallel For- tran dialects have recently become available and more are expected in the near future. The issue is the correctness of optimization techniques (in- cluding parallelization) to be applied to programs written in these parallel dialects. Recent work [23] shows that traditional optimization tech- niques can be applied to parallel programs pro- vided that some conditions are met. Such condi- tions use information on threads that are parallel to the one being optimized by the compiler. Con- sider, for example, the following program: </p><p>a=0 cobegin </p><p>a=l / / </p><p>do i= l ,n 13(i) =a </p><p>end do coend </p><p>At the end of this program, it can be asserted that 13 will be a sequence of zeros followed by a sequence of ones. A parallelizer using traditional techniques will transform the loop into a doall. In </p></li><li><p>Problem-solving environments 225 </p><p>the parallelized program, B could be assigned any sequence of zeros and ones which contradicts the assertion made on the original program. Clearly, this example illustrates a case where paralleliza- tion should not be applied without special inter- vention of the programmer. </p><p>The second issue is that of multilingual paral- lelization. The objecti...</p></li></ul>


View more >