17
231 Machine and collection abstractions for user-implemented data-parallel programming 1 Magne Haveraaen Department of Informatics, University of Bergen, P.O. Box 7800, N-5020 BERGEN, Norway Data parallelism has appeared as a fruitful approach to the parallelisation of compute-intensive programs. Data paral- lelism has the advantage of mimicking the sequential (and deterministic) structure of programs as opposed to task par- allelism, where the explicit interaction of processes has to be programmed. In data parallelism data structures, typically collection classes in the form of large arrays, are distributed on the processors of the target parallel machine. Trying to extract distribution aspects from conventional code often runs in- to problems with a lack of uniformity in the use of the da- ta structures and in the expression of data dependency pat- terns within the code. Here we propose a framework with two conceptual classes, Machine and Collection. The Machine class abstracts hardware communication and dis- tribution properties. This gives a programmer high-level ac- cess to the important parts of the low-level architecture. The Machine class may readily be used in the implementation of a Collection class, giving the programmer full control of the parallel distribution of data, as well as allowing nor- mal sequential implementation of this class. Any program using such a collection class will be parallelisable, without requiring any modification, by choosing between sequential and parallel versions at link time. Experiments with a com- mercial application, built using the Sophus library which uses this approach to parallelisation, show good parallel speed- ups, without any adaptation of the application program being needed. 1. Introduction Programming of parallel high performance comput- ers is considered a difficult task, and to do so effi- 1 This investigation has been carried out with the support of the European Union, ESPRIT project 21871 SAGA (Scientific comput- ing and algebraic abstractions) and a grant of computing resources from the Norwegian Supercomputer Committee. ciently may require extensive rewriting of a sequen- tial program. One of the simpler approaches to par- allel programming is the data parallel model, see [6] for a general introduction. Data parallel programming represents a generalisation of SIMD (Single Instruc- tion Multiple Data) parallel programming. Early data parallel languages include Actus [35], *Lisp and CM- Fortran for Thinking Machine’s Connection Machine series with up to 65 536 processors (see [22] for experi- ence with the CM-1), MPL [28] for the MasPar series of machines, and Fortran-90 [1], which was mostly devel- oped during the 1980s. Much of the work on data par- allelism has been influenced by research on systolic ar- rays [30]. The popularity of data parallel programming took off with the massively parallel machines available in the late 1980s. Task parallelism, also known as control parallelism or functional parallelism, is the notion of indepen- dent tasks communicating with each other [23]. It is intrinsically non-deterministic in nature. Task paral- lelism requires a much more involved semantical ap- proach, checking that tasks rendezvous appropriately in addition to checking the sequential correctness of each task and any of the possible compositions of the tasks [33]. In contrast, data parallelism is much sim- pler to work with, see [21]. Data parallelism allows us to use sequential reasoning techniques when writing, reading and understanding code [16]. The programs will behave consistently on synchronous (SIMD) or asynchronous (MIMD – Multiple Instruction, Multiple Data) platforms [17]. Here we extend the use of data abstraction orient- ed programming to provide a framework to encom- pass data parallel programming. The importance of data abstraction and encapsulation has been recognised for a long time [34]. It is now supported by most programming languages, often under the terminology object-orientation and class abstraction. 2 Our idea is 2 Based on concepts already clearly defined in the programming language Simula [12]. Scientific Programming 8 (2000) 231–246 ISSN 1058-9244 / $8.00 2000, IOS Press. All rights reserved

Machine and collection abstractions for user-implemented

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine and collection abstractions for user-implemented

231

Machine and collection abstractions foruser-implemented data-parallel programming1

Magne HaveraaenDepartment of Informatics, University of Bergen,P.O. Box 7800, N-5020 BERGEN, Norway

Data parallelism has appeared as a fruitful approach to theparallelisation of compute-intensive programs. Data paral-lelism has the advantage of mimicking the sequential (anddeterministic) structure of programs as opposed to task par-allelism, where the explicit interaction of processes has to beprogrammed.

In data parallelism data structures, typically collectionclasses in the form of large arrays, are distributed on theprocessors of the target parallel machine. Trying to extractdistribution aspects from conventional code often runs in-to problems with a lack of uniformity in the use of the da-ta structures and in the expression of data dependency pat-terns within the code. Here we propose a framework withtwo conceptual classes, Machine and Collection. TheMachine class abstracts hardware communication and dis-tribution properties. This gives a programmer high-level ac-cess to the important parts of the low-level architecture. TheMachine class may readily be used in the implementationof a Collection class, giving the programmer full controlof the parallel distribution of data, as well as allowing nor-mal sequential implementation of this class. Any programusing such a collection class will be parallelisable, withoutrequiring any modification, by choosing between sequentialand parallel versions at link time. Experiments with a com-mercial application, built using the Sophus library which usesthis approach to parallelisation, show good parallel speed-ups, without any adaptation of the application program beingneeded.

1. Introduction

Programming of parallel high performance comput-ers is considered a difficult task, and to do so effi-

1This investigation has been carried out with the support of theEuropean Union, ESPRIT project 21871 SAGA (Scientific comput-ing and algebraic abstractions) and a grant of computing resourcesfrom the Norwegian Supercomputer Committee.

ciently may require extensive rewriting of a sequen-tial program. One of the simpler approaches to par-allel programming is the data parallel model, see [6]for a general introduction. Data parallel programmingrepresents a generalisation of SIMD (Single Instruc-tion Multiple Data) parallel programming. Early dataparallel languages include Actus [35], *Lisp and CM-Fortran for Thinking Machine’s Connection Machineseries with up to 65 536 processors (see [22] for experi-ence with the CM-1), MPL [28] for the MasPar series ofmachines, and Fortran-90 [1], which was mostly devel-oped during the 1980s. Much of the work on data par-allelism has been influenced by research on systolic ar-rays [30]. The popularity of data parallel programmingtook off with the massively parallel machines availablein the late 1980s.

Task parallelism, also known as control parallelismor functional parallelism, is the notion of indepen-dent tasks communicating with each other [23]. It isintrinsically non-deterministic in nature. Task paral-lelism requires a much more involved semantical ap-proach, checking that tasks rendezvous appropriatelyin addition to checking the sequential correctness ofeach task and any of the possible compositions of thetasks [33]. In contrast, data parallelism is much sim-pler to work with, see [21]. Data parallelism allows usto use sequential reasoning techniques when writing,reading and understanding code [16]. The programswill behave consistently on synchronous (SIMD) orasynchronous (MIMD – Multiple Instruction, MultipleData) platforms [17].

Here we extend the use of data abstraction orient-ed programming to provide a framework to encom-pass data parallel programming. The importance ofdata abstraction and encapsulation has been recognisedfor a long time [34]. It is now supported by mostprogramming languages, often under the terminologyobject-orientation and class abstraction.2 Our idea is

2Based on concepts already clearly defined in the programminglanguage Simula [12].

Scientific Programming 8 (2000) 231–246ISSN 1058-9244 / $8.00 2000, IOS Press. All rights reserved

Page 2: Machine and collection abstractions for user-implemented

232 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

to develop a normal, sequential program using data ab-stractions. Then one or more of the abstractions arereplaced by semantically equivalent, but data parallelversions [44]. The new configuration will create a dataparallel version of the whole program without requiringany reprogramming or extensive compile-time analysis.The basic module for parallelisation is a parameteriseddata abstraction (also called template class or genericclass), Collection, with user oriented methods forpermutations and data updates. The Collection isthen implemented as normal sequential code or by usinga Machine template class. Machine provides con-cepts corresponding to the machine hardware: proces-sors with local memory and communication channels.This may reflect a distributed memory parallel com-puter, but may also represent a shared memory parallelmachine with hierarchical memory (local cache, localmemory, off-processor shared memory,secondary storefor paging, etc.). The embedding of a Collectiononto the parallel machine is then expressed by usingMachine in the data structure for the collection. Thisgives the programmer full control over the distribu-tion and partitioning strategies. We envisage a libraryof Machine and Collection classes representingboth sequential and parallel embeddings. Then a pro-grammer may easily experiment with different paral-lelisation strategies of a program by choosing betweendifferent implementations in the library.

The programming strategy to utilise the collectionclasses is close to using array/collection expressions inlanguages like APL [24], Actus or Fortran-90. Thisis in contrast to a data parallel programming languagelike High Performance Fortran (HPF) [26,38], which isfocused around loops (with explicit parallelism) overindividual elements of the collections. But by makingthe choice of collection class implementation availableto the programmer, the control over distribution of datais closer to that of HPF where this can be controlled bydirectives. The choice between collection class imple-mentations does not require adaptation of the programcode, and can be postponed to link time. We will referto this as configuration time, which is when the compo-nents to make up a program are chosen, irrespectivelyof when this happens in the compile-link-run phases ofsoftware development. Configuration does not interactwith the actual code of programs and algorithms, but isthe act of choosing between (library) implementations.

Being a data parallel framework, our suggestiondiffers dramatically from parallel programming usingmessage passing, e.g., the MPI library [40,41]. Mes-sage passing is associated with task parallelism pro-

gramming, even though it often is used in a more re-stricted context when programming parallel machines.Message passing is normally added to some standardprogramming language, and the nature of this languagewill guide what kind of abstractions and programmingstyle the programmer has to work with. However, thekind of data that can be sent across the network usingmessage passing (typically only primitive types or con-tiguous array segments) will influence which abstrac-tions are natural to work with.

This paper is organised as follows. In the next sec-tion we discuss some parallelisation strategies for thedata parallel programming approach. In Section 3 wepresent the classesCollection andMachineof ourdata parallelisation framework together with a frame-work, Sophus, which supports the use of such abstrac-tions. Then we provide benchmarks for parallel speed-up. Finally we compare our approach with similar ap-proaches. C++ [42] and Fortran-90 is used for codeexamples and benchmarks.

2. Data parallelism

The idea behind data parallel programming is thata sequential program may be parallelised by runningidentical copies of it synchronously as separate tasks.The data is distributed between these tasks, and a ref-erence to non-local data causes a communication be-tween processes. This approach can be proved to besafe in general [2]. Thus no additional reasoning, be-yond that of showing it is a correct sequential program,is required to ensure the correctness of any data parallelprogram.

A data parallel compiler tries to extract the paral-lelism present in a sequential program and target it for aparallel machine. Normally the elements of array datastructures, or other collection oriented data structures,are the candidates for being distributed. Distributioncan be looked upon as the task of mapping the index set,the index domain, of the array onto the physical pro-cessors. If the mapping is a surjective function all pro-cessors will receive some work. Normally the mappingcannot be injective, as there will be more elements inthe index set than there are processors on the target ma-chine. So the index domain may have to be partitionedsuch that each partition is mapped to a distinct proces-sor. According to [13], there are two philosophies topartitioning:

1. machine based: starting from the hardware, in-troducing virtual processors for the elements ofthe index domain, and

Page 3: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 233

2. program based: grouping of the index domain ofthe data structure.

Partitioning is always subject to (1) the constraint ofbalancing computational load as evenly as possible be-tween the processors and (2) minimising interproces-sor data access, i.e., to minimise communication be-tween processors, and keep communication as close aspossible to the communication topology of the targetmachine.

Meeting these requirements requires extensive anal-ysis of data access patterns within a program. Theanalysis must be performed for all data structures thatare candidates for distribution, and the result must be acoherent distribution strategy for all the data of a pro-gram. It is important that data structures that interactare aligned, i.e., that their index domains are partitionedusing the same strategy. Additionally, the optimal dis-tribution of data may change throughout the program,requiring redistribution of the data as the computationproceeds. Partitioning should also take into accountmemory hierarchy and communication speed charac-teristics. Chatterjee et al. [8,9] have studied this prob-lem and use techniques from optimisation theory to findoptimal data distributions.

Though data parallelisation of a sequential programalways will be correct, in the sense that the parallelversions will compute the same results as the sequen-tial versions, there is no guarantee that the parallel ver-sions will be efficient. While some parallel speed-upcan be expected, the optimal speed-up can never beexpected, as the general partitioning problem is NP-complete [27]. Thus the user normally needs to supplymore information in order to achieve optimal speed-up. In HPF this is given as directives. The informa-tion may otherwise take the form of language exten-sions, or be interactive instructions to the compiler orrun-time system, see [11] for an overview. The formof this information will appear differently dependingon whether the partitioning strategy is virtual machinebased or program based. Program based partitioningis the more common strategy at present. HPF and itscompilers are based on this strategy [10].

A parallelising compiler for a language like HPF of-ten treats the directives only as hints on the distributionof data. The compiler is free to distribute data in otherways it may find beneficial. This often has to do withalignment of data, where directives that conflict withthe use of the data can be discovered. In such a casethe compiler may rely more on the intended use of thedata than on the hints it has been supplied with.

3. An abstraction based framework for dataparallelism

Instead of defining a data parallel language or havingthe compiler look for implicit parallelism in sequentialprograms, we will identify abstractions, within a se-quential language framework, that allow us to captureparallelism explicitly. Our data parallel framework willthus appear as sequential programming, yet providethe programmer with explicit control over the paralleldistribution of the program. Reflecting the machineand program based approaches to data partitioning, weidentify two collection-oriented abstractions:

1. a class Machine that reflects the set P of physi-cal processors of the target machine and its com-munication topology, and

2. a class Collection, reflecting the arrays andother collection data structures declared in a pro-gram, with user-chosen index domain I and user-defined permutation patterns.

Both of these classes must be parameterised by a tem-plate class T. This ensures that we can place any da-ta structure T on the processors of the target ma-chine. The intention is that parallelisation of a pro-gram will be achieved by embedding the user-levelabstraction Collection into the machine abstrac-tion Machine. This will allow full programmer con-trol over the embedding, and different embeddings forthe same Collection will provide different paral-lelisation schemes for any program that can use theCollection abstraction.

The basic methods in the interface of the Machine<T> class are derived directly from hardware charac-teristics.

– Methods map to perform an argument proceduref on the local data of type T in parallel for eachprocessor p ∈ P of the machine, where the be-haviour of f may depend on the processor indexp, but f is only allowed to access data local to itsprocessor p.

– (Several) methods that capture the structured datapermutations between processors consistent withthe communication network topology of the hard-ware.

– (Several) methods that capture unstructured da-ta permutations which are supported by the hard-ware.

– Broadcast, reduction/prefix operations as support-ed by the hardware.

Page 4: Machine and collection abstractions for user-implemented

234 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

– Methods to update and read individual data valuesat specific processors.

The topology of a machine may take into accountissues as relative access costs of hierarchical memory,not just the physical wiring between processors whichtraditionally is seen as the connectivity of a machine.Many of the operations above should be familiar as ar-ray operations. For example, mapping the plus oper-ation over a pair of collections gives the elementwisearray addition operator of Fortran-90. Parallelism isachieved by the map operations (forall style paral-lelism), which execute f at every processor in parallel,and the permutation operations, which communicate inparallel over the network. Access to a location at a spe-cific processor will force sequentialisation and shouldbe avoided.

For the Connection Machine series a CM Machineabstraction will reflect the machine’s hypercube ar-chitecture. The class CM Machine will have theprocessor numbers as index domain, such as the set{0, 1, . . . , 65 535}. The permutations will permute da-ta between neighbouring processors for every link ofthe hypercube, i.e., for i ∈ {1, 2, . . . , 16} the pairs ofprocessors that differ in bit i will swap data.

For the MasPar machine series the MP Machineabstraction will reflect the machine’s 2-dimensionaltoroidal architecture. So for a 64 by 128 proces-sor MasPar MP-2, MP Machine will have the pairs{0, 1, . . . , 63}×{0, 1, . . . , 127} as index domain. Thepermutations will shift data between processors for ev-ery link of the collection, i.e., circularly in directionsNorth/South or East/West, or circularly in the diagonalconnections NE/SW or NW/SE. The MasPar also has ageneral router between arbitrary processors, but it doesnot perform well for massive data exchanges, and maybe consciously omitted from the permutation methodsof MP Machine.

For modern machines like a 128 processor SGI CrayOrigin 2000 or networks of (multiprocessor) worksta-tions, like 32 SUN Ultra-10, the network topology be-comes more vague. Both these configurations havespecific links, but neither offers any direct means ofcontrolling which connection is to be accessed. Thuswe may think of one as a 128 node fully connected ma-chine, the other as a 32 node fully connected machine.In both cases the index domain for FC Machine willbe a range of integers, and any communication patternrepeated on all processors is a legal data permutation.In these examples we have not worried about the effectof hierarchical memory.

The user level abstraction Collection is quitesimilar to Machine. In addition to the template classT, we formally include the index domain I as template,giving the abstraction Collection<I,T>. Eachvariable (or object) belonging to this class has a shapeI ′ ⊆ I and only contains data belonging to the subdo-main I ′ of the index domain I . In C++ a rank n index3

set I may be realised by the class Index<n> whichrepresents all n independent integer indices i1, . . . , in.For n = 3 we may define the extent I ′ of the indices tobe i1 ∈ {0, 1, 2, 3, 4}, i2 ∈ {0, 1, 2}, i3 ∈ {0, 1, 2, 3}when we declare a variable ind.

int ilim[3];ilim[0]=5; ilim[1]=3; ilim[2]=4;Index<3> ind(ilim);

Constructs for defining arbitrary index sets, akin toset in Pascal [25], makes it possible to define non-contiguous subsets I ′′ of I ′. We may now declare aCollection<I,T> variable A to have double aselements and 3 indices in the range of i1, i2, i3 fromabove by

Collection<Index<3>,double> A(ind);ilim[0]=2; ilim[1]=2; ilim[2]=2;set(ind,ilim);update(A,ind,5.2);

The last three statements set the index ind to(2, 2, 2) and then updates the element at that positionin A to 5.2.

The methods for a Collection class are basicallythe same as for the Machine classes.

– methods map to perform an argument proceduref on the local data of type T in parallel for eachdata element i ∈ I ′ of the index domain, wherethe behaviour of f may depend on the index i, butf is only allowed to access data local to its indexi,

– (several) methods for structured and unstructuredpermutations of data, as needed by the user,

– methods for broadcast, reduce (with an arbitraryassociative operation ⊕), prefix etc., as needed bythe user,

– methods to update and read individual data valuesat specific indices i ∈ I ′, and

– methods for extracting and replacing all elementscorresponding to a subset I ′′ of I ′.

3A rank n index set I is such that the elements of I can be identifiedby n independent indices (i1, . . . , in) ∈ I .

Page 5: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 235

The majority of the operations are only defined asneeded by the user. This is because different appli-cations have different requirements, and by tuning theinterface of Collection to the needs of the user welessen the demands on its implementation, making iteasier to provide versions tuned for different parallelmachines.

The important aspect of the Collection abstrac-tion is that it promotes access to the entire structure atonce, separating permutation operations from compu-tations and element update operations. This promotesa programming style which discourages arbitrary ac-cess patterns to data elements, but any useful accesspattern may of course be defined in an appropriate per-mutation operation. It also obliterates a construct likeHPF’s “independent do” directive which only has ameaning for explicitly indexed elements in loops. In-stead this will be built into the requirements of rele-vant (unstructured) permutation operations or be han-dled by map operations. For example, if computa-tions f and g are to be applied to elements indexedby disjoint subsets J, K ⊆ I ′, respectively, then map-ping an operation p(i,x) for index i and data elementxwith bodyif i in J then f(x) else if iin K then g(x) over the elements of the collectionwill do the trick.

The user oriented permutation operations should berestricted as much as reasonable. The more limitationsthat can be placed on these permutations, the better, asthis will give a larger freedom of choice for the parallelembeddings.

The rest of this section will first present an approachto programming numerical software well suited for us-ing the Collection abstraction. Then we will dis-cuss implementing sequential and parallel versions ofCollection and how to define the Machine class.

3.1. The Sophus library

As noted, the use of the proposed data parallel frame-work requires a more holistic view of programmingthan the elementwise manipulations commonly found.But using this is feasible. We have used it with successin the area of partial differential equations (PDEs) [19].

Historically, the mathematics of PDEs has been ap-proached in two different ways. The applied, solution-oriented approach uses concrete representations of vec-tors and matrices, discretisation techniques, and nu-merical algorithms. The abstract, pure mathematical,approach develops the theory in terms of manifolds,vector and tensor fields, and the like, focusing more on

the structure of the underlying concepts than on how tocalculate with them (see [39] for an introduction).

The former approach is the basis for most of the PDEsoftware in existence today. The latter has very promis-ing potential for the structuring of complicated PDEsoftware when combined with object-oriented and tem-plate class based programming languages. Some cur-rent languages that support these concepts are Ada [3],C++ [42], Eiffel [29] and GJ [7]. Languages thatdo not support templates, such as Fortran-90 [1] andJava [15], may also be used, but at a greater codingcost.

The Sophus library framework [18,20] provides theabstract mathematical concepts from PDE theory asprogramming entities. Its components are based on thenotions of manifold, scalar field and tensor field, whilethe implementations are based on conventional numer-ical algorithms and discretisations. Sophus is being in-vestigated by implementing it in C++. The frameworkis structured around the following concepts:

– Variations of the basic collection classes assketched above. The map operations for numer-ical operations like +, * have been explicitly de-fined.

– Manifolds. These are sets with a notion of prox-imity and direction. They represent the physicalspace where the problem to be solved takes place.

– Scalar fields. These may be treated as arraysindexed by the points of the manifold with re-als as data elements. Scalar fields describe themeasurable quantities of the physical problem tobe solved, and are the basic layer of “continuousmathematics” in the library. The different dis-cretisation methods provide different designs forthe implementation of scalar fields. A typical im-plementation would use an appropriate collectionas underlying discrete data structure. In a finitedifference implementation partial derivatives areimplemented using simple shift permutations andarithmetic operations on the collection.

– Tensors. These are generalisations of vectors andmatrices and have scalar fields as components.Tensors are implemented using collections withscalar fields as template arguments.

– Equation administrators. These handle the testfunctions for finite element methods, volumes forfinite volume methods, etc., in order to build (andsolve) systems of linear equations. Again, theseclasses are implemented using appropriate collec-tions.

Page 6: Machine and collection abstractions for user-implemented

236 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

Each of the collections classes used in the abstrac-tions above may have a different implementation.

A partial differential equation uses tensors to repre-sent physical quantities in a system, and it provides arelationship between spatial derivatives of tensor fieldsand their time derivatives. Given constraints in the formof the values of the tensor fields at a specific instancein time and relevant boundary conditions, the aim ofa PDE solver is to show how the physical system willevolve over time, or what state it will converge to if leftby itself. Using Sophus, the solvers are formulated atthe tensor layer, giving an abstract, high level programfor the solution of the problem.

3.2. Implementing the abstractions

3.2.1. The Collection classesAny given Collection variant may be imple-

mented in many ways. Different sequential and dis-tributed implementations provide different data layout,data traversal and parallelisation strategies. A config-uration for a program will define which of these im-plementations, and hence strategies, to use. Since thecollection classes are data abstractions, a new versionmay be implemented by the user, e.g., if a new hard-ware topology should require it. This also gives an op-portunity for experimenting with optimisation and par-allelisation strategies under full programmer control.

A sequentialCollection is a normal data abstrac-tion, implemented using sequential code, e.g., using thestandard array structures of a programming language.The data layout in memory will be significant if issueslike cache misses etc. are important for the target ma-chine. Configuring a program from sequential collec-tions will give a sequential program.4 Sequential col-

4Ideally such a program should have the same run-time efficien-cy as a conventionally written program, where the data permutationpatterns are intermixed with expressions. This will often not be thecase, see the discussion in [14], due to current compiler optimisationtechnology. This technology is geared towards certain programmingstyles, giving large efficiency improvements to programs expressedusing certain idioms, with little or no improvements on programswritten using other styles. But optimisation technology is not static,and history has shown that it will adapt to new usage idioms, sup-porting styles which people believe are important. A discussion ofrecent optimisation technology improvements for C++ can be foundin [36]. User controlled optimisation techniques based on abstractionoriented programming approaches are also being developed, see forexample [43,14]. Thus one’s fear that the intense use of abstractionsmay slow down a program may be confirmed by current benchmarks,but this is a situation that should change quickly when the usefulnessof abstraction oriented techniques has been established.

lections may be nested freely in the same way as anyother data type constructor.

A parallel version of the collection classes may beachieved using aMachine abstraction as data structurefor Collection. We assume that we may distributea collection data type in isolation from the rest of theprogram, that it then can be nested with other data typeconstructors, and that the set of distributed variablesin a program will give a coherent distribution of thatprogram. These assumptions are normally satisfied.

All collection classes which satisfy the same inter-face and functionality are interchangeable, irrespec-tively of whether the implementations are sequentialor parallel. Thus a program that uses Collectionclasses may be run on sequential and parallel machineswithout altering the program code. The key is the con-figuration where whichCollection implementationto use is defined.

Consider the simple example of a matrix data struc-ture BlockCollection<P,J,T>. We want to dis-tribute it among the processors of a parallel machineas a blocked matrix implementation, where the indexdomain P for the blocks coincide with the index set forthe processors of the target machine, and each blockhas index domain J . Distributing the blocks directlyon the processors would correspond to the data struc-ture Machine<Collection<J,T>>, where the in-ner class Collection<J,T> is sequential, i.e., is inlocal memory at each of the processors. This can beexpressed by the C++ template class declaration

template<class P, class J, class T>BlockCollection{ Machine<Collection<J,T>> B;

. . .public:void blockperm();void columnperm( int j );. . .

}whereblockperm andcolumnperm are two of pos-sibly many permutation operations for this collectionclass. Assume that blockperm is an operation thatwill permute the blocks in a pattern required by the us-er, for instance for a block matrix multiplication. If thispermutation pattern coincides with the topology of themachine,blockpermmy be implemented using a calldirectly to one of Machine’s permutation operationsperm.

template

Page 7: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 237

<class P, class J, class T>void BlockCollection<P,J,T>::

blockperm(){ B.perm(); }

The permutation operation columnperm(int j)which permutes column j of each block between theblocks in the pattern of the topology,can be code by firstextracting the column at all processors, then permutingthem, and finally replacing the permuted column on allprocessors.

template<class P, class J, class T>void BlockCollection<P,J,T>::

columnperm(int j){ Machine<Collection<J,T>> C;map( B, extract, j, C );C.perm();map( B, replace, j, C );

}Here extract(Collection<J,T> M, int

j, Collection<J,T>& c) is a Collectionoperation that extracts column j of the matrix Mand stores it in the column variable c. The op-eration replace(Collection<J,T>& M, intj, Collection<J,T> c) will do the opposite,i.e., replace column j in M by c. These operationsare performed on all processors by the map operations,and will, for each block at each processor, in parallel,extract and then replace the appropriate portion of thedistributed variable C. The distributed data in C is per-muted according to the topology of Machine betweenthe mapped extract and replace operations.

The permutation patterns needed by the user willnot always match exactly the communication topolo-gy of the target machine. Then the implementationof the permutation operations for the collection willrequire more complex expressions of communicationoperations. Most of these communication expressionswill in any case be straight forward, since the machinetopologies are adapted to the permutations likely to beneeded by users. This typically includes the regularpermutations needed for matrix multiplication, manylinear equation solvers, fast Fourier transforms, finitedifference methods, etc. At the general level, it hasbeen shown [31] that any regular permutation patternmay be built in a small number of steps from a few ba-sic communication operations, namely those of meshesand hypercubes. The cost of finding the optimal se-quence may be fairly high, but once found, it may be

coded in the implementation of a Collection classand freely reused.

Irregular permutation patterns may also be defined asa permutation operation on the collection class [37], ir-respectively of whether such communication is directlysupported by hardware. In fact, it may be beneficialto implement unstructured permutations by a sequenceof structured permutations, even if there is hardwaresupport for irregular permutations [4].

If we look closer at the problem of distributing adata type Collection<I,T> by the data structureMachine<Collection<J,T>>, for a machinewith processors indexed by P , we see that we can de-fine an injective location mapping � : I → P ×J . Thisforces a lower bound on the size of J given I and P .Then �(i) = (p, j) will indicate which processor p ∈ Pand which index j ∈ J at that processor the element iof the collection Collection<I,T> is mapped to.This is true both when the number of data elements islarger than the number of processors (the most commoncase), but also when there are more processors thanelements in the index domain I . In that case J maybe taken to be a one-element set, which is equivalentto declaring the data structure Machine<T> ratherthan Machine<Collection<J,T>>. The sequen-tial case is when the set P contains one element. Thenthe data structure reduces to Collection<J,T> and� defines a reindexing of the data elements, e.g., howto map a rank n index set I into a rank 1 index set J .

The location mapping � can be defined from twofunctions, the distribution mapping d : I → P andthe local memory location mapping e : I → J , by�(i) = (d(i), e(i)). The functions d and e definethe distribution among processors and memory lay-out within a processor of the data, respectively. Suchfunctions may be used explicitly in the implementa-tion of the collection, or may be given implicitly bythe way data of Collection<I,T> is addressed onMachine<Collection<J,T>>. It is always use-ful to state these functions explicitly, as this is a gooddocumentation of the data layout. Normally one wouldrequire d to be surjective so that all processors receivedata. Also, making e surjective would ensure a goodutilisation of memory at each of the processors. Com-mon strategies for d and e include address splitting,where some bits of an index i ∈ I are chosen as proces-sor number and some bits are chosen as location indexwithin each processor. This gives a uniform distribu-tion of the data elements, but may not give a uniformdistribution of work load between processors. If thereis some subset I ′ ⊆ I that we expect to be heavily

Page 8: Machine and collection abstractions for user-implemented

238 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

compute intensive, we would prefer that the distribu-tion mapping d maps the elements of I ′ as uniformlyas possible across the processors, and how the rest ofthe data is mapped is less crucial.

Equally important is taking into account the com-munication topology of the target machine in caseof a distributed memory machine. Permutations ofCollection<I,T> will have to be expressed ascommunications of Machine, and the more directlywe may exploit the hardware topology, the more effi-cient our program will become.

Sequential and shared memory machines apparentlyhave no network communication topology in the tradi-tional sense. But these typically exhibit complex no-tions of memory hierarchies: various caches, non-localmemory, disk storage for paging, etc. Constraints maythen be put on the local memory location mapping eto utilise hierarchical memory efficiently. In practicethis seems to be more difficult to control than the utili-sation of processors, as the strategies used by memorymanagement systems rarely are available to applicationdevelopers. For high performance applications, cacheand memory misses may be more costly than poor par-allel load balancing. Currently we are forced to exper-iment in the dark with different memory localisationstrategies, and rely on the quality of compilers to avoidunnecessary high execution costs in this area.

When building complex programs using collectionclasses one will often find that collection classes arenested within collection classes. This is the case forthe Sophus framework, where the tensor fields are col-lections of scalar fields, which themselves are imple-mented using collections of reals. Sequential imple-mentations of collection classes may be nested freelywith sequential and parallel classes. A collection classbased on a Machine class can be nested within an-other parallel collection class only if we have a nestedparallel machine structure, e.g., a networked cluster ofmultiprocessor machines. Some approaches to nestedparallelism exist at the programming language level,see [5], and may be exploited to allow such nesting,but this requires advanced handling by the compiler.Otherwise machine classes should not be nested in aprogram configuration, neither directly nor indirectly,hence we may only have one parallel collection class inthe construction of a nested data structure. If the par-allel collection class is deeply nested, we typically getfine-grained parallelism (large data structure with par-allel replication of the small parts). If the parallel col-lection is one of the outer classes we tend to get coarsegrain parallelism (a few parallel instances of large datastructures).

3.2.2. The Machine classesThe Machine class is where the data distribution

and parallelism is implemented directly in low-level,hardware oriented concepts. This is in a setting wherewe only have to worry about the hardware characteris-tics and not about data partitioning, virtualisation, userdefined permutations, etc. The main implementationtechniques available are

– a sequential language together with message pass-ing akin to MPI, typical for programming MIMDmachines, but many message passing librarieshave also been adapted for SIMD machines, and

– a language with specific parallel constructs for thetarget hardware, such as the data parallel C dialectMPL for the MasPar computer series, typical forprogramming SIMD machines.

Although general, platform independent, parallellanguages exist, these will seldom be suitable for im-plementing a Machine class, since they provide anextra layer of abstraction which destroys the close re-lationship with hardware that we want to achieve. Thecloser we are to the hardware, the better it is in the caseof implementing the Machine classes.

Building a Machine class using message passingin principle opens up for all the problems of showingcorrectness that are associated with task parallelism.But the SPMD (Single Program, Multiple Data) restric-tion of the MIMD model provides assumptions whichgreatly simplify correctness arguments in our frame-work. An SPMD execution of a program makes iden-tical copies of the same program code and runs themconcurrently, one on each processor. The only placewhere communication takes place in our framework iswithin the permutation, broadcast/reduce, update andread operations of Machine. If certain simple guide-lines on writing program code is obeyed, these will beexecuted synchronously in the SPMD model. So wejust have to ensure that the communications are cor-rect within each of these communication operations.Thus the reasoning for correctness is simplified fromshowing correctness for all executions of a collectionof tasks [33], to showing correctness for all executionsof each of the permutation procedures. This can beshown once for each implementation of a machine ab-straction – and need not be repeated for each programbeing developed as for general task parallelism. Thereason is that operations on Machine<T> are collec-tive operations performed by all SPMD processes onthe same arguments, see [2] for details. In contrast,message passing calls in the MIMD style, even if ex-

Page 9: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 239

ecuted as SPMD processes, are generated unrestrictedon each processor, requiring extensive proofs to guar-antee correct matching of communication at all stagesof the program.

The SPMD implementation of Machine for MIMDexecution on P processors may be obtained by thefollowing.

– Use T as data structure for Machine<T>.– map executes the argument procedure on the data

structure.– Permutations are implemented by appropriate

message passing send/receives or permutations.– Broadcast and reduce/prefix operations are imple-

mented as supported by the message passing sys-tem. If not supported, they will not be providedby the machine abstraction.

– Individual requests for updates and reads are di-rected to the appropriate processor using broadcastor send/receive operations.

With this strategy all data is declared on all proces-sors, which gives a P -indexed set of data, and all opera-tions are automatically run in parallel on all processors.The reason is that the SPMD execution style implicitlydistributes all variables by replicating the whole pro-gram. This allows the compiler to be ignorant about theparallel structure of the program. Only the Machineimplementation is aware of this fact, so only the ab-stract variables defined via the machine class will betruly distributed, the others will be replicated. This alsogoes for the computations, which will be replicated forall sequential variables, and only the distributed vari-ables will benefit from the parallel execution. Sincethe most compute-intensive variables normally are de-clared using Machine, the repeated computation, inparallel, of the replicated variables should represent asmall fraction of the total time. This then gives a goodspeed-up, see the benchmarks in Fig. 8 for the applica-tion SeisMod.

The communication operations for Machine mustbe provided for any type T, not only the predefinedtypes. This may add some technical difficulties in theimplementation of Machine, as any composite, non-primitive type, may have to be broken down into prim-itive components, transmitted, and then be rebuilt bythe receiver.

The technique of overlapping communication oper-ations with computation to reduce waiting time is dif-ficult to achieve in the proposed framework. It wouldrequire splitting the Machine communication oper-ations into: communication initiate and communica-

tion close suboperations. The splitting would have tobe propagated up to the collection abstractions, whichwould violate the uniformity of the interface for dif-ferent collection class implementations, especially be-tween sequential and parallel versions. The splittingwould also make program development much more dif-ficult. A way to handle this is through a tool like Code-Boost [14] which could utilise the algebraic propertiesof the communication suboperations to restructure theprogram when such overlap is required.

ProgrammingMachine classes does not run into theproblem of communication between tasks if an SIMDlanguage is used, where all operations are synchronous.SIMD language attributes will typically identify dataas distributed (indexed by the processors) or sequential(replicated on the processors or stored on a frontendprocessor). Permutations, broadcasts, reduce/prefixoperations will be supported in the SIMD language ifavailable on the machine. Likewise with access to in-dividual elements. So an SIMD language will have di-rect constructs supporting all operations of a Machineclass. These operations should also be at a sufficient-ly low level to achieve efficiency at the hardware lev-el. A drawback may be that many machines, hencetheir SIMD languages, have restrictions on which datatypes may be moved between processors in atomic op-erations, or which associative operations are supportedby reduction/scan operations. The general forms mustthen be implemented for the machine classes. Moreseriously, some SIMD languages have restrictions onhow to declare distributed data. The problems this en-tails and how to handle this is outside the scope of thispaper, see [17] for some indications.

3.2.3. The Sophus collection classesThe abstraction oriented approach to parallelisation

has been used in Sophus. There the index domainI for the Collection is an n-dimensional discreteCartesian index space, so the Collection can beconsidered a rank n array, i.e., an array with n indices.

A Collection class with shift operations, for anarbitrary distance along any of the n dimensions, asthe permutations, has been defined. It has been imple-mented in both a sequential version and a parallel ver-sion for the fully connected machine. The MPI mes-sage passing library provides the operations needed forFC Machine, as well as more user level operations,and was used directly for the collection code. SinceMPI is fairly portable, this gave easy access to parallelimplementations on both SGI Cray Origin 2000 anda network of 32 SUN Ultra-10 connected by ethernet.

Page 10: Machine and collection abstractions for user-implemented

240 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

Ideally we should have used language and communica-tion operations as low-level and close to the hardwareas possible. For pragmatic reasons we did not do thisin our pilot implementation. The implementations ofthe collection data structure and methods have been de-signed with the specific requirements from the Sophusapplication programs SeisMod in mind [19].

The sequential implementation ofCollection us-es a rank 1 array with index domain J as main datastructure. Then the map operations just traverse thesearrays linearly, providing a very simple but fairly effi-cient implementation, since we get uninterrupted loopsover the data. The mapped numerical operators wereexplicitly defined. The shift operations rearrange da-ta in the array, and the choice of local memory loca-tion mapping e : I → J is essential for the speed ofthese operations. The layout is such that shifts in thelower dimensions move large consecutive blocks of da-ta, while the blocks become smaller and more frag-mented as the dimension number increases. Shifts areperformed with equal frequency in all dimensions, butsince the number of dimensions normally is low thisgenerally gives good behaviour.

The parallel implementation of Collection alsouses a rank 1 array as main data structure for the effi-cient implementation of map. Since a fully connectedarchitecture does not put any constraints on data move-ment, the choice of distribution mapping d : I → Pis independent of such topological constraints. Thedata sets to be stored in Collection were typical-ly rectangular, so splitting the data sets into equallylarge blocks by cutting the rectangle in pieces alongthe longer side gave a simple implementation with abalanced distribution of data and reasonably low com-munication costs. The local memory location mappinge : I → J uses the same principle as in the sequen-tial case, taking care that contiguous blocks of data areused for interprocessor permutations. Other usage pat-terns would benefit from other distributions of data. Ingeneral, communication costs will be lowest if the datablocks distributed are as square as possible.

4. Benchmarks

Test runs of the pilot implementation of this frame-work approach to parallelisation were made on a net-work of 32 SUN Ultra-10 (UltraSPARC-IIi processorwith 300 MHz clock and 99.9 MHz bus and 128 MByteinternal memory) connected by a 10MHz ethernet net-work and on an SGI Cray Origin 2000 with 128

processors. On both computing resources the speedof both the sequential and parallel implementationswere measured, and a comparison with similar com-putations in sequential Fortran-90 has been includ-ed. To show certain effects of hierarchical memory,the tests were run using rank 3 array data sets with8∗8∗8 = 512, 16∗16∗16 = 4096,32∗32∗32 = 32 768,64∗64∗64 = 262 144 and 128∗128∗128 = 2 097 152data elements each. These cube shapes do not reflectthe actual usage patterns of the applications the pilotimplementation was tuned for. The tests are for singleprecision floating point numbers (4 bytes), thus yield-ing collection variables with data sizes of 2 kB, 16 kB,131 kB, 1.0 MB and 8.4 MB each, respectively. Fivemeasurements were taken for each tabular entry, andthe median value (at processor zero) has been used. Themachines are multi-user machines, and the tests wererun during normal operation, but at a time with reason-ably low load. On the SUN network the test runs wouldtypically be the only compute-intensive jobs running,while the Cray had idle processors and a load between50 and 80 during most of the tests.

The test programs were compiled using commandsequivalent to those in Fig. 1, where g, which rangesthrough 8, 16, 32, 64 and 128, designates the numberof data elements in each direction, and p, which rangesthrough 1, 2, 4, 8, 16, 32, 64 (as relevant), is the numberof processors the code is intended for. The flag Parcontrols when the configuration uses the parallel imple-mentation of Collection. No code in the program,or other classes used, is dependent on this flag.5

The first set of tests is on the speed of the shiftoperation.

– Direction 1: data movement internal for each pro-cessor, large contiguous segment moved at eachprocessor.

– Direction 2: data movement internal for each pro-cessor, many small segments moved at each pro-cessor.

– Direction 3: in the parallel case, contiguous seg-ments communicated between processors. In thesequential case, the data segments moved are morefragmented than those for direction 2.

5This is not fully true in our test programs, as we have included(in the main program) some code that prints out the configuration,number of processors being used and the execution speed of the tests.This code is sensitive to the configuration flags. The rest of the codeis written without regards to these falgs.

Page 11: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 241

sun4-seq> CC -I. -fast AbstractMatrix-tes-c++.C -DGRIDSIZE=gsun4-par> CC -I. -fast AbstractMatrix-tes-c++.C -DGRIDSIZE=g -D Par -DNOPROCESSORS=psun4-f90> f90 -fast -fixed AbstractMatrix-tes-f90.F -DGRIDSIZE=g -DITERATIONS=1cray-seq> CC -I. -Ofast AbstractMatrix-tes-c++.C -DGRIDSIZE=gcray-par> CC -I. -Ofast AbstractMatrix-tes-c++.C -DGRIDSIZE=g -D Par -DNOPROCESSORS=pcray-f90> f90 -cpp -Ofast AbstractMatrix-tes-f90.F -DGRIDSIZE=g -DITERATIONS=1

Fig. 1. Compilation commands for the benchmarks.

SUN4 Mflshps, Mflshps, Mflshps,processors direction 1 direction 2 direction 3

83 elts. Total Per proc. Total Per proc. Total Per proc.1 74 74 55 55 21 212 41 20 88 44 0.49 0.254 71 18 136 34 0.37 0.09

seq 76 56 21

163 elts. Total Per proc. Total Per proc. Total Per proc.1 98 98 70 70 28 282 61 30 178 89 2.4 1.24 143 36 350 88 1.6 0.418 271 34 622 78 0.87 0.11

seq 63 60 29

323 elts. Total Per proc. Total Per proc. Total Per proc.1 51 51 52 52 36 362 66 33 96 48 5.3 2.64 92 23 206 52 6.3 1.68 338 42 624 78 3.4 0.4316 695 44 1187 74 2.9 0.18seq 53 53 38

643 elts. Total Per proc. Total Per proc. Total Per proc.1 32 32 32 32 25 252 67 33 91 45 11 5.44 159 39 209 52 13 3.28 281 35 384 48 6.9 0.8616 646 40 816 51 6.3 0.3932 1208 38 1587 50 5.0 0.16seq 32 31 25

1283 elts. Total Per proc. Total Per proc. Total Per proc.1 29 29 32 32 29 292 53 27 67 33 21 104 102 26 131 33 23 5.78 203 55 263 33 16 2.016 509 32 745 47 15 0.9232 964 30 1584 50 8.4 0.3seq 29 32 29

Fig. 2. Test runs showing speed of shift operation on a network of SUN Ultra-10 workstations

The results are given in Figs 2 and 3. The speed isgiven in million of floating point numbers shifted persecond, Mflshps.

If shifting of data in all directions is an important as-pect of an algorithm,we find that small problems shoulddefinitely be run in single processor mode. Larger

problems may have an optimal number of processorto run on, see the effects on direction 3 (“total” col-umn). For 643 elements the optimum seems to be be-tween 8 and 16 processors on the Cray. Adding moreprocessors probably will decrease the total throughputof the program due to the relative dominance of the

Page 12: Machine and collection abstractions for user-implemented

242 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

Cray Mflshps, Mflshps, Mflshps,processors direction 1 direction 2 direction 3

83 elts. Total Per proc. Total Per proc. Total Per proc.1 59 59 53 53 34 342 60 30 88 44 8.2 4.14 100 25 140 35 7.6 1.9

seq 60 55 35

163 elts. Total Per proc. Total Per proc. Total Per proc.1 76 76 73 73 52 522 104 52 142 71 24 124 200 50 270 68 23 5.88 367 46 495 62 20 2.5

seq 76 73 52

323 elts. Total Per proc. Total Per proc. Total Per proc.1 28 28 67 67 47 472 93 47 135 68 23 124 235 59 322 81 31 7.78 502 63 657 82 34 4.316 899 56 1264 79 37 2.3seq 28 66 44

643 elts. Total Per proc. Total Per proc. Total Per proc.1 27 27 68 68 55 552 110 55 146 75 29 144 220 55 294 73 44 118 441 55 585 73 55 6.916 799 50 1170 73 66 4.132 1313 41 2769 87 47 1.5seq 27 67 54

1283 elts. Total Per proc. Total Per proc. Total Per proc.1 17 17 28 28 18 182 36 18 82 41 24 124 146 37 289 72 43 118 360 45 597 75 59 7.316 483 30 1194 75 79 4.932 508 16 2529 79 74 2.364 374 5.8 5002 78 7.8 0.12seq 20 36 27

Fig. 3. Test runs showing speed of shift operation on a SGI Cray Origin 2000 multiprocessor

slow interprocessor communication as the number ofprocessors increase. The same effect is visible for theSUNs. Due to the much higher communication costsover an ethernet network, the ideal number of proces-sors is lower than for the Cray. With 643 elements,around 4 processors seems best.

For the mapped arithmetic operations the story seemsdifferent. These have no communication and scalenicely when parallelised. See Figs 4 and 5 for the ad-dition of two matrices and Figs 6 and 7 for a scalarmultiplied with a matrix. Computation speed is mea-sured in million of floating point operations per sec-ond, MFLOPS. For the Fortran-90 tests we used 3-dimensional arrays and the built-in array operations.What is startling is the relationship between C++ and

Fortran-90 execution speeds. The C++ implementa-tion is ahead on small arrays. On the SUN, Fortran-90speeds only approach C++ speeds for larger data sets,while on the Cray, Fortran-90 manages to surpass C++speeds for the larger data sets. This may imply thatthe Fortran compiler on the Cray does a better job ofaligning data with memory blocks.

Also noticeable on both platforms are irregularchanges in execution speed per processor as the datasize increases. Matrix-scalar multiplication, which in-volves only one large collection, seems to outperformaddition, which involves two large collections. Thesechanges signals two things. The first is the problemof hierarchical memory, and the interplay between datasize and cache/memory misses. This will give lower

Page 13: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 243

SUN4 83 elts. 163 elts. 323 elts. 643 elts. 1283 elts.Add Total Per proc. Total Per proc. Total Per proc. Total Per proc. Total Per proc.

1 49 49 27 27 27 27 17 17 17 172 97 49 54 27 53 26 34 17 34 174 190 48 193 48 108 28 81 20 68 178 — 393 49 215 27 211 26 134 17

16 — — 658 41 423 26 269 1732 — — — 857 27 740 23

seq 49 27 26 17 17f90 6.8 10 11 8.7 8.5

Fig. 4. Test runs showing speed of 3-dimensional matrix addition on a network of SUN Ultra-10 workstations

Cray 83 elts. 163 elts. 323 elts. 643 elts. 1283 elts.Add Total Per proc. Total Per proc. Total Per proc. Total Per proc. Total Per proc.

1 36 36 36 36 28 28 25 25 11 112 71 35 72 36 55 27 51 26 10 104 138 34 144 36 110 27 101 25 60 158 — 288 36 288 36 207 26 129 16

16 — — 579 36 435 27 430 2732 — — — 882 28 880 2864 — — — — 1096 17

seq 40 40 26 22 13f90 19 35 47 26 31

Fig. 5. Test runs showing speed of 3-dimensional matrix addition on a SGI Cray Origin 2000 multiprocessor

speeds as data size increases. The other effect is that oflonger arrays allowing more efficient use of vectorisa-tion and loop optimisation technology. This should givehigher execution speeds as data size increases. Thiseffect also takes place on each processor in the parallelcase, where often the per processor speed increases asthe number of processors increase. Thus it seems veryimportant to be able to control data alignment. Current-ly no explicit language or library construct for this isavailable. Only indirect techniques, such as reversingthe ordering of declarations or changing the initialisa-tion ordering for the data arrays, can be used to achievethis kind of alignment of data.

Random load distributions on multi-user machinesmay have a marked influence on the speed of programsusing large data sets. This was observed for the tests.Normally the speeds of a series of test runs would varyby no more than 10–15%, but on the Cray the individualmeasurements for the test case of 1283 elements wouldvary far more than this. In extreme cases a factor ofmore than 2 was observed between the slowest andfastest test runs.

To confirm the efficiency on a real program, the par-allelisation strategy was tested on the application Seis-

Mod, a collection of programs for seismic simulations6

written using the Sophus library. The tests were runon the isotropic and transverse isotropic versions of theprogram. The isotropic case means that the rock theseismic wave traverses has identical physical propertiesin all directions. In the transverse isotropic case therock will have different properties in the vertical andthe horizontal directions. The latter case has a morecomplex mathematical formulation than the former, re-quiring more computations per timestep. The programwas run for 1000 time-steps, and distributed on varyingnumbers of processors. Speed-up is measured as se-quential simulation time divided by parallel simulationtime. Two parallelisation strategies were chosen. Thefirst, shown in Fig. 8, parallelises the scalar field, theinner collection class of Sophus, which gives a fine-grained parallelisation. Here we see that the executiontime of the main simulation decreases with the numberof processors, but with a noticeable efficiency drop-offstarting at 8 processors. This is probably due to the lossof efficiency in shifting data between processors when

6SeisMod is marketed by UniGEO a.s., Thormøhlensgt. 55, N-5008 Bergen, Norway.

Page 14: Machine and collection abstractions for user-implemented

244 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

SUN4 83 elts. 163 elts. 323 elts. 643 elts. 1283 elts.Mult Total Per proc. Total Per proc. Total Per proc. Total Per proc. Total Per proc.

1 49 49 49 49 34 34 27 27 27 272 98 49 100 50 68 34 66 33 54 274 192 48 198 50 139 35 137 34 108 278 — 378 47 386 48 277 35 217 27

16 — — 779 49 553 35 513 3232 — — — 111 35 1084 34

seq 49 49 35 27 27f90 15 30 32 27 27

Fig. 6. Test runs showing speed of 3-dimensional matrix–scalar multiplication on a network of SUN Ultra-10 workstations

Cray 83 elts. 163 elts. 323 elts. 643 elts. 1283 elts.Mult Total Per proc. Total Per proc. Total Per proc. Total Per proc. Total Per proc.

1 43 43 43 43 38 38 38 38 19 192 85 42 86 43 77 39 77 38 55 274 167 42 172 43 181 45 154 39 141 358 — 343 43 345 43 308 39 299 37

16 — — 690 43 619 39 299 3732 — — — 1435 45 1230 3864 — — — — 2451 38

seq 51 52 39 39 26f90 28 71 71 61 50

Fig. 7. Test runs showing speed of 3-dimensional matrix–scalar multiplication on a SGI Cray Origin 2000 multiprocessor

SeisMod, fine Isotropic Transverse isotropicCray proc. Init. Simulate Speed-up Init. Simulate Speed-up

1 7s 1899s 0.9 11s 2687s 1.12 17s 1031s 1.6 27s 1538s 1.94 28s 543s 3.0 44s 894s 3.38 40s 331s 5.0 56s 504s 5.8

16 56s 257s 6.4 68s 347s 8.5

seq 5s 1641s 1 9s 2948s 1

Fig. 8. Execution times for application SeisMod with parallelisation of scalar fields (fine grain).

SeisMod, coarse Isotropic Transverse isotropicCray proc. Init. Simulate Speed-up Init. Simulate Speed-up

1 6s 1957s 0.8 10s 3067s 1.02 14s 2527s 0.7 19s 3737s 0.8

seq 5s 1648s 1 9s 3061s 1

Fig. 9. Execution times for application SeisMod with parallelisation of tensor fields (coarse grain).

the number of processors increase. The initialisationtimes increase dramatically with the number of proces-sors, but will amount to a minor part of total executiontimes for normal simulations. The second case, Fig. 9,shows parallelisation of the tensors fields, the outer col-

lection class in Sophus, giving a much coarser grain.The tensor fields have between 2 and 6 components,and provides a less balanced distribution of the compu-tational effort. We see this in an increase in executiontime when the number of processors increase.

Page 15: Machine and collection abstractions for user-implemented

M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming 245

5. Conclusions

Data parallel programming coupled with data ab-straction provides a powerful framework for the de-velopment of application programs that easily port be-tween sequential and parallel machines. Porting doesnot require any reprogramming, just a selection of theappropriate parallelisation module from a library. Italso allows the programmer to develop new parallelmodules, either for moving onto a new architecture,or for application specific parallel optimisations. Thisrequires that the parallel module may be nested at anylevel of the application program. This is automaticallysatisfied for a normal compiler and SPMD execution assupported by, e.g., MPI [40].

To fully utilise this framework for data parallel pro-gramming it must be coupled with programming stylessuch as that of Sophus [18–20] for coordinate free nu-merics. This is a style that provides a natural focuson permutations and operations mapped for all data,rather than on individual data accesses and loops. Acollection abstraction suited for parallelisation may beused at many levels in an application, providing ampleopportunities for different parallelisation strategies.

Experiments show that this approach scales nicelywith the number of processors and provides a goodspeed-up compared to the sequential case if the classesselected for parallelisation contain many data items tobe distributed. In Sophus this would typically be thescalar field classes, which implement the discretisationtechniques for the numerical solvers. Similar speed-up results are shown by Ohta [32]. He proposes touse the scalar field class as the unit of parallelisationdirectly, rather than using a parallelisable collectionto implement it. Programming a parallel version of ascalar field class directly gives more technical details totake into account at one time than our proposal does. Italso disregards the possibility for other sources of dataparallelism within an application.

A data parallel collection class proved to be fairlyeasy to implement, and yet obtain a speed that is not toofar from equivalent Fortran-90 speed. For a compar-ison of efficiency differences between C++ and For-tran see [36]. We still expect a certain loss of effi-ciency due to the extra layer of code introduced byclasses Collection and Machine. But at the sametime these classes have clearer algebraic properties thanevident in traditional code. These properties may beutilised by high-level optimisation tools such as Code-Boost [14], perhaps regaining more speed than that be-ing lost.

Machines such as the SGI Cray Origin 2000 and soft-ware packages such as MPI camouflage the underlyingconnectivity of the hardware. Though a useful simpli-fication for many purposes, the framework we proposehere would allow programmers to take advantage of thelow-level communication topology without sacrificingprogram structure. Thus a means of directly accessingthe hardware topology for the purpose of implementingMachine classes would be useful for further researchon our approach. A similar problem appears in the useof hierarchical memory structure of high performancemachines, whether this be swapping between memoryand disk or multilevel caches on a processor. This ispresently beyond programmer control, but has a greatinfluence on program efficiency. We would like to in-vestigate closer how to control and utilise such memorystructure from a high level perspective.

Acknowledgements

We would like to thank the following people fortheir contributions to this work: Ivar Velle, SteinarSøreide, who did most of the parallel implementationsand helped with an earlier draft of this paper, HansMunthe-Kaas and Andre Friis.

References

[1] J.C. Adams, W.S. Brainerd and J.T. Martin, Fortran 90 Hand-book: Complete ANSI/ISO Reference, Intertext Publications,1992.

[2] C. Bareau, B. Caillaud, C. Jard and R. Thoraval, Correctnessof automated distribution of sequential programs, in: PAR-LE’93 – Parallel architectures and languages Europe, vol-ume 694 of Lecture Notes in Computer Science, A. Bode,M. Reeve and G. Wolf, eds, Springer, 1993, pp. 517–528.

[3] D. Barstow, ed., The Programming Language Ada – ReferenceManual, Springer, LNCS 155, 1983.

[4] P.E. Bjørstad and R. Schreiber, Unstructured grids on SIMDtorus machines, in: The 1994 Scalable High PerformanceComputing Conference, E.L. Jacks, ed., IEEE, 1994.

[5] G.E. Blelloch, S. Chatterjee, J.C. Harwick, J. Sipelstein andM. Zagha, Implementation of a portable nested data-parallellanguage, Journal of Parallel and Distributed Computing21(1) (1994), 4–14.

[6] L. Bouge, The data parallel model: A semantic perspective,in: The Data Parallel Programming Model, volume 1132 ofLecture Notes in Computer Science, G.-R. Perrin and A. Darte,eds, Springer, 1996, pp. 4–26.

[7] G. Bracha, M. Odersky, D. Stoutamire and P. Wadler, Makingthe future safe for the past: Adding genericity to the Javaprogramming language, in: Proceedings of the Conferenceon Object Oriented Programming Systems, Languages, andApplications (OOPSLA ’98), 1998.

Page 16: Machine and collection abstractions for user-implemented

246 M. Haveraaen / Machine and collection abstractions for user-implemented data-parallel programming

[8] S. Chatterjee, J.R. Gilbert and R. Schreiber, Optimal evalua-tion of array expressions on massively parallel machines, ACMtransactions on programming languages and systems 17(1)(1995), 123–156.

[9] S. Chatterjee, J.R. Gilbert, R. Schreiber and S.-H. Teng,Automatic array alignment in data-parallel programs, in:ACM Symposium on Principles of Programming LanguagesPOPL’93, 1993, pp. 16–28.

[10] F. Coelho, C. Germain and J.-L. Pazat, State of the art incompiling HPF, in: The Data Parallel Programming Model,volume 1132 of Lecture Notes in Computer Science, G.-R.Perrin and A. Darte, eds, Springer, 1996, pp. 104–133.

[11] P. Crooks and R.H. Perrott, Language constructs for data parti-tioning and distribution, Scientific Programming 5(1) (1995),59–85.

[12] O.-J. Dahl, B. Myhrhaug and K. Nygaard, SIMULA 67 Com-mon Base Language, (Vol. S-2), Norwegian Computing Cen-ter, Oslo, 1968.

[13] J.-L. Dekeyser and P. Marquet, Supporting irregular and dy-namic computations in data parallel languages, in: The DataParallel Programming Model, volume 1132 of Lecture Notesin Computer Science, G.-R. Perrin and A. Darte, eds, Springer,1996, pp. 197–219.

[14] T. Dinesh, M. Haveraaen and J. Heering, An algebraic pro-gramming style for numerical software and its optimisation,Scientific Programming 8(4) (2000), 247–259.

[15] J. Gosling, B. Joy and G. Steele, The Java Language Specifi-cation, Addison-Wesley, 1996.

[16] Y.L. Guyadec and B. Virot, Sequential-like proofs of data-parallel programs, Parallel Processing Letters 6(3) (1996),415–426.

[17] M. Haveraaen, Comparing some approaches to programmingdistributed memory machines, in: The 6th Distributed Mem-ory Computing Conference Proceedings, volume II Software,Q.F. Stout and M. Wolfe, eds, IEEE Computer Society Press,1991, pp. 222–227.

[18] M. Haveraaen, Case study on algebraic software methodolo-gies for scientific computing, Scientific Programming 8(4)(2000), 261–273.

[19] M. Haveraaen, H.A. Friis and T.A. Johansen, Formal softwareengineering for computational modeling, Nordic Journal ofComputing 6(3) (1999), 241–270.

[20] M. Haveraaen, V. Madsen and H. Munthe-Kaas, Algebra-ic programming technology for partial differential equations,in: Norsk Informatikk Konferanse – NIK’92, A. Maus andF. Eliassen et al., eds, Tapir, Norway, 1992, pp. 55–68.

[21] M. Haveraaen and N.A. Magnussen, Why parallel program-ming is easy, and CSP programming is hard, in: Norsk Infor-matikk Konferanse – NIK’90, B. Kirkerud et al., eds, Tapir,Trondheim, Norway, 1990, pp. 183–192.

[22] W.D. Hillis and G.L. Steele Jr., Data parallel algorithms, Com-munications of the ACM 29(12) (1986), 1170–1183.

[23] C. Hoare, Comunicating sequential processes, Communica-tions of the ACM 21(8) (1978), 666–677.

[24] K.E. Iverson, A Programming Language, John Wiley andSons, Inc., 1962.

[25] K. Jensen and N. Wirth, Pascal – User Manual and Report,volume 18 of Lecture Notes in Computer Science, Springer,1974.

[26] C.H. Koelbel, D.B. Loveman, R.S. Schreiber, G.L. Steele Jr.and M.E. Zosel, eds, The High Performance Fortran Hand-book, Scientific and Engineering Computation Series, MIT

Press, Cambridge Massachusetts, 1994.[27] M.E. Mace, Memory storage patterns in parallel processing,

volume 30 of The Kluwer International Series in Engineeringand Computer Science, Kluwer Academic Publishers, Boston,MA, 1987.

[28] MasPar Computer Corporation, MasPar Parallel ApplicationsLanguage (MPL), (version 1.00), MasPar Computer Corpora-tion, Sunnyvale, California, 1990.

[29] B. Meyer, Eiffel: The Language, Prentice-Hall, 1992.[30] W. Moore, A. McCabe and R. Urquhart, eds, Systolic Arrays,

Adam Hilger, Bristol and Boston, 1987.[31] H. Munthe-Kaas, Routing k-tile permutations on parallel com-

puters, in: Norsk Informatikk Konferanse – NIK’94, M. Hav-eraaen and B. Jæger et al., eds, Tapir, Norway, 1994, pp. 133–150.

[32] T. Ohta, Design of a data class for parallel scientific com-puting, in: Scientific Computing in Object-Oriented Paral-lel Environments, volume 1343 of Lecture Notes in Comput-er Science, Y. Ishikawa, R.R. Oldehoeft, J.V. Reynders andM. Tholburn, eds, Springer, 1997, pp. 211–217.

[33] S. Owicki and D. Gries, Verifying properties of parallel pro-grams: an axiomatic approach, Communications of the ACM19(5) (1976), 279–285.

[34] D.C. Parnas, On the criteria to be used in decomposing systemsinto modules, Communications of the ACM 15(12) (1972),1053–1058.

[35] R.H. Perrott, A language for array and vector processors, ACMtransactions on programming languages and systems, 1(2)(1979), 177–195.

[36] A.D. Robison, C++ gets faster for scientific computing, Com-puters in Physics 10(5) (1996), 458–462.

[37] J.H. Saltz, G. Agrawal, C. Chang, R. Das, G. Edjlali, P. Havlak,Y.-S. Hwang, B. Moon, R. Ponnusamy, S.D. Sharma, A. Suss-man and M. Uysal, Programming irregular applications: Run-time support, compilation and tools, Advances in Computers45 (1997), 105–153.

[38] R.S. Schreiber, An introduction to HPF, in: The Data ParallelProgramming Model, volume 1132 of Lecture Notes in Com-puter Science, G.-R. Perrin and A. Darte, eds, Springer, 1996,pp. 27–44.

[39] B. Schutz, Geometrical Methods of Mathematical Physics,Cambridge University Press, 1980.

[40] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker andJ. Dongarra, MPI – The complete reference, MIT Press, Cam-bridge, Mass., USA, 1996.

[41] J.M. Squyres, B. Saphir and A. Lumsdaine, The design andevolution of the MPI-2 C++ interface, in: Scientific Comput-ing in Object-Oriented Parallel Environments, volume 1343of Lecture Notes in Computer Science, Y. Ishikawa, R.R. Old-ehoeft, J.V. Reynders and M. Tholburn, eds, Springer, 1997,pp. 57–64.

[42] B. Stroustrup, The C++ Programming Language, (3rd ed.),Addison-Wesley, 1997.

[43] T.L. Veldhuizen and M.E. Jernigan, Will C++ be faster thanFortran, in: Scientific Computing in Object-Oriented Paral-lel Environments, volume 1343 of Lecture Notes in Comput-er Science, Y. Ishikawa, R.R. Oldehoeft, J.V. Reynders andM. Tholburn, eds, Springer, 1997, pp. 49–56.

[44] I. Velle, Abstrakte datatyper på parallelle maskiner, Master’sthesis, Universitetet i Bergen, P.O.Box 7800, N-5020 Bergen,Norway, August 1993.

Page 17: Machine and collection abstractions for user-implemented

Submit your manuscripts athttp://www.hindawi.com

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Applied Computational Intelligence and Soft Computing

 Advances in 

Artificial Intelligence

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014