15
CONCURRENCY: PRACTICE AND EXPERIENCE Concurrency: Pract. Exper., Vol. 11(11), 671–685 (1999) Efficient implementation of a portable parallel programming model for image processing P. J. MORROW 1 * , D. CROOKES 2 , J. BROWN 2 , G. MCALEESE 1 , D. ROANTREE 1 AND I. SPENCE 2 1 School of Information and Software Engineering, University of Ulster, Coleraine, BT52 1SA, Northern Ireland, UK (e-mail: [email protected]) 2 School of Computer Science, The Queen’s University of Belfast, Belfast, BT7 1NN, Northern Ireland, UK (e-mail: [email protected]) SUMMARY This paper describes a domain specific programming model for execution on parallel and distributed architectures. The model has initially been targeted at the application area of image processing, though the techniques developed may be more generally applicable to other domains where an algebraic or library-based approach is common. Efficiency is achieved by the concept of a self-optimising class library of primitive image processing operations, which allows programs to be written in a high level, algebraic notation and which is automatically parallelised (using an application-specific data parallel approach). The class library is extended automatically with optimised operations, generated by a transformation system, giving improved execution performance. The parallel implementation of the model described here is based on MPI and has been tested on a C40 processor network, a quad-processor Unix workstation, and a network of PCs running Linux. Timings are included to indicate the impact of the automatic optimisation facility (rather than the effect of parallelisation). Copyright 1999 John Wiley & Sons, Ltd. 1. INTRODUCTION The application area of image processing has benefited greatly in recent years from the use of parallel architectures for the implementation of many computationally intensive aspects and techniques within its domain[1–5]. However, as with other application areas, porting the developed systems to other parallel architectures has been difficult to achieve when efficiency of execution must be maintained. There is a need for a programming model which allows portability of software across a diverse range of parallel architectures whilst at the same time retaining efficiency of execution. A number of research projects concerned with general-purpose parallel programming models have made gradual progress, particularly in data parallel approaches. The work of Jesshope on F-Code[6] is well established, though it does not directly give the optimised efficiency desirable for image processing applications. Data parallel models based on Fortran (HPF and Fortran 90) have made steady progress on the autoparallelisation front; but progress has been hindered by many factors from the nature of Fortran. Valiant’s BSP model[7], relying on parallel slackness to mask the latency of remote accesses, is another unifying * Correspondence to: P. J. Morrow, School of Information and Software Engineering, University of Ulster, Coleraine, BT52 1SA, Northern Ireland, UK. Contract/grant sponsor: UK EPSRC. CCC 1040–3108/99/110671–15$17.50 Received 24 March 1999 Copyright 1999 John Wiley & Sons, Ltd.

Efficient implementation of a portable parallel programming model for image processing

  • Upload
    i

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient implementation of a portable parallel programming model for image processing

CONCURRENCY: PRACTICE AND EXPERIENCEConcurrency: Pract. Exper.,Vol. 11(11), 671–685 (1999)

Efficient implementation of a portable parallelprogramming model for image processing

P. J. MORROW1∗, D. CROOKES2, J. BROWN2, G. MCALEESE1, D. ROANTREE1 AND I. SPENCE2

1School of Information and Software Engineering, University of Ulster, Coleraine, BT52 1SA, Northern Ireland,UK(e-mail: [email protected])2School of Computer Science, The Queen’s University of Belfast, Belfast, BT7 1NN, Northern Ireland, UK(e-mail: [email protected])

SUMMARYThis paper describes a domain specific programming model for execution on parallel anddistributed architectures. The model has initially been targeted at the application area ofimage processing, though the techniques developed may be more generally applicable to otherdomains where an algebraic or library-based approach is common. Efficiency is achieved bythe concept of a self-optimising class library of primitive image processing operations, whichallows programs to be written in a high level, algebraic notation and which is automaticallyparallelised (using an application-specific data parallel approach). The class library is extendedautomatically with optimised operations, generated by a transformation system, givingimproved execution performance. The parallel implementation of the model described hereis based on MPI and has been tested on a C40 processor network, a quad-processor Unixworkstation, and a network of PCs running Linux. Timings are included to indicate the impactof the automatic optimisation facility (rather than the effect of parallelisation). Copyright 1999 John Wiley & Sons, Ltd.

1. INTRODUCTION

The application area of image processing has benefited greatly in recent years from theuse of parallel architectures for the implementation of many computationally intensiveaspects and techniques within its domain[1–5]. However, as with other applicationareas, porting the developed systems to other parallel architectures has been difficultto achieve when efficiency of execution must be maintained. There is a need for aprogramming model which allows portability of software across a diverse range of parallelarchitectures whilst at the same time retaining efficiency of execution. A number ofresearch projects concerned with general-purpose parallel programming models havemade gradual progress, particularly in data parallel approaches. The work of Jesshope onF-Code[6] is well established, though it does not directly give the optimised efficiencydesirable for image processing applications. Data parallel models based on Fortran (HPFand Fortran 90) have made steady progress on the autoparallelisation front; but progresshas been hindered by many factors from the nature of Fortran. Valiant’s BSP model[7],relying on parallel slackness to mask the latency of remote accesses, is another unifying

∗Correspondence to: P. J. Morrow, School of Information and Software Engineering, University of Ulster,Coleraine, BT52 1SA, Northern Ireland, UK.Contract/grant sponsor: UK EPSRC.

CCC 1040–3108/99/110671–15$17.50 Received 24 March 1999Copyright 1999 John Wiley & Sons, Ltd.

Page 2: Efficient implementation of a portable parallel programming model for image processing

672 P. J. MORROWET AL.

model, and based on this model McColl demonstrates how portable, scaleable software canbe developed for any general purpose parallel architecture[8]. A broader approach is thatadopted by PVM[9] or MPI[10], programming environments in which programs executeon a heterogeneous collection of machines. However, programming using PVM or MPI isundertaken at quite a low level.

Other programming models have been applied to image processing, including, forexample, a number dealing with the integration of task and data parallelism[11–14]and the mapping of image processing tasks on to distributed memory machines[15–17].Downton has also proposed a pipeline processor farm (PPF) as a general design method tosupport parallelisation and has applied this to image processing applications using MIMDprocessors[18,19].

A general theme that can be found in most of this work is the use of a library-based approach to providing parallelism within the programming model. Image processinglibraries normally provide routines for a range of operations or instructions. However,efficiency of execution can then become an issue since each library call is ‘indivisible’in nature and in many cases, for image processing applications, successive calls could beoptimised very easily.

When dealing with programming models for parallel architectures there are three viewswhich must be considered, namely theexpressive powerof the programming notation, theportability of the programming model and theefficiencyof the model’s implementation.The goal that many researchers are seeking is a general purpose programming model whichis portable across a range of parallel architectures and which optimally uses the resourcesprovided by each architecture. The traditional approach to providing a programming modelfor this situation is to start with ageneral-purposenotation (e.g. Fortran), which may bereasonably portable, and attempt to develop efficient implementations for it on the requiredrange of architectures. However, this means that a number of different implementations arenecessary, one for each architecture. We have taken a different approach to this traditionalview. In our model we have started with adomain-specificprogramming notation whichwe know we can implement efficiently on parallel architectures and which is portable(with minimal effort) across these architectures. The search for an optimal solution to thisproblem is illustrated in Figure1, where the goal is to have a completely portable, highlyexpressive (general-purpose) and efficient programming model (i.e. the closer we can getto the ‘origin’ in the diagram the better; the further out along each axis we go the moremachine dependent, application specific and inefficient the model becomes).

Our programming model is based on previous work by the authors[20–22] andprovides only programming abstractions which can be implemented efficiently; ourultimate aim is to make the abstractions more powerful and generally applicable. Weare therefore approaching the origin along the ‘expressive power’ axis of Figure1.Within this framework the image processing model described here lies some distanceout on the expressive power axis as shown. In the long term we wish to extend theabstractions provided to make them more general purpose and thus move towards our finalobjective.

The following sections of this paper provide an outline of the EPIC programming model,examples of how the model can be used, a description of major design issues and adiscussion of optimiser timings for the example programs. The timings have been takenacross three parallel systems, namely a C40 processor network, a quad-processor Unixworkstation and a network of PCs running Linux. The purpose of these timings is to give

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 3: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 673

Figure 1. The ‘optimal’ programming model

an indication of the impact of the automatic optimisation facility in a parallel environment,rather than to measure the effect of parallelism itself.

2. ARCHITECTURAL OVERVIEW

The architectural model for our system is based on a ‘master–slave’ model, or moreprecisely a ‘host-coprocessor’ model. The programmer writes programs in a sequentialnotation (in C++) and makes use of the programming abstractions provided by a classlibrary. Within the domain of image processing many of these abstractions are based ontemplates or small local neighbourhoods around a pixel. By using this type of abstractionwe can restrict access to image pixel data so that data access patterns (and hence, datadistribution) can be known in advance of any operation. (This also means that the optimiseris able to predict access patterns more easily and hence produce more efficient optimisedinstructions). Any of the abstractions that can be executed in parallel will be executed onthe parallel coprocessor; those that are sequential in nature will be executed on the hostmachine. The parallel coprocessor effectivelyinterprets the abstractions when a librarycall is made. Traditionally, interpreters have been thought of as a veryinefficient executionmodel. However, in this case the image processing abstractions themselves (or instructions)are computationally intensive (compared to instructions found in normal interpreters) andso the interpretative process in this case is not a significant overhead. The architecturalmodel thus far can be called a parallel image coprocessor (PIC).

Although the use of an interpreter aids portability, one of its disadvantages is theindivisible nature of the high-level instructions. The use of such a static instruction set(like a library) results in inefficiencies which can sometimes make the whole approachless attractive to real application developers. These inefficiencies arise typically for tworeasons: (a)special casesof instructions are often implemented more efficiently by manualprogrammers, and (b) the array processing overheads forcompoundinstructions areusually cumulative. We have extended our PIC model for an abstract machine in whichthese inefficiencies can be avoided while retaining the elegance and portability of theabstract machine model. We have developed anextensibleabstract machine, in which theinstruction set has two parts: (a) the basic, static instruction set, and (b) a set of additional,optimised instructions which avoid the inefficiencies mentioned above. These additionalinstructions are program-specific, and implement special cases and compound instructions

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 4: Efficient implementation of a portable parallel programming model for image processing

674 P. J. MORROWET AL.

Figure 2. Overall EPIC architecture

as efficiently as manual coding. As indicated below, these extended instructions aregenerated automatically. Thus, given two successive instructions which require the sameloop-bounds (when processing image variables), in an efficient implementation we wouldideally fuse the instructions into one set of outer loops to improve performance. Using thenormal interpretative process this would not be possible within the PIC model. However,within our extensible parallel image coprocessor (EPIC) model new optimised instructionscan be generated for this type of situation and on subsequent runs of a user programthe optimised instructions will be used rather than the sequence of primitive ones. Auser program will therefore become more efficient during successive runs (depending onbranching within the program and on optimisation limitations).

From a programmer’s view he/she is writing a sequential program and making calls to aclass library – however, the library is much more complex than a normal library. The overallarchitecture (shown in Figure2) consists of asequential host,a parallel coprocessorandanoff-line optimiser.

2.1. Sequential host

On the host system a user program is compiled and linked against the EPIC library. Duringprogram execution, syntax (or expression) graphs are constructed for each statement in theprogram. This is achieved by using C++ operator overloading in the EPIC class library.These graphs are only evaluated when an assignment operator is met, at which point one

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 5: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 675

of two execution paths is followed:

(a) If this is the first time the program has been executed then the syntax graph fora statement is traversed using a tree walking routine, and an instruction streamof primitive operations is generated. This instruction stream will be interpreted asnormal on the coprocessor. As a ‘by-product’ of this run, the syntax graphs forcombinations of the primitive instructions are written out for later consideration bythe optimiser.

(b) On subsequent runs of the program when a sequence of instructions occurs a checkis made to decide if an optimised instruction is available for this sequence (in a tableof previously optimised instructions) and if so, this optimised instruction will beexecuted rather than the sequence of primitive ones.

2.2. Parallel coprocessor

The parallel coprocessor consists of a number of ‘worker’ processes and one ‘controller’process, and together these processes comprise the interpreter for the instruction stream.From a user’s point of view the coprocessor is hidden and parallelism is provided for‘behind the scenes’. (In reality, the user need only specify the number of workers availablein the actual architecture being used as a parameter to the execution of his/her program.)

2.3. Off-line optimiser

As an off-line process the optimiser is run on the syntax graphs to determine if an optimisedinstruction can be generated. If so, then a set of ‘C’ functions is generated (as an ExtendedInstruction Library – EIL) for later inclusion in the coprocessor (interpreter). The EIL iscompiled and integrated with the interpreter running on the coprocessor.

3. EPIC PROGRAMMING EXAMPLES

In order to illustrate some of the abstractions provided in the EPIC class library, considerthe example programs given below. The first program implements a simple neighbourhoodoperation; the second is a mixture of neighbourhood and point operations; the thirdillustrates a much more complex neighbourhood operation which would normally be ratherinefficient in a library-based approach; and the fourth is an example involving a globaloperation.

3.1. Example 1: Image noise reduction

In this example a simple mean filter is convolved with an image as a means of reducing anynoise pixels found in it (although a significant amount of ‘blurring’ occurs as a side-effectof this operation). A 7× 7 template is used to define the neighbourhood around which theaveraging operation is performed.

#include <iostream.h>#include <epic.hh> // include the EPIC class library

int main(int argc, char **argv){

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 6: Efficient implementation of a portable parallel programming model for image processing

676 P. J. MORROWET AL.

EPIC_Image src (argv[1]); // declare the ‘source’ imageEPIC_Image dst (src.rows, src.cols); // declare the ‘destination’ image

// to be the same size as the sourceEPIC_Template avg (7, 7); // declare a neighbourhood templateEPIC_Scalar numWeights = 49; // define the number of weights in the template

... initialise template weights

dst = conv (src, avg) / numWeights; // apply the actual mean filter}

3.1.1. Variable declaration

The EPIC system provides three classes which can be used in a program, namelyEPIC Image, EPIC Templateand EPIC Scalar. A number of image constructors areprovided in the EPICImage class, two of which are illustrated above. If a filename isgiven as part of the declaration then the image is not only declared and space allocatedon the coprocessor but the actual image data is also read from the file and automaticallydistributed across the worker processes in the coprocessor. Images can also be declaredto be of a specific size – in this case space is simply allocated on the coprocessor forthe image. The second variable class provided is that of an EPICTemplate. Variables ofthis type are declared as having a particular size – in contrast to images, storage for eachtemplate is allocated onall worker process in the coprocessor. Finally, variables may bedeclared of type EPICScalar – this class is provided to permit image-to-scalar operations.

3.1.2. Applying the filter

The averaging operation is applied to the source image by making a call to a convolutionmethodconv – the result of the operation is assigned to a new imagedst . Note thatalthough the user seems to be writing a sequential program, the operation is automaticallyperformed in parallel on the coprocessor, i.e. the images are allocated on the coprocessorand once the initial image is partitioned and distributed no further communication of imagedata is required between the host and the coprocessor unless images are to be saved to diskor displayed.

3.2. Example 2: Frame differencing

Given two successive frames from an image sequence we may wish to remove noisefrom these frames and subsequently calculate the difference between them. The followingprogram is a simple example of this process – it reads two images (from files in this case),performs a noise reduction operation and then subtracts one from the other placing theresult in a destination image:

int main(int argc, char **argv){

EPIC_Image src1 (argv[1]);EPIC_Image src2 (argv[2]);EPIC_Image dst (src1.rows, src1.cols);

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 7: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 677

EPIC_Template avg (7, 7);EPIC_Scalar numWeights = 49;

... initialise the template weights - all to 1.0

dst = abs (conv (src1, avg) - conv (src2, avg)) / numWeights;}

Using the above program as an example, the execution process involved in running an EPICprogram is as follows. The first time the program is executed the two images are read fromtheir respective files and processed to produce the destination image. The assignment to the‘dst’ variable will require calls to a number of primitive EPIC instructions: two convolutionoperations for the noise reduction; an image-to-image subtraction; an image ‘abs’ functioncall; and finally an image-to-scalar operation to divide by the total number of weights.Since the primitive EPIC instructions are ‘indivisible’, a number of sets of loops will haveto be traversed (across the image dimensions) in order to achieve the desired result. (Notethat these individual instructionswill still be performed in parallel– the image data areautomatically distributed across the available processors in the coprocessor). However, ifwe were to ‘hand-code’ a (parallel) version of the operations only one set of outer loopswould be necessary since there are no image data dependencies in the expression on theright-hand side of the assignment.

During the execution of the program a syntax graph for the ‘dst’ assignment will alsobe saved. The optimiser can then examine this graph (as an off-line process) to determineif an optimised instruction can be generated for the assignment statement. In this casean instruction will be generated which utilises a single set of loops as in the hand-codedversion. During subsequent runs of the program we will therefore obtain execution speedscomparable to a good hand-coded version (because the EPIC code generator copies whata good hand coder would produce) but without the ‘pain’ of having to write a parallelprogram. In addition, the program will execute on different sizes of the coprocessornetwork without any change to the program code.

3.3. Example 3: Statistical differencing

This example performs a form of image contrast enhancement and illustrates how quitecomplex expressions can be constructed using the EPIC class library. Image contrast canbe improved by performing a ‘statistical differencing’ operation on it; for example, in [23]such a method is described which enhances an image by stretching contrast in regions oflow contrast and compressing contrast in regions where the contrast was originally high.The operation is tuneable by adjusting the values of a gain factor and a proportioningfactor. In addition, a desired mean and a desired standard deviation can also be specified.The general formula to achieve this is as follows:

dst(i, j) = [src(i, j)−mean(i, j)]∗[desiredSd∗gain/(stdDev(i, j)∗gain+ desiredSd)] +[desiredMean∗prop+ (1− prop)∗mean(i, j)]

where:

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 8: Efficient implementation of a portable parallel programming model for image processing

678 P. J. MORROWET AL.

dst(i,j) is the result pixel value at point (i,j),src(i,j) is the input pixel value at point (i,j),mean(i,j) is the 7× 7 neighbourhood mean around pixel (i,j),stdDev(i,j) is the 7× 7 neighbourhood standard deviation around pixel(i,j),desiredSd is the desired standard deviation,desiredMean is the desired mean,gain is a gain factor, andprop is a proportioning factor.

With a little re-arranging, this quite complex formula can be implemented in an EPICprogram in a concise form as shown below:

int main(int argc, char **argv){

EPIC_Image src(argv[1]);EPIC_Image mean(src.rows, src.cols);EPIC_Image stdDev(src.rows, src.cols);EPIC_Image dst(src.rows, src.cols);EPIC_Template t1(7,7);

EPIC_Scalar NhoodSize = 49;EPIC_Scalar Gain = 6;EPIC_Scalar Prop = 0.7;EPIC_Scalar DesiredSd = 85;EPIC_Scalar DesiredMean = 128;EPIC_Scalar const1 = 510; // Gain * DesiredSdEPIC_Scalar const2 = 89.6; // Prop * DesiredMeanEPIC_Scalar const3 = 0.3; / / 1 - Prop

... initialise the template weights - all to 1.0

mean = conv(src,t1) / NhoodSize;stdDev = sqrt((conv((src*src), t1) - ((mean*mean)*NhoodSize)) / NhoodSize);

dst = (((src - mean)/(stdDev*Gain + DesiredSd)) * const1) +(mean*const3) + const2 ;

}

As with the previous example, many of the statements in the program will generatea number of instructions to be interpreted by the coprocessor. For example, thestatement which calculates the standard deviation for each neighbourhood(stdDev) willgenerate: an image-to-image multiplication operation for the sub-expressionsrc*src ;an image-to-template instruction for the convolution of the previous result with thetemplate t1 ; image-to-image and image-to-scalar instructions for the sub-expression(mean*mean)*NhoodSize ; and so on. Once again the optimiser will combine manyof these operations into optimised instructions and on subsequent runs of the program theoptimised versions will be used.

3.4. Example 4: Automatic threshold selection

Our final example illustrates the fact that we can mix sequential and parallel processingwithin the EPIC system. The program shown below performs an automatic thresholdoperation on an edge detected image. The threshold is selected on the basis of retaining10% of the edge pixels.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 9: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 679

int main(int argc, char **argv){

... variable declarations and template initialisation

tmp = absconv(src, SobelX) + absconv(src, SobelY); // edge detection

// scale the result to the range 0..255oldmin = (float) min(tmp);oldmax = (float) max(tmp);scalefactor = 255.0 / (oldmax - oldmin);src = (tmp - (EPIC_Scalar)oldmin) * (EPIC_Scalar)scalefactor;

// calculate the histogram and find the appropriate threshold valuegetHistogram (&src, histogram);threshold = 255;runningTotal = histogram [threshold--];while (runningTotal < pixelsKept)

runningTotal += histogram [threshold--];

// apply the threshold value to the edge detected imagedst = (src > (EPIC_Scalar)threshold) * maxGL;

}

void getHistogram (EPIC_Image* src, int* histogram){

int r, c;

src->getImage (); // obtain the image pixel data from the coprocessorfor (c = 0; c < 256; c++) histogram [c] = 0;

for (r = 0; r < 256; r++)for (c = 0; c < 256; c++)

histogram [(int)src->bitmap [r][c]]++;}

Initially an edge detection operation is applied to the source image (a Sobel edge detectorin this case). The resulting (tmp ) image is then scaled to the range 0–255 to facilitatecalculation of the image histogram. All of these program statements generate instructioncalls to the coprocessor and are executed in parallel across all worker processes. (Thecalculation of the image histogram is performed as a sequential process in this program forillustrative purposes. However, since it is a common image processing operation a built-ininstruction – which executes in parallel – is also provided in EPIC). In order for the ‘host’program to get access to the pixel data for an image this data must be obtained from thecoprocessor (since no image data normally resides on the sequential host). This is achievedby simply making a call to thegetImage method – the image pixel data can then beaccessed as shown in thegetHistogram function above. The results obtained from theoptimiser for this example are not as good as for the previous examples. There are a numberof reasons for this: firstly, many of the program statements are ‘single’ instructions (e.g.oldmin = (float) min(tmp); ) and the optimiser cannot optimise this any further;secondly, there is an overhead in obtaining the image pixel data from the coprocessor; andthirdly, the histogram itself is calculated in a sequential manner. Care must therefore betaken when designing programs to ensure that image data transfer is kept to a minimum.

The above examples have illustrated how EPIC programs can be written in standardC++, making use of an ‘intelligent’ library which can incorporate optimisations of the

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 10: Efficient implementation of a portable parallel programming model for image processing

680 P. J. MORROWET AL.

EPIC instruction set. User programs effectivelyevolve where execution efficiency isconcerned. The following Section describes some details of the design of the classlibrary and the coprocessor interpreter. (Timings for the above programs are discussedin Section5.)

4. DESIGN DETAILS

During a program’s execution three ‘processes’ are actually running. Firstly, the userprogram itself is running on the host machine. Secondly, a ‘controller’ process is runningon the first node of the parallel network, and thirdly, a ‘worker’ process is running on eachavailable processor in the network. The controller and worker processes together comprisethe EPIC interpreter.

4.1. Controller process

The controller process acts as a communications channel between the host machine and thenetwork of workers and controls the interpretation process, the distribution and gatheringof image data, and the synchronisation of the workers for certain instructions. The overallcode structure for the controller is as follows:

while (Host.get (Instruction) != halt) {sendInstructionToWorkers;// Controller now ‘serves’ the workerswhile (Workers.get (Command) != completed) {

switch (Command) {case ...

}}

send ‘halt’ instruction to workers;}

It receives instructions from the host, relays these to the worker processes and then goesinto a ‘server’ mode where it serves the workers until the instruction is completed – in somecases the workers will require (image) data from the host (e.g. reading an image from file).Further instructions are then received from the host and the process is repeated.

4.2. Worker processes

The worker processes simply receive and interpret instructions from the controller andonce the instruction is completed an appropriate message is then sent to the controller:

while (Controller.get (Instruction) != halt) {switch (Instruction) {

case OP_DECLARE: ... // Execute the instructions... // on each workercase OP_CALL_STANDARD: ...case OP_CALL_OPTIMISED: ......

}send ‘completed’ command to Controller;

}

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 11: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 681

Figure 3. EPIC communications model

Prior to the execution of any instruction, the worker process will ensure that valid(image) data is available in each image segment which is held locally to each worker. Thisincludes ‘border’ information required for neighbourhood operations where an overlappingborder is held as part of each image segment.

4.3. Interpretation of instructions

In order to illustrate the interpretation of EPIC instructions and the automaticparallelisation involved we will consider a simple image-to-image addition operator.For example, given appropriate image declarations consider the following assignmentstatement:

dst = src1 + src2;At declaration of the image variables storage will have been allocated in the workerprocesses for each of the variables (via EPIC instructions sent to the workers). For example,given four workers an image would be divided into four horizontal strips and each workerwill have allocated storage for a quarter of the image (an imagesegment). When imagedata are read into an image variable the EPIC library, controller and worker processescommunicate to ensure that the appropriate image data are read and distributed acrossall the workers. Once valid image data are held in each image segment (on each workerprocess) the image-to-image addition operation can be performed. In this case each workerwill execute the addition on its own segment – the result image will not be reconstructedand sent to the host unless it is subsequently displayed or saved to disk.

The communications routines used in the manipulation of image (and other) data arebased on calls to MPI functions. This library of routines provides facilities, amongst others,to divide image data among the available processors (DistributeImage ), gather imagedata back (GatherImage ) and update segment border information for neighbourhoodoperations (UpdateBorders ). A layered communications model is used in EPIC whichis illustrated in Figure3. By using MPI we are able to achieve portability of the interpreterto any machine which has available an implementation of MPI. If no such implementationexists the porting of our interpreter should still be a relatively straightforward task sinceonly a small subset of MPI is used for the communications layer. (This is in fact the casefor our C40 implementation where a small MPI kernel was implemented since MPI wasnot available on the C40s.)

4.4. Generating and calling optimised (parallel) instructions

The process of generating optimised instructions has been described briefly earlier. Whena sequence of EPIC instructions (library calls) occurs, a syntax tree of the expression

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 12: Efficient implementation of a portable parallel programming model for image processing

682 P. J. MORROWET AL.

comprising the instructions is constructed prior to execution of the individual instructions.That is, adelayed evaluationtechnique is used whereby we delay the execution of anyinstruction involving images until the assignment operator is met (this is achieved throughoperator overloadingin C++). Once assignment is met we check to see if an optimisedinstruction for the current expression exists and if so call it, otherwise the individualinstructions comprising the expression are executed and the syntax tree written out forfuture optimisation. Thus the execution of an expression can be described as follows:

while ( !assignment operator) {add operator to syntax tree;

}if (optimised instruction exists for ‘syntax tree’)

call it;else {

execute instructions in ‘syntax tree’;write out ‘syntax tree’ for future optimisation;

}

The off-line optimiser takes the syntactic representations generated by execution of theuser’s program on the sequential host and generates optimised parallel code to implementthe complete operation defined in each statement. The optimised routine, after compilationand linking, becomes a new instruction on the parallel coprocessor (and remains availableuntil it is removed by re-building the coprocessor without any optimised instructions).

The optimiser is a rule driven system written in Prolog. It uses optimisation techniqueswhich are not in themselves new. For the most part these are loop combining, loopunrolling and loop interchanging, along with the elimination of redundant arithmetic.Several heuristics are employed to identify the most appropriate path to follow based onthe contents of the syntax graph. This includes trial and error approaches when an obviouscourse is not evident from the nature of the syntax graph. A pattern-matching approach isused to find the best transformation applicable.

The output is initially in the form of an object code syntax graph. A code generationmodule is then used to generate C code with calls to library routines for such purposes asborder swapping. The generated optimised instructions are independent of the number ofavailable processors. Direct calls to MPI, which provides the communication framework,are also embedded in the code.

The capabilities built into the optimiser enable it to handle any composition of theoperations provided within the class library, including operations nested to any depth.Also provided is the ability to handle the syntactic representations of variant templatesoperations (templates where the weights vary with image position), including sequences ofvariant templates.

5. RESULTS

In the context of testing the performance of the EPIC system on actual parallel systems,there are several dimensions which could be investigated independently, including:

• optimisation: performance with and without optimisation• portability: performance on different parallel architectures• scalability: performance on different sizes of the same architecture.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 13: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 683

Table 1.

Noise Frame Statistical Thresholdfilter differencing differencing selection

C40 – 2 workersBefore optimisation 1.05 4.30 3.44 1.86After optimisation 0.22 0.61 0.77 1.50Improvement 4.87 fold 7.02 fold 4.46 fold 1.24 fold

C40 – 4 workersBefore optimisation 0.56 2.17 2.10 1.61After optimisation 0.13 0.38 0.55 1.46Improvement 4.23 fold 5.70 fold 3.86 fold 1.10 fold

Quad-processor workstationBefore optimisation 1.56 5.18 9.21 4.73After optimisation 0.87 2.16 5.55 4.06Improvement 1.79 fold 2.40 fold 1.66 fold 1.17 fold

PC-Linux system (4 workers)Before optimisation 1.44 4.56 5.83 2.93After optimisation 0.57 1.28 2.69 2.42Improvement 2.51 fold 3.56 fold 2.17 fold 1.21 fold

Many studies have been carried out in the past investigating the issues of scalability andtypes of architecture for image processing. Therefore, our main objective in this Sectionis to consider the impact which the optimisation process has on execution times. Thisis done by running a selection of typical image processing operations, with and withoutoptimisation, on a number of different parallel architectures.

Timings are given in Table1 for running the EPIC system, with and withoutoptimisation, on all three of the architectures to which the EPIC system has been ported.The results are tabulated for the four example programs described in Section3. Times arein seconds and the image size in each case is 256× 256. The interconnection networkconsists of direct links for the C40s, shared memory for the quad-processor and Ethernetfor the PC/Linux system.

The figures of particular interest here are the improvement factors. From these, a numberof observations can be made:

(i) The impact of optimisation varies from virtually none (1.10) to very significant(7.02). The factor obtained depends on both the operation and on the architecture.

(ii) Greatest improvements are obtained on compute-bound operations, as opposed toI/O bound (e.g. compare frame differencing with threshold selection). This is onlyto be expected, since the optimiser most often improves only the computation partof operations.

(iii) Image distribution and broadcast-type communication operations (such as histogramdistribution) are essentially sequential in our implementation. Algorithms using suchoperations are likely to be completely I/O bound (threshold selection is a case inpoint).

(iv) Closely coupled parallel processors (as on the C40 networks) give biggerimprovements than loosely coupled networks of multi-processed workstations. Thisis because they are less prone to being I/O bound.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 14: Efficient implementation of a portable parallel programming model for image processing

684 P. J. MORROWET AL.

It is clear that a key requirement for achieving high optimisation impact is toavoid becoming I/O bound. In this regard, it should be noted that the MPI (subset)implementation on the C40 system is architecture-aware, and has lower latency for thecommunication patterns in image processing.

6. CONCLUSIONS

In this paper we have described a novel approach to providing a programming model fora domain specific application for parallel architectures. The key to our approach is to hideparallelism from the programmer and allow him/her essentially to program in a standardsequential notation (C++). The abstractions in our EPIC model have been implemented asa class library which has significant ‘intelligence’ embedded in it. A delayed evaluationtechnique is employed to allow the EPIC instruction set to be extended with optimisedinstructions. Because the rule-based optimiser applies the same optimisations which theauthors and others have applied manually in developing efficient hand-coded imageprocessing software, the efficiency of the generated EPIC instructions can rival that ofgood hand-coded versions. At the highest level, possibly the most significant achievementof the project is to enable programmers to continue to operate using a set of high-levelprimitives – and hence retain applicationportability – by reclaiming the normally acceptedloss of efficiency through sophisticated optimising software tools.

ACKNOWLEDGEMENTS

This work was supported by the UK EPSRC under the Portable Software Tools for ParallelArchitectures programme. The support received from Transtech Parallel Systems is alsogratefully acknowledged.

REFERENCES

1. A. N. Choudhary and J. H. Patel,Parallel Architectures and Parallel Algorithms for IntegratedVision Systems, Kluwer Academic Publishers, 1990.

2. A. N. Choudhary and S. Ranka, ‘Parallel processing for computer vision and imageunderstanding’,Computer, 25(2), 7–10 (1992).

3. G. Nudd, N. Francis, T. Atherton, D. Kerbyson, R. Packwood and J. Vaudin, ‘Hierarchicalmultiple-SIMD architecture for image analysis’,Mach. Vis. Appl., 5, 85–103 (1992).

4. J. Webb, ‘Steps towards architecture-independent image processing’,Computer, 25(2), 21–31(1992).

5. H. C. Webber, (ed),Image Processing and Transputers, IOS Press, 1992.6. C. Jessope, ‘The F-code abstract machine and its implementation’,Proc. COMPEURO 1992,

IEEE Press, 1992.7. L. G. Valiant, ‘A bridging model for parallel computation’,Commun. ACM, 33(8), 103–111

(1990).8. W. F. McColl, ‘Scalability, portability and predictability: The BSP approach to parallel

programming’,Future Gener. Comput. Syst., 12(4), 265–272 (1996).9. V. S. Sunderman, ‘PVM: a framework for parallel distributed computing’,Concurrency Pract.

Exp., 2(4), 315–339 (1990).10. W. Gropp,et. al., Using MPI: Portable Parallel Programming with the Message-passing

Interface, MIT Press, 1994.11. J. Subhlok and B. L. Yang, ‘A new model for integrated nested task and data parallel

programming’,ACM SIGPlan Notices, 32(7), 1–12 (1997).

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)

Page 15: Efficient implementation of a portable parallel programming model for image processing

PORTABLE PARALLEL PROGRAMMING MODEL 685

12. I. Foster, D. R. Kohr, T. Krishnaiyer and A. Choudhary, ‘A library-based approach to taskparallelism in a data-parallel language’,J. Parallel Distrib. Comput., 45(2), 148–158 (1997).

13. J. M. Kim, Y. Kim, S. D. Kim, T. D. Han and S. B. Yang, ‘An adaptive parallel computer visionsystem’,Int. J. Pattern Recognit. Arti. Intell., 12(3), 311–334 (1998).

14. A. Petrosino and E. Tarantino, ‘Parallel image understanding algorithms on MIMDmulticomputers’,Computing, 60(2), 91–107 (1998).

15. C. Lee, Y. F. Wang and T. Yang, ‘Global optimization for mapping parallel image processingtasks on distributed memory machines’,J. Parallel Distrib. Comput., 45(1), 29–45 (1997).

16. C. K. Lee and M. Hamdi, ‘Parallel image processing applications on a network of workstations’,Parallel Comput., 21(1), 137–160 (1995).

17. V. Marionpoty and S. Miguet, ‘Data allocation strategies for parallel image processingalgorithms’,Int. J. Pattern Recognit. Artif. Intell., 9(4), 615–634 (1995).

18. A. C. Downton, R. W. S. Tregidgo and A. Cuhadar, ‘Top-down structured parallelization ofembedded image processing applications’,IEE Proc. – Vis. Image Signal Process., 141(6),431–437 (1994).

19. A. Cuhadar, A. C. Downton and M. Fleury, ‘Structured parallel design for embedded visionsystems: a case study’,Microprocess. Microsys., 21(2), 131–141 (1997).

20. D. Crookes, P. J. Morrow and P. J. McParland, ‘IAL: a parallel image processing programminglanguage’,IEE Proceedings I, 137(3), 176–182 (1990).

21. T. J. Brown and D. Crookes, ‘A high level language for image processing’,Image Vis. Comput.,12(2), 67–79 (1994).

22. P. J. Morrow and D. Crookes, ‘Using Prolog to implement a compiler for a parallel imageprocessing language’,Proc. ICIP’95, IEEE Computer Society Press, Vol I, 1995, pp. 89–91.

23. W. Pratt,Digital Image Processing, 2nd ed., Wiley, 1991.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 671–685 (1999)