12
Distributed bucket processing: A paradigm embedded in a framework for the parallel processing of pixel sets P.P. Jonker * , J.G.E. Olk, C. Nicolescu Bio-Robotics Lab, Faculty of Mechanical, Maritime and Materials Engineering, Delft University of Technology, Netherlands article info Article history: Available online 19 September 2008 Keywords: Data parallel processing Task parallel processing Real-time image processing abstract Large datasets, such as pixels and voxels in 2D and 3D images can usually be reduced dur- ing their processing to smaller subsets with less datapoints. Such subsets can be the objects in the image, features – edges or corners – or more general, regions of interest. For instance, the transformation from a set of datapoints representing an image, to one or more subsets of datapoints representing objects in the image, is due to a segmentation algorithm and may involve both the selection of datapoints as well as a change in datastructure. The mas- sive number of pixels in the original image, points to a data parallel approach, whereas the processing of the various objects in the image is more suitable for task parallelism. In this paper we introduce a framework for parallel image processing and we focus on an array of buckets that can be distributed over a number of processors and that contains pointers to the data from the dataset. The benefit of this approach is that the processor activity remains focussed on the datapoints that need processing and, moreover, that the load can be distributed over many processors, even in a heterogeneous computer architecture. Although the method is generally applicable in the processing of sets, in this paper we obtain our examples from the domain of image processing. As this method yields speedups that are data dependent, we derived a run-time evaluation that is able to determine if the use of distributed buckets is beneficial. Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction Typical real-time computer vision tasks require a large amount of processing power, larger than what can be achieved by current state-of-the-art workstations. By way of example, a real-time image processing system for the classification of steel with fault classes: dirt, rost, pits, cuts, galls and inclusions, with a resolution of 0:33 2 mm, a steel band width of 1.60 m and a speed of 60 km/h would require a processing speed on a single processor of 2.6 pixels/ns. Parallel processing using MIMD systems, i.e. shared memory or distributed memory multi processors systems, or such a system containing an SIMD subsys- tem, can be an economical solution [11]. This paper describes and discusses a data and task parallel framework to smoothly program parallel applications in the field of image processing. Within this context we focus on a special paradigm for parallel processing, the distributed bucket processing (DBP) paradigm, that is able to dynamically distribute data sets over processors. The idea behind distributed buck- et processing is to combine the data reduction and data parallelism strategies. It is geared towards iterative image processing algorithms where only a subset of the image data is processed. The approach seems closer to the sparse arrays techniques 0167-8191/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2008.09.006 * Corresponding author. E-mail address: [email protected] (P.P. Jonker). Parallel Computing 34 (2008) 735–746 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco

Distributed bucket processing: A paradigm embedded in a framework for the parallel processing of pixel sets

Embed Size (px)

Citation preview

Parallel Computing 34 (2008) 735–746

Contents lists available at ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate/parco

Distributed bucket processing: A paradigm embedded in a frameworkfor the parallel processing of pixel sets

P.P. Jonker *, J.G.E. Olk, C. NicolescuBio-Robotics Lab, Faculty of Mechanical, Maritime and Materials Engineering, Delft University of Technology, Netherlands

a r t i c l e i n f o

Article history:Available online 19 September 2008

Keywords:Data parallel processingTask parallel processingReal-time image processing

0167-8191/$ - see front matter � 2008 Elsevier B.Vdoi:10.1016/j.parco.2008.09.006

* Corresponding author.E-mail address: [email protected] (P.P. Jonker

a b s t r a c t

Large datasets, such as pixels and voxels in 2D and 3D images can usually be reduced dur-ing their processing to smaller subsets with less datapoints. Such subsets can be the objectsin the image, features – edges or corners – or more general, regions of interest. For instance,the transformation from a set of datapoints representing an image, to one or more subsetsof datapoints representing objects in the image, is due to a segmentation algorithm andmay involve both the selection of datapoints as well as a change in datastructure. The mas-sive number of pixels in the original image, points to a data parallel approach, whereas theprocessing of the various objects in the image is more suitable for task parallelism. In thispaper we introduce a framework for parallel image processing and we focus on an array ofbuckets that can be distributed over a number of processors and that contains pointers tothe data from the dataset. The benefit of this approach is that the processor activityremains focussed on the datapoints that need processing and, moreover, that the loadcan be distributed over many processors, even in a heterogeneous computer architecture.Although the method is generally applicable in the processing of sets, in this paper weobtain our examples from the domain of image processing. As this method yields speedupsthat are data dependent, we derived a run-time evaluation that is able to determine if theuse of distributed buckets is beneficial.

� 2008 Elsevier B.V. All rights reserved.

1. Introduction

Typical real-time computer vision tasks require a large amount of processing power, larger than what can be achieved bycurrent state-of-the-art workstations. By way of example, a real-time image processing system for the classification of steelwith fault classes: dirt, rost, pits, cuts, galls and inclusions, with a resolution of 0:332 mm, a steel band width of 1.60 m and aspeed of 60 km/h would require a processing speed on a single processor of 2.6 pixels/ns. Parallel processing using MIMDsystems, i.e. shared memory or distributed memory multi processors systems, or such a system containing an SIMD subsys-tem, can be an economical solution [11].

This paper describes and discusses a data and task parallel framework to smoothly program parallel applications in thefield of image processing. Within this context we focus on a special paradigm for parallel processing, the distributed bucketprocessing (DBP) paradigm, that is able to dynamically distribute data sets over processors. The idea behind distributed buck-et processing is to combine the data reduction and data parallelism strategies. It is geared towards iterative image processingalgorithms where only a subset of the image data is processed. The approach seems closer to the sparse arrays techniques

. All rights reserved.

).

736 P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746

but the distributed bucket processing paradigm focuses on the characteristics of the image processing operators. Severalimplementations exist for parallel processing of sparse arrays in Fortran 90 [4] and Java [3].

Section 2 presents the three levels of image processing and their possibilities for parallelism. Section 3 presents our dataand task parallel framework. Section 4 introduces the distributed bucket processing concept and Section 5 its abstract datatypes and presents a simple example. Section 6 presents estimates on the benefits of the approach and the results of someexperiments. Section 8 closes the paper with conclusions.

2. Motivation

The type of operations in a computer vision task varies greatly. Starting with a plain image or stream of images, thetype of operations moves from arithmetic to symbolic and the amount of data is reduced until eventually a measurementis delivered or decision is made. The initial processing in a computer vision task is real-time, as it needs to keep up withthe rate of the incoming data, for example from a camera. The end type of processing leading to a decision (like control-ling a robot) may be characterized as ‘‘just-in-time”. Generally, three levels of image processing can be distinguished[7,1]:

(1) Low-level operations. Image oriented. These operations work on whole pixel image structures and yield an image, avector, or a single value. They have a local nature and include mostly point operations (PO) and local ðn� nÞ neigh-borhood operations (LNO). They work on single pixels in an image, thus offering fine-grain data parallelism. Examplesare smoothing, convolution, histogram generation.

(2) Intermediate-level operations. Symbolic processing. These operations work on the pixels of the objects in the image andproduce more compact data structures such as lists. The available parallelism is medium grain and less evident thanthat of low-level operations. Examples are: region labelling, object tracking.

(3) High-level operations. Knowledge-based processing. Interpretation of the high-level information extracted from theintermediate-level processing. The operations work on graphs or lists and can lead to the decision flow in an applica-tion. The available parallelism is coarse grain, can usually be described with a data flow graph and is highly applicationdependent. An example of a high-level operation is scene analysis.

To our experience, SIMD architectures are very suitable for low-level processing, SPMD for intermediate-level image pro-cessing, while MIMD architectures are suitable for high-level processing. However, the SPMD level can easily be emulated onan SIMD or on an MIMD system. Hence MIMD, possibly combined with an SIMD subsystem, seems profitable to meeting thecomputational demands of high speed computer vision. A mixed-machine approach [23,5,18,20,11] is most attractive be-cause of the inherent good performance due to the spatial machine parallelism and the simple design of each sub-machine.Our target machine is a Myrinet connected distributed memory system with a shared memory dual Pentium board with anIMAP-Vision SIMD system on its PCI-bus [8,12].

3. A data and task parallel framework

Many algorithms have been developed to parallelize numerous image operators on a variety of parallel machines. Most ofthese parallel algorithms are either architecture dependent or specifically developed for the application.

However, for a common image processing user with limited knowledge on parallel computing, it is a tedious job in bothcases. Consequently, we are developing an environment in which we can embed both data and task parallelism on a heter-ogenous architecture with a language point of view. The solution must lie in the extension of a common programming lan-guage like C, C++ or Java with annotations. For this we are working on an underlying framework for mixed data and taskparallelism that could eventually be brought into a pre-compiler.

In order to embed data and task parallelism in image processing we use both algorithmic skeletons [6,9,19] and an ImageApplication Task Graph (IATG) [13], see Fig. 1.

Skeletons are algorithmic abstractions that are common to a series of applications, which can be implemented in parallel.Skeletons are embedded in a sequential language, hence being the only source of parallelism in such a program.However,[21,22,17] showed that exploiting both task and data parallelism in a program yields better speedups than either pure taskor pure data parallelism. The main reason is that in many applications the data sets are limited, and thus the amount of dataparallelism is limited. For example, the image data size is often determined by the size of the pixel array of the camera.Although very effective, above a certain limit the data parallelism is saturated and additional parallelism may come fromtask parallelism. By coding the image processing application using skeletons and having determined and using an IATGwe have created an environment with both data and task parallelism.

Fig. 1 depicts the interconnection between the different modules of our parallel framework. The source program is a spec-ification of the image processing algorithm in which all data parallelism is made explicit by annotations and using instancesof the skeletons, with as a parameters image processing functions written in C. A separate module called Syntacticand Depen-dency analyzer turns this specification into a task graph, the IATG, with nodes associated to image processing functions (whichare executed on sets of processors from an available pool of processors), and edges representing communication channels.

Syntactic analyzerDependence analyzer

Sequential Image ProcessingApplication

IATG

Mapping & SchedulingCost Estimation

ProfilingParallel Executable

Intermediatelevel

High _ levelframework

Low_ level framework skeleton

DCGskeleton

DSR

START

FINISH

Skeletons library

Fig. 1. A framework for embedding Data and Task Parallelism.

P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746 737

On the extracted IATG we apply Mapping and Scheduling algorithms [17] to find the minimum execution time of the appli-cation. This module computes the set of processors and the corresponding execution time for each node (image processingoperator) in the graph. The edge associated communication times are also determined. The scheduler’s intermediate outputis a list of sets of processors on which the image processing operators, data parallel executed in the skeletons, should be com-puted. The mapping and scheduling module needs information about the execution costs of the image processing operatorsfor different number of processors. This information is provided by a Cost estimation and Profiling module which communi-cates with the Skeletons library [14]. This library contains skeletons for low-level, intermediate-level and high-level imageprocessing. Currently only the framework for low-level image processing is implemented and the skeleton data sendingand receiving (DSR), responsible for communicating the images across the processors. Below an example of a skeleton fora Dyadic Point Operation is given:

void Im_Dyadic_PO(void(*im_op)(), char *im_in1, char *im_in2, char *im_o, int T, list_proc set);

Here *im_op() points to the – sequential – image operation, *im_in1, *im_in2, *im_o to the two input and single outputimages, a task number T corresponds to a task number, used by the task parallel framework to schedule the skeleton and set

is the set of processors on which the skeleton should run, obtained from the scheduler.

4. Distributed bucket processing concept

Most low-level image processing operations can be implemented by applying a kernel mask to each pixel of an image. Initerative algorithms, successive scans over the image implement a desired operation, in each scan applying the kernel mask

738 P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746

to every pixel in the image. For example in the erosion operation -as iteratively used in skeletonization operations [16], ineach iteration the contour pixels of the objects in the image are ‘‘eaten away”. Since the kernel operations on each pixel areidentical and independent of one another, a data parallel approach can be used to parallelize this task and thus obtain a lin-ear speedup. Consider Fig. 2, where three successive erosion iterations are applied to an image of 20� 20 pixels using anerosion kernel mask. Black pixels are object pixels, white pixels are background pixels, grey pixels are object pixels thatare eroded during an iteration. The kernel mask specifies that an object pixel ‘‘survives” when it is surrounded by object pix-els neighboring left, right, up, and down. Otherwise the pixel is eroded, i.e. set to the background pixel value. If we assumethat applying the kernel to a pixel takes tp (seconds), then each iteration of the 202 image takes 400tp. When performing theerosion data parallel using P processors, each iteration would take 400tp

P . We ignored here the time needed to obtain neighborpixel data. With P processors a speedup of P could be achieved. The data parallel approach seems an efficient and easy way ofspeeding up image processing operations. With many image processing operations, however, only a small part of the imageis actually changed and thus repetitive scanning of the whole image produces much obsolete processing, addressing pixelsfrom which is known beforehand that they will not change under the operation. This is obvious in Fig. 2. Note that the pixelsthat change in one iteration are neighbor pixels that changed in the previous iteration, or will be changed in the next iter-ation. Table 1 shows that only a small fraction of the image is changed with each iteration. Using a strategy of only process-ing this small fraction will highly improve the performance due to a significant reduction of the data. For the three iterations(using a single processor) it would yield a speedup of 12.9, 14.8, and 21, respectively. This speedup, being a function of thenumber of pixels, can be even greater when data parallel processing is used.

Distributed bucket processing combines data reduction and data parallelism. It is geared towards algorithms where dynam-ically only subsets of the initial data is processed. By way of example consider Fig. 3, showing an image column-wise mappedonto three processes, that may reside on three connected processors P.

The image contains three subsets: two objects and background. A first global scan over the data is performed to obtainthe data of interest, here all object pixels. Note that during this initial global scan, the image may already be processed.During that scan, the data of interest is collected in one or more buckets, the data structures that store the data subsets.During the subsequent bucket processing, data elements are drawn from a bucket, processed and when new data is gen-erated, it can be put into a bucket again. Processing can be done data parallel when the operations on the data elements ofthe set are mutually independent. Data parallel processing of a bucket is possible since an ordering of elements in thebucket is neither specified nor guaranteed and thus the bucket data structure can be distributed over multiple processors.Note that although a bucket can be implemented by a stack, queue or linked list, knowledge on the ordering of data isexplicitly banned at the functional level, which is reflected by the word ’bucket’. Furthermore, a write paradigm is enforcedwhich means that a process can only read from its own part of the bucket, but can write to the whole bucket, also whenthis is partially mapped onto a different processor. The mapping function determines how the bucket is distributed over theprocessors. Usually a bucket is mapped in the same way as its related images are mapped onto the processors. So when

kernel mask object pixel pixel to be eroded

iteration 1 iteration 2 iteration 3

Fig. 2. Three erosion iterations on an object in a 20� 20 image.

Table 1Number of pixels in each iteration

Iteration Object pixels Eroded pixels Fraction changed (%)

1 88 31 7.752 57 27 6.753 30 19 4.75

Fig. 3. Processing from image to distributed bucket to bucketarray.

P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746 739

writing data to a bucket, pixel coordinates can be used to determine the destination bucket. The method is beneficial ifthere is enough data in the buckets to overlap the communication time of the data distribution with the computation timeof the data processing.

Usually, buckets are in memory cache, which can be on-chip for SIMD architectures like IMAP, or the processor cache fordistributed memory systems.

Obviously, the benefits of using bucket processing are strongly dependent on the data contents and the data distribution.Data may be ordered for one should use that knowledge efficiently. Typically for SIMD type of processing with buckets, theprocessing of data elements in the bucket is continued until the bucket is empty. Processors that encounter an empty part ofthe bucket will simply become inactive or execute a NOP (no operation) instruction.

5. Bucket processing data structures

For distributed bucket processing [15] two types of data structures can be defined: a bucket and a bucket array. Both aredistributed data structures when they are mapped on the processing nodes of a distributed/shared memory parallel com-puter and/or on the processing elements of an SIMD computer.

Below is a C++ definition of the bucket data structure with its access functions is given:

class Bucket

{public:Bucket(unsigned int elm_size, char *name =’’unnamed Bucket’’);Bucket()~;boolean empty();int put(void *element);void get(void *element);void clear();

};

A bucket is defined as a data structure with two main access functions: a put() function to put data elements in thebucket and a get() function to retrieve an arbitrary data element from the bucket. Furthermore, there is an empty() func-tion for checking whether the bucket is empty and a clear() function for removing all data elements from the bucket. Notethat both get() and put() are blocking when the bucket is empty or full respectively; meaning that a call to get() or put()will not return until an element is retrieved from respectively put into the bucket. To avoid deadlock the bucket’s statusshould be checked before executing those operations. These definitions allow various implementations of the bucket datastructure. Yet, the user may not assume the bucket behaves in a certain way – e.g. as a FIFO – and use that knowledge inthe program. In addition a constructor Bucket() and destructor function Bucket~() are used to create and destroy the buck-et object types. On creation, the size of each data element is specified by elm_size and fixed for the existence of the Bucket

object. So all elements in the bucket have identical size. It is set when the bucket is created and cannot be alteredafterwards.

On a parallel system the bucket data structure may be distributed over a number of processors. This distributed bucketdata structure is obtained by segmenting the bucket in so called partial buckets that are allocated on the processors. Withinthe example in Fig. 3 we suppose that with a read() each processor Px retrieves pixels from its part of the image ðPx : aÞ, seg-ments it in in object and background pixels – thereby reducing the data – and put() the data in a bucket distributed over the

740 P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746

three processors ðPx : bÞ. The bucket consists of three partial buckets. When get() is called on this distributed bucket, eachprocessor will return data elements that are present in its partial bucket only, i.e. a get() of processor P2 will only returnan arbitrary data element contained in the partial bucket of processor P2. However, when a processor puts data in the dis-tributed bucket ðPx : cÞ, the data may end up in the (partial) bucket of another processor. When data is put() into the bucket,an additional argument is provided that is evaluated by the mapping function to determine the destination of the data. So theactual mapping and communication required for the data distribution is hidden from the user. The write paradigm enforcesthat processors can write transparently to neighboring processors but can only read from their own part of the bucket. Thedifference in use between a distributed bucket and a distributed bucket array, is merely in the fact that in the latter, the partialbuckets are made explicit. Here the write paradigm enforces that processors can write explicitly to neighboring processors,as is shown in the lowest row of buckets in Fig. 3 ðQx : dÞ.

In Section 2 it was explained that low-level image processing – sets of pixels – fits naturally on SIMD architectures whileintermediate – sets of attributed objects – and high-level – sets of symbols – image processing fit well on MIMD architec-tures. The distributed bucket processing technique could be used to bridge the gap between architectures in a heterogeneoussetting. At some stage of processing, an output bucket(array) may be used that is mapped on another architecture. For exam-ple, in Fig. 3 the last put() operation implicitly transfers the pixels to a distributed bucket array ðQ y : dÞthat is mapped ontoanother set of processors Q. Generally segmentation of a single set of pixels (an image) into multiple object sets can be donequite effectively on an SIMD processor with Px processing elements, whereas eventually each object set is stored on the sep-arate processors of an MIMD system with Q y processors.

A distributed bucket array is an array of buckets where each bucket has a label and a numeric value – or range of values –attached to it. The bucket array as a whole is considered as a single data structure with two main access functions similar tothe bucket data structure: a put() function to put a data element in a specific bucket of the bucket array and a get() function toretrieve an arbitrary data element from a specific bucket of the bucket array. Both access functions thus have an argumentindicating the bucket to access. This implies that buckets in the bucket array must have unique labels, i.e. non-overlapping.Note that the labels of the buckets need not be consecutive. A bucket may have a range of label values and the bucket will beaccessed when the label argument of the access function is within that range. One bucket can be labelled as others indicatingthat its label range is everything that falls outside the ranges of the other buckets. Below, a C++ definition of a distributedbucket array is presented.

class BucketArray

{public:

BucketArray(unsigned int array_size,unsigned int elm_size, char *name =’’unnamed BucketArray’’);

BucketArray()~;boolean empty(unsigned int index);int put(int bucketlabel, void *element);void get(int bucketlabel, void *element);void clear(unsigned int index);void setlabel(unsigned int index, int label_l, int label_h);unsigned int getindex(int label);

};

The bucket array has among others the following access functions:

� put(). The put() function copies data from the location pointed to by element, to the bucket where the label intervalmatches the value of bucketlabel. As all data elements in all buckets of the bucket array have the same size, the size ofthe data structure where element is pointing to, is known. When bucketlabel does not match any of the label intervalsand no bucket is labelled with others, the put() will simply return. In that case the data element is not copied to any ofthe buckets and no error is returned or signaled.

� get(). The get() function copies a data element from the bucket where the label interval matches the value of bucketlabel, tothe memory location pointed to by element. The bucket with label others will match any bucketlabel value when no match isfound with label intervals of the other buckets. When there is no bucket labelled others and get() is called with a bucketlabel

that does not match any of the label intervals of the buckets, that call to get() will block forever.� setlabel(). Each bucket has a label interval that is set by default to the indexed position in the array. With the setlabel()

function, each bucket label can be changed.

As a simple example to illustrate the programming using buckets we look at a number n of consecutive erosions on animage. Instead of scanning the whole image each erosion, we only need to scan the image once and collect all border pixelsof the objects in the image. For the next erosion we know we only need to consider those pixels and do not need to processbackground pixels anymore. Algorithm 1 lists the code.

P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746 741

6. Estimation of dynamic behavior

The distributed bucket processing can be encapsulated into a skeleton, so that it can be used in the scheduling phase ofthe task parallel framework. However, the main aspect of distributed bucket processing is data parallel processing whiledynamically reducing the data. This is very data dependent as well as dependent on the distribution of the data over theprocessors. To be used in a skeleton and taking part in a scheduling phase the Cost estimation and Profiling module needsto predict if DBP is beneficial in comparison with straight forward data parallel processing. Hence, we have developed thefollowing model.

For a system with P processors the time TLNOðnÞ to perform a number of n scan iterations using a local neighborhood oper-ation of a Nc � Nr image is given by

TLNOðnÞ ¼nNcNr

Ptp ð1Þ

where tp is the (average) time needed for a processor to process a pixel. For example, in case of an erosion operation thiswould include reading a pixel’s value, reading values of its four neighbors, applying the erosion mask and writing the resultback to memory. For DBP the first image scan iteration is done in the same way as with a plain image scan. But in addition,also the wavefront of pixels needs to be collected that are needed to do a next iteration of the operation. These collected

742 P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746

pixels are put into a bucket and the next iterations are done by processing the entries in the bucket. So doing n image scaniterations takes

TDBPðnÞ ¼NcNr

Pðtp þ teÞ þ

Xn

i¼2

Bi

biPtb; ð0 < bi 6 1Þ ð2Þ

where te is the time needed per scanned pixel in the first image scan to evaluate it as a bucket candidate and possibly put it inthe bucket. Bi is the number of entries in the distributed bucket for iteration i. tb is the time needed to process an entry fromthe bucket including generating new work pixels for the bucket. bi is the workload efficiency, a term to compensate for thespread in workload for each processor. It represents the load balance among the processors, e.g. bi ¼ 0:1 means that effec-tively in iteration i only 1/10 of the available processors was used to process the image. Ideally bi ¼ 1 meaning that the work-load is evenly distributed over the available processors for that iteration. Further bi > 1=P thus 1=P 6 bi 6 1. The value bi isstrongly data dependent and also depends on the way the image is mapped onto the processors. For an ideal workload, theprocessed data is evenly divided over the available processors, so the number of bucket elements Bpi

– processed by proces-sor p in iteration i – is equal for all processors. Yet, for a less than ideal workload, the processor that has locally the most datafrom the distributed bucket will determine the execution time for an iteration (determined by maxðBpi

Þ). An implementationof bucket processing on SIMD has been done by NEC on their IMAP [12]. A stack based approach was used where each pro-cessor of a linear processor array has access to a local stack and can push data to the stacks of its immediate neighbors. bi canbe defined as

bi ¼Bi

P maxðBpiÞ ; i P 2; 0 6 p 6 P � 1 ð3Þ

where Bi, the number of bucket elements processed in iteration i, can be defined as

Bi ¼XP�1

p¼0

Bpi; i P 2 ð4Þ

Bi can also be defined as a recursive function:

Biþ1 ¼ aiBi; i P 2 ð5Þ

with ai a multiplication factor. Normally, it is desired that 0 < ai < 1, as the amount of work is decreasing. But depending onthe algorithm, ai can be > 1 and then the bucket elements generate more work than they dissolve. We name ai the workgeneration factor.

In order to get an idea of the values for Bi and bi, several experiments were conducted that involved the performing ofmultiple iterations on a set of test images of and erosion operation as well as of a distance transform. The resulting workloadefficiency differed significantly depending on the images. For example, in images where the erosions take place in only asmall region of the image, the workload efficiency decreased very quickly when the number of processors P increased. Incontrast with images where erosions take place over the whole image, where the workload efficiency did not decrease.Moreover, how the image was mapped onto the processors (in our case column-wise) was obviously important for theresulting workload efficiency but it seemed that only a very fine-grain mapping helped in case of images with one singlelarge object.

In order for DBP to be more beneficial than straight forward data parallel LNO, TDBPðnÞ < TLNOðnÞ for n iterations using thesame number of processors. This yields the tradeoff:

NcNr

Pðtp þ teÞ þ

Xn

i¼2

Bi

biPtb <

nNcNr

Ptp ð6Þ

which can be simplified to

NcNrðtp þ teÞ þXn

i¼2

Bi

bitb < nNcNrtp ð7Þ

The difficulty in this tradeoff are the values for Bi and bi, as these values are dependent on the type of operation, the images,and the mapping of the image data on the available processors. Experiments show that the number of elements in the bucketin most cases decrease with growing iteration number. The upper bound is then that we assume that the number of elementsin a bucket is constant. The same can be done for bi. So assuming that Bi and bi remain constant during the iterations, theycan be determined after the first scan of the image (i.e. Bi ¼ B2; bi ¼ b2) and used for the decision whether to perform the nextiterations by straight forward data parallel scanning or using the DBP method. Assuming that the number of entries in thebucket for each iteration is constant and equal to the sum of all wavefront pixels in the image found during the first scan,then the estimated execution time TDBPest is

TDBPest ¼NcNr

Pðtp þ teÞ þ ðn� 1Þ B

Pbtb ð8Þ

P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746 743

where B is the number of wavefront pixels collected during the first image scan for the second iteration, i.e. B ¼ B2, and b theworkload distribution determined from the differences in the number of entries in the distributed bucket for each processoras given by Eq. (7), i.e. b ¼ b2.

For an advantageous use of DBP we need TDBPest < TLNOðnÞ and thus the tradeoff becomes:

Fig. 4.the num

NcNrðtp þ teÞ þ ðn� 1ÞBb

tb < nNcNrtp

Assume now that tb ¼ c1tp; te ¼ c2tp. This gives

ð1� nþ c2ÞNcNr þ c1ðn� 1ÞBb< 0 ð9Þ

So, for DBP to be faster than multiple data parallel LNO scans while assuming that processing an entry from a bucket ðtbÞtakes c1 times more time than processing a pixel during an image scan ðtpÞ and the additional overhead per pixel for fillingthe bucket during the first scan ðteÞ takes a factor c2 more time than tp, yields that the number of bucket entries found duringthe first image scan must satisfy:

B <NcNrbðn� 1� c2Þðn� 1Þc1

ð10Þ

Obviously, the values of c1 and c2 depend on the actual system and implementation of the algorithm. To get some feeling forc1 and c2, we assume implementation on a hypothetical, but realistic SIMD architecture. The architecture is a linear proces-sor array with Nc=m processors where the Nc � Nr images pixels are mapped column-wise on the processors and 1 6 m 6 Nc .Being SIMD, all processors execute the same instructions and operate in lock step. The processors are connected in a ring andeach processor can only communicate directly with its two neighbors. The architecture closely resembles the NEC IMAP-Vi-sion SIMD subsystem [12,8]. Given the instruction set and timing information, the values for tp; tb, and te, for e.g. an erosionoperation can be determined. Depending on m, and thus the precise mapping of the image (number of columns per proces-sors) with possible code optimizations or overhead, this yields values for tp; te, and tb. When m ¼ 1 the neighboring pixels arethe central pixels of the corresponding neighboring processors which makes accessing those pixels a little bit faster as thedata can be read from a neighbor’s register. For m > 1 additional address and offset calculations are needed. For simplicity,we assume worst case values and thus for m > 1 we derived the timings: tp ¼ 23; te ¼ 18, and tb ¼ 51. Given tp, te, and tb, thevalues for c1 and c2 can be calculated and substituted in Eq. 10 and so the tradeoff of a normal data parallel image scan versusthe DBP approach is determined by Eq. (11) and (worst case timings) Eq. (12):

B <NcNrbðn� 2:06Þ

1:5ðn� 1Þ ; m ¼ 1 ð11Þ

B <NcNrbðn� 1:78Þ

2:22ðn� 1Þ ; m > 1 ð12Þ

Fig. 4 shows the tradeoff for the number of bucket entries found during the first erosion iteration B, for different values of nand b and the two cases for m, with an image size of 256� 256 pixels. The left figure shows the case for m ¼ 1, the rightfigure shows the tradeoff for m > 1. Note that when B < 0 the proper operation is not possible as not all buckets necessaryfor the operation are filled due to the overhead to set up the buckets in case of this type of skeletonization operation [10].

DBP is more efficient than LNO when the number of entries found during the first iteration scan is below the curve. Wenotice that when m ¼ 1, DBP may be faster when doing three erosion iterations, depending on the workload efficiency andthe work, i.e. the amount of pixels in the bucket, found during the first iteration. For m > 1, DBP may be faster when doing

-50000

5000

10000

1500020000

2500030000

2 3 4 5 6 7 8 9 10buck

et e

ntrie

s af

ter f

irst i

tera

tion

iteration

beta = 1.0beta = 0.5beta = 0.1

-5000

05000

1000015000

20000

2500030000

2 3 4 5 6 7 8 9 10buck

et e

ntrie

s af

ter f

irst i

tera

tion

iteration

beta = 1.0beta = 0.5beta = 0.1

Tradeoff for B for different b values and image size of 256� 256 pixels. Left shows m ¼ 1, right shows m > 1. DBP is more efficient than LNO whenber of bucket entries found during the first iteration is below the curve.

Fig. 5. Original objects and their object skeletons.

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128 256

spee

dup

processors

speedup DBP versus LNO execution

flowerTUD

obscuratrui

cermet

Fig. 6. Speedup for the thinning operation for various test images.

744 P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746

only two erosion iterations. Moreover, based on our workload efficiency measurements, the workload efficiency in the casem > 1 is expected to be much better than for m ¼ 1.

Obviously the benefits for using DBP instead of a straightforward data parallel LNO operation depend highly on the data inthe image, but also the number of processors influences the tradeoff. Multiple erosion iterations are used in a thinning oper-ation. A (topology preserving) thinning operation is a conditional erosion. Objects are eroded until a single pixel ‘‘skeleton” ofthe objects remains on the median axes of the original objects [16], see Fig. 5.

Fig. 6 summaries the theoretical speedup achieved with DBP over LNO for a thinning operation on some test images.Obviously, the highest speedup is achieved for the lowest number of processors due to the reduction in processed data

and better workload efficiency. Hence, on a linear SIMD architecture with indirect addressing, the capability of the DBP tech-nique is on average about twice as fast as the normal data parallel LNO image scan for more than three iterations of the samekernel operation. Simulated results of Fig. 6 have been verified with an implementation on an IMAP-Vision SIMD system,yielding for the TUD image – the lowest curve in Fig. 5 – 21 ms. for a straight forward data parallel implementation, versus6 ms. for a DBP implementation on an IMAP-Vision SIMD system, giving thus a speedup of 3.5 [2].

7. Communication handling

In the implementation of a distributed bucket or bucket array on distributed memory systems, the bucket (array) is dis-tributed over a number of processors and a communication network is required to control the distributed bucket (array) datastructure and to implement the write paradigm, so processors can put elements in the bucket (array) that may need to besent to other processors. We have made the following assumptions:

P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746 745

� The parallel system has an autonomous, dealock and livelock free, communication network that offers (virtual) directcommunication between all processors.

� Communication between nodes in the parallel system is done using asynchronous message passing, i.e. the sender of amessage needs not to wait for the actual receipt of the sent message by the destination node.

� The bucket (array) is implemented using the SPNMD approach. each node runs the same program but needs not to run inlock step with other nodes.

The following communication primitives are required for the implementation of distributed bucket (array):

� A processor should be able to send a data element to another processor, specifying the associated bucket and the data ele-ment’s content. The destination processor can be any processor that has (part of) the bucket mapped onto it.

� A processor should be able to receive a data element for a locally mapped (part of a) bucket from another processor, deter-mine its associated bucket and put it in the corresponding (local) bucket. This requirement is necessary to avoid deadlockat the application, i.e. bucket (array) implementation level.

� A processor should be able to respond to non-data transfers and react accordingly. For instance, a request for the status ofa (local) bucket should be handled to determine whether the bucket is full or not.

8. Conclusions

For iterative filters with more than three consecutive iterations, generally DBP is a better approach. A useful and suffi-ciently accurate tradeoff function was established that helps decide between LNO and DBP based on the amount of workfound after the first scan and the way it is distributed over the available processors. One of the aspects of the DBP paradigmis the reduction in processed data. Clearly this should give a speedup in processing. However, this speedup is very datadependent as well as dependent on the distribution of the data over the processors.

Within the context of one single programming language for a heterogeneous architecture, the paradigm can be used tobridge the gap between the architectures. The DBP can be encapsulated into a skeleton and used in a data and task parallelframework that can be mapped onto heterogenous architectures.

Although the DBP paradigm can be implemented on both SIMD and MIMD, there are several drawbacks in the SIMD ap-proach. In an SIMD architecture each instruction is executed on all processors, meaning that the processor with the largestnumber of bucket entries determines the speed. All other processors that are ready execute an active wait. The communi-cation that takes place within a bucket is not transparent and communication and computation may not always overlap.The great advantage of the bucket structure is that it can overlap processing and communication. However, even on SIMD,with the indirect addressing capability to implement the buckets, DBP usually speeds up data parallel processingconsiderably.

References

[1] D. Ballard, C. Brown, Computer Vision, Prentice Hall, 1982.[2] W. Bokhove, Fast robotvision using the IMAP-vision image processing board, Master’s Thesis, Delft University of Technology, 2000.[3] R.G. Chang, T.R. Chang, J.K. Lee, Towards automatic support for parallel sparse computation in (Java) with continuous compilation, Concurrency:

Practice and Experience 9 (11) (1997) 1101–1111.[4] R.G. Chang, T.R. Chang, J.K. Lee, Compiler optimizations for parallel sparse programs with array intrinsecs of Fortran 90, in: Proceedings of International

Conference on Parallel Processing, 1999, p. 103.[5] M.J. Cola, J.L. Jumpertz, B. Guérin, B. Chéron, F. Battini, B. de Lescure, E. Gautier, J.P. Geffroy, The implementation of p3i, a parallel architecture for video

real-time processing: a case study, Proceedings of the IEEE 84 (7) (1996) 1019–1036.[6] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computations, Pitman/MIT Press, 1989.[7] Alok N. Choudhary, Janak H. Patel, Narendra Ahuja, NETRA: a hierarchical and partitionable architecture for computer vision systems, IEEE Transactions

on Parallel and Distributed Systems 4 (10) (1993) 1092–1104.[8] Y. Fujita et al., A 10 gips SIMD processor for pc-based real-time vision applications, in: Proceedings of IEEE Workshop on Computer Architecture for

Machine Perception (CAMP’97), 1997, pp. 22–26.[9] H.W. To, J. Darlington, Y.K. Guo, Y. Jing, Skeletons for structured parallel composition, in: Proceedings of the 15th ACM SIGPLAN Symposium on

Principles and Practice of Parallel Programming, 1995.[10] P.P. Jonker, Skeletons in N dimensions using shape primitives, Pattern Recognition Letters 23 (4) (2002) 677–686.[11] Pieter Jonker, Jan Vogelbruch, The CC/IPP, an MIMD-SIMD architecture for image processing and patter recognition, in: Charles C. Weems Jr. (Ed.),

Proceedings of Fourth IEEE International Workshop on Computer Architecture for Machine Perception, October, IEEE Computer Society Press, 1997, pp.33–39.

[12] S. Kyo, K. Sato, Efficient implementation of image processing algorithms on linear processor arrays using the data parallel language 1dc, in:Proceedings IAPR Workshop on Machine Vision applications, Tokyo, Japan, November 1996, pp. 160–165.

[13] C. Nicolescu, P.P. Jonker, A data and task parallel image processing framework, in: Proceedings of the Third Workshop on High Performance Scientificand Engineering Computing with Applications (Held in Conjuction with ICPP 2001), Valencia, Spain.

[14] C. Nicolescu, P.P. Jonker, Easy pipe – an easy to use parallel image processing environment based on algorithmic skeletons, in: CDROM Proceedings ofWorkshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (held in conjunction with IPDPS 2001), SanFrancisco, USA., April 23–28 2001.

[15] J.G.E. Olk, Diustributed bucket Processing – a paradigm for parallel image processing, PhD Thesis, Delft University of Technology, 2001.[16] P.P. Jonker, E.R. Komen, M.A. Kraaijveld, A scalable real-time image processing pipeline, Machine Vision and Applications 8 (1995) 110–121.[17] A. Radulescu, C. Nicolescu, A. van Gemund, P.P.Jonker, CPR: mixed task and data parallel scheduling for distributed systems, in: CDROM proceedings of

The 15th International Parallel and Distributed Symposium, San Francisco, California, 2001, Best Paper Award.

746 P.P. Jonker et al. / Parallel Computing 34 (2008) 735–746

[18] M.H. Sunwoo, J.K. Aggarwal, VisTa – an image understanding architecture, in: V.K. Prasanna Kumar (Ed.), Parallel Architectures and Algorithms forImage Understanding, Academic Press, 1991, pp. 121–154.

[19] F. Serot, D. Ginhac, J.P. Derutin, Skipper: a skeleton-based programming environment for image processing applications, in: Proceedings of the FifthInternational Conference on Parallel Computing Technologies, 1999.

[20] Jocelyn Serot, Georges Quenot, Bertrand Zavidovique, Functional programming on a dataflow architecture: applications in real-time image processing,Machine Vision and Applications 7 (1) (1993) 44–56.

[21] S. Sapatnekar, S. Ramaswamy, P. Banerjee, A framework for exploiting task and data parallelism on distributed memory multicomputers, IEEETransactions on Parallel and Distributed Systems 8 (11) (1997) 1098–1115.

[22] J. Subhlok, B. Yang, Optimal use of mixed task and data parallelism for pipelined computations, Journal of Parallel and Distributed Computing 60 (2000)297–319.

[23] C. Weems, The image understanding architecture: a status report, Proceedings of the SPIE – The International Society for Optical Engineering 2368(1995) 235–246.