30
iVEC Research Internships (2009 - 2010) Analysis of the Parallelisation of the Duchamp Algorithm Stefan Westerlund University of Western Australia Abstract A critical step in radio astronomy is to search images to determine the objects they contain. New telescope installations, such as the Murchinson Widefield Array (MWA) and the Australian Square Kilometer Array Pathfinder (ASKAP), are capable of observing the sky in superior reso- lution than previous telescopes. This increased resolution results in a much greater data output, so increased computing power is required in order to search this data for objects. The Square Kilo- meter Array (SKA) will produce even more data, and require even more computational power to search its output. A parallel application is required to make use of the required computing performance. The goal of this project is to examine the source finder program Duchamp to determine how it will perform in a parallel implementation, and to estimate the potential combinations of hard- ware to be used to run this parallel implementation. This is done by calculating the arithmetic intensity of Duchamp, and matching it to the arithmetic intensity of potential hardware. This comparison is performed using a black box model, to determine the overall performance of the computing system and its bandwidth A node model is also considered to determine the num- ber, performance, memory and interconnect bandwidth of the individual nodes that comprise the parallel computer system. The results of this project suggest two potential computer systems, one consists of 392 nodes, each with an Intel Core i7 975 processor, at least 14.1GB of RAM and a network capable of pro- viding a connection of at least 1.56GB/s (12.5Gbit/s) of bandwidth to each node. The second uses 67 nodes powered by nVidia Tesla C2070 GPUs, at least 82.1GB of RAM per node and network that can provide at least 4.32GB/s (34.6Gbit/s) of bandwidth to each node. Both sys- tems should use a 100 Gigabit Ethernet network to transfer data to and from the system. Other configurations considered had memory per node requirements that exceed currently available commodity hardware.

Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

Embed Size (px)

Citation preview

Page 1: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

iVEC Research Internships (2009 - 2010)

Analysis of the Parallelisation of the Duchamp Algorithm

Stefan WesterlundUniversity of Western Australia

Abstract

A critical step in radio astronomy is to search images to determine the objects they contain.New telescope installations, such as the Murchinson Widefield Array (MWA) and the AustralianSquare Kilometer Array Pathfinder (ASKAP), are capable of observing the sky in superior reso-lution than previous telescopes. This increased resolution results in a much greater data output, soincreased computing power is required in order to search this data for objects. The Square Kilo-meter Array (SKA) will produce even more data, and require even more computational powerto search its output. A parallel application is required to make use of the required computingperformance.

The goal of this project is to examine the source finder program Duchamp to determine howit will perform in a parallel implementation, and to estimate the potential combinations of hard-ware to be used to run this parallel implementation. This is done by calculating the arithmeticintensity of Duchamp, and matching it to the arithmetic intensity of potential hardware. Thiscomparison is performed using a black box model, to determine the overall performance of thecomputing system and its bandwidth A node model is also considered to determine the num-ber, performance, memory and interconnect bandwidth of the individual nodes that comprise theparallel computer system.

The results of this project suggest two potential computer systems, one consists of 392 nodes,each with an Intel Core i7 975 processor, at least 14.1GB of RAM and a network capable of pro-viding a connection of at least 1.56GB/s (12.5Gbit/s) of bandwidth to each node. The seconduses 67 nodes powered by nVidia Tesla C2070 GPUs, at least 82.1GB of RAM per node andnetwork that can provide at least 4.32GB/s (34.6Gbit/s) of bandwidth to each node. Both sys-tems should use a 100 Gigabit Ethernet network to transfer data to and from the system. Otherconfigurations considered had memory per node requirements that exceed currently availablecommodity hardware.

Page 2: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 2

1. Introduction

Modern telescope installations, such as the Square Kilometer Array (SKA) and the AustralianSquare Kilometer Array Pathfinder (ASKAP) search a much larger area of sky in a given amountof time than current telescopes. The SKA will produce data cubes that are terabytes in size. Anall sky survey will produce thousands of such image cubes, resulting in a data set that is petabytesin size. All of this data will need to be searched to find the objects they contain.

The problem is that the computational requirements to search these images dwarfs that avail-able from desktop machines. Instead, supercomputers are needed to process these images in areasonable amount of time. Therefore, the source finder programs used for searching the largeastronomy images must be able to run in parallel, so they can make use of the computationalpower of current, parallel supercomputers. This project uses the Duchamp program as a repre-sentative example of a source finder, in order to examine the implications of running a sourcefinder in a parallel computer network.

The goal of this project is to examine the Duchamp source finding program and determine theeffects of parallelising it. This is done by examining how many operations Duchamp requires andhow much data transfer is required in order to search an area of sky. These values will be used todetermine an appropriate combination of hardware to run the parallel version of Duchamp. Thisproject will consider both the hardware of the computer system as a whole, and as a network ofnodes.

The Background section will describe the knowledge required to understand this report. Themanner in which the Duchamp program will be analysed to understand how it will perform witha parallel implementation will be detailed in the Methodology section, along with how estimatesfor potential hardware configurations for this problem will be obtained. The results of evaluatingthe models from the Methodology section are shown in the Testing section. The implicationsof these results will be considered and the selection of hardware will be made in the Discussionsection. The conclusions of this work are made, along with a discussion of the limitations of thisproject and the future work to be done to expand on this project, in the Conclusion section.

2. Background

This section will first provide a broad introduction radio astronomy and the role of sourcefinders. It will then detail Duchamp, the source finder chosen to be investigates in this project.Also described is the a trous image reconstruction algorithm which comprises the majority ofthe computational requirements of the Duchamp program. The concepts of arithmetic intensityand computational complexity will be explained as they are relevant to the understanding of thiswork.

Page 3: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 3

2.1. Radio Astronomy

Radio Astronomy is the study of celestial objects, by examining electromagnetic radiationin the radio spectrum. It is possible to examine this radiation from earth because it is in one ofthe frequency windows that is not blocked by Earth’s atmosphere. Radio waves are beneficialto study because they pass through objects that are opaque to visible light, such as dust clouds.Radio astronomy also allows astronomers to observe objects that do not emit visible light, suchas hydrogen clouds, as neutral hydrogen produces 21cm radiation [1]. Radio waves are oftendetected using arrays of telescopes because the signals can be combined between telescopes in aprocess called radio interferometry. Using multiple telescopes improves the angular resolutionof the system, such that the angular resolution of two telescopes a certain distance apart is thesame as a single telescope with a dish with a diameter of that distance.

It takes a significant amount of computational power to process the signals received by tele-scopes into astronomy images. Signals from different telescopes are correlated together, takingaccount of their relative positions. The results are integrated over time, causing noise to cancelout towards zero and allowing fainter signals to be detected. This information is combined into astructure called a data cube. This cube has three dimensions, two of which are spatial dimensionsthat denote where in the sky that element represents, and the third dimension is the frequencychannel that a particular element represents. Once these data cubes have been created, they needto be searched.

Source Finders are programs that are used to find sources of electromagnetic radiation in animage. The quality of a source finder is measured in terms of its completeness and reliability.Completeness is a measure of how many of the actual sources in the data cube the source finderfinds. Reliability is the proportion of the objects a source finder finds that are actual sources,rather than noise. Several source finders were considered for this report. These source find-ers include were MultiFind and TopHat, which were used by the HIPASS survey [2] and theDuchamp source finder [3]. Neither MultiFind nor TopHat were chosen because of requirementsof completeness and reliability. In the HIPASS survey, people were used to confirm each of thesources found by these programs. MultiFind found around 83% and TopHat found around 90%of the sources that were deemed to be in the data, but each one found sources the other didn’t.Additionally, MultiFind and TopHat found 137,060 and 17,232 sources respectively, comparedto 4,315 sources in the final count [2]. The data from newer surveys, such as ASKAP, will havetoo many sources to be verified by people.

Duchamp [3] is a new source finder program written by Dr. Matthew Whiting. Still underdevelopment, Duchamp uses a different algorithm. Because of this, Duchamp will be used as anestimate of how much computing power a source finder will need, and what are the effects ofparallelising it. The detection algorithm used by Duchamp is to consider all elements above acertain threshold as bright elements. Adjacent bright elements are then joined together as objects.Objects that are below another threshold in size are discarded as noise, rather than actual sources.Duchamp uses pre-processing of the data cube to reduce the noise of the data cube, and to allowit to see fainter objects. The pre-processing uses image reconstruction with the a trous method[4], which is explained in detail in the next section.

Page 4: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 4

2.2. The A Trous Image Reconstruction Algorithm

The a trous image reconstruction algorithm is a three-dimensional wavelet transform [5].Through successive three-dimensional low-pass filtering, it considers the image at several scales.The filtered values at each scale are added to the output only if they are still greater than athreshold. The flowchart for the algorithm is shown in Figure 1. The algorithm is described inmore detail in the following paragraphs.

First the algorithm loads the original data cube as the input, and the values of the output datacube are initialised to zero. The data cube is operated on several iterations of the outside loop,with the stopping criteria dependent on the change in MAD (Median Absolute Deviation) fromone iteration to the next, and a minimum of two iterations. For each of these iterations, first thedata cube values are set to the original input minus the current output and the scale is initialisedto one. The wavelet values are calculated by convolving the data cube is with a low-pass filter,and subtracting these values from the values of the data cube. The distance between the elementsused in the filter is dependent on the scale. A threshold is calculated from the median of thewavelet values. The wavelet values that are greater than the threshold are added to the output.The data cube is then updated by subtracting the wavelet values from the current data cube values.

The inner loop is repeated, incrementing the scale at each iteration, for a number of scalesproportional to the logarithm of the shortest side length of the data cube. Once all the scales havebeen completed, the final filtered values are added to the output, without regard to a threshold.The stopping condition is then checked to see if another iteration of the outer loop should beperformed. Once all the iterations are complete, the output data cube is returned. The exactoperations performed by the algorithm are described in the next paragraph.

Starting with the original data cube as the input, it is then convolved with a discrete filter.Consider x, y and z as the coordinates of an element in the image cube, ds,l is the real-valueddata cube that comprises the image, at scale s and iteration l. α is the original input data cubeand βl−1 is the output at the end of iteration l − 1 and the start of iteration l. W = b l f

2 c where l f

is the one-dimensional length of the filter used, and f [i][ j][k] are the coefficients of the three-dimensional filter. Then the values of the data cube are updated from one scale and iteration tothe next according to the following equations:

ds+1,l[x][y][z] =W∑

i=−W

W∑j=−W

W∑k=−W

f [i][ j][k]ds,l[x + 2si][y + 2s j][z + 2sk]

d1,l[x][y][z] = α[x][y][z] − βl−1[x][y][z]

(1)

This data access pattern is demonstrated, in two dimensions, in Figure 2. If a required elementis outside of the cube, then a reflected element is used instead. The number of scales, S isdependent on the smallest side length of the data cube being examined. If the length of theshortest side is lmin, then the number of scales is S = blog2(lmin)c − 1.

The wavelet coefficients, ws,l, at scale s and iteration l are then equal to the difference between

Page 5: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 5

Set Scaleto 1

Scale > Max?

CalculateWavelet

Coefficients

CalculateThreshold

Add WaveletValues

Greater thanThreshold to

Output

IncrementScale

Update DataCube Values

UsingWavelets

Add DataCube Values

To Output

CheckStoppingCriteria

No

Continue

ReturnOutput

Load InputImage

Stop

Set Data Cubeto Input Minus

Output

Yes

Figure 1: A Trous Image Reconstruction FlowchartThis flowchart shows the working of the a trous image reconstruction algorithm used byDuchamp. First it loads the original image as the input, and the values of the output data cubeis initialised to zero. The image is operated on several iterations of the outside loop, with thestopping criteria dependent on the change in MAD (Median Absolute Deviation) from one iter-ation to the next, and a minimum of two iterations. For each of these iterations, the data cubeis convolved with a filter over several scales. At each scale, data cube values that are above athreshold are added to the output.

Page 6: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 6

3 33 3 3

2,3 33 2 2,3 2 2,3

2 1,2 1,2 1,2 2

1 1 1

2,3 33 1,2 1 X 1 1,2 2,3

1 1 1

2 1,2 1,2 1,2 2

2,3 33 2 2,3 2 2,3

3 33 3 3

2,32,3

1 1

1

1

11

1

1

Figure 2: Duchamp Data Access PatternThis diagram shows the values needed to calculate the next value of a given element. A two-dimensional data set is used instead of a three-dimensional one for clarity. Likewise, only theelements required for the first three scales are shown. The element marked X is the elementwhose next value is being calculated. The numbers indicate in which scale the surroundingelements are used. Elements that are used in two different scales still need to be read twice, asthe value of the surrounding pixels will have also changed value from one scale to the next. Notethat the distance from the target element to the surrounding element doubles with each scale.Also, each coloured element will require the black element for their calculation, at that scale.

Page 7: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 7

the data cube values at successive scales:

ws,l[x][y][z] = ds,l[x][y][z] − ds+1,l[x][y][z] (2)

The wavelet coefficients are then added to the output array, if and only if they are a certainthreshold, t[s], above the median, m[ws,l]. This threshold is a constant, based on the currentscale, s. The increase in the output as a result of scale s in iteration l, ∆βs,l, is therefore calculatedaccording to the following equation:

∆βs,l[x][y][z] =

ws,l[x][y][z] if ws,l[x][y][z] > m[ws,l] + t[s]0 otherwise

(3)

The threshold is a constant value determined at the start of the program. It is dependent on thescale, and is multiplied by a value given in the program parameters.

The output is calculated for S scales. The final filtered values for the data are added to theoutput, so the total output for iteration l of the algorithm, βl, is given according to the followingequation:

βl+1[x][y][z] = βl[x][y][z] + dS+1,l[x][y][z] +S∑

s=1

∆βs,l[x][y][z]

β0[z][y][z] = 0

(4)

The output after each iteration is calculated until the difference in the median absolute devi-ation from one iteration until the next is small enough. If M[x] is the median absolute deviationof x and τ is the tolerance, then the stopping condition is evaluated according to the equation:

|M[α − βl] − M[α − βl−1]|M[α − βl]

< τ (5)

The tolerance is specified in the input parameters for the program. With the default setting, thealgorithm usually takes three or four iterations.

2.3. Algorithmic Intensity

The measure used to match the algorithm to potential hardware is Arithmetic Intensity. Thisvalue compares computation to data transfer. The arithmetic intensity of an algorithm is thenumber of operations it requires per byte of data transferred. The arithmetic intensity of a com-puter system is its operational performance divided by its bandwidth. If the arithmetic intensityof the algorithm is greater than that of the system, then the problem is computationally bound,and excess bandwidth will be unused. Conversely, if the arithmetic intensity of the algorithm isless than that of the computer system the algorithm is bandwidth bound, and excess computa-tional performance will be unused. This metric can therefore be used to match an algorithm withsuitable hardware.

Page 8: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 8

Equation 6 denotes how the arithmetic intensity of an algorithm and computer system iscalculated. aa is the arithmetic intensity of the algorithm, ac is the algorithmic intensity of thecomputer system. p is the number of operations required for the algorithm, usually counted asFLOPs, of Floating Point OPerations. r is the number of bytes the algorithm needs to transfer. cis the computational power of the computer, in FLOP/s, Floating Point OPerations per second. bis the bandwidth of the computer system. If aa > ac then the problem is computationally bound,and excess bandwidth will be unused. If aa < ac then the algorithm is bandwidth bound, andextra computational performance will not be used. This is how an algorithm can be matched to anappropriate computer system. The number of operations required, and the related computationalcomplexity, is discussed in the next section.

aa =pr

ac =cb

(6)

2.4. Computational Complexity and Operation Counts

Computational Complexity is a measure of how much effort is required to run an algorithm.It is often written as the upper bound, in big O notation. A function f (n) is of order O(g(n)) ifEquation 7 holds for some constant k.

limn←inf

f (n) ≤ kg(n) (7)

The computational complexity of the a trous image reconstruction method used by Duchamp isO(VS L), where V is the number of elements in the image, S is the number of scales and L isthe number of iterations of the outer loop required. This can be used to estimate the runningtime of the program, based on the running time of the program with different input. If t is therunning time of the program with parameters V , S and L and t0 is the running time measuredusing parameters V0, S 0 and L0, then the running time can be estimated using Equation 8.

t ≈ t0VS L

V0S 0L0(8)

Examining the source code also allows for the operation counts to be determined. The the fil-tering portion of the image reconstruction algorithm requires 2180VS L single-precision floatingpoint operations and 250VS L double-precision floating point operations. Calculating the medianrequires 48V log10(V)(S +2)L single-precision floating point operations. This analysis considersthe median algorithm to be a single-threaded implementation of introsort [6], followed by pick-ing the middle element, as a worst-case scenario. Parallel, and more efficient, implementationsexist, for example Bader, 2004 [7].

The single- and double-precision operations will be combined to derive an equivalent numberof single-precision floating point operations. This will be done by considering a double-precision

Page 9: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 9

operation to be equivalent to two single-precision operations in the case of CPUs [8], two single-precision operations for nVidia GPUs [9] and five single-precision operations in the case ofAMD GPUs [10]. This is because different processors perform double-precision floating pointoperations at different speeds relative to how fast they can perform single-precision floating pointoperations. These are the operation counts that will be used in calculating the arithmetic intensityof the image reconstruction algorithm.

3. Methodology

This section will describe how the Duchamp source finder program will be analysed. It willexplain how the data cubes that are the input to Duchamp will be considered. Two models of thecomputing environment in which Duchamp will be run will be considered, a black box modeland a node model. The two measures that will be applied to Duchamp, arithmetic intensity andcomputational complexity, are explained in this section.

This report will consider a data cube with two spatial dimensions, X and Y , and a frequencydimension, F. This results in an image cube having XYF elements. Each element has D single-precision values for each, where D is greater than or equal to one. These values specify differentproperties of the element, including one element denoting the brightness of that element. There-fore, with a single-precision floating point number requiring four bytes of storage, the total filesize for the data cube is 4XYFD bytes. Because Duchamp only uses the brightness value foran element, only this value will be considered when determining how much memory Duchampneeds to store all its data.

3.1. Black Box Model

The first model is a black box model, as shown in Figure 3. This considers the computersystem as a black box, with a certain computational rate and a bandwidth that determines therate at which data is moved on and off the system. This model will compare the total number offloating point operations required by the algorithm to the amount of data transfer of moving theinput data cube onto the system and moving the output data catalogue off the system. This modelwill help determine the overall performance of the potential system.

This model gives particular values to use to calculate the arithmetic intensities, as shown inEquation 6. p is the number of floating point operations required, as given in Section 2.4. Thevalue of r is equal to the file size of the image cube, 4XYFD plus the file size of the catalogue,in bytes. Although only one value per element is used by Duchamp, this report will consider allD values per element being transferred to the computer system, as a worst-case situation. Thevalue c is equal to the total computational performance of the computer system, in FLOP/s andb is the bandwidth of the communication link that moves data on and off the system.

Page 10: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 10

3.2. Node Model

The second model is the node model, as shown in Figure 4. This concerns the computationonce all the data is on the system. It considers a series of nodes, each capable of a certaincomputational performance. These are connected by an interconnect that has a certain bandwidthfrom one node to another. This will compare the operations required by the algorithm against thedata it needs to transfer. This model will also consider the memory requirements of the system.

Examining the source code shows that the image reconstruction algorithm uses five single-precision values in memory for each element in the data cube, so each node requires 5 × 4Vbytes of RAM to store all the required data in memory. In determining the arithmetic intensity,the node model uses the same number of floating point operations as the black box model forp. The computational performance of the computer system, c, is the computational performanceof a single node. The system bandwidth, b, is the bandwidth of the interconnect from one nodeto another. The amount of data transfer required, r, is more complicated, as it depends on thenumber of nodes used, and how the elements in the data cube are distributed between the nodes.The amount of the data transfer required is calculated in the following paragraphs.

This analysis will consider each node to hold an m × m × m cube of elements from the datacube, and be responsible for performing the operations required for these elements. If there are atotal of v elements in the data cube then the number of nodes used is n = v

m3 , so m = 3√ v

n . For an afilter with length 2w + 1, for some positive integer w and at each scale s, to calculate the waveletcoefficient for a given element, the node holding that element needs the value of the elements2s−1, 2 × 2s−1, 3 × 2s+1, ...,w × 2s−1 values away, on either side, in each dimension. This requiresthat each node in the computing network store not only the data cube values for the elements itis responsible for computing, but also the elements that surround these in the data cube. Theseextra values that are not operated on by a node, but are used in calculations for the elements onthat node, are called a halo. If a node needs a element d values away from than element in each ofthree dimensions and a node works on a cube of elements with side length m, then the elementsa node in the computer network needs form a cube with side lengths of m + 2d elements.

The amount of data that needs to be transferred to a node is equal to the volume of the abovecube stored in that node, minus the size of the cube itself, as the node already holds its ownvalues. Therefore, the amount of data transfer needed per node, dn is:

dn = (m + 2d)3 − m3 = 6m2d + 12md2 + 8d3 (9)

The total amount of data needing transfer from one node to another is equal to the datatransfer needed per node multiplied by the number of nodes. Thus the number of elements beingtransferred per pass of the filter, d f , is equal to:

d f = ndn

= (6m2d + 12md2 + 8d3)n

= 6nm2d + 12nmd2 + 8nd3

(10)

Page 11: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 11

ComputerSystemBandwidth Bandwidth

Catalogue

Data Cube

ComputationalPerformance

Figure 3: Black Box Model This figure shows the black box model. It compares the compu-tation required to complete an algorithm on the input given, against the data transfer requiredto move the input onto the computer system and to move the output from the system. This canbe used to determine a potential performance for the computer system, when using a given datatransfer technology.

NodesComputational Performance

InterconnectBandwidth

Figure 4: Node ModelThis figure shows the node model. It considers a number of computing nodes that are eachcapable of a certain computational performance, and are connected together with an interconnectof a certain bandwidth. Each node hold part of the data cube, and works with the other nodesto calculate the result. This model can be used to match the algorithm to a given computationalperformance, bandwidth, number of nodes in the network, and memory per node.

Page 12: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 12

Substituting in m = 3√ v

n gives the amount of transfer per pass of the filter as a function of thetotal data cube size and the number of nodes.

d f = 6n13 v

23 d + 12n

23 v

13 d2 + 8nd3 (11)

Thus the amount of data transfer needed is proportional to the number of nodes. The maxi-mum amount of elements stored on a given node is limited by the amount of memory that nodehas, divided my the amount of memory it needs per element to be able to operate on that element.

For this algorithm, the distance is dependent on the scale: d = w2s−1, for s = 1 to S . Theamount of data transferred at scale s, ds is:

ds = 6n13 v

23 w2s−1 + 12n

23 v

13 w222s−2 + 8nw323s−3 (12)

The amount of information transferred for each iteration of the main loop, dl is therefore:

dl =

S∑s=1

6n13 v

23 w2s−1 + 12n

23 v

13 w222s−2 + 8nw323s−3 (13)

And so the total amount of information needing transfer for L iterations of the main loop, dt

is:

dt = LS∑

s=1

6n13 v

23 w2s−1 + 12n

23 v

13 w222s−2 + 8nw323s−3 (14)

The amount of data transfer required for a given scale reaches a maximum when all the dataneeded in that scale comes from elements that are stored in a different node from the centerelement. When this happens, the data transfer for that scale stays constant as the number ofprocesses increases. This is because each node needs data from w3 − 1 other nodes, rather thanneeding all the values in a certain distance in the data cube. The data access patterns betweennodes can be seen in Figures 5 and 6.

This calculation overestimates the amount of data transfer because it fails to account for theedge cases of the data cube. When an element near the edge of the data cube needs the value ofan element that is outside the data cube, a reflected value is used instead. This reflected elementmay lie in the current node, or in data that has already been loaded from another node, meaningthat the reflected value does not need to be loaded itself. The portion of the data that does notneed to be loaded because of this increases as the number of elements per node increases relativeto the total number of elements. Thus, the overestimation is greatest when the fewest number ofnodes are used. This is why the amount of data transfer does not decrease to zero and the numberof nodes is one.

Page 13: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 13

Figure 5: Duchamp Node Data Access PatternThis diagram shows the elements a node needs from other nodes, for different scales. A two-dimensional data set is shown instead of a three-dimensional data set for clarity. This exampleuses a node size value of m = 4, a filter of length 5, so w = 2 and shows four scales, S = 4.The boundaries between nodes are shown by the thick black lines. The black elements are theelements whose wavelet values are being calculated by a given node. The first scale uses theblue elements. The second scale uses the blue and red elements. The third scale uses the blue,red, green and yellow elements. The fourth scale requires the yellow and brown elements. Notehow for the first three scales, all the elements in a certain distance around the node are needed,but at the fourth scale only certain blocks of values are needed, with ‘gaps’ in between. In thisexample, the data transfer between nodes reaches a maximum at the fourth scale, and remainsthe same for higher scales.

Page 14: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 14

Figure 6: Duchamp Node Data Access PatternThis diagram shows the elements a node needs to calculate the wavelet coefficients for its ownelements, when the node size is not a power of two. This example uses a node size value ofm = 5, a filter of length 5, so w = 2 and shows four scales, S = 4. The first scale used theblue elements. The second scale uses the blue, red and orange elements. The third scale uses theblue, red, orange, yellow and green elements. The fourth scale uses the orange, green and brownelements. Note that the elements needed by a particular node to not align with the elements othernodes hold.

Page 15: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 15

The data transfer for the median requires one transfer of the image each time the median iscalculated. The amount of data transfer per calculation of the median is dm = V . The medianis called once per scale and twice at then end of each iteration, so the total number of elementsneeding transfer for calculation of the median, dm,t is:

dm,t = dm(S + 2)L= V(S + 2)L

(15)

4. Results

There are a number of steps in analysing how the performance of a parallel implementa-tion of Duchamp. First the input to Duchamp will be defined, for use in the remainder of thetesting. Preliminary testing was first performed to examine the single-threaded implementation,to determine its running time and most computationally intensive methods. Duchamp was thenanalysed according to the black box model to match the computational requirements of the entirecomputer system with the speed of the connection used to transfer data on and off the system.Duchamp was then considered using the node model. This model relates the number of nodes,the computational speed and memory of each node, and the speed of the interconnect betweennodes.

There are two data cubes that will be considered in this analysis. The first is a data cube ofthe Virgo cluster, made from data from the HIPASS survey. This cube has spatial dimensions ofX × Y = 318 × 355, and F = 256 frequency channels. From these dimensions, the number ofscales for this cube, S will be six. This cube only has one value per element, so D = 1 and thefile size is 116MB. This data cube will be used to test the Duchamp program.

The second cube is a hypothetical cube that may be produced by ASKAP, to be used toestimate what hardware a computer system would need to process such a cube. This cube hasspatial dimensions of X × Y = 4, 096 × 4, 096 and frequency channels F = 16, 384. This resultsin a number of scales, S , of ten. The ASKAP cube may have D = 5 values per element and a filesize of 5.5TB. Of the five values, four are the Stokes parameters that determine the polarisationof that electromagnetic radiation, including the brightness, and the fifth being a weighting thatmeasures how exposed that element was over the time the data cube was produced.

As a preliminary test, Duchamp was first run using the HIPASS cube as input. This wasto provide an estimate of the time needed, and to check what methods used the majority of thecomputing time. This test was run on a system with a dual-core AMD 1.8GHz Opteron 265processor with 4GB of DDR2 memory. The filter chosen for this test, and for the ASKAP datacube, was of length five, so w = 2. Running the Duchamp program on this system with theHIPASS cube as input took 30 minutes. This required three iterations for the outermost loop.This report will estimate that the ASKAP will require four iterations of the outermost loop.Using these values, as estimate can be made for how long Duchamp will run when processing anASKAP cube. Using Equation 8, the estimate of the time taken, T , is:

Page 16: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 16

T ≈ 30 ×4, 096 × 4, 096 × 16, 386 × 10 × 4

355 × 318 × 256 × 6 × 3minutes

≈ 634, 000 minutes≈ 440 days

(16)

This test produces a catalogue output that is 70kB in size. This shows that for the black boxmodel, the size of the output can be ignored because it is negligible compared to the size of theinput.

Duchamp was profiled with the gprof program in order to determine which method callstake the greatest portion of the running time. Analysis of the operation counts, in section 2.4,suggests that the a trous image reconstruction algorithm and the determining the median wouldcomprise the majority of the computational requirements of Duchamp. Executing Duchamp withthe HIPASS data cube shows that the a trous image reconstruction algorithm takes 95% of therunning time, including 17% for calculating the median. This confirms that these are the mosttime consuming parts of the Duchamp program.

The amount of operations required is a function of the cube, and independent of the numberof processors used. For the filtering, the HIPASS cube requires 1.23 single-precision TFLOPsand 130 double-precision GFLOPs. For calculating the median, it requires 248 single-precisionGFLOPs. The ASKAP cube requires 24 single-precision EFLOPs and 2.75 double-precisionEFLOPs for filtering and 7.24 single-precision EFLOPs for calculating the median.

The black box arithmetic intensity can be calculated from these variables. How the blackbox arithmetic intensity varies with the size of the data cube is shown in Figure 7. The blackbox arithmetic intensity of the HIPASS and ASKAP cubes is compared to different networktechnologies in Figure 8. The technologies shows are Gigabit Ethernet at 125MB/s, 10 GigabitEthernet at 1.25GB/s, the proposed 100 Gigabit Ethernet at 12.5GB/s [11] and InfiniBand QDR4X at 4.00GB/s [12].

The node model arithmetic intensity of the algorithm changes with the number of nodes used,as the amount of data transfer varies. The amount of data transfer required is shown in Figure 9.Comparing this with the amount of operations required, the arithmetic intensity can be calculated.The arithmetic intensity of the HIPASS and ASKAP cubes, as the number of nodes varies, isshown in Figures 11 and 12, respectively. Comparing these values against available hardwarelinks a potential combination of hardware to the optimum number of nodes, as shown in Figure13. The processors shown are an Intel Core i7 975, with a single-precision performance of 213GFLOP/s [13], a nVidia Tesla C2070 with a single-precision performance of 1.26 TFLOP/s [14]and an AMD Radeon HD 5970 with a single-precision performance of 4.64 TFLOP/s [10]. Theinterconnects used are 10 Gigabit Ethernet at 250MB/s [11], InfiniBand QDR 4X at 8.00GB/s[12], PCI Express v2 x16 at 16.0GB/s[15], and the proposed 100 Gigabit Ethernet at 25.0GB/s[11]. These figures are twice the one-way bandwidth because the Duchamp algorithm can benefitfrom transferring information in both directions with full-duplex interconnects. How the RAMrequirements for each node varies with the number of nodes is shown in Figure 14.

Page 17: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 17

100

1000

10000

1000 1e+06 1e+09 1e+12

FLO

Pspe

rByt

eTr

ansf

erre

d

Number of Elements in Data Cube

HIPASS ASKAP

Figure 7: Duchamp Black Box Computational IntensityThis graph shows how the approximate computational intensity for the Duchamp program varieswith the number of elements in the data cube size. This is measured as the number of combinedsingle- and double-precision FLOPs required per byte of data transferred onto the computersystem. This graph assumes that each element in the data cube has five single-precision floatingpoint values associated with it, and that four iterations of the main loop are performed. Becauseof this, the HIPASS cube shown here shows less arithmetic intensity than it does in practice.This graph was made to determine how the balance of computation and data transfer varies fordifferent data cube sizes.

Page 18: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 18

0.1

1

10

100

1000

10000

0.01 0.1 1 10 100 1000

Com

puta

tiona

lPer

form

ance

(TFL

OP/

s)

System Bandwidth (Gigabytes per Second)

HIPASS Computational IntensityASKAP Computational Intensity

Gigabit Ethernet10 Gigabit EthernetInfiniBand QDR 4X

100 Gigabit Ethernet

Figure 8: Duchamp Black Box Technological RequirementsThis graph compares the floating point performance of the computer system required to keep upwith a given bandwidth that transfers data onto the computer system. The x-axis is the bandwidthused to transfer data onto the system in bytes per second and the y-axis is the floating pointperformance, in floating point operations per second. Also shown are several common networktechnologies and their bandwidths. This is made in order to match the connection bandwidth tothe overall computational performance of the computer system.

Page 19: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 19

0.001

0.01

0.1

1

10

100

1000

0 500 1000 1500 2000 2500

Dat

aTr

ansf

erB

etw

een

Nod

es(T

B)

Number of Nodes

HIPASS ASKAP

Figure 9: Duchamp Data TransferThis figure shows how the amount of data needing transfer from one node to another varies withthe number of nodes. The x-axis is the number of nodes in the computer system and the y-axis isthe amount of data, measured in terabytes. This graph calculates the amount of data transfer asthe number of nodes in the system varies, to be used to calculate the arithmetic intensity of theimage reconstruction algorithm when calculated using different numbers of nodes.

Page 20: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 20

0.1

1

10

100

1000

10000

1 1000 1e+06 1e+09 1e+12

Dat

aTr

ansf

erB

etw

een

Nod

es(T

B)

Number of Nodes

HIPASS ASKAP

Figure 10: Duchamp Data Transfer for Large Numbers of NodesThis diagram shows how much the data transfer varies for large numbers of nodes. The x-axisis the number of nodes. Each line starts at one node at ends at a number of nodes equal to thenumber of elements in that data cube. The y-axis is the amount of data transferred from one nodeto another, in terabytes. This graph shows the data transfer required for grater numbers of nodesin order to show the effects of the image reconstruction algorithm reaching the maximum datatransfer for a given scale.

Page 21: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 21

1

10

100

1000

1 10 100 1000

FLO

Pspe

rByt

eTr

ansf

erre

d

Number of Nodes

Single-Precision FLOPsDouble-Precision FLOPs

Combined FLOPs

Figure 11: HIPASS Computational IntensityThis graph shows the computational intensity of the Duchamp 3D image reconstruction algo-rithm, when run on the HIPASS data cube. This arithmetic intensity can be used to matchthe algorithm to suitable hardware. The first line shows the single-precision algorithmic in-tensity. The second line shows the double-precision algorithmic intensity. The third line showsthe computational intensity of single- and double-precision operations together, counting onedouble-precision operation as two single-precision operations. These lines show the single- anddouble-precision operations because they are performed at different speeds by CPUs and GPUs.

Page 22: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 22

1

10

100

1000

1 10 100 1000

FLO

Pspe

rByt

eTr

ansf

erre

d

Number of Nodes

Single-Precision FLOPsDouble-Precision FLOPs

Combined FLOPs

Figure 12: ASKAP Computational IntensityThis graph shows the computational intensity of the Duchamp 3D image reconstruction algo-rithm, when run on the ASKAP data cube. This arithmetic intensity can be used to matchthe algorithm to suitable hardware. The first line shows the single-precision algorithmic in-tensity. The second line shows the double-precision algorithmic intensity. The third line showsthe computational intensity of single- and double-precision operations together, counting onedouble-precision operation as two single-precision operations. These lines show the single- anddouble-precision operations because they are performed at different speeds by CPUs and GPUs.

Page 23: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 23

1

10

100

1000

10000

0.1 1 10 100 1000

Num

bero

fNod

es

Interconnect Bandwidth(Gigabytes per Second)

Intel Core i7 975nVidia Tesla C2070

AMD Radeon HD 597010 Gigabit Ethernet

InfiniBand QDR 4XPCI-E v2, x16

100 Gigabit Ethernet

Figure 13: Duchamp Optimum Number of NodesThis graph shows the optimum number of nodes to use for the Duchamp algorithm on an ASKAPdata cube. The number of nodes is a function of the bandwidth of the interconnect used and thechosen processor. If a greater number of nodes is used, then there is not enough bandwidthto keep up with the extra computational performance and increased data transfer. If a fewernumber of nodes is used, then bandwidth will go unused as the system waits for calculations tocomplete. The arithmetic intensity figure used for each processor takes into account the relativespeed of single- and double-precision floating point operations on that processor. This graph usesarithmetic intensity of the algorithm to match a processor speed and interconnect bandwidth tothe number of nodes used to make the system.

Page 24: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 24

100 MB

1 GB

10 GB

100 GB

1 TB

10 TB

1 10 100 1000 10000

RA

Mpe

rNod

eN

eede

d

Number of Nodes

HIPASS Data Cube ASKAP Data Cube

Figure 14: Duchamp Memory RequirementsThis graph shows how many nodes are needed, for a given amount of RAM per node, to storeall the information needed in RAM. The x-axis is the number of nodes in the computer system,and the y-axis is the amount of RAM each node needs to store all the data required by Duchamp.This RAM is used to avoid the longer access times of secondary storage. This relation betweenthe amount of RAM needed and the number of nodes, and the maximum amount of RAM pernode in available technology, forms a lower bound on the number of nodes that can be used toexecute Duchamp.

Page 25: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 25

5. Discussion

The black box arithmetic intensity is first calculated. A technology for the system bandwidthcan be chosen, and from the black box arithmetic intensity the overall computational performanceof the system can be determined. An additional choice of processor can then be made, andcompared to the overall computational performance to determine how many nodes are needed.The node model is then considered. Using the node arithmetic intensity from this model, theinterconnect bandwidth can be determined from the number of nodes and the performance of thechosen processor. The node model also determines the amount of memory each node will need,from the number of nodes used. Therefore the information obtained from these models is usedto estimate an potential combination of hardware to be used to execute the Duchamp program.

5.1. Black Box Model

The arithmetic intensity of Duchamp is first considered using the black box model. Thisis done to match the computational complexity of the entire system with the bandwidth of theconnection that is used to transfer data to and from the system. The black box arithmetic intensityincreases with the size of the data cube, as shown in Figure 7. This test is done to show howthe black box arithmetic intensity of the algorithm varies with different-sized data cubes. Thearithmetic intensity increases because the number of operations required is proportional to thesize of the data cube increases and the number of scales increases, but the amount of data transferrequired is only proportional to the size of the data cube. The jumps in the graph occur as thedata cube becomes large enough that another scale is needed for the filtering. This suggests thata proportionally faster computer system, compared to the bandwidth, can be used as the size ofthe image increases. For simplicity, this graph uses three assumptions to show the arithmeticintensity as a function of the data cube size. First, it assumes that the data cube has the samelength in each of the three dimensions, so that the number of scales is only a function of the totalnumber of elements in the cube, rather than considering the smallest side length. The secondassumption is that the number of values per element that need to be transferred is D = 5 andthe third is that the number if iterations of the outermost loop is L = 4. Because of this, thearithmetic intensity shown here is only approximate. In particular, the arithmetic intensity of theHIPASS data cube is higher than that shown, because this graph overestimates how many valuesare needed to be transferred onto the system.

The actual arithmetic intensities of the HIPASS and ASKAP data cubes are shown in Figure8. This graph shows that the black box arithmetic intensity of Duchamp using the HIPASS datacube greater than that when using the ASKAP data cube. This is because the HIPASS cuberequired less data transfer per element of the data cube. From this graph, a network technologycan be chosen and the appropriate computational power of the system can be determined.

5.2. Node Model

We now consider Duchamp using the node model. In order to calculate the arithmetic inten-sity of the system, the number of operations and amount of data transfer must be known. The

Page 26: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 26

number of operations required, as calculated from the equations in Section 2.4 are given on page16. The amount of data transfer is more complex and shown in Figure 9. These graphs show howthe amount of data transfer needed increases with the number of nodes. As the number of nodesincreases, the amount of data transfer required approaches a linear increase with the number ofnodes. There are a sudden decreases in slope present in these graphs. These occur when themaximum data transfer is reached for a particular scale, so the data transfer for that scale stopsincreasing with the number of nodes. These changed can be seen more clearly in Figure 10. Thisplot shows the data transfer of Duchamp using the two data cubes, from using a single node forthe entire data cube to using a single node for each element in the data cubes. Note that theseplots overestimate the amount of data transfer required, particularly for low numbers of nodes.This is because this model does not account for reflection of edge values, where the element ofa needed value lies outside the data cube, and instead the value of a reflected element is usedinstead. The reflected element may lie in the original node, or overlap with elements neededfrom another node.

With these results, the arithmetic intensity of the Duchamp algorithm when using the HIPASSand ASKAP data cubes as input can be calculated. The arithmetic intensity of Duchamp usingthe HIPASS data cube is shown in Figure 11 and the algorithmic intensity using the ASKAPdata cube is shown in Figure 12. These figures show how the arithmetic intensity decreasesas the number of nodes increases. This is because the number of operations is constant withthe number of nodes, but the data transfer needed increases. These figures each show threelines. These are the arithmetic intensities calculated using the single-precision floating pointoperations, double-precision operations and the equivalent combined operation counts.

Comparing this arithmetic intensity against available hardware can be used to determine theoptimum number of nodes. Figure 13 shows the optimum number of nodes for a given combi-nation of processor and interconnect bandwidth. There is a sudden jump in the optimum numberof nodes, near 2000 nodes. This because a scale reaches maximum data transfer, so the slope ofthe data transfer decreases and the slope of the arithmetic intensity of the Duchamp algorithmincreases. The slope of the optimum number of nodes otherwise decreases, as the data transferrequired increases.

The last factor in the node model is the amount of memory each node needs. Figure 14shows how the amount of RAM each node needs varies with the number of nodes. Combinedwith a maximum amount of RAM per node on available technology, this relation forms a lowerlimit on the number of nodes that can be effectively used to run Duchamp. As a node needsfive single-precision floating point values for each element of a data cube it holds, the amount ofRAM required is proportional to the number of elements in the image, and inversely proportionalto the number of nodes in the computer system. The graph decreases in steps for low numbersof nodes because only integer numbers of nodes are considered.

5.3. Hardware Choices

There are a number of constraints that affect what the potential choice of hardware for runningDuchamp on an ASKAP-size data cube. The computer system should finish computation onthe data cube in an equivalent amount of time to transferring the data cube onto the system.

Page 27: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 27

Considering the black box arithmetic intensity of Duchamp in Figure 8 and choosing a bandwidthtechnology gives an suggested performance for the entire computer system. The number of nodescan be determined from this by dividing the total required performance by the performance of achosen processor. Neither the computing power nor the interconnects should go unused, so thearithmetic intensity of the nodes should match that of the algorithm. So, by comparing the choiceof processor and the number of nodes to Figure 13, the suggested interconnect bandwidth can bechosen. The last constraint is the amount of RAM each node needs. By comparing the numberof nodes to Figure 14, the amount of RAM per node can be determined. For determining thepotential hardware, this report will consider using 10 Gigabit Ethernet and 100 Gigabit Ethernetas the connections used to transfer data onto the computer system, and the Intel Core i7 975 CPUand AMD Radeon HD 5970 GPU as processors.

Using a 10 Gigabit Ethernet connection to transfer data onto the computer suggests thatthe computer system should have a bandwidth of 8.35 single-precision TFLOP/s. Using theCore i7 processor, this requires 40 nodes, a 627MB/s (5.02Gbit/s) interconnect bandwidth and138GB of RAM on each node. Using the Tesla C2070 requires 7 nodes, a 2.61GB/s (20.9Gbit/s)interconnect and 786GB of RAM per node. Using the Radeon HD 5970 requires 2 nodes, a6.85GB/s (54.8Gbit/s) interconnect and 2.75TB of RAM on both nodes.

Using a 100 Gigabit Ethernet connection requires a computer system capable of 83.5 TFLOP/sof single-precision performance. With the Core i7 processors, a computer with the 100 Gbps con-nection will require 392 nodes, a 1.56GB/s (12.5Gbit/s) interconnect bandwidth and 14.1GB ofRAM on each node. Using the Tesla C2070 requires 67 nodes, a 4.32GB/s (34.6Gbit/s) inter-connect and 82.1GB of RAM per node. Using the Radeon HD 5970 as the processor requires 18nodes, a 9.22GB/s (73.8Gbit/s) interconnect bandwidth and 306GB of RAM per node.

For the 10 Gigabit Ethernet system connection, the Tesla and Radeon processors both requirean infeasible amount of RAM per node, using commodity hardware. The Core i7 solution re-quires 138GB of RAM per node, which is a large amount but still possible with some currentmotherboards [16], and the 627MB/s (5.02Gbit/s) interconnect requirement is the lowest of allthe considered combinations, can be met by 10 Gigabit Ethernet. For the 100 Gigabit Ethernetsystem connection, the Tesla solution requires 82.1GB of RAM per node, which is high, but pos-sible and the Radeon requires 306GB per node, which is at the edge of what is currently possible.The bandwidth required by the Tesla and Radeon solutions is also considerable, but can be metwith a 100 Gigabit Ethernet interconnect. The Core i7 solution requires the least amount of RAMper node, which can be feasibly met. It also requires a 1.56GB/s network connecting, with canbe met using both directions of a full-duplex 10 Gigabit Ethernet connection, or an InfiniBandQDR 4X connection. Therefore using a 100 Gigabit Ethernet connection, with either CPU ornVidia Tesla powered nodes, are potential platforms for running Duchamp.

6. Conclusion

It is necessary to parallelise Duchamp in order to efficiently process an ASKAP-sized datacube. This is because running Duchamp with a single thread on an ASKAP data cube will takean excessive amount of time, around 440 days, along with requiring 5.5TB of RAM, an excessive

Page 28: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 28

amount for a single computer. Because of this, potential hardware to use to run Duchamp mustbe determined.

A potential optimum hardware configuration is found by comparing the arithmetic intensityof the Duchamp algorithm to the arithmetic intensity of the hardware being considered for thecomputer system. This comparison shows that of the hardware considered, the first potentialcombination is to use 392 nodes, each powered by an Intel Core i7 975 processor, possessing atleast 14.1GB of RAM and a network that provides at least 1.56GB/s (12.5Gbit/s) of bandwidthfor each node. The second potential combination is 67 nodes powered by nVidia Tesla C2070GPUs, at least 82.1GB of RAM per node and a network that is capable of supporting 4.32GB/s(34.6Gbit/s) of bandwidth to each node. In both cases, the entire computer system should beconnected by 100 Gigabit Ethernet link to the data store.

This suggests that the most pressing constraints on the system are the bandwidth of the con-nection used to transfer on and off the computer system, and the amount of memory required tostore all the data. To a lesser extent, the interconnect bandwidth is also a constricting factor. For agiven processor, increasing the number of nodes reduced the amount of memory needed on eachnode, but increases the requirements for the interconnect bandwidth and the system bandwidth.The reverse is true reducing the number of nodes. This implies that the optimum solution willneed to be optimised with regards to both these requirements.

This report has made several approximations, which may affect the results. The calculationsfor the number of operations and amount of data transfer are inexact. In particular, the amountof data transfer is underestimated, especially for low numbers of nodes. Also, the computationalperformances of processors and the bandwidths of network connections used in this report are thetheoretical maximum values, rather than what is achievable. The actual optimum combinationof hardware will be different, depending on the difference between actual and theoretical perfor-mance for each of the technologies considered. This report also does not account for backplanescalability, and only considers GPUs as being faster processors, but not the issue of transferringdata to and from the GPU.

6.1. Future Work

There are a number of ways in which this work could be improved. In considering theeffects of parallelising Duchamp, the algorithm has been unchanged. Duchamp can be optimised,reducing the amount of work required, and changing its arithmetic intensity. How the elementsof the data cube are distributed among the nodes of the computer system can be changed, toreduce the amount of data transfer required. The number of scales used by Duchamp dependson the shortest side length of the data cube. However, it is only effective to examine scales upto the largest object that can be observed, as larger scaled will not detect any objects. Duchampcould improve performance by only examining up to the scale of the largest object that could befound in the data cube. There is also the possibility of splitting a large data cube into multiplesmaller data cubes, then examining each of these small data cubes independently. This has theadvantage of not requiring data transfer from one small cube to another, and a computer systemwould require less memory to process one of these smaller cubes. This analysis also considers allthe values in a data cube to be transferred to the computer system, although Duchamp only uses

Page 29: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 29

one value per element. It is possible that the data cubes could be altered before being transferredto the computer system, so that only the one needed value is transferred. This would increase thearithmetic of the Duchamp algorithm, in the black box model.

For the storage requirements, this report only considers commodity RAM for memory, andassumes that hard disk storage is too slow for efficient processing. It is possible that somecustom storage technologies, for example the products of Fusion IO [17], could be used to suitDuchamp’s needs. This custom hardware should be considered. Another strategy would toconsider the use of local, secondary storage to store the less frequently used values, to reduce theamount of RAM required on each node. Additionally, this report only considers the amount ofmemory that is required by the algorithm and does not consider the amount of RAM used by thecomputer system. This use may increase the required amount of memory per node by about 2 to4GB from the figures stated above.

This report only considers GPUs as processors of a particular speed, and does not consider theeffects of transferring data to and from the GPU. This transfer may be a performance bottleneck,particularly as the amount of RAM on a GPU [18, 14] is relatively small compared to that ofthe amount of RAM required per host. This may require that data be frequently transferred toand from the GPU, as the GPU processes all the elements stored on its node. The effects thisdata transfer on performance, and how they can be mitigated, will need to be considered for aGPU-based solution.

The Duchamp source finder program is still under development. Any future changes to thealgorithm used by Duchamp may change the parallel characteristics of the Duchamp program,and its arithmetic intensity. This will impact the optimum choice of hardware for running theprogram. The optimum parameters for finding objects have yet to be determined. By chang-ing the thresholds used in the image reconstruction algorithm, the number of iterations of theouter loop may change. The current implementation uses 4-byte integer values for array in-dices but some input data cubes may be too large to fit all the index values into a 4-byte integervalue, and require the use of 8-byte integer values. This change would alter the vast majorityof the single-precision operations to double-precision operations, greatly increasing the amountof equivalent single precision processing required, and changing the arithmetic intensity of thealgorithm. Another analysis would need to be performed after these changes, to determine whatthat best hardware solution is. An entirely different source finding program may be used in theASKAP study, in place of Duchamp. In that case, the replacement program will need to be anal-ysed to determine how it performs in parallel, and what the optimum hardware combination forrunning it is.

Determining a more accurate result for the optimum combination of hardware will requirea more thorough investigation. This will involve the development of parallel prototypes ofDuchamp and benchmarking the technologies involved to determine their actual performance,rather than their theoretical peak performance. This analysis would also have to consider a widerrange of hardware, as well as using multiple processors per node, and considering different net-work connection topologies. Additionally, a thorough analysis would also consider the costs,both capital and running, of the different hardware involved against the time to completion ofDuchamp.

Page 30: Analysis of the Parallelisation of the Duchamp Algorithm · Moderntelescopeinstallations,suchastheSquareKilometerArray(SKA)andtheAustralian Square Kilometer Array Pathfinder (ASKAP)

S. Westerlund / iVEC Research Internships (2009 - 2010) 30

Acknowledgments

I thank the iVEC Research Internships Program for providing this opportunity. I thank iVECstaff for assisting us interns with our research projects, in particular Valerie Maxville for organ-ising the internship program. I appreciate iVEC for funding this project, the Western AustralianSupercomputing Program (WASP) for supplying equipment and Akos Bruz for his assistance increating the podcast for this project. I am thankful to Dr. Christopher Harris for supervising thisproject, and for providing extensive feedback on this report. I am also grateful for the assistanceof Dr. Martin Meyer, for giving his time to discuss source finders in radio astronomy.

[1] K. R. abd Thimas L. Wilson, Tools of Radio Astronomy. Springer, fourth revised and enlarged edition ed., 2004.[2] M. J. Meyer, M. A. Zwaan, R. L. Webster, L. Staveley-Smith, E. Ryan-Weber, M. J. Drinkwater, D. G. Barnes,

M. Howlett, V. A. Kilborn, J. Stevens, M. Waugh, M. J. Pierce, and et al., “The hipass catalogue - i. data presenta-tion,” 2004.

[3] M. Whiting, “Duchamp.” Available from http://www.atnf.csiro.au/people/Matthew.Whiting/

Duchamp/, Last Visited 7 January 2010.[4] M. Whiting, “Duchamp user’s guide.” Available from http://www.atnf.csiro.au/people/Matthew.

Whiting/Duchamp/downloads/UserGuide-1.1.8.pdf, Last Visited 7 January 2010.[5] J. Starck and F. Murtagh, Astronomical Image and Data Analysis. Springer, 1st ed., 2002.[6] D. R. Musser, “Introspective sorting and selection algorithms,” Softw. Pract. Exper., vol. 27, no. 8, pp. 983–993,

1997.[7] D. A. Bader, “An improved, randomized algorithm for parallel selection with an experimental study,” Journal of

Parallel and Distributed Computing, vol. 64, no. 9, pp. 1051 – 1059, 2004.[8] J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. Dongarra, “Exploiting the performance of 32 bit

floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems),” in SC ’06:Proceedings of the 2006 ACM/IEEE conference on Supercomputing, (New York, NY, USA), p. 113, ACM, 2006.

[9] nVidia, “Fermi compute architecture whitepaper, v1.1.” Whitepaper, Available from http://www.nvidia.com/

content/PDF/fermi white papers/NVIDIA Fermi Compute Architecture Whitepaper.pdf, Last Vis-ited 9 March 2010.

[10] AMD, “Ati radeonTMhd 5970 graphics feature summary.” Available from http://www.amd.com/us/products/

desktop/graphics/ati-radeon-hd-5000/hd-5970/Pages/ati-radeon-hd-5970-specifications.

aspx, Last Visited 1 March 2010.[11] LAN/MAN CSMA/CD (Ethernet) Access Method. IEEE Standard 802.3, 2008.[12] I. T. Association, “Infiniband roadmap.” Available from http://www.infinibandta.org/content/pages.

php?pg=technology overview, Last Visited 2 March 2010.[13] Intel, “Intel coreTMi7-975 processor extreme edition.” Available from http://ark.intel.com/Product.aspx?

id=37153, Last Visited 3 March 2010.[14] nVidia, “Tesla c2050 / c2070 gpu computing processor.” Available from http://www.nvidia.com/object/

product tesla C2050 C2070 us.html, Last Visisted 9 March 2010.[15] PCI SIG, “PCI Express R© 3.0 frequently asked questions.” Available from http://www.pcisig.com/news

room/faqs/pcie3.0 faq/, Last Visited 4 March 2010.[16] Super Micro Computer, Inc, “Super micro computer, inc. - products — motherboards — xeon.” Available from

http://www.supermicro.com/products/motherboard/Xeon7000/, Last Visited 4 March 2010.[17] Fusion-IO, “Fusion-io.” Available from http://www.fusionio.com/, Last Visited 11 March 2010.[18] XFX, “Graphic cards - ati radeon hd 5000 - ati radeonTMhd 5970.” Available from http://www.xfxforce.com/

en-gb/products/graphiccards/HD%205000series/5970.aspx#2, Last Visited 8 March 2010.