15
Parallel Astronomical Data Processing with Python: Recipes for multicore machines Navtej Singh a,1,* , Lisa-Marie Browne a , Ray Butler a a Centre for Astronomy, School of Physics, National University of Ireland - Galway, University Road, Galway, Ireland Abstract High performance computing has been used in various fields of astrophysical research. But most of it is implemented on massively parallel systems (supercomputers) or graphical processing unit clusters. With the advent of multicore processors in the last decade, many serial software codes have been re-implemented in parallel mode to utilize the full potential of these processors. In this paper, we propose parallel processing recipes for multicore machines for astronomical data processing. The target audience are astronomers who are using Python as their preferred scripting language and who may be using PyRAF/IRAF for data processing. Three problems of varied complexity were benchmarked on three dierent types of multicore processors to demonstrate the benefits, in terms of execution time, of parallelizing data processing tasks. The native multiprocessing module available in Python makes it a relatively trivial task to implement the parallel code. We have also compared the three multiprocessing approaches - Pool/Map, Process/Queue and Parallel Python. Our test codes are freely available and can be downloaded from our website. Keywords: Astronomical data processing, Parallel computing, Multicore Programming, Python Multiprocessing, Parallel Python, Deconvolution 1. Introduction In 1965, Gordon Moore predicted that the num- ber of transistors in integrated circuits would dou- ble every two years (Moore, 1965). This prediction has proved true until now, although semiconductor experts 2 expect it to slow down by the end of 2013 (doubling every 3 years instead of 2). The initial em- phasis was on producing single core processors with higher processing power. But with increasing heat dissipation problems and higher power consumption, the focus in the last decade has shifted to multicore processors - where each core acts as a separate pro- * Corresponding author Email addresses: [email protected] (Navtej Singh), [email protected] (Lisa-Marie Browne), [email protected] (Ray Butler) 1 Telephone: +353-91-492532, Fax: +353-91-494584 2 From the 2011 executive summary of Inter- national Technology Roadmap for Semiconductor (http://www.itrs.net/links/2011itrs/2011Chapters/2011ExecSum.pdf) cessor. Each core may have lower processing power compared to a high end single core processor, but it provides better performance by allowing multiple threads to run simultaneously, known as thread-level parallelism (TLP). At present, dual and quad core processors are common place in desktop and lap- top machines and even in the current generation of high end smart phones. With both Intel (Garver and Crepps, 2009) and AMD 3 working on next genera- tion multicore processors, the potential for utilizing processing power in desktop machines is massive. However, traditional software for scientific applica- tions (e.g. image processing) is written for single- core Central Processing Units (CPU) and is not har- nessing the full computational potential of the multi- core machines. Traditionally, high performance computing (HPC) is done on supercomputers with a multitude of pro- 3 Advanced Micro Devices Preprint submitted to Astronomy and Computing June 5, 2013 arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

Parallel Astronomical Data Processing with Python:Recipes for multicore machines

Navtej Singha,1,∗, Lisa-Marie Brownea, Ray Butlera

aCentre for Astronomy, School of Physics, National University of Ireland - Galway, University Road, Galway, Ireland

Abstract

High performance computing has been used in various fields of astrophysical research. But most of it isimplemented on massively parallel systems (supercomputers) or graphical processing unit clusters. With theadvent of multicore processors in the last decade, many serial software codes have been re-implemented inparallel mode to utilize the full potential of these processors. In this paper, we propose parallel processingrecipes for multicore machines for astronomical data processing. The target audience are astronomers who areusing Python as their preferred scripting language and who may be using PyRAF/IRAF for data processing.Three problems of varied complexity were benchmarked on three different types of multicore processorsto demonstrate the benefits, in terms of execution time, of parallelizing data processing tasks. The nativemultiprocessing module available in Python makes it a relatively trivial task to implement the parallel code.We have also compared the three multiprocessing approaches - Pool/Map, Process/Queue and Parallel Python.Our test codes are freely available and can be downloaded from our website.

Keywords: Astronomical data processing, Parallel computing, Multicore Programming, PythonMultiprocessing, Parallel Python, Deconvolution

1. Introduction

In 1965, Gordon Moore predicted that the num-ber of transistors in integrated circuits would dou-ble every two years (Moore, 1965). This predictionhas proved true until now, although semiconductorexperts2 expect it to slow down by the end of 2013(doubling every 3 years instead of 2). The initial em-phasis was on producing single core processors withhigher processing power. But with increasing heatdissipation problems and higher power consumption,the focus in the last decade has shifted to multicoreprocessors - where each core acts as a separate pro-

∗Corresponding authorEmail addresses: [email protected] (Navtej

Singh), [email protected] (Lisa-Marie Browne),[email protected] (Ray Butler)

1Telephone: +353-91-492532, Fax: +353-91-4945842From the 2011 executive summary of Inter-

national Technology Roadmap for Semiconductor(http://www.itrs.net/links/2011itrs/2011Chapters/2011ExecSum.pdf)

cessor. Each core may have lower processing powercompared to a high end single core processor, butit provides better performance by allowing multiplethreads to run simultaneously, known as thread-levelparallelism (TLP). At present, dual and quad coreprocessors are common place in desktop and lap-top machines and even in the current generation ofhigh end smart phones. With both Intel (Garver andCrepps, 2009) and AMD3 working on next genera-tion multicore processors, the potential for utilizingprocessing power in desktop machines is massive.However, traditional software for scientific applica-tions (e.g. image processing) is written for single-core Central Processing Units (CPU) and is not har-nessing the full computational potential of the multi-core machines.

Traditionally, high performance computing (HPC)is done on supercomputers with a multitude of pro-

3Advanced Micro Devices

Preprint submitted to Astronomy and Computing June 5, 2013

arX

iv:1

306.

0573

v1 [

astr

o-ph

.IM

] 3

Jun

201

3

Page 2: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

cessors (and large memory). Computer clusters us-ing commercial off the shelf (COTS) hardware andopen source software are also being utilized (Sza-lay, 2011). And recently graphical processing unit(GPU) based clusters have been put to use for gen-eral purpose computing (Strzodka et al., 2005; Belle-man et al., 2008). The advent of multicore processorsprovides a unique opportunity to move parallel com-puting to desktops and laptops, at least for simpletasks. In addition to hardware, one also needs uniquesoftware protocols and tools for parallel processing.The two most popular parallel processing protocolsare Message Passing Interface (MPI) and OpenMP.MPI is used on machines with distributed memory(for example - clusters) whereas OpenMP is gearedtowards shared memory systems.

Parallel computing has been used in different sub-fields of astrophysical research. Physical modelingand computationally intensive simulation code havebeen ported to supercomputers. Examples include N-Body simulation of massive star and galaxy clusters(Makino et al., 1997), radiative transfer (Robitaille,2011), plasma simulation around pulsars, galaxy for-mation and mergers, cosmology etc. But most of theastronomical image processing and general time con-suming data processing and analysis tasks are stillrun in serial mode. One of the reasons for this is theintrinsic and perceived complexity connected withwriting and executing parallel code. Another reasonmay be that day to day astronomical data process-ing tasks do not take an extremely long time to exe-cute. Irrespective of this, one can find a few parallelmodules developed for astronomical image process-ing. The cosmic ray removal module CRBLASTER(Mighell, 2010) is written in C and based on the MPIprotocol, and can be executed on supercomputers orcluster computers (as well as on single multicore ma-chines). For co-addition of images, Wiley et al. (2011)proposed software based on the MapReduce4 algo-rithm, which is geared towards processing terabytesof data (for example - data generated by big sky sur-veys like the SDSS5) using massively parallel sys-tems.

4Model to process large data sets on a distributed cluster ofcomputers

5SDSS: Sloan Digital Sky Survey [http://www.sdss.org/]

In this paper, we have explored the other end ofthe spectrum - single multicore machines. We areproposing a few recipes for utilizing multicore ma-chines for parallel computation, to perform faster ex-ecution of astronomical tasks. Our work is targetedat astronomers who are using Python as their pre-ferred scripting language and may be using PyRAF6

or IRAF7 for image/data processing and analysis. Theidea is to make the transition from serial to parallelprocessing as simple as possible for astronomers whodo not have experience in high performance comput-ing. Simple IRAF tasks can be rewritten in Python touse parallel processing, but rewriting the more lengthytasks may not be straightforward. Therefore, insteadof rewriting the existing optimized serial tasks, wecan use the Python multiprocessing modules to par-allelize iterative processes.

In Section 2, we introduce the concept of par-allel data processing and the various options avail-able. Python multiprocessing in discussed in Section3 with emphasis on native parallel processing im-plementation. Three different astronomical data pro-cessing examples are benchmarked in Section 4. InSection 5, we discuss load balancing, scalability, andportability of the parallel Python code. Final conclu-sions are drawn in Section 6.

2. Parallel Data Processing

Processors execute instructions sequentially andtherefore, from the initial days of computers to thepresent, most of the applications have been writtenas serial code. Generally coding and debugging ofserial code is much simpler than parallel code. How-ever, debugging is an issue only for parallel programswhere many processes depend on results from otherprocesses - whereas it is not an issue while process-ing large datasets in parallel. Moving to parallel cod-ing not only requires new hardware and software tools,but also a new way of tackling the problem in hand.

6PyRAF is a product of the Space Telescope Science Insti-tute, which is operated by AURA for NASA.

7IRAF is distributed by the National Optical Astronomy Ob-servatories, which are operated by the Association of Universi-ties for Research in Astronomy, Inc., under cooperative agree-ment with the National Science Foundation.

2

Page 3: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

To run a program in parallel, one needs multiple pro-cessors/cores or computing nodes8. The first ques-tion one asks is how to divide the problem so as torun each sub-task in parallel.

Generally speaking, parallelization can be achievedusing either task parallelization or data paralleliza-tion. In task parallelism, each computing node runsthe same or different code in parallel. Whereas, indata parallelism, the input data is divided across thecomputing nodes and the same code processes thedata elements in parallel. Data parallelism is simplerto implement, as well as being the more appropriateapproach in most astronomical data processing ap-plications, and this paper deals only with it.

Considering a system with N processors or com-puting nodes, the speedup that can be achieved (com-pared to 1 processor) can be given as:

S =T1

TN, (1)

where T1 and TN are the code runtime for oneand N processors respectively. TN depends not onlyon the number of computing nodes but also on thefraction of code that is serial. The total runtime ofthe parallel code using N processors can be expressedusing Amdahl’s law (Amdahl, 1967):

TN = TS +TP

N+ Tsync (2)

where TS is the execution time of the serial frac-tion of the code, TP is the runtime of code that can beparallelized, and Tsync is the time for synchronization(I/O operations etc.). The efficiency of the parallelcode execution depends a lot on how optimized thecode is, i.e. the lower the fraction of serial code, thebetter. If we ignore synchronization time, theoreti-cally unlimited speedup can be achieved as N → ∞by converting the serial code to completely paral-lel code. More realistically, Tsync can be modelledas K ∗ ln(N), where N is the number of processorsand K is a synchronization constant (Gove, 2010).This means that at a particular process count, the per-formance gain over serial code will start decreasing.Minimization of Equation (2) gives:

8The terms processors and computing nodes will be usedinterchangeably in the rest of the paper.

N =TP

K(3)

This means that the value of N for which the par-allel code scales is directly proportional to the frac-tion of code that is parallel and inversely proportionalto synchronization. In other words, by keeping Nconstant, one can achieve better performance by ei-ther increasing the fraction of parallel code or de-creasing the synchronization time, or both.

We have used multiprocessing instead of multi-threading to achieve parallelism. There is a very ba-sic difference between threads and processes. Threadsare code segments that can be scheduled by the oper-ating system. On single processor machines, the op-erating system gives the illusion of running multiplethreads in parallel but in actuality it switches betweenthe threads quickly (time division multiplexing). Butin the case of multicore machines, threads run simul-taneously on separate cores. Multiple processes aredifferent from multiple threads in the sense that theyhave separate memory and state from the master pro-cess that invokes them (multiple threads use the samestate and memory).

The most popular languages for parallel comput-ing are C, C++ and FORTRAN. MPI as well as OpenMPprotocols have been developed for these three lan-guages. But wrappers or software implementationsdo exist to support interpreted languages like Python,Perl, Java etc. As mentioned in the introduction,the target audience of this paper is astronomers us-ing Python as their language of choice, and/or IRAF.Although much more optimized and faster-executingparallel code can be written in FORTRAN, C or anyother compiled language, the main objective here isto optimize astronomer’s time. The speed gain withcompiled code comes at the cost of longer develop-ment time. The secondary objective is code reuse i.e.using the existing Python code and/or IRAF tasks toparallelize the problem (wherever it is feasible).

3. Python Multiprocessing

Python supports both multi-threading and multi-processing programming. The threads in Python aremanaged by the host operating system i.e. schedul-ing and switching of the threads is done by the oper-ating system and not by the Python interpreter. Python

3

Page 4: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

has a mechanism called the Global Interpreter Lock(GIL) that generally prevents more than one threadrunning simultaneously, even if multiple cores or pro-cessors are available (Python Software Foundation,2012). This results in only one thread having exclu-sive access to the interpreter resources, or in otherwords resources are “locked” by the executing thread.The running thread releases the GIL for either I/Ooperations or during interpreter periodic checks (bydefault after every 100 interpreter ticks or bytecodeinstructions) (Beazley, 2006). The waiting threadscan run briefly during this period. This unfortunatelyaffects the performance of multi-threaded applica-tions and for CPU bound tasks, the execution timemay be actually higher than serial execution. Theperformance deteriorates further on the multicore ma-chines as the Python interpreter wants to run a singlethread at a time whereas the operating system willschedule the threads simultaneously on all the avail-able processor cores.

This is only true for the CPython implementa-tion; PyPy9, Jython10, and IronPython11 do not pre-vent running multiple threads simultaneous on mul-tiple processor cores. Jython and IronPython use anunderlying threading model implemented in their vir-tual machines. But the default Python implementa-tion on most operating systems is CPython and there-fore the rest of this paper assumes the CPython im-plementation. In addition, many of the scientific toolsare only available for CPython. Which is anotherreason why not many in scientific programming usethese alternative python distributions.

A better option to get parallel concurrency is touse Python’s native multiprocessing module. Otherstandalone modules for parallelizing on shared mem-ory machines include Parallel Python, pyLinda, andpyPastSet. Python multiprocessing and Parallel Pythoncan also be used on a cluster of machines.

Another parallelizing option for distributed mem-ory machines is message passing. A Python MPI im-plementation or wrappers (eg. PyCSP (Vinter et al.,2009), mpi4py (Dalcin et al., 2008), pupyMPI (Bromeret al., 2011), and Pypar (Nielsen, 2003)) can be used

9Just-in-time (JIT) compilation implementation10Python implementation for Java virtual machine11Python implementation for .NET framework

for this purpose. Parallelization can also be achievedby vectorizing the tasks using NumPy. Vectorizationis an efficient and optimized way of replacing explicititerative loops from the Python code. However, notall the operations can be parallelized in Numpy/S-ciPy 12.

We have used the native multiprocessing mod-ule and the standalone Parallel Python module toachieve parallelization. For comparison purposes,we have implemented parallel code for four astron-omy routines.

The multiprocessing module is part of Python 2.6and onwards, and backports exist for versions 2.4 and2.513. Multiprocessing can also be used with a clus-ter of computers (using the multiprocessing Managerobject), although the implementation is not trivial.This paper deals only with shared memory or sym-metric multiprocessing (SMP) machines, although de-tailed information about computer clusters for astro-nomical data processing can be found elsewhere14.

A few approaches exist in the Python multipro-cessing library to distribute the workload in parallel.In this paper we will be considering two main ap-proaches - a pool of processes created by the Poolclass, and individual processes spawned by the Pro-cess class. The Parallel Python module uses inter-process communication (IPC) and dynamic load bal-ancing to execute processes in parallel. Code imple-mentation using these three approaches is discussedin detail in the following sub-sections.

3.1. The Pool/Map ApproachOut of the two multiprocessing approaches, this

is the simplest to implement. The Pool/Map approachspawns a pool of worker processes and returns a listof results. In Python functional programming, a func-tion can be applied to every item iterable using thebuilt-in map function. For example, instead of run-ning an iterative loop, the map function can be used:

12Refer to http://www.scipy.org/ParallelProgramming formore details

13More details on http://pypi.python.org/pypi/multiprocessing14One such document can be found on our website -

http://astro.nuigalway.ie/staff/navtejs

4

Page 5: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

Listing 1: Example Python iterative function# Iterative function

def worker( indata ):

...

return result

# Input dataset divided into chunks

lst = [ in1, in2, ... ]

# Loop over lst items and append results

results = []

for item in lst:

results.append( worker( item ) )

Listing 2: Python map function# Iterative function

def worker( indata ):

...

return result

# Input dataset divided into chunks

lst = [ in1, in2, ... ]

# Iteratively run func in serial mode

results = map( worker, lst )

The map function is extended to the multipro-cessing module and can be used with the Pool classto execute worker processes in parallel, as depictedin Listing 3.

Listing 3: Pool/Map multiprocessing implementation# Import multiprocessing module

import multiprocessing as mp

# Get number of processors on the machine

# or manually enter required number of processes

ncpus = mp.count_cpus()

# Define pool of ncpus worker processes

pool = mp.Pool( ncpus )

# Start ncpus pool of worker processes in parallel

# output is appended to results python list

results = pool.map( worker, lst )

The import command includes the multiprocess-ing module in the routine, the count cpus methodgets the number of processors or cores on the ma-chine, the Pool class creates a pool of ncpus pro-cesses, and Pool’s map method iterates over the inputelement list in parallel, and maps each element to theworker function. The number of worker processesspawned can be more than the number of cores on

the machine but as we will see in Section 4, best per-formance is achieved when the number of processesis equal to the number of physical processor cores orthe total number of concurrent threads.

3.2. The Process/Queue ApproachThe Pool/Map approach allows only one argu-

ment as an input parameter to the calling function.There are two ways to send multiple arguments: packarguments in a python list or a tuple, or use the pro-cess class in conjunction with a queue or pipe. Al-though a process can be used without queues andpipes, it is good programming practice to use them15.Two FIFO (First In, First Out) queues are created -one for sending input data elements and another forreceiving output data. Parallel worker processes arestarted using the Process class and smaller chunks ofinput data are put on the send queue for processing.Each worker process picks the next data chunk in thequeue after processing the previous chunk. The out-put result is put on the receive queue, and then readat the end for post-processing.

An example Python code listing for the Process/Queueapproach is shown below:

Listing 4: Process/Queue multiprocessing implementation# Import multiprocessing module

import multiprocessing as mp

# Worker function

# iter is standard python built-in function

def worker( s_q, r_q, arg1, arg2, ... ):

for value in iter( s_q.get, ‘STOP‘ ):

...

r_q.put( result )

# Get number of cores on the machine

ncpus = mp.count_cpus()

# Create send and receive queues

send_q = mp.Queue()

recv_q = mp.Queue()

# Start ncpus number of processes

for i in range( ncpus ):

mp.Process( target = worker, args = ( send_q, recv_q,

arg1, arg2, ... ) )

# Put input data chunks on send queue

# indata is python list of input data set,

# which is already divided into smaller chunks

15Python documentation suggests to avoid synchro-nization locks and instead use the queue or the pipe(http://docs.python.org/library/multiprocessing.html#programming-guidelines/)

5

Page 6: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

for chunk in indata:

send_q.put( chunk )

# Get output from receive queue

results = []

for i in range( len( indata ) ):

results.append( recv_q.get() )

# Stop all the running processes

for i in range( ncpus ):

send_q.put( ‘STOP‘ )

The code is self-explanatory but the main pointto notice is the code related to stopping the runningprocesses. STOP or any other value can be put in thesend queue to stop the processes, as the worker func-tion reads from the queue until it encounters STOP.We have found that in certain conditions, the Pro-cess/Queue approach performs better than the Pool/Mapapproach, as shown in Section 4.

3.3. Parallel Python ApproachParallel Python is an open source cross-platform

module for parallelizing python code. It providesdynamic computation resource allocation as well asdynamic load balancing at runtime. In addition toexecuting programs in parallel on Symmetric multi-processing (SMP) machines, it can also be used onclusters of heterogeneous multi-platform machines(Vanovschi, 2013).

Processes in Parallel Python run under a job server.The job server is started locally (or remotely if run-ning on a cluster of machines) with the desired num-ber of processes (ideally equal to the number of pro-cessor cores). A very basic code template is shownbelow:

Listing 5: Parallel Python implementation# Import Parallel Python module

import pp

# Parallel Python worker function

def worker( indata ):

...

return result

# Create an empty tuple

ppservers = ()

# Either manually set the number of processes or

# default to the number of cores on the machine

if ncpus:

job_server = pp.Server( int(ncpus), ppservers =

ppservers )

else:

job_server = pp.Server( ppservers = ppservers )

# Divide data into smaller chunks for better

# performance (based on the scheduler type)

chunks = getchunks( infile, job_server.get_ncpus(),

scheduler )

# Start the worker processes in parallel

jobs = []

for value in chunks:

indata = ( arg1, arg2, ... )

jobs.append( job_server.submit( worker, (indata,), (

func1,func2,...), (mod1,mod2,...) ) )

# Append the results

results = []

for job in jobs:

results.append( job())

The number of processes (ncpus) is passed to theprogram as user input, or defaulted to the number ofcores on the machine. The input data is divided intosmaller chunks for better load balancing. The Paral-lel Python job server starts the worker processes forparallel execution. At the end of the execution, re-sults are retrieved from the processes. The parame-ters to the job server’s submit method are the workerfunction, its arguments, and any other function ormodule used by the worker function.

4. Benchmarking

To benchmark the parallel processing approachesdescribed in the previous section, three different kindsof astronomical problems, of varied complexity, wereparallelized. Three machines of different configura-tion were used to benchmark the code. The hard-ware and software configuration of the test machinesis listed in Table 1.

The homebuilt machine was over-clocked to 3.51GHz and was running the Ubuntu 11.10 operatingsystem (OS). The Dell Studio XPS was running thesame OS but as a guest OS in a VirtualBox16 vir-tual machine on a Windows 7 host. Out of the 6GB of RAM on the Dell machine, 2 GB was allo-cated to the guest OS. The iMac was running MacOS X Mountain Lion. For astronomical data andimage processing, existing routines from the ESOScisoft 7.617 software package were used. This ver-sion of Scisoft is using Python 2.5 and therefore the

16VirtualBox is open source virtualization software. Moredetails on https://www.virtualbox.org/

17ESO Scisoft software package is a collection of astronom-ical utilities distributed by European Southern Observatory

6

Page 7: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

Machine Processor Memory Operating System

Homebuilt AMD PhenomII X4 B60quad [email protected]

4 GB@1066MHz

Ubuntu11.10 32-bit.Linux Kernel3.0.0-16-generic-pae

DellStudioXPS

Intel Corei7 920 quadcore processor(8 threads)@2.67GHz

6 GB@1066MHz

Ubuntu 12.1032-bit. LinuxKernel 3.5.0-22-generic

iMac21.5inch

Intel Core i5I5-2400S quadcore [email protected]

4 GB@1333MHz

Mac OS X10.8.1 64-bit. Darwin12.1.0

Table 1: Hardware and software configuration of the SMP ma-chines used for benchmarking the parallel code.

python multiprocessing backport was separately in-stalled. The latest version of Scisoft (version 7.7 re-leased in March 2012) has been upgraded to Python2.7 and therefore does not require separate installa-tion of the multiprocessing module. The iMac ma-chine was using the latest version of Scisoft.

All of our benchmarking examples concern a classof parallel work flow known as Embarrassingly Par-allel problems. In simple terms this means that theproblem can be easily broken down into componentsto be run in parallel. The first astronomy routine isa parallel implementation of coordinate transforma-tion of the charge-coupled device (CCD) pixel to skycoordinates (RA and DEC), and vice-versa, for Hub-ble Space Telescope (HST) images, for a large num-ber of input values. The second routine parallelizeda Monte Carlo completeness test. The third routineis parallel implementation of sub-sampled deconvo-lution of HST images with a spatially varying pointspread function (PSF). These routines are describedin the following sub-sections. They are freely avail-able and can be downloaded from our website18.

4.1. PIX2SKY and SKY2PIXThe STSDAS19 package for IRAF includes the

routines xy2rd and rd2xy to transform HST CCD im-

18http://astro.nuigalway.ie/staff/navtejs19Space Telescope Science Data Analysis System

age pixel coordinates to sky coordinates, and vice-versa. These tasks can only process one coordinatetransformation at a time. Running them serially, oreven in parallel, to process hundreds to thousandsof transformations (e.g. star coordinate transforma-tions in massive star clusters) is not efficient, as itwill be performing the same expensive FITS headerkeyword reads on each call. Non-IRAF routines totransform lists of coordinates do exist, but they de-fault to serial processing. Two such routines (xy2skyand sky2xy, implemented in C) are part of the WC-STools20 software. A pure Python implementationexists in the pywcs module, which is part of the as-tropy package21.

As these IRAF routines are not lengthy, they caneasily be re-written for multicore machines to pro-cess a large input dataset in parallel. We have im-plemented two tasks - PIX2SKY (based on xy2rd)and SKY2PIX (based on rd2xy) - in Python, usingboth the multiprocessing and Parallel Python mod-ules. These routines read input files with either [X,Y]pixel coordinates (PIX2SKY routine) or [RA, DEC]values (SKY2PIX routine), and output transformedcoordinates. We have used the PyFITS22 module fromthe Space Telescope Science Institute (STScI) for han-dling FITS images.

The speedup factor (as defined in Section 2) forPIX2SKY and SKY2PIX is plotted against the num-ber of processes in Figure 1. For this benchmark-ing, one million input coordinates were fed into theParallel Python based transformation modules withguided scheduling.

Best performance was achieved when the num-ber of processes was equal to the number of physicalcores on the machine. The Intel Core i7 based ma-chine showed better speedup than the AMD Phenombased machine. This is because Intel Core i7 pro-cessors use hyper-threading technology23 and have 2threads per core. Theoretically, we would have ex-pected close to eight-fold speedup on the Intel Corei7 quad-core processor (with hyper-threading), butbottlenecks like I/O operations restrict higher perfor-

20More details on http://tdc-www.harvard.edu/wcstools/21Refer to http://www.astropy.org/ for more details22STScI’s python module for working with FITS files23Proprietary technology of Intel Corporation to allow mul-

tiple threads to run on each core

7

Page 8: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

2 4 6 8 10Number of Processes (N)

1.0

1.5

2.0

2.5

3.0

3.5Speedup Factor (S)

AMD Quad CoreIntel Core i7

Intel Core i5

2 4 6 8 10Number of Processes (N)

1.0

1.5

2.0

2.5

3.0

3.5

(a) (b)

Figure 1: Coordinate transformation benchmark, using the Parallel Python approach with a guided scheduler. (a) PIX2SKY and(b) SKY2PIX speedup on the three test machines. Data points are the average of 50 runs. As expected, maximum speedup occurswhen the number of processes is equal to the number of physical cores on the machine. Of course, load balancing and the serial toparallel code fraction also influence the speedup.

mance gains (keeping the fraction of code runningin parallel constant). As the number of processesincreases beyond the number of cores, the speedupfactor is almost flat. This is a result of dynamic loadbalancing in Parallel Python, which is not the casefor the Process/Queue approach (as we will see inSection 5.3).

Another interesting point regarding Figure 1 isthe somewhat inferior performance of Core i5 pro-cessor compared to the other two quad core proces-sors. We could not do a direct comparison betweenthe Core i5 processor and other two processors as bydefault Turbo Boost24 and Speed Step25 are turned onin the iMac machine (and cannot be switched off).For the other two processors - Speed Step technologywas turned off on the Core i7 processor and Cool’n’Quiet26

was switched off on the AMD machine to preventCPU throttling.

24Proprietary technology of Intel Corporation to run proces-sors above their base frequency

25Refer to http://en.wikipedia.org/wiki/SpeedStep for moredetails

26Refer to http://www.amd.com/us/products/technologies/cool-n-quiet/Pages/cool-n-quiet.aspx for more details.

4.2. Completeness TestIn more complex or lengthy problems, it makes

much more sense to reuse the existing tasks or opti-mized serial codes, rather than re-writing code as wedid in Section 4.1. A good example is a complete-ness test - the method to determine the detection ef-ficiency of different magnitude stars in images. It isa Monte Carlo class of algorithm, i.e. it involves re-peated random execution. The basic idea behind themethod is to add artificial stars of varied magnitudesto random positions within the image, and then deter-mine their recovery rate as a function of magnitude.For the test to be statistically relevant, an average ofa few hundred random iterations of modified imagesis taken.

To reuse the existing code, the addstar task inthe DAOPHOT package of IRAF was used to addartificial stars to the image under question and thedaofind and phot tasks in DAOPHOT were used todetect and match the added stars. One way to achieveparallelism in this case is to run multiple iterationsin parallel on the available computing nodes, whereeach node gets a complete image.

For the benchmark, we used the reduced and co-8

Page 9: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

Figure 2: HST WFPC2 co-added image of the globular cluster M71 (NGC 6838) in the F555W filter, used for benchmarking thecompleteness routine. The test was performed on a 512×512 image section (lower left). This image plus 50 artificial stars between14th and 28th magnitude, generated in one of the Monte Carlo iterations, is shown in the lower right.

added HST WFPC227 (chip PC1) image of galacticglobular star cluster M71 (NGC 6838) in the F555Wfilter (shown in Figure 2). The completeness testwas performed on a 512×512 pixel uncrowded sec-tion of the image with 14th − 28th magnitude stars ineach iteration. Fifty stars (a sufficiently small num-ber to avoid changing the star density appreciably)were added randomly to the image section (shown inFigure 2), and 100 iterations per magnitude intervalwere run. As in the previous benchmark, we used theParallel Python approach with guided scheduler.

In Figure 3(a), we plot the completeness test re-sults, computed by averaging results from all the it-erations, with a 0.5 magnitude bin interval. Figure3(b) shows that maximum speedup is achieved whenthe number of processes is equal to the number ofphysical cores, or the total number of threads (in thecase of the Intel Core i7 machine). All three proces-sors show close to 2× speedup with 2 spawned pro-cesses. After four processes, speedup flattens out forthe AMD and Core i5 processors whereas it keeps onincreasing for the Core i7 machine (although not assteeply as from 1 to 4 processes). This is explainedby Equation 3 - the number of processes is scaled onthe parallel code fraction and synchronization time.

27Wide Field Planetary Camera 2

4.3. Parallel Sub-Sampled DeconvolutionImage deconvolution is an embarrassingly paral-

lel problem, as image sections can be deconvolved inparallel and combined at the end. In addition to beingunder-sampled, HST detector point spread functionsare spatially varying. To recover resolution lost toaberration and poor sampling in HST images, Butler(2000) proposed an innovative sub-sampled decon-volution technique. We have implemented a parallelversion of this technique.

To deconvolve HST images with a spatially vary-ing PSF, we wrote a parallelized version of the sub-sampled image deconvolution algorithm. The Maxi-mum entropy method (MEM) implementation in theSTSDAS package was used for deconvolution. Asit only uses a spatially invariant PSF, highly over-lapping 256×256 sub-images were deconvolved withthe appropriate PSF for that position on the CCD.Deconvolved sub-images were reassembled to gen-erate the final sub-sampled deconvolved image.

For benchmarking, we used a coadded HST WFPC2PC1 chip image of the globular cluster NGC 6293.A spatially varying analytical PSF model was gen-erated from the TinyTim28 PSF grid. The ParallelPython module with guided scheduler was used for

28TinyTim is a point spread function modeling tool for HSTinstruments

9

Page 10: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

14 16 18 20 22 24 26 28

Magnitude (mag)

0

20

40

60

80

100

Com

ple

ten

ess

(%

)

(a)

2 4 6 8 10Number of Processes (N)

1

2

3

4

5

6

7

Speedu

p Factor (S)

AMD Quad CoreIntel Core i7Intel Core i5

(b)

Figure 3: (a) Detection completeness plot for the images in Figure 2 with artificial star magnitudes varying from 14.0 to 28.0 magwithin 0.5 mag intervals; (b) Speedup achieved on the test machines as a function of the number of parallel processes. The ParallelPython approach with guided scheduler was used for benchmarking.

(a)

2 4 6 8 10Number of Processes (N)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

Speedup Factor (S)

AMD Quad CoreIntel Core i7Intel Core i5

(b)

Figure 4: (a) HST WFPC2 PC1 coadded image of globular cluster NGC 6293 in the F814W filter. A normal sampled 4.6′′ × 4.6′′

image section is shown in the upper left hand corner, and the 2× sub-sampled deconvolved section in the upper right hand corner;(b) Sub-sampled deconvolution benchmark. The AMD quad core performance levels out after four processes, whereas the Intelquad core achieves a speedup factor of almost 5 for eight processes. The Parallel Python approach with guided scheduler was usedfor benchmarking.

10

Page 11: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

benchmarking. The central region of NGC 6293,along with a high resolution section of the normalsampled image and sub-sampled deconvolved image,are shown in Figure 4(a). The speedup factor levelsout close to 4 for the AMD machine, whereas it lev-els out at 5 for the Intel Core i7 machine (see Figure4(b)). In general, embarrassingly parallel problemsprovide the best speedup as a higher fraction of codeis running in parallel.

5. Discussion

5.1. Ease of ImplementationAs shown in Section 3, parallelizing existing python

serial code can be pretty straightforward for prob-lems that can be easily broken down into smallerparts. This is especially true if the multiprocessingPool/Map approach is used in existing functional pythoncode. Python includes excellent code profilers - cPro-file and Profile. These, along with the pstats module,can be used to optimize the parallel code as well asthe serial code.

Astronomers using IRAF scripts now have an-other reason to move to PyRAF, and the python script-ing language in general. The PyRAF command lineinterface allows execution of IRAF CL scripts, andalso gives full flexibility and the power of the pythonlanguage. STScI has a very useful PyRAF introduc-tory tutorial29 on their website. Another reason forusing python is that it is an open source language,supported by community development, with most ofthe modules available for free. Python modules NumPyand SciPy comprise an extensive collection of nu-merical and scientific functions30.

IRAF saves the task parameter values in globalparameter files. This could lead to deadlock if sametwo tasks try to gain write access to the parameterfile. This can be avoided in PyRAF by explicitlyspecifying all the parameters in the task, thus avoid-ing accessing the global parameter file.

5.2. Load BalancingLoad balancing in parallel computing is the dis-

tribution of the workload on computing nodes to op-

29http://stsdas.stsci.edu/pyraf/doc.old/pyraf tutorial/30Numerical Python (NumPy) and Scientific Python (SciPy)

are non-standard Python modules

timize performance. It is an important issue to dealwith while designing a parallel application. On mul-ticore machines, all the cores will not perform iden-tically as other activities and tasks are running con-currently. A parallel program will run as fast as theslowest core. Thus, the performance of a parallelprogram will be dictated by the slowest processorcore. Therefore, efficient load balancing of the work-load on the processor cores is very important. Onshared memory machines, the OpenMP protocol hasimplemented four types of schedulers or load bal-ancing routines - static, guided, dynamic and run-time (Chandra et al., 2001). On the same lines, wehave implemented static and guided scheduler rou-tines to slice large input datasets into smaller chunksand distribute them on the computing nodes. In thestatic scheduler, equal chunks of data are distributedon each computing node. In the guided scheduler,instead of distributing equal chunks in one go, thedataset is divided in much smaller chunks to providebetter performance. These two schedulers were usedin both the PIX2SKY and SKY2PIX routines. Formore efficient load balancing, dynamic or runtimeroutines can be implemented.

In normal usage of both PIX2SKY and SKY2PIX,we read hundreds to thousands of input records. Send-ing large chunks of file data to computing nodes wouldresult in a longer synchronization time. Instead, wesend the starting and ending byte values of the filechunks, and actual file reading is done inside the workerprocess at each node. Example scheduler implemen-tation code is shown below:

Listing 6: Scheduler implementation codedef getchunks( infile, ncpus, scheduler = ‘guided‘ ):

# Based on scheduler, divide the input data element

if scheduler == ‘static‘:

size = getsize( infile ) / ncpus

else:

size = getsize( infile ) / (ncpus * 20)

# Open the input file. Display error and

# exit if open fails

try:

ifile = open( infile )

except:

print >> sys.stderr, ‘Error : Not able to open ‘,

infile, ‘. Exiting.‘

sys.exit( -1 )

# Get starting and ending byte values for each chunk,

# taking care to read the full line

while 1:

start = ifile.tell()

11

Page 12: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

ifile.seek(size, 1)

s = ifile.readline()

yield start, ifile.tell() - start

if not s:

break

# Close the input file

ifile.close()

In the case of the guided scheduler, the number ofrecords is divided into a large number (20 times thenumber of processes) of smaller chunks. The per-formance of the PIX2SKY routine with one millioninput records on our AMD quad core machine, withboth the static and guided scheduler, is shown in Fig-ure 5. The Parallel Python approach was used, but weget the same results with the other two approachesalso.

The graph clearly shows a better performance forthe guided scheduler. Therefore dividing work intosmaller fragments is better for optimum performanceof parallel programs.

5.3. Multiprocessing Method PerformanceTo compare the performance of the different mul-

tiprocessing methods introduced in Section 3, we againtook the PIX2SKY and SKY2PIX routines with onemillion input coordinates to be transformed. Our plotsof run time (wall time) versus number of processesfor the AMD quad core machine are shown in Figure6(a) and Figure 6(b).

For both PIX2SKY and SKY2PIX, the Process/Queueapproach performs better than both the Pool/Map andParallel Python approaches. However, Parallel Pythonscales nicely when the number of processes are morethan number of physical cores. Although the Pro-cess/Queue performs better, implementing the Pool/Mapmethod is much more straightforward.

For our routines using existing IRAF tasks - thecompleteness test and sub-sampled deconvolution -all three approaches have comparable performance.This is because the execution time of these routines isdictated by the performance of the underlying IRAFtasks (see Figure 6(c) and Figure 6(d)).

5.4. Scalability and PortabilityData parallelism using Python multiprocessing scales

nicely with the number of processor cores, as shownin Section 4. Our coordinate transformation code

also scales nicely with the number of input data ele-ments. As depicted in Figure 7, the PIX2SKY rou-tine scales linearly with the number of input data el-ements. It was benchmarked on an AMD quad coremachine for 4 processes, using the Process/Queuemultiprocessing method and varying the input dataelements from 500 to 1 million. Similar scalability isalso achieved for the SKY2PIX routine.

As Python is a platform-independent interpretedlanguage, the parallel code implementation is portable,i.e. it can be implemented on any of the Python-supported OS. In two of our examples, we have usedIRAF tasks which can only be executed on Linuxor Unix-like OS (e.g. Mac OS X), as IRAF is onlyavailable for these platforms. But astronomical dataanalysis tasks not requiring IRAF can be run on othersupported platforms.

6. Conclusions

We have shown that moving to parallel astronom-ical data processing is not as daunting as most as-tronomers perceive. The open source Python pro-gramming language provides the necessary tools toimplement parallel code on multicore machines. Weused three different hardware configurations, threedifferent parallelizing schemes or approaches, twodifferent load balancing routines, and three differentapplications of varied complexity to demonstrate theease of implementation and benefits of parallelizingdata processing tasks. Although the emphasis wason the Python multiprocessing module, results fromthe Parallel Python module were also presented. TheProcess/Queue approach performed better as a par-allelizing scheme than both the Pool/Map and Par-allel Python approaches. Parallel performance canbe optimized by carefully load balancing the work-load. Where there is no possibility of re-writing thecode for parallel processing because of complexity orany other factor, the existing serial code can still beused to parallelize the problem (as shown in the com-pleteness test and sub-sampled deconvolution tasks).While these are not the only or the most optimizedmethods to parallelize the code, the computationaltime savings are still very significant, even with thesestraightforward approaches. The cross-platform na-ture of Python makes the code portable on multiple

12

Page 13: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

2 4 6 8 10Number of Processes (N)

1.0

1.5

2.0

2.5

3.0

3.5Sp

eedu

p Fa

ctor

(S)

Guided scheduler Static scheduler

0 2 4 6 8 10Number of Processes (N)

0

5

10

15

20

25

30

35

Wall T

ime (sec

s)

(a) (b)

Figure 5: (a) Speedup and (b) Wall time for the PIX2SKY routine with different schedulers. The guided scheduler shows betterperformance than the static. Wall time is the total execution time for the routine. Data points are averaged over 100 runs. Programwall time on COTS machines depends on the other jobs and programs running. Error bars depict 1 standard deviation. The ParallelPython approach was used.

computer platforms.

7. Acknowledgements

We gratefully acknowledge the support of Sci-ence Foundation Ireland (under award number 08/RF-P/PHY1236). Based on observations made with theNASA/ESA Hubble Space Telescope in programs GO-8118 and GO-5366, obtained from the data archive atthe Space Telescope Institute. STScI is operated bythe association of Universities for Research in As-tronomy, Inc. under the NASA contract NAS 5-26555.STSDAS, TABLES, PyRAF and PyFITS are prod-ucts of the Space Telescope Science Institute, whichis operated by AURA for NASA. We also thank theanonymous referees for comments and suggestions,that greatly improved the clarity of the article.

References

Amdahl, G.M., 1967. Validity of the single-processor approachto achieving large-scale computing capabilities, in: Proc.Am. Federation of Information Processing Socities Conf.,AFIPS Press. pp. 532–533.

Beazley, D.M., 2006. Python Essential Reference. Sam Pub-lishing. third edition.

Belleman, R.G., Bedorf, J., Portegies Zwart, S.F., 2008.High performance direct gravitational n-body simulations ongraphics processing units II: an implementation in CUDA.New Astronomy 13, 103–112.

Bromer, R., Hantho, F., Vinter, B., 2011. pupyMPI - MPI im-plemented in pure python, in: Proceedings of the 18th Euro-pean MPI Users’ Group conference on Recent advances inthe message passing interface, Springer-Verlag, Berlin, Hei-delberg. pp. 130–139.

Butler, R., 2000. Deconvolution as a tool for improvedcrowded-field photometry with HST, in: ASP ConferenceProceeding, Astronomical Society of the Pacific. p. 595598.

Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J.,Menon, R., 2001. Parallel Programming in OpenMP. Mor-gan Kaufmann Publishers.

Dalcin, L., Rodrigo, P., Storti, M., D’Elia, J., 2008. MPI forPython: Performance improvements and MPI-2 extensions.Journal of Parallel and Distributed Computing 68, 655–662.

Garver, S.l., Crepps, B., 2009. The newera of tera-scale computing. http://

software.intel.com/en-us/articles/

the-new-era-of-tera-scale-computing/. Accessed:22/05/2012.

Gove, D., 2010. Multicore Application Programming : ForWindows, Linux, and Oracle Solaris. Addison-Wesley Pro-

13

Page 14: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

0 2 4 6 8 10Number of Processes (N)

0

5

10

15

20

25

30

35

Wall T

ime (sec

s)

Process/Queue ApproachPool/Map ApproachParallel Python Approach

(a) PIX2SKY

0 2 4 6 8 10Number of Processes (N)

0

5

10

15

20

25

30

Wall T

ime (sec

s)

Process/Queue ApproachPool/Map ApproachParallel Python Approach

(b) SKY2PIX

0 2 4 6 8 10Number of Processes (N)

0

10

20

30

40

50

60

70

80

90

Wall T

ime (secs)

Process/Queue ApproachPool/Map ApproachParallel Python Approach

(c) Completeness Test

0 2 4 6 8 10Number of Processes (N)

0

2

4

6

8

10

12

14

16

18

Wal

l Tim

e (m

ins.)

Process/Queue ApproachPool/Map ApproachParallel Python Approach

(d) Sub-sampled Deconvolution

Figure 6: Performance of different multiprocessing methods. Panels (a) and (b) depict the performance of our PIX2SKY andSKY2PIX routines on our AMD quad core machine respectively. Panel (c) depicts the performance of our completeness test andpanel (d) that of our parallel sub-sampled deconvolution. Data points are the average of 50 iterations for the PIX2SKY benchmarkand 10 iterations for the completeness test and deconvolution. The Parallel Python, Pool and Process approaches show comparableperformance for our completeness and sub-sampled deconvolution routines, whereas the Process/Queue approach performs betterthan both the Parallel Python and Pool/Map approaches for the PIX2SKY and SKY2PIX routines.

fessional. illustrated edition.Makino, J., Taiji, M., Ebisuzaki, T., Sugimoto, D., 1997.

GRAPE-4: a massively parallel special-purpose computerfor collisional n-body simulations. Astrophysical Journal

480, 432446.Mighell, K.J., 2010. CRBLASTER: a parallel-processing com-

putational framework for embarrassingly parallel image-analysis algorithm. Publications of the Astronomical Society

14

Page 15: Abstract arXiv:1306.0573v1 [astro-ph.IM] 3 Jun 2013

0 200000 400000 600000 800000 1000000

Input Star Count

0

2

4

6

8

10

12

14

Wall

Tim

e (

secs)

Figure 7: Scalability of our PIX2SKY routine with the number of input elements, for a fixed number of processes. The number ofinput coordinates was varied from 500 to 1 million. The benchmark was carried out on an AMD quad core machine with 4 parallelprocesses. The data points are the average of 50 iterations, with error bars corresponding to 1σ. The inset box is a close-up of thearea near the origin. The Process/Queue approach was used with guided scheduler.

of the Pacific 122, 1236–1245.Moore, G.E., 1965. Cramming more components into inte-

grated circuits. Electronics Magazine 38.Nielsen, O., 2003. Pypar software package. http://

datamining.anu.edu.au/~ole/pypar.Python Software Foundation, 2012. Python/C API Reference

Manual. Python Software Foundation.Robitaille, T.P., 2011. HYPERION: an open-source paral-

lelized three-dimensional dust continuum radiative transfercode. Astronomy & Astrophysics 536.

Strzodka, R., Doggett, M., Kolb, A., 2005. Scientific compu-tation for simulations on programmable graphics hardware.Simulation Modelling Practice and Theory, Special Issue:Programmable Graphics Hardware 13, 667680.

Szalay, A., 2011. Extreme data-intensive scientific computing.Computing in Science & Engineering 13, 3441.

Vanovschi, V., 2013. Parallel python software. http://www.

parallelpython.com. Accessed: 10/02/2013.Vinter, B., Bjørndalen, J.M., Friborg, R.M., 2009. PyCSP re-

visited, in: Communicating Process Architectures 2009, IOSPress. pp. 277–292.

Wiley, K., Connolly, A., Gardner, J., Krughoff, S., Balazinska,M., Howe, B., Kwon, Y., Bu, Y., 2011. Astronomy in thecloud: Using MapReduce for image co-addition. Publica-

tions of the Astronomical Society of the Pacific 123, 366380.

15