9
47-1 Publ. Astron. Soc. Japan (2015) 67 (3), 47 (1–9) doi: 10.1093/pasj/psv018 Advance Access Publication Date: 2015 April 30 High-performance parallel image reconstruction for the New Vacuum Solar Telescope Xue-Bao LI, 1, 2, Zhong LIU, 1, 2 Feng WANG, 1, 2, 3, Zhen-Yu JIN, 1, 2 Yong-Yuan XIANG, 1, 2 and Yan-Fang ZHENG 1, 2, 4 1 Yunnan Observatories, Chinese Academy of Sciences, Kunming, China, 650011 2 University of Chinese Academy of Sciences, Beijing, China, 100049 3 Computer Technology Application Key Lab of Yunnan Province, Kunming University of Science and Technology, Chenggong, Kunming, China, 650500 4 Jiangsu University of Science and Technology, Zhangjiagang, China, 215600 *E-mail: [email protected] (XBL); [email protected] (FW) Received 2014 December 17; Accepted 2015 March 6 Abstract Many technologies have been developed to help improve spatial resolution of observa- tional images for ground-based solar telescopes, such as adaptive optics (AO) systems and post-processing reconstruction. As any AO system correction is only partial, it is indispensable to use post-processing reconstruction techniques. In the New Vacuum Solar Telescope (NVST), a speckle-masking method is used to achieve the diffraction- limited resolution of the telescope. Although the method is very promising, the com- putation is quite intensive, and the amount of data is tremendous, requiring several months to reconstruct observational data of one day on a high-end computer. To accel- erate image reconstruction, we parallelize the program package on a high-performance cluster. We describe parallel implementation details for several reconstruction proce- dures. The code is written in the C language using the Message Passing Interface (MPI) and is optimized for parallel processing in a multiprocessor environment. We show the excellent performance of parallel implementation, and the whole data processing speed is about 71 times faster than before. Finally, we analyze the scalability of the code to find possible bottlenecks, and propose several ways to further improve the parallel perfor- mance. We conclude that the presented program is capable of executing reconstruction applications in real-time at NVST. Key words: methods: observational — techniques: image processing — telescopes 1 Introduction Atmospheric turbulence is the primary barrier to obtaining diffraction-limited observations in modern large aperture ground-based solar telescopes, as it randomly distorts the wavefronts radiating from the Sun. The collected images are motion, blurring, and geometrical distortions. Adaptive optics (AO) systems have been introduced to many advanced solar telescopes to enhance spatial resolution of observational images (von der uhe et al. 2003; Rimmele et al. 2004; Marino et al. 2010). However, any AO system only corrects partial wavefront aberrations. To achieve diffraction-limited observation, further post- processing reconstruction techniques become indispens- able. The most commonly used methods for solar image C The Author 2015. Published by Oxford University Press on behalf of the Astronomical Society of Japan. All rights reserved. For Permissions, please email: [email protected] at National Astronomical Observatory on July 29, 2015 http://pasj.oxfordjournals.org/ Downloaded from

High-performance parallel image reconstruction for the …or.nsfc.gov.cn/bitstream/00001903-5/172393/1/1000014926174.pdfHigh-performance parallel image reconstruction for the New Vacuum

  • Upload
    hahuong

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

47-1

Publ. Astron. Soc. Japan (2015) 67 (3), 47 (1–9)doi: 10.1093/pasj/psv018

Advance Access Publication Date: 2015 April 30

High-performance parallel image reconstruction

for the New Vacuum Solar Telescope

Xue-Bao LI,1,2,∗ Zhong LIU,1,2 Feng WANG,1,2,3,∗ Zhen-Yu JIN,1,2

Yong-Yuan XIANG,1,2 and Yan-Fang ZHENG1,2,4

1Yunnan Observatories, Chinese Academy of Sciences, Kunming, China, 6500112University of Chinese Academy of Sciences, Beijing, China, 1000493Computer Technology Application Key Lab of Yunnan Province, Kunming University of Science andTechnology, Chenggong, Kunming, China, 650500

4Jiangsu University of Science and Technology, Zhangjiagang, China, 215600

*E-mail: [email protected] (XBL); [email protected] (FW)

Received 2014 December 17; Accepted 2015 March 6

Abstract

Many technologies have been developed to help improve spatial resolution of observa-tional images for ground-based solar telescopes, such as adaptive optics (AO) systemsand post-processing reconstruction. As any AO system correction is only partial, it isindispensable to use post-processing reconstruction techniques. In the New VacuumSolar Telescope (NVST), a speckle-masking method is used to achieve the diffraction-limited resolution of the telescope. Although the method is very promising, the com-putation is quite intensive, and the amount of data is tremendous, requiring severalmonths to reconstruct observational data of one day on a high-end computer. To accel-erate image reconstruction, we parallelize the program package on a high-performancecluster. We describe parallel implementation details for several reconstruction proce-dures. The code is written in the C language using the Message Passing Interface (MPI)and is optimized for parallel processing in a multiprocessor environment. We show theexcellent performance of parallel implementation, and the whole data processing speedis about 71 times faster than before. Finally, we analyze the scalability of the code to findpossible bottlenecks, and propose several ways to further improve the parallel perfor-mance. We conclude that the presented program is capable of executing reconstructionapplications in real-time at NVST.

Key words: methods: observational — techniques: image processing — telescopes

1 Introduction

Atmospheric turbulence is the primary barrier to obtainingdiffraction-limited observations in modern large apertureground-based solar telescopes, as it randomly distorts thewavefronts radiating from the Sun. The collected imagesare motion, blurring, and geometrical distortions. Adaptiveoptics (AO) systems have been introduced to many

advanced solar telescopes to enhance spatial resolutionof observational images (von der Luhe et al. 2003;Rimmele et al. 2004; Marino et al. 2010). However, anyAO system only corrects partial wavefront aberrations.To achieve diffraction-limited observation, further post-processing reconstruction techniques become indispens-able. The most commonly used methods for solar image

C© The Author 2015. Published by Oxford University Press on behalf of the Astronomical Society of Japan.All rights reserved. For Permissions, please email: [email protected]

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3 47-2

Table 1. The data acquisition parameters for the multichannel high-

resolution imaging system of NVST.

Channel Hα TiO-band G-band

Wavelength (A) 6563 7058 4300Resolution (pixels) 1024 × 1024 2560 × 2160 2560 × 2160Frames per second (fps) 10 10 10Storage capacity (GB per hour) 75 398 398

reconstruction are speckle masking (Knox & Thompson1974; Weigelt 1977; Lohmann et al. 1983; von der Luhe1993), phase diversity methods (Gonsalves 1982; Paxmanet al. 1992; Lofdahl & Scharmer 1994), and different vari-ants of multiframe blind deconvolution (van Kampen &Paxman 1998; Lofdahl 2002; van Noort et al. 2005).

The 1 m New Vacuum Solar Telescope (NVST) hasbeen built at the Fuxian Solar Observatory (FSO) of theYunnan observatories in 2010. Its main scientific goal isto observe the fine structures in both the photosphere andthe chromosphere. A multichannel high-resolution imagingsystem, with two photosphere channels and one chromo-sphere channel, has been installed and brought into use atNVST. In table 1, the data acquisition parameters for thecurrent imaging system of NVST are described. The bandfor observing the chromosphere is Hα (6563 A), and thebands for observing the photosphere are TiO (7058 A) andG-band (4300 A), respectively. Because of its channel sep-aration, the imaging system simultaneously acquires datausing three detectors, capable of generating 2560 × 2160 or1024 × 1024 pixels of image data at a frame rate of around10 images per second. When observing for several hours perday, the imaging system will produce several terabytes ofunreduced data. Although storage cost is dropping continu-ally, such huge data volumes still become difficult to transferand distribute. The raw image data from NVST are recon-structed using speckle masking. In general, a single “speckleburst” with at least 100 short exposure images is used toreconstruct one image statistically. The image reconstruc-tion program was originally developed in the InteractiveData Language (IDL). Through an IDL implementation,speckle masking reconstruction of one 2560 × 2160 pixelimage from one burst took about an hour on an Intel Core3.4 GHz computer, and reconstruction of a single day’sdata took several months. Obviously, the (near) real-timedata processing can improve the efficiency of the telescope,because not only is the data volume reduced by a factorof 100 rapidly, but also the time between the observationsand data analysis is dramatically shortened. Thus, thereis a strong demand for processing observational data inreal time or at least near real time. The rapid developmentof computer technology, e.g., the high performance com-puting technology, makes it possible to reconstruct mas-sive speckle data in real time onsite when using speckle

masking. Some aspects of using post-processing methodsto reconstruct speckle data in near real time have alreadybeen exploited by Denker, Yang, and Wang (2001), Woger,von der Luhe, and Reardon (2008), and Woger et al. (2010).

In this study, we present a parallelized implementationof the speckle masking algorithm. Due to the large volumeof intermediate data produced and the high computationalcomplexity in such a speckle reconstruction, the programis implemented in a multiprocessor, multinode computingenvironment using the Message Passing Interface (MPI:Gropp et al. 1998). The remainder of this paper is orga-nized as follows. In section 2, we make a survey of previouswork in the literature concerned with massive speckle datacomputing. In section 3, we present the system design anda concrete implementation in parallel processing mode. Insection 4, reconstruction results using speckle masking arepresented, and the scalability of the code with an increasingnumber of employed processors on a cluster is analyzed,followed by discussions and conclusions in section 5.

2 Related work

Among post-processing image reconstruction techniques,speckle masking becomes attractive not only to NVST butalso to other solar telescopes because of its reliable per-formance in phase reconstruction. A high-resolution imageis usually reconstructed from one burst with at least 100short exposure images. Since speckle masking is only validfor a small region, the isoplanatic patch, the raw imageshave to be divided into a number of partially overlappingdata cubes of 100 subimages (Liu et al. 2014). Since thesesubimage cubes are processed independently, this makesparallel processing of the subimage cubes easy as theyare sent to different processors using MPI. The isopla-natic patch has a size of approximately 5′′ × 5′′. We usea spectral ratio technique to calculate the seeing param-eter r0 that is necessary for reconstruction (von der Luhe1984). The modulus of the object’s Fourier transform iscalculated using the classical method of Labeyrie (1970).To derive the phases of the object’s Fourier transform, weuse the speckle-masking method (Weigelt 1977). Inversetransformation of the modulus and phases of the object’sFourier transform yields a mosaic of partially overlap-ping reconstructed subimages. Finally, all the reconstructed

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

47-3 Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3

Fig. 1. Schematic diagram for reconstructing a high-resolution image on multiple processors using the speckle-masking algorithm.

subimages are aligned and put together to form an entirehigh-resolution image. In figure 1, the schematic diagramfor reconstructing a high-resolution image on multiple pro-cessors using the speckle-masking algorithm for NVST isshown.

To deal with both massive amounts of speckle dataand large amounts of computation in near real time, thetechnologies of parallel and distributed computing havebeen applied at many advanced solar telescopes. The DutchOpen Telescope (DOT) reconstructed a full day’s datawithin 24 hr on a cluster consisting of 70 processors(Bettonvil et al. 2004). Reconstruction of one1024 × 1024 pixel image from one burst with 100images took about 22 s, using speckle masking, on a clusterwith 23 computation nodes at Dun Solar Telescope (DST:Woger et al. 2008). The computation time was 5–6 min toreconstruct one 2024 × 2024 pixel image from one burstwith 100 images using speckle masking on a cluster with8 computation nodes at the New Solar Telescope (NST:Cao et al. 2010). The Daniel K. Inouye Solar Telescope(DKIST), formerly Advanced Technology Solar Telescope

(ATST), may meet the goal of reconstructing 80 images of4096 × 4096 pixels within 3 s into a single image, usingspeckle masking on a cluster of fewer than 50 graphicsprocessing units (GPUs: Woger & Ferayorni 2012). Wealso have developed a faster data handling system toreconstruct images in real time on a high-performancecomputing cluster for NVST, which is different from thatof other telescopes, although parallel implementations ofspeckle-masking reconstruction have been employed atsome solar telescopes.

3 System design and implementation

3.1 System design

The use of MPI to implement code for parallel processinghas certain implications for the overall design of a program.Because much time has been devoted to reconstructing one2560 × 2160 pixel image from one burst, reconstructionof a single burst needs to be accelerated. In general terms,the master process takes charge of not only distributing

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3 47-4

the tasks to all other processes and receiving the com-puted results from them, but also performing the calcu-lations as well as them. An overview of the system design isshown in figure 2, and the pseudocode of the whole parallelimplementation of speckle masking is shown in figure 3.The system shows the whole parallel process of specklemasking on a high performance cluster, assuming a burstof 100 images of 2560 × 2160 pixels. The image recon-struction application for NVST can be seen as a pipelinemainly consisting of the following procedures: flat-field anddark-field preprocessing, correlation tracking, estimationof seeing, reconstruction of subimages, and mosaicing ofreconstructed subimages. These image reconstruction pro-cedures are parallelized and accelerated, and the acceleratedresults are presented in section 4. In what follows, details ofthe parallel implementation for the individual reconstruc-tion steps are described.

3.2 Implementation

In our MPI implementation, we assume that the numberof processes used in the computation is fixed by the user.However, considering that the number of subimage cubesof one burst in the TiO band is about one thousand, and thenumber of available processors on the cluster is limited toseveral hundreds, 222 processes are launched and the pro-gram is executed on 222 processors in parallel using MPI.The master process reads the average flat-field and dark-field images from the storage device, and broadcasts theseimages to other processes through two calls of MPI BCAST,assuming the data is prepared for preprocessing. The first100 processes are chosen to read one burst with 100 rawimages of 2560 × 2160 pixels from the storage device basedon their process identification number, and perform flat-field and dark-field processing simultaneously and indepen-dently. During the correlation tracking, the master processextracts a ∼ 2000 × 2000 pixel subimage from the pre-processed image as a reference image, and broadcasts itto other processes to compensate for image motion. Eachof the 100 processes computes the image shifts using atwo-dimensional cross-correlation method, and executesthe same alignment procedure on different images.

We use the spectral ratio technique to calculate the seeingparameter r0 from the observed data itself (von der Luhe1984). A burst of aligned images of 2560 × 2160 pixels isdivided into 80 non-overlapping subimage cubes consistingof 256 × 256 pixels, different from many partially overlap-ping subimage cubes as mentioned in section 2. The schemeof image segmentation for calculating the seeing parameter

Fig. 2. Overview of the system design. The system shows the wholeparallel process of speckle masking on a high-performance cluster,assuming a burst of 100 images of 2560 × 2160 pixels.

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

47-5 Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3

Fig. 3. Pseudocode of the whole parallel implementation of speckle masking on a high-performance cluster, assuming a burst of 100 images of2560 × 2160 pixels.

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3 47-6

is shown in figure 2. The first 80 processes (processid < 80) are selected and gather corresponding non-overlapping subimage cubes rapidly through 80 calls ofMPI GATHER, via the high-speed Infiniband network.The data volume of these subimage cubes is small, so thetime cost of data transfer between the processes can beneglected. When the data transfer is finished, the first 80processes execute the same procedure of computing r0 ondifferent subimage cubes simultaneously. The number ofnon-overlapping subimage cubes is less than that of allprocesses, thus the selected 80 processes can accomplish thetasks of computing the seeing parameters at the same time.These seeing parameters calculated by different processesare collected and averaged by the master process througha call of MPI GATHER. Finally, a mean value of seeingparameter r0 is broadcast to all processes through a callof MPI BCAST.

We also divide a burst of aligned images of 2560 ×2160 pixels into 1110 partially overlapping subimage cubesconsisting of 256 × 256 pixels, matching the isoplanaticsize, in the module which divides all overlapping subimagecubes. The scheme of image segmentation for reconstructinga subimage is also shown in figure 2. All these subimagecubes are assigned and gathered by all processes throughmany calls to MPI GATHER, as indicated in figures 2and 3. Each process is designed to reconstruct one subimagefrom one subimage cube as indicated in figures 1 and 2,on the premise that it has sufficient RAM. However, theoverall number of processors is insufficient to reconstructall subimages at once. The most effective way is to utilizeseveral do-loops to manipulate all subimage cubes on allprocesses. In the reconstruction of subimages, the object’sFourier modulus is reconstructed according to the classicalmethod of Labeyrie (1970), and the object’s Fourier phaseis reconstructed by using the speckle-masking algorithm(Weigelt 1977; Lohmann et al. 1983). The same proce-dure of reconstruction of an object’s Fourier modulus andphase is implemented by all processes on different par-tially overlapping subimage cubes simultaneously. Once theoperation above is completed, all processes resume simul-taneous execution of the same inverse Fourier transforminstructions.

The object’s Fourier modulus is reconstructed accordingto the classical method of Labeyrie (1970):

|O( f )|2 = 〈|Ii ( f )|2〉i

〈|Hi ( f )|2〉i. (1)

〈|Hi(f)|2〉i is generally referred to as the speckle transferfunction (STF), which is obtained using the spectral ratiotechnique (von der Luhe 1984). The object’s spatial powerspectrum |O(f)|2 becomes accessible from the measuredmean spatial power spectrum 〈|Ii(f)|2〉i. Once |O(f)|2 is

calculated, the object’s Fourier modulus can be obtained.Due to the lack of possibility to simultaneously observea reference point source when observing the Sun, theobject’s Fourier modulus needs to be calibrated withthe model STF. In order to choose the correct modelfunction, Woger and von der Luhe (2008) and Woger,von der Luhe, and Reardon (2008) used the spectral ratiotechnique (von der Luhe 1984) to estimate the strength ofatmospheric turbulence, the value of the seeing parameterr0. In our parallel implementation, we also use the samemethod to get the value of the seeing parameter.

We use several do-loops to reconstruct the subim-ages on 222 processes, the do-loop index varying from 1to 5. After 222 subimages are reconstructed in each do-loop, they are aligned simultaneously on different pro-cesses. Subsequently, they are collected by the masterprocess, and merged to form a partially full recov-ered image as illustrated in figure 2, through a call toMPI REDUCE. A full mosaic high-resolution image of2368 × 1920 pixels is reconstructed until all do-loop oper-ations are completed. Eventually, the master process writesthe final results into the storage device, including a fullrecovered image and corresponding image header informa-tion. The program is designed that the important parame-ters are computed automatically, such as the do-loop index,the number of subimage cubes, etc. The do-loop index forreconstructing all subimages can be reduced by using morethan 222 processes.

4 Results and analysis

The observations were made with the NVST at FSO withoutAO on 2013 November 2. A TiO filter (λ = 705.8 nm) wasused for observations of solar granulations and sunspots.We used a high-speed CCD camera with 16-bit digitiza-tion as a detector. Figure 4a shows one observed frameof one burst before reconstruction. The field of view is100′′ × 80′′, and the number of pixels is 2368 × 1920(0 .′′042 pixel−1). Figure 4b shows the speckle-masking par-allel reconstruction image with the same field of view. Thediffraction-limited resolution of the telescope is about 0 .′′18at λ = 705.8 nm. Thus, the speckle reconstructed imageis close to the diffraction-limited resolution of the NVST.Such reconstructed data can be used to study the dynam-ical properties of fine features in the photosphere. Theimage reconstruction program for speckle masking was firstdeveloped and carried out in IDL. Because of the inter-preted nature of IDL, the program does not run very effi-ciently. On a high-end computer equipped with an IntelCore(TM) CPU i7-3770 with 3.4 GHz clock speed and32 GB of RAM, it requires about an hour to reconstruct

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

47-7 Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3

Fig. 4. (a) One observed frame of one burst before reconstruction. (b) The speckle-masking parallel reconstructed image. The field of the view is100′′ × 80′′.

a 2368 × 1920 pixel image from a burst of 100 images con-sisting of 2560 × 2160 pixels.

The parallel reconstruction has been tested, on a high-performance cluster with 45 computation nodes installedat the Yunnan Observatories, without changing the results.Each server (computation node) has 2-way Intel Xeon E5-2650 with 2 GHz CPUs (total 16 cores) and 64 GB ofDDR3 RAM. All nodes are connected together with a high-speed Infiniband network with bandwidth of 40 Gb s−1.The Red Hat 6.2 (64-bit) operating system is installed oneach server. OpenMPI-1.6.4 is installed on each node. Thestorage system, a Lustre distributed file system with fourservers, is also connected with the 40 Gb s−1 Infinibandnetwork. The total storage capacity is 54 TB. All raw dataare first copied to the storage system of the cluster for high-performance data processing. For the tests, we used a burstof 100 images, and this data set was reconstructed with184775040 bispectrum values and 1110 subimage cubesconsisting of 256 × 256 pixels.

Table 2 shows runtime comparisons for reconstructinga 2368 × 1920 pixel image between various modules in theC language using MPI on 222 processors of the cluster,and IDL on the high-end computer as mentioned above.Since the IDL implementation cannot run on one node ofthe cluster, we used the high-end computer with almostthe same computing capacity in place of it. Several time-consuming module operations as listed in table 2 aresuccessfully parallelized, and the speedup shows positiveresults. The runtime of the whole data parallel processingreduces to around 48 seconds when using 222 processors,and the whole data processing speed is around 71 timesfaster than before. The most time-consuming module ofreconstruction of all subimages shows a great speed increaseby a factor of about 202, the module of flat-field and

dark-field preprocessing is about 193 times faster thanbefore, and the processing speed of other modules is about9–41 times faster than before. In the module of divisionof all overlapping subimage cubes, about 16 seconds areused to transfer data between all processes. Although thedata volume of 1110 subimage cubes is around 55 GB, thetotal amount of data transfer increases to 92 GB. The totalamount of data transfer increases when the number of pro-cessors increases, as shown in figure 5c. The reason is thatdivision of all overlapping subimage cubes generates addi-tional transfer data using more than 100 processes, whenone burst with 100 aligned images is stored in the first100 processes. Because the number of do-loops for recon-structing all subimages is 5 on 222 processors, 92 GB oftotal transfer data is also divided into 5 groups. Thus,each group in each do-loop has around 18.4 GB of transferdata, avoiding jamming the network. The procedure of datatransfer is shown in detail in the pseudocode of figure 3.The communication time makes up a significant propor-tion of one image reconstruction time. If we reduce theamount of data transfer and increase the bandwidth of thenetwork, the performance of parallel reconstruction willbe further improved, which is what we should focus on inour future work.

We have assessed the scalability of the parallel imple-mentation by measuring the time used for computationof a speckle reconstruction. The execution time for theparallel implementation between various modules on thecluster is shown in figures 5a and 5b for the numberof employed processors ranging from 32 to 555. In thecurves for all modules in figure 5b, the runtime reduceswhen the number of processors increases. However, thespeedup is not linear, especially when the number of pro-cessors used exceeds 111, and there is a saturation at

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3 47-8

Table 2. Runtime comparisons for reconstructing a 2368 × 1920 pixel image between

various modules in IDL and C using MPI.

Module C & MPI (s) IDL (s) Factor

Flat-field and dark-field preprocessing 0.13 25.2 193.8Correlation tracking 7.5 139.6 18.6Estimation of seeing 3.5 31.4 9Division of all overlapping subimage cubes 16.4 — —Reconstruction of all subimages 15.3 3100 202.6Mosaicing of all reconstructed subimages 2.5 104.1 41.6The whole data processing 48 3425 71.3

Fig. 5. Time used for one reconstruction between various modules, (a) and (b), and volume of data transfer (c) versus number of employed processorson the high-performance cluster.

a minimum. Because only the first 100 processes can per-form parallel processing in all modules in figure 5b, theruntimes wouldn’t change obviously when the number ofprocessors varies from 111 to 555. In the curve for recon-struction of all subimages in figure 5a, the runtime alwaysreduces when the number of processors increases, because

the number of do-loops for reconstructing all subimagesreduces. However, the runtime always increases in the curveof division of all overlapping subimage cubes in figure 5awhen the number of processors increases, because the totalamount of data transfer increases. In the curve of the wholedata processing in figure 5a, the runtime reduces when

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from

47-9 Publications of the Astronomical Society of Japan (2015), Vol. 67, No. 3

the number of processors increases, and there is a satu-ration at a minimum of around 48 s when the numberof processors exceeds 222. The reason is that the runtimeincrease in the overlapping subimage cube division modulealmost equals the runtime decrease in the subimage recon-struction module. In any event, already today a systemsuch as the one tested above has achieved real-time per-formance, for a detector that reads out and stores a2560 × 2160 pixel frame at an effective rate of 10 framesper second.

5 Discussion and conclusions

In this study, we present the details of parallel implemen-tation for image reconstruction using speckle masking insolar application. The code is written in the C language forparallel processing using MPI, and the parallel implementa-tion between various modules shows a great speed increaseas compared to the previous IDL implementation. The spa-tial resolution of the reconstructed image can be improvedsignificantly compared to that of the original image. Wealso tested the scalability of the code. The timing resultsof the parallel implementation between various modules onthe cluster showed a clear advantage with greater numbersof processors. Since a considerable proportion of the com-puting time is spent in the modules of division of all overlap-ping subimage cubes and reconstruction of all subimages,we need to not only increase the network bandwidth, butalso accelerate subimage reconstruction to further improvethe performance of the parallel reconstruction.

Obviously, high-performance image reconstruction likespeckle masking will be valuable for NVST and next-generation solar telescopes when a high-performancecluster is adopted on site, as it increases the telescope’sefficiency. In addition, the development of multiprocessingand parallel numerical algorithms for high spatial resolu-tion imaging will become even more important in the con-text of advanced large aperture solar telescopes, when boththe chip size and data acquisition speed of the detectorincrease. The program proposed in the study has been suc-cessfully tested on the cluster, and the outstanding real-timeperformance has been achieved. However, we still seek newtechniques for real-time image reconstruction for NVST.

General Purpose Graphics Processing Units (GPGPUs)are a new parallel computing technique that has beenwidely used in real-time computing. After the implemen-tation of MPI image reconstruction, we also considermigrating our program from MPI to GPU in order to obtain

higher computing speed. In the preliminary experiment, weused one GPU to accelerate a bispectrum calculation, andgained significant speedup (around 4–6 times over MPI) inone subimage reconstruction. However, the massive datatransfer from host memory to GPU memory is a time-consuming issue because of the limitation of the computer’sbus bandwidth. The optimal reconstruction algorithm onGPU is worth studying in the future.

Acknowledgments

This work is supported by the National Natural Science Founda-tion of China (Nos. U1231205, 11163004, 11203077). The authorsthank the NVST team for the support they have given this project,and also gratefully acknowledge the helpful comments and sugges-tions of the reviewers.

References

Bettonvil, F. C., Hammerschlag, R. H., Sutterlin, P., Rutten, R. J.,Jagers, A. P., & Snik, F. 2004, Proc. SPIE, 5489, 362

Cao, W., et al. 2010, Proc. SPIE, 7735, 77355VDenker, C., Yang, G., & Wang, H. 2001, Sol. Phys., 202, 63Gonsalves, R. A. 1982, Opt. Eng., 829, 832Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E. L.,

Nitzberg, B., Saphir, W., & Snir, M. 1998, MPI – The Com-plete Reference, Vol. 2, The MPI Extensions (Cambridge, MA:MIT Press)

Knox, K. T., & Thompson, B. J. 1974, ApJ, 193, L45Labeyrie, A. 1970, A&A, 8, 85Liu, Z., et al. 2014, Res. Astron. Astrophys., 14, 705Lofdahl, M. G. 2002, Proc. SPIE, 4792, 146Lofdahl, M. G., & Scharmer, G. B. 1994, Proc. SPIE, 2302, 254Lohmann, A. W., Weigelt, G., & Wirnitzer, B. 1983, Appl. Opt., 22,

4028Marino, J., Woger, F., & Rimmele, T. 2010, Proc. SPIE, 7736,

77363EPaxman, R. G., Schulz, T. J., & Fienup, J. R. 1992, J. Opt. Soc. Am.

A, 9, 1072Rimmele, T. R., et al. 2004, Proc. SPIE, 5171, 179van Kampen, W. C., & Paxman, R. G. 1998, Proc. SPIE, 3433, 296van Noort, M., Rouppe van der Voort, L., & Lofdahl, M. G. 2005,

Sol. Phys., 228, 191von der Luhe, O. 1984, J. Opt. Soc. Am. A, 1, 510von der Luhe, O. 1993, A&A, 268, 374von der Luhe, O., Soltau, D., Berkefeld, T., & Schelenz, T. 2003,

Proc. SPIE, 4853, 187Weigelt, G. P. 1977, Opt. Commun., 21, 55Woger, F., et al. 2010, Proc. SPIE, 7735, 773521Woger, F., & Ferayorni, A. 2012, Proc. SPIE, 8451, 84511CWoger, F., & von der Luhe, O. 2008, Proc. SPIE, 7019, 70191EWoger, F., von der Luhe, O., & Reardon, K. 2008, A&A, 488, 375

at National A

stronomical O

bservatory on July 29, 2015http://pasj.oxfordjournals.org/

Dow

nloaded from