(D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

(D)CI related activities at IFCA

Marcos López-CaniegoInstituto de Física de Cantabria (CSIC-UC)

Astro VRC Workshop Paris Nov 7th 2011

The Observational Cosmology and Instrumentation Group at the Instituto de Física de Cantabria (CSIC-UC) is involved in several aspects of the data analysis of Planck, ESA’s mission to study the Cosmic Microwave Background Radiation.

In addition to Planck, we are involved in the analysis of data from other experiments such as WMAP and Herschel, and in the simulation and data analysis of a new CMB experiment called QUIJOTE.

During the EGEE-III project we dedicated a fair amount of time and effort to port to the GRID several applications to do CMB-related analysis.

– Detection of Point Sources in single frequency maps – Detection of Point Sources using multifrequency information– Detection of Clusters of Galaxies using the SZ effect– Detection of Non-Gaussian features in CMB maps

Astro VRC Workshop Paris Nov 7th 2011 2

Our experience using the GRID when running these applications was very variable:

– We used input maps up to 200-400 MBs each, and we had many of these!

– To do some multifrequency analysis we had to load more than one map at the time and nodes with several GBs of RAM were not always available.

– In particular, the original SZ cluster application required up to 9 all-sky maps, and the nodes did not have enough RAM memory to deal with them maps and with the intermediate files that produced.

– To solve this we decided to divide the maps into 100’s of patches beforehand, group, gzip and put them in the SE before starting the job.

– This strategy worked very well and we used the GRID to do the SZ analysis of Planck simulations for about two years. But also implied some additional pre-processing work.

– Analogously we were able to adapt the NG codes to avoid continuous data traffic between the SE and the nodes, and it also worked very well.

– Moving this amount of information from the Storage Elements to the nodes was always problematic and very ofter saturated the network.


–It was also common to have failed jobs because the codes started to run before the maps had fully arrived to the nodes, even though this should never happen due to the sequencial order of commands in the scripts.

–Maybe the most important problem that we had was a high rate of failed/killed jobs. Sometimes the 100% of the jobs run smoothly and sometimes we had to resubmit “by hand” mostif not all the jobs because they had failed for unknown problems.

–One way to improve the situation was to force the jobs to run in nodes physically close to the Storage Element and the rate of succesful jobs improved.

–No tool to control failed jobs was available and resubmission by hand was not an option. I heard of things like “metaschedulers”, gridway, etc. Not available in our infrastructure.

–In the near future our input maps can be as large as 2GB each -> more problems.

Not everything was bad, there was a huge amount of resources available at a time when obtaining 10.000’s of CPU hours in HPC clusters was not easy and we did use the GRID a lot.

But… with the launch of Planck in mid-2009 the amount of work increased to a level that we could not afford spending valuable time dealing with problems with the GRID.

And we moved away from the GRID and started to work with normal clusters.


• Current activities at IFCA (not distributed in the sense of GRID or Cloud):

– the number of people in our group running jobs in HPC has doubled in the last couple of years. Now most of the people in the group (10-12/15) use clusters for their daily work.

– We use big infrastructures across Spain, Europe and sometimes in the US

• Spain: GRID-CSIC, Altamira (part of the BSC at IFCA)• Finland: CSC (member of PRACE)• UK: Darwin, Cosmos and Universe in Cambridge• Italy: Planck LFI cluster in Trieste• Germany: a cluster in the Forschungzentrum in Julich• US: NERSC, IPAC

• Estimation of the number of CPU time used in the last year per proyect:


Of the order of a Few Million CPU hours and 10’s of TB of storage

Component Separation

Planck Planck

Last 12 months Next 6 months

Total CPU hours 2.000 20.000

RAM/job [GB] 3 3

Total Storage [GB] 60 500

PARALLEL Jobs NO NO

Infrastructure Altamira Altamira


Component Separation Pol

WMAP

Last 12 months

Total CPU hours 100.000

RAM/job [GB] 2

Total Storage [GB] 40

PARALLEL Jobs NO

Infrastructure Altamira


Anomalies in LSS WMAP

Next 6 months


RAM/job [GB] 3


PARALLEL Jobs NO

Infrastructure Altamira


Non-Gaussianity Analysis

Planckfn_l wavelets

Planckfn_l wavelets



RAM/job [GB] 5 5


PARALLEL Jobs NO NO

Infrastructure GRID-CSIC GRID-CSIC



Planckfn_l wavelets

Planckfn_l wavelets



RAM/job [GB] 3 3


PARALLEL Jobs YES YES

Infrastructure CSC CSC



WMAPneural networks

WMAPneural networks



RAM/job [GB] 3 3






WMAPneural networks

WMAPneural networks



RAM/job [GB] 3 3



Infrastructure GRID-CSIC GRID-CSIC



WMAPneural networks

Last 12 months


RAM/job [GB] 3


PARALLEL Jobs YES

Infrastructure Darwin



WMAPHamiltonian Samp.

WMAPHamiltonian Samp.



RAM/job [GB] 3 3



Infrastructure Cosmos/Universe Cosmos/Universe


SZ Cluster Detection

Planck Planck



RAM/job [GB] 2 2


PARALLEL Jobs NO NO



SZ Cluster Detection

Planck Planck



RAM/job [GB] 2 2


PARALLEL Jobs NO NO

Infrastructure Trieste Trieste


PS Detection Planck Planck



RAM/job [GB] 2 2


PARALLEL Jobs NO NO

Infrastructure Trieste Trieste


Bayesian PS Detection in Pol

Planck/QUIJOTE Planck/QUIJOTE



RAM/job [GB] 2 2


PARALLEL Jobs NO NO



LSScluster simulations



RAM/job [GB] 4 4

Total Storage [GB] 45.000 45.000


Infrastructure Julich-DE Julich-DE


CPU hours RAM Storage CPU hours RAM Storage

ComponentSeparation

100.000 3 100 20.000 5 500

SZ Cluster detection

90.000 2 70 45.000 2 35

PS detectionT and P

5.000 2 2 2.500 2 35

PS DetectionBayesian

5.000 2 2 2.500 2 2

Non Gaussianity

2.800.000 3 4.500 1.100.000 3 2.000

Large Scale Structure

simulations

500.000 4 45.000 540.000 3-4 45.000

Total 3.500.000 2-4 GB 50 TB 1.710.000 2-5 GB 48TB



• The activities that we carry out at IFCA demand huge amount of computational resources.

• We were involved in EGEE-III because we really thought that the GRID was the solution to our computing problems, but evolves too slow outside the High Energy Physics community.

• After a small learning period we were in the condition to submit jobs to the GRID, and we did, but the success rate was too variable.

• We backed-up the proposal for a big Grid Infrastructure in Spain and we use it, but in HPC mode.

• At the same time we started to have access to HPC clusters in Spain and across Europe where jobs run smoothly and our hopes for a usable GRID died.

• Maybe because of the kind of jobs that we run, it never transitioned from experimental to production and our continuous Planck data analysis could not wait.

• Maybe the situation has evolved in the last 12-18 months and the GRID in A&A is now more mature.

• In our group at IFCA we are always in need of additional computing resources (we recenlty bought our own small cluster with the particularity that single jobs can access up to 256 GB RAM).

• In addition to Planck and WMAP, we are involved in other experiments that will require large amounts of computing resources (QUIJOTE and the PAU-Javalambre Astrophysical Survey), and will be happy if we could count on the GRID or Cloud Computing if it is really in a production stage.


CONCLUSIONS

Documents

(D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011