Upload
charles-alexander
View
214
Download
2
Embed Size (px)
Citation preview
(D)CI related activities at IFCA
Marcos López-CaniegoInstituto de Física de Cantabria (CSIC-UC)
Astro VRC Workshop Paris Nov 7th 2011
The Observational Cosmology and Instrumentation Group at the Instituto de Física de Cantabria (CSIC-UC) is involved in several aspects of the data analysis of Planck, ESA’s mission to study the Cosmic Microwave Background Radiation.
In addition to Planck, we are involved in the analysis of data from other experiments such as WMAP and Herschel, and in the simulation and data analysis of a new CMB experiment called QUIJOTE.
During the EGEE-III project we dedicated a fair amount of time and effort to port to the GRID several applications to do CMB-related analysis.
– Detection of Point Sources in single frequency maps – Detection of Point Sources using multifrequency information– Detection of Clusters of Galaxies using the SZ effect– Detection of Non-Gaussian features in CMB maps
Astro VRC Workshop Paris Nov 7th 2011 2
Our experience using the GRID when running these applications was very variable:
– We used input maps up to 200-400 MBs each, and we had many of these!
– To do some multifrequency analysis we had to load more than one map at the time and nodes with several GBs of RAM were not always available.
– In particular, the original SZ cluster application required up to 9 all-sky maps, and the nodes did not have enough RAM memory to deal with them maps and with the intermediate files that produced.
– To solve this we decided to divide the maps into 100’s of patches beforehand, group, gzip and put them in the SE before starting the job.
– This strategy worked very well and we used the GRID to do the SZ analysis of Planck simulations for about two years. But also implied some additional pre-processing work.
– Analogously we were able to adapt the NG codes to avoid continuous data traffic between the SE and the nodes, and it also worked very well.
– Moving this amount of information from the Storage Elements to the nodes was always problematic and very ofter saturated the network.
Astro VRC Workshop Paris Nov 7th 2011 3
–It was also common to have failed jobs because the codes started to run before the maps had fully arrived to the nodes, even though this should never happen due to the sequencial order of commands in the scripts.
–Maybe the most important problem that we had was a high rate of failed/killed jobs. Sometimes the 100% of the jobs run smoothly and sometimes we had to resubmit “by hand” mostif not all the jobs because they had failed for unknown problems.
–One way to improve the situation was to force the jobs to run in nodes physically close to the Storage Element and the rate of succesful jobs improved.
–No tool to control failed jobs was available and resubmission by hand was not an option. I heard of things like “metaschedulers”, gridway, etc. Not available in our infrastructure.
–In the near future our input maps can be as large as 2GB each -> more problems.
Not everything was bad, there was a huge amount of resources available at a time when obtaining 10.000’s of CPU hours in HPC clusters was not easy and we did use the GRID a lot.
But… with the launch of Planck in mid-2009 the amount of work increased to a level that we could not afford spending valuable time dealing with problems with the GRID.
And we moved away from the GRID and started to work with normal clusters.
Astro VRC Workshop Paris Nov 7th 2011 4
• Current activities at IFCA (not distributed in the sense of GRID or Cloud):
– the number of people in our group running jobs in HPC has doubled in the last couple of years. Now most of the people in the group (10-12/15) use clusters for their daily work.
– We use big infrastructures across Spain, Europe and sometimes in the US
• Spain: GRID-CSIC, Altamira (part of the BSC at IFCA)• Finland: CSC (member of PRACE)• UK: Darwin, Cosmos and Universe in Cambridge• Italy: Planck LFI cluster in Trieste• Germany: a cluster in the Forschungzentrum in Julich• US: NERSC, IPAC
• Estimation of the number of CPU time used in the last year per proyect:
Astro VRC Workshop Paris Nov 7th 2011 5
Of the order of a Few Million CPU hours and 10’s of TB of storage
Component Separation
Planck Planck
Last 12 months Next 6 months
Total CPU hours 2.000 20.000
RAM/job [GB] 3 3
Total Storage [GB] 60 500
PARALLEL Jobs NO NO
Infrastructure Altamira Altamira
Astro VRC Workshop Paris Nov 7th 2011 6
Component Separation Pol
WMAP
Last 12 months
Total CPU hours 100.000
RAM/job [GB] 2
Total Storage [GB] 40
PARALLEL Jobs NO
Infrastructure Altamira
Astro VRC Workshop Paris Nov 7th 2011 7
Anomalies in LSS WMAP
Next 6 months
Total CPU hours 40.000
RAM/job [GB] 3
Total Storage [GB] 3
PARALLEL Jobs NO
Infrastructure Altamira
Astro VRC Workshop Paris Nov 7th 2011 8
Non-Gaussianity Analysis
Planckfn_l wavelets
Planckfn_l wavelets
Last 12 months Next 6 months
Total CPU hours 600.000 300.000
RAM/job [GB] 5 5
Total Storage [GB] 1000 500
PARALLEL Jobs NO NO
Infrastructure GRID-CSIC GRID-CSIC
Astro VRC Workshop Paris Nov 7th 2011 9
Non-Gaussianity Analysis
Planckfn_l wavelets
Planckfn_l wavelets
Last 12 months Next 6 months
Total CPU hours 350.000 175.000
RAM/job [GB] 3 3
Total Storage [GB] 1000 500
PARALLEL Jobs YES YES
Infrastructure CSC CSC
Astro VRC Workshop Paris Nov 7th 2011 10
Non-Gaussianity Analysis
WMAPneural networks
WMAPneural networks
Last 12 months Next 6 months
Total CPU hours 650.000 325.000
RAM/job [GB] 3 3
Total Storage [GB] 500 250
PARALLEL Jobs YES YES
Infrastructure Altamira Altamira
Astro VRC Workshop Paris Nov 7th 2011 11
Non-Gaussianity Analysis
WMAPneural networks
WMAPneural networks
Last 12 months Next 6 months
Total CPU hours 600.000 300.000
RAM/job [GB] 3 3
Total Storage [GB] 500 250
PARALLEL Jobs YES YES
Infrastructure GRID-CSIC GRID-CSIC
Astro VRC Workshop Paris Nov 7th 2011 12
Non-Gaussianity Analysis
WMAPneural networks
Last 12 months
Total CPU hours 600.000
RAM/job [GB] 3
Total Storage [GB] 500
PARALLEL Jobs YES
Infrastructure Darwin
Astro VRC Workshop Paris Nov 7th 2011 13
Non-Gaussianity Analysis
WMAPHamiltonian Samp.
WMAPHamiltonian Samp.
Last 12 months Next 6 months
Total CPU hours 10.000 5.000
RAM/job [GB] 3 3
Total Storage [GB] 1000 500
PARALLEL Jobs YES YES
Infrastructure Cosmos/Universe Cosmos/Universe
Astro VRC Workshop Paris Nov 7th 2011 14
SZ Cluster Detection
Planck Planck
Last 12 months Next 6 months
Total CPU hours 80.000 40.000
RAM/job [GB] 2 2
Total Storage [GB] 50 25
PARALLEL Jobs NO NO
Infrastructure Altamira Altamira
Astro VRC Workshop Paris Nov 7th 2011 15
SZ Cluster Detection
Planck Planck
Last 12 months Next 6 months
Total CPU hours 10.000 5.000
RAM/job [GB] 2 2
Total Storage [GB] 20 10
PARALLEL Jobs NO NO
Infrastructure Trieste Trieste
Astro VRC Workshop Paris Nov 7th 2011 16
PS Detection Planck Planck
Last 12 months Next 6 months
Total CPU hours 5.000 2.500
RAM/job [GB] 2 2
Total Storage [GB] 2 2
PARALLEL Jobs NO NO
Infrastructure Trieste Trieste
Astro VRC Workshop Paris Nov 7th 2011 17
Bayesian PS Detection in Pol
Planck/QUIJOTE Planck/QUIJOTE
Last 12 months Next 6 months
Total CPU hours 4.000 2.000
RAM/job [GB] 2 2
Total Storage [GB] 2 2
PARALLEL Jobs NO NO
Infrastructure Altamira Altamira
Astro VRC Workshop Paris Nov 7th 2011 18
LSScluster simulations
Last 6 months Next 6 months
Total CPU hours 500.000 500.000
RAM/job [GB] 4 4
Total Storage [GB] 45.000 45.000
PARALLEL Jobs YES YES
Infrastructure Julich-DE Julich-DE
Astro VRC Workshop Paris Nov 7th 2011 19
CPU hours RAM Storage CPU hours RAM Storage
ComponentSeparation
100.000 3 100 20.000 5 500
SZ Cluster detection
90.000 2 70 45.000 2 35
PS detectionT and P
5.000 2 2 2.500 2 35
PS DetectionBayesian
5.000 2 2 2.500 2 2
Non Gaussianity
2.800.000 3 4.500 1.100.000 3 2.000
Large Scale Structure
simulations
500.000 4 45.000 540.000 3-4 45.000
Total 3.500.000 2-4 GB 50 TB 1.710.000 2-5 GB 48TB
Last 12 months Next 6 months
Astro VRC Workshop Paris Nov 7th 2011 20
• The activities that we carry out at IFCA demand huge amount of computational resources.
• We were involved in EGEE-III because we really thought that the GRID was the solution to our computing problems, but evolves too slow outside the High Energy Physics community.
• After a small learning period we were in the condition to submit jobs to the GRID, and we did, but the success rate was too variable.
• We backed-up the proposal for a big Grid Infrastructure in Spain and we use it, but in HPC mode.
• At the same time we started to have access to HPC clusters in Spain and across Europe where jobs run smoothly and our hopes for a usable GRID died.
• Maybe because of the kind of jobs that we run, it never transitioned from experimental to production and our continuous Planck data analysis could not wait.
• Maybe the situation has evolved in the last 12-18 months and the GRID in A&A is now more mature.
• In our group at IFCA we are always in need of additional computing resources (we recenlty bought our own small cluster with the particularity that single jobs can access up to 256 GB RAM).
• In addition to Planck and WMAP, we are involved in other experiments that will require large amounts of computing resources (QUIJOTE and the PAU-Javalambre Astrophysical Survey), and will be happy if we could count on the GRID or Cloud Computing if it is really in a production stage.
Astro VRC Workshop Paris Nov 7th 2011 21
CONCLUSIONS