21
(D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC- UC) Astro VRC Workshop Paris Nov 7th 2011

(D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Embed Size (px)

Citation preview

Page 1: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

(D)CI related activities at IFCA

Marcos López-CaniegoInstituto de Física de Cantabria (CSIC-UC)

Astro VRC Workshop Paris Nov 7th 2011

Page 2: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

The Observational Cosmology and Instrumentation Group at the Instituto de Física de Cantabria (CSIC-UC) is involved in several aspects of the data analysis of Planck, ESA’s mission to study the Cosmic Microwave Background Radiation.

In addition to Planck, we are involved in the analysis of data from other experiments such as WMAP and Herschel, and in the simulation and data analysis of a new CMB experiment called QUIJOTE.

During the EGEE-III project we dedicated a fair amount of time and effort to port to the GRID several applications to do CMB-related analysis.

– Detection of Point Sources in single frequency maps – Detection of Point Sources using multifrequency information– Detection of Clusters of Galaxies using the SZ effect– Detection of Non-Gaussian features in CMB maps

Astro VRC Workshop Paris Nov 7th 2011 2

Page 3: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Our experience using the GRID when running these applications was very variable:

– We used input maps up to 200-400 MBs each, and we had many of these!

– To do some multifrequency analysis we had to load more than one map at the time and nodes with several GBs of RAM were not always available.

– In particular, the original SZ cluster application required up to 9 all-sky maps, and the nodes did not have enough RAM memory to deal with them maps and with the intermediate files that produced.

– To solve this we decided to divide the maps into 100’s of patches beforehand, group, gzip and put them in the SE before starting the job.

– This strategy worked very well and we used the GRID to do the SZ analysis of Planck simulations for about two years. But also implied some additional pre-processing work.

– Analogously we were able to adapt the NG codes to avoid continuous data traffic between the SE and the nodes, and it also worked very well.

– Moving this amount of information from the Storage Elements to the nodes was always problematic and very ofter saturated the network.

Astro VRC Workshop Paris Nov 7th 2011 3

Page 4: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

–It was also common to have failed jobs because the codes started to run before the maps had fully arrived to the nodes, even though this should never happen due to the sequencial order of commands in the scripts.

–Maybe the most important problem that we had was a high rate of failed/killed jobs. Sometimes the 100% of the jobs run smoothly and sometimes we had to resubmit “by hand” mostif not all the jobs because they had failed for unknown problems.

–One way to improve the situation was to force the jobs to run in nodes physically close to the Storage Element and the rate of succesful jobs improved.

–No tool to control failed jobs was available and resubmission by hand was not an option. I heard of things like “metaschedulers”, gridway, etc. Not available in our infrastructure.

–In the near future our input maps can be as large as 2GB each -> more problems.

Not everything was bad, there was a huge amount of resources available at a time when obtaining 10.000’s of CPU hours in HPC clusters was not easy and we did use the GRID a lot.

But… with the launch of Planck in mid-2009 the amount of work increased to a level that we could not afford spending valuable time dealing with problems with the GRID.

And we moved away from the GRID and started to work with normal clusters.

Astro VRC Workshop Paris Nov 7th 2011 4

Page 5: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

• Current activities at IFCA (not distributed in the sense of GRID or Cloud):

– the number of people in our group running jobs in HPC has doubled in the last couple of years. Now most of the people in the group (10-12/15) use clusters for their daily work.

– We use big infrastructures across Spain, Europe and sometimes in the US

• Spain: GRID-CSIC, Altamira (part of the BSC at IFCA)• Finland: CSC (member of PRACE)• UK: Darwin, Cosmos and Universe in Cambridge• Italy: Planck LFI cluster in Trieste• Germany: a cluster in the Forschungzentrum in Julich• US: NERSC, IPAC

• Estimation of the number of CPU time used in the last year per proyect:

Astro VRC Workshop Paris Nov 7th 2011 5

Of the order of a Few Million CPU hours and 10’s of TB of storage

Page 6: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Component Separation

Planck Planck

Last 12 months Next 6 months

Total CPU hours 2.000 20.000

RAM/job [GB] 3 3

Total Storage [GB] 60 500

PARALLEL Jobs NO NO

Infrastructure Altamira Altamira

Astro VRC Workshop Paris Nov 7th 2011 6

Page 7: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Component Separation Pol

WMAP

Last 12 months

Total CPU hours 100.000

RAM/job [GB] 2

Total Storage [GB] 40

PARALLEL Jobs NO

Infrastructure Altamira

Astro VRC Workshop Paris Nov 7th 2011 7

Page 8: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Anomalies in LSS WMAP

Next 6 months

Total CPU hours 40.000

RAM/job [GB] 3

Total Storage [GB] 3

PARALLEL Jobs NO

Infrastructure Altamira

Astro VRC Workshop Paris Nov 7th 2011 8

Page 9: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Non-Gaussianity Analysis

Planckfn_l wavelets

Planckfn_l wavelets

Last 12 months Next 6 months

Total CPU hours 600.000 300.000

RAM/job [GB] 5 5

Total Storage [GB] 1000 500

PARALLEL Jobs NO NO

Infrastructure GRID-CSIC GRID-CSIC

Astro VRC Workshop Paris Nov 7th 2011 9

Page 10: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Non-Gaussianity Analysis

Planckfn_l wavelets

Planckfn_l wavelets

Last 12 months Next 6 months

Total CPU hours 350.000 175.000

RAM/job [GB] 3 3

Total Storage [GB] 1000 500

PARALLEL Jobs YES YES

Infrastructure CSC CSC

Astro VRC Workshop Paris Nov 7th 2011 10

Page 11: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Non-Gaussianity Analysis

WMAPneural networks

WMAPneural networks

Last 12 months Next 6 months

Total CPU hours 650.000 325.000

RAM/job [GB] 3 3

Total Storage [GB] 500 250

PARALLEL Jobs YES YES

Infrastructure Altamira Altamira

Astro VRC Workshop Paris Nov 7th 2011 11

Page 12: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Non-Gaussianity Analysis

WMAPneural networks

WMAPneural networks

Last 12 months Next 6 months

Total CPU hours 600.000 300.000

RAM/job [GB] 3 3

Total Storage [GB] 500 250

PARALLEL Jobs YES YES

Infrastructure GRID-CSIC GRID-CSIC

Astro VRC Workshop Paris Nov 7th 2011 12

Page 13: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Non-Gaussianity Analysis

WMAPneural networks

Last 12 months

Total CPU hours 600.000

RAM/job [GB] 3

Total Storage [GB] 500

PARALLEL Jobs YES

Infrastructure Darwin

Astro VRC Workshop Paris Nov 7th 2011 13

Page 14: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Non-Gaussianity Analysis

WMAPHamiltonian Samp.

WMAPHamiltonian Samp.

Last 12 months Next 6 months

Total CPU hours 10.000 5.000

RAM/job [GB] 3 3

Total Storage [GB] 1000 500

PARALLEL Jobs YES YES

Infrastructure Cosmos/Universe Cosmos/Universe

Astro VRC Workshop Paris Nov 7th 2011 14

Page 15: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

SZ Cluster Detection

Planck Planck

Last 12 months Next 6 months

Total CPU hours 80.000 40.000

RAM/job [GB] 2 2

Total Storage [GB] 50 25

PARALLEL Jobs NO NO

Infrastructure Altamira Altamira

Astro VRC Workshop Paris Nov 7th 2011 15

Page 16: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

SZ Cluster Detection

Planck Planck

Last 12 months Next 6 months

Total CPU hours 10.000 5.000

RAM/job [GB] 2 2

Total Storage [GB] 20 10

PARALLEL Jobs NO NO

Infrastructure Trieste Trieste

Astro VRC Workshop Paris Nov 7th 2011 16

Page 17: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

PS Detection Planck Planck

Last 12 months Next 6 months

Total CPU hours 5.000 2.500

RAM/job [GB] 2 2

Total Storage [GB] 2 2

PARALLEL Jobs NO NO

Infrastructure Trieste Trieste

Astro VRC Workshop Paris Nov 7th 2011 17

Page 18: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

Bayesian PS Detection in Pol

Planck/QUIJOTE Planck/QUIJOTE

Last 12 months Next 6 months

Total CPU hours 4.000 2.000

RAM/job [GB] 2 2

Total Storage [GB] 2 2

PARALLEL Jobs NO NO

Infrastructure Altamira Altamira

Astro VRC Workshop Paris Nov 7th 2011 18

Page 19: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

LSScluster simulations

Last 6 months Next 6 months

Total CPU hours 500.000 500.000

RAM/job [GB] 4 4

Total Storage [GB] 45.000 45.000

PARALLEL Jobs YES YES

Infrastructure Julich-DE Julich-DE

Astro VRC Workshop Paris Nov 7th 2011 19

Page 20: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

CPU hours RAM Storage CPU hours RAM Storage

ComponentSeparation

100.000 3 100 20.000 5 500

SZ Cluster detection

90.000 2 70 45.000 2 35

PS detectionT and P

5.000 2 2 2.500 2 35

PS DetectionBayesian

5.000 2 2 2.500 2 2

Non Gaussianity

2.800.000 3 4.500 1.100.000 3 2.000

Large Scale Structure

simulations

500.000 4 45.000 540.000 3-4 45.000

Total 3.500.000 2-4 GB 50 TB 1.710.000 2-5 GB 48TB

Last 12 months Next 6 months

Astro VRC Workshop Paris Nov 7th 2011 20

Page 21: (D)CI related activities at IFCA Marcos López-Caniego Instituto de Física de Cantabria (CSIC-UC) Astro VRC Workshop Paris Nov 7th 2011

• The activities that we carry out at IFCA demand huge amount of computational resources.

• We were involved in EGEE-III because we really thought that the GRID was the solution to our computing problems, but evolves too slow outside the High Energy Physics community.

• After a small learning period we were in the condition to submit jobs to the GRID, and we did, but the success rate was too variable.

• We backed-up the proposal for a big Grid Infrastructure in Spain and we use it, but in HPC mode.

• At the same time we started to have access to HPC clusters in Spain and across Europe where jobs run smoothly and our hopes for a usable GRID died.

• Maybe because of the kind of jobs that we run, it never transitioned from experimental to production and our continuous Planck data analysis could not wait.

• Maybe the situation has evolved in the last 12-18 months and the GRID in A&A is now more mature.

• In our group at IFCA we are always in need of additional computing resources (we recenlty bought our own small cluster with the particularity that single jobs can access up to 256 GB RAM).

• In addition to Planck and WMAP, we are involved in other experiments that will require large amounts of computing resources (QUIJOTE and the PAU-Javalambre Astrophysical Survey), and will be happy if we could count on the GRID or Cloud Computing if it is really in a production stage.

Astro VRC Workshop Paris Nov 7th 2011 21

CONCLUSIONS