Upload
minjonet-roussel
View
14
Download
0
Embed Size (px)
DESCRIPTION
Feedback on the experiences of the BioMed DC. Nicolas Jacq LPC, IN2P3/CNRS, France. DC report. WISDOM statistics All instances not yet registered Wisdom.egee-eu.fr unavailable JRA2 statistics : Only with RB, biomed DC jobs were selected 1 missing RB (SINICA for technical reason) - PowerPoint PPT Presentation
Citation preview
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
N. Jacq, 05/09/05, biomedical meeting
Feedback on the experiences of the BioMed DC
Nicolas JacqLPC, IN2P3/CNRS, France
N. Jacq, 05/09/05 2
Enabling Grids for E-sciencE
INFSO-RI-508833
DC report
• WISDOM statistics– All instances not yet registered– Wisdom.egee-eu.fr unavailable
• JRA2 statistics :– Only with RB, biomed DC jobs were selected– 1 missing RB (SINICA for technical reason)
• Estimation :– No of job run : +60 000– CPU consumed : +100 CPU years– Total size : -1TB
N. Jacq, 05/09/05 3
Enabling Grids for E-sciencE
INFSO-RI-508833
Jobs failure
• Estimation– 40% successfully done– 20% unsuccessfully done (after checking)– 10% aborted– 10% cancelled : resubmission, errors during the process– 20% failed/unknown : RB or nodes failure, other reasons ?
• Unsuccessed done reasons :– 80% License server : server down, SCAI electric cut, droped no of server licenses, no possible
flexx jobs on CE - Important because of automatic resubmission
– 15% CE configuration : tar pbl, missing space…– 5% No results transfer : lcg AND globus commands failed
• Aborted reasons– 63% mismatching resources : Failing middleware component or wrong request in the job JDL– 28% wrong configuration– 4% network/connection failure– 4% proxy problems– 1% JDL problem
• Finaly :– ~30 % of failures due to the grid (middleware, resources…)– ~30% due to the WISDOM application– 40% successed done
N. Jacq, 05/09/05 4
Enabling Grids for E-sciencE
INFSO-RI-508833
Operational issues
• RB : Main bottleneck despite the 12 available RB (3-7 in the same time)– Overload, crash, space limit for share repository, disk failure, bad status
information for done jobs in CE, impossible to download outputsandboxes
• RLS/RMC : Was the main bottleneck for me before the DC– No RLS/RMC during a Renater cut– 10% lcg-cp and lcg-cr commands failed– 5 % globus-url-copy failed also
• SE : No important problem– Corrupted tar in the SE UPV– Electrical cut of SCAI, Renater cut for CC
• CE : No critical problem– Configuration problems, air-conditioning– BDII update problem (problems for jobs distribution)– Difference of priority for biomgrid/biomed certificates
• UI : Handicap for the submission– Slow in multithreaded submission process, disk space, crash
N. Jacq, 05/09/05 5
Enabling Grids for E-sciencE
INFSO-RI-508833
Improvement proposition (for intensive use of the grid)
• Load balancing– Know all possible CE configurations specific to a VO– Have a dynamic information system update for intensive submission– Robustness of the BDII update in a CE– Define a ranking attribute appropriate for the intensive use
• Avoid bottlenecks– Not only 1 RLS/RMC (and not only 1 license server)– RB synchronisation
As it is now, a RB failure means losing jobs or OutputSandBoxes. There is a real need for synchronization between RB (specially LB service) to be able to send a job through a given RB and check the status or retrieve results via another. It would mean that job databases and outputSandBoxes are not stored on RB but somewhere on the grid, and can be reached by any RB, and maybe replicated. The idea is to have a job management similar to the data management , with something like the LFN for the jobs.
• Process Management– Know all possible errors of the grid (RB, commands)– Small nodes need to have a limit of scheduled jobs in their queue