Feedback on the experiences of the BioMed DC

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

N. Jacq, 05/09/05, biomedical meeting

Feedback on the experiences of the BioMed DC

Nicolas JacqLPC, IN2P3/CNRS, France

N. Jacq, 05/09/05 2


INFSO-RI-508833

DC report

• WISDOM statistics– All instances not yet registered– Wisdom.egee-eu.fr unavailable

• JRA2 statistics :– Only with RB, biomed DC jobs were selected– 1 missing RB (SINICA for technical reason)

• Estimation :– No of job run : +60 000– CPU consumed : +100 CPU years– Total size : -1TB

N. Jacq, 05/09/05 3


INFSO-RI-508833

Jobs failure

• Estimation– 40% successfully done– 20% unsuccessfully done (after checking)– 10% aborted– 10% cancelled : resubmission, errors during the process– 20% failed/unknown : RB or nodes failure, other reasons ?

• Unsuccessed done reasons :– 80% License server : server down, SCAI electric cut, droped no of server licenses, no possible

flexx jobs on CE - Important because of automatic resubmission

– 15% CE configuration : tar pbl, missing space…– 5% No results transfer : lcg AND globus commands failed

• Aborted reasons– 63% mismatching resources : Failing middleware component or wrong request in the job JDL– 28% wrong configuration– 4% network/connection failure– 4% proxy problems– 1% JDL problem

• Finaly :– ~30 % of failures due to the grid (middleware, resources…)– ~30% due to the WISDOM application– 40% successed done

N. Jacq, 05/09/05 4


INFSO-RI-508833

Operational issues

• RB : Main bottleneck despite the 12 available RB (3-7 in the same time)– Overload, crash, space limit for share repository, disk failure, bad status

information for done jobs in CE, impossible to download outputsandboxes

• RLS/RMC : Was the main bottleneck for me before the DC– No RLS/RMC during a Renater cut– 10% lcg-cp and lcg-cr commands failed– 5 % globus-url-copy failed also

• SE : No important problem– Corrupted tar in the SE UPV– Electrical cut of SCAI, Renater cut for CC

• CE : No critical problem– Configuration problems, air-conditioning– BDII update problem (problems for jobs distribution)– Difference of priority for biomgrid/biomed certificates

• UI : Handicap for the submission– Slow in multithreaded submission process, disk space, crash

N. Jacq, 05/09/05 5


INFSO-RI-508833

Improvement proposition (for intensive use of the grid)

• Load balancing– Know all possible CE configurations specific to a VO– Have a dynamic information system update for intensive submission– Robustness of the BDII update in a CE– Define a ranking attribute appropriate for the intensive use

• Avoid bottlenecks– Not only 1 RLS/RMC (and not only 1 license server)– RB synchronisation

As it is now, a RB failure means losing jobs or OutputSandBoxes. There is a real need for synchronization between RB (specially LB service) to be able to send a job through a given RB and check the status or retrieve results via another. It would mean that job databases and outputSandBoxes are not stored on RB but somewhere on the grid, and can be reached by any RB, and maybe replicated. The idea is to have a job management similar to the data management , with something like the LFN for the jobs.

• Process Management– Know all possible errors of the grid (RB, commands)– Small nodes need to have a limit of scheduled jobs in their queue

Documents

Feedback on the experiences of the BioMed DC