Upload
barrie-roberts
View
215
Download
0
Embed Size (px)
Citation preview
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1
Status ofRAW data production
(III)
ALICE-LCG Task Force weekly
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 2
Current Production
• Switched to raw data production with a single master-job per each global
run, under the control of Lightweight Production Monitor (LPM)
• Main problems during the Easter break:
1. Successful jobs (output written at the WN) end up in "EXPIRED" after few hours they
stay in "SAVING" (ALICE::CERN::CASTOR2 not responding).
2. Jobs go to "ERROR_E" due to zero CPU consumption in the last 20 mins (probably the
first time some raw file is accessed, it takes longer to be staged). Solved using LPM, after
re-submission the number of "ERROR_E" decreases significantly.
3. User alidaq was squeezed by aliprod on the production partition "rawreco":
• CEs are CERN::LCG and CERN::CERN-gLite
• Could run not more than 20 jobs in parallel when CEs overloaded
• Now it has a higher priority (200 jobs in concurrent mode)
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 3
Output Restructuring
AliEn Catalog (raw data)..../LHC08a/000026001/raw/08000026001002.10.root../LHC08a/000026001/raw/08000026001012.10.root../LHC08a/000026001/raw/08000026001012.20.root../LHC08a/000026007../LHC08a/000026020....
AliEn Catalog (reconstructed data)..../000026001/ESDs/pass1/08000026001002.10.root/AliESDs.root../000026001/ESDs/pass1/08000026001002.10.root/<detector>.QA*.root../000026001/ESDs/pass1/08000026001002.10.root/<detector>.RecPoints.root../000026001/ESDs/pass1/08000026001002.10.root/<detector>debug.root../000026001/ESDs/pass1/08000026001002.10.root/rec.log|stdout|stderr../000026001/ESDs/pass1/08000026001002.10.root/root_archive../000026001/ESDs/pass1/08000026001002.10.root/log_archive..
1..~40
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 4
Production using LPM
• Master-job kept running until 95% of the sub-jobs are DONE• This may slow down/stuck the production in the long term if too many
ERROR_V
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 5
• Task queue automatically re-filled if number of waiting jobs < 3500
Production using LPM (1)
resubmitting error jobs pid 12028083 had 7 error jobs (job is 30% done - 877 out of 2846) pid 12051531 had 0 error jobs (job is 49% done - 112 out of 228) pid 12051672 had 0 error jobs (job is 0% done - 0 out of 2) pid 12051673 had 797 error jobs (job is 56% done - 2185 out of 3878) total resubmitted : 804there are 4192 jobs waiting in queue for user alidaqtarget queue size is 4000
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 6
• All the urgent runs are already scheduled in LPM for production• At this rate (~900jobs/h) we would terminate reconstruction in two days
Production using LPM (2)
• Jobs running successfully but fail
when write outputs to CASTOR2
(expired)