6
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

Embed Size (px)

Citation preview

Page 1: CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1

Status ofRAW data production

(III)

ALICE-LCG Task Force weekly

Page 2: CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 2

Current Production

• Switched to raw data production with a single master-job per each global

run, under the control of Lightweight Production Monitor (LPM)

• Main problems during the Easter break:

1. Successful jobs (output written at the WN) end up in "EXPIRED" after few hours they

stay in "SAVING" (ALICE::CERN::CASTOR2 not responding).

2. Jobs go to "ERROR_E" due to zero CPU consumption in the last 20 mins (probably the

first time some raw file is accessed, it takes longer to be staged). Solved using LPM, after

re-submission the number of "ERROR_E" decreases significantly.

3. User alidaq was squeezed by aliprod on the production partition "rawreco":

• CEs are CERN::LCG and CERN::CERN-gLite

• Could run not more than 20 jobs in parallel when CEs overloaded

• Now it has a higher priority (200 jobs in concurrent mode)

Page 3: CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 3

Output Restructuring

AliEn Catalog (raw data)..../LHC08a/000026001/raw/08000026001002.10.root../LHC08a/000026001/raw/08000026001012.10.root../LHC08a/000026001/raw/08000026001012.20.root../LHC08a/000026007../LHC08a/000026020....

AliEn Catalog (reconstructed data)..../000026001/ESDs/pass1/08000026001002.10.root/AliESDs.root../000026001/ESDs/pass1/08000026001002.10.root/<detector>.QA*.root../000026001/ESDs/pass1/08000026001002.10.root/<detector>.RecPoints.root../000026001/ESDs/pass1/08000026001002.10.root/<detector>debug.root../000026001/ESDs/pass1/08000026001002.10.root/rec.log|stdout|stderr../000026001/ESDs/pass1/08000026001002.10.root/root_archive../000026001/ESDs/pass1/08000026001002.10.root/log_archive..

1..~40

Page 4: CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 4

Production using LPM

• Master-job kept running until 95% of the sub-jobs are DONE• This may slow down/stuck the production in the long term if too many

ERROR_V

Page 5: CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 5

• Task queue automatically re-filled if number of waiting jobs < 3500

Production using LPM (1)

resubmitting error jobs    pid 12028083 had 7 error jobs (job is 30% done - 877 out of 2846)    pid 12051531 had 0 error jobs (job is 49% done - 112 out of 228)    pid 12051672 had 0 error jobs (job is 0% done - 0 out of 2)    pid 12051673 had 797 error jobs (job is 56% done - 2185 out of 3878)  total resubmitted : 804there are 4192 jobs waiting in queue for user alidaqtarget queue size is 4000

Page 6: CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 6

• All the urgent runs are already scheduled in LPM for production• At this rate (~900jobs/h) we would terminate reconstruction in two days

Production using LPM (2)

• Jobs running successfully but fail

when write outputs to CASTOR2

(expired)