23
Status of the ALICE Grid Patricia Méndez Lorenzo (IT) ALICE OFFLINE WEEK, CERN 18 October 2010

Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

Embed Size (px)

Citation preview

Page 1: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

Status of the ALICE GridPatricia Méndez Lorenzo (IT) ALICE OFFLINE WEEK, CERN 18 October 2010

Page 2: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

2

Outlook

General results in the last three months

List of general issues

News about services

HI CMS+ALICE exercise

Nagios and Monitoring

Summary and Conclusions

18/15/10

Page 3: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

3

General results in the last three months

18/15/10

Page 4: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

4

List of general issues

T0 site Instabilities this summer with the local CREAM-CE Instabilities with the AFS software area CAF nodes quite stable Security patches applied to all ALICE VOBOXES at CERN Migration of out of warranty voboxes (voalice07 to voalice15 and voalice09 to voalice16) HI combined exercise

T1 sites CREAM-CE issues including instabilities observed in the resource BDII SE problems found at CNAF and CC-IN2P3 related to lack of disk space

T2 sites Usual operations, in general quite stable behavior Challenge: new sites entering production and updates of T2 to T1 sites (from the ALICE

perspective)

18/15/10

Page 5: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

5

T2 sites T1 sites

Korean and USA sites willing to become ALICE T1 sites Assuming in terms of services provision and management

Challenge: Bandwidth We found a poor network between these sites and CERN

Show-stop for these sites and also for new comers 1st approach: bottleneck entering CERN? (firewall stops)

It has been found this is not the issue Current situation: Not fully clear (Jeff in contact this week with Edoardo

Martelli to report the Supercomputing results)

“Proposal for Next Generation Architecture interconnecting LHC computing sites” (Nov 2010) Moving towards a more dynamically configured links between sites with a few

static connections 18/15/10

Page 6: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

6

CREAM and AliEn v2.19

1. Easy management of the OSB (OutputSandBox)

2. Removal of any reference to the CREAM-DB

3. Check out of the CREAM-CE status in the BDII

18/15/10

Page 7: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

7

CREAM and AliEn v2.19

Easy management of the OSB OSB required by ALICE for debugging purposes only Direct submission of jobs via the CREAM-CE requires the specification of a gridftp

server to save the OSB Server specified to the level of the jdl file ALICE solved it requiring a gridftp server at the local VOBOX

OSB cannot be retrieved from CREAM disk via any client command Well… not fully true. Functionality possible but not exposed before CREAM1.6

Requirements to expose this feature Automatic purge procedures (from CREAM1.5) Limiters blocking new submissions in case of low free disk space (from CREAM1.6 )

CREAM1.6 exposes the possibility to leave the OSB in the CREAM-CE outputsandboxbasedesturi="gsiftp://localhost"; (agent jdl level) gridftp server at the VOBOX is not longer needed

18/15/10

Page 8: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

8

CREAM and AliEn v2.19

Removal of any reference to CREAM-DB Reporting of running/waiting jobs purposes in parallel to the BDII information AliEn v2.18 enabled both information sources

Definable on a site by site basis through an env variable (CE_USE_BDII) included in LDAP

AliEnv2.19 keeps the env variable but removes the CREAM-DB reference as information source Too heavy query and not always reliable

If not reliable we could collapse the sites or the opposite: simply not run CREAM-CE developers have proposed us the creation of a tool able to provide

waiting/running number of jobs querying the batch system Therefore the maintenance of the env variable CE_USE_BDII

WARNING: THE ONLY INFO SYSTEM WE HAVE NOW IS THE RESOURCE BDII

18/15/10

Page 9: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

9

CREAM and AliEnv2.19

Check out of the CREAM-CE status “Economic” reasons… for what to keep the submission to CREAM-

CEs in draining or maintenance mode? Until AliEn v2.19: Manual approach

Non operational CEs were manually removed from LDAP With AliEnv2.19: Automatic approach

Before any CREAM-CE operation the status of the CREAM-CE is queried to the resource BDII

If CE in maintenance of drain mode no operation is performed with this CE If there is a list of CREAM-CEs, only those in production will be used

No need to restart services when the CE comes back in production Procedure implemented and tested in Subatech with good results

18/15/10

Page 10: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

10

CREAM Status

Current CREAM-CE production version: CREAM1.6.3 (gLite3.2/sl5_x86_64) patch#4415

gLite3.1 version arriving (patch #4387 in staged-rollout) BUT! This will be the last CREAM-CE deployment in gLite3.1

Next CREAM-CE version: CREAM1.7 (gLite3.2/sl5_x86_64 ONLY!) Foreseen end of the year/beginning of 2011

Brief parenthesis… Since the last offline week, I have submitted 27 GGUS tickets

17 associated to CREAM 4 associated to wrong information provided by the BDII 6 associated to SE issues

Let’s see the issues associated to CREAM (and observed by ALICE) in these last three months

18/15/10

Page 11: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

11

CREAM Issues

Last Offline week’s advice for sites: Migrate to CREAM1.6 as soon as possible Lots of bug fixes reported by ALICE and new features were included in this

version However several instabilities were observed after the migration to

CREAM1.6: Connection timeout messages observed at submission time Error messages reporting problems with the blparser service (blparser service

not alive) Issues reported to the CREAM-CE developers We created a page for site admins describing the problems and the

solutions: http://alien2.cern.ch/index.php?

option=com_content&view=article&id=46&Itemid=10318/15/10

Page 12: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

12

CREAM Issues

Connection timeout error message observed at submission time CREAM service is down Bug #69554: Memory leak in util-java if CA loading throws exception. SOLVED IN

CREAM1.6.2 Workaround provided by developers very easy to apply:

http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues

blparser service is not alive (glite-ce-job-status) Well documented issue associated to the status of the BLAH blparser service http://grid.pd.infn.it/cream/field.php?

n=Main.ErrorMessagesReportedByCREAMToClient#blparsernotalive

Further problem(s): Bug #69545: CREAM digests asynchronous commands very slow. SOLVED IN CREAM 1.6.2

Workaround provided by the developers very easy to apply http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues

18/15/10

Page 13: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

13

Other issues

Reported by GridKa User proxy delegation problems

At delegation time user gets “not authorized for operation” messages Documentation available in:

http://grid.pd.infn.it/cream/field.php?n=Main.ErrorMessagesReportedByCREAMToClient

Reported by LPSC /tmp area of CREAM full of glexec “proxy files” (Bug #73961) Not direct a CREAM issue although the service was affected With CREAM1.6.3 the problem is solved

No workaround will have to be applied as soon as sites migrates to this version

Migration to CREAM1.6.3 is highly recommended18/15/10

Page 14: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

14

Other issues

Found at CERN Lots of timeouts while querying the CREAM-DB during the summer

1. Increase of the timeout window to 3min2. Deprecation of the CREAM-DB usage

Reported by Subatech glite-ce-job-status fails with the message: “EOF detected during

communication. Probably service closed connection or SOCKET TIMEOUT occurred”

Issue associated to poor memory in the CREAM-CE (~2GB when the issue was found)

CREAM-CE advice: CREAM-CE nodes should have a minimum of 8 GB of memory

18/15/10

Page 15: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

15

More about CREAM

CREAM1.6.3 includes important bug fixes See Massimo Sgaravatto’ presentation during the latest GDB

meeting: http://indico.cern.ch/conferenceDisplay.py?confId=83604

CREAM1.7 client will include glite-ce-job-output This does not require changes in our CREAM.pm module

And the possibility to leave the OSB in the CREAM (and retrieve it on demand) is of course available

18/15/10

Page 16: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

16

gLite-VOBOX

Current production version: VOBOX 3.2.9 (gLite3.2/sl5_x86_64) patch#4257 (5.Oct 2010)

New features new Glue 2.0 service publisher new version of LB clients

Still gridftp server is included in this version … included but not configured via YAIM The startup of the service has to be treated besides YAIM The removal of this server will be asked

18/15/10

Page 17: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

17

HI CMS+ALICE exercise

Combined ALICE+CMS exercise (21-October 14:00, 22-October 14:00) to check the IT infrastructure (network and tapes) ability to cope with the expected rates P2-CASTOR (2.5 GB/sec max) and transfer to tape

2.3 PB available on t0alice and 2.3 PB available on alicedisk Reconstruct ~10% of data Simultaneous copy of RAW data to disk pool (via xrd3cp, 2.5GB/sec max)

2100 TB extra space on disk pools provided before by IT Asynchronous start up of the test

ALICE exported directly to CASTOR while CMS was performing a previous repacking before the export

18/15/10

Page 18: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

18

P2 CASTOR transfers

Average rate – 2GB/sec with a max rate of 2.5 GB/sec

160 TB transferred (1-% of the expected HI volume), 60587 files (2.7 GB/file)

Several interruptions for detector reconfiguration and follow up on data transfer to tapes (realistic scenario)

18/15/10

(Plot provided by L. Betev)

Page 19: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

19

CASTOR disk buffer tapes transfers

18/15/10

Data in from P2

To tapeΔt=1h

Δt=1h

Δt=1h

(Plot provided by L. Betev)

Average rate 2.4 GB/secData makes it to tape after 1 hours after being written to the CASTOR buffer3rd party copy delayed by 1h

Page 20: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

20

Copy from toalice alicedisk + reco

18/15/10

Copy t0alice to alicedisk (average 2.6GB/sec)

RAW data reconstruction reading and writing

(Plot provided by L. Betev)

Average copy rate – 2.5 GB/secAverage reco “in” rate –200 MB/secAverage reco “out” rate -20MB/s

Page 21: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

21

Monitoring: Nagios

Nagios Monitoring of the ALICE VOBOXES in production since Summer2010

Visualization of the results via SAM is obsolete Nagios implementation in ML still pending

Sites availability calculation: Currently being compared the calculations though SAM and through Nagios Next MB meeting will show these results

Pending developments Implementation of the CREAM-CE standard test suite Redefinition of the site availability algorithm based on CREAM

(currently based on LCG-CE)

18/15/10

Page 22: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

22

Monitoring: Gridview

The transfer rate reported by Gridview is smaller than the real rate

Issue has been found in August 2010 but it is still pending Track in a GGUS ticket: #61724

18/15/10

average transfer for the day of 20 MB/s

average transfer for the day of 32 MB/s

Page 23: Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

ALICE OFFLINE WEEK -- ALICE GRID STATUS

23

Summary and conclusions

Very smooth production in these last three months Raw data transfer to CASTOR, registration in the AliEn file catalog, transfers to

T1 sites are already routine Site inefficiencies immediately managed together with the site admins

Some changes have been included in AliEn v.2.19 concerning the CREAM-CE service Based on the experiences gained this summer with the 1st version of

CREAM1.6

Some new improvements can be expected for the next CREAM1.7 version

Agile approach foreseen by ALICE with emphasis on the use of T2 sites (even becoming ALICE T1 sites) will be one of the topics to work in in the following months

18/15/10