12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ ES 1 ATLAS Site Status Board Automatic queue exclusion based on downtimes 21 st Feb 2012 ATLAS site topology Site exclusion algorithm Test results First real exclusion and recovery C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth [email protected]

ATLAS Site Status Board Automatic queue exclusion based on downtimes

  • Upload
    nishi

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

ATLAS Site Status Board Automatic queue exclusion based on downtimes. ATLAS site topology Site exclusion algorithm Test results First real exclusion and recovery. C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth - PowerPoint PPT Presentation

Citation preview

Page 1: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

1

ATLAS Site Status BoardAutomatic queue exclusion based on downtimes

21st Feb 2012

• ATLAS site topology• Site exclusion algorithm• Test results• First real exclusion and recovery

C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth

[email protected]

Page 2: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

2

ATLAS site topology

• Based on information from AGIS, Schedconfig

• Mapping between various ATLAS site naming conventions• AGIS (based on GOCDB/OIM), Panda, DDM

• Populated “exception file”

• ATLAS site-oriented topology

• http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json

• ATLAS Panda queue-oriented topology

• http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues.json

• http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues_dict.json

In touch with Pilot factory monitoring developers to get mapping between queues and resources as Pilot factories see it

Will enable us to map ANALY queues to downtimes of CE

21st Feb 2012

Page 3: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

3

Site exclusion

• Queue exclusion based on downtime of a SE, CE (, LFC)

– Exclusion tools has undergone thorough testing before was put into production for the first queues

21st Feb 2012

18 Oct 2011

AGISSite downtime information

DDM exclusion collectorFetches SE downtime from AGIS

Site ASE downtime

starts

Site A: SESE Excluded

Site BSE downtime

over

Site exclusion collectorFetches SE/CE/LFC downtime

from AGISSite C

SE downtime starts

Site C: CECEs Excluded

Site DLFC downtime

starts

Site D: CECE(s) Excluded

Site D: SESE(s) Excluded

Site ECE(s) downtime

starts

Site E: CECE(s) Excluded

Site B: SESE Recovered

In productionIn testing

phase

GOCDB OIMDB

Page 4: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

4

Site exclusion algorithm

• Fetch ongoing and future downtimes from AGIS

• Map downtimes from sites to queues (topology!)• SRM downtime: action with every queue type (ANALY, prod)

• CE downtime: action only with prod queues

• Decide exclusion/recovery action, consider

• time of downtime

• queue type (production, analysis, “special”)

• current queue status

• current queue comment

21st Feb 2012

Page 5: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

5

Exclusion of a production queue

• 12 hr in advance of a downtime:• setoffline with comment “set.offline.by.SSB” if queue is:

• Online with any possible comment

• Brokeroff with comment “set.brokeroff.by.SSB”

• Test with comment “HC.Test.Me”

• Otherwise do not touch that queue!

• When downtime starts:• Make sure that queue is set offline when appropriate

• See the rules above, in the T-12h .. T intervals

• End of downtime/downtime disappears – recovery:• settest with comment “HC.Test.Me” if the current status is

Offline with comment “set.offline.by.SSB”

• Otherwise do not touch that queue!

21st Feb 2012

Page 6: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

6

Exclusion of an analysis queue• 6 hr in advance of a downtime:

• setbrokeroff with comment “set.brokeroff.by.SSB” if queue is:

• Online with any possible comment

• Brokeroff with comment “set.brokeroff.by.SSB”

• Offline with comment “set.offline.by.SSB”

• Otherwise do not touch that queue!

• 2 hr in advance of a downtime and during downtime:• setoffline with comment “set.offline.by.SSB” if queue is:

• Online with any possible comment

• Brokeroff with comment “set.brokeroff.by.SSB”

• Test with comment “HC.Test.Me”

• Otherwise do not touch that queue!

• End of downtime/downtime disappears – recovery:• settest with comment “HC.Test.Me” if the current status is

Offline with comment “set.offline.by.SSB”

• Otherwise do not touch that queue!21st Feb 2012

Page 7: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

7

Testing the exclusion idea - 1• Assembled test data:

• 2 flavours of production queues (only 1 enabled),

• 2 flavours of analysis queues (only 1 enabled)

• Phase space of queue status contains every possible combination of [queue type, queue status, queue comment]:

• FAKE_QUEUE_TYPES (x) FAKE_QUEUE_PREFIXES (x) (x) FAKE_STATES (x) FAKE_COMMENTS, where

• FAKE_QUEUE_TYPES=[DEFAULT_QUEUE_TYPE_PRODUCTION, DEFAULT_QUEUE_TYPE_ANALYSIS, DEFAULT_QUEUE_TYPE_SPECIAL]

• FAKE_QUEUE_PREFIXES={DEFAULT_QUEUE_TYPE_PRODUCTION: ['testsite-testsitece02-at2testsite-pbs_test', 'testsite-testsitece03-at2testsite-pbs_test'], DEFAULT_QUEUE_TYPE_ANALYSIS:['ANALY', 'ANALY2'], DEFAULT_QUEUE_TYPE_SPECIAL:['SPECIAL1', 'SPECIAL2']}

• FAKE_STATES=['online', 'offline', 'test', 'brokeroff']

• FAKE_COMMENTS=['', 'dummy', 'set.offline.by.SSB', 'set.offline.by.SSB.dummy', 'set.brokeroff.by.SSB', 'set.brokeroff.by.SSB.dummy', 'set.online.by.SSB', 'set.online.by.SSB.dummy', 'HC.Test.Me', 'HC.Test.Me.dummy']

21st Feb 2012

Page 8: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

8

Testing the exclusion idea -2• “Dashboard” with the timeline for each queue class from

the phase spacehttp://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.html

• Log with detailed actions describedhttp://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.log

• Test downtimes:• SRM: from 2012-02-05 23:30 UTC to 2012-02-06 02:00 UTC

• SRM: from 2012-02-06 04:30 UTC to 2012-02-06 06:00 UTC

• SRM: from 2012-02-07 04:30 UTC to 2012-02-07 06:00 UTC

• CE: for each queue from 2012-02-06 8am 9am UTC

The exclusion algorithm does what is expected and when it is expected!

21st Feb 2012

Page 9: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

9

Real actions• After thorough testing and improving log debugging

features for operations• We started taking real actions for several queues

https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/33952

The exclusion tool does what is expected and when it is expected!

• Tested with ifae and UKI-SCOTGRID-DURHAM, which have downtimes today.

• Next in the pipeline is SFU-LCG2.

21st Feb 2012

Page 10: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

10

Operational experience - 1

• Every action is logged, so it’s easier to debug what went wrong if this occur.http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher.log

• Found few minor issues on the way

• Fetched only future downtimes from AGIS. Fixed. Now fetching ongoing and future

downtimes.

• Disabled all real queues for the past night Fixed. Now all queues from elog:33952 are

enabled again.

The exclusion tool takes only actions we intend it to take!21st Feb 2012

Page 11: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

11

Operational experience - 2

• Found few minor issues on the way

• Fetched only future downtimes from AGIS.

Fixed. Now fetching ongoing and future downtimes.

• Disabled all real queues for the past night

Fixed. Now all queues from elog:33952 are enabled again.

The exclusion tool takes only actions we intend it to take!

21st Feb 2012

Page 12: ATLAS Site Status Board Automatic queue exclusion based on downtimes

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

ES

12

Summary

Using ATLAS site topology– http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json

First real exclusions and recoveries successful!

Next steps: Add more queues to real actions Add more configurability (now: system-wide)

Questions?

[email protected] Feb 2012