9
AMOD Report June 24-30, 2013 Torre Wenaus, BNL July 2, 2013

AMOD Report June 24-30, 2013

  • Upload
    livi

  • View
    41

  • Download
    1

Embed Size (px)

DESCRIPTION

AMOD Report June 24-30, 2013. Torre Wenaus, BNL July 2, 2013. Activities. Stable operations, utilization tapering off on the weekend – few pending tasks ~ 4.3 M analysis jobs, 7M jobs total ~ 560 analysis users Ops issues in the week: Recovering from BNL disk pool failure - PowerPoint PPT Presentation

Citation preview

Page 1: AMOD Report  June 24-30,  2013

AMOD Report June 24-30, 2013

Torre Wenaus, BNL

July 2, 2013

Page 2: AMOD Report  June 24-30,  2013

Torre Wenaus 2

Activities

• Stable operations, utilization tapering off on the weekend – few pending tasks

• ~4.3M analysis jobs, 7M jobs total• ~560 analysis users• Ops issues in the week:

– Recovering from BNL disk pool failure– Low disk space at many T1s, T2s

140k

Page 3: AMOD Report  June 24-30,  2013

Torre Wenaus 3

Production & Analysis

Production

Analysis~17k min – 37k max

Page 4: AMOD Report  June 24-30,  2013

Torre Wenaus 4

Data transfers

Page 5: AMOD Report  June 24-30,  2013

Torre Wenaus 5

Tier 0, Central Services, ADC

• Mon: HC DB problem from previous weekend fixed with DB restart. DB connections saturated when session count grew with no release of connections. “A follow-up is being discussed.” GGUS:95033

• Ongoing issue “CERN-PROD: file transfer failure from T2 sites due to SECURITY_ERROR” closed because it had been resolved in early June (as pointed out by Maria in the WLCG meeting). GGUS:92166

• Smooth incident-free interventions on Castor, Oracle production DBs, Bourricot, Tracer/Consistency Service

• Problems (curl SSL failure) using pandamon cloud/site control on lxplus (SL6 issue?), experts investigating

Page 6: AMOD Report  June 24-30,  2013

Torre Wenaus 6

Tier 1

• Mon: FZK-LCG2 transfer failures, “all ATLAS jobs/transfers are forced onto the same disk cluster, because all other disks are full to the brim. Consequently, the load cannot get distributed anymore and we now observe higher failure rate.” GGUS:95021

• Mon-Thu: SARA-MATRIX problems with dest/source transfers. gPlazma service interruption, SRM problems fixed with restart. GGUS: 95071

• Tue: FZK-LCG2 missing AOD file reported, they can find no trace, waiting for reply. GGUS:95092

• Tue-Wed: IN2P3-CC file transfer failures, SRM crashed during night, fixed in the morning with restart. Site established auto recovery to avoid delays in such cases in the future. GGUS:95093

Page 7: AMOD Report  June 24-30,  2013

Torre Wenaus 7

Tier 1

• Thu: BNL provided incident report on disk pool failure. Recovery worked on through the week

• Fri-Sun: RAL-LCG2 DDM errors due to Castor problems on Fri, downtime over weekend, cloud set brokeroff, downtime ended Sunday when problems were resolved, restored to production. GGUS:95160

• Sat: SARA-MATRIX storage errors due to full DATADISK, blacklisted. Space cleaned up over weekend. GGUS:95175

• Mon 7/1: IN2P3-CC NO_SPACE_LEFT errors but no auto blacklisting, site not publishing that it is full. Inconsistency in SRM DB found, bad space calculation, fixed. GGUS:95204

• Several Tier 1s (and Tier 2s) over the week: low space. FZK, SARA, IN2P3

Page 8: AMOD Report  June 24-30,  2013

Torre Wenaus 8

Other

• Clouds were running out of assigned tasks during the week. Would be very desirable to sustain a deeper todo queue of tasks.– [this was the first item on the ‘Other’ slide in my last (Feb) AMOD

report; it still applies]• New manual whitelisting policy

– Armen in last ADC weekly: "Consider an option of manual whitelisting (by expert shifter, AMOD), not reversible by SAAB. May be needed in some exceptional cases.”

– Ueda has put this in place• “on” (whitelisting = ignore auto-exclusions) added as savannah site

exclusion ticket option• dq2-set-location-status documentation for the “on” case added to

the CentralizedSiteExclusion twiki

Page 9: AMOD Report  June 24-30,  2013

Torre Wenaus 9

Thanks!

• Big thanks to very attentive and effective ADCoS shifters