23
GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09.

GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

Embed Size (px)

Citation preview

Page 1: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

GridPP22 – Service Resilience and Disaster

Planning

David Britton, 1/Apr/09.

Page 2: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

2

Resilience and Disaster Planning

• The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site.

• One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality.

1/Apr/09

PLAN - A

Page 3: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

3

Towards Plan-B

Fortifying the Service

• Increasing the hardware’s capacity to handle faults.

• Duplicating services or machines.

• Automatic restarts.• Fast intervention. • In depth investigation of

the reason for failure.

1/Apr/09

Disaster Planning

• Taking control early enough.

• (Pre-) establishing possible options.

• Understanding user priorities.

• Timely Action.• Effective

Communication.

See talks by Jeremy (today) and Andrew (tomorrow)

Page 4: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

4

Disasters: Not “if” but “when+where”

1/Apr/09

wLCG weekly operations report, Feb-09

Page 5: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

5

Disasters: Not “if” but “how big”

1/Apr/09

A typical campus incident

Page 6: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

6

Purpose of GridPP22

• To understand the experiment priorities and plans (insofar as they are defined) in the case of various disaster scenarios.

• To extract commonalities across our user-base, to inform our priorities and planning in such an event.

• To examine (and help crystallise) the current state of site (and experiment) resilience and disaster planning.

• Raise collective awareness and encourage collaboration and dissemination of best-practice.

• An ounce of prevention is worth a pound of cure.

1/Apr/09

Page 7: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

7

...and talking of quotes

• When anyone asks me how I can best describe my experience in nearly forty years at sea, I merely say, uneventful. Of course there have been winter gales, and storms and fog and the like. But in all my experience, I have never been in any accident... or any sort worth speaking about. I have seen but one vessel in distress in all my years at sea. I never saw a wreck and never have been wrecked nor was I ever in any predicament that threatened to end in disaster of any sort.

• E. J. Smith, 1907, Captain, RMS Titanic

1/Apr/09

( Who ordered the ICE ? – E.J. Smith, 1912)

Page 8: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

8

Status Update

1/Apr/09

Swansea, Sep 08

Page 9: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

9

WLCG Growth

September 2008 March 2009

Page 10: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

10

A Magic Moment

10/04/23

Page 11: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

11

Tier-1 Reliability

10/04/23

Last 6 months (Sep-Feb) : RAL Reliability = 98%

Target reliability for best 8 sites = 98% (from Jan).

RAL was in top-5.But this was measured with OPs VO...

ATLAS: Last 6 months (Sep-Feb) : RAL Reliability = 90%

Target reliability for best 8 sites = 98%

RAL was 8th out of 11 sites.

Atlas VOBut RAL was one of the best sites for both CMS and LHCb

Page 12: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

12

UK CPU Contribution

1/Apr/09

6 - Months

6 - Months

Page 13: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

13

UK Site Contributions

1/Apr/09

2007(8)

NorthGrid: 34(22)%

London: 28(25)%

ScotGrid: 18(17)%

Tier-1: 13(15)%

SouthGrid: 7(16)%

GridIreland: 6.1% (~)

1 - Year

6 - Month

Page 14: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

14

CPU Efficiencies

1/Apr/09

1-Year

6-Months6-Months

6-Months

Page 15: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

15

Storage (doh!)

1/Apr/09

Page 16: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

16

UK Tier-2 Storage

1/Apr/09

Integrals (08Q4)

Pledged: 1500 TB Provided: 2700 TB Used: 420 TB

Page 17: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

17

Data Transfers

1/Apr/09

Page 18: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

18

STEP09 (i.e. CCRC09)

1/Apr/09

•Currently, it seems likely that this will be in June.

•There may be conflicts with the (much delayed) move to R89.

•It raises issues to do with upgrades such as CASTOR 2.1.8.

Page 19: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

19

Current Issues: CASTOR

• Current version (CASTOR 2.1.7) appears to function at an acceptable level though there are a number of serious bugs that we are learning to work around (notably the BigID and CrossTalk problems).

• These problems have also been observed elsewhere which adds pressure for them to be addressed.

• CASTOR 2.1.8 is under test at CERN and shortly at RAL. Consensus is that we need to be very cautious in moving to this version, even though it may address some of the 2.1.7 bugs and offer additional features (eg. Xrootd functionality).

• Ultimately, this decision must be driven by the experiments (is a consensus possible?). Strongly prefer not to be the first non-CERN site to upgrade.

• Possible conflict with the STEP09 exercise (can we upgrade early enough not to risk participation? Does it make any sense to upgrade after?)

• Is there a bigger risk of not upgrading (degrading support for 2.1.7 ?)

1/Apr/09

Page 20: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

20

Hand-over delayed from Dec-22nd 2008 by a number of issues:

• Cleanliness (addressed)• Inaudible fire alarms

(addressed)• Cooling system (outstanding).

• Plan-A: R89 must be accepted by STFC by 1st May to allow a 2-week migration towards the end of June.

• Plan-B (if there is a small delay) is a 1-week migration of critical components only.

• Plan-C (if there is a longer delay) is to remain completely in the ATLAS building.

• Must balance establishing a stable service for LHC data with the advantages of moving to a better environment.

• Other factors are STEP09; Castor-upgrade; Costs; and Convenience.

Current Issues: R89

1/Apr/09

Page 21: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

21

Tier-1 Hardware

• The FY2008 hardware procurement is currently waiting to be delivered pending resolution of the R89 situation:

– CPU: ~2500 KSI2K to be add to the existing 4590 KSI2K.

– DISK: ~1500 TB to be added to the existing 2222 TB.

– Tape: up to 2000TB can be added to existing 2195 TB.

• The FY09 hardware procurement will start as soon as the experiments have determined revised requirements based on the new LHC schedule (i.e. soon).

10/04/23

Page 22: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

22

Current Issues: EGI/NGI

1/Apr/09

• What follows on from EGEE-III in April 2010?• Current idea is an EGI body (European Grid Infrastructure) coordinating

a set of national NGIs, together with a middleware consortium and a set of Specialist Service Centres (e.g. one for HEP).

• EGI-DS underway.• Timescales and transition are problematic.

• What is the UK NGI? Some evolution of the NGS with components of GridPP?

• Timescales and transition are problematic. • Funding is complicated.

• Initial step: A joint GridPP/NGS working group to try and identify common services.

• See talks by John and Robin on the last afternoon.

Page 23: GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09

23

Summary

• This meeting is about making the most of the window of opportunity before LHC data, to ensure that our Grid services are resilient and our disaster planning is as in place, not just at RAL but also at the Tier-2s.

• Meanwhile, the UK continues to perform well and make incremental improvements in our delivery to the experiments and the wLCG.

• There are, and will continue to be, vexing issues and future uncertainties. We must all keep our eye-on-the-ball.

• Remember, it’s not “if” a disaster strikes but “when and where”.

1/Apr/09

LHC Data

TBD !