25
Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Embed Size (px)

Citation preview

Page 1: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Hardware Reliability at the RAL Tier1

Gareth Smith16th September 2011

Page 2: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Staffing

The following staff have left since GridPP26:

• James Thorne (Fabric - Disk servers)• Matt Hodges (Grid Services Team leader)• Derek Ross (Grid Services Team)• Richard Hellier (Grid Services & Castor Teams)

We thank them for their work whilst with the Tier1 team.

10 April 2023 Tier-1 Status

Page 3: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Some Changes

• CVMFS in use for Atlas & LHCb:– The Atlas (NFS) software server used to give

significant problems.– Some CVMFS teething issues but overall much

better!

• Virtualisation:– Starting to bear fruit. Uses Hyper-V.

• Numerous test systems• Production systems that do not require particular

resilience.

• Quattor:– Large gains already made.

10 April 2023 Tier-1 Status

Page 4: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

46th May 2011

CernVM-FS Service Architecture

Squid

cvmfs-public@cern

Batch NodeBatch NodeBatch Node

Random site

cvmfs-ral@cern cvmfs-bnl@cern

Stratum 0 web in PH

Atlas install box in PH LHCb install box in PH

•Replication to Stratum 1 by hourly cron (for now)

•Stratum 0 moving to IT by end of year

•BNL almost in production

Page 5: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

56th May 2011

CernVM-FS Service At RAL

Squid

Batch NodeBatch NodeBatch Node

squid cache

lcgsquid05

iSCSI Storage webfs.gridpp.rl.ac.uk

WEB Server - replicated Stratum 0 at CERN

lcgsquid06

Squid(s)

Batch NodeBatch NodeBatch Node

Random site

Batch NodeBatch NodeBatch NodeBatch NodeBatch NodeBatch Node

RAL batch

cvmfs.gridpp.rl.ac.uk The Replica at RAL - presented as a virtual host in 2 reverse proxy squids accelerating webfs in the background

squid cache

Page 6: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Database Infrastructure

We making Significant Changes to the Oracle Database Infrastructure.

Why?• Old servers are out of maintenance• Move from 32bit to 64bit databases• Performance improvements• Standby systems• Simplified architecture

Page 7: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Database Disk Arrays - Now

10 April 2023 Tier-1 Status

Fibrechannel

SAN

Oracle RAC Nodes

Disk Arrays

Power Supplies (on UPS)

Page 8: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Database Disk Arrays - Future

10 April 2023 Tier-1 Status

Fibrechannel

SAN

Oracle RAC Nodes

Disk Arrays

Power Supplies (on UPS)

Data Guard

Page 9: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Castor

Changes since last GridPP Meeting:

• Castor upgrade to 2.1.10 (March)• Castor version 2.1.10-1 (July) needed for the higher

capacity "T10KC" tapes.• Updated Garbage Collection Algorithm (to “LRU”

rather than the default which is based on size). (July)• (Moved ‘logrotate’ to 1pm rather than 4am.)

10 April 2023 Tier-1 Status

Page 10: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Castor Issues.

• Load related issues on small/full service classes (e.g. AtlasScratchDisk; LHCbRawRDst)– Load can become concentrated on one or two disk

servers.– Exacerbated if uneven distribution if disk server

sizes.

• Solutions:– Add more capacity; clean-up.– Changes to tape migration policies.– Re-organization of service classes.

10 April 2023 Tier-1 Status

Page 11: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Procurement

All existing bulk capacity orders in production or being deployed.

Problems ‘SL08’ generation overcome.Tenders under way for disk and tape.

Disk:• Anticipate 2.66 PB usable space.

– Vendor 24 day proving test using our tests. Then– Re-install and 7 days acceptance tests by us.

CPU:• Anticipate 12k HEPSpec06

– 14 day proving tests by vendor. Then– 14 day acceptance tests by us.Evaluation based on 5 year Total Cost of Ownership.

10 April 2023 Tier-1 Status

Page 12: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Disk Server Outages by Cause (2011)

10 April 2023 Tier-1 Status

Page 13: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Disk Server Outages by Service Class (2011)

10 April 2023 Tier-1 Status

Page 14: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Disk Drive Failure – Year 2011

Page 15: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Double Disk Failures (2011)

In process of updating the firmware on the particular batch of disk controllers.

10 April 2023 Tier-1 Status

Page 16: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Disk Server Issues - Responses

New Possibilities with Castor 2.1.9 or later:• ‘Draining’ (‘passive’ and ‘active’)• Read –only server.• Checksumming (2.1.9) – easier to validate files plus

regular check of files written.

All of these used regularly when responding to a disk server problems.

10 April 2023 Tier-1 Status

Page 17: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Data Loss Incidents

Summary of losses since GridPP26Total of 12 incidents logged:• 1 – Due to a disk server failure (loss of 8 files for CMS)• 1 – Due to a bad tape (loss of 3 files for LHCb)• 1 - Files not in Castor Nameserver but no location. ( 9

LHCb files)• 9 – Cases of corrupt files. In most cases the files were

old (and pre-date Castor checksumming).

Checksumming in place of tape and disk files. Daily and random checks made on disk files.

10 April 2023 Tier-1 Status

Page 18: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

T10000 Tapes

Type Capacity In Use Total Capacity

A 0.5TB 55702.2PB

B 1TB 2170 1.9PB (CMS)C 5TB

We have 320 T10KC tapes (capacity ~1.5PByte) already purchased.

Plan is to move data (VOs) using the ‘A’ tapes onto the ‘C’ tapes, leapfrogging the ‘Bs’.

10 April 2023 Tier-1 Status

Page 19: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

T10000C Issues

• Failure of 6 out of 10 tapes.– Current A/B failure rate roughly 1 in 1000.– After writing part of a tape an error was reported.

• Concerns are three fold:– A high rate of write errors cause disruption– If tapes could not be filled our capacity would be

reduced– We were not 100% confident that data would be

secure• Updated Firmware in drives.

– 100 tapes now successfully written without problem.

• In contact with Oracle.10 April 2023 Tier-1 Status

Page 20: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Long-Term Operational Issues

• Building R89 (Noisy power v EMC units); Electrical Discharge in 11kV supply.– Noisy electrical current: Fixed by use of isolating

transformers in appropriate places.– Await final information on resolution of 11kV

discharge.• Asymmetric Data Transfer rates in/out of RAL Tier1.

– Many possible causes: Load; FTS settings, Disk server settings; TCP/IP tuning, network (LAN & WAN performance).

– Have modified FTS settings with some success.– Looking at Tier1-UK Tier2 transfers as within

GridPP.10 April 2023 Tier-1 Status

Page 21: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Long-Term Operational Issues

• BDII issues. (Site BDII stops updating).– Monitoring & re-starters in place.

• Packet Loss on RAL Network Link– Particular effect on LFC updates (reported by Atlas)– Some evidence of being load related.

• Some ‘hangs’ within Castor JobManager– Difficult to trace, very intermittent.

10 April 2023 Tier-1 Status

Page 22: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Other reported hardware issues

The following were reported via theTier1-Experiments Liaison Meeting.

• (Isolating transformers in disk array power feeds - final resolution)

• Network switch stack failure• Network (transceiver) failure in switch – in link to tape

system• CMS LSF machine was also turned off by mistake -

resulting in a short outage for srm-cms.• T10KC tapes• 11kV feed into building - discharge.• Failure of Site Access Router

10 April 2023 Tier-1 Status

Page 23: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Other Hardware Issues

The following were reported via theTier1-Experiments Liaison Meeting.

• (Isolating transformers in disk array power feeds - final resolution of long standing problem)

• Network switch stack failure• Network (transceiver) failure in switch – in link to tape

system• CMS LSF machine was also turned off by mistake -

resulting in a short outage for srm-cms.• T10KC tapes• 11kV feed into building – electrical discharge.• Failure of Site Access Router

10 April 2023 Tier-1 Status

Page 24: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

A couple of final comments

Disk server issues are the main area of effort for hardware reliability / stability.

...but do not forget the network.

Hardware that has performed reliably in the past may throw up a systematic problem.

10 April 2023 Tier-1 Status

Page 25: Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011

Additional Slide: Procedures....

Post Mortem:1 Post Mortem During 6 months since GridPP26.

10th May 2011: LFC Outage After Database Update• 1 hour outage following an At Risk. Caused by

configuration error in Oracle ACL lists.

Disaster Management:• Triggered once (for T10KC tape problem).

10 April 2023 Tier-1 Status