25
TSM Linux User Experience TSM Linux User Experience TSM Linux User Experience TSM Linux User Experience at CERN at CERN David Asbury, CERN, Geneva, Switzerland Of d TSM S i 26 S t b 2007 Oxford TSM Symposium, 26 September 2007

TSM Linux User ExperienceTSM Linux User Experience at CERN

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TSM Linux User ExperienceTSM Linux User Experience at CERN

TSM Linux User ExperienceTSM Linux User ExperienceTSM Linux User Experience TSM Linux User Experience at CERNat CERN

David Asbury, CERN, Geneva, SwitzerlandO f d TSM S i 26 S t b 2007Oxford TSM Symposium, 26 September 2007

Page 2: TSM Linux User ExperienceTSM Linux User Experience at CERN

TopicsTopics

What is CERN?What do we do with all that data?How TSM is used in CERNManaging the growth of dataConfigurationConfigurationExperience with Linux

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 2

Page 3: TSM Linux User ExperienceTSM Linux User Experience at CERN

What is CERN?What is CERN?

European Laboratory for Particle PhysicsFrench-Swiss border near Geneva20 member states, ~3000 staff,~6500 visiting scientists from ~500 institutes ~80 nationalitiesinstitutes, 80 nationalitiesLarge Hadron Collider (LHC) to open 2008

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 3

Page 4: TSM Linux User ExperienceTSM Linux User Experience at CERN

Large Hadron ColliderLarge Hadron Collider

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 4

Page 5: TSM Linux User ExperienceTSM Linux User Experience at CERN

Accelerator ComplexAccelerator Complex

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 5

Page 6: TSM Linux User ExperienceTSM Linux User Experience at CERN

Atlas ExperimentAtlas Experiment

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 6

Page 7: TSM Linux User ExperienceTSM Linux User Experience at CERN

Data PyramidData Pyramid

Derived data,Physics dbs

Mail, Home directoriesDatabases systems etcPhysics dbs Databases, systems etc.

Raw Data fromi iexperiments is

distributed among 10 other Grid sites10 other Grid sites.

~15PB per year

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 7

Page 8: TSM Linux User ExperienceTSM Linux User Experience at CERN

CERN Policy on BackupCERN Policy on Backup

Home DirectoriesAFS l b k CAFS

Windows DFS

Mail

AFS volume backup -> CastorTSM

MailMicrosoft Exchange

D t b TSMTSM

DatabasesUnix group & project servers TSM

TSM

Experimental Data Castor

Castor: CERN Advanced Storage Manager (local)26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 8

Castor: CERN Advanced Storage Manager (local)

Page 9: TSM Linux User ExperienceTSM Linux User Experience at CERN

G h!G h!Growth!Growth!Data Received by TSM

60

40

50

ek

10

20

30

TB p

er w

e

0

/200

5

/200

5

/200

5

/200

5

/200

6

/200

6

/200

6

/200

6

/200

7

/200

7

02/0

1

02/0

4

02/0

7

02/1

0

02/0

1

02/0

4

02/0

7

02/1

0

02/0

1

02/0

4

Date

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 9

Page 10: TSM Linux User ExperienceTSM Linux User Experience at CERN

Managing growthManaging growth

Ask the major clients for forecastsMonitoring everything they do too!

Servergraph, moving to home-grown TSMMS

Want a repeatable “unit” of TSMCan add when needed to avoid performance problemsUse existing TSM FC infrastructureProfit from local Linux expertise and installationMake use of physics robotic tape infrastructureMake use of physics robotic tape infrastructure

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 10

Page 11: TSM Linux User ExperienceTSM Linux User Experience at CERN

A Unit of TSM CapacityA Unit of TSM Capacity

PC running standard RHEL4 64-bit Linux4 cpus, 8GB memory, 2 Qlogic HBAs for FC

System disks mirrored by 3ware cardDisks for TSM db & log mirrored by TSMRAID6 disks for staging areasg gUse physics robot tape infrastructure

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 11

Page 12: TSM Linux User ExperienceTSM Linux User Experience at CERN

TSM ConfigurationTSM Configuration2nd Storage Centre Computer Centre

FC stack FC stack

TAPE

ROBOT

SAN

AIX AIXLinux LinuxLinux… Linux…

FC switch FC switchRAID DISK SAN

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 12

Page 13: TSM Linux User ExperienceTSM Linux User Experience at CERN

Setting up the Linux etc.Setting up the Linux etc.

IBM only supports specific Linux kernelsIBM t d i d ifi IBM d iIBM tape drives need specific IBM driverMore restrictive than AIX or Solaris

No “smitt ” s stem tool like AIXNo “smitty” system tool like AIX

Must reload FC driver to add devicesDisks MUST be labelled in /etc/fstab for safetyDisks MUST be labelled in /etc/fstab for safetyCannot avoid Unix disk cache with ext3 fsTape drive devices may change name if add new onesM t h t t d i if TSM t tMust change access to tape devices if TSM not run as rootUsually have to reboot

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 13

Page 14: TSM Linux User ExperienceTSM Linux User Experience at CERN

Spec of 1Spec of 1stst TSM on LinuxTSM on Linux

PC Intel Xeon 2x3Ghz cpus, 4GB memorySystem disks mirrored by 3ware cardStandard RHEL4 64-bit Linux (specified)( p )Raptor disks mirrored by TSM for db & logSATA RAID6 Infortrend array for stagingSATA RAID6 Infortrend array for stagingExt3 file system used (specified)8 IBM 3592J tapes (300GB) in 3584 robot

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 14

Page 15: TSM Linux User ExperienceTSM Linux User Experience at CERN

1st TSM on Linux1st TSM on Linux

Started well, performance okayFunctioned normallyHigh load (>1 cpu) when doing i/og ( p ) gSometimes does not schedule all TSM processes concurrently?processes concurrently?Beware of Linux “tools” for devices

Rewound tape drives!Rewound tape drives!

Added 2nd Linux machine …

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 15

Page 16: TSM Linux User ExperienceTSM Linux User Experience at CERN

Spec of 2nd TSM on LinuxSpec of 2nd TSM on Linux

AMD Opteron dual core, 4 cpu, 8GB mem.System disks mirrored by 3ware cardStandard RHEL4 64-bit Linux (specified)( p )Raptor disks mirrored by TSM for db & logSATA RAID6 Infortrend array for stagingSATA RAID6 Infortrend array for staging6 LTO3 HP drives in STK 8500 robotLTO drives mounted via ACSLS

Need special script to create device files

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 16

Page 17: TSM Linux User ExperienceTSM Linux User Experience at CERN

2nd TSM on Linux2nd TSM on Linux

Started well, but high cpu with i/o againCorrupted file systems with high disk i/o

/var/log/messages “trying to seek off end of disk”Reboot stopped - needed manual fsck of file systemsSystem down for some hours to check file systems and ran TSM AUDIT on disks to cleanupran TSM AUDIT on disks to cleanup

Upset Backup clients!System not available when neededSystem not available when neededBackups corrupted? - yes

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 17

Page 18: TSM Linux User ExperienceTSM Linux User Experience at CERN

Tracing the CorruptionTracing the Corruption

Tried changing RAID arrays, updated k l d Ql i FC d ikernel and Qlogic FC driverTried single-processor kernel.

Better, but still corrupted

Borrowed RedHat certified PCStill corrupted with memory problems, audit errors

Eventually moved big clients back to AIXLinux better lightly loaded, but still see audit errors

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 18

Page 19: TSM Linux User ExperienceTSM Linux User Experience at CERN

CorruptionCorruption

Needs high disk i/o – only seen with disks connected by FCconnected by FCSingle processor kernel was better, but too slow (limited cpu for i/o)too slow (limited cpu for i/o)Did not seriously suspect RAID arrays as have worked well with AIX for yearshave worked well with AIX for yearsDifficult to separate Linux fs from FCR TSM AUDIT f tl b t tRun TSM AUDIT frequently, but cannot check data (only metadata)

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 19

Page 20: TSM Linux User ExperienceTSM Linux User Experience at CERN

CERN Corruption SurveyCERN Corruption Survey

Used fsprobe program in C (not TSM)J t d / it U i fil d h kJust reads/writes Unix files and checksRun on ~3000 farm PCs in CERN for some weeksV i t f il t ti f dVariety of silent corruption found:

Memory errors, less than expected. 1-bit errors are correctedSector/page sized regions corruptedSector/page sized regions corruptedLarger blocks of invalid data – ext3 file system?

All makes of PC eventually showed errorsMemory is most dangerous place for your data!

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 20

Page 21: TSM Linux User ExperienceTSM Linux User Experience at CERN

ConclusionsConclusions

Jury still out. Linux fs or FC-related?Linux offers cheaper repeatable unit?Problem: no single point of contactg p

No clear line between hardware and softwareDifferent PCs show corruption in different waysExtremely time consuming, disruptive to service

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 21

Page 22: TSM Linux User ExperienceTSM Linux User Experience at CERN

Next StepsNext Steps

Try IBM configuration certified for TSMPC, Qlogic HBAs, IBM RAID with RHEL4

Pay IBM to take all problems (Redhat too)Hope for clear answer to problem – do not want to repeat all this with new hardware!Talk in TSM Symposium 2009 on results?

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 22

Page 23: TSM Linux User ExperienceTSM Linux User Experience at CERN

AcknowledgementsAcknowledgements

Lio Frost-AinleyGordon LeeTim Bell (boss)( )Charles Silvan (Expert from GATE & IBM)Peter Kelemen (Corruption Survey)Peter Kelemen (Corruption Survey)

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 23

Page 24: TSM Linux User ExperienceTSM Linux User Experience at CERN

Contact DetailsContact Details

David Asbury, CERN IT DepartmentEmail: [email protected] Website: www.cern.ch

26 September 2007 Oxford TSM Symposium 2007 | Linux User experience at CERN 24

Page 25: TSM Linux User ExperienceTSM Linux User Experience at CERN

Q ti ?Q ti ?Questions?Questions?