9
Winnie Lacesso Bristol Site Report June 2009

Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

Embed Size (px)

Citation preview

Page 1: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

Winnie Lacesso

Bristol Site ReportJune 2009

Page 2: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

2 Staff & Users

• Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics support)

• Particle Physics: Winnie Lacesso, Rhys Morris (.2)

• About 40 PP staff & students

• Desktops: less than 10 - Lx, MS, 2 x iMac

• Laptops: 40 or so mainly Mac (~16), Xp (~15), Lx (SL4/5, FC)

• STAFF CHANGES: Yves Coppens = SouthGrid Technical Support, left; Jon Wakelin = .5 Particle Physics support (GPFS, StoRM) left

• Dr Bob Cregan joined as HPC Storage Admin - will help with StoRM & GPFS

Page 3: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

3 Servers

• About 10 non-LCG servers (was 20) consolidated/reitired 10 in 1 yr!!

• Win2003: fileserving (480GB); considering Unix/Samba replacement

• Win2K AFS (IBM TransArc 3.6) (230GB): have Unix server ready, no

time to get to it & Win2K server keeps working...

• Most servers =SL4/5: NFS (1 (was 5)), PBS batch(3), compute (~3),

subversion/elog, mediawiki, infrastructure (web, DHCP, kickstart)

Page 4: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

4 UBristol HPC: PP usage

• Was 30 jobslots, now up to 90 on SL4 HPC cluster (2GB RAM/core)• Not yet using SL5 HPC cluster (only 1GB RAM/core)• Jon W was instrumental in getting CE & SE up+running!

Page 5: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

5 RAID Grief SCSI Agrro

• DPM has 2 x RAID arrays attached. 16-bay slid into borken/faulty after commissioning & 2 years work. Months of grief + debugging.

• Aug 5 10:30:17 lcgse01 kernel: SCSI error : <1 0 2 0> return code = 0x10000

• Aug 5 10:30:17 lcgse01 kernel: end_request: I/O error, dev sdf, sector 787223

• Aug 5 10:30:17 lcgse01 kernel: Buffer I/O error on device sdf1, logical block 98395

• Aug 5 10:30:17 lcgse01 kernel: lost page write due to I/O error on sdf1

• Aug 5 10:30:37 lcgse01 kernel: scsi1:0:2:0: Attempting to abort cmd ebdd0e00: 0x28 0x0 0x89 0xbf

• Aug 5 10:30:37 lcgse01 kernel: scsi1: At time of recovery, card was not paused

• Aug 5 10:30:37 lcgse01 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins<<<<<<<<<<<<<<<<<

• Aug 5 10:30:37 lcgse01 kernel: scsi1: Dumping Card State at program address 0x26 Mode 0x33

• Aug 5 10:30:37 lcgse01 kernel: Card was paused

• Replace SCSI controller (Adaptec, for LSI) - no diff• Vendor agreed & sent replacement Dec 2008; installed Jan 2009

Page 6: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

6 Shoulder that Load!

Page 7: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

7 StoRM SE, GPFS

• New hardware for HPC CE & StoRM SE, also gridftp server & new MON (syslog, Nagios, etc): X7DBU Xeon E5405 with 2GB RAM/core

• HPC CE working well except gpfs timeouts – patchy OPS SAM fails• Problems with StoRM - gpfs multiclustering not yet working, rfio

permission problems (ACLs??) - thought Jon left it in working order but guess not... New Storage Admin (Bob Cregan) will help get gpfs multiclustering working

• Good performance on new hardware!

Page 8: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

8 Security

• User laptops frequently go offsite (home, CERN, RAL), come back & reconnect to internal network. No (detected) incidents. Even from users with root/admin access on laptops.

• One laptop lost - student forgot bag at bus stop. Not there on return. Fortunately, USB backup disk kept in different location. Moral of story: carry USB backup disk separate from laptop.

• Ongoing scary ssh-linux incident: no intrusions detected here so far

Page 9: Winnie Lacesso Bristol Site Report June 2009. 2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics

9 Issues• Upcoming/pending work :

• Ongoing: New servers replacing old – servers waiting

• VMs will replace existing web/svn/elog/wiki server, existing SL3 MON, & probably others

• Recent/ongoing problems :

• UPS needs rearranging – some important servers not on UPS

• Workload really increased since Yves & Jon left

• A/C failure May 2009 – A/C being replaced (before too hot we hope)