Upload
alice-andrews
View
214
Download
1
Embed Size (px)
Citation preview
Winnie Lacesso
Bristol Site ReportJune 2009
2 Staff & Users
• Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics support)
• Particle Physics: Winnie Lacesso, Rhys Morris (.2)
• About 40 PP staff & students
• Desktops: less than 10 - Lx, MS, 2 x iMac
• Laptops: 40 or so mainly Mac (~16), Xp (~15), Lx (SL4/5, FC)
• STAFF CHANGES: Yves Coppens = SouthGrid Technical Support, left; Jon Wakelin = .5 Particle Physics support (GPFS, StoRM) left
• Dr Bob Cregan joined as HPC Storage Admin - will help with StoRM & GPFS
3 Servers
• About 10 non-LCG servers (was 20) consolidated/reitired 10 in 1 yr!!
• Win2003: fileserving (480GB); considering Unix/Samba replacement
• Win2K AFS (IBM TransArc 3.6) (230GB): have Unix server ready, no
time to get to it & Win2K server keeps working...
• Most servers =SL4/5: NFS (1 (was 5)), PBS batch(3), compute (~3),
subversion/elog, mediawiki, infrastructure (web, DHCP, kickstart)
4 UBristol HPC: PP usage
• Was 30 jobslots, now up to 90 on SL4 HPC cluster (2GB RAM/core)• Not yet using SL5 HPC cluster (only 1GB RAM/core)• Jon W was instrumental in getting CE & SE up+running!
5 RAID Grief SCSI Agrro
• DPM has 2 x RAID arrays attached. 16-bay slid into borken/faulty after commissioning & 2 years work. Months of grief + debugging.
• Aug 5 10:30:17 lcgse01 kernel: SCSI error : <1 0 2 0> return code = 0x10000
• Aug 5 10:30:17 lcgse01 kernel: end_request: I/O error, dev sdf, sector 787223
• Aug 5 10:30:17 lcgse01 kernel: Buffer I/O error on device sdf1, logical block 98395
• Aug 5 10:30:17 lcgse01 kernel: lost page write due to I/O error on sdf1
• Aug 5 10:30:37 lcgse01 kernel: scsi1:0:2:0: Attempting to abort cmd ebdd0e00: 0x28 0x0 0x89 0xbf
• Aug 5 10:30:37 lcgse01 kernel: scsi1: At time of recovery, card was not paused
• Aug 5 10:30:37 lcgse01 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins<<<<<<<<<<<<<<<<<
• Aug 5 10:30:37 lcgse01 kernel: scsi1: Dumping Card State at program address 0x26 Mode 0x33
• Aug 5 10:30:37 lcgse01 kernel: Card was paused
• Replace SCSI controller (Adaptec, for LSI) - no diff• Vendor agreed & sent replacement Dec 2008; installed Jan 2009
6 Shoulder that Load!
7 StoRM SE, GPFS
• New hardware for HPC CE & StoRM SE, also gridftp server & new MON (syslog, Nagios, etc): X7DBU Xeon E5405 with 2GB RAM/core
• HPC CE working well except gpfs timeouts – patchy OPS SAM fails• Problems with StoRM - gpfs multiclustering not yet working, rfio
permission problems (ACLs??) - thought Jon left it in working order but guess not... New Storage Admin (Bob Cregan) will help get gpfs multiclustering working
• Good performance on new hardware!
8 Security
• User laptops frequently go offsite (home, CERN, RAL), come back & reconnect to internal network. No (detected) incidents. Even from users with root/admin access on laptops.
• One laptop lost - student forgot bag at bus stop. Not there on return. Fortunately, USB backup disk kept in different location. Moral of story: carry USB backup disk separate from laptop.
• Ongoing scary ssh-linux incident: no intrusions detected here so far
9 Issues• Upcoming/pending work :
• Ongoing: New servers replacing old – servers waiting
• VMs will replace existing web/svn/elog/wiki server, existing SL3 MON, & probably others
• Recent/ongoing problems :
• UPS needs rearranging – some important servers not on UPS
• Workload really increased since Yves & Jon left
• A/C failure May 2009 – A/C being replaced (before too hot we hope)