13
Chris Brew RAL PPD Site Report RAL PPD Site Report Chris Brew SciTech/PPD

RAL PPD Site Report

  • Upload
    sugar

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

RAL PPD Site Report. Chris Brew SciTech/PPD. Outline. Hardware Current Grid User New Machine Room Issues Power, Air Conditioning & Space Plans Tier 3 Configuration Management Common Backup Issues Log processing Windows. Current Grid Cluster. CPU: - PowerPoint PPT Presentation

Citation preview

Page 1: RAL PPD Site Report

Chris Brew

RAL PPD Site ReportRAL PPD Site Report

Chris Brew

SciTech/PPD

Page 2: RAL PPD Site Report

Chris Brew

OutlineOutline

• Hardware– Current

• Grid• User

– New

• Machine Room Issues– Power, Air Conditioning & Space

• Plans– Tier 3– Configuration Management– Common Backup

• Issues– Log processing

• Windows

Page 3: RAL PPD Site Report

Chris Brew

Current Grid ClusterCurrent Grid Cluster

• CPU:– 52 x Dual Opteron 270 Dual Core CPUs, 4GB RAM– 40 x Dual PIV Xeon 2.8Ghz, 2GB RAM– All running SL3 glite-WN

• Disk:– 8 x 24 Slot dCache Pool Servers

• Areca ARC-1170 24 RAID cards• 22 x WD5000YS RAID 6 (Storage) – 10TB• 2 x WD1600YD RAID 1 (System)• 64 bit SL4, Single large xfs file system

• Misc:– GridPP Front Ends running, Torque, LFC/NFS, R-GMA, dCache

Head– Ex WNs running CE, DHCPD/TFTP pxeboot server

• Network now at 10Gb/s but external link still limited by Firewall

Page 4: RAL PPD Site Report

Chris Brew

Current User ClusterCurrent User Cluster

• User Interfaces– 7 ex WNs from dual 1.4GHz PIII to dual 2.8 GHz PIV

• 6 x SL3 (1 test, 2 general, 3 expt)

• 1 SL4 test UI

• 2 x Dell PowerEdge 1850 Disk Servers– Dell PERC 4/DC RAID card– 6 x 300GB disks in Dell PowerVault 220 SCSI shelf– Serves Home and experiment areas via NFS

• Master copy on one server

• rsync’d to backup server 1-4 times daily

• Home area backed up to ADS daily

• Same hardware as Windows solution, common spares

Page 5: RAL PPD Site Report

Chris Brew

Other Miscellaneous BoxenOther Miscellaneous Boxen

• Extra Boxes– Install/Scratch/Internal Web server– Monitoring Server– External Web Server– Minos CVS Server– NIS Master– Security Box (Central Logger and Tripwire)

• New Kit (undergoing burnin now)– 32 x Dual Intel Woodcrest 5130 Dual Core CPUs, 8GB

RAM (Streamline)– 13 Viglen HS160a Disk servers

Page 6: RAL PPD Site Report

Chris Brew

Machine Room IssuesMachine Room Issues

• Too much equipment for our small departmental Computer room

• Taken over adjacent “Display” area– Historically part of computer room– Already has raised floor, and three phase power, though new

distribution panel needed for latter– Common air conditioning with Computer Room

• Refurbished power distribution, installed kit and powered on:– Temp in new area rose to 26°C, temp in old area fell by 1 °C– “Consulting” engineer called in by estates to “rebalance” air

conditioning. Very successful - Old/New now 21.5/22.7 °C– Also calculated total capacity of plant at 50kW of cooling

currently we are using ~30kW

• Next step is to refurbish the power in the old machine room to reinstate the three phase supply

Page 7: RAL PPD Site Report

Chris Brew

MonitoringMonitoring

• 2 Different monitoring systems– Ganglia: Monitors per host metrics and records

histories to produce graphs, good for trending and viewing current and historic status

– Nagios: Monitors “services” and issues alerts, good for raising alerts and viewing “what’s currently bad”. See other talk

• In view of current lack of effort, program to get as much monitoring as possible in Nagios to be automatically alerted on.– Recently added alerts for SAM tests and Yumit/Patiki

updates

Page 8: RAL PPD Site Report

Chris Brew

Plans 1: Tier 3Plans 1: Tier 3

• Physicists seem to want access to batch other than on the grid so need to provide local access

• Rather then run 2 batch systems want to give local user access to Grid batch workers

• Need to:– Merge grid and user cluster account databases

• Modify YAIM to use NIS pool accounts

– Change maui settings to Fairshare Grid/Non-Grid before VO before Users

Page 9: RAL PPD Site Report

Chris Brew

Plans 2: cfenginePlans 2: cfengine

• Getting to be too many worker nodes to manage with current ad hoc system need to move towards a full configuration management system

• After asking around decide upon cfengine• Test deployment promising• Working on re-implementing the Worker Node

install in cfengine• Still need to find good solution for secure key

distribution to newly installed nodes

Page 10: RAL PPD Site Report

Chris Brew

Plans 3: Common BackupPlans 3: Common Backup

• Current backup of important files for Unix is to the Atlas Data Store– Not sure how much longer the ADS is going to be

around, need to look for another solution

• Was intending to look at Amanda but…– Dept bought new 30 slot tape robot for Windows Backup – Veritas Backup software in use on Windows supports

Linux Clients

• Just starting tests on a single node. Will keep you posted.

Page 11: RAL PPD Site Report

Chris Brew

Plan 4: Reliable HardwarePlan 4: Reliable Hardware

• Plan to purchase an new class of “more reliable” worker node type machines– Dual system disks in hot swap caddys– Possibly redundant hot swap power supplies

• Use this type of machines for running Grid services, Local services (Databases, web servers etc.) and User Interfaces

Page 12: RAL PPD Site Report

Chris Brew

Issues 1: Log ProcessingIssues 1: Log Processing

• Already running Central Syslog Server (soon to be expanded to 2 hosts for redundancy).

• As with our Tripwire a fairly passive system– Hope to get enough info off the system to get some

useful info after the event

• Would like some system to monitor these logs and flag “interesting” events.

• Would prefer little or no training required.

Page 13: RAL PPD Site Report

Chris Brew

Windows, etc.Windows, etc.

• Still using Windows XP, with Office 2003 and Hummingbird eXceed– Are looking at Vista and Office 2007 but not yet

seriously and have no plans for rollout yet

• Now managed at Business Unit level rather than department

• Looking for synergies between Unix and Windows support:– Common file server hardware– Common Backup Solution

• Recently equipped PPD Meeting room with Polycom rollabout VideoConferencing system.