Upload
hani856
View
212
Download
0
Embed Size (px)
Citation preview
7/29/2019 tier1a
1/31
Tier1 Status Report
Andrew Sansum
GRIDPP121 February 2004
7/29/2019 tier1a
2/31
1 February 2004 Tier-1 Status Report
Overview
Hardware configuration/utilisation
That old SRM story
dCache deployment
Network developments
Security stuff
7/29/2019 tier1a
3/31
1 February 2004 Tier-1 Status Report
Hardware
CPU 500 dual Processor Intel PIII and Xeon servers mainly rack
mounts (About 884KSI2K) also 100 older systems.
Disk Service mainly standard configuration
Commodity disk service based on core of 57 Linux servers External IDE&SATA/SCSI RAID arrays (Accusys and Infortrend)
ATA and SATA drives (1500 spinning drives another 500 old SCSI
drives)
About 220TB disk (last tranch deployed in October)
Cheap and (fairly) cheerful
Tape Service
STK Powderhorn 9310 silo with 8 9940B drives. Max capacity 1PB
at present but capable of 3PB by 2007.
7/29/2019 tier1a
4/31
1 February 2004 Tier-1 Status Report
LCG in September
7/29/2019 tier1a
5/31
1 February 2004 Tier-1 Status Report
LCG Load
7/29/2019 tier1a
6/31
1 February 2004 Tier-1 Status Report
Babar Tier-A
7/29/2019 tier1a
7/31
1 February 2004 Tier-1 Status Report
Last 7 Days
ACTIVE: Babar,D0, LHCB,SNO, H1, ATLAS,ZEUS
7/29/2019 tier1a
8/31
1 February 2004 Tier-1 Status Report
dCache
Motivation:
Needed SRM access to disk (and tape)
We have 60+ disk servers (140+ filesystems) needed
disk pool management dCache was only plausible candidate
i ( )
7/29/2019 tier1a
9/31
1 February 2004 Tier-1 Status Report
History (Norse saga)of dCache at RAL
Mid 2003
We deployed a non grid version
for CMS. It was never used in
production.
End of 2003/Start of 2004 RAL offered to package a
production quality dCache.
Stalled due to bugs and holidays
went back to developers and
LCG developers.
September 2004
Redeployed DCache into LCG
system for CMS, and DTeam VOs.
dCache deployed
within JRA1 testing
infrastructure for gLite
i/o daemon testing.
Jan 2005: Still workingwith CMS to resolve
interoperation issues,
partly due to hybrid
grid/non-grid use. Jan 2005: Prototype
back end to tape
written.
7/29/2019 tier1a
10/31
1 February 2004 Tier-1 Status Report
dCache Deployment
dCache deployed as production service (also testinstance, JRA1, developer1 and developer2?)
Now available in production for ATLAS, CMS,LHCB and DTEAM (17TB now configured 4TBused) Reliability good but load is light
Will use dCache (preferably production instance)as interface to Service Challenge 2.
Work underway to provide Tape backend,prototype already operational. This will beproduction SRM to tape at least until after Julyservice challenge
t
7/29/2019 tier1a
11/31
1 February 2004 Tier-1 Status Report
urrentDeployment at RAL
O i i l Pl f
7/29/2019 tier1a
12/31
1 February 2004 Tier-1 Status Report
Original Plan forService Challenges
Experiments
Production dCache testdCache(?)
Service Challenges
technology
D h f S i
7/29/2019 tier1a
13/31
1 February 2004 Tier-1 Status Report
head
gridftp
dcache
Dcache for ServiceChallenges
UKLIGHT
Disk Servers
Service ChallengeExperiments
7/29/2019 tier1a
14/31
1 February 2004 Tier-1 Status Report
Network
Recent upgrade to Tier1 Network
Begin to put in place new generation of network
infrastructure
Low cost solution based on commodity hardware 10 Gigabit ready
Able to meet needs of:
forthcoming service challenges Increasing production data flows
7/29/2019 tier1a
15/31
1 February 2004 Tier-1 Status Report
September
Summit 7i
1 Gbit
Site Router
DiskDisk
CPU
CPU DiskCPU
Summit 7i Summit 7i
7/29/2019 tier1a
16/31
1 February 2004 Tier-1 Status Report
Now (Production)
Disk+CPU Disk+CPU
Nortel
5510 stack
(80Gbit)
Summit 7i
N*1 Gbit N*1 Gbit
N*1 Gbit
1 Gbit/link
Site Router
7/29/2019 tier1a
17/31
1 February 2004 Tier-1 Status Report
Soon (Lightpath)
Disk+CPU Disk+CPU
Nortel
5510 stack
(80Gbit)
Summit 7i
N*1 Gbit N*1 Gbit
1 Gbit
1 Gbit/link
Site Router
UK
L
I
G
HTDual
attach
10
G
b
2*1Gbit
7/29/2019 tier1a
18/31
1 February 2004 Tier-1 Status Report
Next (Production)
Disk+CPU Disk+CPU
Nortel
5510 stack
(80Gbit)
10 Gigabit Switch
N*10 Gbit N*10 Gbit
10 Gb
1 Gbit/link
New Site RouterR
A
L
S
I
TE
7/29/2019 tier1a
19/31
1 February 2004 Tier-1 Status Report
Machine Room Upgrade
Large mainframe cooling (cooking) Infrastructure:~540KW Substantial overhaul now completed good performance gains, but
was close to maximum by mid summer, temporary chillers overAugust (funded by CCLRC)
Substantial additional capacity ramp-up planned for Tier-1 (andother E-Science) Service
November (Annual) Air-conditioning RPM Shutdown Major upgrade new (independent) cooling system (400KW+)
funded by CCLRC
Also: profiling power distribution, new access controlsystem.
Worrying about power stability (brownout and blackout inlast quarter.
7/29/2019 tier1a
20/31
1 February 2004 Tier-1 Status Report
MSS Stress Testing
Preparation for SC 3 (and beyond) underway (Tim Folkes).
Underway since August.
Motivation service load has been historically rather low.
Look for Gotchas
Review known limitations.
Stress test part of the way through the process just a
taster here
Measure performance
Fix trivial limitations
Repeat
Buy more hardware
Repeat
Test system
Production system
7/29/2019 tier1a
21/31
1 February 2004 Tier-1 Status Report
STK 9310
8 x 9940 tape drives
ADS_switch_1 ADS_Switch_2
Brocade FC switches
4 drives to each switch
ermintrudeAIX
dataserver
florenceAIX
dataserver
zebedeeAIX
dataserver
dougalAIX
dataserver
mchenry1
AIX
Testflfsys
basil
AIXtest
dataserver
brian
AIX
flfsys
ADS0CNTR
Redhat
counter
ADS0PT01
Redhat
pathtape
ADS0SB01
Redhat
SRB interface
dylan
AIX
Import/export
buxton
SunOS
ACSLS
User
array4 array3 array2 array1
catalogue
cache
catalogue
cache
Test system
SRB
Inq; S commands; MySRB
Tape devices
ADS
tape
ADS
sysreq
admin
commands
create query
User pathtape
commandsLogging
Physical connection (FC/SCSI)
Sysreq udp command
User SRB command
VTP data transfer
SRB data transfer
STK ACSLS command
All sysreq, vtp and
ACSLS connections to
dougal also apply to
the other dataserver
machines, but are left
out for clarity
Production system
SRB pathtape commands
Thursday, 04 November 2004
7/29/2019 tier1a
22/31
1 February 2004 Tier-1 Status Report
Catalogue Manipulation
0
20
40
60
80
100
120
1 2 5 10 20 50 100 150 200 250
# queries
seconds Create Av.
Query Av.
Purge Av.
7/29/2019 tier1a
23/31
1 February 2004 Tier-1 Status Report
Write Performance
0
10
20
30
40
50
60
70
80
90
100
16:00:41
16:22:40
16:44:41
17:06:41
17:28:41
17:49:42
18:12:42
18:35:42
18:58:42
19:21:41
19:44:41
20:07:42
20:30:41
20:53:41
21:16:41
21:39:41
22:02:41
22:25:41
22:48:41
23:10:41
23:33:41
23:56:41
00:19:42
00:42:42
01:04:41
01:27:40
01:49:41
02:12:41
02:35:41
02:58:41
03:21:41
03:44:40
04:07:40
04:29:41
04:52:41
05:15:41
05:37:41
06:00:41
06:23:41
06:46:40
07:09:40
07:32:40
07:55:42
08:18:40
08:41:41
read (KB/sec) write (KB/sec) Total
Single Server Test
7/29/2019 tier1a
24/31
1 February 2004 Tier-1 Status Report
Conclusions
Have found a number of easily fixable bugs
Have found some less easily fixable architecture
issues
Have much better understanding of limitations ofarchitecture
Estimate suggests 60-80MB/s -> tape now. Buy
more/faster disk and try again.
Current drives good for 240MB/s peak actual
performance likely to be limited by ratio of drive
(read+write):(load+unload+seek)
7/29/2019 tier1a
25/31
1 February 2004 Tier-1 Status Report
Security Incident
26 August X11 scan acquires userid/pw at upstream site Hacker logs on to upstream site snoops known_hosts
Ssh to Tier-1 front end host using unencrypted private keyfrom upstream site
Upstream site responds but does not notify RAL Hacker loads IRC bot on Tier1 and registers at remote IRC
server (for command/control/monitoring)
Tries to root exploit (fails), attempts login to downstream
sites (fails we think) 7th October Tier-1 incident notified by IRC service andbegins response.
2000-3000 sites involved globally
7/29/2019 tier1a
26/31
1 February 2004 Tier-1 Status Report
Objectives
Comply with site security policy disconnect etc etc.
Will disconnect hosts promptly once active intrusion detected.
Comply with external Security Policies (eg LCG)
Notification
Protect downstream sites, by notification, pro-activedisconnection
Identify involved sites
Establish direct contacts with uptream/downstream
Minimise service outage Eradicate infestation
7/29/2019 tier1a
27/31
1 February 2004 Tier-1 Status Report
Roles
Incident controller mange information flood deal withexternal contacts
Log trawler hunt for contacts
Hunter/killer(s) Searching for compromised hosts
Detailed forensics Understand what happened on
compromised hosts
End user contacts Get in touch with users, confirm
usage patterns, terminate IDs
7/29/2019 tier1a
28/31
1 February 2004 Tier-1 Status Report
Chronology
At 09:30 on 7th October RAL network group forwarded a complaint fromUndernet suggesting unauthorized connections from Tier1 hosts to Undernet.
At 10:00 initial investigation suggests unauthorised activity on csfmove02.csfmove02 physically disconnected from network.
By now 5 Tier1 staff + E-Science security Officer 100% engaged on incident.Additional support from CCLRC network group, and site security officer.Effort remained at this level for several days. Babar support staff at RAL alsoactive tracking down unexplained activity.
At 10:07 request made to site firewall admin for firewall logs for all contactswith suspected hostile IRC servers.
At 10:37 firewall admin provides initial report, confirming unexplainedcurrent outbound activity from csfmove02, but no other nodes involved.
At 11:29 babar report that bfactory account password was common to the
following additional Ids (bbdatsrv and babartst) At 11:31 Steve completes rootkit check no hosts found although possiblefalse positives on Redhat 7.2 which we are uneasy aboutBy 11:40 preliminaryinvestigations at RAL had concluded that an unauthorized access had takenplace onto host csfmove02 (a data mover node) which in turn was connectedoutbound to an IRC service. At this point we notified security mailing lists(hepix-security, lcg-security, hepsysman):
7/29/2019 tier1a
29/31
1 February 2004 Tier-1 Status Report
Security Events
0
5
10
15
20
25
30
35
2D1 4D1 2D2 4D2 D4 D6
6 Hour Intervals
Security Events
7/29/2019 tier1a
30/31
1 February 2004 Tier-1 Status Report
Security Summary
The intrusion took 1-2 staff months of CCLRC effort toinvestigate (6 staff fulltime for 3 days (5 Tier-1 plus E-Science security officer), working long hours. Also: Networking group
CCLRC site security.
Babar support
Other sites
Prompt notification of the incident by up-stream sitewould have substantially reduced the size and complexityof the investigation.
The good standard of patching on the Tier1 minimised the
spread of the incident internally (but we were lucky) Can no longer trust who logged on uses are many userids
(globally) probably compromised.
7/29/2019 tier1a
31/31
1 February 2004 Tier-1 Status Report
Conclusions
A period of consolidation User demand continues to fluctuate, butincreasing number of experiments able to useLCG.
Good progress on SRM to DISK (DCACHE Making progress with SRM to tape
Having an SRM isnt enough it has to meet theneeds of the experiments
Expect focus to shift (somewhat) towards servicechallenge