tier1a

  • Upload
    hani856

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 tier1a

    1/31

    Tier1 Status Report

    Andrew Sansum

    GRIDPP121 February 2004

  • 7/29/2019 tier1a

    2/31

    1 February 2004 Tier-1 Status Report

    Overview

    Hardware configuration/utilisation

    That old SRM story

    dCache deployment

    Network developments

    Security stuff

  • 7/29/2019 tier1a

    3/31

    1 February 2004 Tier-1 Status Report

    Hardware

    CPU 500 dual Processor Intel PIII and Xeon servers mainly rack

    mounts (About 884KSI2K) also 100 older systems.

    Disk Service mainly standard configuration

    Commodity disk service based on core of 57 Linux servers External IDE&SATA/SCSI RAID arrays (Accusys and Infortrend)

    ATA and SATA drives (1500 spinning drives another 500 old SCSI

    drives)

    About 220TB disk (last tranch deployed in October)

    Cheap and (fairly) cheerful

    Tape Service

    STK Powderhorn 9310 silo with 8 9940B drives. Max capacity 1PB

    at present but capable of 3PB by 2007.

  • 7/29/2019 tier1a

    4/31

    1 February 2004 Tier-1 Status Report

    LCG in September

  • 7/29/2019 tier1a

    5/31

    1 February 2004 Tier-1 Status Report

    LCG Load

  • 7/29/2019 tier1a

    6/31

    1 February 2004 Tier-1 Status Report

    Babar Tier-A

  • 7/29/2019 tier1a

    7/31

    1 February 2004 Tier-1 Status Report

    Last 7 Days

    ACTIVE: Babar,D0, LHCB,SNO, H1, ATLAS,ZEUS

  • 7/29/2019 tier1a

    8/31

    1 February 2004 Tier-1 Status Report

    dCache

    Motivation:

    Needed SRM access to disk (and tape)

    We have 60+ disk servers (140+ filesystems) needed

    disk pool management dCache was only plausible candidate

    i ( )

  • 7/29/2019 tier1a

    9/31

    1 February 2004 Tier-1 Status Report

    History (Norse saga)of dCache at RAL

    Mid 2003

    We deployed a non grid version

    for CMS. It was never used in

    production.

    End of 2003/Start of 2004 RAL offered to package a

    production quality dCache.

    Stalled due to bugs and holidays

    went back to developers and

    LCG developers.

    September 2004

    Redeployed DCache into LCG

    system for CMS, and DTeam VOs.

    dCache deployed

    within JRA1 testing

    infrastructure for gLite

    i/o daemon testing.

    Jan 2005: Still workingwith CMS to resolve

    interoperation issues,

    partly due to hybrid

    grid/non-grid use. Jan 2005: Prototype

    back end to tape

    written.

  • 7/29/2019 tier1a

    10/31

    1 February 2004 Tier-1 Status Report

    dCache Deployment

    dCache deployed as production service (also testinstance, JRA1, developer1 and developer2?)

    Now available in production for ATLAS, CMS,LHCB and DTEAM (17TB now configured 4TBused) Reliability good but load is light

    Will use dCache (preferably production instance)as interface to Service Challenge 2.

    Work underway to provide Tape backend,prototype already operational. This will beproduction SRM to tape at least until after Julyservice challenge

    t

  • 7/29/2019 tier1a

    11/31

    1 February 2004 Tier-1 Status Report

    urrentDeployment at RAL

    O i i l Pl f

  • 7/29/2019 tier1a

    12/31

    1 February 2004 Tier-1 Status Report

    Original Plan forService Challenges

    Experiments

    Production dCache testdCache(?)

    Service Challenges

    technology

    D h f S i

  • 7/29/2019 tier1a

    13/31

    1 February 2004 Tier-1 Status Report

    head

    gridftp

    dcache

    Dcache for ServiceChallenges

    UKLIGHT

    Disk Servers

    Service ChallengeExperiments

  • 7/29/2019 tier1a

    14/31

    1 February 2004 Tier-1 Status Report

    Network

    Recent upgrade to Tier1 Network

    Begin to put in place new generation of network

    infrastructure

    Low cost solution based on commodity hardware 10 Gigabit ready

    Able to meet needs of:

    forthcoming service challenges Increasing production data flows

  • 7/29/2019 tier1a

    15/31

    1 February 2004 Tier-1 Status Report

    September

    Summit 7i

    1 Gbit

    Site Router

    DiskDisk

    CPU

    CPU DiskCPU

    Summit 7i Summit 7i

  • 7/29/2019 tier1a

    16/31

    1 February 2004 Tier-1 Status Report

    Now (Production)

    Disk+CPU Disk+CPU

    Nortel

    5510 stack

    (80Gbit)

    Summit 7i

    N*1 Gbit N*1 Gbit

    N*1 Gbit

    1 Gbit/link

    Site Router

  • 7/29/2019 tier1a

    17/31

    1 February 2004 Tier-1 Status Report

    Soon (Lightpath)

    Disk+CPU Disk+CPU

    Nortel

    5510 stack

    (80Gbit)

    Summit 7i

    N*1 Gbit N*1 Gbit

    1 Gbit

    1 Gbit/link

    Site Router

    UK

    L

    I

    G

    HTDual

    attach

    10

    G

    b

    2*1Gbit

  • 7/29/2019 tier1a

    18/31

    1 February 2004 Tier-1 Status Report

    Next (Production)

    Disk+CPU Disk+CPU

    Nortel

    5510 stack

    (80Gbit)

    10 Gigabit Switch

    N*10 Gbit N*10 Gbit

    10 Gb

    1 Gbit/link

    New Site RouterR

    A

    L

    S

    I

    TE

  • 7/29/2019 tier1a

    19/31

    1 February 2004 Tier-1 Status Report

    Machine Room Upgrade

    Large mainframe cooling (cooking) Infrastructure:~540KW Substantial overhaul now completed good performance gains, but

    was close to maximum by mid summer, temporary chillers overAugust (funded by CCLRC)

    Substantial additional capacity ramp-up planned for Tier-1 (andother E-Science) Service

    November (Annual) Air-conditioning RPM Shutdown Major upgrade new (independent) cooling system (400KW+)

    funded by CCLRC

    Also: profiling power distribution, new access controlsystem.

    Worrying about power stability (brownout and blackout inlast quarter.

  • 7/29/2019 tier1a

    20/31

    1 February 2004 Tier-1 Status Report

    MSS Stress Testing

    Preparation for SC 3 (and beyond) underway (Tim Folkes).

    Underway since August.

    Motivation service load has been historically rather low.

    Look for Gotchas

    Review known limitations.

    Stress test part of the way through the process just a

    taster here

    Measure performance

    Fix trivial limitations

    Repeat

    Buy more hardware

    Repeat

    Test system

    Production system

  • 7/29/2019 tier1a

    21/31

    1 February 2004 Tier-1 Status Report

    STK 9310

    8 x 9940 tape drives

    ADS_switch_1 ADS_Switch_2

    Brocade FC switches

    4 drives to each switch

    ermintrudeAIX

    dataserver

    florenceAIX

    dataserver

    zebedeeAIX

    dataserver

    dougalAIX

    dataserver

    mchenry1

    AIX

    Testflfsys

    basil

    AIXtest

    dataserver

    brian

    AIX

    flfsys

    ADS0CNTR

    Redhat

    counter

    ADS0PT01

    Redhat

    pathtape

    ADS0SB01

    Redhat

    SRB interface

    dylan

    AIX

    Import/export

    buxton

    SunOS

    ACSLS

    User

    array4 array3 array2 array1

    catalogue

    cache

    catalogue

    cache

    Test system

    SRB

    Inq; S commands; MySRB

    Tape devices

    ADS

    tape

    ADS

    sysreq

    admin

    commands

    create query

    User pathtape

    commandsLogging

    Physical connection (FC/SCSI)

    Sysreq udp command

    User SRB command

    VTP data transfer

    SRB data transfer

    STK ACSLS command

    All sysreq, vtp and

    ACSLS connections to

    dougal also apply to

    the other dataserver

    machines, but are left

    out for clarity

    Production system

    SRB pathtape commands

    Thursday, 04 November 2004

  • 7/29/2019 tier1a

    22/31

    1 February 2004 Tier-1 Status Report

    Catalogue Manipulation

    0

    20

    40

    60

    80

    100

    120

    1 2 5 10 20 50 100 150 200 250

    # queries

    seconds Create Av.

    Query Av.

    Purge Av.

  • 7/29/2019 tier1a

    23/31

    1 February 2004 Tier-1 Status Report

    Write Performance

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    16:00:41

    16:22:40

    16:44:41

    17:06:41

    17:28:41

    17:49:42

    18:12:42

    18:35:42

    18:58:42

    19:21:41

    19:44:41

    20:07:42

    20:30:41

    20:53:41

    21:16:41

    21:39:41

    22:02:41

    22:25:41

    22:48:41

    23:10:41

    23:33:41

    23:56:41

    00:19:42

    00:42:42

    01:04:41

    01:27:40

    01:49:41

    02:12:41

    02:35:41

    02:58:41

    03:21:41

    03:44:40

    04:07:40

    04:29:41

    04:52:41

    05:15:41

    05:37:41

    06:00:41

    06:23:41

    06:46:40

    07:09:40

    07:32:40

    07:55:42

    08:18:40

    08:41:41

    read (KB/sec) write (KB/sec) Total

    Single Server Test

  • 7/29/2019 tier1a

    24/31

    1 February 2004 Tier-1 Status Report

    Conclusions

    Have found a number of easily fixable bugs

    Have found some less easily fixable architecture

    issues

    Have much better understanding of limitations ofarchitecture

    Estimate suggests 60-80MB/s -> tape now. Buy

    more/faster disk and try again.

    Current drives good for 240MB/s peak actual

    performance likely to be limited by ratio of drive

    (read+write):(load+unload+seek)

  • 7/29/2019 tier1a

    25/31

    1 February 2004 Tier-1 Status Report

    Security Incident

    26 August X11 scan acquires userid/pw at upstream site Hacker logs on to upstream site snoops known_hosts

    Ssh to Tier-1 front end host using unencrypted private keyfrom upstream site

    Upstream site responds but does not notify RAL Hacker loads IRC bot on Tier1 and registers at remote IRC

    server (for command/control/monitoring)

    Tries to root exploit (fails), attempts login to downstream

    sites (fails we think) 7th October Tier-1 incident notified by IRC service andbegins response.

    2000-3000 sites involved globally

  • 7/29/2019 tier1a

    26/31

    1 February 2004 Tier-1 Status Report

    Objectives

    Comply with site security policy disconnect etc etc.

    Will disconnect hosts promptly once active intrusion detected.

    Comply with external Security Policies (eg LCG)

    Notification

    Protect downstream sites, by notification, pro-activedisconnection

    Identify involved sites

    Establish direct contacts with uptream/downstream

    Minimise service outage Eradicate infestation

  • 7/29/2019 tier1a

    27/31

    1 February 2004 Tier-1 Status Report

    Roles

    Incident controller mange information flood deal withexternal contacts

    Log trawler hunt for contacts

    Hunter/killer(s) Searching for compromised hosts

    Detailed forensics Understand what happened on

    compromised hosts

    End user contacts Get in touch with users, confirm

    usage patterns, terminate IDs

  • 7/29/2019 tier1a

    28/31

    1 February 2004 Tier-1 Status Report

    Chronology

    At 09:30 on 7th October RAL network group forwarded a complaint fromUndernet suggesting unauthorized connections from Tier1 hosts to Undernet.

    At 10:00 initial investigation suggests unauthorised activity on csfmove02.csfmove02 physically disconnected from network.

    By now 5 Tier1 staff + E-Science security Officer 100% engaged on incident.Additional support from CCLRC network group, and site security officer.Effort remained at this level for several days. Babar support staff at RAL alsoactive tracking down unexplained activity.

    At 10:07 request made to site firewall admin for firewall logs for all contactswith suspected hostile IRC servers.

    At 10:37 firewall admin provides initial report, confirming unexplainedcurrent outbound activity from csfmove02, but no other nodes involved.

    At 11:29 babar report that bfactory account password was common to the

    following additional Ids (bbdatsrv and babartst) At 11:31 Steve completes rootkit check no hosts found although possiblefalse positives on Redhat 7.2 which we are uneasy aboutBy 11:40 preliminaryinvestigations at RAL had concluded that an unauthorized access had takenplace onto host csfmove02 (a data mover node) which in turn was connectedoutbound to an IRC service. At this point we notified security mailing lists(hepix-security, lcg-security, hepsysman):

  • 7/29/2019 tier1a

    29/31

    1 February 2004 Tier-1 Status Report

    Security Events

    0

    5

    10

    15

    20

    25

    30

    35

    2D1 4D1 2D2 4D2 D4 D6

    6 Hour Intervals

    Security Events

  • 7/29/2019 tier1a

    30/31

    1 February 2004 Tier-1 Status Report

    Security Summary

    The intrusion took 1-2 staff months of CCLRC effort toinvestigate (6 staff fulltime for 3 days (5 Tier-1 plus E-Science security officer), working long hours. Also: Networking group

    CCLRC site security.

    Babar support

    Other sites

    Prompt notification of the incident by up-stream sitewould have substantially reduced the size and complexityof the investigation.

    The good standard of patching on the Tier1 minimised the

    spread of the incident internally (but we were lucky) Can no longer trust who logged on uses are many userids

    (globally) probably compromised.

  • 7/29/2019 tier1a

    31/31

    1 February 2004 Tier-1 Status Report

    Conclusions

    A period of consolidation User demand continues to fluctuate, butincreasing number of experiments able to useLCG.

    Good progress on SRM to DISK (DCACHE Making progress with SRM to tape

    Having an SRM isnt enough it has to meet theneeds of the experiments

    Expect focus to shift (somewhat) towards servicechallenge