Prague Tier-2 operationsgrid2012.jinr.ru/docs/Kouba_Prague_T2_operations.pdf · 2012. 7. 16. · Nagios – Health of hradware, systems, SW, syslog monitor, SNMP traps – Important

Prague Tier-2 operations

● Tomáš KoubaMiloš Lokajíček

● GRID 2012, Dubna● 16.7.2012

Outline

● Who we are, our users● New HW● Services● Internal network● External connectivity● IPv6 testbed

● Who we are– Regional Computing Centre for Particle Physics,

Insitute of Physics Academy of Sciences of the Czech Republic

– basic research in particle physics, solid state physics and optics

● Our users– scientists from our and other institutes of the

Academy– Charles University, Czech Technical University– WLCG (ATLAS, ALICE), EGI (AUGER, CTA), D0 grid

Who we are, Our users

WLCG grid structure

Backup Tier1:TaipeiBNL(FNAL)

MFFTIER3

FJFITIER3

ÚJFTIER3

CERNTIER0/1

KITTIER1 Prague

TIER2

Other Tier2 over Internet

Disk space and computing capacity

Next year goal:• support for Tier3

centers• user support

Capacities over time

HEPSPEC2006 % TB disk %2009 10 340 1862010 19 064 100 427 1002011 23 484 100 1 714 100

D0 90331 40 35 2ATLAS 6 796 29 1316(16 MFF) 77

ALICE 7 357 31 363(60 Řež) 21

2012 29 192 100 2521 100D0 9 980 34 35 1

ATLAS 11 600 40 1880 (16 MFF) 74

ALICE 7 612 26 606(100 Řež) 24

New HW in 2012

• Worker nodes:– 23 nodes SGI Rackable C1001-G13– 2x (Opteron 6274 16 cores) 64 GB RAM, 2x SAS 300 GB– 374 W (full load), more than 5000 HEPSpec in total– delivered in water-cooled rack

• Disk servers– 4 Supermicro nodes (4 servers + 3 JBODs)– 837TB in total (400TB still delayed because of floods)

• Infrastracture servers– 2x DL360 G7 (HyperV server, NFS server)

• UPS PowerWare 9390 (aka Eaton) – 2x100 KW, energy saving mode (offline => 98% efficiency)

good sealing crucial

diskservers on off (divider added)

dis

kser

vers

worker nodes

rubus01

Services

● Batch system: Torque/MUI

● UMD services

– 2x CreamCE

– MONBox

– SE DPM (1x head node, 15 disk nodes)

● VO specific

– AUGER dashboard

– squid (for cvmfs and frontier – ATLAS)

– VOBOX (ALICE)

– 2x SAM station (D0)

● All nodes installed automatically over network (PXE, kickstart, simple script to end installation)

● All further configuration performed by CFengine (version 2)

– We are evaluating puppet● New services in 2012:

– CVMFS (problem with full disks, direct access to CERN stratum 1)

– UMD worker nodes

– perfsonar

Monitoring

● Nagios – Health of hradware, systems, SW, syslog

monitor, SNMP traps– Important errors by e-mail and SMS, rest by

consolidated mails 3 times per day– 7000 services on 466 hosts– WLCG data transfers, job execution– Multisite –alternative user interface, massive

opreations with group of nodes

Multisite Nagios UI

Netflow – network monitoring

● Flowtracker, Flowgrapher● Useful for troubleshooting problems in the past

– e.g. reason of poor Alice efficiency at our site:

Internal network

● CESNET upgraded our main CISCO router– 6506 -> 6509– supervisor SUP720 -> SUP2T– new 8x 10G X2 card– planned upgrade of power supplies 2x3kW -> 2x6 kW– (2 cards 48x1 Gbps, 1 card 4x10 Gbps, FW service module)

● FWSM upgraded to support IPv6● MTU increased to 9000 during spring

– experienced problems with ATLAS data transfers– fragmentation ICMP messages were suppressed– fixed on the main router

Central router (Cisco 6509)

External connectivity

● Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET)● Shared: 10 Gbps (PASNET – GEANT)

FZU -> FZK FZK -> FZU PASNET link

• Not enough for ATLAS T2D limit (5 MB/s to/from T1s)• Perfsonar installed:

External connectivity

LHCONE - LHC Open Network Environment

● New concept to connect T2 to other T1s and T2s● Tier1 (11), Tier2 (130), Tier3 allover the world● Initially hierarchical model: T2 communicates to one T1● T1s interconnected with private redundant optical LHCOPN● Change from hierarchical to flat model

T1

T2

T2

T1

T2

T2

T1

T1

LHCONE cont.

● LHCONE complementary to well working LHCOPN● LHCONE only for LHC data● Realization via L3 VPN using VRF● Under construction

– Esnet, Internet2, Geant+NREN, Nordunet, USLHCnet, Surfnet, ASGC, CERN

● Evaluation and new improvements in 2013● Our implementation and HW requirements are being discussed

with CESNET

IPv6 testing

● We participate in Hepix IPv6 testbed (we focus on IPv6-only setup)

● HW status (so far tested)

– switches have no problem with IPv6 (only 2 of them can be managed over IPv6)

– firewall upgrade was needed

– no management interfaces of our servers support IPv6

– no facility monitored by SNMP supports IPv6 (air condition, thermometers, UPS, water cooling unit)

– none of the disk arrays management interfaces support IPv6

● DNS, DHCPv6 running fine

● NTP server runs fine (lack of stratum 1 NTP servers with IPv6 connectivity)

● Many problems with automatic installation (SL5 is simply not ready for IPv6)

IPv6 testing cont.

● Running middleware needs regular CRL updates– we developed a tool to test CRLs availability over IPv6

● IPv6 testing project was partially supported by CESNET, project number 416R1/2011.

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20

Documents

Prague Tier-2 operationsgrid2012.jinr.ru/docs/Kouba_Prague_T2_operations.pdf · 2012. 7. 16. · Nagios – Health of hradware, systems, SW, syslog monitor, SNMP traps – Important