Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Prague Tier-2 operations
● Tomáš KoubaMiloš Lokajíček
● GRID 2012, Dubna● 16.7.2012
Outline
● Who we are, our users● New HW● Services● Internal network● External connectivity● IPv6 testbed
● Who we are– Regional Computing Centre for Particle Physics,
Insitute of Physics Academy of Sciences of the Czech Republic
– basic research in particle physics, solid state physics and optics
● Our users– scientists from our and other institutes of the
Academy– Charles University, Czech Technical University– WLCG (ATLAS, ALICE), EGI (AUGER, CTA), D0 grid
Who we are, Our users
WLCG grid structure
Backup Tier1:TaipeiBNL(FNAL)
MFFTIER3
FJFITIER3
ÚJFTIER3
CERNTIER0/1
KITTIER1 Prague
TIER2
Other Tier2 over Internet
Disk space and computing capacity
Next year goal:• support for Tier3
centers• user support
Capacities over time
HEPSPEC2006 % TB disk %2009 10 340 1862010 19 064 100 427 1002011 23 484 100 1 714 100
D0 90331 40 35 2ATLAS 6 796 29 1316(16 MFF) 77
ALICE 7 357 31 363(60 Řež) 21
2012 29 192 100 2521 100D0 9 980 34 35 1
ATLAS 11 600 40 1880 (16 MFF) 74
ALICE 7 612 26 606(100 Řež) 24
New HW in 2012
• Worker nodes:– 23 nodes SGI Rackable C1001-G13– 2x (Opteron 6274 16 cores) 64 GB RAM, 2x SAS 300 GB– 374 W (full load), more than 5000 HEPSpec in total– delivered in water-cooled rack
• Disk servers– 4 Supermicro nodes (4 servers + 3 JBODs)– 837TB in total (400TB still delayed because of floods)
• Infrastracture servers– 2x DL360 G7 (HyperV server, NFS server)
• UPS PowerWare 9390 (aka Eaton) – 2x100 KW, energy saving mode (offline => 98% efficiency)
good sealing crucial
diskservers on off (divider added)
dis
kser
vers
worker nodes
rubus01
Services
● Batch system: Torque/MUI
● UMD services
– 2x CreamCE
– MONBox
– SE DPM (1x head node, 15 disk nodes)
● VO specific
– AUGER dashboard
– squid (for cvmfs and frontier – ATLAS)
– VOBOX (ALICE)
– 2x SAM station (D0)
● All nodes installed automatically over network (PXE, kickstart, simple script to end installation)
● All further configuration performed by CFengine (version 2)
– We are evaluating puppet● New services in 2012:
– CVMFS (problem with full disks, direct access to CERN stratum 1)
– UMD worker nodes
– perfsonar
Monitoring
● Nagios – Health of hradware, systems, SW, syslog
monitor, SNMP traps– Important errors by e-mail and SMS, rest by
consolidated mails 3 times per day– 7000 services on 466 hosts– WLCG data transfers, job execution– Multisite –alternative user interface, massive
opreations with group of nodes
Multisite Nagios UI
Netflow – network monitoring
● Flowtracker, Flowgrapher● Useful for troubleshooting problems in the past
– e.g. reason of poor Alice efficiency at our site:
Internal network
● CESNET upgraded our main CISCO router– 6506 -> 6509– supervisor SUP720 -> SUP2T– new 8x 10G X2 card– planned upgrade of power supplies 2x3kW -> 2x6 kW– (2 cards 48x1 Gbps, 1 card 4x10 Gbps, FW service module)
● FWSM upgraded to support IPv6● MTU increased to 9000 during spring
– experienced problems with ATLAS data transfers– fragmentation ICMP messages were suppressed– fixed on the main router
Central router (Cisco 6509)
External connectivity
● Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET)● Shared: 10 Gbps (PASNET – GEANT)
FZU -> FZK FZK -> FZU PASNET link
• Not enough for ATLAS T2D limit (5 MB/s to/from T1s)• Perfsonar installed:
External connectivity
LHCONE - LHC Open Network Environment
● New concept to connect T2 to other T1s and T2s● Tier1 (11), Tier2 (130), Tier3 allover the world● Initially hierarchical model: T2 communicates to one T1● T1s interconnected with private redundant optical LHCOPN● Change from hierarchical to flat model
T1
T2
T2
T1
T2
T2
T1
T1
LHCONE cont.
● LHCONE complementary to well working LHCOPN● LHCONE only for LHC data● Realization via L3 VPN using VRF● Under construction
– Esnet, Internet2, Geant+NREN, Nordunet, USLHCnet, Surfnet, ASGC, CERN
● Evaluation and new improvements in 2013● Our implementation and HW requirements are being discussed
with CESNET
IPv6 testing
● We participate in Hepix IPv6 testbed (we focus on IPv6-only setup)
● HW status (so far tested)
– switches have no problem with IPv6 (only 2 of them can be managed over IPv6)
– firewall upgrade was needed
– no management interfaces of our servers support IPv6
– no facility monitored by SNMP supports IPv6 (air condition, thermometers, UPS, water cooling unit)
– none of the disk arrays management interfaces support IPv6
● DNS, DHCPv6 running fine
● NTP server runs fine (lack of stratum 1 NTP servers with IPv6 connectivity)
● Many problems with automatic installation (SL5 is simply not ready for IPv6)
IPv6 testing cont.
● Running middleware needs regular CRL updates– we developed a tool to test CRLs availability over IPv6
● IPv6 testing project was partially supported by CESNET, project number 416R1/2011.
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20