Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Canada’s national laboratoryfor particle and nuclear physicsand accelerator-based science
Canadian Tier-1 Status and Evolution
Reda TafiroutTRIUMF
GDB meeting, CERN, May 10 2017
2
Outline
● Canadian WLCG Scene
– The bigger picture & organization
● Tier-1 centre brief overview
– Historical context & funding history
● Current status and plans
– Computing needs and funding opportunities
– Tier-1 relocation planning & transition activities
● Future outlook
– Remaining activities for 2017 & timeline
3
Canadian WLCG Scene
● Tier-1: dedicated facility located at TRIUMF
– Managed and operated by ATLAS-Canada, Simon Fraser U. and TRIUMF
● Tier-2's: shared facilities located at Compute Canada centres
– National organization serving all research communities
– Management structure and operations are more complex
– Each ~year, ATLAS-Canada submits a proposal to the National Resource Allocation Committee (NRAC) to secure resources.
– Two WLCG federations across 4 sites (was 5 prior to 2013).
● Same funding mechanism: Canada Foundation for Innovation (CFI) & provincial partners for matching
– Tier-1: very successful in securing own funding since 2006
– Compute Canada is refreshing all of its infrastructure and aging equipment (going from ~27 to 4-6 larger centres)
– CFI would like a Tier-1 integration within Compute Canada, to minimize infrastructure and operating costs.
4
Canadian Tier-1 centre
● TRIUMF Tier-1 is well established with an excellent track record in several key areas:
– Availability & Reliability
– Scalability & Performance
– Customer service attitude & user support
– Provision of resources and high level services (under WLCG MOU)
– ~10 years of stable 24x7 operations
● Serving ATLAS VO only
– Providing 10% of Tier-1 resources
● Dedicated facility and personnel
● High visibility project for Canada and TRIUMF
● MOU signed by TRIUMF in 2006 once initial CFI funding secured
~99.7%
5
Funding History & Current Status
● Important component of the current TRIUMF five-year plan funding cycle and prior proposals (2005-2010, 2010-2015)
● Successful and critical funding secured throughout the years:
– Significant prototyping since early 2002 helped secure funding
– 2006: CFI Exceptional opportunities Fund● $8M CFI + $2.5M IOF , $4M BCKDF (provincial match)
● In-kind: significant vendors discounts, TRIUMF, CANARIE, BCNet
– 2011: $3.3M from CFI for operating (through march '15)
– 2012: CFI LEF $2.5M project cost (40% CFI, 40% BCKDF) + in-kind.
● All CFI proposals led by Simon Fraser University (SFU) on behalf of ATLAS-Canada universities consortium
● Strategic procurement and expansions inline with the ATLAS physics program
● Hardware resources: 4830 cores, 7.8 PB disk, 12 PB tape, 85 servers
● Human resources: 9 FTEs
● TRIUMF presently covering full operations costs (since 2015)
6
TRIUMF Tier-1 Physical Infrastructure
● Current physical layout as deployed at TRIUMF
● Usage for power & cooling (relative to capacity):
– Power: ~75% of UPS capacity (225 kVA, dual feeds) ~45% of regular power capacity (112 kVA transformer)
– Cooling: ~70% of total capacity (320 kW design)
2007+2008
2009+2010
2011+2012
2014
7
TRIUMF Tier-1 Network Topology
● Current network topology as deployed at TRIUMF
8
TRIUMF Tier-1 Cluster Diagram
● Current cluster architecture as deployed at TRIUMF
9
Resources Needs & New Funding
● Computing needs are increasing substantially due to excellent LHC performance in 2016, which is expected to continue for 2017 & 2018
● ATLAS requests are reviewed yearly by the WLCG Computing Resources Scrutiny Group (C-RSG) and approved by CERN Resources Review Board.
TRIUMF Tier-1 Required Resources
2017 2018 2019 2020
CPU (cores) 7,236 7,456 8,948 10,737
Disk (PB) 7.5 8.1 9.4 10.8
Tape (PB) 20.7 23.2 26.7 30.7
● New CFI cyber-infrastructure funding competition announced in late 2014:
– Capital for equipment: only Compute Canada can apply (shared resources)
● Several discussions followed between TRIUMF, SFU, CFI, Compute Canada and ATLAS-Canada (during 2015 & 2016):
– Decision was made to integrate new Tier-1 resources into the new Compute Canada centre at SFU and leverage on its infrastructure; also a CFI condition
– CFI proposal submitted for the Innovation Fund in October 2016, led by SFU on behalf of ATLAS-Canada (decision expected in June)
10
Tier-1 relocation plans
● Great majority of TRIUMF Tier-1 equipment reached 5 years in 2017:
– Warranties & support contracts extended until early 2018 (CFI LEF funds)
– Hardware refresh required by then
● TRIUMF infrastructure is aging (~10 years) and floor space limited
● New data centre at Simon Fraser University:
– 2 x 0.5 MW UPS capacity, backed up by generator (HA)
– Large floor space (new building recently renovated)
– Ensures proper expansion going forward into the future (10 MW power)
● For April 2017, need additional Tier-1 capacity as per MoU commitments
– However, limited capital available (remaining CFI LEF)
– Borrowing equipment from Compute Canada for additional tape capacity
– Leveraging on Compute Canada procurement process whenever possible
● Goal is to minimize Tier-1 downtime during the transition
11
Role of TRIUMF unchanged● TRIUMF Tier-1 personnel are still responsible for the operations at SFU;
keeping control, and line management structure unchanged. Activities will be coordinated by the ATLAS Tier-1 Group Leader at TRIUMF
● New data centre infrastructure aspects are the responsibility of SFU
● Drafts of MoU & SLA exist and will be finalized in the coming months
12
● Distance between TRIUMF and SFU: ~28 km with ~1 ms RTT
● New location: SFU_WTB (Simon Fraser University Water Tower Building)
TRIUMF & SFU Locations
TRIUMF
Simon Fraser U.
BCNET TX (CANARIE)
13
New Tier-1 deployment plans
● Implement a distributed Tier-1 centre during the transition phase
– Tier-1 resources at TRIUMF and SFU locations seen as one from ATLAS
● Distributed dCache (similar to NDGF) ; other services should be OK
● Phase 0: pre-production of initial services and testing (Q1 '17)
– Install necessary equipment (core switch, HSM servers, admins, SAN)
– Network configuration (new address space, LHCOPN, DNS, etc.)
– New tape library commissioning and distributed Tier-1 testing
● Phase 1: production at smaller scale & capacity (Q2 '17)
– production with additional tape capacity and related services
– Install additional disk and cpu capacity (for 2017 WLCG pledges)
● Phase 2: production at larger scale & capacity (Q4 '17 – Q1 '18)
– Hardware refresh and expansion (with new CFI funding)
– Data migration from TRIUMF site to SFU site
14
Phase 0 related work
● Intense activities during Q1 of 2017
– Physical installation, configuration, commissioning and testing
● Collaborative effort between TRIUMF, BCNET, CANARIE, SFU
– Network fully implemented with necessary topology (spare core switch from TRIUMF)
● All necessary equipment in place
– Using existing CFI funds for HSM servers, admin nodes, SAN for tape buffer
– Tape library borrowed from Compute Canada (drives and cartridges with logical partitioning)
● New cluster ready for production
Q4 -2017 - 2x100Gto T0,T1,T2
Tier-1 SFU/GP2
Tier-1 TRIUMF
BCNET LHCONE VRF
10G
20G
to T0,T1,T2
BCNET R&E VRF
BCNET LHCOPN VRF
Canarie TRIUMF ASN 36391
10G
10G
10G
GRE IP-IP VPN
GRE IP-IP VPN
15
DNS & IP address space
● TRIUMF delegated DNS for the lcg.triumf.ca and t1dev.triumf.ca sub-domain to the Tier-1
– June 2014:
● IPv4: 206.12.1.0/24 and 142.90.144.0/23 at TRIUMF
– Jan 2017:
● IPv4: 206.12.9.128/25, 206.12.9.112/28 at SFU_WTB
– Mar 2017:
● IPv6: 2607:f8f0:660:1::/64 at TRIUMF
● IPv6: 2607:f8f0:660:3::/64 at SFU_WTB
● Hidden DNS master model ansible-ized and put into production in Feb 2017
– 2 public slaves, 3rd one soon at SFU_WTB site
– 4 private slaves (which are also TRIUMF slaves), 2 at TRIUMF and 2 at SFU_WTB site
– # ansible-playbook dns/update/public.yml
– # ansible-playbook dns/push/public.yml -e "target='dns_slaves'"
16
DNS Master / Slaves
17
IPv6 Status (Network)
● Almost fully implemented: two aspects remaining
– For the TRIUMF site: work required from core computing services to implement IPv6 on R&E network, needs further coordination
– For the SFU_WTB site: IPv6 on commercial/commodity not implemented
18
● Focus has been on storage services (initial Tier-1 WLCG requirement/expectation)
● Implemented in the Middleware Readiness (MW) dCache instance for the moment (pre-production phase)
– IPv4/6 dual stack setup
– MW dCache instance: 1 head node, 1 pool node, SL7, java8, 1Gb network interface
– Straight forward and normal dCache setup procedure:
● listens on any protocol, configure IPv6 protocol into Pool Manager. Removing hostname definition from /etc/hosts
– iptables : open only necessary ports to the outside; internal storage nodes are trusted.
– ip6tables open only necessary ports to any nodes.
– IPv6 is now the primary protocol for data transfer in MW readiness
● Completed and tested very recently
IPv6 Status (WLCG dual stack services)
19
Remaining Activities for 2017
● Phase 1: full production with additional capacity at SFU
– Tape: finalize monitoring aspects for 24x7 operations (now).
(tape infrastructure borrowed from Compute Canada)
– Disk & CPU: procurement process ongoing (expect delivery in June)
(using remaining CFI funds)
– Network: VPN between the two sites for WN access
● Phase 2: large-scale procurement and Tier-1 hardware refresh
– Exact timeline highly dependent on CFI funding decision (June), application for matching funds, lifting of conditions and capital readiness for spending (no later than Q4 '17)
● Finalize MOU & SLA between TRIUMF, SFU and Compute Canada
● Tier-1 Personnel: exact time fraction and personnel count that needs to spend significant amount of time at the SFU_WTB location TBD. Developing a new operations model (to be finalized in early 2018)
Canada’s national laboratoryfor particle and nuclear physics and accelerator-based science
TRIUMF: Alberta | British Columbia | Calgary | Carleton | Guelph | Manitoba | McGill | McMaster | Montréal | Northern British Columbia | Queen’s | Regina | Saint Mary’s | Simon Fraser | Toronto | Victoria | Western | Winnipeg | York
Thank you!Merci!
Follow us at TRIUMFLab