Upload
katherine-byrd
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Working group for optimized Computing Capacity Lifecycle Planning
• Created after ISM meeting 16th of June• Members: Tim B, Eric G, Helge, Massimo, Carles, Benoit, Bernd, Olof
• Also: Eric S, Artur, Arne
• Mandate: To look at the current process and recent difficulties, including having multiple budget codes. Based on this, make a proposal for a revised process which should eliminate the recent issues as well as make the process as efficient as possible. This process should be, as far as possible, consistent for all hardware and services in the IT Computing Facilities.• Activity: 6 meetings in total, 2 to define the problems and 4 for finding
solutions and agree on recommendations.• Output: report with recommendations
Requirements
WG Topics
Technology survey
Decommissioning
Schedule
Capacity
Procurement
Life-cycle
Technical
Budgeting
AccountingFunding &Chargeback(?)
Commissioning
Allocation &Repurposing
Who Recommendation
R01 CF Procurement team gives yearly public presentation covering technology trends that are relevant for CERN IT and the implications for the services.
R02 CF Procurement team starts every tender cycle by organizing requirement meeting covering both technical and capacity requirements.
R03 CF Operation team maintains global table for tracking the deliveries in the Computer Facilities Capacity Coordination Meeting (CFCCM).
R04 CF Procurement team opens SNOW ticket to intended customer service FE for hand-over of allocated systems for commissioning.
R05 CS Investigate technical options for separating logical network from infrastructure.R06 All Review every use-case (e.g. Oracle databases, Drupal) for private network should be re-evaluated in due of cost.R07 CS Enhance LANDB interface to better support bulk renumbering of IP services.R08 CS Add information about blocking factor and number of fibres at Switch level in LANDBR09 CS Replace cross-charging for network switches with an explicit budget transferR10 CF Define a process for review and approval of request for using the BarnR11 CF Establish host based replacement process as it is outlined in section 5
R12 Bernd Propose and agree with IT management on a standing justification for adding an option for 20% additional volume to future FC papers
R13 CF Move to standard scheme with two procurement cycles / year targeting June and December FC meetingsR14 OIS Test and certify Windows installation on standard bulk hardware
R15 All Review clusters for potential candidates for best-effort production hardware usage as defined in host-by-host replacement proposal
R16 DHO Determine necessary staffing to implement and operate the processes once recommendations R01 to R15 are all agreed
Who Recommendation
R01 CF Procurement team gives yearly public presentation covering technology trends that are relevant for CERN IT and the implications for the services.
R02 CF Procurement team starts every tender cycle by organizing requirement meeting covering both technical and capacity requirements.
R03 CF Operation team maintains global table for tracking the deliveries in the Computer Facilities Capacity Coordination Meeting (CFCCM).
R04 CF Procurement team opens SNOW ticket to intended customer service FE for hand-over of allocated systems for commissioning.
R05 CS Investigate technical options for separating logical network from infrastructure.R06 All Review every use-case (e.g. Oracle databases, Drupal) for private network should be re-evaluated in due of cost.R07 CS Enhance LANDB interface to better support bulk renumbering of IP services.R08 CS Add information about blocking factor and number of fibres at Switch level in LANDBR09 CS Replace cross-charging for network switches with an explicit budget transferR10 CF Define a process for review and approval of request for using the BarnR11 CF Establish host based replacement process as it is outlined in section 5
R12 Bernd Propose and agree with IT management on a standing justification for adding an option for 20% additional volume to future FC papers
R13 CF Move to standard scheme with two procurement cycles / year targeting June and December FC meetingsR14 OIS Test and certify Windows installation on standard bulk hardware
R15 All Review clusters for potential candidates for best-effort production hardware usage as defined in host-by-host replacement proposal
R16 DHO Determine necessary staffing to implement and operate the processes once recommendations R01 to R15 are all agreed
R03: CFCC tracking table for installations, acceptance and commissioning
https://espace2013.cern.ch/CFCCM/Shared%20Documents/Installations%20List/CFCCM_Installations_list.xlsx?Web=1 (Access restricted to e-group: it-service-cfccm)
Spec work
Dispatch tender
Waiting bids
Bid evaluation
Waiting FC approval
Waiting delivery
Acceptance
Dispatching orders
N+8 N+9 N+10 N+11N N+1 N+2 N+3 N+4 N+5 N+6 N+7 N+12
Dedicated
Assisting
Consulting
Procurement team
R13: Typical Procurement cycle with FC
Month
S O N DJ F M A M J J A S O N DJ F M A M J J A
S O N DJ F M A M J J A S O N DJ F M A M J J A
Year N Year N+1
FC meetings
Invoicing
June FC
March FC
December FC
September FC
R13: synchronized to FC meetings
S O N DJ F M A M J J A S O N DJ F M A M J J A
S O N DJ F M A M J J A S O N DJ F M A M J J A
Year N Year N+1
FC meetings
Invoicing
June FC
December FC
R13: 2x cycles June & December FC
R11: 1-to-1 host replacement
• Once a year: tender for 1-to-1 replacement of hosts (and storage) with expiring warranty in next 12 months• One year later: replacement capacity ready for commissioning• Inform the owner Functional Element (Cloud Infrastructure, EOS, …)
• List of reliable production hosts to be replaced• List of replacement hosts
• Allow one year for migration• Old host re-purposed for “best effort” production
• Host replacement accounts some capacity growth. Additional Growth added on top or tendered separately
S O N DJ F M A M J J A S O N DJ F M A M J J A S O N DJ F M A M J J A
2015 2016 2017
Replacement capacity =1294 systems with age between 2 and 3 years
Tender (December FC)
Replacement capacity available
Phase 1: procurement Phase 2: commission & repurpose
Notify services to migrate
All old capacity repurposed
R11: typical 1-to-1 cycle
R11: Host lifecycle
new 1 year 2 years 3 years 4 years 5 years Older
Commissioning
Reliable production Best effort productionRe-purpose
Tender for 1:1 replacement
Original warranty expiresReplacement available
R11: 1-to-1 host replacement
• Once a year: tender for 1-to-1 replacement of hosts (and storage) with expiring warranty in next 12 months• One year later: replacement capacity ready for commissioning• Inform the owner Functional Element (Cloud Infrastructure, EOS, …)
• List of reliable production hosts to be replaced• List of replacement hosts
• Allow one year for migration• Old host re-purposed for “best effort” production
• Host replacement implies significant capacity growth• Additional growth on top or tendered for separately
R11: Decommissioning
• When to decide obsolescence?• Difficult to define generalize criteria for
• Inefficiency (power consumption per capacity, physical space)• Failure rates• Parts availability• Maintenance efforts (firmware support and security)
• ~>5 years “feels” about right
• Decommissioning process?• Big-bang retirement campaigns?• Establish a background activity, e.g. trickle 100-200 servers / month?
Aug-13 Dec-14 May-16 Sep-17 Feb-19 Jun-200
2000
4000
6000
8000
10000
12000
14000
R11: Simulation of 1:1 replacement in 4th year and constant rate decommissioning of age >5 years
Total serversDecommissionedCommissioned
- Decommissioning at age>5 years- 200 servers/month
Adding ~2300 servers in pipeline
R15: Reliable production vs best-effort • Move from reliable production and best-effort production phases is
expected one year after the expiry of the original vendor warranty• Best-effort production consists of clusters of hosts that are available
for re-purposing or decommissioning. • It may include new hardware that has not yet been commissioned in reliable
production
• Examples of best-effort production cluster can be OpenStack cells with short turn-over VMs such as batch worker nodes.• FEs using bare hardware should review their clusters for potential
candidates.
Summary
• Working group completed its mission• Good and focussed discussions (thanks to all involved)• Conclusions presented (and approved) at the ISM in December
• Only a subset presented here• More details in report attached to the agenda
• Implementation• Monitor progress?
Who Recommendation
R01 CF Procurement team gives yearly public presentation covering technology trends that are relevant for CERN IT and the implications for the services.
R02 CF Procurement team starts every tender cycle by organizing requirement meeting covering both technical and capacity requirements.
R03 CF Operation team maintains global table for tracking the deliveries in the Computer Facilities Capacity Coordination Meeting (CFCCM).
R04 CF Procurement team opens SNOW ticket to intended customer service FE for hand-over of allocated systems for commissioning.
R05 CS Investigate technical options for separating logical network from infrastructure.R06 All Review every use-case (e.g. Oracle databases, Drupal) for private network should be re-evaluated in due of cost.R07 CS Enhance LANDB interface to better support bulk renumbering of IP services.R08 CS Add information about blocking factor and number of fibres at Switch level in LANDBR09 CS Replace cross-charging for network switches with an explicit budget transferR10 CF Define a process for review and approval of request for using the BarnR11 CF Establish host based replacement process as it is outlined in section 5
R12 Bernd Propose and agree with IT management on a standing justification for adding an option for 20% additional volume to future FC papers
R13 CF Move to standard scheme with two procurement cycles / year targeting June and December FC meetingsR14 OIS Test and certify Windows installation on standard bulk hardware
R15 All Review clusters for potential candidates for best-effort production hardware usage as defined in host-by-host replacement proposal
R16 DHO Determine necessary staffing to implement and operate the processes once recommendations R01 to R15 are all agreed
@ ITTF tomorrow