17
Working group for optimized Computing Capacity Lifecycle Planning •Created after ISM meeting 16 th of June •Members: Tim B, Eric G, Helge, Massimo, Carles, Benoit, Bernd, Olof • Also: Eric S, Artur, Arne •Mandate: To look at the current process and recent difficulties, including having multiple budget codes. Based on this, make a proposal for a revised process which should eliminate the recent issues as well as make the process as efficient as possible. This process should be, as far as possible, consistent for all hardware and services in the IT Computing Facilities. •Activity: 6 meetings in total, 2 to define the problems and 4 for finding solutions and agree on recommendations. •Output: report with recommendations

Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

Embed Size (px)

Citation preview

Page 1: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

Working group for optimized Computing Capacity Lifecycle Planning

• Created after ISM meeting 16th of June• Members: Tim B, Eric G, Helge, Massimo, Carles, Benoit, Bernd, Olof

• Also: Eric S, Artur, Arne

• Mandate: To look at the current process and recent difficulties, including having multiple budget codes. Based on this, make a proposal for a revised process which should eliminate the recent issues as well as make the process as efficient as possible. This process should be, as far as possible, consistent for all hardware and services in the IT Computing Facilities.• Activity: 6 meetings in total, 2 to define the problems and 4 for finding

solutions and agree on recommendations.• Output: report with recommendations

Page 2: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

Requirements

WG Topics

Technology survey

Decommissioning

Schedule

Capacity

Procurement

Life-cycle

Technical

Budgeting

AccountingFunding &Chargeback(?)

Commissioning

Allocation &Repurposing

Page 3: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

  Who Recommendation

R01 CF Procurement team gives yearly public presentation covering technology trends that are relevant for CERN IT and the implications for the services.

R02 CF Procurement team starts every tender cycle by organizing requirement meeting covering both technical and capacity requirements.

R03 CF Operation team maintains global table for tracking the deliveries in the Computer Facilities Capacity Coordination Meeting (CFCCM).

R04 CF Procurement team opens SNOW ticket to intended customer service FE for hand-over of allocated systems for commissioning.

R05 CS Investigate technical options for separating logical network from infrastructure.R06 All Review every use-case (e.g. Oracle databases, Drupal) for private network should be re-evaluated in due of cost.R07 CS Enhance LANDB interface to better support bulk renumbering of IP services.R08 CS Add information about blocking factor and number of fibres at Switch level in LANDBR09 CS Replace cross-charging for network switches with an explicit budget transferR10 CF Define a process for review and approval of request for using the BarnR11 CF Establish host based replacement process as it is outlined in section 5

R12 Bernd Propose and agree with IT management on a standing justification for adding an option for 20% additional volume to future FC papers

R13 CF Move to standard scheme with two procurement cycles / year targeting June and December FC meetingsR14 OIS Test and certify Windows installation on standard bulk hardware

R15 All Review clusters for potential candidates for best-effort production hardware usage as defined in host-by-host replacement proposal

R16 DHO Determine necessary staffing to implement and operate the processes once recommendations R01 to R15 are all agreed

Page 4: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

  Who Recommendation

R01 CF Procurement team gives yearly public presentation covering technology trends that are relevant for CERN IT and the implications for the services.

R02 CF Procurement team starts every tender cycle by organizing requirement meeting covering both technical and capacity requirements.

R03 CF Operation team maintains global table for tracking the deliveries in the Computer Facilities Capacity Coordination Meeting (CFCCM).

R04 CF Procurement team opens SNOW ticket to intended customer service FE for hand-over of allocated systems for commissioning.

R05 CS Investigate technical options for separating logical network from infrastructure.R06 All Review every use-case (e.g. Oracle databases, Drupal) for private network should be re-evaluated in due of cost.R07 CS Enhance LANDB interface to better support bulk renumbering of IP services.R08 CS Add information about blocking factor and number of fibres at Switch level in LANDBR09 CS Replace cross-charging for network switches with an explicit budget transferR10 CF Define a process for review and approval of request for using the BarnR11 CF Establish host based replacement process as it is outlined in section 5

R12 Bernd Propose and agree with IT management on a standing justification for adding an option for 20% additional volume to future FC papers

R13 CF Move to standard scheme with two procurement cycles / year targeting June and December FC meetingsR14 OIS Test and certify Windows installation on standard bulk hardware

R15 All Review clusters for potential candidates for best-effort production hardware usage as defined in host-by-host replacement proposal

R16 DHO Determine necessary staffing to implement and operate the processes once recommendations R01 to R15 are all agreed

Page 5: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

R03: CFCC tracking table for installations, acceptance and commissioning

https://espace2013.cern.ch/CFCCM/Shared%20Documents/Installations%20List/CFCCM_Installations_list.xlsx?Web=1 (Access restricted to e-group: it-service-cfccm)

Page 6: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

Spec work

Dispatch tender

Waiting bids

Bid evaluation

Waiting FC approval

Waiting delivery

Acceptance

Dispatching orders

N+8 N+9 N+10 N+11N N+1 N+2 N+3 N+4 N+5 N+6 N+7 N+12

Dedicated

Assisting

Consulting

Procurement team

R13: Typical Procurement cycle with FC

Month

Page 7: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

S O N DJ F M A M J J A S O N DJ F M A M J J A

S O N DJ F M A M J J A S O N DJ F M A M J J A

Year N Year N+1

FC meetings

Invoicing

June FC

March FC

December FC

September FC

R13: synchronized to FC meetings

Page 8: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

S O N DJ F M A M J J A S O N DJ F M A M J J A

S O N DJ F M A M J J A S O N DJ F M A M J J A

Year N Year N+1

FC meetings

Invoicing

June FC

December FC

R13: 2x cycles June & December FC

Page 9: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

R11: 1-to-1 host replacement

• Once a year: tender for 1-to-1 replacement of hosts (and storage) with expiring warranty in next 12 months• One year later: replacement capacity ready for commissioning• Inform the owner Functional Element (Cloud Infrastructure, EOS, …)

• List of reliable production hosts to be replaced• List of replacement hosts

• Allow one year for migration• Old host re-purposed for “best effort” production

• Host replacement accounts some capacity growth. Additional Growth added on top or tendered separately

Page 10: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

S O N DJ F M A M J J A S O N DJ F M A M J J A S O N DJ F M A M J J A

2015 2016 2017

Replacement capacity =1294 systems with age between 2 and 3 years

Tender (December FC)

Replacement capacity available

Phase 1: procurement Phase 2: commission & repurpose

Notify services to migrate

All old capacity repurposed

R11: typical 1-to-1 cycle

Page 11: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

R11: Host lifecycle

new 1 year 2 years 3 years 4 years 5 years Older

Commissioning

Reliable production Best effort productionRe-purpose

Tender for 1:1 replacement

Original warranty expiresReplacement available

Page 12: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

R11: 1-to-1 host replacement

• Once a year: tender for 1-to-1 replacement of hosts (and storage) with expiring warranty in next 12 months• One year later: replacement capacity ready for commissioning• Inform the owner Functional Element (Cloud Infrastructure, EOS, …)

• List of reliable production hosts to be replaced• List of replacement hosts

• Allow one year for migration• Old host re-purposed for “best effort” production

• Host replacement implies significant capacity growth• Additional growth on top or tendered for separately

Page 13: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

R11: Decommissioning

• When to decide obsolescence?• Difficult to define generalize criteria for

• Inefficiency (power consumption per capacity, physical space)• Failure rates• Parts availability• Maintenance efforts (firmware support and security)

• ~>5 years “feels” about right

• Decommissioning process?• Big-bang retirement campaigns?• Establish a background activity, e.g. trickle 100-200 servers / month?

Page 14: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

Aug-13 Dec-14 May-16 Sep-17 Feb-19 Jun-200

2000

4000

6000

8000

10000

12000

14000

R11: Simulation of 1:1 replacement in 4th year and constant rate decommissioning of age >5 years

Total serversDecommissionedCommissioned

- Decommissioning at age>5 years- 200 servers/month

Adding ~2300 servers in pipeline

Page 15: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

R15: Reliable production vs best-effort • Move from reliable production and best-effort production phases is

expected one year after the expiry of the original vendor warranty• Best-effort production consists of clusters of hosts that are available

for re-purposing or decommissioning. • It may include new hardware that has not yet been commissioned in reliable

production

• Examples of best-effort production cluster can be OpenStack cells with short turn-over VMs such as batch worker nodes.• FEs using bare hardware should review their clusters for potential

candidates.

Page 16: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

Summary

• Working group completed its mission• Good and focussed discussions (thanks to all involved)• Conclusions presented (and approved) at the ISM in December

• Only a subset presented here• More details in report attached to the agenda

• Implementation• Monitor progress?

Page 17: Working group for optimized Computing Capacity Lifecycle Planning Created after ISM meeting 16 th of June Members: Tim B, Eric G, Helge, Massimo, Carles,

  Who Recommendation

R01 CF Procurement team gives yearly public presentation covering technology trends that are relevant for CERN IT and the implications for the services.

R02 CF Procurement team starts every tender cycle by organizing requirement meeting covering both technical and capacity requirements.

R03 CF Operation team maintains global table for tracking the deliveries in the Computer Facilities Capacity Coordination Meeting (CFCCM).

R04 CF Procurement team opens SNOW ticket to intended customer service FE for hand-over of allocated systems for commissioning.

R05 CS Investigate technical options for separating logical network from infrastructure.R06 All Review every use-case (e.g. Oracle databases, Drupal) for private network should be re-evaluated in due of cost.R07 CS Enhance LANDB interface to better support bulk renumbering of IP services.R08 CS Add information about blocking factor and number of fibres at Switch level in LANDBR09 CS Replace cross-charging for network switches with an explicit budget transferR10 CF Define a process for review and approval of request for using the BarnR11 CF Establish host based replacement process as it is outlined in section 5

R12 Bernd Propose and agree with IT management on a standing justification for adding an option for 20% additional volume to future FC papers

R13 CF Move to standard scheme with two procurement cycles / year targeting June and December FC meetingsR14 OIS Test and certify Windows installation on standard bulk hardware

R15 All Review clusters for potential candidates for best-effort production hardware usage as defined in host-by-host replacement proposal

R16 DHO Determine necessary staffing to implement and operate the processes once recommendations R01 to R15 are all agreed

@ ITTF tomorrow