77
LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document Dated 05-23-16 APEX 2020 Technical Requirements, Version 4.1 Page 1 of 77 APEX 2020 Technical Requirements Document for Crossroads and NERSC-9 Systems LA-UR-15-28541 SAND2016-4325 O Lawrence Berkeley National Laboratories is operated by the University of California for the U.S. Department of Energy under contract NO. DE-AC02-05CH11231. Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by Los Alamos National Security, LLC, for the National Nuclear Security Administration of the U.S. Department of Energy under contract DE-AC52-06NA25396. LA-UR-15-28541 Approved for public release; distribution is unlimited. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04- 94AL85000. SAND2016-4325 O.

APEX 2020 Technical Requirements Document€¦ · LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document Dated 05-23-16 APEX 2020 Technical Requirements, Version 4.1 Page

  • Upload
    lelien

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 1 of 77

APEX2020

TechnicalRequirementsDocument

for

CrossroadsandNERSC-9Systems

LA-UR-15-28541SAND2016-4325O

LawrenceBerkeleyNationalLaboratoriesisoperatedbytheUniversityofCaliforniafortheU.S.DepartmentofEnergyundercontractNO.DE-AC02-05CH11231.LosAlamosNationalLaboratory,anaffirmativeaction/equalopportunityemployer,isoperatedbyLosAlamosNationalSecurity,LLC,fortheNationalNuclearSecurityAdministrationoftheU.S.DepartmentofEnergyundercontractDE-AC52-06NA25396.LA-UR-15-28541Approvedforpublicrelease;distributionisunlimited.SandiaNationalLaboratoriesisamulti-programlaboratorymanagedandoperatedbySandiaCorporation,awhollyownedsubsidiaryofLockheedMartinCorporation,fortheU.S.DepartmentofEnergy’sNationalNuclearSecurityAdministrationundercontractDE-AC04-94AL85000.SAND2016-4325O.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 2 of 77

APEX2020:TechnicalRequirements1 INTRODUCTION 4

1.1 CROSSROADS 5

1.2 NERSC-9 7

1.3 SCHEDULE 8

2 MANDATORYREQUIREMENTS 8

3 TARGETDESIGNREQUIREMENTS 9

3.1 SCALABILITY 9

3.2 SYSTEMSOFTWAREANDRUNTIME 12

3.3 SOFTWARETOOLSANDPROGRAMMINGENVIRONMENT 13

3.4 PLATFORMSTORAGE 16

3.5 APPLICATIONPERFORMANCE 20

3.6 RESILIENCE,RELIABILITY,ANDAVAILABILITY 24

3.7 APPLICATIONTRANSITIONSUPPORTANDEARLYACCESSTOAPEXTECHNOLOGIES 25

3.8 TARGETSYSTEMCONFIGURATION 26

3.9 SYSTEMOPERATIONS 26

3.10 POWERANDENERGY 28

3.11 FACILITIESANDSITEINTEGRATION 30

4 NON-RECURRINGENGINEERING 36

5 OPTIONS 37

5.1 UPGRADES,EXPANSIONSANDADDITIONS 37

5.2 EARLYACCESSDEVELOPMENTSYSTEM 37

5.3 TESTSYSTEMS 38

5.4 ONSITESYSTEMANDAPPLICATIONSOFTWAREANALYSTS 38

5.5 DEINSTALLATION 38

5.6 MAINTENANCEANDSUPPORT 39

6 DELIVERYANDACCEPTANCE 41

6.1 PRE-DELIVERYTESTING 41

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 3 of 77

6.2 SITEINTEGRATIONANDPOST-DELIVERYTESTING 42

6.3 ACCEPTANCETESTING 42

7 RISKANDPROJECTMANAGEMENT 42

8 DOCUMENTATIONANDTRAINING 43

8.1 DOCUMENTATION 43

8.2 TRAINING 44

9 REFERENCES 44

APPENDIXA:SAMPLEACCEPTANCEPLANS 46

APPENDIXB:LANS/UCSPECIFICPROJECTMANAGEMENTREQUIREMENTS 61

DEFINITIONSANDGLOSSARY 76

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 4 of 77

1 Introduction

LosAlamosNationalSecurity,LLC(LANS),infurtheranceofitsparticipationintheAllianceforComputingatExtremeScale(ACES),acollaborationbetweenLosAlamosNationalLaboratoryandSandiaNationalLaboratories;andincoordinationwiththeRegentsoftheUniversityofCalifornia(UC),whichoperatestheNationalEnergyResearchScientificComputing(NERSC)CenterresidingwithintheLawrenceBerkeleyNationalLaboratory(LBNL),isreleasingajointRequestforProposal(RFP)fortwonextgenerationsystems,CrossroadsandNERSC-9undertheAllianceforapplicationPerformanceatEXtremescale(APEX),tobedeliveredinthe2020timeframe.

ThesuccessfulOfferorwillberesponsiblefordeliveringandinstallingtheCrossroadsandNERSC-9systemsattheirrespectivelocations.WhileitisourpreferencetoawardboththeCrossroadsandNERSC-9subcontractstoasingleOfferor,awardsmaybemadetoseparateOfferors.AwardswillbemadebyLANSonbehalfofACESandbyUConbehalfofNERSC.Intotaltherewillbefoursubcontracts,oneNon-RecurringEngineeringsubcontractforeachofACESandNERSC(seeSection4herein)andthe“build”(system)subcontracts(oneissuedbyLANSforCrossroadsandoneissuedbyUCforNERSC-9).Thetechnicalrequirementsinthisdocumentdescribejointrequirementswhereverpossible.TheOfferorshallrespondwithasingleproposalthatcontainsdistinctsectionsshowinghowandwheretheirproposedCrossroadsandNERSC-9systemsdiffer.Alternativesolutionsforhardware,software,and/orarchitecturemayalsobeincludedintheOfferor’sproposal.AnOfferor’sTechnicalProposalshallincludenarrativeandgraphicsasappropriate,describingitsproposedsolutionstotechnicalaspectsoftheprojectasseeninnumberedsectionsofthisTechnicalRequirementsDocument.AnOfferorshallincorporateitsproposedsolutionsdirectlyintoeachsectionoftheTechnicalRequirementsDocumenttothegreatestpracticalextent.TheTechnicalRequirementsDocumentisprovidedinMSWordformattofacilitatethisproposalrequirement.TheevaluationcommitteewillmakenopresumptionoftechnicalcapabilitywhenevaluatingOfferorresponses.Offerorsmustaddresseachsectioninamateriallyresponsivemanner.Theresponseshallclearlydescribetheroleofanysubcontractor(s)andthetechnologyortechnologies,bothhardwareandsoftware,andvalueaddedthatthesubcontractor(s)provide,whereappropriate.IfanOfferorchoosestosubmitalternativesolutionstotheAPEXRFP,complete,separateanddistinctproposalpackages(toincludeallapplicable/requiredproposaldocuments)mustbesubmittedforeach

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 5 of 77

alternative.FailuretocomplywiththeseproposalsubmissioninstructionsmaycauseanOfferor'sproposal(s)tobedowngraded.

ThescopeofworkandtechnicalspecificationsforanysubcontractsresultingfromthisRFPwillbenegotiatedbasedontherequirementsandtheoptionsinthisdocumentandthesuccessfulOfferor’sresponses.

CrossroadsandNERSC-9eachhavemaximumfundinglimitsovertheirsystemlives,toincludealldesignanddevelopment,sitepreparation,maintenance,supportandanalysts.Totalownershipcostswillbeconsideredinsystemselection.TheOfferormustrespondwithaconfigurationandpricingforbothsystems.

Applicationperformanceandworkflowefficiencyareessentialtotheseprocurements.SuccesswillbedefinedasmeetingAPEX2020missionneedswhileatthesametimeservingasapre-exascalesystemthatenablesourapplicationstobegintoevolveusingyettobedefinednextgenerationprogrammingmodels.TheadvancedtechnologyaspectsoftheAPEXsystemswillbepursuedbothbyfieldingfirstofakindtechnologiesonthepathtoexascaleaspartofsystembuildandbyselectingandparticipatinginstrategicNREprojectswiththeOfferorandapplicabletechnologyproviders.AcompellingsetofNREprojectswillbecrucialforthesuccessoftheseplatforms,byenablingthedeploymentoffirstofakindtechnologiesinsuchawayastomaximizetheirutility.TheNREareasofcollaborationshouldprovidesubstantialvaluetotheCrossroadsandNERSC-9systemswiththegoalsof:

§ Increasingapplicationperformance.§ Increasingworkflowefficiency.§ Increasingtheresilience,andreliabilityofthesystem.ThedetailsoftheNREaremorecompletelydescribedinsection4.

Tosupportthegoalsofapplicationperformanceandworkflowefficiencyanaccompanyingwhitepaper,“APEXWorkflows,”isprovidedthatdescribeshowapplicationteamsuseHPCresourcestodaytoadvancescientificgoals.Thewhitepaperisdesignedtoprovideaframeworkforreasoningabouttheoptimalsolutiontothesechallenges.(TheCrossroads/NERSC-9workflowsdocumentcanbefoundontheAPEXwebsite.)

1.1 CrossroadsTheDepartmentofEnergy(DOE)NationalNuclearSecurityAdministration(NNSA)AdvancedSimulationandComputing(ASC)Programrequiresacomputingsystembedeployedin2020tosupporttheStockpileStewardshipProgram.Inthe2020timeframe,Trinity,thefirstASCAdvancedTechnologySystem(ATS-1),willbenearingtheendofitsusefullifetime.Crossroads,theproposedATS-3system,providesareplacement,tri-labcomputingresource

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 6 of 77

forexistingsimulationcodesandprovidesalargerresourceforever-increasingcomputingrequirementstosupporttheweaponsprogram.TheCrossroadssystem,tobesitedatLosAlamos,NM,isprojectedtoprovidealargeportionoftheATSresourcesfortheNNSAASCtri-labsimulationcommunity:LosAlamosNationalLaboratory(LANL),SandiaNationalLaboratories(SNL),andLawrenceLivermoreNationalLaboratory(LLNL),duringthe2021-2025timeframe.

Inordertofulfillitsmission,theNNSAStockpileStewardshipProgramrequireshigherperformancecomputationalresourcesthanarecurrentlyavailablewithintheNuclearSecurityEnterprise(NSE).Thesecapabilitiesarerequiredforsupportingstockpilestewardshipcertificationandassessmentstoensurethatthenation’snuclearstockpileissafe,reliable,andsecure.TheASCProgramisfacedwithsignificantchallengesbytheongoingtechnologyrevolution.Itmustcontinuetomeetthemissionneedsofthecurrentapplicationsbutalsoadapttoradicalchangeintechnologyinordertocontinuerunningthemostdemandingapplicationsinthefuture.TheASCProgramrecognizesthatthesimulationenvironmentofthefuturewillbetransformedwithnewcomputingarchitecturesandnewprogrammingmodelsthatwilltakeadvantageofthenewarchitectures.Withinthiscontext,ASCrecognizesthatASCapplicationsmustbeginthetransitiontothenewsimulationenvironmentortheymaybecomeobsoleteasaresultofnotleveragingtechnologydrivenbymarkettrends.Withthischallengeoftechnologychange,itisamajorprogrammaticdrivertoprovideanarchitecturethatkeepsASCmovingforwardandallowsapplicationstofullyexploreandexploitupcomingtechnologies,inadditiontomeetingNNSADefensePrograms’missionneeds.ItispossiblethatmajormodificationstotheASCsimulationtoolswillberequiredinordertotakefulladvantageofthenewtechnology.However,codesrunningonNNSAAdvancedTechnologySystems(TrinityandSierra)inthe2019timeframeareexpectedtorunonCrossroads.Insomecasesnewapplicationsalsomayneedtobedeveloped.CrossroadsisexpectedtohelptechnologydevelopmentfortheASCProgramtomeettherequirementsoffuturesystemswithgreatercomputationalperformanceorcapability.CrossroadswillserveasatechnologypathforfutureASCsystemsinthenextdecade.

TodirectlysupporttheASCRoadmap,whichstatesthat“workinthistimeframewillestablishastrongtechnologicalfoundationtobuildtowardexascalecomputingenvironments,whichpredictivecapabilitymaydemand,”itiscriticalfortheASCProgramtobothexploretherapidlychangingtechnologyoffuturesystemsandtoprovidesystemswithhigherperformanceandmorememorycapacityforpredictivecapability.Therefore,adesigngoalofCrossroadsistoachieveabalancebetweenusabilityofcurrentNNSAASCsimulationcodesandadaptationtonewcomputingtechnologies.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 7 of 77

1.2 NERSC-9TheDOEOfficeofScience(SC)requiresahighperformanceproductioncomputingsysteminthe2020timeframetoprovideasignificantupgradetothecurrentcomputationalanddatacapabilitiesthatsupportthebasicandappliedresearchprogramsthathelpaccomplishthemissionofDOESC.

Thesystemalsoneedstoprovideafirmfoundationforfutureexascalesystemsin2023andbeyond;aneedthatiscalledoutintheDOE’sStrategicPlan2014-2018,thatcallsoutfor“advancedscientificcomputingtoanalyze,model,simulateandpredictcomplexphenomena,includingthescientificpotentialthatexascalesimulationanddatawillprovideinthefuture.”

NERSCCentersupportsnearly6000usersandabout600differentapplicationcodesfromabroadrangeofsciencedisciplinescoveringallsixprogramofficesinSC.Thescientificgoalsarewellsummarizedinthe2012-2014seriesofrequirementsreviewscommissionedbytheAdvancedScientificComputingResearch(ASCR)officethatbroughttogetherapplicationscientists,computerscientists,appliedmathematicians,DOEprogrammanagersandNERSCpersonnel.The2012-2014requirementsreviewsindicatedthatcompute-intensiveresearchandresearchthatattemptsscientificdiscoverythroughtheanalysisofexperimentalandobservationaldatabothhaveaclearneedformajorincreasesincomputationalcapabilityandcapacityinthe2017timeframeandbeyond.Inaddition,severalscienceareasalsohaveaburgeoningneedforHPCresourcesthatsatisfyanincreasedcomputeworkloadandprovidestrongsupportfordata-centricworkflowsandreal-timeobservationalscience.MoredetailsabouttheDOESCapplicationrequirementsareinthereviewslocatedat:http://www.nersc.gov/science/hpc-requirements-reviews/.NERSChasalreadybeguntransitioningtheSCuserbasetoenergyefficientarchitectures,withtheprocurementoftheNERSC-8“Cori”system.Inthe2020timeframe,NERSCalsoexpectsaneedtoaddressearlyexascalehardwareandsoftwaretechnologies,includingtheareasofprocessortechnology,memoryhierarchies,networkingtechnology,andprogrammingmodels.

TheNERSC-9systemisexpectedtorunfor4-6yearsandwillbehousedintheWangHall(Building59)atLBNLthatcurrentlyhousesthe“Cori”systemandotherresourcesthatNERSCsupports.ThesystemmustintegrateintotheNERSCenvironmentandprovidehighbandwidthaccesstoexistingdatastoredbycontinuingresearchprojects.FormoreinformationaboutNERSCandthecurrentsystems,environment,andsupportprovidedforourusers,seehttp://www.nersc.gov.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 8 of 77

1.3 ScheduleThefollowingisthetentativeschedulefortheCrossroadsandNERSC-9systems.

Table1Crossroads/NERSC-9Schedule

CrossroadsandNERSC-9RFPReleased Q3CY16Subcontracts(NRE/Build)Awarded Q4CY16On-siteSystemDeliveryBegins Q2CY20On-siteSystemDeliveryComplete Q3CY20AcceptanceComplete Q1CY21

2 MandatoryRequirementsAnOfferorshalladdressallMandatoryRequirementsinamateriallyresponsivemanneranditsproposalshalldemonstratehowitmeetsorexceedseachone.Aproposalwillbedeemednon-responsive/unacceptable,willberejected,andwillnotbeconsideredfurtherifeachandeveryoneofthefollowingMandatoryRequirementsisnotmet.

2.1.1 TheOfferorshallprovideadetailedfullsystemarchitecturaldescriptionofboththeCrossroadsandNERSC-9systems,includingdiagramsandtextdescribingthefollowingdetailsastheypertaintotheOfferor’ssystemarchitecture(s):§ Componentarchitecture–detailsofallprocessor(s),memory

technologies,storagetechnologies,networkinterconnect(s)andanyotherapplicablecomponents.

§ Nodearchitecture(s)–detailsofhowcomponentsarecombinedintothenodearchitecture(s).Detailsshallincludebandwidthandlatencyspecifications(orprojections)betweencomponents.

§ Boardand/orbladearchitecture(s)–detailsofhowthenodearchitecture(s)isintegratedattheboardand/orbladelevel.Detailsshouldincludeallinter-nodeandinter-board/bladecommunicationpathsandanyadditionalboard/bladelevelcomponents.

§ Rackand/orcabinetarchitecture(s)–detailsofhowboardand/orbladesareorganizedandintegratedintoracksand/orcabinets.Detailsshouldincludeallinterrack/cabinetcommunicationpathsandanyadditionalrack/cabinetlevelcomponents.

§ Platformstorage–detailsofhowstorageisintegratedwiththesystem,includingaplatformstoragearchitecturaldiagram.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 9 of 77

§ Systemarchitecture–detailsofhowrackorcabinetsarecombinedtoproducesystemarchitecture,includingthehigh-speedinterconnectsandnetworktopologies(ifmultiple)andplatformstorage.

§ Proposedfloorplan–includingdetailsofthephysicalfootprintofthesystemandallofthesupportingcomponents.

2.1.2 TheOfferorshallprovideadetaileddescriptionoftheproposedsoftwareeco-system,includingahigh-levelsoftwarearchitecturaldiagramincludingtheprovenanceofthesoftwarecomponent,forexampleopensourceorproprietaryandsupportmechanismforeach(forthelifetimeofthesystemincludingupdates).

2.1.3 TheOfferorshalldescribehowthesystemdoesordoesnotfitintotheOfferor’slong-termproductroadmapandapotentialfollow-onsystemacquisitioninthe2025andbeyondtimeframe.

3 TargetDesignRequirementsThissectioncontainsdetailedsystemdesigntargetsandperformancefeatures.ItisdesirablethattheOfferor’sdesignmeetsorexceedsallthefeaturesandperformancemetricsoutlinedinthissection.IfaTargetDesignRequirementcannotbemet,itisdesirablethattheOfferorprovideadevelopmentanddeploymentplan,includingaschedule,tosatisfytherequirement.TheevaluationcommitteewillmakenopresumptionoftechnicalcapabilitywhenevaluatingOfferorresponsestoTargetDesignRequirements.OfferorsthatdonotaddresstheTargetDesignRequirementsinamateriallyresponsivemannerwillbedowngraded.TheOfferormayalsoproposeanyhardwareand/orsoftwarearchitecturalfeaturesthatwillprovideimprovementsforanyaspectofthesystem.

3.1 ScalabilityThescaleofthesystemnecessarytomeettheneedsoftheapplicationrequirementsoftheAPEXlaboratoriesaddssignificantchallenges.TheOfferorshallproposeasystemthatenablesapplicationperformanceuptothefullscaleofthesystem.Additionally,thesystemproposedshouldprovidefunctionalitythatassistsusersinobtainingperformanceatuptofullscale.Scalabilityfeatures,bothhardwareandsoftware,thatbenefitbothcurrentandfutureprogrammingmodelsareessential.

3.1.1 Thesystemshallsupportrunningjobsuptoandincludingthefullscaleofthesystem.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 10 of 77

3.1.2 Thesystemshallsupportlaunchinganapplicationatfullsystemscaleinlessthan30seconds.TheOfferorshalldescribefactors(suchasexecutablesize)thatcouldpotentiallyaffectapplicationlaunchtime.

3.1.3 TheOfferorshalldescribehowapplicationslaunchscaleswiththenumberofconcurrentlaunchrequests(perssecond)andscaleofeachlaunchrequest(resourcesrequested,suchasthenumberofscheduleableunitsetc.),includinginformationsuchas:

§ Allsystem-levelandnode-leveloverheadintheprocessstartupincludinghowoverheadscaleswithnodecountforparallelapplications,orhowoverheadscaleswiththeapplicationcountforlargenumbersofserialapplications.

§ Anylimitationsforprocessesoncomputenodesfrominterfacingwithanexternalwork-flowmanager,externaldatabaseormessagequeuesystem.

3.1.4 Thesystemshallsupportthousandsofconcurrentusersandmorethan20,000concurrentbatchjobs.Thesystemshallallowamixofapplicationoruseridentitywhereinatleastasubsetofnodescanrunmultipleindependentapplicationsfrommultipleusers.TheOfferorshalldescribedetails,includinglimitationsoftheirproposedsupportforthisrequirement.

3.1.5 TheOfferorshalldescribeallareasofthesysteminwhichnode-levelresourceusage(hardwareandsoftware)increasesasajobscalesup(node,coreorthreadcount).

3.1.6 Thesystemshallutilizeanoptimizedjobplacementalgorithmtoreducejobruntime,lowervariability,minimizelatency,etc.TheOfferorshalldescribeindetailhowthealgorithmisoptimizedtothesystemarchitecture.

3.1.7 Thesystemshallincludeanapplicationprogramminginterfacetoallowapplicationsaccesstothephysical-to-logicalmappinginformationofthejob’snodeallocation–includingamappingbetweenMPIranksandnetworktopologycoordinates,andcore,nodeandrackidentifiers.

3.1.8 Thesystemsoftwaresolutionshallprovidealowjitterenvironmentforapplicationsandshallprovideanestimateofacomputenodeoperatingsystem’snoiseprofile,bothwhileidleandwhilerunninganon-trivialMPIapplication.Ifcorespecializationisused,theOfferorshalldescribethesystemsoftwareactivitythatremainsontheapplicationcores.

3.1.9 Thesystemshallprovidecorrectnumericalresultsandconsistentruntimes(i.e.wallclocktime)thatdonotvarymorethan3%fromruntorunindedicatedmodeand5%inproductionmode.TheOfferorshalldescribestrategiesforminimizingruntimevariability.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 11 of 77

3.1.10 Thesystem’shighspeedinterconnectshallsupportahighmessagingbandwidth,highinjectionrate,lowlatency,highthroughput,andindependentprogress.TheOfferorshalldescribe:

§ Thesysteminterconnectindetail,includinganymechanismsforadaptingtoheavyloadsorinoperablelinks,aswellasadescriptionofhowdifferenttypesoffailureswillbeaddressed.

§ Howtheinterfacewillallowallcoresinthesystemtosimultaneouslycommunicatesynchronouslyorasynchronouslywiththehighspeedinterconnect.

§ Howtheinterconnectwillenablelow-latencycommunicationforone-andtwo-sidedparadigms.

3.1.11 TheOfferorshalldescribehowbothhardwareandsoftwarecomponentsoftheinterconnectsupporteffectivecomputationandcommunicationoverlapforbothpoint-to-pointoperationsandcollectiveoperations(i.e.,theabilityoftheinterconnectsubsystemtoprogressoutstandingcommunicationrequestsinthebackgroundofthemaincomputationthread).

3.1.12 TheOfferorshallreportorprojectthesystem’snodeinjection/ejectionbandwidth.

3.1.13 TheOfferorshallreportorprojectthesystem’sbiterrorrateoftheinterconnectintermsoftimeperiodbetweenerrorsthatinterruptajobrunningatthefullscaleofthesystem.

3.1.14 TheOfferorshalldescribehowtheinterconnectofthesystemwillprovideQualityofService(QoS)capabilities(e.g.,intheformofvirtualchannelsorothersub-systemQoScapabilities),includingbutnotlimitedto:§ Anexplanationofhowthesecapabilitiescanbeusedtopreventcore

communicationtrafficfrominterferingwithotherclassesofcommunication,suchasdebuggingandperformancetoolsorwithI/Otraffic.

§ Anexplanationofhowthesecapabilitiesallowefficientadaptiveroutingaswellasacapabilitytopreventtrafficfromdifferentapplicationsinterferingwitheachother(eitherthroughQoScapabilitiesorappropriatejobpartitioning).

§ Anexplanationofanysub-systemQoScapabilities(e.g.platformstorageQoSfeatures).

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 12 of 77

3.1.15 TheOfferorshalldescribespecializedhardwareorsoftwarefeaturesofthesystemthataccelerateworkflowsorcomponentsofworkflowssuchasdataanalysisorvisualization,anddescribeanylimitstheirscalabilityonthesystem.Thehardwareshallbeonthesamehighspeednetworkasthemaincomputeresourcesandshallhaveequalaccesstoothercomputeresources(e.g.filesystemsandplatformstorage).Itisdesirablethatthehardwarehavethesamenodelevelarchitectureasthemaincomputeresources,butcould,forexample,havemorememorypernode.

3.2 SystemSoftwareandRuntimeThesystemshallincludeawell-integratedandsupportedsystemsoftwareenvironment.Theoverallimperativeistoprovideuserswithaproductive,high-performing,reliable,andscalablesystemsoftwareenvironmentthatenablesefficientuseofthefullcapabilityofthesystem.

3.2.1 Thesystemshallincludeafull-featuredLinuxoperatingsystemenvironmentonalluservisibleservicepartitions(e.g.,front-endnodes,servicenodes,I/Onodes).TheOfferorshalldescribetheproposedfull-featuredLinuxoperatingsystemenvironment.

3.2.2 Thesystemshallincludeanoptimizedcomputepartitionoperatingsystemthatprovidesanefficientexecutionenvironmentforapplicationsrunninguptofull-systemscale.TheOfferorshalldescribeanyHPCrelevantoptimizationsmadetothecomputepartitionoperatingsystem.

3.2.3 TheOfferorshalldescribethesecuritycapabilitiesoftheoperatingsystemsproposedintechnicalrequirements3.2.1and3.2.2.

3.2.4 Thesystemshallincludeefficientsupportfordynamicsharedlibraries,bothatjobloadtimeandduringruntime.TheOfferorshalldescribehowapplicationsusingsharedlibrarieswillexecuteatfullsystemscalewithminimalperformanceoverheadcomparedtostaticallylinkedapplications.

3.2.5 Thesystemshallincluderesourcemanagementfunctionality,includingjobmigration,backfill,targetingofspecifiedresources(e.g.,platformstorage),advanceandpersistentreservations,jobpreemption,jobaccounting,architecture-awarejobplacement,powermanagement,jobdependencies(e.g.,workloadmanagement),andresiliencemanagement.TheOfferormayproposemultiplesolutionsforavendor-supportedresourcemanagerandshoulddescribethebenefitsofeach.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 13 of 77

3.2.6 Thesystemshallsupportjobsconsistingofmultipleindividualapplicationsrunningsimultaneously(inter-nodeorintra-node)andcooperatingaspartofanoverallmulti-componentapplication(e.g.,ajobthatcouplesasimulationapplicationtoananalysisapplication).TheOfferorshalldescribeindetailhowthiswillbesupportedbythesystemsoftwareinfrastructure(e.g.,userinterfaces,securitymodel,andinter-applicationcommunication).

3.2.7 Thesystemshallincludeamechanismthatwillallowuserstoprovidecontainerizedsoftwareimageswithoutrequiringprivilegedaccesstothesystemorallowingausertoescalateprivilege.Thestartuptimeforlaunchingaparallelapplicationinacontainerizedsoftwareimageatfullsystemscaleshallnotgreatlyexceedthestartuptimeforlaunchingaparallelapplicationinthevendor-providedimage.

3.2.8 ThesystemshallincludeamechanismfordynamicallyconfiguringexternalIPv4/IPv6connectivitytoandfromcomputenodes,enablingspecialconnectivitypathsforsubsetsofnodesonaper-batch-jobbasis,andallowingfullyroutableinteractionswithexternalservices.

3.2.9 TheOfferorshallprovideaccesstosourcecode,andnecessarybuildenvironment,forallsoftwareexceptforfirmware,compilers,andthirdpartyproducts.TheOfferorshallprovideupdatesofsourcecode,andanynecessarybuildenvironment,forallsoftwareoverthelifeofthesubcontract.

3.3 SoftwareToolsandProgrammingEnvironmentTheprimaryprogrammingmodelsusedinproductionapplicationsinthistimeframearetheMessagePassingInterface(MPI),forinter-nodecommunication,andOpenMP,forfine-grainedon-nodeparallelism.WhileMPI+OpenMPwillbethemajorityoftheworkload,theAPEXlaboratoriesexpectsomenewapplicationstoexerciseemergingasynchronousprogrammingmodels.Systemsupportthatwouldacceleratetheseprogrammingmodels/runtimesandbenefitMPI+OpenMPisdesirable.

3.3.1 ThesystemshallincludeanimplementationoftheMPIversion3.1(ormostcurrent)standardspecification.TheOfferorshallprovideadetaileddescriptionoftheMPIimplementation(includingspecificationversion)andsupportforfeaturessuchasacceleratedcollectives,andshalldescribeanylimitationsrelativetotheMPIstandard.

3.3.2 TheOfferorshalldescribeatwhatparallelgranularitythesystemcanbeutilizedbyMPI-onlyapplications.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 14 of 77

3.3.3 Thesystemshallincludeoptimizedimplementationsofcollectiveoperationsutilizingbothinter-nodeandintra-nodefeatureswhereappropriate,includingMPI_Barrier,MPI_Allreduce,MPI_Reduce,MPI_Allgather,andMPI_Gather.

3.3.4 TheOfferorshalldescribethenetworktransportlayerofthesystemincludingsupportforOpenUCX,Portals,libfabric,libverbs,andanyothertransportlayerincludinganyoptimizationsoftheirimplementationthatwillbenefitapplicationperformanceorworkflowefficiency.

3.3.5 ThesystemshallincludeacompleteimplementationoftheOpenMPversion4.1(ormostcurrent)standardincluding,ifapplicable,acceleratordirectives,aswellasasupportingprogrammingenvironment.TheOfferorshallprovideadetailedfeaturedescriptionoftheOpenMPimplementation(s)anddescribeanyexpecteddeviationsfromtheOpenMPstandard.

3.3.6 TheOfferorshallprovideadescriptionofhowOpenMP3.1applicationswillbecompiledandexecutedonthesystem.

3.3.7 TheOfferorshallprovideadescriptionofanyproposedhardwareorsoftwarefeaturesthatenableOpenMPperformanceoptimizations.

3.3.8 TheOfferorshalllistanyPGASlanguagesand/orlibrariesthataresupported(e.g.UPC,SHMEM,CAF,GlobalArrays)anddescribeanyhardwareand/orprogrammingenvironmentsoftwarethatoptimizesanyofthelistedPGASlanguagessupportedonthesystem.Thesystemshallincludeamechanismtocompile,run,anddebugUPCapplications.TheOfferorshalldescribeinteroperabilitywithMPI+OpenMP.

3.3.9 TheOfferorshalldescribeandlistsupportforanyemergingprogrammingmodelssuchasasynchronoustask/datamodels(e.g.,Legion,STAPL,HPX,orOCR)anddescribeanysystemhardwareand/orprogrammingenvironmentsoftwareitwillprovidethatoptimizesanyofthesupportedmodels.TheOfferorshalldescribeinteroperabilitywithMPI+OpenMP.

3.3.10 TheOfferorshalldescribetheproposedhardwareandsoftwareenvironmentsupportfor:

§ Fastthreadsynchronizationofsubsetsofexecutionthreads.§ Atomicadd,fetch-and-add,multiply,bitwiseoperations,andcompare-

and-swapoperationsoverinteger,single-precision,anddouble-precisionoperands.

§ Atomiccompare-and-swapoperationsover16-bytewideoperandsthatcomprisetwodoubleprecisionvaluesortwomemorypointeroperands.

§ Fastcontextswitchingortask-switching.§ Fasttaskspawningforuniqueandidenticaltaskwithdatadependencies.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 15 of 77

§ Supportforactivemessages.

3.3.11 TheOfferorshalldescribeindetailallprogrammingAPIs,languages,compliersandcompilerextensions,etc.otherthanMPIandOpenMP(e.g.OpenACC,CUDA,OpenCL,etc.)thatwillbesupportedbythesystem.Itisdesirablethatinstancesofallprogrammingmodelsprovidedbeinteroperableandefficientwhenusedwithinasingleprocessorsinglejobrunningonthesamecomputenode.

3.3.12 ThesystemshallincludesupportforthelanguagesC,C++(includingcompleteC++11/14/17),Fortran77,Fortran90,andFortran2008programminglanguages.Providingmultiplecompilationenvironmentsishighlydesirable.TheOfferorshalldescribeanylimitationsthatcanbeexpectedinmeetingfullC++17supportbasedoncurrentexpectations.

3.3.13 ThesystemshallincludeaPythonimplementationthatwillrunonthecomputepartitionwithoptimizedMPI4Py,NumPy,andSciPylibraries.

3.3.14 Thesystemshallincludeaprogrammingtoolchain(s)thatenablesruntimecoexistenceofthreadinginC,C++,andFortran,fromwithinapplicationsandanysupportinglibrariesusingthesametoolchain.TheOfferorshalldescribetheinteractionbetweenOpenMPandnativeparallelismexpressedinlanguagestandards.

3.3.15 ThesystemshallincludeC++compiler(s)thatcansuccessfullybuildtheBoostC++library,http://www.boost.org.TheOfferorshallsupportthemostrecentstableversionofBoost.

3.3.16 Thesystemshallincludeoptimizedversionsoflibm,libgsl,BLASlevels1,2and3,LAPACK,ScaLAPACK,HDF5,NetCDF,andFFTW.ItisdesirableforthesetoefficientlyinteroperatewithapplicationsthatutilizeOpenMP.TheOfferorshalldescribeallotheroptimizedlibrariesthatwillbesupported,includingadescriptionoftheinteroperabilityoftheselibrarieswiththeprogrammingenvironmentsproposed.

3.3.17 Thesystemshallincludeamechanismthatenablescontroloftaskandmemoryplacementwithinanodeforefficientperformance.TheOfferorshallprovideadetaileddescriptionofcontrolsprovidedandanylimitationsthatmayexist.

3.3.18 Thesystemshallincludeacomprehensivesoftwaredevelopmentenvironmentwithconfigurationandsourcecodemanagementtools.Onheterogeneoussystems,amechanism(e.g.,anupgradedautoconf)shallbeprovidedtocreateconfigurescriptstobuildcross-compiledapplicationsonloginnodes.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 16 of 77

3.3.19 ThesystemshallincludeaninteractiveparalleldebuggerwithanX11-basedgraphicaluserinterface.Thedebuggershallprovideasinglepointofcontrolthatcandebugapplicationsinallsupportedlanguagesusingallgranularitiesofparallelism(e.g.MPI+X)andprogrammingenvironmentsprovidedandscaleupto25%ofthesystem.

3.3.20 Thesystemshallincludeasuiteoftoolsfordetailedperformanceanalysisandprofilingofuserapplications.AtleastonetoolshallsupportallgranularitiesofparallelisminmixedMPI+OpenMPprogramsandanyadditionalprogrammingmodelssupportedonthesystem.Thetoolsuitemustprovidetheabilitytosupportmulti-nodeintegratedprofilingofon-nodeparallelismandcommunicationperformanceanalysis.TheOfferorshalldescribeallproposedtoolsandthescalabilitylimitationsofeach.TheOfferorshalldescribetoolsformeasuringI/Obehaviorofuserapplications.

3.3.21 Thesystemshallincludeevent-tracingtools.Eventtracingofinterestincludes:message-passingeventtracing,I/Oeventtracing,floatingpointexceptiontracing,andmessage-passingprofiling.Theevent-tracingtoolAPIshallprovidefunctionstoactivateanddeactivateeventmonitoringduringexecutionfromwithinaprocess.

3.3.22 Thesystemshallincludesingle-andmulti-nodestack-tracingtools.Thetoolsetshallincludeasource-levelstacktraceback,includinganAPIthatallowsarunningprocessorthreadtoqueryitscurrentstacktrace.

3.3.23 Thesystemshallincludetoolstoassisttheprogrammerinintroducinglimitedlevelsofparallelismanddatastructurerefactoringtocodesusinganyproposedprogrammingmodelsandlanguages.Tool(s)shalladditionallybeprovidedtoassistapplicationdevelopersinthedesignandplacementofthedatastructureswiththegoalofoptimizingdatamovement/placementfortheclassesofmemoryproposedinthesystem.

3.3.24 Thesystemshallincludesoftwarelicensestoenablethefollowingnumberofsimultaneoususersonthesystem:

Crossroads NERSC-9Compiler 20 100Debugger 20 20

3.4 PlatformStoragePlatformstorageiscertaintobeoneoftheadvancedtechnologyareasincludedinanysystemdeliveredinthistimeframe.TheAPEXlaboratoriesanticipatetheseemergingtechnologieswillenablenewusagemodels.Withthisinmind,anaccompanyingwhitepaper,“APEXWorkflows,”isprovidedthatdescribeshowapplicationteamsuseHPCresourcestodaytoadvance

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 17 of 77

scientificgoals.Thewhitepaperisdesignedtoprovideaframeworkforreasoningabouttheoptimalsolutiontothesechallenges.ThewhitepaperisintendedtohelpanOfferordevelopaplatformstoragearchitectureresponsethatacceleratesthescienceworkflowswhileminimizingthetotalnumberofplatformstoragetiers.TheCrossroads/NERSC-9workflowsdocumentcanbefoundontheAPEXwebsite.

3.4.1 Thesystemshallincludeplatformstoragecapableofretainingallapplicationinput,output,andworkingdatafor12weeks(84days),estimatedataminimumof36%ofbaselinesystemmemoryperday.

3.4.2 Thesystemshallincludeplatformstoragewithanappropriatedurabilityoramaintenanceplansuchthattheplatformstorageiscapableofabsorbingapproximatelyfourtimesthesystemsbaselinememoryperdayforthelifeofthesystem.

3.4.3 TheOfferorshalldescribehowthesystemprovidessufficientbandwidthtosupportaJMTTI/Delta-Ckptratioofgreaterthan200(whereDelta-Ckptislessthan7.2minutes).

3.4.4 TheOfferorshalldescribetheprojectedcharacteristicsofallplatformstoragedevicesforthesystem,includingbutnotlimitedto:

§ Usablecapacity,accesslatencies,platformstorageinterfaces(e.g.NVMe,PCIe),expectedlifetime(warrantyperiod,MTTF,totalwrites,etc.),andmediaanddeviceerrorrates

§ Relevantsoftware/firmwarefeatures§ Compressiontechnologiesusedbytheplatformstoragedevices,the

resourcesusedtoimplementthecompression/decompressionalgorithms,theexpectedcompressionrates,andallcompression/decompression-relatedperformanceimpacts

3.4.5 TheOfferorshalldescribeallavailableinterfacestoplatformstorageforthesystem,includingbutnotlimitedto:§ POSIX§ APIs§ ExceptionstoPOSIXcompliance.§ Timetoconsistencyandanypotentialdelaysforreliabledata

consumption.§ Anyspecialrequirementsforuserstoachieveperformanceand/or

consistentdata.

3.4.6 TheOfferorshalldescribethereliabilitycharacteristicsofplatformstorage,includingbutnotlimitedto:

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 18 of 77

§ Anysinglepointoffailureforallproposedplatformstoragetiers(noteanycomponentfailurethatwillleadtotemporaryorpermanentlossofdataavailability).

§ Meantimetodatalossforeachplatformstoragetierprovided.§ Enumerateplatformstoragetiersthataredesignedtobelessreliableor

donotusedataprotectiontechniques(e.g.,replication,erasurecoding).§ Themagnitudesanddurationofperformanceandreliabilitydegradation

broughtaboutbyasingleormultiplecomponentfailuresforeachreliableplatformstoragetier.

§ Vendorsuppliedmechanismstoensuredataintegrityforeachplatformstoragetier(e.g.,datascrubbingprocesses,backgroundchecksumverification,etc.).

§ Enumerateanyplatformstoragefailuresthatpotentiallyimpactscheduledorcurrentlyexecutingjobsthatimpacttheplatformstorageorsystemperformanceand/oravailability.

§ Loginorinteractivenodesaccesstoplatformstoragewhenthecomputenodesareunavailable.

3.4.7 TheOfferorshalldescribesystemfeaturesforplatformstoragetiermanagementdesignedtoaccelerateworkflows,includingbutnotlimitedto:§ Mechanismsformigratingdatabetweenplatformstoragetiers,including

manual,scheduled,and/orautomaticdatamigrationtoincluderebalancing,draining,orrewritingdataacrossdeviceswithinatier.

§ Howplatformstoragewillbeinstantiatedwitheachjobifitneedstobe,andhowplatformstoragemaybepersistedacrossjobs.

§ Thecapabilitiesprovidedtodefineper-userpoliciesandautomatedatamovementbetweendifferenttiersofplatformstorageorexternalstorageresources(e.g.,archives).

§ Theabilitytoserializenamespacesnolongerinuse(e.g.,snapshots).§ Theabilitytorestorenamespacesneededforascheduledjobthatisnot

currentlyavailable.§ Theabilitytointegratewithoractasasite-wideschedulingresource.§ Amechanismtoincrementallyaddcapacityandbandwidthtoaparticular

tierofplatformstoragewithoutrequiringatier-wideoutage.§ Capabilitiestomanageorinterfaceplatformstoragewithexternal

storageresourcesorarchives(e.g.,faststoragelayersorHPSS).

3.4.8 TheOfferorshalldescribesoftwarefeaturesthatallowuserstooptimizeI/Ofortheworkflowsofthesystem,includingbutnotlimitedto:§ Batchdatamovementcapabilities,especiallywhendataresideson

multipletiersofplatformstorage.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 19 of 77

§ Methodsforuserstocreateandmanageplatformstorageallocations.§ Anyabilitytodirectlywritetoorreadfromatiernotdirectly(logically)

adjacenttothecomputeresources.§ Locality-awarejob/datascheduling.§ I/Outilizationforreservations.§ Featurestopreventdataduplicationonmorethanoneplatformstorage

tier.§ Methodsforuserstoexploitanyenhancedperformanceofrelaxed

consistency.§ Methodsforenablinguser-definedmetadatawiththeplatformstorage

solution.

3.4.9 TheOfferorshalldescribethemethodforwalkingtheentireplatformstoragemetadata,anddescribeanyspecialcapabilitiesthatwouldmitigateuserperformanceissuesfordailyfull-systemnamespacewalks;expectatleast1billionobjects.

3.4.10 TheOfferorshalldescribeanycapabilitiestocomprehensivelycollectplatformstorageusagedata(inascalableway),forthesystem,includingbutnotlimitedto:§ Perclientmetricsandfrequencyofcollection,includingbutnotlimited

to:thenumberofbytesreadorwritten,numberofreadorwriteinvocations,clientcachestatistics,andmetadatastatisticssuchasnumberofopens,closes,creates,andothersystemcallsofrelevancetotheperformanceofplatformstorage.

§ Joblevelmetrics,suchasthenumberofsessionseachjobinitiateswitheachplatformstoragetier,sessionduration,totaldatatransmitted(separatedasreadsandwrites)duringthesession,andthenumberoftotalplatformstorageinvocationsmadeduringthesession.

§ Platformstoragetiermetricsandfrequencyofcollection,suchasthenumberofbytesread,numberofbyteswritten,numberofreadinvocations,numberofwriteinvocations,bytesdeleted/purged,numberofI/Osessionsestablished,andperiodsofoutage/unavailability.

§ Joblevelmetricsdescribingusageofatieredplatformstoragehierarchy,suchashowlongfilesareresidentineachtier,hitrateoffilepagesineachtier(i.e.,whetherpagesareactuallyreadandhowmanytimesdataisre-read),fractionofdatamovedbetweentiersbecauseofa)explicitprogrammercontrolandb)transparentcaching,andtimeintervalbetweenaccessestothesamefile(e.g.,howlonguntilananalysisprogramreadsasimulationgeneratedoutputfile).

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 20 of 77

3.4.11 TheOfferorshallproposeamethodforprovidingaccesstoplatformstoragefromothersystemsatthefacility.Inthecaseoftieredplatformstorage,atleastonetiermustsatisfythisrequirement.

3.4.12 TheOfferorshalldescribethecapabilityforplatformstoragetierstoberepaired,serviced,andincrementallypatched/upgradedwhilerunningdifferentversionsofsoftwareorfirmwarewithoutrequiringastoragetier-wideoutage.TheOfferorshalldescribethelevelofperformancedegradation,ifany,anticipatedduringtherepairorserviceinterval.

3.4.13 TheOfferershallspecifytheminimumnumberofcomputenodesrequiredtoreadandwritethefollowingdatasetsfrom/toplatformstorage:§ A1TBdatasetof20GBfilesin2seconds.§ A5TBdatasetofanychosenfilesizein10seconds.Offerorshallreport

thefilesizechosen.§ A1PBdatasetof32MBfilesin1hour.

3.5 ApplicationPerformanceAssuringthatrealapplicationsperformwellonboththeCrossroadsandNERSC-9systemsiskeyfortheirsuccess.Becausethefullapplicationsarelarge,oftenwithmillionsoflinesofcode,andinsomecasesareexportcontrolled,asuiteofbenchmarkshavebeendevelopedforRFPresponseevaluationandsystemacceptance.ThebenchmarkcodesarerepresentativeoftheworkloadsoftheAPEXlaboratoriesbutoftensmallerthanthefullapplications.TheperformanceofthebenchmarkswillbeevaluatedaspartofboththeRFPresponseandsystemacceptance.Finalbenchmarkacceptanceperformancetargetswillbenegotiatedafterafinalsystemconfigurationisdefined.Allperformancetestsmustcontinuetomeetnegotiatedacceptancecriteriathroughoutthelifetimeofthesystem.SystemacceptanceforCrossroadsshallalsoincludeanASCSimulationCodeSuitecomprisedofatleasttwo(2)butnomorethanfour(4)ASCapplicationsfromthethreeNNSAlaboratories,Sandia,LosAlamosandLawrenceLivermore.

TheCrossroads/NERSC-9benchmarks,informationregardingtheCrossroadsacceptancecodes,andsupplementalmaterialscanbefoundontheAPEXwebsite.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 21 of 77

3.5.1 TheOfferorshallprovideresponsestothebenchmarks(SNAP,PENNANT,HPCG,MiniPIC,UMT,MILC,MiniDFT,GTC,andMeraculous)providedontheCrossroads/NERSC-9benchmarkslinkontheAPEXwebsite.Allmodificationsornewvariantsofthebenchmarks(includingmakefiles,buildscripts,andenvironmentvariables)aretobesuppliedintheOfferor’sresponse.

§ Theresultsofallproblemsizes(baselineandoptimized)shallbeprovidedintheOfferor'sScalableSystemImprovement(SSI)spreadsheets.SSIisthecalculationusedformeasuringimprovementandisdocumentedontheAPEXwebsite,alongwiththeSSIspreadsheets.Ifpredictedorextrapolatedresultsareprovided,themethodologyusedtoderivethemshouldbedocumented.

§ TheOfferorshallprovidelicensesforthesystemforallcompilers,libraries,andruntimesusedtoachievebenchmarkperformance.

3.5.2 TheOfferorshallprovideperformanceresultsforthesystemthatmaybebenchmarked,predicted,and/orextrapolatedforthebaselineMPI+OpenMP(orUPCforMeraculous)variantsofthebenchmarks.TheOfferormaymodifythebenchmarkstoincludeextraOpenMPpragmasasrequired,butthebenchmarkmustremainastandard-compliantprogramthatmaintainsexistingoutputsubjecttothevalidationcriteriadescribedinthebenchmarkrunrules.

3.5.3 TheOfferorshalloptionallyprovideperformanceresultsfromanOfferoroptimizedvariantofthebenchmarks.TheOfferormaymodifythebenchmarks,includingthealgorithmand/orprogrammingmodelusedtodemonstratehighsystemperformance.Ifalgorithmicchangesaremade,theOfferorshallprovideanexplanationofwhytheresultsmaydeviatefromvalidationcriteriadescribedinthebenchmarkrunrules.

3.5.4 FortheCrossroadssystemonly:inadditiontotheCrossroads/NERSC-9benchmarks,anASCSimulationCodeSuiterepresentingthethreeNNSAlaboratorieswillbeusedtojudgeperformanceattimeofacceptance.TheCrossroadssystemshallachieveaminimumofatleast6times(6X)improvementovertheASCTrinitysystem(KnightsLandingpartition)foreachcode,measuredusingSSI.TheOfferorshallspecifyabaselineperformancegreaterthanorequalto6Xattimeofresponse.Finalacceptanceperformancetargetswillbenegotiatedafterafinalsystemconfigurationisdefined.InformationregardingASCSimulationCodeSuiterunrulesandacceptancecanbefoundontheAPEXwebsite.SourcecodewillbeprovidedtotheOfferorbutwillrequirecompliancewithexportcontrollawsandnocostlicensingagreements.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 22 of 77

3.5.5 TheOfferorshallreportorprojectthenumberofcoresnecessarytosaturatetheavailablenodebaselinememorybandwidthasmeasuredbytheCrossroads/NERSC-9memorybandwidthbenchmarkfoundontheAPEXwebsite.§ Ifthenodecontainsheterogeneouscores,theOfferorshallreportthe

numberofcoresofeacharchitecturenecessarytosaturatetheavailablebaselinememorybandwidth.

§ Ifmultipletiersofmemoryareavailable,theOfferorshallreporttheaboveforeveryfunctionalcombinationofcorearchitectureandbaselineorextendedmemorytier.

3.5.6 TheOfferorshallreportorprojectthesustaineddensematrixmultiplicationperformanceoneachtypeofprocessorcore(individuallyand/orinparallel)ofthesystemnodearchitecture(s)asmeasuredbytheCrossroads/NERSC-9multithreadedDGEMMbenchmarkfoundontheAPEXwebsite.

§ TheOfferorshalldescribethepercentageoftheoreticaldouble-precision(64-bit)computationalpeak,whichthebenchmarkGFLOP/srateachievesforeachtypeofcomputecore/unitintheresponse,anddescribehowthisiscalculated.

3.5.7 TheOfferorshallreport,orproject,theMPItwo-sidedmessagerateofthenodesinthesystemunderthefollowingconditionsmeasuredbythecommunicationbenchmarkspecifiedontheAPEXwebsite:

§ UsingasingleMPIrankpernodewithMPI_THREAD_SINGLE.§ Usingtwo,four,andeightMPIrankspernodewith

MPI_THREAD_SINGLE.§ Usingone,two,four,andeightMPIrankspernodeandmultiplethreads

perrankwithMPI_THREAD_MULTIPLE.§ TheOfferormayadditionallychoosetoreportonotherconfigurations.

3.5.8 TheOfferorshallreport,orproject,theMPIone-sidedmessagerateofthenodesinthesystemforallpassivesynchronizationRMAmethodswithbothpre-allocatedanddynamicmemorywindowsunderthefollowingconditionsmeasuredbythecommunicationbenchmarkspecifiedontheAPEXwebsiteusing:

§ AsingleMPIrankpernodewithMPI_THREAD_SINGLE.§ Two,four,andeightMPIrankspernodewithMPI_THREAD_SINGLE.§ One,two,four,andeightMPIrankspernodeandmultiplethreadsper

rankwithMPI_THREAD_MULTIPLE.§ TheOfferormayadditionallychoosetoreportonotherconfigurations.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 23 of 77

3.5.9 TheOfferorshallreport,orproject,thetimetoperformthefollowingcollectiveoperationsforfull,half,andquartermachinesizeinthesystemandreportoncoreoccupancyduringtheoperationsmeasuredbythecommunicationbenchmarkspecifiedontheAPEXwebsitefor:§ An8byteMPI_Allreduceoperation.§ An8byteperrankMPI_Allgatheroperation.

3.5.10 TheOfferorshallreport,orproject,theminimumandmaximumoff-nodelatencyofthesystemforMPItwo-sidedmessagesusingthefollowingthreadingmodesmeasuredbythecommunicationbenchmarkspecifiedontheAPEXwebsite:

§ MPI_THREAD_SINGLEwithasinglethreadperrank.§ MPI_THREAD_MULTIPLEwithtwoormorethreadsperrank.

3.5.11 TheOfferorshallreport,orproject,theminimumandmaximumoff-nodelatencyforMPIone-sidedmessagesofthesystemforallpassivesynchronizationRMAmethodswithbothpre-allocatedanddynamicmemorywindowsusingthefollowingthreadingmodesmeasuredbythecommunicationbenchmarkspecifiedontheAPEXwebsite:

§ MPI_THREAD_SINGLEwithasinglethreadperrank.§ MPI_THREAD_MULTIPLEwithtwoormorethreadsperrank.

3.5.12 TheOfferorshallprovideanefficientimplementationofMPI_THREAD_MULTIPLE.Bandwidth,latency,andmessagethroughputmeasurementsusingtheMPI_THREAD_MULTIPLEthreadsupportlevelshallhavenomorethana10%performancedegradationwhencomparedtousingtheMPI_THREAD_SINGLEsupportlevelasmeasuredbythecommunicationbenchmarkspecifiedontheAPEXwebsite.

3.5.13 TheOfferorshallreport,orproject,themaximumI/ObandwidthsofthesystemasmeasuredbytheIORbenchmarkspecifiedontheAPEXwebsite.

3.5.14 TheOfferorshallreport,orproject,themetadataratesofthesystemasmeasuredbytheMDTESTbenchmarkspecifiedontheAPEXwebsite.

3.5.15 TheOfferorshallberequiredattimeofacceptancetomeetspecifiedtargetsforacceptancebenchmarks,andmissioncodesforCrossroads,listedontheAPEXwebsite.

3.5.16 TheOfferorshalldescribehowthesystemmaybeconfiguredtosupportahighrateandbandwidthofTCP/IPconnectionstoexternalservicesbothfromcomputenodesanddirectlytoandfromtheplatformstorage,including:

§ Computenodeexternalaccessshallallowallnodestoeachinitiate1connectionconcurrentlywithina1secondwindow.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 24 of 77

§ Transferofdataovertheexternalnetworktoandfromthecomputenodesandplatformstorageat100GB/sperdirectionofa1TBdatasetcomprisedof20GBfilesin10seconds.

3.6 Resilience,Reliability,andAvailabilityTheabilitytoachievetheAPEXmissiongoalshingesontheproductivityofsystemusers.Systemavailabilityisthereforeessentialandrequiressystem-widefocustoachievearesilient,reliable,andavailablesystem.Foreachmetricspecifiedbelow,theOfferormustdescribehowtheyarrivedattheirestimates.

3.6.1 Failureofthesystemmanagementand/orRASsystem(s)shallnotcauseasystemorjobinterrupt.ThisrequirementdoesnotapplytoaRASsystemfeature,whichautomaticallyshutsdownthesystemforsafetyreasons,suchasanoverheatingcondition.

3.6.2 TheminimumSystemMeanTimeBetweenInterrupt(SMTBI)shallbegreaterthan720hours.

3.6.3 TheminimumJobMeanTimeToInterrupt(JMTTI)shallbegreaterthan24hours.Automaticrestartsdonotmitigateajobinterruptforthismetric.

3.6.4 TheratioofJMTTI/Delta-Ckptshallbegreaterthan200.Thismetricisameasureofthesystem’sabilitytomakeprogressoveralongperiodoftimeandcorrespondstoanefficiencyofapproximately90%.If,forexample,theJMTTIrequirementisnotmet,thetargetJMTTI/Delta-Ckptratioensuresthisminimumlevelofefficiency.

3.6.5 Animmediatere-launchofaninterruptedjobshallnotrequireacompleteresourcereallocation.Ifajobisinterrupted,thereshallbeamechanismthatallowsre-launchoftheapplicationusingthesameallocationofresource(e.g.,computenodes)thatithadbeforetheinterruptoranaugmentedallocationwhenpartoftheoriginalallocationexperiencesahardfailure.

3.6.6 Acompletesysteminitializationshalltakenomorethan30minutes.TheOfferorshalldescribethefullsysteminitializationsequenceandtimings.

3.6.7 Thesystemshallachieve99%scheduledsystemavailability.Systemavailabilityisdefinedintheglossary.

3.6.8 TheOfferorshalldescribetheresilience,reliability,andavailabilitymechanismsandcapabilitiesofthesystemincluding,butnotlimitedto:

§ Anyconditionoreventthatcanpotentiallycauseajobinterrupt.§ Resiliencyfeaturestoachievetheavailabilitytargets.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 25 of 77

§ Singlepointsoffailure(hardwareorsoftware),andthepotentialeffectonrunningapplicationsandsystemavailability.

§ Howajobmaintainsitsresourceallocationandisabletorelaunchanapplicationafteraninterrupt.

§ Asystem-levelmechanismtocollectfailuredataforeachkindofcomponent.

3.7 ApplicationTransitionSupportandEarlyAccesstoAPEXTechnologiesTheCrossroadsandNERSC-9systemswillincludenumerouspre-exascaletechnologies.TheOfferorshallincludeintheirproposalaplantoeffectivelyutilizethesetechnologiesandassistintransitioningthemissionworkflowstothesystems.FortheCrossroadssystemonly,theOfferorshallsupporteffortstotransitiontheAdvancedTechnologyDevelopmentMitigation(ATDM)codestothesystems.ATDMcodesarecurrentlybeingdevelopedbythethreeNNSAweaponslaboratories,Sandia,LosAlamos,andLawrenceLivermore.Thesecodesmayrequirecompliancewithexportcontrollawsandnocostlicensingagreements.InformationabouttheATDMprogramcanbefoundontheNNSAwebsite.

3.7.1 TheOfferorshallprovide(thustheOfferorshallpropose)avehicleforsupportingthesuccessfuldemonstrationoftheapplicationperformancerequirementsandthetransitionofkeyapplicationstotheCrossroadsandNERSC-9systems(e.g.,aCenterofExcellence).SupportshallbeprovidedbytheOfferorandallofitskeyadvancedtechnologyproviders(e.g.,processorvendors,integrators,etc).TheOfferorshallprovideexpertsintheareasofapplicationportingandperformanceoptimizationintheformofstafftraining,generalusertraining,anddeep-diveinteractionswithasetofapplicationcodeteams.Supportshallincludecompilerstoenabletimelybugfixesaswellastoenablenewfunctionality.Supportshallbeprovidedfromthedateofsubcontractexecutionthroughtwo(2)yearsafterfinalacceptanceofthesystems.

3.7.2 TheOfferorshalldescribewhichoftheproposedAPEXhardwareandsoftwaretechnologies(physicalhardware,emulators,and/orsimulators),willbeavailableforaccessbeforesystemdeliveryandinwhattimeframe.TheproposedtechnologiesshouldprovidevalueinadvancedpreparationforthedeliveryofthefinalAPEXsystem(s)forpre-system-deliveryapplicationportingandperformanceassessmentactivities.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 26 of 77

3.8 TargetSystemConfigurationTable2TargetSystemConfiguration

Crossroads NERSC-9

BaselineMemoryCapacityExcludesalllevelsofon-die-CPUcache

>3PiB >3PiB

BenchmarkSSIincreaseoverEdisonsystem

>20X >20X

PlatformStorage >30XBaselineMemory >30XBaselineMemory

WallPlatePower <20MW <20MW

PeakPower <18MW <18MW

NominalPower <15MW <15MW

IdlePower <10%WallPlatePower <10%WallPlatePower

JobMeanTimeToInterrupt(JMTTI)Calculatedforasinglejobrunningintheentiresystem

>24Hours >24Hours

SystemMeanTimeToInterrupt(SMTTI)

>720Hours >720Hours

Delta-Ckpt <7.2minutes <7.2minutes

JMTTI/Delta-Ckpt >200 >200

SystemAvailability >99% >99%

3.9 SystemOperationsSystemmanagementshallbeanintegralfeatureoftheoverallsystemandshallprovidetheabilitytoeffectivelymanagesystemresourceswithhighutilizationandthroughputunderaworkloadwithawiderangeofconcurrencies.TheOfferorshallprovidesystemadministrators,security

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 27 of 77

officers,anduser-supportpersonnelwithproductiveandefficientsystemconfigurationmanagementcapabilitiesandanenhanceddiagnosticenvironment.

3.9.1 ThesystemshallincludescalableintegratedsystemmanagementcapabilitiesthatprovidehumaninterfacesandAPIsforsystemconfigurationanditsabilitytobeautomated,softwaremanagement,changemanagement,localsiteintegration,andsystemconfigurationbackupandrecovery.

3.9.2 Thesystemshallincludeameansfortrackingandanalyzingallsoftwareupdates,softwareandhardwarefailures,andhardwarereplacementsoverthelifetimeofthesystem.

3.9.3 Thesystemshallincludetheabilitytoperformrollingupgradesandrollbacksonasubsetofthesystemwhilethebalanceofthesystemremainsinproductionoperation.TheOfferorshalldescribethemechanisms,capabilities,andlimitationsofrollingupgradesandrollbacks.Nomorethanhalfthesystempartitionshallberequiredtobedownforrollingupgradesandrollbacks.

3.9.4 Thesystemshallincludeanefficientmechanismforreconfiguringandrebootingcomputenodes.TheOfferorshalldescribeindetailthecomputenoderebootmechanism,differentiatingtypesofboots(warmbootvs.coldboot)requiredfordifferentnodefeatures,aswellashowthetimerequiredtorebootscaleswiththenumberofnodesbeingrebooted.

3.9.5 Thesystemwillincludeamechanismwherebyallmonitoringdataandlogscapturedareavailabletothesystemowner,andwillsupportanopenmonitoringAPItofacilitatelossless,scalablesamplinganddatacollectionformonitoreddata.Anyfilteringthatmayneedtooccurwillbeattheoptionofthesystemmanager.Thesystemwillincludeasamplingandconnectionframeworkthatallowsthesystemmanagertoconfigureindependentalternativeparalleldatastreamstobedirectedoffthesystemtosite-configurableconsumers.

3.9.6 Thesystemshallincludeamechanismtocollectandprovidemetricsandlogswhichmonitorthestatus,health,andperformanceofthesystem,including,butnotlimitedto:

§ Environmentalmeasurementcapabilitiesforallsystemsandperipheralsandtheirsub-systemsandsupportinginfrastructure,includingpowerandenergyconsumptionandcontrol.

§ InternalHSNperformancecounters,includingmeasuresofnetworkcongestionandnetworkresourceconsumption.

§ Alllevelsofintegratedandattachedplatformstorage.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 28 of 77

§ Thesystemasawhole,includinghardwareperformancecountersformetricsforalllevelsofintegratedandattachedplatformstorage.

3.9.7 TheOfferorshalldescribewhattoolsitshallprovideforthecollection,analysis,integration,andvisualizationofmetricsandlogsproducedbythesystem(e.g.,peripherals,integratedandattachedplatformstorage,andenvironmentaldata,includingpowerandenergyconsumption).

3.9.8 TheOfferorshalldescribethesystemconfigurationmanagementanddiagnosticcapabilitiesofthesystemthataddressthefollowingtopics:

§ Detaileddescriptionofthesystemmanagementsupport.§ Anyeffectoroverheadofsoftwaremanagementtoolcomponentsonthe

CPUormemoryavailableoncomputenodes.§ Releaseplan,withregressiontestingandvalidationforallsystemrelated

softwareandsecurityupdates.§ Supportformultiplesimultaneousoralternativesystemsoftware

configurations,includingestimatedtimeandeffortrequiredtoinstallbothamajorandaminorsystemsoftwareupdate.

§ Useractivitytracking,suchasauditloggingandprocessaccounting.§ Unrestrictedprivilegedaccesstoallhardwarecomponentsdelivered

withthesystem.

3.10 PowerandEnergyPower,energy,andtemperaturewillbecriticalfactorsinhowtheAPEXlaboratoriesmanagesystemsinthistimeframeandmustbeanintegralpartofoverallSystemsOperations.Thesolutionmustbewellintegratedintootherintersectingareas(e.g.,facilities,resourcemanagement,runtimesystems,andapplications).TheAPEXlaboratoriesexpectagrowingnumberofusecasesinthisareathatwillrequireaverticallyintegratedsolution.

3.10.1 TheOfferorshalldescribeallpower,energy,andtemperaturemeasurementcapabilities(system,rack/cabinet,board,node,component,andsub-componentlevel)forthesystem,includingcontrolandresponsetimes,samplingfrequency,accuracyofthedata,andtimestampsofthedataforindividualpointsofmeasurementandcontrol.

3.10.2 TheOfferorshalldescribeallcontrolcapabilitiesitshallprovidetoaffectpowerorenergyuse(system,rack/cabinet,board,node,component,andsub-componentlevel).

3.10.3 Thesystemshallincludesystem-levelinterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystem,includingbutnotlimitedto:

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 29 of 77

§ ACmeasurementcapabilitiesatthesystemorracklevel.§ System-levelminimumandmaximumpowersettings(e.g.,powercaps).§ System-levelpowerrampupanddownrate.§ Scalablecollectionandretentionallmeasurementdatasuchas:§ point-in-timepowerdata.§ energyusageinformation.§ minimumandmaximumpowerdata.

3.10.4 Thesystemshallincluderesourcemanagerinterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystem,includingbutnotlimitedto:

§ Jobandnodelevelminimumandmaximumpowersettings.§ Jobandnodelevelpowerrampupanddownrate.§ Jobandnodelevelprocessorand/orcorefrequencycontrol.§ Systemandjoblevelprofilingandforecasting.

o e.g.,predictionofhourlypoweraverages>24hoursinadvancewitha1MWtolerance.

3.10.5 Thesystemshallincludeapplicationandruntimesysteminterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystemincludingbutnotlimitedto:

§ Nodelevelminimumandmaximumpowersettings.§ Nodelevelprocessorand/orcorefrequencycontrol.§ Nodelevelapplicationhints,suchas:

o applicationenteringserial,parallel,computationallyintense,I/Ointenseorcommunicationintensephase.

3.10.6 ThesystemshallincludeanintegratedAPIforalllevelsofmeasurementandcontrolofpowerrelevantcharacteristicsofthesystem.ItispreferablethattheprovidedAPIcomplieswiththeHighPerformanceComputingPowerApplicationProgrammingInterfaceSpecification(http://powerapi.sandia.gov).

3.10.7 TheOfferorshallproject(andreport)theWallPlate,Peak,Nominal,andIdlePowerofthesystem.

3.10.8 TheOfferorshalldescribeanycontrolsavailabletoenforceorlimitpowerusagebelowwallplatepowerandthereactiontimeofthismechanism(e.g.,whatdurationandmagnitudecanpowerusageexceedtheimposedlimits).

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 30 of 77

3.10.9 TheOfferorshalldescribethestatusofthesystemwheninanIdleState(describeallIdleStatesifmultipleareavailable)andthetimetotransitionfromtheIdleState(oreachIdleStateiftherearemultiple)tothestartofjobexecution.

3.11 FacilitiesandSiteIntegration

3.11.1 Thesystemshalluse3-phase480VAC.Othersysteminfrastructurecomponents(e.g.,disks,switches,loginnodes,andmechanicalsubsystemssuchasCDUs)mustuseeither3-phase480VAC(stronglypreferred),3-phase208VAC(secondchoice),orsingle-phase120/240VAC(thirdchoice).Thetotalnumberofindividualbranchcircuitsandphaseloadimbalanceshallbeminimized.

3.11.2 AllequipmentandpowercontrolhardwareofthesystemshallbeNationallyRecognizedTestingLaboratories(NRTL)certifiedandbearappropriateNRTLlabels.

3.11.3 Everyrack,networkswitch,interconnectswitch,node,anddiskenclosureshallbeclearlylabeledwithauniqueidentifiervisiblefromthefrontoftherackand/ortherearoftherack,asappropriate,whentherackdoorisopen.Theselabelswillbehighqualitysothattheydonotfalloff,fade,disintegrate,orotherwisebecomeunusableorunreadableduringthelifetimeofthesystem.Nodeswillbelabeledfromtherearwithauniqueserialnumberforinventorytracking.Itisdesirablethatmotherboardsalsohaveauniqueserialnumberforinventorytracking.Serialnumbersshallbevisiblewithouthavingtodisassemblethenode,ortheymustbeabletobequeriedfromthesystemmanagementconsole.

3.11.4 TheOfferorshalldescribethefeaturesofthesystemrelatedtofacilitiesandsiteintegration,including:

§ Descriptionofthephysicalpackagingofthesystem,includingdimensioneddrawingsofindividualcabinetstypesandthefloorlayoutoftheentiresystem.

§ Remoteenvironmentalmonitoringcapabilitiesofthesystemandhowitwouldintegrateintofacilitymonitoring.

§ Emergencyshutdowncapabilities.§ Detaileddescriptionsofpowerandcoolingdistributionsthroughoutthe

system,includingpowerconsumptionforallsubsystems.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 31 of 77

§ DescriptionofparasiticpowerlosseswithinOfferor’sequipment,suchasfans,powersupplyconversionlosses,power-factoreffects,etc.Forthecomputationalandplatformstoragesubsystemsseparately,giveanestimateofthetotalpowerandparasiticpowerlosses(whosedifferenceshouldbepowerusedbycomputationalorplatformstoragecomponents)attheminimumandmaximumITUE,whichisdefinedastheratiooftotalequipmentpoweroverpowerusedbycomputationalorplatformstoragecomponents.Describetheconditions(e.g.“idle”)atwhichtheextremaoccur.

§ OSdistributionsorotherclientrequirementstosupportoff-systemaccesstotheplatformstorage(e.g.LANLFileTransferAgents).

Table3CrossroadsandNERSC-9FacilityRequirements

Crossroads NERSC-9

Location LosAlamosNationalLaboratory,LosAlamos,NM.ThesystemwillbehousedintheStrategicComputingComplex(SCC),Building2327

NationalEnergyResearchScientificComputingCenter,LawrenceBerkeleyNationalLaboratory,Berkeley,CA.ThesystemwillbehousedinWangHall,Building59(formerlyknownastheComputationalTheoryandResearchFacility).

Altitude 7,500feet 650feet

Seismic N/A Systemtobeplacedonaseismicisolationfloor.Systemcabinetsshallhaveanattachmentmechanismthatwillenablethemtobefirmlyattachedtoeachotherandtheisolationfloor.Whensecuredviatheseattachments,thecabinetsshallwithstandseismicdesignaccelerationspertheCaliforniaBuildingCodeandLBNLLateralForceDesignCriteriapolicyin

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 32 of 77

Crossroads NERSC-9

effectatthetimeofsubcontractaward.(TheCBCcurrentlyspecifies0.49gbutisexpectedtobeupdatedin2016.)

WaterCooling ThesystemshalloperateinconformancewithASHRAEClassW2guidelines(dated2011).Thefacilitywillprovideoperatingwatertemperaturethatnominallyvariesbetween60-75°F,atupto35PSIdifferentialpressureatthesystemcabinetsHowever,Offerorshouldnoteifthesystemiscapableofoperatingathighertemperatures.

Note:LANLfacilitywillprovideinletwateratanominal75°F.Itmaygotoaslowas60°Fbasedonfacilityand/orenvironmentalfactors.Totalflowrequirementsmaynotexceed9600GPM.

Same

Note:NERSCfacilitywillprovideinletwateratanominal65°F.Itmaygoashighas75°Fbasedonfacilityand/orenvironmentalfactors.Totalflowrequirementsmaynotexceed9600GPM.

WaterChemistry ThesystemmustoperatewithfacilitywatermeetingbasicASHRAEwaterchemistry.Specialchemistrywaterisnotavailableinthemainbuildingloopandwouldrequireaseparatetertiaryloopprovidedwiththesystem.If

Same

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 33 of 77

Crossroads NERSC-9

tertiaryloopsareincludedinthesystem,theOfferorshalldescribetheiroperationandmaintenance,includingcoolantchemistry,pressures,andflowcontrols.Allcoolantloopswithinthesystemshallhavereliableleakdetection,temperature,andflowalarms,withautomaticprotectionandnotificationmechanisms.

AirCooling Thesystemmustoperatewithsupplyairat76°Forbelow,witharelativehumidityfrom30%-70%.Therateofairflowisbetween800-1500CFM/floortile.Nomorethan3MWofheatshallberemovedbyaircooling.

Thesystemmustoperatewithsupplyairat76°Forbelow,witharelativehumidityfrom30%-80%.Thecurrentfacilitycansupportupto60KCFMofairflow,andremove500KWofheat.Expansionispossibleto300KCFMand1.5MW,butataddedexpense.

MaximumPowerRateofChange

Thehourlyaverageinsystempowershallnotexceedthe2MWwidepowerbandnegotiatedatleast2hoursinadvance.

N/A

PowerQuality ThesystemshallberesilienttoincomingpowerfluctuationsatleasttothelevelguaranteedbytheITICpowerqualitycurve.

Same

Floor 42”raisedfloor 48”raisedfloor

Ceiling 16footceilingandan18’ 17’10”ceilinghowevermaximumcabinetheight

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 34 of 77

Crossroads NERSC-9

6”ceilingplenum is9’5”

MaximumFootprint 8000squarefeet;80feetlongand100feetdeep.

64’x92’,or5888squarefeet(inclusiveofcompute,platformstorageandserviceaisles).Thisareaisitselfsurroundedbyaminimum4’aislethatcanbeusedinthesystemlayout.Itispreferredthatcabinetrowsrunparalleltotheshortdimension.

ShipmentDimensionsandWeight

Norestrictions. Fordelivery,systemcomponentsshallweighlessthan7000poundsandshallfitintoanelevatorwhosedooris6ft6inwideand9ft0inhighandwhosedepthis8ft3in.Clearinternalwidthis8ft4in.

FloorLoading Theaveragefloorloadingovertheeffectiveareashallbenomorethan300poundspersquarefoot.Theeffectiveareaistheactualloadingareaplusatmostafootofsurroundingfullyunloadedarea.Amaximumlimitof300poundspersquarefootalsoappliestoallloadsduringinstallation.TheOfferorshalldescribehowtheweightwillbedistributedoverthefootprintoftherack(pointloads,lineloads,orevenlydistributedovertheentirefootprint).A

Thefloorloadingshallnotexceedauniformloadof500poundspersquarefoot.RaisedfloortilesareASMFS400withanisolatedpointloadof2000poundsandarollingloadof1200pounds.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 35 of 77

Crossroads NERSC-9

pointloadappliedonaonesquareinchareashallnotexceed1500pounds.AdynamicloadusingaCISCAWheel1sizeshallnotexceed1250pounds(CISCAWheel2–1000pounds).

Cabling Allpowercablingandwaterconnectionsshallbebelowtheaccessfloor.Itispreferablethatallothercabling(e.g.,systeminterconnect)isabovefloorandintegratedintothesystemcabinetry.Underfloorcables(ifunavoidable)shallbeplenumratedandcomplywithNEC300.22andNEC645.5.Allcommunicationscables,whereverinstalled,shallbesource/destinationlabeledatbothends.Allcommunicationscablesandfibersover10metersinlengthandinstalledunderthefloorshallalsohaveauniqueserialnumberanddBlossdatadocument(orequivalent)deliveredattimeofinstallationforeachcable,ifamethodofmeasurementexistsforcabletype.

Same

Externalnetworkinterfacessupportedbythesiteforconnectivityrequirementsspecified

1Gb,10Gb,40Gb,100Gb,IB

Same

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 36 of 77

Crossroads NERSC-9

below

Externalbandwidthon/offthesystemforgeneralTCP/IPconnectivity

>100GB/sperdirection Same

Externalbandwidthon/offthesystemforaccessingthesystem’sPFS

>100GB/s Same

Externalbandwidthon/offthesystemforaccessingexternal,sitesuppliedfilesystems.E.g.GPFS,NFS

>100GB/s Same

4 Non-RecurringEngineeringTheAPEXteamexpectstoawardtwo(2)Non-RecurringEngineering(NRE)subcontracts,separatefromthetwo(2)systemsubcontracts.ItisexpectedthatCrossroadsandNERSCpersonnelwillcollaborateinbothNREsubcontracts.ItisanticipatedthattheNREsubcontractswillbeapproximately10%-15%ofthecombinedCrossroadsandNERSC-9systembudgets.TheOfferorisencouragedtoprovideproposalsforareasofcollaborationtheyfeelprovidesubstantialvaluetotheCrossroadsandNERSC-9systemswiththegoalsof:

§ Increasingapplicationperformance.§ Increasingworkflowperformance.§ Increasingtheresilience,andreliabilityofthesystem.Proposedcollaborationareasshouldfocusontopicsthatprovideaddedvaluebeyondplannedroadmapactivities.Proposalsshouldnotfocusonone-offpointsolutionsorgapscreatedbytheirproposeddesignthatshouldbeotherwiseprovidedaspartofaverticallyintegratedsolution.ItisexpectedthatNREcollaborationswillhaveimpactonboththeCrossroadsandNERSC-9systemsandfollow-onsystemsprocuredbytheU.S.DepartmentofEnergy'sNNSAandOfficeofScience.NREtopicsofinterestinclude,butarenotlimitedto,thefollowing:

§ DevelopmentandoptimizationofhardwareandsoftwarecapabilitiestoincreasetheperformanceofMPI+OpenMPandfuturetask-basedasychronousprogrammingmodels.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 37 of 77

§ Developmentandoptimizationofhardwareandsoftwarecapabilitiestoincreasetheperformanceofapplicationworkflows,includingconsiderationofconsistencyrequirements,data-migrationneeds,andsystem-wideresourcemanagement.

§ Developmentofscalablesystemmanagementcapabilitiestoenhancethereliability,resilience,power,andenergyusageofCrossroads/NERSC-9.

5 OptionsTheAPEXteamexpectstohavefuturerequirementsforsystemupgradesand/oradditionalquantitiesofcomponentsbasedontheconfigurationsproposedinresponsetothissolicitation.TheOfferorshalladdressanytechnicalchallengesforeseenwithrespecttoscalingandanyotherproductionissues.Proposalsshouldbeasdetailedaspossible.TheevaluationcommitteewillmakenopresumptionoftechnicalcapabilitywhenevaluatingOfferorresponsestoOptions.OfferorsthatdonotaddresstheOptionsinamateriallyresponsivemannerwillbedowngraded.

5.1 Upgrades,ExpansionsandAdditions

5.1.1 TheOfferorshallproposeandseparatelypriceupgrades,expansionsorprocurementofadditionalsystemconfigurationsbythefollowingfractionsofthesystemasmeasuredbytheSustainedSystemImprovement(SSI)metric.§ 25%§ 50%§ 100%§ 200%

5.1.2 TheOfferorshallproposeaconfigurationorconfigurationswhichdoublethebaselinememorycapacity.

5.1.3 TheOfferorshallproposeupgrades,expansionsorprocurementofadditionalplatformstoragecapacity(pertierifmultipletiersarepresent)inincrementsof25%.

5.2 EarlyAccessDevelopmentSystemToallowforearlyand/oraccelerateddevelopmentofapplicationsordevelopmentoffunctionalityrequiredasapartofthestatementofwork,theOfferorshallproposeoptionsforearlyaccessdevelopmentsystems.Thesesystemscanbeinsupportofthebaselinerequirementsoranyproposedoptions.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 38 of 77

5.2.1 TheOfferorshallproposeanEarlyAccessDevelopmentSystem.Theprimarypurposeistoexposetheapplicationtothesameprogrammingenvironmentaswillbefoundonthefinalsystem.Itisacceptablefortheearlyaccesssystemtonotusethefinalprocessor,node,orhigh-speedinterconnectarchitectures.However,theprogrammingandruntimeenvironmentmustbesufficientlysimilarthataporttothefinalsystemistrivial.Theearlyaccesssystemshallcontainsimilarfunctionalityofthefinalsystem,includingfilesystems,butscaleddowntotheappropriateconfiguration.TheOfferorshallproposeanoptionforthefollowingconfigurationsbasedonthesizeofthefinalCrossroads/NERSC-9systems.

§ 2%ofthecomputepartition.§ 5%ofthecomputepartition.§ 10%ofthecomputepartition.

5.2.2 TheOfferorshallproposedevelopmenttestbedsystemsthatwillreduceriskandaidthedevelopmentofanyadvancedfunctionalitythatisexercisedasapartofthestatementofwork.Forexample,anytopicsproposedforNRE.

5.3 TestSystemsTheOfferorshallproposethefollowingtestsystems.Thesystemsshallcontainallthefunctionalityofthemainsystem,includingfilesystems,butscaleddowntotheappropriateconfiguration.Multipletestsystemsmaybeawarded.

5.3.1 TheOfferorshallproposeanApplicationRegressiontestsystem,whichshallcontainatleast200computenodes.

5.3.2 TheOfferorshallproposeaSystemDevelopmenttestsystem,whichshallcontainatleast50computenodes.

5.4 OnSiteSystemandApplicationSoftwareAnalysts

5.4.1 TheOfferorshallproposeandseparatelypricetwo(2)SystemSoftwareAnalystsandtwo(2)ApplicationsSoftwareAnalystsforeachsite.Offerorsshallpresumeeachanalystwillbeutilizedforfour(4)years.ForCrossroads,thesepositionsrequireaDOEQ-clearanceforaccess.

5.5 DeinstallationTheOfferorshalldeinstall,removeand/orrecyclethesystemandsupportinginfrastructureatendoflife.StoragemediashallbewipedordestroyedtothesatisfactionofACESandNERSC,and/orreturnedtoACESandNERSCattheirrequest.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 39 of 77

5.6 MaintenanceandSupportTheOfferorshallproposeandseparatelypricemaintenanceandsupportwiththefollowingfeatures:

5.6.1 MaintenanceandSupportPeriod

TheOfferorshallproposeallmaintenanceandsupportforaperiodoffour(4)yearsfromthedateofacceptanceofthesystem.Warrantyshallbeincludedinthe4years.Forexample,ifthesystemisacceptedonApril1,2021andtheWarrantyisforoneyear,thentheWarrantyendsonMarch30,2022,andthemaintenanceperiodbeginsApril1,2022andendsonMarch30,2025.Offerorshallalsoproposeadditionalmaintenanceandsupportextensionforyears5-7.

5.6.2 MaintenanceandSupportSolutionsTheOfferorshallproposethefollowingmaintenanceandsupportsolutionsandproposepricingseparatelyforeachsolution.ACESandNERSCmaypurchaseeitheroneofthesolutionsorneitherofthesolutions,atitsdiscretion.Differentmaintenancesolutionsmaybeselectedforthevarioustestsystemsandfinalsystem.

5.6.2.1 Solution1–7x24

TheOfferorshallpriceSolution1asfullhardwareandsoftwaresupportforallOfferorprovidedhardwarecomponentsandsoftware.Theprincipalperiodofmaintenance(PPM)shallbefor24hoursby7daysaweekwithafourhourresponsetoanyrequestforservice.

5.6.2.2 Solution2–5x9TheOfferorshallpriceSolution2asfullhardwareandsoftwaresupportforallOfferorprovidedhardwarecomponentsandsoftware.Theprincipalperiodofmaintenance(PPM)shallbeona9hoursby5daysaweek(exclusiveofholidaysobservedbyACESorNERSC).TheOfferorshallprovidehardwaremaintenancetrainingforACES/NERSCstaffsothatstaffareabletoprovidehardwaresupportforallothertimestheOfferorisunabletoprovidehardwarerepairinatimelymanneroutsideofthePPM.TheOfferorshallsupplyhardwaremaintenanceproceduraldocumentation,training,andmanualsnecessarytosupportthiseffort.

Allproposedmaintenanceandsupportsolutionsshallincludethefollowingfeaturesandmeetallrequirementsofthissection.

5.6.3 GeneralServiceProvisions

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 40 of 77

TheOfferorshallberesponsibleforrepairorreplacementofanyfailinghardwarecomponentthatitsuppliesandcorrectionofdefectsinsoftwarethatitprovidesaspartofthesystem.

Atitssolediscretion,ACESorNERSCmayrequestadvancereplacementofcomponentswhichshowapatternoffailureswhichreasonablyindicatesthatfuturefailuresmayoccurinexcessofreliabilitytargets,orforwhichthereisasystemicproblemthatpreventseffectiveuseofthesystem.

Hardwarefailuresduetoenvironmentalchangesinfacilitypowerandcoolingsystemswhichcanbereasonablyanticipated(suchasbrown-outs,voltage-spikesorcoolingsystemfailures)aretheresponsibilityoftheOfferor.

5.6.4 SoftwareandFirmwareUpdateServiceTheOfferorshallprovideanupdateserviceforallsoftwareandfirmwareprovidedforthedurationoftheWarrantyplusMaintenanceperiod.Thisshallincludenewreleasesofsoftware/firmwareandsoftware/firmwarepatchesasrequiredforfornormaluse.TheOfferorshallintegratesoftwarefixes,revisionsorupgradedversionsinsuppliedsoftware,includingcommunitysoftware(e.g.LinuxorLustre),andmakethemavailabletoACESandNERSCwithintwelve(12)monthsoftheirgeneralavailability.TheOfferorshallprovidepromptavailabilityofpatchesforcybersecuritydefects.

5.6.5 CallServiceTheOfferorshallprovidecontactinformationfortechnicalpersonnelwithknowledgeoftheproposedequipmentandsoftware.ThesepersonnelshallbeavailableforconsultationbytelephoneandelectronicmailwithACES/NERSCpersonnel.Inthecaseofdegradedperformance,theOfferor’sservicesshallbemadereadilyavailabletodevelopstrategiesforimprovingperformance,i.e.patches,workarounds.

5.6.6 On-sitePartsCacheTheOfferorshallmaintainapartscacheon-siteatboththeACESandNERSCfacilities.Thepartscacheshallbesizedandprovisionedsufficientlytosupportallnormalrepairactionsfortwoweekswithouttheneedforpartsrefresh.TheinitialsizingandprovisioningofthecacheshallbebasedonOfferor’sMeanTimeBetweenFailure(MTBF)estimatesforeachFRUandeachrack,andscaledbasedonthenumberofFRU’sandracksdelivered.Thepartscacheconfigurationwillbeperiodicallyreviewedforquantitiesneededtosatisfythisrequirement,andadjustedifnecessary,basedonobservedFRUornodefailurerates.Thepartscachewillberesized,attheOfferor’sexpense,shouldtheon-sitepartscacheprovetobeinsufficienttosustaintheactuallyobservedFRUornodefailurerates.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 41 of 77

5.6.7 On-SiteNodeCacheTheOfferorshallalsomaintainanon-sitesparenodeinventoryofatleast1%ofthetotalnodesinallofthesystem.ThesenodesshallbemaintainedandtestedforhardwareintegrityandfunctionalityutilizingtheHardwareSupportClusterdefinedbelowifprovided.

ThefollowingfeaturesandrequirementsarespecifictoresponsesforACESrequirements.

5.6.8 HardwareSupportClusterTheOfferorshallprovideaHardwareSupportCluster(HSC).TheHSCshallsupportthehotsparenodesandprovidefunctionssuchashardwareburn-in,problemdiagnosis,etc.TheOfferorshallsupplysufficientracks,interconnect,networking,storageequipmentandanyassociatedhardware/softwarenecessarytomaketheHSCastand-alonesystemcapableofrunningdiagnosticsonindividualorclustersofHSCnodes.ACESwillstoreandinventorytheHSCandotheron-sitepartscachecomponents.

5.6.9 DOEQ-ClearedTechnicalServicePersonnelTheCrossroadssystemwillbeinstalledinsecurityareasthatrequireaDOEQ-clearanceforaccess.ItwillbepossibletoinstallthesystemwiththeassistanceofunclearedUScitizensorL-clearedpersonnel,buttheOfferorshallarrangeandpayforappropriate3rdpartysecurityescorts.TheOfferorshallobtainnecessaryclearancesforon-sitesupportstafftoperformtheirduties.

6 DeliveryandAcceptanceTestingofthesystemshallproceedinthreesteps:pre-delivery,post-delivery,andacceptance.Eachstepisintendedtovalidatethesystemandfeedsintosubsequentactivities.SampleAcceptanceTestplans(AppendixA)shallbeprovidedaspartoftheRequestforProposal.

6.1 Pre-deliveryTestingTheAPEXteamandtheOfferorshallperformpre-deliverytestingatthefactoryonthehardwaretobedelivered.Anylimitationsforperformingthepre-deliverytestingshallbeidentifiedintheOfferor’sproposal,includingscaleandlicensinglimitations(ifany).Duringpre-deliverytesting,theOfferorshall:

§ DemonstrateRAScapabilitiesandrobustnessusingsimplefaultinjectiontechniques,suchasdisconnectingcables,poweringdownsubsystems,orinstallingknownbadparts.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 42 of 77

§ Demonstratefunctionalcapabilitiesoneachsegmentofthesystembuilt,includingthecapacitytobuildapplications,schedulejobs,andrunthemusingacustomer-providedtestingframework.Therootcauseofapplicationfailuremustbeidentifiedpriortosystemshipping.

§ Provideafilesystemsufficientlyprovisionedtosupportthesuiteoftests.§ ProvideonsiteandremoteaccesstotheAPEXteamtomonitortesting

andanalyzeresults.§ Instillconfidenceintheabilitytoconformtothestatementofwork.

6.2 SiteIntegrationandPost-deliveryTestingTheAPEXteamandtheOfferorstaffshallperformsiteintegrationandpost-deliverytestingonthefullydeliveredsystem.Limitationsand/orspecialrequirementsmayexistforaccesstotheonsitesystembytheOfferor.§ Duringpost-deliverytesting,thepre-deliverytestsshallberunonthefull

systeminstallation.§ Whereapplicable,testsshallberunatfullscale.

6.3 AcceptanceTestingTheAPEXteamandtheOfferorstaffshallperformonsiteacceptancetestingonthefullyinstalledsystem.Limitationsand/orspecialrequirementsmayexistforaccesstotheonsitesystembytheOfferor.

6.3.1 TheOfferorshalldemonstratethatthedeliveredsystemconformstothesubcontract’sStatementofWork.

7 RiskandProjectManagementTheOfferorshallproposeariskmanagementstrategyandprojectmanagementplanfortheCrossroadsandNERSC-9systemsthatiscloselycoordinatedbetweenthesubcontractsforLANSandUC.

7.1.1 TheOfferorshallProposeariskmanagementstrategyforthesystemintheeventoftechnologyproblemsorschedulingdelaysthataffectdeliveryofthesystemorachievementofperformancetargetsintheproposedtimeframe.Offerorshalldescribetheimpactofsubstitutetechnologies(ifany)ontheoverallarchitectureandperformanceofthesysteminparticularaddressingthefourtechnologyareaslistedbelow:

§ Processor§ Memory§ High-speedinterconnect§ Platformstorage

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 43 of 77

7.1.2 TheOfferorshallidentifyanyotherhigh-riskareasandaccompanyingmitigationstrategiesforthesystem.

7.1.3 TheOfferorshallprovideaclearplanforeffectivelyrespondingtosoftwareandhardwaredefectsandsystemoutagesateachseveritylevelanddocumenthowproblemsordefectswillbeescalated.

7.1.4 TheOfferorshallproposearoadmapshowinghowtheirresponsetothisRequestforProposalalignswiththeirplansforexascalecomputing.

7.1.5 TheOfferorshallidentifyadditionalcapabilities,including:

§ Itsabilitytoproduceandmaintainthesystemforthelifeofthesystem§ Itsabilitytoachievespecificqualityassurance,reliability,availabilityand

serviceabilitygoals§ Itsin-housetestingandproblemdiagnosiscapability,includinghardware

resourcesatappropriatescale

7.1.6 TheOfferorshallprovideprojectmanagementspecificsfortheAPEXteamshallbedetailedaspartoftheRequestforProposaldocument.PleaseseeAppendixBforfurtherinformation.

8 DocumentationandTrainingTheOfferorshallprovidedocumentationandtrainingtoeffectivelyoperate,configure,maintain,andusethesystemstotheAPEXteamandusersoftheCrossroadsandNERSC-9systems.TheAPEXteammay,attheiroption,makeaudioandvideorecordingsofpresentationsfromtheOfferor’sspeakersatpubliceventstargetedattheAPEXusercommunities(e.g.,usertrainingevents,collaborativeapplicationevents,bestpracticesdiscussions,etc.).TheOfferorwillgranttheAPEXteamuseranddistributionrightsofdocumentationprovidedbytheOfferor,sessionmaterials,andrecordedmediatobesharedwithotherDOELabs’staffandallauthorizedusersandsupportstaffforCrossroadsandNERSC-9.

8.1 Documentation

8.1.1 TheOfferorshallprovidedocumentationforeachdeliveredsystemdescribingtheconfiguration,interconnecttopology,labelingschema,hardwarelayout,etc.ofthesystemasdeployedbeforethecommencementofsystemacceptancetesting.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 44 of 77

8.1.2 TheOfferorshallsupplyandsupportsystemanduser-leveldocumentationforallcomponentsbeforethedeliveryofthesystem.Uponrequestbythelaboratories,theOfferorshallsupplyadditionaldocumentationnecessaryforoperationandmaintenanceofthesystem.Alluser-leveldocumentationshallbepublicallyavailable.

8.1.3 TheOfferorshalldistributeandupdatealldocumentationelectronicallyandinatimelymanner.Forexample,changestothesystemshallbeaccompaniedbyrelevantdocumentation.Documentationofchangesandfixesmaybedistributedelectronicallyintheformofreleasenotes.Referencemanualsmaybeupdatedlater,buteffortshouldbemadetokeepalldocumentationcurrent.

8.2 Training

8.2.1 TheOfferorshallprovidethefollowingtypesoftrainingatfacilitiesspecifiedbyACESorNERSC:

NumberofClasses

ClassType ACES NERSC

SystemOperationsandAdvancedAdministration

2 2

UserProgramming 3 3

8.2.2 TheOfferorshalldescribeallproposedtraininganddocumentationrelevanttotheproposedsolutionsutilizingthefollowingmethods:§ Classroomtraining§ Onsitetraining§ Onlinedocumentation§ Onlinetraining

9 ReferencesAPEXscheduleandhigh-levelinformationcanbefoundattheprimaryAPEXwebsitehttp://apex.lanl.gov.

Crossroads/NERSC-9benchmarksandworkflowswhitepapercanbefoundattheAPEXBenchmarkandWorkflowswebsitehttps://www.nersc.gov/research-and-development/apex/apex-benchmarks-and-workflows.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 45 of 77

HighPerformanceComputingPowerApplicationProgrammingInterfaceSpecificationhttp://powerapi.sandia.gov.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 46 of 77

AppendixA:SampleAcceptancePlansAppendixA-1:LANSSampleAcceptancePlan

Testingofthesystemshallproceedinthreesteps:pre-delivery,post-deliveryandacceptance.Eachstepisintendedtovalidatethesystemandfeedsintosubsequentactivities.

Pre-delivery(Factory)Test

TheSubcontractorshalldemonstrateallhardwareisfullyfunctionalpriortoshipping.Ifthesystemistobedeliveredinseparateshipments,eachshipmentshallundergopre-deliverytesting.IftheSubcontractorproposesadevelopmentsystemsubcomponent,LANSrecognizesthatthedevelopmentsystemisnotpartofthepre-deliveryacceptancecriteria.

LANSandSubcontractorstaffshallperformpre-deliverytestingatthefactoryonthehardwaretobedelivered.Anylimitationsforperformingthepre-deliverytestingneedtobeidentifiedincludingscaleandlicensinglimitations.

• DemonstrateRAScapabilitiesandrobustness,usingsimplefaultinjectiontechniquessuchasdisconnectingcables,poweringdownsubsystems,orinstallingknownbadparts.

• Demonstratefunctionalcapabilitiesoneachsegmentofthesystembuilt,includingthecapabilitytobuildapplications,schedulejobs,andrunthemusingthecustomer-providedtestingframework.Therootcauseofanyapplicationfailuremustbeidentified.

• TheOfferorshallprovideafilesystemsufficientlyprovisionedtosupportthesuiteoftests.

• ProvideonsiteandremoteaccessforLANSstafftomonitortestingandanalyzeresults.

• Instillconfidenceintheabilitytoconformtothestatementofwork.

Pre-DeliveryAssembly

• TheSubcontractorshallperformthepre-deliverytestofCrossroadsoragreed-uponsub-configurationsofCrossroadsattheSubcontractor’slocationpriortoshipment.Atitsoption,LANSmaysendarepresentative(s)toobservetestingattheSubcontractor’sfacility.WorktobeperformedbytheSubcontractorincludes:

o Allhardwareinstallationandassembly

o Burninofallcomponentso Installationofsoftware

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 47 of 77

o ImplementationoftheACES-specificproductionsystem-configurationandprogrammingenvironment

o Performtestsandbenchmarkstovalidatefunctionality,performance,reliability,andquality

• Runbenchmarksanddemonstratethatbenchmarksmeetperformancecommitments.

Pre-DeliveryConfiguration

• TBDPre-DeliveryTest

SubcontractorshallprovideLANSon-siteaccesstothesysteminordertoverifythatthesystemdemonstratestheabilitytopassacceptancecriteria.

Thepre-deliverytestshallconsistof(butisnotlimitedto)thefollowingtests:

NameofTest PassCriteria

Systempowerup Allnodesbootsuccessfully

Systempowerdown Allnodesshutdown

Unixcommands AllUNIX/Linuxandvendorspecificcommandsfunctioncorrectly

Monitoring Monitoringsoftwareshowsstatusforallnodes

Reset “Reset”functionsonallnodes

PowerOn/Off Powercycleallcomponentsoftheentiresystemfromtheconsole

FailOver/Resilience Demonstrateproperoperationofallfail-overorresiliencemechanisms

FullConfigurationTest Pre-deliverysystemcanefficientlyrunapplicationsthatusetheentirecomputeresourceofthepre-deliverysystem.Theapplicationstoberunwillbedrawnfromthe72-hourtestruns,scaledtothepre-deliveryconfiguration

Benchmarks Benchmarksshallachieveperformancewithinthelimitsofpre-deliveryconfiguration

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 48 of 77

NameofTest PassCriteria

72Hourtest 100%availabilityofthepre-deliverysystemfora72hourtestperiodwhilerunninganagreed-uponworkloadthatexercisesatleast99%ofthecomputeresources

Post-deliveryIntegrationandTestPost-deliveryIntegration

DuringPost-DeliveryIntegration,theSubcontractor’ssystem(s)shallbedelivered,installed,fullyintegrated,andshallundergoSubcontractorstabilizationprocesses.Post-deliverytestingshallincludereplicationofallofthepre-deliverytestingsteps,alongwithappropriatetestsatscale,onthefullyintegratedplatform.Whereapplicable,testsshallberunatfullscale.

SiteIntegrationWhentheSubcontractorhasdeclaredthesystemtobestable,theSubcontractorshallmakethesystemavailabletoLANSpersonnelforsite-specificintegrationandcustomization.OncetheSubcontractor’ssystemhasundergonesite-specificintegrationandcustomization,theacceptancetestshallcommence.

AcceptanceTest

TheAcceptanceTestPeriodshallcommencewhenthesystemhasbeendelivered,physicallyinstalled,andundergonestabilizationandsite-specificintegrationandcustomizationcompleted.ThedurationoftheAcceptanceTestperiodisdefinedintheStatementofWork.AlltestsshallbeperformedontheinitialproductionconfigurationasdefinedbyLANS.

TheSubcontractorshallsupplysourcecodeused,compilescripts,output,andverificationfilesforalltestsrunbytheSubcontractor.AllsuchprovidedmaterialsbecomethepropertyofLANS.AlltestsshallbeperformedontheinitialproductionconfigurationoftheCrossroadssystemasitwillbedeployedtotheACESusercommunity.LANSmayrunalloranyportionofthesetestsatanytimeonthesystemtoensuretheSubcontractor’scompliancewiththerequirementssetforthinthisdocument.

TheacceptancetestshallconsistofaFunctionalityDemonstration,aSystemBootTest,aSystemResilienceTest,aPerformanceTest,andanAvailabilityTest,performedinthatorder.FunctionalityDemonstration

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 49 of 77

SubcontractorandLANSwillperformtheFunctionalityDemonstrationonadedicatedsystem.TheFunctionalityDemonstrationshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:

• Remotemonitoring,powercontrolandbootcapability

• Networkconnectivity

• Filesystemfunctionality

• Batchsystem

• Systemmanagementsoftware

• Programbuildinganddebugging(e.g.compilers,linkers,libraries,etc.)

• UnixfunctionsSystemBootTest

SubcontractorandLANSwillperformtheSystemBootTestonadedicatedsystem.TheSystemBootTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:Twosuccessfulsystemcoldbootstoproductionstate,withnointerventiontobringthesystemup.Productionstateisdefinedasrunningallsystemservicesrequiredforproductionuseandbeingabletocompileandrunparalleljobsonthefullsystem.Inacoldboot,allelementsofthesystem(compute,login,I/O)arecompletelypoweredoffbeforethebootsequenceisinitiated.Allcomponentsarethenpoweredon.

• Singlenodepower-fail/resettest:Failureorresetofasinglecomputenodeshallnotcausesystem-widefailure.

SystemResilienceTestSubcontractorandLANSwillperformtheSystemResilienceTestonadedicatedsystem.TheSystemResilienceTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.

AllsystemresiliencefeaturesofCrossroadsshallbedemonstratedviafault-injectiontestswhenrunningtestapplicationsatscale.Faultinjectionoperationsshouldincludebothgracefulandhardshutdownsofcomponents.Themetricsforresilienceoperationsincludecorrectoperation,anylossofaccessordata,andtimetocompletetheinitial

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 50 of 77

recoveryplusanytimerequiredtorestore(fail-back)anormaloperatingmodeforthefailedcomponents.

PerformanceTest

CrossroadssystemperformanceandbenchmarktestsarefullydocumentedintheStatementofWorkalongwithguidanceandtestinformationfoundatthiswebsite:https://www.nersc.gov/research-and-development/apex/apex-benchmarks-and-workflows.

TheSubcontractorshallruntheCrossroadstestsandapplicationbenchmarks,fullconfigurationtest,externalnetworktestandfilesystemmetadatatestasdescribedintheApplicationandBenchmarkRunRulesdocument.Benchmarkanswersmustbecorrect,andeachbenchmarkresultmustmeetorexceedperformancecommitmentsintheperformancerequirementssection.

Benchmarksmustberunusingthesuppliedresourcemanagementandschedulingsoftware.Exceptasrequiredbytherunrules,benchmarksneednotberunconcurrently.IfrequestedbyLANS,Subcontractorshallreconfiguretheresourcemanagementsoftwaretoutilizeonlyasubsetofcomputenodes,specifiedbyLANS.

JMTTIandSystemAvailabilityTestingTheJMTTIandSystemAvailabilityTestwillcommenceaftersuccessfulcompletionoftheFunctionalityDemonstration,SystemTestandPerformanceTest.LANSwillperformtheJMTTIandAvailabilityTest.

TheCrossroadssystemmustdemonstratetheJMTTIandavailabilitymetricsdefinedintheStatementofWork,withinanagreed-uponperiodoftime.Anautomatedjoblaunchandoutcomeanalysistool,suchasthePavilionHPCTestingFramework,shallbeusedtomanageanagreed-uponworkloadthatwillbeusedtomeasurethereliabilityofindividualjobs.ThesejobsshallbeamixtureofbenchmarksfromthePerformanceTestandotherapplications.EverytestintheJMTTIandSystemAvailabilityTestworkloadshallobtainacorrectresultinbothdedicatedandnon-dedicatedmodes:

• Indedicatedmode,eachbenchmarkinthePerformanceTestshallmeettheperformancecommitmentspecifiedintheStatementofWork.Innon-dedicatedmode,themeanperformanceofeachperformancetestshallmeetorexceedtheperformancecommitmentspecifiedintheStatementofWork

• DuringtheJMTTIandSystemAvailabilityTest,LANSshallhavefullaccesstothesystemandshallmonitorthesystem.LANSand

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 51 of 77

usersdesignatedbyLANSshallsubmitjobsthroughtheCrossroadsresourcemanagementsystem.

• DuringtheJMTTIandSystemAvailabilityTest,theSubcontractorshalladheretothefollowingrequirements:

o AllhardwareandsoftwareshallbefullyfunctionalattheendoftheJMTTIandAvailabilityTest.Anydowntimerequiredtorepairfailedhardwareorsoftwareshallbeconsideredanoutageunlessitcanberepairedwithoutimpactingsystemavailability.

o Hardwareandsoftwareupgradesshallnotbepermittedduringthelast7daysoftheJMTTIandAvailabilityTest.Thesystemshallbeconsidereddownforthetimerequiredtoperformanyupgrades,includingrollingupgrades.

o Nosignificant(i.e.levels1,2or3)problemsshallbeopenduringthelast7days.

• DuringtheJMTTIandAvailabilityTestingperiod,ifanysystemsoftwareupgradeorsignificanthardwarerepairsareapplied,theSubcontractorshallberequiredtorunthePerformanceTestsanddemonstratethatthechangesincurnolossofperformance.Atitsoption,LANSmayalsorunanydeemednecessary.TimetakentorunthePerformanceandothertestsshallnotcountasdowntime,providedthatalltestsperformtospecifications.

DefinitionsforNodeandSystemFailuresThebaselineofinterrupts,asusedintheJMTTIandSMTBIcalculations,shallinclude,butmaynotbelimitedto,thefollowingcircumstances:

• AnodeshallbedefinedasdownifahardwareproblemcausesSubcontractorsuppliedsoftwaretocrashorthenodeisunavailable.FailuresthataretransparenttoSubcontractor-suppliedsoftwarebecauseofredundanthardwareshallnotbeclassifiedasanodebeingdownaslongasthefailuredoesnotimpactnodeorsystemperformance.Lowseveritysoftwarebugsandsuggestions(e.g.wrongerrormessage)associatedwithSubcontractorsuppliedsoftwarewillnotbeclassifiedasanodebeingdown.

• AnodeshallbeclassifiedasdownifadefectintheSubcontractorsuppliedsoftwarecausesanodetobeunavailable.Communicationnetworkfailuresexternaltothesystem,anduserapplicationprogrambugsthatdonotimpactotherusersshallnotconstituteanodebeingdown.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 52 of 77

• Repeatfailureswithineighthoursofthepreviousfailureshallbecountedasonecontinuousfailure.

• TheSubcontractor'ssystemshallbeclassifiedasdown(andallnodesshallbeconsidereddown)ifanyofthefollowingrequirementscannotbemet(“system-widefailures”):o CompleteaPOSIX`stat'operationonanyfilewithinall

Subcontractor-providedfilesystemsandaccessalldatablocksassociatedwiththesefiles.

o CompleteasuccessfulinteractivelogintotheSubcontractor'ssystem.FailuresintheACESnetworkdonotconstituteasystem-widefailure.

o Successfullyrunanypartoftheperformancetest.ThePerformanceTestconsistsoftheCrossroadsBenchmarks,theFullConfigurationTestandtheExternalNetworkTest.

o Fullswitchbandwidthisavailable.Failureofaswitchadapterinanodedoesnotconstituteasystem-widefailure.However,failureofaswitchwouldconstitutefailure,evenifalternateswitchpathswereavailable,becausefullbandwidthwouldnotbeavailableformultiplenodes.

o Userapplicationscanbelaunchedand/orcompletedviathescheduler.

• OtherfailuresinSubcontractorsuppliedproductsandservicesthatdisruptworkonasignificantportionofthenodesshallconstituteasystem-wideoutage.

• Ifthereisasystem-wideoutage,LANSshallturnoverthesystemtotheSubcontractorforservicewhentheSubcontractorindicatestheyarereadytobeginworkonthesystem.Allnodesareconsidereddownduringasystem-wideoutage.

• DowntimeforanyoutageshallbeginwhenLANSnotifiestheSubcontractorofaproblem(e.g.anofficialproblemreportisopened)and,forsystemoutages,whenthesystemismadeavailabletotheSubcontractor.Downtimeshallendwhen:o Forproblemsthatcanbeaddressedbybringingupasparenode

orbyrebootingthedownnode,thedowntimeshallendwhenasparenodeorthedownnodeisavailableforproductionuse.

o ForproblemsrequiringtheSubcontractortorepairafailedhardwarecomponent,thedowntimeshallendwhenthefailedcomponentisreturnedtoLANSandavailableforproductionuse.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 53 of 77

Forsoftwaredowntime,thedowntimeshallendwhentheSubcontractorsuppliesafixthatrectifiestheproblemorwhenLANSrevertstoapriorcopyofthefailingsoftwarethatdoesnotexhibitthesameproblem.AfailureduetoACESortoothercausesoutoftheSubcontractor'scontrolshallnotbecountedagainsttheSubcontractorunlessthefailuredemonstratesadefectinthesystem.IfthereareanydisagreementsastowhetherafailureisthefaultoftheSubcontractororACES,theyshallberesolvedpriortotheendoftheacceptanceperiod.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 54 of 77

AppendixA-2:NERSCSampleAcceptancePlanIntegration,Installation,andAcceptanceTestingPre-delivery

TheSubcontractorshalldemonstrateallhardwareisfullyfunctionalpriortoshipping.Ifthesystemistobedeliveredinseparateshipments,eachshipmentshouldundergopre-deliverytesting.IftheSubcontractorproposesadevelopmentsystemsubcomponent,theUniversityrecognizesthatthedevelopmentsystemisnotpartofthepre-deliveryacceptancecriteria.DeliverablesofanyNREeffortthatareintegratedintothebuildsystemshouldbeconsideredpartofthepre-deliveryacceptancecriteria.

Pre-DeliveryAssembly

TheSubcontractorshallperformthepre-deliverytestoftheNERSC-9systemoragreed-uponsub-configurationsofNERSC-9attheSubcontractor’slocationpriortoshipment.Atitsoption,theUniversitymaysendarepresentative(s)toobservetestingattheSubcontractor’sfacility.WorktobeperformedbytheSubcontractorincludes:

• Allhardwareinstallationandassembly

• Burninofallcomponents

• Installationofsoftware

• SuccessfulintegrationofanyNREcomponentsintothebuildsystem.

• ImplementationoftheUniversity-specificproductionsystem-configurationandprogrammingenvironment

• Performtestsandbenchmarkstovalidatefunctionality,performance,reliability,andquality

• Runbenchmarksanddemonstratethatbenchmarksmeetperformancecommitments

Pre-DeliveryTest

SubcontractorshallprovidetheUniversityon-siteaccesstothesysteminordertoverifythatthesystemdemonstratestheabilitytopassacceptancecriteria.

Thepre-deliverytestshallconsistof(butisnotlimitedto)thefollowingtests:

NameofTest PassCriteria

Systempowerup Allnodesbootsuccessfully

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 55 of 77

NameofTest PassCriteria

Systempowerdown Allnodesshutdown

Unixcommands AllUNIX/Linuxandvendorspecificcommandsfunctioncorrectly

Monitoring Monitoringsoftwareshowsstatusforallnodes

Reset “Reset”functionsonallnodes

PowerOn/Off Powercycleallcomponentsoftheentiresystemfromtheconsole

FailOver/Resilience Demonstrateproperoperationofallfail-overorresiliencemechanisms

FullConfigurationTest FullConfigurationTestrunssuccessfullyonthesystem

Benchmarks Thesystemshalldemonstratetheabilitytoachievetherequiredperformancelevelonallbenchmarkrequirements

72Hourtest Highavailabilityoftheproductionsystemfora72hourtestperiodunderconstantthroughputload

Post-deliveryIntegrationandTest§ TheSubcontractor’ssystem(s)shallbedelivered,installed,fully

integrated,andshallundergoSubcontractorstabilizationprocesses.Post-deliverytestingshallincludereplicationofallofthepre-deliverytestingsteps,alongwithappropriatetestsatscale,onthefullyintegratedsystem.

SiteIntegration

WhentheSubcontractorhasdeclaredthesystemtobestable,theSubcontractorshallmakethesystemavailabletoUniversitypersonnelforsite-specificintegrationandcustomization.OncetheSubcontractor’ssystemhasundergonesite-specificintegrationandcustomization,theacceptancetestshallcommence.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 56 of 77

AcceptanceTestTheAcceptanceTestPeriodshallcommencewhenthesystemhasbeendelivered,physicallyinstalled,andundergonestabilizationandsite-specificintegrationandcustomization.ThedurationoftheAcceptanceTestPeriodshallnotexceed60days.

AlltestsshallbeperformedontheproductionconfigurationasdefinedbytheUniversity.

TheSubcontractorshallnotberesponsibleforfailurestomeettheperformancemetricssetortheavailabilitymetricssetforthinthisSection,ifsuchfailureisthedirectresultofmodificationsmadebytheUniversitytoSubcontractorsourcecode.Suchsuspensionwillbeonlyforthoserequirementsthatfailduetothemodification(s)andonlyforthelengthoftimethemodification(s)result(s)inthefailure.

TheSubcontractorshallsupplysourcecodeused,compilescripts,output,andverificationfilesforalltests.AllsuchprovidedmaterialsbecomethepropertyofTheUniversity.

AlltestsshallbeperformedonaproductionconfigurationoftheNERSC-9system,asitwillbedeployedtotheUniversityusercommunity.TheUniversitymayrunalloranyportionofthesetestsatanytimeonthesystemtoensuretheSubcontractor’scompliancewiththerequirementssetforthinthisdocument.

TheacceptancetestshallconsistofFunctionalityDemonstrations,SystemTests,SystemResiliencyTests,PerformanceTests,andanAvailabilityTest,performedinthatorder.

FunctionalityDemonstrationSubcontractorandtheUniversitywillperformtheFunctionalityDemonstrationonadedicatedsystem.TheFunctionalityDemonstrationshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:

• Remotemonitoring,powercontrolandbootcapability

• Networkconnectivity

• Filesystemfunctionality

• Batchsystem

• Systemmanagementsoftware

• Programbuildinganddebugging(e.g.compilers,linkers,libraries,etc.)

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 57 of 77

• UnixfunctionsSystemTest

SubcontractorandtheUniversitywillperformtheSystemTestonadedicatedsystem.TheSystemTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:Twosuccessfulsystemcoldbootstoproductionstateinaccordancewithrequiredtimings,withnointerventiontobringthesystemup.Productionstateisdefinedasrunningallsystemservicesrequiredforproductionuseandbeingabletocompileandrunparalleljobsonthefullsystem.Inacoldboot,allelementsofthesystem(compute,login,I/O,network)arecompletelypoweredoffbeforethebootsequenceisinitiated.Allcomponentsarethenpoweredon.Singlenodepower-fail/resettest:Failureorresetofasinglecomputenodeshallnotcauseasystem-widefailure.Anodeshallreboottoproductionstateafterresetinaccordancewithrequiredtimings.

SystemResilienceTest

SubcontractorandtheUniversitywillperformtheSystemResilienceTestonadedicatedsystem.TheSystemResilienceTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.

AllsystemresiliencefeaturesoftheNERSC-9systemshallbedemonstratedviafault-injectiontestswhenrunningtestapplicationsatscale.Faultinjectionoperationsshouldincludebothgracefulandhardshutdownsofcomponents.Themetricsforresilienceoperationsincludecorrectoperation,anylossofaccessordata,andtimetocompletetheinitialrecoveryplusanytimerequiredtorestore(fail-back)anormaloperatingmodeforthefailedcomponents.

PerformanceTest

TheSubcontractorshallruntheNERSC-9testsandapplicationbenchmarks,fullconfigurationtest,externalnetworktestandfilesystemmetadatatest,aminimumoffivetimeseachasdescribedintheBenchmarkRunRulessection.Benchmarkanswersmustbecorrect,andeachbenchmarkresultmustmeetorexceedperformancecommitments.

Benchmarksmustberunusingthesuppliedresourcemanagementandschedulingsoftware.Exceptasrequiredbytherunrules,benchmarksneednotberunconcurrently.Ifrequestedbythe

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 58 of 77

University,Subcontractorshallreconfiguretheresourcemanagementsoftwaretoutilizeonlyasubsetofcomputenodes,specifiedbytheUniversity.Performancemustbeconsistentfromruntorun.AvailabilityTestTheAvailabilityTestwillcommenceaftersuccessfulcompletionoftheFunctionalityDemonstration,SystemTestandPerformanceTest.TheSubcontractorshallperformtheAvailabilityTest;atthistimeorbefore,theUniversitywilladduseraccountstothesystem.TheAvailabilityTestshallbe30contiguousdaysinaslidingwindowwithintheAcceptanceTestPeriod.TheNERSC-9systemmustdemonstratetherequiredavailabilityofthesystem.

DuringtheAvailabilityTest,theUniversityshallhavefullaccesstothesystemandshallmonitorthesystem.TheUniversityandusersdesignatedbytheUniversityshallsubmitjobsthroughtheNERSC-9resourcemanagementsystem.ThesejobsshallbeamixtureofbenchmarksfromthePerformanceTestandotherapplications.

TheSubcontractorshalladheretotheSystemAvailabilityandReliabilityrequirementsasdefinedbelow:

• AllhardwareandsoftwareshallbefullyfunctionalattheendoftheAvailabilityTest.Anydowntimerequiredtorepairfailedhardwareorsoftwareshallbeconsideredanoutageunlessitcanberepairedwithoutimpactingsystemavailability.

• Hardwareandsoftwareupgradesshallnotbepermittedduringthelast7daysoftheAvailabilityTest.Thesystemshallbeconsidereddownforthetimerequiredtoperformanyupgrades,includingrollingupgrades.

• Nosignificant(i.e.levels1,2or3)problemsshallbeopenduringthelast7days.

• DuringtheAvailabilityTestingperiod,ifanysystemsoftwareupgradeorsignificanthardwarerepairsareapplied,theSubcontractorshallberequiredtoruntheBenchmarkTestsanddemonstratethatthechangesincurnolossofperformance.Atitsoption,theUniversitymayalsorunanydeemednecessary.TimetakentoruntheBenchmarkandothertestsshallnotcountasdowntime,providedthatalltestsperformtospecifications.

• EverytestintheFunctionalityTest,PerformanceTestandNERSC-definedworkloadshallobtainacorrectresultinbothdedicatedandnon-dedicatedmodes.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 59 of 77

• Indedicatedmode,eachbenchmarkinthePerformanceTestshallmeetorexceedtheperformancecommitmentandvariationrequirement.

• Innon-dedicatedmode,themeanperformanceofeachperformancetestshallmeetorexceedtheperformancecommitment.ThemeasuredCoefficientofVariation(standarddeviationdividedbythemean)ofresultsfromeachperformancetestshallnotbegreaterthan5%.

• Nodeandsystemavailabilitywillbemeasuredonanodehourbasisasfollows.

)()(

∑ −= Ni i

Ni ii

SDSlabilitySystemAvai

where:Siisthenumberofscheduledhoursfornodei(wallclocktimeminusdowntimescheduledbytheUniversity)

DiisthenumberofhoursofdowntimefornodeiNodeandsystemoutagesaredefinedinthefollowingsection.

DefinitionofNodeandSystemFailures

• AnodeshallbedefinedasdownifahardwareproblemcausesSubcontractorsuppliedsoftwaretocrashorthenodeisunavailable.FailuresthataretransparenttoSubcontractor-suppliedsoftwarebecauseofredundanthardwareshallnotbeclassifiedasanodebeingdownaslongasthefailuredoesnotimpactnodeorsystemperformance.Lowseveritysoftwarebugsandsuggestions(e.g.wrongerrormessage)associatedwithSubcontractorsuppliedsoftwarewillnotbeclassifiedasanodebeingdown.

• AnodeshallbeclassifiedasdownifadefectintheSubcontractorsuppliedsoftwarecausesanodetobeunavailable.Communicationnetworkfailuresexternaltothesystem,anduserapplicationprogrambugsthatdonotimpactotherusersshallnotconstituteanodebeingdown.

• Repeatfailureswithineighthoursofthepreviousfailureshallbecountedasonecontinuousfailure.

• TheSubcontractor'ssystemshallbeclassifiedasdown(andallnodesshallbeconsidereddown)ifanyofthefollowingrequirementscannotbemet(“system-widefailures”):

• CompleteaPOSIX‘stat'operationonanyfilewithinallSubcontractor-providedfilesystemsandaccessalldatablocksassociatedwiththesefiles.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 60 of 77

• CompleteasuccessfulinteractivelogintotheSubcontractor'ssystem.FailuresintheUniversitynetworkdonotconstituteasystem-widefailure.

• Successfullyrunanypartoftheperformancetest.ThePerformanceTestconsistsoftheNERSC-9Benchmarks,theFullConfigurationTestandtheExternalNetworkTest.

• Fullswitchbandwidthisavailable.Failureofaswitchadapterinanodedoesnotconstituteasystem-widefailure.However,failureofaswitchwouldconstitutefailure,evenifalternateswitchpathswereavailable,becausefullbandwidthwouldnotbeavailableformultiplenodes.

• Userapplicationscanbelaunchedand/orcompletedviathescheduler.

• OtherfailuresinSubcontractorsuppliedproductsandservicesthatdisruptworkonasignificantportionofthenodesshallconstituteasystem-wideoutage.

• Ifthereisasystem-wideoutage,theUniversityshallturnoverthesystemtotheSubcontractorforservicewhentheSubcontractorindicatestheyarereadytobeginworkonthesystem.Allnodesareconsidereddownduringasystem-wideoutage.

• DowntimeforanyoutageshallbeginwhentheUniversitynotifiestheSubcontractorofaproblem(e.g.anofficialproblemreportisopened)and,forsystemoutages,whenthesystemismadeavailabletotheSubcontractor.Downtimeshallendwhen:

o Forproblemsthatcanbeaddressedbybringingupasparenodeorbyrebootingthedownnode,thedowntimeshallendwhenasparenodeorthedownnodeisavailableforproductionuse.

o ForproblemsrequiringtheSubcontractortorepairafailedhardwarecomponent,thedowntimeshallendwhenthefailedcomponentisreturnedtotheUniversityandavailableforproductionuse.

o Forsoftwaredowntime,thedowntimeshallendwhentheSubcontractorsuppliesafixthatrectifiestheproblemorwhentheUniversityrevertstoapriorcopyofthefailingsoftwarethatdoesnotexhibitthesameproblem.

o AfailureduetotheUniversityortoothercausesoutoftheSubcontractor'scontrolshallnotbecountedagainsttheSubcontractorunlessthefailuredemonstratesadefectinthesystem.IftherearedisputesastowhetherafailureisthefaultoftheSubcontractorortheUniversity,theyshallberesolvedpriortotheendoftheacceptanceperiod.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 61 of 77

AppendixB:LANS/UCSpecificProjectManagementRequirementsAppendixB-1:LANSProjectManagementRequirementsNOTE:Thefollowingrequirementsapplytotheprojectmanagementofthedeliveryofthesystemproposedbythesubcontractor.However,sincethereisaNon-RecurringEngineering(NRE)componenttothisRequestforProposal,NREareaswillalsohavesimilarprojectmanagementrequirements,shouldtheproposalsbenegotiatedintocontracts.KeyaspectswillincludethesubcontractorNREpointofcontact,NREdeliverymilestoneschedules,regularupdatesandreviews,andmilestoneapprovals.ThespecificrequirementsforNREprojectmanagementwillbenegotiatedbyLANSandtheselectedsubcontractorandwillbebasedonthetechnicalNREareasproposedforevaluation.

ProjectManagementThedevelopment,pre-shipmenttesting,installationandacceptancetestingoftheCrossroadssystemisacomplexendeavorandwillrequireclosecooperationbetweentheSubcontractor,LosAlamosNationalSecurity,LLC(LANS),andACES.ThereshallbequarterlyexecutivereviewsbycorporateofficersoftheSubcontractor,ACES,andrepresentativesofDOE/DP,toassesstheprogressoftheproject.

ProjectPlanningWorkshop

• LANSandSubcontractorshallscheduleandcompleteaworkshoptomutuallyunderstandandagreeuponprojectmanagementgoals,techniques,andprocesses.

• Theworkshopshalltakeplacenolaterthanaward+45days

ProjectPlan

• DeliveryMilestone:nolaterthanaward+60days

SubcontractorshallprovidetheLANSwithadetailedProjectPlan–whichincludesadetailedWorkBreakdownStructure(WBS).TheProjectPlanshallcontainallaspectsoftheproposedSubcontractor’ssolutionandassociatedengineering(hardwareandsoftware)andsupportactivities.TheProjectPlanshalladdressorinclude:

• ProgramManagement

• HighAssuranceDeliveryProcess

WBS:

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 62 of 77

o FacilitiesPlanning(e.g.,floor,power&cooling,cabling);o ComputerHardwarePlanning;

o Installation&TestPlanning;

o DeploymentandIntegrationMilestoneso SystemStabilityPlanning;

o SystemScalabilityPlanning;o SoftwarePlan

o NREdeliverables

o Testing(BuildandNRE)o Development

o InterdependenciesbetweenBuildandNRE

o Testingo Deployment

o RiskAssessment&RiskMitigation(BuildandNRE)o Staffing;

o On-siteWarrantyandMaintenanceandSupportPlanning;

o Training&Education;ProjectPlan–ProgramManagement

Ataminimum,theProjectPlan–ProgramManagementSectionshall:o Identify,byname,theProgramManagementTeammembers;

o Identify,byname,theleadCrossroadsSystemArchitect

o Identify,byname,theCrossroadsSystemRASPointofContacto DescribetherolesandresponsibilitiesoftheTeammembers;

o ListSubcontractor’sManagementContacts;

o DefineandinstitutionalizethePeriodicProgressReviewprocesswithregardtofrequency(daily,weekly,monthly,quarterly,andannually),level(support,technical,andexecutive),andescalationprocedures.

• Additionally,theProjectPlan–ProgramManagementSectionshalldetailthejointactivitiesoftheSubcontractorandLANStomonitorandassesstheoverallProgramPerformance.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 63 of 77

• LANSwillfurnishtheSubcontractorwithatop-10listofproblemsandissues.TheSubcontractorisresponsibleforappointingapointofcontactforeachoftheitemsonthelist.Thislistshallbereviewedweekly.

• AllSubcontractorProgramManagementshallinterfacewiththedesignatedLANSCrossroadsprojectmanager.

• TheWBSwillbeupdatedbytheSubcontractormonthlyandreviewedforapprovalbyLANS

• TheSubcontractorProjectPlanshallbeupdatedbytheSubcontractorquarterlyandreviewedforapprovalbyLANS

ProjectPlan-HighAssuranceHardwareDeliveryProcess

SubcontractorshallprovidetheLANSwithahighassurancedeliveryprocessandcertificationprogramforhardwaredeliverablesofallstagesofthedeploymentandoperationalusebytheASCApplicationsCommunityofthesystems.Allassetsdeliveredshallbe,ataminimum,factory-testedandfield–certified;

A“pre-deliverytest”shalltakeplaceatthefactorypriortoeachshipment.FunctionaldiagnosticsandagreeduponLANSapplicationsshallbeexecutedtoverifytheproperfunctioningofeachsystempriortoshipment.Problemsidentifiedasaresultofthesetestsshallbecorrectedpriortoshipment.Assetsthathavesuccessfullycompletedthispre-deliverytestare“pre-verified.”

ProjectPlan-HighAssuranceSoftwareDeliveryProcessSubcontractorshallprovideLANSwithahighassurancedeliveryprocessandcertificationprogramforsoftwaredeliverablesofallstagesofthedeploymentandoperationalusebytheASCIApplicationsCommunityofthesystems.Inaddition,SubcontractorshallprovideLANSwithdocumentationofSubcontractor’santicipatedsoftwarereleaseschedulesduringlifetimeofthesubcontract.Thisincludesmajorandminorreleases,updates,andfixesaswellasexpectedbeta-levelavailability.

• WhileBetasoftwareand/orpre-GAsoftwareisanticipatedtobeinstalledandrunonthesesystems,howeverallsuchinstallationsaresubjecttoLANSapproval;

• SubcontractorshallprovideLANSwithalistofinterdependenciesbetweenhardwareandsoftwareastheypertaintothedeliveredsystems;

ProjectPlan–WBS,Milestones

Subcontractorshalldefineappropriatehigh-levelMilestonesfortheexecutionofthedeliveryandacceptanceoftheCrossroadssystem.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 64 of 77

ProjectPlan–WBS,FacilitiesPlanningCompliantwiththerequirementsoftheFacilitiesdescribedintheTechnicalRequirements.

ProjectPlan–WBS,SystemStabilityPlanningScalablesystemsofthesizebeingdeliveredcanattimesprovedifficulttopredictintermsofstability.Thenumberofcomponentscanhaveasignificanteffectonthestabilityandmayprovidesomescalabilityproblemsintermsofstabilityofthesystem.TheLANSrequiresaplantoprogressivelyqualifyaseriesofconfigurationsofincreasingcomplexity,intermsofbothprocessorcountsandinterconnecttopology.

SubcontractorshallberesponsiblefordeliveringaStabilizationPlanthatincludesthefollowing:

• Planobjectives

• TargetGoalsforStability,asagreedtojointlywiththeLANS

• TechnicalStrategy

• Rolesandresponsibilities

• TestingPlan

• ProgressEvaluationCheckpoints

• Contingencies

ProjectPlan–Staffing:

• StaffSupportshallbeforthelifeofthesubcontract.

• SubcontractorshallidentifyitsmembersoftheProjectTeam.ProjectPlan–On-siteWarrantyandMaintenanceandSupportPlanning

• On-siteWarrantyandMaintenanceandSupportshallbeforthelifeofthesubcontract

• On-siteWarrantyandMaintenanceandSupportshallincludeSubcontractor’spreventivemaintenanceschedule.

• On-siteWarrantyandMaintenanceandSupportshallincludeloggingandweeklyreportingofallinterruptionstoservice.Ataminimum,theSubcontractorshallenterallinterruptloggingintotheLANStrackingsystem.

ProjectPlan–TrainingandEducation

• InadditiontoSubcontractor’susualandcustomarycustomerTrainingandEducationprogram,SubcontractorshallallowtheLANS’sstaffaccesstoSubcontractor’sinternalTraining&Educationprogram;

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 65 of 77

• TrainingandEducationSupportshallbeforthelifeofthesubcontract.ProjectPlan–RiskAssessmentandRiskMitigation

• SubcontractorshallprovidetheLANSwithaRiskManagementPlanthatidentifiesandaddressesallidentifiedrisks.

• Subcontractorshallprovideariskmanagementstrategyfortheproposedsystemincaseoftechnologyproblemsorschedulingdelaysthataffectavailabilityorachievementofperformancetargetsintheproposedtimeframe.Subcontractorshalldescribetheimpactofsubstitutetechnologiesontheoverallarchitectureandperformanceofthesystem.Inparticular,thesubcontractorshalladdressthetechnologyareaslistedbelow:

o Processoro Memory

o High-SpeedInterconnect

o PlatformStorageandallotherI/Osubsystems

• SubcontractorshallcontinuouslymonitorandassesstherisksinvolvedforthosemajortechnologycomponentsthatSubcontractoridentifiestobeontheCriticalPath(i.e.,RiskAssessment);

• SubcontractorshallprovidetheLANSwithtimelyandregularupdatesregardingSubcontractor’sRiskAssessment;

• SubcontractorshallprovidetheLANSwithaRiskMitigationPlan.EachriskmitigationstrategyshallbesubjecttoLANSapproval.SuchRiskMitigationPlanshallinclude:

o RisksCategorization–Risksshallbecategorizedaccordingtoo Probabilityofoccurrence(Low,medium,orhigh)

o Impacttotheprogramiftheyoccur(low,medium,orhigh)o DatesforRiskMitigationDecisionPointsIdentified

o ExecutionofmitigationplansaresubjecttoLANSapprovalandmayinclude:

§ TechnologySubstitution–subjecttotheconditionthatsubstitutedtechnologiesshallnothaveaggregateperformance,capability,orcapacitylessthanoriginallyproposed;

§ 3rdPartyAssistance–especiallyinareasofcriticalsoftwaredevelopment;

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 66 of 77

§ SourceCodeAvailability–especiallyintheareasofOperatingSystems,CommunicationLibraries;

§ PerformanceCompensation–possibilityofcompensatingforperformanceshortfallsviaadditionaldeliveries.

o Subcontractor’sRiskMitigationPlanwillbereviewedquarterlybytheLANS.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 67 of 77

AppendixB-2:UCProjectManagementRequirementsProjectManagementThedevelopment,pre-shipmenttesting,installationandacceptancetestingoftheNERSC-9systemandthemanagementoftheNon-RecurringEngineering(NRE)subcontract(s)arecomplexendeavorsandwillrequireclosecooperationbetweentheSubcontractorandtheLaboratory.ThereshallbequarterlyexecutivereviewsbycorporateofficersoftheSubcontractorandUCtoassesstheprogressoftheproject.

ProjectPlanningWorkshop

• LBNLandSubcontractorshallscheduleandcompleteaworkshoptomutuallyunderstandandagreeuponprojectmanagementgoals,techniques,andprocesses.

• Theworkshopshalltakeplacenolaterthan45daysaftercontractaward

• Theworkshopshalladdressmanagementgoals,techniquesandprocessesforthe“Build”(NERSC-9)subcontractandthe“NRE”subcontract.

ProjectPlan

• SubcontractorshallprovidetheUniversitywithdetailedProjectPlans–whichincludeadetailedWorkBreakdownStructure(WBS)forthe“Build”andthe“NRE”contracts.TheProjectPlansshallcontainallaspectsoftheproposedSubcontractor’ssolutionandassociatedengineering(hardwareandsoftware)andsupportactivities.

• TheProjectPlansshallbesubmittednolaterthan60daysaftercontractaward

• TheProjectPlansshalladdressorinclude:o ProjectManagemento WorkBreakdownStructureforeachoftheprojects

o FacilitiesPlanninginformation(e.g.,floor,power&cooling,cablingrequirements)asapplicabletotheBuildcontract

o ComputerHardwarePlanning

o Installation&TestPlanning(includingpre-deliveryfactorytestsandacceptancetests)

o DeploymentandIntegration

o SystemStabilityPlanning

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 68 of 77

o SystemScalabilityPlanningo SoftwarePlan

o Development

o NREdeliverableso InterdependenciesbetweenBuildandNRE

o Testing(BuildandNRE)o RiskAssessment&RiskMitigation(BuildandNRE)

o Staffing(forthelifeofthesubcontracts)

o On-siteSupportandServicesPlanning(forthelifeofthesubcontracts)

o Training&Education

ProjectManagementTeam

• TheSubcontractorshallappointaProjectManager(PM)forthepurposesofexecutingtheProjectManagementPlanforthe“Build”systemonbehalfoftheSubcontractor.ThePMfortheACES/CrossroadssystemandtheNERSC-9systemshallbethesameindividual.

• TheNREcontract(s)shallalsohaveaProjectManagerassignedtooverseetheexecutionoftheNREcontractonbehalfoftheSubcontractor.ThePMfortheACES/CrossroadsNREandtheNERSC-9NREshallbethesameindividual.

• ThePMsforthesystembuildandNREsubcontractsshallcloselycoordinatetheprojects.ItisdesireablethatthesameindividualbetheleadPMforall4subcontracts.

• ThePMsshallbeassignedforthedurationofthesubcontract.ThePMforthe“Build”systemshallbebasedintheBayareathroughtheinstallationandacceptanceofthedeliveredSystem.WhenthePMsareunavailableduetovacation,sickleave,orotherabsence,theSubcontractorshallprovidebackupswhoareknowledgeableoftheNERSC-9“Build”and“NRE”projectsandhavetheauthoritytomakedecisionsintheabsenceofthePM.ThePMsorbackupsshallbeavailableforemergencysituationsviaphoneona24x7basis.

SubcontractorManagementContactsThefollowingpositionsintheSubcontractormanagementchainareresponsibleforperformanceunderthissubcontract:

• TechnicalContact(s)

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 69 of 77

• ServiceManager(s)

• ContractManager(s)

• AccountManager(s)RolesandResponsibilitiesforeachofthePMsandmanagementchain(BuildandNRE)

ThePMhasresponsibilityforoverallcustomersatisfactionandsubcontractperformance.Itisanticipatedthathe/sheshallbeanexperiencedSubcontractoremployeewithworkingknowledgeoftheproductsandservicesproposed.TheSubcontractor’sPMcanandshall:

• DelegateprogramauthorityandresponsibilitytoSubcontractorpersonnel

• EstablishinternalschedulesconsistentwiththesubcontractscheduleandrespondappropriatelytoscheduleredirectionfromthedesignatedUniversityauthority

• Establishteamcommunicationprocedures

• Conductregularlyscheduledreviewmeetings

• ApprovesubcontractdeliverablesforsubmittaltotheUniversity

• ObtainrequiredresourcesfromtheextensivecapabilitiesavailablefromwithintheSubcontractorandfromoutsidesources

• ActasconduitofinformationandissuesbetweentheUniversityandtheSubcontractor

• Providefortimelyresolutionofproblems

• ApprisetheUniversityofnewhardwareandsoftwarereleasesandpatcheswithinoneweekofreleasetothegeneralmarketplaceandprovidetheUniversitywithsaidsoftwarewithintwoweeksofrequest

ThePMshallserveastheprimaryinterfacefortheUniversityintotheSubcontractor,managingallaspectsoftheSubcontractorinresponsetotheprogramrequirements.

• TheTechnicalContactsshallberesponsiblefor:

o Developing(Build)Systemconfigurationstotechnicaldesignrequirements

o TranslationofNRErequirementstodeliverablesandtrackingsaiddeliverables

o UpdatingtheUniversityontheSubcontractor’sproductsanddirections

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 70 of 77

o WorkingwiththerespectivePMstoreviewtheSubcontractor’sadherencetotheSubcontracts

• TheContractManagersare:o TheSubcontractor’sprimaryinterfaceforsubcontractmatters

o IsauthorizedtosignsubcontractdocumentscommittingtheSubcontractor

o SupportstheProjectManagerbysubmittingformalproposalsandacceptingsubcontractmodifications.

• TheServiceManagershavetheresponsibilityfor:

o CompliancewiththeSubcontractor’shardwareservicerequirements.

o DeterminingworkloadrequirementsandassigningservicespersonneltosupporttheUniversity

o ManagingtheSubcontractor’soverallservicedeliverytotheUniversity

o MeetingwithUniversitypersonnelregularlytoreviewwhethertheSubcontractor’sserviceisfillingtheUniversity’srequirements

o HelpingSubcontractor’sservicepersonnelunderstandUniversitybusinessneedsandfuturedirections

PeriodicProgressReviewsDailyCommunication(BuildContract)

• FortheBuildcontract,theSubcontractor’sPMordesignateshallcommunicatedailywiththeUniversity’sTechnicalRepresentativesordesignateandappropriateUniversitystaff.Thesedailycommunicationsshallcommenceshortlyaftersubcontractawardandcontinueuntilbothpartiesagreetheyarenolongerneeded.Thetopicscoveredinthismeetinginclude:

o Systemproblems–statusincludingescalation

o Non-systemproblems

o Impendingdeliverieso Othertopicsasappropriate

• TheSubcontractor’sPM(ordesignate)istheownerofthismeeting.Targetdurationforthismeetingisone-halfhour.BothSubcontractorandtheUniversitymaysubmitagendaitemsforthismeeting.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 71 of 77

WeeklyStatusMeeting(BuildandNREcontracts)

• TheSubcontractor’sPMshallschedulethismeeting.Targetdurationisonehour.AttendeesnormallyincludetheSubcontractor’sPM,ServiceManager,University’sProcurementRepresentative,TechnicalRepresentativeandSystemAdministrator(s)aswellasotherinvitees.Topicscoveredinthismeetinginclude:

o Reviewofthepastsevendaysandthenextsevendayswithafocusonproblems,resolutions,andimpendingmilestones

o ReviewoftheUniversity’stop-10listofproblemsandissues.

o SpecificallyfortheBuildsystem§ Systemreliability

§ Systemutilization

§ Systemconfigurationchanges§ Openissues(hardware/software)shallbepresentedby

theSubcontractor’sPM.Openissuesthatarenotclosedatthismeetingshallhaveanactionplandefinedandagreeduponbybothpartiesbycloseofthismeeting

o SpecificallyfortheNREcontract(s)§ Progresstowardsdeliverables

§ ProgresstowardsmeetingtechnicalmilestonesintheBuildcontract

§ ImplicationsofNREdeliverablesfortheBuildsystemconfiguration

§ Othertopicsasappropriate

ExtendedStatusReviewMeeting(BuildandNREcontracts)

• Periodically,butnomorethanoncepermonthandnolessthanonceperquarter,anExtendedStatusReviewMeetingwillbeconductedinlieuoftheWeeklyStatusMeeting.

• AseparatemeetingfortheNREandBuildcontractsshallbeconducted.

• TheSubcontractor’sPMshallschedulethismeetingwiththeagreementoftheUniversity’sTechnicalRepresentative.Targetdurationisonetothreehours.Attendeesnormallyinclude:Subcontractor’sPM,TechnicalContact,FieldServiceManagerandLineManagement,University’sProcurementRepresentative,

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 72 of 77

TechnicalRepresentativeandLineManagementaswellasotherinvitees.Topicscoveredinthismeetinginclude:

• Reviewofthepast30daysandthenext30dayswithafocusonproblems,resolutionsandimpendingmilestones(SubcontractorPMtopresent)

o Deliverablesschedulestatus(SubcontractorPMtopresent)o Highpriorityissues(issueownerstopresent)

o Forthe“Build”system:Facilitiesissues(changesinproductpower,cooling,andspaceestimatesforthetobeinstalledproducts)

o AlltopicsthatarenormallycoveredintheWeeklyStatusMeeting

o Othertopicsasappropriate

BothSubcontractorandtheUniversitymaysubmitagendaitemsforthismeeting.

QuarterlyExecutiveMeeting(BuildandNREcontracts)

• Subcontractor’sPMshallschedulethismeeting.Targetdurationissixhours.Attendeesnormallyinclude:Subcontractor’sPM,Subcontractor’sSeniorManagement,University’sProcurementRepresentative,TechnicalRepresentative,selectedManagement,selectedTechnicalStaffandotherinvitees.Topicscoveredinthismeetinginclude:

o Programstatus(Subcontractortopresent)o Universitysatisfaction(Universitytopresent)

o Partnershipissuesandopportunities(jointdiscussion)

o FuturehardwareandsoftwareproductplansandpotentialimpactsfortheUniversity

o ParticipationbySubcontractor’ssuppliersasappropriate

o Othertopicsasappropriateo BothSubcontractorandtheUniversitymaysubmitagenda

itemsforthismeeting.

• ThemeetingwillcoverbothNREandBuildcontractissues.

HardwareandSoftwareSupport(BuildContract)

• SeverityClassifications

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 73 of 77

o TheSubcontractorshallhavedocumentedproblemseverityclassifications.TheseseverityclassificationsshallbeprovidedtotheUniversityalongwithdescriptionsdefiningeachclassification.

• SeverityResponse

o TheSubcontractorshallhaveadocumentedresponseforeachseverityclassification.TheguidelinesforhowtheSubcontractorwillrespondtoeachseverityclassificationshallbeprovidedtotheUniversity.

ProblemSearchCapabilities(BuildandNREcontracts)

• TheSubcontractorshallprovidethecapabilityofsearchingaproblemdatabaseviaawebpageinterface.ThiscapabilityshallbemadeavailabletoallindividualUniversitystaffmembersdesignatedbytheUniversity.

ProblemEscalation(BuildandNREcontracts)

• TheSubcontractorshallutilizeaproblemescalationsystemthatinitiatesescalationbasedeitherontimeortheneedformoretechnicalsupport.Problemescalationproceduresarethesameforhardwareandsoftwareproblems.Aproblemisclosedwhenallcommitmentshavebeenmet,theproblemisresolvedandtheUniversityisinagreement.

• Asapplicabletoeithercontract,theUniversityinitiatesproblemnotificationtoonsiteSubcontractorpersonnel,ordesignatedSubcontractoron-callstaff.

RiskManagement(BuildandNREcontracts)

• TheSubcontractorshallcontinuouslymonitorandassessrisksaffectingthesuccessfulcompletionoftheNERSC-9project(BuildandNREcontracts),andprovidetheUniversitywithdocumentationtofacilitateprojectmanagement,andtoassisttheUniversityinitsriskmanagementobligationstoDOE.

• TheSubcontractorshallprovidetheUniversitywithaRiskManagementPlan(RMP)forthetechnological,scheduleandbusinessrisksoftheNERSC-9project.TheRMPdescribestheSubcontractor’sapproachtomanagingNERSC-9projectrisksbyidentifying,analyzing,mitigating,contingencyplanning,tracking,andultimatelyretiringprojectrisks.

• ThePlanshalladdressboththeBuildandtheNREportionsoftheproject.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 74 of 77

• Theinitialplanisdue30daysafterawardoftheSubcontract.OnceapprovedbytheUniversity,theUniversityshallreviewtheSubcontractor’sRMPannually.

• TheSubcontractorshallalsomaintainaformalRiskRegister(RR)documentingallindividualriskelementsthatmayaffectthesuccessfulcompletionoftheNERSC-9project(bothBuildandNREcontracts).TheRRisadatabasemanagedusinganapplicationandformatapprovedbytheUniversity.

• TheinitialRRisdue30daysafterawardoftheSubcontract.TheRRshallbeupdatedatleastmonthly,andbeforeanyCriticalDecision(CD)reviewswithDOE.Afteracceptance,theRRshallbeupdatedquarterly.

• AlongwitheachrequiredupdatetotheRR,theSubcontractorshallprovideaRiskAssessmentReport(RAR)summarizingthestatusoftherisksandanymaterialchanges.TheinitialreportandsubsequentupdateswillbereviewedandapprovedbytheUniversity’sTechnicalRepresentativeorhis/herdesignee.

RiskManagementPlan

• ThepurposeofthisRMP,asdetailedbelow,istodocument,assessandmanageSubcontract’srisksaffectingtheNERSC-9project:

o DocumentproceduresandmethodologyforidentifyingandanalyzingknownriskstotheNERSC-9projectalongwithtacticsandstrategiestomitigatethoserisks.

o Serveasabasisforidentifyingalternativestoachievingcost,schedule,andperformancegoals.

o Assistinmakinginformeddecisionsbyprovidingrisk-relatedinformation.

TheRMPshallinclude,butisnotlimitedto,thefollowingcomponents:management,hardware,software;riskassessment,mitigationandcontingencyplan(s)(fallbackstrategies).

RiskRegister

• TheRRshallincludeanassessmentofeachlikelyriskelementthatmayimpacttheNERSC-9project.Foreachidentifiedrisk,thereportshallinclude:

o Rootcauseofidentifiedrisk

o Probabilityofoccurrence(low,medium,orhigh)

o Impacttotheprojectiftheriskoccurs(low,medium,orhigh)

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 75 of 77

o Impactidentifiestheconsequenceofariskeventaffectingcost,schedule,performance,and/orscope.

o Riskmitigationstepstobetakentoreducelikelihoodofriskoccurrenceand/orstepstoreduceimpactofrisk.

• ExecutionofmitigationplansaresubjecttoUniversityapprovalandmayinclude:

o Technologysubstitution-subjecttotheconditionthatsubstitutedtechnologiesshallnothaveaggregateperformance,capability,orcapacitylessthanoriginallyproposed;

o 3rdpartyassistance-especiallyinareasofcriticalsoftwaredevelopment;

o Performancecompensation-possibilityofcompensatingforperformanceshortfallsviaadditionaldeliveries.

o Datesforriskmitigationdecisionpoints.o Contingencyplanstobeexecutedshouldriskoccur;subjectto

Universityapprovalo Owneroftherisk.

RiskAssessmentReport

• TheRARshallincludethefollowing:o Totalnumberofrisksgroupedbyseverityandprojectarea

(NREandBuild).o Summaryofnewlyidentifiedrisksfromlastreportingperiod.

o Summaryofanyrisksretiredsincethelastreport.o IdentificationanddiscussionofthestatusoftheTop10(watch

list)risks.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 76 of 77

DefinitionsandGlossaryBaselineMemory:HighperformancememorytechnologiessuchasDDR-DRAM,HBM,andHMC,forexample,thatmaybeincludedinthesystemsmemorycapacityrequirement.Itdoesnotincludememoryassociatedwithcaches.

CoefficientofVariation:Theratioofthestandarddeviationtothemean.Delta-Ckpt:Thetimetocheckpoint80%ofaggregatememoryofthesystemtopersistentstorage.Forexample,iftheaggregatememoryofthecomputepartitionis3PiB,Delta-Ckptisthetimetocheckpoint2.4PiB.Rationale:Thiswillprovideacheckpointefficiencyofabout90%forfullsystemjobs.

EjectionBandwidth:Bandwidthleavingthenode(i.e.,NICtorouter).FullScale:Allofthecomputenodesinthesystem.Thismayormaynotincludeallavailablecomputeresourcesonanode,dependingontheusecase.

IdlePower:TheprojectedpowerconsumedonthesystemwhenthesystemisinanIdleState.

IdleState:Astatewhenthesystemispreparedtobutnotcurrentlyexecutingjobs.Theremaybemultipleidlestates.InjectionBandwidth:Bandwidthenteringthenode(i.e.,routertoNIC).

JobInterrupt:Anysystemeventthatcausesajobtounintentionallyterminate.JobMeanTimetoInterrupt(JMTTI):Averagetimebetweenjobinterruptsoveragiventimeintervalonthefullscaleofthesystem.Automaticrestartsdonotmitigateajobinterruptforthismetric.JMTTI/Delta-Ckpt:RatiooftheJMTTItoDelta-Ckpt,whichprovidesameasureofhowmuchusefulworkcanbeachievedonthesystem.

NominalPower:TheprojectedpowerconsumedonthesystembytheAPEXworkflows(e.g.,acombinationoftheAPEXbenchmarkcodesrunninglargeproblemsontheentiresystem).PeakPower:TheprojectedpowerconsumedbyanapplicationthatutilizesthemaximumachievablepowerconsumptionsuchasDGEMM.

PlatformStorage:Anynonvolatilestoragethatisdirectlyusablebythesystem,itssystemsoftware,andapplications.Exampleswouldincludediskdrives,RAIDdevices,andsolidstatedrives,nomatterthemethodofattachment.RollingUpgrades/RollingRollbacks:Arollingupgradeorarollbackisdefinedaschangingtheoperatingsoftwareorfirmwareofasystemcomponentinsuchawaythatthechangedoesnotrequiresynchronizationacrosstheentiresystem.Rollingupgradesandrollbacksaredesignedtobeperformedwiththosepartsofthesystemthatarenotbeingworkedonremaininginfulloperationalcapacity.

LA-UR-15-28541 DRAFT APEX 2020 Technical Requirements Document

Dated 05-23-16

APEX 2020 Technical Requirements, Version 4.1 Page 77 of 77

SystemInterrupt:Anysystemevent,oraccumulationofsystemeventsovertime,resultinginmorethan1%ofthecomputeresourcebeingunavailableatanygiventime.Lossofaccesstoanydependentsubsystem(e.g.,platformstorageorservicepartitionresource)willalsoincurasysteminterrupt.SystemMeanTimeBetweenInterrupt(SMTBI):Averagetimebetweensysteminterruptsoveragiventimeinterval.SystemAvailability:((timeinperiod–timeunavailableduetooutagesinperiod)/(timeinperiod–timeunavailableduetoscheduledoutagesinperiod))*100SystemInitialization:Thetimetobring99%ofthecomputeresourceand100%ofanyserviceresourcetothepointwhereajobcanbesuccessfullylaunched.

WallPlate(NamePlate)Power:Themaximumtheoreticalpowerthesystemcouldconsume.Thisisadesignlimit,likelynotachievableinoperation.