Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC)...

Helen He!NERSC User Services Group!July 10, 2014!

Babbage: the Intel Many Core (MIC) Testbed System at NERSC

First Message

•  BabbagecanhelpyoutoprepareforCoriregardingthreadscalability(hybridMPI/OpenMPimplementa=on)andvectoriza=on.

•  PerformanceonBabbagewillbesignificantlyworsethanCori.However,usingBabbagecanexposeboGlenecksandweaknessesinyourcodeforimprovement.

Outline

•  KnightsCorner(KNC)architectureandprogrammingconsidera=ons

•  Systemconfigura=onsandprogrammingenvironment

•  Howtocompileandrun•  Examplesoftuningkernelsandapplica=ons

Basic Terminologies

•  MIC:IntelManyIntegratedCoresarchitecture•  Xeon:IntelProcessors.VariousproductnamesincludeNehalem,Westmere,SandyBridge(SNB)etc.

•  XeonPhi:Intel’smarkeEngnameforMICarchitecture.–  Somecodenamesare:KnightsFerry(KNF),KnightCorner(KNC),KnightsLanding(KNL)

•  KnightsCorner(KNC):firstgeneraEonproductofMICXeonPhiimplementaEon–  Co-processorsconnectedtohostviaPCIe–  Validateprogrammingmodels–  PreparefornextgeneraEonproducEonhardware

Babbage Nodes

•  1loginnode:bint01•  45computenodes,eachhas:

–  Hostnode:2IntelXeonSandybridgeEPprocessors,8coreseach.2.6GHz,AVX256-bit.Peakperformance166GB/sec

–  2MICcards(5100P)eachhas60naEvecores,connectedbyahigh-speedbidirecEonalring,1053MHz,4hardwarethreadspercore.

–  Peakperformance1TB/sec–  8GBGDDR5memory,peak

memorybandwidth320GB/sec–  512-bitSIMDinstrucEons.Holds16

SPor8DPfloaEngpointnumbers.

Babbage KNC vs. Cori KNL •  Similari=es

–  Manyintegratedcoresonanode(>60),4hardwarethreadspercore–  MPI+OpenMPasmainprogramminglanguage–  512bitsvectorlengthforSIMD

•  SignificantimprovementsinKNL–  Self-hostedarchitecture(notaco-processor!)–  3XsinglethreadperformancethanKNC–  Deeperout-of-orderexecuEon–  Highbandwidthon-packagememory–  ImprovedvectorizaEoncapabiliEes–  Andmore…

Evenwiththesedifferences,BabbagecansEllbehelpfulpreparingapplicaEonsforCori.(Note:Edisoncanbeusedaswell,futurepresentaEon)

Programming Considerations

•  Use“na=ve”modeonKNCtomimicKNL,whichmeansignorethehost,justruncompletelyonKNCcards.

•  Encouragesinglenodeexplora=ononKNCcardswithproblemsizesthatcanfit.

•  Simpletoportfrommul=-coreCPUarchitectures,buthardtoachievehighperformance.

•  NeedtoexplorehighlooplevelparallelismviathreadingandSIMDvectoriza=on

•  Availablethreadingmodels:OpenMP,pthreads,etc.

User Friendly Test System

•  Configuredwithease-of-useinmind–  AllproducEonfilesystemsaremounted–  SystemSSHconfiguraEonallowspassword-lessaccesstothehostnodesandMICcardssmoothly

–  ModulescreatedandloadedbydefaulttoiniEateIntelcompilerandMPIlibraries

–  BatchschedulerinstalledforallocaEngnodes–  MICcardshavingaccesstosystemlibrariesallowsmulEpleversionsofsohwaretoco-exist•  NoneedtocopysystemlibrariesorbinariestoeachMICcardmanuallyaspre-stepsforrunningjobs

•  Createdscriptsandwrapperstofurthersimplifyjoblaunchingcommands

–  UserenvironmentverysimilartootherproducEonsystems

Intel Linux Studio XE Package •  IntelC,C++andFortrancompilers•  IntelMKL:mathkernellibraries•  IntelIntegratedPerformancePrimi=ve(IPP):

performancelibraries•  IntelTraceAnalyzerandCollector:MPIcommunicaEons

profilingandanalysis•  IntelVtuneAmplifierXE:advancedthreadingand

performanceprofiler•  IntelInspectorXE:memoryandthreadingdebugger•  IntelAdvisorXE:threadingprototypingtool•  IntelThreadingBuildingBlocks(TBB)andIntelCilkPlus:

parallelprogrammingmodels

Available Software Loadedbydefault:-bash-4.1$modulelistCurrentlyLoadedModulefiles:1)modules3)torque/4.2.65)intel/14.0.07)usg-default-modules/1.12)nsg/1.2.04)moab/7.2.66)impi/4.1.1ModulesAvailable:-bash-4.1$moduleavail<omitsystemso*waremodules…>----------------------/usr/common/usg/Modules/modulefiles------------------------------------------advisor/4.300519impi/4.1.0szip/host-2.1allineatools/4.2.1-36484(default)impi/4.1.1(default)szip/mic-2.1(default)nw/3.3.4-hostinspector/2013.304368totalview/8T.12.0-1(default)nw/3.3.4-mic(default)intel/13.0.1usg-default-modules/1.0hdf5/host-1.8.10-p1intel/13.1.2usg-default-modules/1.1(default)hdf5/host-1.8.13intel/14.0.0(default)vtune/2013.update16(default)hdf5/mic-1.8.10-p1intel/14.0.3zlib/host-1.2.7hdf5/mic-1.8.13(default)itac/8.1.3zlib/host-1.2.8hdf5-parallel/host-1.8.10-p1netcdf/host-4.1.3zlib/mic-1.2.7hdf5-parallel/host-1.8.13netcdf/host-4.3.2zlib/mic-1.2.8(default)hdf5-parallel/mic-1.8.10-p1netcdf/mic-4.1.3hdf5-parallel/mic-1.8.13(default)netcdf/mic-4.3.2(default)

How to Compile on Babbage •  OnlyIntelcompilerandIntelMPIaresupported.

•  Compileontheloginnode“bint01”directlytobuildanexecutabletorunonthehostorontheMICcards.

•  Youcanalsocompileonahostnode.Donot“sshbcxxxx”directlyfrom“bint01”,instead,use“qsub-I-lnodes=1”togetanodeallocatedtoyou.

•  Use“ifort”,icc”or“icpc”tocompileserialFortran,C,orC++codes.•  Use“mpiifort”,“mpiicc”,or“mpiicpc”tocompileparallelFortran,C,orC+

+MPIcodes.(NOTmpif90,mpicc,ormpiCC)

•  Usethe“-openmp”flagforOpenMPcodes.

•  Usethe“-mmic”flagtobuildanexecutabletorunontheMICcards.

•  Example:Buildabinaryforhost:%mpiicc-openmp-oxthi.hostxthi.cBuildabinaryforMIC:%mpiicc-mmic-openmp-oxthi.micxthi.c

Spectrum of Programming Models HostOnly Offload Symmetric

(HostandMIC)Na=ve(MIConly)

Xeon(Host)

Programfoocallbar()End

XeonPhi(MIC)

Programfoocallbar()End

•  KnightsLanding(KNL)willbeinself-hostedmode,thuseliminatesthehostandtheneedtocommunicatebetweenhostandMIC.

•  Weencourageuserstofocusonop=mizingintheNa=vemodeandexploreon-nodescalingonasingleKNCcard.

How to Run on Host

bint01%qsub-I-lnodes=2<waitforasession>%cd$PBS_O_WORKDIR%cat$PBS_NODEFILEbc1012bc1011

%get_hoshile%cathoshile.$PBS_JOBIDbc1012-ibbc1011-ib

%exportOMP_NUM_THREADS=4%mpirun-n2-hossilehossile.$PBS_JOBID-ppn1./xthi.host Hellofromrank0,thread0,onbc1012.(coreaffinity=0-15)Hellofromrank0,thread2,onbc1012.(coreaffinity=0-15)…Hellofromrank1,thread3,onbc1011.(coreaffinity=0-15)Hellofromrank1,thread0,onbc1011.(coreaffinity=0-15)

•  Usefulforcomparingperformancewithrunningna=velyontheMICcards.

How to Run on MIC Cards Natively

bint01%qsub-I-lnodes=2<waitforasession>%cd$PBS_O_WORKDIR%cat$PBS_NODEFILEbc1012bc1011%get_micfile%catmicfile.$PBS_JOBIDbc1011-mic0bc1011-mic1bc1010-mic0bc1010-mic1%exportOMP_NUM_THREADS=12%exportKMP_AFFINITY=balanced

%mpirun.mic-n4-hossilemicfile.$PBS_JOBID-ppn1./xthi.mic|sortHellofromrank0,thread0,onbc1011-mic0.(coreaffinity=1)Hellofromrank0,thread1,onbc1011-mic0.(coreaffinity=5)Hellofromrank0,thread10,onbc1011-mic0.(coreaffinity=41)Hellofromrank0,thread11,onbc1011-mic0.(coreaffinity=45)…Hellofromrank3,thread6,onbc1010-mic1.(coreaffinity=25)Hellofromrank3,thread7,onbc1010-mic1.(coreaffinity=29)Hellofromrank3,thread8,onbc1010-mic1.(coreaffinity=33)Hellofromrank3,thread9,onbc1010-mic1.(coreaffinity=37)

Thread Affinity: KMP_AFFINITY •  none:defaultopEononhost•  compact:defaultopEononMIC.Bindthreadsasclosetoeachotheraspossible

•  scaGer:bindthreadsasfarapartaspossible

•  balanced:onlyavailableonMIC.Spreadtoeachcorefirst,thensetthreadnumbersusingdifferentHTofsamecoreclosetoeachother.

•  explicit:example:setenvKMP_AFFINITY“explicit,granularity=fine,proclist=[1:236:1]”•  Newenvoncoprocessors:KMP_PLACE_THREADS,forexactthreadplacement

Node Core1 Core2 Core3

HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4

Thread 0 1 2 3 4 5

Thread 0 3 1 4 2 5

Thread 0 1 2 3 4 5

MPI Process Affinity: I_MPI_PIN_DOMAIN

•  MapCPUsintonon-overlappingdomains–  1MPIprocessperdomain–  OpenMPthreadspinnedinsideeachdomain

•  I_MPI_PIN_DOMAIN=<size>[:<layout>]<size>=ompadjusttoOMP_NUM_THREADSauto#CPUs/#MPIprocs<n>anumber<layout>=plasormaccordingtoBIOSnumberingcompactclosetoeachothersca}erfarawayfromeachother

3D Stencil Diffusion Algorithm

--Onhost:use16threads,KMP_AFFINITY=sca}er--OnMIC:Testedwithdifferentnumberofthreads:60,120,180,236,240,combinedwithvariousKMP_AFFINITYopEons.--ThebestspeeduponMICisobtainedvia180threadswithsca}eraffinity.--RunsfasteronhostwithbaseopEonandOpenMPonlyopEon.--FasteronMICwhenvectorizaEonisintroducedwithOpenMP.--OpenMPandVectorizaEonbothplaysignificantrolesonMIC--MoreadvancedloopopEmizaEontechniques(looppeelandEling)canimprovefurther.

base omp omp+vect peel =led

Speedu

3DStencilDiffusionSpeedup

compact

scaGer

balanced

100000

120000

base omp omp+vect peel =led

3DStencilDiffusiononHostandMIC

STREAM

15 30 60 120 180 240

NumberofOpenMPThreads

StreamTriad

no-vec

prefetch

cache-evict

straming-stores

--Bestrateis162GFlops/secwith60OpenMPthreadson1MICcard.--60%improvementwithvectorizaEon.--Sohwareprefetchhelpssignificantly.(35%improvement)--Intelreportsbestperformanceof174GFlops/seconXeonPhi7100P,61core

Tuning Lessons Learned •  Somecoderestructuringandalgorithmmodifica=onsareneeded

totakeadvantageoftheKNCarchitecture.•  Someapplica=onswon'tbeabletofitintomemorywithpureMPI

duetothesmallmemorysizeonKNCcards.•  Itisessen=altoaddOpenMPatashighalevelaspossibleto

explorelooplevelparallelism,andmakesurelarge,innermost,computa=onalextensiveloopsarevectorized.

•  ExplorethescalabilityofOpenMPimplementa=on.•  TryvariousMPIandOpenMPaffinityop=ons.•  Specialcompilerop=onsonKNCalsohelps.•  Memoryalignmentisimportant.•  Op=miza=onstargetedforKNCcanhelpperformanceforother

architectures:Xeon,KNL.

Summary

•  PerformanceonBabbagedoesnotrepresentwhatwillbeonCori.

•  BabbagecanhelpyoutoprepareforCoriregardingthread

scalability(hybridMPI/OpenMPimplementa=on)andvectoriza=on.

•  Pleasecontactconsult@nersc.govforques=ons.

Further Information •  Babbagewebpage:

–  h}ps://www.nersc.gov/users/computaEonal-systems/testbeds/babbage•  IntelXeonPhiCoprocessorDeveloperZone:

–  h}p://sohware.intel.com/mic-developer•  ProgrammingandCompilingforIntelMICArchitecture

–  h}p://sohware.intel.com/en-us/arEcles/programming-and-compiling-for-intel-many-integrated-core-architecture

•  Op=mizingMemoryBandwidthonStreamTriad–  h}p://sohware.intel.com/en-us/arEcles/opEmizing-memory-bandwidth-on-

stream-triad•  InteroperabilitywithOpenMPAPI

–  h}p://sohware.intel.com/sites/products/documentaEon/hpc/ics/impi/41/win/Reference_Manual/Interoperability_with_OpenMP.htm

•  IntelClusterStudioXE2013–  h}p://sohware.intel.com/en-us/intel-cluster-studio-xe/

•  IntelXeonPhiCoprocessorHigh-PerformanceProgramming.JimJeffersandJamesReinders,PublishedbyElsevierInc.2013.

Acknowledgement •  Babbagesystemsupportteam,especiallyNickCardo,for

configuringthesystemandsolvingmanymysteriesandissues.

•  NERSCApplica=onReadinessTeamfortes=ng,providing

ideas,andrepor=ngproblemsonthesystem.

Thank you.

Extra Slides

Babbage Compute Nodes •  45nodes,etc.,bc09xx,bc10xx,bc11xx•  Hostnode:2IntelXeonSandybridgeES-2670processors

–  Eachprocessorhas8cores,with2hardwarethreads(HTnotenabled),2.6GHz,peakperformance166.4GFlops

–  128GBmemorypernode–  Memorybandwidth51.2GB/sec–  AVX32bytealigned:AVXonhost,256-bitSIMD

•  2MICcards(5110P,bc09xx-mic0,bc09xx-mic1)eachwith:–  60naEvecores,connectedbyahigh-speedbidirecEonalring,clockspeedis1053MHz,Error

CorrecEngCode(ECC)enabled–  4hardwarethreadspercore–  8GBGDDR5memory,effecEvespeed5GT/s,peakmemorybandwidth320GB/sec–  L1cachepercore:32KB8-wayassociaEvedataandinstrucEoncache–  L2cachepercore:512KB8-wayassociaEveinclusivecachewithhardwareprefetcher–  Peakperformance1011GFlops–  VectorUnit.

•  512-bitSIMDinstrucEons,•  32512-bitregisters,holds32DPand64SP…

Memory Alignment

•  Alwaysalignat64byteboundariestoensuredatacanbeloadedfrommemorytocacheop=mally–  20%performancepenaltywithoutmemoryalignmentfor–  DGEMM(matrixsize6000x6000)

•  Fortran:compilewith“-alignarray64byte”op=ontoalignallsta=carraydatato64memoryaddressboundaries

•  C/C++:declarevar–  staEc:floatvar[100]__a}ribute__((aligned(64)));–  dynamic:__mm_aligned_malloc(buf,64)

•  Moreop=onswithcompilerdirec=ves

SIMD and Vectorization •  Vectoriza=on:theprocessoftransformingascalarinstrucEon

(SISD)intovectorinstrucEon(SIMD)•  Totellcompilertoignorepoten=aldependenciesand

vectorizeanyway:–  FortrandirecEve:!DIR$SIMD–  C/C++direcEve:#pragmasimd

•  Example:a,b,c,arepointers,compilerdoesnotknowtheyareindependent

Notvectorized:for(i=0;i<n;i++)a[i]=b[i]+c[i]

Notvectorized:for(i=0;i<n;i++)a[i]=b[i]+a[i-1]

Vectorized:#pragmasimdfor(i=0;i<n;i++)a[i]=b[i]+c[i]

Wrapper Script for mpirun on MIC Cards

•  Inallmodulefiles:–  Set$LD_LIBRARY_PATHforlibrariesneededonhost–  Set$MIC_LD_LIBRARY_PATHforlibrariesneededonMICcard

•  %catmpirun.mic#!/bin/shmpirun-envLD_LIBRARY_PATH$MIC_LD_LIBRARY_PATH$@

•  Sampleexecu=online:%mpirun.mic-n4-hossilemicfile.$PBS_JOBID-ppn2./xthi.mic

-DMIC Trick for Configure on MIC Card

•  Some=meswheninstallsovwarelibrariesonMICcards,atestprogramneedstoberun.Duetocross-compile,thetestprogramwillfail.

•  Thetrickistodefine“-DMIC”forthethecompilerop=onssuchasCC,CXX,FC,etc.usedin“configure”:exportCC=“icc–DMIC”,…

•  Replaceall“-DMIC”inMakefilewith“-mmic”,thencompileandbuild.files=$(find./*-nameMakefile)perl–p–i–e‘s/-DMIC/-mmic/g’$files

Sample Batch Script #PBS-qregular#PBS-lnodes=2:mics=2#PBS-lwallEme=02:00:00#PBS-V

cd$PBS_O_WORKDIRexportOMP_NUM_THREADS=60exportKMP_AFFINITY=balancedmpirun.mic-n4–hossile$PBS_MICFILE-ppn1./myexe.mic(#some5mesthefullpathtotheexecutableisneeded,otherwiseyoumayseea"nosuchfileordirectory"error).-------------------------------------#Canusecustomhossilewith-hostopEon:%mpirun.mic–n4-hostbc1013-mic0,bc1012-mic1–ppn2./myexe.mic#Canpassenvwith–envopEon:%mpirun.mic-n16-hostbc1011-mic1-envOMP_NUM_THREADS2-envKMP_AFFINITYbalanced./myexe.mic

Thread Affinity: KMP_PLACE_THREADS

•  Newse�ngoncoprocessorsonly.InaddiEontoKMP_AFFINITY,cansetexactbutsEllgenericthreadplacement.

•  KMP_PLACE_THREADS=<n>Cx<m>T,<o>O–  <n>CoresEmes<m>Threadswith<o>ofcoresOffset–  e.g.40Cx3T,1Omeansusing40cores,and3threads(HT2,3,4)percore

•  OSrunsonlogicalproc0,whichlivesonphysicalcore59–  OSprocsoncore59:0,237,238,239.–  Avoiduseproc0,i.e.,usemax_threads=236onBabbage.

Synthetic Benchmark Summary (Intel MKL) (5110P)

STREAM Compiler Options •  -no-vec:

–  -O3-mmic–openmp-no-vec-DSTREAM_ARRAY_SIZE=64000000•  Base

–  -O3-mmic-openmp-DSTREAM_ARRAY_SIZE=64000000•  Baseplus-opt-prefetch-distance=64,8

–  SohwarePrefetch64cachelinesaheadforL2cache–  SohwarePrefetch8cachelinesaheadforL1cache

•  Baseplus-opt-streaming-cache-evict=0–  Turnoffallcachelineevicts

•  Baseplus-opt-streaming-storesalways–  EnablegeneraEonofstreamingstoresundertheassumpEonthatthe

applicaEonismemorybound•  Opt:

–  Useallaboveflags(except–no-vec)•  NomorehugepagesneededsinceitisnowdefaultinMPSS

•  novec:-O3-novec-w-hz-fno-alias-FR-convertbig_endian-openmp-fpp–auto•  base:-mmic-O3-w-openmp-FR-convertbig_endian-alignarray64byte-vec-report6•  precision:-fimf-precision=low-fimf-domain-exclusion=15-fp-modelfast=1-no-prec-div-no-prec-sqrt•  MIC:-opt-assume-safe-padding-opt-streaming-storesalways-opt-streaming-cache-evict=0-mP2OPT_hlo_pref_use_outer_strategy=F

WRF (Weather Research and Forecasting Model)

15 30 60 120 180 240

Time(sec)

NumberofOpenMPThreads

WRFem_realonMIC

base+MIC

base+precision

•  BestEmeis39.26secwith180OpenMPthreads.•  15.1%improvementwithallopEmizaEonflagscomparedwithbaseopEons.

Steps to Optimize BerkeleyGW

Time/Code-Revision

1.  RefactortocreatehierarchicalsetofloopstobeparallelizedviaMPI,OpenMPandVectorizaEonandtoimprovememorylocality.

2.  AddOpenMPatashighalevelaspossible.3.  Makesurelargeinnermost,flopintensive,loopsarevectorized

*-eliminatespuriouslogic,somecoderestructuringsimplificaEonandotheropEmizaEon

AheropEmizaEon,4earlyIntelXeon-PhicardswithMPI/OpenMPis~1.5Xfasterthan32coresofIntelSandyBridgeontestproblem.*

LowerisBe}

CourtesyofJackDeslippe

Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC)...

Documents

CHARLES BABBAGE INSTITUTE

Climate Applications at NERSC · Helen He !! NERSC Climate PIs Telecon! Dec 4, 2015 Climate Applications at NERSC

Zhengji Zhao NERSC User Services Group CORI and NERSC Exascale Application Program (NESAP) Acknowledgement: NERSC application readiness team NWChem Workshop

Overview of NERSC Analytics Program Cecilia Aragon NERSC Analytics Team aragon@hpcrd.lbl.gov NERSC User Group Meeting September 17, 2007

NERSC Users Group Meeting Stephen Lau NERSC November 6, 2014

Jacquard / Babbage

Introduction to NERSC Archival Storage: HPSS · Lisa Gerhardt! NERSC User Services! Nick Balthaser! NERSC Storage Systems! NUG Training! February 3, 2014 Introduction to NERSC Archival

National Energy Research Scientific Computing Center (NERSC) NERSC Site Report

NERSC Science Highlights

The Babbage Engine

Babbage - linuxlsga.netlinuxlsga.net/babbage.pdf · Charles Babbage: Polymath Engraving of Babbage aged 42. Babbage had interests in Maths, Astronomy and Management Science. He was

Introduction to the NERSC HPCF NERSC User Services

Babbage ICT-coach

Babbage - Ed Thelen

1 The NERSC Global File System NERSC June 12th, 2006

NERSC Site Report

The NERSC Global File System NERSC June 12th, 2006

Modern Communication. Charles Babbage Babbage Engine - 1830

Charless babbage

Maszyna babbage