View
13
Download
0
Category
Preview:
Citation preview
Helen He!NERSC User Services Group!July 10, 2014!
Babbage: the Intel Many Core (MIC) Testbed System at NERSC
First Message
• BabbagecanhelpyoutoprepareforCoriregardingthreadscalability(hybridMPI/OpenMPimplementa=on)andvectoriza=on.
• PerformanceonBabbagewillbesignificantlyworsethanCori.However,usingBabbagecanexposeboGlenecksandweaknessesinyourcodeforimprovement.
-2-
Outline
• KnightsCorner(KNC)architectureandprogrammingconsidera=ons
• Systemconfigura=onsandprogrammingenvironment
• Howtocompileandrun• Examplesoftuningkernelsandapplica=ons
-3-
Basic Terminologies
• MIC:IntelManyIntegratedCoresarchitecture• Xeon:IntelProcessors.VariousproductnamesincludeNehalem,Westmere,SandyBridge(SNB)etc.
• XeonPhi:Intel’smarkeEngnameforMICarchitecture.– Somecodenamesare:KnightsFerry(KNF),KnightCorner(KNC),KnightsLanding(KNL)
• KnightsCorner(KNC):firstgeneraEonproductofMICXeonPhiimplementaEon– Co-processorsconnectedtohostviaPCIe– Validateprogrammingmodels– PreparefornextgeneraEonproducEonhardware
-4-
Babbage Nodes
-5-
• 1loginnode:bint01• 45computenodes,eachhas:
– Hostnode:2IntelXeonSandybridgeEPprocessors,8coreseach.2.6GHz,AVX256-bit.Peakperformance166GB/sec
– 2MICcards(5100P)eachhas60naEvecores,connectedbyahigh-speedbidirecEonalring,1053MHz,4hardwarethreadspercore.
– Peakperformance1TB/sec– 8GBGDDR5memory,peak
memorybandwidth320GB/sec– 512-bitSIMDinstrucEons.Holds16
SPor8DPfloaEngpointnumbers.
Babbage KNC vs. Cori KNL • Similari=es
– Manyintegratedcoresonanode(>60),4hardwarethreadspercore– MPI+OpenMPasmainprogramminglanguage– 512bitsvectorlengthforSIMD
• SignificantimprovementsinKNL– Self-hostedarchitecture(notaco-processor!)– 3XsinglethreadperformancethanKNC– Deeperout-of-orderexecuEon– Highbandwidthon-packagememory– ImprovedvectorizaEoncapabiliEes– Andmore…
-6-
Evenwiththesedifferences,BabbagecansEllbehelpfulpreparingapplicaEonsforCori.(Note:Edisoncanbeusedaswell,futurepresentaEon)
Programming Considerations
• Use“na=ve”modeonKNCtomimicKNL,whichmeansignorethehost,justruncompletelyonKNCcards.
• Encouragesinglenodeexplora=ononKNCcardswithproblemsizesthatcanfit.
• Simpletoportfrommul=-coreCPUarchitectures,buthardtoachievehighperformance.
• NeedtoexplorehighlooplevelparallelismviathreadingandSIMDvectoriza=on
• Availablethreadingmodels:OpenMP,pthreads,etc.
-7-
User Friendly Test System
• Configuredwithease-of-useinmind– AllproducEonfilesystemsaremounted– SystemSSHconfiguraEonallowspassword-lessaccesstothehostnodesandMICcardssmoothly
– ModulescreatedandloadedbydefaulttoiniEateIntelcompilerandMPIlibraries
– BatchschedulerinstalledforallocaEngnodes– MICcardshavingaccesstosystemlibrariesallowsmulEpleversionsofsohwaretoco-exist• NoneedtocopysystemlibrariesorbinariestoeachMICcardmanuallyaspre-stepsforrunningjobs
• Createdscriptsandwrapperstofurthersimplifyjoblaunchingcommands
– UserenvironmentverysimilartootherproducEonsystems
-8-
Intel Linux Studio XE Package • IntelC,C++andFortrancompilers• IntelMKL:mathkernellibraries• IntelIntegratedPerformancePrimi=ve(IPP):
performancelibraries• IntelTraceAnalyzerandCollector:MPIcommunicaEons
profilingandanalysis• IntelVtuneAmplifierXE:advancedthreadingand
performanceprofiler• IntelInspectorXE:memoryandthreadingdebugger• IntelAdvisorXE:threadingprototypingtool• IntelThreadingBuildingBlocks(TBB)andIntelCilkPlus:
parallelprogrammingmodels
-9-
Available Software Loadedbydefault:-bash-4.1$modulelistCurrentlyLoadedModulefiles:1)modules3)torque/4.2.65)intel/14.0.07)usg-default-modules/1.12)nsg/1.2.04)moab/7.2.66)impi/4.1.1ModulesAvailable:-bash-4.1$moduleavail<omitsystemso*waremodules…>----------------------/usr/common/usg/Modules/modulefiles------------------------------------------advisor/4.300519impi/4.1.0szip/host-2.1allineatools/4.2.1-36484(default)impi/4.1.1(default)szip/mic-2.1(default)nw/3.3.4-hostinspector/2013.304368totalview/8T.12.0-1(default)nw/3.3.4-mic(default)intel/13.0.1usg-default-modules/1.0hdf5/host-1.8.10-p1intel/13.1.2usg-default-modules/1.1(default)hdf5/host-1.8.13intel/14.0.0(default)vtune/2013.update16(default)hdf5/mic-1.8.10-p1intel/14.0.3zlib/host-1.2.7hdf5/mic-1.8.13(default)itac/8.1.3zlib/host-1.2.8hdf5-parallel/host-1.8.10-p1netcdf/host-4.1.3zlib/mic-1.2.7hdf5-parallel/host-1.8.13netcdf/host-4.3.2zlib/mic-1.2.8(default)hdf5-parallel/mic-1.8.10-p1netcdf/mic-4.1.3hdf5-parallel/mic-1.8.13(default)netcdf/mic-4.3.2(default)
-10-
How to Compile on Babbage • OnlyIntelcompilerandIntelMPIaresupported.
• Compileontheloginnode“bint01”directlytobuildanexecutabletorunonthehostorontheMICcards.
• Youcanalsocompileonahostnode.Donot“sshbcxxxx”directlyfrom“bint01”,instead,use“qsub-I-lnodes=1”togetanodeallocatedtoyou.
• Use“ifort”,icc”or“icpc”tocompileserialFortran,C,orC++codes.• Use“mpiifort”,“mpiicc”,or“mpiicpc”tocompileparallelFortran,C,orC+
+MPIcodes.(NOTmpif90,mpicc,ormpiCC)
• Usethe“-openmp”flagforOpenMPcodes.
• Usethe“-mmic”flagtobuildanexecutabletorunontheMICcards.
• Example:Buildabinaryforhost:%mpiicc-openmp-oxthi.hostxthi.cBuildabinaryforMIC:%mpiicc-mmic-openmp-oxthi.micxthi.c
-11-
Spectrum of Programming Models HostOnly Offload Symmetric
(HostandMIC)Na=ve(MIConly)
Xeon(Host)
Programfoocallbar()End
Programfoocallbar()End
Programfoocallbar()End
--
XeonPhi(MIC)
--
bar()
Programfoocallbar()End
Programfoocallbar()End
-12-
• KnightsLanding(KNL)willbeinself-hostedmode,thuseliminatesthehostandtheneedtocommunicatebetweenhostandMIC.
• Weencourageuserstofocusonop=mizingintheNa=vemodeandexploreon-nodescalingonasingleKNCcard.
How to Run on Host
bint01%qsub-I-lnodes=2<waitforasession>%cd$PBS_O_WORKDIR%cat$PBS_NODEFILEbc1012bc1011
%get_hoshile%cathoshile.$PBS_JOBIDbc1012-ibbc1011-ib
-13-
%exportOMP_NUM_THREADS=4%mpirun-n2-hossilehossile.$PBS_JOBID-ppn1./xthi.host Hellofromrank0,thread0,onbc1012.(coreaffinity=0-15)Hellofromrank0,thread2,onbc1012.(coreaffinity=0-15)…Hellofromrank1,thread3,onbc1011.(coreaffinity=0-15)Hellofromrank1,thread0,onbc1011.(coreaffinity=0-15)
• Usefulforcomparingperformancewithrunningna=velyontheMICcards.
How to Run on MIC Cards Natively
bint01%qsub-I-lnodes=2<waitforasession>%cd$PBS_O_WORKDIR%cat$PBS_NODEFILEbc1012bc1011%get_micfile%catmicfile.$PBS_JOBIDbc1011-mic0bc1011-mic1bc1010-mic0bc1010-mic1%exportOMP_NUM_THREADS=12%exportKMP_AFFINITY=balanced
-14-
%mpirun.mic-n4-hossilemicfile.$PBS_JOBID-ppn1./xthi.mic|sortHellofromrank0,thread0,onbc1011-mic0.(coreaffinity=1)Hellofromrank0,thread1,onbc1011-mic0.(coreaffinity=5)Hellofromrank0,thread10,onbc1011-mic0.(coreaffinity=41)Hellofromrank0,thread11,onbc1011-mic0.(coreaffinity=45)…Hellofromrank3,thread6,onbc1010-mic1.(coreaffinity=25)Hellofromrank3,thread7,onbc1010-mic1.(coreaffinity=29)Hellofromrank3,thread8,onbc1010-mic1.(coreaffinity=33)Hellofromrank3,thread9,onbc1010-mic1.(coreaffinity=37)
Thread Affinity: KMP_AFFINITY • none:defaultopEononhost• compact:defaultopEononMIC.Bindthreadsasclosetoeachotheraspossible
• scaGer:bindthreadsasfarapartaspossible
• balanced:onlyavailableonMIC.Spreadtoeachcorefirst,thensetthreadnumbersusingdifferentHTofsamecoreclosetoeachother.
• explicit:example:setenvKMP_AFFINITY“explicit,granularity=fine,proclist=[1:236:1]”• Newenvoncoprocessors:KMP_PLACE_THREADS,forexactthreadplacement
-15-
Node Core1 Core2 Core3
HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4
Thread 0 1 2 3 4 5
Node Core1 Core2 Core3
HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4
Thread 0 3 1 4 2 5
Node Core1 Core2 Core3
HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4
Thread 0 1 2 3 4 5
MPI Process Affinity: I_MPI_PIN_DOMAIN
• MapCPUsintonon-overlappingdomains– 1MPIprocessperdomain– OpenMPthreadspinnedinsideeachdomain
• I_MPI_PIN_DOMAIN=<size>[:<layout>]<size>=ompadjusttoOMP_NUM_THREADSauto#CPUs/#MPIprocs<n>anumber<layout>=plasormaccordingtoBIOSnumberingcompactclosetoeachothersca}erfarawayfromeachother
-16-
3D Stencil Diffusion Algorithm
-17-
--Onhost:use16threads,KMP_AFFINITY=sca}er--OnMIC:Testedwithdifferentnumberofthreads:60,120,180,236,240,combinedwithvariousKMP_AFFINITYopEons.--ThebestspeeduponMICisobtainedvia180threadswithsca}eraffinity.--RunsfasteronhostwithbaseopEonandOpenMPonlyopEon.--FasteronMICwhenvectorizaEonisintroducedwithOpenMP.--OpenMPandVectorizaEonbothplaysignificantrolesonMIC--MoreadvancedloopopEmizaEontechniques(looppeelandEling)canimprovefurther.
0
50
100
150
200
250
300
350
400
450
500
base omp omp+vect peel =led
Speedu
p
3DStencilDiffusionSpeedup
compact
scaGer
balanced
0
20000
40000
60000
80000
100000
120000
base omp omp+vect peel =led
MFLop
s/sec
3DStencilDiffusiononHostandMIC
host
mic
STREAM
-18-
0
20
40
60
80
100
120
140
160
180
15 30 60 120 180 240
GFlop
s/sec
NumberofOpenMPThreads
StreamTriad
no-vec
basic
prefetch
cache-evict
straming-stores
opt
--Bestrateis162GFlops/secwith60OpenMPthreadson1MICcard.--60%improvementwithvectorizaEon.--Sohwareprefetchhelpssignificantly.(35%improvement)--Intelreportsbestperformanceof174GFlops/seconXeonPhi7100P,61core
Tuning Lessons Learned • Somecoderestructuringandalgorithmmodifica=onsareneeded
totakeadvantageoftheKNCarchitecture.• Someapplica=onswon'tbeabletofitintomemorywithpureMPI
duetothesmallmemorysizeonKNCcards.• Itisessen=altoaddOpenMPatashighalevelaspossibleto
explorelooplevelparallelism,andmakesurelarge,innermost,computa=onalextensiveloopsarevectorized.
• ExplorethescalabilityofOpenMPimplementa=on.• TryvariousMPIandOpenMPaffinityop=ons.• Specialcompilerop=onsonKNCalsohelps.• Memoryalignmentisimportant.• Op=miza=onstargetedforKNCcanhelpperformanceforother
architectures:Xeon,KNL.
-19-
Summary
• PerformanceonBabbagedoesnotrepresentwhatwillbeonCori.
• BabbagecanhelpyoutoprepareforCoriregardingthread
scalability(hybridMPI/OpenMPimplementa=on)andvectoriza=on.
• Pleasecontactconsult@nersc.govforques=ons.
-20-
Further Information • Babbagewebpage:
– h}ps://www.nersc.gov/users/computaEonal-systems/testbeds/babbage• IntelXeonPhiCoprocessorDeveloperZone:
– h}p://sohware.intel.com/mic-developer• ProgrammingandCompilingforIntelMICArchitecture
– h}p://sohware.intel.com/en-us/arEcles/programming-and-compiling-for-intel-many-integrated-core-architecture
• Op=mizingMemoryBandwidthonStreamTriad– h}p://sohware.intel.com/en-us/arEcles/opEmizing-memory-bandwidth-on-
stream-triad• InteroperabilitywithOpenMPAPI
– h}p://sohware.intel.com/sites/products/documentaEon/hpc/ics/impi/41/win/Reference_Manual/Interoperability_with_OpenMP.htm
• IntelClusterStudioXE2013– h}p://sohware.intel.com/en-us/intel-cluster-studio-xe/
• IntelXeonPhiCoprocessorHigh-PerformanceProgramming.JimJeffersandJamesReinders,PublishedbyElsevierInc.2013.
-21-
Acknowledgement • Babbagesystemsupportteam,especiallyNickCardo,for
configuringthesystemandsolvingmanymysteriesandissues.
• NERSCApplica=onReadinessTeamfortes=ng,providing
ideas,andrepor=ngproblemsonthesystem.
-22-
Thank you.
-23-
Extra Slides
-24-
Babbage Compute Nodes • 45nodes,etc.,bc09xx,bc10xx,bc11xx• Hostnode:2IntelXeonSandybridgeES-2670processors
– Eachprocessorhas8cores,with2hardwarethreads(HTnotenabled),2.6GHz,peakperformance166.4GFlops
– 128GBmemorypernode– Memorybandwidth51.2GB/sec– AVX32bytealigned:AVXonhost,256-bitSIMD
• 2MICcards(5110P,bc09xx-mic0,bc09xx-mic1)eachwith:– 60naEvecores,connectedbyahigh-speedbidirecEonalring,clockspeedis1053MHz,Error
CorrecEngCode(ECC)enabled– 4hardwarethreadspercore– 8GBGDDR5memory,effecEvespeed5GT/s,peakmemorybandwidth320GB/sec– L1cachepercore:32KB8-wayassociaEvedataandinstrucEoncache– L2cachepercore:512KB8-wayassociaEveinclusivecachewithhardwareprefetcher– Peakperformance1011GFlops– VectorUnit.
• 512-bitSIMDinstrucEons,• 32512-bitregisters,holds32DPand64SP…
-25-
Memory Alignment
• Alwaysalignat64byteboundariestoensuredatacanbeloadedfrommemorytocacheop=mally– 20%performancepenaltywithoutmemoryalignmentfor– DGEMM(matrixsize6000x6000)
• Fortran:compilewith“-alignarray64byte”op=ontoalignallsta=carraydatato64memoryaddressboundaries
• C/C++:declarevar– staEc:floatvar[100]__a}ribute__((aligned(64)));– dynamic:__mm_aligned_malloc(buf,64)
• Moreop=onswithcompilerdirec=ves
-26-
SIMD and Vectorization • Vectoriza=on:theprocessoftransformingascalarinstrucEon
(SISD)intovectorinstrucEon(SIMD)• Totellcompilertoignorepoten=aldependenciesand
vectorizeanyway:– FortrandirecEve:!DIR$SIMD– C/C++direcEve:#pragmasimd
• Example:a,b,c,arepointers,compilerdoesnotknowtheyareindependent
-27-
Notvectorized:for(i=0;i<n;i++)a[i]=b[i]+c[i]
Notvectorized:for(i=0;i<n;i++)a[i]=b[i]+a[i-1]
Vectorized:#pragmasimdfor(i=0;i<n;i++)a[i]=b[i]+c[i]
Wrapper Script for mpirun on MIC Cards
• Inallmodulefiles:– Set$LD_LIBRARY_PATHforlibrariesneededonhost– Set$MIC_LD_LIBRARY_PATHforlibrariesneededonMICcard
• %catmpirun.mic#!/bin/shmpirun-envLD_LIBRARY_PATH$MIC_LD_LIBRARY_PATH$@
• Sampleexecu=online:%mpirun.mic-n4-hossilemicfile.$PBS_JOBID-ppn2./xthi.mic
-28-
-DMIC Trick for Configure on MIC Card
• Some=meswheninstallsovwarelibrariesonMICcards,atestprogramneedstoberun.Duetocross-compile,thetestprogramwillfail.
• Thetrickistodefine“-DMIC”forthethecompilerop=onssuchasCC,CXX,FC,etc.usedin“configure”:exportCC=“icc–DMIC”,…
• Replaceall“-DMIC”inMakefilewith“-mmic”,thencompileandbuild.files=$(find./*-nameMakefile)perl–p–i–e‘s/-DMIC/-mmic/g’$files
-29-
Sample Batch Script #PBS-qregular#PBS-lnodes=2:mics=2#PBS-lwallEme=02:00:00#PBS-V
cd$PBS_O_WORKDIRexportOMP_NUM_THREADS=60exportKMP_AFFINITY=balancedmpirun.mic-n4–hossile$PBS_MICFILE-ppn1./myexe.mic(#some5mesthefullpathtotheexecutableisneeded,otherwiseyoumayseea"nosuchfileordirectory"error).-------------------------------------#Canusecustomhossilewith-hostopEon:%mpirun.mic–n4-hostbc1013-mic0,bc1012-mic1–ppn2./myexe.mic#Canpassenvwith–envopEon:%mpirun.mic-n16-hostbc1011-mic1-envOMP_NUM_THREADS2-envKMP_AFFINITYbalanced./myexe.mic
-30-
Thread Affinity: KMP_PLACE_THREADS
• Newse�ngoncoprocessorsonly.InaddiEontoKMP_AFFINITY,cansetexactbutsEllgenericthreadplacement.
• KMP_PLACE_THREADS=<n>Cx<m>T,<o>O– <n>CoresEmes<m>Threadswith<o>ofcoresOffset– e.g.40Cx3T,1Omeansusing40cores,and3threads(HT2,3,4)percore
• OSrunsonlogicalproc0,whichlivesonphysicalcore59– OSprocsoncore59:0,237,238,239.– Avoiduseproc0,i.e.,usemax_threads=236onBabbage.
-31-
Synthetic Benchmark Summary (Intel MKL) (5110P)
-32-
STREAM Compiler Options • -no-vec:
– -O3-mmic–openmp-no-vec-DSTREAM_ARRAY_SIZE=64000000• Base
– -O3-mmic-openmp-DSTREAM_ARRAY_SIZE=64000000• Baseplus-opt-prefetch-distance=64,8
– SohwarePrefetch64cachelinesaheadforL2cache– SohwarePrefetch8cachelinesaheadforL1cache
• Baseplus-opt-streaming-cache-evict=0– Turnoffallcachelineevicts
• Baseplus-opt-streaming-storesalways– EnablegeneraEonofstreamingstoresundertheassumpEonthatthe
applicaEonismemorybound• Opt:
– Useallaboveflags(except–no-vec)• NomorehugepagesneededsinceitisnowdefaultinMPSS
-33-
• novec:-O3-novec-w-hz-fno-alias-FR-convertbig_endian-openmp-fpp–auto• base:-mmic-O3-w-openmp-FR-convertbig_endian-alignarray64byte-vec-report6• precision:-fimf-precision=low-fimf-domain-exclusion=15-fp-modelfast=1-no-prec-div-no-prec-sqrt• MIC:-opt-assume-safe-padding-opt-streaming-storesalways-opt-streaming-cache-evict=0-mP2OPT_hlo_pref_use_outer_strategy=F
WRF (Weather Research and Forecasting Model)
0
50
100
150
200
250
15 30 60 120 180 240
Time(sec)
NumberofOpenMPThreads
WRFem_realonMIC
novec
base
base+MIC
base+precision
all
• BestEmeis39.26secwith180OpenMPthreads.• 15.1%improvementwithallopEmizaEonflagscomparedwithbaseopEons.
Steps to Optimize BerkeleyGW
Time/Code-Revision
1. RefactortocreatehierarchicalsetofloopstobeparallelizedviaMPI,OpenMPandVectorizaEonandtoimprovememorylocality.
2. AddOpenMPatashighalevelaspossible.3. Makesurelargeinnermost,flopintensive,loopsarevectorized
*-eliminatespuriouslogic,somecoderestructuringsimplificaEonandotheropEmizaEon
AheropEmizaEon,4earlyIntelXeon-PhicardswithMPI/OpenMPis~1.5Xfasterthan32coresofIntelSandyBridgeontestproblem.*
LowerisBe}
er
CourtesyofJackDeslippe
Recommended