Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node •...

Thorsten Kurth

Optimizing Code for Intel Xeon Phi 7250 (Knight’s Landing)

NUGTraining,06/09/2017

Multicore vs. manycore• multicore(Edison)

‣ 5000nodes

‣ 12physicalcores/CPU

‣ 24HWthreads/CPU

‣ 2.4-3.2GHz

‣ 4DPops/cycle

‣ 30MBL3cache

‣ 64GB/node

‣ 100GB/smemorybandwidth

• manycore(Cori-KNL)

‣ 9600nodes

‣ 68physicalcores/CPU

‣ 272HWthreads/CPU

‣ 1.2-1.6GHz

‣ 2x8DPops/cycle

‣ noL3cache

‣ 16GB/node(fast) 96GB/node(slow)

‣ 450GB/smemorybandwidth(fast)

Recompile and go?

• x86-64compatible:canusecodesforolderarchitecturesorrecompile

• self-hosted:noneedforoffloading

• medianspeedupvs.Edison:1.15x

• medianspeedupvs.Haswell:0.70x

Why should I optimize my code?

• pros

• getmoreforyourbucks:makingefficientuseofexistingmanycoreHPCsystems

• fastsuccesspossible:manylowhangingfruitsinunoptimizedcodes

• investinginthefuture:heterogeneousarchitecturesareenergyefficientandthuswillstayaroundforawhile

• benefitsonmulticore:optimizationstargetingmanycorearchitecturesmostlyimproveperformanceonmulticoresystemsaswell

• cons

• effort:manymostbeneficialoptimizationsrequiresignificantcodechanges

• investinginthefuture:whatifIbetonthewronghorse?

Optimization targets

• singlenodeperformance

• starthere:forrepresentativelocalproblemsize,singlenodeperformanceisupperboundofwhatyougetinmulti-node

• manyoptimizationopportunities,fastturnaroundtimes

• manyprofilingtoolsavailable

• multi-nodeperformance

• feweroptimizationopportunities,profiling/debuggingtedious

• IOperformance

• notmanyopportunitiesforimprovement

Where do I start?

• gettoknowyourapplication:don’tassumeyoualreadydo!

• determinehotspots

• manualtimers:becarefulwiththreadsafety/syncbarriers

• profilingtools:NERSCoffersalot

• CrayPat(verylightweight)

• Advisor(findtime-consumingloops)

• VTune(candoalotofthingsbutalsoveryslow)

• MAP(comparablylightweight)

• foundhotspots,nowwhat?

What architectural feature shall I target?

• KNLhasmanynewfeaturestoexplore

• manythreads

• biggervectorunits

• complexintrinsics(ISA)

• multiplememorytiers

• understandyourhotspots

• computebound:morethreads,vectorization,ISA

• memoryBWbound:memorytiers,morethreads

• memorylatencybound:morethreads,vectorization

Prerequisites - compile and run

• recompileyourcodeforKNL:codeforolderCPUsissupportedbutthosedonotmakefulluseofnewarchitecture

• Cray(wrappers): module swap craype-haswell craype-mic-knl

• Intel: -xmic-avx512

• GNU:-march=knl

• useproperOpenMPsettings:export OMP_NUM_THREADS=64export OMP_PLACES=threadsexport OMP_PROC_BIND=spread

• usejob-script-generatoronmy.nersc.govorNERSCwebsite

• nodeconfiguration:use-C knl,quad,cacheasastart

Prerequisites - #FLOPS

• :numberoffloatingpointoperations

• manualcalculation:

• floatadditionandmultiplication:+1

• complexmultiplication:+6(4multiplications+2additions)

• etc.

• measurewithSDE:

• usingSDEismoreprecise,becauseitaccountsformasking

#FLOPS

Prerequisites - #BYTES

• :numberofbytestransferredfrommainmemory

• manualcalculation(notrecommended,butgoodcheck):

• countthebytesofdatatobereadandwritteninthekernel

• doesnotaccountfordatareusethroughcaching

• measurewithVTune:

• preciselyobtainuncorecounterevents

#BYTES

What is limiting my performance?

• Rooflineperformancemodel

• arithmeticintensity

• performance

• plotPvs.AIwitharchitecturalrooflineR

AI =#FLOPS

#BYTES

P =#FLOPS

time[s]

R(AI) = min(memory bw ·AI, peak flops)

Example roofline

memory bandwidth bound

Example roofline

use MCDRAM

Example roofline

compute bounduse MCDRAM

Example roofline

use MCDRAM

utilize vectorization

Example roofline

(possibly) memory latency bounduse MCDRAM

Example roofline

use MCDRAM

threading, vectorization

Example roofline

use MCDRAM

threading, vectorization

improve AI

How to improve AI?

• definitionofarithmeticintensity

• twopossibilities

• numberofflops⬆ numberofbytes➡

(notpossible/easy,choiceofalgorithmdeterminesflops)

• numberofflops ➡ numberofbytes ⬇

• reality:tradeoffbetweenboth

AI =#FLOPS

#BYTES

Create more work/thread

• loop/kernelfusion:improvescachere-useandreduceoverhead

• collapsenestedloops:

• rearrangedatastructures:moveOpenMPout(coarsegrain)

Loop transformations I

• looptiling:improvescachere-useandcansignificantlyimproveperformance

Loop transformations I

• looptiling:improvescachere-useandcansignificantlyimproveperformance

• especiallyrelevantonKNLbecauseofmissingL3

• blockingtosharedL2(512KiB)usuallygood

• wasmytransformationsuccessful?checkL1,L2missrates,e.g.inVTune

Loop transformations II

• shortloopunrolling:helpsthecompilervectorizingtherightloops

Loop transformations II

• shortloopunrolling:helpsthecompilervectorizingtherightloops

• unrollingpragmasarehelpfultoo

• checkcompileroptimizationreports

• useIntelAdvisor

Data alignment

• align(andpad)datato64bitwordstoimproveprefetching

• canbedoneeasilyinmajorprogramminglanguages

• FORTRAN:-align array64byte (ifort,gfortrandoesitautomagically)

• C/C++:aligned_alloc(64, <size>), __attribute__ ((aligned(64))), __declspec(align(64))

• C++trick:overloadnew operator

• advanced:manuallypaddataifarrayextentsarepowerof2tominimizecacheassociativityconflicts

Make use of ISA

• helpthecompilertogenerateefficientintrinsics

runtime example for app with kernel: 1.2 sec

Make use of ISA

if condition inside loop

Make use of ISA

1.5x speedup

Reduced precision math

• transcendentalfunctions,squareroots,etc.areexpensive

• use-fp-model fast=2 -no-prec-divduringcompilation

• replacedivisionsbyconstantswithmultiplicationswithinverse

• donotexpecttoomuch:benefitsusuallyonlyvisibleinheavilycompute-boundcodesections

• reducedprecisionmightnotalwaysbeacceptable

Benefits of AVX-512

• medianspeedup:1.2x

• benefitscanbelargerthan2x(probablymoreefficientprefetching)

• automaticallyenabledwhencompilingforKNLarchitecture

Use MCDRAM

• alwaysuse16GiBon-packagememory(MCDRAM)

• cacheworkswell:requestwith-C knl,cache

• codefitsinto16GiB:request-C knl,flatand prependexecutablewithnumactl -m 1

A note on heap allocation

• KNLmemoryallocationiscomparablyslow

• avoidallocatingandde-allocatingmemoryfrequently

• removeallocations/deallocationsinloopbodiesorfunctionswhicharecalledmanytimes

• tooinvolved?poolallocatorlibraries(e.g.IntelTBBscalablememorypools)

• pros:

• overloadsnew/malloc,no/minimalsourcecodechangesnecessary

• cangivesignificantperformanceboostforcertaincodes

• takecareofthread-safety

• cons:

• memoryfootprintneedstobeknown/computedinadvance

• codemightbecomelessportable

Multi-node optimizations

• singleKNLthreadcannotsaturateAriesinjectionrate

• usethread-levelcommunicationormultipleMPIrankspernode

• recommended:>4rankspernode

• dedicatecorestoOS:-S <ncores>insbatch(ncores=2goodchoice)

Hugepages, DMAPP and hardware AMO

• hugepagescanreduceAriesTLBmisses

• loadcorrespondingmoduleatcompileandruntime

• canusedifferentmodulesatcompileandruntime

• MPI-collective-heavycodes:enableDMAPP(add-ldmapp) export MPICH_RMA_OVER_DMAPP=1export MPICH_USE_DMAPP_COLL=1export MPICH_NETWORK_BUFFER_COLL_OPT=1

• enablehardwareAMOforMPI-3RMAatomicsexport MPICH_RMA_USE_NETWORK_AMO=1

Some notes on IO

SingleCoreI/OPerformanceonCori

Rela4vePerfo

rmance(KN

WriteBa

ndwidth(M

BufferedI/O SyncI/O DirectI/O

HSWKNLKNL/HSW

Use multiple processes

• usemoreprocesses(e.g.withMPIIO)

• unfortunately,nogoodthreadedIOsolutionsavailableyet

• always:pool(writebigchunks),reducefileoperations(open,close)

• largefiles:burstbuffer

Does it help?

Summary

• singlenodeperformance(goforthatonefirst)

• loopfusionandtiling

• ensuregoodvectorization

• useMCDRAM

• multi-nodeperformance

• hugepages

• DMAPP

• IOperformance

• usemultiplenodes,poolIO,reducefileoperationstominimum

NERSC training material

• runningjobs

• process/threadbinding

• codeprofilingandtools

• measuringarithmeticintensity(AI)

• improvingOpenMPscaling

• vectorizationhelp

• howtouseMCDRAM

• NESAPcasestudies

Thank you

Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node •...

Documents

Kirsten Fagnan NERSC User Services Februar 12, 2013 Getting Started at NERSC

NERSC Overview · 2020-06-16 · Small jobs are great on Cori Haswell, not so good on Cori KNL Back your stuff up Especially from scratch, which has a purge policy Acknowledge NERSC

NERSC Status Update for NERSC User Group Meeting June 2006 William T.C Kramer kramer@nersc

NUG Webinar February 23, 2017 - NERSC · § NESAP teams are testing code and running science problems up to full scale on Cori KNL nodes with 24 hour time limit. § All NERSC users

NERSC Site Update - IXPUG · Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head. Cori 2 Cray XC40 system 9,600 Intel Xeon Phi (KNL 7250) @ 1.4

NERSC Users Group Meeting Stephen Lau NERSC November 6, 2014

ML Tools @ NERSC (Plus A Science Example!)-science.pdf · • Caffe - high performance (IntelCaffe with performance highly optimised for KNL), multinode (no programming necessary)

NERSC Site Update

Accelerate Science on Perlmutter with NERSC...NERSC Systems Roadmap NERSC-7: Edison 2.5 PFs Multi-core CPU 3MW NERSC-8: Cori 30PFs Manycore CPU 4MW 2013 2016 2024 NERSC-9: Perlmutter

Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

The NERSC Global File System NERSC June 12th, 2006

KNL 3503_WK8

NERSC Science Highlights A selection of science results produced by NERSC users

National Energy Research Scientific Computing Center (NERSC) NERSC Site Report

Advanced Anaconda - NERSC

KNL-1directsassemblyofthe microtubule-bindinginterface ... · KNL-1directsassemblyofthe microtubule-bindinginterface ofthekinetochoreinC. elegans ArshadDesai,1SonjaRybina,ThomasMüller-Reichert,AndrejShevchenko,AnnaShevchenko

Overview of NERSC Analytics Program Cecilia Aragon NERSC Analytics Team aragon@hpcrd.lbl.gov NERSC User Group Meeting September 17, 2007

NERSC Science Highlights

Overview of NERSC Analytics Program Cecilia Aragon NERSC Analytics Team aragon@hpcrd.lbl.gov NERSC User Group Meeting September 17, 2007

Introduction to NERSC Archival Storage: HPSS · Lisa Gerhardt! NERSC User Services! Nick Balthaser! NERSC Storage Systems! NUG Training! February 3, 2014 Introduction to NERSC Archival