Recent work with RAJA, and a nested loop update DOE COE 2017.pdfRecent work with RAJA, and a nested loop update DOE Centers of Excellence Performance Portability Meeting 2017 Adam

LLNL-PRES-737358This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

RecentworkwithRAJA,andanestedloopupdateDOECentersofExcellencePerformancePortabilityMeeting2017

AdamJ.Kunen

August 23, 2017

LLNL-PRES-7373582

RAJAisaC++abstractionlayerthatenablesportabilitywithsmalldisruptiontoapplicationprogrammingstyles

ThemaingoalofRAJAistobalanceperformance…

§ PreserveandaugmentabilitiesofC++compilerstooptimize

§ Supportvariousformsoffine-grained(on-node)parallelismandvariousprogrammingmodeloptions(OpenMP,CUDA,TBB,OpenACC,...)

…and developerproductivity

§ Maintainsingle-sourcekernelsanddon’tbindanapptoaparticularPM

§ Clearseparationofresponsibilities— RAJA:Executeloops,encapsulatehardware&programmingmodeldetails— Application: SelectloopiterationpatternsandexecutionpolicieswithRAJAAPI

RAJA development is currently driven by the needs of ATDM/ASC applications at LLNL and ECP collaborators

LLNL-PRES-7373583

§ RAJAdecouplesloopiterationandloopbody— Iterationsare“tasks” – aggregate,reorder,etc.

§ RAJAConcepts:— Patterns:forall,forallN,reduce,scan

— Policies: sequential,simd,openmp,cuda,….— Index:iterations– aggregate,reorder,tile,....

RAJAconceptshelpencapsulateloopexecutiondetails

double* x ; double* y ;double a, tsum = 0, tmin = MYMAX;

for ( int i = begin; i < end; ++i ) {y[i] += a * x[i] ;tsum += y[i] ;if ( y[i] < tmin ) tmin = y[i];

}

C-style for-loop

RAJA-style loopdouble* x ; double* y ;double a ;RAJA::SumReduction<reduce_policy, double> tsum(0);RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);

RAJA::forall< exec_policy > ( IndexSet , [=] (int i) {y[i] += a * x[i] ;tsum += y[i];tmin.min( y[i] );

} );

Executionpatterns&policies(scheduling,PMchoice,etc.)

double* x ; double* y ;double a ;RAJA::SumReduction<reduce_policy, double> tsum(0);RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);


} );

Loop body is mostly unchanged (C++ lambda function).

IndexSets(iterationspace,ordering,etc.)



} );PortableReductiontypes



} );

LLNL-PRES-7373584

§ “Lighttouch”— Existingapplicationdatastructures&algorithmsrequirelittlechange,ifany

§ “Lowbarriertoentry”— Parallelismcanbeaddedselectivelyandperformancetunedincrementally

§ “Application-facingdesignphilosophy”— Mapsnaturallytoappsandcanbecustomized– easytograspfor(non-CS)

applicationdevelopers

§ “Performance”— RAJAdoeswellwith“streaming”kernelsthatareprevalentinLLNLcodes— Designedforcoarse-grainedsynchronization– reducesresourcecontention

andmemorysynchronizationoverheads

4

LLNL-PRES-7373585

§ Cleanerorganizationofconcepts&headerfiles,refinedAPIs

§ Backends forOpenMP4.x,OpenACC,TBB

§ Parallelscans

§ NewIndexSet implementationsupportsarbitrarysegmenttypes

§ “Multi-policy”forruntimepolicyselection

§ ImprovedintegrationwithCHAI

§ RAJAPerformanceSuite– runvariousexperimentstocomparekernels(RAJAvs.native),helpguidecompilerNREwork

§ ExpandedandrefinednestedloopcapabilitiesandAPI

5

LLNL-PRES-7373586

§ RecentworkwithNVidianvcc,IBMhackathonatLLNL— Identifiedperformanceissuesinnested-loopabstractions(RAJA::forallN)

• Copyconstructionofloopbody• Capture-by-valuevs.capture-by-referencecausingissueswithnvcccorrectness

§ ReworkofforallN— CurrentimplementationofforallN reliesona”peelandbind”mechanism

togeneratethenestedloopstructureandbindtheloopiteratestothelambda• Causesexcessivecopyconstruction– seenasmassiveperformanceproblemwiththingslikeCHAI,host-devicelambdas,reductionobject.

— RevampofforallN replaces“peelandbind”with“peelandset”mechanismthatdoesn’ttriggercopyconstruction

6

LLNL-PRES-73735877

For i : IFor j : J

For k : Kbody(i,j,k)

auto body = [=](int i, int j, int k){ … };RAJA::forallN<pol>(I, J, K, body);

A = IndexConverter(body)B = PeelOuter(A)For i : I

C = BindFirstArg(A, i)D = PeelOuter(C)

For j : JE = BindFirstArg(C, j)F = PeelOuter(E)

For k : KG = BindFirstArg(E, k)G()

§ Eachloopperformstwocapture-by-valuewrappingsoftheloopbody— Onepeelsoffthatloopsexecutionpolicyand

segment— Theotherbindsthatloopsiteratetothebody

(similartostd::bind)— Numberofcopy-constructionsofbodyO(I*J*K)

LLNL-PRES-7373588

§ Itwasstraightforwardtodesign— Aninitialimplementation

§ Wejustdidn’tknow— Oftenthe”body”isalambdawhichonlycapturesPODtypes

• Thecompilercaneliminatemostofthecopyconstructorsandinlineeverything• Thereisnoapparentinefficiency

§ Sowhathappened?— Threethings:

• WeusedRAJAreductionobjects• WeusedCHAI• CUDAhost-devicelambdas

— Thesehaveexplicitcopyconstructors• Thecompilerdoesnotoptimizetheseaway• Performancedropsthroughthefloor

WhywasPeel-and-Bindusedifit’ssoinefficient?!?

Weonlyseeperformancelosswhenourlambdascapturecomplexobjects

LLNL-PRES-73735899

For i : IFor j : J

For k : Kbody(i,j,k)

auto body = [=](int i, int j, int k){ … };RAJA::forallN<pol>(I, J, K, body);

idx = std::tuple<int, int, int>A = InvokeWrapper(body)For i : I

idx.i = i

For j : Jidx.j = j

For k : Kidx.k = k

A(idx)

§ Eachloopassignsitsiterateintoatuple— Onewrappingofbodyisneededtoprovide

invocation— Wrappercanbecaptured-by-referenceateach

loopnestlevel— Numberofloop-bodycopyconstructionsisO(1)— SideBenefit:Newportablemetaprogramming

library“camp”

LLNL-PRES-73735810

§ AlotofthingsaregoingoninRAJA— Newfeatures— Newbackends

§ RunningupagainstperformanceportabilityissueswithCUDAandOpenMP4.5thatareforcingustorethinkcertainimplementationstrategies

§ Bugreports,featurerequests,codecontributions,areallwelcome!

§ GetRAJAongithub:— https://github.com/LLNL/RAJA

10

LLNL-PRES-73735812

Numberofloop-bodycopyconstructionsforthePeel-and-Bindimplementation

𝐶𝑜𝑝𝑖𝑒𝑠( 𝐼) ) = 2 + .2/

)01

2 𝐼3

)

301

+2 𝐼3

/

301

𝐶𝑜𝑝𝑖𝑒𝑠(𝐼4) = 2 + 𝐼4

𝐶𝑜𝑝𝑖𝑒𝑠(𝐼4, 𝐼1) = 2 + 2 𝐼4 + 𝐼4×𝐼1

𝐶𝑜𝑝𝑖𝑒𝑠 𝐼4, 𝐼1, 𝐼8 = 2 + 2 𝐼4 + 2 𝐼4×𝐼1 + 𝐼4×𝐼1×𝐼8

= 𝒪 2 𝐼3

/

301

Thenumberofcopy-ctors calledisontheorderoftheiterationspacesize

Documents

Recent work with RAJA, and a nested loop update DOE COE 2017.pdfRecent work with RAJA, and a nested loop update DOE Centers of Excellence Performance Portability Meeting 2017 Adam