Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
LLNL-PRES-737358This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
RecentworkwithRAJA,andanestedloopupdateDOECentersofExcellencePerformancePortabilityMeeting2017
AdamJ.Kunen
August 23, 2017
LLNL-PRES-7373582
RAJAisaC++abstractionlayerthatenablesportabilitywithsmalldisruptiontoapplicationprogrammingstyles
ThemaingoalofRAJAistobalanceperformance…
§ PreserveandaugmentabilitiesofC++compilerstooptimize
§ Supportvariousformsoffine-grained(on-node)parallelismandvariousprogrammingmodeloptions(OpenMP,CUDA,TBB,OpenACC,...)
…and developerproductivity
§ Maintainsingle-sourcekernelsanddon’tbindanapptoaparticularPM
§ Clearseparationofresponsibilities— RAJA:Executeloops,encapsulatehardware&programmingmodeldetails— Application: SelectloopiterationpatternsandexecutionpolicieswithRAJAAPI
RAJA development is currently driven by the needs of ATDM/ASC applications at LLNL and ECP collaborators
LLNL-PRES-7373583
§ RAJAdecouplesloopiterationandloopbody— Iterationsare“tasks” – aggregate,reorder,etc.
§ RAJAConcepts:— Patterns:forall,forallN,reduce,scan
— Policies: sequential,simd,openmp,cuda,….— Index:iterations– aggregate,reorder,tile,....
RAJAconceptshelpencapsulateloopexecutiondetails
double* x ; double* y ;double a, tsum = 0, tmin = MYMAX;
for ( int i = begin; i < end; ++i ) {y[i] += a * x[i] ;tsum += y[i] ;if ( y[i] < tmin ) tmin = y[i];
}
C-style for-loop
RAJA-style loopdouble* x ; double* y ;double a ;RAJA::SumReduction<reduce_policy, double> tsum(0);RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);
RAJA::forall< exec_policy > ( IndexSet , [=] (int i) {y[i] += a * x[i] ;tsum += y[i];tmin.min( y[i] );
} );
Executionpatterns&policies(scheduling,PMchoice,etc.)
double* x ; double* y ;double a ;RAJA::SumReduction<reduce_policy, double> tsum(0);RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);
RAJA::forall< exec_policy > ( IndexSet , [=] (int i) {y[i] += a * x[i] ;tsum += y[i];tmin.min( y[i] );
} );
Loop body is mostly unchanged (C++ lambda function).
IndexSets(iterationspace,ordering,etc.)
double* x ; double* y ;double a ;RAJA::SumReduction<reduce_policy, double> tsum(0);RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);
RAJA::forall< exec_policy > ( IndexSet , [=] (int i) {y[i] += a * x[i] ;tsum += y[i];tmin.min( y[i] );
} );PortableReductiontypes
double* x ; double* y ;double a ;RAJA::SumReduction<reduce_policy, double> tsum(0);RAJA::MinReduction<reduce_policy, double> tmin(MYMAX);
RAJA::forall< exec_policy > ( IndexSet , [=] (int i) {y[i] += a * x[i] ;tsum += y[i];tmin.min( y[i] );
} );
LLNL-PRES-7373584
§ “Lighttouch”— Existingapplicationdatastructures&algorithmsrequirelittlechange,ifany
§ “Lowbarriertoentry”— Parallelismcanbeaddedselectivelyandperformancetunedincrementally
§ “Application-facingdesignphilosophy”— Mapsnaturallytoappsandcanbecustomized– easytograspfor(non-CS)
applicationdevelopers
§ “Performance”— RAJAdoeswellwith“streaming”kernelsthatareprevalentinLLNLcodes— Designedforcoarse-grainedsynchronization– reducesresourcecontention
andmemorysynchronizationoverheads
4
LLNL-PRES-7373585
§ Cleanerorganizationofconcepts&headerfiles,refinedAPIs
§ Backends forOpenMP4.x,OpenACC,TBB
§ Parallelscans
§ NewIndexSet implementationsupportsarbitrarysegmenttypes
§ “Multi-policy”forruntimepolicyselection
§ ImprovedintegrationwithCHAI
§ RAJAPerformanceSuite– runvariousexperimentstocomparekernels(RAJAvs.native),helpguidecompilerNREwork
§ ExpandedandrefinednestedloopcapabilitiesandAPI
5
LLNL-PRES-7373586
§ RecentworkwithNVidianvcc,IBMhackathonatLLNL— Identifiedperformanceissuesinnested-loopabstractions(RAJA::forallN)
• Copyconstructionofloopbody• Capture-by-valuevs.capture-by-referencecausingissueswithnvcccorrectness
§ ReworkofforallN— CurrentimplementationofforallN reliesona”peelandbind”mechanism
togeneratethenestedloopstructureandbindtheloopiteratestothelambda• Causesexcessivecopyconstruction– seenasmassiveperformanceproblemwiththingslikeCHAI,host-devicelambdas,reductionobject.
— RevampofforallN replaces“peelandbind”with“peelandset”mechanismthatdoesn’ttriggercopyconstruction
6
LLNL-PRES-73735877
For i : IFor j : J
For k : Kbody(i,j,k)
auto body = [=](int i, int j, int k){ … };RAJA::forallN<pol>(I, J, K, body);
A = IndexConverter(body)B = PeelOuter(A)For i : I
C = BindFirstArg(A, i)D = PeelOuter(C)
For j : JE = BindFirstArg(C, j)F = PeelOuter(E)
For k : KG = BindFirstArg(E, k)G()
§ Eachloopperformstwocapture-by-valuewrappingsoftheloopbody— Onepeelsoffthatloopsexecutionpolicyand
segment— Theotherbindsthatloopsiteratetothebody
(similartostd::bind)— Numberofcopy-constructionsofbodyO(I*J*K)
LLNL-PRES-7373588
§ Itwasstraightforwardtodesign— Aninitialimplementation
§ Wejustdidn’tknow— Oftenthe”body”isalambdawhichonlycapturesPODtypes
• Thecompilercaneliminatemostofthecopyconstructorsandinlineeverything• Thereisnoapparentinefficiency
§ Sowhathappened?— Threethings:
• WeusedRAJAreductionobjects• WeusedCHAI• CUDAhost-devicelambdas
— Thesehaveexplicitcopyconstructors• Thecompilerdoesnotoptimizetheseaway• Performancedropsthroughthefloor
WhywasPeel-and-Bindusedifit’ssoinefficient?!?
Weonlyseeperformancelosswhenourlambdascapturecomplexobjects
LLNL-PRES-73735899
For i : IFor j : J
For k : Kbody(i,j,k)
auto body = [=](int i, int j, int k){ … };RAJA::forallN<pol>(I, J, K, body);
idx = std::tuple<int, int, int>A = InvokeWrapper(body)For i : I
idx.i = i
For j : Jidx.j = j
For k : Kidx.k = k
A(idx)
§ Eachloopassignsitsiterateintoatuple— Onewrappingofbodyisneededtoprovide
invocation— Wrappercanbecaptured-by-referenceateach
loopnestlevel— Numberofloop-bodycopyconstructionsisO(1)— SideBenefit:Newportablemetaprogramming
library“camp”
LLNL-PRES-73735810
§ AlotofthingsaregoingoninRAJA— Newfeatures— Newbackends
§ RunningupagainstperformanceportabilityissueswithCUDAandOpenMP4.5thatareforcingustorethinkcertainimplementationstrategies
§ Bugreports,featurerequests,codecontributions,areallwelcome!
§ GetRAJAongithub:— https://github.com/LLNL/RAJA
10
LLNL-PRES-73735812
Numberofloop-bodycopyconstructionsforthePeel-and-Bindimplementation
𝐶𝑜𝑝𝑖𝑒𝑠( 𝐼) ) = 2 + .2/
)01
2 𝐼3
)
301
+2 𝐼3
/
301
𝐶𝑜𝑝𝑖𝑒𝑠(𝐼4) = 2 + 𝐼4
𝐶𝑜𝑝𝑖𝑒𝑠(𝐼4, 𝐼1) = 2 + 2 𝐼4 + 𝐼4×𝐼1
𝐶𝑜𝑝𝑖𝑒𝑠 𝐼4, 𝐼1, 𝐼8 = 2 + 2 𝐼4 + 2 𝐼4×𝐼1 + 𝐼4×𝐼1×𝐼8
= 𝒪 2 𝐼3
/
301
Thenumberofcopy-ctors calledisontheorderoftheiterationspacesize