13
LLNL-PRES-681292 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Nested-Loop RAJA Extensions for Determinis6c Transport DOE Centers of Excellence Performance Portability Mee7ng Adam J. Kunen April 19, 2016

Nested-Loop RAJA Extensions for Determinisc Transport · 2020. 9. 10. · Nested-Loop RAJA extensions provide muldimensional support § Maintains the RAJA philosophy — Separate

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • LLNL-PRES-681292 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

    Nested-LoopRAJAExtensionsforDeterminis6cTransportDOECentersofExcellencePerformancePortabilityMee7ng

    AdamJ.Kunen

    April 19, 2016

  • LLNL-PRES-681292 2

    RAJA (L2) Demoed OMP/CUDA Portability (FY15)

    Legion Ongoing (Sam White, UIUC)

    STAPL Ongoing (TAMU)

    OCCA Demoed OKL DSL (David Medina 2015)

    CORAL CoE Exploring CUDA algorithms Ongoing

    Charm++ Ongoing (Sam White, UIUC)

    KripkeisaresearchtoolthatisinformingthedevelopmentofASCcodes

    ARDRA

    Kripke v1.0

    Kripke v1.1

    Nested-Loop RAJA

    External PM Research/Development Collaborations

    L, L+, Sweep (Aug 2014)

    L, L+, Sweep, Source, Scattering (Sept 2015)

    Developing RAJA extensions for use in ARDRA (Ongoing)

    Porting to RAJA during FY16/FY17

    Internal Research and Development

    Research with Kripke motivated nested-loop abstractions

  • LLNL-PRES-681292 3

    Kripkedemonstratesperformanceissensi6vetodatalayout,architectureandproblemsize

    Abstracting data layout and loop order enable performance portability

    Data layout and loop interchange affect performance

    … so do problem dimensions and architecture

    Sandy Bridge, P4, 12^3 zones/thread, 64 grp, 96 dir BG/Q, P9, 12^3 zones/thread, 64 grp, 96 dir

  • LLNL-PRES-681292 4

    DiscreteordinatestransportneedsaportablePMthattreatsmul6dimensionalloopsanddataasfirstclassci6zens

    §  SnTransportisdominatedbynestedloops—  Highdimensionalphasespace—  OHenloopsarenested2to5deep(some7mesevenmore)— Manyofourloopsareperfectlynested—  Complexitera7onpaMerns(sweeps,mul7-material)

    §  Kripkeshowsperformanceissensi7veto:—  Datalayout,Loop-nes7ngorder—  Architecture,Compiler(vendorandversion)—  Problemspecifica7ons(zones,groups,direc7ons,moments,etc.)

    §  GivenArchitecture+Compiler+Problem:—  ChooseDataLayoutandLoop-NestOrder—  Chooseexecu7onpolicies

    §  Prepara7onforSierrarequirespor7ngtoGPU—  CUDA?OpenMP?— Whataboutexascale?WhatPMwillweneedthen?

    Sn transport needs performance portable multidimensional PM concepts

  • LLNL-PRES-681292 5

    RAJA::forall(IndexSet_I,[=](inti){RAJA::forall(IndexSet_J,[=](intj){RAJA::forall(IndexSet_K,[=](intk){y[i*nj+j]+=a[k]*x[j*nk+k];})})});

    WhydoweneedRAJA::forallNinsteadofjustnes6ngtheexis6ngRAJA::forall?

    Nested loops and multidimensional arrays are needed in addition to RAJA

    Multi-Dimensional Arrays •  Manual array calculations are error prone •  Hard-wires code for specific data layouts

    Nested Loop Constructs •  Nesting RAJA::forall’s work, but don’t enable complex loop

    transformations •  Hard-wires code for specific patterns •  CUDA kernels require building a thread+block index space

    from multiple loop indices… very difficult with nested forall’s •  Nested lambdas are problematic for OMP4 and CUDA

  • LLNL-PRES-681292 6

    Nested-LoopRAJAextensionsprovidemul6dimensionalsupport

    §  MaintainstheRAJAphilosophy—  Separateconceptsofloopexecu7on,itera7onpaMernsandloopbodies—  Minorcodestructurechanges—  Allowincrementaltransi7ontoRAJA—  Leverageexis7ngRAJAcode(forall,IndexSet,etc.)

    •  Basicnested-loopsarefunc7onallyequivalenttonestedRAJA::forall()’s

    §  ArbitraryDimensionality—  RAJA::forallNforanyN—  Usingvariadictemplatemeta-programming,nocodegen(almost)

    §  LoopTransforma7ons(forperfectlynestedloops)—  Loopinterchange—  Mul7-level7ling/blocking—  MappingtoCUDAthreadsandblocks—  Loopcollapsing(OpenMPcollapse(N))

    §  DataLayouts—  Arbitrarydatastridingorders

    §  Portability(andhopefullyperformance)—  Sequen7al,SIMD,OpenMP(CPU/GPU),CUDA

    RAJA is a good starting point for an Sn programming model

  • LLNL-PRES-681292 7

    HowdoweextendtheRAJA::forallabstrac6onforsingleloopstonestedloops?

    Kripke v1.1: LTimes for DGZ layout

    Hand coded loops are inflexible and non-portable

    double*ell_ptr;double*psi_ptr;double*phi_ptr;for(intnm=0;nm<num_moments;++nm){double*ell_nm=ell+nm*num_loc_dir;double*__restrict__phi_nm=phi+nm*num_gz;for(intd=0;d<num_loc_dir;d++){double*__restrict__psi_d=psi+d*num_locgz;doubleell_nm_d=ell_nm[d];for(intgz=0;gz<num_loc_gz;++gz){phi_nm[gz]+=ell_nm_d*psi_d[gz];}}}

    §  Loop-nestorderisfixed—  Loop-interchangerequiresrewrite

    §  Inner2-loopscollapsed—  Anarch-specificop7miza7on

    §  Fixedlayoutofeachvariable

    §  NoobviousmappingtoCUDA— Needtorewrite

    §  Codeisjustdownrightugly

    Issues with this coding style:

  • LLNL-PRES-681292 8

    Nested-LoopRAJAabstractsnestedloops,loopinterchange,anddatalayouts

    View_Psipsi(psi_ptr,…);View_Phiphi(phi_ptr,…);View_Ellell(ell_ptr,…);forallN(is_moment,is_dir,is_group,is_zone,[=](NMnm,Dird,Groupg,Zonez){phi(nm,g,z)+=ell(d,nm)*psi(d,g,z);});

    Kripke+RAJA: LTimes for all layouts

    Nested-loop abstraction promotes flexibility, while maintaining code structure

    double*ell_ptr;double*psi_ptr;double*phi_ptr;for(intnm=0;nm<num_moments;++nm){double*ell_nm=ell+nm*num_loc_dir;double*__restrict__phi_nm=phi+nm*num_gz;for(intd=0;d<num_loc_dir;d++){double*__restrict__psi_d=psi+d*num_locgz;doubleell_nm_d=ell_nm[d];for(intgz=0;gz<num_loc_gz;++gz){phi_nm[gz]+=ell_nm_d*psi_d[gz];}}}

    Kripke v1.1: LTimes for DGZ layout

  • LLNL-PRES-681292 9

    forallNextendsRAJAconceptsneededfornested-loops

    RAJA::forallN(is_moment,is_dir,is_group,is_zone,[=](NMnm,Dird,Groupg,Zonez){phi(nm,g,z)+=ell(d,nm)*psi(d,g,z);});

    forallN() abstracts an N-nested loop

    •  Exec policies for each loop nest •  Loop transformations

    IndexSet for each loop nest

    Loop-body Type-safe indices

    Extensions provide nested loop concepts and promote code correctness

    Views abstract data access and provide type safety

  • LLNL-PRES-681292 10

    Execu6onpoliciesenablerapidtes6ngofdiverseparalleliza6onstrategies

    usingPol=NestedPolicy>>;

    Parallel region, with OpenMP threading over groups, collapse(2), tiling zones by 512:

    usingPol=NestedPolicy<ExecList,OMP_Parallel<Permute>>;

    Parallel region, with OpenMP threading over groups:

    usingPol=NestedPolicy;

    Sequential Policy:

    Complex loop nesting constructs are easy to implement without kernel changes

    usingPol=NestedPolicy<ExecList>;

    CUDA policy, mapping groups to threads and block in X (with 32 threads/block), and zones to blocks in Y.

  • LLNL-PRES-681292 11

    Kernelperformancedependsonchoosingtheexecu6onpolicythatmatchesthearchitecture

    0.00E+00

    2.00E-08

    4.00E-08

    6.00E-08

    8.00E-08

    1.00E-07

    1.20E-07

    1.40E-07

    0.00E+00 2.00E+07 4.00E+07 6.00E+07 8.00E+07 1.00E+08 1.20E+08

    Grin

    d Ti

    me

    (Sec

    onds

    /Unk

    now

    n)

    Number of Unknowns

    Grind Time for DGZ LTimes Kernel in Kripke for Various Execution Policies

    collapse(2) omp_seq par_collapse(2) par_omp_seq par_tile256_omp_seq par_tile512_omp_seq

    DGZ Dirs/Sets: 96/8 Grp/Sets: 32/1 Zones: 8x8x8, 10x10x10, … (Git Hash: d3799680b5fe930423a29e0478329d2edaa2e8a7 )

    Performance can be tuned with policies, and w/o modifying kernels

  • LLNL-PRES-681292 12

    Conclusion

    §  NestedLoopconstructsarenowofficiallyinRAJA—  Offloadtothreads(OpenMP)andGPU(CUDA)—  Complexlooptransforma7onsarepossiblewithoutimpac7ngcode

    §  CoEinterac7onshavebeencrucial—  Vendorcoopera7onhasbeengood—  Intelmachinesarelookinggood…

    •  Op7miza7onimprovementswouldmakemarkedimprovement•  OpenMPkernelsinVtuneareproblema7c

    —  IBMandNVidia(nvcc)havemostissuestoworkout•  Star7ngtoexploresolu7ons,mayimpactimplementa7ondetails

    §  Star7ngtoexploreimplementa7oninARDRA—  Star7ngwithconceptsthatseemlessriskyandincorpora7ngthemintoARDRA—  Juststar7ngtomovetoC++11(Sequoia+XLConlyblocker)—  Hopingthatwegetmorevendorissuesworkedoutsoon

    §  Ques7ons?