The Sparse Matrix Vector Product on High-End GPUs

KIT – The Research University in the Helmholtz Association

HartwigAnzt,TerryCojean, Yuhsiang M.TsaiSteinbuchCentre for Computing (SCC)

www.kit.edu

TheSparseMatrixVectorProductonHigh-EndGPUsSIAMConferenceonParallelProcessingforScientificComputing(PP20)February12- 15,2020HyattRegencySeattle|Seattle,Washington,U.S.

MikeTsai

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of EnergyOffice of Science and the National Nuclear Security Administration and the Helmholtz Impuls und VernetzungsfondVH-NG-1241.

2 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

SpMV onGPUs– MovingawayfromtheNVIDIAhegemony

• Inthepast,NVIDIAGPUsweredominating theGPGPUmarket;

• WeseeanincreasingadoptionofAMDGPUsinleadershipsupercomputers:• FrontiersysteminOakRidge (2021)• ElCapitaninLawrenceLivermoreNationalLab?(2023)

• AMDisheavilyinvestingintheHIPsoftwaredevelopmentecosystem;• HIPprogramming similartoCUDAprogramming;• HIPlibrariessimilartocuBLAS, cuSPARSE,…

• TheRaceison!

• HowcanwepreparetheGinkgosparselinearalgebralibrary forcross-platformportability?

• AretheCUDA-optimizedkernelssuitableforAMDGPUs?

• HowdoestheperformancecompareacrossdifferentGPUs?


ExtendGinkgo’shardwarescopetoAMDGPUs

LibraryInfrastructureAlgorithm Implementations• IterativeSolvers• Preconditioners• …

Core

OpenMP-kernels• SpMV• Solverkernels• Precond kernels• …

OpenMPReferencekernels• SpMV• Solverkernels• Precond kernels• …

ReferenceCUDA-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDA

Librarycorecontainsarchitecture-agnosticalgorithm implementation;

Runtimepolymorphism selectstherightkerneldepending onthetargetarchitecture;

Architecture-specifickernelsexecutethealgorithmontargetarchitecture;

Referencearesequentialkernels tocheckcorrectnessofalgorithmdesignandoptimizedkernels;

Optimizedarchitecture-specifickernels;

Kernels

https://github.com/ginkgo-project/ginkgo

Part of

https://xsdk.info/




Core



ReferenceHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDAHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

HIP






Kernels




Core



ReferenceHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …


HIP






Kernels

CUDA HIP




Core





HIP






Kernels




Core





HIP






Kernels

• SharedkernelsCommon

Toavoidcodeduplication,commoncontainskernelssharedbetweenCUDAandHIP(upon parameterconfigs)



• KernelssharedbetweenCUDAandAMDbackends (upon parametersetting)arerelocatedinthe``common’’ module.

• NewcodenecessaryforHIP-specificoptimizationsandforimplementingfunctionalitycurrentlymissing intheHIPecosystem(e.g.cooperativegroups).

CUDA

new

CUDA

common

HIP


HowdoesGinkgocomparetothevendorlibraries- COOSpMV

GinkgovsHIPsparse onRadeonVII GinkgovscuSPARSE onV100

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/


HowdoesGinkgocomparetothevendorlibraries- CSRSpMV




HowdoesGinkgocomparetothevendorlibraries- ELLSpMV




HowdoesGinkgocomparetothevendorlibraries- hybridSpMV




PerformanceProfileonAMD’sRadeonVII



PerformanceProfileonNVIDIA’sV100



CompilingHIPcodeforNVIDIAGPUs– comparisonagainstnativeCUDAcode

• NativeCUDAvs.HIPcompiledforNVIDIAGPUs• Samekernel• AlltestsonNVIDIAV100(Summit)• WeexpectCUDAtobeslightlyfaster




outliers?machinenoise?HIPfasterthanCUDAonNVIDIAGPU?





Outlierstatson100runsa20reps:





Outlierstatson100runsa20reps:

Reproducible– butrelevant?



HIPcodefaster

CUDAcodefaster

• Running onV100GPU• 2,800testmatrices• Comparekeyfunctionality

• GinkgoSellp SpMV• GinkgoCooSpMV• Vendor’sCsr SpMV• Ginkgo’s CGsolver

SlightadvantagesontheCUDAside,butusually<5%.


HowdoGPUarchitecturescompareintermsofSpMV performance?

GinkgoCOOSpMV Vendor libraryCOOSpMV




GinkgoCSRSpMV Vendor libraryCSRSpMV




GinkgoELLSpMV Vendor libraryELLSpMV



Summary:TheSparseMatrixVectorProductonHigh-EndGPUs

• AMDanditsHIPsoftwareecosystembecomesarelevantalternativetoNVIDIACUDA;

• Significantsimilarities inthelanguages (HIP/CUDA)allowsforsharedkernel implementations;

• HIPallowstocompileforNVIDIAGPUs

– inmostcaseswithmoderateperformance losscomparedtonativeCUDAcode;

• AMD GPUsandNVIDIA GPUscomparableinsparselinearalgebraperformance;

• Wedeployed comprehensivecross-platformSpMV functionalityintheGinkgo library;

• Weprovideacomprehensive SpMV performancestudyforinteractiveexploration:

https://ginkgo-project.github.io/gpe/

CheckoutourposteronGinkgoatthepostersessiontonight inPP2at6pm,5th floor:Ginkgo- aNode-LevelSparseLinearAlgebraLibraryforHighPerformanceComputing

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of EnergyOffice of Science and the National Nuclear Security Administration and the Helmholtz Impuls und VernetzungsfondVH-NG-1241.

https://xsdk.info/


CITest

PerformanceDataRepository

ContinuousIntegration(CI)

DeveloperCodeReview

CIBuildSourceCodeRepository

Push

Schedule inBatchSystem

Web-Application

HPCSystem

TrustedReviewer

Users

MergeintoMasterBranch

CIBenchmarkTests https://ginkgo-project.github.io/

https://ginkgo-project.github.io/gpe/

Backup:Ginkgodevelopmentworkflow


Backup:GinkgoDesign

• Open-sourceC++frameworkforsparselinearalgebra.• Sparselinearsolvers,preconditioners, SpMV etc.• FocusedonMulticoreandManycore accelerators;• Softwarequalityandsustainabilityefforts

guidedbyxSDK communitypolicies:

•

•

•

https://xsdk.info/

• Staticpolymorphism fortemplatingprecisions• ValueType (default:Z,C,D,S), IndexType (int32/64)

• Smartpointerstoavoidmemory leaks• Runtimepolymorphism foroperators andkernels

• Kernelshavethesamesignaturefordifferentarchitectures

• Executordetermineswhichkernel isused

• LinOp classforanylinearoperator:• Matrices• Solvers• Preconditioners• …

generateapply…

DetermineswhereDatalives&operationsareexecuted


Backup:GPUformatconversion

Documents

The Sparse Matrix Vector Product on High-End GPUs