26
KIT – The Research University in the Helmholtz Association Hartwig Anzt, TerryCojean, Yuhsiang M. Tsai Steinbuch Centre for Computing (SCC) www.kit.edu The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on Parallel Processing for Scientific Computing (PP20) February 12 - 15, 2020 Hyatt Regency Seattle | Seattle, Washington, U.S. Mike Tsai This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration and the Helmholtz Impuls und VernetzungsfondVH-NG-1241.

The Sparse Matrix Vector Product on High-End GPUs

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Sparse Matrix Vector Product on High-End GPUs

KIT – The Research University in the Helmholtz Association

HartwigAnzt,TerryCojean, Yuhsiang M.TsaiSteinbuchCentre for Computing (SCC)

www.kit.edu

TheSparseMatrixVectorProductonHigh-EndGPUsSIAMConferenceonParallelProcessingforScientificComputing(PP20)February12- 15,2020HyattRegencySeattle|Seattle,Washington,U.S.

MikeTsai

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of EnergyOffice of Science and the National Nuclear Security Administration and the Helmholtz Impuls und VernetzungsfondVH-NG-1241.

Page 2: The Sparse Matrix Vector Product on High-End GPUs

2 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

SpMV onGPUs– MovingawayfromtheNVIDIAhegemony

• Inthepast,NVIDIAGPUsweredominating theGPGPUmarket;

• WeseeanincreasingadoptionofAMDGPUsinleadershipsupercomputers:• FrontiersysteminOakRidge (2021)• ElCapitaninLawrenceLivermoreNationalLab?(2023)

• AMDisheavilyinvestingintheHIPsoftwaredevelopmentecosystem;• HIPprogramming similartoCUDAprogramming;• HIPlibrariessimilartocuBLAS, cuSPARSE,…

• TheRaceison!

• HowcanwepreparetheGinkgosparselinearalgebralibrary forcross-platformportability?

• AretheCUDA-optimizedkernelssuitableforAMDGPUs?

• HowdoestheperformancecompareacrossdifferentGPUs?

Page 3: The Sparse Matrix Vector Product on High-End GPUs

3 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

ExtendGinkgo’shardwarescopetoAMDGPUs

LibraryInfrastructureAlgorithm Implementations• IterativeSolvers• Preconditioners• …

Core

OpenMP-kernels• SpMV• Solverkernels• Precond kernels• …

OpenMPReferencekernels• SpMV• Solverkernels• Precond kernels• …

ReferenceCUDA-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDA

Librarycorecontainsarchitecture-agnosticalgorithm implementation;

Runtimepolymorphism selectstherightkerneldepending onthetargetarchitecture;

Architecture-specifickernelsexecutethealgorithmontargetarchitecture;

Referencearesequentialkernels tocheckcorrectnessofalgorithmdesignandoptimizedkernels;

Optimizedarchitecture-specifickernels;

Kernels

https://github.com/ginkgo-project/ginkgo

Part of

https://xsdk.info/

Page 4: The Sparse Matrix Vector Product on High-End GPUs

4 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

ExtendGinkgo’shardwarescopetoAMDGPUs

LibraryInfrastructureAlgorithm Implementations• IterativeSolvers• Preconditioners• …

Core

OpenMP-kernels• SpMV• Solverkernels• Precond kernels• …

OpenMPReferencekernels• SpMV• Solverkernels• Precond kernels• …

ReferenceHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDAHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

HIP

Librarycorecontainsarchitecture-agnosticalgorithm implementation;

Runtimepolymorphism selectstherightkerneldepending onthetargetarchitecture;

Architecture-specifickernelsexecutethealgorithmontargetarchitecture;

Referencearesequentialkernels tocheckcorrectnessofalgorithmdesignandoptimizedkernels;

Optimizedarchitecture-specifickernels;

Kernels

Page 5: The Sparse Matrix Vector Product on High-End GPUs

5 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

ExtendGinkgo’shardwarescopetoAMDGPUs

LibraryInfrastructureAlgorithm Implementations• IterativeSolvers• Preconditioners• …

Core

OpenMP-kernels• SpMV• Solverkernels• Precond kernels• …

OpenMPReferencekernels• SpMV• Solverkernels• Precond kernels• …

ReferenceHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDAHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

HIP

Librarycorecontainsarchitecture-agnosticalgorithm implementation;

Runtimepolymorphism selectstherightkerneldepending onthetargetarchitecture;

Architecture-specifickernelsexecutethealgorithmontargetarchitecture;

Referencearesequentialkernels tocheckcorrectnessofalgorithmdesignandoptimizedkernels;

Optimizedarchitecture-specifickernels;

Kernels

CUDA HIP

Page 6: The Sparse Matrix Vector Product on High-End GPUs

6 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

ExtendGinkgo’shardwarescopetoAMDGPUs

LibraryInfrastructureAlgorithm Implementations• IterativeSolvers• Preconditioners• …

Core

OpenMP-kernels• SpMV• Solverkernels• Precond kernels• …

OpenMPReferencekernels• SpMV• Solverkernels• Precond kernels• …

ReferenceCUDA-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDAHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

HIP

Librarycorecontainsarchitecture-agnosticalgorithm implementation;

Runtimepolymorphism selectstherightkerneldepending onthetargetarchitecture;

Architecture-specifickernelsexecutethealgorithmontargetarchitecture;

Referencearesequentialkernels tocheckcorrectnessofalgorithmdesignandoptimizedkernels;

Optimizedarchitecture-specifickernels;

Kernels

Page 7: The Sparse Matrix Vector Product on High-End GPUs

7 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

ExtendGinkgo’shardwarescopetoAMDGPUs

LibraryInfrastructureAlgorithm Implementations• IterativeSolvers• Preconditioners• …

Core

OpenMP-kernels• SpMV• Solverkernels• Precond kernels• …

OpenMPReferencekernels• SpMV• Solverkernels• Precond kernels• …

ReferenceCUDA-GPUkernels• SpMV• Solverkernels• Precond kernels• …

CUDAHIP-GPUkernels• SpMV• Solverkernels• Precond kernels• …

HIP

Librarycorecontainsarchitecture-agnosticalgorithm implementation;

Runtimepolymorphism selectstherightkerneldepending onthetargetarchitecture;

Architecture-specifickernelsexecutethealgorithmontargetarchitecture;

Referencearesequentialkernels tocheckcorrectnessofalgorithmdesignandoptimizedkernels;

Optimizedarchitecture-specifickernels;

Kernels

• SharedkernelsCommon

Toavoidcodeduplication,commoncontainskernelssharedbetweenCUDAandHIP(upon parameterconfigs)

Page 8: The Sparse Matrix Vector Product on High-End GPUs

8 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

ExtendGinkgo’shardwarescopetoAMDGPUs

• KernelssharedbetweenCUDAandAMDbackends (upon parametersetting)arerelocatedinthe``common’’ module.

• NewcodenecessaryforHIP-specificoptimizationsandforimplementingfunctionalitycurrentlymissing intheHIPecosystem(e.g.cooperativegroups).

CUDA

new

CUDA

common

HIP

Page 9: The Sparse Matrix Vector Product on High-End GPUs

9 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoesGinkgocomparetothevendorlibraries- COOSpMV

GinkgovsHIPsparse onRadeonVII GinkgovscuSPARSE onV100

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 10: The Sparse Matrix Vector Product on High-End GPUs

10 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoesGinkgocomparetothevendorlibraries- CSRSpMV

GinkgovsHIPsparse onRadeonVII GinkgovscuSPARSE onV100

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 11: The Sparse Matrix Vector Product on High-End GPUs

11 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoesGinkgocomparetothevendorlibraries- ELLSpMV

GinkgovsHIPsparse onRadeonVII GinkgovscuSPARSE onV100

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 12: The Sparse Matrix Vector Product on High-End GPUs

12 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoesGinkgocomparetothevendorlibraries- hybridSpMV

GinkgovsHIPsparse onRadeonVII GinkgovscuSPARSE onV100

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 13: The Sparse Matrix Vector Product on High-End GPUs

13 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

PerformanceProfileonAMD’sRadeonVII

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 14: The Sparse Matrix Vector Product on High-End GPUs

14 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

PerformanceProfileonNVIDIA’sV100

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 15: The Sparse Matrix Vector Product on High-End GPUs

15 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

CompilingHIPcodeforNVIDIAGPUs– comparisonagainstnativeCUDAcode

• NativeCUDAvs.HIPcompiledforNVIDIAGPUs• Samekernel• AlltestsonNVIDIAV100(Summit)• WeexpectCUDAtobeslightlyfaster

Page 16: The Sparse Matrix Vector Product on High-End GPUs

16 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

CompilingHIPcodeforNVIDIAGPUs– comparisonagainstnativeCUDAcode

• NativeCUDAvs.HIPcompiledforNVIDIAGPUs• Samekernel• AlltestsonNVIDIAV100(Summit)• WeexpectCUDAtobeslightlyfaster

outliers?machinenoise?HIPfasterthanCUDAonNVIDIAGPU?

Page 17: The Sparse Matrix Vector Product on High-End GPUs

17 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

CompilingHIPcodeforNVIDIAGPUs– comparisonagainstnativeCUDAcode

• NativeCUDAvs.HIPcompiledforNVIDIAGPUs• Samekernel• AlltestsonNVIDIAV100(Summit)• WeexpectCUDAtobeslightlyfaster

outliers?machinenoise?HIPfasterthanCUDAonNVIDIAGPU?

Outlierstatson100runsa20reps:

Page 18: The Sparse Matrix Vector Product on High-End GPUs

18 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

CompilingHIPcodeforNVIDIAGPUs– comparisonagainstnativeCUDAcode

• NativeCUDAvs.HIPcompiledforNVIDIAGPUs• Samekernel• AlltestsonNVIDIAV100(Summit)• WeexpectCUDAtobeslightlyfaster

outliers?machinenoise?HIPfasterthanCUDAonNVIDIAGPU?

Outlierstatson100runsa20reps:

Reproducible– butrelevant?

Page 19: The Sparse Matrix Vector Product on High-End GPUs

19 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

CompilingHIPcodeforNVIDIAGPUs– comparisonagainstnativeCUDAcode

HIPcodefaster

CUDAcodefaster

• Running onV100GPU• 2,800testmatrices• Comparekeyfunctionality

• GinkgoSellp SpMV• GinkgoCooSpMV• Vendor’sCsr SpMV• Ginkgo’s CGsolver

SlightadvantagesontheCUDAside,butusually<5%.

Page 20: The Sparse Matrix Vector Product on High-End GPUs

20 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoGPUarchitecturescompareintermsofSpMV performance?

GinkgoCOOSpMV Vendor libraryCOOSpMV

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 21: The Sparse Matrix Vector Product on High-End GPUs

21 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoGPUarchitecturescompareintermsofSpMV performance?

GinkgoCSRSpMV Vendor libraryCSRSpMV

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 22: The Sparse Matrix Vector Product on High-End GPUs

22 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

HowdoGPUarchitecturescompareintermsofSpMV performance?

GinkgoELLSpMV Vendor libraryELLSpMV

Resultsandinteractiveperformanceexploreravailableat: https://ginkgo-project.github.io/gpe/

Page 23: The Sparse Matrix Vector Product on High-End GPUs

23 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

Summary:TheSparseMatrixVectorProductonHigh-EndGPUs

• AMDanditsHIPsoftwareecosystembecomesarelevantalternativetoNVIDIACUDA;

• Significantsimilarities inthelanguages (HIP/CUDA)allowsforsharedkernel implementations;

• HIPallowstocompileforNVIDIAGPUs

– inmostcaseswithmoderateperformance losscomparedtonativeCUDAcode;

• AMD GPUsandNVIDIA GPUscomparableinsparselinearalgebraperformance;

• Wedeployed comprehensivecross-platformSpMV functionalityintheGinkgo library;

• Weprovideacomprehensive SpMV performancestudyforinteractiveexploration:

https://ginkgo-project.github.io/gpe/

CheckoutourposteronGinkgoatthepostersessiontonight inPP2at6pm,5th floor:Ginkgo- aNode-LevelSparseLinearAlgebraLibraryforHighPerformanceComputing

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of EnergyOffice of Science and the National Nuclear Security Administration and the Helmholtz Impuls und VernetzungsfondVH-NG-1241.

https://xsdk.info/

Page 24: The Sparse Matrix Vector Product on High-End GPUs

24 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

CITest

PerformanceDataRepository

ContinuousIntegration(CI)

DeveloperCodeReview

CIBuildSourceCodeRepository

Push

Schedule inBatchSystem

Web-Application

HPCSystem

TrustedReviewer

Users

MergeintoMasterBranch

CIBenchmarkTests https://ginkgo-project.github.io/

https://ginkgo-project.github.io/gpe/

Backup:Ginkgodevelopmentworkflow

Page 25: The Sparse Matrix Vector Product on High-End GPUs

25 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

Backup:GinkgoDesign

• Open-sourceC++frameworkforsparselinearalgebra.• Sparselinearsolvers,preconditioners, SpMV etc.• FocusedonMulticoreandManycore accelerators;• Softwarequalityandsustainabilityefforts

guidedbyxSDK communitypolicies:

https://xsdk.info/

• Staticpolymorphism fortemplatingprecisions• ValueType (default:Z,C,D,S), IndexType (int32/64)

• Smartpointerstoavoidmemory leaks• Runtimepolymorphism foroperators andkernels

• Kernelshavethesamesignaturefordifferentarchitectures

• Executordetermineswhichkernel isused

• LinOp classforanylinearoperator:• Matrices• Solvers• Preconditioners• …

generateapply…

DetermineswhereDatalives&operationsareexecuted

Page 26: The Sparse Matrix Vector Product on High-End GPUs

26 02/13/2020Hartwig Anzt:TheSparseMatrixVectorProduct onHigh-EndGPUs

Backup:GPUformatconversion