Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Towards a GraphBLAS Library in Chapel
ArifulAzad&AydinBuluçLawrenceBerkeleyNa.onalLaboratory(LBNL)
CHIUW,IPDPS2017
Overview
GraphBLASBuildingblocksforgraphalgorithmsinthelanguageofsparselinearalgebra
ChapelAnemergingparallellanguagedesignedforproduc.veparallelcompu.ngatscale
Bothpromise:Produc.vity+Performance
q High-levelresearchobjec.ve:– Enableproduc.veandhigh-performancegraphanaly.cs– WeusedGraphBLASandChapeltoachievethisgoal
q Scopeofthispaper:AGraphBLASlibraryinChapel
1. OverviewofGraphBLASprimi.ves2. Implementa.onofasubsetofGraphBLAS
primi.vesinChapelwithexperimentalresults
Outline
Warning:thisisjustanearlyevalua.onasChapel’ssparsematrix support is ac.vely under development. Allexperiments were conducted on Chapel 1.13.1. Theperformance numbers are expected to improvesignificantlyinfuturereleasesofChapel.
Part1.GraphBLASoverview
5
GraphBLASanalogyAready-to-assemblefurnitureshop(Ikea)
Buildingblocks Objects(Algorithms)
Finalproduct(Applica.ons)
6
q GraphBLAS(http://graphblas.org )
– Standardbuildingblocksforgraphalgorithmsinthelanguageofsparselinearalgebra
– InspiredbytheBasicLinearAlgebraSubprograms(BLAS)
– Par.cipantsfromindustry,academiaandna.onallabs
– CAPIisavailableinthewebsite(DesignoftheGraphBLASAPIforC,ABuluç,TMaYson,SMcMillan,JMoreira,CYang,IPDPSWorkshops2017)
Graphalgorithmbuildingblocks
7
q Employsgraph-matrixduality– Graphs=>sparsematrix– Asubsetofvertex/edges=>sparse/densevector
q Benefits– Standardsetofopera.ons– Learnfromtherichhistoryofnumericallinearalgebra– Offersstructuredandregularmemoryaccessesandcommunica.ons(asopposedtoirregularmemoryaccessesintradi.ongraphalgorithm)
– Opportunityforcommunica.onavoidingalgorithms
GraphBLASasalgorithmbuildingblocks
8
SomeGraphBLASbasicprimi.ves
FuncJon
Parameters Returns MatlabnotaJon
MxM(SpGEMM)
-sparsematricesAandB-op.onalunaryfuncts
sparsematrix C=A*B
MxV(SpM{Sp}V)
-sparsematrixA-sparse/densevectorx
sparse/densevector y=A*x
EwiseMult,Add,…(SpEWiseX)
-sparsematricesorvectors-binaryfunct,op.onalunarys
inplaceorsparsematrix/vector
C=A.*BC=A+B
Reduce(Reduce)
-sparsematrixAandfunct densevector y=sum(A,op)
Extract(SpRef)
-sparsematrixA-indexvectorspandq
sparsematrix B=A(p,q)
Assign(SpAsgn)
-sparsematricesAandB-indexvectorspandq
none A(p,q)=B
BuildMatrix(Sparse)
-listofedges/triples(i,j,v)
sparsematrix A=sparse(i,j,v,m,n)
ExtractTuples(Find)
-sparsematrixA
edgelist [i,j,v]=find(A)
9
Generalpurposeopera.onsviasemirings(overloadingaddi.onandmul.plica.onopera.ons)
Realfield:(R,+,x) Classicalnumericallinearalgebra
Booleanalgebra:({01},|,&) Graphtraversal
Tropicalsemiring:(RU{∞}, min, +) Shortestpaths
(S, select, select) Selectsubgraph,orcontractnodestoformquo.entgraph
(edge/vertexaYributes,vertexdataaggrega.on,edgedataprocessing)
Schemaforuser-specifiedcomputa.onatver.cesandedges
(R,max,+) Graphmatching&networkalignment
(R,min,Jmes) Maximalindependentset
• ShortenedsemiringnotaJon:(Set,Add,MulJply).Bothiden..esomiYed.• Add:Traversesedges,MulJply:Combinesedges/pathsatavertex
Example:Exploringthenext-levelver.cesviaSpMSpV
a
e b
c f
gd
1
2 33 2
x xx x x
x x x xx
x x xx x
x xx x
abcdefgh
abcdefgh
2
3
2
Overload(mul.ply,add)with(select2nd,min)
abcdefgh
CurrentfronJer
NextfronJer
Adjacencymatrix
h
11
Sparse-DenseMatrixProduct
(SpDM3)
Sparse-SparseMatrixProduct (SpGEMM)
SparseMatrixTimesMul<pleDenseVectors
(SpMM)
SparseMatrix-DenseVector
(SpMV)
SparseMatrix-SparseVector(SpMSpV)
GraphBLASprimi<vesinincreasingarithme<cintensity
Shortestpaths(all-pairs,
single-source,temporal)
Graphclustering(Markovcluster,peerpressure,spectral,local)
Miscellaneous:connec<vity,traversal(BFS),independentsets(MIS),graphmatching
Centrality(PageRank,
betweenness,closeness)
Higher-levelcombinatorialandmachinelearningalgorithms
Classifica7on(supportvector
machines,Logis<cregression)
Dimensionalityreduc7on(NMF,PCA)
Algorithmiccoverage
• Develophigh-performancealgorithmsfor10-12primi.ves.• Usetheminmanyalgorithms(boostproduc.vity).
Expecta.on:two-layerproduc.vity
Graphalgorithms
GraphBLASopera.ons
Chapel’sproduc.vityfeatures
use
use
library
userspace
language
Part2.ImplemenJngasubsetofGraphBLASoperaJonsinChapel
Parameters ReturnsApply x:sparsematrix/vector
f:unaryfunc.onNone x[i]=f(x[i])
Assign x:sparsematrix/vectory:sparsematrix/vector
None x[i]=y[i]
eWiseMult x:sparsematrix/vectory:sparsematrix/vector
z:sparsematrix/vector
z[i]=x[i]*y[i]
SpMSpV A:sparsematrixx:sparsevector
y:sparsevector
y=Ax
ForChapel:AsubsetofGraphBLASopera.ons
q Chapeldetails– Chapel1.13.1(thelatestversionbeforetheIPDPSdeadline)– Chapelbuiltfromsource– CHPL_COMM:gasnet/gemini– Joblauncher:slurm-srun
q Experimentplatorm:NERSC/Edison– IntelIvyBridgeprocessor– 24coreson2sockets– 64GBmemorypernode– 30-MBL3Cache
Experimentalplatorm
SparsematricesinChapel
q Blockdistributedsparsematrices.Thedensecontainerisblockdistributed.
q Weusedcompressedsparseblock(CSR)layouttostorelocalmatrices.
varn=6constD={0..n-1,0..n-1}dmappedBlock(1..3,1..3);varspD:sparsesubdomain(D);varA=[spD]real;
Inthisexample:#locales=9
Inourresults,wedidnotinclude.metoconstructarrays
ThesimplestGraphBLASopera.on:Apply(x[i]=f(x[i]))
Apply1:high-level(Chapelstyle)
Apply2manipula.nginternalarrays(MPIstyle)
1 2 4 8 16 324
8
16
32
64
128
256
Number of Threads (single node)
Tim
e (m
s)
Apply1Apply2
Example,simplecase:Apply(x[i]=f(x[i]))
1 2 4 8 16 32 640.000244141
0.000976562
0.00390625
0.015625
0.0625
0.25
1
4
16
64
256
Number of Nodes (24 threads per node)
Tim
e (s
econ
d)
Apply1Apply2
Apply1:high-level(Chapelstyle)Apply2:manipula.nginternalarrays(C++style)x:10MnonzerosPlatorm:NERSC/Edison
Dataparallelloopsperformwellinsharedmemory
Butdonotperformwellindistributedmemory
Performanceondistributed-memory
Apply1 Apply2
UsingchplvisonfourlocalesRed:datain,blue:dataout
Thisissuewithsparsearrayshasbeenaddressedaboutaweekago
Allworkatlocale0
Assignx[i]=y[i]
Assign1:high-level(Chapelstyle)
Assign2:manipula.nginternalarrays(MPIstyle)
Shared-memoryperformance:Assign(x[i]=y[i])
Assign1:high-level(Chapelstyle)Assign2:manipula.nginternalarrays(C++style)x:1MnonzerosPlatorm:NERSC/Edison
BigperformancegapEveninsharedmemory
1 2 4 8 16 328
16
32
64
128
256
512
1024
2048
Number of Threads (single node)
Tim
e (m
s)
Assign1Assign2
Why?Indexingasparsedomainusesbinarysearch.Forassignment
itcanbeavoided
distributed-memoryperformance:Assign(x[i]=y[i])
Assign1:high-level(Chapelstyle)Assign2:manipula.nginternalarrays(C++style)x:1MnonzerosPlatorm:NERSC/Edison
BigperformancegapEvenindistributedmemory
1 2 4 8 16 32 640.000976562
0.00390625
0.015625
0.0625
0.25
1
4
16
64
256
1024
Number of Nodes (24 threads per node)
Tim
e (s
econ
d)
Assign1Assign2
x"
= *"
A"
SPA$
gather"sca-er/"
accumulate"
y"
Example,complexcase:SpMSpV(y=Ax)
Algorithmoverview
Sparsematrix-sparsevectormul.ply(SpMSpV)
xA x
àx
1.Gatherver.cesinprocessorcolumn2.Localmul.plica.on3.ScaYerresultsinprocessorrow
n p
n p
p × p Processorgrid
Pprocessorsarearrangedin
Mul.ply(accessremotedataasneeded).Nocollec.vecommunica.on
Algorithm(ChapelStyle)Algorithm(MPIStyle)
0.0009766
0.0039063
0.015625
0.0625
0.25
1
4
1 2 4 8 16 32 64
Time(s)
NumberofNodes(24threads/node)
ERmatrix(n=1M,d=16,f=2%)
GatherInput LocalMul<ply Sca?eroutput
0.0004883
0.0019531
0.0078125
0.03125
0.125
0.5
1 2 4 8 16 32 64
Time(s)
NumberofNodes(24threads/node)
ERmatrix(n=1M,d=4,f=2%)GatherInput LocalMul<ply Sca?eroutput
0.0039063
0.015625
0.0625
0.25
1
4
16
1 2 4 8 16 32 64
Time(s)
NumberofNodes(24threads/node)
ERmatrix(n=1M,d=16,f=20%)GatherInput LocalMul<ply Sca?eroutput
Distributed-memoryperformanceofSpMSpVonEdison
A:random;16Mnonzerosx:random;2000nonzeros
RemoteatomicsareexpensiveinChapel
Wedon’tknowthereason
q Exploitavailablespa.allocalityinsparsemanipula.ons– Efficientaccessofnonzerosofsparsematrices/vectors– Chapelisalmostthere,needsimprovedparalleliterators
q Usebulk-synchronouscommunica.onwheneverpossible– Avoidlatency-boundcommunica.on– Teamcollec.vesareuseful
Requirementsforachievinghighperformance
Task Hardness Why?Datastructure medium Manipula.ngdomainsandarraysFunc.onality easy Fewerlinesofcodewithbuilt-infeaturesParalleliza.on easy Noneedtothinkaboutcommunica.on
Ourexperience:produc.vityvs.performance
ProducJvity(easytodevelopaprototype)
Task Hardness Why?Datastructure hard Manipula.nglowleveldatastructuresShared-memory medium DataparalleliteratorsforsparsedataDistributed-memory
hard Needsbulksynchronouscommunica.on,teamcollec.ves,etc.
Performance(hardtoachieveperformance)
q WehaveimplementedaprototypeGraphBLASlibraryinChapel– Implementedbreadth-firstsearchasarepresenta.vealgorithmusingtheseprimi.ves
q LibrarydevelopmentinChapeliseasy(rela.vetoC++)
q Chapel’sdistributed-sparsematrixsupportiss.llunderdevelopment.Thedistributed-memoryperformanceisexpectedtoimproveover.me.
Summary
q FinishacompleteGraphBLAS-compliantlibraryinaPGASlanguage(includingChapel)– Achievinghighperformanceisourfocus– Benchmarkourlibraryagainstotherprogrammingmodelsandlanguages
q Designcomplexgraphalgorithmsusingthelibrarytodemonstrateitsu.lity– Understandtheimpactofprogrammingmodelsongraphanaly.cs
Futuredirec.on
q FundedinpartbyDOD/ACSandinpartbyDOE/ASCRq Acknowledgement:Cos.nIancu(LBNL),BradChamberlain
(Cray),MichaelFerguson(Cray),EnginKayraklioglu(GeorgeWashingtonUniversity)
q References:– A.AzadandA.Buluç,IPDPSWorkshops2017,TowardsaGraphBLASlibraryinChapel.
– A.AzadandA.Buluç,IPDPS2017,Awork-efficientalgorithmforsparsematrix-sparsevectormul.plica.onalgorithm.
Acknowledgementandrelevantreferences
QuesJons?