Towards a GraphBLAS Library in Chapel · algorithms in the language of sparse linear algebra – Inspired by the Basic Linear Algebra Subprograms (BLAS) – Par.cipants from industry,

Towards a GraphBLAS Library in Chapel

ArifulAzad&AydinBuluçLawrenceBerkeleyNa.onalLaboratory(LBNL)

CHIUW,IPDPS2017

Overview

GraphBLASBuildingblocksforgraphalgorithmsinthelanguageofsparselinearalgebra

ChapelAnemergingparallellanguagedesignedforproduc.veparallelcompu.ngatscale

Bothpromise:Produc.vity+Performance

q  High-levelresearchobjec.ve:–  Enableproduc.veandhigh-performancegraphanaly.cs– WeusedGraphBLASandChapeltoachievethisgoal

q  Scopeofthispaper:AGraphBLASlibraryinChapel

1.  OverviewofGraphBLASprimi.ves2.  Implementa.onofasubsetofGraphBLAS

primi.vesinChapelwithexperimentalresults

Outline

Warning:thisisjustanearlyevalua.onasChapel’ssparsematrix support is ac.vely under development. Allexperiments were conducted on Chapel 1.13.1. Theperformance numbers are expected to improvesignificantlyinfuturereleasesofChapel.

Part1.GraphBLASoverview

5

GraphBLASanalogyAready-to-assemblefurnitureshop(Ikea)

Buildingblocks Objects(Algorithms)

Finalproduct(Applica.ons)

6

q  GraphBLAS(http://graphblas.org )

–  Standardbuildingblocksforgraphalgorithmsinthelanguageofsparselinearalgebra

–  InspiredbytheBasicLinearAlgebraSubprograms(BLAS)

–  Par.cipantsfromindustry,academiaandna.onallabs

–  CAPIisavailableinthewebsite(DesignoftheGraphBLASAPIforC,ABuluç,TMaYson,SMcMillan,JMoreira,CYang,IPDPSWorkshops2017)

Graphalgorithmbuildingblocks

7

q  Employsgraph-matrixduality–  Graphs=>sparsematrix–  Asubsetofvertex/edges=>sparse/densevector

q  Benefits–  Standardsetofopera.ons–  Learnfromtherichhistoryofnumericallinearalgebra–  Offersstructuredandregularmemoryaccessesandcommunica.ons(asopposedtoirregularmemoryaccessesintradi.ongraphalgorithm)

–  Opportunityforcommunica.onavoidingalgorithms

GraphBLASasalgorithmbuildingblocks

8

SomeGraphBLASbasicprimi.ves

FuncJon

Parameters Returns MatlabnotaJon

MxM(SpGEMM)

-sparsematricesAandB-op.onalunaryfuncts

sparsematrix C=A*B

MxV(SpM{Sp}V)

-sparsematrixA-sparse/densevectorx

sparse/densevector y=A*x

EwiseMult,Add,…(SpEWiseX)

-sparsematricesorvectors-binaryfunct,op.onalunarys

inplaceorsparsematrix/vector

C=A.*BC=A+B

Reduce(Reduce)

-sparsematrixAandfunct densevector y=sum(A,op)

Extract(SpRef)

-sparsematrixA-indexvectorspandq

sparsematrix B=A(p,q)

Assign(SpAsgn)

-sparsematricesAandB-indexvectorspandq

none A(p,q)=B

BuildMatrix(Sparse)

-listofedges/triples(i,j,v)

sparsematrix A=sparse(i,j,v,m,n)

ExtractTuples(Find)

-sparsematrixA

edgelist [i,j,v]=find(A)

9

Generalpurposeopera.onsviasemirings(overloadingaddi.onandmul.plica.onopera.ons)

Realfield:(R,+,x) Classicalnumericallinearalgebra

Booleanalgebra:({01},|,&) Graphtraversal

Tropicalsemiring:(RU{∞}, min, +) Shortestpaths

(S, select, select) Selectsubgraph,orcontractnodestoformquo.entgraph

(edge/vertexaYributes,vertexdataaggrega.on,edgedataprocessing)

Schemaforuser-specifiedcomputa.onatver.cesandedges

(R,max,+) Graphmatching&networkalignment

(R,min,Jmes) Maximalindependentset

•  ShortenedsemiringnotaJon:(Set,Add,MulJply).Bothiden..esomiYed.•  Add:Traversesedges,MulJply:Combinesedges/pathsatavertex

Example:Exploringthenext-levelver.cesviaSpMSpV

a

e b

c f

gd

1

2 33 2

x xx x x

x x x xx

x x xx x

x xx x

abcdefgh

abcdefgh

2

3

2

Overload(mul.ply,add)with(select2nd,min)

abcdefgh

CurrentfronJer

NextfronJer

Adjacencymatrix

h

11

Sparse-DenseMatrixProduct

(SpDM3)

Sparse-SparseMatrixProduct (SpGEMM)

SparseMatrixTimesMul<pleDenseVectors

(SpMM)

SparseMatrix-DenseVector

(SpMV)

SparseMatrix-SparseVector(SpMSpV)

GraphBLASprimi<vesinincreasingarithme<cintensity

Shortestpaths(all-pairs,

single-source,temporal)

Graphclustering(Markovcluster,peerpressure,spectral,local)

Miscellaneous:connec<vity,traversal(BFS),independentsets(MIS),graphmatching

Centrality(PageRank,

betweenness,closeness)

Higher-levelcombinatorialandmachinelearningalgorithms

Classifica7on(supportvector

machines,Logis<cregression)

Dimensionalityreduc7on(NMF,PCA)

Algorithmiccoverage

•  Develophigh-performancealgorithmsfor10-12primi.ves.•  Usetheminmanyalgorithms(boostproduc.vity).

Expecta.on:two-layerproduc.vity

Graphalgorithms

GraphBLASopera.ons

Chapel’sproduc.vityfeatures

use

use

library

userspace

language

Part2.ImplemenJngasubsetofGraphBLASoperaJonsinChapel

Parameters ReturnsApply x:sparsematrix/vector

f:unaryfunc.onNone x[i]=f(x[i])

Assign x:sparsematrix/vectory:sparsematrix/vector

None x[i]=y[i]

eWiseMult x:sparsematrix/vectory:sparsematrix/vector

z:sparsematrix/vector

z[i]=x[i]*y[i]

SpMSpV A:sparsematrixx:sparsevector

y:sparsevector

y=Ax

ForChapel:AsubsetofGraphBLASopera.ons

q  Chapeldetails–  Chapel1.13.1(thelatestversionbeforetheIPDPSdeadline)–  Chapelbuiltfromsource–  CHPL_COMM:gasnet/gemini–  Joblauncher:slurm-srun

q  Experimentplatorm:NERSC/Edison–  IntelIvyBridgeprocessor–  24coreson2sockets–  64GBmemorypernode–  30-MBL3Cache

Experimentalplatorm

SparsematricesinChapel

q  Blockdistributedsparsematrices.Thedensecontainerisblockdistributed.

q Weusedcompressedsparseblock(CSR)layouttostorelocalmatrices.

varn=6constD={0..n-1,0..n-1}dmappedBlock(1..3,1..3);varspD:sparsesubdomain(D);varA=[spD]real;

Inthisexample:#locales=9

Inourresults,wedidnotinclude.metoconstructarrays

ThesimplestGraphBLASopera.on:Apply(x[i]=f(x[i]))

Apply1:high-level(Chapelstyle)

Apply2manipula.nginternalarrays(MPIstyle)

1 2 4 8 16 324

8

16

32

64

128

256

Number of Threads (single node)

Tim

e (m

s)

Apply1Apply2

Example,simplecase:Apply(x[i]=f(x[i]))

1 2 4 8 16 32 640.000244141

0.000976562

0.00390625

0.015625

0.0625

0.25

1

4

16

64

256

Number of Nodes (24 threads per node)

Tim

e (s

econ

d)

Apply1Apply2

Apply1:high-level(Chapelstyle)Apply2:manipula.nginternalarrays(C++style)x:10MnonzerosPlatorm:NERSC/Edison

Dataparallelloopsperformwellinsharedmemory

Butdonotperformwellindistributedmemory

Performanceondistributed-memory

Apply1 Apply2

UsingchplvisonfourlocalesRed:datain,blue:dataout

Thisissuewithsparsearrayshasbeenaddressedaboutaweekago

Allworkatlocale0

Assignx[i]=y[i]

Assign1:high-level(Chapelstyle)

Assign2:manipula.nginternalarrays(MPIstyle)

Shared-memoryperformance:Assign(x[i]=y[i])

Assign1:high-level(Chapelstyle)Assign2:manipula.nginternalarrays(C++style)x:1MnonzerosPlatorm:NERSC/Edison

BigperformancegapEveninsharedmemory

1 2 4 8 16 328

16

32

64

128

256

512

1024

2048

Number of Threads (single node)

Tim

e (m

s)

Assign1Assign2

Why?Indexingasparsedomainusesbinarysearch.Forassignment

itcanbeavoided

distributed-memoryperformance:Assign(x[i]=y[i])

Assign1:high-level(Chapelstyle)Assign2:manipula.nginternalarrays(C++style)x:1MnonzerosPlatorm:NERSC/Edison

BigperformancegapEvenindistributedmemory

1 2 4 8 16 32 640.000976562

0.00390625

0.015625

0.0625

0.25

1

4

16

64

256

1024

Number of Nodes (24 threads per node)

Tim

e (s

econ

d)

Assign1Assign2

x"

= *"

A"

SPA$

gather"sca-er/"

accumulate"

y"

Example,complexcase:SpMSpV(y=Ax)

Algorithmoverview

Sparsematrix-sparsevectormul.ply(SpMSpV)

xA x

àx

1.Gatherver.cesinprocessorcolumn2.Localmul.plica.on3.ScaYerresultsinprocessorrow

n p

n p

p × p Processorgrid

Pprocessorsarearrangedin

Mul.ply(accessremotedataasneeded).Nocollec.vecommunica.on

Algorithm(ChapelStyle)Algorithm(MPIStyle)

0.0009766

0.0039063

0.015625

0.0625

0.25

1

4

1 2 4 8 16 32 64

Time(s)

NumberofNodes(24threads/node)

ERmatrix(n=1M,d=16,f=2%)

GatherInput LocalMul<ply Sca?eroutput

0.0004883

0.0019531

0.0078125

0.03125

0.125

0.5

1 2 4 8 16 32 64

Time(s)


ERmatrix(n=1M,d=4,f=2%)GatherInput LocalMul<ply Sca?eroutput

0.0039063

0.015625

0.0625

0.25

1

4

16

1 2 4 8 16 32 64

Time(s)


ERmatrix(n=1M,d=16,f=20%)GatherInput LocalMul<ply Sca?eroutput

Distributed-memoryperformanceofSpMSpVonEdison

A:random;16Mnonzerosx:random;2000nonzeros

RemoteatomicsareexpensiveinChapel

Wedon’tknowthereason

q  Exploitavailablespa.allocalityinsparsemanipula.ons–  Efficientaccessofnonzerosofsparsematrices/vectors–  Chapelisalmostthere,needsimprovedparalleliterators

q Usebulk-synchronouscommunica.onwheneverpossible–  Avoidlatency-boundcommunica.on–  Teamcollec.vesareuseful

Requirementsforachievinghighperformance

Task Hardness Why?Datastructure medium Manipula.ngdomainsandarraysFunc.onality easy Fewerlinesofcodewithbuilt-infeaturesParalleliza.on easy Noneedtothinkaboutcommunica.on

Ourexperience:produc.vityvs.performance

ProducJvity(easytodevelopaprototype)

Task Hardness Why?Datastructure hard Manipula.nglowleveldatastructuresShared-memory medium DataparalleliteratorsforsparsedataDistributed-memory

hard Needsbulksynchronouscommunica.on,teamcollec.ves,etc.

Performance(hardtoachieveperformance)

q WehaveimplementedaprototypeGraphBLASlibraryinChapel–  Implementedbreadth-firstsearchasarepresenta.vealgorithmusingtheseprimi.ves

q  LibrarydevelopmentinChapeliseasy(rela.vetoC++)

q  Chapel’sdistributed-sparsematrixsupportiss.llunderdevelopment.Thedistributed-memoryperformanceisexpectedtoimproveover.me.

Summary

q  FinishacompleteGraphBLAS-compliantlibraryinaPGASlanguage(includingChapel)–  Achievinghighperformanceisourfocus–  Benchmarkourlibraryagainstotherprogrammingmodelsandlanguages

q  Designcomplexgraphalgorithmsusingthelibrarytodemonstrateitsu.lity–  Understandtheimpactofprogrammingmodelsongraphanaly.cs

Futuredirec.on

q  FundedinpartbyDOD/ACSandinpartbyDOE/ASCRq  Acknowledgement:Cos.nIancu(LBNL),BradChamberlain

(Cray),MichaelFerguson(Cray),EnginKayraklioglu(GeorgeWashingtonUniversity)

q  References:–  A.AzadandA.Buluç,IPDPSWorkshops2017,TowardsaGraphBLASlibraryinChapel.

–  A.AzadandA.Buluç,IPDPS2017,Awork-efficientalgorithmforsparsematrix-sparsevectormul.plica.onalgorithm.

Acknowledgementandrelevantreferences

QuesJons?

Documents

Towards a GraphBLAS Library in Chapel · algorithms in the language of sparse linear algebra – Inspired by the Basic Linear Algebra Subprograms (BLAS) – Par.cipants from industry,