Sparse Matrices for High Performance Graph Analy8csgilbert/talks/GilbertPurdueJanuary2016.pdfAriful...

Preview:

Citation preview

1

SparseMatricesforHighPerformanceGraphAnaly8cs

JohnR.GilbertUniversityofCalifornia,SantaBarbaraPurdueUniversityJanuary27,2016

Support at UCSB: Intel, Microsoft, DOE Office of Science, NSF

2

Thanks …

Ariful Azad, David Bader, Lucas Bang, Jon Berry, Eric Boman, Aydin Buluc, John Conroy, Kevin Deweese,

Erika Duriakova, Armando Fox, Joey Gonzalez, Shoaib Kamil, Jeremy Kepner, Tristan Konolige, Manoj Kumar, Adam Lugowski, Tim Mattson, Brad McRae, Henning Meyerhenke, Dave Mizell, Jose Moreira, Lenny Oliker,

Carey Priebe, Steve Reinhardt, Lijie Ren, Eric Robinson, Viral Shah, Veronika Strnadova-Neely, Yun Teng, Joshua Vogelstein, Drew Waranis, Sam Williams

3

Outline

•  Mo8va8on:Graphapplica8ons

•  Mathema8cs:Sparsematricesforgraphalgorithms

•  SoLware:CombBLAS,KDT,QuadMat

•  Standards:TheGraphBLASeffort

4

Computa8onalmodelsofthephysicalworld

Cortical bone"

Trabecular bone"

5

Largegraphsareeverywhere…

WWWsnapshot,courtesyY.Hyun Yeastproteininterac8onnetwork,courtesyH.Jeong

•  Internetstructure• Socialinterac8ons

• Scien8ficdatasets:biological,chemical,cosmological,ecological,…

6

Co-authorgraphfrom1993

Householdersymposium

Socialnetworkanalysis(1993)

7

Facebookgraph:>1,200,000,000ver8ces

Socialnetworkanalysis(2016)

8

Computers

Con,nuousphysicalmodeling

Linearalgebra

Discretestructureanalysis

Graphtheory

Computers

Themiddlewarechallengeforgraphanalysis

9

Top 500 List (November 2015)

= x PA L U

Top500 Benchmark: Solve a large system of linear equations

by Gaussian elimination

10

Graph 500 List (November 2015)

Graph500 Benchmark:

Breadth-first search in a large

power-law graph

1 2

3

4 7

6

5

11

Floating-Point vs. Graphs, November 2015

= x P A L U1 2

3

4 7

6

5

34 Peta / 38 Tera is about 900.

34 Petaflops 38 Terateps

12

Nov 2015: 34 Peta / 38 Tera ~ 900 Nov 2010: 2.5 Peta / 6.6 Giga ~ 380,000

Floating-Point vs. Graphs, November 2015

= x P A L U1 2

3

4 7

6

5

34 Petaflops 38 Terateps

13

•  Byanalogytonumericalscien8ficcompu8ng...

•  WhatshouldthecombinatorialBLASlooklike?

Themiddlewarechallengeforgraphanalysis

C = A*B

y = A*x

µ = xT y

BasicLinearAlgebraSubrou,nes(BLAS):Ops/Secvs.MatrixSize

Thecaseforsparsematrices

Coarse-grained parallelism can be exploited by abstractions at the right level.

Vertex/edgegraphcomputa,ons

Graphsinthelanguageoflinearalgebra

Unpredictable,data-drivencommunica8onpaberns

Fixedcommunica8onpaberns

Irregulardataaccesses,withpoorlocality

Matrixblockopera8onsexploitmemoryhierarchy

Finegraineddataaccesses,dominatedbylatency

Coarsegrainedparallelism,limitedbybandwidthnotlatency

Sparsematrix-sparsematrixmul8plica8on

*

Sparsematrix-sparsevectormul8plica8on

*

.*

Sparsearrayprimi8vesforgraphs

Element-wiseopera8ons Sparsematrixindexing

Matricesovervarioussemirings:(+,×),(and,or),(min,+),…

16

Mul8ple-sourcebreadth-firstsearch

B

1 2

3

4 7

6

5

AT

17

Mul8ple-sourcebreadth-firstsearch

•  Sparsearrayrepresenta8on=>spaceefficient

•  Sparsematrix-matrixmul8plica8on=>workefficient

•  Threepossiblelevelsofparallelism:searches,ver8ces,edges

BAT AT B

à

1 2

3

4 7

6

5

18

Examplesofsemiringsingraphalgorithms

(“values”:edge/vertexabributes,“add”:vertexdataaggrega8on,“mul8ply”:edgedataprocessing)

Generalschemaforuser-specifiedcomputa8onatver8cesandedges

Realfield:(R,+,*) Numericallinearalgebra

Booleanalgebra:({01},|,&)

Graphtraversal

Tropicalsemiring:(RU{∞},min,+)

Shortestpaths

(S,select,select)

Selectsubgraph,orcontractnodestoformquo8entgraph

Graphcontrac8onviasparsetripleproduct

5

6

3

1 2

4

A1

A3A2

A1

A2 A3

Contract

1 5 2 3 4 6 1

5

2 3 4

6

1 1 0 00 00 0 1 10 00 0 0 01 1

1 1 01 0 10 1 01 11 1

0 0 1

x x =

1 5 2 3 4 6 1 2 3

Subgraphextrac8onviasparsetripleproduct

5

6

3

1 2

4

Extract 3

12

1 5 2 3 4 6 1

5

2 3 4

6

1 1 1 00 00 0 1 11 00 0 0 01 1

1 1 01 0 11 1 01 11 1

0 0 1

x x =

1 5 2 3 4 6 1 2 3

21

Coun8ngtriangles(clusteringcoefficient)

A

5

6

3

1 2

4

Clusteringcoefficient:

•  Pr(wedgei-j-kmakesatrianglewithedgei-k)

•  3*#triangles/#wedges

•  3*4/19=0.63inexample

•  maywanttocomputeforeachvertexj

22

A

5

6

3

1 2

4

Clusteringcoefficient:

•  Pr(wedgei-j-kmakesatrianglewithedgei-k)

•  3*#triangles/#wedges

•  3*4/19=0.63inexample

•  maywanttocomputeforeachvertexj

“Cohen’s”algorithmtocounttriangles:

-Counttrianglesbylowest-degreevertex.

- Enumerate“low-hinged”wedges.

-Keepwedgesthatclose.

hi hilo

hi hilo

hihilo

Coun8ngtriangles(clusteringcoefficient)

23

A L U

1 2

1 1 1 2

C

A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge)A ∧B = C (closed wedge)sum(C)/2 = 4 triangles

A

5

6

3

1 2

4 5

6

3

1 2

4

1

1

2

B, C

Coun8ngtriangles(clusteringcoefficient)

•  Aimedatgraphalgorithmdesigners/programmerswhoarenotexpertinmappingalgorithmstoparallelhardware.

•  Flexible,templatedC++interface.•  Scalableperformancefromlaptopto100,000-processorHPC.

•  OpensourcesoLware,version1.5.0releasedJanuary2016.

Anextensibledistributed-memorylibraryofferingasmallbutpowerfulsetoflinearalgebraicopera8ons

specificallytarge8nggraphanaly8cs.

CombinatorialBLAS

hbp://gauss.cs.ucsb.edu/~aydin/CombBLAS

CombinatorialBLASindistributedmemory

CommGrid

DCSC CSC CSBTriples

SpMatSpDistMatDenseDistMat

DistMat

Enforcesinterfaceonly

CombinatorialBLASfunc7onsandoperators

DenseDistVecSpDistVec

FullyDistVec...HASA

Polymorphism

Matrix/vectordistribu8ons,interleavedoneachother.

5

8

x1,1

x1,2

x1,3

x2,1

x2,2

x2,3

x3,1

x3,2

x3,3

A1,1

A1,2

A1,3

A2,1

A2,2

A2,3

A3,1

A3,2

A3,3

n pr€

n pc

2DLayoutforSparseMatrices&Vectors

-2Dmatrixlayoutwinsover1Dwithlargecorecountsandwithlimitedbandwidth/compute-2Dvectorlayoutsome8mesimportantforloadbalance

Defaultdistribu8oninCombinatorial BLAS.Scalablewithincreasingnumberofprocesses

Benchmarkinggraphanaly8csframeworks

CombinatorialBLASwasfastestamongalltested

graphprocessingframeworkson3outof4benchmarks

inanindependentstudybyIntel.

Sa8shetal."Naviga8ngtheMazeofGraphAnaly8csFrameworksusingMassiveGraphDatasets”,inSIGMOD’14

28

CombinatorialBLAS“users”(Jan2016)

•  IBM (T.J.Watson, Zurich, & Tokyo) •  Intel •  Cray •  Microsoft •  Stanford •  MIT •  UC Berkeley •  Carnegie-Mellon •  Georgia Tech •  Purdue •  Ohio State •  U Texas Austin •  NC State •  UC San Diego •  UC Merced •  UC Santa Barbara

•  Berkeley Lab •  Sandia Labs •  Columbia •  U Minnesota •  Duke •  Indiana U •  Mississippi State •  SEI •  Paradigm4 •  Mellanox •  IHPC (Singapore)

•  Tokyo Inst of Technology

•  Chinese Academy of Sciences •  U Canterbury (New Zealand)

•  King Fahd U (Saudi Arabia) •  Bilkent U (Turkey)

•  U Ghent (Belgium)

•  Aimedatdomainexpertswhoknowtheirproblemwellbutdon’tknowhowtoprogramasupercomputer

•  Easy-to-usePythoninterface•  Runsonalaptopaswellasaclusterwith10,000processors

•  OpensourcesoLware(NewBSDlicense)

Ageneralgraphlibrarywithopera8onsbasedonlinear

algebraicprimi8ves

KnowledgeDiscoveryToolboxhbp://kdt.sourceforge.net/

Example:•  Vertextypes:Person,Phone,

Camera,Gene,Pathway•  Edgetypes:PhoneCall,TextMessage,

CoLoca8on,SequenceSimilarity•  Edgeabributes:Time,Dura8on

•  Calculatecentralityjustforemailsamongengineerssentbetweengivenstartandend8mes

Abributedseman8cgraphsandfilters

def onlyEngineers (self): return self.position == Engineer def timedEmail (self, sTime, eTime): return ((self.type == email) and (self.Time > sTime) and (self.Time < eTime)) G.addVFilter(onlyEngineers) G.addEFilter(timedEmail(start, end)) # rank via centrality based on recent email transactions among engineers bc = G.rank(’approxBC’)

KDT$Algorithm$

CombBLAS$Primi4ve$

Filter$(Py)$

Python'

C++'

Semiring$(Py)$KDT$Algorithm$

CombBLAS$Primi4ve$ Filter$(C++)$

Semiring$(C++)$

Standard$KDT$ KDT+SEJITS$

SEJITS$$$$Transla4on$

Filter$(Py)$

Semiring$(Py)$

SEJITSforfilter/semiringaccelera8on

EmbeddedDSL:Pythonforthewholeapplica8on•  Introspect,translatePythontoequivalentC++code•  Callcompiled/op8mizedC++insteadofPython

FilteredBFSwithSEJITS

!"#$%!"$!%&"!!%#"!!%'"!!%("!!%&)"!!%*#"!!%)'"!!%

&#&% #$)% $+)% &!#'% #!#$%

!"#$%&'

(%)*

"%

+,*-".%/0%!12%3./4"55"5%

,-.% /012./3,-.% 456789:/%

Time(inseconds)forasingleBFSitera8ononscale25RMAT(33Mver8ces,500Medges)with10%ofelementspassingfilter.MachineisNERSC’sHopper.

33

Afewothergraphalgorithmswe’veimplementedinlinearalgebraicstyle

•  Maximalindependentset(KDT/SEJITS)[BDFGKLOW2013]

•  Peer-pressureclustering(SPARQL)[DGLMR2013]

•  Time-dependentshortestpaths(CombBLAS)[Ren2012]

•  Gaussianbeliefpropaga8on(KDT)[LABGRTW2011]

•  Markoffclustering(CombBLAS,KDT)[BG2011,LABGRTW2011]

•  Betweennesscentrality(CombBLAS)[BG2011]

•  Geometricmeshpar88oning(MatlabJ)[GMT1998]

34

The(original)BLAS

•  ExpertsinmappingalgorithmstohardwaretuneBLASforspecificpla�orms.

•  ExpertsinnumericallinearalgebrabuildsoLwareontopoftheBLAStogethighperformance“forfree.”

Todayeverycomputer,phone,etc.comeswith/usr/lib/libblas!

TheBasicLinearAlgebraSubrou8neshadarevolu8onaryimpact

oncomputa8onallinearalgebra.

BLAS1 vectorops Lawson,Hanson,Kincaid,Krogh,1979

LINPACK

BLAS2 matrix-vectorops

Dongarra,DuCroz,Hammarling,Hanson,1988

LINPACKonvectormachines

BLAS3 matrix-matrixops

Dongarra,DuCroz,Duff,Hammarling,1990

LAPACKoncachebasedmachines

Canwestandardizea“GraphBLAS”?

Canwestandardizea“GraphBLAS”?

No, it’s not reasonable to define a universal set

of building blocks.

•  Huge diversity in matching graph algorithms to hardware platforms. •  No consensus on data structures or linguistic primitives. •  Lots of graph algorithms remain to be discovered. •  Early standardization can inhibit innovation.

Canwestandardizea“GraphBLAS”?

Yes, it is reasonable to define a universal set

of building blocks…

… for graphs as linear algebra.

•  Representing graphs in the language of linear algebra is a mature field. •  Algorithms, high level interfaces, and implementations vary. •  But the core primitives are well established.

TheGraphBLASeffort

•  Manifesto, �

HPEC 2013:

•  Workshops at IPDPS and HPEC – next in May 2016•  Periodic working group telecons and meetings•  Graph BLAS Forum: http://graphblas.org

Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

SomeGraphBLASbasicfunc8ons(namesnotfinal)

Func,on(CombBLASequiv)

Parameters Returns Matlabnota,on

matmul(SpGEMM)

-sparsematricesAandB-op8onalunaryfuncts

sparsematrix C=A*B

matvec(SpM{Sp}V)

-sparsematrixA-sparse/densevectorx

sparse/densevector y=A*x

ewisemult,add,…(SpEWiseX)

-sparsematricesorvectors-binaryfunct,op8onalunarys

inplaceorsparsematrix/vector

C=A.*BC=A+B

reduce(Reduce)

-sparsematrixAandfunct densevector y=sum(A,op)

extract(SpRef)

-sparsematrixA-indexvectorspandq

sparsematrix B=A(p,q)

assign(SpAsgn)

-sparsematricesAandB-indexvectorspandq

none A(p,q)=B

buildMatrix(Sparse)

-listofedges/triples(i,j,v)

sparsematrix A=sparse(i,j,v,m,n)

getTuples(Find)

-sparsematrixA

edgelist [i,j,v]=find(A)

Matrix8mesmatrixoversemiring

Inputs matrix A: MxN (sparse or dense) matrix B: NxL (sparse or dense)Optional Inputs matrix C: MxL (sparse or dense) scalar “add” function ⊕ scalar “multiply” function ⊗ transpose flags for A, B, COutputs matrix C: MxL (sparse or dense)

Specific cases and function names:SpGEMM: sparse matrix times sparse matrixSpMSpV: sparse matrix times sparse vectorSpMV: sparse matrix times dense vectorSpMM: sparse matrix times dense matrix

Notesis the set of scalars, user-specifieddefaults to IEEE double float

⊕ defaults to floating-point +⊗ defaults to floating-point *

Implements C ⊕= A ⊕.⊗ B

for j = 1 : N C(i,k) = C(i,k) ⊕ (A(i,j) ⊗ B(j,k))

If input C is omitted, implements C = A ⊕.⊗ B

Transpose flags specify operation � on AT, BT, and/or CT instead

41

Breadth-firstsearchwith(draL)GraphBLAS

Conclusions

•  Graphanalysispresentschallengesin:–  Performance(bothserialandparallel)–  SoLwarecomplexity–  Interoperability

•  Implemen8nggraphalgorithmsusingmatrix-basedapproachesprovidesseveralusefulsolu8onstothesechallenges.

•  ResearchersfromIntel,IBM,Nvidia,LBL,MIT,UCSB,GaTech,KIT,etc.havejoinedtogethertocreatetheGraphBLASstandard.

•  Severalimplementa8onsinprogress:–  C++:CombBLAS(LBL,UCSB),GraphMAT(Intel)–  C:GraphProgrammingInterface(IBM),S8nger(GaTech)–  Java:Graphulo(MIT)–  Python:NetworkKit(KIT)

43

Thank You!

Computers

Continuous physical modeling

Linear algebra & graph theory

Discrete structure analysis

http://graphblas.org