Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

September 17, 2014

Task-based sparse linear solvers for modernarchitecturesP. Ramet (and HiePACS team)

4th HOSCAR workshopGramado, Brazil

P. RametHiePACS teamInria Bordeaux Sud-OuestLaBRI Bordeaux University

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 2

Introduction

Guideline

Introduction





Conclusion


1Introduction

Introduction

Mixed/Hybrid direct-iterative methods

The ”spectrum” of linear algebra solvers

I Robust/accurate for generalproblems

I BLAS-3 based implementationI Memory/CPU prohibitive for

large 3D problemsI Limited parallel scalability

I Problem dependent efficiency/controlledaccuracy

I Only mat-vec required, fine graincomputation

I Less memory consumption, possibletrade-off with CPU

I Attractive ”build-in” parallel features


Direct

Guideline

Introduction





Conclusion


2Sparse direct factorization -PaStiX

Direct

Major steps for solving sparse linear systems

1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )

2. Factorization: matrix is factorized as A = LU, LLT or LDLT

3. Solve: the solution x is computed by means of forward andbackward substitutions


Direct

Direct Method and Nested Dissection


Direct

Supernodal methods

DefinitionA supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.

I All algorithms (ordering, symbolic factor., factor., solve)generalized to block versions.

I Use of efficient matrix-matrix kernels (improve cache usage).I Same concept as supervariables for elimination tree/minimum

degree ordering.I Supernodes and pivoting: pivoting inside a supernode does

not increase fill-in.


Direct

PaStiX main Features

I LLt, LDLt, LU : supernodal implementation (BLAS3)I Static pivoting + Refinement: CG/GMRES/BiCGstabI Column-block or Block mappingI Simple/Double precision + Real/Complex arithmeticI MPI/Threads (Cluster/Multicore/SMP/NUMA)I Multiple GPUs using DAG runtimesI Support external ordering library (PT-Scotch or METIS ...)I Multiple RHS (direct factorization)I Incomplete factorization with ILU(k) preconditionnerI Schur complement computationI C++/Fortran/Python/PETSc/Trilinos/FreeFem...


Direct

Current works

I Astrid Casadei (PhD student) : memory optimization to builda Schur complement in PaStiX, tight coupling betweensparse direct and iterative solvers (HIPS) + graphpartitioning with balanced halo

I Xavier Lacoste (PhD student) and the MUMPS team : GPUoptimizations for sparse factorizations with StarPU

I Stojce Nakov (PhD student) : tight coupling between sparsedirect and iterative solvers (MaPHYS) + GPUoptimizations for GMRES with StarPU

I Mathieu Faverge (Assistant Professor) : redesigned PaStiXstatic/dynamic scheduling with PaRSEC in order to get ageneric framework for multicore/MPI/GPU/Out-of-Core +low-rank compression


Direct

Direct Solver Highlights

Fusion - ITERPaStiX is used in the JOREK code developped by G. Huysmans atCEA/Cadarache in a fully implicit time evolution scheme for thenumerical simulations of the ELM (Edge Localized Mode)instabilities commonly observed in the standard tokamak operatingscenario.

MHD pellet injection simulated with the JOREK code


Direct

Tasks algorithms

POTRFTRSMSYRKGEMM

(a) Task for a dense tile (b) Task for a sparse supernode


Direct

DAG representation

POTRF

TRSM

T=>T

TRSM

T=>T

TRSM

T=>T TRSM

T=>T

GEMM

C=>B

GEMM

C=>B

GEMM

C=>B

SYRK

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>B

GEMM

C=>C

SYRK

T=>T

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>C

GEMM

C=>C

SYRK

T=>T

C=>A C=>AC=>A SYRK

C=>A

TRSM

C=>C

TRSM

C=>C

TRSM

C=>C

POTRF

T=>T

T=>T T=>TT=>T

C=>B

SYRK

C=>A C=>B C=>A C=>AC=>B

T=>T

GEMM

C=>C

SYRK

T=>T

SYRK

T=>T

C=>A C=>A C=>A

TRSM

C=>C POTRF

T=>T

T=>T

TRSM

T=>T

C=>C

C=>A C=>B

SYRK

T=>T

C=>AC=>A

POTRF

T=>T

TRSM

C=>C

T=>T

C=>A

POTRF

T=>T

(c) Dense DAG

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

(d) Sparse DAG


Direct

Direct Solver Highlights (cluster of multicore)

RC3 matrix - complex double precisionN=730700 - NNA=41600758 - Fill-in=50

Facto 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6820 3520 1900 18906 threads 1020 639 337 28712 threads 525 360 155 121Mem Gb 1 MPI 2 MPI 4 MPI 8 MPI1 thread 34 19,2 12,5 9,226 threads 34,3 19,5 12,8 9,6612 threads 34,6 19,7 13 9,14Solve 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6,97 3,75 1,93 1,036 threads 2,5 1,43 0,78 0,5412 threads 1,33 0,93 0,66 0,59


Direct

NUMA-Aware Allocation (upto 20% efficiency)


Direct

Direct Solver Highlights (multicore)SGI NUMA 160-cores

Name N NNZA Fill ratio FactAudi 9.44×105 3.93×107 31.28 float LLT

10M 1.04×107 8.91×107 75.66 complex LDLT

10M 10 20 40 80 160Facto (s) 3020 1750 654 356 260Mem (Gb) 122 124 127 133 146Solve (s) 24.6 13.5 3.87 2.90 2.89

Audi 128 2x64 4x32 8x16Facto (s) 17.8 18.6 13.8 13.4Mem (Gb) 13.4 2x7.68 4x4.54 8x2.69Solve (s) 0.40 0.32 0.21 0.14


Direct

Communication schemes (upto 10% efficiency)


Direct

Dynamic scheduling (upto 15% efficiency)

PropertiesN 10 423 737NNZA 89 072 871NNZL 6 724 303 039OPC 4.41834e+13

4x32 8x32Static Scheduler 289 195Dynamic Scheduler 240 184

Table: Factorization time in seconds

I Electromagnetism problem in double complex from CEAI Cluster Vargas from IDRIS with 32 power6 per node


Direct

Static Scheduling Gantt Diagram

I 10Million test case on 4 MPI process of 32 threads(color is level in the tree)


Direct

Dynamic Scheduling Gantt Diagram

I Reduces time by 10-15%


Direct

Automatic Performance Analysis ?


Direct

Block ILU(k): supernode amalgamation algorithmDerive a block incomplete LU factorization from thesupernodal parallel direct solver

I Based on existing package PaStiXI Level-3 BLAS incomplete factorization implementationI Fill-in strategy based on level-fill among block structures

identified thanks to the quotient graphI Amalgamation strategy to enlarge block size

HighlightsI Handles efficiently high level-of-fillI Solving time faster than with scalar ILU(k)I Scalable parallel implementation


Manycore

Guideline

Introduction





Conclusion


3Sparse solver on heterogenousarchitectures

Manycore

Multiple layer approach

ALGORITHM

RUNTIME

KERNELS

GPU CPU

Governing ideas: Enable advancednumerical algorithms to be executed on ascalable unified runtime system forexploiting the full potential of futureexascale machines.Basics:

I Graph of tasksI Out-of-order schedulingI Fine granularity


Manycore

DAG schedulers consideredStarPU

I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy

PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T. HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled


Manycore

Supernodal sequential algorithm

forall the Supernode S1 dopanel (S1);/* update of the panel */

forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT

i substracted from

S2 */

endend


Manycore

StarPU Tasks submission

forall the Supernode S1 dosubmit panel (S1);/* update of the panel */

forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );submit gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT

i subtracted from S2*/

endwait for all tasks ();

end


Manycore

PaRSEC’s parameterized task graph

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

Task Graph

1 p a n e l ( j )23 /∗ E x e c u t i o n Space ∗/4 j = 0 . . c b l k n b r−156 /∗ Task L o c a l i t y ( Owner Compute ) ∗/7 : A( j )89 /∗ Data d e p e n d e n c i e s ∗/

10 RW A <− ( l e a f ) ? A( j ) : C gemm( l a s t b r o w )11 −> A gemm( f i r s t b l o c k +1 . . l a s t b l o c k )12 −> A( j )

Panel Factorization in JDF Format


Manycore

Matrices and Machines

Matrix Prec Method Size nnzA nnzL TFlopFilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47

Table: Matrix description (Z: double complex, D: double real)

Processors Frequency GPUs RAMWestmere Intel Xeon X5650 (2× 6) 2.67 GHz Tesla M2070 (×3) 36 GB


Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores





Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU


Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore

Xeon Phi (from Max-Planck-Institut for Plasma Physic)[Phi = 4 Xeon cores, GPU = 6 Xeon cores]


Manycore

Distributed implementation

C0

C1

C2

C3

P1P2


Manycore

Distributed (preliminary) results (cluster of multicore)

afshell1

0(D, LU

)

FilterV

2(Z, LU

)

Flan(D

, LLT )

audi(D

, LLT )

MHD(D,

LU)

Geo1438

(D,LL

T )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(

D, LDL

T )

0

50

100

150

200

250

300

Perform

ance

(GFlop/s)

native StarPU fan-in StarPU fan-out

4 MPI, 8 threads 4 MPI, 8 threads 4 MPI, 8 threads





Hybrid

Guideline

Introduction





Conclusion


4Hybrid methods - MaPHYS andHIPS

Hybrid Linear Solvers

Develop robust scalable parallel hybrid direct/iterative linearsolvers

I Exploit the efficiency and robustness of the sparse direct solversI Develop robust parallel preconditioners for iterative solversI Take advantage of scalable implementation of iterative solvers

Domain Decomposition (DD)

I Natural approach for PDE’sI Extend to general sparse matricesI Partition the problem into subdomainsI Use a direct solver on the subdomainsI Robust preconditioned iterative solver

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matrices

I MeTiS [G. Karypis and V.Kumar]

I Scotch [Pellegrini and al.]

I Local factorization

I MUMPS [P. Amestoy and al.](with Schur option)

I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

0 cut edges

Ω

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω


Hybrid









I Mkl library



48 cut edges

Ω1

Ω2

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Γ


Hybrid









I Mkl library



88 cut edges

Ω1

Ω2

Ω3

Ω4

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid


several local matricesI MeTiS [G. Karypis and V.

Kumar]I Scotch [Pellegrini and al.]





I Mkl library



88 cut edges

Ω1

Ω2

Ω3

Ω4

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid




I Local factorizationI MUMPS [P. Amestoy and al.]

(with Schur option)I PaStiX [P. Ramet and al.]

(with Schur option andmulti-threaded version)


I Mkl library




Hybrid







I Constructing of the preconditionerI Mkl library



0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid







I Constructing of the preconditionerI Mkl library

I Solving the reduced systemI Cg/Gmres/FGmres on the

reduced system

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid

HIPS : hybrid direct-iterative solver

Based on a domain decomposition : interface one node-wide(no overlap in DD lingo)

(AB FE AC

)B : Interior nodes of subdomains (direct factorization).C : Interface nodes.

Special decomposition and ordering of the subset C :Goal : Building a global Schur complement preconditioner (ILU)from the local domain matrices only.


Hybrid

HIPS: preconditioners

Main featuresI Iterative or “hybrid” direct/iterative method are implemented.I Mix direct supernodal (BLAS-3) and sparse ILUT

factorization in a seamless manner.I Memory/load balancing : distribute the domains on the

processors (domains > processors).


H-PaStiX

Guideline

Introduction





Conclusion


5Low-rank compression -H-PaStiX

H-PaStiX

Toward low rank compressions in supernodal solver

Many works on hierarchical matrices and direct solversI Eric Darve : Hierarchical matrices classifications (Building

O(N) Linear Solvers Using Nested Dissection)I Sherry Li : Multifrontal solver + HSS (Towards an

Optimal-Order Approximate Sparse Factorization ExploitingData-Sparseness in Separators)

I David Bindel : CHOLMOD + Low Rank (An Efficient Solverfor Sparse Linear Systems Based on Rank-Structured CholeskyFactorization)

I Jean-Yves L’Excellent : MUMPS + Block Low Rank


H-PaStiX

Symbolic factorization


H-PaStiX

Nested dissection, 2D mesh/matrix


H-PaStiX

Computational cost (O(N2) for 3D PDE)


H-PaStiX

Low-Rank Compression of Supernodes


H-PaStiX

FastLA associate team between INRIA/Berkeley/Stanford

Supernodal Solver - Hierarchical Matrices O(N .loga(N))

1. Check the potential compression ratio on top level blocks2. Develop a prototype with:

I low-rank compression on the larger supernodesI compression tree built at each updateI complexity analysis of the approach

3. Study coupling between nested dissection and compressiontree ordering

Which algorithm to find low-rank approximation ?SVD, RR-LU, RR-QR, ACA, CUR, Random ...

Which family of hierarchical matrix ?H, H2, HODLR ...


Conclusion

Guideline

Introduction





Conclusion


6Conclusion

Conclusion

SoftwaresGraph/Mesh partitioner and ordering :

http://scotch.gforge.inria.fr

Sparse linear system solvers :

http://pastix.gforge.inria.fr

http://hips.gforge.inria.fr

https://wiki.bordeaux.inria.fr/maphys/doku.php


http://scotch.gforge.inria.fr

http://pastix.gforge.inria.fr

http://hips.gforge.inria.fr

https://wiki.bordeaux.inria.fr/maphys/doku.php

Conclusion

Softwares

Fast Multipole Method :

http://scalfmm-public.gforge.inria.fr/

Matrices Over Runtime Systems (with University of Tenessee):

http://icl.cs.utk.edu/projectsdev/morse


http://scalfmm-public.gforge.inria.fr/

http://icl.cs.utk.edu/projectsdev/morse

Conclusion

MerciQuestions ?

Pierre RametHiePACS

http://www.labri.fr/ ramet

Documents

Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2