71
September 17, 2014 Task-based sparse linear solvers for modern architectures P. Ramet (and HiePACS team) 4th HOSCAR workshop Gramado, Brazil P. Ramet HiePACS team Inria Bordeaux Sud-Ouest LaBRI Bordeaux University

Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

  • Upload
    vunga

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

September 17, 2014

Task-based sparse linear solvers for modernarchitecturesP. Ramet (and HiePACS team)

4th HOSCAR workshopGramado, Brazil

P. RametHiePACS teamInria Bordeaux Sud-OuestLaBRI Bordeaux University

Page 2: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 2

Page 3: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Introduction

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 3

Page 4: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

1Introduction

Page 5: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Introduction

Mixed/Hybrid direct-iterative methods

The ”spectrum” of linear algebra solvers

I Robust/accurate for generalproblems

I BLAS-3 based implementationI Memory/CPU prohibitive for

large 3D problemsI Limited parallel scalability

I Problem dependent efficiency/controlledaccuracy

I Only mat-vec required, fine graincomputation

I Less memory consumption, possibletrade-off with CPU

I Attractive ”build-in” parallel features

P. Ramet - HOSCAR workshop September 17, 2014- 5

Page 6: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 6

Page 7: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

2Sparse direct factorization -PaStiX

Page 8: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Major steps for solving sparse linear systems

1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )

2. Factorization: matrix is factorized as A = LU, LLT or LDLT

3. Solve: the solution x is computed by means of forward andbackward substitutions

P. Ramet - HOSCAR workshop September 17, 2014- 8

Page 9: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Direct Method and Nested Dissection

P. Ramet - HOSCAR workshop September 17, 2014- 9

Page 10: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Supernodal methods

DefinitionA supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.

I All algorithms (ordering, symbolic factor., factor., solve)generalized to block versions.

I Use of efficient matrix-matrix kernels (improve cache usage).I Same concept as supervariables for elimination tree/minimum

degree ordering.I Supernodes and pivoting: pivoting inside a supernode does

not increase fill-in.

P. Ramet - HOSCAR workshop September 17, 2014- 10

Page 11: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

PaStiX main Features

I LLt, LDLt, LU : supernodal implementation (BLAS3)I Static pivoting + Refinement: CG/GMRES/BiCGstabI Column-block or Block mappingI Simple/Double precision + Real/Complex arithmeticI MPI/Threads (Cluster/Multicore/SMP/NUMA)I Multiple GPUs using DAG runtimesI Support external ordering library (PT-Scotch or METIS ...)I Multiple RHS (direct factorization)I Incomplete factorization with ILU(k) preconditionnerI Schur complement computationI C++/Fortran/Python/PETSc/Trilinos/FreeFem...

P. Ramet - HOSCAR workshop September 17, 2014- 11

Page 12: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Current works

I Astrid Casadei (PhD student) : memory optimization to builda Schur complement in PaStiX, tight coupling betweensparse direct and iterative solvers (HIPS) + graphpartitioning with balanced halo

I Xavier Lacoste (PhD student) and the MUMPS team : GPUoptimizations for sparse factorizations with StarPU

I Stojce Nakov (PhD student) : tight coupling between sparsedirect and iterative solvers (MaPHYS) + GPUoptimizations for GMRES with StarPU

I Mathieu Faverge (Assistant Professor) : redesigned PaStiXstatic/dynamic scheduling with PaRSEC in order to get ageneric framework for multicore/MPI/GPU/Out-of-Core +low-rank compression

P. Ramet - HOSCAR workshop September 17, 2014- 12

Page 13: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Direct Solver Highlights

Fusion - ITERPaStiX is used in the JOREK code developped by G. Huysmans atCEA/Cadarache in a fully implicit time evolution scheme for thenumerical simulations of the ELM (Edge Localized Mode)instabilities commonly observed in the standard tokamak operatingscenario.

MHD pellet injection simulated with the JOREK code

P. Ramet - HOSCAR workshop September 17, 2014- 13

Page 14: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Tasks algorithms

POTRFTRSMSYRKGEMM

(a) Task for a dense tile (b) Task for a sparse supernode

P. Ramet - HOSCAR workshop September 17, 2014- 14

Page 15: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

DAG representation

POTRF

TRSM

T=>T

TRSM

T=>T

TRSM

T=>T TRSM

T=>T

GEMM

C=>B

GEMM

C=>B

GEMM

C=>B

SYRK

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>B

GEMM

C=>C

SYRK

T=>T

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>C

GEMM

C=>C

SYRK

T=>T

C=>A C=>AC=>A SYRK

C=>A

TRSM

C=>C

TRSM

C=>C

TRSM

C=>C

POTRF

T=>T

T=>T T=>TT=>T

C=>B

SYRK

C=>A C=>B C=>A C=>AC=>B

T=>T

GEMM

C=>C

SYRK

T=>T

SYRK

T=>T

C=>A C=>A C=>A

TRSM

C=>C POTRF

T=>T

T=>T

TRSM

T=>T

C=>C

C=>A C=>B

SYRK

T=>T

C=>AC=>A

POTRF

T=>T

TRSM

C=>C

T=>T

C=>A

POTRF

T=>T

(c) Dense DAG

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

(d) Sparse DAG

P. Ramet - HOSCAR workshop September 17, 2014- 15

Page 16: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Direct Solver Highlights (cluster of multicore)

RC3 matrix - complex double precisionN=730700 - NNA=41600758 - Fill-in=50

Facto 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6820 3520 1900 18906 threads 1020 639 337 28712 threads 525 360 155 121Mem Gb 1 MPI 2 MPI 4 MPI 8 MPI1 thread 34 19,2 12,5 9,226 threads 34,3 19,5 12,8 9,6612 threads 34,6 19,7 13 9,14Solve 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6,97 3,75 1,93 1,036 threads 2,5 1,43 0,78 0,5412 threads 1,33 0,93 0,66 0,59

P. Ramet - HOSCAR workshop September 17, 2014- 16

Page 17: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

NUMA-Aware Allocation (upto 20% efficiency)

P. Ramet - HOSCAR workshop September 17, 2014- 17

Page 18: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Direct Solver Highlights (multicore)SGI NUMA 160-cores

Name N NNZA Fill ratio FactAudi 9.44×105 3.93×107 31.28 float LLT

10M 1.04×107 8.91×107 75.66 complex LDLT

10M 10 20 40 80 160Facto (s) 3020 1750 654 356 260Mem (Gb) 122 124 127 133 146Solve (s) 24.6 13.5 3.87 2.90 2.89

Audi 128 2x64 4x32 8x16Facto (s) 17.8 18.6 13.8 13.4Mem (Gb) 13.4 2x7.68 4x4.54 8x2.69Solve (s) 0.40 0.32 0.21 0.14

P. Ramet - HOSCAR workshop September 17, 2014- 18

Page 19: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Communication schemes (upto 10% efficiency)

P. Ramet - HOSCAR workshop September 17, 2014- 19

Page 20: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Dynamic scheduling (upto 15% efficiency)

PropertiesN 10 423 737NNZA 89 072 871NNZL 6 724 303 039OPC 4.41834e+13

4x32 8x32Static Scheduler 289 195Dynamic Scheduler 240 184

Table: Factorization time in seconds

I Electromagnetism problem in double complex from CEAI Cluster Vargas from IDRIS with 32 power6 per node

P. Ramet - HOSCAR workshop September 17, 2014- 20

Page 21: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Static Scheduling Gantt Diagram

I 10Million test case on 4 MPI process of 32 threads(color is level in the tree)

P. Ramet - HOSCAR workshop September 17, 2014- 21

Page 22: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Dynamic Scheduling Gantt Diagram

I Reduces time by 10-15%

P. Ramet - HOSCAR workshop September 17, 2014- 22

Page 23: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Automatic Performance Analysis ?

P. Ramet - HOSCAR workshop September 17, 2014- 23

Page 24: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Direct

Block ILU(k): supernode amalgamation algorithmDerive a block incomplete LU factorization from thesupernodal parallel direct solver

I Based on existing package PaStiXI Level-3 BLAS incomplete factorization implementationI Fill-in strategy based on level-fill among block structures

identified thanks to the quotient graphI Amalgamation strategy to enlarge block size

HighlightsI Handles efficiently high level-of-fillI Solving time faster than with scalar ILU(k)I Scalable parallel implementation

P. Ramet - HOSCAR workshop September 17, 2014- 24

Page 25: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 25

Page 26: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

3Sparse solver on heterogenousarchitectures

Page 27: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Multiple layer approach

ALGORITHM

RUNTIME

KERNELS

GPU CPU

Governing ideas: Enable advancednumerical algorithms to be executed on ascalable unified runtime system forexploiting the full potential of futureexascale machines.Basics:

I Graph of tasksI Out-of-order schedulingI Fine granularity

P. Ramet - HOSCAR workshop September 17, 2014- 27

Page 28: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

DAG schedulers consideredStarPU

I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy

PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T. HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled

P. Ramet - HOSCAR workshop September 17, 2014- 28

Page 29: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Supernodal sequential algorithm

forall the Supernode S1 dopanel (S1);/* update of the panel */

forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT

i substracted from

S2 */

endend

P. Ramet - HOSCAR workshop September 17, 2014- 29

Page 30: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

StarPU Tasks submission

forall the Supernode S1 dosubmit panel (S1);/* update of the panel */

forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );submit gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT

i subtracted from S2*/

endwait for all tasks ();

end

P. Ramet - HOSCAR workshop September 17, 2014- 30

Page 31: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

PaRSEC’s parameterized task graph

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

Task Graph

1 p a n e l ( j )23 /∗ E x e c u t i o n Space ∗/4 j = 0 . . c b l k n b r−156 /∗ Task L o c a l i t y ( Owner Compute ) ∗/7 : A( j )89 /∗ Data d e p e n d e n c i e s ∗/

10 RW A <− ( l e a f ) ? A( j ) : C gemm( l a s t b r o w )11 −> A gemm( f i r s t b l o c k +1 . . l a s t b l o c k )12 −> A( j )

Panel Factorization in JDF Format

P. Ramet - HOSCAR workshop September 17, 2014- 31

Page 32: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Matrices and Machines

Matrix Prec Method Size nnzA nnzL TFlopFilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47

Table: Matrix description (Z: double complex, D: double real)

Processors Frequency GPUs RAMWestmere Intel Xeon X5650 (2× 6) 2.67 GHz Tesla M2070 (×3) 36 GB

P. Ramet - HOSCAR workshop September 17, 2014- 32

Page 33: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores

6 cores 6 cores 6 cores

9 cores 9 cores 9 cores

12 cores 12 cores 12 cores

P. Ramet - HOSCAR workshop September 17, 2014- 33

Page 34: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores

6 cores 6 cores 6 cores

9 cores 9 cores 9 cores

12 cores 12 cores 12 cores

P. Ramet - HOSCAR workshop September 17, 2014- 33

Page 35: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores

6 cores 6 cores 6 cores

9 cores 9 cores 9 cores

12 cores 12 cores 12 cores

P. Ramet - HOSCAR workshop September 17, 2014- 33

Page 36: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores

6 cores 6 cores 6 cores

9 cores 9 cores 9 cores

12 cores 12 cores 12 cores

P. Ramet - HOSCAR workshop September 17, 2014- 33

Page 37: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores

6 cores 6 cores 6 cores

9 cores 9 cores 9 cores

12 cores 12 cores 12 cores

P. Ramet - HOSCAR workshop September 17, 2014- 33

Page 38: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU

P. Ramet - HOSCAR workshop September 17, 2014- 34

Page 39: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU

P. Ramet - HOSCAR workshop September 17, 2014- 34

Page 40: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU

P. Ramet - HOSCAR workshop September 17, 2014- 34

Page 41: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU

P. Ramet - HOSCAR workshop September 17, 2014- 34

Page 42: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU

P. Ramet - HOSCAR workshop September 17, 2014- 34

Page 43: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

GPU scaling study : GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU

P. Ramet - HOSCAR workshop September 17, 2014- 34

Page 44: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Xeon Phi (from Max-Planck-Institut for Plasma Physic)[Phi = 4 Xeon cores, GPU = 6 Xeon cores]

P. Ramet - HOSCAR workshop September 17, 2014- 35

Page 45: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Distributed implementation

C0

C1

C2

C3

P1P2

P. Ramet - HOSCAR workshop September 17, 2014- 36

Page 46: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Manycore

Distributed (preliminary) results (cluster of multicore)

afshell1

0(D, LU

)

FilterV

2(Z, LU

)

Flan(D

, LLT )

audi(D

, LLT )

MHD(D,

LU)

Geo1438

(D,LL

T )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(

D, LDL

T )

0

50

100

150

200

250

300

Perform

ance

(GFlop/s)

native StarPU fan-in StarPU fan-out

4 MPI, 8 threads 4 MPI, 8 threads 4 MPI, 8 threads

8 MPI, 8 threads 8 MPI, 8 threads 8 MPI, 8 threads

16 MPI, 8 threads 16 MPI, 8 threads 16 MPI, 8 threads

32 MPI, 8 threads 32 MPI, 8 threads 32 MPI, 8 threads

P. Ramet - HOSCAR workshop September 17, 2014- 37

Page 47: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 38

Page 48: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

4Hybrid methods - MaPHYS andHIPS

Page 49: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid Linear Solvers

Develop robust scalable parallel hybrid direct/iterative linearsolvers

I Exploit the efficiency and robustness of the sparse direct solversI Develop robust parallel preconditioners for iterative solversI Take advantage of scalable implementation of iterative solvers

Domain Decomposition (DD)

I Natural approach for PDE’sI Extend to general sparse matricesI Partition the problem into subdomainsI Use a direct solver on the subdomainsI Robust preconditioned iterative solver

Page 50: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matrices

I MeTiS [G. Karypis and V.Kumar]

I Scotch [Pellegrini and al.]

I Local factorization

I MUMPS [P. Amestoy and al.](with Schur option)

I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

0 cut edges

Ω

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 51: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matrices

I MeTiS [G. Karypis and V.Kumar]

I Scotch [Pellegrini and al.]

I Local factorization

I MUMPS [P. Amestoy and al.](with Schur option)

I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

48 cut edges

Ω1

Ω2

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Γ

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 52: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matrices

I MeTiS [G. Karypis and V.Kumar]

I Scotch [Pellegrini and al.]

I Local factorization

I MUMPS [P. Amestoy and al.](with Schur option)

I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

88 cut edges

Ω1

Ω2

Ω3

Ω4

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 53: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matricesI MeTiS [G. Karypis and V.

Kumar]I Scotch [Pellegrini and al.]

I Local factorization

I MUMPS [P. Amestoy and al.](with Schur option)

I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

88 cut edges

Ω1

Ω2

Ω3

Ω4

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 54: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matricesI MeTiS [G. Karypis and V.

Kumar]I Scotch [Pellegrini and al.]

I Local factorizationI MUMPS [P. Amestoy and al.]

(with Schur option)I PaStiX [P. Ramet and al.]

(with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 55: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matricesI MeTiS [G. Karypis and V.

Kumar]I Scotch [Pellegrini and al.]

I Local factorizationI MUMPS [P. Amestoy and al.]

(with Schur option)I PaStiX [P. Ramet and al.]

(with Schur option andmulti-threaded version)

I Constructing of the preconditionerI Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 56: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matricesI MeTiS [G. Karypis and V.

Kumar]I Scotch [Pellegrini and al.]

I Local factorizationI MUMPS [P. Amestoy and al.]

(with Schur option)I PaStiX [P. Ramet and al.]

(with Schur option andmulti-threaded version)

I Constructing of the preconditionerI Mkl library

I Solving the reduced systemI Cg/Gmres/FGmres on the

reduced system

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ

P. Ramet - HOSCAR workshop September 17, 2014- 41

Page 57: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

HIPS : hybrid direct-iterative solver

Based on a domain decomposition : interface one node-wide(no overlap in DD lingo)

(AB FE AC

)B : Interior nodes of subdomains (direct factorization).C : Interface nodes.

Special decomposition and ordering of the subset C :Goal : Building a global Schur complement preconditioner (ILU)from the local domain matrices only.

P. Ramet - HOSCAR workshop September 17, 2014- 42

Page 58: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Hybrid

HIPS: preconditioners

Main featuresI Iterative or “hybrid” direct/iterative method are implemented.I Mix direct supernodal (BLAS-3) and sparse ILUT

factorization in a seamless manner.I Memory/load balancing : distribute the domains on the

processors (domains > processors).

P. Ramet - HOSCAR workshop September 17, 2014- 43

Page 59: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 44

Page 60: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

5Low-rank compression -H-PaStiX

Page 61: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

Toward low rank compressions in supernodal solver

Many works on hierarchical matrices and direct solversI Eric Darve : Hierarchical matrices classifications (Building

O(N) Linear Solvers Using Nested Dissection)I Sherry Li : Multifrontal solver + HSS (Towards an

Optimal-Order Approximate Sparse Factorization ExploitingData-Sparseness in Separators)

I David Bindel : CHOLMOD + Low Rank (An Efficient Solverfor Sparse Linear Systems Based on Rank-Structured CholeskyFactorization)

I Jean-Yves L’Excellent : MUMPS + Block Low Rank

P. Ramet - HOSCAR workshop September 17, 2014- 46

Page 62: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

Symbolic factorization

P. Ramet - HOSCAR workshop September 17, 2014- 47

Page 63: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

Nested dissection, 2D mesh/matrix

P. Ramet - HOSCAR workshop September 17, 2014- 48

Page 64: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

Computational cost (O(N2) for 3D PDE)

P. Ramet - HOSCAR workshop September 17, 2014- 49

Page 65: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

Low-Rank Compression of Supernodes

P. Ramet - HOSCAR workshop September 17, 2014- 50

Page 66: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

H-PaStiX

FastLA associate team between INRIA/Berkeley/Stanford

Supernodal Solver - Hierarchical Matrices O(N .loga(N))

1. Check the potential compression ratio on top level blocks2. Develop a prototype with:

I low-rank compression on the larger supernodesI compression tree built at each updateI complexity analysis of the approach

3. Study coupling between nested dissection and compressiontree ordering

Which algorithm to find low-rank approximation ?SVD, RR-LU, RR-QR, ACA, CUR, Random ...

Which family of hierarchical matrix ?H, H2, HODLR ...

P. Ramet - HOSCAR workshop September 17, 2014- 51

Page 67: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Conclusion

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - MaPHYS and HIPS

Low-rank compression - H-PaStiX

Conclusion

P. Ramet - HOSCAR workshop September 17, 2014- 52

Page 68: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

6Conclusion

Page 69: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Conclusion

SoftwaresGraph/Mesh partitioner and ordering :

http://scotch.gforge.inria.fr

Sparse linear system solvers :

http://pastix.gforge.inria.fr

http://hips.gforge.inria.fr

https://wiki.bordeaux.inria.fr/maphys/doku.php

P. Ramet - HOSCAR workshop September 17, 2014- 54

Page 70: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Conclusion

Softwares

Fast Multipole Method :

http://scalfmm-public.gforge.inria.fr/

Matrices Over Runtime Systems (with University of Tenessee):

http://icl.cs.utk.edu/projectsdev/morse

P. Ramet - HOSCAR workshop September 17, 2014- 55

Page 71: Task-based sparse linear solvers for modern … Simple/Double precision + Real/Complex arithmetic ... 3020 1750 654 356 260 Mem ... /*sparse GEMM B k;k i B i T substracted from S 2

Conclusion

MerciQuestions ?

Pierre RametHiePACS

http://www.labri.fr/ ramet