115
Preconditioned Iterative Preconditioned Iterative Linear Solvers for Linear Solvers for Unstructured Grids on the Unstructured Grids on the Earth Simulator Earth Simulator Kengo Nakajima Kengo Nakajima [email protected] [email protected] Supercomputing Division, Information Technology Center, Supercomputing Division, Information Technology Center, The University of Tokyo The University of Tokyo Japan Agency for Marine-Earth Science and Technology Japan Agency for Marine-Earth Science and Technology April 22, 2008 April 22, 2008

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

  • Upload
    jaron

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima [email protected] Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Preconditioned Iterative Linear Preconditioned Iterative Linear Solvers for Unstructured Grids Solvers for Unstructured Grids

on the Earth Simulatoron the Earth Simulator

Kengo NakajimaKengo [email protected]@cc.u-tokyo.ac.jp

Supercomputing Division, Information Technology Center, Supercomputing Division, Information Technology Center, The University of TokyoThe University of Tokyo

Japan Agency for Marine-Earth Science and TechnologyJapan Agency for Marine-Earth Science and Technology

April 22, 2008April 22, 2008

Page 2: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

208-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Overview

Page 3: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 3

Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)

Large-Scale Computing

Page 4: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 4

Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure

CPU

Main Memory

Cache

Register

FAST

SLOW

Small Capacity (MB)ExpensiveLarge (100M of Transistors)

Large Capacity (GB)Cheap

Page 5: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 5

Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)

Vector Processor Very high sustained/peak performance ratio

e.g. 35 % for FEM applications on the Earth Simulator requires …

very special tuning sufficiently long loops (= large-scale problem size) for certain performance

Suitable for simple computation.

Large-Scale Computing

Page 6: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 6

Vector ProcessorsVector Register & Fast Memory

Main Memory

VeryFAST

Vector Processor

VectorRegister

• Parallel Processing for Simple DO Loops.• Suitable for Simple & Large Computation

do i= 1, N A(i)= B(i) + C(i)enddo

Page 7: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 7

Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)

Vector Processor Very high sustained/peak performance ratio

e.g. 35 % for FEM applications on the Earth Simulator requires …

very special tuning sufficiently long loops (= large-scale problem size) for certain performance

Suitable for simple computation.

Large-Scale Computing

Page 8: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 8

Earth Simulator

8.00Peak Performance

GFLOPS

26.6Measured Memory

BW (STREAM)(GB/sec/PE)

2.31-3.24(28.8-40.5)

EstimatedPerformance/Core

GFLOPS (% of peak)

2.93(36.6)

MeasuredPerformance/Core

BYTE/FLOP

HitachiSR8000/MPP

(U.Tokyo)

1.80

2.85

.291-.347(16.1-19.3)

.335(18.6)

1.5833.325

Comm. BW(GB/sec/Node) 1.6012.3

MPI Latency(sec) 6-20

HitachiSR11000/J2(U.Tokyo)

9.20

8.00

.880-.973(9.6-10.6)

1.34(14.5)

0.870

12.0

4.7 *5.6-7.7

* IBM p595J.T.Carter et al.

IBM SP3(LBNL)

1.50

0.623

.072-.076(4.8-5.1)

.122(8.1)

0.413

1.00

16.3

Page 9: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 9

Typical Behavior …

Earth Simulator:Performance is good for large scale problems due to long vector length.

IBM-SP3:Performance is good for small problems due to cache effect.

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

8 % of peak

40 % of peak

Page 10: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 10

Parallel ComputingStrong Scaling (Fixed Prob. Size)

PE#

Pe

rfo

rma

nc

e

Ideal

PE#

Pe

rfo

rma

nc

e

Ideal

Earth Simulator:Performance decreases for many PE’s due to comm. overhead and small vector length.

IBM-SP3:Super-scalar effect for small numberof PE’s. Performance decreases for many PE’s due to comm. overhead.

Page 11: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 11

Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2

0.00

1.00

2.00

3.00

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

IBM SP-3IBM SP-3POWER3POWER3

SR11000/J2SR11000/J2POWER5+POWER5+

■ Flat-MPI/DCRS□ Hybrid/DCRS

375 MHz1.0 GB/sec8M L2 cache/PE

2.3 GHz8.0 GB/sec18M L3 cache/PE

0.0

5.0

10.0

15.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

Memory PerformanceMemory Performance(BWDTH, latency etc.)(BWDTH, latency etc.)

Page 12: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 12

My History with Vector Computers• Cray-1S (1985-1988)

– Mitsubishi Research Institute (MRI)

• Cray-YMP (1988-1995)– MRI, University of Texas at Austin

• Fujitsu VP, VPP series (1985-1999)– JAERI, PNC

• NEC SX-4 (1997-2001)– CCSE/JAERI

• Hitachi SR2201 (1997-2004)– University of Tokyo, CCSE/JAERI

• Hitachi SR8000 (2000-2007)– University of Tokyo

• Earth Simulator (2002-)

Page 13: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

1308-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Page 14: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 14

• Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core convection, etc.

• Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural science (solid earth) communities.

GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/

Page 15: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 15

System Configuration ofGeoFEM

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

PEs

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算(Static linear)構造計算(Dynamic linear)構造計算(

Contact)

Partitioned mesh

PlatformSolverI/ F

Comm.I/ F

Vis.I/ F

Structure

FluidWave

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

PEs

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算(Static linear)構造計算(Dynamic linear)構造計算(

Contact)

Partitioned mesh

PlatformSolverI/ F

Comm.I/ F

Vis.I/ F

Structure

FluidWave

Page 16: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

1608-APR22

Results on Solid Earth Simulation

Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands

Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan

TSUNAMI !!TSUNAMI !!

Transportation by Groundwater Flow Transportation by Groundwater Flow through Heterogeneous Porous Mediathrough Heterogeneous Porous Media

h=5.00

h=1.25

T=100 T=200 T=300 T=400 T=500

Page 17: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

1708-APR22

Results by GeoFEM

Page 18: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 18

Features of FEM applications (1/2)

• Local “element-by-element” operations– sparse coefficient matrices

– suitable for parallel computing

• HUGE “indirect” accesses– IRREGULAR sparse matrices

– memory intensive

do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddoenddo

Page 19: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 19

• In parallel computation …– comm. with ONLY neighbors (except “dot products” etc.)– amount of messages are relatively small because only values on domain-boundary are

exchanged. – communication (MPI) latency is critical

Features of FEM applications (2/2)

Page 20: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

2008-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Page 21: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 21

Earth Simulator (ES)http://www.es.jamstec.go.jp/

• 640×8= 5,120 Vector Processors– SMP Cluster-Type Architecture– 8 GFLOPS/PE– 64 GFLOPS/Node– 40 TFLOPS/ES

• 16 GB Memory/Node, 10 TB/ES• 640×640 Crossbar Network

– 16 GB/sec×2

• Memory BWTH with 32 GB/sec.• 35.6 TFLOPS for LINPACK (2002-March)• 26 TFLOPS for AFES (Climate Simulation)

Page 22: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 22

• GeoFEM Project (FY.1998-2002)

• FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)– Implicit Linear Solvers

– Hybrid vs. Flat MPI Parallel Programming Model

Motivations

Page 23: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 23

Flat MPI vs. Hybrid

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Hybrid: Hierarchal Structure

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Flat-MPI: Each PE -> Independent

Page 24: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

2408-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Page 25: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 25

• Direct Methods– Gaussian Elimination/LU Factorization.

• compute A-1 directly.

– Robust for wide range of applications.– More expensive than iterative methods (memory, CPU)– Not suitable for parallel and vector computation due to its global operations.

• Iterative Methods– CG, GMRES, BiCGSTAB– Less expensive than direct methods, especially in memory.– Suitable for parallel and vector computing.– Convergence strongly depends on problems, boundary conditions (condition number etc.)

• Preconditioning is required

Direct/Iterative Methods for Linear Equations

Page 26: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 26

• Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.

• A preconditioner M transforms the linear system into one with more favorable spectral properties – In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.– M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b

• ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.

Preconditioning for Iterative Methods

Page 27: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 27

• Iterative method is the ONLY choice for large-scale parallel computing.

• Preconditioning is important– general methods, such as ILU(0)/IC(0), cover wide range of applications.– problem specific methods.

Strategy in GeoFEM

Page 28: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

2808-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Page 29: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 29

Block IC(0)-CG Solver on the Earth Simulator

• 3D Linear Elastic Problems (SPD)

• Parallel Iterative Linear Solver– Node-based Local Data Structure– Conjugate-Gradient Method (CG): SPD– Localized Block IC(0) Preconditioning (Block Jacobi)

• Modified IC(0): Non-diagonal components of original [A] are kept

– Additive Schwartz Domain Decomposition(ASDD)– http://geofem.tokyo.rist.or.jp/

• Hybrid Parallel Programming Model– OpenMP+MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI

Page 30: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 30

Flat MPI vs. Hybrid

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Hybrid: Hierarchal Structure

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Flat-MPI: Each PE -> Independent

Page 31: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 31

Local Data StructureNode-based Partitioning

internal nodes - elements - external nodes

1 2 3 4 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7

PE#3

1 2 3 4 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

1 2 3 4 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

Page 32: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 32

1 SMP node => 1 domain for Hybrid Programming Model

MPI communication among domains

Node-0 Node-1

Node-2 Node-3

PEPEPEPEPEPEPEPE

MemoryNode-0 Node-1

Node-2 Node-3PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

Node-0 Node-1

Node-2 Node-3

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

MemoryNode-0 Node-1

Node-2 Node-3PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

Page 33: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 33

Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

Page 34: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 34

ILU(0)/IC(0) Factorization

do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) thenif ((i,k)∈ NonZero(A)) then aaik ik := a:= aikik/a/akkkk

endifendif do j= k+1, n if ((i,j)∈ NonZero(A)) thenif ((i,j)∈ NonZero(A)) then aaij ij := a:= aijij - a - aikik*a*akjkj

endifendif enddo enddo enddo

Page 35: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 35

ILU/IC Preconditioning

M = (L+D)D-1(D+U)

Mz= r

D-1(D+U)= z1

Forward Substitution

(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution

(I+ D-1 U)zNEW = z1

z= z - D-1Uz

need to solve this equation

Page 36: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 36

ILU/IC Preconditioning

do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo

do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo

Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)

Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz

Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A

Page 37: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 37

Basic Strategy for Parallel Programming on the Earth Simulator

• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector pr

ocessors.• Re-Ordering for Highly Parallel/Vector Performance

– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

Page 38: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 38

ILU/IC Preconditioning

do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo

do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo

Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)

Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz

Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

Reordering:Directly connectednodes do not appearin RHS.

M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A

Page 39: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 39

Basic Strategy for Parallel Programming on the Earth Simulator

• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

• 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization

Page 40: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 40

SMP ParallelSMP Parallel

VectorVector

Re-Ordering Technique for Vector/Parallel Architectures

Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

VectorVectorSMP ParallelSMP Parallel

1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor)

3. DJDS re-ordering

4. Cyclic DJDS for SMP unit

These processes can be substituted by traditional multi-coloring (MC).

Page 41: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 41

Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from

each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector

length.: Trade-Off !!

Red-Black (2 colors) 4 colors RCM (Reverse CM)

Page 42: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 42

1D-Storage (CRS)memory saved, short vector length

2D-Storagelong vector length, many ZERO’s

Large Scale Sparse Matrix Storage for Unstructured Grids

Page 43: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 43

Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #

Elements on the same color are independent, thereforeElements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results intra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal StorageDJDS : Descending-Order Jagged Diagonal Storage

Page 44: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 44

Cyclic DJDS (MC/CM-RCM) Cyclic Re-Ordering for SMP units

Load-balancing among PEs

iS+1

iE

iv0+1

do iv= 1, NCOLORS!$omp parallel dodo ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) do j= 1, NLhyp(iv) iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )!cdir nodep do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) (Important Computations) enddo enddoenddoenddonpLX1= NLmax * PEsmpTOTINL(0:NLmax*PEsmpTOT*NCOLORS)

Page 45: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 45

Difference betweenFlat MPI & Hybrid

• Most of the efforts of re-ordering are for vectorization.

• If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes.

• Source codes of Hybrid and Flat MPI are not so different.

– Flat MPI corresponds to Hybrid where PE/SMP node=1.– In other words, Flat MPI code is sufficiently complicated.

Page 46: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 46

Cyclic DJDS (MC/CM-RCM) for Forward/Backward Substitution in BILU Factorization

SMPparallel

do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE… etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)X(i)= X(i) - A(k)*X(kk)*DINV(i) etc.

enddoenddo

enddoenddo

Vectorized

SMPparallel

do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE… etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)X(i)= X(i) - A(k)*X(kk)*DINV(i) etc.

enddoenddo

enddoenddo

Vectorized

Page 47: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 47

Simple 3D Cubic Model

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

Page 48: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 48

Effect of Ordering

Page 49: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 49

Effect of Re-Ordering

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Long LoopsContinuous Access

Short LoopsContinuous Access

Short LoopsIrregular Access

Page 50: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 50

Matrix Storage, Loops• DJDS (Descending order

Jagged Diagonal Storage) with long innermost loops is suitable for vector processors.

DJDSDJDS

DCRSDCRS do i= 1, N SW= WW(i,Z) isL= index_L(i-1)+1 ieL= index_L(i) do j= isL, ieL k = item_L(j) SW= SW - AL(j)*Z(k) enddo

Z(i)= SW/DD(i)enddo

do iv= 1, NVECT iv0= STACKmc(iv-1) do j= 1, NLhyp(iv) iS= index_L(NL*(iv-1)+ j-1) iE= index_L(NL*(iv-1)+ j ) do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= item_L(k) Z(i)= Z(i) - AL(k)*Z(kk) enddo

iS= STACKmc(iv-1) + 1 iE= STACKmc(iv ) do i= iS, iE Z(i)= Z(i)/DD(i) enddo enddoenddo

• Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.

Page 51: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 51

Effect of Re-OrderingResults on 1 SMP node

Color #: 99 (fixed)Re-Ordering is REALLY required !!!

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

Effect of Vector LengthX10

+ Re-Ordering X100

22 GFLOPS, 34% of the Peak

Ideal Performance: 40%-45%for Single CPU

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Page 52: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 52

Effect of Re-OrderingResults on 1 SMP node

Color #: 99 (fixed)Re-Ordering is REALLY required !!!

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

80x80x80 case (1.5M DOF)

● 212 iter’s, 11.2 sec.■ 212 iter’s, 143.6 sec.▲ 203 iter’s, 674.2 sec.

Page 53: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 53

3D Elastic SimulationProblem Size~GFLOPS

Earth Simulator1 SMP node (8 PE’s)

Flat-MPI23.4 GFLOPS, 36.6 % of Peak

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

Hybrid (OpenMP)21.9 GFLOPS, 34.3 % of Peak

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Flat-MPI is better.Flat-MPI is better.Nice Intra-Node MPI.Nice Intra-Node MPI.

Page 54: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 54

Earth Simulator

8.00Peak Performance

GFLOPS

26.6Measured Memory

BW (STREAM)(GB/sec/PE)

2.31-3.24(28.8-40.5)

EstimatedPerformance/Core

GFLOPS (% of peak)

2.93(36.6)

MeasuredPerformance/Core

BYTE/FLOP

HitachiSR8000/MPP

(U.Tokyo)

1.80

2.85

.291-.347(16.1-19.3)

.335(18.6)

1.5833.325

Comm. BW(GB/sec/Node) 1.6012.3

MPI Latency(sec) 6-20

HitachiSR11000/J2(U.Tokyo)

9.20

8.00

.880-.973(9.6-10.6)

1.34(14.5)

0.870

12.0

4.7 *5.6-7.7

* IBM p595J.T.Carter et al.

IBM SP3(LBNL)

1.50

0.623

.072-.076(4.8-5.1)

.122(8.1)

0.413

1.00

16.3

Page 55: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 55

3D Elastic SimulationProblem Size~GFLOPS

Hitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

Flat-MPI2.17 GFLOPS, 15.0 % of Peak

Hybrid (OpenMP)2.68 GFLOPS, 18.6 % of Peak

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Hybrid is better.Hybrid is better.Low Intra-Node MPI.Low Intra-Node MPI.

Page 56: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 56

3D Elastic SimulationProblem Size~GFLOPS

IBM-SP3 (NERSC) 1 SMP node (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

Flat-MPI Hybrid (OpenMP)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Cache is well-utilizedCache is well-utilizedin Flat-MPI.in Flat-MPI.

Page 57: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 57

3D Elastic SimulationProblem Size~GFLOPS

Hitachi SR11000/J2 (U.Tokyo) 1 SMP node (8 PE’s)

Flat-MPI Hybrid (OpenMP)

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

● ■ ▲

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

0.0

5.0

10.0

15.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

0.0

5.0

10.0

15.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

Page 58: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 58

SMP node # > 10up to 176 nodes (1408 PEs)

Problem size for each SMP node is fixed.PDJDS-CM/RCM, Color #: 99

Page 59: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 59

3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

●: Flat MPI, ○: Hybrid

GFLOPS rate Parallel Work Ratio

0

1000

2000

3000

4000

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

GF

LO

PS

40

50

60

70

80

90

100

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Pa

rall

el

Wo

rk R

ati

o:

%

3.8TFLOPS for 2.2G DOF 3.8TFLOPS for 2.2G DOF 176 nodes (33.8% of peak)176 nodes (33.8% of peak)

●Flat MPI

○Hybrid

Page 60: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 60

3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

●: Flat MPI, ○: Hybrid

0

1000

2000

3000

4000

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

GF

LO

PS

40

50

60

70

80

90

100

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Pa

rall

el

Wo

rk R

ati

o:

%

GFLOPS rate Parallel Work Ratio

●Flat MPI

○Hybrid

Page 61: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 61

Hybrid outperforms Flat-MPI• when ...

– number of SMP node (PE) is large.– problem size/node is small.

• because flat-MPI has ...– as 8 times as many communication processes– as TWICE as large communication/computation ratio

• Effect of communication becomes significant if number of SMP node (or PE) is large.

• Performance Estimation by D.Kerbyson (LANL) – LA-UR-02-5222– relatively larger communication latency of ES

Page 62: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 62

Flat-MPI and Hybrid

Flat MPI Hybrid

Problem size/each MPI Process(N=number of FEM nodes in onedirection of cube geometry

3N3 38N3

Size of messages on each surfaces with each neighboringdomain

3N2 34N2

Ratio ofcommunication/computation

1/N 1/(2N)

N

N

NFlat-MPI Hybrid

Page 63: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 63

Earth Simulator

8.00Peak Performance

GFLOPS

26.6Measured Memory

BW (STREAM)(GB/sec/PE)

2.31-3.24(28.8-40.5)

EstimatedPerformance/Core

GFLOPS (% of peak)

2.93(36.6)

MeasuredPerformance/Core

BYTE/FLOP

HitachiSR8000/MPP

(U.Tokyo)

1.80

2.85

.291-.347(16.1-19.3)

.335(18.6)

1.5833.325

Comm. BW(GB/sec/Node) 1.6012.3

MPI Latency(sec) 6-20

HitachiSR11000/J2(U.Tokyo)

9.20

8.00

.880-.973(9.6-10.6)

1.34(14.5)

0.870

12.0

4.7 *5.6-7.7

* IBM p595J.T.Carter et al.

IBM SP3(LBNL)

1.50

0.623

.072-.076(4.8-5.1)

.122(8.1)

0.413

1.00

16.3

Page 64: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 64

• latency of network• finite bandwidth of network• synchronization at SEND/RECV, ALLREDUCE etc.

• memory performance in boundary communications (memory COPY)

Why Communication Overhead ?

Page 65: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 65

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV & (N, NEIBPETOT, NEIBPE, & IMPORT_INDEX, IMPORT_NODE, & EXPORT_INDEX, EXPORT_NODE, & WS, WR, X, SOLVER_COMM, my_rank) implicit REAL*8 (A-H,O-Z)include 'mpif.h'parameter (KREAL= 8)integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)integer SOLVER_COMM, my_rankinteger req1(NEIBPETOT), req2(NEIBPETOT)integer sta1(MPI_STATUS_SIZE, NEIBPETOT)integer sta2(MPI_STATUS_SIZE, NEIBPETOT)real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)

do neib= 1, NEIBPETOT istart= EXPORT_INDEX(neib-1) inum = EXPORT_INDEX(neib ) - istart do k= istart+1, istart+inum WS(k)= X(EXPORT_NODE(k)) enddo call MPI_ISEND (WS(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req1(neib), ierr)enddo

SENDSEND

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart call MPI_IRECV (WR(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req2(neib), ierr)enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart do k= istart+1, istart+inum X(IMPORT_NODE(k))= WR(k) enddoenddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

returnend

RECEIVERECEIVE

Page 66: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 66

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV & (N, NEIBPETOT, NEIBPE, & IMPORT_INDEX, IMPORT_NODE, & EXPORT_INDEX, EXPORT_NODE, & WS, WR, X, SOLVER_COMM, my_rank) implicit REAL*8 (A-H,O-Z)include 'mpif.h'parameter (KREAL= 8)integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)integer SOLVER_COMM, my_rankinteger req1(NEIBPETOT), req2(NEIBPETOT)integer sta1(MPI_STATUS_SIZE, NEIBPETOT)integer sta2(MPI_STATUS_SIZE, NEIBPETOT)real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)

do neib= 1, NEIBPETOT istart= EXPORT_INDEX(neib-1) inum = EXPORT_INDEX(neib ) - istart do k= istart+1, istart+inum WS(k)= X(EXPORT_NODE(k)) enddo call MPI_ISEND (WS(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req1(neib), ierr)enddo

SENDSEND

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart call MPI_IRECV (WR(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req2(neib), ierr)enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart do k= istart+1, istart+inum X(IMPORT_NODE(k))= WR(k) enddoenddo

call MPI_WAITALL (NEIBPETOT, req1, sta1, ierr)

returnend

RECEIVERECEIVE

Page 67: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 67

Communication Overhead

MemoryCopy

Comm.BandWidth

Comm.Latency

Page 68: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 68

Communication Overhead

MemoryCopy

Comm.BandWidth

Comm.Latency

depends onmessage size

depends onmessage size

Page 69: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 69

Communication OverheadEarth Simulator

Comm.Latency

Page 70: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 70

Communication OverheadHitachi SR11000, IBM SP3 etc.

MemoryCopy

Comm.BandWidth

Page 71: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 71

Communication Overhead= Synchronization Overhead

Page 72: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 72

Communication Overhead= Synchronization Overhead

MemoryCopy

Comm.Latency

Comm.BWTH

Page 73: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 73

Communication Overhead= Synchronization Overhead

Earth SimulatorComm.Latency

Page 74: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 74

Communication OverheadWeak Scaling: Earth Simulator

0.00

0.02

0.04

0.06

10 100 1000 10000

PE#

Co

mm

. Ov

erh

ea

d (

se

c.)

●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid

Effect of message sizeEffect of message sizeis small. Effect of latencyis small. Effect of latencyis large.is large.

Memory-copy is so fast.Memory-copy is so fast.

Page 75: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 75

Communication OverheadWeak Scaling: IBM SP-3

●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid

Effect of message sizeEffect of message sizeis more significant.is more significant.

0.00

0.10

0.20

0.30

0.40

10 100 1000 10000

PE#

Co

mm

. Ov

erh

ea

d (

se

c.)

Page 76: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 76

Communication OverheadWeak Scaling:

Hitachi SR11000/J2 (8cores/node)

●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid

0.00

0.02

0.04

0.06

0.08

0.10

10 100 1000

cores

Co

mm

. Ov

erh

ea

d (

se

c.)

Page 77: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 77

Summary

• Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids

• Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.– N.Kushida (student of Prof.Okuda) attained >10 TFLOPS

using 512 nodes for >3G DOF problem.

• Re-Ordering is really required

Page 78: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 78

Summary (cont.)

• Hybrid vs. Flat MPI – Flat-MPI is better for small number of SMP nodes.

– Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.

– Flat MPI: Communication, Hybrid: Memory

– depends on application, problem size etc.

– Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.

• In Mat-Vec. operations, difference is not so significant.

Page 79: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

7908-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Page 80: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 80

“CUBE” Benchmark• 3D linear elastic applications on cubes for a wide range of problem si

ze.• Hardware

– Single CPU– Earth Simulator– AMD Opteron (1.8GHz)

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

Page 81: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 81

Time for 3x643=786,432 DOF

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

DJDSDJDSDCRSDCRS

Page 82: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 82

Matrix+Solver

0

10

20

30

40

50

60

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

DJDS on ESoriginal

DJDS on Opteronoriginal

MatrixSolver

Page 83: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 83

Computation Time vs. Problem Size

0

10

20

30

40

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

ES (DJDS original)

Opteron (DJDS original)

Opteron (DCRS)

Matrix Solver

Total

Page 84: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 84

Matrix assembling/formation part is rather expensive

• This part should be also optimized for vector processors.• For example, in nonlinear simulations such as elasto-plastic solid simulations,

or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.

• This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.

– also includes complicated processes which are difficult to be “vectorized”

Page 85: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 85

Typical Procedure for Calculating Coefficient Matrix in FEM

• Apply Galerkin’s method on each element.• Integration over each element, and get element-matrix.• Element matrices are accumulated to each node, and glo

bal matrices are obtained => Global linear equations

• Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature

Page 86: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 86

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

19

13

7

1

20

14

8

2

21

15

9

3

22

16

10

4

23

17

11

5

24

18

12

6

Elements

Page 87: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 87

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

Nodes19

13

7

1

20

14

8

2

21

15

9

3

22

16

10

4

23

17

11

5

24

18

12

6

29 30 31 32 33 34 35

22

15

8

1

23

16

9

2

24

17

10

3

25

18

11

4

26

19

12

5

27

20

13

6

28

21

14

7

Page 88: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 88

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

Page 89: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 89

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

44434241

34333231

24232221

14131211

EEEE

EEEE

EEEE

EEEE

Page 90: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 90

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

29

22

15

8

1

30

23

16

9

2

31

24

17

10

3

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

Page 91: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 91

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

35

34

33

3

2

1

35

34

33

3

2

1

35,3534,35

35,3434,34

33,33

3,3

2,21,2

2,11,1

.........

f

f

f

f

f

f

u

u

u

u

u

u

aa

aa

a

a

aa

aa

Page 92: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 92

Element-by-Element Operations

13

7

14

8

22

15

8

23

16

9

24

17

10

• If you calculate a23,16 and a16,23, you have to consider

contribution by both of 13th and 14th elements.

35

34

33

3

2

1

35

34

33

3

2

1

35,3534,35

35,3434,34

33,33

3,3

2,21,2

2,11,1

.........

f

f

f

f

f

f

u

u

u

u

u

u

aa

aa

a

a

aa

aa

Page 93: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 93

Current Approach

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulated element-matrix to global-matrix enddo enddo enddo

Page 94: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 94

Current Approach

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo

Local Node ID

1 2

4 3

Local Node IDfor each bi-linear 4-nodeelement

Page 95: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 95

Current Approach

13

7

14

8

22

15

8

23

16

9

24

17

10

• Nice for cache reuse because of localized operations• Not suitable for vector processors

– a16,23 and a23,16 might not be calculated properly.

– Short innermost loops– There are many “if-then-else” s

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo

Local Node ID

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Page 96: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 96

Inside the loop: integration at Gaussian quadrature points

do jpn= 1, 2do ipn= 1, 2 coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)

PNXi= PNX(ipn,jpn,ie) PNYi= PNY(ipn,jpn,ie)

PNXj= PNX(ipn,jpn,je) PNYj= PNY(ipn,jpn,je)

a11= a11 + (valX*PNXi*PNXj + valB*PNYi*PNYj)*coef a22= a22 + (valX*PNYi*PNYj + valB*PNXi*PNXj)*coef a12= a12 + (valA*PNXi*PNYj + valB*PNXj*PNYi)*coef a21= a21 + (valA*PNYi*PNXj + valB*PNYj*PNXi)*coefenddoenddo

Page 97: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 97

Remedy

• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are

in same color.

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo

Page 98: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 98

Coloring of Elements

29

22

15

8

1

30

23

16

9

2

31

24

17

10

3

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

Page 99: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 99

Coloring of Elements

29

1

30

16

2

31

3

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

22

15

8

23

9

24

17

10

Elements sharing the 16th node are assigned to different colors

Page 100: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 100

Remedy

• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.

• Short innermost loops– loop exchange

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo

Page 101: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 101

Remedy

• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.

• Short innermost loops– loop exchange

• There are many “if-then-else” s– define ELEMENT-to-MATRIX array

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo

Page 102: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 102

Define ELEMENT-to-MATRIX array

13

7

14

8

22

15

8

23

16

9

24

17

10

13

22

15

23

16

14

23

16

24

17

ELEMmat(icel, ie, je)

① ②

④ ③

① ②

④ ③

Element ID

Local Node ID

Local Node ID

if kkU= index_U(16-1+k) and item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif

if kkL= index_L(23-1+k) and item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif

Page 103: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 103

Define ELEMENT-to-MATRIX array

13

7

14

8

22

15

8

23

16

9

24

17

10

if kkU= index_U(16-1+k) and item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif

13

22

15

23

16

14

23

16

24

17

if kkL= index_L(23-1+k) and item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif

① ②

④ ③

① ②

④ ③

Local Node ID

“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.

Page 104: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 104

Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo

do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo

do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo

Extra Storage for• ELEMmat array• element-matrix components

for elements in each color• < 10% increase

Extra Computation for• ELEMmat

Page 105: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 105

Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo

do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo

do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo

PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

Page 106: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 106

Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo

do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo

do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo

PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

PART II“Floating” operations for matrix assembling/accumulation.

In nonlinear cases, this part is repeated for every nonlinear iteration.

Page 107: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 107

Time for 3x643=786,432 DOF

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

12.5(643)

21.7(3246)

DJDS improvedsec.

(MFLOPS)

34.2

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

21.2(381)

271(260)

292

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Page 108: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 108

Time for 3x643=786,432 DOF

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

12.5(643)

21.7(3246)

DJDS improvedsec.

(MFLOPS)

34.2

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

21.2(381)

271(260)

292

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Slower than originalbecause of long innermost loops (data locality has been lost)

Page 109: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 109

Matrix+Solver

0

10

20

30

40

50

60

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

10

20

30

40

50

60

41472 98304 192000 375000 786432

DOF

se

c.

DJDS on ESoriginal

DJDS on Opteronoriginal

MatrixSolver

DJDS on ESimproved

DJDS on Opteronimproved

Page 110: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 110

Computation Time vs. Problem Size

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

10

20

30

40

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

ES (DJDS improved)

Opteron (DJDS improved)

Matrix Solver

Total

Opteron (DCRS)

Page 111: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 111

“Matrix” computation time for improved version of DJDS

0

5

10

15

20

25

41472 98304 192000 375000 786432

DOF

se

c.

0

5

10

15

20

25

41472 98304 192000 375000 786432

DOF

se

c.

IntegerFloating

ES Opteron

Page 112: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 112

Optimization of “Matrix” assembling/formation on ES

• DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.

• “Integer” operation part is slower.• But, “floating” operation is much faster than Opteron.

• In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.

Page 113: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 113

Suppose “virtual” mode where …

• On scalar processor– “Integer” operation part

• On vector processor– “floating” operation part– linear solvers

• Scalar performance of ES (500MHz) is smaller than that of Pentium III

Page 114: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 114

Time for 3x643=786,432 DOF

Matrix1.88

(4431)

21.7(3246)

Solver

DJDS virtualsec.

(MFLOPS)

23.6Total

12.5(643)

21.7(3246)

DJDS improvedsec.

(MFLOPS)

34.2

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix

Solver

Total

21.2(381)

271(260)

292

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Page 115: Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

08-APR22 115

Summary: Vectorization of FEM appl.• NOT so easy• FEM’s good features of local operations are not necessarily suitable for vector processors.

– Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.

• Sometimes, major revision of original codes are required– Usually, more memory, more lines, additional operations …

• Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)