Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Preconditioned Iterative Linear Preconditioned Iterative Linear Solvers for Unstructured Grids Solvers for Unstructured Grids

on the Earth Simulatoron the Earth Simulator

Kengo NakajimaKengo [email protected]@cc.u-tokyo.ac.jp

Supercomputing Division, Information Technology Center, Supercomputing Division, Information Technology Center, The University of TokyoThe University of Tokyo

Japan Agency for Marine-Earth Science and TechnologyJapan Agency for Marine-Earth Science and Technology

April 22, 2008April 22, 2008

208-APR22

• Background– Vector/Scalar Processors– GeoFEM Project– Earth Simulator– Preconditioned Iterative Linear Solvers

• Optimization Strategy on the Earth Simulator– BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications– Matrix Assembling

• Summary & Future Works

Overview

08-APR22 3

Scalar Processor Big gap between clock rate and memory bandwidth. Very low sustained/peak performance ratio (<10%)

Large-Scale Computing

08-APR22 4

Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure

CPU

Main Memory

Cache

Register

FAST

SLOW

Small Capacity (MB)ExpensiveLarge (100M of Transistors)

Large Capacity (GB)Cheap

08-APR22 5


Vector Processor Very high sustained/peak performance ratio

e.g. 35 % for FEM applications on the Earth Simulator requires …

very special tuning sufficiently long loops (= large-scale problem size) for certain performance

Suitable for simple computation.


08-APR22 6

Vector ProcessorsVector Register & Fast Memory

Main Memory

VeryFAST

Vector Processor

VectorRegister

• Parallel Processing for Simple DO Loops.• Suitable for Simple & Large Computation

do i= 1, N A(i)= B(i) + C(i)enddo

08-APR22 7


Vector Processor Very high sustained/peak performance ratio

e.g. 35 % for FEM applications on the Earth Simulator requires …

very special tuning sufficiently long loops (= large-scale problem size) for certain performance

Suitable for simple computation.


08-APR22 8

Earth Simulator

8.00Peak Performance

GFLOPS

26.6Measured Memory

BW (STREAM)(GB/sec/PE)

2.31-3.24(28.8-40.5)

EstimatedPerformance/Core

GFLOPS (% of peak)

2.93(36.6)

MeasuredPerformance/Core

BYTE/FLOP

HitachiSR8000/MPP

(U.Tokyo)

1.80

2.85

.291-.347(16.1-19.3)

.335(18.6)

1.5833.325

Comm. BW(GB/sec/Node) 1.6012.3

MPI Latency(sec) 6-20

HitachiSR11000/J2(U.Tokyo)

9.20

8.00

.880-.973(9.6-10.6)

1.34(14.5)

0.870

12.0

4.7 *5.6-7.7

* IBM p595J.T.Carter et al.

IBM SP3(LBNL)

1.50

0.623

.072-.076(4.8-5.1)

.122(8.1)

0.413

1.00

16.3

08-APR22 9

Typical Behavior …

Earth Simulator:Performance is good for large scale problems due to long vector length.

IBM-SP3:Performance is good for small problems due to cache effect.

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

8 % of peak

40 % of peak

08-APR22 10

Parallel ComputingStrong Scaling (Fixed Prob. Size)

PE#

Pe

rfo

rma

nc

e

Ideal

PE#

Pe

rfo

rma

nc

e

Ideal

Earth Simulator:Performance decreases for many PE’s due to comm. overhead and small vector length.

IBM-SP3:Super-scalar effect for small numberof PE’s. Performance decreases for many PE’s due to comm. overhead.

08-APR22 11

Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2

0.00

1.00

2.00

3.00

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

IBM SP-3IBM SP-3POWER3POWER3

SR11000/J2SR11000/J2POWER5+POWER5+

■ Flat-MPI/DCRS□ Hybrid/DCRS

375 MHz1.0 GB/sec8M L2 cache/PE

2.3 GHz8.0 GB/sec18M L3 cache/PE

0.0

5.0

10.0

15.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

Memory PerformanceMemory Performance(BWDTH, latency etc.)(BWDTH, latency etc.)

08-APR22 12

My History with Vector Computers• Cray-1S (1985-1988)

– Mitsubishi Research Institute (MRI)

• Cray-YMP (1988-1995)– MRI, University of Texas at Austin

• Fujitsu VP, VPP series (1985-1999)– JAERI, PNC

• NEC SX-4 (1997-2001)– CCSE/JAERI

• Hitachi SR2201 (1997-2004)– University of Tokyo, CCSE/JAERI

• Hitachi SR8000 (2000-2007)– University of Tokyo

• Earth Simulator (2002-)

1308-APR22




08-APR22 14

• Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core convection, etc.

• Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural science (solid earth) communities.

GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/

08-APR22 15

System Configuration ofGeoFEM

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

PEs

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算（Static linear）構造計算（Dynamic linear）構造計算（

Contact）

Partitioned mesh

PlatformSolverI/ F

Comm.I/ F

Vis.I/ F

Structure

FluidWave

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

PEs

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算（Static linear）構造計算（Dynamic linear）構造計算（

Contact）

Partitioned mesh

PlatformSolverI/ F

Comm.I/ F

Vis.I/ F

Structure

FluidWave

1608-APR22

Results on Solid Earth Simulation

Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands

Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan

TSUNAMI !!TSUNAMI !!

Transportation by Groundwater Flow Transportation by Groundwater Flow through Heterogeneous Porous Mediathrough Heterogeneous Porous Media

h=5.00

h=1.25

T=100 T=200 T=300 T=400 T=500

1708-APR22

Results by GeoFEM

08-APR22 18

Features of FEM applications (1/2)

• Local “element-by-element” operations– sparse coefficient matrices

– suitable for parallel computing

• HUGE “indirect” accesses– IRREGULAR sparse matrices

– memory intensive

do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddoenddo

08-APR22 19

• In parallel computation …– comm. with ONLY neighbors (except “dot products” etc.)– amount of messages are relatively small because only values on domain-boundary are

exchanged. – communication (MPI) latency is critical

Features of FEM applications (2/2)

2008-APR22




08-APR22 21

Earth Simulator (ES)http://www.es.jamstec.go.jp/

• 640×8= 5,120 Vector Processors– SMP Cluster-Type Architecture– 8 GFLOPS/PE– 64 GFLOPS/Node– 40 TFLOPS/ES

• 16 GB Memory/Node, 10 TB/ES• 640×640 Crossbar Network

– 16 GB/sec×2

• Memory BWTH with 32 GB/sec.• 35.6 TFLOPS for LINPACK (2002-March)• 26 TFLOPS for AFES (Climate Simulation)

08-APR22 22

• GeoFEM Project (FY.1998-2002)

• FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)– Implicit Linear Solvers

– Hybrid vs. Flat MPI Parallel Programming Model

Motivations

08-APR22 23

Flat MPI vs. Hybrid

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Hybrid： Hierarchal Structure

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Flat-MPI： Each PE -> Independent

2408-APR22




08-APR22 25

• Direct Methods– Gaussian Elimination/LU Factorization.

• compute A-1 directly.

– Robust for wide range of applications.– More expensive than iterative methods (memory, CPU)– Not suitable for parallel and vector computation due to its global operations.

• Iterative Methods– CG, GMRES, BiCGSTAB– Less expensive than direct methods, especially in memory.– Suitable for parallel and vector computing.– Convergence strongly depends on problems, boundary conditions (condition number etc.)

• Preconditioning is required

Direct/Iterative Methods for Linear Equations

08-APR22 26

• Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A.

• A preconditioner M transforms the linear system into one with more favorable spectral properties – In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large.– M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b

• ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.

Preconditioning for Iterative Methods

08-APR22 27

• Iterative method is the ONLY choice for large-scale parallel computing.

• Preconditioning is important– general methods, such as ILU(0)/IC(0), cover wide range of applications.– problem specific methods.

Strategy in GeoFEM

2808-APR22




08-APR22 29

Block IC(0)-CG Solver on the Earth Simulator

• 3D Linear Elastic Problems (SPD)

• Parallel Iterative Linear Solver– Node-based Local Data Structure– Conjugate-Gradient Method (CG): SPD– Localized Block IC(0) Preconditioning (Block Jacobi)

• Modified IC(0): Non-diagonal components of original [A] are kept

– Additive Schwartz Domain Decomposition（ASDD）– http://geofem.tokyo.rist.or.jp/

• Hybrid Parallel Programming Model– OpenMP+MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI

08-APR22 30

Flat MPI vs. Hybrid

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Hybrid： Hierarchal Structure

PE

PE

PE

PE

Memory

PE

PE

PE

PE

Memory

Flat-MPI： Each PE -> Independent

08-APR22 31

Local Data StructureNode-based Partitioning

internal nodes - elements - external nodes

１２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7

PE#3

１２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

１２３４ 5

21 22 23 24 25

1617 18 19

20

1112 13 14

15

67 8 9

10

08-APR22 32

1 SMP node => 1 domain for Hybrid Programming Model

MPI communication among domains

Node-0 Node-1

Node-2 Node-3

PEPEPEPEPEPEPEPE

MemoryNode-0 Node-1

Node-2 Node-3PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

Node-0 Node-1

Node-2 Node-3

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

MemoryNode-0 Node-1

Node-2 Node-3PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

PEPEPEPEPEPEPEPE

Memory

08-APR22 33

Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis

– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

08-APR22 34

ILU(0)/IC(0) Factorization

do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) thenif ((i,k)∈ NonZero(A)) then aaik ik := a:= aikik/a/akkkk

endifendif do j= k+1, n if ((i,j)∈ NonZero(A)) thenif ((i,j)∈ NonZero(A)) then aaij ij := a:= aijij - a - aikik*a*akjkj

endifendif enddo enddo enddo

08-APR22 35

ILU/IC Preconditioning

M = (L+D)D-1(D+U)

Mz= r

D-1(D+U)= z1

Forward Substitution

(L+D)z1= r : z1= D-1(r-Lz1)

Backward Substitution

(I+ D-1 U)zNEW = z1

z= z - D-1Uz

need to solve this equation

08-APR22 36


do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo

do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo

Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)

Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz

Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A

08-APR22 37

Basic Strategy for Parallel Programming on the Earth Simulator

• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector pr

ocessors.• Re-Ordering for Highly Parallel/Vector Performance

– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

08-APR22 38


do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo

do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo

Forward SubstitutionForward Substitution (L+D)z= r : z= D(L+D)z= r : z= D-1-1(r-Lz)(r-Lz)

Backward SubstitutionBackward Substitution (I+ D(I+ D-1-1 U)z U)znewnew= z= zoldold : z= z - D : z= z - D-1-1UzUz

Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.

Reordering:Directly connectednodes do not appearin RHS.

M =(L+D)DM =(L+D)D-1-1(D+U)(D+U)L,D,U: AL,D,U: A

08-APR22 39

Basic Strategy for Parallel Programming on the Earth Simulator

• Hypothesis– Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization

• 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization

08-APR22 40

SMP ParallelSMP Parallel

VectorVector

Re-Ordering Technique for Vector/Parallel Architectures

Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))

VectorVectorSMP ParallelSMP Parallel

1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor)

3. DJDS re-ordering

4. Cyclic DJDS for SMP unit

These processes can be substituted by traditional multi-coloring (MC).

08-APR22 41

Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from

each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector

length.: Trade-Off !!

Red-Black (2 colors) 4 colors RCM (Reverse CM)

08-APR22 42

1D-Storage (CRS)memory saved, short vector length

2D-Storagelong vector length, many ZERO’s

Large Scale Sparse Matrix Storage for Unstructured Grids

08-APR22 43

Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component #

Elements on the same color are independent, thereforeElements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results intra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal StorageDJDS : Descending-Order Jagged Diagonal Storage

08-APR22 44

Cyclic DJDS (MC/CM-RCM) Cyclic Re-Ordering for SMP units

Load-balancing among PEs

iS+1

iE

iv0+1

do iv= 1, NCOLORS!$omp parallel dodo ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) do j= 1, NLhyp(iv) iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )!cdir nodep do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) (Important Computations) enddo enddoenddoenddonpLX1= NLmax * PEsmpTOTINL(0:NLmax*PEsmpTOT*NCOLORS)

08-APR22 45

Difference betweenFlat MPI & Hybrid

• Most of the efforts of re-ordering are for vectorization.

• If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes.

• Source codes of Hybrid and Flat MPI are not so different.

– Flat MPI corresponds to Hybrid where PE/SMP node=1.– In other words, Flat MPI code is sufficiently complicated.

08-APR22 46

Cyclic DJDS (MC/CM-RCM) for Forward/Backward Substitution in BILU Factorization

SMPparallel

do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE… etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)X(i)= X(i) - A(k)*X(kk)*DINV(i) etc.

enddoenddo

enddoenddo

Vectorized

SMPparallel

do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE… etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )

!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)X(i)= X(i) - A(k)*X(kk)*DINV(i) etc.

enddoenddo

enddoenddo

Vectorized

08-APR22 47

Simple 3D Cubic Model

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

08-APR22 48

Effect of Ordering

08-APR22 49

Effect of Re-Ordering

PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop

CRS no re-ordering

Long LoopsContinuous Access

Short LoopsContinuous Access

Short LoopsIrregular Access

08-APR22 50

Matrix Storage, Loops• DJDS (Descending order

Jagged Diagonal Storage) with long innermost loops is suitable for vector processors.

DJDSDJDS

DCRSDCRS do i= 1, N SW= WW(i,Z) isL= index_L(i-1)+1 ieL= index_L(i) do j= isL, ieL k = item_L(j) SW= SW - AL(j)*Z(k) enddo

Z(i)= SW/DD(i)enddo

do iv= 1, NVECT iv0= STACKmc(iv-1) do j= 1, NLhyp(iv) iS= index_L(NL*(iv-1)+ j-1) iE= index_L(NL*(iv-1)+ j ) do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= item_L(k) Z(i)= Z(i) - AL(k)*Z(kk) enddo

iS= STACKmc(iv-1) + 1 iE= STACKmc(iv ) do i= iS, iE Z(i)= Z(i)/DD(i) enddo enddoenddo

• Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.

08-APR22 51

Effect of Re-OrderingResults on 1 SMP node

Color #: 99 (fixed)Re-Ordering is REALLY required !!!

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

Effect of Vector LengthX10

+ Re-Ordering X100

22 GFLOPS, 34% of the Peak

Ideal Performance: 40%-45%for Single CPU

● ■ ▲



CRS no re-ordering

08-APR22 52

Effect of Re-OrderingResults on 1 SMP node

Color #: 99 (fixed)Re-Ordering is REALLY required !!!

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

● ■ ▲



CRS no re-ordering

80x80x80 case (1.5M DOF)

● 212 iter’s, 11.2 sec.■ 212 iter’s, 143.6 sec.▲ 203 iter’s, 674.2 sec.

08-APR22 53

3D Elastic SimulationProblem Size~GFLOPS

Earth Simulator1 SMP node (8 PE’s)

Flat-MPI23.4 GFLOPS, 36.6 % of Peak

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

Hybrid (OpenMP)21.9 GFLOPS, 34.3 % of Peak

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

Flat-MPI is better.Flat-MPI is better.Nice Intra-Node MPI.Nice Intra-Node MPI.

08-APR22 54

Earth Simulator


GFLOPS

26.6Measured Memory


2.31-3.24(28.8-40.5)


GFLOPS (% of peak)

2.93(36.6)


BYTE/FLOP

HitachiSR8000/MPP

(U.Tokyo)

1.80

2.85

.291-.347(16.1-19.3)

.335(18.6)

1.5833.325




9.20

8.00

.880-.973(9.6-10.6)

1.34(14.5)

0.870

12.0

4.7 *5.6-7.7


IBM SP3(LBNL)

1.50

0.623

.072-.076(4.8-5.1)

.122(8.1)

0.413

1.00

16.3

08-APR22 55


Hitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

Flat-MPI2.17 GFLOPS, 15.0 % of Peak

Hybrid (OpenMP)2.68 GFLOPS, 18.6 % of Peak

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

Hybrid is better.Hybrid is better.Low Intra-Node MPI.Low Intra-Node MPI.

08-APR22 56


IBM-SP3 (NERSC) 1 SMP node (8 PE’s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

Flat-MPI Hybrid (OpenMP)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1.0E+04 1.0E+05 1.0E+06 1.0E+07

DOF: Problem Size

GF

LO

PS

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

Cache is well-utilizedCache is well-utilizedin Flat-MPI.in Flat-MPI.

08-APR22 57


Hitachi SR11000/J2 (U.Tokyo) 1 SMP node (8 PE’s)

Flat-MPI Hybrid (OpenMP)

● ■ ▲



CRS no re-ordering

● ■ ▲



CRS no re-ordering

0.0

5.0

10.0

15.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

0.0

5.0

10.0

15.0

1.E+04 1.E+05 1.E+06 1.E+07

DOF

GF

LO

PS

08-APR22 58

SMP node # > 10up to 176 nodes (1408 PEs)

Problem size for each SMP node is fixed.PDJDS-CM/RCM, Color #: 99

08-APR22 59

3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF

●： Flat MPI, ○： Hybrid

GFLOPS rate Parallel Work Ratio

0

1000

2000

3000

4000

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

GF

LO

PS

40

50

60

70

80

90

100

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Pa

rall

el

Wo

rk R

ati

o:

%

3.8TFLOPS for 2.2G DOF 3.8TFLOPS for 2.2G DOF 176 nodes (33.8% of peak)176 nodes (33.8% of peak)

●Flat MPI

○Hybrid

08-APR22 60

3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF

●： Flat MPI, ○： Hybrid

0

1000

2000

3000

4000

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

GF

LO

PS

40

50

60

70

80

90

100

0 16 32 48 64 80 96 112 128 144 160 176 192

NODE#

Pa

rall

el

Wo

rk R

ati

o:

%

GFLOPS rate Parallel Work Ratio

●Flat MPI

○Hybrid

08-APR22 61

Hybrid outperforms Flat-MPI• when ...

– number of SMP node (PE) is large.– problem size/node is small.

• because flat-MPI has ...– as 8 times as many communication processes– as TWICE as large communication/computation ratio

• Effect of communication becomes significant if number of SMP node (or PE) is large.

• Performance Estimation by D.Kerbyson (LANL) – LA-UR-02-5222– relatively larger communication latency of ES

08-APR22 62

Flat-MPI and Hybrid

Flat MPI Hybrid

Problem size/each MPI Process(N=number of FEM nodes in onedirection of cube geometry

3N3 38N3

Size of messages on each surfaces with each neighboringdomain

3N2 34N2

Ratio ofcommunication/computation

1/N 1/(2N)

N

N

NFlat-MPI Hybrid

08-APR22 63

Earth Simulator


GFLOPS

26.6Measured Memory


2.31-3.24(28.8-40.5)


GFLOPS (% of peak)

2.93(36.6)


BYTE/FLOP

HitachiSR8000/MPP

(U.Tokyo)

1.80

2.85

.291-.347(16.1-19.3)

.335(18.6)

1.5833.325




9.20

8.00

.880-.973(9.6-10.6)

1.34(14.5)

0.870

12.0

4.7 *5.6-7.7


IBM SP3(LBNL)

1.50

0.623

.072-.076(4.8-5.1)

.122(8.1)

0.413

1.00

16.3

08-APR22 64

• latency of network• finite bandwidth of network• synchronization at SEND/RECV, ALLREDUCE etc.

• memory performance in boundary communications (memory COPY)

Why Communication Overhead ?

08-APR22 65

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV & (N, NEIBPETOT, NEIBPE, & IMPORT_INDEX, IMPORT_NODE, & EXPORT_INDEX, EXPORT_NODE, & WS, WR, X, SOLVER_COMM, my_rank) implicit REAL*8 (A-H,O-Z)include 'mpif.h'parameter (KREAL= 8)integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)integer SOLVER_COMM, my_rankinteger req1(NEIBPETOT), req2(NEIBPETOT)integer sta1(MPI_STATUS_SIZE, NEIBPETOT)integer sta2(MPI_STATUS_SIZE, NEIBPETOT)real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)

do neib= 1, NEIBPETOT istart= EXPORT_INDEX(neib-1) inum = EXPORT_INDEX(neib ) - istart do k= istart+1, istart+inum WS(k)= X(EXPORT_NODE(k)) enddo call MPI_ISEND (WS(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req1(neib), ierr)enddo

SENDSEND

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart call MPI_IRECV (WR(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req2(neib), ierr)enddo

call MPI_WAITALL (NEIBPETOT, req2, sta2, ierr)

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart do k= istart+1, istart+inum X(IMPORT_NODE(k))= WR(k) enddoenddo


returnend

RECEIVERECEIVE

08-APR22 66

Domain-to-Domain CommunicationExchange Boundary Information (SEND/RECV)

subroutine SOLVER_SEND_RECV & (N, NEIBPETOT, NEIBPE, & IMPORT_INDEX, IMPORT_NODE, & EXPORT_INDEX, EXPORT_NODE, & WS, WR, X, SOLVER_COMM, my_rank) implicit REAL*8 (A-H,O-Z)include 'mpif.h'parameter (KREAL= 8)integer IMPORT_INDEX(0:NEIBPETOT), IMPORT_NODE(N)integer EXPORT_INDEX(0:NEIBPETOT), EXPORT_NODE(N)integer SOLVER_COMM, my_rankinteger req1(NEIBPETOT), req2(NEIBPETOT)integer sta1(MPI_STATUS_SIZE, NEIBPETOT)integer sta2(MPI_STATUS_SIZE, NEIBPETOT)real(kind=KREAL) X(N), NEIBPE(NEIBPETOT), WS(N), WR(N)

do neib= 1, NEIBPETOT istart= EXPORT_INDEX(neib-1) inum = EXPORT_INDEX(neib ) - istart do k= istart+1, istart+inum WS(k)= X(EXPORT_NODE(k)) enddo call MPI_ISEND (WS(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req1(neib), ierr)enddo

SENDSEND

do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart call MPI_IRECV (WR(istart+1), inum, MPI_DOUBLE_PRECISION, & NEIBPE(neib), 0, SOLVER_COMM, & req2(neib), ierr)enddo


do neib= 1, NEIBPETOT istart= IMPORT_INDEX(neib-1) inum = IMPORT_INDEX(neib ) - istart do k= istart+1, istart+inum X(IMPORT_NODE(k))= WR(k) enddoenddo


returnend

RECEIVERECEIVE

08-APR22 67

Communication Overhead

MemoryCopy

Comm.BandWidth

Comm.Latency

08-APR22 68

Communication Overhead

MemoryCopy

Comm.BandWidth

Comm.Latency

depends onmessage size

depends onmessage size

08-APR22 69

Communication OverheadEarth Simulator

Comm.Latency

08-APR22 70

Communication OverheadHitachi SR11000, IBM SP3 etc.

MemoryCopy

Comm.BandWidth

08-APR22 71

Communication Overhead= Synchronization Overhead

08-APR22 72


MemoryCopy

Comm.Latency

Comm.BWTH

08-APR22 73


Earth SimulatorComm.Latency

08-APR22 74

Communication OverheadWeak Scaling: Earth Simulator

0.00

0.02

0.04

0.06

10 100 1000 10000

PE#

Co

mm

. Ov

erh

ea

d (

se

c.)

●○ 3x503 DOF/PE▲ △ 3x323 DOF/PE●▲ Flat-MPI ○△ Hybrid

Effect of message sizeEffect of message sizeis small. Effect of latencyis small. Effect of latencyis large.is large.

Memory-copy is so fast.Memory-copy is so fast.

08-APR22 75

Communication OverheadWeak Scaling: IBM SP-3


Effect of message sizeEffect of message sizeis more significant.is more significant.

0.00

0.10

0.20

0.30

0.40

10 100 1000 10000

PE#

Co

mm

. Ov

erh

ea

d (

se

c.)

08-APR22 76

Communication OverheadWeak Scaling:

Hitachi SR11000/J2 (8cores/node)


0.00

0.02

0.04

0.06

0.08

0.10

10 100 1000

cores

Co

mm

. Ov

erh

ea

d (

se

c.)

08-APR22 77

Summary

• Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids

• Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%) in 3D linear-elastic problem using BIC(0)-CG method.– N.Kushida (student of Prof.Okuda) attained >10 TFLOPS

using 512 nodes for >3G DOF problem.

• Re-Ordering is really required

08-APR22 78

Summary (cont.)

• Hybrid vs. Flat MPI – Flat-MPI is better for small number of SMP nodes.

– Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.

– Flat MPI: Communication, Hybrid: Memory

– depends on application, problem size etc.

– Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP.

• In Mat-Vec. operations, difference is not so significant.

7908-APR22




08-APR22 80

“CUBE” Benchmark• 3D linear elastic applications on cubes for a wide range of problem si

ze.• Hardware

– Single CPU– Earth Simulator– AMD Opteron (1.8GHz)

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin


Ny-1

Nx-1

Nz-1

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin


Ny-1

Nx-1

Nz-1

08-APR22 81

Time for 3x643=786,432 DOF

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

DJDSDJDSDCRSDCRS

08-APR22 82

Matrix+Solver

0

10

20

30

40

50

60

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

DJDS on ESoriginal

DJDS on Opteronoriginal

MatrixSolver

08-APR22 83

Computation Time vs. Problem Size

0

10

20

30

40

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

ES (DJDS original)

Opteron (DJDS original)

Opteron (DCRS)

Matrix Solver

Total

08-APR22 84

Matrix assembling/formation part is rather expensive

• This part should be also optimized for vector processors.• For example, in nonlinear simulations such as elasto-plastic solid simulations,

or fully coupled Navier-Stokes flow simulations, matrices must be updated for every nonlinear iterations.

• This part strongly depends on applications/physics, therefore it’s very difficult to develop general libraries, such as those of iterative linear solvers.

– also includes complicated processes which are difficult to be “vectorized”

08-APR22 85

Typical Procedure for Calculating Coefficient Matrix in FEM

• Apply Galerkin’s method on each element.• Integration over each element, and get element-matrix.• Element matrices are accumulated to each node, and glo

bal matrices are obtained => Global linear equations

• Matrix assembling/formation is embarrassingly parallel procedure due to its element-by-element feature

08-APR22 86

Element-by-Element Operations• Integration over each element => element-matrix• Element matrices are accumulated to each node

=> global-matrix• Linear equations for each node

19

13

7

1

20

14

8

2

21

15

9

3

22

16

10

4

23

17

11

5

24

18

12

6

Elements

08-APR22 87



Nodes19

13

7

1

20

14

8

2

21

15

9

3

22

16

10

4

23

17

11

5

24

18

12

6

29 30 31 32 33 34 35

22

15

8

1

23

16

9

2

24

17

10

3

25

18

11

4

26

19

12

5

27

20

13

6

28

21

14

7

08-APR22 88



X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

08-APR22 89



44434241

34333231

24232221

14131211

EEEE

EEEE

EEEE

EEEE

08-APR22 90



29

22

15

8

1

30

23

16

9

2

31

24

17

10

3

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

08-APR22 91



35

34

33

3

2

1

35

34

33

3

2

1

35,3534,35

35,3434,34

33,33

3,3

2,21,2

2,11,1

.........

f

f

f

f

f

f

u

u

u

u

u

u

aa

aa

a

a

aa

aa

08-APR22 92

Element-by-Element Operations

13

7

14

8

22

15

8

23

16

9

24

17

10

• If you calculate a23,16 and a16,23, you have to consider

contribution by both of 13th and 14th elements.

35

34

33

3

2

1

35

34

33

3

2

1

35,3534,35

35,3434,34

33,33

3,3

2,21,2

2,11,1

.........

f

f

f

f

f

f

u

u

u

u

u

u

aa

aa

a

a

aa

aa

08-APR22 93

Current Approach

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulated element-matrix to global-matrix enddo enddo enddo

08-APR22 94

Current Approach

13

7

14

8

22

15

8

23

16

9

24

17

10

do icel= 1, ICELTOT do ie= 1, 4 do je= 1, 4 - assemble element-matrix - accumulat element-matrix to global-matrix enddo enddo enddo

Local Node ID

1 2

4 3

Local Node IDfor each bi-linear 4-nodeelement

08-APR22 95

Current Approach

13

7

14

8

22

15

8

23

16

9

24

17

10

• Nice for cache reuse because of localized operations• Not suitable for vector processors

– a16,23 and a23,16 might not be calculated properly.

– Short innermost loops– There are many “if-then-else” s


Local Node ID

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

08-APR22 96

Inside the loop: integration at Gaussian quadrature points

do jpn= 1, 2do ipn= 1, 2 coef= dabs(DETJ(ipn,jpn))*WEI(ipn)*WEI(jpn)

PNXi= PNX(ipn,jpn,ie) PNYi= PNY(ipn,jpn,ie)

PNXj= PNX(ipn,jpn,je) PNYj= PNY(ipn,jpn,je)

a11= a11 + (valX*PNXi*PNXj + valB*PNYi*PNYj)*coef a22= a22 + (valX*PNYi*PNYj + valB*PNXi*PNXj)*coef a12= a12 + (valA*PNXi*PNYj + valB*PNXj*PNYi)*coef a21= a21 + (valA*PNYi*PNXj + valB*PNYj*PNXi)*coefenddoenddo

08-APR22 97

Remedy

• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are

in same color.

13

7

14

8

22

15

8

23

16

9

24

17

10


08-APR22 98

Coloring of Elements

29

22

15

8

1

30

23

16

9

2

31

24

17

10

3

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

08-APR22 99

Coloring of Elements

29

1

30

16

2

31

3

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

22

15

8

23

9

24

17

10

Elements sharing the 16th node are assigned to different colors

08-APR22 100

Remedy

• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.

• Short innermost loops– loop exchange

13

7

14

8

22

15

8

23

16

9

24

17

10


08-APR22 101

Remedy

• a16,23 and a23,16 might not be calculated properly. – coloring the elements: elements which do not share any nodes are in same color.

• Short innermost loops– loop exchange

• There are many “if-then-else” s– define ELEMENT-to-MATRIX array

13

7

14

8

22

15

8

23

16

9

24

17

10


08-APR22 102

Define ELEMENT-to-MATRIX array

13

7

14

8

22

15

8

23

16

9

24

17

10

13

22

15

23

16

14

23

16

24

17

ELEMmat(icel, ie, je)

① ②

④ ③

① ②

④ ③

Element ID

Local Node ID

Local Node ID

if kkU= index_U(16-1+k) and item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif

if kkL= index_L(23-1+k) and item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif

08-APR22 103

Define ELEMENT-to-MATRIX array

13

7

14

8

22

15

8

23

16

9

24

17

10

if kkU= index_U(16-1+k) and item_U(kkU)= 23 then

ELEMmat(13,2,3)= +kkU ELEMmat(14,1,4)= +kkUendif

13

22

15

23

16

14

23

16

24

17

if kkL= index_L(23-1+k) and item_L(kkL)= 16 then

ELEMmat(13,3,2)= -kkL ELEMmat(14,4,1)= -kkLendif

① ②

④ ③

① ②

④ ③

Local Node ID

“ELEMmat” specifies relationship between node pairs of each node of each element and address of global coefficient matrix.

08-APR22 104

Optimized Proceduredo icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - define “ELEMmat” array enddo enddo enddoenddo

do icol= 1, NCOLOR_E_tot do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - assemble element-matrix enddo enddo enddo

do ie= 1, 4 do je= 1, 4 do ic0= index_COL(icol-1)+1, indexCOL_(icol) icel= item_COL(ic0) - accumulate element-matrix to global-matrix enddo enddo enddoenddo

Extra Storage for• ELEMmat array• element-matrix components

for elements in each color• < 10% increase

Extra Computation for• ELEMmat

08-APR22 105




PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

08-APR22 106




PART I“Integer” operations for “ELEMmat”In nonlinear cases, this part should be done just once (before initial iteration), as long as mesh connectivity does not change.

PART II“Floating” operations for matrix assembling/accumulation.

In nonlinear cases, this part is repeated for every nonlinear iteration.

08-APR22 107


Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

12.5(643)

21.7(3246)

DJDS improvedsec.

(MFLOPS)

34.2

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

21.2(381)

271(260)

292

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

08-APR22 108


Matrix34.2(240)

21.7(3246)

Solver

DJDS originalsec.

(MFLOPS)

55.9Total

12.5(643)

21.7(3246)

DJDS improvedsec.

(MFLOPS)

34.2

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix12.4(663)

271(260)

Solver

283Total

21.2(381)

271(260)

292

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

Slower than originalbecause of long innermost loops (data locality has been lost)

08-APR22 109

Matrix+Solver

0

10

20

30

40

50

60

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

10

20

30

40

50

60

41472 98304 192000 375000 786432

DOF

se

c.

DJDS on ESoriginal

DJDS on Opteronoriginal

MatrixSolver

DJDS on ESimproved

DJDS on Opteronimproved

08-APR22 110

Computation Time vs. Problem Size

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

0

10

20

30

40

41472 98304 192000 375000 786432

DOF

se

c.

0

100

200

300

41472 98304 192000 375000 786432

DOF

se

c.

ES (DJDS improved)

Opteron (DJDS improved)

Matrix Solver

Total

Opteron (DCRS)

08-APR22 111

“Matrix” computation time for improved version of DJDS

0

5

10

15

20

25

41472 98304 192000 375000 786432

DOF

se

c.

0

5

10

15

20

25

41472 98304 192000 375000 786432

DOF

se

c.

IntegerFloating

ES Opteron

08-APR22 112

Optimization of “Matrix” assembling/formation on ES

• DJDS has been much improved compared to the original one, but it’s still slower than DCRS version on Opteron.

• “Integer” operation part is slower.• But, “floating” operation is much faster than Opteron.

• In nonlinear simulations, “integer” operation is executed only once (just before initial iteration), therefore, ES outperforms Opteron if the number of nonlinear iterations is more than 2.

08-APR22 113

Suppose “virtual” mode where …

• On scalar processor– “Integer” operation part

• On vector processor– “floating” operation part– linear solvers

• Scalar performance of ES (500MHz) is smaller than that of Pentium III

08-APR22 114


Matrix1.88

(4431)

21.7(3246)

Solver

DJDS virtualsec.

(MFLOPS)

23.6Total

12.5(643)

21.7(3246)

DJDS improvedsec.

(MFLOPS)

34.2

28.6(291)

360(171)

DCRSsec.

(MFLOPS)

389

Matrix

Solver

Total

21.2(381)

271(260)

292

10.2(818)

225(275)

235

ES8.0 GFLOPS

Opteron1.8GHz

3.6 GFLOPS

08-APR22 115

Summary: Vectorization of FEM appl.• NOT so easy• FEM’s good features of local operations are not necessarily suitable for vector processors.

– Preconditioned iterative solvers can be vectorized rather easier, because their target is “global” matrix.

• Sometimes, major revision of original codes are required– Usually, more memory, more lines, additional operations …

• Performance of optimized codes for vector processor is not necessarily good on scalar processors (e.g. matrix assembling of FEM)

Documents

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator