Programming of Supercomputers final presentation

PARALLELIZATION OF GCCG SOLVER

Programming of SupercomputersTeam 03Roberto CamachoMiroslava Slavcheva

INTRODUCTION

oFire benchmark suite

oGCCG solver• Generalized orthomin solver with diagonal scaling

• Based on a Linearized Continuity Equation

: [known] boundary pole coefficients (BP) : [known] boundary cell coefficients (BE, BS, etc.) : [known] source value (SU) : [unknown] variation vector/flow to be transported

(VAR)

•While residual > tolerance Compute the new directional values (direc) from the old

ones. Normalize and update values. Compute the new residual.

SU

PROGRAMMING OF SUPERCOMPUTERS LAB FLOWoSequential optimization of the GCCG solver.

oDefinition of performance objectives for parallelization.

oParallelization:• Domain decomposition into volume cells• Definition of a communication model• Implementation using MPI

oPerformance analysis and tuning.

VAR

SEQUENTIAL OPTIMIZATION

oPerformance metrics via PAPI library:

• Execution time, total Mflops, L2- and L3-level cache miss rates

oCompiler optimization flags:• -g, -O0, -O1, -O2, -O3, -O3p

oASCII vs binary input files



BENCHMARK PARALLELIZATION

DATA DISTRIBUTIONoMain objective: Avoid data replication

oImplemented distributions:oClassic

oProblem:oApplication is for irregular geometriesoIndirect addressing (LCC)

oMETIS: Graph analysis tool Dual approach: elements -> vertices Balanced computational time

Nodal approach: nodes -> vertices “Optimal neighborhood“

Processor 0

Processor 1

Processor 2

Processor 3

Processor 4

Max difference:1 element

Distributeevenly

Classic distribution

METIS dual distribution

DATA DISTRIBUTION

Global to local mapping8 9 1

011

Local elements: rank 2

Globalidx

LCC

Neighbors: 9, 12, 14Map neighbors

Lookup

Local to global mapping

0 X1 X…8 09 1

10 211 312 413 X14 5

Internal elements0 81 92 103 11

External elements4 125 14

DATA DISTRIBUTION

Input File Input File

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 1

Rank 2

Rank 3

Rank 4

Rank 0

Read data

Transmit distribution

Compute distribution

Distribution by master process Distribution computed by each process

COMMUNICATION MODEL

oMain objective: Optimize communication

oGhost cells

oCommunication listsoSendoReceive

oSynchronization between lists

COMMUNICATION MODEL2D arraySend list

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Max #elements in one processor

Total storage: N * max_loc_elems

Idx(0) = 0 Idx(1) = Idx (0) + Send_count(0)

Idx(2) Idx(3)

• Index array must be created• Send_count already existsTotal storage: Total_neighbors + N

1D arraySend list

COMMUNICATION MODEL: MPI_ALLTOALLV

1 2 3 4

0

0

0

0

0 2 3 4

1

0

0

0

1

1

1

Receive lists

Node 0 Node 1

Node 1

Node 2

Node 3

Node 4

Node 0

Node 2

Node 3

Node 4

Send lists

MPI IMPLEMENTATION

oMain loop: Compute_solution

oCollective communicationoLow overheadoNo deadlock

o Direc1 transmissionoIndirect addressingoIncludes ghost cells and external cellsoUse communication lists and mappings

direc1

send_list

send_buffer

MPI_Alltoallv

recv_buffer

recv_list

direc1

MPI IMPLEMENTATIONMPI_Allreduce with MPI_IN_PLACE

Processor 0

Processor 1

Processor 2

Processor 3

Processor 0

Processor 1

Processor 2

Processor 3

local residual

res_updated

local residual

res_updated

local residual

res_updated

local residual

res_updated

+

global residual

res_updated

FINALIZATION

oAdded to verify results

oProcessor 0 does all the work

oImbalanced load

oHigh communicationoDataoLocal to global map

PERFORMANCE ANALYSIS AND

TUNING

SPEEDUPoSpeedup objective: linear speedup using Metis Dual distribution and 1-8 processors.

oPAPI measurements with up to 8 processors.

• Linear (super-linear) speedup with Metis distributions.

oScalasca/Periscope/Vampir/Cube measurements and visualization (code/makefile instrumentation) with up to 64 processors.

•Speedup plateaus.

SPEEDUP

LOAD IMBALANCEoMinimal load imbalance in computation phase by design.

oSevere load imbalance in finalization phase: root process does all work.

MPI OVERHEADoAcceptance criterion: MPI overhead (MPI_Init() and MPI_Finalize() excluded) less than 25 % on pent with 4 processors.

oOverhead plateaus.

CONCLUSION

oNo severe load imbalance or data replication.

oCollective operations used for achieving the performance goals.

oMetis distributions lead to better performance than classic.

oOptimal number of processors used for best speedup and least communication overhead depends on:• the distribution• the input file size

Documents

Programming of Supercomputers final presentation