22
PARALLELIZATION OF GCCG SOLVER Programming of Supercomputers Team 03 Roberto Camacho Miroslava Slavcheva

Programming of Supercomputers final presentation

Embed Size (px)

DESCRIPTION

FInal project of Programming of Supercomputers WS13. Outlines the parallelization and optimization of a serial code.

Citation preview

Page 1: Programming of Supercomputers final presentation

PARALLELIZATION OF GCCG SOLVER

Programming of SupercomputersTeam 03Roberto CamachoMiroslava Slavcheva

Page 2: Programming of Supercomputers final presentation

INTRODUCTION

oFire benchmark suite

oGCCG solver• Generalized orthomin solver with diagonal scaling

• Based on a Linearized Continuity Equation

: [known] boundary pole coefficients (BP) : [known] boundary cell coefficients (BE, BS, etc.) : [known] source value (SU) : [unknown] variation vector/flow to be transported

(VAR)

•While residual > tolerance Compute the new directional values (direc) from the old

ones. Normalize and update values. Compute the new residual.

SU

Page 3: Programming of Supercomputers final presentation

PROGRAMMING OF SUPERCOMPUTERS LAB FLOWoSequential optimization of the GCCG solver.

oDefinition of performance objectives for parallelization.

oParallelization:• Domain decomposition into volume cells• Definition of a communication model• Implementation using MPI

oPerformance analysis and tuning.

VAR

Page 4: Programming of Supercomputers final presentation

SEQUENTIAL OPTIMIZATION

oPerformance metrics via PAPI library:

• Execution time, total Mflops, L2- and L3-level cache miss rates

oCompiler optimization flags:• -g, -O0, -O1, -O2, -O3, -O3p

oASCII vs binary input files

Page 5: Programming of Supercomputers final presentation

SEQUENTIAL OPTIMIZATION

Page 6: Programming of Supercomputers final presentation

SEQUENTIAL OPTIMIZATION

Page 7: Programming of Supercomputers final presentation

BENCHMARK PARALLELIZATION

Page 8: Programming of Supercomputers final presentation

DATA DISTRIBUTIONoMain objective: Avoid data replication

oImplemented distributions:oClassic

oProblem:oApplication is for irregular geometriesoIndirect addressing (LCC)

oMETIS: Graph analysis tool Dual approach: elements -> vertices Balanced computational time

Nodal approach: nodes -> vertices “Optimal neighborhood“

Processor 0

Processor 1

Processor 2

Processor 3

Processor 4

Max difference:1 element

Distributeevenly

Classic distribution

METIS dual distribution

Page 9: Programming of Supercomputers final presentation

DATA DISTRIBUTION

Global to local mapping8 9 1

011

Local elements: rank 2

Globalidx

LCC

Neighbors: 9, 12, 14Map neighbors

Lookup

Local to global mapping

0 X1 X…8 09 1

10 211 312 413 X14 5

Internal elements0 81 92 103 11

External elements4 125 14

Page 10: Programming of Supercomputers final presentation

DATA DISTRIBUTION

Input File Input File

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 1

Rank 2

Rank 3

Rank 4

Rank 0

Read data

Transmit distribution

Compute distribution

Distribution by master process Distribution computed by each process

Page 11: Programming of Supercomputers final presentation

COMMUNICATION MODEL

oMain objective: Optimize communication

oGhost cells

oCommunication listsoSendoReceive

oSynchronization between lists

Page 12: Programming of Supercomputers final presentation

COMMUNICATION MODEL2D arraySend list

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Max #elements in one processor

Total storage: N * max_loc_elems

Idx(0) = 0 Idx(1) = Idx (0) + Send_count(0)

Idx(2) Idx(3)

• Index array must be created• Send_count already existsTotal storage: Total_neighbors + N

1D arraySend list

Page 13: Programming of Supercomputers final presentation

COMMUNICATION MODEL: MPI_ALLTOALLV

1 2 3 4

0

0

0

0

0 2 3 4

1

0

0

0

1

1

1

Receive lists

Node 0 Node 1

Node 1

Node 2

Node 3

Node 4

Node 0

Node 2

Node 3

Node 4

Send lists

Page 14: Programming of Supercomputers final presentation

MPI IMPLEMENTATION

oMain loop: Compute_solution

oCollective communicationoLow overheadoNo deadlock

o Direc1 transmissionoIndirect addressingoIncludes ghost cells and external cellsoUse communication lists and mappings

direc1

send_list

send_buffer

MPI_Alltoallv

recv_buffer

recv_list

direc1

Page 15: Programming of Supercomputers final presentation

MPI IMPLEMENTATIONMPI_Allreduce with MPI_IN_PLACE

Processor 0

Processor 1

Processor 2

Processor 3

Processor 0

Processor 1

Processor 2

Processor 3

local residual

res_updated

local residual

res_updated

local residual

res_updated

local residual

res_updated

+

global residual

res_updated

Page 16: Programming of Supercomputers final presentation

FINALIZATION

oAdded to verify results

oProcessor 0 does all the work

oImbalanced load

oHigh communicationoDataoLocal to global map

Page 17: Programming of Supercomputers final presentation

PERFORMANCE ANALYSIS AND

TUNING

Page 18: Programming of Supercomputers final presentation

SPEEDUPoSpeedup objective: linear speedup using Metis Dual distribution and 1-8 processors.

oPAPI measurements with up to 8 processors.

• Linear (super-linear) speedup with Metis distributions.

oScalasca/Periscope/Vampir/Cube measurements and visualization (code/makefile instrumentation) with up to 64 processors.

•Speedup plateaus.

Page 19: Programming of Supercomputers final presentation

SPEEDUP

Page 20: Programming of Supercomputers final presentation

LOAD IMBALANCEoMinimal load imbalance in computation phase by design.

oSevere load imbalance in finalization phase: root process does all work.

Page 21: Programming of Supercomputers final presentation

MPI OVERHEADoAcceptance criterion: MPI overhead (MPI_Init() and MPI_Finalize() excluded) less than 25 % on pent with 4 processors.

oOverhead plateaus.

Page 22: Programming of Supercomputers final presentation

CONCLUSION

oNo severe load imbalance or data replication.

oCollective operations used for achieving the performance goals.

oMetis distributions lead to better performance than classic.

oOptimal number of processors used for best speedup and least communication overhead depends on:• the distribution• the input file size