58
Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Wo Application Optimization with non-blocking Collective Operations – A case study with a three-dimensional FFT – Torsten Höfler Department of Computer Science Indiana University / Technical University of Chemnitz Commissariat à l’Énergie Atomique Direction des applications militaires (CEA-DAM) Bruyéres-le-chatel, France 12th January 2007

Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Application Optimization with non-blockingCollective Operations

– A case study with a three-dimensional FFT –

Torsten Höfler

Department of Computer ScienceIndiana University / Technical University of Chemnitz

Commissariat à l’Énergie AtomiqueDirection des applications militaires (CEA-DAM)

Bruyéres-le-chatel, France12th January 2007

Page 2: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Outline

1 Non-blocking Collective OperationsGeneral ThoughtsOverlapProcess Skew

2 General Application OptimizationIntroductionAn independent data AlgorithmAn independent data Loop

3 Use case: A specialized 3D-FFTA parallel 3D-FFTApplying non-blocking Collectives

4 Conclusions and Future Work

Page 3: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Outline

1 Non-blocking Collective OperationsGeneral ThoughtsOverlapProcess Skew

2 General Application OptimizationIntroductionAn independent data AlgorithmAn independent data Loop

3 Use case: A specialized 3D-FFTA parallel 3D-FFTApplying non-blocking Collectives

4 Conclusions and Future Work

Page 4: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

General Thoughts

What is it?

Non-blocking Send/Recv

MPI_Isend/MPI_Irecv + MPI_Test/MPI_Waitavoid deadlock situations and enable overlap

Collective OperationsMPI_Bcast/MPI_Reduce/...often-used comm. patterns and performance portability→ cf. BLAS for communication

Non-blocking Collective Operations

MPI_Ibcast/MPI_Ireduce/... + MPI_Test/MPI_Waitcombines all advantagesoverlap + performance portability

Page 5: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

General Thoughts

What is it?

Where do I find it in the Standard?not part of MPI-2explicit programming model (threads) ⇒ not viableimplemented as an addition to MPI-2

Why should I invest the additional effort?two main advantages:

1 hide communication latency2 lower the effects of process skew

(introduced by OS noise or the algorithm)

Page 6: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

General Thoughts

What is it?

Where do I find it in the Standard?not part of MPI-2explicit programming model (threads) ⇒ not viableimplemented as an addition to MPI-2

Why should I invest the additional effort?two main advantages:

1 hide communication latency2 lower the effects of process skew

(introduced by OS noise or the algorithm)

Page 7: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Overlap

What is overlap and how does it help?

Hardware Parallelismtoday’s computers communicate without CPU involvementcommunication in the background, CPU is freed

Ah, my program runs faster!?

not much - “blocking communication” blocks the CPU :-(CPU waits until the communication is finishednon-blocking communication gives control to the user

But I heard that non-blocking Send/Recv is slowdepends on the MPI librarysome are implemented badly(e.g. operation is performed blocking during MPI_Wait)

Page 8: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Overlap

What is overlap and how does it help?

Hardware Parallelismtoday’s computers communicate without CPU involvementcommunication in the background, CPU is freed

Ah, my program runs faster!?

not much - “blocking communication” blocks the CPU :-(CPU waits until the communication is finishednon-blocking communication gives control to the user

But I heard that non-blocking Send/Recv is slowdepends on the MPI librarysome are implemented badly(e.g. operation is performed blocking during MPI_Wait)

Page 9: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Overlap

What is overlap and how does it help?

Hardware Parallelismtoday’s computers communicate without CPU involvementcommunication in the background, CPU is freed

Ah, my program runs faster!?

not much - “blocking communication” blocks the CPU :-(CPU waits until the communication is finishednon-blocking communication gives control to the user

But I heard that non-blocking Send/Recv is slowdepends on the MPI librarysome are implemented badly(e.g. operation is performed blocking during MPI_Wait)

Page 10: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Overlap

What can I gain with overlap?

The Latency of Collective Operationsoften implemented on top of point-to-point messagesscales logarithmic O(log2P) or linear O(P) in P

Ok, how much is that?simple network model (Hockney) with 1 byte messagestime to send from host i to host j (j 6= i): LL is network dependent:

Fast Ethernet: L = 50 − 60µsGigabit Ethernet: L = 15 − 20µsInfiniBandTM : L = 2 − 7µs

⇒ 1µs ≈ 4000 FLOP of a 2GHz Machine

Page 11: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Overlap

What can I gain with overlap?

The Latency of Collective Operationsoften implemented on top of point-to-point messagesscales logarithmic O(log2P) or linear O(P) in P

Ok, how much is that?simple network model (Hockney) with 1 byte messagestime to send from host i to host j (j 6= i): LL is network dependent:

Fast Ethernet: L = 50 − 60µsGigabit Ethernet: L = 15 − 20µsInfiniBandTM : L = 2 − 7µs

⇒ 1µs ≈ 4000 FLOP of a 2GHz Machine

Page 12: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Process Skew

Process Skew

caused by OS interference or unbalanced applicationespecially if processors are overloadedworse for big systemscan cause dramatic performance decreaseall nodes wait for the last

ExamplePetrini et. al. (2003) ”The Case of the Missing SupercomputerPerformance: Achieving Optimal Performance on the 8,192Processors of ASCI Q”

Page 13: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Process Skew

Process Skew

caused by OS interference or unbalanced applicationespecially if processors are overloadedworse for big systemscan cause dramatic performance decreaseall nodes wait for the last

ExamplePetrini et. al. (2003) ”The Case of the Missing SupercomputerPerformance: Achieving Optimal Performance on the 8,192Processors of ASCI Q”

Page 14: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Process Skew

Process Skew - MPI_BCAST Example - Jumpshot

process 0 delayed, black=calculation time, blue=MPI time

proc

esse

s

time

P0

P1

P3

P2

Page 15: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Process Skew

Process Skew - MPI_IBCAST Example - Jumpshot

process 0 delayed, black=calculation time, blue=MPI time

proc

esse

s

time

P0

P1

P3

P2

Page 16: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Process Skew

Great! How do I use it?

Proposal & Interface DefinitionHoefler et. al. (2006): “Non-Blocking Collective Operations forMPI-2”

Implementation - LibNBCneeds only ANSI C + MPI-1BSD Licensedownload from http://www.unixer.de/NBC

LibNBC UsageNBC_Ibcast(buf1, p, MPI_INT, 0, comm, &req);NBC_Wait(&req);

Page 17: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Process Skew

Great! How do I use it?

Proposal & Interface DefinitionHoefler et. al. (2006): “Non-Blocking Collective Operations forMPI-2”

Implementation - LibNBCneeds only ANSI C + MPI-1BSD Licensedownload from http://www.unixer.de/NBC

LibNBC UsageNBC_Ibcast(buf1, p, MPI_INT, 0, comm, &req);NBC_Wait(&req);

Page 18: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Outline

1 Non-blocking Collective OperationsGeneral ThoughtsOverlapProcess Skew

2 General Application OptimizationIntroductionAn independent data AlgorithmAn independent data Loop

3 Use case: A specialized 3D-FFTA parallel 3D-FFTApplying non-blocking Collectives

4 Conclusions and Future Work

Page 19: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Introduction

Acknowledgements

I want to thank some inspiring people!(alphabetically)

George Bosilca, University of Tennessee (LibNBC)Peter Gottschling, Indiana University (3D-CG Solver, Apps)Andrew Lumsdaine, Indiana University (LibNBC, Apps)Wolfgang Rehm, TU Chemnitz (LibNBC, Apps)Jeff Squyres, Cisco Systems (LibNBC)Gilles Zerah, CEA-DAM France (problem of 3D-FFT)

Page 20: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Introduction

(incomplete) Classification of parallel Algorithms

Independent Data Applications3D-CG Poisson solver (inner and halo parts)many implicit iterative solvers (inner and halo parts)

Independent Data in Loopsparallel compression (blocks independent)multi-dimensional FFT (lines/planes independent)

Dependent Data in Loopsparallel Gauss Method (HPL, panel broadcast)parallel Cholesky (strong data dependency)

Page 21: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Introduction

(incomplete) Classification of parallel Algorithms

Independent Data Applications3D-CG Poisson solver (inner and halo parts)many implicit iterative solvers (inner and halo parts)

Independent Data in Loopsparallel compression (blocks independent)multi-dimensional FFT (lines/planes independent)

Dependent Data in Loopsparallel Gauss Method (HPL, panel broadcast)parallel Cholesky (strong data dependency)

Page 22: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Introduction

(incomplete) Classification of parallel Algorithms

Independent Data Applications3D-CG Poisson solver (inner and halo parts)many implicit iterative solvers (inner and halo parts)

Independent Data in Loopsparallel compression (blocks independent)multi-dimensional FFT (lines/planes independent)

Dependent Data in Loopsparallel Gauss Method (HPL, panel broadcast)parallel Cholesky (strong data dependency)

Page 23: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

An independent data Algorithm

3D Poisson Solver

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������

������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

���� Process−local data

Halo−data2D Domain

P1 P2 P3

P4 P5 P6 P7

P10P9P8 P11

P0

Page 24: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

An independent data Algorithm

3D-Poisson - Parallel Speedup (Best Case)

0

20

40

60

80

100

8 16 24 32 40 48 56 64 72 80 88 96

Spe

edup

Number of CPUs

Eth blockingEth non-blocking

0

20

40

60

80

100

8 16 24 32 40 48 56 64 72 80 88 96S

peed

upNumber of CPUs

IB blockingIB non-blocking

“odin”@IU: 128 2 GHz dual Opteron 246 nodesInterconnect: Gigabit Ethernet, InfiniBandTM

System size 800x800x800 (1 node ≈ 5300s)

Page 25: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

An independent data Loop

Parallel Compression

block-by-block parallel compressiongather compressed data to a single nodecompression could also be post-processingwidely used to record experimental data

for(i=0; i < my_blocks; i++) {compress_block(i);

}MPI_Gather(<block 0 to my_blocks-1>);

Page 26: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

An independent data Loop

Pipelined Communication

start non-blocking communication after some data is readytwo parameters:

1 tile-factor: number of elements per communication2 window-size: number of outstanding requests

for(i=0; i < my_blocks/tile; i++) {for(j=0; j < tile; j++)compress_block(i*tile + j);

MPI_Igather(<block i to i+tile-1>);}MPI_Waitall(<Igather requests>);

Page 27: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

An independent data Loop

Compression - Parallel Speedup (Best Case)

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70 80 90 100

Spe

edup

# Processors

MPI/blockingNBC/pipe

NBC/tileNBC/wintile

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70 80 90 100S

peed

up# Processors

MPI/blockingNBC/pipe

NBC/tileNBC/wintile

“odin”@IU: 128 2 GHz dual Opteron 246 nodesInterconnect: Gigabit Ethernet, InfiniBandTM

System size 57.22 MB (1 node ≈ 9800s)

Page 28: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Outline

1 Non-blocking Collective OperationsGeneral ThoughtsOverlapProcess Skew

2 General Application OptimizationIntroductionAn independent data AlgorithmAn independent data Loop

3 Use case: A specialized 3D-FFTA parallel 3D-FFTApplying non-blocking Collectives

4 Conclusions and Future Work

Page 29: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

Domain Decomposition

Discretized 3D Domain (FFT-Box)

y x

z

Page 30: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

Domain Decomposition

Memory layout (3x3x3 box)(coordinates xyz: 000 → 222)

......... 020 021 022 100 101 102

110 111 112 120 121 122 ...

... 220 221 222

...000 001 002 010 011 012

... ...200 201 202 210 211 212

Page 31: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

Domain Decomposition

Distributed 3D Domain

y x

z 0 1 2

Page 32: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

Domain Decomposition

Blocked data distribution

......... 020 021 022 100 101 102

110 111 112 120 121 122 ...

... 220 221 222

...000 001 002 010 011 012

... ...200 201 202 210 211 212

Page 33: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

1D Transformation

1D Transformation in z Direction

y x

z

Page 34: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

Rearrange Data Layout

rearrange from xyz to xzy (simply swap y and z indices)

......... 002 012 022 100 110 120

101 111 121 102 112 122 ...

... 202 212 222

...000 010 020 001 011 021

... ...200 210 220 201 211 221

......... 020 021 022 100 101 102

110 111 112 120 121 122 ...

... 220 221 222

...000 001 002 010 011 012

... ...200 201 202 210 211 212

Page 35: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

1D Transformation

1D Transformation in y Direction

y x

z

Page 36: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

Rearrange Data Layout

rearrange from xzy to yzx (parallel transpose)⇒ MPI_Alltoall(v)

......... 020 220 001 101 201

011 111 211 021 121 221 ...

... 022 122 222

...000 100 200 010 110 210

... ...002 102 202 012 112 212

120...

...... 002 012 022 100 110 120

101 111 121 102 112 122 ...

... 202 212 222

...000 010 020 001 011 021

... ...200 210 220 201 211 221

Page 37: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

A parallel 3D-FFT

1D Transformation

1D Transformation in x Direction

y x

z

Page 38: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Non-blocking 3D-FFT

Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time

Solutionstart MPI_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (tile factor)

Page 39: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Non-blocking 3D-FFT

Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time

Solutionstart MPI_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (tile factor)

Page 40: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Non-blocking 3D-FFT

Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time

Solutionstart MPI_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (tile factor)

Page 41: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in z Direction

Data already transformed in y direction

y x

z

1 block = 1 double value (3x3x3 grid)

Page 42: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in z Direction

Transform first xz plane in z direction

y x

z

������������������������������������������

������������������������������

�������������������������

������������������������� � � � � � � � � � � � �

!�!�!!�!�!!�!�!!�!�!!�!�!!�!�!

"�"�"�""�"�"�"#�#�#�##�#�#�#

$�$$�$$�$$�$$�$$�$$�$

%%%%%%%

&�&&�&&�&&�&&�&&�&&�&

'''''''

(�(�((�(�((�(�((�(�((�(�(

)�)�))�)�))�)�))�)�))�)�)

*�*�**�*�**�*�**�*�**�*�**�*�*

+�+�++�+�++�+�++�+�++�+�++�+�+

,�,�,,�,�,,�,�,,�,�,,�,�,,�,�,

-�-�--�-�--�-�--�-�--�-�--�-�-.�.�.�..�.�.�./�/�/�//�/�/�/

0�0�00�0�00�0�00�0�00�0�00�0�0

1�1�11�1�11�1�11�1�11�1�11�1�1

2�2�2�2�22�2�2�2�23�3�3�33�3�3�3

4�44�44�44�44�44�44�4

5555555

6�6�6�66�6�6�66�6�6�66�6�6�66�6�6�6

7�7�77�7�77�7�77�7�77�7�78�8�8�88�8�8�88�8�8�88�8�8�88�8�8�88�8�8�8

9�9�99�9�99�9�99�9�99�9�99�9�9

pattern means that data was transformed in y and z direction

Page 43: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation z Direction

start MPI_Ialltoall of first xz plane and transform second plane

:�:�::�:�::�:�::�:�::�:�::�:�:

;�;�;;�;�;;�;�;;�;�;;�;�;;�;�;<�<�<�<<�<�<�<<�<�<�<=�=�=�==�=�=�==�=�=�=

>>>>>>>

???????

@�@�@@�@�@@�@�@@�@�@@�@�@@�@�@

A�A�AA�A�AA�A�AA�A�AA�A�AA�A�A

B�B�BB�B�BB�B�BB�B�BB�B�B

C�C�CC�C�CC�C�CC�C�CC�C�C DDDDDDD

EEEEEEE

FFFFFFF

GGGGGGG

H�H�H�HH�H�H�HH�H�H�HH�H�H�HH�H�H�HH�H�H�H

I�I�I�II�I�I�II�I�I�II�I�I�II�I�I�II�I�I�I

J�J�JJ�J�JJ�J�JJ�J�JJ�J�JJ�J�J

K�K�KK�K�KK�K�KK�K�KK�K�KK�K�KL�L�L�L�LL�L�L�L�LL�L�L�L�LM�M�M�M�MM�M�M�M�MM�M�M�M�M

N�N�N�NN�N�N�NN�N�N�NO�O�O�OO�O�O�OO�O�O�O

y x

z

P�P�PP�P�PP�P�PP�P�PP�P�P

Q�Q�QQ�Q�QQ�Q�QQ�Q�QQ�Q�QR�R�RR�R�RR�R�RR�R�RR�R�RR�R�R

S�S�SS�S�SS�S�SS�S�SS�S�SS�S�S T�TT�TT�TT�TT�TT�TT�T

UUUUUUU

V�VV�VV�VV�VV�VV�VV�V

WWWWWWW

X�X�XX�X�XX�X�XX�X�XX�X�XX�X�X

Y�Y�YY�Y�YY�Y�YY�Y�YY�Y�YY�Y�YZ�Z�Z�ZZ�Z�Z�Z[�[�[�[[�[�[�[

\�\�\\�\�\\�\�\\�\�\\�\�\\�\�\

]�]�]]�]�]]�]�]]�]�]]�]�]]�]�]

^�^�^�^�^^�^�^�^�^_�_�_�__�_�_�_

`�``�``�``�``�``�``�`

aaaaaaa

b�b�b�bb�b�b�bb�b�b�bb�b�b�bb�b�b�b

c�c�cc�c�cc�c�cc�c�cc�c�cd�d�d�dd�d�d�dd�d�d�dd�d�d�dd�d�d�dd�d�d�d

e�e�ee�e�ee�e�ee�e�ee�e�ee�e�e

f�f�f�ff�f�f�fg�g�g�gg�g�g�g

h�h�h�hh�h�h�hh�h�h�hh�h�h�hh�h�h�hh�h�h�h

i�i�ii�i�ii�i�ii�i�ii�i�ii�i�i

j�j�jj�j�jj�j�jj�j�jj�j�j

k�k�kk�k�kk�k�kk�k�kk�k�k

l�l�ll�l�ll�l�ll�l�ll�l�ll�l�l

m�m�mm�m�mm�m�mm�m�mm�m�mm�m�m

cyan color means that data is communicated in the background

Page 44: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in z Direction

start MPI_Ialltoall of second xz plane and transform third planen�n�n�nn�n�n�nn�n�n�no�o�o�oo�o�o�oo�o�o�o

p�p�p�p�pp�p�p�p�pp�p�p�p�pq�q�q�qq�q�q�qq�q�q�q

r�rr�rr�rr�rr�rr�rr�r

s�ss�ss�ss�ss�ss�ss�s

t�t�t�tt�t�t�tt�t�t�tt�t�t�tt�t�t�tt�t�t�t

u�u�uu�u�uu�u�uu�u�uu�u�uu�u�u

v�v�v�vv�v�v�vv�v�v�vv�v�v�vv�v�v�vv�v�v�v

w�w�ww�w�ww�w�ww�w�ww�w�ww�w�w

x�x�xx�x�xx�x�xx�x�xx�x�x

y�y�yy�y�yy�y�yy�y�yy�y�y

z�z�z�zz�z�z�zz�z�z�zz�z�z�zz�z�z�z

{�{�{{�{�{{�{�{{�{�{{�{�{|�|�|�|�||�|�|�|�||�|�|�|�|}�}�}�}}�}�}�}}�}�}�}

~�~~�~~�~~�~~�~~�~~�~

������������������������������������������

��������������������������������������������������������

�������������������������

������������������������������

������������������������������������������������������������������������

�������������������������

������������������������� �������

�������

�������

�������

������������������������������������������

������������������������������������������

������������������������������

������������������������������������������������������������������������������������

�������

�������

������������������������������

������������������������������

������������������������������������������

y x

z

�������������������������

�������������������������������������������������������

������������������������������ ���������������������

�������

 �  �  �  �  �  �  � 

¡¡¡¡¡¡¡

¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢

£�£�££�£�££�£�££�£�££�£�£

¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤

¥�¥�¥¥�¥�¥¥�¥�¥¥�¥�¥¥�¥�¥¥�¥�¥¦�¦�¦�¦¦�¦�¦�¦§�§�§�§§�§�§�§

¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨

©�©�©©�©�©©�©�©©�©�©©�©�©©�©�©

ª�ª�ª�ª�ªª�ª�ª�ª�ª«�«�«�««�«�«�«

¬�¬¬�¬¬�¬¬�¬¬�¬¬�¬¬�¬

­­­­­­­

®�®�®�®®�®�®�®®�®�®�®®�®�®�®®�®�®�®

¯�¯�¯¯�¯�¯¯�¯�¯¯�¯�¯¯�¯�¯°�°�°�°°�°�°�°°�°�°�°°�°�°�°°�°�°�°°�°�°�°

±�±�±±�±�±±�±�±±�±�±±�±�±±�±�±²�²�²�²²�²�²�²²�²�²�²²�²�²�²²�²�²�²²�²�²�²

³�³�³³�³�³³�³�³³�³�³³�³�³³�³�³

´�´�´´�´�´´�´�´´�´�´´�´�´´�´�´

µ�µ�µµ�µ�µµ�µ�µµ�µ�µµ�µ�µµ�µ�µ¶�¶�¶�¶¶�¶�¶�¶·�·�·�··�·�·�·

data of two planes is not accessible due to communication

Page 45: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in x Direction

start communication of the third plane and ...¸�¸�¸�¸¸�¸�¸�¸¸�¸�¸�¸¹�¹�¹�¹¹�¹�¹�¹¹�¹�¹�¹

º�º�º�º�ºº�º�º�º�ºº�º�º�º�º»�»�»�»»�»�»�»»�»�»�»

¼�¼¼�¼¼�¼¼�¼¼�¼¼�¼¼�¼

½�½½�½½�½½�½½�½½�½½�½

¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾

¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿

À�À�À�ÀÀ�À�À�ÀÀ�À�À�ÀÀ�À�À�ÀÀ�À�À�ÀÀ�À�À�À

Á�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�Á

Â�Â�ÂÂ�Â�ÂÂ�Â�ÂÂ�Â�ÂÂ�Â�Â

Ã�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�Ã

Ä�Ä�Ä�ÄÄ�Ä�Ä�ÄÄ�Ä�Ä�ÄÄ�Ä�Ä�ÄÄ�Ä�Ä�Ä

Å�Å�ÅÅ�Å�ÅÅ�Å�ÅÅ�Å�ÅÅ�Å�ÅÆ�Æ�Æ�Æ�ÆÆ�Æ�Æ�Æ�ÆÆ�Æ�Æ�Æ�ÆÇ�Ç�Ç�ÇÇ�Ç�Ç�ÇÇ�Ç�Ç�Ç

È�È�È�ÈÈ�È�È�ÈÈ�È�È�ÈÈ�È�È�ÈÈ�È�È�È

É�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�É

Ê�ÊÊ�ÊÊ�ÊÊ�ÊÊ�ÊÊ�ÊÊ�Ê

Ë�ËË�ËË�ËË�ËË�ËË�ËË�ËÌ�ÌÌ�ÌÌ�ÌÌ�ÌÌ�ÌÌ�ÌÌ�Ì

Í�ÍÍ�ÍÍ�ÍÍ�ÍÍ�ÍÍ�ÍÍ�Í

Î�Î�ÎÎ�Î�ÎÎ�Î�ÎÎ�Î�ÎÎ�Î�ÎÎ�Î�Î

Ï�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÐ�Ð�Ð�ÐÐ�Ð�Ð�ÐÐ�Ð�Ð�ÐÑ�Ñ�Ñ�ÑÑ�Ñ�Ñ�ÑÑ�Ñ�Ñ�Ñ

Ò�Ò�ÒÒ�Ò�ÒÒ�Ò�ÒÒ�Ò�ÒÒ�Ò�Ò

Ó�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�Ó ÔÔÔÔÔÔÔ

ÕÕÕÕÕÕÕ

ÖÖÖÖÖÖÖ

×××××××

Ø�Ø�Ø�ØØ�Ø�Ø�ØØ�Ø�Ø�ØØ�Ø�Ø�ØØ�Ø�Ø�ØØ�Ø�Ø�Ø

Ù�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÚ�Ú�Ú�Ú�ÚÚ�Ú�Ú�Ú�ÚÚ�Ú�Ú�Ú�ÚÛ�Û�Û�Û�ÛÛ�Û�Û�Û�ÛÛ�Û�Û�Û�Û

Ü�Ü�Ü�ÜÜ�Ü�Ü�ÜÜ�Ü�Ü�ÜÝ�Ý�Ý�ÝÝ�Ý�Ý�ÝÝ�Ý�Ý�Ý

Þ�Þ�ÞÞ�Þ�ÞÞ�Þ�ÞÞ�Þ�ÞÞ�Þ�ÞÞ�Þ�Þ

ß�ß�ßß�ß�ßß�ß�ßß�ß�ßß�ß�ßß�ß�ß ààààààà

ááááááá

â�â�ââ�â�ââ�â�ââ�â�ââ�â�ââ�â�â

ã�ã�ãã�ã�ãã�ã�ãã�ã�ãã�ã�ãã�ã�ã

y x

z ä�ä�ää�ä�ää�ä�ää�ä�ää�ä�ää�ä�ä

å�å�åå�å�åå�å�åå�å�åå�å�åå�å�å æ�ææ�ææ�ææ�ææ�ææ�ææ�æ

ççççççç

è�èè�èè�èè�èè�èè�èè�è

ééééééé

ê�ê�êê�ê�êê�ê�êê�ê�êê�ê�ê

ë�ë�ëë�ë�ëë�ë�ëë�ë�ëë�ë�ë

ì�ì�ì�ìì�ì�ì�ìí�í�í�íí�í�í�í

î�î�îî�î�îî�î�îî�î�îî�î�îî�î�î

ï�ï�ïï�ï�ïï�ï�ïï�ï�ïï�ï�ïï�ï�ï

ð�ð�ð�ð�ðð�ð�ð�ð�ðñ�ñ�ñ�ññ�ñ�ñ�ñ

ò�òò�òò�òò�òò�òò�òò�ò

óóóóóóó

ô�ô�ô�ôô�ô�ô�ôô�ô�ô�ôô�ô�ô�ôô�ô�ô�ô

õ�õ�õõ�õ�õõ�õ�õõ�õ�õõ�õ�õö�ö�ö�öö�ö�ö�öö�ö�ö�öö�ö�ö�öö�ö�ö�öö�ö�ö�ö

÷�÷�÷÷�÷�÷÷�÷�÷÷�÷�÷÷�÷�÷÷�÷�÷

ø�ø�ø�øø�ø�ø�øù�ù�ù�ùù�ù�ù�ù

ú�ú�ú�úú�ú�ú�úú�ú�ú�úú�ú�ú�úú�ú�ú�úú�ú�ú�ú

û�û�ûû�û�ûû�û�ûû�û�ûû�û�ûû�û�û

ü�ü�üü�ü�üü�ü�üü�ü�üü�ü�ü

ý�ý�ýý�ý�ýý�ý�ýý�ý�ýý�ý�ý

þ�þ�þþ�þ�þþ�þ�þþ�þ�þþ�þ�þþ�þ�þ

ÿ�ÿ�ÿÿ�ÿ�ÿÿ�ÿ�ÿÿ�ÿ�ÿÿ�ÿ�ÿÿ�ÿ�ÿ

������������������������������

������������������������������

we need the first xz plane to go on ...

Page 46: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in x Direction

... so MPI_Wait for the first MPI_Ialltoall!������������������������������������������

������������������������������������������������

���������������������

���������������������

������������������

������������

������������������������������������������

������������������������������

� � � � � � � � � �

�������������������������

�����������������������������������

�������������������������������������������������������������������������

���������������������

������������������������������������������

��������������������������������������������������������

�������������������������

������������������������������

������������������������������������������������������������������������

�������������������������

������������������������� �������

!!!!!!!

"""""""

#�#�#�##�#�#�##�#�#�##�#�#�##�#�#�##�#�#�#

$�$�$�$$�$�$�$$�$�$�$$�$�$�$$�$�$�$$�$�$�$%�%�%�%�%%�%�%�%�%%�%�%�%�%&�&�&�&�&&�&�&�&�&&�&�&�&�&

'�'�'�''�'�'�''�'�'�'(�(�(�((�(�(�((�(�(�(

)�)�))�)�))�)�))�)�))�)�))�)�)

*�*�**�*�**�*�**�*�**�*�**�*�* +++++++

,,,,,,,

-�-�--�-�--�-�--�-�--�-�--�-�-

.�.�..�.�..�.�..�.�..�.�..�.�.

y x

z /�/�//�/�//�/�//�/�//�/�//�/�/

0�0�00�0�00�0�00�0�00�0�00�0�0 1�11�11�11�11�11�11�1

2222222

3�33�33�33�33�33�33�3

4444444

5�5�55�5�55�5�55�5�55�5�5

6�6�66�6�66�6�66�6�66�6�6

7�7�7�77�7�7�78�8�8�88�8�8�8

9�9�99�9�99�9�99�9�99�9�99�9�9

:�:�::�:�::�:�::�:�::�:�::�:�:

;�;�;�;�;;�;�;�;�;<�<�<�<<�<�<�<

=�==�==�==�==�==�==�=

>>>>>>>

?�?�?�??�?�?�??�?�?�??�?�?�??�?�?�?

@�@�@@�@�@@�@�@@�@�@@�@�@A�A�A�AA�A�A�AA�A�A�AA�A�A�AA�A�A�AA�A�A�A

B�B�BB�B�BB�B�BB�B�BB�B�BB�B�B

C�C�C�CC�C�C�CD�D�D�DD�D�D�D

E�E�E�EE�E�E�EE�E�E�EE�E�E�EE�E�E�EE�E�E�E

F�F�FF�F�FF�F�FF�F�FF�F�FF�F�F

G�G�GG�G�GG�G�GG�G�GG�G�G

H�H�HH�H�HH�H�HH�H�HH�H�H

I�I�II�I�II�I�II�I�II�I�II�I�I

J�J�JJ�J�JJ�J�JJ�J�JJ�J�JJ�J�J

K�K�KK�K�KK�K�KK�K�KK�K�KK�K�K

L�L�LL�L�LL�L�LL�L�LL�L�LL�L�L

and transform first plane (new pattern means xyz transformed)

Page 47: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in x Direction

Wait and transform second xz planeM�M�M�MM�M�M�MM�M�M�MN�N�N�NN�N�N�NN�N�N�N

O�O�O�O�OO�O�O�O�OO�O�O�O�OP�P�P�PP�P�P�PP�P�P�P

Q�QQ�QQ�QQ�QQ�QQ�QQ�Q

R�RR�RR�RR�RR�RR�RR�R

S�S�S�SS�S�S�SS�S�S�SS�S�S�SS�S�S�SS�S�S�S

T�T�TT�T�TT�T�TT�T�TT�T�TT�T�T

U�U�U�UU�U�U�UU�U�U�UU�U�U�UU�U�U�UU�U�U�U

V�V�VV�V�VV�V�VV�V�VV�V�VV�V�V

W�W�WW�W�WW�W�WW�W�WW�W�W

X�X�XX�X�XX�X�XX�X�XX�X�X

Y�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�Y

Z�Z�ZZ�Z�ZZ�Z�ZZ�Z�ZZ�Z�Z[�[�[�[�[[�[�[�[�[[�[�[�[�[\�\�\�\\�\�\�\\�\�\�\

]�]]�]]�]]�]]�]]�]]�]

^�^^�^^�^^�^^�^^�^^�^_�__�__�__�__�__�__�_

`�``�``�``�``�``�``�`a�a�a�aa�a�a�aa�a�a�aa�a�a�aa�a�a�a

b�b�bb�b�bb�b�bb�b�bb�b�b

c�c�cc�c�cc�c�cc�c�cc�c�cc�c�c

d�d�dd�d�dd�d�dd�d�dd�d�dd�d�de�e�e�ee�e�e�ee�e�e�ef�f�f�ff�f�f�ff�f�f�f

g�g�gg�g�gg�g�gg�g�gg�g�g

h�h�hh�h�hh�h�hh�h�hh�h�h iiiiiii

jjjjjjj

kkkkkkk

lllllll

m�m�m�mm�m�m�mm�m�m�mm�m�m�mm�m�m�mm�m�m�m

n�n�n�nn�n�n�nn�n�n�nn�n�n�nn�n�n�nn�n�n�n o�o�o�oo�o�o�oo�o�o�o

p�p�p�pp�p�p�pp�p�p�p

q�q�qq�q�qq�q�qq�q�qq�q�qq�q�q

r�r�rr�r�rr�r�rr�r�rr�r�rr�r�rs�s�ss�s�ss�s�ss�s�ss�s�ss�s�s

t�t�tt�t�tt�t�tt�t�tt�t�tt�t�tu�u�u�u�uu�u�u�u�uu�u�u�u�uv�v�v�v�vv�v�v�v�vv�v�v�v�v

wwwwwww

xxxxxxx

y x

z y�y�yy�y�yy�y�yy�y�yy�y�yy�y�y

z�z�zz�z�zz�z�zz�z�zz�z�zz�z�z {�{{�{{�{{�{{�{{�{{�{

|||||||

}�}}�}}�}}�}}�}}�}}�}

~~~~~~~

�������������������������

�������������������������

����������������������������

������������������������������

������������������������������

��������������������������������

���������������������

�������

�����������������������������������

�������������������������������������������������������������������

������������������������������

����������������������������

������������������������������������������

������������������������������

�������������������������

�������������������������

������������������������������

������������������������������

������������������������������

������������������������������

first plane’s data could be accessed for next operation

Page 48: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Transformation in x Direction

wait and transform last xz plane������������������������������������������

������������������������������������������������

���������������������

���������������������

������������������������������������������

������������������������������

������������������������������������������

 � �  � �  � �  � �  � �  � � 

¡�¡�¡¡�¡�¡¡�¡�¡¡�¡�¡¡�¡�¡

¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢

£�£�£�££�£�£�££�£�£�££�£�£�££�£�£�£

¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¥�¥�¥�¥�¥¥�¥�¥�¥�¥¥�¥�¥�¥�¥¦�¦�¦�¦¦�¦�¦�¦¦�¦�¦�¦

§�§§�§§�§§�§§�§§�§§�§

¨�¨¨�¨¨�¨¨�¨¨�¨¨�¨¨�¨

©�©�©�©©�©�©�©©�©�©�©©�©�©�©©�©�©�©

ª�ª�ªª�ª�ªª�ª�ªª�ª�ªª�ª�ª

«�««�««�««�««�««�««�«

¬�¬¬�¬¬�¬¬�¬¬�¬¬�¬¬�¬

­�­�­­�­�­­�­�­­�­�­­�­�­­�­�­

®�®�®®�®�®®�®�®®�®�®®�®�®®�®�®¯�¯�¯�¯¯�¯�¯�¯¯�¯�¯�¯°�°�°�°°�°�°�°°�°�°�°

±�±�±±�±�±±�±�±±�±�±±�±�±

²�²�²²�²�²²�²�²²�²�²²�²�² ³³³³³³³

´´´´´´´

µµµµµµµ

¶¶¶¶¶¶¶

·�·�·�··�·�·�··�·�·�··�·�·�··�·�·�··�·�·�·

¸�¸�¸�¸¸�¸�¸�¸¸�¸�¸�¸¸�¸�¸�¸¸�¸�¸�¸¸�¸�¸�¸ ¹�¹�¹�¹¹�¹�¹�¹¹�¹�¹�¹

º�º�º�ºº�º�º�ºº�º�º�º

»�»�»»�»�»»�»�»»�»�»»�»�»»�»�»

¼�¼�¼¼�¼�¼¼�¼�¼¼�¼�¼¼�¼�¼¼�¼�¼½�½�½½�½�½½�½�½½�½�½½�½�½½�½�½

¾�¾�¾¾�¾�¾¾�¾�¾¾�¾�¾¾�¾�¾¾�¾�¾¿�¿�¿�¿�¿¿�¿�¿�¿�¿¿�¿�¿�¿�¿À�À�À�À�ÀÀ�À�À�À�ÀÀ�À�À�À�À

ÁÁÁÁÁÁÁ

ÂÂÂÂÂÂÂ

y x

z Ã�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�Ã

Ä�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�Ä Å�ÅÅ�ÅÅ�ÅÅ�ÅÅ�ÅÅ�ÅÅ�Å

ÆÆÆÆÆÆÆ

Ç�ÇÇ�ÇÇ�ÇÇ�ÇÇ�ÇÇ�ÇÇ�Ç

ÈÈÈÈÈÈÈ

É�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�É

Ê�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�Ê

Ë�Ë�Ë�ËË�Ë�Ë�ËÌ�Ì�Ì�ÌÌ�Ì�Ì�Ì

Í�Í�ÍÍ�Í�ÍÍ�Í�ÍÍ�Í�ÍÍ�Í�ÍÍ�Í�Í

Î�Î�ÎÎ�Î�ÎÎ�Î�ÎÎ�Î�ÎÎ�Î�ÎÎ�Î�Î

Ï�Ï�Ï�Ï�ÏÏ�Ï�Ï�Ï�ÏÐ�Ð�Ð�ÐÐ�Ð�Ð�Ð

Ñ�ÑÑ�ÑÑ�ÑÑ�ÑÑ�ÑÑ�ÑÑ�Ñ

ÒÒÒÒÒÒÒ

Ó�Ó�Ó�ÓÓ�Ó�Ó�ÓÓ�Ó�Ó�ÓÓ�Ó�Ó�ÓÓ�Ó�Ó�Ó

Ô�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÕ�Õ�Õ�ÕÕ�Õ�Õ�ÕÕ�Õ�Õ�ÕÕ�Õ�Õ�ÕÕ�Õ�Õ�ÕÕ�Õ�Õ�Õ

Ö�Ö�ÖÖ�Ö�ÖÖ�Ö�ÖÖ�Ö�ÖÖ�Ö�ÖÖ�Ö�Ö

×�×�×�××�×�×�×Ø�Ø�Ø�ØØ�Ø�Ø�Ø

Ù�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�Ù

Ú�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�Ú

Û�Û�ÛÛ�Û�ÛÛ�Û�ÛÛ�Û�ÛÛ�Û�Û

Ü�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�Ü

Ý�Ý�ÝÝ�Ý�ÝÝ�Ý�ÝÝ�Ý�ÝÝ�Ý�ÝÝ�Ý�Ý

Þ�Þ�ÞÞ�Þ�ÞÞ�Þ�ÞÞ�Þ�ÞÞ�Þ�ÞÞ�Þ�Þ

ß�ß�ßß�ß�ßß�ß�ßß�ß�ßß�ß�ßß�ß�ß

à�à�àà�à�àà�à�àà�à�àà�à�àà�à�à

done! → 1 complete 1D-FFT overlaps a communication

Page 49: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Performance Tuning - Parameters

Tile factornumber of z-planes to gather before MPI_Ialltoall is startedvery performance critical!not easily predictable

Window sizenumber of outstanding communicationsnot implemented yetnot very performance critical → fine-tuning

MPI_Test intervalprogresses internal state and outstanding operationsunneccessary in threaded NBC implementation (future)

Page 50: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Performance Tuning - Parameters

Tile factornumber of z-planes to gather before MPI_Ialltoall is startedvery performance critical!not easily predictable

Window sizenumber of outstanding communicationsnot implemented yetnot very performance critical → fine-tuning

MPI_Test intervalprogresses internal state and outstanding operationsunneccessary in threaded NBC implementation (future)

Page 51: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

Performance Tuning - Parameters

Tile factornumber of z-planes to gather before MPI_Ialltoall is startedvery performance critical!not easily predictable

Window sizenumber of outstanding communicationsnot implemented yetnot very performance critical → fine-tuning

MPI_Test intervalprogresses internal state and outstanding operationsunneccessary in threaded NBC implementation (future)

Page 52: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

3D-FFT Benchmark Results (small input)

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

Spe

edup

Nodes

idealNBCMPI

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35C

omm

unic

atio

n O

verh

ead

(%)

Nodes

NBCMPI

“tantale”@CEA: 128 2 GHz quad Opteron 844 nodesInterconnect: InfiniBandTM

System size 128x128x128 (1 node ≈ 0.75 s)

Page 53: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

3D-FFT Benchmark Results (large input) - InfiniBand

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200

Spe

edup

Nodes

idealNBC singleMPI singleNBC dualMPI dual

12

14

16

18

20

22

24

26

28

30

32

0 20 40 60 80 100 120 140 160 180 200C

omm

unic

atio

n O

verh

ead

(%)

Nodes

NBC singleMPI singleNBC dualMPI dual

“odin”@IU: 128 2 GHz dual Opteron 246 nodesInterconnect: InfiniBandTM

System size 512x512x512 (1 node ≈ 50s)

Page 54: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Applying non-blocking Collectives

3D-FFT Benchmark Results (large input) - Ethernet

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200

Spe

edup

Nodes

idealNBC singleMPI singleNBC dualMPI dual

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160 180 200C

omm

unic

atio

n O

verh

ead

(%)

Nodes

NBC singleMPI singleNBC dualMPI dual

“odin”@IU: 128 2 GHz dual Opteron 246 nodesInterconnect: Gigabit EthernetSystem size 512x512x512 (1 node ≈ 50s)

Page 55: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Outline

1 Non-blocking Collective OperationsGeneral ThoughtsOverlapProcess Skew

2 General Application OptimizationIntroductionAn independent data AlgorithmAn independent data Loop

3 Use case: A specialized 3D-FFTA parallel 3D-FFTApplying non-blocking Collectives

4 Conclusions and Future Work

Page 56: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Conclusions & Future Work

Conclusionsapplying NBC requires some effortNBC improves scalingcommon application patterns exist

Future Worktune FFT further (cache issues)automatic parameter assessment (?)parallel model for LibNBCLibNBC features (e.g. Fortran bindings)

Page 57: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Conclusions & Future Work

Conclusionsapplying NBC requires some effortNBC improves scalingcommon application patterns exist

Future Worktune FFT further (cache issues)automatic parameter assessment (?)parallel model for LibNBCLibNBC features (e.g. Fortran bindings)

Page 58: Application Optimization with non-blocking Collective ... · Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future

Non-blocking Collective Operations General Application Optimization Use case: A specialized 3D-FFT Conclusions and Future Work

Discussion

THE ENDtry LibNBC: http://www.unixer.de/NBC

Thank you for your attention!