20

Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints
Page 2: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� Use optimized scientific libraries

� Prepost MPI receives

� MPI-IO collective buffering

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 2

Page 3: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 3

Page 4: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� In PWscf (V 4.0) the diagonalization is performed by routines PDSPEV_DRV

(REAL) and PZHPEV_DRV (COMPLEX), written by Quantum-Espresso

developers

� PWscf scalability stops at ~1000 cores due heavy MPI_Allreduce and

MPI_Bcast activities, triggered by these routines

� Efficient scaLAPACK routines exist to implement diagonalization

� REAL case: PDSYEVD

� COMPLEX case: PZHEEVD

� To implement the diagonalization with scaLAPACK routines has been

necessary to:

� Modify the indexing scheme and distribute a few dynamic allocated

arrays according to the block cyclic distribution expected by scaLAPACK

� Replace the call to PDSPEV_DRV with a call to scaLAPACK PDSYEVD

� Replace the call to PZHPEV_DRV with a call to scaLAPACK PZHEEVD

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 4

Page 5: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 5

Page 6: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� Original code

CALL blk2cyc_redist( n, diag, nrlx, hh, nx, desc )

CALL pdspev_drv( 'V', diag, nrlx, e, vv, nrlx, nrl, n, &

desc(la_npc_)*desc(la_npr_), desc(la_me_), desc(la_ comm_) )

CALL cyc2blk_redist( n, vv, nrlx , hh, nx , desc )CALL cyc2blk_redist( n, vv, nrlx , hh, nx , desc )

� Modified code

CALL pdsyevd('V','U',n,diag,1,1,desc_block,e,hh,1,1 ,desc_block, &

work,lwork,iwork,liwork,info)

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 6

Page 7: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 7

Page 8: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

Routine Time - ASIS Time OPT

dspev_module_ptqliv 20 -

blk2cyc_redist 18 -

cyc2blk_redist 18 -

XT5 Introduction Workshop - CSCS July 2-3, 2009

MPI_Gather 101 -

MPI_Bcast 277 165

MPI_Allreduce 155 15

Total time 590 180

Slide 8

Page 9: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

4

5

6

7

Tfl

op

/sQuantumESPRESSO Performance

XT5 Introduction Workshop - CSCS July 2-3, 2009

0

1

2

3

0 500 1000 1500 2000 2500

Tfl

op

/s

Number of cores

CNT10POR8 ASIS

CNT10POR8 OPT

Slide 9

Page 10: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 10

Page 11: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� Use non-blocking send/recvs when it is possible to overlap communication

with computation

� If possible, prepost receives before sender posts the matching send

� Best to insure this via the algorithm

� Can also use MPI synchronous send but this is not as optimal.

Match Entries created byapplication pre-posting of

Match Entries Posted by MPI to handle

XT5 Introduction Workshop - CSCS July 2-3, 2009

Unexpected short message buffers

Unexpected long message buffers-Portals EQ event only

pre-postedME

msgX

pre-postedME

msgY

appbuffer

formsgX

appbuffer

formsgY

eager shortmessage

ME

application pre-posting ofReceives

Match Entries Posted by MPI to handleunexpected short and long messages

eager shortmessage

ME

eager shortmessage

ME

longmessage

ME

shortun-

expectbuffer

shortun-

expectbuffer

shortun-

expectbuffer

incomingmessage

other EQ unexpected EQ

Portals matches incoming message with pre-posted receives and delivers message data directly into user buffer.

Slide 11

Page 12: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

Patrick H. Worley “Importance of Pre-Posting Receives” Cray Technical Workshop , San Francisco, 2008Slide 12

Page 13: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

Patrick H. Worley “Importance of Pre-Posting Receives” Cray Technical Workshop , San Francisco, 2008Slide 13

Page 14: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

Patrick H. Worley “Importance of Pre-Posting Receives” Cray Technical Workshop , San Francisco, 2008Slide 14

Page 15: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 15

Page 16: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� What is the best way to perform I/O from an MPP system ?

� Input files must be read by all compute nodes

� At specific intervals, or at the end of the computation, data from all

compute nodes must be written to output files

� Solution #1: all PEs read/write to a single file

� Good enough for READ from a medium number of PEs� Good enough for READ from a medium number of PEs

� What happens on 10000 PEs?

� Solution #2: all PEs read/write to different files

� Inconvenient: can only be used with same

number of PEs

� Slow open: MDS bottleneck

� Solution #3: One I/O PE + MPI comm

� Only one PE performs I/O

� Good for small files; limited by disk bandwidth

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 16

Page 17: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� A pool of PEs perform I/O

� The proper number of I/O PEs must be defined properly depending on

� The application requirements: frequence of the I/O,

I/O to computation ratio

� I/O characteristics: file size, record length

� the total number of PEs� the total number of PEs

� If properly configured, this can be efficient and scalable

� However it may require significant modifications to the application…

Sure ?

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 17

Page 18: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� Use collective MPI-IO calls

� MPI_File_read_all, MPI_File_write_all

� Must be called by all PEs

� Set collective buffering hints using variable MPICH_MPIIO_HINTS

� romio_cb_write enable collective buffering on write

� romio_cb_read enable collective buffering on read� romio_cb_read enable collective buffering on read

� cb_buffer_size buffer size

� cb_nodes size of I/O pool (“the aggregators”)

� cb_config_list select the aggregators

� striping _factor Lustre stripe count

� striping_unit Lustre stripe size

MPICH_MPIIO_HINTS=myfile:romio_cb_write=enable:cb_buffer_size=2097152:

cb_nodes=16:cb_config_list=*:

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 18

Page 19: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints

� The MPICH_MPIIO_CB_ALIGN environment variable can be used to select

the collective buffering alignment algorithm

� MPICH_MPIIO_CB_ALIGN=2

� Automatic alignment

� MPI-IO parameters chosen according to Lustre settings� MPI-IO parameters chosen according to Lustre settings

� cb_nodes = stripe_count

� Optimal choice (most of the time)

� New MPI-IO White Paper

ftp://ftp.cray.com/pub/pe/download/MPI-IO_White_Paper.pdf

XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 19

Page 20: Use optimized scientific libraries PrepostMPI receives · Use collective MPI-IO calls MPI_File_read_all, MPI_File_write_all Must be called by all PEs Set collective buffering hints