Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
� Use optimized scientific libraries
� Prepost MPI receives
� MPI-IO collective buffering
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 2
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 3
� In PWscf (V 4.0) the diagonalization is performed by routines PDSPEV_DRV
(REAL) and PZHPEV_DRV (COMPLEX), written by Quantum-Espresso
developers
� PWscf scalability stops at ~1000 cores due heavy MPI_Allreduce and
MPI_Bcast activities, triggered by these routines
� Efficient scaLAPACK routines exist to implement diagonalization
� REAL case: PDSYEVD
� COMPLEX case: PZHEEVD
� To implement the diagonalization with scaLAPACK routines has been
necessary to:
� Modify the indexing scheme and distribute a few dynamic allocated
arrays according to the block cyclic distribution expected by scaLAPACK
� Replace the call to PDSPEV_DRV with a call to scaLAPACK PDSYEVD
� Replace the call to PZHPEV_DRV with a call to scaLAPACK PZHEEVD
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 4
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 5
� Original code
CALL blk2cyc_redist( n, diag, nrlx, hh, nx, desc )
CALL pdspev_drv( 'V', diag, nrlx, e, vv, nrlx, nrl, n, &
desc(la_npc_)*desc(la_npr_), desc(la_me_), desc(la_ comm_) )
CALL cyc2blk_redist( n, vv, nrlx , hh, nx , desc )CALL cyc2blk_redist( n, vv, nrlx , hh, nx , desc )
� Modified code
CALL pdsyevd('V','U',n,diag,1,1,desc_block,e,hh,1,1 ,desc_block, &
work,lwork,iwork,liwork,info)
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 6
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 7
Routine Time - ASIS Time OPT
dspev_module_ptqliv 20 -
blk2cyc_redist 18 -
cyc2blk_redist 18 -
XT5 Introduction Workshop - CSCS July 2-3, 2009
MPI_Gather 101 -
MPI_Bcast 277 165
MPI_Allreduce 155 15
Total time 590 180
Slide 8
4
5
6
7
Tfl
op
/sQuantumESPRESSO Performance
XT5 Introduction Workshop - CSCS July 2-3, 2009
0
1
2
3
0 500 1000 1500 2000 2500
Tfl
op
/s
Number of cores
CNT10POR8 ASIS
CNT10POR8 OPT
Slide 9
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 10
� Use non-blocking send/recvs when it is possible to overlap communication
with computation
� If possible, prepost receives before sender posts the matching send
� Best to insure this via the algorithm
� Can also use MPI synchronous send but this is not as optimal.
Match Entries created byapplication pre-posting of
Match Entries Posted by MPI to handle
XT5 Introduction Workshop - CSCS July 2-3, 2009
Unexpected short message buffers
Unexpected long message buffers-Portals EQ event only
pre-postedME
msgX
pre-postedME
msgY
appbuffer
formsgX
appbuffer
formsgY
eager shortmessage
ME
application pre-posting ofReceives
Match Entries Posted by MPI to handleunexpected short and long messages
eager shortmessage
ME
eager shortmessage
ME
longmessage
ME
shortun-
expectbuffer
shortun-
expectbuffer
shortun-
expectbuffer
incomingmessage
other EQ unexpected EQ
Portals matches incoming message with pre-posted receives and delivers message data directly into user buffer.
Slide 11
Patrick H. Worley “Importance of Pre-Posting Receives” Cray Technical Workshop , San Francisco, 2008Slide 12
Patrick H. Worley “Importance of Pre-Posting Receives” Cray Technical Workshop , San Francisco, 2008Slide 13
Patrick H. Worley “Importance of Pre-Posting Receives” Cray Technical Workshop , San Francisco, 2008Slide 14
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 15
� What is the best way to perform I/O from an MPP system ?
� Input files must be read by all compute nodes
� At specific intervals, or at the end of the computation, data from all
compute nodes must be written to output files
� Solution #1: all PEs read/write to a single file
� Good enough for READ from a medium number of PEs� Good enough for READ from a medium number of PEs
� What happens on 10000 PEs?
� Solution #2: all PEs read/write to different files
� Inconvenient: can only be used with same
number of PEs
� Slow open: MDS bottleneck
� Solution #3: One I/O PE + MPI comm
� Only one PE performs I/O
� Good for small files; limited by disk bandwidth
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 16
� A pool of PEs perform I/O
� The proper number of I/O PEs must be defined properly depending on
� The application requirements: frequence of the I/O,
I/O to computation ratio
� I/O characteristics: file size, record length
� the total number of PEs� the total number of PEs
� If properly configured, this can be efficient and scalable
� However it may require significant modifications to the application…
Sure ?
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 17
� Use collective MPI-IO calls
� MPI_File_read_all, MPI_File_write_all
� Must be called by all PEs
� Set collective buffering hints using variable MPICH_MPIIO_HINTS
� romio_cb_write enable collective buffering on write
� romio_cb_read enable collective buffering on read� romio_cb_read enable collective buffering on read
� cb_buffer_size buffer size
� cb_nodes size of I/O pool (“the aggregators”)
� cb_config_list select the aggregators
� striping _factor Lustre stripe count
� striping_unit Lustre stripe size
MPICH_MPIIO_HINTS=myfile:romio_cb_write=enable:cb_buffer_size=2097152:
cb_nodes=16:cb_config_list=*:
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 18
� The MPICH_MPIIO_CB_ALIGN environment variable can be used to select
the collective buffering alignment algorithm
� MPICH_MPIIO_CB_ALIGN=2
� Automatic alignment
� MPI-IO parameters chosen according to Lustre settings� MPI-IO parameters chosen according to Lustre settings
� cb_nodes = stripe_count
� Optimal choice (most of the time)
� New MPI-IO White Paper
ftp://ftp.cray.com/pub/pe/download/MPI-IO_White_Paper.pdf
XT5 Introduction Workshop - CSCS July 2-3, 2009 Slide 19