View
5
Download
0
Category
Preview:
Citation preview
25.06.2004 1 Höchstleistungsrechenzentrum StuttgartBettina Krammer
MPI Application Development Using the Analysis Tool MARMOT
Bettina Krammer
HLRSHigh Performance Computing Center Stuttgart
Allmandring 30D-70550 Stuttgarthttp://www.hlrs.de
25.06.2004 2 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Overview
• General Problems of MPI Programming• Related Work• Design of MARMOT• Examples• Performance Results with Benchmarks and Real Applications• Outlook
25.06.2004 3 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Problems of MPI Programming
• All problems of serial programming• Additional problems:
– Increased difficulty to verify correctness of program– Increased difficulty to debug N parallel processes– New parallel problems (deadlock, race conditions)– Portability between different MPI implementations
PACX-MPI
25.06.2004 4 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Related Work I: Parallel Debuggers
• Examples: totalview, DDT, p2d2 • Advantages:
– Same approach and tool as in serial case• Disadvantages:
– Can only fix problems after and if they occur– Scalability: How can you debug programs that crash after 3
hours on 512 nodes?– Reproducibility: How to debug a program that crashes only
every fifth time?– It does not help to improve portability
25.06.2004 5 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Related Work II: Debug version of MPI Library
• Examples:– catches some incorrect usage: e.g. node count in
MPI_CART_CREATE (mpich)– deadlock detection (NEC mpi)
• Advantages:– good scalability– better debugging in combination with totalview
• Disadvantages:– Portability: only helps to use this implementation of MPI– Trade-of between performance and safety– Reproducibility: Does not help to debug irreproducible programs
25.06.2004 6 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Related Work III: Special Verification Tools
• Special tools dedicated to message checking, examples:• MPI-CHECK
– Restricted to Fortran code– Automatic compile-time at run-time analysis– Currently not under active development
• Umpire– Mature tool, but not freely available– First version limited to shared memory platforms– Distributed memory version in preparation
• New Approach: MARMOT
25.06.2004 7 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Design of MARMOT
25.06.2004 8 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Design of MARMOT I
• Design Goals of MARMOT:– portability
• verify that your program is a correct MPI program
– reproducibility• detect possible race conditions• pseudo-serialize the program to make it reproducible
– scalability• automatic debugging wherever possible
25.06.2004 9 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Applicationor Test Program
MPI library
MARMOT core tool
Profiling Interface
DebugServer
(additionalprocess)
Design of MARMOT II
25.06.2004 10 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Design of MARMOT III
• Library written in C++ that will be linked to the application• This library consists of the debug clients and one debug server.• No source code modification is required except for adding an
additional process working as debug server, i.e. the application will have to be run with mpirun for n+1 instead of n processes.
• Main interface = MPI profiling interface according to the MPI standard 1.2
• Implementation of C language binding of MPI• Implementation of Fortran language binding as a wrapper to the C
interface• Environment variables for tool behavior and output (report of errors,
warnings and/or remarks, trace-back, etc.)
25.06.2004 11 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Client Checks: verification on the local nodes
• Verification of MPI_Request usage– invalid recycling of active request– invalid use of unregistered request– warning if number of requests is zero– warning if all requests are MPI_REQUEST_NULL
• Verification of tag range• Verification if requested cartesian communicator has correct size• Verification of communicator in cartesian calls• Verification of groups in group calls• Verification of sizes in calls that create groups or communicators• Verification if ranges are valid (e.g. in group constructor calls)• Verification if ranges are distinct (e.g. MPI_Group_incl, -excl)• Check for pending messages and active requests in
MPI_Finalize
25.06.2004 12 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Server Checks: verification between the nodes, control of program
• Everything that requires a global view• Control the execution flow• Signal conditions, e.g. deadlocks• Check matching send/receive pairs for consistency• Output log (report errors etc.)
25.06.2004 13 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Comparison with related projects
basically anyUNIX environment
basically any UNIX environment
shared memoryplatforms
Platform
automaticcompile-time and run-time checking
automatic run-timechecking
automatic run-timechecking
Functionality
Parsing and automaticmodification of source code
use of profilinginterface, centralmanager to controlexecution and collect information,no source codemodificationrequired
use of profilinginterface, centralmanager to controlexecution and collect information,no source codemodificationrequired
Design
yesyesyesFortran90 support
yesyesyesFortran77 support
nono?C++ support (MPI 2.0)
noyesyesC support
parts of 1.2/2.0complete 1.2parts of 1.2MPI-standard
MPI-CHECKMARMOTumpireName/Features
25.06.2004 14 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Comparison with related projects
nowarnings, e.g. whencalls are usingwildcards
noDetect Race Conditions
yes,algorithm basedon handshakingstrategy, combined withuser-specifiedtime-out
yes,algorithm based on user-specified time-out mechanism,traceback on eachnode possible
yes,algorithm tries to constructdependencygraphs, combinedwith user-specifiedtime-out
Deadlock Detection
yesnoyesCheck mismatched collectiveoperations
noyesyes, partlyCheck correct handling of requests
partly (ranks, types, negative message lengths)
yes,basically all arguments arechecked, includingproper requesthandling
partly (data types, requests, communicators)
Check correct use of arguments (ranks, tags, types, communicators, groups, datatypes, user-definedresources, requests,...)
MPI-CHECKMARMOTumpireName/Features
25.06.2004 15 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Examples
25.06.2004 16 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 1: request-reuse (source code)
/* ** Here we re-use a request we didn't free before*/
#include <stdio.h>#include <assert.h>#include "mpi.h"
int main( int argc, char **argv ) {int size = -1;int rank = -1;int value = -1;int value2 = -1;MPI_Status send_status, recv_status;MPI_Request send_request, recv_request;
printf( "We call Irecv and Isend with non-freed requests.\n" );MPI_Init( &argc, &argv );MPI_Comm_size( MPI_COMM_WORLD, &size );MPI_Comm_rank( MPI_COMM_WORLD, &rank );printf( " I am rank %d of %d PEs\n", rank, size );
25.06.2004 17 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 1: request-reuse (source code continued)
if( rank == 0 ){/*** this is just to get the request used ***/MPI_Irecv( &value, 1, MPI_INT, 1, 18, MPI_COMM_WORLD, &recv_request );/*** going to receive the message and reuse a non-freed request ***/MPI_Irecv( &value, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &recv_request );MPI_Wait( &recv_request, &recv_status ); assert( value = 19 );
}if( rank == 1 ){
value2 = 19;/*** this is just to use the request ***/MPI_Isend( &value, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &send_request );/*** going to send the message ***/MPI_Isend( &value2, 1, MPI_INT, 0, 17, MPI_COMM_WORLD, &send_request );MPI_Wait( &send_request, &send_status );
}MPI_Finalize();return 0;
}
25.06.2004 18 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 1: request-reuse (output log)
We call Irecv and Isend with non-freed requests.1 rank 0 performs MPI_Init2 rank 1 performs MPI_Init3 rank 0 performs MPI_Comm_size4 rank 1 performs MPI_Comm_size5 rank 0 performs MPI_Comm_rank6 rank 1 performs MPI_Comm_rank
I am rank 0 of 2 PEs7 rank 0 performs MPI_Irecv
I am rank 1 of 2 PEs8 rank 1 performs MPI_Isend9 rank 0 performs MPI_Irecv10 rank 1 performs MPI_Isend
ERROR: MPI_Irecv Request is still in use !! 11 rank 0 performs MPI_Wait
ERROR: MPI_Isend Request is still in use !! 12 rank 1 performs MPI_Wait13 rank 0 performs MPI_Finalize14 rank 1 performs MPI_Finalize
25.06.2004 19 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 2: deadlock (source code)
/* This program produces a deadlock.** At least 2 nodes are required to run the program.**** Rank 0 recv a message from Rank 1.** Rank 1 recv a message from Rank 0.**** AFTERWARDS:** Rank 0 sends a message to Rank 1.** Rank 1 sends a message to Rank 0.*/
#include <stdio.h>#include "mpi.h"
int main( int argc, char** argv ){int rank = 0;int size = 0;int dummy = 0;MPI_Status status;
25.06.2004 20 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 2: deadlock (source code continued)
MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );if( size < 2 ){fprintf( stderr," This program needs at least 2 PEs!\n" );
}else {if ( rank == 0 ){MPI_Recv( &dummy, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &status );MPI_Send( &dummy, 1, MPI_INT, 1, 18, MPI_COMM_WORLD );
}if( rank == 1 ){MPI_Recv( &dummy, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &status );MPI_Send( &dummy, 1, MPI_INT, 0, 17, MPI_COMM_WORLD );
}}MPI_Finalize();
}
25.06.2004 21 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 2: deadlock (output log)
$ mpirun -np 3 deadlock1
1 rank 0 performs MPI_Init2 rank 1 performs MPI_Init3 rank 0 performs MPI_Comm_rank4 rank 1 performs MPI_Comm_rank5 rank 0 performs MPI_Comm_size6 rank 1 performs MPI_Comm_size7 rank 0 performs MPI_Recv8 rank 1 performs MPI_Recv8 Rank 0 is pending!8 Rank 1 is pending!
WARNING: deadlock detected, all clients are pending
25.06.2004 22 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Example 2: deadlock (output log continued)
Last calls (max. 10) on node 0:timestamp = 1: MPI_Init( *argc, ***argv )timestamp = 3: MPI_Comm_rank( comm, *rank )timestamp = 5: MPI_Comm_size( comm, *size )timestamp = 7: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status)
Last calls (max. 10) on node 1:timestamp = 2: MPI_Init( *argc, ***argv )timestamp = 4: MPI_Comm_rank( comm, *rank )timestamp = 6: MPI_Comm_size( comm, *size )timestamp = 8: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status )
25.06.2004 23 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Current status of MARMOT
• Full MPI 1.2 implemented• C and Fortran binding is supported• Used for several Fortran and C benchmarks (NASPB) and
applications (Crossgrid Project and others) • Tests on different platforms, using different compilers and MPI
implementations, e.g.– IA32/IA64 clusters (Intel, g++ compiler) with mpich– IBM Regatta– NEC SX5– Hitachi SR8000
25.06.2004 24 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Performance with Benchmarks
25.06.2004 25 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Bandwidth on an IA64 cluster with Myrinet
0
50
100
150
200
250
1 8 64 512
4096
3276
826
2144
2097
152
Message size [Bytes]
MB/s
native MPI
MARMOT
MARMOTserialized
25.06.2004 26 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Latency on an IA64 cluster with Myrinet
0,01
0,1
1
10
100
1 8 64 512
4096
3276
826
2144
2097
152
Message size [Bytes]
Late
ncy
[ms]
native MPI
MARMOT
MARMOTserialized
25.06.2004 27 Höchstleistungsrechenzentrum StuttgartBettina Krammer
cg.B on an IA32 cluster with Myrinet
0
500
1000
1500
2000
2500
1 2 4 8 16
Processors
Mop
s/s
tota
l native MPI
MARMOT
MARMOTserialized
25.06.2004 28 Höchstleistungsrechenzentrum StuttgartBettina Krammer
is.B on an IA32 cluster with Myrinet
020406080
100120140160180
1 2 4 8 16
Processors
Mop
s/s
tota
l native MPI
MARMOT
MARMOTserialized
25.06.2004 29 Höchstleistungsrechenzentrum StuttgartBettina Krammer
is.A on an IA32 cluster with Myrinet
020406080
100120140160180
1 2 4 8 16
Processors
Mop
s/s
tota
l native MPI
MARMOT
MARMOTserialized
25.06.2004 30 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Performance with Real Applications
25.06.2004 31 Höchstleistungsrechenzentrum StuttgartBettina Krammer
CrossGrid Application: WP 1.4: Air pollution modeling
• Air pollution modeling with STEM-II model• Transport equation solved with Petrov-
Crank-Nikolson-Galerkin method• Chemistry and Mass transfer are integrated
using semi-implicit Euler and pseudo-analytical methods
• 15500 lines of Fortran code• 12 different MPI calls:
– MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize.
25.06.2004 32 Höchstleistungsrechenzentrum StuttgartBettina Krammer
STEM application on an IA32 cluster with Myrinet
01020304050607080
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
Tim
e [s
]
native MPIMARMOT
25.06.2004 33 Höchstleistungsrechenzentrum StuttgartBettina Krammer
CrossGrid Application: WP 1.3: High Energy Physics
• Filtering of real-time data with neural networks (ANN application)
• 11500 lines of C code• 11 different MPI calls:
– MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Get_processor_name, MPI_Barrier, MPI_Gather, MPI_Recv, MPI_Send, MPI_Bcast, MPI_Reduce, MPI_Finalize.
level 1 - special hardware
40 MHz (40 TB/sec)75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis
level 2 - embedded processors
level 3 - PCs
25.06.2004 34 Höchstleistungsrechenzentrum StuttgartBettina Krammer
HEP application on an IA32 cluster with Myrinet
050
100150200250300350400
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
Tim
e [s
]
native MPIMARMOT
25.06.2004 35 Höchstleistungsrechenzentrum StuttgartBettina Krammer
CrossGrid Application: WP 1.1: Medical Application
• Calculation of blood flow with Lattice-Boltzmann method
• Stripped down application with 6500 lines of C code
• 14 different MPI calls:– MPI_Init, MPI_Comm_rank,
MPI_Comm_size, MPI_Pack, MPI_Bcast, MPI_Unpack, MPI_Cart_create, MPI_Cart_shift, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Reduce, MPI_Sendrecv, MPI_Finalize
25.06.2004 36 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Medical application on an IA32 cluster with Myrinet
0
0,1
0,2
0,3
0,4
0,5
0,6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
Tim
e pe
r Ite
ratio
n [s
]
native MPIMARMOT
25.06.2004 37 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Message statistics with native MPI
25.06.2004 38 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Message statistics with MARMOT
25.06.2004 39 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Medical application on an IA32 cluster with Myrinet without barrier
0
0,1
0,2
0,3
0,4
0,5
0,6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
Tim
e pe
r Ite
ratio
n [s
] native MPI
MARMOT
MARMOTwithoutbarrier
25.06.2004 40 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Barrier with native MPI
25.06.2004 41 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Barrier with MARMOT
25.06.2004 42 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Feedback of Crossgrid Applications
• Task 1.1 (biomedical)– C application– Identified issues:
• Possible race conditions due to use of MPI_ANY_SOURCE• Task 1.2 (flood):
– Fortran application– Identified issues:
• Tags outside of valid range• Possible race conditions due to use of MPI_ANY_SOURCE
• Task 1.3 (hep):– ANN (C application)– no issues found by MARMOT
• Task 1.4 (meteo):– STEMII (Fortran)– MARMOT detected holes in self-defined datatypes used in
MPI_Scatterv, MPI_Gatherv. These holes were removed, which helped to improve the performance of the communication.
25.06.2004 43 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Conclusion
• MARMOT supports MPI 1.2 for C and Fortran binding• Tested successfully with several applications and platforms• Performance sufficient for many applications
• Future work:– scalability and general performance improvements– distribute tests from server to clients– better user interface to present problems and warnings– extended functionality:
• more tests to verify collective calls• MPI-2• Hybrid programming
25.06.2004 44 Höchstleistungsrechenzentrum StuttgartBettina Krammer
Thanks for your attention
Recommended