MPI Application Development Using the Analysis Tool MARMOT · 2009-02-06 · MPI Application...

25.06.2004 1 Höchstleistungsrechenzentrum StuttgartBettina Krammer

MPI Application Development Using the Analysis Tool MARMOT

Bettina Krammer

HLRSHigh Performance Computing Center Stuttgart

Allmandring 30D-70550 Stuttgarthttp://www.hlrs.de

Overview

• General Problems of MPI Programming• Related Work• Design of MARMOT• Examples• Performance Results with Benchmarks and Real Applications• Outlook

Problems of MPI Programming

• All problems of serial programming• Additional problems:

– Increased difficulty to verify correctness of program– Increased difficulty to debug N parallel processes– New parallel problems (deadlock, race conditions)– Portability between different MPI implementations

PACX-MPI

Related Work I: Parallel Debuggers

• Examples: totalview, DDT, p2d2 • Advantages:

– Same approach and tool as in serial case• Disadvantages:

– Can only fix problems after and if they occur– Scalability: How can you debug programs that crash after 3

hours on 512 nodes?– Reproducibility: How to debug a program that crashes only

every fifth time?– It does not help to improve portability

Related Work II: Debug version of MPI Library

• Examples:– catches some incorrect usage: e.g. node count in

MPI_CART_CREATE (mpich)– deadlock detection (NEC mpi)

• Advantages:– good scalability– better debugging in combination with totalview

• Disadvantages:– Portability: only helps to use this implementation of MPI– Trade-of between performance and safety– Reproducibility: Does not help to debug irreproducible programs

Related Work III: Special Verification Tools

• Special tools dedicated to message checking, examples:• MPI-CHECK

– Restricted to Fortran code– Automatic compile-time at run-time analysis– Currently not under active development

• Umpire– Mature tool, but not freely available– First version limited to shared memory platforms– Distributed memory version in preparation

• New Approach: MARMOT

Design of MARMOT

Design of MARMOT I

• Design Goals of MARMOT:– portability

• verify that your program is a correct MPI program

– reproducibility• detect possible race conditions• pseudo-serialize the program to make it reproducible

– scalability• automatic debugging wherever possible

Applicationor Test Program

MPI library

MARMOT core tool

Profiling Interface

DebugServer

(additionalprocess)

Design of MARMOT II

Design of MARMOT III

• Library written in C++ that will be linked to the application• This library consists of the debug clients and one debug server.• No source code modification is required except for adding an

additional process working as debug server, i.e. the application will have to be run with mpirun for n+1 instead of n processes.

• Main interface = MPI profiling interface according to the MPI standard 1.2

• Implementation of C language binding of MPI• Implementation of Fortran language binding as a wrapper to the C

interface• Environment variables for tool behavior and output (report of errors,

warnings and/or remarks, trace-back, etc.)

Client Checks: verification on the local nodes

• Verification of MPI_Request usage– invalid recycling of active request– invalid use of unregistered request– warning if number of requests is zero– warning if all requests are MPI_REQUEST_NULL

• Verification of tag range• Verification if requested cartesian communicator has correct size• Verification of communicator in cartesian calls• Verification of groups in group calls• Verification of sizes in calls that create groups or communicators• Verification if ranges are valid (e.g. in group constructor calls)• Verification if ranges are distinct (e.g. MPI_Group_incl, -excl)• Check for pending messages and active requests in

MPI_Finalize

Server Checks: verification between the nodes, control of program

• Everything that requires a global view• Control the execution flow• Signal conditions, e.g. deadlocks• Check matching send/receive pairs for consistency• Output log (report errors etc.)

Comparison with related projects

basically anyUNIX environment

basically any UNIX environment

shared memoryplatforms

Platform

automaticcompile-time and run-time checking

automatic run-timechecking

Functionality

Parsing and automaticmodification of source code

use of profilinginterface, centralmanager to controlexecution and collect information,no source codemodificationrequired

Design

yesyesyesFortran90 support

yesyesyesFortran77 support

nono?C++ support (MPI 2.0)

noyesyesC support

parts of 1.2/2.0complete 1.2parts of 1.2MPI-standard

MPI-CHECKMARMOTumpireName/Features

Comparison with related projects

nowarnings, e.g. whencalls are usingwildcards

noDetect Race Conditions

yes,algorithm basedon handshakingstrategy, combined withuser-specifiedtime-out

yes,algorithm based on user-specified time-out mechanism,traceback on eachnode possible

yes,algorithm tries to constructdependencygraphs, combinedwith user-specifiedtime-out

Deadlock Detection

yesnoyesCheck mismatched collectiveoperations

noyesyes, partlyCheck correct handling of requests

partly (ranks, types, negative message lengths)

yes,basically all arguments arechecked, includingproper requesthandling

partly (data types, requests, communicators)

Check correct use of arguments (ranks, tags, types, communicators, groups, datatypes, user-definedresources, requests,...)

MPI-CHECKMARMOTumpireName/Features

Examples

Example 1: request-reuse (source code)

/* ** Here we re-use a request we didn't free before*/

#include <stdio.h>#include <assert.h>#include "mpi.h"

int main( int argc, char **argv ) {int size = -1;int rank = -1;int value = -1;int value2 = -1;MPI_Status send_status, recv_status;MPI_Request send_request, recv_request;

printf( "We call Irecv and Isend with non-freed requests.\n" );MPI_Init( &argc, &argv );MPI_Comm_size( MPI_COMM_WORLD, &size );MPI_Comm_rank( MPI_COMM_WORLD, &rank );printf( " I am rank %d of %d PEs\n", rank, size );

Example 1: request-reuse (source code continued)

if( rank == 0 ){/*** this is just to get the request used ***/MPI_Irecv( &value, 1, MPI_INT, 1, 18, MPI_COMM_WORLD, &recv_request );/*** going to receive the message and reuse a non-freed request ***/MPI_Irecv( &value, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &recv_request );MPI_Wait( &recv_request, &recv_status ); assert( value = 19 );

}if( rank == 1 ){

value2 = 19;/*** this is just to use the request ***/MPI_Isend( &value, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &send_request );/*** going to send the message ***/MPI_Isend( &value2, 1, MPI_INT, 0, 17, MPI_COMM_WORLD, &send_request );MPI_Wait( &send_request, &send_status );

}MPI_Finalize();return 0;

Example 1: request-reuse (output log)

We call Irecv and Isend with non-freed requests.1 rank 0 performs MPI_Init2 rank 1 performs MPI_Init3 rank 0 performs MPI_Comm_size4 rank 1 performs MPI_Comm_size5 rank 0 performs MPI_Comm_rank6 rank 1 performs MPI_Comm_rank

I am rank 0 of 2 PEs7 rank 0 performs MPI_Irecv

I am rank 1 of 2 PEs8 rank 1 performs MPI_Isend9 rank 0 performs MPI_Irecv10 rank 1 performs MPI_Isend

ERROR: MPI_Irecv Request is still in use !! 11 rank 0 performs MPI_Wait

ERROR: MPI_Isend Request is still in use !! 12 rank 1 performs MPI_Wait13 rank 0 performs MPI_Finalize14 rank 1 performs MPI_Finalize

Example 2: deadlock (source code)

/* This program produces a deadlock.** At least 2 nodes are required to run the program.**** Rank 0 recv a message from Rank 1.** Rank 1 recv a message from Rank 0.**** AFTERWARDS:** Rank 0 sends a message to Rank 1.** Rank 1 sends a message to Rank 0.*/

#include <stdio.h>#include "mpi.h"

int main( int argc, char** argv ){int rank = 0;int size = 0;int dummy = 0;MPI_Status status;

Example 2: deadlock (source code continued)

MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );if( size < 2 ){fprintf( stderr," This program needs at least 2 PEs!\n" );

}else {if ( rank == 0 ){MPI_Recv( &dummy, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &status );MPI_Send( &dummy, 1, MPI_INT, 1, 18, MPI_COMM_WORLD );

}if( rank == 1 ){MPI_Recv( &dummy, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &status );MPI_Send( &dummy, 1, MPI_INT, 0, 17, MPI_COMM_WORLD );

}}MPI_Finalize();

Example 2: deadlock (output log)

$ mpirun -np 3 deadlock1

1 rank 0 performs MPI_Init2 rank 1 performs MPI_Init3 rank 0 performs MPI_Comm_rank4 rank 1 performs MPI_Comm_rank5 rank 0 performs MPI_Comm_size6 rank 1 performs MPI_Comm_size7 rank 0 performs MPI_Recv8 rank 1 performs MPI_Recv8 Rank 0 is pending!8 Rank 1 is pending!

WARNING: deadlock detected, all clients are pending

Example 2: deadlock (output log continued)

Last calls (max. 10) on node 0:timestamp = 1: MPI_Init( *argc, ***argv )timestamp = 3: MPI_Comm_rank( comm, *rank )timestamp = 5: MPI_Comm_size( comm, *size )timestamp = 7: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status)

Last calls (max. 10) on node 1:timestamp = 2: MPI_Init( *argc, ***argv )timestamp = 4: MPI_Comm_rank( comm, *rank )timestamp = 6: MPI_Comm_size( comm, *size )timestamp = 8: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status )

Current status of MARMOT

• Full MPI 1.2 implemented• C and Fortran binding is supported• Used for several Fortran and C benchmarks (NASPB) and

applications (Crossgrid Project and others) • Tests on different platforms, using different compilers and MPI

implementations, e.g.– IA32/IA64 clusters (Intel, g++ compiler) with mpich– IBM Regatta– NEC SX5– Hitachi SR8000

Performance with Benchmarks

Bandwidth on an IA64 cluster with Myrinet

1 8 64 512

Message size [Bytes]

native MPI

MARMOT

MARMOTserialized

Latency on an IA64 cluster with Myrinet

1 8 64 512

Message size [Bytes]

native MPI

MARMOT

MARMOTserialized

cg.B on an IA32 cluster with Myrinet

1 2 4 8 16

Processors

l native MPI

MARMOT

MARMOTserialized

is.B on an IA32 cluster with Myrinet

020406080

100120140160180

1 2 4 8 16

Processors

l native MPI

MARMOT

MARMOTserialized

is.A on an IA32 cluster with Myrinet

020406080

100120140160180

1 2 4 8 16

Processors

l native MPI

MARMOT

MARMOTserialized

Performance with Real Applications

CrossGrid Application: WP 1.4: Air pollution modeling

• Air pollution modeling with STEM-II model• Transport equation solved with Petrov-

Crank-Nikolson-Galerkin method• Chemistry and Mass transfer are integrated

using semi-implicit Euler and pseudo-analytical methods

• 15500 lines of Fortran code• 12 different MPI calls:

– MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize.

STEM application on an IA32 cluster with Myrinet

01020304050607080

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

native MPIMARMOT

CrossGrid Application: WP 1.3: High Energy Physics

• Filtering of real-time data with neural networks (ANN application)

• 11500 lines of C code• 11 different MPI calls:

– MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Get_processor_name, MPI_Barrier, MPI_Gather, MPI_Recv, MPI_Send, MPI_Bcast, MPI_Reduce, MPI_Finalize.

level 1 - special hardware

40 MHz (40 TB/sec)75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis

level 2 - embedded processors

level 3 - PCs

HEP application on an IA32 cluster with Myrinet

100150200250300350400

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

native MPIMARMOT

CrossGrid Application: WP 1.1: Medical Application

• Calculation of blood flow with Lattice-Boltzmann method

• Stripped down application with 6500 lines of C code

• 14 different MPI calls:– MPI_Init, MPI_Comm_rank,

MPI_Comm_size, MPI_Pack, MPI_Bcast, MPI_Unpack, MPI_Cart_create, MPI_Cart_shift, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Reduce, MPI_Sendrecv, MPI_Finalize

Medical application on an IA32 cluster with Myrinet

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

native MPIMARMOT

Message statistics with native MPI

Message statistics with MARMOT

Medical application on an IA32 cluster with Myrinet without barrier

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

] native MPI

MARMOT

MARMOTwithoutbarrier

Barrier with native MPI

Barrier with MARMOT

Feedback of Crossgrid Applications

• Task 1.1 (biomedical)– C application– Identified issues:

• Possible race conditions due to use of MPI_ANY_SOURCE• Task 1.2 (flood):

– Fortran application– Identified issues:

• Tags outside of valid range• Possible race conditions due to use of MPI_ANY_SOURCE

• Task 1.3 (hep):– ANN (C application)– no issues found by MARMOT

• Task 1.4 (meteo):– STEMII (Fortran)– MARMOT detected holes in self-defined datatypes used in

MPI_Scatterv, MPI_Gatherv. These holes were removed, which helped to improve the performance of the communication.

Conclusion

• MARMOT supports MPI 1.2 for C and Fortran binding• Tested successfully with several applications and platforms• Performance sufficient for many applications

• Future work:– scalability and general performance improvements– distribute tests from server to clients– better user interface to present problems and warnings– extended functionality:

• more tests to verify collective calls• MPI-2• Hybrid programming

Thanks for your attention

MPI Application Development Using the Analysis Tool MARMOT · 2009-02-06 · MPI Application...

Documents

Bits & Bytesav/amt/projects/marmot/ ), is a non-commercial tool which aims at checking the correct MPI usage in a code at runtime. It includes checking of conformance to the MPI standard,

paralleldebugging 5 bk woValgrind - fs.hlrs.de · • Valgrind see talk [5a] on Thursday – MPI-Analysis Tools • Marmot – Debuggers • gdb • DDT • TotalView Krammer Höchstleistungsrechenzentrum

Marmot T42 · Marmot T42 Central Plateau New Zealand 03 May 2014. Results PlaceRace NoName Division Time. Marmot T42 Central Plateau New Zealand 03 May 2014

MPI-75-V-7 · MPI-75-LUS MPI-75-H6 MPI-75-K6 MPI-75-U5 MPI-75-N MPI-75-D MPI-75-B MPI-75-U3 MPI-75-C MPI-75-T6 MPI-75-H MPI-75-V MPI-75-LM MPI-75-T MPI-75-E MPI-75-F MPI-75-K MPI-75-J

EXHIBIT LIST FOR 2020 INSURANCE RATES€¦ · mpi-1: ex. # mpi-2 ex. # mpi-3 ex. # mpi-4 ex. # mpi-5-1 ex. # mpi-5-2 ex. # mpi-6 ex. # mpi-7

Каталог Marmot Весна-Лето 2013

Mighty Marmot Basin - 50 Years

Messages from Marmot. Presentation

#1 Marmot Marmot Mountain Works Catalogs - … Hampton's Outdoor Catalog Collection.pdf · #1 Marmot Marmot Mountain Works Catalogs S 1977 (2) (1) logo cut out on the cover, 9-15-77

MPI MPI MPI

mpi 3.0 overview hlrs-review-ws2012 · 2012-10-31 · – MPI 3.0 standard (black and white) mpi30-reportbw.pdf – MPI 3.0 standard (book version) mpi30-report-book.pdf – MPI 3.0

Marmot Tech Manual 2010

Agenda XVR Expert Workshop HLRS, 27.02.15

Page 1 Verizon Wireless Roaming Database Management The VZW Network 175 Serving MSCs Lucent Nortel Motorola 21 Standalone HLRs HP 92 Integrated HLRs Lucent

Checking Interaction Consistency in MARMOT Component Refinements

Reducing Health Inequalities - Coventry: a Marmot City

Precision Diagnostics - gauss-center.eu · the Gauss Center for Supercomputing (HLRS, JSC, LRZ) Editor-in-Chief Michael Resch, HLRS resch@hlrs.de Managing Editor F. Rainer Klank,

Estimating Siberian marmot (Marmota sibirica densities in

Marmot Life #2 Sommer 2011

Marmot companies final_032512