1 MPI and MPICH on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory

1

MPI and MPICH on Clusters

Rusty Lusk

Mathematics and Computer Science Division

Argonne National Laboratory

2

Outline

MPI implementations on clusters– Unix– NT

Cluster-related activities at Argonne– new cluster– MPICH news and plans

– MPICH on NT

– Other stuff– scalable performance visualization– managing jobs and processes– parallel I/O

3

MPI Implementations

MPI’s design for portability + performance has inspired a wide variety of implementations

Vendor implementations for their own machines

Implementations from software vendors Freely available public implementations for a

number of environments Experimental implementations to explore

research ideas

4

MPI Implementations from Vendors

IBM for SP and RS/6000 workstations, OS/390 Sun for Solaris systems, clusters of them SGI for SGI Origin, Power Challenge, Cray T3E

and C90 HP for Exemplar, HP workstations Compaq for (Digital) parallel machines MPI Software Technology, for Microsoft Windows

NT, Linux, Mac Fujitsu for VPP (Pallas) NEC Hitachi Genias (for NT)

5

From the Public for the Public

MPICH– for many architectures, including clusters– http://www.mcs.anl.gov/mpi/mpich

– new: http/www.mcs.anl.gov/mpi/mpich/mpich-nt

LAM– for clusters– http://lam.nd.edu

MPICH-based NT implementations– Aachen– Portugal– Argonne

6

Experimental Implementations

Real-time (Hughes) Special networks and protocols (not TCP)

– MPI-FM (U. of I., UCSD)– MPI-BIP (Lyon)– MPI-AM (Berkeley)– MPI over VIA (Berkeley, Parma)– MPI-MBCF (Tokyo) – MPI over SCI (Oslo)

Wide-area networks– Globus (MPICH-G, Argonne)– Legion (U. of Virginia)– MetaMPI (Germany)

Ames Lab (highly optimized subset) More implementations at http://www.mpi.nd.edu/lam

7

Status of MPI-2 Implementations

Fujitsu: complete (from PALLAS) NEC has I/O, complete early 2000, for SX MPICH & LAM (C++ bindings and most of I/O) LAM has parts of dynamic and one-sided HP has part of one-sided HP and SGI have most of I/O Sun, Compaq are working on parts of MPI-2 IBM has I/O, soon will have one-sided EPCC did one-sided for Cray T3E Experimental implementations (esp. one-sided)

8

Cluster Activities at Argonne

New cluster - Chiba City MPICH Other software

Chiba City8 Computing Towns

256 Dual Pentium III systems1 Visualization Town

32 Pentium III systems with Matrox G400 cards

1 Storage Town 8 Xeon systems

with 300G disk each

Cluster Management 12 PIII Mayor Systems4 PIII Front End Systems

2 Xeon File Servers3.4 TB disk

High Performance Net

64-bit Myrinet

Management NetGigabit and Fast Ethernet

Gigabit External Link

10

Chiba City System Details Purpose:

– Scalable CS research– Prototype application support

System - 314 computers:– 256 computing nodes, PIII

500MHz, 512M, 9G local disk– 32 visualization nodes, PIII

500MHz, 512M, Matrox G200– 8 storage nodes, 500 MHz

Xeon, 512M, 300GB disk: 2.4TB total

– 10 town mayors, 1 city mayor, other management systems: PIII 500 MHz, 512M, 3TB disk

Communications:– 64-bit Myrinet computing net– Switched fast/gigabit ethernet

management net– Serial control network

Software Environment:– Linux (based on RH 6.0), plus

“install your own” OS support– Compilers: GNU g++, PGI, etc

– Libraries and Tools: PETSc, MPICH, Globus, ROMIO, SUMMA3d, Jumpshot, Visualization, PVFS, HPSS, ADSM, PBS + Maui Scheduler

11

Software Research on Clusters at ANL

Scalable Systems Management– Chiba City Management Model (w/LANL, LBNL)

MPI and Communications Software– GigaNet, Myrinet, ServerNetII

Data Management and Grid Services– Globus Services on Linux (w/LBNL, ISI)

Visualization and Collaboration Tools– Parallel OpenGL server (w/Princeton, UIUC)– vTK and CAVE Software for Linux Clusters– Scalable Media Server (FL Voyager Server on Linux Cluster)

Scalable Display Environment and Tools– Virtual Frame Buffer Software (w/Princeton)– VNC (ATT) modifications for ActiveMural

Parallel I/O– MPI-IO and Parallel Filesystems Developments (w/Clemson, PVFS)

12

MPICH

Goals Misconceptions about MPICH MPICH architecture

– the Abstract Device Interface (ADI)

Current work at Argonne on MPICH– Work above the ADI– Work below the ADI– A new ADI

13

Goals of MPICH

As a research project:– to explore tradeoffs between performance and

portability in the context of the MPI standard– to study algorithms applicable to MPI implementation– to investigate interfaces between MPI and tools

As a software project:– to provide a portable, freely available MPI to everyone– to give vendors and others a running start in the

development of specialized MPI implementations– to provide a testbed for other research groups working

on particular aspects of message passing

14

Misconceptions About MPICH

It is pronounced (by its authors, at least) as “em-pee-eye-see-aitch”, not “em-pitch”.

It runs on networks of heterogeneous machines. It runs MIMD parallel programs, not just SPMD. It can use TCP, shared-memory, or both at the

same time (for networks of SMP’s). It runs over native communication on machines

like the IBM SP and Cray T3E (not just TCP). It is not for Unix only (new NT version). It doesn’t necessarily poll (depends on device).

15

MPICH Architecture

MPI Routines

ADI Routines

Channel Device Other Devices

ch_p4

sockets sockets+shmem

ch_shmem ch_eui t3e Globus

The AbstractDevice Interface

Abovethe

Device

Belowthe

Device

ch_NT

sockets (shmem) (+)

16

Recent Work Above the Device

Complete 1.2 compliance (in MPICH-1.2.0)– including even MPI_Cancel for sends

Better MPI derived datatype packing– can also be done below the device

MPI-2 C++ bindings – thanks to Notre Dame group

MPI-2 Fortran-90 module– permits use mpi instead of #include ‘mpif.h’– extends work of Michael Hennecke

MPI-2 I/O– the ROMIO project– layers MPI I/O on any MPI implementation, file system

17

Above the Device (continued)

Error message architecture– Instance-specific error reporting

– “rank 789 invalid” rather than “Invalid rank”

– Internationalization– German– Thai

Thread safety Globus/NGI/collective

18

Below the Device (continued)

Flow control for socket-based devices as defined by IMPI

Better multi-protocol for LINUX SMP’s– mmap solution portable among other Unixes not

usable on Linux– Can’t use MAP_ANONYMOUS with MAP_SHARED!

– SYSV solution works but can cause problems (race condition in deallocation)

– works better than before in MPICH-1.2.0

New NT implementation of the channel device

19

MPICH for NT

open source: build with MS Visual C++ 6.0 and Digital Visual Fortran 6.0, or download dll

complete MPI 1.2, shares above-device code with Unix MPICH

correct implementation, passes test suites from ANL, IBM, Intel

not yet fully optimized only socket device in current release DCOM-based job launcher working on other devices, ADI-3 implementation http://www.mcs.anl.gov/~ashton/mpichbeta.html

20

Preliminary Experiments with Shared Memory on NT

smp ping

0

200

400

600

800

1000

1200

1 10 100 1000 10000 100000 1000000 10000000 100000000

message size

MB

ts/s

ec

shmem

shmem stream 20k

shprocess

shprocess fixed

21

ADI-3: A New Abstract Device

Motivated by: the fact that requirements of MPI-2 are

inconsistent with ADI-2 design– thread safety– dynamic process management– one-sided operations

New capabilities of user-accessible hardware– LAPI– VIA/SIO (NGIO,FIO)– other network interfaces (Myrinet)

Desire for peak efficiency– top-to-bottom overhaul of MPICH– ADI-1: speed of implementation; ADI-2: portability

22

Runtime Environment Research

Fast startup of MPICH jobs via the mpd Experiment with process manager, job manager, scheduler

interface for parallel jobs

mpirun

Scheduler

job

process

23

Scalable Logfiles: SLOG

From IBM via AIX tracing Using MPI profiling mechanism Both automatic and user-defined states Can support large logfiles, yet find and display

sections quickly Freely-available API for reading/writing SLOG

files Current format read by Jumpshot

24

Jumpshot

25

Parallel I/O for Clusters

ROMIO is an implementation of (almost all of) the I/O part of the MPI standard.

It can utilize multiple file systems and MPI implementations.

Included in MPICH, LAM, and MPI from SGI, HP, and NEC

A combination for Clusters: Linux, MPICH, and PVFS (Parallel Virtual File System from Clemson).

26

Conclusion

There are many MPI implementations for clusters; MPICH is one.

MPI implementation, particularly for fast networks, remains an active research area

Argonne National Laboratory, all of whose software is open source, has a number of ongoing and new cluster-related activities– New cluster– MPICH– Tools

27

Available in November

Documents

1 MPI and MPICH on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory