Using MPI-2 - Advanced Features

Embed Size (px)

Citation preview

  • 8/14/2019 Using MPI-2 - Advanced Features

    1/275

    Page i

    Using MPI-2

    Page ii

    Scientific and Engineering Computation

    Janusz Kowalik, editor

    Data-Parallel Programming on MIMD Computers,

    Philip J. Hatcher and Michael J. Quinn, 1991

    Unstructured Scientific Computation on Scalable Multiprocessors,

    edited by Piyush Mehrotra, Joel Saltz, and Robert Voigt, 1992

    Parallel Computational Fluid Dynamics: Implementation and Results,

    edited by Horst D. Simon, 1992

    Enterprise Integration Modeling: Proceedings of the First International Conference,

    edited by Charles J. Petrie, Jr., 1992

    The High Performance Fortran Handbook,

    Charles H. Koelbel, David B. Loveman, Robert S. Schreiber, Guy L. Steele Jr. and Mary E. Zosel, 1994

    PVM: Parallel Virtual MachineA Users' Guide and Tutorial for Network Parallel Computing,

    Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek, and Vaidy Sunderam, 1994

    Practical Parallel Programming,

    Gregory V. Wilson, 1995

    Enabling Technologies for Petaflops Computing,

    Thomas Sterling, Paul Messina, and Paul H. Smith, 1995

    An Introduction to High-Performance Scientific Computing,

    Lloyd D. Fosdick, Elizabeth R. Jessup, Carolyn J. C. Schauble, and Gitta Domik, 1995

    Parallel Programming Using C++,

    edited by Gregory V. Wilson and Paul Lu, 1996

    Using PLAPACK: Parallel Linear Algebra Package,

    Robert A. van de Geijn, 1997

  • 8/14/2019 Using MPI-2 - Advanced Features

    2/275

    Fortran 95 Handbook,

    Jeanne C. Adams, Walter S. Brainerd, Jeanne T. Martin, Brian T. Smith, Jerrold L. Wagener, 1997

    MPIThe Complete Reference: Volume 1, The MPI Core,

    Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra, 1998

    MPIThe Complete Reference: Volume 2, The MPI-2 Extensions,

    William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg,

    William Saphir, and Marc Snir, 1998

    A Programmer's Guide to ZPL,

    Lawrence Snyder, 1999

    How to Build a Beowulf,

    Thomas L. Sterling, John Salmon, Donald J. Becker, and Daniel F. Savarese, 1999

    Using MPI: Portable Parallel Programming with the Message-Passing Interface, second edition,

    William Gropp, Ewing Lusk, and Anthony Skjellum, 1999

    Using MPI-2: Advanced Features of the Message-Passing Interface,

    William Gropp, Ewing Lusk, and Rajeev Thakur, 1999

    Page iii

    Using MPI-2

    Advanced Features of the Message-Passing Interface

    William Gropp

    Ewing Lusk

    Rajeev Thakur

    Page iv

    1999 Massachusetts Institute of Technology

    All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (includinghotocopying, recording, or information storage and retrieval) without permission in writing from the publisher.

    his book was set in by the authors and was printed and bound in the United States of America.

    ibrary of Congress Cataloging-in-Publication Data

    Gropp, William.

    Using MPI-2: advanced features of the message-passing interface /

    William Gropp, Ewing Lusk, Rajeev Thakur.

    cm.(Scientific and engineering computation)

    ncludes bibliographical references and index.

    SBN 0-262-057133-1 (pb.: alk. paper)

    Parallel programming (Computer science). 2. Parallel computers

    rogramming. 3. Computer interfaces. I. Lusk, Ewing. II. Thakur,

    ajeev. III. Title. IV. Series.

    QA76.642.G762 1999

    05.2'75dc21 99-042972

  • 8/14/2019 Using MPI-2 - Advanced Features

    3/275

  • 8/14/2019 Using MPI-2 - Advanced Features

    4/275

    2.2.3 MPI I/O to Separate Files 16

    2.2.4 Parallel MPI I/O to a Single File 19

    2.2.5 Fortran 90 Version 21

    2.2.6 Reading the File with a Different Number of Processes 22

    2.2.7 C++ Version 24

    2.2.8 Other Ways to Write to a Shared File 28

    2.3 Remote Memory Access 29

    2.3.1 The Basic Idea: Memory Windows 30

    2.3.2 RMA Version ofcpi 30

    2.4 Dynamic Process Management 36

    2.4.1 Spawning Processes 37

    2.4.2 Parallel cp: A Simple System Utility 38

    2.5 More Info on Info 47

    Page viii

    2.5.1 Motivation, Description, and Rationale 47

    2.5.2 An Example from Parallel I/O 47

    2.5.3 An Example from Dynamic Process Management 48

    2.6 Summary 50

    3

    Parallel I/O

    51

    3.1 Introduction 51

    3.2 Using MPI for Simple I/O 51

    3.2.1 Using Individual File Pointers 52

    3.2.2 Using Explicit Offsets 55

    3.2.3 Writing to a File 59

  • 8/14/2019 Using MPI-2 - Advanced Features

    5/275

    3.3 Noncontiguous Accesses and Collective I/O 59

    3.3.1 Noncontiguous Accesses 60

    3.3.2 Collective I/O 64

    3.4 Accessing Arrays Stored in Files 67

    3.4.1 Distributed Arrays 68

    3.4.2 A Word of Warning about Darray 71

    3.4.3 Subarray Datatype Constructor 72

    3.4.4 Local Array with Ghost Area 74

    3.4.5 Irregularly Distributed Arrays 78

    3.5 Nonblocking I/O and Split Collective I/O 81

    3.6 Shared File Pointers 83

    3.7 Passing Hints to the Implementation 85

    3.8 Consistency Semantics 89

    3.8.1 Simple Cases 89

    3.8.2 Accessing a Common File Opened with MPI_COMM_WORLD 91

    3.8.3 Accessing a Common File Opened with MPI_COMM_SELF 94

    3.8.4 General Recommendation 95

    3.9 File Interoperability 95

    3.9.1 File Structure 96

    3.9.2 File Data Representation 97

    3.9.3 Use of Datatypes for Portability 98

  • 8/14/2019 Using MPI-2 - Advanced Features

    6/275

    Page ix

    3.9.4 User-Defined Data Representations 100

    3.10 Achieving High I/O Performance with MPI 101

    3.10.1 The Four "Levels" of Access 101

    3.10.2 Performance Results 105

    3.10.3 Upshot Graphs 106

    3.11 An Astrophysics Example 112

    3.11.1 ASTRO3D I/O Requirements 112

    3.11.2 Implementing the I/O with MPI 114

    3.11.3 Header Issues 116

    3.12 Summary 118

    4

    Understanding Synchronization

    119

    4.1 Introduction 119

    4.2 Synchronization in Message Passing 119

    4.3 Comparison with Shared Memory 127

    4.3.1 Volatile Variables 129

    4.3.2 Write Ordering 130

    4.3.3 Comments 131

    5

    ntroduction to Remote Memory Operations

    133

    5.1 Introduction 135

    5.2 Contrast with Message Passing 136

    5.3 Memory Windows 139

    5.3.1 Hints on Choosing Window Parameters 141

    5.3.2 Relationship to Other Approaches 142

  • 8/14/2019 Using MPI-2 - Advanced Features

    7/275

    5.4 Moving Data 142

    5.4.1 Reasons for Using Displacement Units 146

    5.4.2 Cautions in Using Displacement Units 147

    5.4.3 Displacement Sizes in Fortran 148

    5.5 Completing Data Transfers 148

    5.6 Examples of RMA Operations 150

    5.6.1 Mesh Ghost Cell Communication 150

    Page x

    5.6.2 Combining Communication and Computation 164

    5.7 Pitfalls in Accessing Memory 169

    5.7.1 Atomicity of Memory Operations 169

    5.7.2 Memory Coherency 171

    5.7.3 Some Simple Rules for RMA 171

    5.7.4 Overlapping Windows 173

    5.7.5 Compiler Optimizations 173

    5.8 Performance Tuning for RMA Operations 175

    5.8.1 Options for MPI_Win_create 175

    5.8.2 Options for MPI_Win_fence 177

    6

    Advanced Remote Memory Access

    181

    6.1 Introduction 181

    6.2 Lock and Unlock 181

    6.2.1 Implementing Blocking, Independent RMA Operations 183

    6.3 Allocating Memory for MPI Windows 184

    6.3.1 Using MPI_Alloc_mem from C/C++ 184

  • 8/14/2019 Using MPI-2 - Advanced Features

    8/275

    6.3.2 Using MPI_Alloc_mem from Fortran 185

    6.4 Global Arrays 185

    6.4.1 Create and Free 188

    6.4.2 Put and Get 192

    6.4.3 Accumulate 194

    6.5 Another Version ofNXTVAL 194

    6.5.1 The Nonblocking Lock 197

    6.5.2 A Nonscalable Implementation ofNXTVAL 197

    6.5.3 Window Attributes 201

    6.5.4 A Scalable Implementation ofNXTVAL 204

    6.6 An RMA Mutex 208

    6.7 The Rest of Global Arrays 210

    6.7.1 Read and Increment 210

    6.7.2 Mutual Exclusion for Global Arrays 210

    6.7.3 Comments on the MPI Version of Global Arrays 212

    Page xi

    6.8 Differences between RMA and Shared Memory 212

    6.9 Managing a Distributed Data Structure 215

    6.9.1 A Shared-Memory Distributed List Implementation 215

    6.9.2 An MPI Implementation of a Distributed List 216

    6.9.3 Handling Dynamically Changing Distributed Data Structures 220

    6.9.4 An MPI Implementation of a Dynamic Distributed List 224

    6.10 Compiler Optimization and Passive Targets 225

    6.11 Scalable Synchronization 228

  • 8/14/2019 Using MPI-2 - Advanced Features

    9/275

    6.11.1 Exposure Epochs 229

    6.11.2 The Ghost-Point Exchange Revisited 229

    6.11.3 Performance Optimizations for Scalable Synchronization 231

    6.12 Summary 232

    7

    Dynamic Process Management

    233

    7.1 Introduction 233

    7.2 Creating New MPI Processes 233

    7.2.1 Intercommunicators 234

    7.2.2 Matrix-Vector Multiplication Example 235

    7.2.3 Intercommunicator Collective Operations 238

    7.2.4 Intercommunicator Point-to-Point Communication 239

    7.2.5 Finding the Number of Available Processes 242

    7.2.6 Passing Command-Line Arguments to Spawned Programs 245

    7.3 Connecting MPI Processes 245

    7.3.1 Visualizing the Computation in an MPI Program 247

    7.3.2 Accepting Connections from Other Programs 249

    7.3.3 Comparison with Sockets 251

    7.3.4 Moving Data between Groups of Processes 253

    7.3.5 Name Publishing 254

    7.4 Design of the MPI Dynamic Process Routines 258

    7.4.1 Goals for MPI Dynamic Process Management 258

  • 8/14/2019 Using MPI-2 - Advanced Features

    10/275

    Page xii

    7.4.2 What MPI Did Not Standardize 260

    8

    Using MPI with Threads

    261

    8.1 Thread Basics and Issues 261

    8.1.1 Thread Safety 262

    8.1.2 Threads and Processes 263

    8.2 MPI and Threads 263

    8.3 Yet Another Version of NXTVAL 266

    8.4 Implementing Nonblocking Collective Operations 268

    8.5 Mixed-Model Programming: MPI for SMP Clusters 269

    9

    Advanced Features

    273

    9.1 Defining New File Data Representations 273

    9.2 External Interface Functions 275

    9.2.1 Decoding Datatypes 277

    9.2.2 Generalized Requests 279

    9.2.3 Adding New Error Codes and Classes 285

    9.3 Mixed-Language Programming 289

    9.4 Attribute Caching 292

    9.5 Error Handling 295

    9.5.1 Error Handlers 295

    9.5.2 Error Codes and Classes 297

    9.6 Topics Not Covered in This Book 298

    0

    Conclusions

    301

    10.1 New Classes of Parallel Programs 301

  • 8/14/2019 Using MPI-2 - Advanced Features

    11/275

    10.2 MPI-2 Implementation Status 301

    10.2.1 Vendor Implementations 301

    10.2.2 Free, Portable Implementations 302

    10.2.3 Layering 302

    10.3 Where Does MPI Go from Here? 302

    10.3.1 More Remote Memory Operations 303

    Page xiii

    10.3.2 More on Threads 303

    10.3.3 More Language Bindings 304

    10.3.4 Interoperability of MPI Implementations 304

    10.3.5 Real-Time MPI 304

    10.4 Final Words 304

    A

    Summary of MPI-2 Routines and Their Arguments

    307

    B

    MPI Resources on the World Wide Web

    355

    C

    Surprises, Questions, and Problems in MPI

    357

    D

    Standardizing External Startup with mpiexec

    361

    References 365

    Subject Index 373

    Function and Term Index 379

  • 8/14/2019 Using MPI-2 - Advanced Features

    12/275

    Page xv

    eries Foreword

    he world of modern computing potentially offers many helpful methods and tools to scientists and engineers, but the fast pace

    f change in computer hardware, software, and algorithms often makes practical use of the newest computing technology

    ifficult. The Scientific and Engineering Computation series focuses on rapid advances in computing technologies, with the

    im of facilitating transfer of these technologies to applications in science and engineering. It will include books on theories,methods, and original applications in such areas as parallelism, large-scale simulations, time-critical computing, computer-

    ided design and engineering, use of computers in manufacturing, visualization of scientific data, and human-machine interface

    echnology.

    he series is intended to help scientists and engineers understand the current world of advanced computation and to anticipate

    uture developments that will affect their computing environments and open up new capabilities and modes of computation.

    his book describes how to use advanced features of the Message-Passing Interface (MPI), a communication library

    pecification for both parallel computers and workstation networks. MPI has been developed as a community standard for

    message passing and related operations. Its adoption by both users and implementers has provided the parallel-programming

    ommunity with the portability and features needed to develop application programs and parallel libraries that will tap the

    ower of today's (and tomorrow's) high-performance computers.

    JANUSZ S. KOWALIK

    Page xvii

    reface

    MPI (Message-Passing Interface) is a standard library interface for writing parallel programs. MPI was developed in two

    hases by an open forum of parallel computer vendors, library writers, and application developers. The first phase took place in9931994 and culminated in the first release of the MPI standard, which we call MPI-1. A number of important topics in

    arallel computing had been deliberately left out of MPI-1 in order to speed its release, and the MPI Forum began meeting

    gain in 1995 to address these topics, as well as to make minor corrections and clarifications to MPI-1 that had been discovered

    o be necessary. The MPI-2 Standard was released in the summer of 1997. The official Standard documents for MPI-1 (the

    urrent version as updated by the MPI-2 forum is 1.2) and MPI-2 are available on the Web at http://www.mpi-forum.org. More

    olished versions of the standard documents are published by MIT Press in the two volumes ofMPIThe Complete Reference

    27, 79].

    hese official documents and the books that describe them are organized so that they will be useful as reference works. The

    ructure of the presentation is according to the chapters of the standard, which in turn reflects the subcommittee structure of

    he MPI Forum.

    n 1994, two of the present authors, together with Anthony Skjellum, wrote Using MPI: Portable Programming with the

    Message-Passing Interface [31], a quite differently structured book on MPI-1, taking a more tutorial approach to the material.

    A second edition [32] of that book has now appeared as a companion to this one, covering the most recent additions and

    larifications to the material of MPI-1, and bringing it up to date in various other ways as well. This book takes the same

    utorial, example-driven approach to its material that Using MPIdoes, applying it to the topics of MPI-2. These topics include

    arallel I/O, dynamic process management, remote memory operations, and external interfaces.

    bout This Book

    ollowing the pattern set in Using MPI, we do not follow the order of chapters in the MPI-2 Standard, nor do we follow the

    rder of material within a chapter as in the Standard. Instead, we have organized the material in each chapter according to the

    omplexity of the programs we use as examples, starting with simple examples and moving to more complex ones. We do

    ssume that the reader is familiar with at least the simpler aspects of MPI-1. It is not necessary to have read Using MPI, but it

    wouldn't hurt.

    http://www.mpi-forum.org/http://www.mpi-forum.org/
  • 8/14/2019 Using MPI-2 - Advanced Features

    13/275

    Page xviii

    We begin in Chapter 1 with an overview of the current situation in parallel computing, many aspects of which have changed in

    he past five years. We summarize the new topics covered in MPI-2 and their relationship to the current and (what we see as)

    he near-future parallel computing environment.

    MPI-2 is not ''MPI-1, only more complicated." There are simple and useful parts of MPI-2, and in Chapter 2 we introduce them

    with simple examples of parallel I/O, dynamic process management, and remote memory operations.

    n Chapter 3 we dig deeper into parallel I/O, perhaps the "missing feature" most requested by users of MPI-1. We describe the

    arallel I/O features of MPI, how to use them in a graduated series of examples, and how they can be used to get higherformance, particularly on today's parallel/high-performance file systems.

    n Chapter 4 we explore some of the issues of synchronization between senders and receivers of data. We examine in detail

    what happens (and what must happen) when data is moved between processes. This sets the stage for explaining the design of

    MPI's remote memory operations in the following chapters.

    hapters 5 and 6 cover MPI's approach to remote memory operations. This can be regarded as the MPI approach to shared

    memory, since shared-memory and remote-memory operations have much in common. At the same time they are different,

    nce access to the remote memory is through MPI function calls, not some kind of language-supported construct (such as a

    lobal pointer or array). This difference arises because MPI is intended to be portable to distributed-memory machines, even

    eterogeneous clusters.

    ecause remote memory access operations are different in many ways from message passing, the discussion of remote memory

    ccess is divided into two chapters. Chapter 5 covers the basics of remote memory access and a simple synchronization model.

    hapter 6 covers more general types of remote memory access and more complex synchronization models.

    hapter 7 covers MPI's relatively straightforward approach to dynamic process management, including both spawning new

    rocesses and dynamically connecting to running MPI programs.

    he recent rise of the importance of small to medium-size SMPs (shared-memory multiprocessors) means that the interaction

    f MPI with threads is now far more important than at the time of MPI-1. MPI-2 does not define a standard interface to thread

    braries because such an interface already exists, namely, the POSIX threads interface [42]. MPI instead provides a number of

    eatures designed to facilitate the use of multithreaded MPI programs. We describe these features in Chapter 8.

    n Chapter 9 we describe some advanced features of MPI-2 that are particularly useful to library writers. These features include

    efining new file data representa-

    Page xix

    ons, using MPI's external interface functions to build layered libraries, support for mixed-language programming, attribute

    aching, and error handling.

    n Chapter 10 we summarize our journey through the new types of parallel programs enabled by MPI-2, comment on the

    urrent status of MPI-2 implementations, and speculate on future directions for MPI.

    Appendix A contains the C, C++, and Fortran bindings for all the MPI-2 functions.

    Appendix B describes how to obtain supplementary material for this book, including complete source code for the examples,

    nd related MPI materials that are available via anonymous ftp and on the World Wide Web.

    n Appendix C we discuss some of the surprises, questions, and problems in MPI, including what we view as some

    hortcomings in the MPI-2 Standard as it is now. We can't be too critical (because we shared in its creation!), but experience

    nd reflection have caused us to reexamine certain topics.

    Appendix D covers the MPI program launcher, mpiexec, which the MPI-2 Standard recommends that all implementations

    upport. The availability of a standard interface for running MPI programs further increases the protability of MPI applications,nd we hope that this material will encourage MPI users to expect and demand mpiexec from the suppliers of MPI

    mplementations.

    n addition to the normal subject index, there is an index for the usage examples and definitions of the MPI-2 functions,

    onstants, and terms used in this book.

  • 8/14/2019 Using MPI-2 - Advanced Features

    14/275

    We try to be impartial in the use of C, Fortran, and C++ in the book's examples. The MPI Standard has tried to keep the syntax

    f its calls similar in C and Fortran; for C++ the differences are inevitably a little greater, although the MPI Forum adopted a

    onservative approach to the C++ bindings rather than a complete object library. When we need to refer to an MPI function

    without regard to language, we use the C version just because it is a little easier to read in running text.

    his book is not a reference manual, in which MPI functions would be grouped according to functionality and completely

    efined. Instead we present MPI functions informally, in the context of example programs. Precise definitions are given in

    olume 2 ofMPIThe Complete Reference [27] and in the MPI-2 Standard [59]. Nonetheless, to increase the usefulness of this

    ook to someone working with MPI, we have provided the calling sequences in C, Fortran, and C++ for each MPI-2 function

    hat we discuss. These listings can be found set off in boxes located near where the functions are introduced. C bindings are

    iven in ANSI C style. Arguments that can be of several types (typically message buffers) are defined as void* in C. In the

    ortran boxes, such arguments are marked as being of type . This means that one of the appropriate Fortran data types

    hould be used. To

    Page xx

    nd the "binding box" for a given MPI routine, one should use the appropriate bold-face reference in the Function and Term

    ndex: C for C, f90 for Fortran, and C++ for C++. Another place to find this information is in Appendix A, which lists all MPI

    unctions in alphabetical order for each language.

    cknowledgments

    We thank all those who participated in the MPI-2 Forum. These are the people who created MPI-2, discussed a wide variety of

    opics (many not included here) with seriousness, intelligence, and wit, and thus shaped our ideas on these areas of parallel

    omputing. The following people (besides ourselves) attended the MPI Forum meetings at one time or another during the

    ormulation of MPI-2: Greg Astfalk, Robert Babb, Ed Benson, Rajesh Bordawekar, Pete Bradley, Peter Brennan, Ron

    rightwell, Maciej Brodowicz, Eric Brunner, Greg Burns, Margaret Cahir, Pang Chen, Ying Chen, Albert Cheng, Yong Cho,

    oel Clark, Lyndon Clarke, Laurie Costello, Dennis Cottel, Jim Cownie, Zhenqian Cui, Suresh Damodaran-Kamal, Raja

    aoud, Judith Devaney, David DiNucci, Doug Doefler, Jack Dongarra, Terry Dontje, Nathan Doss, Anne Elster, Mark Fallon,

    Karl Feind, Sam Fineberg, Craig Fischberg, Stephen Fleischman, Ian Foster, Hubertus Franke, Richard Frost, Al Geist, Robert

    George, David Greenberg, John Hagedorn, Kei Harada, Leslie Hart, Shane Hebert, Rolf Hempel, Tom Henderson, Alex Ho,

    Hans-Christian Hoppe, Steven Huss-Lederman, Joefon Jann, Terry Jones, Carl Kesselman, Koichi Konishi, Susan Kraus, Steve

    Kubica, Steve Landherr, Mario Lauria, Mark Law, Juan Leon, Lloyd Lewins, Ziyang Lu, Andrew Lumsdaine, Bob Madahar,

    eter Madams, John May, Oliver McBryan, Brian McCandless, Tyce McLarty, Thom McMahon, Harish Nag, Nick Nevin,

    arek Nieplocha, Bill Nitzberg, Ron Oldfield, Peter Ossadnik, Steve Otto, Peter Pacheco, Yoonho Park, Perry Partow, Pratap

    attnaik, Elsie Pierce, Paul Pierce, Heidi Poxon, Jean-Pierre Prost, Boris Protopopov, James Pruyve, Rolf Rabenseifner, Joe

    ieken, Peter Rigsbee, Tom Robey, Anna Rounbehler, Nobutoshi Sagawa, Arindam Saha, Eric Salo, Darren Sanders, William

    aphir, Eric Sharakan, Andrew Sherman, Fred Shirley, Lance Shuler, A. Gordon Smith, Marc Snir, Ian Stockdale, David

    aylor, Stephen Taylor, Greg Tensa, Marydell Tholburn, Dick Treumann, Simon Tsang, Manuel Ujaldon, David Walker,

    errell Watts, Klaus Wolf, Parkson Wong, and Dave Wright. We also acknowledge the valuable input from many persons

    round the world who participated in MPI Forum discussions via e-mail.

    ur interactions with the many users of MPICH have been the source of ideas,

    Page xxi

    xamples, and code fragments. Other members of the MPICH group at Argonne have made critical contributions to MPICH

    nd other MPI-related tools that we have used in the preparation of this book. Particular thanks go to Debbie Swider for her

    nthusiastic and insightful work on MPICH implementation and interaction with users, and to Omer Zaki and Anthony Chan

    or their work on Upshot and Jumpshot, the performance visualization tools we use with MPICH.

    We thank PALLAS GmbH, particularly Hans-Christian Hoppe and Thomas Kentemich, for testing some of the MPI-2 code

    xamples in this book on the Fujitsu MPI implementation.

    Gail Pieper, technical writer in the Mathematics and Computer Science Division at Argonne, was our indispensable guide in

    matters of style and usage and vastly improved the readability of our prose.

  • 8/14/2019 Using MPI-2 - Advanced Features

    15/275

    Page 1

    ntroduction

    When the MPI Standard was first released in 1994, its ultimate significance was unknown. Although the Standard was the

    esult of a consensus among parallel computer vendors, computer scientists, and application developers, no one knew to what

    xtent implementations would appear or how many parallel applications would rely on it.

    Now the situation has clarified. All parallel computing vendors supply their users with MPI implementations, and there are

    eely available implementations that both compete with vendor implementations on their platforms and supply MPI solutions

    or heterogeneous networks. Applications large and small have been ported to MPI, and new applications are being written.

    MPI's goal of stimulating the development of parallel libraries by enabling them to be portable has been realized, and an

    ncreasing number of applications become parallel purely through the use of parallel libraries.

    his book is about how to use MPI-2, the collection of advanced features that were added to MPI by the second MPI Forum. In

    his chapter we review in more detail the origins of both MPI-1 and MPI-2. We give an overview of what new functionality has

    een added to MPI by the release of the MPI-2 Standard. We conclude with a summary of the goals of this book and its

    rganization.

    .1

    ackground

    We present here a brief history of MPI, since some aspects of MPI can be better understood in the context of its development.

    An excellent description of the history of MPI can also be found in [36].

    1.1

    ncient History

    n the early 1990s, high-performance computing was in the process of converting from the vector machines that had dominated

    cientific computing in the 1980s to massively parallel processors (MPPs) such as the IBM SP-1, the Thinking Machines CM- and the Intel Paragon. In addition, people were beginning to use networks of desktop workstations as parallel computers.

    oth the MPPs and the workstation networks shared the message-passing model of parallel computation, but programs were

    ot portable. The MPP vendors competed with one another on the syntax of their message-passing libraries. Portable libraries,

    uch as PVM [24], p4 [8], and TCGMSG [35], appeared from the research community and became widely used on workstation

    etworks. Some of them allowed portability to MPPs as well, but

    Page 2

    here was no unified, common syntax that would enable a program to run in all the parallel environments that were suitable for

    from the hardware point of view.

    1.2

    he MPI Forum

    tarting with a workshop in 1992, the MPI Forum was formally organized at Supercomputing '92. MPI succeeded because the

    ffort attracted a broad spectrum of the parallel computing community. Vendors sent their best technical people. The authors of

    ortable libraries participated, and applications programmers were represented as well. The MPI Forum met every six weeks

    arting in January 1993 and released MPI in the summer of 1994.

    o complete its work in a timely manner, the Forum strictly circumscribed its topics. It developed a standard for the strict

    message-passing model, in which all data transfer is a cooperative operation among participating processes. It was assumed

    hat the number of processes was fixed and that processes were started by some (unspecified) mechanism external to MPI. I/O

    was ignored, and language bindings were limited to C and Fortran 77. Within these limits, however, the Forum delved deeply,roducing a very full-featured message-passing library. In addition to creating a portable syntax for familiar message-passing

    unctions, MPI introduced (or substantially extended the development of) a number of new concepts, such as derived datatypes,

    ontexts, and communicators. MPI constituted a major advance over all existing message-passing libraries in terms of features,

    recise semantics, and the potential for highly optimized implementations.

  • 8/14/2019 Using MPI-2 - Advanced Features

    16/275

    n the year following its release, MPI was taken up enthusiastically by users, and a 1995 survey by the Ohio Supercomputer

    enter showed that even its more esoteric features found users. The MPICH portable implementation [30], layered on top of

    xisting vendor systems, was available immediately, since it had evolved along with the standard. Other portable

    mplementations appeared, particularly LAM [7], and then vendor implementations in short order, some of them leveraging

    MPICH. The first edition ofUsing MPI[31] appeared in the fall of 1994, and we like to think that it helped win users to the

    ew Standard.

    ut the very success of MPI-1 drew attention to what was not there. PVM users missed dynamic process creation, and several

    sers needed parallel I/O. The success of the Cray shmem library on the Cray T3D and the active-message library on the CM-5

    made users aware of the advantages of "one-sided" operations in algorithm design. The MPI Forum would have to go back towork.

    Page 3

    1.3

    he MPI-2 Forum

    he modern history of MPI begins in the spring of 1995, when the Forum resumed its meeting schedule, with both veterans of

    MPI-1 and about an equal number of new participants. In the previous three years, much had changed in parallel computing,

    nd these changes would accelerate during the two years the MPI-2 Forum would meet.

    n the hardware front, a consolidation of MPP vendors occurred, with Thinking Machines Corp., Meiko, and Intel all leavinghe marketplace. New entries such as Convex (now absorbed into Hewlett-Packard) and SGI (now having absorbed Cray

    esearch) championed a shared-memory model of parallel computation although they supported MPI (passing messages

    hrough shared memory), and many applications found that the message-passing model was still well suited for extracting peak

    erformance on shared-memory (really NUMA) hardware. Small-scale shared-memory multiprocessors (SMPs) became

    vailable from workstation vendors and even PC manufacturers. Fast commodity-priced networks, driven by the PC

    marketplace, became so inexpensive that clusters of PCs combined with inexpensive networks, started to appear as "home-

    rew" parallel supercomputers. A new federal program, the Accelerated Strategic Computing Initiative (ASCI), funded the

    evelopment of the largest parallel computers ever built, with thousands of processors. ASCI planned for its huge applications

    o use MPI.

    n the software front, MPI, as represented by MPI-1, became ubiquitous as the application programming interface (API) forhe message-passing model. The model itself remained healthy. Even on flat shared-memory and NUMA (nonuniform memory

    ccess) machines, users found the message-passing model a good way to control cache behavior and thus performance. The

    erceived complexity of programming with the message-passing model was alleviated by two developments. The first was the

    onvenience of the MPI interface itself, once programmers became more comfortable with it as the result of both experience

    nd tutorial presentations. The second was the appearance of libraries that hide much of the MPI-level complexity from the

    pplication programmer. Examples are PETSc [3], ScaLAPACK [12], and PLAPACK [94]. This second development is

    specially satisfying because it was an explicit design goal for the MPI Forum to encourage the development of libraries by

    ncluding features that libraries particularly needed.

    At the same time, non-message-passing models have been explored. Some of these may be beneficial if actually adopted as

    ortable standards; others may still require interaction with MPI to achieve scalability. Here we briefly summarize tworomising, but quite different approaches.

    Page 4

    xplicit multithreading is the use of an API that manipulates threads (see [32] for definitions within a single address space.

    his approach may be sufficient on systems that can devote a large number of CPUs to servicing a single process, but

    nterprocess communication will still need to be used on scalable systems. The MPI API has been designed to be thread safe.

    However, not all implementations are thread safe. An MPI-2 feature is to allow applications to request and MPI

    mplementations to report their level of thread safety (see Chapter 8).

    n some cases the compiler generates the thread parallelism. In such cases the application or library uses only the MPI API, and

    dditional parallelism is uncovered by the compiler and expressed in the code it generates. Some compilers do this unaided;

    thers respond to directives in the form of specific comments in the code.

    penMP is a proposed standard for compiler directives for expressing parallelism, with particular emphasis on loop-level

    arallelism. Both C [68] and Fortran [67] versions exist.

  • 8/14/2019 Using MPI-2 - Advanced Features

    17/275

    hus the MPI-2 Forum met during time of great dynamism in parallel programming models. What did the Forum do, and what

    id it come up with?

    .2

    What's New in MPI-2?

    he MPI-2 Forum began meeting in March of 1995. Since the MPI-1 Forum was judged to have been a successful effort, the

    ew Forum procedures were kept the same as for MPI-1. Anyone was welcome to attend the Forum meetings, which were held

    very six weeks. Minutes of the meetings were posted to the Forum mailing lists, and chapter drafts were circulated publicly

    or comments between meetings. At meetings, subcommittees for various chapters met and hammered out details, and the final

    ersion of the standard was the result of multiple votes by the entire Forum.

    he first action of the Forum was to correct errors and clarify a number of issues that had caused misunderstandings in the

    riginal document of July 1994, which was retroactively labeled MPI-1.0. These minor modifications, encapsulated as MPI-

    1, were released in May 1995. Corrections and clarifications, to MPI-1 topics continued during the next two years, and the

    MPI-2 document contains MPI-1.2 as a chapter (Chapter 3) of the MPI-2 release, which is the current version of the MPI

    andard. MPI-1.2 also contains a number of topics that belong in spirit to the MPI-1 discussion, although they were added by

    he MPI-2 Forum.

    Page 5

    MPI-2 has three "large," completely new areas, which represent extensions of the MPI programming model substantially

    eyond the strict message-passing model represented by MPI-1. These areas are parallel I/O, remote memory operations, and

    ynamic process management. In addition, MPI-2 introduces a number of features designed to make all of MPI more robust

    nd convenient to use, such as external interface specifications, C++ and Fortran-90 bindings, support for threads, and mixed-

    anguage programming.

    2.1

    arallel I/O

    he parallel I/O part of MPI-2, sometimes just called MPI-IO, originated independently of the Forum activities, as an effort

    within IBM to explore the analogy between input/output and message passing. After all, one can think of writing to a file asnalogous to sending a message to the file system and reading from a file as receiving a message from it. Furthermore, any

    arallel I/O system is likely to need collective operations, ways of defining noncontiguous data layouts both in memory and in

    les, and nonblocking operations. In other words, it will need a number of concepts that have already been satisfactorily

    pecified and implemented in MPI. The first study of the MPI-IO idea was carried out at IBM Research [71]. The effort was

    xpanded to include a group at NASA Ames, and the resulting specification appeared in [15]. After that, an open e-mail

    iscussion group was formed, and this group released a series of proposals, culminating in [90]. At that point the group merged

    with the MPI Forum, and I/O became a part of MPI-2. The I/O specification evolved further over the course of the Forum

    meetings, until MPI-2 was finalized in July 1997.

    n general, I/O in MPI-2 can be thought of as Unix I/O plus quite a lot more. That is, MPI does include analogues of the basic

    perations ofopen, close, seek, read, and write. The arguments for these functions are similar to those of the

    orresponding Unix I/O operations, making an initial port of existing programs to MPI relatively straightforward. The purpose

    f parallel I/O in MPI, however, is to achieve much higher performance than the Unix API can deliver, and serious users of

    MPI must avail themselves of the more advanced features, which include

    noncontiguous access in both memory and file,

    collective I/O operations,

    use of explicit offsets to avoid separate seeks,

    both individual and shared file pointers,

    nonblocking I/O,

    portable and customized data representations, and

  • 8/14/2019 Using MPI-2 - Advanced Features

    18/275

    Page 6

    hints for the implementation and file system.

    We will explore in detail in Chapter 3 exactly how to exploit these features. We will find out there just how the I/O API

    efined by MPI enables optimizations that the Unix I/O API precludes.

    2.2

    emote Memory Operations

    he hallmark of the message-passing model is that data is moved from the address space of one process to that of another bymeans of a cooperative operation such as a send/receive pair. This restriction sharply distinguishes the message-passing

    model from the shared-memory model, in which processes have access to a common pool of memory and can simply perform

    rdinary memory operations (load from, store into) on some set of addresses.

    n MPI-2, an API is defined that provides elements of the shared-memory model in an MPI environment. These are called

    MPI's "one-sided" or "remote memory" operations. Their design was governed by the need to

    balance efficiency and portability across several classes of architectures, including shared-memory multiprocessors (SMPs),

    onuniform memory access (NUMA) machines, distributed-memory massively parallel processors (MPPs), SMP clusters, and

    ven heterogeneous networks;

    retain the "look and feel" of MPI-1;

    deal with subtle memory behavior issues, such as cache coherence and sequential consistency; and

    separate synchronization from data movement to enhance performance.

    he resulting design is based on the idea of remote memory access windows: portions of each process's address space that it

    xplicitly exposes to remote memory operations by other processes defined by an MPI communicator. Then the one-sided

    perationsput, get, and accumulate can store into, load from, and update, respectively, the windows exposed by other

    rocesses. All remote memory operations are nonblocking, and synchronization operations are necessary to ensure their

    ompletion. A variety of such synchronizations operations are provided, some for simplicity, some for precise control, and

    ome for their analogy with shared-memory synchronization operations. In Chapter 4, we explore some of the issues ofynchronization between senders and receivers of data. Chapters 5 and 6 describe the remote memory operations of MPI-2 in

    etail.

    Page 7

    2.3

    ynamic Process Management

    he third major departure from the programming model defined by MPI-1 is the ability of an MPI process to participate in the

    reation of new MPI processes or to establish communication with MPI processes that have been started separately. The main

    sues faced in designing an API for dynamic process management are

    maintaining simplicity and flexibility;

    interacting with the operating system, the resource manager, and the process manager in a complex system software

    nvironment; and

    avoiding race conditions that compromise correctness.

    he key to correctness is to make the dynamic process management operations collective, both among the processes doing the

    reation of new processes and among the new processes being created. The resulting sets of processes are represented in an

    ntercommunicator. Intercommunicators (communicators containing two groups of processes rather than one) are an esoteric

    eature of MPI-1, but are fundamental for the MPI-2 dynamic process operations. The two families of operations defined inMPI-2, both based on intercommunicators, are creating of new sets of processes, called spawning, and establishing

    ommunications with pre-existing MPI programs, called connecting. The latter capability allows applications to have parallel-

    lient/parallel-server structures of processes. Details of the dynamic process management operations can be found in Chapter 7.

  • 8/14/2019 Using MPI-2 - Advanced Features

    19/275

    2.4

    Odds and Ends

    esides the above ''big three," the MPI-2 specification covers a number of issues that were not discussed in MPI-1.

    xtended Collective Operations

    xtended collective operations in MPI-2 are analogous to the collective operations of MPI-1, but are defined for use on

    ntercommunicators. (In MPI-1, collective operations are restricted to intracommunicators.) MPI-2 also extends the MPI-1

    ntracommunicator collective operations to allow an "in place" option, in which the send and receive buffers are the same.

    ++ and Fortran 90

    n MPI-1, the only languages considered were C and Fortran, where Fortran was construed as Fortran 77. In MPI-2, all

    unctions (including MPI-1 functions) have C++ bindings, and Fortran means Fortran 90 (or Fortran 95 [1]). For C++, the MPI-

    Forum chose a "minimal" approach in which the C++ versions of MPI functions are quite similar to the C versions, with

    lasses defined

    Page 8

    or most of the MPI objects (such as MPI::Request for the C MPI_Request. Most MPI functions are member functions

    f MPI classes (easy to do because MPI has an object-oriented design), and others are in the MPI namespace.

    MPI can't take advantage of some Fortran-90 features, such as array sections, and some MPI functions, particularly ones like

    PI-Send that use a "choice" argument, can run afoul of Fortran's compile-time type checking for arguments to routines. This

    usually harmless but can cause warning messages. However, the use of choice arguments does not match the letter of the

    ortran standard; some Fortran compilers may require the use of a compiler option to relax this restriction in the Fortran

    anguage.1 "Basic" and "extended" levels of support for Fortran 90 are provided in MPI-2. Essentially, basic support requires

    hat mpif.h be valid in both fixed-and free-form format, and "extended" support includes an MPI module and some new

    unctions that use parameterized types. Since these language extensions apply to all of MPI, not just MPI-2, they are covered in

    etail in the second edition ofUsing MPI[32] rather than in this book.

    anguage Interoperability

    anguage interoperability is a new feature in MPI-2. MPI-2 defines features, both by defining new functions and by specifying

    he behavior of implementations, that enable mixed-language programming, an area ignored by MPI-1.

    xternal Interfaces

    he external interfaces part of MPI makes it easy for libraries to extend MPI by accessing aspects of the implementation that

    re opaque in MPI-1. It aids in the construction of integrated tools, such as debuggers and performance analyzers, and is

    lready being used in the early implementations of the MPI-2 I/O functionality [88].

    hreads

    MPI-1, other than designing a thread-safe interface, ignored the issue of threads. In MPI-2, threads are recognized as a potential

    art of an MPI programming environment. Users can inquire of an implementation at run time what

    1 Because Fortran uses compile-time data-type matching rather than run-time data-type matching, it is invalid to make two calls

    to the same routine in which two different data types are used in the same argument position. This affects the "choice" arguments

    in the MPI Standard. For example, calling MPI-Send with a first argument of type integer and then with a first argument of

    type real is invalid in Fortran 77. In Fortran 90, when using the extended Fortran support, it is possible to allow arguments of

    different types by specifying the appropriate interfaces in the MPI module. However, this requires a different interface for each

    type and is not a practical approach for Fortran 90 derived types. MPI does provide for data-type checking, but does so at run

    time through a separate argument, the MPI datatype argument.

  • 8/14/2019 Using MPI-2 - Advanced Features

    20/275

    Page 9

    s level of thread-safety is. In cases where the implementation supports multiple levels of thread-safety, users can select the

    evel that meets the application's needs while still providing the highest possible performance.

    .3

    Reading This Book

    his book is not a complete reference book for MPI-2. We leave that to the Standard itself [59] and to the two volumes of

    MPIThe Complete Reference [27, 79]. This book, like its companion Using MPIfocusing on MPI-1, is organized aroundsing the concepts of MPI-2 in application programs. Hence we take an iterative approach. In the preceding section we

    resented a very high level overview of the contents of MPI-2. In the next chapter we demonstrate the use of several of these

    oncepts in simple example programs. Then in the following chapters we go into each of the major areas of MPI-2 in detail.

    We start with the parallel I/O capabilities of MPI in Chapter 3, since that has proven to be the single most desired part of MPI-

    In Chapter 4 we explore some of the issues of synchronization between senders and receivers of data. The complexity and

    mportance of remote memory operations deserve two chapters, Chapters 5 and 6. The next chapter, Chapter 7, is on dynamic

    rocess management. We follow that with a chapter on MPI and threads, Chapter 8, since the mixture of multithreading and

    message passing is likely to become a widely used programming model. In Chapter 9 we consider some advanced features of

    MPI-2 that are particularly useful to library writers. We conclude in Chapter 10 with an assessment of possible future directions

    or MPI.

    n each chapter we focus on example programs to illustrate MPI as it is actually used. Some miscellaneous minor topics will

    ust appear where the example at hand seems to be a good fit for them. To find a discussion on a given topic, you can consult

    ither the subject index or the function and term index, which is organized by MPI function name.

    inally, you may wish to consult the companion volume, Using MPI: Portable Parallel Programming with the Message-

    assing Interface [32]. Some topics considered by the MPI-2 Forum are small extensions to MPI-1 topics and are covered in

    he second edition (1999) ofUsing MPI. Although we have tried to make this volume self-contained, some of the examples

    ave their origins in the examples ofUsing MPI.

    Now, let's get started!

    Page 11

    Getting Started with MPI-2

    n this chapter we demonstrate what MPI-2 "looks like," while deferring the details to later chapters. We use relatively simple

    xamples to give a flavor of the new capabilities provided by MPI-2. We focus on the main areas of parallel I/O, remote

    memory operations, and dynamic process management, but along the way demonstrate MPI in its new language bindings, C++

    nd Fortran 90, and touch on a few new features of MPI-2 as they come up.

    .1

    ortable Process Startup

    ne small but useful new feature of MPI-2 is the recommendation of a standard method for starting MPI programs. The

    mplest version of this is

    mpiexec -n 16 myprog

    o run the program myprog with 16 processes.

  • 8/14/2019 Using MPI-2 - Advanced Features

    21/275

    trictly speaking, how one starts MPI programs is outside the scope of the MPI specification, which says how to write MPI

    rograms, not how to run them. MPI programs are expected to run in such a wide variety of computing environments, with

    ifferent operating systems, job schedulers, process managers, and so forth, that standardizing on a multiple-process startup

    mechanism is impossible. Nonetheless, users who move their programs from one machine to another would like to be able to

    move their run scripts as well. Several current MPI implementations use mpirun to start MPI jobs. Since the mpirun

    rograms are different from one implementation to another and expect different arguments, this has led to confusion, especially

    when multiple MPI implementations are installed on the same machine.

    n light of all these considerations, the MPI Forum took the following approach, which appears in several other places in the

    MPI-2 Standard as well. It recommendedto implementers that mpiexec be one of the methods for starting an MPI program,nd then specified the formats ofsome of the arguments, which are optional. What it does say is that ifan implementation

    upports startup of MPI jobs with mpiexec and uses the keywords for arguments that are described in the Standard, then the

    rguments must have the meanings specified in the Standard. That is,

    mpiexec -n 32 myprog

    hould start 32 MPI processes with 32 as the size ofMPI_COMM_WORLD, and not do something else. The name mpiexec was

    hosen so as to avoid conflict with the various currently established meanings ofmpirun.

    Page 12

    esides the -n argument, mpiexec has a small number of other arguments whose behavior is specified by

    MPI. In each case, the format is a reserved keyword preceded by a hyphen and followed (after whitespace) by a value. The

    ther keywords are -soft, -host, -arch, -wdir, -path, and -file. They are most simply explained by

    xamples.

    mpiexec -n 32 -soft 16 myprog

    means that if 32 processes can't be started, because of scheduling constraints, for example, then start 16 instead. (The request

    or 32 processes is a "soft" request.)

    mpiexec -n 4 -host denali -wdir /home/me/outfiles myprog

    means to start 4 processes (by default, a request for a given number of processes is "hard") on the specified host machine

    denali" is presumed to be a machine name known to mpiexec) and have them start with their working directories set to /

    ome/me/outfiles.

    mpiexec -n 12 -soft 1:12 -arch sparc-solaris \

    -path /home/me/sunprogs myprog

    ays to try for 12 processes, but run any number up to 12 if 12 cannot be run, on a sparc-solaris machine, and look for myprog

    n the path /home/me/sunprogs, presumably the directory where the user compiles for that architecture. And finally,

    mpiexec -file myfile

    ells mpiexec to look in myfile for instructions on what to do. The format ofmyfile is left to the implementation. More

    etails on mpiexec, including how to start multiple processes with different executables, can be found in Appendix D.

    .2

    arallel I/O

    arallel I/O in MPI starts with functions familiar to users of standard "language" I/O or libraries. MPI also has additional

    eatures necessary for performance and portability. In this section we focus on the MPI counterparts of opening and closing

    les and reading and writing contiguous blocks of data from/to them. At this level the main feature we show is how MPI can

    onveniently express parallelism in these operations. We give several variations of a simple example in which processes write a

    ngle array of integers to a file.

  • 8/14/2019 Using MPI-2 - Advanced Features

    22/275

    Page 13

    Figure 2.1

    Sequential I/O from a parallel program

    2.1

    Non-Parallel I/O from an MPI Program

    MPI-1 does not have any explicit support for parallel I/O. Therefore, MPI applications developed over the past few years have

    ad to do their I/O by relying on the features provided by the underlying operating system, typically Unix. The most

    raightforward way of doing this is just to have one process do all I/O. Let us start our sequence of example programs in this

    ection by illustrating this technique, diagrammed in Figure 2.1. We assume that the set of processes have a distributed array of

    ntegers to be written to a file. For simplicity, we assume that each process has 100 integers of the array, whose total length

    hus depends on how many processes there are. In the figure, the circles represent processes; the upper rectangles represent the

    lock of 100 integers in each process's memory; and the lower rectangle represents the file to be written. A program to write

    uch an array is shown in Figure 2.2. The program begins with each process initializing its portion of the array. All processes

    ut process 0 send their section to process 0. Process 0 first writes its own section and then receives the contributions from the

    ther processes in turn (the rank is specified in MPI_Recv) and writes them to the file.

    his is often the first way I/O is done in a parallel program that has been converted from a sequential program, since no

    hanges are made to the I/O part of the program. (Note that in Figure 2.2, ifnumprocs is 1, no MPI communication

    perations are performed.) There are a number of other reasons why I/O in a parallel program may be done this way.

    The parallel machine on which the program is running may support I/O only from one process.

    One can use sophisticated I/O libraries, perhaps written as part of a high-level data-management layer, that do not have

    arallel I/O capability.

    The resulting single file is convenient for handling outside the program (by mv, cp, or ftp, for example).

    Page 14

    * example of sequential Unix write into a common file */

    include "mpi.h"

    include

    define BUFSIZE 100

    nt main(int argc, char *argv[])

    int i, myrank, numprocs, buf[BUFSIZE];

    MPI_Status status;

    FILE *myfile;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

    for (i=0; i

  • 8/14/2019 Using MPI-2 - Advanced Features

    23/275

    for (i=1; i

  • 8/14/2019 Using MPI-2 - Advanced Features

    24/275

    Page 16

    * example of parallel Unix write into separate files */

    include "mpi.h"

    include

    define BUFSIZE 100

    nt main(int argc, char *argv[])

    int i, myrank, buf[BUFSIZE];

    char filename[128];FILE *myfile;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    for (i=0; i

  • 8/14/2019 Using MPI-2 - Advanced Features

    25/275

    MPI_INFO_NULL, &myfile);

    MPI_File_write(myfile, buf, BUFSIZE, MPI_INT,

    MPI_STATUS_IGNORE);

    MPI_File_close(&myfile);

    MPI_Finalize();

    return 0;

    igure 2.5

    MPI I/O to separate files

    alled MPI_File_open. Let us consider the arguments in the call

    MPI_File_open(MPI_COMM_SELF, filename,

    MPI_MODE_CREATE | MPI_MODE_WRONLY,

    MPI_INFO_NULL, &myfile);

    ne by one. The first argument is a communicator. In a way, this is the most significant new component of I/O in MPI. Files in

    MPI are opened by a collection of processes identified by an MPI communicator. This ensures that those processes operating

    n a file together know which other processes are also operating on the file and can communicate with one another. Here, since

    ach process is opening its own file for its own exclusive use, it uses the communicator MPI_COMM_SELF.

    Page 18

    he second argument is a string representing the name of the file, as in fopen. The third argument is the mode in which the

    le is opened. Here it is being both created (or overwritten if it exists) and will only be written to by this program. The

    onstants MPI_MODE_CREATE and MPI_MODE_WRONLY represent bit flags that are or'd together in C, much as they are in

    he Unix system call open.

    he fourth argument, MPI_INFO_NULL here, is a predefined constant representing a dummy value for the info argument to

    PI_File_open. We will describe the MPI_Info object later in this chapter in Section 2.5. In our program we don't need

    ny of its capabilities; hence we pass MPI_INFO_NULL to MPI_File_open. As the last argument, we pass the address of

    he MPI_File variable, which the MPI_File_open will fill in for us. As with all MPI functions in C, MPI_File_openeturns as the value of the function a return code, which we hope is MPI_SUCCESS. In our examples in this section, we do not

    heck error codes, for simplicity.

    he next function, which actually does the I/O in this program, is

    MPI_File_write(myfile, buf, BUFSIZE, MPI_INT,

    MPI_STATUS_IGNORE);

    Here we see the analogy between I/O and message passing that was alluded to in Chapter 1. The data to be written is described

    y the (address, count, datatype) method used to describe messages in MPI-1. This way of describing a buffer to be written (or

    ead) gives the same two advantages as it does in message passing: it allows arbitrary distributions of noncontiguous data in

    memory to be written with a single call, and it expresses the datatype, rather than just the length, of the data to be written, sohat meaningful transformations can be done on it as it is read or written, for heterogeneous environments. Here we just have a

    ontiguous buffer ofBUFSIZE integers, starting at address buf. The final argument to MPI_File_write is a "status"

    rgument, of the same type as returned by MPI_Recv. We shall see its use below. In this case we choose to ignore its value.

    MPI-2 specifies that the special value MPI_STATUS_IGNORE can be passed to any MPI function in place of a status

    rgument, to tell the MPI implementation not to bother filling in the status information because the user intends to ignore it.

    his technique can slightly improve performance when status information is not needed.

    inally, the function

    MPI_File_close(&myfile);

    loses the file. The address ofmyfile is passed rather than the variable itself because the MPI implementation will replace its

    alue with the constant MPI_FILE_NULL. Thus the user can detect invalid file objects.

  • 8/14/2019 Using MPI-2 - Advanced Features

    26/275

    Page 19

    * example of parallel MPI write into a single file */

    include "mpi.h"

    include

    define BUFSIZE 100

    nt main(int argc, char *argv[])

    int i, myrank, buf[BUFSIZE];

    MPI_File thefile;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    for (i=0; i

  • 8/14/2019 Using MPI-2 - Advanced Features

    27/275

    MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int),

    MPI_INT, MPI_INT, "native", MPI_INFO_NULL);

    he first argument identifies the file. The second argument is the displacement (in bytes) into the file where the process's view

    f the file is to start. Here we multiply the size of the data to be written (BUFSIZE * sizeof(int)) by the rank of the

    rocess, so that each process's view starts at the appropriate place in the file. This argument is of a new type MPI_Offset,

    which on systems that support large files can be expected to be a 64-bit integer. See Section 2.2.6 for further discussion.

    he next argument is called the etype of the view; it specifies the unit of data in the file. Here it is MPI_INT, since we will

    lways be writing some number ofMPI_INTs to this file. The next argument, called thefiletype, is a very flexible way ofescribing noncontiguous views in the file. In our simple case here, where there are no noncontiguous units to be written, we

    an just use the etype, MPI_INT. In general, etype and filetype can be any MPI predefined or derived datatype. See Chapter 3

    or details.

    he next argument is a character string denoting the data representation to be used in the file. The native representation

    pecifies that data is to be represented in the file exactly as it is in memory. This preserves precision and results in no

    erformance loss from conversion overhead. Other representations are internal and external32, which enable various

    egrees of file portability across machines with different architectures and thus different data representations. The final

    rgument

    Page 21

    Table 2.1

    C bindings for the I/O functions used in Figure 2.6

    intMPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info,

    MPI_File *fh)

    intMPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,

    MPI_Datatype filetype, char *datarep, MPI_Info info)

    intMPI_File_write(MPI_File fh, void *buf, int count, MPI_Datatype datatype,

    MPI_Status *status)

    intMPI_File_close(MPI_File *fh)

    an info object as in MPI_File_open. Here again it is to be ignored, as dictated by specifying MPI_INFO_NULL for this

    rgument.

    Now that each process has its own view, the actual write operation

    MPI_File_write (thefile, buf, BUFSIZE, MPI_INT,

    MPI_STATUS_IGNORE);

    exactly the same as in our previous version of this program. But because the MPI_File_open specified

    PI_COMM_WORLD in its communicator argument, and the MPI_File_set_view gave each process a different view of the

    le, the write operations proceed in parallel and all go into the same file in the appropriate places.

    Why did we not need a call to MPI_File_set_view in the previous example? The reason is that the default view is that of a

    near byte stream, with displacement 0 and both etype and filetype set to MPI_BYTE. This is compatible with the way we used

    he file in our previous example.

    bindings for the I/O functions in MPI that we have used so far are given in Table 2.1.

    2.5

    ortran 90 Version

  • 8/14/2019 Using MPI-2 - Advanced Features

    28/275

  • 8/14/2019 Using MPI-2 - Advanced Features

    29/275

    Page 23

    example of parallel MPI write into a single file, in Fortran

    ROGRAM main

    ! Fortran 90 users can (and should) use

    ! use mpi

    ! instead of include 'mpif.h' if their MPI implementation provides a

    ! mpi module.

    include 'mpif.h'

    integer ierr, i, myrank, BUFSIZE, thefileparameter (BUFSIZE=100)

    integer buf(BUFSIZE)

    integer(kind=MPI_OFFSET_KIND) disp

    call MPI_INIT(ierr)

    call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)

    do i = 0, BUFSIZE

    buf(i) = myrank * BUFSIZE + i

    enddo

    call MPI_FILE_OPEN(MPI_COMM_WORLD, 'testfile', &

    MPI_MODE_WRONLY + MPI_MODE_CREATE, &

    MPI_INFO_NULL, thefile, ierr)

    ! assume 4-byte integers

    disp = myrank * BUFSIZE * 4

    call MPI_FILE_SET_VIEW(thefile, disp, MPI_INTEGER, &

    MPI_INTEGER, 'native', &

    MPI_INFO_NULL, ierr)

    call MPI_FILE_WRITE(thefile, buf, BUFSIZE, MPI_INTEGER, &

    MPI_STATUS_IGNORE, ierr)

    call MPI_FILE_CLOSE(thefile, ierr)

    call MPI_FINALIZE(ierr)

    ND PROGRAM main

    igure 2.8

    MPI I/O to a single file in Fortran

    Page 24

    Table 2.3

    C bindings for some more I/O functions

    intMPI_File_get_size(MPI_File fh, MPI_Offset *size)

    intMPI_File_read(MPI_File fh, void *buf, int count, MPI_Datatype datatype,

    MPI_Status *status)

    igure 2.9 we show a program to read the file we have been writing in our previous examples. This program is independent of

    he number of processes that run it. The total size of the file is obtained, and then the views of the various processes are set so

    hat they each have approximately the same amount to read.

    ne new MPI function is demonstrated here: MPI_File_get_size. The first argument is an open file, and the second is the

    ddress of a field to store the size of the file in bytes. Since many systems can now handle files whose sizes are too big to be

    epresented in a 32-bit integer, MPI defines a type, MPI_Offset, that is large enough to contain a file size. It is the type used

    or arguments to MPI functions that refer to displacements in files. In C, one can expect it to be a long or long longat

    ny rate a type that can participate in integer arithmetic, as it is here, when we compute the displacement used in

    PI_File_set_view. Otherwise, the program used to read the file is very similar to the one that writes it.

  • 8/14/2019 Using MPI-2 - Advanced Features

    30/275

    ne difference between writing and reading is that one doesn't always know exactly how much data will be read. Here,

    lthough we could compute it, we let every process issue the same MPI_File_read call and pass the address of a real

    PI -Status instead ofMPI_STATUS_IGNORE. Then, just as in the case of an MPI_Recv, we can use

    PI_Get_count to find out how many occurrences of a given datatype were read. If it is less than the number of items

    equested, then end-of-file has been reached.

    bindings for the new functions used in this example are given in Table 2.3.

    2.7

    ++ Version

    he MPI Forum faced a number of choices when it came time to provide C++ bindings for the MPI-1 and MPI-2 functions.

    he simplest choice would be to make them identical to the C bindings. This would be a disappointment to C++ programmers,

    owever. MPI is object-oriented in design, and it seemed a shame not to express this design in C++ syntax, which could be

    one without changing the basic structure of MPI. Another choice would be to define a complete class library that might look

    uite different from MPI's C bindings.

    Page 25

    * parallel MPI read with arbitrary number of processes*/

    include "mpi.h"

    include

    nt main(int argc, char *argv[])

    int myrank, numprocs, bufsize, *buf, count;

    MPI_File thefile;

    MPI_Status status;

    MPI_Offset filesize;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

    MPI_File_open(MPI_COMM_WORLD, "testfile", MPI_MODE_RDONLY,

    MPI_INFO_NULL, &thefile);MPI_File_get_size(thefile, &filesize); /* in bytes */

    filesize = filesize / sizeof(int); /* in number of ints */

    bufsize = filesize / numprocs + 1; /* local number to read */

    buf = (int *) malloc (bufsize * sizeof(int));

    MPI_File_set_view(thefile, myrank * bufsize * sizeof(int),

    MPI_INT, MPI_INT, "native", MPI_INFO_NULL);

    MPI_File_read(thefile, buf, bufsize, MPI_INT, &status);

    MPI_Get_count(&status, MPI_INT, &count);

    printf("process %d read %d ints\n", myrank, count);

    MPI_File_close(&thefile);

    MPI_Finalize();

    return 0;

    igure 2.9

    eading the file with a different number of processes

  • 8/14/2019 Using MPI-2 - Advanced Features

    31/275

    Page 26

    Although the last choice was explored, and one instance was explored in detail [80], in the end the Forum adopted the middle

    oad. The C++ bindings for MPI can almost be deduced from the C bindings, and there is roughly a one-to-one correspondence

    etween C++ functions and C functions. The main features of the C++ bindings are as follows.

    Most MPI "objects," such as groups, communicators, files, requests, and statuses, are C++ objects.

    If an MPI function is naturally associated with an object, then it becomes a method on that object. For example, MPI_Send

    . . .,comm) becomes a method on its communicator: comm.Send( . . .).

    Objects that are not components of other objects exist in an MPI name space. For example, MPI_COMM_WORLD becomes

    PI::COMM_WORLD and a constant like MPI_INFO_NULL becomes MPI::INFO_NULL.

    Functions that normally create objects return the object as a return value instead of returning an error code, as they do in C.

    or example, MPI::File::Open returns an object of type MPI::File.

    Functions that in C return a value in one of their arguments return it instead as the value of the function. For example, comm.

    et_rank returns the rank of the calling process in the communicator comm.

    The C++ style of handling errors can be used. Although the default error handler remains MPI::ERRORS_ARE_FATAL in C+, the user can set the default error handler to MPI::ERRORS_THROW_EXCEPTIONS In this case the C++ exception

    mechanism will throw an object of type MPI::Exception.

    We illustrate some of the features of the C++ bindings by rewriting the previous program in C++. The new program is shown

    n Figure 2.10. Note that we have used the way C++ can defer defining types, along with the C++ MPI feature that functions

    an return values or objects. Hence instead of

    int myrank;

    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    we have

    int myrank = MPI::COMM_WORLD.Get_rank();

    he C++ bindings for basic MPI functions found in nearly all MPI programs are shown in Table 2.4. Note that the new

    et_rank has no arguments instead of the two that the C version, MPI_Get_rank, has because it is a method on a

    Page 27

    / example of parallel MPI read from single file, in C++

    include

    include "mpi.h"

    nt main(int argc, char *argv[])

    int bufsize, *buf, count;

    char filename[128];

    MPI::Status status;

    MPI::Init();

    int myrank = MPI::COMM_WORLD.Get_rank();

    int numprocs = MPI::COMM_WORLD.Get_size();

    MPI::File thefile = MPI::File::Open(MPI::COMM_WORLD, "testfile",

    MPI::MODE_RDONLY,

    MPI::INFO_NULL);

    MPI::Offset filesize = thefile.Get_size(); // in bytesfilesize = filesize / sizeof(int); // in number of ints

    bufsize = filesize / numprocs + 1; // local number to read

    buf = (int *) malloc (bufsize * sizeof(int));

    thefile.Set_view(myrank * bufsize * sizeof(int),

    MPI_INT, MPI_INT, "native", MPI::INFO_NULL);

  • 8/14/2019 Using MPI-2 - Advanced Features

    32/275

  • 8/14/2019 Using MPI-2 - Advanced Features

    33/275

    n Section 2.2.4 we used MPI_File_set_view to show how multiple processes can be instructed to share a single file. As is

    ommon throughout MPI, there are

    Page 29

    multiple ways to achieve the same result. MPI_File_seek allows multiple processes to position themselves at a specific

    yte offset in a file (move the process's file pointer) before reading or writing. This is a lower-level approach than using file

    iews and is similar to the Unix function 1seek. An example that uses this approach is given in Section 3.2. For efficiency

    nd thread-safety, a seek and read operation can be combined in a single function, MPI_File_read_at; similarly, there is

    n MPI_File_write_at. Finally, another file pointer, called the shared file pointer, is shared among processes belonging to

    he communicator passed to MPI_File_open. Functions such as MPI_File_write_shared access data from the current

    ocation of the shared file pointer and increment the shared file pointer by the amount of data accessed. This functionality is

    seful, for example, when all processes are writing event records to a common log file.

    .3

    Remote Memory Access

    n this section we discuss how MPI-2 generalizes the strict message-passing model of MPI-1 and provides direct access by one

    rocess to parts of the memory of another process. These operations, referred to as get, put, and accumulate, are called remote

    emory access (RMA) operations in MPI. We will walk through a simple example that uses the MPI-2 remote memory access

    perations.

    he most characteristic feature of the message-passing model of parallel computation is that data is moved from one process's

    ddress space to another's only by a cooperative pair of send/receive operations, one executed by each process. The same

    perations that move the data also perform the necessary synchronization; in other words, when a receive operation completes,

    he data is available for use in the receiving process.

    MPI-2 does not provide a real shared-memory model; nonetheless, the remote memory operations of MPI-2 provide much of

    he flexibility of shared memory. Data movement can be initiated entirely by the action of one process; hence these operations

    re also referred to as one sided. In addition, the synchronization needed to ensure that a data-movement operation is complete

    decoupled from the (one-sided) initiation of that operation. In Chapters 5 and 6 we will see that MPI-2's remote memory

    ccess operations comprise a small but powerful set of data-movement operations and a relatively complex set ofynchronization operations. In this chapter we will deal only with the simplest form of synchronization.

    is important to realize that the RMA operations come with no particular guarantee of performance superior to that of send

    nd receive. In particular, they

    Page 30

    ave been designed to work both on shared-memory machines and in environments without any shared-memory hardware at

    ll, such as networks of workstations using TCP/IP as an underlying communication mechanism. Their main utility is in the

    exibility they provide for the design of algorithms. The resulting programs will be portable to all MPI implementations and

    resumably will be efficient on platforms that do provide hardware support for access to the memory of other processes.

    3.1

    he Basic Idea:

    Memory Windows

    n strict message passing, the send/receive buffers specified by MPI datatypes represent those portions of a process's address

    pace that are exported to other processes (in the case of send operations) or available to be written into by other processes (in

    he case of receive operations). In MPI-2, this notion of ''communication memory" is generalized to the notion of a remote

    memory access window. Each process can designate portions of its address space as available to other processes for both read

    nd write access. The read and write operations performed by other processes are called getandputremote memory access

    perations. A third type of operation is called accumulate. This refers to the update of a remote memory location, for example,

    y adding a value to it.

  • 8/14/2019 Using MPI-2 - Advanced Features

    34/275

    he word window in MPI-2 refers to the portion of a single process's memory that it contributes to a distributed object called a

    indow object. Thus, a window object is made up of multiple windows, each of which consists of all the local memory areas

    xposed to the other processes by a collective window-creation function. A collection of processes can have multiple window

    bjects, and the windows contributed to a window object by a set of processes may vary from process to process. In Figure

    11 we show a window object made up of windows contributed by two processes. The putand getoperations that move data

    o and from the remote memory of another process are nonblocking; a separate synchronization operation is needed to ensure

    heir completion. To see how this works, let us consider a simple example.

    3.2

    MA Version ofcpi

    n this section we rewrite the cpi example that appears in Chapter 3 ofUsing MPI[32]. This program calculates the value of

    y numerical integration. In the original version there are two types of communication. Process 0 prompts the user for a

    umber of intervals to use in the integration and uses MPI_Bcast to send this number to the other processes. Each process

    hen computes a partial sum, and the total sum is obtained by adding the partial sums with an MPI_Reduce operation.

    Page 31

    Figure 2.11

    Remote memory access window on two processes. The shaded area covers a single window

    object made up of two windows.

    n the one-sided version of this program, process 0 will store the value it reads from the user into its part of an RMA window

    bject, where the other processes can simply getit. After the partial sum calculations, all processes will add their contributions

    o a value in another window object, using accumulate. Synchronization will be carried out by the simplest of the windowynchronization operations, thefence.

    igure 2.12 shows the beginning of the program, including setting up the window objects. In this simple example, each window

    bject consists only of a single number in the memory of process 0. Window objects are represented by variables of type

    PI_Winin C. We need two window objects because window objects are made up of variables of a single datatype, and we

    ave an integer n and a double pi that all processes will access separately. Let us look at the first window creation call done on

    rocess 0.

    MPI_Win_create (&n, sizeof(int), 1, MPI_INFO_NULL,

    MPI_COMM_WORLD, &nwin);

    his is matched on the other processes by

    MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL,

    MPI_COMM_WORLD, &nwin);

    he call on process 0 needs to be matched on the other processes, even though they are not contributing any memory to the

    window object, because MPI_Win_create is a collective operation over the communicator specified in its last argument.

    his communicator designates which processes will be able to access the window object.

    he first two arguments ofMPI_Win_create are the address and length (in bytes) of the window (in local memory) that the

    alling process is exposing toput/getoperations by other processes. Here it is the single integer n on process 0 and no

  • 8/14/2019 Using MPI-2 - Advanced Features

    35/275

    Page 32

    * Compute pi by numerical integration, RMA version */

    include "mpi.h"

    include

    nt main(int argc, char *argv[])

    int n, myid, numprocs, i;

    double PI25DT = 3.141592653589793238462643;

    double mypi, pi, h, sum, x;

    MPI_Win nwin, piwin;

    MPI_Init(&argc,&argv);

    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    if (myid == 0) {

    MPI_Win_create(&n, sizeof(int), 1, MPI_INFO_NULL,

    MPI_COMM_WORLD, &nwin);

    MPI_Win_create(&pi, sizeof(double), 1, MPI_INFO_NULL,

    MPI_COMM_WORLD, &piwin);

    }

    else {MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL,

    MPI_COMM_WORLD, &nwin);

    MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL,

    MPI_COMM_WORLD, &piwin);

    }

    igure 2.12

    pi: setting up the RMA windows

    memory at all on the other processes, signified by a length of 0. We use MPI_BOTTOM as the address because it is a valid

    ddress and we wish to emphasize that these processes are not contributing any local windows to the window object being

    reated.

    he next argument is a displacement unitused to specify offsets into memory in windows. Here each window object contains

    nly one variable, which we will access with a displacement of 0, so the displacement unit is not really important. We specify 1

    byte). The fourth argument is an MPI_Info argument, which can be used to optimize the performance of RMA operations in

    ertain situations. Here we use MPI_INFO_NULL. See Chapter 5 for more on the use of displacement units and the

    PI_Info argument. The fifth argument is a communicator, which specifies

    Page 33

    he set of processes that will have access to the memory being contributed to the window object. The MPI implementation will

    eturn an MPI_Win object as the last argument.

    After the first call to MPI_Win_create, each process has access to the data in nwin (consisting of the single integer n) via

    utand getoperations for storing and reading, and the accumulate operation for updating. Note that we did not have to acquire

    r set aside special memory for the window; we just used the ordinary program variable n on process 0. It is possible, and

    ometimes preferable, to acquire such special memory with MPI_Alloc_mem, but we will not do so here. See Chapter 6 for

    urther information on MPI_Alloc_mem.

    he second call to MPI_Win_create in each process is similar to the first, and creates a window object piwin giving each

    rocess access to the variable pi on process 0, where the total value of will be accumulated.

    Now that the window objects have been created, let us consider the rest of the program, shown in Figure 2.13. It is a loop inwhich each iteration begins with process 0 asking the user for a number of intervals and continues with the parallel

    omputation and printing of the approximation of by process 0. The loop terminates when the user enters a 0.

  • 8/14/2019 Using MPI-2 - Advanced Features

    36/275

    he processes of nonzero rank will getthe value of n directly from the window object without any explicit action on the part of

    rocess 0 to send it to them. But before we can call MPI_Get or any other RMA communication function, we must call a

    pecial synchronization function, MPI_Win_fence, to start what is known as anRMA access epoch. We would like to

    mphasize that the function MPI_Barriercannotbe used to achieve the synchronization necessary for remote memory

    perations. MPI provides special mechanismsthree of themfor synchronizing remote memory operations. We consider the

    mplest of them, MPI_Win_fence, here. The other two mechanisms are discussed in Chapter 6.

    hefence operation is invoked by the function MPI_Win_fence. It has two arguments. The first is an "assertion" argument

    ermitting certain optimizations; 0 is always a valid assertion value, and so we use it here for simplicity. The second argument

    the window thefence operation is being performed on. MPI_Win_fence can be thought of as a barrier (across all the

    rocesses in the communicator used to create the window object) that separates a set of local operations on the window from

    he remote operations on the window or (not illustrated here) separates two sets of remote operations. Here,

    MPI_Win_fence(0, nwin);

    Page 34

    while (1) {

    if (myid == 0) {

    printf("Enter the number of intervals: (0 quits) ");

    fflush(stdout);scanf("%d",&n);

    pi = 0.0;

    }

    MPI_Win_fence(0, nwin);

    if (myid != 0)

    MPI_Get(&n, 1, MPI_INT, 0, 0, 1, MPI_INT, nwin);

    MPI_Win_fence(0, nwin);

    if (n == 0)

    break;

    else {

    h = 1.0 / (double) n;

    sum = 0.0;

    for (i = myid + 1; i

  • 8/14/2019 Using MPI-2 - Advanced Features

    37/275

    Page 35

    eparates the assignment of the value of n read from the terminal from the operations that follow, which are remote operations.

    he getoperation, performed by all the processes except process 0, is

    MPI_Get(&n, 1, MPI_INT, 0, 0, 1, MPI_INT, nwin);

    he easiest way to think of this argument list is as that of a receive/send pair, in which the arguments for both send and receive

    re specified in a single call on a single process. The getis like a receive, so the receive buffer is specified first, in the normal

    MPI style, by the triple &n, 1, MPI_INT, in the usual (address, count, datatype) format used for receive buffers. The next

    rgument is the rank of the targetprocess, the process whose memory we are accessing. Here it is 0 because all processes

    xcept 0 are accessing the memory of process 0. The next three arguments define the "send buffer" in the window, again in the

    MPI style of (address, count, datatype). Here the address is given as a displacement into the remote memory on the target

    rocess. In this case it is 0 because there is only one value in the window, and therefore its displacement from the beginning of

    he window is 0. The last argument is the window object.

    he remote memory operations only initiate data movement. We are not guaranteed that when MPI_Get returns, the data has

    een fetched into the variable n. In other words, MPI_Get is a nonblocking operation. To ensure that the operation is

    omplete, we need to call MPI_Win_fence again.

    he next few lines in the code compute a partial sum mypi in each process, including process 0. We obtain an approximation

    f by having each process update the value pi in the window object by adding its value ofmypi to it. First we call another

    PI_Win_fence, this time on the piwin window object, to start another RMA access epoch. Then we perform an

    ccumulate operation operation using

    MPI_Accumulate(&mypi, 1, MPI_DOUBLE, 0, 0, 1, MPI_DOUBLE,

    MPI_SUM, piwin);

    he first three arguments specify the local value being used to do the update, in the usual (address, count, datatype) form. The

    ourth argument is the rank of the target process, and the subsequent three arguments represent the value being updated, in the

    orm (displacement, count, datatype). Then comes the operation used to do the update. This argument is similar to the op

    rgument to MPI_Reduce, the difference being that only the predefined MPI reduction operations can be used in

    PI_Accumulate; user-defined reduction operations cannot be used. In this example, each process needs to add its value of

    ypi to pi; therefore, we

    Page 36

    Table 2.6

    C bindings for the RMA functions used in the cpi example

    intMPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info,

    MPI_Comm comm, MPI_Win *win)

    intMPI_Win_fence(int assert, MPI_Win win)

    intMPI_Get(void *origin_addr, int origin_count, MPI_Datatype origin_datatype,

    int target_rank, MPI_Aint target_disp, int target_count,

    MPI_Datatype target_datatype, MPI_Win win)

    intMPI_Accumulate(void *origin_addr, int origin_count,

    MPI_Datatype origin_datatype, int target_rank,

    MPI_Aint target_disp, int target_count,

    MPI_Datatype target_datatype, MPI_Op op, MPI_Win win)

    intMPI_Win_free(MPI_Win *win)

    se the