A Parallel Software Infrastructure for Dynamic Block-Irregular Scienti

A Parallel Software Infrastructure

for Dynamic Block�Irregular

Scienti�c Calculations

Scott R� Kohn

LPARX

Implementation Abstractions

Message Passing Layer

LDA MDAMG SPH3D

Adaptive Mesh API Particle API

API

Application

ComputationalScientist

CFortran

C++Cobol

ProgrammingLanguage

Machine

UNIVERSITY OF CALIFORNIA� SAN DIEGO

A Parallel Software Infrastructure for Dynamic

Block�Irregular Scienti�c Calculations

A dissertation submitted in partial satisfaction of the

requirements for the degree Doctor of Philosophy

in the Department of Computer Science and Engineering

by

Scott R� Kohn

Committee in charge�

Professor Scott B� Baden� ChairProfessor Francine D� BermanProfessor William G� GriswoldProfessor Keith MarzulloProfessor Maria Elizabeth G� OngProfessor John H� Weare

��

Copyright

Scott R� Kohn� ��

All rights reserved�

The dissertation of Scott R� Kohn is approved� and

it is acceptable in quality and form for publication on

micro�lm�

University of California� San Diego

��

iii

TABLE OF CONTENTS

Signature Page � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � iii

Table of Contents � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � iv

List of Figures � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � vii

List of Tables � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � x

Acknowledgements � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xii

Vita and Publications � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xiv

Abstract � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xvi

� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Parallel Scienti�c Computation � � � � � � � � � � � � � � � � � � � � � � �� Dynamic Block�Irregular Calculations � � � � � � � � � � � � � � � � � � �� A Parallel Software Infrastructure � � � � � � � � � � � � � � � � � � � �

�� LPARX � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Implementation Abstractions � � � � � � � � � � � � � � � � � � � � � �� Adaptive Mesh API � � � � � � � � � � � � � � � � � � � � � � � � � � �� Particle API � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Organization of the Dissertation � � � � � � � � � � � � � � � � � � � � � ��

Parallelization Abstractions � � � � � � � � � � � � � � � � � � � � � � � � � � �� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� The LPARX Abstractions � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Philosophy � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Data Types � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Coarse�Grain Data Parallel Computation � � � � � � � � � � � � � � �� The Region Calculus � � � � � � � � � � � � � � � � � � � � � � � � � � �� Data Motion � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � LPARX Implementation � � � � � � � � � � � � � � � � � � � � � � � � �� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� LPARX Programming Examples � � � � � � � � � � � � � � � � � � � � � �� Jacobi Relaxation � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Decomposing the Problem Domain � � � � � � � � � � � � � � � � � � �� Parallel Computation � � � � � � � � � � � � � � � � � � � � � � � � � �� Communicating Boundary Values � � � � � � � � � � � � � � � � � � �� Dynamic and Irregular Computations � � � � � � � � � � � � � � � � �

�� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Structural Abstraction � � � � � � � � � � � � � � � � � � � � � � � � � �

iv

� Parallel Languages � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Run�Time Support Libraries � � � � � � � � � � � � � � � � � � � � � �

� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � �� Structural Abstraction � � � � � � � � � � � � � � � � � � � � � � � � � �� Limitations of the Abstractions � � � � � � � � � � � � � � � � � � � � � �� Shared Memory � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Coarse�Grain Data Parallelism � � � � � � � � � � � � � � � � � � � � �� Language Interoperability � � � � � � � � � � � � � � � � � � � � � � � �� Communication Model � � � � � � � � � � � � � � � � � � � � � � � � � �� Future Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Implementation Methodology � � � � � � � � � � � � � � � � � � � � � � � � � �� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Implementation Abstractions � � � � � � � � � � � � � � � � � � � � � � � �� Message Passing Layer � � � � � � � � � � � � � � � � � � � � � � � � � �� Asynchronous Message Streams � � � � � � � � � � � � � � � � � � � � �� Distributed Parallel Objects � � � � � � � � � � � � � � � � � � � � � � �� Communication Example � � � � � � � � � � � � � � � � � � � � � � � ��

�� Implementation and Performance � � � � � � � � � � � � � � � � � � � � � � �� Interrupts versus Polling � � � � � � � � � � � � � � � � � � � � � � � �� DPO and AMS Overheads � � � � � � � � � � � � � � � � � � � � � � �� Application Performance � � � � � � � � � � � � � � � � � � � � � � � ��

�� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � �� Flexibility � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Portability � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Implementation Mistakes � � � � � � � � � � � � � � � � � � � � � � � ��

� Adaptive Mesh Applications � � � � � � � � � � � � � � � � � � � � � � � � � � �� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Structured Adaptive Mesh Algorithms � � � � � � � � � � � � � � � � � � �� Adaptive Mesh API � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Software Infrastructure Overview � � � � � � � � � � � � � � � � � � � � � Data Structures � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Error Estimation � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Grid Generation � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Load Balancing and Processor Assignment � � � � � � � � � � � � � �� Numerical Computation � � � � � � � � � � � � � � � � � � � � � � � � �� Communication � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Adaptive Eigensolvers in Materials Design � � � � � � � � � � � � � � � � �� A Model Problem � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

v

� Adaptive Framework � � � � � � � � � � � � � � � � � � � � � � � � � � �� Eigenvalue Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � �� Multigrid � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Finite Di�erence Discretizations � � � � � � � � � � � � � � � � � � � � � Computational Results � � � � � � � � � � � � � � � � � � � � � � � � ��

� Performance Analysis � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Performance Comparison � � � � � � � � � � � � � � � � � � � � � � � �� Execution Time Analysis � � � � � � � � � � � � � � � � � � � � � � � �� Uniform Grid Patches � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � �� Parallelization Requirements � � � � � � � � � � � � � � � � � � � � � �� Future Research Directions � � � � � � � � � � � � � � � � � � � � � � ��

Particle Calculations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Application Programmer Interface � � � � � � � � � � � � � � � � � � � � �� Balancing Non�Uniform Workloads � � � � � � � � � � � � � � � � � � �� Caching O��Processor Data � � � � � � � � � � � � � � � � � � � � � � �� Writing Back Particle Information � � � � � � � � � � � � � � � � � � �� Repatriating Particles � � � � � � � � � � � � � � � � � � � � � � � � � �� Implementation Details � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Smoothed Particle Hydrodynamics � � � � � � � � � � � � � � � � � � � � � �� Numerical Background � � � � � � � � � � � � � � � � � � � � � � � � � �� Performance Comparison � � � � � � � � � � � � � � � � � � � � � � � � �� Execution Time Analysis � � � � � � � � � � � � � � � � � � � � � � � �� Exploiting Force Law Symmetry � � � � � � � � � � � � � � � � � � � �� Communication Optimizations � � � � � � � � � � � � � � � � � � � � ��

�� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � �� Parallelization Requirements � � � � � � � � � � � � � � � � � � � � � �� Unstructured Partitionings � � � � � � � � � � � � � � � � � � � � � � �� Future Research Directions � � � � � � � � � � � � � � � � � � � � � � ��

Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Research Contributions � � � � � � � � � � � � � � � � � � � � � � � � � � �� Outstanding Research Issues � � � � � � � � � � � � � � � � � � � � � � � ��

�� Implementation Strategies for APIs � � � � � � � � � � � � � � � � � �� Language Interoperability � � � � � � � � � � � � � � � � � � � � � � � ��

�� The Scienti�c Computing Community � � � � � � � � � � � � � � � � � � ��

Appendix A� Machine Characteristics � � � � � � � � � � � � � � � � � � � � ��

Bibliography � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

vi

LIST OF FIGURES

�� Current design trends in parallel architecture favor machines that re�semble tightly coupled networks of workstations � � � � � � � � � � � � �

�� An overview of our parallel software infrastructure � � � � � � � � � � � �

�� The LPARX layer of our software infrastructure provides paralleliza�tion mechanisms on which we build application�speci�c APIs � � � � � ��

� LPARX applications logically consist of three components� partition�ing routines� LPARX code� and serial numerical kernels � � � � � � � � ��

�� The XArray of Grids structure provides a common framework for im�plementing various block�irregular decompositions of data � � � � � � � �

�� Examples of LPARX�s region calculus operations � � � � � � � � � � � �� The computational domain for a simple �nite di�erence problem � � � �� The main routine for the parallel Jacobi application � � � � � � � � � � �� The relaxation routine for the parallel Jacobi application � � � � � � � �� Subroutine FillPatch manages all interprocessor communication � � ��

�� The LPARX run�time system is built on a message passing library�Asynchronous Message Streams� and Distributed Parallel Objects � � �

�� LPARX programs are modeled as a collection of objects �Grids� withasynchronous and unpredictable communication patterns � � � � � � � �

�� Asynchronous communication facilities of the AMS layer � � � � � � � �� An example of AMS�s message stream abstractions � � � � � � � � � � �� Primary and secondary objects in the DPO model � � � � � � � � � � � �� LPARX function XAlloc supplies a Region and a processor assignment

when creating a Grid � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Each LPARX Grid is a DPO object � � � � � � � � � � � � � � � � � � � �� Coarse�grain execution in DPO employs the owner�computes rule � � �� FillPatch will be used to illustrate how the various implementation

layers interact in interprocessor communication � � � � � � � � � � � � �� A time�line view of the transmission of data to another processor � � �� A time�line view of the reception of data from another processor � � � �

�� The adaptive mesh API provides application�speci�c facilities for struc�tured adaptive mesh methods � � � � � � � � � � � � � � � � � � � � � � ��

�� A comparison of unstructured and structured adaptive mesh methods �� Structured adaptive mesh methods represent the numerical solution to

a partial di�erential equation using a hierarchy of grid levels � � � � � �� A sample d structured adaptive mesh hierarchy for a materials design

problem � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Organization of the structured adaptive mesh API library � � � � � � � �� A composite grid is represented using a Grid� an IrregularGrid� and

a CompositeGrid � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

vii

�� Error estimation and grid generation � � � � � � � � � � � � � � � � � � �� Grid generation using the signature algorithm � � � � � � � � � � � � � �� Two grid generation strategies for uniform re�nement regions � � � � � �� A simple load balancing algorithm for grid patches � � � � � � � � � � �� An improved load balancing strategy � � � � � � � � � � � � � � � � � � �� Coarse�grain numerical computation over the individual Grids within

an IrregularGrid � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� A comparison of coarse�grain and �ne�grain data parallel execution � �� Intralevel communication between grids at the same level � � � � � � � �� Interlevel communication between grids at di�erent levels � � � � � � � �� Materials design seeks to understand the chemical properties of

molecules such as this C��H�� ring � � � � � � � � � � � � � � � � � � � � �� Outline of the adaptive eigenvalue solver � � � � � � � � � � � � � � � � �� An iterative multigrid�based eigenvalue algorithm � � � � � � � � � � � �� The Full Approximation Storage �FAS� multigrid algorithm � � � � � � �� Computational results for hydrogen � � � � � � � � � � � � � � � � � � � �� Computational results for the hydrogen molecular ion � � � � � � � � � �� Computational results were gathered for this �d synthetic eigenvalue

problem � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Adaptive eigenvalue solver execution times � � � � � � � � � � � � � � � �� A level�by�level accounting of the execution time for the eigenvalue

algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Execution time breakdown on the Intel Paragon and IBM SP � � � � �� These graphs illustrate the performance overheads of uniform grid

patches as compared to non�uniform patches � � � � � � � � � � � � � � ��

�� Our particle API provides computational scientists with high�level fa�cilities targeted towards particle applications � � � � � � � � � � � � � � ��

� A framework for a generic particle calculation � � � � � � � � � � � � � �� Snapshots of a d vortex dynamics application with a non�uniform

workload distribution � � � � � � � � � � � � � � � � � � � � � � � � � � � �� A parallelized version of the generic particle code � � � � � � � � � � � �� An irregular decomposition of the computational domain using the

XArray � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� API function BalanceWorkloads redistributes computational e�ort

across the processors � � � � � � � � � � � � � � � � � � � � � � � � � � � �� FetchParticles locally caches copies of o��processor particle infor�

mation needed for particle interactions � � � � � � � � � � � � � � � � � � �� WriteBack updates force information for particles owned by other pro�

cessors � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� The API de�nition for C�� class ChainMesh � � � � � � � � � � � � � � � �� Application C�� code to compute local interactions � � � � � � � � � � � �� Customizations for C�� class ParticleList � � � � � � � � � � � � � � � �� Our SPH�D application simulates the evolution of a �d disk galaxy � �

viii

�� SPH�D execution times on a Cray C�� Intel Paragon� IBM SP� andan Alpha workstation farm running PVM � � � � � � � � � � � � � � � � ��

�� Execution time summary for one SPH�D timestep on the Intel Paragonand the IBM SP � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� A comparison of the SPH�D code with a restricted version that doesnot fully exploit force law symmetry � � � � � � � � � � � � � � � � � � ��

�� A comparison of the SPH�D code to a �naive� implementation thatdoes not attempt to minimize interprocessor communication � � � � � ��

�� A comparison of structured and unstructured partitions � � � � � � � � ��

A�� Alpha workstation cluster message passing performance � � � � � � � � �� A� IBM SP message passing performance � � � � � � � � � � � � � � � � � ��A�� Intel Paragon message passing performance � � � � � � � � � � � � � � � ��

ix

LIST OF TABLES

�� A brief description of the four LPARX data types� Point� Region�Grid� and XArray � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� A summary of LPARX operations � � � � � � � � � � � � � � � � � � � � �

�� A summary of the facilities provided by DPO� AMS� and MP�� A summary of the asynchronous communication facilities provided by

the Asynchronous Message Stream layer � � � � � � � � � � � � � � � � �� A summary of the object management mechanisms de�ned by the Dis�

tributed Parallel Objects layer � � � � � � � � � � � � � � � � � � � � � � �� The implementation of communication between Grids depends on

whether they are primary or secondary objects � � � � � � � � � � � � � �� Message length and memory overheads for AMS� DPO� and LPARX � �� LPARX overheads for a �d Jacobi application � � � � � � � � � � � � � ��

�� A breakdown of the eleven thousand lines of code that constitute theadaptive mesh API library � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Descriptions of Grid� IrregularGrid� and CompositeGrid � � � � � � �� Unknowns and mesh spacing for the adaptive mesh hierarchy used to

solve the eigenvalue problem � � � � � � � � � � � � � � � � � � � � � � � �� Software version numbers and compiler optimization �ags for the struc�

tured adaptive mesh performance results � � � � � � � � � � � � � � � � �� Adaptive eigenvalue solver execution times � � � � � � � � � � � � � � � �� Execution time breakdown on the Intel Paragon � � � � � � � � � � � � �� Execution time breakdown on the IBM SP � � � � � � � � � � � � � � �� Average interprocessor communication volume � � � � � � � � � � � � � �� Uniform grid patches require additional memory resources as compared

to non�uniform patches � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� A survey of the computational structure for various N �body approxi�mation methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Variables and functions of the smoothed particle hydrodynamics equa�tions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Software version numbers and compiler optimization �ags for theSPH�D computational results � � � � � � � � � � � � � � � � � � � � � � ��

�� SPH�D execution times on a Cray C�� Intel Paragon� IBM SP� andan Alpha workstation farm running PVM � � � � � � � � � � � � � � � � ��

� Execution time breakdown of one SPH�D timestep on the IntelParagon and the IBM SP � � � � � � � � � � � � � � � � � � � � � � � � ��

� Execution time summary for one SPH�D timestep on the Intel Paragonand the IBM SP � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� A comparison of the SPH�D code with a restricted version that doesnot fully exploit force law symmetry � � � � � � � � � � � � � � � � � � ��

x

�� A comparison of the SPH�D code to a �naive� implementation thatdoes not attempt to minimize interprocessor communication � � � � � ��

A�� Software version numbers and compiler optimization �ags � � � � � � � ��A� A summary of machine characteristics � � � � � � � � � � � � � � � � � ��

xi

ACKNOWLDEGEMENTS

As one of those rare individuals destined for true greatness� this record ofmy thoughts and convictions will provide invaluable insight into buddinggenius� Think of it� A priceless historical document in the making�

� Calvin� �Calvin and Hobbes�

Many people contribute to the completion of a dissertation� and I would like

to thank everyone who has contributed to mine�

I have had the privilege and pleasure to work with Scott Baden for the last

�ve years� I am indebted to him for his support and encouragement� I have enjoyed

our numerous �lively discussions�� which have greatly contributed to my work� I

would also like to thank my committee members�Fran Berman� Bill Griswold� Keith

Marzullo� Beth Ong� and John Weare�for o�ering criticisms and comments�

Special thanks go to Steve Fink and Val Donaldson� Their keen insights�

thoughtful comments� and honest criticisms are the stu� of good science� Sharing a

lab with them has been a pleasure that I will miss� I doubt that they know how much

I have valued their input�

I appreciate the many useful suggestions from Greg Cook� Steve Fink� Chris

Myers� and Charles Rendleman on how to improve LPARX and the adaptive mesh

software� I also thank Eric Bylaska� Alan Edelman� Ryoichi Kawai� Beth Ong� and

John Weare for numerous valuable discussions on numerical methods in materials

design�

I would like to thank my family for their support� encouragement� and love�

In particular� I thank my father for his sense of curiosity and my mother for trying

to make me read Dr� Seuss when I only wanted to read science books� I also have

to thank my two cats for putting things in perspective� they do not recognize the

importance of writing a dissertation� and they have never hesitated to let me know

that their needs �e�g� being fed on time� should come �rst and foremost�

Finally� I would like to dedicate this dissertation to my wife� Kristin� I �nd

it di�cult to express in words what I feel for her in my heart� I thank her for always

xii

being there when I needed her and for always reminding me what is most important

in my life�

Generous �nancial support has been provided by a General Atomics fel�

lowship� NSF contract ASC�� and ONR contract N�� Access

to the Cray C�� IBM SP� Intel Paragon� and DEC Alpha workstation farm has

been provided by the San Diego Supercomputer Center �through a UCSD School of

Engineering Block Grant� and the Cornell Theory Center�

xiii

VITA

�� B�S� Electrical EngineeringAdditional major in MathematicsAdditional major in Computer ScienceUniversity of Wisconsin at Madison

�� M�S� Computer ScienceUniversity of California at San Diego

�� Ph�D� Computer ScienceUniversity of California at San Diego

PUBLICATIONS

Submitted for Publication

S� R� Kohn and S� B� Baden� A Parallel Software Infrastructure for StructuredAdaptive Mesh Methods� submitted to Supercomputing ��

Journals

S� R� Kohn and S� B� Baden� Irregular Coarse�Grain Data Parallelism UnderLPARX� to appear� Journal of Scienti�c Programming�

S� B� Baden and S� R� Kohn� Portable Parallel Programming of Numerical Prob�lems Under the LPAR System� Journal of Parallel and Distributed Computation�May� ��

Conferences

S� R� Kohn and S� B� Baden� The Parallelization of an Adaptive Multigrid Eigen�value Solver with LPARX� Proceedings of the Seventh SIAM Conference on ParallelProcessing for Scienti�c Computing� San Francisco� CA� February� ��

E� J� Bylaska� S� R� Kohn� S� B� Baden� A� Edelman� R� Kawai� M� E�

Ong� and J� H� Weare� Scalable Parallel Numerical Methods and Software Toolsfor Material Design� Proceedings of the Seventh SIAM Conference on Parallel Pro�cessing for Scienti�c Computing� San Francisco� CA� February� ��

S� B� Baden� S� R� Kohn� and S� J� Fink� Programming with LPARX� Proceed�ings of the �� Intel Supercomputer User�s Group� San Diego� CA� June� ��

S� R� Kohn and S� B� Baden� A Robust Parallel Programming Model for Dy�namic Non�Uniform Scientic Computations� Proceedings of the �� Scalable HighPerformance Computing Conference� Knoxville� TN� May� ��

xiv

S� R� Kohn and S� B� Baden� An Implementation of the LPAR Parallel Program�ming Model for Scientic Computations� Proceedings of the Sixth SIAM Conferenceon Parallel Processing for Scienti�c Computing� Norfolk� VA� March� ��

S� B� Baden and S� R� Kohn� Lattice Parallelism A Parallel Programming Modelfor Manipulating Non�Uniform Structured Scientic Data Structures� Proceedings ofthe Workshop on Languages� Compilers� and Run�Time Environments for DistributedMemory Multiprocessors� Boulder� CO� October� ��

S� B� Baden and S� R� Kohn� A Comparison of Load Balancing Strategies forParticle Methods Running on MIMD Multiprocessors� Proceedings of the Fifth SIAMConference on Parallel Processing for Scienti�c Computing� Houston� TX� March��

Technical Reports

S� R� Kohn and S� B� Baden� Blobs Visualization of Particle Methods on Multi�processors� Tech� Rep� CS�� University of California� San Diego� May� ��

S� B� Baden and S� R� Kohn� The Reference Guide to GenMP�The GenericMultiprocessor� Tech� Rep� CS�� University of California� San Diego� June��

xv

ABSTRACT OF THE DISSERTATION

A Parallel Software Infrastructure for Dynamic

Block�Irregular Scienti�c Calculations

by

Scott R� Kohn

Doctor of Philosophy in Computer Science

University of California� San Diego� ��

Professor Scott B� Baden� Chair

Dear Sir or Madam�will you read my book�It took me years to write�will you take a look�

� John Lennon and Paul McCartney� �Paperback Writer�

The accurate solution of many problems in science and engineering requires

the resolution of unpredictable� localized physical phenomena� Such applications

may involve the solution of complicated� time�dependent partial di�erential equations

such as those in materials design� computational �uid dynamics� astrophysics� and

molecular dynamics� The important feature of these numerical problems is that

some portions of the computational domain require higher resolution� and thus more

computational e�ort� than others�

Parallel supercomputers o�er the power to solve many of these computa�

tionally intensive tasks� however� these applications are particularly challenging to

implement on parallel architectures because they rely on dynamic� complicated� ir�

regular structures with dynamic and irregular communication patterns� Current

parallel software technology does not yet a�ord a solution� and new programming

abstractions�along with the accompanying run�time support�are needed�

xvi

We have developed a parallel software infrastructure to simplify the imple�

mentation of dynamic� irregular� block�structured scienti�c computations on high�

performance parallel supercomputers� Our software infrastructure provides compu�

tational scientists with high�level� domain�speci�c tools that hide low�level details of

the parallel hardware� It is portable across a wide range of parallel architectures�

At the center of our infrastructure is the LPARX parallel programming

system� LPARX introduces the concept of �structural abstraction�� which enables

applications to dynamically manipulate irregular data decompositions as language�

level objects� LPARX provides a framework for creating decompositions that may be

tailored to meet the needs of a particular application�

Building on the LPARX abstractions� we have developed application pro�

grammer interfaces �APIs� for two important classes of applications� structured adap�

tive mesh methods and particle calculations� These APIs enable scientists to concen�

trate on the mathematics and the physics of their application� APIs provide high�level

software tools that hide underlying implementation details� Our parallel software in�

frastructure has enabled computational scientists to explore new approaches to solving

a variety of problems� and it has reduced the development time of challenging numer�

ical applications� Indeed� we have applied our structured adaptive mesh API to the

adaptive solution of eigenvalue problems in materials design and our particle API to

a �d smoothed particle hydrodynamics application in astrophysics�

xvii

Chapter �

Introduction

I realized that the purpose of writing is to in�ate weak ideas� obscure poorreasoning� and inhibit clarity� With a little practice� writing can be anintimidating and impenetrable fog � � � Academia� here I come�


�� Parallel Scienti�c Computation

Parallel supercomputers o�er the power to solve many of the computation�

ally intensive problems that arise in science and engineering� Unfortunately� this

potential has only been partially realized due to the di�culty of implementing scien�

ti�c applications on parallel platforms� To put it bluntly� today�s parallel computers

are hard to use� and parallel software technology does not yet a�ord the performance

and ease of use that computational scientists have come to expect from sequential

and vector supercomputers�

It would not be an understatement to say that parallel software is in a state

of crisis� The vast majority of scienti�c programmers �nd that current parallel soft�

ware support is inadequate �� In fact� they are more likely to develop their own

in�house software support rather than use existing products �� The most com�

monly used parallel programming paradigm today is message passing� Standardiza�

tion e�orts have resulted in a portable message passing library called MPI �Message

�

Passing Interface� �� Unfortunately� programming with message passing is te�

dious� as the programmer must explicitly manage low�level details of data placement

and interprocessor communication�

Developments in High Performance Fortran� �HPF� �� are promising� un�

fortunately� HPF will require improvements before it becomes a general purpose par�

allel language� For example� HPF does not adequately address dynamic and irregular

problems �� and these limitations have prompted a second HPF standardization

e�ort� The HPF� committee is currently investigating enhancements to HPF� but it

will be some time before we know what strategies will be e�ective and how di�cult

they will be to support in the compiler� Improvements in HPF for dynamic and irreg�

ular scienti�c applications will likely require new parallel programming abstractions

and run�time support libraries�

Parallel computers are di�cult to use because they require the explicit and

low�level management of data locality� Current design trends in high�performance

parallel architectures favor machines constructed with commodity components� More

than anything else� today�s parallel computers resemble tightly coupled networks of

workstations �see Figure �� The programmer� compiler� or run�time system must

distribute data carefully because access to remote data �through the interconnection

network� is typically several orders of magnitude more expensive than access to local

data� Some parallel computers� such as the Intel Paragon and the IBM SP� provide

very little hardware support for managing data distributed across processor memories�

Other machines� such as the Stanford FLASH �� and the Wisconsin COW ��

contain hardware for the automatic caching of remote data� However� recent studies

with these �distributed shared memory� machines � � �� indicate that such hard�

ware caching mechanisms are inadequate for dynamic scienti�c applications� In fact�

these studies conclude that e�cient distributed shared memory applications require

the same attention to data management and the same implementation techniques as

message passing applications�

�High Performance Fortran is a data parallel Fortran language that is quickly becoming acceptedby a number of manufacturers as a standard parallel programming language for scienti�c computing�

�

M

P

M

P

M

P

M

P

M

PInterconnection

Network

Figure �� Current design trends in parallel architecture favor machines built withcommodity components� Today�s parallel computers resemble tightly coupled net�works of workstations and typically contain a few tens to a few hundreds of powerfulprocessing nodes connected by an interconnection network� Each processor P is tied toa local memory M� and remote data is accessed through the interconnection network�At any one time� several parallel applications share the machine� with a single appli�cation generally using a few tens of dedicated processors� Appendix A summarizesthe machine characteristics for the parallel architectures used in this dissertation�

Another concern when writing parallel applications is portability� Parallel

platforms obsolesce at an alarming rate� and portability is essential so that applica�

tions will run on the next generation of architectures� In just the two years spent

developing our software infrastructure� four parallel computers have become obsolete

�nCUBE nCUBE�� Intel iPSC�� Kendall Square Research KSR�� and Think�

ing Machines CM�� two manufacturers have declared bankruptcy �Kendall Square

Research and Thinking Machines�� and two manufacturers have entered the parallel

scienti�c computing market �IBM and Silicon Graphics�� This trend is likely to con�

tinue in the near future due to the rapidly changing microprocessor and interconnect

technology used to build parallel machines�

The key to portability is hiding low�level� machine�dependent details� For

sequential programs� this can be easily achieved through the use of a standard pro�

�

gramming language such as Fortran� However� parallel programs typically contain

a considerable amount of hardware�dependent code to manage data distribution

and interprocessor communication� Such hardware dependencies hamper portability�

To be portable� parallelization mechanisms must hide these low�level� architecture�

dependent implementation details�

Implementing portable parallel programs without high�level software sup�

port is a di�cult task� Computational scientists would rather address the mathe�

matics and the physics of their problems than worry about e�cient parallel imple�

mentation techniques� Low�level� machine�dependent details reduce portability and

obscure the algorithms underlying an application� Appropriate software support is

essential for developing architecture�independent� high�performance parallel scienti�c

applications�

�� Dynamic Block�Irregular Calculations

My own interests are in using computers as God intended�to do arith�metic�

� Cleve Moler

Many scienti�c computations involve the study of dynamic� irregular� lo�

cally structured physical phenomena� Such applications may involve the solution

of complicated� time�dependent partial di�erential equations such as those in ma�

terials design �� computational �uid dynamics �� or localized deformations in

geophysical systems� Also included are particle methods in molecular dynamics ��

astrophysics �� and vortex dynamics �� More recently� adaptive methods have

been applied to the study of entire ecosystems through satellite imagery at multiple

resolutions �� These applications are particularly challenging to implement on

parallel computers owing to their dynamic� irregular decompositions of data�

Our research addresses the programming abstractions and the accompanying

software support required for dynamic� irregular� block�structured scienti�c compu�

tations running on MIMD �� parallel computers� The distinguishing characteristics

of this class of problems are that �� numerical work is non�uniformly distributed

over space� �� the workload distribution changes as the computation progresses�

and �� the workload exhibits a local structure� Such applications employ dynamic�

irregular�but locally structured�meshes to represent the changing numerical com�

putation� They spend considerably more e�ort in some portions of the problem space

than in others� The distribution of computational e�ort is not known at compile�time�

and the application must adapt to the evolving calculation at run�time� Numerical

work tends to be localized in regions irregularly distributed across the problem do�

main� This localization property is especially important on multiprocessors� since we

can exploit data locality to reduce interprocessor communication costs and improve

parallel performance�

We focus on two important classes of dynamic� block�irregular applications�

� structured adaptive mesh methods �� and

� particle methods based on link�cell techniques ��

Such applications can be di�cult to implement without advanced software support

because they rely on dynamic� complicated irregular array structures with irregular

communication patterns�� The programmer is burdened with the responsibility of

managing dynamically changing data distributed across processor memories and or�

chestrating interprocessor communication and synchronization� Little information is

available at compile�time to guide a parallel compiler because numerical workloads

change in response to the dynamics of the particular problem being solved�

Current parallel programming languages provide little support for dynamic�

block�irregular applications� Data parallel Fortran languages such as High Perfor�

mance Fortran typically focus on regular� static problems such as dense linear algebra�

HPF de�nes a set of built�in� uniform data decompositions speci�ed through compile�

time directives� however� it provides few mechanisms for dynamically changing irreg�

ular data� Extending HPF will require developments in �� parallel programming

�Further details can be found in Sections �� and ��

abstractions and �� run�time support libraries� New parallelization abstractions are

needed because the current compile�time data distribution mechanisms are inadequate

for dynamic problems� Such applications will also require sophisticated run�time sup�

port to manage changing data distributions and communication patterns�

A number of run�time support systems have already been developed� includ�

ing CHAOS �formerly called PARTI� � �� multiblock PARTI �� and Multipol ��

Both CHAOS and multiblockPARTI have been used as run�time support for data par�

allel Fortran compilers� CHAOS has been very successful in addressing unstructured

problems such as sparse linear algebra and �nite elements �� Multiblock PARTI

has been employed in the parallelization of applications with a small number of large�

static blocks �� its support for dynamic block structured problems is unclear� The

Multipol library provides a collection of distributed non�array data structures such

as graphs� unstructured grids� hash tables� sets� trees� and queues� However� none

of these systems directly address the dynamic� block�irregular problems that are the

focus of our research��

�� A Parallel Software Infrastructure

Solving a problem is similar to building a house� We must collect the rightmaterial� but collecting the material is not enough a heap of stones is notyet a house� To construct the house or the solution� we must put togetherthe parts and organize them into a purposeful whole�

� George Polya

We have developed a parallel software infrastructure to simplify the imple�

mentation of dynamic� irregular� block�structured scienti�c computations on high�

performance parallel supercomputers� Our software infrastructure has enabled com�

putational scientists to explore new approaches to solving applied problems� It has

reduced the development time of challenging numerical applications� Our infrastruc�

ture has been implemented as a C�� class library and consists of approximately thirty

�We will discuss such related work in detail in Section ��

�

thousand lines of C�� and Fortran code�� It has been employed by researchers at the

University of California at San Diego� George Mason University� Lawrence Livermore

National Laboratories� Sandia National Laboratories� and the Cornell Theory Center

for applications in gas dynamics �� smoothed particle hydrodynamics� particle

simulation studies � �� adaptive eigenvalue solvers in materials design �� ge�

netics algorithms �� adaptive multigrid methods in numerical relativity� and the

dynamics of earthquake faults �see Section �� for a complete list��

Our parallel software infrastructure addresses two goals of software support

for scienti�c applications ��

� it hides low�level details of the hardware� and

� it provides high�level� e�cient mechanisms that match the scientist�s view of

the computation�

The �rst goal is necessary for portability� Software that exposes too much of the un�

derlying hardware will not run e�ciently on parallel platforms with di�erent hardware

characteristics� Our software meets this goal through the use of high�level paralleliza�

tion mechanisms that assume very little about the underlying hardware architecture�

The second goal is necessary for ease�of�use and simpli�ed code development� Our

software infrastructure provides the programmer with high�level tools appropriate for

the task at hand through domain�speci�c application programmer interfaces �APIs�

built upon our parallelization mechanisms�

Figure �� illustrates the organization of our parallel software infrastructure�

At the very top of the infrastructure lie applications� and at the bottom lies a portable

message passing layer� Each level provides more powerful�and also more speci�c�

abstractions� The infrastructure consists of four primary components� �� implemen�

tation support� �� a set of parallelization abstractions called LPARX ��ell�par�eks��

�� a structured adaptive mesh API� and �� a particle API�

�The current software distribution may be obtained through the World Wide Web at addresshttp��www�cse�ucsd�edu�users�skohn�

�

LPARX



LDA MDAMG SPH3D

Chapter 2

Chapter 3

Chapter 4 Chapter 5


Figure �� This �gure shows our parallel software infrastructure for dynamic� irregu�lar� block�structured scienti�c computations� It consists of four primary components�each of which has been labeled with the chapter of this dissertation that describes thatparticular component� Higher levels of the infrastructure provide more powerful�andmore specialized�abstractions� At the top lie applications �LDA� AMG� SPH�D� and MD�and at the bottom lies a portable message passing layer� See the text for a briefdescription of each component�

There are three advantages to a layered software infrastructure as compared

to a single� monolithic application� portability� code reusability� and extensibility�

Because applications do not directly rely on the message passing layer and instead

employ the mechanisms provided by their application�speci�c APIs� low�level changes

in the implementation do not directly a�ect the applications� For example� applica�

tions are completely shielded from low�level changes in the LPARX implementation�

Code reuse is achieved because multiple application libraries share the same paral�

lelization mechanisms� Optimizations in the LPARX implementation are realized by

both particle and adaptive mesh applications� Finally� because the infrastructure pro�

�

vides tools� and not canned solutions� computational scientists can tailor and extend

our abstractions to match their applications�

The following sections brie�y describe the four main components of our

parallel software infrastructure�

�� LPARX

At the center of the software infrastructure is the LPARX parallel program�

ming system� which de�nes high�level� e�cient mechanisms for data distribution�

partitioning and mapping� parallel execution� and interprocessor communication� It

provides a common set of parallelization facilities on which we built the particle and

adaptive mesh APIs�

LPARX introduces the concept of �structural abstraction�� which enables

applications to dynamically manipulate irregular data decompositions� Instead of

forcing the programmer to choose from a small set of prede�ned decompositions�

LPARX provides a framework for creating decompositions that may be tailored to

meet the needs of a particular application� To our knowledge� LPARX is the �rst and

only system that e�ciently supports arbitrary dynamic� user�de�ned� block�irregular

data distribution on parallel architectures�

LPARX assumes only basic message passing support and is therefore

portable to a variety of high�performance computing platforms� Our current im�

plementation runs on the Cray C�� single processor�� IBM SP� Intel Paragon�

single processor workstations �for code development and debugging�� and networks of

workstations connected via PVM ��

�� Implementation Abstractions

At the very bottom of our software infrastructure is a portable message pass�

ing layer called MP�� our own version of MPI �� To simplify the implementation

of the LPARX run�time system� we have introduced two levels of software abstrac�

��

tion between LPARX and the message passing layer� Asynchronous Message Streams

�AMS� and Distributed Parallel Objects �DPO�� AMS and DPO provide support for

parallel programs consisting of a relatively small number of large� complicated objects

with asynchronous and unpredictable communication patterns� They build on ideas

from the concurrent object oriented programming community�

AMS de�nes a �message stream� abstraction that greatly simpli�es the com�

munication of complicated data structures between processors� Its mechanisms com�

bine ideas from asynchronous remote procedure calls �� Active Messages ��

and the C�� I�O stream library �� DPO provides object oriented mecha�

nisms for manipulating objects that are physically distributed across processor mem�

ories and is based on communicating object models from the distributed systems

community � ��

�� Adaptive Mesh API

Our adaptive mesh API de�nes specialized� high�level facilities tailored to

structured multilevel adaptive mesh re�nement applications �� Such numerical

methods dynamically re�ne the local representation of a problem in �interesting�

portions of the computational domain� such as shock regions in computational �uid

dynamics �� They are di�cult to implement because re�nement regions vary in

size and location� resulting in complicated geometries and irregular communication

patterns�

Computational scientists using our adaptive mesh API can concentrate on

their numerical applications rather than being concerned with low�level implemen�

tation details� The API library� built upon the parallelization and communication

abstractions of LPARX� provides mechanisms for automatic error estimation� grid

generation� load balancing� and grid hierarchy management� All details associated

with parallelism are completely hidden from the programmer�

We have used our software infrastructure to develop a parallel adaptive

eigenvalue solver �LDA� and an adaptive multigrid solver �AMG� for problems aris�

��

ing in materials design �� By exploiting adaptivity� we have reduced mem�

ory consumption and computation time by more than two orders of magnitude over

an equivalent non�adaptive method� To our knowledge� this is the �rst time that

structured adaptive mesh techniques have been used to solve eigenvalue problems in

materials design�

�� Particle API

Our particle API provides computational scientists high�level tools that sim�

plify the implementation of particle applications �� on parallel computers� Par�

ticle methods are di�cult to parallelize because they require dynamic� irregular data

decompositions to balance changing non�uniform workloads� Built on top of the

LPARX mechanisms� our particle API de�nes facilities speci�cally tailored towards

particle methods� The use of the LPARX abstractions enabled us to provide func�

tionality and explore performance optimizations that would have been di�cult had

the library been implemented using only a primitive message passing layer� Using

our software infrastructure� we have developed a �d smoothed particle hydrodynam�

ics �� code �SPH�D� that simulates the evolution of galactic bodies in astrophysics�

and we are currently developing a �d molecular dynamics application �MD� to study

fracture dynamics in solids ��

�� Organization of the Dissertation

Sixty minutes of thinking of any kind is bound to lead to confusion andunhappiness�

� James Thurber

This dissertation is organized into six chapters� Each of Chapters through

covers a portion of the software infrastructure shown in Figure �� Each chapter is

self�contained� with its own introduction� motivation� related work� and analysis and

conclusions� The discussion of LPARX in Chapter is a starting point for all further

�

chapters� otherwise� Chapter � �Implementation Mechanisms�� Chapter � �Adaptive

Mesh Applications�� and Chapter �Particle Calculations� may be read independently

of the others� We conclude with the contributions of this work in Chapter � The

parallel architectures used in this dissertation are described in Appendix A�

Chapter �

Parallelization Abstractions

Fundamental denitions do not arise at the start but at the end of theexploration� because in order to dene a thing you must know what it isand what it is good for�

� Hans Freudenthal� �Developments in Mathematical Education�

If at rst you do succeed � try to hide your astonishment�

� Harry F� Banks

�� Introduction

The LPARX parallel programming system �� provides portable facil�

ities for the e�cient implementation of dynamic� non�uniform scienti�c applications

on MIMD architectures� Such applications are typically di�cult to implement with�

out sophisticated software support� The LPARX mechanisms hide low�level imple�

mentation details and provide powerful tools for data distribution� partitioning and

mapping� parallel execution� and interprocessor communication� LPARX requires

only basic message passing support and is therefore portable to a variety of high�

performance computing platforms� Our current implementation runs on the Cray

C�� single processor�� IBM SP� Intel Paragon� and networks of workstations con�

nected via PVM �� LPARX applications may be developed and debugged on a

single processor workstation�

��

��

LPARX



LDA MDSPH3DAMG


Figure �� The LPARX layer of our software infrastructure provides paralleliza�tion facilities designed for scienti�c applications that employ dynamic� irregular�structured representations� Based on the LPARX mechanisms� we have developedapplication�speci�c APIs for particle computations and structured adaptive meshmethods�

Building on the LPARX mechanisms described in this chapter �see Fig�

ure �� we have developed application�speci�c support libraries for two important

classes of applications� multilevel structured adaptive mesh methods �� and par�

ticle calculations �� In Chapters � and � we describe how LPARX provides the

parallelization support infrastructure needed to e�ciently and easily implement these

re�usable APIs�

This chapter is organized as follows� We begin with a description of the

LPARX abstractions in Section �� Section �� illustrates how these abstractions

are used to parallelize a simple application� We compare our approach with other

related work in Section �� Finally� we conclude with an analysis of the advantages

and limitations of the LPARX approach�

�

�� The LPARX Abstractions

A breakthrough is not a breakthrough unless you coin a term for it�

� Sidney Harris� �Einstein Simplied�

I think you�ve done it� All we need now is a trademark and a theme song�

� Sidney Harris� �From Personal Ads to Cloning Labs�

LPARX �� is a coarse�grain� domain�speci�c parallel programming

model that provides high�level abstractions for representing and manipulating dy�

namic� irregular block�structured data on MIMD distributed memory architectures�

Dynamic irregular block decompositions are not currently supported by programming

languages such as High Performance Fortran �HPF� �� Fortran D �� Vienna For�

tran �� or Fortran ��D �� They arise in two important classes of scienti�c

computations�

� multilevel structured adaptive �nite di�erence methods �� which rep�resent re�nement regions using block�irregular data structures� and

� parallel computations such as particle methods �� that require an irregulardata decomposition �� to balance non�uniform workloads across parallel

processors�

We have used the LPARX mechanisms to implement domain�speci�c APIs and rep�

resentative applications from each of these two problem classes�

LPARX should not be thought of as a �language� but rather as a set of data

distribution and parallel coordination abstractions which may be implemented in a

library �as we have done� or added to a language� The design goals of LPARX are as

follows�

� Express irregular data decompositions� layouts� and data dependencies at run�time using high�level� intuitive abstractions�

� Require only basic message passing support and give portable performanceacross diverse parallel architectures�

�

� Separate parallel control and communication from numerical computation�

� Provide the basis for an expandable software infrastructure of application�

speci�c APIs�

Implementing dynamic� irregular computations on parallel computers is a di�cult

task� To achieve reasonable parallel performance� the application must explicitly

manage low�level details of data locality and communication� even on shared memory

multiprocessors �� This burden soon becomes unmanageable and can obscure

the salient features of the algorithm� LPARX hides many of these implementation de�

tails and provides high�level coordination mechanisms to manage data locality within

the memory hierarchy and minimize communication costs� The software support pro�

vided by LPARX greatly simpli�es the development of high�performance� portable�

parallel applications software�

The following sections describe LPARX�s facilities� We begin with an

overview of the philosophy underlying the LPARX model� Section �� introduces

the LPARX data types and its representation of irregular block decompositions� We

then present LPARX�s model of coarse�grain data parallel execution� Sections ��

and �� describe LPARX�s region calculus and data motion primitives which ex�

press data decompositions and dependencies in geometric terms� We brie�y discuss

the LPARX implementation in Section �� and then conclude with a summary�

�� Philosophy

The LPARX parallel programming model separates the expression of data

decomposition� communication� and parallel execution from numerical computation�

As shown in Figure �� LPARX applications are logically organized into three sepa�

rate pieces� partitioners� LPARX code� and serial numerical kernels�

The LPARX layer provides facilities for the coordination and control of

parallel execution� LPARX is a coarse�grain data parallel programmingmodel� it gives

the illusion of a single global address space and a single logical thread of control� On a

��

RoutinesPartitioning

Serial Numerical Kernels

LPARX

Figure �� The logical organization of an LPARX application consists of three com�ponents� partitioning routines� LPARX code� and serial numerical kernels�

MIMD parallel computer� the underlying run�time system executes in Single Program

Multiple Data �SPMD� mode�

Computations are divided into a relatively small number of coarse�grain

pieces� Each work unit represents a substantial computation with thousands or tens

of thousands of �oating point operations executing on a single logical processing

node� Parallel execution is expressed using a coarse�grain loop� each iteration of

the loop executes as if on its own processor� The computation for each piece is

performed by a numerical kernel� and the computations proceed independently of one

another� Numerical routines may be written in any language� such as C�� C� or

Fortran� The advantage of this approach is that LPARX can leverage serial compiler

technology and existing sequential code� Heavily optimized numerical routines need

not be re�implemented to parallelize an application� Furthermore� numerical code can

be optimized for a processing node without regard to the higher level parallelization�

LPARX does not de�ne what constitutes a single logical node� a node may correspond

to a single processor� a processing cluster� or a processor subset� Thus� kernels may

be tuned to take advantage of low�level node characteristics� such as vector units�

cache sizes� or multiple processors�

An important part of the LPARX philosophy is that data partitioning for

dynamic� non�uniform scienti�c computations is extremely problem�dependent and

��

therefore is best left to the application �or the API�� No speci�c data decomposition

strategies have been built into the LPARX model� Rather� all data decomposition

is performed at run�time under the direct control of the application� LPARX pro�

vides the application a uniform framework for representing and manipulating block�

irregular decompositions� Although our implementation supplies a standard library

of decomposition routines� the programmer is free to write others�

Our approach to data decomposition di�ers from most parallel languages�

such as HPF �� which require the programmer to choose from a small number

of prede�ned decomposition methods� Vienna Fortran �� provides some facilities

for irregular user�de�ned data decompositions but limits them to tensor products of

irregular one dimensional decompositions� Block�irregular decompositions may be

constructed using the pointwise mapping arrays of Fortran D �� however� point�

wise decompositions are inappropriate and unnatural for calculations which exhibit

block structures� Because pointwise decompositions have no knowledge of the block

structure� mapping information must be maintained for each individual array ele�

ment at a substantial cost in memory and communication overheads� By comparison�

coarse�grain partitionings incur a cost proportional to the number of blocks� which is

typically three or four orders of magnitude smaller than the number of array elements�

Once a decomposition has been speci�ed� the details of the data partitioning

are hidden from the application� The programmer can change partitioning strategies

without a�ecting the correctness of the underlying code� Thus� LPARX views parti�

tioners as interchangeable� and the application may change decomposition strategies

by simply invoking a di�erent partitioning routine�

At the core of LPARX is the concept of structural abstraction� Structural

abstraction enables an application to express the logical structure of data and its

decomposition across processors as �rst�class� language�level objects� The key idea is

that the structure of the data�the ��oorplan� describing how the data is decomposed

and where the data is located�is represented and manipulated separately from the

data itself� LPARX expresses communication and operations on data decompositions

��

using intuitive geometric operations� such as intersection� instead of explicit indexing�

Interprocessor communication is hidden by the run�time system� and the application

is completely unaware of low�level details� Although the current LPARX implementa�

tion is limited to representing irregular� block�structured decompositions� the concept

of structural abstraction is general and extends to other classes of applications� such

as unstructured �nite element meshes ��

�� Data Types

LPARX provides the following four basic data types�

� Point� an integer n�tuple representing a point in Zn�

� Region� an object representing a rectangular subset of array index space�

� Grid� a dynamic array instantiated over a Region� and

� XArray� a dynamic array of Grids distributed over processors�

The Point is a simple� auxiliary data type used to de�ne and manipulate Regions�

Element�wise addition and scalar multiplication are de�ned over Points in the obvious

way�

The Region provides the basis for structural abstraction� An n�dimensional

Region represents a subset of Zn� the space of n�dimensional integer vectors� The

Region does not contain data elements� as an array� but rather represents a portion

of index space� In the current implementation of LPARX� we restrict Regions to

be rectangular� however� the concepts described here apply to arbitrary subsets of

Zn �� A Region is uniquely de�ned by the two Points at its lower and upper

corners� We denote the lower bound of a Region R by lwb�R� and its upper bound

by upb�R�� Although there is no identical construct in Fortran or C� the Region

is related to array section speci�ers found in Fortran�� Unlike Fortran�� array

section speci�ers� however� the Region is a �rst�class object and may be assigned

�

and manipulated at run�time� The concept of �rst�class array section objects �called

domains� was introduced in the FIDIL programming language ��

The Grid is a dynamic array de�ned over an arbitrary rectangular index

set speci�ed by a Region� The Grid is similar to a Fortran �� allocatable array�

Each Grid remembers its associated Region� which can be queried at run�time� a

convenience that greatly reduces bookkeeping for dynamically de�ned Grids�� All

Grid elements must have the same type� they may be integers� �oating point numbers�

or any user�de�ned type or class� For example� in addition to representing a mesh

of �oating point numbers� the Grid may also be used to implement the spatial data

structures �� common in particle calculations� Grids may be manipulated using

high�level block copy operations� described in Section ��

LPARX is targeted towards applications with irregular� block structures�

To support such structures� it provides a special array�the XArray�for organizing a

dynamic collection of Grids� Each Grid in an XArray is arbitrarily assigned to a single

processor� individual Grids are not subdivided across processors� The XArray can be

viewed as a coarse�grain analogue of a Fortran D array decomposed via mapping

arrays except that XArray elements are themselves arrays �Grids��

The Grids in an XArraymay have di�erent origins� sizes� and index sets� but

all Grids must have the same spatial dimension� To allocate an XArray� the applica�

tion invokes the LPARX operation XAlloc with an array of Regions representing the

structure of the Grids and a corresponding array of processor assignments �i�e� the

�oorplan�� LPARX provides a default assignment of Grids to processors if none is

given� An XArray is intended to implement coarse�grain irregular decompositions�

thus� each processor is typically assigned only a few Grids�

LPARX de�nes a coarse grain looping construct�forall�which iterates

concurrently over the Grids of an XArray� The semantics of forall are similar to

HPF�s INDEPENDENT forall �� each loop iteration is executed as if an atomic

operation� In writing a forall loop� the programmer is unaware of the assignment

�Compare this to C� which requires the programmer to keep track of bounds for dynamicallyallocated array storage�

�

processor 1 processor 3 processor 4processor 2

XArrays

Figure �� Two examples of an XArray of Grids data structure� The recursivebisection decomposition on the far left is usually employed to balance non�uniformworkloads in particle calculations� The structure in the middle is typical of a singlelevel mesh re�nement in structured adaptive mesh methods� On the right� we showone possible mapping of XArray elements to processors� Note that the XArray is acontainer for the Grids and its elements are Grids� not pointers�

of Grids to processors�each XArray element is treated as if it were assigned to its

own processor�and the LPARX run time system correctly manages the parallelism�

LPARX also de�nes a for loop� a sequential version of the forall�

The XArray of Grids structure provides a common framework for imple�

menting various block�irregular decompositions of data� This framework is used by

standard load balancing routines such as recursive bisection �� see Chapter � and

also by application�speci�c routines� such as the grid generator for an adaptive mesh

calculation �� see Chapter �� Figure �� shows decompositions arising in two

di�erent applications� In each case� the data has been divided into Grids� each repre�

senting a di�erent portion of the computational domain� which have been assigned to

an XArray� The following section provides more detail about how XArrays are used

to organize a parallel computation�

�� Coarse�Grain Data Parallel Computation

Recall that an LPARX application consists of three components� partition�

ing routines� LPARX code� and serial numerical kernels� Here we show how these

pieces work together in an application� LPARX provides the programmer with a

simple model of coarse�grain parallel computation�

�� Decompose the computational structure into an array of Regions�

� Specify an assignment of each Region in �� to a processor�

�� Create an XArray of Grids representing the data decomposition �oorplan gen�

erated by steps �� and ��

�� Satisfy data dependencies between Grids in the XArray using LPARX�s com�

munication facilities �described in the following sections��

� Perform calculations on the Grids in the XArray in parallel using the coarse�

grain forall loop�

The decomposition in �� may be managed explicitly by the application�

such as in generating re�nement regions� or by load balancing utilities that implement

partitioning strategies� The LPARX implementation provides a standard library of

partitioners that implement recursive coordinate bisection �� and uniform block

partitioning�

The assignment of Regions to processors in �� provides applications the

�exibility to delegate work to processors� In general� this information will be returned

by the routine which renders the partitions� This step may be omitted� in which case

LPARX generates a default assignment�

In step �� the application invokes an LPARX operation called XAlloc

which� using the partitioning and the processor assignment information� instantiates

an XArray of Grids implementing the data decomposition� LPARX creates Grids

based on the supplied Region information and assigns them to the appropriate pro�

cessors�

After the decomposition and allocation of data� applications typically al�

ternate between steps �� and �� In �� data dependencies between the Grids in

the XArray are satis�ed using LPARX�s region calculus and block copy operations�

described in the following sections� Finally� the application computes in parallel on

the Grids in the XArray using a forall loop �� For each Grid� a numerical routine

�

is called to perform the computation� the computation executes on a single logical

processing node which may actually consist of many physical processors� The execu�

tion of forall assumes the Grids are decoupled� they are processed independently

and asynchronously�

�� The Region Calculus

LPARX de�nes a region calculus which enables the programmer to manip�

ulate index sets �Regions� in high�level geometric terms� In this section� we describe

the most important region calculus operations�� shift� intersection� and grow�

Given a Region R and a Point P� shift�R�P� denotes the Region R trans�

lated by a distance P� as shown in Figure ��a� The intersection of two Regions is

simply the set of points which the two have in common� The dark shaded area in Fig�

ure ��b represents the intersection of Regions R and S� written as R � S� Regions are

closed under intersection�the intersection of two Regions is always another Region�

If two Regions do not overlap� the resulting intersection is said to be empty�

Grow surrounds a Region with a boundary layer of a speci�ed width� It

takes two arguments�a Region and a Point�and returns a new Region which has

been extended �or shrunk for negative widths� in each coordinate direction by the

speci�ed amount� The second argument to grow may also be an integer� in which

case each dimension is grown by the same amount� Figure ��c shows the Region

resulting from grow�R��

�� Data Motion

LPARX coordinates data motion between Grids using two types of block

copy operations� copy�on�intersect and general block copy� Copy�on�intersect copies

data from a Grid into the corresponding elements of another where their Regions

overlap in the underlying integer coordinate system� Of course� all Grids and Regions

�A complete list of all LPARX operations can be found in Table ��

�

. ..

...

21

7

4 5 17

R Shift(R, [7,-1])

. ..

...

21

7

4 5 17

R

S

R * S

�a� �b�

. ..

...

21

7

4 5 17

R Grow(R,1)

. ..

...

21

7

4 5 17

R Grow(R,1)

S

�c� �d�

Figure �� Four examples of LPARX�s region calculus� Although shown in d forsimplicity� these operations generalize readily to higher dimensions� �a� Shift takesa Region and a Point and returns a Region translated by the speci�ed distance��b� Intersection returns the set of points shared by two Regions� �c� Grow adds aboundary layer to a Region� �d� Data dependencies for a ghost cell region can becalculated easily using the grow and intersection operations� In this example� thedarkest Region is grow�R�� S�

in the same copy statement must have the same spatial dimension� For Grids G and

H� the statement�

copy into G from H

copies data from H into G where the Regions of G and H intersect� Another form�

copy into G from H on R

where R is a Region� limits the copy to the index space in which all three Regions

intersect� General block copy is similar to copy�on�intersect except that it allows a

shift between the source and destination Regions� The statement�

copy into G on R from H on S

copies data from Region S of Grid H into Region R of Grid G�

The default behavior of both data motion operations is to simply copy data

from the source into the destination� LPARX also provides a reduction form�

copy into �� from �� using combine

where combine is a speci�ed commutative associative reduction function� Instead

of copying the data� LPARX applies combine elementwise to combine corresponding

source and destination data values� For example�

copy into G from H using sum

adds corresponding elements of Grid H to Grid G� portions of G that do not intersect

with H remain unchanged� Section �� illustrates how this reduction variant is used

to sum force information in a particle application�

We now show how these simple but powerful operations are used to calculate

data dependencies� One common communication operation in scienti�c codes is the

transmission of data to �ll ghost cells� boundary elements added to each processor�s

local data partition �see Figure �b�� The region calculus represents the processor�s

local partition as a Region� We grow the Region to de�ne ghost cells and then use

intersection to calculate the Region of data required from another processor� Finally�

a copy updates the ghost region�s data values �see Figure ��d�� Recall that copy�on�

intersect copies values that lie in the intersection of the ghost region and interacting

blocks� The calculation of data dependencies involves no explicit computations in�

volving subscripts� as copy�on�intersect manages all bookkeeping details� The region

calculus is independent of the Grid dimension� and the same operations work for any

problem dimension� All interprocessor communication is managed by the run�time

system and is completely hidden from the user�

�� LPARX Implementation

In this section� we brie�y describe the LPARX implementation� further de�

tails are provided in Chapter ��

LPARX has been implemented as a C�� run�time library consisting of ap�

proximately �fteen thousand lines of code �excluding the application�speci�c API

libraries described in Chapters � and �� The implementation does not require a spe�

cial compiler other than a standard C�� compiler� and LPARX code may be freely

mixed with calls to other C�� C� or Fortran routines�

LPARX de�nes C�� classes for Point� Region� Grid� and XArray� Grid

elements may be standard C�� types �e�g� int or double�� structures� or other C��

classes� All classes are strongly typed by the number of spatial dimensions� for exam�

ple� Region represents a d Region� Region� a �d Region� and so on� Dimension�

independent code is written using an X in the place of the spatial dimension in the

class name �e�g� RegionX� and is translated into dimension�speci�c code �e�g� RegionX

to Region� by a preprocessor at compilation time�

In our examples� we will employ LPARX pseudocode instead of actual C��

code for clarity and to separate the semantics of LPARX operations from their current

implementation as a C�� class library� Of course� other implementations of LPARX

are possible� With the exception of minor syntactic di�erences� the LPARX code and

the actual C�� code are nearly identical�

�

Data Type Description

Point n�tuple representing a point in integer spaceused to de�ne and manipulate Regions

Region represents a subset of array index spaceused to describe irregular data decompositionsmanipulated with shift� intersect� and grow

Grid a dynamic array de�ned over a Region

Grid elements may be any user�de�ned typeGrids communicate via geometric block copies

XArray an array of Grids distributed over processorscommon framework for irregular block decompositionscoarse�grain execution using the forall loop

Table �� A brief summary of the four LPARX data types de�ned in Section ��

�� Summary

LPARX provides run�time mechanisms for user�de�ned irregular block de�

compositions� The structure of a data decomposition�the �oorplan describing how

data is to be decomposed across processors�is a �rst�class� language�level object

which may be manipulated by the application� In contrast� many data parallel lan�

guages such as High Performance Fortran �� specify data decompositions at compile�

time using compiler directives which force the user to choose from a limited set of

built�in� regular decompositions�

LPARX de�nes four new data types �see Table �� Point� Region� Grid�

and XArray� Its forall loop implements a coarse�grain model of data parallelism in

which an operation is applied in parallel to all Grids within an XArray� Communica�

tion among Grids is expressed using the region calculus abstractions and high�level

block copy operations� LPARX operations are summarized in Table ��

�

Operation Description

R� � R� � R� Region R� is the intersection of Regions R� and R�

P � lwb�R� Point P is the lower bound of Region R

P � upb�R� Point P is the upper bound of Region R

R� � shift�R��P� Region R� is the Region R� shifted by Point P

R� � grow�R��P�Region R� is the Region R� extended by an amount P �aPoint� in each coordinate dimension

R� � grow�R��i�Region R� is the Region R� extended by integer amounti in all coordinate dimensions

R � region�G� Region R is the Region associated with Grid G

G�i��i�� in�Array indexing for an n�dimensional Grid G� returns aGrid element

X�i��i�� in�Array indexing for an n�dimensional XArray X � returns aGrid

XAlloc�X� n� R� M�

Allocate the data for XArray X with n Grids using the�oorplan speci�ed by the Array of Regions R and theArray of integers M � if M is omitted� LPARX providesa default mapping

copy into G� from G�Copy data from GridG� into GridG� where their Regionsintersect �copy on intersection�

copy into G� from G�

on R

Copy data from GridG� into GridG� where their Regionsintersect with Region R �copy on intersection�

copy into G� on R�

from G� on R�

Copy data from Region R� of Grid G� into Region R� ofGrid G� �general block copy�

copy into �� from

�� using f

Reduction form of copy in which the commutative asso�ciate function f is applied elementwise to combine corre�sponding source and destination elements

forall i��i�� in in X

� � �

end forall

A coarse�grain data parallel loop that iterates concur�rently over the Grids in the n�dimensional XArray X

for i��i�� in in X

� � �

end for

A sequential loop that iterates over the Grids in the n�dimensional XArray X

Uniform�R� P�

RCB�W� P�

External uniform and recursive bisection �� partitioningroutines provided by a standard LPARX library� both re�turn an Array of Regions describing the computationalspace �indicated by Region R or workload estimate W �decomposed into P Regions representing approximatelyequal amounts of computational work

Table �� This table summarizes the operations de�ned by LPARX�

�

�� LPARX Programming Examples

You know my methods� apply them�

� Sherlock Holmes� �The Hounds of Baskerville�

In this section� we illustrate how to use the LPARX mechanisms to paral�

lelize a simple application� Jacobi relaxation on a rectangular domain� Although this

particular computation is neither irregular nor dynamic� it is easy to explain� and the

techniques described here generalize immediately to the irregular� dynamic applica�

tions of Chapters � and � Sections �� through �� describe the parallelization

of the Jacobi code� Section �� shows how the techniques used to parallelize the

Jacobi application also apply to dynamic� irregular computations�

�� Jacobi Relaxation

Consider the Laplace equation in two dimensions subject to Dirichlet bound�

ary conditions�

�u � � in � u � f on �

where f and u are real�valued functions of two variables� the domain � R� is a

rectangle� and � is the boundary of � We discretize the computation using the

method of �nite di�erences� solving a set of discrete equations de�ned on a regularly

spaced d mesh of size �M ! � � �N ! ��

The interior of the mesh is de�ned as�

Region Interior � ��M��N

The square bracket notation indicates that we will number the interior points of the

mesh from � to M in the x�coordinate and from � to N in the y�coordinate� The

interior region will be extended with a boundary region �using grow� to contain the

Dirichlet boundary conditions for the problem� as shown in Figure �a�

To parallelize Jacobi relaxation� we decompose the computational domain

into subdomains and assign each subdomain to a processor� A standard blockwise de�

composition for � processors is shown in Figure �b� Each subdomain is augmented

��

N

1

1

Boundary CellsInterior

M

Ghost Cells Subdomain

�a� �b�

Figure �� a� A �nite di�erence mesh de�ned over the d Region ��M��N��

with interior ��M��N� �b� A blockwise decomposition of the computational spaceinto � subblocks� The lightly shaded area shows the ghost region for a typicalpartition�

with a ghost cell region that locally caches either interior data from adjoining sub�

domains or Dirichlet boundary conditions �for those subdomains on the boundary��

We refresh these ghost cells before computing on each subdomain� Each processor

then updates the solution for the subdomains it owns� this computation proceeds in

parallel and each processor performs its calculations independently of the others�

�� Decomposing the Problem Domain

Recall that LPARX does not prede�ne speci�c data partitioning strategies�

rather� data partitioning is under the control of the application� One possible parti�

tioning for this problem is a uniform BLOCK decomposition such as that provided by

HPF� By convention� LPARX expects the partitioner to return an Array of Regions

that describes the uniform partitioning�

Array of Region Partition � Uniform�Interior� P�

��

Here� our Array represents the standard array type provided by most programming

languages� The partitioner Uniform takes two arguments� the Region to be par�

titioned and the desired number of subdomains� P� which is usually the number of

processors� Recall that LPARX de�nes some common partitioning utilities� such as

Uniform� in a standard library� but the programmer is free to write others�

After determining the partitioning of space� we extend each subdomain with

ghost cells� The exact thickness of the ghost cell region depends on the particulars of

the numerical method� In our case� we will assume a �nite di�erence scheme requiring

a ghost cell region one cell thick� We apply grow to augment each subdomain of

Partition with a ghost region��

Array of Region Ghosts � grow�Partition� P� ��

The computational domain is now logically divided into an Array of over�

lapping Regions called Ghosts� We next allocate an XArray of Grids of Double to

implement this data decomposition� This occurs in two steps� First� we declare the

XArray of Grids structure�

XArray of Grid of Double U

and next we instantiate the storage using LPARX�s XAlloc operation�

call XAlloc�U� P� Ghosts�

XAlloc takes three arguments� the XArray to be allocated� the number of elements

to allocate� and an Array of Regions� one Region for each element in the XArray�

We may optionally supply a processor assignment for each XArray component� if no

such processor assignment is speci�ed� as in the code above� then LPARX chooses a

default mapping� To override the default� we provide an Array of integer processor

identi�ers� one for each XArray element� as an optional fourth argument to XAlloc�

call XAlloc�U� P� Ghosts� Mapping�

�Grow is overloaded in the obvious way to handle arrays of Regions�

�

�� The main routine of the Jacobi relaxation programfunction main

�� Initialize M� N� and number of processors P �not shown�

�� Partition the computational domainRegion Interior � ��M��N�Array of Region Partition � Uniform�Interior� P�Array of Region Ghosts � grow�Partition� P� ��

�� Allocate and initialize data �initialization not shown�XArray of Grid of Double Ucall XAlloc�U� P� Ghosts�

�� Iterate until the solution converges �error check not shown�while �the solution has not converged� do

call relax�U�end while

end function

Figure � � LPARX code which partitions the computational space� allocates theXArray of Grids structure� and calls the Jacobi relaxation routine �described in Sec�tion ��

Such a mapping may be used to better balance workloads or to optimize interprocessor

communication tra�c for a particular hardware interconnect topology�

After instantiating the XArray� we are ready to compute� The main com�

putational loop iterates until the solution meets some �unspeci�ed� error criterion�

while �the solution has not converged� do

call relax�U�

end while

where relax� described in Section �� performs the computation� The main routine

of the Jacobi code is summarized in Figure � �

Note that we may change the data partitioning scheme at any time without

a�ecting the correctness of the code� For example� the �box�like� partitioning may

��

�� Execute one iteration of Jacobi relaxationfunction relax�XArray of Grid of Double U�

�� Refresh the ghost cell region with the newest valuescall FillPatch�U�� Compute in parallel over all of the subblocksforall i in U

�� Call a numerical kernel to do the computationcall smooth�U�i��

end forallend function

Figure �� The Jacobi relaxation routine invokes FillPatch to fetch data valuesfrom adjacent processors and then calls smooth� a computational kernel�

be replaced with a strip decomposition or even a recursive bisection decomposition

simply by calling a di�erent partitioner� No other changes would need to be made�

Furthermore� the computational domain need not be restricted to a rectangle� it may

be an �L��shaped region or� in general� any irregular collection of blocks�

�� Parallel Computation

Function relax performs the major tasks in solving Laplace�s equation� it

invokes subroutine FillPatch �described in the following section� to refresh the ghost

cell regions and then calls the computational kernel smooth �not shown�� The code

for relax is shown in Figure �� The forall loop computes in parallel over all of

the Grids of U� Smooth is called to perform the computation for each U�i�� the ith

Grid of XArray U�

Typically� computational kernels are written in a language such as Fortran

which might not understand the concept of an LPARX �Grid�� LPARX provides

a simple interface for calling C�� C� or Fortran which enables the programmer to

extract Grid data in a form understandable by the numerical routine� These three

languages require only a pointer to the Grid data and the extents of the associated

��

�� Communicate boundary data between neighboring partitionsfunction FillPatch�XArray of Grid of Double U�

�� Loop over all pairs of grids in Uforall i in U

�� Mask o� the ghost cells �copy interior values only�� Function region�� extracts the region from its argumentRegion Inside � grow�region�U�i�� "��for j in U

�� Copy data from intersecting regionscopy into U�j� from U�i� on Inside

end forend forall

end function

Ghost Cells

Partition

U(j)

Figure �� FillPatch updates ghost cell regions of Grid U�j� with overlappingnon�ghost cell data from adjacent Grids U�i��

Region� However� interoperability for languages such as High Performance Fortran

is still an open research question and is addressed in Section ��

�� Communicating Boundary Values

The �nal piece of Jacobi code is FillPatch� shown in Figure �� This

routine updates the ghost cell regions of each subgrid with data from the interior �non�

ghost cell� sections of adjacent subgrids� For every pair of Grids U�i� and U�j�� it

copies into the ghost cells of U�j� the overlapping non�ghost cell data from U�i�� The

outer loop is a parallel forall� processors calculate data dependencies only for those

�

Grids they own� Aggregate data motion between Grids is handled through LPARX�s

copy�on�intersect primitive� FillPatch employs grow with a negative width to peel

away the ghost cell region to obtain the Inside of Grid U�i��

For applications in which the structure of theN subgrids is simple and static�

as in our example� this O�N�� algorithm is naive because it looks at all possible inter�

sections� However� the communication structure for dynamic irregular computations

is neither static nor regular and thus cannot be easily predicted� In localized com�

putations such as Jacobi� many of these O�N�� intersections will be empty� in such

cases� the LPARX run�time system does not communicate data� We will discuss how

LPARX eliminates such unnecessary communication in Chapter ��

FillPatch works for any problem dimension� In fact� none of our code

examples in this section have assumed a particular spatial dimension� Moreover�

we can replace Double with any valid data type to handle di�erent types of Grids�

such as Grids of particles employed by our particle API �see Section �� Finally�

FillPatch does not assume a simple uniform partitioning� this same code will work

for any style of data partitioning� regular or irregular�

�� Dynamic and Irregular Computations

We have used LPARX to develop a straightforward parallel implementation

of Jacobi relaxation� a simple application requiring only a uniform static data decom�

position� In this section� we show how the LPARX parallelization mechanisms can be

used to address dynamic� irregular computations such as structured adaptive mesh

methods ��

As described in Chapter �� structured adaptive mesh methods represent the

solution to partial di�erential equations using a hierarchy of irregular but locally

structured meshes� Our adaptive mesh API implementation represents each level of

this adaptive mesh hierarchy as an XArray of Grid� Unlike the Jacobi example� each

mesh level typically consists of an irregular collection of blocks� Instead of the uni�

form block partitioner� the application calls error estimation and regridding routines

�

which perform data decomposition at run�time� FillPatchworks without change �see

Section �� because LPARX�s structural abstractions apply equally well to both

uniform decompositions and irregular block structures� Of course� the adaptive appli�

cation adds other routines to manage the transfer of numerical information between

levels of the hierarchy �e�g� interpolation operators�� The key observation� however�

is that the LPARX abstractions used in the Jacobi code generalize immediately to

dynamic� irregular computations�

�� Related Work

Yes� I get by with a little help from my friends�

� Lennon and McCartney� �With a Little Help from my Friends�

In this section� we compare the LPARX approach with other related work�

We divide our survey into three areas� structural abstraction �Section �� parallel

languages �Section �� and run�time support libraries �Section ��

�� Structural Abstraction

LPARX�s Region abstraction and its region calculus are based in part on

the domain abstractions explored in the scienti�c programming language FIDIL ��

FIDIL�s domain calculus provides operations such as union and intersection over arbi�

trary index sets� however� FIDIL is intended for vector supercomputers and therefore

does not address data distribution� LPARX borrows a subset of FIDIL�s calculus

operations to provide the structural abstractions for data decomposition and inter�

processor communication on multiprocessors�

Whereas FIDIL supports the notion of arbitrary non�rectangular index sets�

LPARX restricts index sets to rectangles� A prototype of LPARX� called LPAR

�� supported FIDIL�style regions� However� we found that such generality�

and the associated complexity and run�time performance penalty�was unnecessary

��

for the class of irregular block�structured applications targeted by LPARX�We believe

that such abstractions� if needed� should be included as a separate type�

FIDIL�s irregular array structure� called a Map� is used to represent both

meshes and arrays of meshes� We found that Map�s overloaded functionality com�

plicated the programmer�s model� Therefore� LPARX implements the Map using an

XArray and a Grid� and it distinguishes between concurrent computation �over the

Grids in an XArray� and sequential computation �over the elements of a Grid��

Crutch�eld et al� independently developed similar Region abstractions based

upon FIDIL for vector architectures �� Based on this framework� they have devel�

oped domain�speci�c libraries for adaptive mesh re�nement applications in gas dy�

namics �� Their adaptive mesh re�nement libraries have been parallelized using

our software infrastructure �� see Section ��

The array sublanguage ZPL �� employs a form of region abstraction�

ZPL does not explicitly manage data distribution� which it assumes is handled by an�

other language� It uses its region constructs to simplify array indexing and as iteration

masks� in contrast� LPARX employs Regions to specify run�time data decompositions

and express communication dependencies� ZPL regions are not �rst�class� assignable

objects as in LPARX�

Building on the LPARX region calculus and structural abstraction� Fink

and Baden � �� have developed a run�time data distribution library that provides an

HPF�like mapping strategy with �rst�class� dynamic distribution objects supporting

both regular and irregular block decompositions� In their system� all decisions about

data decomposition and mapping are made at run�time� providing support for dis�

tributions that are unknown at compile�time or which may change during execution�

Currently� compiled languages such as HPF support neither general block�irregular

decompositions nor run�time data distribution�

The Structural Abstraction �SA� parallel programming model �� extends

the LPARX abstractions with a new data type �an unstructured Region� to address

other classes of irregular scienti�c applications� such as unstructured �nite element

��

problems and irregularly coupled regular meshes �� The goal of SA is to unify

several previous domain�speci�c systems� including LPARX� multiblock PARTI ��

and CHAOS � ��

�� Parallel Languages

The parallel programming literature describes numerous languages� each of

which provides facilities specialized for its own intended class of applications� In the

following survey� we evaluate various parallel languages on their ability to solve the

dynamic� block�irregular problems targeted by LPARX�

Data Parallel Fortran Languages

High Performance Fortran �HPF� �� is a data parallel Fortran which com�

bines the array operations of Fortran �� a parallel forall loop� and data decompo�

sition directives based on the research languages Fortran D �� and Fortran ��D

�� It is quickly becoming accepted by a number of manufacturers as a

standard parallel programming language for scienti�c computing� HPF has been tar�

geted towards regular� static applications such as dense linear algebra but provides

little support for irregular� dynamic computations �� HPF represents data de�

compositions using an abstract index space called a template� Arrays are mapped to

templates and then templates are decomposed across processors�

One limitation of templates� as well as all other HPF data decomposition

entities� is that they are are not �rst�class� language�level objects� Rather� they ex�

ist only as compile�time directives� which are essentially comments to the compiler�

Thus� the application has limited control over run�time data distribution� Although

HPF supports dynamically allocatable arrays and pointer arrays� their utility is lim�

ited at present because the application has little run�time control over how array

data is distributed� the processor distribution must be known at compile�time� This

has motivated the High Performance Fortran Forum to consider new mechanisms for

��

dynamic applications�� HPF de�nes a redistribute directive that allows the appli�

cation to change array decompositions� but data redistribution is local to a program

unit and cannot be passed back to the calling routine� Furthermore� arrays may

be decomposed using only a limited set of regular� prede�ned distribution methods

�e�g� a uniform block decomposition�� HPF does not yet support user�de�ned irregular

decompositions�

Fortran D �� relies on the same data distribution model as HPF and

therefore su�ers the same limitations for dynamic problems� One distinguishing fea�

ture of Fortran D is its support for pointwise mapping arrays in which individual

array elements may be mapped to arbitrary processors �� An application could

in theory construct block�irregular decompositions by mapping each array element in

a block to the same processor� However� such an element�by�element decomposition

cannot exploit the inherent block structure of the application and must instead main�

tain mapping information for each array element at substantial cost both in memory

and communication overheads� Pointwise mappings are therefore inappropriate for

block�irregular applications�

To avoid the limitations of the HPF decomposition model� Vienna For�

tran �� de�nes more general dynamic data distribution directives� However� Vienna

Fortran restricts the types of irregular data decompositions available to the applica�

tion� It supports pointwise mappings as in Fortran D �with the same limitations for

block�structured methods� and also tensor products of �d block�irregular decomposi�

tions� These mechanisms alone cannot describe the irregular blocking structures that

arise in adaptive mesh re�nement and recursive coordinate bisection ��

Data Parallel C�� Languages

The pC�� programming language is a data parallel extension of

C�� It implements a �concurrent aggregate� �� model in which a parallel operation

is applied simultaneously to all elements of a data aggregate called a �collection��

�Information about the second High Performance Fortran standardization eort can be found atWorld Wide Web address ftp��hpsl�cs�umd�edu�pub�hpf bench�index�html�

��

Each element of a collection may be a complicated C�� object� This form of coarse�

grain data parallelism is similar to LPARX�s forall loop acting on Grid elements of

an XArray collection�

pC�� aligns and distributes collections across processors using the same

model as HPF� however� pC�� de�nes �rst�class Processor� Distribution� and

Align objects� Because decompositions may be easily modi�ed at run�time� pC��

allows more �exibility than HPF for dynamic applications� Although pC�� does not

currently support irregular decompositions� classes similar to LPARX�s Region� Grid�

and XArray could be written in pC��

The C�� language de�nes a coarse�grain� concurrent aggregate model

of parallelism similar to that of pC�� However� it does not provide explicit data

decomposition mechanisms� the application has no control whatsoever over data dis�

tribution�

Task Parallel Languages

In the previously described data parallel models� programs apply a sequence

of operations to an array or collection of data objects distributed across processors�

The task parallel programming model takes a di�erent approach in which programs

consist of a number of asynchronous� independent� communicating parallel processes�

Task parallel languages such as CC�� CHARM �� CHARM�� Fortran M

�� and Linda �� de�ne a set of mechanisms that coordinate process execution

and communication among autonomous tasks� Task parallelism provides no explicit

support for data decomposition�

Task parallelism is ideally suited for computations integrating various het�

erogenous operations� such as a multidisciplinary simulation coordinating various in�

dependent components �� However� it is inappropriate for the coarse�grain scienti�c

applications addressed by LPARX� which are more naturally expressed in a coarse�

grain data parallel fashion �see Section ��

��

Split�C

Split�C �� is a parallel extension to C for distributed memory multiproces�

sors� Split�C gets its name from its split�phase communications model� it allows the

programmer to overlap communication and computation through a two phase data

request� The application initiates a request for data and then computes until the data

arrives� Synchronization primitives ensure that communication has completed�

Split�C supports �ne�grain data accesses to a global address space through

a special type of pointer called a �global pointer�� Dereferencing a global pointer

results in interprocessor communication� By distinguishing global pointers from local

pointers �i�e� pointers within a single address space�� Split�C provides a simple but

realistic cost model for interprocessor communication� Split�C also supports data

layout for regular problems through the �spread array�� which is distributed across

processors as in HPF�s uniform BLOCK and BLOCK CYCLIC decompositions� There is

no support for irregular arrays�

Although Split�C de�nes e�cient communication mechanisms for executing

interprocessor communication� it does not help the programmer in determining the

schedule of communication or in managing the associated data structures� The Split�C

run�time system does not eliminate duplicate data requests to the same data item�

nor does it aggregate messages� two optimizations provided by the CHAOS run�

time system � �� For example� although the numerical kernel for the Split�C EM�D

application �� is only about ten lines of code� EM�D would require several hundred

lines of initialization code to calculate data dependencies and manage ghost cells�

Applicative Languages

SISAL � �� and NESL �� are applicative programming languages which

restrict functions to be free of side e�ects�� a requirement that simpli�es the work

of the compiler and exposes more potential parallelism� SISAL and NESL rely on

�Applicative languages require that the value of any expression depends only on the value of eachconstituent subexpression and not on their order of evaluation� Functions are not allowed to modifyglobal data�

�

sophisticated compiler technology to analyze the program and automatically decom�

pose data and schedule parallel tasks� While the automatic detection of parallelism

is extremely attractive� these languages have not yet demonstrated that the compiler

alone can extract su�cient information from the program to e�ciently distribute data

for dynamic� irregular problems on message passing architectures�

�� Run�Time Support Libraries

The CHAOS �formerly PARTI� � �� and multiblock PARTI �� libraries pro�

vide run�time support for data parallel compilers such as HPF and Fortran D ��

Both libraries support a �inspector�executor� model for scheduling communication

at run�time� In the inspector phase� the application computes the data motion re�

quired to satisfy data dependencies and saves the resulting communication pattern

in a �schedule�� The executor later uses this schedule to fetch remote data values�

Schedule generation can be thought of as �run�time compilation� to compute data

dependencies that cannot be known at compile�time� CHAOS and multiblock PARTI

optimize schedules to minimize interprocessor communication� standard optimizations

include eliminating duplicate requests to the same remote data item and aggregating

many small messages into a single large message� Often� the cost of creating a com�

munications schedule can be amortized over many uses if data dependencies do not

change�

CHAOS implements pointwise mapping arrays for unstructured calculations

such as sweeps over �nite elementmeshes and sparse matrix computations �� It has

also been used to parallelize portions of the CHARMM molecular dynamics application

�� The Fortran D run�time system employs CHAOS to support Fortran D�s

mapping arrays �� Recently� CHAOS has been extended to support unstructured

applications consisting of complicated C�� objects �� However� recall that such

unstructured representations are inappropriate for the irregular but structured appli�

cations targeted by LPARX�

The multiblock PARTI library is targeted towards block�structured applica�

��

tions� It supports the uniform BLOCK� BLOCK CYCLIC� and CYCLIC array decomposi�

tions of HPF and has been used in the run�time system for the Fortran ��D compiler

�� Multiblock PARTI de�nes canned routines that �ll ghost cells and copy

regular sections between arrays� Although it has been employed in the parallelization

of computations with a small number of large� static blocks �e�g� irregularly coupled

regular meshes �� multiblock PARTI has not been applied to problems with a large

number of smaller� dynamic blocks� such as the structured adaptive mesh problems

targeted by LPARX�

Quinlan has developed a parallel C�� array class library called P�� that

supports �ne�grain data parallel operations on arrays distributed across collections of

processors� P�� automatically manages data decomposition� interprocessor communi�

cation� and synchronization� In contrast to the �ne�grain parallelism of P�� LPARX

employs coarse�grain parallelism� which is a better match to current coarse�grain

message passing architectures because it allows more asynchrony between proces�

sors� Indeed� to improve the e�ciency of the �ne�grain model� Parsons and Quin�

lan �� are developing run�time methods for automatically extracting coarse�grain

tasks from P��

The POOMA� �Parallel Object Oriented Methods and Applications� project

at Los Alamos National Laboratories is developing a parallel run�time system for

scienti�c simulations� When completed� it will support arrays �as in P�� matrices�

particle methods� and unstructured meshes� POOMA employs a layering strategy

�similar in philosophy to our own� in which libraries at higher levels in the abstraction

hierarchy provide more application�speci�c tools than lower layers�

PETSc �Portable Extensible Tools for Scienti�c Computing� �� is a large

toolkit of mathematical software for both serial and parallel scienti�c computation� It

targets more �traditional� algorithms in mathematics for sparse and dense matrices�

including Krylov iterative methods� linear and nonlinear systems solvers� and some

�sequential� partial di�erential equation solvers for �nite element and �nite di�erence

�Papers on the POOMA project have not yet been published information is available from theirWorld Wide Web address http��www�acl�lanl�gov�PoomaFramework�

��

schemes �� The toolkit employs a data�structure�neutral implementation which

permits users of the numerical routines to use their own� application�speci�c data

structures� Although PETSc does not currently provide the irregular array structures

needed by structured multilevel adaptive mesh applications �see Chapter �� it may

be possible to extend PETSc by integrating LPARX�s XArray and Grid abstractions

into the PETSc data�structure�neutral framework�

Multipol �� is a run�time library of distributed data structures designed to

simplify the implementation of irregular problems on distributed memory architec�

tures� It supports a number of non�array data structures such as graphs� unstructured

grids� hash tables� sets� trees� and queues� However� Multipol does not currently sup�

port the type of irregular arrays employed by LPARX applications�

�� Analysis and Discussion

Every great scientic truth goes through three stages� First� people sayit con�icts with the Bible� Next� they say it had been discovered before�Lastly� they say they always believed it�

� Louis Agassiz

LPARX is a portable programming model and run�time system which sup�

ports coarse�grain data parallelism e�ciently over a wide range of MIMD parallel

platforms� LPARX�s abstractions enable the programmer to reason about an algo�

rithm at a high level� and we have used it as a foundation for building APIs for struc�

tured adaptive meshmethods and particle methods� Its structural abstraction enables

the application to manipulate data decompositions as �rst�class objects� LPARX is

intended for applications with changing non�uniform workloads� such as particle cal�

culations� and for computations with dynamic� block�irregular data structures� such

as structured adaptive mesh methods� Its philosophy is that data partitioning for

such irregular algorithms is heavily problem�dependent and therefore must be under

the control of the application�

LPARX provides four new data types� an integer vector called a Point� an

�

index set�valued object called a Region� a dynamic array called a Grid� and an array

of distributed Grids called an XArray� E�cient high�level copies between Grids hide

interprocessor communication and low�level bookkeeping details� LPARX supports a

coarse�grain data parallel model of execution over XArrays via the forall loop�

In the following sections� we analyze in more detail the contributions and

limitations of the LPARX approach�

�� Structural Abstraction

Perhaps the greatest contribution of LPARX is the concept of �structural

abstraction�� that is� the ability to represent and manipulate decompositions as �rst�

class� language�level objects separately from the data� Instead of supporting only a

limited set of prede�ned data distribution strategies� LPARX provides the application

a framework for creating its own problem�speci�c decompositions� To our knowledge�

LPARX is the �rst and only parallel system that e�ciently supports arbitrary dy�

namic� user�de�ned block�irregular data decompositions�

LPARX�s region calculus operations express data dependencies in geometric

terms independently of the spatial dimension and data decomposition� Dimension

independence means that the same data decomposition and communications code

can be used for d and �d versions of an application� The programmer can develop a

simpler d version of the problem on a workstation� and� when con�dent that the code

has been debugged� apply the computational resources of a parallel machine to the full

�d application� Indeed� we adopted this approach in the design and implementation

of the structured adaptive mesh API library and application described in Chapter ��

While we have emphasized the utility of structural abstraction for multipro�

cessors� these ideas also apply to single processor systems� Many scienti�c applica�

tions� such as structured adaptive meshmethods� exhibit irregular data structures and

irregular communication patterns independent of the parallelism� The region calculus

provides a powerful methodology for describing and managing such irregularity�

�

�� Limitations of the Abstractions

The design of any software system involves a trade�o� between generality

and speci�c mechanisms supporting a particular problem class� We believe that

LPARX strikes a good balance� Its parallelization mechanisms are su�ciently gen�

eral to support both particle methods and structured adaptive mesh methods� Of

course� there are many applications that LPARX does not address� LPARX does not

support the unstructured methods targeted by CHAOS � �� nor does it provide the

dynamic irregular data types of Multipol �� It cannot handle the task parallelism

of CC�� or Fortran M �� LPARX applies only to problems with irregular�

block�structured data exhibiting coarse�grain data parallelism� Recent work with the

Structural Abstraction �SA� model �� extends the LPARX ideas to address other

classes of irregular scienti�c applications �e�g� unstructured methods��

Another limitation of LPARX is that its representation of a data decomposi�

tion may not necessarily match the programmer�s view of the data� LPARX�s XArray

and Grid abstractions are intended to support dynamic� block�irregular computa�

tions� and this representation may not be the most natural one for some applications�

For example� particle applications require a non�uniform decomposition of space to

balance workloads� Programmers do not care that the non�uniform decomposition is

represented using an XArray of Grids� they are only interested in accessing particle

information from a particular region of space� The solution in this case is to use

LPARX�s parallelization facilities as a basis for application�speci�c APIs that hide

such details �see Chapter ��

One �nal limitation of LPARX is that each Grid may be assigned to only

one logical processor�� Currently� this restriction has little practical impact� as most

numerical kernels are written in sequential Fortran �� However� with the upcoming

availability of parallel HPF numerical routines� it will become increasingly important

that LPARX support hierarchical parallelism in which LPARX manages communi�

cation and data decomposition for irregular collections of arrays that are� in turn�

�Recall that a single logical processor may actually consist of many physical processors�

��

split across processor subsets �� with di�erent numbers of logical processors� For�

tunately� LPARX�s current restriction is easy to lift� we simply allow each Grid to

be assigned to a subset of processors� However� calling HPF from LPARX introduces

some interesting language interoperability issues �see Section ��

�� Shared Memory

At �rst glance� shared memory multiprocessors with coherent caches might

appear easier to use than message passing multiprocessors because the programmer

could allow the hardware caching mechanisms to manage data locality� Unfortunately�

this is not always the case � � �� Experiments with the Wisconsin Wind Tunnel

shared memory simulator �� indicate that the explicit management of data local�

ity can dramatically improve performance for dynamic scienti�c applications � ��

Instead of relying on the hardware cache coherence mechanism alone� irregular calcu�

lations employ specialized communication scheduling techniques �� similar to those

pioneered in CHAOS � �� Thus� e�cient shared memory implementations require the

same memory management techniques as e�cient message passing implementations�

and it is not su�cient to rely on automatic hardware caching mechanisms� LPARX

provides the application with the explicit� high�level mechanisms needed to e�ciently

manage data locality within the memory hierarchy�

�� Coarse�Grain Data Parallelism

Recall from Section � that LPARX separates the expression of

parallelism�data decomposition� communication� and parallel execution�from nu�

merical computation� To LPARX� numerical work is performed by sequential process�

ing nodes� This execution model matches the systems architecture of most message

passing multiprocessors� which typically consist of powerful compute nodes connected

via a communications network� LPARX manages the parallelism across nodes and

the interprocessor communication between nodes� and the numerical routines handle

��

computation on a single node�

Note that a processing �node� may actually consist of multiple physical

processors� providing a simple form of hierarchical parallelism� For example� some of

the nodes on newer Intel Paragons actually contain two compute processors �ignoring

the third processor normally reserved for communication�� Programmers can annotate

Fortran numerical routines running on the Paragon to take advantage of this second

processor�

There are two main advantages to separating the management of parallelism

from numerical computation� �� performance� and �� software re�use� LPARX�s

model enables programmers to tune numerical kernels without concern for the parallel

structure of the application� Code may be optimized to take advantage of specialized

node characteristics� such as multiple processors �as on the Intel Paragon�� cache

sizes� or vector units �as on the Cray C�� E�cient parallel programs start with

the e�cient use of node processors�

Numerical routines may be written in any language� enabling LPARX to

leverage mature sequential compiler technology� Existing optimized kernels may be

used� often without change� in parallel applications� Programmers may use the lan�

guage which is most appropriate� For example� Fortran� in spite of its limitations�

provides a natural and simple syntax for array�based computation�

The primary disadvantage of external numerical routines is language inter�

operability� which we discuss next�

�� Language Interoperability

Numerical routines in LPARX applications are typically written in Fortran�

which does not understand the concept of an LPARX �Grid�� Language interoper�

ability addresses the question of how to interface between two di�erent languages�

Recall that interoperability is not di�cult for sequential languages such as Fortran�

Calling Fortran requires only a pointer to the Grid data �passed to Fortran as an

array� and the dimensions of the associated Region� By default� LPARX adopts For�

��

tran�s column�major array ordering convention� Language interoperability for High

Performance Fortran� however� is substantially more involved ��

Although HPF de�nes an interface to subroutines written in other languages

through the notion of �extrinsic procedures�� it does not address how other languages

may call routines written in HPF� One di�cult problem is how to communicate

the representation of distributed data between LPARX and HPF� HPF arrays are

considerably more complicated than Fortran or C arrays� Fortran arrays of a given

type are completely described by three items� �� the starting location of the array

in memory� �� the bounds of the array� and �� the ordering of the array elements

�i�e� column�major�� In comparison� elements of an HPF array may be distributed

across processors� aligned to other arrays� and ordered in various ways �e�g� BLOCK or

CYCLIC�� To call HPF� LPARX must�

� understand how HPF represents decomposed arrays�

� allocate HPF arrays of a particular decomposition and alignment� and

� pass array structure information into HPF numerical routines�

Unfortunately� the High Performance Fortran speci�cation �� does not de�ne a

standard interface for external languages� instead� it allows manufacturers to develop

their own external array representations� The Parallel Compiler Runtime Consortium

�PCRC� is developing standard language interoperability mechanisms between run�

time libraries� task parallel languages� and data parallel compilers �� however�

interfacing to HPF in a portable manner is still an open research question�

�� Communication Model

When designing LPARX� we determined that the basic communication

mechanismwould be a block copy between two individual Grids� The disadvantage of

the current mechanism is that it tells little about the global communication structure

among all interacting Grids� limiting opportunities for communication optimizations�

�

One possible solution �� employs the communication schedule building techniques

of Saltz � �� in which communication is split into two phases� an inspection phase and

an execution phase� In the inspection phase� processors build a schedule describing

the communication pattern� In our case� the schedule would be built using operations

similar to LPARX�s block copies between Grids� Communication only occurs when

this schedule is later �executed�� Typically� the application saves schedules for later

re�use�

The advantage of this approach is that before communication begins� all

processors have prior knowledge of the global communications pattern� Thus� they can

perform optimizations to minimize communication overheads� such as pre�allocating

message bu�ers and message aggregation� The lack of global knowledge in the LPARX

implementation results in communication overheads �see Section �� which could be

reduced using schedules ��

Schedules introduce a variety of interesting implementation issues� How do

we keep track of the vast number of schedules in complicated dynamic applications#

The structured adaptive mesh calculation described in Chapter � would require per�

haps forty di�erent active communication schedules� Such bookkeeping facilities are

not provided by CHAOS � �� and multiblock PARTI �� which assume that schedules

are managed either by the compiler or the user� Since communication dependen�

cies change� schedules will need to change� How do we know when to re�calculate a

schedule# These are open questions for future research�

�� Future Work

LPARX currently addresses only those applications with irregular but struc�

tured data decompositions� The Structural Abstraction �SA� model �� extends the

LPARX ideas to other classes of irregular scienti�c applications� SA has not yet been

implemented� however� and its implementation will require the uni�cation of three

di�erent run�time support libraries� LPARX� CHAOS � �� and multiblock PARTI ��

The acceptance of High Performance Fortran by the scienti�c computing

�

community introduces a number of interesting research issues� How will LPARX and

other languages and run�time systems interface to HPF# The HPF standardization

committee has not yet de�ned the portable language interoperability mechanisms

required so that other languages may call external routines written in HPF� The Par�

allel Compiler Runtime Consortium has begun standardization e�orts� but their work

is far from �nished� Furthermore� LPARX does not yet support multiple processor

owners per Grid� limiting its ability to exploit processor subsets ��

Finally� more research remains on how to integrate communication schedules

into the LPARXmodel� LPARXwill require new forms of run�time support to manage

the numerous changing communication schedules employed by dynamic� irregular

applications�

Chapter �

Implementation Methodology

I really hate this damned machine I wish that they would sell it�It never does quite what I wantBut only what I tell it�

� Dennie L� Van Tassel� �The Compleat Computer�

�� Introduction

In Chapter � we introduced the LPARX parallel programming model� In

this chapter� we describe the implementation methodology and the set of program�

ming abstractions used in the development of the LPARX run�time system�

As illustrated in Figure �� the LPARX implementation is based upon three

di�erent software layers� At the very bottom of the software infrastructure is a basic

portable message passing system called MP�� Built on top of the message pass�

ing layer� Asynchronous Message Streams �AMS� provides high�level abstractions

for asynchronous interprocessor communication that hide low�level details such as

message bu�er management� The Distributed Parallel Objects �DPO� layer extends

AMS�s communication mechanisms with distributed object naming and object�to�

object communication facilities� We have implemented MP�� AMS� and DPO as

C�� class libraries�

�

Distributed Parallel Objects

Asynchronous Message Streams

LPARX



LDA MDSPH3DAMG


Figure �� The LPARX run�time system is based upon the following three levelsof the software infrastructure� a message passing library called MP�� AsynchronousMessage Streams �AMS�� and Distributed Parallel Objects �DPO��

While we will emphasize the use of the DPO and AMS mechanisms in the

design of the LPARX run�time system� we note that these facilities may be useful in

other application domains� Many scienti�c methods� such as tree�based algorithms

in N �body simulations �� rely on elaborate� dynamic data structures and

exhibit unpredictable� unstructured communication patterns� The implementation of

such numerical methods would be greatly simpli�ed using the run�time support of

DPO and AMS�

This chapter is organized as follows� We begin with the motivation behind

the DPO and AMS abstractions and discuss related work� In Section �� we describe

the MP�� AMS� and DPO mechanisms� Implementation details and performance

overheads are presented in Section �� Finally� we conclude in Section �� with an

analysis and discussion of this work�

�

Barrier

Time

3

12

4

5

1

2

3

4

5

Figure �� LPARX programs are modeled as a number of objects �Grids� with asyn�chronous and unpredictable communication patterns� All processors must wait at aglobal synchronization barrier until all interprocessor communication has terminated�

�� Motivation

For the purposes of the LPARX implementation and run�time system� we

characterize LPARX programs as follows�

� An LPARX program consists of a relatively small number �e�g� tens to hundreds�

of large� complicated objects �Grids�� each of which is owned by a particular

processor�

� Communication between these objects is asynchronous and unpredictable� thatis� Grids do not know when or even if communication will occur� Communica�

tion between Grids is speci�ed via the LPARX copy operations�

� Communication phases containing LPARX copies are terminated by a global

barrier synchronization that ensures that all interprocessor communication has

�nished�

This execution model is illustrated in Figure �� The leftmost �gure shows �ve

objects with unpredictable and asynchronous communication patterns� Communi�

cation times between objects is presented in the time�line view on the right� Upon

reaching the global synchronization barrier� processors wait until all interprocessor

communication has terminated�

For most of the remainder of this chapter� we will assume that this asyn�

chronous model accurately re�ects the run�time characteristics of real LPARX ap�

plications� However� in Section �� we will see that the assumption about asyn�

chronous and unpredictable interprocessor communication is unnecessarily general�

In fact� we can predict communication patterns between Grids using schedules ��

and we can exploit this knowledge to reduce run�time overheads�

Developing asynchronous code directly on top of a message passing library

can be tedious and error�prone� as the programmer is responsible for a number of

low�level activities�

� The message passing model forces the programmer to explicitlymanage messagebu�ers� The programmer must ensure that bu�ers are su�ciently large to hold

all message information� Bu�ers must be packed and unpacked� often with data

of various types� sizes� and alignments� Such message bu�er management can

be particularly challenging for complicated objects� since storage requirements

for message bu�er space may not be known in advance�

� Every message send must be matched by a corresponding message receive onthe appropriate processor� While not generally di�cult in applications where

communication patterns are known� asynchronous applications do not usually

know when �or even if� messages are expected to arrive�

� To implement global synchronization barriers� the application must detect whenall interprocessor communication has terminated�

To alleviate the burden of implementing LPARX� we introduce two levels of inter�

mediate abstractions between LPARX and the message passing library� Distributed

Parallel Objects �DPO� and Asynchronous Message Streams �AMS��

�� Related Work

The abstractions provided by the Asynchronous Message Stream and Dis�

tributed Parallel Objects libraries build on a number of ideas originally developed by

the concurrent object oriented programming community� The notion of communicat�

ing objects� or �actors�� was �rst described by Hewitt �� and then further developed

by Agha �� Actors are concurrent objects which communicate with each other via

messages� Actors execute in response to messages� and each actor object may contain

several concurrently executing tasks� Actor�based languages include ABCL ��

Cantor �� and Charm��

Implementations of actor languages require complicated compilation strate�

gies and sophisticated run�time support libraries� such as the Concert system �� for

�ne�grain object management� Because of this complexity� we have not based the

LPARX run�time system on an existing concurrent object�based language� Instead�

we borrowed features that were speci�cally needed for the LPARX implementation�

For example� DPO�s distributed object naming mechanisms and its notion of primary

and secondary objects �described in Section �� are based in part on the distributed

facilities described by Deshpande et al� � �� The AMS abstractions combine ideas

from Active Messages �� asynchronous remote procedure calls �� and the

C�� I�O stream model ��

Another related paradigm developed by the distributed systems community

is virtual shared memory� which provides the illusion of a single� shared� coherent ad�

dress space for systems with physically distributed memories� Virtual shared memory

models include page�based �� and object�based systems �� Page�based

virtual shared memory enforces consistency at the level of the memory page� typically

one to four thousand bytes� Because such a coarse page granularity results in poor per�

formance due to false sharing� object�based systems provide consistency at the level of

a single user�de�ned object� These systems are inappropriate for our implementation

for two reasons� First� they require complicated and expensive operating system and

compiler support that is currently unavailable on production multiprocessor architec�

�

tures� Second� the virtual shared memory paradigm implements a read�modify�write

model �similar to cache lines and virtual memory pages� that results in unnecessary

interprocessor communication� To modify an object� a processor must �rst read the

entire object into local memory� modify it� and then write it back� In contrast� DPO

and AMS communicate only the message data necessary to modify an object�

The designers of the pC�� run�time support library �� address some of the

same implementation issues as LPARX� In fact� their model of parallel execution for

the pC�� run�time system is very similar to ours described in Section �� There

are two important di�erences� �� the pC�� run�time system is supported by a com�

piler� and �� pC�� objects may be �ne�grain objects� The design of DPO assumes

that programs contain a relatively small number of large� coarse�grain objects� which

enables DPO to replicate object naming information across processors whereas pC��

must distribute such data� Our implementation eliminates the costly communica�

tion needed to translate object names at the cost of additional� although acceptable�

memory overheads �see Section ��

Active Messages �� is an asynchronous communication mechanism which�

like asynchronous remote procedure calls �� sends a message to a speci�ed

function that executes on message arrival� AMS combines this asynchronous message

delivery mechanism with the concept of a message stream �� to hide message

bu�er management details� Active Messages is optimized for message sizes of only a

few tens of bytes whereas AMS messages are typically hundreds or thousands of bytes

long� AMS does not assume e�cient �ne�grain message passing facilities� which are

not currently available on most parallel architectures� and requires only basic message

passing support� The implementation of the CHAOS!! system for unstructured

collections of objects �� employs a similar abstraction of �mobile objects� that

de�ne packing and unpacking operations similar to that of AMS�

�

Layer Facilities

DPO name translation for distributed objectscommunication between objectscontrol over object ownership

AMS message stream abstraction hides bu ering detailscommunication between handlers on processorsglobal synchronization barriers

MP�� point�to�point message sends and receivescollective communication �broadcasts and reductions�

Table �� A brief summary of the facilities provided by the three LPARX implemen�tation layers� DPO� AMS� and MP��

�� Implementation Abstractions

Nothing you can�t spell will ever work�

� Will Rogers

The implementation of the LPARX run�time system was in�uenced by a

number of design considerations� First� the implementation must be portable across

a wide range of MIMD parallel platforms and yet provide good performance� It

should not rely on architecture�speci�c facilities� such as �ne�grain message passing

�e�g� Active Messages �� Second� it may not assume compiler support other than

that provided by a standard compiler such as C�� All decisions about data distri�

bution� communication� and synchronization are to be made at run�time� Finally�

the LPARX implementation should provide communication facilities for complicated

data structures �e�g� those with pointers�� which may require special treatment when

communicated across address spaces�

The LPARX implementation infrastructure consists of three layered soft�

ware libraries� Distributed Parallel Objects �DPO�� Asynchronous Message Streams

�AMS�� and a portable message passing library called MP�� Table �� summaries

the operations provided by each layer� The MP�� portable message passing layer�

described in Section �� implements very basic interprocessor communication facil�

�

ities� Building on the point�to�point messages of MP�� AMS �Section �� de�nes

a �message stream� abstraction that hides details of packing and unpacking data

and sending and receiving messages� DPO �Section �� de�nes mechanisms for

managing objects distributed across processor memories� We conclude this section

with an example of how these three layers interact with LPARX to implement the

interprocessor communication necessary for LPARX Grid copies�

�� Message Passing Layer

At the very bottom of the LPARX software hierarchy is MP�� an

architecture�independent message passing layer similar in spirit to MPI� �� MP��

provides facilities for asynchronous and synchronous point�to�point message passing�

barrier synchronization� broadcasts� and global reductions�

To port our software infrastructure �approximately thirty thousand lines of

C�� and Fortran code� to a new multiprocessor� we need only port the MP�� library�

no other code changes� Porting MP�� to a new parallel machine typically takes only

a few hours and a few hundred lines of code� for example� the port to the IBM SP

required only about �� lines of new code� To port MP�� the programmer must

translate the generic MP�� message passing calls into architecture�speci�c calls� For

instance� MP�� message send routine mpSend is implemented using csend on the Intel

Paragon and mpc bsend on the IBM SP� Our software is currently running on the

Cray C�� single processor�� IBM SP� Intel Paragon� single processor workstations�

and networks of workstations under PVM �� In the past� MP�� has also sup�

ported the Intel iPSC�� Kendall Square Research KSR�� nCUBE nCUBE�� and

Thinking Machines CM�� all of which are now obsolete�

On a single processor workstation� MP�� can execute in a �simulated paral�

lel machine� mode in which multiple UNIX processes emulate the processing nodes

of a parallel machine� This environment is well suited for code development and

�MPI is a relatively new portable message passing standard� When e�cient implementations ofMPI are readily available on parallel architectures� the MP�� layer will be replaced with MPI�

�

debugging� as the workstation programming environment is more mature than that

provided on most parallel architectures� In practice� very little LPARX application

development and debugging is performed on parallel architectures� For the past two

years at the University of California at San Diego� MP�� has been used on worksta�

tions to teach message passing parallel programming�

�� Asynchronous Message Streams

The Asynchronous Message Stream �AMS� communication paradigm builds

on ideas from asynchronous remote procedure calls �� Active Messages ��

and the C�� I�O stream library �� AMS requires only basic message passing

support such as that provided by MPI �� Its message stream abstraction frees the

programmer from many low�level message passing details�

Both AMS and the Active Message �� model provide mechanisms for

sending a message to a handler which then consumes the message� An important

di�erence is that AMS in intended for coarse�grain communication� Although Active

Messages provides some facilities for sending long messages� it emphasizes �ne�grain

message passing� AMS hides all message bu�er management via its message stream

abstraction� Active Messages does not�

Message Streams

Communication between processors uses AMS�s �message stream� abstrac�

tion� based on the C�� I�O stream model� A message stream contains two endpoints�

a sending end and a receiving end� Data is written into the communication stream

at the sending end and read out from the receiving end� AMS message streams

are intended to be a short�term message connections between processors� They hide

all details of message bu�er management from the programmer� AMS automati�

cally packetizes the message stream and coordinates interprocessor communication

through the message passing layer� Because the application is shielded from the in�

ternal representation of data in the message stream� AMS could transparently encode

�

data �� for transmission among heterogenous processors which use di�erent

data representations� our current AMS implementation does not provide this service

because of the high cost of changing data formats�

Each user�de�ned object to be communicated between processors must de�

�ne pack and unpack functions which copy object data into and out of the message

stream� These pack and unpack functions are simple to write and resemble standard

C�� I�O statements �see the example later in this section�� AMS de�nes pack and

unpack functions for C�� built�in types such as integer and double�

Asynchronous Communication

AMS supports two forms of interprocessor communication� sends and for�

warding sends� In an AMS send� the processor initiating the send speci�es a destina�

tion processor and a user�de�ned function handler on that processor� The handler is

simply a function which is to be called when the AMS message arrives� The originat�

ing processor opens an AMS message stream connection to the handler on the remote

processor� As shown in Figure ��a� the handler is awoken by AMS� consumes data

from the message stream� takes some appropriate action de�ned by the handler� and

then exits� The handler may perform computation on the incoming data stream or

incorporate the message data into local data structures but may not return data�

AMS�s forwarding send allows handlers to return data� In the forwarding

send� the originating processor provides two additional arguments� the processor and

user�de�ned handler which are to receive the reply from the �rst handler� The handler

processing the data request is oblivious as to where the results are being forwarded�

it is only aware that it is writing data into an outgoing message stream� All message

stream connections between processors are managed by AMS� Figure ��b shows pro�

cessor P sending a request to a handler on Q� the result of the computation is returned

to a handler on P� In the general case� the reply may be directed towards any proces�

sor� Note that P does not block while waiting for the reply from Q� instead� it overlaps

the communication with Q with other computation or perhaps other communication�

compute

compute

Processor P

process

inhandler

message

compute

compute

Processor Q

AMS send

processreply

compute

compute

Processor P

processrequest

compute

compute

Processor Q

AMS request

AMS reply

�a� �b�

Figure �� a� Communication between processors P and Q via the AsynchronousMessage Stream layer� Processor P sends a message to Q� A user�de�ned handlerfunction awakes on Q� consumes the incoming message stream� performs some com�putation� and exits� �b� Processor P makes a data request from processor Q� whichprocesses the request and returns data to P� Note that processor P continues to com�pute while Q services the request�

Global Synchronization

Detecting the end of a communication phase in an application with asyn�

chronous and unpredictable interprocessor communication patterns is a di�cult task�

Because communication patterns are not known� processors cannot predict when or

even if messages are expected to arrive� Furthermore� most message passing imple�

mentations cannot guarantee that messages sent from the same processor arrive in

the same order as they were sent�

AMS implements a very simple synchronization protocol� processors can

pass the synchronization barrier only after their communication with all other pro�

cessors has ended� In all interprocessor communication� AMS routines that open

the message streams record the number of messages destined for every other pro�

cessor� Upon reaching the global synchronization point� processors perform a global

�

addition to obtain the number of messages that each processor is to have received�

Each processor then waits until it has received the proper number of messages� This

protocol guarantees that processors will pass the synchronization point only after all

communication has terminated�

Note that it is not su�cient to execute a simple global barrier because

messages from the same processor may not arrive in the same order as they were

sent� Thus� it would be possible for an asynchronous message sent before the barrier

on one processor to arrive after the barrier on another processor if the asynchronous

message and the barrier message arrived out of order� The cost of this synchronization

protocol is discussed in Section ��

An AMS Example

Figure �� illustrates sample AMS C�� code based on a geographically struc�

tured genetics algorithm application developed using LPARX �� Recall from Chap�

ter that LPARX Grids may contain elements of any user�de�ned type� In this par�

ticular application� Grid elements are of type GA Individual� For many user�de�ned

structures� such as those containing pointers or other user�de�ned types� the LPARX

run�time system does not know how to pack and unpack data for transmission across

memory spaces� instead� the application must supply these routines� Fortunately�

they are easy to write and very short �e�g� ten to twenty lines of code��

AMS overloads the standard C�� I�O operators �� and �� to write into

and read out of message streams� such an approach mirrors standard C�� input

and output� The de�nitions of the message stream operators �� and �� for class

GA Individual are shown in Figure �� SendPacket in �� represents the outgoing

message stream and RecvPacket in �� the incoming stream� In this genetic algorithms

code� all GA Individual objects contain �genotype� information but only some con�

tain �phenotype� data� depending on the value of �ag has phenotype� The message

stream functions determine at run�time which data to send and receive based on this

�ag� We do not show the operator de�nitions for class GA Point�

�

Define the complicated C class �GA Point defined elsewhere�

class GA Individual fdouble eval�

GA Point genotype�

int has phenotype�

GA Point phenotype�

public�

public member functions

g�

C code to write data into the outgoing message stream

SendPacket operator �� SendPacket outgoing�

const GA Individual GA�

foutgoing �� GA�eval �� GA�genotype �� GA�has phenotype�

if �GA�has phenotype�

outgoing �� GA�phenotype�

return�outgoing��

g

C code to read data from the incoming message stream

RecvPacket operator �� RecvPacket incoming�

GA Individual GA�

fincoming �� GA�eval �� GA�genotype �� GA�has phenotype�

if �GA�has phenotype�

incoming �� GA�phenotype�

return�incoming��

g

Figure �� This C�� code illustrates AMS�s message stream abstractions for a genet�ics algorithm application �� Information for class GA Individual is written into amessage stream using �� and read out with �� This approach is similar to standardC�� input and output� Similar code written without the bene�ts of the AMS messagestream abstraction would be considerably more complicated�

Mechanism Description

open s to �h� p� open a message stream s to handler h on processor p

open s to �h�� p�� andforward to �h�� p��

open a message stream s to handler h� on processor p� andthen forward the resulting communication to handler h� onprocessor p�

close s close the message stream s

s �� o write data for object o into the message stream s

s �� o read data for object o from message stream s

s �� PackArray�o�n� write data for n objects of array o into message stream s

s �� UnPackArray�o�n� read data for n objects of array o from message stream s

barrierforce all processors to wait at a global synchronization barrieruntil all communication has terminated

Table �� This table summarizes the asynchronous communication facilities providedby the AMS layer�

This simple example illustrates a number of AMS�s features� First� AMS

hides all low�level message bu�ering details� The application may determine at

run�time what class information is to be transmitted through the message stream�

Furthermore� applications are free to mix objects of various types in the same mes�

sage stream� Finally� the message stream operators hide class information in a hi�

erarchical manner �e�g� the de�nitions of �� and �� for GA Point are hidden from

GA Individual�� Similar code written without the bene�ts of the AMS message

stream abstractions would be considerably more complicated� the code would be re�

sponsible for managing bu�er pointers� moving data into and out of a message bu�er�

and checking for bu�er under�ow and over�ow� AMS hides these details�

Summary

Table �� summarizes the asynchronous communication facilities provided

by the Asynchronous Message Stream layer�

�� Distributed Parallel Objects

The Asynchronous Message Stream abstractions hide many low�level mes�

sage passing details� however� one detail that AMS shares with the message passing

model is that messages are directed towards speci�c processors� The LPARX execu�

tion model views the application as a collection of communicating objects �Grids��

thus� mechanisms that communicate directly between objects� rather than between

processors� would be more appropriate� Such facilities are provided by the Distributed

Parallel Objects �DPO� layer� which extends AMS�s mechanisms with distributed ob�

ject naming and object�to�object communication facilities�

Distributed Objects and Name Resolution

The Distributed Parallel Objects layer manipulates physically distributed

C�� objects in a shared name space� Programs consist of a relatively small number

of large� coarse�grain objects and execute in SPMD �Single Program� Multiple Data�

fashion� Objects communicate through asynchronous messages� Each object is as�

signed to a particular processor� and ownership does not change�� Each processor

has a copy of every object� although the processor owning an object has a di�erent

version of the object than all other processors� The owner has a primary copy whereas

all other processors have a secondary� or ghost� copy �see Figure �� The primary

copy of an object contains all pertinent data� Although secondary objects may ex�

plicitly cache data locally� they are intended to act as �handles� through which the

program accesses the primary version of the object� DPO�s model of primary and sec�

ondary objects is based upon the distributed communication and object management

mechanisms described by Deshpande et al� � ��

Object identi�ers are used to name DPO objects lying on di�erent proces�

sors� DPO assigns each primary object a globally unique identi�er upon creation

and enters the object�s memory location� its identi�er� and its owner into a registry�

�In fact� our current DPO implementation does provide facilities for object migration� but wehave not found these mechanisms to be useful�

�

3

12

4

53

12

4

5

Processor 0 Processor 1

Figure �� These �gures show the �ve objects of Figure �� distributed across twoprocessors� Processor � owns objects � and and Processor � owns �� and �Primary objects are indicated by solid circles and secondary copies by dashed circles�Each processor has a copy of every object� although the type of copy �primary orsecondary� di�ers�

Secondary objects receive the same identi�er as their associated primary object� An

object may determine whether it is primary or secondary by simply comparing its

processor identi�er against the object�s assigned owner in the registry� Registry in�

formation is replicated across processors to avoid the interprocessor communication

otherwise required by distributed name translation�

Note that the DPO model is ultimately unscalable because secondary object

information and registry data are replicated on all processors� However� these memory

overheads are not a concern for our targeted class of applications� DPO objects are

intended to be large �e�g� an entire Grid�� for which the overhead of data replication

is small in comparison to the quantity of data stored in the primary object �see

Section �� Furthermore� current trends in multiprocessor design favor parallel

computers with a small number of very powerful processors� not thousands of small

processors� Scienti�c applications typically use only a few tens of processors at any

one time� Thus� our design decision to replicate storage is appropriate for today�s

parallel machines�

�

Show the definition of an XArray�ofGrid�ofDouble

class XArray�ofGrid�ofDouble fint n�

Grid�ofDouble ��grids�

public�


Grid�ofDouble operator�� const int i� f return��grids�i�� gg�

Define function XAlloc for a �d XArray of �d Grid of double

void XAlloc�XArray�ofGrid�ofDouble xarray� const int n�

Region� �regions� int �assignments� fxarray�grids � new Grid�ofDouble ��n��

for �int i � �� i � n� i�

xarray�grids�i� � new Grid�ofDouble�regions�i�� assignments�i��

xarray�n � n�

g

Figure �� LPARX function XAlloc allocates an XArray of Grids using Region andprocessor assignment information� See Figure �� for the de�nition of the Grid class�

Each LPARX Grid is a DPO object� LPARX function XAlloc supplies a

Region and a processor assignment when creating a Grid �see Figure �� All copies

of the Grid� even the secondary versions� store the Region� but only the processor

that actually owns the Grid allocates the array data �see Figure �� The distinction

between primary and secondary objects is hidden from the LPARX application� Be�

cause Region information for each Grid is replicated on all processors� the LPARX

run�time system can calculate the data to send to an o��processor Grid� determined

by the intersection of the two Regions� without the need to explicitly fetch the Region

from the other processor� copies with empty intersections require no communication

whatsoever�

�

Define a �d Grid of double

class Grid�ofDouble � private DistributedParallelObject fRegion� myRegion�

double �myData�

public�


g�

Define the Grid constructor �called when creating a Grid�

Grid�ofDouble��Grid�ofDouble�const Region� region� const int processor� �

DistributedParallelObject�processor� fmyRegion � region�

if �is a primary copy�� fconst int size � myRegion�number of elements��

myData � new double �size��

g else fmyData � �double �� NULL�

gg

Figure �� This C�� code illustrates how a Grid object is created� Grid is a subclassof DistributedParallelObject� The Grid is assigned a unique identi�er and regis�tered during the initialization of the base class� All Grids store Region informationbut only the owner �determined by the call to is a primary copy� a member functionof class DistributedParallelObject� actually allocates the data for the array�

Execution Model

DPO provides a very simple execution model that avoids many of the imple�

mentation di�culties and overheads�such as execution dispatch� multiple execution

streams� frame allocation� and scheduling ��found in more complicated systems�

All processors execute the same code in SPMD fashion� and only the owner of an

object �as reported by the DPO registry� executes the computation for that object

�i�e� the owner�computes rule��

For example� recall that LPARX�s forall loop iterates in parallel over the

��

User�s C code actually looks like this �forall is a C macro�

forall�i� xarray�

� � �

end forall

But after C translates the macro� the compiler sees this code

for �int i � �� i � xarray�n� i� fif �xarray�i��owner�� f

� � �

ggSynchronize��

Figure �� The LPARX forall loop is implemented as a C�� macro� The userwrites the code at the top of the �gure� and the C�� pre�processor generates the codeat the bottom� The call to Synchronize at the end of the parallel loop synchronizesall processors� LPARX also de�nes a special form of the end forall that allows theprogrammer to override the implicit synchronization� there is no need to synchronizeat the end of parallel loops that perform no interprocessor communication�

Grids of an XArray �see Section �� In the DPO implementation� all processors

loop over all elements of the XArray� but only the processor owning a Grid executes

the computation for that Grid �see Figure �� Note that because Grids are coarse�

grain objects� the overhead associated with the check for ownership is insigni�cant

when compared to the computation for a Grid� Of course� all details such as checks

for ownership are managed by the forall loop and are completely hidden from the

programmer� To eliminate the need for repeated ownership checks in every forall

loop� we could have the XArray locally cache a list of all Grids owned by its processor�

but we have not done so since the overhead of ownership checks is negligible for our

coarse�grain applications�

��

Communication

Object�to�object communication in DPO uses the same message stream ab�

straction as AMS� However� instead of sending a message to a handler on a speci�ed

processor� the message is sent to a function belonging to a speci�ed object� DPO

supports two forms of object�based communication�

� A send sends a message to a handler belonging to the primary copy of a speci�ed

object� This communication mechanism is an object�based version of the AMS

send shown in Figure ��a�

� A forwarding send sends a request to a handler belonging to the primary copy

of a speci�ed object� The object processes the request and forwards the reply

to another object� This is an object�based version of the AMS forwarding send

of Figure ��b�

Note that DPO manages all details of object name translation and ownership� As

in AMS� all communication is asynchronous� Global synchronization barriers are

provided using AMS�s protocol�

Summary

Table �� summarizes the object�to�object communication and object man�

agement facilities provided by the Distributed Parallel Objects layer�

�� Communication Example

In this section� we describe how MP�� AMS� DPO� LPARX� and an LPARX

application interact to implement interprocessor communication� Note that the fol�

lowing details �except for AMS message stream packing and unpacking for user�

de�ned types� are hidden from LPARX applications and are completely managed by

the LPARX run�time system� Our example will be the execution of the FillPatch

loop described in Section �� and reproduced here in Figure �� Recall that

�

Mechanism Description

open s to o open a message stream s to primary object o

open s to o� and

forward to o�

open a message stream s to primary object o� and forwardthe resulting communication to o� �primary or secondary�

close s close the message stream s

lookup idlook up object identi�er id in the DPO registry and returnthe associated object

register o register object o in the DPO registry and return an identi�er

owner o

is primary o return information about the speci�ed object ois secondary o

s �� o

s �� o same as in AMSbarrier

Table �� This table summarizes the object management mechanisms de�ned by theDistributed Parallel Objects layer�

�� Communicate boundary data between neighboring partitionsfunction FillPatch�XArray of Grid of Double U�

�� Loop over all pairs of grids in Uforall i in U

�� Mask o� the ghost cells �copy interior values only�� Function region�� extracts the region from its argumentRegion Inside � grow�region�U�i�� "��for j in U

�� Copy data from intersecting regionscopy into U�j� from U�i� on Inside

end forend forall

end function

Figure �� FillPatch updates ghost cell regions of Grid U�j� with overlappingnon�ghost cell data from adjacent Grids U�i�� This code is reproduced here fromFigure �� Communication in the current LPARX implementation is asynchronousand processors calculate data dependencies only for those Grids they own�

��

FillPatch updates the ghost cell regions of each subgrid with data from the interior

�non�ghost cell� portions of adjacent subgrids�

In our LPARX implementation� all message communication between pro�

cessors is asynchronous� Processors execute the iterations of the communication loop

independently� calculating data dependencies only for those Grids they own� Com�

munication within the copy routine is asynchronous and non�blocking� and the actual

data copy may not complete until a later time� In fact� data motion is not guaranteed

to terminate until all processors reach a global synchronization point� Thus� there

may be multiple copy operations executing in parallel� overlapping interprocessor

communication� LPARX inserts a global synchronization barrier at the end of every

communication phase �e�g� in the end forall at the end of the forall loop� to en�

sure that all communication has terminated before computation begins� Split�C ��

supports a similar split phase communications paradigm�

We now consider the execution of the LPARX statement�

copy into A from B on Inside

where A and B are Grids and Inside is a Region� Note that we have replaced U�j�

and U�i� from Figure �� with A and B �respectively� to simplify the notation� Recall

from Section �� that this statement copies data from B into A where the Regions

of A and B intersect with Inside� Because Region information for Grids A and B

is replicated on all processors� LPARX can immediately calculate the intersection

between the Regions of A and B� If this intersection is empty� then no data is to be

moved and the copy is �nished� This optimization is extremely important in spatially

localized computations �such as those addressed by LPARX� since most intersections

are empty� Otherwise� there are four possible cases to consider� as shown in Table ��

depending on whether A and B are primary or secondary objects�

LPARX stores Grid data only on the processor which actually owns the

Grid� If both A and B are primary objects� then all data is available locally and a

memory�to�memory copy su�ces� If A is a primary copy and B is a secondary copy�

then A asks the processor owning B to send it the necessary Grid data �a �get�� If

��

Grid A Grid B Action

primary primary local memory�to�memory copyprimary secondary A requests Grid data from the owner of Bsecondary primary Grid data from B is sent directly to the owner of A

secondary secondaryrequest the processor owning B to send the appropriateGrid data to the processor owning A

Table �� There are four possible cases to consider in the implementation of theLPARX statement �copy into A from B�� depending on whether Grids A or B areprimary or secondary DPO objects� LPARX stores Grid data only on the processorwhich actually owns the Grid �the primary copy�� Accessing Grid data for secondaryobjects requires interprocessor communication� Routines to check whether objectsare primary or secondary are provided by DPO �see Table ��

the data for B is local but A is stored elsewhere� then Grid data from B is sent directly

to the processor owning A �a �put�� Finally� if neither A nor B are primary objects�

then a request is sent to the processor owning B to send Grid data to the processor

owning A�

Figures �� and �� provide time�line views �not drawn to scale� of how

MP�� AMS� DPO� LPARX� and an LPARX application interact to implement the

interprocessor communication associated with an LPARX copy� These examples as�

sume that Grid elements are complicated C�� objects� such as GA Individual in

Figure �� so that application�level packing and unpacking routines are required�

For Grids whose elements are standard C�� types �e�g� double or integer�� LPARX

manages packing and unpacking automatically without the intervention of the appli�

cation�

Figure �� illustrates the transmission of Grid information to a Grid lying

on another processor� The application calls the LPARX copy routine� which calculates

what Region of Grid data is necessary to satisfy the copy� DPO translates the

name of the destination Grid� and AMS opens a message stream connection to the

destination� Grid data is copied into the message stream using the application�level

packing routines� If the internal message stream bu�er over�ows� AMS initiates a

�

calculate dependence

translate destination

open stream

copy Grid data into buffer

start send start send

close stream

end copy

call copy resume

Application

AMS

DPO

LPARX

MP++

Figure �� This �gure provides a time�line view �not to scale� of the transmissionof Grid data to another processor� Arcs show transitions between MP�� AMS� DPO�LPARX� and the LPARX application� The message send in MP�� and the copying ofnew data in the application�level packing routines occurs in parallel� See the text fora detailed explanation�

Application

AMS

DPO

LPARX

message ready extract Grid data from buffer

extract Region

translate name

activate handler

receive message receive message

exit handler

copy finished

resume

MP++

Figure �� This �gure provides a time�line view �not to scale� of the reception ofGrid data from another processor� Arcs show transitions between MP�� AMS� DPO�LPARX� and the LPARX application� The reception of messages in MP�� and theunpacking of data in the application�level routines occurs in parallel� See the text fora detailed explanation�

�

send to transmit the data bu�er� allocates a new data bu�er� and resumes packing

data� Note that the transmission of the old bu�er and the allocation of the new bu�er

are completely transparent to the application�level packing routines� Interprocessor

communication executes in parallel with the �lling of the new bu�er� After all Grid

data has been transmitted� AMS closes the message stream and LPARX returns from

the copy routine�

Figure �� shows the other end of the interprocessor communication� the

reception of Grid information from a Grid lying on another processor� Once the �rst

packet of the message stream arrives� AMS activates a message handler provided by

DPO� DPO then translates the name of the destination object�an LPARX Grid�

and calls an LPARX routine to process the incoming data stream� LPARX extracts

Region information and begins to copy data out of the message stream �using the

application�level unpacking routines� and into the Grid� Concurrently� MP�� receives

the next portion of the message stream� When the internal message bu�er under�ows�

the application�level unpacking routines transparently switch to the new message

bu�er� If the needed message bu�er is not yet available� then AMS waits until the

appropriate message arrives� While waiting� AMS will remove incoming message

packets and queue them locally� but it will not activate another handler �to avoid

corrupting global state� until the current one �nishes� After all Grid information has

been extracted� LPARX exits to AMS� which closes the message stream� leaves the

handler� and returns control to the application�

�� Implementation and Performance

None of us really understands what�s going on with all these numbers�

� David Allen Stockman� referring to the �� federal budget

Distributed Parallel Objects� Asynchronous Message Streams� and MP��

have been implemented as a collection of C�� classes� they require no special compiler

support� MP�� consists of approximately �� lines of C�� code� AMS �� lines� and

��

DPO �� lines� LPARX and its associated libraries add another �� lines of code�

In the following three sections� we discuss implementation issues and over�

heads� We begin in Section �� with a comparison of interrupt and polling mech�

anisms for asynchronous communication� In Section �� we present memory and

communication overheads for AMS and DPO� Finally� we analyze the performance of

LPARX on a simple application �the Jacobi problem of Section �� and compare its

performance to a message passing implementation�

�� Interrupts versus Polling

The AMS run�time system requires a mechanism to detect when an AMS

message has arrived and invoke the appropriate function handler� There are two

methods typically employed to process such asynchronous events� interrupts and

polling �which we use�� Each method has its advantages and disadvantages�

The primary advantage of interrupt driven message handlers is that they

do not require polling calls in the code� Interrupt driven messages� though� have a

number of drawbacks� First� interrupt mechanisms are not portable and vary signi��

cantly from multiprocessor to multiprocessor� In fact� some message passing libraries

�e�g� MPI �� do not support interrupt driven messages� Second� interrupt driven

message handlers must be careful when writing to global variables �i�e� those variables

not local to the handler�� Because there is little control over when interrupts occur�

handlers may corrupt global state being modi�ed by another routine� For instance�

suppose an interrupt occurred while the main program was allocating memory� If the

interrupt handler also attempted to allocate memory� the handler could corrupt the

state of the memory allocator� One solution would be to require the programmer to

mask interrupts during all sensitive calculations� but such an approach is tedious and

error�prone�

In the implementation of AMS and DPO� we use polling to process asyn�

chronous events� The AMS run�time system provides a special polling function which

the libraries call to check on pending messages and� if found� invoke the associated

��

Description Overhead

AMS overhead per message � bytesDPO and AMS overhead per message �"� bytesTotal overhead per message ��"�� bytesMemory overhead per DPO object �per processor� � bytesTotal storage per LPARX Grid �per processor� � "� bytes

Table �� Message length and memory overheads for AMS� DPO� and LPARX� Seethe text for a detailed explanation�

function handlers� The advantage of this method is that handlers are called only at

those times considered safe� The drawback is that the code must periodically check

for pending message events� In practice� however� this requirement is not particularly

bothersome� LPARX applications are oblivious to polling as such calls are hidden

within the implementation of LPARX�s copy functions�

�� DPO and AMS Overheads

Table �� provides various memory and message length overheads for AMS

and DPO� performance overheads will be discussed in the following section� Each

of the LPARX� DPO� and AMS layers adds header information to all interprocessor

communication� AMS requires � bytes of header information for function handlers�

message size� packet sequence numbers� and other miscellaneous data� DPO adds an

additional � to � bytes �depending on the type of message� for object identi�cation�

Finally� LPARX adds between � and � bytes �� ! � d where d � �� is the

spatial dimension of the Grid� to specify the Regions used in a copy� Note that such

overheads should not signi�cantly a�ect message passing times� On most parallel ar�

chitectures� the transmission time for short messages is dominated by communication

start�up overheads� for long messages� the additional overhead of a hundred bytes is

insigni�cant�

On every processor� each DPO object requires a memory overhead of �

��

bytes for name translation and object ownership information� To that� LPARX adds

another �" � bytes �� ! �d for a d�dimensional Grid� for Region and other data�

Only the primary version of an object allocates the Grid array� Recall that the

memory overhead of data replication is relatively modest in comparison to the size

of a typical Grid� which may contain several tens of thousands of bytes� An alter�

nate� scalable implementation strategy would involve �ne�grain� distributed transla�

tion schemes such as those implemented by CHAOS � �� and pC�� at the cost

of additional interprocessor communication� While scalable� such implementation

approaches are inappropriate for our coarse�grain applications running on today�s

parallel platforms�

�� Application Performance

To provide an overall estimate of LPARX run�time performance overheads�

we implemented a �d Jacobi iterative solver with a �� point �nite di�erence stencil

in LPARX and also by hand using message passing� The message passing implemen�

tation should provide an approximate lower bound for the �best� possible implemen�

tation� We chose the Jacobi application because it is simple enough to parallelize by

hand� While we would have preferred to compare a �real� LPARX application such

as a structured adaptive mesh application� it would have taken months� if not years�

to parallelize such a code by hand�

Table �� compares the performance of the two codes for a ��mesh on � Paragon nodes� The hand�coded application made a number of simplifying

assumptions� namely that each processor was assigned only one subgrid and that the

problem was static so that it could precompute communication schedules� Without

these simplifying assumptions� the hand�implementation would have been consider�

ably more di�cult� While such assumptions may apply to this simple example� they

are not valid for the dynamic irregular applications which are the intended target of

LPARX� For example� structured adaptive mesh calculations may assign several sub�

grids to each processor� and particle methods change communication dependencies as

��

By Hand LPARX v�� No Barrier

Total time �ms� �� Computation �ms� �� Communication �ms� �� Messages �kilobytes� � � �� Message starts ��

Table �� LPARX overheads for a �d Jacobi �� point stencil� relaxation calculationon a �� mesh on � Paragon nodes� The �By Hand� application was par�allelized using only message passing� The numbers for LPARX v�� re�ect the currentLPARX implementation� and the �No Barrier� numbers estimate the performance ofLPARX without the global barrier synchronization� All numbers measure the wall�clock time for one iteration of the algorithm and were averaged over �� iterations�Message statistics represent single processor averages for one iteration�

particles move�

Table �� reports �ve performance numbers� total execution time� numerical

computation time� communication time� average number of bytes communicated per

processor� and average number of message sends per processor� All measurements

are reported per iteration and were averaged over �� iterations� The performance

numbers in the �By Hand� column re�ect the message passing implementation� the

�LPARX v�� column represents the performance of the current LPARX software

release�

The LPARX computation time is identical to that of the message passing

code� LPARX overheads appear only in the communication routines� The LPARX

communication time is �$ slower than the message passing version� This translates

into an overall execution time which is �$ longer than the equivalent message passing

code�

The LPARX code communicated approximately �ve percent more bytes than

the message passing implementation� Part of this overhead is due to the additional

information �described in the previous section� which must be communicated with

each LPARX message� Because LPARX cannot assume that only one subgrid is as�

��

signed to each processor �as was assumed in the hand�coded message passing version��

it must incorporate descriptive information into each message identifying the subgrid

where data is to be stored�

Most of LPARX�s communication overhead can be attributed to the extra

messages sent as part of its synchronization protocol� Recall that at the end of a

communications loop� LPARX detects the termination of communication via a global

barrier synchronization that accounts for the additional message sends� Our synchro�

nization protocol requires log P message starts on P processors� However� it would

be possible to eliminate this costly synchronization through an alternative implemen�

tation strategy �see Section �� using run�time schedule analysis techniques ��

By eliminating the barrier� we obtain the results in the �No Barrier� column� LPARX

overheads now drop to approximately one percent of the total execution time of the

program�


It is a capital mistake to theorize before one has data�

� Sir Arthur Conan Doyle

To simplify the implementation of the LPARX run�time system� we have

introduced two intermediate software layers between LPARX and our MP�� portable

message passing library� Asynchronous Message Streams �AMS� and Distributed Par�

allel Objects �DPO�� AMS and DPO provide support for SPMD programs consisting

of a small number of large� complicated� coarse�grain objects with asynchronous�

unpredictable communication patterns� AMS de�nes a message stream abstraction

which hides low�level message passing details such as message bu�er management�

Building on the AMS facilities� DPO provides mechanisms for manipulating physi�

cally distributed objects in a shared name space� Our software infrastructure runs on

a variety of parallel architectures and requires only basic message passing support�

The primary run�time overhead associated with our implementation strategy

�

is the global barrier synchronization used to detect the end of interprocessor commu�

nication� We are working to eliminate this overhead through run�time communication

schedule techniques ��

�� Flexibility

One advantage to the modular implementation approach of our software

infrastructure is the �exibility to re�use various software components as needed�

For example� scientists at Lawrence Livermore National Laboratories have used our

DPO� AMS� and MP�� software to parallelize a structured adaptive mesh library for

hyperbolic problems in gas dynamics �� Their library de�nes a set of ab�

stractions similar to LPARX� Their versions of Point� Region� and Grid have been

specialized for their particular class of applications� for example� their Region de�

scribes application�speci�c properties such as whether meshes are cell�centered or

node�centered�

A direct implementation using LPARX would have involved customizing our

Region and Grid classes to conform to their standard� Because of their investment

in tens of thousands of lines of code and their established user base� it would have

been impossible to re�write their library to conform to LPARX� Instead� they used

the DPO� AMS� and MP�� layers� Their Grid became a DPO object� and they

implemented their own version of the XArray class� This approach enabled them to

leverage both our parallelization support and their extensive amount of application�

speci�c code�

�� Portability

In the design of the LPARX implementation� we decided that portability

would be provided through the use of a portable message passing layer� Our entire

software infrastructure� approximately thirty thousand lines of C�� and Fortran code�

can be ported to a new architecture merely by changing a few hundred lines of code

��

in the MP�� library�

The downside to portability via a message passing library is that the message

passing model may not necessarily match or exploit the low�level hardware charac�

teristics of a particular parallel platform� For example� how should LPARX be im�

plemented on a shared memory architecture that supports �ne�grain communication#

One possible solution would be to implement a message passing library on top of the

hardware�s shared memory mechanisms� However� this strategy would probably not

be as e�cient as an implementation which directly takes into account the hardware

support for shared memory�

In the LPARX multitasking port to the Cray C�� our �rst step was to port

a version of the MP�� library that emulated message passing through shared memory

message bu�ers� This implementation introduced unnecessary copying through inter�

mediate bu�ers within the �message passing� layer� We later modi�ed the LPARX

layer software so that Grid�to�Grid copies exploited the Cray�s shared memory archi�

tecture and bypassed the message passing layer� Note that these issues deal only with

the implementation of the LPARX system and not LPARX itself� these architecture�

speci�c implementation details are hidden from the programmer�

Such portability issues apply not only to the design of the LPARX run�time

system but to parallel programs in general� For example� how will MPI �� message

passing applications run on a shared memory architecture# MPI programs will require

the implementation of a shared memory message passing library similar to what we

implemented for the Cray C�� However� applications using this library will in all

likelihood be less e�cient than programs that have been speci�cally designed to take

advantage of the shared memory support�

What is the best method for implementing LPARX �or for that matter� any

dynamic� irregular scienti�c application� on a distributed shared memory multipro�

cessor# We believe this to be an open research question� Studies with distributed

shared memory computers indicate that dynamic� irregular applications can dramati�

cally improve performance by explicitly managing cache locality � � �� it is simply

��

not su�cient to rely on hardware caching mechanisms� What is the role of the AMS

layer on such a computer# Certainly� the application does not need to pack and

unpack complicated objects for transmission across address spaces if it allows the

hardware coherence mechanism to cache object data automatically� However� appli�

cations might improve performance by bypassing the hardware mechanisms and using

AMS instead� Such questions require further study�

Part of the di�culty in writing portable parallel programs is that no single�

unifying� realistic model of parallel computation and communication has emerged�

There have been several attempts to de�ne a unifying model� such as PMH � ��

LogP � �� BSP �� and CTA �� However� these models tend to focus on per�

formance evaluation� What is needed is a single set of realistic� portable� general

purpose mechanisms for e�cient parallel programming� Without high�level support�

programmers are left to employ di�erent implementations on di�erent architectures�

hampering portability�

The advantage of LPARX is that it provides a very high�level set of portable

tools that isolate the programmer from the architecture and gives an intuitive per�

formance model� Although tools such as MPI are portable� they are� in some sense�

�closer� to the machine� The LPARX run�time system may look very di�erent on

a shared memory machine than on a message passing machine� but such details are

hidden from the LPARX programmer�

�� Implementation Mistakes

Recall that in Section �� we modeled LPARX programs as a collection of

objects with asynchronous and unpredictable communication patterns� Our imple�

mentation with AMS and DPO is based on this assumption� In retrospect� we believe

that this model of LPARX communication is unnecessarily general�

LPARX applications alternate between communication and computation

phases� Thus� communication is not asynchronous but is� in fact� limited to the

well�de�ned communication phases of the program� Furthermore� we can predict

�

global communication patterns using the inspector�executor paradigm pioneered in

CHAOS � �� and multiblock PARTI �� In this model of communication� the inspec�

tor phase calculates a schedule of data motion which is then executed in the executor

phase�

The LPARX schedule building loop�the inspector�would employ the re�

gion calculus and copy�on�intersect operations �i�e� structural abstraction� to specify

data dependencies� Because the schedule provides global knowledge of communica�

tion patterns� each processor would know when its communication had completed and

the costly global barrier synchronization would no longer be required� Preliminary

work with integrating schedule building techniques into LPARX �� indicates that

most of the overheads described in Section �� can be eliminated�

Chapter �

Adaptive Mesh Applications

I have yet to see any problem� however complicated� which� when youlooked at it in the right way� did not become still more complicated�

� Paul Anderson� �New Scientist�

�� Introduction

In this chapter� we describe the adaptive mesh API �application programmer

interface� component of our software infrastructure �see Figure �� This API has

been implemented as a library built upon the parallelization and communication

abstractions of LPARX� It provides the scienti�c programmer with specialized� high�

level facilities speci�cally tailored to structured adaptive mesh applications�

We have used our adaptive mesh API to develop a parallel adaptive eigen�

value solver �LDA in Figure �� and an adaptive multigrid solver �AMG� for the solution

of eigenvalue problems arising in the �rst principles simulation of real materials� Ma�

terials design attempts to understand the chemical properties of materials through

computer simulation� Such applications require adaptive numerical methods to accu�

rately capture the chemical behavior of molecules containing atoms with steep nuclear

potentials �e�g� oxygen or transition metals�� To our knowledge� this is the �rst time

that structured adaptive mesh techniques have been used to solve eigenvalue problems

in materials design�

�

��

LPARX



LDA MDSPH3DAMG


Figure �� The adaptive mesh API� built on top of the parallelization and commu�nication mechanisms of LPARX� provides application�speci�c facilities for structuredadaptive mesh methods�

It is an open research question whether the irregularity of an adaptive mesh

calculation can be e�ciently implemented in a data parallel language such as High

Performance Fortran �� which does not readily support dynamic and irregular array

structures� We will present computational results �Section �� which show that

the restrictions imposed by a data parallel Fortran implementation may signi�cantly

impact parallel performance�

This chapter is organized as follows� We begin by describing the importance

of adaptive mesh algorithms in the solution of numerical problems and review related

work� Section �� introduces the salient features of the adaptive mesh algorithm to

motivate the software facilities required by the adaptive mesh API� In Section ��

we describe in detail our API and how its facilities are built on top of the LPARX

abstractions� Section �� describes our materials design eigenvalue solver� provides

details about the numerical methods� and presents computational results for some

simple materials design calculations� Section �� analyzes parallel performance and

library overheads� We conclude in Section �� with an analysis and discussion�

��

�� Motivation

The accurate solution of many problems in science and engineering requires

the resolution of unpredictable� localized physical phenomena� Examples include

shock waves in computational �uid dynamics �� and the near�singular atomic core

potentials in materials design �� The key feature of these problems is that some

portions of the problem domain�for example� regions containing the shock waves

or the atomic nuclei�require higher resolution� and thus more computational e�ort�

than other areas of the computational space�

Structured adaptive numerical methods dynamically place computational

resources� such as CPU cycles and memory� in interesting portions of the solution

space� thus� they can achieve better accuracy for the same computational resources

as compared to non�adaptive methods� Although structured adaptive mesh methods

incur some overhead costs associated with adaptivity� such as error estimation and

data structure management� these overheads are insigni�cant when compared to the

savings gained through selective re�nement �� For example� by exploiting adap�

tivity in a materials design application �see Section �� we have reduced memory

consumption and computation time by more than two orders of magnitude over an

equivalent uniform mesh method �� The adaptive code allows us to solve problems

on a high�performance� single processor workstation which would otherwise require

hundreds of giga�ops on a supercomputer with gigabytes of memory�

In general� adaptive methods may be structured or unstructured� depending

on how they represent the numerical solution to the problem� as shown in Figure ��

Unstructured adaptive methods �� store the solution using graph or tree

representations �� these methods are called �unstructured� because connec�

tivity information must be stored for each unknown �node� of the graph� Structured

methods� such as adaptive mesh re�nement �� and structured multigrid algorithms

�� employ a hierarchy of nested mesh levels in which each level consists of

many simple� rectangular grids� Each rectangular grid in the hierarchy represents a

structured block of many thousands of unknowns� Because of these dissimilar data

��

RefinedArea

Level l+1

Level l

�a� �b�

Figure �� a� Unstructured adaptive methods employ a graph�like representationwhich requires connectivity information for each unknown in the graph� �b� Struc�tured adaptive methods use a hierarchy of levels in which each level consists of anumber of rectangular grids� Each rectangular grid may contain many thousands ofunknowns� The dark shaded area represents the portion of level l covered by l! ��

representation strategies� structured adaptive methods require di�erent software sup�

port and implementation approaches than unstructured adaptive methods� Here we

consider only structured adaptive methods�

Structured adaptive mesh methods are di�cult to implement on serial

architectures�not to mention parallel machines�because they rely on dynamic� ir�

regular data structures� Regions of the computational space are dynamically re�ned

in response to run�time estimates of local solution error� resulting in irregular data

dependencies and communication patterns� On parallel platforms� the programmer is

further burdened with the responsibility of managing data distributed across proces�

sor memories and orchestrating interprocessor communication and synchronization�

Such distractions can signi�cantly increase application development time� Because

adaptive applications change in response to the dynamics of the problem� little can

be known about the structure of the computation at compile�time� Thus� decisions

about data decomposition� the assignment of work to processors� and the calculation

of communication patterns must be made at run�time�

We have developed a structured adaptive mesh API that hides these im�

��

plementation details� It presents computational scientists with high�level tools that

allow them to concentrate on the application and the mathematics instead of low�level

concerns of data distribution and interprocessor communication� Such support en�

ables scientists to develop e�cient parallel� portable� high�performance applications

in a fraction of the time that would have been required if the application had been

developed from scratch�

�� Related Work

Adaptive mesh re�nement techniques for multiple spatial dimensions were

�rst developed by Berger and Oliger �� to solve time�dependent hyperbolic

partial di�erential equations� These techniques are based on previous work on locally

nested re�nement structures in one spatial dimension by Bolstad �� Adaptive mesh

re�nement methods were later used by Berger and Colella to resolve shock waves in

computational �uid dynamics �� Our work with adaptive mesh methods applies

this same adaptive framework to elliptic partial di�erential equations and adaptive

eigenvalue problems ��

Berger and Saltzman have implemented a parallel d adaptive mesh re�ne�

ment code in Connection Machine Fortran for the CM� �� Their data parallel

implementation required that all regions of re�nement be the same size� As a result�

the application over�re�ned some portions of the computational space� using �$

more memory than an equivalent implementation without the uniform size restric�

tion� Our experiments indicate that uniform re�nement regions also result in excessive

overheads in three dimensions �see Section �� Because of compiler limitations�

their code did not execute e�ciently on the CM��

Quinlan et al� have developed an adaptive mesh library called AMR��

based on the P�� data parallel C�� array class library �� P�� supports �ne�grain

data parallel operations on arrays distributed across collections of processors� it auto�

matically manages data decomposition� interprocessor communication� and synchro�

nization� In contrast to this �ne�grain array parallelism� we employ a coarse�grain

��

parallelism in which operations are applied in parallel to entire collections of arrays

�see Section �� Fine�grain parallelism is di�cult to implement e�ciently on to�

day�s coarse�grain architectures� indeed� Parsons and Quinlan �� are developing

techniques to extract coarse�grain parallelism from P�� to improve the e�ciency of

the �ne�grain approach�

An object oriented library for structured adaptive mesh re�nement has been

developed at Lawrence Livermore National Laboratory by Crutch�eld et al� ��

This software is intended to support hyperbolic gas dynamics applications running

on vector supercomputers �� The basic abstractions employed in Crutch�eld�s work

are very similar to our own� in fact� their adaptive mesh re�nement libraries have been

parallelized using the LPARX software ��

Parashar and Browne are developing a software infrastructure supporting

parallel adaptive mesh re�nement methods for black hole interactions �� Their

method is based on a clever load balancing and processor mapping strategy that

maps grids to processors through locality�preserving space �lling curves ��

However� their approach imposes two restrictions on the grid hierarchy� �� all re�

�nement regions must be the same size� and �� re�nement regions must be nested in

a tree structure�� Our performance analysis in Section �� indicates that uniform

re�nement regions over�re�ne the computational space and are therefore less e�cient

than non�uniform re�nements� Although we require nested grids� our infrastructure

allows grids to have multiple parents �see Section �� and therefore provides the

application more freedom in constructing re�nement structures� The implications of

the tree nesting restriction have yet to be studied� although we believe that such a

strategy will result in additional costly over�re�nement�

�In general� structured re�nement hierarchies do not form a tree�

�

Composite Grid

Level 0

Level 1

Level 2

Figure �� The solution to a partial di�erential equation is resolved on a compositegrid which represents a non�uniform discretization of space� In practice� compositegrids are implemented as a hierarchy of grid levels� This �d grid hierarchy consists ofthree grid levels with a mesh re�nement factor of four� Levels � and � each have onlyone grid� but level has two grids� This composite grid hierarchy is modeled afterthe one used in the �d hydrogen molecular ion problem shown in Figure ��a�

�� Structured Adaptive Mesh Algorithms

Oh� laddie� you�ve got a lot to learn if you want people to think of you asa miracle worker�

� Scotty� Star Trek The Next Generation episode �Relics�

This section provides a high�level description of structured adaptive mesh

algorithms� We present the salient features of the method to motivate the abstractions

described in Section �� Further numerical details can be found in Section ��

Structured adaptive mesh methods �� solve partial di�erential equations

using a hierarchy of nested� locally structured �nite di�erence grids� The grid hi�

erarchy can be thought of as a single composite grid in which the discretization is

non�uniform �see Figure �� All grids at the same level of the hierarchy have the

same mesh spacing� but successive levels have �ner spacing than the ones preced�

ing it� providing a more accurate representation of the solution� The di�erence in

mesh spacing between a grid level and the �ner resolution level above it is called the

renement factor� which is typically two or four�

Adaptive methods re�ne this discretization of space to accurately represent

localized physical phenomena �see Figure �� When creating a new level� the hi�

��

Level 1Level 0 Level 2

Figure �� Three levels of a structured adaptive mesh hierarchy� The eight darkcircles represent regions of high error� such as atomic nuclei in materials design ap�plications �� The mesh spacing of each level is half of the previous coarser level�This problem is similar to the problem solved in Section ��

erarchy is re�ned according to an error estimate calculated at run�time� These new

higher�resolution grids� called renement patches� are used only where necessary to

meet accuracy requirements� Adaptive methods use computational resources e��

ciently because they expend extra memory and computation time only in regions of

unacceptable error� In general� the location and size of re�nement patches must be

computed at run�time� as they cannot be predicted a priori �

Structured adaptive mesh algorithms communicate information about the

numerical solution between levels of the hierarchy and also among grids at the same

level of the hierarchy� Around the boundary of each grid patch is a ghost cell region

which locally caches data from adjacent grids or� where no neighboring grids exist�

from the next coarser level of the hierarchy� Without the proper software support�

managing these bookkeeping details can be di�cult because of the irregular and

unpredictable placement of re�nement patches �� Note that ghost cell regions and

communication are required by the adaptive meshmethod and are not simply artifacts

of the parallel implementation�

Adaptive mesh methods that use structured re�nements possess a number

of advantages over unstructured adaptive methods which represent solutions using

��

a graph representation� �see Figure �� Structured re�nement patches exploit the

local structure within the solution� if a point is �agged as needing re�nement� then it is

likely that nearby points will also need re�nement� Although the grid hierarchy itself

may be non�uniform� patches themselves are uniform� Location and size information

need only be stored for each patch� which in turn may contain many thousands of

unknowns� Because the structure of a grid patch may be represented using only a few

tens of bytes� structure information for the entire grid hierarchy may be replicated

across processor memories�

In comparison� unstructured representations require connectivity informa�

tion for each unknown in the graph� signi�cantly increasing memory overheads� On

parallel computers� the calculation of data dependencies for unstructured problems

scales as the number of unknowns� whereas algorithms for structured adaptive mesh

methods scale as the number of patches�

Furthermore� numerical kernels for structured adaptive mesh methods are

simpler and more e�cient than those for unstructured methods �� Solvers for

structured methods may employ compact� high�order �nite di�erence stencils� Index�

ing is fast and e�cient since grid patches are essentially rectangular arrays� Numerical

kernels typically make better use of the cache as array elements are stored contigu�

ously in memory� improving cache locality� rather than scattered across memory�

Although local re�nement saves both computation time and memory� the

savings in memory may play a more important role in many applications� Available

memory places a hard limit on the problem sizes which can be solved in�core� Prob�

lems larger than a �xed size must resort either to paging� which is terribly slow on

most multiprocessors� or to out�of�core algorithms�

Following the work by Berger and Colella �� our structured adaptive mesh

application consists of three main components� a numerical solver� an error estimator�

and a grid generator� Note that although our algorithm looks somewhat similar to

adaptive mesh re�nement �� it is not identical� Our intended applications employ

�Of course� unstructured representations may be more appropriate for some problems� such asthose with irregular boundaries�

�

elliptic� not hyperbolic� partial di�erential equations� Ellpitic solvers require di�erent

types of numerical schemes� they use implicit iterative numerical methods �� as

compared to the explicit time�marching schemes �� used for hyperbolic problems�

We begin with a single grid level that covers the entire computational domain

and build the adaptive grid hierarchy level�by�level� Assuming that we have already

constructed a hierarchy with levels l � �� L� our algorithm is as follows �refer

back to Figure ��

�� Solve the partial di�erential equation using all levels of the grid hierarchy� The

resulting solution is the most accurate representation of the answer thus far�

� Flag points of the grids on level L �and only on level L� where the estimated

error exceeds some speci�ed error threshold�

�� Calculate the location and sizes of re�nement patches which cover the �agged

points on level L� If running on a parallel computer� assign these re�nement

patches to processors�

�� Create level L!� in the hierarchy using re�nement patch information from the

previous step� Interpolate the current solution from level L to L! ��

� Increment L and continue at step ��

These �ve steps are repeated until a user�speci�ed maximum number of re�nement

levels has been reached or until the problem has been solved to su�cient accuracy�

The structure of the grids�the size� location� and number of re�nement regions�is

determined at run�time in response to error estimates derived from the current solu�

tion� The software abstractions needed to implement this algorithm are the subject

of the following section�

�

�� Adaptive Mesh API

Scientists come up with great� wild theories� but then they give them dull�unimaginative names� � � I tell you� there�s a fortune to be made here�


The structured adaptive mesh algorithm of the previous section is quite dif�

�cult to implement on both sequential and parallel architectures� Re�nement regions

vary in size and location in the computational space� resulting in complicated geome�

tries �see Figure �� Communication patterns between grid patches and between

grid levels are irregular and change as the hierarchy is modi�ed� On message pass�

ing platforms� the programmer must explicitly manage grid data distributed across

the processor memories and orchestrate interprocessor communication� Even shared

memory multiprocessors require the explicit� low�level management of data locality

and communication for reasonable performance �� These implementation

di�culties soon become unmanageable and can obscure the mathematics behind the

algorithms�

The goal of our adaptive mesh API is to provide scientists with high�level

support for structured adaptive mesh applications� Scientists using our API facili�

ties can concentrate on their speci�c applications rather than being concerned with

the underlying implementation details� Of course� our software is portable and ef�

�cient� Portability among high�performance computing platforms guarantees that

applications software will run on the most powerful and up�to�date computational

resources available� Such a powerful software infrastructure is essential in developing

sophisticated� reusable code�

�� Software Infrastructure Overview

Figure �� illustrates the organization of our adaptive mesh infrastructure�

The software consists of three primary components� numerical operations� grid man�

agement facilities� and display routines� The numerical routines de�ne the elliptic

partial di�erential equation to be solved� The display library contains some simple

��

Adaptive Mesh Library

Grid Hierarchy Management

Estimate ErrorsBalance Loads Composite Grid Regrid

XArray of GridGridStructure

IrregularGrid

Numerical Routines Display Library

Figure �� Organization of the structured adaptive mesh API library� The softwareinfrastructure consists of three main components� numerical routines� grid manage�ment facilities� and display functions�

graphing and plotting facilities for visualizing data �for example� see Figure ��a and

Figure �� The grid hierarchy management routines comprise the most complex

and interesting portion of the adaptive mesh libraries� These facilities manage all

aspects of the grid hierarchy� data structure bookkeeping� error estimation� workload

balancing and processor assignment� and communication� An important observation

is that the grid management facilities are independent of the numerical details of a

particular elliptic partial di�erential equation� the same routines may be used to solve

a number of di�erent numerical problems�

One feature of our adaptive mesh library is that its facilities are independent

of problem dimension� Scientists using the API see the same abstractions and inter�

face whether they are working in two or three spatial dimensions� Numerical details

di�er� but the interfaces for regridding� error estimation� load balancing� and grid

��

Component Code Lines

Data Structures ��Display Routines ��Error Estimation ��Grid Generation ��Numerical Routines ��Statistics Gathering ��Workload Balancing ��

Total ��

Table �� This table provides a breakdown of the eleven thousand lines of code thatconstitute the structured adaptive mesh API library� implemented as a collection ofC�� classes and Fortran routines�

hierarchy management are identical� Dimension independence provides programmers

the freedom to develop and debug simpler� faster d versions of their applications

on a workstation using simpli�ed d numerical kernels� Then� when con�dent that

the code is working� programmers can insert the appropriate �d numerical routines

and recompile on a supercomputer� In practice� we have found dimension indepen�

dence particularly useful� the adaptive mesh API libraries and the materials design

application presented in Sections �� and �� were �rst developed on workstations

in d�

We have implemented our structured adaptive mesh API as a collection of

C�� classes and Fortran routines consisting of approximately eleven thousand lines

of code� as shown in Table �� Our software is built on top of the LPARX abstrac�

tions described in Chapter � LPARX provides run�time parallel support such as

distributed data management� coarse�grain parallel execution� interprocessor commu�

nication� and synchronization� The adaptive mesh libraries add facilities speci�cally

tailored towards adaptive mesh applications�

LPARX�s concept of structural abstraction and its support for �rst�class

data decompositions have been vital to our success� Structural abstraction enables us

to represent and manipulate the structure of data�the ��oorplan� describing where

��

re�nement regions are located in space and how they are mapped to processors�

separately from the data itself� For example� when adding a new level to the adap�

tive mesh hierarchy� we represent re�nement regions at the new level as �rst�class�

language�level objects� The structure of the new re�nement level is determined by

regridding routines� Re�nement patches are then manipulated by load balancing and

processor assignment algorithms� Only then do we actually allocate the data associ�

ated with the re�nement patches� Structural abstraction enables us to represent and

modify dynamic re�nement structures at run�time�

The following six sections describe the grid management routines of Fig�

ure �� in more detail� We begin in Section �� with a discussion of the data

structures used to represent the structured grid hierarchy� We then describe some

of the algorithms used in error estimation �Section �� regridding �Section ��

and workload balancing and processor assignment �Section �� We conclude with

a description of coarse�grain data parallel numerical computation �Section �� and

communication �Section ��

�� Data Structures

Recall from Section �� that structured adaptive mesh methods store data

using a composite grid implemented as a hierarchy of levels �see Figure �� To

represent this structure� we use three data types� a Grid� an IrregularGrid� and

a CompositeGrid� The Grid represents a single� logically rectangular array at one

level of the composite grid hierarchy� A collection of Grids at one level of the hier�

archy is an IrregularGrid� and a set of IrregularGrids organized into levels is a

CompositeGrid�

LPARX de�nes the basic building blocks�Region� Grid� and XArray�for

IrregularGrid and CompositeGrid� The IrregularGrid is similar to LPARX�s

XArray but is specialized for adaptive mesh methods� for example� it encapsulates

information about mesh spacing that would be inappropriate for an XArray� Each

object provides facilities appropriate for its role in the adaptive mesh hierarchy� as

��

Composite Grid

Level 0

Level 1

Level 2

Grid

IrregularGrid

Figure �� A composite grid is represented using a Grid� an IrregularGrid� anda CompositeGrid� A CompositeGrid consists of IrregularGrid objects organizedinto levels� Each IrregularGrid is a collection of Grids� Real grid hierarchies inmultiple dimensions are vastly more complicated �see� for example� Figure ��

described in Table ��

The parallelism in our adaptive mesh application lies at the level of the

IrregularGrid� We can parallelize across one level of the hierarchy� but there is little

opportunity for parallelism across levels� Therefore� the Grids in an IrregularGrid

are distributed across processors� and applications compute over these Grids in par�

allel� Following the LPARX model� each Grid in an IrregularGrid is assigned to

one processor� Of course� a single processor may be responsible for many Grids�

Communication among Grids in the hierarchy employs LPARX�s copy�on�

intersect operation �see Section �� a high�level facility that copies data between

the logically overlapping portions of two Grids� Data motion involves no explicit

computations involving subscripts� all bookkeeping and interprocessor communication

details are managed by the run�time system� We discuss communication in detail in

Section ��

In the implementation of our adaptive mesh libraries� we often found it

convenient to represent the structure of an IrregularGrid separately from the

IrregularGrid itself �i�e� structural abstraction�� For example� regridding and load

balancing routines manipulate and return the structure�the locations of re�nement

��

Data Type Description

Grid

Grid represents a single re�nement patch in the adaptive grid hier�archy� Grid computations are typically performed in serial numericalroutines �see Section ��

IrregularGrid

IrregularGrid represents one level in the adaptive mesh hierarchy�Grids in an IrregularGrid are distributed across processors� and ap�plications compute over these Grids in parallel �see Section ��IrregularGrid provides communication routines to �ll boundary cellsfor Grids at the same level of re�nement �see Section ��

CompositeGrid

CompositeGrid represents the entire adaptive mesh hierarchy� It pro�vides mechanisms to communicate between levels �see Section ��and to create new levels through error estimation �Section �� gridgeneration �Section �� and load balancing �Section ��

Table �� Descriptions of the three basic data types used to represent the adaptivegrid hierarchy �refer to Figure �� Grid� IrregularGrid� and CompositeGrid� Theoperations de�ned on these data types are described in detail in succeeding sections�

patches and their assignments to processors�of an IrregularGrid� Such structure

information is encapsulated in a GridStructure� which consists of an array of LPARX

Regions and an array of corresponding processor assignments�

�� Error Estimation

To re�ne its representation of the solution� structured adaptive mesh algo�

rithms add additional levels to the grid hierarchy� Error estimation and regridding

routines are called to calculate where to place the computational resources on the new

level� Error estimation evaluates the solution error on the level of the grid hierarchy

with the �nest resolution� and regridding uses this error estimate to determine where

to place new grid patches to re�ne portions of the domain with the highest error�

Our adaptive mesh API provides two common algorithms for estimating

solution error� solution gradient and Richardson extrapolation� The solution gradient

does not actually measure error but rather indicates where the solution is changing

most rapidly� We use this as an ad�hoc estimate of error� as no further re�nement

��

�a� �b�

Figure �� a� The error estimation procedure has �agged the points of highest error�as indicated by the solid dots�� b� The regridding routine has generated re�nementpatches which cover all �agged points but which enclose few non��agged points�

is generally needed in regions where the solution is changing slowly �i�e� has a small

gradient�� Richardson extrapolation �� attempts to calculate an exact estimate of

the local truncation error using the solution at the two coarser levels of the grid

hierarchy�

After obtaining an estimate of error� we must �ag points where the error

is �too high�� as shown in Figure ��a� One well�known method used for both el�

liptic �� and hyperbolic �� partial di�erential equations is to �ag every location

which exceeds some predetermined error threshold� This method attempts to bound

the overall solution error by bounding the error at every grid point� Another approach�

which we have not seen mentioned in the literature� focuses on �xing computational

resources� This strategy �ags a speci�ed number of points with the highest error

and is appropriate for applications for which good analytical estimates of error are

unavailable�

We typically use the latter approach in our experiments with adaptive eigen�

value calculations for materials design �� For eigenvalue problems� the correlation

between the pointwise error on a grid level and the �nal error in the eigenvalue �the

value of interest� is not straightforward �� Thus� we attempt to obtain the best an�

swer to our problem for �xed computational resources� If we �nd in the end that this

answer is not su�ciently accurate� we simply allocate more resources and solve again�

��

�� Grid Generation

After the error estimator has �agged points on the highest level of the grid

hierarchy� the next step is to calculate where to place grid re�nements for the next

level� Regridding routines take the �agged points from the error estimator and return

a GridStructure representing the re�nement structure for the newest level� The goal

of grid generation is to create re�nement patches that cover all �agged points of the

previous level� Patches should be relatively large �to minimize overheads� but should

enclose as few non��agged points as possible �see Figure ��b�� Furthermore� they

should be nested �i�e� every grid point at level l!� must lie above some grid point at

level l�� Each patch may have multiple parents� re�nement hierarchies do not form a

tree� Patches are assumed to be rectangular and lie parallel to the coordinate axes�

Our adaptive mesh API implements a regridding algorithm by Berger and

Rigoutsos �� based on signatures from pattern recognition� In this method� informa�

tion about the spatial distribution of �agged points within a speci�ed region of space

is collapsed onto the �d axes� these signatures are then used to generate re�nement

patches� The algorithm begins by using signatures to calculate the smallest bounding

box that contains all �agged points �see Figure ��a�� If the ratio of �agged points to

total points in this box satis�es a speci�ed e�ciency threshold� then the re�nement

patch is accepted� If not� the method splits the re�nement patch and calls itself re�

cursively on the two sub�patches� In the recursive calls� signatures are calculated only

over the speci�ed region of space enclosing the sub�patch �e�g� the �mask region� in

Figure ��b��

Signatures are used to �nd a splitting point for the two sub�patches� The

algorithm tries to choose a splitting point which minimizes communication across the

interface� For example� consider the signatures shown in Figure ��a� The signature

on the horizontal axis represents the number of �agged points that lie above it� Be�

cause the center portion of the signature is zero� the splitting algorithm knows that

no �agged points exist in this region� Thus� the algorithm chooses this section as a

separator for the two smaller sub�patches�

��

Mask Region

Sum

�a� �b�

Figure �� a� Re�nement regions are calculated using signatures� The signatureon the horizontal axis represents the number of �agged points which lie above� thesignature on the right represents the number of �agged points to the left� The dottedlines show the bounding box enclosing all �agged points� The two darkly shadedregions represent e�cient patches� �b� We calculate signatures using a parallel reduc�tion across an irregular array structure� Contributions from �agged points outsidethe �mask region� are ignored�

Previous implementations of the signature algorithm have represented

�agged locations using lists of points with one point for each �agged location ��

We implement a di�erent strategy based on parallel array reductions over irregular

grid structures� �an IrregularGrid of integers�� Flagged points are assigned a value

of � and all other locations �� Our strategy calculates signatures using array reduc�

tions with addition over a speci�ed region of space� For the signatures in Figure ��b�

we employ array reductions across the two Grids and mask out any contributions

that lie outside the speci�ed mask region� Irregular array reduction is a direct gen�

eralization of standard rectangular array reductions� portions of the index space not

covered by a Grid simply make no contribution to the result� The advantages of this

method are that it is easy to parallelize� uses the same data structures as the grid

hierarchy� and is e�cient� In our computations� regridding using parallel reductions

requires only about one percent of the total computation time�

To simplify the communication of information between levels of the hierar�

�In their data parallel Fortran implementation� Berger and Saltzman �� employ parallelarray reductions over rectangular arrays� but their regridding algorithm returns patches of uniformsize and is not based on signatures�

��

�a� �b�

Figure �� Two methods for calculating uniformly sized re�nement regions� Here the�agged regions are represented by the shaded areas� �a� The computational space istiled �dashed lines� and only those regions with �agged points �solid lines� are keptas re�nement patches� �b� Shifted tiles cover the �agged area more e�ciently �sevenpatches instead of nine��

chy� we require that grids are logically nested within grids at the next coarser level�

The signature algorithm alone does not ensure proper nesting of the grids because it

does not take into account the boundaries of the grid patches on the previous level�

Thus� after grid generation� we ensure nesting by intersecting all new grid patches

against the underlying re�nement regions� In practice� this step is rarely necessary�

as re�nement regions are usually already nested�

Our adaptive mesh libraries also provide one alternative regridding algo�

rithm based on work by Berger and Saltzman for uniform re�nement regions ��

Unlike the previous algorithm� this method guarantees that all re�nement patches

are the same size� The uniform algorithm was originally motivated by an adaptive

mesh re�nement implementation in a data parallel language �Connection Machine

Fortran� that required uniform patches� It is much simpler than the non�uniform

algorithm� Essentially� the computational space is tiled with re�nement patches of a

speci�ed size �see Figure ��a�� Each patch is checked whether it includes a �agged

point� If so� then that patch is added to the new re�nement level� if not� the patch is

��

discarded� One improvement in this algorithm implemented by Berger and Saltzman

�and also by us� allows the tiles to shift to better cover the �agged region �see Fig�

ure ��b�� These algorithms employ the same irregular array reduction primitives as

the non�uniform method� A performance comparison for the signature and uniform

regridding algorithms is presented in Section ��

�� Load Balancing and Processor Assignment

The regridding procedure of the previous section generates a GridStructure

describing the re�nement structure for the newest level in the adaptive mesh hierarchy�

In general� the re�nement patches rendered by the regridding procedure vary in size

and number� there may be fewer patches than processors or many more patches than

processors� Thus� a simple cyclic assignment of patches to processors would not

typically result in good load balance� Therefore� before creating the new level� the

API routines must distribute this computational work across processors� Our API

de�nes load balancing and processor assignment facilities that take the structural

description returned by the grid generator� manipulate and modify it� and then return

a new GridStructure that is used to instantiate the newest grid level� The goal of

load balancing and processor assignment is to evenly distribute computational work

across the processors of the machine�

Our regridding routines seek to create re�nement regions which minimize

computational e�ort� they have no knowledge of load balancing or processor assign�

ment� One possible implementation strategy would integrate load balancing and

processor assignment into regridding� We have avoided this approach for two rea�

sons� First� we believe that the re�nement structure of the numerical computation

should not be in�uenced by its parallel implementation� By decoupling regridding

from parallelization� we guarantee that the regridding procedure will generate iden�

tical re�nement structures when running the same problem on varying numbers of

processors� Second� we may change either regridding methods or load balancing

strategies without in�uencing the other� Future improvements in one algorithm will

��

�� Fracture re�nement patches that are too largelet P � number of processorslet w � �

Psize of each patch��P

recursively divide patches larger than w

�� Bin�pack re�nement patches to processorssort patches by size from largest to smallestfor each patch �largest �rst�

assign patch to the processor with the least workend for

Figure �� This load balancing routine takes a set of re�nement patches from theregridding algorithm and assigns them to processors� Patches that are too large�larger than the average workload to be assigned to a processor� are divided intosmaller ones� The resulting patches are then bin�packed to processors�

not force changes in the other�

Recall that all parallelism in our adaptive mesh method lies across the grid

patches in a single level of the grid hierarchy� With the exception of the commu�

nication optimization described later in this section� we partition each grid level

independently of all other levels� For our particular numerical algorithms� the work�

load associated with each re�nement region is directly proportional to the size of the

region�

A simple but e�ective load balancing algorithm is shown in Figure ��

The �rst step of the method is to calculate the approximate average workload w to

be assigned to each processor� Next� patches which represent more work than w are

recursively divided until all patches are size w or smaller� guaranteeing that large

patches will be evenly distributed across processors� The load balancing routines

adopt LPARX�s parallelization model that an individual grid is assigned to only one

processor� Although not an issue in our particular application� dividing patches may

not be appropriate or desirable for other numerical methods for which introducing

��

new boundary elements creates additional computational work� �e�g� �ux correction

for hyperbolic partial di�erential equations �� When recursively dividing patches�

our algorithm does not generate sub�patches smaller than a speci�ed� architecture�

speci�c minimum size� Although small blocks reduce load imbalance� they introduce

additional interprocessor communication� After patches have been divided� they are

sorted in decreasing order by size and are bin�packed to processors ��

This load balancing method works well in practice but does not take into

account interprocessor communication between levels of the grid hierarchy� Patches

communicate with their parents on the previous level� Interprocessor communication

costs for hyperbolic numerical methods are dominated by computation costs�� how�

ever� communication overheads become a signi�cant portion of the total execution

time for elliptic problems �see Section ��

We have developed a new processor assignment strategy that introduces in�

terprocessor communication costs into the bin�packing algorithm by including a �pro�

cessor preference�� or a�nity� for each patch �see Figure �� Before distributing

work to processors� we assign each patch a processor preference value which estimates

the amount of communication between the patch and its parents lying on that proces�

sor� Communication costs are directly related to the size of the intersection between

a patch and its underlying parents� When bin�packing� we attempt to place patches

on their preferred processors� This simple optimization has reduced interprocessor

communication by as much as �$ in some parts of the code� Unfortunately� the

bene�ts of this optimization are limited to the highest levels of the grid hierarchy�

For the applications we have run to date� we have found little change in the over�

all execution time� Processor preferences are most e�ective when there are many

more patches than processors� otherwise� the algorithm has little freedom in mapping

patches to processors� We believe� though� that this optimization will become more

important as we begin to run larger� more realistic applications requiring more levels

of re�nement on larger numbers of processors�

�Of course� new load balancing routines may be de�ned to handle these special cases��Phillip Colella� personal communication�

��

�� Fracture re�nement patches which are too largelet P � number of processorslet w � �

Psize of each patch��P

recursively divide patches larger than w

�� Calculate a processor a�nity for each patchfor each patch i

for each parent j of patch i�� Estimate communication costs as the intersection�� between a patch and its parentlet c � intersection between patch i and parent jlet p � processor owning patch jassign patch i a preference c for processor p

end forend for

�� Bin�pack re�nement patches to processorssort patches by size from largest to smallestfor each patch i �largest �rst�

for each processor j in i�s preference list �most preferred �rst�if �assigning patch i to processor j does not exceed work w for j� then

assign patch i to processor jend if

end forend forfor each unassigned patch �largest �rst�

assign patch to the processor with the least workend for

Figure �� This load balancing algorithm is an improvement of the method inFigure �� In this approach� we use an estimate of interprocessor communicationbetween levels of the grid hierarchy to calculate a �processor preference�� or a�nity�for each patch� When possible� each patch is assigned to its preferred processor�

��

P0 P1 P2 P3

IrregularGrid U

�� Compute in parallel over the elements of U�� U is an IrregularGrid and its elements are Gridsforall i in U

call update�U�i��end forall

Figure �� Parallel computation over the individual Grids in an IrregularGrid�The forall loop is a coarse�grain data parallel loop which executes each iteration asif on its own virtual processor� Function update is an externally de�ned numericalroutine� often written in Fortran� which performs some computation on Grid U�i��

�� Numerical Computation

The previous section described how the load balancing and processor as�

signment routines decompose a single level of the adaptive grid hierarchy�an

IrregularGrid�across processors� In this section� we discuss parallel computation

over such distributed structures�

Consider the single grid level shown in Figure �� which has been dis�

tributed over four processors p�"p�� Each grid has been assigned to one processor�

The largest rectangular patch has been divided over two processors� and p� has been

assigned the two smaller grids�

We express parallel execution using LPARX�s forall construct� a coarse�

grain data parallel loop which executes each iteration as if on its own virtual processor�

��

Coarse�Grain Parallelism

�� Parallel loop over gridsforall i in U

�� Serial loop over elementsfor j�k�l in U�i�

�� Do numerical workend for

end forall

Fine�Grain Parallelism

�� Serial loop over gridsfor i in U

�� Parallel loop over elementsforall j�k�l in U�i�

�� Do numerical workend forall

end for

Figure �� Coarse�grain data parallelism �left� expresses parallel execution over theentire collection of grids� computation on each individual grid is serial� In contrast��ne�grain data parallelism �right� expresses parallelism over the data elements of eachgrid� and the grids are handled sequentially�

Each iteration executes independently of all other iterations� For each Grid U�i�� we

call the routine update� an externally de�ned serial numerical kernel� which executes

on one processor�

There are advantages to separating parallel execution from serial numerical

computation� Numerical code may be optimized to take advantage of low�level node

characteristics� such as vector units or multiple physical processors� without regard to

the higher level parallelism� Existing serial code may not need to be re�implemented

when parallelizing an application� Furthermore� we can leverage existing mature

sequential and vector compiler technology�

Figure �� compares our model of coarse�grain parallelism with a �ne�grain

data parallel style � � �� In the former� we execute in parallel over the entire collec�

tion of grids� Each grid is assigned to one processor� and numerical computation on

that grid is sequential� In contrast� �ne�grain parallelism processes grids sequentially

and expresses parallelism over the elements of a single grid�

There are a number of advantages to coarse�grain parallelism� Because the

numerical computation is serial� we may employ numerical methods on each grid

which do not parallelize e�ciently� For example� Gauss�Seidel relaxation works well as

a smoother in multigrid� but it cannot be easily expressed in a �ne�grain data parallel

��

style� Coarse�grain parallelism also allows more asynchrony between processors and

is therefore a better match to current coarse�grain message passing architectures�

To improve the e�ciency of the �ne�grain model� Parsons and Quinlan �� are

developing run�time methods for automatically extracting coarse�grain tasks from

�ne�grain data parallel loops� Another model� processor subsets �� combines the

coarse�grain and �ne�grain approaches� parallelism is expressed both over grids and

within each grid�

In our discussion of parallel execution� we have ignored the interprocessor

communication required to satisfy data dependencies� This is the subject of the

following section�

�� Communication

Adaptive mesh methods exhibit two basic forms of communication� in�

tralevel communication among grids at the same level in the grid hierarchy and

interlevel communication between adjacent levels of the hierarchy� Both forms of com�

munication employ LPARX�s copy�on�intersect operation �see Section �� which

copies a block of data between the logically overlapping portions of two grids�

The purpose of intralevel communication is to obtain boundary information

from neighboring grids� Around the boundary of each grid patch is a ghost cell

region used to locally cache data from adjacent grids� These ghost cells are needed by

adaptive mesh algorithms even on serial architectures� they are an intrinsic component

of the computation and are not simply an artifact of parallelization� The pseudocode

shown in Figure �� updates the ghost cell regions of each grid with data from the

interior �non�ghost cell� portions of adjacent grids�

Interlevel communication transfers information up �from coarser grids to

�ner grids� and down �from �ner to coarser� the adaptive grid hierarchy� We will

describe only the latter process� called coarsening� as the computational structure of

the former is identical� As shown in Figure �� coarsening involves two steps� First�

information at the �ne grid level is averaged into a temporary� intermediate grid level�

��

�� Communicate boundary information between grids at the same level�� U is the irregular array of grids at this level of the hierarchy

function FillPatch�U�� Loop over all pairs of grids in Uforall i in U

for j in U�� Copy data from intersecting regions�� Interior�� returns the interior of its argumentcopy into U�j� from U�i� on Interior�U�i��

end forend forall

end function

Ghost Cells

Interior(U(i))

U(i)

U(j)

Copy

Figure �� Intralevel communication copies boundary information from the interiorsof adjacent grids at the same level of the adaptive mesh hierarchy�

which is a coarsened version of the �ne grid level� The numerical computation by

subroutine Average is performed in parallel� Second� the coarsened data in the inter�

mediate grid level is copied into the coarse grid level using LPARX�s copy operation�

Although the redistribution of data from the intermediate grid into the coarse grid

may appear expensive� this is typically not the case� In the act of averaging� the

quantity of data in the �ne grid is generally reduced by a factor of rd for a mesh

re�nement factor of r in d dimensions� for our particular application� r � �� d � ��

and �� Thus� the intermediate grid level represents a relatively small amount

of data�

��

�� Communicate between grid levels in the hierarchy�� Fine is the �ne grid level�� Temp is the coarsened version of Fine�� Coarse is the coarse grid level

function Coarsen�Fine� Temp� Coarse�� Average information in Fine down to Tempforall i in Fine

call Average�Fine�i�� Temp�i��end forall�� Copy data from Temp into grid level Coarseforall i in Temp

for j in Coarsecopy into Coarse�j� from Temp�i�

end forend forall

end function

Fine

Coarse

Temp Average

Copy

Figure �� Interlevel communication transfers information between levels of theadaptive mesh hierarchy� Coarsen illustrates communication between a �ne level andthe coarse level beneath it� Data is �rst averaged into an intermediate� temporarygrid level� which is then copied into the coarse level�

��

�� Adaptive Eigensolvers in Materials Design

And all this science I don�t understand�It�s just my job ve days a week�

� Elton John� �Rocket Man�

We have applied our structured adaptive mesh infrastructure to the solution

of model eigenvalue problems arising in the �rst principles design of real materials�

To our knowledge� this is the �rst time that such techniques have been used to solve

these problems on parallel computers� Materials design seeks to understand the chem�

ical properties of technologically important materials through computer simulation�

Consider the C��H�� molecule shown in Figure �� Materials scientists would like to

understand properties of this molecule� such as� What is the size of this ring# What

is the bond distance between two carbons# What is the energy of this system# Com�

puter modeling o�ers the possibility of answering these questions without actually

constructing the compound in the laboratory�

One di�culty in modeling real materials is that some atoms� such as oxygen

or transition metals� exhibit steep e�ective potentials localized about the nucleus� Be�

cause the chemical properties of an atom are determined by these localized potentials�

it is vital that numerical methods represent them accurately� Material scientists have

traditionally employed Fourier transform methods� but such methods do not easily

admit the non�uniformity needed to capture localized phenomena ��

Adaptive numericalmethods support the non�uniform representation needed

to accurately resolve localized potentials� While preliminary� the results presented in

this section indicate that adaptive eigenvalue methods may provide an important

alternative solution methodology for materials design applications�

The need for adaptive re�nement in materials design has been noted by other

researchers� Bernholc et al� �� have implemented a semi�adaptive code that places

a single� static re�nement patch over each of their atoms� Others have attempted

adaptive solutions using either �nite element methods �� or a combination

of �nite element and wavelet techniques �� Although �nite element methods pro�

��

Figure �� Materials design seeks to understand the chemical properties of moleculessuch as the C��H�� ring shown here� This image was provided by Kawai and Weare�

vide adaptivity� they are more computationally expensive than structured adaptive

mesh methods because they require signi�cantly more memory and CPU time for the

same number of unknowns� Thus� given the same computational resources� �nite ele�

ment calculations must employ a coarser representation than an equivalent structured

adaptive mesh method�

Section �� describes our model materials design application in more de�

tail� In the succeeding four sections� we describe the numerical algorithms used in our

adaptive eigenvalue solver� Section �� brie�y reviews the adaptive mesh framework

that has already been discussed in Section �� We have integrated into this adaptive

methodology a multilevel eigenvalue solver �Section �� that is based on the multi�

grid method �Section �� Section �� describes the �nite di�erence stencil used

to discretize the continuous equations� Finally� Section �� presents computational

results for two simple materials design problems�

��

�� A Model Problem

The �rst principles design of real materials models the properties of complex

chemical compounds through solving approximations to the Schr%odinger equation�

One common approach uses the Local Density Approximation �LDA� of Kohn and

Sham �� In the LDA� the electronic wavefunctions ui are given by the solutions to

the nonlinear eigenvalue problem�

Hui � �iui� ��

The di�erential operator H� which is elliptic� self�adjoint� real� and inde�nite� is givenby the Hamiltonian�

H � � �

mr� ! V� ��

The wavefunction� or eigenvector� ui provides a measure for the location of the elec�

tron� u�i gives the probability that the ith electron is located at a particular point in

space� The potential term V contains contributions from various electron�electron

and electron�nucleus interactions� it is a function of both position and the wavefunc�

tions ui� In general� we wish to �nd only the Na lowest eigenvalues and associated

eigenvectors� where Na is typically the number of atoms in the system� Na is several

orders of magnitude smaller than the number of grid points used to represent the

system�

Several length scales are represented in the solution to Eqs� �� and �� The

overall size of the system is determined by the atomic positions and the associated

electron density� Furthermore� associated with each atomic center is an e�ective

nuclear charge which varies according to the atomic species� Our goal is to show

that adaptive numerical methods can accurately resolve these di�erent length scales�

Thus� in our experiments� we use a model potential V which captures the essential

near�singular behavior and multiple length scales of real e�ective potentials while

removing many nonessential details�

��

L � �while �L � MaxLevels� do

call EigenvalueSolvercall ErrorEstimationcall GridGenerationL � L ! �

end while

Figure �� The adaptive eigenvalue solver builds the adaptive grid hierarchy level�by�level� The solution on L levels is used to create the next �ner grid level�

�� Adaptive Framework

Recall from Section �� that structured adaptive numerical methods build

the grid hierarchy level�by�level� That is� the algorithm solves the problem on a

hierarchy of L levels before proceeding to one with L ! � levels� The solution on

L levels is used to guide the creation of the next �ner grid level� Furthermore� this

�nested iteration� or �full multigrid cycle� �as it is called by the multigrid community�

speeds convergence because the solution on L levels provides a good estimate for the

solution at level L ! �� An outline of our adaptive eigenvalue solver is shown in

Figure ��

�� Eigenvalue Algorithm

We wish to solve the following generalized� eigenvalue problem�

Hui � �iKui � i � � � � � Na ��

where H and K are symmetric and positive de�nite� We require only the lowest Na

eigenvalues and eigenvectors� where Na is much smaller than than the number of

unknowns� Ng� Typically� Na � �� to �� whereas Ng � ��

�Although K � I in many eigenvalue problems� we must address the general case because thehigher order O�h�� discretization employed by our �nite dierence scheme introduces a right handside matrix K �� I�

��

let u be an initial guess �u �� repeat

H�normalize u� �u�Hu� � �let � � �u�Hu��u�Ku�perform one multigrid V�cycle on �H� �K�u � �

until k�H� �K�uk � � �some error tolerance�

Figure �� Mandel and McCormick�s �� linearized iterative multigrid�based eigen�value solver �nds the lowest eigenvalue � and its associated eigenvector u from thegeneralized eigenvalue problem Hu � �Ku�

Because we need only a few of the lowest eigenvalues� we do not apply

standard dense eigenvalue algorithms �e�g�QRmethods �� to the solution of Eq� ��

Instead� we use a multigrid�based iterative method by Mandel and McCormick ��

which e�ciently calculates the lowest eigenvalue and associated eigenvector� This

algorithm is shown in Figure �� Cai et al� �� have proven that this method is

optimal in the sense that it requires O�Ng� work and O�� iterations� Convergence is

independent of the distribution of eigenvalues�

To calculate eigenvalues other than the lowest� we apply the above procedure

and� at each step� orthogonalize the candidate eigenvector un against all previously

calculated eigenvectors ui� � � i � n�� Unfortunately� this strategy does not appearto retain the optimal convergence properties proven for the lowest eigenvalue� The

best approach to extracting the lowest Na eigenvalues and eigenvectors is still an open

research question�

As with many iterative algorithms� a good initial guess for u can signi�cantly

reduce the time to solution� Such good guesses are often available� For example� in

problems which evolve over time� the solution at one timestep may be used to seed

the solution at the following timestep�

One potential di�culty in applying this method to our materials design

application is that the Hamiltonian H is not positive de�nite� some eigenvalues�

��

in fact� those eigenvalues of interest�are negative� The normalization �u�Hu� � ��

which requires the calculation ofq�u�Hu�� is only well�de�ned for H positive de�nite�

Thus� we must shift the eigenvalues of H to make them positive� If �� is a lower bound

for the eigenvalues of H� then H! �I is positive de�nite� Because the performance of

our solver is independent of the eigenvalue distribution of H� shifting eigenvalues by� does not change convergence properties� The lower bound � need not be tight� in

materials design calculations� an approximate lower bound �which actually represents

the energy of the system� can usually be found using experimental knowledge or

experience�

�� Multigrid

The eigenvalue solver of the previous section depends on a multilevel iter�

ative technique called multigrid� Multigrid �� is a fast method for solving partial

di�erential equations� It represents the solution on a grid hierarchy� Multigrid uses

the multiple levels of the hierarchy to accelerate the communication of numerical in�

formation across the computational domain and reduce the time to solution� Multi�

grid techniques integrate easily into the adaptive mesh framework� the same grid

hierarchy used by adaptive mesh methods to represent the solution can be used by

multigrid to accelerate convergence� In this section� we brie�y describe Brandt�s Full

Approximation Storage �FAS� variant �� of multigrid� further details can be found

elsewhere ��

We wish to solve the partial di�erential equation Lu � � subject to Dirichletboundary conditions on the composite grid u consisting of L!� levels� ul� � � l � L�

In general� L is a nonlinear operator� We divide each level ul of u into two parts�

the boundary� denoted by boundary�ul�� and the interior� interior�ul�� Multigrid

requires two operators to transfer data values between grids at neighboring levels of

the hierarchy� The coarsening operator IC takes data at level l ! � and averages it

down to level l� The re�ning operator IF takes data at level l and interpolates it to

grids at level l ! �� Finally� we de�ne a relaxation procedure relax that performs

��

FAS�l� ul� fl�

ul � relax�Lul� fl � ��

if �l ��

rl�� IC�fl � Lul�tl�� ICul

fl��

rl�� ! Ltl�� on interior�ul�

� otherwise

ul�� FAS�l� �� ul�� fl��

ul ��

ul ! IF �ul�� tl�� on interior�ul�

IFul�� on boundary�ul�

end if

ul � relax�Lul� fl � ��

Figure �� The Full Approximation Storage �FAS� multigrid algorithm of Brandt�� When called via FAS�L�uL�� this method performs one V�cycle on the equa�tion Lu � � for the composite grid u consisting of L! � levels ul� � � l � L�

smoothing �e�g� Gauss�Seidel �� on each multigrid level�

The FAS multigrid method is shown in Figure �� When invoked us�

ing FAS�L�uL�� the algorithm starts at level L of the grid hierarchy� winds down

through the grids to level �� and then works back up to level L� This cycling strategy

is called a multigrid V�cycle� Each application of FAS drives u closer to the solution�

We have found that between ten and twenty V�cycle iterations are usually su�cient

to solve our problems to machine precision�

Immediately following the recursive call to level l�� grid level ul is updatedusing new information from level l � �� Since we may not have explicit Dirichlet

boundary information for higher levels in the grid hierarchy� we must calculate the

appropriate boundary conditions from the next lower level� We use the interpolated

�

data values from the underlying grid level ul�� to obtain the Dirichlet boundary

conditions for level ul�

An important distinction between the FAS method and other multigrid vari�

ants is that FAS� like the adaptive mesh method� stores unknowns at each level of the

multigrid hierarchy� Many multigrid methods store error corrections to the unknown

�i�e� residuals� rather than the unknown itself� Such a storage technique would be

incompatible with the adaptive mesh framework�

�� Finite Dierence Discretizations

To implement these numerical algorithms on a computer� we require a

method for representing the partial di�erential operator L used in the multigrid pro�cedure� We solve for elliptic operators of the general form�

Lu � �r� ! f�u! g ��

where R and u� f� g R� R� For example� for Poisson�s equation r�u � ��

� �� f � �� and g � �� For the Hamiltonian eigenvalue problem�

��

mr� ! V � ��u � � ��

� � ��m� f � V � �� and g � ��

We discretize this expression using a �nite di�erence scheme� Consider the

cube of � integer locations in three dimensions �� centered at the point i�De�ne face to be the set of six points at the center of each face of the cube� Likewise�

de�ne edge as the set of twelve points on the edges �but not the corners� of the cube�

We ignore the eight corner points� The fourth order O�h�� nite di�erence discretiza�

tion corresponding to mesh location i with mesh spacing h can be written as ��

�Lhu�i �

h�

��ui !

�

Xj�face

uj !�

�

Xj�edge

uj

�A

!

��fiui !

�

Xj�face

fjuj

�A!

��gi !

�

Xj�face

gj

�A ��

��

�� Computational Results

To validate our approach� we chose two deceptively simple problems whose

analytic solutions are known� the hydrogen atom and the H� molecule� While the

Hamiltonians for these problems are very simple� they cannot be solved directly using

current Fourier transform techniques because of a �rsingularity in the atomic poten�

tial� All of the following model problems were solved in �d� we did not attempt to

exploit problem symmetry� Each solution required approximately one minute running

on an IBM RS� �� model �� Note that real materials design problems will contain

tens or hundreds of atoms� not just one or two� and will require the computation and

memory resources of a high�performance parallel computer�

Hydrogen

The Hamiltonian operator for hydrogen has a simple form�

H � �r�

� Z

r� ��

While the eigenvalue problem associated with Eq� �� can be solved analytically� the

singular behavior at r � � can cause signi�cant di�culties for non�adaptive numerical

methods� For example� current Fourier methods with �� grid points return �� as the lowest eigenvalue instead of the correct value of �� The solution to thisproblem is an exponential with the form e�Zr and eigenvalue � � Z�

� � As Z increases�

the solution becomes increasingly localized about the origin�

The solution to the hydrogen problem �Z � �� is plotted in Figure ��a�

Note that the units for the values of the wavefunction are arbitrary� The adaptive

eigenvalue solution and the exact answer �not plotted� are identical at the scale of

this graph� The cusp at the origin is a result of the singularity of the potential at this

point and is usually very expensive to resolve with a non�adaptive numerical method�

The accurate representation of the cusp requires a very dense clustering of grid points

about the origin �see the abscissa of Figure ��a��

��

-15.0 -5.0 5.0 15.0Distance (au)

0

0

0

Wave

funct

ion

Hydrogen Wavefunction

0 5 10 15 20Z

0

50

100

150

200

Eig

enva

lue

Eigenvalues for -Z/R Potential

ExactThree Adaptive Levels Four Adaptive LevelsUniform Levels Only

�a� �b�

Figure �� a� The left graph displays the lowest energy eigenvector for the hydrogenatom� graph data was extracted from the �d volume along the Z axis� Tick markson the abscissa represent the mesh points of the composite grid� �b� The right plotshows the eigenvalues for a �Z

Rpotential as a function of the nuclear charge Z� All

three solution methods used approximately the same number of grid points� Notethat the �Exact� solution and the �Four Adaptive Levels� solution lie on top of oneanother�

Figure ��b illustrates how the eigenvalues of the system change with in�

creasing nuclear charge Z� To accurately capture the increasingly localized and near�

singular solution� it was necessary to use higher levels of adaptivity� However� because

of increased localization about the origin� the total number of grid points remains ap�

proximately the same� The resolution of the �nest grid level in the adaptive solution

would correspond to a uniform grid with �� points� as compared to fewer than ��

points required by the adaptive algorithm� a savings of � �� Obviously� a uniformmesh of this resolution would be unthinkable�

Hydrogen Molecular Ion

Another simple problem which is more commonly used as a test case for

�

-15.0 -5.0 5.0 15.0Distance (au)

0

0

0

0W

ave

funct

ion

H2+ Wavefunction

0.0 2.0 4.0 6.0 8.0 10.0Atomic Separation (au)

-0.15

-0.10

-0.05

0.00

0.05

0.10

Bin

din

g E

nerg

y (

Hart

rees)

Morse Plot for H2+

Exact EnergyCalculated Energy

�a� �b�

Figure �� a� The left graph displays the lowest energy eigenvector for the hydrogenmolecular ion� graph data was extracted from the �d volume along the Z axis� Tickmarks on the abscissa represent the mesh points of the composite grid� �b� The rightplot shows the binding energy curve as a function of atomic separation� The minimumpoint on this curve represents the minimum energy distance between the two atoms�

chemical methods is the hydrogen molecular ion �H� �� In this problem� there is only

one electron but two nuclei with �rsingularities�

H ��r�

m� �

j�r ! �Ra

� j� �

j�r � �Ra

� j�where �Ra is the atomic separation� ��

Analytic solutions for this problem are known �� but it is too sti� for practical

solution via Fourier methods� The adaptive eigenvalue solver method performs quite

well� however� The eigenvector solution is plotted in Figure ��a� note the increased

density of grid points in the vicinity of the two nuclei� Figure ��b shows the binding

energy curve for H� � The binding energy is de�ned as the total energy of the atoms

at a speci�ed distance minus the energy at in�nite separation� The minimumpoint in

this curve tells material scientists the preferred minimum energy separation between

the two atoms of the molecule�

�

�� Performance Analysis

First get your facts then you can distort them at your leisure�

� Samuel Clemens

Portability and performance are two vital considerations in the design and

implementation of any numerical library� Parallel computers become obsolete at an

alarming rate� today�s state�of�the�art supercomputer is tomorrow�s space heater�

Portability ensures that numerical software will run on the most powerful and up�to�

date computational resources available� Computational scientists will not use software

libraries unless they deliver reasonable performance� In this section� we analyze the

performance and overheads of our adaptive mesh library� We begin with a perfor�

mance comparison of an Intel Paragon and an IBM SP with one processor of a Cray

C�� and the succeeding section presents a detailed breakdown of parallel execution

times�

It is an open research question whether non�uniform re�nement structures

can be e�ciently supported in a data parallel language� One implementation strategy

for structured adaptive mesh methods in a data parallel language such as High Perfor�

mance Fortran �� would restrict all re�nement patches to be the same size ��

We therefore conclude this section with an analysis of the performance implications

of requiring uniformly sized re�nement regions�

The motivating application for our structured adaptive mesh API is the

adaptive solution of nonlinear eigenvalue problems arising in materials design ��

We present computational results for the calculation of the lowest eigenvalue and

associated eigenvector for the following �d eigenvalue problem�

��r� ! V �u � �u ��

V �r� ��Xi�

�

jr � �cos ��i�� sin ��i�� j ��

The potential V represents a ring of ten hydrogen ions located in the Z � � plane� as

shown in Figure �� While this is a synthetic problem� its structure resembles real

materials design applications of interest �e�g� ring structures��

��

Figure �� Computational results were gathered for a �d synthetic eigenvalue prob�lem with ten hydrogen ions located in a ring� as illustrated here by a slice through theZ � � plane� Local re�nement regions are represented by rectangles superimposedon the image�

The adaptive mesh hierarchy for this problem consists of eight levels with

a total of �� grid points �see Table �� The �rst six levels are the usual

uniform multigrid grids �with a mesh re�nement ratio of two� and the next two are

adaptively re�ned �with a mesh re�nement ratio of four�� The resolution on the �nest

level corresponds to a uniform mesh of size �� thus� for this application� adaptivity

reduced memory requirements by a factor of � � �� as compared to ��Recall from Section �� that this grid hierarchy is built level�by�level at run�

time� In the following sections� we will typically report the cost for one iteration of

the eigenvalue algorithm over all eight levels and will ignore the iterations used to

build it� In real materials design problems� the building phase is brief and is followed

by numerous iterations over the entire hierarchy structure� Moreover� most of the

computational work and the memory storage is located at the highest levels�

Each complete iteration requires approximately �� million �oating point

operations� or approximately �� ops per grid point spread out over about ten

��

Multigrid Levelsl � � l � � l � l � � l � � l � �

Unknowns � � ��

Mesh Spacing ��

Adaptive Re�nement Totall � � l � �

Unknowns ��

Mesh Spacing ��

Table �� Number of unknowns and mesh spacing for the adaptive mesh hierarchyused to solve the problem pictured in Figure �� The resolution on the �nest gridlevel corresponds to a uniform mesh of size ��

Machine C�� Fortran OperatingCompiler Optimization Compiler Optimization System

C�� CC v�� O cft�� v�� O� UNICOS ��

Paragon g�� v�� O �mnoieee if�� v�� O� �Knoieee OSF��

SP xlC v�� O �Q xlf v�� O� AIX v��

Table �� Software version numbers and compiler optimization �ags for all compu�tations in this chapter� All benchmarks used LPARX release v�� Detailed machinecharacteristics are reported in Appendix A�

di�erent numerical routines with intermittent communication� Table �� summarizes

software releases and compiler �ags for all benchmarks� refer to Appendix A for a

detailed description of the computer architectures� All �oating point arithmetic used

��bit numbers�

�� Performance Comparison

Table �� and Figure �� compare the execution times for the IBM SP�� In�

tel Paragon� and one processor of a Cray C�� Note that although the SP processors

�The IBM SP� results were obtained on a pre�production machine at the Cornell Theory Centerthese times should improve as the system is tuned and enters full production use�

��

Number of ProcessorsP � � P � P � � P � � P � �� P � �

Cray C��

IBM SP ��

Intel Paragon ��

Table �� Adaptive eigenvalue solver execution times for the IBM SP� Intel Paragon�and one processor of the Cray C�� Times �in seconds� represent one iteration�averaged over ten� of the eigenvalue algorithm� We report wallclock times for theSP and Paragon �processor nodes are not time�shared� and CPU times for the C��The application would not run on fewer than four Paragon processors because ofmemory constraints� These numbers are graphed in Figure ��

are approximately four times faster then the Paragon processors� its communication

network is about half as fast� We ran the same applications code on all machines

except that the Fortran kernels on the Cray C�� are annotated to aid vectorization�

The Paragon and the SP compare quite favorably against the C�� for this

application� four SP nodes or � Paragon nodes deliver the performance of one C��

processor� Although all Fortran numerical kernels of our code vectorize� hardware

performance monitors on the C�� report that our application achieves an aggregate

rate of only � mega�ops �million �oating point operations per second� over the

entire code and a peak rate of �� mega�ops� Our application realizes only a fraction

of the Cray C��s peak performance of �� mega�ops due to short vector lengths

�between �� and �� in the Fortran routines�

Of course� vector lengths are tied directly to grid size� We could achieve

a higher mega�op rate and longer vector lengths by using larger grids and more

memory� Note� however� that time to solution for a speci�ed accuracy� not mega�op

rate� is the important metric� Placing additional grid points in regions where they

are not needed to improve resolution does not necessarily result in more accurate

solutions� For example� we doubled the number of grid points used by the solver for

this problem and yet achieved the same answer �to within ��$�� The additional

grid points were used to simply over�re�ne portions of the computational space where

��

1 2 4 8 16 32Processors

1

10

Seco

nds

per

Itera

tion

Eigenvalue Solver Performance

ParagonSP2

Cray C-90

1 2 4 8 16 32Processors

0

1

2

3

4

Rela

tive P

erf

orm

ance

(hig

her

is s

low

er)

Relative Solver Performance (normalized to one processor of a Cray C-90)

ParagonSP2

Cray C-90

�a� �b�

Figure �� These �gures graph the execution time results presented in Table ��a� Adaptive eigenvalue solver performance results for the IBM SP� Intel Paragon�and one processor of the Cray C�� The Cray C�� time on one processor is shown asa reference line� �b� Relative execution times for the SP and Paragon as comparedto one processor of the Cray C��

no further re�nement was necessary�

On the Cray C�� our implementation using the adaptive mesh libraries

would be comparable in performance to a Fortran code developed by hand without

library support� Approximately ��$ of the execution time of our application is spent

on numerical computation in Fortran routines� �$ in transferring data between grids

�which happens to be written in C�� but would also be required in an all�Fortran im�

plementation�� and the remaining �$ in miscellaneous routines� Even if we attribute

the last �$ as all library overhead �which it is not�� the ease of using an API and the

bene�ts of portability to high�performance parallel architectures far outweighs the

small loss in performance�

��

0 2 4 6Level Number

0

1

2

3

Seco

nds

per

Itera

tion

Execution Time By LevelIntel Paragon

p = 4p = 8p = 16p = 32

4 8 16 32Processors

0

2

4

6

8

10

Seco

nds

per

Itera

tion

Cumulative Execution Time By LevelIntel Paragon

Level 7Level 6Level 5Level 4Levels 0-3

�a� �b�

Figure �� A level�by�level accounting of the execution time for one iteration ofthe eigenvalue algorithm� The bene�ts of parallelism are limited to the highest levelsof the hierarchy because lower levels have too little work for e�cient parallelization��a� The execution times for the highest levels drop as we add more processors� timesfor the other levels do not change signi�cantly� �b� This graph shows a level�by�levelbreakdown of the cumulative execution time�

�� Execution Time Analysis

Figure �� illustrates that almost all of the bene�t of additional proces�

sors is in the reduction of execution times at the highest levels of the adaptive grid

hierarchy� Lower levels have too little work for e�cient parallelization� Note that

we cannot simply remove the lower levels because they play a vital role in the nu�

merical convergence of our eigenvalue algorithm� We can expect better scaling as we

address more complicated problems� which place additional computational work at

the highest levels�

Tables �� and �� provide a detailed accounting for the parallel execution

time on the Intel Paragon and the IBM SP� We divide the execution time for the

entire eigenvalue algorithm� including time spent building the adaptive grid hierarchy�

��

Task P � � P � � P � �� P � �Time � Time � Time � Time �

Computation ��

Load Imbalance ��

Intralevel Communication ��

Interlevel Communication ��

Error Estimation ��

Load Balancing ��

Grid Generation ��

Total ��

Table �� Execution time breakdown for the eigenvalue calculation on the IntelParagon� Times are in seconds� Percentages may not add up to ��$ due to round�ing� The relative cost of communication increases with additional processors� com�munication overheads account for almost half of the total execution time on � nodes�

into numerical computation� time lost to load imbalance� communication among grids

at the same level of the hierarchy� communication between levels� error estimation�

load balancing� and grid generation� The vast majority of the time is spent in nu�

merical computation� load imbalance� and communication �intralevel and interlevel��

Error estimation� load balancing� and grid generation consume only a few percent of

the total execution time� Figure �� graphs the times of the four most expensive

operations for one iteration of the algorithm� As computation times drop with ad�

ditional processors� communication overheads become a dominant factor in overall

performance� On � Paragon nodes� communication accounts for about half of the

total execution time�

It is clear from this data that interprocessor communication times are limit�

ing parallel performance� Table �� shows the amount of communication �in kilobytes�

between processors on the Intel Paragon� numbers for the SP are identical� Intralevel

and interlevel communication clearly dominates� For these two routines� we have mea�

sured the average message size to be between �� and �� bytes� Communication

times on both the Paragon and SP are dominated by message start�up costs� On

the Paragon� the operating system message start�up overhead is about �� sec with a

��

Task P � P � P � P P ��Time � Time � Time � Time � Time �

Computation �� Load Imbalance � � �� Intralevel Communication �� Interlevel Communication �� Error Estimation �� Load Balancing � � �� Grid Generation ��

Total ��

Table �� Execution time breakdown for the eigenvalue calculation on the IBM SP�Times are in seconds� Percentages may not add up to ��$ due to rounding� TheSP spends a majority of its time in communication for eight and sixteen nodes� thisproblem size can use only four processors e�ciently�

4 8 16 32Processors

0

2

4

6

8

10

Seco

nds

per

Itera

tion

Eigenvalue Solver Time BreakdownIntel Paragon

Interlevel CommunicationIntralevel CommunicationLoad ImbalanceComputation

4 8 161 2Processors

0

2

4

6

Seco

nds

per

Itera

tion

Eigenvalue Solver Time BreakdownIBM SP2

Interlevel CommunicationIntralevel CommunicationLoad ImbalanceComputation

�a� �b�

Figure �� Execution time breakdown for one iteration �averaged over ten� for the�a� Intel Paragon and the �b� IBM SP� We report computation time� time lostto load imbalance� and communication �both interlevel and intralevel�� The SPprocessors are approximately �ve times as powerful as the Paragon processors butthe communications network is about half as fast�

��

Task P � � P � � P � �� P � �Comm � Comm � Comm � Comm �

Intralevel Communication ��

Interlevel Communication ��

Error Estimation ��

Grid Generation � ��

Total ��

Table �� Average interprocessor communication volume �in kilobytes� for each pro�cessor for the eigenvalue calculation on the Intel Paragon� Percentages may not addup to ��$ due to rounding� Interprocessor communication �gures on the IBM SPare identical�

peak bandwidth of � megabytes�sec for very long messages �see Appendix A�� The

corresponding numbers for the SP are about sec and �� megabytes�sec� Given

these �gures� over ��$ of the cost of sending a message of this length on the Paragon

is due to message start�up costs � �$ for the SP��

It is di�cult to assess adaptive mesh library overheads on parallel computers

since we do not yet have detailed hardware performance analyzers such as those on

the Cray C�� We can assume that there is little overhead in computation� since

all numerical work is done in Fortran� The remaining contributor of overheads is in�

terprocessor communication� as described above� Experiments indicate that perhaps

half of the interprocessor communication time is due to overheads in the LPARX

communication routines �see Section �� the remainder is spent in the operating

system message routines� We are currently working on a re�design of the LPARX

communication libraries which we believe will eliminate most of this additional over�

head ��

�� Uniform Grid Patches

Data parallel Fortran languages such as High Performance Fortran ��

�� do not readily support non�uniform grid structures� In their Connection Machine

��

Fortran implementation of a d adaptive mesh re�nement application on the CM��

Berger and Saltzman �� required that all re�nement regions be the same size�

To ascertain the performance implications of such a restriction� we have implemented

a grid generation strategy identical to that used by Berger and Saltzman�

One of the important tradeo�s in a uniform re�nement strategy is the se�

lection of the appropriate patch size� They key is to �nd a grid size which is large

enough to be computationally e�cient on the target parallel architecture but yet

small enough to limit over�re�nement� Large patches typically re�ne more of the

computational domain than what is needed and thus waste memory resources� Small

patches may represent too little work to be e�cient�

Another consideration when choosing a uniform patch size is the ratio of

ghost cells� boundary points used to locally cache data from other grids� to interior

grid points� The width of this boundary region depends on the numerical kernels of the

application� Our eigenvalue solver uses a ghost cell width of one� Berger and Saltzman

use four� For small patches� these ghost regions can represent a signi�cant fraction

of the total memory� especially in three dimensions� For example� the boundary cells

for a � � patch in �d with a ghost cell width of four �� total� represent ��$ of

the memory used to store the patch� Furthermore� boundary cells may introduce

additional computational work for some numerical methods �e�g� �ux correction for

hyperbolic partial di�erential equations ��

We compare the non�uniform re�nement approach against four uniform grid

sizes� �� and �� Each patch is augmented with a ghost cell region of

width one� These four sizes bracket the range of useful patch sizes� for this particular

application� �� is too small and �� is too large� Memory overheads are reported in

Table �� Uniform re�nement with the smallest patch size requires only about $

more memory than non�uniform re�nement� the largest patch size uses almost three

times more memory�

Figure �� a presents the execution time for one iteration of the adaptive

eigenvalue application on the Intel Paragon� Note that we do not report the results

��

Patch Type Unknowns Over Unknowns and ExcessRe�nement Ghost Cells Memory

Non�Uniform �� Uniform �� Uniform �� Uniform �� Uniform ��

Table �� Uniform grid patches require additional memory resources as compared tonon�uniform patches because the grid generator does not have the freedom in selectingthe optimal grid size needed to cover a particular region of space� Thus� uniformpatches lead to over�re�nement� Small patches reduce over�re�nement� however� theyalso introduce a large number of ghost cells relative to the number of unknowns� Inthis application� the ghost cells region is one cell thick� Note that the number ofghost cells increases slightly �by about ��$ between � and � processors� for thenon�uniform strategy as patches are split to balance the load across processors� Forour results� we have chosen the worst�case non�uniform patch numbers�

for the �� patch size on four processors� this problem would not run because of

memory limitations� The � � patch size gives the best performance for all numbers

of processors� running between ��$ and �$ slower than the non�uniform re�nement

method� This �gure includes both computation time and communication time� Nu�

merical computation time alone is plotted in Figure �� b�� Note that in the absence

of interprocessor communication� the �� and � � grid sizes are very competitive with

the non�uniform strategy� Interprocessor communication costs �in millions of bytes�

are presented in Figure �� c�

Both memory usage and computation time are important computational re�

sources for adaptive mesh applications� In fact� many accounting systems for parallel

computers charge not only for CPU time but also for memory usage� To capture both

resources� Figure �� d presents the relative space�time �e�g� megabyte�hour� cost of

uniform re�nement patches as compared to non�uniform patches� In this metric�

uniform patches are between two and eight times more expensive�

These results clearly show that uniform re�nement patches are more expen�

��

4 8 16 32Paragon Processors

0

10

20

Wallc

lock

Seco

nds

per

Itera

tion

Uniform Patch Execution Time

Non-Uniform PatchesUniform (12x12x12)Uniform (16x16x16)Uniform (24x24x24)Uniform (32x32x32)


0

10

20

Wallc

lock

Seco

nds

per

Itera

tion

Uniform Patch Numerical Execution Time


�a� �b�


0.0

0.5

1.0

1.5

2.0

Com

munic

atio

n (

Mbyt

es)

per

Itera

tion

Uniform Patch Communication



0

2

4

6

8

10

Rela

tive S

pace

-Tim

e C

ost

Uniform Patch Space-Time Cost


�c� �d�

Figure �� These graphs illustrate the performance costs of uniform grid patchesas compared to non�uniform patches� By default� the adaptive mesh libraries em�ploy non�uniform patches� Results for the �� patch size on four processors are notreported� the problem would not run due to memory limitations� �a� A comparisonof execution times for a variety of uniform patch sizes� �b� Numerical computationtimes only� �c� Total interprocessor communication �in millions of bytes� for the di�er�ent re�nement strategies� �d� Relative space�time �megabyte�hour� costs of uniformre�nement patches as compared to non�uniform patches�

��

sive than the identical application using non�uniform patches� To solve a problem to a

speci�ed accuracy� structured adaptive mesh methods employing uniform re�nement

regions will require more computational resources ��ops and memory� than one using

non�uniform re�nement regions� Likewise� given �xed resources� non�uniform re�ne�

ments will solve a particular problem to higher accuracy� The additional complexity

of the non�uniform implementation requires a powerful software infrastructure such

as our adaptive mesh API�


You can observe a lot by watching�

� Yogi Berra

We have developed an e�cient� portable� parallel software infrastructure for

structured adaptive mesh algorithms� It provides computational scientists with high�

level tools that hide implementation details of parallelism and resource management�

Such powerful software support is essential for the timely development of quality

reusable numerical software� We are applying our adaptive mesh infrastructure to

the solution of adaptive eigenvalue problems arising in materials design ��

Two distinguishing characteristics of our work are the concepts of struc�

tural abstraction and coarse�grain data parallelism� both borrowed from LPARX�

Structural abstraction and �rst�class data decompositions enable our software to rep�

resent and manipulate re�nement structures as language�level objects� In contrast� a

language such as High Performance Fortran that supports compile�time data layout

provides little freedom in the expression of irregular� run�time data distribution� Our

model of coarse�grain data parallel numerical computation maps e�ciently onto the

current generation of message passing parallel architectures�

It is an open research question whether data parallel languages can e�ciently

support the irregular re�nement structures employed by structured adaptive mesh

algorithms� Previous implementations �� have required uniform re�nements

��

to �t the �ne�grain data parallel model� Our experiments in �d indicate that such a

restriction results in costly over�re�nement and a corresponding loss in computational

performance� Thus� the e�cient� portable implementation of structured adaptive

mesh methods remains an outstanding challenge for the data parallel community�

�� Parallelization Requirements

The parallelization mechanisms provided by LPARX greatly simpli�ed the

design and implementation of our structured adaptive mesh API� We believe that the

following LPARX features were instrumental to our success�

� LPARX�s concept of structural abstraction enables our software to dynamicallycreate the user�de�ned� irregular array structures required to represent re�ne�

ment levels� Our regridding routines produce a description of the re�nement

structure� and this description is in turn modi�ed by load balancing and proces�

sor assignment routines to distribute the computational work across processors�

Only then is this ��oorplan� instantiated as the new level in the hierarchy�

Structural abstraction provided us the freedom to explore and compare di�er�

ent regridding strategies �both non�uniform and uniform re�nement patches�

and processor mapping algorithms� To our knowledge� no other parallel pro�

gramming system supports such user�de�ned� dynamic block structures�

� The region calculus and the copy�on�intersect operation are intuitive and nat�ural mechanisms for manipulating re�nement structures and expressing inter�

processor communication� The same set of primitives can be used to manage

the di�erent communication patterns required by structured adaptive mesh ap�

plications� They are based on simple geometric notions �e�g� �intersection��

that are independent of the spatial dimension of the problem� As a result� our

software supports both d and �d applications with the same API�

� E�cient parallel programs begin with the e�ective use of node resources� Be�cause LPARX separates parallel execution from computation� it does not in�

��

terfere with the performance of our serial numerical routines� which run at full

Fortran speeds� Furthermore� we were able to re�use routines from previous

serial multigrid codes in our parallel materials design application� Although

such numerical kernels are generally short� they are extremely tedious to write

and debug because they involve complicated stencil computations and array

indexing �see Eq� ��

Altogether� LPARX enabled us to easily and e�ciently implement the software sup�

port necessary for structured adaptive mesh applications�

�� Future Research Directions

More research is needed on whether structured adaptive mesh methods can

be e�ciently implemented in a data parallel Fortran language such as High Per�

formance Fortran �� which does not readily support irregular array structures�

As discussed in Section �� one potential solution would employ uniform re�ne�

ment regions� However� our experiments indicate that this design decision results in

costly over�re�nement and a corresponding loss in performance� In addition� even

the uniform re�nement strategy results in irregular communication patterns between

re�nement patches� To simplify the management of these bookkeeping details� we

would still need to implement region calculus operations such as intersection and

copy�on�intersect in an API library written for HPF�

Although we have shown that adaptivity can resolve the near�singular po�

tentials arising in materials design applications� much work remains before we can

address real problems� such as carbon clusters or carbon �laments� Addressing the full

generality of the LDA eigenvalue equations will require substantial advances in nu�

merical algorithms� in particular� �� how do we extend the linear adaptive eigenvalue

solver to address the full nonlinear LDA eigenvalue problem� �� how do we e�ciently

extract multiple eigenvectors from the current adaptive algorithm� and �� how do we

incorporate LDA�speci�c approximations �which we have not discussed here� into the

solution process# Fortunately� we believe that the software support provided by our

��

structured adaptive mesh API will greatly simplify the exploration of these numerical

issues�

Finally� further research is needed to investigate whether we can merge soft�

ware support for elliptic and hyperbolic partial di�erential equations� Although ellip�

tic and hyperbolic structured adaptive mesh methods employ similar data represen�

tations� their numerical structures are very di�erent �as described in Section �� We

plan to explore how to unify support for these two problem classes in a single API�

C� Duncan of Bowling Green State University would like to use our structured adap�

tive mesh infrastructure to parallelize an adaptive hyperbolic solver for simulations

of relativistic extragalactic jets � �� and we view this as an opportunity to develop a

common API framework for both types of partial di�erential equations�

Chapter �

Particle Calculations

The di�erence between a text without problems and a text with problemsis like the di�erence between learning to read a language and learning tospeak it�

� Freeman Dyson� �Disturbing the Universe�

�� Introduction

In Chapter � we introduced the LPARX parallelization abstractions for

managing irregular� block�structured data� In this chapter� we describe how these

facilities are used in the design and implementation of an API library for particle

applications� which require irregular data decompositions to balance non�uniform

workloads on parallel architectures� Our particle API� implemented as a library built

on top of the LPARX mechanisms �see Figure �� provides computational scientists

the high�level software tools needed to e�ciently and easily parallelize particle calcu�

lations� Our facilities are independent of the problem�s spatial dimension and present

the same interface for both d and �d applications� We show that such functionality

is easily expressed using the LPARX primitives�

Using our software infrastructure� we have developed a �d smoothed particle

hydrodynamics �� application �SPH�D in Figure �� which simulates the evolution

of galactic bodies in astrophysics� Our API�s high�level mechanisms have enabled us

��

��

LPARX



LDA MDSPH3DAMG


Figure �� The particle API portion of our software infrastructure is built on topof LPARX and provides computational scientists with high�level facilities targetedtowards particle applications� We have developed a �d smoothed particle hydrody�namics application �SPH�D� using this software and are currently implementing a �dmolecular dynamics application �MD��

to explore performance optimizations that others have found di�cult with a message

passing implementation� Our facilities have also been employed by Figueira and

Baden to analyze the performance of various parallelization strategies for localized

N �body solvers � ��

This chapter is organized as follows� We begin with an overview of particle

methods and review related work� Section � introduces the parallelization facilities

provided by our particle API and describes how they have been implemented using

the LPARX mechanisms� Section �� describes our smoothed particle hydrodynam�

ics application and analyzes its performance in detail� Finally� we conclude with a

discussion and analysis of this work�

��

�� Motivation

Simulations using particles play an important role in many �elds of compu�

tational science� including astrophysics� �uid �ow� molecular dynamics� and plasma

physics �� In particle applications� some quantity of interest� such as mass or charge�

is stored on bodies� called particles� which move unpredictably under their mutual

in�uence� Particles interact according to a problem�dependent force law �e�g� gravi�

tation�� Computations proceed over a series of discrete timesteps� At each timestep�

the algorithm evaluates the force acting on every particle and then moves the particles

according to the calculated force �eld� The force evaluation is typically the most time�

consuming portion of the computation� Moving particles takes only a few percent of

the total execution time� Figure � provides an outline for a generic particle code�

In general� each particle feels the in�uence of every other particle� A naive

force evaluation scheme would� for a system of N particles� calculate all O�N��

particle�particle interactions directly� Such an approach is too expensive for systems

of more than a few thousand particles�

Rapid approximation methods �� accelerate the force evaluation

by trading some accuracy for speed� Approximation algorithms typically divide the

force evaluation into two components� local particle�particle interactions and far��eld

force computations� Local interactions evaluate the in�uence of nearby particles by

calculating direct interactions only for those particles lying within a speci�ed cuto�

distance� The remaining non�local� far��eld contributions are calculated separately�

Two di�erent techniques are typically used to evaluate far��eld forces� The

�rst technique averages particle data onto a grid covering the entire computational

domain and employs a fast partial di�erential equation �PDE� solver to evaluate the

force equations on the grid� Values representing far��eld forces are then interpolated

back onto the particles� The second technique evaluates far��eld in�uences using a

hierarchical structure in which forces are represented at varying length scales and

accuracies� the in�uences of nearby particles are represented more accurately than

particles that are further away� Table �� surveys the computational structure of

��

�� Advect is the main routine of the particle calculationfunction Advect

for t � � to MaxStepscall CalculateForcescall MoveParticles

end forend function

�� Calculate forces using local and far��eld interactions�� Note that we do not show the code for FarFieldForcesfunction CalculateForces

call LocalInteractionscall FarFieldForces

end function

�� Calculate forces arising from nearby particle interactionsfunction LocalInteractions

for each particle pfor each particle q in a neighborhood of p

calculate interaction between p and qend for

end forend function

�� Update particle positions according to calculated forcesfunction MoveParticles

for each particle pmove particle p using force information

end forend function

Figure �� A framework for a generic particle calculation� Particle simulations pro�ceed over a sequence of timesteps� In each timestep� the algorithm evaluates forces onparticles and then moves particles according to the calculated forces� Rapid approx�imation methods typically divide the force computation into two components� localparticle�particle interactions and far��eld force evaluation �not shown��

��

Approximation Algorithm Local Fast PDE HierarchicalInteractions Solver Representations

Local Force Approximations �� p

Particle�Particle Particle�Mesh ��p p

Particle�in�Cell �PIC� ��p

Method of Local Corrections �MLC� ��p p

Adaptive MLC �� p p p

Fast Multipole Method �FMM� ��p p

Barnes�Hut ��p p

Hierarchical Element Method ��p p

Table �� A survey of the computational structure for various N �body approximationmethods� This chart indicates whether a particular approximation algorithm employslocal particle�particle interactions� a fast partial di�erential equation �PDE� solver�or hierarchical representations� Local force approximations assume that the force lawis zero beyond a speci�ed cuto� and ignore far��eld in�uences�

some of the most common rapid approximation algorithms�

In this chapter� we only consider software abstractions for the evaluation

of local particle�particle interactions� Software facilities for fast PDE solvers and

adaptive hierarchical representations have been described in Chapter ��

Particle methods are di�cult to implement on parallel computers because

they require dynamic load balancing to maintain an equal distribution of work� The

computational e�ort required to evaluate the forces acting on a particle depends on

the local particle density� Workload distributions change with time as particles move�

Furthermore� when partitioning the problem� we would like to take advantage of the

spatial locality of the particle�particle interactions� By subdividing the computational

space into large� contiguous blocks� we can minimize interprocessor communication

since nearby particles are likely to be assigned the same processor ��

Figure ��a� which depicts a uniform block decomposition of the compu�

tational space� illustrates the need to handle load balancing� Each of the sixteen

partitions has been assigned to a processor numbered from p� to p�� Such a uni�

form decomposition does not e�ciently distribute workloads� for example� no work

��

�a�

�b� �c�

Figure �� These pictures show several snapshots from a d vortex dynamics ap�plication in which the computational domain has been partitioned across sixteenprocessors� Particles are represented by dots� and the workload is directly related tolocal particle density� A uniform block decomposition �a� is unable to balance a non�uniform workload distribution� Recursive bisection �b and c� adjusts the assignmentof work to processors according to the workload distribution� Partitionings mustchange dynamically in accordance with the redistribution of the particles� Severalrepartitioning phases have occurred between the times represented by the last twosnapshots�

��

has been assigned to processors p�� p�� p�� and p��

A better method for decomposing non�uniform workloads is shown in Fig�

ures ��b and ��c� which illustrate two irregular block assignments rendered using

recursive bisection �� In these decompositions� each processor receives approxi�

mately the same amount of work� Because the distribution of particles and the asso�

ciated workloads change over time� we must periodically redistribute the work across

processors to maintain load balance� How we represent such dynamic� irregular data

decompositions is the subject of Section ��

�� Related Work

The literature for parallel particle calculations is quite rich and expansive�

here we provide only a brief survey of related work� Our particle library facilities

are based in part on previous work by Baden �� who developed a programming

methodology for parallelizing particle calculations running on MIMD multiprocessors�

His implementation of a d vortex dynamics application was the �rst to employ a

recursive bisection decomposition �� to dynamically balance particle methods�

Tamayo et al� �� investigated various data parallel implementation strate�

gies for molecular dynamics simulations running on SIMD computers such as the

CM�� More recently� Figueira and Baden � �� employed our software infrastructure

to analyze the performance of di�erent parallelization strategies for localized N �body

methods running on MIMD multiprocessors�

Some previous e�orts with parallel particle calculations have concentrated

on the parallelization of a particular program instead of a general software infrastruc�

ture� For example� Clark et al� �� implemented a parallel version of the GROMOS

molecular dynamics application� Their approach uses non�uniform� dynamic parti�

tions similar to our own which were implemented �with considerable e�ort� using a

message passing library� The parallelization of GROMOS would have been signi�cantly

easier had they employed the software abstractions we describe in Section ��

Portions of the CHARMM molecular dynamics application have been paral�

��

lelized using the CHAOS software primitives �� CHAOS employs a �ne�grain

decomposition strategy in which particles are individually assigned to processors�

The drawback to this type of �ne�grain approach is that algorithms for scheduling

interprocessor communication scale as the number of particles� In contrast� we use a

coarse�grain strategy that assigns aggregates of particles to processors� and our com�

munication algorithms depend only on the number of processors� not on the number

of particles�

Experimental data distribution primitives targeted towards particle calcu�

lations have been added to Fortran D �� using CHAOS as the run�time support

library� Adhara �� is a run�time library for particle applications that is not as gen�

eral as CHAOS but has been speci�cally designed and optimized for certain classes

of particle methods�

Warren and Salmon �� developed a parallel tree code intended for Barnes�

Hut �� and fast multipole �� methods� Their approach dynamically distributes

nodes of the tree across processors using a clever hashing mechanism� Singh et al�

�� have also implemented a parallel fast multipole algorithm for shared mem�

ory multiprocessors� However� these tree�based mechanisms are inappropriate for

particle methods that employ fast PDE solvers based on array representations ��

�� Application Programmer Interface

The beauty of this is that it is only of theoretical importance� and there isno way it can be of any practical use whatsoever�

� Sidney Harris� �Einstein Simplied�

Our particle API provides scientists with high�level computational tools that

enable easy and e�cient portable implementations of particle applications on paral�

lel architectures� We have implemented our API using the LPARX primitives of

Chapter � We show that the techniques required to parallelize particle applications

are easily and succinctly expressed in LPARX� Without the powerful LPARX ab�

stractions� such software support normally constitutes many thousands of lines of

��

complicated message passing code� In contrast� we will present LPARX code for

all of the major activities needed to balance non�uniform workloads and manage

interprocessor communication� The LPARX operations enable us to manage the im�

plementation complexity at a reasonable level� As discussed in this section and in

Section �� the high�level treatment of our API implementation has enabled us to

explore performance optimizations that others have found di�cult with a message

passing implementation� Such optimizations have reduced the execution times of our

smoothed particle hydrodynamics application by an average of ��$�

As shown in Figure �� parallelization of the sequential code of Figure �

requires four modi�cations to the application�

�� In Advect� call a load balancing routine to divide the computational workload

across the processors� Each processor will be responsible for all calculations

involving particles in the data partition assigned to it�

� Before calculating local interactions� fetch o��processor particles required to

satisfy data dependencies�

�� After calculating local interactions� update force information for particles owned

by other processors�

�� After updating particle positions in MoveParticles� repatriate particles to their

rightful processor owners if they have migrated o� of the local data partition�

Functions BalanceWorkloads� FetchParticles� WriteBack� and Repatriate�

Particles are provided by our particle API� The following �ve sections describe

these facilities and their implementation using the LPARX mechanisms in more de�

tail�

�� Balancing Non�Uniform Workloads

We address two issues in this section� �� how do we represent the non�

uniform decompositions needed to evenly distribute workloads �e�g� see Figures ��b

��

�� Advect is the main routine of the particle calculationfunction Advect

for t � � to MaxSteps�� Redistribute workload across processorsif �time to rebalance workloads� then

call BalanceWorkloadsend ifcall CalculateForcescall MoveParticles

end for

end function

�� Calculate forces arising from nearby particle interactionsfunction LocalInteractions

�� Fetch particle data from adjoining processorscall FetchParticlesfor each particle p owned by this processor

for each particle q in a neighborhood of pcalculate interaction between p and q

end forend for

�� Write back forces for o �processor particlescall WriteBack

end function

�� Update particle positions according to calculated forcesfunction MoveParticles

for each particle p owned by this processormove particle p using force information

end for

�� Repatriate particles which have moved o our partitioncall RepatriateParticles

end function

Figure �� A parallel version of the generic particle code shown in Figure �� Fourchanges are necessary to parallelize the application� �� balance workloads and dis�tribute computational e�ort across the processors �in Advect�� before performinglocal interactions� call FetchParticles to cache o��processor particle informationneeded to calculate forces� �� after calculating interactions� update force informationfor particles owned by other processors� and �� repatriate particles to their rightfulowners if they have migrated o� of the local data partition� Note that we do notshow function CalculateForces because it has not changed from Figure ��

�

and ��c�� and �� how do we dynamically redistribute computational e�ort as work�

loads change#

A common technique used to implement particle applications employs a

chaining or binning mesh which covers the entire computational domain �� Particles

are sorted into the mesh according to their spatial location� each element �or bin� of

the mesh contains the particles lying in the corresponding portion of space� This

binning structure is used to accelerate the O�N�� search for neighboring particles

that would be otherwise required� A sequential application would represent the mesh

as a Grid of particle lists�

Grid of ParticleList bins

where ParticleList is the user�de�ned type implementing an unordered collection

of particles� The computational work carried by each bin is a function of the number

of particles in the bin and the local particle density�

To balance non�uniform workloads on a parallel machine� we decompose this

binning structure across processors� Recall that LPARX represents irregular block de�

compositions using the XArray� thus� our parallel data decomposition is implemented

as�

XArray of Grid of ParticleList bins

where each Grid of the XArray contains the particle list data for its corresponding

data partition� Irregular data distributions are determined by a partitioning utility

which attempts to evenly divide the work among the processors� A d sample data

decomposition for four processors and the associated XArray are shown in Figure ��

When calculating forces and moving particles� processors employ LPARX�s coarse�

grain data parallel forall loop to compute over only those particles which lie in their

assigned partition�s��

This binning mesh must be periodically repartitioned in response to the

changing workload distribution� In general� the application need not call the re�

balancing routine every timestep� The maximum distance a particle may move in

��

XArray

Grids

Figure �� Irregular decompositions of the computational domain are representedusing the XArray� For particle applications� each processor is typically assigned asingle data partition� which corresponds to an XArray element �a Grid�� Processorscompute only for those particles within their assigned partition�

a single timestep is limited by the stability requirements of the numerical method�

therefore� workloads change slowly �� For example� the smoothed particle

hydrodynamics application described in Section �� repartitions every ten timesteps�

Partitioning the computational domain introduces data dependencies be�

tween the various subproblems� particles near the boundary of a partition may in�

teract with particles belonging to other processors� We extend each partition with a

ghost cell region �see Figure �� used to locally cache copies of o��processor particles�

In general� the width of this ghost cell region depends on the mesh spacing and may

be di�erent in each dimension� Prior to each force evaluation� ghost cells are �lled

with the most recent o��processor particle data� How this is managed is described in

Section ��

Dynamic load balancing is handled by particle library routine Balance�

Workloads� shown in Figure � � The �rst step is to estimate how much computational

e�ort will be required to calculate the forces for all particles in a particular bin� Our

API automatically measures the amount of time spent computing in each bin and

uses these timing measurements from previous timesteps to guide the partitioning for

the following timesteps�

In the next step� we call a recursive bisection �� partitioning utility� pro�

vided by the LPARX standard libraries� This partitioner takes the workload estimate

�The application may substitute another irregular block partitioner�

��

�� Rebalance workloads and copy from the old mesh into the new�� Bins and NewBins are XArray of Grid of ParticleList�� Partition and NewPartition are Array of Region�� NGHOST is the ghost cell width�� P is the number of processors

function BalanceWorkloads�� Step �� Estimate the workload distributionArray of Double Work � EstimateWork�� Step �� Partition this workloadArray of Region NewPartition � RCB�Work� P�� Step �� Add ghost cells to the partitionArray of Region Ghosts � grow�NewPartition� P� NGHOST�� Step �� Allocate the storage for this structurecall XAlloc�NewBins� P� Ghosts�� Step �� Copy data from Bins into NewBinsforall i in NewBins

for j in Binscopy into NewBins�i� from Bins�j� on Partition�j�

end forend forall

end function

Bins(j) NewBins(i)

copy

Figure � � API function BalanceWorkloads redistributes computational e�ort acrossthe processors� Workload information gathered by the particle API is used to guidethe RCB partitioning utility provided by the LPARX libraries� After the routine addsghost cell regions and allocates storage� data from the old binning structure is copiedinto the new mesh� All of these details are hidden from users of the API� Note thatthe ghost cells are not shown in the picture�

�

and the number of desired partitions and returns a description of the data decomposi�

tion� Note that the Regions returned represent the structure of the new partitioning

�i�e� structural abstraction�� but do not actually allocate storage� Next� we pad each

partition with a ghost cell region of size NGHOST using grow from LPARX� Finally� we

allocate the storage for the new binning array NewBins using this structure informa�

tion�

The new binning mesh is initially empty� before using it� we must copy

particle information from the previous data distribution into the new decomposition�

The two nested loops copy values from Bins into the new mesh NewBins� For each i

and j� NewBins�i� is assigned the portion of Grid Bins�j� that logically overlaps

with j�s associated partition� Any interprocessor communication is automatically

managed by the LPARX run�time system� Copies of particle lists local to a processor

are implemented by simply copying pointers� Note that although we temporarily

duplicate the storage for the chaining mesh� the particle data�which is likely to

require far more memory resources�is not duplicated� Of course� all of these details

are hidden by the API�

�� Caching O�Processor Data

Recall from the previous section that each processor�s partition is surrounded

with a ghost cell region used to locally cache copies of o��processor particles from

neighboring partitions� In general� e�ciently �lling ghost cell regions for irregular�

dynamic decompositions is a di�cult task� Data dependencies between processors

change as workloads are rebalanced� communication structures are neither static not

regular and cannot be easily predicted� Wide ghost cell regions may span several

other partitions�

All of these details are managed by the FetchParticles code shown in Fig�

ure �� For every pair of Grids Bins�i� and Bins�j�� this routine copies into the

ghost cells of Bins�j� interior �non�ghost cell� particle information from all adjacent

Grids Bins�i�� We select only the interior particles from the source Bins�i� by grow�

�

�� Communicate boundary particle data between neighboring partitions�� Bins is the binning mesh used to store the particles

function FetchParticles�XArray of Grid of ParticleList Bins�� Loop over all pairs of grids in Binsforall i in Bins

�� Mask o� the ghost cells �copy interior values only�� Function region�� extracts the region from its argumentRegion Interior � grow�region�Bins�i�� "NGHOST�for j in Bins

�� Copy data from intersecting regionscopy into Bins�j� from Bins�i� on Interior

end forend forall

end function

Bins(j)

Ghost Cells

Partition

Figure �� FetchParticles locally caches copies of o��processor particle informationneeded for particle interactions� Ghost cell regions are updated with particle datafrom the interiors of adjacent partitions�

ing its Region by a negative ghost cell width� Aggregate data motion between Grids is

handled through LPARX�s copy�on�intersect operation� which e�ciently copies data

between Grids� ignoring points which are not shared�

The simplicity of our routine belies the fact that the code to perform Fetch�

Particles becomes quite complicated in the absence of the powerful LPARX fa�

cilities� Similar functionality in the GenMP system �� required over �� lines of

message passing code� In describing the parallelization of the GROMOS molecular dy�

namics application� Clark et al� �� point out the di�culty of supporting irregular

��

partitions and ghost cell regions that may span several partitions� Such special cases

are automatically managed by the copy�on�intersect primitives provided by LPARX�

Furthermore� FetchParticles is independent of the type of data decomposition and

the same algorithm works for both d and �d applications�

�� Writing Back Particle Information

Many of the force laws employed by particle applications are symmetric� that

is� the force acting on particle p by particle q is equal and opposite to the force acting

on particle q by particle p �Newton�s Third Law �� By exploiting this symmetry�

we reduce computational costs by about half� since once we have calculated the force

acting on p by q� we know that the force acting on q by p is the same but in the

opposite direction�

Subtle implementation issues arise when employing symmetric forces laws�

Consider the force calculation between a particle p in the interior of a processor�s

partition and a particle q in its ghost cell region� If we exploit the symmetry of the

computation� we update the forces for both particles p and q� However� at the end of

the local interactions� particle q� a locally cached copy of a particle owned by another

processor� contains important force information which must be transmitted back to

the processor owning q�

Thus� by exploiting the symmetry of the force law� we halve the numerical

computation at the expense of additional interprocessor communication� One com�

promise employs symmetry only if neither interacting particle lies in the ghost cell

region �� forces for such particles are computed redundantly on di�erent proces�

sors instead of communicated between processors� While this approach eliminates

the extra communication� our experiments �reported in Section �� indicate that

it results in ��$ longer execution times because of the redundant computation�

Figure �� shows the API routine WriteBackwhich implements the force up�

date� This code is essentially the same as FetchParticle described in Section ��

except that data travels in the opposite direction� One notable di�erence is that

��

�� Write back force information between neighboring partitions�� Bins is the binning mesh used to store the particles

function WriteBack�XArray of Grid of ParticleList Bins�� Loop over all pairs of grids in Binsfor i in Bins

for j in Bins�� Mask o� the ghost cells �copy interior values only�Region Interior � grow�region�Bins�j�� "NGHOST�� Copy data from intersecting regionscopy into Bins�j� from Bins�i� on Interior using CombineForces

end forend for

end function

Bins(i)

Ghost Cells

Partition

Figure �� WriteBack updates force information for particles owned by other pro�cessors� This code is essentially the same as FetchParticles in Figure �� exceptthat data �ows in the opposite direction� This routine employs the reduction formof the LPARX copy�on�intersect operation� In this example� the reduction functionCombineForces sums forces from o��processor particles into locally owned particlelists� As described in Section �� this code is also used to repatriate particles totheir rightful processor owners if they have migrated o� of the local data partition�For that version� reduction function CombineLists is used to combine o��processorparticles with lists of locally owned particles�

��

WriteBack employs the reduction form of the LPARX copy�on�intersect operation�

Recall that this primitive takes a commutative associative reduction function as an

argument� instead of simply copying data� the speci�ed function is applied element�

wise to combine corresponding source and destination data values� In this case�

the reduction function CombineForces takes two ParticleLists� sums the forces

for corresponding particles in the two lists� and returns the result� In the general

case� CombineForces must be provided by the application because symmetric force

laws often calculate more than just forces� for example� the smoothed particle hy�

drodynamics application of Section �� calculates both forces and densities� Writing

CombineForces is simple� however� and we will show an example in Section ��

As before� the code for WriteBack is di�cult to implement without the

support provided by LPARX� Indeed� the parallel implementation of the GROMOS

molecular dynamics application �� ignores symmetry and redundantly computes in�

teractions involving particles lying in ghost cell regions� even though the implementors

expect a dramatic increase in performance with this optimization� We will explore

the performance implications of this design decision in Section ��

�� Repatriating Particles

The fourth and �nal facility required to parallelize a particle application

repatriates particles across processors if they have migrated o� of their processor�s

partition� The last phase of each timestep moves particles according to the calculated

forces acting on each particle� In this step� some particles maymove o� of the partition

owned by their processor into the ghost cell region� �Prior to moving particles� we

remove from the ghost cell region the o��processor particles locally cached by Fetch�

Particles�� Particles will not move past the ghost cell region because the numerical

methods limit the maximum distance a particle may move in a single timestep due to

stability requirements �� Because these particles no longer lie in their processor�s

partition� they must be communicated to the processors which rightfully own them�

The computational structure of repatriation is identical to that of the force

� �

update described in the previous section� In fact� the code for RepatriateParticles

is identical to that in Figure �� with one change� the reduction function Combine�

Forces is replaced by CombineLists� which takes two particle lists and returns their

union� As particles lying in ghost cell regions are copied back onto their proper parti�

tions� they are combined with the particles already lying in those bins via Combine�

Lists� The user of our API does not supply CombineLists and only needs to call

RepatriateParticles� all interprocessor communication and list concatenation is

managed automatically by the run�time system�

�� Implementation Details

We have implemented our particle API as a library of C�� classes built

on top of the LPARX parallelization abstractions� The library consists of approx�

imately one thousand lines of C�� code and de�nes two classes�ChainMesh and

ParticleList�that provide the functionality described in the previous sections�

Programmers are completely isolated from LPARX and never see types Region� Grid�

or XArray� in fact� we could implement the library on top of another parallel run�time

system� and the API would not change� ChainMesh and ParticleList are described

brie�y in the following two sections�

ChainMesh

ChainMesh implements the chaining mesh structure �� for organizing the

particles� Recall from Section �� that this chaining mesh covers the entire com�

putational domain� and particles are sorted into the mesh based on their spatial

location� Each mesh element or bin contains a ParticleList of the particles lying

in the corresponding region of space� Figure �� shows part of the C�� class inter�

face for ChainMesh� Internally� the chaining mesh is implemented as an XArray of

Grid of ParticleList� although this representation is hidden from the programmer�

ChainMesh de�nes member functions corresponding to all of the parallelization mech�

anisms described previously� BalanceWorkloads� FetchParticles� WriteBack� and

� �

Define a chaining mesh class for a �d particle calculation

Class Particle defines particle attributes �position� velocity� ��

Class Index is a simple opaque index object �like an LPARX Point�

class ChainMesh f The chaining mesh is a �d XArray of �d Grids of ParticleList

XArray�ofGrid�ofParticleList mesh�

ChainMesh automatically times iterations for load balancing

XArray�ofGrid�ofDouble workload�

Define other miscellaneous flags and variables

int has periodic boundary conditions�

double interaction distance�

� � �

public�

void AddParticle�Particle p��

void BalanceWorkloads��

void FetchParticles��

void WriteBack�ForceReductionFunction CombineForces��

void RepatriateParticles��

ParticleList operator��const Index I��

� � �

g�

ForAll loops over all ParticleLists owned by this processor

�define ForAll�I�MESH� � � �

�define EndForAll � � �

ForAllInteracting loops over all particles J interacting with I

�define ForAllInteracting�J�I�MESH� � � �

�define EndForAllInteracting � � �

Figure �� This API de�nition is taken from the C�� header �le for classChainMesh� which implements the chaining mesh structure used to organize particles�� ChainMesh provides all of the parallelization mechanisms described previously�BalanceWorkloads� FetchParticles� WriteBack� and RepatriateParticles� It isimplemented on top of the LPARX parallelization mechanisms�

�

RepatriateParticles� It also provides functions for adding particles to the mesh

�AddParticle� and for indexing the mesh to extract a single list of particles�

The implementation also de�nes two loops�ForAll and ForAll�

Interacting�that iterate over the particle lists in the chaining mesh� The ForAll is

a parallel loop that iterates over the particle lists owned by a particular processor� It

automatically times the computation time associated with each bin� and this timing

information is used by ChainMesh to guide the partitioning of the mesh in Balance�

Workloads� Although the application must explicitly rebalance the mesh by calling

BalanceWorkloads� the di�cult task of determining the non�uniform workload distri�

bution is handled automatically� In practice� applications typically repartition every

N th timestep� where N depends upon how quickly particles move�

The other loop� ForAllInteracting� iterates over all bins that contain par�

ticles interacting with the bin returned by ForAll� Figure �� shows how these loops

are used in a local interactions computation� LocalInteractions is a C�� routine

that calculates local interactions using numerical kernel ComputeInteractions �not

shown�� This C�� code looks very similar to the local interactions loop of our generic

parallel particle application in Figure ��

ParticleList

In addition to ChainMesh� the particle library de�nes a class called

ParticleList to represent a list of particles� Such particle lists are not an artifact

of the parallel implementation but are also required by serial codes� Traditionally�

chaining meshes have represented particle lists using a linked list �� The advantage

of the linked list strategy is that it is easy to add and remove particles by simply

manipulating pointers�

Instead of this approach� our implementation of ParticleList represents a

list of particles using an array �see Figure �� Although this array representation

is more complicated�particle information must be copied into and out of the list�

and the array must grow and shrink dynamically�it has its performance advantages�

� �

Show a simple routine which does local particle interactions

extern void ComputeInteractions�ParticleList A� ParticleList B��

extern void CombineForces�ParticleList A� const ParticleList B��

void LocalInteractions�ChainMesh mesh� f Fetch particle data from adjoining neighbors

mesh�FetchParticles��

Calculate forces arising from local particle interactions

ForAll�I� mesh�

ForAllInteracting�J� I� mesh�

ComputeInteractions�mesh�I�� mesh�J��

EndForAllInteracting

EndForAll

Write back forces for off�processor particles

mesh�WriteBack�CombineForces��

g

Figure �� Application C�� code to compute local interactions using the particlelibrary� The ForAll loop iterates in parallel over all particle lists in the chainingmesh� and ForAllInteracting iterates over bins that contain particles interactingwith bins returned by ForAll� ComputeInteractions is a numerical routine thatcomputes the interactions between two particle lists and is not shown�

Because particle information is arranged in arrays� it is easier to vectorize numerical

kernels on vector architectures such as the Cray C�� Arrays are easier to pass to

Fortran numerical routines than linked lists� Finally� the array representation o�ers

improved cache locality because particle values lie contiguously in memory�

The physical information stored on each particle typically depends on the

type of numerical simulation� Thus� the programmer must take the following steps

to customize the particle representations provided by our library �see Figure ��

� De�ne a C�� class called Particle that includes all physical information neededto characterize a particle� such as position� velocity� acceleration� force� mass�

density� or pressure�

� �

Class Particle defines important particle attributes

class Particle fdouble position�� velocity�� force��

� � �

g�

Class ParticleList represents Particle information as arrays

class ParticleList fint number�

double �position�� velocity�� force��

� � �

g�

CombineForces �called by WriteBack� adds forces from B into A

void CombineForces�ParticleList A� const ParticleList B� ffor �int i � �� i � A�number� i� f

A�force�i�� B�force�i��



gg

Packing routine to transmit ParticleList data between processors

SendPacket operator �� SendPacket stream� const ParticleList PL�

fstream �� PL�number�

stream �� PackArray�PL�position� ��PL�number��

stream �� PackArray�PL�velocity� ��PL�number��

� � �

return�stream��

g

Figure �� Because the information represented by a particle depends on the physicsof the computation� the programmer must customize our ParticleList class for aparticular application� These changes are simple and could be managed automaticallyusing a pre�processor� ParticleList represents particle information using arrays ofdata instead of the linked list method typically used in chaining mesh codes ��Function PackArray is de�ned by the AMS libraries and packs an entire array ofdata into the outgoing message stream�

�

� Modify ParticleList to represent the same information as a Particle and

write routines to copy a Particle into and out of a ParticleList� These

copying routines are used internally by the particle library� They are simple

and resemble the gather and scatter routines used on vector architectures�

� Write message stream packing and unpacking routines needed to transmit par�

ticle data across memory spaces �see Section �� Again� these routines are

easy to write and resemble standard C�� I�O� A sample packing routine is

shown in Figure �� The corresponding unpacking routine would be similar

except that it would extract data from the message stream�

� Write the CombineForces routine as needed by WriteBack �see Section ��

Recall that CombineForces is used by WriteBack to combine the force contri�

butions from two particle lists� The sample code in Figure �� loops over the

particles in particle lists A and B and adds the forces from particles in list B to

the corresponding particles in list A�

Although these changes are not di�cult� they could be automated by using a simple

pre�processor� Of course� the programmer must also write the numerical computation

routines to calculate particle interactions and update particle positions�

One possible performance optimization that we explore in Section �� is

the selective packing and unpacking of particle information� Particles typically con�

tain a signi�cant amount of data �tens to hundreds of bytes�� and it not always

necessary to communicate all of this information between processors� For example�

only forces typically need to be communicated when writing back force data� Thus�

the programmer can apply this application�speci�c knowledge to selectively transfer

data in the packing and unpacking routines� Our experiments with this optimization

in Section �� indicate that it reduces execution times between �$ and ��$ and

reduces the amount of interprocessor communication by a factor of four to �ve�

�

Figure �� Our smoothed particle hydrodynamics application simulates the evolu�tion of the �d disk galaxy shown here� Particles are equally distributed around thering and are assigned a random vertical position clustered about the Z � � plane�

�� Smoothed Particle Hydrodynamics

We�ve discovered a massive dust and gas cloud which is either the begin�ning of a new star or just a hell of a lot of dust and gas�

� Sidney Harris� �From Personal Ads to Cloning Labs�

We have developed a �d smoothed particle hydrodynamics application

�SPH�D� based on the software facilities described in the previous section� Smoothed

particle hydrodynamics is a particle�based simulation method which has been applied

to gas dynamics� stellar collisions� planet formation� cloud collisions� cosmology� mag�

netic phenomena� and nearly incompressible �ow �� Our particular application�

arises in astrophysics and models the evolution of the disk galaxy shown in Figure ��

The computational structure of our smoothed particle hydrodynamics ap�

plication is similar to that of the generic particle codes shown in Figures � and ��

Interactions between particles occur only over short ranges� and there are no far��eld

�The original code and a sample data set were provided by John Wallin �Institute for Computa�tional Sciences and Informatics at George Mason University� and Curtis Struck�Marcell �Departmentof Physics and Astronomy� Iowa State University��

� �

forces� Each interaction is expensive� requiring approximately one hundred �oating

point operations� Associated with each particle is �� bytes of information describing

position� velocity� acceleration� mass� density� and pressure� Local interactions take

two forms� First� the method calculates a local density for each particle� Then� using

this density information� it computes pressure gradients and their associated forces�

These forces are used to move the particles in preparation for the next timestep�

Because there are two local interaction phases� two calls to FetchParticles and

WriteBack are needed every timestep�

We begin this section with a description of the numerical calculation� which

may be skipped without loss of continuity� The succeeding four sections present

computational results�

�� Numerical Background

You know� I don�t think math is a science� I think it�s a religion� Allthese equations are like miracles� You take two numbers� and when youadd them� they magically become one new number� No one can say how ithappens� You either believe it or you don�t� This whole �section� is full ofthings that have to be accepted on faith� It�s a religion� As a math atheist�I should be excused from this� � �


Smoothed particle hydrodynamics represents each particle not as a point

but as a smooth �blob� smeared over a portion of space �� The general form of a

blob is given by the interaction basis function� or kernel� �� Our particular application

uses the following kernel function�

��r� h� �

��

��h��

�r� ! �

�r�� r � h

��h� �� r�� h � r � h

� otherwise

��

where r R is the distance away from the center of the particle and h R gives

the �spreading� of the blob� Note that this kernel has compact support� � is zero for

r � h� Thus� particles separated by more than h do not interact�

� �

Variable Space Physical Meaning

x R� position

v R� velocity

a R� acceleration

� R density

m R mass

P R pressure

V R viscosity

�t R timestep

� R�R� R interaction kernel function

h R interaction distance

g R� � R external gravitational �eld

Table �� A summary of the variables and functions used in the smoothed particlehydrodynamics equations�

Associated with each particle is information about its position x� velocity v�

acceleration a� mass m� and density � �see Table � for a summary of all variables

and functions de�ned in this section�� Local interactions consist of two separate com�

putation phases� density calculations and force calculations� We compute the density

for a particular particle i by summing mass contributions from nearby particles j�

�i �Xj

mj��kxi � xjk� h� ��

Although written as a sum over all particles j� only nearby particles contribute to the

density because � is zero for kxi � xjk � h�After we have calculated the local density for each particle� we compute the

forces on each particle i�

ai � �Xj

mj fPi�j ! Vi�jgri��kxi � xjk� h� ! g�xi� ��

where ri is the gradient taken with respect to the coordinates of particle i� Pi�j

represents the force component due to pressure and is given by�

Pi�j �p�i�j

��

� �

The viscosity� or �stickiness�� of the �uid is represented by the term Vi�j� de�ned as�

Vi�j �

��

�h�f�vi�vj��xi�xj�g�

�kxi�xjk��h��i�j��vi � vj� � �xi � xj� � �

� �vi � vj� � �xi � xj� � ��

where �� is the standard dot product in R�� The term g�xi� in Eq� �� represents the

in�uence of an external� problem�dependent gravitational �eld� With the exception of

the multiplication bymj� the computation for ai within the sum is identical to that for

aj� Thus� we exploit the symmetric nature of the force law to reduce computational

costs by about a factor of two�

Using the acceleration information from Eq� �� we update the velocity and

position of particle i using the �rst�order Euler�s method ��

vi � vi ! ai�t ��

xi � xi ! vi�t ��

where �t represents the timestep� The application dynamically changes the timestep

�t to satisfy stability criteria such as the Courant�Friedrichs�Lewy �CFL� condi�

tion ��

�� Performance Comparison

We present performance results for the SPH�D application on the Cray C��

�single processor�� Intel Paragon� IBM SP�� and a network of Alpha workstations

connected by a GIGAswitch running PVM �� refer to Table �� for software ver�

sions and Appendix A for machine characteristics� Time is reported in seconds per

timestep� All �oating point arithmetic was performed using ��bit numbers� The

application code was identical on all machines except that the C�� version gathered

and scattered particles to obtain longer vector lengths�

We ran simulations with �� k�� k�� k�� and

�� k�� particles for the spatial distribution shown in Figure �� Numerical

�The IBM SP� results were obtained on a pre�production machine at the Cornell Theory Centerthese times should improve as the system is tuned and enters full production use�

��

Machine C�� Fortran Operating SystemCompiler Optimization Compiler Optimization

Alphas g�� v�� O f�� v�� O� OSF��




Table �� Software version numbers and compiler optimization �ags for all compu�tations in this chapter� The Alpha cluster consists of eight DEC Alpha workstationscommunicating through PVM �� over a GIGAswitch network interconnect� Allbenchmarks used LPARX release v�� Detailed machine characteristics are reportedin Appendix A�

computation costs vary as the square of the local particle density� asymptotically�

doubling the number of particles in the same computational space requires four times

more work� We may eliminate the square dependence on the number of particles

by reducing particle interaction distances as the local density increases �� but we

have not taken this approach in these simulations�

We employed a chaining mesh of size �� on �� and � processorsand � � �� on � and � processors� These mesh sizes were chosen to minimizethe total execution time� larger meshes help reduce load imbalance because they allow

a �ner partitioning of the problem space� In general� the choice of the best mesh size

depends on factors such as the number of processors� kernel interaction distance�

load imbalance� workload distribution� processor computational speed� and particle

density � �� On the parallel machines� we rebalanced workloads every ten timesteps�

Table �� and Figure �� present computational performance for one

timestep of the SPH�D application� Although the numerical kernels of SPH�D vec�

torize on the C�� the kernels are rather complicated and contain a number of condi�

tionals which hinder e�cient utilization of the vector units� Furthermore� even though

the C�� code gathers and scatters particles to increase vector lengths� vectors are

still quite short� These vectorization limitations are intrinsic to the algorithm and

are not artifacts of parallelization� For ��k particles� hardware performance monitors

��

Particles Cray C�� Time Alpha TimeP � � P � �

�k ��

�k ��

��k ��

��k ��

Intel Paragon PerformanceParticles P � � P � �� P � � P � ��

Time Speedup Time Speedup Time Speedup Time Speedup

�k ��

�k ��

��k ��

��k ��

IBM SP PerformanceParticles P � � P � � P � ��

Time Speedup Time Speedup Time Speedup

�k ��

�k ��

��k ��

��k ��

Table �� These tables present SPH�D performance results on a Cray C�� IntelParagon� IBM SP� and an Alpha workstation farm running PVM� All times are inseconds per timestep� Cray times were averaged over timesteps� Alpha times over� timesteps� and all other times over �� timesteps� The C�� measurements areCPU times on a production system� measurements on the Alpha farm� Paragon� andSP are wallclock times since processor nodes are not time�shared� For the Paragonand SP� speedups are reported relative to the smallest number of processors used togather data� These numbers are graphed in Figure ��

��

12k 24k 48k 96kNumber of Particles

1

10

100

Seco

nds

per

Tim

est

ep

SPH3D Performance Comparison

Alpha Cluster (P = 8)Cray C-90 (P = 1)Paragon (P = 16)SP2 (P = 4)

�a�


1

10

100

Seco

nds

per

Tim

est

ep

SPH3D Performance on Paragon

P = 8P = 16P = 32P = 64


1

10

100

Seco

nds

per

Tim

est

ep

SPH3D Performance on SP2

P = 4P = 8P = 16

�b� �c�

Figure �� These graphs present SPH�D performance results on a Cray C�� IntelParagon� IBM SP� and an Alpha workstation farm running PVM� Measurementswere gathered as described in Table �� In graph �a�� the number of processors for aparticular machine was chosen to provide performance roughly comparable to a singleprocessor of a Cray C�� processor numbers are given in parentheses� The bottomtwo bar charts present timings as a function of the number of processors for �b� theIntel Paragon and �c� the IBM SP�

��

on the C�� report an average �oating point execution rate of �� mega�ops and an

average vector length of sixty� the peak� not�to�exceed performance for one proces�

sor of the C�� is approximately �� mega�ops� For this particular problem size�

one processor of the C�� is roughly equivalent to � Alpha processors� � Paragon

processors� or � SP processors�

The C�� and the Alpha cluster exhibit relatively poor performance on the

smallest problem size ��k particles�� The Alphas su�er because of the high overheads

of message passing through PVM� in larger problems� this overhead is hidden by the

increased computational costs of particle interactions� Poor performance on the C��

is due to short vector lengths� The Cray C�� times improve relative to the other

machines for the largest problem size because of increasing vector lengths� Because

applications implemented using our particle library are portable across a diversity

of high�performance machines� computational scientists have the freedom to choose

the most cost�e�ective architecture �e�g� Cray C�� or Alpha cluster� for a particular

problem size�

�� Execution Time Analysis

To better understand the various costs of a parallel particle application�

we provide a detailed breakdown of the Paragon and SP execution times for the

SPH�D calculation� We have chosen the �k data set for our analysis because it

exhibits reasonable performance across all processor sizes� the �k problem does not

have enough computational work for � processors� and numerical work dominates

all other costs for the larger problems running on � and � processors�

Table � presents a breakdown of the execution time for one timestep of

SPH�D �averaged over �� timesteps�� Times �in milliseconds� are reported for the

following categories� force calculation� move particles� load imbalance� fetch parti�

cles� write back forces� repatriate particles� and rebalance workloads� The �rst two

categories measure numerical work and the last �ve categories communication and

parallelization overheads� The majority of the time is spent in force calculation� load

��

Intel Paragon Performance BreakdownTask P � � P � �� P � � P � ��

Time � Time � Time � Time �

Force Calculation ��

Move Particles ��


Fetch Particles ��

Write Back Forces ��

Repatriate Particles ��

Rebalance Workloads ��

Total ��

IBM SP Performance BreakdownTask P � � P � � P � ��

Time � Time � Time �

Force Calculation ��

Move Particles ��






Total ��

Table �� A breakdown for the execution time of one SPH�D timestep �averaged over�� with �k particles on the Intel Paragon and the IBM SP� Times are reportedin milliseconds� Workloads were rebalanced every ten timesteps� Numbers may notadd up to the �Total� due to rounding�

imbalance� and interprocessor communication for fetching and updating interacting

particles�

The computational work per processor drops by a factor of two each time

the number of processors doubles� and the force calculation times re�ect this pattern�

One interesting anomaly occurs between � and � Paragon processors� for which

the computation time is more than halved� This e�ect is probably due to better

caching behavior on � processors� Recall that SPH�D uses a �ner chaining mesh

on � processors than on � � thus� there are fewer particles per bin because each bin

��

covers less of the computational domain� The on�chip cache in the Paragon�s i� ��XP

processor is very small �� Kbytes data� and can simultaneously cache only a few tens

of particles� With fewer particles per bin� there is a higher probability that particles

will remain in the data cache during the inner loops of the numerical computation�

The application loses a signi�cant amount of time to load imbalance� on

� Paragon processors� nearly ��$ of the total execution time is spent waiting for

other processors� The reason for this poor load balancing is that our sample data

set �see Figure �� distributes most of the workload in a d plane� Because the

computational work is clustered in a small area� the recursive bisection algorithm

cannot e�ciently partition the workload across processors� Although we could re�

�ne the mesh to obtain a better load balance� a �ner mesh would incur additional

computational overheads � �� resulting in worse overall performance�

The communication of interacting particle information accounts for most of

the remaining execution time� Although the time spent in communication remains

somewhat constant as we increase the number of processors� its relative cost increases

as computation time decreases� On � Paragon processors� interprocessor communi�

cation accounts for approximately �$ of the total execution time� Note that the

parallel overhead of performing load balancing� which includes partitioning and copy�

ing particles from the old decomposition into the new� is only a few percent of the

execution time�

Table � and Figure �� provide another view of the execution time for one

SPH�D timestep� In this breakdown� the �force calculation� and �move particles�

times are combined� and the total interprocessor communication time is subdivided

into two categories� bu�er management �packing and unpacking of data� and commu�

nication costs �sending and receiving messages and synchronization�� These results

clearly show that the overhead associated with gathering and scattering data into and

out of message bu�ers cannot be neglected� Note that the application does not change

data representation� as would be required for a heterogenous network of machines with

di�ering number formats� instead� bu�er packing employs simple memory�to�memory

��

Intel Paragon Performance BreakdownTask P � � P � �� P � � P � ��

Time � Time � Time � Time �

Computation ��


Communication ��

Packing�Unpacking Data ��

Total ��

IBM SP Performance BreakdownTask P � � P � � P � ��

Time � Time � Time �

Computation ��



Packing�Unpacking Data ��

Total ��

Table � � A breakdown for one SPH�D timestep �averaged over �� with �k parti�cles on the Intel Paragon and the IBM SP� The execution time is divided into fourcategories� computation time� load imbalance� interprocessor communication costs�and bu�er packing and unpacking overheads� Times are given in milliseconds� Num�bers may not add up to the �Total� due to rounding� This data is also graphed inFigure ��

copies� On � Paragon processors� copying alone accounts for almost ��$ of the total

execution time�

These numbers also indicate that for larger numbers of processors� commu�

nication overheads are dominated by message start�up costs� Message packing times

are directly related to the number of bytes transmitted between processors� If com�

munication were bandwidth limited� we would expect communication times to scale

as message packing times� Instead� communication costs increase faster� indicating

that communication is dominated by start�up overheads�

��

8 16 32 64Processors

2

4

6

8W

allc

lock

Seco

nds

per

Tim

est

ep

SPH3D Execution Time BreakdownIntel Paragon (24k particles)

Pack/Unpack DataCommunicationLoad ImbalanceComputation

8 16 32Processors

2

4

1

3

Wallc

lock

Seco

nds

per

Tim

est

ep

SPH3D Execution Time BreakdownIBM SP2 (24k particles)


�a� �b�

Figure �� A graph of the data presented in Table � for �a� the Intel Paragon�and �b� the IBM SP� Execution time is divided into four categories� computationtime� load imbalance� interprocessor communication costs� and bu�er packing andunpacking overheads�

�� Exploiting Force Law Symmetry

Recall from Section �� that many particle applications employ a symmet�

ric force law in which the force acting on a particle p by particle q is equal in opposite

to the force acting on q by p� The SPH�D application exploits this symmetry to

reduce numerical computation costs by about a factor of two� However� this savings

is o�set somewhat by additional communication� since forces for particles lying in the

ghost cell regions must be transmitted back to the processors owning those particles�

This write back phase can be di�cult to implement without the proper software sup�

port� in fact� the parallel implementation of the molecular dynamics program GROMOS

does not use a symmetric force law for ghost cell particles �� for this very reason�

In this section� we investigate the performance tradeo�s of this design decision�

We modi�ed the SPH�D code to ignore symmetry for interactions involving

particles in ghost cell regions� these modi�cations required changes to fewer than ten

��

Intel Paragon ��k particles�Task P � � P � �� P � � P � ��

FS PS FS PS FS PS FS PS

Computation ��



Total ��

Intel Paragon ��k particles�Task P � � P � �� P � � P � ��

FS PS FS PS FS PS FS PS

Computation ��



Total ��

Table �� These tables compare the performance of the SPH�D code� which exploitsthe full symmetry ��FS�� of the force law� to a restricted version that exploits onlysome of the symmetry ��PS� for partial symmetry�� The �PS� variant does notuse symmetry for interactions involving particles lying in the ghost cell regions� Onaverage� �FS� runs about ��$ faster than �PS�� Times �in milliseconds� representone timestep �averaged over �� on the Intel Paragon� Numbers may not add up tothe �Total� due to rounding� These times are also graphed in Figure ��

lines of code� Of course� we still exploit symmetry if neither particle lies in the ghost

cell region� Table �� and Figure �� compare the Paragon execution times for the two

SPH�D versions on simulations with �k and ��k particles� Without the write back

communications phase� the modi�ed SPH�D code �labelled �partial symmetry� or

�PS�� spends an average of ��$ less time in interprocessor communication� However�

this savings is more than o�set by increased computational costs in redundant force

calculations� along with a corresponding increase in load imbalance� Overall execution

times for the modi�ed code are an average of ��$ slower than the original version�

The computation times in Table �� reveal that the relative penalty of re�

dundant force calculations increases with larger numbers of processors� The modi�ed

code�s force computations run about ��$ slower than the original SPH�D code on

� processors and about �$ slower on � processors� Larger numbers of processors

��


2

4

6

8

10W

allc

lock

Seco

nds

per

Tim

est

ep

Exploiting Force Law SymmetryIntel Paragon (24k particles)

CommunicationLoad ImbalanceComputation

Partial Symmetry

Full Symmetry


10

20

30

40

0

Wallc

lock

Seco

nds

per

Tim

est

ep

Exploiting Force Law SymmetryIntel Paragon (48k particles)

CommunicationLoad ImbalanceComputation

Partial SymmetryFull Symmetry

�a� �b�

Figure �� These graphs compare the performance of the SPH�D code �left barslabeled �Full Symmetry�� and a version that does not exploit symmetry for inter�actions involving particles lying in the ghost cell regions �right bars labeled �PartialSymmetry�� Execution times �taken from Table �� are reported for one timestepof �a� �k particle and �b� ��k particle simulations on the Intel Paragon�

divide the computational space into smaller partitions� and the increased surface�to�

volume ratio of small partitions means that a more signi�cant percentage of interac�

tions involve ghost cell particles� Thus� a greater fraction of redundant computations

are executed for larger numbers of processors�

�� Communication Optimizations

Each particle in our smoothed particle hydrodynamics application requires

approximately �� bytes of memory to describe position� velocity� acceleration� and

various other physical parameters� However� not all of this information is needed

by each phase of the calculation� For example� the local interactions to calculate

density �see Eq� �� require only mass and position data for each particle� After

computing the density� only the density values�not mass and position� which have

��


2

4

6

8

10W

allc

lock

Seco

nds

per

Tim

est

ep

SPH3D Communication OptimizationsIntel Paragon (24k particles)


NaiveOptimized


0

100

200

Com

munic

atio

n (

Kbyt

es)

per

Tim

est

ep

SPH3D Communication CostsIntel Paragon (24k particles)

Rebalance Workloads RepatriateWrite BackFetch Neighbors

NaiveOptimized

�a� �b�

Figure �� These graphs compare the performance of the SPH�D code �left bars�to a �naive� implementation �right bars� which does not attempt to minimize inter�processor communication� All numbers represent averages for one timestep �averagedover �� on the Intel Paragon for �k particles� �a� The optimized SPH�D runsbetween �$ and ��$ faster than the naive version� �b� Average interprocessor com�munication �in kilobytes� for each processor during the timestep� These numbers arealso reported in Table ��

not changed�need to be communicated between processors and updated�

Therefore� the SPH�D application communicates only the particle infor�

mation required during each phase of the computation� Figure �� a compares the

execution time of one timestep of SPH�D to a �naive� implementation without this

communication optimization� The current �optimized� SPH�D code runs between

�$ and ��$ faster than the naive implementation�

Figure �� b and Table �� illustrate the additional interprocessor commu�

nication costs incurred by the naive version for the four primary communication

routines of the SPH�D application� fetch particles� write back forces� repatriate par�

ticles� and rebalance workloads� Repatriating particles and rebalancing workloads

already require that all particle information be transferred between processors� thus�

��

Task P � � P � �� P � � P � ��Opt Naive Opt Naive Opt Naive Opt Naive





Total ��

Table �� Average interprocessor communication costs �in kilobytes� per Paragonprocessor for one iteration of the SPH�D code ��Opt�� and a �Naive� implemen�tation that does not attempt to minimize interprocessor communication� Note thatrepatriating particles and balancing workloads require the communication of all parti�cle information� Overall� the naive implementation sends between four and �ve timesmore data� These numbers are also graphed in Figure �� b�

the quantity of interprocessor communication in these routines does not change� Com�

municating only the required data signi�cantly reduces message tra�c when fetching

interacting particles and writing back calculated forces� Overall� communications

tra�c is reduced by a factor of four to �ve�

While it may seem obvious that an application should transfer only the

particle information needed by each phase of the calculation� implementing this op�

timization signi�cantly a�ects the design of a software support library� The library

must allow the programmer to specify what data is to be sent during each phase of

the calculation� and it must support selective packing and unpacking of data� These

types of design considerations are also vital for performance on distributed shared

memory machines with coherent caches � ��

��


The great tragedy of science is the slaying of a beautiful hypothesis by anugly fact�

� Thomas Henry Huxley

Particle applications are di�cult to parallelize because they require dynamic�

irregular partitionings of space to maintain an equal distribution of computational

work� Based upon the parallelization mechanisms of LPARX� we have developed

run�time support facilities which greatly simplify the task of implementing e�cient�

portable� parallel particle codes� The use of the LPARX abstractions allowed us to

provide functionality and explore performance optimizations which would have been

very di�cult using only a message passing library� Applications written using our

API library are portable to a number of high�performance architectures�including

the Intel Paragon� IBM SP� and networks of workstations�with good performance�

Based on our detailed performance analysis in Section �� we make the

following observations and recommendations for parallel implementations of particle

calculations�

� Applications with symmetric force laws cannot ignore symmetry for interactionsinvolving particles lying in ghost cell regions� Although ignoring such symmetry

reduces communication overheads� any savings is more than o�set by increased

computational costs in redundant particle interactions� Furthermore� the per�

formance penalty increases with the number of processors� For our SPH�D

application� a code which fully exploits symmetry runs an average of ��$ faster

than one that does not�

� When transmitting particle information between processors� applications mustbe careful to communicate only the information needed by a particular com�

putational phase� Our experiments indicate that this simple optimization can

reduce execution times by �$ to ��$ and the amount of interprocessor com�

munication by a factor of four to �ve�

��

� Load imbalance can become the dominant cost for computations with localizeddensities� as was the case with our SPH�D sample data set� The recursive bisec�

tion decomposition method may be inadequate for such workload distributions�

We discuss an alternative decomposition strategy in Section ��

Our software support infrastructure provided the high�level abstractions that enabled

us to easily explore these various design decisions�

�� Parallelization Requirements

We found the following features of LPARX essential in the development of

our particle library�

� Our use of recursive bisection to balance dynamic� non�uniform workloads relies

on LPARX�s concept of structural abstraction and its support for dynamic� user�

de�ned� irregular block decompositions� Through structural abstraction� we are

able to de�ne data decompositions appropriate for our particular application�

� LPARX�s region calculus �e�g� grow� and its copy�on�intersect operation greatlysimpli�es the expression of interprocessor communication in API routines

BalanceWorkloads� FetchParticles� WriteBack� and RepatriateParticles�

� LPARX supports Grids of complicated types� such as ParticleList� With�

out such support� we could not have implemented the chaining mesh structure

needed to organize the particles�

Overall� LPARX enabled us to reason about the structure of the particle computation

at a high level and simpli�ed the implementation� An equivalent library written using

only message passing would have been considerably more complicated and would have

required many times more code�

��

�a� �b�

Figure �� These pictures compare �a� structured partitions with �b� unstructuredpartitions� Unstructured partitions are better at balancing workloads because theyallow partition boundaries to meander through the domain� However� the reduc�tion in load imbalance is o�set somewhat by the expense of additional overheads inmore complicated communications analysis� It is an open research question whetherstructured or unstructured partitions are better for particle methods�

�� Unstructured Partitionings

The performance analysis of the SPH�D application in Section �� reveals

that a sizeable portion of the available computational resources are lost due to load

imbalance� For this problem� the recursive bisection �RCB� partitioner �� was un�

able to e�ciently balance workloads� One drawback of RCB is that all partition cuts

are straight lines �see Figure ��a�� thus� RCB does not have the freedom to insert a

�kink� in the cut to improve load balance� This drawback is also an advantage� how�

ever� because RCB renders structured� boxy partitions which are easily and e�ciently

supported by LPARX�

An alternative partitioning strategy� such as Inverse Space�lling Partitioning

�ISP� �� is better at balancing workloads because it allows the cuts to meander

through the space� The resulting partitioning is unstructured �see Figure ��b��

��

Pilkington and Baden �� show that such decompositions can signi�cantly reduce

load imbalance� However� unstructured partitions employ di�erent types of program�

ming abstractions such as those provided by the CHAOS run�time system �� Un�

structured implementations require more complicated communications analysis when

fetching o��processor data and therefore may be more expensive in total execution

time� It is an open research question which method is better for particle calculations�

�� Future Research Directions

Thus far we have only considered the software facilities for local particle�

particle interactions� Recall from Table �� than most particle methods also employ

either fast PDE solvers or hierarchical data representations� In Chapter � �Adaptive

Mesh Applications�� we described the techniques and software support required to

support fast PDE solvers and structured adaptive hierarchical representations� We

plan to combine the techniques of these two chapters to implement a hierarchical

particle method such as Almgren�s Adaptive Method of Local Corrections �AMLC�

�� AMLC coupled with the power of parallel architectures would enable computa�

tional scientists to study vortex dynamics problems with considerably larger numbers

of particles�

Traditionally� multipole �� and Barnes�Hut �� methods have been imple�

mented using unstructured tree codes �� An alternative implementation strategy

would employ a hierarchy of irregular but structured re�nements �� using our

software infrastructure� To date� no one has directly compared these two implemen�

tation strategies using real codes� Such a comparison would provide valuable insight

into the relative strengths and weaknesses of each representation�

Chapter

Conclusions

My life has been a fascinating series of amazing exploits about which Ihave many profound insights� But frankly� none of it is any of your darnbusiness� � �


�� Research Contributions

We have developed a set of programming abstractions and the accompany�

ing software support for dynamic� irregular� block�structured scienti�c computations

running on high�performance parallel computers� Such applications are di�cult to

implement without the appropriate software support� Our parallel software infras�

tructure simpli�es code development because it provides computational scientists with

high�level� domain�speci�c �yet �exible� tools that hide low�level implementation de�

tails� Our software is portable across a wide range of MIMD parallel platforms and is

currently running on the Cray C�� single processor�� IBM SP� Intel Paragon� and

networks of workstations via PVM ��

We have designed and implemented application programmer interfaces

�APIs� for two important classes of scienti�c applications� structured adaptive mesh

applications and particle calculations� These APIs provide computational tools that

match the scientist�s view of the application� We have applied our structured adap�

��

��

tive mesh API to the solution of model eigenvalue problems in materials design� The

particle API has been used in the development of a �d smoothed particle hydrody�

namics application in astrophysics� Our parallel software infrastructure has enabled

computational scientists to explore new approaches to solving real problems �see Sec�

tion ��

Our APIs are layered on top of the LPARX parallel programming system�

LPARX introduces the concept of �structural abstraction�� which enables applica�

tions to dynamically manipulate irregular data decompositions as language�level ob�

jects� Instead of requiring the programmer to choose from a small set of prede�ned

decompositions� LPARX provides a framework for creating decompositions that may

be tailored to meet the needs of a particular application�

Extending High Performance Fortran �� will require developments in par�

allel programming abstractions and run�time support libraries� A second High Per�

formance Fortran �� standardization e�ort is currently addressing the limitations of

HPF for dynamic and irregular applications� We believe that the abstractions and

run�time support provided by our software infrastructure may provide some of the

answers�

�� Outstanding Research Issues

Work is the greatest thing in the world� so we should always leave someof it for tomorrow�

� Don Herald

We have already described speci�c future research directions at the end

of each of Chapters through � Here we discuss two challenging and broad re�

search areas for the computational science community� implementation strategies

for application programmer interfaces �Section �� and language interoperability

�Section ��

��

�� Implementation Strategies for APIs

Because of the growing complexity in scienti�c applications� we believe that

it will be increasingly important to provide computational scientists with application

programmer interfaces �APIs� that provide high�level� domain�speci�c tools� We have

taken this approach with our particle and structured adaptive mesh libraries� It is

our belief that scientists should only be required to concentrate on the math and

physics of their application� that is their area of expertise� It is the responsibility

of the computer scientist to respond to the needs of the scienti�c community and

provide the appropriate software tools� This is not to say that computer scientists

are to become �technicians� or �programmers� at the beck and call of the scientists�

Indeed� the development of such APIs involves a number of interesting and challenging

research issues �as we have shown in our work��

There are two general strategies for implementing a suite of domain�speci�c

toolkits� �� languages or �� libraries� In the �rst approach� a new language�with

the appropriate syntax� control structures� and data types�is developed for each new

applications domain� In the second� an applications library is created on top of an

existing language �as we have done in C��

The primary advantage of the language�based strategy is that each new

language can be tailored to the speci�c problem domain� Languages would be sup�

ported by compilers that could apply domain�speci�c transformations to improve the

quality of compiled code� For example� current optimizing compilers commonly re�

order numerical operations to improve performance by eliminating common numerical

sub�expressions or by scheduling instructions to avoid pipeline bubbles� Similar op�

timizations could be applied to a �matrix language� to block matrix operations� use

e�cient BLAS operations� or chain operations on vector architectures �� Unfortu�

nately� scientists would need to learn a new language syntex�and computer scientists

would need to develop a new compilation system�for each new problem domain� In

addition� scientists could not easily combine codes from di�erent application domains

�e�g� a structured adaptive mesh solver with a particle code� since the syntax� data

��

types� and compilers would be di�erent�

The other approach would be to build an API as a library on top of an

existing programming language� Although scientists would still be required to learn

the speci�cs of a particular API library� they would not be burdened with mastering

an entirely new language� C�� provides powerful and e�cient facilities for data

abstraction and has been adopted by many as the language of choice for constructing

API libraries� However� the C�� compiler cannot apply domain�speci�c knowledge

to optimize code� For example� it is very di�cult to implement an e�cient matrix

library in C�� because the compiler does not understand the special properties

of a �matrix� object� The Sage�� compilation system for C�� helps the compiler

to generate e�cient code by de�ning a suite of high�level compiler transformations

that enable API writers to incorporate domain�speci�c knowledge�

One particularly troublesome limitation of C�� as a parallel programming

language is that it provides no mechanisms for control abstraction �i�e� user�de�ned

control structures�� Thus� C�� makes it di�cult to express parallel execution con�

structs such as parallel loops� Two partial solutions are �� introduce �parallel control

constructs� using C�� macros and �� hide parallelism within a data object� LPARX

takes the �rst approach to implement its forall loop� P�� takes the second�

each P�� parallel array is invisibly divided across processors and the P�� program�

mer is unaware of the parallel execution of array operations within the library� The

�rst strategy is not ideal� since it essentially creates a macro �sub�language� within

C�� The second approach does not apply to all applications� such as those addressed

by LPARX� since it is not always possible to completely encapsulate the parallelism

within a single object� Clearly� implementation techniques for APIs are a fertile area

for future research�

�� Language Interoperability

Currently� parallel software written in one programming language or run�

time library is likely to be incompatible with software written in another system�

��

Language �and library� interoperability is driven by two key factors� �� code reuse

and �� heterogenous programming models �� Code reuse is di�cult today since

common subroutines cannot in general be shared by di�erent parallel systems� Het�

erogenous programming models enable the programmer to use the programming lan�

guage or run�time library best suited for the task at hand� Some applications are more

naturally expressed in one paradigm than another� for example� task parallelism ap�

plies to pipeline and producer�consumer applications but is usually inappropriate for

array�based computations� which are often better handled by data parallelism�

Language interoperability raises research issues in three key areas� �� run�

time systems� �� data representation� and �� language extensions for external pro�

cedures� Common implementation support is needed to merge parallel languages and

libraries with di�erent run�time behaviors� For example� a data parallel language

such as HPF �� is typically implemented using only a single thread of control per

processor whereas a task parallel language such as CC�� or Fortran M �� might

require several interacting execution threads per processor� Thus� combining these

two models will require common run�time support for task management and commu�

nication�

Several consortiums have been formed to investigate uni�ed support mecha�

nisms for task parallel and data parallel systems� For example� the PORTS� �POrtable

Run�Time System� consortium has developed a set of portable task�based facilities

for creating and scheduling tasks and for managing �ne�grain inter�thread commu�

nication� and the PCRC� �Parallel Compiler Runtime Consortium� is investigating

common high�level run�time support techniques�

The second interoperability issue addresses common data representation for�

mats� Parallel languages and libraries de�ne a rich set of data distributions across

processors� uniform block� irregular block� cyclic� pointwise� and so on� To share data�

the run�time support must ensure that each system understands the data represen�

tations used by the others �e�g� see Section �� Thus� a uni�ed data descriptor

�Information on PORTS is available at http��www�cs�uoregon�edu�paracomp�ports��Information on the PCRC is available at http��aldebaran�npac�syr�edu�index�html�

��

format is needed� Such a data de�nition interface is currently under investigation by

the PCRC�

Finally� language designers must consider the types of language extensions

that will be required to call externally de�ned routines� For example� how should the

programmer in a task parallel language specify a call to a data parallel routine# How

should data be transferred across the call interface# HPF �� de�nes a rudimentary

external procedure interface� and others have investigated calling HPF from pC��

and also from Fortran M �� However� generally applicable mechanisms are currently

unknown�

�� The Scienti�c Computing Community

Build it� and they will come� � �

� �Field of Dreams�

The goal of our research has been the development of software tools to en�

able computational scientists to explore new approaches in solving applied problems

on high�performance parallel computers� It is therefore �tting that we conclude this

dissertation with a list of the projects that have bene�tted from our software infras�

tructure�

� W� Hart has implemented a d geometrically structured genetic algorithms codeto study locally adaptive search techniques on parallel computers ��

� In collaboration with J� Wallin �George Mason University�� we have parallelizeda �d smoothed particle hydrodynamics code for modeling the evolution of galaxy

clusters �see Chapter ��

� G� Cook �Cornell Theory Center� has used LPARX as the base for an applicationprogrammer interface for adaptive multigrid methods in numerical relativity as

part of the Black Hole Binary Grand Challenge Project�

��

� Scientists at Lawrence Livermore National Laboratories have employed our Dis�tributed Parallel Object� Asynchronous Message Stream� and MP�� software to

parallelize a structured adaptive mesh library for hyperbolic problems in gas

dynamics ��

� C� Myers �Cornell Theory Center�� B� Shaw �Lamont�Doherty Earth Observa�

tory�� and J� Langer �University of California at Santa Barbara� have imple�

mented a parallel d code to study localized slip modes in the dynamics of

earthquake faults�

� C� Myers and J� Sethna �Cornell University� have developed a parallel time�dependent Ginzburg�Landau model of shape transformations to study shape�

memory e�ects in martensitic alloys� Their code extends the LPARX Grid to

support d deformable cartesian meshes�

� C� Myers has also written a Cornell Theory Center �Smart Node� newsletterdescribing LPARX� and will discuss some of his experiences with it at the ��

meeting of the APS Physics Computing Conference��

� In collaboration with materials scientists and mathematicians� we have devel�oped adaptive numerical techniques and the parallel software support for the so�

lution of eigenvalue problems arising in materials design �see Chapter ��

� LPARX has been used to implement a dimension�independent code for d� �d�

and �d connected component labeling for spin models in statistical mechan�

ics ��

� Building on LPARX� S� Fink and S� Baden have developed run�time HPF�likedata distribution techniques for block structured applications � ��

�Myers� article is available at World Wide Web address http��www�tc�cornell�edu�

SmartNodes�Newsletters��VN�Myers��The abstract of Myers� talk� �Some ABCs of OOP for PDEs on MPPs�� is available at World

Wide Web address http��aps�org�BAPSPC��abs�SJ��html�

��

� S� Figueira and S� Baden have employed our software infrastructure to analyzethe performance tradeo�s of various parallelization strategies for localized N �

body solvers � ��

� G� Duncan �Bowling Green State University� is planning to use our structuredadaptive mesh infrastructure to parallelize an adaptive hyperbolic solver for

simulations of relativistic extragalactic jets � ��

� In collaboration with F� Abraham �IBM Almaden�� we are using our particle

library to develop a molecular dynamics application to study fracture dynamics

in solids ��

In addition� our software has been used to teach undergraduate and graduate courses

in computational science at the University of California at San Diego�

I�m not going to school anymore� I�ve decided to be a �hunter�gatherer�when I grow up� I�ll be living naked in a tropical forest� subsisting onberries� grubs� and the occasional frog� and spending my free time groomingfor lice�


Appendix A

Machine Characteristics

The Fast drives out the Slow even if the Fast is wrong�

� W� Kahan

In this Appendix� we describe the four supercomputers used to gather per�

formance data in this dissertation� the Cray C�� IBM SP� Intel Paragon� and a

network of eight DEC Alpha �� workstations located at the San Diego Super�

computer Center� The Alphas are connected via a GIGAswitch and communicate

through PVM �� Even though the C�� contains more than one processor� it is

rarely used as a true parallel machine in production mode� Instead� the processors run

several independent jobs at the same time� Thus� we have only reported performance

results for a single processor�

For the three message passing architectures �Alpha cluster� IBM SP� and

Intel Paragon�� we characterize interprocessor communication overheads using the

simple linear cost model commonly used in the literature� Message passing perfor�

mance is reported using two numbers� T� and BW � T�� often incorrectly� called the

message latency� represents the time to send a zero byte message� BW is the average

peak communications bandwidth for large message sizes �several hundred kilobytes��

Thus� the time to send a message of length L can be approximated by T� !L

BW�

We measured message passing times with a simple program that sends messages of

�T� actually incorporates both message latency and unavoidable software overheads ��

��

��

Machine C�� Fortran Operating SystemCompiler Optimization Compiler Optimization

Alphas g�� v�� O f�� v�� O� OSF��




Table A�� Software version numbers and compiler optimization �ags for all compu�tations in this dissertation� On the Alpha workstation cluster� we used PVM version�� for interprocessor communication� All benchmarks used LPARX software re�lease v��

Alphas Cray C�� IBM SP Intel Paragon

Typical M�ops ��

Memory �Mbytes� ��

T� ��sec� ��

BW �Mbytes�sec� ��

Table A�� A summary of machine characteristics for the Alpha cluster� Cray C��IBM SP� and Intel Paragon� All numbers re�ect one processor of the machine� Thememory limit on the Cray represents the memory available to tasks in the largestmemory queue at the San Diego Supercomputer Center� Note that these �guresare intended to provide a very rough estimate of expected applications performance�All M�ops �million �oating point operations per second� measurements re�ect ��bit�oating point rates�

varying sizes around in a ring�

Table A�� summarizes software version numbers and compiler �ags� and

Table A� summarizes machine characteristics� Interprocessor communication times

are presented in Figure A�� DEC Alphas�� Figure A� �IBM SP�� and Figure A��

�Intel Paragon��

��

64 1K 16K 256KMessage Length

0.01

0.1

1.0

Bandw

idth

(M

byt

es/

sec)

Message BandwidthAlpha Cluster


1

10

100

1000

Tim

e (

mill

iseconds)

Message Passing TimeAlpha Cluster

�a� �b�

Figure A�� Alpha workstation cluster message passing performance for �a� messagebandwidth and �b� message sending times as a function of the message size� Notethat the vertical scale for �b� is in milliseconds� not microseconds as in the othergraphs�

��


0.1

1.0

10

100B

andw

idth

(M

byt

es/

sec)

Message BandwidthIBM SP2


100

1000

10000

Tim

e (

mic

roseconds)

Message Passing TimeIBM SP2

�a� �b�

Figure A�� IBM SP message passing performance for �a� message bandwidth and�b� message sending times as a function of the message size�


0.1

1.0

10

100

Bandw

idth

(M

byt

es/

sec)

Message BandwidthIntel Paragon


100

1000

10000

Tim

e (

mic

roseconds)

Message Passing TimeIntel Paragon

�a� �b�

Figure A�� Intel Paragon message passing performance for �a� message bandwidthand �b� message sending times as a function of the message size�

Bibliography

One of the problems of being a pioneer is you always make mistakesand I never� never want to be a pioneer� It�s always best to comesecond when you can look at the mistakes the pioneers made�

� Seymour Cray

�� F� F� Abraham� D� Brodbeck� R� A� Rafey� and W� E� Rudge� In�stability dynamics of fracture A computer simulation investigation� PhysicalReview Letters� ��

�� G� Agha� Actors A Model of Concurrent Computation in Distributed Systems�MIT Press� ��

�� G� Agrawal� A� Sussman� and J� Saltz� An integrated runtime andcompile�time approach for parallelizing structured and block structured appli�cations� IEEE Transactions on Parallel and Distributed Systems� �to appear��

�� A� Almgren� T� Buttke� and P� Colella� A fast vortex method in threedimensions� in Proceedings of the ��th AIAA Computational Fluid DynamicsConference� Honolulu� Hawaii� June �� pp� �� "��

�� A� S� Almgren� A Fast Adaptive Vortex Method Using Local Corrections� PhDthesis� University of California at Berkeley� June ��

� � B� Alpern� L� Carter� E� Feig� and T� Selker� The uniform memoryhierarchy model of computation� Algorithmica� � �� pp� �"��

�� A� L� Ananda� B� H� Tay� and E� K� Koh� Astra An asynchronous remoteprocedure call facility� in Proceedings of the ��th International Conference onDistributed Computing Systems� May ��

�� C� R� Anderson� A method of local corrections for computing the velocityeld due to a distribution of vortex blobs� Journal of Computational Physics� �� pp� ��"��

��

��

�� C� R� Anderson� An implementation of the fast multipole method withoutmultipoles� SIAM Journal on Scienti�c and Statistical Computing� �� pp� ��"��

�� I� Ashok and J� Zahorjan� Adhara Runtime support for dynamic space�based applications on distributed memory MIMD multiprocessors� in Proceed�ings of the �� Scalable High Performance Computing Conference� May ��

�� W� Athas and N� Boden� Cantor An actor programming system for sci�entic computing� in Proceedings of the AMC SIGPLAN Workshop of ObjectBased Concurrent Programming� ��

�� S� B� Baden� Programming abstractions for dynamically partitioning and co�ordinating localized scientic calculations running on multiprocessors� SIAMJournal on Scienti�c and Statistical Computing� � �� pp� ��"��

�� S� B� Baden� S� J� Fink� and S� R� Kohn� Structural abstraction A uni�fying parallel programming model for data motion and partitioning in irregularscientic computations� �in preparation��

�� S� B� Baden and S� R� Kohn� A comparison of load balancing strategies forparticle methods running on MIMD multiprocessors� in Proceedings of the FifthSIAM Conference on Parallel Processing for Scienti�c Computing� March ��

�� Portable parallel programming of numerical problems under the LPARsystem� Journal of Parallel and Distributed Computing� ��

�� H� E� Bal and A� S� Tanenbaum� Distributed programming with shareddata� in Proceedings of the International Conference on Computer Languages�October �� pp� �"��

�� J� Barnes and P� Hut� A hierarchical O�N logN� force�calculation algo�rithm� Nature� �A �� p� ��

�� D� R� Bates� K� Ledsham� and A� L� Stewart� Wave functions of thehydrogen molecular ion� Phil� Trans� Roy� Soc� London� � �� pp� �"��

�� J� Bell� M� Berger� J� Saltzman� and M� Welcome� Three�dimensionaladaptive mesh renement for hyperbolic conservation laws� SIAM Journal onScienti�c and Statistical Computing� � �� pp� ��"��

�� M� J� Berger� Adaptive Mesh Renement for Hyperbolic Partial Di�erentialEquations� PhD thesis� Stanford University� ��

�� M� J� Berger and S� H� Bokhari� A partitioning strategy for nonuniformproblems on multiprocessors� IEEE Transactions on Computers� C�� pp� ��"��

��

�� M� J� Berger and P� Colella� Local adaptive mesh renement for shockhydrodynamics� Journal of Computational Physics� � �� pp� �"��

�� M� J� Berger and J� Oliger� Adaptive mesh renement for hyperbolicpartial di�erential equations� Journal of Computational Physics� � ��pp� ��"��

�� M� J� Berger and I� Rigoutsos� An algorithm for point clustering and gridgeneration� IEEE Transactions on Systems� Man and Cybernetics� � ��pp� ��"��

�� M� J� Berger and J� Saltzman� AMR on the CM�� Tech� Rep� �� RIACS� Mo�ett Field� CA� August ��

� � � Structured adaptive mesh renement on the Connection Machine� in Pro�ceedings of the Sixth SIAM Conference on Parallel Processing for Scienti�cComputing� March ��

�� J� Bernholc� J��Y� Yi� and D� J� Sullivan� Structural transitions in metalclusters� Faraday Discussions� � �� pp� ��"��

�� G� E� Blelloch� S� Chatterjee� J� C� Hardwick� J� Sipelstein� andM� Zagha� Implementation of a portable nested data parallel language� inFourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Pro�gramming� May ��

�� F� Bodin� P� Beckman� D� Gannon� J� Botwals� S� Narayana�

S� Srinivas� and B� Winnicka� Sage�� An object�oriented toolkit andclass library for building Fortran and C�� restructuring tools� in Object Ori�ented Numerics Conference �OONSKI��

�� F� Bodin� P� Beckman� D� Gannon� S� Narayana� and S� X� Yang�Distributed pC�� Basic ideas for an object parallel language� Journal of Sci�enti�c Programming� ��

�� F� Bodin� P� Beckman� D� Gannon� S� Yang� S� Kesavan� A� Mal�ony� and B� Mohr� Implementing a parallel C�� runtime system for scalableparallel systems� in Proceedings of Supercomputing �� November ��

�� J� Bolstad� PhD thesis� Stanford University� ��

�� A� Bozkus� A� Choudhary� G� Fox� T� Haupt� and S� Ranka� For�tran ��D�HPF compiler for distributed memory MIMD computers Design�implementation� and performance results� in Proceedings of Supercomputing�� November ��

�� A� Brandt� Multi�level adaptive solutions to boundary�value problems� Math�ematics of Computation� �� pp� ��"��

��

�� W� L� Briggs� A Multigrid Tutorial� SIAM� ��

�� K� G� Budge� J� S� Perry� and A� C� Robinson� High performance sci�entic computing using C�� in USENIX C!! Conference Proceedings� ��

�� E� J� Bylaska� S� R� Kohn� S� B� Baden� A� Edelman� R� Kawai�M� E� G� Ong� and J� H� Weare� Scalable parallel numerical methods andsoftware tools for material design� in Proceedings of the Seventh SIAM Con�ference on Parallel Processing for Scienti�c Computing� San Francisco� CA�February ��

�� Z� Cai� J� Mandel� and S� McCormick� Multigrid methods for nearlysingular linear equations and eigenvalue problems� �submitted for publication��

�� N� Carriero and D� Gelernter� Linda in context� Communications of theACM� � �� pp� ��"��

�� S� Chakrabarti� E� Deprit� E��J� Im� J� Jones� A� Krishnamurthy�C��P� Wen� and K� Yelick� Multipol A distributed data structure library�in Fifth ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming� July ��

�� K� M� Chandy and C� Kesselman� Compositional C�� Compositionalparallel programming� in Fifth International Workshop of Languages and Com�pilers for Parallel Computing� New Haven� CT� August ��

�� C� Chang� A� Sussman� and J� Saltz� Support for distributed dynamic datastructures in C�� Tech� Rep� CS�TR�� University of Maryland� ��

�� B� Chapman� P� Mehrotra� H� Moritsch� and H� Zima� Dynamic datadistribution in Vienna Fortran� in Proceedings of Supercomputing �� Novem�ber ��

�� B� Chapman� P� Mehrotra� and H� Zima� Extending HPF for advanceddata parallel applications� Tech� Rep� �� ICASE� May ��

�� C� Chase� K� Crowley� J� Saltz� and A� Reeves� Parallelization ofirregularly coupled regular meshes� Tech� Rep� �"�� ICASE� NASA LangleyResearch Center� January ��

�� J� S� Chase� F� G� Amador� E� D� Lazowska� H� M� Levy� and R� J�Littlefield� The Amber system Parallel programming on a network of multi�processors� in Proceedings of the �th ACM Symposium on Operating SystemsPrinciples� December �� pp� ��"��

�� A� A� Chien� Concurrent Aggregates Supporting Modularity in Massively Par�allel Programs� MIT Press� ��

�

�� K� Cho� T� A� Arias� J� D� Joannopoulos� and P� K� Lam�Wavelets inelectronic structure calculations� Physical Review Letters� �� pp� ��"��

�� T� W� Clark� R� V� Hanxleden� J� A� McCammon� and L� R� Scott�Parallelization strategies for a molecular dynamics program� in Intel TechnologyFocus Conference Proceedings� April ��

�� T� W� Clark� R� v� Hanxleden� J� A� McCammon� and L� R� Scott�Parallelizing molecular dynamics using spatial decomposition� in Proceedings ofthe �� Scalable High Performance Computing Conference� May ��

�� P� Colella and P� Woodward� The piecewise parabolic method �ppm�for gas�dynamical simulations� Journal of Computational Physics� � ��pp� ��"��

�� L� Collatz� The numerical treatment of di�erential equations� Springer�Verlag� ��

�� C� R� Cook� C� M� Pancake� and R� Walpole� Are expectations forparallelism too high� A survey of potential parallel users� in Proceedings ofSupercomputing �� November ��

�� W� Y� Crutchfield� Load balancing irregular algorithms� Tech� Rep� UCRL�JC�� Lawrence Livermore National Laboratory� July ��

�� W� Y� Crutchfield and M� L� Welcome� Object oriented implementationof adaptive mesh renement algorithms� Journal of Scienti�c Programming� �� pp� ��"� �

� � D� Culler� R� Karp� D� Patterson� A� Sahay� K� E� Schauser�E� Santos� R� Subramonian� and T� von Eicken� LogP Towards a realis�tic model of parallel computation� in Proceedings of the Fourth AMC SIGPLANSymposium on Principles and Practice of Parallel Programming� ��

�� D� E� Culler� A� Dusseau� S� C� Goldstein� A� Krishnamurthy�S� Lumetta� T� von Eicken� and K� Yelick� Parallel programming inSplit�C� in Proceedings of Supercomputing �� November ��

�� R� Das� D� J� Mavriplis� J� Saltz� S� Gupta� and R� Ponnusamy� Thedesign and implementation of a parallel unstructured euler solver using softwareprimitives� Tech� Rep� �� ICASE� Hampton� VA� March ��

�� R� Das and J� Saltz� Parallelizing molecular dynamics codes using PARTIsoftware primitives� in Proceedings of the Sixth SIAM Conference on ParallelProcessing for Scienti�c Computing� March ��

��

� �� R� Das� M� Uysal� J� Saltz� and Y��S� Hwang� Communication optimiza�tions for irregular scientic computations on distributed memory architectures�Journal of Parallel and Distributed Computing� �to appear��

� �� S� Deshpande� P� Delisle� and A� G� Daghi� A communication facilityfor distributed object�oriented applications� in USENIX C!! Conference Pro�ceedings� ��

� � K� D� Devine and J� E� Flaherty�Dynamic load balancing for parallel niteelement methods with h� and p�renement� in Proceedings of the Seventh SIAMConference on Parallel Processing for Scienti�c Computing� San Francisco� CA�February ��

� �� G� C� Duncan and P� A� Hughes� Simulations of relativistic extragalacticjets� The Astrophysical Journal� �� pp� L��"L��

� �� D� J� Edelsohn� Hierarchical tree�structures as adaptive meshes� InternationalJournal of Modern Physics C �Physics and Computers�� pp� ��"��

� � B� Falsafi� A� R� Lebeck� S� K� Reinhardt� I� Schoinas� M� D� Hill�J� R� Larus� A� Rogers� and D� A� Wood� Application�specic protocolsfor user�level shared memory� in Proceedings of Supercomputing �� November��

� � M� J� Feeley and H� M� Levy� Distributed shared memory with versionedobjects� in Proceedings of the Conference on Object�Oriented ProgrammingSystems� Languages� and Applications �OOPSLA�� October ��

� �� J� T� Feo and D� C� Cann� A report on the SISAL language project� Journalof Parallel and Distributed Computing� �� pp� ��"� �

� �� S� M� Figueira and S� B� Baden� Performance analysis of parallel strategiesfor localized n�body solvers� in Proceedings of the Seventh SIAM Conference onParallel Processing for Scienti�c Computing� San Francisco� CA� February ��

� �� S� J� Fink and S� B� Baden� Run�time data distribution for block�structuredapplications on distributed memory computers� in Proceedings of the SeventhSIAM Conference on Parallel Processing for Scienti�c Computing� San Fran�cisco� CA� February ��

�� S� J� Fink� S� B� Baden� and S� R� Kohn� Flexible communication schedulesfor block structured applications� �in preparation��

�� S� J� Fink� C� Huston� S� B� Baden� and K� Jansen� Parallel clusteridentication for multidimensional lattices� �submitted to IEEE Transactionson Parallel and Distributed Systems��

��

�� M� J� Flynn� Some computer organizations and their e�ectiveness� IEEETransactions on Computers� C�� pp� ��"� ��

�� K� Forsman� W� Gropp� L� Kettunen� and D� Levine� Computationalelectromagnetics and parallel dense matrix computations� in Proceedings of theSeventh SIAM Conference on Parallel Processing for Scienti�c Computing� SanFrancisco� CA� February ��

�� I� Foster and K� M� Chandy� Fortran M A language for modular parallelprogramming� Journal of Parallel and Distributed Computing� �to appear��

�� I� Foster and C� Kesselman� Integrating task and data parallelism� in Pro�ceedings of Supercomputing �� November ��

�� I� Foster� M� Xu� B� Avalani� and A� Choudhary�A compilation systemthat integrates High Performance Fortran and Fortran M� in Proceedings of the�� Scalable High Performance Computing Conference� May ��

�� G� Fox� S� Hiranandani� K� Kennedy� C� Koelbel� U� Kremer�C� Tseng� and M� Wu� Fortran D language specication� Tech� Rep� TR�� Department of Computer Science� Rice University� Houston� TX� December��

�� G� H� Golub and C� F� V� Loan� eds�� Matrix Computations �Second Edi�tion�� The Johns Hopkins University Press� Baltimore� ��

�� L� Greengard and V� Rokhlin� A fast algorithm for particle simulations�Journal of Computational Physics� �� pp� �"��

�� W� Gropp and B� Smith� Scalable� extensible� and portable numerical li�braries� in Proceedings of the Scalable Parallel Libraries Conference� ��pp� ��"��

�� W� E� Hart� Adaptive Global Optimization with Local Search� PhD thesis�University of California at San Diego� ��

�� C� Hewitt� P� Bishop� and R� Steiger� A universal ACTOR formalismfor articial intelligence� in Proceedings of the International Joint Conferenceon Arti�cial Intelligence� ��

�� High Performance Fortran Forum� High Performance Fortran LanguageSpecication� November ��

�� HPF�� Scope of Activities and Motivating Applications� November ��

�� P� N� Hilfinger and P� Colella� FIDIL A language for scientic pro�gramming� Tech� Rep� UCRL�� Lawrence Livermore National Laboratory�January ��

�

�� S� Hiranandani� K� Kennedy� and C��W� Tseng� Preliminary experienceswith the Fortran D compiler� in Proceedings of Supercomputing �� November��

�� R� W� Hockney and J� W� Eastwood� Computer Simulation Using Par�ticles� McGraw�Hill� ��

�� Y��S� Hwang� R� Das� J� Saltz� B� Brooks� and M� Hodoscek� Par�allelizing molecular dynamics programs for distributed memory machines Anapplication of the CHAOS runtime support library� Tech� Rep� CS�TR��University of Maryland� College Park� MD� ��

�� E� Jul� H� Levy� N� Hutchinson� and A� Black� Fine�grained mobilityin the Emerald system� ACM Transactions on Computer Systems� ��pp� ��"��

�� L� Kale and S� Krishnan� CHARM�� A portable concurrent object ori�ented system based on C�� in Proceedings of OOPSLA� September ��

�� V� Karamcheti and A� Chien� Concert E�cient runtime support for con�current object�oriented programming languages on stock hardware� in Proceed�ings of Supercomputing �� November ��

�� S� R� Kohn and S� B� Baden� An implementation of the LPAR parallelprogramming model for scientic computations� in Proceedings of the SixthSIAM Conference on Parallel Processing for Scienti�c Computing� Norfolk� VA�March ��

�� A robust parallel programming model for dynamic non�uniform scienticcomputations� in Proceedings of the �� Scalable High Performance ComputingConference� May ��

�� The parallelization of an adaptive multigrid eigenvalue solver withLPARX� in Proceedings of the Seventh SIAM Conference on Parallel Processingfor Scienti�c Computing� San Francisco� CA� February ��

�� Irregular coarse�grain data parallelism under LPARX� Journal of Scienti�cProgramming� �to appear��

�� W� Kohn and L� Sham� Physical Review� �� p� A��

�� J� Kuskin� D� Ofelt� M� Heinrich� J� Heinlein� R� Simoni� K� Ghara�chorloo� J� Chapin� D� Nakahira� J� Baxter� M� Horowitz�

A� Gupta� M� Rosenblum� and J� Hennessy� The Stanford FLASH mul�tiprocessor� in Proceedings of the �st International Symposium on ComputerArchitecture� April �� pp� ��"��

�

�� J� R� Larus� C�� A large�grain object oriented data parallel programming lan�guage� in Fifth International Workshop of Languages and Compilers for ParallelComputing� New Haven� CT� August ��

�� M� Lemke and D� Quinlan� P�� A C�� virtual shared grids basedprogramming environment for architecture�independent development of struc�tured grid applications� in Lecture Notes in Computer Science� Springer�Verlag�September ��

�� E� C� Lewis� C� Lin� L� Synder� and G� Turkiyyah� A portable paral�lel n�body solver� in Proceedings of the Seventh SIAM Conference on ParallelProcessing for Scienti�c Computing� San Francisco� CA� February ��

�� K� Li and P� Hudak� Memory coherence in shared virtual memory systems�ACM Transactions on Computer Systems� � �� pp� ��"��

�� C� Lin and L� Snyder� ZPL An array sublanguage� in Proceedings of theSixth International Workshop on Languages and Compilers for Parallel Com�putation� Springer�Verlag� �� pp� � "��

�� J� Mandel and S� McCormick�Multilevel variational method for Au � �Buon composite grids� Journal of Computational Physics� �� pp� ��"��

�� S� F� McCormick� ed�� Multilevel Adaptive Methods for Partial Di�erentialEquations� SIAM� Philadelphia� ��

�� Message Passing Interface Forum� MPI A Message�Passing InterfaceStandard �v�� May ��

�� R� E� Minnear� P� A� Muckelbauer� and V� F� Russo� Integrating theSun Microsystems XDR�RPC protocols into the C�� stream model� in USENIXC!! Conference Proceedings� ��

�� W� F� Mitchell� Renement tree based partitioning for adaptive grids� in Pro�ceedings of the Seventh SIAM Conference on Parallel Processing for Scienti�cComputing� San Francisco� CA� February ��

�� J� J� Monaghan� Smoothed particle hydrodynamics� Annual Review of As�tronomy and Astrophysics� �� pp� ��"��

�� S� S� Mukherjee� S� D� Sharman� M� D� Hill� J� R� Larus� A� Rogers�and J� Saltz� E�cient support for irregular applications on distributed mem�ory machines� in to appear in Proceedings of the �� Symposium on Principlesand Practice of Parallel Programming� ��

�� B� J� Nelson� Remote Procedure Call� PhD thesis� Carnegie"Mellon Univer�sity� Pittsburgh� PA� ��

��

�� I� Newton� Philosophiae Naturalis Principia Mathematica� � ��

�� C� M� Pancake and D� Bergmark� Do parallel languages respond to theneeds of scientic programmers�� IEEE Computer� �� pp� ��"��

�� C� M� Pancake and C� Cook�What users need in parallel tool support Sur�vey results and analysis� in Proceedings of the �� Scalable High PerformanceComputing Conference� May ��

�� Parallel Compiler Runtime Consortium� Common Runtime Support forHigh�Performance Parallel Languages� July ��

�� M� Parashar and J� C� Browne� An infrastructure for parallel adaptivemesh renement techniques� �draft��

�� M� Parashar� S� Hariri� T� Haupt� and G� C� Fox� Interpreting the per�formance of HPF�Fortran ��D� in Proceedings of Supercomputing �� Novem�ber ��

�� R� Parsons and D� Quinlan� Run�time recognition of task parallelism withinthe P�� parallel array class library� in Scalable Libraries Conference� ��

�� J� R� Pilkington and S� B� Baden� Dynamic partitioning of non�uniformstructured workloads with spacelling curves� �submitted to IEEE Transactionson Parallel and Distributed Systems��

�� W� H� Press� S� A� Teukolsky� W� T� Vetterling� and B� P� Flan�

nery� Numerical Recipes in C The Art of Scientic Computing� CambridgeUniversity Press� ��

�� D� Quinlan� Parallel Adaptive Mesh Renement� PhD thesis� University ofColorado at Denver� ��

�� S� K� Reinhardt� M� D� Hill� J� R� Larus� A� R� Lebeck� J� C� Lewis�and D� A� Wood� The Wisconsin wind tunnel Virtual prototyping of parallelcomputers� in Proceedings of the �� AMC SIGMETRICS Conference� May��

�� S� K� Reinhardt� J� R� Larus� and D� A� Wood� Typhoon and tempestUser�level shared memory� in Proceedings of the ACM�IEEE International Sym�posium on Computer Architecture� April ��

�� M��C� Rivara� Design and data structure of fully adaptive� multigrid� niteelement software� ACM Transactions on Mathematical Software� �� pp� �" ��

�� H� Samet� The Design and Analysis of Spatial Data Structures� Addison�Wesley� ��

��

�� W�W� Shu and L� V� Kale� Chare kernel A runtime support system for par�allel computations� Journal of Parallel and Distributed Computing� �� pp� ��"��

�� J� P� Singh� Parallel Hierarchical N�Body Methods and their Implications forMultiprocessors� PhD thesis� Stanford University� ��

�� J� P� Singh and J� L� Hennessy� Finding and exploiting parallelism in anocean simulation program Experiences� results� and implications� Journal ofParallel and Distributed Computing� � �� pp� �"��

�� J� P� Singh� C� Holt� J� L� Hennessy� and A� Gupta� A parallel adaptivefast multipole method� in Proceedings of Supercomputing �� November ��

�� L� Snyder� Type architectures� shared memory� and the corollary of modestpotential� Annual Review of Computer Science� �� pp� ��"��

�� L� Stals� Adaptive multigrid in parallel� in Proceedings of the Seventh SIAMConference on Parallel Processing for Scienti�c Computing� San Francisco� CA�February ��

�� B� Stroustrup� The C�� Programming Language �Second Edition��Addison"Wesley� ��

�� V� S� Sunderam� PVM A framework for parallel distributed computing� Con�currency� Practice and Experience� �� pp� ��"��

�� P� Tamayo� J� P� Mesirov� and B� M� Boghosian� Parallel approaches toshort range molecular dynamics simulations� in Proceedings of Supercomputing�� Albuquerque� NM� November ��

�� E� Tsuchida and M� Tsukada� Real space approach to electronic�structurecalculations� Department of Physics� University of Tokyo �unpublishedmanuscript��

�� C� J� Turner and J� G� Turner� Adaptive data parallel methods for ecosys�tem monitoring� in Proceedings of Supercomputing �� November ��

�� R� v� Hanxleden� K� Kennedy� and J� Saltz� Value�based distributionsin Fortran D A preliminary report� Tech� Rep� CRPC�TR�� S� Center forResearch on Parallel Computation� Rice University� Houston� TX� December��

�� L� G� Valiant� A bridging model for parallel computation� Communicationsof the Association for Computing Machinery� �� pp� ��"��

��

�� T� von Eicken� D� E� Culler� S� C� Goldstein� and K� E� Schauser�Active Messages A mechanism for integrated communication and computation�in Proceedings of the ��th International Symposium on Computer Architecture�May ��

�� R� von Hanxleden� K� Kennedy� C� Koelbel� R� Das� and J� Saltz�Compiler analysis for irregular problems in Fortran D� in Fifth InternationalWorkshop of Languages and Compilers for Parallel Computing� New Haven�CT� August ��

�� M� S� Warren and J� K� Salmon� A parallel hashed oct�tree n�body algo�rithm� in Proceedings of Supercomputing �� November ��

�� M� Welcome� B� Crutchfield� C� Rendleman� J� Bell� L� Howell�

V� Beckner� and D� Simkims� Boxlib user�s guide and manual� �draft��

�� S� R� White� J� W� Wilkins� and M� P� Teter� Finite�element methodfor electronic structure� Physical Review B� �� pp� ��"��

�� M� Wu and G� Fox� Fortran ��D compiler for distributed memory MIMDparallel computers� Tech� Rep� SCCS��B� Syracuse University� ��

�� S� X� Yang� D� Gannon� S� Srinivas� F� Bodin� and P� Bode� HighPerformance Fortran interface to the parallel C�� in Proceedings of the ��Scalable High Performance Computing Conference� May ��

�� A� Yonezawa� ABCL An Object Oriented Concurrent System� MIT Press��

Documents

A Parallel Software Infrastructure for Dynamic Block-Irregular Scienti