Upload
lamkien
View
216
Download
2
Embed Size (px)
Citation preview
A Parallel Software Infrastructure
for Dynamic Block�Irregular
Scienti�c Calculations
Scott R� Kohn
LPARX
Implementation Abstractions
Message Passing Layer
LDA MDAMG SPH3D
Adaptive Mesh API Particle API
API
Application
ComputationalScientist
CFortran
C++Cobol
ProgrammingLanguage
Machine
UNIVERSITY OF CALIFORNIA� SAN DIEGO
A Parallel Software Infrastructure for Dynamic
Block�Irregular Scienti�c Calculations
A dissertation submitted in partial satisfaction of the
requirements for the degree Doctor of Philosophy
in the Department of Computer Science and Engineering
by
Scott R� Kohn
Committee in charge�
Professor Scott B� Baden� ChairProfessor Francine D� BermanProfessor William G� GriswoldProfessor Keith MarzulloProfessor Maria Elizabeth G� OngProfessor John H� Weare
���
Copyright
Scott R� Kohn� ���
All rights reserved�
The dissertation of Scott R� Kohn is approved� and
it is acceptable in quality and form for publication on
micro�lm�
University of California� San Diego
���
iii
TABLE OF CONTENTS
Signature Page � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � iii
Table of Contents � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � iv
List of Figures � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � vii
List of Tables � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � x
Acknowledgements � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xii
Vita and Publications � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xiv
Abstract � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � xvi
� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� Parallel Scienti�c Computation � � � � � � � � � � � � � � � � � � � � � � �� Dynamic Block�Irregular Calculations � � � � � � � � � � � � � � � � � � ��� A Parallel Software Infrastructure � � � � � � � � � � � � � � � � � � � �
�� LPARX � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Implementation Abstractions � � � � � � � � � � � � � � � � � � � � � ��� Adaptive Mesh API � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Particle API � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Organization of the Dissertation � � � � � � � � � � � � � � � � � � � � � ��
Parallelization Abstractions � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� The LPARX Abstractions � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Philosophy � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Data Types � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Coarse�Grain Data Parallel Computation � � � � � � � � � � � � � � ��� The Region Calculus � � � � � � � � � � � � � � � � � � � � � � � � � � �� Data Motion � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � LPARX Implementation � � � � � � � � � � � � � � � � � � � � � � � � �� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� LPARX Programming Examples � � � � � � � � � � � � � � � � � � � � � ��� Jacobi Relaxation � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Decomposing the Problem Domain � � � � � � � � � � � � � � � � � � ���� Parallel Computation � � � � � � � � � � � � � � � � � � � � � � � � � ���� Communicating Boundary Values � � � � � � � � � � � � � � � � � � ��� Dynamic and Irregular Computations � � � � � � � � � � � � � � � � �
�� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Structural Abstraction � � � � � � � � � � � � � � � � � � � � � � � � � �
iv
� Parallel Languages � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Run�Time Support Libraries � � � � � � � � � � � � � � � � � � � � � �
� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Structural Abstraction � � � � � � � � � � � � � � � � � � � � � � � � � �� Limitations of the Abstractions � � � � � � � � � � � � � � � � � � � � � �� Shared Memory � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Coarse�Grain Data Parallelism � � � � � � � � � � � � � � � � � � � � ��� Language Interoperability � � � � � � � � � � � � � � � � � � � � � � � �� � Communication Model � � � � � � � � � � � � � � � � � � � � � � � � � ���� Future Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� Implementation Methodology � � � � � � � � � � � � � � � � � � � � � � � � � �� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� Implementation Abstractions � � � � � � � � � � � � � � � � � � � � � � � ��� Message Passing Layer � � � � � � � � � � � � � � � � � � � � � � � � � �� Asynchronous Message Streams � � � � � � � � � � � � � � � � � � � � ��� Distributed Parallel Objects � � � � � � � � � � � � � � � � � � � � � � �� Communication Example � � � � � � � � � � � � � � � � � � � � � � � ��
�� Implementation and Performance � � � � � � � � � � � � � � � � � � � � � � �� Interrupts versus Polling � � � � � � � � � � � � � � � � � � � � � � � ��� DPO and AMS Overheads � � � � � � � � � � � � � � � � � � � � � � ���� Application Performance � � � � � � � � � � � � � � � � � � � � � � � ��
�� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Flexibility � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Portability � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� Implementation Mistakes � � � � � � � � � � � � � � � � � � � � � � � ��
� Adaptive Mesh Applications � � � � � � � � � � � � � � � � � � � � � � � � � � �� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Structured Adaptive Mesh Algorithms � � � � � � � � � � � � � � � � � � ��� Adaptive Mesh API � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Software Infrastructure Overview � � � � � � � � � � � � � � � � � � � � � Data Structures � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Error Estimation � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Grid Generation � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Load Balancing and Processor Assignment � � � � � � � � � � � � � �� � Numerical Computation � � � � � � � � � � � � � � � � � � � � � � � � ����� Communication � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Adaptive Eigensolvers in Materials Design � � � � � � � � � � � � � � � � ���� A Model Problem � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
v
� Adaptive Framework � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Eigenvalue Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � ����� Multigrid � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� Finite Di�erence Discretizations � � � � � � � � � � � � � � � � � � � � � Computational Results � � � � � � � � � � � � � � � � � � � � � � � � ��
� Performance Analysis � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Performance Comparison � � � � � � � � � � � � � � � � � � � � � � � ��� Execution Time Analysis � � � � � � � � � � � � � � � � � � � � � � � ����� Uniform Grid Patches � � � � � � � � � � � � � � � � � � � � � � � � � ���
� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Parallelization Requirements � � � � � � � � � � � � � � � � � � � � � ���� Future Research Directions � � � � � � � � � � � � � � � � � � � � � � ���
Particle Calculations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Related Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
� Application Programmer Interface � � � � � � � � � � � � � � � � � � � � ����� Balancing Non�Uniform Workloads � � � � � � � � � � � � � � � � � � ��� Caching O��Processor Data � � � � � � � � � � � � � � � � � � � � � � ��� Writing Back Particle Information � � � � � � � � � � � � � � � � � � ���� Repatriating Particles � � � � � � � � � � � � � � � � � � � � � � � � � ��� Implementation Details � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Smoothed Particle Hydrodynamics � � � � � � � � � � � � � � � � � � � � � �� Numerical Background � � � � � � � � � � � � � � � � � � � � � � � � � �� Performance Comparison � � � � � � � � � � � � � � � � � � � � � � � � ��� Execution Time Analysis � � � � � � � � � � � � � � � � � � � � � � � ����� Exploiting Force Law Symmetry � � � � � � � � � � � � � � � � � � � ���� Communication Optimizations � � � � � � � � � � � � � � � � � � � � ���
�� Analysis and Discussion � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Parallelization Requirements � � � � � � � � � � � � � � � � � � � � � ���� Unstructured Partitionings � � � � � � � � � � � � � � � � � � � � � � ����� Future Research Directions � � � � � � � � � � � � � � � � � � � � � � ��
Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� �� Research Contributions � � � � � � � � � � � � � � � � � � � � � � � � � � �� � Outstanding Research Issues � � � � � � � � � � � � � � � � � � � � � � � ���
�� Implementation Strategies for APIs � � � � � � � � � � � � � � � � � ���� Language Interoperability � � � � � � � � � � � � � � � � � � � � � � � ���
�� The Scienti�c Computing Community � � � � � � � � � � � � � � � � � � ���
Appendix A� Machine Characteristics � � � � � � � � � � � � � � � � � � � � ���
Bibliography � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
vi
LIST OF FIGURES
��� Current design trends in parallel architecture favor machines that re�semble tightly coupled networks of workstations � � � � � � � � � � � � �
�� An overview of our parallel software infrastructure � � � � � � � � � � � �
�� The LPARX layer of our software infrastructure provides paralleliza�tion mechanisms on which we build application�speci�c APIs � � � � � ��
� LPARX applications logically consist of three components� partition�ing routines� LPARX code� and serial numerical kernels � � � � � � � � ��
�� The XArray of Grids structure provides a common framework for im�plementing various block�irregular decompositions of data � � � � � � � �
�� Examples of LPARX�s region calculus operations � � � � � � � � � � � �� The computational domain for a simple �nite di�erence problem � � � ��� The main routine for the parallel Jacobi application � � � � � � � � � � ��� The relaxation routine for the parallel Jacobi application � � � � � � � ���� Subroutine FillPatch manages all interprocessor communication � � ��
��� The LPARX run�time system is built on a message passing library�Asynchronous Message Streams� and Distributed Parallel Objects � � �
�� LPARX programs are modeled as a collection of objects �Grids� withasynchronous and unpredictable communication patterns � � � � � � � �
��� Asynchronous communication facilities of the AMS layer � � � � � � � ��� An example of AMS�s message stream abstractions � � � � � � � � � � ��� Primary and secondary objects in the DPO model � � � � � � � � � � � ��� LPARX function XAlloc supplies a Region and a processor assignment
when creating a Grid � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Each LPARX Grid is a DPO object � � � � � � � � � � � � � � � � � � � ���� Coarse�grain execution in DPO employs the owner�computes rule � � ����� FillPatch will be used to illustrate how the various implementation
layers interact in interprocessor communication � � � � � � � � � � � � ����� A time�line view of the transmission of data to another processor � � ����� A time�line view of the reception of data from another processor � � � �
��� The adaptive mesh API provides application�speci�c facilities for struc�tured adaptive mesh methods � � � � � � � � � � � � � � � � � � � � � � ��
�� A comparison of unstructured and structured adaptive mesh methods ����� Structured adaptive mesh methods represent the numerical solution to
a partial di�erential equation using a hierarchy of grid levels � � � � � ���� A sample d structured adaptive mesh hierarchy for a materials design
problem � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Organization of the structured adaptive mesh API library � � � � � � � ���� A composite grid is represented using a Grid� an IrregularGrid� and
a CompositeGrid � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
vii
��� Error estimation and grid generation � � � � � � � � � � � � � � � � � � ����� Grid generation using the signature algorithm � � � � � � � � � � � � � ������ Two grid generation strategies for uniform re�nement regions � � � � � ������ A simple load balancing algorithm for grid patches � � � � � � � � � � ������� An improved load balancing strategy � � � � � � � � � � � � � � � � � � ������ Coarse�grain numerical computation over the individual Grids within
an IrregularGrid � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� A comparison of coarse�grain and �ne�grain data parallel execution � ������� Intralevel communication between grids at the same level � � � � � � � ������ Interlevel communication between grids at di�erent levels � � � � � � � ������ Materials design seeks to understand the chemical properties of
molecules such as this C��H�� ring � � � � � � � � � � � � � � � � � � � � �� ���� Outline of the adaptive eigenvalue solver � � � � � � � � � � � � � � � � ������� An iterative multigrid�based eigenvalue algorithm � � � � � � � � � � � ������� The Full Approximation Storage �FAS� multigrid algorithm � � � � � � ����� Computational results for hydrogen � � � � � � � � � � � � � � � � � � � ����� Computational results for the hydrogen molecular ion � � � � � � � � � ��� Computational results were gathered for this �d synthetic eigenvalue
problem � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Adaptive eigenvalue solver execution times � � � � � � � � � � � � � � � ������ A level�by�level accounting of the execution time for the eigenvalue
algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Execution time breakdown on the Intel Paragon and IBM SP � � � � ����� These graphs illustrate the performance overheads of uniform grid
patches as compared to non�uniform patches � � � � � � � � � � � � � � ���
�� Our particle API provides computational scientists with high�level fa�cilities targeted towards particle applications � � � � � � � � � � � � � � ���
� A framework for a generic particle calculation � � � � � � � � � � � � � ���� Snapshots of a d vortex dynamics application with a non�uniform
workload distribution � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� A parallelized version of the generic particle code � � � � � � � � � � � ��� An irregular decomposition of the computational domain using the
XArray � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� API function BalanceWorkloads redistributes computational e�ort
across the processors � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� FetchParticles locally caches copies of o��processor particle infor�
mation needed for particle interactions � � � � � � � � � � � � � � � � � � �� WriteBack updates force information for particles owned by other pro�
cessors � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� The API de�nition for C�� class ChainMesh � � � � � � � � � � � � � � � ���� Application C�� code to compute local interactions � � � � � � � � � � � ���� Customizations for C�� class ParticleList � � � � � � � � � � � � � � � ��� Our SPH�D application simulates the evolution of a �d disk galaxy � �
viii
��� SPH�D execution times on a Cray C���� Intel Paragon� IBM SP� andan Alpha workstation farm running PVM � � � � � � � � � � � � � � � � ��
��� Execution time summary for one SPH�D timestep on the Intel Paragonand the IBM SP � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
�� A comparison of the SPH�D code with a restricted version that doesnot fully exploit force law symmetry � � � � � � � � � � � � � � � � � � ���
�� A comparison of the SPH�D code to a �naive� implementation thatdoes not attempt to minimize interprocessor communication � � � � � ���
��� A comparison of structured and unstructured partitions � � � � � � � � ���
A�� Alpha workstation cluster message passing performance � � � � � � � � �� A� IBM SP message passing performance � � � � � � � � � � � � � � � � � ���A�� Intel Paragon message passing performance � � � � � � � � � � � � � � � ���
ix
LIST OF TABLES
�� A brief description of the four LPARX data types� Point� Region�Grid� and XArray � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� A summary of LPARX operations � � � � � � � � � � � � � � � � � � � � �
��� A summary of the facilities provided by DPO� AMS� and MP��� � � � ��� A summary of the asynchronous communication facilities provided by
the Asynchronous Message Stream layer � � � � � � � � � � � � � � � � ��� A summary of the object management mechanisms de�ned by the Dis�
tributed Parallel Objects layer � � � � � � � � � � � � � � � � � � � � � � ���� The implementation of communication between Grids depends on
whether they are primary or secondary objects � � � � � � � � � � � � � ���� Message length and memory overheads for AMS� DPO� and LPARX � ���� LPARX overheads for a �d Jacobi application � � � � � � � � � � � � � ��
��� A breakdown of the eleven thousand lines of code that constitute theadaptive mesh API library � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Descriptions of Grid� IrregularGrid� and CompositeGrid � � � � � � ������ Unknowns and mesh spacing for the adaptive mesh hierarchy used to
solve the eigenvalue problem � � � � � � � � � � � � � � � � � � � � � � � ����� Software version numbers and compiler optimization �ags for the struc�
tured adaptive mesh performance results � � � � � � � � � � � � � � � � ���� Adaptive eigenvalue solver execution times � � � � � � � � � � � � � � � ���� Execution time breakdown on the Intel Paragon � � � � � � � � � � � � ����� Execution time breakdown on the IBM SP � � � � � � � � � � � � � � ������ Average interprocessor communication volume � � � � � � � � � � � � � ������ Uniform grid patches require additional memory resources as compared
to non�uniform patches � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� A survey of the computational structure for various N �body approxi�mation methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Variables and functions of the smoothed particle hydrodynamics equa�tions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Software version numbers and compiler optimization �ags for theSPH�D computational results � � � � � � � � � � � � � � � � � � � � � � ���
�� SPH�D execution times on a Cray C���� Intel Paragon� IBM SP� andan Alpha workstation farm running PVM � � � � � � � � � � � � � � � � ���
� Execution time breakdown of one SPH�D timestep on the IntelParagon and the IBM SP � � � � � � � � � � � � � � � � � � � � � � � � ���
� Execution time summary for one SPH�D timestep on the Intel Paragonand the IBM SP � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� A comparison of the SPH�D code with a restricted version that doesnot fully exploit force law symmetry � � � � � � � � � � � � � � � � � � ���
x
�� A comparison of the SPH�D code to a �naive� implementation thatdoes not attempt to minimize interprocessor communication � � � � � ���
A�� Software version numbers and compiler optimization �ags � � � � � � � ��A� A summary of machine characteristics � � � � � � � � � � � � � � � � � ��
xi
ACKNOWLDEGEMENTS
As one of those rare individuals destined for true greatness� this record ofmy thoughts and convictions will provide invaluable insight into buddinggenius� Think of it� A priceless historical document in the making�
� Calvin� �Calvin and Hobbes�
Many people contribute to the completion of a dissertation� and I would like
to thank everyone who has contributed to mine�
I have had the privilege and pleasure to work with Scott Baden for the last
�ve years� I am indebted to him for his support and encouragement� I have enjoyed
our numerous �lively discussions�� which have greatly contributed to my work� I
would also like to thank my committee members�Fran Berman� Bill Griswold� Keith
Marzullo� Beth Ong� and John Weare�for o�ering criticisms and comments�
Special thanks go to Steve Fink and Val Donaldson� Their keen insights�
thoughtful comments� and honest criticisms are the stu� of good science� Sharing a
lab with them has been a pleasure that I will miss� I doubt that they know how much
I have valued their input�
I appreciate the many useful suggestions from Greg Cook� Steve Fink� Chris
Myers� and Charles Rendleman on how to improve LPARX and the adaptive mesh
software� I also thank Eric Bylaska� Alan Edelman� Ryoichi Kawai� Beth Ong� and
John Weare for numerous valuable discussions on numerical methods in materials
design�
I would like to thank my family for their support� encouragement� and love�
In particular� I thank my father for his sense of curiosity and my mother for trying
to make me read Dr� Seuss when I only wanted to read science books� I also have
to thank my two cats for putting things in perspective� they do not recognize the
importance of writing a dissertation� and they have never hesitated to let me know
that their needs �e�g� being fed on time� should come �rst and foremost�
Finally� I would like to dedicate this dissertation to my wife� Kristin� I �nd
it di�cult to express in words what I feel for her in my heart� I thank her for always
xii
being there when I needed her and for always reminding me what is most important
in my life�
Generous �nancial support has been provided by a General Atomics fel�
lowship� NSF contract ASC��������� and ONR contract N�������������� Access
to the Cray C���� IBM SP� Intel Paragon� and DEC Alpha workstation farm has
been provided by the San Diego Supercomputer Center �through a UCSD School of
Engineering Block Grant� and the Cornell Theory Center�
xiii
VITA
���� B�S� Electrical EngineeringAdditional major in MathematicsAdditional major in Computer ScienceUniversity of Wisconsin at Madison
��� M�S� Computer ScienceUniversity of California at San Diego
��� Ph�D� Computer ScienceUniversity of California at San Diego
PUBLICATIONS
Submitted for Publication
S� R� Kohn and S� B� Baden� A Parallel Software Infrastructure for StructuredAdaptive Mesh Methods� submitted to Supercomputing ���
Journals
S� R� Kohn and S� B� Baden� Irregular Coarse�Grain Data Parallelism UnderLPARX� to appear� Journal of Scienti�c Programming�
S� B� Baden and S� R� Kohn� Portable Parallel Programming of Numerical Prob�lems Under the LPAR System� Journal of Parallel and Distributed Computation�May� ����
Conferences
S� R� Kohn and S� B� Baden� The Parallelization of an Adaptive Multigrid Eigen�value Solver with LPARX� Proceedings of the Seventh SIAM Conference on ParallelProcessing for Scienti�c Computing� San Francisco� CA� February� ����
E� J� Bylaska� S� R� Kohn� S� B� Baden� A� Edelman� R� Kawai� M� E�
Ong� and J� H� Weare� Scalable Parallel Numerical Methods and Software Toolsfor Material Design� Proceedings of the Seventh SIAM Conference on Parallel Pro�cessing for Scienti�c Computing� San Francisco� CA� February� ����
S� B� Baden� S� R� Kohn� and S� J� Fink� Programming with LPARX� Proceed�ings of the ���� Intel Supercomputer User�s Group� San Diego� CA� June� �����
S� R� Kohn and S� B� Baden� A Robust Parallel Programming Model for Dy�namic Non�Uniform Scientic Computations� Proceedings of the ���� Scalable HighPerformance Computing Conference� Knoxville� TN� May� �����
xiv
S� R� Kohn and S� B� Baden� An Implementation of the LPAR Parallel Program�ming Model for Scientic Computations� Proceedings of the Sixth SIAM Conferenceon Parallel Processing for Scienti�c Computing� Norfolk� VA� March� �����
S� B� Baden and S� R� Kohn� Lattice Parallelism A Parallel Programming Modelfor Manipulating Non�Uniform Structured Scientic Data Structures� Proceedings ofthe Workshop on Languages� Compilers� and Run�Time Environments for DistributedMemory Multiprocessors� Boulder� CO� October� ����
S� B� Baden and S� R� Kohn� A Comparison of Load Balancing Strategies forParticle Methods Running on MIMD Multiprocessors� Proceedings of the Fifth SIAMConference on Parallel Processing for Scienti�c Computing� Houston� TX� March������
Technical Reports
S� R� Kohn and S� B� Baden� Blobs Visualization of Particle Methods on Multi�processors� Tech� Rep� CS����� University of California� San Diego� May� ����
S� B� Baden and S� R� Kohn� The Reference Guide to GenMP�The GenericMultiprocessor� Tech� Rep� CS����� University of California� San Diego� June�����
xv
ABSTRACT OF THE DISSERTATION
A Parallel Software Infrastructure for Dynamic
Block�Irregular Scienti�c Calculations
by
Scott R� Kohn
Doctor of Philosophy in Computer Science
University of California� San Diego� ���
Professor Scott B� Baden� Chair
Dear Sir or Madam�will you read my book�It took me years to write�will you take a look�
� John Lennon and Paul McCartney� �Paperback Writer�
The accurate solution of many problems in science and engineering requires
the resolution of unpredictable� localized physical phenomena� Such applications
may involve the solution of complicated� time�dependent partial di�erential equations
such as those in materials design� computational �uid dynamics� astrophysics� and
molecular dynamics� The important feature of these numerical problems is that
some portions of the computational domain require higher resolution� and thus more
computational e�ort� than others�
Parallel supercomputers o�er the power to solve many of these computa�
tionally intensive tasks� however� these applications are particularly challenging to
implement on parallel architectures because they rely on dynamic� complicated� ir�
regular structures with dynamic and irregular communication patterns� Current
parallel software technology does not yet a�ord a solution� and new programming
abstractions�along with the accompanying run�time support�are needed�
xvi
We have developed a parallel software infrastructure to simplify the imple�
mentation of dynamic� irregular� block�structured scienti�c computations on high�
performance parallel supercomputers� Our software infrastructure provides compu�
tational scientists with high�level� domain�speci�c tools that hide low�level details of
the parallel hardware� It is portable across a wide range of parallel architectures�
At the center of our infrastructure is the LPARX parallel programming
system� LPARX introduces the concept of �structural abstraction�� which enables
applications to dynamically manipulate irregular data decompositions as language�
level objects� LPARX provides a framework for creating decompositions that may be
tailored to meet the needs of a particular application�
Building on the LPARX abstractions� we have developed application pro�
grammer interfaces �APIs� for two important classes of applications� structured adap�
tive mesh methods and particle calculations� These APIs enable scientists to concen�
trate on the mathematics and the physics of their application� APIs provide high�level
software tools that hide underlying implementation details� Our parallel software in�
frastructure has enabled computational scientists to explore new approaches to solving
a variety of problems� and it has reduced the development time of challenging numer�
ical applications� Indeed� we have applied our structured adaptive mesh API to the
adaptive solution of eigenvalue problems in materials design and our particle API to
a �d smoothed particle hydrodynamics application in astrophysics�
xvii
Chapter �
Introduction
I realized that the purpose of writing is to in�ate weak ideas� obscure poorreasoning� and inhibit clarity� With a little practice� writing can be anintimidating and impenetrable fog � � � Academia� here I come�
� Calvin� �Calvin and Hobbes�
��� Parallel Scienti�c Computation
Parallel supercomputers o�er the power to solve many of the computation�
ally intensive problems that arise in science and engineering� Unfortunately� this
potential has only been partially realized due to the di�culty of implementing scien�
ti�c applications on parallel platforms� To put it bluntly� today�s parallel computers
are hard to use� and parallel software technology does not yet a�ord the performance
and ease of use that computational scientists have come to expect from sequential
and vector supercomputers�
It would not be an understatement to say that parallel software is in a state
of crisis� The vast majority of scienti�c programmers �nd that current parallel soft�
ware support is inadequate ���� In fact� they are more likely to develop their own
in�house software support rather than use existing products ������ The most com�
monly used parallel programming paradigm today is message passing� Standardiza�
tion e�orts have resulted in a portable message passing library called MPI �Message
�
Passing Interface� ����� Unfortunately� programming with message passing is te�
dious� as the programmer must explicitly manage low�level details of data placement
and interprocessor communication�
Developments in High Performance Fortran� �HPF� ���� are promising� un�
fortunately� HPF will require improvements before it becomes a general purpose par�
allel language� For example� HPF does not adequately address dynamic and irregular
problems ����� and these limitations have prompted a second HPF standardization
e�ort� The HPF� committee is currently investigating enhancements to HPF� but it
will be some time before we know what strategies will be e�ective and how di�cult
they will be to support in the compiler� Improvements in HPF for dynamic and irreg�
ular scienti�c applications will likely require new parallel programming abstractions
and run�time support libraries�
Parallel computers are di�cult to use because they require the explicit and
low�level management of data locality� Current design trends in high�performance
parallel architectures favor machines constructed with commodity components� More
than anything else� today�s parallel computers resemble tightly coupled networks of
workstations �see Figure ����� The programmer� compiler� or run�time system must
distribute data carefully because access to remote data �through the interconnection
network� is typically several orders of magnitude more expensive than access to local
data� Some parallel computers� such as the Intel Paragon and the IBM SP� provide
very little hardware support for managing data distributed across processor memories�
Other machines� such as the Stanford FLASH ���� and the Wisconsin COW ����
contain hardware for the automatic caching of remote data� However� recent studies
with these �distributed shared memory� machines � � ���� indicate that such hard�
ware caching mechanisms are inadequate for dynamic scienti�c applications� In fact�
these studies conclude that e�cient distributed shared memory applications require
the same attention to data management and the same implementation techniques as
message passing applications�
�High Performance Fortran is a data parallel Fortran language that is quickly becoming acceptedby a number of manufacturers as a standard parallel programming language for scienti�c computing�
�
M
P
M
P
M
P
M
P
M
PInterconnection
Network
Figure ���� Current design trends in parallel architecture favor machines built withcommodity components� Today�s parallel computers resemble tightly coupled net�works of workstations and typically contain a few tens to a few hundreds of powerfulprocessing nodes connected by an interconnection network� Each processor P is tied toa local memory M� and remote data is accessed through the interconnection network�At any one time� several parallel applications share the machine� with a single appli�cation generally using a few tens of dedicated processors� Appendix A summarizesthe machine characteristics for the parallel architectures used in this dissertation�
Another concern when writing parallel applications is portability� Parallel
platforms obsolesce at an alarming rate� and portability is essential so that applica�
tions will run on the next generation of architectures� In just the two years spent
developing our software infrastructure� four parallel computers have become obsolete
�nCUBE nCUBE�� Intel iPSC�� �� Kendall Square Research KSR��� and Think�
ing Machines CM��� two manufacturers have declared bankruptcy �Kendall Square
Research and Thinking Machines�� and two manufacturers have entered the parallel
scienti�c computing market �IBM and Silicon Graphics�� This trend is likely to con�
tinue in the near future due to the rapidly changing microprocessor and interconnect
technology used to build parallel machines�
The key to portability is hiding low�level� machine�dependent details� For
sequential programs� this can be easily achieved through the use of a standard pro�
�
gramming language such as Fortran� However� parallel programs typically contain
a considerable amount of hardware�dependent code to manage data distribution
and interprocessor communication� Such hardware dependencies hamper portability�
To be portable� parallelization mechanisms must hide these low�level� architecture�
dependent implementation details�
Implementing portable parallel programs without high�level software sup�
port is a di�cult task� Computational scientists would rather address the mathe�
matics and the physics of their problems than worry about e�cient parallel imple�
mentation techniques� Low�level� machine�dependent details reduce portability and
obscure the algorithms underlying an application� Appropriate software support is
essential for developing architecture�independent� high�performance parallel scienti�c
applications�
��� Dynamic Block�Irregular Calculations
My own interests are in using computers as God intended�to do arith�metic�
� Cleve Moler
Many scienti�c computations involve the study of dynamic� irregular� lo�
cally structured physical phenomena� Such applications may involve the solution
of complicated� time�dependent partial di�erential equations such as those in ma�
terials design ����� computational �uid dynamics ��� or localized deformations in
geophysical systems� Also included are particle methods in molecular dynamics ����
astrophysics ������ and vortex dynamics ���� More recently� adaptive methods have
been applied to the study of entire ecosystems through satellite imagery at multiple
resolutions ����� These applications are particularly challenging to implement on
parallel computers owing to their dynamic� irregular decompositions of data�
Our research addresses the programming abstractions and the accompanying
software support required for dynamic� irregular� block�structured scienti�c compu�
tations running on MIMD ��� parallel computers� The distinguishing characteristics
of this class of problems are that ��� numerical work is non�uniformly distributed
over space� �� the workload distribution changes as the computation progresses�
and ��� the workload exhibits a local structure� Such applications employ dynamic�
irregular�but locally structured�meshes to represent the changing numerical com�
putation� They spend considerably more e�ort in some portions of the problem space
than in others� The distribution of computational e�ort is not known at compile�time�
and the application must adapt to the evolving calculation at run�time� Numerical
work tends to be localized in regions irregularly distributed across the problem do�
main� This localization property is especially important on multiprocessors� since we
can exploit data locality to reduce interprocessor communication costs and improve
parallel performance�
We focus on two important classes of dynamic� block�irregular applications�
� structured adaptive mesh methods �� �� ���� and
� particle methods based on link�cell techniques ��� ����
Such applications can be di�cult to implement without advanced software support
because they rely on dynamic� complicated irregular array structures with irregular
communication patterns�� The programmer is burdened with the responsibility of
managing dynamically changing data distributed across processor memories and or�
chestrating interprocessor communication and synchronization� Little information is
available at compile�time to guide a parallel compiler because numerical workloads
change in response to the dynamics of the particular problem being solved�
Current parallel programming languages provide little support for dynamic�
block�irregular applications� Data parallel Fortran languages such as High Perfor�
mance Fortran typically focus on regular� static problems such as dense linear algebra�
HPF de�nes a set of built�in� uniform data decompositions speci�ed through compile�
time directives� however� it provides few mechanisms for dynamically changing irreg�
ular data� Extending HPF will require developments in ��� parallel programming
�Further details can be found in Sections ����� and ������
abstractions and �� run�time support libraries� New parallelization abstractions are
needed because the current compile�time data distribution mechanisms are inadequate
for dynamic problems� Such applications will also require sophisticated run�time sup�
port to manage changing data distributions and communication patterns�
A number of run�time support systems have already been developed� includ�
ing CHAOS �formerly called PARTI� � ��� multiblock PARTI ���� and Multipol �����
Both CHAOS and multiblockPARTI have been used as run�time support for data par�
allel Fortran compilers� CHAOS has been very successful in addressing unstructured
problems such as sparse linear algebra and �nite elements ���� Multiblock PARTI
has been employed in the parallelization of applications with a small number of large�
static blocks ���� its support for dynamic block structured problems is unclear� The
Multipol library provides a collection of distributed non�array data structures such
as graphs� unstructured grids� hash tables� sets� trees� and queues� However� none
of these systems directly address the dynamic� block�irregular problems that are the
focus of our research��
��� A Parallel Software Infrastructure
Solving a problem is similar to building a house� We must collect the rightmaterial� but collecting the material is not enough a heap of stones is notyet a house� To construct the house or the solution� we must put togetherthe parts and organize them into a purposeful whole�
� George Polya
We have developed a parallel software infrastructure to simplify the imple�
mentation of dynamic� irregular� block�structured scienti�c computations on high�
performance parallel supercomputers� Our software infrastructure has enabled com�
putational scientists to explore new approaches to solving applied problems� It has
reduced the development time of challenging numerical applications� Our infrastruc�
ture has been implemented as a C�� class library and consists of approximately thirty
�We will discuss such related work in detail in Section ����
�
thousand lines of C�� and Fortran code�� It has been employed by researchers at the
University of California at San Diego� George Mason University� Lawrence Livermore
National Laboratories� Sandia National Laboratories� and the Cornell Theory Center
for applications in gas dynamics ������ smoothed particle hydrodynamics� particle
simulation studies � ��� adaptive eigenvalue solvers in materials design ���� ���� ge�
netics algorithms ����� adaptive multigrid methods in numerical relativity� and the
dynamics of earthquake faults �see Section �� for a complete list��
Our parallel software infrastructure addresses two goals of software support
for scienti�c applications �����
� it hides low�level details of the hardware� and
� it provides high�level� e�cient mechanisms that match the scientist�s view of
the computation�
The �rst goal is necessary for portability� Software that exposes too much of the un�
derlying hardware will not run e�ciently on parallel platforms with di�erent hardware
characteristics� Our software meets this goal through the use of high�level paralleliza�
tion mechanisms that assume very little about the underlying hardware architecture�
The second goal is necessary for ease�of�use and simpli�ed code development� Our
software infrastructure provides the programmer with high�level tools appropriate for
the task at hand through domain�speci�c application programmer interfaces �APIs�
built upon our parallelization mechanisms�
Figure �� illustrates the organization of our parallel software infrastructure�
At the very top of the infrastructure lie applications� and at the bottom lies a portable
message passing layer� Each level provides more powerful�and also more speci�c�
abstractions� The infrastructure consists of four primary components� ��� implemen�
tation support� �� a set of parallelization abstractions called LPARX ��ell�par�eks���
��� a structured adaptive mesh API� and ��� a particle API�
�The current software distribution may be obtained through the World Wide Web at addresshttp���www�cse�ucsd�edu�users�skohn�
�
LPARX
Implementation Abstractions
Message Passing Layer
LDA MDAMG SPH3D
Chapter 2
Chapter 3
Chapter 4 Chapter 5
Adaptive Mesh API Particle API
Figure ��� This �gure shows our parallel software infrastructure for dynamic� irregu�lar� block�structured scienti�c computations� It consists of four primary components�each of which has been labeled with the chapter of this dissertation that describes thatparticular component� Higher levels of the infrastructure provide more powerful�andmore specialized�abstractions� At the top lie applications �LDA� AMG� SPH�D� and MD�and at the bottom lies a portable message passing layer� See the text for a briefdescription of each component�
There are three advantages to a layered software infrastructure as compared
to a single� monolithic application� portability� code reusability� and extensibility�
Because applications do not directly rely on the message passing layer and instead
employ the mechanisms provided by their application�speci�c APIs� low�level changes
in the implementation do not directly a�ect the applications� For example� applica�
tions are completely shielded from low�level changes in the LPARX implementation�
Code reuse is achieved because multiple application libraries share the same paral�
lelization mechanisms� Optimizations in the LPARX implementation are realized by
both particle and adaptive mesh applications� Finally� because the infrastructure pro�
�
vides tools� and not canned solutions� computational scientists can tailor and extend
our abstractions to match their applications�
The following sections brie�y describe the four main components of our
parallel software infrastructure�
����� LPARX
At the center of the software infrastructure is the LPARX parallel program�
ming system� which de�nes high�level� e�cient mechanisms for data distribution�
partitioning and mapping� parallel execution� and interprocessor communication� It
provides a common set of parallelization facilities on which we built the particle and
adaptive mesh APIs�
LPARX introduces the concept of �structural abstraction�� which enables
applications to dynamically manipulate irregular data decompositions� Instead of
forcing the programmer to choose from a small set of prede�ned decompositions�
LPARX provides a framework for creating decompositions that may be tailored to
meet the needs of a particular application� To our knowledge� LPARX is the �rst and
only system that e�ciently supports arbitrary dynamic� user�de�ned� block�irregular
data distribution on parallel architectures�
LPARX assumes only basic message passing support and is therefore
portable to a variety of high�performance computing platforms� Our current im�
plementation runs on the Cray C��� �single processor�� IBM SP� Intel Paragon�
single processor workstations �for code development and debugging�� and networks of
workstations connected via PVM �����
����� Implementation Abstractions
At the very bottom of our software infrastructure is a portable message pass�
ing layer called MP�� �our own version of MPI ������ To simplify the implementation
of the LPARX run�time system� we have introduced two levels of software abstrac�
��
tion between LPARX and the message passing layer� Asynchronous Message Streams
�AMS� and Distributed Parallel Objects �DPO�� AMS and DPO provide support for
parallel programs consisting of a relatively small number of large� complicated objects
with asynchronous and unpredictable communication patterns� They build on ideas
from the concurrent object oriented programming community�
AMS de�nes a �message stream� abstraction that greatly simpli�es the com�
munication of complicated data structures between processors� Its mechanisms com�
bine ideas from asynchronous remote procedure calls ��� ����� Active Messages ������
and the C�� I�O stream library ��� � ����� DPO provides object oriented mecha�
nisms for manipulating objects that are physically distributed across processor mem�
ories and is based on communicating object models from the distributed systems
community � ���
����� Adaptive Mesh API
Our adaptive mesh API de�nes specialized� high�level facilities tailored to
structured multilevel adaptive mesh re�nement applications ��� ���� Such numerical
methods dynamically re�ne the local representation of a problem in �interesting�
portions of the computational domain� such as shock regions in computational �uid
dynamics ��� They are di�cult to implement because re�nement regions vary in
size and location� resulting in complicated geometries and irregular communication
patterns�
Computational scientists using our adaptive mesh API can concentrate on
their numerical applications rather than being concerned with low�level implemen�
tation details� The API library� built upon the parallelization and communication
abstractions of LPARX� provides mechanisms for automatic error estimation� grid
generation� load balancing� and grid hierarchy management� All details associated
with parallelism are completely hidden from the programmer�
We have used our software infrastructure to develop a parallel adaptive
eigenvalue solver �LDA� and an adaptive multigrid solver �AMG� for problems aris�
��
ing in materials design ���� ���� By exploiting adaptivity� we have reduced mem�
ory consumption and computation time by more than two orders of magnitude over
an equivalent non�adaptive method� To our knowledge� this is the �rst time that
structured adaptive mesh techniques have been used to solve eigenvalue problems in
materials design�
����� Particle API
Our particle API provides computational scientists high�level tools that sim�
plify the implementation of particle applications ��� ��� on parallel computers� Par�
ticle methods are di�cult to parallelize because they require dynamic� irregular data
decompositions to balance changing non�uniform workloads� Built on top of the
LPARX mechanisms� our particle API de�nes facilities speci�cally tailored towards
particle methods� The use of the LPARX abstractions enabled us to provide func�
tionality and explore performance optimizations that would have been di�cult had
the library been implemented using only a primitive message passing layer� Using
our software infrastructure� we have developed a �d smoothed particle hydrodynam�
ics ����� code �SPH�D� that simulates the evolution of galactic bodies in astrophysics�
and we are currently developing a �d molecular dynamics application �MD� to study
fracture dynamics in solids ����
��� Organization of the Dissertation
Sixty minutes of thinking of any kind is bound to lead to confusion andunhappiness�
� James Thurber
This dissertation is organized into six chapters� Each of Chapters through
covers a portion of the software infrastructure shown in Figure ��� Each chapter is
self�contained� with its own introduction� motivation� related work� and analysis and
conclusions� The discussion of LPARX in Chapter is a starting point for all further
�
chapters� otherwise� Chapter � �Implementation Mechanisms�� Chapter � �Adaptive
Mesh Applications�� and Chapter �Particle Calculations� may be read independently
of the others� We conclude with the contributions of this work in Chapter � The
parallel architectures used in this dissertation are described in Appendix A�
Chapter �
Parallelization Abstractions
Fundamental denitions do not arise at the start but at the end of theexploration� because in order to dene a thing you must know what it isand what it is good for�
� Hans Freudenthal� �Developments in Mathematical Education�
If at rst you do succeed � try to hide your astonishment�
� Harry F� Banks
��� Introduction
The LPARX parallel programming system ���� �� provides portable facil�
ities for the e�cient implementation of dynamic� non�uniform scienti�c applications
on MIMD architectures� Such applications are typically di�cult to implement with�
out sophisticated software support� The LPARX mechanisms hide low�level imple�
mentation details and provide powerful tools for data distribution� partitioning and
mapping� parallel execution� and interprocessor communication� LPARX requires
only basic message passing support and is therefore portable to a variety of high�
performance computing platforms� Our current implementation runs on the Cray
C��� �single processor�� IBM SP� Intel Paragon� and networks of workstations con�
nected via PVM ����� LPARX applications may be developed and debugged on a
single processor workstation�
��
��
LPARX
Implementation Abstractions
Message Passing Layer
LDA MDSPH3DAMG
Adaptive Mesh API Particle API
Figure ��� The LPARX layer of our software infrastructure provides paralleliza�tion facilities designed for scienti�c applications that employ dynamic� irregular�structured representations� Based on the LPARX mechanisms� we have developedapplication�speci�c APIs for particle computations and structured adaptive meshmethods�
Building on the LPARX mechanisms described in this chapter �see Fig�
ure ���� we have developed application�speci�c support libraries for two important
classes of applications� multilevel structured adaptive mesh methods ���� and par�
ticle calculations ����� In Chapters � and � we describe how LPARX provides the
parallelization support infrastructure needed to e�ciently and easily implement these
re�usable APIs�
This chapter is organized as follows� We begin with a description of the
LPARX abstractions in Section �� Section �� illustrates how these abstractions
are used to parallelize a simple application� We compare our approach with other
related work in Section ��� Finally� we conclude with an analysis of the advantages
and limitations of the LPARX approach�
�
��� The LPARX Abstractions
A breakthrough is not a breakthrough unless you coin a term for it�
� Sidney Harris� �Einstein Simplied�
I think you�ve done it� All we need now is a trademark and a theme song�
� Sidney Harris� �From Personal Ads to Cloning Labs�
LPARX ���� �� is a coarse�grain� domain�speci�c parallel programming
model that provides high�level abstractions for representing and manipulating dy�
namic� irregular block�structured data on MIMD distributed memory architectures�
Dynamic irregular block decompositions are not currently supported by programming
languages such as High Performance Fortran �HPF� ����� Fortran D ����� Vienna For�
tran ����� or Fortran ��D ���� ����� They arise in two important classes of scienti�c
computations�
� multilevel structured adaptive �nite di�erence methods �� �� ���� which rep�resent re�nement regions using block�irregular data structures� and
� parallel computations such as particle methods ���� that require an irregulardata decomposition ��� �� to balance non�uniform workloads across parallel
processors�
We have used the LPARX mechanisms to implement domain�speci�c APIs and rep�
resentative applications from each of these two problem classes�
LPARX should not be thought of as a �language� but rather as a set of data
distribution and parallel coordination abstractions which may be implemented in a
library �as we have done� or added to a language� The design goals of LPARX are as
follows�
� Express irregular data decompositions� layouts� and data dependencies at run�time using high�level� intuitive abstractions�
� Require only basic message passing support and give portable performanceacross diverse parallel architectures�
�
� Separate parallel control and communication from numerical computation�
� Provide the basis for an expandable software infrastructure of application�
speci�c APIs�
Implementing dynamic� irregular computations on parallel computers is a di�cult
task� To achieve reasonable parallel performance� the application must explicitly
manage low�level details of data locality and communication� even on shared memory
multiprocessors ����� ���� This burden soon becomes unmanageable and can obscure
the salient features of the algorithm� LPARX hides many of these implementation de�
tails and provides high�level coordination mechanisms to manage data locality within
the memory hierarchy and minimize communication costs� The software support pro�
vided by LPARX greatly simpli�es the development of high�performance� portable�
parallel applications software�
The following sections describe LPARX�s facilities� We begin with an
overview of the philosophy underlying the LPARX model� Section �� introduces
the LPARX data types and its representation of irregular block decompositions� We
then present LPARX�s model of coarse�grain data parallel execution� Sections ���
and �� describe LPARX�s region calculus and data motion primitives which ex�
press data decompositions and dependencies in geometric terms� We brie�y discuss
the LPARX implementation in Section �� and then conclude with a summary�
����� Philosophy
The LPARX parallel programming model separates the expression of data
decomposition� communication� and parallel execution from numerical computation�
As shown in Figure �� LPARX applications are logically organized into three sepa�
rate pieces� partitioners� LPARX code� and serial numerical kernels�
The LPARX layer provides facilities for the coordination and control of
parallel execution� LPARX is a coarse�grain data parallel programmingmodel� it gives
the illusion of a single global address space and a single logical thread of control� On a
��
RoutinesPartitioning
Serial Numerical Kernels
LPARX
Figure �� The logical organization of an LPARX application consists of three com�ponents� partitioning routines� LPARX code� and serial numerical kernels�
MIMD parallel computer� the underlying run�time system executes in Single Program
Multiple Data �SPMD� mode�
Computations are divided into a relatively small number of coarse�grain
pieces� Each work unit represents a substantial computation with thousands or tens
of thousands of �oating point operations executing on a single logical processing
node� Parallel execution is expressed using a coarse�grain loop� each iteration of
the loop executes as if on its own processor� The computation for each piece is
performed by a numerical kernel� and the computations proceed independently of one
another� Numerical routines may be written in any language� such as C��� C� or
Fortran� The advantage of this approach is that LPARX can leverage serial compiler
technology and existing sequential code� Heavily optimized numerical routines need
not be re�implemented to parallelize an application� Furthermore� numerical code can
be optimized for a processing node without regard to the higher level parallelization�
LPARX does not de�ne what constitutes a single logical node� a node may correspond
to a single processor� a processing cluster� or a processor subset� Thus� kernels may
be tuned to take advantage of low�level node characteristics� such as vector units�
cache sizes� or multiple processors�
An important part of the LPARX philosophy is that data partitioning for
dynamic� non�uniform scienti�c computations is extremely problem�dependent and
��
therefore is best left to the application �or the API�� No speci�c data decomposition
strategies have been built into the LPARX model� Rather� all data decomposition
is performed at run�time under the direct control of the application� LPARX pro�
vides the application a uniform framework for representing and manipulating block�
irregular decompositions� Although our implementation supplies a standard library
of decomposition routines� the programmer is free to write others�
Our approach to data decomposition di�ers from most parallel languages�
such as HPF ����� which require the programmer to choose from a small number
of prede�ned decomposition methods� Vienna Fortran ���� provides some facilities
for irregular user�de�ned data decompositions but limits them to tensor products of
irregular one dimensional decompositions� Block�irregular decompositions may be
constructed using the pointwise mapping arrays of Fortran D ����� however� point�
wise decompositions are inappropriate and unnatural for calculations which exhibit
block structures� Because pointwise decompositions have no knowledge of the block
structure� mapping information must be maintained for each individual array ele�
ment at a substantial cost in memory and communication overheads� By comparison�
coarse�grain partitionings incur a cost proportional to the number of blocks� which is
typically three or four orders of magnitude smaller than the number of array elements�
Once a decomposition has been speci�ed� the details of the data partitioning
are hidden from the application� The programmer can change partitioning strategies
without a�ecting the correctness of the underlying code� Thus� LPARX views parti�
tioners as interchangeable� and the application may change decomposition strategies
by simply invoking a di�erent partitioning routine�
At the core of LPARX is the concept of structural abstraction� Structural
abstraction enables an application to express the logical structure of data and its
decomposition across processors as �rst�class� language�level objects� The key idea is
that the structure of the data�the ��oorplan� describing how the data is decomposed
and where the data is located�is represented and manipulated separately from the
data itself� LPARX expresses communication and operations on data decompositions
��
using intuitive geometric operations� such as intersection� instead of explicit indexing�
Interprocessor communication is hidden by the run�time system� and the application
is completely unaware of low�level details� Although the current LPARX implementa�
tion is limited to representing irregular� block�structured decompositions� the concept
of structural abstraction is general and extends to other classes of applications� such
as unstructured �nite element meshes �����
����� Data Types
LPARX provides the following four basic data types�
� Point� an integer n�tuple representing a point in Zn�
� Region� an object representing a rectangular subset of array index space�
� Grid� a dynamic array instantiated over a Region� and
� XArray� a dynamic array of Grids distributed over processors�
The Point is a simple� auxiliary data type used to de�ne and manipulate Regions�
Element�wise addition and scalar multiplication are de�ned over Points in the obvious
way�
The Region provides the basis for structural abstraction� An n�dimensional
Region represents a subset of Zn� the space of n�dimensional integer vectors� The
Region does not contain data elements� as an array� but rather represents a portion
of index space� In the current implementation of LPARX� we restrict Regions to
be rectangular� however� the concepts described here apply to arbitrary subsets of
Zn ����� A Region is uniquely de�ned by the two Points at its lower and upper
corners� We denote the lower bound of a Region R by lwb�R� and its upper bound
by upb�R�� Although there is no identical construct in Fortran or C� the Region
is related to array section speci�ers found in Fortran���� Unlike Fortran��� array
section speci�ers� however� the Region is a �rst�class object and may be assigned
�
and manipulated at run�time� The concept of �rst�class array section objects �called
domains� was introduced in the FIDIL programming language ����
The Grid is a dynamic array de�ned over an arbitrary rectangular index
set speci�ed by a Region� The Grid is similar to a Fortran �� allocatable array�
Each Grid remembers its associated Region� which can be queried at run�time� a
convenience that greatly reduces bookkeeping for dynamically de�ned Grids�� All
Grid elements must have the same type� they may be integers� �oating point numbers�
or any user�de�ned type or class� For example� in addition to representing a mesh
of �oating point numbers� the Grid may also be used to implement the spatial data
structures ���� common in particle calculations� Grids may be manipulated using
high�level block copy operations� described in Section ���
LPARX is targeted towards applications with irregular� block structures�
To support such structures� it provides a special array�the XArray�for organizing a
dynamic collection of Grids� Each Grid in an XArray is arbitrarily assigned to a single
processor� individual Grids are not subdivided across processors� The XArray can be
viewed as a coarse�grain analogue of a Fortran D array decomposed via mapping
arrays except that XArray elements are themselves arrays �Grids��
The Grids in an XArraymay have di�erent origins� sizes� and index sets� but
all Grids must have the same spatial dimension� To allocate an XArray� the applica�
tion invokes the LPARX operation XAlloc with an array of Regions representing the
structure of the Grids and a corresponding array of processor assignments �i�e� the
�oorplan�� LPARX provides a default assignment of Grids to processors if none is
given� An XArray is intended to implement coarse�grain irregular decompositions�
thus� each processor is typically assigned only a few Grids�
LPARX de�nes a coarse grain looping construct�forall�which iterates
concurrently over the Grids of an XArray� The semantics of forall are similar to
HPF�s INDEPENDENT forall ����� each loop iteration is executed as if an atomic
operation� In writing a forall loop� the programmer is unaware of the assignment
�Compare this to C� which requires the programmer to keep track of bounds for dynamicallyallocated array storage�
�
processor 1 processor 3 processor 4processor 2
XArrays
Figure ��� Two examples of an XArray of Grids data structure� The recursivebisection decomposition on the far left is usually employed to balance non�uniformworkloads in particle calculations� The structure in the middle is typical of a singlelevel mesh re�nement in structured adaptive mesh methods� On the right� we showone possible mapping of XArray elements to processors� Note that the XArray is acontainer for the Grids and its elements are Grids� not pointers�
of Grids to processors�each XArray element is treated as if it were assigned to its
own processor�and the LPARX run time system correctly manages the parallelism�
LPARX also de�nes a for loop� a sequential version of the forall�
The XArray of Grids structure provides a common framework for imple�
menting various block�irregular decompositions of data� This framework is used by
standard load balancing routines such as recursive bisection ��� �see Chapter � and
also by application�speci�c routines� such as the grid generator for an adaptive mesh
calculation ���� �see Chapter ��� Figure �� shows decompositions arising in two
di�erent applications� In each case� the data has been divided into Grids� each repre�
senting a di�erent portion of the computational domain� which have been assigned to
an XArray� The following section provides more detail about how XArrays are used
to organize a parallel computation�
����� Coarse�Grain Data Parallel Computation
Recall that an LPARX application consists of three components� partition�
ing routines� LPARX code� and serial numerical kernels� Here we show how these
pieces work together in an application� LPARX provides the programmer with a
simple model of coarse�grain parallel computation�
�� Decompose the computational structure into an array of Regions�
� Specify an assignment of each Region in ��� to a processor�
�� Create an XArray of Grids representing the data decomposition �oorplan gen�
erated by steps ��� and ���
�� Satisfy data dependencies between Grids in the XArray using LPARX�s com�
munication facilities �described in the following sections��
� Perform calculations on the Grids in the XArray in parallel using the coarse�
grain forall loop�
The decomposition in ��� may be managed explicitly by the application�
such as in generating re�nement regions� or by load balancing utilities that implement
partitioning strategies� The LPARX implementation provides a standard library of
partitioners that implement recursive coordinate bisection ��� and uniform block
partitioning�
The assignment of Regions to processors in �� provides applications the
�exibility to delegate work to processors� In general� this information will be returned
by the routine which renders the partitions� This step may be omitted� in which case
LPARX generates a default assignment�
In step ���� the application invokes an LPARX operation called XAlloc
which� using the partitioning and the processor assignment information� instantiates
an XArray of Grids implementing the data decomposition� LPARX creates Grids
based on the supplied Region information and assigns them to the appropriate pro�
cessors�
After the decomposition and allocation of data� applications typically al�
ternate between steps ��� and ��� In ���� data dependencies between the Grids in
the XArray are satis�ed using LPARX�s region calculus and block copy operations�
described in the following sections� Finally� the application computes in parallel on
the Grids in the XArray using a forall loop ��� For each Grid� a numerical routine
�
is called to perform the computation� the computation executes on a single logical
processing node which may actually consist of many physical processors� The execu�
tion of forall assumes the Grids are decoupled� they are processed independently
and asynchronously�
����� The Region Calculus
LPARX de�nes a region calculus which enables the programmer to manip�
ulate index sets �Regions� in high�level geometric terms� In this section� we describe
the most important region calculus operations�� shift� intersection� and grow�
Given a Region R and a Point P� shift�R�P� denotes the Region R trans�
lated by a distance P� as shown in Figure ��a� The intersection of two Regions is
simply the set of points which the two have in common� The dark shaded area in Fig�
ure ��b represents the intersection of Regions R and S� written as R � S� Regions are
closed under intersection�the intersection of two Regions is always another Region�
If two Regions do not overlap� the resulting intersection is said to be empty�
Grow surrounds a Region with a boundary layer of a speci�ed width� It
takes two arguments�a Region and a Point�and returns a new Region which has
been extended �or shrunk for negative widths� in each coordinate direction by the
speci�ed amount� The second argument to grow may also be an integer� in which
case each dimension is grown by the same amount� Figure ��c shows the Region
resulting from grow�R����
����� Data Motion
LPARX coordinates data motion between Grids using two types of block
copy operations� copy�on�intersect and general block copy� Copy�on�intersect copies
data from a Grid into the corresponding elements of another where their Regions
overlap in the underlying integer coordinate system� Of course� all Grids and Regions
�A complete list of all LPARX operations can be found in Table ����
�
. ..
...
21
7
4 5 17
R Shift(R, [7,-1])
. ..
...
21
7
4 5 17
R
S
R * S
�a� �b�
. ..
...
21
7
4 5 17
R Grow(R,1)
. ..
...
21
7
4 5 17
R Grow(R,1)
S
�c� �d�
Figure ��� Four examples of LPARX�s region calculus� Although shown in d forsimplicity� these operations generalize readily to higher dimensions� �a� Shift takesa Region and a Point and returns a Region translated by the speci�ed distance��b� Intersection returns the set of points shared by two Regions� �c� Grow adds aboundary layer to a Region� �d� Data dependencies for a ghost cell region can becalculated easily using the grow and intersection operations� In this example� thedarkest Region is grow�R��� � S�
in the same copy statement must have the same spatial dimension� For Grids G and
H� the statement�
copy into G from H
copies data from H into G where the Regions of G and H intersect� Another form�
copy into G from H on R
where R is a Region� limits the copy to the index space in which all three Regions
intersect� General block copy is similar to copy�on�intersect except that it allows a
shift between the source and destination Regions� The statement�
copy into G on R from H on S
copies data from Region S of Grid H into Region R of Grid G�
The default behavior of both data motion operations is to simply copy data
from the source into the destination� LPARX also provides a reduction form�
copy into �� � � from �� � � using combine
where combine is a speci�ed commutative associative reduction function� Instead
of copying the data� LPARX applies combine elementwise to combine corresponding
source and destination data values� For example�
copy into G from H using sum
adds corresponding elements of Grid H to Grid G� portions of G that do not intersect
with H remain unchanged� Section ��� illustrates how this reduction variant is used
to sum force information in a particle application�
We now show how these simple but powerful operations are used to calculate
data dependencies� One common communication operation in scienti�c codes is the
transmission of data to �ll ghost cells� boundary elements added to each processor�s
local data partition �see Figure �b�� The region calculus represents the processor�s
local partition as a Region� We grow the Region to de�ne ghost cells and then use
intersection to calculate the Region of data required from another processor� Finally�
a copy updates the ghost region�s data values �see Figure ��d�� Recall that copy�on�
intersect copies values that lie in the intersection of the ghost region and interacting
blocks� The calculation of data dependencies involves no explicit computations in�
volving subscripts� as copy�on�intersect manages all bookkeeping details� The region
calculus is independent of the Grid dimension� and the same operations work for any
problem dimension� All interprocessor communication is managed by the run�time
system and is completely hidden from the user�
����� LPARX Implementation
In this section� we brie�y describe the LPARX implementation� further de�
tails are provided in Chapter ��
LPARX has been implemented as a C�� run�time library consisting of ap�
proximately �fteen thousand lines of code �excluding the application�speci�c API
libraries described in Chapters � and �� The implementation does not require a spe�
cial compiler other than a standard C�� compiler� and LPARX code may be freely
mixed with calls to other C��� C� or Fortran routines�
LPARX de�nes C�� classes for Point� Region� Grid� and XArray� Grid
elements may be standard C�� types �e�g� int or double�� structures� or other C��
classes� All classes are strongly typed by the number of spatial dimensions� for exam�
ple� Region represents a d Region� Region� a �d Region� and so on� Dimension�
independent code is written using an X in the place of the spatial dimension in the
class name �e�g� RegionX� and is translated into dimension�speci�c code �e�g� RegionX
to Region� by a preprocessor at compilation time�
In our examples� we will employ LPARX pseudocode instead of actual C��
code for clarity and to separate the semantics of LPARX operations from their current
implementation as a C�� class library� Of course� other implementations of LPARX
are possible� With the exception of minor syntactic di�erences� the LPARX code and
the actual C�� code are nearly identical�
�
Data Type Description
Point n�tuple representing a point in integer spaceused to de�ne and manipulate Regions
Region represents a subset of array index spaceused to describe irregular data decompositionsmanipulated with shift� intersect� and grow
Grid a dynamic array de�ned over a Region
Grid elements may be any user�de�ned typeGrids communicate via geometric block copies
XArray an array of Grids distributed over processorscommon framework for irregular block decompositionscoarse�grain execution using the forall loop
Table ��� A brief summary of the four LPARX data types de�ned in Section ���
���� Summary
LPARX provides run�time mechanisms for user�de�ned irregular block de�
compositions� The structure of a data decomposition�the �oorplan describing how
data is to be decomposed across processors�is a �rst�class� language�level object
which may be manipulated by the application� In contrast� many data parallel lan�
guages such as High Performance Fortran ���� specify data decompositions at compile�
time using compiler directives which force the user to choose from a limited set of
built�in� regular decompositions�
LPARX de�nes four new data types �see Table ���� Point� Region� Grid�
and XArray� Its forall loop implements a coarse�grain model of data parallelism in
which an operation is applied in parallel to all Grids within an XArray� Communica�
tion among Grids is expressed using the region calculus abstractions and high�level
block copy operations� LPARX operations are summarized in Table ��
�
Operation Description
R� � R� � R� Region R� is the intersection of Regions R� and R�
P � lwb�R� Point P is the lower bound of Region R
P � upb�R� Point P is the upper bound of Region R
R� � shift�R��P� Region R� is the Region R� shifted by Point P
R� � grow�R��P�Region R� is the Region R� extended by an amount P �aPoint� in each coordinate dimension
R� � grow�R��i�Region R� is the Region R� extended by integer amounti in all coordinate dimensions
R � region�G� Region R is the Region associated with Grid G
G�i��i��� � ��in�Array indexing for an n�dimensional Grid G� returns aGrid element
X�i��i��� � ��in�Array indexing for an n�dimensional XArray X � returns aGrid
XAlloc�X� n� R� M�
Allocate the data for XArray X with n Grids using the�oorplan speci�ed by the Array of Regions R and theArray of integers M � if M is omitted� LPARX providesa default mapping
copy into G� from G�Copy data from GridG� into GridG� where their Regionsintersect �copy on intersection�
copy into G� from G�
on R
Copy data from GridG� into GridG� where their Regionsintersect with Region R �copy on intersection�
copy into G� on R�
from G� on R�
Copy data from Region R� of Grid G� into Region R� ofGrid G� �general block copy�
copy into �� � �� from
�� � �� using f
Reduction form of copy in which the commutative asso�ciate function f is applied elementwise to combine corre�sponding source and destination elements
forall i��i��� � ��in in X
� � �
end forall
A coarse�grain data parallel loop that iterates concur�rently over the Grids in the n�dimensional XArray X
for i��i��� � ��in in X
� � �
end for
A sequential loop that iterates over the Grids in the n�dimensional XArray X
Uniform�R� P�
RCB�W� P�
External uniform and recursive bisection �� partitioningroutines provided by a standard LPARX library� both re�turn an Array of Regions describing the computationalspace �indicated by Region R or workload estimate W �decomposed into P Regions representing approximatelyequal amounts of computational work
Table �� This table summarizes the operations de�ned by LPARX�
�
��� LPARX Programming Examples
You know my methods� apply them�
� Sherlock Holmes� �The Hounds of Baskerville�
In this section� we illustrate how to use the LPARX mechanisms to paral�
lelize a simple application� Jacobi relaxation on a rectangular domain� Although this
particular computation is neither irregular nor dynamic� it is easy to explain� and the
techniques described here generalize immediately to the irregular� dynamic applica�
tions of Chapters � and � Sections ���� through ���� describe the parallelization
of the Jacobi code� Section ��� shows how the techniques used to parallelize the
Jacobi application also apply to dynamic� irregular computations�
����� Jacobi Relaxation
Consider the Laplace equation in two dimensions subject to Dirichlet bound�
ary conditions�
�u � � in � u � f on �
where f and u are real�valued functions of two variables� the domain � R� is a
rectangle� and � is the boundary of � We discretize the computation using the
method of �nite di�erences� solving a set of discrete equations de�ned on a regularly
spaced d mesh of size �M ! � � �N ! ��
The interior of the mesh is de�ned as�
Region Interior � ���M���N
The square bracket notation indicates that we will number the interior points of the
mesh from � to M in the x�coordinate and from � to N in the y�coordinate� The
interior region will be extended with a boundary region �using grow� to contain the
Dirichlet boundary conditions for the problem� as shown in Figure �a�
To parallelize Jacobi relaxation� we decompose the computational domain
into subdomains and assign each subdomain to a processor� A standard blockwise de�
composition for � processors is shown in Figure �b� Each subdomain is augmented
��
N
1
1
Boundary CellsInterior
M
Ghost Cells Subdomain
�a� �b�
Figure �� �a� A �nite di�erence mesh de�ned over the d Region ���M�����N��
with interior ���M���N� �b� A blockwise decomposition of the computational spaceinto � subblocks� The lightly shaded area shows the ghost region for a typicalpartition�
with a ghost cell region that locally caches either interior data from adjoining sub�
domains or Dirichlet boundary conditions �for those subdomains on the boundary��
We refresh these ghost cells before computing on each subdomain� Each processor
then updates the solution for the subdomains it owns� this computation proceeds in
parallel and each processor performs its calculations independently of the others�
����� Decomposing the Problem Domain
Recall that LPARX does not prede�ne speci�c data partitioning strategies�
rather� data partitioning is under the control of the application� One possible parti�
tioning for this problem is a uniform BLOCK decomposition such as that provided by
HPF� By convention� LPARX expects the partitioner to return an Array of Regions
that describes the uniform partitioning�
Array of Region Partition � Uniform�Interior� P�
��
Here� our Array represents the standard array type provided by most programming
languages� The partitioner Uniform takes two arguments� the Region to be par�
titioned and the desired number of subdomains� P� which is usually the number of
processors� Recall that LPARX de�nes some common partitioning utilities� such as
Uniform� in a standard library� but the programmer is free to write others�
After determining the partitioning of space� we extend each subdomain with
ghost cells� The exact thickness of the ghost cell region depends on the particulars of
the numerical method� In our case� we will assume a �nite di�erence scheme requiring
a ghost cell region one cell thick� We apply grow to augment each subdomain of
Partition with a ghost region��
Array of Region Ghosts � grow�Partition� P� ��
The computational domain is now logically divided into an Array of over�
lapping Regions called Ghosts� We next allocate an XArray of Grids of Double to
implement this data decomposition� This occurs in two steps� First� we declare the
XArray of Grids structure�
XArray of Grid of Double U
and next we instantiate the storage using LPARX�s XAlloc operation�
call XAlloc�U� P� Ghosts�
XAlloc takes three arguments� the XArray to be allocated� the number of elements
to allocate� and an Array of Regions� one Region for each element in the XArray�
We may optionally supply a processor assignment for each XArray component� if no
such processor assignment is speci�ed� as in the code above� then LPARX chooses a
default mapping� To override the default� we provide an Array of integer processor
identi�ers� one for each XArray element� as an optional fourth argument to XAlloc�
call XAlloc�U� P� Ghosts� Mapping�
�Grow is overloaded in the obvious way to handle arrays of Regions�
�
�� The main routine of the Jacobi relaxation programfunction main
�� Initialize M� N� and number of processors P �not shown�
�� Partition the computational domainRegion Interior � ���M���N�Array of Region Partition � Uniform�Interior� P�Array of Region Ghosts � grow�Partition� P� ��
�� Allocate and initialize data �initialization not shown�XArray of Grid of Double Ucall XAlloc�U� P� Ghosts�
�� Iterate until the solution converges �error check not shown�while �the solution has not converged� do
call relax�U�end while
end function
Figure � � LPARX code which partitions the computational space� allocates theXArray of Grids structure� and calls the Jacobi relaxation routine �described in Sec�tion ������
Such a mapping may be used to better balance workloads or to optimize interprocessor
communication tra�c for a particular hardware interconnect topology�
After instantiating the XArray� we are ready to compute� The main com�
putational loop iterates until the solution meets some �unspeci�ed� error criterion�
while �the solution has not converged� do
call relax�U�
end while
where relax� described in Section ����� performs the computation� The main routine
of the Jacobi code is summarized in Figure � �
Note that we may change the data partitioning scheme at any time without
a�ecting the correctness of the code� For example� the �box�like� partitioning may
��
�� Execute one iteration of Jacobi relaxationfunction relax�XArray of Grid of Double U�
�� Refresh the ghost cell region with the newest valuescall FillPatch�U��� Compute in parallel over all of the subblocksforall i in U
�� Call a numerical kernel to do the computationcall smooth�U�i��
end forallend function
Figure ��� The Jacobi relaxation routine invokes FillPatch to fetch data valuesfrom adjacent processors and then calls smooth� a computational kernel�
be replaced with a strip decomposition or even a recursive bisection decomposition
simply by calling a di�erent partitioner� No other changes would need to be made�
Furthermore� the computational domain need not be restricted to a rectangle� it may
be an �L��shaped region or� in general� any irregular collection of blocks�
����� Parallel Computation
Function relax performs the major tasks in solving Laplace�s equation� it
invokes subroutine FillPatch �described in the following section� to refresh the ghost
cell regions and then calls the computational kernel smooth �not shown�� The code
for relax is shown in Figure ��� The forall loop computes in parallel over all of
the Grids of U� Smooth is called to perform the computation for each U�i�� the ith
Grid of XArray U�
Typically� computational kernels are written in a language such as Fortran
which might not understand the concept of an LPARX �Grid�� LPARX provides
a simple interface for calling C��� C� or Fortran which enables the programmer to
extract Grid data in a form understandable by the numerical routine� These three
languages require only a pointer to the Grid data and the extents of the associated
��
�� Communicate boundary data between neighboring partitionsfunction FillPatch�XArray of Grid of Double U�
�� Loop over all pairs of grids in Uforall i in U
�� Mask o� the ghost cells �copy interior values only��� Function region�� extracts the region from its argumentRegion Inside � grow�region�U�i��� "��for j in U
�� Copy data from intersecting regionscopy into U�j� from U�i� on Inside
end forend forall
end function
Ghost Cells
Partition
U(j)
Figure ��� FillPatch updates ghost cell regions of Grid U�j� with overlappingnon�ghost cell data from adjacent Grids U�i��
Region� However� interoperability for languages such as High Performance Fortran
is still an open research question and is addressed in Section ���
����� Communicating Boundary Values
The �nal piece of Jacobi code is FillPatch� shown in Figure ��� This
routine updates the ghost cell regions of each subgrid with data from the interior �non�
ghost cell� sections of adjacent subgrids� For every pair of Grids U�i� and U�j�� it
copies into the ghost cells of U�j� the overlapping non�ghost cell data from U�i�� The
outer loop is a parallel forall� processors calculate data dependencies only for those
�
Grids they own� Aggregate data motion between Grids is handled through LPARX�s
copy�on�intersect primitive� FillPatch employs grow with a negative width to peel
away the ghost cell region to obtain the Inside of Grid U�i��
For applications in which the structure of theN subgrids is simple and static�
as in our example� this O�N�� algorithm is naive because it looks at all possible inter�
sections� However� the communication structure for dynamic irregular computations
is neither static nor regular and thus cannot be easily predicted� In localized com�
putations such as Jacobi� many of these O�N�� intersections will be empty� in such
cases� the LPARX run�time system does not communicate data� We will discuss how
LPARX eliminates such unnecessary communication in Chapter ��
FillPatch works for any problem dimension� In fact� none of our code
examples in this section have assumed a particular spatial dimension� Moreover�
we can replace Double with any valid data type to handle di�erent types of Grids�
such as Grids of particles employed by our particle API �see Section ���� Finally�
FillPatch does not assume a simple uniform partitioning� this same code will work
for any style of data partitioning� regular or irregular�
����� Dynamic and Irregular Computations
We have used LPARX to develop a straightforward parallel implementation
of Jacobi relaxation� a simple application requiring only a uniform static data decom�
position� In this section� we show how the LPARX parallelization mechanisms can be
used to address dynamic� irregular computations such as structured adaptive mesh
methods �����
As described in Chapter �� structured adaptive mesh methods represent the
solution to partial di�erential equations using a hierarchy of irregular but locally
structured meshes� Our adaptive mesh API implementation represents each level of
this adaptive mesh hierarchy as an XArray of Grid� Unlike the Jacobi example� each
mesh level typically consists of an irregular collection of blocks� Instead of the uni�
form block partitioner� the application calls error estimation and regridding routines
�
which perform data decomposition at run�time� FillPatchworks without change �see
Section ������ because LPARX�s structural abstractions apply equally well to both
uniform decompositions and irregular block structures� Of course� the adaptive appli�
cation adds other routines to manage the transfer of numerical information between
levels of the hierarchy �e�g� interpolation operators�� The key observation� however�
is that the LPARX abstractions used in the Jacobi code generalize immediately to
dynamic� irregular computations�
��� Related Work
Yes� I get by with a little help from my friends�
� Lennon and McCartney� �With a Little Help from my Friends�
In this section� we compare the LPARX approach with other related work�
We divide our survey into three areas� structural abstraction �Section ������ parallel
languages �Section ����� and run�time support libraries �Section ������
����� Structural Abstraction
LPARX�s Region abstraction and its region calculus are based in part on
the domain abstractions explored in the scienti�c programming language FIDIL ����
FIDIL�s domain calculus provides operations such as union and intersection over arbi�
trary index sets� however� FIDIL is intended for vector supercomputers and therefore
does not address data distribution� LPARX borrows a subset of FIDIL�s calculus
operations to provide the structural abstractions for data decomposition and inter�
processor communication on multiprocessors�
Whereas FIDIL supports the notion of arbitrary non�rectangular index sets�
LPARX restricts index sets to rectangles� A prototype of LPARX� called LPAR
��� ��� supported FIDIL�style regions� However� we found that such generality�
and the associated complexity and run�time performance penalty�was unnecessary
��
for the class of irregular block�structured applications targeted by LPARX�We believe
that such abstractions� if needed� should be included as a separate type�
FIDIL�s irregular array structure� called a Map� is used to represent both
meshes and arrays of meshes� We found that Map�s overloaded functionality com�
plicated the programmer�s model� Therefore� LPARX implements the Map using an
XArray and a Grid� and it distinguishes between concurrent computation �over the
Grids in an XArray� and sequential computation �over the elements of a Grid��
Crutch�eld et al� independently developed similar Region abstractions based
upon FIDIL for vector architectures ��� Based on this framework� they have devel�
oped domain�speci�c libraries for adaptive mesh re�nement applications in gas dy�
namics ����� Their adaptive mesh re�nement libraries have been parallelized using
our software infrastructure ����� �see Section �������
The array sublanguage ZPL ����� ��� employs a form of region abstraction�
ZPL does not explicitly manage data distribution� which it assumes is handled by an�
other language� It uses its region constructs to simplify array indexing and as iteration
masks� in contrast� LPARX employs Regions to specify run�time data decompositions
and express communication dependencies� ZPL regions are not �rst�class� assignable
objects as in LPARX�
Building on the LPARX region calculus and structural abstraction� Fink
and Baden � �� have developed a run�time data distribution library that provides an
HPF�like mapping strategy with �rst�class� dynamic distribution objects supporting
both regular and irregular block decompositions� In their system� all decisions about
data decomposition and mapping are made at run�time� providing support for dis�
tributions that are unknown at compile�time or which may change during execution�
Currently� compiled languages such as HPF support neither general block�irregular
decompositions nor run�time data distribution�
The Structural Abstraction �SA� parallel programming model ���� extends
the LPARX abstractions with a new data type �an unstructured Region� to address
other classes of irregular scienti�c applications� such as unstructured �nite element
��
problems and irregularly coupled regular meshes ���� The goal of SA is to unify
several previous domain�speci�c systems� including LPARX� multiblock PARTI ����
and CHAOS � ���
����� Parallel Languages
The parallel programming literature describes numerous languages� each of
which provides facilities specialized for its own intended class of applications� In the
following survey� we evaluate various parallel languages on their ability to solve the
dynamic� block�irregular problems targeted by LPARX�
Data Parallel Fortran Languages
High Performance Fortran �HPF� ���� is a data parallel Fortran which com�
bines the array operations of Fortran ��� a parallel forall loop� and data decompo�
sition directives based on the research languages Fortran D ���� � � and Fortran ��D
���� �� � ����� It is quickly becoming accepted by a number of manufacturers as a
standard parallel programming language for scienti�c computing� HPF has been tar�
geted towards regular� static applications such as dense linear algebra but provides
little support for irregular� dynamic computations ���� ���� HPF represents data de�
compositions using an abstract index space called a template� Arrays are mapped to
templates and then templates are decomposed across processors�
One limitation of templates� as well as all other HPF data decomposition
entities� is that they are are not �rst�class� language�level objects� Rather� they ex�
ist only as compile�time directives� which are essentially comments to the compiler�
Thus� the application has limited control over run�time data distribution� Although
HPF supports dynamically allocatable arrays and pointer arrays� their utility is lim�
ited at present because the application has little run�time control over how array
data is distributed� the processor distribution must be known at compile�time� This
has motivated the High Performance Fortran Forum to consider new mechanisms for
��
dynamic applications�� HPF de�nes a redistribute directive that allows the appli�
cation to change array decompositions� but data redistribution is local to a program
unit and cannot be passed back to the calling routine� Furthermore� arrays may
be decomposed using only a limited set of regular� prede�ned distribution methods
�e�g� a uniform block decomposition�� HPF does not yet support user�de�ned irregular
decompositions�
Fortran D ���� � � relies on the same data distribution model as HPF and
therefore su�ers the same limitations for dynamic problems� One distinguishing fea�
ture of Fortran D is its support for pointwise mapping arrays in which individual
array elements may be mapped to arbitrary processors ������ An application could
in theory construct block�irregular decompositions by mapping each array element in
a block to the same processor� However� such an element�by�element decomposition
cannot exploit the inherent block structure of the application and must instead main�
tain mapping information for each array element at substantial cost both in memory
and communication overheads� Pointwise mappings are therefore inappropriate for
block�irregular applications�
To avoid the limitations of the HPF decomposition model� Vienna For�
tran ���� de�nes more general dynamic data distribution directives� However� Vienna
Fortran restricts the types of irregular data decompositions available to the applica�
tion� It supports pointwise mappings as in Fortran D �with the same limitations for
block�structured methods� and also tensor products of �d block�irregular decomposi�
tions� These mechanisms alone cannot describe the irregular blocking structures that
arise in adaptive mesh re�nement and recursive coordinate bisection �����
Data Parallel C�� Languages
The pC�� ���� ��� ���� programming language is a data parallel extension of
C��� It implements a �concurrent aggregate� ���� model in which a parallel operation
is applied simultaneously to all elements of a data aggregate called a �collection��
�Information about the second High Performance Fortran standardization eort can be found atWorld Wide Web address ftp���hpsl�cs�umd�edu�pub�hpf bench�index�html�
��
Each element of a collection may be a complicated C�� object� This form of coarse�
grain data parallelism is similar to LPARX�s forall loop acting on Grid elements of
an XArray collection�
pC�� aligns and distributes collections across processors using the same
model as HPF� however� pC�� de�nes �rst�class Processor� Distribution� and
Align objects� Because decompositions may be easily modi�ed at run�time� pC��
allows more �exibility than HPF for dynamic applications� Although pC�� does not
currently support irregular decompositions� classes similar to LPARX�s Region� Grid�
and XArray could be written in pC���
The C�� ���� language de�nes a coarse�grain� concurrent aggregate model
of parallelism similar to that of pC��� However� it does not provide explicit data
decomposition mechanisms� the application has no control whatsoever over data dis�
tribution�
Task Parallel Languages
In the previously described data parallel models� programs apply a sequence
of operations to an array or collection of data objects distributed across processors�
The task parallel programming model takes a di�erent approach in which programs
consist of a number of asynchronous� independent� communicating parallel processes�
Task parallel languages such as CC�� ����� CHARM ���� CHARM�� ����� Fortran M
����� and Linda ���� de�ne a set of mechanisms that coordinate process execution
and communication among autonomous tasks� Task parallelism provides no explicit
support for data decomposition�
Task parallelism is ideally suited for computations integrating various het�
erogenous operations� such as a multidisciplinary simulation coordinating various in�
dependent components �� �� However� it is inappropriate for the coarse�grain scienti�c
applications addressed by LPARX� which are more naturally expressed in a coarse�
grain data parallel fashion �see Section �����
��
Split�C
Split�C ��� is a parallel extension to C for distributed memory multiproces�
sors� Split�C gets its name from its split�phase communications model� it allows the
programmer to overlap communication and computation through a two phase data
request� The application initiates a request for data and then computes until the data
arrives� Synchronization primitives ensure that communication has completed�
Split�C supports �ne�grain data accesses to a global address space through
a special type of pointer called a �global pointer�� Dereferencing a global pointer
results in interprocessor communication� By distinguishing global pointers from local
pointers �i�e� pointers within a single address space�� Split�C provides a simple but
realistic cost model for interprocessor communication� Split�C also supports data
layout for regular problems through the �spread array�� which is distributed across
processors as in HPF�s uniform BLOCK and BLOCK CYCLIC decompositions� There is
no support for irregular arrays�
Although Split�C de�nes e�cient communication mechanisms for executing
interprocessor communication� it does not help the programmer in determining the
schedule of communication or in managing the associated data structures� The Split�C
run�time system does not eliminate duplicate data requests to the same data item�
nor does it aggregate messages� two optimizations provided by the CHAOS run�
time system � ��� For example� although the numerical kernel for the Split�C EM�D
application ��� is only about ten lines of code� EM�D would require several hundred
lines of initialization code to calculate data dependencies and manage ghost cells�
Applicative Languages
SISAL � �� and NESL ��� are applicative programming languages which
restrict functions to be free of side e�ects�� a requirement that simpli�es the work
of the compiler and exposes more potential parallelism� SISAL and NESL rely on
�Applicative languages require that the value of any expression depends only on the value of eachconstituent subexpression and not on their order of evaluation� Functions are not allowed to modifyglobal data�
�
sophisticated compiler technology to analyze the program and automatically decom�
pose data and schedule parallel tasks� While the automatic detection of parallelism
is extremely attractive� these languages have not yet demonstrated that the compiler
alone can extract su�cient information from the program to e�ciently distribute data
for dynamic� irregular problems on message passing architectures�
����� Run�Time Support Libraries
The CHAOS �formerly PARTI� � �� and multiblock PARTI ��� libraries pro�
vide run�time support for data parallel compilers such as HPF and Fortran D ������
Both libraries support a �inspector�executor� model for scheduling communication
at run�time� In the inspector phase� the application computes the data motion re�
quired to satisfy data dependencies and saves the resulting communication pattern
in a �schedule�� The executor later uses this schedule to fetch remote data values�
Schedule generation can be thought of as �run�time compilation� to compute data
dependencies that cannot be known at compile�time� CHAOS and multiblock PARTI
optimize schedules to minimize interprocessor communication� standard optimizations
include eliminating duplicate requests to the same remote data item and aggregating
many small messages into a single large message� Often� the cost of creating a com�
munications schedule can be amortized over many uses if data dependencies do not
change�
CHAOS implements pointwise mapping arrays for unstructured calculations
such as sweeps over �nite elementmeshes and sparse matrix computations ���� It has
also been used to parallelize portions of the CHARMM molecular dynamics application
��� ���� The Fortran D run�time system employs CHAOS to support Fortran D�s
mapping arrays ������ Recently� CHAOS has been extended to support unstructured
applications consisting of complicated C�� objects ���� However� recall that such
unstructured representations are inappropriate for the irregular but structured appli�
cations targeted by LPARX�
The multiblock PARTI library is targeted towards block�structured applica�
��
tions� It supports the uniform BLOCK� BLOCK CYCLIC� and CYCLIC array decomposi�
tions of HPF and has been used in the run�time system for the Fortran ��D compiler
���� �� � ����� Multiblock PARTI de�nes canned routines that �ll ghost cells and copy
regular sections between arrays� Although it has been employed in the parallelization
of computations with a small number of large� static blocks �e�g� irregularly coupled
regular meshes ����� multiblock PARTI has not been applied to problems with a large
number of smaller� dynamic blocks� such as the structured adaptive mesh problems
targeted by LPARX�
Quinlan has developed a parallel C�� array class library called P�� ���� that
supports �ne�grain data parallel operations on arrays distributed across collections of
processors� P�� automatically manages data decomposition� interprocessor communi�
cation� and synchronization� In contrast to the �ne�grain parallelism of P��� LPARX
employs coarse�grain parallelism� which is a better match to current coarse�grain
message passing architectures because it allows more asynchrony between proces�
sors� Indeed� to improve the e�ciency of the �ne�grain model� Parsons and Quin�
lan ����� are developing run�time methods for automatically extracting coarse�grain
tasks from P���
The POOMA� �Parallel Object Oriented Methods and Applications� project
at Los Alamos National Laboratories is developing a parallel run�time system for
scienti�c simulations� When completed� it will support arrays �as in P���� matrices�
particle methods� and unstructured meshes� POOMA employs a layering strategy
�similar in philosophy to our own� in which libraries at higher levels in the abstraction
hierarchy provide more application�speci�c tools than lower layers�
PETSc �Portable Extensible Tools for Scienti�c Computing� ���� is a large
toolkit of mathematical software for both serial and parallel scienti�c computation� It
targets more �traditional� algorithms in mathematics for sparse and dense matrices�
including Krylov iterative methods� linear and nonlinear systems solvers� and some
�sequential� partial di�erential equation solvers for �nite element and �nite di�erence
�Papers on the POOMA project have not yet been published information is available from theirWorld Wide Web address http���www�acl�lanl�gov�PoomaFramework�
��
schemes ����� The toolkit employs a data�structure�neutral implementation which
permits users of the numerical routines to use their own� application�speci�c data
structures� Although PETSc does not currently provide the irregular array structures
needed by structured multilevel adaptive mesh applications �see Chapter ��� it may
be possible to extend PETSc by integrating LPARX�s XArray and Grid abstractions
into the PETSc data�structure�neutral framework�
Multipol ���� is a run�time library of distributed data structures designed to
simplify the implementation of irregular problems on distributed memory architec�
tures� It supports a number of non�array data structures such as graphs� unstructured
grids� hash tables� sets� trees� and queues� However� Multipol does not currently sup�
port the type of irregular arrays employed by LPARX applications�
��� Analysis and Discussion
Every great scientic truth goes through three stages� First� people sayit con�icts with the Bible� Next� they say it had been discovered before�Lastly� they say they always believed it�
� Louis Agassiz
LPARX is a portable programming model and run�time system which sup�
ports coarse�grain data parallelism e�ciently over a wide range of MIMD parallel
platforms� LPARX�s abstractions enable the programmer to reason about an algo�
rithm at a high level� and we have used it as a foundation for building APIs for struc�
tured adaptive meshmethods and particle methods� Its structural abstraction enables
the application to manipulate data decompositions as �rst�class objects� LPARX is
intended for applications with changing non�uniform workloads� such as particle cal�
culations� and for computations with dynamic� block�irregular data structures� such
as structured adaptive mesh methods� Its philosophy is that data partitioning for
such irregular algorithms is heavily problem�dependent and therefore must be under
the control of the application�
LPARX provides four new data types� an integer vector called a Point� an
�
index set�valued object called a Region� a dynamic array called a Grid� and an array
of distributed Grids called an XArray� E�cient high�level copies between Grids hide
interprocessor communication and low�level bookkeeping details� LPARX supports a
coarse�grain data parallel model of execution over XArrays via the forall loop�
In the following sections� we analyze in more detail the contributions and
limitations of the LPARX approach�
����� Structural Abstraction
Perhaps the greatest contribution of LPARX is the concept of �structural
abstraction�� that is� the ability to represent and manipulate decompositions as �rst�
class� language�level objects separately from the data� Instead of supporting only a
limited set of prede�ned data distribution strategies� LPARX provides the application
a framework for creating its own problem�speci�c decompositions� To our knowledge�
LPARX is the �rst and only parallel system that e�ciently supports arbitrary dy�
namic� user�de�ned block�irregular data decompositions�
LPARX�s region calculus operations express data dependencies in geometric
terms independently of the spatial dimension and data decomposition� Dimension
independence means that the same data decomposition and communications code
can be used for d and �d versions of an application� The programmer can develop a
simpler d version of the problem on a workstation� and� when con�dent that the code
has been debugged� apply the computational resources of a parallel machine to the full
�d application� Indeed� we adopted this approach in the design and implementation
of the structured adaptive mesh API library and application described in Chapter ��
While we have emphasized the utility of structural abstraction for multipro�
cessors� these ideas also apply to single processor systems� Many scienti�c applica�
tions� such as structured adaptive meshmethods� exhibit irregular data structures and
irregular communication patterns independent of the parallelism� The region calculus
provides a powerful methodology for describing and managing such irregularity�
�
����� Limitations of the Abstractions
The design of any software system involves a trade�o� between generality
and speci�c mechanisms supporting a particular problem class� We believe that
LPARX strikes a good balance� Its parallelization mechanisms are su�ciently gen�
eral to support both particle methods and structured adaptive mesh methods� Of
course� there are many applications that LPARX does not address� LPARX does not
support the unstructured methods targeted by CHAOS � �� nor does it provide the
dynamic irregular data types of Multipol ����� It cannot handle the task parallelism
of CC�� ���� or Fortran M ����� LPARX applies only to problems with irregular�
block�structured data exhibiting coarse�grain data parallelism� Recent work with the
Structural Abstraction �SA� model ���� extends the LPARX ideas to address other
classes of irregular scienti�c applications �e�g� unstructured methods��
Another limitation of LPARX is that its representation of a data decomposi�
tion may not necessarily match the programmer�s view of the data� LPARX�s XArray
and Grid abstractions are intended to support dynamic� block�irregular computa�
tions� and this representation may not be the most natural one for some applications�
For example� particle applications require a non�uniform decomposition of space to
balance workloads� Programmers do not care that the non�uniform decomposition is
represented using an XArray of Grids� they are only interested in accessing particle
information from a particular region of space� The solution in this case is to use
LPARX�s parallelization facilities as a basis for application�speci�c APIs that hide
such details �see Chapter ��
One �nal limitation of LPARX is that each Grid may be assigned to only
one logical processor�� Currently� this restriction has little practical impact� as most
numerical kernels are written in sequential Fortran ��� However� with the upcoming
availability of parallel HPF numerical routines� it will become increasingly important
that LPARX support hierarchical parallelism in which LPARX manages communi�
cation and data decomposition for irregular collections of arrays that are� in turn�
�Recall that a single logical processor may actually consist of many physical processors�
��
split across processor subsets ���� with di�erent numbers of logical processors� For�
tunately� LPARX�s current restriction is easy to lift� we simply allow each Grid to
be assigned to a subset of processors� However� calling HPF from LPARX introduces
some interesting language interoperability issues �see Section ����
����� Shared Memory
At �rst glance� shared memory multiprocessors with coherent caches might
appear easier to use than message passing multiprocessors because the programmer
could allow the hardware caching mechanisms to manage data locality� Unfortunately�
this is not always the case � � ���� Experiments with the Wisconsin Wind Tunnel
shared memory simulator ���� indicate that the explicit management of data local�
ity can dramatically improve performance for dynamic scienti�c applications � ��
Instead of relying on the hardware cache coherence mechanism alone� irregular calcu�
lations employ specialized communication scheduling techniques ����� similar to those
pioneered in CHAOS � ��� Thus� e�cient shared memory implementations require the
same memory management techniques as e�cient message passing implementations�
and it is not su�cient to rely on automatic hardware caching mechanisms� LPARX
provides the application with the explicit� high�level mechanisms needed to e�ciently
manage data locality within the memory hierarchy�
����� Coarse�Grain Data Parallelism
Recall from Section � that LPARX separates the expression of
parallelism�data decomposition� communication� and parallel execution�from nu�
merical computation� To LPARX� numerical work is performed by sequential process�
ing nodes� This execution model matches the systems architecture of most message
passing multiprocessors� which typically consist of powerful compute nodes connected
via a communications network� LPARX manages the parallelism across nodes and
the interprocessor communication between nodes� and the numerical routines handle
��
computation on a single node�
Note that a processing �node� may actually consist of multiple physical
processors� providing a simple form of hierarchical parallelism� For example� some of
the nodes on newer Intel Paragons actually contain two compute processors �ignoring
the third processor normally reserved for communication�� Programmers can annotate
Fortran numerical routines running on the Paragon to take advantage of this second
processor�
There are two main advantages to separating the management of parallelism
from numerical computation� ��� performance� and �� software re�use� LPARX�s
model enables programmers to tune numerical kernels without concern for the parallel
structure of the application� Code may be optimized to take advantage of specialized
node characteristics� such as multiple processors �as on the Intel Paragon�� cache
sizes� or vector units �as on the Cray C����� E�cient parallel programs start with
the e�cient use of node processors�
Numerical routines may be written in any language� enabling LPARX to
leverage mature sequential compiler technology� Existing optimized kernels may be
used� often without change� in parallel applications� Programmers may use the lan�
guage which is most appropriate� For example� Fortran� in spite of its limitations�
provides a natural and simple syntax for array�based computation�
The primary disadvantage of external numerical routines is language inter�
operability� which we discuss next�
����� Language Interoperability
Numerical routines in LPARX applications are typically written in Fortran�
which does not understand the concept of an LPARX �Grid�� Language interoper�
ability addresses the question of how to interface between two di�erent languages�
Recall that interoperability is not di�cult for sequential languages such as Fortran�
Calling Fortran requires only a pointer to the Grid data �passed to Fortran as an
array� and the dimensions of the associated Region� By default� LPARX adopts For�
��
tran�s column�major array ordering convention� Language interoperability for High
Performance Fortran� however� is substantially more involved ������
Although HPF de�nes an interface to subroutines written in other languages
through the notion of �extrinsic procedures�� it does not address how other languages
may call routines written in HPF� One di�cult problem is how to communicate
the representation of distributed data between LPARX and HPF� HPF arrays are
considerably more complicated than Fortran or C arrays� Fortran arrays of a given
type are completely described by three items� ��� the starting location of the array
in memory� �� the bounds of the array� and ��� the ordering of the array elements
�i�e� column�major�� In comparison� elements of an HPF array may be distributed
across processors� aligned to other arrays� and ordered in various ways �e�g� BLOCK or
CYCLIC�� To call HPF� LPARX must�
� understand how HPF represents decomposed arrays�
� allocate HPF arrays of a particular decomposition and alignment� and
� pass array structure information into HPF numerical routines�
Unfortunately� the High Performance Fortran speci�cation ���� does not de�ne a
standard interface for external languages� instead� it allows manufacturers to develop
their own external array representations� The Parallel Compiler Runtime Consortium
�PCRC� is developing standard language interoperability mechanisms between run�
time libraries� task parallel languages� and data parallel compilers ������ however�
interfacing to HPF in a portable manner is still an open research question�
����� Communication Model
When designing LPARX� we determined that the basic communication
mechanismwould be a block copy between two individual Grids� The disadvantage of
the current mechanism is that it tells little about the global communication structure
among all interacting Grids� limiting opportunities for communication optimizations�
�
One possible solution ���� employs the communication schedule building techniques
of Saltz � �� in which communication is split into two phases� an inspection phase and
an execution phase� In the inspection phase� processors build a schedule describing
the communication pattern� In our case� the schedule would be built using operations
similar to LPARX�s block copies between Grids� Communication only occurs when
this schedule is later �executed�� Typically� the application saves schedules for later
re�use�
The advantage of this approach is that before communication begins� all
processors have prior knowledge of the global communications pattern� Thus� they can
perform optimizations to minimize communication overheads� such as pre�allocating
message bu�ers and message aggregation� The lack of global knowledge in the LPARX
implementation results in communication overheads �see Section ���� which could be
reduced using schedules �����
Schedules introduce a variety of interesting implementation issues� How do
we keep track of the vast number of schedules in complicated dynamic applications#
The structured adaptive mesh calculation described in Chapter � would require per�
haps forty di�erent active communication schedules� Such bookkeeping facilities are
not provided by CHAOS � �� and multiblock PARTI ���� which assume that schedules
are managed either by the compiler or the user� Since communication dependen�
cies change� schedules will need to change� How do we know when to re�calculate a
schedule# These are open questions for future research�
���� Future Work
LPARX currently addresses only those applications with irregular but struc�
tured data decompositions� The Structural Abstraction �SA� model ���� extends the
LPARX ideas to other classes of irregular scienti�c applications� SA has not yet been
implemented� however� and its implementation will require the uni�cation of three
di�erent run�time support libraries� LPARX� CHAOS � ��� and multiblock PARTI ����
The acceptance of High Performance Fortran by the scienti�c computing
�
community introduces a number of interesting research issues� How will LPARX and
other languages and run�time systems interface to HPF# The HPF standardization
committee has not yet de�ned the portable language interoperability mechanisms
required so that other languages may call external routines written in HPF� The Par�
allel Compiler Runtime Consortium has begun standardization e�orts� but their work
is far from �nished� Furthermore� LPARX does not yet support multiple processor
owners per Grid� limiting its ability to exploit processor subsets �����
Finally� more research remains on how to integrate communication schedules
into the LPARXmodel� LPARXwill require new forms of run�time support to manage
the numerous changing communication schedules employed by dynamic� irregular
applications�
Chapter �
Implementation Methodology
I really hate this damned machine I wish that they would sell it�It never does quite what I wantBut only what I tell it�
� Dennie L� Van Tassel� �The Compleat Computer�
��� Introduction
In Chapter � we introduced the LPARX parallel programming model� In
this chapter� we describe the implementation methodology and the set of program�
ming abstractions used in the development of the LPARX run�time system�
As illustrated in Figure ���� the LPARX implementation is based upon three
di�erent software layers� At the very bottom of the software infrastructure is a basic
portable message passing system called MP��� Built on top of the message pass�
ing layer� Asynchronous Message Streams �AMS� provides high�level abstractions
for asynchronous interprocessor communication that hide low�level details such as
message bu�er management� The Distributed Parallel Objects �DPO� layer extends
AMS�s communication mechanisms with distributed object naming and object�to�
object communication facilities� We have implemented MP��� AMS� and DPO as
C�� class libraries�
�
Distributed Parallel Objects
Asynchronous Message Streams
LPARX
Implementation Abstractions
Message Passing Layer
LDA MDSPH3DAMG
Adaptive Mesh API Particle API
Figure ���� The LPARX run�time system is based upon the following three levelsof the software infrastructure� a message passing library called MP��� AsynchronousMessage Streams �AMS�� and Distributed Parallel Objects �DPO��
While we will emphasize the use of the DPO and AMS mechanisms in the
design of the LPARX run�time system� we note that these facilities may be useful in
other application domains� Many scienti�c methods� such as tree�based algorithms
in N �body simulations ��� ��� ����� rely on elaborate� dynamic data structures and
exhibit unpredictable� unstructured communication patterns� The implementation of
such numerical methods would be greatly simpli�ed using the run�time support of
DPO and AMS�
This chapter is organized as follows� We begin with the motivation behind
the DPO and AMS abstractions and discuss related work� In Section ��� we describe
the MP��� AMS� and DPO mechanisms� Implementation details and performance
overheads are presented in Section ���� Finally� we conclude in Section ��� with an
analysis and discussion of this work�
�
Barrier
Time
3
12
4
5
1
2
3
4
5
Figure ��� LPARX programs are modeled as a number of objects �Grids� with asyn�chronous and unpredictable communication patterns� All processors must wait at aglobal synchronization barrier until all interprocessor communication has terminated�
����� Motivation
For the purposes of the LPARX implementation and run�time system� we
characterize LPARX programs as follows�
� An LPARX program consists of a relatively small number �e�g� tens to hundreds�
of large� complicated objects �Grids�� each of which is owned by a particular
processor�
� Communication between these objects is asynchronous and unpredictable� thatis� Grids do not know when or even if communication will occur� Communica�
tion between Grids is speci�ed via the LPARX copy operations�
� Communication phases containing LPARX copies are terminated by a global
barrier synchronization that ensures that all interprocessor communication has
�nished�
This execution model is illustrated in Figure ��� The leftmost �gure shows �ve
objects with unpredictable and asynchronous communication patterns� Communi�
cation times between objects is presented in the time�line view on the right� Upon
reaching the global synchronization barrier� processors wait until all interprocessor
communication has terminated�
For most of the remainder of this chapter� we will assume that this asyn�
chronous model accurately re�ects the run�time characteristics of real LPARX ap�
plications� However� in Section ������ we will see that the assumption about asyn�
chronous and unpredictable interprocessor communication is unnecessarily general�
In fact� we can predict communication patterns between Grids using schedules �����
and we can exploit this knowledge to reduce run�time overheads�
Developing asynchronous code directly on top of a message passing library
can be tedious and error�prone� as the programmer is responsible for a number of
low�level activities�
� The message passing model forces the programmer to explicitlymanage messagebu�ers� The programmer must ensure that bu�ers are su�ciently large to hold
all message information� Bu�ers must be packed and unpacked� often with data
of various types� sizes� and alignments� Such message bu�er management can
be particularly challenging for complicated objects� since storage requirements
for message bu�er space may not be known in advance�
� Every message send must be matched by a corresponding message receive onthe appropriate processor� While not generally di�cult in applications where
communication patterns are known� asynchronous applications do not usually
know when �or even if� messages are expected to arrive�
� To implement global synchronization barriers� the application must detect whenall interprocessor communication has terminated�
To alleviate the burden of implementing LPARX� we introduce two levels of inter�
mediate abstractions between LPARX and the message passing library� Distributed
Parallel Objects �DPO� and Asynchronous Message Streams �AMS��
����� Related Work
The abstractions provided by the Asynchronous Message Stream and Dis�
tributed Parallel Objects libraries build on a number of ideas originally developed by
the concurrent object oriented programming community� The notion of communicat�
ing objects� or �actors�� was �rst described by Hewitt ��� and then further developed
by Agha ��� Actors are concurrent objects which communicate with each other via
messages� Actors execute in response to messages� and each actor object may contain
several concurrently executing tasks� Actor�based languages include ABCL �����
Cantor ����� and Charm�� �����
Implementations of actor languages require complicated compilation strate�
gies and sophisticated run�time support libraries� such as the Concert system ���� for
�ne�grain object management� Because of this complexity� we have not based the
LPARX run�time system on an existing concurrent object�based language� Instead�
we borrowed features that were speci�cally needed for the LPARX implementation�
For example� DPO�s distributed object naming mechanisms and its notion of primary
and secondary objects �described in Section ����� are based in part on the distributed
facilities described by Deshpande et al� � ��� The AMS abstractions combine ideas
from Active Messages ������ asynchronous remote procedure calls ��� ����� and the
C�� I�O stream model ��� � �����
Another related paradigm developed by the distributed systems community
is virtual shared memory� which provides the illusion of a single� shared� coherent ad�
dress space for systems with physically distributed memories� Virtual shared memory
models include page�based ����� and object�based systems �� � � � � ���� Page�based
virtual shared memory enforces consistency at the level of the memory page� typically
one to four thousand bytes� Because such a coarse page granularity results in poor per�
formance due to false sharing� object�based systems provide consistency at the level of
a single user�de�ned object� These systems are inappropriate for our implementation
for two reasons� First� they require complicated and expensive operating system and
compiler support that is currently unavailable on production multiprocessor architec�
�
tures� Second� the virtual shared memory paradigm implements a read�modify�write
model �similar to cache lines and virtual memory pages� that results in unnecessary
interprocessor communication� To modify an object� a processor must �rst read the
entire object into local memory� modify it� and then write it back� In contrast� DPO
and AMS communicate only the message data necessary to modify an object�
The designers of the pC�� run�time support library ���� address some of the
same implementation issues as LPARX� In fact� their model of parallel execution for
the pC�� run�time system is very similar to ours described in Section ������ There
are two important di�erences� ��� the pC�� run�time system is supported by a com�
piler� and �� pC�� objects may be �ne�grain objects� The design of DPO assumes
that programs contain a relatively small number of large� coarse�grain objects� which
enables DPO to replicate object naming information across processors whereas pC��
must distribute such data� Our implementation eliminates the costly communica�
tion needed to translate object names at the cost of additional� although acceptable�
memory overheads �see Section ������
Active Messages ����� is an asynchronous communication mechanism which�
like asynchronous remote procedure calls ��� ����� sends a message to a speci�ed
function that executes on message arrival� AMS combines this asynchronous message
delivery mechanism with the concept of a message stream ��� � ���� to hide message
bu�er management details� Active Messages is optimized for message sizes of only a
few tens of bytes whereas AMS messages are typically hundreds or thousands of bytes
long� AMS does not assume e�cient �ne�grain message passing facilities� which are
not currently available on most parallel architectures� and requires only basic message
passing support� The implementation of the CHAOS!! system for unstructured
collections of objects ��� employs a similar abstraction of �mobile objects� that
de�ne packing and unpacking operations similar to that of AMS�
�
Layer Facilities
DPO name translation for distributed objectscommunication between objectscontrol over object ownership
AMS message stream abstraction hides bu ering detailscommunication between handlers on processorsglobal synchronization barriers
MP�� point�to�point message sends and receivescollective communication �broadcasts and reductions�
Table ���� A brief summary of the facilities provided by the three LPARX implemen�tation layers� DPO� AMS� and MP���
��� Implementation Abstractions
Nothing you can�t spell will ever work�
� Will Rogers
The implementation of the LPARX run�time system was in�uenced by a
number of design considerations� First� the implementation must be portable across
a wide range of MIMD parallel platforms and yet provide good performance� It
should not rely on architecture�speci�c facilities� such as �ne�grain message passing
�e�g� Active Messages ������� Second� it may not assume compiler support other than
that provided by a standard compiler such as C��� All decisions about data distri�
bution� communication� and synchronization are to be made at run�time� Finally�
the LPARX implementation should provide communication facilities for complicated
data structures �e�g� those with pointers�� which may require special treatment when
communicated across address spaces�
The LPARX implementation infrastructure consists of three layered soft�
ware libraries� Distributed Parallel Objects �DPO�� Asynchronous Message Streams
�AMS�� and a portable message passing library called MP��� Table ��� summaries
the operations provided by each layer� The MP�� portable message passing layer�
described in Section ����� implements very basic interprocessor communication facil�
�
ities� Building on the point�to�point messages of MP��� AMS �Section ���� de�nes
a �message stream� abstraction that hides details of packing and unpacking data
and sending and receiving messages� DPO �Section ����� de�nes mechanisms for
managing objects distributed across processor memories� We conclude this section
with an example of how these three layers interact with LPARX to implement the
interprocessor communication necessary for LPARX Grid copies�
����� Message Passing Layer
At the very bottom of the LPARX software hierarchy is MP��� an
architecture�independent message passing layer similar in spirit to MPI� ����� MP��
provides facilities for asynchronous and synchronous point�to�point message passing�
barrier synchronization� broadcasts� and global reductions�
To port our software infrastructure �approximately thirty thousand lines of
C�� and Fortran code� to a new multiprocessor� we need only port the MP�� library�
no other code changes� Porting MP�� to a new parallel machine typically takes only
a few hours and a few hundred lines of code� for example� the port to the IBM SP
required only about �� lines of new code� To port MP��� the programmer must
translate the generic MP�� message passing calls into architecture�speci�c calls� For
instance� MP�� message send routine mpSend is implemented using csend on the Intel
Paragon and mpc bsend on the IBM SP� Our software is currently running on the
Cray C��� �single processor�� IBM SP� Intel Paragon� single processor workstations�
and networks of workstations under PVM ����� In the past� MP�� has also sup�
ported the Intel iPSC�� �� Kendall Square Research KSR��� nCUBE nCUBE�� and
Thinking Machines CM�� all of which are now obsolete�
On a single processor workstation� MP�� can execute in a �simulated paral�
lel machine� mode in which multiple UNIX processes emulate the processing nodes
of a parallel machine� This environment is well suited for code development and
�MPI is a relatively new portable message passing standard� When e�cient implementations ofMPI are readily available on parallel architectures� the MP�� layer will be replaced with MPI�
�
debugging� as the workstation programming environment is more mature than that
provided on most parallel architectures� In practice� very little LPARX application
development and debugging is performed on parallel architectures� For the past two
years at the University of California at San Diego� MP�� has been used on worksta�
tions to teach message passing parallel programming�
����� Asynchronous Message Streams
The Asynchronous Message Stream �AMS� communication paradigm builds
on ideas from asynchronous remote procedure calls ��� ����� Active Messages ������
and the C�� I�O stream library ��� � ����� AMS requires only basic message passing
support such as that provided by MPI ����� Its message stream abstraction frees the
programmer from many low�level message passing details�
Both AMS and the Active Message ����� model provide mechanisms for
sending a message to a handler which then consumes the message� An important
di�erence is that AMS in intended for coarse�grain communication� Although Active
Messages provides some facilities for sending long messages� it emphasizes �ne�grain
message passing� AMS hides all message bu�er management via its message stream
abstraction� Active Messages does not�
Message Streams
Communication between processors uses AMS�s �message stream� abstrac�
tion� based on the C�� I�O stream model� A message stream contains two endpoints�
a sending end and a receiving end� Data is written into the communication stream
at the sending end and read out from the receiving end� AMS message streams
are intended to be a short�term message connections between processors� They hide
all details of message bu�er management from the programmer� AMS automati�
cally packetizes the message stream and coordinates interprocessor communication
through the message passing layer� Because the application is shielded from the in�
ternal representation of data in the message stream� AMS could transparently encode
�
data ��� � ��� for transmission among heterogenous processors which use di�erent
data representations� our current AMS implementation does not provide this service
because of the high cost of changing data formats�
Each user�de�ned object to be communicated between processors must de�
�ne pack and unpack functions which copy object data into and out of the message
stream� These pack and unpack functions are simple to write and resemble standard
C�� I�O statements �see the example later in this section�� AMS de�nes pack and
unpack functions for C�� built�in types such as integer and double�
Asynchronous Communication
AMS supports two forms of interprocessor communication� sends and for�
warding sends� In an AMS send� the processor initiating the send speci�es a destina�
tion processor and a user�de�ned function handler on that processor� The handler is
simply a function which is to be called when the AMS message arrives� The originat�
ing processor opens an AMS message stream connection to the handler on the remote
processor� As shown in Figure ���a� the handler is awoken by AMS� consumes data
from the message stream� takes some appropriate action de�ned by the handler� and
then exits� The handler may perform computation on the incoming data stream or
incorporate the message data into local data structures but may not return data�
AMS�s forwarding send allows handlers to return data� In the forwarding
send� the originating processor provides two additional arguments� the processor and
user�de�ned handler which are to receive the reply from the �rst handler� The handler
processing the data request is oblivious as to where the results are being forwarded�
it is only aware that it is writing data into an outgoing message stream� All message
stream connections between processors are managed by AMS� Figure ���b shows pro�
cessor P sending a request to a handler on Q� the result of the computation is returned
to a handler on P� In the general case� the reply may be directed towards any proces�
sor� Note that P does not block while waiting for the reply from Q� instead� it overlaps
the communication with Q with other computation or perhaps other communication�
compute
compute
Processor P
process
inhandler
message
compute
compute
Processor Q
AMS send
processreply
compute
compute
Processor P
processrequest
compute
compute
Processor Q
AMS request
AMS reply
�a� �b�
Figure ���� �a� Communication between processors P and Q via the AsynchronousMessage Stream layer� Processor P sends a message to Q� A user�de�ned handlerfunction awakes on Q� consumes the incoming message stream� performs some com�putation� and exits� �b� Processor P makes a data request from processor Q� whichprocesses the request and returns data to P� Note that processor P continues to com�pute while Q services the request�
Global Synchronization
Detecting the end of a communication phase in an application with asyn�
chronous and unpredictable interprocessor communication patterns is a di�cult task�
Because communication patterns are not known� processors cannot predict when or
even if messages are expected to arrive� Furthermore� most message passing imple�
mentations cannot guarantee that messages sent from the same processor arrive in
the same order as they were sent�
AMS implements a very simple synchronization protocol� processors can
pass the synchronization barrier only after their communication with all other pro�
cessors has ended� In all interprocessor communication� AMS routines that open
the message streams record the number of messages destined for every other pro�
cessor� Upon reaching the global synchronization point� processors perform a global
�
addition to obtain the number of messages that each processor is to have received�
Each processor then waits until it has received the proper number of messages� This
protocol guarantees that processors will pass the synchronization point only after all
communication has terminated�
Note that it is not su�cient to execute a simple global barrier because
messages from the same processor may not arrive in the same order as they were
sent� Thus� it would be possible for an asynchronous message sent before the barrier
on one processor to arrive after the barrier on another processor if the asynchronous
message and the barrier message arrived out of order� The cost of this synchronization
protocol is discussed in Section ������
An AMS Example
Figure ��� illustrates sample AMS C�� code based on a geographically struc�
tured genetics algorithm application developed using LPARX ����� Recall from Chap�
ter that LPARX Grids may contain elements of any user�de�ned type� In this par�
ticular application� Grid elements are of type GA Individual� For many user�de�ned
structures� such as those containing pointers or other user�de�ned types� the LPARX
run�time system does not know how to pack and unpack data for transmission across
memory spaces� instead� the application must supply these routines� Fortunately�
they are easy to write and very short �e�g� ten to twenty lines of code��
AMS overloads the standard C�� I�O operators �� and �� to write into
and read out of message streams� such an approach mirrors standard C�� input
and output� The de�nitions of the message stream operators �� and �� for class
GA Individual are shown in Figure ���� SendPacket in �� represents the outgoing
message stream and RecvPacket in �� the incoming stream� In this genetic algorithms
code� all GA Individual objects contain �genotype� information but only some con�
tain �phenotype� data� depending on the value of �ag has phenotype� The message
stream functions determine at run�time which data to send and receive based on this
�ag� We do not show the operator de�nitions for class GA Point�
�
Define the complicated C class �GA Point defined elsewhere�
class GA Individual fdouble eval�
GA Point genotype�
int has phenotype�
GA Point phenotype�
public�
public member functions
g�
C code to write data into the outgoing message stream
SendPacket operator �� �SendPacket outgoing�
const GA Individual GA�
foutgoing �� GA�eval �� GA�genotype �� GA�has phenotype�
if �GA�has phenotype�
outgoing �� GA�phenotype�
return�outgoing��
g
C code to read data from the incoming message stream
RecvPacket operator �� �RecvPacket incoming�
GA Individual GA�
fincoming �� GA�eval �� GA�genotype �� GA�has phenotype�
if �GA�has phenotype�
incoming �� GA�phenotype�
return�incoming��
g
Figure ���� This C�� code illustrates AMS�s message stream abstractions for a genet�ics algorithm application ����� Information for class GA Individual is written into amessage stream using �� and read out with ��� This approach is similar to standardC�� input and output� Similar code written without the bene�ts of the AMS messagestream abstraction would be considerably more complicated�
Mechanism Description
open s to �h� p� open a message stream s to handler h on processor p
open s to �h�� p�� andforward to �h�� p��
open a message stream s to handler h� on processor p� andthen forward the resulting communication to handler h� onprocessor p�
close s close the message stream s
s �� o write data for object o into the message stream s
s �� o read data for object o from message stream s
s �� PackArray�o�n� write data for n objects of array o into message stream s
s �� UnPackArray�o�n� read data for n objects of array o from message stream s
barrierforce all processors to wait at a global synchronization barrieruntil all communication has terminated
Table ��� This table summarizes the asynchronous communication facilities providedby the AMS layer�
This simple example illustrates a number of AMS�s features� First� AMS
hides all low�level message bu�ering details� The application may determine at
run�time what class information is to be transmitted through the message stream�
Furthermore� applications are free to mix objects of various types in the same mes�
sage stream� Finally� the message stream operators hide class information in a hi�
erarchical manner �e�g� the de�nitions of �� and �� for GA Point are hidden from
GA Individual�� Similar code written without the bene�ts of the AMS message
stream abstractions would be considerably more complicated� the code would be re�
sponsible for managing bu�er pointers� moving data into and out of a message bu�er�
and checking for bu�er under�ow and over�ow� AMS hides these details�
Summary
Table �� summarizes the asynchronous communication facilities provided
by the Asynchronous Message Stream layer�
����� Distributed Parallel Objects
The Asynchronous Message Stream abstractions hide many low�level mes�
sage passing details� however� one detail that AMS shares with the message passing
model is that messages are directed towards speci�c processors� The LPARX execu�
tion model views the application as a collection of communicating objects �Grids��
thus� mechanisms that communicate directly between objects� rather than between
processors� would be more appropriate� Such facilities are provided by the Distributed
Parallel Objects �DPO� layer� which extends AMS�s mechanisms with distributed ob�
ject naming and object�to�object communication facilities�
Distributed Objects and Name Resolution
The Distributed Parallel Objects layer manipulates physically distributed
C�� objects in a shared name space� Programs consist of a relatively small number
of large� coarse�grain objects and execute in SPMD �Single Program� Multiple Data�
fashion� Objects communicate through asynchronous messages� Each object is as�
signed to a particular processor� and ownership does not change�� Each processor
has a copy of every object� although the processor owning an object has a di�erent
version of the object than all other processors� The owner has a primary copy whereas
all other processors have a secondary� or ghost� copy �see Figure ���� The primary
copy of an object contains all pertinent data� Although secondary objects may ex�
plicitly cache data locally� they are intended to act as �handles� through which the
program accesses the primary version of the object� DPO�s model of primary and sec�
ondary objects is based upon the distributed communication and object management
mechanisms described by Deshpande et al� � ���
Object identi�ers are used to name DPO objects lying on di�erent proces�
sors� DPO assigns each primary object a globally unique identi�er upon creation
and enters the object�s memory location� its identi�er� and its owner into a registry�
�In fact� our current DPO implementation does provide facilities for object migration� but wehave not found these mechanisms to be useful�
�
3
12
4
53
12
4
5
Processor 0 Processor 1
Figure ��� These �gures show the �ve objects of Figure �� distributed across twoprocessors� Processor � owns objects � and and Processor � owns �� �� and �Primary objects are indicated by solid circles and secondary copies by dashed circles�Each processor has a copy of every object� although the type of copy �primary orsecondary� di�ers�
Secondary objects receive the same identi�er as their associated primary object� An
object may determine whether it is primary or secondary by simply comparing its
processor identi�er against the object�s assigned owner in the registry� Registry in�
formation is replicated across processors to avoid the interprocessor communication
otherwise required by distributed name translation�
Note that the DPO model is ultimately unscalable because secondary object
information and registry data are replicated on all processors� However� these memory
overheads are not a concern for our targeted class of applications� DPO objects are
intended to be large �e�g� an entire Grid�� for which the overhead of data replication
is small in comparison to the quantity of data stored in the primary object �see
Section ������ Furthermore� current trends in multiprocessor design favor parallel
computers with a small number of very powerful processors� not thousands of small
processors� Scienti�c applications typically use only a few tens of processors at any
one time� Thus� our design decision to replicate storage is appropriate for today�s
parallel machines�
�
Show the definition of an XArray�ofGrid�ofDouble
class XArray�ofGrid�ofDouble fint n�
Grid�ofDouble ��grids�
public�
public member functions
Grid�ofDouble operator�� �const int i� f return���grids�i���� gg�
Define function XAlloc for a �d XArray of �d Grid of double
void XAlloc�XArray�ofGrid�ofDouble xarray� const int n�
Region� �regions� int �assignments� fxarray�grids � new Grid�ofDouble ��n��
for �int i � �� i � n� i�
xarray�grids�i� � new Grid�ofDouble�regions�i�� assignments�i���
xarray�n � n�
g
Figure �� � LPARX function XAlloc allocates an XArray of Grids using Region andprocessor assignment information� See Figure ��� for the de�nition of the Grid class�
Each LPARX Grid is a DPO object� LPARX function XAlloc supplies a
Region and a processor assignment when creating a Grid �see Figure �� �� All copies
of the Grid� even the secondary versions� store the Region� but only the processor
that actually owns the Grid allocates the array data �see Figure ����� The distinction
between primary and secondary objects is hidden from the LPARX application� Be�
cause Region information for each Grid is replicated on all processors� the LPARX
run�time system can calculate the data to send to an o��processor Grid� determined
by the intersection of the two Regions� without the need to explicitly fetch the Region
from the other processor� copies with empty intersections require no communication
whatsoever�
�
Define a �d Grid of double
class Grid�ofDouble � private DistributedParallelObject fRegion� myRegion�
double �myData�
public�
public member functions
g�
Define the Grid constructor �called when creating a Grid�
Grid�ofDouble��Grid�ofDouble�const Region� region� const int processor� �
DistributedParallelObject�processor� fmyRegion � region�
if �is a primary copy��� fconst int size � myRegion�number of elements���
myData � new double �size��
g else fmyData � �double �� NULL�
gg
Figure ���� This C�� code illustrates how a Grid object is created� Grid is a subclassof DistributedParallelObject� The Grid is assigned a unique identi�er and regis�tered during the initialization of the base class� All Grids store Region informationbut only the owner �determined by the call to is a primary copy� a member functionof class DistributedParallelObject� actually allocates the data for the array�
Execution Model
DPO provides a very simple execution model that avoids many of the imple�
mentation di�culties and overheads�such as execution dispatch� multiple execution
streams� frame allocation� and scheduling �����found in more complicated systems�
All processors execute the same code in SPMD fashion� and only the owner of an
object �as reported by the DPO registry� executes the computation for that object
�i�e� the owner�computes rule��
For example� recall that LPARX�s forall loop iterates in parallel over the
��
User�s C code actually looks like this �forall is a C macro�
forall�i� xarray�
� � �
end forall
But after C translates the macro� the compiler sees this code
for �int i � �� i � xarray�n� i� fif �xarray�i��owner��� f
� � �
ggSynchronize���
Figure ���� The LPARX forall loop is implemented as a C�� macro� The userwrites the code at the top of the �gure� and the C�� pre�processor generates the codeat the bottom� The call to Synchronize at the end of the parallel loop synchronizesall processors� LPARX also de�nes a special form of the end forall that allows theprogrammer to override the implicit synchronization� there is no need to synchronizeat the end of parallel loops that perform no interprocessor communication�
Grids of an XArray �see Section ���� In the DPO implementation� all processors
loop over all elements of the XArray� but only the processor owning a Grid executes
the computation for that Grid �see Figure ����� Note that because Grids are coarse�
grain objects� the overhead associated with the check for ownership is insigni�cant
when compared to the computation for a Grid� Of course� all details such as checks
for ownership are managed by the forall loop and are completely hidden from the
programmer� To eliminate the need for repeated ownership checks in every forall
loop� we could have the XArray locally cache a list of all Grids owned by its processor�
but we have not done so since the overhead of ownership checks is negligible for our
coarse�grain applications�
��
Communication
Object�to�object communication in DPO uses the same message stream ab�
straction as AMS� However� instead of sending a message to a handler on a speci�ed
processor� the message is sent to a function belonging to a speci�ed object� DPO
supports two forms of object�based communication�
� A send sends a message to a handler belonging to the primary copy of a speci�ed
object� This communication mechanism is an object�based version of the AMS
send shown in Figure ���a�
� A forwarding send sends a request to a handler belonging to the primary copy
of a speci�ed object� The object processes the request and forwards the reply
to another object� This is an object�based version of the AMS forwarding send
of Figure ���b�
Note that DPO manages all details of object name translation and ownership� As
in AMS� all communication is asynchronous� Global synchronization barriers are
provided using AMS�s protocol�
Summary
Table ��� summarizes the object�to�object communication and object man�
agement facilities provided by the Distributed Parallel Objects layer�
����� Communication Example
In this section� we describe how MP��� AMS� DPO� LPARX� and an LPARX
application interact to implement interprocessor communication� Note that the fol�
lowing details �except for AMS message stream packing and unpacking for user�
de�ned types� are hidden from LPARX applications and are completely managed by
the LPARX run�time system� Our example will be the execution of the FillPatch
loop described in Section ���� and reproduced here in Figure ���� Recall that
�
Mechanism Description
open s to o open a message stream s to primary object o
open s to o� and
forward to o�
open a message stream s to primary object o� and forwardthe resulting communication to o� �primary or secondary�
close s close the message stream s
lookup idlook up object identi�er id in the DPO registry and returnthe associated object
register o register object o in the DPO registry and return an identi�er
owner o
is primary o return information about the speci�ed object ois secondary o
s �� o
s �� o same as in AMSbarrier
Table ���� This table summarizes the object management mechanisms de�ned by theDistributed Parallel Objects layer�
�� Communicate boundary data between neighboring partitionsfunction FillPatch�XArray of Grid of Double U�
�� Loop over all pairs of grids in Uforall i in U
�� Mask o� the ghost cells �copy interior values only��� Function region�� extracts the region from its argumentRegion Inside � grow�region�U�i��� "��for j in U
�� Copy data from intersecting regionscopy into U�j� from U�i� on Inside
end forend forall
end function
Figure ���� FillPatch updates ghost cell regions of Grid U�j� with overlappingnon�ghost cell data from adjacent Grids U�i�� This code is reproduced here fromFigure ��� Communication in the current LPARX implementation is asynchronousand processors calculate data dependencies only for those Grids they own�
��
FillPatch updates the ghost cell regions of each subgrid with data from the interior
�non�ghost cell� portions of adjacent subgrids�
In our LPARX implementation� all message communication between pro�
cessors is asynchronous� Processors execute the iterations of the communication loop
independently� calculating data dependencies only for those Grids they own� Com�
munication within the copy routine is asynchronous and non�blocking� and the actual
data copy may not complete until a later time� In fact� data motion is not guaranteed
to terminate until all processors reach a global synchronization point� Thus� there
may be multiple copy operations executing in parallel� overlapping interprocessor
communication� LPARX inserts a global synchronization barrier at the end of every
communication phase �e�g� in the end forall at the end of the forall loop� to en�
sure that all communication has terminated before computation begins� Split�C ���
supports a similar split phase communications paradigm�
We now consider the execution of the LPARX statement�
copy into A from B on Inside
where A and B are Grids and Inside is a Region� Note that we have replaced U�j�
and U�i� from Figure ��� with A and B �respectively� to simplify the notation� Recall
from Section �� that this statement copies data from B into A where the Regions
of A and B intersect with Inside� Because Region information for Grids A and B
is replicated on all processors� LPARX can immediately calculate the intersection
between the Regions of A and B� If this intersection is empty� then no data is to be
moved and the copy is �nished� This optimization is extremely important in spatially
localized computations �such as those addressed by LPARX� since most intersections
are empty� Otherwise� there are four possible cases to consider� as shown in Table ����
depending on whether A and B are primary or secondary objects�
LPARX stores Grid data only on the processor which actually owns the
Grid� If both A and B are primary objects� then all data is available locally and a
memory�to�memory copy su�ces� If A is a primary copy and B is a secondary copy�
then A asks the processor owning B to send it the necessary Grid data �a �get��� If
��
Grid A Grid B Action
primary primary local memory�to�memory copyprimary secondary A requests Grid data from the owner of Bsecondary primary Grid data from B is sent directly to the owner of A
secondary secondaryrequest the processor owning B to send the appropriateGrid data to the processor owning A
Table ���� There are four possible cases to consider in the implementation of theLPARX statement �copy into A from B�� depending on whether Grids A or B areprimary or secondary DPO objects� LPARX stores Grid data only on the processorwhich actually owns the Grid �the primary copy�� Accessing Grid data for secondaryobjects requires interprocessor communication� Routines to check whether objectsare primary or secondary are provided by DPO �see Table �����
the data for B is local but A is stored elsewhere� then Grid data from B is sent directly
to the processor owning A �a �put��� Finally� if neither A nor B are primary objects�
then a request is sent to the processor owning B to send Grid data to the processor
owning A�
Figures ���� and ���� provide time�line views �not drawn to scale� of how
MP��� AMS� DPO� LPARX� and an LPARX application interact to implement the
interprocessor communication associated with an LPARX copy� These examples as�
sume that Grid elements are complicated C�� objects� such as GA Individual in
Figure ���� so that application�level packing and unpacking routines are required�
For Grids whose elements are standard C�� types �e�g� double or integer�� LPARX
manages packing and unpacking automatically without the intervention of the appli�
cation�
Figure ���� illustrates the transmission of Grid information to a Grid lying
on another processor� The application calls the LPARX copy routine� which calculates
what Region of Grid data is necessary to satisfy the copy� DPO translates the
name of the destination Grid� and AMS opens a message stream connection to the
destination� Grid data is copied into the message stream using the application�level
packing routines� If the internal message stream bu�er over�ows� AMS initiates a
�
calculate dependence
translate destination
open stream
copy Grid data into buffer
start send start send
close stream
end copy
call copy resume
Application
AMS
DPO
LPARX
MP++
Figure ����� This �gure provides a time�line view �not to scale� of the transmissionof Grid data to another processor� Arcs show transitions between MP��� AMS� DPO�LPARX� and the LPARX application� The message send in MP�� and the copying ofnew data in the application�level packing routines occurs in parallel� See the text fora detailed explanation�
Application
AMS
DPO
LPARX
message ready extract Grid data from buffer
extract Region
translate name
activate handler
receive message receive message
exit handler
copy finished
resume
MP++
Figure ����� This �gure provides a time�line view �not to scale� of the reception ofGrid data from another processor� Arcs show transitions between MP��� AMS� DPO�LPARX� and the LPARX application� The reception of messages in MP�� and theunpacking of data in the application�level routines occurs in parallel� See the text fora detailed explanation�
�
send to transmit the data bu�er� allocates a new data bu�er� and resumes packing
data� Note that the transmission of the old bu�er and the allocation of the new bu�er
are completely transparent to the application�level packing routines� Interprocessor
communication executes in parallel with the �lling of the new bu�er� After all Grid
data has been transmitted� AMS closes the message stream and LPARX returns from
the copy routine�
Figure ���� shows the other end of the interprocessor communication� the
reception of Grid information from a Grid lying on another processor� Once the �rst
packet of the message stream arrives� AMS activates a message handler provided by
DPO� DPO then translates the name of the destination object�an LPARX Grid�
and calls an LPARX routine to process the incoming data stream� LPARX extracts
Region information and begins to copy data out of the message stream �using the
application�level unpacking routines� and into the Grid� Concurrently� MP�� receives
the next portion of the message stream� When the internal message bu�er under�ows�
the application�level unpacking routines transparently switch to the new message
bu�er� If the needed message bu�er is not yet available� then AMS waits until the
appropriate message arrives� While waiting� AMS will remove incoming message
packets and queue them locally� but it will not activate another handler �to avoid
corrupting global state� until the current one �nishes� After all Grid information has
been extracted� LPARX exits to AMS� which closes the message stream� leaves the
handler� and returns control to the application�
��� Implementation and Performance
None of us really understands what�s going on with all these numbers�
� David Allen Stockman� referring to the ���� federal budget
Distributed Parallel Objects� Asynchronous Message Streams� and MP��
have been implemented as a collection of C�� classes� they require no special compiler
support� MP�� consists of approximately ��� lines of C�� code� AMS ��� lines� and
��
DPO ���� lines� LPARX and its associated libraries add another ��� lines of code�
In the following three sections� we discuss implementation issues and over�
heads� We begin in Section ����� with a comparison of interrupt and polling mech�
anisms for asynchronous communication� In Section ����� we present memory and
communication overheads for AMS and DPO� Finally� we analyze the performance of
LPARX on a simple application �the Jacobi problem of Section ��� and compare its
performance to a message passing implementation�
����� Interrupts versus Polling
The AMS run�time system requires a mechanism to detect when an AMS
message has arrived and invoke the appropriate function handler� There are two
methods typically employed to process such asynchronous events� interrupts and
polling �which we use�� Each method has its advantages and disadvantages�
The primary advantage of interrupt driven message handlers is that they
do not require polling calls in the code� Interrupt driven messages� though� have a
number of drawbacks� First� interrupt mechanisms are not portable and vary signi��
cantly from multiprocessor to multiprocessor� In fact� some message passing libraries
�e�g� MPI ����� do not support interrupt driven messages� Second� interrupt driven
message handlers must be careful when writing to global variables �i�e� those variables
not local to the handler�� Because there is little control over when interrupts occur�
handlers may corrupt global state being modi�ed by another routine� For instance�
suppose an interrupt occurred while the main program was allocating memory� If the
interrupt handler also attempted to allocate memory� the handler could corrupt the
state of the memory allocator� One solution would be to require the programmer to
mask interrupts during all sensitive calculations� but such an approach is tedious and
error�prone�
In the implementation of AMS and DPO� we use polling to process asyn�
chronous events� The AMS run�time system provides a special polling function which
the libraries call to check on pending messages and� if found� invoke the associated
��
Description Overhead
AMS overhead per message � bytesDPO and AMS overhead per message �"� bytesTotal overhead per message ��"��� bytesMemory overhead per DPO object �per processor� � bytesTotal storage per LPARX Grid �per processor� � "� bytes
Table ��� Message length and memory overheads for AMS� DPO� and LPARX� Seethe text for a detailed explanation�
function handlers� The advantage of this method is that handlers are called only at
those times considered safe� The drawback is that the code must periodically check
for pending message events� In practice� however� this requirement is not particularly
bothersome� LPARX applications are oblivious to polling as such calls are hidden
within the implementation of LPARX�s copy functions�
����� DPO and AMS Overheads
Table �� provides various memory and message length overheads for AMS
and DPO� performance overheads will be discussed in the following section� Each
of the LPARX� DPO� and AMS layers adds header information to all interprocessor
communication� AMS requires � bytes of header information for function handlers�
message size� packet sequence numbers� and other miscellaneous data� DPO adds an
additional � to � bytes �depending on the type of message� for object identi�cation�
Finally� LPARX adds between � and � bytes �� ! � d where d � �� � � � � � is the
spatial dimension of the Grid� to specify the Regions used in a copy� Note that such
overheads should not signi�cantly a�ect message passing times� On most parallel ar�
chitectures� the transmission time for short messages is dominated by communication
start�up overheads� for long messages� the additional overhead of a hundred bytes is
insigni�cant�
On every processor� each DPO object requires a memory overhead of �
��
bytes for name translation and object ownership information� To that� LPARX adds
another �" � bytes �� ! �d for a d�dimensional Grid� for Region and other data�
Only the primary version of an object allocates the Grid array� Recall that the
memory overhead of data replication is relatively modest in comparison to the size
of a typical Grid� which may contain several tens of thousands of bytes� An alter�
nate� scalable implementation strategy would involve �ne�grain� distributed transla�
tion schemes such as those implemented by CHAOS � �� and pC�� ���� at the cost
of additional interprocessor communication� While scalable� such implementation
approaches are inappropriate for our coarse�grain applications running on today�s
parallel platforms�
����� Application Performance
To provide an overall estimate of LPARX run�time performance overheads�
we implemented a �d Jacobi iterative solver with a �� point �nite di�erence stencil
in LPARX and also by hand using message passing� The message passing implemen�
tation should provide an approximate lower bound for the �best� possible implemen�
tation� We chose the Jacobi application because it is simple enough to parallelize by
hand� While we would have preferred to compare a �real� LPARX application such
as a structured adaptive mesh application� it would have taken months� if not years�
to parallelize such a code by hand�
Table �� compares the performance of the two codes for a �����������mesh on � Paragon nodes� The hand�coded application made a number of simplifying
assumptions� namely that each processor was assigned only one subgrid and that the
problem was static so that it could precompute communication schedules� Without
these simplifying assumptions� the hand�implementation would have been consider�
ably more di�cult� While such assumptions may apply to this simple example� they
are not valid for the dynamic irregular applications which are the intended target of
LPARX� For example� structured adaptive mesh calculations may assign several sub�
grids to each processor� and particle methods change communication dependencies as
��
By Hand LPARX v�� No Barrier
Total time �ms� ����� � �� ����Computation �ms� ���� ���� ����Communication �ms� ���� �� ���Messages �kilobytes� � � ���� ����Message starts ��� �� ���
Table �� � LPARX overheads for a �d Jacobi ��� point stencil� relaxation calculationon a ����������� mesh on � Paragon nodes� The �By Hand� application was par�allelized using only message passing� The numbers for LPARX v�� re�ect the currentLPARX implementation� and the �No Barrier� numbers estimate the performance ofLPARX without the global barrier synchronization� All numbers measure the wall�clock time for one iteration of the algorithm and were averaged over ��� iterations�Message statistics represent single processor averages for one iteration�
particles move�
Table �� reports �ve performance numbers� total execution time� numerical
computation time� communication time� average number of bytes communicated per
processor� and average number of message sends per processor� All measurements
are reported per iteration and were averaged over ��� iterations� The performance
numbers in the �By Hand� column re�ect the message passing implementation� the
�LPARX v��� column represents the performance of the current LPARX software
release�
The LPARX computation time is identical to that of the message passing
code� LPARX overheads appear only in the communication routines� The LPARX
communication time is �$ slower than the message passing version� This translates
into an overall execution time which is �$ longer than the equivalent message passing
code�
The LPARX code communicated approximately �ve percent more bytes than
the message passing implementation� Part of this overhead is due to the additional
information �described in the previous section� which must be communicated with
each LPARX message� Because LPARX cannot assume that only one subgrid is as�
��
signed to each processor �as was assumed in the hand�coded message passing version��
it must incorporate descriptive information into each message identifying the subgrid
where data is to be stored�
Most of LPARX�s communication overhead can be attributed to the extra
messages sent as part of its synchronization protocol� Recall that at the end of a
communications loop� LPARX detects the termination of communication via a global
barrier synchronization that accounts for the additional message sends� Our synchro�
nization protocol requires log P message starts on P processors� However� it would
be possible to eliminate this costly synchronization through an alternative implemen�
tation strategy �see Section ������ using run�time schedule analysis techniques �����
By eliminating the barrier� we obtain the results in the �No Barrier� column� LPARX
overheads now drop to approximately one percent of the total execution time of the
program�
��� Analysis and Discussion
It is a capital mistake to theorize before one has data�
� Sir Arthur Conan Doyle
To simplify the implementation of the LPARX run�time system� we have
introduced two intermediate software layers between LPARX and our MP�� portable
message passing library� Asynchronous Message Streams �AMS� and Distributed Par�
allel Objects �DPO�� AMS and DPO provide support for SPMD programs consisting
of a small number of large� complicated� coarse�grain objects with asynchronous�
unpredictable communication patterns� AMS de�nes a message stream abstraction
which hides low�level message passing details such as message bu�er management�
Building on the AMS facilities� DPO provides mechanisms for manipulating physi�
cally distributed objects in a shared name space� Our software infrastructure runs on
a variety of parallel architectures and requires only basic message passing support�
The primary run�time overhead associated with our implementation strategy
�
is the global barrier synchronization used to detect the end of interprocessor commu�
nication� We are working to eliminate this overhead through run�time communication
schedule techniques �����
����� Flexibility
One advantage to the modular implementation approach of our software
infrastructure is the �exibility to re�use various software components as needed�
For example� scientists at Lawrence Livermore National Laboratories have used our
DPO� AMS� and MP�� software to parallelize a structured adaptive mesh library for
hyperbolic problems in gas dynamics �� ����� Their library de�nes a set of ab�
stractions similar to LPARX� Their versions of Point� Region� and Grid have been
specialized for their particular class of applications� for example� their Region de�
scribes application�speci�c properties such as whether meshes are cell�centered or
node�centered�
A direct implementation using LPARX would have involved customizing our
Region and Grid classes to conform to their standard� Because of their investment
in tens of thousands of lines of code and their established user base� it would have
been impossible to re�write their library to conform to LPARX� Instead� they used
the DPO� AMS� and MP�� layers� Their Grid became a DPO object� and they
implemented their own version of the XArray class� This approach enabled them to
leverage both our parallelization support and their extensive amount of application�
speci�c code�
����� Portability
In the design of the LPARX implementation� we decided that portability
would be provided through the use of a portable message passing layer� Our entire
software infrastructure� approximately thirty thousand lines of C�� and Fortran code�
can be ported to a new architecture merely by changing a few hundred lines of code
��
in the MP�� library�
The downside to portability via a message passing library is that the message
passing model may not necessarily match or exploit the low�level hardware charac�
teristics of a particular parallel platform� For example� how should LPARX be im�
plemented on a shared memory architecture that supports �ne�grain communication#
One possible solution would be to implement a message passing library on top of the
hardware�s shared memory mechanisms� However� this strategy would probably not
be as e�cient as an implementation which directly takes into account the hardware
support for shared memory�
In the LPARX multitasking port to the Cray C���� our �rst step was to port
a version of the MP�� library that emulated message passing through shared memory
message bu�ers� This implementation introduced unnecessary copying through inter�
mediate bu�ers within the �message passing� layer� We later modi�ed the LPARX
layer software so that Grid�to�Grid copies exploited the Cray�s shared memory archi�
tecture and bypassed the message passing layer� Note that these issues deal only with
the implementation of the LPARX system and not LPARX itself� these architecture�
speci�c implementation details are hidden from the programmer�
Such portability issues apply not only to the design of the LPARX run�time
system but to parallel programs in general� For example� how will MPI ���� message
passing applications run on a shared memory architecture# MPI programs will require
the implementation of a shared memory message passing library similar to what we
implemented for the Cray C���� However� applications using this library will in all
likelihood be less e�cient than programs that have been speci�cally designed to take
advantage of the shared memory support�
What is the best method for implementing LPARX �or for that matter� any
dynamic� irregular scienti�c application� on a distributed shared memory multipro�
cessor# We believe this to be an open research question� Studies with distributed
shared memory computers indicate that dynamic� irregular applications can dramati�
cally improve performance by explicitly managing cache locality � � ����� it is simply
��
not su�cient to rely on hardware caching mechanisms� What is the role of the AMS
layer on such a computer# Certainly� the application does not need to pack and
unpack complicated objects for transmission across address spaces if it allows the
hardware coherence mechanism to cache object data automatically� However� appli�
cations might improve performance by bypassing the hardware mechanisms and using
AMS instead� Such questions require further study�
Part of the di�culty in writing portable parallel programs is that no single�
unifying� realistic model of parallel computation and communication has emerged�
There have been several attempts to de�ne a unifying model� such as PMH � ��
LogP � �� BSP ������ and CTA ����� However� these models tend to focus on per�
formance evaluation� What is needed is a single set of realistic� portable� general
purpose mechanisms for e�cient parallel programming� Without high�level support�
programmers are left to employ di�erent implementations on di�erent architectures�
hampering portability�
The advantage of LPARX is that it provides a very high�level set of portable
tools that isolate the programmer from the architecture and gives an intuitive per�
formance model� Although tools such as MPI are portable� they are� in some sense�
�closer� to the machine� The LPARX run�time system may look very di�erent on
a shared memory machine than on a message passing machine� but such details are
hidden from the LPARX programmer�
����� Implementation Mistakes
Recall that in Section ����� we modeled LPARX programs as a collection of
objects with asynchronous and unpredictable communication patterns� Our imple�
mentation with AMS and DPO is based on this assumption� In retrospect� we believe
that this model of LPARX communication is unnecessarily general�
LPARX applications alternate between communication and computation
phases� Thus� communication is not asynchronous but is� in fact� limited to the
well�de�ned communication phases of the program� Furthermore� we can predict
�
global communication patterns using the inspector�executor paradigm pioneered in
CHAOS � �� and multiblock PARTI ���� In this model of communication� the inspec�
tor phase calculates a schedule of data motion which is then executed in the executor
phase�
The LPARX schedule building loop�the inspector�would employ the re�
gion calculus and copy�on�intersect operations �i�e� structural abstraction� to specify
data dependencies� Because the schedule provides global knowledge of communica�
tion patterns� each processor would know when its communication had completed and
the costly global barrier synchronization would no longer be required� Preliminary
work with integrating schedule building techniques into LPARX ���� indicates that
most of the overheads described in Section ����� can be eliminated�
Chapter �
Adaptive Mesh Applications
I have yet to see any problem� however complicated� which� when youlooked at it in the right way� did not become still more complicated�
� Paul Anderson� �New Scientist�
��� Introduction
In this chapter� we describe the adaptive mesh API �application programmer
interface� component of our software infrastructure �see Figure ����� This API has
been implemented as a library built upon the parallelization and communication
abstractions of LPARX� It provides the scienti�c programmer with specialized� high�
level facilities speci�cally tailored to structured adaptive mesh applications�
We have used our adaptive mesh API to develop a parallel adaptive eigen�
value solver �LDA in Figure ���� and an adaptive multigrid solver �AMG� for the solution
of eigenvalue problems arising in the �rst principles simulation of real materials� Ma�
terials design attempts to understand the chemical properties of materials through
computer simulation� Such applications require adaptive numerical methods to accu�
rately capture the chemical behavior of molecules containing atoms with steep nuclear
potentials �e�g� oxygen or transition metals�� To our knowledge� this is the �rst time
that structured adaptive mesh techniques have been used to solve eigenvalue problems
in materials design�
�
��
LPARX
Implementation Abstractions
Message Passing Layer
LDA MDSPH3DAMG
Adaptive Mesh API Particle API
Figure ���� The adaptive mesh API� built on top of the parallelization and commu�nication mechanisms of LPARX� provides application�speci�c facilities for structuredadaptive mesh methods�
It is an open research question whether the irregularity of an adaptive mesh
calculation can be e�ciently implemented in a data parallel language such as High
Performance Fortran ����� which does not readily support dynamic and irregular array
structures� We will present computational results �Section ����� which show that
the restrictions imposed by a data parallel Fortran implementation may signi�cantly
impact parallel performance�
This chapter is organized as follows� We begin by describing the importance
of adaptive mesh algorithms in the solution of numerical problems and review related
work� Section �� introduces the salient features of the adaptive mesh algorithm to
motivate the software facilities required by the adaptive mesh API� In Section ����
we describe in detail our API and how its facilities are built on top of the LPARX
abstractions� Section ��� describes our materials design eigenvalue solver� provides
details about the numerical methods� and presents computational results for some
simple materials design calculations� Section �� analyzes parallel performance and
library overheads� We conclude in Section �� with an analysis and discussion�
��
����� Motivation
The accurate solution of many problems in science and engineering requires
the resolution of unpredictable� localized physical phenomena� Examples include
shock waves in computational �uid dynamics �� and the near�singular atomic core
potentials in materials design ����� The key feature of these problems is that some
portions of the problem domain�for example� regions containing the shock waves
or the atomic nuclei�require higher resolution� and thus more computational e�ort�
than other areas of the computational space�
Structured adaptive numerical methods dynamically place computational
resources� such as CPU cycles and memory� in interesting portions of the solution
space� thus� they can achieve better accuracy for the same computational resources
as compared to non�adaptive methods� Although structured adaptive mesh methods
incur some overhead costs associated with adaptivity� such as error estimation and
data structure management� these overheads are insigni�cant when compared to the
savings gained through selective re�nement �� �� For example� by exploiting adap�
tivity in a materials design application �see Section ����� we have reduced memory
consumption and computation time by more than two orders of magnitude over an
equivalent uniform mesh method ����� The adaptive code allows us to solve problems
on a high�performance� single processor workstation which would otherwise require
hundreds of giga�ops on a supercomputer with gigabytes of memory�
In general� adaptive methods may be structured or unstructured� depending
on how they represent the numerical solution to the problem� as shown in Figure ���
Unstructured adaptive methods ��� � ���� store the solution using graph or tree
representations ����� ���� these methods are called �unstructured� because connec�
tivity information must be stored for each unknown �node� of the graph� Structured
methods� such as adaptive mesh re�nement ��� and structured multigrid algorithms
���� ����� employ a hierarchy of nested mesh levels in which each level consists of
many simple� rectangular grids� Each rectangular grid in the hierarchy represents a
structured block of many thousands of unknowns� Because of these dissimilar data
��
RefinedArea
Level l+1
Level l
�a� �b�
Figure ��� �a� Unstructured adaptive methods employ a graph�like representationwhich requires connectivity information for each unknown in the graph� �b� Struc�tured adaptive methods use a hierarchy of levels in which each level consists of anumber of rectangular grids� Each rectangular grid may contain many thousands ofunknowns� The dark shaded area represents the portion of level l covered by l! ��
representation strategies� structured adaptive methods require di�erent software sup�
port and implementation approaches than unstructured adaptive methods� Here we
consider only structured adaptive methods�
Structured adaptive mesh methods are di�cult to implement on serial
architectures�not to mention parallel machines�because they rely on dynamic� ir�
regular data structures� Regions of the computational space are dynamically re�ned
in response to run�time estimates of local solution error� resulting in irregular data
dependencies and communication patterns� On parallel platforms� the programmer is
further burdened with the responsibility of managing data distributed across proces�
sor memories and orchestrating interprocessor communication and synchronization�
Such distractions can signi�cantly increase application development time� Because
adaptive applications change in response to the dynamics of the problem� little can
be known about the structure of the computation at compile�time� Thus� decisions
about data decomposition� the assignment of work to processors� and the calculation
of communication patterns must be made at run�time�
We have developed a structured adaptive mesh API that hides these im�
��
plementation details� It presents computational scientists with high�level tools that
allow them to concentrate on the application and the mathematics instead of low�level
concerns of data distribution and interprocessor communication� Such support en�
ables scientists to develop e�cient parallel� portable� high�performance applications
in a fraction of the time that would have been required if the application had been
developed from scratch�
����� Related Work
Adaptive mesh re�nement techniques for multiple spatial dimensions were
�rst developed by Berger and Oliger ��� �� to solve time�dependent hyperbolic
partial di�erential equations� These techniques are based on previous work on locally
nested re�nement structures in one spatial dimension by Bolstad ���� Adaptive mesh
re�nement methods were later used by Berger and Colella to resolve shock waves in
computational �uid dynamics ��� Our work with adaptive mesh methods applies
this same adaptive framework to elliptic partial di�erential equations and adaptive
eigenvalue problems �����
Berger and Saltzman have implemented a parallel d adaptive mesh re�ne�
ment code in Connection Machine Fortran for the CM� �� �� Their data parallel
implementation required that all regions of re�nement be the same size� As a result�
the application over�re�ned some portions of the computational space� using �$
more memory than an equivalent implementation without the uniform size restric�
tion� Our experiments indicate that uniform re�nement regions also result in excessive
overheads in three dimensions �see Section ������ Because of compiler limitations�
their code did not execute e�ciently on the CM��
Quinlan et al� have developed an adaptive mesh library called AMR�� �����
based on the P�� data parallel C�� array class library ����� P�� supports �ne�grain
data parallel operations on arrays distributed across collections of processors� it auto�
matically manages data decomposition� interprocessor communication� and synchro�
nization� In contrast to this �ne�grain array parallelism� we employ a coarse�grain
��
parallelism in which operations are applied in parallel to entire collections of arrays
�see Section ���� �� Fine�grain parallelism is di�cult to implement e�ciently on to�
day�s coarse�grain architectures� indeed� Parsons and Quinlan ����� are developing
techniques to extract coarse�grain parallelism from P�� to improve the e�ciency of
the �ne�grain approach�
An object oriented library for structured adaptive mesh re�nement has been
developed at Lawrence Livermore National Laboratory by Crutch�eld et al� ���
This software is intended to support hyperbolic gas dynamics applications running
on vector supercomputers ����� The basic abstractions employed in Crutch�eld�s work
are very similar to our own� in fact� their adaptive mesh re�nement libraries have been
parallelized using the LPARX software ������
Parashar and Browne are developing a software infrastructure supporting
parallel adaptive mesh re�nement methods for black hole interactions ����� Their
method is based on a clever load balancing and processor mapping strategy that
maps grids to processors through locality�preserving space �lling curves ����� ����
However� their approach imposes two restrictions on the grid hierarchy� ��� all re�
�nement regions must be the same size� and �� re�nement regions must be nested in
a tree structure�� Our performance analysis in Section ���� indicates that uniform
re�nement regions over�re�ne the computational space and are therefore less e�cient
than non�uniform re�nements� Although we require nested grids� our infrastructure
allows grids to have multiple parents �see Section ������ and therefore provides the
application more freedom in constructing re�nement structures� The implications of
the tree nesting restriction have yet to be studied� although we believe that such a
strategy will result in additional costly over�re�nement�
�In general� structured re�nement hierarchies do not form a tree�
�
Composite Grid
Level 0
Level 1
Level 2
Figure ���� The solution to a partial di�erential equation is resolved on a compositegrid which represents a non�uniform discretization of space� In practice� compositegrids are implemented as a hierarchy of grid levels� This �d grid hierarchy consists ofthree grid levels with a mesh re�nement factor of four� Levels � and � each have onlyone grid� but level has two grids� This composite grid hierarchy is modeled afterthe one used in the �d hydrogen molecular ion problem shown in Figure ���a�
��� Structured Adaptive Mesh Algorithms
Oh� laddie� you�ve got a lot to learn if you want people to think of you asa miracle worker�
� Scotty� Star Trek The Next Generation episode �Relics�
This section provides a high�level description of structured adaptive mesh
algorithms� We present the salient features of the method to motivate the abstractions
described in Section ���� Further numerical details can be found in Section ����
Structured adaptive mesh methods ��� solve partial di�erential equations
using a hierarchy of nested� locally structured �nite di�erence grids� The grid hi�
erarchy can be thought of as a single composite grid in which the discretization is
non�uniform �see Figure ����� All grids at the same level of the hierarchy have the
same mesh spacing� but successive levels have �ner spacing than the ones preced�
ing it� providing a more accurate representation of the solution� The di�erence in
mesh spacing between a grid level and the �ner resolution level above it is called the
renement factor� which is typically two or four�
Adaptive methods re�ne this discretization of space to accurately represent
localized physical phenomena �see Figure ����� When creating a new level� the hi�
��
Level 1Level 0 Level 2
Figure ���� Three levels of a structured adaptive mesh hierarchy� The eight darkcircles represent regions of high error� such as atomic nuclei in materials design ap�plications ����� The mesh spacing of each level is half of the previous coarser level�This problem is similar to the problem solved in Section ���
erarchy is re�ned according to an error estimate calculated at run�time� These new
higher�resolution grids� called renement patches� are used only where necessary to
meet accuracy requirements� Adaptive methods use computational resources e��
ciently because they expend extra memory and computation time only in regions of
unacceptable error� In general� the location and size of re�nement patches must be
computed at run�time� as they cannot be predicted a priori �
Structured adaptive mesh algorithms communicate information about the
numerical solution between levels of the hierarchy and also among grids at the same
level of the hierarchy� Around the boundary of each grid patch is a ghost cell region
which locally caches data from adjacent grids or� where no neighboring grids exist�
from the next coarser level of the hierarchy� Without the proper software support�
managing these bookkeeping details can be di�cult because of the irregular and
unpredictable placement of re�nement patches ��� Note that ghost cell regions and
communication are required by the adaptive meshmethod and are not simply artifacts
of the parallel implementation�
Adaptive mesh methods that use structured re�nements possess a number
of advantages over unstructured adaptive methods which represent solutions using
��
a graph representation� �see Figure ���� Structured re�nement patches exploit the
local structure within the solution� if a point is �agged as needing re�nement� then it is
likely that nearby points will also need re�nement� Although the grid hierarchy itself
may be non�uniform� patches themselves are uniform� Location and size information
need only be stored for each patch� which in turn may contain many thousands of
unknowns� Because the structure of a grid patch may be represented using only a few
tens of bytes� structure information for the entire grid hierarchy may be replicated
across processor memories�
In comparison� unstructured representations require connectivity informa�
tion for each unknown in the graph� signi�cantly increasing memory overheads� On
parallel computers� the calculation of data dependencies for unstructured problems
scales as the number of unknowns� whereas algorithms for structured adaptive mesh
methods scale as the number of patches�
Furthermore� numerical kernels for structured adaptive mesh methods are
simpler and more e�cient than those for unstructured methods ������ Solvers for
structured methods may employ compact� high�order �nite di�erence stencils� Index�
ing is fast and e�cient since grid patches are essentially rectangular arrays� Numerical
kernels typically make better use of the cache as array elements are stored contigu�
ously in memory� improving cache locality� rather than scattered across memory�
Although local re�nement saves both computation time and memory� the
savings in memory may play a more important role in many applications� Available
memory places a hard limit on the problem sizes which can be solved in�core� Prob�
lems larger than a �xed size must resort either to paging� which is terribly slow on
most multiprocessors� or to out�of�core algorithms�
Following the work by Berger and Colella ��� our structured adaptive mesh
application consists of three main components� a numerical solver� an error estimator�
and a grid generator� Note that although our algorithm looks somewhat similar to
adaptive mesh re�nement ���� it is not identical� Our intended applications employ
�Of course� unstructured representations may be more appropriate for some problems� such asthose with irregular boundaries�
�
elliptic� not hyperbolic� partial di�erential equations� Ellpitic solvers require di�erent
types of numerical schemes� they use implicit iterative numerical methods ���� as
compared to the explicit time�marching schemes ��� used for hyperbolic problems�
We begin with a single grid level that covers the entire computational domain
and build the adaptive grid hierarchy level�by�level� Assuming that we have already
constructed a hierarchy with levels l � �� � � � � � L� our algorithm is as follows �refer
back to Figure �����
�� Solve the partial di�erential equation using all levels of the grid hierarchy� The
resulting solution is the most accurate representation of the answer thus far�
� Flag points of the grids on level L �and only on level L� where the estimated
error exceeds some speci�ed error threshold�
�� Calculate the location and sizes of re�nement patches which cover the �agged
points on level L� If running on a parallel computer� assign these re�nement
patches to processors�
�� Create level L!� in the hierarchy using re�nement patch information from the
previous step� Interpolate the current solution from level L to L! ��
� Increment L and continue at step ����
These �ve steps are repeated until a user�speci�ed maximum number of re�nement
levels has been reached or until the problem has been solved to su�cient accuracy�
The structure of the grids�the size� location� and number of re�nement regions�is
determined at run�time in response to error estimates derived from the current solu�
tion� The software abstractions needed to implement this algorithm are the subject
of the following section�
�
��� Adaptive Mesh API
Scientists come up with great� wild theories� but then they give them dull�unimaginative names� � � I tell you� there�s a fortune to be made here�
� Calvin� �Calvin and Hobbes�
The structured adaptive mesh algorithm of the previous section is quite dif�
�cult to implement on both sequential and parallel architectures� Re�nement regions
vary in size and location in the computational space� resulting in complicated geome�
tries �see Figure ����� Communication patterns between grid patches and between
grid levels are irregular and change as the hierarchy is modi�ed� On message pass�
ing platforms� the programmer must explicitly manage grid data distributed across
the processor memories and orchestrate interprocessor communication� Even shared
memory multiprocessors require the explicit� low�level management of data locality
and communication for reasonable performance ����� ���� These implementation
di�culties soon become unmanageable and can obscure the mathematics behind the
algorithms�
The goal of our adaptive mesh API is to provide scientists with high�level
support for structured adaptive mesh applications� Scientists using our API facili�
ties can concentrate on their speci�c applications rather than being concerned with
the underlying implementation details� Of course� our software is portable and ef�
�cient� Portability among high�performance computing platforms guarantees that
applications software will run on the most powerful and up�to�date computational
resources available� Such a powerful software infrastructure is essential in developing
sophisticated� reusable code�
����� Software Infrastructure Overview
Figure �� illustrates the organization of our adaptive mesh infrastructure�
The software consists of three primary components� numerical operations� grid man�
agement facilities� and display routines� The numerical routines de�ne the elliptic
partial di�erential equation to be solved� The display library contains some simple
��
Adaptive Mesh Library
Grid Hierarchy Management
Estimate ErrorsBalance Loads Composite Grid Regrid
XArray of GridGridStructure
IrregularGrid
Numerical Routines Display Library
Figure ��� Organization of the structured adaptive mesh API library� The softwareinfrastructure consists of three main components� numerical routines� grid manage�ment facilities� and display functions�
graphing and plotting facilities for visualizing data �for example� see Figure ���a and
Figure ���� The grid hierarchy management routines comprise the most complex
and interesting portion of the adaptive mesh libraries� These facilities manage all
aspects of the grid hierarchy� data structure bookkeeping� error estimation� workload
balancing and processor assignment� and communication� An important observation
is that the grid management facilities are independent of the numerical details of a
particular elliptic partial di�erential equation� the same routines may be used to solve
a number of di�erent numerical problems�
One feature of our adaptive mesh library is that its facilities are independent
of problem dimension� Scientists using the API see the same abstractions and inter�
face whether they are working in two or three spatial dimensions� Numerical details
di�er� but the interfaces for regridding� error estimation� load balancing� and grid
��
Component Code Lines
Data Structures ���Display Routines ����Error Estimation ��Grid Generation ��Numerical Routines ���Statistics Gathering ��Workload Balancing ���
Total �����
Table ���� This table provides a breakdown of the eleven thousand lines of code thatconstitute the structured adaptive mesh API library� implemented as a collection ofC�� classes and Fortran routines�
hierarchy management are identical� Dimension independence provides programmers
the freedom to develop and debug simpler� faster d versions of their applications
on a workstation using simpli�ed d numerical kernels� Then� when con�dent that
the code is working� programmers can insert the appropriate �d numerical routines
and recompile on a supercomputer� In practice� we have found dimension indepen�
dence particularly useful� the adaptive mesh API libraries and the materials design
application presented in Sections ��� and �� were �rst developed on workstations
in d�
We have implemented our structured adaptive mesh API as a collection of
C�� classes and Fortran routines consisting of approximately eleven thousand lines
of code� as shown in Table ���� Our software is built on top of the LPARX abstrac�
tions described in Chapter � LPARX provides run�time parallel support such as
distributed data management� coarse�grain parallel execution� interprocessor commu�
nication� and synchronization� The adaptive mesh libraries add facilities speci�cally
tailored towards adaptive mesh applications�
LPARX�s concept of structural abstraction and its support for �rst�class
data decompositions have been vital to our success� Structural abstraction enables us
to represent and manipulate the structure of data�the ��oorplan� describing where
��
re�nement regions are located in space and how they are mapped to processors�
separately from the data itself� For example� when adding a new level to the adap�
tive mesh hierarchy� we represent re�nement regions at the new level as �rst�class�
language�level objects� The structure of the new re�nement level is determined by
regridding routines� Re�nement patches are then manipulated by load balancing and
processor assignment algorithms� Only then do we actually allocate the data associ�
ated with the re�nement patches� Structural abstraction enables us to represent and
modify dynamic re�nement structures at run�time�
The following six sections describe the grid management routines of Fig�
ure �� in more detail� We begin in Section ���� with a discussion of the data
structures used to represent the structured grid hierarchy� We then describe some
of the algorithms used in error estimation �Section ������� regridding �Section �������
and workload balancing and processor assignment �Section ������ We conclude with
a description of coarse�grain data parallel numerical computation �Section ���� � and
communication �Section �������
����� Data Structures
Recall from Section �� that structured adaptive mesh methods store data
using a composite grid implemented as a hierarchy of levels �see Figure �� �� To
represent this structure� we use three data types� a Grid� an IrregularGrid� and
a CompositeGrid� The Grid represents a single� logically rectangular array at one
level of the composite grid hierarchy� A collection of Grids at one level of the hier�
archy is an IrregularGrid� and a set of IrregularGrids organized into levels is a
CompositeGrid�
LPARX de�nes the basic building blocks�Region� Grid� and XArray�for
IrregularGrid and CompositeGrid� The IrregularGrid is similar to LPARX�s
XArray but is specialized for adaptive mesh methods� for example� it encapsulates
information about mesh spacing that would be inappropriate for an XArray� Each
object provides facilities appropriate for its role in the adaptive mesh hierarchy� as
���
Composite Grid
Level 0
Level 1
Level 2
Grid
IrregularGrid
Figure �� � A composite grid is represented using a Grid� an IrregularGrid� anda CompositeGrid� A CompositeGrid consists of IrregularGrid objects organizedinto levels� Each IrregularGrid is a collection of Grids� Real grid hierarchies inmultiple dimensions are vastly more complicated �see� for example� Figure �����
described in Table ���
The parallelism in our adaptive mesh application lies at the level of the
IrregularGrid� We can parallelize across one level of the hierarchy� but there is little
opportunity for parallelism across levels� Therefore� the Grids in an IrregularGrid
are distributed across processors� and applications compute over these Grids in par�
allel� Following the LPARX model� each Grid in an IrregularGrid is assigned to
one processor� Of course� a single processor may be responsible for many Grids�
Communication among Grids in the hierarchy employs LPARX�s copy�on�
intersect operation �see Section ���� a high�level facility that copies data between
the logically overlapping portions of two Grids� Data motion involves no explicit
computations involving subscripts� all bookkeeping and interprocessor communication
details are managed by the run�time system� We discuss communication in detail in
Section ������
In the implementation of our adaptive mesh libraries� we often found it
convenient to represent the structure of an IrregularGrid separately from the
IrregularGrid itself �i�e� structural abstraction�� For example� regridding and load
balancing routines manipulate and return the structure�the locations of re�nement
���
Data Type Description
Grid
Grid represents a single re�nement patch in the adaptive grid hier�archy� Grid computations are typically performed in serial numericalroutines �see Section �������
IrregularGrid
IrregularGrid represents one level in the adaptive mesh hierarchy�Grids in an IrregularGrid are distributed across processors� and ap�plications compute over these Grids in parallel �see Section �������IrregularGrid provides communication routines to �ll boundary cellsfor Grids at the same level of re�nement �see Section �������
CompositeGrid
CompositeGrid represents the entire adaptive mesh hierarchy� It pro�vides mechanisms to communicate between levels �see Section ������and to create new levels through error estimation �Section ������� gridgeneration �Section ������� and load balancing �Section �������
Table ��� Descriptions of the three basic data types used to represent the adaptivegrid hierarchy �refer to Figure �� �� Grid� IrregularGrid� and CompositeGrid� Theoperations de�ned on these data types are described in detail in succeeding sections�
patches and their assignments to processors�of an IrregularGrid� Such structure
information is encapsulated in a GridStructure� which consists of an array of LPARX
Regions and an array of corresponding processor assignments�
����� Error Estimation
To re�ne its representation of the solution� structured adaptive mesh algo�
rithms add additional levels to the grid hierarchy� Error estimation and regridding
routines are called to calculate where to place the computational resources on the new
level� Error estimation evaluates the solution error on the level of the grid hierarchy
with the �nest resolution� and regridding uses this error estimate to determine where
to place new grid patches to re�ne portions of the domain with the highest error�
Our adaptive mesh API provides two common algorithms for estimating
solution error� solution gradient and Richardson extrapolation� The solution gradient
does not actually measure error but rather indicates where the solution is changing
most rapidly� We use this as an ad�hoc estimate of error� as no further re�nement
��
�a� �b�
Figure ���� �a� The error estimation procedure has �agged the points of highest error�as indicated by the solid dots�� �b� The regridding routine has generated re�nementpatches which cover all �agged points but which enclose few non��agged points�
is generally needed in regions where the solution is changing slowly �i�e� has a small
gradient�� Richardson extrapolation ���� attempts to calculate an exact estimate of
the local truncation error using the solution at the two coarser levels of the grid
hierarchy�
After obtaining an estimate of error� we must �ag points where the error
is �too high�� as shown in Figure ���a� One well�known method used for both el�
liptic ���� and hyperbolic ��� partial di�erential equations is to �ag every location
which exceeds some predetermined error threshold� This method attempts to bound
the overall solution error by bounding the error at every grid point� Another approach�
which we have not seen mentioned in the literature� focuses on �xing computational
resources� This strategy �ags a speci�ed number of points with the highest error
and is appropriate for applications for which good analytical estimates of error are
unavailable�
We typically use the latter approach in our experiments with adaptive eigen�
value calculations for materials design ����� For eigenvalue problems� the correlation
between the pointwise error on a grid level and the �nal error in the eigenvalue �the
value of interest� is not straightforward ����� Thus� we attempt to obtain the best an�
swer to our problem for �xed computational resources� If we �nd in the end that this
answer is not su�ciently accurate� we simply allocate more resources and solve again�
���
����� Grid Generation
After the error estimator has �agged points on the highest level of the grid
hierarchy� the next step is to calculate where to place grid re�nements for the next
level� Regridding routines take the �agged points from the error estimator and return
a GridStructure representing the re�nement structure for the newest level� The goal
of grid generation is to create re�nement patches that cover all �agged points of the
previous level� Patches should be relatively large �to minimize overheads� but should
enclose as few non��agged points as possible �see Figure ���b�� Furthermore� they
should be nested �i�e� every grid point at level l!� must lie above some grid point at
level l�� Each patch may have multiple parents� re�nement hierarchies do not form a
tree� Patches are assumed to be rectangular and lie parallel to the coordinate axes�
Our adaptive mesh API implements a regridding algorithm by Berger and
Rigoutsos ��� based on signatures from pattern recognition� In this method� informa�
tion about the spatial distribution of �agged points within a speci�ed region of space
is collapsed onto the �d axes� these signatures are then used to generate re�nement
patches� The algorithm begins by using signatures to calculate the smallest bounding
box that contains all �agged points �see Figure ���a�� If the ratio of �agged points to
total points in this box satis�es a speci�ed e�ciency threshold� then the re�nement
patch is accepted� If not� the method splits the re�nement patch and calls itself re�
cursively on the two sub�patches� In the recursive calls� signatures are calculated only
over the speci�ed region of space enclosing the sub�patch �e�g� the �mask region� in
Figure ���b��
Signatures are used to �nd a splitting point for the two sub�patches� The
algorithm tries to choose a splitting point which minimizes communication across the
interface� For example� consider the signatures shown in Figure ���a� The signature
on the horizontal axis represents the number of �agged points that lie above it� Be�
cause the center portion of the signature is zero� the splitting algorithm knows that
no �agged points exist in this region� Thus� the algorithm chooses this section as a
separator for the two smaller sub�patches�
���
Mask Region
Sum
�a� �b�
Figure ���� �a� Re�nement regions are calculated using signatures� The signatureon the horizontal axis represents the number of �agged points which lie above� thesignature on the right represents the number of �agged points to the left� The dottedlines show the bounding box enclosing all �agged points� The two darkly shadedregions represent e�cient patches� �b� We calculate signatures using a parallel reduc�tion across an irregular array structure� Contributions from �agged points outsidethe �mask region� are ignored�
Previous implementations of the signature algorithm have represented
�agged locations using lists of points with one point for each �agged location ���� ��
We implement a di�erent strategy based on parallel array reductions over irregular
grid structures� �an IrregularGrid of integers�� Flagged points are assigned a value
of � and all other locations �� Our strategy calculates signatures using array reduc�
tions with addition over a speci�ed region of space� For the signatures in Figure ���b�
we employ array reductions across the two Grids and mask out any contributions
that lie outside the speci�ed mask region� Irregular array reduction is a direct gen�
eralization of standard rectangular array reductions� portions of the index space not
covered by a Grid simply make no contribution to the result� The advantages of this
method are that it is easy to parallelize� uses the same data structures as the grid
hierarchy� and is e�cient� In our computations� regridding using parallel reductions
requires only about one percent of the total computation time�
To simplify the communication of information between levels of the hierar�
�In their data parallel Fortran implementation� Berger and Saltzman ���� � � employ parallelarray reductions over rectangular arrays� but their regridding algorithm returns patches of uniformsize and is not based on signatures�
��
�a� �b�
Figure ���� Two methods for calculating uniformly sized re�nement regions� Here the�agged regions are represented by the shaded areas� �a� The computational space istiled �dashed lines� and only those regions with �agged points �solid lines� are keptas re�nement patches� �b� Shifted tiles cover the �agged area more e�ciently �sevenpatches instead of nine��
chy� we require that grids are logically nested within grids at the next coarser level�
The signature algorithm alone does not ensure proper nesting of the grids because it
does not take into account the boundaries of the grid patches on the previous level�
Thus� after grid generation� we ensure nesting by intersecting all new grid patches
against the underlying re�nement regions� In practice� this step is rarely necessary�
as re�nement regions are usually already nested�
Our adaptive mesh libraries also provide one alternative regridding algo�
rithm based on work by Berger and Saltzman for uniform re�nement regions �� ��
Unlike the previous algorithm� this method guarantees that all re�nement patches
are the same size� The uniform algorithm was originally motivated by an adaptive
mesh re�nement implementation in a data parallel language �Connection Machine
Fortran� that required uniform patches� It is much simpler than the non�uniform
algorithm� Essentially� the computational space is tiled with re�nement patches of a
speci�ed size �see Figure ���a�� Each patch is checked whether it includes a �agged
point� If so� then that patch is added to the new re�nement level� if not� the patch is
��
discarded� One improvement in this algorithm implemented by Berger and Saltzman
�and also by us� allows the tiles to shift to better cover the �agged region �see Fig�
ure ���b�� These algorithms employ the same irregular array reduction primitives as
the non�uniform method� A performance comparison for the signature and uniform
regridding algorithms is presented in Section �����
����� Load Balancing and Processor Assignment
The regridding procedure of the previous section generates a GridStructure
describing the re�nement structure for the newest level in the adaptive mesh hierarchy�
In general� the re�nement patches rendered by the regridding procedure vary in size
and number� there may be fewer patches than processors or many more patches than
processors� Thus� a simple cyclic assignment of patches to processors would not
typically result in good load balance� Therefore� before creating the new level� the
API routines must distribute this computational work across processors� Our API
de�nes load balancing and processor assignment facilities that take the structural
description returned by the grid generator� manipulate and modify it� and then return
a new GridStructure that is used to instantiate the newest grid level� The goal of
load balancing and processor assignment is to evenly distribute computational work
across the processors of the machine�
Our regridding routines seek to create re�nement regions which minimize
computational e�ort� they have no knowledge of load balancing or processor assign�
ment� One possible implementation strategy would integrate load balancing and
processor assignment into regridding� We have avoided this approach for two rea�
sons� First� we believe that the re�nement structure of the numerical computation
should not be in�uenced by its parallel implementation� By decoupling regridding
from parallelization� we guarantee that the regridding procedure will generate iden�
tical re�nement structures when running the same problem on varying numbers of
processors� Second� we may change either regridding methods or load balancing
strategies without in�uencing the other� Future improvements in one algorithm will
���
�� Fracture re�nement patches that are too largelet P � number of processorslet w � �
Psize of each patch��P
recursively divide patches larger than w
�� Bin�pack re�nement patches to processorssort patches by size from largest to smallestfor each patch �largest �rst�
assign patch to the processor with the least workend for
Figure ����� This load balancing routine takes a set of re�nement patches from theregridding algorithm and assigns them to processors� Patches that are too large�larger than the average workload to be assigned to a processor� are divided intosmaller ones� The resulting patches are then bin�packed to processors�
not force changes in the other�
Recall that all parallelism in our adaptive mesh method lies across the grid
patches in a single level of the grid hierarchy� With the exception of the commu�
nication optimization described later in this section� we partition each grid level
independently of all other levels� For our particular numerical algorithms� the work�
load associated with each re�nement region is directly proportional to the size of the
region�
A simple but e�ective load balancing algorithm is shown in Figure �����
The �rst step of the method is to calculate the approximate average workload w to
be assigned to each processor� Next� patches which represent more work than w are
recursively divided until all patches are size w or smaller� guaranteeing that large
patches will be evenly distributed across processors� The load balancing routines
adopt LPARX�s parallelization model that an individual grid is assigned to only one
processor� Although not an issue in our particular application� dividing patches may
not be appropriate or desirable for other numerical methods for which introducing
���
new boundary elements creates additional computational work� �e�g� �ux correction
for hyperbolic partial di�erential equations ���� When recursively dividing patches�
our algorithm does not generate sub�patches smaller than a speci�ed� architecture�
speci�c minimum size� Although small blocks reduce load imbalance� they introduce
additional interprocessor communication� After patches have been divided� they are
sorted in decreasing order by size and are bin�packed to processors ����
This load balancing method works well in practice but does not take into
account interprocessor communication between levels of the grid hierarchy� Patches
communicate with their parents on the previous level� Interprocessor communication
costs for hyperbolic numerical methods are dominated by computation costs�� how�
ever� communication overheads become a signi�cant portion of the total execution
time for elliptic problems �see Section ����
We have developed a new processor assignment strategy that introduces in�
terprocessor communication costs into the bin�packing algorithm by including a �pro�
cessor preference�� or a�nity� for each patch �see Figure ������ Before distributing
work to processors� we assign each patch a processor preference value which estimates
the amount of communication between the patch and its parents lying on that proces�
sor� Communication costs are directly related to the size of the intersection between
a patch and its underlying parents� When bin�packing� we attempt to place patches
on their preferred processors� This simple optimization has reduced interprocessor
communication by as much as �$ in some parts of the code� Unfortunately� the
bene�ts of this optimization are limited to the highest levels of the grid hierarchy�
For the applications we have run to date� we have found little change in the over�
all execution time� Processor preferences are most e�ective when there are many
more patches than processors� otherwise� the algorithm has little freedom in mapping
patches to processors� We believe� though� that this optimization will become more
important as we begin to run larger� more realistic applications requiring more levels
of re�nement on larger numbers of processors�
�Of course� new load balancing routines may be de�ned to handle these special cases��Phillip Colella� personal communication�
���
�� Fracture re�nement patches which are too largelet P � number of processorslet w � �
Psize of each patch��P
recursively divide patches larger than w
�� Calculate a processor a�nity for each patchfor each patch i
for each parent j of patch i�� Estimate communication costs as the intersection�� between a patch and its parentlet c � intersection between patch i and parent jlet p � processor owning patch jassign patch i a preference c for processor p
end forend for
�� Bin�pack re�nement patches to processorssort patches by size from largest to smallestfor each patch i �largest �rst�
for each processor j in i�s preference list �most preferred �rst�if �assigning patch i to processor j does not exceed work w for j� then
assign patch i to processor jend if
end forend forfor each unassigned patch �largest �rst�
assign patch to the processor with the least workend for
Figure ����� This load balancing algorithm is an improvement of the method inFigure ����� In this approach� we use an estimate of interprocessor communicationbetween levels of the grid hierarchy to calculate a �processor preference�� or a�nity�for each patch� When possible� each patch is assigned to its preferred processor�
���
P0 P1 P2 P3
IrregularGrid U
�� Compute in parallel over the elements of U�� U is an IrregularGrid and its elements are Gridsforall i in U
call update�U�i��end forall
Figure ���� Parallel computation over the individual Grids in an IrregularGrid�The forall loop is a coarse�grain data parallel loop which executes each iteration asif on its own virtual processor� Function update is an externally de�ned numericalroutine� often written in Fortran� which performs some computation on Grid U�i��
����� Numerical Computation
The previous section described how the load balancing and processor as�
signment routines decompose a single level of the adaptive grid hierarchy�an
IrregularGrid�across processors� In this section� we discuss parallel computation
over such distributed structures�
Consider the single grid level shown in Figure ���� which has been dis�
tributed over four processors p�"p�� Each grid has been assigned to one processor�
The largest rectangular patch has been divided over two processors� and p� has been
assigned the two smaller grids�
We express parallel execution using LPARX�s forall construct� a coarse�
grain data parallel loop which executes each iteration as if on its own virtual processor�
���
Coarse�Grain Parallelism
�� Parallel loop over gridsforall i in U
�� Serial loop over elementsfor j�k�l in U�i�
�� Do numerical workend for
end forall
Fine�Grain Parallelism
�� Serial loop over gridsfor i in U
�� Parallel loop over elementsforall j�k�l in U�i�
�� Do numerical workend forall
end for
Figure ����� Coarse�grain data parallelism �left� expresses parallel execution over theentire collection of grids� computation on each individual grid is serial� In contrast��ne�grain data parallelism �right� expresses parallelism over the data elements of eachgrid� and the grids are handled sequentially�
Each iteration executes independently of all other iterations� For each Grid U�i�� we
call the routine update� an externally de�ned serial numerical kernel� which executes
on one processor�
There are advantages to separating parallel execution from serial numerical
computation� Numerical code may be optimized to take advantage of low�level node
characteristics� such as vector units or multiple physical processors� without regard to
the higher level parallelism� Existing serial code may not need to be re�implemented
when parallelizing an application� Furthermore� we can leverage existing mature
sequential and vector compiler technology�
Figure ���� compares our model of coarse�grain parallelism with a �ne�grain
data parallel style � � ���� In the former� we execute in parallel over the entire collec�
tion of grids� Each grid is assigned to one processor� and numerical computation on
that grid is sequential� In contrast� �ne�grain parallelism processes grids sequentially
and expresses parallelism over the elements of a single grid�
There are a number of advantages to coarse�grain parallelism� Because the
numerical computation is serial� we may employ numerical methods on each grid
which do not parallelize e�ciently� For example� Gauss�Seidel relaxation works well as
a smoother in multigrid� but it cannot be easily expressed in a �ne�grain data parallel
��
style� Coarse�grain parallelism also allows more asynchrony between processors and
is therefore a better match to current coarse�grain message passing architectures�
To improve the e�ciency of the �ne�grain model� Parsons and Quinlan ����� are
developing run�time methods for automatically extracting coarse�grain tasks from
�ne�grain data parallel loops� Another model� processor subsets ����� combines the
coarse�grain and �ne�grain approaches� parallelism is expressed both over grids and
within each grid�
In our discussion of parallel execution� we have ignored the interprocessor
communication required to satisfy data dependencies� This is the subject of the
following section�
���� Communication
Adaptive mesh methods exhibit two basic forms of communication� in�
tralevel communication among grids at the same level in the grid hierarchy and
interlevel communication between adjacent levels of the hierarchy� Both forms of com�
munication employ LPARX�s copy�on�intersect operation �see Section ���� which
copies a block of data between the logically overlapping portions of two grids�
The purpose of intralevel communication is to obtain boundary information
from neighboring grids� Around the boundary of each grid patch is a ghost cell
region used to locally cache data from adjacent grids� These ghost cells are needed by
adaptive mesh algorithms even on serial architectures� they are an intrinsic component
of the computation and are not simply an artifact of parallelization� The pseudocode
shown in Figure ���� updates the ghost cell regions of each grid with data from the
interior �non�ghost cell� portions of adjacent grids�
Interlevel communication transfers information up �from coarser grids to
�ner grids� and down �from �ner to coarser� the adaptive grid hierarchy� We will
describe only the latter process� called coarsening� as the computational structure of
the former is identical� As shown in Figure ���� coarsening involves two steps� First�
information at the �ne grid level is averaged into a temporary� intermediate grid level�
���
�� Communicate boundary information between grids at the same level�� U is the irregular array of grids at this level of the hierarchy
function FillPatch�U��� Loop over all pairs of grids in Uforall i in U
for j in U�� Copy data from intersecting regions�� Interior�� returns the interior of its argumentcopy into U�j� from U�i� on Interior�U�i��
end forend forall
end function
Ghost Cells
Interior(U(i))
U(i)
U(j)
Copy
Figure ����� Intralevel communication copies boundary information from the interiorsof adjacent grids at the same level of the adaptive mesh hierarchy�
which is a coarsened version of the �ne grid level� The numerical computation by
subroutine Average is performed in parallel� Second� the coarsened data in the inter�
mediate grid level is copied into the coarse grid level using LPARX�s copy operation�
Although the redistribution of data from the intermediate grid into the coarse grid
may appear expensive� this is typically not the case� In the act of averaging� the
quantity of data in the �ne grid is generally reduced by a factor of rd for a mesh
re�nement factor of r in d dimensions� for our particular application� r � �� d � ��
and �� � �� Thus� the intermediate grid level represents a relatively small amount
of data�
���
�� Communicate between grid levels in the hierarchy�� Fine is the �ne grid level�� Temp is the coarsened version of Fine�� Coarse is the coarse grid level
function Coarsen�Fine� Temp� Coarse��� Average information in Fine down to Tempforall i in Fine
call Average�Fine�i�� Temp�i��end forall�� Copy data from Temp into grid level Coarseforall i in Temp
for j in Coarsecopy into Coarse�j� from Temp�i�
end forend forall
end function
Fine
Coarse
Temp Average
Copy
Figure ���� Interlevel communication transfers information between levels of theadaptive mesh hierarchy� Coarsen illustrates communication between a �ne level andthe coarse level beneath it� Data is �rst averaged into an intermediate� temporarygrid level� which is then copied into the coarse level�
��
��� Adaptive Eigensolvers in Materials Design
And all this science I don�t understand�It�s just my job ve days a week�
� Elton John� �Rocket Man�
We have applied our structured adaptive mesh infrastructure to the solution
of model eigenvalue problems arising in the �rst principles design of real materials�
To our knowledge� this is the �rst time that such techniques have been used to solve
these problems on parallel computers� Materials design seeks to understand the chem�
ical properties of technologically important materials through computer simulation�
Consider the C��H�� molecule shown in Figure ��� � Materials scientists would like to
understand properties of this molecule� such as� What is the size of this ring# What
is the bond distance between two carbons# What is the energy of this system# Com�
puter modeling o�ers the possibility of answering these questions without actually
constructing the compound in the laboratory�
One di�culty in modeling real materials is that some atoms� such as oxygen
or transition metals� exhibit steep e�ective potentials localized about the nucleus� Be�
cause the chemical properties of an atom are determined by these localized potentials�
it is vital that numerical methods represent them accurately� Material scientists have
traditionally employed Fourier transform methods� but such methods do not easily
admit the non�uniformity needed to capture localized phenomena �����
Adaptive numericalmethods support the non�uniform representation needed
to accurately resolve localized potentials� While preliminary� the results presented in
this section indicate that adaptive eigenvalue methods may provide an important
alternative solution methodology for materials design applications�
The need for adaptive re�nement in materials design has been noted by other
researchers� Bernholc et al� ��� have implemented a semi�adaptive code that places
a single� static re�nement patch over each of their atoms� Others have attempted
adaptive solutions using either �nite element methods ����� ��� or a combination
of �nite element and wavelet techniques ����� Although �nite element methods pro�
��
Figure ��� � Materials design seeks to understand the chemical properties of moleculessuch as the C��H�� ring shown here� This image was provided by Kawai and Weare�
vide adaptivity� they are more computationally expensive than structured adaptive
mesh methods because they require signi�cantly more memory and CPU time for the
same number of unknowns� Thus� given the same computational resources� �nite ele�
ment calculations must employ a coarser representation than an equivalent structured
adaptive mesh method�
Section ����� describes our model materials design application in more de�
tail� In the succeeding four sections� we describe the numerical algorithms used in our
adaptive eigenvalue solver� Section ���� brie�y reviews the adaptive mesh framework
that has already been discussed in Section ��� We have integrated into this adaptive
methodology a multilevel eigenvalue solver �Section ������ that is based on the multi�
grid method �Section ������� Section ���� describes the �nite di�erence stencil used
to discretize the continuous equations� Finally� Section ���� presents computational
results for two simple materials design problems�
���
����� A Model Problem
The �rst principles design of real materials models the properties of complex
chemical compounds through solving approximations to the Schr%odinger equation�
One common approach uses the Local Density Approximation �LDA� of Kohn and
Sham �� �� In the LDA� the electronic wavefunctions ui are given by the solutions to
the nonlinear eigenvalue problem�
Hui � �iui� �����
The di�erential operator H� which is elliptic� self�adjoint� real� and inde�nite� is givenby the Hamiltonian�
H � � �
mr� ! V� ����
The wavefunction� or eigenvector� ui provides a measure for the location of the elec�
tron� u�i gives the probability that the ith electron is located at a particular point in
space� The potential term V contains contributions from various electron�electron
and electron�nucleus interactions� it is a function of both position and the wavefunc�
tions ui� In general� we wish to �nd only the Na lowest eigenvalues and associated
eigenvectors� where Na is typically the number of atoms in the system� Na is several
orders of magnitude smaller than the number of grid points used to represent the
system�
Several length scales are represented in the solution to Eqs� ��� and ��� The
overall size of the system is determined by the atomic positions and the associated
electron density� Furthermore� associated with each atomic center is an e�ective
nuclear charge which varies according to the atomic species� Our goal is to show
that adaptive numerical methods can accurately resolve these di�erent length scales�
Thus� in our experiments� we use a model potential V which captures the essential
near�singular behavior and multiple length scales of real e�ective potentials while
removing many nonessential details�
���
L � �while �L � MaxLevels� do
call EigenvalueSolvercall ErrorEstimationcall GridGenerationL � L ! �
end while
Figure ����� The adaptive eigenvalue solver builds the adaptive grid hierarchy level�by�level� The solution on L levels is used to create the next �ner grid level�
����� Adaptive Framework
Recall from Section �� that structured adaptive numerical methods build
the grid hierarchy level�by�level� That is� the algorithm solves the problem on a
hierarchy of L levels before proceeding to one with L ! � levels� The solution on
L levels is used to guide the creation of the next �ner grid level� Furthermore� this
�nested iteration� or �full multigrid cycle� �as it is called by the multigrid community�
speeds convergence because the solution on L levels provides a good estimate for the
solution at level L ! �� An outline of our adaptive eigenvalue solver is shown in
Figure �����
����� Eigenvalue Algorithm
We wish to solve the following generalized� eigenvalue problem�
Hui � �iKui � i � � � � � Na �����
where H and K are symmetric and positive de�nite� We require only the lowest Na
eigenvalues and eigenvectors� where Na is much smaller than than the number of
unknowns� Ng� Typically� Na � �� to ��� whereas Ng � ����
�Although K � I in many eigenvalue problems� we must address the general case because thehigher order O�h�� discretization employed by our �nite dierence scheme introduces a right handside matrix K �� I�
���
let u be an initial guess �u �� ��repeat
H�normalize u� �u�Hu� � �let � � �u�Hu���u�Ku�perform one multigrid V�cycle on �H� �K�u � �
until k�H� �K�uk � � �some error tolerance�
Figure ����� Mandel and McCormick�s ����� linearized iterative multigrid�based eigen�value solver �nds the lowest eigenvalue � and its associated eigenvector u from thegeneralized eigenvalue problem Hu � �Ku�
Because we need only a few of the lowest eigenvalues� we do not apply
standard dense eigenvalue algorithms �e�g�QRmethods ����� to the solution of Eq� ����
Instead� we use a multigrid�based iterative method by Mandel and McCormick �����
which e�ciently calculates the lowest eigenvalue and associated eigenvector� This
algorithm is shown in Figure ����� Cai et al� ���� have proven that this method is
optimal in the sense that it requires O�Ng� work and O��� iterations� Convergence is
independent of the distribution of eigenvalues�
To calculate eigenvalues other than the lowest� we apply the above procedure
and� at each step� orthogonalize the candidate eigenvector un against all previously
calculated eigenvectors ui� � � i � n��� Unfortunately� this strategy does not appearto retain the optimal convergence properties proven for the lowest eigenvalue� The
best approach to extracting the lowest Na eigenvalues and eigenvectors is still an open
research question�
As with many iterative algorithms� a good initial guess for u can signi�cantly
reduce the time to solution� Such good guesses are often available� For example� in
problems which evolve over time� the solution at one timestep may be used to seed
the solution at the following timestep�
One potential di�culty in applying this method to our materials design
application is that the Hamiltonian H is not positive de�nite� some eigenvalues�
��
in fact� those eigenvalues of interest�are negative� The normalization �u�Hu� � ��
which requires the calculation ofq�u�Hu�� is only well�de�ned for H positive de�nite�
Thus� we must shift the eigenvalues of H to make them positive� If �� is a lower bound
for the eigenvalues of H� then H! �I is positive de�nite� Because the performance of
our solver is independent of the eigenvalue distribution of H� shifting eigenvalues by� does not change convergence properties� The lower bound � need not be tight� in
materials design calculations� an approximate lower bound �which actually represents
the energy of the system� can usually be found using experimental knowledge or
experience�
����� Multigrid
The eigenvalue solver of the previous section depends on a multilevel iter�
ative technique called multigrid� Multigrid ���� is a fast method for solving partial
di�erential equations� It represents the solution on a grid hierarchy� Multigrid uses
the multiple levels of the hierarchy to accelerate the communication of numerical in�
formation across the computational domain and reduce the time to solution� Multi�
grid techniques integrate easily into the adaptive mesh framework� the same grid
hierarchy used by adaptive mesh methods to represent the solution can be used by
multigrid to accelerate convergence� In this section� we brie�y describe Brandt�s Full
Approximation Storage �FAS� variant ���� of multigrid� further details can be found
elsewhere ��� �����
We wish to solve the partial di�erential equation Lu � � subject to Dirichletboundary conditions on the composite grid u consisting of L!� levels� ul� � � l � L�
In general� L is a nonlinear operator� We divide each level ul of u into two parts�
the boundary� denoted by boundary�ul�� and the interior� interior�ul�� Multigrid
requires two operators to transfer data values between grids at neighboring levels of
the hierarchy� The coarsening operator IC takes data at level l ! � and averages it
down to level l� The re�ning operator IF takes data at level l and interpolates it to
grids at level l ! �� Finally� we de�ne a relaxation procedure relax that performs
��
FAS�l� ul� fl�
ul � relax�Lul� fl � ��
if �l ��
rl�� � IC�fl � Lul�tl�� � ICul
fl�� ������
rl�� ! Ltl�� on interior�ul�
� otherwise
ul�� � FAS�l� �� ul��� fl���
ul ������
ul ! IF �ul�� � tl��� on interior�ul�
IFul�� on boundary�ul�
end if
ul � relax�Lul� fl � ��
Figure ����� The Full Approximation Storage �FAS� multigrid algorithm of Brandt����� When called via FAS�L�uL���� this method performs one V�cycle on the equa�tion Lu � � for the composite grid u consisting of L! � levels ul� � � l � L�
smoothing �e�g� Gauss�Seidel ���� on each multigrid level�
The FAS multigrid method is shown in Figure ����� When invoked us�
ing FAS�L�uL���� the algorithm starts at level L of the grid hierarchy� winds down
through the grids to level �� and then works back up to level L� This cycling strategy
is called a multigrid V�cycle� Each application of FAS drives u closer to the solution�
We have found that between ten and twenty V�cycle iterations are usually su�cient
to solve our problems to machine precision�
Immediately following the recursive call to level l��� grid level ul is updatedusing new information from level l � �� Since we may not have explicit Dirichlet
boundary information for higher levels in the grid hierarchy� we must calculate the
appropriate boundary conditions from the next lower level� We use the interpolated
�
data values from the underlying grid level ul�� to obtain the Dirichlet boundary
conditions for level ul�
An important distinction between the FAS method and other multigrid vari�
ants is that FAS� like the adaptive mesh method� stores unknowns at each level of the
multigrid hierarchy� Many multigrid methods store error corrections to the unknown
�i�e� residuals� rather than the unknown itself� Such a storage technique would be
incompatible with the adaptive mesh framework�
����� Finite Dierence Discretizations
To implement these numerical algorithms on a computer� we require a
method for representing the partial di�erential operator L used in the multigrid pro�cedure� We solve for elliptic operators of the general form�
Lu � �r� ! f�u! g �����
where R and u� f� g R� R� For example� for Poisson�s equation r�u � ��
� �� f � �� and g � ��� For the Hamiltonian eigenvalue problem�
�� �
mr� ! V � ��u � � ����
� � ��m� f � V � �� and g � ��
We discretize this expression using a �nite di�erence scheme� Consider the
cube of � integer locations in three dimensions �� � � � �� centered at the point i�De�ne face to be the set of six points at the center of each face of the cube� Likewise�
de�ne edge as the set of twelve points on the edges �but not the corners� of the cube�
We ignore the eight corner points� The fourth order O�h�� �nite di�erence discretiza�
tion corresponding to mesh location i with mesh spacing h can be written as ���
�Lhu�i �
h�
����ui !
�
Xj�face
uj !�
�
Xj�edge
uj
�A
!
��fiui !
�
Xj�face
fjuj
�A!
��gi !
�
Xj�face
gj
�A ��� �
��
����� Computational Results
To validate our approach� we chose two deceptively simple problems whose
analytic solutions are known� the hydrogen atom and the H� molecule� While the
Hamiltonians for these problems are very simple� they cannot be solved directly using
current Fourier transform techniques because of a �rsingularity in the atomic poten�
tial� All of the following model problems were solved in �d� we did not attempt to
exploit problem symmetry� Each solution required approximately one minute running
on an IBM RS� ��� model ��� Note that real materials design problems will contain
tens or hundreds of atoms� not just one or two� and will require the computation and
memory resources of a high�performance parallel computer�
Hydrogen
The Hamiltonian operator for hydrogen has a simple form�
H � �r�
� Z
r� �����
While the eigenvalue problem associated with Eq� ��� can be solved analytically� the
singular behavior at r � � can cause signi�cant di�culties for non�adaptive numerical
methods� For example� current Fourier methods with �� grid points return ��� �as the lowest eigenvalue instead of the correct value of ���� The solution to thisproblem is an exponential with the form e�Zr and eigenvalue � � Z�
� � As Z increases�
the solution becomes increasingly localized about the origin�
The solution to the hydrogen problem �Z � �� is plotted in Figure ���a�
Note that the units for the values of the wavefunction are arbitrary� The adaptive
eigenvalue solution and the exact answer �not plotted� are identical at the scale of
this graph� The cusp at the origin is a result of the singularity of the potential at this
point and is usually very expensive to resolve with a non�adaptive numerical method�
The accurate representation of the cusp requires a very dense clustering of grid points
about the origin �see the abscissa of Figure ���a��
��
-15.0 -5.0 5.0 15.0Distance (au)
0
0
0
Wave
funct
ion
Hydrogen Wavefunction
0 5 10 15 20Z
0
50
100
150
200
Eig
enva
lue
Eigenvalues for -Z/R Potential
ExactThree Adaptive Levels Four Adaptive LevelsUniform Levels Only
�a� �b�
Figure ���� �a� The left graph displays the lowest energy eigenvector for the hydrogenatom� graph data was extracted from the �d volume along the Z axis� Tick markson the abscissa represent the mesh points of the composite grid� �b� The right plotshows the eigenvalues for a �Z
Rpotential as a function of the nuclear charge Z� All
three solution methods used approximately the same number of grid points� Notethat the �Exact� solution and the �Four Adaptive Levels� solution lie on top of oneanother�
Figure ���b illustrates how the eigenvalues of the system change with in�
creasing nuclear charge Z� To accurately capture the increasingly localized and near�
singular solution� it was necessary to use higher levels of adaptivity� However� because
of increased localization about the origin� the total number of grid points remains ap�
proximately the same� The resolution of the �nest grid level in the adaptive solution
would correspond to a uniform grid with ��� � points� as compared to fewer than ��
points required by the adaptive algorithm� a savings of � ���� Obviously� a uniformmesh of this resolution would be unthinkable�
Hydrogen Molecular Ion
Another simple problem which is more commonly used as a test case for
�
-15.0 -5.0 5.0 15.0Distance (au)
0
0
0
0W
ave
funct
ion
H2+ Wavefunction
0.0 2.0 4.0 6.0 8.0 10.0Atomic Separation (au)
-0.15
-0.10
-0.05
0.00
0.05
0.10
Bin
din
g E
nerg
y (
Hart
rees)
Morse Plot for H2+
Exact EnergyCalculated Energy
�a� �b�
Figure ���� �a� The left graph displays the lowest energy eigenvector for the hydrogenmolecular ion� graph data was extracted from the �d volume along the Z axis� Tickmarks on the abscissa represent the mesh points of the composite grid� �b� The rightplot shows the binding energy curve as a function of atomic separation� The minimumpoint on this curve represents the minimum energy distance between the two atoms�
chemical methods is the hydrogen molecular ion �H� �� In this problem� there is only
one electron but two nuclei with �rsingularities�
H ��r�
m� �
j�r ! �Ra
� j� �
j�r � �Ra
� j�where �Ra is the atomic separation� �����
Analytic solutions for this problem are known ����� but it is too sti� for practical
solution via Fourier methods� The adaptive eigenvalue solver method performs quite
well� however� The eigenvector solution is plotted in Figure ���a� note the increased
density of grid points in the vicinity of the two nuclei� Figure ���b shows the binding
energy curve for H� � The binding energy is de�ned as the total energy of the atoms
at a speci�ed distance minus the energy at in�nite separation� The minimumpoint in
this curve tells material scientists the preferred minimum energy separation between
the two atoms of the molecule�
�
��� Performance Analysis
First get your facts then you can distort them at your leisure�
� Samuel Clemens
Portability and performance are two vital considerations in the design and
implementation of any numerical library� Parallel computers become obsolete at an
alarming rate� today�s state�of�the�art supercomputer is tomorrow�s space heater�
Portability ensures that numerical software will run on the most powerful and up�to�
date computational resources available� Computational scientists will not use software
libraries unless they deliver reasonable performance� In this section� we analyze the
performance and overheads of our adaptive mesh library� We begin with a perfor�
mance comparison of an Intel Paragon and an IBM SP with one processor of a Cray
C���� and the succeeding section presents a detailed breakdown of parallel execution
times�
It is an open research question whether non�uniform re�nement structures
can be e�ciently supported in a data parallel language� One implementation strategy
for structured adaptive mesh methods in a data parallel language such as High Perfor�
mance Fortran ���� would restrict all re�nement patches to be the same size �� ��
We therefore conclude this section with an analysis of the performance implications
of requiring uniformly sized re�nement regions�
The motivating application for our structured adaptive mesh API is the
adaptive solution of nonlinear eigenvalue problems arising in materials design �����
We present computational results for the calculation of the lowest eigenvalue and
associated eigenvector for the following �d eigenvalue problem�
���r� ! V �u � �u �����
V �r� ���Xi�
�
jr � �cos ���i����� � sin ���i����� � ��j ������
The potential V represents a ring of ten hydrogen ions located in the Z � � plane� as
shown in Figure ��� While this is a synthetic problem� its structure resembles real
materials design applications of interest �e�g� ring structures��
��
Figure ��� Computational results were gathered for a �d synthetic eigenvalue prob�lem with ten hydrogen ions located in a ring� as illustrated here by a slice through theZ � � plane� Local re�nement regions are represented by rectangles superimposedon the image�
The adaptive mesh hierarchy for this problem consists of eight levels with
a total of ��� � ��� grid points �see Table ����� The �rst six levels are the usual
uniform multigrid grids �with a mesh re�nement ratio of two� and the next two are
adaptively re�ned �with a mesh re�nement ratio of four�� The resolution on the �nest
level corresponds to a uniform mesh of size ��� thus� for this application� adaptivity
reduced memory requirements by a factor of � � ���� � ��� as compared to ����Recall from Section �� that this grid hierarchy is built level�by�level at run�
time� In the following sections� we will typically report the cost for one iteration of
the eigenvalue algorithm over all eight levels and will ignore the iterations used to
build it� In real materials design problems� the building phase is brief and is followed
by numerous iterations over the entire hierarchy structure� Moreover� most of the
computational work and the memory storage is located at the highest levels�
Each complete iteration requires approximately �� million �oating point
operations� or approximately �� �ops per grid point spread out over about ten
��
Multigrid Levelsl � � l � � l � l � � l � � l � �
Unknowns � � ��� ���� ���� ��� ��� ���
Mesh Spacing ���� ���� ���� ���� ���� ���� ����� ����
Adaptive Re�nement Totall � � l � �
Unknowns ��� ��� ���� ��� ���� ���
Mesh Spacing ����� ���� ����� ����
Table ���� Number of unknowns and mesh spacing for the adaptive mesh hierarchyused to solve the problem pictured in Figure ��� The resolution on the �nest gridlevel corresponds to a uniform mesh of size ���
Machine C�� Fortran OperatingCompiler Optimization Compiler Optimization System
C��� CC v������� �O cft�� v��� �O� UNICOS �����
Paragon g�� v��� �O �mnoieee if�� v���� �O� �Knoieee OSF�� �����
SP xlC v�� �O �Q xlf v��� �O� AIX v��
Table ���� Software version numbers and compiler optimization �ags for all compu�tations in this chapter� All benchmarks used LPARX release v��� Detailed machinecharacteristics are reported in Appendix A�
di�erent numerical routines with intermittent communication� Table ��� summarizes
software releases and compiler �ags for all benchmarks� refer to Appendix A for a
detailed description of the computer architectures� All �oating point arithmetic used
��bit numbers�
����� Performance Comparison
Table �� and Figure ��� compare the execution times for the IBM SP�� In�
tel Paragon� and one processor of a Cray C���� Note that although the SP processors
�The IBM SP� results were obtained on a pre�production machine at the Cornell Theory Centerthese times should improve as the system is tuned and enters full production use�
��
Number of ProcessorsP � � P � P � � P � � P � �� P � �
Cray C��� ��
IBM SP ���� ��� ��� ���� ����
Intel Paragon ���� ��� ���� ���
Table ��� Adaptive eigenvalue solver execution times for the IBM SP� Intel Paragon�and one processor of the Cray C���� Times �in seconds� represent one iteration�averaged over ten� of the eigenvalue algorithm� We report wallclock times for theSP and Paragon �processor nodes are not time�shared� and CPU times for the C����The application would not run on fewer than four Paragon processors because ofmemory constraints� These numbers are graphed in Figure ����
are approximately four times faster then the Paragon processors� its communication
network is about half as fast� We ran the same applications code on all machines
except that the Fortran kernels on the Cray C��� are annotated to aid vectorization�
The Paragon and the SP compare quite favorably against the C���� for this
application� four SP nodes or � Paragon nodes deliver the performance of one C���
processor� Although all Fortran numerical kernels of our code vectorize� hardware
performance monitors on the C��� report that our application achieves an aggregate
rate of only � mega�ops �million �oating point operations per second� over the
entire code and a peak rate of �� mega�ops� Our application realizes only a fraction
of the Cray C����s peak performance of ���� mega�ops due to short vector lengths
�between �� and ��� in the Fortran routines�
Of course� vector lengths are tied directly to grid size� We could achieve
a higher mega�op rate and longer vector lengths by using larger grids and more
memory� Note� however� that time to solution for a speci�ed accuracy� not mega�op
rate� is the important metric� Placing additional grid points in regions where they
are not needed to improve resolution does not necessarily result in more accurate
solutions� For example� we doubled the number of grid points used by the solver for
this problem and yet achieved the same answer �to within ���$�� The additional
grid points were used to simply over�re�ne portions of the computational space where
���
1 2 4 8 16 32Processors
1
10
Seco
nds
per
Itera
tion
Eigenvalue Solver Performance
ParagonSP2
Cray C-90
1 2 4 8 16 32Processors
0
1
2
3
4
Rela
tive P
erf
orm
ance
(hig
her
is s
low
er)
Relative Solver Performance (normalized to one processor of a Cray C-90)
ParagonSP2
Cray C-90
�a� �b�
Figure ���� These �gures graph the execution time results presented in Table ����a� Adaptive eigenvalue solver performance results for the IBM SP� Intel Paragon�and one processor of the Cray C���� The Cray C��� time on one processor is shown asa reference line� �b� Relative execution times for the SP and Paragon as comparedto one processor of the Cray C����
no further re�nement was necessary�
On the Cray C���� our implementation using the adaptive mesh libraries
would be comparable in performance to a Fortran code developed by hand without
library support� Approximately ��$ of the execution time of our application is spent
on numerical computation in Fortran routines� �$ in transferring data between grids
�which happens to be written in C�� but would also be required in an all�Fortran im�
plementation�� and the remaining �$ in miscellaneous routines� Even if we attribute
the last �$ as all library overhead �which it is not�� the ease of using an API and the
bene�ts of portability to high�performance parallel architectures far outweighs the
small loss in performance�
���
0 2 4 6Level Number
0
1
2
3
Seco
nds
per
Itera
tion
Execution Time By LevelIntel Paragon
p = 4p = 8p = 16p = 32
4 8 16 32Processors
0
2
4
6
8
10
Seco
nds
per
Itera
tion
Cumulative Execution Time By LevelIntel Paragon
Level 7Level 6Level 5Level 4Levels 0-3
�a� �b�
Figure ���� A level�by�level accounting of the execution time for one iteration ofthe eigenvalue algorithm� The bene�ts of parallelism are limited to the highest levelsof the hierarchy because lower levels have too little work for e�cient parallelization��a� The execution times for the highest levels drop as we add more processors� timesfor the other levels do not change signi�cantly� �b� This graph shows a level�by�levelbreakdown of the cumulative execution time�
����� Execution Time Analysis
Figure ��� illustrates that almost all of the bene�t of additional proces�
sors is in the reduction of execution times at the highest levels of the adaptive grid
hierarchy� Lower levels have too little work for e�cient parallelization� Note that
we cannot simply remove the lower levels because they play a vital role in the nu�
merical convergence of our eigenvalue algorithm� We can expect better scaling as we
address more complicated problems� which place additional computational work at
the highest levels�
Tables �� and ��� provide a detailed accounting for the parallel execution
time on the Intel Paragon and the IBM SP� We divide the execution time for the
entire eigenvalue algorithm� including time spent building the adaptive grid hierarchy�
��
Task P � � P � � P � �� P � �Time � Time � Time � Time �
Computation �� � ���� �� ��� � ���� ��
Load Imbalance ���� � ���� � ���� �� ��� �
Intralevel Communication ���� � ���� �� ��� � ���� �
Interlevel Communication ��� � ���� � ���� �� ���� �
Error Estimation ���� �� ����� �� ����� �� ���� ��
Load Balancing ����� �� ����� �� ����� �� ���� ��
Grid Generation ���� �� ����� � ��� � ����� �
Total ��� ��� ��� ��� ���� ��� ���� ���
Table �� � Execution time breakdown for the eigenvalue calculation on the IntelParagon� Times are in seconds� Percentages may not add up to ���$ due to round�ing� The relative cost of communication increases with additional processors� com�munication overheads account for almost half of the total execution time on � nodes�
into numerical computation� time lost to load imbalance� communication among grids
at the same level of the hierarchy� communication between levels� error estimation�
load balancing� and grid generation� The vast majority of the time is spent in nu�
merical computation� load imbalance� and communication �intralevel and interlevel��
Error estimation� load balancing� and grid generation consume only a few percent of
the total execution time� Figure �� graphs the times of the four most expensive
operations for one iteration of the algorithm� As computation times drop with ad�
ditional processors� communication overheads become a dominant factor in overall
performance� On � Paragon nodes� communication accounts for about half of the
total execution time�
It is clear from this data that interprocessor communication times are limit�
ing parallel performance� Table ��� shows the amount of communication �in kilobytes�
between processors on the Intel Paragon� numbers for the SP are identical� Intralevel
and interlevel communication clearly dominates� For these two routines� we have mea�
sured the average message size to be between �� and ���� bytes� Communication
times on both the Paragon and SP are dominated by message start�up costs� On
the Paragon� the operating system message start�up overhead is about �� sec with a
���
Task P � P � P � P P ��Time � Time � Time � Time � Time �
Computation ��� �� ���� � ���� �� ���� �� ���� ��Load Imbalance � � ���� � ���� � ��� � ���� �Intralevel Communication ���� � ���� � ���� �� ���� � ���� ��Interlevel Communication ���� � ���� � ��� �� �� � �� ��� ��Error Estimation ���� �� ��� �� ����� �� ����� �� ����� ��Load Balancing � � ����� �� ����� �� ����� �� ����� ��Grid Generation ��� �� ���� �� ����� �� ��� � ����� �
Total ��� ��� ���� ��� ���� ��� ���� ��� ���� ���
Table ���� Execution time breakdown for the eigenvalue calculation on the IBM SP�Times are in seconds� Percentages may not add up to ���$ due to rounding� TheSP spends a majority of its time in communication for eight and sixteen nodes� thisproblem size can use only four processors e�ciently�
4 8 16 32Processors
0
2
4
6
8
10
Seco
nds
per
Itera
tion
Eigenvalue Solver Time BreakdownIntel Paragon
Interlevel CommunicationIntralevel CommunicationLoad ImbalanceComputation
4 8 161 2Processors
0
2
4
6
Seco
nds
per
Itera
tion
Eigenvalue Solver Time BreakdownIBM SP2
Interlevel CommunicationIntralevel CommunicationLoad ImbalanceComputation
�a� �b�
Figure ��� Execution time breakdown for one iteration �averaged over ten� for the�a� Intel Paragon and the �b� IBM SP� We report computation time� time lostto load imbalance� and communication �both interlevel and intralevel�� The SPprocessors are approximately �ve times as powerful as the Paragon processors butthe communications network is about half as fast�
���
Task P � � P � � P � �� P � �Comm � Comm � Comm � Comm �
Intralevel Communication ���� �� ���� �� ��� �� ���� ��
Interlevel Communication ��� �� ��� ��� � ��� �
Error Estimation ��� ��� � ��� � ��� �
Grid Generation � �� �� �� �� �� ��� ��
Total ���� ��� ���� ��� ���� ��� ����� ���
Table ���� Average interprocessor communication volume �in kilobytes� for each pro�cessor for the eigenvalue calculation on the Intel Paragon� Percentages may not addup to ���$ due to rounding� Interprocessor communication �gures on the IBM SPare identical�
peak bandwidth of � megabytes�sec for very long messages �see Appendix A�� The
corresponding numbers for the SP are about sec and �� megabytes�sec� Given
these �gures� over ��$ of the cost of sending a message of this length on the Paragon
is due to message start�up costs � �$ for the SP��
It is di�cult to assess adaptive mesh library overheads on parallel computers
since we do not yet have detailed hardware performance analyzers such as those on
the Cray C���� We can assume that there is little overhead in computation� since
all numerical work is done in Fortran� The remaining contributor of overheads is in�
terprocessor communication� as described above� Experiments indicate that perhaps
half of the interprocessor communication time is due to overheads in the LPARX
communication routines �see Section ������� the remainder is spent in the operating
system message routines� We are currently working on a re�design of the LPARX
communication libraries which we believe will eliminate most of this additional over�
head �����
����� Uniform Grid Patches
Data parallel Fortran languages such as High Performance Fortran ���� ���
��� do not readily support non�uniform grid structures� In their Connection Machine
��
Fortran implementation of a d adaptive mesh re�nement application on the CM��
Berger and Saltzman �� � required that all re�nement regions be the same size�
To ascertain the performance implications of such a restriction� we have implemented
a grid generation strategy identical to that used by Berger and Saltzman�
One of the important tradeo�s in a uniform re�nement strategy is the se�
lection of the appropriate patch size� They key is to �nd a grid size which is large
enough to be computationally e�cient on the target parallel architecture but yet
small enough to limit over�re�nement� Large patches typically re�ne more of the
computational domain than what is needed and thus waste memory resources� Small
patches may represent too little work to be e�cient�
Another consideration when choosing a uniform patch size is the ratio of
ghost cells� boundary points used to locally cache data from other grids� to interior
grid points� The width of this boundary region depends on the numerical kernels of the
application� Our eigenvalue solver uses a ghost cell width of one� Berger and Saltzman
use four� For small patches� these ghost regions can represent a signi�cant fraction
of the total memory� especially in three dimensions� For example� the boundary cells
for a � � patch in �d with a ghost cell width of four ��� total� represent ��$ of
the memory used to store the patch� Furthermore� boundary cells may introduce
additional computational work for some numerical methods �e�g� �ux correction for
hyperbolic partial di�erential equations ����
We compare the non�uniform re�nement approach against four uniform grid
sizes� ��� � �� ��� and ��� Each patch is augmented with a ghost cell region of
width one� These four sizes bracket the range of useful patch sizes� for this particular
application� �� is too small and �� is too large� Memory overheads are reported in
Table ���� Uniform re�nement with the smallest patch size requires only about $
more memory than non�uniform re�nement� the largest patch size uses almost three
times more memory�
Figure �� a presents the execution time for one iteration of the adaptive
eigenvalue application on the Intel Paragon� Note that we do not report the results
��
Patch Type Unknowns Over Unknowns and ExcessRe�nement Ghost Cells Memory
Non�Uniform ����� � ��� ��� ��� � ��� ���Uniform ���� ���� � ��� ���� ��� � ��� �� Uniform �� �� ���� � ��� ���� �� � � ��� ����Uniform ���� ��� � ��� ��� � � � ��� ��Uniform ���� ��� � ��� �� ��� � ��� ���
Table ���� Uniform grid patches require additional memory resources as compared tonon�uniform patches because the grid generator does not have the freedom in selectingthe optimal grid size needed to cover a particular region of space� Thus� uniformpatches lead to over�re�nement� Small patches reduce over�re�nement� however� theyalso introduce a large number of ghost cells relative to the number of unknowns� Inthis application� the ghost cells region is one cell thick� Note that the number ofghost cells increases slightly �by about ��$ between � and � processors� for thenon�uniform strategy as patches are split to balance the load across processors� Forour results� we have chosen the worst�case non�uniform patch numbers�
for the �� patch size on four processors� this problem would not run because of
memory limitations� The � � patch size gives the best performance for all numbers
of processors� running between ��$ and �$ slower than the non�uniform re�nement
method� This �gure includes both computation time and communication time� Nu�
merical computation time alone is plotted in Figure �� �b�� Note that in the absence
of interprocessor communication� the �� and � � grid sizes are very competitive with
the non�uniform strategy� Interprocessor communication costs �in millions of bytes�
are presented in Figure �� c�
Both memory usage and computation time are important computational re�
sources for adaptive mesh applications� In fact� many accounting systems for parallel
computers charge not only for CPU time but also for memory usage� To capture both
resources� Figure �� d presents the relative space�time �e�g� megabyte�hour� cost of
uniform re�nement patches as compared to non�uniform patches� In this metric�
uniform patches are between two and eight times more expensive�
These results clearly show that uniform re�nement patches are more expen�
���
4 8 16 32Paragon Processors
0
10
20
Wallc
lock
Seco
nds
per
Itera
tion
Uniform Patch Execution Time
Non-Uniform PatchesUniform (12x12x12)Uniform (16x16x16)Uniform (24x24x24)Uniform (32x32x32)
4 8 16 32Paragon Processors
0
10
20
Wallc
lock
Seco
nds
per
Itera
tion
Uniform Patch Numerical Execution Time
Non-Uniform PatchesUniform (12x12x12)Uniform (16x16x16)Uniform (24x24x24)Uniform (32x32x32)
�a� �b�
4 8 16 32Paragon Processors
0.0
0.5
1.0
1.5
2.0
Com
munic
atio
n (
Mbyt
es)
per
Itera
tion
Uniform Patch Communication
Non-Uniform PatchesUniform (12x12x12)Uniform (16x16x16)Uniform (24x24x24)Uniform (32x32x32)
4 8 16 32Paragon Processors
0
2
4
6
8
10
Rela
tive S
pace
-Tim
e C
ost
Uniform Patch Space-Time Cost
Non-Uniform PatchesUniform (12x12x12)Uniform (16x16x16)Uniform (24x24x24)Uniform (32x32x32)
�c� �d�
Figure �� � These graphs illustrate the performance costs of uniform grid patchesas compared to non�uniform patches� By default� the adaptive mesh libraries em�ploy non�uniform patches� Results for the �� patch size on four processors are notreported� the problem would not run due to memory limitations� �a� A comparisonof execution times for a variety of uniform patch sizes� �b� Numerical computationtimes only� �c� Total interprocessor communication �in millions of bytes� for the di�er�ent re�nement strategies� �d� Relative space�time �megabyte�hour� costs of uniformre�nement patches as compared to non�uniform patches�
���
sive than the identical application using non�uniform patches� To solve a problem to a
speci�ed accuracy� structured adaptive mesh methods employing uniform re�nement
regions will require more computational resources ��ops and memory� than one using
non�uniform re�nement regions� Likewise� given �xed resources� non�uniform re�ne�
ments will solve a particular problem to higher accuracy� The additional complexity
of the non�uniform implementation requires a powerful software infrastructure such
as our adaptive mesh API�
�� Analysis and Discussion
You can observe a lot by watching�
� Yogi Berra
We have developed an e�cient� portable� parallel software infrastructure for
structured adaptive mesh algorithms� It provides computational scientists with high�
level tools that hide implementation details of parallelism and resource management�
Such powerful software support is essential for the timely development of quality
reusable numerical software� We are applying our adaptive mesh infrastructure to
the solution of adaptive eigenvalue problems arising in materials design �����
Two distinguishing characteristics of our work are the concepts of struc�
tural abstraction and coarse�grain data parallelism� both borrowed from LPARX�
Structural abstraction and �rst�class data decompositions enable our software to rep�
resent and manipulate re�nement structures as language�level objects� In contrast� a
language such as High Performance Fortran that supports compile�time data layout
provides little freedom in the expression of irregular� run�time data distribution� Our
model of coarse�grain data parallel numerical computation maps e�ciently onto the
current generation of message passing parallel architectures�
It is an open research question whether data parallel languages can e�ciently
support the irregular re�nement structures employed by structured adaptive mesh
algorithms� Previous implementations �� � have required uniform re�nements
���
to �t the �ne�grain data parallel model� Our experiments in �d indicate that such a
restriction results in costly over�re�nement and a corresponding loss in computational
performance� Thus� the e�cient� portable implementation of structured adaptive
mesh methods remains an outstanding challenge for the data parallel community�
����� Parallelization Requirements
The parallelization mechanisms provided by LPARX greatly simpli�ed the
design and implementation of our structured adaptive mesh API� We believe that the
following LPARX features were instrumental to our success�
� LPARX�s concept of structural abstraction enables our software to dynamicallycreate the user�de�ned� irregular array structures required to represent re�ne�
ment levels� Our regridding routines produce a description of the re�nement
structure� and this description is in turn modi�ed by load balancing and proces�
sor assignment routines to distribute the computational work across processors�
Only then is this ��oorplan� instantiated as the new level in the hierarchy�
Structural abstraction provided us the freedom to explore and compare di�er�
ent regridding strategies �both non�uniform and uniform re�nement patches�
and processor mapping algorithms� To our knowledge� no other parallel pro�
gramming system supports such user�de�ned� dynamic block structures�
� The region calculus and the copy�on�intersect operation are intuitive and nat�ural mechanisms for manipulating re�nement structures and expressing inter�
processor communication� The same set of primitives can be used to manage
the di�erent communication patterns required by structured adaptive mesh ap�
plications� They are based on simple geometric notions �e�g� �intersection��
that are independent of the spatial dimension of the problem� As a result� our
software supports both d and �d applications with the same API�
� E�cient parallel programs begin with the e�ective use of node resources� Be�cause LPARX separates parallel execution from computation� it does not in�
���
terfere with the performance of our serial numerical routines� which run at full
Fortran speeds� Furthermore� we were able to re�use routines from previous
serial multigrid codes in our parallel materials design application� Although
such numerical kernels are generally short� they are extremely tedious to write
and debug because they involve complicated stencil computations and array
indexing �see Eq� �� ��
Altogether� LPARX enabled us to easily and e�ciently implement the software sup�
port necessary for structured adaptive mesh applications�
����� Future Research Directions
More research is needed on whether structured adaptive mesh methods can
be e�ciently implemented in a data parallel Fortran language such as High Per�
formance Fortran ����� which does not readily support irregular array structures�
As discussed in Section ����� one potential solution would employ uniform re�ne�
ment regions� However� our experiments indicate that this design decision results in
costly over�re�nement and a corresponding loss in performance� In addition� even
the uniform re�nement strategy results in irregular communication patterns between
re�nement patches� To simplify the management of these bookkeeping details� we
would still need to implement region calculus operations such as intersection and
copy�on�intersect in an API library written for HPF�
Although we have shown that adaptivity can resolve the near�singular po�
tentials arising in materials design applications� much work remains before we can
address real problems� such as carbon clusters or carbon �laments� Addressing the full
generality of the LDA eigenvalue equations will require substantial advances in nu�
merical algorithms� in particular� ��� how do we extend the linear adaptive eigenvalue
solver to address the full nonlinear LDA eigenvalue problem� �� how do we e�ciently
extract multiple eigenvectors from the current adaptive algorithm� and ��� how do we
incorporate LDA�speci�c approximations �which we have not discussed here� into the
solution process# Fortunately� we believe that the software support provided by our
���
structured adaptive mesh API will greatly simplify the exploration of these numerical
issues�
Finally� further research is needed to investigate whether we can merge soft�
ware support for elliptic and hyperbolic partial di�erential equations� Although ellip�
tic and hyperbolic structured adaptive mesh methods employ similar data represen�
tations� their numerical structures are very di�erent �as described in Section ���� We
plan to explore how to unify support for these two problem classes in a single API�
C� Duncan of Bowling Green State University would like to use our structured adap�
tive mesh infrastructure to parallelize an adaptive hyperbolic solver for simulations
of relativistic extragalactic jets � ��� and we view this as an opportunity to develop a
common API framework for both types of partial di�erential equations�
Chapter �
Particle Calculations
The di�erence between a text without problems and a text with problemsis like the di�erence between learning to read a language and learning tospeak it�
� Freeman Dyson� �Disturbing the Universe�
��� Introduction
In Chapter � we introduced the LPARX parallelization abstractions for
managing irregular� block�structured data� In this chapter� we describe how these
facilities are used in the design and implementation of an API library for particle
applications� which require irregular data decompositions to balance non�uniform
workloads on parallel architectures� Our particle API� implemented as a library built
on top of the LPARX mechanisms �see Figure ���� provides computational scientists
the high�level software tools needed to e�ciently and easily parallelize particle calcu�
lations� Our facilities are independent of the problem�s spatial dimension and present
the same interface for both d and �d applications� We show that such functionality
is easily expressed using the LPARX primitives�
Using our software infrastructure� we have developed a �d smoothed particle
hydrodynamics ����� application �SPH�D in Figure ��� which simulates the evolution
of galactic bodies in astrophysics� Our API�s high�level mechanisms have enabled us
��
���
LPARX
Implementation Abstractions
Message Passing Layer
LDA MDSPH3DAMG
Adaptive Mesh API Particle API
Figure ��� The particle API portion of our software infrastructure is built on topof LPARX and provides computational scientists with high�level facilities targetedtowards particle applications� We have developed a �d smoothed particle hydrody�namics application �SPH�D� using this software and are currently implementing a �dmolecular dynamics application �MD��
to explore performance optimizations that others have found di�cult with a message
passing implementation� Our facilities have also been employed by Figueira and
Baden to analyze the performance of various parallelization strategies for localized
N �body solvers � ���
This chapter is organized as follows� We begin with an overview of particle
methods and review related work� Section � introduces the parallelization facilities
provided by our particle API and describes how they have been implemented using
the LPARX mechanisms� Section �� describes our smoothed particle hydrodynam�
ics application and analyzes its performance in detail� Finally� we conclude with a
discussion and analysis of this work�
���
����� Motivation
Simulations using particles play an important role in many �elds of compu�
tational science� including astrophysics� �uid �ow� molecular dynamics� and plasma
physics ����� In particle applications� some quantity of interest� such as mass or charge�
is stored on bodies� called particles� which move unpredictably under their mutual
in�uence� Particles interact according to a problem�dependent force law �e�g� gravi�
tation�� Computations proceed over a series of discrete timesteps� At each timestep�
the algorithm evaluates the force acting on every particle and then moves the particles
according to the calculated force �eld� The force evaluation is typically the most time�
consuming portion of the computation� Moving particles takes only a few percent of
the total execution time� Figure � provides an outline for a generic particle code�
In general� each particle feels the in�uence of every other particle� A naive
force evaluation scheme would� for a system of N particles� calculate all O�N��
particle�particle interactions directly� Such an approach is too expensive for systems
of more than a few thousand particles�
Rapid approximation methods ��� �� �� ��� ��� accelerate the force evaluation
by trading some accuracy for speed� Approximation algorithms typically divide the
force evaluation into two components� local particle�particle interactions and far��eld
force computations� Local interactions evaluate the in�uence of nearby particles by
calculating direct interactions only for those particles lying within a speci�ed cuto�
distance� The remaining non�local� far��eld contributions are calculated separately�
Two di�erent techniques are typically used to evaluate far��eld forces� The
�rst technique averages particle data onto a grid covering the entire computational
domain and employs a fast partial di�erential equation �PDE� solver to evaluate the
force equations on the grid� Values representing far��eld forces are then interpolated
back onto the particles� The second technique evaluates far��eld in�uences using a
hierarchical structure in which forces are represented at varying length scales and
accuracies� the in�uences of nearby particles are represented more accurately than
particles that are further away� Table �� surveys the computational structure of
��
�� Advect is the main routine of the particle calculationfunction Advect
for t � � to MaxStepscall CalculateForcescall MoveParticles
end forend function
�� Calculate forces using local and far��eld interactions�� Note that we do not show the code for FarFieldForcesfunction CalculateForces
call LocalInteractionscall FarFieldForces
end function
�� Calculate forces arising from nearby particle interactionsfunction LocalInteractions
for each particle pfor each particle q in a neighborhood of p
calculate interaction between p and qend for
end forend function
�� Update particle positions according to calculated forcesfunction MoveParticles
for each particle pmove particle p using force information
end forend function
Figure �� A framework for a generic particle calculation� Particle simulations pro�ceed over a sequence of timesteps� In each timestep� the algorithm evaluates forces onparticles and then moves particles according to the calculated forces� Rapid approx�imation methods typically divide the force computation into two components� localparticle�particle interactions and far��eld force evaluation �not shown��
��
Approximation Algorithm Local Fast PDE HierarchicalInteractions Solver Representations
Local Force Approximations ��� ����p
Particle�Particle Particle�Mesh ���p p
Particle�in�Cell �PIC� ���p
Method of Local Corrections �MLC� ��p p
Adaptive MLC �� ��p p p
Fast Multipole Method �FMM� ���p p
Barnes�Hut ���p p
Hierarchical Element Method ��p p
Table ��� A survey of the computational structure for various N �body approximationmethods� This chart indicates whether a particular approximation algorithm employslocal particle�particle interactions� a fast partial di�erential equation �PDE� solver�or hierarchical representations� Local force approximations assume that the force lawis zero beyond a speci�ed cuto� and ignore far��eld in�uences�
some of the most common rapid approximation algorithms�
In this chapter� we only consider software abstractions for the evaluation
of local particle�particle interactions� Software facilities for fast PDE solvers and
adaptive hierarchical representations have been described in Chapter ��
Particle methods are di�cult to implement on parallel computers because
they require dynamic load balancing to maintain an equal distribution of work� The
computational e�ort required to evaluate the forces acting on a particle depends on
the local particle density� Workload distributions change with time as particles move�
Furthermore� when partitioning the problem� we would like to take advantage of the
spatial locality of the particle�particle interactions� By subdividing the computational
space into large� contiguous blocks� we can minimize interprocessor communication
since nearby particles are likely to be assigned the same processor �����
Figure ��a� which depicts a uniform block decomposition of the compu�
tational space� illustrates the need to handle load balancing� Each of the sixteen
partitions has been assigned to a processor numbered from p� to p��� Such a uni�
form decomposition does not e�ciently distribute workloads� for example� no work
���
�a�
�b� �c�
Figure ��� These pictures show several snapshots from a d vortex dynamics ap�plication in which the computational domain has been partitioned across sixteenprocessors� Particles are represented by dots� and the workload is directly related tolocal particle density� A uniform block decomposition �a� is unable to balance a non�uniform workload distribution� Recursive bisection �b and c� adjusts the assignmentof work to processors according to the workload distribution� Partitionings mustchange dynamically in accordance with the redistribution of the particles� Severalrepartitioning phases have occurred between the times represented by the last twosnapshots�
���
has been assigned to processors p�� p�� p��� and p���
A better method for decomposing non�uniform workloads is shown in Fig�
ures ��b and ��c� which illustrate two irregular block assignments rendered using
recursive bisection ���� In these decompositions� each processor receives approxi�
mately the same amount of work� Because the distribution of particles and the asso�
ciated workloads change over time� we must periodically redistribute the work across
processors to maintain load balance� How we represent such dynamic� irregular data
decompositions is the subject of Section ��
����� Related Work
The literature for parallel particle calculations is quite rich and expansive�
here we provide only a brief survey of related work� Our particle library facilities
are based in part on previous work by Baden ���� who developed a programming
methodology for parallelizing particle calculations running on MIMD multiprocessors�
His implementation of a d vortex dynamics application was the �rst to employ a
recursive bisection decomposition ��� to dynamically balance particle methods�
Tamayo et al� ����� investigated various data parallel implementation strate�
gies for molecular dynamics simulations running on SIMD computers such as the
CM�� More recently� Figueira and Baden � �� employed our software infrastructure
to analyze the performance of di�erent parallelization strategies for localized N �body
methods running on MIMD multiprocessors�
Some previous e�orts with parallel particle calculations have concentrated
on the parallelization of a particular program instead of a general software infrastruc�
ture� For example� Clark et al� ���� �� implemented a parallel version of the GROMOS
molecular dynamics application� Their approach uses non�uniform� dynamic parti�
tions similar to our own which were implemented �with considerable e�ort� using a
message passing library� The parallelization of GROMOS would have been signi�cantly
easier had they employed the software abstractions we describe in Section ��
Portions of the CHARMM molecular dynamics application have been paral�
���
lelized using the CHAOS software primitives ��� ���� CHAOS employs a �ne�grain
decomposition strategy in which particles are individually assigned to processors�
The drawback to this type of �ne�grain approach is that algorithms for scheduling
interprocessor communication scale as the number of particles� In contrast� we use a
coarse�grain strategy that assigns aggregates of particles to processors� and our com�
munication algorithms depend only on the number of processors� not on the number
of particles�
Experimental data distribution primitives targeted towards particle calcu�
lations have been added to Fortran D ��� � using CHAOS as the run�time support
library� Adhara ���� is a run�time library for particle applications that is not as gen�
eral as CHAOS but has been speci�cally designed and optimized for certain classes
of particle methods�
Warren and Salmon ����� developed a parallel tree code intended for Barnes�
Hut ���� and fast multipole ���� methods� Their approach dynamically distributes
nodes of the tree across processors using a clever hashing mechanism� Singh et al�
�� � ��� have also implemented a parallel fast multipole algorithm for shared mem�
ory multiprocessors� However� these tree�based mechanisms are inappropriate for
particle methods that employ fast PDE solvers based on array representations ��� ���
��� Application Programmer Interface
The beauty of this is that it is only of theoretical importance� and there isno way it can be of any practical use whatsoever�
� Sidney Harris� �Einstein Simplied�
Our particle API provides scientists with high�level computational tools that
enable easy and e�cient portable implementations of particle applications on paral�
lel architectures� We have implemented our API using the LPARX primitives of
Chapter � We show that the techniques required to parallelize particle applications
are easily and succinctly expressed in LPARX� Without the powerful LPARX ab�
stractions� such software support normally constitutes many thousands of lines of
��
complicated message passing code� In contrast� we will present LPARX code for
all of the major activities needed to balance non�uniform workloads and manage
interprocessor communication� The LPARX operations enable us to manage the im�
plementation complexity at a reasonable level� As discussed in this section and in
Section ��� the high�level treatment of our API implementation has enabled us to
explore performance optimizations that others have found di�cult with a message
passing implementation� Such optimizations have reduced the execution times of our
smoothed particle hydrodynamics application by an average of ��$�
As shown in Figure ��� parallelization of the sequential code of Figure �
requires four modi�cations to the application�
�� In Advect� call a load balancing routine to divide the computational workload
across the processors� Each processor will be responsible for all calculations
involving particles in the data partition assigned to it�
� Before calculating local interactions� fetch o��processor particles required to
satisfy data dependencies�
�� After calculating local interactions� update force information for particles owned
by other processors�
�� After updating particle positions in MoveParticles� repatriate particles to their
rightful processor owners if they have migrated o� of the local data partition�
Functions BalanceWorkloads� FetchParticles� WriteBack� and Repatriate�
Particles are provided by our particle API� The following �ve sections describe
these facilities and their implementation using the LPARX mechanisms in more de�
tail�
����� Balancing Non�Uniform Workloads
We address two issues in this section� ��� how do we represent the non�
uniform decompositions needed to evenly distribute workloads �e�g� see Figures ��b
��
�� Advect is the main routine of the particle calculationfunction Advect
for t � � to MaxSteps�� Redistribute workload across processorsif �time to rebalance workloads� then
call BalanceWorkloadsend ifcall CalculateForcescall MoveParticles
end for
end function
�� Calculate forces arising from nearby particle interactionsfunction LocalInteractions
�� Fetch particle data from adjoining processorscall FetchParticlesfor each particle p owned by this processor
for each particle q in a neighborhood of pcalculate interaction between p and q
end forend for
�� Write back forces for o �processor particlescall WriteBack
end function
�� Update particle positions according to calculated forcesfunction MoveParticles
for each particle p owned by this processormove particle p using force information
end for
�� Repatriate particles which have moved o our partitioncall RepatriateParticles
end function
Figure ��� A parallel version of the generic particle code shown in Figure �� Fourchanges are necessary to parallelize the application� ��� balance workloads and dis�tribute computational e�ort across the processors �in Advect�� �� before performinglocal interactions� call FetchParticles to cache o��processor particle informationneeded to calculate forces� ��� after calculating interactions� update force informationfor particles owned by other processors� and ��� repatriate particles to their rightfulowners if they have migrated o� of the local data partition� Note that we do notshow function CalculateForces because it has not changed from Figure ��
�
and ��c�� and �� how do we dynamically redistribute computational e�ort as work�
loads change#
A common technique used to implement particle applications employs a
chaining or binning mesh which covers the entire computational domain ����� Particles
are sorted into the mesh according to their spatial location� each element �or bin� of
the mesh contains the particles lying in the corresponding portion of space� This
binning structure is used to accelerate the O�N�� search for neighboring particles
that would be otherwise required� A sequential application would represent the mesh
as a Grid of particle lists�
Grid of ParticleList bins
where ParticleList is the user�de�ned type implementing an unordered collection
of particles� The computational work carried by each bin is a function of the number
of particles in the bin and the local particle density�
To balance non�uniform workloads on a parallel machine� we decompose this
binning structure across processors� Recall that LPARX represents irregular block de�
compositions using the XArray� thus� our parallel data decomposition is implemented
as�
XArray of Grid of ParticleList bins
where each Grid of the XArray contains the particle list data for its corresponding
data partition� Irregular data distributions are determined by a partitioning utility
which attempts to evenly divide the work among the processors� A d sample data
decomposition for four processors and the associated XArray are shown in Figure ��
When calculating forces and moving particles� processors employ LPARX�s coarse�
grain data parallel forall loop to compute over only those particles which lie in their
assigned partition�s��
This binning mesh must be periodically repartitioned in response to the
changing workload distribution� In general� the application need not call the re�
balancing routine every timestep� The maximum distance a particle may move in
��
XArray
Grids
Figure �� Irregular decompositions of the computational domain are representedusing the XArray� For particle applications� each processor is typically assigned asingle data partition� which corresponds to an XArray element �a Grid�� Processorscompute only for those particles within their assigned partition�
a single timestep is limited by the stability requirements of the numerical method�
therefore� workloads change slowly ��� ��� ����� For example� the smoothed particle
hydrodynamics application described in Section �� repartitions every ten timesteps�
Partitioning the computational domain introduces data dependencies be�
tween the various subproblems� particles near the boundary of a partition may in�
teract with particles belonging to other processors� We extend each partition with a
ghost cell region �see Figure ��� used to locally cache copies of o��processor particles�
In general� the width of this ghost cell region depends on the mesh spacing and may
be di�erent in each dimension� Prior to each force evaluation� ghost cells are �lled
with the most recent o��processor particle data� How this is managed is described in
Section ���
Dynamic load balancing is handled by particle library routine Balance�
Workloads� shown in Figure � � The �rst step is to estimate how much computational
e�ort will be required to calculate the forces for all particles in a particular bin� Our
API automatically measures the amount of time spent computing in each bin and
uses these timing measurements from previous timesteps to guide the partitioning for
the following timesteps�
In the next step� we call a recursive bisection ��� partitioning utility� pro�
vided by the LPARX standard libraries� This partitioner takes the workload estimate
�The application may substitute another irregular block partitioner�
��
�� Rebalance workloads and copy from the old mesh into the new�� Bins and NewBins are XArray of Grid of ParticleList�� Partition and NewPartition are Array of Region�� NGHOST is the ghost cell width�� P is the number of processors
function BalanceWorkloads�� Step ���� Estimate the workload distributionArray of Double Work � EstimateWork���� Step ��� Partition this workloadArray of Region NewPartition � RCB�Work� P��� Step ���� Add ghost cells to the partitionArray of Region Ghosts � grow�NewPartition� P� NGHOST��� Step ���� Allocate the storage for this structurecall XAlloc�NewBins� P� Ghosts��� Step ��� Copy data from Bins into NewBinsforall i in NewBins
for j in Binscopy into NewBins�i� from Bins�j� on Partition�j�
end forend forall
end function
Bins(j) NewBins(i)
copy
Figure � � API function BalanceWorkloads redistributes computational e�ort acrossthe processors� Workload information gathered by the particle API is used to guidethe RCB partitioning utility provided by the LPARX libraries� After the routine addsghost cell regions and allocates storage� data from the old binning structure is copiedinto the new mesh� All of these details are hidden from users of the API� Note thatthe ghost cells are not shown in the picture�
�
and the number of desired partitions and returns a description of the data decomposi�
tion� Note that the Regions returned represent the structure of the new partitioning
�i�e� structural abstraction�� but do not actually allocate storage� Next� we pad each
partition with a ghost cell region of size NGHOST using grow from LPARX� Finally� we
allocate the storage for the new binning array NewBins using this structure informa�
tion�
The new binning mesh is initially empty� before using it� we must copy
particle information from the previous data distribution into the new decomposition�
The two nested loops copy values from Bins into the new mesh NewBins� For each i
and j� NewBins�i� is assigned the portion of Grid Bins�j� that logically overlaps
with j�s associated partition� Any interprocessor communication is automatically
managed by the LPARX run�time system� Copies of particle lists local to a processor
are implemented by simply copying pointers� Note that although we temporarily
duplicate the storage for the chaining mesh� the particle data�which is likely to
require far more memory resources�is not duplicated� Of course� all of these details
are hidden by the API�
����� Caching O�Processor Data
Recall from the previous section that each processor�s partition is surrounded
with a ghost cell region used to locally cache copies of o��processor particles from
neighboring partitions� In general� e�ciently �lling ghost cell regions for irregular�
dynamic decompositions is a di�cult task� Data dependencies between processors
change as workloads are rebalanced� communication structures are neither static not
regular and cannot be easily predicted� Wide ghost cell regions may span several
other partitions�
All of these details are managed by the FetchParticles code shown in Fig�
ure ��� For every pair of Grids Bins�i� and Bins�j�� this routine copies into the
ghost cells of Bins�j� interior �non�ghost cell� particle information from all adjacent
Grids Bins�i�� We select only the interior particles from the source Bins�i� by grow�
�
�� Communicate boundary particle data between neighboring partitions�� Bins is the binning mesh used to store the particles
function FetchParticles�XArray of Grid of ParticleList Bins��� Loop over all pairs of grids in Binsforall i in Bins
�� Mask o� the ghost cells �copy interior values only��� Function region�� extracts the region from its argumentRegion Interior � grow�region�Bins�i��� "NGHOST�for j in Bins
�� Copy data from intersecting regionscopy into Bins�j� from Bins�i� on Interior
end forend forall
end function
Bins(j)
Ghost Cells
Partition
Figure ��� FetchParticles locally caches copies of o��processor particle informationneeded for particle interactions� Ghost cell regions are updated with particle datafrom the interiors of adjacent partitions�
ing its Region by a negative ghost cell width� Aggregate data motion between Grids is
handled through LPARX�s copy�on�intersect operation� which e�ciently copies data
between Grids� ignoring points which are not shared�
The simplicity of our routine belies the fact that the code to perform Fetch�
Particles becomes quite complicated in the absence of the powerful LPARX fa�
cilities� Similar functionality in the GenMP system ��� required over ��� lines of
message passing code� In describing the parallelization of the GROMOS molecular dy�
namics application� Clark et al� ��� point out the di�culty of supporting irregular
��
partitions and ghost cell regions that may span several partitions� Such special cases
are automatically managed by the copy�on�intersect primitives provided by LPARX�
Furthermore� FetchParticles is independent of the type of data decomposition and
the same algorithm works for both d and �d applications�
����� Writing Back Particle Information
Many of the force laws employed by particle applications are symmetric� that
is� the force acting on particle p by particle q is equal and opposite to the force acting
on particle q by particle p �Newton�s Third Law ������� By exploiting this symmetry�
we reduce computational costs by about half� since once we have calculated the force
acting on p by q� we know that the force acting on q by p is the same but in the
opposite direction�
Subtle implementation issues arise when employing symmetric forces laws�
Consider the force calculation between a particle p in the interior of a processor�s
partition and a particle q in its ghost cell region� If we exploit the symmetry of the
computation� we update the forces for both particles p and q� However� at the end of
the local interactions� particle q� a locally cached copy of a particle owned by another
processor� contains important force information which must be transmitted back to
the processor owning q�
Thus� by exploiting the symmetry of the force law� we halve the numerical
computation at the expense of additional interprocessor communication� One com�
promise employs symmetry only if neither interacting particle lies in the ghost cell
region ���� forces for such particles are computed redundantly on di�erent proces�
sors instead of communicated between processors� While this approach eliminates
the extra communication� our experiments �reported in Section ����� indicate that
it results in ��$ longer execution times because of the redundant computation�
Figure �� shows the API routine WriteBackwhich implements the force up�
date� This code is essentially the same as FetchParticle described in Section ��
except that data travels in the opposite direction� One notable di�erence is that
��
�� Write back force information between neighboring partitions�� Bins is the binning mesh used to store the particles
function WriteBack�XArray of Grid of ParticleList Bins��� Loop over all pairs of grids in Binsfor i in Bins
for j in Bins�� Mask o� the ghost cells �copy interior values only�Region Interior � grow�region�Bins�j��� "NGHOST��� Copy data from intersecting regionscopy into Bins�j� from Bins�i� on Interior using CombineForces
end forend for
end function
Bins(i)
Ghost Cells
Partition
Figure ��� WriteBack updates force information for particles owned by other pro�cessors� This code is essentially the same as FetchParticles in Figure �� exceptthat data �ows in the opposite direction� This routine employs the reduction formof the LPARX copy�on�intersect operation� In this example� the reduction functionCombineForces sums forces from o��processor particles into locally owned particlelists� As described in Section ���� this code is also used to repatriate particles totheir rightful processor owners if they have migrated o� of the local data partition�For that version� reduction function CombineLists is used to combine o��processorparticles with lists of locally owned particles�
��
WriteBack employs the reduction form of the LPARX copy�on�intersect operation�
Recall that this primitive takes a commutative associative reduction function as an
argument� instead of simply copying data� the speci�ed function is applied element�
wise to combine corresponding source and destination data values� In this case�
the reduction function CombineForces takes two ParticleLists� sums the forces
for corresponding particles in the two lists� and returns the result� In the general
case� CombineForces must be provided by the application because symmetric force
laws often calculate more than just forces� for example� the smoothed particle hy�
drodynamics application of Section �� calculates both forces and densities� Writing
CombineForces is simple� however� and we will show an example in Section ���
As before� the code for WriteBack is di�cult to implement without the
support provided by LPARX� Indeed� the parallel implementation of the GROMOS
molecular dynamics application ��� ignores symmetry and redundantly computes in�
teractions involving particles lying in ghost cell regions� even though the implementors
expect a dramatic increase in performance with this optimization� We will explore
the performance implications of this design decision in Section �����
����� Repatriating Particles
The fourth and �nal facility required to parallelize a particle application
repatriates particles across processors if they have migrated o� of their processor�s
partition� The last phase of each timestep moves particles according to the calculated
forces acting on each particle� In this step� some particles maymove o� of the partition
owned by their processor into the ghost cell region� �Prior to moving particles� we
remove from the ghost cell region the o��processor particles locally cached by Fetch�
Particles�� Particles will not move past the ghost cell region because the numerical
methods limit the maximum distance a particle may move in a single timestep due to
stability requirements ������ Because these particles no longer lie in their processor�s
partition� they must be communicated to the processors which rightfully own them�
The computational structure of repatriation is identical to that of the force
� �
update described in the previous section� In fact� the code for RepatriateParticles
is identical to that in Figure �� with one change� the reduction function Combine�
Forces is replaced by CombineLists� which takes two particle lists and returns their
union� As particles lying in ghost cell regions are copied back onto their proper parti�
tions� they are combined with the particles already lying in those bins via Combine�
Lists� The user of our API does not supply CombineLists and only needs to call
RepatriateParticles� all interprocessor communication and list concatenation is
managed automatically by the run�time system�
����� Implementation Details
We have implemented our particle API as a library of C�� classes built
on top of the LPARX parallelization abstractions� The library consists of approx�
imately one thousand lines of C�� code and de�nes two classes�ChainMesh and
ParticleList�that provide the functionality described in the previous sections�
Programmers are completely isolated from LPARX and never see types Region� Grid�
or XArray� in fact� we could implement the library on top of another parallel run�time
system� and the API would not change� ChainMesh and ParticleList are described
brie�y in the following two sections�
ChainMesh
ChainMesh implements the chaining mesh structure ���� for organizing the
particles� Recall from Section ��� that this chaining mesh covers the entire com�
putational domain� and particles are sorted into the mesh based on their spatial
location� Each mesh element or bin contains a ParticleList of the particles lying
in the corresponding region of space� Figure �� shows part of the C�� class inter�
face for ChainMesh� Internally� the chaining mesh is implemented as an XArray of
Grid of ParticleList� although this representation is hidden from the programmer�
ChainMesh de�nes member functions corresponding to all of the parallelization mech�
anisms described previously� BalanceWorkloads� FetchParticles� WriteBack� and
� �
Define a chaining mesh class for a �d particle calculation
Class Particle defines particle attributes �position� velocity� ����
Class Index is a simple opaque index object �like an LPARX Point�
class ChainMesh f The chaining mesh is a �d XArray of �d Grids of ParticleList
XArray�ofGrid�ofParticleList mesh�
ChainMesh automatically times iterations for load balancing
XArray�ofGrid�ofDouble workload�
Define other miscellaneous flags and variables
int has periodic boundary conditions�
double interaction distance�
� � �
public�
void AddParticle�Particle p��
void BalanceWorkloads���
void FetchParticles���
void WriteBack�ForceReductionFunction CombineForces��
void RepatriateParticles���
ParticleList operator���const Index I��
� � �
g�
ForAll loops over all ParticleLists owned by this processor
�define ForAll�I�MESH� � � �
�define EndForAll � � �
ForAllInteracting loops over all particles J interacting with I
�define ForAllInteracting�J�I�MESH� � � �
�define EndForAllInteracting � � �
Figure ��� This API de�nition is taken from the C�� header �le for classChainMesh� which implements the chaining mesh structure used to organize particles����� ChainMesh provides all of the parallelization mechanisms described previously�BalanceWorkloads� FetchParticles� WriteBack� and RepatriateParticles� It isimplemented on top of the LPARX parallelization mechanisms�
�
RepatriateParticles� It also provides functions for adding particles to the mesh
�AddParticle� and for indexing the mesh to extract a single list of particles�
The implementation also de�nes two loops�ForAll and ForAll�
Interacting�that iterate over the particle lists in the chaining mesh� The ForAll is
a parallel loop that iterates over the particle lists owned by a particular processor� It
automatically times the computation time associated with each bin� and this timing
information is used by ChainMesh to guide the partitioning of the mesh in Balance�
Workloads� Although the application must explicitly rebalance the mesh by calling
BalanceWorkloads� the di�cult task of determining the non�uniform workload distri�
bution is handled automatically� In practice� applications typically repartition every
N th timestep� where N depends upon how quickly particles move�
The other loop� ForAllInteracting� iterates over all bins that contain par�
ticles interacting with the bin returned by ForAll� Figure ��� shows how these loops
are used in a local interactions computation� LocalInteractions is a C�� routine
that calculates local interactions using numerical kernel ComputeInteractions �not
shown�� This C�� code looks very similar to the local interactions loop of our generic
parallel particle application in Figure ���
ParticleList
In addition to ChainMesh� the particle library de�nes a class called
ParticleList to represent a list of particles� Such particle lists are not an artifact
of the parallel implementation but are also required by serial codes� Traditionally�
chaining meshes have represented particle lists using a linked list ����� The advantage
of the linked list strategy is that it is easy to add and remove particles by simply
manipulating pointers�
Instead of this approach� our implementation of ParticleList represents a
list of particles using an array �see Figure ����� Although this array representation
is more complicated�particle information must be copied into and out of the list�
and the array must grow and shrink dynamically�it has its performance advantages�
� �
Show a simple routine which does local particle interactions
extern void ComputeInteractions�ParticleList A� ParticleList B��
extern void CombineForces�ParticleList A� const ParticleList B��
void LocalInteractions�ChainMesh mesh� f Fetch particle data from adjoining neighbors
mesh�FetchParticles���
Calculate forces arising from local particle interactions
ForAll�I� mesh�
ForAllInteracting�J� I� mesh�
ComputeInteractions�mesh�I�� mesh�J���
EndForAllInteracting
EndForAll
Write back forces for off�processor particles
mesh�WriteBack�CombineForces��
g
Figure ���� Application C�� code to compute local interactions using the particlelibrary� The ForAll loop iterates in parallel over all particle lists in the chainingmesh� and ForAllInteracting iterates over bins that contain particles interactingwith bins returned by ForAll� ComputeInteractions is a numerical routine thatcomputes the interactions between two particle lists and is not shown�
Because particle information is arranged in arrays� it is easier to vectorize numerical
kernels on vector architectures such as the Cray C���� Arrays are easier to pass to
Fortran numerical routines than linked lists� Finally� the array representation o�ers
improved cache locality because particle values lie contiguously in memory�
The physical information stored on each particle typically depends on the
type of numerical simulation� Thus� the programmer must take the following steps
to customize the particle representations provided by our library �see Figure �����
� De�ne a C�� class called Particle that includes all physical information neededto characterize a particle� such as position� velocity� acceleration� force� mass�
density� or pressure�
� �
Class Particle defines important particle attributes
class Particle fdouble position���� velocity���� force����
� � �
g�
Class ParticleList represents Particle information as arrays
class ParticleList fint number�
double �position���� �velocity���� �force����
� � �
g�
CombineForces �called by WriteBack� adds forces from B into A
void CombineForces�ParticleList A� const ParticleList B� ffor �int i � �� i � A�number� i� f
A�force�i���� � B�force�i�����
A�force�i���� � B�force�i�����
A�force�i���� � B�force�i�����
gg
Packing routine to transmit ParticleList data between processors
SendPacket operator �� �SendPacket stream� const ParticleList PL�
fstream �� PL�number�
stream �� PackArray�PL�position� ��PL�number��
stream �� PackArray�PL�velocity� ��PL�number��
� � �
return�stream��
g
Figure ���� Because the information represented by a particle depends on the physicsof the computation� the programmer must customize our ParticleList class for aparticular application� These changes are simple and could be managed automaticallyusing a pre�processor� ParticleList represents particle information using arrays ofdata instead of the linked list method typically used in chaining mesh codes �����Function PackArray is de�ned by the AMS libraries and packs an entire array ofdata into the outgoing message stream�
�
� Modify ParticleList to represent the same information as a Particle and
write routines to copy a Particle into and out of a ParticleList� These
copying routines are used internally by the particle library� They are simple
and resemble the gather and scatter routines used on vector architectures�
� Write message stream packing and unpacking routines needed to transmit par�
ticle data across memory spaces �see Section ����� Again� these routines are
easy to write and resemble standard C�� I�O� A sample packing routine is
shown in Figure ���� The corresponding unpacking routine would be similar
except that it would extract data from the message stream�
� Write the CombineForces routine as needed by WriteBack �see Section �����
Recall that CombineForces is used by WriteBack to combine the force contri�
butions from two particle lists� The sample code in Figure ��� loops over the
particles in particle lists A and B and adds the forces from particles in list B to
the corresponding particles in list A�
Although these changes are not di�cult� they could be automated by using a simple
pre�processor� Of course� the programmer must also write the numerical computation
routines to calculate particle interactions and update particle positions�
One possible performance optimization that we explore in Section ��� is
the selective packing and unpacking of particle information� Particles typically con�
tain a signi�cant amount of data �tens to hundreds of bytes�� and it not always
necessary to communicate all of this information between processors� For example�
only forces typically need to be communicated when writing back force data� Thus�
the programmer can apply this application�speci�c knowledge to selectively transfer
data in the packing and unpacking routines� Our experiments with this optimization
in Section ��� indicate that it reduces execution times between �$ and ��$ and
reduces the amount of interprocessor communication by a factor of four to �ve�
�
Figure ��� Our smoothed particle hydrodynamics application simulates the evolu�tion of the �d disk galaxy shown here� Particles are equally distributed around thering and are assigned a random vertical position clustered about the Z � � plane�
��� Smoothed Particle Hydrodynamics
We�ve discovered a massive dust and gas cloud which is either the begin�ning of a new star or just a hell of a lot of dust and gas�
� Sidney Harris� �From Personal Ads to Cloning Labs�
We have developed a �d smoothed particle hydrodynamics application
�SPH�D� based on the software facilities described in the previous section� Smoothed
particle hydrodynamics is a particle�based simulation method which has been applied
to gas dynamics� stellar collisions� planet formation� cloud collisions� cosmology� mag�
netic phenomena� and nearly incompressible �ow ������ Our particular application�
arises in astrophysics and models the evolution of the disk galaxy shown in Figure ���
The computational structure of our smoothed particle hydrodynamics ap�
plication is similar to that of the generic particle codes shown in Figures � and ���
Interactions between particles occur only over short ranges� and there are no far��eld
�The original code and a sample data set were provided by John Wallin �Institute for Computa�tional Sciences and Informatics at George Mason University� and Curtis Struck�Marcell �Departmentof Physics and Astronomy� Iowa State University��
� �
forces� Each interaction is expensive� requiring approximately one hundred �oating
point operations� Associated with each particle is ��� bytes of information describing
position� velocity� acceleration� mass� density� and pressure� Local interactions take
two forms� First� the method calculates a local density for each particle� Then� using
this density information� it computes pressure gradients and their associated forces�
These forces are used to move the particles in preparation for the next timestep�
Because there are two local interaction phases� two calls to FetchParticles and
WriteBack are needed every timestep�
We begin this section with a description of the numerical calculation� which
may be skipped without loss of continuity� The succeeding four sections present
computational results�
����� Numerical Background
You know� I don�t think math is a science� I think it�s a religion� Allthese equations are like miracles� You take two numbers� and when youadd them� they magically become one new number� No one can say how ithappens� You either believe it or you don�t� This whole �section� is full ofthings that have to be accepted on faith� It�s a religion� As a math atheist�I should be excused from this� � �
� Calvin� �Calvin and Hobbes�
Smoothed particle hydrodynamics represents each particle not as a point
but as a smooth �blob� smeared over a portion of space ������ The general form of a
blob is given by the interaction basis function� or kernel� �� Our particular application
uses the following kernel function�
��r� h� �
�������������
��h���� �
�r� ! �
�r�� � � r � h
���h� �� r�� h � r � h
� otherwise
����
where r R is the distance away from the center of the particle and h R gives
the �spreading� of the blob� Note that this kernel has compact support� � is zero for
r � h� Thus� particles separated by more than h do not interact�
� �
Variable Space Physical Meaning
x R� position
v R� velocity
a R� acceleration
� R density
m R mass
P R pressure
V R viscosity
�t R timestep
� R�R� R interaction kernel function
h R interaction distance
g R� � R external gravitational �eld
Table �� A summary of the variables and functions used in the smoothed particlehydrodynamics equations�
Associated with each particle is information about its position x� velocity v�
acceleration a� mass m� and density � �see Table � for a summary of all variables
and functions de�ned in this section�� Local interactions consist of two separate com�
putation phases� density calculations and force calculations� We compute the density
for a particular particle i by summing mass contributions from nearby particles j�
�i �Xj
mj��kxi � xjk� h� ���
Although written as a sum over all particles j� only nearby particles contribute to the
density because � is zero for kxi � xjk � h�After we have calculated the local density for each particle� we compute the
forces on each particle i�
ai � �Xj
mj fPi�j ! Vi�jgri��kxi � xjk� h� ! g�xi� ����
where ri is the gradient taken with respect to the coordinates of particle i� Pi�j
represents the force component due to pressure and is given by�
Pi�j �p�i�j
����
� �
The viscosity� or �stickiness�� of the �uid is represented by the term Vi�j� de�ned as�
Vi�j �
�����
�h�f�vi�vj���xi�xj�g�
�kxi�xjk����h�����i�j��vi � vj� � �xi � xj� � �
� �vi � vj� � �xi � xj� � ����
where ��� is the standard dot product in R�� The term g�xi� in Eq� �� represents the
in�uence of an external� problem�dependent gravitational �eld� With the exception of
the multiplication bymj� the computation for ai within the sum is identical to that for
aj� Thus� we exploit the symmetric nature of the force law to reduce computational
costs by about a factor of two�
Using the acceleration information from Eq� ��� we update the velocity and
position of particle i using the �rst�order Euler�s method ������
vi � vi ! ai�t �� �
xi � xi ! vi�t ����
where �t represents the timestep� The application dynamically changes the timestep
�t to satisfy stability criteria such as the Courant�Friedrichs�Lewy �CFL� condi�
tion ������
����� Performance Comparison
We present performance results for the SPH�D application on the Cray C���
�single processor�� Intel Paragon� IBM SP�� and a network of Alpha workstations
connected by a GIGAswitch running PVM ����� refer to Table �� for software ver�
sions and Appendix A for machine characteristics� Time is reported in seconds per
timestep� All �oating point arithmetic was performed using ��bit numbers� The
application code was identical on all machines except that the C��� version gathered
and scattered particles to obtain longer vector lengths�
We ran simulations with ���� ���k��� ���� ���k��� ��� � ����k��� and
��� ��� k�� particles for the spatial distribution shown in Figure ��� Numerical
�The IBM SP� results were obtained on a pre�production machine at the Cornell Theory Centerthese times should improve as the system is tuned and enters full production use�
���
Machine C�� Fortran Operating SystemCompiler Optimization Compiler Optimization
Alphas g�� v��� �O f�� v��� �O� OSF�� ��
C��� CC v������� �O cft�� v��� �O� UNICOS �����
Paragon g�� v��� �O �mnoieee if�� v���� �O� �Knoieee OSF�� �����
SP xlC v�� �O �Q xlf v��� �O� AIX v��
Table ��� Software version numbers and compiler optimization �ags for all compu�tations in this chapter� The Alpha cluster consists of eight DEC Alpha workstationscommunicating through PVM ���� over a GIGAswitch network interconnect� Allbenchmarks used LPARX release v��� Detailed machine characteristics are reportedin Appendix A�
computation costs vary as the square of the local particle density� asymptotically�
doubling the number of particles in the same computational space requires four times
more work� We may eliminate the square dependence on the number of particles
by reducing particle interaction distances as the local density increases ������ but we
have not taken this approach in these simulations�
We employed a chaining mesh of size ������ on �� �� �� and � processorsand � � �� on � and � processors� These mesh sizes were chosen to minimizethe total execution time� larger meshes help reduce load imbalance because they allow
a �ner partitioning of the problem space� In general� the choice of the best mesh size
depends on factors such as the number of processors� kernel interaction distance�
load imbalance� workload distribution� processor computational speed� and particle
density � ��� On the parallel machines� we rebalanced workloads every ten timesteps�
Table �� and Figure ��� present computational performance for one
timestep of the SPH�D application� Although the numerical kernels of SPH�D vec�
torize on the C���� the kernels are rather complicated and contain a number of condi�
tionals which hinder e�cient utilization of the vector units� Furthermore� even though
the C��� code gathers and scatters particles to increase vector lengths� vectors are
still quite short� These vectorization limitations are intrinsic to the algorithm and
are not artifacts of parallelization� For ��k particles� hardware performance monitors
���
Particles Cray C��� Time Alpha TimeP � � P � �
�k ��� �
�k ���� ����
��k ���� ����
��k ���� ����
Intel Paragon PerformanceParticles P � � P � �� P � � P � ��
Time Speedup Time Speedup Time Speedup Time Speedup
�k �� ��� ���� ��� ����� � ��� ���
�k ���� ��� ���� ��� ��� ��� ���� ���
��k ��� ��� ���� ��� ��� ��� ���� ���
��k ��� ��� ���� ��� ��� ��� ���� ���
IBM SP PerformanceParticles P � � P � � P � ��
Time Speedup Time Speedup Time Speedup
�k ��� ��� ����� ��� ����� ��
�k ��� ��� ��� ��� ���� ��
��k ���� ��� ���� ��� ���� ��
��k ���� ��� ���� ��� ��� ��
Table ��� These tables present SPH�D performance results on a Cray C���� IntelParagon� IBM SP� and an Alpha workstation farm running PVM� All times are inseconds per timestep� Cray times were averaged over timesteps� Alpha times over� timesteps� and all other times over ��� timesteps� The C��� measurements areCPU times on a production system� measurements on the Alpha farm� Paragon� andSP are wallclock times since processor nodes are not time�shared� For the Paragonand SP� speedups are reported relative to the smallest number of processors used togather data� These numbers are graphed in Figure ����
��
12k 24k 48k 96kNumber of Particles
1
10
100
Seco
nds
per
Tim
est
ep
SPH3D Performance Comparison
Alpha Cluster (P = 8)Cray C-90 (P = 1)Paragon (P = 16)SP2 (P = 4)
�a�
12k 24k 48k 96kNumber of Particles
1
10
100
Seco
nds
per
Tim
est
ep
SPH3D Performance on Paragon
P = 8P = 16P = 32P = 64
12k 24k 48k 96kNumber of Particles
1
10
100
Seco
nds
per
Tim
est
ep
SPH3D Performance on SP2
P = 4P = 8P = 16
�b� �c�
Figure ���� These graphs present SPH�D performance results on a Cray C���� IntelParagon� IBM SP� and an Alpha workstation farm running PVM� Measurementswere gathered as described in Table ��� In graph �a�� the number of processors for aparticular machine was chosen to provide performance roughly comparable to a singleprocessor of a Cray C���� processor numbers are given in parentheses� The bottomtwo bar charts present timings as a function of the number of processors for �b� theIntel Paragon and �c� the IBM SP�
���
on the C��� report an average �oating point execution rate of �� mega�ops and an
average vector length of sixty� the peak� not�to�exceed performance for one proces�
sor of the C��� is approximately ���� mega�ops� For this particular problem size�
one processor of the C��� is roughly equivalent to � Alpha processors� � Paragon
processors� or � SP processors�
The C��� and the Alpha cluster exhibit relatively poor performance on the
smallest problem size ��k particles�� The Alphas su�er because of the high overheads
of message passing through PVM� in larger problems� this overhead is hidden by the
increased computational costs of particle interactions� Poor performance on the C���
is due to short vector lengths� The Cray C��� times improve relative to the other
machines for the largest problem size because of increasing vector lengths� Because
applications implemented using our particle library are portable across a diversity
of high�performance machines� computational scientists have the freedom to choose
the most cost�e�ective architecture �e�g� Cray C��� or Alpha cluster� for a particular
problem size�
����� Execution Time Analysis
To better understand the various costs of a parallel particle application�
we provide a detailed breakdown of the Paragon and SP execution times for the
SPH�D calculation� We have chosen the �k data set for our analysis because it
exhibits reasonable performance across all processor sizes� the �k problem does not
have enough computational work for � processors� and numerical work dominates
all other costs for the larger problems running on � and � processors�
Table � presents a breakdown of the execution time for one timestep of
SPH�D �averaged over ��� timesteps�� Times �in milliseconds� are reported for the
following categories� force calculation� move particles� load imbalance� fetch parti�
cles� write back forces� repatriate particles� and rebalance workloads� The �rst two
categories measure numerical work and the last �ve categories communication and
parallelization overheads� The majority of the time is spent in force calculation� load
���
Intel Paragon Performance BreakdownTask P � � P � �� P � � P � ��
Time � Time � Time � Time �
Force Calculation ���� �� ��� �� ���� �� ��� �
Move Particles ��� �� �� ��
Load Imbalance ��� � ���� �� ��� �� ��� ��
Fetch Particles �� � �� � �� � �� �
Write Back Forces ��� ��� � ��� � ��� �
Repatriate Particles �� �� � �� �� �� �
Rebalance Workloads �� �� �� � �� � �� �
Total ���� ��� ���� ��� ��� ��� ���� ���
IBM SP Performance BreakdownTask P � � P � � P � ��
Time � Time � Time �
Force Calculation ��� �� ���� �� ��� ��
Move Particles ��� ���� ����
Load Imbalance �� � �� � ��� ��
Fetch Particles ���� �� � ��� �
Write Back Forces ���� � �� � ���� �
Repatriate Particles ��� �� ���� � ����
Rebalance Workloads ���� �� ���� �� ���� �
Total ��� ��� ��� ��� ���� ���
Table �� A breakdown for the execution time of one SPH�D timestep �averaged over���� with �k particles on the Intel Paragon and the IBM SP� Times are reportedin milliseconds� Workloads were rebalanced every ten timesteps� Numbers may notadd up to the �Total� due to rounding�
imbalance� and interprocessor communication for fetching and updating interacting
particles�
The computational work per processor drops by a factor of two each time
the number of processors doubles� and the force calculation times re�ect this pattern�
One interesting anomaly occurs between � and � Paragon processors� for which
the computation time is more than halved� This e�ect is probably due to better
caching behavior on � processors� Recall that SPH�D uses a �ner chaining mesh
on � processors than on � � thus� there are fewer particles per bin because each bin
��
covers less of the computational domain� The on�chip cache in the Paragon�s i� ��XP
processor is very small �� Kbytes data� and can simultaneously cache only a few tens
of particles� With fewer particles per bin� there is a higher probability that particles
will remain in the data cache during the inner loops of the numerical computation�
The application loses a signi�cant amount of time to load imbalance� on
� Paragon processors� nearly ��$ of the total execution time is spent waiting for
other processors� The reason for this poor load balancing is that our sample data
set �see Figure ��� distributes most of the workload in a d plane� Because the
computational work is clustered in a small area� the recursive bisection algorithm
cannot e�ciently partition the workload across processors� Although we could re�
�ne the mesh to obtain a better load balance� a �ner mesh would incur additional
computational overheads � ��� resulting in worse overall performance�
The communication of interacting particle information accounts for most of
the remaining execution time� Although the time spent in communication remains
somewhat constant as we increase the number of processors� its relative cost increases
as computation time decreases� On � Paragon processors� interprocessor communi�
cation accounts for approximately �$ of the total execution time� Note that the
parallel overhead of performing load balancing� which includes partitioning and copy�
ing particles from the old decomposition into the new� is only a few percent of the
execution time�
Table � and Figure ��� provide another view of the execution time for one
SPH�D timestep� In this breakdown� the �force calculation� and �move particles�
times are combined� and the total interprocessor communication time is subdivided
into two categories� bu�er management �packing and unpacking of data� and commu�
nication costs �sending and receiving messages and synchronization�� These results
clearly show that the overhead associated with gathering and scattering data into and
out of message bu�ers cannot be neglected� Note that the application does not change
data representation� as would be required for a heterogenous network of machines with
di�ering number formats� instead� bu�er packing employs simple memory�to�memory
��
Intel Paragon Performance BreakdownTask P � � P � �� P � � P � ��
Time � Time � Time � Time �
Computation ���� �� �� �� ���� �� �� ��
Load Imbalance ��� � ���� �� ��� �� ��� ��
Communication �� � �� � �� � �� ��
Packing�Unpacking Data �� � � � � � ��� �
Total ���� ��� ���� ��� ��� ��� ���� ���
IBM SP Performance BreakdownTask P � � P � � P � ��
Time � Time � Time �
Computation ���� �� ���� �� ��� ��
Load Imbalance �� � �� � ��� ��
Communication ��� �� � ��� ��
Packing�Unpacking Data �� ���� � ���� �
Total ���� ��� ��� ��� ���� ���
Table � � A breakdown for one SPH�D timestep �averaged over ���� with �k parti�cles on the Intel Paragon and the IBM SP� The execution time is divided into fourcategories� computation time� load imbalance� interprocessor communication costs�and bu�er packing and unpacking overheads� Times are given in milliseconds� Num�bers may not add up to the �Total� due to rounding� This data is also graphed inFigure ����
copies� On � Paragon processors� copying alone accounts for almost ��$ of the total
execution time�
These numbers also indicate that for larger numbers of processors� commu�
nication overheads are dominated by message start�up costs� Message packing times
are directly related to the number of bytes transmitted between processors� If com�
munication were bandwidth limited� we would expect communication times to scale
as message packing times� Instead� communication costs increase faster� indicating
that communication is dominated by start�up overheads�
���
8 16 32 64Processors
2
4
6
8W
allc
lock
Seco
nds
per
Tim
est
ep
SPH3D Execution Time BreakdownIntel Paragon (24k particles)
Pack/Unpack DataCommunicationLoad ImbalanceComputation
8 16 32Processors
2
4
1
3
Wallc
lock
Seco
nds
per
Tim
est
ep
SPH3D Execution Time BreakdownIBM SP2 (24k particles)
Pack/Unpack DataCommunicationLoad ImbalanceComputation
�a� �b�
Figure ���� A graph of the data presented in Table � for �a� the Intel Paragon�and �b� the IBM SP� Execution time is divided into four categories� computationtime� load imbalance� interprocessor communication costs� and bu�er packing andunpacking overheads�
����� Exploiting Force Law Symmetry
Recall from Section ��� that many particle applications employ a symmet�
ric force law in which the force acting on a particle p by particle q is equal in opposite
to the force acting on q by p� The SPH�D application exploits this symmetry to
reduce numerical computation costs by about a factor of two� However� this savings
is o�set somewhat by additional communication� since forces for particles lying in the
ghost cell regions must be transmitted back to the processors owning those particles�
This write back phase can be di�cult to implement without the proper software sup�
port� in fact� the parallel implementation of the molecular dynamics program GROMOS
does not use a symmetric force law for ghost cell particles ��� for this very reason�
In this section� we investigate the performance tradeo�s of this design decision�
We modi�ed the SPH�D code to ignore symmetry for interactions involving
particles in ghost cell regions� these modi�cations required changes to fewer than ten
���
Intel Paragon ��k particles�Task P � � P � �� P � � P � ��
FS PS FS PS FS PS FS PS
Computation ���� ���� �� ���� ���� ���� �� ���
Load Imbalance ��� ���� ���� ��� ��� ���� ��� ���
Communication ��� ��� ��� �� ��� ��� ��� ��
Total ���� ����� ���� ���� ��� ���� ���� ���
Intel Paragon ���k particles�Task P � � P � �� P � � P � ��
FS PS FS PS FS PS FS PS
Computation ����� ���� ���� ����� ���� ���� ��� ����
Load Imbalance ��� ���� ��� ����� �� ���� ��� ����
Communication ��� ��� ��� ��� �� ��� ��� ���
Total ���� ����� ����� ���� ��� ����� ���� ����
Table ��� These tables compare the performance of the SPH�D code� which exploitsthe full symmetry ��FS�� of the force law� to a restricted version that exploits onlysome of the symmetry ��PS� for partial symmetry�� The �PS� variant does notuse symmetry for interactions involving particles lying in the ghost cell regions� Onaverage� �FS� runs about ��$ faster than �PS�� Times �in milliseconds� representone timestep �averaged over ���� on the Intel Paragon� Numbers may not add up tothe �Total� due to rounding� These times are also graphed in Figure ���
lines of code� Of course� we still exploit symmetry if neither particle lies in the ghost
cell region� Table �� and Figure �� compare the Paragon execution times for the two
SPH�D versions on simulations with �k and ��k particles� Without the write back
communications phase� the modi�ed SPH�D code �labelled �partial symmetry� or
�PS�� spends an average of ��$ less time in interprocessor communication� However�
this savings is more than o�set by increased computational costs in redundant force
calculations� along with a corresponding increase in load imbalance� Overall execution
times for the modi�ed code are an average of ��$ slower than the original version�
The computation times in Table �� reveal that the relative penalty of re�
dundant force calculations increases with larger numbers of processors� The modi�ed
code�s force computations run about ��$ slower than the original SPH�D code on
� processors and about �$ slower on � processors� Larger numbers of processors
���
8 16 32 64Processors
2
4
6
8
10W
allc
lock
Seco
nds
per
Tim
est
ep
Exploiting Force Law SymmetryIntel Paragon (24k particles)
CommunicationLoad ImbalanceComputation
Partial Symmetry
Full Symmetry
8 16 32 64Processors
10
20
30
40
0
Wallc
lock
Seco
nds
per
Tim
est
ep
Exploiting Force Law SymmetryIntel Paragon (48k particles)
CommunicationLoad ImbalanceComputation
Partial SymmetryFull Symmetry
�a� �b�
Figure ��� These graphs compare the performance of the SPH�D code �left barslabeled �Full Symmetry�� and a version that does not exploit symmetry for inter�actions involving particles lying in the ghost cell regions �right bars labeled �PartialSymmetry��� Execution times �taken from Table ��� are reported for one timestepof �a� �k particle and �b� ��k particle simulations on the Intel Paragon�
divide the computational space into smaller partitions� and the increased surface�to�
volume ratio of small partitions means that a more signi�cant percentage of interac�
tions involve ghost cell particles� Thus� a greater fraction of redundant computations
are executed for larger numbers of processors�
����� Communication Optimizations
Each particle in our smoothed particle hydrodynamics application requires
approximately ��� bytes of memory to describe position� velocity� acceleration� and
various other physical parameters� However� not all of this information is needed
by each phase of the calculation� For example� the local interactions to calculate
density �see Eq� �� require only mass and position data for each particle� After
computing the density� only the density values�not mass and position� which have
���
8 16 32 64Processors
2
4
6
8
10W
allc
lock
Seco
nds
per
Tim
est
ep
SPH3D Communication OptimizationsIntel Paragon (24k particles)
Pack/Unpack DataCommunicationLoad ImbalanceComputation
NaiveOptimized
8 16 32 64Processors
0
100
200
Com
munic
atio
n (
Kbyt
es)
per
Tim
est
ep
SPH3D Communication CostsIntel Paragon (24k particles)
Rebalance Workloads RepatriateWrite BackFetch Neighbors
NaiveOptimized
�a� �b�
Figure �� � These graphs compare the performance of the SPH�D code �left bars�to a �naive� implementation �right bars� which does not attempt to minimize inter�processor communication� All numbers represent averages for one timestep �averagedover ���� on the Intel Paragon for �k particles� �a� The optimized SPH�D runsbetween �$ and ��$ faster than the naive version� �b� Average interprocessor com�munication �in kilobytes� for each processor during the timestep� These numbers arealso reported in Table ���
not changed�need to be communicated between processors and updated�
Therefore� the SPH�D application communicates only the particle infor�
mation required during each phase of the computation� Figure �� a compares the
execution time of one timestep of SPH�D to a �naive� implementation without this
communication optimization� The current �optimized� SPH�D code runs between
�$ and ��$ faster than the naive implementation�
Figure �� b and Table �� illustrate the additional interprocessor commu�
nication costs incurred by the naive version for the four primary communication
routines of the SPH�D application� fetch particles� write back forces� repatriate par�
ticles� and rebalance workloads� Repatriating particles and rebalancing workloads
already require that all particle information be transferred between processors� thus�
���
Task P � � P � �� P � � P � ��Opt Naive Opt Naive Opt Naive Opt Naive
Fetch Particles ���� ��� �� ���� ���� ��� ��� ����
Write Back Forces ���� ��� ���� ���� ���� ��� ��� ����
Repatriate Particles ����� ����� ����� ����� ����� ����� ����� �����
Rebalance Workloads ���� ���� ���� ���� ���� ���� ��� ���
Total ���� �� ���� �� ��� �� � ����
Table ��� Average interprocessor communication costs �in kilobytes� per Paragonprocessor for one iteration of the SPH�D code ��Opt�� and a �Naive� implemen�tation that does not attempt to minimize interprocessor communication� Note thatrepatriating particles and balancing workloads require the communication of all parti�cle information� Overall� the naive implementation sends between four and �ve timesmore data� These numbers are also graphed in Figure �� b�
the quantity of interprocessor communication in these routines does not change� Com�
municating only the required data signi�cantly reduces message tra�c when fetching
interacting particles and writing back calculated forces� Overall� communications
tra�c is reduced by a factor of four to �ve�
While it may seem obvious that an application should transfer only the
particle information needed by each phase of the calculation� implementing this op�
timization signi�cantly a�ects the design of a software support library� The library
must allow the programmer to specify what data is to be sent during each phase of
the calculation� and it must support selective packing and unpacking of data� These
types of design considerations are also vital for performance on distributed shared
memory machines with coherent caches � ��
��
��� Analysis and Discussion
The great tragedy of science is the slaying of a beautiful hypothesis by anugly fact�
� Thomas Henry Huxley
Particle applications are di�cult to parallelize because they require dynamic�
irregular partitionings of space to maintain an equal distribution of computational
work� Based upon the parallelization mechanisms of LPARX� we have developed
run�time support facilities which greatly simplify the task of implementing e�cient�
portable� parallel particle codes� The use of the LPARX abstractions allowed us to
provide functionality and explore performance optimizations which would have been
very di�cult using only a message passing library� Applications written using our
API library are portable to a number of high�performance architectures�including
the Intel Paragon� IBM SP� and networks of workstations�with good performance�
Based on our detailed performance analysis in Section ��� we make the
following observations and recommendations for parallel implementations of particle
calculations�
� Applications with symmetric force laws cannot ignore symmetry for interactionsinvolving particles lying in ghost cell regions� Although ignoring such symmetry
reduces communication overheads� any savings is more than o�set by increased
computational costs in redundant particle interactions� Furthermore� the per�
formance penalty increases with the number of processors� For our SPH�D
application� a code which fully exploits symmetry runs an average of ��$ faster
than one that does not�
� When transmitting particle information between processors� applications mustbe careful to communicate only the information needed by a particular com�
putational phase� Our experiments indicate that this simple optimization can
reduce execution times by �$ to ��$ and the amount of interprocessor com�
munication by a factor of four to �ve�
���
� Load imbalance can become the dominant cost for computations with localizeddensities� as was the case with our SPH�D sample data set� The recursive bisec�
tion decomposition method may be inadequate for such workload distributions�
We discuss an alternative decomposition strategy in Section ����
Our software support infrastructure provided the high�level abstractions that enabled
us to easily explore these various design decisions�
����� Parallelization Requirements
We found the following features of LPARX essential in the development of
our particle library�
� Our use of recursive bisection to balance dynamic� non�uniform workloads relies
on LPARX�s concept of structural abstraction and its support for dynamic� user�
de�ned� irregular block decompositions� Through structural abstraction� we are
able to de�ne data decompositions appropriate for our particular application�
� LPARX�s region calculus �e�g� grow� and its copy�on�intersect operation greatlysimpli�es the expression of interprocessor communication in API routines
BalanceWorkloads� FetchParticles� WriteBack� and RepatriateParticles�
� LPARX supports Grids of complicated types� such as ParticleList� With�
out such support� we could not have implemented the chaining mesh structure
needed to organize the particles�
Overall� LPARX enabled us to reason about the structure of the particle computation
at a high level and simpli�ed the implementation� An equivalent library written using
only message passing would have been considerably more complicated and would have
required many times more code�
���
�a� �b�
Figure ���� These pictures compare �a� structured partitions with �b� unstructuredpartitions� Unstructured partitions are better at balancing workloads because theyallow partition boundaries to meander through the domain� However� the reduc�tion in load imbalance is o�set somewhat by the expense of additional overheads inmore complicated communications analysis� It is an open research question whetherstructured or unstructured partitions are better for particle methods�
����� Unstructured Partitionings
The performance analysis of the SPH�D application in Section �� reveals
that a sizeable portion of the available computational resources are lost due to load
imbalance� For this problem� the recursive bisection �RCB� partitioner ��� was un�
able to e�ciently balance workloads� One drawback of RCB is that all partition cuts
are straight lines �see Figure ���a�� thus� RCB does not have the freedom to insert a
�kink� in the cut to improve load balance� This drawback is also an advantage� how�
ever� because RCB renders structured� boxy partitions which are easily and e�ciently
supported by LPARX�
An alternative partitioning strategy� such as Inverse Space�lling Partitioning
�ISP� ������ is better at balancing workloads because it allows the cuts to meander
through the space� The resulting partitioning is unstructured �see Figure ���b��
��
Pilkington and Baden ����� show that such decompositions can signi�cantly reduce
load imbalance� However� unstructured partitions employ di�erent types of program�
ming abstractions such as those provided by the CHAOS run�time system ����� Un�
structured implementations require more complicated communications analysis when
fetching o��processor data and therefore may be more expensive in total execution
time� It is an open research question which method is better for particle calculations�
����� Future Research Directions
Thus far we have only considered the software facilities for local particle�
particle interactions� Recall from Table �� than most particle methods also employ
either fast PDE solvers or hierarchical data representations� In Chapter � �Adaptive
Mesh Applications�� we described the techniques and software support required to
support fast PDE solvers and structured adaptive hierarchical representations� We
plan to combine the techniques of these two chapters to implement a hierarchical
particle method such as Almgren�s Adaptive Method of Local Corrections �AMLC�
��� �� AMLC coupled with the power of parallel architectures would enable computa�
tional scientists to study vortex dynamics problems with considerably larger numbers
of particles�
Traditionally� multipole ���� and Barnes�Hut ���� methods have been imple�
mented using unstructured tree codes ������ An alternative implementation strategy
would employ a hierarchy of irregular but structured re�nements ��� �� using our
software infrastructure� To date� no one has directly compared these two implemen�
tation strategies using real codes� Such a comparison would provide valuable insight
into the relative strengths and weaknesses of each representation�
Chapter
Conclusions
My life has been a fascinating series of amazing exploits about which Ihave many profound insights� But frankly� none of it is any of your darnbusiness� � �
� Calvin� �Calvin and Hobbes�
�� Research Contributions
We have developed a set of programming abstractions and the accompany�
ing software support for dynamic� irregular� block�structured scienti�c computations
running on high�performance parallel computers� Such applications are di�cult to
implement without the appropriate software support� Our parallel software infras�
tructure simpli�es code development because it provides computational scientists with
high�level� domain�speci�c �yet �exible� tools that hide low�level implementation de�
tails� Our software is portable across a wide range of MIMD parallel platforms and is
currently running on the Cray C��� �single processor�� IBM SP� Intel Paragon� and
networks of workstations via PVM �����
We have designed and implemented application programmer interfaces
�APIs� for two important classes of scienti�c applications� structured adaptive mesh
applications and particle calculations� These APIs provide computational tools that
match the scientist�s view of the application� We have applied our structured adap�
��
���
tive mesh API to the solution of model eigenvalue problems in materials design� The
particle API has been used in the development of a �d smoothed particle hydrody�
namics application in astrophysics� Our parallel software infrastructure has enabled
computational scientists to explore new approaches to solving real problems �see Sec�
tion ����
Our APIs are layered on top of the LPARX parallel programming system�
LPARX introduces the concept of �structural abstraction�� which enables applica�
tions to dynamically manipulate irregular data decompositions as language�level ob�
jects� Instead of requiring the programmer to choose from a small set of prede�ned
decompositions� LPARX provides a framework for creating decompositions that may
be tailored to meet the needs of a particular application�
Extending High Performance Fortran ���� will require developments in par�
allel programming abstractions and run�time support libraries� A second High Per�
formance Fortran ���� standardization e�ort is currently addressing the limitations of
HPF for dynamic and irregular applications� We believe that the abstractions and
run�time support provided by our software infrastructure may provide some of the
answers�
�� Outstanding Research Issues
Work is the greatest thing in the world� so we should always leave someof it for tomorrow�
� Don Herald
We have already described speci�c future research directions at the end
of each of Chapters through � Here we discuss two challenging and broad re�
search areas for the computational science community� implementation strategies
for application programmer interfaces �Section ���� and language interoperability
�Section ����
���
����� Implementation Strategies for APIs
Because of the growing complexity in scienti�c applications� we believe that
it will be increasingly important to provide computational scientists with application
programmer interfaces �APIs� that provide high�level� domain�speci�c tools� We have
taken this approach with our particle and structured adaptive mesh libraries� It is
our belief that scientists should only be required to concentrate on the math and
physics of their application� that is their area of expertise� It is the responsibility
of the computer scientist to respond to the needs of the scienti�c community and
provide the appropriate software tools� This is not to say that computer scientists
are to become �technicians� or �programmers� at the beck and call of the scientists�
Indeed� the development of such APIs involves a number of interesting and challenging
research issues �as we have shown in our work��
There are two general strategies for implementing a suite of domain�speci�c
toolkits� ��� languages or �� libraries� In the �rst approach� a new language�with
the appropriate syntax� control structures� and data types�is developed for each new
applications domain� In the second� an applications library is created on top of an
existing language �as we have done in C����
The primary advantage of the language�based strategy is that each new
language can be tailored to the speci�c problem domain� Languages would be sup�
ported by compilers that could apply domain�speci�c transformations to improve the
quality of compiled code� For example� current optimizing compilers commonly re�
order numerical operations to improve performance by eliminating common numerical
sub�expressions or by scheduling instructions to avoid pipeline bubbles� Similar op�
timizations could be applied to a �matrix language� to block matrix operations� use
e�cient BLAS operations� or chain operations on vector architectures �� �� Unfortu�
nately� scientists would need to learn a new language syntex�and computer scientists
would need to develop a new compilation system�for each new problem domain� In
addition� scientists could not easily combine codes from di�erent application domains
�e�g� a structured adaptive mesh solver with a particle code� since the syntax� data
���
types� and compilers would be di�erent�
The other approach would be to build an API as a library on top of an
existing programming language� Although scientists would still be required to learn
the speci�cs of a particular API library� they would not be burdened with mastering
an entirely new language� C�� provides powerful and e�cient facilities for data
abstraction and has been adopted by many as the language of choice for constructing
API libraries� However� the C�� compiler cannot apply domain�speci�c knowledge
to optimize code� For example� it is very di�cult to implement an e�cient matrix
library in C�� �� � because the compiler does not understand the special properties
of a �matrix� object� The Sage�� ��� compilation system for C�� helps the compiler
to generate e�cient code by de�ning a suite of high�level compiler transformations
that enable API writers to incorporate domain�speci�c knowledge�
One particularly troublesome limitation of C�� as a parallel programming
language is that it provides no mechanisms for control abstraction �i�e� user�de�ned
control structures�� Thus� C�� makes it di�cult to express parallel execution con�
structs such as parallel loops� Two partial solutions are ��� introduce �parallel control
constructs� using C�� macros and �� hide parallelism within a data object� LPARX
takes the �rst approach to implement its forall loop� P�� ���� takes the second�
each P�� parallel array is invisibly divided across processors and the P�� program�
mer is unaware of the parallel execution of array operations within the library� The
�rst strategy is not ideal� since it essentially creates a macro �sub�language� within
C��� The second approach does not apply to all applications� such as those addressed
by LPARX� since it is not always possible to completely encapsulate the parallelism
within a single object� Clearly� implementation techniques for APIs are a fertile area
for future research�
����� Language Interoperability
Currently� parallel software written in one programming language or run�
time library is likely to be incompatible with software written in another system�
���
Language �and library� interoperability is driven by two key factors� ��� code reuse
and �� heterogenous programming models ���� Code reuse is di�cult today since
common subroutines cannot in general be shared by di�erent parallel systems� Het�
erogenous programming models enable the programmer to use the programming lan�
guage or run�time library best suited for the task at hand� Some applications are more
naturally expressed in one paradigm than another� for example� task parallelism ap�
plies to pipeline and producer�consumer applications but is usually inappropriate for
array�based computations� which are often better handled by data parallelism�
Language interoperability raises research issues in three key areas� ��� run�
time systems� �� data representation� and ��� language extensions for external pro�
cedures� Common implementation support is needed to merge parallel languages and
libraries with di�erent run�time behaviors� For example� a data parallel language
such as HPF ���� is typically implemented using only a single thread of control per
processor whereas a task parallel language such as CC�� ���� or Fortran M ���� might
require several interacting execution threads per processor� Thus� combining these
two models will require common run�time support for task management and commu�
nication�
Several consortiums have been formed to investigate uni�ed support mecha�
nisms for task parallel and data parallel systems� For example� the PORTS� �POrtable
Run�Time System� consortium has developed a set of portable task�based facilities
for creating and scheduling tasks and for managing �ne�grain inter�thread commu�
nication� and the PCRC� �Parallel Compiler Runtime Consortium� is investigating
common high�level run�time support techniques�
The second interoperability issue addresses common data representation for�
mats� Parallel languages and libraries de�ne a rich set of data distributions across
processors� uniform block� irregular block� cyclic� pointwise� and so on� To share data�
the run�time support must ensure that each system understands the data represen�
tations used by the others �e�g� see Section ���� Thus� a uni�ed data descriptor
�Information on PORTS is available at http���www�cs�uoregon�edu�paracomp�ports��Information on the PCRC is available at http���aldebaran�npac�syr�edu�index�html�
���
format is needed� Such a data de�nition interface is currently under investigation by
the PCRC�
Finally� language designers must consider the types of language extensions
that will be required to call externally de�ned routines� For example� how should the
programmer in a task parallel language specify a call to a data parallel routine# How
should data be transferred across the call interface# HPF ���� de�nes a rudimentary
external procedure interface� and others have investigated calling HPF from pC�� �����
and also from Fortran M �� �� However� generally applicable mechanisms are currently
unknown�
�� The Scienti�c Computing Community
Build it� and they will come� � �
� �Field of Dreams�
The goal of our research has been the development of software tools to en�
able computational scientists to explore new approaches in solving applied problems
on high�performance parallel computers� It is therefore �tting that we conclude this
dissertation with a list of the projects that have bene�tted from our software infras�
tructure�
� W� Hart has implemented a d geometrically structured genetic algorithms codeto study locally adaptive search techniques on parallel computers �����
� In collaboration with J� Wallin �George Mason University�� we have parallelizeda �d smoothed particle hydrodynamics code for modeling the evolution of galaxy
clusters �see Chapter ��
� G� Cook �Cornell Theory Center� has used LPARX as the base for an applicationprogrammer interface for adaptive multigrid methods in numerical relativity as
part of the Black Hole Binary Grand Challenge Project�
��
� Scientists at Lawrence Livermore National Laboratories have employed our Dis�tributed Parallel Object� Asynchronous Message Stream� and MP�� software to
parallelize a structured adaptive mesh library for hyperbolic problems in gas
dynamics �� �����
� C� Myers �Cornell Theory Center�� B� Shaw �Lamont�Doherty Earth Observa�
tory�� and J� Langer �University of California at Santa Barbara� have imple�
mented a parallel d code to study localized slip modes in the dynamics of
earthquake faults�
� C� Myers and J� Sethna �Cornell University� have developed a parallel time�dependent Ginzburg�Landau model of shape transformations to study shape�
memory e�ects in martensitic alloys� Their code extends the LPARX Grid to
support d deformable cartesian meshes�
� C� Myers has also written a Cornell Theory Center �Smart Node� newsletterdescribing LPARX� and will discuss some of his experiences with it at the ���
meeting of the APS Physics Computing Conference��
� In collaboration with materials scientists and mathematicians� we have devel�oped adaptive numerical techniques and the parallel software support for the so�
lution of eigenvalue problems arising in materials design �see Chapter �� ���� ����
� LPARX has been used to implement a dimension�independent code for d� �d�
and �d connected component labeling for spin models in statistical mechan�
ics �����
� Building on LPARX� S� Fink and S� Baden have developed run�time HPF�likedata distribution techniques for block structured applications � ���
�Myers� article is available at World Wide Web address http���www�tc�cornell�edu�
SmartNodes�Newsletters������VN�Myers��The abstract of Myers� talk� �Some ABCs of OOP for PDEs on MPPs�� is available at World
Wide Web address http���aps�org�BAPSPC��abs�SJ�����html�
���
� S� Figueira and S� Baden have employed our software infrastructure to analyzethe performance tradeo�s of various parallelization strategies for localized N �
body solvers � ���
� G� Duncan �Bowling Green State University� is planning to use our structuredadaptive mesh infrastructure to parallelize an adaptive hyperbolic solver for
simulations of relativistic extragalactic jets � ���
� In collaboration with F� Abraham �IBM Almaden�� we are using our particle
library to develop a molecular dynamics application to study fracture dynamics
in solids ����
In addition� our software has been used to teach undergraduate and graduate courses
in computational science at the University of California at San Diego�
I�m not going to school anymore� I�ve decided to be a �hunter�gatherer�when I grow up� I�ll be living naked in a tropical forest� subsisting onberries� grubs� and the occasional frog� and spending my free time groomingfor lice�
� Calvin� �Calvin and Hobbes�
Appendix A
Machine Characteristics
The Fast drives out the Slow even if the Fast is wrong�
� W� Kahan
In this Appendix� we describe the four supercomputers used to gather per�
formance data in this dissertation� the Cray C���� IBM SP� Intel Paragon� and a
network of eight DEC Alpha �������� workstations located at the San Diego Super�
computer Center� The Alphas are connected via a GIGAswitch and communicate
through PVM ����� Even though the C��� contains more than one processor� it is
rarely used as a true parallel machine in production mode� Instead� the processors run
several independent jobs at the same time� Thus� we have only reported performance
results for a single processor�
For the three message passing architectures �Alpha cluster� IBM SP� and
Intel Paragon�� we characterize interprocessor communication overheads using the
simple linear cost model commonly used in the literature� Message passing perfor�
mance is reported using two numbers� T� and BW � T�� often incorrectly� called the
message latency� represents the time to send a zero byte message� BW is the average
peak communications bandwidth for large message sizes �several hundred kilobytes��
Thus� the time to send a message of length L can be approximated by T� !L
BW�
We measured message passing times with a simple program that sends messages of
�T� actually incorporates both message latency and unavoidable software overheads �� ��
���
��
Machine C�� Fortran Operating SystemCompiler Optimization Compiler Optimization
Alphas g�� v��� �O f�� v��� �O� OSF�� ��
C��� CC v������� �O cft�� v��� �O� UNICOS �����
Paragon g�� v��� �O �mnoieee if�� v���� �O� �Knoieee OSF�� �����
SP xlC v�� �O �Q xlf v��� �O� AIX v��
Table A��� Software version numbers and compiler optimization �ags for all compu�tations in this dissertation� On the Alpha workstation cluster� we used PVM version���� for interprocessor communication� All benchmarks used LPARX software re�lease v���
Alphas Cray C��� IBM SP Intel Paragon
Typical M�ops ����� ������� ���� ����
Memory �Mbytes� �� ���� �� �
T� ��sec� �� � �� ��
BW �Mbytes�sec� ��� � �� ��
Table A�� A summary of machine characteristics for the Alpha cluster� Cray C����IBM SP� and Intel Paragon� All numbers re�ect one processor of the machine� Thememory limit on the Cray represents the memory available to tasks in the largestmemory queue at the San Diego Supercomputer Center� Note that these �guresare intended to provide a very rough estimate of expected applications performance�All M�ops �million �oating point operations per second� measurements re�ect ��bit�oating point rates�
varying sizes around in a ring�
Table A�� summarizes software version numbers and compiler �ags� and
Table A� summarizes machine characteristics� Interprocessor communication times
are presented in Figure A�� �DEC Alphas�� Figure A� �IBM SP�� and Figure A��
�Intel Paragon��
��
64 1K 16K 256KMessage Length
0.01
0.1
1.0
Bandw
idth
(M
byt
es/
sec)
Message BandwidthAlpha Cluster
64 1K 16K 256KMessage Length
1
10
100
1000
Tim
e (
mill
iseconds)
Message Passing TimeAlpha Cluster
�a� �b�
Figure A��� Alpha workstation cluster message passing performance for �a� messagebandwidth and �b� message sending times as a function of the message size� Notethat the vertical scale for �b� is in milliseconds� not microseconds as in the othergraphs�
���
64 1K 16K 256KMessage Length
0.1
1.0
10
100B
andw
idth
(M
byt
es/
sec)
Message BandwidthIBM SP2
64 1K 16K 256KMessage Length
100
1000
10000
Tim
e (
mic
roseconds)
Message Passing TimeIBM SP2
�a� �b�
Figure A�� IBM SP message passing performance for �a� message bandwidth and�b� message sending times as a function of the message size�
64 1K 16K 256KMessage Length
0.1
1.0
10
100
Bandw
idth
(M
byt
es/
sec)
Message BandwidthIntel Paragon
64 1K 16K 256KMessage Length
100
1000
10000
Tim
e (
mic
roseconds)
Message Passing TimeIntel Paragon
�a� �b�
Figure A��� Intel Paragon message passing performance for �a� message bandwidthand �b� message sending times as a function of the message size�
Bibliography
One of the problems of being a pioneer is you always make mistakesand I never� never want to be a pioneer� It�s always best to comesecond when you can look at the mistakes the pioneers made�
� Seymour Cray
��� F� F� Abraham� D� Brodbeck� R� A� Rafey� and W� E� Rudge� In�stability dynamics of fracture A computer simulation investigation� PhysicalReview Letters� �� �������
�� G� Agha� Actors A Model of Concurrent Computation in Distributed Systems�MIT Press� ��� �
��� G� Agrawal� A� Sussman� and J� Saltz� An integrated runtime andcompile�time approach for parallelizing structured and block structured appli�cations� IEEE Transactions on Parallel and Distributed Systems� �to appear��
��� A� Almgren� T� Buttke� and P� Colella� A fast vortex method in threedimensions� in Proceedings of the ��th AIAA Computational Fluid DynamicsConference� Honolulu� Hawaii� June ����� pp� �� "��
�� A� S� Almgren� A Fast Adaptive Vortex Method Using Local Corrections� PhDthesis� University of California at Berkeley� June �����
� � B� Alpern� L� Carter� E� Feig� and T� Selker� The uniform memoryhierarchy model of computation� Algorithmica� � ������� pp� �"����
��� A� L� Ananda� B� H� Tay� and E� K� Koh� Astra An asynchronous remoteprocedure call facility� in Proceedings of the ��th International Conference onDistributed Computing Systems� May �����
��� C� R� Anderson� A method of local corrections for computing the velocityeld due to a distribution of vortex blobs� Journal of Computational Physics� ���� �� pp� ���"���
���
���
��� C� R� Anderson� An implementation of the fast multipole method withoutmultipoles� SIAM Journal on Scienti�c and Statistical Computing� �� ������pp� ��"����
���� I� Ashok and J� Zahorjan� Adhara Runtime support for dynamic space�based applications on distributed memory MIMD multiprocessors� in Proceed�ings of the ���� Scalable High Performance Computing Conference� May �����
���� W� Athas and N� Boden� Cantor An actor programming system for sci�entic computing� in Proceedings of the AMC SIGPLAN Workshop of ObjectBased Concurrent Programming� �����
��� S� B� Baden� Programming abstractions for dynamically partitioning and co�ordinating localized scientic calculations running on multiprocessors� SIAMJournal on Scienti�c and Statistical Computing� � ������� pp� ��"���
���� S� B� Baden� S� J� Fink� and S� R� Kohn� Structural abstraction A uni�fying parallel programming model for data motion and partitioning in irregularscientic computations� �in preparation�� ����
���� S� B� Baden and S� R� Kohn� A comparison of load balancing strategies forparticle methods running on MIMD multiprocessors� in Proceedings of the FifthSIAM Conference on Parallel Processing for Scienti�c Computing� March �����
��� � Portable parallel programming of numerical problems under the LPARsystem� Journal of Parallel and Distributed Computing� ������
�� � H� E� Bal and A� S� Tanenbaum� Distributed programming with shareddata� in Proceedings of the International Conference on Computer Languages�October ����� pp� �"���
���� J� Barnes and P� Hut� A hierarchical O�N logN� force�calculation algo�rithm� Nature� �A ���� �� p� �� �
���� D� R� Bates� K� Ledsham� and A� L� Stewart� Wave functions of thehydrogen molecular ion� Phil� Trans� Roy� Soc� London� � ������ pp� �"���
���� J� Bell� M� Berger� J� Saltzman� and M� Welcome� Three�dimensionaladaptive mesh renement for hyperbolic conservation laws� SIAM Journal onScienti�c and Statistical Computing� � ������� pp� ��"����
��� M� J� Berger� Adaptive Mesh Renement for Hyperbolic Partial Di�erentialEquations� PhD thesis� Stanford University� ����
��� M� J� Berger and S� H� Bokhari� A partitioning strategy for nonuniformproblems on multiprocessors� IEEE Transactions on Computers� C�� �������pp� ��"���
��
�� M� J� Berger and P� Colella� Local adaptive mesh renement for shockhydrodynamics� Journal of Computational Physics� � ������� pp� �"���
��� M� J� Berger and J� Oliger� Adaptive mesh renement for hyperbolicpartial di�erential equations� Journal of Computational Physics� � �������pp� ���"��
��� M� J� Berger and I� Rigoutsos� An algorithm for point clustering and gridgeneration� IEEE Transactions on Systems� Man and Cybernetics� � �������pp� ���"�� �
�� M� J� Berger and J� Saltzman� AMR on the CM��� Tech� Rep� ��� �RIACS� Mo�ett Field� CA� August ����
� � � Structured adaptive mesh renement on the Connection Machine� in Pro�ceedings of the Sixth SIAM Conference on Parallel Processing for Scienti�cComputing� March �����
��� J� Bernholc� J��Y� Yi� and D� J� Sullivan� Structural transitions in metalclusters� Faraday Discussions� � ������� pp� ��"��
��� G� E� Blelloch� S� Chatterjee� J� C� Hardwick� J� Sipelstein� andM� Zagha� Implementation of a portable nested data parallel language� inFourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Pro�gramming� May �����
��� F� Bodin� P� Beckman� D� Gannon� J� Botwals� S� Narayana�
S� Srinivas� and B� Winnicka� Sage�� An object�oriented toolkit andclass library for building Fortran and C�� restructuring tools� in Object Ori�ented Numerics Conference �OONSKI�� �����
���� F� Bodin� P� Beckman� D� Gannon� S� Narayana� and S� X� Yang�Distributed pC�� Basic ideas for an object parallel language� Journal of Sci�enti�c Programming� �������
���� F� Bodin� P� Beckman� D� Gannon� S� Yang� S� Kesavan� A� Mal�ony� and B� Mohr� Implementing a parallel C�� runtime system for scalableparallel systems� in Proceedings of Supercomputing ���� November �����
��� J� Bolstad� PhD thesis� Stanford University� ����
���� A� Bozkus� A� Choudhary� G� Fox� T� Haupt� and S� Ranka� For�tran ��D�HPF compiler for distributed memory MIMD computers Design�implementation� and performance results� in Proceedings of Supercomputing���� November �����
���� A� Brandt� Multi�level adaptive solutions to boundary�value problems� Math�ematics of Computation� �� ������� pp� ���"����
��
��� W� L� Briggs� A Multigrid Tutorial� SIAM� �����
�� � K� G� Budge� J� S� Perry� and A� C� Robinson� High performance sci�entic computing using C��� in USENIX C!! Conference Proceedings� ����
���� E� J� Bylaska� S� R� Kohn� S� B� Baden� A� Edelman� R� Kawai�M� E� G� Ong� and J� H� Weare� Scalable parallel numerical methods andsoftware tools for material design� in Proceedings of the Seventh SIAM Con�ference on Parallel Processing for Scienti�c Computing� San Francisco� CA�February ����
���� Z� Cai� J� Mandel� and S� McCormick� Multigrid methods for nearlysingular linear equations and eigenvalue problems� �submitted for publication�������
���� N� Carriero and D� Gelernter� Linda in context� Communications of theACM� � ������� pp� ���"���
���� S� Chakrabarti� E� Deprit� E��J� Im� J� Jones� A� Krishnamurthy�C��P� Wen� and K� Yelick� Multipol A distributed data structure library�in Fifth ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming� July ����
���� K� M� Chandy and C� Kesselman� Compositional C�� Compositionalparallel programming� in Fifth International Workshop of Languages and Com�pilers for Parallel Computing� New Haven� CT� August ����
��� C� Chang� A� Sussman� and J� Saltz� Support for distributed dynamic datastructures in C��� Tech� Rep� CS�TR�� � University of Maryland� ����
���� B� Chapman� P� Mehrotra� H� Moritsch� and H� Zima� Dynamic datadistribution in Vienna Fortran� in Proceedings of Supercomputing ���� Novem�ber �����
���� B� Chapman� P� Mehrotra� and H� Zima� Extending HPF for advanceddata parallel applications� Tech� Rep� ������ ICASE� May �����
��� C� Chase� K� Crowley� J� Saltz� and A� Reeves� Parallelization ofirregularly coupled regular meshes� Tech� Rep� �"�� ICASE� NASA LangleyResearch Center� January ����
�� � J� S� Chase� F� G� Amador� E� D� Lazowska� H� M� Levy� and R� J�Littlefield� The Amber system Parallel programming on a network of multi�processors� in Proceedings of the �th ACM Symposium on Operating SystemsPrinciples� December ����� pp� ���"���
���� A� A� Chien� Concurrent Aggregates Supporting Modularity in Massively Par�allel Programs� MIT Press� �����
�
���� K� Cho� T� A� Arias� J� D� Joannopoulos� and P� K� Lam�Wavelets inelectronic structure calculations� Physical Review Letters� �� ������� pp� ����"�����
���� T� W� Clark� R� V� Hanxleden� J� A� McCammon� and L� R� Scott�Parallelization strategies for a molecular dynamics program� in Intel TechnologyFocus Conference Proceedings� April ����
��� T� W� Clark� R� v� Hanxleden� J� A� McCammon� and L� R� Scott�Parallelizing molecular dynamics using spatial decomposition� in Proceedings ofthe ���� Scalable High Performance Computing Conference� May �����
��� P� Colella and P� Woodward� The piecewise parabolic method �ppm�for gas�dynamical simulations� Journal of Computational Physics� � �������pp� ���"���
�� L� Collatz� The numerical treatment of di�erential equations� Springer�Verlag� �� �
��� C� R� Cook� C� M� Pancake� and R� Walpole� Are expectations forparallelism too high� A survey of potential parallel users� in Proceedings ofSupercomputing ���� November �����
��� W� Y� Crutchfield� Load balancing irregular algorithms� Tech� Rep� UCRL�JC���� ��� Lawrence Livermore National Laboratory� July �����
�� W� Y� Crutchfield and M� L� Welcome� Object oriented implementationof adaptive mesh renement algorithms� Journal of Scienti�c Programming� ������� pp� ��"� �
� � D� Culler� R� Karp� D� Patterson� A� Sahay� K� E� Schauser�E� Santos� R� Subramonian� and T� von Eicken� LogP Towards a realis�tic model of parallel computation� in Proceedings of the Fourth AMC SIGPLANSymposium on Principles and Practice of Parallel Programming� �����
��� D� E� Culler� A� Dusseau� S� C� Goldstein� A� Krishnamurthy�S� Lumetta� T� von Eicken� and K� Yelick� Parallel programming inSplit�C� in Proceedings of Supercomputing ���� November �����
��� R� Das� D� J� Mavriplis� J� Saltz� S� Gupta� and R� Ponnusamy� Thedesign and implementation of a parallel unstructured euler solver using softwareprimitives� Tech� Rep� ���� ICASE� Hampton� VA� March ����
��� R� Das and J� Saltz� Parallelizing molecular dynamics codes using PARTIsoftware primitives� in Proceedings of the Sixth SIAM Conference on ParallelProcessing for Scienti�c Computing� March �����
��
� �� R� Das� M� Uysal� J� Saltz� and Y��S� Hwang� Communication optimiza�tions for irregular scientic computations on distributed memory architectures�Journal of Parallel and Distributed Computing� �to appear��
� �� S� Deshpande� P� Delisle� and A� G� Daghi� A communication facilityfor distributed object�oriented applications� in USENIX C!! Conference Pro�ceedings� ����
� � K� D� Devine and J� E� Flaherty�Dynamic load balancing for parallel niteelement methods with h� and p�renement� in Proceedings of the Seventh SIAMConference on Parallel Processing for Scienti�c Computing� San Francisco� CA�February ����
� �� G� C� Duncan and P� A� Hughes� Simulations of relativistic extragalacticjets� The Astrophysical Journal� �� ������� pp� L���"L��
� �� D� J� Edelsohn� Hierarchical tree�structures as adaptive meshes� InternationalJournal of Modern Physics C �Physics and Computers�� � ������� pp� ���"����
� � B� Falsafi� A� R� Lebeck� S� K� Reinhardt� I� Schoinas� M� D� Hill�J� R� Larus� A� Rogers� and D� A� Wood� Application�specic protocolsfor user�level shared memory� in Proceedings of Supercomputing ���� November�����
� � M� J� Feeley and H� M� Levy� Distributed shared memory with versionedobjects� in Proceedings of the Conference on Object�Oriented ProgrammingSystems� Languages� and Applications �OOPSLA�� October ����
� �� J� T� Feo and D� C� Cann� A report on the SISAL language project� Journalof Parallel and Distributed Computing� �� ������� pp� ���"� �
� �� S� M� Figueira and S� B� Baden� Performance analysis of parallel strategiesfor localized n�body solvers� in Proceedings of the Seventh SIAM Conference onParallel Processing for Scienti�c Computing� San Francisco� CA� February ����
� �� S� J� Fink and S� B� Baden� Run�time data distribution for block�structuredapplications on distributed memory computers� in Proceedings of the SeventhSIAM Conference on Parallel Processing for Scienti�c Computing� San Fran�cisco� CA� February ����
���� S� J� Fink� S� B� Baden� and S� R� Kohn� Flexible communication schedulesfor block structured applications� �in preparation�� ����
���� S� J� Fink� C� Huston� S� B� Baden� and K� Jansen� Parallel clusteridentication for multidimensional lattices� �submitted to IEEE Transactionson Parallel and Distributed Systems�� ����
��
��� M� J� Flynn� Some computer organizations and their e�ectiveness� IEEETransactions on Computers� C�� ������ pp� ���"� ��
���� K� Forsman� W� Gropp� L� Kettunen� and D� Levine� Computationalelectromagnetics and parallel dense matrix computations� in Proceedings of theSeventh SIAM Conference on Parallel Processing for Scienti�c Computing� SanFrancisco� CA� February ����
���� I� Foster and K� M� Chandy� Fortran M A language for modular parallelprogramming� Journal of Parallel and Distributed Computing� �to appear��
��� I� Foster and C� Kesselman� Integrating task and data parallelism� in Pro�ceedings of Supercomputing ���� November �����
�� � I� Foster� M� Xu� B� Avalani� and A� Choudhary�A compilation systemthat integrates High Performance Fortran and Fortran M� in Proceedings of the���� Scalable High Performance Computing Conference� May �����
���� G� Fox� S� Hiranandani� K� Kennedy� C� Koelbel� U� Kremer�C� Tseng� and M� Wu� Fortran D language specication� Tech� Rep� TR������� Department of Computer Science� Rice University� Houston� TX� December�����
���� G� H� Golub and C� F� V� Loan� eds�� Matrix Computations �Second Edi�tion�� The Johns Hopkins University Press� Baltimore� �����
���� L� Greengard and V� Rokhlin� A fast algorithm for particle simulations�Journal of Computational Physics� �� ������� pp� �"����
���� W� Gropp and B� Smith� Scalable� extensible� and portable numerical li�braries� in Proceedings of the Scalable Parallel Libraries Conference� �����pp� ��"���
���� W� E� Hart� Adaptive Global Optimization with Local Search� PhD thesis�University of California at San Diego� �����
��� C� Hewitt� P� Bishop� and R� Steiger� A universal ACTOR formalismfor articial intelligence� in Proceedings of the International Joint Conferenceon Arti�cial Intelligence� �����
���� High Performance Fortran Forum� High Performance Fortran LanguageSpecication� November �����
���� � HPF�� Scope of Activities and Motivating Applications� November �����
��� P� N� Hilfinger and P� Colella� FIDIL A language for scientic pro�gramming� Tech� Rep� UCRL������ Lawrence Livermore National Laboratory�January �����
�
�� � S� Hiranandani� K� Kennedy� and C��W� Tseng� Preliminary experienceswith the Fortran D compiler� in Proceedings of Supercomputing ���� November�����
���� R� W� Hockney and J� W� Eastwood� Computer Simulation Using Par�ticles� McGraw�Hill� �����
���� Y��S� Hwang� R� Das� J� Saltz� B� Brooks� and M� Hodoscek� Par�allelizing molecular dynamics programs for distributed memory machines Anapplication of the CHAOS runtime support library� Tech� Rep� CS�TR������University of Maryland� College Park� MD� �����
���� E� Jul� H� Levy� N� Hutchinson� and A� Black� Fine�grained mobilityin the Emerald system� ACM Transactions on Computer Systems� �������pp� ���"����
���� L� Kale and S� Krishnan� CHARM�� A portable concurrent object ori�ented system based on C��� in Proceedings of OOPSLA� September �����
���� V� Karamcheti and A� Chien� Concert E�cient runtime support for con�current object�oriented programming languages on stock hardware� in Proceed�ings of Supercomputing ���� November �����
��� S� R� Kohn and S� B� Baden� An implementation of the LPAR parallelprogramming model for scientic computations� in Proceedings of the SixthSIAM Conference on Parallel Processing for Scienti�c Computing� Norfolk� VA�March �����
���� � A robust parallel programming model for dynamic non�uniform scienticcomputations� in Proceedings of the ���� Scalable High Performance ComputingConference� May �����
���� � The parallelization of an adaptive multigrid eigenvalue solver withLPARX� in Proceedings of the Seventh SIAM Conference on Parallel Processingfor Scienti�c Computing� San Francisco� CA� February ����
��� � Irregular coarse�grain data parallelism under LPARX� Journal of Scienti�cProgramming� �to appear��
�� � W� Kohn and L� Sham� Physical Review� ��� ��� �� p� A�����
���� J� Kuskin� D� Ofelt� M� Heinrich� J� Heinlein� R� Simoni� K� Ghara�chorloo� J� Chapin� D� Nakahira� J� Baxter� M� Horowitz�
A� Gupta� M� Rosenblum� and J� Hennessy� The Stanford FLASH mul�tiprocessor� in Proceedings of the �st International Symposium on ComputerArchitecture� April ����� pp� ��"����
�
���� J� R� Larus� C�� A large�grain object oriented data parallel programming lan�guage� in Fifth International Workshop of Languages and Compilers for ParallelComputing� New Haven� CT� August ����
���� M� Lemke and D� Quinlan� P�� A C�� virtual shared grids basedprogramming environment for architecture�independent development of struc�tured grid applications� in Lecture Notes in Computer Science� Springer�Verlag�September ����
����� E� C� Lewis� C� Lin� L� Synder� and G� Turkiyyah� A portable paral�lel n�body solver� in Proceedings of the Seventh SIAM Conference on ParallelProcessing for Scienti�c Computing� San Francisco� CA� February ����
����� K� Li and P� Hudak� Memory coherence in shared virtual memory systems�ACM Transactions on Computer Systems� � ������� pp� ��"���
���� C� Lin and L� Snyder� ZPL An array sublanguage� in Proceedings of theSixth International Workshop on Languages and Compilers for Parallel Com�putation� Springer�Verlag� ����� pp� � "����
����� J� Mandel and S� McCormick�Multilevel variational method for Au � �Buon composite grids� Journal of Computational Physics� �� ������� pp� ��"��
����� S� F� McCormick� ed�� Multilevel Adaptive Methods for Partial Di�erentialEquations� SIAM� Philadelphia� �����
���� Message Passing Interface Forum� MPI A Message�Passing InterfaceStandard �v����� May �����
��� � R� E� Minnear� P� A� Muckelbauer� and V� F� Russo� Integrating theSun Microsystems XDR�RPC protocols into the C�� stream model� in USENIXC!! Conference Proceedings� ����
����� W� F� Mitchell� Renement tree based partitioning for adaptive grids� in Pro�ceedings of the Seventh SIAM Conference on Parallel Processing for Scienti�cComputing� San Francisco� CA� February ����
����� J� J� Monaghan� Smoothed particle hydrodynamics� Annual Review of As�tronomy and Astrophysics� �� ������ pp� ��"���
����� S� S� Mukherjee� S� D� Sharman� M� D� Hill� J� R� Larus� A� Rogers�and J� Saltz� E�cient support for irregular applications on distributed mem�ory machines� in to appear in Proceedings of the ��� Symposium on Principlesand Practice of Parallel Programming� ����
����� B� J� Nelson� Remote Procedure Call� PhD thesis� Carnegie"Mellon Univer�sity� Pittsburgh� PA� �����
��
����� I� Newton� Philosophiae Naturalis Principia Mathematica� � ���
���� C� M� Pancake and D� Bergmark� Do parallel languages respond to theneeds of scientic programmers�� IEEE Computer� ������� pp� ��"��
����� C� M� Pancake and C� Cook�What users need in parallel tool support Sur�vey results and analysis� in Proceedings of the ���� Scalable High PerformanceComputing Conference� May �����
����� Parallel Compiler Runtime Consortium� Common Runtime Support forHigh�Performance Parallel Languages� July �����
���� M� Parashar and J� C� Browne� An infrastructure for parallel adaptivemesh renement techniques� �draft�� ����
��� � M� Parashar� S� Hariri� T� Haupt� and G� C� Fox� Interpreting the per�formance of HPF�Fortran ��D� in Proceedings of Supercomputing ���� Novem�ber �����
����� R� Parsons and D� Quinlan� Run�time recognition of task parallelism withinthe P�� parallel array class library� in Scalable Libraries Conference� �����
����� J� R� Pilkington and S� B� Baden� Dynamic partitioning of non�uniformstructured workloads with spacelling curves� �submitted to IEEE Transactionson Parallel and Distributed Systems�� ����
����� W� H� Press� S� A� Teukolsky� W� T� Vetterling� and B� P� Flan�
nery� Numerical Recipes in C The Art of Scientic Computing� CambridgeUniversity Press� ����
���� D� Quinlan� Parallel Adaptive Mesh Renement� PhD thesis� University ofColorado at Denver� �����
���� S� K� Reinhardt� M� D� Hill� J� R� Larus� A� R� Lebeck� J� C� Lewis�and D� A� Wood� The Wisconsin wind tunnel Virtual prototyping of parallelcomputers� in Proceedings of the ���� AMC SIGMETRICS Conference� May�����
��� S� K� Reinhardt� J� R� Larus� and D� A� Wood� Typhoon and tempestUser�level shared memory� in Proceedings of the ACM�IEEE International Sym�posium on Computer Architecture� April �����
���� M��C� Rivara� Design and data structure of fully adaptive� multigrid� niteelement software� ACM Transactions on Mathematical Software� �� �������pp� �" ��
���� H� Samet� The Design and Analysis of Spatial Data Structures� Addison�Wesley� �����
��
��� W�W� Shu and L� V� Kale� Chare kernel A runtime support system for par�allel computations� Journal of Parallel and Distributed Computing� �� �������pp� ���"���
�� � J� P� Singh� Parallel Hierarchical N�Body Methods and their Implications forMultiprocessors� PhD thesis� Stanford University� �����
���� J� P� Singh and J� L� Hennessy� Finding and exploiting parallelism in anocean simulation program Experiences� results� and implications� Journal ofParallel and Distributed Computing� � ������ pp� �"���
���� J� P� Singh� C� Holt� J� L� Hennessy� and A� Gupta� A parallel adaptivefast multipole method� in Proceedings of Supercomputing ���� November �����
���� L� Snyder� Type architectures� shared memory� and the corollary of modestpotential� Annual Review of Computer Science� ���� �� pp� ��"����
����� L� Stals� Adaptive multigrid in parallel� in Proceedings of the Seventh SIAMConference on Parallel Processing for Scienti�c Computing� San Francisco� CA�February ����
����� B� Stroustrup� The C�� Programming Language �Second Edition��Addison"Wesley� �����
���� V� S� Sunderam� PVM A framework for parallel distributed computing� Con�currency� Practice and Experience� ������� pp� ��"����
����� P� Tamayo� J� P� Mesirov� and B� M� Boghosian� Parallel approaches toshort range molecular dynamics simulations� in Proceedings of Supercomputing���� Albuquerque� NM� November �����
����� E� Tsuchida and M� Tsukada� Real space approach to electronic�structurecalculations� Department of Physics� University of Tokyo �unpublishedmanuscript�� �����
���� C� J� Turner and J� G� Turner� Adaptive data parallel methods for ecosys�tem monitoring� in Proceedings of Supercomputing ���� November �����
��� � R� v� Hanxleden� K� Kennedy� and J� Saltz� Value�based distributionsin Fortran D A preliminary report� Tech� Rep� CRPC�TR��� �S� Center forResearch on Parallel Computation� Rice University� Houston� TX� December�����
����� L� G� Valiant� A bridging model for parallel computation� Communicationsof the Association for Computing Machinery� �� ������� pp� ���"����
��
����� T� von Eicken� D� E� Culler� S� C� Goldstein� and K� E� Schauser�Active Messages A mechanism for integrated communication and computation�in Proceedings of the ��th International Symposium on Computer Architecture�May ����
����� R� von Hanxleden� K� Kennedy� C� Koelbel� R� Das� and J� Saltz�Compiler analysis for irregular problems in Fortran D� in Fifth InternationalWorkshop of Languages and Compilers for Parallel Computing� New Haven�CT� August ����
����� M� S� Warren and J� K� Salmon� A parallel hashed oct�tree n�body algo�rithm� in Proceedings of Supercomputing ���� November �����
����� M� Welcome� B� Crutchfield� C� Rendleman� J� Bell� L� Howell�
V� Beckner� and D� Simkims� Boxlib user�s guide and manual� �draft�� ����
���� S� R� White� J� W� Wilkins� and M� P� Teter� Finite�element methodfor electronic structure� Physical Review B� �� ������� pp� ���"����
����� M� Wu and G� Fox� Fortran ��D compiler for distributed memory MIMDparallel computers� Tech� Rep� SCCS���B� Syracuse University� �����
����� S� X� Yang� D� Gannon� S� Srinivas� F� Bodin� and P� Bode� HighPerformance Fortran interface to the parallel C��� in Proceedings of the ����Scalable High Performance Computing Conference� May �����
���� A� Yonezawa� ABCL An Object Oriented Concurrent System� MIT Press������