David E. Keyes Center for Computational Science Old Dominion University

David E. Keyes

Center for Computational ScienceOld Dominion University

Institute for Computer Applications in Science & EngineeringNASA Langley Research Center

Institute for Scientific Computing ResearchLawrence Livermore National Laboratory

Algorithms for Cluster Computing Based on Domain Decomposition

NASA Langley Cluster SIG

Plan of Presentation Introduction (5) Imperative of domain decomposition and multilevel

methods for terascale computing (5) Basic and “advanced” domain decomposition and

multilevel algorithmic concepts (14) Vignettes of domain decomposition and multilevel

performance (13)(with a little help from my friends…)

Agenda for future research (6) Terascale Optimal PDE Simulations (TOPS) software

project (14) Conclusions (4)


Terascale simulation has been “sold”

Environmentglobal climate

groundwater flow

Lasers & Energycombustion ICF modeling

Engineeringstructural dynamics

electromagnetics Experiments expensive

Chemistry & Biomaterials modeling

drug design

Experiments controversial

AppliedPhysics

radiation transporthydrodynamics

Experiments prohibited

Scientific

Simulation

Experiments dangerous

However, it is far from proven! To meet expectations, we need to handle problems of multiple physical scales.

‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06

100+ Tflop / 30 TB

Time (CY)

Cap

abili

ty

1+ Tflop / 0.5 TB

PlanDevelopUse

30+ Tflop / 10 TB

Red

3+ Tflop / 1.5 TBBlue

10+ Tflop / 4 TBWhite

50+ Tflop / 25 TB

Large platforms have been provided

ASCI program of the U.S. ASCI program of the U.S. DOE has roadmap to go to DOE has roadmap to go to 100 Tflop/s by 2006100 Tflop/s by 2006www.llnl.gov/asci/platformswww.llnl.gov/asci/platforms

SandiaSandia Los AlamosLos Alamos

LivermoreLivermore

LivermoreLivermore


NSF’s 13.6 TF TeraGrid coming on line

26

24

8

4 HPSS

5

HPSS

HPSS UniTree

External Networks

External NetworksExternal

Networks

External Networks

Site Resources Site Resources

Site ResourcesSite Resources SDSC4.1 TF225 TB

Caltech

NCSA/PACI8 TF240 TB

Argonne

TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org


Algorithmic requirementsfrom architecture

Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache

T3E

“horizontal” aspects “vertical” aspects

message passing, shared memory threads register blocking, cache blocking, prefetching


Building platforms is the “easy” part Algorithms must be

highly concurrent and straightforward to load balance latency tolerant cache friendly (good temporal and spatial locality) highly scalable (in the sense of convergence)

Domain decomposed multilevel methods “natural” for all of these

Domain decomposition also “natural” for software engineering

Fortunate that mathematicians built up its theory in advance of requirements!


Decomposition strategies for Lu=f in Operator decomposition

Function space decomposition

Domain decomposition

k

kLL

k

kkk

kk uuff ,

kk

fuuyxkk II )()1(][ LL

Consider the implicitly discretized parabolic case


Operator decomposition Consider ADI

fuyuxkk II )()2/1( ][][ 2/2/ LL

fuxuykk II )2/1()1( ][][ 2/2/ LL

Iteration matrix consists of four multiplicative substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves

Parallelism within each substep But global data exchanges between bandsolve substeps


Function space decomposition Consider a spectral Galerkin method

),()(),,(1

yxtatyxu j

N

jj

Nifuu iiidtd ,...,1),,(),(),( L

Nifau ijjijdtda

ijj ,...,1),,(),(),( L

fMKaMdtda 11

System of ordinary differential equations Perhaps are diagonal

matrices Perfect parallelism across spectral index But global data exchanges to transform back to

physical variables at each step

)],[()],,[( ijij KM L


Domain decomposition Consider restriction and extension

operators for subdomains, , and for possible coarse grid,

Replace discretized with

Solve by a Krylov method Matrix-vector multiplies with

parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)

iiR

0R

TRR 00 ,

Tii RR ,

fAu fBAuB 11

iiTii

T RARRARB 10

100

1

Tiii ARRA

=


Comparison Operator decomposition (ADI)

natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose)

Function space decomposition (Fourier) Natural mode-based assignment requires all-to-all,

bulk data exchanges in each step (for transform) Domain decomposition (Schwarz)

Natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem

(Of course, domain decomposition can be interpreted as a special operator or function space decomposition)


Theoretical Scaling of Domain Decomposition for Three Common Network Topologies

With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: optimal number of processors scales linearly with problem size

(scalable, assumes one subdomain per processor) With 3D torus-based global reductions and scalable nearest neighbor

hardware: optimal number of processors scales as three-fourths power of

problem size (almost scalable) With common bus network (heavy contention):

optimal number of processors scales as one-fourth power of problem size (not scalable)

bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards” using multiple NICs per Beowulf node!


Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioning

“Advanced” Concepts Polynomial combinations of Schwarz projections Schwarz-Schur combinations

Neumann-Neumann/FETI (Schwarz on Schur) LNKS (Schwarz inside Schur)

Nonlinear Schwarz


Iterative correction The most basic idea in iterative methods

Evaluate residual accurately, but solve approximately, where is an approximate inverse to A

A sequence of complementary solves can be used, e.g., with first and then one has

)(1 AufBuu

)]([ 11

12

12

11 AufABBBBuu

2B1B

1B

RRARRB TT 112 )(

)( 1AB Optimal polynomials of lead to various preconditioned Krylov methods

Scale recurrence, e.g., with , leads to multilevel methods


smoother

Finest Grid

First Coarse Grid

coarser grid has fewer cells (less work & storage)

Restrictiontransfer from fine to coarse grid

Recursively apply this idea until we have an easy problem to solve

A Multigrid V-cycle

Prolongationtransfer from coarse to fine grid

Multilevel Preconditioning


Schwarz Preconditioning Given A x = b , partition x into

subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of nx

iR

thi

thi

xRx ii Tiii ARRA

iiTii RARB 11

i

x

Let Boolean rectangular matrix extract the subset of :

Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support


Iteration Count Estimates From the Schwarz Theory

In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows:

Ο(P1/3)Ο(P1/3)1-level Additive Schwarz

Ο(1)Ο(1)2-level Additive Schwarz

Ο((NP)1/6)Ο((NP)1/4)Domain Jacobi (=0)

Ο(N1/3)Ο(N1/2)Point Jacobi

in 3Din 2DPreconditioning Type

Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system


Schur Preconditioning Given a partition

Condense:

Let M be a good preconditioner for S Then is a preconditioner for A

Moreover, solves with may be done approximately if all degrees of freedom are retained

ff

uu

AAAA ii

i

iii

MAAI

IAA iii

i

ii

00 1

gSu

iiii AAAAS 1iiii fAAfg 1

iiA


Schwarz polynomials Polynomials of Schwarz projections that are

combinations of additive and multiplicative may be appropriate for certain implementations

We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

)(1 AufBuu ii

)(10 AufBuu

))(( 110

10

1 ii BABIBB This leads to “Hybrid II” in Smith, Bjorstad & Gropp:


Schwarz-on-Schur Preconditioning the Schur complement is complex in

and of itself; Schwarz is used on the reduced problem Neumann-Neumann

Balancing Neumann-Neumann))()(( 1

011

01

01 SMIDRSRDSMIMM iii

Tiii

iiiTiii DRSRDM 11

Multigrid on the Schur complement


Schwarz-inside-Schur Equality constrained optimization leads to the KKT

system for states x , designs u , and multipliers

fgg

ux

JJJWWJWW

u

x

ux

Tuuuux

Tx

Tuxxx

0

Then

Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian

fJWWJJgJJgg xuxxxT

xTux

Tx

Tuu

1)( uxuxxx

Tx

Tu

Tux

Tx

Tuuu JJWWJJWJJWH 1)(

uJfxJ ux uWxWgJ T

uxxxxTx


Schwarz-inside-Schur, cont. Problems

is the Jacobian of a PDE huge! involve Hessians of objective and constraints

second derivatives and huge H is unreasonable to form, store, or invert

xJW

Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic

differentiation Form approximate inverse action of state Jacobian and its

transpose by Schwarz


Nonlinear Schwarz Preconditioning Nonlinear Schwarz has Newton both inside and

outside and is fundamentally Jacobian-free It replaces with a new nonlinear system

possessing the same root, Define a correction to the partition (e.g.,

subdomain) of the solution vector by solving the following local nonlinear system:

where is nonzero only in the components of the partition

Then sum the corrections:

0)( uF0)( uthi

thi

)(ui

0))(( uuFR ii n

i u )(

)()( uu ii


Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is

nonsingular in a neighborhood of the desired root then and have the same unique root

To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product

Remarkably, (Cai-Keyes, 2000) it can be shown that

where and All required actions are available in terms of !

0)( u

nvu ,)()( uu ii

0)( uF

vu ')(

JvRJRvu iiTii )()( 1'

)(' uFJ Tiii JRRJ

)(uF


Experimental Example of Nonlinear Schwarz

Newton’s methodAdditive Schwarz Preconditioned Inexact Newton

(ASPIN)

Difficulty at critical Re

Stagnation beyond

critical Re

Convergence for all Re


“Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the

inverse of the sums? When the decomposition is right!

Good decompositions are a compromise between conditioning and parallel complexity, in practice

iriii raAr T

iii Arra Let be a complete set of orthonormal row eigenvectors for A : or

iiT

ii rarA Then

rArrrrarA Tii

Tiiii

Tii

111 )( and

— the Schwarz formula!


Some anecdotal successes Newton-Krylov-Schwarz

Computational aerodynamics (Anderson et al., NASA, ODU, ANL)

Free convection (Shadid & Tuminaro, Sandia) Pseudo-spectral Schwarz

Incompressible flow (Fischer & Tufo, ANL) FETI

Computational structural dynamics (Farhat, CU-Boulder & Pierson, Sandia)

LNKS PDE-constrained optimization (Biros, NYU & Ghattas, CMU)


Newton-Krylov-Schwarz

Newtonnonlinear solver

asymptotically quadratic

Krylovaccelerator

spectrally adaptive

Schwarzpreconditionerparallelizable

Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994)


Jacobian-Free Newton-Krylov Method In the Jacobian-Free Newton-Krylov (JFNK)

method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products

These are approximated by the Fréchet derivatives

so that the actual Jacobian elements are never

explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error

Schwarz preconditions, using approximate elements

)]()([1)( uFvuFvuJ


Computational Aerodynamics

mesh c/o D. Mavriplis, ICASE

Implemented in PETSc

www.mcs.anl.gov/petsc

Transonic “Lambda” Shock, Mach contours on surfaces


Fixed-size Parallel Scaling Results

Four orders of magnitude in 13 years

c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

128 nodes 128 nodes 43min43min

3072 nodes 3072 nodes 2.5min, 2.5min, 226Gf/s226Gf/s

11M unknowns 11M unknowns 1515µs/unknown µs/unknown 70% efficient70% efficient

This scaling study, featuring our widest range of processor number, was done for the incompressible case.


Fixed-size Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072

ASCI Red Nodes (Pentium Pro 333 MHz processors)


PDE Workingsets Smallest: data for single stencil Largest: data for entire subdomain Intermediate: data for a neighborhood collection of

stencils, reused as possible


Improvements Resulting from Locality Reordering

8.01626274221.0200200Pent. Pro

333

400

400

400

360

300

600

450

332

120

120

200

250

ClockMHz

6.32136406018.8333Pent. Pro

7.83149497819.5400Pent. II/NT

8.33347528320.8400Pent. II/LIN

2.5203647718.9800Ultra II/HPC

3.52547549413.0720Ultra II

3.01835427512.5600Ultra II

1.3163747917.61200Alpha 21164

1.6143239758.3900Alpha 21164

2.3153143669.9664604e

3.115405911724.3480P2SC (4 card)

2.713355110121.4480P2SC (2 card)

4.032688716320.3800P3

5.226597412725.4500R10000

Orig.% of Peak

Orig.Mflop/s

Interl.only

Mflop/s

Reord.Only

Mflop/s

Opt.Mflop/s

Opt. % of Peak

Peak Mflop/s

Processor Factor of Five!


Cache Traffic for PDEs As successive workingsets “drop” into a level of memory,

capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

Traffic decreases as cache gets bigger or subdomains get smaller


1 proc

Transport Modeling

Newton-Krylov solver with Aztec non-restarted GMRES with 1 – level domain decomposition preconditioner, ILUT subdomain solver, and ML 2-level DD with Gauss-Seidel subdomain solver. Coarse Solver: “Exact” = Superlu (1 proc), “Approx” = one step of ILU (8 proc. in parallel)

Temperature iso-lines on slice plane, velocity iso-surfaces and streamlines in 3D

N.45

N.24

N0

2 – Level DDExact Coarse Solve

2 – Level DD Approx. Coarse Solve

1 – Level DD3D Results

512 procs

Total Unknowns

Avg

. Ite

ratio

ns p

er N

ewto

n St

ep

Thermal Convection Problem (Ra = 1000)

c/o J. Shadid and R. Tuminaro

MPSalsa/Aztec Newton-Krylov-Schwarz solver

www.cs.sandia.gov/CRF/MPSalsa


unsteady Navier-Stokes solver high-order tensor-product polynomials in space (N~5-15) high-order operator splitting in time

two-level overlapping additive Schwarz for pressure spectral elements taken as subdomains fast local solves (tensor-product diagonalization) fast coarse-grid solver

sparse exact factorization yields x0 =A0-1b = XXTb as parallel

matrix-vector product low communication; scales to 10,000 processors

Transition near roughness element

Incompressible Flow

www.mcs.anl.gov/appliedmath/Flow/cfd.html

Transition in arterio-venous graft, Rev=2700

ASCI-Red scaling to P=4096, 376 Gflops27 M gridpoint Navier-Stokes calculation

dual MPI

multithreading

Nek5000: unstructured spectral element code

c/o P. Fischer and H. Tufo


0

50

100

150

200

250

0 500 1000 1500 2000 2500 3000 3500 4000

ASCI-White Processors

Tim

e (s

econ

ds)

Total Salinas FETI-DP

FETI-DP for structural mechanics

1mdof

4mdof9mdof 18mdof 30mdof

60mdof

c/o C. Farhat and K. Pierson

Numerically scalable, hardware scalable solutions for realistic solid/shell models

Used in Sandia applications Salinas, Adagio, Andante


PDE-constrained Optimization

c/o G. Biros and O. Ghattas

Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc

wing tip vortices, no control (l); optimal control (r)wing tip vortices, no control (l); optimal control (r)

optimal boundary controls shown as velocity vectorsoptimal boundary controls shown as velocity vectors

Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis)

www.cs.nyu.edu/~biros/veltisto/


Agenda for future research High concurrency (100,000 processors) Asynchrony Fault tolerance Automated tuning Integration of simulation with studies of sensitivity,

stability, and optimization


High ConcurrencyToday Future

100,000 processors, in a room or as part of a grid

Most phases of DD computations scale well

favorable surface-to-volume comm-to-comp ratio

However, latencies will nix frequent exact reductions

Paradigm: extrapolate data in retarded messages; correct (if necessary) when message arrives, such as in C(p,q,j) schemes by Garbey and Tromeur-Dervout

10,000 processors in a single room with tightly coupled network

DD computations scale well, when provided with

network rich enough for parallel near neighbor communication

fast global reductions (complexity sublinear in processor count)


AsynchronyToday Future

Adaptivity requirements and far-flung, nondedicated networks will lead to idleness and imbalance at synchronization points

Need algorithms with looser outer loops than global Newton-Krylov

Can we design algorithms that are robust with respect to incomplete convergence of inner tasks, like inexact Newton?

Paradigm: nonlinear Schwarz with regional (not global) nonlinear solvers where most execution time is spent

A priori partitionings for quasi-static meshes provide load-balanced computational tasks between frequent synchronization points

Good load balance is critical to parallel scalability on 1,000 processors and more


Fault ToleranceToday Future

c/o A. Geist

With 100,000 processors or worldwide networks, MTBF will be in minutes

Checkpoint-restart could take longer than the time to next failure

Paradigm: naturally fault tolerant algorithms, robust with respect to failure, such as a new FD algorithm at ORNL

Fault tolerance is not a driver in most scientific application code projects

FT handled as follows: Detection of wrong

System – in hardware Framework – by runtime env Library – in math or comm lib

Notification of application Interrupt – signal sent to job Error code returned by app

process Recovery

Restart from checkpoint Migration of task to new

hardware Reassignment of work to

remaining tasks


Automated TuningToday Future

Less knowledgeable users required to employ parallel iterative solvers in taxing applications

Need safe defaults and automated tuning strategies

Paradigm: parallel direct search (PDS) derivative-free optimization methods, using overall parallel computational complexity as objective function and algorithm tuning parameters as design variables, to tune solver in preproduction trial executions

Knowledgeable user-developers parameterize their solvers with experience and theoretically informed intuition for:

problem size/processor ratio outer solver type Krylov solver type DD preconditioner type maximum subspace dimensions overlaps fill levels inner tolerances potentially many others


Integrated SoftwareToday Future

Analysis increasingly an “inner loop” around which more sophisticated science-driven tasks are wrapped

Need PDE task functionality (e.g., residual evaluation, Jacobian evaluation, Jacobian inverse) exposed to optimization/sensitivity/stability algorithms

Paradigm: integrated software based on common distributed data structures

Each analysis is a “special effort”

Optimization, sensitivity analysis (e.g., for uncertainty quantification), and stability analysis to fully exploit and contextualize scientific results are rare


Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups

For FY2002, 51 new projects at $57M/year total Approximately one-third for ISICs A third for grid infrastructure and collaboratories A third for applications groups

5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#3 in latest “Top 500”) and “Cheetah” at ORNL (being installed now) available for SciDAC


Introducing “Terascale Optimal PDE Simulations” (TOPS) ISICNine institutions, five years, 24 co-PIs


TOPS Not just algorithms, but vertically integrated software suites Portable, scalable, extensible, tunable implementations Starring PETSc and hypre, among other existing packages

Driven by three applications SciDAC groups LBNL-led “21st Century Accelerator” designs

ORNL-led core collapse supernovae simulations PPPL-led magnetic fusion energy simulations

intended for many others


Background of PETSc Library(in which FUN3D example was implemented)

Developed by Balay, Gropp, McInnes & Smith (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by four additional staff (Buschelman, Kaushik, Knepley, Zhang) under SciDAC

Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays

Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly

Portable, and callable from C, C++, Fortran Uniform high-level API, with multi-layered entry Aggressively optimized: copies minimized, communication aggregated

and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitivetasks (e.g., gather/scatter)

See http://www.mcs.anl.gov/petsc


PETSc codeUser code

ApplicationInitialization

FunctionEvaluation

JacobianEvaluation

Post-Processing

PC KSPPETSc

Main Routine

Linear Solvers (SLES)

Nonlinear Solvers (SNES)

Timestepping Solvers (TS)

User Code/PETSc Library Interactions


PETSc codeUser code

ApplicationInitialization

FunctionEvaluation

JacobianEvaluation

Post-Processing

PC KSPPETSc

Main Routine

Linear Solvers (SLES)

Nonlinear Solvers (SNES)

Timestepping Solvers (TS)

User Code/PETSc Library Interactions

To be AD code


Background of Hypre Library(to be combined with PETSc under SciDAC)

Developed by Chow, Cleary & Falgout (LLNL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by seven additional staff (Henson, Jones, Lambert, Painter, Tong, Treadway, Yang) under ASCI and SciDAC

Object-oriented design similar to PETSc Concentrates on linear problems only Richer in preconditioners than PETSc, with focus on algebraic

multigrid Includes other preconditioners, including sparse approximate inverse

(Parasails) and parallel ILU (Euclid)

See http://www.llnl.gov/CASC/hypre/


Hypre’s “Conceptual Interfaces”

Data Layoutstructured composite block-struc unstruc CSR

Linear SolversGMG, ... FAC, ... Hybrid, ... AMGe, ... ILU, ...

Linear System Interfaces

Slide c/o E. Chow, LLNL


Sample of Hypre’s Scaled Efficiency

PFMG-CG on Red (40x40x40)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000

procs / problem size

scal

ed e

ffici

ency

Setup

Solve

Slide c/o E. Chow, LLNL


Scope for TOPS Design and implementation of “solvers”

Time integrators, with sens. analysis

Nonlinear solvers, with sens. analysis

Optimizers

Linear solvers

Eigensolvers

Software integration Performance optimization

0),,,( ptxxf

0),( pxF

bAx

BxAx

0),(..),(min uxFtsuxu

Optimizer

Linear solver

Eigensolver

Time integrator

Nonlinear solver

Indicates dependence

Sens. Analyzer


Keyword: “Optimal” Convergence rate nearly independent of discretization parameters

Multilevel schemes for linear and nonlinear problems Newton-like schemes for quadratic convergence of nonlinear problems

Convergence rate as independent as possible of physical parameters

Continuation schemes Physics-based preconditioning

unscalable

scalable

Problem Size (increasing with number of processors)

Tim

e to

Sol

utio

n

200

150

50

0

100

10 100 10001

Prometheus steel/rubber compositeParallel multigrid c/o M. Adams, Berkeley-Sandia

The solver is a key part, but not the only part, of the simulation that needs to be scalable


TOPS Philosophy on PDEs Solution of a system of PDEs is rarely a goal in itself

PDEs are solved to derive various outputs from specified inputs Actual goal is characterization of a response surface or a design

or control strategy Together with analysis, sensitivities and stability are often

desired

Tools for PDE solution should also support related desires


TOPS Philosophy on Operators A continuous operator may appear in a discrete

code in many different instances Optimal algorithms tend to be hierarchical and nested

iterative Processor-scalable algorithms tend to be domain-

decomposed and concurrent iterative Majority of progress towards desired highly resolved,

high fidelity result occurs through cost-effective low resolution, low fidelity parallel efficient stages

Operator abstractions and recurrence must be supported


Traditional Approach to Software Interoperability

Direct interfacing between different packages/libraries/apps Public interfaces are unique

Many-to-Many couplings require Many 2 interfaces Often a heroic effort to understand the details of both codes Not a scalable solution

SUMAA3d

DAs

GRACE

Overture

Data / mesh software

Hypre

Trilinos

ISIS++

PETSc

Linear solvers

Slide c/o L. McInnes, ANL


CCA Approach:Common Interface Specification

Reduces the Many-to-Many problem to a Many-to-One problem Allows interchangeability and experimentation Difficulties

Interface agreement Functionality limitations Maintaining performance

SUMAA3d

DAs

GRACE

OvertureData

Hypre

Trilinos

ISIS++

PETSc

ESI



A A A A

B B B B

MPI

MPI

Process

MPI application using CCA for interaction between components A and B within the same address space

Adaptive mesh componentwritten by user1

Solver componentwritten by user2

DirectConnection

supplied byframework at

compile/runtime

CCA Concept:SCMD (SPMD) Components

Proc1 Proc2 Proc3 etc...



Conclusions Domain decomposition and multilevel iteration the

dominant paradigm in contemporary terascale PDE simulation

Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes

Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive!

Many challenges loom at the “next scale” of computation Undoubtedly, new theory/algorithms will be part of the

solution


AcknowledgmentsThis talk was prepared despite interruptions by the …


Acknowledgments Early supporters at ICASE:

Bob Voigt Yousuff Hussaini

Early supporters at NASA: Manny Salas Jim Thomas


Acknowledgments Collaborators or Contributors:

George Biros (NYU) Xiao-Chuan Cai (Univ. Colorado, Boulder) Paul Fischer (ANL) Al Geist (ORNL) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) Kendall Pierson (Sandia) Henry Tufo (ANL) AZTEC team at Sandia National Laboratory John Shadid, Ray Tuminaro PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith

Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSC


Related URLs Personal homepage: papers, talks, etc.

http://www.math.odu.edu/~keyes SciDAC initiative

http://www.science.doe.gov/scidac TOPS project

http://www.math.odu.edu/~keyes/scidac PETSc project

http://www.mcs.anl.gov/petsc Hypre project

http://www.llnl.gov/CASC/hypre ASCI platforms

http://www.llnl.gov/asci/platforms

http://www.mcs.anl.gov/petsc-fun3d





http://www.mcs.anl.gov/petsc

































Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002,

to be submitted to J. Comp. Phys.

Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, to appear in SIAM J. Sci. Comp.

High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362

Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17

Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136

Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99

Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278

How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

Documents

David E. Keyes Center for Computational Science Old Dominion University