Upload
oona
View
47
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Algorithms for Cluster Computing Based on Domain Decomposition. David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research Center Institute for Scientific Computing Research - PowerPoint PPT Presentation
Citation preview
David E. Keyes
Center for Computational ScienceOld Dominion University
Institute for Computer Applications in Science & EngineeringNASA Langley Research Center
Institute for Scientific Computing ResearchLawrence Livermore National Laboratory
Algorithms for Cluster Computing Based on Domain Decomposition
NASA Langley Cluster SIG
Plan of Presentation Introduction (5) Imperative of domain decomposition and multilevel
methods for terascale computing (5) Basic and “advanced” domain decomposition and
multilevel algorithmic concepts (14) Vignettes of domain decomposition and multilevel
performance (13)(with a little help from my friends…)
Agenda for future research (6) Terascale Optimal PDE Simulations (TOPS) software
project (14) Conclusions (4)
NASA Langley Cluster SIG
Terascale simulation has been “sold”
Environmentglobal climate
groundwater flow
Lasers & Energycombustion ICF modeling
Engineeringstructural dynamics
electromagnetics Experiments expensive
Chemistry & Biomaterials modeling
drug design
Experiments controversial
AppliedPhysics
radiation transporthydrodynamics
Experiments prohibited
Scientific
Simulation
Experiments dangerous
However, it is far from proven! To meet expectations, we need to handle problems of multiple physical scales.
‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06
100+ Tflop / 30 TB
Time (CY)
Cap
abili
ty
1+ Tflop / 0.5 TB
PlanDevelopUse
30+ Tflop / 10 TB
Red
3+ Tflop / 1.5 TBBlue
10+ Tflop / 4 TBWhite
50+ Tflop / 25 TB
Large platforms have been provided
ASCI program of the U.S. ASCI program of the U.S. DOE has roadmap to go to DOE has roadmap to go to 100 Tflop/s by 2006100 Tflop/s by 2006www.llnl.gov/asci/platformswww.llnl.gov/asci/platforms
SandiaSandia Los AlamosLos Alamos
LivermoreLivermore
LivermoreLivermore
NASA Langley Cluster SIG
NSF’s 13.6 TF TeraGrid coming on line
26
24
8
4 HPSS
5
HPSS
HPSS UniTree
External Networks
External NetworksExternal
Networks
External Networks
Site Resources Site Resources
Site ResourcesSite Resources SDSC4.1 TF225 TB
Caltech
NCSA/PACI8 TF240 TB
Argonne
TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org
NASA Langley Cluster SIG
Algorithmic requirementsfrom architecture
Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache
T3E
“horizontal” aspects “vertical” aspects
message passing, shared memory threads register blocking, cache blocking, prefetching
NASA Langley Cluster SIG
Building platforms is the “easy” part Algorithms must be
highly concurrent and straightforward to load balance latency tolerant cache friendly (good temporal and spatial locality) highly scalable (in the sense of convergence)
Domain decomposed multilevel methods “natural” for all of these
Domain decomposition also “natural” for software engineering
Fortunate that mathematicians built up its theory in advance of requirements!
NASA Langley Cluster SIG
Decomposition strategies for Lu=f in Operator decomposition
Function space decomposition
Domain decomposition
k
kLL
k
kkk
kk uuff ,
kk
fuuyxkk II )()1(][ LL
Consider the implicitly discretized parabolic case
NASA Langley Cluster SIG
Operator decomposition Consider ADI
fuyuxkk II )()2/1( ][][ 2/2/ LL
fuxuykk II )2/1()1( ][][ 2/2/ LL
Iteration matrix consists of four multiplicative substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves
Parallelism within each substep But global data exchanges between bandsolve substeps
NASA Langley Cluster SIG
Function space decomposition Consider a spectral Galerkin method
),()(),,(1
yxtatyxu j
N
jj
Nifuu iiidtd ,...,1),,(),(),( L
Nifau ijjijdtda
ijj ,...,1),,(),(),( L
fMKaMdtda 11
System of ordinary differential equations Perhaps are diagonal
matrices Perfect parallelism across spectral index But global data exchanges to transform back to
physical variables at each step
)],[()],,[( ijij KM L
NASA Langley Cluster SIG
Domain decomposition Consider restriction and extension
operators for subdomains, , and for possible coarse grid,
Replace discretized with
Solve by a Krylov method Matrix-vector multiplies with
parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)
iiR
0R
TRR 00 ,
Tii RR ,
fAu fBAuB 11
iiTii
T RARRARB 10
100
1
Tiii ARRA
=
NASA Langley Cluster SIG
Comparison Operator decomposition (ADI)
natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose)
Function space decomposition (Fourier) Natural mode-based assignment requires all-to-all,
bulk data exchanges in each step (for transform) Domain decomposition (Schwarz)
Natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem
(Of course, domain decomposition can be interpreted as a special operator or function space decomposition)
NASA Langley Cluster SIG
Theoretical Scaling of Domain Decomposition for Three Common Network Topologies
With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: optimal number of processors scales linearly with problem size
(scalable, assumes one subdomain per processor) With 3D torus-based global reductions and scalable nearest neighbor
hardware: optimal number of processors scales as three-fourths power of
problem size (almost scalable) With common bus network (heavy contention):
optimal number of processors scales as one-fourth power of problem size (not scalable)
bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards” using multiple NICs per Beowulf node!
NASA Langley Cluster SIG
Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioning
“Advanced” Concepts Polynomial combinations of Schwarz projections Schwarz-Schur combinations
Neumann-Neumann/FETI (Schwarz on Schur) LNKS (Schwarz inside Schur)
Nonlinear Schwarz
NASA Langley Cluster SIG
Iterative correction The most basic idea in iterative methods
Evaluate residual accurately, but solve approximately, where is an approximate inverse to A
A sequence of complementary solves can be used, e.g., with first and then one has
)(1 AufBuu
)]([ 11
12
12
11 AufABBBBuu
2B1B
1B
RRARRB TT 112 )(
)( 1AB Optimal polynomials of lead to various preconditioned Krylov methods
Scale recurrence, e.g., with , leads to multilevel methods
NASA Langley Cluster SIG
smoother
Finest Grid
First Coarse Grid
coarser grid has fewer cells (less work & storage)
Restrictiontransfer from fine to coarse grid
Recursively apply this idea until we have an easy problem to solve
A Multigrid V-cycle
Prolongationtransfer from coarse to fine grid
Multilevel Preconditioning
NASA Langley Cluster SIG
Schwarz Preconditioning Given A x = b , partition x into
subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of nx
iR
thi
thi
xRx ii Tiii ARRA
iiTii RARB 11
i
x
Let Boolean rectangular matrix extract the subset of :
Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support
NASA Langley Cluster SIG
Iteration Count Estimates From the Schwarz Theory
In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows:
Ο(P1/3)Ο(P1/3)1-level Additive Schwarz
Ο(1)Ο(1)2-level Additive Schwarz
Ο((NP)1/6)Ο((NP)1/4)Domain Jacobi (=0)
Ο(N1/3)Ο(N1/2)Point Jacobi
in 3Din 2DPreconditioning Type
Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system
NASA Langley Cluster SIG
Schur Preconditioning Given a partition
Condense:
Let M be a good preconditioner for S Then is a preconditioner for A
Moreover, solves with may be done approximately if all degrees of freedom are retained
ff
uu
AAAA ii
i
iii
MAAI
IAA iii
i
ii
00 1
gSu
iiii AAAAS 1iiii fAAfg 1
iiA
NASA Langley Cluster SIG
Schwarz polynomials Polynomials of Schwarz projections that are
combinations of additive and multiplicative may be appropriate for certain implementations
We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)
)(1 AufBuu ii
)(10 AufBuu
))(( 110
10
1 ii BABIBB This leads to “Hybrid II” in Smith, Bjorstad & Gropp:
NASA Langley Cluster SIG
Schwarz-on-Schur Preconditioning the Schur complement is complex in
and of itself; Schwarz is used on the reduced problem Neumann-Neumann
Balancing Neumann-Neumann))()(( 1
011
01
01 SMIDRSRDSMIMM iii
Tiii
iiiTiii DRSRDM 11
Multigrid on the Schur complement
NASA Langley Cluster SIG
Schwarz-inside-Schur Equality constrained optimization leads to the KKT
system for states x , designs u , and multipliers
fgg
ux
JJJWWJWW
u
x
ux
Tuuuux
Tx
Tuxxx
0
Then
Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian
fJWWJJgJJgg xuxxxT
xTux
Tx
Tuu
1)( uxuxxx
Tx
Tu
Tux
Tx
Tuuu JJWWJJWJJWH 1)(
uJfxJ ux uWxWgJ T
uxxxxTx
NASA Langley Cluster SIG
Schwarz-inside-Schur, cont. Problems
is the Jacobian of a PDE huge! involve Hessians of objective and constraints
second derivatives and huge H is unreasonable to form, store, or invert
xJW
Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic
differentiation Form approximate inverse action of state Jacobian and its
transpose by Schwarz
NASA Langley Cluster SIG
Nonlinear Schwarz Preconditioning Nonlinear Schwarz has Newton both inside and
outside and is fundamentally Jacobian-free It replaces with a new nonlinear system
possessing the same root, Define a correction to the partition (e.g.,
subdomain) of the solution vector by solving the following local nonlinear system:
where is nonzero only in the components of the partition
Then sum the corrections:
0)( uF0)( uthi
thi
)(ui
0))(( uuFR ii n
i u )(
)()( uu ii
NASA Langley Cluster SIG
Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is
nonsingular in a neighborhood of the desired root then and have the same unique root
To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product
Remarkably, (Cai-Keyes, 2000) it can be shown that
where and All required actions are available in terms of !
0)( u
nvu ,)()( uu ii
0)( uF
vu ')(
JvRJRvu iiTii )()( 1'
)(' uFJ Tiii JRRJ
)(uF
NASA Langley Cluster SIG
Experimental Example of Nonlinear Schwarz
Newton’s methodAdditive Schwarz Preconditioned Inexact Newton
(ASPIN)
Difficulty at critical Re
Stagnation beyond
critical Re
Convergence for all Re
NASA Langley Cluster SIG
“Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the
inverse of the sums? When the decomposition is right!
Good decompositions are a compromise between conditioning and parallel complexity, in practice
iriii raAr T
iii Arra Let be a complete set of orthonormal row eigenvectors for A : or
iiT
ii rarA Then
rArrrrarA Tii
Tiiii
Tii
111 )( and
— the Schwarz formula!
NASA Langley Cluster SIG
Some anecdotal successes Newton-Krylov-Schwarz
Computational aerodynamics (Anderson et al., NASA, ODU, ANL)
Free convection (Shadid & Tuminaro, Sandia) Pseudo-spectral Schwarz
Incompressible flow (Fischer & Tufo, ANL) FETI
Computational structural dynamics (Farhat, CU-Boulder & Pierson, Sandia)
LNKS PDE-constrained optimization (Biros, NYU & Ghattas, CMU)
NASA Langley Cluster SIG
Newton-Krylov-Schwarz
Newtonnonlinear solver
asymptotically quadratic
Krylovaccelerator
spectrally adaptive
Schwarzpreconditionerparallelizable
Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994)
NASA Langley Cluster SIG
Jacobian-Free Newton-Krylov Method In the Jacobian-Free Newton-Krylov (JFNK)
method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products
These are approximated by the Fréchet derivatives
so that the actual Jacobian elements are never
explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error
Schwarz preconditions, using approximate elements
)]()([1)( uFvuFvuJ
NASA Langley Cluster SIG
Computational Aerodynamics
mesh c/o D. Mavriplis, ICASE
Implemented in PETSc
www.mcs.anl.gov/petsc
Transonic “Lambda” Shock, Mach contours on surfaces
NASA Langley Cluster SIG
Fixed-size Parallel Scaling Results
Four orders of magnitude in 13 years
c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith
128 nodes 128 nodes 43min43min
3072 nodes 3072 nodes 2.5min, 2.5min, 226Gf/s226Gf/s
11M unknowns 11M unknowns 1515µs/unknown µs/unknown 70% efficient70% efficient
This scaling study, featuring our widest range of processor number, was done for the incompressible case.
NASA Langley Cluster SIG
Fixed-size Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072
ASCI Red Nodes (Pentium Pro 333 MHz processors)
NASA Langley Cluster SIG
PDE Workingsets Smallest: data for single stencil Largest: data for entire subdomain Intermediate: data for a neighborhood collection of
stencils, reused as possible
NASA Langley Cluster SIG
Improvements Resulting from Locality Reordering
8.01626274221.0200200Pent. Pro
333
400
400
400
360
300
600
450
332
120
120
200
250
ClockMHz
6.32136406018.8333Pent. Pro
7.83149497819.5400Pent. II/NT
8.33347528320.8400Pent. II/LIN
2.5203647718.9800Ultra II/HPC
3.52547549413.0720Ultra II
3.01835427512.5600Ultra II
1.3163747917.61200Alpha 21164
1.6143239758.3900Alpha 21164
2.3153143669.9664604e
3.115405911724.3480P2SC (4 card)
2.713355110121.4480P2SC (2 card)
4.032688716320.3800P3
5.226597412725.4500R10000
Orig.% of Peak
Orig.Mflop/s
Interl.only
Mflop/s
Reord.Only
Mflop/s
Opt.Mflop/s
Opt. % of Peak
Peak Mflop/s
Processor Factor of Five!
NASA Langley Cluster SIG
Cache Traffic for PDEs As successive workingsets “drop” into a level of memory,
capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth
Traffic decreases as cache gets bigger or subdomains get smaller
NASA Langley Cluster SIG
1 proc
Transport Modeling
Newton-Krylov solver with Aztec non-restarted GMRES with 1 – level domain decomposition preconditioner, ILUT subdomain solver, and ML 2-level DD with Gauss-Seidel subdomain solver. Coarse Solver: “Exact” = Superlu (1 proc), “Approx” = one step of ILU (8 proc. in parallel)
Temperature iso-lines on slice plane, velocity iso-surfaces and streamlines in 3D
N.45
N.24
N0
2 – Level DDExact Coarse Solve
2 – Level DD Approx. Coarse Solve
1 – Level DD3D Results
512 procs
Total Unknowns
Avg
. Ite
ratio
ns p
er N
ewto
n St
ep
Thermal Convection Problem (Ra = 1000)
c/o J. Shadid and R. Tuminaro
MPSalsa/Aztec Newton-Krylov-Schwarz solver
www.cs.sandia.gov/CRF/MPSalsa
NASA Langley Cluster SIG
unsteady Navier-Stokes solver high-order tensor-product polynomials in space (N~5-15) high-order operator splitting in time
two-level overlapping additive Schwarz for pressure spectral elements taken as subdomains fast local solves (tensor-product diagonalization) fast coarse-grid solver
sparse exact factorization yields x0 =A0-1b = XXTb as parallel
matrix-vector product low communication; scales to 10,000 processors
Transition near roughness element
Incompressible Flow
www.mcs.anl.gov/appliedmath/Flow/cfd.html
Transition in arterio-venous graft, Rev=2700
ASCI-Red scaling to P=4096, 376 Gflops27 M gridpoint Navier-Stokes calculation
dual MPI
multithreading
Nek5000: unstructured spectral element code
c/o P. Fischer and H. Tufo
NASA Langley Cluster SIG
0
50
100
150
200
250
0 500 1000 1500 2000 2500 3000 3500 4000
ASCI-White Processors
Tim
e (s
econ
ds)
Total Salinas FETI-DP
FETI-DP for structural mechanics
1mdof
4mdof9mdof 18mdof 30mdof
60mdof
c/o C. Farhat and K. Pierson
Numerically scalable, hardware scalable solutions for realistic solid/shell models
Used in Sandia applications Salinas, Adagio, Andante
NASA Langley Cluster SIG
PDE-constrained Optimization
c/o G. Biros and O. Ghattas
Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc
wing tip vortices, no control (l); optimal control (r)wing tip vortices, no control (l); optimal control (r)
optimal boundary controls shown as velocity vectorsoptimal boundary controls shown as velocity vectors
Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis)
www.cs.nyu.edu/~biros/veltisto/
NASA Langley Cluster SIG
Agenda for future research High concurrency (100,000 processors) Asynchrony Fault tolerance Automated tuning Integration of simulation with studies of sensitivity,
stability, and optimization
NASA Langley Cluster SIG
High ConcurrencyToday Future
100,000 processors, in a room or as part of a grid
Most phases of DD computations scale well
favorable surface-to-volume comm-to-comp ratio
However, latencies will nix frequent exact reductions
Paradigm: extrapolate data in retarded messages; correct (if necessary) when message arrives, such as in C(p,q,j) schemes by Garbey and Tromeur-Dervout
10,000 processors in a single room with tightly coupled network
DD computations scale well, when provided with
network rich enough for parallel near neighbor communication
fast global reductions (complexity sublinear in processor count)
NASA Langley Cluster SIG
AsynchronyToday Future
Adaptivity requirements and far-flung, nondedicated networks will lead to idleness and imbalance at synchronization points
Need algorithms with looser outer loops than global Newton-Krylov
Can we design algorithms that are robust with respect to incomplete convergence of inner tasks, like inexact Newton?
Paradigm: nonlinear Schwarz with regional (not global) nonlinear solvers where most execution time is spent
A priori partitionings for quasi-static meshes provide load-balanced computational tasks between frequent synchronization points
Good load balance is critical to parallel scalability on 1,000 processors and more
NASA Langley Cluster SIG
Fault ToleranceToday Future
c/o A. Geist
With 100,000 processors or worldwide networks, MTBF will be in minutes
Checkpoint-restart could take longer than the time to next failure
Paradigm: naturally fault tolerant algorithms, robust with respect to failure, such as a new FD algorithm at ORNL
Fault tolerance is not a driver in most scientific application code projects
FT handled as follows: Detection of wrong
System – in hardware Framework – by runtime env Library – in math or comm lib
Notification of application Interrupt – signal sent to job Error code returned by app
process Recovery
Restart from checkpoint Migration of task to new
hardware Reassignment of work to
remaining tasks
NASA Langley Cluster SIG
Automated TuningToday Future
Less knowledgeable users required to employ parallel iterative solvers in taxing applications
Need safe defaults and automated tuning strategies
Paradigm: parallel direct search (PDS) derivative-free optimization methods, using overall parallel computational complexity as objective function and algorithm tuning parameters as design variables, to tune solver in preproduction trial executions
Knowledgeable user-developers parameterize their solvers with experience and theoretically informed intuition for:
problem size/processor ratio outer solver type Krylov solver type DD preconditioner type maximum subspace dimensions overlaps fill levels inner tolerances potentially many others
NASA Langley Cluster SIG
Integrated SoftwareToday Future
Analysis increasingly an “inner loop” around which more sophisticated science-driven tasks are wrapped
Need PDE task functionality (e.g., residual evaluation, Jacobian evaluation, Jacobian inverse) exposed to optimization/sensitivity/stability algorithms
Paradigm: integrated software based on common distributed data structures
Each analysis is a “special effort”
Optimization, sensitivity analysis (e.g., for uncertainty quantification), and stability analysis to fully exploit and contextualize scientific results are rare
NASA Langley Cluster SIG
Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups
For FY2002, 51 new projects at $57M/year total Approximately one-third for ISICs A third for grid infrastructure and collaboratories A third for applications groups
5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#3 in latest “Top 500”) and “Cheetah” at ORNL (being installed now) available for SciDAC
NASA Langley Cluster SIG
Introducing “Terascale Optimal PDE Simulations” (TOPS) ISICNine institutions, five years, 24 co-PIs
NASA Langley Cluster SIG
TOPS Not just algorithms, but vertically integrated software suites Portable, scalable, extensible, tunable implementations Starring PETSc and hypre, among other existing packages
Driven by three applications SciDAC groups LBNL-led “21st Century Accelerator” designs
ORNL-led core collapse supernovae simulations PPPL-led magnetic fusion energy simulations
intended for many others
NASA Langley Cluster SIG
Background of PETSc Library(in which FUN3D example was implemented)
Developed by Balay, Gropp, McInnes & Smith (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by four additional staff (Buschelman, Kaushik, Knepley, Zhang) under SciDAC
Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays
Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly
Portable, and callable from C, C++, Fortran Uniform high-level API, with multi-layered entry Aggressively optimized: copies minimized, communication aggregated
and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitivetasks (e.g., gather/scatter)
See http://www.mcs.anl.gov/petsc
NASA Langley Cluster SIG
PETSc codeUser code
ApplicationInitialization
FunctionEvaluation
JacobianEvaluation
Post-Processing
PC KSPPETSc
Main Routine
Linear Solvers (SLES)
Nonlinear Solvers (SNES)
Timestepping Solvers (TS)
User Code/PETSc Library Interactions
NASA Langley Cluster SIG
PETSc codeUser code
ApplicationInitialization
FunctionEvaluation
JacobianEvaluation
Post-Processing
PC KSPPETSc
Main Routine
Linear Solvers (SLES)
Nonlinear Solvers (SNES)
Timestepping Solvers (TS)
User Code/PETSc Library Interactions
To be AD code
NASA Langley Cluster SIG
Background of Hypre Library(to be combined with PETSc under SciDAC)
Developed by Chow, Cleary & Falgout (LLNL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by seven additional staff (Henson, Jones, Lambert, Painter, Tong, Treadway, Yang) under ASCI and SciDAC
Object-oriented design similar to PETSc Concentrates on linear problems only Richer in preconditioners than PETSc, with focus on algebraic
multigrid Includes other preconditioners, including sparse approximate inverse
(Parasails) and parallel ILU (Euclid)
See http://www.llnl.gov/CASC/hypre/
NASA Langley Cluster SIG
Hypre’s “Conceptual Interfaces”
Data Layoutstructured composite block-struc unstruc CSR
Linear SolversGMG, ... FAC, ... Hybrid, ... AMGe, ... ILU, ...
Linear System Interfaces
Slide c/o E. Chow, LLNL
NASA Langley Cluster SIG
Sample of Hypre’s Scaled Efficiency
PFMG-CG on Red (40x40x40)
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000
procs / problem size
scal
ed e
ffici
ency
Setup
Solve
Slide c/o E. Chow, LLNL
NASA Langley Cluster SIG
Scope for TOPS Design and implementation of “solvers”
Time integrators, with sens. analysis
Nonlinear solvers, with sens. analysis
Optimizers
Linear solvers
Eigensolvers
Software integration Performance optimization
0),,,( ptxxf
0),( pxF
bAx
BxAx
0),(..),(min uxFtsuxu
Optimizer
Linear solver
Eigensolver
Time integrator
Nonlinear solver
Indicates dependence
Sens. Analyzer
NASA Langley Cluster SIG
Keyword: “Optimal” Convergence rate nearly independent of discretization parameters
Multilevel schemes for linear and nonlinear problems Newton-like schemes for quadratic convergence of nonlinear problems
Convergence rate as independent as possible of physical parameters
Continuation schemes Physics-based preconditioning
unscalable
scalable
Problem Size (increasing with number of processors)
Tim
e to
Sol
utio
n
200
150
50
0
100
10 100 10001
Prometheus steel/rubber compositeParallel multigrid c/o M. Adams, Berkeley-Sandia
The solver is a key part, but not the only part, of the simulation that needs to be scalable
NASA Langley Cluster SIG
TOPS Philosophy on PDEs Solution of a system of PDEs is rarely a goal in itself
PDEs are solved to derive various outputs from specified inputs Actual goal is characterization of a response surface or a design
or control strategy Together with analysis, sensitivities and stability are often
desired
Tools for PDE solution should also support related desires
NASA Langley Cluster SIG
TOPS Philosophy on Operators A continuous operator may appear in a discrete
code in many different instances Optimal algorithms tend to be hierarchical and nested
iterative Processor-scalable algorithms tend to be domain-
decomposed and concurrent iterative Majority of progress towards desired highly resolved,
high fidelity result occurs through cost-effective low resolution, low fidelity parallel efficient stages
Operator abstractions and recurrence must be supported
NASA Langley Cluster SIG
Traditional Approach to Software Interoperability
Direct interfacing between different packages/libraries/apps Public interfaces are unique
Many-to-Many couplings require Many 2 interfaces Often a heroic effort to understand the details of both codes Not a scalable solution
SUMAA3d
DAs
GRACE
Overture
Data / mesh software
Hypre
Trilinos
ISIS++
PETSc
Linear solvers
Slide c/o L. McInnes, ANL
NASA Langley Cluster SIG
CCA Approach:Common Interface Specification
Reduces the Many-to-Many problem to a Many-to-One problem Allows interchangeability and experimentation Difficulties
Interface agreement Functionality limitations Maintaining performance
SUMAA3d
DAs
GRACE
OvertureData
Hypre
Trilinos
ISIS++
PETSc
ESI
Slide c/o L. McInnes, ANL
NASA Langley Cluster SIG
A A A A
B B B B
MPI
MPI
Process
MPI application using CCA for interaction between components A and B within the same address space
Adaptive mesh componentwritten by user1
Solver componentwritten by user2
DirectConnection
supplied byframework at
compile/runtime
CCA Concept:SCMD (SPMD) Components
Proc1 Proc2 Proc3 etc...
Slide c/o L. McInnes, ANL
NASA Langley Cluster SIG
Conclusions Domain decomposition and multilevel iteration the
dominant paradigm in contemporary terascale PDE simulation
Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes
Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive!
Many challenges loom at the “next scale” of computation Undoubtedly, new theory/algorithms will be part of the
solution
NASA Langley Cluster SIG
AcknowledgmentsThis talk was prepared despite interruptions by the …
NASA Langley Cluster SIG
Acknowledgments Early supporters at ICASE:
Bob Voigt Yousuff Hussaini
Early supporters at NASA: Manny Salas Jim Thomas
NASA Langley Cluster SIG
Acknowledgments Collaborators or Contributors:
George Biros (NYU) Xiao-Chuan Cai (Univ. Colorado, Boulder) Paul Fischer (ANL) Al Geist (ORNL) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) Kendall Pierson (Sandia) Henry Tufo (ANL) AZTEC team at Sandia National Laboratory John Shadid, Ray Tuminaro PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith
Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSC
NASA Langley Cluster SIG
Related URLs Personal homepage: papers, talks, etc.
http://www.math.odu.edu/~keyes SciDAC initiative
http://www.science.doe.gov/scidac TOPS project
http://www.math.odu.edu/~keyes/scidac PETSc project
http://www.mcs.anl.gov/petsc Hypre project
http://www.llnl.gov/CASC/hypre ASCI platforms
http://www.llnl.gov/asci/platforms
NASA Langley Cluster SIG
Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002,
to be submitted to J. Comp. Phys.
Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, to appear in SIAM J. Sci. Comp.
High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362
Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136
Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99
Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278
How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297