Author
barny
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures. Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg}. Linear Solvers Grant Kickoff Meeting, 9/26/06. - PowerPoint PPT Presentation
Evaluating Sparse Linear System
Solvers on Scalable Parallel Architectures
Ananth Grama and Ahmed Sameh Department of Computer Science
Purdue University.http://www.cs.purdue.edu/people/{sameh/ayg}
Linear Solvers Grant Kickoff Meeting, 9/26/06.
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures
Project Overview• Identify sweet spots in algorithm-architecture-programming model space for efficient sparse linear system solvers.• Design a new class of highly scalable sparse solvers suited to petascale HPC systems.• Develop analytical frameworks for performance prediction and projection.
Methodology
• Design generalized sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics.
• Evaluate architectural features and their impact on scalable solver performance.
• Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI.
Challenges and Impact
• Generalizing the space of parallel sparse linear system solvers.
• Analysis and implementation on parallel platforms
• Performance projection to the petascale
•Guidance for architecture and programming model design / performance envelope.
• Benchmarks and libraries for HPCS.
Milestones / Schedule
• Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel sparse solvers).
• Six month target: Comparative performance of solvers on multicore SMPs and clusters.
• 12-month target: Evaluation on these solvers on Cray X1, BG, JS20/21, for CAF/UPC/MPI implementations.
Introduction
• A critical aspect of High-Productivity is the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable for implementation on petascale systems.
• This project aims at identifying such points for commonly used sparse linear system solvers, and at developing more robust novel solvers.
• These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts – yielding much better actual performance.
Project Rationale
• Sparse system solvers govern the overall performance of many CSE applications on HPC systems.
• Design of HPC architectures and programming models should be influenced by their suitability for such solvers and related kernels.
• Extreme need for concurrency on novel architectural models require fundamental re-examination of conventional sparse solvers.
Typical Computational Kernels for PDE’sIntegration
Newton Iteration
Linear system solvers
k
k
t
Fluid Structure Interaction
NESSIE – Nanoelectronics Simulation Environment
Transport/ElectrostaticsMulti-scale, Multi-physics
Multi-method
Numerical Parallel Algorithms:
Linear solver (SPIKE),Eigenpairs solver (TraceMin), preconditioning strategies,…
Mathematical Methodologies:
Finite Element Method, mode decomposition, multi-scale, non-linear numerical schemes,…
S DCNT channel
Gate insulatorGate
tc
18nm
tox=20nm
Lc=30nm
Top Gate
ε2=16
ε1=3.9
3D CNTs-3D Molec. Devices 3D Si Nanowires 3D III/V Devices 2D MOSFETs
E. Polizzi (1998-2005)
Simulation, Model Reduction and Real-Time Control of Structures
Fluid-Solid Interaction
Project Goals• Develop generalizations of direct and iterative
solvers – e.g. the Spike polyalgorithm.• Implement such generalizations on various
architectures (multicore, multicore SMPs, multicore SMP aggregates) and programming models (PGAs, Messaging APIs)
• Analytically quantify performance and project to petascale platforms.
• Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.
Background
• Personnel:– Ahmed Sameh, Samuel Conte Professor of Computer
Science, has worked on development of parallel numerical algorithms for four decades.
– Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems.
– (To be named – Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking.
*We have identified three candidates for this position and will shortly be hiring one of them.
Background…
• Technical– We have built extensive infrastructure for
parallel sparse solvers – including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditioners
– We have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.
Background…
• Technical…– We have initiated installation of Co-Array
Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs.
– We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.
Technical Highlights
The SPIKE Toolkit
SPIKE: Introduction
• Engineering problems usually produce large sparse linear systems• Banded (or banded with low-rank perturbations) structure is often
obtained after reordering• SPIKE partitions the banded matrix into a block tridiagonal form• Each partition is associated with one CPU, or one node multilevel parallelism
after RCM reordering
SPIKE: Introduction…
SPIKE: Introduction …
Reduced system ((p-1) x 2m)
Retrieve solution
AX=F SX=diag(A1-1,…,Ap
-1) F
A(n x n)Bj, Cj (m x m), m<<n
W2
V1
V2
V3W3
W4
SPIKE: A Hybrid Algorithm
• The spikes can be computed:– Explicitly (fully or partially)– On the Fly– Approximately
• The diagonal blocks can be solved:– Directly (dense LU, Cholesky,
or sparse counterparts)– Iteratively (with a
preconditioning strategy)• The reduced system can be
solved:– Directly (Recursive SPIKE)– Iteratively (with a
preconditioning scheme)– Approximately (Truncated
SPIKE)
Different choices depending on the properties of the matrix and/or platform architecture
The SPIKE algorithm
Level Description
3 SPIKE
2 LAPACK blocked algorithms
1 Primitives for banded matrices
(our own)
0 BLAS3 (matrix-matrix primitives)
Hierarchy of Computational Modules(systems dense within the band)
SPIKE versionsAlgorithm
Factorization
E
Explicit
R
Recursive
T
Truncated
F
on the Fly
P
LU w/
pivoting
Explicit generation of spikes- reduced system is solved iteratively with a preconditioner.
Explicit generation of spikes- reduced system is solved directly using recursive SPIKE
Implicit generation of reduced system which is solved on-the-fly using an iterative method.
L
LU w/o pivoting
Explicit generation of spikes- reduced system is solved iteratively with a preconditioner.
Explicit generation of spikes- reduce system is solved directly using recursive SPIKE
Truncated generation of spike tips: Vb is exact, Wt is approx.- reduced system is solved directly
Implicit generation of reduced system which is solved on-the-fly using an iterative method.
U
LU and UL
w/o pivot.
Truncated generation of spike tips: Vb, Wt are exact- reduced system is solved directly
Implicit generation of reduced system which is solved on-the-fly using an iterative method with precond.
A
alternate
LU / UL
Explicit generation of spikes using new partitioning- reduced system is solved iteratively with a preconditioner.
Truncated generation of spikes using new partitioning- reduced system is solved directly
SPIKE Hybrids1. SPIKE versions
R = recursive E = explicitF = on-the-fly T = truncated
2. FactorizationNo pivoting:
L = LUU = LU & UL A = alternate LU & UL
Pivoting: P = LU
3. Solution improvement:– 0 direct solver only– 2 iterative refinement– 3 outer Bicgstab iterations
SPIKE “on-the-fly” • Does not require generating the spikes explicitly
Ideally suited for banded systems with large sparse bands
• The reduced system is solved iteratively with/without a preconditioning strategy
iiiiii CIA,IFBIAI,E
00
00 11
IFEIHGIF
EIHGIF
EI
S
4
33
33
22
22
1
00
0000
00
iiiiii CIAI,HBIA,IG
00
00 11
Numerical Experiments(dense within the band)
–Computing platforms• 4 nodes Linux Xeon cluster
• 512 processor IBM-SP
–Performance comparison w/ LAPACK, and ScaLAPACK
# procs. 8 16 32 64 128 256 512
N=0.5 M 7.2 8.2 7.9 7.0 6.0 5.0 4.1
N=1M 8.4 8.2 7.7 6.9 5.7 4.8
N=2M 8.3 8.0 7.6 6.0 5.4
N=4M 8.2 8.0 7.4 6.4
Speed improvement Spike vs ScalapackIBM-SP Spike (RL0)
b=401; RHS=1;
4
5
6
7
8
9
8 16 32 64 128 256 512
# processors
Tscal/
Tsp
ike
0.5 M
1M
2M
3M
SPIKE: Scalability
SPIKE partitioning - 4 processor example
A1
A2
A3
A4
C2
C3
C4
B1
B2
B3
Processors
1
2
3
4
Factorizations (w/o pivoting)
LU
LU and UL
LU and UL
UL
A1
A2
A3
C2
C3
B1
B2
Processors
1
2,3
4
Factorizations (w/o pivoting)
LUUL (p=2); LU (p=3)
UL
Partitioning -- 1
Partitioning -- 2
SPIKE: Small number of processors• ScaLAPACK needs at least 4-8 processors to perform as
well as LAPACK. • SPIKE benefits from the LU-UL strategy and realizes speed
improvement over LAPACK on 2 or more processors.
00.5
11.52
2.53
3.54
4.55
1 CPU 2 CPU 4 CPU 8 CPU
Spee
d-u
p /
LAPACK
LAPACKScaLAPACKSPIKE TruncatedSPIKE-new
4-node Xeon Intel Linux cluster with infiniband interconnection - Two 3.2 Ghz processors per node; 4 GB of memory/ node; 1 MB cache; 64-bit arithmetic.Intel fortran, Intel MPIIntel MKL libraries for LAPACK, SCALAPACK
n=960,000b=201
System is diag. dominant
General sparse systems
After RCM reordering
• Science and engineering problems often produce large sparse linear systems
• Banded (or banded with low-rank perturbations) structure is often obtained after reordering
• While most ILU preconditioners depend on reordering strategies that minimize the fill-in in the factorization stage, we propose to:– extract a narrow banded matrix, via a reordering-dropping strategy,
to be used as a preconditioner.– make use of an improved SPIKE “on-the-fly” scheme that is ideally
suited for banded systems that are sparse within the band.
Sparse parallel direct solvers and banded systems
PARDISO Reordering Factorization Solve
1 CPU- (1 Node) 19.0 4.13 0.83
2 CPU- (1 Node) 18.3 3.11 0.83
SuperLU Reordering Factorization Solve1 CPU - (1 Node) 10.3 17* 8.4*2 CPU- (2 Nodes) 8.2 37* 12.5*4 CPU- (4 Nodes) 8.2 41* 16.7*8 CPU- (4 Nodes) 7.6 178* 15.9*
MUMPS Reordering Factorization Solve1 CPU - (1 Node) 16.2 6.3 0.82 CPU- (2 Nodes) 17.2 4.2 0.64 CPU- (4 Nodes) 17.8 3 0.558 CPU- (4 Nodes) 17.7 20* 1.9*
* Memory swap –too much fill-in
N=432,000, b= 177, nnz= 7, 955, 116, fill-in of the band: 10.4%
Multilevel Parallelism: SPIKE calling MKL-PARDISO for banded systems that are sparse
within the band
Node 1 Node 2 Node 3 Node 4
Pardiso Pardiso Pardiso Pardiso
SPIKEA1
A2
A3
A4
C2
C3
C4
B1
B2
B3
This SPIKE hybrid scheme exhibits better performance than other parallel direct sparse solvers used alone.
SPIKE-MKL ”on-the-fly” for systems that are sparse within the bandN=432,000, b= 177, nnz= 7, 955, 116, sparsity of the band: 10.4%
MUMPS: time (2-nodes) = 21.35 s; time (4-nodes) = 39.6 s For narrow banded systems, SPIKE will consider the matrix dense within the band. Reordering schemes for minimizing the bandwidth can be used if necessary.
N=471,800, b= 1455, nnz= 9, 499, 744, sparsity of the band: 1.4%
Good scalability using “on-the-fly” SPIKE scheme
A Preconditioning Approach
1. Reorder the matrix to bring most of the elements within a band via
• HSL’s MC64 (to maximize sum or product of the diagonal elements)
• RCM (Reverse Cuthill-McKee) or MD (Minimum Degree)
2. Extract a band from the reordered matrix to use as preconditioner.
3. Use an iterative method (e.g. Bicgstab) to solve the linear system (outer solver).
4. Use SPIKE to solve the system involving the preconditioner (inner solver).
Matrix DW8192, order N = 8192, NNZ = 41746, Condest (A) = 1.39e+07, Square Dielectric Waveguide Sparsity = 0.0622% , (A+A’) indefinite
GMRES + ILU preconditionervs. Bicgstab with banded
preconditioner
• Gmres w/o precond: >5000 iterations– Gmres + ILU (no fill-in): Fails – Gmres + ILU (1 e-3): Fails– Gmres + ILU (1 e-4): 16 iters., ||r|| ~ 10-5
• NNZ(L) = 612,889• NNZ(U) = 414,661
• Spike - Bicgstab: – preconditioner of bandwidth = 70 – 16 iters., ||r|| ~ 10-7
• NNZ = 573,440
Comparisons on a 4-node Xeon Linux Cluster (2 CPUs/node)
• SuperLU:– 1 node -- 3.56 s– 2 nodes-- 1.87 s– 4 nodes-- 3.02 s (memory limitation)
• MUMPS:– 1 node -- 0.26 s– 2 nodes-- 0.36 s– 4 nodes-- 0.34 s
• SPIKE – RL3 (bandwidth = 129):– 1 node -- 0.28 s– 2 nodes-- 0.27 s– 4 nodes-- 0.20 s– Bicgstab required 6 iterations only– Norm of rel. res. ~ 10-6
Matrix DW8192
Matrix DW8192
Sparse Matvec kernel needs fine-tuning
Sparse Matvec kernel needs fine-tuning
Matrix: BMW7st_1(stiffness matrix)
Order N =141,347, NNZ = 7,318,399
after RCM reordering
Comparisons on a 4-node Xeon Linux Cluster (2 CPUs/node)
• SuperLU:– 1 node -- 26.56 s– 2 nodes -- 21.17 s– 4 nodes -- 45.03 s
• MUMPS:– 1 node -- 09.71 s– 2 nodes -- 08.03 s– 4 nodes -- 08.05 s
• SPIKE – TL3 (bandwidth = 101):– 1 node -- 02.38 s– 2 nodes -- 01.62 s– 4 nodes -- 01.33 s– Bicgstab required 5 iterations only– Norm of rel. res. ~ 10-6
Matrix:BMW7st_1is diag. dominant
Sparse Matvec kernel needs fine-tuning
Reservoir Modeling
Bicgstab + ILU (‘0’, fill-in) NNZ = 50,720
Matrix after RCM reordering
Bicgtab + banded preconditioner bandwidth = 61; NNZ = 12,464
Driven Cavity Example
• Square domain: 1 x, y 1
B.CS.: ux = uy = 0 on x, y = 1 ; x = 1
ux = 1, uy = 0 on y = 1
• Picard’s iterations; Q2/Q1 elements: Linearized equations (Oseen problem)
• G 0
• E = As = (A + AT)/2
• viscosity: 0.1, 0.02, 0.01,0.002
Linear system for 128 x 128 grid
after spectral reordering
MA48 as a preconditioner for GMRES on a uniprocessor
Droptol # of iter. T(factor.) T(GMRES iters.) nnz(L+U) final residual v
0 2 142.7 0.3 7,195,157 2.07E-13 1/50
1.00E-04 24 115.5 1.4 2,177,181 2.83E-09 1/50
1.00E-03 248 109.6 17.3 1,611,030 5.11E-08 1/50
** for drop tolerance = .0001, total time ~ 117 sec.** No convergence for a drop tolerance of 10-2
Spike – Pardiso for solving systems involving the banded
preconditioner• On 4 nodes (8 CPU’s) of a Xeon-Intel cluster:
– bandwidth of extracted preconditioner = 1401– total time = 16 sec.– # of Gmres iters. = 131– 2-norm of residual = 10-7
• Speed improvement over sequential procedure with drop tolerance = .0001:– 117/16 ~ 7.3
Technical Highlights…
• Analysis of Scaling Properties– In early work, we developed the Isoefficiency
metric for scalability.– With the likely scenario of utilizing up to 100K
processing cores, this work becomes critical.– Isoefficiency quantifies the performance of a
parallel system (a parallel program and the underlying architecture) as the number of processors is increased.
Technical Highlights…• Isoefficiency Analysis
– The efficiency of any parallel program for a given problem instance goes down with increasing number of processors.
– For a family of parallel programs (formally referred to as scalable programs), increasing the problem size results in an increase in efficiency.
Technical Highlights…• Isoefficiency is the rate at which problem
size must be increased w.r.t. number of processors, to maintain constant efficiency.
• This rate is critical, since it is ultimately limited by total memory size.
• Isoefficiency is a key indicator of a program’s ability to scale to very large machine configurations.
• Isoefficiency analysis will be used extensively for performance projection and scaling properties.
Architecture
• We target the following currently available architectures
– IBM JS20/21 and BlueGene/L platforms– Cray X1/XT3– AMD Opteron multicore SMP and SMP clusters– Intel Xeon multicore SMP and SMP clusters
• These platforms represent a wide range of currently available architectural extremes.
_______________________________________________
Intel, AMD, and IBM JS21 platforms are available locally at Purdue. We intend to get access to the BlueGene at LLNL and the Cray XT3 at ORNL.
Implementation
• Current implementations are MPI based.
• The Spike tooklit will be ported to– POSIX and OpenMP– UPC and CAF– Titanium and X10 (if releases are available)
• These implementations will be comprehensively benchmarked across platforms.
Benchmarks/Metrics
• We aim to formally specify a number of benchmark problems (sparse systems arising in nanoelectronics, structural mechanics, and fluid-structure interaction)
• We will abstract architecture characteristics – processor speed, memory bandwidth, link bandwidth, bisection bandwidth (some of these abstractions can be derived from HPCC benchmarks).
• We will quantify solvers on the basis of wall-clock time, FLOP count, parallel efficiency, scalability, and projected performance to petascale systems.
Benchmarks/Metrics
• Other popular benchmarks such as HPCC do not address sparse linear solvers in meaningful ways.
• HPCC comprises of seven benchmark tests – HPL, DGEMM, STREAM, PTRANS, RandomAccess, FFT, and Communication bandwidth and latency.
• They only peripherally address underlying performance issues in sparse system solvers.
Progress/Accomplishments
• Implementation of the parallel Spike polyalgorithm toolkit.
• Incorporation of a number of parallel direct and preconditioned iterative solvers (e.g. SuperLU, MUMPS, and Pardiso) into the Spike toolkit.
• Evaluation of Spike on the IBM/SP, SGI Altix, Intel Xeon clusters, and Intel multicore platforms.
Milestones
• Final deliverable: Comprehensive evaluation of scaling properties of existing (and new solvers).
• Six month target: Comparative performance of solvers on multicore SMPs and clusters.
• Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.
Financials
• The total cost of this project is approximately $150K for its one-year duration.
• The budget primarily accounts for a post-doctoral researcher’s salary/benefits and minor summer-time for the PIs.
• Together, these three project personnel are responsible for accomplishing project milestones and reporting.
Concluding Remarks
• This project takes a comprehensive view of parallel sparse linear system solvers and their suitability for petascale HPC systems.
• Its results directly influence ongoing and future development of HPC systems.
• A number of major challenges are likely to emerge, both as a result of this project, and from impending architectural innovations.
Concluding Remarks…
• Architectural innovations on the horizon– Scalable multicore platforms: 64 to 128 cores on the
horizon– Heterogeneous multicore: It is likely that cores will be
heterogeneous – some with floating point units, others with vector units, yet others with programmable hardware (indeed such chips are commonly used in cell phones)
– Significantly higher pressure on the memory subsystem
Concluding Remarks…
• Challenges for programming models and runtime systems:– Affinity scheduling is important for
performance – need to specify tasks that must be co-scheduled (suitable programming abstractions needed).
– Programming constructs for utilizing heterogeneity.
Concluding Remarks…
• Challenges for algorithm and application development– FLOPS are cheap, memory references are expensive
– explore new families of algorithms that optimize for (minimize) the latter
– Algorithmic techniques and programming constructs for specifying algorithmic asynchrony (used to mask system latency)
– Many of the optimizations are likely to be beyond the technical reach of applications programmers – need for scalable library support
– Increased emphasis on scalability analysis