prev

next

of 59

View

35Download

0

Tags:

Embed Size (px)

DESCRIPTION

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures. Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg}. Linear Solvers Grant Kickoff Meeting, 9/26/06. - PowerPoint PPT Presentation

Evaluating Sparse Linear System Solvers on Scalable Parallel ArchitecturesAnanth Grama and Ahmed Sameh Department of Computer Science Purdue University.http://www.cs.purdue.edu/people/{sameh/ayg}Linear Solvers Grant Kickoff Meeting, 9/26/06.

Evaluating Sparse Linear System Solvers on Scalable Parallel ArchitecturesProject Overview Identify sweet spots in algorithm-architecture-programming model space for efficient sparse linear system solvers. Design a new class of highly scalable sparse solvers suited to petascale HPC systems. Develop analytical frameworks for performance prediction and projection.Methodology Design generalized sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics. Evaluate architectural features and their impact on scalable solver performance. Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI.Challenges and Impact Generalizing the space of parallel sparse linear system solvers. Analysis and implementation on parallel platforms Performance projection to the petascaleGuidance for architecture and programming model design / performance envelope. Benchmarks and libraries for HPCS.Milestones / Schedule Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel sparse solvers). Six month target: Comparative performance of solvers on multicore SMPs and clusters. 12-month target: Evaluation on these solvers on Cray X1, BG, JS20/21, for CAF/UPC/MPI implementations.

IntroductionA critical aspect of High-Productivity is the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable for implementation on petascale systems.This project aims at identifying such points for commonly used sparse linear system solvers, and at developing more robust novel solvers.These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts yielding much better actual performance.

Project RationaleSparse system solvers govern the overall performance of many CSE applications on HPC systems.Design of HPC architectures and programming models should be influenced by their suitability for such solvers and related kernels.Extreme need for concurrency on novel architectural models require fundamental re-examination of conventional sparse solvers.

Typical Computational Kernels for PDEsIntegrationNewton IterationLinear system solverskkt

Fluid Structure Interaction

NESSIE Nanoelectronics Simulation Environment

Transport/ElectrostaticsMulti-scale, Multi-physicsMulti-method

Numerical Parallel Algorithms:Linear solver (SPIKE),Eigenpairs solver (TraceMin), preconditioning strategies,Mathematical Methodologies:Finite Element Method, mode decomposition, multi-scale, non-linear numerical schemes,

3D CNTs-3D Molec. Devices3D Si Nanowires3D III/V Devices2D MOSFETsE. Polizzi (1998-2005)

Simulation, Model Reduction and Real-Time Control of Structures

Fluid-Solid Interaction

Project GoalsDevelop generalizations of direct and iterative solvers e.g. the Spike polyalgorithm.Implement such generalizations on various architectures (multicore, multicore SMPs, multicore SMP aggregates) and programming models (PGAs, Messaging APIs)Analytically quantify performance and project to petascale platforms.Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.

BackgroundPersonnel:Ahmed Sameh, Samuel Conte Professor of Computer Science, has worked on development of parallel numerical algorithms for four decades.Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems.(To be named Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking.

*We have identified three candidates for this position and will shortly be hiring one of them.

BackgroundTechnicalWe have built extensive infrastructure for parallel sparse solvers including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditionersWe have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.

BackgroundTechnicalWe have initiated installation of Co-Array Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs.We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.

Technical Highlights

The SPIKE Toolkit

SPIKE: IntroductionEngineering problems usually produce large sparse linear systemsBanded (or banded with low-rank perturbations) structure is often obtained after reorderingSPIKE partitions the banded matrix into a block tridiagonal formEach partition is associated with one CPU, or one node multilevel parallelism

after RCM reordering

SPIKE: Introduction

- SPIKE: Introduction Reduced system ((p-1) x 2m) Retrieve solutionAX=F SX=diag(A1-1,,Ap-1) FA(n x n)Bj, Cj (m x m), m
SPIKE: A Hybrid AlgorithmThe spikes can be computed:Explicitly (fully or partially)On the FlyApproximatelyThe diagonal blocks can be solved:Directly (dense LU, Cholesky, or sparse counterparts)Iteratively (with a preconditioning strategy)The reduced system can be solved:Directly (Recursive SPIKE)Iteratively (with a preconditioning scheme)Approximately (Truncated SPIKE)Different choices depending on the properties of the matrix and/or platform architecture

The SPIKE algorithmHierarchy of Computational Modules(systems dense within the band)

SPIKE versions

SPIKE HybridsSPIKE versionsR = recursiveE = explicitF = on-the-flyT = truncated

2.FactorizationNo pivoting:L = LUU = LU & UL A = alternate LU & UL Pivoting: P = LU

3.Solution improvement:0 direct solver only2iterative refinement3outer Bicgstab iterations

SPIKE on-the-fly Does not require generating the spikes explicitlyIdeally suited for banded systems with large sparse bandsThe reduced system is solved iteratively with/without a preconditioning strategy

Numerical Experiments(dense within the band) Computing platforms 4 nodes Linux Xeon cluster 512 processor IBM-SP Performance comparison w/ LAPACK, and ScaLAPACK

Speed improvement Spike vs ScalapackIBM-SPSpike (RL0) b=401; RHS=1;SPIKE: Scalability

SPIKE partitioning - 4 processor exampleA1A2A3C2C3B1B2Processors12,34Factorizations (w/o pivoting)LUUL (p=2); LU (p=3)ULPartitioning -- 1Partitioning -- 2

SPIKE: Small number of processorsScaLAPACK needs at least 4-8 processors to perform as well as LAPACK. SPIKE benefits from the LU-UL strategy and realizes speed improvement over LAPACK on 2 or more processors.

4-node Xeon Intel Linux cluster with infiniband interconnection - Two 3.2 Ghz processors per node; 4 GB of memory/ node; 1 MB cache; 64-bit arithmetic.Intel fortran, Intel MPIIntel MKL libraries for LAPACK, SCALAPACKn=960,000b=201System is diag. dominant

General sparse systemsScience and engineering problems often produce large sparse linear systemsBanded (or banded with low-rank perturbations) structure is often obtained after reordering

While most ILU preconditioners depend on reordering strategies that minimize the fill-in in the factorization stage, we propose to:extract a narrow banded matrix, via a reordering-dropping strategy, to be used as a preconditioner.make use of an improved SPIKE on-the-fly scheme that is ideally suited for banded systems that are sparse within the band.

Sparse parallel direct solvers and banded systems* Memory swap too much fill-inN=432,000, b= 177, nnz= 7, 955, 116, fill-in of the band: 10.4%

SuperLUReorderingFactorizationSolve1 CPU - (1 Node)10.317*8.4*2 CPU- (2 Nodes)8.237*12.5*4 CPU- (4 Nodes)8.241*16.7*8 CPU- (4 Nodes)7.6178*15.9*

MUMPSReorderingFactorizationSolve1 CPU - (1 Node)16.26.30.82 CPU- (2 Nodes)17.24.20.64 CPU- (4 Nodes)17.830.558 CPU- (4 Nodes)17.720*1.9*

Multilevel Parallelism: SPIKE calling MKL-PARDISO for banded systems that are sparse within the band

This SPIKE hybrid scheme exhibits better performance than other parallel direct sparse solvers used alone.

SPIKE-MKL on-the-fly for systems that are sparse within the bandN=432,000, b= 177, nnz= 7, 955, 116, sparsity of the band: 10.4%MUMPS: time (2-nodes) = 21.35 s; time (4-nodes) = 39.6 sFor narrow banded systems, SPIKE will consider the matrix dense within the band. Reordering schemes for minimizing the bandwidth can be used if necessary.

N=471,800, b= 1455, nnz= 9, 499, 744, sparsity of the band: 1.4%Good scalability using on-the-fly SPIKE scheme

A Preconditioning Approach

Reorder the matrix to bring most of the elements within a band viaHSLs MC64 (to maximize sum or product of the diagonal elements) RCM (Reverse Cuthill-McKee) or MD (Minimum Degree) Extract a band from the reordered matrix to use as preconditioner. Use an iterative method (e.g. Bicgstab) to solve the linear system (outer solver). Use SPIKE to solve the system involving the preconditioner (inner solver).

Matrix DW8192, order N = 8192, NNZ = 41746, Condest (A) = 1.39e+07,Square Dielectric Waveguide Sparsity = 0.0622% , (A+A) indefinite

GMRES + ILU preconditionervs. Bicgstab with banded preconditioner

Gmres w/o precond:>5000 iterationsGmres + ILU (no fill-in):Fails Gmres + ILU (1