View
57
Download
3
Category
Tags:
Preview:
DESCRIPTION
High Performance Computing: Concepts, Methods & Means HPC Libraries. Hartmut Kaiser PhD Center for Computation & Technology Louisiana State University April 19 th , 2007. Outline. Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) - PowerPoint PPT Presentation
Citation preview
High Performance Computing: Concepts, Methods & Means
HPC Libraries
Hartmut Kaiser PhDCenter for Computation & Technology
Louisiana State University
April 19th, 2007
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
2
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
3
Puzzle of the Day
#include <stdio.h>
int main(){ int a = 10; switch(a) { case '1': printf("ONE\n"); break;
case '2': printf("TWO\n"); break;
defa1ut: printf("NONE\n"); } return 0;}
4
If you expect the output of the above program to be NONE, I would request you to check it out!
Application domains
• Linear algebra– BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim
• Ordinary and partial Differential Equations– PETSc
• Mesh manipulation and Load Balancing – METIS, ParMETIS, CHACO, JOSTLE, PARTY
• Graph manipulation– Boost.Graph library
• Vector/Signal/Image processing– VSIPL, PSSL.
• General parallelization– MPI, pthreads
• Other domain specific libraries– NAMD, NWChem, Fluent, Gaussian, LS-DYNA
5
Application Domain Overview
• Linear Algebra Libraries – Provide optimized methods for constructing sets of linear equations,
performing operations on them (matrix-matrix products, matrix-vector products) and solving them (factoring, forward & backward substitution.
– Commonly used libraries include BLAS, ATLAS, LAPACK, ScaLAPACK, PaLAPACK
• PDE Solvers: – Developing general-porpose, parallel numerical PDE libraries– Usual toolsets include manipulation of sparse data structures, iterative
linear system solvers, preconditioners, nonlinear solvers and time-stepping methods.
– Commonly used libraries for solving PDEs include SAMRAI, PETSc, PARASOL, Overture, among others.
6
Application Domain Overview
• Mesh manipulation and Load Balancing – These libraries help in partitioning meshes in roughly equal sizes
across processors, thereby balancing the workload while minimizing size of separators and communication costs.
– Commonly used libraries for this purpose include METIS, ParMetis, Chaco, JOSTLE among others.
• Other packages:– FFTW: features highly optimized Fourier transform package
including both real and complex multidimensional transforms in sequential, multithreaded, and parallel versions.
– NAMD: molecular dynamics library available for Unix/Linux, Windows, OS X
– Fluent: computational fluid dynamics package, used for such applications as environment control systems, propulsion, reactor modeling etc.
7
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
8
BLAS
• (Updated set of) Basic Linear Algebra Subprograms
• The BLAS functionality is divided into three levels: – Level 1: contains vector operations of the form:
as well as scalar dot products and vector norms
– Level 2: contains matrix-vector operations of the form
as well as Tx = y solving for x with T being triangular
– Level 3: contains matrix-matrix operations of the form
as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiply operation.
9
BLAS
• Several implementations for different languages exist– Reference implementation (F77 and C)
http://www.netlib.org/blas/– ATLAS, highly optimized for particular
processor architectures– A generic C++ template class library providing
BLAS functionality: uBLAS http://www.boost.org
– Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)
10
BLAS: C naming conventions
• F77 routine name is changed to lowercase and prefixed with cblas_
• All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major)
• Character parameters are replaced by corresponding enum values
• Input arguments are declared const• Non-complex scalar input parameters are passed by value• Complex scalar input argiments are passed using a void*• Arrays are passed by address• Output scalar arguments are passed by address• Complex functions become subroutines which return the result
via an additional last parameter (void*), appending _sub to the name
12
BLAS Level 1 routines
• Vector operations(xROT, xSWAP, xCOPY etc.)
• Scalar dot products (xDOT etc.)
• Vector norms(IxAMX etc.)
13
BLAS Level 2 routines
• Matrix-vector operations(xGEMV, xGBMV, xHEMV, xHBMV etc.)
• Solving Tx = y for x, where T is triangular(xGER, xHER etc.)
14
BLAS Level 3 routines
• Matrix-matrix operations(xGEMM etc.)
• Solving for triangular matrices(xTRMM)
• Widely used matrix-matrix multiply (xSYMM, xGEMM)
15
Demo 1
• Shows solving a matrix multiplication problem using BLAS expressed in FORTRAN, C, and C++
• Shows genericity of uBLAS, by comparing generic and banded matrix versions
• Shows newmat, a C++ matrix library which uses operator overloading
16
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
17
LAPACK
• Linear Algebra PACKage– http://www.netlib.org/lapack/– Written in F77– Provides routines for
• Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR
decomposition on a matrix and • Singular value problems
– Was initially designed to run efficiently on shared memory vector machines
– Depends on BLAS– Has been extended for distributed (SIMD) systems
(ScaPACK and PLAPACK)
18
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
22
PETSc (pronounced PET-see)
• Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/)– Suite of data structures and routines for the scalable
(parallel) solution of scientific applications modeled by partial differential equations (PDEs)
– Employs the MPI standard for all message-passing communication
– Intended for use in large-scale application projects– Includes a large suite of parallel linear and nonlinear
equation solvers– Easily used in application codes written in C, C++,
Fortran and Python• Good introduction:
http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt
23
PETSc (general features)
• Features include:– Parallel vectors
• Scatters (handles communicating ghost point information)
• Gathers
– Parallel matrices • Several sparse storage formats • Easy, efficient assembly.
– Scalable parallel preconditioners – Krylov subspace methods – Parallel Newton-based nonlinear solvers – Parallel time stepping (ODE) solvers
24
PETSc: Component details
• Vector operations (Vec): Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures.
• Matrix operations (Mat): A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems.
• Preconditioners (PC): A collection of sequential and parallel preconditioners, including
– (sequential) ILU(k) (incomplete factorization), – LU (lower/upper decomposition), – both sequential and parallel block Jacobi, overlapping additive Schwarz
methods• Time stepping ODE solvers (TS): Code
for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.
26
PETSc: Component details
• Krylov subspace solvers (KSP): Parallel implementations of many popular Krylov subspace iterative methods, including
– GMRES (Generalized Minimal Residual method), – CG (Conjugate Gradient), – CGS (Conjugate Gradient Squared), – Bi-CG-Stab (BiConjugate Gradient Squared), – two variants of TFQMR (transpose free QMR), – CR (Conjugate Residuals), – LSQR (Least Square Root).
All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods.
• Non-linear solvers (SNES): Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.
27
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
28
Introduction to Meshes and Grids
• Mesh/Grid : 2D or 3D representation of the computational domain.
• Common 2D meshes are composed of triangular or quadrilateral elements
• Common 3D meshes are composed of hexahedral, tetrahedral or pyramidal elements
30
TriangleQuadrilateral
Tetrahedron
Hexahedron Prism
2D Mesh elements
3D Mesh elements
Structured Grids (Meshes)• Cartesian grids, logically
rectangular grids• Mesh info accessed implicitly
using grid point indices– Efficient in both computation
and storage• Typically use finite difference
discretization
Unstructured Meshes• Mesh connectivity information
must be stored– Incurs additional memory and
computational cost• Handles complex geometries
and grid adaptivity• Typically use finite volume or
finite element discretization• Mesh quality becomes a
concern
31
Structured/Unstructured Meshes
Mesh Decomposition
• Goal is to maximize interior while minimizing connections between subdomains. That is, minimize communication.
• Such decomposition problems have been studied in load balancing for parallel computation.
• Lots of choices:• METIS, ParMETIS -- University of Minnesota.• PARTI -- University of Maryland,• CHACO -- Sandia National Laboratories,• JOSTLE -- University of Greenwich,• PARTY -- University of Paderborn,• SCOTCH -- Université Bordeaux,• TOP/DOMDEC -- NAS at NASA Ames Research Center.
http://www.hlrs.de34
Mesh Decomposition
• Load balancing– Distribute elements evenly across processors.
– Each processor should have equal share of work.
• Communication costs should be minimized. – Minimize sub-domain boundary elements.
– Minimize number of neighboring domains.
• Distribution should reflect machine architecture.– Communication versus calculation.
– Bandwidth versus latency.
• Note that optimizing load balance and communication cost simultaneously is an NP-hard problem.
http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html
35
Static Grids (Meshes)
• Decomposition need only be carried out once
• Static decomposition may therefore be carried out as a preprocessing step, often done in serial
Dynamic Meshes
• Decomposition must be adapted as underlying mesh or processor load changes.
• Dynamic decomposition therefore becomes part of the calculation itself and cannot be carried out solely as a pre-processing step.
37
http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html
Static and Dynamic Meshes
HP J67001 CPUSolve Time: 13:26Baseline Time
38
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster2 CPU’sSolve Time: 5:20Speed-Up: 2.5X
39
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster4 CPU’sSolve Time: 3:07Speed-Up: 4.3X
40
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster8 CPU’sSolve Time: 1:51Speed-Up: 7.3X
41
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster16 CPU’sSolve Time: 1:03Speed-Up: 12.8X
42
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
52
FFTW
• Fastest Fourier Transform in the West
• Portable C subroutine library for computing discrete cosine/sine transform (DCT/DST)
• Computes arbitrary size discrete Fourier and Hartley transforms on real or complex data, in one or more dimensions
• Optimized for speed through application of special-purpose compiler genfft (codelet generator), originally written in OCaml; performance comparable even with vendor optimized libraries
• Free software, distributed under GPL; also available under commercial MIT license
• Developed at MIT by Matteo Frigo and Steven G. Johnson• Won J. H. Wilkinson Prize for Numerical Software in 1999• Most recent stable version is 3.1.2 (http://www.fftw.org)
53
Main FFTW Features
• C and FORTRAN interfaces, C++ wrappers available• Speed, including support for SSE, SSE2, 3dNow! and Altivec• Arbitrary size transforms with complexity of O(n·log(n)) (sizes which
can be factored to 2, 3, 5 and 7 are most efficient by default, but a custom code can be also generated for other sizes if required)
• Even/odd data (DCT/DST), types I-IV• Can produce pure real output, or process pure real input data• Efficient handling of multiple, strided transforms (e.g. transformation of
multiple arrays at once; one dimension of multi-dimensional array; one field of multi-component array)
• Parallel code supporting Cilk, SMP platforms with threads, or MPI• Ability to save and restore plans optimized for a given platform (through
wisdom mechanism)• Portable to any platform with a working C compiler
54
FFTW Sample Code
Source: http://www.fftw.org/fftw3.pdf
Computing 1-D complex DFT
55
#include <fftw3.h>#include <fftw3.h>......{{ fftw_complex *in, *out;fftw_complex *in, *out; fftw_plan p;fftw_plan p; ...... in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); /* populate in[] with input data *//* populate in[] with input data */ … … p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); ...... fftw_execute(p); /* repeat as needed */fftw_execute(p); /* repeat as needed */ /* transform now available in out[] *//* transform now available in out[] */ ...... fftw_destroy_plan(p);fftw_destroy_plan(p); fftw_free(in); fftw_free(out);fftw_free(in); fftw_free(out);}}
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
56
What is Boost?
• Data Structures, Containers, Iterators, and Algorithms
• String and Text Processing • Function Objects and Higher-Order
Programming • Generic Programming and Template
Metaprogramming • Math and Numerics• Input/Output • Miscellaneous
• Mostly header only
58
What’s important
• OS abstraction– Thread: OS independent kernel level thread
interface– Asio: asynchronous input output– Filesystem: file system operations as file copy,
delete, directory create, file path handling– System: OS error code abstraction and handling– Program options: handling of command line
arguments and parameters– Streams: build your own C++ streams– DateTime: Handling of dates, times and time
periods– Timer: simple timer object
59
What’s important
• Data types, Container types, all extending STL– Pointer containers: allow for pointers in STL containers:
vector<char *> ptr_vector<char>– Multi index: data structures with multiple indicies– Constant sized arrays: array<char, 10>, acts like vector or
plain ‘C‘ array– Any: can hold values of any type (if you need polymorphism)– Variant: can hold values of any of the types specified at
compile time (‘C’ equivalent is discriminated union)– Optional: can hold a value or nothing– Tuple: like a vector or array, but every element may have a
different type (similar to plain struct)– Graph library: very sophisticated collection of graph releated
data structures and algorithms• Parallel version exists (using MPI)
60
What’s important
• Helper classes– Smart pointers: working with pointers
without having to worry about memory management
– Memory pools: specialized memory allocation for containers
– Iterator library: write your own iterator classes with ease (non trivial otherwise)
61
Other stuff in Boost
• String and Text processing• Regex, parsing, format, conversion etc.
• Alorithms• String algos, FOR_EACH, minmax etc.
• Math and numerics• Conversion, interval, random, octonion, quarternion, special
functions, rational, uBLAS
• Functional and higher order prgramming• Bind, lambda, function, ref, signals etc.
• Generic and template metaprogramming• Proto, mpl, fusion, phoenix, enable_if etc.
• Testing• Unit tests, concept checks, static_assert
62
Conclusion
• Look at Boost first if you need something not available in Standard library
• Even if it‘s not in Boost look around, there are a lot of libraries in preparation for Boost (Boost Sandbox, File Vault)
63
Links
• Boost, current release V1.33.1 – Web: http://www.boost.org
– CVS: http://sourceforge.net/projects/boost
• Boost Sandbox– CVS: http://sourceforge.net/projects/boost-sandbox
– File Vault: http://boost-consulting.com/vault/
• Boost mailing lists– http://www.boost.org/more/mailing_lists.htm
64
Outlook
Functional specification with a Domain Specific Embedded Language (DSEL)
equation = sum<vertex_edge> [ sumf<edge_vertex>(0.0,
_e) [ pot * orient(_e, _1) ] * A / d * eps] - V * rho
65
Elliptic PDE discretized by Finite Volume
References: [1]
References
1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software Design, Portland, Oregon, October 2006.
66
Outline
• Introduction to High Performance Libraries• Linear Algebra Libraries (BLAS, LAPACK)• PDE Solvers (PETSc) • Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)• Special purpose libraries (FFTW)• General purpose libraries (C++: Boost)• Summary – Materials for test
67
Summary – Material for the Test
• High performance libraries 5,6,7• Linear algebra libraries: BLAS: 9, 11, 12• Linear algebra libraries: LinPACK: 18• PDE Solvers: 23, 24, 26, 27• Mesh decomposition & load balancing: 30, 31,
34, 35, 37, 44, 45, 46, 48, 49• FFTW: 53, 54• Boost: 58, 59, 60, 61, 62
Recommended