Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
High Performance Computing: Concepts, Methods & Means
HPC Libraries
Hartmut Kaiser PhD
Center for Computation & Technology
Louisiana State University
April 19th, 2007
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
2
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
3
Puzzle of the Day
#include <stdio.h>
int main()
{
int a = 10;
switch(a) {
case '1':
printf("ONE\n");
break;
case '2':
printf("TWO\n");
break;
defa1ut:
printf("NONE\n");
}
return 0;
}
4
If you expect the output of the above
program to be NONE, I would request
you to check it out!
Application domains
• Linear algebra
– BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim
• Ordinary and partial Differential Equations
– PETSc
• Mesh manipulation and Load Balancing
– METIS, ParMETIS, CHACO, JOSTLE, PARTY
• Graph manipulation
– Boost.Graph library
• Vector/Signal/Image processing
– VSIPL, PSSL.
• General parallelization
– MPI, pthreads
• Other domain specific libraries
– NAMD, NWChem, Fluent, Gaussian, LS-DYNA
5
Application Domain Overview
• Linear Algebra Libraries
– Provide optimized methods for constructing sets of linear equations,
performing operations on them (matrix-matrix products, matrix-vector
products) and solving them (factoring, forward & backward
substitution.
– Commonly used libraries include BLAS, ATLAS, LAPACK,
ScaLAPACK, PaLAPACK
• PDE Solvers:
– Developing general-porpose, parallel numerical PDE libraries
– Usual toolsets include manipulation of sparse data structures,
iterative linear system solvers, preconditioners, nonlinear solvers and
time-stepping methods.
– Commonly used libraries for solving PDEs include SAMRAI, PETSc,
PARASOL, Overture, among others.
6
Application Domain Overview
• Mesh manipulation and Load Balancing
– These libraries help in partitioning meshes in roughly equal sizes
across processors, thereby balancing the workload while
minimizing size of separators and communication costs.
– Commonly used libraries for this purpose include METIS, ParMetis,
Chaco, JOSTLE among others.
• Other packages:
– FFTW: features highly optimized Fourier transform package
including both real and complex multidimensional transforms in
sequential, multithreaded, and parallel versions.
– NAMD: molecular dynamics library available for Unix/Linux,
Windows, OS X
– Fluent: computational fluid dynamics package, used for such
applications as environment control systems, propulsion, reactor
modeling etc.
7
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
8
BLAS
• (Updated set of) Basic Linear Algebra Subprograms
• The BLAS functionality is divided into three levels: – Level 1: contains vector operations of the form:
as well as scalar dot products and vector norms
– Level 2: contains matrix-vector operations of the form
as well as Tx = y solving for x with T being triangular
– Level 3: contains matrix-matrix operations of the form
as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiplyoperation.
9
BLAS
• Several implementations for different languages exist– Reference implementation (F77 and C)http://www.netlib.org/blas/
– ATLAS, highly optimized for particular processor architectures
– A generic C++ template class library providing BLAS functionality: uBLAS http://www.boost.org
– Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)
10
BLAS: F77 naming conventions
11
BLAS: C naming conventions
• F77 routine name is changed to lowercase and prefixed with cblas_
• All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major)
• Character parameters are replaced by corresponding enum values
• Input arguments are declared const
• Non-complex scalar input parameters are passed by value• Complex scalar input argiments are passed using a void*
• Arrays are passed by address
• Output scalar arguments are passed by address
• Complex functions become subroutines which return the result via an additional last parameter (void*), appending _sub to the name
12
BLAS Level 1 routines
• Vector operations(xROT, xSWAP, xCOPY etc.)
• Scalar dot products (xDOT etc.)
• Vector norms(IxAMX etc.)
13
BLAS Level 2 routines
• Matrix-vector operations(xGEMV, xGBMV, xHEMV, xHBMV etc.)
• Solving Tx = y for x, where T is triangular(xGER, xHER etc.)
14
BLAS Level 3 routines
• Matrix-matrix operations(xGEMM etc.)
• Solving for triangular matrices(xTRMM)
• Widely used matrix-matrix multiply (xSYMM, xGEMM)
15
Demo 1
• Shows solving a matrix multiplication problem using
BLAS expressed in FORTRAN, C, and C++
• Shows genericity of uBLAS, by comparing generic
and banded matrix versions
• Shows newmat, a C++ matrix library which uses
operator overloading
16
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
17
LAPACK
• Linear Algebra PACKage– http://www.netlib.org/lapack/
– Written in F77
– Provides routines for • Solving systems of simultaneous linear equations,
• Least-squares solutions of linear systems of equations,
• Eigenvalue problems,
• Householder transformation to implement QR decomposition on a matrix and
• Singular value problems
– Was initially designed to run efficiently on shared memory vector machines
– Depends on BLAS
– Has been extended for distributed (SIMD) systems (ScaPACK and PLAPACK)
18
19
LAPACK (Architecture)
LAPACK naming conventions
20
Demo 2
• Shows how using a library might speed
up the computation considerably
21
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
22
PETSc (pronounced PET-see)
• Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/)– Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations (PDEs)
– Employs the MPI standard for all message-passing communication
– Intended for use in large-scale application projects
– Includes a large suite of parallel linear and nonlinear equation solvers
– Easily used in application codes written in C, C++, Fortran and Python
• Good introduction:http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt
23
PETSc (general features)
• Features include:– Parallel vectors
• Scatters (handles communicating ghost point information)
• Gathers
– Parallel matrices • Several sparse storage formats
• Easy, efficient assembly.
– Scalable parallel preconditioners
– Krylov subspace methods
– Parallel Newton-based nonlinear solvers
– Parallel time stepping (ODE) solvers
24
PETSc (Architecture)
25
PETSc: Module architecture and layers of abstraction
PETSc: Component details
• Vector operations (Vec): Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures.
• Matrix operations (Mat): A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems.
• Preconditioners (PC): A collection of sequential and parallel preconditioners, including – (sequential) ILU(k) (incomplete factorization),
– LU (lower/upper decomposition),
– both sequential and parallel block Jacobi, overlapping additive Schwarz methods
• Time stepping ODE solvers (TS): Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.
26
PETSc: Component details
• Krylov subspace solvers (KSP): Parallel implementations of many popular Krylov subspace iterative methods, including – GMRES (Generalized Minimal Residual method),
– CG (Conjugate Gradient),
– CGS (Conjugate Gradient Squared),
– Bi-CG-Stab (BiConjugate Gradient Squared),
– two variants of TFQMR (transpose free QMR),
– CR (Conjugate Residuals),
– LSQR (Least Square Root).
All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods.
• Non-linear solvers (SNES): Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.
27
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
28
Mesh libraries
• Introduction
– Structured/unstructured meshes
– Examples
• Mesh decomposition
29
Introduction to Meshes and Grids
• Mesh/Grid : 2D or 3D
representation of the computational
domain.
• Common 2D meshes are composed
of triangular or quadrilateral
elements
• Common 3D meshes are composed
of hexahedral, tetrahedral or
pyramidal elements
30
TriangleQuadrilateral
Tetrahedron
Hexahedron Prism
2D Mesh elements
3D Mesh elements
Structured Grids (Meshes)
• Cartesian grids, logically rectangular grids
• Mesh info accessed implicitly using grid point indices– Efficient in both computation and storage
• Typically use finite difference discretization
Unstructured Meshes• Mesh connectivity information must be stored– Incurs additional memory and computational cost
• Handles complex geometries and grid adaptivity
• Typically use finite volume or finite element discretization
• Mesh quality becomes a concern
31
Structured/Unstructured Meshes
Mesh examples
32
Meshes are used for Computation
33
Mesh Decomposition
• Goal is to maximize interior while minimizing connections between subdomains.
That is, minimize communication.
• Such decomposition problems have been studied in load balancing for parallel
computation.
• Lots of choices:
• METIS, ParMETIS -- University of Minnesota.
• PARTI -- University of Maryland,
• CHACO -- Sandia National Laboratories,
• JOSTLE -- University of Greenwich,
• PARTY -- University of Paderborn,
• SCOTCH -- Université Bordeaux,
• TOP/DOMDEC -- NAS at NASA Ames Research Center.
http://www.hlrs.de
34
Mesh Decomposition
• Load balancing
– Distribute elements evenly across processors.
– Each processor should have equal share of work.
• Communication costs should be minimized.
– Minimize sub-domain boundary elements.
– Minimize number of neighboring domains.
• Distribution should reflect machine architecture.
– Communication versus calculation.
– Bandwidth versus latency.
• Note that optimizing load balance and communication cost
simultaneously is an NP-hard problem.
http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html
35
36http://www.hlrs.de
36
Mesh decomposition
Static Grids (Meshes)
• Decomposition need only be
carried out once
• Static decomposition may
therefore be carried out as a
preprocessing step, often done in
serial
Dynamic Meshes
• Decomposition must be adapted
as underlying mesh or processor
load changes.
• Dynamic decomposition therefore
becomes part of the calculation
itself and cannot be carried out
solely as a pre-processing step.
37
http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html
Static and Dynamic Meshes
HP J67001 CPUSolve Time: 13:26Baseline Time
38
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster2 CPU’sSolve Time: 5:20Speed-Up: 2.5X
39
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster4 CPU’sSolve Time: 3:07Speed-Up: 4.3X
40
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster8 CPU’sSolve Time: 1:51Speed-Up: 7.3X
41
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Linux Cluster16 CPU’sSolve Time: 1:03Speed-Up: 12.8X
42
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt
Speedup due to decomposition
# CPUs Run-times (s)
1 806
2 320
4 187
8 111
16 63
43
44
http://www.hlrs.de
44
Jostle and Metis
Jostle
45
45
http://www.hlrs.de
Jostle
46
46
http://www.hlrs.de
Jostle
47
47
http://www.hlrs.de
Metis
48
48
http://www.hlrs.de
ParMetis
49
49
http://www.hlrs.de
Metis (serial)
50
50
http://www.hlrs.de
Comparison
51
51
http://www.hlrs.de
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
52
FFTW
• Fastest Fourier Transform in the West
• Portable C subroutine library for computing discrete cosine/sine
transform (DCT/DST)
• Computes arbitrary size discrete Fourier and Hartley transforms
on real or complex data, in one or more dimensions
• Optimized for speed through application of special-purpose
compiler genfft (codelet generator), originally written in OCaml;
performance comparable even with vendor optimized libraries
• Free software, distributed under GPL; also available under
commercial MIT license
• Developed at MIT by Matteo Frigo and Steven G. Johnson
• Won J. H. Wilkinson Prize for Numerical Software in 1999
• Most recent stable version is 3.1.2 (http://www.fftw.org)
53
Main FFTW Features
• C and FORTRAN interfaces, C++ wrappers available
• Speed, including support for SSE, SSE2, 3dNow! and Altivec
• Arbitrary size transforms with complexity of O(n·log(n)) (sizes which
can be factored to 2, 3, 5 and 7 are most efficient by default, but a
custom code can be also generated for other sizes if required)
• Even/odd data (DCT/DST), types I-IV
• Can produce pure real output, or process pure real input data
• Efficient handling of multiple, strided transforms (e.g. transformation of
multiple arrays at once; one dimension of multi-dimensional array; one
field of multi-component array)
• Parallel code supporting Cilk, SMP platforms with threads, or MPI
• Ability to save and restore plans optimized for a given platform (through
wisdom mechanism)
• Portable to any platform with a working C compiler
54
FFTW Sample Code
Source: http://www.fftw.org/fftw3.pdf
Computing 1-D complex DFT
55
#include <fftw3.h>#include <fftw3.h>
......
{{
fftw_complex *in, *out;fftw_complex *in, *out;
fftw_plan p;fftw_plan p;
......
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
/* populate in[] with input data *//* populate in[] with input data */
……
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
......
fftw_execute(p); /* repeat as needed */fftw_execute(p); /* repeat as needed */
/* transform now available in out[] *//* transform now available in out[] */
......
fftw_destroy_plan(p);fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);fftw_free(in); fftw_free(out);
}}
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
56
The Boost Libraries
• What’s Boost
– What’s important
– Other stuff
57
What is Boost?
• Data Structures, Containers, Iterators, and Algorithms
• String and Text Processing
• Function Objects and Higher-Order Programming
• Generic Programming and Template Metaprogramming
• Math and Numerics
• Input/Output
• Miscellaneous
• Mostly header only
58
What’s important
• OS abstraction– Thread: OS independent kernel level thread interface
– Asio: asynchronous input output
– Filesystem: file system operations as file copy, delete, directory create, file path handling
– System: OS error code abstraction and handling
– Program options: handling of command line arguments and parameters
– Streams: build your own C++ streams
– DateTime: Handling of dates, times and time periods
– Timer: simple timer object
59
What’s important
• Data types, Container types, all extending STL– Pointer containers: allow for pointers in STL containers: vector<char *> � ptr_vector<char>
– Multi index: data structures with multiple indicies
– Constant sized arrays: array<char, 10>, acts like vector or plain ‘C‘ array
– Any: can hold values of any type (if you need polymorphism)
– Variant: can hold values of any of the types specified at compile time (‘C’ equivalent is discriminated union)
– Optional: can hold a value or nothing
– Tuple: like a vector or array, but every element may have a different type (similar to plain struct)
– Graph library: very sophisticated collection of graph releated data structures and algorithms• Parallel version exists (using MPI)
60
What’s important
• Helper classes
– Smart pointers: working with pointers
without having to worry about memory
management
–Memory pools: specialized memory
allocation for containers
– Iterator library: write your own iterator
classes with ease (non trivial otherwise)
61
Other stuff in Boost
• String and Text processing• Regex, parsing, format, conversion etc.
• Alorithms• String algos, FOR_EACH, minmax etc.
• Math and numerics• Conversion, interval, random, octonion, quarternion, special functions, rational, uBLAS
• Functional and higher order prgramming• Bind, lambda, function, ref, signals etc.
• Generic and template metaprogramming• Proto, mpl, fusion, phoenix, enable_if etc.
• Testing• Unit tests, concept checks, static_assert
62
Conclusion
• Look at Boost first if you need something not
available in Standard library
• Even if it‘s not in Boost look around, there are a lot of
libraries in preparation for Boost (Boost Sandbox, File
Vault)
63
Links
• Boost, current release V1.33.1 – Web: http://www.boost.org
– CVS: http://sourceforge.net/projects/boost
• Boost Sandbox– CVS: http://sourceforge.net/projects/boost-sandbox
– File Vault: http://boost-consulting.com/vault/
• Boost mailing lists– http://www.boost.org/more/mailing_lists.htm
64
Outlook
Functional specification with a
Domain Specific Embedded
Language (DSEL)
equation = sum<vertex_edge>
[
sumf<edge_vertex>(0.0,
_e)
[
pot * orient(_e, _1)
] * A / d * eps
] - V * rho
65
Elliptic PDE discretized by Finite Volume
References: [1]
References
1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and
a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software
Design, Portland, Oregon, October 2006.
66
Outline
• Introduction to High Performance Libraries
• Linear Algebra Libraries (BLAS, LAPACK)
• PDE Solvers (PETSc)
• Mesh manipulation and load balancing
(METIS/ParMETIS, JOSTLE)
• Special purpose libraries (FFTW)
• General purpose libraries (C++: Boost)
• Summary – Materials for test
67
Summary – Material for the Test
• High performance libraries 5,6,7
• Linear algebra libraries: BLAS: 9, 11, 12
• Linear algebra libraries: LinPACK: 18
• PDE Solvers: 23, 24, 26, 27
• Mesh decomposition & load balancing: 30, 31,
34, 35, 37, 44, 45, 46, 48, 49
• FFTW: 53, 54
• Boost: 58, 59, 60, 61, 62