Hartmut Kaiser PhD - cct.lsu.edusidhanti/classes/csc7600/S6_L4_Libraries2.pdfsetting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter

High Performance Computing: Concepts, Methods & Means

HPC Libraries

Hartmut Kaiser PhD

Center for Computation & Technology

Louisiana State University

April 19th, 2007

Outline

• Introduction to High Performance Libraries

• Linear Algebra Libraries (BLAS, LAPACK)

• PDE Solvers (PETSc)

• Mesh manipulation and load balancing

(METIS/ParMETIS, JOSTLE)

• Special purpose libraries (FFTW)

• General purpose libraries (C++: Boost)

• Summary – Materials for test

2

Outline









3

Puzzle of the Day

#include <stdio.h>

int main()

{

int a = 10;

switch(a) {

case '1':

printf("ONE\n");

break;

case '2':

printf("TWO\n");

break;

defa1ut:

printf("NONE\n");

}

return 0;

}

4

If you expect the output of the above

program to be NONE, I would request

you to check it out!

Application domains

• Linear algebra

– BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim

• Ordinary and partial Differential Equations

– PETSc

• Mesh manipulation and Load Balancing

– METIS, ParMETIS, CHACO, JOSTLE, PARTY

• Graph manipulation

– Boost.Graph library

• Vector/Signal/Image processing

– VSIPL, PSSL.

• General parallelization

– MPI, pthreads

• Other domain specific libraries

– NAMD, NWChem, Fluent, Gaussian, LS-DYNA

5

Application Domain Overview

• Linear Algebra Libraries

– Provide optimized methods for constructing sets of linear equations,

performing operations on them (matrix-matrix products, matrix-vector

products) and solving them (factoring, forward & backward

substitution.

– Commonly used libraries include BLAS, ATLAS, LAPACK,

ScaLAPACK, PaLAPACK

• PDE Solvers:

– Developing general-porpose, parallel numerical PDE libraries

– Usual toolsets include manipulation of sparse data structures,

iterative linear system solvers, preconditioners, nonlinear solvers and

time-stepping methods.

– Commonly used libraries for solving PDEs include SAMRAI, PETSc,

PARASOL, Overture, among others.

6

Application Domain Overview

• Mesh manipulation and Load Balancing

– These libraries help in partitioning meshes in roughly equal sizes

across processors, thereby balancing the workload while

minimizing size of separators and communication costs.

– Commonly used libraries for this purpose include METIS, ParMetis,

Chaco, JOSTLE among others.

• Other packages:

– FFTW: features highly optimized Fourier transform package

including both real and complex multidimensional transforms in

sequential, multithreaded, and parallel versions.

– NAMD: molecular dynamics library available for Unix/Linux,

Windows, OS X

– Fluent: computational fluid dynamics package, used for such

applications as environment control systems, propulsion, reactor

modeling etc.

7

Outline









8

BLAS

• (Updated set of) Basic Linear Algebra Subprograms

• The BLAS functionality is divided into three levels: – Level 1: contains vector operations of the form:

as well as scalar dot products and vector norms

– Level 2: contains matrix-vector operations of the form

as well as Tx = y solving for x with T being triangular

– Level 3: contains matrix-matrix operations of the form

as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiplyoperation.

9

BLAS

• Several implementations for different languages exist– Reference implementation (F77 and C)http://www.netlib.org/blas/

– ATLAS, highly optimized for particular processor architectures

– A generic C++ template class library providing BLAS functionality: uBLAS http://www.boost.org

– Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)

10

BLAS: F77 naming conventions

11

BLAS: C naming conventions

• F77 routine name is changed to lowercase and prefixed with cblas_

• All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major)

• Character parameters are replaced by corresponding enum values

• Input arguments are declared const

• Non-complex scalar input parameters are passed by value• Complex scalar input argiments are passed using a void*

• Arrays are passed by address

• Output scalar arguments are passed by address

• Complex functions become subroutines which return the result via an additional last parameter (void*), appending _sub to the name

12

BLAS Level 1 routines

• Vector operations(xROT, xSWAP, xCOPY etc.)

• Scalar dot products (xDOT etc.)

• Vector norms(IxAMX etc.)

13


• Matrix-vector operations(xGEMV, xGBMV, xHEMV, xHBMV etc.)

• Solving Tx = y for x, where T is triangular(xGER, xHER etc.)

14


• Matrix-matrix operations(xGEMM etc.)

• Solving for triangular matrices(xTRMM)

• Widely used matrix-matrix multiply (xSYMM, xGEMM)

15

Demo 1

• Shows solving a matrix multiplication problem using

BLAS expressed in FORTRAN, C, and C++

• Shows genericity of uBLAS, by comparing generic

and banded matrix versions

• Shows newmat, a C++ matrix library which uses

operator overloading

16

Outline









17

LAPACK

• Linear Algebra PACKage– http://www.netlib.org/lapack/

– Written in F77

– Provides routines for • Solving systems of simultaneous linear equations,

• Least-squares solutions of linear systems of equations,

• Eigenvalue problems,

• Householder transformation to implement QR decomposition on a matrix and

• Singular value problems

– Was initially designed to run efficiently on shared memory vector machines

– Depends on BLAS

– Has been extended for distributed (SIMD) systems (ScaPACK and PLAPACK)

18

19

LAPACK (Architecture)

LAPACK naming conventions

20

Demo 2

• Shows how using a library might speed

up the computation considerably

21

Outline









22

PETSc (pronounced PET-see)

• Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/)– Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations (PDEs)

– Employs the MPI standard for all message-passing communication

– Intended for use in large-scale application projects

– Includes a large suite of parallel linear and nonlinear equation solvers

– Easily used in application codes written in C, C++, Fortran and Python

• Good introduction:http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt

23

PETSc (general features)

• Features include:– Parallel vectors

• Scatters (handles communicating ghost point information)

• Gathers

– Parallel matrices • Several sparse storage formats

• Easy, efficient assembly.

– Scalable parallel preconditioners

– Krylov subspace methods

– Parallel Newton-based nonlinear solvers

– Parallel time stepping (ODE) solvers

24

PETSc (Architecture)

25

PETSc: Module architecture and layers of abstraction

PETSc: Component details

• Vector operations (Vec): Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures.

• Matrix operations (Mat): A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems.

• Preconditioners (PC): A collection of sequential and parallel preconditioners, including – (sequential) ILU(k) (incomplete factorization),

– LU (lower/upper decomposition),

– both sequential and parallel block Jacobi, overlapping additive Schwarz methods

• Time stepping ODE solvers (TS): Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.

26

PETSc: Component details

• Krylov subspace solvers (KSP): Parallel implementations of many popular Krylov subspace iterative methods, including – GMRES (Generalized Minimal Residual method),

– CG (Conjugate Gradient),

– CGS (Conjugate Gradient Squared),

– Bi-CG-Stab (BiConjugate Gradient Squared),

– two variants of TFQMR (transpose free QMR),

– CR (Conjugate Residuals),

– LSQR (Least Square Root).

All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods.

• Non-linear solvers (SNES): Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.

27

Outline









28

Mesh libraries

• Introduction

– Structured/unstructured meshes

– Examples

• Mesh decomposition

29

Introduction to Meshes and Grids

• Mesh/Grid : 2D or 3D

representation of the computational

domain.

• Common 2D meshes are composed

of triangular or quadrilateral

elements

• Common 3D meshes are composed

of hexahedral, tetrahedral or

pyramidal elements

30

TriangleQuadrilateral

Tetrahedron

Hexahedron Prism

2D Mesh elements

3D Mesh elements

Structured Grids (Meshes)

• Cartesian grids, logically rectangular grids

• Mesh info accessed implicitly using grid point indices– Efficient in both computation and storage

• Typically use finite difference discretization

Unstructured Meshes• Mesh connectivity information must be stored– Incurs additional memory and computational cost

• Handles complex geometries and grid adaptivity

• Typically use finite volume or finite element discretization

• Mesh quality becomes a concern

31

Structured/Unstructured Meshes

Mesh examples

32

Meshes are used for Computation

33

Mesh Decomposition

• Goal is to maximize interior while minimizing connections between subdomains.

That is, minimize communication.

• Such decomposition problems have been studied in load balancing for parallel

computation.

• Lots of choices:

• METIS, ParMETIS -- University of Minnesota.

• PARTI -- University of Maryland,

• CHACO -- Sandia National Laboratories,

• JOSTLE -- University of Greenwich,

• PARTY -- University of Paderborn,

• SCOTCH -- Université Bordeaux,

• TOP/DOMDEC -- NAS at NASA Ames Research Center.

http://www.hlrs.de

34

Mesh Decomposition

• Load balancing

– Distribute elements evenly across processors.

– Each processor should have equal share of work.

• Communication costs should be minimized.

– Minimize sub-domain boundary elements.

– Minimize number of neighboring domains.

• Distribution should reflect machine architecture.

– Communication versus calculation.

– Bandwidth versus latency.

• Note that optimizing load balance and communication cost

simultaneously is an NP-hard problem.

http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html

35

36http://www.hlrs.de

36

Mesh decomposition

Static Grids (Meshes)

• Decomposition need only be

carried out once

• Static decomposition may

therefore be carried out as a

preprocessing step, often done in

serial

Dynamic Meshes

• Decomposition must be adapted

as underlying mesh or processor

load changes.

• Dynamic decomposition therefore

becomes part of the calculation

itself and cannot be carried out

solely as a pre-processing step.

37

http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html

Static and Dynamic Meshes

HP J67001 CPUSolve Time: 13:26Baseline Time

38

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt

Linux Cluster2 CPU’sSolve Time: 5:20Speed-Up: 2.5X

39



40



41



42


Speedup due to decomposition

# CPUs Run-times (s)

1 806

2 320

4 187

8 111

16 63

43

44

http://www.hlrs.de

44

Jostle and Metis

Jostle

45

45

http://www.hlrs.de

Jostle

46

46

http://www.hlrs.de

Jostle

47

47

http://www.hlrs.de

Metis

48

48

http://www.hlrs.de

ParMetis

49

49

http://www.hlrs.de

Metis (serial)

50

50

http://www.hlrs.de

Comparison

51

51

http://www.hlrs.de

Outline









52

FFTW

• Fastest Fourier Transform in the West

• Portable C subroutine library for computing discrete cosine/sine

transform (DCT/DST)

• Computes arbitrary size discrete Fourier and Hartley transforms

on real or complex data, in one or more dimensions

• Optimized for speed through application of special-purpose

compiler genfft (codelet generator), originally written in OCaml;

performance comparable even with vendor optimized libraries

• Free software, distributed under GPL; also available under

commercial MIT license

• Developed at MIT by Matteo Frigo and Steven G. Johnson

• Won J. H. Wilkinson Prize for Numerical Software in 1999

• Most recent stable version is 3.1.2 (http://www.fftw.org)

53

Main FFTW Features

• C and FORTRAN interfaces, C++ wrappers available

• Speed, including support for SSE, SSE2, 3dNow! and Altivec

• Arbitrary size transforms with complexity of O(n·log(n)) (sizes which

can be factored to 2, 3, 5 and 7 are most efficient by default, but a

custom code can be also generated for other sizes if required)

• Even/odd data (DCT/DST), types I-IV

• Can produce pure real output, or process pure real input data

• Efficient handling of multiple, strided transforms (e.g. transformation of

multiple arrays at once; one dimension of multi-dimensional array; one

field of multi-component array)

• Parallel code supporting Cilk, SMP platforms with threads, or MPI

• Ability to save and restore plans optimized for a given platform (through

wisdom mechanism)

• Portable to any platform with a working C compiler

54

FFTW Sample Code

Source: http://www.fftw.org/fftw3.pdf

Computing 1-D complex DFT

55

#include <fftw3.h>#include <fftw3.h>

......

{{

fftw_complex *in, *out;fftw_complex *in, *out;

fftw_plan p;fftw_plan p;

......

in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

/* populate in[] with input data *//* populate in[] with input data */

……

p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);

......

fftw_execute(p); /* repeat as needed */fftw_execute(p); /* repeat as needed */

/* transform now available in out[] *//* transform now available in out[] */

......

fftw_destroy_plan(p);fftw_destroy_plan(p);

fftw_free(in); fftw_free(out);fftw_free(in); fftw_free(out);

}}

Outline









56

The Boost Libraries

• What’s Boost

– What’s important

– Other stuff

57

What is Boost?

• Data Structures, Containers, Iterators, and Algorithms

• String and Text Processing

• Function Objects and Higher-Order Programming

• Generic Programming and Template Metaprogramming

• Math and Numerics

• Input/Output

• Miscellaneous

• Mostly header only

58

What’s important

• OS abstraction– Thread: OS independent kernel level thread interface

– Asio: asynchronous input output

– Filesystem: file system operations as file copy, delete, directory create, file path handling

– System: OS error code abstraction and handling

– Program options: handling of command line arguments and parameters

– Streams: build your own C++ streams

– DateTime: Handling of dates, times and time periods

– Timer: simple timer object

59

What’s important

• Data types, Container types, all extending STL– Pointer containers: allow for pointers in STL containers: vector<char *> � ptr_vector<char>

– Multi index: data structures with multiple indicies

– Constant sized arrays: array<char, 10>, acts like vector or plain ‘C‘ array

– Any: can hold values of any type (if you need polymorphism)

– Variant: can hold values of any of the types specified at compile time (‘C’ equivalent is discriminated union)

– Optional: can hold a value or nothing

– Tuple: like a vector or array, but every element may have a different type (similar to plain struct)

– Graph library: very sophisticated collection of graph releated data structures and algorithms• Parallel version exists (using MPI)

60

What’s important

• Helper classes

– Smart pointers: working with pointers

without having to worry about memory

management

–Memory pools: specialized memory

allocation for containers

– Iterator library: write your own iterator

classes with ease (non trivial otherwise)

61

Other stuff in Boost

• String and Text processing• Regex, parsing, format, conversion etc.

• Alorithms• String algos, FOR_EACH, minmax etc.

• Math and numerics• Conversion, interval, random, octonion, quarternion, special functions, rational, uBLAS

• Functional and higher order prgramming• Bind, lambda, function, ref, signals etc.

• Generic and template metaprogramming• Proto, mpl, fusion, phoenix, enable_if etc.

• Testing• Unit tests, concept checks, static_assert

62

Conclusion

• Look at Boost first if you need something not

available in Standard library

• Even if it‘s not in Boost look around, there are a lot of

libraries in preparation for Boost (Boost Sandbox, File

Vault)

63

Links

• Boost, current release V1.33.1 – Web: http://www.boost.org

– CVS: http://sourceforge.net/projects/boost

• Boost Sandbox– CVS: http://sourceforge.net/projects/boost-sandbox

– File Vault: http://boost-consulting.com/vault/

• Boost mailing lists– http://www.boost.org/more/mailing_lists.htm

64

Outlook

Functional specification with a

Domain Specific Embedded

Language (DSEL)

equation = sum<vertex_edge>

[

sumf<edge_vertex>(0.0,

_e)

[

pot * orient(_e, _1)

] * A / d * eps

] - V * rho

65

Elliptic PDE discretized by Finite Volume

References: [1]

References

1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and

a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software

Design, Portland, Oregon, October 2006.

66

Outline









67

Summary – Material for the Test

• High performance libraries 5,6,7

• Linear algebra libraries: BLAS: 9, 11, 12

• Linear algebra libraries: LinPACK: 18

• PDE Solvers: 23, 24, 26, 27

• Mesh decomposition & load balancing: 30, 31,

34, 35, 37, 44, 45, 46, 48, 49

• FFTW: 53, 54

• Boost: 58, 59, 60, 61, 62

Documents

Hartmut Kaiser PhD - cct.lsu.edusidhanti/classes/csc7600/S6_L4_Libraries2.pdfsetting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter