Hierarchic Data Structures for Sparse Matrix

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Hierarchic Data Structures for Sparse MatrixRepresentation in Large-scale DFT/HF

Calculations

Pawe l Sa lekTheoretical Chemistry, KTH, Stockholm

16-18 October 2007, Linkoping

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Talk Outline

Running large-scale ab-initio calculations – scaling.

Solving problems with sparse matrices.

OpenMP parallelization – problems.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Performance of the SCF Cycle

Computation of Kohn-Shammatrix F is time-consuming.

For really large systems, densityevaluation (F → D) istime-consuming as well.

Matrix memory usage growsquadratically.

Local basis set – basis functionslocalized on atoms.

Get starting guess D0

compute F(n) = F(D)

find D(n+1)=D(F)

D((n+1) = D(n)

no

converged

yes

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Research Group

Elias Rudberg: Interaction evaluation (PhD in December).

Emanuel Rubensson: Sparse matrices.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

F → D Step

Density traditionally obtained via diagonalization and aufbauprinciple:

FC = εSC D = CoccCTocc

Diagonalization does not scale linearly.

Density optimization and purification algorithms scalelinearly when sparsity is used.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Sparse Matrices in Quantum Chemistry

Matrices are used to represent operators D and F .

Overlap matrix, Density matrix, Fock Matrix, Kohn-Shammatrix.

Matrices must be represented in such a way that commonoperations are fast.

Sparsity appears only for larger molecules (> 50 atoms).

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Sparsity Patterns

A

B

Atom A Atom B

distance

Density Matrix

0 10 20 30 40 50 6010−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

Distance between atom centers (au)

Mag

nitu

de o

f ele

men

ts

Glycine

Matrix sparsity depends on the basis set and geometry and tosome extend on the band gap.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Taking Advantage of Sparsity Patterns

Sparsity appears in blocks.

Reorder atoms to merge theatom blocks in larger ones.

Use BLAS for operations onblocks and Compressed-SparseRow (CSR) format for blockstorage.

Enforce sparsity by small elementtruncation.

Unordered

Blocked byatoms

Reorderedand blocked

Alkane chain Fock matrix

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

F → D Step With Sparse Matrices

Use Trace-CorrectingPurification – a series ofspectral transformations.

Performance limited by thesparse matrixmultiplication speed.

0 1

0

1

xn

x n+1

f1(x)=x

n2

f2(x)=2x

n−x

n2

compute P = (lmax I-F)/(lmax-lmin)while abs(trace(P)-N)>thresholdif(trace(P)>N) then

P := P*Pelse

P := 2*P-P*Pend while

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Example TC2 Application

0 10 20 30

0

0.5

1

Eigenvalues

0 10 20 30

10−4

10−5

10−6

Accumulated error

0 10 20 3010

−6

10−2

10−4

100

Iteration

Idempotency error

0 10 20 3020%

30%

40%

50%

60%Share of nonzero elements

Iteration

Glycine Molecule

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Problems with TC2

Number of TC2 iterations depends the bandgap.

Control the error: TC2 error grows exponentially with thenumber of iterations.

More flexible representation than Compressed Sparse Rowis needed for easy implementation of other algorithms.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Systematic Small-Submatrix Selection Algorithm

Error introduced by truncation of small elements.

First approaches considered only distance between atomsand empirical threshold factors – unreliable!

More advanced approaches look at the norms of neglectedblocks – more reliable but strict error control stillimpossible.

SSSA looks at the error of the entire matrix. Providesstrict error control.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

The Effect of SSSA on Total Energy Error

Benchmark on water clusters with Hartree-Fock and STO-3G.alg. 1: threshold based filtering, alg. 2: SSSA.

73 143 213 283 35310−9

10−8

10−7

10−6

Number of atoms

Truncation error

Fock matrix, alg. 1Fock matrix, alg. 2Density matrix, alg. 1Density matrix, alg. 2

0 500 1000 1500 2000 2500 3000 3500 400010−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Matrix SizeHa

rtree

Error in total energy

10−1.5

10−2

10−2.5

10−3

SSSA provides rigorous error control. → Saves time and givestrustworthy results. Energy extrapolation possible.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Hierarchic Matrix Library (C++)

HML allows for low-overhead random element access.

0 00 0 0

typedef Matrix<Matrix<Matrix<double> >> MyMatrixType;typedef Matrix<Matrix<Matrix<long double> >> MyAccurateMatrixType;

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

HML Features

Easy to code and maintain.

Block size determined by the architecture performance,not chemistry.

Low overhead random element access.

Blocked algorithms easy to express:1 Matrix multiplication, also by transposed matrices.2 Use of matrix symmetry.3 INverse CHolesky factorisation (INCH).

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Block Size Tradeoff

Smaller block size → more opportunity for screening.

Larger block size → better block-block multiplicationperformance.

0

1

2

3

4

5

6

7

8

9

0 20 40 60 80 100 120 140 160 180

Gflo

ps

Matrix Size

Woodcrest vs Nocona (MKL dgemm)

line 1line 2

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Example Implementation of C := beta*C + A*B

static void multiply(const Matrix<Telement>& A,

const Matrix<Telement>& B,

Matrix<Telement>& C, double beta) {

for (int colC = 0; colC < C.ncols; colC++)

for (int rowC = 0; rowC < C.nrows; rowC++) {

Telement::multiply(A(rowC, 0), B(0, colC),

C(rowC, colC), beta);

for (int colA = 1; colA < A.ncols; colA++)

Telement::multiply(A(rowC, colA), B(colA, colC),

C(rowC, colC), 1);

}

}

Lowest level (block) multiplication expressed in terms ofBLAS calls.

Template expansion will generate (instantiate) code for allthe remaining hierarchy levels.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Intel MKL vs HML Benchmark

HML design allows for easy implementation of symmetricmatrix multiplication (sysq: S = αT 2 + βS) as needed byTC2:sysq twice faster than general sparse multiplications.

0 5000 10000 150000

10

20

30

40

50

Matrix size

Tim

e (s

econ

ds)

Matrix multiplication benchmark: water clusters/3−21G

dgemm (MKL)dgemm (HML τ = 10−6)

dsysq (HML τ = 10−6)

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

OpenMP Parallelization

Parallel programs necessary toefficiently use modern multi-corehardware.

OpenMP less invasive and easierto load-balance.

Problems with scaling and. . .compiler support.

Poor compiler support! GNU gccis the only reliable,OpenMP-enabled compilerknown to us so far.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Details of OpenMP Parallelization

Pick a level in the hierarchy, run aparallel loop with dynamicscheduling over it.

Approach trivial to implement.

Higher levels: coarse loaddistribution.

Lower levels: thread startupoverhead.

0 00 0 0

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Exceptions and OpenMP

OpenMP and C++ exceptions do interact.

Threads must catch any exceptions that are generated.The behavior is undefined otherwise.

We do the right thing (in case you ask).

#pragma omp parallel forfor (int i = 0; i < MAX; i++) {try {// Heavy lifting here} catch (...) { /* Handle it nicely. */ }

}

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Compiler Problems

GNU C++ OpenMP support since 4.1(?). No problemsfound. Sequential performance lower than itscompetitors.

Portland C++ fairly warns that it cannot handle exceptionsand OpenMP at the same time. A honestwarning but. . .

Intel C++ 3 versions tried. All of them had bugs either insequential code or in OpenMP parallelization.8.1 fails to generate correct sequential code;miscompiles OpenMP code as well.9.1 works sequentially; compiler crashes withexecuted with -openmp flag.10.0 fails to generate correct sequential code.Support tickets with Intel are open.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

OpenMP Speedup

Timings taken on 1.5GHz Itanium2, 4 CPU (luc2, PDC),4 threads.

Glycine-Alanine chain with 1600+ atoms, HF method.GNU C++.

Operation CPU time [s] Wall time [s] Speedup

FDS-SDF 133.54 53 2.66Purification 947.59 454 2.13

Acceptable multiplication load balancing (3.5/4.0) butserial data management has negative impact on scalability.

Additionally, purification involves serial error estimationroutines.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Summary & Outlook

SSSA – for strict error control.

HML – flexible sparse matrix representation.

A number of algorithms (arbitrary MxM multiplications,inverse Cholesky factorisation) already implemented.

OpenMP parallelization.

Outlook

Analyse the sparsity in the QM methods beyond thealgorithms relevant for SCF: Linear response forcalculation of molecular properties.

HierarchicMatrices

P. Sa lek

Sparse QM

Error Control

HMLib

Parallelization

Small Submatrix Selection Algorithm (SSSA)

Given a matrix norm ‖ · ‖ and an error limit ε we wantto find a sparse approximation A of A so that ‖A− A‖ < ε.

SSSA:

1 Compute the Frobenius norm of each submatrix.

2 Sort the values in descending order.

3 Remove submatrices from the end as long as the error iswithin desired accuracy.

=⇒ Error very close to the requested value in the Frobeniusnorm.

Documents

Hierarchic Data Structures for Sparse Matrix