1
Large-scale Eigensolvers for SciDAC Applications Eugene Vecharynski Meiyue Shao Chao Yang Esmond Ng Lawrence Berkeley National Laboratory Overview Eigenvalue problems arise in a number of SciDAC applications. We highlight some recent progress on 1) computing a large number eigenpairs of a Hermitian matrix in the context of density functional theory based electronic structure calculation 2) computing a few selected eigenpairs of a non-Hermitian matrix in the context of equation-of-motion coupled cluster (EOM-CC) calculation and complex scaling configuration interaction 3) computing all or a selected number of eigenpairs of the Bethe–Salpeter and Casida Hamiltonian matrices which have a special structure. Computing a large invariant subspace of a Hermitian matrix Motivation: I Large-scale density functional theory based electronic structure calculations require computing a large number of lowest eigenpairs (10 3 pairs or more). I Density functional perturbation theory requires many more lowest eigenpairs (10 3 -10 5 ). The challenge: I Existing eigensolvers contain repeated calls of the Rayleigh–Ritz procedure that becomes a bottleneck when many eigenpairs are computed on a massively distributed-memory parallel machines. I Standard computational kernels for solving dense eigenvalue problems (ScaLAPACK) do not scale beyond a certain number of cores. Our goal: I Compute many lowest eigenpairs on massively parallel high performance computers. I Avoid or reduce the amount of the RR computations. The Projected Preconditioned Conjugate Gradient (PPCG) algorithm I The new eigensolver for computing large invariant subspaces of Hermitian matrices. I The standard Rayleigh–Ritz procedure is replaced by a sequence of small dense eigenvalue problems plus a QR factorization of the approximate eigenspace. I The Rayleigh–Ritz computation is performed only once every 5-10 iterations. I Takes advantage of the available preconditioning techniques. I Relatively easy to implement. [1] E. Vecharynski, C. Yang, J. E. Pask: A projected preconditioned conjugate gradient algorithm for computing many extreme eigenpairs of a Hermitian matrix, J. Comp. Phys., Vol. 290, pp. 73–89, 2015 Performance of the PPCG algorithm in Quantum Espresso Benchmark systems: the solvation of LiPF6 in ethylene carbonate and propylene carbonate liquids containing 318 atoms (left), the 16 by 16 supercell of graphene containing 512 carbon atoms (center), and 5 by 5 by 5 supercell of bulk silicon containing 1000 silicon atoms (right). 0 10 20 30 40 50 60 70 80 10 -2 10 0 10 2 Convergence of eigensolvers for Li318 (2,062 eigenpairs) Time (sec.) Frobenius norm of the residual Davidson PPCG (a) Li318 (480 cores) 0 20 40 60 80 100 120 140 10 -2 10 -1 10 0 10 1 Convergence of eigensolvers for Graphene512 (2,254 eigenpairs) Time (sec.) Frobenius norm of the residual Davidson PPCG (b) Graphene512 (576 cores) 0 50 100 150 200 250 300 10 -3 10 -2 10 -1 10 0 10 1 Convergence of eigensolvers for Sicluster (2,550 eigenpairs) Time (sec.) Frobenius norm of the residual Davidson PPCG (c) Silicon1000 (2, 400 cores) 0 100 200 300 400 500 600 700 800 40 60 80 100 120 140 160 180 200 220 Scaling of eigensolvers for Li318 (2,062 eigenpairs) Time (sec.) Number of cpus Davidson PPCG Ongoing work: PPCG in QBox The ultimate goal is to perform a molecular dynamics simulation for up to 10,000 atoms to study lithium-ion batteries 50 100 150 200 250 0.2 0.4 0.6 0.8 1.0 Al108 eigs occs 10 20 30 40 50 60 10 -6 10 -4 10 -2 10 0 10 2 SCF convergence of Al108 (120 cores on Edison) Time (sec.) SCF error PPCG, 4 steps/SCF iter. PSDA, 7 steps/SCF iter. I 2x speedup due to a reduced SCF iteration count compared to a default QBox eigensolver (PSDA) Non-Hermitian eigenvalue problems: computational challenges Computing a subset of eigenpairs closest to the given shift σ I Electronic resonant states (method of complex coordinate rotation). I Equation-of-motion coupled-cluster (EOM-CC) method. Difficulties with the existing solution approaches I Require inverting A - σ I (“shift-and-invert”). I Performance issues I Limited degree of parallelism (“one-by-one” eigenpair computation). I Failure to fully take advantage of BLAS3. I Robustness issues. Our goal: I Develop a novel eigensolver that overcomes the known difficulties. The Generalized Preconditioned Locally Harmonic Residual (GPLHR) method I Uses the harmonic Rayleigh–Ritz procedure to extract approximate eigenpairs from low-dimensional search subspaces. I Performs block iterations, effectively leverages BLAS3 kernels, provides multiple levels of concurrency. I Takes advantage of the available preconditioning techniques. I Robust, better convergence if memory is limited/tight. I Provides an option of switching between the approximate eigenvector and Schur vectors iterations. [1] E. Vecharynski, C. Yang, F. Xue: Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems, SIAM J. Sci. Comput., submitted (2015) [2] D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A. I. Krylov: New algorithms for iterative matrix-free eigensolvers in quantum chemistry, J. Comp. Chem., Vol. 36, Issue 5, pp. 273–284, 2015 GPLHR in Q-Chem: EOM-CC benchmark Benchmark systems: hydrated photoactive yellow protein chromophore PYPa-W p (left) and dihydrated 1,3-dimethyluracil (mU) 2 -(H 2 O) 2 (right). PYPa-W p /6-31+G(d,p) GPLHR (σ = 11 a.u.) nroots a niters b m Max. # of stored vectors # matvec c 1 4 1 8 9 2 4 1 16 18 3 4 1 24 27 5 8 1 40 63 a The number of requested eigenpairs. b The number of iterations to converge all eigenpairs. c The total number of matrix-vector multiplications. Davidson failed to deliver the solution. Left: PYPa-W p /6-31+G(d,p) for the pairs with converged energies of 4.11 and 4.20 eV; Right: (mU) 2 -(H 2 O) 2 /6-311+G(d,p) for the pairs with converged energies of 8.89 and 10.04 eV. GPLHR in other applications GPLHR exhibits a rapid and reliable convergence for a variety of standard and generalized eigenvalue problems across different applications (e.g., crystal growth simulation or stability analysis of fluid flows) 0 200 400 600 800 1000 10 -8 10 -6 10 -4 10 -2 CRY10000, k=5 Largest relative eigenresidual norm Number of preconditioned matvecs GPLHR BGD(4k) BGD(6k) BGD(8k) 0 0.5 1 1.5 2 x 10 4 10 -8 10 -6 10 -4 10 -2 10 0 LSTAB_NS, k=7 Largest relative eigenresidual norm Number of preconditioned matvecs GPLHR BGD(4k) BGD(6k) BGD(8k) I Often requires significantly less memory to maintain convergence rate similar to state-of-the-art approaches (e.g., block Davidson methods). Eigenvalue problems with a paired matrix structure I The Bethe–Salpeter eigenvalue (BSE) problem I The linear response (LR) eigenvalue problems in time-dependent density functional theory (TDDFT) Band gap Exciton An exciton (electron-hole pair) Absorption spectrum Computing all/selected eigenpairs of the Casida Hamiltonian Problem setting I The Casida Hamiltonian matrix is of the form H = A B - B - A C 2n×2n , where A = A * is Hermitian, B = B T is complex symmetric, with AB B A 0 (i.e., AB B A is Hermitian positive definite.) I In BSE, often all eigenpairs of H are required. I In TDDFT, H is real and sparse for molecules, and only several smallest positive eigenpairs are needed. Our goal: I Develop structure-preserving parallel eigensolver for BSE. I Develop and apply block preconditioned eigensolver techniques tailored specifically to the LR problem. Methodology I BSE can be reduced to a real positive definite Hamiltonian eigenvalue problem and solved by a skew-symmetric eigensolver. I When H is real it is equivalent to solving an n × n product eigenvalue problem MKx = λ 2 x , where M = A + B 0, K = A - B 0. I Cholesky factorization and SVD are used to solve MKx = λ 2 x when all eigenpairs are needed. I Several smallest eigenpairs of MK are computed by properly preconditioned iterative eigensolvers applied to the symmetric generalized eigenvalue problem KMKx = λ 2 Kx . Experimental results Parallel scalability of the dense BSE solver 10 0 10 1 10 2 10 -1 10 0 10 1 10 2 10 3 #procs Wtime (sec) Parallel scalability Total Cholesky Form W Tridiagonalization Diagonalization Eigvec. Householder Eigvec. GEMM I The parallel solver is built on top of ScaLAPACK. I The scalability of the solver is comparable to that of GEMM. Efficiency of new LR eigensolvers 0 10 20 30 40 50 60 70 80 10 -4 10 -3 10 -2 10 -1 10 0 Number of matrix-vector multiplications Residual norm Number of roots: 5, Total number of eigenvalues: 11799 LOBPCG-LR Davidson-LR Standard Davidson I New eigensolvers (LOBPCG-LR and Davidson-LR) achieve 2x speedup compared to the traditional Davidson approach for the LR eigenproblem. I The proposed approaches require 2x less matrix-vector products and offer a significant reduction in memory usage.

Large-scale Eigensolvers for SciDAC Applications...[2] D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A. I. Krylov: New algorithms for iterative matrix-free eigensolvers in quantum

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large-scale Eigensolvers for SciDAC Applications...[2] D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A. I. Krylov: New algorithms for iterative matrix-free eigensolvers in quantum

Large-scale Eigensolvers for SciDAC ApplicationsEugene Vecharynski Meiyue Shao Chao Yang Esmond Ng

Lawrence Berkeley National Laboratory

Overview

Eigenvalue problems arise in a number of SciDAC applications. We highlight some recent progress on 1) computing a large number eigenpairs of a Hermitian matrix in the context of density functional theory based electronic structure calculation2) computing a few selected eigenpairs of a non-Hermitian matrix in the context of equation-of-motion coupled cluster (EOM-CC) calculation and complex scaling configuration interaction 3) computing all or a selected number of eigenpairs of theBethe–Salpeter and Casida Hamiltonian matrices which have a special structure.

Computing a large invariant subspace of a Hermitian matrix

Motivation:I Large-scale density functional theory based electronic structure

calculations require computing a large number of lowest eigenpairs (103

pairs or more).I Density functional perturbation theory requires many more lowest

eigenpairs (103-105).The challenge:I Existing eigensolvers contain repeated calls of the Rayleigh–Ritz

procedure that becomes a bottleneck when many eigenpairs arecomputed on a massively distributed-memory parallel machines.

I Standard computational kernels for solving dense eigenvalue problems(ScaLAPACK) do not scale beyond a certain number of cores.

Our goal:I Compute many lowest eigenpairs on massively parallel high performance

computers.I Avoid or reduce the amount of the RR computations.

The Projected Preconditioned Conjugate Gradient (PPCG) algorithm

I The new eigensolver for computing large invariant subspaces ofHermitian matrices.

I The standard Rayleigh–Ritz procedure is replaced by a sequence ofsmall dense eigenvalue problems plus a QR factorization of theapproximate eigenspace.

I The Rayleigh–Ritz computation is performed only once every 5-10iterations.

I Takes advantage of the available preconditioning techniques.I Relatively easy to implement.

[1] E. Vecharynski, C. Yang, J. E. Pask: A projected preconditionedconjugate gradient algorithm for computing many extreme eigenpairs of aHermitian matrix, J. Comp. Phys., Vol. 290, pp. 73–89, 2015

Performance of the PPCG algorithm in Quantum Espresso

Benchmark systems: the solvation of LiPF6 in ethylene carbonate and propylene carbonateliquids containing 318 atoms (left), the 16 by 16 supercell of graphene containing 512 carbonatoms (center), and 5 by 5 by 5 supercell of bulk silicon containing 1000 silicon atoms (right).

0 10 20 30 40 50 60 70 80

10−2

100

102

Convergence of eigensolvers for Li318 (2,062 eigenpairs)

Time (sec.)

Fro

be

niu

s n

orm

of

the

re

sid

ua

l

Davidson

PPCG

(a) Li318 (480 cores)

0 20 40 60 80 100 120 140

10−2

10−1

100

101

Convergence of eigensolvers for Graphene512 (2,254 eigenpairs)

Time (sec.)

Fro

be

niu

s n

orm

of

the

re

sid

ua

l

Davidson

PPCG

(b) Graphene512 (576 cores)

0 50 100 150 200 250 300

10−3

10−2

10−1

100

101

Convergence of eigensolvers for Sicluster (2,550 eigenpairs)

Time (sec.)

Fro

be

niu

s n

orm

of

the

re

sid

ua

l

Davidson

PPCG

(c) Silicon1000 (2,400 cores)

0 100 200 300 400 500 600 700 80040

60

80

100

120

140

160

180

200

220Scaling of eigensolvers for Li318 (2,062 eigenpairs)

Tim

e (

sec.)

Number of cpus

Davidson

PPCG

Ongoing work: PPCG in QBox

The ultimate goal is to perform a molecular dynamics simulation for up to10,000 atoms to study lithium-ion batteries

50 100 150 200 250

0.2

0.4

0.6

0.8

1.0

Al108

eigs

occs

10 20 30 40 50 60

10−6

10−4

10−2

100

102

SCF convergence of Al108 (120 cores on Edison)

Time (sec.)

SC

F e

rro

r

PPCG, 4 steps/SCF iter.

PSDA, 7 steps/SCF iter.

I 2x speedup due to a reduced SCF iteration count compared to a defaultQBox eigensolver (PSDA)

Non-Hermitian eigenvalue problems: computational challenges

Computing a subset of eigenpairs closest to the given shift σ

I Electronic resonant states (method of complex coordinate rotation).I Equation-of-motion coupled-cluster (EOM-CC) method.

Difficulties with the existing solution approachesI Require inverting A− σI (“shift-and-invert”).I Performance issues

I Limited degree of parallelism (“one-by-one” eigenpair computation).I Failure to fully take advantage of BLAS3.

I Robustness issues.Our goal:I Develop a novel eigensolver that overcomes the known difficulties.

The Generalized Preconditioned Locally Harmonic Residual (GPLHR)method

I Uses the harmonic Rayleigh–Ritz procedure to extract approximateeigenpairs from low-dimensional search subspaces.

I Performs block iterations, effectively leverages BLAS3 kernels, providesmultiple levels of concurrency.

I Takes advantage of the available preconditioning techniques.I Robust, better convergence if memory is limited/tight.I Provides an option of switching between the approximate eigenvector

and Schur vectors iterations.

[1] E. Vecharynski, C. Yang, F. Xue: Generalized preconditioned locallyharmonic residual method for non-Hermitian eigenproblems, SIAM J. Sci.Comput., submitted (2015)[2] D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A. I. Krylov: Newalgorithms for iterative matrix-free eigensolvers in quantum chemistry, J.Comp. Chem., Vol. 36, Issue 5, pp. 273–284, 2015

GPLHR in Q-Chem: EOM-CC benchmark

Benchmark systems: hydrated photoactive yellow protein chromophore PYPa-Wp (left) anddihydrated 1,3-dimethyluracil (mU)2-(H2O)2 (right).

PYPa-Wp/6-31+G(d,p)GPLHR (σ = 11 a.u.)

nrootsa nitersb m Max. # of stored vectors # matvecc

1 4 1 8 92 4 1 16 183 4 1 24 275 8 1 40 63

a The number of requested eigenpairs. b The number of iterations to converge all eigenpairs.c The total number of matrix-vector multiplications. Davidson failed to deliver the solution.

Left: PYPa-Wp/6-31+G(d,p) for the pairs with converged energies of 4.11 and 4.20 eV;Right: (mU)2-(H2O)2/6-311+G(d,p) for the pairs with converged energies of 8.89 and 10.04

eV.

GPLHR in other applications

GPLHR exhibits a rapid and reliable convergence for a variety of standardand generalized eigenvalue problems across different applications (e.g.,crystal growth simulation or stability analysis of fluid flows)

0 200 400 600 800 1000

10−8

10−6

10−4

10−2

CRY10000, k=5

Larg

est

rela

tive e

igen

resid

ual n

orm

Number of preconditioned matvecs

GPLHR

BGD(4k)

BGD(6k)

BGD(8k)

0 0.5 1 1.5 2

x 104

10−8

10−6

10−4

10−2

100

LSTAB_NS, k=7

Larg

est

rela

tive e

igen

resid

ual n

orm

Number of preconditioned matvecs

GPLHR

BGD(4k)

BGD(6k)

BGD(8k)

I Often requires significantly less memory to maintain convergence ratesimilar to state-of-the-art approaches (e.g., block Davidson methods).

Eigenvalue problems with a paired matrix structure

I The Bethe–Salpeter eigenvalue (BSE) problemI The linear response (LR) eigenvalue problems in time-dependent density

functional theory (TDDFT)

Band gapExciton

An exciton (electron-hole pair) Absorption spectrum

Computing all/selected eigenpairs of the Casida Hamiltonian

Problem settingI The Casida Hamiltonian matrix is of the form

H =

[A B−B −A

]∈ C2n×2n,

where A = A∗ is Hermitian, B = BT is complex symmetric, with[A BB A

]� 0 (i.e.,

[A BB A

]is Hermitian positive definite.)

I In BSE, often all eigenpairs of H are required.I In TDDFT, H is real and sparse for molecules, and only several smallest

positive eigenpairs are needed.

Our goal:I Develop structure-preserving parallel eigensolver for BSE.I Develop and apply block preconditioned eigensolver techniques tailored

specifically to the LR problem.MethodologyI BSE can be reduced to a real positive definite Hamiltonian eigenvalue

problem and solved by a skew-symmetric eigensolver.I When H is real it is equivalent to solving an n × n product eigenvalue

problem MKx = λ2x , where M = A + B � 0, K = A− B � 0.I Cholesky factorization and SVD are used to solve MKx = λ2x when all

eigenpairs are needed.I Several smallest eigenpairs of MK are computed by properly

preconditioned iterative eigensolvers applied to the symmetricgeneralized eigenvalue problem KMKx = λ2Kx .

Experimental results

Parallel scalability of the dense BSE solver

100

101

102

10−1

100

101

102

103

#procs

Wtim

e (

sec)

Parallel scalability

TotalCholeskyForm WTridiagonalizationDiagonalizationEigvec. HouseholderEigvec. GEMM

I The parallel solver is built on top of ScaLAPACK.I The scalability of the solver is comparable to that of GEMM.

Efficiency of new LR eigensolvers

0 10 20 30 40 50 60 70 8010

−4

10−3

10−2

10−1

100

Number of matrix−vector multiplications

Re

sid

ua

l n

orm

Number of roots: 5, Total number of eigenvalues: 11799

LOBPCG−LR

Davidson−LR

Standard Davidson

I New eigensolvers (LOBPCG-LR and Davidson-LR) achieve 2x speedupcompared to the traditional Davidson approach for the LR eigenproblem.

I The proposed approaches require 2x less matrix-vector products andoffer a significant reduction in memory usage.