1
A High-Performance Preconditioned SVD Solver for Accurately Computing Large-Scale Singular Value Problems in PRIMME Lingfei Wu, Andreas Stathopoulos (Advisor), Eloy Romero (Advisor), College of William and Mary, Williamsburg, Virginia, USA Introduction Experimental setup: Compare PRIMME_SVDS against TRLBD, Krylov-Schur (KS), Generalized Davidson (GD), and Jacob-Davidson (JD) in SLEPc Seek a small number of extreme singular triplets W/ and W/O preconditioning Matrices and vectors distributed on total 1024 processes on Edison (NERSC) or 96 processes on SciClone (W&M) Our approach: a preconditioned hybrid, two-stage SVD method on top of the state-of-the-art eigensolver PRIMME Experiments and Results PRIMME_SVDS: A High-Performance SVD Solver State-of-The-Art SVD Solvers The problem: find k extreme singular triplets of A m×n Aν i = σ i u i , i = 1, 2, , k , k << n Iterative methods for computing SVD: Hermitian eigenvalue problem on C = A T A or C = AA T Hermitian eigenvalue problem on B = [0 A T ; A 0] Lanczos bidiagonalization method (LBD) A = PB d Q T and B d = XΣY T where U = PX and V = QY. Comparison of different approaches Fig. 3: Comparing different methods on relat9 and delaunay_n24 Our goal: solve large-scale, sparse SVD problems with unprecedented efficiency , robustness and accuracy Scarcity of SVD software for large-scale problems A clear need for a high quality SVD solver software Implement state-of-the-art methods with massively parallelism Compute smallest singular triplets effectively and accurately Leverage preconditioning to deal with growing size and difficulty of real-world problems Flexible interface suitable for black-box and advanced usage Software State-of-the-art Methods Classic Methods Lang MPI/ SMP Precond- itioning Extreme SVs Fast Full Accuracy PRIMME PRIMME_SVDS Multimethod C Both Y Y Y N IRRHLB N/A Matlab N N Y Y N IRLBA N/A Matlab N N Y Y N JDSVD N/A Matlab N Y Y Y N SVDIFP N/A Matlab N Y Y N SLEPc N/A Many C MPI Y Y N PROPACK N/A LBD F77 SMP N Y N SVDPACK N/A Lanczos F77 N N Y N Applications DNA Least squares Graph clustering QCD Matrix relat9 cage15 delaunay_n24 Laplacian Dimension 12,360,060×549,336 5,154,859 16,777,216 8,000 per proc NNZ 38,955,420 99,199,551 100,663,202 55,760 per proc Target Smallest Smallest Largest Largest Preconditioning N/A BoomerAMG N/A N/A Eigenmethod on C Fast for largest SVs Slow for smallest SVs Achieve accuracy of O(κ(A)||A||ε mach ) Eigenmethod on B Slow for largest SVs Extremely slow for smallest SVs (interior eigenvalue problem) Achieve accuracy of O(||A||ε mach ) LBD on A Fast for largest SVs Similar to C but exhibits irregular convergence for smallest SVs Achieve accuracy of O(||A||ε mach ) GD+k on C + dynamic tolerance JDQMR on B + initial guesses from C + refined projection PRIMME_SVDS: a preconditioned two-stage SVD method Results I: methods comparison (low accuracy) 0 500 1000 1500 2000 2500 3000 10 10 10 5 10 0 10 5 Matvecs ||A T u m v|| A=diag([1:10 1000:100:1e6]), Prec=A+rand(1,10000)*1e4 GD+1 on B GD+1 on C to limit GD+1 on C to eps||A|| 2 switch to GD+1 on B Fig. 2: Hybrid, two-stage SVD method and an example This work is supported by NSF under grants No. CCF 1218349 and ACI SI2-SSE 1440700, and by DOE under a grant No. DE-FC02-12ER41890 Key components: Stage I: fast convergence on matrix C to compute approximate eigenvalues and eigenvectors Stage II: improve accuracy by exploiting power of PRIMME and specialized refined projection on matrix B Parallel Implementation of PRIMME_SVDS User defined data distribution among processes User defined parallel matrix-vector and preconditioning functions User provided global summation for dot products Table 3: Dimension of the matrices and parameters used in the experiments Fig. 1: Advantages and disadvantages of different approaches Table 1: Functionalities of different SVD solvers Results II: scalability performance (high accuracy) Fig. 4: Strong scaling: seek smallest on relat9 and cage15 Fig. 5: Strong and Weak scaling: seek largest on delaunay_n24 and Laplacian Highlights of PRIMME_SVDS: Among the fastest and most robust production level software for computing a small number of singular triplets Exploits preconditioning for large-scale problems Computes efficiently smallest SVs in full accuracy Demonstrates good scalability under both strong and weak scaling even for highly sparse matrices Free software, available at: https://github.com/primme/primme Number of MPI Processes 16 32 64 128 256 512 Runtime (Second) 0 500 1000 1500 Seeking Smallest without Preconditioning Parallel Speedup 16 32 64 128 256 Number of MPI Processes 16 32 64 128 256 512 1024 Runtime (Second) 0 500 1000 1500 Seeking Smallest with Preconditioning Parallel Speedup 16 32 64 128 256 512 Number of MPI Processes 16 32 64 128 256 512 Runtime (Second) 0 500 1000 1500 2000 Seeking Largest without Preconditioning Parallel Speedup 16 32 64 128 256 512 Number of MPI Processes 64 125 216 343 512 729 1000 Runtime (Second) 0 0.005 0.01 0.015 0.02 Seeking Largest without Preconditioning Parallel Efficiency 0 0.25 0.5 0.75 1 Operations Kernels or Libs Cost per iteration Scalability Dense algebra: MV, MM, Inner Prods, Scale BLAS (eg, MKL, ESSL, OpenBlas, ACML) O((m+n)*numSVs) Good Sparse algebra: SpMV, SpMM, Preconditioner User defined (eg, PETSc, Trilinos, HYPRE, librsb) O(1) calls Application dependent Reduction User defined (eg, MPI_AllReduce) O(1) calls of size O(numSVs) Machine dependent Table 2: Parallel characteristics of PRIMME operations Number of MPI Processes 20 40 60 80 100 Speedup over TRLBD 10 0 10 1 10 2 Seeking Smallest without Preconditioning PRIMME_SVDS JD on C KS on C TRLBD Number of MPI Processes 20 40 60 80 100 Speedup Over TRLBD 1 1.5 2 2.5 3 3.5 4 Seeking Largest without Preconditioning PRIMME_SVDS GD on C KS on C TRLBD

A High-Performance Preconditioned SVD Solver for ...sc15.supercomputing.org/sites/all/themes/SC15...A High-Performance Preconditioned SVD Solver for Accurately Computing Large-Scale

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A High-Performance Preconditioned SVD Solver for ...sc15.supercomputing.org/sites/all/themes/SC15...A High-Performance Preconditioned SVD Solver for Accurately Computing Large-Scale

A High-Performance Preconditioned SVD Solver for Accurately Computing Large-Scale Singular Value Problems in PRIMME

Lingfei Wu, Andreas Stathopoulos (Advisor), Eloy Romero (Advisor), College of William and Mary, Williamsburg, Virginia, USA Introduction

v Experimental setup: Ø  Compare PRIMME_SVDS against TRLBD, Krylov-Schur (KS), Generalized

Davidson (GD), and Jacob-Davidson (JD) in SLEPc Ø  Seek a small number of extreme singular triplets W/ and W/O preconditioning Ø  Matrices and vectors distributed on total 1024 processes on Edison (NERSC) or 96 processes on SciClone (W&M)

v Our approach: a preconditioned hybrid, two-stage SVD method on top of the state-of-the-art eigensolver PRIMME

Experiments and Results

PRIMME_SVDS: A High-Performance SVD Solver

State-of-The-Art SVD Solvers

v The problem: find k extreme singular triplets of Am×n Aνi = σiui , i = 1, 2, …, k , k << n

v  Iterative methods for computing SVD: �  Hermitian eigenvalue problem on C = ATA or C = AAT �  Hermitian eigenvalue problem on B = [0 AT; A 0] �  Lanczos bidiagonalization method (LBD)

A = PBdQT and Bd = XΣYT where U = PX and V = QY. v Comparison of different approaches

Fig. 3: Comparing different methods on relat9 and delaunay_n24

v Our goal: solve large-scale, sparse SVD problems with unprecedented efficiency, robustness and accuracy

v Scarcity of SVD software for large-scale problems

v A clear need for a high quality SVD solver software Ø  Implement state-of-the-art methods with massively parallelism Ø  Compute smallest singular triplets effectively and accurately Ø  Leverage preconditioning to deal with growing size and

difficulty of real-world problems Ø  Flexible interface suitable for black-box and advanced usage

Software State-of-the-art Methods

Classic Methods

Lang MPI/SMP

Precond-itioning

Extreme SVs

Fast Full Accuracy

PRIMME PRIMME_SVDS Multimethod C Both Y Y Y

N IRRHLB N/A Matlab N N Y Y

N IRLBA N/A Matlab N N Y Y

N JDSVD N/A Matlab N Y Y Y

N SVDIFP N/A Matlab N Y Y N

SLEPc N/A Many C MPI Y Y N

PROPACK N/A LBD F77 SMP N Y N

SVDPACK N/A Lanczos F77 N N Y N

Applications DNA Least squares Graph clustering QCD

Matrix relat9 cage15 delaunay_n24 Laplacian

Dimension 12,360,060×549,336 5,154,859 16,777,216 8,000 per proc

NNZ 38,955,420 99,199,551 100,663,202 55,760 per proc

Target Smallest Smallest Largest Largest

Preconditioning N/A BoomerAMG N/A N/A

Eigenmethod on C

•  Fast for largest SVs •  Slow for smallest SVs •  Achieve accuracy of

O(κ(A)||A||εmach)

Eigenmethod on B

•  Slow for largest SVs •  Extremely slow for

smallest SVs (interior eigenvalue problem)

•  Achieve accuracy of O(||A||εmach)

LBD on A

•  Fast for largest SVs •  Similar to C but

exhibits irregular convergence for smallest SVs

•  Achieve accuracy of O(||A||εmach)

GD+k on C + dynamic tolerance

JDQMR on B + initial guesses

from C + refined projection

PRIMME_SVDS: a preconditioned two-stage SVD

method

v Results I: methods comparison (low accuracy)

0 500 1000 1500 2000 2500 300010−10

10−5

100

105

Matvecs

||AT u

− m

v||

A=diag([1:10 1000:100:1e6]), Prec=A+rand(1,10000)*1e4

GD+1 on BGD+1 on C to limitGD+1 on C to eps||A||2switch to GD+1 on B

Fig. 2: Hybrid, two-stage SVD method and an example

This work is supported by NSF under grants No. CCF 1218349 and ACI SI2-SSE 1440700, and by DOE under a grant No. DE-FC02-12ER41890

v Key components: Ø  Stage I: fast convergence on matrix C to compute approximate

eigenvalues and eigenvectors Ø  Stage II: improve accuracy by exploiting power of PRIMME and

specialized refined projection on matrix B v Parallel Implementation of PRIMME_SVDS Ø  User defined data distribution among processes Ø  User defined parallel matrix-vector and preconditioning functions Ø  User provided global summation for dot products

Table 3: Dimension of the matrices and parameters used in the experiments

Fig. 1: Advantages and disadvantages of different approaches

Table 1: Functionalities of different SVD solvers

v Results II: scalability performance (high accuracy)

Fig. 4: Strong scaling: seek smallest on relat9 and cage15

Fig. 5: Strong and Weak scaling: seek largest on delaunay_n24 and Laplacian

v Highlights of PRIMME_SVDS: Ø  Among the fastest and most robust production level software for

computing a small number of singular triplets Ø  Exploits preconditioning for large-scale problems Ø  Computes efficiently smallest SVs in full accuracy Ø  Demonstrates good scalability under both strong and weak scaling even for highly sparse matrices Ø  Free software, available at: https://github.com/primme/primme

Number of MPI Processes16 32 64 128 256 512

Run

time

(Sec

ond)

0

500

1000

1500Seeking Smallest without Preconditioning

Para

llel S

peed

up

16

32

64

128

256

Number of MPI Processes16 32 64 128 256 512 1024

Run

time

(Sec

ond)

0

500

1000

1500Seeking Smallest with Preconditioning

Para

llel S

peed

up

16

32

64

128

256

512

Number of MPI Processes16 32 64 128 256 512

Run

time

(Sec

ond)

0

500

1000

1500

2000Seeking Largest without Preconditioning

Para

llel S

peed

up

16

32

64

128

256

512

Number of MPI Processes64 125 216 343 512 729 1000

Run

time

(Sec

ond)

0

0.005

0.01

0.015

0.02Seeking Largest without Preconditioning

Para

llel E

ffici

ency

0

0.25

0.5

0.75

1

Operations Kernels or Libs Cost per iteration Scalability Dense algebra: MV, MM, Inner Prods, Scale

BLAS (eg, MKL, ESSL, OpenBlas, ACML)

O((m+n)*numSVs) Good

Sparse algebra: SpMV, SpMM, Preconditioner

User defined (eg, PETSc, Trilinos, HYPRE, librsb)

O(1) calls Application dependent

Reduction User defined (eg, MPI_AllReduce)

O(1) calls of size O(numSVs)

Machine dependent

Table 2: Parallel characteristics of PRIMME operations

Number of MPI Processes20 40 60 80 100

Spee

dup

over

TR

LBD

100

101

102 Seeking Smallest without Preconditioning

PRIMME_SVDSJD on CKS on CTRLBD

Number of MPI Processes20 40 60 80 100

Spee

dup

Ove

r TR

LBD

1

1.5

2

2.5

3

3.5

4Seeking Largest without Preconditioning

PRIMME_SVDSGD on CKS on CTRLBD