ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

Kate Clark, July 28th 2016

ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING FINE-GRAINED PARALLELIZATION

CONTENTS

QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion

QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD,

Chroma, CPS, MILC, TIFR, etc. – Latest release 0.8.0 (8th February 2016)

• Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-source solvers – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale

3

QUDA CONTRIBUTORS

§ Steve Gottlieb (Indiana University) § Dean Howarth (Rensselaer Polytechnic Institute) § Bálint Joó (Jlab) § Hyung-Jin Kim (BNL -> Samsung) § Claudio Rebbi (Boston University) § Guochun Shi (NCSA -> Google) § Mario Schröck (INFN) § Alexei Strelchenko (FNAL) § Alejandro Vaquero (INFN) § Mathias Wagner (NVIDIA) § Frank Winter (Jlab)

(multigrid collaborators in green)

§ Ron Babich (NVIDIA) § Michael Baldhauf (Regensburg) § Kip Barros (LANL) § Rich Brower (Boston University) § Nuno Cardoso (NCSA) § Kate Clark (NVIDIA) § Michael Cheng (Boston University) § Carleton DeTar (Utah University) § Justin Foley (Utah -> NIH) § Joel Giedt (Rensselaer Polytechnic Institute) § Arjun Gambhir (William and Mary)

5

MULTIGRID FOR QCD

WHY MULTIGRID?

-0.43 -0.42 -0.41 -0.4mass

100

1000

10000

1e+05

Dir

ac o

per

ator

appli

cati

ons

32396 CG

24364 CG

16364 CG

24364 Eig-CG

16364 Eig-CG

32396 MG-GCR

24364 MG-GCR

16364 MG-GCR

Babich et al 2010

0"

10"

20"

30"

40"

50"

60"

70"

QUDA"(32"XK"nodes)" Mul:Grid"(16"XE"nodes)""

Average'Ru

n'Time'for'1

'source''

(secon

ds)'

Wallclock'9me'for'Light'Quark'solves'in'Single'Precision''

0"

5"

10"

15"

20"

25"

30"

35"

QUDA"(16"XK"Nodes)" Mul:"Grid(16"XE"Nodes)"

Average'Time'for'1

'source'

(secon

ds)'

Wallclock'9me'for'Strange'Quark'solves'in'Single'Precision'

Chroma propagator benchmark Figure by Balint Joo

MG Chroma integration by Saul CohenMG Algorithm by James Osborn

7Osborn et al 2010

Optimality Speed

Stability

8

ADAPTIVE GEOMETRIC MULTIGRID

Adaptively find candidate null-space vectors

Dynamically learn the null space and use this to define the prolongator

Algorithm is self learning

Setup

1. Set solver to be simple smoother

2. Apply current solver to random vector vi = P(D) ηi

3. If convergence good enough, solver setup complete

4. Construct prolongator using fixed coarsening (1 - P R) vk = 0

➡ Typically use 44 geometric blocks

➡ Preserve chirality when coarsening R = γ5 P† γ5 = P

†

5. Construct coarse operator (Dc = R D P)

6. Recurse on coarse problem

7. Set solver to be augmented V-cycle, goto 2

Falgout

Babich et al 2010

MULTIGRID ON GPUS

10

THE CHALLENGE OF MULTIGRID ON GPU

GPU requirements very different from CPU Each thread is slow, but O(10,000) threads per GPU

Fine grids run very efficiently High parallel throughput problem

Coarse grids are worst possible scenario More cores than degrees of freedom

Increasingly serial and latency bound

Little’s law (bytes = bandwidth * latency)

Amdahl’s law limiter

Multigrid exposes many of the problems expected at the Exascale

INGREDIENTS FOR PARALLEL ADAPTIVE MULTIGRID

▪ Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition

▪ Smoothing (relaxation on a given grid) – Repurpose existing solvers

▪ Prolongation – interpolation from coarse grid to fine grid – one-to-many mapping

▪ Restriction – restriction from fine grid to coarse grid – many-to-one mapping

▪ Coarse Operator construction (setup) – Evaluate R A P locally – Batched (small) dense matrix multiplication

▪ Coarse grid solver – Need optimal coarse-grid operator

xx

x

x−

x−

U x

Ux

μ

μ

ν

x x

x

x−

x−

U x

Ux

μ

μ

ν

11

COARSE GRID OPERATOR

▪ Coarse operator looks like a Dirac operator (many more colors) – Link matrices have dimension 2Nv x 2Nv (e.g., 48 x 48)

▪ Fine vs. Coarse grid parallelization – Fine grid operator has plenty of grid-level parallelism – E.g., 16x16x16x16

= 65536 lattice sites – Coarse grid operator has diminishing grid-level parallelism – first coarse grid 4x4x4x4= 256 lattice sites – second coarse grid 2x2x2x2 = 16 lattice sites

▪ Current GPUs have up to 3840 processing cores

▪ Need to consider finer-grained parallelization – Increase parallelism to use all GPU resources – Load balancing

dofs (geometry). We start by defining the fields

W±µksc,ls�c� = V †

ksc,kscP±µs,s�U(k+µ)c,lc��k+µ,lVls�c�,ls�c�

note that here we are defining di�erent links for forward and backwards,they are not simply the conjugate of each other (because of the di�erent spinprojection between the two). Also note that these e�ective link matrices havealso a spin index, this is because the vectors used to define the V rotationmatrices have spin dependence now. In this form we can now write down thecoarse Dirac operator as

Disc,js�c� = ��i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

ksc,ls�c��k+µ,l + W+µ†ksc,ls�c��k�µ,l

⇧ ��l/B,j�s�/Bs,s�

⇥

+M �isc,js�c� .

We now finish up by blocking the geometry and spin onto the coarse lattice,defining the e�ective link matrices Y ±µ that connect sites on the coarselattice:

Y ±µisc,js�c� =

��i,k/B�s,s/Bs

⇥W±µ

ksc,ls�c�

��l/B,j�s�/Bs,s�

⇥�i⇤µ,j (2)

Xisc,js�c� =��i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

isc,ks�c� + W+µ†isc,ks�c�

⇧ ��l/B,j�s�/Bs,s�

⇥�i,j,

where we note now that the matrix X is not Hermitian. Thus the coarseoperator is written

Disc,js�c� = �⇤

µ

⌅Y �µ

isc,js�c��i+µ,j + Y +µ†isc,js�c��i�µ,j

⇧+ (M �Xisc,js�c�) �isc,js�c� . (3)

For the explicit form of these matrices we refer the reader to Appendix A.After the first blocking, subsequent blockings require that Bs = 1, i.e., we

cannot block the spin dimension again since we cannot remove the chirality.Apart from this observation, the next coarse operator will have a similar formto the current one: it will be a nearest neighbour non-Hermitian operatorconnecting sites with ds = 2 spin dimension (in 2d and 4d anyway).

We note here in passing that because of the definition of the matrix field Vinclude explicit spin dependence, this destroys the tensor product structureof the spin and colour on the coarse operator, i.e., we have to define ane�ective link matrix that rotates in spin and colour space. If this were notthe case, i.e., if V were to be spin independent, then this structure would be

8

xx

x

x−

x−

U x

Ux

μ

μ

ν

12

13

X[0]

X[1]

SOURCE OF PARALLELISM

�a00 a01 a02 a03

�

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

◆

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 0

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 1

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 2

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 3

xx

x

x−

x−

U x

Ux

μ

μ

ν

3. Stencil direction 8-way parallelism

1. Grid parallelism Volume of threads

2. Link matrix-vector partitioning 2 Nvec-way parallelism (spin * color)

0

BB@

c0c1c2c3

1

CCA+ =

0

BB@

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

1

CCA

0

BB@

b0b1b2b3

1

CCAthread y

index

4. Dot-product partitioning 4-way parallelism

14

COARSE GRID OPERATOR PERFORMANCETesla K20X (Titan), FP32, Nvec = 24

0"

20"

40"

60"

80"

100"

120"

140"

160"

10" 8" 6" 4" 2"

GFLO

PS'

La)ce'length'

baseline"

color2spin"

dimension"+"direc7on"

dot"product"

24,576-way parallel

16-way parallel

COARSE GRID OPERATOR PERFORMANCE

▪ Autotuner finds optimum degree of parallelization

▪ Larger grids favor less fine grained

▪ Coarse grids favor most fine grained

▪ GPU is nearly always faster than CPU

▪ Expect in future that coarse grids will favor CPUs

▪ For now, use GPU exclusively

8-core Haswell 2.4 GHz (solid line) vs M6000 (dashed lined), FP32

0 20 40 60 80 1002N

0

50

100

150

200

250

300

GFL

OPS

2x2x2x24x2x2x24x2x2x44x2x4x44x4x4x4

Coarse Dslash performance (8-core Haswell 2.4 GHz vs M6000)Solid symbol CPU, open symbol / dashed line GPU

15

RESULTS

17

MULTIGRID VERSUS BICGSTAB

• Compare MG against the state-of-the-art traditional Krylov solver on GPU • BiCGstab in double/half precision with reliable updates • 12/8 reconstruct • Symmetric red-black preconditioning

• Adaptive Multigrid algorithm • GCR outer solver wraps 3-level MG preconditioner • GCR restarts done in double, everything else in single • 24 null-space vectors on fine grid • Minimum Residual smoother • Symmetric red-black preconditioning on each level • GCR coarse-grid solver

Mee

= 1ee

� 2A�1ee

Deo

A�1oo

Doe

Tim

e to

sol

utio

n

0

2

4

6

8

10

12

Mass parameter

-0.42 -0.415 -0.410 -0.405 -0.40

BiCGstab (double-half) GCR-MG (double-single)

Iterations GFLOPs

mass BiCGstab GCR-MG BiCGstab GCR-MG

-0.400 251 15 980 376

-0.405 372 16 980 372

-0.410 510 17 980 353

-0.415 866 18 980 314

-0.420 3103 19 980 293

Wilson, V = 243x64, single workstation (3x M6000) MULTIGRID VERSUS BICGSTAB

18

19

Wilson-clover, Strong scaling on Titan (K20X)

Tim

e to

Sol

utio

n

0

10

20

30

40

50

Number of Nodes

24 48

BiCGstab MG

6.6x 6.3x

MULTIGRID VERSUS BICGSTABTi

me

to S

olut

ion

0

10

20

30

40

50

Number of Nodes

20 32

BiCGstab MG

7.9x 6.1x

V = 403x256, mπ = 230 MeV V = 483x96, mπ = 197 MeV

20

Wilson-clover, Strong scaling on Titan (K20X), V = 643x128, mπ = 197 MeVTi

me

to S

olut

ion

0

10

20

30

40

50

Number of Nodes

32 64 128 256 512

BiCGstab MG

5.5x 10.2x 8.9x 7.4x

MULTIGRID VERSUS BICGSTAB

21

Wilson-clover, Strong scaling on Titan (K20X), V = 643x128, 12 linear solvesTi

me

0

10

20

30

40

50

Number of Nodes

64 128 256 512

level 1 level 2 level 3

MULTIGRID TIMING BREAKDOWN

22

MULTIGRID FUTURE WORK

• Absolute Performance tuning, e.g., half precision on coarse grids

• Strong scaling improvements: • Combine with Schwarz preconditioner • Accelerate coarse grid solver: CA-GMRES instead of GCR, deflation • More flexible coarse grid distribution, e.g., redundant nodes

• Investigate off load of coarse grids to the CPU • Use CPU and GPU simultaneously using additive MG

• Full off load of setup phase to GPU - required for HMC

23

MULTI-SRC SOLVERS

• Multi-src solvers increase locality through link-field reuse • see previous talk by M. Wagner

• Multi-grid operators even more so since link matrices are 48x48 • Coarse Dslash / Prolongator / Restrictor

• Coarsest grids also latency limited • Kernel level latency • Network latency

• Multi-src solvers are a solution • More parallelism • Bigger messages

GFL

OPS

0

200

400

600

800

Number of right hand sides1 2 4 8 16 32 64 128

2^4 4^4

Coarse dslash on M6000 GPU vs #rhs

> 3x speedup

24

PASCAL

• Newly launched Pascal P100 GPU provides significant performance uplift through stacked HBM memory

• Wilson-clover dslash 3x speedup vs K40 • double ~500 GFLOPS • single ~1000 GFLOPS • half ~2000 GFLOPS

• Same kernel from 2008 runs unchanged • 5 generations ago

• Similar gains for MG building blocks

GFL

OPS

0

150

300

450

600

Lattice length

2 4 6 8 10

Kepler Maxwell Pascal

Wilson-clover dslash

Coarse dslash performance

P100

25

DGX-1

• Pascal includes NVLink interconnect technology • 4x NVLink connections per GPU • 160 GB/s bi-directional bandwidth per GPU

• NVIDIA DGX-1 • 8x P100 GPUs • 23 NVLink topology • 4x EDR IB (via PCIe switches) • 2x Intel Xeon hosts

• 1 DGX-1 equivalent inter-GPU bandwidth to 64 nodes of Titan

PCIe

26

COMING SOON

•>10x speedup from MG •Lower bound since much more optimization to do

•6x node-to-node speedup •DGX-1 (8x GP100) vs Pi0g (4x K40)

•3x speedup from multi-src solvers •Increased temporal locality from links and increased parallelism for MG

•Expect >100x speedup for analysis workloads versus previous GPU workflow

27

CONCLUSIONS AND OUTLOOK

• Multigrid algorithms LQCD are running well on GPUs • In production by multiple groups • Wilson, Wilson-clover, twisted-mass, twisted-clover all supported • Staggered in progress (see next talk by E. Weinberg) • 5-d operators when I get a chance…

• Up to 10x speedup • Fine-grained parallelization was key • To be presented / published at SC16

• Lots more to come • Combine with multi-right-hand side solver and Pascal architecture

28

GRID PARALLELISM

__device__ void grid_idx(int x[], const int X[]) { // X[] holds the local lattice dimension int idx = blockIdx.x*blockDim.x + threadIdx.x; int za = (idx / X[0]); int zb = (za / X[1]); x[1] = za -‐ zb * X[1]; x[3] = (zb / X[2]); x[2] = zb -‐ x[3] * X[2]; x[0] = idx -‐ za * X[0]; // x[] now holds the thread coordinates }

X[0]

X[1]

Thread x dimension maps to location on the grid

29

MATRIX-VECTOR PARALLELISM

template<int Nv> __device__ void color_spin_idx(int &s, int &c) { int yIdx = blockDim.y*blockIdx.y + threadIdx.y; int s = yIdx / Nv; int c = yIdx % Nv; // s is now spin index for this thread // c is now color index for this thread }

Each stencil application is a sum of matrix-vector products

Parallelize over output vector indices (parallelization over color and spin)

Thread y dimension maps to vector indices

Up to 2 x Nv more parallelism

0

BB@

c0c1c2c3

1

CCA+ =

0

BB@

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

1

CCA

0

BB@

b0b1b2b3

1

CCAthread y

index

Write result to shared memory

Synchronize

dim=0/dir=0 threads combine and write out result

Introduces up to 8x more parallelism30

STENCIL DIRECTION PARALLELSIM

__device__ void dim_dir_idx(int &dim, int &dir) { int zIdx = blockDim.z*blockIdx.z + threadIdx.z; int dir = zIdx % 2; int dim = zIdx / 2; // dir is now the fwd/back direction for this thread // dim is now the dim for this thread }

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 0x

x

x

x−

x−

U x

Ux

μ

μ

ν warp 1

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 2

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 3

Partition computation over stencil direction and dimension onto different threads

xx

x

x−

x−

U x

Ux

μ

μ

ν

31

DOT PRODUCT PARALLELIZATION I

const int warp_size = 32; // warp size const int n_split = 4; // four-‐way warp split const int grid_points = warp_size/n_split; // grid points per warp complex<real> sum = 0.0; for (int i=0; i<N; i+=n_split) sum += a[i] * b[i];

// cascading reduction for (int offset = warp_size/2; offset >= grid_points; offset /= 2) sum += __shfl_down(sum, offset); // first grid_points threads now hold desired result

Partition dot product between threads in the same warp Use warp shuffle for final result Useful when not enough grid parallelism to fill a warp

�a00 a01 a02 a03

�

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

◆

32

DOT PRODUCT PARALLELIZATION II

const int n_ilp = 2; // two-‐way ILP complex<real> sum[n_ilp] = { }; for (int i=0; i<N; i+=n_ilp) for (int j=0; j<n_ilp; j++) sum[j] += a[i+j] * b[i+j];

complex<real> total = static_cast<real>(0.0); for (int j=0; j<n_ilp; j++) total += sum[j];

Degree of ILP exposed

Multiple computationswith no dependencies

Compute final result

Partition dot product computation within a thread Hide dependent arithmetic latency within a thread More important for Kepler then Maxwell / Pascal

�a00 a01 a02 a03

�

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

◆

33

ERROR REDUCTION AND VARIANCEV = 403x256, mπ = 230 MeV

Documents

ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements: