Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Kate Clark, July 28th 2016
ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING FINE-GRAINED PARALLELIZATION
CONTENTS
QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion
QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD,
Chroma, CPS, MILC, TIFR, etc. – Latest release 0.8.0 (8th February 2016)
• Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-source solvers – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale
3
QUDA CONTRIBUTORS
§ Steve Gottlieb (Indiana University) § Dean Howarth (Rensselaer Polytechnic Institute) § Bálint Joó (Jlab) § Hyung-Jin Kim (BNL -> Samsung) § Claudio Rebbi (Boston University) § Guochun Shi (NCSA -> Google) § Mario Schröck (INFN) § Alexei Strelchenko (FNAL) § Alejandro Vaquero (INFN) § Mathias Wagner (NVIDIA) § Frank Winter (Jlab)
(multigrid collaborators in green)
§ Ron Babich (NVIDIA) § Michael Baldhauf (Regensburg) § Kip Barros (LANL) § Rich Brower (Boston University) § Nuno Cardoso (NCSA) § Kate Clark (NVIDIA) § Michael Cheng (Boston University) § Carleton DeTar (Utah University) § Justin Foley (Utah -> NIH) § Joel Giedt (Rensselaer Polytechnic Institute) § Arjun Gambhir (William and Mary)
5
MULTIGRID FOR QCD
WHY MULTIGRID?
-0.43 -0.42 -0.41 -0.4mass
100
1000
10000
1e+05
Dir
ac o
per
ator
appli
cati
ons
32396 CG
24364 CG
16364 CG
24364 Eig-CG
16364 Eig-CG
32396 MG-GCR
24364 MG-GCR
16364 MG-GCR
Babich et al 2010
0"
10"
20"
30"
40"
50"
60"
70"
QUDA"(32"XK"nodes)" Mul:Grid"(16"XE"nodes)""
Average'Ru
n'Time'for'1
'source''
(secon
ds)'
Wallclock'9me'for'Light'Quark'solves'in'Single'Precision''
0"
5"
10"
15"
20"
25"
30"
35"
QUDA"(16"XK"Nodes)" Mul:"Grid(16"XE"Nodes)"
Average'Time'for'1
'source'
(secon
ds)'
Wallclock'9me'for'Strange'Quark'solves'in'Single'Precision'
Chroma propagator benchmark Figure by Balint Joo
MG Chroma integration by Saul CohenMG Algorithm by James Osborn
7Osborn et al 2010
Optimality Speed
Stability
8
ADAPTIVE GEOMETRIC MULTIGRID
Adaptively find candidate null-space vectors
Dynamically learn the null space and use this to define the prolongator
Algorithm is self learning
Setup
1. Set solver to be simple smoother
2. Apply current solver to random vector vi = P(D) ηi
3. If convergence good enough, solver setup complete
4. Construct prolongator using fixed coarsening (1 - P R) vk = 0
➡ Typically use 44 geometric blocks
➡ Preserve chirality when coarsening R = γ5 P† γ5 = P
†
5. Construct coarse operator (Dc = R D P)
6. Recurse on coarse problem
7. Set solver to be augmented V-cycle, goto 2
Falgout
Babich et al 2010
MULTIGRID ON GPUS
10
THE CHALLENGE OF MULTIGRID ON GPU
GPU requirements very different from CPU Each thread is slow, but O(10,000) threads per GPU
Fine grids run very efficiently High parallel throughput problem
Coarse grids are worst possible scenario More cores than degrees of freedom
Increasingly serial and latency bound
Little’s law (bytes = bandwidth * latency)
Amdahl’s law limiter
Multigrid exposes many of the problems expected at the Exascale
INGREDIENTS FOR PARALLEL ADAPTIVE MULTIGRID
▪ Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition
▪ Smoothing (relaxation on a given grid) – Repurpose existing solvers
▪ Prolongation – interpolation from coarse grid to fine grid – one-to-many mapping
▪ Restriction – restriction from fine grid to coarse grid – many-to-one mapping
▪ Coarse Operator construction (setup) – Evaluate R A P locally – Batched (small) dense matrix multiplication
▪ Coarse grid solver – Need optimal coarse-grid operator
xx
x
x−
x−
U x
Ux
μ
μ
ν
x x
x
x−
x−
U x
Ux
μ
μ
ν
11
COARSE GRID OPERATOR
▪ Coarse operator looks like a Dirac operator (many more colors) – Link matrices have dimension 2Nv x 2Nv (e.g., 48 x 48)
▪ Fine vs. Coarse grid parallelization – Fine grid operator has plenty of grid-level parallelism – E.g., 16x16x16x16
= 65536 lattice sites – Coarse grid operator has diminishing grid-level parallelism – first coarse grid 4x4x4x4= 256 lattice sites – second coarse grid 2x2x2x2 = 16 lattice sites
▪ Current GPUs have up to 3840 processing cores
▪ Need to consider finer-grained parallelization – Increase parallelism to use all GPU resources – Load balancing
dofs (geometry). We start by defining the fields
W±µksc,ls�c� = V †
ksc,kscP±µs,s�U(k+µ)c,lc��k+µ,lVls�c�,ls�c�
note that here we are defining di�erent links for forward and backwards,they are not simply the conjugate of each other (because of the di�erent spinprojection between the two). Also note that these e�ective link matrices havealso a spin index, this is because the vectors used to define the V rotationmatrices have spin dependence now. In this form we can now write down thecoarse Dirac operator as
Disc,js�c� = ���i,k/B�s,s/Bs
⇥ ⇤
µ
⌅W�µ
ksc,ls�c��k+µ,l + W+µ†ksc,ls�c��k�µ,l
⇧ ��l/B,j�s�/Bs,s�
⇥
+M �isc,js�c� .
We now finish up by blocking the geometry and spin onto the coarse lattice,defining the e�ective link matrices Y ±µ that connect sites on the coarselattice:
Y ±µisc,js�c� =
��i,k/B�s,s/Bs
⇥W±µ
ksc,ls�c�
��l/B,j�s�/Bs,s�
⇥�i⇤µ,j (2)
Xisc,js�c� =��i,k/B�s,s/Bs
⇥ ⇤
µ
⌅W�µ
isc,ks�c� + W+µ†isc,ks�c�
⇧ ��l/B,j�s�/Bs,s�
⇥�i,j,
where we note now that the matrix X is not Hermitian. Thus the coarseoperator is written
Disc,js�c� = �⇤
µ
⌅Y �µ
isc,js�c��i+µ,j + Y +µ†isc,js�c��i�µ,j
⇧+ (M �Xisc,js�c�) �isc,js�c� . (3)
For the explicit form of these matrices we refer the reader to Appendix A.After the first blocking, subsequent blockings require that Bs = 1, i.e., we
cannot block the spin dimension again since we cannot remove the chirality.Apart from this observation, the next coarse operator will have a similar formto the current one: it will be a nearest neighbour non-Hermitian operatorconnecting sites with ds = 2 spin dimension (in 2d and 4d anyway).
We note here in passing that because of the definition of the matrix field Vinclude explicit spin dependence, this destroys the tensor product structureof the spin and colour on the coarse operator, i.e., we have to define ane�ective link matrix that rotates in spin and colour space. If this were notthe case, i.e., if V were to be spin independent, then this structure would be
8
xx
x
x−
x−
U x
Ux
μ
μ
ν
12
13
X[0]
X[1]
SOURCE OF PARALLELISM
�a00 a01 a02 a03
�
0
BB@
b0b1b2b3
1
CCA )�a00 a01
�✓b0b1
◆+
�a02 a03
�✓b2b3
◆
xx
x
x−
x−
U x
Ux
μ
μ
ν
warp 0
xx
x
x−
x−
U x
Ux
μ
μ
ν warp 1
xx
x
x−
x−
U x
Ux
μ
μ
ν warp 2
xx
x
x−
x−
U x
Ux
μ
μ
ν
warp 3
xx
x
x−
x−
U x
Ux
μ
μ
ν
3. Stencil direction 8-way parallelism
1. Grid parallelism Volume of threads
2. Link matrix-vector partitioning 2 Nvec-way parallelism (spin * color)
0
BB@
c0c1c2c3
1
CCA+ =
0
BB@
a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33
1
CCA
0
BB@
b0b1b2b3
1
CCAthread y
index
4. Dot-product partitioning 4-way parallelism
14
COARSE GRID OPERATOR PERFORMANCETesla K20X (Titan), FP32, Nvec = 24
0"
20"
40"
60"
80"
100"
120"
140"
160"
10" 8" 6" 4" 2"
GFLO
PS'
La)ce'length'
baseline"
color2spin"
dimension"+"direc7on"
dot"product"
24,576-way parallel
16-way parallel
COARSE GRID OPERATOR PERFORMANCE
▪ Autotuner finds optimum degree of parallelization
▪ Larger grids favor less fine grained
▪ Coarse grids favor most fine grained
▪ GPU is nearly always faster than CPU
▪ Expect in future that coarse grids will favor CPUs
▪ For now, use GPU exclusively
8-core Haswell 2.4 GHz (solid line) vs M6000 (dashed lined), FP32
0 20 40 60 80 1002N
0
50
100
150
200
250
300
GFL
OPS
2x2x2x24x2x2x24x2x2x44x2x4x44x4x4x4
Coarse Dslash performance (8-core Haswell 2.4 GHz vs M6000)Solid symbol CPU, open symbol / dashed line GPU
15
RESULTS
17
MULTIGRID VERSUS BICGSTAB
• Compare MG against the state-of-the-art traditional Krylov solver on GPU • BiCGstab in double/half precision with reliable updates • 12/8 reconstruct • Symmetric red-black preconditioning
• Adaptive Multigrid algorithm • GCR outer solver wraps 3-level MG preconditioner • GCR restarts done in double, everything else in single • 24 null-space vectors on fine grid • Minimum Residual smoother • Symmetric red-black preconditioning on each level • GCR coarse-grid solver
Mee
= 1ee
� 2A�1ee
Deo
A�1oo
Doe
Tim
e to
sol
utio
n
0
2
4
6
8
10
12
Mass parameter
-0.42 -0.415 -0.410 -0.405 -0.40
BiCGstab (double-half) GCR-MG (double-single)
Iterations GFLOPs
mass BiCGstab GCR-MG BiCGstab GCR-MG
-0.400 251 15 980 376
-0.405 372 16 980 372
-0.410 510 17 980 353
-0.415 866 18 980 314
-0.420 3103 19 980 293
Wilson, V = 243x64, single workstation (3x M6000) MULTIGRID VERSUS BICGSTAB
18
19
Wilson-clover, Strong scaling on Titan (K20X)
Tim
e to
Sol
utio
n
0
10
20
30
40
50
Number of Nodes
24 48
BiCGstab MG
6.6x 6.3x
MULTIGRID VERSUS BICGSTABTi
me
to S
olut
ion
0
10
20
30
40
50
Number of Nodes
20 32
BiCGstab MG
7.9x 6.1x
V = 403x256, mπ = 230 MeV V = 483x96, mπ = 197 MeV
20
Wilson-clover, Strong scaling on Titan (K20X), V = 643x128, mπ = 197 MeVTi
me
to S
olut
ion
0
10
20
30
40
50
Number of Nodes
32 64 128 256 512
BiCGstab MG
5.5x 10.2x 8.9x 7.4x
MULTIGRID VERSUS BICGSTAB
21
Wilson-clover, Strong scaling on Titan (K20X), V = 643x128, 12 linear solvesTi
me
0
10
20
30
40
50
Number of Nodes
64 128 256 512
level 1 level 2 level 3
MULTIGRID TIMING BREAKDOWN
22
MULTIGRID FUTURE WORK
• Absolute Performance tuning, e.g., half precision on coarse grids
• Strong scaling improvements: • Combine with Schwarz preconditioner • Accelerate coarse grid solver: CA-GMRES instead of GCR, deflation • More flexible coarse grid distribution, e.g., redundant nodes
• Investigate off load of coarse grids to the CPU • Use CPU and GPU simultaneously using additive MG
• Full off load of setup phase to GPU - required for HMC
23
MULTI-SRC SOLVERS
• Multi-src solvers increase locality through link-field reuse • see previous talk by M. Wagner
• Multi-grid operators even more so since link matrices are 48x48 • Coarse Dslash / Prolongator / Restrictor
• Coarsest grids also latency limited • Kernel level latency • Network latency
• Multi-src solvers are a solution • More parallelism • Bigger messages
GFL
OPS
0
200
400
600
800
Number of right hand sides1 2 4 8 16 32 64 128
2^4 4^4
Coarse dslash on M6000 GPU vs #rhs
> 3x speedup
24
PASCAL
• Newly launched Pascal P100 GPU provides significant performance uplift through stacked HBM memory
• Wilson-clover dslash 3x speedup vs K40 • double ~500 GFLOPS • single ~1000 GFLOPS • half ~2000 GFLOPS
• Same kernel from 2008 runs unchanged • 5 generations ago
• Similar gains for MG building blocks
GFL
OPS
0
150
300
450
600
Lattice length
2 4 6 8 10
Kepler Maxwell Pascal
Wilson-clover dslash
Coarse dslash performance
P100
25
DGX-1
• Pascal includes NVLink interconnect technology • 4x NVLink connections per GPU • 160 GB/s bi-directional bandwidth per GPU
• NVIDIA DGX-1 • 8x P100 GPUs • 23 NVLink topology • 4x EDR IB (via PCIe switches) • 2x Intel Xeon hosts
• 1 DGX-1 equivalent inter-GPU bandwidth to 64 nodes of Titan
PCIe
26
COMING SOON
•>10x speedup from MG •Lower bound since much more optimization to do
•6x node-to-node speedup •DGX-1 (8x GP100) vs Pi0g (4x K40)
•3x speedup from multi-src solvers •Increased temporal locality from links and increased parallelism for MG
•Expect >100x speedup for analysis workloads versus previous GPU workflow
27
CONCLUSIONS AND OUTLOOK
• Multigrid algorithms LQCD are running well on GPUs • In production by multiple groups • Wilson, Wilson-clover, twisted-mass, twisted-clover all supported • Staggered in progress (see next talk by E. Weinberg) • 5-d operators when I get a chance…
• Up to 10x speedup • Fine-grained parallelization was key • To be presented / published at SC16
• Lots more to come • Combine with multi-right-hand side solver and Pascal architecture
28
GRID PARALLELISM
__device__ void grid_idx(int x[], const int X[]) { // X[] holds the local lattice dimension int idx = blockIdx.x*blockDim.x + threadIdx.x; int za = (idx / X[0]); int zb = (za / X[1]); x[1] = za -‐ zb * X[1]; x[3] = (zb / X[2]); x[2] = zb -‐ x[3] * X[2]; x[0] = idx -‐ za * X[0]; // x[] now holds the thread coordinates }
X[0]
X[1]
Thread x dimension maps to location on the grid
29
MATRIX-VECTOR PARALLELISM
template<int Nv> __device__ void color_spin_idx(int &s, int &c) { int yIdx = blockDim.y*blockIdx.y + threadIdx.y; int s = yIdx / Nv; int c = yIdx % Nv; // s is now spin index for this thread // c is now color index for this thread }
Each stencil application is a sum of matrix-vector products
Parallelize over output vector indices (parallelization over color and spin)
Thread y dimension maps to vector indices
Up to 2 x Nv more parallelism
0
BB@
c0c1c2c3
1
CCA+ =
0
BB@
a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33
1
CCA
0
BB@
b0b1b2b3
1
CCAthread y
index
Write result to shared memory
Synchronize
dim=0/dir=0 threads combine and write out result
Introduces up to 8x more parallelism30
STENCIL DIRECTION PARALLELSIM
__device__ void dim_dir_idx(int &dim, int &dir) { int zIdx = blockDim.z*blockIdx.z + threadIdx.z; int dir = zIdx % 2; int dim = zIdx / 2; // dir is now the fwd/back direction for this thread // dim is now the dim for this thread }
xx
x
x−
x−
U x
Ux
μ
μ
ν
warp 0x
x
x
x−
x−
U x
Ux
μ
μ
ν warp 1
xx
x
x−
x−
U x
Ux
μ
μ
ν warp 2
xx
x
x−
x−
U x
Ux
μ
μ
ν
warp 3
Partition computation over stencil direction and dimension onto different threads
xx
x
x−
x−
U x
Ux
μ
μ
ν
31
DOT PRODUCT PARALLELIZATION I
const int warp_size = 32; // warp size const int n_split = 4; // four-‐way warp split const int grid_points = warp_size/n_split; // grid points per warp complex<real> sum = 0.0; for (int i=0; i<N; i+=n_split) sum += a[i] * b[i];
// cascading reduction for (int offset = warp_size/2; offset >= grid_points; offset /= 2) sum += __shfl_down(sum, offset); // first grid_points threads now hold desired result
Partition dot product between threads in the same warp Use warp shuffle for final result Useful when not enough grid parallelism to fill a warp
�a00 a01 a02 a03
�
0
BB@
b0b1b2b3
1
CCA )�a00 a01
�✓b0b1
◆+
�a02 a03
�✓b2b3
◆
32
DOT PRODUCT PARALLELIZATION II
const int n_ilp = 2; // two-‐way ILP complex<real> sum[n_ilp] = { }; for (int i=0; i<N; i+=n_ilp) for (int j=0; j<n_ilp; j++) sum[j] += a[i+j] * b[i+j];
complex<real> total = static_cast<real>(0.0); for (int j=0; j<n_ilp; j++) total += sum[j];
Degree of ILP exposed
Multiple computationswith no dependencies
Compute final result
Partition dot product computation within a thread Hide dependent arithmetic latency within a thread More important for Kepler then Maxwell / Pascal
�a00 a01 a02 a03
�
0
BB@
b0b1b2b3
1
CCA )�a00 a01
�✓b0b1
◆+
�a02 a03
�✓b2b3
◆
33
ERROR REDUCTION AND VARIANCEV = 403x256, mπ = 230 MeV