Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 1

Multi-GPU Calculations in Lattice Quantum Chromodynamics

Ronald BabichBoston University

SIAM Conference on CS&EReno, NevadaMarch 1, 2011


Collaborators & References

● Kipton Barros (now at LANL)

● Richard Brower (Boston University)

● Michael Clark (Harvard-Smithsonian)

● Bálint Joó (Jefferson Lab)

● Claudio Rebbi (Boston University)

● K. Barros, R. Babich, R. Brower, M. A. Clark, and C. Rebbi, “Blasting through lattice calculations using CUDA,” PoS(LATTICE2008) 045 (2008) [arXiv:0810.5365 [hep-lat]].

● M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi, "Solving Lattice QCD systems of equations using mixed precision solvers on GPUs," Comput. Phys. Commun. 181, 1517 (2010) [arXiv:0911.3191 [hep-lat]].

● R. Babich, M. A. Clark, B. Joó, “Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics," SC'10 [arXiv:1011.0024 [hep-lat]].


A bit about lattice QCD...


Quarks and gluons

● The strong force is one of the basic forces of nature (along with gravity, electromagnetism, and the weak force).

● It's what binds together the quarks and gluons in the proton (and the neutron, as well as hundreds of other particles seen in accelerator experiments).

Thomas Jefferson National Accelerator FacilityFermi National Accelerator Laboratory


QCD and lattice QCD

● The theory of the strong force is called Quantum Chromodynamics (QCD).

● Similar to QED (the quantum theory of electromagnetism) but with electric charge replaced by three “color” charges.

● Unlike QED, QCD at everyday energies cannot be treated with perturbation theory (Feynman diagrams).

● Instead, we must evaluate the QCD path integral numerically, sampling all possible configurations of the quark and gluon fields in a region of spacetime.

● To make this possible, continuous spacetime is replaced with a four-dimensional grid (lattice), hence lattice QCD.


Steps in a lattice QCD calculation

1. Generate an ensemble of gluon field configurations,

2. Compute quark propagators in these fixed backgrounds by solving the Dirac equation for various right-hand sides.

or “Ax=b”


Configuration generation

● Markov process (sequential)

● Requires > O(10 Tflops) = BlueGene/P, Cray XT5, etc.

“Intrepid” - Argonne Leadership Computing Facility “Jaguar” - Oak Ridge Leadership Computing Facility


Computing propagators

● This “analysis” stage is suitable for capacity-type machines but accounts for as many as half the cycles in modern calculations.

● Each job requires tens of cluster nodes . . .

(Clusters dedicated to lattice QCD at Fermilab and Jefferson Lab)

Ax=b


Computing propagators

● . . . or a handful of GPUs (this talk).

● For smaller lattices, even a single GPU might suffice. More typical problems require O(10).

● Replacing capability machines for lattice generation would require the use of at least hundreds of GPUs in parallel.


Krylov solvers

● (Conjugate gradient, BiCGstab, and friends)

● Search for solution to Ax=b in the subspace spanned by {b, Ab, A2b, ...}

● Upshot:

● We need fast code to apply A to an arbitrary vector

● ... as well as fast routines for vector addition, inner products, etc. (home-grown “BLAS”)

● QUDA: A library for lattice QCD on GPUs

● http://lattice.github.com/quda


Our “A”: The Wilson-clover operator

● Here we consider the clover-improved Wilson discretization of the Dirac operator (QUDA supports others):

● This is a finite-difference operator with a 9-point stencil in 4 dimensions.

● are 4x4 projection matrices acting in “spin” space with entries (never explicitly stored).

● are fields of 3x3 matrices acting in “color” space.

● is a field of 12x12 matrices.

● Altogether, our vector consists of 12 complex numbers per site.


Hardware considerations


A tale of two processors

Intel Xeon X5680

6 cores (each with 4-wide SSE unit)

1.17 billion transistors

Shared L3 Cache: 12 MB

L1+L2: 6 x (320 KB) = 1920 KB

160 Gflops (SP)

32 GB/s memory bandwidth

up to 288 GB (96 GB is realistic)

NVIDIA GeForce GTX 480

480 cores

3.0 billion transistors

Shared L2 Cache: 768 KB

L1+SM+Reg: 15 x 192 KB = 2880 KB

1345 Gflops (SP)


1.5 GB (up to 6 GB in Tesla variant)

“Gulftown” “Fermi”


Bandwidth constraints

160 Gflops (SP)


1345 Gflops (SP)


● Per lattice site, our matrix-vector product carries out 1824 flops while reading/writing 432 floats, corresponding to a byte/flop ratio of 0.95 in single precision or 1.90 in double.

● The basic linear algebra routines are even more memory-bound.

● We're entirely constrained by memory bandwidth. On the GPU, flops are virtually free.


GPU memory hierarchy

(GeForce GTX 480)


Single GPU strategyand performance


Strategies

● We employ several strategies to reduce bandwidth requirements in the matrix-vector product. These are somewhat application-specific but amount to recomputing data on the fly and performing basis rotations to increase the sparsity of the matrix.

● The matrix-vector kernel itself is produced by a code generator written in python.

● Among the strategies we employ to speed up the linear algebra routines are

● kernel fusion

● auto-tuned launch parameters

● Multi-precision solvers are key. Even half (16-bit) precision is worthwhile, given the right algorithm.


Kernel fusion

● Consider the following set of operations taken from our BiCGstab solver:

z = z +ax + by

y = y - bw

c = |y|2

d = (v,w)

zxy

w

v

z

yy

y

w

c

d

8vector reads

2vector writes


Kernel fusion

● We can avoid memory transfers by fusing these operations into a single compute kernel:

z = z +ax + by

y = y - bw

c = |y|2

d = (v,w)

zx

y

w

v

z

c

d

8vector reads

2vector writes

y5


Auto-tuned linear algebra$ make...$ make tune...

Benchmarking 16 bit precisioncopyCuda : 256 threads per block, 2048 blocks per grid, Gflops/s = 0.000000, GiB/s = 127.606472axpbyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 62.037775, GiB/s = 125.183891xpyCuda : 256 threads per block, 512 blocks per grid, Gflops/s = 20.661412, GiB/s = 125.075855axpyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 41.360739, GiB/s = 125.190617xpayCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 41.375916, GiB/s = 125.236556mxpyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 20.686066, GiB/s = 125.225099axCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 31.444969, GiB/s = 126.903442caxpyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 82.751603, GiB/s = 125.236209caxpbyCuda : 128 threads per block, 2048 blocks per grid, Gflops/s = 145.006273, GiB/s = 125.401357cxpaypbzCuda : 256 threads per block, 1024 blocks per grid, Gflops/s = 125.884968, GiB/s = 127.009472axpyZpbxCuda : 128 threads per block, 2048 blocks per grid, Gflops/s = 101.000918, GiB/s = 127.378922caxpbypzYmbwCuda : 64 threads per block, 4096 blocks per grid, Gflops/s = 120.062473, GiB/s = 121.134966sumCuda : 256 threads per block, 256 blocks per grid, Gflops/s = 51.459132, GiB/s = 103.837612normCuda : 256 threads per block, 256 blocks per grid, Gflops/s = 103.021799, GiB/s = 95.946527reDotProductCuda : 128 threads per block, 256 blocks per grid, Gflops/s = 59.747498, GiB/s = 120.562419axpyNormCuda : 256 threads per block, 2048 blocks per grid, Gflops/s = 74.018444, GiB/s = 112.019453xmyNormCuda : 256 threads per block, 4096 blocks per grid, Gflops/s = 55.737687, GiB/s = 112.471159cDotProductCuda : 128 threads per block, 256 blocks per grid, Gflops/s = 119.348463, GiB/s = 120.414577xpaycDotzyCuda : 256 threads per block, 2048 blocks per grid, Gflops/s = 85.237100, GiB/s = 114.664674cDotProductNormACuda : 128 threads per block, 64 blocks per grid, Gflops/s = 173.619070, GiB/s = 116.779982cDotProductNormBCuda : 128 threads per block, 64 blocks per grid, Gflops/s = 173.822401, GiB/s = 116.916746caxpbypzYmbwcDotProductWYNormYQuda: 256 threads per block, 512 blocks per grid, Gflops/s = 145.992303, GiB/s = 114.563884

Benchmarking 32 bit precisioncopyCuda : 64 threads per block, 4096 blocks per grid, Gflops/s = 0.000000, GiB/s = 126.151752...

Benchmarking 64 bit precisioncopyCuda : 256 threads per block, 4096 blocks per grid, Gflops/s = 0.000000, GiB/s = 125.865711...

Writing optimal parameters to blas_param.hmake[1]: Leaving directory `/home/rbabich/quda/tests'Autotuning completed successfully. Please type 'make' to rebuild library.

$ make...


Mixed precision with reliable updates

● Using a mixed-precision solver incorporating “reliable updates” (Clark et al., arXiv:0911.3191) with half precision greatly reduces time-to-solution while maintaining double precision accuracy.


Performance results

● Results are for the even/odd preconditioned clover-improved Wilson matrix-vector product (“Dslash”).

● Runs were done on a GeForce GTX 480 (consumer-level “Fermi” card) with slightly out-of-date QUDA. Newer version would give perhaps 10% higher performance.

● For reference, a standard dual-socket node with recent (Westmere) quad-core Xeons would sustain around 20 Gflops in single precision for a well-optimized Wilson-clover Dslash.

● We'll compare results for double, single, and half precision. In this case, half is a 16-bit quasi-fixed-point implementation, but GPUs support true FP16 as well.

● The spatial volume is held fixed at 243.


Matrix-vector performance

● Single and half performance are about 2.6x and 5.1x higher than double, respectively.


Multi-GPU strategyand performance


Challenges to scaling up

● GPU-to-host and inter-node bandwidth

● GPU-to-host and inter-node latency

~ 3+3 GB/s

QD

R I

nfi

nib

an

d F

ab

ric


Multi-GPU strategy

Slide: Bálint Joó


Multi-GPU strategy

● In this first pass, we divide up the temporal direction only.

● We must contend with the fact that the spinor field is stored in 6 separate arrays (necessary to ensure memory coalescing).

● With our choice of spin basis, we need only transfer half the spin components (e.g., upper in the backward direction).

● The 3 sub-arrays containing these components on the boundary time-slice are copied into a contiguous buffer on the host.

● The buffer is then transferred across the network to the remote host, where it is copied onto the remote GPU.

● We use CUDA streams and cudaMemcpyAsync() to overlap boundary transfers with interior computation.

Sending Device

Sending Host

Receiving Host

Receiving Device

Network


Overlapping communications


Multi-GPU results

● All performance numbers are for the full solver (BiCGstab, anisotropic clover-improved Wilson with “symmetric” even/odd preconditioning).

● Tests were run on a 16-node cluster at Jefferson Laboratory, interconnected by QDR Infiniband.

● Each node has 2 GeForce GTX 285 cards (previous generation; 240 cores/GPU).

● More recent versions of QUDA obtain higher performance.


Weak scaling (324 local)

● Local volume (per GPU) is held fixed: 324

323 x 256


Strong scaling (323 x 256)

● Total volume is held fixed: 323 x 256


Multi-GPU results on Fermi

● 1 node (Dual-socket/dual-chipset)

● 4 NVIDIA GeForce GTX 480 cards

● Again, latest code would achieve higher performance

● Sustained performance in the inverter (BiCGstab, clover-improved Wilson, mixed single/half):

1023 Gflops


Ongoing work

● Decomposing the lattice along only one dimension is sufficient for “analysis” jobs on most (but not all) lattice sizes of interest, allowing us to fit the problem and sustain ~ 1-4 Tflop/s.

● Ultimately, we're interested in the strong-scaling regime. A first pass at multi-dimensional parallelization is nearly complete.

● Inter-node latency and bandwidth are major constraints. CUDA 4.0 and GPUDirect v2.0 will help somewhat.

● Pushing beyond O(100) GPUs will demand more sophisticated algorithms (domain decomposition, etc.).

● Compute/communication imbalance is likely to be a recurring theme in the future (see, e.g., DARPA exascale report). In this sense, GPU clusters are a glimpse of things to come.

Documents

Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is