33
Ron Babich (Boston University) SIAM CSE11 March 1, 2011 1 Multi-GPU Calculations in Lattice Quantum Chromodynamics Ronald Babich Boston University SIAM Conference on CS&E Reno, Nevada March 1, 2011

Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 1

Multi-GPU Calculations in Lattice Quantum Chromodynamics

Ronald BabichBoston University

SIAM Conference on CS&EReno, NevadaMarch 1, 2011

Page 2: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 2

Collaborators & References

● Kipton Barros (now at LANL)

● Richard Brower (Boston University)

● Michael Clark (Harvard-Smithsonian)

● Bálint Joó (Jefferson Lab)

● Claudio Rebbi (Boston University)

● K. Barros, R. Babich, R. Brower, M. A. Clark, and C. Rebbi, “Blasting through lattice calculations using CUDA,” PoS(LATTICE2008) 045 (2008) [arXiv:0810.5365 [hep-lat]].

● M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi, "Solving Lattice QCD systems of equations using mixed precision solvers on GPUs," Comput. Phys. Commun. 181, 1517 (2010) [arXiv:0911.3191 [hep-lat]].

● R. Babich, M. A. Clark, B. Joó, “Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics," SC'10 [arXiv:1011.0024 [hep-lat]].

Page 3: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 3

A bit about lattice QCD...

Page 4: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 4

Quarks and gluons

● The strong force is one of the basic forces of nature (along with gravity, electromagnetism, and the weak force).

● It's what binds together the quarks and gluons in the proton (and the neutron, as well as hundreds of other particles seen in accelerator experiments).

Thomas Jefferson National Accelerator FacilityFermi National Accelerator Laboratory

Page 5: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5

QCD and lattice QCD

● The theory of the strong force is called Quantum Chromodynamics (QCD).

● Similar to QED (the quantum theory of electromagnetism) but with electric charge replaced by three “color” charges.

● Unlike QED, QCD at everyday energies cannot be treated with perturbation theory (Feynman diagrams).

● Instead, we must evaluate the QCD path integral numerically, sampling all possible configurations of the quark and gluon fields in a region of spacetime.

● To make this possible, continuous spacetime is replaced with a four-dimensional grid (lattice), hence lattice QCD.

Page 6: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 6

Steps in a lattice QCD calculation

1. Generate an ensemble of gluon field configurations,

2. Compute quark propagators in these fixed backgrounds by solving the Dirac equation for various right-hand sides.

or “Ax=b”

Page 7: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 7

Configuration generation

● Markov process (sequential)

● Requires > O(10 Tflops) = BlueGene/P, Cray XT5, etc.

“Intrepid” - Argonne Leadership Computing Facility “Jaguar” - Oak Ridge Leadership Computing Facility

Page 8: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 8

Computing propagators

● This “analysis” stage is suitable for capacity-type machines but accounts for as many as half the cycles in modern calculations.

● Each job requires tens of cluster nodes . . .

(Clusters dedicated to lattice QCD at Fermilab and Jefferson Lab)

Ax=b

Page 9: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 9

Computing propagators

● . . . or a handful of GPUs (this talk).

● For smaller lattices, even a single GPU might suffice. More typical problems require O(10).

● Replacing capability machines for lattice generation would require the use of at least hundreds of GPUs in parallel.

Page 10: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 10

Krylov solvers

● (Conjugate gradient, BiCGstab, and friends)

● Search for solution to Ax=b in the subspace spanned by {b, Ab, A2b, ...}

● Upshot:

● We need fast code to apply A to an arbitrary vector

● ... as well as fast routines for vector addition, inner products, etc. (home-grown “BLAS”)

● QUDA: A library for lattice QCD on GPUs

● http://lattice.github.com/quda

Page 11: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 11

Our “A”: The Wilson-clover operator

● Here we consider the clover-improved Wilson discretization of the Dirac operator (QUDA supports others):

● This is a finite-difference operator with a 9-point stencil in 4 dimensions.

● are 4x4 projection matrices acting in “spin” space with entries (never explicitly stored).

● are fields of 3x3 matrices acting in “color” space.

● is a field of 12x12 matrices.

● Altogether, our vector consists of 12 complex numbers per site.

Page 12: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 12

Hardware considerations

Page 13: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 13

A tale of two processors

Intel Xeon X5680

6 cores (each with 4-wide SSE unit)

1.17 billion transistors

Shared L3 Cache: 12 MB

L1+L2: 6 x (320 KB) = 1920 KB

160 Gflops (SP)

32 GB/s memory bandwidth

up to 288 GB (96 GB is realistic)

NVIDIA GeForce GTX 480

480 cores

3.0 billion transistors

Shared L2 Cache: 768 KB

L1+SM+Reg: 15 x 192 KB = 2880 KB

1345 Gflops (SP)

177 GB/s memory bandwidth

1.5 GB (up to 6 GB in Tesla variant)

“Gulftown” “Fermi”

Page 14: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 14

Bandwidth constraints

160 Gflops (SP)

32 GB/s memory bandwidth

1345 Gflops (SP)

177 GB/s memory bandwidth

● Per lattice site, our matrix-vector product carries out 1824 flops while reading/writing 432 floats, corresponding to a byte/flop ratio of 0.95 in single precision or 1.90 in double.

● The basic linear algebra routines are even more memory-bound.

● We're entirely constrained by memory bandwidth. On the GPU, flops are virtually free.

Page 15: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 15

GPU memory hierarchy

(GeForce GTX 480)

Page 16: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 16

Single GPU strategyand performance

Page 17: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 17

Strategies

● We employ several strategies to reduce bandwidth requirements in the matrix-vector product. These are somewhat application-specific but amount to recomputing data on the fly and performing basis rotations to increase the sparsity of the matrix.

● The matrix-vector kernel itself is produced by a code generator written in python.

● Among the strategies we employ to speed up the linear algebra routines are

● kernel fusion

● auto-tuned launch parameters

● Multi-precision solvers are key. Even half (16-bit) precision is worthwhile, given the right algorithm.

Page 18: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 18

Kernel fusion

● Consider the following set of operations taken from our BiCGstab solver:

z = z +ax + by

y = y - bw

c = |y|2

d = (v,w)

zxy

w

v

z

yy

y

w

c

d

8vector reads

2vector writes

Page 19: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 19

Kernel fusion

● We can avoid memory transfers by fusing these operations into a single compute kernel:

z = z +ax + by

y = y - bw

c = |y|2

d = (v,w)

zx

y

w

v

z

c

d

8vector reads

2vector writes

y5

Page 20: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 20

Auto-tuned linear algebra$ make...$ make tune...

Benchmarking 16 bit precisioncopyCuda : 256 threads per block, 2048 blocks per grid, Gflops/s = 0.000000, GiB/s = 127.606472axpbyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 62.037775, GiB/s = 125.183891xpyCuda : 256 threads per block, 512 blocks per grid, Gflops/s = 20.661412, GiB/s = 125.075855axpyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 41.360739, GiB/s = 125.190617xpayCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 41.375916, GiB/s = 125.236556mxpyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 20.686066, GiB/s = 125.225099axCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 31.444969, GiB/s = 126.903442caxpyCuda : 64 threads per block, 2048 blocks per grid, Gflops/s = 82.751603, GiB/s = 125.236209caxpbyCuda : 128 threads per block, 2048 blocks per grid, Gflops/s = 145.006273, GiB/s = 125.401357cxpaypbzCuda : 256 threads per block, 1024 blocks per grid, Gflops/s = 125.884968, GiB/s = 127.009472axpyZpbxCuda : 128 threads per block, 2048 blocks per grid, Gflops/s = 101.000918, GiB/s = 127.378922caxpbypzYmbwCuda : 64 threads per block, 4096 blocks per grid, Gflops/s = 120.062473, GiB/s = 121.134966sumCuda : 256 threads per block, 256 blocks per grid, Gflops/s = 51.459132, GiB/s = 103.837612normCuda : 256 threads per block, 256 blocks per grid, Gflops/s = 103.021799, GiB/s = 95.946527reDotProductCuda : 128 threads per block, 256 blocks per grid, Gflops/s = 59.747498, GiB/s = 120.562419axpyNormCuda : 256 threads per block, 2048 blocks per grid, Gflops/s = 74.018444, GiB/s = 112.019453xmyNormCuda : 256 threads per block, 4096 blocks per grid, Gflops/s = 55.737687, GiB/s = 112.471159cDotProductCuda : 128 threads per block, 256 blocks per grid, Gflops/s = 119.348463, GiB/s = 120.414577xpaycDotzyCuda : 256 threads per block, 2048 blocks per grid, Gflops/s = 85.237100, GiB/s = 114.664674cDotProductNormACuda : 128 threads per block, 64 blocks per grid, Gflops/s = 173.619070, GiB/s = 116.779982cDotProductNormBCuda : 128 threads per block, 64 blocks per grid, Gflops/s = 173.822401, GiB/s = 116.916746caxpbypzYmbwcDotProductWYNormYQuda: 256 threads per block, 512 blocks per grid, Gflops/s = 145.992303, GiB/s = 114.563884

Benchmarking 32 bit precisioncopyCuda : 64 threads per block, 4096 blocks per grid, Gflops/s = 0.000000, GiB/s = 126.151752...

Benchmarking 64 bit precisioncopyCuda : 256 threads per block, 4096 blocks per grid, Gflops/s = 0.000000, GiB/s = 125.865711...

Writing optimal parameters to blas_param.hmake[1]: Leaving directory `/home/rbabich/quda/tests'Autotuning completed successfully. Please type 'make' to rebuild library.

$ make...

Page 21: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 21

Mixed precision with reliable updates

● Using a mixed-precision solver incorporating “reliable updates” (Clark et al., arXiv:0911.3191) with half precision greatly reduces time-to-solution while maintaining double precision accuracy.

Page 22: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 22

Performance results

● Results are for the even/odd preconditioned clover-improved Wilson matrix-vector product (“Dslash”).

● Runs were done on a GeForce GTX 480 (consumer-level “Fermi” card) with slightly out-of-date QUDA. Newer version would give perhaps 10% higher performance.

● For reference, a standard dual-socket node with recent (Westmere) quad-core Xeons would sustain around 20 Gflops in single precision for a well-optimized Wilson-clover Dslash.

● We'll compare results for double, single, and half precision. In this case, half is a 16-bit quasi-fixed-point implementation, but GPUs support true FP16 as well.

● The spatial volume is held fixed at 243.

Page 23: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 23

Matrix-vector performance

● Single and half performance are about 2.6x and 5.1x higher than double, respectively.

Page 24: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 24

Multi-GPU strategyand performance

Page 25: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 25

Challenges to scaling up

● GPU-to-host and inter-node bandwidth

● GPU-to-host and inter-node latency

~ 3+3 GB/s

QD

R I

nfi

nib

an

d F

ab

ric

Page 26: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 26

Multi-GPU strategy

Slide: Bálint Joó

Page 27: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 27

Multi-GPU strategy

● In this first pass, we divide up the temporal direction only.

● We must contend with the fact that the spinor field is stored in 6 separate arrays (necessary to ensure memory coalescing).

● With our choice of spin basis, we need only transfer half the spin components (e.g., upper in the backward direction).

● The 3 sub-arrays containing these components on the boundary time-slice are copied into a contiguous buffer on the host.

● The buffer is then transferred across the network to the remote host, where it is copied onto the remote GPU.

● We use CUDA streams and cudaMemcpyAsync() to overlap boundary transfers with interior computation.

Sending Device

Sending Host

Receiving Host

Receiving Device

Network

Page 28: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 28

Overlapping communications

Page 29: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 29

Multi-GPU results

● All performance numbers are for the full solver (BiCGstab, anisotropic clover-improved Wilson with “symmetric” even/odd preconditioning).

● Tests were run on a 16-node cluster at Jefferson Laboratory, interconnected by QDR Infiniband.

● Each node has 2 GeForce GTX 285 cards (previous generation; 240 cores/GPU).

● More recent versions of QUDA obtain higher performance.

Page 30: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 30

Weak scaling (324 local)

● Local volume (per GPU) is held fixed: 324

323 x 256

Page 31: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 31

Strong scaling (323 x 256)

● Total volume is held fixed: 323 x 256

Page 32: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 32

Multi-GPU results on Fermi

● 1 node (Dual-socket/dual-chipset)

● 4 NVIDIA GeForce GTX 480 cards

● Again, latest code would achieve higher performance

● Sustained performance in the inverter (BiCGstab, clover-improved Wilson, mixed single/half):

1023 Gflops

Page 33: Multi-GPU Calculations in Lattice Quantum Chromodynamics · Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 5 QCD and lattice QCD The theory of the strong force is

Ron Babich (Boston University) – SIAM CSE11 – March 1, 2011 33

Ongoing work

● Decomposing the lattice along only one dimension is sufficient for “analysis” jobs on most (but not all) lattice sizes of interest, allowing us to fit the problem and sustain ~ 1-4 Tflop/s.

● Ultimately, we're interested in the strong-scaling regime. A first pass at multi-dimensional parallelization is nearly complete.

● Inter-node latency and bandwidth are major constraints. CUDA 4.0 and GPUDirect v2.0 will help somewhat.

● Pushing beyond O(100) GPUs will demand more sophisticated algorithms (domain decomposition, etc.).

● Compute/communication imbalance is likely to be a recurring theme in the future (see, e.g., DARPA exascale report). In this sense, GPU clusters are a glimpse of things to come.