Quarks, GPUs and Exotic Matter - NVIDIA · 2013. 7. 5. · – nucleon = quarks + gluons • Almost all of our mass comes from quarks & gluons • Quantum Chromodynamics (QCD) is

Quarks, GPUs and Exotic Matter

Bálint Joó, Jefferson LabRon Babich, NVIDIA (presenter)

NVIDIA Theater SC’12Salt Lake City, Utah

Nov 2012

Acknowledgements • Science Results: Hadron Spectrum Collaboration • Software:

– The QUDA Community & NVIDIA– Frank Winter for his work on the JIT version of QDP++

• Machines: – USQCD National Facility for access to clusters at JLab, JLab SciComp Team– LLNL for access to Edge Cluster, – NERSC for access to Dirac Cluster – Oak Ridge Leadership Computing Facility, for access to TitanDev, and for Directors Discretionary

Allocation– NSF NICS for access to Keeneland Cluster– NCSA for access to BlueWaters

• Funding: US DOE– Contract DE-AC05-06OR23177: under which Jefferson Science Associates, LLC, manages and operates

Jefferson Laboratory, Grant No: DE-FC02-06ER41440: USQCD SciDAC II project)• Funding: NSF

– Grants: PHY-0835713 and OCI-0946441 • Special thanks from Balint to Ron for stepping in to present this talk.

Nuclear Physics and QCD• Ordinary matter is made up of atoms

– atom = nucleus + “orbiting” electrons– nucleus = protons + neutrons (nucleons)– nucleon = quarks + gluons

• Almost all of our mass comes from quarks & gluons• Quantum Chromodynamics (QCD) is the theory of quarks

and gluons.– quarks carry color charge (r,g,b)– gluons carry the color interactions eg. (-r,+b)

• We can only see things with net 0 color charge– never see individual quarks, gluons, only combinations – color charges must cancel between quarks and gluons– QCD allows “exotics”: quark-gluon excitations, glueballs

QCD in Nuclear Physics

Hägler, Musch, Negele, Schäfer, EPL 88 61001

• Can QCD predict the spectrum of hadrons ?– what is the role of the gluons?– what about exotics?– GlueX experiment of Jefferson Lab 12GeV, Hall D

• How do quarks and gluons make nucleons?– what are distributions of quarks, gluons, spin, etc ?– GPD experiments e.g. Jefferson Lab, Halls A & B

• QCD must explain nuclear interactions– ab initio calculations for simple systems– bridges to higher level effective theories

• QCD phase structure, equation of state– experiments at RHIC– input to higher level effective theories– astrophysics (physics of the Early Universe)

• Lattice QCD is the only known model independent, non-perturbative technique for carrying out QCD calculations.– Replace continuum space-time with lattice– Gluons live on links as SU(3) matrices– Quarks live on sites as vectors/spinors.– Change QCD to a system similar to a crystal

Lattice QCD

Evaluate Path Integral Using Markov Chain Monte Carlo Method

Large Scale LQCD Simulations

• Stage 1: Generate Configurations– snapshots of QCD vacuum– configurations generated in sequence– capability computing needed for large

lattices and light quarks• Stage 2a: Compute quark propagators

– task parallelizable (per configuration)– capacity workload (but can also use capability h/w)

• Stage 3: Extract Physics– on workstations,

small cluster partitions

• Stage 2b: Contract propagators into Correlation Functions– determines the physics you’ll see– complicated multi-index tensor contractions

Titan Image Courtesy of Oak Ridge Leadership Computing Facility (OLCF), Oak Ridge National Laboratory

The Lattice Dirac Equation • Describes how quarks interact with the gluons• Must be solved in gauge generation (Stage 1)

– O(1M) times, in sequence – ~60%-80% of workload spent in solvers

• Must be solved to generate quark propagators (Stage 2)– O(10M) times, but task parallel– solver is >90% of workload

• Operator has dimension ~100M, but very sparse – Efficient Matrix-Vector operations are crucial– Need optimized solvers

Aee -Deo

-Doe Aoo( ) φ = χ

Software: Chroma + QUDA• Chroma is a large lattice QCD framework

– algorithms for gauge generation, quark propagators etc– abstractions for components (solvers)– open source: http://usqcd.jlab.org/usqcd-docs/chroma/– developed/maintained through US DOE SciDAC funding– integrates QUDA library as a solver component

• R. G. Edwards, B. Joo, Nucl. Phys. Proc. Suppl. 140 (2005) 832

• QUDA is a highly optimized library for lattice QCD on GPUs– Linear Solvers, Force Terms, interfaces to code-bases– open source: http://lattice.github.com/quda– developed/maintained by NVIDIA & QUDA Community

• M. Clark, R. Babich, K. Barros, R. C. Brower, C. Rebbi, Comp. Phys. Commun. 181:1517-1528, 2010

http://usqcd.jlab.org/usqcd-docs/chroma/

http://usqcd.jlab.org/usqcd-docs/chroma/

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDkQFjAB&url=http%3A%2F%2Farxiv.org%2Fpdf%2Fhep-lat%2F0409003&ei=dFuZULKKELSw0QGxmoGQDQ&usg=AFQjCNGEda0l-G5AjS3SIWsjqbxXW0fYoA&sig2=w6UvIxyChG7rc159TdaPJw

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDkQFjAB&url=http%3A%2F%2Farxiv.org%2Fpdf%2Fhep-lat%2F0409003&ei=dFuZULKKELSw0QGxmoGQDQ&usg=AFQjCNGEda0l-G5AjS3SIWsjqbxXW0fYoA&sig2=w6UvIxyChG7rc159TdaPJw

http://lattice.github.com/quda

http://lattice.github.com/quda

http://arxiv.org/abs/0911.3191




QUDA Performance Optimization• LQCD is typically memory bound

– Dslash: Nearest neighbour stencil in 4D• Wilson Formulation: 0.92 FLOP/B (SP)• Staggered Formulation: ~0.66 FLOP/B (SP)• Key Optimizations focus on being memory friendly

• Layout data for coalesced memory access • Use symmetries to compress SU(3) matrices

– 2 row storage or 8 parameter storage– reconstruct 3rd row with “free” FLOPs– trade bandwidth for compute

• Use reduced precision if possible (e.g. 16bit)– mixed precision solver – iterative refinement + reliable updates

• Fuse BLAS like kernels - increase reuse

(V-1 sites)x12 floats12 floats

(V-1 sites) x 4 floats4 floats Pad

1 block

Using GPUs in Capacity Mode• USQCD National Facility (FNAL, JLab, BNL)

– distributed computational facility for LQCD– JLab and Fermilab operate GPU clusters– JLab GPU cluster used for generating quark propagators.

� � � � ��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� !"�

��

�#�

$��

��

%&

'!

�

Orange Bars: from NERSC Dirac ClusterOther data from JLab 9G & 10G Clusters

JLab 9G GPU Cluster

JLab: 127 quad nodes: Mix of Tesla C2050, M2050, and GTX 285/480/580 GPUs

FNAL: 72 dual nodes: M25050 GPUs

0

500

1000

1500

2000

Science From GPU ClustersJ. J. Dudek, R. G. Edwards, “Hybrid Baryons in QCD”, Phys. Rev. D85, 054016

• Hybrid Excitations in mesons, and baryons at a common scale of ~1200 MeV• Pattern suggests chromo-magnetic excitation

– common in mesons, baryons. – “Effective degree of freedom” ?– first principle calculation can agree with or disfavor effective models

Point to take home here:• These analysis computations are extremely demanding. • Need (apart from the gauge configurations):

– innovation in the method of the computation• so called “distillation” technique• variational method, with large operator basis

– optimized formulation of the lattice theory• Anisotropic lattices: cleaner determination of excited states

– availability of cheap capacity FLOPs• GPUs highly cost effective.• recall: O(10 M) solves of the Dirac Equation• lots of partitions of 4-16 GPUs (today)• => 32-64 GPUs tomorrow for larger lattices

Gauge Generation on GPUs• Gauge Generation is not task parallel

– proceeds sequentially– O(1M) solves of the Dirac Equation– needs the concentrated power of

capability computing facilities• Need to scale to 100s-1000s of GPUs• Two main obstacles:

– Host/Accelerator Model & Amdahl’s law

• Code not running on GPU limits speedup.

– Hardware Bottleneck• Ratio of peak device memory/PCIe2

bandwidths ~ 170/16 (for Fermi)• PCIe3, GPU Direct, etc should help here.

Sapp =1

(1− P ) + PS

16 32 64 128 256 512 1024 2048 4096 8192Interlagos Sockets (16 core/socket)

0.0625

0.125

0.25

0.5

1

2

4

8

16

32

64

128

Tflo

ps S

usta

ined

Titan, XK6 nodes, CPU only: Single Precision Reliable-IBiCGStab SolverRosa, XE6 nodes, CPU only: Single Precision Reliable IBiCGStab solverTitan, XK6 nodes, GPU only: Single Precision (single/single) Reliable BiCGStab solverTitan, XK6 nodes, GPU only: Mixed Precision (half/single) Reliable BiCGStab solverTitan, XK6 nodes, GPU only: Mixed Precision (half/single) GCR solver with Domain Decomposed preconditioner

Strong Scaling: 483x512 Lattice (Weak Field), Chroma + QUDA

100 Tflops

Architecture Aware Algorithms

• A domain decomposed preconditioner combined with a GCR solver – reduced communication needs in the Linear Solver– strong scaled to 768 nodes on the TitanDev Cray XK6 system

(Fermi Tesla GPUs) at the OLCF

Our work with strong scaling targets the newly installed Cray XK7 Titan System at Oak Ridge Leadership Computing Facility (OLCF) (pictured above) and other large scale GPU based systems such as NCSA BlueWaters, Keeneland and others

R. Babich, M. Clark, B. Joo, G. Shi,R. Brower, S. Gottlieb, SC’11, Seattle

Image Courtesy of Oak Ridge Leadership Computing Facility (OLCF), Oak Ridge National Laboratory

http://dl.acm.org/citation.cfm?id=2063478




Moving more code to GPUs• Work with Frank Winter,

University of Edinburgh

• Re-wrote QDP++ layer on which Chroma is based, to run on GPUs

• Innovative Just-In-Time (JIT) compilation of C++ expression templates to GPU kernels

• Works on Cray XK too in ‘Just-Before-Time’ mode– pre-generate kernels on

regular Linux

16 32 64 128 256number of XK6 nodes

128

256

512

1024

2048

4096

Tim

e fo

r tra

ject

ory

(sec

)

Chroma (CPU only)Chroma(CPU) + QUDA SolversChroma(QDP-JIT) + QUDA

2 Flavor Wilson HMC (Gauge + 2 Flavor + Hasenbusch monomials), 323x96 lattice

• QDP-JIT version is fastest• Still suffers strong scaling effects eventually

- but problem size is small (fits on 1GPU)- expect better on current larger lattice sizes

- Data from TitanDev at OLCF, B. Joo & F. Winter

Sublattice/GPUis too small

Significantgain fromQDP-JIT

- F. Winter, "Accelerating QDP++ using GPUs" arXiv:1105:2279[hep-lat]

Preliminary





Conclusions• GPUs have brought a disruptive leap in the cost effectiveness

of lattice QCD calculations at the capacity level– enabled new analysis methods (e.g. distillation)– are producing discovery level science of great interest to

nuclear physics experiments (e.g. at Jefferson Lab)• By using architecture aware solvers, we have been able to

strong scale LQCD to over 100 TFlops sustained performance on TitanDev – expect even more performance from Kepler GPUs

• The QDP-JIT effort will allow us to move Chroma completely to the accelerators:– reduce Amdahl’s law, maximize speedup

• We look forward to more exciting science from GPUs

Documents

Quarks, GPUs and Exotic Matter - NVIDIA · 2013. 7. 5. · – nucleon = quarks + gluons • Almost all of our mass comes from quarks & gluons • Quantum Chromodynamics (QCD) is