prev

next

out of 16

View

0Download

0

Embed Size (px)

Quarks, GPUs and Exotic Matter

Bálint Joó, Jefferson Lab Ron Babich, NVIDIA (presenter)

NVIDIA Theater SC’12 Salt Lake City, Utah

Nov 2012

Acknowledgements • Science Results: Hadron Spectrum Collaboration • Software:

– The QUDA Community & NVIDIA – Frank Winter for his work on the JIT version of QDP++

• Machines: – USQCD National Facility for access to clusters at JLab, JLab SciComp Team – LLNL for access to Edge Cluster, – NERSC for access to Dirac Cluster – Oak Ridge Leadership Computing Facility, for access to TitanDev, and for Directors Discretionary

Allocation – NSF NICS for access to Keeneland Cluster – NCSA for access to BlueWaters

• Funding: US DOE – Contract DE-AC05-06OR23177: under which Jefferson Science Associates, LLC, manages and operates

Jefferson Laboratory, Grant No: DE-FC02-06ER41440: USQCD SciDAC II project) • Funding: NSF

– Grants: PHY-0835713 and OCI-0946441 • Special thanks from Balint to Ron for stepping in to present this talk.

Nuclear Physics and QCD • Ordinary matter is made up of atoms

– atom = nucleus + “orbiting” electrons – nucleus = protons + neutrons (nucleons) – nucleon = quarks + gluons

• Almost all of our mass comes from quarks & gluons • Quantum Chromodynamics (QCD) is the theory of quarks

and gluons. – quarks carry color charge (r,g,b) – gluons carry the color interactions eg. (-r,+b)

• We can only see things with net 0 color charge – never see individual quarks, gluons, only combinations – color charges must cancel between quarks and gluons – QCD allows “exotics”: quark-gluon excitations, glueballs

QCD in Nuclear Physics

Hägler, Musch, Negele, Schäfer, EPL 88 61001

• Can QCD predict the spectrum of hadrons ? – what is the role of the gluons? – what about exotics? – GlueX experiment of Jefferson Lab 12GeV, Hall D

• How do quarks and gluons make nucleons? – what are distributions of quarks, gluons, spin, etc ? – GPD experiments e.g. Jefferson Lab, Halls A & B

• QCD must explain nuclear interactions – ab initio calculations for simple systems – bridges to higher level effective theories

• QCD phase structure, equation of state – experiments at RHIC – input to higher level effective theories – astrophysics (physics of the Early Universe)

• Lattice QCD is the only known model independent, non- perturbative technique for carrying out QCD calculations. – Replace continuum space-time with lattice – Gluons live on links as SU(3) matrices – Quarks live on sites as vectors/spinors. – Change QCD to a system similar to a crystal

Lattice QCD

Evaluate Path Integral Using Markov Chain Monte Carlo Method

Large Scale LQCD Simulations

• Stage 1: Generate Configurations – snapshots of QCD vacuum – configurations generated in sequence – capability computing needed for large

lattices and light quarks • Stage 2a: Compute quark propagators

– task parallelizable (per configuration) – capacity workload (but can also use capability h/w)

• Stage 3: Extract Physics – on workstations,

small cluster partitions

• Stage 2b: Contract propagators into Correlation Functions – determines the physics you’ll see – complicated multi-index tensor contractions

Titan Image Courtesy of Oak Ridge Leadership Computing Facility (OLCF), Oak Ridge National Laboratory

The Lattice Dirac Equation • Describes how quarks interact with the gluons • Must be solved in gauge generation (Stage 1)

– O(1M) times, in sequence – ~60%-80% of workload spent in solvers

• Must be solved to generate quark propagators (Stage 2) – O(10M) times, but task parallel – solver is >90% of workload

• Operator has dimension ~100M, but very sparse – Efficient Matrix-Vector operations are crucial – Need optimized solvers

Aee -Deo -Doe Aoo( ) φ = χ

Software: Chroma + QUDA • Chroma is a large lattice QCD framework

– algorithms for gauge generation, quark propagators etc – abstractions for components (solvers) – open source: http://usqcd.jlab.org/usqcd-docs/chroma/ – developed/maintained through US DOE SciDAC funding – integrates QUDA library as a solver component

• R. G. Edwards, B. Joo, Nucl. Phys. Proc. Suppl. 140 (2005) 832

• QUDA is a highly optimized library for lattice QCD on GPUs – Linear Solvers, Force Terms, interfaces to code-bases – open source: http://lattice.github.com/quda – developed/maintained by NVIDIA & QUDA Community

• M. Clark, R. Babich, K. Barros, R. C. Brower, C. Rebbi, Comp. Phys. Commun. 181:1517-1528, 2010

http://usqcd.jlab.org/usqcd-docs/chroma/ http://usqcd.jlab.org/usqcd-docs/chroma/ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDkQFjAB&url=http%3A%2F%2Farxiv.org%2Fpdf%2Fhep-lat%2F0409003&ei=dFuZULKKELSw0QGxmoGQDQ&usg=AFQjCNGEda0l-G5AjS3SIWsjqbxXW0fYoA&sig2=w6UvIxyChG7rc159TdaPJw http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDkQFjAB&url=http%3A%2F%2Farxiv.org%2Fpdf%2Fhep-lat%2F0409003&ei=dFuZULKKELSw0QGxmoGQDQ&usg=AFQjCNGEda0l-G5AjS3SIWsjqbxXW0fYoA&sig2=w6UvIxyChG7rc159TdaPJw http://lattice.github.com/quda http://lattice.github.com/quda http://arxiv.org/abs/0911.3191 http://arxiv.org/abs/0911.3191 http://arxiv.org/abs/0911.3191 http://arxiv.org/abs/0911.3191

QUDA Performance Optimization • LQCD is typically memory bound

– Dslash: Nearest neighbour stencil in 4D • Wilson Formulation: 0.92 FLOP/B (SP) • Staggered Formulation: ~0.66 FLOP/B (SP) • Key Optimizations focus on being memory friendly

• Layout data for coalesced memory access • Use symmetries to compress SU(3) matrices

– 2 row storage or 8 parameter storage – reconstruct 3rd row with “free” FLOPs – trade bandwidth for compute

• Use reduced precision if possible (e.g. 16bit) – mixed precision solver – iterative refinement + reliable updates

• Fuse BLAS like kernels - increase reuse

(V-1 sites)x12 floats12 floats

(V-1 sites) x 4 floats4 floats Pad

1 block

Using GPUs in Capacity Mode • USQCD National Facility (FNAL, JLab, BNL)

– distributed computational facility for LQCD – JLab and Fermilab operate GPU clusters – JLab GPU cluster used for generating quark propagators.

� � � � ��

�

���

���

���

���

����

����

����

����

����

�� �� ��� ��

������

��� �� ��

�� �����������

������

�� �����������

������

�� ���������

������

�������� ��!"�

� �

� #�

$� �

� ��

% &

' !

�

Orange Bars: from NERSC Dirac Cluster Other data from JLab 9G & 10G Clusters

JLab 9G GPU Cluster

JLab: 127 quad nodes: Mix of Tesla C2050, M2050, and GTX 285/480/580 GPUs

FNAL: 72 dual nodes: M25050 GPUs

0

500

1000

1500

2000

Science From GPU Clusters J. J. Dudek, R. G. Edwards, “Hybrid Baryons in QCD”, Phys. Rev. D85, 054016

• Hybrid Excitations in mesons, and baryons at a common scale of ~1200 MeV • Pattern suggests chromo-magnetic excitation

– common in mesons, baryons. – “Effective degree of freedom” ? – first principle calculation can agree with or disfavor effective models

Point to take home here: • These analysis computations are extremely demanding. • Need (apart from the gauge configurations):

– innovation in the method of the computation • so called “distillation” technique • variational method, with large operator basis

– optimized formulation of the lattice theory • Anisotropic lattices: cleaner determination of excited states

– availability of cheap capacity FLOPs • GPUs highly cost effective. • recall: O(10 M) solves of the Dirac Equation • lots of partitions of 4-16 GPUs (today) • => 32-64 GPUs tomorrow for larger lattices

Gauge Generation on GPUs • Gauge Generation is not task parallel

– proceeds sequentially – O(1M) solves of the Dirac Equation – needs the concentrated power of

capability computing facilities • Need to scale to 100s-1000s of GPUs • Two main obstacles:

– Host/Accelerator Model & Amdahl’s law

• Code not running on GPU limits speedup.

– Hardware Bottleneck • Ratio of peak device memory/PCIe2

bandwidths ~ 170/16 (for Fermi) • PCIe3, GPU Direct, etc should help here.

Sapp = 1

(1− P ) + PS

16 32 64 128 256 512 1024 2048 4096 8192 Interlagos Sockets (16 core/socket)

0.0625

0.125

0.25

0.5

1

2

4

8

16

32

64

128

Tf lo

ps S

us ta

in ed

Titan, XK6 nodes, CPU only: Single Precision Reliable-IBiCGS