Quarks, GPUs and Exotic Matter - NVIDIA 2013. 7. 5.¢  ¢â‚¬â€œ nucleon = quarks + gluons ¢â‚¬¢ Almost all of

  • View
    0

  • Download
    0

Embed Size (px)

Text of Quarks, GPUs and Exotic Matter - NVIDIA 2013. 7. 5.¢ ...

  • Quarks, GPUs and Exotic Matter

    Bálint Joó, Jefferson Lab Ron Babich, NVIDIA (presenter)

    NVIDIA Theater SC’12 Salt Lake City, Utah

    Nov 2012

  • Acknowledgements • Science Results: Hadron Spectrum Collaboration • Software:

    – The QUDA Community & NVIDIA – Frank Winter for his work on the JIT version of QDP++

    • Machines: – USQCD National Facility for access to clusters at JLab, JLab SciComp Team – LLNL for access to Edge Cluster, – NERSC for access to Dirac Cluster – Oak Ridge Leadership Computing Facility, for access to TitanDev, and for Directors Discretionary

    Allocation – NSF NICS for access to Keeneland Cluster – NCSA for access to BlueWaters

    • Funding: US DOE – Contract DE-AC05-06OR23177: under which Jefferson Science Associates, LLC, manages and operates

    Jefferson Laboratory, Grant No: DE-FC02-06ER41440: USQCD SciDAC II project) • Funding: NSF

    – Grants: PHY-0835713 and OCI-0946441 • Special thanks from Balint to Ron for stepping in to present this talk.

  • Nuclear Physics and QCD • Ordinary matter is made up of atoms

    – atom = nucleus + “orbiting” electrons – nucleus = protons + neutrons (nucleons) – nucleon = quarks + gluons

    • Almost all of our mass comes from quarks & gluons • Quantum Chromodynamics (QCD) is the theory of quarks

    and gluons. – quarks carry color charge (r,g,b) – gluons carry the color interactions eg. (-r,+b)

    • We can only see things with net 0 color charge – never see individual quarks, gluons, only combinations – color charges must cancel between quarks and gluons – QCD allows “exotics”: quark-gluon excitations, glueballs

  • QCD in Nuclear Physics

    Hägler, Musch, Negele, Schäfer, EPL 88 61001

    • Can QCD predict the spectrum of hadrons ? – what is the role of the gluons? – what about exotics? – GlueX experiment of Jefferson Lab 12GeV, Hall D

    • How do quarks and gluons make nucleons? – what are distributions of quarks, gluons, spin, etc ? – GPD experiments e.g. Jefferson Lab, Halls A & B

    • QCD must explain nuclear interactions – ab initio calculations for simple systems – bridges to higher level effective theories

    • QCD phase structure, equation of state – experiments at RHIC – input to higher level effective theories – astrophysics (physics of the Early Universe)

  • • Lattice QCD is the only known model independent, non- perturbative technique for carrying out QCD calculations. – Replace continuum space-time with lattice – Gluons live on links as SU(3) matrices – Quarks live on sites as vectors/spinors. – Change QCD to a system similar to a crystal

    Lattice QCD

    Evaluate Path Integral Using Markov Chain Monte Carlo Method

  • Large Scale LQCD Simulations

    • Stage 1: Generate Configurations – snapshots of QCD vacuum – configurations generated in sequence – capability computing needed for large

    lattices and light quarks • Stage 2a: Compute quark propagators

    – task parallelizable (per configuration) – capacity workload (but can also use capability h/w)

    • Stage 3: Extract Physics – on workstations,

    small cluster partitions

    • Stage 2b: Contract propagators into Correlation Functions – determines the physics you’ll see – complicated multi-index tensor contractions

    Titan Image Courtesy of Oak Ridge Leadership Computing Facility (OLCF), Oak Ridge National Laboratory

  • The Lattice Dirac Equation • Describes how quarks interact with the gluons • Must be solved in gauge generation (Stage 1)

    – O(1M) times, in sequence – ~60%-80% of workload spent in solvers

    • Must be solved to generate quark propagators (Stage 2) – O(10M) times, but task parallel – solver is >90% of workload

    • Operator has dimension ~100M, but very sparse – Efficient Matrix-Vector operations are crucial – Need optimized solvers

    Aee -Deo -Doe Aoo( ) φ = χ

  • Software: Chroma + QUDA • Chroma is a large lattice QCD framework

    – algorithms for gauge generation, quark propagators etc – abstractions for components (solvers) – open source: http://usqcd.jlab.org/usqcd-docs/chroma/ – developed/maintained through US DOE SciDAC funding – integrates QUDA library as a solver component

    • R. G. Edwards, B. Joo, Nucl. Phys. Proc. Suppl. 140 (2005) 832

    • QUDA is a highly optimized library for lattice QCD on GPUs – Linear Solvers, Force Terms, interfaces to code-bases – open source: http://lattice.github.com/quda – developed/maintained by NVIDIA & QUDA Community

    • M. Clark, R. Babich, K. Barros, R. C. Brower, C. Rebbi, Comp. Phys. Commun. 181:1517-1528, 2010

    http://usqcd.jlab.org/usqcd-docs/chroma/ http://usqcd.jlab.org/usqcd-docs/chroma/ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDkQFjAB&url=http%3A%2F%2Farxiv.org%2Fpdf%2Fhep-lat%2F0409003&ei=dFuZULKKELSw0QGxmoGQDQ&usg=AFQjCNGEda0l-G5AjS3SIWsjqbxXW0fYoA&sig2=w6UvIxyChG7rc159TdaPJw http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDkQFjAB&url=http%3A%2F%2Farxiv.org%2Fpdf%2Fhep-lat%2F0409003&ei=dFuZULKKELSw0QGxmoGQDQ&usg=AFQjCNGEda0l-G5AjS3SIWsjqbxXW0fYoA&sig2=w6UvIxyChG7rc159TdaPJw http://lattice.github.com/quda http://lattice.github.com/quda http://arxiv.org/abs/0911.3191 http://arxiv.org/abs/0911.3191 http://arxiv.org/abs/0911.3191 http://arxiv.org/abs/0911.3191

  • QUDA Performance Optimization • LQCD is typically memory bound

    – Dslash: Nearest neighbour stencil in 4D • Wilson Formulation: 0.92 FLOP/B (SP) • Staggered Formulation: ~0.66 FLOP/B (SP) • Key Optimizations focus on being memory friendly

    • Layout data for coalesced memory access • Use symmetries to compress SU(3) matrices

    – 2 row storage or 8 parameter storage – reconstruct 3rd row with “free” FLOPs – trade bandwidth for compute

    • Use reduced precision if possible (e.g. 16bit) – mixed precision solver – iterative refinement + reliable updates

    • Fuse BLAS like kernels - increase reuse

    (V-1 sites)x12 floats12 floats

    (V-1 sites) x 4 floats4 floats Pad

    1 block

  • Using GPUs in Capacity Mode • USQCD National Facility (FNAL, JLab, BNL)

    – distributed computational facility for LQCD – JLab and Fermilab operate GPU clusters – JLab GPU cluster used for generating quark propagators.

    � � � � ��

    ���

    ���

    ���

    ���

    ����

    ����

    ����

    ����

    ����

    �� �� ��� ��

    ������

    ��� �� ��

    �� �����������

    ������

    �� �����������

    ������

    �� ���������

    ������

    �������� ��!"�

    � �

    � #�

    $� �

    � ��

    % &

    ' !

    Orange Bars: from NERSC Dirac Cluster Other data from JLab 9G & 10G Clusters

    JLab 9G GPU Cluster

    JLab: 127 quad nodes: Mix of Tesla C2050, M2050, and GTX 285/480/580 GPUs

    FNAL: 72 dual nodes: M25050 GPUs

  • 0

    500

    1000

    1500

    2000

    Science From GPU Clusters J. J. Dudek, R. G. Edwards, “Hybrid Baryons in QCD”, Phys. Rev. D85, 054016

    • Hybrid Excitations in mesons, and baryons at a common scale of ~1200 MeV • Pattern suggests chromo-magnetic excitation

    – common in mesons, baryons. – “Effective degree of freedom” ? – first principle calculation can agree with or disfavor effective models

  • Point to take home here: • These analysis computations are extremely demanding. • Need (apart from the gauge configurations):

    – innovation in the method of the computation • so called “distillation” technique • variational method, with large operator basis

    – optimized formulation of the lattice theory • Anisotropic lattices: cleaner determination of excited states

    – availability of cheap capacity FLOPs • GPUs highly cost effective. • recall: O(10 M) solves of the Dirac Equation • lots of partitions of 4-16 GPUs (today) • => 32-64 GPUs tomorrow for larger lattices

  • Gauge Generation on GPUs • Gauge Generation is not task parallel

    – proceeds sequentially – O(1M) solves of the Dirac Equation – needs the concentrated power of

    capability computing facilities • Need to scale to 100s-1000s of GPUs • Two main obstacles:

    – Host/Accelerator Model & Amdahl’s law

    • Code not running on GPU limits speedup.

    – Hardware Bottleneck • Ratio of peak device memory/PCIe2

    bandwidths ~ 170/16 (for Fermi) • PCIe3, GPU Direct, etc should help here.

    Sapp = 1

    (1− P ) + PS

  • 16 32 64 128 256 512 1024 2048 4096 8192 Interlagos Sockets (16 core/socket)

    0.0625

    0.125

    0.25

    0.5

    1

    2

    4

    8

    16

    32

    64

    128

    Tf lo

    ps S

    us ta

    in ed

    Titan, XK6 nodes, CPU only: Single Precision Reliable-IBiCGS