Understanding the building blocks of matter by solving ......• 2013: On BlueWaters (NVIDIA GTC Contribution) Solver performance DD: domain-decomposed solver - architecture-aware

1

Understanding the building blocks of matter by solving Quantum

Chromodynamics

SciDAC-3 PI Meeting, Rockville, MarylandJuly 26th, 2013

David RichardsJefferson Lab

1

Who we areWhy we areExploiting new architectures for LQCDAlgorithms for NP calculationsA sampling of Science

12

Frithjof Karsch - Brookhaven National LaboratoryRichard Brower -Boston UniversityRobert Edwards -Thomas Jefferson National Accelerator FacilityMartin Savage -University of WashingtonJohn Negele -Massachusetts Institute of TechnologyRob Fowler - University of North Carolina (SUPER)Andreas Stathopoulos -College of William and Mary

SciDAC-3 Scientific Computation Application Partnership projectProject Director: Frithjof Karsch, Brookhaven National Laboratory Project Co-director for Science: David Richards, JLabProject Co-director for Computation: Richard Brower, Boston University

Teams

Computing Properties of Hadrons, Nuclei and Nuclear Matter from QCD

mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]

2

Lattice QCD for NP: Science

4

Lattice QCD for NP: Computation

Leadership-class Clusters + GPUS + MIC

Software + Algorithms

+ Fastmath

Multigrid with HYPRE for Lattice QCD, Andrew Pochinsky

QDP++ and Chroma on GPUs

• QDP-JIT/C is production ready. • QDP-JIT/PTX is fully featured, requires some small integration with

QUDA• Have performed full 2+1 flavor Gauge Generation on Blue Waters with

both variants• JIT/PTX has used Chroma’s internal solvers, JIT/C has used QUDA

solvers• Paper Submitted to SC’13.

• Prognosis: Excellent, Tasks will be complete by End of Year

5

AIM: put all of application code on GPUQDP-JIT - Just-in-time

Porting Lattice QCD Calculations to Novel Architectures: Balint Joo, Frank Winter

0 512 1024 1536 2048 2560 3072 3584 4096 4608Titan Nodes (GPUs)

0

50

100

150

200

250

300

350

400

450

TFLO

PS

BiCGStab: 723x256DD+GCR: 723x256BiCGStab: 963x256DD+GCR: 963x256

Strong Scaling, QUDA+Chroma+QDP-JIT(PTX)

B. Joo, F. Winter (JLab), M. Clark (NVIDIA)

3

GPUs and Heterogenous Architectures

• 2010: QUDA parallelized (SC ’10 Paper)• 2011: 256 GPUs on Edge cluster (SC ’11

Paper)• 2012: 768 GPUs on TitanDev (ACSS ’12

Contribution, Invited APS Contribution)• 2013: On BlueWaters (NVIDIA GTC

Contribution)

Solver performance

DD: domain-decomposed solver - architecture-aware

0 256 512 768 1024 1280 1536 1792XK7 Nodes

0

2000

4000

6000

8000

10000

12000

14000

16000

Tim

e fo

r tra

j 226

(sec

)

CPUQDP-JIT (PTX)QDP-JIT (PTX) + QUDA

Anisotropic Clover HMC. V=48x48x48x512, physical pion mass attempt, NCSA BlueWaters

Tue Jun 25 10:45:47 2013

3

Gauge Generation: HMC - I

Aim: to put the whole of the HMC code so as to exploit the GPU

GPUStrong scaling for anisotropic clover HMC - key for spectroscopy 483.512

QUDA solver

3

Gauge Generation: HMC - IICurrent production running

94! 95!

172!

195! 188!

215!

96! 96!

187!

211! 204!

235!

93! 93!

164!

197!186!

219!

95! 95!

172!

205!192!

229!

97! 97!

185!

215!202!

237!

0!

50!

100!

150!

200!

250!

Uncompressed! Compressed! Uncompressed! Compressed! Uncompressed! Compressed!Intel® Xeon® E5-2680 (SNB-EP)! Intel® Xeon Phi™ (KNC) 5110P! Intel® Xeon Phi™ (KNC)

B1PRQ-7110P!

GFL

OPS

!

Preconditioned Wilson CG, on Intel(R) Xeon(tm) and Intel(R) Xeon Phi (tm) processors, in Single Precision!from B. Joo, D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. Lee, P. Dubey, W. Watson III, !

ISC 2013, LNCS 7905, pp. 40–54, 2013!KNC 5110P: 60 cores @ 1.053 GHz, MPSS 2.1.4346-16(Gold)!

KNC B1PRQ-7110P: 61 cores @ 1.1 GHz, B1 stepping, MPSS 2.1.3552-1, 7936 MB [email protected] GHz!!

V=24x24x24x128 ! V=32x32x32x128! V=40x40x40x96! V=48x48x24x64! V=32x40x24x96!

Intel Xeon-Phi

9

Collaboration with Intel Parallel Labs

•Performance competitive with Nvidia• Exploit e.g. Stampede

Typical LQCD WorkflowGenerate the configurationsl Leadership level

t=0 t=T

Analyze• Typically mid-range

level

Few big jobs Few big files

Many small jobs Many big filesI/O movement

Extractl Extract information from

measured observables

10

IMPORTANT FOR NP

Correlation functions: Distillation• Use the new “distillation” method.• Observe

• Truncate sum at sufficient i to capture relevant physics modes – we use 64: set “weights” f to be unity

• Meson correlation function

• Decompose using “distillation” operator as

Eigenvectors of Laplacian

Perambulators

M. Peardon et al., PRD80,054506 (2009)

This is a Capacity-computing task: solver with many RHS

GPUs

All-Mode Averaging - I

Chulwoo Jung, BNL

T. Blum, T. Izubichi, E. Shintani, arXiv: 1208.4349

All-Mode Averaging - IIElectromagnetic form factor of the proton

Andreas Strathopolous - AM/CS at William and Mary - development of methods for multiple right-hand sides

Wick Contraction Methods

K. Orginos, W. Detmold

Nuclear Physics calculations involve many contractions...

15

Momentum-dependent Phase Shifts

0

20

40

60

80

100

120

140

160

180

800 850 900 950 1000 1050

Extension to I = 1 ππ elastic scattering

I = 2 ππ elastic scattering

0.5

1.0

1.5

2.0

2.5

exotics

isoscalar

isovectorYM glueball

negative parity positive parity

Hadron Structure

16

Momentum fraction of quarks in proton

How is spin apportioned in a proton?

−200

−180

−160

−140

−120

−100

−80

−60

−40

−20

0

−B

[MeV

]

1+0+

1+

0+

1+

0+

12

+ 12

+

32

+

12

+

32

+

0+ 0+

0+

0+

d nn nΣ H-dib nΞ3He 3ΛH 3ΛHe 3ΣHe4He 4ΛHe 4ΛΛ He

s = 0 s = −1 s = −2

2-body3-body4-body

Three- and Four-body systems

Binding energies at physical strange-quark mass

17

Chiral Transition Temperature

OUTLOOK

• Effective use of JIT to exploit accelerated architectures is basis of work with SUPER to develop domain specific compiler for LQCD. Apply lessons to other domain-specific frameworks• Effective exploitation of Xeon-Phi• Collaboration with FASTMATH for multi-grid methods• Lattice QCD for nuclear physics characterised by heavy capacity requirements as well as capability requirement

- Methods for multiple RHS Wick-contraction methods

• Exciting Physics

19

Documents

Understanding the building blocks of matter by solving ......• 2013: On BlueWaters (NVIDIA GTC Contribution) Solver performance DD: domain-decomposed solver - architecture-aware