Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
QCD in GPU
Ting-Wai Chiu (趙挺偉)
Department of Physics Center for Quantum Science and Engineering
National Taiwan University
GPU Technology Conference Beijing, China
December 14-15, 2011
2
which build up the hadrons
(e.g., neutron, proton, pion, etc.).
QCD provides the framework to understand the n
The quantum field theory for the strong interaction
between quarks and gluons
uclear
force/energy from the first principles.
QCD plays an important role in the evolution of the
early universe, from the quark gluon "plasma" phase
to the hadron phase.
Quantum Chromodynamics (QCD)
T.W. Chiu, GTC-Asia-2011 2011/12/14
T.W. Chiu, GTC-Asia-2011 3 2011/12/14
Outline
Introduction
Lattice Dirac Operator : Quark Matrix
Cuda Kernels for Conjugate Gradient
Conclusion & Outlook
References:
Taiwan Lattice QCD Collaboration (TWQCD):
arXiv: 0911.5029, 1101.0405, 1101.0402, 1101.0423,
1105.4414, 1109.3675
T.W. Chiu, GTC-Asia-2011 5 2011/12/14
2011/12/14 T.W. Chiu, GTC-Asia-2011 6
2011/12/14 T.W. Chiu, GTC-Asia-2011 7
11 12 13
21 22 23
31 32 33
† †
ˆˆ ˆ ˆ( , , , ) , , , ,
U U U
U x y z t U U U x y z t
U U U
U U U U I
Gluon fields on the Lattice
Then the gluon action on the lattice can be written as
where
ˆx a ˆˆx a a
x ˆx a
The color gluon field are defined on each link connecting and , through the link variable
A x 3SUx ˆx a
ˆexp2
aU x iagA x
42plaquette
6 1 11 Re 0
3 2g pS U tr U a d x tr F x F x
g
† †ˆˆpU U x U x a U x a U x
8 T.W. Chiu, GPU Workshop, Jan 16, '09
2011/12/14 T.W. Chiu, GTC-Asia-2011 9
The Quark Matrix (Lattice Dirac Operator)
ijDD
( , , ) ( , , )i a x j b y
, 1,2,3 (color indices)a b
, 1,2,3,4 (Dirac spinor indices)
, 1, , (Lattice site indices)sitesx y N
2011/12/14 T.W. Chiu, GTC-Asia-2011 10
Salient Features of the Quark Matrix
D is prohibitively large for exact solvers.
In general, D is a sparse matrix, since it only
involves (next-)nearest neighbor interactions in
4-dim or 5-dim lattice.
Iterative algorithms (conjugate gradient, Lanczos,
etc.) are used, which involve the matrix-vector
multiplication.
CUDA kernels can be optimized for the
matrix-vector multiplication in QCD.
Lattice QCD
,
The QCD action
where is the action of the gluon fields
( )
( )
, , , , ,
(
flavo r i
)
( ) (
n e
)
d
G
f a x b
G
f f
a x b yy
S S U
S U
f
D U
D U D U
u d s c b t
sites
x
color index
Di
, 1,2,3
, =1,2,3,4
rac index
site index
For exampl
, 1, , =
e
x y z t
a b
x y N N N N N
3
1
16 32
1,572,864 1,
, on the lattice, for each flavor,
is a complex matrix of size
( , , ) ( , )( , , )
572,864
det( )
det( )
G
G
SS
SS
dUd d U
D
De dU U eU
dUd d e dU e
D
D
11 T.W. Chiu, GTC-Asia-2011
Kenneth G. Wilson Nobel Prize (1982)
2011/12/14
Domain-Wall Fermions
dwfˆis a local op. with the nearest neighbor coupling along sD
1 2sN
s
Left-handed Right-handed
( , )x s( ,1)x ( , )sx N
5dwf5
5 52
1exp det
1
, , 0, Exact Chiral Sym.
c c
s c c
Sd d D D D
S
HN S D D
H
12 T.W. Chiu, GTC-Asia-2011
2011/12/14
T.W. Chiu, GTC-Asia-2011 13
DWF with even-odd preconditioning
2011/12/14
T.W. Chiu, GTC-Asia-2011 14 2011/12/14
T.W. Chiu, GTC-Asia-2011 15 2011/12/14
Mixed-Precision CG (2)
1
1
1
1 1
1.
2. If | | | |, then stop
3.
4.
5. Go to 1.
Pr
| | | |
| | | |
Solve in single precision to
Let
then
an accuracy 1
,
| | |
oof :
k k
k
k k
k k k k
k k
k
k
k k
k
k
At r
r b Ax
r b
x x
u r A u r
r b Ax b Ax A
t
t
t u
1| | | | |k k kr r
16 T.W. Chiu, GTC-Asia-2011 2011/12/14
T.W. Chiu, GTC-Asia-2011 17 2011/12/14
T.W. Chiu, GTC-Asia-2011 18 2011/12/14
T.W. Chiu, GTC-Asia-2011 19 2011/12/14
T.W. Chiu, GTC-Asia-2011 20 2011/12/14
T.W. Chiu, GTC-Asia-2011 21 2011/12/14
T.W. Chiu, GTC-Asia-2011 22 2011/12/14
T.W. Chiu, GTC-Asia-2011 23 2011/12/14
T.W. Chiu, GTC-Asia-2011 24 2011/12/14
T.W. Chiu, GTC-Asia-2011 25 2011/12/14
T.W. Chiu, GTC-Asia-2011 26 2011/12/14
T.W. Chiu, GTC-Asia-2011 27
Dw(Single) M5(Single) Dw(Double) M5(Double) CG(Mixed)
GTX285 177 346 33 69 181
C1060 128 290 29 61 132
C2070 171 244 22 96 156
GTX480 293 309 37 116 252
GTX580 338 445 41 150 317
Benchmarks
CG (mixed prec.) attains 317 Gflops on GTX580
The bottleneck is Dw single-precision multiplication
All numbers are in unit of Gflops, tested with ODWF on 163 x 32 x 16 lattice
2011/12/14
GPU Cluster at NTU
16 units of Nvidia Tesla S1070 (total 64 GPUs)
connected to 16 servers (total 48 Intel QC Xeon)
280 Nvidia GPUs with peak performance > 300 TFLOPS
Attaining 80 TFLOPS (sustained) for LQCD with Optimal DWF
Developed efficient CUDA codes for full QCD.
320/156/180/132 Gflops for GTX580/C2070/GTX285/C1060
32 Nvidia C1060 (total 32 GPUs),
connected to 16 servers (total 16 Intel i7)
Hard disk storage > 300 TB, Lustre cluster file system
122 Nvidia GTX285 (total 122 GPUs),
connected to 62 servers (total 62 Intel i7)
22/36/7 Nvidia C2070/GTX580/GTX480 (total 47 GPUs),
connected to 27 servers (total 27 Intel i7)
28 T.W. Chiu, GTC-Asia-2011 2011/12/14
GPU Cluster at NTU (a partial view)
29 T.W. Chiu, GTC-Asia-2011 2011/12/14
T.W. Chiu, GTC-Asia-2011 30
We have implemented efficient CUDA codes for
lattice QCD with domain-wall fermion. On GTX580,
our CG solver attains 320 Gflops (sustained).
GPU has provided the optimal price/performance
for large-scale lattice QCD simulations.
Conclusions and Outlook
2011/12/14
Lattice QCD with domain-wall fermion on the
243 x 48 x 16 can be simulated efficiently with
one single GPU, and for larger lattices
(e.g., 323 x 64 x 16) with multiGPUs.
GPU has revolutionized the advancement of QCD
2011/12/14 T.W. Chiu, GTC-Asia-2011 31
Conclusions and Outlook (cont.)
First Physics Results from the NTU GPU cluster:
1. T.W. Chiu, T.H. Hsieh, Y.Y. Mao (TWQCD),
“Topological Susceptibility in Two Flavors Lattice QCD with the Optimal Domain-Wall Fermion”,
Phys. Lett. B 702 (2011) 131.
2. T.W. Chiu, T.H. Hsieh, Y.Y. Mao (TWQCD), “Pseudoscalar Meson in Two Flavors QCD with the
Optimal Domain-Wall Fermion”,
arXiv:1109.3675
T.W. Chiu, GTC-Asia-2011 32
Backup slides
2011/12/14
33
which build up the hadrons (e.g., neutron,
proton, pion, etc.). QCD provides the framework to understand
the nuclear force
The quantum field theory for the strong interaction between
quarks and gluons
/energy from the first principles, and plays an
important role in the evolution of the early universe, from
the quark gluon "plasma" phase to the hadron
Gauge group gluo(3)
phase.
SU
:Salient features
( ) 0 as 0
( ) 1 at
ns have self-i
1 fm
nteractions.
Asymptotic freedom: .
IR slavory: quark (color) confinement
Spontaneously chiral symmetry breaking
g r r
g r r
Quantum Chromodynamics (QCD)
T.W. Chiu, GTC-Asia-2011 2011/12/14
T.W. Chiu, GTC-Asia-2011 34
It took 23 years (1974 ~1997) to realize that Lattice QCD with Exact Chiral Symmetry is the ideal theoretical framework to study the nonperturbative
physics from the first principles of QCD.
But, it is challenging to perform the simulation such that the chiral sym. is preserved to very high
precision & all topological sectors are sampled
ergodically.
Since 2009, the TWQCD collaboration has been simulating QCD with optimal domain-wall quarks. The chiral sym. is preserved to a good precision
with , and all topological sectors are
sampled ergodically.
0.0004resm a
2011/12/14
T.W. Chiu, GTC-Asia-2011 35
Exact Chiral Symmetry on the Lattice
The proper way to break the chiral symmetry at finite lattice
spacing is to impose the Ginsparg-Wilson relation (1982)
5 55D D D D
equivalently, 1 15 5 ,5( , ) ( , ) x yD x y D x y
which is realized by the Domain-Wall Fermion (Kaplan,1992),
and the overlap Dirac operator (Neuberger,1998)
†5 2
,
0,
is exponentially local for sufficently smooth gauge field.
In the continuum li
,
mit
( ).a
H HH
D IH
D
D igA
2011/12/14
T.W. Chiu, GTC-Asia-2011 36
Central Problems in Lattice QCD
To compute the quark propagator D−1
To compute the (low-lying) eigenmodes of D
The matrix D is prohibitively large for exact solvers.
Iterative algorithms involve the matrix-vector multiplication.
For the overlap fermion operator, it involves
2
HY
H
The inverse square-root cannot be computed exactly.
What is the best way to proceed ?
2011/12/14
T.W. Chiu, GTC-Asia-2011 37
Nested Conjugate Gradient
1, 2
2( )
n nHY HR H Y
H
1, 2
21 1
2 solved by CG with mul
( )
ti-shifts
nl
l
l
l
nn n l
l l
l
bR H Y Y
H d
H d
b Z
Z Y
5 2
nested CTo compute quark propagator requires
G
Y bD YH
IH
Q: Can we avoid the inverse square root ?
A: Yes, to introduce an extra dim. (degree of freedom).
2011/12/14
Domain-Wall Fermions
dwfˆis a local op. with the nearest neighbor coupling along sD
1 2sN
s
Left-handed Right-handed
( , )x s( ,1)x ( , )sx N
5dwf5
5 52
1exp det
1
, , 0, Exact
For fini
Chiral Sym.
te , which DWF gives the best rational approx. to ?
c c
s c c
sN
Sd d D D D
S
HN S D D
H
S
38 T.W. Chiu, GTC-Asia-2011
2011/12/14
Optimal Domain-Wall Fermion
[ TWC, Phys. Rev. Lett. 90 (2003) 071601 ]
with boundary conditions
odwf , , , 1 , 1 ,, ,, 1 ,
odwf
sN
x s w s s w s s s s x sx x x xx
s
s s x
sA I D I D P P
D
4
0 0
1
, 0,2wD t W m m
†, ,1
,2
x x x xt x x U x U x
4
†
, , ,
1
1, 2
2x x x x x xW x x U x U x
0
,0 , , : bare quark mass2
q
s q
mP x P x N m
m
50
1, 1 ,1 , (1 )
2 2
q
s
mP x N P x P
m
39 T.W. Chiu, GTC-Asia-2011 2011/12/14
The state-of-the-art in the simulations of
unquenched QCD with exact chiral symmetry
40 T.W. Chiu, GTC-Asia-2011
3 3 3
Machine:
Lattice fermion:
Lat
RBC
tice sizes:
Machine:
and UKQCD Collaborations
QCDOC
Domain-Wall Fermion
16
JLQCD Collaboration
IBM BlueGene/L
32 16, 24 48 16, 32 64 16
3 3
=0Lattice fermion:
Lattice sizes:
Overlap Fermion (with fixed topology )
16
TWQCD C
Machin
ollaboration
Optimal D
e:
Lattice fermion: o ma
32, 16 48
tQ
GPU cluster with 280 GPUs
3 3 3 Lattice sizes
in-Wall Fermion
16: 32 16, 20 40 16, 24 48 16
2011/12/14
T.W. Chiu, GTC-Asia-2011 41 2011/12/14
T.W. Chiu, GTC-Asia-2011 42 2011/12/14
T.W. Chiu, GTC-Asia-2011 43 2011/12/14