Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
spcl.inf.ethz.ch
@spcl_eth
TORSTEN HOEFLER
Progress in automatic GPU compilation and
why you want to run MPI on your GPU
with Tobias Grosser and Tobias Gysi @ SPCL
presented at High Performance Computing, Cetraro, Italy 2016
spcl.inf.ethz.ch
@spcl_eth
2
Evading various “ends” – the hardware view
spcl.inf.ethz.ch
@spcl_eth
3
Pete’s system software view
spcl.inf.ethz.ch
@spcl_eth
4
My software/programming model view
spcl.inf.ethz.ch
@spcl_eth
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
Non-Goal:
Algorithmic Changes
5
Holy grail – auto-parallelization heterogenization
Automatic
Regression Free High Performance
Automatic accelerator mapping
-
How close can we get?
spcl.inf.ethz.ch
@spcl_eth
Tool: Polyhedral Modeling
Iteration Space
0 1 2 3 4 5
j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)
Program Code
(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S(i,j);
4T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
7
Mapping Computation to Device
0
1
2
0
1
2
0 1 2 3 0 1 2 3
0
0 1
1
Device Blocks & ThreadsIteration Space
𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖
4% 2,
𝑗
3% 2 }
0
1
10
i
j
𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
8
Memory Hierarchy of a Heterogeneous System
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
9
Host-device date transfers
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
10
Host-device date transfers
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
11
Mapping onto fast memory
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
12
Mapping onto fast memory
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
13
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++)
for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
14
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++)
for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
15
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++)
for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
16
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++)
for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];
• Data needed on device
• 12 elements
• Minimal data, but complex transfer
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
17
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++)
for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];
• One-dimensional hull
• 20 elements
• Simple transfer, but redundant data
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
18
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++)
for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];
• Two-dimensional hull
• 16 elements
• Simple transfer, less redundant data
Modeling multi-dimensional access
behavior is important
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
Profitability Heuristic
Trivial
Unsuitable
Insufficient Compute
static dynamic
Modeling Execution
All
Loop
Nests
GPU
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
spcl.inf.ethz.ch
@spcl_eth
20
Some results: Polybench 3.2
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
arithmean: ~30x
geomean: ~6x
Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
Speedup over icc –O3
spcl.inf.ethz.ch
@spcl_eth
0:00
1:12
2:24
3:36
4:48
6:00
7:12
8:24
Mobile Workstation
icc icc -openmp clang Polly ACC
21
Compiles all of SPEC CPU 2006 – Example: LBM
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
Runtim
e (
m:s
)
Xeon E5-2690 (10 cores, 0.5Tflop) vs.
Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
essentially my 4-core x86 laptop
with the (free) GPU that’s in there
~20%
~4x
spcl.inf.ethz.ch
@spcl_eth
Unfortunately not …
Limited to affine code regions
Maybe generalizes to control-restricted programs
No distributed anything!!
Good news:
Much of traditional HPC fits that model
Infrastructure is coming along
Bad news:
Modern data-driven HPC and Big Data fits less well
Need a programming model for distributed heterogeneous machines!
22
Brave new compiler world!?
spcl.inf.ethz.ch
@spcl_eth
How do we program GPUs today?
ld
ld
ld
ld
st
st
st
st
device compute core active thread instruction latency
CUDA• over-subscribe hardware• use spare parallel slack for latency hiding
MPI• host controlled• full device synchronization
…
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Latency hiding at the cluster level?
ld
ld
ld
ld
device compute core active thread instruction latency
dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding
st
put
st
put ld
ld
ld
ld
st
st
st
st
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
dCUDA extends CUDA with MPI-3 RMA and
notifications
for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];
if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);
if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);
dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);
swap(in, out); swap(win, wout);}
computation
communication
• iterative stencil kernel• thread specific idx
• map ranks to blocks• device-side put/get operations• notifications for synchronization• shared and distributed memory
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Hardware supported communication overlap
traditional MPI-CUDA dCUDA
1
device compute core active block
2
1
2
4
3
4
3
5
6
6
5
7
8
7
8
1
2
4
3
5
6
7
8
2
1
4
3
6
5
8
7
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Implementation of the dCUDA runtime system
MPI
context
loggingqueue
commandqueue
ackqueue
notificationqueue
mor
e b
lock
s
event handler
device library
block manager host-side
device-side
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per
node)
Overlap of a copy kernel with halo exchange
communication
compute & exchange
compute only
halo exchange
0
500
1000
30 60 90# of copy iterations per exchange
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per
node)
Weak scaling of MPI-CUDA and dCUDA for a
stencil program
dCUDA
halo exchange
MPI-CUDA
0
50
100
2 4 6 8
# of nodes
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per
node)
Weak scaling of MPI-CUDA and dCUDA for a
particle simulation
dCUDA
halo exchange
MPI-CUDA
0
50
100
150
200
2 4 6 8
# of nodes
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per
node)
Weak scaling of MPI-CUDA and dCUDA for
sparse-matrix vector multiplication
dCUDA
communication
MPI-CUDA
0
50
100
150
200
1 4 9
# of nodes
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
spcl.inf.ethz.ch
@spcl_eth
for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];
if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);
if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);
dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);
swap(in, out); swap(win, wout);}
32
http://spcl.inf.ethz.ch/Polly-ACC
Automatic
“Regression Free” High Performance
dCUDA – distributed memory
Automatic
Overlap High Performancetry now: https://translate.google.de/#en/de/a%20bad%20day%20for%20Europe