ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

TORSTEN HOEFLER

Progress in automatic GPU compilation and

why you want to run MPI on your GPU

with Tobias Grosser and Tobias Gysi @ SPCL

presented at High Performance Computing, Cetraro, Italy 2016

spcl.inf.ethz.ch

@spcl_eth

2

Evading various “ends” – the hardware view

spcl.inf.ethz.ch

@spcl_eth

3

Pete’s system software view

spcl.inf.ethz.ch

@spcl_eth

4

My software/programming model view

spcl.inf.ethz.ch

@spcl_eth

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Non-Goal:

Algorithmic Changes

5

Holy grail – auto-parallelization heterogenization

Automatic

Regression Free High Performance

Automatic accelerator mapping

-

How close can we get?

spcl.inf.ethz.ch

@spcl_eth

Tool: Polyhedral Modeling

Iteration Space

0 1 2 3 4 5

j

i

5

4

3

2

1

0

N = 4

j ≤ i

i ≤ N = 4

0 ≤ j

0 ≤ i

D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }

(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)

Program Code

(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)

for (i = 0; i <= N; i++)

for (j = 0; j <= i; j++)

S(i,j);

4T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

7

Mapping Computation to Device

0

1

2

0

1

2

0 1 2 3 0 1 2 3

0

0 1

1

Device Blocks & ThreadsIteration Space

𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖

4% 2,

𝑗

3% 2 }

0

1

10

i

j

𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }


spcl.inf.ethz.ch

@spcl_eth

8

Memory Hierarchy of a Heterogeneous System


spcl.inf.ethz.ch

@spcl_eth

9

Host-device date transfers


spcl.inf.ethz.ch

@spcl_eth

10

Host-device date transfers


spcl.inf.ethz.ch

@spcl_eth

11

Mapping onto fast memory


spcl.inf.ethz.ch

@spcl_eth

12

Mapping onto fast memory


spcl.inf.ethz.ch

@spcl_eth

13

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];


spcl.inf.ethz.ch

@spcl_eth

14


for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)



spcl.inf.ethz.ch

@spcl_eth

15


for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)



spcl.inf.ethz.ch

@spcl_eth

16


for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)


• Data needed on device

• 12 elements

• Minimal data, but complex transfer


spcl.inf.ethz.ch

@spcl_eth

17


for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)


• One-dimensional hull

• 20 elements

• Simple transfer, but redundant data


spcl.inf.ethz.ch

@spcl_eth

18


for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)


• Two-dimensional hull

• 16 elements

• Simple transfer, less redundant data

Modeling multi-dimensional access

behavior is important


spcl.inf.ethz.ch

@spcl_eth

Profitability Heuristic

Trivial

Unsuitable

Insufficient Compute

static dynamic

Modeling Execution

All

Loop

Nests

GPU


spcl.inf.ethz.ch

@spcl_eth

20

Some results: Polybench 3.2


arithmean: ~30x

geomean: ~6x

Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

Speedup over icc –O3

spcl.inf.ethz.ch

@spcl_eth

0:00

1:12

2:24

3:36

4:48

6:00

7:12

8:24

Mobile Workstation

icc icc -openmp clang Polly ACC

21

Compiles all of SPEC CPU 2006 – Example: LBM


Runtim

e (

m:s

)

Xeon E5-2690 (10 cores, 0.5Tflop) vs.

Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

essentially my 4-core x86 laptop

with the (free) GPU that’s in there

~20%

~4x

spcl.inf.ethz.ch

@spcl_eth

Unfortunately not …

Limited to affine code regions

Maybe generalizes to control-restricted programs

No distributed anything!!

Good news:

Much of traditional HPC fits that model

Infrastructure is coming along

Bad news:

Modern data-driven HPC and Big Data fits less well

Need a programming model for distributed heterogeneous machines!

22

Brave new compiler world!?

spcl.inf.ethz.ch

@spcl_eth

How do we program GPUs today?

ld

ld

ld

ld

st

st

st

st

device compute core active thread instruction latency

CUDA• over-subscribe hardware• use spare parallel slack for latency hiding

MPI• host controlled• full device synchronization

…

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

spcl.inf.ethz.ch

@spcl_eth

Latency hiding at the cluster level?

ld

ld

ld

ld

device compute core active thread instruction latency

dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding

st

put

st

put ld

ld

ld

ld

st

st

st

st


spcl.inf.ethz.ch

@spcl_eth

dCUDA extends CUDA with MPI-3 RMA and

notifications

for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];

if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);

if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);

dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);

swap(in, out); swap(win, wout);}

computation

communication

• iterative stencil kernel• thread specific idx

• map ranks to blocks• device-side put/get operations• notifications for synchronization• shared and distributed memory


spcl.inf.ethz.ch

@spcl_eth

Hardware supported communication overlap

traditional MPI-CUDA dCUDA

1

device compute core active block

2

1

2

4

3

4

3

5

6

6

5

7

8

7

8

1

2

4

3

5

6

7

8

2

1

4

3

6

5

8

7


spcl.inf.ethz.ch

@spcl_eth

Implementation of the dCUDA runtime system

MPI

context

loggingqueue

commandqueue

ackqueue

notificationqueue

mor

e b

lock

s

event handler

device library

block manager host-side

device-side


spcl.inf.ethz.ch

@spcl_eth

Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per

node)

Overlap of a copy kernel with halo exchange

communication

compute & exchange

compute only

halo exchange

0

500

1000

30 60 90# of copy iterations per exchange

exec

uti

on

tim

e [m

s]


spcl.inf.ethz.ch

@spcl_eth


node)

Weak scaling of MPI-CUDA and dCUDA for a

stencil program

dCUDA

halo exchange

MPI-CUDA

0

50

100

2 4 6 8

# of nodes

exec

uti

on

tim

e [m

s]


spcl.inf.ethz.ch

@spcl_eth


node)

Weak scaling of MPI-CUDA and dCUDA for a

particle simulation

dCUDA

halo exchange

MPI-CUDA

0

50

100

150

200

2 4 6 8

# of nodes

exec

uti

on

tim

e [m

s]


spcl.inf.ethz.ch

@spcl_eth


node)

Weak scaling of MPI-CUDA and dCUDA for

sparse-matrix vector multiplication

dCUDA

communication

MPI-CUDA

0

50

100

150

200

1 4 9

# of nodes

exec

uti

on

tim

e [m

s]


spcl.inf.ethz.ch

@spcl_eth

for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];

if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);

if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);

dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);

swap(in, out); swap(win, wout);}

32

http://spcl.inf.ethz.ch/Polly-ACC

Automatic

“Regression Free” High Performance

dCUDA – distributed memory

Automatic

Overlap High Performancetry now: https://translate.google.de/#en/de/a%20bad%20day%20for%20Europe

Documents

ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16