32
spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at High Performance Computing, Cetraro, Italy 2016

ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

TORSTEN HOEFLER

Progress in automatic GPU compilation and

why you want to run MPI on your GPU

with Tobias Grosser and Tobias Gysi @ SPCL

presented at High Performance Computing, Cetraro, Italy 2016

Page 2: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

2

Evading various “ends” – the hardware view

Page 3: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

3

Pete’s system software view

Page 4: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

4

My software/programming model view

Page 5: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Non-Goal:

Algorithmic Changes

5

Holy grail – auto-parallelization heterogenization

Automatic

Regression Free High Performance

Automatic accelerator mapping

-

How close can we get?

Page 6: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Tool: Polyhedral Modeling

Iteration Space

0 1 2 3 4 5

j

i

5

4

3

2

1

0

N = 4

j ≤ i

i ≤ N = 4

0 ≤ j

0 ≤ i

D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }

(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)

Program Code

(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)

for (i = 0; i <= N; i++)

for (j = 0; j <= i; j++)

S(i,j);

4T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 7: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

7

Mapping Computation to Device

0

1

2

0

1

2

0 1 2 3 0 1 2 3

0

0 1

1

Device Blocks & ThreadsIteration Space

𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖

4% 2,

𝑗

3% 2 }

0

1

10

i

j

𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 8: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

8

Memory Hierarchy of a Heterogeneous System

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 9: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

9

Host-device date transfers

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 10: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

10

Host-device date transfers

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 11: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

11

Mapping onto fast memory

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 12: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

12

Mapping onto fast memory

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 13: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

13

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 14: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

14

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 15: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

15

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 16: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

16

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];

• Data needed on device

• 12 elements

• Minimal data, but complex transfer

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 17: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

17

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];

• One-dimensional hull

• 20 elements

• Simple transfer, but redundant data

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 18: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

18

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++)

for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];

• Two-dimensional hull

• 16 elements

• Simple transfer, less redundant data

Modeling multi-dimensional access

behavior is important

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 19: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Profitability Heuristic

Trivial

Unsuitable

Insufficient Compute

static dynamic

Modeling Execution

All

Loop

Nests

GPU

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 20: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

20

Some results: Polybench 3.2

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

arithmean: ~30x

geomean: ~6x

Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

Speedup over icc –O3

Page 21: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

0:00

1:12

2:24

3:36

4:48

6:00

7:12

8:24

Mobile Workstation

icc icc -openmp clang Polly ACC

21

Compiles all of SPEC CPU 2006 – Example: LBM

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Runtim

e (

m:s

)

Xeon E5-2690 (10 cores, 0.5Tflop) vs.

Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

essentially my 4-core x86 laptop

with the (free) GPU that’s in there

~20%

~4x

Page 22: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Unfortunately not …

Limited to affine code regions

Maybe generalizes to control-restricted programs

No distributed anything!!

Good news:

Much of traditional HPC fits that model

Infrastructure is coming along

Bad news:

Modern data-driven HPC and Big Data fits less well

Need a programming model for distributed heterogeneous machines!

22

Brave new compiler world!?

Page 23: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

How do we program GPUs today?

ld

ld

ld

ld

st

st

st

st

device compute core active thread instruction latency

CUDA• over-subscribe hardware• use spare parallel slack for latency hiding

MPI• host controlled• full device synchronization

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 24: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Latency hiding at the cluster level?

ld

ld

ld

ld

device compute core active thread instruction latency

dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding

st

put

st

put ld

ld

ld

ld

st

st

st

st

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 25: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

dCUDA extends CUDA with MPI-3 RMA and

notifications

for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];

if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);

if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);

dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);

swap(in, out); swap(win, wout);}

computation

communication

• iterative stencil kernel• thread specific idx

• map ranks to blocks• device-side put/get operations• notifications for synchronization• shared and distributed memory

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 26: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Hardware supported communication overlap

traditional MPI-CUDA dCUDA

1

device compute core active block

2

1

2

4

3

4

3

5

6

6

5

7

8

7

8

1

2

4

3

5

6

7

8

2

1

4

3

6

5

8

7

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 27: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Implementation of the dCUDA runtime system

MPI

context

loggingqueue

commandqueue

ackqueue

notificationqueue

mor

e b

lock

s

event handler

device library

block manager host-side

device-side

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 28: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per

node)

Overlap of a copy kernel with halo exchange

communication

compute & exchange

compute only

halo exchange

0

500

1000

30 60 90# of copy iterations per exchange

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 29: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per

node)

Weak scaling of MPI-CUDA and dCUDA for a

stencil program

dCUDA

halo exchange

MPI-CUDA

0

50

100

2 4 6 8

# of nodes

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 30: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per

node)

Weak scaling of MPI-CUDA and dCUDA for a

particle simulation

dCUDA

halo exchange

MPI-CUDA

0

50

100

150

200

2 4 6 8

# of nodes

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 31: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per

node)

Weak scaling of MPI-CUDA and dCUDA for

sparse-matrix vector multiplication

dCUDA

communication

MPI-CUDA

0

50

100

150

200

1 4 9

# of nodes

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 32: ORSTEN OEFLER Progress in automatic GPU …Compiles all of SPEC CPU 2006 –Example: LBM T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

spcl.inf.ethz.ch

@spcl_eth

for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];

if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);

if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);

dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);

swap(in, out); swap(win, wout);}

32

http://spcl.inf.ethz.ch/Polly-ACC

Automatic

“Regression Free” High Performance

dCUDA – distributed memory

Automatic

Overlap High Performancetry now: https://translate.google.de/#en/de/a%20bad%20day%20for%20Europe