25
Thinking Outside of the Tera-Scale Box Thinking Outside of the Tera-Scale Box Piotr Luszczek

S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Thinking Outside of the Tera-Scale BoxThinking Outside of the Tera-Scale Box

Piotr Luszczek

Page 2: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Brief History of Tera-flop: 1997

1997

ASCI Red

Page 3: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Brief History of Tera-flop: 2007

2007

Intel Polaris

1997

ASCI Red

Page 4: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Brief History of Tera-flop: GPGPU

GPGPU

20082007

Intel Polaris

1997

ASCI Red

Page 5: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Tera-flop: a Dictionary Perspective• Belongs to • Belongs tog

supercomputer expert• Consumes 500 kW

ggaming geek

• Consumes under 500 W• Costs over $1M• Requires distributed

• Costs under $200• Requires only a CUDA q

memory programmingq y

kernel

• Memory: 1 GiB or more • Memory: 1 GiB or more

Page 6: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Double-Dip Recession?Cilk, Intel TBB,

Productivity

Cilk, Intel TBB, Intel Ct, Apple

GCD, MS Concrt, …

Hybrid MulticoreMulticore

HMPP, PGI Accelerator, PathScale?

20072005

Time

Page 7: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Locality Space in HPC ChallengeHPLM t M t M l

Parallel-TransposeSTREAM Mat-Mat-MulSTREAM

FFTRandomAccess

Page 8: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Locality Space in HPC Challenge: BoundsHPLM t M t M l

Parallel-TransposeSTREAM Mat-Mat-MulSTREAM

Bandwidth-bound

AORSA2D

Compute-bound

PCIe

FFTRandomAccessLatency-bound

Page 9: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

CUDA vs. OpenCL: Similar Software Stacks

Page 10: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N

for k = 1:TC(i, j) += A(i, k) * B(k, j)

CUBLAS Volkov et al. MAGMA Nakasato

Page 11: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Fermi C2050: from CUDA to OpenCL700

500

600

/s]

400

500

ce [G

flop/

200

300

erfo

rman

c

CUDA, textureCUDA, NO textureOpenCL, no changeOpenCL good unrolling

100

200

Pe

OpenCL, good unrollingOpenCL, bad unrolling

0

Problem size

Page 12: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Fermi C2050: Simple OpenCL Porting600

500

400

/s

200

300

Gflo

p/ Native OpenCL Mat-mul

Ported OpenCL Mat-mul

100

200

0

192

384

576

768

960

152

344

536

728

920

112

304

496

688

880

072

264

456

648

840

032

224

416

608

800

992

184

376

568

760

952

1 3 5 7 9 11 13 15 17 19 21 23 24 26 28 30 32 34 36 38 40 42 44 46 48 49 51 53 55 57 59

Problem size

Page 13: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

ATI 5870: Simple OpenCL Porting1600

1200

1400

1000

p/s

Native OpenCL Mat-mul

Ported OpenCL Mat-mul (inlined indexing)

600

800

Gflo

p

Ported OpenCL

200

400

0

192

384

576

768

960

1152

1344

1536

1728

1920

2112

2304

2496

2688

2880

3072

3264

3456

3648

3840

4032

4224

4416

4608

4800

4992

5184

5376

5568

5760

5952

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5

Problem size

Page 14: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Making Case for Autotuning: Use of Texture600

500

400

p/s

200

300

Gflo

p

Fermi C2050 OpenCL texture

Fermi C2050 OpenCL NO texture

100

0

Problem size

Page 15: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Making Case for Autotuning: Blocking Factors

Provided by NVIDIANVIDIA

Page 16: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Mat-Mul Portability Summary

Achieved PerformanceSingle precision

Achieved PerformanceDouble precisiong p p

ATI OpenCL 52% 57%ATI IL 74% 87%NVIDIA CUDA 63% 58%NVIDIA CUDA 63% 58%NVIDIA OpenCL 57% 49%Multicore (vendor) 79% 79%Multicore (autotuned) 46% 40%

Page 17: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Now let’s think outside of the boxNow, let s think outside of the box

Page 18: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Recipe for Performance Before MulticoreL d/StLoad/Store

I t h d liInstr. scheduling

Wider superscalar instr. issue

Increase clock frequency

Increase number of transistorIncrease number of transistor

Page 19: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Recipe for Performance After Multicore

More cores

Parallel software

Page 20: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Recipe for Future Performance

More cores

Better sequential software

DAG Runtime

Page 21: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

From Assembly to DAG: PLASMA• Typical loop in assembly • LU factorizationyp p y

– load R0, #1000– divf F1, F2, F3

– for i = 1:N– GETR2(A[i, i])

Sequential Sequential instruction

Sequential Sequential task

– load R1, #1000– divf F2, F4, F5

– for j = i+1:N

– TRSM(A[i, i], A[i, j])

stream stream

– dec R1– jump …

– for k = i+1:N

– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling

GETRF2

TRSM GEMM GEMM

GETRF2

DAG schedulingTRSM GEMM GEMM

DAG scheduling

Page 22: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Scalability of LU Factorization on Multicore40

45

100

1206 cores = 1 socket

20

25

30

35

60

80

100

24 cores = 4 sockets

0

5

10

15

0

20

40

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

140

160

180

PLASMA48 cores = 8 sockets

60

80

100

120

140

MKL 11.0

0

20

40

60

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

LAPACK

AMD Istanbul 2 8 GHz1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 AMD Istanbul 2.8 GHz

Page 23: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Combining Multicore and GPU for QR Fact.0.6

0.5

AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

0.4

0.3

Tflo

p/s

0.2 CPUs + GPU

CPUs

0

0.1 GPU

03840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200

Page 24: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

Performance on Distributed Memory30

35

30

35

15

20

25

Tflo

p/s

15

20

25

30

Tflo

p/s

QR LU

0

5

10

108 192 432 768 1200 30720

5

10

108 192 432 768 1200 3072Number of cores

108 192 432 768 1200 3072

Number of cores

30

35

DPLASMA

Peak

15

20

25

Tflo

p/s

Cholesky

Cray LibSci

DPLASMA

0

5

10

108 192 432 768 1200 3072

ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-core

Number of cores

Dual socet 6 coreInterconnect: Seastar, 3D torus

Page 25: S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl Parallel-Transpose STREAM Mat-Mat-Mul ... OpenCL good unrolling 100 P OpenCL, good

People and Projects• Jack Dongarra • MAGMAg• Mathieu Faverge• Jakub Kurzak

• PLASMA• DPLASMA and DAGuEJakub Kurzak

• Hatem Ltaief• Stan Tomov

DPLASMA and DAGuE– George Bosilca– Aurelien BouteillerStan Tomov

• Asim YarKhan• Peng Du

– Anthony Danalis– Thomas Herault

• Peng Du• Rick Weber

– Perre Lemarinier