Thinking Outside of the Tera-Scale Box
Piotr Luszczek
Brief History of Tera-flop: 1997
1997
ASCI Red
Brief History of Tera-flop: 2007
1997
2007
ASCI Red
Intel Polaris
Brief History of Tera-flop: GPGPU
2008
1997
2007
ASCI Red
Intel Polaris
GPGPU
Tera-flop: a Dictionary Perspective• Belongs to
supercomputer expert• Consumes 500 kW• Costs over $1M• Requires distributed
memory programming
• Memory: 1 GiB or more
• Belongs togaming geek
• Consumes under 500 W• Costs under $200• Requires only a CUDA
kernel
• Memory: 1 GiB or more
Double-Dip Recession?
20072005
Time
Productivity
Hybrid MulticoreMulticore
Cilk, Intel TBB, Intel Ct, Apple
GCD, MS Concrt, …
HMPP, PGI Accelerator, PathScale?
Locality Space in HPC ChallengeHPLMat-Mat-Mul
Parallel-TransposeSTREAM
FFTRandomAccess
Locality Space in HPC Challenge: BoundsHPLMat-Mat-Mul
Parallel-TransposeSTREAM
FFTRandomAccessLatency-bound
Bandwidth-bound
Compute-bound
AORSA2D
PCIe
CUDA vs. OpenCL: Similar Software Stacks
Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j)
CUBLAS Volkov et al. MAGMA Nakasato
Fermi C2050: from CUDA to OpenCL
0
100
200
300
400
500
600
700
CUDA, textureCUDA, NO textureOpenCL, no changeOpenCL, good unrollingOpenCL, bad unrolling
Problem size
Perf
orm
ance
[Gflo
p/s]
Fermi C2050: Simple OpenCL Porting
192
576
960
1344
1728
2112
2496
2880
3264
3648
4032
4416
4800
5184
5568
5952
0
100
200
300
400
500
600
Native OpenCL Mat-mul
Ported OpenCL Mat-mul
Problem size
Gflo
p/s
ATI 5870: Simple OpenCL Porting
192
576
960
1344
1728
2112
2496
2880
3264
3648
4032
4416
4800
5184
5568
5952
0
200
400
600
800
1000
1200
1400
1600
Native OpenCL Mat-mul
Ported OpenCL Mat-mul (inlined indexing)
Ported OpenCL
Problem size
Gflo
p/s
Making Case for Autotuning: Use of Texture
192
576
960
1344
1728
2112
2496
2880
3264
3648
4032
4416
4800
5184
5568
5952
6336
6720
7104
7488
7872
0
100
200
300
400
500
600
Fermi C2050 OpenCL texture
Fermi C2050 OpenCL NO texture
Problem size
Gflo
p/s
Making Case for Autotuning: Blocking Factors
Provided by NVIDIA
Mat-Mul Portability Summary
Achieved PerformanceSingle precision
Achieved PerformanceDouble precision
ATI OpenCL 52% 57%ATI IL 74% 87%NVIDIA CUDA 63% 58%NVIDIA OpenCL 57% 49%Multicore (vendor) 79% 79%Multicore (autotuned) 46% 40%
Now, let’s think outside of the box
Recipe for Performance Before Multicore
Wider superscalar instr. issue
Increase clock frequency
Increase number of transistor
Load/Store
Instr. scheduling
Recipe for Performance After Multicore
More cores
Parallel software
Recipe for Future Performance
More cores
Better sequential software
DAG Runtime
From Assembly to DAG: PLASMA• Typical loop in assembly
– load R0, #1000– divf F1, F2, F3– load R1, #1000– divf F2, F4, F5– dec R1– jump …
• LU factorization– for i = 1:N– GETR2(A[i, i])– for j = i+1:N– TRSM(A[i, i], A[i, j])– for k = i+1:N– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling
GETRF2
TRSM
GETRF2
GEMM GEMM
TRSM GEMM GEMMDAG scheduling
Sequential instruction
stream
Sequential task
stream
Scalability of LU Factorization on Multicore
1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
5
10
15
20
25
30
35
40
45
1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
20
40
60
80
100
120
1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
20
40
60
80
100
120
140
160
180
PLASMA
MKL 11.0
LAPACK
6 cores = 1 socket
24 cores = 4 sockets
48 cores = 8 sockets
AMD Istanbul 2.8 GHz
Combining Multicore and GPU for QR Fact.
3840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 192000
0.1
0.2
0.3
0.4
0.5
0.6
CPUs + GPU
CPUs
GPU
Tflo
p/s
AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480
Performance on Distributed Memory
108 192 432 768 1200 30720
5
10
15
20
25
30
35
Number of cores
Tflop
/s
108 192 432 768 1200 30720
5
10
15
20
25
30
35
Number of cores
Tflop
/s
108 192 432 768 1200 30720
5
10
15
20
25
30
35
Number of cores
Tflop
/s
QR LU
Cholesky
ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-coreInterconnect: Seastar, 3D torus
Cray LibSci
DPLASMA
Peak
People and Projects• Jack Dongarra• Mathieu Faverge• Jakub Kurzak• Hatem Ltaief• Stan Tomov• Asim YarKhan• Peng Du• Rick Weber
• MAGMA• PLASMA• DPLASMA and DAGuE
– George Bosilca– Aurelien Bouteiller– Anthony Danalis– Thomas Herault– Perre Lemarinier