Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi

K20

Evghenii Gaburov

Clash of the Titans

..a personal view..

Thursday, June 20, 13

> 1 TFLOP/s

on a desktop


K20XXeonPhi


192 fp32 cores 64 fp64 cores 32 SFU 32 LD/ST unit

64KB L1$+shared

1.5MB L2$15 SMX @0.73GHz

240 GB/s

32 SIMD width

hardware thread scheduling

255 reg/thread

K20X

2048 threadsin-order execution

image: GK110 whitepaper

1.4 TFLOP/s fp64


XeonPhi (KNC)

software thread scheduling

61 pentium cores @1.1GHz

352 GB/s

16 fp32 SIMD 8 fp64 SIMD

32KB L1$512KB L2$ shared

512bit SIMD register

32 SIMD reg/thread

4 threadsin-order execution

image: Intel Xeon Phi programming overview

30.5MB L$2 1.1 TFLOP/s fp64


effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960




Xeon Phi: 61 core x 2 threads x 8 double = 976




Xeon Phi: 61 core x 2 threads x 8 double = 976

Xeon E5: 8 core x 1 thread x 4 double = 32

Xeon Phi is much more parallel than Xeon E5!

above all: *algorithm* MUST scale!




~3x

for the same number of threadsimage: Intel Xeon Phi programming overview


~3x

to get the same performanceimage: Intel Xeon Phi programming overview


~3x


if app doesn’t scale


~3x


if app doesn’t scale ... or worse






.. don’t forget the Amdahl’s law

P=0.99, N=1 P=0.99, N=32 P=0.99, N=960

S1=1 S32=24 S960=91

𝛆=100% 𝛆=75% 𝛆=9.4%


XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only




K20mature compiler


MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...




K20mature compiler


MPIOpenMP




MPI Not possible




K20mature compiler


MPIOpenMP




MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible




K20mature compiler


MPIOpenMP




MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

software schedulingthread affinity is important

hardware schedulingno worries about threads


for (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N


#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N



M

N

say M=64, N=1024 ..



M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8



M

N

say M=64, N=1024 ..


XeonPhi: OMP_NUM_THREADS = 240



M

N

say M=64, N=1024 ..





M

N

say M=64, N=1024 ..



K20X: use CUDA, it works!


M

N

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024


M

N

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

minimize surface-to-volume ratio


M

N

nby

nbx

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by

}minimize surface-to-volume ratio


M

N

nby

nbx

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i += blockDim.x) { some thread code } }}



#pragma omp parallel{ /* thread-block code */ bid = omp_get_thread_num(); nb = omp_get_num_threads(); bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i++) { some thread code } }}

M

N


nby

nbx


CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim






#pragma omp simd ... not that simple


This is where the I find the biggest limitation ... the TOOLS!





#pragma omp simd ... not that simple


it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim


#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}




#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}

auto-vectorization usually works well for simple cases





#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple cases




#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..





auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!



do not fight the compiler




auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!




#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma simd { simdsize = get_simd_size(); simdlane = get_simd_lane_index(); for (int i = ib; i < ie; i += simdsize) { complex code executed by each lane } } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..manual vectorization, as in CUDA..



http://ispc.github.com




“The reality is that there is no suchthing as a “magic” compiler that willautomatically parallelize your code.”

MySerialCode.cpp ./parallel_a.out


Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually



It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi




Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..




Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..

CUDA programming model maps well to Xeon Phihowever, there is lack of tools to take advantage of this... (Intel SPMD compiler may help)


http://arxiv.org/abs/1302.1078Thursday, June 20, 13

http://arxiv.org/abs/1302.1078

http://arxiv.org/abs/1302.1078

http://icl.cs.utk.edu/magma/Thursday, June 20, 13

http://icl.cs.utk.edu/magma/


http://icl.cs.utk.edu/magma/Thursday, June 20, 13



http://clbenchmark.com




Documents

Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June