55
XeonPhi K20 Evghenii Gaburov Clash of the Titans ..a personal view.. Thursday, June 20, 13

Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi

K20

Evghenii Gaburov

Clash of the Titans

..a personal view..

Thursday, June 20, 13

Page 2: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

> 1 TFLOP/s

on a desktop

Thursday, June 20, 13

Page 3: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

K20XXeonPhi

Thursday, June 20, 13

Page 4: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

192 fp32 cores 64 fp64 cores 32 SFU 32 LD/ST unit

64KB L1$+shared

1.5MB L2$15 SMX @0.73GHz

240 GB/s

32 SIMD width

hardware thread scheduling

255 reg/thread

K20X

2048 threadsin-order execution

image: GK110 whitepaper

1.4 TFLOP/s fp64

Thursday, June 20, 13

Page 5: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi (KNC)

software thread scheduling

61 pentium cores @1.1GHz

352 GB/s

16 fp32 SIMD 8 fp64 SIMD

32KB L1$512KB L2$ shared

512bit SIMD register

32 SIMD reg/thread

4 threadsin-order execution

image: Intel Xeon Phi programming overview

30.5MB L$2 1.1 TFLOP/s fp64

Thursday, June 20, 13

Page 6: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Thursday, June 20, 13

Page 7: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Xeon Phi: 61 core x 2 threads x 8 double = 976

Thursday, June 20, 13

Page 8: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Xeon Phi: 61 core x 2 threads x 8 double = 976

Xeon E5: 8 core x 1 thread x 4 double = 32

Xeon Phi is much more parallel than Xeon E5!

above all: *algorithm* MUST scale!

Thursday, June 20, 13

Page 9: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

image: Intel Xeon Phi programming overview

Thursday, June 20, 13

Page 10: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

~3x

for the same number of threadsimage: Intel Xeon Phi programming overview

Thursday, June 20, 13

Page 11: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

~3x

to get the same performanceimage: Intel Xeon Phi programming overview

Thursday, June 20, 13

Page 12: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

~3x

image: Intel Xeon Phi programming overview

if app doesn’t scale

Thursday, June 20, 13

Page 13: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

~3x

image: Intel Xeon Phi programming overview

if app doesn’t scale ... or worse

Thursday, June 20, 13

Page 14: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

image: Intel Xeon Phi programming overview

Thursday, June 20, 13

Page 15: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

image: Intel Xeon Phi programming overview

Thursday, June 20, 13

Page 16: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

.. don’t forget the Amdahl’s law

P=0.99, N=1 P=0.99, N=32 P=0.99, N=960

S1=1 S32=24 S960=91

𝛆=100% 𝛆=75% 𝛆=9.4%

Thursday, June 20, 13

Page 17: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

Thursday, June 20, 13

Page 18: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

Thursday, June 20, 13

Page 19: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

MPI Not possible

Thursday, June 20, 13

Page 20: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

Thursday, June 20, 13

Page 21: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

software schedulingthread affinity is important

hardware schedulingno worries about threads

Thursday, June 20, 13

Page 22: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

for (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

Thursday, June 20, 13

Page 23: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

Thursday, June 20, 13

Page 24: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

Thursday, June 20, 13

Page 25: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

Thursday, June 20, 13

Page 26: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

XeonPhi: OMP_NUM_THREADS = 240

Thursday, June 20, 13

Page 27: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

XeonPhi: OMP_NUM_THREADS = 240

Thursday, June 20, 13

Page 28: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

XeonPhi: OMP_NUM_THREADS = 240

K20X: use CUDA, it works!

Thursday, June 20, 13

Page 29: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

M

N

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

Thursday, June 20, 13

Page 30: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

M

N

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

minimize surface-to-volume ratio

Thursday, June 20, 13

Page 31: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

M

N

nby

nbx

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by

}minimize surface-to-volume ratio

Thursday, June 20, 13

Page 32: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

M

N

nby

nbx

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i += blockDim.x) { some thread code } }}

minimize surface-to-volume ratio

Thursday, June 20, 13

Page 33: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel{ /* thread-block code */ bid = omp_get_thread_num(); nb = omp_get_num_threads(); bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i++) { some thread code } }}

M

N

minimize surface-to-volume ratio

nby

nbx

Thursday, June 20, 13

Page 34: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

Thursday, June 20, 13

Page 35: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

#pragma omp simd ... not that simple

Thursday, June 20, 13

Page 36: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

This is where the I find the biggest limitation ... the TOOLS!

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

#pragma omp simd ... not that simple

Thursday, June 20, 13

Page 37: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 38: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 39: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}

auto-vectorization usually works well for simple cases

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 40: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple cases

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 41: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 42: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 43: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

do not fight the compiler

Thursday, June 20, 13

Page 44: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 45: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma simd { simdsize = get_simd_size(); simdlane = get_simd_lane_index(); for (int i = ib; i < ie; i += simdsize) { complex code executed by each lane } } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..manual vectorization, as in CUDA..

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

Page 46: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

http://ispc.github.com

Thursday, June 20, 13

Page 47: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

“The reality is that there is no suchthing as a “magic” compiler that willautomatically parallelize your code.”

MySerialCode.cpp ./parallel_a.out

Thursday, June 20, 13

Page 48: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

Thursday, June 20, 13

Page 49: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Thursday, June 20, 13

Page 50: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..

Thursday, June 20, 13

Page 51: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..

CUDA programming model maps well to Xeon Phihowever, there is lack of tools to take advantage of this... (Intel SPMD compiler may help)

Thursday, June 20, 13

Page 52: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

http://arxiv.org/abs/1302.1078Thursday, June 20, 13

Page 53: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

http://icl.cs.utk.edu/magma/Thursday, June 20, 13

Page 54: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

http://icl.cs.utk.edu/magma/Thursday, June 20, 13

Page 55: Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June 20, 13 > 1 TFLOP/s on a desktop Thursday, June 20, 13. K20X XeonPhi Thursday, June

http://clbenchmark.com

Thursday, June 20, 13