Astrosim 2017 Les « Houches » the 7th ;-)

Astrosim 2017

Les « Houches » the 7th ;-)

GPU : the disruptive technology of 21th century

Little comparison between CPU, MIC, GPU

Emmanuel Quémener

Emmanuel QUÉMENER CC BY-NC-SA 2/48September 29, 2017

A (human) generation before...A serial B film in 1984

● 1984 : The Last Starfighter– 27 minutes of synthetic images– 22.5 10 operations per image⁹

– Use of Cray X-MP (130 kW)– 66 days (in fact, 1 year needed)

● 2016: on GTX 1080Ti (250 W)– 3.4 secondes– Comparison GTX1080Ti / Cray

● Performance : 1 600 000 !● Consommation ~ 900 000 000 !


« The beginning of all things »● « Nvidia Launches Tesla Personal Supercomputer »● When : on 19/11/2008 Qui : Nvidia● Where : sur Tom's Hardware● What : a PCIe card C1060 PCIe with 240 cores● How much : 933 Gflops SP (but 78 Gflops DP)


What position for GPUs ?The others « accelerators »● Accelerator or the old history of coprocessors...

– 1980 : 8087 (on 8086/8088) for floating points operations– 1989: 80387 (on 80386) and the respect of IEEE 754– 1990 : 80486DX and the integration of FPU inside the CPU– 1997 : K6-3DNow ! & Pentium MMX : SIMD inside the CPU– 1999 : SSE functions and the birth of a long serie (SSE4 & AVX)

● When chips stand out from CPU– 1998 : DSP style TMS320C67x as tools – 2008: Cell inside the PS3, IBM inside Road Runner and Top1 of Top500

● Business of compilers & very accurate programming model


Why a GPU is so powerful ?To construct 3D scene !

● 2 approaches:– Raytracing : PovRay– Shadering : 3 operations

● Raytracing :– Starting from the eye to the objects of scene

● Shadering– Model2World : vectorial objects placed in the scene– World2View : projection of objects on a plan of view– View2Projection : pixelisation of vectorial plan of view


Why a GPU is so powerfulShadering & Matrix computing● Modele 2 World : 3 matrix products

– Rotation– Translation– Scaling

● World 2 View : 2 matrix products– Camera position– Direction de l'endroit pointé

● View 2 Projection– Pixellisation

A GPU : a « huge » Matrix Multiplier

W2V

M2W

V2P


Is GPU really so powerful ?With my best MyriALUs...

Tesla P100 GTX 1080Ti R9 Fury Xeon Phi Ryzen7 1800 E5-2680v4 x20.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

10000.00

OpenBLAS DP

clBLAS DP

Thunking DP

CuBLAS DP

OpenBLAS SP

clBLAS SP

Thunking SP

CuBLAS SP

Teslafor

Pros

GTXfor

Gamers

AMDforAlt*

CPU


Tesl

a C10

60

Tesl

a M20

90

Tesl

a K20m

Tesl

a K80

Tesl

a P100

Quadr

o k400

0

GTX 5

60 Ti

GTX 6

90

GTX 7

80

GTX Ti

tan

GTX 7

80Ti

GTX 98

0

GTX 9

80Ti

GTX 1

080

GTX 1

080T

i

HD7970

R9-29

5x #1

R9 Fur

y

Nano

Xeon

Phi

FX95

90

Ryzen

7 18

00

E6300

x555

0 x2

E5-266

5 x2

E5-26

40v3

x2

E5-268

7v3 x2

E5-264

0v4 x2

E5-268

0v4 x2

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

10000.00OpenBLAS DP clBLAS DP

Thunking DP CuBLAS DP

OpenBLAS SP clBLAS SP

Thunking SP CuBLAS SP

AMDForAlt*

GTXfor

Gamers

BLAS match : CPU vs GPUxGEMM when x=D(P) or S(P)

clBLAS

OpenBLASCuBLAS

Teslafor

Pros

CPU


What Nvidia never broadcasts...xGEMM on a large domain ...

Tesla P100GTX 1080

Performance

Size

Yes, performance exists, in specific conditions !


How a GPU must be consideredA laboratory of parallelism● You’ve got dozen of thousands of Equivalent PU !● You have the choice between : CPU GPUMIC


How many elements inside ?Difference between GPU & CPU

● Operations– Matrix Multiply– Vectorization– Pipelining– Shader (multi)processor

● Programmation : 1993– OpenGL, Glide, Direct3D,

● Généricité : 2002– CgToolkit, CUDA, OpenCL

Tesla P100Tesla P100


Why GPU is so powerful ?Come back generic processing

Inside the GPU● Specialized processing units● Pipelining efficiency

Adaptation Loss● Changing scenes● Different nature of details

The Solution More generalist processing units !


Why is GPU so powerfulAll « M » Flynn taxonomy included

● Vectorial : SIMD (Simple Instruction Multiple Data)– Addition of 2 positions (x,y,z) : 1 uniq command

● Parallel : MIMD (Multiple Instructions Multiple Data)– Several executions in parallel with the (almost) same data

● In fact, SIMT : Simple Instruction Multiple Threads– All processing units share the Threads– Each processing unit can work independently from others

● Need to synchronize the Threads


Important dates

● 1992-01 : OpenGL and the birth of a standard● 1998-03 : OpenGL 1.2 and interesting functions● 2002-12 : Cg Toolkit (Nvidia) and the extensions of language

– Wrappers for all languages (Python)

● 2007-06 : CUDA (Nvidia) or the arrival of a real langague● 2008-06 : Snow Leopard (Apple) integrates OpenCL

– La volonté d'utiliser au mieux sa machine ?

● 2008-11 : OpenCL 1.0 and its first specifications● 2011-04 : WebCL and its first version by Nokia


Who use OpenCL ?As applications ● OS : Apple in MacOSX● « Big » applications :

– Libreoffice– Ffmpeg

● Known applications– Graphical ones : photoscan, – Scientific : OpenMM, Gromacs, Lammps, Amber, …

● Hundreds of softwares, librairies ! But checking is essentiel...– https://www.khronos.org/opencl/resources/opencl-applications-using-opencl


Why GPU is a disruptive technologyHow to reshuffle the cards in Scientific Computing

● In a conference in february 2006, this… – x100 in 5 years

● Between 2000 and 2015– GeForce 2 Go/GTX 980Ti : from 286 to 2816000 MOperations/s : x10000– Classical CPU: x100


Just a remind of last friday !

● OpenCL is the most versatile one● CUDA can ONLY be used with Nvidia GPUs● Accelerator seems to be the most universal, but...

Cluster Node CPU Node GPU Node Nvidia Accelerator

MPI Yes Yes No No Yes*

PVM Yes Yes No No Yes*

OpenMP No Yes No No Yes*

Pthreads No Yes No No Yes*

OpenCL No Yes Yes Yes Yes

CUDA No No No Yes No

TBB No Yes No No Yes*

Parallel programming models


Act as integrator : do not reinvent the wheel !

● Classic librairies can be used under GPU– Nvidia provides lots of implementations– OpenCL provides several on them, but not so complete

Parallel programming librairiesCluster Node CPU Node GPU Node Nvidia Accelerator

BLAS BLACSMKL

OpenBLASMKL

clBLAS CuBLAS OpenBLASMKL

LAPACK ScalapackMKL

AtlasMKL

clMAGMA MAGMA MagmaMIC

FFT FFTw3 FFTw3 clFFT CuFFT FFTw3


Why OpenCL ? Non Politically Correct History● In the middle of 2000, Apple needs power for MacOSX

– Some processing can be done by Nvidia chipsets– CUDA exists as a good successor of CgToolkit

● But Apple did not want to be jailed– (as their users :-) )

● Apple promotes the creation of Khronos consortium– AMD, IBM, ARM came quickly– … and Nvidia, Intel, Altera, … came backwards...


What OpenCL offers10 implementations on x86● 3 implementations for CPU :

– AMD one : the original– Intel one : efficient for specific parallel rate– PortableCL (POCL) : OpenSource

● 4 implementations for GPU :– AMD one : for AMD/ATI graphical boards– Nvidia one : for Nvidia graphical boards– « Beignet » one : Open Source one for Intel Graphical Chipset– « Mesa » one : Open Source one for all Open Source drivers in Linux Kernel

● 1 implementation for Accelerator : Intel one for Xeon Phi● 1 implementation for FPGA : Intel one for Altera FPGA● On other platforms, it seems to be possible


Goal : replace loop by distribute2 types of distributions

● Blocks/WorkItems– Domain « global », large but slow amount of memory available

● Threads– Domain « local », small but quick amount of memory available– Needed for synchronisation of process

● Different access to memory :● « clinfo » to get the properties of CL devices : Max work items

– Max work items : 1024x1024x1024 for CPU so 1.1 billion distribution


How to do it ? Little example as...A « Hello World » in OpenCL...● Define two vectors in ASCII● Duplicate them in 2 huge vectors● Add them using a OpenCL kernel● Définir deux vecteurs en « Ascii »● Print them on screen

Add of

2 vectors a+b =c

For each n :

c[n] = a[n] + b[n]

Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World!

The End


Kernel Code & Build and CallAdd a vector in OpenCL__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)

{

// Index of the elements to add

unsigned int n = get_global_id(0);

// Sum the n th element of vectors a and b and store in c

c[n] = a[n] + b[n];

}

OpenCLProgram = cl.Program(ctx, OpenCLSource).build()

OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,GPUOutputVector , GPUVector1, GPUVector2)


How to OpenCL ?Write « Hello World ! » in C

#include <stdio.h>

#include <stdlib.h>

#include <CL/cl.h>

const char* OpenCLSource[] = {

"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",

"{",

" // Index of the elements to add \n",

" unsigned int n = get_global_id(0);",

" // Sum the n’th element of vectors a and b and store in c \n",

" c[n] = a[n] + b[n];",

"}"

};

int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};

int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};

#define SIZE 2048

int main(int argc, char **argv)

{

int HostVector1[SIZE], HostVector2[SIZE];

for(int c = 0; c < SIZE; c++)

{

HostVector1[c] = InitialData1[c%20];

HostVector2[c] = InitialData2[c%20];

}

cl_platform_id cpPlatform;

clGetPlatformIDs(1, &cpPlatform, NULL);

cl_int ciErr1;

cl_device_id cdDevice;

ciErr1 = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL);

cl_context GPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, &ciErr1);

cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext,

cdDevice, 0, NULL);

cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector1, NULL);

cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector2, NULL);

cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,sizeof(int) * SIZE, NULL, NULL);

cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource,NULL, NULL);

clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);

cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);

clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);

clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);

clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);

size_t WorkSize[1] = {SIZE}; // one dimensional Range

clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);

int HostOutputVector[SIZE];

clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0,SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);

clReleaseKernel(OpenCLVectorAdd);

clReleaseProgram(OpenCLProgram);

clReleaseCommandQueue(cqCommandQueue);

clReleaseContext(GPUContext);

clReleaseMemObject(GPUVector1);

clReleaseMemObject(GPUVector2);

clReleaseMemObject(GPUOutputVector);

for (int Rows = 0; Rows < (SIZE/20); Rows++) {

printf("\t");

for(int c = 0; c <20; c++) {

printf("%c",(char)HostOutputVector[Rows * 20 + c]);

}

}

printf("\n\nThe End\n\n");

return 0;

}

OpenCL kernelNumber of lines

Of OpenCL kernel

Kernel Call


How to OpenCL ?Write « Hello World ! » in Python

import pyopencl as cl

import numpy

import numpy.linalg as la

import sys

OpenCLSource = """

__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)

{

// Index of the elements to add

unsigned int n = get_global_id(0);

// Sum the n th element of vectors a and b and store in c

c[n] = a[n] + b[n];

}

"""

InitialData1=[37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17]

InitialData2=[35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15]

SIZE=2048

HostVector1=numpy.zeros(SIZE).astype(numpy.int32)

HostVector2=numpy.zeros(SIZE).astype(numpy.int32)

for c in range(SIZE):

HostVector1[c] = InitialData1[c%20]

HostVector2[c] = InitialData2[c%20]

ctx = cl.create_some_context()

queue = cl.CommandQueue(ctx)

mf = cl.mem_flags

GPUVector1 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=HostVector1)

GPUVector2 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=HostVector2)

GPUOutputVector = cl.Buffer(ctx, mf.WRITE_ONLY, HostVector1.nbytes)

OpenCLProgram = cl.Program(ctx, OpenCLSource).build()

OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,GPUOutputVector , GPUVector1, GPUVector2)

HostOutputVector = numpy.empty_like(HostVector1)

cl.enqueue_copy(queue, HostOutputVector, GPUOutputVector)

GPUVector1.release()

GPUVector2.release()

GPUOutputVector.release()

OutputString=''

for rows in range(SIZE/20):

OutputString+='\t'

for c in range(20):

OutputString+=chr(HostOutputVector[rows*20+c])

print OutputString

sys.stdout.write("\nThe End\n\n");

OpenCL kernel

Kernel Call


How to OpenCL ?« Hello World » C/Python : to weight

● On previous OpenCL implementations :– In C : 75 lines, 262 words, 2848 bytes– In Python : 51 lines, 137 words, 1551 bytes– Factors : 0.68, 0.52, 0.54 in lines, words and bytes.

● Programming OpenCL :– Difficult programming context

● « To open the box in more difficult than to assemble the furniture unit ! »● Requirement to simplify the calls by an API

– No compatibility between SDK of AMD and Nvidia● Everyone rewrite their own API

● One solution, however « The » solution : Python


Which simple code to implement in //?PiMC: Pi by the method of the target

● Historical example of the Monte Carlo method● Parallel implementation:

– Equivalent distribution of iterations

● From 2 to 4 parameters– Number of iterations– Parallel Rate (PR)– (Type of variable: INT32, INT64, FP32, FP64)– (RNG: MWC, CONG, SHR3, KISS)

● 2 simple observables:– Estimation of Pi (just indicative, Pi not being rational :-))– Elapsed Time (Individual or Global)


Differences between CUDA/OpenCL

__device__ ulong MainLoop(ulong iterations,uint seed_w,uint seed_z,size_t work)

{

uint jcong=seed_z+work;

ulong total=0;

for (ulong i=0;i<iterations;i++) {

float x=CONGfp ;

float y=CONGfp ;

ulong inside=((x*x+y*y) <= THEONE) ? 1:0;

total+=inside;

}

__global__ void MainLoopBlocks(ulong *s,ulong iterations,uint seed_w,uint seed_z)

{

ulong total=MainLoop(iterations,seed_z,seed_w,blockIdx.x);

s[blockIdx.x]=total;

__syncthreads();

}

ulong MainLoop(ulong iterations,uint seed_z,uint seed_w,size_t work)

{

uint jcong=seed_z+work;

ulong total=0;

for (ulong i=0;i<iterations;i++) {

float x=CONGfp ;

float y=CONGfp ;

ulong inside=((x*x+y*y) <= THEONE) ? 1:0;

total+=inside;

}

return(total);

}

__kernel void MainLoopGlobal(__global ulong *s,ulong iterations,uint seed_w,uint seed_z)

{

ulong total=MainLoop(iterations,seed_z,seed_w,get_global_id(0));

barrier(CLK_GLOBAL_MEM_FENCE);

s[get_global_id(0)]=total;

}


You think Python is not efficient ?Let’s have a quick comparison...

MPI/C

MPI+OpenMP/C

MPI+PyOpenCL NoHT

MPI+PyOpenCL HT

0.00

E+00

2.00

E+10

4.00

E+10

6.00

E+10

8.00

E+10

1.00

E+11

1.20

E+11

1.40

E+11

Itops

ITOPS


Tesl

a C

1060

Tesl

a M

2070

Tesl

a M

2090

Tesl

a K

20m

Tesl

a K

40m

Tesl

a K

80 #

1

Tesl

a K

80 +

pThr

eads

2x T

esla

K80

+pT

hrea

ds

Tesl

a P

100

2x T

esla

P10

0 +

pThr

eads

NV

S 3

15

Qua

dro

4000

Qua

dro

K20

00

Qua

dro

K40

00

GT

X 5

60Ti

GT

X 6

80

GT

X 6

90 #

1

GT

X 6

90 +

pThr

eads

GT

X 7

80

GT

X T

itan

GT

X 7

80Ti

GT

X 7

50

GT

X 7

50Ti

GT

X 9

60

GT

X 9

70

GT

X 9

80

GT

X 9

80Ti

GT

X 1

050T

i

GT

X 1

080

GT

X 1

080T

i

GT

430

GT

620

GT

640

GT

710

GT

730

HD

585

0

HD

645

0

HD

797

0

Kav

eri A

10-7

850K

GP

U

R7

240

Nan

o F

ury

R9

Fur

y

R9

380

R9

290

R9

295X

2 #1

R9

295X

2 +

pThr

eads

MIC

Phi

712

0P

AM

D F

X-9

590

AM

D R

yzen

7 18

00

1x In

tel P

entiu

m M

755

1x In

tel P

entiu

m E

6300

2x In

tel X

5550

4c

MP

I+C

2x In

tel X

5550

4c

2x In

tel 2

687v

3 10

c

2x In

tel 2

680v

4 14

c

Clu

ster

64n

8c M

PI+

PyC

L N

oHT

Clu

ster

64n

8c M

PI+

PyC

L H

T

Clu

ster

64n

8c M

PI/C

Clu

ster

64n

8c M

PI+

Ope

nMP

/C

0.0E+00

5.0E+10

1.0E+11

1.5E+11

2.0E+11

2.5E+11

Pi Dart Board match (yesterday) : Python vs C & CPU vs GPU vs Phi

2 Boards on Host !

2 Chips on Board !

Teslafor

Pros GTX for GamersAMD/ATI

CPU


X555

0

E5-26

65

E5-26

20v2

E5-26

37v2

E5-26

87v3

Xeon

Phi

R9-29

5X

R9-29

0

GTX Ti

tan

Quadr

o K40

00

Tesla

K40

c

GTX 9

80Ti

GTX 1

080

1000000

10000000

100000000

1000000000

PiOpenCLPiOCL BasePiOpenCL Boost

With 1 QPU, low "regime" ...The CPU explodes the GPU!● From 20 to 50x slower!

● 1.5 order of magnitude ...X5550

E5-2665

E5-2620v2

E5-2637v2

E5-2687v3

Xeon Phi

R9-295X

R9-290

GTX Titan

Quadro K4000

Tesla K40c

GTX 980Ti

GTX 1080

0.00

E+00

5.00

E+07

1.00

E+08

1.50

E+08

2.00

E+08

2.50

E+08

PiOpenCL

PiOCL Base

PiOpenCL Boost


From 1 to 1024 PR

Small comparison between *PUMIC Phi emerges, GPU ambushed


Deep exploration of Parallel RatesParallel Rate from 1024 to 16384

PR from 1024 to 16384

Xeon Phi with 60 cores

60x64

60x4160x2760x64

Nvidia GTX Titan & Tesla K40All prime numbers > 1024

AMD HD7970 with 2048 ALU AMD R9-290X with 2816 ALU

2816x42048x4


Deep exploration of Parallel RatesParallel Rate from 1024 to 65536

PR from 1024 to 65536 AMD HD7970 & R9-290

Long Period : 4x number of ALU

Nvidia GTX Titan & Tesla K40Short Period : number of SMX units

Period of 14(14 SMX)

Period of 15(15 SMX)


And the latest Nvidia GTX1080,Always affected with oddities?

Simple Precision Double Precision

Variability without real structure


Behaviour of 20 (GP)GPU vs PR


Is Parallel Envelop reproducible ?For 2 specific boards...

Nvidia GT620Nvidia GT620 AMD HD6450AMD HD6450


Variability : A distinctive criterion !What a strange property :-/ ...

Mode BlocksMode Blocks Mode ThreadsMode Threads


GPU AMD/ATIHD 7970 & R290X Intel IviBridge & Haswell

A « memory-bound » testSplutter code on OpenCL devices

GPU Nvidia GT, GTX, Tesla

Accelerator Xeon Phi


"Back to the Physics"A Newtonian N-body code ...● Pass on a code "fine grain"

– Code N-body very simply integrated– Newton's second law, autogravitating system

● Differential integration methods:– Euler Implicite, Euler Explicit, Heun, Runge Kutta

● Settings :– Physical: number of particles– Numeric: no integration, number of steps– Then: influence FP32, FP64


"Back to the Physics"A Newtonian N-body code ...● Pass on a code "fine grain"

– Code N-body very simply integrated– Newton's second law, autogravitating system

● Differential integration methods:– Euler Implicite, Euler Explicit, Heun, Runge Kutta

● Settings :– Physical: number of particles– Numeric: no integration, number of steps– Then: influence FP32, FP64


Tesla

C10

60

Tesla

M20

90

Tesla

K20

m

Tesla

K40

m

Tesla

K80

Tesla

P10

0

GTX56

0Ti

GTX68

0

GTX78

0Ti

GTX98

0

GTX98

0Ti

GTX10

80

GTX10

80Ti

HD5850

HD6450

HD7970

R9-29

0X

R9-29

5X #

1Nan

o

A10-7

850K

GPU

Xeon

Phi

FX-9

590

A10-7

850K

CPU

Ryzen

7-18

00

M 755

x1

E630

0 x1

X5550

x2

E5-2

665

x2

E5-2

609v

2 x2

E5-2

637v

2 x2

E5-2

670v

2 x2

E5-2

640v

3 x2

E5-2

687v

3 x2

E5-2

640v

4 x2

E5-2

680v

4 x2

0

5000000000

10000000000

15000000000

20000000000

25000000000

30000000000D²/T_FP32

D²/T_FP64

Very different performancesNvidia, AMD, Intel, Single/Double

CPUsAMD Intel

AMD/ATIforAlt*

GTXfor

Gamers

Teslafor

Pros


32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05

1E+06

1E+07

1E+08

1E+09

1E+10GTX 1080

Nano

Xeon Phi

FX-9590

E5-2687v3 x2

For high powered *PUDe la « charge » à la saturation...

● Results expressed in unit Size²/Elapsed

Double precisionSimple precision

32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05

1E+06

1E+07

1E+08

1E+09

1E+10

1E+11GTX 1080

Nano

Xeon Phi

FX-9590

E5-2687v3 x2


But if GPU is to powerful…Quid about its original goal ?

● When using as compute GPU, beware to Xorg !● For AMD/ATI, boring when non root user...

32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05

1E+06

1E+07

1E+08

1E+09

1E+10

1E+11

Xorg Off SP

Xorg On SP

Xorg Off DP

Xorg On DP

32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

80.0E+00

5.0E+09

1.0E+10

1.5E+10

2.0E+10

2.5E+10

Xorg Off SP

Xorg On SP

Xorg Off DP

Xorg On DP


Tesla

C10

60

Tesl

a M209

0

Tesla

K20m

Tesla

K40m

Tesl

a K80

Tesla

P100

GTX560

Ti

GTX680

GTX780

Ti

GTX98

0

GTX980

Ti

GTX10

80

GTX108

0Ti

HD5850

HD6450

HD7970

R9-29

0X

R9-29

5X #1

Nano

A10-7

850K

GPU

Xeon

Phi

FX-9

590

A10-

7850

K CPU

Ryzen7

-1800

M 755

x1

E6300

x1

X5550

x2

E5-266

5 x2

E5-260

9v2 x2

E5-263

7v2

x2

E5-267

0v2 x2

E5-264

0v3

x2

E5-268

7v3 x2

E5-264

0v4

x2

E5-268

0v4 x2

0

20000000

40000000

60000000

80000000

100000000

120000000

Ratio32W

Ratio64W

Performances normalized to TDPNvidia, AMD, Intel, Single / Double


Tesla

C10

60

Tesla

M20

90

Tesla

K20

m

Tesla

K40

m

Tesla

K80

Tesla

P10

0

GTX56

0Ti

GTX68

0

GTX78

0Ti

GTX98

0

GTX98

0Ti

GTX10

80

GTX10

80Ti

HD5850

HD6450

HD7970

R9-29

0X

R9-29

5X #

1Nan

o

A10-7

850K

GPU

Xeon

Phi

FX-95

90

A10-7

850K

CPU

Ryzen

7-18

00

M 7

55 x

1

E6300

x1

X555

0 x2

E5-26

65 x

2

E5-26

09v2

x2

E5-26

37v2

x2

E5-26

70v2

x2

E5-26

40v3

x2

E5-26

87v3

x2

E5-26

40v4

x2

E5-26

80v4

x2

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

Ratio32Pr

Ratio64Pr

Performances normalized to PriceNvidia, AMD, Intel, Single/Double


Tesla

C10

60

Tesla

M20

90

Tesla

K20

m

Tesla

K40

m

Tesla

K80

Tesla

P10

0

GTX56

0Ti

GTX68

0

GTX78

0Ti

GTX98

0

GTX98

0Ti

GTX10

80

GTX10

80Ti

HD5850

HD6450

HD7970

R9-29

0X

R9-29

5X #

1Nan

o

A10-7

850K

GPU

Xeon

Phi

FX-95

90

A10-7

850K

CPU

Ryzen

7-18

00

M 7

55 x

1

E6300

x1

X555

0 x2

E5-26

65 x

2

E5-26

09v2

x2

E5-26

37v2

x2

E5-26

70v2

x2

E5-26

40v3

x2

E5-26

87v3

x2

E5-26

40v4

x2

E5-26

80v4

x2

0

10000

20000

30000

40000

50000

60000

70000

80000

Ratio32

Ratio64

Performances normalized to Cores.MHzNvidia, AMD, Intel, Single / Double


But what's the use of all this?Characterize & avoid regrets ...● Now, the « brute » power is inside GPUs● But, QPU (PR=1) of GPU is 50x slower than CPU● To exploit GPU, parallel rate MUST be over 1000● To program GPU, prefer :

– Developer approach : Python with PyOpenCL (or PyCUDA)– Integrator approach : external libraries optimized

● In all cases, instrument your launches !

Documents

Astrosim 2017 Les « Houches » the 7th ;-)