on CPU and GPU clusters - CentraleSupelec

Equipe ProjetINRIA

AlGorille

Equipe IMS

Computing and energy performance i i i f l i l i hoptimization of a multi‐algorithms PDE solver on CPU and GPU clusterssolver on CPU and GPU clusters

Stéphane Vialle, Sylvain Contassot‐Vivier, Thomas Jost

13/01/201113/01/2011

1 ‐ First experiments on GPU clusters

First experiments on GPU clusters

Experimental testbedExperimental testbed2 x 16 CPU+GPU‐nodes clusters :

• Xeon dual‐core + GT8800• Xeon dual‐core + GT285• Nehalem quad‐core + GT285• Nehalem quad‐core + GT480

a 16 nodes state‐of‐the‐art CPU+GPU clusteran older 16 nodes CPU+GPU clusteran older 16 nodes CPU+GPU cluster2 gigabit Ethernet switchesan heterogeneous 32 nodes cluster

regular upgrade of the system

• Some energy monitoringexternal devices :

Raritan (Dominion PX)


Collection of experimentsCollection of experiments

3 benchmarks with different features:

1 ‐ European option pricer: E b i l ll l M C l i

3 benchmarks with different features:

Embarrassingly parallel Monte‐Carlo computationsParallel random number generator have been ported on GPU

2 ‐ PDE solver: Strong computationsRegular communications between nodesSome computations (must) remain on CPU

3 ‐ 2D Jacobi relaxation: Repetitive light computationsRepetitive light computationsFrequent communications between neighbor nodes


Collection of experimentsCollection of experiments

1E+5Pricer ‐ parallel MC

1E+3

EDP Solver ‐ Synchronous1E+4

Jacobi Relaxation

1E+2

1E+3

1E+4

1E+5

on tim

e (s)

1E+2

1E+3

tion

tim

e (s)

1E+3

1E+4

tion

tim

e (s)

1E+0

1E+1

1E+2

1 10 100

Executio

1E+0

1E+1

1 10 100

Execut

1E+1

1E+2

1 10 100

Execut

Number of nodes Number of nodes1 10 100

Number of nodes

Monocore‐CPU cluster Multicore‐CPU cluster Manycore‐GPU cluster

1E+3

1E+4

Pricer – parallel MC1E+2

EDP Solver ‐ Synchronous1E+3

Jacobi Relaxation

1E+1

1E+2

1E+3

Energy (W

h)

1E+1

Energy (W

h)

1E+1

1E+2

Energy (W

h)

1E+0

1 10 100

Number of nodes

1E+0

1 10 100

Number of nodes

1E+0

1 10 100

E

Number of nodes


Computational & energy model designComputational & energy model design

Temporal gain (speedup) and energy gain of GPU cluster vs CPU cluster:

SpeedupEnergy gain

p g ( p p) gy g

1E+2

ter vs

U cluster

Jacobi Relaxation

Beyond ?Predictions ?1E+2

1E+3

ter vs

PU cluster


er vs

U cluster

EDP Solver ‐ synchronous

1E+0

1E+1

GPU

clust

multicore

CPU

1E+0

1E+1

GPU

clus

multicore

CP OK

1E+0

1E+1GPU

clust

multicore

CPHum…

1 10 100Number of nodes

1E+0



Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases :

h ?‐ why ?‐ beyond 16 nodes ?


Computational & energy model designComputational & energy model design

Temporal gain (speedup) and energy gain of GPU cluster vs CPU cluster:

SpeedupEnergy gain

p g ( p p) gy g

1E+2

ter vs

U cluster

Jacobi Relaxation

Beyond ?Predictions ?1E+2

1E+3

ter vs

PU cluster


er vs

U cluster


1E+0

1E+1

GPU

clust

multicore

CPU

1E+0

1E+1

GPU

clus

multicore

CP OK

1E+0

1E+1GPU

clust

multicore

CPHum…


1E+0



Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases :

h ?‐ why ?‐ beyond 16 nodes ?


Computational & energy model designComputational & energy model designCPU cluster GPU cluster

Computations T‐calc‐CPU If algorithm is adapted to GPU architecture:T‐calc‐GPU << T‐calc‐CPU

else: do not use GPUs!

Communications T‐comm‐CPU = T‐comm‐MPI

T‐comm‐GPU ≥ T‐comm‐CPU

T‐comm‐GPU = T‐transfert‐GPUtoCPU + T‐comm‐MPI + T t f t CPUt GPUT‐transfert‐CPUtoGPU

Total time T‐CPU T‐GPU < ? > T‐CPU …..

For a set pb: when the number of nodes increases,T‐commbecomes dominant and GPU cluster interest decreases

2 ‐ A first performance model

A first performance model

First modelling approachFirst modelling approach

1E+3

EDP Solver ‐ Synchronous EDP Solver ‐ Synchronous

Obser ation of the1E+2

1E+3

tion

tim

e (s)

1E+1

1E+2

gy (W

h)

Observation of the first experimentalperformances:

1E+0

1E+1

1 10 100

Execut

1E+0

1 10 100

Energperformances:

• it exists a « scalablearea »,

Number of nodes Number of nodes

area »,• performances of CPU and GPU clusters

Scalablearea

Scalablearea

have different slopes.1E+2

vs uster


Modelling of the « scalable area » 1E+1

GPU

cluster v

multicore

CPU clg

assuming some experimentalmeasurements of the real

1E+0

1 10 100

m

Number of nodes

application are possible(simple modelling).


First modelling approachFirst modelling approachT E σGPUWemodel the‐σT

CPU

‐σTGPU

σE

σECPU

Wemodel the« scalable area » :

N (nodes)

T

N (nodes)

T(N)=T(1)/NσTCPU GPU

We observe:with:

E(N)=E(1).NσE σTCPU σT

GPU>

We consider the l i l

E(N) = P(1).T(N).N+Pswitch.N/Nmax.T(N)

We consider:electrical power dissipated by nodesand switch : E(N) P(1).T(N).N Pswitch.N/Nmax.T(N)

with P: electrical power (Watts)

and switch :



We obtain:σE σT= 1 ‐

SU (N) SU (1) NG/C G/C σTCPUσT

GPU‐

SU (N) = SU (1).NG/C G/C σTσT

EG (N) = EG (1) NG/C G/C σTCPUσT

GPU‐

EG (N) EG (1).N

G/CG/C σCPUσGPU‐1/( ) G/C

We compute the 2 threshold number of nodes:

N = N = SU (1)G/CG/CT

σT

σT

‐‐1/( )SU (N) = 1G/C

G/CG/C σCPUσGPU‐1/( ) G/CN = N = EG (1)G/CG/C

Eσ

Tσ

T‐‐1/( ) EG (N) = 1G/C



3 areas appear when increasing the number of nodes: pp g

GPU cluster more CPU cluster more GPU cluster fasterefficient

(about T and E)efficient

(about T and E)OR less energyconsumming

NG/CTmin( , )NG/C

E NG/CTmax( , )NG/C

E1 N (nodes)

Strategy and Choose GPU Choose CPU gyheuristic requiredto choose GPU or

cluster cluster

CPU cluster

Improving performances withasynchronous algorithmsasynchronous algorithms

Investigation with our PDE solver

Improving performances with asynchronous algorithms

Asynchronous parallel computingAsynchronous parallel computing

Asynchronous algo provide implicit overlapping of communications andAsynchronous algo. provide implicit overlapping of communications and computations, and … … communications are important on GPU clusters.

They should improve executions on GPU clustersThey should improve executions on GPU clusters

But :

• Some iterative algorithms can be turned into asynchronousalgorithms (not all),

• A strong mathematical theory supports this approach.

And :• The convergence detection of the algorithm is more complexand requires more communications (than with synchronous algo)

• Some extra iterations are required to achieve the same accuracy.


Parallel iterative PDE solverParallel iterative PDE solver


Inner linear solverInner linear solver


Asynchronous versionAsynchronous version

… and really more complex parallel implementation!


Asynchronous versionAsynchronous version


Performances on a heterogeneous clusterPerformances on a heterogeneous clusterExecution time using both GPU clusters of Supelec (to minimize):

• 17 nodes Xeon dual‐core + GT8800• 16 nodes Nehalem quad‐core + GT285

Rmk: two clusters managed by one 16 nodes Nehalem quad core + GT285

• 2 interconnected Gibagit switchesOAR server

GPUs &synchronous

GPUs &asynchronoussynchronous asynchronous

T‐exec(s)

T‐exec(s)

T T

Nb of fastnodes

Nb of fastnodes


Performances on a heterogeneous clusterPerformances on a heterogeneous clusterSpeedup vs 1 GPU (to maximize):

• asynchronous version achieves more regular speedup• asynchronous version achieves better speedup on high nb of nodes

GPU cluster & synchronousvs 1 GPU

GPU cluster & asynchronousvs 1 GPU

vs seq

.

vs seq

.

Speedu

pv

Speedu

pv

Sync. S

Sync. S

Nb of fastnodes

Nb of fastnodes


Performances on a heterogeneous clusterPerformances on a heterogeneous clusterEnergy consumption (to minimize):

• measurement errors become important• sync. and async. energy consumption curves are (just) « different »

GPU cluster & synchronous GPU cluster & asynchronous

on(W

.h)

on(W

.h)

onsumptio

onsumptio

Energy

co

Energy

co

Nb of fastnodes

Nb of fastnodes

E E


Performances on a heterogeneous clusterPerformances on a heterogeneous clusterEnergy overhead factor vs 1 GPU (to minimize):

• overhead curves are (just) « differents »no more global attractive solution !

GPU cluster & synchronousvs 1 GPU

GPU cluster & asynchronousvs 1 GPU

factor

factor

overhe

adf

overhe

ad

Energy

o

Energy

o

Nb of fastnodes

Nb of fastnodes

4 ‐ Need for a new performance model and an auto‐adaptive solution

Need for a new performance model and an auto‐adaptive solution

Relative async. vs sync. performancesRelative async vs sync speedup and energy gain exhibit somesimilarities: can be used to choose the version to run

need a fine model (region frontiers are complex)need a heuristic when only one gain is greater than 1

Speedup Energy gain

Async. better Async. better


Relative async. vs sync. performancesRelative async. vs sync. performancesEnergy‐Delay Product (EDP) (to minimize):

• to track a global optimum considering both T and E• to track a global optimum, considering both T and E• parallel runs on many nodes seem better• no large differences between sync. and async. versionsno large differences between sync. and async. versions

GPU cluster & synchronous GPU cluster & asynchronous


Relative async. vs sync. performancesRelative async. vs sync. performancesAsync. vs sync. relative Energy‐Delay Product ratio:

• We compute: EDP sync / EDP async• We compute: EDP‐sync / EDP‐async• can be used to make a choice inside ambiguous regions

Compute the EDP and choose sync. or async.

hversion in this region(where relative‐SU > 1 and relative‐EG < 1)and relative‐EG < 1)

Choose async. version

Choose sync. version


Automatic choice criteriaAutomatic choice criteriaAutomatic selection of the « best » version to run:

CPU cluster GPU cluster

Synchronous algoSynchronous algo.

Asynchronous algo.

Criteria:• relative speedup : tracking HPC performances

l i i ki l i• relative energy gain : tracking low energy consumption• relative energy‐delay‐product : tracking a compromise

b d d l i hi h i• but need a model to automatize this choicea fine model : criteria variations are small and region

frontiers are complexfrontiers are complexa model not requiring long and large experiments


Fine model requiredFine model requiredFirst model limitations:

• requires/assumes a « scalable area »T

CPUrequires/assumes a « scalable area »approximative model (not fine)

• requires 4 executions of the entire application:

‐σTCPU

‐σTGPU

requires 4 executions of the entire application: on 1 and N0 nodesrunning the 2 versions to compare

N (nodes)

measuring both T and E

adapted to optimize the execution of a long life application scaling on a parallel machine

To achieve an automatic selection on a short‐life application, we need :• a model requiring only small elementary benchmarks to fixthe model parameter values on the hardware used,

• a fine model not requiring the application exhibit a perfectscalability on the architecture.


Fine model requiredFine model required

First version of this fine model exists

It takes into account:• different power dissipation of the different « identical » nodes of the cluster

• when starting computations on GPU the power dissipation• when starting computations on GPU the power dissipation increases 2 times:

when the GPU starts to computepwhen the fan of the GPU starts, or/and when the GPU increases its frequency, …

h h d• when stopping computations on GPU the power dissipation decreases several times, but not immediately !

•• …


Fine model requiredFine model required(W

atts)

(s)


Fine model requiredFine model required

Fi l iFirst evaluation:• on our PDE solver execution and on our heterogeneous GPU clustercluster

• error (model – observation) : 6%• after some bias corrections : 1%

Stronger evaluation is planned:• on different hardware• on different applications

th h i ti f t d t ti / t l ti f th i ht• then: a heuristic of auto‐adaptation/auto‐selection of the right algorithm will be implemented

To be continued

5 Conclusion and perspectives5 – Conclusion and perspectives

Long term objectives

Kernel 1‐v1 Kernel 1‐v2 Kernel 1‐v3

Kernel 2‐v1 Kernel 2‐v2

Parallel algorithm 1 Parallel algorithm 2

a out –O [speed energy edp ]

Parallel algorithm 1 Parallel algorithm 2

usera.out –O [speed, energy, edp, …] user

h d l

Auto‐adaptation of a.outHeuristic

Performance scheduler

End after fast executionmodel

(energy and

Elementary

computations)End after edp compromise execution

End after low energy consumption executionhardware benchmarks

Equipe ProjetINRIA

AlGorille

Equipe IMS

Computing and energy performance optimization of a multi‐algorithms PDEoptimization of a multi algorithms PDE

solver on CPU and GPU clusters

Q i ?Questions ?

Documents

on CPU and GPU clusters - CentraleSupelec