35
Equipe Projet INRIA AlGorille Equipe I M S Computing and energy performance i i i f li l ih optimization of a multialgorithms PDE solver on CPU and GPU clusters solver on CPU and GPU clusters Stéphane Vialle , Sylvain ContassotVivier, Thomas Jost 13/01/2011 13/01/2011

on CPU and GPU clusters - CentraleSupelec

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: on CPU and GPU clusters - CentraleSupelec

Equipe ProjetINRIA

AlGorille

Equipe IMS

Computing and energy performance i i i f l i l i hoptimization of a multi‐algorithms PDE solver on CPU and GPU clusterssolver on CPU and GPU clusters

Stéphane Vialle, Sylvain Contassot‐Vivier, Thomas Jost

13/01/201113/01/2011

Page 2: on CPU and GPU clusters - CentraleSupelec

1 ‐ First experiments on GPU clusters

Page 3: on CPU and GPU clusters - CentraleSupelec

First experiments on GPU clusters

Experimental testbedExperimental testbed2 x 16 CPU+GPU‐nodes clusters :

• Xeon dual‐core + GT8800• Xeon dual‐core + GT285• Nehalem quad‐core + GT285• Nehalem quad‐core + GT480

a 16 nodes state‐of‐the‐art CPU+GPU clusteran older 16 nodes CPU+GPU clusteran older 16 nodes CPU+GPU cluster2 gigabit Ethernet switchesan heterogeneous 32 nodes cluster

regular upgrade of the system

• Some energy monitoringexternal devices : 

Raritan (Dominion PX)

Page 4: on CPU and GPU clusters - CentraleSupelec

First experiments on GPU clusters 

Collection of experimentsCollection of experiments

3 benchmarks with different features:

1 ‐ European option pricer: E b i l ll l M C l i

3 benchmarks with different features:

Embarrassingly parallel Monte‐Carlo computationsParallel random number generator have been ported on GPU

2 ‐ PDE solver: Strong computationsRegular communications between nodesSome computations (must) remain on CPU 

3 ‐ 2D Jacobi relaxation: Repetitive light computationsRepetitive light computationsFrequent communications between neighbor nodes

Page 5: on CPU and GPU clusters - CentraleSupelec

First experiments on GPU clusters 

Collection of experimentsCollection of experiments

1E+5Pricer ‐ parallel MC

1E+3

EDP Solver ‐ Synchronous1E+4

Jacobi Relaxation

1E+2

1E+3

1E+4

1E+5

on tim

e (s)

1E+2

1E+3

tion

 tim

e (s)

1E+3

1E+4

tion

 tim

e (s)

1E+0

1E+1

1E+2

1 10 100

Executio

1E+0

1E+1

1 10 100

Execut

1E+1

1E+2

1 10 100

Execut

Number of nodes Number of nodes1 10 100

Number of nodes

Monocore‐CPU cluster Multicore‐CPU cluster Manycore‐GPU cluster

1E+3

1E+4

Pricer – parallel MC1E+2

EDP Solver ‐ Synchronous1E+3

Jacobi Relaxation

1E+1

1E+2

1E+3

Energy (W

h)

1E+1

Energy (W

h)

1E+1

1E+2

Energy (W

h)

1E+0

1 10 100

Number of nodes

1E+0

1 10 100

Number of nodes

1E+0

1 10 100

E

Number of nodes

Page 6: on CPU and GPU clusters - CentraleSupelec

First experiments on GPU clusters 

Computational & energy model designComputational & energy model design

Temporal gain (speedup) and energy gain of GPU cluster vs CPU cluster:

SpeedupEnergy gain

p g ( p p) gy g

1E+2

ter vs 

U cluster

Jacobi Relaxation 

Beyond ?Predictions ?1E+2

1E+3

ter vs 

PU cluster 

Pricer – parallel MC1E+2

er  vs

U cluster

EDP Solver ‐ synchronous

1E+0

1E+1

GPU

 clust

multicore

CPU

1E+0

1E+1

GPU

 clus

multicore

CP OK

1E+0

1E+1GPU

 clust

multicore

CPHum…

1 10 100Number of nodes

1E+0

1 10 100Number of nodes

1 10 100Number of nodes

Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases :

h ?‐ why ?‐ beyond 16 nodes ?

Page 7: on CPU and GPU clusters - CentraleSupelec

First experiments on GPU clusters 

Computational & energy model designComputational & energy model design

Temporal gain (speedup) and energy gain of GPU cluster vs CPU cluster:

SpeedupEnergy gain

p g ( p p) gy g

1E+2

ter vs 

U cluster

Jacobi Relaxation 

Beyond ?Predictions ?1E+2

1E+3

ter vs 

PU cluster 

Pricer – parallel MC1E+2

er  vs

U cluster

EDP Solver ‐ synchronous

1E+0

1E+1

GPU

 clust

multicore

CPU

1E+0

1E+1

GPU

 clus

multicore

CP OK

1E+0

1E+1GPU

 clust

multicore

CPHum…

1 10 100Number of nodes

1E+0

1 10 100Number of nodes

1 10 100Number of nodes

Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases :

h ?‐ why ?‐ beyond 16 nodes ?

Page 8: on CPU and GPU clusters - CentraleSupelec

First experiments on GPU clusters 

Computational & energy model designComputational & energy model designCPU cluster GPU cluster

Computations T‐calc‐CPU If algorithm is adapted to GPU architecture:T‐calc‐GPU << T‐calc‐CPU

else: do not use GPUs!

Communications T‐comm‐CPU = T‐comm‐MPI

T‐comm‐GPU ≥ T‐comm‐CPU

T‐comm‐GPU = T‐transfert‐GPUtoCPU +   T‐comm‐MPI + T t f t CPUt GPUT‐transfert‐CPUtoGPU

Total time T‐CPU T‐GPU < ? > T‐CPU …..

For a set pb: when the number of nodes increases,T‐commbecomes dominant and GPU cluster interest decreases

Page 9: on CPU and GPU clusters - CentraleSupelec

2 ‐ A first performance model

Page 10: on CPU and GPU clusters - CentraleSupelec

A first performance model

First modelling approachFirst modelling approach

1E+3

EDP Solver ‐ Synchronous EDP Solver ‐ Synchronous

Obser ation of the1E+2

1E+3

tion

 tim

e (s)

1E+1

1E+2

gy (W

h)

Observation of the first experimentalperformances:

1E+0

1E+1

1 10 100

Execut

1E+0

1 10 100

Energperformances:

• it exists a « scalablearea »,

Number of nodes Number of nodes

area »,• performances of CPU and GPU clusters 

Scalablearea

Scalablearea

have different slopes.1E+2

vs uster

EDP Solver ‐ synchronous

Modelling of the « scalable area » 1E+1

GPU

 cluster  v

multicore

CPU clg

assuming some experimentalmeasurements of the real 

1E+0

1 10 100

m

Number of nodes

application are possible(simple modelling).

Page 11: on CPU and GPU clusters - CentraleSupelec

A first performance model

First modelling approachFirst modelling approachT E σGPUWemodel the‐σT

CPU

‐σTGPU

σE

σECPU

Wemodel the« scalable area » :

N (nodes)

T

N (nodes)

T(N)=T(1)/NσTCPU GPU

We observe:with:

E(N)=E(1).NσE σTCPU σT

GPU>

We consider the l i l

E(N) = P(1).T(N).N+Pswitch.N/Nmax.T(N)

We consider:electrical power dissipated by nodesand switch : E(N)   P(1).T(N).N Pswitch.N/Nmax.T(N)

with P: electrical power (Watts)

and switch :

Page 12: on CPU and GPU clusters - CentraleSupelec

A first performance model

First modelling approachFirst modelling approach

We obtain:σE σT=  1 ‐

SU (N) SU (1) NG/C G/C σTCPUσT

GPU‐

SU     (N) = SU     (1).NG/C G/C σTσT

EG (N) = EG (1) NG/C G/C σTCPUσT

GPU‐

EG     (N)   EG     (1).N

G/CG/C σCPUσGPU‐1/( ) G/C

We compute the 2 threshold number of nodes:

N = N      = SU     (1)G/CG/CT

σT

σT

‐‐1/(                       )SU     (N) = 1G/C

G/CG/C σCPUσGPU‐1/( ) G/CN = N      = EG     (1)G/CG/C

T‐‐1/(                       ) EG     (N) = 1G/C

Page 13: on CPU and GPU clusters - CentraleSupelec

A first performance model

First modelling approachFirst modelling approach

3 areas appear when increasing the number of nodes: pp g

GPU cluster more  CPU cluster more GPU cluster fasterefficient

(about T and E)efficient

(about T and E)OR less energyconsumming

NG/CTmin(        ,         )NG/C

E NG/CTmax(        ,         )NG/C

E1 N (nodes)

Strategy and Choose GPU Choose CPU gyheuristic requiredto choose GPU or 

cluster cluster

CPU cluster

Page 14: on CPU and GPU clusters - CentraleSupelec

Improving performances withasynchronous algorithmsasynchronous algorithms

Investigation with our PDE solver

Page 15: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Asynchronous parallel computingAsynchronous parallel computing

Asynchronous algo provide implicit overlapping of communications andAsynchronous algo. provide implicit overlapping of communications and computations,  and … … communications are important on GPU clusters.

They should improve executions on GPU clustersThey should improve executions on GPU clusters

But : 

• Some iterative algorithms can be turned into asynchronousalgorithms (not all),

• A strong mathematical theory supports this approach.

And :• The convergence detection of the algorithm is more complexand requires more communications (than with synchronous algo)

• Some extra iterations are required to achieve the same accuracy.

Page 16: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Parallel iterative PDE solverParallel iterative PDE solver

Page 17: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Inner linear solverInner linear solver

Page 18: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Asynchronous versionAsynchronous version

… and really more complex parallel implementation!

Page 19: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Asynchronous versionAsynchronous version

Page 20: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Performances on a heterogeneous clusterPerformances on a heterogeneous clusterExecution time using both GPU clusters of Supelec (to minimize):

• 17 nodes Xeon dual‐core + GT8800• 16 nodes Nehalem quad‐core + GT285

Rmk: two clusters managed by one 16 nodes Nehalem quad core + GT285

• 2 interconnected Gibagit switchesOAR server

GPUs &synchronous

GPUs &asynchronoussynchronous asynchronous

T‐exec(s)

T‐exec(s)

T T

Nb of fastnodes

Nb of fastnodes

Page 21: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Performances on a heterogeneous clusterPerformances on a heterogeneous clusterSpeedup vs 1 GPU (to maximize):

• asynchronous version achieves more regular speedup• asynchronous version achieves better speedup on high nb of nodes

GPU cluster & synchronousvs 1 GPU

GPU cluster & asynchronousvs 1 GPU

vs seq

.

vs seq

.

Speedu

pv

Speedu

pv

Sync. S

Sync. S

Nb of fastnodes

Nb of fastnodes

Page 22: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Performances on a heterogeneous clusterPerformances on a heterogeneous clusterEnergy consumption (to minimize):

• measurement errors become important• sync. and async. energy consumption curves are (just) « different » 

GPU cluster & synchronous GPU cluster & asynchronous

on(W

.h)

on(W

.h)

onsumptio

onsumptio

Energy

co

Energy

co

Nb of fastnodes

Nb of fastnodes

E E

Page 23: on CPU and GPU clusters - CentraleSupelec

Improving performances with asynchronous algorithms

Performances on a heterogeneous clusterPerformances on a heterogeneous clusterEnergy overhead factor vs 1 GPU (to minimize):

• overhead curves are (just) « differents »no more global attractive solution !

GPU cluster & synchronousvs 1 GPU

GPU cluster & asynchronousvs 1 GPU

factor

factor

overhe

adf

overhe

ad

Energy

o

Energy

o

Nb of fastnodes

Nb of fastnodes

Page 24: on CPU and GPU clusters - CentraleSupelec

4 ‐ Need for a new performance model and an auto‐adaptive solution

Page 25: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Relative async. vs sync. performancesRelative async vs sync speedup and energy gain exhibit somesimilarities: can be used to choose the version to run

need a fine model (region frontiers are complex)need a heuristic when only one gain is greater than 1

Speedup Energy gain

Async. better Async. better

Page 26: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Relative async. vs sync. performancesRelative async. vs sync. performancesEnergy‐Delay Product (EDP) (to minimize):

• to track a global optimum considering both T and E• to track a global optimum, considering both T and E• parallel runs on many nodes seem better• no large differences between sync. and async. versionsno large differences between sync. and async. versions

GPU cluster & synchronous GPU cluster & asynchronous

Page 27: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Relative async. vs sync. performancesRelative async. vs sync. performancesAsync. vs sync. relative Energy‐Delay Product ratio:

• We compute: EDP sync / EDP async• We compute: EDP‐sync / EDP‐async• can be used to make a choice inside ambiguous regions

Compute the EDP and choose sync. or async. 

hversion in this region(where relative‐SU > 1 and relative‐EG < 1)and relative‐EG < 1)

Choose async. version

Choose sync. version

Page 28: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Automatic choice criteriaAutomatic choice criteriaAutomatic selection of the « best » version to run:

CPU cluster GPU cluster

Synchronous algoSynchronous algo.

Asynchronous algo.

Criteria:• relative speedup : tracking HPC performances

l i i ki l i• relative energy gain                    : tracking low energy consumption• relative energy‐delay‐product : tracking a compromise

b d d l i hi h i• but need a model to automatize this choicea fine model : criteria variations are small and region

frontiers are complexfrontiers are complexa model not requiring long and large experiments

Page 29: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Fine model requiredFine model requiredFirst model limitations:

• requires/assumes a « scalable area »T

CPUrequires/assumes a « scalable area »approximative model (not fine)

• requires 4 executions of the entire application:

‐σTCPU

‐σTGPU

requires 4 executions of the entire application: on 1 and N0 nodesrunning the 2 versions to compare

N (nodes)

measuring both T and E

adapted to optimize the execution of a long life application scaling on a parallel machine

To achieve an automatic selection on a short‐life application, we need :• a model requiring only small elementary benchmarks to fixthe model parameter values on the hardware used,

• a fine model not requiring the application exhibit a perfectscalability on the architecture.

Page 30: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Fine model requiredFine model required

First version of this fine model exists

It takes into account:• different power dissipation of the different « identical » nodes of the cluster

• when starting computations on GPU the power dissipation• when starting computations on GPU the power dissipation increases 2 times:

when the GPU starts to computepwhen the fan of the GPU starts, or/and when the GPU increases its frequency, …

h h d• when stopping computations on GPU the power dissipation decreases several times, but not immediately ! 

•• …

Page 31: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Fine model requiredFine model required(W

atts)

(s)

Page 32: on CPU and GPU clusters - CentraleSupelec

Need for a new performance model and an auto‐adaptive solution

Fine model requiredFine model required

Fi l iFirst evaluation:• on our PDE solver execution and on our heterogeneous GPU clustercluster

• error (model – observation) : 6%• after some bias corrections  : 1%

Stronger evaluation is planned:• on different hardware• on different applications

th h i ti f t d t ti / t l ti f th i ht• then: a heuristic of auto‐adaptation/auto‐selection of the right algorithm will be implemented

To be continued

Page 33: on CPU and GPU clusters - CentraleSupelec

5 Conclusion and perspectives5 – Conclusion and perspectives

Page 34: on CPU and GPU clusters - CentraleSupelec

Long term objectives

Kernel 1‐v1 Kernel 1‐v2 Kernel 1‐v3

Kernel 2‐v1 Kernel 2‐v2

Parallel algorithm 1 Parallel algorithm 2

a out –O [speed energy edp ]

Parallel algorithm 1 Parallel algorithm 2

usera.out  –O [speed, energy, edp, …] user

h d l

Auto‐adaptation of a.outHeuristic

Performance  scheduler

End after fast executionmodel 

(energy and 

Elementary

computations)End after edp compromise execution

End after low energy consumption executionhardware benchmarks

Page 35: on CPU and GPU clusters - CentraleSupelec

Equipe ProjetINRIA

AlGorille

Equipe IMS

Computing and energy performance optimization of a multi‐algorithms PDEoptimization of a multi algorithms PDE 

solver on CPU and GPU clusters

Q i ?Questions ?