Download ppt - Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute)

Nanco: a large HPC Nanco: a large HPC cluster for RBNIcluster for RBNI

(Russell Berrie Nanotechnology Institute)(Russell Berrie Nanotechnology Institute)

Anne Weill – ZrahiaAnne Weill – ZrahiaTechnion,Computer CenterTechnion,Computer Center

October 2008October 2008

Resources needed for applications Resources needed for applications arising from Nanotechnologyarising from Nanotechnology

Large memory – Large memory – TbytesTbytes

High floating point computing speed High floating point computing speed ––TflopsTflops

High data throughput High data throughput – state of the – state of the art …art …

SMP architectureSMP architecture

PP PP

Memory

Cluster architectureCluster architecture

Interconnection network

Why not a clusterWhy not a cluster

Single SMP system easier to Single SMP system easier to purchase/maintainpurchase/maintain

Ease of programming in SMP Ease of programming in SMP systemssystems

Why a clusterWhy a cluster

ScalabilityScalability Total available physical RAMTotal available physical RAM Reduced costReduced cost

But …But …

Having an application which exploits Having an application which exploits the parallel capabilities the parallel capabilities

Studying the application or applications which

will run on the cluster

Things to include in designThings to include in design

Property of Property of codecode

Essential Essential componentcomponent

CPU boundCPU bound Fast Fast computing computing unitunit

Memory Memory boundbound

Large Large memory , fast memory , fast accessaccess

Global flow of Global flow of data in data in parallel appparallel app

Fast Fast interconnectinterconnect

Our choicesOur choices Property of Property of codecode

Essential Essential componentcomponent

ChoiceChoice

CoComputationnmputationnally ally intensive,FPintensive,FP

Fast Fast computing computing unitunit

64 bit dual 64 bit dual core,Opteron,core,Opteron,Rev.FRev.F

Large Large matricesmatrices

Large Large memory , fast memory , fast accessaccess

88 GB /nodeGB /node

Finite Finite element, element, spectral spectral codescodes,,

Fast Fast interconnectinterconnect

Infiniband Infiniband DDR (20 DDR (20 Gb/s,low Gb/s,low latency)latency)

Other requirementsOther requirements

Space, power ,cooling constraints , Space, power ,cooling constraints , strength of floorsstrength of floors

Software configuration:Software configuration:

1.1. Operating systemOperating system

2.2. Compilers & application deve. toolsCompilers & application deve. tools

3.3. Load balancing and job schedulingLoad balancing and job scheduling

4.4. System management toolsSystem management tools

ConfigurationConfiguration

P P PPP P

MMM

Infiniband Switch

Before finalizing our choice …Before finalizing our choice …

One should check , on a similar system One should check , on a similar system ::

Single processor peak performanceSingle processor peak performance Infiniband interconnect performance Infiniband interconnect performance SMP behaviourSMP behaviour Non commercial parallel applications Non commercial parallel applications

behaviourbehaviour

Parallel applications issuesParallel applications issues

Execution timeExecution time

Parallel speedup Sp= T1/TpParallel speedup Sp= T1/Tp

ScalabilityScalability

Benchmark designBenchmark design

Must give a good estimate of Must give a good estimate of performance of your applicationperformance of your application

Acceptance test -should match all its Acceptance test -should match all its componentscomponents

Comparison of performanceComparison of performance

Computer Computer CarmelCarmelNancoNanco

Lapack Lapack program, program, N=9000N=9000

487 Mflops487 Mflops3823826.4 Mflops6.4 Mflops

Ratio of 7.8 !!Ratio of 7.8 !!

Execution time of Monte-Carlo Execution time of Monte-Carlo parallel code (MPI)parallel code (MPI)

ProcessesProcesses((CarmelCarmel11NancoNanco

112204222042

(~6hrs !)(~6hrs !)43894389

(~1 hr)(~1 hr)

22122461224617391739

44480948091154.81154.8

8835403540642.12642.12

1616282.5282.5

Speedup of Parallel Monte Carlo

0.00

20.00

40.00

60.00

80.00

100.00

120.00

2 4 8 16 32 64

n of processes

Exe

uti

on

tim

e

MILC

What did workWhat did work

Running MPI code interactivelyRunning MPI code interactively Running a serial job through the Running a serial job through the

queuequeue Compiling C code with MPICompiling C code with MPI

What did not workWhat did not work

Compiling F90 or C++ code with Compiling F90 or C++ code with MPIMPI

Running MPI code through the queueRunning MPI code through the queue Queues do not do accounting per Queues do not do accounting per

CPUCPU

PParaarallel performancellel performance results results

TheoreticTheoretical peak al peak

2.1 Tflops2.1 Tflops

NNanco performance on HPL:anco performance on HPL:

0.58 Tflops0.58 Tflops

Comparison with Sun BenchmarkComparison with Sun Benchmark

Comparison Sunbench vs nanco(pathscale),2ppn

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

24816

nof processes

MVH1

MILC

IGOR

M I LC s mal l - 2th/ n-

0. 00

500. 00

1000. 00

1500. 00

2000. 00

2500. 00

12481632

pr ocesses

Sun-bench

Nanco-gcc3

Nanco-sunc

Nanco-path

Nanco-gcc4

EExecution tixecution time –comparison of me –comparison of compilerscompilers

P ar al l el Speedup f or M I LC (2th/ n)

0. 00

20. 00

40. 00

60. 00

80. 00

100. 00

120. 00

248163264

pr ocesses

SUN-bench

Nanco-sun

Nanco-path

PerforPerformance with different mance with different optimizationsoptimizationsExecution time of MVH1 on nanco w ith 32 threads

0.00

50.00

100.00

150.00

200.00

250.00

300.00

Type of optimization

Execu

tio

n t

ime

VoltaireMPI+Pathscale

OpenMPI+opt.plac.

OpenMPI+opt.plac.+tmp disk

Conclusions from acceptance testsConclusions from acceptance tests

New gcc (gcc4) is faster than New gcc (gcc4) is faster than Pathscale for some applicationsPathscale for some applications

MPI collective communication MPI collective communication functions are differently functions are differently implemented in various MPI versionsimplemented in various MPI versions

Disk access times are crucial - use Disk access times are crucial - use attached storage when possibleattached storage when possible

Scheduling decisionsScheduling decisions

Assessing priorities between user Assessing priorities between user groupsgroups

Assessing parallel efficiency of Assessing parallel efficiency of different job types different job types (MPI,serial ,OPenMP) /commercial (MPI,serial ,OPenMP) /commercial software and designing special software and designing special queues for themqueues for them

Avoiding starvation by giving weight Avoiding starvation by giving weight to the urgency parameterto the urgency parameter

Observations during production Observations during production modemode

Assessing user’s understanding of Assessing user’s understanding of machine – support in writing scripts machine – support in writing scripts and efficient parallelizationand efficient parallelization

Lack of visualization tools – writing of Lack of visualization tools – writing of script to show current usage of script to show current usage of clustercluster

Utilization of clusterUtilization of cluster

Utilization of nanco sep08Utilization of nanco sep08

Utilization (daily) sep 08

0

20

40

60

80

100

120

date

Uti

liza

tio

n

Series1

Nanco jobs by typeNanco jobs by type

Nanco- feb 2008-by job type

Scalar

Fullwave

Self dev.code

ConclusionConclusion

Benchmark correct design is crucial Benchmark correct design is crucial to test capabilities of proposed to test capabilities of proposed architecturearchitecture

Acceptance tests allow to negotiate Acceptance tests allow to negotiate with vendors and give insights on with vendors and give insights on future choicesfuture choices

Only after several weeks and Only after several weeks and running of the cluster at full running of the cluster at full capacity can we make informed capacity can we make informed decisions on management of the decisions on management of the clustercluster