Nanco: a large HPC Nanco: a large HPC cluster for RBNIcluster for RBNI
(Russell Berrie Nanotechnology Institute)(Russell Berrie Nanotechnology Institute)
Anne Weill – ZrahiaAnne Weill – ZrahiaTechnion,Computer CenterTechnion,Computer Center
October 2008October 2008
Resources needed for applications Resources needed for applications arising from Nanotechnologyarising from Nanotechnology
Large memory – Large memory – TbytesTbytes
High floating point computing speed High floating point computing speed ––TflopsTflops
High data throughput High data throughput – state of the – state of the art …art …
SMP architectureSMP architecture
PP PP
Memory
Cluster architectureCluster architecture
Interconnection network
Why not a clusterWhy not a cluster
Single SMP system easier to Single SMP system easier to purchase/maintainpurchase/maintain
Ease of programming in SMP Ease of programming in SMP systemssystems
Why a clusterWhy a cluster
ScalabilityScalability Total available physical RAMTotal available physical RAM Reduced costReduced cost
But …But …
Having an application which exploits Having an application which exploits the parallel capabilities the parallel capabilities
Studying the application or applications which
will run on the cluster
Things to include in designThings to include in design
Property of Property of codecode
Essential Essential componentcomponent
CPU boundCPU bound Fast Fast computing computing unitunit
Memory Memory boundbound
Large Large memory , fast memory , fast accessaccess
Global flow of Global flow of data in data in parallel appparallel app
Fast Fast interconnectinterconnect
Our choicesOur choices Property of Property of codecode
Essential Essential componentcomponent
ChoiceChoice
CoComputationnmputationnally ally intensive,FPintensive,FP
Fast Fast computing computing unitunit
64 bit dual 64 bit dual core,Opteron,core,Opteron,Rev.FRev.F
Large Large matricesmatrices
Large Large memory , fast memory , fast accessaccess
88 GB /nodeGB /node
Finite Finite element, element, spectral spectral codescodes,,
Fast Fast interconnectinterconnect
Infiniband Infiniband DDR (20 DDR (20 Gb/s,low Gb/s,low latency)latency)
Other requirementsOther requirements
Space, power ,cooling constraints , Space, power ,cooling constraints , strength of floorsstrength of floors
Software configuration:Software configuration:
1.1. Operating systemOperating system
2.2. Compilers & application deve. toolsCompilers & application deve. tools
3.3. Load balancing and job schedulingLoad balancing and job scheduling
4.4. System management toolsSystem management tools
ConfigurationConfiguration
P P PPP P
MMM
Infiniband Switch
Before finalizing our choice …Before finalizing our choice …
One should check , on a similar system One should check , on a similar system ::
Single processor peak performanceSingle processor peak performance Infiniband interconnect performance Infiniband interconnect performance SMP behaviourSMP behaviour Non commercial parallel applications Non commercial parallel applications
behaviourbehaviour
Parallel applications issuesParallel applications issues
Execution timeExecution time
Parallel speedup Sp= T1/TpParallel speedup Sp= T1/Tp
ScalabilityScalability
Benchmark designBenchmark design
Must give a good estimate of Must give a good estimate of performance of your applicationperformance of your application
Acceptance test -should match all its Acceptance test -should match all its componentscomponents
Comparison of performanceComparison of performance
Computer Computer CarmelCarmelNancoNanco
Lapack Lapack program, program, N=9000N=9000
487 Mflops487 Mflops3823826.4 Mflops6.4 Mflops
Ratio of 7.8 !!Ratio of 7.8 !!
Execution time of Monte-Carlo Execution time of Monte-Carlo parallel code (MPI)parallel code (MPI)
ProcessesProcesses((CarmelCarmel11NancoNanco
112204222042
(~6hrs !)(~6hrs !)43894389
(~1 hr)(~1 hr)
22122461224617391739
44480948091154.81154.8
8835403540642.12642.12
1616282.5282.5
Speedup of Parallel Monte Carlo
0.00
20.00
40.00
60.00
80.00
100.00
120.00
2 4 8 16 32 64
n of processes
Exe
uti
on
tim
e
MILC
What did workWhat did work
Running MPI code interactivelyRunning MPI code interactively Running a serial job through the Running a serial job through the
queuequeue Compiling C code with MPICompiling C code with MPI
What did not workWhat did not work
Compiling F90 or C++ code with Compiling F90 or C++ code with MPIMPI
Running MPI code through the queueRunning MPI code through the queue Queues do not do accounting per Queues do not do accounting per
CPUCPU
PParaarallel performancellel performance results results
TheoreticTheoretical peak al peak
2.1 Tflops2.1 Tflops
NNanco performance on HPL:anco performance on HPL:
0.58 Tflops0.58 Tflops
Comparison with Sun BenchmarkComparison with Sun Benchmark
Comparison Sunbench vs nanco(pathscale),2ppn
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
24816
nof processes
MVH1
MILC
IGOR
M I LC s mal l - 2th/ n-
0. 00
500. 00
1000. 00
1500. 00
2000. 00
2500. 00
12481632
pr ocesses
Sun-bench
Nanco-gcc3
Nanco-sunc
Nanco-path
Nanco-gcc4
EExecution tixecution time –comparison of me –comparison of compilerscompilers
P ar al l el Speedup f or M I LC (2th/ n)
0. 00
20. 00
40. 00
60. 00
80. 00
100. 00
120. 00
248163264
pr ocesses
SUN-bench
Nanco-sun
Nanco-path
PerforPerformance with different mance with different optimizationsoptimizationsExecution time of MVH1 on nanco w ith 32 threads
0.00
50.00
100.00
150.00
200.00
250.00
300.00
Type of optimization
Execu
tio
n t
ime
VoltaireMPI+Pathscale
OpenMPI+opt.plac.
OpenMPI+opt.plac.+tmp disk
Conclusions from acceptance testsConclusions from acceptance tests
New gcc (gcc4) is faster than New gcc (gcc4) is faster than Pathscale for some applicationsPathscale for some applications
MPI collective communication MPI collective communication functions are differently functions are differently implemented in various MPI versionsimplemented in various MPI versions
Disk access times are crucial - use Disk access times are crucial - use attached storage when possibleattached storage when possible
Scheduling decisionsScheduling decisions
Assessing priorities between user Assessing priorities between user groupsgroups
Assessing parallel efficiency of Assessing parallel efficiency of different job types different job types (MPI,serial ,OPenMP) /commercial (MPI,serial ,OPenMP) /commercial software and designing special software and designing special queues for themqueues for them
Avoiding starvation by giving weight Avoiding starvation by giving weight to the urgency parameterto the urgency parameter
Observations during production Observations during production modemode
Assessing user’s understanding of Assessing user’s understanding of machine – support in writing scripts machine – support in writing scripts and efficient parallelizationand efficient parallelization
Lack of visualization tools – writing of Lack of visualization tools – writing of script to show current usage of script to show current usage of clustercluster
Utilization of clusterUtilization of cluster
Utilization of nanco sep08Utilization of nanco sep08
Utilization (daily) sep 08
0
20
40
60
80
100
120
date
Uti
liza
tio
n
Series1
Nanco jobs by typeNanco jobs by type
Nanco- feb 2008-by job type
Scalar
Fullwave
Self dev.code
ConclusionConclusion
Benchmark correct design is crucial Benchmark correct design is crucial to test capabilities of proposed to test capabilities of proposed architecturearchitecture
Acceptance tests allow to negotiate Acceptance tests allow to negotiate with vendors and give insights on with vendors and give insights on future choicesfuture choices
Only after several weeks and Only after several weeks and running of the cluster at full running of the cluster at full capacity can we make informed capacity can we make informed decisions on management of the decisions on management of the clustercluster