49
MIT/EAPS & Mech.Eng. C. Evangelinos ([email protected]) CriB 2010 Seminar Series Scientific Computing on the Cloud: Many Task Computing and other opportunities Constantinos Evangelinos Pierre F. J. Lermusiaux Chris Hill Jinshan Xu MIT Patrick J. Haley Jr. Earth, Atmosp heric and Planetary Sciences MIT/Mechanical Engineering

Crib 2009

  • Upload
    ce107

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 1/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

CriB 2010 Seminar Series

Scientific Computing on the Cloud:

Many Task Computing and other opportunities

Constantinos Evangelinos Pierre F. J. LermusiauxChris Hill Jinshan Xu

MIT  Patrick J. Haley Jr.

Earth, Atmosp

heric and Planetary Sciences MIT/Mechanical Engineering

Page 2: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 2/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Outline

● Many Task Computing● ESSE as an MTC application

● ESSE on clusters, grids and Amazon EC2

● Amazon EC2 for HPC?

● Amazon EC2 for education

● Conclusions

Page 3: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 3/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Motivation● Could cloud computing be in our future for

climate (ocean and coupled climate) models? – Can it be useful for more than EP or Map-Reduce

type of applications?

 – Are the days of having to purchase, install and

maintain personal clusters coming to an end? – Could grant money buy cloud cycles some day?

 – Can it be used for HPC instruction?

 –Can it be used for Geosciences education?

● What about HPC performance in a virtualmachine environment?

 – Issues and middleware

Page 4: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 4/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Many Task Computing● Loose definition by Foster et al.: high-performance computations

comprising multiple distinct activities, coupled via (for example) file

system operations or message passing. Tasks may be small or large,uniprocessor or multiprocessor, compute-intensive or data-intensive.The set of tasks may be static or dynamic, homogeneous orheterogeneous, loosely or tightly coupled. The aggregate number oftasks, quantity of computing, and volumes of data may be extremelylarge.

● What it is not:

 – Plain MPMD (unless one speaks of dynamic/heterogeneous)

 – Workflow (only part of the story)

 – Capacity computing

 – High Throughput computing

 – Embarrassingly parallel computing

● Instead of metric jobs/day, metric is units per sec or per hour.

Page 5: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 5/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

DA Motivation● Improve the forecasting capabilities of ocean

data assimilation and related fields viaincreased access to parallelism

● Move existing computational framework to amore modern, non-site specific setup

● Test the opportunities for executing massivetask count workflows on distributed clusters,Grid and Cloud platforms.

● Provide an external outlet to handle peak-demand for compute resources during liveexperiments in the field

Page 6: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 6/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Ocean Data Assimilationdx =M (x, t) + dη; M the model operator

yko = H (xk, tk) + εk; H the measurement operator

minxJ (x

k,y

k

o; dη, εk, Q(t), R

k); J objective function

Model errors are assumed Brownian:dη = N(0,Q(t)) with E{dη(t) dη(t) T} = Q(t) dt

In fact the models are forced by processes withnoise correlated in space and time (meteo)

Measurement errors follow white Gaussian:

εk

= N(0, Rk)

Page 7: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 7/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Ocean AcousticsEstimate of the ocean temperature and salinity

fields (and uncertainties) necessary for calculatingacoustic fields and their uncertainties.

Sound-propagation studies often focus on verticalsections. Time is fixed and an acoustic broadband

transmission loss (TL) field is computed for eachocean realization.

A sound source of specific frequency, location

and depth is chosen. The coupled physical-acoustical covariance P for the section iscomputed and non-dimensionalized and used forassimilation of hydrographic and TL data.

Page 8: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 8/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Acoustic climatology maps

● Underwater acoustics transmission loss variability predictions in a 56 x 33km area northeast of Taiwan.

● 2D propagation over 15km distance at 31x31 = 961 grid points X 8directions

● Each job a short 3 minute acoustics 2D ray propagation problem● Distributed on 100 dual-core computer nodes, speed up more than 100

times in real time experiment (SGE overhead of scheduling short jobs)

( )Mean Transmission Loss TL  TL STD over depth  TL STD over bearing

77km

65km

55dB

65dB

.1 3 dB

.0 1 dB

 3 dB

.0 1 dB effect of internal tides

 Effect of steepbathymetry

Page 9: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 9/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Canyon Nx2D acoustics modeling

–OMAS moving sound source

 Bathymetry of Mien Hua Canyon

Page 10: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 10/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

AOSN-II Monterey Bay

Page 11: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 11/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Error Subspace Statistical Estimation

Page 12: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 12/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

ESSE Surf. Temp. Error Standard Deviation Forecasts for AOSN-II

Aug 12 Aug 13

Aug 27Aug 24

Aug 14

Aug 28

End of RelaxationSecond Upwelling period

First Upwelling periodStart of Upwelling

Leonard and Ramp, Lead PIs

Page 13: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 13/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Serial and Parallel ESSE workflows

Page 14: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 14/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

The ESSE workflow engine● Is actually (for historical and practical reasons)

a heavily modified C-shell script (master )!

 – Catches signals to kill all remaining jobs

● Grid Engine, Condor and PBS variants

 – Submits and tracks singleton jobs

● Or uses job arrays for scalability

 – Further variants depending on I/O strategy:

● Separate pert singletons?

● Input/output to shared or local disk (or mixed)?● Shared directories store files with the execution

status of each of the singleton scripts

Singletons need the perturbation number:tricks!

Page 15: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 15/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Multi-level parallelism in ESSE● Nested ocean model

runs (HOPS) are runin parallel

 – Limited parallelism

 – 2 or 3 levels

 – bi-directional

● SVD calculation isbased on

parallelizableLAPACK routines

● Convergence checkcalculation also.

Page 16: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 16/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

ESSE and ocean acoustics● As things stand ESSE is used to provide the

necessary temperature and salinity informationfor sound propagation studies.

● The ESSE framework can also be extended toacoustic data assimilation. With significantly

more compute power one can compute the whole “acoustic climate” in a 3D region

 – providing TL for any source and receiverlocations in the region as a function of time

and frequency,

 – by running multiple independent tasks fordifferent sources/frequencies/slices atdifferent times.

Page 17: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 17/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Canyon Nx2D acoustics modeling

● Acoustics transmission loss difference in 6 hours (internal tides or otheruncertainties)

● In future, incorporate with ESSE for uncertainties estimation, computationcost will be 1800 (directions) X 15 locations X HUNDREDS of cases.

Page 18: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 18/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Ocean DA/ESSE/acoustics: MTC● A minimum of hundreds to thousands (and with

increased fidelity tens of thousands) of oceanmodel runs (tens of minutes or more) precededby an equal number of IC perturbations (secs)

● File I/O intensive, both for reading and writing

● Concurrent reads to forcing files etc.

● Thousands of short acoustics runs (mins)

● Future directions for ESSE will generate even

more tasks:

 – dynamic path sampling for observing assets

 – combined physical-acoustical ESSE

Page 19: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 19/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

“Real-time” experiments

Page 20: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 20/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Notable differencesFrom many parameter sweeps and other MTC apps:

● there is a hard deadline associated with the execution of theensemble worflow, as a forecast needs to be timely;

● the size of the ensemble is dynamically adjusted according tothe convergence of the ESSE workflow which is not a DAG;

● individual ensemble members are not significant (and theirresults can be ignored if unavailable) - what is important is thestatistical coverage of the ensemble;

● the full resulting dataset of the ensemble member forecastisrequired, not just a small set of numbers; IC are different for

each ensemble members● individual forecasts within an ensemble, especially in the case

of interdisciplinary interactions and nested meshes, can be parallel programs themselves.

Page 21: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 21/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

And their implications● Deadline: use any Advanced Reservation capabilities available

● Dynamic: means that the actual total compute and data

requirements for the forecast are not known beforehand andchange dynamically

● Dropped members: suggests that failures (due to software orhardware problems) are not catastrophic and can be tolerated.

Moreover runs that have not finished (or even started) by theforecast deadline can be safely ignored provided they do notcollectively represent a systematic hole in the statisticalcoverage.

● I/O needs: mean that relatively high data storage and network

bandwidth constraints will be placed on the underlyinginfrastructure

● Parallel ensemble members: mean that the computerequirements will not be insignificant either.

Page 22: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 22/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Ocean DA on local clusters● Local Opteron cluster

 – Opteron 250 2.4GHz (4GB RAM) computenodes (single gigabit network connection)

 – Opteron 2380 2.5GHz (24GB RAM) head node

 – 18TB of shared disk (NFS) over 10Gbit Ethernet

 – 200Gbit switch backplane

 – Grid Engine and Condor co-existing

● Tried both GridEngine and Condor versions of

ESSE workflows. Test 600 member ensemble: – I/O optimizations (all local dirs) 86 to 77 mins

 – SGE 10-20% faster than Condor

● without heroic tuning of the latter

h id

Page 23: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 23/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Ocean DA on the Teragrid● Extensive use of sshfs to share directories for

checking state of runs etc.

● Remote job submissions (over (gsi)ssh)

 – part of driver and modified singletons

● Or Condor-C and Glide-in with care if root

● Condor-G will not scale

● Or Personal Condor & Mycluster

System cores pertORNL 2 67.83 1823.99

Purdue 4 6.25 1107.4

local 2 6.21 1531.33

pemodel

Ad f h T id

Page 24: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 24/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Advantages of the Teragrid● Enormous numbers of theoretically available

cores and very large sizes for storage

 – Condor pool supposedly 14-27kcores (~1800)

● Shared high-speed parallel filesystems

● High speed connections to the home cluster

● Suites of Grid software for remote file accessand job submission, control etc.

 – Mixed blessing...

● Free after writing the proposal to convinceTeragrid to get the SUs...

Di d f h T id

Page 25: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 25/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Disadvantages of the Teragrid● Very large heterogeneity in both hardware, O/S and

paths (to scratch disks etc.) requiring mods to the

singleton code – user confusion.● Without advance reservations one cannot be

guaranteed not to have to use multiple Teragrid sitesto reach the desired number of processors within the

deadline. – Backfilling can help but per user job limits also limit

the usability of a single Teragrid site

 – Schedulers favor large processor count runs

 – Complicated tricks to submit many jobs as one

● Teragrid MPPs not always suitable for scripts

● Careful fetching of results back to home (congestion)

O DA th Cl d

Page 26: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 26/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Ocean DA on the Cloud● We have been experimenting with the use of

Cloud computing for more traditional HPCusage – including parallel runs of I/O intensivedata parallel ocean models such as MITgcm.

● Given the limitations seen in network

performance it was natural to try andinvestigate the usability of Amazon EC2 forMTC applications such as ESSE.

Cl d M d f

Page 27: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 27/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Cloud Modes of usage● Stand-alone (batch) on-demand EC2 cluster

 – Torque or SGE (all-in-the cloud or remote submits)● Augmented local cluster with EC2 nodes

 – We have a Torque setup

 –

Used recipes for SGE setup. – Condor use of EC2 too restrictive

 – MyCluster dynamic SGE or Condor merged clusters

 – Commercial (Univa Unicloud, Sun Cloud

Adapter in Hedeby/SDM) for fully dynamicprovisioning

● Experientation with parallel filesystems:PVFS2/GlusterFS/FhGFS

S i l t/ d l f

Page 28: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 28/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Serial pert/pemodel performanceSystem cores pert

m1.small 0.5 13.53 2850.14

m1.large 2 9.33 1817.13m1.xlarge 4 9.14 1860.81

c1.medium 2 9.8 1008.11

c1.xlarge 8 6.67 1030.42

m2.2xlarge 4 3.39 779.77

m2.4xlarge 8 3.35 790.86

pemodel

●m1.xxxx AMIs are using Opteron processors●A binary optimized with the Pathscale compilers was used● All cores were loaded.●

I/O is to local disk (EBS is slower, so is NFS that is used forthe centrally coordinating directory of the run)● Total runtime is reported.● Better than 2.5 speedup for m1.small to c1.medium● Nehalems (m2.xxxxx) not the best option for price/perf.

Advantages of the Cloud

Page 29: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 29/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Advantages of the Cloud● For all intents and purposes the response is immediate.

Currently a request for a virtual EC2 cluster gets satisfied on-

demand, without having to worry about queue times and backfillslots.

● The use of virtual machines allows for deploying the sameenvironment as the home cluster. This provides for a very cleanintegration of the two clusters.

● Having the same software environment also results in no needto rebuild (and in most cases having to revalidate) executables.This means that last minute changes (because of model build-time parameter tuning) can be used ASAP instead of having togo through a buildtest-deploy cycle on each remote platform.

● EC2 allows our virtual clusters to scale at will: (default limit 20)

● Since the remote machines are under our complete control,scheduling software and policies etc. are tuned to our needs.

Cost analysis

Page 30: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 30/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Cost analysis● Cost-wise for example an ESSE calculation with 1.5GB input data, 960 ensemble memberseach sending back 11MB (for a total of 6.6GB)

 would cost:

 – 1.5(GB)×0.1+10.56(GB)x0.17 for the data

 – 2(hr)x20x0.68 for the computations – For a total of $29.15

● Use of reserved instances would drop pricing

for the cpu usage by more than a factor of 3.● Compare that to the cost of overprovisioning

your local cluster resources to handle the peakload required a few times a year.

Disadvantages of the Cloud

Page 31: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 31/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Disadvantages of the Cloud● Inhomogeneity needs to be kept in mind or it will bite you

● Any extra security issues need to be worked out.

● EC2 usage needs to be directly paid to Amazon. Amazoncharges by the hour - like a cell-phone, 1 hour 1 sec. counts as2 hours. Charges for data movement in and out of EC2.

● The performance of virtual machines is less than that of “bare

metal”, the difference more pronounced when it comes to I/O.● No persistent large parallel filesystem. One can be constructed

on demand (just like the virtual clusters) but the GigabitEthernet connectivity used throughout Amazon EC2 alongsidethe randomization of instance placement mean that parallelperformance of the filesystem is not up to par. Horror stories...

● Unlike national and state supercomputing facilities, Amazon’sconnections to the home cluster are bound to be slower andresult in file transfer delays.

Future work directions

Page 32: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 32/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Future work directions● Reimplement the workflow engine.

 –Considering Swift – other options? Nimrod?

● Generalize ESSE work-engine away:

 – Use with other ocean models (MITgcm,ROMS)

Expand production use of ESSE: – Heterogeneous sites on the Teragrid

 – Open Science Grid

 – MPPs with sufficient support: Blue Gene/P?

● Expand uses for ESSE (and number of tasks):

 – ESSE for Acoustics

 – ESSE for adaptive sampling

Which sampling on Aug 26 optimally reduces uncertainties on Aug 27?

Page 33: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 33/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Which sampling on Aug 26 optimally reduces uncertainties on Aug 27?4 candidate tracks, overlaid on surface T fct for Aug 26

ESSE fcts after DA of each track

Aug 24 Aug 26 Aug 27

2-day ESSE fct

ESSE for Track 4

ESSE for Track 3

ESSE for Track 2

ESSE for Track 1DA 1

DA 2

DA 3

DA 4

IC(nowcast) DA

Best predicted relative error reduction: track 1

Based on nonlinear error covariance evolution•For every choice of adaptive strategy, an ensembleis computed

Memory Bandwidth

Page 34: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 34/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Memory Bandwidth

m1.small c1.medium Opteron 1.4GHz

0

1

2

3

4

5

6

5.4

2.6 2.8

5.4

2.6 2.8

5.3 5.61 threadN threads

per thread

● The small instance memory bandwidth appears to be equal to the fullmemory expected from such a platform despite the 50% cpu time throttler –not entirely unexpected for memory bandwidth.

● The faster CPU in the c1.medium instance does considerably worse.

● In fact an original 1st gen 1.4 GHz Opteron system also does worse (DDR2memory in the m1.small instance should help).

● This suggests that for memory bandwidth limited applications the smallinstance may be the most efficient

● The increase of memory bandwidth with the c1.medium instance suggeststhat the 2 cores are not on the same die. This would be an Amazon policy.

Serial Performance

Page 35: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 35/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Serial PerformanceSystem Class A Class W EP (A) EP (W)

m1.small 132 149 6.66 6.73

c1.medium 312 357 15.59 15.04

ratio 2.36 2.4 2.34 2.23

●NAS NPB serial (geometric mean of all tests except EP) inMop/s

● Compiled with system gcc (generic flags)● A single instance running on the c1.medium case (no

memory bandwidth contention)● 1:2.5 theoretical ratio becomes 1:2.3● Still the price ratio is 1:2● When loading both cores in the the c1.medium the resulting

ratio depends on the memory vs cpu utilizationcharacteristics of the individual benchmark

I/O performance

Page 36: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 36/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

I/O performance

m1.small c1.medium c1.medium 2 cores

1

10

100

1000

10000

168 284 264

1626 1882 2399

40 42 4173 63 63

/tmp write

/tmp read

NFS write

NFS read

m1.small c1.medium c1.medium 2 cores

1

10

100

1729 28

1629 28

1526 21

14

30 26

Class A /tmp

Class W /tmp

Class A NFSClass W NFS

Serial NAS NPB 3.3 BT-IO; Fortran I/O, MB/s

Serial IOR read and write bandwidth (128KB requests, 1MBblocksize, 100 segments), fsync, MB/s

MPI performance

Page 37: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 37/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

MPI performance

LAM GridMPI MPICH2 nemesis MPICH2 sock OpenMPI LAM/ACES

0

50

100

150

200

250

57.85 54.6

15.72

58.49

16.44

117.64

81.98 77.07

26.08

83.42

17.99

198.59

unidirectional bandwidth (MB/s)

bidirectional bandwidth (MB/s)

LAM GridMPI MPICH2 nemesis MPICH2 sock OpenMPI LAM/ACES

0

50

100

150

200

250

300

350

81.2 83.46

300

85.87

300

35.83

latency (us)

MPI performance cont

Page 38: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 38/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

MPI performance cont.

Coupled climate simulation

Page 39: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 39/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Coupled climate simulation● The MITgcm (MIT General Circulation Model) is

a numerical model designed for study of the

atmosphere, ocean, and climate.

 – MPI (+OpenMP) code base, Fortran 77(/90) withsome C bits – very portable

 – Memory bandwidth intensive but not entirelymemory bound – also I/O intensive for climateapplications.

● Coupled ocean-coupler-(atmosphere-land-ice)

model on a ~2.8° cubed sphere (6 32x32 faces) – MPMD mode, 3 binaries, up to 6+6+1 processes in

a standard configuration.

ECCO-GODAE

Page 40: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 40/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

ECCO GODAE● 1 degree MITgcm ocean simulation (including sea-ice)

that computes costs with respect to misfits to

observational data. Automatic differentiation. – Followed by an optimization step that generally will

not fit on EC2 nodes (large memory)

 – Loops over – so a lot of data transfer involved.

● 32, 60 or 64 processor runs usually.

● Very I/O intensive (60-120 or more GB input data, 25-200 or more GB output data that need to be kept,more in terms of intermediate files).

● Per process I/O useful but bothersome.

● Ensembles of forward runs less I/O demanding (MTCat large scale?)

Modes of usage

Page 41: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 41/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Modes of usage● Stand-alone (interactive) on-demand EC2

cluster

● Stand-alone (batch) on-demand EC2 cluster

 – Torque or SGE

● Augmented local cluster with EC2 nodes

 – We have a Torque setup

 – Used recipes for SGE setup.

● Project Hedeby

● Parallel filesystems: PVFS2/GlusterFS/FhGFS● Inhomogeneity needs to be kept in mind

● Security issues need to be worked out.

Optimizing compiler issues

Page 42: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 42/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

Optimizing compiler issues

● Two high performance compilers that can be deployed without licensingissues for academics and may perform better: Open64 and Sun Studio 12.

 – The latter provides an 11.5% performance boost for the geometric meanof the tests (up to 25% for MG).

 – The MPI runtime may need to be rebuilt for the new compiler every time

● To use the Intel, Absoft, PGI compilers one can employ a local virtualmachine with a valid software license using the same OS and middleware asthe virtual cluster and then run the executables on the EC2 cluster.

gcc 4.1 open64 4.1 Pathscale 2.5 PGI 6.1 Absoft 10 Intel 9.1 Studio12

0

20

40

60

80

100

120

140

160

180

148.83 150.22 151.11 159.46 149.91 150.97165.91

131.57 139.01 141.25 143.82 139.02 146

Class W

Class A

The economics of Clouds

Page 43: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 43/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

The economics of Clouds• So can we move to an all-cloud option for our HPC

needs?            ̶

The enticement: No more worrying about hardwaremaintenance, upgrades, network administration,possibly system administration (using pre-configured clusters) leading to lower costs.

            ̶ At the same time “virtual” clusters retain part of the

“cluster”-hugging mentality of some users.            ̶ And at the institute level:

            ̶ No need to worry aboutbuilding/renovating/retrofitting datacenters

            ̶

And most importantly in days of increasing energycosts, you don't see electricity bills anymore            ̶ The carbon impact becomes someone else's

problem.

An exercise

Page 44: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 44/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

e e c se• Part of an effort at MIT for investigating future needs:            ̶ 0.68c/hour for 2 cpu, 8 core Xeon instance on Amazon

EC2 (cheapest option offering fullest flexibility currentlyavailable)

            ̶ Cost of 158 racks, 2U, low density, 21 node per rackequivalent is 158x21x0.68 is $2256.24 per hour.

            ̶ Using reserved instances it is 158x21x0.24=$796.3

            ̶ Assuming an 85% utilization, that amounts to2256.24*24*365*0.85 = $16.8 million per year, ~7 timesour expected electricity bill for a highly efficientdatacenter.2800/3*158*21+2654.4*24*365*0.24=$8.7 per annum

            ̶ With the cost of building a datacenter the cloud costsmore after 4 (9) (or less for more racks) years.

            ̶ But sporadic use is very well suited economically to theuse of clouds.

 – Gigabit Ethernet limitations for large instance counts.

Education

Page 45: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 45/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

● Using cloud computing for geoscienceseducation.

 – Multi-tiered approach(client-server emphasis)

 – Up to one Amazon EC2 instance per student(full cpu power for each student if needed)

 – VNC or other remote visualization approach – Menu/forms driven models

 – Web interface integrating course material withdemonstrations

 – Simulations mimicking experiments run in class

● MPI/OpenMP class taught at MIT (IAP 2008-10)

 – EC2 and/or VMware image

Educational uses

Page 46: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 46/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

● The opportunity tohost all of ESSE's

computationalneeds on EC2allows for a visionof ocean DA foreducation.

● CITE (Cloud-computing

Infrastructure andTechnology forEducation) – NSFSTCI project.

Virtual teaching environment

Page 47: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 47/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

g

LCML/LEGEND

Page 48: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 48/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

● LCML (Legacy Computing Markup Language)is an XML Schema based framework for

encapsulating the process of configuring thebuild-time and run-time configuration of legacybinaries alongside constraints.

It was implemented for ocean/climate modelsbut designed for general applications that useMakefiles, imake, cmake, autoconf etc. to setupbuild-time configuration (not ant).

● LEGEND is a Java-based validating GUI generator that parses LCML files describing anapplication and produces a GUI for the user tobuild and run the model.

LEGEND in action

Page 49: Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 49/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])