Crib 2009

8/8/2019 Crib 2009

http://slidepdf.com/reader/full/crib-2009 1/49

MIT/EAPS & Mech.Eng.C. Evangelinos ([email protected])

CriB 2010 Seminar Series

Scientific Computing on the Cloud:

Many Task Computing and other opportunities

Constantinos Evangelinos Pierre F. J. LermusiauxChris Hill Jinshan Xu

MIT Patrick J. Haley Jr.

Earth, Atmosp

heric and Planetary Sciences MIT/Mechanical Engineering

8/8/2019 Crib 2009



Outline

● Many Task Computing● ESSE as an MTC application

● ESSE on clusters, grids and Amazon EC2

● Amazon EC2 for HPC?

● Amazon EC2 for education

● Conclusions

8/8/2019 Crib 2009



Motivation● Could cloud computing be in our future for

climate (ocean and coupled climate) models? – Can it be useful for more than EP or Map-Reduce

type of applications?

– Are the days of having to purchase, install and

maintain personal clusters coming to an end? – Could grant money buy cloud cycles some day?

– Can it be used for HPC instruction?

–Can it be used for Geosciences education?

● What about HPC performance in a virtualmachine environment?

– Issues and middleware

8/8/2019 Crib 2009



Many Task Computing● Loose definition by Foster et al.: high-performance computations

comprising multiple distinct activities, coupled via (for example) file

system operations or message passing. Tasks may be small or large,uniprocessor or multiprocessor, compute-intensive or data-intensive.The set of tasks may be static or dynamic, homogeneous orheterogeneous, loosely or tightly coupled. The aggregate number oftasks, quantity of computing, and volumes of data may be extremelylarge.

● What it is not:

– Plain MPMD (unless one speaks of dynamic/heterogeneous)

– Workflow (only part of the story)

– Capacity computing

– High Throughput computing

– Embarrassingly parallel computing

● Instead of metric jobs/day, metric is units per sec or per hour.

8/8/2019 Crib 2009



DA Motivation● Improve the forecasting capabilities of ocean

data assimilation and related fields viaincreased access to parallelism

● Move existing computational framework to amore modern, non-site specific setup

● Test the opportunities for executing massivetask count workflows on distributed clusters,Grid and Cloud platforms.

● Provide an external outlet to handle peak-demand for compute resources during liveexperiments in the field

8/8/2019 Crib 2009



Ocean Data Assimilationdx =M (x, t) + dη; M the model operator

yko = H (xk, tk) + εk; H the measurement operator

minxJ (x

k,y

k

o; dη, εk, Q(t), R

k); J objective function

Model errors are assumed Brownian:dη = N(0,Q(t)) with E{dη(t) dη(t) T} = Q(t) dt

In fact the models are forced by processes withnoise correlated in space and time (meteo)

Measurement errors follow white Gaussian:

εk

= N(0, Rk)

8/8/2019 Crib 2009



Ocean AcousticsEstimate of the ocean temperature and salinity

fields (and uncertainties) necessary for calculatingacoustic fields and their uncertainties.

Sound-propagation studies often focus on verticalsections. Time is fixed and an acoustic broadband

transmission loss (TL) field is computed for eachocean realization.

A sound source of specific frequency, location

and depth is chosen. The coupled physical-acoustical covariance P for the section iscomputed and non-dimensionalized and used forassimilation of hydrographic and TL data.

8/8/2019 Crib 2009



Acoustic climatology maps

● Underwater acoustics transmission loss variability predictions in a 56 x 33km area northeast of Taiwan.

● 2D propagation over 15km distance at 31x31 = 961 grid points X 8directions

● Each job a short 3 minute acoustics 2D ray propagation problem● Distributed on 100 dual-core computer nodes, speed up more than 100

times in real time experiment (SGE overhead of scheduling short jobs)

( )Mean Transmission Loss TL TL STD over depth TL STD over bearing

77km

65km

55dB

65dB

.1 3 dB

.0 1 dB

3 dB

.0 1 dB effect of internal tides

Effect of steepbathymetry

8/8/2019 Crib 2009



Canyon Nx2D acoustics modeling

–OMAS moving sound source

Bathymetry of Mien Hua Canyon

8/8/2019 Crib 2009



AOSN-II Monterey Bay

8/8/2019 Crib 2009



Error Subspace Statistical Estimation

8/8/2019 Crib 2009



ESSE Surf. Temp. Error Standard Deviation Forecasts for AOSN-II

Aug 12 Aug 13

Aug 27Aug 24

Aug 14

Aug 28

End of RelaxationSecond Upwelling period

First Upwelling periodStart of Upwelling

Leonard and Ramp, Lead PIs

8/8/2019 Crib 2009



Serial and Parallel ESSE workflows

8/8/2019 Crib 2009



The ESSE workflow engine● Is actually (for historical and practical reasons)

a heavily modified C-shell script (master )!

– Catches signals to kill all remaining jobs

● Grid Engine, Condor and PBS variants

– Submits and tracks singleton jobs

● Or uses job arrays for scalability

– Further variants depending on I/O strategy:

● Separate pert singletons?

● Input/output to shared or local disk (or mixed)?● Shared directories store files with the execution

status of each of the singleton scripts

●

Singletons need the perturbation number:tricks!

8/8/2019 Crib 2009



Multi-level parallelism in ESSE● Nested ocean model

runs (HOPS) are runin parallel

– Limited parallelism

– 2 or 3 levels

– bi-directional

● SVD calculation isbased on

parallelizableLAPACK routines

● Convergence checkcalculation also.

8/8/2019 Crib 2009



ESSE and ocean acoustics● As things stand ESSE is used to provide the

necessary temperature and salinity informationfor sound propagation studies.

● The ESSE framework can also be extended toacoustic data assimilation. With significantly

more compute power one can compute the whole “acoustic climate” in a 3D region

– providing TL for any source and receiverlocations in the region as a function of time

and frequency,

– by running multiple independent tasks fordifferent sources/frequencies/slices atdifferent times.

8/8/2019 Crib 2009



Canyon Nx2D acoustics modeling

● Acoustics transmission loss difference in 6 hours (internal tides or otheruncertainties)

● In future, incorporate with ESSE for uncertainties estimation, computationcost will be 1800 (directions) X 15 locations X HUNDREDS of cases.

8/8/2019 Crib 2009



Ocean DA/ESSE/acoustics: MTC● A minimum of hundreds to thousands (and with

increased fidelity tens of thousands) of oceanmodel runs (tens of minutes or more) precededby an equal number of IC perturbations (secs)

● File I/O intensive, both for reading and writing

● Concurrent reads to forcing files etc.

● Thousands of short acoustics runs (mins)

● Future directions for ESSE will generate even

more tasks:

– dynamic path sampling for observing assets

– combined physical-acoustical ESSE

8/8/2019 Crib 2009



“Real-time” experiments

8/8/2019 Crib 2009



Notable differencesFrom many parameter sweeps and other MTC apps:

● there is a hard deadline associated with the execution of theensemble worflow, as a forecast needs to be timely;

● the size of the ensemble is dynamically adjusted according tothe convergence of the ESSE workflow which is not a DAG;

● individual ensemble members are not significant (and theirresults can be ignored if unavailable) - what is important is thestatistical coverage of the ensemble;

● the full resulting dataset of the ensemble member forecastisrequired, not just a small set of numbers; IC are different for

each ensemble members● individual forecasts within an ensemble, especially in the case

of interdisciplinary interactions and nested meshes, can be parallel programs themselves.

8/8/2019 Crib 2009



And their implications● Deadline: use any Advanced Reservation capabilities available

● Dynamic: means that the actual total compute and data

requirements for the forecast are not known beforehand andchange dynamically

● Dropped members: suggests that failures (due to software orhardware problems) are not catastrophic and can be tolerated.

Moreover runs that have not finished (or even started) by theforecast deadline can be safely ignored provided they do notcollectively represent a systematic hole in the statisticalcoverage.

● I/O needs: mean that relatively high data storage and network

bandwidth constraints will be placed on the underlyinginfrastructure

● Parallel ensemble members: mean that the computerequirements will not be insignificant either.

8/8/2019 Crib 2009



Ocean DA on local clusters● Local Opteron cluster

– Opteron 250 2.4GHz (4GB RAM) computenodes (single gigabit network connection)

– Opteron 2380 2.5GHz (24GB RAM) head node

– 18TB of shared disk (NFS) over 10Gbit Ethernet

– 200Gbit switch backplane

– Grid Engine and Condor co-existing

● Tried both GridEngine and Condor versions of

ESSE workflows. Test 600 member ensemble: – I/O optimizations (all local dirs) 86 to 77 mins

– SGE 10-20% faster than Condor

● without heroic tuning of the latter

h id

8/8/2019 Crib 2009



Ocean DA on the Teragrid● Extensive use of sshfs to share directories for

checking state of runs etc.

● Remote job submissions (over (gsi)ssh)

– part of driver and modified singletons

● Or Condor-C and Glide-in with care if root

● Condor-G will not scale

● Or Personal Condor & Mycluster

System cores pertORNL 2 67.83 1823.99

Purdue 4 6.25 1107.4

local 2 6.21 1531.33

pemodel

Ad f h T id

8/8/2019 Crib 2009



Advantages of the Teragrid● Enormous numbers of theoretically available

cores and very large sizes for storage

– Condor pool supposedly 14-27kcores (~1800)

● Shared high-speed parallel filesystems

● High speed connections to the home cluster

● Suites of Grid software for remote file accessand job submission, control etc.

– Mixed blessing...

● Free after writing the proposal to convinceTeragrid to get the SUs...

Di d f h T id

8/8/2019 Crib 2009



Disadvantages of the Teragrid● Very large heterogeneity in both hardware, O/S and

paths (to scratch disks etc.) requiring mods to the

singleton code – user confusion.● Without advance reservations one cannot be

guaranteed not to have to use multiple Teragrid sitesto reach the desired number of processors within the

deadline. – Backfilling can help but per user job limits also limit

the usability of a single Teragrid site

– Schedulers favor large processor count runs

– Complicated tricks to submit many jobs as one

● Teragrid MPPs not always suitable for scripts

● Careful fetching of results back to home (congestion)

O DA th Cl d

8/8/2019 Crib 2009



Ocean DA on the Cloud● We have been experimenting with the use of

Cloud computing for more traditional HPCusage – including parallel runs of I/O intensivedata parallel ocean models such as MITgcm.

● Given the limitations seen in network

performance it was natural to try andinvestigate the usability of Amazon EC2 forMTC applications such as ESSE.

Cl d M d f

8/8/2019 Crib 2009



Cloud Modes of usage● Stand-alone (batch) on-demand EC2 cluster

– Torque or SGE (all-in-the cloud or remote submits)● Augmented local cluster with EC2 nodes

– We have a Torque setup

–

Used recipes for SGE setup. – Condor use of EC2 too restrictive

– MyCluster dynamic SGE or Condor merged clusters

– Commercial (Univa Unicloud, Sun Cloud

Adapter in Hedeby/SDM) for fully dynamicprovisioning

● Experientation with parallel filesystems:PVFS2/GlusterFS/FhGFS

S i l t/ d l f

8/8/2019 Crib 2009



Serial pert/pemodel performanceSystem cores pert

m1.small 0.5 13.53 2850.14

m1.large 2 9.33 1817.13m1.xlarge 4 9.14 1860.81

c1.medium 2 9.8 1008.11

c1.xlarge 8 6.67 1030.42

m2.2xlarge 4 3.39 779.77

m2.4xlarge 8 3.35 790.86

pemodel

●m1.xxxx AMIs are using Opteron processors●A binary optimized with the Pathscale compilers was used● All cores were loaded.●

I/O is to local disk (EBS is slower, so is NFS that is used forthe centrally coordinating directory of the run)● Total runtime is reported.● Better than 2.5 speedup for m1.small to c1.medium● Nehalems (m2.xxxxx) not the best option for price/perf.

Advantages of the Cloud

8/8/2019 Crib 2009



Advantages of the Cloud● For all intents and purposes the response is immediate.

Currently a request for a virtual EC2 cluster gets satisfied on-

demand, without having to worry about queue times and backfillslots.

● The use of virtual machines allows for deploying the sameenvironment as the home cluster. This provides for a very cleanintegration of the two clusters.

● Having the same software environment also results in no needto rebuild (and in most cases having to revalidate) executables.This means that last minute changes (because of model build-time parameter tuning) can be used ASAP instead of having togo through a buildtest-deploy cycle on each remote platform.

● EC2 allows our virtual clusters to scale at will: (default limit 20)

● Since the remote machines are under our complete control,scheduling software and policies etc. are tuned to our needs.

Cost analysis

8/8/2019 Crib 2009



Cost analysis● Cost-wise for example an ESSE calculation with 1.5GB input data, 960 ensemble memberseach sending back 11MB (for a total of 6.6GB)

would cost:

– 1.5(GB)×0.1+10.56(GB)x0.17 for the data

– 2(hr)x20x0.68 for the computations – For a total of $29.15

● Use of reserved instances would drop pricing

for the cpu usage by more than a factor of 3.● Compare that to the cost of overprovisioning

your local cluster resources to handle the peakload required a few times a year.

Disadvantages of the Cloud

8/8/2019 Crib 2009



Disadvantages of the Cloud● Inhomogeneity needs to be kept in mind or it will bite you

● Any extra security issues need to be worked out.

● EC2 usage needs to be directly paid to Amazon. Amazoncharges by the hour - like a cell-phone, 1 hour 1 sec. counts as2 hours. Charges for data movement in and out of EC2.

● The performance of virtual machines is less than that of “bare

metal”, the difference more pronounced when it comes to I/O.● No persistent large parallel filesystem. One can be constructed

on demand (just like the virtual clusters) but the GigabitEthernet connectivity used throughout Amazon EC2 alongsidethe randomization of instance placement mean that parallelperformance of the filesystem is not up to par. Horror stories...

● Unlike national and state supercomputing facilities, Amazon’sconnections to the home cluster are bound to be slower andresult in file transfer delays.

Future work directions

8/8/2019 Crib 2009



Future work directions● Reimplement the workflow engine.

–Considering Swift – other options? Nimrod?

● Generalize ESSE work-engine away:

– Use with other ocean models (MITgcm,ROMS)

●

Expand production use of ESSE: – Heterogeneous sites on the Teragrid

– Open Science Grid

– MPPs with sufficient support: Blue Gene/P?

● Expand uses for ESSE (and number of tasks):

– ESSE for Acoustics

– ESSE for adaptive sampling

Which sampling on Aug 26 optimally reduces uncertainties on Aug 27?

8/8/2019 Crib 2009



Which sampling on Aug 26 optimally reduces uncertainties on Aug 27?4 candidate tracks, overlaid on surface T fct for Aug 26

ESSE fcts after DA of each track

Aug 24 Aug 26 Aug 27

2-day ESSE fct

ESSE for Track 4

ESSE for Track 3

ESSE for Track 2

ESSE for Track 1DA 1

DA 2

DA 3

DA 4

IC(nowcast) DA

Best predicted relative error reduction: track 1

•

Based on nonlinear error covariance evolution•For every choice of adaptive strategy, an ensembleis computed

Memory Bandwidth

8/8/2019 Crib 2009



Memory Bandwidth

m1.small c1.medium Opteron 1.4GHz

0

1

2

3

4

5

6

5.4

2.6 2.8

5.4

2.6 2.8

5.3 5.61 threadN threads

per thread

● The small instance memory bandwidth appears to be equal to the fullmemory expected from such a platform despite the 50% cpu time throttler –not entirely unexpected for memory bandwidth.

● The faster CPU in the c1.medium instance does considerably worse.

● In fact an original 1st gen 1.4 GHz Opteron system also does worse (DDR2memory in the m1.small instance should help).

● This suggests that for memory bandwidth limited applications the smallinstance may be the most efficient

● The increase of memory bandwidth with the c1.medium instance suggeststhat the 2 cores are not on the same die. This would be an Amazon policy.

Serial Performance

8/8/2019 Crib 2009



Serial PerformanceSystem Class A Class W EP (A) EP (W)

m1.small 132 149 6.66 6.73

c1.medium 312 357 15.59 15.04

ratio 2.36 2.4 2.34 2.23

●NAS NPB serial (geometric mean of all tests except EP) inMop/s

● Compiled with system gcc (generic flags)● A single instance running on the c1.medium case (no

memory bandwidth contention)● 1:2.5 theoretical ratio becomes 1:2.3● Still the price ratio is 1:2● When loading both cores in the the c1.medium the resulting

ratio depends on the memory vs cpu utilizationcharacteristics of the individual benchmark

I/O performance

8/8/2019 Crib 2009



I/O performance

m1.small c1.medium c1.medium 2 cores

1

10

100

1000

10000

168 284 264

1626 1882 2399

40 42 4173 63 63

/tmp write

/tmp read

NFS write

NFS read

m1.small c1.medium c1.medium 2 cores

1

10

100

1729 28

1629 28

1526 21

14

30 26

Class A /tmp

Class W /tmp

Class A NFSClass W NFS

Serial NAS NPB 3.3 BT-IO; Fortran I/O, MB/s

Serial IOR read and write bandwidth (128KB requests, 1MBblocksize, 100 segments), fsync, MB/s

MPI performance

8/8/2019 Crib 2009



MPI performance

LAM GridMPI MPICH2 nemesis MPICH2 sock OpenMPI LAM/ACES

0

50

100

150

200

250

57.85 54.6

15.72

58.49

16.44

117.64

81.98 77.07

26.08

83.42

17.99

198.59

unidirectional bandwidth (MB/s)

bidirectional bandwidth (MB/s)

LAM GridMPI MPICH2 nemesis MPICH2 sock OpenMPI LAM/ACES

0

50

100

150

200

250

300

350

81.2 83.46

300

85.87

300

35.83

latency (us)

MPI performance cont

8/8/2019 Crib 2009



MPI performance cont.

Coupled climate simulation

8/8/2019 Crib 2009



Coupled climate simulation● The MITgcm (MIT General Circulation Model) is

a numerical model designed for study of the

atmosphere, ocean, and climate.

– MPI (+OpenMP) code base, Fortran 77(/90) withsome C bits – very portable

– Memory bandwidth intensive but not entirelymemory bound – also I/O intensive for climateapplications.

● Coupled ocean-coupler-(atmosphere-land-ice)

model on a ~2.8° cubed sphere (6 32x32 faces) – MPMD mode, 3 binaries, up to 6+6+1 processes in

a standard configuration.

ECCO-GODAE

8/8/2019 Crib 2009



ECCO GODAE● 1 degree MITgcm ocean simulation (including sea-ice)

that computes costs with respect to misfits to

observational data. Automatic differentiation. – Followed by an optimization step that generally will

not fit on EC2 nodes (large memory)

– Loops over – so a lot of data transfer involved.

● 32, 60 or 64 processor runs usually.

● Very I/O intensive (60-120 or more GB input data, 25-200 or more GB output data that need to be kept,more in terms of intermediate files).

● Per process I/O useful but bothersome.

● Ensembles of forward runs less I/O demanding (MTCat large scale?)

Modes of usage

8/8/2019 Crib 2009



Modes of usage● Stand-alone (interactive) on-demand EC2

cluster

● Stand-alone (batch) on-demand EC2 cluster

– Torque or SGE

● Augmented local cluster with EC2 nodes

– We have a Torque setup

– Used recipes for SGE setup.

● Project Hedeby

● Parallel filesystems: PVFS2/GlusterFS/FhGFS● Inhomogeneity needs to be kept in mind

● Security issues need to be worked out.

Optimizing compiler issues

8/8/2019 Crib 2009



Optimizing compiler issues

● Two high performance compilers that can be deployed without licensingissues for academics and may perform better: Open64 and Sun Studio 12.

– The latter provides an 11.5% performance boost for the geometric meanof the tests (up to 25% for MG).

– The MPI runtime may need to be rebuilt for the new compiler every time

● To use the Intel, Absoft, PGI compilers one can employ a local virtualmachine with a valid software license using the same OS and middleware asthe virtual cluster and then run the executables on the EC2 cluster.

gcc 4.1 open64 4.1 Pathscale 2.5 PGI 6.1 Absoft 10 Intel 9.1 Studio12

0

20

40

60

80

100

120

140

160

180

148.83 150.22 151.11 159.46 149.91 150.97165.91

131.57 139.01 141.25 143.82 139.02 146

Class W

Class A

The economics of Clouds

8/8/2019 Crib 2009



The economics of Clouds• So can we move to an all-cloud option for our HPC

needs? ̶

The enticement: No more worrying about hardwaremaintenance, upgrades, network administration,possibly system administration (using pre-configured clusters) leading to lower costs.

̶ At the same time “virtual” clusters retain part of the

“cluster”-hugging mentality of some users. ̶ And at the institute level:

̶ No need to worry aboutbuilding/renovating/retrofitting datacenters

̶

And most importantly in days of increasing energycosts, you don't see electricity bills anymore ̶ The carbon impact becomes someone else's

problem.

An exercise

8/8/2019 Crib 2009



e e c se• Part of an effort at MIT for investigating future needs: ̶ 0.68c/hour for 2 cpu, 8 core Xeon instance on Amazon

EC2 (cheapest option offering fullest flexibility currentlyavailable)

̶ Cost of 158 racks, 2U, low density, 21 node per rackequivalent is 158x21x0.68 is $2256.24 per hour.

̶ Using reserved instances it is 158x21x0.24=$796.3

̶ Assuming an 85% utilization, that amounts to2256.24*24*365*0.85 = $16.8 million per year, ~7 timesour expected electricity bill for a highly efficientdatacenter.2800/3*158*21+2654.4*24*365*0.24=$8.7 per annum

̶ With the cost of building a datacenter the cloud costsmore after 4 (9) (or less for more racks) years.

̶ But sporadic use is very well suited economically to theuse of clouds.

– Gigabit Ethernet limitations for large instance counts.

Education

8/8/2019 Crib 2009



● Using cloud computing for geoscienceseducation.

– Multi-tiered approach(client-server emphasis)

– Up to one Amazon EC2 instance per student(full cpu power for each student if needed)

– VNC or other remote visualization approach – Menu/forms driven models

– Web interface integrating course material withdemonstrations

– Simulations mimicking experiments run in class

● MPI/OpenMP class taught at MIT (IAP 2008-10)

– EC2 and/or VMware image

Educational uses

8/8/2019 Crib 2009



● The opportunity tohost all of ESSE's

computationalneeds on EC2allows for a visionof ocean DA foreducation.

● CITE (Cloud-computing

Infrastructure andTechnology forEducation) – NSFSTCI project.

Virtual teaching environment

8/8/2019 Crib 2009



g

LCML/LEGEND

8/8/2019 Crib 2009



● LCML (Legacy Computing Markup Language)is an XML Schema based framework for

encapsulating the process of configuring thebuild-time and run-time configuration of legacybinaries alongside constraints.

●

It was implemented for ocean/climate modelsbut designed for general applications that useMakefiles, imake, cmake, autoconf etc. to setupbuild-time configuration (not ant).

● LEGEND is a Java-based validating GUI generator that parses LCML files describing anapplication and produces a GUI for the user tobuild and run the model.

LEGEND in action

8/8/2019 Crib 2009



Documents

Crib 2009