Algorithms in Scienti c Computing II

Algorithms in Scientific Computing IIa. k. a. Parallel Programming and High Performance

Computing II

– Intro –

Michael Bader

TUM – SCCS

Winter 2011/2012

Part I

Scientific Computing and Numerical

Simulation

The Simulation Pipeline

phenomenon, process etc.

mathematical model?

modelling

numerical algorithm?

numerical treatment

simulation code?

implementation

results to interpret?

visualization

��

HHHHj embeddingstatement tool

-

-

-

validation

The Simulation Pipeline

phenomenon, process etc.

mathematical model?

modelling

numerical algorithm?

numerical treatment

simulation code?

parallel implementation

results to interpret?

visualization

��

HHHHj embeddingstatement tool

-

-

-

validation

Example – Millennium-XXL Project

(Springel, Angulo, et al., 2010)

N-body simulation with N = 3 · 1011 “particles”

compute gravitational forces and effects

(every “particle” correspond to ∼ 109 suns)

simulation of the generation of galaxy clusters

plausibility of the “cold dark matter” model

Example – Millennium-XXL Project (2)

Simulation – HPC-Related Data:

N-body simulation with N = 3 · 1011 “particles”

10 TB RAM required only to store particles positions and

velocities (single precision)

total memory requirement: 29 TB

JuRoPa supercomputer (Julich)

simulation on 1536 nodes

(each 2x QuadCore, thus 12 288 cores)

hybrid parallelisation: MPI plus OpenMP/Posix threads

runtime: 9.3 days; 300 CPU years in total

Example – Gordon Bell Prize 2010

(Rahimian, . . . , Biros, 2010)

direct simulation of blood flow

particulate flow simulation (coupled problem)

Stokes flow for blood plasma

red blood cells as immersed, deformable particles

Example – Gordon Bell Prize 2010 (2)

Simulation – HPC-Related Data:

up to 260 Mio blood cells, up to 9 · 1010 unknowns

fast multipole method to compute Stokes flow

(octree-based; octree-level 4–24)

scalability: 327 CPU-GPU nodes on Keeneland cluster,

200,000 AMD cores on Jaguar (ORNL)

0.7 Petaflops/s sustained performance on Jaguar

extensive use of GEMM routine (matrix multiplication)

runtime: ≈ 1 minute per time step

Article for Supercomputing conference:

http://www.cc.gatech.edu/~gbiros/papers/sc10.pdf

http://www.cc.gatech.edu/~gbiros/papers/sc10.pdf

“Faster, Bigger, More”

Why parallel high performance computing:

response time: compute a problem in 1p

time

speed up engineering processes

real-time simulations (tsunami warning?)

Problem size: compute a p-times bigger problem

Simulation of multi-scale phenomena

maximal problem size that “fits into the machine”

Throughput: compute p problems at once

case and parameter studies, statistical risk scenarios, etc.

massively distributed computing (SETI@home, e.g.)

Part II

HPC in the Literature – Past and Present

Trends

Four Horizons for Enhancing the Performance . . .

of Parallel Simulations Based on Partial Differential Equations

(David Keyes, 2000)

1 Expanded Number of Processors

→ in 2000: 1000 cores; in 2010: 200,000 cores

2 More Efficient Use of Faster Processors

→ PDF working-sets, cache efficiency

3 More Architecture-Friendly Algorithms

→ improve temporal/spatial locality

4 Algorithms Delivering More “Science per Flop”

→ adaptivity (in space and time), higher-order methods,

fast solvers

The Seven Dwarfs of HPC

“dwarfs” = key algorithmic kernels in many scientific

computing applications

P. Colella (LBNL), 2004:

1 dense linear algebra

2 sparse linear algebra

3 spectral methods

4 N-body methods

5 structured grids

6 unstructured grids

7 Monte Carlo

Computational Science Demands a New Paradigm

Computational simulation must meet three challenges to

become a mature partner of theory and experiment

(Post & Votta, 2005)

1 performance challenge

→ exponential growth of performance, massively parallel

architectures

2 programming challenge

→ new (parallel) programming models

3 prediction challenge

→ careful verification and validation of codes; towards

reproducible simulation experiments

International Exascale Software Project RoadmapTowards an Exa-Flop/s Platform in 2018 (www.exascale.org):

1 technology trends

→ concurrency, reliability, power consumption, . . .

→ blueprint of an exascale system: 10-billion-way

concurrency, 100 million to 1 billion cores, 10-to-100-way

concurrency per core, hundreds of cores per die, . . .

2 science trends

→ climate, high-energy physics, nuclear physics, fusion

energy sciences, materials science and chemistry, . . .

3 X-stack (software stack for exascale

→ energy, resiliency, heterogeneity, I/O and memory

4 Polito-economic trends

→ exascale systems run by government labs, used by CSE

scientists

Performance Development in Supercomputing

Performance Development in Supercomputing

Top 500 (www.top500.org) – June 2011

CONTACT SUBMISSIONS LINKS HOME

Statistics Charts Development

Top500 List:

Statistics Type:

next

Home Lists June 2011

TOP500 List - June 2011 (1-100)Rmax and Rpeak values are in TFlops. For more details about other fields, check the TOP500 description.

Power data in KW for entire system

Rank Site Computer/Year Vendor Cores Rmax Rpeak Power

1RIKEN Advanced Institute forComputational Science (AICS)Japan

K computer, SPARC64 VIIIfx 2.0GHz, Tofuinterconnect / 2011Fujitsu

548352 8162.00 8773.63 9898.56

2National Supercomputing Center inTianjinChina

Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C,NVIDIA GPU, FT-1000 8C / 2010NUDT

186368 2566.00 4701.00 4040.00

3 DOE/SC/Oak Ridge National LaboratoryUnited States

Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz /2009Cray Inc.

224162 1759.00 2331.00 6950.60

4National Supercomputing Centre inShenzhen (NSCS)China

Nebulae - Dawning TC3600 Blade, Intel X5650,NVidia Tesla C2050 GPU / 2010Dawning

120640 1271.00 2984.30 2580.00

5GSIC Center, Tokyo Institute ofTechnologyJapan

TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon6C X5670, Nvidia GPU, Linux/Windows / 2010NEC/HP

73278 1192.00 2287.63 1398.61

6 DOE/NNSA/LANL/SNLUnited States

Cielo - Cray XE6 8-core 2.4 GHz / 2011Cray Inc. 142272 1110.00 1365.81 3980.00

7 NASA/Ames Research Center/NASUnited States

Pleiades - SGI Altix ICE 8200EX/8400EX, XeonHT QC 3.0/Xeon 5570/5670 2.93 Ghz, Infiniband/ 2011SGI

111104 1088.00 1315.33 4102.00

8 DOE/SC/LBNL/NERSCUnited States

Hopper - Cray XE6 12-core 2.1 GHz / 2010Cray Inc. 153408 1054.00 1288.63 2910.00

9Commissariat a l'Energie Atomique(CEA)France

Tera-100 - Bull bullx super-node S6010/S6030 /2010Bull SA

138368 1050.00 1254.55 4590.00

10 DOE/NNSA/LANLUnited States

Roadrunner - BladeCenter QS22/LS21 Cluster,PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz,Voltaire Infiniband / 2009IBM

122400 1042.00 1375.78 2345.50

11National Institute for ComputationalSciences/University of TennesseeUnited States

Kraken XT5 - Cray XT5-HE Opteron Six Core2.6 GHz / 2011Cray Inc.

112800 919.10 1173.00 3090.00

12 Forschungszentrum Juelich (FZJ)Germany

JUGENE - Blue Gene/P Solution / 2009IBM 294912 825.50 1002.70 2268.00

PROJECT LISTS STATISTICS RESOURCES NEWS

TOP500 List - June 2011 (1-100) | TOP500 Supercomputing Sites http://www.top500.org/list/2011/06/100

1 of 6 11.10.2011 14:00

Top 500 (www.top500.org) – CPUs

→ Any CPU left with less than 4 cores?

Top 500 (www.top500.org) – CPUs

→ Any CPU left with less than 4 cores?

Part III

Organisation

Lecture Topics – The Seven Dwarfs of HPC

Algorithms, parallelisation, HPC-aspects

for problems related to:

1 dense linear algebra

2 sparse linear algebra

3 spectral methods

4 N-body methods

5 structured grids

6 unstructured grids

7 Monte Carlo

Tutorials – Computing on Manycore Hardware

Examples on GPUs using CUDA:

dense linear algebra

structured grids

N-body methods

Tutors and time/place:

Daniel Butnaru,

Christoph Kowitz

roughly bi-weekly tutorials (90min)

(see website for exact schedule)

small “dwarf-related” projects

(1–2 tutorial lessons per project)

Exams, ECTS, Modules

Exam:

oral exam highly likely

include exercises (as exam or separate?)

ECTS, Modules

4 ECTS (2+1 lectures per week)

Informatik/Computer Science: Master catalogue

CSE: Application area

others?

Documents

Algorithms in Scienti c Computing II