Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Algorithms in Scientific Computing IIa. k. a. Parallel Programming and High Performance
Computing II
– Intro –
Michael Bader
TUM – SCCS
Winter 2011/2012
Part I
Scientific Computing and Numerical
Simulation
The Simulation Pipeline
phenomenon, process etc.
mathematical model?
modelling
numerical algorithm?
numerical treatment
simulation code?
implementation
results to interpret?
visualization
�����
HHHHj embeddingstatement tool
-
-
-
validation
The Simulation Pipeline
phenomenon, process etc.
mathematical model?
modelling
numerical algorithm?
numerical treatment
simulation code?
parallel implementation
results to interpret?
visualization
�����
HHHHj embeddingstatement tool
-
-
-
validation
Example – Millennium-XXL Project
(Springel, Angulo, et al., 2010)
N-body simulation with N = 3 · 1011 “particles”
compute gravitational forces and effects
(every “particle” correspond to ∼ 109 suns)
simulation of the generation of galaxy clusters
plausibility of the “cold dark matter” model
Example – Millennium-XXL Project (2)
Simulation – HPC-Related Data:
N-body simulation with N = 3 · 1011 “particles”
10 TB RAM required only to store particles positions and
velocities (single precision)
total memory requirement: 29 TB
JuRoPa supercomputer (Julich)
simulation on 1536 nodes
(each 2x QuadCore, thus 12 288 cores)
hybrid parallelisation: MPI plus OpenMP/Posix threads
runtime: 9.3 days; 300 CPU years in total
Example – Gordon Bell Prize 2010
(Rahimian, . . . , Biros, 2010)
direct simulation of blood flow
particulate flow simulation (coupled problem)
Stokes flow for blood plasma
red blood cells as immersed, deformable particles
Example – Gordon Bell Prize 2010 (2)
Simulation – HPC-Related Data:
up to 260 Mio blood cells, up to 9 · 1010 unknowns
fast multipole method to compute Stokes flow
(octree-based; octree-level 4–24)
scalability: 327 CPU-GPU nodes on Keeneland cluster,
200,000 AMD cores on Jaguar (ORNL)
0.7 Petaflops/s sustained performance on Jaguar
extensive use of GEMM routine (matrix multiplication)
runtime: ≈ 1 minute per time step
Article for Supercomputing conference:
http://www.cc.gatech.edu/~gbiros/papers/sc10.pdf
“Faster, Bigger, More”
Why parallel high performance computing:
response time: compute a problem in 1p
time
speed up engineering processes
real-time simulations (tsunami warning?)
Problem size: compute a p-times bigger problem
Simulation of multi-scale phenomena
maximal problem size that “fits into the machine”
Throughput: compute p problems at once
case and parameter studies, statistical risk scenarios, etc.
massively distributed computing (SETI@home, e.g.)
Part II
HPC in the Literature – Past and Present
Trends
Four Horizons for Enhancing the Performance . . .
of Parallel Simulations Based on Partial Differential Equations
(David Keyes, 2000)
1 Expanded Number of Processors
→ in 2000: 1000 cores; in 2010: 200,000 cores
2 More Efficient Use of Faster Processors
→ PDF working-sets, cache efficiency
3 More Architecture-Friendly Algorithms
→ improve temporal/spatial locality
4 Algorithms Delivering More “Science per Flop”
→ adaptivity (in space and time), higher-order methods,
fast solvers
The Seven Dwarfs of HPC
“dwarfs” = key algorithmic kernels in many scientific
computing applications
P. Colella (LBNL), 2004:
1 dense linear algebra
2 sparse linear algebra
3 spectral methods
4 N-body methods
5 structured grids
6 unstructured grids
7 Monte Carlo
Computational Science Demands a New Paradigm
Computational simulation must meet three challenges to
become a mature partner of theory and experiment
(Post & Votta, 2005)
1 performance challenge
→ exponential growth of performance, massively parallel
architectures
2 programming challenge
→ new (parallel) programming models
3 prediction challenge
→ careful verification and validation of codes; towards
reproducible simulation experiments
International Exascale Software Project RoadmapTowards an Exa-Flop/s Platform in 2018 (www.exascale.org):
1 technology trends
→ concurrency, reliability, power consumption, . . .
→ blueprint of an exascale system: 10-billion-way
concurrency, 100 million to 1 billion cores, 10-to-100-way
concurrency per core, hundreds of cores per die, . . .
2 science trends
→ climate, high-energy physics, nuclear physics, fusion
energy sciences, materials science and chemistry, . . .
3 X-stack (software stack for exascale
→ energy, resiliency, heterogeneity, I/O and memory
4 Polito-economic trends
→ exascale systems run by government labs, used by CSE
scientists
Performance Development in Supercomputing
Performance Development in Supercomputing
Top 500 (www.top500.org) – June 2011
CONTACT SUBMISSIONS LINKS HOME
Statistics Charts Development
Top500 List:
Statistics Type:
next
Home Lists June 2011
TOP500 List - June 2011 (1-100)Rmax and Rpeak values are in TFlops. For more details about other fields, check the TOP500 description.
Power data in KW for entire system
Rank Site Computer/Year Vendor Cores Rmax Rpeak Power
1RIKEN Advanced Institute forComputational Science (AICS)Japan
K computer, SPARC64 VIIIfx 2.0GHz, Tofuinterconnect / 2011Fujitsu
548352 8162.00 8773.63 9898.56
2National Supercomputing Center inTianjinChina
Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C,NVIDIA GPU, FT-1000 8C / 2010NUDT
186368 2566.00 4701.00 4040.00
3 DOE/SC/Oak Ridge National LaboratoryUnited States
Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz /2009Cray Inc.
224162 1759.00 2331.00 6950.60
4National Supercomputing Centre inShenzhen (NSCS)China
Nebulae - Dawning TC3600 Blade, Intel X5650,NVidia Tesla C2050 GPU / 2010Dawning
120640 1271.00 2984.30 2580.00
5GSIC Center, Tokyo Institute ofTechnologyJapan
TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon6C X5670, Nvidia GPU, Linux/Windows / 2010NEC/HP
73278 1192.00 2287.63 1398.61
6 DOE/NNSA/LANL/SNLUnited States
Cielo - Cray XE6 8-core 2.4 GHz / 2011Cray Inc. 142272 1110.00 1365.81 3980.00
7 NASA/Ames Research Center/NASUnited States
Pleiades - SGI Altix ICE 8200EX/8400EX, XeonHT QC 3.0/Xeon 5570/5670 2.93 Ghz, Infiniband/ 2011SGI
111104 1088.00 1315.33 4102.00
8 DOE/SC/LBNL/NERSCUnited States
Hopper - Cray XE6 12-core 2.1 GHz / 2010Cray Inc. 153408 1054.00 1288.63 2910.00
9Commissariat a l'Energie Atomique(CEA)France
Tera-100 - Bull bullx super-node S6010/S6030 /2010Bull SA
138368 1050.00 1254.55 4590.00
10 DOE/NNSA/LANLUnited States
Roadrunner - BladeCenter QS22/LS21 Cluster,PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz,Voltaire Infiniband / 2009IBM
122400 1042.00 1375.78 2345.50
11National Institute for ComputationalSciences/University of TennesseeUnited States
Kraken XT5 - Cray XT5-HE Opteron Six Core2.6 GHz / 2011Cray Inc.
112800 919.10 1173.00 3090.00
12 Forschungszentrum Juelich (FZJ)Germany
JUGENE - Blue Gene/P Solution / 2009IBM 294912 825.50 1002.70 2268.00
PROJECT LISTS STATISTICS RESOURCES NEWS
TOP500 List - June 2011 (1-100) | TOP500 Supercomputing Sites http://www.top500.org/list/2011/06/100
1 of 6 11.10.2011 14:00
Top 500 (www.top500.org) – CPUs
→ Any CPU left with less than 4 cores?
Top 500 (www.top500.org) – CPUs
→ Any CPU left with less than 4 cores?
Part III
Organisation
Lecture Topics – The Seven Dwarfs of HPC
Algorithms, parallelisation, HPC-aspects
for problems related to:
1 dense linear algebra
2 sparse linear algebra
3 spectral methods
4 N-body methods
5 structured grids
6 unstructured grids
7 Monte Carlo
Tutorials – Computing on Manycore Hardware
Examples on GPUs using CUDA:
dense linear algebra
structured grids
N-body methods
Tutors and time/place:
Daniel Butnaru,
Christoph Kowitz
roughly bi-weekly tutorials (90min)
(see website for exact schedule)
small “dwarf-related” projects
(1–2 tutorial lessons per project)
Exams, ECTS, Modules
Exam:
oral exam highly likely
include exercises (as exam or separate?)
ECTS, Modules
4 ECTS (2+1 lectures per week)
Informatik/Computer Science: Master catalogue
CSE: Application area
others?