Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Application
FFMK: A FAST AND FAULT-TOLERANT MICROKERNEL- BASED SYSTEM FOR EXASCALE COMPUTING
Scientific Network (Selection)
Center for Advancing Electronics Dresden TU Dresden Excellence Cluster
Frank Bellosa Karlsruhe Institute of Technology
Laxmikant V. Kale University of Illinois at Urbana- Champain, Charm++
Yutaka Ishikawa University of Tokyo RIKEN
Argo / Hobbes / mOS Argonne / Sandia / Intel
SPPEXA: ESSEX / GROMEX Gerhard Wellein / Ivo Kabadshow
Highly Adaptive Energy- Efficient Computing SFB912
ASTEROID SPP1500
Gernot Heiser UNSW, NICTA
Vijay Saraswat IBM Research Zürich, X10
Torsten Hoefler ETH Zurich
Michael Bussmann Helmholtz Zentrum Dresden Rossendorf
Eric Van Hensbergen IBM Research Austin DARPA HPCS, FastOS, X-Stack
Frank Mueller North Carolina State University
Phase 1 Results: Summary ■ First L4-based prototype
■ Several source-compatible MPI applications ported
■ Tested on small island of real HPC cluster
■ Gossip scalability and resilience modeled, simulated, and measured
■ Erasure-coded in-memory checkpoints with XtreemFS, tested on Cray XC40
■ 2 SPPEXA Workshops
Prof. Alexander Reinefeld Zuse Institute Berlin
Thomas Steinke Thorsten Schuett Florian Wende
Distributed File Systems
■ Flease - Lease Coordination Without a Lock Server,International Parallel and Distributed Processing Symposium, 2011
■ Consistency and Fault Tolerance for Erasure-Coded Distributed Storage Systems, Workshop on Data Intensive Distributed Computing at HPDC 2012
Prof. Amnon Barak Hebrew University of Jerusalem
Amnon Shiloh Ely Levy Tal Ben-Nun Alexander Margolin Michael Sutton
Load Balancing ■ Resilient Gossip Algorithms for
Collecting Online Management Information in Exascale Clusters, Concurrency and Computation: Practice and Experience, 2015
■ An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster,IEEE Transactions on Parallel and Distributed Systems Vol. 11, 2000
FFMK System Architecture
■ L4 microkernel on every node
■ Programming paradigms provided as library-based runtimes
■ Performance-critical parts of MPI, InfiniBand, and checkpointing run directly on L4
■ Non-critical support functionality reuses Linux (e.g., XtreemFS MRC+OSD, MPI startup+control)
■ Gossip algorithms disseminate info for platform management
■ Linux compatibility via virtualization
■ Optional application hints can improve decision making
■ GROMEX, COSMO-SPECS+FD4, CP2K, benchmarks, mini apps, …
Second L4-based Prototype: Decoupled Execution
■ Avoids operating system noise by sidestepping Linux
■ HPC Applications are ordinary Linux processes, but its threads moved to compute core controlled by L4
■ Communication via InfiniBand via direct hardware access
■ Linux System calls: Move thread back into Linux, handle operation on service core, then return to compute core
■ L4 system calls: faster scheduling, threads, memory, …
Prof. Wolfgang E. Nagel Technische Universität Dresden, ZIH
Matthias Lieber
MPI and Performance Analysis
■ The International Exascale SoftwareRoadmap, International Journal of High Performance Computer Applications 25(1), 2011
■ VAMPIR: Visualization and Analysis of MPI Resources, Supercomputer 63, XII(1):69–80, 1996
Prof. Hermann Härtig Technische Universität Dresden
Carsten Weinhold Adam Lackorzynski Jan Bierbaum Martin Küttler Maksym Planeta Hannes Weisbach
L4 Microkernel
Linux
Non-critical App
Critical App
■ The Performance of µ-Kernel-Based Systems, SOSP 1997
■ VPFS: Building a Virtual Private File System with a Small Trusted Computing Base, EuroSys 2008
■ ATLAS: Look-Ahead Scheduling Using Workload Metrics, RTAS 2013
Operating Systems
Metadata
File Content OSDs
MRC
XtreemFS Client
The Hebrew Universityof Jerusalem
Runtime Linux Kernel
Application
Light-weight Kernel (L4 Microkernel)
ProxiesMPI Library
MPI
Platform Management
Decision Making
Gossip
Infiniband InfinibandMonitor Mo
nit
or
Service coresCompute cores
Chkpt.
Checkpoint
Decision Making
Application hints
Hardware monitoring
MigrationPlatform Info (Gossip)
Resource Prediction
0 250 500 750 1000
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
994 GB/s
990 GB/s
993 GB/s
980 GB/s
990 GB/s
984 GB/s
971 GB/s
973 GB/s
0 400 800 1200 1600
2048 Nodes
1646 GB/s
1654 GB/s
1653 GB/s
1637 GB/s
1623 GB/s
1585 GB/s
1555 GB/s
0 800 1600 2400 3200
4096 Nodes
3217 GB/s
3235 GB/s
3212 GB/s
3210 GB/s
3106 GB/s
3060 GB/s
3064 GB/s
0 1400 2800 4200 5600
8192 Nodes
5652 GB/s
5638 GB/s
5641 GB/s
5594 GB/s
5417 GB/s
5395 GB/s
Figure 3: PTRANS performance in GB/s (higher is better).
0 10 20
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
19.0 s
19.0 s
19.0 s
19.0 s
19.1 s
19.2 s
19.5 s
20.0 s
12.2 s
12.2 s
12.2 s
12.2 s
12.3 s
12.4 s
12.7 s
13.2 s
0 20 40 60
2048 Nodes
50.2 s
50.2 s
50.4 s
50.5 s
51.1 s
52.0 s
54.0 s
36.4 s
36.4 s
36.5 s
36.6 s
37.2 s
38.1 s
40.0 s
0 20 40 60
4096 Nodes
40.0 s
40.0 s
40.2 s
40.7 s
42.6 s
45.3 s
32.9 s
32.9 s
33.1 s
33.6 s
35.5 s
38.1 s
0 20 40 60
8192 Nodes
27.8 s
28.0 s
28.5 s
29.9 s
35.2 s
42.2 s
24.0 s
24.2 s
24.7 s
26.1 s
31.4 s
38.4 s
Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.
0 10 20 30 40 50
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
Interval = 1 ms
1024 Nodes
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.7 s
40.8 s
41.1 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.3 s
8.4 s
8.6 s
0 10 20 30 40 50
2048 Nodes
36.7 s
36.7 s
36.7 s
36.7 s
36.8 s
36.9 s
37.2 s
38.0 s
4.3 s
4.2 s
4.2 s
4.2 s
4.3 s
4.4 s
4.7 s
5.5 s
0 10 20 30 40 50
4096 Nodes
36.7 s
36.7 s
36.7 s
36.8 s
37.3 s
37.6 s
38.7 s
4.3 s
4.3 s
4.4 s
4.4 s
4.8 s
5.1 s
6.1 s
0 10 20 30 40 50
8192 Nodes
37.3 s
37.3 s
37.3 s
37.5 s
37.9 s
38.2 s
4.8 s
4.8 s
4.9 s
5.0 s
5.3 s
5.6 s
Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.
No Gossip 1024 ms 256 ms 64 ms 16 ms 8 ms 4 ms 2 ms
Gossip: Scalability/Overhead
MPI-FFT (Blue Gene/Q, 1024 nodes)
Dynamic Platform Management
■ Consider CPU cycles, memory bandwidth, and other resources
■ Classification based on memory load („memory dwarfs“) to optimize scheduling and placement
■ Prediction of resource usage using hardware counters and application-level hints (e.g., number of particles, time steps)
Fault Tolerance
■ Application interfaces to optimize or avoid C/R (e.g., hints on when to checkpoint, ability to recover from node loss)
■ Node-level fault tolerance: Multiple Linux instances, micro-rebooting, proactive migration away from failing nodes
Related Projects
-20 0
20 40 60 80
100 120 140
0 60 120 180
Pro
cess
ID
Timestep
COSMO-SPECS+FD4, 128 ranks
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Com
puta
tion
time
(frac
tion)
Load Imbalances
COSMO-SPECS+FD4 (128 ranks,180 timesteps)
-20 0
20 40 60 80
100 120 140
0 60 120 180
Pro
cess
ID
Timestep
COSMO-SPECS+FD4, 128 ranks
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Com
puta
tion
time
(frac
tion)
-20 0
20 40 60 80
100 120 140
0 60 120 180
Pro
cess
ID
Timestep
COSMO-SPECS+FD4, 128 ranks
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Com
puta
tion
time
(frac
tion)
L4Linux
L4 Microkernel
Run
Tim
e in
Sec
onds
18,10
18,20
18,30
18,40
18,50
18,60
18,70
Number of Cores
30 150 270 390 510 630 750
Standard Decoupled
Phase 1 (2013 – 2015)
Phase 2 (2016 – 2017)