Upload
others
View
36
Download
0
Embed Size (px)
Citation preview
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed MartinCompany, for the United States Department of Energy’s National Nuclear Security
Administration under contract DE-AC04-94AL85000.
Increasing Fault Resiliency in a Message-Passing Environment
Rolf Riesen, Kurt FerreiraRon Oldfield, Jon Stearley
James Laros, Kevin PedrettiRon Brightwell
Sandia National Laboratories
Todd KordenbrockHewlett-Packard Company
October 14, 2009
Motivation
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 2 / 32
Motivation
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 3 / 32
■ Checkpoint/Restart is common way to deal with faults■ What can we do to increase checkpoint interval?■ Explore redundant computation for MPI applications
◆ Can it be done at user level?◆ What are requirements for RAS system and runtime?◆ What is the overhead◆ Cost versus benefit?
■ Write rMPI library at MPI profiling layer to learn what theissues are
Design
Motivation
Design
Redundancy
Basics
Msg. order
Other
Status
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 4 / 32
Redundant Computation
Motivation
Design
Redundancy
Basics
Msg. order
Other
Status
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 5 / 32
■ Active and redundant node do same computation■ One node continues when the other fails■ MPI Comm rank() returns same value on both nodes■ Not each node needs to have a redundant partner
Basic Operation
Motivation
Design
Redundancy
Basics
Msg. order
Other
Status
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 6 / 32
■ Each message gets sent twice (4, if we count redundantnodes)
■ Msg and redundant copy received into same buffer■ Redundant msg have unused tag bit set■ Protocol needed to coordinate receives and other MPI ops
Message Order
Motivation
Design
Redundancy
Basics
Msg. order
Other
Status
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 7 / 32
■ Active and redundant node must receive msg in same order■ MPI ANY SOURCE and MPI ANY TAG are problematic■ Redundant node maintains queue of posted receives if
MPI ANY SOURCE
■ Coordinate with active node to post specific receive
Other Functions
Motivation
Design
Redundancy
Basics
Msg. order
Other
Status
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 8 / 32
■ Redundant node must return same info for probe, test, andtime functions
■ Active node does operation and sends result to redundantnode
■ Collectives use redundant point-to-point■ rMPI re-implements almost all of MPI
◆ rMPI uses MPICH (mostly) as a transport layer
Status
Motivation
Design
Redundancy
Basics
Msg. order
Other
Status
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 9 / 32
■ Can’t do MPI ANY TAG and MPI ANY SOURCE simultaneously■ Most mayor functions of MPI-2 implemented■ Are considering transactions to limit data volume■ Plan to open source it■ Few RAS features needed:
◆ Notification of node availability◆ No Byzantine behavior (error correction protocol on
network)◆ Messages to/from dead nodes must be consumed (no
dead- or life-lock)
Evaluation
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 10 / 32
Bandwidth
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 11 / 32
Ban
dwith
Diff
eren
ce to
nat
ive
Message size
Base Base % Forward Forward % Native
0.0 B/s
200.0 MB/s
400.0 MB/s
600.0 MB/s
800.0 MB/s
1.0 GB/s
1.2 GB/s
1.4 GB/s
1.6 GB/s
1.8 GB/s
2.0 GB/s
1 B
10 B
100 B
1 kB
10 kB
100 kB
1 MB
10 MB
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’
Latency
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 12 / 32
Late
ncy
Diff
eren
ce to
nat
ive
Message size
Base Base % Forward Forward % Native
4.0 us
6.0 us
8.0 us
10.0 us
12.0 us
14.0 us
16.0 us
18.0 us
20.0 us
1 B
10 B
100 B
1 kB
10 kB
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’
Latency with MPI ANY SOURCE
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 13 / 32
Late
ncy
Diff
eren
ce to
nat
ive
Message size
Base Base % Forward Forward % Native
0.0 s
5.0 us
10.0 us
15.0 us
20.0 us
25.0 us
30.0 us
1 B
10 B
100 B
1 kB
10 kB
0 %
20 %
40 %
60 %
80 %
100 %
120 %
140 %
160 %
Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’
Allreduce
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 14 / 32
Tim
e pe
r op
erat
ion
Diff
eren
ce to
nat
ive
Message size
Allreduce performance on 32 nodes
NativeBase
ReverseShuffle
Fully redundant
0.0 s
1.0 ms
2.0 ms
3.0 ms
4.0 ms
5.0 ms
6.0 ms
1 B
10 B
100 B
1 kB
10 kB
100 kB
1 MB
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’
CTH
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 15 / 32
Exe
cutio
n tim
e
Diff
eren
ce to
nat
ive
Nodes
BaseBase %
ReverseReverse %
ShuffleShuffle %
ForwardForward %
Native
240.0 s
260.0 s
280.0 s
300.0 s
320.0 s
340.0 s
360.0 s
380.0 s
400.0 s
420.0 s
440.0 s
4 8 16 32 64 128
256
512
1,024
2,048
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’
SAGE
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 16 / 32
Exe
cutio
n tim
e
Diff
eren
ce to
nat
ive
Nodes
BaseBase %
ReverseReverse %
ShuffleShuffle %
ForwardForward %
Native
300.0 s
400.0 s
500.0 s
600.0 s
700.0 s
800.0 s
900.0 s
1.0 ks
1.1 ks
4 8 16 32 64 128
256
512
1,024
2,048
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’
LAMMPS
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 17 / 32
Exe
cutio
n tim
e
Diff
eren
ce to
nat
ive
Nodes
BaseBase %
ReverseReverse %
ShuffleShuffle %
ForwardForward %
Native
315.0 s
320.0 s
325.0 s
330.0 s
335.0 s
340.0 s
345.0 s
350.0 s
4 8 16 32 64 128
256
512
1,024
2,048
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’
HPCCG
Motivation
Design
Evaluation
Bandwidth
Latency
Allreduce
CTH
SAGE
LAMMPS
HPCCG
Analysis
Implications
Summary
HPC Resiliency 2009 18 / 32
Exe
cutio
n tim
e
Diff
eren
ce to
nat
ive
Nodes
BaseBase %
ReverseReverse %
ShuffleShuffle %
ForwardForward %
Native
0.0 s
5.0 s
10.0 s
15.0 s
20.0 s
25.0 s
30.0 s
35.0 s
4 8 16 32 64 128
256
512
1,024
0 %
20 %
40 %
60 %
80 %
100 %
Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’
Analysis
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 19 / 32
rMPI Model
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 20 / 32
■ Acts as a filter
◆ Input is a series of faults◆ Output is application interrupts that cause a restart
Application Interrupts
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 21 / 32
Pro
babi
lity
of a
pplic
atio
nin
terr
upt (
inte
rrup
ts/fa
ults
)
Number of nodes
No redundancyN redundant nodes
0.001
0.010
0.100
1.0
100
200
500
1,000
2,000
5,000
10,000
20,000
50,000
100,000
Application Lifetime
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 22 / 32
■ Application needs to finish a set amount of work■ Do checkpoints and restart when necessary
Application Simulation
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 23 / 32
■ Combine rMPI model with state machine■ Finish a set amount of work■ Do checkpoints and restart when necessary
Application Behavior no Redundancy
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 24 / 32
Ela
psed
tim
e
Number of nodes
workckptreworkrestart
0%
20%
40%
60%
80%
100%
100
200
500
1,000
2,000
5,000
10,00020,00050,000100,000
Normalized time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF
Application Behavior no Redundancy
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 25 / 32
Ela
psed
tim
e in
hou
rs
Number of nodes
workckptreworkrestart
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
100
200
500
1,000
2,000
5,000
10,000
20,000
50,000
100,000
Total time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF
Application Behavior with Redundancy
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 26 / 32
Ela
psed
tim
e in
hou
rs
Number of nodes
workckptreworkrestart
0.0
50.0
100.0
150.0
200.0
250.0
100
200
500
1,000
2,000
5,000
10,000
20,000
50,000
100,000
Normalized time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF
Validation
Motivation
Design
Evaluation
Analysis
rMPI Model
Interrupts
Lifetime
Simulation
Behavior
Validation
Implications
Summary
HPC Resiliency 2009 27 / 32
Daly’s equation from A higher order estimate of the optimum
checkpoint interval for restart dumps.
Tw(τ) = ΘeR
Θ (eτ+δ
Θ − 1)Ts
τfor δ << Ts
Exe
cutio
n tim
e in
hou
rs
Number of nodes
Daly’s modelApp simulation
600.0
800.0
1.0 k
1.2 k
1.4 k
1.6 k
1.8 k
2.0 k
100
200
500
1,000
2,000
5,000
10,000
20,000
50,000
Implications and Trade-Offs
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 28 / 32
Implications
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 29 / 32
R =Tw + To(none)
Tw + To(full) + TrMPI
(1)
■ Tw: amount of work■ To(none): checkpoint, restart, rework overhead without
redundant nodes■ To(full): overhead with redundant nodes■ If R > 2 then using redundant nodes makes sense
Implications
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 30 / 32
none1/4
1/23/4
fullLevel of redundancy 100
1 k
10 k
100 k
Number of nodes
1
10
100
1 k
10 k
100 k
1 M
10 MIn
terr
upts
1 10 100 1 k10 k100 k1 M10 M
Inte
rrup
ts
5,000h work, 5 min checkpoint, 10 min restart, 1-year nodeMTBF
Summary
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 31 / 32
Summary
Motivation
Design
Evaluation
Analysis
Implications
Summary
HPC Resiliency 2009 32 / 32
■ Redundant MPI library can be done at user level■ Overhead for applications is not significant for most
applications■ Application restart simulator allows modeling of
◆ varying node counts◆ node MTBF◆ level of redundancy◆ failure distribution function (exponential, gamma,
weibull)◆ amount of work, checkpoint and restart times
■ Model helps determine when redundancy pays off