Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed MartinCompany, for the United States Department of Energy’s National Nuclear Security

Administration under contract DE-AC04-94AL85000.

Increasing Fault Resiliency in a Message-Passing Environment

Rolf Riesen, Kurt FerreiraRon Oldfield, Jon Stearley

James Laros, Kevin PedrettiRon Brightwell

[email protected]

Sandia National Laboratories

Todd KordenbrockHewlett-Packard Company

October 14, 2009

Motivation

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 2 / 32

Motivation

Motivation

Design

Evaluation

Analysis

Implications

Summary


■ Checkpoint/Restart is common way to deal with faults■ What can we do to increase checkpoint interval?■ Explore redundant computation for MPI applications

◆ Can it be done at user level?◆ What are requirements for RAS system and runtime?◆ What is the overhead◆ Cost versus benefit?

■ Write rMPI library at MPI profiling layer to learn what theissues are

Design

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary


Redundant Computation

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary


■ Active and redundant node do same computation■ One node continues when the other fails■ MPI Comm rank() returns same value on both nodes■ Not each node needs to have a redundant partner

Basic Operation

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary


■ Each message gets sent twice (4, if we count redundantnodes)

■ Msg and redundant copy received into same buffer■ Redundant msg have unused tag bit set■ Protocol needed to coordinate receives and other MPI ops

Message Order

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary


■ Active and redundant node must receive msg in same order■ MPI ANY SOURCE and MPI ANY TAG are problematic■ Redundant node maintains queue of posted receives if

MPI ANY SOURCE

■ Coordinate with active node to post specific receive

Other Functions

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary


■ Redundant node must return same info for probe, test, andtime functions

■ Active node does operation and sends result to redundantnode

■ Collectives use redundant point-to-point■ rMPI re-implements almost all of MPI

◆ rMPI uses MPICH (mostly) as a transport layer

Status

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary


■ Can’t do MPI ANY TAG and MPI ANY SOURCE simultaneously■ Most mayor functions of MPI-2 implemented■ Are considering transactions to limit data volume■ Plan to open source it■ Few RAS features needed:

◆ Notification of node availability◆ No Byzantine behavior (error correction protocol on

network)◆ Messages to/from dead nodes must be consumed (no

dead- or life-lock)

Evaluation

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Bandwidth

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Ban

dwith

Diff

eren

ce to

nat

ive

Message size

Base Base % Forward Forward % Native

0.0 B/s

200.0 MB/s

400.0 MB/s

600.0 MB/s

800.0 MB/s

1.0 GB/s

1.2 GB/s

1.4 GB/s

1.6 GB/s

1.8 GB/s

2.0 GB/s

1 B

10 B

100 B

1 kB

10 kB

100 kB

1 MB

10 MB

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’

Latency

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Late

ncy

Diff

eren

ce to

nat

ive

Message size


4.0 us

6.0 us

8.0 us

10.0 us

12.0 us

14.0 us

16.0 us

18.0 us

20.0 us

1 B

10 B

100 B

1 kB

10 kB

0 %

20 %

40 %

60 %

80 %

100 %


Latency with MPI ANY SOURCE

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Late

ncy

Diff

eren

ce to

nat

ive

Message size


0.0 s

5.0 us

10.0 us

15.0 us

20.0 us

25.0 us

30.0 us

1 B

10 B

100 B

1 kB

10 kB

0 %

20 %

40 %

60 %

80 %

100 %

120 %

140 %

160 %


Allreduce

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Tim

e pe

r op

erat

ion

Diff

eren

ce to

nat

ive

Message size

Allreduce performance on 32 nodes

NativeBase

ReverseShuffle

Fully redundant

0.0 s

1.0 ms

2.0 ms

3.0 ms

4.0 ms

5.0 ms

6.0 ms

1 B

10 B

100 B

1 kB

10 kB

100 kB

1 MB

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’

CTH

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

240.0 s

260.0 s

280.0 s

300.0 s

320.0 s

340.0 s

360.0 s

380.0 s

400.0 s

420.0 s

440.0 s

4 8 16 32 64 128

256

512

1,024

2,048

0 %

20 %

40 %

60 %

80 %

100 %


SAGE

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

300.0 s

400.0 s

500.0 s

600.0 s

700.0 s

800.0 s

900.0 s

1.0 ks

1.1 ks

4 8 16 32 64 128

256

512

1,024

2,048

0 %

20 %

40 %

60 %

80 %

100 %


LAMMPS

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

315.0 s

320.0 s

325.0 s

330.0 s

335.0 s

340.0 s

345.0 s

350.0 s

4 8 16 32 64 128

256

512

1,024

2,048

0 %

20 %

40 %

60 %

80 %

100 %


HPCCG

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary


Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

0.0 s

5.0 s

10.0 s

15.0 s

20.0 s

25.0 s

30.0 s

35.0 s

4 8 16 32 64 128

256

512

1,024

0 %

20 %

40 %

60 %

80 %

100 %


Analysis

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


rMPI Model

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


■ Acts as a filter

◆ Input is a series of faults◆ Output is application interrupts that cause a restart

Application Interrupts

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


Pro

babi

lity

of a

pplic

atio

nin

terr

upt (

inte

rrup

ts/fa

ults

)

Number of nodes

No redundancyN redundant nodes

0.001

0.010

0.100

1.0

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

100,000

Application Lifetime

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


■ Application needs to finish a set amount of work■ Do checkpoints and restart when necessary

Application Simulation

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


■ Combine rMPI model with state machine■ Finish a set amount of work■ Do checkpoints and restart when necessary

Application Behavior no Redundancy

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


Ela

psed

tim

e

Number of nodes

workckptreworkrestart

0%

20%

40%

60%

80%

100%

100

200

500

1,000

2,000

5,000

10,00020,00050,000100,000

Normalized time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF

Application Behavior no Redundancy

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


Ela

psed

tim

e in

hou

rs

Number of nodes


0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

100,000

Total time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF

Application Behavior with Redundancy

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


Ela

psed

tim

e in

hou

rs

Number of nodes


0.0

50.0

100.0

150.0

200.0

250.0

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

100,000

Normalized time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF

Validation

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary


Daly’s equation from A higher order estimate of the optimum

checkpoint interval for restart dumps.

Tw(τ) = ΘeR

Θ (eτ+δ

Θ − 1)Ts

τfor δ << Ts

Exe

cutio

n tim

e in

hou

rs

Number of nodes

Daly’s modelApp simulation

600.0

800.0

1.0 k

1.2 k

1.4 k

1.6 k

1.8 k

2.0 k

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

Implications and Trade-Offs

Motivation

Design

Evaluation

Analysis

Implications

Summary


Implications

Motivation

Design

Evaluation

Analysis

Implications

Summary


R =Tw + To(none)

Tw + To(full) + TrMPI

(1)

■ Tw: amount of work■ To(none): checkpoint, restart, rework overhead without

redundant nodes■ To(full): overhead with redundant nodes■ If R > 2 then using redundant nodes makes sense

Implications

Motivation

Design

Evaluation

Analysis

Implications

Summary


none1/4

1/23/4

fullLevel of redundancy 100

1 k

10 k

100 k

Number of nodes

1

10

100

1 k

10 k

100 k

1 M

10 MIn

terr

upts

1 10 100 1 k10 k100 k1 M10 M

Inte

rrup

ts

5,000h work, 5 min checkpoint, 10 min restart, 1-year nodeMTBF

Summary

Motivation

Design

Evaluation

Analysis

Implications

Summary


Summary

Motivation

Design

Evaluation

Analysis

Implications

Summary


■ Redundant MPI library can be done at user level■ Overhead for applications is not significant for most

applications■ Application restart simulator allows modeling of

◆ varying node counts◆ node MTBF◆ level of redundancy◆ failure distribution function (exponential, gamma,

weibull)◆ amount of work, checkpoint and restart times

■ Model helps determine when redundancy pays off

Documents

Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must