Fault tolerance

FAULT TOLERANCE

Michal Waleszczuk

DEFINITION

„Fault tolerance is the property that enables a system to continue operating

properly in the event of the failure of (or one or more faults within) some of

its components.”

Fault tolerance, wikipedia (2012) [Online]. Available:

http://en.wikipedia.org/wiki/Fault-tolerant_ system

Fault tolerance is a feature/capability of long-running applications/systems

to tolerate/avoid some amount of failures, which may appear during

computation.

My own definition, Michal Waleszczuk

FAULT CLASSIFICATION

• Software Failure

• Hardware Failure

Causes:

• Hardware/software faults,

• Human factor,

• Network congestion,

• Server overload

Occurence frequency:

Permanent – malfunctioning

till fixed,

Transient – temporary

malfunctioning, which

disappears after some time,

Inermittent- frequent but

temporary failure.

MAIN TASKS

• Monitoring: watches an application looking for a correct execution.

• Protection: creates backups(checkpoints).

• Detection: notices failures.

• Migration: transports data from failed node/process to other.

• Fault Masking: responsible for proper communication after failure.

• Recovery: recovers the process from checkpoint.

• Reconfiguration: restores configuration properties.

ROLLBACK RECOVERY

Rollback Recovery is a mechanism, which allows system to create

checkpoints of running application. In case of failure it recovers application

to a previously saved checkpoints. Main features:

• Transparency: independent from application.

• Platform portability: not bounded with operating system or application

framework.

• Intelligence/Automatic: uses failure prediction and detection

mechanims to recognize and deal with failure on its’ own, without

involving the user.

CHECKPOINT SYSTEM

IMPLEMENTATION

• Application-level: requires programmer knowledge about

application. Not transparent.

• Library-level: patch or library is attached to an application .

Usually not transparent, may have lack of access to some

system information.

• System-level: not possible on all systems since not all share

kernel code, dependent on platform, users transparent

CHECKPOINT SYSTEM

STORAGE POLITICS

Factors Full checkpoints Incremental

checkpoints

Restore time (+)Faster (-)Slower

Resources usage (-)Higher (+)Lower

Affect on application

throughput

(-)Higher (+)Lower

Checkpoints may be:

• Invoked from: APIs( Event driven ) or Command line( Time driven )

• Executed as:

Coordinated( Simultaneously for all process ) or

Uncoordinated( Each process on its own, with message logging)

DMTCP

Distributed Multithreaded Checkpointing:

• Coordinated checkpoint protocol

• Transparent to Linux applications

• Supports HPC environments such as MPI, INFINIBAND, SLURM

STUDY

Matrix multiplication example

ENVIRONMENT

• OS: Debian 7

• CPU: 2 x 2 GHz

• Checkpoint System: DMTCP 2.4.0

• MPI: OpenMPI 1.8.7

APPLICATION

• Matrix multiplication written in C , using MPI for parallel computing.

Includes dmtcp.h to create checkpoints from source code. Application

creates two square-matrices at random and multiplies them. Parameters:

• n: matrix size(n x n).

• modulo: delimiter for maximum cell value.

• Bash script executes app and returns meantime of runtime. Parameters:

• Number of executions, loop iterations.

• n and modulo: application parameters.

• (number of nodes): in case of parallel application.

RUNTIME COMPARISON

0 10 20 30 40

10

30

50

70

90

Miliseconds

Matrix

size

: n

Coordinator Normal

0 5000 10000 15000

100

200

500

800

1000

Miliseconds

Matrix

size

: n

Coordinator Normal

30x repetitions for each matrix

size

Serial execution

Modulo set to 10000

Matrix size for n -> n x n

RUNTIME COMPARISON

4.76 5.63 7.00 8.39

9.64 10.57 12.54 14.37

33.37 35.0238.65

41.90

78.3981.39

87.4692.44

0

10

20

30

40

50

60

70

80

90

100

no coor with 1

checkpoint

with 3

checkpoints

with 5

checkpoints

Se

co

nd

s

Mode

800, x1 1000, x1.5 1500, x3.5 2000, x6.3

Checkpoints:

1. 50%

3. 30% , 60% , 90%

5. 19%, 38% , 57% , 76% ,

95%

of execution measured by

computed rows.

RUNTIME OVERHEAD

0%

18.19%

47.00%

76.10%

0% 9.62%

30.09%

49.05%

0% 4.93%

15.82%

25.56%

0% 3.82% 11.56%

17.92%

0%

10%

20%

30%

40%

50%

60%

70%

80%

no coor with 1 checkpoint with 3 checkpoints with 5 checkpoints

800, 640k el. 1000, 1000k el. 1500, 2250k el. 2000, 4000k el.

CHECKPOINT

CONFIGURATION

• DMTCP configuration: “./configure --prefix=/home/michal/opt/dmtcp --

enable-timing --enable-unique-checkpoint-filenames”

• Checkpoints called mainly within application code using dmtcp.h and its’

method dmtcp_checkpoint(). Although there is an option to use intervals

without involving into the application code.

• For presentation I used checkpoints invoked from code every 100 rows. I

also wrote a script to retrieve checkpoints size and restoration time.

CHECKPOINTS SIZE

7.46

8.07

8.65

9.21

9.78

10.35

10.91

11.46

12.02

12.57

14.01

14.98

15.84

16.69

17.55

18.41

19.26

20.12

20.98

21.83

23.17

24.52

25.68

26.82

27.97

29.12

30.28

31.43

32.57

33.72

0 5 10 15 20 25 30 35 40

100

200

300

400

500

600

700

800

900

1000

Size in MB

Co

mp

ute

d R

ow

2000 1500 1000

CHECKPOINT STORETIME

0.69 0.72 0.75 0.78 0.81 0.850.93 0.98 0.96 0.96

1.171.29 1.30 1.33 1.39

1.49 1.46 1.511.58

1.66

1.851.99 2.04

2.122.19

2.25 2.252.31

2.422.53

0

0.5

1

1.5

2

2.5

3

100 200 300 400 500 600 700 800 900 1000

Se

co

nd

s

Computed Row

n=1000 n=1500 n=2000

CHECKPOINT RESTORETIME

0.22 0.230.27

0.24 0.24 0.25 0.26 0.26 0.27 0.28

0.38

0.43 0.44 0.45 0.45 0.46 0.47 0.48 0.49 0.50

0.61

0.72 0.73 0.75

0.810.77 0.79

0.81 0.81 0.82

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

100 200 300 400 500 600 700 800 900 1000

Se

co

nd

s

Computed Row

n=1000 n=1500 n=2000

PROTECTOR

MPI application to copy checkpoints files between nodes.

ENVIRONMENT

• Cluster of 4 nodes

• OS: Ubuntu 12.04

• CPU: Intel Core i5-650 3.2GHz

• RAM: 8 GB

• Checkpoint System: DMTCP 2.4.0

• MPI: OpenMPI 1.8.7

• NFS

BACKGROUND

• DMTCP checkpoint files are distributed along nodes. Each node

stores only its own checkpoint files.

• In case of permament failure a node is lost irrecoverably.

• Within checkpoints there is created a restart script, which

simultaneously recovers all processes.

• To restart an client application on different nodes it’s neccesary to

modify the script.

• Checkpoints are stored and overwritten during a computation.

• Protection requires constant monitoring and determining when

checkpoint was taken.

PROBLEM ANIMATION

Node 0 Node 1 Node 2

hostnamehostnamehostnamehostname

hostname

hostname

PROTECTOR

Node 0

• Coordinator

• Jtimings.csv

Node 1

Node 2

Node 3• Sends hostname of each node to

next one, so the other node

receives location to which he

should copy checkpoint files.

• Watches jtimings.csv. Which

changes imply that checkpoint

was taken.

• Copies checkpoint files when

jtimings.csv is modified.

• Closes when jtimings.csv is

closed (= coordinator went off)

HD

HD

HD

HD HDHD

HD

HD

SOURCES

• Jorge Villamayor, Dolores Rexachs, Emilio Luque

„Conguring Fault Tolerance with

Coordinated Checkpoint/Restart”.

• Ifeanyi P. Egwutuoha, David Levy, Bran Selic, Shiping Chen

„A survey of fault tolerance

mechanisms and checkpoint/restart implementations for high

performance computing systems”.

• http://dmtcp.sourceforge.net/

DONE TASKS

• Explore subject of fault tolerance,

• Get to know with DMTCP and OpenMPI, install them and configure.

• Prepare simple application(matrix multiplication) in C and convert it to

parallel app with MPI,

• Create a script to execute an application given number of times, which

returns meantime of execution.

• Create protecting application which copies checkpoints files to other

nodes.

• Manage to restore an application from checkpoints on different nodes

THANK YOU FOR ATTENTION!

Documents

Fault tolerance