Upload
michal-waleszczuk
View
124
Download
1
Embed Size (px)
Citation preview
DEFINITION
„Fault tolerance is the property that enables a system to continue operating
properly in the event of the failure of (or one or more faults within) some of
its components.”
Fault tolerance, wikipedia (2012) [Online]. Available:
http://en.wikipedia.org/wiki/Fault-tolerant_ system
Fault tolerance is a feature/capability of long-running applications/systems
to tolerate/avoid some amount of failures, which may appear during
computation.
My own definition, Michal Waleszczuk
FAULT CLASSIFICATION
• Software Failure
• Hardware Failure
Causes:
• Hardware/software faults,
• Human factor,
• Network congestion,
• Server overload
Occurence frequency:
Permanent – malfunctioning
till fixed,
Transient – temporary
malfunctioning, which
disappears after some time,
Inermittent- frequent but
temporary failure.
MAIN TASKS
• Monitoring: watches an application looking for a correct execution.
• Protection: creates backups(checkpoints).
• Detection: notices failures.
• Migration: transports data from failed node/process to other.
• Fault Masking: responsible for proper communication after failure.
• Recovery: recovers the process from checkpoint.
• Reconfiguration: restores configuration properties.
ROLLBACK RECOVERY
Rollback Recovery is a mechanism, which allows system to create
checkpoints of running application. In case of failure it recovers application
to a previously saved checkpoints. Main features:
• Transparency: independent from application.
• Platform portability: not bounded with operating system or application
framework.
• Intelligence/Automatic: uses failure prediction and detection
mechanims to recognize and deal with failure on its’ own, without
involving the user.
CHECKPOINT SYSTEM
IMPLEMENTATION
• Application-level: requires programmer knowledge about
application. Not transparent.
• Library-level: patch or library is attached to an application .
Usually not transparent, may have lack of access to some
system information.
• System-level: not possible on all systems since not all share
kernel code, dependent on platform, users transparent
CHECKPOINT SYSTEM
STORAGE POLITICS
Factors Full checkpoints Incremental
checkpoints
Restore time (+)Faster (-)Slower
Resources usage (-)Higher (+)Lower
Affect on application
throughput
(-)Higher (+)Lower
Checkpoints may be:
• Invoked from: APIs( Event driven ) or Command line( Time driven )
• Executed as:
Coordinated( Simultaneously for all process ) or
Uncoordinated( Each process on its own, with message logging)
DMTCP
Distributed Multithreaded Checkpointing:
• Coordinated checkpoint protocol
• Transparent to Linux applications
• Supports HPC environments such as MPI, INFINIBAND, SLURM
APPLICATION
• Matrix multiplication written in C , using MPI for parallel computing.
Includes dmtcp.h to create checkpoints from source code. Application
creates two square-matrices at random and multiplies them. Parameters:
• n: matrix size(n x n).
• modulo: delimiter for maximum cell value.
• Bash script executes app and returns meantime of runtime. Parameters:
• Number of executions, loop iterations.
• n and modulo: application parameters.
• (number of nodes): in case of parallel application.
RUNTIME COMPARISON
0 10 20 30 40
10
30
50
70
90
Miliseconds
Matrix
size
: n
Coordinator Normal
0 5000 10000 15000
100
200
500
800
1000
Miliseconds
Matrix
size
: n
Coordinator Normal
30x repetitions for each matrix
size
Serial execution
Modulo set to 10000
Matrix size for n -> n x n
RUNTIME COMPARISON
4.76 5.63 7.00 8.39
9.64 10.57 12.54 14.37
33.37 35.0238.65
41.90
78.3981.39
87.4692.44
0
10
20
30
40
50
60
70
80
90
100
no coor with 1
checkpoint
with 3
checkpoints
with 5
checkpoints
Se
co
nd
s
Mode
800, x1 1000, x1.5 1500, x3.5 2000, x6.3
Checkpoints:
1. 50%
3. 30% , 60% , 90%
5. 19%, 38% , 57% , 76% ,
95%
of execution measured by
computed rows.
RUNTIME OVERHEAD
0%
18.19%
47.00%
76.10%
0% 9.62%
30.09%
49.05%
0% 4.93%
15.82%
25.56%
0% 3.82% 11.56%
17.92%
0%
10%
20%
30%
40%
50%
60%
70%
80%
no coor with 1 checkpoint with 3 checkpoints with 5 checkpoints
800, 640k el. 1000, 1000k el. 1500, 2250k el. 2000, 4000k el.
CHECKPOINT
CONFIGURATION
• DMTCP configuration: “./configure --prefix=/home/michal/opt/dmtcp --
enable-timing --enable-unique-checkpoint-filenames”
• Checkpoints called mainly within application code using dmtcp.h and its’
method dmtcp_checkpoint(). Although there is an option to use intervals
without involving into the application code.
• For presentation I used checkpoints invoked from code every 100 rows. I
also wrote a script to retrieve checkpoints size and restoration time.
CHECKPOINTS SIZE
7.46
8.07
8.65
9.21
9.78
10.35
10.91
11.46
12.02
12.57
14.01
14.98
15.84
16.69
17.55
18.41
19.26
20.12
20.98
21.83
23.17
24.52
25.68
26.82
27.97
29.12
30.28
31.43
32.57
33.72
0 5 10 15 20 25 30 35 40
100
200
300
400
500
600
700
800
900
1000
Size in MB
Co
mp
ute
d R
ow
2000 1500 1000
CHECKPOINT STORETIME
0.69 0.72 0.75 0.78 0.81 0.850.93 0.98 0.96 0.96
1.171.29 1.30 1.33 1.39
1.49 1.46 1.511.58
1.66
1.851.99 2.04
2.122.19
2.25 2.252.31
2.422.53
0
0.5
1
1.5
2
2.5
3
100 200 300 400 500 600 700 800 900 1000
Se
co
nd
s
Computed Row
n=1000 n=1500 n=2000
CHECKPOINT RESTORETIME
0.22 0.230.27
0.24 0.24 0.25 0.26 0.26 0.27 0.28
0.38
0.43 0.44 0.45 0.45 0.46 0.47 0.48 0.49 0.50
0.61
0.72 0.73 0.75
0.810.77 0.79
0.81 0.81 0.82
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
100 200 300 400 500 600 700 800 900 1000
Se
co
nd
s
Computed Row
n=1000 n=1500 n=2000
ENVIRONMENT
• Cluster of 4 nodes
• OS: Ubuntu 12.04
• CPU: Intel Core i5-650 3.2GHz
• RAM: 8 GB
• Checkpoint System: DMTCP 2.4.0
• MPI: OpenMPI 1.8.7
• NFS
BACKGROUND
• DMTCP checkpoint files are distributed along nodes. Each node
stores only its own checkpoint files.
• In case of permament failure a node is lost irrecoverably.
• Within checkpoints there is created a restart script, which
simultaneously recovers all processes.
• To restart an client application on different nodes it’s neccesary to
modify the script.
• Checkpoints are stored and overwritten during a computation.
• Protection requires constant monitoring and determining when
checkpoint was taken.
hostnamehostnamehostnamehostname
hostname
hostname
PROTECTOR
Node 0
• Coordinator
• Jtimings.csv
Node 1
Node 2
Node 3• Sends hostname of each node to
next one, so the other node
receives location to which he
should copy checkpoint files.
• Watches jtimings.csv. Which
changes imply that checkpoint
was taken.
• Copies checkpoint files when
jtimings.csv is modified.
• Closes when jtimings.csv is
closed (= coordinator went off)
HD
HD
HD
HD HDHD
HD
HD
SOURCES
• Jorge Villamayor, Dolores Rexachs, Emilio Luque
„Conguring Fault Tolerance with
Coordinated Checkpoint/Restart”.
• Ifeanyi P. Egwutuoha, David Levy, Bran Selic, Shiping Chen
„A survey of fault tolerance
mechanisms and checkpoint/restart implementations for high
performance computing systems”.
• http://dmtcp.sourceforge.net/
DONE TASKS
• Explore subject of fault tolerance,
• Get to know with DMTCP and OpenMPI, install them and configure.
• Prepare simple application(matrix multiplication) in C and convert it to
parallel app with MPI,
• Create a script to execute an application given number of times, which
returns meantime of execution.
• Create protecting application which copies checkpoints files to other
nodes.
• Manage to restore an application from checkpoints on different nodes