Upload
milo
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu, Gagan Agrawal, Oguz Kurt(Ohio State University). Introduction and Motivation ◆ The Mean-Time-To-Failure (MTTF) of the sys- tems is decreasing with growing number of cores. - PowerPoint PPT Presentation
Citation preview
Fault Tolerant Parallel Data-Intensive AlgorithmsMucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University)
Introduction and Motivation
◆ The Mean-Time-To-Failure (MTTF) of the sys- tems is decreasing with growing number of cores.
◆For the future exascale systems, it is being argued that check- pointing and recovery time (with current methods) will even exceed the MTTF.
◆ Algorithm-based fault-tolerance can be alternative method
Our Goal◆ We focus on only fail-stop failures.
◆ We do not use any back up node and continue the process with remaining nodes after the failure.
◆We have two main goals for faster recovery: ◆ minimize the data loss, since the lost data needs to be reread from the storage cluster ◆ minimize re-processing of the lost data
Our Approach- Intelligent Replication
◆ Minimum Data Intersection by dividing data into blocks and distributing them in different processors.
◆ Passive Replicas
- Summarization
◆After processing one block, a summary is generated for that block and sent to the master node.
◆ No need to re-process the blocks that a summary is already sent before the failure.
MasterFile
System
P1 P2 P3 P4
Recovery Scenario
◆ P1 and P2 fail at the beginning of the iteration.
◆ Master node notifies - P3 to process D2 and D3 - P4 to process D4
◆ Since all D1 blocks are lost, master node reads D1 from the file system/storage cluster and notifies P4to process it. D1 D2
D6 D7
D3 D4
D1 D8
D5 D6
D2 D3
D7 D8
D4 D5
Experimental Setup◆ Implemented k-means and apriori algorithms in C programming language by using MPI library.◆ Used 2.5 GHz Opterons processors and 24 GB memory
◆ The number of processors is 8 ◆ In the experiments with Hadoop:
◆Replication factor(R) : 3 ◆Summarization frequency(S) : 4
Impact of Summary Exchange Frequency in Apriori: Varying Number of Failures
Total Execution Time that Changes with the Number of Failures
Experimental Results
P1
1 2
P2
3 4
P3
5 6
P4
7 8
P5
9 10
P6
11 12
P7
13 14
12 13 1 14 2 3 4 5 6 7 8 9 10 11
primary
replica